Phoenix Observability

Debug, evaluate, and monitor LLM applications with open-source tracing

✨ The solution you've been looking for

Verified
Tested and verified by our team
16036 Stars

Open-source AI observability platform for LLM tracing, evaluation, and monitoring. Use when debugging LLM applications with detailed traces, running evaluations on datasets, or monitoring production AI systems with real-time insights.

observability tracing llm-monitoring evaluation debugging opentelemetry phoenix ai-ops
Repository

See It In Action

Interactive preview & real-world examples

Live Demo
Skill Demo Animation

AI Conversation Simulator

See how users interact with this skill

User Prompt

My LangChain chatbot is returning empty responses for some users. Help me set up Phoenix tracing to debug this issue.

Skill Processing

Analyzing request...

Agent Response

Complete setup with OpenTelemetry instrumentation, trace collection, and debugging insights from the Phoenix UI

Quick Start (3 Steps)

Get up and running in minutes

1

Install

claude-code skill install phoenix-observability

claude-code skill install phoenix-observability
2

Config

3

First Trigger

@phoenix-observability help

Commands

CommandDescriptionRequired Args
@phoenix-observability debug-production-llm-failuresTrack down why your chatbot gives inconsistent responses by analyzing detailed traces of user interactionsNone
@phoenix-observability evaluate-model-performance-systematicallyRun automated evaluations on your test datasets to measure hallucination, relevance, and toxicityNone
@phoenix-observability monitor-production-ai-systemsSet up real-time monitoring for token usage, latency, and quality metrics in your production environmentNone

Typical Use Cases

Debug production LLM failures

Track down why your chatbot gives inconsistent responses by analyzing detailed traces of user interactions

Evaluate model performance systematically

Run automated evaluations on your test datasets to measure hallucination, relevance, and toxicity

Monitor production AI systems

Set up real-time monitoring for token usage, latency, and quality metrics in your production environment

Overview

Phoenix - AI Observability Platform

Open-source AI observability and evaluation platform for LLM applications with tracing, evaluation, datasets, experiments, and real-time monitoring.

When to use Phoenix

Use Phoenix when:

  • Debugging LLM application issues with detailed traces
  • Running systematic evaluations on datasets
  • Monitoring production LLM systems in real-time
  • Building experiment pipelines for prompt/model comparison
  • Self-hosted observability without vendor lock-in

Key features:

  • Tracing: OpenTelemetry-based trace collection for any LLM framework
  • Evaluation: LLM-as-judge evaluators for quality assessment
  • Datasets: Versioned test sets for regression testing
  • Experiments: Compare prompts, models, and configurations
  • Playground: Interactive prompt testing with multiple models
  • Open-source: Self-hosted with PostgreSQL or SQLite

Use alternatives instead:

  • LangSmith: Managed platform with LangChain-first integration
  • Weights & Biases: Deep learning experiment tracking focus
  • Arize Cloud: Managed Phoenix with enterprise features
  • MLflow: General ML lifecycle, model registry focus

Quick start

Installation

1pip install arize-phoenix
2
3# With specific backends
4pip install arize-phoenix[embeddings]  # Embedding analysis
5pip install arize-phoenix-otel         # OpenTelemetry config
6pip install arize-phoenix-evals        # Evaluation framework
7pip install arize-phoenix-client       # Lightweight REST client

Launch Phoenix server

1import phoenix as px
2
3# Launch in notebook (ThreadServer mode)
4session = px.launch_app()
5
6# View UI
7session.view()  # Embedded iframe
8print(session.url)  # http://localhost:6006

Command-line server (production)

1# Start Phoenix server
2phoenix serve
3
4# With PostgreSQL
5export PHOENIX_SQL_DATABASE_URL="postgresql://user:pass@host/db"
6phoenix serve --port 6006

Basic tracing

 1from phoenix.otel import register
 2from openinference.instrumentation.openai import OpenAIInstrumentor
 3
 4# Configure OpenTelemetry with Phoenix
 5tracer_provider = register(
 6    project_name="my-llm-app",
 7    endpoint="http://localhost:6006/v1/traces"
 8)
 9
10# Instrument OpenAI SDK
11OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)
12
13# All OpenAI calls are now traced
14from openai import OpenAI
15client = OpenAI()
16response = client.chat.completions.create(
17    model="gpt-4o",
18    messages=[{"role": "user", "content": "Hello!"}]
19)

Core concepts

Traces and spans

A trace represents a complete execution flow, while spans are individual operations within that trace.

 1from phoenix.otel import register
 2from opentelemetry import trace
 3
 4# Setup tracing
 5tracer_provider = register(project_name="my-app")
 6tracer = trace.get_tracer(__name__)
 7
 8# Create custom spans
 9with tracer.start_as_current_span("process_query") as span:
10    span.set_attribute("input.value", query)
11
12    # Child spans are automatically nested
13    with tracer.start_as_current_span("retrieve_context"):
14        context = retriever.search(query)
15
16    with tracer.start_as_current_span("generate_response"):
17        response = llm.generate(query, context)
18
19    span.set_attribute("output.value", response)

Projects

Projects organize related traces:

1import os
2os.environ["PHOENIX_PROJECT_NAME"] = "production-chatbot"
3
4# Or per-trace
5from phoenix.otel import register
6tracer_provider = register(project_name="experiment-v2")

Framework instrumentation

OpenAI

1from phoenix.otel import register
2from openinference.instrumentation.openai import OpenAIInstrumentor
3
4tracer_provider = register()
5OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)

LangChain

 1from phoenix.otel import register
 2from openinference.instrumentation.langchain import LangChainInstrumentor
 3
 4tracer_provider = register()
 5LangChainInstrumentor().instrument(tracer_provider=tracer_provider)
 6
 7# All LangChain operations traced
 8from langchain_openai import ChatOpenAI
 9llm = ChatOpenAI(model="gpt-4o")
10response = llm.invoke("Hello!")

LlamaIndex

1from phoenix.otel import register
2from openinference.instrumentation.llama_index import LlamaIndexInstrumentor
3
4tracer_provider = register()
5LlamaIndexInstrumentor().instrument(tracer_provider=tracer_provider)

Anthropic

1from phoenix.otel import register
2from openinference.instrumentation.anthropic import AnthropicInstrumentor
3
4tracer_provider = register()
5AnthropicInstrumentor().instrument(tracer_provider=tracer_provider)

Evaluation framework

Built-in evaluators

 1from phoenix.evals import (
 2    OpenAIModel,
 3    HallucinationEvaluator,
 4    RelevanceEvaluator,
 5    ToxicityEvaluator,
 6    llm_classify
 7)
 8
 9# Setup model for evaluation
10eval_model = OpenAIModel(model="gpt-4o")
11
12# Evaluate hallucination
13hallucination_eval = HallucinationEvaluator(eval_model)
14results = hallucination_eval.evaluate(
15    input="What is the capital of France?",
16    output="The capital of France is Paris.",
17    reference="Paris is the capital of France."
18)

Custom evaluators

 1from phoenix.evals import llm_classify
 2
 3# Define custom evaluation
 4def evaluate_helpfulness(input_text, output_text):
 5    template = """
 6    Evaluate if the response is helpful for the given question.
 7
 8    Question: {input}
 9    Response: {output}
10
11    Is this response helpful? Answer 'helpful' or 'not_helpful'.
12    """
13
14    result = llm_classify(
15        model=eval_model,
16        template=template,
17        input=input_text,
18        output=output_text,
19        rails=["helpful", "not_helpful"]
20    )
21    return result

Run evaluations on dataset

 1from phoenix import Client
 2from phoenix.evals import run_evals
 3
 4client = Client()
 5
 6# Get spans to evaluate
 7spans_df = client.get_spans_dataframe(
 8    project_name="my-app",
 9    filter_condition="span_kind == 'LLM'"
10)
11
12# Run evaluations
13eval_results = run_evals(
14    dataframe=spans_df,
15    evaluators=[
16        HallucinationEvaluator(eval_model),
17        RelevanceEvaluator(eval_model)
18    ],
19    provide_explanation=True
20)
21
22# Log results back to Phoenix
23client.log_evaluations(eval_results)

Datasets and experiments

Create dataset

 1from phoenix import Client
 2
 3client = Client()
 4
 5# Create dataset
 6dataset = client.create_dataset(
 7    name="qa-test-set",
 8    description="QA evaluation dataset"
 9)
10
11# Add examples
12client.add_examples_to_dataset(
13    dataset_name="qa-test-set",
14    examples=[
15        {
16            "input": {"question": "What is Python?"},
17            "output": {"answer": "A programming language"}
18        },
19        {
20            "input": {"question": "What is ML?"},
21            "output": {"answer": "Machine learning"}
22        }
23    ]
24)

Run experiment

 1from phoenix import Client
 2from phoenix.experiments import run_experiment
 3
 4client = Client()
 5
 6def my_model(input_data):
 7    """Your model function."""
 8    question = input_data["question"]
 9    return {"answer": generate_answer(question)}
10
11def accuracy_evaluator(input_data, output, expected):
12    """Custom evaluator."""
13    return {
14        "score": 1.0 if expected["answer"].lower() in output["answer"].lower() else 0.0,
15        "label": "correct" if expected["answer"].lower() in output["answer"].lower() else "incorrect"
16    }
17
18# Run experiment
19results = run_experiment(
20    dataset_name="qa-test-set",
21    task=my_model,
22    evaluators=[accuracy_evaluator],
23    experiment_name="baseline-v1"
24)
25
26print(f"Average accuracy: {results.aggregate_metrics['accuracy']}")

Client API

Query traces and spans

 1from phoenix import Client
 2
 3client = Client(endpoint="http://localhost:6006")
 4
 5# Get spans as DataFrame
 6spans_df = client.get_spans_dataframe(
 7    project_name="my-app",
 8    filter_condition="span_kind == 'LLM'",
 9    limit=1000
10)
11
12# Get specific span
13span = client.get_span(span_id="abc123")
14
15# Get trace
16trace = client.get_trace(trace_id="xyz789")

Log feedback

 1from phoenix import Client
 2
 3client = Client()
 4
 5# Log user feedback
 6client.log_annotation(
 7    span_id="abc123",
 8    name="user_rating",
 9    annotator_kind="HUMAN",
10    score=0.8,
11    label="helpful",
12    metadata={"comment": "Good response"}
13)

Export data

1# Export to pandas
2df = client.get_spans_dataframe(project_name="my-app")
3
4# Export traces
5traces = client.list_traces(project_name="my-app")

Production deployment

Docker

1docker run -p 6006:6006 arizephoenix/phoenix:latest

With PostgreSQL

1# Set database URL
2export PHOENIX_SQL_DATABASE_URL="postgresql://user:pass@host:5432/phoenix"
3
4# Start server
5phoenix serve --host 0.0.0.0 --port 6006

Environment variables

VariableDescriptionDefault
PHOENIX_PORTHTTP server port6006
PHOENIX_HOSTServer bind address127.0.0.1
PHOENIX_GRPC_PORTgRPC/OTLP port4317
PHOENIX_SQL_DATABASE_URLDatabase connectionSQLite temp
PHOENIX_WORKING_DIRData storage directoryOS temp
PHOENIX_ENABLE_AUTHEnable authenticationfalse
PHOENIX_SECRETJWT signing secretRequired if auth enabled

With authentication

1export PHOENIX_ENABLE_AUTH=true
2export PHOENIX_SECRET="your-secret-key-min-32-chars"
3export PHOENIX_ADMIN_SECRET="admin-bootstrap-token"
4
5phoenix serve

Best practices

  1. Use projects: Separate traces by environment (dev/staging/prod)
  2. Add metadata: Include user IDs, session IDs for debugging
  3. Evaluate regularly: Run automated evaluations in CI/CD
  4. Version datasets: Track test set changes over time
  5. Monitor costs: Track token usage via Phoenix dashboards
  6. Self-host: Use PostgreSQL for production deployments

Common issues

Traces not appearing:

 1from phoenix.otel import register
 2
 3# Verify endpoint
 4tracer_provider = register(
 5    project_name="my-app",
 6    endpoint="http://localhost:6006/v1/traces"  # Correct endpoint
 7)
 8
 9# Force flush
10from opentelemetry import trace
11trace.get_tracer_provider().force_flush()

High memory in notebook:

1# Close session when done
2session = px.launch_app()
3# ... do work ...
4session.close()
5px.close_app()

Database connection issues:

1# Verify PostgreSQL connection
2psql $PHOENIX_SQL_DATABASE_URL -c "SELECT 1"
3
4# Check Phoenix logs
5phoenix serve --log-level debug

References

Resources

What Users Are Saying

Real feedback from the community

Environment Matrix

Dependencies

arize-phoenix>=12.0.0
Python 3.8+
PostgreSQL (optional for production)

Framework Support

OpenAI ✓ (recommended) LangChain ✓ LlamaIndex ✓ Anthropic ✓ OpenTelemetry ✓

Context Window

Token Usage ~1K-3K tokens for typical evaluation tasks

Security & Privacy

Information

Author
davila7
Updated
2026-01-30
Category
debugging