Phoenix Observability

Debug, evaluate, and monitor LLM applications with open-source tracing

✨ The solution you've been looking for

Verified

Tested and verified by our team

16036 Stars

Open-source AI observability platform for LLM tracing, evaluation, and monitoring. Use when debugging LLM applications with detailed traces, running evaluations on datasets, or monitoring production AI systems with real-time insights.

observability tracing llm-monitoring evaluation debugging opentelemetry phoenix ai-ops

Repository

See It In Action

Interactive preview & real-world examples

Live Demo

AI Conversation Simulator

See how users interact with this skill

User Prompt

My LangChain chatbot is returning empty responses for some users. Help me set up Phoenix tracing to debug this issue.

Skill Processing

Analyzing request...

Agent Response

Complete setup with OpenTelemetry instrumentation, trace collection, and debugging insights from the Phoenix UI

Quick Start (3 Steps)

Get up and running in minutes

Install

claude-code skill install phoenix-observability

claude-code skill install phoenix-observability

Config

First Trigger

@phoenix-observability help

Commands

Command	Description	Required Args
@phoenix-observability debug-production-llm-failures	Track down why your chatbot gives inconsistent responses by analyzing detailed traces of user interactions	None
@phoenix-observability evaluate-model-performance-systematically	Run automated evaluations on your test datasets to measure hallucination, relevance, and toxicity	None
@phoenix-observability monitor-production-ai-systems	Set up real-time monitoring for token usage, latency, and quality metrics in your production environment	None

Typical Use Cases

Debug production LLM failures

Track down why your chatbot gives inconsistent responses by analyzing detailed traces of user interactions

Evaluate model performance systematically

Run automated evaluations on your test datasets to measure hallucination, relevance, and toxicity

Monitor production AI systems

Set up real-time monitoring for token usage, latency, and quality metrics in your production environment

Overview

Phoenix - AI Observability Platform

Open-source AI observability and evaluation platform for LLM applications with tracing, evaluation, datasets, experiments, and real-time monitoring.

When to use Phoenix

Use Phoenix when:

Debugging LLM application issues with detailed traces
Running systematic evaluations on datasets
Monitoring production LLM systems in real-time
Building experiment pipelines for prompt/model comparison
Self-hosted observability without vendor lock-in

Key features:

Tracing: OpenTelemetry-based trace collection for any LLM framework
Evaluation: LLM-as-judge evaluators for quality assessment
Datasets: Versioned test sets for regression testing
Experiments: Compare prompts, models, and configurations
Playground: Interactive prompt testing with multiple models
Open-source: Self-hosted with PostgreSQL or SQLite

Use alternatives instead:

LangSmith: Managed platform with LangChain-first integration
Weights & Biases: Deep learning experiment tracking focus
Arize Cloud: Managed Phoenix with enterprise features
MLflow: General ML lifecycle, model registry focus

Quick start

Installation

1pip install arize-phoenix
2
3# With specific backends
4pip install arize-phoenix[embeddings]  # Embedding analysis
5pip install arize-phoenix-otel         # OpenTelemetry config
6pip install arize-phoenix-evals        # Evaluation framework
7pip install arize-phoenix-client       # Lightweight REST client

Launch Phoenix server

1import phoenix as px
2
3# Launch in notebook (ThreadServer mode)
4session = px.launch_app()
5
6# View UI
7session.view()  # Embedded iframe
8print(session.url)  # http://localhost:6006

Command-line server (production)

1# Start Phoenix server
2phoenix serve
3
4# With PostgreSQL
5export PHOENIX_SQL_DATABASE_URL="postgresql://user:pass@host/db"
6phoenix serve --port 6006

Basic tracing

 1from phoenix.otel import register
 2from openinference.instrumentation.openai import OpenAIInstrumentor
 3
 4# Configure OpenTelemetry with Phoenix
 5tracer_provider = register(
 6    project_name="my-llm-app",
 7    endpoint="http://localhost:6006/v1/traces"
 8)
 9
10# Instrument OpenAI SDK
11OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)
12
13# All OpenAI calls are now traced
14from openai import OpenAI
15client = OpenAI()
16response = client.chat.completions.create(
17    model="gpt-4o",
18    messages=[{"role": "user", "content": "Hello!"}]
19)

Core concepts

Traces and spans

A trace represents a complete execution flow, while spans are individual operations within that trace.

 1from phoenix.otel import register
 2from opentelemetry import trace
 3
 4# Setup tracing
 5tracer_provider = register(project_name="my-app")
 6tracer = trace.get_tracer(__name__)
 7
 8# Create custom spans
 9with tracer.start_as_current_span("process_query") as span:
10    span.set_attribute("input.value", query)
11
12    # Child spans are automatically nested
13    with tracer.start_as_current_span("retrieve_context"):
14        context = retriever.search(query)
15
16    with tracer.start_as_current_span("generate_response"):
17        response = llm.generate(query, context)
18
19    span.set_attribute("output.value", response)

Projects

Projects organize related traces:

1import os
2os.environ["PHOENIX_PROJECT_NAME"] = "production-chatbot"
3
4# Or per-trace
5from phoenix.otel import register
6tracer_provider = register(project_name="experiment-v2")

Framework instrumentation

OpenAI

1from phoenix.otel import register
2from openinference.instrumentation.openai import OpenAIInstrumentor
3
4tracer_provider = register()
5OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)

LangChain

 1from phoenix.otel import register
 2from openinference.instrumentation.langchain import LangChainInstrumentor
 3
 4tracer_provider = register()
 5LangChainInstrumentor().instrument(tracer_provider=tracer_provider)
 6
 7# All LangChain operations traced
 8from langchain_openai import ChatOpenAI
 9llm = ChatOpenAI(model="gpt-4o")
10response = llm.invoke("Hello!")

LlamaIndex

1from phoenix.otel import register
2from openinference.instrumentation.llama_index import LlamaIndexInstrumentor
3
4tracer_provider = register()
5LlamaIndexInstrumentor().instrument(tracer_provider=tracer_provider)

Anthropic

1from phoenix.otel import register
2from openinference.instrumentation.anthropic import AnthropicInstrumentor
3
4tracer_provider = register()
5AnthropicInstrumentor().instrument(tracer_provider=tracer_provider)

Evaluation framework

Built-in evaluators

 1from phoenix.evals import (
 2    OpenAIModel,
 3    HallucinationEvaluator,
 4    RelevanceEvaluator,
 5    ToxicityEvaluator,
 6    llm_classify
 7)
 8
 9# Setup model for evaluation
10eval_model = OpenAIModel(model="gpt-4o")
11
12# Evaluate hallucination
13hallucination_eval = HallucinationEvaluator(eval_model)
14results = hallucination_eval.evaluate(
15    input="What is the capital of France?",
16    output="The capital of France is Paris.",
17    reference="Paris is the capital of France."
18)

Custom evaluators

 1from phoenix.evals import llm_classify
 2
 3# Define custom evaluation
 4def evaluate_helpfulness(input_text, output_text):
 5    template = """
 6    Evaluate if the response is helpful for the given question.
 7
 8    Question: {input}
 9    Response: {output}
10
11    Is this response helpful? Answer 'helpful' or 'not_helpful'.
12    """
13
14    result = llm_classify(
15        model=eval_model,
16        template=template,
17        input=input_text,
18        output=output_text,
19        rails=["helpful", "not_helpful"]
20    )
21    return result

Run evaluations on dataset

 1from phoenix import Client
 2from phoenix.evals import run_evals
 3
 4client = Client()
 5
 6# Get spans to evaluate
 7spans_df = client.get_spans_dataframe(
 8    project_name="my-app",
 9    filter_condition="span_kind == 'LLM'"
10)
11
12# Run evaluations
13eval_results = run_evals(
14    dataframe=spans_df,
15    evaluators=[
16        HallucinationEvaluator(eval_model),
17        RelevanceEvaluator(eval_model)
18    ],
19    provide_explanation=True
20)
21
22# Log results back to Phoenix
23client.log_evaluations(eval_results)

Datasets and experiments

Create dataset

 1from phoenix import Client
 2
 3client = Client()
 4
 5# Create dataset
 6dataset = client.create_dataset(
 7    name="qa-test-set",
 8    description="QA evaluation dataset"
 9)
10
11# Add examples
12client.add_examples_to_dataset(
13    dataset_name="qa-test-set",
14    examples=[
15        {
16            "input": {"question": "What is Python?"},
17            "output": {"answer": "A programming language"}
18        },
19        {
20            "input": {"question": "What is ML?"},
21            "output": {"answer": "Machine learning"}
22        }
23    ]
24)

Run experiment

 1from phoenix import Client
 2from phoenix.experiments import run_experiment
 3
 4client = Client()
 5
 6def my_model(input_data):
 7    """Your model function."""
 8    question = input_data["question"]
 9    return {"answer": generate_answer(question)}
10
11def accuracy_evaluator(input_data, output, expected):
12    """Custom evaluator."""
13    return {
14        "score": 1.0 if expected["answer"].lower() in output["answer"].lower() else 0.0,
15        "label": "correct" if expected["answer"].lower() in output["answer"].lower() else "incorrect"
16    }
17
18# Run experiment
19results = run_experiment(
20    dataset_name="qa-test-set",
21    task=my_model,
22    evaluators=[accuracy_evaluator],
23    experiment_name="baseline-v1"
24)
25
26print(f"Average accuracy: {results.aggregate_metrics['accuracy']}")

Client API

Query traces and spans

 1from phoenix import Client
 2
 3client = Client(endpoint="http://localhost:6006")
 4
 5# Get spans as DataFrame
 6spans_df = client.get_spans_dataframe(
 7    project_name="my-app",
 8    filter_condition="span_kind == 'LLM'",
 9    limit=1000
10)
11
12# Get specific span
13span = client.get_span(span_id="abc123")
14
15# Get trace
16trace = client.get_trace(trace_id="xyz789")

Log feedback

 1from phoenix import Client
 2
 3client = Client()
 4
 5# Log user feedback
 6client.log_annotation(
 7    span_id="abc123",
 8    name="user_rating",
 9    annotator_kind="HUMAN",
10    score=0.8,
11    label="helpful",
12    metadata={"comment": "Good response"}
13)

Export data

1# Export to pandas
2df = client.get_spans_dataframe(project_name="my-app")
3
4# Export traces
5traces = client.list_traces(project_name="my-app")

Production deployment

Docker

1docker run -p 6006:6006 arizephoenix/phoenix:latest

With PostgreSQL

1# Set database URL
2export PHOENIX_SQL_DATABASE_URL="postgresql://user:pass@host:5432/phoenix"
3
4# Start server
5phoenix serve --host 0.0.0.0 --port 6006

Environment variables

Variable	Description	Default
`PHOENIX_PORT`	HTTP server port	`6006`
`PHOENIX_HOST`	Server bind address	`127.0.0.1`
`PHOENIX_GRPC_PORT`	gRPC/OTLP port	`4317`
`PHOENIX_SQL_DATABASE_URL`	Database connection	SQLite temp
`PHOENIX_WORKING_DIR`	Data storage directory	OS temp
`PHOENIX_ENABLE_AUTH`	Enable authentication	`false`
`PHOENIX_SECRET`	JWT signing secret	Required if auth enabled

With authentication

1export PHOENIX_ENABLE_AUTH=true
2export PHOENIX_SECRET="your-secret-key-min-32-chars"
3export PHOENIX_ADMIN_SECRET="admin-bootstrap-token"
4
5phoenix serve

Best practices

Use projects: Separate traces by environment (dev/staging/prod)
Add metadata: Include user IDs, session IDs for debugging
Evaluate regularly: Run automated evaluations in CI/CD
Version datasets: Track test set changes over time
Monitor costs: Track token usage via Phoenix dashboards
Self-host: Use PostgreSQL for production deployments

Common issues

Traces not appearing:

 1from phoenix.otel import register
 2
 3# Verify endpoint
 4tracer_provider = register(
 5    project_name="my-app",
 6    endpoint="http://localhost:6006/v1/traces"  # Correct endpoint
 7)
 8
 9# Force flush
10from opentelemetry import trace
11trace.get_tracer_provider().force_flush()

High memory in notebook:

1# Close session when done
2session = px.launch_app()
3# ... do work ...
4session.close()
5px.close_app()

Database connection issues:

1# Verify PostgreSQL connection
2psql $PHOENIX_SQL_DATABASE_URL -c "SELECT 1"
3
4# Check Phoenix logs
5phoenix serve --log-level debug

References

Advanced Usage - Custom evaluators, experiments, production setup
Troubleshooting - Common issues, debugging, performance

Resources

Documentation: https://docs.arize.com/phoenix
Repository: https://github.com/Arize-ai/phoenix
Docker Hub: https://hub.docker.com/r/arizephoenix/phoenix
Version: 12.0.0+
License: Apache 2.0

What Users Are Saying

Real feedback from the community

Environment Matrix

Dependencies

arize-phoenix>=12.0.0

Python 3.8+

PostgreSQL (optional for production)

Framework Support

OpenAI ✓ (recommended) LangChain ✓ LlamaIndex ✓ Anthropic ✓ OpenTelemetry ✓

Phoenix Observability

See It In Action

AI Conversation Simulator

Quick Start (3 Steps)

Install

Config

First Trigger

Commands

Typical Use Cases

Debug production LLM failures

Evaluate model performance systematically

Monitor production AI systems

Overview

Phoenix - AI Observability Platform

When to use Phoenix

Quick start

Installation

Launch Phoenix server

Command-line server (production)

Basic tracing

Core concepts

Traces and spans

Projects

Framework instrumentation

OpenAI

LangChain

LlamaIndex

Anthropic

Evaluation framework

Built-in evaluators

Custom evaluators

Run evaluations on dataset

Datasets and experiments

Create dataset

Run experiment

Client API

Query traces and spans

Log feedback

Export data

Production deployment

Docker

With PostgreSQL

Environment variables

With authentication

Best practices

Common issues

References

Resources

What Users Are Saying

Environment Matrix

Dependencies

Framework Support

Context Window

Security & Privacy

Information

Related Skills

Phoenix Observability

Langsmith Observability

Langsmith Observability