Phoenix Observability
Debug, evaluate, and monitor LLM applications with open-source tracing
✨ The solution you've been looking for
Open-source AI observability platform for LLM tracing, evaluation, and monitoring. Use when debugging LLM applications with detailed traces, running evaluations on datasets, or monitoring production AI systems with real-time insights.
See It In Action
Interactive preview & real-world examples
AI Conversation Simulator
See how users interact with this skill
User Prompt
My LangChain chatbot is returning empty responses for some users. Help me set up Phoenix tracing to debug this issue.
Skill Processing
Analyzing request...
Agent Response
Complete setup with OpenTelemetry instrumentation, trace collection, and debugging insights from the Phoenix UI
Quick Start (3 Steps)
Get up and running in minutes
Install
claude-code skill install phoenix-observability
claude-code skill install phoenix-observabilityConfig
First Trigger
@phoenix-observability helpCommands
| Command | Description | Required Args |
|---|---|---|
| @phoenix-observability debug-production-llm-failures | Track down why your chatbot gives inconsistent responses by analyzing detailed traces of user interactions | None |
| @phoenix-observability evaluate-model-performance-systematically | Run automated evaluations on your test datasets to measure hallucination, relevance, and toxicity | None |
| @phoenix-observability monitor-production-ai-systems | Set up real-time monitoring for token usage, latency, and quality metrics in your production environment | None |
Typical Use Cases
Debug production LLM failures
Track down why your chatbot gives inconsistent responses by analyzing detailed traces of user interactions
Evaluate model performance systematically
Run automated evaluations on your test datasets to measure hallucination, relevance, and toxicity
Monitor production AI systems
Set up real-time monitoring for token usage, latency, and quality metrics in your production environment
Overview
Phoenix - AI Observability Platform
Open-source AI observability and evaluation platform for LLM applications with tracing, evaluation, datasets, experiments, and real-time monitoring.
When to use Phoenix
Use Phoenix when:
- Debugging LLM application issues with detailed traces
- Running systematic evaluations on datasets
- Monitoring production LLM systems in real-time
- Building experiment pipelines for prompt/model comparison
- Self-hosted observability without vendor lock-in
Key features:
- Tracing: OpenTelemetry-based trace collection for any LLM framework
- Evaluation: LLM-as-judge evaluators for quality assessment
- Datasets: Versioned test sets for regression testing
- Experiments: Compare prompts, models, and configurations
- Playground: Interactive prompt testing with multiple models
- Open-source: Self-hosted with PostgreSQL or SQLite
Use alternatives instead:
- LangSmith: Managed platform with LangChain-first integration
- Weights & Biases: Deep learning experiment tracking focus
- Arize Cloud: Managed Phoenix with enterprise features
- MLflow: General ML lifecycle, model registry focus
Quick start
Installation
1pip install arize-phoenix
2
3# With specific backends
4pip install arize-phoenix[embeddings] # Embedding analysis
5pip install arize-phoenix-otel # OpenTelemetry config
6pip install arize-phoenix-evals # Evaluation framework
7pip install arize-phoenix-client # Lightweight REST client
Launch Phoenix server
1import phoenix as px
2
3# Launch in notebook (ThreadServer mode)
4session = px.launch_app()
5
6# View UI
7session.view() # Embedded iframe
8print(session.url) # http://localhost:6006
Command-line server (production)
1# Start Phoenix server
2phoenix serve
3
4# With PostgreSQL
5export PHOENIX_SQL_DATABASE_URL="postgresql://user:pass@host/db"
6phoenix serve --port 6006
Basic tracing
1from phoenix.otel import register
2from openinference.instrumentation.openai import OpenAIInstrumentor
3
4# Configure OpenTelemetry with Phoenix
5tracer_provider = register(
6 project_name="my-llm-app",
7 endpoint="http://localhost:6006/v1/traces"
8)
9
10# Instrument OpenAI SDK
11OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)
12
13# All OpenAI calls are now traced
14from openai import OpenAI
15client = OpenAI()
16response = client.chat.completions.create(
17 model="gpt-4o",
18 messages=[{"role": "user", "content": "Hello!"}]
19)
Core concepts
Traces and spans
A trace represents a complete execution flow, while spans are individual operations within that trace.
1from phoenix.otel import register
2from opentelemetry import trace
3
4# Setup tracing
5tracer_provider = register(project_name="my-app")
6tracer = trace.get_tracer(__name__)
7
8# Create custom spans
9with tracer.start_as_current_span("process_query") as span:
10 span.set_attribute("input.value", query)
11
12 # Child spans are automatically nested
13 with tracer.start_as_current_span("retrieve_context"):
14 context = retriever.search(query)
15
16 with tracer.start_as_current_span("generate_response"):
17 response = llm.generate(query, context)
18
19 span.set_attribute("output.value", response)
Projects
Projects organize related traces:
1import os
2os.environ["PHOENIX_PROJECT_NAME"] = "production-chatbot"
3
4# Or per-trace
5from phoenix.otel import register
6tracer_provider = register(project_name="experiment-v2")
Framework instrumentation
OpenAI
1from phoenix.otel import register
2from openinference.instrumentation.openai import OpenAIInstrumentor
3
4tracer_provider = register()
5OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)
LangChain
1from phoenix.otel import register
2from openinference.instrumentation.langchain import LangChainInstrumentor
3
4tracer_provider = register()
5LangChainInstrumentor().instrument(tracer_provider=tracer_provider)
6
7# All LangChain operations traced
8from langchain_openai import ChatOpenAI
9llm = ChatOpenAI(model="gpt-4o")
10response = llm.invoke("Hello!")
LlamaIndex
1from phoenix.otel import register
2from openinference.instrumentation.llama_index import LlamaIndexInstrumentor
3
4tracer_provider = register()
5LlamaIndexInstrumentor().instrument(tracer_provider=tracer_provider)
Anthropic
1from phoenix.otel import register
2from openinference.instrumentation.anthropic import AnthropicInstrumentor
3
4tracer_provider = register()
5AnthropicInstrumentor().instrument(tracer_provider=tracer_provider)
Evaluation framework
Built-in evaluators
1from phoenix.evals import (
2 OpenAIModel,
3 HallucinationEvaluator,
4 RelevanceEvaluator,
5 ToxicityEvaluator,
6 llm_classify
7)
8
9# Setup model for evaluation
10eval_model = OpenAIModel(model="gpt-4o")
11
12# Evaluate hallucination
13hallucination_eval = HallucinationEvaluator(eval_model)
14results = hallucination_eval.evaluate(
15 input="What is the capital of France?",
16 output="The capital of France is Paris.",
17 reference="Paris is the capital of France."
18)
Custom evaluators
1from phoenix.evals import llm_classify
2
3# Define custom evaluation
4def evaluate_helpfulness(input_text, output_text):
5 template = """
6 Evaluate if the response is helpful for the given question.
7
8 Question: {input}
9 Response: {output}
10
11 Is this response helpful? Answer 'helpful' or 'not_helpful'.
12 """
13
14 result = llm_classify(
15 model=eval_model,
16 template=template,
17 input=input_text,
18 output=output_text,
19 rails=["helpful", "not_helpful"]
20 )
21 return result
Run evaluations on dataset
1from phoenix import Client
2from phoenix.evals import run_evals
3
4client = Client()
5
6# Get spans to evaluate
7spans_df = client.get_spans_dataframe(
8 project_name="my-app",
9 filter_condition="span_kind == 'LLM'"
10)
11
12# Run evaluations
13eval_results = run_evals(
14 dataframe=spans_df,
15 evaluators=[
16 HallucinationEvaluator(eval_model),
17 RelevanceEvaluator(eval_model)
18 ],
19 provide_explanation=True
20)
21
22# Log results back to Phoenix
23client.log_evaluations(eval_results)
Datasets and experiments
Create dataset
1from phoenix import Client
2
3client = Client()
4
5# Create dataset
6dataset = client.create_dataset(
7 name="qa-test-set",
8 description="QA evaluation dataset"
9)
10
11# Add examples
12client.add_examples_to_dataset(
13 dataset_name="qa-test-set",
14 examples=[
15 {
16 "input": {"question": "What is Python?"},
17 "output": {"answer": "A programming language"}
18 },
19 {
20 "input": {"question": "What is ML?"},
21 "output": {"answer": "Machine learning"}
22 }
23 ]
24)
Run experiment
1from phoenix import Client
2from phoenix.experiments import run_experiment
3
4client = Client()
5
6def my_model(input_data):
7 """Your model function."""
8 question = input_data["question"]
9 return {"answer": generate_answer(question)}
10
11def accuracy_evaluator(input_data, output, expected):
12 """Custom evaluator."""
13 return {
14 "score": 1.0 if expected["answer"].lower() in output["answer"].lower() else 0.0,
15 "label": "correct" if expected["answer"].lower() in output["answer"].lower() else "incorrect"
16 }
17
18# Run experiment
19results = run_experiment(
20 dataset_name="qa-test-set",
21 task=my_model,
22 evaluators=[accuracy_evaluator],
23 experiment_name="baseline-v1"
24)
25
26print(f"Average accuracy: {results.aggregate_metrics['accuracy']}")
Client API
Query traces and spans
1from phoenix import Client
2
3client = Client(endpoint="http://localhost:6006")
4
5# Get spans as DataFrame
6spans_df = client.get_spans_dataframe(
7 project_name="my-app",
8 filter_condition="span_kind == 'LLM'",
9 limit=1000
10)
11
12# Get specific span
13span = client.get_span(span_id="abc123")
14
15# Get trace
16trace = client.get_trace(trace_id="xyz789")
Log feedback
1from phoenix import Client
2
3client = Client()
4
5# Log user feedback
6client.log_annotation(
7 span_id="abc123",
8 name="user_rating",
9 annotator_kind="HUMAN",
10 score=0.8,
11 label="helpful",
12 metadata={"comment": "Good response"}
13)
Export data
1# Export to pandas
2df = client.get_spans_dataframe(project_name="my-app")
3
4# Export traces
5traces = client.list_traces(project_name="my-app")
Production deployment
Docker
1docker run -p 6006:6006 arizephoenix/phoenix:latest
With PostgreSQL
1# Set database URL
2export PHOENIX_SQL_DATABASE_URL="postgresql://user:pass@host:5432/phoenix"
3
4# Start server
5phoenix serve --host 0.0.0.0 --port 6006
Environment variables
| Variable | Description | Default |
|---|---|---|
PHOENIX_PORT | HTTP server port | 6006 |
PHOENIX_HOST | Server bind address | 127.0.0.1 |
PHOENIX_GRPC_PORT | gRPC/OTLP port | 4317 |
PHOENIX_SQL_DATABASE_URL | Database connection | SQLite temp |
PHOENIX_WORKING_DIR | Data storage directory | OS temp |
PHOENIX_ENABLE_AUTH | Enable authentication | false |
PHOENIX_SECRET | JWT signing secret | Required if auth enabled |
With authentication
1export PHOENIX_ENABLE_AUTH=true
2export PHOENIX_SECRET="your-secret-key-min-32-chars"
3export PHOENIX_ADMIN_SECRET="admin-bootstrap-token"
4
5phoenix serve
Best practices
- Use projects: Separate traces by environment (dev/staging/prod)
- Add metadata: Include user IDs, session IDs for debugging
- Evaluate regularly: Run automated evaluations in CI/CD
- Version datasets: Track test set changes over time
- Monitor costs: Track token usage via Phoenix dashboards
- Self-host: Use PostgreSQL for production deployments
Common issues
Traces not appearing:
1from phoenix.otel import register
2
3# Verify endpoint
4tracer_provider = register(
5 project_name="my-app",
6 endpoint="http://localhost:6006/v1/traces" # Correct endpoint
7)
8
9# Force flush
10from opentelemetry import trace
11trace.get_tracer_provider().force_flush()
High memory in notebook:
1# Close session when done
2session = px.launch_app()
3# ... do work ...
4session.close()
5px.close_app()
Database connection issues:
1# Verify PostgreSQL connection
2psql $PHOENIX_SQL_DATABASE_URL -c "SELECT 1"
3
4# Check Phoenix logs
5phoenix serve --log-level debug
References
- Advanced Usage - Custom evaluators, experiments, production setup
- Troubleshooting - Common issues, debugging, performance
Resources
- Documentation: https://docs.arize.com/phoenix
- Repository: https://github.com/Arize-ai/phoenix
- Docker Hub: https://hub.docker.com/r/arizephoenix/phoenix
- Version: 12.0.0+
- License: Apache 2.0
What Users Are Saying
Real feedback from the community
Environment Matrix
Dependencies
Framework Support
Context Window
Security & Privacy
Information
- Author
- davila7
- Updated
- 2026-01-30
- Category
- debugging
Related Skills
Phoenix Observability
Open-source AI observability platform for LLM tracing, evaluation, and monitoring. Use when …
View Details →Langsmith Observability
LLM observability platform for tracing, evaluation, and monitoring. Use when debugging LLM …
View Details →Langsmith Observability
LLM observability platform for tracing, evaluation, and monitoring. Use when debugging LLM …
View Details →