Langsmith Observability
Debug, evaluate and monitor your LLM applications in production
✨ The solution you've been looking for
LLM observability platform for tracing, evaluation, and monitoring. Use when debugging LLM applications, evaluating model outputs against datasets, monitoring production systems, or building systematic testing pipelines for AI applications.
See It In Action
Interactive preview & real-world examples
AI Conversation Simulator
See how users interact with this skill
User Prompt
My RAG application is giving inconsistent answers. Help me set up tracing to debug the retrieval and generation steps.
Skill Processing
Analyzing request...
Agent Response
Complete visibility into each step of your LLM pipeline with inputs, outputs, latency, and error details
Quick Start (3 Steps)
Get up and running in minutes
Install
claude-code skill install langsmith-observability
claude-code skill install langsmith-observabilityConfig
First Trigger
@langsmith-observability helpCommands
| Command | Description | Required Args |
|---|---|---|
| @langsmith-observability debug-llm-application-issues | Trace through complex LLM chains and agents to identify where failures occur | None |
| @langsmith-observability systematic-model-evaluation | Test model performance against datasets with automated scoring | None |
| @langsmith-observability production-monitoring-setup | Monitor LLM applications in production with cost and performance tracking | None |
Typical Use Cases
Debug LLM Application Issues
Trace through complex LLM chains and agents to identify where failures occur
Systematic Model Evaluation
Test model performance against datasets with automated scoring
Production Monitoring Setup
Monitor LLM applications in production with cost and performance tracking
Overview
LangSmith - LLM Observability Platform
Development platform for debugging, evaluating, and monitoring language models and AI applications.
When to use LangSmith
Use LangSmith when:
- Debugging LLM application issues (prompts, chains, agents)
- Evaluating model outputs systematically against datasets
- Monitoring production LLM systems
- Building regression testing for AI features
- Analyzing latency, token usage, and costs
- Collaborating on prompt engineering
Key features:
- Tracing: Capture inputs, outputs, latency for all LLM calls
- Evaluation: Systematic testing with built-in and custom evaluators
- Datasets: Create test sets from production traces or manually
- Monitoring: Track metrics, errors, and costs in production
- Integrations: Works with OpenAI, Anthropic, LangChain, LlamaIndex
Use alternatives instead:
- Weights & Biases: Deep learning experiment tracking, model training
- MLflow: General ML lifecycle, model registry focus
- Arize/WhyLabs: ML monitoring, data drift detection
Quick start
Installation
1pip install langsmith
2
3# Set environment variables
4export LANGSMITH_API_KEY="your-api-key"
5export LANGSMITH_TRACING=true
Basic tracing with @traceable
1from langsmith import traceable
2from openai import OpenAI
3
4client = OpenAI()
5
6@traceable
7def generate_response(prompt: str) -> str:
8 response = client.chat.completions.create(
9 model="gpt-4o",
10 messages=[{"role": "user", "content": prompt}]
11 )
12 return response.choices[0].message.content
13
14# Automatically traced to LangSmith
15result = generate_response("What is machine learning?")
OpenAI wrapper (automatic tracing)
1from langsmith.wrappers import wrap_openai
2from openai import OpenAI
3
4# Wrap client for automatic tracing
5client = wrap_openai(OpenAI())
6
7# All calls automatically traced
8response = client.chat.completions.create(
9 model="gpt-4o",
10 messages=[{"role": "user", "content": "Hello!"}]
11)
Core concepts
Runs and traces
A run is a single execution unit (LLM call, chain, tool). Runs form hierarchical traces showing the full execution flow.
1from langsmith import traceable
2
3@traceable(run_type="chain")
4def process_query(query: str) -> str:
5 # Parent run
6 context = retrieve_context(query) # Child run
7 response = generate_answer(query, context) # Child run
8 return response
9
10@traceable(run_type="retriever")
11def retrieve_context(query: str) -> list:
12 return vector_store.search(query)
13
14@traceable(run_type="llm")
15def generate_answer(query: str, context: list) -> str:
16 return llm.invoke(f"Context: {context}\n\nQuestion: {query}")
Projects
Projects organize related runs. Set via environment or code:
1import os
2os.environ["LANGSMITH_PROJECT"] = "my-project"
3
4# Or per-function
5@traceable(project_name="my-project")
6def my_function():
7 pass
Client API
1from langsmith import Client
2
3client = Client()
4
5# List runs
6runs = list(client.list_runs(
7 project_name="my-project",
8 filter='eq(status, "success")',
9 limit=100
10))
11
12# Get run details
13run = client.read_run(run_id="...")
14
15# Create feedback
16client.create_feedback(
17 run_id="...",
18 key="correctness",
19 score=0.9,
20 comment="Good answer"
21)
Datasets and evaluation
Create dataset
1from langsmith import Client
2
3client = Client()
4
5# Create dataset
6dataset = client.create_dataset("qa-test-set", description="QA evaluation")
7
8# Add examples
9client.create_examples(
10 inputs=[
11 {"question": "What is Python?"},
12 {"question": "What is ML?"}
13 ],
14 outputs=[
15 {"answer": "A programming language"},
16 {"answer": "Machine learning"}
17 ],
18 dataset_id=dataset.id
19)
Run evaluation
1from langsmith import evaluate
2
3def my_model(inputs: dict) -> dict:
4 # Your model logic
5 return {"answer": generate_answer(inputs["question"])}
6
7def correctness_evaluator(run, example):
8 prediction = run.outputs["answer"]
9 reference = example.outputs["answer"]
10 score = 1.0 if reference.lower() in prediction.lower() else 0.0
11 return {"key": "correctness", "score": score}
12
13results = evaluate(
14 my_model,
15 data="qa-test-set",
16 evaluators=[correctness_evaluator],
17 experiment_prefix="v1"
18)
19
20print(f"Average score: {results.aggregate_metrics['correctness']}")
Built-in evaluators
1from langsmith.evaluation import LangChainStringEvaluator
2
3# Use LangChain evaluators
4results = evaluate(
5 my_model,
6 data="qa-test-set",
7 evaluators=[
8 LangChainStringEvaluator("qa"),
9 LangChainStringEvaluator("cot_qa")
10 ]
11)
Advanced tracing
Tracing context
1from langsmith import tracing_context
2
3with tracing_context(
4 project_name="experiment-1",
5 tags=["production", "v2"],
6 metadata={"version": "2.0"}
7):
8 # All traceable calls inherit context
9 result = my_function()
Manual runs
1from langsmith import trace
2
3with trace(
4 name="custom_operation",
5 run_type="tool",
6 inputs={"query": "test"}
7) as run:
8 result = do_something()
9 run.end(outputs={"result": result})
Process inputs/outputs
1def sanitize_inputs(inputs: dict) -> dict:
2 if "password" in inputs:
3 inputs["password"] = "***"
4 return inputs
5
6@traceable(process_inputs=sanitize_inputs)
7def login(username: str, password: str):
8 return authenticate(username, password)
Sampling
1import os
2os.environ["LANGSMITH_TRACING_SAMPLING_RATE"] = "0.1" # 10% sampling
LangChain integration
1from langchain_openai import ChatOpenAI
2from langchain_core.prompts import ChatPromptTemplate
3
4# Tracing enabled automatically with LANGSMITH_TRACING=true
5llm = ChatOpenAI(model="gpt-4o")
6prompt = ChatPromptTemplate.from_messages([
7 ("system", "You are a helpful assistant."),
8 ("user", "{input}")
9])
10
11chain = prompt | llm
12
13# All chain runs traced automatically
14response = chain.invoke({"input": "Hello!"})
Production monitoring
Hub prompts
1from langsmith import Client
2
3client = Client()
4
5# Pull prompt from hub
6prompt = client.pull_prompt("my-org/qa-prompt")
7
8# Use in application
9result = prompt.invoke({"question": "What is AI?"})
Async client
1from langsmith import AsyncClient
2
3async def main():
4 client = AsyncClient()
5
6 runs = []
7 async for run in client.list_runs(project_name="my-project"):
8 runs.append(run)
9
10 return runs
Feedback collection
1from langsmith import Client
2
3client = Client()
4
5# Collect user feedback
6def record_feedback(run_id: str, user_rating: int, comment: str = None):
7 client.create_feedback(
8 run_id=run_id,
9 key="user_rating",
10 score=user_rating / 5.0, # Normalize to 0-1
11 comment=comment
12 )
13
14# In your application
15record_feedback(run_id="...", user_rating=4, comment="Helpful response")
Testing integration
Pytest integration
1from langsmith import test
2
3@test
4def test_qa_accuracy():
5 result = my_qa_function("What is Python?")
6 assert "programming" in result.lower()
Evaluation in CI/CD
1from langsmith import evaluate
2
3def run_evaluation():
4 results = evaluate(
5 my_model,
6 data="regression-test-set",
7 evaluators=[accuracy_evaluator]
8 )
9
10 # Fail CI if accuracy drops
11 assert results.aggregate_metrics["accuracy"] >= 0.9, \
12 f"Accuracy {results.aggregate_metrics['accuracy']} below threshold"
Best practices
- Structured naming - Use consistent project/run naming conventions
- Add metadata - Include version, environment, user info
- Sample in production - Use sampling rate to control volume
- Create datasets - Build test sets from interesting production cases
- Automate evaluation - Run evaluations in CI/CD pipelines
- Monitor costs - Track token usage and latency trends
Common issues
Traces not appearing:
1import os
2# Ensure tracing is enabled
3os.environ["LANGSMITH_TRACING"] = "true"
4os.environ["LANGSMITH_API_KEY"] = "your-key"
5
6# Verify connection
7from langsmith import Client
8client = Client()
9print(client.list_projects()) # Should work
High latency from tracing:
1# Enable background batching (default)
2from langsmith import Client
3client = Client(auto_batch_tracing=True)
4
5# Or use sampling
6os.environ["LANGSMITH_TRACING_SAMPLING_RATE"] = "0.1"
Large payloads:
1# Hide sensitive/large fields
2@traceable(
3 process_inputs=lambda x: {k: v for k, v in x.items() if k != "large_field"}
4)
5def my_function(data):
6 pass
References
- Advanced Usage - Custom evaluators, distributed tracing, hub prompts
- Troubleshooting - Common issues, debugging, performance
Resources
- Documentation: https://docs.smith.langchain.com
- Python SDK: https://github.com/langchain-ai/langsmith-sdk
- Web App: https://smith.langchain.com
- Version: 0.2.0+
- License: MIT
What Users Are Saying
Real feedback from the community
Environment Matrix
Dependencies
Framework Support
Context Window
Security & Privacy
Information
- Author
- davila7
- Updated
- 2026-01-30
- Category
- debugging
Related Skills
Langsmith Observability
LLM observability platform for tracing, evaluation, and monitoring. Use when debugging LLM …
View Details →Phoenix Observability
Open-source AI observability platform for LLM tracing, evaluation, and monitoring. Use when …
View Details →Phoenix Observability
Open-source AI observability platform for LLM tracing, evaluation, and monitoring. Use when …
View Details →