Langsmith Observability

Debug, evaluate and monitor your LLM applications in production

✨ The solution you've been looking for

Verified

Tested and verified by our team

16036 Stars

LLM observability platform for tracing, evaluation, and monitoring. Use when debugging LLM applications, evaluating model outputs against datasets, monitoring production systems, or building systematic testing pipelines for AI applications.

observability tracing evaluation monitoring debugging llm-ops testing langsmith

Repository

See It In Action

Interactive preview & real-world examples

Live Demo

AI Conversation Simulator

See how users interact with this skill

User Prompt

My RAG application is giving inconsistent answers. Help me set up tracing to debug the retrieval and generation steps.

Skill Processing

Analyzing request...

Agent Response

Complete visibility into each step of your LLM pipeline with inputs, outputs, latency, and error details

Quick Start (3 Steps)

Get up and running in minutes

Install

claude-code skill install langsmith-observability

claude-code skill install langsmith-observability

Config

First Trigger

@langsmith-observability help

Commands

Command	Description	Required Args
@langsmith-observability debug-llm-application-issues	Trace through complex LLM chains and agents to identify where failures occur	None
@langsmith-observability systematic-model-evaluation	Test model performance against datasets with automated scoring	None
@langsmith-observability production-monitoring-setup	Monitor LLM applications in production with cost and performance tracking	None

Typical Use Cases

Debug LLM Application Issues

Trace through complex LLM chains and agents to identify where failures occur

Systematic Model Evaluation

Test model performance against datasets with automated scoring

Production Monitoring Setup

Monitor LLM applications in production with cost and performance tracking

Overview

LangSmith - LLM Observability Platform

Development platform for debugging, evaluating, and monitoring language models and AI applications.

When to use LangSmith

Use LangSmith when:

Debugging LLM application issues (prompts, chains, agents)
Evaluating model outputs systematically against datasets
Monitoring production LLM systems
Building regression testing for AI features
Analyzing latency, token usage, and costs
Collaborating on prompt engineering

Key features:

Tracing: Capture inputs, outputs, latency for all LLM calls
Evaluation: Systematic testing with built-in and custom evaluators
Datasets: Create test sets from production traces or manually
Monitoring: Track metrics, errors, and costs in production
Integrations: Works with OpenAI, Anthropic, LangChain, LlamaIndex

Use alternatives instead:

Weights & Biases: Deep learning experiment tracking, model training
MLflow: General ML lifecycle, model registry focus
Arize/WhyLabs: ML monitoring, data drift detection

Quick start

Installation

1pip install langsmith
2
3# Set environment variables
4export LANGSMITH_API_KEY="your-api-key"
5export LANGSMITH_TRACING=true

Basic tracing with @traceable

 1from langsmith import traceable
 2from openai import OpenAI
 3
 4client = OpenAI()
 5
 6@traceable
 7def generate_response(prompt: str) -> str:
 8    response = client.chat.completions.create(
 9        model="gpt-4o",
10        messages=[{"role": "user", "content": prompt}]
11    )
12    return response.choices[0].message.content
13
14# Automatically traced to LangSmith
15result = generate_response("What is machine learning?")

OpenAI wrapper (automatic tracing)

 1from langsmith.wrappers import wrap_openai
 2from openai import OpenAI
 3
 4# Wrap client for automatic tracing
 5client = wrap_openai(OpenAI())
 6
 7# All calls automatically traced
 8response = client.chat.completions.create(
 9    model="gpt-4o",
10    messages=[{"role": "user", "content": "Hello!"}]
11)

Core concepts

Runs and traces

A run is a single execution unit (LLM call, chain, tool). Runs form hierarchical traces showing the full execution flow.

 1from langsmith import traceable
 2
 3@traceable(run_type="chain")
 4def process_query(query: str) -> str:
 5    # Parent run
 6    context = retrieve_context(query)  # Child run
 7    response = generate_answer(query, context)  # Child run
 8    return response
 9
10@traceable(run_type="retriever")
11def retrieve_context(query: str) -> list:
12    return vector_store.search(query)
13
14@traceable(run_type="llm")
15def generate_answer(query: str, context: list) -> str:
16    return llm.invoke(f"Context: {context}\n\nQuestion: {query}")

Projects

Projects organize related runs. Set via environment or code:

1import os
2os.environ["LANGSMITH_PROJECT"] = "my-project"
3
4# Or per-function
5@traceable(project_name="my-project")
6def my_function():
7    pass

Client API

 1from langsmith import Client
 2
 3client = Client()
 4
 5# List runs
 6runs = list(client.list_runs(
 7    project_name="my-project",
 8    filter='eq(status, "success")',
 9    limit=100
10))
11
12# Get run details
13run = client.read_run(run_id="...")
14
15# Create feedback
16client.create_feedback(
17    run_id="...",
18    key="correctness",
19    score=0.9,
20    comment="Good answer"
21)

Datasets and evaluation

Create dataset

 1from langsmith import Client
 2
 3client = Client()
 4
 5# Create dataset
 6dataset = client.create_dataset("qa-test-set", description="QA evaluation")
 7
 8# Add examples
 9client.create_examples(
10    inputs=[
11        {"question": "What is Python?"},
12        {"question": "What is ML?"}
13    ],
14    outputs=[
15        {"answer": "A programming language"},
16        {"answer": "Machine learning"}
17    ],
18    dataset_id=dataset.id
19)

Run evaluation

 1from langsmith import evaluate
 2
 3def my_model(inputs: dict) -> dict:
 4    # Your model logic
 5    return {"answer": generate_answer(inputs["question"])}
 6
 7def correctness_evaluator(run, example):
 8    prediction = run.outputs["answer"]
 9    reference = example.outputs["answer"]
10    score = 1.0 if reference.lower() in prediction.lower() else 0.0
11    return {"key": "correctness", "score": score}
12
13results = evaluate(
14    my_model,
15    data="qa-test-set",
16    evaluators=[correctness_evaluator],
17    experiment_prefix="v1"
18)
19
20print(f"Average score: {results.aggregate_metrics['correctness']}")

Built-in evaluators

 1from langsmith.evaluation import LangChainStringEvaluator
 2
 3# Use LangChain evaluators
 4results = evaluate(
 5    my_model,
 6    data="qa-test-set",
 7    evaluators=[
 8        LangChainStringEvaluator("qa"),
 9        LangChainStringEvaluator("cot_qa")
10    ]
11)

Advanced tracing

Tracing context

1from langsmith import tracing_context
2
3with tracing_context(
4    project_name="experiment-1",
5    tags=["production", "v2"],
6    metadata={"version": "2.0"}
7):
8    # All traceable calls inherit context
9    result = my_function()

Manual runs

1from langsmith import trace
2
3with trace(
4    name="custom_operation",
5    run_type="tool",
6    inputs={"query": "test"}
7) as run:
8    result = do_something()
9    run.end(outputs={"result": result})

Process inputs/outputs

1def sanitize_inputs(inputs: dict) -> dict:
2    if "password" in inputs:
3        inputs["password"] = "***"
4    return inputs
5
6@traceable(process_inputs=sanitize_inputs)
7def login(username: str, password: str):
8    return authenticate(username, password)

Sampling

1import os
2os.environ["LANGSMITH_TRACING_SAMPLING_RATE"] = "0.1"  # 10% sampling

LangChain integration

 1from langchain_openai import ChatOpenAI
 2from langchain_core.prompts import ChatPromptTemplate
 3
 4# Tracing enabled automatically with LANGSMITH_TRACING=true
 5llm = ChatOpenAI(model="gpt-4o")
 6prompt = ChatPromptTemplate.from_messages([
 7    ("system", "You are a helpful assistant."),
 8    ("user", "{input}")
 9])
10
11chain = prompt | llm
12
13# All chain runs traced automatically
14response = chain.invoke({"input": "Hello!"})

Production monitoring

Hub prompts

1from langsmith import Client
2
3client = Client()
4
5# Pull prompt from hub
6prompt = client.pull_prompt("my-org/qa-prompt")
7
8# Use in application
9result = prompt.invoke({"question": "What is AI?"})

Async client

 1from langsmith import AsyncClient
 2
 3async def main():
 4    client = AsyncClient()
 5
 6    runs = []
 7    async for run in client.list_runs(project_name="my-project"):
 8        runs.append(run)
 9
10    return runs

Feedback collection

 1from langsmith import Client
 2
 3client = Client()
 4
 5# Collect user feedback
 6def record_feedback(run_id: str, user_rating: int, comment: str = None):
 7    client.create_feedback(
 8        run_id=run_id,
 9        key="user_rating",
10        score=user_rating / 5.0,  # Normalize to 0-1
11        comment=comment
12    )
13
14# In your application
15record_feedback(run_id="...", user_rating=4, comment="Helpful response")

Testing integration

Pytest integration

1from langsmith import test
2
3@test
4def test_qa_accuracy():
5    result = my_qa_function("What is Python?")
6    assert "programming" in result.lower()

Evaluation in CI/CD

 1from langsmith import evaluate
 2
 3def run_evaluation():
 4    results = evaluate(
 5        my_model,
 6        data="regression-test-set",
 7        evaluators=[accuracy_evaluator]
 8    )
 9
10    # Fail CI if accuracy drops
11    assert results.aggregate_metrics["accuracy"] >= 0.9, \
12        f"Accuracy {results.aggregate_metrics['accuracy']} below threshold"

Best practices

Structured naming - Use consistent project/run naming conventions
Add metadata - Include version, environment, user info
Sample in production - Use sampling rate to control volume
Create datasets - Build test sets from interesting production cases
Automate evaluation - Run evaluations in CI/CD pipelines
Monitor costs - Track token usage and latency trends

Common issues

Traces not appearing:

1import os
2# Ensure tracing is enabled
3os.environ["LANGSMITH_TRACING"] = "true"
4os.environ["LANGSMITH_API_KEY"] = "your-key"
5
6# Verify connection
7from langsmith import Client
8client = Client()
9print(client.list_projects())  # Should work

High latency from tracing:

1# Enable background batching (default)
2from langsmith import Client
3client = Client(auto_batch_tracing=True)
4
5# Or use sampling
6os.environ["LANGSMITH_TRACING_SAMPLING_RATE"] = "0.1"

Large payloads:

1# Hide sensitive/large fields
2@traceable(
3    process_inputs=lambda x: {k: v for k, v in x.items() if k != "large_field"}
4)
5def my_function(data):
6    pass

References

Advanced Usage - Custom evaluators, distributed tracing, hub prompts
Troubleshooting - Common issues, debugging, performance

Resources

Documentation: https://docs.smith.langchain.com
Python SDK: https://github.com/langchain-ai/langsmith-sdk
Web App: https://smith.langchain.com
Version: 0.2.0+
License: MIT

What Users Are Saying

Real feedback from the community

Environment Matrix

Dependencies

langsmith>=0.2.0

Python 3.8+

Framework Support

LangChain ✓ (recommended) LlamaIndex ✓ OpenAI ✓ Anthropic ✓

Context Window

Token Usage ~1K-5K tokens for typical tracing operations

Security & Privacy

Information

Author: davila7
Updated: 2026-01-30
Category: debugging

Langsmith Observability

See It In Action

AI Conversation Simulator

Quick Start (3 Steps)

Install

Config

First Trigger

Commands

Typical Use Cases

Debug LLM Application Issues

Systematic Model Evaluation

Production Monitoring Setup

Overview

LangSmith - LLM Observability Platform

When to use LangSmith

Quick start

Installation

Basic tracing with @traceable

OpenAI wrapper (automatic tracing)

Core concepts

Runs and traces

Projects

Client API

Datasets and evaluation

Create dataset

Run evaluation

Built-in evaluators

Advanced tracing

Tracing context

Manual runs

Process inputs/outputs

Sampling

LangChain integration

Production monitoring

Hub prompts

Async client

Feedback collection

Testing integration

Pytest integration

Evaluation in CI/CD

Best practices

Common issues

References

Resources

What Users Are Saying

Environment Matrix

Dependencies

Framework Support

Context Window

Security & Privacy

Information

Related Skills

Langsmith Observability

Phoenix Observability

Phoenix Observability