Langsmith Observability

Debug, evaluate and monitor your LLM applications in production

✨ The solution you've been looking for

Verified
Tested and verified by our team
16036 Stars

LLM observability platform for tracing, evaluation, and monitoring. Use when debugging LLM applications, evaluating model outputs against datasets, monitoring production systems, or building systematic testing pipelines for AI applications.

observability tracing evaluation monitoring debugging llm-ops testing langsmith
Repository

See It In Action

Interactive preview & real-world examples

Live Demo
Skill Demo Animation

AI Conversation Simulator

See how users interact with this skill

User Prompt

My RAG application is giving inconsistent answers. Help me set up tracing to debug the retrieval and generation steps.

Skill Processing

Analyzing request...

Agent Response

Complete visibility into each step of your LLM pipeline with inputs, outputs, latency, and error details

Quick Start (3 Steps)

Get up and running in minutes

1

Install

claude-code skill install langsmith-observability

claude-code skill install langsmith-observability
2

Config

3

First Trigger

@langsmith-observability help

Commands

CommandDescriptionRequired Args
@langsmith-observability debug-llm-application-issuesTrace through complex LLM chains and agents to identify where failures occurNone
@langsmith-observability systematic-model-evaluationTest model performance against datasets with automated scoringNone
@langsmith-observability production-monitoring-setupMonitor LLM applications in production with cost and performance trackingNone

Typical Use Cases

Debug LLM Application Issues

Trace through complex LLM chains and agents to identify where failures occur

Systematic Model Evaluation

Test model performance against datasets with automated scoring

Production Monitoring Setup

Monitor LLM applications in production with cost and performance tracking

Overview

LangSmith - LLM Observability Platform

Development platform for debugging, evaluating, and monitoring language models and AI applications.

When to use LangSmith

Use LangSmith when:

  • Debugging LLM application issues (prompts, chains, agents)
  • Evaluating model outputs systematically against datasets
  • Monitoring production LLM systems
  • Building regression testing for AI features
  • Analyzing latency, token usage, and costs
  • Collaborating on prompt engineering

Key features:

  • Tracing: Capture inputs, outputs, latency for all LLM calls
  • Evaluation: Systematic testing with built-in and custom evaluators
  • Datasets: Create test sets from production traces or manually
  • Monitoring: Track metrics, errors, and costs in production
  • Integrations: Works with OpenAI, Anthropic, LangChain, LlamaIndex

Use alternatives instead:

  • Weights & Biases: Deep learning experiment tracking, model training
  • MLflow: General ML lifecycle, model registry focus
  • Arize/WhyLabs: ML monitoring, data drift detection

Quick start

Installation

1pip install langsmith
2
3# Set environment variables
4export LANGSMITH_API_KEY="your-api-key"
5export LANGSMITH_TRACING=true

Basic tracing with @traceable

 1from langsmith import traceable
 2from openai import OpenAI
 3
 4client = OpenAI()
 5
 6@traceable
 7def generate_response(prompt: str) -> str:
 8    response = client.chat.completions.create(
 9        model="gpt-4o",
10        messages=[{"role": "user", "content": prompt}]
11    )
12    return response.choices[0].message.content
13
14# Automatically traced to LangSmith
15result = generate_response("What is machine learning?")

OpenAI wrapper (automatic tracing)

 1from langsmith.wrappers import wrap_openai
 2from openai import OpenAI
 3
 4# Wrap client for automatic tracing
 5client = wrap_openai(OpenAI())
 6
 7# All calls automatically traced
 8response = client.chat.completions.create(
 9    model="gpt-4o",
10    messages=[{"role": "user", "content": "Hello!"}]
11)

Core concepts

Runs and traces

A run is a single execution unit (LLM call, chain, tool). Runs form hierarchical traces showing the full execution flow.

 1from langsmith import traceable
 2
 3@traceable(run_type="chain")
 4def process_query(query: str) -> str:
 5    # Parent run
 6    context = retrieve_context(query)  # Child run
 7    response = generate_answer(query, context)  # Child run
 8    return response
 9
10@traceable(run_type="retriever")
11def retrieve_context(query: str) -> list:
12    return vector_store.search(query)
13
14@traceable(run_type="llm")
15def generate_answer(query: str, context: list) -> str:
16    return llm.invoke(f"Context: {context}\n\nQuestion: {query}")

Projects

Projects organize related runs. Set via environment or code:

1import os
2os.environ["LANGSMITH_PROJECT"] = "my-project"
3
4# Or per-function
5@traceable(project_name="my-project")
6def my_function():
7    pass

Client API

 1from langsmith import Client
 2
 3client = Client()
 4
 5# List runs
 6runs = list(client.list_runs(
 7    project_name="my-project",
 8    filter='eq(status, "success")',
 9    limit=100
10))
11
12# Get run details
13run = client.read_run(run_id="...")
14
15# Create feedback
16client.create_feedback(
17    run_id="...",
18    key="correctness",
19    score=0.9,
20    comment="Good answer"
21)

Datasets and evaluation

Create dataset

 1from langsmith import Client
 2
 3client = Client()
 4
 5# Create dataset
 6dataset = client.create_dataset("qa-test-set", description="QA evaluation")
 7
 8# Add examples
 9client.create_examples(
10    inputs=[
11        {"question": "What is Python?"},
12        {"question": "What is ML?"}
13    ],
14    outputs=[
15        {"answer": "A programming language"},
16        {"answer": "Machine learning"}
17    ],
18    dataset_id=dataset.id
19)

Run evaluation

 1from langsmith import evaluate
 2
 3def my_model(inputs: dict) -> dict:
 4    # Your model logic
 5    return {"answer": generate_answer(inputs["question"])}
 6
 7def correctness_evaluator(run, example):
 8    prediction = run.outputs["answer"]
 9    reference = example.outputs["answer"]
10    score = 1.0 if reference.lower() in prediction.lower() else 0.0
11    return {"key": "correctness", "score": score}
12
13results = evaluate(
14    my_model,
15    data="qa-test-set",
16    evaluators=[correctness_evaluator],
17    experiment_prefix="v1"
18)
19
20print(f"Average score: {results.aggregate_metrics['correctness']}")

Built-in evaluators

 1from langsmith.evaluation import LangChainStringEvaluator
 2
 3# Use LangChain evaluators
 4results = evaluate(
 5    my_model,
 6    data="qa-test-set",
 7    evaluators=[
 8        LangChainStringEvaluator("qa"),
 9        LangChainStringEvaluator("cot_qa")
10    ]
11)

Advanced tracing

Tracing context

1from langsmith import tracing_context
2
3with tracing_context(
4    project_name="experiment-1",
5    tags=["production", "v2"],
6    metadata={"version": "2.0"}
7):
8    # All traceable calls inherit context
9    result = my_function()

Manual runs

1from langsmith import trace
2
3with trace(
4    name="custom_operation",
5    run_type="tool",
6    inputs={"query": "test"}
7) as run:
8    result = do_something()
9    run.end(outputs={"result": result})

Process inputs/outputs

1def sanitize_inputs(inputs: dict) -> dict:
2    if "password" in inputs:
3        inputs["password"] = "***"
4    return inputs
5
6@traceable(process_inputs=sanitize_inputs)
7def login(username: str, password: str):
8    return authenticate(username, password)

Sampling

1import os
2os.environ["LANGSMITH_TRACING_SAMPLING_RATE"] = "0.1"  # 10% sampling

LangChain integration

 1from langchain_openai import ChatOpenAI
 2from langchain_core.prompts import ChatPromptTemplate
 3
 4# Tracing enabled automatically with LANGSMITH_TRACING=true
 5llm = ChatOpenAI(model="gpt-4o")
 6prompt = ChatPromptTemplate.from_messages([
 7    ("system", "You are a helpful assistant."),
 8    ("user", "{input}")
 9])
10
11chain = prompt | llm
12
13# All chain runs traced automatically
14response = chain.invoke({"input": "Hello!"})

Production monitoring

Hub prompts

1from langsmith import Client
2
3client = Client()
4
5# Pull prompt from hub
6prompt = client.pull_prompt("my-org/qa-prompt")
7
8# Use in application
9result = prompt.invoke({"question": "What is AI?"})

Async client

 1from langsmith import AsyncClient
 2
 3async def main():
 4    client = AsyncClient()
 5
 6    runs = []
 7    async for run in client.list_runs(project_name="my-project"):
 8        runs.append(run)
 9
10    return runs

Feedback collection

 1from langsmith import Client
 2
 3client = Client()
 4
 5# Collect user feedback
 6def record_feedback(run_id: str, user_rating: int, comment: str = None):
 7    client.create_feedback(
 8        run_id=run_id,
 9        key="user_rating",
10        score=user_rating / 5.0,  # Normalize to 0-1
11        comment=comment
12    )
13
14# In your application
15record_feedback(run_id="...", user_rating=4, comment="Helpful response")

Testing integration

Pytest integration

1from langsmith import test
2
3@test
4def test_qa_accuracy():
5    result = my_qa_function("What is Python?")
6    assert "programming" in result.lower()

Evaluation in CI/CD

 1from langsmith import evaluate
 2
 3def run_evaluation():
 4    results = evaluate(
 5        my_model,
 6        data="regression-test-set",
 7        evaluators=[accuracy_evaluator]
 8    )
 9
10    # Fail CI if accuracy drops
11    assert results.aggregate_metrics["accuracy"] >= 0.9, \
12        f"Accuracy {results.aggregate_metrics['accuracy']} below threshold"

Best practices

  1. Structured naming - Use consistent project/run naming conventions
  2. Add metadata - Include version, environment, user info
  3. Sample in production - Use sampling rate to control volume
  4. Create datasets - Build test sets from interesting production cases
  5. Automate evaluation - Run evaluations in CI/CD pipelines
  6. Monitor costs - Track token usage and latency trends

Common issues

Traces not appearing:

1import os
2# Ensure tracing is enabled
3os.environ["LANGSMITH_TRACING"] = "true"
4os.environ["LANGSMITH_API_KEY"] = "your-key"
5
6# Verify connection
7from langsmith import Client
8client = Client()
9print(client.list_projects())  # Should work

High latency from tracing:

1# Enable background batching (default)
2from langsmith import Client
3client = Client(auto_batch_tracing=True)
4
5# Or use sampling
6os.environ["LANGSMITH_TRACING_SAMPLING_RATE"] = "0.1"

Large payloads:

1# Hide sensitive/large fields
2@traceable(
3    process_inputs=lambda x: {k: v for k, v in x.items() if k != "large_field"}
4)
5def my_function(data):
6    pass

References

Resources

What Users Are Saying

Real feedback from the community

Environment Matrix

Dependencies

langsmith>=0.2.0
Python 3.8+

Framework Support

LangChain ✓ (recommended) LlamaIndex ✓ OpenAI ✓ Anthropic ✓

Context Window

Token Usage ~1K-5K tokens for typical tracing operations

Security & Privacy

Information

Author
davila7
Updated
2026-01-30
Category
debugging