ML Model Debugger

Pinpoint PyTorch training failures in seconds, not overnight

✨ The solution you've been looking for

Verified
Tested and verified by our team
Official
Officially endorsed by the platform
4.6 /5.0
3800 Downloads
280 Stars
medium Token Usage

Quickly diagnose machine learning model issues with detailed debugging info and fix suggestions

Machine Learning Debugging Model Optimization TensorFlow PyTorch
Repository

See It In Action

Interactive preview & real-world examples

Output Preview
Skill Demo Screenshot
Live Demo
Skill Demo Animation

AI Conversation Simulator

See how users interact with this skill

User Prompt

My model training isn't converging, can you help?

Skill Processing

Analyzing training process: checking loss curves, gradient flow, learning rate...

Agent Response

Found issue: Gradient explosion (gradients > 100 after layer 15). Suggestions: 1) Reduce learning rate to 1e-4; 2) Add gradient clipping; 3) Use LayerNorm. Here's the fix code...

User Prompt

Model inference is too slow, how to optimize?

Skill Processing

Performance analysis: profiling inference process, identifying bottlenecks...

Agent Response

Bottleneck in attention layer (85% of time). Suggestion: Using Flash Attention 2 can speed up inference by 3-5x. Here's the implementation...

Quick Start (3 Steps)

Get up and running in minutes

1

Install

Install the skill using Claude Code CLI

claude-code skill install ml-model-debugger
2

Config

Set framework: export ML_FRAMEWORK=pytorch (or tensorflow/jax). Optional: Create .ml-debugger.yaml for custom rules.
3

First Trigger

During training, ask: 'Why isn't my model converging?' or '@debug-ml check-training'

@debug-ml check-training model.py

Commands

CommandDescriptionRequired Args
@debug-ml check-training <file>Diagnose training issues (vanishing/exploding gradients, overfitting)Training script path
@debug-ml profile <file>Analyze inference performance, identify bottlenecksInference script path
@debug-ml validate-data <data path>Check data quality, distribution, and potential issuesDataset path
@debug-ml visualize <checkpoint>Visualize model weights, activations, and gradientsModel checkpoint path
@debug-ml compare <ckpt1> <ckpt2>Compare different model checkpoints or configurationsTwo checkpoint paths

Typical Use Cases

Gradient Explosion Debugging

Training loss suddenly becomes NaN

User: "Training loss became NaN at epoch 3"
Agent: *Running gradient check*
"Detected gradient explosion:
- Layer 12 linear weights gradient: 1.2e8
- Layer 15 attention gradient: inf

Fix suggestions:
1) Add gradient clipping: clip_grad_norm_(model.parameters(), max_norm=1.0)
2) Reduce learning rate: 1e-3 → 1e-4
3) Check layer 15 initialization"

Overfitting Detection

99% training accuracy, but only 60% validation accuracy

@debug-ml check-training train.py

Output:
"Detected severe overfitting:
- Train/Val gap: 39%
- Model capacity too large (10M params vs 5K training samples)

Suggestions:
1) Add Dropout(0.3) in fully connected layers
2) Use data augmentation
3) Reduce model size or increase training data"

Inference Optimization

Batch inference speed doesn't meet expectations

@debug-ml profile inference.py

Output:
"Performance bottleneck analysis:
┌─────────────────┬──────────┬─────────┐
│ Layer           │ Time     │ Percent │
├─────────────────┼──────────┼─────────┤
│ SelfAttention   │ 850ms    │ 85%     │
│ FFN             │ 120ms    │ 12%     │
│ Others          │ 30ms     │ 3%      │
└─────────────────┴──────────┴─────────┘

Optimization suggestions:
- Use Flash Attention 2 (expected 4x speedup)
- Enable torch.compile() (additional 20% speedup)"

Composability

Works seamlessly with data pipeline and experiment tracking skills to build complete ML development workflows

Works Well With:

Data Pipeline Builder Experiment Tracker Model Optimizer Dataset Validator

Example Workflow:

# Complete ML Debugging Workflow
@validate-dataset data/train  # First validate data
@debug-ml check-training train.py  # Debug training
@profile-model --gpu  # Performance analysis
@log-experiment wandb  # Record to experiment tracking system

Overview

Overview

ML Model Debugger is a debugging tool designed specifically for machine learning engineers, helping you quickly locate and resolve various issues during model training and inference.

Core Features

  • Training Diagnostics: Automatically detect gradient vanishing/explosion, overfitting, and other issues
  • Performance Analysis: Identify model inference performance bottlenecks
  • Data Validation: Check data quality and distribution
  • Visualization: Provide rich visualization tools
  • Framework Support: Support mainstream frameworks like TensorFlow, PyTorch, JAX

Use Cases

  • Model training not converging
  • Inference speed too slow
  • Abnormal model performance
  • Data preprocessing issues

Advanced Features

Automatic Fix Suggestions

Generate fix code suggestions based on detected issues.

Real-time Monitoring

Monitor key metrics in real-time during training.

What Users Are Saying

Real feedback from the community

D
deep_learner_42
Transformer Training

Spent two days debugging vanishing gradients, this tool found the root cause in 5 minutes - LayerNorm was in the wrong position. Amazing!

C
cv_engineer
Computer Vision

Performance analysis showed data loading was the bottleneck (70% of time), not the model itself. Optimized and training speed increased 3x.

R
research_scientist
NLP Research

Visualization features are great for seeing attention weight distributions. Would like more custom visualization options.

Environment Matrix

Dependencies

Python 3.9+
PyTorch 2.0+ / TensorFlow 2.10+ / JAX 0.4+
CUDA 11.8+ (optional for GPU acceleration)

Framework Support

PyTorch 2.x ✓ (native support) TensorFlow 2.10+ ✓ JAX/Flax 0.4+ ✓ HuggingFace Transformers ✓ Lightning AI ✓

Model Compatibility

Claude 3.5 Sonnet ✓ (recommended)
Claude 3 Opus ✓
GPT-4 Turbo ✓

Context Window

Token Usage Analyzing a single training script uses ~2K-5K tokens, depending on code complexity

Security & Privacy

Network Access
Optional - only connects when uploading experiment logs to Weights & Biases / MLflow
File Permissions
Read: model files, training scripts, checkpoints. Write: debugging reports to ./debug-reports/
Data Flow
Model weights and training data are analyzed locally. Only metadata (like loss curves) is optionally uploaded to experiment tracking platforms.
Sandbox
Runs in isolated Python environment, loads models in read-only mode (doesn't modify original checkpoints)

Information

Author
AI Tools
Version
1.5.2
License
Apache 2.0
Updated
2026-01-14
Category
Machine Learning