ML Model Debugger
Pinpoint PyTorch training failures in seconds, not overnight
✨ The solution you've been looking for
Quickly diagnose machine learning model issues with detailed debugging info and fix suggestions
See It In Action
Interactive preview & real-world examples

AI Conversation Simulator
See how users interact with this skill
User Prompt
My model training isn't converging, can you help?
Skill Processing
Analyzing training process: checking loss curves, gradient flow, learning rate...
Agent Response
Found issue: Gradient explosion (gradients > 100 after layer 15). Suggestions: 1) Reduce learning rate to 1e-4; 2) Add gradient clipping; 3) Use LayerNorm. Here's the fix code...
User Prompt
Model inference is too slow, how to optimize?
Skill Processing
Performance analysis: profiling inference process, identifying bottlenecks...
Agent Response
Bottleneck in attention layer (85% of time). Suggestion: Using Flash Attention 2 can speed up inference by 3-5x. Here's the implementation...
Quick Start (3 Steps)
Get up and running in minutes
Install
Install the skill using Claude Code CLI
claude-code skill install ml-model-debuggerConfig
export ML_FRAMEWORK=pytorch (or tensorflow/jax). Optional: Create .ml-debugger.yaml for custom rules.First Trigger
During training, ask: 'Why isn't my model converging?' or '@debug-ml check-training'
@debug-ml check-training model.pyCommands
| Command | Description | Required Args |
|---|---|---|
| @debug-ml check-training <file> | Diagnose training issues (vanishing/exploding gradients, overfitting) | Training script path |
| @debug-ml profile <file> | Analyze inference performance, identify bottlenecks | Inference script path |
| @debug-ml validate-data <data path> | Check data quality, distribution, and potential issues | Dataset path |
| @debug-ml visualize <checkpoint> | Visualize model weights, activations, and gradients | Model checkpoint path |
| @debug-ml compare <ckpt1> <ckpt2> | Compare different model checkpoints or configurations | Two checkpoint paths |
Typical Use Cases
Gradient Explosion Debugging
Training loss suddenly becomes NaN
User: "Training loss became NaN at epoch 3"
Agent: *Running gradient check*
"Detected gradient explosion:
- Layer 12 linear weights gradient: 1.2e8
- Layer 15 attention gradient: inf
Fix suggestions:
1) Add gradient clipping: clip_grad_norm_(model.parameters(), max_norm=1.0)
2) Reduce learning rate: 1e-3 → 1e-4
3) Check layer 15 initialization"
Overfitting Detection
99% training accuracy, but only 60% validation accuracy
@debug-ml check-training train.py
Output:
"Detected severe overfitting:
- Train/Val gap: 39%
- Model capacity too large (10M params vs 5K training samples)
Suggestions:
1) Add Dropout(0.3) in fully connected layers
2) Use data augmentation
3) Reduce model size or increase training data"
Inference Optimization
Batch inference speed doesn't meet expectations
@debug-ml profile inference.py
Output:
"Performance bottleneck analysis:
┌─────────────────┬──────────┬─────────┐
│ Layer │ Time │ Percent │
├─────────────────┼──────────┼─────────┤
│ SelfAttention │ 850ms │ 85% │
│ FFN │ 120ms │ 12% │
│ Others │ 30ms │ 3% │
└─────────────────┴──────────┴─────────┘
Optimization suggestions:
- Use Flash Attention 2 (expected 4x speedup)
- Enable torch.compile() (additional 20% speedup)"
Composability
Works seamlessly with data pipeline and experiment tracking skills to build complete ML development workflows
Works Well With:
Example Workflow:
# Complete ML Debugging Workflow
@validate-dataset data/train # First validate data
@debug-ml check-training train.py # Debug training
@profile-model --gpu # Performance analysis
@log-experiment wandb # Record to experiment tracking system
Overview
Overview
ML Model Debugger is a debugging tool designed specifically for machine learning engineers, helping you quickly locate and resolve various issues during model training and inference.
Core Features
- Training Diagnostics: Automatically detect gradient vanishing/explosion, overfitting, and other issues
- Performance Analysis: Identify model inference performance bottlenecks
- Data Validation: Check data quality and distribution
- Visualization: Provide rich visualization tools
- Framework Support: Support mainstream frameworks like TensorFlow, PyTorch, JAX
Use Cases
- Model training not converging
- Inference speed too slow
- Abnormal model performance
- Data preprocessing issues
Advanced Features
Automatic Fix Suggestions
Generate fix code suggestions based on detected issues.
Real-time Monitoring
Monitor key metrics in real-time during training.
What Users Are Saying
Real feedback from the community
Spent two days debugging vanishing gradients, this tool found the root cause in 5 minutes - LayerNorm was in the wrong position. Amazing!
Performance analysis showed data loading was the bottleneck (70% of time), not the model itself. Optimized and training speed increased 3x.
Visualization features are great for seeing attention weight distributions. Would like more custom visualization options.
Environment Matrix
Dependencies
Framework Support
Model Compatibility
Context Window
Security & Privacy
- Network Access
- Optional - only connects when uploading experiment logs to Weights & Biases / MLflow
- File Permissions
- Read: model files, training scripts, checkpoints. Write: debugging reports to ./debug-reports/
- Data Flow
- Model weights and training data are analyzed locally. Only metadata (like loss curves) is optionally uploaded to experiment tracking platforms.
- Sandbox
- Runs in isolated Python environment, loads models in read-only mode (doesn't modify original checkpoints)
Information
- Author
- AI Tools
- Version
- 1.5.2
- License
- Apache 2.0
- Updated
- 2026-01-14
- Category
- Machine Learning