ML Model Debugger

Pinpoint PyTorch training failures in seconds, not overnight

✨ The solution you've been looking for

Verified

Tested and verified by our team

Official

Officially endorsed by the platform

4.6 /5.0

3800 Downloads

280 Stars

medium Token Usage

Quickly diagnose machine learning model issues with detailed debugging info and fix suggestions

Machine Learning Debugging Model Optimization TensorFlow PyTorch

Repository

See It In Action

Interactive preview & real-world examples

Output Preview

Live Demo

AI Conversation Simulator

See how users interact with this skill

User Prompt

My model training isn't converging, can you help?

Skill Processing

Analyzing training process: checking loss curves, gradient flow, learning rate...

Agent Response

Found issue: Gradient explosion (gradients > 100 after layer 15). Suggestions: 1) Reduce learning rate to 1e-4; 2) Add gradient clipping; 3) Use LayerNorm. Here's the fix code...

User Prompt

Model inference is too slow, how to optimize?

Skill Processing

Performance analysis: profiling inference process, identifying bottlenecks...

Agent Response

Bottleneck in attention layer (85% of time). Suggestion: Using Flash Attention 2 can speed up inference by 3-5x. Here's the implementation...

Quick Start (3 Steps)

Get up and running in minutes

Install

Install the skill using Claude Code CLI

claude-code skill install ml-model-debugger

Config

Set framework: export ML_FRAMEWORK=pytorch (or tensorflow/jax). Optional: Create .ml-debugger.yaml for custom rules.

First Trigger

During training, ask: 'Why isn't my model converging?' or '@debug-ml check-training'

@debug-ml check-training model.py

Commands

Command	Description	Required Args
@debug-ml check-training <file>	Diagnose training issues (vanishing/exploding gradients, overfitting)	Training script path
@debug-ml profile <file>	Analyze inference performance, identify bottlenecks	Inference script path
@debug-ml validate-data <data path>	Check data quality, distribution, and potential issues	Dataset path
@debug-ml visualize <checkpoint>	Visualize model weights, activations, and gradients	Model checkpoint path
@debug-ml compare <ckpt1> <ckpt2>	Compare different model checkpoints or configurations	Two checkpoint paths

Typical Use Cases

Gradient Explosion Debugging

Training loss suddenly becomes NaN

User: "Training loss became NaN at epoch 3"
Agent: *Running gradient check*
"Detected gradient explosion:
- Layer 12 linear weights gradient: 1.2e8
- Layer 15 attention gradient: inf

Fix suggestions:
1) Add gradient clipping: clip_grad_norm_(model.parameters(), max_norm=1.0)
2) Reduce learning rate: 1e-3 → 1e-4
3) Check layer 15 initialization"

Overfitting Detection

99% training accuracy, but only 60% validation accuracy

@debug-ml check-training train.py

Output:
"Detected severe overfitting:
- Train/Val gap: 39%
- Model capacity too large (10M params vs 5K training samples)

Suggestions:
1) Add Dropout(0.3) in fully connected layers
2) Use data augmentation
3) Reduce model size or increase training data"

Inference Optimization

Batch inference speed doesn't meet expectations

@debug-ml profile inference.py

Output:
"Performance bottleneck analysis:
┌─────────────────┬──────────┬─────────┐
│ Layer           │ Time     │ Percent │
├─────────────────┼──────────┼─────────┤
│ SelfAttention   │ 850ms    │ 85%     │
│ FFN             │ 120ms    │ 12%     │
│ Others          │ 30ms     │ 3%      │
└─────────────────┴──────────┴─────────┘

Optimization suggestions:
- Use Flash Attention 2 (expected 4x speedup)
- Enable torch.compile() (additional 20% speedup)"

Composability

Works seamlessly with data pipeline and experiment tracking skills to build complete ML development workflows

Works Well With:

Data Pipeline Builder Experiment Tracker Model Optimizer Dataset Validator

Example Workflow:

# Complete ML Debugging Workflow
@validate-dataset data/train  # First validate data
@debug-ml check-training train.py  # Debug training
@profile-model --gpu  # Performance analysis
@log-experiment wandb  # Record to experiment tracking system

Overview

ML Model Debugger is a debugging tool designed specifically for machine learning engineers, helping you quickly locate and resolve various issues during model training and inference.

Core Features

Training Diagnostics: Automatically detect gradient vanishing/explosion, overfitting, and other issues
Performance Analysis: Identify model inference performance bottlenecks
Data Validation: Check data quality and distribution
Visualization: Provide rich visualization tools
Framework Support: Support mainstream frameworks like TensorFlow, PyTorch, JAX

Use Cases

Model training not converging
Inference speed too slow
Abnormal model performance
Data preprocessing issues

Advanced Features

Automatic Fix Suggestions

Generate fix code suggestions based on detected issues.

Real-time Monitoring

Monitor key metrics in real-time during training.

What Users Are Saying

Real feedback from the community

deep_learner_42

Transformer Training

Spent two days debugging vanishing gradients, this tool found the root cause in 5 minutes - LayerNorm was in the wrong position. Amazing!

cv_engineer

Computer Vision

Performance analysis showed data loading was the bottleneck (70% of time), not the model itself. Optimized and training speed increased 3x.

research_scientist

NLP Research

Visualization features are great for seeing attention weight distributions. Would like more custom visualization options.

Environment Matrix

Dependencies

Python 3.9+

PyTorch 2.0+ / TensorFlow 2.10+ / JAX 0.4+

CUDA 11.8+ (optional for GPU acceleration)

Framework Support

PyTorch 2.x ✓ (native support) TensorFlow 2.10+ ✓ JAX/Flax 0.4+ ✓ HuggingFace Transformers ✓ Lightning AI ✓

Model Compatibility

Claude 3.5 Sonnet ✓ (recommended)

Claude 3 Opus ✓

GPT-4 Turbo ✓

Context Window

Token Usage Analyzing a single training script uses ~2K-5K tokens, depending on code complexity

Security & Privacy

Network Access: Optional - only connects when uploading experiment logs to Weights & Biases / MLflow
File Permissions: Read: model files, training scripts, checkpoints. Write: debugging reports to ./debug-reports/
Data Flow: Model weights and training data are analyzed locally. Only metadata (like loss curves) is optionally uploaded to experiment tracking platforms.
Sandbox: Runs in isolated Python environment, loads models in read-only mode (doesn't modify original checkpoints)

Information

Author: AI Tools
Version: 1.5.2
License: Apache 2.0
Updated: 2026-01-14
Category: Machine Learning

Related Skills

ML Model Debugger

Quickly diagnose machine learning model issues with detailed debugging info and fix suggestions

View Details →