TrainLens runs as a step-level autonomous agent alongside your training loop — observing every rank, every step, diagnosing failures as they form, and acting before the run crashes or wastes hours of compute.
~30% of runs hit a detectable failure — and SemiAnalysis measured the resulting compute waste at 2.8–10.7% of total cluster spend, logged separately from debugging costs.
Goodput loss: SemiAnalysis / Nebius Total Cost of a GPU Cluster, Feb 2026 (2.795%–10.667%, separate TCO line from debugging). CO₂: GPU TDP × PUE 1.2 × 0.4 kg CO₂/kWh. Compute recovery: ~60% of goodput loss via early termination. Debug savings: failures/month = (GPU count × 720 hrs) ÷ MTBF (25k GPU-hr enterprise, 20k standard, 15k silver-tier per SemiAnalysis); TrainLens surfaces root cause live — saves 4 hrs per failure (diagnosis + re-launch setup) at $180/hr loaded eng rate.
GPU at 95% still doesn't tell you which rank is dragging the job, whether your dataloader is stalling, or if you'll hit OOM in 200 steps. An agent watching every step does.
DataLoader blocking the GPU? TrainLens surfaces dataloader wait time every step so you know the source immediately.
Unstable or drifting step duration signals memory fragmentation, NCCL variance, or scheduling noise — visible per step.
See worst-rank vs median-rank timing and skew across all DDP ranks. Know exactly which node is slowing the whole job.
GPU memory growing step over step? TrainLens tracks the allocation trend so you catch OOM before it crashes your run.
TrainLens runs as a separate aggregator process co-located with your training loop — a closed agent loop that reads signals, classifies run state, and acts autonomously. No CUDA stream contention. No process group hijacking.
One line to deploy. Wrap your step with trace_step(), use TrainLensTrainer, or attach TrainLensCallback. The agent auto-detects your stack — FSDP, ZeRO, Accelerate — and instruments itself.
Step timing, GPU memory, gradient health, rank skew, MFU, communication overlap — the agent collects from every rank, every step, continuously. Nothing is sampled or buffered.
ML models run on the live signal stream — before the run crashes or wastes hours of compute. The agent surfaces the root cause, alerts you, or terminates the run autonomously. You set the policy; the agent executes.
Forward · backward · optimizer · overhead · dataloader wait — split and tracked per step, every step. The foundational signal that makes every other diagnosis possible.
Worst-rank vs median-rank timing and skew across all ranks. Pin the node dragging the job in seconds.
Allocated, reserved, and peak memory per step. See fragmentation building before it crashes the run.
Per-step gradient norm tracking, NaN / Inf detection, and spike detection. Auto-terminates on gradient explosion.
Real attained MFU vs theoretical peak. Know instantly whether you are compute-bound or wasting hardware.
All-gather / reduce-scatter latency, DDP comm-compute overlap ratio, and pipeline parallel bubble ratio — per step.
trainlens deep adds per-layer forward / backward timing and memory — pin the bottleneck layer directly.
Rich terminal dashboard during training, plus a NiceGUI browser UI — accessible via SSH port-forward for remote pods — for live monitoring and run comparison.
Every run is stored and labeled. Review outcomes, compare across sessions, and watch your model improve as more runs accumulate.
The agent runs ML models on your live signal stream and decides — monitor, alert, escalate, or terminate — before the run crashes or wastes hours of compute.
TrainLens emits per-step training signals directly into the tools you already use — under your existing W&B run, your active MLflow experiment, your Prometheus scrape target, or your OTel collector. No new dashboards. No new accounts.
Weights & Biases
Pushed into your active wandb.run with a detection report after each run.
MLflow
Logged to your active MLflow run
Prometheus
Push to Pushgateway per step
OpenTelemetry
Export via gRPC or HTTP
Hugging Face
Drop-in TrainLensTrainer
PyTorch Lightning
Standard callback API
PyTorch
trace_step() context manager
Accelerate
trainlens_step() or zero-code callback
DeepSpeed
ZeRO-2/3 aware grad norms & memory
TrainLens is benchmarked end-to-end across GPU generations and training scales — single node to multi-node, single GPU to hundreds of ranks, across clusters and at the org level.
Full benchmark methodology and results available on request. Contact us →
From a solo researcher chasing a slow run to an MLOps team protecting a 1,000-GPU cluster — TrainLens gives you live visibility and autonomous failure handling at every scale.
You notice step time doubled between two runs. TrainLens shows the dataloader wait spiked from 2ms to 180ms — a dataset preprocessing bottleneck that nvidia-smi wouldn't reveal.
An 8-GPU DDP run is 30% slower than expected. TrainLens shows rank 3 is consistently 40ms behind the median — a single slow NVLink link causing the whole job to wait at the barrier.
A long training job starts showing signs of gradient instability at step 800. TrainLens predicts failure and terminates the run cleanly — saving 6 hours of A100 time before the inevitable crash.
Deploy in two minutes. The agent watches every step, catches failures before they crash your run, and protects your compute budget — autonomously.