TrainLens — Find why PyTorch training got slow, live.

Real Cost of Wasted Compute

GPU failures cost more than your bill shows.

~30% of runs hit a detectable failure — and SemiAnalysis measured the resulting compute waste at 2.8–10.7% of total cluster spend, logged separately from debugging costs.

GPU count

1,024 GPUs

642565121k2k4k8k

GPU type

Cluster tier — SemiAnalysis, Feb 2026

Goodput loss 6.7%

0%10.7% max

Annual spend

$11.1M

512 × H100 × $2.50/hr × 8,760 hrs

Wasted compute / yr

$740K

6.7% goodput loss

CO₂ wasted / yr

32.4 t

tonnes per year

with TrainLens — detect · diagnose · terminate

Early failure detection Root cause in real time Auto-terminate failing runs

Compute recovered / yr

$444K

~60% of goodput loss via early termination

Engineer time saved / yr

$10.8K

~18 failures/mo × 4 hrs × $180/hr eng rate

Total saved / yr

$455K

$37.9K/mo compute + eng

CO₂ saved / yr

19.5 t

≈ 4.2 cars off road

Goodput loss: SemiAnalysis / Nebius Total Cost of a GPU Cluster, Feb 2026 (2.795%–10.667%, separate TCO line from debugging). CO₂: GPU TDP × PUE 1.2 × 0.4 kg CO₂/kWh. Compute recovery: ~60% of goodput loss via early termination. Debug savings: failures/month = (GPU count × 720 hrs) ÷ MTBF (25k GPU-hr enterprise, 20k standard, 15k silver-tier per SemiAnalysis); TrainLens surfaces root cause live — saves 4 hrs per failure (diagnosis + re-launch setup) at $180/hr loaded eng rate.

Why your jobs need an agent

Utilization metrics show 95%.
The agent sees what they miss.

GPU at 95% still doesn't tell you which rank is dragging the job, whether your dataloader is stalling, or if you'll hit OOM in 200 steps. An agent watching every step does.

Input stalls

DataLoader blocking the GPU? TrainLens surfaces dataloader wait time every step so you know the source immediately.

Jittery step times

Unstable or drifting step duration signals memory fragmentation, NCCL variance, or scheduling noise — visible per step.

DDP rank stragglers

See worst-rank vs median-rank timing and skew across all DDP ranks. Know exactly which node is slowing the whole job.

Memory creep

GPU memory growing step over step? TrainLens tracks the allocation trend so you catch OOM before it crashes your run.

Observe. Diagnose. Act.
On every step.

TrainLens runs as a separate aggregator process co-located with your training loop — a closed agent loop that reads signals, classifies run state, and acts autonomously. No CUDA stream contention. No process group hijacking.

01

Attach the agent

One line to deploy. Wrap your step with trace_step(), use TrainLensTrainer, or attach TrainLensCallback. The agent auto-detects your stack — FSDP, ZeRO, Accelerate — and instruments itself.

trace_step() HF Trainer Lightning

02

Agent reads every step

Step timing, GPU memory, gradient health, rank skew, MFU, communication overlap — the agent collects from every rank, every step, continuously. Nothing is sampled or buffered.

timing memory grad FSDP MFU

03

Agent diagnoses & acts

ML models run on the live signal stream — before the run crashes or wastes hours of compute. The agent surfaces the root cause, alerts you, or terminates the run autonomously. You set the policy; the agent executes.

dashboard ML prediction auto-terminate

Every signal. Every step. Always on.

Core signal

Step time breakdown

Forward · backward · optimizer · overhead · dataloader wait — split and tracked per step, every step. The foundational signal that makes every other diagnosis possible.

forward

38%

backward

44%

optimizer

8%

dataloader

6%

overhead

4%

DDP rank stragglers

Worst-rank vs median-rank timing and skew across all ranks. Pin the node dragging the job in seconds.

GPU memory trend

Allocated, reserved, and peak memory per step. See fragmentation building before it crashes the run.

Gradient health

Per-step gradient norm tracking, NaN / Inf detection, and spike detection. Auto-terminates on gradient explosion.

MFU — model FLOP utilization

Real attained MFU vs theoretical peak. Know instantly whether you are compute-bound or wasting hardware.

FSDP & comm overlap

All-gather / reduce-scatter latency, DDP comm-compute overlap ratio, and pipeline parallel bubble ratio — per step.

Per-layer deep mode

trainlens deep adds per-layer forward / backward timing and memory — pin the bottleneck layer directly.

Live terminal & browser dashboard

Rich terminal dashboard during training, plus a NiceGUI browser UI — accessible via SSH port-forward for remote pods — for live monitoring and run comparison.

Run history

Every run is stored and labeled. Review outcomes, compare across sessions, and watch your model improve as more runs accumulate.

The agent's intelligence layer

Predicts failure before it costs you.

The agent runs ML models on your live signal stream and decides — monitor, alert, escalate, or terminate — before the run crashes or wastes hours of compute.

Agent decision on every step

monitor

alert

escalate

terminate

From step one

Immediate detection

Catches obvious failures instantly — no history or warm-up needed.

Learns your hardware

Pattern recognition

Learns the normal signature of your workload and flags deviations before they escalate.

Across many steps

Trend analysis

Detects slow-burn degradation — the kind that looks fine step-to-step but is quietly heading for failure.

Failure classes detected

Out of Memory Gradient Divergence Training Plateau Communication Hang Thermal Throttle

Agent-initiated termination

When the agent predicts imminent failure, it stops the run cleanly — before it crashes or wastes another hour of compute on a job that won't recover.

Your metrics, pushed into your existing monitoring stack.

TrainLens emits per-step training signals directly into the tools you already use — under your existing W&B run, your active MLflow experiment, your Prometheus scrape target, or your OTel collector. No new dashboards. No new accounts.

Zero config — auto-detects your stack on the first step

Signal flow

trainlens/ per-step metrics

W&B run MLflow experiment Prometheus OTel collector

Observability

Weights & Biases

Pushed into your active wandb.run with a detection report after each run.

MLflow

Logged to your active MLflow run

Prometheus

Push to Pushgateway per step

OpenTelemetry

Export via gRPC or HTTP

Training frameworks

Hugging Face

Drop-in TrainLensTrainer

PyTorch Lightning

Standard callback API

PyTorch

trace_step() context manager

Accelerate

trainlens_step() or zero-code callback

DeepSpeed

ZeRO-2/3 aware grad norms & memory

Distributed support FSDP1 & FSDP2 · DeepSpeed ZeRO-2/3 grad norms & memory · Gradient checkpointing — auto-detected, no config required.

Benchmarks

Validated on real hardware, real models, real failures.

TrainLens is benchmarked end-to-end across GPU generations and training scales — single node to multi-node, single GPU to hundreds of ranks, across clusters and at the org level.

<1%

Overhead per step

Measured on A100 80GB, LLaMA-2 7B, 8-GPU DDP

A100 · H100

GPU generations

BF16 and FP8 training validated

Single → Multi-node

Scale range

DDP, FSDP, and Pipeline Parallel validated

Full benchmark methodology and results available on request. Contact us →

Built for every team that trains models.

From a solo researcher chasing a slow run to an MLOps team protecting a 1,000-GPU cluster — TrainLens gives you live visibility and autonomous failure handling at every scale.

Researcher

Debug a slow training run in minutes

You notice step time doubled between two runs. TrainLens shows the dataloader wait spiked from 2ms to 180ms — a dataset preprocessing bottleneck that nvidia-smi wouldn't reveal.

ML Team

Find the rank slowing your DDP job

An 8-GPU DDP run is 30% slower than expected. TrainLens shows rank 3 is consistently 40ms behind the median — a single slow NVLink link causing the whole job to wait at the barrier.

MLOps

Stop failing runs before they waste GPU hours

A long training job starts showing signs of gradient instability at step 800. TrainLens predicts failure and terminates the run cleanly — saving 6 hours of A100 time before the inevitable crash.

An agent for every
training job you run.

Deploy in two minutes. The agent watches every step, catches failures before they crash your run, and protects your compute budget — autonomously.

PyTorch 2.5+ Python 3.10+ Single GPU to multi-node DDP FSDP & Pipeline Parallel No account required

Request Access

An agent for your training jobs. Detects, diagnoses, and saves your compute.

GPU failures cost more than your bill shows.

Utilization metrics show 95%.The agent sees what they miss.

Input stalls

Jittery step times

DDP rank stragglers

Memory creep

Observe. Diagnose. Act.On every step.

Attach the agent

Agent reads every step

Agent diagnoses & acts

Every signal. Every step. Always on.

Step time breakdown

DDP rank stragglers

GPU memory trend

Gradient health

MFU — model FLOP utilization

FSDP & comm overlap

Per-layer deep mode

Live terminal & browser dashboard

Run history

Predicts failure before it costs you.

Your metrics, pushed into your existing monitoring stack.

Validated on real hardware, real models, real failures.

Built for every team that trains models.

Debug a slow training run in minutes

Find the rank slowing your DDP job

Stop failing runs before they waste GPU hours

An agent for everytraining job you run.

Get in touch

Message sent!

Utilization metrics show 95%.
The agent sees what they miss.

Observe. Diagnose. Act.
On every step.

An agent for every
training job you run.