Autonomous training agent — now in beta

An agent for your training jobs. Detects, diagnoses, and saves your compute.

TrainLens runs as a step-level autonomous agent alongside your training loop — observing every rank, every step, diagnosing failures as they form, and acting before the run crashes or wastes hours of compute.

Observe Diagnose Act
<1% throughput overhead TrainLens S1 benchmark · A100 & H100
~30% of training runs hit a detectable failure or silent slowdown W&B State of ML 2024
trainlens agent · llama3-finetune
Real Cost of Wasted Compute

GPU failures cost more than your bill shows.

~30% of runs hit a detectable failure — and SemiAnalysis measured the resulting compute waste at 2.8–10.7% of total cluster spend, logged separately from debugging costs.

GPU count
1,024 GPUs
642565121k2k4k8k
GPU type
Cluster tier — SemiAnalysis, Feb 2026
Goodput loss 6.7%
0%10.7% max
Annual spend
$11.1M
512 × H100 × $2.50/hr × 8,760 hrs
Wasted compute / yr
$740K
6.7% goodput loss
CO₂ wasted / yr
32.4 t
tonnes per year
with TrainLens — detect · diagnose · terminate
Early failure detection Root cause in real time Auto-terminate failing runs
Compute recovered / yr
$444K
~60% of goodput loss via early termination
Engineer time saved / yr
$10.8K
~18 failures/mo × 4 hrs × $180/hr eng rate
Total saved / yr
$455K
$37.9K/mo compute + eng
CO₂ saved / yr
19.5 t
4.2 cars off road

Goodput loss: SemiAnalysis / Nebius Total Cost of a GPU Cluster, Feb 2026 (2.795%–10.667%, separate TCO line from debugging). CO₂: GPU TDP × PUE 1.2 × 0.4 kg CO₂/kWh. Compute recovery: ~60% of goodput loss via early termination. Debug savings: failures/month = (GPU count × 720 hrs) ÷ MTBF (25k GPU-hr enterprise, 20k standard, 15k silver-tier per SemiAnalysis); TrainLens surfaces root cause live — saves 4 hrs per failure (diagnosis + re-launch setup) at $180/hr loaded eng rate.

Why your jobs need an agent

Utilization metrics show 95%.
The agent sees what they miss.

GPU at 95% still doesn't tell you which rank is dragging the job, whether your dataloader is stalling, or if you'll hit OOM in 200 steps. An agent watching every step does.

Input stalls

DataLoader blocking the GPU? TrainLens surfaces dataloader wait time every step so you know the source immediately.

Jittery step times

Unstable or drifting step duration signals memory fragmentation, NCCL variance, or scheduling noise — visible per step.

DDP rank stragglers

See worst-rank vs median-rank timing and skew across all DDP ranks. Know exactly which node is slowing the whole job.

Memory creep

GPU memory growing step over step? TrainLens tracks the allocation trend so you catch OOM before it crashes your run.

Observe. Diagnose. Act.
On every step.

TrainLens runs as a separate aggregator process co-located with your training loop — a closed agent loop that reads signals, classifies run state, and acts autonomously. No CUDA stream contention. No process group hijacking.

01

Attach the agent

One line to deploy. Wrap your step with trace_step(), use TrainLensTrainer, or attach TrainLensCallback. The agent auto-detects your stack — FSDP, ZeRO, Accelerate — and instruments itself.

trace_step() HF Trainer Lightning
02

Agent reads every step

Step timing, GPU memory, gradient health, rank skew, MFU, communication overlap — the agent collects from every rank, every step, continuously. Nothing is sampled or buffered.

timing memory grad FSDP MFU
03

Agent diagnoses & acts

ML models run on the live signal stream — before the run crashes or wastes hours of compute. The agent surfaces the root cause, alerts you, or terminates the run autonomously. You set the policy; the agent executes.

dashboard ML prediction auto-terminate

Every signal. Every step. Always on.

Core signal

Step time breakdown

Forward · backward · optimizer · overhead · dataloader wait — split and tracked per step, every step. The foundational signal that makes every other diagnosis possible.

forward
38%
backward
44%
optimizer
8%
dataloader
6%
overhead
4%
DDP rank stragglers

Worst-rank vs median-rank timing and skew across all ranks. Pin the node dragging the job in seconds.

GPU memory trend

Allocated, reserved, and peak memory per step. See fragmentation building before it crashes the run.

Gradient health

Per-step gradient norm tracking, NaN / Inf detection, and spike detection. Auto-terminates on gradient explosion.

MFU — model FLOP utilization

Real attained MFU vs theoretical peak. Know instantly whether you are compute-bound or wasting hardware.

FSDP & comm overlap

All-gather / reduce-scatter latency, DDP comm-compute overlap ratio, and pipeline parallel bubble ratio — per step.

Per-layer deep mode

trainlens deep adds per-layer forward / backward timing and memory — pin the bottleneck layer directly.

Live terminal & browser dashboard

Rich terminal dashboard during training, plus a NiceGUI browser UI — accessible via SSH port-forward for remote pods — for live monitoring and run comparison.

Run history

Every run is stored and labeled. Review outcomes, compare across sessions, and watch your model improve as more runs accumulate.

The agent's intelligence layer

Predicts failure before it costs you.

The agent runs ML models on your live signal stream and decides — monitor, alert, escalate, or terminate — before the run crashes or wastes hours of compute.

Agent decision on every step
monitor
alert
escalate
terminate
From step one
Immediate detection
Catches obvious failures instantly — no history or warm-up needed.
Learns your hardware
Pattern recognition
Learns the normal signature of your workload and flags deviations before they escalate.
Across many steps
Trend analysis
Detects slow-burn degradation — the kind that looks fine step-to-step but is quietly heading for failure.
Failure classes detected
Out of Memory Gradient Divergence Training Plateau Communication Hang Thermal Throttle
Agent-initiated termination
When the agent predicts imminent failure, it stops the run cleanly — before it crashes or wastes another hour of compute on a job that won't recover.

Your metrics, pushed into your existing monitoring stack.

TrainLens emits per-step training signals directly into the tools you already use — under your existing W&B run, your active MLflow experiment, your Prometheus scrape target, or your OTel collector. No new dashboards. No new accounts.

Zero config — auto-detects your stack on the first step
Signal flow
trainlens/ per-step metrics
W&B run MLflow experiment Prometheus OTel collector
Observability

Weights & Biases

Pushed into your active wandb.run with a detection report after each run.

MLflow

Logged to your active MLflow run

Prometheus

Push to Pushgateway per step

OpenTelemetry

Export via gRPC or HTTP

Training frameworks

Hugging Face

Drop-in TrainLensTrainer

PyTorch Lightning

Standard callback API

PyTorch

trace_step() context manager

Accelerate

trainlens_step() or zero-code callback

DeepSpeed

ZeRO-2/3 aware grad norms & memory

Distributed support FSDP1 & FSDP2 · DeepSpeed ZeRO-2/3 grad norms & memory · Gradient checkpointing — auto-detected, no config required.
Benchmarks

Validated on real hardware, real models, real failures.

TrainLens is benchmarked end-to-end across GPU generations and training scales — single node to multi-node, single GPU to hundreds of ranks, across clusters and at the org level.

<1%
Overhead per step
Measured on A100 80GB, LLaMA-2 7B, 8-GPU DDP
A100 · H100
GPU generations
BF16 and FP8 training validated
Single → Multi-node
Scale range
DDP, FSDP, and Pipeline Parallel validated

Full benchmark methodology and results available on request. Contact us →

Built for every team that trains models.

From a solo researcher chasing a slow run to an MLOps team protecting a 1,000-GPU cluster — TrainLens gives you live visibility and autonomous failure handling at every scale.

Researcher
Debug a slow training run in minutes

You notice step time doubled between two runs. TrainLens shows the dataloader wait spiked from 2ms to 180ms — a dataset preprocessing bottleneck that nvidia-smi wouldn't reveal.

ML Team
Find the rank slowing your DDP job

An 8-GPU DDP run is 30% slower than expected. TrainLens shows rank 3 is consistently 40ms behind the median — a single slow NVLink link causing the whole job to wait at the barrier.

MLOps
Stop failing runs before they waste GPU hours

A long training job starts showing signs of gradient instability at step 800. TrainLens predicts failure and terminates the run cleanly — saving 6 hours of A100 time before the inevitable crash.

An agent for every
training job you run.

Deploy in two minutes. The agent watches every step, catches failures before they crash your run, and protects your compute budget — autonomously.

PyTorch 2.5+ Python 3.10+ Single GPU to multi-node DDP FSDP & Pipeline Parallel No account required