Replay the Webinar: Navigating Networking Transitions Shaping AI Infra Economics: Scaling Up, Out, and Across

Play SemiAnalysis-Clockwork Webinar: Comparing Fault Tolerance Frameworks & TCO Impact

Launching TorchPass: A New Class of Fault Tolerance to End Failure-Driven GPU Waste In AI Training

Watch virtual panel with Nebius on Economically Viable Enterprise AI

Watch Oracle Cloud World Video on Performant AI Networks

You Only Compute Once

Are training failures eating up your GPU dollars?

Every failure makes you pay four times: provisioning replacement resources, restarting the job, restoring from the last checkpoint, and recomputing the work lost since that checkpoint was taken. On a 1,024-GPU cluster, that waste can exceed $300,000 per month.

The Clockwork YOCO Guarantee

TorchPass ends this waste. Run your training jobs with TorchPass and we commit that at least 90% of covered failures will be resolved through live GPU migration — no checkpoint restart, no rollback, no recompute. If we fall short in a contract year, you receive a YOCO credit of 25% of your annual TorchPass license fee, applied to your next renewal or expansion.

See What Failures Are Costing You

TorchPass keeps distributed AI training running through failures using live GPU migration

Without TorchPass

Every disruption, large or small, causes training to crash and triggers a heavy restart penalty: provision resources, job restart, restore from last checkpoint, recompute work since the checkpoint was taken.

With TorchPass

Failures are handled seamlessly through live GPU migration. The job live-migrates to spare resources, and training continues without rollback, recompute, or lost progress.

How the YOCO Guarantee Works

  1. Purchase a TorchPass subscription.

  2. Install TorchPass on a supported stack. 

  3. Opt in to the YOCO Guarantee and we will validate your environment.

  4. Enable TorchPass on your training jobs.

  5. Start saving money by not having to restart and recompute when there’s a failure!

Details:

Covered workloads Training jobs validated to run with TorchPass, in a TorchPass-supported environment.
Covered failures Covered failures are those described as “Qualifying Migration Events” in the table below. Failures explicitly not covered are those described as “Disqualifying Migration Events” in the table below.
The commitment Run your training jobs with TorchPass and at least 90% of covered failures will be resolved through live GPU migration — no lost training progress, no rollback, no recompute of completed steps.
The remedy If the annual migration success rate falls below 90%, you receive a YOCO Credit of 25% of your annual TorchPass license fee, applied to your next renewal or expansion.
How it’s measured Success rate = successful qualifying migrations ÷ total qualifying migrations, measured on a trailing 12-month basis at renewal or expansion. Measurement period must be at least 3 months for expansion purchased in the first year.

Qualifying TorchPass Migration Events:

Qualifying Migration Events Some Example(s)
Planned migration via Cordon (K8s) or DRAIN state (Slurm) or other supported signal Node cordon after detection of rising GPU temperature; Node cordon to remove node for security patching
Job, process or training execution interruption Pod delete; process terminates; kernel panic; NCCL hang / stuck rank; OOM kill; CUDA driver fault
Host, GPU, or local fabric failure GPU falls-off-bus; uncorrectable HBM ECC; NVLink or NIC failure; node power loss; entire node failure

Disqualifying TorchPass Migration Events:

Disqualifying Migration Events Some Example(s)
Shared infrastructure failure Switch / fabric-wide outage; filesystem unmount; filesystem full; not enough memory 
Unable to recover Failures exceeding parallelism redundancy; silent data corruption; node 0 failures with Slurm; Failures due to spare resources not being available. 

FAQs

Do I pay extra for the YOCO Guarantee?

No. The YOCO Guarantee comes with TorchPass. You just need an active paid TorchPass subscription and must explicitly opt in so we can validate your environment.

What exactly is being guaranteed?

When you run training jobs with TorchPass, we commit that at least 90% of covered failures will be resolved through live GPU migration. No checkpoint restart, no rollback, and no recompute of completed steps.

What happens if TorchPass falls short of 90%?

If your annual migration success rate falls below 90%, you receive a YOCO credit worth 25% of your annual TorchPass license fee, applied to your next renewal or expansion.

How is the success rate measured?

Success rate = # of successful migrations / total # of qualifying migrations. It’s measured on a trailing 12-month basis at renewal or expansion.

What kind of training failures are covered?

Example covered (“qualifying”) failures: planned migrations via cordon/drain signals (e.g., removing a node for a rising GPU temperature or a security patch), job or process interruptions (e.g. pod deletes, kernel panics, NCCL hangs, OOM kills), and host, GPU, or local fabric failures (GPU falls off bus, uncorrectable HBM ECC errors, NVLink or NIC failures, node power loss).

Which failures are not covered?

Example disqualifying events: failures on jobs that are not configured to use TorchPass, widespread infrastructure failures (e.g. fabric-wide outages), and cases where compatible spare capacity isn’t available.

What do I need to do to be eligible?