Watch Oracle Cloud World Video on Performant AI Networks

Workload Failover

Job-aware resilience that keeps AI running. Stateful fault tolerance absorbs NIC failures and link flaps with instant rerouting to healthy paths, preserving collective integrity.

The result: nonstop AI jobs that avoid disruption, wasted GPU hours, and costly restarts across clusters and clouds.

Workload Acceleration Fleet Monitoring Workload Failover
Resilience

Accelerate AI Jobs with Real-time Resilience

Transform utilization by eliminating costly restarts.

Disruptive link failures shouldn’t bring AI training to a halt. Clockwork Workload Failover delivers job-aware resilience that absorbs NIC failures and link flaps in real time—keeping clusters productive and avoiding wasted GPU hours.

Disruptive Network Failures and Link Flaps 

Are Common and Expensive

Why traditional fabrics force resets instead of recovery.

At scale, even rare NIC or optical failures compound into frequent job restarts. With thousands of links in a cluster, the statistical mean time to failure is measured in minutes, not years—causing GPU stalls, lost hours, and mounting cost. One of the most common problems encountered is Infiniband/RoCE link failure. Even if each NIC-to-leaf switch link had a mean to failure rate of 5 years, due to the high number of transceivers, it would only take 26.28 minutes for the first job failure.

Time of first job failure in brand new cluster: 
26.28 minutes
“Achieving high utilization with them (GPUs) is difficult due to the high failure rate of various components, especially networking.

Clockwork’s Workload Failover Provides

Resilience To Link Flaps

Sustain training momentum with resilient, job-aware networking

Link/NIC flapping

  • Quickly detect link/NIC failure

  • Use an alternate path

  • Monitor failed paths and reuse them on recovery

Link/NIC flapping

  • Before Clockwork: A NIC failure kills the job, halting training until a restart and checkpoint recovery.

  • After Clockwork: Jobs stay alive — throughput dips briefly, then recovers to full speed within a minute.

Node Inbound
Infiniband Throughput

  • AI job resilience in action: Even with multiple NIC flaps, the workload continues running — no restart required.

  • Graceful recovery: Throughput dips briefly, then automatically restores to full speed, preserving collective progress and avoiding wasted GPU hours

Learn More

Stop wasting GPU cycles. Start scaling smarter.
Clusters must deliver high uptime while running at maximum efficiency.

Turn your GPU clusters into a competitive advantage—not a cost center.