Launching TorchPass: A New Class of Fault Tolerance to End Failure-Driven GPU Waste In AI Training

Watch virtual panel with Nebius on economically viable Enterprise AI

Watch Oracle Cloud World Video on Performant AI Networks

AI Performance Optimization

Dynamic Traffic Control (DTC) uses real-time telemetry to keep GPUs productive on any network. It balances congestion, paces queues, and prevents stalls — accelerating synchronized collectives .

The result: accelerated AI training across Ethernet, InfiniBand, and RoCE entirely in software, without proprietary hardware.

AI Fault Tolerance AI Observability AI Performance Optimization
Performance Optimization

Detecting & Eliminating Contention

Reroute workloads instantly, eliminate collisions.

What is Contention? QPairs collide on links and contend for network bandwidth Clockwork’s Solution Workload Acceleration QPairs with contentions have high one-way delays Shift traffic from congested paths to uncongested paths

Workload Acceleration

Proven Throughput Gains Across Real-World AI Workloads

Hyperscaler with Clockwork vs Dynamic Load Balancing (DLB)

2 all-to-all jobs

The hyperscaler with Clockwork enabled has 33% more outbound throughput vs. DLB

Large Social Media Company with Clockwork vs. ECMP

2 all-to-all jobs

The large social media company with Clockwork enabled has 29% more throughput vs. ECMP

Learn More

Stop wasting GPU cycles. Start scaling smarter.
Clusters must deliver high uptime while running at maximum efficiency.

Turn your GPU clusters into a competitive advantage—not a cost center.