top of page
Nic Failure, Link Failure

Link / NIC Flapping: Before And After Clockwork

Without Clockwork, a NIC failure halts AI jobs entirely. With Clockwork, jobs continue at reduced throughput during a failure and quickly return to full capacity, ensuring robust resilience and uninterrupted performance

AI Networking Reinvented: Accelerate Smarter With Software

Boost AI job speed and GPU utilization with blazing fast efficiency

Clockwork's pure software solution delivers zero-loss network performance at scale on commodity NICs and switches. No proprietary hardware or expensive upgrades needed.

Frequent Network Issues Cited By GPU Operators And Customers

The ToR has 2 down uplinks. It's unknown why they went down 6 months ago!

We don't have anything like this. We barely have per minute data, this is per second.

Alert: Urgent/Cluster:  Link has flapped more than 8 times within the past hour. We have to send technicians to swap out.

Network links get saturated quickly. 

 
The last flow dictates the start of the next iteration.

We tuned the DCQCN threshold to increase throughput, but latency and congestion went up!

157 Minutes: The time to first job failure on a new 10,000 cluster

  • Disruptions Impact AI: Network bottlenecks and link outages disrupt AI workloads, forcing job restarts.

  • Wasted GPU Cycles: Optical failures and overheating cause cluster pauses, lowering GPU utilization.

  • Reduced ROI: Delayed AI performance leads to poor returns on capital investment.

  • Solution: Deliver fault tolerance and reliability despite network failures to achieve high GPU utilization and efficiency.

Mean time to failure for GPU

Network Acceleration Software For AI Workloads

Clockwork's approach is fundamentally different. Our unique software-based solution ensures reliability and fabric acceleration without relying on custom hardware or in-band network telemetry. Compatible with standard Ethernet switches and NICs, it can scale beyond 100,000 GPU nodes while cutting costs, boosting flexibility, and enhancing resilience.

Image by Milad Fakurian

Clockwork's Solution For CPU Cloud

Semianalysis Logo
"Achieving high utilization with them (GPUs) is even more difficult due to the high failure rates of various components, especially networking."
Challenges
Description
Consequences

1. Lack of visibility

2. Lack of reliability

Link / NIC flapping / failures  due to overheating

No real-time insights on connectivity, path quality, message-level, job-level metrics

Bursty traffic with multiple data flows collide on links and contend for bandwidth

AI infrastructure teams struggle to detect and fix issues fast

Job crashes, frequent restarts, and checkpointing

Low throughput, high latency, degraded NCCL performance

3. Network contention and congestion

  1. Lack of visibility: No realtime insights on connectivity, path quality, message level, job level metrics: Infra teams struggle to detect and fix issues fast.

  2. Lack of reliability: Link / NIC flapping / failures  due to overheating: Job crashes, frequent restarts, and checkpointing.

  3. Network contention and congestion: Bursty traffic with multiple data  flows collide on links and contend  for bandwidth: Low throughput, high latency,  degraded NCCL performance.

Impact On Your AI Workloads

Cloud Deluxe Boosts App Performance

Eliminate network congestion even under high load. Postpone the need for autoscaling till it’s really needed. Improve performance of latency-sensitive apps and AI/ML workloads.​

Do more with less resources.

Frame 141.png

Fine-grained Visibility

Comprehensive real-time fleet and workload monitoring

Frame 139.png

Continuous Reliability

Comprehensive operations despite link and NIC failures

Frame 140.png

Fabric Acceleration

Eliminate contention and congestion, improve throughput

David A Maltz, 
Technical Fellow and Corporate Vice President,
Microsoft Azure

 It has been great to partner with Clockwork's team as we've conducted successful trials on Azure, and, moving forward, we believe their tech will prove highly effective in identifying and eliminating network bottlenecks.

Microsoft_Azure_Logo.svg.png
Amin Vahdat,
Engineer Fellow and VP,
Google Cloud

I've collaborated with the Clockwork team since their Stanford days. They solve a decades-old problem in scalable, high-accuracy network clock sync. It's a foundational technology and Clockwork's application of it can solve basic problems

Google_Cloud_logo.svg.png

Interested In Learning More About Clockwork.io?

We're here to help! Please complete the form and we'll be in touch soon. 

Contact Sales

Cluster Type
Cluster Size
google logo.png

Google Cloud

Get started with our fully managed solution on Google Cloud

Azure logo.png

Microsoft Azure

Get started with our fully managed solution on Azure marketplace

aws.png

Amazon Web Services

Coming soon to AWS marketplace. Contact us for a private beta.

oracle o.png

Oracle Cloud

Now on Oracle Cloud marketplace. Contact us for a private beta.

Built To Run On All Of Your Environments

40%

Of AI Training and Inference Time Is Spent On Network Communications. (Source: AMD) 
bottom of page