
Frequent Network Issues Cited By GPU Operators And Customers
The ToR has 2 down uplinks. It's unknown why they went down 6 months ago!
We don't have anything like this. We barely have per minute data, this is per second.
Alert: Urgent/Cluster: Link has flapped more than 8 times within the past hour. We have to send technicians to swap out.
Network links get saturated quickly.
The last flow dictates the start of the next iteration.
We tuned the DCQCN threshold to increase throughput, but latency and congestion went up!
157 Minutes: The time to first job failure on a new 10,000 cluster
-
Disruptions Impact AI: Network bottlenecks and link outages disrupt AI workloads, forcing job restarts.
-
Wasted GPU Cycles: Optical failures and overheating cause cluster pauses, lowering GPU utilization.
-
Reduced ROI: Delayed AI performance leads to poor returns on capital investment.
-
Solution: Deliver fault tolerance and reliability despite network failures to achieve high GPU utilization and efficiency.

Network Acceleration Software For AI Workloads
Clockwork's approach is fundamentally different. Our unique software-based solution ensures reliability and fabric acceleration without relying on custom hardware or in-band network telemetry. Compatible with standard Ethernet switches and NICs, it can scale beyond 100,000 GPU nodes while cutting costs, boosting flexibility, and enhancing resilience.

Clockwork's Solution For CPU Cloud

"Achieving high utilization with them (GPUs) is even more difficult due to the high failure rates of various components, especially networking."
Challenges
Description
Consequences
1. Lack of visibility
2. Lack of reliability
Link / NIC flapping / failures due to overheating
No real-time insights on connectivity, path quality, message-level, job-level metrics
Bursty traffic with multiple data flows collide on links and contend for bandwidth
AI infrastructure teams struggle to detect and fix issues fast
Job crashes, frequent restarts, and checkpointing
Low throughput, high latency, degraded NCCL performance
3. Network contention and congestion
-
Lack of visibility: No realtime insights on connectivity, path quality, message level, job level metrics: Infra teams struggle to detect and fix issues fast.
-
Lack of reliability: Link / NIC flapping / failures due to overheating: Job crashes, frequent restarts, and checkpointing.
-
Network contention and congestion: Bursty traffic with multiple data flows collide on links and contend for bandwidth: Low throughput, high latency, degraded NCCL performance.
Impact On Your AI Workloads


Fine-grained Visibility
Comprehensive real-time fleet and workload monitoring


Continuous Reliability
Comprehensive operations despite link and NIC failures


Fabric Acceleration
Eliminate contention and congestion, improve throughput
David A Maltz,
Technical Fellow and Corporate Vice President,
Microsoft Azure
It has been great to partner with Clockwork's team as we've conducted successful trials on Azure, and, moving forward, we believe their tech will prove highly effective in identifying and eliminating network bottlenecks.

Amin Vahdat,
Engineer Fellow and VP,
Google Cloud
I've collaborated with the Clockwork team since their Stanford days. They solve a decades-old problem in scalable, high-accuracy network clock sync. It's a foundational technology and Clockwork's application of it can solve basic problems

Interested In Learning More About Clockwork.io?
We're here to help! Please complete the form and we'll be in touch soon.