Clockwork's Network Acceleration Software for AI Workloads
Built on Breakthrough Clock Sync Technology
Clockwork eliminates the need for specialized hardware using a breakthrough software-based clock synchronization technology developed by the founders at Stanford. This enables accurate one-way delay measurements on each data path at scale, without specialized hardware or network support.
Unlock the full potential of your GPU clusters with Clockwork. Our software solution eliminates contention and congestion, delivering unmatched reliability, enhanced visibility, and consistently high performance to keep your AI infra running at peak efficiency.
Clockwork's GPU Cloud Solution
Our approach is fundamentally different. Our unique software-based solution ensures reliability and fabric acceleration without relying on custom hardware or in-band network telemetry. Compatible with standard Ethernet switches and NICs, it can scale beyond 100,000 GPU nodes while cutting costs, boosting flexibility, and enhancing resilience.
Clockwork Job and Network Fleet Monitoring
No real-time insights on connectivity, path quality, message-level, and job-level metrics, resulting in AI infra/ops team unable to identify and resolve issues quickly.
Clockwork’s software:
- Create NIC-to-NIC Probe Mesh: Small probe packets traverse these edges
- Monitor Network Health: Probes continuously check for liveness of paths, whether or not there’s data on the paths
- Measure NIC—NIC Delays
Synchronize the clocks at all the NICs
Obtain accurate one-way delays for every QPair of interest
Clockwork Solution To Link/NIC Flapping
Link / NIC failures due to optics overheating is a common problem in InfiniBand and RoCE, resulting in job crashes, more frequent checkpointing and restarting.
“Alert: Urgent | Cluster: link has flapped more than 8 times within the past hour”
Clockwork’s software:
- Quickly detect link/NIC failures
- Use an alternate path
- Monitor health of failed paths and re-use them when they recover
- Ensure continuous operations despite failures, leading to higher GPU utilization and faster job completion time.
Link/NIC Flapping: Before and After Clockwork
Without Clockwork, a NIC failure halts AI jobs entirely. With Clockwork, jobs continue at reduced throughput during a failure and quickly return to full capacity, ensuring robust resilience and uninterrupted performance
Clockwork's Solution to Fabric Contention
Bursty traffic with multiple data flows collide on links and contend for bandwidth, resulting in low throughput, high latency, and degraded NCCL performance.
“Network links get saturated quickly … The last flow dictates starting of next iteration”
Clockwork’s software:
- Use QPair-level delay measurements, intelligently
detects “fabric contention” (the oversubscription of certain paths in the fabric) - Balance the load evenly across the network fabric to eliminate contention and increase the total throughput.