Clockwork's Network Acceleration Software for AI Workloads

Built on Breakthrough Clock Sync Technology
Clockwork eliminates the need for specialized hardware using a breakthrough software-based clock synchronization technology developed by the founders at Stanford. This enables accurate one-way delay measurements on each data path at scale, without specialized hardware or network support.

Unlock the full potential of your GPU clusters with Clockwork. Our software solution eliminates contention and congestion, delivering unmatched reliability, enhanced visibility, and consistently high performance to keep your AI infra running at peak efficiency.

Clockwork's GPU Cloud Solution

Our approach is fundamentally different. Our unique software-based solution ensures reliability and fabric acceleration without relying on custom hardware or in-band network telemetry. Compatible with standard Ethernet switches and NICs, it can scale beyond 100,000 GPU nodes while cutting costs, boosting flexibility, and enhancing resilience.

Clockwork Job and Network Fleet Monitoring

No real-time insights on connectivity, path quality, message-level, 
and job-level metrics, resulting in AI infra/ops team unable to identify and resolve 
issues quickly.

Clockwork’s software:

  • Create NIC-to-NIC Probe Mesh: Small probe packets traverse these edges
  • Monitor Network Health: Probes continuously check for liveness of paths, whether or not there’s data on the paths
  • Measure NIC—NIC Delays
    Synchronize the clocks at all the NICs
    Obtain accurate one-way delays for every QPair of interest

Clockwork Solution To Link/NIC Flapping

Link / NIC failures due to optics overheating is a common problem in InfiniBand and RoCE, resulting in job crashes, more frequent checkpointing and restarting.

“Alert: Urgent | Cluster: link has flapped more than 8 times within the past hour”

Clockwork’s software:

  • Quickly detect link/NIC failures
  • Use an alternate path
  • Monitor health of failed paths and re-use them when they recover
  • Ensure continuous operations despite failures, leading to higher GPU utilization and faster job completion time.

Link/NIC Flapping: Before and After Clockwork

Without Clockwork, a NIC failure halts AI jobs entirely. With Clockwork, jobs continue at reduced throughput during a failure and quickly return to full capacity, ensuring robust resilience and uninterrupted performance

Clockwork's Solution to Fabric Contention

Bursty traffic with multiple data 
flows collide on links and contend 
for bandwidth, resulting in low throughput, high latency, and degraded NCCL performance.

“Network links get saturated quickly … The last flow dictates starting of next iteration”

Clockwork’s software:

  • Use QPair-level delay measurements,  intelligently
    detects “fabric contention” (the oversubscription of certain paths in the fabric)
  • Balance the load evenly across the network fabric to eliminate contention and increase the total throughput.

Fabric Contention: Before and After Clockwork

Clockwork revolutionizes high-performance networking by enhancing throughput and latency. In an All-to-All workload, it recovers throughput from 39Gbps to 92Gbps under contention, while reducing latency to under 50 microseconds, even with simultaneous jobs.

100% Pure Software: Accelerate any workload on any AI accelerator and network.

Clockwork’s software replaces costly hardware, enabling rapid congestion resolution while ensuring reliability, acceleration, and full network visibility to keep AI jobs running 24/7.

Interested in learning more about Clockwork.io?

We're here to help. Please complete the form and we'll be in touch soon!