top of page

Network Acceleration Software For AI Workloads

Our approach is fundamentally different. Our unique software-based solution ensures reliability and fabric acceleration without relying on custom hardware or in-band network telemetry. Compatible with RoCE and InfiniBand switches and NICs, scale clusters of any size from a few hundred GPUs to hundreds of thousands while cutting costs, boosting flexibility, and unmatched AI resilience.

The Sauce: High Accuracy Clock Sync at Scale

Our innovative solution is built on top of our breakthrough clock sync technology, which enables accurate true one-way delay measurements on each data path at scale, without specialized hardware or network support.  Our "data-path aware" software transforms traditional networks into high-performance systems designed for the rigorous demands of today's AI workloads.

before-after-clockwork.png

Software Accelerated AI Networks

Upgrade and future-proof existing networks to match the speed and scale of AI workloads without expensive hardware and proprietary, brittle add-ons for congestion control and load balancing.

With Clockwork's software-based solution, get unmatched visibility and continuous high-performance networking, supercharging your AI jobs 24/7 -- whether on-prem or in the cloud, compatible with InfiniBand and RoCE. No bottlenecks. No GPU waste. No rigid, specialized hardware holding you back. 

Link / NIC Flapping: Before and After Clockwork

Without Clockwork, a NIC failure halts AI jobs entirely. With Clockwork, jobs continue at reduced throughput during a failure and quickly return to full capacity, ensuring robust resilience and uninterrupted performance

Nic-flapping.png
Easily Identify Network Related GPU Utilization Gaps
LInk-failure.png

Clockwork Solution To Link/NIC Flapping

Link / NIC failures due to optics overheating is a common problem in InfiniBand and RoCE, resulting in job crashes, more frequent checkpointing and restarting.​

Clockwork’s software lets you:

  • Quickly detect link/NIC failures

  • Use an alternate path

  • Monitor health of failed paths and re-use them when they recover

  • Ensure continuous operations despite failures, leading to higher GPU utilization and faster job completion time.

Clockwork Fleet And Job Monitoring

Create NIC-to-NIC Probe Mesh

Small probe packets traverse these edges

Monitor Network Health

Probes continuously check for liveness of paths, whether or not there's data on the paths

NIC-NIC Delay Measurement

Synchronize the clocks at all the NICs, obtain transmit(Tx) and receive (Rx) timestamps, get ECN marks at switches

OWD = Rx - Tx

Obtain OWDs for every QPair of interest

Prevent AI Job Crashes From NIC/ Link Failures
AI techstack.png

 100% Pure Software For Multi-vendor Accelerators and Networks

Our software-driven approach redefines AI networking, supporting InfiniBand and RoCE in on-prem and cloud environments. By providing fine-grained visibility, enhanced  reliability, and accelerated performance, Clockwork ensures your AI workloads run smoothly and efficiently around the clock.

Achieve Job Peak Performance Throughput and Latency at Any Scale

Problem: Fabric Contention

AI workloads are homogenous and highly correlated, multiple data flows can collide on links and compete for bandwidth. The example below shows 2 All-to-All jobs contending for bandwidth, leading to latency spikes and throughput drops. 

Contention.png

Interested In Learning More About Clockwork.io?

We're here to help! Please complete the form and we'll be in touch soon. 

Contact Sales

Cluster Type
Cluster Size

Accelerate Any AI Workload on Any Network

Clockwork's Solution To Fabric Contention

Bursty traffic with multiple data flows collide on links and contend for bandwidth, resulting in low throughput, high latency, and degraded NCCL performance.

Clockwork's software:

  • Uses QPair-level delay measurements, intelligently detects "fabric contention" (the oversubscription of certain paths in the network fabric).

  • Balances the load evenly across the network fabric to eliminate contention and increase the total throughput

Fabric-contention.png
No hardware required
Deploy anywhere in minutes 
Supports RoCE &InfiniBand
Frame 141.png

Fine-grained Visibility

Comprehensive real-time fleet and workload monitoring

Frame 139.png

Continuous Reliability

Comprehensive operations despite link and NIC failures

Frame 140.png

Fabric Acceleration

Eliminate contention and congestion, improve throughput

bottom of page