Network Acceleration Software For AI Workloads
Our approach is fundamentally different. Our unique software-based solution ensures reliability and fabric acceleration without relying on custom hardware or in-band network telemetry. Compatible with RoCE and InfiniBand switches and NICs, scale clusters of any size from a few hundred GPUs to hundreds of thousands while cutting costs, boosting flexibility, and unmatched AI resilience.
The Sauce: High Accuracy Clock Sync at Scale
Our innovative solution is built on top of our breakthrough clock sync technology, which enables accurate true one-way delay measurements on each data path at scale, without specialized hardware or network support. Our "data-path aware" software transforms traditional networks into high-performance systems designed for the rigorous demands of today's AI workloads.


Software Accelerated AI Networks
Upgrade and future-proof existing networks to match the speed and scale of AI workloads without expensive hardware and proprietary, brittle add-ons for congestion control and load balancing.
With Clockwork's software-based solution, get unmatched visibility and continuous high-performance networking, supercharging your AI jobs 24/7 -- whether on-prem or in the cloud, compatible with InfiniBand and RoCE. No bottlenecks. No GPU waste. No rigid, specialized hardware holding you back.
Link / NIC Flapping: Before and After Clockwork
Without Clockwork, a NIC failure halts AI jobs entirely. With Clockwork, jobs continue at reduced throughput during a failure and quickly return to full capacity, ensuring robust resilience and uninterrupted performance

Easily Identify Network Related GPU Utilization Gaps

Clockwork Solution To Link/NIC Flapping
Link / NIC failures due to optics overheating is a common problem in InfiniBand and RoCE, resulting in job crashes, more frequent checkpointing and restarting.
Clockwork’s software lets you:
-
Quickly detect link/NIC failures
-
Use an alternate path
-
Monitor health of failed paths and re-use them when they recover
-
Ensure continuous operations despite failures, leading to higher GPU utilization and faster job completion time.
Clockwork Fleet And Job Monitoring
Create NIC-to-NIC Probe Mesh
Small probe packets traverse these edges
Monitor Network Health
Probes continuously check for liveness of paths, whether or not there's data on the paths
NIC-NIC Delay Measurement
Synchronize the clocks at all the NICs, obtain transmit(Tx) and receive (Rx) timestamps, get ECN marks at switches
OWD = Rx - Tx
Obtain OWDs for every QPair of interest
Prevent AI Job Crashes From NIC/ Link Failures


100% Pure Software For Multi-vendor Accelerators and Networks
Our software-driven approach redefines AI networking, supporting InfiniBand and RoCE in on-prem and cloud environments. By providing fine-grained visibility, enhanced reliability, and accelerated performance, Clockwork ensures your AI workloads run smoothly and efficiently around the clock.
Achieve Job Peak Performance Throughput and Latency at Any Scale
Problem: Fabric Contention
AI workloads are homogenous and highly correlated, multiple data flows can collide on links and compete for bandwidth. The example below shows 2 All-to-All jobs contending for bandwidth, leading to latency spikes and throughput drops.

Interested In Learning More About Clockwork.io?
We're here to help! Please complete the form and we'll be in touch soon.
Accelerate Any AI Workload on Any Network
Clockwork's Solution To Fabric Contention
Bursty traffic with multiple data flows collide on links and contend for bandwidth, resulting in low throughput, high latency, and degraded NCCL performance.
Clockwork's software:
-
Uses QPair-level delay measurements, intelligently detects "fabric contention" (the oversubscription of certain paths in the network fabric).
-
Balances the load evenly across the network fabric to eliminate contention and increase the total throughput

No hardware required
Deploy anywhere in minutes
Supports RoCE &InfiniBand


Fine-grained Visibility
Comprehensive real-time fleet and workload monitoring


Continuous Reliability
Comprehensive operations despite link and NIC failures


Fabric Acceleration
Eliminate contention and congestion, improve throughput