Clockwork.io | AI Network Reinvented

Link / NIC Flapping: Before And After Clockwork

Without Clockwork, a NIC failure halts AI jobs entirely. With Clockwork, jobs continue at reduced throughput during a failure and quickly return to full capacity, ensuring robust resilience and uninterrupted performance

AI Networking Reinvented: Accelerate Smarter With Software

Boost AI job speed and GPU utilization with blazing fast efficiency

Clockwork's pure software solution delivers zero-loss network performance at scale on commodity NICs and switches. No proprietary hardware or expensive upgrades needed.

Explore Platform

Frequent Network Issues Cited By GPU Operators And Customers

The ToR has 2 down uplinks. It's unknown why they went down 6 months ago!

We don't have anything like this. We barely have per minute data, this is per second.

Alert: Urgent/Cluster: Link has flapped more than 8 times within the past hour. We have to send technicians to swap out.

Network links get saturated quickly.

The last flow dictates the start of the next iteration.

We tuned the DCQCN threshold to increase throughput, but latency and congestion went up!

157 Minutes: The time to first job failure on a new 10,000 cluster

Disruptions Impact AI: Network bottlenecks and link outages disrupt AI workloads, forcing job restarts.
Wasted GPU Cycles: Optical failures and overheating cause cluster pauses, lowering GPU utilization.
Reduced ROI: Delayed AI performance leads to poor returns on capital investment.
Solution: Deliver fault tolerance and reliability despite network failures to achieve high GPU utilization and efficiency.

Contact Sales

Network Acceleration Software For AI Workloads

Clockwork's approach is fundamentally different. Our unique software-based solution ensures reliability and fabric acceleration without relying on custom hardware or in-band network telemetry. Compatible with standard Ethernet switches and NICs, it can scale beyond 100,000 GPU nodes while cutting costs, boosting flexibility, and enhancing resilience.

Clockwork's Solution For CPU Cloud

"Achieving high utilization with them (GPUs) is even more difficult due to the high failure rates of various components, especially networking."

Challenges

Description

Consequences

1. Lack of visibility

2. Lack of reliability

Link / NIC flapping / failures due to overheating

No real-time insights on connectivity, path quality, message-level, job-level metrics

Bursty traffic with multiple data flows collide on links and contend for bandwidth

AI infrastructure teams struggle to detect and fix issues fast

Job crashes, frequent restarts, and checkpointing

Low throughput, high latency, degraded NCCL performance

3. Network contention and congestion

Lack of visibility: No realtime insights on connectivity, path quality, message level, job level metrics: Infra teams struggle to detect and fix issues fast.
Lack of reliability: Link / NIC flapping / failures due to overheating: Job crashes, frequent restarts, and checkpointing.
Network contention and congestion: Bursty traffic with multiple data flows collide on links and contend for bandwidth: Low throughput, high latency, degraded NCCL performance.

Impact On Your AI Workloads

Cloud Deluxe Boosts App Performance

Eliminate network congestion even under high load. Postpone the need for autoscaling till it’s really needed. Improve performance of latency-sensitive apps and AI/ML workloads.

Do more with less resources.

Contact Sales

Fine-grained Visibility

Comprehensive real-time fleet and workload monitoring

Continuous Reliability

Comprehensive operations despite link and NIC failures

Fabric Acceleration

Eliminate contention and congestion, improve throughput

David A Maltz,
Technical Fellow and Corporate Vice President,
Microsoft Azure

It has been great to partner with Clockwork's team as we've conducted successful trials on Azure, and, moving forward, we believe their tech will prove highly effective in identifying and eliminating network bottlenecks.

Amin Vahdat,
Engineer Fellow and VP,
Google Cloud

I've collaborated with the Clockwork team since their Stanford days. They solve a decades-old problem in scalable, high-accuracy network clock sync. It's a foundational technology and Clockwork's application of it can solve basic problems

Interested In Learning More About Clockwork.io?

We're here to help! Please complete the form and we'll be in touch soon.

Google Cloud

Get started with our fully managed solution on Google Cloud

Get Started

Microsoft Azure

Get started with our fully managed solution on Azure marketplace

Get Started

Amazon Web Services

Coming soon to AWS marketplace. Contact us for a private beta.