Watch Oracle Cloud World Video on Performant AI Networks

Accelerate AI Around The Clock.

AI that never stalls. GPUs that never sit idle. Clockwork’s hardware-agnostic Software Driven Fabric keeps workloads crash-proof, accelerated, and GPUs fully utilized—at any scale

No crashes. No slowdowns. Just efficient speed-to-market.

“As AI infrastructure scales to tens of thousands of GPUs for training and inference, the bottleneck has shifted from compute to communication. With accelerators running in lockstep, a single link flap, congestion spike or straggler can stall progress and crater utilization. The operational priority is utilizing real-time fabric visibility for faster fault isolation and recovery to keep workloads moving instead of looping through costly restarts. And as Mixture of Experts (MoE) models with high rank expert parallelism proliferate, the all-to-all exchange intensifies, raising the bar even higher for GPU communication efficiency.”
Dylan Patel
Founder, CEO, and Chief Analyst, SemiAnalysis

Clockwork Launches FleetIQ to Recast GPU Economics, Appoints Suresh Vasudevan as CEO

Uber accelerates incident detection, DCAI drives faster AI training and cluster efficiency, and Nebius raises MTBF in large-scale distributed AI—all powered by Clockwork’s first-of-its-kind Software-Driven Fabric.

“As AI infrastructure scales to tens of thousands of GPUs for training and inference, the bottleneck has shifted from compute to communication. With accelerators running in lockstep, a single link flap, congestion spike or straggler can stall progress and crater utilization. The operational priority is utilizing real-time fabric visibility for faster fault isolation and recovery to keep workloads moving instead of looping through costly restarts. And as Mixture of Experts (MoE) models with high rank expert parallelism proliferate, the all-to-all exchange intensifies, raising the bar even higher for GPU communication efficiency.”
Dylan Patel
Founder, CEO, and Chief Analyst, SemiAnalysis
“MI350X series systems with ROCm software and Pollara NICs provide a strong foundation for performance and reliability in AI training and inference. As deployments expand, ecosystem innovation, such as Clockwork’s software-driven approach, adds complementary capabilities that help ensure efficiency and consistency at scale.”
Vamsi Boppana
SVP, AI, AMD
“At Uber, we tackle real-time logistics problems where every millisecond matters—latency spikes don’t just hurt customer experience, they directly impact driver retention and revenue. In our tests across a hybrid, multi-cloud environment, Clockwork delivered significant coverage and accuracy improvements over networking observability. Their unique innovation can greatly help Uber expedite the detection and fault-localization of networking issues: from hours to minutes, which will greatly improve service tail latency and prevent noisy neighbor impact. We are in the process of rolling out Clockwork across Uber infrastructure, and look forward to experiencing their full capabilities at Uber’s scale. Clockwork’s software-driven fabric provides foundational observability for the hybrid, multi-cloud environment, helping us deliver what matters most: improved infrastructure utilization, enhanced resiliency, and ultimately, a better experience for the millions of people who rely on our platform every day. We are in the process of rolling out Clockwork across Uber infrastructure, and look forward to experiencing their full capabilities at Uber’s scale. Clockwork’s software-driven fabric provides foundational observability for the hybrid, multi-cloud environment, helping us deliver what matters most: improved infrastructure utilization, enhanced resiliency, and ultimately, a better experience for the millions of people who rely on our platform every day.”
Albert Greenberg
Chief Architect Officer, Uber
“At Broadcom, our focus has always been on delivering Ethernet-centric infrastructure that scales AI with both performance and efficiency. Clockwork’s software-driven fabric adds an essential layer of agility and observability that enhances the power of our silicon. With proactive fleet monitoring and seamless failover, Clockwork enables platforms such as our Tomahawk 6 and Jericho4 to realize their full potential in flexibility, uptime, and AI performance. Together, we’re driving open, adaptable fabrics that allow enterprises to build AI infrastructure that is resilient, high-performing, and future-ready.”
Ram Velaga
Senior Vice President and General Manager, Core Switching Group, Broadcom
“At Nscale, we are building the foundation for AI at planetary scale—making it faster, more efficient, and more resilient for the world’s most ambitious organizations. To do that, we seek partners who share our vision for redefining what’s possible. Clockwork’s approach aligns perfectly with ours, and together we’re creating an AI infrastructure that is not only powerful and reliable, but ready to support the most demanding innovations of the future.”
David Power
CTO, NScale
“We have been working with Clockwork to evaluate their software-driven fabric on our AI infrastructure, and seeing meaningful improvements in reliability. This is exactly what our customers need when running large-scale AI workloads where any disruption can be costly. We like how this approach works across different network configurations without requiring hardware lock-in. As we continue to scale our infrastructure, solutions that focus on the communication layer — which is often a bottleneck — are becoming increasingly important for delivering the performance and reliability that our customers expect.”
Danila Shtan
CTO, Nebius
“Our mission at DCAI is to remove barriers to high-performance AI infrastructure — not only to serve researchers, startups, and enterprises today, but also to build the sovereign foundations of tomorrow’s innovation economy. Gefion is a game-changing resource driving breakthroughs in quantum computing, drug discovery, advanced weather forecasting and beyond. To succeed, we must deliver resilience, reliability and efficiency at an unprecedented scale — performance once reserved for hyperscalers. Partnering with Clockwork enables us to operate Gefion seamlessly and reliably, even as workloads and demands increase. The result is a compute-efficient, fault-tolerant infrastructure that researchers and industries can trust — lowering costs, eliminating wasted GPU cycles, and helping us deliver a sovereign AI capability second to none.”
Dr. Nadia Carlsten
CEO, DCAI
“At WhiteFiber, Clockwork helps us deploy GPU clusters faster and with greater consistency. Their observability and rapid localization of fabric issues not only reduce deployment times but also validate the reliability of our infrastructure, ensuring clients’ AI workloads run on clusters built for performance, resilience and scale.”
Tom Sanfillippo
CTO, White Fiber

Compute Isn’t The Bottleneck. Communication Is.

Convert Idle GPUs into Productive Intelligence

Nvidia

GPU clusters deliver 30–55% of peak capacity—wasting billions at scale. GPUs should be busy processing AI workloads—but instead, they sit idle waiting on the network.

18–57% of training and 40–75% of inference time (AMD) is lost to communication, turning even small hiccups like link flaps into costly restarts and wasted GPU hours.

AMD

The result: an AI efficiency gap caused by three compounding issues—visibility (pinpointing slowdowns), reliability (frequent cluster failures), and performance (traffic collisions and congestion).

Clockwork closes this gap. Our hardware-agnostic software fabric delivers nanosecond-precise visibility, dynamic traffic control, and job-aware resilience, so AI jobs run through failures, GPU utilization rises, and stranded capacity becomes dependable power—at scale.

Time of first job failure in brand new cluster: 
26.28 minutes
“Achieving high utilization with them (GPUs) is difficult due to the high failure rate of various components, especially networking.

See AI Jobs Run Through the Failures That Break Others

In GPU clusters, network link failures are constant—and they can crash critical AI jobs in an instant. Clockwork makes those failures irrelevant. Watch how our software fabric keeps jobs running, uninterrupted, even when a live network cable is pulled.

“All cloud providers 
and infrastructure teams have these problems. These are important problems to solve.”
Jag Brar
VP and Distinguished Engineer

AI Training Communication Constraints

Of AI Training and Inference Time
is Spent on Network Communications

40%

Time to First Job Failure in Brand
New Cluster

26.28
mins

Of GPU Potential Capacity is Wasted in
Real-world Clusters

45-70%

AI Training and Inference

Job crashes 
and slowdowns Lower goodput per dollar Lower model FLOPS Wasted GPU cycles, inconsistent SLAs, vendor lock-in and hardware obsolescence Interrupted Operations – higher MTTR/MTTD​ NCCL NFS TCP/UDP NVSHMEM UCX NVMe-oF Ethernet libfabric S3 InfiniBand MPI ibverbs GDS ROCEv2 Communication libraries, I/O Protocols Network APIs / Transport protocols Training data ingestion Checkpoint writes Application/user IO Result/response flows Parameter Exchange Model load/fetch Query/request traffic Control traffic CPU CPU App GPU GPU CPU CPU App GPU GPU CPU CPU App GPU GPU Compute Cluster Storage AI Factories

Stringent I/O demand

Lossless, TB-bandwidth, microsecond latency

Synchronized, stateful flows

Application, GPU-to-GPU, storage I/O flows

Multiple networks / transports

Ethernet, ROCE and InfiniBand

Frequent hardware failures

Jobs forced to restart too often

Clockwork Software Driven Fabric
Optimizes Cluster Utilization

AI Training and Inference

Faster job completion times Higher goodput 
per dollar Higher model FLOPS Optimal utilization, consistent SLAs, and multi-vendor, future-proofed investment​ 24/7 Resilient 
operations Global Clock Sync Dynamic Traffic Control Clockwork FleetIQ CPU CPU App GPU GPU CPU CPU App GPU GPU CPU CPU App GPU GPU Compute Cluster Storage AI Factories Stateful
Fault-Tolerance Efficient Performance Cross-stack Visibility

Cross-stack visibility

Identify WHY jobs are slow, inefficient or failing and correlate with underlying infrastructure issues.

Stateful fault-tolerance

Jobs should continue without disruptions despite infrastructure failures

Efficient performance

Eliminate congestion, contention and infrastructure bottlenecks

Explainer Videos: Software Driven Fabrics
Optimize Cluster Utilization

100% Software-Driven Fabric
For Multi-vendor Accelerators and Networks

Clockwork’s breakthrough software eliminates the need for expensive, proprietary hardware, enabling hosts to rapidly detect and resolve congestion and network contention. It delivers reliability, acceleration, and full visibility into workload and network health to keep AI jobs running around the clock.

Learn More

Stop wasting GPU cycles. Start scaling smarter.
Clusters must deliver high uptime while running at maximum efficiency.

Turn your GPU clusters into a competitive advantage—not a cost center.