Launching TorchPass: A New Class of Fault Tolerance to End Failure-Driven GPU Waste In AI Training

Watch virtual panel with Nebius on economically viable Enterprise AI

Watch Oracle Cloud World Video on Performant AI Networks

About Us

Clockwork builds software that optimizes GPU clusters for fault tolerance, deterministic performance and increased utilization.Today’s AI workloads are massively distributed and depend on GPUs staying tightly in sync. Performance isn’t limited by how fast each GPU is, but by how efficiently thousands of them can communicate. Clockwork Software Driven AI Fabrics optimize communication and make workloads resilient by combining observability to catch and quickly resolve problems, fault tolerance to keep jobs running through failures, and performance optimization to dynamically route and pace traffic to avoid congestion.

The result: Clockwork FleetIQ improves GPU cluster utilization and job completion times by 1.1–1.5x, reduces disruptive failures by over 90%, and works across any network (InfiniBand, RoCE, or Ethernet), any GPU vendor, and any deployment model (from hyperscaler to on-premises). All using software only.

2018

TickTock Networks founded based on research incubated at Stanford University.

2021

TickTock Networks renames to Clockwork Systems.

2022

Clockwork.io debuts publicly.

AI Factories — massive clusters with thousands of GPUs — are the most expensive computing infrastructures ever built. Yet despite their staggering scale and cost, they typically achieve only 20%-40% utilization.

The reason is surprising: the bottleneck in AI is not compute — it is communication.

Modern AI workloads require thousands of GPUs to operate in tight synchronization. A single straggler, transient network issue, or GPU failure can stall or crash entire jobs, wasting hours of GPU time and billions of dollars of infrastructure capacity at hyperscale.

The design principles that made cloud computing successful — statelessness, horizontal elasticity, location transparency, and tolerance of individual failure — are not merely insufficient for tightly coupled workloads; they are often directly counterproductive.

AI training is forcing a transition from infrastructure optimized for independent request handling to infrastructure optimized for coordinated completion of shared phases across many devices. This shift represents the emergence of a new infrastructure discipline — one built around coordinated systems rather than independent services.

In this new discipline, the communication fabric becomes a first-class system component, responsible not only for moving data, but for coordinating the behavior of the distributed system itself. AI infrastructure must prioritize predictable performance over peak performance, failure domain engineering over simple redundancy, and fabric-first networking over host-centric communication.

We propose Software-Driven AI Fabrics — programmable, vendor-neutral control layers that operate across any accelerator, any network (RoCE, InfiniBand, or Ethernet), and any cloud or on-prem environment. At its core, a Software-Driven AI Fabric functions as a closed-loop control system for distributed AI infrastructure: continuously observing the state of communication and computation, deciding how the system should adapt, and acting on the network and communication layer in real time.

Today, Software-Driven AI Fabrics provide:

  • Observability to detect and resolve problems quickly
  • Fault tolerance to keep AI jobs running through failures
  • Performance optimization that dynamically reroutes, paces, and slices traffic flows to relieve congestion and enforce QoS

The result is dramatically higher GPU utilization, faster job completion, and better return on AI infrastructure investment through a 100% software-based solution.

Our Technology Pillars

  • Global Clock Sync — software-based near-nanosecond clock synchronization across every component (host, switch, NIC, etc.) in a distributed environment. This enables near-nanosecond-granular telemetry capable of accurately measuring one-way network delays and provides the foundation for real-time insights into network and workload conditions.
  • Distributed State Transfer — the ability to capture distributed network and communication state, including model parameters, connection health, routing, NCCL communicators and rank progress, and transfer it, with appropriate modifications, to alternative destinations. This capability underpins network fault tolerance through path failover and enables live GPU migration, allowing AI workloads to continue running through failures.
  • Dynamic Traffic Control — treats the network as a programmable system, steering traffic dynamically in response to real-time telemetry. During congestion events, it identifies imbalanced or colliding flows and shifts traffic across underutilized paths. It can pace or delay specific queue pairs at the packet level, bounding tail latencies that would otherwise stall synchronized collectives. Because it runs entirely in software, it operates across Ethernet, InfiniBand, and RoCE fabrics without requiring proprietary switch features.

Together, these capabilities create a continuously operating control loop: high-precision observation of system behavior, distributed visibility into communication state, and the ability to enact coordinated changes across the network and communication stack.

Conclusion

These capabilities manifest in three primary application domains: fault tolerance, workload mobility, and communication optimization. The three application domains are not separate products. They are expressions of the same observe-decide-act control loop, running continuously and transparently beneath the training framework. The end state is autonomic collective communications: a system that continuously monitors its own health and performance, predicts failures before they occur, adapts topology and routing in real time, and optimizes communication parameters based on observed conditions.

This vision is grounded in infrastructure that already exists. Distributed State Tracking already provides real-time visibility from the RDMA queue-pair level up through collective operations and into the training workload. Observability-Driven Orchestration already translates those signals into coordinated, multi-step interventions beneath running jobs.

Collective communication is the nervous system of distributed AI training. Today, that nervous system is static, fragile, and blind. Clockwork is making that nervous system intelligent: able to sense what is happening across the infrastructure, reason about what should change, and act without disrupting the workloads it supports. The result is AI training that doesn’t just run fast when everything is working, but runs well all the time.

Our Leadership

Team - Balaji Prabhakar
Balaji Prabhakar
Co-Founder
Read Bio
Team - Suresh Vasudevan
Suresh Vasudevan
CEO
Read Bio
Team - Mendel Rosenblum
Mendel Rosenblum
Chief Scientist
Read Bio
Team - Deepak Merugu
Deepak Merugu
Co-Founder and Chief Engineer
Read Bio
Team - Yilong Geng
Yilong Geng
Co-Founder and CTO
Read Bio
Team - Dan Zheng
Dan Zheng
Chief Business Officer
Read Bio
Anita Pandey
VP of Growth Marketing
Read Bio
Joe Tarantino
VP of Global Sales
Read Bio
Gavin Cohen
VP of Product
Read Bio
Alison Qu
VP of Finance
Read Bio

Our Investors

New Enterprise Associates

Lead Series A investor

Lip Bu Tan
CEO of Intel
Diane Greene
Former CEO of Vmware
John Chambers
Former CEO and Executive Chairman of Cisco Systems
Investors - John Hennessy
John Hennessy
Former President of Stanford University, Chairman of the Board of Alphabet
Investors - Ram Shriram
Ram Shriram
Trustee of Stanford University and of Alphabet
Investors - Jerry Yang
Jerry Yang
AME Cloud Ventures, Co-Founder of Yahoo!

Want to join our team? Check out our current positions.