About Us
Clockwork builds software that optimizes GPU clusters for fault tolerance, deterministic performance and increased utilization.Today’s AI workloads are massively distributed and depend on GPUs staying tightly in sync. Performance isn’t limited by how fast each GPU is, but by how efficiently thousands of them can communicate. Clockwork Software Driven AI Fabrics optimize communication and make workloads resilient by combining observability to catch and quickly resolve problems, fault tolerance to keep jobs running through failures, and performance optimization to dynamically route and pace traffic to avoid congestion.
The result: Clockwork FleetIQ improves GPU cluster utilization and job completion times by 1.1–1.5x, reduces disruptive failures by over 90%, and works across any network (InfiniBand, RoCE, or Ethernet), any GPU vendor, and any deployment model (from hyperscaler to on-premises). All using software only.
2018
TickTock Networks founded based on research incubated at Stanford University.
2021
TickTock Networks renames to Clockwork Systems.
2022
Clockwork.io debuts publicly.
AI Factories — massive clusters with thousands of GPUs — are the most expensive computing infrastructures ever built. Yet despite their staggering scale and cost, they typically achieve only 20%-40% utilization.
The reason is surprising: the bottleneck in AI is not compute — it is communication.
Modern AI workloads require thousands of GPUs to operate in tight synchronization. A single straggler, transient network issue, or GPU failure can stall or crash entire jobs, wasting hours of GPU time and billions of dollars of infrastructure capacity at hyperscale.
The design principles that made cloud computing successful — statelessness, horizontal elasticity, location transparency, and tolerance of individual failure — are not merely insufficient for tightly coupled workloads; they are often directly counterproductive.
AI training is forcing a transition from infrastructure optimized for independent request handling to infrastructure optimized for coordinated completion of shared phases across many devices. This shift represents the emergence of a new infrastructure discipline — one built around coordinated systems rather than independent services.
In this new discipline, the communication fabric becomes a first-class system component, responsible not only for moving data, but for coordinating the behavior of the distributed system itself. AI infrastructure must prioritize predictable performance over peak performance, failure domain engineering over simple redundancy, and fabric-first networking over host-centric communication.
We propose Software-Driven AI Fabrics — programmable, vendor-neutral control layers that operate across any accelerator, any network (RoCE, InfiniBand, or Ethernet), and any cloud or on-prem environment. At its core, a Software-Driven AI Fabric functions as a closed-loop control system for distributed AI infrastructure: continuously observing the state of communication and computation, deciding how the system should adapt, and acting on the network and communication layer in real time.
Today, Software-Driven AI Fabrics provide:
- Observability to detect and resolve problems quickly
- Fault tolerance to keep AI jobs running through failures
- Performance optimization that dynamically reroutes, paces, and slices traffic flows to relieve congestion and enforce QoS
The result is dramatically higher GPU utilization, faster job completion, and better return on AI infrastructure investment through a 100% software-based solution.
Our Technology Pillars
- Global Clock Sync — software-based near-nanosecond clock synchronization across every component (host, switch, NIC, etc.) in a distributed environment. This enables near-nanosecond-granular telemetry capable of accurately measuring one-way network delays and provides the foundation for real-time insights into network and workload conditions.
- Distributed State Transfer — the ability to capture distributed network and communication state, including model parameters, connection health, routing, NCCL communicators and rank progress, and transfer it, with appropriate modifications, to alternative destinations. This capability underpins network fault tolerance through path failover and enables live GPU migration, allowing AI workloads to continue running through failures.
- Dynamic Traffic Control — treats the network as a programmable system, steering traffic dynamically in response to real-time telemetry. During congestion events, it identifies imbalanced or colliding flows and shifts traffic across underutilized paths. It can pace or delay specific queue pairs at the packet level, bounding tail latencies that would otherwise stall synchronized collectives. Because it runs entirely in software, it operates across Ethernet, InfiniBand, and RoCE fabrics without requiring proprietary switch features.
Together, these capabilities create a continuously operating control loop: high-precision observation of system behavior, distributed visibility into communication state, and the ability to enact coordinated changes across the network and communication stack.
Conclusion
These capabilities manifest in three primary application domains: fault tolerance, workload mobility, and communication optimization. The three application domains are not separate products. They are expressions of the same observe-decide-act control loop, running continuously and transparently beneath the training framework. The end state is autonomic collective communications: a system that continuously monitors its own health and performance, predicts failures before they occur, adapts topology and routing in real time, and optimizes communication parameters based on observed conditions.
This vision is grounded in infrastructure that already exists. Distributed State Tracking already provides real-time visibility from the RDMA queue-pair level up through collective operations and into the training workload. Observability-Driven Orchestration already translates those signals into coordinated, multi-step interventions beneath running jobs.
Collective communication is the nervous system of distributed AI training. Today, that nervous system is static, fragile, and blind. Clockwork is making that nervous system intelligent: able to sense what is happening across the infrastructure, reason about what should change, and act without disrupting the workloads it supports. The result is AI training that doesn’t just run fast when everything is working, but runs well all the time.
Our Investors
New Enterprise Associates
Lead Series A investor