About Us

Founded in 2018 by a team from Stanford University, Clockwork’s technology enables time-sensitive applications in areas such as financial trading, high-tech, and online gaming. Being software-based, its solutions can run anywhere: in on-premises data centers, public clouds, or hybrid environments. Taking aim at the ‘clockless architecture’ prevalent in distributed systems and networks, Clockwork.io aims to redefine a large part of the way these technologies (which underlie the cloud) are currently practiced.

2018

TickTock Networks founded based on research incubated at Stanford University.

2021

TickTock Networks renames to Clockwork Systems.

2022

Clockwork.io debuts publicly.

Our Vision

Accelerate AI Around the Clock With Software-Driven-Fabrics

Even after trillions invested in GPUs – real clusters of thousands to 100,000+ accelerators routinely deliver only 30–55% of their theoretical performance.

At hyperscale, this inefficiency compounds into billions of dollars of stranded capacity – the AI efficiency gap. The bottleneck is no longer compute; it is communication. AI is the most distributed and demanding workload in computing, and like all distributed systems, efficiency depends on communication. Fragile fabrics where a single flapping link or hot path can force full restarts waste millions of GPU hours. In practice, teams run into compounding gaps that turn small hiccups into hours of lost time.

First, a visibility gap: difficulties correlating workload performance (step time, throughput, time-to-first-token) with infrastructure faults (GPU, card, cable, switch). Without this view, minor issues grow until jobs stall.

Second, a reliability gap: networks flap, GPUs reboot, storage pauses, drivers misbehave, configs drift – any can trigger checkpoint restarts.

Third, a performance gap: traffic collisions, poorly tuned collectives, or congestion stretch training and inference.

Fourth, a cost gap: today’s segregated GPU, storage, and service fabrics triple optics, ports, cabling, and power – inflating cost and stranding capacity.

Together, these gaps leave thousands of GPUs idle, stretch recovery times, and convert minor glitches into major expense and delay.

Clockwork’s Thesis

Clockwork makes AI communication reliable and performant by:

Treating communication as the new Moore’s Law — measurable, predictable, controllable in real time.
Building this as a software layer that can Run Anywhere across diverse libraries, transports, accelerators, and vendors.

Our Vision: Software-Driven Fabric

Communication is the New Moore’s Law. Run Anywhere in Software.

At scale, AI is communication-bound: synchronized collectives and storage I/O traverse thousands of links. Congestion, collisions, and transient faults turn tiny hiccups into idle GPUs and failed jobs, pushing utilization far below the silicon’s promise. Large operators report chronic restarts and slowdowns (e.g., Llama-3 with 466 restarts in 54 days; Alibaba noting ~60% of large jobs slowed). Fragile fabrics directly translate into idle silicon and blown schedules.

Clockwork’s Software-Driven Fabric (SDF) treats the fabric as a programmable software control plane. It first delivers pervasive visibility — one-way delay and per-flow insights from network through communication libraries to the training job. Then it applies closed-loop, path-aware control to steer and pace traffic in real time, ensuring faults cause slowdowns, not crashes. Net effect: higher utilization, shorter time-to-train, and economically sustainable fleets.

Run Anywhere

AI factories are inherently heterogeneous: diverse APIs (NCCL/RCCL, NVSHMEM, MPI), transports (InfiniBand, RoCE, TCP), accelerators (NVIDIA, AMD, custom), and NIC/switch vendors. Vendor-specific optimizations may help short-term but erode portability and produce fragile fabrics.

Clockwork’s Software-Driven-Fabric is a neutral layer: any GPU, any NIC, any switch, any transport. It plugs into standard libraries and APIs to provide visibility, real-time control, and fault tolerance across InfiniBand and Ethernet/RoCE — without app changes. By living in software, it normalizes vendor and generational differences, ships new features quickly, and enforces consistent policy and performance.

Converge the Fabric

Today, deterministic performance forces operators to build three separate networks: GPU collectives, storage I/O, and service traffic. A software-driven fabric enables one physical network for all flows.

Clockwork’s Software-Driven-Fabric classifies flows at the source, uses microsecond telemetry to steer around hot paths, paces background traffic, and admits or defers work to protect priority jobs. Convergence eliminates duplicated optics, ports, and stranded capacity, turning performance from a wiring problem into a software scheduling problem — again, without application changes.

FleetIQ – Clockwork’s Software-Driven Fabric

Fast, Efficient and Fault-Tolerant AI – On Any Hardware, At Any Scale

FleetIQ delivers breakthrough capabilities that redefine network reliability and performance:

Global ClockSync: software-based, sub-microsecond synchronization across hosts, switches, NICs, and SmartNICs. Provides per-flow, real-time telemetry of provisioning and runtime efficiency. Alerts trigger at the first sign of link, node, or workload trouble.
Stateful fault tolerance: flapping NICs and link failures no longer crash jobs. Fleet Failover reroutes traffic instantly, preserving collective integrity and saving thousands of GPU-hours weekly in large clusters.
Dynamic Traffic Control (DTC): makes the network programmable. Traffic is dynamically steered around congestion, imbalanced flows are rebalanced, and packet pacing bounds tail latencies that otherwise stall collectives. Runs entirely in software across InfiniBand, Ethernet, and RoCE.

Clockwork’s Software-Driven-Fabric is more than optimization — it is a new foundation for AI infrastructure. By fusing nanosecond-level visibility, dynamic traffic control, and job-aware resilience, FleetIQ converts communication from the weakest link into the strongest lever for performance. This is how trillions in GPU investment become dependable capacity — how idle silicon becomes productive intelligence. The winners in AI will be those who master communication. Clockwork exists to make that mastery real: infrastructure that is faster, more resilient, and economically sustainable at global scale.

Our CEO’s Leadership Philosophy