Watch Oracle Cloud World Video on Performant AI Networks

Clockwork FleetIQ Platform

Nano-second Accurate Visibility Correlated Across the Stack

AI at scale slows when GPU, cluster, or cloud communication falters. FleetIQ unifies nanosecond-level visibility, dynamic traffic control, and job-aware resilience in one software control plane — transforming communication into a performance lever. The result: fewer restarts, faster training, true operating capacity.

3 Dysfunctions AI Infrastructure Teams Grapple With

Dysfunctional networks hurt GPU Utilization, 
Job Completion Time, overall ROI

Link flaps Link flaps NIC failure NIC failure
Visibility gap
Troubleshooting unpredictable queues and slow jobs is hurt by poor visibility into network misconfigurations, link flaps and congestion
Resiliency gap
Network links routinely fail or degrade, and a single link flap in a large cluster can cause job restarts, wasting thousands of GPU hours
Performance gap
Network congestion and contention results in too much time spent on data ingestion and exchange instead of on compute, slowing AI jobs

AI Fabrics Are Different

LLM Training Patterns vs. Traditional Cloud Computing

NIC egress traffic 
during training
Traditional cloud computing 
traffic patterns

Source: Alibaba HPN – A Data Center Network for large Language Model Training ACM SIGCOMM ‘24, August 4-8, 2024, Sydney, NSW, Australia

  • Separate back-end and front-end networks

  • Highly demanding back-end network:
    Lossless | Very high-bandwidth | Low latency and jitter | In-order delivery

  • Frequent network failures due to optical port density, overheating, dust, etc.

A Fabric Built for AI at Scale

Clockwork Frees AI from Communication Constraints

Global Clock Sync Dynamic Traffic Control Clockwork FleetIQ CPU CPU App GPU GPU CPU CPU App GPU GPU CPU CPU App GPU GPU Compute Cluster Storage AI Factories Stateful
Fault-Tolerance Efficient Performance Cross-stack Visibility

Clockwork FleetIQ transforms AI infrastructure by unifying nanosecond visibility, job-aware resilience, and dynamic traffic control into one software-driven fabric. Unlike static, vendor-bound networks, FleetIQ runs anywhere—across GPUs, NICs, switches, and transports—normalizing performance and accelerating training without application changes

Any Accelerator, Any Network: RoCE to InfiniBand
Accelerate Time-to-First-Token at Any Scale
Accelerate Time-to-First-Token at Any Scale

From Precision to Control: End-to-End Fabric Intelligence

Transform AI fabrics into resilient, high-performance networks

Clockwork FleetIQ Platform Foundation: Global ClockSync

Sub-microsecond accurate visibility

Global ClockSync aligns every host, NIC, and switch to a shared sub-microsecond timeline. This unified clock enables precise telemetry and real-time correlation across jobs, GPUs, and networks—turning invisible slowdowns into observable, actionable data.

QPair latency NIC to NIC latency NIC to NIC delay Chunk latency Message latency Chunks Message Chunks Message Queue Pairs Queue Pairs GPU CPU CPU Ethernet NIC Ethernet NIC GPU GlobalClockSync Backend Networking Frontend Networking NCCL/RCCL NCCL/RCCL NIC NIC NIC NIC NIC NIC NIC NIC

Delivers Nanosecond Telemetry, Unified Time Sync
and Precise Root-Cause Attribution.

Clockwork’s NCCL Plugin
Provides Granular Fleet & AI Job Visibility

Supports RoCE and InfiniBand

Dynamic Traffic Control (DTC) actively steers flows to avoid collisions and incast collapse. By pacing queue pairs and shifting traffic across underutilized paths, DTC bounds tail latencies and keeps synchronized collectives moving forward.

QPair latency NIC to NIC latency Packet Flow Control NCCL/RCCL NCCL/RCCL Queue Pairs Queue Pairs GPU NIC NIC NIC NIC CPU CPU Ethernet NIC Ethernet NIC GPU NIC NIC NIC NIC GlobalClockSync Dynamic Traffic Control Backend Networking Frontend Networking QPair latency NIC to NIC latency Packet Flow Control Queue Pairs Queue Pairs GPU NCCL/RCCL NCCL/RCCL NIC NIC NIC NIC NIC NIC NIC NIC CPU CPU Ethernet NIC Ethernet NIC GPU GlobalClockSync Dynamic Traffic Control Backend Networking Frontend Networking

Delivers Network Auto-failover, Congestion Control and Load Balancing

Addressing the Visibility Gap: Clockwork Fleet Audit,
Fleet Monitoring, Workload Monitoring

From clean starts to continuous uptime: end-to-end AI fleet assurance

Provisioning Operations Provision Nodes, Network, Storage, Firmware, Base Schedule Observe, detect, troubleshoot, 
fix and optimize Infrastructure

Deploy a reproducible, known good baseline fleet to run AI workloads

Keep the fleet healthy, performant 
and cost-effective while AI jobs run

Every host, NIC, and switch is aligned to a single sub-microsecond timeline. This unified clock underpins provisioning and operations, enabling precise telemetry, real-time visibility, and job-aware control across the fleet — the base layer for reliable, efficient AI at scale.

Fleet Audit

  • Software checks
  • Node checks
  • Front-end network
  • Back-end GPU network validation

Fleet Monitoring

  • Runtime link failures/flaps
  • Runtime fabric topology
  • Runtime fabric performance
  • Congestion and contention monitoring

Workload Monitoring

  • Deep workload visibility
  • Correlation of data path performance with network metrics to identify root cause of job performance

Clocksync Foundation

  • Deep workload visibility
  • Correlation of data path performance with network metrics to identify root cause of job performance

Disruptive Network Failures and Link Flaps
Are Common and Expensive

Failures Happen Frequently—Even in Brand New Clusters

One of the most common problems encountered is Infiniband/RoCE link failure. Even if each NIC-to-leaf switch link had a mean to failure rate of 5 years, due to the high number of transceivers, it would only take 26.28 minutes for the first job failure.

Time of first job failure in brand new cluster: 
26.28 minutes
“Achieving high utilization with them (GPUs) is difficult due to the high failure rate of various components, especially networking.

Link / NIC Flapping: Before And After Clockwork.

Without Clockwork, a NIC failure halts AI jobs entirely. With Clockwork, jobs continue at reduced throughput during a failure and quickly return to full capacity, ensuring robust resilience and uninterrupted performance

Link/NIC Flapping

Workload Acceleration

Proven Throughput Gains Across Real-World AI Workloads

Hyperscaler with Clockwork vs Dynamic Load Balancing (DLB)

2 all-to-all jobs

The hyperscaler with Clockwork enabled has 33% more outbound throughput vs. DLB

Large Social Media Company with Clockwork vs. ECMP

2 all-to-all jobs

The large social media company with Clockwork enabled has 29% more throughput vs. ECMP

Learn More

Stop wasting GPU cycles. Start scaling smarter.
Clusters must deliver high uptime while running at maximum efficiency.

Turn your GPU clusters into a competitive advantage—not a cost center.