MRC: A Step Forward for GPU Networking
On May 5th, OpenAI announced MRC (Multipath Reliable Connection), a new open RDMA transport co-developed with AMD, Broadcom, Intel, Microsoft, and NVIDIA, and released through the Open Compute Project. It’s a meaningful contribution to GPU networking, and we’re glad to see it land in the open. Here’s a quick read on what MRC is, why it matters, and how we think about it at Clockwork.
What MRC Brings
MRC extends RDMA over Converged Ethernet (RoCE) and adds SRv6-based source routing. It centers on four building blocks:
- Multipath packet spraying – Packets of a single Queue Pair (QP) are sprayed across many paths and across all planes of a multi-plane network. Target NICs deal with out-of-order delivery through direct memory placement.
- SRv6 source routing – Each packet encodes its end-to-end path as compact microsegment IDs in the destination address, letting switches do simple static lookups.
- Multi-plane topology – NICs are split into multiple ports each of which can connect to multiple parallel network planes, enabling larger GPU clusters without adding additional network tiers.
- Reliability and Congestion control – NICs can detect failures and reroute using parallel network planes. By using selective acknowledgement and trimmed packets (with switch support), fast retransmission deals with packet loss.
Many of the ideas are in common with UEC, although there are also significant differences, notably that MRC retains QPs as the primary connection model while UEC moves away from QPs for connection management.
Adopting and getting the full value of MRC requires a coordinated set of changes across hardware and software. This includes MRC-capable NICs and switches, implementing multi-plane fabric topologies, and updating the control-plane logic and collective communication libraries (NCCL and equivalents) so they interoperate cleanly with QP-level EV management, packet reordering, and new path-failure signaling.
What MRC Tells Us About the Industry
MRC, UEC/UET, and the broader push toward adaptive routing all point to the same reality: large GPU fabrics have become dynamic failure domains. Congestion, path imbalance, packet loss, link flaps, and switch and plane-level events translate directly into stalled GPUs and slower training runs.
This is the same operating reality that led Clockwork to invest early in Fleet Monitoring, Workload Monitoring and Workload Fault Tolerance. At GPU-cluster scale, the fabric cannot be treated as a static underlay. It has to be continuously sensed, correlated with workload impact, and acted upon before transient degradation becomes lost GPU time.
Our products are, by design, transport-agnostic and work equally well across RoCEv2, InfiniBand, UEC, and MRC. Most operators will run heterogeneous estates, even as they adopt MRC, UEC and other new technologies, and Clockwork gives them one operational view across all those fabrics.
Our solutions and new transports like MRC are complementary layers:
- MRC and UEC/UET improve packet delivery, congestion signaling, and path selection within their own ecosystems.
- Clockwork sits above any single transport, running probes from end-hosts and, where available, switches, to give operators a complete view of the fabric experience – host-to-host, switch-to-switch, path-level, link-level, and workload-correlated. We tie those fabric signals back to specific jobs, ranks, collectives, QPs, and GPU utilization, and we close the loop into control actions like link resilience, workload fault tolerance, and live GPU migration.
Put simply: MRC strengthens the fabric. Clockwork helps operators see and act on what’s happening across the entire estate.