28323
Linux & DevOps

8 Things You Didn't Know About CUBIC's Hidden QUIC Bug

Posted by u/Tiobasil · 2026-05-17 21:52:46

When network congestion controllers behave unexpectedly, the internet slows down—sometimes permanently. This is the story of a subtle bug in CUBIC, the default congestion controller in Linux, that caused the congestion window (cwnd) to remain stuck at its minimum after a severe loss event. The bug originated from a kernel optimization intended to align CUBIC with RFC 9438, but when ported to Cloudflare's QUIC implementation (quiche), it created a cascade of failures. Here are eight key insights from this deep dive into troubleshooting and debugging.

1. CUBIC Is Everywhere—Even in QUIC

Standardized in RFC 9438, CUBIC governs how most TCP and QUIC connections on the public internet probe bandwidth, detect loss, and recover. At Cloudflare, our open-source QUIC library, quiche, uses CUBIC as its default congestion controller. This means the same algorithm that manages data flow for billions of web requests also sits in the critical path for QUIC traffic. Any flaw in CUBIC's logic can ripple across a significant share of Cloudflare's served connections.

8 Things You Didn't Know About CUBIC's Hidden QUIC Bug
Source: blog.cloudflare.com

2. The Symptom: A Test That Fails 61% of the Time

Our investigation began with erratic failures in integration tests for quiche's ingress proxy. These tests simulated heavy packet loss early in a connection, a scenario that exercised CUBIC's recovery after a congestion collapse. The failure rate—61%—was too high to ignore. Most congestion control tests focus on steady-state growth, but rare behaviors at minimum cwnd can hide devastating bugs. This one was hiding in plain sight.

3. The Core Logic: Loss-Based Bandwidth Probing

Every loss-based congestion controller, including CUBIC, operates on a simple premise: increase the sending rate (cwnd) when there's no packet loss, and decrease it when loss occurs. The goal is to maximize throughput by inferring available network capacity. CUBIC uses a cubic function to grow cwnd aggressively after a loss event, then levels off near the estimated bandwidth. This design works well in steady-state, but corners like the minimum cwnd region are rarely tested—until now.

4. The Bug: Cwnd Permanently Pinned at Minimum

In the failing scenario, after a severe loss event, CUBIC reduced cwnd to its minimum value (traditionally 2 packets). Normally, the algorithm should gradually increase cwnd again as acknowledgments arrive. But here, the cwnd never recovered—it stayed at the minimum for the entire connection. This effectively throttled the sender to a trickle, causing timeouts and throughput collapse. The bug appeared only when the connection was app-limited (the sender had no application data to send) immediately after the loss.

5. Root Cause: The App-Limited Exclusion Rule

The bug was traced to a Linux kernel change that implemented the app-limited exclusion described in RFC 9438 Section 4.2-12. The rule states that when a sender is app-limited (has no data to fill the cwnd), it should not count acknowledgments toward increasing cwnd. This prevents inflated windows caused by idle periods. The kernel change correctly fixed a TCP issue, but when ported to QUIC's different ACK handling, it broke the recovery logic. In certain sequences, CUBIC's increase algorithm would never be triggered.

8 Things You Didn't Know About CUBIC's Hidden QUIC Bug
Source: blog.cloudflare.com

6. Porting from TCP to QUIC Opened a New Vulnerability

TCP and QUIC handle acknowledgments differently—QUIC uses cumulative ACKs plus selective ACKs (SACKs) in a single packet, while TCP can bundle multiple ACKs. The kernel's optimization assumed TCP's behavior, but QUIC's more complex ACK processing meant the app-limited state was detected too aggressively. After a congestion collapse, the sender often became app-limited because the cwnd was so small that a single acknowledgment could cover all outstanding bytes, leaving no new data to send. This triggered the exclusion rule, freezing cwnd growth.

7. The Elegant (Near-)One-Line Fix

After extensive analysis, the Cloudflare team discovered a simple workaround: modify the condition that determines whether the connection is app-limited. By ensuring that the app-limited flag is cleared after a loss event, even if the sender has no new data, CUBIC could resume normal growth as acknowledgments arrived. The fix was nearly a one-line change in quiche's CUBIC implementation—a testament to how a small oversight can cause disproportionate impact.

8. Lessons Learned for Congestion Control Testing

This bug highlights the importance of testing congestion controllers under extreme and rare conditions—especially the minimum cwnd regime after a collapse. Most benchmarks focus on throughput and fairness in steady-state, but real-world networks suffer from transient losses that push algorithms into corners. Developers and researchers should include scenarios that force the sender into app-limited states immediately after loss, and test across different transport protocols (TCP vs. QUIC) to uncover porting discrepancies.

In the end, a textbook kernel optimization revealed hidden assumptions about ACK processing and app-limited behavior. The fix not only restored CUBIC's reliability in QUIC but also deepened our understanding of how congestion control algorithms interact with modern transport protocols. For anyone working on network stacks, this story is a reminder that even the most battle-tested code can have unexpected bugs—and that testing the edge cases matters more than ever.