123 Main Street, New York, NY 10001

USB Host/Device/DRD Controller (USB 2.0/3.x/4)

← Back to: USB / PCIe / HDMI / MIPI — High-Speed I/O Index

This page turns a USB Host/Device/DRD controller into an executable mental model: roles and state machines, xHCI-style rings/DMA, scheduling and recovery ladders—so throughput, latency, and robustness can be budgeted, instrumented, and verified with measurable pass criteria. It focuses strictly on controller/firmware/driver logic (not PHY/Type-C/SI), providing concrete checklists and troubleshooting steps to stop retry storms and keep enumeration and data paths stable in real products.

H2-1. What This Page Covers: USB Controller Scope & Mental Model

A USB controller is the traffic engine: it turns transfer intent into scheduled transactions, moves data via DMA, and enforces Host/Device/DRD role ownership—without mixing PHY, Type-C policy, retimers, or connector protection scope.

Card A · Controller = Transactions + Scheduling + DMA + Role Ownership
  • Transactions: endpoint contexts, control transfers, and completion events define “what happened” (not just throughput).
  • Scheduling: microframe/slot decisions decide latency, jitter, and service guarantees for interrupt/iso endpoints.
  • DMA: rings/descriptors translate transfers into memory movement; cache/IOMMU rules decide correctness under load.
  • Role ownership: Host vs Device vs DRD determines who owns SOF, port reset, address assignment, and recovery sequencing.
Pass criteria (definition-level)
For any observed issue, it must be classifiable into (1) transaction, (2) scheduling, (3) DMA/memory, or (4) role ownership within X minutes using controller-visible signals and counters.
Card B · Common Controller Forms (SoC / Bridge / FPGA / PCIe Card)
SoC-integrated controller
Typical risks: power/clock domains, cache coherency assumptions, interrupt storms. First check: role state + ring health + cache policy alignment.
Bridge / docking controller
Typical risks: fixed firmware constraints, class limitations, descriptor/quirk handling. First check: enumeration pipeline + endpoint map vs class needs.
FPGA / programmable controller
Typical risks: verification coverage, queue depth sizing, deterministic scheduling under contention. First check: scheduler invariants + completion ordering.
PCIe add-in card controller
Typical risks: MSI-X routing, IOMMU mapping, latency jitter from CPU affinity. First check: interrupt moderation + ring batching + DMA mapping correctness.
Pass criteria (form-factor fit)
The controller form must be identifiable and mapped to its primary failure surface (domains vs firmware constraints vs verification vs PCIe/IOMMU) before deep debugging starts.
Card C · Scope Contract (Hard Boundary to Prevent Topic Overlap)
In-scope (covered here)
Role FSM (Host/Device/DRD), endpoint/transfer model, scheduling & service guarantees, DMA rings/descriptors, enumeration robustness, error handling & recovery ladder, and controller-side observability (counters/trace/hooks).
Out-of-scope (route to sibling pages)
  • USB PHY electrical / SSC / eye: see “USB PHY (2.0 / 3.x / 4)”.
  • Type-C policy / PD / Alt-Mode details: see “Type-C Orientation & Signal MUX”.
  • Redriver/Retimer EQ and channel reach: see “USB Redriver / Retimer”.
  • Hub downstream port management: see “USB Hub Controller”.
  • Connector protection (TVS/load switch) parts & layout: see “USB Port ESD/TVS & Load Switch”.
Pass criteria (scope control)
Any statement that depends on electrical behavior (SI/eye/SSC), Type-C policy, or connector protection layout must be replaced by a one-line pointer to the correct sibling page.
Diagram · USB System Block Map (Controller in-scope boundary)
Host CPU / SoC OS / Driver Memory + IOMMU Buffers / Cache USB Controller (in-scope) Role FSM Host / Device / DRD Scheduler Microframes DMA Rings TRB / Desc Doorbell / Events EP Context Transfer Types Control / Bulk USB PHY out-of-scope Connector USB-C See sibling pages Type-C Retimer ESD/TVS Hub Focus here: role → schedule → DMA → observability (then route electrical topics to siblings)
Diagram intent: put the controller at the center, label PHY/Type-C/retimers/protection as separate scopes, and keep this page strictly on controller logic and ownership.

H2-2. Roles & Port Ownership: Host vs Device vs DRD (OTG/Role-Swap)

DRD stability comes from explicit ownership: every critical resource must have a clear owner in each role, and role swap must follow a quiesce → commit → health-probe sequence to prevent “alive device, dead link” failures.

Card A · Ownership Map (Who Owns What)
The list below acts like an “ownership table” but stays mobile-safe. Each row names the resource, the expected owner, and the first failure symptom when ownership is ambiguous.
SOF / timing master
Owner: Host. Symptom if wrong: periodic endpoints drift, isochronous jitter spikes, or transfers stall without clear errors.
Port reset / address assignment
Owner: Host. Symptom if wrong: enumeration loops, “new device every plug”, or configuration never commits.
Endpoint contexts (EP0 + data EPs)
Owner: Role-dependent (Host manages schedules; Device enforces endpoint behavior). Symptom if wrong: EP stalls persist across swaps or one interface starves others.
DMA rings / descriptors / doorbells
Owner: controller + memory subsystem. Symptom if wrong: “alive but no progress” under load (ring head/tail stop moving) or completion events mismatch.
Clock/reset domains for role swap
Owner: platform integration. Symptom if wrong: role shows “swapped” in software but controller datapath stays in old mode.
Interrupt/event routing
Owner: controller + OS integration. Symptom if wrong: completions occur but software never sees them (or old events appear after swap).
Pass criteria (ownership clarity)
For each role (Host/Device) and each swap phase, ownership of the listed resources must be unambiguous and observable via state, counters, or trace markers.
Card B · Role Swap Triggers (Interface View, Not PD Details)
  • Physical result inputs: ID/CC outcomes can appear as a role-request signal (policy details belong elsewhere).
  • Policy manager requests: platform firmware/EC can request swap to satisfy a system policy.
  • OS requests: user or driver stack can request a role change (e.g., accessory mode, diagnostics).
Controller-side swap sequence (robust default)
  1. Quiesce: stop admitting new transfers; freeze scheduling decisions.
  2. Drain: complete or cancel in-flight TRBs/descriptors; ensure rings reach a stable boundary.
  3. Commit: switch role mode; reinitialize role-owned contexts (EP/port state, event routing).
  4. Health probe: run a minimal “progress check” (EP0 sanity / event ring progress) before resuming full traffic.
Pass criteria (swap correctness)
Swap must complete with no orphan rings, no stale events, and a successful health probe within X ms, followed by stable transfers for Y minutes.
Card C · Classic Failure: “Alive Device, Dead Link” After Swap
Most post-swap flaps are not “mystical.” They collapse into three layers. Each layer has a first check that is controller-visible.
Layer 1 · Domains (clock/reset/power)
Likely cause: role toggled in software but hardware domain stayed in old mode.
Quick check: role-state + domain-state markers change together (no “split-brain”).
Layer 2 · Queues (rings/events/ordering)
Likely cause: old TRBs or events survive swap; doorbells reorder; completions mismatch.
Quick check: ring head/tail monotonic progress; event ring contains only post-swap sequence markers.
Layer 3 · Memory (cache/DMA coherency)
Likely cause: rings or contexts read stale data under load.
Quick check: controlled A/B run with explicit flush/invalidate or forced coherent path changes outcome.
Pass criteria (post-swap stability)
After any role swap, transfers must show continuous forward progress: no repeated re-enumeration, no ring stalls, and no periodic flaps over a Y-minute observation window.
Diagram · DRD Role FSM (Quiesce → Commit → Health Probe)
Detached No ownership Host SOF / reset / addr Device EP behavior / responses Swap-Prep Quiesce + Drain Commit Switch role Health Probe Progress check Error Recover Reset ladder Attach Attach Role request Role request Ready Commit done Success Success Timeout Fail Recovery complete The swap is safe only if queues are drained and progress is verified before full traffic resumes.
Diagram intent: enforce a disciplined swap path (prep/drain → commit → probe). This prevents “role changed” from becoming a false-positive when rings, events, or domains are still in the old state.

H2-3. Controller Architecture: USB2, USB3.x, USB4 (Controller View)

The controller is best debugged as a closed loop: command creates work, transfer rings describe it, DMA executes it, and event rings prove forward progress—while PHY and electrical topics remain out-of-scope.

Card A · USB2 (EHCI/OHCI/UHCI) vs USB3.x (xHCI) — the practical difference
Control-plane “truth”
USB2 era: progress is dominated by frame/queue state + interrupts.
xHCI era: progress is dominated by ring state + doorbells + event completions.
Transfer representation
USB2 era: list/queue descriptors (software-driven scheduling assumptions).
xHCI era: TRBs on transfer rings (hardware-driven completion evidence).
First debug check (controller-only)
Verify a monotonic loop: doorbell → DMA → event ring → interrupt → driver. If the loop breaks, the issue is typically controller/software integration, not PHY electrical behavior.
Pass criteria (architecture comprehension)
Any stall must be placeable on the loop above within X minutes using ring pointers, event counters, and interrupt delivery status.
Card B · Controller vs PHY boundary (only the interface contract)
Controller contract
Owns command/transfer scheduling, role state, DMA execution, and recovery policy based on link-status results. Electrical tuning is not owned here.
PHY contract (out-of-scope)
Owns electrical compliance and signal integrity. The controller consumes link up/down and error indications, then chooses reset/retrain/recover actions.
Boundary sanity check
If events and ring progress stop, debug the controller loop first. If progress continues but errors cluster around link-status transitions, route electrical topics to PHY/retimer pages.
Card C · USB4 (controller view): control + routing + tunneling objects
  • More than speed: the controller manages routing decisions and tunneling resources (logical objects), not only endpoints.
  • Isolation: a tunnel must be diagnosable and recoverable without resetting unrelated traffic.
  • Observability: tunnel-level counters and state markers are needed to prove forward progress and to prevent false “link up” assumptions.
Pass criteria (USB4 scope)
USB4 is treated as controller-managed objects (control/routing/tunnels). Physical-layer training and SI remain out-of-scope and must be routed to sibling pages.
Diagram · xHCI-style Controller Block Diagram (debug loop anchor)
Host / OS USB Driver Memory Buffers IOMMU Mappings MSI-X / IRQ USB Controller Core Command Ring Event Ring TR Ring EP0 TR Ring EP1..n TR Ring Iso / Int Scheduler Microframes DMA Engine Read / Write Doorbell Interrupt MSI-X Counters Trace PHY IF out-of-scope Link status Debug anchor: prove doorbell→DMA→event→IRQ→driver forward progress before routing to PHY topics.
Diagram intent: expose the controller’s closed loop and its observability points (ring pointers, events, interrupts). PHY electrical details remain out-of-scope.

H2-4. Endpoint & Transfer Model: Control/Bulk/Interrupt/Isochronous

Transfer types are not “labels.” They define service guarantees and recovery rules, which map directly to endpoint contexts, transfer rings, and scheduler decisions.

Card A · EP0 control transfers: why enumeration stability lives here
  • EP0 is universal: every device must expose it, so every system depends on its correctness.
  • EP0 carries the critical path: address assignment, configuration commit, descriptor tree, and capability discovery.
  • Controller-side failure modes: setup/data/status boundary bugs, over-aggressive timeouts, EP0 starvation by other endpoints, or small-transfer DMA coherency gaps.
Pass criteria (EP0 robustness)
Over X plug cycles, enumeration must succeed ≥ Y% with no repeated address/config loops, and EP0 must not exceed T ms worst-case response under concurrent traffic.
Card B · Bulk vs Interrupt: throughput vs bounded latency (scheduler view)
Bulk (best-effort throughput)
Uses deep queues and batching to maximize payload throughput. Primary risks are queue collapse, excessive interrupt cost, or DMA mapping inefficiency.
Interrupt (bounded service latency)
Requires periodic service and predictable upper-bound latency. Primary risks are starvation by bulk traffic or over-aggressive interrupt moderation.
Pass criteria (service targets)
Bulk must reach ≥ X% of expected throughput without ring stalls; interrupt endpoints must meet ≤ L ms worst-case service latency under concurrent bulk load.
Card C · Isochronous: smooth delivery beats “zero errors”
  • Iso is time-bound: missing the slot is worse than a late retry.
  • Scheduler is the control point: microframe allocation and service ordering decide continuity.
  • Controller-visible signals: underrun/overrun counters and periodic timing markers prove smoothness.
Pass criteria (iso continuity)
Over Y minutes, isochronous underrun events must stay ≤ X/min, and service jitter must remain within J (controller timestamp definition).
Diagram · Transfer Types vs Scheduling (queue + priority + retry)
Scheduler (microframes / service order) Control Bulk Interrupt Isochronous Queue EP0 ring Priority high Retry bounded Queue deep rings Priority best-effort Retry allowed Queue periodic Priority guaranteed Retry bounded Queue time slots Priority time-bound Retry not useful Same controller, different guarantees: priority + queue model + retry rule define real-world behavior.
Diagram intent: compare the four transfer types using only three controller-relevant knobs—queue model, scheduling priority, and retry usefulness—without dragging PHY details into this page.

H2-5. Scheduling & Bandwidth Budgeting (Microframes, Service Guarantees)

Real-world “smoothness” is decided by a microframe budget and a service order. The practical goal is to prove that periodic endpoints keep their guarantees while bulk traffic remains stable—using controller-visible accounting only.

Card A · Microframe budget accounting (placeholders, controller view)
Step 1 · Link capacity
Start from R_link (nominal payload rate per microframe unit).
Step 2 · Protocol overhead
Subtract OH_proto (tokens/headers/handshakes/spacing as placeholders).
Step 3 · Scheduling overhead
Subtract OH_sched (service switching cost, periodic reservations, fragmentation).
Step 4 · Software/ISR overhead
Subtract OH_sw (doorbell frequency, ISR cost, buffer management jitter).
Step 5 · Available payload
P_avail = R_link − OH_proto − OH_sched − OH_sw
Pass criteria (budget quality)
The accounting must explain a “smoothness” failure by mapping it to one of the terms above within X minutes, without using PHY electrical arguments.
Card B · How service guarantees are implemented (period, burst, order)
Knob 1 · Period
A periodic endpoint must be served at least once every N microframes to keep worst-case latency bounded.
Knob 2 · Burst
Burst limits prevent a single service from consuming the next window; cap per-service payload to preserve continuity.
Knob 3 · Service order
The scheduler must ensure periodic endpoints cannot be starved by bulk rings; ordering rules must be observable via ring progress and completion timing.
Pass criteria (guarantees)
Periodic endpoints must meet ≤ L worst-case service gap (microframe units), and bulk throughput must remain ≥ X% of target without ring stalls.
Card C · “Bandwidth is enough, but it still stutters” — typical controller-side roots
Case 1 · Preemption / service order
Quick check: verify periodic reservation is enforced and bulk cannot monopolize consecutive windows.
Fix: tighten service order + cap burst.
Pass: periodic service gap ≤ X microframes.
Case 2 · Interrupt storm / moderation mismatch
Quick check: correlate event ring progress with IRQ delivery and ISR time clusters.
Fix: tune coalescing (batch) while protecting periodic deadlines.
Pass: IRQ rate ≤ Y, ISR p99 ≤ T.
Case 3 · Ring depth / buffer starvation
Quick check: inspect ring occupancy and producer/consumer distance for “tail-chase”.
Fix: increase ring depth + prefill buffers + reduce per-TRB overhead.
Pass: occupancy stays within [A..B] under peak load.
Diagram · Microframe Budget Ladder (capacity → overheads → guarantees)
R_link − OH_proto − OH_sched − OH_sw P_avail Guarantee Pools Iso G_iso Int G_int Bulk Remainder P_avail − G_iso − G_int Controller view: values are placeholders; budget explains service guarantees and stutter without PHY electrical details.
Diagram intent: a reusable ladder that turns “stutter” into an explainable budget—then splits the remaining payload into iso/int guarantees and bulk remainder.

H2-6. DMA & Data Path: Rings, TRBs/Descriptors, Cache Coherency

Low latency is achieved when the descriptor path and the data path both have provable visibility. The practical requirement is a monotonic lifecycle: produce → doorbell → DMA → complete → event → reclaim.

Card A · TRB/descriptor lifecycle (producer/consumer loop)
1) Produce
Driver writes TRBs/descriptors and advances the producer pointer.
2) Doorbell
Doorbell notifies the controller; TRB visibility must be guaranteed before ringing.
3) DMA fetch & move
DMA reads descriptors and transfers payload between endpoint and memory buffers.
4) Complete & post event
Controller records completion on the event ring and advances the consumer side.
5) IRQ / poll & reclaim
Driver consumes events, updates ring pointers, and reclaims TRBs for reuse.
Pass criteria (forward progress)
Producer/consumer pointers must advance monotonically under load; completion latency p99 ≤ X, and ring stalls must be explainable by a single broken step in the loop above.
Card B · Cache coherency & IOMMU: the three hard pitfalls
Pitfall 1 · TRB written, DMA cannot see it
Cause: missing visibility barrier / flush before doorbell.
Fix: enforce “descriptor visible before notify” rule.
Pass: no “ghost doorbells” over Y runs.
Pitfall 2 · DMA wrote data, CPU reads old bytes
Cause: missing invalidate or cache-line overlap.
Fix: define ownership boundaries for buffers and invalidate on completion.
Pass: data correctness holds across N randomized buffer offsets.
Pitfall 3 · Scatter-gather edge cases look “random”
Cause: alignment/page boundary/mapping fragmentation.
Fix: constrain segment rules + validate mapping success per segment.
Pass: zero silent truncation across X mixed-size transfers.
Card C · Low-latency playbook (controller-visible knobs)
  • Zero-copy: reduce cache pollution and copies; requires strict buffer ownership and reclaim discipline.
  • Pre-allocation: remove runtime allocation/mapping jitter from the hot path.
  • Batch doorbells: reduce MMIO frequency while keeping periodic deadlines safe.
  • Interrupt coalescing (MSI-X): prevent IRQ storms while keeping completion p99 within target.
Pass criteria (latency + stability)
Completion latency p99 ≤ X, IRQ rate ≤ Y, and ring occupancy never hits “empty” or “full” thresholds for more than T.
Diagram · DMA Path from Endpoint to Memory (descriptor path + data path)
EP FIFO payload DMA Engine read / write System Memory buffers App buffer Cache flush / invalidate IOMMU map / unmap Transfer Ring TRBs Event Ring completion IRQ Visibility points Separate paths: descriptors (rings/events/IRQ) and payload (DMA/memory). Both must be coherent for low latency.
Diagram intent: show two independent correctness requirements—descriptor visibility and payload visibility—so “random” latency or corruption can be traced to cache/IOMMU/ring steps.

H2-7. Enumeration & Descriptor Pipeline (Robustness First)

Enumeration stability is decided by a predictable control-transfer pipeline and disciplined fallback rules. The focus here is controller/firmware behavior: state progression, descriptor parsing safety, and bounded retries.

Card A · Pipeline breakdown (reset → address → descriptor tree → configuration)
Stage 1 · Reset / Default
Required: bus reset + EP0 ready.
Quick check: EP0 control completions appear as a complete triad (setup → data → status).
Fallback: restart control path domain (not a full-system reset).
Stage 2 · Set Address
Required: SET_ADDRESS (control transfer).
Quick check: the first GET_DESCRIPTOR after address change succeeds without extra retries.
Fallback: return to Default if address confirmation is inconsistent.
Stage 3 · Descriptor Tree Walk
Required: GET_DESCRIPTOR (DEVICE / CONFIG / STRING / BOS as needed).
Quick check: length negotiation and short-packet handling remain consistent across segmented reads.
Fallback: roll back to the tree root (avoid unnecessary full reset).
Stage 4 · Set Configuration / Interface
Required: SET_CONFIGURATION (+ SET_INTERFACE for alt settings).
Quick check: endpoint enable + ring allocation completes before traffic is allowed.
Fallback: return to Address and re-apply configuration (bounded attempts).
Pass criteria (pipeline)
Enumeration outcomes must be attributable to a single stage within X minutes of logging, and stage-level rollbacks must resolve transient failures without escalating to full resets.
Card B · BOS / SS companion / alternate settings — controller-side pitfalls
Pitfall 1 · BOS parsing safety
Likely cause: length/unknown-capability handling is not robust.
Quick check: validate “length-bounded parse” and skip unknown capabilities.
Fix: enforce bounds + tolerate optional blocks.
Pass: BOS walk never reads past buffer; no parse crash across N device variants.
Pitfall 2 · SS companion consistency
Likely cause: companion parameters do not match endpoint descriptors during enable.
Quick check: cross-check burst/interval fields before programming endpoints.
Fix: enforce descriptor cross-validation rules.
Pass: endpoint programming rejects inconsistent combos with a single clear error reason.
Pitfall 3 · Alternate settings rebind
Likely cause: SET_INTERFACE occurs while DMA/rings are still active.
Quick check: ensure “quiesce → reprogram → restart” ordering is enforced.
Fix: stop endpoints + drain completions before re-enabling.
Pass: alt switch completes within T without ring corruption or stuck completions.
Card C · Stability policy: bounded retries, timeouts, backoff (no retry storms)
Policy variables (placeholders)
T_ctl (control timeout budget) · N_retry (retry cap) · B_k (backoff schedule)
Rule 1 · No infinite retries
After N_retry, transition to Error-Recover; do not loop forever on a failing step.
Rule 2 · Backoff is mandatory
A failure must wait B_k before the next attempt to prevent a retry storm from amplifying instability.
Rule 3 · Layered rollback
Prefer stage-local rollback (tree step) over global reset, escalating only when evidence shows state corruption.
Pass criteria (robustness)
Enumeration success rate ≥ X% across Y reconnect cycles; average time ≤ T_avg, p95 ≤ T_p95. Failure modes must map to a stage + rule violation (timeout/retry/backoff/rollback).
Diagram · Enumeration State Pipeline (steps + bounded rollback)
Reset EP0 Address SET_ADDR Descriptor Tree Walk Config EP Enable Ready Traffic Rollback rules fail → Address Error-Recover N_retry + Backoff repeated fail recover → Reset Controller/FW view: stage-local rollback first; bounded retries + backoff prevent retry storms.
Diagram intent: a deterministic enumeration pipeline with explicit rollback severity, so failures are isolated to a stage instead of escalating into repeated full resets.

H2-8. Class Support from a Controller Perspective (MSC/UASP/CDC + Composite)

Class support is not an OS driver tutorial. The controller requirement is an endpoint map, queue depth, interrupt policy, and isolation rules that keep one workload from degrading the whole device.

Card A · MSC BOT vs UASP: controller requirements differ by concurrency
Dimension 1 · Queue depth
BOT: typically shallow, sequential behavior.
UASP: deeper concurrent command queue; multiple in-flight transfers must complete reliably.
Dimension 2 · Completion pressure
UASP increases event frequency; ring progress and interrupt moderation must avoid cluster stalls.
Dimension 3 · DMA model
Deeper queues amplify scatter-gather edge cases; mapping and segment rules must be validated before submission.
Dimension 4 · Error recovery
Prefer local recovery (endpoint queue drain / reset) over global re-enumeration when concurrency is high.
Pass criteria (storage classes)
Under target concurrency, completion latency p99 ≤ X and no sustained ring backlog; error recovery must converge within T without forcing re-enumeration.
Card B · CDC (serial): small, frequent packets stress IRQ + DMA
Risk 1 · IRQ amplification
Quick check: measure IRQ rate under steady CDC traffic.
Fix: coalesce completions and batch event processing.
Pass: IRQ rate ≤ X at Y packets/s.
Risk 2 · Ring starvation
Quick check: observe ring occupancy for “empty hits” during bursty traffic.
Fix: prefill buffers + increase ring depth within bounds.
Pass: occupancy remains within [A..B] throughout bursts.
Risk 3 · Cache churn
Quick check: correlate latency spikes with buffer reuse and completion clustering.
Fix: use stable buffer pools and avoid per-packet allocation.
Pass: p99 latency ≤ T with no periodic spikes.
Card C · Composite devices: resource isolation and failure domains
Isolation axis 1 · Queues / rings
Allocate independent rings where possible; prevent one interface from saturating global completion capacity.
Isolation axis 2 · Interrupt policy
Keep latency-sensitive endpoints protected from bulk completion bursts by tuning moderation per queue/vector.
Isolation axis 3 · Buffers
Avoid a single shared pool that allows one class to consume all buffers; enforce per-class quotas.
Isolation axis 4 · Budget
Preserve periodic service guarantees even under peak bulk or CDC load; enforce scheduler reservations.
Pass criteria (composite)
Under peak load of one function, other functions must still meet minimum service guarantees ≥ X. Injected timeouts in one interface must not force global re-enumeration.
Diagram · Class → Endpoint Map (minimal labels, controller requirements)
Classes MSC (BOT) UASP Q depth CDC small pkts Composite Endpoints EP0 BULK OUT BULK IN INT IN INT OUT queue IRQ sensitive isolate rings / IRQ / buffers / budget Controller view: class support = endpoint map + queue depth + IRQ policy + isolation (not an OS tutorial).
Diagram intent: map each class to endpoint types with minimal labels, then highlight the controller knobs that prevent one workload from degrading the entire composite device.

H2-9. Error Handling & Recovery: Stalls, Resets, Timeouts, Link Events

Field instability becomes repeatable when errors are classified into controller-visible buckets and recovery is executed as a layered ladder. The goal is to converge quickly without escalating into full-system reboots or retry storms.

Card A · Error taxonomy (transaction vs stall vs controller halt)
Bucket 1 · Transaction errors
Symptom: timeouts, repeated retries, missing completions, abnormal latency tails.
First check: completion/event latency distribution + ring backlog slope (per EP, per window).
Likely cause: submission bursts, IRQ delay, queue depth mismatch, stalled producer/consumer.
Minimal recovery: stop new submits → drain completions → bounded retry with backoff.
Bucket 2 · Protocol stalls (EP halt)
Symptom: STALL, halted EP, control stage failure patterns (EP0).
First check: stall events are attributed to the correct EP/interface (not a global mapping bug).
Likely cause: invalid request sequencing, alt rebind without quiesce, stale queue state.
Minimal recovery: clear/stop the affected EP → rebind queues → resume with a probe.
Bucket 3 · Controller halt / fatal
Symptom: host controller halted, ring halted, fatal error reason code.
First check: halt reason + last state transition + whether ring pointers stopped advancing.
Likely cause: domain reset ordering error, internal watchdog, descriptor corruption.
Minimal recovery: quiesce → reset the minimal controller domain → validate rings before resume.
Pass criteria (taxonomy)
Every field failure must map to exactly one primary bucket within a W second observation window, with a consistent first-check item (counter or trace) that explains the chosen recovery level.
Card B · Layered recovery: quiesce → reset domain → re-enumerate (escalate with evidence)
Step 0 · Quiesce (mandatory)
Freeze new submissions, drain in-flight completions, and scope the failure domain to the smallest affected EP/port/controller block.
Level 1 · Retry (bounded)
Trigger: transient timeout with ring progress still healthy.
Action: retry ≤ N with backoff B_k.
Pass: completion latency returns within T for Z windows.
Level 2 · Reset EP (local)
Trigger: stalls or repeat failures isolated to one endpoint/interface.
Action: stop EP → clear halt → rebuild queue state → probe.
Pass: EP traffic resumes with error rate ≤ X over Y seconds.
Level 3 · Reset port (mid)
Trigger: multi-EP failures on the same port or repeated link events.
Action: port reset + controlled re-enable.
Pass: stable link state and successful probes for Z windows.
Level 4 · Reset controller domain (heavy)
Trigger: controller halt, ring no-progress, or repeated fatal reason codes.
Action: reset minimal domain(s) + validate ring pointers before resume.
Pass: rings advance normally and error counters remain flat for T.
Level 5 · Re-enumerate (last resort)
Escalate only when evidence shows state corruption beyond local domains. Cap attempts and enforce cooldown to avoid oscillation.
Card C · Anti-storm controls: backoff, circuit breaker, health probe
Storm detection (placeholders)
retries/s ≥ X · p95 completion latency ≥ T · ring backlog not decreasing for Y windows
Guard 1 · Exponential backoff
Increase wait time B_k after each failure to prevent tight retry loops from amplifying instability.
Guard 2 · Circuit breaker
When storm thresholds are met, freeze submissions and force a recovery ladder step rather than continuing retries.
Guard 3 · Health probe
After recovery, use a lightweight probe to confirm progress before reopening high-rate traffic; keep the probe scoped to the affected domain.
Pass criteria (anti-storm)
Under a fault burst, recovery converges within T_recover and retry rate remains ≤ X/s. Re-enumeration oscillation stays ≤ N per hour.
Diagram · Recovery Ladder (escalate from light to heavy)
Recovery Rules Quiesce first freeze + drain Backoff B_k Circuit breaker stop storm Health probe verify progress Escalation Ladder Level 1 · Retry N + B_k Level 2 · Reset EP local Level 3 · Reset Port mid Level 4 · Reset Ctrl domain Level 5 · Re-enumerate last Trigger Pass transient stable Z isolated EP err ≤ X same port probe OK halt rings move corruption cooldown Escalate with evidence. Avoid full-system reboot. Cap retries and enforce backoff to prevent storms.
Diagram intent: a deterministic ladder that escalates from local recovery to re-enumeration, with storm controls and explicit pass criteria at each level.

H2-10. Observability & Test Hooks (Trace, Counters, Loopback/PRBS – Controller Side)

Fixing instability requires measurable signals. The controller must export counters, traces, and injection hooks that close the loop: detect → classify → recover → verify.

Card A · Must-have counters (domain + slice + window)
Data-plane
submitted/completed TRBs · bytes · backlog · completion latency p50/p95/p99 (per EP).
Control-plane
state transitions · port reset count · enumeration retries · role-change events (per port, per window).
Error-plane
timeouts · stalls · halted events · underrun/overrun · drop/abort reasons (with reason codes).
Slicing rule (required)
Every counter must support: port / EP / type / time window. Avoid relying on a single global accumulated value.
Card B · Trace points (doorbell, event, state transitions) via ring-buffer trace
Control trace
state transitions for enumeration/recovery/role changes, with timestamps and scoped object IDs.
Data trace
submissions/doorbells, ring pointers, completion batches, and backlog deltas (low overhead).
Error trace
stall/halt/timeout reason codes and the triggering state at the time of fault detection.
Minimal trace schema (required)
timestamp · event type · object id (port/EP/queue) · delta (latency/backlog) · reason code (optional)
Card C · Injection + loopback/PRBS hooks (interface + pass criteria only)
Injection hooks (examples)
forced timeout · forced stall · drop completion · queue jam (scoped to a port/EP).
Pattern traffic (controller side)
enable/disable pattern mode · read error counters · validate windowed pass criteria.
Pass criteria (validation loop)
Identical injections must trigger the same recovery ladder step; counters + traces must explain escalation. Recovery success rate ≥ X% across Y injections with bounded recovery time ≤ T.
Diagram · Controller Telemetry Map (data/control/error buses)
USB Controller DMA Sched Event Telemetry Buses Data bytes · backlog Control states · resets Error timeouts · stalls Exports Counters snapshot window Ring Trace low overhead Injection fault hooks Slice: port · EP · type · time window Observability closes the loop: counters + trace explain recovery escalation; injection validates repeatability.
Diagram intent: three telemetry buses (data/control/error) mapped to export interfaces, so failure classification and recovery escalation are backed by measurable evidence.

H2-11. Engineering Checklist (Design → Bring-up → Production)

This section is a project gate. Each item is written as a verifiable check with a fast validation method, a failure signature, and a pass threshold placeholder. Scope stays on controller/firmware/driver behaviors (queueing, DMA, interrupts, role swap, recovery, telemetry).

Design Gate · Architecture frozen with measurable budgets
Queue depth budget (rings/TRBs)
Check: ring depth per EP/type supports worst-case bursts (no unbounded backlog).
How: track backlog slope + p95/p99 completion latency in a fixed window.
Fail signature: throughput looks fine but p99 latency spikes; completions arrive in clumps.
Pass: backlog peak ≤ X and p99 ≤ T over Y windows.
DMA + cache coherency plan
Check: coherency responsibility is explicit (coherent vs non-coherent, IOMMU on/off).
How: run a descriptor/data integrity loop (submit → DMA → complete → verify) with forced stress.
Fail signature: sporadic “old data”, descriptor corruption, or non-repeatable timeouts.
Pass: zero integrity mismatches across N iterations; error counters flat.
Interrupt strategy (MSI-X / moderation / batching)
Check: completion batching and IRQ moderation prevent CPU saturation and storm amplification.
How: correlate IRQ rate vs completion batch size vs p99 latency.
Fail signature: IRQ rate explodes at load; timeouts increase while link remains “up”.
Pass: IRQ rate ≤ X/min and timeout ≤ Y/min at target load.
DRD role swap + quiesce ordering
Check: swap sequence stops submissions, drains completions, then switches domains (no stale DMA).
How: verify role FSM trace and ring pointer monotonicity across swaps.
Fail signature: device appears alive but traffic dead; repeating attach/detach loops.
Pass: swap success ≥ X% and recovery time ≤ T.
Recovery ladder policy (bounded, evidence-based)
Check: retry caps, escalation evidence, cooldown, and anti-storm rules are enforced.
How: use injection regression to confirm deterministic ladder steps.
Fail signature: retry storm; oscillating re-enumeration; full system reboot as default.
Pass: T_recover ≤ T; re-enum ≤ N/hour; retries/s ≤ X.
Concrete materials (examples)
  • Discrete host controller ICs (PCIe → USB 3.x): Renesas uPD720202 / uPD720201; VIA Labs VL805; ASMedia ASM1042A; ASMedia ASM1142 / ASM2142 / ASM3142.
  • USB4 host controllers (PCIe → USB4): ASMedia ASM4242; Intel JHL8440 (Thunderbolt 4 / USB4 controller).
  • Peripheral/device controllers: Infineon (Cypress) EZ-USB FX3 CYUSB3014; EZ-USB FX2LP CY7C68013A; FTDI FT600 / FT601 (USB 3.0 FIFO bridge).
  • Config / FW NVM anchors: Winbond W25Q64JV (SPI NOR); Microchip 24AA64 (I²C EEPROM) (used for board config / IDs when applicable).
  • Clock anchors (board-level examples): Abracon ASFL1 series (e.g., 24/25 MHz oscillators) (frequency depends on controller reference requirements).
Bring-up Gate · Enumeration → EP matrix → baseline → injection regression
Enumeration checklist (EP0 robustness)
Check: reset → address → descriptor tree → configuration has traceable steps and bounded retries.
How: capture per-step timing and failure reasons in a fixed window (state trace + counters).
Fail signature: stalls at the same stage; repeated resets without progress.
Pass: enum time ≤ T and failure ≤ X/1000 runs.
Endpoint verification matrix
Check: Control/Bulk/Interrupt/Iso are verified across speed, concurrency, and queue depth variants.
How: slice counters by port/EP/type/window and compare to baseline profiles.
Fail signature: one EP class starves others; iso jitter grows under background bulk load.
Pass: per-class error ≤ X and p99 latency/jitter ≤ T.
Throughput + latency baseline
Check: baseline is reproducible: throughput, p95/p99 latency, backlog peaks, and CPU cost.
How: fixed workload + fixed sampling window snapshots; store per build.
Fail signature: later builds drift with no signal; “works” but performance regresses.
Pass: drift ≤ X% across Y releases.
Injection regression (deterministic recovery)
Check: forced timeout/stall/drop completion/queue jam triggers the same ladder step with evidence.
How: run repeated injections; compare counters + trace signatures per scenario.
Fail signature: non-repeatable outcomes; escalation without measurable reason.
Pass: repeatability ≥ X% and T_recover ≤ T.
Concrete materials (bring-up anchors)
  • Device-side controller for validation targets: CYUSB3014 (FX3) enables repeatable bulk/iso/intr patterns (device role).
  • Host-side discrete controller for regression rigs: uPD720202 or VL805 boards are common for stable xHCI behavior comparisons.
  • NVM for controlled firmware/config variants: W25Q64JV (SPI NOR) + 24AA64 (EEPROM) support reproducible config matrices.
Production Gate · Stable telemetry, thresholds, rollback, soak metrics
Telemetry schema frozen (counters + trace)
Check: counter names/fields and trace event types are versioned and backward compatible.
How: snapshot window definitions and slicing dimensions are documented and enforced.
Fail signature: field logs cannot be compared across builds; “unknown reason” dominates.
Pass: parser compatibility ≥ X%; missing fields ≤ Y.
Threshold alarms (X per minute) + controlled actions
Check: alarms trigger bounded recovery and traffic shaping (not a full reboot).
How: validate alarms using injection and soak bursts; measure false/true positive rates.
Fail signature: alert fatigue or late detection; storm continues despite alarms.
Pass: false alarm ≤ X% and missed alarm ≤ Y%.
Firmware upgrade + rollback
Check: rollback triggers are defined (stability window, error thresholds, repeated ladder escalations).
How: run forced-fault + upgrade tests; verify return to baseline in T.
Fail signature: “half-upgraded” states; enumeration instability after update.
Pass: rollback success ≥ X%; recovery time ≤ T.
Soak test metrics (long-run)
Check: error rate, latency drift, recovery count, and re-enumeration count are trended by hour/day.
How: aggregate by windows; store percentiles and slopes (not only totals).
Fail signature: “fragile after a week” (drift) while short tests look clean.
Pass: drift ≤ X%; re-enum ≤ N/week; errors ≤ Y.
Concrete materials (production anchors)
  • NVM for dual-image/rollback patterns (platform-dependent): W25Q64JV (SPI NOR) is a common anchor for firmware storage on embedded platforms.
  • Stable SoC DRD platforms (examples): NXP i.MX8MM / i.MX8MP; TI AM625 / AM642 (SoC-integrated controllers with mature ecosystems).
Diagram · USB Controller Bring-up Flow (gated steps)
Bring-up Flow = Evidence + Gates (controller side) Step 1 Power/Clock domain ready GATE Step 2 Enumerate EP0 trace T ≤ T Step 3 Throughput baseline Δ ≤ X% Step 4 Low-latency p99 p99 ≤ T Step 5 Recover injection Trec ≤ T Evidence Artifacts (stored per build) Counter snapshots port/EP/type/window Ring trace state + doorbell + err Injection log scenario → ladder step Gates prevent “works on bench” from reaching production without telemetry and repeatability.
Diagram intent: five gated steps plus an evidence row (snapshots, traces, injection logs) to support regression and production readiness.

H2-12. Applications & IC Selection Logic (Controller-Only, before FAQs)

Selection is driven by controller-side requirements: roles (Host/Device/DRD), speed generation (2.0/3.x/4), ports and concurrency, DMA/latency characteristics, observability hooks, and production lifecycle (upgrade/rollback). PHY/Type-C/retimer/ESD details remain out-of-scope for this page.

Card A · Application map → controller pressure points
Industrial gateway / edge box
Pressure: multi-port concurrency, bounded recovery, stable telemetry schema, long-run soak stability.
Must-have: predictable IRQ moderation, counter slicing (port/EP/type/window), recovery ladder with cooldown.
Camera / capture / vision
Pressure: iso/interrupt jitter, latency percentiles, DMA zero-copy feasibility.
Must-have: ring depth budget, completion batching, stable p99 under background traffic.
Storage enclosure / UASP targets
Pressure: queue depth + command concurrency, bounded error recovery, stable throughput baseline.
Must-have: efficient DMA scatter-gather, predictable completion latency tails, storm protection.
Dock / multi-function expansion
Pressure: topology changes and compatibility matrix across hosts/OS builds.
Must-have: observability hooks, deterministic recovery, and controlled enumeration retries.
Card B · Selection dimensions (controller view, ordered)
Tier 1 · Hard filters
Role support (Host/Device/DRD) · generation (USB 2.0 / 3.x / 4) · ports (#) · concurrency budget (EP/streams).
Tier 2 · Latency + DMA
Ring depth scaling · scatter-gather overhead · cache/IOMMU cost · batching/IRQ moderation · p99 stability under load.
Tier 3 · Production lifecycle
Telemetry completeness · deterministic recovery ladder · upgrade/rollback ability · long-run soak indicators · ecosystem maturity.
Concrete controller IC examples (by role)
  • Host (PCIe attach, USB 3.x): Renesas uPD720202 / uPD720201; VIA Labs VL805; ASMedia ASM1042A; ASMedia ASM1142 / ASM2142 / ASM3142.
  • USB4 host (PCIe attach): ASMedia ASM4242; Intel JHL8440 (Thunderbolt 4 / USB4 controller).
  • Peripheral / device controllers: CYUSB3014 (FX3); CY7C68013A (FX2LP); FT600 / FT601.
  • SoC-integrated DRD platforms (examples): NXP i.MX8MM / i.MX8MP; TI AM625 / AM642; Rockchip RK3568 / RK3588.
Card C · Landing strategy: SoC integrated vs discrete controller vs PCIe attach
Option 1 · SoC integrated controller
When: tight integration, lower BOM, controlled platform ecosystem.
Must-have: telemetry exports, deterministic recovery ladder, coherent DMA plan.
Hidden risk: limited injection hooks or restricted low-level counters on some platforms.
Examples: i.MX8MP, AM625 (platform-specific).
Option 2 · Discrete controller (board-level)
When: clearer isolation, repeatable regression rigs, easier multi-platform reuse.
Must-have: ring depth headroom, IRQ moderation, stable telemetry slicing.
Hidden risk: driver/firmware compatibility matrix across OS builds.
Examples: uPD720202, VL805, ASM1142/ASM2142/ASM3142.
Option 3 · PCIe attach controller (add-in / external box)
When: scale-out ports or performance upgrades without redesigning the base SoC.
Must-have: IOMMU compatibility, IRQ scaling, deterministic recovery and telemetry.
Hidden risk: platform-level resource contention (CPU/IRQ/IOMMU), not raw link speed.
Examples: ASM4242 (USB4), JHL8440 (USB4/TB4).
Diagram · Selection Decision Tree (controller-only)
Start: Controller Requirements Role Host / Device / DRD Generation 2.0 / 3.x / 4 Ports count + concurrency DMA + Latency rings · IOMMU · IRQ scaling Ecosystem + Production telemetry · rollback · soak SoC integrated AM625 · i.MX8MP Discrete controller uPD720202 · VL805 PCIe attach ASM4242 · JHL8440 Out-of-scope here: PHY / Type-C orientation / retimers / ESD selection → see sibling pages.
Diagram intent: a question-first tree that converges on three landing strategies while keeping PHY/Type-C/retimer/ESD selection out of scope.

Request a Quote

Accepted Formats

pdf, csv, xls, xlsx, zip

Attachment

Drag & drop files here or use the button below.

H2-13. FAQs (Controller-Side Troubleshooting, 4-line Answers + JSON-LD)

Scope guard: only controller/firmware/driver logic (EP0 control transfers, rings/TRBs, DMA/cache/IOMMU, IRQ moderation, DRD role FSM, recovery ladder, counters/trace). PHY/SI/Type-C/retimer/ESD selection stays out-of-scope for this page.

Host enumeration intermittently fails but the waveform looks “normal” — retry accounting or cache coherency first?
Likely cause: control-transfer retry/timeout policy mismatch, or stale descriptor/data due to non-coherent DMA (flush/invalidate missing or mis-scoped).
Quick check: slice EP0 failures by step (RESET/ADDR/DESC/CFG) + count retries per step + verify DMA buffers show monotonic updates (no “old” payload).
Fix: bound retries per step + add exponential backoff/cooldown; harden cache/IOMMU policy for EP0 buffers (explicit sync on submit/complete).
Pass criteria: enum success ≥ X% over Y runs; step retry ≤ N; EP0 timeout ≤ X/1000; no stale-buffer mismatches in N iterations.
Enumeration stalls at GET_DESCRIPTOR / SET_CONFIGURATION — where is the first controller-side sanity check?
Likely cause: EP0 queue starvation (ring depth too small), incorrect control transfer state transitions, or timeout escalation without evidence.
Quick check: record per-step timing and “in-flight” control TRB count; confirm doorbell→event completion path is continuous (no missing events).
Fix: reserve EP0 ring budget; enforce one-step-at-a-time state machine; add deterministic timeout ladder (retry → reset EP0 → reset port → re-enum) with caps.
Pass criteria: per-step time ≤ T ms; missing completion events ≤ X/10k; escalation level matches evidence ≥ X%; re-enum ≤ N/hour.
Device enumerates but the expected function does not show up — descriptor pipeline or alternate settings first?
Likely cause: descriptor tree parsing/validation issue (BOS/SS companion/alt-setting mismatch) causing incorrect interface/EP activation.
Quick check: log parsed descriptor nodes (interfaces/alt/EP types) and compare to what is actually programmed into endpoint contexts/rings.
Fix: strict descriptor validation + fallback path; ensure alt-setting transitions quiesce old EPs before enabling new EP contexts.
Pass criteria: parsed-vs-programmed EP map mismatch = 0; alt-setting switch success ≥ X%; function bind failures ≤ X/1000 enumerations; recovery time ≤ T ms.
Throughput meets target but latency jitter is large — check IRQ moderation or queue depth/doorbell batching first?
Likely cause: IRQ storm or poor moderation (too many small completions), or shallow rings causing bursty doorbell/completion patterns.
Quick check: correlate IRQ rate vs completion batch size vs p99 completion latency; observe backlog slope (steady vs saw-tooth).
Fix: enable moderation/coalescing + cap IRQ rate; increase ring depth; batch doorbells; keep completion batching deterministic under load.
Pass criteria: p99 completion latency ≤ T ms at load L; IRQ rate ≤ X/min; backlog peak ≤ N; jitter (p99–p50) ≤ T ms.
Counters look clean but periodic “micro-stalls” occur — scheduler starvation or timer/housekeeping jitter?
Likely cause: controller scheduling gaps due to housekeeping/interrupt bursts, or completion processing delays masking true stall sources.
Quick check: add trace points at doorbell submit, schedule decision, and completion retire; measure gap histograms between these events.
Fix: isolate housekeeping from critical completion path; prioritize periodic endpoints; bound per-ISR work and shift heavy parsing to deferred context.
Pass criteria: schedule-gap p99 ≤ T µs; completion-retire p99 ≤ T µs; micro-stall count ≤ X/hour; jitter slope stable over Y hours.
After DRD role swap, the device still “exists” but all transfers fail — reset domain or endpoint quiesce ordering first?
Likely cause: swaps happen without draining in-flight DMA/TRBs, leaving stale endpoints/contexts active; or an overly broad reset wipes state unexpectedly.
Quick check: confirm quiesce sequence (stop submit → drain complete → disable EP contexts → switch role → re-init rings) in trace; validate ring pointers stop moving before reset.
Fix: implement strict quiesce barrier + bounded wait; reset the minimal required domains; re-seed rings and counters on role change.
Pass criteria: role swap success ≥ X%; swap recovery time ≤ T ms; “dead transfer” incidents ≤ X/1000 swaps; post-swap event ring integrity errors = 0.
Recovery makes things worse (re-enumeration oscillation) — how to stop retry storms without “reboot everything”?
Likely cause: unbounded retries and immediate re-enumeration create feedback loops; escalation happens without evidence (no cooldown, no fuse).
Quick check: compute retry rate and re-enum rate per minute; verify ladder step triggers require concrete counters/trace evidence.
Fix: introduce exponential backoff + cooldown windows; add circuit breaker (fuse) after N failures; escalate only when evidence persists.
Pass criteria: retry rate ≤ X/s; re-enum ≤ N/hour; recovery success ≥ X%; mean time to stable state ≤ T s.
UASP is less stable than BOT — queue depth/concurrency or recovery policy clearing in-flight commands?
Likely cause: insufficient concurrency budget (stream/tag depth), or recovery clears in-flight queues causing host-side timeouts and replays.
Quick check: track max in-flight commands and abort/reset counts; correlate failures with queue depth and recovery escalations.
Fix: increase command/transfer ring headroom; make recovery selective (per EP/stream) and bounded; preserve progress evidence to avoid full queue flush.
Pass criteria: UASP abort/reset ≤ X/hour; command timeout ≤ X/10k; sustained p99 latency ≤ T ms at queue depth Q; recovery escalations ≤ N/day.
Composite device: one interface drags down the rest — where is the first isolation check?
Likely cause: shared ring/IRQ budget with no per-interface shaping; one EP class monopolizes scheduling or floods interrupts.
Quick check: slice counters/latency by interface/EP; verify that backlog and IRQ contributions are bounded per interface.
Fix: allocate per-interface ring headroom; enforce per-class priorities; apply per-interface rate limiting and IRQ moderation rules.
Pass criteria: cross-interface latency inflation ≤ X%; per-interface backlog ≤ N; starvation events = 0 over Y minutes; IRQ share per interface ≤ X%.
CDC / small interrupt packets cause CPU spikes — IRQ storm or inefficient DMA path?
Likely cause: too many tiny completions and interrupts (no coalescing), or per-packet DMA mapping/sync overhead dominating.
Quick check: measure completions per IRQ and bytes per completion; observe CPU time per completion path vs payload.
Fix: enable completion batching; aggregate small transfers; pre-map buffers and reduce per-packet sync; cap IRQ rate while maintaining bounded latency.
Pass criteria: bytes/IRQ ≥ X; completions/IRQ ≥ Y; CPU% ≤ X% at rate R; p99 latency ≤ T ms.
Isochronous shows drops even when the bandwidth budget looks sufficient — what controller-side check is fastest?
Likely cause: periodic schedule fragmentation or service order jitter (microframe timing), not raw capacity; ring starvation for iso due to competing bulk.
Quick check: log microframe service order for iso EPs; compare iso backlog growth vs bulk activity and IRQ bursts.
Fix: enforce iso priority and reservation; increase iso ring headroom; reduce completion latency tails with IRQ moderation and deterministic scheduling.
Pass criteria: iso drop rate ≤ X/min; iso service jitter p99 ≤ T µs; iso backlog peak ≤ N; p99 completion latency ≤ T ms at load L.
Works for hours, then becomes “fragile” — first check counter windows or ring lifecycle leaks?
Likely cause: ring/descriptor lifecycle leak (TRBs not reclaimed), counter-window resets hiding trends, or accumulating cache/IOMMU sync debt under long-run load.
Quick check: trend backlog/latency/error rates by hour; verify ring free-space monotonicity; confirm counters are sampled with stable windows (no silent resets).
Fix: harden ring reclamation invariants; add periodic health probes; pin telemetry window definitions; add fuse/cooldown to prevent late-stage storms.
Pass criteria: drift (p99 latency) ≤ X% over Y hours; ring free-space never below N; re-enum ≤ N/week; error rate slope stable (|Δ| ≤ X per hour).