123 Main Street, New York, NY 10001

PCIe Controller / Endpoint / Root Complex

← Back to: USB / PCIe / HDMI / MIPI — High-Speed I/O Index

This page explains how a PCIe Controller / Root Complex / Endpoint becomes “usable and operable” end-to-end: from enumeration and configuration to interrupts, DMA/IOMMU, ATS/PRI, SR-IOV, and reset/power recovery.

The goal is to turn field symptoms (device missing, MMIO hang, stale DMA, VF no IRQ, timeout floods) into a measurable checklist and pass criteria so issues can be attributed and closed with data.

H2-1 · Scope & Role Map (RC / EP / Controller Contract)

This chapter defines the control-plane contract for PCIe roles so every later section stays in-bounds: who discovers, who configures, who moves data, and who owns recovery decisions.

Covers (this page is responsible for)
  • Discovery & configuration: enumeration, configuration space, capability gating.
  • Resource mapping: BAR sizing/mapping, interrupt delivery (MSI/MSI-X) at the control-plane level.
  • Data movement semantics: MMIO doorbells, DMA visibility basics, IOMMU interaction (concept-level).
  • Feature responsibilities: ATS/PRI, SR-IOV, error reporting and recovery ownership.
  • Operational stability: reset/power semantics from the controller view (what changes, what must be restored).
Does NOT cover (go to sibling pages)
  • PHY / SerDes / SI: eye/jitter, insertion loss, equalization tuning, layout rules.
  • Retimer / Redriver design: CDR/DFE/CTLE device selection and channel extension strategy.
  • Switch deep dive: port fan-out design, switch silicon policies, detailed ACS/ARI behavior.
  • Compliance workflows: official PCI-SIG test flows and lab procedures.

Use this page to decide what the controller must do; use sibling pages to solve how the link meets electrical requirements.

Role distinctions (control-plane semantics)
Root Complex (RC) / Root Port
  • Owns discovery: initiates configuration transactions to find and classify endpoints.
  • Owns global policy: assigns bus numbers and resource windows (address space + interrupts).
  • Owns recovery: decides whether to reset, disable, or degrade when errors occur.
Endpoint (EP)
  • Responds to configuration: exposes capabilities; becomes usable only after RC policy is applied.
  • Drives data-plane: performs DMA (if present) and signals events via MSI/MSI-X.
  • Reports health: provides error status/counters; supports function-level recovery paths.
Upstream / Downstream ports (fabric ports)
  • Route transactions: forward requests and completions; do not own system policy.
  • Shape visibility: affect what is reachable based on windows and routing, without becoming the policy owner.
  • Create failure surfaces: mis-windowing can hide devices; congestion can mimic link instability.
Typical system patterns (kept at navigation depth)
CPU Root Complex ↔ Switch ↔ Multiple Endpoints

This pattern stresses enumeration policy, resource sizing, and operational recovery. If a device “disappears”, start from the RC policy chain (bus numbering → windowing → capability gating) before assuming an electrical issue.

FPGA-based RC or EP in embedded systems

Role boundaries can blur in SoC/FPGA designs. Bring-up becomes easier when responsibilities are explicit: a single owner for enumeration, a single owner for DMA visibility rules, and a single owner for recovery actions.

Capability index (what to look for, not a keyword dump)
  • Enumerate devices: stable discovery across cold boot, warm reset, and hot reset.
  • Configure capabilities: predictable feature gating (MSI-X, SR-IOV, ATS) with verification points.
  • Map BAR & resources: correct sizing and address placement without conflicts.
  • Deliver interrupts: scalable per-queue vectors without storms.
  • Move data safely: DMA visibility rules aligned with IOMMU policy (no stale or wandering writes).
  • Recover from faults: AER-driven triage and reset semantics (PERST#/FLR) that restore usability.
Pass criteria (page-level contract)
  • Enumeration succeeds on cold boot and warm reset with 0 missing functions across X cycles.
  • BAR mapping is stable with no overlaps; MMIO accesses complete without stalls across X minutes.
  • Interrupt delivery remains bounded: no storm events; CPU distribution meets X policy.
  • Error handling is actionable: AER events are classified and recovery restores service within X seconds.
Root Complex RC Policy Switch Forward Endpoint EP Host Memory Config MMIO DMA IRQ / Errors
Diagram: Control-plane role map (RC policy & config, EP DMA & interrupts, switch forwarding only).

H2-2 · Transaction Model Primer (TLP/DLLP/Config System View)

PCIe controller issues often present as “random stalls” or “works until load”. A correct system view of posted vs non-posted, completions, and ordering/visibility prevents misdiagnosis and makes later topics (ATS/PRI, SR-IOV) stable.

A) Packet types & why they matter in system behavior
  • Posted requests (typical writes): do not require a completion. The system can appear “fast” while still being wrong if the producer signals too early (doorbell before data visibility).
  • Non-posted requests (typical reads and some control paths): require a completion. These transactions are completion-bound and expose congestion, backpressure, or policy errors as stalls.
  • Completions are the controller’s truth source for “did it really happen”. A system that “hangs” frequently fails at the completion boundary (timeouts, replays, poisoned data).
Practical consequence checklist
  • If a read stalls: treat it as a completion problem first (policy/congestion/recovery), not a “write problem”.
  • If a write “succeeds” but data is wrong: suspect ordering/visibility between data buffers and doorbells.
  • If config access is flaky: suspect enumeration/windowing/capability gating before assuming electrical instability.
B) Completions: timeout / retry / poison (why systems “fake-hang”)

A “hang” can be a controller waiting at a completion boundary. Treat timeouts and retries as accounting signals: something prevented the completion from arriving in time.

  • Completion timeout: the request left, but the completion did not return in the allowed window. Common controller-side causes include congestion/backpressure, blocked routing, or recovery policy that never converges.
  • Replay / retry behavior: repeated attempts inflate latency and can look like instability under load. The key is whether retries correlate with resource pressure (queues, interrupt moderation, reset loops).
  • ECRC / poisoned data: indicates an integrity concern at the transaction level. The correct first step is to classify the event and decide a recovery action, not to immediately rewrite the electrical design.
60-second triage (first checks)
  1. Confirm whether the failure is completion-bound (reads/config) or visibility-bound (writes/DMA).
  2. Capture controller-visible symptoms: timeout events, retry/replay pressure, and error class (correctable vs fatal).
  3. Decide a safe recovery action: isolate scope (one function vs the whole device), then apply reset semantics (X) and verify re-enumeration.
C) Ordering & visibility rules (the “doorbell before data” trap)

Many controller bring-up failures are not “data corruption” but visibility mismatches. The producer and consumer observe different timelines unless ordering rules are enforced.

  • Rule 1 — Separate data and control: data buffers can be written (posted writes) while control registers (doorbells) signal readiness. If the signal arrives first, the consumer reads incomplete or stale data.
  • Rule 2 — Completion is a barrier for non-posted paths: reads/config are naturally gated by completions; writes need explicit discipline (barriers, sequencing, queue protocol).
  • Rule 3 — DMA visibility depends on the platform contract: IOMMU policy and cache coherency determine what is visible when. Treat visibility as a first-class design requirement.
Pass criteria (placeholders X)
  • Under stress, posted-write control signaling never outruns data visibility (0 stale reads across X iterations).
  • Completion-bound operations do not time out within X load window (read/config success rate ≥ X%).
  • Error events are classifiable and recoverable; recovery converges within X seconds.
Request Non-posted (Read) Posted (Write) Fabric Backpressure Completion No completion Timeout Replay Poison Non-posted paths are completion-bound; posted writes require explicit visibility discipline.
Diagram: Request → Fabric → Completion view (shows completion-bound behavior and failure tags).

H2-3 · Enumeration & Configuration Space (Power-on → Usable)

This chapter targets the most common bring-up and field failures: a device is missing, or it appears but is not usable. The focus stays on the control-plane chain: bus numbering → bridge windows → configuration gating → BAR sizing/assignment → capability enablement.

Control-plane flow (what “usable” requires)
  1. Discover topology: scan buses and identify bridges/functions without assuming resources exist yet.
  2. Assign bus numbers: set primary/secondary/subordinate so downstream devices are reachable.
  3. Open bridge windows: memory/prefetchable windows must cover the endpoint BAR apertures.
  4. Gate device usability: enable minimum command bits (Memory/Bus Master) only after mapping is valid.
  5. Size and assign BARs: BAR sizing declares resource demand; assignment is the system promise.
  6. Enable capabilities: MSI-X/AER/SR-IOV/ATS are turned on only with clear verify points.
  7. Restore after reset: after hot reset/FLR, re-apply the configuration subset that is not retained (X).
A) Checklist — the first 8 fields to read (lspci-style)
1) Vendor ID / Device ID

Good: stable IDs match expectation. If not: suspect reachability (bridge numbering/window) or reads returning invalid values.

2) Class Code / Revision

Good: correct function class and expected mode. If off: suspect firmware strapping or device mode gating.

3) Header Type (Endpoint vs Bridge / Multi-function)

Good: header matches expected topology role. If mismatched: re-check the scan path and function interpretation.

4) Command (Memory / Bus Master enable)

Good: Memory and Bus Master enabled only after resources are assigned. If Memory is off: MMIO appears dead. If Bus Master is off: DMA cannot work.

5) Status / Capabilities pointer

Good: capability list is present and consistent. If missing: suspect partial enumeration or firmware policy limitations.

6) Bus numbers (Primary / Secondary / Subordinate on bridges)

Good: subordinate range covers downstream scan depth. If too small: devices “behind” the bridge disappear.

7) Bridge windows (Memory / Prefetchable)

Good: windows are large and aligned enough for the endpoint BAR apertures. If undersized: BAR assignment may succeed but accesses fail or get blocked.

8) BAR base + aperture sizing

Good: base is non-zero, aligned, and the aperture matches the device need. If 0/overlap: resource policy or placement conflicts are likely.

Pass criteria (first-boot visibility)
  • All expected functions appear with stable IDs across X cold boots.
  • Bridge subordinate ranges cover full scan depth; no “missing behind bridge” events in X cycles.
  • BARs are non-zero, aligned, and reachable; MMIO reads/writes complete without stalls over X minutes.
B) BAR sizing playbook (card list, no tables)
Case 1 — Device appears, but MMIO reads return errors or time out
  • Likely cause: Memory window does not cover the BAR base+aperture, or access is blocked by policy.
  • Quick check: BAR base/alignment + bridge memory/prefetchable windows.
  • Fix direction: expand/realign windows; ensure Memory enable is applied after placement is valid.
  • Pass criteria: MMIO completes with 0 stalls over X accesses.
Case 2 — BAR assignment fails or lands at 0 (resource exhaustion)
  • Likely cause: host aperture is insufficient or fragmented; alignment requirements cannot be satisfied.
  • Quick check: requested aperture size (BAR sizing) vs available host/bridge window budget.
  • Fix direction: re-place resources, adjust window budgets, or reduce optional apertures (X).
  • Pass criteria: BAR base non-zero, aligned, stable across X resets.
Case 3 — Prefetchable behavior is inconsistent (performance swings)
  • Likely cause: prefetchable vs non-prefetchable mapping does not match the device’s expectation.
  • Quick check: BAR attributes (prefetchable bit) and mapping policy consistency across resets.
  • Fix direction: make mapping deterministic; avoid mixing policies per boot path (X).
  • Pass criteria: throughput variance < X% across X runs.
Case 4 — Hot reset restores link, but the device is “half-dead”
  • Likely cause: configuration subset is not restored (Command, MSI-X, SR-IOV state, BAR re-placements).
  • Quick check: compare key fields before/after reset (Command, BAR base, enabled capabilities).
  • Fix direction: define a deterministic restore list and re-verify enumeration chain (X).
  • Pass criteria: post-reset service returns within X seconds with 0 missing functions.
C) Capability map (feature → enable action → verify signal)
Interrupt family: MSI / MSI-X
  • Enable action: choose MSI-X when per-queue scaling is required; keep vector mapping deterministic.
  • Verify signal: vector usage and per-queue event separation are visible (counts and mapping match expectation).
Reliability family: AER
  • Enable action: turn on actionable error reporting so faults can be classified and recovered.
  • Verify signal: errors are categorized (correctable vs fatal) and recovery policy converges within X.
Virtualization family: SR-IOV
  • Enable action: allocate PF/VF resources in a way that windows and BAR budgets can sustain.
  • Verify signal: VFs enumerate and remain stable across reset; resource placement does not overlap.
Translation family: ATS
  • Enable action: enable only when the platform contract for translation and isolation is clear.
  • Verify signal: translation path is actually exercised and remains stable under load (X).
Power family: PM
  • Enable action: treat D-state transitions as policy-driven; keep restore paths deterministic.
  • Verify signal: resume does not cause re-enumeration loss or capability regressions across X.
Pass criteria (capability stability)
  • Capabilities remain enabled as intended across X hot resets with no regressions.
  • BAR placement and bridge windows remain compatible after VF creation or policy changes (0 overlaps, 0 unreachable apertures).
Header Vendor / Device ID Class Code Command / Status Header Type BARs Size (aperture) Base (placement) Attr (64b / PF) Cap List PCIe MSI-X AER SR-IOV ATS Header → BAR Header → Cap Read structure first; avoid bitfield tables. Use it as a navigation map for bring-up decisions.
Diagram: Configuration space navigation (Header → BAR sizing/placement → Capability gating).

H2-4 · Interrupts & Doorbells (Legacy / MSI / MSI-X → Scalable)

Many endpoint performance failures are notification failures: the interrupt model and queue signaling determine whether throughput scales or collapses into interrupt storms and tail-latency spikes.

1) Model — from “can interrupt” to “can scale”
  • Legacy interrupts are limited for modern multi-queue devices and often collapse under load.
  • MSI improves routing, but vector count and isolation may still limit per-queue scaling.
  • MSI-X enables per-queue vectors so hot queues can be isolated, moderated, and distributed across CPUs.
Scaling principle

A scalable design binds queue → vector → CPU deterministically, then controls interrupt rate with moderation rather than allowing storm-driven scheduling.

2) Configuration — MSI-X as a per-queue contract
  • Vector budget: allocate enough vectors for the active queues (Q0..Qn).
  • Mapping determinism: keep queue-to-vector mapping stable across reset and mode changes.
  • Isolation intent: avoid funneling multiple hot queues into one vector unless that is explicitly intended.
Quick check (configuration sanity)
  • Per-queue interrupts are observable: each hot queue produces its own vector activity (no hidden funneling).
  • No “vector starvation”: vectors are not shared unintentionally across unrelated traffic classes (X).
3) Performance tuning — moderation before storm

Interrupt frequency should track useful work. Too many notifications create storms; too few increase tail latency. Use policy knobs in this order:

  1. Fix mapping first: ensure queue → vector mapping reflects the intended parallelism.
  2. Then moderate: apply coalescing/moderation to cap interrupt rate (X).
  3. Then distribute: set affinity so hot vectors do not pile onto one CPU core.
Doorbells vs interrupts (queue devices)
  • Doorbell (MMIO): signals new work; too frequent writes can become a control-path bottleneck.
  • MSI-X interrupt: signals completion/events; without moderation, the CPU becomes the bottleneck.
  • Discipline: keep doorbell signaling aligned with visibility rules (data before signal).
4) Verification — data-driven acceptance criteria
Metric 1: Interrupt rate

Pass criteria: interrupt rate remains within X / second under peak load, without storm bursts.

Metric 2: CPU distribution

Pass criteria: hot vectors are spread per policy; no single core exceeds X% sustained interrupt handling load.

Metric 3: Queue latency

Pass criteria: queue completion latency meets p99 < X under steady-state load.

Metric 4: Throughput stability

Pass criteria: throughput does not collapse at peak; variance stays within X% over X runs.

Queues MSI-X Vectors CPU Cores Q0 Q1 Q2 Qn V0 V1 V2 Vn Core0 Core1 Core2 Core3 Moderation is a knob: cap interrupts (X/s) without collapsing per-queue isolation.
Diagram: Queue → MSI-X vector → CPU mapping for scalable interrupt handling.

H2-5 · DMA & Memory Model (Host Memory, IOMMU, Coherency, Visibility)

DMA failures are rarely “random.” Most fall into a short control-and-visibility chain: address domain (IOVA vs PA) → mapping lifetime (IOMMU/IOTLB) → coherency world (coherent vs non-coherent) → ordering boundary (data vs descriptor vs doorbell). This chapter focuses on why DMA reads stale data or writes to the wrong place, and how to prove the root cause with measurable checks.

A) DMA bring-up shortest path (minimum viable demo)
Step 0 — define a deterministic test buffer
  • Pattern: fixed-size buffer (e.g., X KB) filled with a known signature and checksum.
  • Alignment: enforce X-byte alignment; keep a separate “misaligned” variant for later.
  • Boundary controls: create a “cross-page” variant (buffer spans two pages) to trigger mapping edge cases.
Step 1 — prove address width and reachability
  • 64-bit DMA enable: confirm the device consumes the full DMA address width (avoid high-bit truncation).
  • IOVA/PA sanity: record the DMA address used by the device and the backing physical placement (for correlation).
  • No surprise remap: pin or otherwise stabilize the mapping during the demo to isolate variables (X).
Step 2 — enforce visibility boundaries (data vs signal)
  • CPU → device: ensure data/descriptor writes become visible before ringing a doorbell (flush or barrier as required).
  • Device → CPU: ensure CPU reads see device writes (invalidate or coherent path).
  • Descriptor discipline: treat descriptors as control-plane; keep them separate from large data buffers.
Pass criteria (minimum DMA demo)
  • Pattern checksum matches after X DMA read/write cycles (no corruption, no stale reads).
  • Cross-page variant completes with 0 unexpected faults over X iterations.
  • High-address placement remains correct (no truncation) across X resets.
B) IOMMU fault triage (fault → locate → fix)
Fault classification tree (first decision points)
  1. Not-present / unmapped: the IOVA has no valid translation at the time of access.
  2. Permission: mapping exists but violates R/W permissions or device domain policy.
  3. Address width / format: the device issued a truncated or malformed DMA address (common with 64-bit not enabled).
  4. Lifetime mismatch: mapping was torn down or reused while the device still uses it (stale mapping).
  5. Cross-page edge: first page mapped, second page missing; faults only on certain buffer sizes.
Quick checks (fastest to confirm/deny)
  • Correlation: match the faulting IOVA against the intended buffer range (start/end).
  • Repro trigger: test the cross-page variant and a high-address placement variant.
  • Lifetime probe: hold mapping longer (X) and verify whether faults disappear.
  • Permissions: enforce least-privilege and confirm the required R/W bits are present.
Fix directions (root-cause aligned)
  • Unmapped: ensure map occurs before doorbell; avoid racing map/unmap with in-flight DMA.
  • Permission: correct R/W permissions and domain binding; avoid over-broad mappings as a “fix.”
  • Width: enable 64-bit addressing end-to-end; validate that high bits are preserved.
  • Lifetime: prevent IOVA reuse until all DMA completion is observed; add a drain point (X).
  • Cross-page: map the entire span; validate scatter/gather segments do not leave holes.
Pass criteria (fault stability)
  • 0 IOMMU faults across X hours of peak DMA load.
  • Fault recovery (if intentionally injected) converges within X ms (p99), no retry storms.
C) Coherency strategy (when flush/invalidate is mandatory)
Three coherency worlds
  • Non-coherent: software must manage cache visibility explicitly (flush/invalidate), or stale reads are expected.
  • Coherent interconnect: data visibility is largely automatic, but ordering between data and signals still matters.
  • Mixed: some buffers are coherent while others are streaming; inconsistent policy causes “only sometimes” failures.
Decision rules (fast and practical)
  • Descriptors: treat as control-plane; ensure CPU writes are visible before doorbell.
  • Device reads (CPU → device): flush/clean cache lines that contain DMA-read data on non-coherent paths.
  • Device writes (device → CPU): invalidate cache lines before CPU consumption on non-coherent paths.
  • Signal ordering: data becomes visible first, then notify (doorbell / flag / interrupt).
Pass criteria (stale read/write elimination)
  • 0 stale reads in pattern test across X million DMA operations.
  • Doorbell/descriptor ordering violations remain at 0 across X stress runs.
EP DMA Engine IOVA Doorbell IOMMU IOTLB Perm PA DRAM Data Descriptors Rings Translate FAULT Visibility boundary: data/descriptor visible first, then doorbell (flush/invalidate as required).
Diagram: DMA uses an IOVA visible to the device; the IOMMU translates to PA and enforces permissions. Faults often indicate mapping lifetime, permission, width, or cross-page issues.

H2-6 · ATS / PRI / PASID (Device-side Translation: Performance vs Safety)

ATS/PRI/PASID shift part of address-translation intelligence toward the device. The upside is lower translation overhead and better tail latency in translation-heavy workloads. The downside is a new failure space: stale translations, invalidation mistakes, and fault recovery instability. This chapter defines the enablement gates, the minimum bring-up proof, and measurable acceptance criteria.

A) Feature gate (platform prerequisites)
Must-have gates
  • IOMMU stable: DMA is already correct and fault-free without ATS/PRI.
  • Capabilities visible: ATS/PRI/PASID capabilities can be discovered and configured.
  • Invalidation path: an explicit translation invalidation mechanism exists and is observable.
  • Policy contract: translation and protection domains are defined (no “enable and hope”).
Nice-to-have gates
  • Counters/telemetry: ATC hit/miss, ATS request rate, PRI request rate, and fault counters.
  • Recovery policy: bounded retries, backoff, and a deterministic “disable ATS/PRI” fallback (X).
  • Isolation validation: PASID/domain separation can be tested and proven (X scenarios).
B) Bring-up checklist (minimum proof: ATS + invalidate + fault recovery)
1) Prove ATS is actually exercised (not just “enabled”)
  • Workload: translation-heavy access pattern (high-frequency, small random I/O).
  • Evidence: ATC hit/miss counters change as expected; ATS requests correlate with misses.
  • Sanity control: disable ATS and confirm the performance/latency signature changes (X).
2) Prove invalidation correctness (stale translation prevention)
  • Stimulus: change a mapping or revoke access intentionally (controlled test case).
  • Evidence: device stops using the old translation after invalidation; no silent writes to old PA.
  • Failure signature: repeated faults or corruption indicates stale ATC/IOTLB state (X).
3) Prove PRI recovery is bounded (no request storms)
  • Stimulus: inject a page-miss scenario; observe PRI request behavior.
  • Evidence: mapping is established (or denial is returned) and the device resumes or fails deterministically.
  • Guardrail: bounded retry count and backoff; escalate to fallback if no convergence (X).
Common risks (what to prevent explicitly)
  • Stale translation: ATC uses an old mapping after revoke/remap → data corruption risk.
  • Invalidation gaps: invalidation sent but not honored/observed → intermittent faults or silent misdirected DMA.
  • PRI storms: missing backoff and retry bounds → tail-latency explosion and system instability.
C) Pass criteria (quantified thresholds — placeholders)
ATS effectiveness
  • ATC hit rate:X% on the target workload.
  • Tail latency: p99 improves by ≥ X% (or stays within X if throughput is the priority).
PRI stability
  • PRI request rate:X / s in steady state.
  • Fault convergence:X ms (p99), no retry storms.
PASID / isolation safety
  • Separation proof: cross-context access attempts fail deterministically (X test matrix).
  • No silent corruption: invalidation/remap tests show 0 “old PA” writes across X cycles.
EP ATC ATS Req DMA IOMMU IOTLB Page Miss Invalidate OS/Driver Map DRAM PA Access ATS PRI Proof requires counters: ATC hit/miss, ATS/PRI rate, and bounded fault recovery (X).
Diagram: ATS fills device-side translation cache (ATC) on misses; PRI handles page misses via OS/driver mapping, then the device resumes. Invalidation correctness prevents stale translations.

H2-7 · SR-IOV & Multi-Function (PF/VF → Isolation & Operability)

SR-IOV turns a device from “it runs” into “it can be safely shared and operated.” The practical goal is a controlled resource slicing model (queues, MSI-X vectors, BAR windows), a repeatable enablement chain (gate → enable → enumerate → validate), and an explicit per-VF observability contract (counters, resets, and bounded failure domains).

A) SR-IOV enable playbook (firmware → platform → OS stack)
Gate → Enable → Enumerate → Validate
  1. Gate: confirm SR-IOV capability is visible and permitted by platform policy (X).
    Quick check: capability present; platform does not block VF creation.
  2. Enable: set the VF count and provision VF resource windows (BAR space, vector budget).
    Quick check: VF count reflects the intended number; resource window sizing does not collide.
  3. Enumerate: verify VFs appear consistently across resets and hot resets (X).
    Quick check: VF enumeration success ≥ X% over X cycles.
  4. Validate: verify per-VF queues, interrupts, and counters are functional and isolated.
    Quick check: per-VF counters increment only for the target VF; vector/queue mapping matches policy.
Pass criteria (enablement)
  • VFs enumerate successfully ≥ X% across X reset cycles.
  • Per-VF BAR windows are conflict-free (0 overlaps), and VF MMIO is accessible without errors.
  • Per-VF MSI-X vectors deliver interrupts to the intended CPUs with stable distribution (X).
B) Operability contract (per-VF counters / reset / firmware compatibility)
Per-VF counters (minimum dashboard)
  • Traffic: throughput and queue depth/latency (p95/p99 placeholders X).
  • Errors: retry/timeout/fatal counters per VF, plus per-queue drops (X).
  • Interrupt health: per-vector counts and coalescing effectiveness (X).
Per-VF reset and recovery
  • Reset granularity: VF recovery must not disrupt neighbor VFs (bounded failure domain).
  • Recovery time: VF returns to “ready” within X ms (p95/p99).
  • State cleanup: queue/doorbell/counter state converges after reset (no ghost traffic).
Firmware / driver compatibility envelope
  • Matrix: define known-good combinations (firmware, PF driver, VF driver) with version bounds (X).
  • Upgrade rule: upgrades must preserve VF stability across reset paths (H2-8 linkage).
  • Rollback rule: rollback must restore VF operability without manual re-provisioning (X).
C) Failure scenarios library (symptom → first check → direction)
VF enumeration is unstable
  • First check: reset path semantics and timing (PERST#/hot reset/FLR interactions with VF provisioning).
  • Direction: ensure VF provisioning happens after the selected reset path completes; avoid racing enumeration (X).
VF disappears after hot reset or power event
  • First check: whether the reset type clears VF enablement state and requires reprovision.
  • Direction: treat VF provisioning as a state that must be re-applied after specific resets (X).
One VF degrades all tenants (shared slowdown)
  • First check: queue/vector budgets and per-VF rate limits (resource contention).
  • Direction: enforce per-VF budgets and verify tail latency isolation via per-VF counters (X).
VF driver binds incorrectly or sporadically
  • First check: PF/VF role separation and compatibility matrix (firmware + PF + VF stack versions).
  • Direction: pin known-good combinations and validate across reset sequences (X).
IOMMU grouping blocks intended isolation
  • First check: whether VF assignment boundaries match the intended isolation domain.
  • Direction: align VF tenancy model with the platform’s isolation domain contract (no implicit assumptions).
Host Tenants Policies Metrics PF Management Plane Provision Policy FW Counters VF0 Queues MSI-X Stats VF1 Queues MSI-X Stats VF2 Queues MSI-X Stats VF3 Queues MSI-X Stats Policy Resource slicing: Queues + MSI-X + BAR windows + per-VF stats (isolation measured by counters).
Diagram: PF provisions multiple VFs. Each VF has a budget of queues and MSI-X vectors plus a per-VF observability contract (stats and bounded reset domain).

H2-8 · Reset / Power Management / Hot-Plug (From “Enumerates” to “Never Disappears”)

Field stability often fails at the control-plane state machine: reset semantics, power-state transitions, and hot-plug re-enumeration. This chapter provides a practical decision tree for selecting PERST#, Hot Reset, FLR, and function resets, then a safe tuning order for ASPM/L1SS, and measurable pass criteria for recovery time and “device missing” events.

A) Reset decision tree (symptom → pick reset type)
Selection rules (practical)
  • Config/functional corruption but link is up: prefer FLR / function reset to keep the impact bounded.
    Verify after reset: enumeration intact, BAR/MMIO usable, error counters converge (X).
  • Link training or enumeration is unstable: prefer Hot Reset first, then escalate to PERST# if instability persists.
    Verify after reset: link stays up for X minutes under load; enumeration success ≥ X%.
  • SR-IOV VF recovery: reset selection must account for whether VF provisioning state is cleared and must be re-applied (X).
    Verify after reset: VF count restored, per-VF queues/interrupts/counters are functional.
B) Power states tuning order (ASPM/L1SS without “device missing”)
Safe sequence: stabilize first, then enable power savings gradually
  1. Baseline (no deep power states): prove 24h stability (0 device-missing events, error counters converge).
    Metrics: recovery time, enumeration success, AER counters (X).
  2. Enable shallow savings: turn on one mechanism at a time; measure exit latency impact (X).
    Check: no increase in completion timeouts; no surprise retrains/resets.
  3. Enable deeper states (L1SS): validate wake/exit timing and device readiness within platform budgets (X ms).
    Check: “device missing” remains 0/24h; p99 latency stays within X.
Fast diagnosis if enabling ASPM/L1SS causes dropouts
  • First check: exit latency and readiness timing vs platform timeout budgets (X).
  • Second check: unexpected re-enumeration or resource reshuffle after wake.
  • Direction: back off to the last stable state; then re-enable stepwise with metrics.
C) Pass criteria (recovery time, enumeration success, counter convergence)
  • Recovery time:X ms (p95) and ≤ X ms (p99) for the selected reset path.
  • Re-enumeration: success ≥ X% across X cycles (including hot resets).
  • Device missing: 0 / 24h under peak traffic + power-state toggling.
  • Error convergence: fatal = 0; correctable ≤ X after stabilization window (X minutes).
  • State consistency: power state transitions do not change functional resource layout unexpectedly.
Enumerate Resources Link Up Errors Active Latency Low Power ASPM/L1SS Wake Time (X) PERST# Hot Reset FLR ASPM Hot-Plug Acceptance: recovery time ≤ X ms, re-enumeration ≥ X%, device missing = 0/24h, errors converge. Tune power states stepwise and measure exit latency and stability at each step.
Diagram: Control-plane timeline for enumeration, activity, low-power entry/exit, and hot-plug. Resets and power transitions must be validated with recovery time and “device missing” metrics.

H2-9 · RAS & Observability (AER/ECRC/Logs/Counters for Root-Cause)

When symptoms look like SI, control-plane evidence must be checked first. This chapter defines a minimal observability panel (AER classes, replay/timeout, ECRC/poison), then provides a repeatable attribution template: symptom → counters to capture → likely layer → next-page jump (PHY/Retimer/Compliance links only, no cross-topic expansion).

A) Minimal observability panel (10 key signals)
Link health (trend first, not snapshots)
  • AER Correctable: rate (X / min) and burstiness (p95/p99 window X).
  • AER Non-fatal: count per hour (X) and correlation with workload/power states.
  • AER Fatal: 0 tolerated; always capture a full snapshot window (pre/post).
  • Replay counter: indicates retransmit pressure; track vs throughput (X).
  • Completion timeout: indicates blocked completions; track vs queue depth (X).
Data integrity signals (evidence that must be preserved)
  • ECRC errors: treat as integrity evidence; correlate with retries and resets (X).
  • Poisoned TLP: indicates corrupted payload propagated upward; always log context (X).
Recovery and stability KPIs (field-operability)
  • Device-missing events: must be 0 / 24h under stress (X).
  • Reset counters: per type (PERST#/hot reset/FLR) with recovery time p95/p99 (X ms).
  • Re-enumeration success: ≥ X% across X cycles, including power transitions.
B) Attribution matrix (no tables; if/then cards)
IF: completion timeout + replay increase with load, ECRC/poison stays 0
  • THEN: congestion / backpressure / software queueing is more likely than a raw channel defect.
  • CAPTURE: replay + timeout trend vs throughput + queue depth (window X).
  • NEXT: jump to software stack gating and queue strategy (H2-10); only later validate channel via compliance page.
IF: ECRC errors or poisoned TLP appears (even if throughput is low)
  • THEN: treat as integrity evidence requiring preservation and controlled reproduction.
  • CAPTURE: AER class + ECRC/poison counters + state snapshot (pre/post window X).
  • NEXT: validate with PHY/retimer and compliance workflows (links only; do not expand here).
IF: errors correlate with power transitions (ASPM/L1SS enablement)
  • THEN: control-plane timing / recovery budgets are suspect (exit latency, readiness).
  • CAPTURE: timestamped events + recovery time distribution + device-missing count (X).
  • NEXT: jump to reset/power state-machine tuning (H2-8) and stack gating (H2-10).
C) Event retention (field log schema recommendations)
  • Timestamp + severity: AER class (correctable/non-fatal/fatal) and burst window ID (X).
  • Counter snapshot: replay, completion timeout, ECRC, poison, reset counts (pre/post window X).
  • Configuration snapshot: negotiated speed/width, power state (ASPM/L1SS), enabled features (ATS/PRI/SR-IOV) as on/off or counts.
  • Load context: throughput, queue depth, and tail latency (p95/p99 placeholders X).
  • Action taken: record-only / degrade / reset (type) + recovery time (X ms).
Correctable Non-fatal Fatal Actions Record Degrade Reset Preserve Evidence Counters Logs Snapshot
Diagram: Severity funnel (Correctable → Non-fatal → Fatal) with action mapping (record / degrade / reset) and evidence preservation (counters, logs, snapshots).

H2-10 · Firmware & Software Stack (Firmware→OS→Driver→User-space)

Hardware capability is not an on/off switch; it is a chain. This chapter explains the ownership boundaries across firmware, OS, drivers, and user-space, then defines feature gating for ATS/PRI/SR-IOV and a field-safe upgrade/rollback strategy. The goal is a repeatable bring-up path: enumerate → map → interrupt → DMA → observe.

A) Bring-up path (firmware → OS → driver → application)
Firmware (BIOS/UEFI)
  • Owns: baseline enumeration and resource allocation.
  • Common failures: hidden devices due to gating or insufficient resource windows.
  • Verify: device is visible and resources are consistent after resets (X).
OS
  • Owns: BAR mapping, interrupt routing, IOMMU policy and DMA mapping model.
  • Common failures: blocked mapping or mis-grouped isolation domains (X).
  • Verify: stable mappings and predictable recovery behavior across power transitions.
Driver
  • Owns: queue model, doorbells, interrupt moderation, error reporting and recovery hooks.
  • Common failures: suboptimal vector/queue mapping causing latency spikes (X).
  • Verify: DMA path correctness and stable per-queue performance under load.
User-space / Application
  • Owns: workload behavior, SLA metrics, and observability consumption.
  • Common failures: missing telemetry, causing SI-like misattribution (H2-9 linkage).
  • Verify: per-device and per-tenant health KPIs remain within X under stress.
B) Feature gating (ATS/PRI/SR-IOV requires a chain)
ATS / PRI
  • Goal: reduce translation overhead and support controlled fault recovery.
  • Required layers: platform policy + IOMMU support + driver enablement.
  • Minimal test: ATS request works; invalidation works; controlled fault recovery converges.
  • Pass: ATC hit-rate (X), PRI rate (X), fault recovery time ≤ X ms.
SR-IOV
  • Goal: tenantable device partitioning with per-VF observability.
  • Required layers: firmware resource windows + driver provisioning + VF stack binding.
  • Minimal test: stable VF enumeration; per-VF queue + per-VF MSI-X + per-VF counters.
  • Pass: VF stability ≥ X% across resets; neighbor VFs unaffected by VF recovery.
C) Upgrade / rollback strategy (field-safe operations)
  • Pre-check: enforce a known-good compatibility matrix (firmware + PF + VF stack) with bounds (X).
  • Rollout gates: stage by fleet fraction; watch device-missing (0/24h), AER fatal (0), and replay/timeout trends (X).
  • Rollback triggers: any fatal increase, recovery time drift, or VF enumeration instability beyond X.
  • Post-check: run a reset/power transition suite; confirm observability panel remains consistent (H2-9 linkage).
Firmware Resource Enumerate Policy OS Map IOMMU IRQ Driver Queues DMA Errors User-space Telemetry Workload SLA Gates ATS PRI SR-IOV IOMMU Reset Logs
Diagram: A layered ownership model (Firmware → OS → Driver → User-space) with a feature gate chain (ATS/PRI/SR-IOV). Capability becomes real only when every layer is aligned and validated.

H2-11 · Engineering Checklist (Design → Bring-up → Production)

A deliverable-focused, three-gate checklist that locks “works in lab” into “field-operable”. Each item includes a Pass criteria placeholder (X) and practical example MPNs for common support parts (clock/reset/telemetry/config).

Gate 1 — Design Gate (budgets + policies locked)
Entry condition: budgeted & reviewable
  • Resource budget for BAR aperture, MSI-X vectors, queues, and DMA descriptors. Pass criteria: X% headroom across worst-case SKUs.
  • Refclk strategy (SRNS/SRIS assumption, spread-spectrum tolerance, clock-domain ownership). Pass criteria: jitter budget meets X and remains stable across power states.
    Example MPNs (clocking)
    Jitter cleaner: Silicon Labs Si5341 / Si5345
    Clock generator: Si5332
    PCIe-grade clock buffer (example family): Renesas/IDT 9FGV series
  • Reset policy (PERST#/hot reset/FLR selection rules + recovery budgets). Pass criteria: recovery p95 ≤ X ms, p99 ≤ X ms.
    Example MPNs (reset supervision)
    Reset supervisor: TI TPS386000 / MAXIM MAX16052
  • RAS policy (AER severity actions: record / degrade / reset; evidence retention schema). Pass criteria: fatal = 0; correctable rate bounded by X / min.
  • Virtualization partition plan (PF/VF budget: queues, vectors, BAR windows, per-VF counters). Pass criteria: VF isolation holds under fault injection (X).
  • Non-volatile config for board ID / straps / feature defaults (if required by design). Pass criteria: deterministic configuration across cold/warm boots.
    Example MPNs (EEPROM / GPIO)
    I²C EEPROM: Microchip 24AA02E64 / 24LC64
    I²C GPIO expander: TI TCA9539
Exit condition
Budgets and policies are reviewable, testable, and linked to measurable pass criteria (X).
Gate 2 — Bring-up Gate (minimum viable proof chain)
Entry condition: reproducible lab setup
Minimal chain (do not reorder)
  • Enumerate: device visible and stable across warm reboot. Pass criteria: 0 device-missing in X cycles.
  • Config & BAR: BAR sizing and mapping consistent after reset. Pass criteria: BAR layout stable; no overlaps; X% headroom.
  • Interrupts: MSI-X routing correct for at least one queue (then scale). Pass criteria: interrupt count matches workload within X%.
  • DMA: host memory read/write correctness under stress patterns. Pass criteria: 0 data mismatch across X GB transferred.
  • IOMMU: DMA mapping correctness with faults observable. Pass criteria: faults detected and recovered within X ms.
  • SR-IOV: VF enumeration stable; per-VF queue/interrupt/counters validated. Pass criteria: VF stability ≥ X% across reset cycles.
  • ATS/PRI (if used): translation cache + invalidation + controlled page-fault recovery. Pass criteria: ATC hit-rate X; recovery ≤ X ms.
Optional lab helpers (MPNs)
Power monitor for per-rail correlation: TI INA238 / INA231
Temperature sensor for thermal correlation: TI TMP117 / ADI ADT7420
Exit condition
The proof chain is reproducible and survives controlled reset/power transitions with bounded recovery (X).
Gate 3 — Production Gate (telemetry + regression + compatibility)
Entry condition: stable bring-up
  • Telemetry always-on: AER severity trends, replay/timeout, ECRC/poison, reset counters. Pass criteria: fatal = 0; correctable bounded by X / min.
  • Alert hygiene: thresholds, suppression, and action mapping (record/degrade/reset). Pass criteria: no alert storms; signal-to-noise ≥ X.
  • Compatibility matrix: firmware + driver + feature gating combinations. Pass criteria: controlled rollout gates (X) and rollback triggers defined.
  • Regression suite mapped to three gates (design assumptions, bring-up chain, production KPIs). Pass criteria: suite run-time ≤ X and catches known failure modes.
Exit condition
Field stability is measurable, actionable, and regression-protected across versions (X).
Design Gate Budget Reset policy RAS policy Partition Bring-up Gate Enumerate BAR MSI-X DMA Prod Telemetry Alerts Matrix Regression All checks end with Pass criteria: X
Diagram: A three-gate flow that prevents “bring-up surprises” and turns capability into a repeatable, field-operable deliverable.

H2-12 · Applications & IC Selection (Controller / RC / EP)

Selection is driven by control-plane requirements (enumeration, MSI-X scale, DMA/IOMMU, ATS/PRI, SR-IOV, RAS) and operability (telemetry + recovery + compatibility matrix). Board-level SerDes/SI details are intentionally left to the PHY/Retimer/Compliance pages.

Use-case buckets (what the control-plane must provide)
Bucket 1 — CPU Root Complex (servers / industrial hosts)
  • Must-have: stable enumeration, large resource windows, MSI-X scale, AER/RAS, robust reset/power recovery.
  • Nice-to-have: SR-IOV manageability, strong observability tooling, deterministic downgrade policy.
  • Example MPNs (RC SoC families): NXP LS1046A, NXP LS2088A, Marvell CN9130
Bucket 2 — FPGA Endpoint (DAQ cards / accelerators / bridges)
  • Must-have: predictable BAR model, doorbells/queues, MSI-X mapping, correct DMA + fault visibility.
  • Nice-to-have: SR-IOV capability (or partition equivalent), built-in counters for attribution.
  • Example MPNs (FPGA with PCIe hard IP options): AMD Xilinx XCKU5P, AMD Xilinx XCVU9P, Intel Agilex AGF014, Microchip PolarFire MPF300T
Bucket 3 — High-throughput endpoints (NIC / NVMe / accelerators)
  • Must-have: large queue model + MSI-X scale, robust RAS evidence chain, stable recovery.
  • Virtualization: SR-IOV (VF count, per-VF counters), IOMMU-aware DMA paths.
  • ATS/PRI: consider when translation overhead dominates tail latency; validate invalidation + fault convergence.
  • Example MPNs (endpoint controllers): Broadcom BCM57414 (PCIe NIC controller), Intel I225-LM (PCIe GbE controller), Phison PS5026-E26 (NVMe SSD controller), Phison PS5018-E18 (NVMe SSD controller), Silicon Motion SM2264 (NVMe SSD controller), InnoGrit IG5236 (NVMe SSD controller)
Bucket 4 — External / expansion (cabled boxes / backplanes)
  • Must-have: strong observability, predictable downgrade behavior, recovery budgets with zero device-missing.
  • Go deeper (examples, not expanded here): PCIe switch: Broadcom PEX88096 / Microchip Switchtec PFX100x; PCIe redriver: TI DS80PCI810 (details belong to dedicated pages).
Selection rubric (scorecards, no tables)
1) Gen compatibility & downgrade policy
  • Must-have: deterministic negotiation and stable operation after downgrade.
  • Verify: negotiated speed/width stays stable across resets and power transitions (X).
  • Risk: flapping links that appear SI-like but are policy/timeout driven.
2) MSI-X scale & queue model
  • Must-have: per-queue interrupt mapping and measurable coalescing behavior.
  • Verify: interrupt distribution and tail latency p95/p99 bounded (X).
  • Risk: performance cliffs misdiagnosed as link quality issues.
3) DMA / IOMMU / ATS / PRI chain
  • Must-have: correct DMA under IOMMU policy and observable faults.
  • Nice-to-have: ATS/PRI when translation overhead dominates.
  • Verify: invalidation works; fault recovery converges ≤ X ms.
  • Risk: stale mappings and silent data corruption.
4) SR-IOV operability
  • Must-have: VF stability, per-VF counters, predictable reset domains.
  • Verify: VF enumeration success ≥ X% across reset/power suites.
  • Risk: tenant instability and un-debuggable incidents.
5) RAS evidence chain (AER/ECRC/poison/timeout/replay)
  • Must-have: counters usable for attribution and retention fields for incident replay.
  • Verify: correctable bounded; fatal = 0; recovery time distributions stable.
  • Risk: SI vs control-plane ambiguity becomes unresolvable.
6) Software ecosystem & tooling
  • Must-have: stable driver support and field-diagnostic hooks aligned to the telemetry panel.
  • Verify: upgrade/rollback works with clear regression gates (X).
  • Risk: features exist on paper but cannot be safely enabled in production.
Go deeper (linked pages only)
  • PHY / SerDes — link only (no SI expansion here).
  • Retimer / Redriver — link only (channel extension details there).
  • Compliance & Test Hooks — link only (PRBS/eye/jitter workflows there).
  • Cabled PCIe / External Boxes — link only (connectors/cabling there).
Capability Fit Map (cards, not a table) CPU Root Complex Required Enum MSI-X RAS Bonus SR-IOV Tools FPGA Endpoint Required BAR DMA MSI-X Bonus IOMMU RAS NIC / NVMe / Accelerator Required Queues MSI-X RAS Bonus SR-IOV ATS External / Expansion Required Logs Recovery Degrade Bonus Matrix Tools
Diagram: Capability-to-usecase fit presented as four cards (required + bonus), avoiding tables while keeping selection structured.
BOM note
Listed MPNs are examples for planning and cross-checking. Final selection must match platform requirements (lane count, Gen target, clocking mode, feature gates) and be validated by datasheets + lab bring-up.

Request a Quote

Accepted Formats

pdf, csv, xls, xlsx, zip

Attachment

Drag & drop files here or use the button below.

H2-13 · FAQs (Field Troubleshooting, Structured Answers)

This section closes long-tail, on-site troubleshooting without introducing new domains. Each question uses the fixed structure: Likely cause / Quick check / Fix / Pass criteria (with quantified placeholders X).

Measurement rule
If a check mentions a counter/log/register, capture it with a timestamp and a workload label so trends and correlations are reproducible.
Device shows in lspci, driver loads, but DMA reads stale data — first check cache maintenance or IOMMU mapping?

Likely cause: Cache coherency boundary is violated (missing flush/invalidate), or IOMMU/IOVA mapping is stale/wrong-direction for the buffer.

Quick check: Correlate data mismatches with (a) IOMMU fault logs and (b) buffer lifetime events (map/unmap). Compare results using a known-coherent buffer path (if available) versus a non-coherent path.

Fix: Enforce the correct cache maintenance at ownership handoff (CPU→Device flush, Device→CPU invalidate), and ensure mapping lifecycle is correct (no use-after-unmap, correct DMA direction, alignment to cache lines).

Pass criteria: 0 data mismatches over X GiB transferred; IOMMU faults = 0 during X minutes; p99 DMA completion latency ≤ X µs.

BAR is present but MMIO access hangs — first check ordering/posted write flush or completion timeout?

Likely cause: Posted writes are never flushed (no readback/ack path), or non-posted MMIO reads are blocked until completion timeout due to a stuck internal state / decode mismatch.

Quick check: Check RAS evidence: completion timeout counters/events, UR/CA status, and whether a read-after-write to the same region returns or stalls. Verify the device is in D0 and BAR aperture matches decode expectations.

Fix: Add an explicit flush/ack pattern (write → readback/doorbell-ack), correct BAR sizing/decoding, and align timeout/ordering policy to the device’s service guarantees (avoid indefinite waits; implement bounded retries).

Pass criteria: 0 MMIO hangs across X iterations; completion timeouts = 0 under load; p99 MMIO read latency ≤ X µs.

SR-IOV VFs appear, but only PF has interrupts — MSI-X vector mapping or per-VF enable?

Likely cause: MSI-X is enabled only for PF, or VF vector tables / per-VF interrupt enables are not programmed, leaving VF queues without routable interrupts.

Quick check: Verify per-function MSI-X enable state and confirm VF interrupt counters increment when VF traffic runs. Check whether vector allocation is sufficient for VF queue count.

Fix: Allocate vectors per VF (or per VF-queue policy), enable MSI-X per VF, and validate queue→vector→CPU mapping is consistent with the device’s partition rules.

Pass criteria: VF interrupt rate matches workload within ±X%; 0 missed interrupt events over X minutes; PF and VF counters remain separated and stable.

ATS enabled but performance gets worse — ATC miss storm or invalidation overhead?

Likely cause: Translation cache thrashes (low ATC hit-rate), or frequent invalidations dominate, turning ATS into overhead rather than savings.

Quick check: Measure ATC hit-rate, ATS request rate, and invalidation frequency; correlate with p99 latency and throughput. A hit-rate collapse with rising invalidations indicates ATS is not benefiting this workload.

Fix: Increase mapping stability (longer-lived mappings, larger pages where valid), reduce invalidation churn, and gate ATS per workload. Apply caps on outstanding ATS requests if supported.

Pass criteria: ATC hit-rate ≥ X%; p99 latency does not regress by more than X% versus ATS-off; ATS request rate ≤ X/s at steady state.

PRI requests spike then the device wedges — fault handling loop or missing backoff?

Likely cause: Page-fault handling does not converge (retries without backoff), or outstanding PRI requests overflow internal queues, leading to a deadlock-like wedge.

Quick check: Track PRI request rate, outstanding PRI depth, and fault-convergence time. If the same fault repeats and outstanding depth climbs until progress stops, backoff/limit is missing.

Fix: Add bounded retries with exponential backoff, cap outstanding PRI, and define a recovery action (abort, FLR, or controlled reset). Use a fallback mode when PRI cannot converge (disable PRI or pre-fault/pin critical ranges).

Pass criteria: PRI rate ≤ X/s; fault convergence p99 ≤ X ms; wedge events = 0 over X hours at peak load.

Hot reset recovers link but functions disappear — resource re-allocation or FLR sequence?

Likely cause: Bus/resource windows are not reprogrammed after reset, or function-level reset (FLR) and configuration restore are ordered incorrectly, leaving functions unconfigured or hidden.

Quick check: Compare pre/post reset device list and resource map (BDF presence, BAR addresses, bridge windows). Confirm SR-IOV enable state and capability visibility is restored post-reset.

Fix: Use a deterministic recovery sequence: enumerate → program bridge windows → restore config → enable features (MSI-X/SR-IOV) → validate counters. Apply FLR only where the platform can reliably reconfigure afterward.

Pass criteria: 0 missing functions across X hot resets; recovery p99 ≤ X ms; post-reset configuration matches baseline within X%.

ASPM saves power but devices randomly vanish — L1SS exit latency vs watchdog timeout?

Likely cause: L1SS exit latency exceeds a software/hardware watchdog or service deadline, triggering timeouts that cascade into device removal or repeated re-enumeration.

Quick check: Correlate disappearance events with power-state transitions (ASPM/L1SS entry/exit). Check link retrain/down events and whether timeouts spike immediately after low-power exit.

Fix: Disable L1SS first to confirm causality, then re-enable with conservative settings (latency budgets aligned to timeouts). Ensure required clocks/aux power remain valid across low-power states and extend service deadlines where needed.

Pass criteria: 0 device-missing events over 24h; exit latency p99 ≤ X ms; retrain/down events ≤ X per day at steady workload.

Correctable AER floods under load but eye looks clean — replay/timeout due to backpressure?

Likely cause: Congestion/backpressure drives replay or completion deadlines (timeouts) rather than a physical-layer integrity issue; software queue starvation can amplify the symptom.

Quick check: Compare correctable AER rate against replay counters, completion timeout counters, and queue depth/CPU service time. If error rate tracks throughput and queue pressure, root cause is control-plane congestion.

Fix: Reduce outstanding requests, tune request sizes (MPS/MRRS policy), and fix queue servicing (avoid starvation). If supported, increase buffering/credits and enforce bounded backpressure handling.

Pass criteria: Correctable AER rate ≤ X/min at peak load; replay ratio ≤ X%; completion timeouts = 0; throughput drop ≤ X%.

VFIO passthrough works until reboot — firmware resource window or IOMMU group shift?

Likely cause: Firmware resource allocation changes across boot (BAR windows/decoding policy), or the device’s isolation boundary changes (IOMMU group shifts with topology/ACS policy), breaking passthrough assumptions.

Quick check: Compare boot-to-boot snapshots: BDF stability, BAR addresses, bridge windows, and group identity. If identity or resources change without hardware changes, firmware policy is the primary suspect.

Fix: Lock required firmware settings (resource decode, enumeration stability, isolation policy), persist device mode settings if applicable, and add a boot-time validation gate that fails fast when the resource map deviates.

Pass criteria: 0 passthrough regressions across X reboots; BDF/resource map stable within X; IOMMU group identity stable across X reboots.

Gen negotiation downgrades unexpectedly — capability gating or firmware policy?

Likely cause: Platform policy caps speed/width, or repeated training/retrain events trigger conservative downgrade; capability exchange assumptions (e.g., clocking expectations) can also force fallback.

Quick check: Record negotiated speed/width versus expected target and track retrain counts. If the negotiated state changes after retries or power transitions, policy gating or training instability is implicated.

Fix: Align capability gating across firmware/driver, define an explicit downgrade policy (with alerts), and remove hidden caps. Require a “stable negotiated state” checkpoint before declaring link-ready.

Pass criteria: Negotiated speed/width stable across X reset/power cycles; unexpected downgrades = 0/24h; retrains ≤ X/day under steady workload.

MSI-X works but CPU usage explodes — interrupt moderation/affinity missing?

Likely cause: Interrupt rate is effectively “per event” without moderation/coalescing, or queue→CPU mapping causes hot-core overload and excessive context churn.

Quick check: Measure interrupts/sec per queue and packets(or completions) per interrupt, plus CPU distribution across cores. A low packets-per-interrupt ratio with skewed core load indicates missing moderation/affinity.

Fix: Enable interrupt moderation/coalescing, batch doorbells/completions, tune queue count to match core budget, and enforce stable queue→vector→CPU mapping for predictable load distribution.

Pass criteria: CPU usage reduced by ≥ X% at target throughput; interrupt rate ≤ X/s; p99 latency ≤ X µs; throughput drop ≤ X%.

Completions time out only on large transfers — MRRS/MPS mismatch or software queue starvation?

Likely cause: Large transfers trigger excessive split completions (MRRS/MPS policy mismatch) or overwhelm completion buffering; software queue starvation/backpressure can make completions miss deadlines.

Quick check: Correlate timeout events with transfer size; record MRRS/MPS settings and outstanding read depth. If timeouts scale with size and outstanding depth, split/completion pressure is the primary driver.

Fix: Set compatible MRRS/MPS policy, cap outstanding reads, and ensure software queues cannot starve completion service. If needed, split transfers at the software layer and align completion timeouts to worst-case service time.

Pass criteria: 0 completion timeouts over X TB transferred; p99 completion latency ≤ X µs; replay/timeout counters remain bounded (≤ X/min) at peak load.