PCIe Controller / Endpoint / Root Complex
← Back to: USB / PCIe / HDMI / MIPI — High-Speed I/O Index
This page explains how a PCIe Controller / Root Complex / Endpoint becomes “usable and operable” end-to-end: from enumeration and configuration to interrupts, DMA/IOMMU, ATS/PRI, SR-IOV, and reset/power recovery.
The goal is to turn field symptoms (device missing, MMIO hang, stale DMA, VF no IRQ, timeout floods) into a measurable checklist and pass criteria so issues can be attributed and closed with data.
H2-1 · Scope & Role Map (RC / EP / Controller Contract)
This chapter defines the control-plane contract for PCIe roles so every later section stays in-bounds: who discovers, who configures, who moves data, and who owns recovery decisions.
- Discovery & configuration: enumeration, configuration space, capability gating.
- Resource mapping: BAR sizing/mapping, interrupt delivery (MSI/MSI-X) at the control-plane level.
- Data movement semantics: MMIO doorbells, DMA visibility basics, IOMMU interaction (concept-level).
- Feature responsibilities: ATS/PRI, SR-IOV, error reporting and recovery ownership.
- Operational stability: reset/power semantics from the controller view (what changes, what must be restored).
- PHY / SerDes / SI: eye/jitter, insertion loss, equalization tuning, layout rules.
- Retimer / Redriver design: CDR/DFE/CTLE device selection and channel extension strategy.
- Switch deep dive: port fan-out design, switch silicon policies, detailed ACS/ARI behavior.
- Compliance workflows: official PCI-SIG test flows and lab procedures.
Use this page to decide what the controller must do; use sibling pages to solve how the link meets electrical requirements.
- Owns discovery: initiates configuration transactions to find and classify endpoints.
- Owns global policy: assigns bus numbers and resource windows (address space + interrupts).
- Owns recovery: decides whether to reset, disable, or degrade when errors occur.
- Responds to configuration: exposes capabilities; becomes usable only after RC policy is applied.
- Drives data-plane: performs DMA (if present) and signals events via MSI/MSI-X.
- Reports health: provides error status/counters; supports function-level recovery paths.
- Route transactions: forward requests and completions; do not own system policy.
- Shape visibility: affect what is reachable based on windows and routing, without becoming the policy owner.
- Create failure surfaces: mis-windowing can hide devices; congestion can mimic link instability.
This pattern stresses enumeration policy, resource sizing, and operational recovery. If a device “disappears”, start from the RC policy chain (bus numbering → windowing → capability gating) before assuming an electrical issue.
Role boundaries can blur in SoC/FPGA designs. Bring-up becomes easier when responsibilities are explicit: a single owner for enumeration, a single owner for DMA visibility rules, and a single owner for recovery actions.
- Enumerate devices: stable discovery across cold boot, warm reset, and hot reset.
- Configure capabilities: predictable feature gating (MSI-X, SR-IOV, ATS) with verification points.
- Map BAR & resources: correct sizing and address placement without conflicts.
- Deliver interrupts: scalable per-queue vectors without storms.
- Move data safely: DMA visibility rules aligned with IOMMU policy (no stale or wandering writes).
- Recover from faults: AER-driven triage and reset semantics (PERST#/FLR) that restore usability.
- Enumeration succeeds on cold boot and warm reset with 0 missing functions across X cycles.
- BAR mapping is stable with no overlaps; MMIO accesses complete without stalls across X minutes.
- Interrupt delivery remains bounded: no storm events; CPU distribution meets X policy.
- Error handling is actionable: AER events are classified and recovery restores service within X seconds.
H2-2 · Transaction Model Primer (TLP/DLLP/Config System View)
PCIe controller issues often present as “random stalls” or “works until load”. A correct system view of posted vs non-posted, completions, and ordering/visibility prevents misdiagnosis and makes later topics (ATS/PRI, SR-IOV) stable.
- Posted requests (typical writes): do not require a completion. The system can appear “fast” while still being wrong if the producer signals too early (doorbell before data visibility).
- Non-posted requests (typical reads and some control paths): require a completion. These transactions are completion-bound and expose congestion, backpressure, or policy errors as stalls.
- Completions are the controller’s truth source for “did it really happen”. A system that “hangs” frequently fails at the completion boundary (timeouts, replays, poisoned data).
- If a read stalls: treat it as a completion problem first (policy/congestion/recovery), not a “write problem”.
- If a write “succeeds” but data is wrong: suspect ordering/visibility between data buffers and doorbells.
- If config access is flaky: suspect enumeration/windowing/capability gating before assuming electrical instability.
A “hang” can be a controller waiting at a completion boundary. Treat timeouts and retries as accounting signals: something prevented the completion from arriving in time.
- Completion timeout: the request left, but the completion did not return in the allowed window. Common controller-side causes include congestion/backpressure, blocked routing, or recovery policy that never converges.
- Replay / retry behavior: repeated attempts inflate latency and can look like instability under load. The key is whether retries correlate with resource pressure (queues, interrupt moderation, reset loops).
- ECRC / poisoned data: indicates an integrity concern at the transaction level. The correct first step is to classify the event and decide a recovery action, not to immediately rewrite the electrical design.
- Confirm whether the failure is completion-bound (reads/config) or visibility-bound (writes/DMA).
- Capture controller-visible symptoms: timeout events, retry/replay pressure, and error class (correctable vs fatal).
- Decide a safe recovery action: isolate scope (one function vs the whole device), then apply reset semantics (X) and verify re-enumeration.
Many controller bring-up failures are not “data corruption” but visibility mismatches. The producer and consumer observe different timelines unless ordering rules are enforced.
- Rule 1 — Separate data and control: data buffers can be written (posted writes) while control registers (doorbells) signal readiness. If the signal arrives first, the consumer reads incomplete or stale data.
- Rule 2 — Completion is a barrier for non-posted paths: reads/config are naturally gated by completions; writes need explicit discipline (barriers, sequencing, queue protocol).
- Rule 3 — DMA visibility depends on the platform contract: IOMMU policy and cache coherency determine what is visible when. Treat visibility as a first-class design requirement.
- Under stress, posted-write control signaling never outruns data visibility (0 stale reads across X iterations).
- Completion-bound operations do not time out within X load window (read/config success rate ≥ X%).
- Error events are classifiable and recoverable; recovery converges within X seconds.
H2-3 · Enumeration & Configuration Space (Power-on → Usable)
This chapter targets the most common bring-up and field failures: a device is missing, or it appears but is not usable. The focus stays on the control-plane chain: bus numbering → bridge windows → configuration gating → BAR sizing/assignment → capability enablement.
- Discover topology: scan buses and identify bridges/functions without assuming resources exist yet.
- Assign bus numbers: set primary/secondary/subordinate so downstream devices are reachable.
- Open bridge windows: memory/prefetchable windows must cover the endpoint BAR apertures.
- Gate device usability: enable minimum command bits (Memory/Bus Master) only after mapping is valid.
- Size and assign BARs: BAR sizing declares resource demand; assignment is the system promise.
- Enable capabilities: MSI-X/AER/SR-IOV/ATS are turned on only with clear verify points.
- Restore after reset: after hot reset/FLR, re-apply the configuration subset that is not retained (X).
Good: stable IDs match expectation. If not: suspect reachability (bridge numbering/window) or reads returning invalid values.
Good: correct function class and expected mode. If off: suspect firmware strapping or device mode gating.
Good: header matches expected topology role. If mismatched: re-check the scan path and function interpretation.
Good: Memory and Bus Master enabled only after resources are assigned. If Memory is off: MMIO appears dead. If Bus Master is off: DMA cannot work.
Good: capability list is present and consistent. If missing: suspect partial enumeration or firmware policy limitations.
Good: subordinate range covers downstream scan depth. If too small: devices “behind” the bridge disappear.
Good: windows are large and aligned enough for the endpoint BAR apertures. If undersized: BAR assignment may succeed but accesses fail or get blocked.
Good: base is non-zero, aligned, and the aperture matches the device need. If 0/overlap: resource policy or placement conflicts are likely.
- All expected functions appear with stable IDs across X cold boots.
- Bridge subordinate ranges cover full scan depth; no “missing behind bridge” events in X cycles.
- BARs are non-zero, aligned, and reachable; MMIO reads/writes complete without stalls over X minutes.
- Likely cause: Memory window does not cover the BAR base+aperture, or access is blocked by policy.
- Quick check: BAR base/alignment + bridge memory/prefetchable windows.
- Fix direction: expand/realign windows; ensure Memory enable is applied after placement is valid.
- Pass criteria: MMIO completes with 0 stalls over X accesses.
- Likely cause: host aperture is insufficient or fragmented; alignment requirements cannot be satisfied.
- Quick check: requested aperture size (BAR sizing) vs available host/bridge window budget.
- Fix direction: re-place resources, adjust window budgets, or reduce optional apertures (X).
- Pass criteria: BAR base non-zero, aligned, stable across X resets.
- Likely cause: prefetchable vs non-prefetchable mapping does not match the device’s expectation.
- Quick check: BAR attributes (prefetchable bit) and mapping policy consistency across resets.
- Fix direction: make mapping deterministic; avoid mixing policies per boot path (X).
- Pass criteria: throughput variance < X% across X runs.
- Likely cause: configuration subset is not restored (Command, MSI-X, SR-IOV state, BAR re-placements).
- Quick check: compare key fields before/after reset (Command, BAR base, enabled capabilities).
- Fix direction: define a deterministic restore list and re-verify enumeration chain (X).
- Pass criteria: post-reset service returns within X seconds with 0 missing functions.
- Enable action: choose MSI-X when per-queue scaling is required; keep vector mapping deterministic.
- Verify signal: vector usage and per-queue event separation are visible (counts and mapping match expectation).
- Enable action: turn on actionable error reporting so faults can be classified and recovered.
- Verify signal: errors are categorized (correctable vs fatal) and recovery policy converges within X.
- Enable action: allocate PF/VF resources in a way that windows and BAR budgets can sustain.
- Verify signal: VFs enumerate and remain stable across reset; resource placement does not overlap.
- Enable action: enable only when the platform contract for translation and isolation is clear.
- Verify signal: translation path is actually exercised and remains stable under load (X).
- Enable action: treat D-state transitions as policy-driven; keep restore paths deterministic.
- Verify signal: resume does not cause re-enumeration loss or capability regressions across X.
- Capabilities remain enabled as intended across X hot resets with no regressions.
- BAR placement and bridge windows remain compatible after VF creation or policy changes (0 overlaps, 0 unreachable apertures).
H2-4 · Interrupts & Doorbells (Legacy / MSI / MSI-X → Scalable)
Many endpoint performance failures are notification failures: the interrupt model and queue signaling determine whether throughput scales or collapses into interrupt storms and tail-latency spikes.
- Legacy interrupts are limited for modern multi-queue devices and often collapse under load.
- MSI improves routing, but vector count and isolation may still limit per-queue scaling.
- MSI-X enables per-queue vectors so hot queues can be isolated, moderated, and distributed across CPUs.
A scalable design binds queue → vector → CPU deterministically, then controls interrupt rate with moderation rather than allowing storm-driven scheduling.
- Vector budget: allocate enough vectors for the active queues (Q0..Qn).
- Mapping determinism: keep queue-to-vector mapping stable across reset and mode changes.
- Isolation intent: avoid funneling multiple hot queues into one vector unless that is explicitly intended.
- Per-queue interrupts are observable: each hot queue produces its own vector activity (no hidden funneling).
- No “vector starvation”: vectors are not shared unintentionally across unrelated traffic classes (X).
Interrupt frequency should track useful work. Too many notifications create storms; too few increase tail latency. Use policy knobs in this order:
- Fix mapping first: ensure queue → vector mapping reflects the intended parallelism.
- Then moderate: apply coalescing/moderation to cap interrupt rate (X).
- Then distribute: set affinity so hot vectors do not pile onto one CPU core.
- Doorbell (MMIO): signals new work; too frequent writes can become a control-path bottleneck.
- MSI-X interrupt: signals completion/events; without moderation, the CPU becomes the bottleneck.
- Discipline: keep doorbell signaling aligned with visibility rules (data before signal).
Pass criteria: interrupt rate remains within X / second under peak load, without storm bursts.
Pass criteria: hot vectors are spread per policy; no single core exceeds X% sustained interrupt handling load.
Pass criteria: queue completion latency meets p99 < X under steady-state load.
Pass criteria: throughput does not collapse at peak; variance stays within X% over X runs.
H2-5 · DMA & Memory Model (Host Memory, IOMMU, Coherency, Visibility)
DMA failures are rarely “random.” Most fall into a short control-and-visibility chain: address domain (IOVA vs PA) → mapping lifetime (IOMMU/IOTLB) → coherency world (coherent vs non-coherent) → ordering boundary (data vs descriptor vs doorbell). This chapter focuses on why DMA reads stale data or writes to the wrong place, and how to prove the root cause with measurable checks.
- Pattern: fixed-size buffer (e.g., X KB) filled with a known signature and checksum.
- Alignment: enforce X-byte alignment; keep a separate “misaligned” variant for later.
- Boundary controls: create a “cross-page” variant (buffer spans two pages) to trigger mapping edge cases.
- 64-bit DMA enable: confirm the device consumes the full DMA address width (avoid high-bit truncation).
- IOVA/PA sanity: record the DMA address used by the device and the backing physical placement (for correlation).
- No surprise remap: pin or otherwise stabilize the mapping during the demo to isolate variables (X).
- CPU → device: ensure data/descriptor writes become visible before ringing a doorbell (flush or barrier as required).
- Device → CPU: ensure CPU reads see device writes (invalidate or coherent path).
- Descriptor discipline: treat descriptors as control-plane; keep them separate from large data buffers.
- Pattern checksum matches after X DMA read/write cycles (no corruption, no stale reads).
- Cross-page variant completes with 0 unexpected faults over X iterations.
- High-address placement remains correct (no truncation) across X resets.
- Not-present / unmapped: the IOVA has no valid translation at the time of access.
- Permission: mapping exists but violates R/W permissions or device domain policy.
- Address width / format: the device issued a truncated or malformed DMA address (common with 64-bit not enabled).
- Lifetime mismatch: mapping was torn down or reused while the device still uses it (stale mapping).
- Cross-page edge: first page mapped, second page missing; faults only on certain buffer sizes.
- Correlation: match the faulting IOVA against the intended buffer range (start/end).
- Repro trigger: test the cross-page variant and a high-address placement variant.
- Lifetime probe: hold mapping longer (X) and verify whether faults disappear.
- Permissions: enforce least-privilege and confirm the required R/W bits are present.
- Unmapped: ensure map occurs before doorbell; avoid racing map/unmap with in-flight DMA.
- Permission: correct R/W permissions and domain binding; avoid over-broad mappings as a “fix.”
- Width: enable 64-bit addressing end-to-end; validate that high bits are preserved.
- Lifetime: prevent IOVA reuse until all DMA completion is observed; add a drain point (X).
- Cross-page: map the entire span; validate scatter/gather segments do not leave holes.
- 0 IOMMU faults across X hours of peak DMA load.
- Fault recovery (if intentionally injected) converges within X ms (p99), no retry storms.
- Non-coherent: software must manage cache visibility explicitly (flush/invalidate), or stale reads are expected.
- Coherent interconnect: data visibility is largely automatic, but ordering between data and signals still matters.
- Mixed: some buffers are coherent while others are streaming; inconsistent policy causes “only sometimes” failures.
- Descriptors: treat as control-plane; ensure CPU writes are visible before doorbell.
- Device reads (CPU → device): flush/clean cache lines that contain DMA-read data on non-coherent paths.
- Device writes (device → CPU): invalidate cache lines before CPU consumption on non-coherent paths.
- Signal ordering: data becomes visible first, then notify (doorbell / flag / interrupt).
- 0 stale reads in pattern test across X million DMA operations.
- Doorbell/descriptor ordering violations remain at 0 across X stress runs.
H2-6 · ATS / PRI / PASID (Device-side Translation: Performance vs Safety)
ATS/PRI/PASID shift part of address-translation intelligence toward the device. The upside is lower translation overhead and better tail latency in translation-heavy workloads. The downside is a new failure space: stale translations, invalidation mistakes, and fault recovery instability. This chapter defines the enablement gates, the minimum bring-up proof, and measurable acceptance criteria.
- IOMMU stable: DMA is already correct and fault-free without ATS/PRI.
- Capabilities visible: ATS/PRI/PASID capabilities can be discovered and configured.
- Invalidation path: an explicit translation invalidation mechanism exists and is observable.
- Policy contract: translation and protection domains are defined (no “enable and hope”).
- Counters/telemetry: ATC hit/miss, ATS request rate, PRI request rate, and fault counters.
- Recovery policy: bounded retries, backoff, and a deterministic “disable ATS/PRI” fallback (X).
- Isolation validation: PASID/domain separation can be tested and proven (X scenarios).
- Workload: translation-heavy access pattern (high-frequency, small random I/O).
- Evidence: ATC hit/miss counters change as expected; ATS requests correlate with misses.
- Sanity control: disable ATS and confirm the performance/latency signature changes (X).
- Stimulus: change a mapping or revoke access intentionally (controlled test case).
- Evidence: device stops using the old translation after invalidation; no silent writes to old PA.
- Failure signature: repeated faults or corruption indicates stale ATC/IOTLB state (X).
- Stimulus: inject a page-miss scenario; observe PRI request behavior.
- Evidence: mapping is established (or denial is returned) and the device resumes or fails deterministically.
- Guardrail: bounded retry count and backoff; escalate to fallback if no convergence (X).
- Stale translation: ATC uses an old mapping after revoke/remap → data corruption risk.
- Invalidation gaps: invalidation sent but not honored/observed → intermittent faults or silent misdirected DMA.
- PRI storms: missing backoff and retry bounds → tail-latency explosion and system instability.
- ATC hit rate: ≥ X% on the target workload.
- Tail latency: p99 improves by ≥ X% (or stays within X if throughput is the priority).
- PRI request rate: ≤ X / s in steady state.
- Fault convergence: ≤ X ms (p99), no retry storms.
- Separation proof: cross-context access attempts fail deterministically (X test matrix).
- No silent corruption: invalidation/remap tests show 0 “old PA” writes across X cycles.
H2-7 · SR-IOV & Multi-Function (PF/VF → Isolation & Operability)
SR-IOV turns a device from “it runs” into “it can be safely shared and operated.” The practical goal is a controlled resource slicing model (queues, MSI-X vectors, BAR windows), a repeatable enablement chain (gate → enable → enumerate → validate), and an explicit per-VF observability contract (counters, resets, and bounded failure domains).
-
Gate: confirm SR-IOV capability is visible and permitted by platform policy (X).
Quick check: capability present; platform does not block VF creation.
-
Enable: set the VF count and provision VF resource windows (BAR space, vector budget).
Quick check: VF count reflects the intended number; resource window sizing does not collide.
-
Enumerate: verify VFs appear consistently across resets and hot resets (X).
Quick check: VF enumeration success ≥ X% over X cycles.
-
Validate: verify per-VF queues, interrupts, and counters are functional and isolated.
Quick check: per-VF counters increment only for the target VF; vector/queue mapping matches policy.
- VFs enumerate successfully ≥ X% across X reset cycles.
- Per-VF BAR windows are conflict-free (0 overlaps), and VF MMIO is accessible without errors.
- Per-VF MSI-X vectors deliver interrupts to the intended CPUs with stable distribution (X).
- Traffic: throughput and queue depth/latency (p95/p99 placeholders X).
- Errors: retry/timeout/fatal counters per VF, plus per-queue drops (X).
- Interrupt health: per-vector counts and coalescing effectiveness (X).
- Reset granularity: VF recovery must not disrupt neighbor VFs (bounded failure domain).
- Recovery time: VF returns to “ready” within X ms (p95/p99).
- State cleanup: queue/doorbell/counter state converges after reset (no ghost traffic).
- Matrix: define known-good combinations (firmware, PF driver, VF driver) with version bounds (X).
- Upgrade rule: upgrades must preserve VF stability across reset paths (H2-8 linkage).
- Rollback rule: rollback must restore VF operability without manual re-provisioning (X).
- First check: reset path semantics and timing (PERST#/hot reset/FLR interactions with VF provisioning).
- Direction: ensure VF provisioning happens after the selected reset path completes; avoid racing enumeration (X).
- First check: whether the reset type clears VF enablement state and requires reprovision.
- Direction: treat VF provisioning as a state that must be re-applied after specific resets (X).
- First check: queue/vector budgets and per-VF rate limits (resource contention).
- Direction: enforce per-VF budgets and verify tail latency isolation via per-VF counters (X).
- First check: PF/VF role separation and compatibility matrix (firmware + PF + VF stack versions).
- Direction: pin known-good combinations and validate across reset sequences (X).
- First check: whether VF assignment boundaries match the intended isolation domain.
- Direction: align VF tenancy model with the platform’s isolation domain contract (no implicit assumptions).
H2-8 · Reset / Power Management / Hot-Plug (From “Enumerates” to “Never Disappears”)
Field stability often fails at the control-plane state machine: reset semantics, power-state transitions, and hot-plug re-enumeration. This chapter provides a practical decision tree for selecting PERST#, Hot Reset, FLR, and function resets, then a safe tuning order for ASPM/L1SS, and measurable pass criteria for recovery time and “device missing” events.
-
Config/functional corruption but link is up:
prefer FLR / function reset to keep the impact bounded.
Verify after reset: enumeration intact, BAR/MMIO usable, error counters converge (X).
-
Link training or enumeration is unstable:
prefer Hot Reset first, then escalate to PERST# if instability persists.
Verify after reset: link stays up for X minutes under load; enumeration success ≥ X%.
-
SR-IOV VF recovery:
reset selection must account for whether VF provisioning state is cleared and must be re-applied (X).
Verify after reset: VF count restored, per-VF queues/interrupts/counters are functional.
-
Baseline (no deep power states): prove 24h stability (0 device-missing events, error counters converge).
Metrics: recovery time, enumeration success, AER counters (X).
-
Enable shallow savings: turn on one mechanism at a time; measure exit latency impact (X).
Check: no increase in completion timeouts; no surprise retrains/resets.
-
Enable deeper states (L1SS): validate wake/exit timing and device readiness within platform budgets (X ms).
Check: “device missing” remains 0/24h; p99 latency stays within X.
- First check: exit latency and readiness timing vs platform timeout budgets (X).
- Second check: unexpected re-enumeration or resource reshuffle after wake.
- Direction: back off to the last stable state; then re-enable stepwise with metrics.
- Recovery time: ≤ X ms (p95) and ≤ X ms (p99) for the selected reset path.
- Re-enumeration: success ≥ X% across X cycles (including hot resets).
- Device missing: 0 / 24h under peak traffic + power-state toggling.
- Error convergence: fatal = 0; correctable ≤ X after stabilization window (X minutes).
- State consistency: power state transitions do not change functional resource layout unexpectedly.
H2-9 · RAS & Observability (AER/ECRC/Logs/Counters for Root-Cause)
When symptoms look like SI, control-plane evidence must be checked first. This chapter defines a minimal observability panel (AER classes, replay/timeout, ECRC/poison), then provides a repeatable attribution template: symptom → counters to capture → likely layer → next-page jump (PHY/Retimer/Compliance links only, no cross-topic expansion).
- AER Correctable: rate (X / min) and burstiness (p95/p99 window X).
- AER Non-fatal: count per hour (X) and correlation with workload/power states.
- AER Fatal: 0 tolerated; always capture a full snapshot window (pre/post).
- Replay counter: indicates retransmit pressure; track vs throughput (X).
- Completion timeout: indicates blocked completions; track vs queue depth (X).
- ECRC errors: treat as integrity evidence; correlate with retries and resets (X).
- Poisoned TLP: indicates corrupted payload propagated upward; always log context (X).
- Device-missing events: must be 0 / 24h under stress (X).
- Reset counters: per type (PERST#/hot reset/FLR) with recovery time p95/p99 (X ms).
- Re-enumeration success: ≥ X% across X cycles, including power transitions.
- THEN: congestion / backpressure / software queueing is more likely than a raw channel defect.
- CAPTURE: replay + timeout trend vs throughput + queue depth (window X).
- NEXT: jump to software stack gating and queue strategy (H2-10); only later validate channel via compliance page.
- THEN: treat as integrity evidence requiring preservation and controlled reproduction.
- CAPTURE: AER class + ECRC/poison counters + state snapshot (pre/post window X).
- NEXT: validate with PHY/retimer and compliance workflows (links only; do not expand here).
- THEN: control-plane timing / recovery budgets are suspect (exit latency, readiness).
- CAPTURE: timestamped events + recovery time distribution + device-missing count (X).
- NEXT: jump to reset/power state-machine tuning (H2-8) and stack gating (H2-10).
- Timestamp + severity: AER class (correctable/non-fatal/fatal) and burst window ID (X).
- Counter snapshot: replay, completion timeout, ECRC, poison, reset counts (pre/post window X).
- Configuration snapshot: negotiated speed/width, power state (ASPM/L1SS), enabled features (ATS/PRI/SR-IOV) as on/off or counts.
- Load context: throughput, queue depth, and tail latency (p95/p99 placeholders X).
- Action taken: record-only / degrade / reset (type) + recovery time (X ms).
H2-10 · Firmware & Software Stack (Firmware→OS→Driver→User-space)
Hardware capability is not an on/off switch; it is a chain. This chapter explains the ownership boundaries across firmware, OS, drivers, and user-space, then defines feature gating for ATS/PRI/SR-IOV and a field-safe upgrade/rollback strategy. The goal is a repeatable bring-up path: enumerate → map → interrupt → DMA → observe.
- Owns: baseline enumeration and resource allocation.
- Common failures: hidden devices due to gating or insufficient resource windows.
- Verify: device is visible and resources are consistent after resets (X).
- Owns: BAR mapping, interrupt routing, IOMMU policy and DMA mapping model.
- Common failures: blocked mapping or mis-grouped isolation domains (X).
- Verify: stable mappings and predictable recovery behavior across power transitions.
- Owns: queue model, doorbells, interrupt moderation, error reporting and recovery hooks.
- Common failures: suboptimal vector/queue mapping causing latency spikes (X).
- Verify: DMA path correctness and stable per-queue performance under load.
- Owns: workload behavior, SLA metrics, and observability consumption.
- Common failures: missing telemetry, causing SI-like misattribution (H2-9 linkage).
- Verify: per-device and per-tenant health KPIs remain within X under stress.
- Goal: reduce translation overhead and support controlled fault recovery.
- Required layers: platform policy + IOMMU support + driver enablement.
- Minimal test: ATS request works; invalidation works; controlled fault recovery converges.
- Pass: ATC hit-rate (X), PRI rate (X), fault recovery time ≤ X ms.
- Goal: tenantable device partitioning with per-VF observability.
- Required layers: firmware resource windows + driver provisioning + VF stack binding.
- Minimal test: stable VF enumeration; per-VF queue + per-VF MSI-X + per-VF counters.
- Pass: VF stability ≥ X% across resets; neighbor VFs unaffected by VF recovery.
- Pre-check: enforce a known-good compatibility matrix (firmware + PF + VF stack) with bounds (X).
- Rollout gates: stage by fleet fraction; watch device-missing (0/24h), AER fatal (0), and replay/timeout trends (X).
- Rollback triggers: any fatal increase, recovery time drift, or VF enumeration instability beyond X.
- Post-check: run a reset/power transition suite; confirm observability panel remains consistent (H2-9 linkage).
H2-11 · Engineering Checklist (Design → Bring-up → Production)
A deliverable-focused, three-gate checklist that locks “works in lab” into “field-operable”. Each item includes a Pass criteria placeholder (X) and practical example MPNs for common support parts (clock/reset/telemetry/config).
- ☐ Resource budget for BAR aperture, MSI-X vectors, queues, and DMA descriptors. Pass criteria: X% headroom across worst-case SKUs.
-
☐
Refclk strategy (SRNS/SRIS assumption, spread-spectrum tolerance, clock-domain ownership).
Pass criteria: jitter budget meets X and remains stable across power states.
Example MPNs (clocking)Jitter cleaner: Silicon Labs Si5341 / Si5345
Clock generator: Si5332
PCIe-grade clock buffer (example family): Renesas/IDT 9FGV series -
☐
Reset policy (PERST#/hot reset/FLR selection rules + recovery budgets).
Pass criteria: recovery p95 ≤ X ms, p99 ≤ X ms.
Example MPNs (reset supervision)Reset supervisor: TI TPS386000 / MAXIM MAX16052
- ☐ RAS policy (AER severity actions: record / degrade / reset; evidence retention schema). Pass criteria: fatal = 0; correctable rate bounded by X / min.
- ☐ Virtualization partition plan (PF/VF budget: queues, vectors, BAR windows, per-VF counters). Pass criteria: VF isolation holds under fault injection (X).
-
☐
Non-volatile config for board ID / straps / feature defaults (if required by design).
Pass criteria: deterministic configuration across cold/warm boots.
Example MPNs (EEPROM / GPIO)I²C EEPROM: Microchip 24AA02E64 / 24LC64
I²C GPIO expander: TI TCA9539
- ☐ Enumerate: device visible and stable across warm reboot. Pass criteria: 0 device-missing in X cycles.
- ☐ Config & BAR: BAR sizing and mapping consistent after reset. Pass criteria: BAR layout stable; no overlaps; X% headroom.
- ☐ Interrupts: MSI-X routing correct for at least one queue (then scale). Pass criteria: interrupt count matches workload within X%.
- ☐ DMA: host memory read/write correctness under stress patterns. Pass criteria: 0 data mismatch across X GB transferred.
- ☐ IOMMU: DMA mapping correctness with faults observable. Pass criteria: faults detected and recovered within X ms.
- ☐ SR-IOV: VF enumeration stable; per-VF queue/interrupt/counters validated. Pass criteria: VF stability ≥ X% across reset cycles.
- ☐ ATS/PRI (if used): translation cache + invalidation + controlled page-fault recovery. Pass criteria: ATC hit-rate X; recovery ≤ X ms.
Temperature sensor for thermal correlation: TI TMP117 / ADI ADT7420
- ☐ Telemetry always-on: AER severity trends, replay/timeout, ECRC/poison, reset counters. Pass criteria: fatal = 0; correctable bounded by X / min.
- ☐ Alert hygiene: thresholds, suppression, and action mapping (record/degrade/reset). Pass criteria: no alert storms; signal-to-noise ≥ X.
- ☐ Compatibility matrix: firmware + driver + feature gating combinations. Pass criteria: controlled rollout gates (X) and rollback triggers defined.
- ☐ Regression suite mapped to three gates (design assumptions, bring-up chain, production KPIs). Pass criteria: suite run-time ≤ X and catches known failure modes.
H2-12 · Applications & IC Selection (Controller / RC / EP)
Selection is driven by control-plane requirements (enumeration, MSI-X scale, DMA/IOMMU, ATS/PRI, SR-IOV, RAS) and operability (telemetry + recovery + compatibility matrix). Board-level SerDes/SI details are intentionally left to the PHY/Retimer/Compliance pages.
- Must-have: stable enumeration, large resource windows, MSI-X scale, AER/RAS, robust reset/power recovery.
- Nice-to-have: SR-IOV manageability, strong observability tooling, deterministic downgrade policy.
- Example MPNs (RC SoC families): NXP LS1046A, NXP LS2088A, Marvell CN9130
- Must-have: predictable BAR model, doorbells/queues, MSI-X mapping, correct DMA + fault visibility.
- Nice-to-have: SR-IOV capability (or partition equivalent), built-in counters for attribution.
- Example MPNs (FPGA with PCIe hard IP options): AMD Xilinx XCKU5P, AMD Xilinx XCVU9P, Intel Agilex AGF014, Microchip PolarFire MPF300T
- Must-have: large queue model + MSI-X scale, robust RAS evidence chain, stable recovery.
- Virtualization: SR-IOV (VF count, per-VF counters), IOMMU-aware DMA paths.
- ATS/PRI: consider when translation overhead dominates tail latency; validate invalidation + fault convergence.
- Example MPNs (endpoint controllers): Broadcom BCM57414 (PCIe NIC controller), Intel I225-LM (PCIe GbE controller), Phison PS5026-E26 (NVMe SSD controller), Phison PS5018-E18 (NVMe SSD controller), Silicon Motion SM2264 (NVMe SSD controller), InnoGrit IG5236 (NVMe SSD controller)
- Must-have: strong observability, predictable downgrade behavior, recovery budgets with zero device-missing.
- Go deeper (examples, not expanded here): PCIe switch: Broadcom PEX88096 / Microchip Switchtec PFX100x; PCIe redriver: TI DS80PCI810 (details belong to dedicated pages).
- Must-have: deterministic negotiation and stable operation after downgrade.
- Verify: negotiated speed/width stays stable across resets and power transitions (X).
- Risk: flapping links that appear SI-like but are policy/timeout driven.
- Must-have: per-queue interrupt mapping and measurable coalescing behavior.
- Verify: interrupt distribution and tail latency p95/p99 bounded (X).
- Risk: performance cliffs misdiagnosed as link quality issues.
- Must-have: correct DMA under IOMMU policy and observable faults.
- Nice-to-have: ATS/PRI when translation overhead dominates.
- Verify: invalidation works; fault recovery converges ≤ X ms.
- Risk: stale mappings and silent data corruption.
- Must-have: VF stability, per-VF counters, predictable reset domains.
- Verify: VF enumeration success ≥ X% across reset/power suites.
- Risk: tenant instability and un-debuggable incidents.
- Must-have: counters usable for attribution and retention fields for incident replay.
- Verify: correctable bounded; fatal = 0; recovery time distributions stable.
- Risk: SI vs control-plane ambiguity becomes unresolvable.
- Must-have: stable driver support and field-diagnostic hooks aligned to the telemetry panel.
- Verify: upgrade/rollback works with clear regression gates (X).
- Risk: features exist on paper but cannot be safely enabled in production.
- PHY / SerDes — link only (no SI expansion here).
- Retimer / Redriver — link only (channel extension details there).
- Compliance & Test Hooks — link only (PRBS/eye/jitter workflows there).
- Cabled PCIe / External Boxes — link only (connectors/cabling there).
Recommended topics you might also need
Request a Quote
H2-13 · FAQs (Field Troubleshooting, Structured Answers)
This section closes long-tail, on-site troubleshooting without introducing new domains. Each question uses the fixed structure: Likely cause / Quick check / Fix / Pass criteria (with quantified placeholders X).
Device shows in lspci, driver loads, but DMA reads stale data — first check cache maintenance or IOMMU mapping?
Likely cause: Cache coherency boundary is violated (missing flush/invalidate), or IOMMU/IOVA mapping is stale/wrong-direction for the buffer.
Quick check: Correlate data mismatches with (a) IOMMU fault logs and (b) buffer lifetime events (map/unmap). Compare results using a known-coherent buffer path (if available) versus a non-coherent path.
Fix: Enforce the correct cache maintenance at ownership handoff (CPU→Device flush, Device→CPU invalidate), and ensure mapping lifecycle is correct (no use-after-unmap, correct DMA direction, alignment to cache lines).
Pass criteria: 0 data mismatches over X GiB transferred; IOMMU faults = 0 during X minutes; p99 DMA completion latency ≤ X µs.
BAR is present but MMIO access hangs — first check ordering/posted write flush or completion timeout?
Likely cause: Posted writes are never flushed (no readback/ack path), or non-posted MMIO reads are blocked until completion timeout due to a stuck internal state / decode mismatch.
Quick check: Check RAS evidence: completion timeout counters/events, UR/CA status, and whether a read-after-write to the same region returns or stalls. Verify the device is in D0 and BAR aperture matches decode expectations.
Fix: Add an explicit flush/ack pattern (write → readback/doorbell-ack), correct BAR sizing/decoding, and align timeout/ordering policy to the device’s service guarantees (avoid indefinite waits; implement bounded retries).
Pass criteria: 0 MMIO hangs across X iterations; completion timeouts = 0 under load; p99 MMIO read latency ≤ X µs.
SR-IOV VFs appear, but only PF has interrupts — MSI-X vector mapping or per-VF enable?
Likely cause: MSI-X is enabled only for PF, or VF vector tables / per-VF interrupt enables are not programmed, leaving VF queues without routable interrupts.
Quick check: Verify per-function MSI-X enable state and confirm VF interrupt counters increment when VF traffic runs. Check whether vector allocation is sufficient for VF queue count.
Fix: Allocate vectors per VF (or per VF-queue policy), enable MSI-X per VF, and validate queue→vector→CPU mapping is consistent with the device’s partition rules.
Pass criteria: VF interrupt rate matches workload within ±X%; 0 missed interrupt events over X minutes; PF and VF counters remain separated and stable.
ATS enabled but performance gets worse — ATC miss storm or invalidation overhead?
Likely cause: Translation cache thrashes (low ATC hit-rate), or frequent invalidations dominate, turning ATS into overhead rather than savings.
Quick check: Measure ATC hit-rate, ATS request rate, and invalidation frequency; correlate with p99 latency and throughput. A hit-rate collapse with rising invalidations indicates ATS is not benefiting this workload.
Fix: Increase mapping stability (longer-lived mappings, larger pages where valid), reduce invalidation churn, and gate ATS per workload. Apply caps on outstanding ATS requests if supported.
Pass criteria: ATC hit-rate ≥ X%; p99 latency does not regress by more than X% versus ATS-off; ATS request rate ≤ X/s at steady state.
PRI requests spike then the device wedges — fault handling loop or missing backoff?
Likely cause: Page-fault handling does not converge (retries without backoff), or outstanding PRI requests overflow internal queues, leading to a deadlock-like wedge.
Quick check: Track PRI request rate, outstanding PRI depth, and fault-convergence time. If the same fault repeats and outstanding depth climbs until progress stops, backoff/limit is missing.
Fix: Add bounded retries with exponential backoff, cap outstanding PRI, and define a recovery action (abort, FLR, or controlled reset). Use a fallback mode when PRI cannot converge (disable PRI or pre-fault/pin critical ranges).
Pass criteria: PRI rate ≤ X/s; fault convergence p99 ≤ X ms; wedge events = 0 over X hours at peak load.
Hot reset recovers link but functions disappear — resource re-allocation or FLR sequence?
Likely cause: Bus/resource windows are not reprogrammed after reset, or function-level reset (FLR) and configuration restore are ordered incorrectly, leaving functions unconfigured or hidden.
Quick check: Compare pre/post reset device list and resource map (BDF presence, BAR addresses, bridge windows). Confirm SR-IOV enable state and capability visibility is restored post-reset.
Fix: Use a deterministic recovery sequence: enumerate → program bridge windows → restore config → enable features (MSI-X/SR-IOV) → validate counters. Apply FLR only where the platform can reliably reconfigure afterward.
Pass criteria: 0 missing functions across X hot resets; recovery p99 ≤ X ms; post-reset configuration matches baseline within X%.
ASPM saves power but devices randomly vanish — L1SS exit latency vs watchdog timeout?
Likely cause: L1SS exit latency exceeds a software/hardware watchdog or service deadline, triggering timeouts that cascade into device removal or repeated re-enumeration.
Quick check: Correlate disappearance events with power-state transitions (ASPM/L1SS entry/exit). Check link retrain/down events and whether timeouts spike immediately after low-power exit.
Fix: Disable L1SS first to confirm causality, then re-enable with conservative settings (latency budgets aligned to timeouts). Ensure required clocks/aux power remain valid across low-power states and extend service deadlines where needed.
Pass criteria: 0 device-missing events over 24h; exit latency p99 ≤ X ms; retrain/down events ≤ X per day at steady workload.
Correctable AER floods under load but eye looks clean — replay/timeout due to backpressure?
Likely cause: Congestion/backpressure drives replay or completion deadlines (timeouts) rather than a physical-layer integrity issue; software queue starvation can amplify the symptom.
Quick check: Compare correctable AER rate against replay counters, completion timeout counters, and queue depth/CPU service time. If error rate tracks throughput and queue pressure, root cause is control-plane congestion.
Fix: Reduce outstanding requests, tune request sizes (MPS/MRRS policy), and fix queue servicing (avoid starvation). If supported, increase buffering/credits and enforce bounded backpressure handling.
Pass criteria: Correctable AER rate ≤ X/min at peak load; replay ratio ≤ X%; completion timeouts = 0; throughput drop ≤ X%.
VFIO passthrough works until reboot — firmware resource window or IOMMU group shift?
Likely cause: Firmware resource allocation changes across boot (BAR windows/decoding policy), or the device’s isolation boundary changes (IOMMU group shifts with topology/ACS policy), breaking passthrough assumptions.
Quick check: Compare boot-to-boot snapshots: BDF stability, BAR addresses, bridge windows, and group identity. If identity or resources change without hardware changes, firmware policy is the primary suspect.
Fix: Lock required firmware settings (resource decode, enumeration stability, isolation policy), persist device mode settings if applicable, and add a boot-time validation gate that fails fast when the resource map deviates.
Pass criteria: 0 passthrough regressions across X reboots; BDF/resource map stable within X; IOMMU group identity stable across X reboots.
Gen negotiation downgrades unexpectedly — capability gating or firmware policy?
Likely cause: Platform policy caps speed/width, or repeated training/retrain events trigger conservative downgrade; capability exchange assumptions (e.g., clocking expectations) can also force fallback.
Quick check: Record negotiated speed/width versus expected target and track retrain counts. If the negotiated state changes after retries or power transitions, policy gating or training instability is implicated.
Fix: Align capability gating across firmware/driver, define an explicit downgrade policy (with alerts), and remove hidden caps. Require a “stable negotiated state” checkpoint before declaring link-ready.
Pass criteria: Negotiated speed/width stable across X reset/power cycles; unexpected downgrades = 0/24h; retrains ≤ X/day under steady workload.
MSI-X works but CPU usage explodes — interrupt moderation/affinity missing?
Likely cause: Interrupt rate is effectively “per event” without moderation/coalescing, or queue→CPU mapping causes hot-core overload and excessive context churn.
Quick check: Measure interrupts/sec per queue and packets(or completions) per interrupt, plus CPU distribution across cores. A low packets-per-interrupt ratio with skewed core load indicates missing moderation/affinity.
Fix: Enable interrupt moderation/coalescing, batch doorbells/completions, tune queue count to match core budget, and enforce stable queue→vector→CPU mapping for predictable load distribution.
Pass criteria: CPU usage reduced by ≥ X% at target throughput; interrupt rate ≤ X/s; p99 latency ≤ X µs; throughput drop ≤ X%.
Completions time out only on large transfers — MRRS/MPS mismatch or software queue starvation?
Likely cause: Large transfers trigger excessive split completions (MRRS/MPS policy mismatch) or overwhelm completion buffering; software queue starvation/backpressure can make completions miss deadlines.
Quick check: Correlate timeout events with transfer size; record MRRS/MPS settings and outstanding read depth. If timeouts scale with size and outstanding depth, split/completion pressure is the primary driver.
Fix: Set compatible MRRS/MPS policy, cap outstanding reads, and ensure software queues cannot starve completion service. If needed, split transfers at the software layer and align completion timeouts to worst-case service time.
Pass criteria: 0 completion timeouts over X TB transferred; p99 completion latency ≤ X µs; replay/timeout counters remain bounded (≤ X/min) at peak load.