FPGA Accelerator Card: Power, JTAG, Clocks, PCIe/CXL

Q: Flash reads/writes fine, but bitstream programming fails intermittently—what is the most common cause?

Intermittent configuration failure is usually a configuration-time stability window violation: rail droop/derating during configuration, unstable clock/PLL lock while sampling the stream, or metadata mismatch (image pointer/signature) that only appears on certain boots. Capture cfg_fail_reason, image_slot, signature_ok, rail_min_v_at_cfg, temp_at_cfg, and confirm PLL lock stays stable through the full configuration window.

Q: One boot succeeds and the next fails—what three sequencing parameters should be checked first?

Check (1) dependency gating (which PG/LOCK must be true before advancing), (2) ramp rate (soft-start slope and inrush behavior), and (3) timeout/backoff policy (wait times and retry spacing). These three often explain PG bounce and early-stage stalls. Compare a successful vs failed boot_state trace and pg_mask transitions, and verify retries do not create rapid power-bounce loops.

Q: How should voltage margining be done without making the system “less stable the more it is tested”?

Margin one rail group at a time, include a settle window before training/config, and always restore the baseline with a cooldown to avoid thermal accumulation. Instability often comes from stacking margins across coupled domains (power + clock) or from aggressive retries that hide the real boundary. Use consistent snapshots (rails/temps/counters) before and after each step to prove stability.

Q: Training fails as temperature rises—how to tell clock jitter from power derating?

Power-derating issues show derating_level, rail droop, or fault warnings near the failure; clock/jitter issues show lock instability (pll_lock_loss_count, abnormal pll_lock_time_ms) while rails remain in bounds. Capture max_temp, derating_level, rail minima, pll lock metrics, and correlate with recovery_count and last_ltssm_group. The first metric that changes consistently is the best discriminator.

Q: JTAG connects, but boundary-scan shows many errors—what hardware-layer issues are typical?

Typical causes are voltage-domain mismatch on JTAG pins, a device in the chain not powered, TRST#/reset unintentionally asserted, TCK too fast for fixture/cabling, or poor contact resistance. In multi-device chains, one unready device can corrupt the scan. Verify level compatibility and domain readiness, reduce TCK to test SI sensitivity, and segment the chain to find the failing stage.

Q: PERST# timing looks correct, but LTSSM is stuck early—what root causes are common?

Early LTSSM stalls usually mean prerequisites are not truly met: REFCLK quality/lock is marginal, lane polarity or lane mapping is wrong, sidebands (CLKREQ#/WAKE# where applicable) are missing, or the endpoint is not ready when PERST# is released. Log refclk_present and PLL lock stability, then classify last_ltssm_group (Detect vs Polling vs Config) and validate sidebands and lane mapping at the connector.

Q: How should golden image / fallback be designed so one update cannot brick the card?

Use two images (immutable golden + update slot), a verifiable commit point, and deterministic rollback. Boot golden by default, validate the update (signature/CRC + first-boot health check), then set a commit flag. Assume power loss during write/switch/first boot: update metadata atomically and ensure rollback works without external tools. Track image_slot, pending_update, commit_flag, rollback_count, and last_update_result.

Q: A multi-rail PMIC reports a fault but voltages look “normal”—how to interpret fault classes and logs?

A normal voltage reading can follow a transient or latched event: brief UV dips, overcurrent foldback, thermal warnings, or PG deglitch behavior. Treat the PMIC report as an event record: read fault class (latched vs auto-retry), map it to rail_id, and correlate with rail_min_v_at_fault and iin_peak_at_fault at the timestamp. This correlation is the fastest root-cause filter.

Q: What hidden defects are most often missed in production, and how can tests cover them?

Common escapes are marginal joints that pass basic power-up, wrong strap values that only appear under training, thermal-interface problems that emerge after minutes, and connector/contact intermittency. Improve coverage by combining boundary scan, a per-unit signature snapshot (rails/temps/version/counters), and a short at-speed training/loopback stress with quick thermal/electrical stimulus. Store scan_report_id and signature_record_id to make failures traceable.

← Back to: Data Center & Servers

An FPGA accelerator card succeeds in the data center only when its four planes—power, configuration/control, clocks, and PCIe/CXL ties—are engineered as a single measurable boot-and-train sequence, not as isolated blocks. Reliable bring-up comes from card-side observability (PG/FAULT, boot states, lock signals, LTSSM groups, and update/rollback evidence) that turns “card not detected” and “intermittent failures” into repeatable triage steps.

H2-1 · Boundary & “What this page is / isn’t”

Boundary: FPGA endpoint card scope, evidence, and link-only neighbors

The fastest way to ship a reliable FPGA accelerator card is to lock the scope early: what is owned on the card (power/config/clock/PCIe-CXL bindings) and what is only referenced (platform fabric, rack power, management stack). This prevents “looks fine” bring-up loops caused by mixing system-level topics into card-level debugging.

Scope Guard (mechanically checkable)

Allowed multi-rail power tree Allowed sequencing & dependencies Allowed PG/FAULT & margining Allowed config flash (QSPI/OSPI) Allowed JTAG / boundary scan Allowed clock synthesis / PLL lock Allowed PCIe/CXL sidebands Allowed LTSSM observability Allowed rail/thermal telemetry

Banned CPU VRM control theory (VR13/VR12+) Banned PSU PFC/LLC details Banned 48V hot-swap SOA deep dive Banned retimer equalization internals Banned BMC/Redfish platform stack

Link-only neighbors (mention and link, no expansion): PCIe Switch/Retimer · 48V/12V Bus & Hot-Swap · BMC · Time Card · In-band Telemetry & Power Log.

What this page solves (reader tasks)

Define a card-level power tree (rails, dependencies, sequencing) that stays stable across temperature and workload transients.
Turn PG/FAULT signals into a deterministic triage flow (symptom → first signal → most likely plane).
Design a safe configuration path: boot modes, flash partitions, update/rollback, and “recoverable by JTAG” guarantees.
Keep on-card clocks lock-stable and jitter-contained so link training does not drift into intermittent failures.
Bring up PCIe/CXL endpoints using only card-owned evidence (PERST#/CLKREQ#, PLL_LOCK, DONE, LTSSM states).

Figure F1 — Card-level boundary map (owned planes vs link-only neighbors)

H2-2 · Card-level architecture: 4 planes (Data / Power / Control / Clock)

Architecture: unify bring-up and debugging with four planes

A stable FPGA accelerator card behaves like a coordinated system, not a set of independent blocks. The “four planes” model turns complex failures into deterministic triage: identify the plane that holds first evidence, then check the coupling points where failures propagate across planes.

Four planes (each plane includes: signals → evidence → failure signature)

Plane	Critical signals / resources	What to observe first	Typical failure signature
Data PCIe / CXL	High-speed lanes, lane mapping, PERST#, CLKREQ#, capabilities	Enumerate result, LTSSM state, link width/speed, error counters	Not detected, downshift, intermittent drop under load
Power Multi-rail	Power tree, sequencing dependencies, PG/FAULT, margin controls	PG timing, rail droop/ripple evidence, fault codes, retry behavior	Boot flakiness, “works cold” but fails hot, config mid-way fail
Control Config & debug	QSPI/OSPI flash, boot mode straps, JTAG chain, SMBus/I3C hooks	Config progress/DONE, flash status, version/rollback state	Update bricks card, stuck init, large lot-to-lot variation
Clock Refclk & PLL	Refclk input, PLL/lock indicator, clock fanout, sensitive rails	PLL_LOCK stability, refclk presence, drift vs temperature	Training unstable, failures appear after warm-up or noise events

Practical rule: when a failure looks like “data plane”, check clock and power coupling first; when a failure looks like “control plane”, check power integrity during configuration.

Coupling points (where root causes hide)

Coupling	Why it matters	First evidence to check
Power ↔ Clock	Clock IC and transceiver supplies amplify rail noise; lock may look “OK” but jitter worsens link margin.	PLL_LOCK stability, sensitive rail PG/FAULT, temperature trend
Clock ↔ Data	Refclk quality and lock timing gate training and stability; weak margin shows as downshift or retrain events.	LTSSM transitions, link speed/width shifts, error bursts
Power ↔ Control	Configuration is a “stress test”: flash reads/writes and FPGA init can expose droop or sequencing gaps.	Config progress stalls, fault codes during init window, rail droop evidence
Control ↔ Data	Bitstream/version determines endpoint behavior; mismatch can look like lane or training faults.	Device ID, image selection (A/B), capability reporting consistency

Bring-up “first hour” checklist (5 steps)

Power plane: all required PG signals asserted in order; no silent retries or latched faults.
Clock plane: refclk present; PLL_LOCK is stable across a short warm-up window.
Control plane: configuration reaches DONE; image selection/rollback state is known.
Data plane: PERST# release timing is sane; LTSSM reaches link-up with expected width/speed.
Coupling check: stress (temp or load) does not flip any of: PLL_LOCK, PG/FAULT, LTSSM state.

Figure F2 — Four-plane block diagram (single view for bring-up & triage)

Diagram reading tip: failures rarely stay inside one plane; the first stable evidence usually appears in power/clock signals before the data plane collapses.

H2-3 · Power tree & rail budgeting

Power tree and rail budgeting: derive rails from behaviors

An FPGA accelerator card power tree is a behavior model: rails are defined by what must stay stable during configuration, link training, and sustained workloads. Budgeting must capture both average and peak currents, plus noise sensitivity and dependencies that can turn “normal readings” into intermittent failures.

Rail families (card scope, no CPU VRM theory)

Core / VCCINT: highest dynamic current; transient integrity gates overall stability and init robustness.
Transceiver AVCC / RX / TX: noise- and jitter-sensitive domain; rail quality strongly impacts link margin and training stability.
I/O banks: multiple voltage domains; mis-binned domains often show up as “works in some modes, fails in others”.
DDR/HBM rails (if present): treat as a separate domain with isolation and clean bring-up windows; keep the discussion at power-domain level.
AUX / 1.8V / 3.3V management: always-on/standby domain that keeps control/telemetry/rollback observable even when the main domain fails.

Budget workflow (practical steps)

Start from a workload power profile (idle / configure / train / sustained) and map power to domains.
Convert power to Iavg and Ipeak per rail; treat configuration and training windows as stress events.
Add margin for process tolerance, temperature rise, aging, and worst-case lane width/speed targets.
Flag noise-critical rails (especially transceiver supplies) and reserve routing/decoupling budget early.
Close the loop with thermal planning: hottest components are often regulators/inductors, not only the FPGA die.

The next chapter (sequencing) turns these rails into a deterministic state machine with evidence (PG/FAULT) for triage.

Rail checklist (definition → evidence → failure signature)

Rail family	Target (typ.)	Iavg / Ipeak	Noise sensitivity	PG condition & dependency	Typical failure signature
Core VCCINT	Low-V	High / Very high	Med droop-sensitive	PG with debounce; must be stable before init	Boot flakiness, config stalls, resets under burst
Transceiver AVCC/RX/TX	Low-V	Med / High	High jitter coupling	PG tied to PLL/link windows; rail must be quiet	Training unstable, downshift, intermittent drop after warm-up
I/O Banks	Multi-V	Low / Med	Med	PG per bank group; depends on mode/straps	Mode-specific failures, marginal GPIO/sideband behavior
Memory domain (if present)	Multi-V	Med / High	Med	Separate PG; sequenced to avoid init collisions	Init-time errors, intermittent during traffic spikes
Management AUX/1.8/3.3	Low/Med-V	Low / Low	Low	Always-on PG; keeps telemetry and recovery reachable	“No evidence” failures: cannot read faults, no recovery path

Figure F3 — FPGA card power tree map (rail families, PG/FAULT, telemetry)

H2-4 · Multi-rail PMIC & sequencing

Multi-rail PMIC sequencing: order, ramp, dependencies, and observability

Sequencing is the card’s control system. The goal is not only “power on”, but “power on with evidence”: each transition is gated by measurable conditions (PG, PLL_LOCK, PERST#, config progress) and protected by timeouts and fault policies that prevent oscillating retries.

Sequencing essentials (three levers)

Order: always-on management rails first; main rails next; sensitive transceiver rails enter only when prerequisites are stable.
Ramp: soft-start slope must avoid both droop (too fast) and watchdog-style timeouts (too slow).
Dependencies: tie key transitions to PG, PLL_LOCK, and PERST# so training never starts on unstable foundations.

PMIC capabilities mapped to field outcomes

Programmable soft-start: converts “ramp quality” into a controllable parameter, reducing lot-to-lot variance.
PG/FAULT aggregation: turns intermittent failures into deterministic evidence (latched codes + timestamps).
Margining: validates guardband for production, temperature drift, and aging without changing the nominal operating point.
Retry policy: controlled retry/backoff prevents rail oscillation that can worsen link and configuration stability.

Fault policy should distinguish “hard stop” vs “retry/derate” vs “log-only” to keep cards observable without masking real damage.

Observability rules (failures must leave evidence)

Latch the first fault source (rail, class, time window) before a retry clears the symptom.
Record the transition where progress stopped (RAIL_RAMP, PLL_LOCK, INIT, TRAIN) to prevent blind debugging.
Keep the management domain alive so telemetry and recovery remain reachable after main-rail shutdown.

Figure F4 — Power sequencing state machine (PG/PLL_LOCK/PERST# gates + timeouts)

Implementation guidance: expose PG/FAULT and PLL_LOCK as readable, latched evidence; keep the management domain alive so faults remain accessible after shutdown.

H2-5 · PG/FAULT design

PG/FAULT design: turning “looks powered but unstable” into a deterministic triage path

Many field issues are not “wrong voltage,” but incorrect power-good definition and fault policy. Robust cards treat PG as a gate (with threshold, debounce, and delay) and treat FAULT as evidence (latched cause, state, and window), so symptoms map to a short, repeatable debug path.

PG design essentials (threshold · debounce · delay)

Threshold: set PG around the worst-case operating window, not only the nominal rail value.
Debounce: prevent transient dips from triggering oscillating retries; match debounce to soft-start and load-step behavior.
Delay/settling: PG asserted does not mean “quiet enough” for training/config; reserve a settling window for sensitive domains.

Multi-PG aggregation (AND/OR rules that prevent false “ready”)

HARD_OK (gate to INIT/TRAIN): critical rails combined with AND (Core + Transceiver + Mgmt reachable).
SOFT_OK (run but log/derate): optional rails or secondary domains do not block progress, but must create events.
LOG_ONLY (evidence only): near-threshold conditions (margin/temperature) should be recorded without causing reset storms.

The most common “powered but unstable” failure mode is entering training while a sensitive rail is not yet quiet or a prerequisite is not truly stable.

Fault-to-symptom mapping (first signals → suspect domains → logs)

Symptom	First signals to check	Likely suspect domain	Associated evidence (logs/fields)
Not enumerated / training fails	HARD_OK composite, PERST#, LTSSM state	Transceiver rails, clock prerequisites	Last state, retry count, LTSSM snapshot, PG settle time window
Link drops under stress	Transient FAULT, rail droop indicators, temperature point	Core rail transient integrity, transceiver noise coupling	Peak-current window, FAULT latch time, temperature at event, error counters
Bitstream fails mid-configuration	Config progress, flash read integrity flags	Management + config domain stability	Config phase marker, integrity check result, brownout-style events
Intermittent reset after warm-up	Thermal threshold events, repeated retries	Regulator/inductor hotspots, protection thresholds	Over-temp flags, hysteresis/derate state, last-good timestamp

Keep management rails alive during faults so evidence remains readable after main-rail shutdown.

Figure F5 — PG/FAULT triage decision tree (symptom → evidence bus → suspect domain → next action)

H2-6 · Configuration architecture

Configuration architecture: boot modes, flash, fallback, and safe updates

Configuration reliability is a system problem: stable prerequisites (PG + clock lock), an integrity-checked boot source, and an update pipeline that can always fall back to a known-good image. The objective is to prevent “one update bricks the card” while keeping a minimal recovery path.

Milestones from power-on to DONE (vendor-agnostic)

Prerequisites ready: HARD_OK PG composite + stable clock lock window.
Boot source selected: straps/mode pins define which image is attempted.
Read + integrity check: CRC/ECC/signature gates the transition to load.
Load + progress markers: record the phase where progress stops (evidence for H2-5 triage).
DONE / ready: only then enable downstream training and full workloads.

Config flash selection (what usually breaks in the field)

Capacity planning: Golden + Update images, metadata, version tags, and rollback records.
Speed vs window: longer load windows increase exposure to thermal drift and transient power events.
Integrity features: ECC/CRC and signed manifests prevent silent corruption from becoming runtime instability.
Write endurance: update cadence + wear leveling strategy must avoid “hot-sector death”.
Power-loss safety: update steps must be transactional; incomplete writes must never replace Golden.

Fallback model (Golden / Update) and rollback conditions

Golden image: minimal, validated, and protected; used as the guaranteed recovery anchor.
Update image: versioned and verified; activated only after integrity and health gates.
Rollback triggers: integrity failure, repeated init/train failures, crash loops, or mismatch against board ID/revision policy.
Minimal recovery path: keep a service mode available (e.g., JTAG path) even when flash is unreadable.

Keep the update pipeline separate from the “bootable anchor” so failures remain recoverable by design.

Figure F6 — Safe update pipeline (Update → Verify → Switch → Health → Commit; else Rollback)

Update safety principle: never overwrite the recovery anchor; treat each transition as a gate (verify + health) and always latch the rollback reason.

H2-7 · JTAG & board-level debug

JTAG & board-level debug: bring-up, boundary scan, and production observability

JTAG is the lowest-level access path when the card cannot boot into a higher-layer workflow. A robust design treats JTAG as an engineering deliverable: it supports early bring-up, shortens “powered but not enumerated” debug loops, and provides boundary-scan coverage for manufacturing.

Three JTAG use cases by lifecycle stage

Bring-up Validate power/clock/config prerequisites using chain visibility (ID response) and minimal access steps.
Debug Isolate the breakpoint for “powers on but cannot enumerate” by checking reset release, configuration readiness, and chain continuity.
Production test Use boundary scan to catch solder/open/short faults on critical sideband nets and essential connectivity.

JTAG chain design (multi-device chains)

Reset interaction: avoid keeping the chain permanently held by board reset policy (TRST# / reset domain alignment).
TCK strategy: support safe low-speed bring-up and higher-speed production modes without marginal timing.
Voltage domains: ensure the access port matches device IO levels; include protection and reference rails if required.
Isolation / bypass: provide a practical way to bypass segments to localize a broken section in the field.
Mechanical access: connector or pads must remain reachable with a fixture without disturbing airflow hardware.

JTAG access checklist (deliverable)

Item	What to specify
Access location	Header/pads placement, fixture approach direction, ground reference proximity
Minimum signals	TCK, TMS, TDI, TDO, TRST# (if used), GND, and a stable reference indicator for IO level domain
Protection	ESD handling, series resistors, and safe behavior under accidental mis-connection
Level domain	IO voltage compatibility and any translation/isolation policy for mixed-voltage chains
Bypass points	Jumpers/resistor options to isolate chain segments for fault localization
Production fixture	Probe/pogo requirements, alignment features, and test-time constraints

Boundary scan focus: prioritize essential sideband connectivity (reset, strap-like signals, management busses, and the JTAG path itself).

Figure F7 — JTAG access & chain topology (bring-up · debug · production test)

H2-8 · Clocking & synthesis

Clocking & synthesis: refclk sources, PLL/jitter cleaners, and distribution

Card-level clocking must be reliable to lock, controlled for jitter, and explainable across temperature. The clock tree is part of the bring-up and power-state definition: training should only proceed after a stable lock window is reached and recorded.

Clock tree objectives (engineering KPIs)

Lock robustness: deterministic lock after power transitions and resets, with clear lock visibility.
Jitter control: preserve margin for the most sensitive links (PCIe/CXL SerDes reference paths).
Thermal explainability: temperature-driven failures must map to lock state, noise coupling, and measurable checkpoints.

Key design points (card-level only)

Refclk sourcing: host-provided vs on-card sources require an explicit “valid” policy and a clear fallback behavior.
Power dependencies: PLL/jitter-cleaner analog rails must be stable before lock is considered meaningful.
Distribution: fanout topology and domain partitioning reduce noise injection into sensitive clock paths.
Common pitfall: PG passes but lock is not stable; training starts too early or drifts after warm-up.

Clock checklist (deliverable)

Item	What to record or verify
Refclk spec	Frequency, tolerance window, and the “valid” indication used by the card
Lock indicator	PLL_LOCK signal policy, stable-window definition, and where it gates link training
Power dependency	Analog/digital rail readiness prerequisites and sequencing tie-ins
Distribution plan	Fanout stages, endpoint mapping, and sensitive-path isolation rules
Measurement points	Suggested test points for refclk, post-cleaner clock, and endpoint clocks
Thermal behavior	Lock-loss counters, last lock-loss time marker, and correlation to temperature windows

Figure F8 — Card-level clock tree (refclk sources → PLL/jitter cleaner → fanout → endpoints + training gate)

Common failure pattern: power-good passes while lock is not stable; gating must use a stable window, not a momentary edge.

H2-9 · PCIe / CXL PHY ties

PCIe / CXL PHY ties: lane mapping, sidebands, reset & training observability

Endpoint-card integration is defined by hard ties: lane mapping and polarity decisions, sideband wiring and timing, and a minimal observability path for training stability. The goal is not just enumeration, but repeatable training with fast isolation when failures appear under load or temperature.

Endpoint-card binding requirements (no retimer deep-dive)

Lane mapping & polarity: document connector lanes → board nets → FPGA SerDes channels; polarity inversion must be explicit and consistent.
Sidebands: PERST#, CLKREQ#, and WAKE# (if used) must align with power-good and clock stability windows.
Training observability: use LTSSM grouping and error counters to turn symptoms into a deterministic debug path.
CXL note: when claimed, apply stricter stability windows and additional training/health checks without expanding fabric architecture.

Bring-up quick path (deliverable)

Step 1 Power rails stable (key PG combination true) → record timestamp.

Step 2 Refclk present at TP_REFCLK and within tolerance window.

Step 3 PLL_LOCK stable (windowed) → allow PERST# release gating.

Step 4 PERST# released (TP_PERST) → verify no unintended re-assert.

Step 5 Enumeration success (device visible) → capture link speed/width snapshot.

Step 6 Stability validation: stress + warm-up → monitor Recovery frequency and error counters.

The quick path is designed to separate “wiring/timing prerequisites” from “training margin” issues.

Signals & test points table (minimum set)

Signal	Purpose	Expected behavior	Probe & symptom cue
REFCLK	Training reference	Present before PERST# release; stable during training	TP_REFCLK; absence or instability → Detect/Polling stalls
PLL_LOCK	Clock stability gate	Must be stable-window true before training starts	TP_LOCK; “PG OK but unstable link” often correlates to non-windowed lock
PERST#	Endpoint reset	Released after PG + stable lock window; no bounce	TP_PERST; bounce → intermittent enumeration / early failures
CLKREQ#	Clock request (if used)	Consistent polarity/levels with platform policy; clean edges	TP_CLKREQ; mis-handled policy → intermittent low-power transitions
WAKE#	Wake signaling (if used)	Valid only in defined states; avoid floating conditions	TP_WAKE; floating or wrong pull → spurious wake or no wake
LTSSM group	Training stage	Detect → Polling → Config → L0; Recovery should be bounded	Readout/telemetry; frequent Recovery → margin/jitter/power integrity suspicion

Figure F9 — Lane/sideband ties & training observability map (endpoint card)

H2-10 · Thermals & derating

Thermals & derating: closing the loop between power, temperature, and reliability

Thermal issues often surface as link-training or configuration failures because temperature shifts power margin and clock stability. A usable thermal plan combines sensor placement, thresholds, and a derating policy that keeps the system observable while reducing risk.

Why thermals can look like “link/config” problems

Temperature rise → power margin shrinks: droop and noise worsen near limits, increasing retries and intermittent failures.
Temperature rise → clock margin shrinks: lock edges and jitter budget degrade, pushing training into Recovery more often.
Result: enumeration instability, training drops, or configuration timeouts appear “protocol-like” but originate from thermal coupling.

Sensor placement & thresholds (card-level)

Hotspot zone	What it indicates	Recommended policy
FPGA hotspot	Core thermal headroom	Warning/critical thresholds; correlate with training stability and configuration completion rate
PMIC / power stages	Derating triggers	Track protection edges; log derating flags and fault counters for reproducible triage
Inductors	Local heat accumulation	Use as early indicator for sustained stress; monitor alongside droop/noise symptoms
Clock/PLL chip	Lock stability margin	Correlate lock loss / training Recovery frequency with temperature windows

Derating strategy (observable & stable)

Level 1 — Warn: raise alarms and log the thermal window (no performance change).
Level 2 — Derate: reduce stress to protect margin while preserving observability (record flags and timestamps).
Level 3 — Protect: enter a safe state with controlled retry rules and explicit reason codes.

Derating should align with clock lock gating and training stability checks to prevent “silent degradation”.

Thermal triage checklist (deliverable)

Reproduce: define load profile + airflow condition + warm-up time window; record ramp rate and ambient.
Thresholds: note which hotspot crosses warning/critical at failure onset.
Correlate: check derating flags, PLL lock stability window, and training Recovery/error counters in the same time slice.
Isolate: compare “same load, different airflow” and “same temperature, different load” to separate power vs clock vs link margin.
Confirm: verify stability after mitigation using the same stress and temperature window.

Figure F10 — Thermal → power/clock margin → training/config failures + triage map

H2-11 — Validation & Production Checklist (Definition of Done)

This chapter turns “it seems to work” into a card-side evidence pack: repeatable pass/fail criteria, measurable thresholds, and minimal logs that make power/config/clock/link failures diagnosable without guessing.

Pass/Fail thresholds Evidence required Reproducible triggers Card-side only

Evidence pack (minimum)

A DoD item is complete only when a stored artifact exists (log fields / counter dumps / report IDs / waveforms) and the same trigger reproduces the same outcome.

Conditions: temperature window, airflow state, link speed/width, workload profile, power margining level.
Procedure: cycle counts, injected faults, timeouts, and recovery policy.
Criteria: thresholds for PG stability, PLL lock, enumeration success, training recovery counts, error counters.
Evidence: event codes + timestamps, “last failure reason”, and signature snapshot (rails/temps/version).

Figure F11 — DoD flow: DVT → PVT → Field Evidence → Evidence Pack

Diagram style: box-first, low text density, mobile-readable labels (≥18px), no <defs>/<style> blocks.

R&D validation (DVT/EVT) — break the edge conditions

Each DoD item must define: Setup (temp/airflow/load/link mode), Procedure (counts/time/fault), Pass criteria (threshold), and Evidence (stored fields/report IDs/waveforms).

DoD item	Setup (trigger)	Pass criteria (example)	Evidence (card-side)	Example MPN hooks
Power cycle repeatabilityCold + warm cycles	N cycles; defined ramp; defined airflow; fixed workload after link-up	No PG bounce; no stuck state; enumeration success rate = 100%; recovery count ≤ threshold	pg_mask timeline; last_fault_code; boot_state; retry_count; timestamp	Supervisor / log NVM
Margining robustnessKey rails	±X% margin on selected rails; repeat at two temps	Stable lock + stable training; no silent degradation (error counters remain bounded)	rail_vout/iin snapshots; margin_level; link_error counters	PMBus mgr
Fault injectionRail drop / UV / OT	Inject one fault at a time; defined timeout policy	Correct containment (alarm vs shutdown); reason code always present; recovery is repeatable	fault_code + rail_id; last_reason; shutdown_class; retry_count	Multi-rail ctrl
Config update & rollbackGolden + update	Multiple update/verify/switch; cut power during critical windows	No “brick” state; rollback triggers are deterministic; golden always boots	image_slot; cfg_result; rollback_count; bitstream_id; signature_ok	QSPI NOR / FRAM
Clock lock stabilityThermal sweep	Hold at temperature corners; refclk present/absent cases	Lock time within limit; lock_loss_count = 0 (or bounded) during stress	pll_lock_time_ms; lock_loss_count; refclk_present	Jitter cleaner
Link training enduranceLong run	T hours; multiple speeds/widths; workload toggles	Link drop = 0 (or bounded); recovery bounded; LTSSM never stuck	link_up_count; link_drop_count; recovery_count; last_ltssm_group	Telemetry monitor

“Example MPN hooks” indicates which component classes typically provide the telemetry/controls needed to collect evidence; exact picks depend on rail count, current, and firmware ownership.

Production test (PVT) — fast screen with high signal-to-noise

Test	Goal	Pass criteria (example)	Evidence
JTAG boundary scan	Detect solder opens/shorts on critical nets with fixture-friendly coverage	Scan passes; failing nets map to “open/short” class; repeatability across fixtures	scan_report_id; failing_net_list (if fail); coverage summary
Signature snapshot	Single-page health record for shipment gating	All rails in range; temps in range; version fields valid; link trains once	bitstream_id; rails_vout/iin; max_temp; pll_lock; link_up_once flag

Production-friendly design tip: keep a read-only signature block (rails/temps/version/counters) accessible over the card’s management interface, so the fixture can log one consistent JSON-like record per unit.

Field evidence (card-side) — minimal logs that stop guesswork

Logs should answer: what happened, which domain (power/config/clock/link), what was the last stable state, and what changed (temperature/rail margin/retry).

Group	Fields (examples)	When to update	Why it matters
Power	rail_id, pg_mask, fault_code, shutdown_class, derating_level, max_temp_at_fault, retry_count Optional: per-rail min/max vout, brownout_count	On PG transitions, fault assertion, shutdown decision, recovery attempt	Separates “rail instability” from “protocol symptoms”
Clock	refclk_present, pll_lock_time_ms, lock_loss_count, last_lock_loss_temp Optional: refclk_fault_count	On lock/unlock, link training start, thermal threshold crossings	Explains intermittent training failures and thermal drift sensitivity
Link + Config	image_slot, cfg_result, rollback_count, bitstream_id, signature_ok, link_up_count, link_drop_count, recovery_count, last_ltssm_group	After configuration completes, after training completes, on each recovery/drop	Distinguishes “bitstream/update issues” from “training margin issues”

Reference MPN list (examples) — parts that enable the evidence

These are example manufacturer part numbers commonly used for card-side telemetry, supervision, logging, and clock conditioning. Final selection depends on rail count/current, footprint, firmware ownership, and supply constraints.

Function	Example MPNs	Where it fits the DoD
PMBus power system manager	LTC2977 (8-channel power system manager with fault logs / telemetry)	Enables repeatable margining + rail snapshots + fault logs used in DVT cycles and production “signature” records.
Digital multiphase controller	Infineon XDPE132G5C / XDPE132G5H Renesas ISL68137 MPS MP2953B (examples; phase count and protocol vary by design)	Provides programmable sequencing/telemetry hooks to correlate “rail behavior” with configuration/training outcomes (used by DoD evidence fields).
Temperature sensing (multi-zone)	TI TMP468 (multi-zone SMBus/I²C temperature sensor)	Supports thermal triage evidence: max_temp, temp-at-fault, derating correlation, and hotspot mapping.
Digital power monitor	TI INA228 (20-bit power/energy monitor; shunt based)	Creates quantifiable “rail load vs symptom” snapshots to back up margining and endurance claims.
Multi-rail supervisor / watchdog	TI TPS386000 (quad-supply supervisor with delay + watchdog)	Hardens reset/timeout policies; provides deterministic gating for “state machine” timeouts that appear in DoD pass/fail criteria.
QSPI/Quad-SPI NOR flash	Winbond W25Q128JV (128M-bit serial NOR flash family)	Stores golden/update images and metadata fields (image_slot, cfg_result, rollback_count, version tags).
Nonvolatile log memory (FRAM)	FM25V02-G / FM25V02A (SPI FRAM examples)	Stores small, high-endurance “last_reason / counters / timestamps” records that survive power loss during faults.
Jitter cleaner / clock generator	Si5341B-D-GM (ultra-low jitter clock generator / attenuator family)	Supports clock-lock evidence (lock_time, lock_loss_count) and reduces “clock-noise disguised as link problems”.

Request a Quote

Name

Company

Part Number(s) / BOM

Quantity & Target Lead Time

Alternates Allowed

Temperature Grade

Package / Footprint

Compliance

Budget Window

Lot Size / Qty

Message

Attachment

Accepted Formats

pdf, csv, xls, xlsx, zip

Attachment

Drag & drop files here or use the button below.

H2-12 — FAQs (10) for FPGA Accelerator Cards

Each answer is written for card-side triage: the shortest measurable path from symptom → signals → evidence fields, with references back to the relevant chapters (H2-4…H2-11). No platform-level detours.

40–70 words per answer 3–5-step triage path Card-side evidence fields Maps back to H2

Figure F12 — FAQ triage map (Symptom → Fast checks → Chapter)

Box-first diagram, low text density, mobile-readable labels (≥18px), no <defs>/<style> blocks.

1Why are all power rails “PG green” but PCIe still cannot enumerate the card?

“PG green” only proves voltages reached thresholds; enumeration also needs PERST# release, a stable REFCLK, and a valid boot state. The most common blockers are PERST# gated too early/late, REFCLK present but PLL not locked, or lane/sideband binding mistakes that keep the endpoint from reaching Detect/Polling.

Confirm perst_seen and the release moment vs boot_state (H2-4).
Check refclk_present and pll_lock_time_ms (H2-8).
Read a coarse last_ltssm_group to see if training even started (H2-9).

Maps to: H2-4 sequencing/state machine, H2-9 PCIe sidebands & observability.

2Flash reads/writes fine, but bitstream programming fails intermittently—what is the most common cause?

Intermittent configuration failure is usually caused by a configuration-time stability window violation: rail droop/derating during configuration, unstable clock/PLL lock while the FPGA samples the stream, or metadata mismatch (image pointer/signature) that only appears on certain boots. “Readable flash” does not guarantee at-speed, low-noise, deterministic configuration.

Capture cfg_fail_reason, image_slot, signature_ok (H2-6).
Log rail_min_v_at_cfg and temp_at_cfg (H2-5/H2-10).
Verify PLL lock stayed stable across the full configuration window (H2-8).

Maps to: H2-6 config architecture, H2-5 fault triage.

3One boot succeeds and the next fails—what three sequencing parameters should be checked first?

The fastest way to eliminate “random boots” is to lock down three parameters: (1) dependency gating (which PG/LOCK must be true before the next step), (2) ramp rate (soft-start slope and inrush behavior), and (3) timeout/backoff policy (how long to wait and how retries are spaced). Small shifts here create PG bounce and silent early-stage stalls.

Compare a successful vs failed boot_state trace and pg_mask transitions (H2-4).
Check whether the failure always occurs after the same timeout stage (H2-4).
Inspect retry loops: no rapid “power-bounce” cycles (H2-11 DoD).

Maps to: H2-4 sequencing/state machine.

4How should voltage margining be done without making the system “less stable the more it is tested”?

Safe margining is controlled change, not repeated stress. Margin one rail group at a time, keep a settle window before training/config, and always restore baseline with a cooldown to avoid thermal accumulation. Instability usually comes from stacking margins across coupled domains (power + clock), or from aggressive retries that hide the real boundary.

Baseline snapshot: rails/temps/counters (H2-11).
Apply one margin step → hold → run the same stress script (H2-4/H2-11).
Restore baseline and verify counters did not drift (H2-11).

Maps to: H2-4 sequencing controls, H2-11 validation methodology.

5Training fails as temperature rises—how to tell clock jitter from power derating?

Separate the two by evidence: power-derating signatures show up as derating_level, rail droop, or fault warnings near the failure; clock/jitter issues show up as lock instability (pll_lock_loss_count, abnormal pll_lock_time_ms) while rails remain within bounds. Both can produce the same symptom, so logs must capture “what changed first.”

Check max_temp and whether derating was asserted (H2-10).
Compare rail minima vs PLL lock stability around the event (H2-8/H2-10).
Correlate with recovery_count and LTSSM group changes (H2-9).

Maps to: H2-8 clocking, H2-10 thermals/derating.

6JTAG connects, but boundary-scan shows many errors—what hardware-layer issues are typical?

Many boundary-scan failures are board-level, not logic-level: voltage-domain mismatch on JTAG pins, a device in the chain not powered, TRST#/reset unintentionally asserted, TCK too fast for the fixture/cabling, or poor contact resistance. In multi-device chains, one unready device can corrupt the entire scan result.

Verify JTAG I/O level compatibility and power-domain readiness (H2-7).
Reduce TCK and re-run to test signal-integrity sensitivity (H2-7).
Segment the chain (jumpers/bypass) to identify the failing stage (H2-7).

Maps to: H2-7 JTAG & production observability.

7PERST# timing looks correct, but LTSSM is stuck early—what root causes are common?

Early LTSSM stalls usually mean the training prerequisites are not truly met: REFCLK quality/lock is marginal, lane polarity or lane mapping is wrong, sidebands (CLKREQ#/WAKE# where applicable) are missing, or the endpoint is not ready when PERST# is released. The key is to confirm whether LTSSM ever leaves Detect and how consistently it reaches Polling.

Log refclk_present and PLL lock stability before PERST# release (H2-8).
Read last_ltssm_group to classify “Detect vs Polling vs Config” (H2-9).
Validate sidebands and lane mapping rules at the card connector (H2-9).

Maps to: H2-9 PHY ties & observability, H2-8 clock prerequisites.

8How should golden image / fallback be designed so one update cannot brick the card?

A safe update requires two images (immutable golden + update slot), a verifiable commit point, and deterministic rollback. Always boot golden by default, validate the update (signature/CRC + first-boot health check), and only then set a commit flag. Power-loss windows must be assumed: metadata should update atomically, and rollback must be possible without external tools.

Maintain image_slot, pending_update, commit_flag (H2-6).
Define rollback triggers: signature fail, DONE timeout, health check fail (H2-6).
Record rollback_count and last_update_result (H2-11).

Maps to: H2-6 configuration architecture, H2-11 evidence pack.

9A multi-rail PMIC reports a fault but voltages look “normal”—how to interpret fault classes and logs?

“Normal voltage now” can hide a fault that was transient or latched: short UV dips, overcurrent foldback, thermal warning, or PG deglitch behavior. Treat the PMIC report as an event record: read fault class (latched vs auto-retry), map it to rail_id, and correlate with rail minima/peak current at the fault timestamp. That correlation is the fastest root-cause filter.

Read fault_code, fault_class, rail_id (H2-5).
Capture rail_min_v_at_fault and iin_peak_at_fault (H2-5/H2-11).
Check whether derating/temperature triggered the event (H2-10).

Maps to: H2-5 PG/FAULT triage.

10What hidden defects are most often missed in production, and how can tests cover them?

The most expensive “escapes” are marginal defects: weak joints that pass basic power-up, wrong strap values that only appear under training, thermal-interface problems that show up after minutes, or connector/contact intermittency. Coverage improves when production tests combine boundary scan, a signature snapshot, and a short at-speed stress (training/loopback) plus a quick thermal/electrical stimulus that exposes margins.

Run boundary scan for opens/shorts (H2-7).
Record a consistent signature (rails/temps/version/counters) per unit (H2-11).
Add a short at-speed training + counter check (H2-9/H2-11).

Maps to: H2-7 production observability, H2-11 DoD checklist.

FAQPage JSON-LD (for indexing)

Paste the script below once per page. Keep the visible Q&A aligned with the JSON-LD content.

If only “Google-format output” is desired next time, request: “Google看的：只输出<script>”.

FPGA Accelerator Card: Power, JTAG, Clocks, PCIe/CXL

FPGA Accelerator Card: Power, JTAG, Clocks, PCIe/CXL

Boundary: FPGA endpoint card scope, evidence, and link-only neighbors

Scope Guard (mechanically checkable)

What this page solves (reader tasks)

Architecture: unify bring-up and debugging with four planes

Four planes (each plane includes: signals → evidence → failure signature)

Coupling points (where root causes hide)

Bring-up “first hour” checklist (5 steps)

Power tree and rail budgeting: derive rails from behaviors

Rail families (card scope, no CPU VRM theory)

Budget workflow (practical steps)

Rail checklist (definition → evidence → failure signature)

Multi-rail PMIC sequencing: order, ramp, dependencies, and observability

Sequencing essentials (three levers)

PMIC capabilities mapped to field outcomes

Observability rules (failures must leave evidence)

PG/FAULT design: turning “looks powered but unstable” into a deterministic triage path

PG design essentials (threshold · debounce · delay)

Multi-PG aggregation (AND/OR rules that prevent false “ready”)

Fault-to-symptom mapping (first signals → suspect domains → logs)

Configuration architecture: boot modes, flash, fallback, and safe updates

Milestones from power-on to DONE (vendor-agnostic)

Config flash selection (what usually breaks in the field)

Fallback model (Golden / Update) and rollback conditions

JTAG & board-level debug: bring-up, boundary scan, and production observability

Three JTAG use cases by lifecycle stage

JTAG chain design (multi-device chains)

JTAG access checklist (deliverable)

Clocking & synthesis: refclk sources, PLL/jitter cleaners, and distribution

Clock tree objectives (engineering KPIs)

Key design points (card-level only)

Clock checklist (deliverable)

PCIe / CXL PHY ties: lane mapping, sidebands, reset & training observability

Endpoint-card binding requirements (no retimer deep-dive)

Bring-up quick path (deliverable)

Signals & test points table (minimum set)

Thermals & derating: closing the loop between power, temperature, and reliability

Why thermals can look like “link/config” problems

Sensor placement & thresholds (card-level)

Derating strategy (observable & stable)

Thermal triage checklist (deliverable)

H2-11 — Validation & Production Checklist (Definition of Done)

Evidence pack (minimum)

Recommended topics you might also need

Request a Quote

Accepted Formats

Attachment

H2-12 — FAQs (10) for FPGA Accelerator Cards

FAQPage JSON-LD (for indexing)

Explore

Categories

Get in Touch