123 Main Street, New York, NY 10001

FPGA Accelerator Card: Power, JTAG, Clocks, PCIe/CXL

← Back to: Data Center & Servers

An FPGA accelerator card succeeds in the data center only when its four planes—power, configuration/control, clocks, and PCIe/CXL ties—are engineered as a single measurable boot-and-train sequence, not as isolated blocks. Reliable bring-up comes from card-side observability (PG/FAULT, boot states, lock signals, LTSSM groups, and update/rollback evidence) that turns “card not detected” and “intermittent failures” into repeatable triage steps.

H2-1 · Boundary & “What this page is / isn’t”

Boundary: FPGA endpoint card scope, evidence, and link-only neighbors

The fastest way to ship a reliable FPGA accelerator card is to lock the scope early: what is owned on the card (power/config/clock/PCIe-CXL bindings) and what is only referenced (platform fabric, rack power, management stack). This prevents “looks fine” bring-up loops caused by mixing system-level topics into card-level debugging.

Scope Guard (mechanically checkable)

Allowed multi-rail power tree Allowed sequencing & dependencies Allowed PG/FAULT & margining Allowed config flash (QSPI/OSPI) Allowed JTAG / boundary scan Allowed clock synthesis / PLL lock Allowed PCIe/CXL sidebands Allowed LTSSM observability Allowed rail/thermal telemetry
Banned CPU VRM control theory (VR13/VR12+) Banned PSU PFC/LLC details Banned 48V hot-swap SOA deep dive Banned retimer equalization internals Banned BMC/Redfish platform stack

Link-only neighbors (mention and link, no expansion): PCIe Switch/Retimer · 48V/12V Bus & Hot-Swap · BMC · Time Card · In-band Telemetry & Power Log.

What this page solves (reader tasks)

  • Define a card-level power tree (rails, dependencies, sequencing) that stays stable across temperature and workload transients.
  • Turn PG/FAULT signals into a deterministic triage flow (symptom → first signal → most likely plane).
  • Design a safe configuration path: boot modes, flash partitions, update/rollback, and “recoverable by JTAG” guarantees.
  • Keep on-card clocks lock-stable and jitter-contained so link training does not drift into intermittent failures.
  • Bring up PCIe/CXL endpoints using only card-owned evidence (PERST#/CLKREQ#, PLL_LOCK, DONE, LTSSM states).
Figure F1 — Card-level boundary map (owned planes vs link-only neighbors)
FPGA accelerator card boundary map Central FPGA accelerator card scope with four owned planes: power, control, clock, and PCIe/CXL bindings. Neighbor systems are shown as link-only gray blocks. FPGA Accelerator Card — Boundary (This Page) Owned on the card FPGA Endpoint Card Design • Bring-up • Production POWER Multi-rail PMIC Sequencing, PG/FAULT CONTROL JTAG / Boundary Scan Config Flash, SMBus CLOCK Refclk → PLL/Lock → Fanout PCIe / CXL BINDINGS PERST#, CLKREQ#, lane map LTSSM evidence for triage Link-only PCIe Switch / Retimer Fabric topics Link-only 48V/12V Bus & Hot-Swap Upstream power Link-only BMC / Platform Mgmt Integration only Link-only Time Card System time
H2-2 · Card-level architecture: 4 planes (Data / Power / Control / Clock)

Architecture: unify bring-up and debugging with four planes

A stable FPGA accelerator card behaves like a coordinated system, not a set of independent blocks. The “four planes” model turns complex failures into deterministic triage: identify the plane that holds first evidence, then check the coupling points where failures propagate across planes.

Four planes (each plane includes: signals → evidence → failure signature)

Plane Critical signals / resources What to observe first Typical failure signature
Data
PCIe / CXL
High-speed lanes, lane mapping, PERST#, CLKREQ#, capabilities Enumerate result, LTSSM state, link width/speed, error counters Not detected, downshift, intermittent drop under load
Power
Multi-rail
Power tree, sequencing dependencies, PG/FAULT, margin controls PG timing, rail droop/ripple evidence, fault codes, retry behavior Boot flakiness, “works cold” but fails hot, config mid-way fail
Control
Config & debug
QSPI/OSPI flash, boot mode straps, JTAG chain, SMBus/I3C hooks Config progress/DONE, flash status, version/rollback state Update bricks card, stuck init, large lot-to-lot variation
Clock
Refclk & PLL
Refclk input, PLL/lock indicator, clock fanout, sensitive rails PLL_LOCK stability, refclk presence, drift vs temperature Training unstable, failures appear after warm-up or noise events

Practical rule: when a failure looks like “data plane”, check clock and power coupling first; when a failure looks like “control plane”, check power integrity during configuration.

Coupling points (where root causes hide)

Coupling Why it matters First evidence to check
Power ↔ Clock Clock IC and transceiver supplies amplify rail noise; lock may look “OK” but jitter worsens link margin. PLL_LOCK stability, sensitive rail PG/FAULT, temperature trend
Clock ↔ Data Refclk quality and lock timing gate training and stability; weak margin shows as downshift or retrain events. LTSSM transitions, link speed/width shifts, error bursts
Power ↔ Control Configuration is a “stress test”: flash reads/writes and FPGA init can expose droop or sequencing gaps. Config progress stalls, fault codes during init window, rail droop evidence
Control ↔ Data Bitstream/version determines endpoint behavior; mismatch can look like lane or training faults. Device ID, image selection (A/B), capability reporting consistency

Bring-up “first hour” checklist (5 steps)

  • Power plane: all required PG signals asserted in order; no silent retries or latched faults.
  • Clock plane: refclk present; PLL_LOCK is stable across a short warm-up window.
  • Control plane: configuration reaches DONE; image selection/rollback state is known.
  • Data plane: PERST# release timing is sane; LTSSM reaches link-up with expected width/speed.
  • Coupling check: stress (temp or load) does not flip any of: PLL_LOCK, PG/FAULT, LTSSM state.
Figure F2 — Four-plane block diagram (single view for bring-up & triage)
FPGA accelerator card four-plane block diagram Block diagram showing host-to-card PCIe/CXL data plane, on-card power tree, control plane with config flash and JTAG, and clock plane with refclk and PLL lock. FPGA Accelerator Card — 4 Planes (Data / Power / Control / Clock) CLOCK plane DATA plane POWER plane CONTROL plane Refclk In Host / onboard PLL / Jitter Cleaner PLL_LOCK Clock Fanout to FPGA / PHY Host Root Complex PERST# / CLKREQ# FPGA Endpoint Transceivers + Logic LTSSM evidence Connector PCIe/CXL lanes Lane map / polarity Power Tree (multi-rail) PMIC / Supervisor PG / FAULT Margining Telemetry Hooks Config Flash (A/B) JTAG Chain SMBus/I3C + Sensors LOCK PERST#

Diagram reading tip: failures rarely stay inside one plane; the first stable evidence usually appears in power/clock signals before the data plane collapses.

H2-3 · Power tree & rail budgeting

Power tree and rail budgeting: derive rails from behaviors

An FPGA accelerator card power tree is a behavior model: rails are defined by what must stay stable during configuration, link training, and sustained workloads. Budgeting must capture both average and peak currents, plus noise sensitivity and dependencies that can turn “normal readings” into intermittent failures.

Rail families (card scope, no CPU VRM theory)

  • Core / VCCINT: highest dynamic current; transient integrity gates overall stability and init robustness.
  • Transceiver AVCC / RX / TX: noise- and jitter-sensitive domain; rail quality strongly impacts link margin and training stability.
  • I/O banks: multiple voltage domains; mis-binned domains often show up as “works in some modes, fails in others”.
  • DDR/HBM rails (if present): treat as a separate domain with isolation and clean bring-up windows; keep the discussion at power-domain level.
  • AUX / 1.8V / 3.3V management: always-on/standby domain that keeps control/telemetry/rollback observable even when the main domain fails.

Budget workflow (practical steps)

  • Start from a workload power profile (idle / configure / train / sustained) and map power to domains.
  • Convert power to Iavg and Ipeak per rail; treat configuration and training windows as stress events.
  • Add margin for process tolerance, temperature rise, aging, and worst-case lane width/speed targets.
  • Flag noise-critical rails (especially transceiver supplies) and reserve routing/decoupling budget early.
  • Close the loop with thermal planning: hottest components are often regulators/inductors, not only the FPGA die.

The next chapter (sequencing) turns these rails into a deterministic state machine with evidence (PG/FAULT) for triage.

Rail checklist (definition → evidence → failure signature)

Rail family Target (typ.) Iavg / Ipeak Noise sensitivity PG condition & dependency Typical failure signature
Core
VCCINT
Low-V High / Very high Med
droop-sensitive
PG with debounce; must be stable before init Boot flakiness, config stalls, resets under burst
Transceiver
AVCC/RX/TX
Low-V Med / High High
jitter coupling
PG tied to PLL/link windows; rail must be quiet Training unstable, downshift, intermittent drop after warm-up
I/O Banks Multi-V Low / Med Med PG per bank group; depends on mode/straps Mode-specific failures, marginal GPIO/sideband behavior
Memory domain
(if present)
Multi-V Med / High Med Separate PG; sequenced to avoid init collisions Init-time errors, intermittent during traffic spikes
Management
AUX/1.8/3.3
Low/Med-V Low / Low Low Always-on PG; keeps telemetry and recovery reachable “No evidence” failures: cannot read faults, no recovery path
Figure F3 — FPGA card power tree map (rail families, PG/FAULT, telemetry)
FPGA accelerator card power tree map Block diagram showing input power, regulators grouped by rail families, PG/FAULT aggregation, and telemetry hooks feeding FPGA domains and sensors. Power Tree — Rail Families, Evidence, and Dependencies Input Power Slot / Aux Upstream details link-only On-card Regulators (grouped by rail family) CORE VCCINT High Ipeak TRANSCEIVER AVCC / RX / TX Noise-sensitive I/O BANKS Multi-domain Mode-dependent MEMORY DDR/HBM rails Domain isolation MANAGEMENT AUX / 1.8V / 3.3V Always-on FPGA Domains CORE TRANSCEIVERS I/O BANKS MGMT / SENSORS PG / FAULT Bus Evidence + gating Telemetry V / I / Temp Card-owned hooks Legend Noise-critical or high-impact rails Regular domain rails
H2-4 · Multi-rail PMIC & sequencing

Multi-rail PMIC sequencing: order, ramp, dependencies, and observability

Sequencing is the card’s control system. The goal is not only “power on”, but “power on with evidence”: each transition is gated by measurable conditions (PG, PLL_LOCK, PERST#, config progress) and protected by timeouts and fault policies that prevent oscillating retries.

Sequencing essentials (three levers)

  • Order: always-on management rails first; main rails next; sensitive transceiver rails enter only when prerequisites are stable.
  • Ramp: soft-start slope must avoid both droop (too fast) and watchdog-style timeouts (too slow).
  • Dependencies: tie key transitions to PG, PLL_LOCK, and PERST# so training never starts on unstable foundations.

PMIC capabilities mapped to field outcomes

  • Programmable soft-start: converts “ramp quality” into a controllable parameter, reducing lot-to-lot variance.
  • PG/FAULT aggregation: turns intermittent failures into deterministic evidence (latched codes + timestamps).
  • Margining: validates guardband for production, temperature drift, and aging without changing the nominal operating point.
  • Retry policy: controlled retry/backoff prevents rail oscillation that can worsen link and configuration stability.

Fault policy should distinguish “hard stop” vs “retry/derate” vs “log-only” to keep cards observable without masking real damage.

Observability rules (failures must leave evidence)

  • Latch the first fault source (rail, class, time window) before a retry clears the symptom.
  • Record the transition where progress stopped (RAIL_RAMP, PLL_LOCK, INIT, TRAIN) to prevent blind debugging.
  • Keep the management domain alive so telemetry and recovery remain reachable after main-rail shutdown.
Figure F4 — Power sequencing state machine (PG/PLL_LOCK/PERST# gates + timeouts)
Power sequencing state machine for FPGA accelerator card State machine from OFF to READY with gates for PG, PLL_LOCK, and PERST#. Includes timeouts, retry/backoff path, and latched fault sink for observability. Sequencing State Machine — Gates and Evidence OFF STANDBY Mgmt rails on RAIL_RAMP Gate: PG PLL_LOCK Gate: LOCK FPGA_INIT Config begins LINK_TRAIN Gate: PERST# READY Stable + observable Enable Cmd PG ok LOCK ok DONE + PERST# release Link up Ttimeout Ttimeout Ttimeout FAULT_LATCHED Latch first cause Record last state Retry / backoff policy Backoff + retry Evidence buses (card-owned) PG / FAULT PLL_LOCK PERST# / LTSSM

Implementation guidance: expose PG/FAULT and PLL_LOCK as readable, latched evidence; keep the management domain alive so faults remain accessible after shutdown.

H2-5 · PG/FAULT design

PG/FAULT design: turning “looks powered but unstable” into a deterministic triage path

Many field issues are not “wrong voltage,” but incorrect power-good definition and fault policy. Robust cards treat PG as a gate (with threshold, debounce, and delay) and treat FAULT as evidence (latched cause, state, and window), so symptoms map to a short, repeatable debug path.

PG design essentials (threshold · debounce · delay)

  • Threshold: set PG around the worst-case operating window, not only the nominal rail value.
  • Debounce: prevent transient dips from triggering oscillating retries; match debounce to soft-start and load-step behavior.
  • Delay/settling: PG asserted does not mean “quiet enough” for training/config; reserve a settling window for sensitive domains.

Multi-PG aggregation (AND/OR rules that prevent false “ready”)

  • HARD_OK (gate to INIT/TRAIN): critical rails combined with AND (Core + Transceiver + Mgmt reachable).
  • SOFT_OK (run but log/derate): optional rails or secondary domains do not block progress, but must create events.
  • LOG_ONLY (evidence only): near-threshold conditions (margin/temperature) should be recorded without causing reset storms.

The most common “powered but unstable” failure mode is entering training while a sensitive rail is not yet quiet or a prerequisite is not truly stable.

Fault-to-symptom mapping (first signals → suspect domains → logs)

Symptom First signals to check Likely suspect domain Associated evidence (logs/fields)
Not enumerated / training fails HARD_OK composite, PERST#, LTSSM state Transceiver rails, clock prerequisites Last state, retry count, LTSSM snapshot, PG settle time window
Link drops under stress Transient FAULT, rail droop indicators, temperature point Core rail transient integrity, transceiver noise coupling Peak-current window, FAULT latch time, temperature at event, error counters
Bitstream fails mid-configuration Config progress, flash read integrity flags Management + config domain stability Config phase marker, integrity check result, brownout-style events
Intermittent reset after warm-up Thermal threshold events, repeated retries Regulator/inductor hotspots, protection thresholds Over-temp flags, hysteresis/derate state, last-good timestamp

Keep management rails alive during faults so evidence remains readable after main-rail shutdown.

Figure F5 — PG/FAULT triage decision tree (symptom → evidence bus → suspect domain → next action)
PG/FAULT triage decision tree for FPGA accelerator card Decision tree starting from common symptoms, checking evidence buses (PG/FAULT, PLL_LOCK, PERST#/LTSSM), then branching to suspect domains and next actions. PG/FAULT Triage — Symptom to Evidence to Action Start: symptom class No enumerate Train fail Drop under Stress Config fails mid-way Reset after warm-up Check: evidence buses PG / FAULT threshold · debounce · delay PLL_LOCK stable window PERST# / LTSSM training state snapshot Branch: suspect domain CORE rail droop / Ipeak TRANSCEIVER quiet window MGMT domain evidence path Optional domains IO / memory Output: next actions (fast isolation) Validate PG gating debounce · delay · AND/OR Capture evidence latch cause · last state Reproduce window temp · load · phase
H2-6 · Configuration architecture

Configuration architecture: boot modes, flash, fallback, and safe updates

Configuration reliability is a system problem: stable prerequisites (PG + clock lock), an integrity-checked boot source, and an update pipeline that can always fall back to a known-good image. The objective is to prevent “one update bricks the card” while keeping a minimal recovery path.

Milestones from power-on to DONE (vendor-agnostic)

  • Prerequisites ready: HARD_OK PG composite + stable clock lock window.
  • Boot source selected: straps/mode pins define which image is attempted.
  • Read + integrity check: CRC/ECC/signature gates the transition to load.
  • Load + progress markers: record the phase where progress stops (evidence for H2-5 triage).
  • DONE / ready: only then enable downstream training and full workloads.

Config flash selection (what usually breaks in the field)

  • Capacity planning: Golden + Update images, metadata, version tags, and rollback records.
  • Speed vs window: longer load windows increase exposure to thermal drift and transient power events.
  • Integrity features: ECC/CRC and signed manifests prevent silent corruption from becoming runtime instability.
  • Write endurance: update cadence + wear leveling strategy must avoid “hot-sector death”.
  • Power-loss safety: update steps must be transactional; incomplete writes must never replace Golden.

Fallback model (Golden / Update) and rollback conditions

  • Golden image: minimal, validated, and protected; used as the guaranteed recovery anchor.
  • Update image: versioned and verified; activated only after integrity and health gates.
  • Rollback triggers: integrity failure, repeated init/train failures, crash loops, or mismatch against board ID/revision policy.
  • Minimal recovery path: keep a service mode available (e.g., JTAG path) even when flash is unreadable.

Keep the update pipeline separate from the “bootable anchor” so failures remain recoverable by design.

Figure F6 — Safe update pipeline (Update → Verify → Switch → Health → Commit; else Rollback)
Safe configuration update pipeline with rollback Flow showing Golden and Update images, verification gates, switching, health check, commit, and rollback path with reason latch and minimal recovery mode. Configuration Update — Transactional Flow with Rollback Images & anchors GOLDEN protected anchor UPDATE IMAGE candidate Pipeline UPDATE write VERIFY CRC/ECC/sign SWITCH select candidate HEALTH init/train checks COMMIT activate ROLLBACK Return to GOLDEN Latch reason Keep recovery reachable SERVICE MODE Minimal recovery path JTAG / safe boot No dependency on candidate Outcomes Committed active image Rolled back reason recorded Service recovery minimal path

Update safety principle: never overwrite the recovery anchor; treat each transition as a gate (verify + health) and always latch the rollback reason.

H2-7 · JTAG & board-level debug

JTAG & board-level debug: bring-up, boundary scan, and production observability

JTAG is the lowest-level access path when the card cannot boot into a higher-layer workflow. A robust design treats JTAG as an engineering deliverable: it supports early bring-up, shortens “powered but not enumerated” debug loops, and provides boundary-scan coverage for manufacturing.

Three JTAG use cases by lifecycle stage

  • Bring-up Validate power/clock/config prerequisites using chain visibility (ID response) and minimal access steps.
  • Debug Isolate the breakpoint for “powers on but cannot enumerate” by checking reset release, configuration readiness, and chain continuity.
  • Production test Use boundary scan to catch solder/open/short faults on critical sideband nets and essential connectivity.

JTAG chain design (multi-device chains)

  • Reset interaction: avoid keeping the chain permanently held by board reset policy (TRST# / reset domain alignment).
  • TCK strategy: support safe low-speed bring-up and higher-speed production modes without marginal timing.
  • Voltage domains: ensure the access port matches device IO levels; include protection and reference rails if required.
  • Isolation / bypass: provide a practical way to bypass segments to localize a broken section in the field.
  • Mechanical access: connector or pads must remain reachable with a fixture without disturbing airflow hardware.

JTAG access checklist (deliverable)

Item What to specify
Access location Header/pads placement, fixture approach direction, ground reference proximity
Minimum signals TCK, TMS, TDI, TDO, TRST# (if used), GND, and a stable reference indicator for IO level domain
Protection ESD handling, series resistors, and safe behavior under accidental mis-connection
Level domain IO voltage compatibility and any translation/isolation policy for mixed-voltage chains
Bypass points Jumpers/resistor options to isolate chain segments for fault localization
Production fixture Probe/pogo requirements, alignment features, and test-time constraints

Boundary scan focus: prioritize essential sideband connectivity (reset, strap-like signals, management busses, and the JTAG path itself).

Figure F7 — JTAG access & chain topology (bring-up · debug · production test)
JTAG chain topology and access points on an FPGA accelerator card Diagram showing JTAG access header/pads, chain through FPGA and optional devices, bypass/isolation points, and outputs for bring-up, debug, and production boundary scan. JTAG Access & Chain — Visibility When Higher Layers Are Down Access JTAG HEADER or TEST PADS TCK TMS TDI TDO TRST# GND VTref Chain topology FPGA primary TAP ID / IR scan CONFIG FLASH (opt.) CLOCK chip (opt.) OPTIONAL DEVICES Retimer / Buffer chain visibility only B B bypass bypass Outputs BRING-UP chain alive? DEBUG breakpoint isolation PRODUCTION boundary scan
H2-8 · Clocking & synthesis

Clocking & synthesis: refclk sources, PLL/jitter cleaners, and distribution

Card-level clocking must be reliable to lock, controlled for jitter, and explainable across temperature. The clock tree is part of the bring-up and power-state definition: training should only proceed after a stable lock window is reached and recorded.

Clock tree objectives (engineering KPIs)

  • Lock robustness: deterministic lock after power transitions and resets, with clear lock visibility.
  • Jitter control: preserve margin for the most sensitive links (PCIe/CXL SerDes reference paths).
  • Thermal explainability: temperature-driven failures must map to lock state, noise coupling, and measurable checkpoints.

Key design points (card-level only)

  • Refclk sourcing: host-provided vs on-card sources require an explicit “valid” policy and a clear fallback behavior.
  • Power dependencies: PLL/jitter-cleaner analog rails must be stable before lock is considered meaningful.
  • Distribution: fanout topology and domain partitioning reduce noise injection into sensitive clock paths.
  • Common pitfall: PG passes but lock is not stable; training starts too early or drifts after warm-up.

Clock checklist (deliverable)

Item What to record or verify
Refclk spec Frequency, tolerance window, and the “valid” indication used by the card
Lock indicator PLL_LOCK signal policy, stable-window definition, and where it gates link training
Power dependency Analog/digital rail readiness prerequisites and sequencing tie-ins
Distribution plan Fanout stages, endpoint mapping, and sensitive-path isolation rules
Measurement points Suggested test points for refclk, post-cleaner clock, and endpoint clocks
Thermal behavior Lock-loss counters, last lock-loss time marker, and correlation to temperature windows
Figure F8 — Card-level clock tree (refclk sources → PLL/jitter cleaner → fanout → endpoints + training gate)
Card-level clock tree for an FPGA accelerator card Diagram showing host and on-card refclk sources, selection policy, PLL/jitter cleaner with lock signal, fanout distribution, endpoints, and a training gate requiring stable lock. Clock Tree — Sources, Lock, Distribution, and Gating Refclk sources HOST REFCLK slot source ON-CARD REFCLK local source (opt.) SELECT VALID policy presence · stable PLL / JITTER CLEANER lock · filter · stabilize PLL_LOCK POWER DEPENDENCIES Analog rail Digital rail quiet & stable FANOUT buffer stages endpoint mapping Endpoints (sensitive paths) FPGA TRANSCEIVERS PCIe/CXL ref path SIDE LOGIC monitor & gate MEAS POINTS TPs for ref/clean TRAINING GATE: allow link train only after LOCK is stable (windowed)

Common failure pattern: power-good passes while lock is not stable; gating must use a stable window, not a momentary edge.

H2-9 · PCIe / CXL PHY ties

PCIe / CXL PHY ties: lane mapping, sidebands, reset & training observability

Endpoint-card integration is defined by hard ties: lane mapping and polarity decisions, sideband wiring and timing, and a minimal observability path for training stability. The goal is not just enumeration, but repeatable training with fast isolation when failures appear under load or temperature.

Endpoint-card binding requirements (no retimer deep-dive)

  • Lane mapping & polarity: document connector lanes → board nets → FPGA SerDes channels; polarity inversion must be explicit and consistent.
  • Sidebands: PERST#, CLKREQ#, and WAKE# (if used) must align with power-good and clock stability windows.
  • Training observability: use LTSSM grouping and error counters to turn symptoms into a deterministic debug path.
  • CXL note: when claimed, apply stricter stability windows and additional training/health checks without expanding fabric architecture.

Bring-up quick path (deliverable)

Step 1 Power rails stable (key PG combination true) → record timestamp.
Step 2 Refclk present at TP_REFCLK and within tolerance window.
Step 3 PLL_LOCK stable (windowed) → allow PERST# release gating.
Step 4 PERST# released (TP_PERST) → verify no unintended re-assert.
Step 5 Enumeration success (device visible) → capture link speed/width snapshot.
Step 6 Stability validation: stress + warm-up → monitor Recovery frequency and error counters.

The quick path is designed to separate “wiring/timing prerequisites” from “training margin” issues.

Signals & test points table (minimum set)

Signal Purpose Expected behavior Probe & symptom cue
REFCLK Training reference Present before PERST# release; stable during training TP_REFCLK; absence or instability → Detect/Polling stalls
PLL_LOCK Clock stability gate Must be stable-window true before training starts TP_LOCK; “PG OK but unstable link” often correlates to non-windowed lock
PERST# Endpoint reset Released after PG + stable lock window; no bounce TP_PERST; bounce → intermittent enumeration / early failures
CLKREQ# Clock request (if used) Consistent polarity/levels with platform policy; clean edges TP_CLKREQ; mis-handled policy → intermittent low-power transitions
WAKE# Wake signaling (if used) Valid only in defined states; avoid floating conditions TP_WAKE; floating or wrong pull → spurious wake or no wake
LTSSM group Training stage Detect → Polling → Config → L0; Recovery should be bounded Readout/telemetry; frequent Recovery → margin/jitter/power integrity suspicion
Figure F9 — Lane/sideband ties & training observability map (endpoint card)
PCIe or CXL endpoint card ties: lanes, sidebands, and observability Block diagram showing host root complex to connector to FPGA transceivers, lane mapping and polarity, sideband signals with test points, and a training observability block with LTSSM and error counters plus a compact bring-up path. PHY Ties — Lanes, Sidebands, Reset, and Training Observability HOST RC PCIe/CXL root CONNECTOR x16 lanes Endpoint-card ties LANE MAP swap / polarity table-backed SIDEBAND PERST# CLKREQ# WAKE# TP OBSERVABILITY LTSSM ERR CNT FPGA SerDes CH Bring-up quick path PG OK REFCLK LOCK PERST# ENUM STABILITY Recovery bound CXL
H2-10 · Thermals & derating

Thermals & derating: closing the loop between power, temperature, and reliability

Thermal issues often surface as link-training or configuration failures because temperature shifts power margin and clock stability. A usable thermal plan combines sensor placement, thresholds, and a derating policy that keeps the system observable while reducing risk.

Why thermals can look like “link/config” problems

  • Temperature rise → power margin shrinks: droop and noise worsen near limits, increasing retries and intermittent failures.
  • Temperature rise → clock margin shrinks: lock edges and jitter budget degrade, pushing training into Recovery more often.
  • Result: enumeration instability, training drops, or configuration timeouts appear “protocol-like” but originate from thermal coupling.

Sensor placement & thresholds (card-level)

Hotspot zone What it indicates Recommended policy
FPGA hotspot Core thermal headroom Warning/critical thresholds; correlate with training stability and configuration completion rate
PMIC / power stages Derating triggers Track protection edges; log derating flags and fault counters for reproducible triage
Inductors Local heat accumulation Use as early indicator for sustained stress; monitor alongside droop/noise symptoms
Clock/PLL chip Lock stability margin Correlate lock loss / training Recovery frequency with temperature windows

Derating strategy (observable & stable)

  • Level 1 — Warn: raise alarms and log the thermal window (no performance change).
  • Level 2 — Derate: reduce stress to protect margin while preserving observability (record flags and timestamps).
  • Level 3 — Protect: enter a safe state with controlled retry rules and explicit reason codes.

Derating should align with clock lock gating and training stability checks to prevent “silent degradation”.

Thermal triage checklist (deliverable)

  • Reproduce: define load profile + airflow condition + warm-up time window; record ramp rate and ambient.
  • Thresholds: note which hotspot crosses warning/critical at failure onset.
  • Correlate: check derating flags, PLL lock stability window, and training Recovery/error counters in the same time slice.
  • Isolate: compare “same load, different airflow” and “same temperature, different load” to separate power vs clock vs link margin.
  • Confirm: verify stability after mitigation using the same stress and temperature window.
Figure F10 — Thermal → power/clock margin → training/config failures + triage map
Thermal coupling loop for FPGA accelerator card reliability Diagram showing four thermal hotspots feeding into power and clock margin degradation, leading to training and configuration failures, with sensors, thresholds, and a triage step map. Thermals & Derating — When Heat Masquerades as Link/Config Issues Hotspot zones FPGA PMIC / POWER STAGES INDUCTORS CLOCK / PLL Sensors & thresholds WARN CRIT MARGIN PATHS POWER NOISE ↑ JITTER ↑ MARGIN ↓ Failure surfaces TRAINING FAIL LINK DROP CONFIG FAIL Triage map REPRODUCE THRESHOLDS CORRELATE ISOLATE CONFIRM STABILITY

H2-11 — Validation & Production Checklist (Definition of Done)

This chapter turns “it seems to work” into a card-side evidence pack: repeatable pass/fail criteria, measurable thresholds, and minimal logs that make power/config/clock/link failures diagnosable without guessing.

Pass/Fail thresholds Evidence required Reproducible triggers Card-side only

Evidence pack (minimum)

A DoD item is complete only when a stored artifact exists (log fields / counter dumps / report IDs / waveforms) and the same trigger reproduces the same outcome.

  • Conditions: temperature window, airflow state, link speed/width, workload profile, power margining level.
  • Procedure: cycle counts, injected faults, timeouts, and recovery policy.
  • Criteria: thresholds for PG stability, PLL lock, enumeration success, training recovery counts, error counters.
  • Evidence: event codes + timestamps, “last failure reason”, and signature snapshot (rails/temps/version).

Figure F11 — DoD flow: DVT → PVT → Field Evidence → Evidence Pack
Validation → Production → Field Evidence (Card-side) DVT / EVT (R&D validation) POWER cycles CONFIG update/rollback CLOCK lock & drift LINK train & stress PVT (production test) JTAG SCAN opens/shorts SIGNATURE rails/temps/ver Field evidence (card-side logs) POWER LOG CLOCK LOG LINK/CFG Evidence Pack Pass/Fail thresholds measurable limits Artifacts stored logs, snapshots, reports Reproducible triggers same cause → same symptom Ready for ship / RMA triage Tip: keep counters coarse but consistent (loss_count, retry_count, last_reason, max_temp, last_image_slot).
Diagram style: box-first, low text density, mobile-readable labels (≥18px), no <defs>/<style> blocks.

R&D validation (DVT/EVT) — break the edge conditions
Each DoD item must define: Setup (temp/airflow/load/link mode), Procedure (counts/time/fault), Pass criteria (threshold), and Evidence (stored fields/report IDs/waveforms).
DoD item Setup (trigger) Pass criteria (example) Evidence (card-side) Example MPN hooks
Power cycle repeatabilityCold + warm cycles N cycles; defined ramp; defined airflow; fixed workload after link-up No PG bounce; no stuck state; enumeration success rate = 100%; recovery count ≤ threshold pg_mask timeline; last_fault_code; boot_state; retry_count; timestamp Supervisor / log NVM
Margining robustnessKey rails ±X% margin on selected rails; repeat at two temps Stable lock + stable training; no silent degradation (error counters remain bounded) rail_vout/iin snapshots; margin_level; link_error counters PMBus mgr
Fault injectionRail drop / UV / OT Inject one fault at a time; defined timeout policy Correct containment (alarm vs shutdown); reason code always present; recovery is repeatable fault_code + rail_id; last_reason; shutdown_class; retry_count Multi-rail ctrl
Config update & rollbackGolden + update Multiple update/verify/switch; cut power during critical windows No “brick” state; rollback triggers are deterministic; golden always boots image_slot; cfg_result; rollback_count; bitstream_id; signature_ok QSPI NOR / FRAM
Clock lock stabilityThermal sweep Hold at temperature corners; refclk present/absent cases Lock time within limit; lock_loss_count = 0 (or bounded) during stress pll_lock_time_ms; lock_loss_count; refclk_present Jitter cleaner
Link training enduranceLong run T hours; multiple speeds/widths; workload toggles Link drop = 0 (or bounded); recovery bounded; LTSSM never stuck link_up_count; link_drop_count; recovery_count; last_ltssm_group Telemetry monitor
“Example MPN hooks” indicates which component classes typically provide the telemetry/controls needed to collect evidence; exact picks depend on rail count, current, and firmware ownership.
Production test (PVT) — fast screen with high signal-to-noise
Test Goal Pass criteria (example) Evidence
JTAG boundary scan Detect solder opens/shorts on critical nets with fixture-friendly coverage Scan passes; failing nets map to “open/short” class; repeatability across fixtures scan_report_id; failing_net_list (if fail); coverage summary
Signature snapshot Single-page health record for shipment gating All rails in range; temps in range; version fields valid; link trains once bitstream_id; rails_vout/iin; max_temp; pll_lock; link_up_once flag
Production-friendly design tip: keep a read-only signature block (rails/temps/version/counters) accessible over the card’s management interface, so the fixture can log one consistent JSON-like record per unit.
Field evidence (card-side) — minimal logs that stop guesswork
Logs should answer: what happened, which domain (power/config/clock/link), what was the last stable state, and what changed (temperature/rail margin/retry).
Group Fields (examples) When to update Why it matters
Power rail_id, pg_mask, fault_code, shutdown_class, derating_level, max_temp_at_fault, retry_count Optional: per-rail min/max vout, brownout_count On PG transitions, fault assertion, shutdown decision, recovery attempt Separates “rail instability” from “protocol symptoms”
Clock refclk_present, pll_lock_time_ms, lock_loss_count, last_lock_loss_temp Optional: refclk_fault_count On lock/unlock, link training start, thermal threshold crossings Explains intermittent training failures and thermal drift sensitivity
Link + Config image_slot, cfg_result, rollback_count, bitstream_id, signature_ok, link_up_count, link_drop_count, recovery_count, last_ltssm_group After configuration completes, after training completes, on each recovery/drop Distinguishes “bitstream/update issues” from “training margin issues”
Reference MPN list (examples) — parts that enable the evidence
These are example manufacturer part numbers commonly used for card-side telemetry, supervision, logging, and clock conditioning. Final selection depends on rail count/current, footprint, firmware ownership, and supply constraints.
Function Example MPNs Where it fits the DoD
PMBus power system manager LTC2977 (8-channel power system manager with fault logs / telemetry) Enables repeatable margining + rail snapshots + fault logs used in DVT cycles and production “signature” records.
Digital multiphase controller Infineon XDPE132G5C / XDPE132G5H
Renesas ISL68137
MPS MP2953B (examples; phase count and protocol vary by design)
Provides programmable sequencing/telemetry hooks to correlate “rail behavior” with configuration/training outcomes (used by DoD evidence fields).
Temperature sensing (multi-zone) TI TMP468 (multi-zone SMBus/I²C temperature sensor) Supports thermal triage evidence: max_temp, temp-at-fault, derating correlation, and hotspot mapping.
Digital power monitor TI INA228 (20-bit power/energy monitor; shunt based) Creates quantifiable “rail load vs symptom” snapshots to back up margining and endurance claims.
Multi-rail supervisor / watchdog TI TPS386000 (quad-supply supervisor with delay + watchdog) Hardens reset/timeout policies; provides deterministic gating for “state machine” timeouts that appear in DoD pass/fail criteria.
QSPI/Quad-SPI NOR flash Winbond W25Q128JV (128M-bit serial NOR flash family) Stores golden/update images and metadata fields (image_slot, cfg_result, rollback_count, version tags).
Nonvolatile log memory (FRAM) FM25V02-G / FM25V02A (SPI FRAM examples) Stores small, high-endurance “last_reason / counters / timestamps” records that survive power loss during faults.
Jitter cleaner / clock generator Si5341B-D-GM (ultra-low jitter clock generator / attenuator family) Supports clock-lock evidence (lock_time, lock_loss_count) and reduces “clock-noise disguised as link problems”.

Request a Quote

Accepted Formats

pdf, csv, xls, xlsx, zip

Attachment

Drag & drop files here or use the button below.

H2-12 — FAQs (10) for FPGA Accelerator Cards

Each answer is written for card-side triage: the shortest measurable path from symptom → signals → evidence fields, with references back to the relevant chapters (H2-4…H2-11). No platform-level detours.

40–70 words per answer 3–5-step triage path Card-side evidence fields Maps back to H2

Figure F12 — FAQ triage map (Symptom → Fast checks → Chapter)
Symptom → Fast checks → Chapter anchor SYMPTOM FAST CHECKS CHAPTER Card not detected PERST# • REFCLK • PLL_LOCK boot_state • LTSSM group H2-4 / H2-9 Config fails cfg_reason • image_slot rail_min_v • temp_at_cfg H2-6 / H2-5 Thermal link drops derating_level • rail droop lock_loss • recovery_count H2-8 / H2-10 JTAG scan errors TCK rate • chain power level domain • fixture H2-7 Keep card logs coarse but consistent: boot_state, pg_mask, fault_code, derating_level, pll_lock_loss_count, last_ltssm_group, cfg_reason.
Box-first diagram, low text density, mobile-readable labels (≥18px), no <defs>/<style> blocks.

1Why are all power rails “PG green” but PCIe still cannot enumerate the card?
“PG green” only proves voltages reached thresholds; enumeration also needs PERST# release, a stable REFCLK, and a valid boot state. The most common blockers are PERST# gated too early/late, REFCLK present but PLL not locked, or lane/sideband binding mistakes that keep the endpoint from reaching Detect/Polling.
  1. Confirm perst_seen and the release moment vs boot_state (H2-4).
  2. Check refclk_present and pll_lock_time_ms (H2-8).
  3. Read a coarse last_ltssm_group to see if training even started (H2-9).
Maps to: H2-4 sequencing/state machine, H2-9 PCIe sidebands & observability.
2Flash reads/writes fine, but bitstream programming fails intermittently—what is the most common cause?
Intermittent configuration failure is usually caused by a configuration-time stability window violation: rail droop/derating during configuration, unstable clock/PLL lock while the FPGA samples the stream, or metadata mismatch (image pointer/signature) that only appears on certain boots. “Readable flash” does not guarantee at-speed, low-noise, deterministic configuration.
  1. Capture cfg_fail_reason, image_slot, signature_ok (H2-6).
  2. Log rail_min_v_at_cfg and temp_at_cfg (H2-5/H2-10).
  3. Verify PLL lock stayed stable across the full configuration window (H2-8).
Maps to: H2-6 config architecture, H2-5 fault triage.
3One boot succeeds and the next fails—what three sequencing parameters should be checked first?
The fastest way to eliminate “random boots” is to lock down three parameters: (1) dependency gating (which PG/LOCK must be true before the next step), (2) ramp rate (soft-start slope and inrush behavior), and (3) timeout/backoff policy (how long to wait and how retries are spaced). Small shifts here create PG bounce and silent early-stage stalls.
  1. Compare a successful vs failed boot_state trace and pg_mask transitions (H2-4).
  2. Check whether the failure always occurs after the same timeout stage (H2-4).
  3. Inspect retry loops: no rapid “power-bounce” cycles (H2-11 DoD).
Maps to: H2-4 sequencing/state machine.
4How should voltage margining be done without making the system “less stable the more it is tested”?
Safe margining is controlled change, not repeated stress. Margin one rail group at a time, keep a settle window before training/config, and always restore baseline with a cooldown to avoid thermal accumulation. Instability usually comes from stacking margins across coupled domains (power + clock), or from aggressive retries that hide the real boundary.
  1. Baseline snapshot: rails/temps/counters (H2-11).
  2. Apply one margin step → hold → run the same stress script (H2-4/H2-11).
  3. Restore baseline and verify counters did not drift (H2-11).
Maps to: H2-4 sequencing controls, H2-11 validation methodology.
5Training fails as temperature rises—how to tell clock jitter from power derating?
Separate the two by evidence: power-derating signatures show up as derating_level, rail droop, or fault warnings near the failure; clock/jitter issues show up as lock instability (pll_lock_loss_count, abnormal pll_lock_time_ms) while rails remain within bounds. Both can produce the same symptom, so logs must capture “what changed first.”
  1. Check max_temp and whether derating was asserted (H2-10).
  2. Compare rail minima vs PLL lock stability around the event (H2-8/H2-10).
  3. Correlate with recovery_count and LTSSM group changes (H2-9).
Maps to: H2-8 clocking, H2-10 thermals/derating.
6JTAG connects, but boundary-scan shows many errors—what hardware-layer issues are typical?
Many boundary-scan failures are board-level, not logic-level: voltage-domain mismatch on JTAG pins, a device in the chain not powered, TRST#/reset unintentionally asserted, TCK too fast for the fixture/cabling, or poor contact resistance. In multi-device chains, one unready device can corrupt the entire scan result.
  1. Verify JTAG I/O level compatibility and power-domain readiness (H2-7).
  2. Reduce TCK and re-run to test signal-integrity sensitivity (H2-7).
  3. Segment the chain (jumpers/bypass) to identify the failing stage (H2-7).
Maps to: H2-7 JTAG & production observability.
7PERST# timing looks correct, but LTSSM is stuck early—what root causes are common?
Early LTSSM stalls usually mean the training prerequisites are not truly met: REFCLK quality/lock is marginal, lane polarity or lane mapping is wrong, sidebands (CLKREQ#/WAKE# where applicable) are missing, or the endpoint is not ready when PERST# is released. The key is to confirm whether LTSSM ever leaves Detect and how consistently it reaches Polling.
  1. Log refclk_present and PLL lock stability before PERST# release (H2-8).
  2. Read last_ltssm_group to classify “Detect vs Polling vs Config” (H2-9).
  3. Validate sidebands and lane mapping rules at the card connector (H2-9).
Maps to: H2-9 PHY ties & observability, H2-8 clock prerequisites.
8How should golden image / fallback be designed so one update cannot brick the card?
A safe update requires two images (immutable golden + update slot), a verifiable commit point, and deterministic rollback. Always boot golden by default, validate the update (signature/CRC + first-boot health check), and only then set a commit flag. Power-loss windows must be assumed: metadata should update atomically, and rollback must be possible without external tools.
  1. Maintain image_slot, pending_update, commit_flag (H2-6).
  2. Define rollback triggers: signature fail, DONE timeout, health check fail (H2-6).
  3. Record rollback_count and last_update_result (H2-11).
Maps to: H2-6 configuration architecture, H2-11 evidence pack.
9A multi-rail PMIC reports a fault but voltages look “normal”—how to interpret fault classes and logs?
“Normal voltage now” can hide a fault that was transient or latched: short UV dips, overcurrent foldback, thermal warning, or PG deglitch behavior. Treat the PMIC report as an event record: read fault class (latched vs auto-retry), map it to rail_id, and correlate with rail minima/peak current at the fault timestamp. That correlation is the fastest root-cause filter.
  1. Read fault_code, fault_class, rail_id (H2-5).
  2. Capture rail_min_v_at_fault and iin_peak_at_fault (H2-5/H2-11).
  3. Check whether derating/temperature triggered the event (H2-10).
Maps to: H2-5 PG/FAULT triage.
10What hidden defects are most often missed in production, and how can tests cover them?
The most expensive “escapes” are marginal defects: weak joints that pass basic power-up, wrong strap values that only appear under training, thermal-interface problems that show up after minutes, or connector/contact intermittency. Coverage improves when production tests combine boundary scan, a signature snapshot, and a short at-speed stress (training/loopback) plus a quick thermal/electrical stimulus that exposes margins.
  1. Run boundary scan for opens/shorts (H2-7).
  2. Record a consistent signature (rails/temps/version/counters) per unit (H2-11).
  3. Add a short at-speed training + counter check (H2-9/H2-11).
Maps to: H2-7 production observability, H2-11 DoD checklist.

FAQPage JSON-LD (for indexing)

Paste the script below once per page. Keep the visible Q&A aligned with the JSON-LD content.

If only “Google-format output” is desired next time, request: “Google看的:只输出<script>”.