FPGA Accelerator Card: Power, JTAG, Clocks, PCIe/CXL
← Back to: Data Center & Servers
An FPGA accelerator card succeeds in the data center only when its four planes—power, configuration/control, clocks, and PCIe/CXL ties—are engineered as a single measurable boot-and-train sequence, not as isolated blocks. Reliable bring-up comes from card-side observability (PG/FAULT, boot states, lock signals, LTSSM groups, and update/rollback evidence) that turns “card not detected” and “intermittent failures” into repeatable triage steps.
Boundary: FPGA endpoint card scope, evidence, and link-only neighbors
The fastest way to ship a reliable FPGA accelerator card is to lock the scope early: what is owned on the card (power/config/clock/PCIe-CXL bindings) and what is only referenced (platform fabric, rack power, management stack). This prevents “looks fine” bring-up loops caused by mixing system-level topics into card-level debugging.
Scope Guard (mechanically checkable)
Link-only neighbors (mention and link, no expansion): PCIe Switch/Retimer · 48V/12V Bus & Hot-Swap · BMC · Time Card · In-band Telemetry & Power Log.
What this page solves (reader tasks)
- Define a card-level power tree (rails, dependencies, sequencing) that stays stable across temperature and workload transients.
- Turn PG/FAULT signals into a deterministic triage flow (symptom → first signal → most likely plane).
- Design a safe configuration path: boot modes, flash partitions, update/rollback, and “recoverable by JTAG” guarantees.
- Keep on-card clocks lock-stable and jitter-contained so link training does not drift into intermittent failures.
- Bring up PCIe/CXL endpoints using only card-owned evidence (PERST#/CLKREQ#, PLL_LOCK, DONE, LTSSM states).
Architecture: unify bring-up and debugging with four planes
A stable FPGA accelerator card behaves like a coordinated system, not a set of independent blocks. The “four planes” model turns complex failures into deterministic triage: identify the plane that holds first evidence, then check the coupling points where failures propagate across planes.
Four planes (each plane includes: signals → evidence → failure signature)
| Plane | Critical signals / resources | What to observe first | Typical failure signature |
|---|---|---|---|
| Data PCIe / CXL |
High-speed lanes, lane mapping, PERST#, CLKREQ#, capabilities | Enumerate result, LTSSM state, link width/speed, error counters | Not detected, downshift, intermittent drop under load |
| Power Multi-rail |
Power tree, sequencing dependencies, PG/FAULT, margin controls | PG timing, rail droop/ripple evidence, fault codes, retry behavior | Boot flakiness, “works cold” but fails hot, config mid-way fail |
| Control Config & debug |
QSPI/OSPI flash, boot mode straps, JTAG chain, SMBus/I3C hooks | Config progress/DONE, flash status, version/rollback state | Update bricks card, stuck init, large lot-to-lot variation |
| Clock Refclk & PLL |
Refclk input, PLL/lock indicator, clock fanout, sensitive rails | PLL_LOCK stability, refclk presence, drift vs temperature | Training unstable, failures appear after warm-up or noise events |
Practical rule: when a failure looks like “data plane”, check clock and power coupling first; when a failure looks like “control plane”, check power integrity during configuration.
Coupling points (where root causes hide)
| Coupling | Why it matters | First evidence to check |
|---|---|---|
| Power ↔ Clock | Clock IC and transceiver supplies amplify rail noise; lock may look “OK” but jitter worsens link margin. | PLL_LOCK stability, sensitive rail PG/FAULT, temperature trend |
| Clock ↔ Data | Refclk quality and lock timing gate training and stability; weak margin shows as downshift or retrain events. | LTSSM transitions, link speed/width shifts, error bursts |
| Power ↔ Control | Configuration is a “stress test”: flash reads/writes and FPGA init can expose droop or sequencing gaps. | Config progress stalls, fault codes during init window, rail droop evidence |
| Control ↔ Data | Bitstream/version determines endpoint behavior; mismatch can look like lane or training faults. | Device ID, image selection (A/B), capability reporting consistency |
Bring-up “first hour” checklist (5 steps)
- Power plane: all required PG signals asserted in order; no silent retries or latched faults.
- Clock plane: refclk present; PLL_LOCK is stable across a short warm-up window.
- Control plane: configuration reaches DONE; image selection/rollback state is known.
- Data plane: PERST# release timing is sane; LTSSM reaches link-up with expected width/speed.
- Coupling check: stress (temp or load) does not flip any of: PLL_LOCK, PG/FAULT, LTSSM state.
Diagram reading tip: failures rarely stay inside one plane; the first stable evidence usually appears in power/clock signals before the data plane collapses.
Power tree and rail budgeting: derive rails from behaviors
An FPGA accelerator card power tree is a behavior model: rails are defined by what must stay stable during configuration, link training, and sustained workloads. Budgeting must capture both average and peak currents, plus noise sensitivity and dependencies that can turn “normal readings” into intermittent failures.
Rail families (card scope, no CPU VRM theory)
- Core / VCCINT: highest dynamic current; transient integrity gates overall stability and init robustness.
- Transceiver AVCC / RX / TX: noise- and jitter-sensitive domain; rail quality strongly impacts link margin and training stability.
- I/O banks: multiple voltage domains; mis-binned domains often show up as “works in some modes, fails in others”.
- DDR/HBM rails (if present): treat as a separate domain with isolation and clean bring-up windows; keep the discussion at power-domain level.
- AUX / 1.8V / 3.3V management: always-on/standby domain that keeps control/telemetry/rollback observable even when the main domain fails.
Budget workflow (practical steps)
- Start from a workload power profile (idle / configure / train / sustained) and map power to domains.
- Convert power to Iavg and Ipeak per rail; treat configuration and training windows as stress events.
- Add margin for process tolerance, temperature rise, aging, and worst-case lane width/speed targets.
- Flag noise-critical rails (especially transceiver supplies) and reserve routing/decoupling budget early.
- Close the loop with thermal planning: hottest components are often regulators/inductors, not only the FPGA die.
The next chapter (sequencing) turns these rails into a deterministic state machine with evidence (PG/FAULT) for triage.
Rail checklist (definition → evidence → failure signature)
| Rail family | Target (typ.) | Iavg / Ipeak | Noise sensitivity | PG condition & dependency | Typical failure signature |
|---|---|---|---|---|---|
| Core VCCINT |
Low-V | High / Very high | Med droop-sensitive |
PG with debounce; must be stable before init | Boot flakiness, config stalls, resets under burst |
| Transceiver AVCC/RX/TX |
Low-V | Med / High | High jitter coupling |
PG tied to PLL/link windows; rail must be quiet | Training unstable, downshift, intermittent drop after warm-up |
| I/O Banks | Multi-V | Low / Med | Med | PG per bank group; depends on mode/straps | Mode-specific failures, marginal GPIO/sideband behavior |
| Memory domain (if present) |
Multi-V | Med / High | Med | Separate PG; sequenced to avoid init collisions | Init-time errors, intermittent during traffic spikes |
| Management AUX/1.8/3.3 |
Low/Med-V | Low / Low | Low | Always-on PG; keeps telemetry and recovery reachable | “No evidence” failures: cannot read faults, no recovery path |
Multi-rail PMIC sequencing: order, ramp, dependencies, and observability
Sequencing is the card’s control system. The goal is not only “power on”, but “power on with evidence”: each transition is gated by measurable conditions (PG, PLL_LOCK, PERST#, config progress) and protected by timeouts and fault policies that prevent oscillating retries.
Sequencing essentials (three levers)
- Order: always-on management rails first; main rails next; sensitive transceiver rails enter only when prerequisites are stable.
- Ramp: soft-start slope must avoid both droop (too fast) and watchdog-style timeouts (too slow).
- Dependencies: tie key transitions to PG, PLL_LOCK, and PERST# so training never starts on unstable foundations.
PMIC capabilities mapped to field outcomes
- Programmable soft-start: converts “ramp quality” into a controllable parameter, reducing lot-to-lot variance.
- PG/FAULT aggregation: turns intermittent failures into deterministic evidence (latched codes + timestamps).
- Margining: validates guardband for production, temperature drift, and aging without changing the nominal operating point.
- Retry policy: controlled retry/backoff prevents rail oscillation that can worsen link and configuration stability.
Fault policy should distinguish “hard stop” vs “retry/derate” vs “log-only” to keep cards observable without masking real damage.
Observability rules (failures must leave evidence)
- Latch the first fault source (rail, class, time window) before a retry clears the symptom.
- Record the transition where progress stopped (RAIL_RAMP, PLL_LOCK, INIT, TRAIN) to prevent blind debugging.
- Keep the management domain alive so telemetry and recovery remain reachable after main-rail shutdown.
Implementation guidance: expose PG/FAULT and PLL_LOCK as readable, latched evidence; keep the management domain alive so faults remain accessible after shutdown.
PG/FAULT design: turning “looks powered but unstable” into a deterministic triage path
Many field issues are not “wrong voltage,” but incorrect power-good definition and fault policy. Robust cards treat PG as a gate (with threshold, debounce, and delay) and treat FAULT as evidence (latched cause, state, and window), so symptoms map to a short, repeatable debug path.
PG design essentials (threshold · debounce · delay)
- Threshold: set PG around the worst-case operating window, not only the nominal rail value.
- Debounce: prevent transient dips from triggering oscillating retries; match debounce to soft-start and load-step behavior.
- Delay/settling: PG asserted does not mean “quiet enough” for training/config; reserve a settling window for sensitive domains.
Multi-PG aggregation (AND/OR rules that prevent false “ready”)
- HARD_OK (gate to INIT/TRAIN): critical rails combined with AND (Core + Transceiver + Mgmt reachable).
- SOFT_OK (run but log/derate): optional rails or secondary domains do not block progress, but must create events.
- LOG_ONLY (evidence only): near-threshold conditions (margin/temperature) should be recorded without causing reset storms.
The most common “powered but unstable” failure mode is entering training while a sensitive rail is not yet quiet or a prerequisite is not truly stable.
Fault-to-symptom mapping (first signals → suspect domains → logs)
| Symptom | First signals to check | Likely suspect domain | Associated evidence (logs/fields) |
|---|---|---|---|
| Not enumerated / training fails | HARD_OK composite, PERST#, LTSSM state | Transceiver rails, clock prerequisites | Last state, retry count, LTSSM snapshot, PG settle time window |
| Link drops under stress | Transient FAULT, rail droop indicators, temperature point | Core rail transient integrity, transceiver noise coupling | Peak-current window, FAULT latch time, temperature at event, error counters |
| Bitstream fails mid-configuration | Config progress, flash read integrity flags | Management + config domain stability | Config phase marker, integrity check result, brownout-style events |
| Intermittent reset after warm-up | Thermal threshold events, repeated retries | Regulator/inductor hotspots, protection thresholds | Over-temp flags, hysteresis/derate state, last-good timestamp |
Keep management rails alive during faults so evidence remains readable after main-rail shutdown.
Configuration architecture: boot modes, flash, fallback, and safe updates
Configuration reliability is a system problem: stable prerequisites (PG + clock lock), an integrity-checked boot source, and an update pipeline that can always fall back to a known-good image. The objective is to prevent “one update bricks the card” while keeping a minimal recovery path.
Milestones from power-on to DONE (vendor-agnostic)
- Prerequisites ready: HARD_OK PG composite + stable clock lock window.
- Boot source selected: straps/mode pins define which image is attempted.
- Read + integrity check: CRC/ECC/signature gates the transition to load.
- Load + progress markers: record the phase where progress stops (evidence for H2-5 triage).
- DONE / ready: only then enable downstream training and full workloads.
Config flash selection (what usually breaks in the field)
- Capacity planning: Golden + Update images, metadata, version tags, and rollback records.
- Speed vs window: longer load windows increase exposure to thermal drift and transient power events.
- Integrity features: ECC/CRC and signed manifests prevent silent corruption from becoming runtime instability.
- Write endurance: update cadence + wear leveling strategy must avoid “hot-sector death”.
- Power-loss safety: update steps must be transactional; incomplete writes must never replace Golden.
Fallback model (Golden / Update) and rollback conditions
- Golden image: minimal, validated, and protected; used as the guaranteed recovery anchor.
- Update image: versioned and verified; activated only after integrity and health gates.
- Rollback triggers: integrity failure, repeated init/train failures, crash loops, or mismatch against board ID/revision policy.
- Minimal recovery path: keep a service mode available (e.g., JTAG path) even when flash is unreadable.
Keep the update pipeline separate from the “bootable anchor” so failures remain recoverable by design.
Update safety principle: never overwrite the recovery anchor; treat each transition as a gate (verify + health) and always latch the rollback reason.
JTAG & board-level debug: bring-up, boundary scan, and production observability
JTAG is the lowest-level access path when the card cannot boot into a higher-layer workflow. A robust design treats JTAG as an engineering deliverable: it supports early bring-up, shortens “powered but not enumerated” debug loops, and provides boundary-scan coverage for manufacturing.
Three JTAG use cases by lifecycle stage
- Bring-up Validate power/clock/config prerequisites using chain visibility (ID response) and minimal access steps.
- Debug Isolate the breakpoint for “powers on but cannot enumerate” by checking reset release, configuration readiness, and chain continuity.
- Production test Use boundary scan to catch solder/open/short faults on critical sideband nets and essential connectivity.
JTAG chain design (multi-device chains)
- Reset interaction: avoid keeping the chain permanently held by board reset policy (TRST# / reset domain alignment).
- TCK strategy: support safe low-speed bring-up and higher-speed production modes without marginal timing.
- Voltage domains: ensure the access port matches device IO levels; include protection and reference rails if required.
- Isolation / bypass: provide a practical way to bypass segments to localize a broken section in the field.
- Mechanical access: connector or pads must remain reachable with a fixture without disturbing airflow hardware.
JTAG access checklist (deliverable)
| Item | What to specify |
|---|---|
| Access location | Header/pads placement, fixture approach direction, ground reference proximity |
| Minimum signals | TCK, TMS, TDI, TDO, TRST# (if used), GND, and a stable reference indicator for IO level domain |
| Protection | ESD handling, series resistors, and safe behavior under accidental mis-connection |
| Level domain | IO voltage compatibility and any translation/isolation policy for mixed-voltage chains |
| Bypass points | Jumpers/resistor options to isolate chain segments for fault localization |
| Production fixture | Probe/pogo requirements, alignment features, and test-time constraints |
Boundary scan focus: prioritize essential sideband connectivity (reset, strap-like signals, management busses, and the JTAG path itself).
Clocking & synthesis: refclk sources, PLL/jitter cleaners, and distribution
Card-level clocking must be reliable to lock, controlled for jitter, and explainable across temperature. The clock tree is part of the bring-up and power-state definition: training should only proceed after a stable lock window is reached and recorded.
Clock tree objectives (engineering KPIs)
- Lock robustness: deterministic lock after power transitions and resets, with clear lock visibility.
- Jitter control: preserve margin for the most sensitive links (PCIe/CXL SerDes reference paths).
- Thermal explainability: temperature-driven failures must map to lock state, noise coupling, and measurable checkpoints.
Key design points (card-level only)
- Refclk sourcing: host-provided vs on-card sources require an explicit “valid” policy and a clear fallback behavior.
- Power dependencies: PLL/jitter-cleaner analog rails must be stable before lock is considered meaningful.
- Distribution: fanout topology and domain partitioning reduce noise injection into sensitive clock paths.
- Common pitfall: PG passes but lock is not stable; training starts too early or drifts after warm-up.
Clock checklist (deliverable)
| Item | What to record or verify |
|---|---|
| Refclk spec | Frequency, tolerance window, and the “valid” indication used by the card |
| Lock indicator | PLL_LOCK signal policy, stable-window definition, and where it gates link training |
| Power dependency | Analog/digital rail readiness prerequisites and sequencing tie-ins |
| Distribution plan | Fanout stages, endpoint mapping, and sensitive-path isolation rules |
| Measurement points | Suggested test points for refclk, post-cleaner clock, and endpoint clocks |
| Thermal behavior | Lock-loss counters, last lock-loss time marker, and correlation to temperature windows |
Common failure pattern: power-good passes while lock is not stable; gating must use a stable window, not a momentary edge.
PCIe / CXL PHY ties: lane mapping, sidebands, reset & training observability
Endpoint-card integration is defined by hard ties: lane mapping and polarity decisions, sideband wiring and timing, and a minimal observability path for training stability. The goal is not just enumeration, but repeatable training with fast isolation when failures appear under load or temperature.
Endpoint-card binding requirements (no retimer deep-dive)
- Lane mapping & polarity: document connector lanes → board nets → FPGA SerDes channels; polarity inversion must be explicit and consistent.
- Sidebands: PERST#, CLKREQ#, and WAKE# (if used) must align with power-good and clock stability windows.
- Training observability: use LTSSM grouping and error counters to turn symptoms into a deterministic debug path.
- CXL note: when claimed, apply stricter stability windows and additional training/health checks without expanding fabric architecture.
Bring-up quick path (deliverable)
The quick path is designed to separate “wiring/timing prerequisites” from “training margin” issues.
Signals & test points table (minimum set)
| Signal | Purpose | Expected behavior | Probe & symptom cue |
|---|---|---|---|
| REFCLK | Training reference | Present before PERST# release; stable during training | TP_REFCLK; absence or instability → Detect/Polling stalls |
| PLL_LOCK | Clock stability gate | Must be stable-window true before training starts | TP_LOCK; “PG OK but unstable link” often correlates to non-windowed lock |
| PERST# | Endpoint reset | Released after PG + stable lock window; no bounce | TP_PERST; bounce → intermittent enumeration / early failures |
| CLKREQ# | Clock request (if used) | Consistent polarity/levels with platform policy; clean edges | TP_CLKREQ; mis-handled policy → intermittent low-power transitions |
| WAKE# | Wake signaling (if used) | Valid only in defined states; avoid floating conditions | TP_WAKE; floating or wrong pull → spurious wake or no wake |
| LTSSM group | Training stage | Detect → Polling → Config → L0; Recovery should be bounded | Readout/telemetry; frequent Recovery → margin/jitter/power integrity suspicion |
Thermals & derating: closing the loop between power, temperature, and reliability
Thermal issues often surface as link-training or configuration failures because temperature shifts power margin and clock stability. A usable thermal plan combines sensor placement, thresholds, and a derating policy that keeps the system observable while reducing risk.
Why thermals can look like “link/config” problems
- Temperature rise → power margin shrinks: droop and noise worsen near limits, increasing retries and intermittent failures.
- Temperature rise → clock margin shrinks: lock edges and jitter budget degrade, pushing training into Recovery more often.
- Result: enumeration instability, training drops, or configuration timeouts appear “protocol-like” but originate from thermal coupling.
Sensor placement & thresholds (card-level)
| Hotspot zone | What it indicates | Recommended policy |
|---|---|---|
| FPGA hotspot | Core thermal headroom | Warning/critical thresholds; correlate with training stability and configuration completion rate |
| PMIC / power stages | Derating triggers | Track protection edges; log derating flags and fault counters for reproducible triage |
| Inductors | Local heat accumulation | Use as early indicator for sustained stress; monitor alongside droop/noise symptoms |
| Clock/PLL chip | Lock stability margin | Correlate lock loss / training Recovery frequency with temperature windows |
Derating strategy (observable & stable)
- Level 1 — Warn: raise alarms and log the thermal window (no performance change).
- Level 2 — Derate: reduce stress to protect margin while preserving observability (record flags and timestamps).
- Level 3 — Protect: enter a safe state with controlled retry rules and explicit reason codes.
Derating should align with clock lock gating and training stability checks to prevent “silent degradation”.
Thermal triage checklist (deliverable)
- Reproduce: define load profile + airflow condition + warm-up time window; record ramp rate and ambient.
- Thresholds: note which hotspot crosses warning/critical at failure onset.
- Correlate: check derating flags, PLL lock stability window, and training Recovery/error counters in the same time slice.
- Isolate: compare “same load, different airflow” and “same temperature, different load” to separate power vs clock vs link margin.
- Confirm: verify stability after mitigation using the same stress and temperature window.
H2-11 — Validation & Production Checklist (Definition of Done)
This chapter turns “it seems to work” into a card-side evidence pack: repeatable pass/fail criteria, measurable thresholds, and minimal logs that make power/config/clock/link failures diagnosable without guessing.
Evidence pack (minimum)
A DoD item is complete only when a stored artifact exists (log fields / counter dumps / report IDs / waveforms) and the same trigger reproduces the same outcome.
- Conditions: temperature window, airflow state, link speed/width, workload profile, power margining level.
- Procedure: cycle counts, injected faults, timeouts, and recovery policy.
- Criteria: thresholds for PG stability, PLL lock, enumeration success, training recovery counts, error counters.
- Evidence: event codes + timestamps, “last failure reason”, and signature snapshot (rails/temps/version).
R&D validation (DVT/EVT) — break the edge conditions
| DoD item | Setup (trigger) | Pass criteria (example) | Evidence (card-side) | Example MPN hooks |
|---|---|---|---|---|
| Power cycle repeatabilityCold + warm cycles | N cycles; defined ramp; defined airflow; fixed workload after link-up | No PG bounce; no stuck state; enumeration success rate = 100%; recovery count ≤ threshold | pg_mask timeline; last_fault_code; boot_state; retry_count; timestamp | Supervisor / log NVM |
| Margining robustnessKey rails | ±X% margin on selected rails; repeat at two temps | Stable lock + stable training; no silent degradation (error counters remain bounded) | rail_vout/iin snapshots; margin_level; link_error counters | PMBus mgr |
| Fault injectionRail drop / UV / OT | Inject one fault at a time; defined timeout policy | Correct containment (alarm vs shutdown); reason code always present; recovery is repeatable | fault_code + rail_id; last_reason; shutdown_class; retry_count | Multi-rail ctrl |
| Config update & rollbackGolden + update | Multiple update/verify/switch; cut power during critical windows | No “brick” state; rollback triggers are deterministic; golden always boots | image_slot; cfg_result; rollback_count; bitstream_id; signature_ok | QSPI NOR / FRAM |
| Clock lock stabilityThermal sweep | Hold at temperature corners; refclk present/absent cases | Lock time within limit; lock_loss_count = 0 (or bounded) during stress | pll_lock_time_ms; lock_loss_count; refclk_present | Jitter cleaner |
| Link training enduranceLong run | T hours; multiple speeds/widths; workload toggles | Link drop = 0 (or bounded); recovery bounded; LTSSM never stuck | link_up_count; link_drop_count; recovery_count; last_ltssm_group | Telemetry monitor |
Production test (PVT) — fast screen with high signal-to-noise
| Test | Goal | Pass criteria (example) | Evidence |
|---|---|---|---|
| JTAG boundary scan | Detect solder opens/shorts on critical nets with fixture-friendly coverage | Scan passes; failing nets map to “open/short” class; repeatability across fixtures | scan_report_id; failing_net_list (if fail); coverage summary |
| Signature snapshot | Single-page health record for shipment gating | All rails in range; temps in range; version fields valid; link trains once | bitstream_id; rails_vout/iin; max_temp; pll_lock; link_up_once flag |
Field evidence (card-side) — minimal logs that stop guesswork
| Group | Fields (examples) | When to update | Why it matters |
|---|---|---|---|
| Power | rail_id, pg_mask, fault_code, shutdown_class, derating_level, max_temp_at_fault, retry_count Optional: per-rail min/max vout, brownout_count | On PG transitions, fault assertion, shutdown decision, recovery attempt | Separates “rail instability” from “protocol symptoms” |
| Clock | refclk_present, pll_lock_time_ms, lock_loss_count, last_lock_loss_temp Optional: refclk_fault_count | On lock/unlock, link training start, thermal threshold crossings | Explains intermittent training failures and thermal drift sensitivity |
| Link + Config | image_slot, cfg_result, rollback_count, bitstream_id, signature_ok, link_up_count, link_drop_count, recovery_count, last_ltssm_group | After configuration completes, after training completes, on each recovery/drop | Distinguishes “bitstream/update issues” from “training margin issues” |
Reference MPN list (examples) — parts that enable the evidence
| Function | Example MPNs | Where it fits the DoD |
|---|---|---|
| PMBus power system manager | LTC2977 (8-channel power system manager with fault logs / telemetry) | Enables repeatable margining + rail snapshots + fault logs used in DVT cycles and production “signature” records. |
| Digital multiphase controller |
Infineon XDPE132G5C / XDPE132G5H Renesas ISL68137 MPS MP2953B (examples; phase count and protocol vary by design) |
Provides programmable sequencing/telemetry hooks to correlate “rail behavior” with configuration/training outcomes (used by DoD evidence fields). |
| Temperature sensing (multi-zone) | TI TMP468 (multi-zone SMBus/I²C temperature sensor) | Supports thermal triage evidence: max_temp, temp-at-fault, derating correlation, and hotspot mapping. |
| Digital power monitor | TI INA228 (20-bit power/energy monitor; shunt based) | Creates quantifiable “rail load vs symptom” snapshots to back up margining and endurance claims. |
| Multi-rail supervisor / watchdog | TI TPS386000 (quad-supply supervisor with delay + watchdog) | Hardens reset/timeout policies; provides deterministic gating for “state machine” timeouts that appear in DoD pass/fail criteria. |
| QSPI/Quad-SPI NOR flash | Winbond W25Q128JV (128M-bit serial NOR flash family) | Stores golden/update images and metadata fields (image_slot, cfg_result, rollback_count, version tags). |
| Nonvolatile log memory (FRAM) | FM25V02-G / FM25V02A (SPI FRAM examples) | Stores small, high-endurance “last_reason / counters / timestamps” records that survive power loss during faults. |
| Jitter cleaner / clock generator | Si5341B-D-GM (ultra-low jitter clock generator / attenuator family) | Supports clock-lock evidence (lock_time, lock_loss_count) and reduces “clock-noise disguised as link problems”. |
H2-12 — FAQs (10) for FPGA Accelerator Cards
Each answer is written for card-side triage: the shortest measurable path from symptom → signals → evidence fields, with references back to the relevant chapters (H2-4…H2-11). No platform-level detours.
1Why are all power rails “PG green” but PCIe still cannot enumerate the card?
- Confirm
perst_seenand the release moment vsboot_state(H2-4). - Check
refclk_presentandpll_lock_time_ms(H2-8). - Read a coarse
last_ltssm_groupto see if training even started (H2-9).
2Flash reads/writes fine, but bitstream programming fails intermittently—what is the most common cause?
- Capture
cfg_fail_reason,image_slot,signature_ok(H2-6). - Log
rail_min_v_at_cfgandtemp_at_cfg(H2-5/H2-10). - Verify PLL lock stayed stable across the full configuration window (H2-8).
3One boot succeeds and the next fails—what three sequencing parameters should be checked first?
- Compare a successful vs failed
boot_statetrace andpg_masktransitions (H2-4). - Check whether the failure always occurs after the same timeout stage (H2-4).
- Inspect retry loops: no rapid “power-bounce” cycles (H2-11 DoD).
4How should voltage margining be done without making the system “less stable the more it is tested”?
- Baseline snapshot: rails/temps/counters (H2-11).
- Apply one margin step → hold → run the same stress script (H2-4/H2-11).
- Restore baseline and verify counters did not drift (H2-11).
5Training fails as temperature rises—how to tell clock jitter from power derating?
derating_level, rail droop, or fault warnings near the failure; clock/jitter issues show up as lock instability (pll_lock_loss_count, abnormal pll_lock_time_ms) while rails remain within bounds. Both can produce the same symptom, so logs must capture “what changed first.”
- Check
max_tempand whether derating was asserted (H2-10). - Compare rail minima vs PLL lock stability around the event (H2-8/H2-10).
- Correlate with
recovery_countand LTSSM group changes (H2-9).
6JTAG connects, but boundary-scan shows many errors—what hardware-layer issues are typical?
- Verify JTAG I/O level compatibility and power-domain readiness (H2-7).
- Reduce TCK and re-run to test signal-integrity sensitivity (H2-7).
- Segment the chain (jumpers/bypass) to identify the failing stage (H2-7).
7PERST# timing looks correct, but LTSSM is stuck early—what root causes are common?
- Log
refclk_presentand PLL lock stability before PERST# release (H2-8). - Read
last_ltssm_groupto classify “Detect vs Polling vs Config” (H2-9). - Validate sidebands and lane mapping rules at the card connector (H2-9).
8How should golden image / fallback be designed so one update cannot brick the card?
- Maintain
image_slot,pending_update,commit_flag(H2-6). - Define rollback triggers: signature fail, DONE timeout, health check fail (H2-6).
- Record
rollback_countandlast_update_result(H2-11).
9A multi-rail PMIC reports a fault but voltages look “normal”—how to interpret fault classes and logs?
- Read
fault_code,fault_class,rail_id(H2-5). - Capture
rail_min_v_at_faultandiin_peak_at_fault(H2-5/H2-11). - Check whether derating/temperature triggered the event (H2-10).
10What hidden defects are most often missed in production, and how can tests cover them?
- Run boundary scan for opens/shorts (H2-7).
- Record a consistent signature (rails/temps/version/counters) per unit (H2-11).
- Add a short at-speed training + counter check (H2-9/H2-11).
FAQPage JSON-LD (for indexing)
Paste the script below once per page. Keep the visible Q&A aligned with the JSON-LD content.