GPU / AI Accelerator Card (Power, HBM, Retimers & Telemetry)

Q: Why can “average power” look fine, yet a burst causes droop or reboot?

This is typically a transient-response problem, not a steady-state power problem. Fast di/dt during bursts can pull the GPU rail below its minimum at the true load-side sense point while telemetry averages hide the event. First verify GPU-side rail waveform versus VRM-output waveform, check OCP/UVP blanking overlap with the burst window, and correlate with VRM/connector temperature. Fix by reducing PDN impedance at the real load boundary, correcting remote-sense referencing, and re-validating with repeatable load-steps across thermal corners. Example anchors: XDPE192C3B-0000, XDPE132G5H-G000, TDA21490AUMA1, ISL99390BR5935, INA229, INA238, LTC2977.

Q: Are more VRM phases always better? When can “more phases” hurt stability?

More phases can reduce ripple and spread heat, but they also increase current-sharing and timing complexity. Poor interleaving, mismatched sensing, or layout-induced delay can create phase imbalance, ringing, or false protections. Check phase-current balance during bursts and separate fast droop (ESL/loop) from slow sag (energy/path). Verify settling and overshoot across temperature changes. Fix by tuning the controller settings together with the output network and sensing strategy, then re-run the same burst profiles at hot steady state and cold boot. Example anchors: XDPE132G5H-G000, ISL99390BR5935, LTC2977.

Q: Where do remote-sense designs most often go wrong (voltage “looks right” but is wrong)?

The most common error is a bad Kelvin reference: sense traces pick up switching noise, share a return with high current, or cross noisy islands, so the controller regulates the wrong point. Compare sense-reported voltage to a direct probe at the true load pads and look for temperature-dependent offsets. Check whether sense filtering adds delay that worsens burst droop. Fix by routing sense as a quiet differential pair to the correct load-side point and verifying with load-steps plus thermal sweeps. Example anchors: INA238, INA229, TMP464.

Q: Why can heavy decoupling still produce poor load-step performance?

Capacitor count is not the same as low impedance at the event frequency. ESL, via inductance, and current-loop area dominate the first microseconds, while bulk caps placed far away mainly help energy later. Separate fast droop from slow sag using scope captures, and compare VRM-output ripple to GPU-pad ripple to detect path inductance. Fix by tightening planes and via arrays, placing high-frequency caps at the true load boundary, and validating with identical step profiles. Example anchors: INA229, INA238, LTC2977.

Q: How should HBM rail ripple limits be set, and why can “clean-looking” rails still fail training?

HBM failures often come from measurement blind spots or noise coupling during a narrow training/initialization window. A rail can look clean at a far test point yet be noisy at the HBM island due to return-path and clock/power coupling. Probe at the HBM boundary with proper technique and capture waveforms during training events, not only steady state. Fix by defining ripple limits at the real load boundary, aligning sequencing to the sensitive window, and re-validating training success-rate across temperature corners. Example anchors: LTC2977, Si5341, INA238.

Q: At high temperature the card drops PCIe/NVLink links—what to suspect first: retimer, power, or clock?

Thermal dropouts usually reflect margin collapse from retimer-rail noise, ref-clock jitter increase, or temperature-driven loss changes. The quickest discriminator is correlation: link errors versus retimer island temperature, retimer supply ripple, and clock stability. Verify sideband integrity (PERST#/CLKREQ#) under heat as well. Fix by stabilizing retimer rails and ref-clock distribution first, then revisit placement and trace constraints. Validate with thermal ramps and long-run training/error counters. Example anchors: PT5162LR, Si5341, TMP464, INA238.

Q: What misjudgments come from slow telemetry sampling, and how should filters/thresholds be set?

If sampling is slower than the event, telemetry reports the average after the failure and filters can hide spikes while delayed decisions still trigger protections. Compare scope waveforms to telemetry time series to quantify lag/aliasing, and separate fast protection from slow policy channels. Thresholds should match actuation time constants and include hysteresis where appropriate. Verify by re-running identical burst workloads while checking timestamp alignment across voltage, current, temperature and error events. Example anchors: INA229, INA238, LTC2977.

Q: Why can power capping create performance oscillation, and how to make it smoother?

Power capping is a control loop: measurement, filtering, and throttle actuation delays can cause over-correction and sawtooth performance. The fix is usually timing, hysteresis, and rate limits—not simply tighter limits. Plot measured power against the decision timeline and isolate GPU rail events from AUX/retimer islands to avoid false global throttles. Verify the power estimate during bursts. Smooth the policy with deadband and slew limiting and confirm stability under repeatable burst sets at hot steady state. Example anchors: INA229, INA238, LTC2977.

Q: Cold-boot training fails intermittently—sequencing or inrush? What is most common?

Many apparent sequencing issues are inrush-induced droop problems. Cold impedance shifts and connector/plane resistance can pull AUX or main rails low during soft-start, causing PG chatter and missing the training window. Capture AUX and main rail ramps with PG/FAULT markers and compare cold versus warm start peak inrush and sag. Confirm enable order (AUX → control → main rails → HBM → high-speed link enable). Fix by shaping inrush and staging enables, and verify by repeated cold boots and training success-rate. Example anchors: TPS25982, LTC2977, INA238.

Q: OCP/OTP seems to trip—could it be a false trigger? How to verify quickly?

False triggers often come from noisy sensing, poor sensor placement, or thresholds that ignore burst profiles. Use scope captures of rail voltage and current proxies at the failing burst while recording fault timestamps, then verify whether limits were truly exceeded at the real sense point. Confirm that temperature sensors represent the actual hotspot (power stage vs inductor vs connector). Fix by improving sense integrity, tuning blanking/filters without hiding real faults, and re-validating hot/cold and burst corners. Example anchors: ISL99390BR5935, TDA21490AUMA1, INA229, TMP464.

← Back to: Data Center & Servers

A GPU/AI accelerator card is won or lost at the board level: high-current VRMs must survive brutal transients, HBM rails must stay quiet during tight training windows, and PCIe/NVLink links must keep margin across temperature.

This page focuses on practical, measurable card-level integration—power tree partitioning, PDN/layout/sense, sequencing/inrush, retimer/clock placement, and telemetry loops—so issues become repeatable, diagnosable, and production-screenable.

H2-1 · Scope & Practical Boundary

Card-level engineering (not chip-level internals)

This page treats the accelerator as a board-level product: it must digest upstream power and cooling constraints, then deliver stable rails, controlled hotspots, and reliable links. Chip-internal architecture, retimer algorithms, and rack power design stay out of scope.

What this page solves (card-level “three pillars”)

Power integrity: high-current VRMs and fast load steps where droop, noise, and protection timing decide stability (not average watts).
Thermal reality: hotspot-driven behavior where sensor placement + control latency can cause throttling jitter and temperature-correlated failures.
High-speed hooks: PCIe/NVLink integration on the card—placement, isolation, and verification of retimers/clock islands (no deep silicon theory).

Definition of “done” (what proves the card is solid)

Transient-safe rails: droop and recovery remain within budget during worst-case load steps; protections do not false-trigger.
Thermal-stable performance: sustained throughput without oscillating throttles; failures do not appear only at high temperature.
Repeatable validation: a small, deterministic set of waveforms + telemetry fields can reproduce and isolate issues across units.

Explicit out-of-scope (link to sibling pages)

Retimer internals: equalization/CDR/algorithms → PCIe Switch / Retimer page.
Rack power system design: PSU paralleling, 48V distribution/hot-swap theory → CRPS / 48V Bus & Hot-Swap pages.
Telemetry platforms: cross-node aggregation/anomaly detection architecture → In-band Telemetry & Power Log page.

Figure A1 — Card-level boundary: power, thermal, and link hooks

The card is treated as a product that must deliver stable rails, controlled hotspots, and reliable links under real workloads. Deep silicon design and rack power theory are intentionally excluded to keep the page actionable and non-overlapping.

H2-2 · System Context

From upstream power to GPU rails: what the card must “digest”

Upstream power can look stable while the card fails because the limiting factors live in the end-to-end path: connector contact resistance, plane/busbar impedance, VRM transient behavior, and temperature-driven drift. A correct mental model separates where voltage is measured from where voltage matters.

Upstream constraints (treated as inputs, not a design topic)

Input source: 12V or 48V presence is only a label; what matters is allowed droop, ripple, and transient capability at the card connector.
Delivery path: slot + auxiliary connectors define current density, heating, and allowable voltage drop before VRMs can even respond.
Environment: temperature and airflow/cold-plate contact shift losses and margins over time.

Why “upstream looks fine” but the card still drops

Static drop: current through path resistance causes localized undervoltage at the VRM input or rail sense point.
Dynamic drop: fast load steps create inductive voltage spikes/dips in planes and loops before regulation catches up.
Thermal coupling: rising temperature increases resistive loss and can tighten protection and timing margins.
Measurement mismatch: telemetry may be filtered or sampled too slowly to reflect the worst microsecond-to-millisecond events.

Looks OK upstream	Actually decides card stability
Input bus voltage average is steady	Local rail droop at the GPU/HBM sense location during worst-case load steps
Rack inlet temperature is normal	Hotspot temperature at VRM stages / GPU region and thermal control loop latency
Power telemetry shows “reasonable” values	Telemetry bandwidth + filtering: whether the worst transient is captured or hidden
Links train most of the time	Temperature-correlated margin loss from placement, reference isolation, and rail/clock coupling on the card

Figure A2 — End-to-end power path & hotspots on an accelerator card

The diagram highlights the practical failure chain: connector/path impedance + fast load steps + thermal drift can create local droop and margin loss, even when the upstream bus looks “stable.” Telemetry helps only if measurement points and bandwidth match the events.

H2-3 · Power Tree & Rail Partitioning

GPU core vs HBM vs AUX rails: partitioning for stability and measurability

A GPU/AI accelerator card succeeds when its power tree is partitioned into rails with distinct electrical roles. The partitioning goal is not cosmetic naming—it is to keep fast transients from contaminating noise-sensitive islands, and to ensure that measurement points represent the truth locations where decisions are made.

Rail classes (electrical behavior, not just function)

GPU core rail (transient-dominant): highest current and fastest load steps; droop budget and recovery timing define stability.
HBM rails (noise/cleanliness-dominant): sensitive to ripple, switching noise coupling, and return-path cleanliness during critical windows.
AUX rails (threshold/timing-dominant): PLL/SerDes/I/O/retimer/MCU support rails where sequencing and enable thresholds decide bring-up success.

Partitioning rules (actionable checks)

Noise isolation: keep HBM/PLL/SerDes islands away from high dv/dt nodes; prevent shared return paths that inject switching currents.
Transient containment: prevent GPU core load steps from pulling down sensitive rails through shared input impedance or poorly placed bulk caps.
Return-path control: maintain continuous, predictable returns for each island; avoid return currents crossing noisy regions.
Truth measurement points: define where voltage is “owned” (sense point), how current is interpreted (IMON), and where hotspots are captured (Tsense).

Common symptoms that point to partitioning problems

Stable input bus but unstable rails: local droop at the rail sense point during burst events.
Training/pass one boot, fail the next: AUX sequencing or sensitive-rail noise during a short initialization window.
Performance jitter: hotspot sensor lag or rail noise triggering throttling and protective derating.

Out of scope reminder: DIMM-centric DDR5 PMIC/RCD details are excluded. This page treats HBM rails as card-level power islands (noise, decoupling, measurement points).

Sense point (voltage truth) IMON (current truth) Tsense (thermal truth)

Figure A3 — Rail partitioning + truth measurement points on a GPU card

The diagram shows three rail classes and the minimum set of “truth points” that keep measurements aligned with decisions: voltage sense for droop budgets, IMON for current interpretation, and Tsense for hotspot control.

H2-4 · High-current VRM Design

Transients, stability, and efficiency under GPU load steps

GPU VRMs are primarily limited by event-driven transients, not average power. The worst case is a fast load step that creates immediate droop through path inductance and finite control response, then risks false protection or link/compute instability if recovery and measurement bandwidth do not match the event.

Why average watts are misleading

Load steps dominate: microbursts and state transitions create short droop windows that can fail training or trigger resets.
Two droop mechanisms: immediate droop from loop/plane inductance, then slower droop from finite regulation and capacitor limits.
Thermal drift matters: hotter stages raise loss and tighten margin, making the same load step fail only at high temperature.

Multiphase design knobs (decision logic + trade-offs)

Phase count: reduces per-phase stress and ripple, but increases layout complexity and cross-coupling sensitivity.
Switching frequency: improves response at the cost of loss and heat; lower frequency shifts burden to decoupling and planes.
Inductor (L): larger L lowers ripple but slows response; smaller L speeds response but raises ripple/noise sensitivity.
Output capacitor network: high-frequency caps near the load, mid/bulk caps to hold energy; placement and ESL dominate outcomes.
Stability margin: prioritize repeatable phase margin across temperature and tolerance (verification > theory).

Protection without false trips

OCP/OTP timing: blanking and debounce must tolerate legitimate load steps while still reacting to true faults.
Measurement bandwidth: telemetry sampling and filtering can hide peaks; protection should not rely on slow averages.
Setpoints as budgets: thresholds should map to rail droop and thermal budgets, not arbitrary “safe-looking” numbers.

Out of scope reminder: controller internal compensation algorithms are not expanded. This chapter focuses on engineering choices and how to verify that stability and protection behavior match real load events.

Figure A4 — Load step → droop → recovery, with OCP blanking and telemetry sampling

The waveform illustrates how a load step creates droop and recovery, while protection blanking and telemetry sampling can either tolerate legitimate events or hide the true peak. Design success ties rail budgets to phase count, frequency, inductor choice, capacitor network placement, and verified stability margin across temperature.

H2-5 · Board-level Power Integrity

Layout, planes, decoupling, and sense: turning VRM budgets into real rails

A high-current GPU rail is decided by the board current loop and the truth measurement points. Even with correct VRM settings, excessive loop inductance, poor return-path control, mis-partitioned decoupling, or a noisy sense route can reshape the droop waveform and cause false protection triggers, training failures, or performance jitter.

Planes, copper, and via arrays (current density + loop inductance)

Current density hotspots: connector pins, neck-down copper, via fields, and phase-output merge points drive local heating and extra drop.
Return-path control: keep the high-current loop compact and predictable; avoid forcing return currents to detour across sensitive islands.
Via fields: treat vias as both R (drop/heat) and L (transient spikes); distribute to reduce bottlenecks and loop area.

Decoupling strategy (HF / MF / bulk roles)

HF caps near the load: suppress fast spikes where ESL dominates; placement is often more important than capacitance value.
Mid-frequency network: controls plane resonance and supports control-loop dynamics between VRM output and the load island.
Bulk caps for energy: support slower events and hold-up; avoid placing bulk behind “inductance walls” that disconnect energy when it is needed.
ESR/ESL partition: ESL sets the fast edge; ESR provides damping—both affect stability and overshoot.

Remote sense (truth voltage) without oscillation or noise pickup

Sense point equals budget point: droop/UV limits must map to the location that actually determines GPU stability.
Kelvin routing: route sense as a tight pair, short, and away from switch nodes and high-current merges to avoid “measuring noise.”
Cross-domain mistakes: do not let sense return cross noisy islands; ambiguous returns can inject ripple into the feedback path.

Common pitfalls and fastest validations

“Telemetry looks fine” but droops happen: sampling/filtering misses the valley; verify with probing at the true sense location during a real load step.
False OCP/UV trips: loop inductance + poor decoupling creates deeper droop than expected; compare VRM output node vs load node waveforms.
Temperature-only failures: copper/via/connector heating increases path drop; correlate droop depth and hotspot temperature under the same workload.

Simulation note (intentionally not a tutorial): PI/SI tools are useful to confirm impedance targets and resonance risks, but this section focuses on layout rules, review checklists, and measurement truth points that remain valid across toolchains.

Quick checklist

Goal: make the rail waveform at the load match the design budget, and make measurements reflect that reality.

Loop

Is the VRM→load current loop compact (small area) and free of bottleneck vias/neck-down copper that heat up?

Return

Are return paths continuous and controlled, avoiding forced detours across sensitive islands?

Decap

Are HF caps truly at the load island, with mid/bulk placed to avoid “energy behind inductance” during transients?

Sense

Does the sense point match the stability budget point, with Kelvin routing away from switch-node noise and clear return reference?

Figure A5 — VRM→GPU current loop and sense truth point (good vs bad mini-sketch)

The figure highlights the two truths that dominate board-level PI: (1) the rail transient is set by the physical current loop and decoupling placement, and (2) sense must represent the load truth point with clean routing—otherwise tuning and protection decisions are based on the wrong voltage.

H2-6 · HBM Power & Clocking

Noise sensitivity, sequencing, and clock-island isolation (card level)

HBM-related rails behave like a noise-sensitive island. Stability is often decided during a short sensitive window (initialization/training phases) when ripple, coupling, and sequencing thresholds matter more than steady averages. This section treats HBM power and clocking as card-level islands with clear isolation boundaries.

HBM rail priorities (what to protect)

Ripple and coupling: short spikes and resonance bursts can matter more than steady ripple numbers.
Return cleanliness: prevent VRM switch currents from sharing the same return corridor as HBM/clock islands.
Measurement realism: place sense/monitor points inside the HBM island so that “good” readings reflect the sensitive area.

Sequencing rules (without protocol/PHY internals)

Threshold clarity: define which rails must be valid before enabling dependent rails and clock distribution.
Ramp-rate discipline: avoid ramps that are too fast (overshoot/noise/false PG) or too slow (timeouts/marginal thresholds).
Fault containment: treat HBM rails as a fault domain—limit cascading effects into GPU core rail during marginal bring-up.

Clock island (distribution + isolation on the card)

Island concept: keep reference/clock distribution and its supporting rails inside a controlled region with predictable return.
Keep-out near switch nodes: avoid placing clock distribution near high dv/dt VRM areas to reduce coupling risk.
Card-level verification: check for temperature/workload correlation between training success and island noise/thermal behavior.

Common symptoms and fastest validations

Intermittent bring-up: succeeds cold but fails warm, or fails only after repeated cycles → suspect sensitive-window noise or sequencing margins.
Training success jitter: inconsistent success rate under the same procedure → suspect coupling across island boundaries.
“Looks OK” telemetry: rails appear stable on slow telemetry → verify at HBM island sense points with appropriate bandwidth.

Out of scope reminder: HBM protocol and PHY internals are excluded. The focus is card-level isolation boundaries, sequencing discipline, and measurable validation points.

Figure A6 — HBM power island + clock island, with noise keep-out and return control

The figure frames HBM rails and reference/clock distribution as controlled islands. A practical “noise keep-out” concept around VRM switch nodes, plus return-path control, helps prevent coupling that can destabilize sensitive-window behavior during initialization and training.

H2-7 · PCIe/NVLink Integration

Retimer placement, sideband reliability, and jitter isolation (card level)

On an accelerator card, link stability is a system of three parallel paths—high-speed lanes, reference clock, and sideband control. Retimers are used to preserve margin across connectors, traces, and temperature drift, while reliable sideband signaling and a clean refclock island prevent “works sometimes” training failures.

Why a retimer is needed on the card (practical triggers)

Insertion loss budget: connector + trace + via transitions can consume margin even when average throughput looks fine.
Pluggability variance: contact quality and assembly tolerance change the real channel each time the card is inserted.
Temperature drift: loss, impedance, and noise coupling shift with temperature, shrinking training margin in worst cases.

Placement logic (where it should sit, and why)

Protect the worst segment: place retimers near the boundary that dominates loss/variance (often the connector/backplane side).
Shorten the “bad part”: reduce the length of the highest-loss or most variable segment that the host/GPU must train through.
Keep routing predictable: limit layer changes and uncontrolled via clusters; enforce consistent constraints for lane groups.

Power and refclock isolation (treat retimer as a sensitive island)

Power island: keep retimer supply away from high dv/dt VRM switch regions; avoid noisy return corridors.
Refclock island: route reference distribution with controlled return; avoid crossing VRM noise keep-out areas.
Keep-out concept: define a practical “do not place/route clocks here” zone near switch-node activity.

Sideband reliability (PERST#/CLKREQ# as card wiring problems)

Do not treat sideband as “easy”: threshold margin and coupling can break training just as effectively as lane margin issues.
Clear reference return: avoid sideband routes that share noisy return paths or run parallel to high-current switching zones.
Intermittent training: cold/warm or plug/unplug sensitivity often points to sideband robustness and timing margins.

Field debug path

Goal: convert “intermittent link” into a bounded suspect list: lanes vs refclock vs sideband.

Training

Fails warm but passes cold → suspect temperature-driven margin loss (loss + coupling + refclock/sideband thresholds).

Errors

Intermittent errors under load → suspect power/clock island contamination and return-path coupling near high-current zones.

Plug

More failures after reinsertion → suspect connector variance and the segment protected by retimer placement.

Figure A7 — Host ↔ Connector ↔ Retimer ↔ GPU, with lanes/refclk/sideband in parallel

The diagram emphasizes the practical card-level view: retimers are positioned to protect the highest-variance or highest-loss segment, while refclock and sideband must be routed as first-class signals with clean returns and noise keep-out awareness.

H2-8 · Telemetry & Control Loop

PMBus, IMON, temperature, and power capping: measure-to-control (not measure-to-display)

Card telemetry is only useful when it supports a stable control loop. For power capping and throttling decisions, time alignment, filtering choices, and control latency can matter more than the number of sensors. “Measured accurately” beats “measured everywhere.”

On-card telemetry sources (what exists on a typical accelerator card)

VRM telemetry: voltage, current, power, and controller temperature (often exposed digitally).
IMON / current indicators: used for protection and power control—only valuable when the measurement meaning is understood.
Hotspot temperature: near VRM phases, GPU vicinity, and sensitive islands (e.g., memory/clock regions).
Event counters (concept): retry/error/training-related indicators with timestamps for correlation.

Sampling, filtering, and timestamps (the difference between correlation and causality)

Time alignment: power, temperature, and event counters react with different delays; align to avoid wrong root-cause conclusions.
Filtering is a trade: heavy filtering hides peaks/valleys; weak filtering can cause noisy policy triggers.
Peak vs average: stability can be decided by short excursions that never appear in slow telemetry.

Power capping and throttle policies (stable loop over aggressive loop)

Policy inputs: V/I/P/T plus event indicators (concept) must be trustworthy and time-consistent.
Policy outputs: capping/throttle acts on workload behavior; delay can create hunting and oscillation.
Hysteresis/windows: prevent rapid toggling when measurements are noisy or delayed.

Common misjudgments and fastest validations

Power looks under limit but throttle triggers: hotspot temperature or delayed measurements may be driving policy.
Telemetry stable but intermittent faults: peaks/valleys are missed by sampling; verify at the sensitive measurement point.
Throttle thrashing: control latency + filtering can cause oscillation; adjust windows/hysteresis conceptually and re-check.

Control-loop checklist

Goal: ensure telemetry drives correct decisions without oscillation or over-throttling.

Meaning

Is each measurement tied to a clear physical meaning and location (load truth, hotspot truth, or averaged indicator)?

Time

Are power, temperature, and events time-aligned (timestamps) so cause/effect is not reversed by delay?

Filter

Is filtering chosen to preserve critical peaks/valleys while preventing noise-driven policy triggers?

Stability

Do capping/throttle windows and hysteresis prevent rapid toggling (hunting) under real workloads?

Figure A8 — Sensors → aggregator → policy → throttle, with filter and latency risks

The closed-loop view prevents a common failure mode: using slow, heavily filtered telemetry for fast decisions. Align time across sensors/events, choose filters that preserve critical excursions, and tune policy windows to avoid hunting.

H2-9 · Power Sequencing, Inrush, and Fault Containment

Sequencing windows, inrush droop, and how to prevent “random” bring-up failures

Accelerator cards often fail intermittently not because steady-state power is insufficient, but because critical enable and training windows overlap with ramp transients, inrush-induced droop, or false PG/FAULT triggers. A card-level sequencing plan treats rails, enables, and protection thresholds as one timed system.

Conceptual power-up order (why order changes stability)

AUX → control domain: bring up monitoring and decision logic first so PG/FAULT decisions are trustworthy.
Main rails → HBM rails: ramp high-current domains before noise-sensitive islands finalize initialization.
High-speed link enable last: allow refclock and rails to settle before enabling link training or retimer/link islands.

Inrush and soft-start (keeping droop out of sensitive windows)

Inrush source: large bulk capacitance and multiple domains charging simultaneously during ramp.
Droop chain: input sag → VRM input dip → output droop → PG jitter → unstable enable/training outcomes.
Soft-start concept: slope limiting and domain staggering reduce the deepest droop and spread current demand over time.

False trigger windows (PG/FAULT must match ramp reality)

Sense vs truth: PG may observe a point that does not represent the load’s true worst-case droop location.
Sampling mismatch: overly fast detection can misread noise as failure; overly slow detection can miss valleys.
Window overlap: enabling links or memory during ramp transients turns deterministic behavior into probability.

Fault containment (a bad rail should not take down the entire card)

Partition by domain: separate control/AUX, core power, HBM power, and link/clock islands conceptually and electrically.
Branch protection (concept): local current limiting, e-fuse/fuse concepts, and controlled shutoff prevent cascade failures.
Recordable failures: containment should preserve logs (PG/FAULT timeline) instead of causing total blackouts.

Bring-up checklist

Goal: keep inrush droop and PG/FAULT behavior away from enable/training windows.

Sequence

Confirm AUX/control is stable before main rails; delay link enable until rails/refclock are settled.

Inrush

Check ramp droop depth and duration; stagger domains when the deepest valley overlaps a sensitive window.

PG/FAULT

Verify blanking/windowing conceptually matches ramp behavior; avoid treating ramp noise as a hard failure.

Contain

Ensure a single rail fault stays local and leaves enough visibility (timestamps/events) for root-cause analysis.

Figure A9 — Rail sequencing vs time, with PG/FAULT and a false-trigger window

The timing diagram highlights the key card-level idea: avoid enabling sensitive functions during ramp transients, and ensure PG/FAULT windowing reflects real ramp behavior instead of reacting to short-lived droop or noise.

H2-10 · Validation & Production Readiness

What proves an accelerator card is solid: bring-up, production, and field reproducibility

“It boots once” is not readiness. A production-ready accelerator card is proven by repeatable results across a validation matrix (temperature × load × scenario) with clear pass criteria and preserved evidence (waveforms, events, and timestamps) that enable fast root-cause analysis and field reproducibility.

Three validation layers (same product, different goals)

Engineering bring-up: expose structural weaknesses (ramp windows, droop, thermal hotspots, training robustness).
Production screening: repeatable coverage at scale (high signal, low ambiguity; minimal manual interpretation).
Field/RMA reproduction: recreate symptoms with controlled conditions and aligned evidence (time + events + environment).

Must-test categories (what typically separates “works” from “robust”)

Load-step integrity: droop depth/duration and recovery stability under bursts.
Ripple/noise checks: critical rails during sensitive windows (startup, enable, high-load transitions).
Thermal robustness: hotspot behavior across airflow and temperature corners.
Link training success rate: repeatability across temperature, insertion variance, and workload states.
Soak/burn-in: long-run stability and intermittent event/error trends.
Policy consistency: power capping/throttle behavior that is stable (no hunting) and predictable across corners.

Evidence to preserve (so failures can be explained, not guessed)

Waveforms: startup/ramp, load-step, and critical rail windows captured at meaningful sense points.
Events: PG/FAULT transitions, throttle triggers, and training-related indicators (concept) with timestamps.
Context: temperature corner, airflow configuration, load state, and insertion condition for reproducibility.

Pass criteria principles (define “good” in a measurable way)

Power integrity: droop and ripple stay within the card’s budget for sensitive windows.
Thermal: hotspots do not cross derating thresholds under defined conditions.
Link: training success remains consistently high across corners, not only under room/idle conditions.
Control: capping/throttle does not oscillate; policy windows and telemetry alignment prevent thrashing.

Readiness checklist

Goal: prove robustness with a matrix and leave a complete evidence trail.

Matrix

Cover temperature corners and load states across training, steady full load, and burst transitions.

Criteria

Define pass criteria for droop, ripple, hotspots, training success, and throttle stability.

Evidence

Save waveforms + events + timestamps + conditions so every failure can be reproduced and explained.

Figure A10 — Validation matrix: temperature × load × scenario, with pass criteria

The matrix approach prevents “room-temperature optimism.” Each cell combines scenario tags (training/full/step) under temperature and load corners, while pass criteria stay explicit and comparable across revisions and production lots.

H2-11 — Field Debug Playbook: Symptoms → Likely Causes → First Measurements

This section turns common GPU/AI accelerator card failures into a repeatable “first-hour” workflow. It stays card-level: rail behavior, telemetry timing, thermal contact/airflow, and link islands (PCIe/NVLink).

Common Setup (Do this before chasing any symptom)

The fastest debug comes from consistent capture windows, consistent probe points, and a minimal “evidence pack”.

Time alignment Capture windows Truth probe points Evidence pack

1) Align time (events must be comparable)

Record PG/FAULT edge order and the exact moment the symptom happens (reset/black screen/drop/throttle).
Prefer logs with timestamps (or monotonic counters) so “cause vs effect” is not guessed.
If telemetry is averaged, note the averaging and conversion time—slow telemetry hides fast droops.

2) Define the three windows

Power-up window: sequencing + inrush + first enabling of link islands.
Training window: PCIe/NVLink enable + refclk stability + island rails settling.
Burst window: worst di/dt load-step and peak current events.

3) Probe the “truth point” (avoid false comfort)

For Vcore/HBM, prioritize the load-side sense region (near the GPU/HBM decoupling field), not only at the controller.
Keep ground reference short (spring ground / nearby return) to avoid seeing ringing that is purely measurement artifact.
For link islands (retimer/refclk), probe the island rails and refclk domain separately from the main power plane.

Evidence pack (save these every time)

Two waveforms: Vcore (or main rail) + one AUX/island rail, captured over the relevant window.
Telemetry snapshot: V/I/P/T + any PG/FAULT bits around the event.
Thermal snapshot: hotspot temps and “contact/airflow state” (blocked airflow / cold-plate contact suspicion / fan curve state).

Representative debug-friendly parts (examples): PMBus VR controllers like Infineon XDPE192C3B-0000, Infineon XDPE132G5H-G000; power system manager for fault logs like ADI LTC2977; temperature sensing like TI TMP464; power monitors like TI INA229 / INA238.

Symptom A — Random reboot / black screen (often “looks fine” until it doesn’t)

Treat this as a timing problem first: find what dropped (PG/rail/link) and in which window (power-up, training, burst).

Likely causes (grouped by domain)

Power: fast Vdroop → UV/PG drop; OCP/OTP hit or false hit; island rail dip triggers a cascade.
Thermal: localized hotspot trips protection; cold-plate contact/airflow creates a “hidden” hotspot.
Link/Control: link island instability amplifies into a system-level fault (often temperature dependent).

First 3 checks (the “three-piece kit”)

VRM telemetry: check PG/FAULT ordering and whether current/temperature flags rise before the event.
Rail waveform: capture Vcore + one AUX/island rail during the burst window; look for a short valley (droop) and recovery ringing.
Hotspot state: compare hotspot gradient vs average; verify airflow/cold-plate contact consistency.

Fast isolation rules (don’t guess)

If PG drops first → power domain likely (droop/UV/OCP blanking mismatch).
If hotspot spikes first → thermal/contact/airflow likely.
If rails look stable but link errors climb first → link island power/refclk/sideband integrity likely.

Parts to reference (examples, with concrete OPNs)

VR controller: Infineon XDPE192C3B-0000 / Infineon XDPE132G5H-G000
Power stage / DrMOS: Infineon TDA21490AUMA1, Renesas ISL99390BR5935
Power monitor: TI INA229 / TI INA238; eFuse (card rail): TI TPS25982

What to save: a waveform pair (Vcore + island rail), a telemetry snapshot around the event, and the thermal/airflow/contact condition. This turns “random” into “repeatable”.

Symptom B — PCIe/NVLink training failure or intermittent link drop

Link failures that correlate with temperature or insertion often originate from island rails, refclk cleanliness, or sideband robustness—at the card level.

Likely causes (card-level)

Channel margin: connector/trace loss + temperature drift shrinks margin.
Retimer island: island rail noise or refclk contamination makes training fragile.
Sideband integrity: PERST#/CLKREQ# (and similar) behaves unreliably under ground bounce/noise.

First 3 checks

Telemetry timing: correlate drop events with island rail voltage/current and temperature.
Island rail waveform: capture the retimer/refclk rail during the training window; look for ripple bursts or dips.
Thermal correlation: compare cold vs hot; training that fails only after warm-up is a strong hint.

Fast isolation

Hot-only drops → margin/thermal drift dominates; focus on island rail stability + refclk cleanliness.
Insertion-dependent → connector/contact variability; confirm with repeated seat/unseat and capture success rate.
Training-window-only → sequencing/island enable timing; check whether the island rail is fully settled before enable.

Parts to reference (examples)

PCIe Retimer (example OPN): Astera Labs PT5162LR / PT5161LRS
Jitter/clock: Si5341 (jitter attenuator/clock generator class)
Temp sensing near islands: TI TMP464 (remote diode channels help map hotspots)

What to save: training-window rail waveform + island temperature + success/fail counter over repeated thermal states (cold/hot).

Symptom C — Performance swings / throttle oscillation (“power cap hunting”)

Most “mysterious throttling” is a loop problem: measurement latency + filtering + policy thresholds interacting with bursty workloads.

Likely causes

Telemetry loop: sampling/averaging hides peaks, then policy over-corrects; delays create oscillation.
Thermal thresholds: hotspot crosses a threshold repeatedly (airflow/contact instability).
Peak power events: short spikes (not average power) trigger limits.

First 3 checks

Time-correlate throttle events with V/I/P/T; note telemetry conversion time and averaging.
Capture burst window: confirm whether short peaks align with policy triggers (even if average is low).
Hotspot gradient: check whether one region runs away while “board temp” looks normal.

Fast isolation

If throttle matches hotspot temperature closely → thermal/contact/airflow dominates.
If throttle matches power spikes → peak management dominates (limit based on peak, not average).
If throttle lags telemetry heavily → measurement/policy latency dominates.

Parts to reference (examples)

Power monitors: TI INA229 (power/energy/charge monitoring class), TI INA238
PMBus fault logging/rail supervision: ADI LTC2977
VR telemetry source: Infineon XDPE132G5H-G000 / XDPE192C3B-0000 class

What to save: a throttle timeline with matching telemetry settings (averaging/conversion) + burst waveform evidence.

Symptom D — Only fails when hot (warm-up dependent)

“Hot-only” failures usually indicate shrinking margin: electrical (loss/noise) or mechanical/thermal (contact/airflow).

What typically changes with temperature

Channel margin shrinks (loss, impedance drift, connector behavior).
VRM efficiency/thermal headroom changes; hotspot can cross OTP earlier.
Refclk cleanliness and island rail ripple sensitivity rises.

First 3 checks

Compare cold vs hot telemetry (same workload): does I/P rise or does T rise faster than expected?
Capture a training window waveform hot vs cold on island rails.
Map hotspot gradient using remote sensing (diode channels) near VRM, HBM area, and retimer area.

Fast isolation

If hot-only aligns with link drops → focus on island rails + refclk + local cooling.
If hot-only aligns with PG/FAULT → VRM thermal/protection dynamics likely.

Parts to reference (examples)

Thermal mapping: TI TMP464
Retimer island reference: Astera Labs PT5162LR class + refclk conditioning Si5341 class
VR thermal telemetry: Renesas ISL99390BR5935 (SPS with telemetry) / Infineon TDA21490AUMA1 class

What to save: hot vs cold comparison under the same workload (telemetry + waveforms + hotspot map).

Symptom E — Only fails on cold start (first boot after power-off)

Cold-start failures are often power-up timing and threshold related. Capture “the first attempt” evidence—later retries can hide the root cause.

Likely causes (first-boot specific)

Sequencing/soft-start window mismatch (AUX/control rails not settled before enable).
Inrush-induced dip creates a “near-miss” that breaks training only on the first attempt.
Temperature-dependent thresholds and sensor offsets cause false protection events.

First 3 checks

Capture the power-up window: rails vs time with PG/FAULT ordering.
Check for inrush dip and recovery ringing on the main input or main rail path.
Compare first boot vs second boot: if the second succeeds, the issue is likely window/settling related.

Fast isolation

If failure disappears after a quick retry → settling/inrush/timing is highly suspicious.
If only one domain stays marginal (e.g., island rail) → focus on that rail’s soft-start and local decoupling.

Parts to reference (examples)

eFuse / inrush control (card-side protection): TI TPS25982
PMBus sequencing + fault logs: ADI LTC2977 class
VRM telemetry source: Infineon XDPE132G5H-G000 / XDPE192C3B-0000 class

What to save: first-boot waveform/time-order evidence (rails + PG/FAULT) before any “retry” masks the issue.

Figure F11 — Symptom → Measurements → Isolation (card-internal first)

Use this fault tree to avoid random probing: start from symptom, run the three-piece kit, then isolate into POWER / THERMAL / LINK+CONTROL domains.

Fault Tree (box-diagram, minimal text, mobile-readable)

SVG note: text is intentionally minimal and ≥18px for mobile readability; arrows avoid marker/defs for WP compatibility.

Request a Quote

Name

Company

Part Number(s) / BOM

Quantity & Target Lead Time

Alternates Allowed

Temperature Grade

Package / Footprint

Compliance

Budget Window

Lot Size / Qty

Message

Attachment

Accepted Formats

pdf, csv, xls, xlsx, zip

Attachment

Drag & drop files here or use the button below.

FAQs (Troubleshooting-first, card-level)

Each answer follows a strict card-level playbook: likely causes → first measurements → fix/verify. Part numbers are representative examples to anchor categories (controller / power stage / monitor / eFuse / retimer / jitter-cleaner).

1 Why can “average power” look fine, yet a burst causes droop or reboot?

Diagnosis: the failure mode is usually transient response, not steady-state watts. Fast di/dt can pull the GPU rail below its minimum at the true sense point, while telemetry averages hide the event.

First measurements (10 minutes):

Capture rail waveform at GPU-side sense and at VRM output; compare droop depth and recovery time.
Check whether OCP/UVP blanking overlaps the burst window; verify with fault pins/logs.
Correlate droop with hotspot temperature (VRM + connector + retimer power island).

Fix/verify: reduce PDN impedance where it matters (path + vias + decoupling placement), align sensing to the true load point, and validate with repeatable load-steps across temperature corners.

Representative BOM anchors (examples)

Digital multiphase controllers: XDPE192C3B-0000 / XDPE132G5H-G000
90A-class power stages: TDA21490AUMA1 / ISL99390BR5935
Power/energy monitors: INA229 (SPI) / INA238 (I²C)
Fault logging / sequencing: LTC2977

2 Are more VRM phases always better? When can “more phases” hurt stability?

Diagnosis: phase count helps current sharing and ripple, but it also increases control/measurement complexity. Poor interleaving, mismatched current sense, or layout-induced delays can create phase imbalance, ringing, or false protections.

First measurements:

Measure phase current balance (IMON/inductor DCR sense or stage monitor outputs) under bursts.
Check rail impedance peaks (scope + step load), not only ripple at steady state.
Verify compensation margin indirectly: overshoot/undershoot + settling vs temperature and airflow.

Fix/verify: tune as a system (controller settings + power stages + output network + sensing). Validate with the same burst profile at hot and cold boot.

Representative BOM anchors (examples)

Multiphase controller family: XDPE132G5H-G000
Smart power stage with current/temp outputs: ISL99390BR5935
PMBus manager for repeatable margins/logs: LTC2977

3 Where do remote-sense designs most often go wrong (voltage “looks right” but is wrong)?

Diagnosis: the common failure is a bad Kelvin reference: sense traces pick up switching noise, share return with high current, or cross noisy islands. The controller regulates the “wrong voltage,” masking true droop at the GPU pads.

First measurements:

Compare sense-reported voltage vs direct probe at the true load pads (short ground spring).
Look for temperature-dependent offset (sense copper resistance + return path shift).
Check whether sense filtering introduces delay that worsens burst droop.

Fix/verify: route sense as a quiet differential pair to the correct point, keep it away from switch nodes/retimer clocks, and verify with step loads plus thermal sweep.

Representative BOM anchors (examples)

Voltage/current monitors for cross-check: INA238 / INA229
Board thermal visibility (remote diode): TMP464

4 Why can heavy decoupling still produce poor load-step performance?

Diagnosis: “more capacitors” does not equal “lower impedance.” ESL and current-loop area dominate at high frequency; bulk caps placed far away help energy but not the first microseconds. A misplaced via field can turn decoupling into an antenna.

First measurements:

Measure the first droop (fast) vs later sag (slow) to separate ESL vs energy deficit.
Compare VRM-output ripple to GPU-pad ripple; large delta points to path inductance.
Check hotspots at vias/planes near the connector and VRM output choke area.

Fix/verify: tighten the high-current loop (planes + via arrays), place HF caps at the true load boundary, and re-validate with identical step profiles.

Representative BOM anchors (examples)

Power monitors for waveform-to-telemetry correlation: INA229 / INA238
PMBus fault logs to tie events to steps: LTC2977

5 How should HBM rail ripple limits be set, and why can “clean-looking” rails still fail training?

Diagnosis: HBM-related failures are often caused by measurement blind spots (insufficient bandwidth, probing artifacts) or noise coupling during the narrow training/initialization window. “Clean” at a far test point can still be noisy at the HBM island.

First measurements:

Probe at the HBM island boundary (short ground), not only at the regulator output.
Capture rail noise during training events, not only steady state.
Check clock island isolation: ref-clock/power share can inject periodic noise.

Fix/verify: define ripple spec at the real load boundary, ensure sequencing aligns to the sensitive window, and re-run training success-rate across temperature corners.

Representative BOM anchors (examples)

PMBus logging / sequencing for reproducible windows: LTC2977
Clock/jitter cleanup (clock island anchor): Si5341
Rail monitors for correlation: INA238

6 At high temperature the card drops PCIe/NVLink links—what to suspect first: retimer, power, or clock?

Diagnosis: temperature-driven dropouts commonly come from margin collapse: retimer supply droop, ref-clock jitter increase, or connector/trace loss changes. The fastest discriminator is correlation: errors vs retimer-rail noise vs clock stability vs local hot spots.

First measurements:

Trend link errors against retimer island temperature and retimer supply ripple.
Confirm ref-clock integrity at the retimer boundary (clean supply + isolation).
Inspect sideband integrity (PERST#/CLKREQ#) under thermal stress.

Fix/verify: stabilize retimer rails first (local decoupling + clean island), then harden ref-clock distribution, then revisit placement/trace constraints. Validate by thermal ramp + long-run training/error counters.

Representative BOM anchors (examples)

PCIe/CXL retimer anchor: PT5162LR
Clock/jitter cleaner anchor: Si5341
Temp + power correlation: TMP464 + INA238

7 What misjudgments come from slow telemetry sampling, and how should filters/thresholds be set?

Diagnosis: if sampling is slower than the event, telemetry reports the “average after the crash.” Filters can hide spikes yet still trigger protection by delayed interpretation. Thresholds must match the policy time constants (throttle, retry, fault latch).

First measurements:

Compare scope waveforms to telemetry time series and quantify lag/aliasing.
Validate alert response vs burst width (avoid missing droop but prevent false trips).
Stamp events consistently: voltage droop, current spike, temperature rise, and link errors.

Fix/verify: define “fast protection” vs “slow policy” channels, use hysteresis and rate limits for capping decisions, and re-run the same workload burst set to confirm stability.

Representative BOM anchors (examples)

High-precision monitors: INA229 / INA238
Sequencing + fault logs + telemetry aggregation: LTC2977

8 Why can power capping create performance oscillation, and how to make it smoother?

Diagnosis: capping is a control loop. If measurement + filtering + actuation latency is too large, the policy over-corrects and causes “sawtooth” frequency/power. The fix is usually better timing, hysteresis, and rate limits—not simply tighter limits.

First measurements:

Plot measured power vs applied throttle decision time; look for phase lag and overshoot.
Separate GPU rail events from AUX/retimer islands (avoid false global throttles).
Check that the power estimate matches real rail power during bursts, not only steady state.

Fix/verify: implement smoother control (deadband + slew limits), and verify with repeated burst workloads at fixed ambient and at hot steady state.

Representative BOM anchors (examples)

Power monitor for accurate inputs: INA229 / INA238
Policy timestamps + fault context: LTC2977

9 Cold-boot training fails intermittently—sequencing or inrush? What is most common?

Diagnosis: many “sequencing bugs” are actually inrush-induced droop. Cold impedance shifts and connector/plane resistance can pull AUX or main rails low during soft-start, causing PG chatter and missing the training window.

First measurements:

Capture AUX and main rail ramp with PG/FAULT markers; look for sag and bounce.
Compare cold vs warm start: ramp time, peak inrush, and connector hot spots.
Check enable ordering: AUX → control → main rails → HBM → high-speed link enable.

Fix/verify: shape inrush (slew control, staged enables) and contain faults per island so a single droop does not reset the entire card. Verify with repeated cold boots and training success-rate.

Representative BOM anchors (examples)

Smart eFuse / inrush management anchor: TPS25982
Sequencing + fault logs: LTC2977
Power monitors for ramp correlation: INA238

10 OCP/OTP seems to trip—could it be a false trigger? How to verify quickly?

Diagnosis: false triggers often come from the measurement path: noisy current sense, poor temperature sensor placement, or thresholds that ignore burst profiles. Confirm whether the rail actually exceeded limits at the true sense point, then align blanking and thresholds to real events.

First measurements:

Scope: rail voltage + current proxy during the failing burst; capture fault pin/log timestamp.
Validate sensor location: does it represent the real hotspot (stage vs inductor vs connector)?
Compare trip events against thermal ramp and airflow changes.

Fix/verify: improve sense integrity, add filtering where appropriate (without hiding true faults), and re-run corner cases (hot steady state, cold boot, burst loads).

Representative BOM anchors (examples)

Power stage with monitors for correlation: ISL99390BR5935 / TDA21490AUMA1
Power/energy monitors: INA229
Remote diode temperature sensing: TMP464

11 In production, how to screen “marginally stable” cards before customers find them?

Diagnosis: marginal stability usually appears only at corners: temperature extremes, burst workloads, and link training retries. Production screening must combine a validation matrix (temp × load × scenario) with a short set of discriminating signatures (droop, recovery, error counters, and thermal deltas).

First measurements (factory-friendly):

Run a standardized burst profile and capture min rail voltage + recovery time.
Check training success-rate and time-to-link under controlled hot/cold soak.
Archive fault logs and key waveforms as shipment evidence.

Fix/verify: convert “intermittent” into “repeatable” using scripted bursts + thermal steps, and require pass/fail thresholds that match field conditions.

Representative BOM anchors (examples)

PMBus manager for sequencing + telemetry + fault logs: LTC2977
Rail monitors for automated thresholds: INA238

12 Field blackout/reboot with a 10-minute window—what is the most effective measurement set?

Diagnosis: the goal is fast isolation inside the card: power transient vs thermal throttle vs link instability. The best set is a triad that can run in parallel: VRM telemetry snapshot, rail waveforms at two points, and hotspot temperature/airflow evidence.

First measurements (triage kit):

Read: V/I/P/T + last fault reason from VRM/PMBus logs.
Scope: GPU-side rail + VRM output simultaneously; mark the failure instant.
Thermals: retimer island + VRM hotspot + connector temperature; confirm cooling contact/flow.

Fix/verify: once the dominant axis is identified, rerun the same workload with one controlled change (fan curve, cap profile, capping policy, retimer island supply) to confirm causality.

Representative BOM anchors (examples)

Telemetry monitors: INA229 / INA238
Remote diode temperature sensor: TMP464
Sequencing/logs for rapid evidence: LTC2977

Tip: Keep each FAQ strictly card-level. If an answer starts discussing retimer DSP/equalization, GPU/HBM internal protocol, or rack PSU behavior, it belongs to a sibling page.

Figure F12 — Debug funnel for accelerator cards (card-level only)

GPU / AI Accelerator Card (Power, HBM, Retimers & Telemetry)

GPU / AI Accelerator Card (Power, HBM, Retimers & Telemetry)

Card-level engineering (not chip-level internals)

What this page solves (card-level “three pillars”)

Definition of “done” (what proves the card is solid)

Explicit out-of-scope (link to sibling pages)

From upstream power to GPU rails: what the card must “digest”

Upstream constraints (treated as inputs, not a design topic)

Why “upstream looks fine” but the card still drops

GPU core vs HBM vs AUX rails: partitioning for stability and measurability

Rail classes (electrical behavior, not just function)

Partitioning rules (actionable checks)

Common symptoms that point to partitioning problems

Transients, stability, and efficiency under GPU load steps

Why average watts are misleading

Multiphase design knobs (decision logic + trade-offs)

Protection without false trips

Layout, planes, decoupling, and sense: turning VRM budgets into real rails

Planes, copper, and via arrays (current density + loop inductance)

Decoupling strategy (HF / MF / bulk roles)

Remote sense (truth voltage) without oscillation or noise pickup

Common pitfalls and fastest validations

Noise sensitivity, sequencing, and clock-island isolation (card level)

HBM rail priorities (what to protect)

Sequencing rules (without protocol/PHY internals)

Clock island (distribution + isolation on the card)

Common symptoms and fastest validations

Retimer placement, sideband reliability, and jitter isolation (card level)

Why a retimer is needed on the card (practical triggers)

Placement logic (where it should sit, and why)

Power and refclock isolation (treat retimer as a sensitive island)

Sideband reliability (PERST#/CLKREQ# as card wiring problems)

PMBus, IMON, temperature, and power capping: measure-to-control (not measure-to-display)

On-card telemetry sources (what exists on a typical accelerator card)

Sampling, filtering, and timestamps (the difference between correlation and causality)

Power capping and throttle policies (stable loop over aggressive loop)

Common misjudgments and fastest validations

Sequencing windows, inrush droop, and how to prevent “random” bring-up failures

Conceptual power-up order (why order changes stability)

Inrush and soft-start (keeping droop out of sensitive windows)

False trigger windows (PG/FAULT must match ramp reality)

Fault containment (a bad rail should not take down the entire card)

What proves an accelerator card is solid: bring-up, production, and field reproducibility

Three validation layers (same product, different goals)

Must-test categories (what typically separates “works” from “robust”)

Evidence to preserve (so failures can be explained, not guessed)

Pass criteria principles (define “good” in a measurable way)

Common Setup (Do this before chasing any symptom)

1) Align time (events must be comparable)

2) Define the three windows

3) Probe the “truth point” (avoid false comfort)

Evidence pack (save these every time)

Symptom A — Random reboot / black screen (often “looks fine” until it doesn’t)

Likely causes (grouped by domain)

First 3 checks (the “three-piece kit”)

Fast isolation rules (don’t guess)

Parts to reference (examples, with concrete OPNs)

Symptom B — PCIe/NVLink training failure or intermittent link drop

Likely causes (card-level)

First 3 checks

Fast isolation

Parts to reference (examples)

Symptom C — Performance swings / throttle oscillation (“power cap hunting”)

Likely causes

First 3 checks

Fast isolation

Parts to reference (examples)

Symptom D — Only fails when hot (warm-up dependent)

What typically changes with temperature

First 3 checks

Fast isolation

Parts to reference (examples)

Symptom E — Only fails on cold start (first boot after power-off)

Likely causes (first-boot specific)

First 3 checks

Fast isolation

Parts to reference (examples)

Figure F11 — Symptom → Measurements → Isolation (card-internal first)

Recommended topics you might also need

Request a Quote