GPU / AI Accelerator Card (Power, HBM, Retimers & Telemetry)
← Back to: Data Center & Servers
A GPU/AI accelerator card is won or lost at the board level: high-current VRMs must survive brutal transients, HBM rails must stay quiet during tight training windows, and PCIe/NVLink links must keep margin across temperature.
This page focuses on practical, measurable card-level integration—power tree partitioning, PDN/layout/sense, sequencing/inrush, retimer/clock placement, and telemetry loops—so issues become repeatable, diagnosable, and production-screenable.
Card-level engineering (not chip-level internals)
What this page solves (card-level “three pillars”)
- Power integrity: high-current VRMs and fast load steps where droop, noise, and protection timing decide stability (not average watts).
- Thermal reality: hotspot-driven behavior where sensor placement + control latency can cause throttling jitter and temperature-correlated failures.
- High-speed hooks: PCIe/NVLink integration on the card—placement, isolation, and verification of retimers/clock islands (no deep silicon theory).
Definition of “done” (what proves the card is solid)
- Transient-safe rails: droop and recovery remain within budget during worst-case load steps; protections do not false-trigger.
- Thermal-stable performance: sustained throughput without oscillating throttles; failures do not appear only at high temperature.
- Repeatable validation: a small, deterministic set of waveforms + telemetry fields can reproduce and isolate issues across units.
Explicit out-of-scope (link to sibling pages)
- Retimer internals: equalization/CDR/algorithms → PCIe Switch / Retimer page.
- Rack power system design: PSU paralleling, 48V distribution/hot-swap theory → CRPS / 48V Bus & Hot-Swap pages.
- Telemetry platforms: cross-node aggregation/anomaly detection architecture → In-band Telemetry & Power Log page.
The card is treated as a product that must deliver stable rails, controlled hotspots, and reliable links under real workloads. Deep silicon design and rack power theory are intentionally excluded to keep the page actionable and non-overlapping.
From upstream power to GPU rails: what the card must “digest”
Upstream power can look stable while the card fails because the limiting factors live in the end-to-end path: connector contact resistance, plane/busbar impedance, VRM transient behavior, and temperature-driven drift. A correct mental model separates where voltage is measured from where voltage matters.
Upstream constraints (treated as inputs, not a design topic)
- Input source: 12V or 48V presence is only a label; what matters is allowed droop, ripple, and transient capability at the card connector.
- Delivery path: slot + auxiliary connectors define current density, heating, and allowable voltage drop before VRMs can even respond.
- Environment: temperature and airflow/cold-plate contact shift losses and margins over time.
Why “upstream looks fine” but the card still drops
- Static drop: current through path resistance causes localized undervoltage at the VRM input or rail sense point.
- Dynamic drop: fast load steps create inductive voltage spikes/dips in planes and loops before regulation catches up.
- Thermal coupling: rising temperature increases resistive loss and can tighten protection and timing margins.
- Measurement mismatch: telemetry may be filtered or sampled too slowly to reflect the worst microsecond-to-millisecond events.
| Looks OK upstream | Actually decides card stability |
|---|---|
| Input bus voltage average is steady | Local rail droop at the GPU/HBM sense location during worst-case load steps |
| Rack inlet temperature is normal | Hotspot temperature at VRM stages / GPU region and thermal control loop latency |
| Power telemetry shows “reasonable” values | Telemetry bandwidth + filtering: whether the worst transient is captured or hidden |
| Links train most of the time | Temperature-correlated margin loss from placement, reference isolation, and rail/clock coupling on the card |
The diagram highlights the practical failure chain: connector/path impedance + fast load steps + thermal drift can create local droop and margin loss, even when the upstream bus looks “stable.” Telemetry helps only if measurement points and bandwidth match the events.
GPU core vs HBM vs AUX rails: partitioning for stability and measurability
A GPU/AI accelerator card succeeds when its power tree is partitioned into rails with distinct electrical roles. The partitioning goal is not cosmetic naming—it is to keep fast transients from contaminating noise-sensitive islands, and to ensure that measurement points represent the truth locations where decisions are made.
Rail classes (electrical behavior, not just function)
- GPU core rail (transient-dominant): highest current and fastest load steps; droop budget and recovery timing define stability.
- HBM rails (noise/cleanliness-dominant): sensitive to ripple, switching noise coupling, and return-path cleanliness during critical windows.
- AUX rails (threshold/timing-dominant): PLL/SerDes/I/O/retimer/MCU support rails where sequencing and enable thresholds decide bring-up success.
Partitioning rules (actionable checks)
- Noise isolation: keep HBM/PLL/SerDes islands away from high dv/dt nodes; prevent shared return paths that inject switching currents.
- Transient containment: prevent GPU core load steps from pulling down sensitive rails through shared input impedance or poorly placed bulk caps.
- Return-path control: maintain continuous, predictable returns for each island; avoid return currents crossing noisy regions.
- Truth measurement points: define where voltage is “owned” (sense point), how current is interpreted (IMON), and where hotspots are captured (Tsense).
Common symptoms that point to partitioning problems
- Stable input bus but unstable rails: local droop at the rail sense point during burst events.
- Training/pass one boot, fail the next: AUX sequencing or sensitive-rail noise during a short initialization window.
- Performance jitter: hotspot sensor lag or rail noise triggering throttling and protective derating.
The diagram shows three rail classes and the minimum set of “truth points” that keep measurements aligned with decisions: voltage sense for droop budgets, IMON for current interpretation, and Tsense for hotspot control.
Transients, stability, and efficiency under GPU load steps
GPU VRMs are primarily limited by event-driven transients, not average power. The worst case is a fast load step that creates immediate droop through path inductance and finite control response, then risks false protection or link/compute instability if recovery and measurement bandwidth do not match the event.
Why average watts are misleading
- Load steps dominate: microbursts and state transitions create short droop windows that can fail training or trigger resets.
- Two droop mechanisms: immediate droop from loop/plane inductance, then slower droop from finite regulation and capacitor limits.
- Thermal drift matters: hotter stages raise loss and tighten margin, making the same load step fail only at high temperature.
Multiphase design knobs (decision logic + trade-offs)
- Phase count: reduces per-phase stress and ripple, but increases layout complexity and cross-coupling sensitivity.
- Switching frequency: improves response at the cost of loss and heat; lower frequency shifts burden to decoupling and planes.
- Inductor (L): larger L lowers ripple but slows response; smaller L speeds response but raises ripple/noise sensitivity.
- Output capacitor network: high-frequency caps near the load, mid/bulk caps to hold energy; placement and ESL dominate outcomes.
- Stability margin: prioritize repeatable phase margin across temperature and tolerance (verification > theory).
Protection without false trips
- OCP/OTP timing: blanking and debounce must tolerate legitimate load steps while still reacting to true faults.
- Measurement bandwidth: telemetry sampling and filtering can hide peaks; protection should not rely on slow averages.
- Setpoints as budgets: thresholds should map to rail droop and thermal budgets, not arbitrary “safe-looking” numbers.
The waveform illustrates how a load step creates droop and recovery, while protection blanking and telemetry sampling can either tolerate legitimate events or hide the true peak. Design success ties rail budgets to phase count, frequency, inductor choice, capacitor network placement, and verified stability margin across temperature.
Layout, planes, decoupling, and sense: turning VRM budgets into real rails
A high-current GPU rail is decided by the board current loop and the truth measurement points. Even with correct VRM settings, excessive loop inductance, poor return-path control, mis-partitioned decoupling, or a noisy sense route can reshape the droop waveform and cause false protection triggers, training failures, or performance jitter.
Planes, copper, and via arrays (current density + loop inductance)
- Current density hotspots: connector pins, neck-down copper, via fields, and phase-output merge points drive local heating and extra drop.
- Return-path control: keep the high-current loop compact and predictable; avoid forcing return currents to detour across sensitive islands.
- Via fields: treat vias as both R (drop/heat) and L (transient spikes); distribute to reduce bottlenecks and loop area.
Decoupling strategy (HF / MF / bulk roles)
- HF caps near the load: suppress fast spikes where ESL dominates; placement is often more important than capacitance value.
- Mid-frequency network: controls plane resonance and supports control-loop dynamics between VRM output and the load island.
- Bulk caps for energy: support slower events and hold-up; avoid placing bulk behind “inductance walls” that disconnect energy when it is needed.
- ESR/ESL partition: ESL sets the fast edge; ESR provides damping—both affect stability and overshoot.
Remote sense (truth voltage) without oscillation or noise pickup
- Sense point equals budget point: droop/UV limits must map to the location that actually determines GPU stability.
- Kelvin routing: route sense as a tight pair, short, and away from switch nodes and high-current merges to avoid “measuring noise.”
- Cross-domain mistakes: do not let sense return cross noisy islands; ambiguous returns can inject ripple into the feedback path.
Common pitfalls and fastest validations
- “Telemetry looks fine” but droops happen: sampling/filtering misses the valley; verify with probing at the true sense location during a real load step.
- False OCP/UV trips: loop inductance + poor decoupling creates deeper droop than expected; compare VRM output node vs load node waveforms.
- Temperature-only failures: copper/via/connector heating increases path drop; correlate droop depth and hotspot temperature under the same workload.
The figure highlights the two truths that dominate board-level PI: (1) the rail transient is set by the physical current loop and decoupling placement, and (2) sense must represent the load truth point with clean routing—otherwise tuning and protection decisions are based on the wrong voltage.
Noise sensitivity, sequencing, and clock-island isolation (card level)
HBM-related rails behave like a noise-sensitive island. Stability is often decided during a short sensitive window (initialization/training phases) when ripple, coupling, and sequencing thresholds matter more than steady averages. This section treats HBM power and clocking as card-level islands with clear isolation boundaries.
HBM rail priorities (what to protect)
- Ripple and coupling: short spikes and resonance bursts can matter more than steady ripple numbers.
- Return cleanliness: prevent VRM switch currents from sharing the same return corridor as HBM/clock islands.
- Measurement realism: place sense/monitor points inside the HBM island so that “good” readings reflect the sensitive area.
Sequencing rules (without protocol/PHY internals)
- Threshold clarity: define which rails must be valid before enabling dependent rails and clock distribution.
- Ramp-rate discipline: avoid ramps that are too fast (overshoot/noise/false PG) or too slow (timeouts/marginal thresholds).
- Fault containment: treat HBM rails as a fault domain—limit cascading effects into GPU core rail during marginal bring-up.
Clock island (distribution + isolation on the card)
- Island concept: keep reference/clock distribution and its supporting rails inside a controlled region with predictable return.
- Keep-out near switch nodes: avoid placing clock distribution near high dv/dt VRM areas to reduce coupling risk.
- Card-level verification: check for temperature/workload correlation between training success and island noise/thermal behavior.
Common symptoms and fastest validations
- Intermittent bring-up: succeeds cold but fails warm, or fails only after repeated cycles → suspect sensitive-window noise or sequencing margins.
- Training success jitter: inconsistent success rate under the same procedure → suspect coupling across island boundaries.
- “Looks OK” telemetry: rails appear stable on slow telemetry → verify at HBM island sense points with appropriate bandwidth.
The figure frames HBM rails and reference/clock distribution as controlled islands. A practical “noise keep-out” concept around VRM switch nodes, plus return-path control, helps prevent coupling that can destabilize sensitive-window behavior during initialization and training.
Retimer placement, sideband reliability, and jitter isolation (card level)
On an accelerator card, link stability is a system of three parallel paths—high-speed lanes, reference clock, and sideband control. Retimers are used to preserve margin across connectors, traces, and temperature drift, while reliable sideband signaling and a clean refclock island prevent “works sometimes” training failures.
Why a retimer is needed on the card (practical triggers)
- Insertion loss budget: connector + trace + via transitions can consume margin even when average throughput looks fine.
- Pluggability variance: contact quality and assembly tolerance change the real channel each time the card is inserted.
- Temperature drift: loss, impedance, and noise coupling shift with temperature, shrinking training margin in worst cases.
Placement logic (where it should sit, and why)
- Protect the worst segment: place retimers near the boundary that dominates loss/variance (often the connector/backplane side).
- Shorten the “bad part”: reduce the length of the highest-loss or most variable segment that the host/GPU must train through.
- Keep routing predictable: limit layer changes and uncontrolled via clusters; enforce consistent constraints for lane groups.
Power and refclock isolation (treat retimer as a sensitive island)
- Power island: keep retimer supply away from high dv/dt VRM switch regions; avoid noisy return corridors.
- Refclock island: route reference distribution with controlled return; avoid crossing VRM noise keep-out areas.
- Keep-out concept: define a practical “do not place/route clocks here” zone near switch-node activity.
Sideband reliability (PERST#/CLKREQ# as card wiring problems)
- Do not treat sideband as “easy”: threshold margin and coupling can break training just as effectively as lane margin issues.
- Clear reference return: avoid sideband routes that share noisy return paths or run parallel to high-current switching zones.
- Intermittent training: cold/warm or plug/unplug sensitivity often points to sideband robustness and timing margins.
The diagram emphasizes the practical card-level view: retimers are positioned to protect the highest-variance or highest-loss segment, while refclock and sideband must be routed as first-class signals with clean returns and noise keep-out awareness.
PMBus, IMON, temperature, and power capping: measure-to-control (not measure-to-display)
Card telemetry is only useful when it supports a stable control loop. For power capping and throttling decisions, time alignment, filtering choices, and control latency can matter more than the number of sensors. “Measured accurately” beats “measured everywhere.”
On-card telemetry sources (what exists on a typical accelerator card)
- VRM telemetry: voltage, current, power, and controller temperature (often exposed digitally).
- IMON / current indicators: used for protection and power control—only valuable when the measurement meaning is understood.
- Hotspot temperature: near VRM phases, GPU vicinity, and sensitive islands (e.g., memory/clock regions).
- Event counters (concept): retry/error/training-related indicators with timestamps for correlation.
Sampling, filtering, and timestamps (the difference between correlation and causality)
- Time alignment: power, temperature, and event counters react with different delays; align to avoid wrong root-cause conclusions.
- Filtering is a trade: heavy filtering hides peaks/valleys; weak filtering can cause noisy policy triggers.
- Peak vs average: stability can be decided by short excursions that never appear in slow telemetry.
Power capping and throttle policies (stable loop over aggressive loop)
- Policy inputs: V/I/P/T plus event indicators (concept) must be trustworthy and time-consistent.
- Policy outputs: capping/throttle acts on workload behavior; delay can create hunting and oscillation.
- Hysteresis/windows: prevent rapid toggling when measurements are noisy or delayed.
Common misjudgments and fastest validations
- Power looks under limit but throttle triggers: hotspot temperature or delayed measurements may be driving policy.
- Telemetry stable but intermittent faults: peaks/valleys are missed by sampling; verify at the sensitive measurement point.
- Throttle thrashing: control latency + filtering can cause oscillation; adjust windows/hysteresis conceptually and re-check.
The closed-loop view prevents a common failure mode: using slow, heavily filtered telemetry for fast decisions. Align time across sensors/events, choose filters that preserve critical excursions, and tune policy windows to avoid hunting.
Sequencing windows, inrush droop, and how to prevent “random” bring-up failures
Accelerator cards often fail intermittently not because steady-state power is insufficient, but because critical enable and training windows overlap with ramp transients, inrush-induced droop, or false PG/FAULT triggers. A card-level sequencing plan treats rails, enables, and protection thresholds as one timed system.
Conceptual power-up order (why order changes stability)
- AUX → control domain: bring up monitoring and decision logic first so PG/FAULT decisions are trustworthy.
- Main rails → HBM rails: ramp high-current domains before noise-sensitive islands finalize initialization.
- High-speed link enable last: allow refclock and rails to settle before enabling link training or retimer/link islands.
Inrush and soft-start (keeping droop out of sensitive windows)
- Inrush source: large bulk capacitance and multiple domains charging simultaneously during ramp.
- Droop chain: input sag → VRM input dip → output droop → PG jitter → unstable enable/training outcomes.
- Soft-start concept: slope limiting and domain staggering reduce the deepest droop and spread current demand over time.
False trigger windows (PG/FAULT must match ramp reality)
- Sense vs truth: PG may observe a point that does not represent the load’s true worst-case droop location.
- Sampling mismatch: overly fast detection can misread noise as failure; overly slow detection can miss valleys.
- Window overlap: enabling links or memory during ramp transients turns deterministic behavior into probability.
Fault containment (a bad rail should not take down the entire card)
- Partition by domain: separate control/AUX, core power, HBM power, and link/clock islands conceptually and electrically.
- Branch protection (concept): local current limiting, e-fuse/fuse concepts, and controlled shutoff prevent cascade failures.
- Recordable failures: containment should preserve logs (PG/FAULT timeline) instead of causing total blackouts.
The timing diagram highlights the key card-level idea: avoid enabling sensitive functions during ramp transients, and ensure PG/FAULT windowing reflects real ramp behavior instead of reacting to short-lived droop or noise.
What proves an accelerator card is solid: bring-up, production, and field reproducibility
“It boots once” is not readiness. A production-ready accelerator card is proven by repeatable results across a validation matrix (temperature × load × scenario) with clear pass criteria and preserved evidence (waveforms, events, and timestamps) that enable fast root-cause analysis and field reproducibility.
Three validation layers (same product, different goals)
- Engineering bring-up: expose structural weaknesses (ramp windows, droop, thermal hotspots, training robustness).
- Production screening: repeatable coverage at scale (high signal, low ambiguity; minimal manual interpretation).
- Field/RMA reproduction: recreate symptoms with controlled conditions and aligned evidence (time + events + environment).
Must-test categories (what typically separates “works” from “robust”)
- Load-step integrity: droop depth/duration and recovery stability under bursts.
- Ripple/noise checks: critical rails during sensitive windows (startup, enable, high-load transitions).
- Thermal robustness: hotspot behavior across airflow and temperature corners.
- Link training success rate: repeatability across temperature, insertion variance, and workload states.
- Soak/burn-in: long-run stability and intermittent event/error trends.
- Policy consistency: power capping/throttle behavior that is stable (no hunting) and predictable across corners.
Evidence to preserve (so failures can be explained, not guessed)
- Waveforms: startup/ramp, load-step, and critical rail windows captured at meaningful sense points.
- Events: PG/FAULT transitions, throttle triggers, and training-related indicators (concept) with timestamps.
- Context: temperature corner, airflow configuration, load state, and insertion condition for reproducibility.
Pass criteria principles (define “good” in a measurable way)
- Power integrity: droop and ripple stay within the card’s budget for sensitive windows.
- Thermal: hotspots do not cross derating thresholds under defined conditions.
- Link: training success remains consistently high across corners, not only under room/idle conditions.
- Control: capping/throttle does not oscillate; policy windows and telemetry alignment prevent thrashing.
The matrix approach prevents “room-temperature optimism.” Each cell combines scenario tags (training/full/step) under temperature and load corners, while pass criteria stay explicit and comparable across revisions and production lots.
H2-11 — Field Debug Playbook: Symptoms → Likely Causes → First Measurements
This section turns common GPU/AI accelerator card failures into a repeatable “first-hour” workflow. It stays card-level: rail behavior, telemetry timing, thermal contact/airflow, and link islands (PCIe/NVLink).
Common Setup (Do this before chasing any symptom)
The fastest debug comes from consistent capture windows, consistent probe points, and a minimal “evidence pack”.
1) Align time (events must be comparable)
- Record PG/FAULT edge order and the exact moment the symptom happens (reset/black screen/drop/throttle).
- Prefer logs with timestamps (or monotonic counters) so “cause vs effect” is not guessed.
- If telemetry is averaged, note the averaging and conversion time—slow telemetry hides fast droops.
2) Define the three windows
- Power-up window: sequencing + inrush + first enabling of link islands.
- Training window: PCIe/NVLink enable + refclk stability + island rails settling.
- Burst window: worst di/dt load-step and peak current events.
3) Probe the “truth point” (avoid false comfort)
- For Vcore/HBM, prioritize the load-side sense region (near the GPU/HBM decoupling field), not only at the controller.
- Keep ground reference short (spring ground / nearby return) to avoid seeing ringing that is purely measurement artifact.
- For link islands (retimer/refclk), probe the island rails and refclk domain separately from the main power plane.
Evidence pack (save these every time)
- Two waveforms: Vcore (or main rail) + one AUX/island rail, captured over the relevant window.
- Telemetry snapshot: V/I/P/T + any PG/FAULT bits around the event.
- Thermal snapshot: hotspot temps and “contact/airflow state” (blocked airflow / cold-plate contact suspicion / fan curve state).
Symptom A — Random reboot / black screen (often “looks fine” until it doesn’t)
Treat this as a timing problem first: find what dropped (PG/rail/link) and in which window (power-up, training, burst).
Likely causes (grouped by domain)
- Power: fast Vdroop → UV/PG drop; OCP/OTP hit or false hit; island rail dip triggers a cascade.
- Thermal: localized hotspot trips protection; cold-plate contact/airflow creates a “hidden” hotspot.
- Link/Control: link island instability amplifies into a system-level fault (often temperature dependent).
First 3 checks (the “three-piece kit”)
- VRM telemetry: check PG/FAULT ordering and whether current/temperature flags rise before the event.
- Rail waveform: capture Vcore + one AUX/island rail during the burst window; look for a short valley (droop) and recovery ringing.
- Hotspot state: compare hotspot gradient vs average; verify airflow/cold-plate contact consistency.
Fast isolation rules (don’t guess)
- If PG drops first → power domain likely (droop/UV/OCP blanking mismatch).
- If hotspot spikes first → thermal/contact/airflow likely.
- If rails look stable but link errors climb first → link island power/refclk/sideband integrity likely.
Parts to reference (examples, with concrete OPNs)
- VR controller: Infineon XDPE192C3B-0000 / Infineon XDPE132G5H-G000
- Power stage / DrMOS: Infineon TDA21490AUMA1, Renesas ISL99390BR5935
- Power monitor: TI INA229 / TI INA238; eFuse (card rail): TI TPS25982
Symptom B — PCIe/NVLink training failure or intermittent link drop
Link failures that correlate with temperature or insertion often originate from island rails, refclk cleanliness, or sideband robustness—at the card level.
Likely causes (card-level)
- Channel margin: connector/trace loss + temperature drift shrinks margin.
- Retimer island: island rail noise or refclk contamination makes training fragile.
- Sideband integrity: PERST#/CLKREQ# (and similar) behaves unreliably under ground bounce/noise.
First 3 checks
- Telemetry timing: correlate drop events with island rail voltage/current and temperature.
- Island rail waveform: capture the retimer/refclk rail during the training window; look for ripple bursts or dips.
- Thermal correlation: compare cold vs hot; training that fails only after warm-up is a strong hint.
Fast isolation
- Hot-only drops → margin/thermal drift dominates; focus on island rail stability + refclk cleanliness.
- Insertion-dependent → connector/contact variability; confirm with repeated seat/unseat and capture success rate.
- Training-window-only → sequencing/island enable timing; check whether the island rail is fully settled before enable.
Parts to reference (examples)
- PCIe Retimer (example OPN): Astera Labs PT5162LR / PT5161LRS
- Jitter/clock: Si5341 (jitter attenuator/clock generator class)
- Temp sensing near islands: TI TMP464 (remote diode channels help map hotspots)
Symptom C — Performance swings / throttle oscillation (“power cap hunting”)
Most “mysterious throttling” is a loop problem: measurement latency + filtering + policy thresholds interacting with bursty workloads.
Likely causes
- Telemetry loop: sampling/averaging hides peaks, then policy over-corrects; delays create oscillation.
- Thermal thresholds: hotspot crosses a threshold repeatedly (airflow/contact instability).
- Peak power events: short spikes (not average power) trigger limits.
First 3 checks
- Time-correlate throttle events with V/I/P/T; note telemetry conversion time and averaging.
- Capture burst window: confirm whether short peaks align with policy triggers (even if average is low).
- Hotspot gradient: check whether one region runs away while “board temp” looks normal.
Fast isolation
- If throttle matches hotspot temperature closely → thermal/contact/airflow dominates.
- If throttle matches power spikes → peak management dominates (limit based on peak, not average).
- If throttle lags telemetry heavily → measurement/policy latency dominates.
Parts to reference (examples)
- Power monitors: TI INA229 (power/energy/charge monitoring class), TI INA238
- PMBus fault logging/rail supervision: ADI LTC2977
- VR telemetry source: Infineon XDPE132G5H-G000 / XDPE192C3B-0000 class
Symptom D — Only fails when hot (warm-up dependent)
“Hot-only” failures usually indicate shrinking margin: electrical (loss/noise) or mechanical/thermal (contact/airflow).
What typically changes with temperature
- Channel margin shrinks (loss, impedance drift, connector behavior).
- VRM efficiency/thermal headroom changes; hotspot can cross OTP earlier.
- Refclk cleanliness and island rail ripple sensitivity rises.
First 3 checks
- Compare cold vs hot telemetry (same workload): does I/P rise or does T rise faster than expected?
- Capture a training window waveform hot vs cold on island rails.
- Map hotspot gradient using remote sensing (diode channels) near VRM, HBM area, and retimer area.
Fast isolation
- If hot-only aligns with link drops → focus on island rails + refclk + local cooling.
- If hot-only aligns with PG/FAULT → VRM thermal/protection dynamics likely.
Parts to reference (examples)
- Thermal mapping: TI TMP464
- Retimer island reference: Astera Labs PT5162LR class + refclk conditioning Si5341 class
- VR thermal telemetry: Renesas ISL99390BR5935 (SPS with telemetry) / Infineon TDA21490AUMA1 class
Symptom E — Only fails on cold start (first boot after power-off)
Cold-start failures are often power-up timing and threshold related. Capture “the first attempt” evidence—later retries can hide the root cause.
Likely causes (first-boot specific)
- Sequencing/soft-start window mismatch (AUX/control rails not settled before enable).
- Inrush-induced dip creates a “near-miss” that breaks training only on the first attempt.
- Temperature-dependent thresholds and sensor offsets cause false protection events.
First 3 checks
- Capture the power-up window: rails vs time with PG/FAULT ordering.
- Check for inrush dip and recovery ringing on the main input or main rail path.
- Compare first boot vs second boot: if the second succeeds, the issue is likely window/settling related.
Fast isolation
- If failure disappears after a quick retry → settling/inrush/timing is highly suspicious.
- If only one domain stays marginal (e.g., island rail) → focus on that rail’s soft-start and local decoupling.
Parts to reference (examples)
- eFuse / inrush control (card-side protection): TI TPS25982
- PMBus sequencing + fault logs: ADI LTC2977 class
- VRM telemetry source: Infineon XDPE132G5H-G000 / XDPE192C3B-0000 class
Figure F11 — Symptom → Measurements → Isolation (card-internal first)
Use this fault tree to avoid random probing: start from symptom, run the three-piece kit, then isolate into POWER / THERMAL / LINK+CONTROL domains.
SVG note: text is intentionally minimal and ≥18px for mobile readability; arrows avoid marker/defs for WP compatibility.
FAQs (Troubleshooting-first, card-level)
Each answer follows a strict card-level playbook: likely causes → first measurements → fix/verify. Part numbers are representative examples to anchor categories (controller / power stage / monitor / eFuse / retimer / jitter-cleaner).
1 Why can “average power” look fine, yet a burst causes droop or reboot?
Diagnosis: the failure mode is usually transient response, not steady-state watts. Fast di/dt can pull the GPU rail below its minimum at the true sense point, while telemetry averages hide the event.
First measurements (10 minutes):
- Capture rail waveform at GPU-side sense and at VRM output; compare droop depth and recovery time.
- Check whether OCP/UVP blanking overlaps the burst window; verify with fault pins/logs.
- Correlate droop with hotspot temperature (VRM + connector + retimer power island).
Fix/verify: reduce PDN impedance where it matters (path + vias + decoupling placement), align sensing to the true load point, and validate with repeatable load-steps across temperature corners.
- Digital multiphase controllers: XDPE192C3B-0000 / XDPE132G5H-G000
- 90A-class power stages: TDA21490AUMA1 / ISL99390BR5935
- Power/energy monitors: INA229 (SPI) / INA238 (I²C)
- Fault logging / sequencing: LTC2977
2 Are more VRM phases always better? When can “more phases” hurt stability?
Diagnosis: phase count helps current sharing and ripple, but it also increases control/measurement complexity. Poor interleaving, mismatched current sense, or layout-induced delays can create phase imbalance, ringing, or false protections.
First measurements:
- Measure phase current balance (IMON/inductor DCR sense or stage monitor outputs) under bursts.
- Check rail impedance peaks (scope + step load), not only ripple at steady state.
- Verify compensation margin indirectly: overshoot/undershoot + settling vs temperature and airflow.
Fix/verify: tune as a system (controller settings + power stages + output network + sensing). Validate with the same burst profile at hot and cold boot.
- Multiphase controller family: XDPE132G5H-G000
- Smart power stage with current/temp outputs: ISL99390BR5935
- PMBus manager for repeatable margins/logs: LTC2977
3 Where do remote-sense designs most often go wrong (voltage “looks right” but is wrong)?
Diagnosis: the common failure is a bad Kelvin reference: sense traces pick up switching noise, share return with high current, or cross noisy islands. The controller regulates the “wrong voltage,” masking true droop at the GPU pads.
First measurements:
- Compare sense-reported voltage vs direct probe at the true load pads (short ground spring).
- Look for temperature-dependent offset (sense copper resistance + return path shift).
- Check whether sense filtering introduces delay that worsens burst droop.
Fix/verify: route sense as a quiet differential pair to the correct point, keep it away from switch nodes/retimer clocks, and verify with step loads plus thermal sweep.
- Voltage/current monitors for cross-check: INA238 / INA229
- Board thermal visibility (remote diode): TMP464
4 Why can heavy decoupling still produce poor load-step performance?
Diagnosis: “more capacitors” does not equal “lower impedance.” ESL and current-loop area dominate at high frequency; bulk caps placed far away help energy but not the first microseconds. A misplaced via field can turn decoupling into an antenna.
First measurements:
- Measure the first droop (fast) vs later sag (slow) to separate ESL vs energy deficit.
- Compare VRM-output ripple to GPU-pad ripple; large delta points to path inductance.
- Check hotspots at vias/planes near the connector and VRM output choke area.
Fix/verify: tighten the high-current loop (planes + via arrays), place HF caps at the true load boundary, and re-validate with identical step profiles.
- Power monitors for waveform-to-telemetry correlation: INA229 / INA238
- PMBus fault logs to tie events to steps: LTC2977
5 How should HBM rail ripple limits be set, and why can “clean-looking” rails still fail training?
Diagnosis: HBM-related failures are often caused by measurement blind spots (insufficient bandwidth, probing artifacts) or noise coupling during the narrow training/initialization window. “Clean” at a far test point can still be noisy at the HBM island.
First measurements:
- Probe at the HBM island boundary (short ground), not only at the regulator output.
- Capture rail noise during training events, not only steady state.
- Check clock island isolation: ref-clock/power share can inject periodic noise.
Fix/verify: define ripple spec at the real load boundary, ensure sequencing aligns to the sensitive window, and re-run training success-rate across temperature corners.
- PMBus logging / sequencing for reproducible windows: LTC2977
- Clock/jitter cleanup (clock island anchor): Si5341
- Rail monitors for correlation: INA238
6 At high temperature the card drops PCIe/NVLink links—what to suspect first: retimer, power, or clock?
Diagnosis: temperature-driven dropouts commonly come from margin collapse: retimer supply droop, ref-clock jitter increase, or connector/trace loss changes. The fastest discriminator is correlation: errors vs retimer-rail noise vs clock stability vs local hot spots.
First measurements:
- Trend link errors against retimer island temperature and retimer supply ripple.
- Confirm ref-clock integrity at the retimer boundary (clean supply + isolation).
- Inspect sideband integrity (PERST#/CLKREQ#) under thermal stress.
Fix/verify: stabilize retimer rails first (local decoupling + clean island), then harden ref-clock distribution, then revisit placement/trace constraints. Validate by thermal ramp + long-run training/error counters.
- PCIe/CXL retimer anchor: PT5162LR
- Clock/jitter cleaner anchor: Si5341
- Temp + power correlation: TMP464 + INA238
7 What misjudgments come from slow telemetry sampling, and how should filters/thresholds be set?
Diagnosis: if sampling is slower than the event, telemetry reports the “average after the crash.” Filters can hide spikes yet still trigger protection by delayed interpretation. Thresholds must match the policy time constants (throttle, retry, fault latch).
First measurements:
- Compare scope waveforms to telemetry time series and quantify lag/aliasing.
- Validate alert response vs burst width (avoid missing droop but prevent false trips).
- Stamp events consistently: voltage droop, current spike, temperature rise, and link errors.
Fix/verify: define “fast protection” vs “slow policy” channels, use hysteresis and rate limits for capping decisions, and re-run the same workload burst set to confirm stability.
- High-precision monitors: INA229 / INA238
- Sequencing + fault logs + telemetry aggregation: LTC2977
8 Why can power capping create performance oscillation, and how to make it smoother?
Diagnosis: capping is a control loop. If measurement + filtering + actuation latency is too large, the policy over-corrects and causes “sawtooth” frequency/power. The fix is usually better timing, hysteresis, and rate limits—not simply tighter limits.
First measurements:
- Plot measured power vs applied throttle decision time; look for phase lag and overshoot.
- Separate GPU rail events from AUX/retimer islands (avoid false global throttles).
- Check that the power estimate matches real rail power during bursts, not only steady state.
Fix/verify: implement smoother control (deadband + slew limits), and verify with repeated burst workloads at fixed ambient and at hot steady state.
- Power monitor for accurate inputs: INA229 / INA238
- Policy timestamps + fault context: LTC2977
9 Cold-boot training fails intermittently—sequencing or inrush? What is most common?
Diagnosis: many “sequencing bugs” are actually inrush-induced droop. Cold impedance shifts and connector/plane resistance can pull AUX or main rails low during soft-start, causing PG chatter and missing the training window.
First measurements:
- Capture AUX and main rail ramp with PG/FAULT markers; look for sag and bounce.
- Compare cold vs warm start: ramp time, peak inrush, and connector hot spots.
- Check enable ordering: AUX → control → main rails → HBM → high-speed link enable.
Fix/verify: shape inrush (slew control, staged enables) and contain faults per island so a single droop does not reset the entire card. Verify with repeated cold boots and training success-rate.
- Smart eFuse / inrush management anchor: TPS25982
- Sequencing + fault logs: LTC2977
- Power monitors for ramp correlation: INA238
10 OCP/OTP seems to trip—could it be a false trigger? How to verify quickly?
Diagnosis: false triggers often come from the measurement path: noisy current sense, poor temperature sensor placement, or thresholds that ignore burst profiles. Confirm whether the rail actually exceeded limits at the true sense point, then align blanking and thresholds to real events.
First measurements:
- Scope: rail voltage + current proxy during the failing burst; capture fault pin/log timestamp.
- Validate sensor location: does it represent the real hotspot (stage vs inductor vs connector)?
- Compare trip events against thermal ramp and airflow changes.
Fix/verify: improve sense integrity, add filtering where appropriate (without hiding true faults), and re-run corner cases (hot steady state, cold boot, burst loads).
- Power stage with monitors for correlation: ISL99390BR5935 / TDA21490AUMA1
- Power/energy monitors: INA229
- Remote diode temperature sensing: TMP464
11 In production, how to screen “marginally stable” cards before customers find them?
Diagnosis: marginal stability usually appears only at corners: temperature extremes, burst workloads, and link training retries. Production screening must combine a validation matrix (temp × load × scenario) with a short set of discriminating signatures (droop, recovery, error counters, and thermal deltas).
First measurements (factory-friendly):
- Run a standardized burst profile and capture min rail voltage + recovery time.
- Check training success-rate and time-to-link under controlled hot/cold soak.
- Archive fault logs and key waveforms as shipment evidence.
Fix/verify: convert “intermittent” into “repeatable” using scripted bursts + thermal steps, and require pass/fail thresholds that match field conditions.
- PMBus manager for sequencing + telemetry + fault logs: LTC2977
- Rail monitors for automated thresholds: INA238
12 Field blackout/reboot with a 10-minute window—what is the most effective measurement set?
Diagnosis: the goal is fast isolation inside the card: power transient vs thermal throttle vs link instability. The best set is a triad that can run in parallel: VRM telemetry snapshot, rail waveforms at two points, and hotspot temperature/airflow evidence.
First measurements (triage kit):
- Read: V/I/P/T + last fault reason from VRM/PMBus logs.
- Scope: GPU-side rail + VRM output simultaneously; mark the failure instant.
- Thermals: retimer island + VRM hotspot + connector temperature; confirm cooling contact/flow.
Fix/verify: once the dominant axis is identified, rerun the same workload with one controlled change (fan curve, cap profile, capping policy, retimer island supply) to confirm causality.
- Telemetry monitors: INA229 / INA238
- Remote diode temperature sensor: TMP464
- Sequencing/logs for rapid evidence: LTC2977