123 Main Street, New York, NY 10001

CPU VRM (VR13/VR12+): Multiphase Design & PMBus Telemetry

← Back to: Data Center & Servers

CPU VRM (VR13/VR12+) is a multiphase, telemetry-rich power stage that converts the server bus rail into a stable, fast-responding low-voltage/high-current CPU rail. This page focuses on how to select, validate, and debug controllers, DrMOS, current sensing, transient/loop stability, protections, and PMBus observability—without drifting into mainboard, BMC, PSU, or 48V topics.

H2-1 · Scope & Boundary: What This Page Solves

The focus is the CPU voltage regulator module (VRM) meeting VR13/VR12+ behavior: fast transient control, stable loop dynamics, robust protections, and usable PMBus telemetry—without drifting into motherboard, rack power, or management topics.

Scope Box (4 lines)
  • What: Convert the board bus (typically 12 V) into an ultra-low-voltage, high-current CPU rail (Vcore/Vcc), while honoring VR13/VR12+ dynamic requirements.
  • Design knobs: Multiphase topology, load-line (Vdroop), compensation/bandwidth, current sensing & balancing, protection policy, thermal efficiency, and PMBus observability.
  • Proof: “Power rating” is not the finish line. Evidence must include transient waveforms, stability margin, protection behavior, and telemetry vs reality consistency.
  • Not covered: Whole-board rail sequencing, BMC/Redfish, PSU/48 V architecture, DDR/PCIe/storage fabrics, rack-level telemetry platforms (links only).
System Boundary Signals (VRM-facing)

The CPU VRM is defined by its boundary: a power path, control/state signals, and a telemetry interface. Keeping the boundary explicit prevents cross-topic drift and makes verification measurable.

Power: 12 V Bus → Multiphase Buck → CPU Rail Control: VID (SVID/AVS) + Slew States: EN / PGOOD Fault: VRHOT# / FAULT Sense: V/I/T Sampling Telemetry: PMBus (Status + Logs)
Deliverables (What the Reader Can Execute)
  • Architecture decisions: phase count, switching frequency range, power stage/inductor boundaries, and why each knob exists in server CPU rails.
  • Load-line & transient targets: how Vdroop shapes output impedance, what to measure, and which failure modes it prevents.
  • Stability evidence: practical margin targets and how to prove them (frequency-domain where possible, time-domain always).
  • Current sensing & balancing: DCR vs shunt vs RDS(on) sensing tradeoffs tied to error, bandwidth, and phase current balance.
  • Protection behavior: trigger → response → recovery mapping for OCP/OVP/UVP/OTP, including common false-trigger mechanisms.
  • PMBus observability: which telemetry fields are useful for acceptance vs field debug, and how filtering can hide real events.
Out-of-scope (Links Only)
Motherboard power tree (overview) BMC / Redfish / IPMI PSU / CRPS 48 V hot-swap / eFuse DDR5 power / RCD / SPD PCIe / NIC / GPU fabrics Rack telemetry platform
Quality bar for the rest of the page: every section must map to a knob, an evidence method, and a failure mode it prevents.
Figure F0 — VRM boundary: energy path, signals, and what stays out-of-scope
CPU VRM Boundary (VR13/VR12+) 12 V Bus Input power domain Multiphase VRM Controller + DrMOS + L/C Sensing + Protection + PMBus CPU Rail Vcore/Vcc (high di/dt) PMBus Telemetry V/I/T + status + faults VID / Slew Control Target V + transition Sense Inputs V/I/T sampling taps States PGOOD / VRHOT# / FAULT Out-of-scope (links only) BMC / Redfish PSU / 48 V DDR5 Power PCIe / NIC / GPU Rack Telemetry

H2-2 · VR13/VR12+ Constraints That Actually Matter

A “high-current” VRM can still fail a server platform if transient behavior, protection consistency, or observability is weak. VR13/VR12+ is best treated as a closed loop: spec intent → design knobs → evidence → pitfalls.

Acceptance should be based on waveforms and behavior—steady-state power numbers alone do not predict burst stability, false trips, or field reproducibility.
Spec → Knobs → Evidence → Pitfalls (Engineering Table)
Spec intent Design knobs (VRM-side) Evidence (what to measure) Common pitfalls
VID tracking & transitions
Target V changes without overshoot or stalls
VID interface handling, slew/transition control, control-loop bandwidth, output impedance shaping Step VID transitions: Vout waveform (overshoot/undershoot), settling time, repeated transitions at temp corners Too aggressive slew causing overshoot; too slow slew causing droop under burst; relying on PMBus-only “smoothed” values
DC accuracy (incl. temp drift)
Voltage is correct where it matters
Sense point selection, offset/gain calibration, reference stability, temperature compensation strategy Vout at load points (low/nom/high) × temperature corners; compare scope-measured Vout vs reported telemetry Interpreting designed droop as “bad accuracy”; sense routing injecting error; unverified telemetry scaling
Load-line / Vdroop behavior
Controlled droop prevents worst-case overshoot
Load-line slope, current-sense method, compensation, output capacitor ESR/ESL profile V–I curve (Vout vs Iload) + transient overlay; confirm droop is consistent across current range and temperature Making droop “too hard” → overshoot/false OVP; making droop “too soft” → undervoltage under burst and resets
Transient response (di/dt)
Burst steps do not crash the rail
Phase count, switching frequency, inductor value/saturation, output capacitor network, loop bandwidth, OCP policy Load step tests: undershoot magnitude, recovery time, ringing; measure with low-inductance probing and current probe Testing only one corner (input V, temp, load); “slow” steps that miss real burst content; capacitor effective value loss under bias
Stability margin
Loop remains stable across corners
Compensation type/parameters, sampling & digital delay, output network (ESR/ESL), mode transitions (CCM/DCM/phase shedding) Frequency-domain where possible (injection/Bode). Always validate with step response + phase current waveforms Capacitor changes shifting poles/zeros; light-load mode changes creating boundary instability; ignoring delay limits in digital control
Current balance (per-phase)
No hidden hotspot phases
Current-sense topology (DCR/shunt/RDS(on)), per-phase calibration, layout symmetry, inductor tolerance Per-phase current telemetry vs thermal map; compare phase-to-phase spread at different loads and temperatures “Average current looks fine” while one phase overheats; DCR temp drift uncorrected; routing parasitics dominating sense signals
Protection behavior
Trigger → response → recovery is consistent
OCP mode (cycle/hiccup/latch), OVP/UVP thresholds & blanking, OTP derating vs shutdown, soft-start + pre-bias handling Fault injection: overload, short, thermal. Correlate waveform behavior with status/fault registers Noise/edge ringing causing false trips; thresholds without adequate debounce; recovery policy that repeats brownout cycles
Telemetry usefulness (PMBus)
Data helps acceptance and debug
Telemetry sampling rate, filtering, snapshot/black-box support, fault register design Compare PMBus V/I/T + flags against real waveforms during events; verify time alignment and event capture Over-filtered values hiding excursions; missing fast event capture; using “stable numbers” as proof of stability

Practical reading: each row must end with an evidence plan (waveform or register) and a corner case (temperature, mode transition, or fast load burst).

Engineering conclusion (why “power rating” is not enough)
Server-grade CPU VRMs are separated by transient immunity, stability margin across corners, consistent protection behavior, and telemetry that captures real events—not by steady-state watts alone.
Figure F1 — Spec intent mapped to design knobs (many-to-many, verified by evidence)
VR13/VR12+: Spec Intent → Design Knobs Spec intent Design knobs VID Tracking DC Accuracy Load-line Transient Stability Protection & Telemetry VID / Slew Phases / Interleaving fSW & Inductor (L) Cout ESR / ESL Compensation / Bandwidth Current Sensing Limits / State Policy PMBus Filter / Snapshot Knobs interact — validate with waveforms + stability evidence + PMBus fault/status correlation.

H2-3 · System Decomposition of a Multiphase Buck VRM

Treat a CPU VRM as two coupled channels: an energy path that moves power and a signal path that enforces behavior. Clear module responsibilities make selection, verification, and fault isolation repeatable.

Energy Path (power flow)

Power flows from the board bus (typically 12 V) into multiple interleaved phases, is combined at the summing node, and is buffered by the output capacitor network before reaching the CPU rail. In a server VRM, the energy path must remain predictable under large di/dt and across temperature corners.

12 V Bus Phase 1..N Summing Node Cout Network CPU Rail
Signal Path (control + observability)

The signal path is a closed loop: V/I/T sensing feeds the digital controller; the controller applies compensation, phase management, limits, and state logic; PWM/drive signals command each phase. Telemetry and fault/status reporting must align with real waveforms, otherwise the VRM becomes difficult to accept and debug in the field.

V/I/T Sense Digital Controller PWM/Drive Protection State PMBus Telemetry
Module Responsibilities (what each block must do)
  • Digital controller: sampling → control law/compensation → phase management → limits/state machine → telemetry & fault registers.
  • DrMOS / power stage: switching + conduction + driver losses; sets switching-node quality and concentrates heat sources.
  • Inductor (magnetics): ripple shaping, transient di/dt capability, saturation margin, loss/temperature, and DCR measurability.
  • Output capacitor network: transient current partitioning and impedance shaping via ESR/ESL across frequency.
Why Multiphase (engineering reasons, not marketing)
  • Ripple reduction: interleaving cancels portions of ripple, improving rail noise and margin.
  • Transient strength: smaller per-phase current steps improve response headroom under burst load.
  • Thermal spreading: heat sources distribute across phases, reducing single-point hotspots.
  • Current density control: phase distribution reduces local copper stress at high currents.
Verification mindset: every block needs at least one evidence item (waveform, thermal, or register) that matches its responsibility.
Evidence Checklist (minimum proof)
  • Waveforms: Vout transient, switching-node sanity, and phase-to-phase consistency under load steps.
  • Thermal: DrMOS and inductor hotspots plus phase spread under sustained load.
  • Telemetry correlation: PMBus V/I/T and fault/status must align with observed events (no “hidden” trips).
  • Corner repeatability: behavior remains consistent across temperature and mode transitions.
Figure F2 — Multiphase VRM: energy path (top) and signal/management path (bottom)
Multiphase VRM Decomposition (Energy + Signals) ENERGY PATH SIGNAL / MANAGEMENT PATH 12 V Bus VIN Phase 1 DrMOS Inductor Phase 2 DrMOS Inductor Phase 3 DrMOS Inductor Phase N DrMOS Inductor Summing Node Cout Network ESR / ESL CPU Rail Vcore / Vcc V / I / T Sense Taps → ADC V I T Digital Controller Control + phases + limits State Machine Fault Registers PWM / Drive Per-phase commands States PGOOD VRHOT# FAULT PMBus Telemetry V / I / T + status Host VRM access Vout Tap

Diagram reading: top channel is power movement; bottom channel is how behavior is enforced and observed (sensing, limits, states, PMBus).

H2-4 · Load-Line (Vdroop) Is Protection, Not “Voltage Loss”

Load-line defines how the CPU rail target changes with load current. Its purpose is to shape output impedance so transient risk becomes predictable: fewer dangerous overshoots, controlled ringing, and fewer false OVP/UVP trips.

Definition (engineering meaning)

Load-line is the intentional relationship between Vout and Iload. The slope is commonly treated as an equivalent resistance: ΔV / ΔI (mΩ). This is not “making voltage worse”; it is setting a controllable output-impedance target so the rail remains safe during large di/dt events.

What it prevents (why VR platforms use it)
  • Overshoot control: without proper droop, load release can produce overvoltage and ringing that risks false OVP or stress.
  • Undershoot control: with an overly soft droop, load application can create deeper dips, risking UVP and resets.
  • Predictable transients: droop makes the transient “budget” manageable across corners instead of relying on luck.
Slope in mΩ: how to interpret it
  • Slope concept: a steeper slope (larger mΩ) yields more droop at high load; a flatter slope (smaller mΩ) holds Vout “harder.”
  • Where it is measured: the sensing point matters; routing and remote-sense choices change what “droop” looks like.
  • Why it interacts: output capacitors (ESR/ESL), compensation/bandwidth, and current sensing all shape the effective output impedance.
Common pitfalls (symptom → root cause → evidence)
Symptom Likely VRM-side cause Evidence to collect
Overshoot / ringing on load release Load-line too hard (too flat), output impedance not sufficiently shaped, or loop response too aggressive near crossover Transient waveform on load release; correlate with OVP flags and verify droop slope vs current sweep
Deep dips on load application Load-line too soft (too steep), insufficient transient headroom, or overly conservative limits restricting response Load-step undershoot and recovery time; verify per-phase response and check for limit activity
False UVP/OVP trips “without obvious current spikes” Current sensing error / filtering mismatch, switching-node noise coupling into sense, or insufficient debounce/blanking Waveform vs PMBus telemetry mismatch; inspect fault/status timing and validate sense integrity under fast edges
Good steady-state numbers but unstable behavior under burst Mode transition (phase shedding / CCM↔DCM) interacting with droop and loop dynamics Repeat step tests across mode boundaries; check phase enabling behavior and transient consistency
Acceptance checklist (procurement-friendly)
  • DC sweep: verify Vout vs Iload follows the intended droop slope consistently across the operating range.
  • Step load: capture undershoot/overshoot and recovery time; validate behavior on both load apply and load release.
  • Corners: repeat at temperature corners and across operating modes (where phase management changes behavior).
  • Correlation: PMBus faults/status must match what waveforms show (no hidden excursions).
Practical differentiator: at the same power rating, the platform grade is defined by load-line correctness and transient window—measured, repeatable, and consistent across corners.
Figure F3 — V–I load-line plus transient overlay (hard vs soft droop)
Load-Line Shapes Transient Risk (Not Just DC Voltage) Vout vs Iload Vout Iload Hard LL Soft LL Slope = ΔV / ΔI (mΩ) Measured at the sense point Step load transient Vout time Iload step Undershoot Recovery Overshoot Hard LL Soft LL

Diagram reading: the left plot sets the intended droop slope; the right plot shows how “hard vs soft” droop shifts overshoot/undershoot risk under step load.

H2-5 · Four Current-Sensing Paths for Server VRMs (VR13/VR12+)

Current sensing in a multiphase CPU VRM serves three deliverables: protection thresholds, per-phase current sharing, and telemetry trust. Selection must be driven by error sources, bandwidth, loss, layout sensitivity, and how well the method supports consistent current balancing.

Server-grade rule: “total current looks correct” does not guarantee per-phase sharing is correct. Sharing errors convert directly into hotspots and early failures.
Comparison matrix (selection boundary)
Method Error & Drift Dynamic fidelity Loss / burden Layout sensitivity Current sharing support Calibration & production Best fit in VR13 servers
Inductor DCR sense Strong thermal drift; sensitive to thermal gradients and sense placement Good for VRM dynamics if routing + filtering are disciplined Low incremental loss; no large extra copper drop High (parasitics + noise coupling easily bias readings) Common per-phase approach; sharing quality depends on drift compensation Requires drift model / temperature compensation / calibration strategy Cost-effective multiphase designs where sharing is actively controlled and validated across corners
Shunt resistor sense Best linearity and stable transfer if thermals are managed High fidelity when Kelvin routing is correct Highest I²R loss; adds heat + burden voltage Medium–high (Kelvin sense mandatory; thermal gradients still matter) Excellent reference for total current; per-phase adds BOM/complexity Calibration is straightforward; must validate hot-state drift When highest accuracy is required and power/thermal budget can absorb shunt losses
DrMOS RDS(on) sense Large device-to-device and temperature variation; transfer depends on junction conditions Good bandwidth; must control switching noise feedthrough Minimal added power loss (uses existing device parameter) Medium (noise and temperature estimation dominate) Convenient for per-phase sharing; mismatch can silently bias sharing Needs characterization + normalization; must prove sharing stability over temperature Highly integrated platforms prioritizing density, with disciplined validation of per-phase consistency
CT / magnetic sense (limited use) Method-dependent; can avoid some DC drift issues but adds system constraints Potentially very high bandwidth; depends on implementation Varies; introduces magnetics + routing considerations Medium (placement and coupling matter) Used when sensing constraints demand it; not a mainstream server VRM default Production flow depends on magnetic tolerances and validation plan Special cases (e.g., uncommon constraints on bandwidth/isolation); keep scope limited in this page
How sensing maps to VRM behavior
  • Per-phase current: drives current sharing, phase management, and phase-to-phase thermal balance.
  • Total current (IMON / summed): drives OCP/ILIM thresholds, power reporting, and system acceptance metrics.
  • Filtering and delay: incorrect filtering can make telemetry “look clean” while control decisions are made from biased or delayed current estimates.
Field symptoms that often trace back to sensing bias
  • Normal current readout but abnormal local heating: per-phase bias forces uneven sharing; one phase silently carries more current.
  • Intermittent current-limit events without obvious current spikes: noise injection or filtering mismatch creates false current peaks in the controller domain.
  • Hot-state instability or unexpected derating: drift compensation is insufficient; thresholds shift with temperature.
  • Mode-boundary disturbances (phase shedding / light-load modes): sensed current discontinuities destabilize sharing and limit logic.
Acceptance checklist (minimum proof)
  • Correlation: compare telemetry trends against independent observation under the same load actions (step and steady-state).
  • Sharing integrity: verify per-phase spread stays bounded across temperature and operating modes.
  • Protection truth: confirm OCP/ILIM events align with real waveform evidence (not only register states).
  • Thermal realism: hotspot map must match the expected current distribution; mismatches indicate sensing/estimation bias.
Figure F4 — Where each sensing method “taps” the VRM (DCR, shunt, RDS(on), and summed IMON)
VRM Current Sensing — Tap Locations and Error Entry Points Simplified VRM Area 12 V VIN Bus input PHASE 1 DrMOS L PHASE 2 DrMOS L PHASE 3 DrMOS L PHASE N DrMOS L SUM NODE Cout ESR / ESL CPU RAIL Vcore / Vcc Controller Share + Limits + ADC DCR SENSE RDS(on) SENSE SHUNT SHUNT SENSE SUMMED IMON ERROR: TEMP / ROUTING ERROR: PARASITIC / NOISE Outputs: Per-phase share Total IMON

Diagram reading: a sensing method is defined as much by where it taps and what noise/thermal errors it admits as by nominal accuracy numbers.

H2-6 · Loop & Compensation: Make “Stable” and “Fast” Both Verifiable

A server CPU VRM must be fast enough for di/dt events while staying stable across tolerance, temperature, and mode changes. Digital control does not bypass analog limits: sampling, filtering, and compute/PWM update delay all reduce usable phase margin.

Minimum closed-loop model (what matters in practice)
  • Power stage plant: LC double-pole defines the natural roll-off; output network adds ESR/ESL effects that reshape impedance.
  • Digital reality: sampling + filtering + computation + PWM update create effective delay (phase loss) that limits bandwidth.
  • Sensing interaction: voltage and current sensing quality affects both stability and protection decisions (noise and bias feed into the loop).
Type-II vs Type-III (selection boundary)
  • Type-II: suitable when the target crossover does not demand aggressive phase boost and when tolerance/temperature margins must stay conservative.
  • Type-III: used when the design needs higher crossover or stronger shaping near the crossover region to preserve phase margin under delay and output-network variation.
  • Selection criterion: choose the simplest compensation that still achieves the target crossover band and phase-margin band across corners.
Targets as ranges (server-grade acceptance)

Use bands instead of a single number: set a target crossover band (fc) that respects switching frequency and effective delay, and set a target phase-margin band (PM) that absorbs component tolerances, temperature drift, and mode-boundary shifts. A “fast” loop that cannot hold PM across corners is not server-grade.

Target fc band Target PM band Delay-limited Corner-stable
Verification path (must pass all three)
  • Bode measurement: confirm crossover lands in the target band and phase margin lands in the target band.
  • Step load response: verify undershoot/overshoot, recovery time, and ringing stay within the transient window.
  • Per-phase current behavior: confirm transient sharing stays bounded; “pretty Vout” with bad sharing hides hotspots and long-term failures.
Practical acceptance: a loop is “done” only when Bode, step load, and per-phase behavior remain consistent across temperature and operating modes.
Figure F5 — Bode targets shown as bands (crossover band and phase-margin band)
Bode Targets as Bands (fc band + PM band) — Server VRM Acceptance GAIN PHASE frequency magnitude frequency phase Target fc band 0 dB LC / ESR shaping Delay Target PM band Gain Phase Target bands

Diagram reading: set targets as ranges; validate that measured crossover and phase margin stay inside bands across temperature and operating modes.

H2-7 · Transient Response & Phase Management: Link N / fsw / L / Cout

Transient performance is not a single number. Server VRMs are judged by a window: undershoot, overshoot, recovery time, and ringing. The dominant design skill is linking phase count (N), switching frequency (fsw), inductance (L), and the output capacitor network (Cout) without breaking stability or thermals.

Treat transient as four acceptance windows
  • Undershoot: minimum Vout during a load step (di/dt-driven). Tightest constraint for CPU rails.
  • Overshoot: rebound peak after correction; can trip OVP or stress the load.
  • Recovery time: time to return into the steady-state window.
  • Ringing: oscillation amplitude and decay; often influenced by output-network damping and mode boundaries.
Four knobs that must move together
  • Phase count (N): lowers per-phase current and spreads heat; also raises effective ripple frequency. More phases are not automatically “faster” if phase management and delay limit usable bandwidth.
  • Switching frequency (fsw): can improve control authority, but increases switching and driver losses and raises EMI sensitivity.
  • Inductance (L): lower L improves di/dt capability but increases ripple and peak currents, raising RMS loss and saturation risk.
  • Cout network: MLCC handles high-frequency current (low ESL/ESR) but loses capacitance under DC bias; polymer adds damping (ESR) and mid/low-frequency energy to control ringing and recovery.
Common field pattern: average load looks stable, but burst load causes droop or protection events. A frequent root cause is phase-shedding / mode boundary interaction with compensation and current-limit logic.
Phase management trade-offs (efficiency vs readiness)
  • Phase shedding: improves light-load efficiency but introduces a boundary where plant and sharing behavior change.
  • Diode emulation: reduces reverse current loss in light load; can change damping and transition behavior.
  • CCM/DCM boundaries: mode transitions can create discontinuities in sensed current and effective control delay if filtering is not aligned.
Minimum verification (VRM-side only)
  • Step load: vary amplitude and edge rate; capture undershoot/overshoot/recovery/ringing consistently.
  • Boundary crossing: force light-to-heavy transitions across phase-shedding thresholds; confirm clean transitions.
  • Thermal realism: validate that transient tuning does not create a single-phase hotspot under burst patterns.
N fsw L MLCC Polymer Shedding CCM/DCM
Figure F6 — Knob linkage map: N / fsw / L / Cout and impact trends (transient + thermal)
Transient Linkage — Knobs to Outcomes (Server VRM) Design knobs N (phases) share / ripple fsw speed / loss L (inductor) di/dt / sat Cout network MLCC + polymer Outcomes Transient window Undershoot min Vout Overshoot peak rebound Recovery settling time Ringing damping Thermal / Eff. Loss Hotspot Efficiency Heat share ↑ Loss ↑ di/dt ↑ when L↓ Damping ↑ with polymer Mode boundary: Shedding · Diode emu · CCM/DCM Burst

Diagram reading: adjust knobs as a set; check that phase-management boundaries do not destabilize compensation or limit logic during burst loads.

H2-8 · DrMOS / Power Stage & Inductor Selection: Efficiency and Thermals Are Gates

In server VRMs, “works on the bench” is not the finish line. Reliability is set by power-stage loss breakdown, SOA under short bursts, and the real heat path into copper and airflow. Inductor saturation and phase-to-phase variation are equally critical because they reshape current sharing and hotspots.

DrMOS selection: four practical checks
  • Current capability & SOA: separate short peak windows from continuous current; hot-state margins matter most.
  • Loss breakdown: conduction (RDS(on)), switching (Qg/Qgd, tr/tf), and driver/dead-time related losses rise quickly with fsw and transient stress.
  • Reverse and transition losses: mode and boundary behavior can create “hidden loss” even when average current is modest.
  • Thermal resistance & layout path: junction-to-board and board-to-air paths define whether SOA is real or theoretical.
Inductor selection: what changes sharing and lifetime
  • Saturation current: must cover transient peak at temperature corners; saturation reshapes plant and can amplify droop/ringing.
  • Temperature rise curve: sets long-term reliability and phase-to-phase thermal balance.
  • DCR: contributes to loss and, when used for sensing, to current-sharing accuracy under drift.
  • Noise / vibration risk: mechanical excitation can cause audible artifacts and reliability concerns under burst loads.
Production reality: small phase-to-phase variations (RDS(on), DCR, thermal contact, airflow) can bias current sharing and create a single-phase hotspot. Validation must include corner and sample spread.
Acceptance checklist (server-grade)
  • Thermal mapping: confirm hotspot location and temperature under continuous and burst patterns.
  • Short-burst stress: prove hot-state SOA margin during burst load and mode transitions.
  • Sharing integrity: verify per-phase temperature and current remain bounded (no “one phase carries it”).
  • Inductor safety: confirm saturation margin and temperature rise across airflow and ambient corners.
Figure F7 — Heat path cross-section: DrMOS → copper → heatsink/airflow (with hotspot and measurement points)
Thermal Path — Power Stage and Magnetics (Where Heat Must Go) Power stage cross-section DrMOS Package Junction (Tj) Package / Leadframe Thermal Pad PCB copper Top copper Inner planes Bottom copper Hotspot risk NTC (near) Cooling + magnetics Heatsink contact + interface Airflow Inductor Sat · DCR · Rise Hotspot Vibration NTC (L)

Diagram reading: SOA claims become real only when the junction-to-air path is intact; magnetics must hold saturation and temperature rise under burst patterns and airflow corners.

H2-9 · Protection & Fault Handling: Trigger → Response → Reset

Server CPU VRMs are evaluated by a complete protection loop: the trigger condition, the response action, the external signals exposed to the platform, and the reset/retry policy. Misalignment in sensing, blanking, or debounce can cause “no obvious overcurrent” yet repeated faults.

Protection loop, written as an engineering contract
  • Trigger: OCP / OVP / UVP / OTP / Short
  • Response: warning (throttle/limit) → fault (shut down or latch) → retry (hiccup) if enabled
  • External signals: PGOOD, VRHOT#, FAULT#, (optional) ALERT# and IMON behavior
  • Record: PMBus status/fault registers and optional “snapshot” at the event boundary
  • Reset: auto-recovery, host clear, or power-cycle depending on policy
OCP: three response styles and their trade-offs
OCP mode What it does Typical trade-offs
Cycle-by-cycle limit Limits phase current every switching cycle to prevent runaway while keeping output alive. Best transient continuity, but more sensitive to sensing noise and delay; can reshape output impedance and stability margins.
Hiccup (retry) Shuts down for a cool-off interval, then retries with a controlled restart window. Improves SOA safety and reduces thermal stress, but appears as repeated brownouts/restarts at the platform level.
Shutdown + latch Disables output and holds off until a defined clear condition occurs. Strongest protection, but lowest availability; misconfiguration or false triggers cause severe downtime.
OVP / UVP: threshold, debounce, and blanking are not optional
  • Threshold alignment: OVP/UVP windows must tolerate legitimate transient excursions while still catching real faults quickly.
  • Debounce: prevents single-cycle glitches or measurement spikes from causing a fault state.
  • Blanking: avoids treating expected startup or load-step edges as faults; too short causes nuisance trips, too long delays true protection.
OTP: internal temperature vs external NTC, derating vs shutdown
  • Internal (Tj): fast protection and accurate device stress indicator; depends on heat path quality.
  • External (NTC): slower but reflects board/copper and airflow realities.
  • Derating: maintains availability if ramp/restore logic is stable.
  • Hard shutdown: maximal safety when thermal margins collapse or short/overload persists.
Why faults happen without “obvious overcurrent”: current-sense bias/temperature drift, switch-node ringing/ground bounce affecting comparators, and PGOOD/blanking/debounce settings that treat allowed transients as violations.
Short-circuit: SOA protection and the fastest disable path
  • Primary risk: peak stress and delay, not average current. Rapid response must remain stable under ringing and measurement noise.
  • Practical expectation: a defined hard-off path plus a deterministic retry/latch policy and a recordable fault signature.
OCP OVP UVP OTP PGOOD VRHOT# FAULT# PMBus
Figure F8 — Fault state machine: Normal → Warning → Fault → Retry/Latch (with PMBus snapshot points)
Protection Loop — Trigger → Response → Reset Normal PGOOD = 1 Telemetry OK Warning / Throttle VRHOT# active Limit / derate Fault PGOOD drop FAULT# assert Retry (Hiccup) Cool-off Controlled restart Latch-off Wait clear or power-cycle Triggers: OCP · OVP · UVP · OTP · Short PMBus snapshot at entry External signals PGOOD VRHOT# FAULT# ALERT# (opt)

Diagram reading: define a deterministic reset policy and ensure PMBus registers capture the boundary conditions (noise, debounce, blanking) that cause nuisance trips.

H2-10 · PMBus Telemetry & Observability: Data That Actually Helps Debug

VRM telemetry is only useful when it forms a proof chain. Continuous readings validate steady-state and sharing; status/fault registers explain protection events; and event snapshots (when available) reveal what averaged telemetry cannot: short transients.

Telemetry layers (from “numbers” to “evidence”)
  • Continuous telemetry: Vout / Iout / Pout, per-phase current, temperature, efficiency estimate
  • Status & fault: status word, fault codes, threshold flags, protection counters (if supported)
  • Event snapshot (optional): black-box register freeze around a fault boundary
Acceptance fields (bring-up / production)
  • Steady Vout: within the defined window across load and temperature corners.
  • Current sharing: per-phase spread bounded; no persistent “hot phase”.
  • Thermal trend: temperature rise and slope under realistic airflow.
  • Signal behavior: PGOOD/VRHOT# transitions match the intended policy.
Debug fields (field / lab)
  • Fault code + status word: the first evidence layer; classify OCP/OVP/UVP/OTP pathways.
  • Per-phase deviation: detects sensing bias, magnetics variation, or thermal contact imbalance.
  • Temperature slope: separates heat-path issues from short transient events.
  • Protection counters / retry counts: identifies nuisance trips vs persistent faults (hiccup patterns).
“Readings look stable” can be an illusion: telemetry is often low-rate and averaged. A fault can be triggered by a short transient that never appears in filtered registers. When available, snapshot/black-box evidence outranks slow averages.
Minimum observability loop (VRM-side only)
  • 1) Start with status/fault: classify the trigger and response path.
  • 2) Check V/I + per-phase spread: confirm sharing and detect sensing bias.
  • 3) Check thermal trend: identify heat-path/hotspot patterns vs transient-only events.
  • 4) Correlate with policy: debounce/blanking/retry/latch settings decide whether the event becomes downtime.
Vout Iout Phase I Temp Status Fault Snapshot PMBus
Figure F9 — Telemetry pipeline: sense → sampling → filtering → registers → PMBus (where transients disappear)
PMBus Observability — From Sensing to Evidence Sense points Vout Iout / IMON Per-phase I Tj (internal) NTC (board) Processing Sampling / ADC rate · delay Filter / Average smooth · hide spikes Transient hidden here Event detect threshold · debounce Registers & PMBus Telemetry Reg averaged Fault Reg cause codes Snapshot optional PMBus

Diagram reading: averaged telemetry validates steady-state, while fault registers and snapshots explain events. Filtering and readout rate can hide the transient that triggered protection.

H2-11 · Validation & Production Tests

Validation & Production Test: Proving VR13/VR12+ Compliance

Server CPU VRMs are accepted by evidence, not by nominal power rating. A practical test plan must create a repeatable chain of proof from lab characterization to production screening to field self-checks—while staying strictly on the VRM side (Vcore rail, phases, protections, and PMBus telemetry).

1) Three-Layer Proof Framework (what each layer must prove)

Layer Primary goal Minimum evidence (VRM-side)
R&D Prove the design is both stable and fast, across corners. Bode target window, step-load transient (ΔV & recovery), thermal steady state, phase current balance, protection trigger/response/retry behavior, telemetry sanity.
Production Fast screening: catch assembly, configuration, and early-life defects with low false rejects. PGOOD behavior, basic accuracy at 2–3 load points, short burst load check, temperature spot-check, PMBus communication + key status/fault registers readable/clearable.
Field Keep failures diagnosable and recoverable with consistent evidence retention. Power-up voltage window self-check, fault/status clear policy, boot history counters, “last-fault snapshot” if supported by the controller/PMBus map.

2) R&D Validation: the 6 things that must be demonstrated

  • Loop stability under real delays: confirm crossover/phase margin with the actual sensing path, digital sampling, and PWM latency in place.
  • Load-step transient compliance: record undershoot/overshoot, settling time, and ringing at defined di/dt steps that represent burst conditions.
  • Thermal steady-state margin: verify hotspots (DrMOS + inductors) and the thermal gradient along the copper/airflow path.
  • Current sharing as a system metric: validate phase-to-phase current spread and correlate it to temperature spread (electrical sharing and thermal sharing).
  • Protection closure: prove “trigger → response → reset/retry” is deterministic (cycle-by-cycle vs hiccup vs latch-off), and logs match the event.
  • Telemetry trust boundary: verify what telemetry can and cannot show (filtering/update rate vs transient reality), and define capture methods for fast events.

3) Production Test: the smallest fast-screen set (low ambiguity)

  • PGOOD / FAULT pins: correct assertion behavior under nominal power-up and a short controlled disturbance.
  • Basic regulation checks: light + mid load points for static error, plus one short burst load for gross dynamic failures.
  • Thermal spot-check: quick IR/NTC reading to catch solder/thermal-pad defects and abnormal phase heating.
  • PMBus bring-up: confirm basic reads, status word/fault registers, and the intended “clear-on-read / clear-on-command” policy.

4) Field Self-Check: preserving evidence without over-scoping

  • Power-up window check: Vout within window after soft-start; if not, read fault/status first, then attempt recovery.
  • Evidence retention: keep counters and last-fault classification when possible; avoid clearing everything blindly on every boot.
  • Consistency rules: define what must be latched (e.g., last OCP/OTP class) vs what can be auto-cleared (benign warnings).

5) Minimal Test Matrix (corners that actually matter)

A minimal matrix that still catches most VR13/VR12+ escapes includes: Temperature (cold/hot), VIN (low/high), Load (light/mid/heavy + step), and phase mode (shedding vs all-on).

Temp: Cold / Hot VIN: Low / High Load: Light / Mid / Heavy + Step Mode: Shedding / All-on
Example material numbers (reference only; confirm platform/interface fit)
  • VR13 digital multiphase controllers: TI TPS53688RSBT; MPS MP2965; Renesas RAA229131GNP#HA0
  • Power stages / DrMOS / SPS: Renesas ISL99390R5935 / ISL99390BR5935; Infineon TDA21490AUMA1; Vishay SiC658A
  • Phase doubler (high phase count builds): Renesas ISL6617A / ISL6617AFRZ
  • Load equipment (examples): Keysight N3300A mainframe + N3306A module; Chroma 6314A mainframe
  • Stability/PI tools (examples): Picotest J2100A/J2101A injection transformer, J2111A current injector; Keysight N7020A power rail probe
  • Probes (examples): Tektronix TCP0030A current probe; Keysight N2790A differential probe
Notes: Part numbers above are commonly used examples across CPU/GPU VRM validation workflows; exact VR generation (VR12+/VR13/VR13.HC), CPU interface (SVID/AVSBus/PWM-VID), and PMBus map must match the target platform.
Figure F11 — Minimal VRM test matrix + step-load evidence map
Minimal Proof Set (VRM-side): Corners + Evidence Corner Matrix Temp VIN Load Mode Cold / Hot Low / High Light / Mid Shedding Cold / Hot Low / High Heavy All-on Cold / Hot Low / High Step Load Both Evidence Checklist Bode (Crossover / Margin) Step (ΔV / Recovery) Thermal (Hotspots) Share (Phase balance) Step-Load Evidence (what “pass” looks like) Vout Iload Nominal Undershoot Recovery Pass window ΔV + settle Step Capture with: low-inductance V sense
The matrix focuses on VRM-dominant corners (temperature, VIN, load, phase mode) and ties each corner to a minimal evidence set (stability, transient, thermal, sharing, protections, and PMBus consistency).
One-line related links (no expansion): Rack Server Mainboard (power tree & rails), In-band Telemetry & Power Log (system aggregation).
H2-12 · Field Debug Playbook

Field Debug Playbook: Fastest Path from Symptom to VRM Root Cause

The shortest debug path avoids “swap-and-hope”. The workflow below is designed for VRM-side isolation only: use fault evidence first, then waveforms, then controlled isolation steps.

Step 0 — Minimum kit (specific material numbers)

  • Power-rail probe (mV sensitivity, fast edges): Keysight N7020A
  • Differential probe (floating measurements): Keysight N2790A
  • Current probe (phase / load current): Tektronix TCP0030A
  • Loop injection (Bode on real VRM): Picotest J2100A / J2101A, and J2111A
  • Load stepping: Picotest transient load steppers (e.g., P2105A with S10/S50 stepper probes), or a modular electronic load (Keysight N3300A + N3306A, Chroma 6314A)

Step 1 — Read evidence first (before touching knobs)

  • Classify the event using status/fault registers (OCP-like, OVP-like, UVP-like, OTP-like) and any “last-fault snapshot”.
  • Check persistence rules: confirm whether the platform clears faults on read, on command, or on power cycle (avoid destroying evidence).
  • Look for mismatch: “telemetry looks normal” while faults occur often indicates filtering/update-rate limits or the wrong measurement point.

Step 2 — Capture the right waveforms (measurement point quality matters)

  • Vout capture: use low-inductance sensing at the rail (avoid long ground leads that create fake ringing).
  • Phase evidence: capture per-phase current (or inductor ripple proxy) and correlate with hotspot location.
  • Pin-level signals: PGOOD, VRHOT#, FAULT aligned in time with Vout and load step.

Step 3 — Isolate quickly (change one variable per trial)

  • Fix operating mode: temporarily disable phase shedding or force all-on to see if burst failures disappear.
  • Validate sensing integrity: compare controller telemetry vs external probe; large mismatch often points to sensing/placement issues.
  • Localize by substitution: swap a known-good output-cap network sample point (same footprint) to distinguish control vs PDN issues.
Symptom → Priority checks → Likely VRM root causes → Next probe

Heavy-load droop / reboot

  • Priority checks: OCP evidence, PGOOD deglitch policy, phase balance (one phase running hot).
  • Likely root causes: OCP too tight or false-trigger; sensing error driving poor current sharing; effective Cout lower than expected (DC bias/aging).
  • Next probe: low-inductance Vout, phase current, ripple/thermal map on DrMOS + inductors.

Thermal throttling / VRHOT# asserts

  • Priority checks: VRHOT# source (internal Tj vs NTC), hotspot location and gradient.
  • Likely root causes: thermal path discontinuity (pad/copper/airflow); current sharing offset causing a single phase hotspot; mode switching losses due to phase management.
  • Next probe: thermal imaging + phase current spread + rail ripple under steady load.

PMBus looks normal, but the board is unstable

  • Priority checks: telemetry update/filtering vs event speed; fault registers around the event.
  • Likely root causes: fast transient excursions hidden by filtering; SW-node ringing corrupting comparators; measurement setup injecting artifacts.
  • Next probe: Vout + PGOOD/FAULT time-aligned capture; confirm measurement ground strategy.

Intermittent start-up failure

  • Priority checks: soft-start slope, pre-bias behavior, PGOOD deglitch/timeouts.
  • Likely root causes: UVP/OVP window hit during ramp; inrush/limit interaction with output network; unwanted mode switching during ramp.
  • Next probe: start-up waveforms (Vout/VIN/PGOOD) + immediate fault register snapshot.
Figure F12 — 3-step VRM debug flow (evidence → waveforms → isolation)
Debug Fast Path (VRM-side) 1) Read Evidence Status / Fault Reg Counters / History Last-Fault Snapshot 2) Capture Waveforms Vout (low-L sense) Phase I / Thermal PGOOD / VRHOT / FAULT 3) Isolate (1 change) Fix phase mode Validate sensing path Swap Cout sample Root-Cause Buckets (VRM-side) Sensing tap / routing / offset Control mode / comp / limits Power Stage SOA / thermal path Output Network Ceff / ESR / ESL Decision rule: Evidence → Waveforms → One-variable isolation (repeatable, minimal collateral changes)
The flow enforces evidence preservation first, then time-aligned waveform capture, then one-variable isolation steps that converge quickly on VRM-side buckets.
One-line related links (no expansion): Rack Server Mainboard (system rails & sequencing), In-band Telemetry & Power Log (aggregation and anomaly detection).

Request a Quote

Accepted Formats

pdf, csv, xls, xlsx, zip

Attachment

Drag & drop files here or use the button below.

FAQs: CPU VRM (VR13/VR12+) Selection, Validation, and Debug

These questions stay strictly on the VRM side: controller, power stages (DrMOS), current sensing, transient behavior, protections, and PMBus observability. No expansion into mainboard/BMC/PSU/48V domains.

1) Why can “low reported current” still overheat a DrMOS?

The most common cause is uneven phase sharing: total IOUT looks moderate, but one or two phases carry excess current due to sensing offset, layout-induced error, or device mismatch. Another frequent cause is telemetry hiding peaks (slow PMBus update / heavy filtering). Confirm with per-phase current (or inductor ripple proxy) and a thermal image; then re-check sensing taps and phase balance calibration.

Example P/N: Controller TPS53688 / MP2965 / RAA229131; DrMOS TDA21490 / ISL99390 / SiC658A.
2) Why can a “harder” load-line (less droop) trigger OVP or instability?

A very small droop reduces effective damping, so fast unload events can create larger overshoot and ringing. That overshoot can trip OVP or push the loop into poor phase margin if compensation was tuned assuming a softer output impedance. Validate by capturing unload transients at the rail with a low-inductance probe, then confirm phase margin (Bode) and revisit load-line and compensation together rather than treating droop as a standalone knob.

Example P/N: Controller TPS53688 / MP2965; probing N7020A (rail) + N2790A (diff).
3) For DCR sensing, what field error source shows up most often—and how to verify fast?

The top field error source is temperature mismatch: the inductor’s actual copper temperature diverges from the RC network’s assumed temperature, so DCR-based current is biased—often worsening under airflow gradients. A fast check is to compare controller IMON with an external current probe during steady load and a short step. If the phase imbalance grows with temperature or changes with airflow, the DCR model or tap routing is the likely culprit.

Example P/N: Current probe TCP0030A; controllers TPS53688 / RAA229131 (typical DCR-sense platforms).
4) Are more phases always better? When do more phases become harder to stabilize or balance?

More phases reduce ripple and spread heat, but they can increase control complexity: longer timing paths, tighter matching requirements, and more opportunities for sensing offset to create phase imbalance. If doublers are used, added delay and mismatch can degrade dynamics and complicate loop tuning. The practical boundary is reached when phase sharing and stability margins become sensitive to small routing or component variations, especially across temperature corners.

Example P/N: Doubler ISL6617A; controllers TPS53688 / MP2965 / RAA229131.
5) Why can light-load modes (phase shedding / DCM) worsen transient response?

With phase shedding, fewer active phases and DCM behavior increase effective output impedance and reduce available di/dt support, so burst load steps produce deeper undershoot and longer recovery. Mode transitions can also introduce nonlinearity (dead-time, current limit behavior, sampling alignment), making short bursts look worse than steady averages. A fast isolation step is to temporarily force “all-on / CCM” mode and compare step-load waveforms and fault incidence.

Example P/N: Controllers TPS53688 / MP2965 (commonly offer configurable phase management / mode lock).
6) If efficiency is high, why can temperature still be uncontrollable? Common layout/thermal-path issues?

High efficiency does not guarantee low temperature if the thermal resistance path is poor. Common issues include voided thermal pads, insufficient copper heat spreading under the DrMOS, weak via stitching to inner planes, and airflow misalignment that leaves hotspots in recirculation zones. Confirm by mapping hotspot location and gradient; if one phase consistently dominates temperature, check phase sharing and local copper/thermal contact first, not only switching loss metrics.

Example P/N: DrMOS examples TDA21490 / ISL99390 / SiC658A (thermal path differences matter).
7) OCP limit is unchanged, but the field shows “random hiccup.” What noise/ringing triggers it?

Random hiccup often comes from false overcurrent detection: SW-node ringing, ground bounce, or sense routing pickup makes the comparator or sampled current appear to exceed threshold for a brief moment. Too-short blanking/deglitch, aggressive filtering choices, or marginal phase margin can amplify the risk. The fastest verification is time-aligned capture of VOUT, phase current proxy, and FAULT/PGOOD with correct probing (short ground, low-inductance rail sense), then tuning blanking and loop damping.

Example P/N: Probing N2790A + N7020A; controllers TPS53688 / RAA229131 (blanking/limits are controller knobs).
8) What misjudgments happen if PMBus sampling is too slow, and how to keep it stable but event-capable?

Slow update rates and heavy averaging can make telemetry look “perfect” while fast undervoltage or overcurrent events occur between samples. That leads to wrong conclusions such as “current is low” or “VOUT never dips.” Use two modes: a stable, filtered profile for acceptance testing, and a faster/event-oriented profile for debug (higher update, less averaging, or snapshot features if supported). Always correlate PMBus data with a captured rail waveform during faults.

Example P/N: Controllers TPS53688 / MP2965 / RAA229131 (telemetry configuration and event capture depend on PMBus map).
9) What start-up problems does pre-bias cause, and how should soft-start be made compatible?

With pre-bias, a naïve soft-start can interpret the rail as already “too high” or cause unexpected current direction during ramp, triggering UVP/OVP windows or confusing PGOOD sequencing. Compatibility requires configuring soft-start and protection windows to tolerate a pre-charged output, ensuring current limit behavior during ramp does not fight the existing rail voltage, and validating ramp waveforms plus immediate fault snapshots after a failed start. The key is deterministic trigger–response–retry behavior during the ramp interval.

Example P/N: Controllers TPS53688 / RAA229131 (pre-bias handling and window policies are controller features).
10) If phase currents are unbalanced, suspect inductor/DrMOS variation first—or the sensing network?

Suspect the sensing network and routing first, because it can bias one phase systematically (tap placement, RC constants, pickup, calibration). Next, evaluate component variation that is strongly temperature-coupled: inductor DCR spread, DrMOS RDS(on) spread, and local thermal gradients that create runaway. A fast method is to compare per-phase telemetry against an external current probe and correlate with thermal imaging; if electrical imbalance tracks temperature and location, sensing and thermal path are primary suspects.

Example P/N: Current probe TCP0030A; DrMOS examples TDA21490 / ISL99390; controllers TPS53688 / MP2965.
11) Step-load tests pass in the lab, but the field still reboots. What corner is most often missed?

The most common missed corners are temperature (cold/hot), VIN low/high, and phase mode (shedding vs all-on) under burst-like repetition, not a single isolated step. Another frequent gap is measurement setup: poor rail sensing can hide real dips or invent ringing. Re-run a minimal matrix that includes mode transitions and a burst pattern, and capture VOUT with a low-inductance rail probe while checking fault/status immediately after the event.

Example P/N: Load gear N3300A+N3306A (modular) or Chroma 6314A; probing N7020A.
12) PGOOD never seems to drop, yet resets occur. Which VRM-side fault/status should be checked first?

PGOOD can stay high while brief events occur outside its deglitch window, or while a non-PGOOD fault condition is logged. Prioritize reading status word and fault registers that capture OCP/OVP/UVP/OTP classifications, counters, and “last-fault” snapshots where available. Also confirm the platform’s clear policy (clear-on-read vs clear-on-command), because evidence can be lost during routine polling. Time-align any captured reset with fault snapshots and a rail waveform to avoid false attribution.

Example P/N: Controllers TPS53688 / RAA229131 / MP2965 (fault map depth varies by controller).
Scope: Controller Scope: DrMOS / Power Stage Scope: Current Sensing Scope: Transient / Stability Scope: Protections Scope: PMBus Telemetry