123 Main Street, New York, NY 10001

CRPS / Server PSU: PFC+LLC, PMBus Telemetry, Redundancy

← Back to: Data Center & Servers

A CRPS/server PSU is not only an AC-to-DC converter—it is a controlled power subsystem designed for redundancy: stable current sharing, fast protection behavior, and trustworthy PMBus telemetry/fault logs that keep the server bus alive through real-world brownouts, transients, and thermal stress.

H2-1 — What is CRPS / Server PSU & Boundary

Engineering definition

A CRPS (Common Redundant Power Supply) is a hot-redundant server power module that converts AC (optionally HVDC in some deployments) into a regulated DC rail (commonly 12 V or 48 V) using a PFC front end plus an isolated LLC stage with synchronous rectification. It supports N+1 / 1+1 redundancy via OR-ing and current-share control, while exposing PMBus telemetry and fault logs to improve serviceability.

This page focuses on the PSU module itself (inside-the-box conversion, redundancy interface, and observability), not downstream distribution or board-level point-of-load regulation.

What this page covers

  • Input domain: EMI/surge/inrush handling and brownout behavior relevant to server uptime.
  • Conversion chain: PFC → HV bus energy → isolated LLC regulation → SR/output filtering → hold-up.
  • Redundancy interface: OR-ing (ideal diode / OR-ing FET) and current-share (droop vs active share).
  • Observability: PMBus telemetry, status words, fault logs, and PSU-internal fan/thermal derating.

What this page does NOT cover (link out only)

  • 48 V / 12 V bus hot-swap controllers, eFuses, downstream power distribution.
  • CPU VRM / DDR rails / on-board sequencing and point-of-load stability details.
  • Rack PDU metering / switching, facility power monitoring, site-level energy analytics.
  • BMC/IPMI/Redfish workflows, KVM/IP video/USB transport details, platform security deep dives.
CRPS meaning server PSU architecture N+1 redundancy PMBus telemetry OR-ing & current share
Figure F1 — CRPS end-to-end block view (navigation map)
CRPS / Server PSU — Functional Blocks 1 2 3 4 5 AC Input EMI / Surge PFC PF · THD · Bus HV Bus Energy / Hold-up LLC (Iso) Regulation SR + Output Filter Ripple · Transient DC Output 12V / 48V OR-ing Ideal Diode Current Share PMBus / MCU Telemetry · Status Fault Log Status Bits Fan / Thermal Curves · Derating NTC / RPM
Reading guide: blocks 1→5 follow the power path; the PMBus/thermal block provides observability across multiple taps (circles).

H2-2 — Electrical Architecture Map (AC → DC, domain breakdown)

A server PSU is best understood as six tightly-coupled domains. Each domain owns specific KPIs, produces characteristic failure symptoms, and exposes distinct telemetry clues. Mapping symptoms to domains prevents misdiagnosis (for example, stable average power readings while a fast transient triggers a reset).


Domain ① — EMI / surge / inrush
  • Controls: conducted EMI, surge energy, inrush stress during plug-in / brownout recovery.
  • KPIs: EMI margin, surge robustness, repeatable start behavior.
  • Common pitfalls: nuisance trips on brownout; “starts only sometimes” due to marginal inrush strategy.
  • Typical symptoms: repeated restart attempts; input-side protection events.
  • Telemetry clues: input status flags, brownout counters (if available), start attempt logs.
Domain ② — PFC (front-end)
  • Controls: input current shaping and HV bus energy with PF/THD constraints.
  • KPIs: PF, THD, efficiency across load range, stable bus regulation.
  • Common pitfalls: light-load mode transitions causing audible noise or EMI peaks; aggressive current limiting leading to brownout sensitivity.
  • Typical symptoms: bus sag under step load; unstable PF/THD in certain load bands.
  • Telemetry clues: bus voltage trend, input power factor estimate, PFC fault/status bits.
Domain ③ — Isolated LLC (resonant regulation)
  • Controls: isolated power transfer and regulation via frequency control over a resonant tank.
  • KPIs: efficiency, stability over input range, light-load behavior, dynamic response.
  • Common pitfalls: gain margin erosion at extremes (frequency limits); burst/skip transitions that disturb output or create acoustic noise.
  • Typical symptoms: intermittent output ripple spikes; “clean at full load, messy at light load.”
  • Telemetry clues: switching mode indicators (if exposed), regulation fault flags, correlated thermal rise on primary devices.
Domain ④ — SR + output filter + hold-up
  • Controls: synchronous rectification timing, output ripple/noise, hold-up delivery during input loss.
  • KPIs: ripple/noise, transient droop, hold-up time, thermal on SR path.
  • Common pitfalls: SR timing drift at light load; hold-up shortfall due to bus energy strategy limits, not only capacitor size.
  • Typical symptoms: reset on AC dropouts; output droop not visible in slow telemetry.
  • Telemetry clues: fast fault logs (UV/PG events), bus discharge signature, output current/voltage snapshots around events.
Domain ⑤ — OR-ing + current-share (redundancy)
  • Controls: reverse blocking, hot-redundant insertion/removal, stable load sharing (droop or active share).
  • KPIs: share accuracy, share stability, minimal bus disturbance on unit drop/add.
  • Common pitfalls: share-loop oscillation; OR-ing device thermal stress; share signal integrity issues.
  • Typical symptoms: one PSU “hunts” or drops out; unexplained heating; bus micro-dips during failover.
  • Telemetry clues: per-unit current mismatch trend, OR-ing thermal flags, share-related status bits, dropout timestamps.
Domain ⑥ — Digital management (PMBus) + thermal
  • Controls: sensing, reporting, fan curves, derating, and fault handling policies.
  • KPIs: telemetry trustworthiness, useful logs (time correlation), controlled derating vs hard shutdown.
  • Common pitfalls: slow averaging masks fast droops; mismatched sensor placement causes fan “overreaction.”
  • Typical symptoms: telemetry looks stable while resets occur; fan ramps without obvious temperature cause.
  • Telemetry clues: status word hierarchy, log depth/retention, sensor-to-event correlation quality.
Practical rule: when a symptom appears, first map it to domains ①–⑥, then use the listed “telemetry clues” to confirm. This avoids chasing the wrong subsystem (for example, debugging share stability when the root cause is a light-load mode transition).
Figure F2 — Six-domain map with KPI/telemetry anchors
Architecture domains (①–⑥) + KPI anchors 1 EMI / Surge / Inrush Start margin Brownout 2 PFC (Front-end) PF / THD HV bus 3 LLC (Isolation) Gain region Light-load 4 SR + Output Filter + Hold-up Ripple Hold-up time Energy delivery 5 OR-ing + Current Share Reverse block Share stability N+1 behavior 6 PMBus + Thermal Fault log Fan curve Telemetry trust
Anchor usage: domains ①–⑥ align with the narrative structure of the page. KPI chips are intentionally minimal labels to keep mobile readability.

H2-3 — Front-end: EMI, surge, inrush & brownout behavior

Practical goal

The AC front end must survive real data-center events (plug-in/inrush, short sags, generator/UPS transfers) without turning an energy shortage into a misclassified overcurrent fault. Robust behavior depends on coordinated inrush limiting, brownout detection, controlled derating, and event logging that preserves the sequence of causes and effects.

This section stays on the PSU input domain: EMI/surge/inrush handling and how brownout propagates into HV bus energy and output stability. Downstream distribution components are out of scope.


Event A — Plug-in / restore power: inrush and start reliability

Symptom Sudden breaker trips, repeated start attempts, or “starts only sometimes” after brief unplug/replug.
Cause Bulk capacitance charging plus EMI filter dynamics create high peak current; NTC cold/hot state changes the peak; active inrush control and PFC soft-start may not be aligned.
Telemetry clues Start-attempt counters, input fault flags during start, and timestamped “start fail” records (when available).
Action Prefer staged soft-start (limit dV/dt of HV bus charging), record a distinct “inrush/start” fault category, and ensure a deterministic restart policy rather than blind retries.

Event B — Brownout / sag: avoid misclassifying energy deficit as OCP

Symptom System resets during a short AC sag; average power telemetry looks normal; one unit in a redundant pair drops out.
Cause Input sag limits PFC energy delivery, HV bus voltage falls, and the LLC stage may leave its regulated region. If classification is poor, the event may be handled as overcurrent/short rather than brownout.
Telemetry clues HV bus trend (if exposed), sequence of flags: AC_FAIL → bus UV → output UV/PG drop, plus a clustered log timestamp pattern.
Action Use explicit brownout detection and staged derating (controlled power reduction) before shutdown. Log the event order so root cause stays in the input domain rather than being blamed on output protection.

Event C — Generator/UPS transfer: frequency/shape disturbances and repeated faults

Symptom Fault “storms” around transfers: repeated resets, audible noise, or PF/THD anomalies concentrated in a narrow time window.
Cause Abnormal input frequency or waveform distortion challenges PFC current shaping and control-mode transitions, especially near light-load boundaries.
Telemetry clues Timestamp clusters, input-related status bits, and repeated “recover/derate/restart” sequences in the fault log.
Action Harden decision thresholds under abnormal input and prioritize logging that captures “transfer window” context (pre-event load, input status, and the first triggering flag).
server PSU inrush limiting brownout behavior fault log timeline controlled derating
Figure F3 — Input disturbance → HV bus → output → fault-log timeline
Event timeline (trend view, not to scale) Vin Vbus Vout Log t0 t SAG Bus energy Hold-up ends AC_FAIL BUS_UV PG↓ RESTART Classification focus Brownout (energy) Overcurrent (load) Event order
The most actionable evidence is the sequence: input sag → bus energy drop → output hold-up end → logged flags. This prevents blaming a brownout on overcurrent protection.

H2-4 — PFC stage deep dive: topology & control (efficiency, THD, dynamics)

The PFC stage shapes the input current, sets the HV bus energy available for hold-up, and strongly influences light-load behavior (noise, EMI margin, and nuisance flags). For 1–3 kW server PSUs, the practical choice is usually among boost CCM, CRM, and TCM, with trade-offs that show up as field symptoms and telemetry patterns.

What to optimize

PF/THD compliance, efficiency across load range, bus stability under sags, and predictable restart behavior.

Where issues appear

Light-load mode transitions, current limiting under brownout, and EMI margin under distorted input.

What to read first

Bus voltage trend, PFC status/fault bits, and event timestamps around transfers and sags.


Topology comparison (CCM vs CRM vs TCM)

Instead of a wide table, each mode is shown as a mobile-friendly decision card with typical symptoms and the most useful telemetry clues.

CCM (Continuous Conduction)
  • Best fit: high power, predictable bus regulation, stable PF/THD under load.
  • Typical pressure: switching loss and EMI filtering complexity.
  • Common symptoms: EMI margin sensitivity; thermal rise on switching devices at heavy load.
  • Telemetry clues: bus ripple trend, thermal flags correlated with high load, PFC limit events.
CRM (Critical Conduction)
  • Best fit: high efficiency in mid load; simpler current shaping in some designs.
  • Typical pressure: variable frequency and EMI peaks in certain regions.
  • Common symptoms: audible/EMI artifacts near mode boundaries; PF/THD variation by load band.
  • Telemetry clues: event clusters around specific load levels; PF/THD estimate anomalies.
TCM (Transition / Triangular Conduction)
  • Best fit: improved light-load efficiency and lower switching loss in some regions.
  • Typical pressure: control complexity and sensitivity to distorted input.
  • Common symptoms: nuisance flags during transfers; instability if thresholds are poorly tuned.
  • Telemetry clues: mode-switch indicators (if exposed), restart/derate patterns during input anomalies.

Digital PFC control: loops, sampling, and limit policies

Voltage loop Sets HV bus energy (hold-up headroom). Overly aggressive tuning may amplify disturbances; overly slow tuning may cause bus sag during load steps.
Current loop Shapes input current to meet PF/THD. Sampling delay and filtering must not create misclassification during abnormal input.
Limit & mode logic Soft-start, current limit, and mode transitions determine whether the system derates gracefully or enters a restart “storm.” The fault log should capture which gate actually triggered.
What PF/THD indicates PF/THD degradation concentrated in a load band is often a signature of control-mode boundaries, making it a diagnostic clue rather than a purely compliance metric.
CCM vs CRM PFC PFC loop bandwidth PF / THD diagnostics light-load mode
Figure F4 — PFC modes + loop/control blocks + telemetry taps
PFC: topology modes + digital control chain CCM Boost PFC IL EMI focus CRM Critical Mode IL Mode boundary TCM Transition IL Light-load Digital control chain Vbus Sense Voltage Loop Current Ref Current Loop PWM Soft-start Current limit Mode switch Input current shape PF / THD signature PMBus Telemetry Status + log HV bus energy Hold-up headroom
Reading guide: mode choice shapes IL behavior and EMI risk; the control chain (loops + limits + mode logic) determines whether brownout/transfer events become controlled derating or repeated restart storms.

H2-5 — Isolated LLC stage deep dive (gain, light-load, dynamics & stability)

Engineering takeaway

LLC behavior is defined by the resonant tank and its gain curve across switching frequency. Under wide input range and wide load range, the operating point can approach frequency limits or ZVS margins, where efficiency, noise, and regulation robustness can change sharply.

Scope: isolated LLC + synchronous rectification (SR) behavior inside the PSU. Downstream board-level conversion is out of scope.


1) Gain curve and operating region (what the control can and cannot “move”)

What matters Resonant tank parameters define the gain-vs-frequency shape, which sets the regulation window under high/low input and different load bands. When the operating point is forced near the min/max switching frequency, control authority shrinks and small disturbances can produce visible output artifacts.
What to observe Frequency command distribution (does it “stick” near limits?), the correlation between HV bus movement and frequency, and thermal sensitivity localized to primary switches / transformer / resonant components (an indicator of margin).

2) Light-load + SR pitfalls (burst, reverse current, noise and efficiency dips)

Typical mechanisms Light-load often triggers burst/skip behavior (energy delivered in packets). If SR timing or thresholds are not well-coordinated, reverse current and “energy backflow” risks increase, while audible noise and ripple envelopes become more pronounced.
Telemetry & log clues Ripple envelope patterns at light-load, clustered mode-related events around a narrow load band, and repeated recover/derate sequences that appear “periodic” rather than a one-time hard fault.

3) Dynamics and stability (where parasitics create “strange” behavior)

Common traps Frequency clamp (upper/lower) can create a “cannot regulate” region. Transformer and layout parasitics introduce extra poles/zeros that change loop behavior by operating point, creating region-specific instability signatures.
How to avoid misattribution Region-specific symptoms (only at certain Vin/load bands), combined with frequency “edge sticking” and a distinctive ripple shape, strongly suggest an LLC operating-point issue rather than an external load problem.
LLC gain curve design server PSU burst mode noise ZVS margin SR reverse current
Figure F5 — LLC gain curve (engineering view) and operating zones
LLC gain & operating zones (trend, not to scale) Vin range High Vin Low Vin Operating point shifts Gain fsw low high Regulation zone comfortable margin ZVS margin band fmin fmax Operating path Burst zone What changes near boundaries Audible noise Efficiency dip Stress
Reading guide: the gain curve defines where regulation is easy versus where frequency limits or reduced ZVS margin make light-load noise, efficiency dips, or instability more likely.

H2-6 — Secondary rectification, output regulation & hold-up (ripple, transients, sampling mismatch)

Boundary

Output quality is defined at the PSU interface: ripple/noise, transient behavior, and hold-up capability. Board-level VRMs may further shape rails, but the PSU must deliver a predictable interface and accurate event evidence when failures occur.

Output ripple and “sensitive loads” (interface-centric view)

Ripple and noise matter because they set the disturbance level presented to the server system input. A practical approach is to treat ripple as an interface requirement and focus on repeatable compliance across load bands, temperatures, and redundant sharing states. If ripple artifacts appear only in specific bands, it is often a signature of mode transitions (LLC/SR behavior) rather than random noise.


Hold-up time: where energy comes from (and why “just add capacitance” is not the only lever)

Hold-up is primarily driven by available HV-bus energy and the allowed bus discharge window. The key is not the capacitor value alone, but the usable energy delta between a safe upper bus level and the minimum bus level that still supports regulation.

Engineering estimate (order-of-magnitude)
E ≈ 1/2 · C · (V1² − V2²) t_hold ≈ (E · η) / P_load

E is the usable stored energy, V1→V2 is the allowed HV-bus discharge window, η is conversion efficiency during hold-up, and P_load is the delivered output power. This estimate is used to size margin and to interpret logs, not as a full design derivation.


Transient response vs PMBus sampling: why “telemetry looks fine” can still mean a reset

What happens Fast transients (µs–ms) can violate output limits long before slow telemetry windows capture an averaged value. As a result, PMBus voltage/current readings may appear stable even when the output briefly collapses.
What to trust first Timestamped fault logs and status transitions (AC_FAIL / BUS_UV / VOUT_UV / PG drop) provide the event sequence that telemetry averages can hide. If min/max-latch style indicators exist, they are more representative of transients than moving averages.
How to reduce ambiguity Prefer logging that captures first-trigger cause and the order of flags. This prevents incorrectly blaming the load for what is fundamentally an energy or regulation boundary event.
server PSU hold-up time calculation transient vs telemetry ripple envelope fault log timeline
Figure F6 — Hold-up discharge + PG + PMBus sampling windows + log anchors
Hold-up & sampling mismatch (timeline view) Vbus Vout PG PMBus t0 t Bus discharge Hold-up ends PG drop windows fast transient Fault log anchors (order matters) AC_FAIL BUS_UV VOUT_UV RESTART Root-cause
Reading guide: hold-up is governed by HV-bus energy and the allowed discharge window. Slow PMBus windows can miss fast transients, so timestamped log anchors and status transitions should be used to establish the true event sequence.

H2-7 — Redundancy: OR-ing + current share (active vs droop)

Engineering definition

Redundant paralleling (1+1 / N+1) is successful only when the combined output behaves like a single, predictable source: stable current sharing, controlled fault isolation, and clean exit/entry without bus disturbance.

What redundancy must solve (the four hard problems)

Current balance Uneven sharing concentrates thermal stress in one unit, triggering derating or repeated exit/entry cycles that look like system “jitter.”
Circulation current Small output setpoint mismatch or path resistance mismatch can create loop currents that waste power and heat OR-ing devices.
Fault isolation OR-ing prevents backfeed and keeps a single unit failure from pulling down the shared bus.
No-bus-disturbance exit When a unit drops out (fault/derate/AC loss), the remaining units must pick up load without a sharp bus sag or repeated toggling.

OR-ing (ideal diode / OR-ing FET): why it exists and how it fails

OR-ing is the isolation gate at the PSU output. Its primary job is to block reverse current (backfeed) and to limit the blast radius of a single-unit fault. In practice, OR-ing is also where heat accumulates when there is circulation current, frequent handover, or marginal reverse-current thresholds.

Evidence to check OR-ing device temperature asymmetry between units under similar load, repeated “handover-like” events, and output steps/ripple bursts that correlate with entry/exit transitions.
What it usually means Persistent OR-ing heating often indicates backfeed tendency or circulation current, not simply “a bad FET.” The fastest discriminator is whether the heating tracks sharing mismatch or tracks transitions.

Current sharing strategies: droop share vs active share (share bus)

Droop share (voltage droop) Each unit intentionally reduces Vout reference as its output current increases. This naturally encourages balance with minimal inter-unit signaling. Trade-off: the shared bus voltage can vary more with load, and mismatch in path resistance can leave residual imbalance.
Active share (share bus) Units exchange a share signal and adjust their internal targets to equalize current more tightly. Benefit: better balance and a steadier bus. Risk: an extra coupled control loop can oscillate if bandwidth/latency/noise margins are poor or if the share line becomes intermittent.
Stability triad (diagnosis mindset) Sharing stability is usually determined by (1) share signal integrity and dropout behavior, (2) sharing-loop speed relative to regulation dynamics, and (3) the output path impedance model (OR-ing drop, wiring, connectors).
power supply current sharing droop vs active ORing FET ideal diode circulation current share bus open

Fault-tree style: fast path from symptom to evidence

Symptom: uneven current / thermal imbalance
  • Check: Iout mismatch across units at a stable load; one unit runs hotter or derates earlier.
  • Clue: imbalance follows specific slot/cable path → path impedance mismatch; imbalance follows one PSU only → control/share issue.
  • Action direction: validate sharing mode (droop/active), confirm share signal continuity, confirm OR-ing drop symmetry under load.
Symptom: share oscillation / bus “jitter”
  • Check: repeated entry/exit cycles, output steps, or periodic ripple bursts; share-related status/log flags if available.
  • Clue: active-share is more sensitive to latency/noise and share line intermittency.
  • Action direction: prioritize share signal integrity and degradation behavior (what happens when share is lost).
Symptom: OR-ing device overheating
  • Check: OR-ing temperature asymmetry, heating that correlates with mismatch or with transitions.
  • Clue: circulation current (setpoint/path mismatch) or backfeed tendency (handover threshold).
  • Action direction: reduce circulation drivers (setpoint alignment/path symmetry) and verify reverse-current blocking thresholds behavior.
Symptom: single unit drop-out causes system disturbance
  • Check: order of events: derate/limit → share imbalance → drop-out → bus sag.
  • Clue: thermal or protection threshold too close to nominal in one unit creates repeated “handover storms.”
  • Action direction: confirm clean dropout semantics (no backfeed, no oscillatory re-entry) and rely on event sequence rather than averaged telemetry.
Figure F7 — Parallel redundancy: OR-ing, share bus, remote sense (abstract), and circulation paths
Parallel redundancy (N+1) — key interfaces PSU A DC output stage OR-ing (ideal diode) Iout Temp Share control PSU B Same interfaces as PSU A DC BUS System load dynamic demand Share Bus Remote sense (abstract) Circulation path Share open / noise OR-ing hot Uneven I Drop-in/out
Reading guide: OR-ing blocks backfeed; share loops set current balance; circulation current is driven by setpoint/path mismatch and is a common root cause of OR-ing heat and “jitter” under redundancy.

H2-8 — Digital PSU & PMBus telemetry (what to trust)

Core principle

Telemetry is evidence, not truth-by-default. The most reliable diagnosis comes from status transitions and fault-log sequence, while averaged power/voltage numbers can mislead during fast transients or calibration drift.

What PMBus/SMBus typically exposes (map by category)

Input-side Vin, Iin, Pin, and related status flags. Useful for identifying brownout-like conditions and input anomalies that precede bus discharge.
Output-side Vout, Iout, Pout, current-limit/derate flags. Most valuable when used as cross-unit comparisons (PSU A vs PSU B), not as single absolute truth.
Thermal & fans Temperature points and Fan RPM. Trend is often more meaningful than a single number, because sensor placement differs from the true hotspot.
Status & fault evidence Status words/bits and FAULT LOG entries (timestamped if supported). This layer usually carries the “first trigger cause” chain.

Why some readings must be discounted (sampling, averaging, calibration)

Sampling window mismatch Fast events (µs–ms) can occur entirely between PMBus update windows. Averaged V/I/P can look fine while PG drops or UV triggers in reality.
Current/power estimation chain Shunt/DCR-based sensing depends on calibration and temperature drift. Derived power (V×I) inherits phase and timing mismatch between measurements.
Temperature point bias “Temp” is only meaningful with context: which sensor, which airflow condition, and whether it tracks OR-ing / magnetics / primary hotspots.

Telemetry trust grading matrix (use it like a troubleshooting standard)

Trust level Examples What it is good for Common trap
Strong Status bits, protection triggers, PG drop, UV/OV/OCP/OTP flags, fault-log first-trigger Establishes the event sequence and true root trigger chain (what happened first, then what followed). Ignoring order of events and only reading the final fault (“last hit” bias).
Medium Temperature trends, fan RPM, derate states, long-window Iout/Vout stability Explains why margins shrink (thermal headroom, derating) and why a unit exits/enters under redundancy. Treating a single temperature number as a universal hotspot indicator.
Weak Averaged power, efficiency estimates, short-window V/I during transients, any value with heavy filtering Provides context and sanity checks, but should not be used alone to “convict” a root cause. Believing “telemetry looks normal” during a reset event (sampling mismatch).

How to read evidence in the right order (5-step field workflow)

1) Status & triggers first Confirm which protection or state transition actually occurred (PG, UV, OCP, OTP, derate). This determines the branch of investigation.
2) Fault log sequence Use timestamps or event order to identify the first trigger and the follow-on cascade (input anomaly → bus discharge → output UV, etc.).
3) Thermal & redundancy context Check whether one unit is consistently warmer or derating earlier; correlate with uneven Iout and OR-ing heat patterns.
4) Cross-unit comparisons Compare PSU A vs PSU B for Iout/Vout/Temp/Fan. Relative mismatch is often more meaningful than absolute values.
5) Averaged numbers last Treat averaged power and filtered telemetry as supporting evidence only—especially around fast transient events.
PMBus fault log server PSU telemetry sampling window power reading inaccurate status bits vs averages
Figure F8 — Telemetry taps: what is observed vs blind spots + trust ladder
Telemetry taps & trust ladder (PSU internal view) Power path (simplified) AC in PFC HV bus LLC + SR DC out Vin Iin Vbus Vout Iout Temp/Fan Status / Log Not observed Trust ladder (how evidence should be used) Strong status bits protection triggers fault-log order Medium temp trends fan RPM derate states Weak averaged V/I/P filtered values during transients
Reading guide: taps show what the PSU exposes, while “Not observed” highlights blind spots (fast/local events). Use the trust ladder to prioritize status/log evidence over averaged numbers when diagnosing resets or redundancy jitter.

H2-9 — Fan & thermal control inside PSU

Thermal control loop

PSU thermal behavior is a closed loop: internal loss hot-spots drive temperature rise, sensors observe selected points, control logic commands fan RPM and derating, and the resulting airflow and power-limit reshape the hot-spot map. Stability, noise, and lifetime depend on how this loop is structured.

Internal hot zones (where heat concentrates)

Hot zones vary by load, airflow, and redundancy imbalance. The most useful model groups them by loss mechanism: conduction-heavy devices, switching-heavy devices, magnetics, and the control board’s local regulators/drivers.

Conduction-heavy PFC switches/rectification paths, secondary synchronous rectifiers, and OR-ing devices often dominate at high current and when circulation currents exist.
Switching-heavy Primary LLC switches and gate-drive regions can heat rapidly under wide input range and high-frequency operation.
Magnetics Transformer and resonant components can run warm even when silicon looks fine, especially when airflow is uneven or dusty.
Control board Local regulators, sensing networks, and drivers can become hidden hot-spots that bias temperature readings and trigger derate earlier than expected.

Fan control strategies (curve vs closed-loop vs power-based)

Fixed curve Predictable and simple. Less adaptive to localized hot-spots and airflow changes, but often stable and repeatable across units.
Temperature closed-loop Tracks measured temperature points and can reduce steady-state temperature. Risk: RPM “hunting” near thresholds if sensing is noisy or loop gain is high.
Power closed-loop Uses output power/load proxies to anticipate heat. Fast response to load steps, but can be conservative when efficiency and airflow vary with conditions.
Engineering trade-off The practical triangle is Noise vs Lifetime vs Thermal headroom. Data-center behavior typically prefers predictable margins over aggressive acoustic optimization.

Derating semantics (gradual power limit vs hard shutdown)

Derating should behave as a controlled reduction of Pout_limit (or equivalent current limit), not as an abrupt latch-off. A gradual limit allows the PSU to stay online while protecting silicon and magnetics, and it reduces the chance of a “handover storm” in redundant operation.

Typical triggers Temperature rise, fan abnormality, and sustained imbalance that concentrates heat in one unit. Input anomalies can also increase internal loss and accelerate temperature rise.
Common chain in redundancy Hotter unit derates → current share drifts → OR-ing and conduction paths heat more → unit exits/enters repeatedly → bus disturbance increases.
server PSU fan curve control thermal derating power supply fan RPM hunting hotspot vs sensor
Figure F9 — PSU thermal loop: hot zones, airflow, sensors, fan control, and Pout_limit derating
Thermal control loop (inside PSU) PSU enclosure Airflow FAN TACH PFC LLC SW XFMR SR ORing CTRL T1 T2 T3 Inlet FAN CTRL Temp → RPM Pout → RPM DERATE Temp / Fan Gradual limit Pout_limit Temp → RPM → Pout_limit (illustrative only)
Reading guide: sensors observe selected points (T1/T2/T3/Inlet), control commands fan RPM, and derating gradually reduces Pout_limit to avoid hard shutdowns and redundancy “handover storms.”

H2-10 — Protections & fault behaviors (and what upstream sees)

Protection mindset

Protection behavior is best understood by its action mode (latch / hiccup / foldback) and by the event sequence recorded in status bits and fault logs. External symptoms (restarts, jitter, unit drop-out) are signatures of these internal actions.

Protection families (grouped by what they protect)

Output stress OCP/short-circuit handling, OPP, and foldback behaviors that prevent silicon and magnetics from exceeding safe operating limits.
Output quality UVP/OVP actions that keep the DC output within the allowed window and coordinate with PSU_OK semantics.
Thermal & mechanical OTP and fan-fail handling (tach loss / blocked fan). These often appear as derating first, then stronger actions if temperature continues rising.
Input anomalies (PSU-side) Input undervoltage/brownout-like conditions can trigger controlled shutdown or reduced power limit, and should be correlated with status/log sequence.

Action modes: latch-off vs hiccup vs foldback (why external symptoms differ)

Latch-off Stops repeated stress by requiring a clear reset/retry condition. External signature: persistent “off” until a reset condition is met.
Hiccup Periodically retries after a fault. External signature: rhythmic restart pulses, often misread as “random reboot” if evidence is averaged.
Foldback / limiting Reduces output capability to stay within safe limits. External signature: output sags or never reaches the expected level under load, without fully latching off.
Key discriminator The “rhythm” (periodic vs persistent vs bounded sag) is a fast clue, but the final answer should come from Status/Log order.

Minimal coordination interfaces (PSU-side only)

Only the PSU-side meaning is in scope here: how these signals are produced and how they align with protection actions. System handling is intentionally not covered.

PSU_OK (abstract) Reflects internal regulation/health state. The timing around faults is a strong clue when aligned with status/log evidence.
Present (abstract) Indicates insertion/availability. Useful for distinguishing true removal from internal latch/lockout.
Alert (abstract) Signals that a status change occurred. Best used as a trigger to read status bits and fault logs promptly.
PMBus status & fault log Carries the “first trigger → cascade → final action” story that external symptoms alone cannot reliably reveal.

Symptom-to-protection map (fast field table)

External symptom Likely action mode What to read first (strong evidence) Next action direction
Rhythmic restarts / periodic output pulses Hiccup Status bits + fault-log order (first trigger vs final UV) Confirm whether the first trigger is OCP/short, OTP, or input anomaly; correlate timing with PSU_OK changes.
Stays off until reset condition Latch-off Latch-related status + persistent fault flag Identify the severity class (short/OV/OTP). Avoid relying on averaged power numbers around the event.
Output droops under load, no full shutdown Foldback / limiting Current-limit/OPP flags + status word transitions Check whether limiting is thermal-initiated (derate chain) or load-initiated (OCP/OPP chain).
One redundant unit keeps dropping in/out Derate → drop-out chain Per-unit temp/fan/derate flags + share/OR-ing related events (if present) Compare PSU A vs PSU B; look for thermal asymmetry first, then validate share integrity and OR-ing heating patterns.
“Jittery” bus behavior during transitions Coupled loops / transitions Event order + PSU_OK edges + fault-log markers Use time sequence: trigger → limiting/derate → exit/entry. Do not conclude from averaged telemetry alone.
power supply hiccup vs latch off foldback current limit redundant PSU one unit keeps dropping status word fault log sequence
Figure F10 — Fault timeline: internal triggers, PSU_OK edges, and log markers (sequence matters)
Fault timeline (external symptoms ← internal action modes) t0 t1 t2 t3 t4 Vin Vbus Vout Iout PSU_OK Status/Log E1 E2 E3 Trigger Limit Hiccup Evidence priority Status/Log > Trends > Averages
Reading guide: use the sequence (E1→E2→E3) to identify the first trigger and follow-on cascade. External rhythm suggests action mode (hiccup/latch/foldback), but status bits and log order should decide the root cause.

H2-11 · Validation & compliance checklist (R&D / production / compliance)

This section defines “done” for a CRPS/server PSU using a traceable evidence bundle: electrical performance plots, protection timing, redundancy stability, PMBus calibration, fault-log decode examples, and safety/EMC test-point mapping.

Definition of done (evidence package)

Electrical Reliability Safety EMI/EMC Production test Telemetry & logs
Deliverables: efficiency/PF/THD curves + ripple/noise & transient plots + hold-up proof + protection timing traces + parallel-share stability results + PMBus calibration report + fault-log decode examples + compliance test-point map.
  • Golden configuration: AC range, output voltage (12V/48V), firmware revision, PMBus address map, fan profile, protection thresholds & action modes (latch/hiccup/foldback).
  • Golden load scripts: steady-state sweep, dynamic step, burst/idle, brownout profile, redundancy entry/exit, and (if supported) hot-insertion/removal behavior.
  • Golden log scripts: event trigger definitions, timestamp resolution, ring-buffer depth, export format, and known-good decode examples.
Scope note: this checklist covers PSU internals and PSU-facing interfaces only. Downstream distribution/hot-swap details belong to sibling pages and are not expanded here.

R&D electrical validation (what to measure, how to prove)

Test item Pass evidence (what to capture) Typical failure signature → likely root
Efficiency curve η vs load at low/mid/high line; thermal steady-state noted; fan profile documented. Mid-load dip → ZVS margin/transformer parasitics; light-load loss → burst/SR reverse current; high-load loss → magnetics/rectification/ORing conduction.
PF / THD PF, THD vs load; harmonic snapshot at representative loads; line conditions recorded. THD spikes at light load → mode hopping; PF drop at high load → current limit or inductor saturation; noise correlation → burst/skip behavior.
Ripple & noise Ripple measured with standardized probe method; bandwidth limit stated; worst-case condition identified. HF spikes → SR commutation/layout; LF ripple → loop gain/cap ESR; “random bursts” → burst-mode or pre-trigger protection.
Transient response Load step plots (ΔI, di/dt, settle time); peak deviation and recovery; operating mode annotated. Slow recovery → compensation; overshoot → secondary dynamics/SR timing; repeated kicks → current-limit interaction.
Hold-up time AC removal to Vout drop; Vbulk discharge trace; load and trigger points defined; log alignment shown. Short hold-up → bulk energy/brownout threshold; steps → ORing/share interaction; “telemetry looks OK” → sampling/averaging mismatch.
Discipline: every plot must state line (low/mid/high), load, fan mode, thermal stabilization, and instrument setup (bandwidth/averaging/shunt).

Reliability & stress validation (what breaks first)

Reliability validation should focus on hotspot parts, electrical stress under abnormal conditions, repetitive fault behavior, and redundancy dynamics that can amplify thermal imbalance.

Compliance checklist (safety + EMI/EMC) — PSU-side scope

Typical families: Safety (IEC 62368-1 direction) + emissions (CISPR families) + immunity (IEC 61000-4-x families) + surge/ESD at PSU interfaces.
  • Safety evidence: insulation system description, creepage/clearance rule set, hipot test points and acceptance criteria, protective earth continuity, and touch-current/leakage methodology (as applicable).
  • EMI evidence: conducted-emissions setup, worst-case operating points (line/load/fan mode), and A/B comparison after any fix.
  • Immunity evidence: surge/EFT/ESD stress points mapped to AC inlet and signal ports; PSU response captured as Vbulk/Vout plus protection bits/log entries (not only “survived”).
The goal is not to list standards, but to publish test points, evidence formats, and repeatable re-test procedures that tie back to telemetry and logs.

Production test (minimum set) + calibration & logging

Production tests should prioritize hard-to-rework risks (HV side, magnetics, redundancy/protection behavior) and hard-to-debug field issues (telemetry calibration and log usability).

Recommended output per unit: a “Calibration & Identity record” (serial, FW, coefficients, threshold profile ID, log-format version) for field traceability and batch analysis.

Reference material list (example part numbers by function)

These part numbers are representative examples (not the only valid choices). Final selection depends on power level, input range, topology, and supply strategy.

  • PFC controllers: UCC28070 (2-phase interleaved CCM PFC), UCC28180 (CCM PFC), NCP1654 (CCM PFC), L6562A (transition/CRM-class PFC controller).
  • LLC / resonant controllers: UCC256404 (LLC resonant controller), L6599A (resonant half-bridge controller), NCP1397 (resonant controller with HV drivers).
  • Synchronous rectifier control: UCC24612 (SR controller example).
  • Digital control / PMBus endpoints: UCD3138-class digital controller; INA233-class PMBus power monitor (telemetry nodes).
  • ORing / current share: LTC4359 (ideal diode controller), LTC4370 (current sharing + ideal diodes).
  • Fan PWM + tach: EMC2305 (fan controller example).
Figure F11 — Validation flow & PSU test-point map (PSU-side)

Request a Quote

Accepted Formats

pdf, csv, xls, xlsx, zip

Attachment

Drag & drop files here or use the button below.

H2-12 · FAQs (CRPS / Server PSU)

These FAQs focus on PSU-internal causes and PSU-facing interfaces: OR-ing/current share stability, PMBus telemetry and fault logs, brownout/hold-up behavior, fan/thermal control, protection modes, and validation practices.

1 Why is current sharing very uneven after two redundant PSUs are paralleled—what share/OR-ing clues come first?

Start by verifying whether the imbalance is real: compare each unit’s IOUT source (shunt/DCR vs estimate) and confirm both units see the same bus voltage at the OR-ing output. Next check for OR-ing drop mismatch and reverse-current blocking behavior, then confirm the share method (droop vs active share bus) and whether the share signal is present, stable, and correctly terminated. (Examples: LTC4359, LTC4370)

Check first: per-PSU IOUT credibility → OR-ing ΔV/temperature → share-bus validity/stability.
2 Sharing is fine at first, then PSUs “fight” and oscillate—what are the most common causes?

Time-dependent “fighting” is usually driven by drift or a marginal control loop: temperature drift in current sensing (DCR/shunt calibration), OR-ing heating that changes effective droop, or noise/latency on the share bus that turns sharing into a positive feedback loop. Correlate current swing with temperature rise and share-bus noise; then tighten filtering/compensation and re-check stability under hot steady-state, not only at cold start. (Examples: LTC4370, INA233)

Check first: IOUT swing vs temperature → share-bus noise/latency → hot steady-state stability margin.
3 Why does the OR-ing FET / ideal diode run hot—circulating current or reverse backfeed, and how to tell?

Distinguish direction. Reverse backfeed shows up when one PSU is off or ramping and current flows “backwards” through its OR-ing path; circulating current happens when both are on and small voltage mismatches drive loop current. Measure OR-ing voltage drop and temperature, then do a minimal isolation test: temporarily remove one unit or disable share and observe whether heat disappears. Fault logs that flag reverse-current events are strong evidence. (Examples: LTC4359)

Check first: OR-ing ΔV + temperature → one-unit-off test → reverse-current / OR-ing fault flags.
4 PMBus power looks stable, but the system still reboots—why can telemetry “miss” transients?

Many PMBus readings are low-rate averages; a fast droop or brief protection event can occur entirely between updates and be “smoothed out.” Use a time-aligned evidence chain: capture VOUT and VBULK on a scope, then correlate with status-bit edges and fault-log timestamps. If the log resolution is coarse, rely on protection flags and restart counters rather than averaged power. Consider raising telemetry update rate or adding dedicated fast droop detection at PSU scope. (Examples: INA233, UCD3138)

Check first: VOUT/VBULK scope trace → status-bit edges → fault log timestamp granularity.
5 Is 80 PLUS Titanium enough to judge data-center energy impact—where do real deployments still get burned?

80 PLUS mainly certifies efficiency at specific points; it does not guarantee best-case behavior in the actual operating window. Real energy surprises often come from redundancy operation at low load (efficiency falloff), thermal conditions that shift losses, fan power that rises sharply with heat, and mode transitions at light load (burst/skip) that trade efficiency for stability/noise. A deployment-relevant proof is a hot steady-state efficiency map across the true load distribution and redundancy modes. (Examples: UCC28070, UCC256404)

Check first: hot efficiency map across real load distribution + redundancy mode + fan power contribution.
6 Light-load squeal/noise or sudden efficiency drop—does it usually come from PFC or LLC mode switching?

Use “when and how” to separate causes. Noise that tracks line voltage or PF/THD behavior often implicates PFC mode hopping; noise that appears at a load threshold points to LLC burst/skip and SR behavior (including reverse current or discontinuous commutation). Sweep load at fixed temperature and capture PFC inductor-current envelope plus LLC switching frequency region; look for sudden frequency jumps, burst packets, or SR timing anomalies that align with audible bands. (Examples: UCC28180, UCC256404)

Check first: load-threshold vs line-dependent symptom → PFC current envelope + LLC frequency region alignment.
7 A small brownout triggers shutdown or repeated restarts—how to reason about the PFC/Vbulk strategy?

Brownout is usually a Vbulk story: as input sags, PFC can no longer hold the bulk bus, and control may choose to derate, shut down cleanly, or attempt a restart. If thresholds are tight or hysteresis is weak, the system can “hunt” and restart repeatedly. Inspect VIN, VBULK, brownout status bits, restart counters, and inrush limiter temperature; then tune the brownout threshold, restart delay, and the shutdown sequence so the exit is deterministic and logged. (Examples: UCC28070, NCP1654)

Check first: VIN/VBULK trace + brownout flags → restart policy (delay/hysteresis) → clean shutdown sequencing.
8 Hold-up time is short—what are the most common causes, and what is the lowest-cost way to fix it?

The top causes are: insufficient usable bulk energy (cap tolerance, aging, temperature), a brownout/shutdown policy that exits too early, or a load profile that has higher transient demand than assumed. Lowest-cost fixes usually come from control/policy first: adjust brownout thresholds/hysteresis, optimize ride-through behavior, and verify the worst-case load step during dropout. Only then consider increasing bulk capacitance or altering bus-voltage targets. Prove the fix with a Vbulk discharge trace and aligned event logs. (Examples: UCD3138)

Check first: Vbulk discharge + shutdown threshold → dropout load profile → policy tuning before adding capacitance.
9 Fans suddenly ramp to maximum, but temperature looks “not high”—sensor placement error or control strategy?

Separate measurement from policy. A “not high” temperature may be a cool sensor location while hotspots rise elsewhere, or a sensor offset/jump. The other common cause is policy-driven ramp: power-based control, input-abnormal derating, or sharing imbalance triggers preemptive cooling. Compare multiple temperature channels, fan target vs actual RPM, and whether the event aligns with POUT, VIN flags, or share anomalies. If available, log the controller’s chosen fan mode and the trigger that caused the ramp. (Examples: EMC2305, TMP117)

Check first: multi-sensor correlation → fan target vs RPM → policy triggers (power/input/share) in logs.
10 For over-temperature, is latch-off or hiccup better, and how does it affect redundancy stability?

Latch-off is safer and easier to diagnose, but it reduces availability until a manual or commanded recovery occurs. Hiccup can recover automatically, but in redundant systems it can repeatedly inject bus disturbances and trigger current re-distribution, sometimes causing the partner unit to run hotter and destabilize sharing. A robust approach is staged response: derate first, then controlled shutdown, with clear status bits and fault-log ordering. Validate that one-unit exit does not produce a bus dip that cascades into a second unit event. (Examples: LTC4370)

Check first: derating-before-shutdown policy + clear logs → verify “single-unit exit” does not disturb the bus.
11 One PSU repeatedly drops out and returns; externally only small voltage wiggles—how to use logs to find root cause?

Build an event timeline. Start from the first fault flag (not the averaged readings): identify whether the trigger is protection (OCP/OTP/UVP), input abnormality, fan fault, OR-ing reverse-current, or share instability. Then correlate status-word transitions with Vbulk/Vout scope captures and the restart counter. If timestamps are coarse, the ordering of flags still matters: input-related events typically precede Vbulk collapse, while share/OR-ing issues often show reverse-current or current-limit hints before Vout movement. Export the last N events each time to avoid ring-buffer overwrite. (Examples: UCD3138, INA233)

Check first: first fault flag → scope alignment (Vbulk/Vout) → restart counter + ordered status transitions.
12 In validation, what predicts real field stability (not just “nice lab plots”)?

Field stability is best predicted by stress cases that mirror deployment reality: hot steady-state operation, brownout/short interruption profiles, hold-up under true load transients, redundancy entry/exit without bus disturbance, and deterministic protection behavior that produces actionable logs. A “done” PSU has repeatable evidence: efficiency/PF/THD across the real load window, Vbulk/Vout transient captures, and fault logs that correctly identify root-class (input, thermal, share/OR-ing, protection) with consistent sequencing. Production calibration and log export integrity are part of stability, not an afterthought. (Examples: UCC28070, UCD3138)

Check first: hot + brownout + redundancy dynamics + log usefulness (repeatable root-class identification).