CRPS / Server PSU: PFC+LLC, PMBus Telemetry, Redundancy
← Back to: Data Center & Servers
A CRPS/server PSU is not only an AC-to-DC converter—it is a controlled power subsystem designed for redundancy: stable current sharing, fast protection behavior, and trustworthy PMBus telemetry/fault logs that keep the server bus alive through real-world brownouts, transients, and thermal stress.
H2-1 — What is CRPS / Server PSU & Boundary
A CRPS (Common Redundant Power Supply) is a hot-redundant server power module that converts AC (optionally HVDC in some deployments) into a regulated DC rail (commonly 12 V or 48 V) using a PFC front end plus an isolated LLC stage with synchronous rectification. It supports N+1 / 1+1 redundancy via OR-ing and current-share control, while exposing PMBus telemetry and fault logs to improve serviceability.
This page focuses on the PSU module itself (inside-the-box conversion, redundancy interface, and observability), not downstream distribution or board-level point-of-load regulation.
What this page covers
- Input domain: EMI/surge/inrush handling and brownout behavior relevant to server uptime.
- Conversion chain: PFC → HV bus energy → isolated LLC regulation → SR/output filtering → hold-up.
- Redundancy interface: OR-ing (ideal diode / OR-ing FET) and current-share (droop vs active share).
- Observability: PMBus telemetry, status words, fault logs, and PSU-internal fan/thermal derating.
What this page does NOT cover (link out only)
- 48 V / 12 V bus hot-swap controllers, eFuses, downstream power distribution.
- CPU VRM / DDR rails / on-board sequencing and point-of-load stability details.
- Rack PDU metering / switching, facility power monitoring, site-level energy analytics.
- BMC/IPMI/Redfish workflows, KVM/IP video/USB transport details, platform security deep dives.
H2-2 — Electrical Architecture Map (AC → DC, domain breakdown)
A server PSU is best understood as six tightly-coupled domains. Each domain owns specific KPIs, produces characteristic failure symptoms, and exposes distinct telemetry clues. Mapping symptoms to domains prevents misdiagnosis (for example, stable average power readings while a fast transient triggers a reset).
- Controls: conducted EMI, surge energy, inrush stress during plug-in / brownout recovery.
- KPIs: EMI margin, surge robustness, repeatable start behavior.
- Common pitfalls: nuisance trips on brownout; “starts only sometimes” due to marginal inrush strategy.
- Typical symptoms: repeated restart attempts; input-side protection events.
- Telemetry clues: input status flags, brownout counters (if available), start attempt logs.
- Controls: input current shaping and HV bus energy with PF/THD constraints.
- KPIs: PF, THD, efficiency across load range, stable bus regulation.
- Common pitfalls: light-load mode transitions causing audible noise or EMI peaks; aggressive current limiting leading to brownout sensitivity.
- Typical symptoms: bus sag under step load; unstable PF/THD in certain load bands.
- Telemetry clues: bus voltage trend, input power factor estimate, PFC fault/status bits.
- Controls: isolated power transfer and regulation via frequency control over a resonant tank.
- KPIs: efficiency, stability over input range, light-load behavior, dynamic response.
- Common pitfalls: gain margin erosion at extremes (frequency limits); burst/skip transitions that disturb output or create acoustic noise.
- Typical symptoms: intermittent output ripple spikes; “clean at full load, messy at light load.”
- Telemetry clues: switching mode indicators (if exposed), regulation fault flags, correlated thermal rise on primary devices.
- Controls: synchronous rectification timing, output ripple/noise, hold-up delivery during input loss.
- KPIs: ripple/noise, transient droop, hold-up time, thermal on SR path.
- Common pitfalls: SR timing drift at light load; hold-up shortfall due to bus energy strategy limits, not only capacitor size.
- Typical symptoms: reset on AC dropouts; output droop not visible in slow telemetry.
- Telemetry clues: fast fault logs (UV/PG events), bus discharge signature, output current/voltage snapshots around events.
- Controls: reverse blocking, hot-redundant insertion/removal, stable load sharing (droop or active share).
- KPIs: share accuracy, share stability, minimal bus disturbance on unit drop/add.
- Common pitfalls: share-loop oscillation; OR-ing device thermal stress; share signal integrity issues.
- Typical symptoms: one PSU “hunts” or drops out; unexplained heating; bus micro-dips during failover.
- Telemetry clues: per-unit current mismatch trend, OR-ing thermal flags, share-related status bits, dropout timestamps.
- Controls: sensing, reporting, fan curves, derating, and fault handling policies.
- KPIs: telemetry trustworthiness, useful logs (time correlation), controlled derating vs hard shutdown.
- Common pitfalls: slow averaging masks fast droops; mismatched sensor placement causes fan “overreaction.”
- Typical symptoms: telemetry looks stable while resets occur; fan ramps without obvious temperature cause.
- Telemetry clues: status word hierarchy, log depth/retention, sensor-to-event correlation quality.
H2-3 — Front-end: EMI, surge, inrush & brownout behavior
The AC front end must survive real data-center events (plug-in/inrush, short sags, generator/UPS transfers) without turning an energy shortage into a misclassified overcurrent fault. Robust behavior depends on coordinated inrush limiting, brownout detection, controlled derating, and event logging that preserves the sequence of causes and effects.
This section stays on the PSU input domain: EMI/surge/inrush handling and how brownout propagates into HV bus energy and output stability. Downstream distribution components are out of scope.
Event A — Plug-in / restore power: inrush and start reliability
Event B — Brownout / sag: avoid misclassifying energy deficit as OCP
Event C — Generator/UPS transfer: frequency/shape disturbances and repeated faults
H2-4 — PFC stage deep dive: topology & control (efficiency, THD, dynamics)
The PFC stage shapes the input current, sets the HV bus energy available for hold-up, and strongly influences light-load behavior (noise, EMI margin, and nuisance flags). For 1–3 kW server PSUs, the practical choice is usually among boost CCM, CRM, and TCM, with trade-offs that show up as field symptoms and telemetry patterns.
PF/THD compliance, efficiency across load range, bus stability under sags, and predictable restart behavior.
Light-load mode transitions, current limiting under brownout, and EMI margin under distorted input.
Bus voltage trend, PFC status/fault bits, and event timestamps around transfers and sags.
Topology comparison (CCM vs CRM vs TCM)
Instead of a wide table, each mode is shown as a mobile-friendly decision card with typical symptoms and the most useful telemetry clues.
- Best fit: high power, predictable bus regulation, stable PF/THD under load.
- Typical pressure: switching loss and EMI filtering complexity.
- Common symptoms: EMI margin sensitivity; thermal rise on switching devices at heavy load.
- Telemetry clues: bus ripple trend, thermal flags correlated with high load, PFC limit events.
- Best fit: high efficiency in mid load; simpler current shaping in some designs.
- Typical pressure: variable frequency and EMI peaks in certain regions.
- Common symptoms: audible/EMI artifacts near mode boundaries; PF/THD variation by load band.
- Telemetry clues: event clusters around specific load levels; PF/THD estimate anomalies.
- Best fit: improved light-load efficiency and lower switching loss in some regions.
- Typical pressure: control complexity and sensitivity to distorted input.
- Common symptoms: nuisance flags during transfers; instability if thresholds are poorly tuned.
- Telemetry clues: mode-switch indicators (if exposed), restart/derate patterns during input anomalies.
Digital PFC control: loops, sampling, and limit policies
H2-5 — Isolated LLC stage deep dive (gain, light-load, dynamics & stability)
LLC behavior is defined by the resonant tank and its gain curve across switching frequency. Under wide input range and wide load range, the operating point can approach frequency limits or ZVS margins, where efficiency, noise, and regulation robustness can change sharply.
Scope: isolated LLC + synchronous rectification (SR) behavior inside the PSU. Downstream board-level conversion is out of scope.
1) Gain curve and operating region (what the control can and cannot “move”)
2) Light-load + SR pitfalls (burst, reverse current, noise and efficiency dips)
3) Dynamics and stability (where parasitics create “strange” behavior)
H2-6 — Secondary rectification, output regulation & hold-up (ripple, transients, sampling mismatch)
Output quality is defined at the PSU interface: ripple/noise, transient behavior, and hold-up capability. Board-level VRMs may further shape rails, but the PSU must deliver a predictable interface and accurate event evidence when failures occur.
Output ripple and “sensitive loads” (interface-centric view)
Ripple and noise matter because they set the disturbance level presented to the server system input. A practical approach is to treat ripple as an interface requirement and focus on repeatable compliance across load bands, temperatures, and redundant sharing states. If ripple artifacts appear only in specific bands, it is often a signature of mode transitions (LLC/SR behavior) rather than random noise.
Hold-up time: where energy comes from (and why “just add capacitance” is not the only lever)
Hold-up is primarily driven by available HV-bus energy and the allowed bus discharge window. The key is not the capacitor value alone, but the usable energy delta between a safe upper bus level and the minimum bus level that still supports regulation.
E ≈ 1/2 · C · (V1² − V2²)
t_hold ≈ (E · η) / P_load
E is the usable stored energy, V1→V2 is the allowed HV-bus discharge window, η is conversion efficiency during hold-up, and P_load is the delivered output power. This estimate is used to size margin and to interpret logs, not as a full design derivation.
Transient response vs PMBus sampling: why “telemetry looks fine” can still mean a reset
H2-7 — Redundancy: OR-ing + current share (active vs droop)
Redundant paralleling (1+1 / N+1) is successful only when the combined output behaves like a single, predictable source: stable current sharing, controlled fault isolation, and clean exit/entry without bus disturbance.
What redundancy must solve (the four hard problems)
OR-ing (ideal diode / OR-ing FET): why it exists and how it fails
OR-ing is the isolation gate at the PSU output. Its primary job is to block reverse current (backfeed) and to limit the blast radius of a single-unit fault. In practice, OR-ing is also where heat accumulates when there is circulation current, frequent handover, or marginal reverse-current thresholds.
Current sharing strategies: droop share vs active share (share bus)
Fault-tree style: fast path from symptom to evidence
- Check: Iout mismatch across units at a stable load; one unit runs hotter or derates earlier.
- Clue: imbalance follows specific slot/cable path → path impedance mismatch; imbalance follows one PSU only → control/share issue.
- Action direction: validate sharing mode (droop/active), confirm share signal continuity, confirm OR-ing drop symmetry under load.
- Check: repeated entry/exit cycles, output steps, or periodic ripple bursts; share-related status/log flags if available.
- Clue: active-share is more sensitive to latency/noise and share line intermittency.
- Action direction: prioritize share signal integrity and degradation behavior (what happens when share is lost).
- Check: OR-ing temperature asymmetry, heating that correlates with mismatch or with transitions.
- Clue: circulation current (setpoint/path mismatch) or backfeed tendency (handover threshold).
- Action direction: reduce circulation drivers (setpoint alignment/path symmetry) and verify reverse-current blocking thresholds behavior.
- Check: order of events: derate/limit → share imbalance → drop-out → bus sag.
- Clue: thermal or protection threshold too close to nominal in one unit creates repeated “handover storms.”
- Action direction: confirm clean dropout semantics (no backfeed, no oscillatory re-entry) and rely on event sequence rather than averaged telemetry.
H2-8 — Digital PSU & PMBus telemetry (what to trust)
Telemetry is evidence, not truth-by-default. The most reliable diagnosis comes from status transitions and fault-log sequence, while averaged power/voltage numbers can mislead during fast transients or calibration drift.
What PMBus/SMBus typically exposes (map by category)
Why some readings must be discounted (sampling, averaging, calibration)
Telemetry trust grading matrix (use it like a troubleshooting standard)
| Trust level | Examples | What it is good for | Common trap |
|---|---|---|---|
| Strong | Status bits, protection triggers, PG drop, UV/OV/OCP/OTP flags, fault-log first-trigger | Establishes the event sequence and true root trigger chain (what happened first, then what followed). | Ignoring order of events and only reading the final fault (“last hit” bias). |
| Medium | Temperature trends, fan RPM, derate states, long-window Iout/Vout stability | Explains why margins shrink (thermal headroom, derating) and why a unit exits/enters under redundancy. | Treating a single temperature number as a universal hotspot indicator. |
| Weak | Averaged power, efficiency estimates, short-window V/I during transients, any value with heavy filtering | Provides context and sanity checks, but should not be used alone to “convict” a root cause. | Believing “telemetry looks normal” during a reset event (sampling mismatch). |
How to read evidence in the right order (5-step field workflow)
H2-9 — Fan & thermal control inside PSU
PSU thermal behavior is a closed loop: internal loss hot-spots drive temperature rise, sensors observe selected points, control logic commands fan RPM and derating, and the resulting airflow and power-limit reshape the hot-spot map. Stability, noise, and lifetime depend on how this loop is structured.
Internal hot zones (where heat concentrates)
Hot zones vary by load, airflow, and redundancy imbalance. The most useful model groups them by loss mechanism: conduction-heavy devices, switching-heavy devices, magnetics, and the control board’s local regulators/drivers.
Fan control strategies (curve vs closed-loop vs power-based)
Derating semantics (gradual power limit vs hard shutdown)
Derating should behave as a controlled reduction of Pout_limit (or equivalent current limit), not as an abrupt latch-off. A gradual limit allows the PSU to stay online while protecting silicon and magnetics, and it reduces the chance of a “handover storm” in redundant operation.
H2-10 — Protections & fault behaviors (and what upstream sees)
Protection behavior is best understood by its action mode (latch / hiccup / foldback) and by the event sequence recorded in status bits and fault logs. External symptoms (restarts, jitter, unit drop-out) are signatures of these internal actions.
Protection families (grouped by what they protect)
Action modes: latch-off vs hiccup vs foldback (why external symptoms differ)
Minimal coordination interfaces (PSU-side only)
Only the PSU-side meaning is in scope here: how these signals are produced and how they align with protection actions. System handling is intentionally not covered.
Symptom-to-protection map (fast field table)
| External symptom | Likely action mode | What to read first (strong evidence) | Next action direction |
|---|---|---|---|
| Rhythmic restarts / periodic output pulses | Hiccup | Status bits + fault-log order (first trigger vs final UV) | Confirm whether the first trigger is OCP/short, OTP, or input anomaly; correlate timing with PSU_OK changes. |
| Stays off until reset condition | Latch-off | Latch-related status + persistent fault flag | Identify the severity class (short/OV/OTP). Avoid relying on averaged power numbers around the event. |
| Output droops under load, no full shutdown | Foldback / limiting | Current-limit/OPP flags + status word transitions | Check whether limiting is thermal-initiated (derate chain) or load-initiated (OCP/OPP chain). |
| One redundant unit keeps dropping in/out | Derate → drop-out chain | Per-unit temp/fan/derate flags + share/OR-ing related events (if present) | Compare PSU A vs PSU B; look for thermal asymmetry first, then validate share integrity and OR-ing heating patterns. |
| “Jittery” bus behavior during transitions | Coupled loops / transitions | Event order + PSU_OK edges + fault-log markers | Use time sequence: trigger → limiting/derate → exit/entry. Do not conclude from averaged telemetry alone. |
H2-11 · Validation & compliance checklist (R&D / production / compliance)
This section defines “done” for a CRPS/server PSU using a traceable evidence bundle: electrical performance plots, protection timing, redundancy stability, PMBus calibration, fault-log decode examples, and safety/EMC test-point mapping.
Definition of done (evidence package)
- Golden configuration: AC range, output voltage (12V/48V), firmware revision, PMBus address map, fan profile, protection thresholds & action modes (latch/hiccup/foldback).
- Golden load scripts: steady-state sweep, dynamic step, burst/idle, brownout profile, redundancy entry/exit, and (if supported) hot-insertion/removal behavior.
- Golden log scripts: event trigger definitions, timestamp resolution, ring-buffer depth, export format, and known-good decode examples.
R&D electrical validation (what to measure, how to prove)
| Test item | Pass evidence (what to capture) | Typical failure signature → likely root |
|---|---|---|
| Efficiency curve | η vs load at low/mid/high line; thermal steady-state noted; fan profile documented. | Mid-load dip → ZVS margin/transformer parasitics; light-load loss → burst/SR reverse current; high-load loss → magnetics/rectification/ORing conduction. |
| PF / THD | PF, THD vs load; harmonic snapshot at representative loads; line conditions recorded. | THD spikes at light load → mode hopping; PF drop at high load → current limit or inductor saturation; noise correlation → burst/skip behavior. |
| Ripple & noise | Ripple measured with standardized probe method; bandwidth limit stated; worst-case condition identified. | HF spikes → SR commutation/layout; LF ripple → loop gain/cap ESR; “random bursts” → burst-mode or pre-trigger protection. |
| Transient response | Load step plots (ΔI, di/dt, settle time); peak deviation and recovery; operating mode annotated. | Slow recovery → compensation; overshoot → secondary dynamics/SR timing; repeated kicks → current-limit interaction. |
| Hold-up time | AC removal to Vout drop; Vbulk discharge trace; load and trigger points defined; log alignment shown. | Short hold-up → bulk energy/brownout threshold; steps → ORing/share interaction; “telemetry looks OK” → sampling/averaging mismatch. |
Reliability & stress validation (what breaks first)
Reliability validation should focus on hotspot parts, electrical stress under abnormal conditions, repetitive fault behavior, and redundancy dynamics that can amplify thermal imbalance.
Compliance checklist (safety + EMI/EMC) — PSU-side scope
- Safety evidence: insulation system description, creepage/clearance rule set, hipot test points and acceptance criteria, protective earth continuity, and touch-current/leakage methodology (as applicable).
- EMI evidence: conducted-emissions setup, worst-case operating points (line/load/fan mode), and A/B comparison after any fix.
- Immunity evidence: surge/EFT/ESD stress points mapped to AC inlet and signal ports; PSU response captured as Vbulk/Vout plus protection bits/log entries (not only “survived”).
Production test (minimum set) + calibration & logging
Production tests should prioritize hard-to-rework risks (HV side, magnetics, redundancy/protection behavior) and hard-to-debug field issues (telemetry calibration and log usability).
Reference material list (example part numbers by function)
These part numbers are representative examples (not the only valid choices). Final selection depends on power level, input range, topology, and supply strategy.
- PFC controllers: UCC28070 (2-phase interleaved CCM PFC), UCC28180 (CCM PFC), NCP1654 (CCM PFC), L6562A (transition/CRM-class PFC controller).
- LLC / resonant controllers: UCC256404 (LLC resonant controller), L6599A (resonant half-bridge controller), NCP1397 (resonant controller with HV drivers).
- Synchronous rectifier control: UCC24612 (SR controller example).
- Digital control / PMBus endpoints: UCD3138-class digital controller; INA233-class PMBus power monitor (telemetry nodes).
- ORing / current share: LTC4359 (ideal diode controller), LTC4370 (current sharing + ideal diodes).
- Fan PWM + tach: EMC2305 (fan controller example).
H2-12 · FAQs (CRPS / Server PSU)
These FAQs focus on PSU-internal causes and PSU-facing interfaces: OR-ing/current share stability, PMBus telemetry and fault logs, brownout/hold-up behavior, fan/thermal control, protection modes, and validation practices.
1 Why is current sharing very uneven after two redundant PSUs are paralleled—what share/OR-ing clues come first?
Start by verifying whether the imbalance is real: compare each unit’s IOUT source (shunt/DCR vs estimate) and confirm both units see the same bus voltage at the OR-ing output. Next check for OR-ing drop mismatch and reverse-current blocking behavior, then confirm the share method (droop vs active share bus) and whether the share signal is present, stable, and correctly terminated. (Examples: LTC4359, LTC4370)
Check first: per-PSU IOUT credibility → OR-ing ΔV/temperature → share-bus validity/stability.2 Sharing is fine at first, then PSUs “fight” and oscillate—what are the most common causes?
Time-dependent “fighting” is usually driven by drift or a marginal control loop: temperature drift in current sensing (DCR/shunt calibration), OR-ing heating that changes effective droop, or noise/latency on the share bus that turns sharing into a positive feedback loop. Correlate current swing with temperature rise and share-bus noise; then tighten filtering/compensation and re-check stability under hot steady-state, not only at cold start. (Examples: LTC4370, INA233)
Check first: IOUT swing vs temperature → share-bus noise/latency → hot steady-state stability margin.3 Why does the OR-ing FET / ideal diode run hot—circulating current or reverse backfeed, and how to tell?
Distinguish direction. Reverse backfeed shows up when one PSU is off or ramping and current flows “backwards” through its OR-ing path; circulating current happens when both are on and small voltage mismatches drive loop current. Measure OR-ing voltage drop and temperature, then do a minimal isolation test: temporarily remove one unit or disable share and observe whether heat disappears. Fault logs that flag reverse-current events are strong evidence. (Examples: LTC4359)
Check first: OR-ing ΔV + temperature → one-unit-off test → reverse-current / OR-ing fault flags.4 PMBus power looks stable, but the system still reboots—why can telemetry “miss” transients?
Many PMBus readings are low-rate averages; a fast droop or brief protection event can occur entirely between updates and be “smoothed out.” Use a time-aligned evidence chain: capture VOUT and VBULK on a scope, then correlate with status-bit edges and fault-log timestamps. If the log resolution is coarse, rely on protection flags and restart counters rather than averaged power. Consider raising telemetry update rate or adding dedicated fast droop detection at PSU scope. (Examples: INA233, UCD3138)
Check first: VOUT/VBULK scope trace → status-bit edges → fault log timestamp granularity.5 Is 80 PLUS Titanium enough to judge data-center energy impact—where do real deployments still get burned?
80 PLUS mainly certifies efficiency at specific points; it does not guarantee best-case behavior in the actual operating window. Real energy surprises often come from redundancy operation at low load (efficiency falloff), thermal conditions that shift losses, fan power that rises sharply with heat, and mode transitions at light load (burst/skip) that trade efficiency for stability/noise. A deployment-relevant proof is a hot steady-state efficiency map across the true load distribution and redundancy modes. (Examples: UCC28070, UCC256404)
Check first: hot efficiency map across real load distribution + redundancy mode + fan power contribution.6 Light-load squeal/noise or sudden efficiency drop—does it usually come from PFC or LLC mode switching?
Use “when and how” to separate causes. Noise that tracks line voltage or PF/THD behavior often implicates PFC mode hopping; noise that appears at a load threshold points to LLC burst/skip and SR behavior (including reverse current or discontinuous commutation). Sweep load at fixed temperature and capture PFC inductor-current envelope plus LLC switching frequency region; look for sudden frequency jumps, burst packets, or SR timing anomalies that align with audible bands. (Examples: UCC28180, UCC256404)
Check first: load-threshold vs line-dependent symptom → PFC current envelope + LLC frequency region alignment.7 A small brownout triggers shutdown or repeated restarts—how to reason about the PFC/Vbulk strategy?
Brownout is usually a Vbulk story: as input sags, PFC can no longer hold the bulk bus, and control may choose to derate, shut down cleanly, or attempt a restart. If thresholds are tight or hysteresis is weak, the system can “hunt” and restart repeatedly. Inspect VIN, VBULK, brownout status bits, restart counters, and inrush limiter temperature; then tune the brownout threshold, restart delay, and the shutdown sequence so the exit is deterministic and logged. (Examples: UCC28070, NCP1654)
Check first: VIN/VBULK trace + brownout flags → restart policy (delay/hysteresis) → clean shutdown sequencing.8 Hold-up time is short—what are the most common causes, and what is the lowest-cost way to fix it?
The top causes are: insufficient usable bulk energy (cap tolerance, aging, temperature), a brownout/shutdown policy that exits too early, or a load profile that has higher transient demand than assumed. Lowest-cost fixes usually come from control/policy first: adjust brownout thresholds/hysteresis, optimize ride-through behavior, and verify the worst-case load step during dropout. Only then consider increasing bulk capacitance or altering bus-voltage targets. Prove the fix with a Vbulk discharge trace and aligned event logs. (Examples: UCD3138)
Check first: Vbulk discharge + shutdown threshold → dropout load profile → policy tuning before adding capacitance.9 Fans suddenly ramp to maximum, but temperature looks “not high”—sensor placement error or control strategy?
Separate measurement from policy. A “not high” temperature may be a cool sensor location while hotspots rise elsewhere, or a sensor offset/jump. The other common cause is policy-driven ramp: power-based control, input-abnormal derating, or sharing imbalance triggers preemptive cooling. Compare multiple temperature channels, fan target vs actual RPM, and whether the event aligns with POUT, VIN flags, or share anomalies. If available, log the controller’s chosen fan mode and the trigger that caused the ramp. (Examples: EMC2305, TMP117)
Check first: multi-sensor correlation → fan target vs RPM → policy triggers (power/input/share) in logs.10 For over-temperature, is latch-off or hiccup better, and how does it affect redundancy stability?
Latch-off is safer and easier to diagnose, but it reduces availability until a manual or commanded recovery occurs. Hiccup can recover automatically, but in redundant systems it can repeatedly inject bus disturbances and trigger current re-distribution, sometimes causing the partner unit to run hotter and destabilize sharing. A robust approach is staged response: derate first, then controlled shutdown, with clear status bits and fault-log ordering. Validate that one-unit exit does not produce a bus dip that cascades into a second unit event. (Examples: LTC4370)
Check first: derating-before-shutdown policy + clear logs → verify “single-unit exit” does not disturb the bus.11 One PSU repeatedly drops out and returns; externally only small voltage wiggles—how to use logs to find root cause?
Build an event timeline. Start from the first fault flag (not the averaged readings): identify whether the trigger is protection (OCP/OTP/UVP), input abnormality, fan fault, OR-ing reverse-current, or share instability. Then correlate status-word transitions with Vbulk/Vout scope captures and the restart counter. If timestamps are coarse, the ordering of flags still matters: input-related events typically precede Vbulk collapse, while share/OR-ing issues often show reverse-current or current-limit hints before Vout movement. Export the last N events each time to avoid ring-buffer overwrite. (Examples: UCD3138, INA233)
Check first: first fault flag → scope alignment (Vbulk/Vout) → restart counter + ordered status transitions.12 In validation, what predicts real field stability (not just “nice lab plots”)?
Field stability is best predicted by stress cases that mirror deployment reality: hot steady-state operation, brownout/short interruption profiles, hold-up under true load transients, redundancy entry/exit without bus disturbance, and deterministic protection behavior that produces actionable logs. A “done” PSU has repeatable evidence: efficiency/PF/THD across the real load window, Vbulk/Vout transient captures, and fault logs that correctly identify root-class (input, thermal, share/OR-ing, protection) with consistent sequencing. Production calibration and log export integrity are part of stability, not an afterthought. (Examples: UCC28070, UCD3138)
Check first: hot + brownout + redundancy dynamics + log usefulness (repeatable root-class identification).