Edge Site Power & Backup: 48V Hot-Swap & Ride-Through Telemetry
← Back to: 5G Edge Telecom Infrastructure
Edge Site Power & Backup covers the end-to-end 48V front-end protection and ride-through chain—from hot-swap and surge stacking to OR-ing power-path switching and supercap/battery backup—so the load bus stays alive during brownouts and outages. It also defines the telemetry and fault-logging evidence needed to diagnose field events remotely and prove the design with a practical validation checklist.
What it is & boundary: what “Edge Site Power & Backup” covers
Definition (engineering scope)
Edge Site Power & Backup is the front-end energy continuity and observability stack from 48V input to the site DC bus. It combines entry protection, hot-swap inrush control, OR-ing / power-path management, and backup energy (supercapacitor hold-up or battery ride-through), plus telemetry and event logging that make brownouts and faults provable in the field.
Boundary statement (what this page does NOT cover)
- Rack PDU branch metering and downstream load distribution (belongs to Micro Edge Datacenter Rack).
- OOB BMC architecture and management protocols (belongs to Micro Edge Datacenter Rack / OOB management pages).
- PoE PSE design and 802.3 standards (belongs to Edge Backhaul PoE++ Node).
- Timing (PTP/SyncE/GNSS) and clock trees (belongs to Timing & Synchronization pages).
Typical deployment patterns (why this stack matters)
| Edge site context | Power/backup problem it must solve |
|---|---|
| Street cabinet / outdoor micro-site Long cables, frequent surges |
Entry protection must absorb surge energy without nuisance resets; hot-swap must prevent connector arcing and MOSFET SOA failures; logs must prove whether resets come from input sag or overcurrent. |
| Indoor micro edge room Shared DC plant, maintenance hot-plug |
Hot-swap and OR-ing must allow module replacement without collapsing the bus; telemetry must surface thermal derating, latched faults, and backup health before service impact. |
| Enterprise/industrial edge Uptime & auditability |
Backup must guarantee a defined ride-through window during brownouts; event logs must be consistent enough to support incident postmortems and automated maintenance triggers. |
Key outcome metrics (what “done” looks like)
These are the measurable targets that drive every design decision in later chapters.
| Metric | Why it matters in the field |
|---|---|
| Ride-through time (ms to minutes) |
Defines how long the DC bus stays within tolerance during input loss/sag. This must include detection + switchover latency, not only stored energy. |
| Peak inrush / dV/dt | Controls connector stress, upstream plant stability, and hot-swap MOSFET SOA. Poor tuning causes either arcing or slow ramp overheating. |
| Allowed bus droop | Sets the usable voltage window for supercap/battery ride-through. A tighter droop budget increases energy requirements and accelerates thermal constraints. |
| OR-ing / reverse current | Prevents backfeeding between sources (main ↔ backup) and avoids oscillatory “source fighting” during recovery. |
| Telemetry coverage | Determines whether remote operations can distinguish “power fault” from “load fault”. At minimum: input/bus voltage, current, temperature, fault codes, backup SoC/health, and switch events. |
| Event evidence completeness | A reset without evidence wastes field time. Logging must capture the decisive timeline: what crossed which threshold, when, and what action followed. |
Practical definition: This subsystem keeps an edge site DC bus alive through input disturbances and produces enough telemetry and logs to prove whether an outage was caused by surge, brownout, inrush/SOA, overcurrent, or thermal derating.
Figure F1 — System boundary: energy path + backup path + telemetry loop
Requirements & sizing: quantify the problem before choosing hardware
Start from the disturbance, not the datasheet
Sizing becomes reliable only when the site disturbance is defined as a time-budgeted event. “Ride-through for X seconds” is the output; the inputs are: input envelope (48V range + surge/sag), load profile (steady + peak + startup), and bus tolerance (minimum acceptable bus voltage and droop).
| Sizing input | Decision it drives |
|---|---|
| 48V envelope e.g., 36–60V + surge |
Protection stack and hot-swap thresholds (UV/OV), plus whether brownouts are “sag” events or full outages. |
| Event shape drop rate & duration |
How much time exists for detection and switchover; steep sags demand faster control and more stable thresholds. |
| Load profile steady + peak + startup |
Required backup power and the “worst moment” that typically occurs during recovery (recharge + peak load). |
| Allowed bus droop | Usable voltage window for supercap/battery and the minimum DC/DC headroom; tighter droop means more stored energy. |
| Recovery policy auto reattach vs hold |
OR-ing hysteresis and recharge limits; incorrect recovery causes oscillatory switching (“source fighting”). |
Ride-through tiers (what changes across ms → seconds → minutes)
The time tier defines the dominant failure mode and therefore where engineering effort should be spent.
| Target window | Best-fit approach |
|---|---|
| 10–50 ms control-dominant |
The critical risks are false detection and threshold chatter. A large energy store is often unnecessary; stable sensing and switchover timing are. |
| 0.1–5 s supercap-dominant |
The critical risks are ESR-limited droop and aging (capacity fade + ESR rise). Voltage window and efficiency determine usable energy far more than nameplate capacitance. |
| 5–10 min battery/UPS-dominant |
The critical risks are temperature derating, rate capability, and maintenance reality. Telemetry must expose SOC/SOH and thermal headroom before service impact. |
A design may use both supercap and battery: supercap covers fast switchover and short dips; battery covers longer outages. The handoff must be explicit in the time budget.
Sizing skeleton (minimal math, maximum correctness)
For supercapacitor hold-up, the usable stored energy depends on the voltage window rather than the nameplate value.
Use the energy form below as a first-order check:
E = 0.5 · C · (V1² − V2²)
where V1 and V2 are the usable capacitor voltages (after accounting for bus minimum and conversion headroom).
Convert energy to time using load power and realistic efficiency:
t ≈ (E · η) / Pload.
For engineering-grade sizing, apply correction terms that dominate field outcomes:
- Efficiency η: include conversion losses in both discharge and recovery.
- ESR droop: ensure initial current does not violate the allowed bus droop; ESR often sets the true limit.
- Aging margin: capacity fades while ESR increases; reserve margin to keep ride-through valid at end-of-life.
- Detection + switchover latency: stored energy must cover the entire timeline, not only the sustain window.
For battery ride-through, sizing is dominated by rate, temperature, and policy: maximum discharge current, cold-start derating, and whether recharge is allowed while loads remain at peak. Avoiding thermal runaway and derating loops is usually more important than theoretical capacity.
Output artifact: requirements template (copy/paste)
This table turns “backup time” into a measurable contract that later chapters can validate.
| Field requirement | Fill-in value |
|---|---|
| Input envelope | Nominal 48V, min/max, surge level, typical sag profiles (rate + duration) |
| Disturbance type | Full outage / voltage sag / intermittent dips (specify which must not reboot) |
| Load profile | Steady W, peak W/A (duration), startup inrush characteristics |
| Allowed bus droop | Minimum Vbus, droop limit during switchover, recovery tolerance |
| Ride-through target | Target time window and the “time budget” split: detect → switch → sustain → recover |
| Recovery policy | Auto reattach thresholds, hysteresis, recharge current limit, avoid source fighting |
| Telemetry contract | Must-report signals, sample rates (trend vs event), fault codes, event timestamps |
| Acceptance test | Minimum set of brownout/hot-plug/short tests required for sign-off |
Figure F2 — Energy window and time budget (detect → switch → sustain → recover)
48V front-end hot-swap: the goal is not “plug-in works”, but “plug-in never burns”
Hot-swap path (minimum system)
A 48V hot-swap front-end is a controlled energy transfer path: input filter → hot-swap controller → power MOSFET(s) → DC bus capacitor/load. Field failures usually occur at the moment stored energy is forced through a MOSFET operating in the linear region, or when cable inductance and connector bounce create repeated high-stress transients.
Inrush is capacitor charging. The dominant control knob is dV/dt:
I_inrush ≈ C_load × dV/dt. Because C_load is often uncertain in the field,
settings must remain stable across a wide range of capacitance and cable conditions.
Why MOSFETs fail even when “current limit looks fine”
| Failure mechanism | Field symptom → root cause → design fix |
|---|---|
| Ramp too slow (linear heating) |
Symptom: MOSFET runs hot during plug-in or fails after repeated starts. Root cause: the FET stays in the linear region too long, so power accumulates as P ≈ Vds × Id within the MOSFET SOA limits.Fix: set ramp/TIMER to exit the linear region quickly; verify SOA at the worst-case Vin, load, and ambient. |
| Cable inductance + connector bounce |
Symptom: plug-in arcing, sporadic resets, or “mystery” overvoltage trips. Root cause: long cable inductance and intermittent contact create overshoot and ringing; repeated micro-plugs can apply many stress pulses in seconds. Fix: keep input energy loops tight; tune dv/dt and add appropriate filtering/clamping (see H2-4); avoid rapid retry loops that amplify bounce events. |
| “Smart” current limit oscillation (hiccup abuse) |
Symptom: repeated latch-off / auto-retry, bus never stabilizes, eventual MOSFET or connector damage. Root cause: limit + retry interacts with load capacitance and thresholds, producing a repetitive energy dump pattern (each attempt charges partially, then collapses). Fix: use clear fault policy: latch-off for hard faults, controlled retry for benign events; add hysteresis and minimum off-time so the system does not “hammer” the same fault. |
Design criteria (write settings as verifiable rules)
ILIM & dV/dt
- Choose ILIM to protect connector and upstream plant while still charging the maximum expected C_load within TIMER.
- Choose dV/dt so inrush stays bounded, but not so slow that MOSFET linear power becomes the dominant risk.
TIMER / retry policy
- Timer must reflect worst-case energy transfer; “slow safe ramp” can be unsafe if it violates SOA.
- Auto-retry should have minimum off-time and limited count to avoid repetitive stress.
UV/OV thresholds
- Thresholds require hysteresis to prevent chatter during sags and noisy cables.
- Debounce must match event tier: ms-tier needs stability; seconds-tier must avoid false trips.
Fuse/breaker coordination
- Electronic limiting should reduce energy fast; upstream protection should isolate only on persistent faults.
- Fault policy should avoid “nuisance breaker trips” caused by repeated retries.
Output artifact: hot-swap selection & settings table (copy/paste)
| Item | What must be compared / recorded |
|---|---|
| Input range & UV/OV | 48V envelope (min/max), UV/OV thresholds, hysteresis, debounce policy |
| Inrush control | dv/dt ramp control method, max ILIM range, start-up profile stability vs unknown C_load |
| SOA protection | Timer behavior, foldback/hiccup options, MOSFET SOA check method at worst Vin/ambient |
| Current sensing | Sense method (Rsense / Rds(on)), accuracy, IMON availability |
| Fault interface | FAULT pin, power-good, latch-off vs retry count, minimum off-time |
| Power FET drive | Gate drive strength, external FET count support, parallel capability, thermal constraints |
Figure F3 — Hot-swap charging equivalent: where inrush and SOA risk come from
Surge/ESD & protection stack: layered protection beats “one big TVS”
Why “a large TVS” is not a protection strategy
A TVS diode primarily limits peak voltage. Many field outages are caused by energy rather than peak: sustained overvoltage, repeated surge bursts, long-cable ringing, or reverse current during recovery. A robust 48V entry must be designed as a layered protection stack where each layer has a distinct job.
Rule of thumb: TVS limits peak, hot-swap/OVP/OCP limits energy, and OR-ing blocks direction. Reliability comes from the sequence of actions, not a single component rating.
Layered protection stack (from connector inward)
| Layer | Job and the failure it prevents |
|---|---|
| TVS clamp | Limits surge peaks and protects downstream silicon from fast overvoltage spikes. Must be paired with short current loops and realistic thermal handling. |
| Input filter / damping | Reduces ringing and prevents high-frequency energy from coupling into controller thresholds and gates. Poor damping can create false UV/OV triggers. |
| OV/UV cutoff | Turns sustained abnormal input into a controlled disconnect. Requires hysteresis and debounce to avoid chatter in sag events. |
| OCP / short protection | Limits fault energy during shorts and prevents repeated high-stress cycles. Policy (latch vs controlled retry) must avoid “hammering” a persistent fault. |
| OR-ing / reverse current block | Prevents backfeeding from the DC bus/backup path into the input line and avoids source fighting during recovery. |
Protection coordination: electronics vs fuse/breaker
Electronic protection is optimized for fast energy limiting and telemetry; fuses/breakers are optimized for ultimate isolation. Coordination must ensure that transient events are handled without nuisance trips, while persistent faults still lead to safe isolation.
- Electronics first: limit energy quickly during inrush/short bursts to avoid upstream plant collapse.
- Isolation eventually: for persistent faults, allow upstream isolation rather than endless retry loops.
- Policy matters: latch-off for hard faults; controlled retry for benign brownouts with minimum off-time.
Field symptom → likely cause (fast triage map)
| Observed symptom | Most common cause inside the protection stack |
|---|---|
| TVS runs hot or fails short | Surge energy exceeds thermal design; poor heat spreading; repeated bursts without cooldown; loop inductance raises stress. |
| Input drops / repeated UV trips | Threshold too tight; inadequate hysteresis; filter/loop causes false detects; upstream plant interacts with inrush limits. |
| Repeated latch-off / auto-retry | Persistent OV/short; retry policy “hammers” the same fault; reverse current conflicts during recovery; poor OR-ing hysteresis. |
Output artifact: protection-layer checklist (tick-box ready)
- TVS loop: surge current loop is short; thermal path is verified; clamp level aligns with downstream OV limits.
- Filter/damping: ringing is controlled; no false UV/OV triggers during cable events.
- OV/UV: thresholds + hysteresis + debounce match sag profiles; no chatter near boundaries.
- OCP/short: energy is limited quickly; policy avoids repeated high-stress retries; persistent faults isolate safely.
- Reverse current: OR-ing blocks backfeed from bus/backup; recovery does not create source fighting.
- Telemetry: clamp/OV/UV/OCP events are logged with timestamps and reason codes for postmortem evidence.
Figure F4 — Layered protection stack (connector → clamp → filter → hot-swap → OR-ing → bus)
OR-ing & power-path management: seamless switchover without backfeed
What OR-ing must guarantee (the four constraints)
A dual-source 48V site power path is judged by behavior during sag and recovery, not by steady-state wiring. The OR-ing stage must simultaneously achieve seamless ride-through, reverse-current blocking, stable failback, and diagnosable switching so the DC bus does not chatter or reset.
Peak goal: keep Vbus above the system reset/UV boundary during main sag.
Hard rule: prevent backfeed from Vbus/backup into the main input line during recovery.
Common causes of switchover “chatter” (and what they mean)
| Observed behavior | Likely cause inside power-path management |
|---|---|
| Bus dips and recovers repeatedly | Failover and failback thresholds too close; insufficient hysteresis; short debounce so noise triggers multiple transitions. |
| Main and backup “fight” (source hunting) | Priority policy missing or weak; forward drop differences too small; control loop/compensation not stable at crossover. |
| Unexpected reverse current alarms | Ideal-diode reverse blocking threshold set too late; recovery ramp pushes Vbus into the main line; sensing noise around zero-current. |
| Failback happens too early | Main input meets threshold briefly but is not stable; missing “stable time” gate; load transient causes immediate re-failover. |
Output artifact: power-path state machine (implementation-ready)
Use explicit thresholds, hysteresis, and minimum on/off times to avoid repeated transitions and hidden stress events.
| State | Entry / exit conditions (with hysteresis and timing) |
|---|---|
| NORMAL (Main) |
Entry: Main_OK asserted and stable; reverse current below threshold. Exit → SAG_DETECT: Main falls below Vcut for > Tdebounce. |
| SAG_DETECT |
Entry: main sag detected, but not confirmed. Exit → BACKUP: Main remains below Vcut for > Tdebounce, or Vbus approaches UV boundary. Exit → NORMAL: Main rises above Vcut + HYS before timeout. |
| BACKUP |
Entry: enable backup path; enforce reverse blocking toward main line. Exit → RECOVER_WAIT: Main rises above Vreturn (Vreturn > Vcut) and stays stable. |
| RECOVER_WAIT |
Entry: main appears recovered. Exit → NORMAL: Main stays above Vreturn for > Tstable and reverse current remains bounded. Exit → BACKUP: Main drops below Vcut again or causes source fighting. |
| FAIL_LOCK (optional) |
Entry: persistent reverse current / overcurrent / overtemp events exceed limits. Exit: requires explicit recovery condition (cooldown or service action); prevents “hammering” a hard fault. |
Figure F5 — Dual-source OR-ing switchover timing: sag → backup takeover → stable failback
Supercap subsystem: a millisecond UPS when engineered, a “giant resistor” when not
Where supercaps win (and the boundary)
Supercaps are strongest in short ride-through and high pulse power events. The limiting factors are not nominal capacitance alone, but the usable voltage window and the instantaneous drop caused by ESR. Many “cap bank looks large but cannot hold” failures are actually ESR- and Vmin-driven.
ESR rule: bus drop under pulse load is dominated by ΔV_ESR = I_peak × ESR_total.
The total includes capacitors, busbars, connectors, protection and OR-ing path resistance.
Charge strategy: avoid a second inrush event
A supercap bank behaves like a large load during charging. Without controlled charge limiting and windowing, the charger can cause secondary stress on the main 48V plant and trigger repeated UV events upstream.
Current limiting
- Ramp or constant-current charge prevents sudden plant droop.
- Define a maximum charge current that cannot collapse Vbus under worst-case load.
Charging window
- Charge only when main input is stable and margins exist.
- Temperature derating avoids high-stress charge at cold ESR peaks or hot lifetime limits.
Series balancing & protection (engineering trade-offs)
| Design block | What must be decided and verified |
|---|---|
| Passive vs active balancing | Passive is simple but dissipative; active improves efficiency but adds complexity and validation load. The selection must match thermal limits and allowable quiescent drain. |
| Overvoltage & overtemp | Protect individual cells and the stack: overvoltage is a primary lifetime killer; overtemp accelerates aging. Protection must isolate or reduce charge, not just alarm. |
| Health aging | Capacity fades while ESR rises; the most common failure mode is “pulse load causes reset” long before total energy looks low. Track ESR proxy and ride-through margin over time. |
Output artifact: supercap design checklist (tick-box ready)
- Stack sizing: series count, single-cell rating, and system Vmax/Vmin margin are defined.
- Usable window: Vmin is set by bus UV boundary + OR-ing drop + DC/DC minimum input.
- ESR budget: ESR_total target includes caps + interconnect + protection + OR-ing path; pulse drop is verified.
- Charge limit: maximum charge current and ramp time cannot pull the plant into UV under worst-case load.
- Balancing: passive/active method selected; fault modes and thermal impact are validated.
- Protection: cell OV/OT, stack OV/OT, short protection, and isolation policy are defined.
- Monitoring: cell/stack voltage, temperature, charge/discharge current, and event codes are logged.
Figure F6 — Supercap module expansion: charger, balancing, monitor, and the ESR drop path
Battery backup subsystem: the management closed-loop for minutes to hours
What a long backup path must achieve (beyond “enough energy”)
Minute- to hour-scale backup is defined by a stable operating loop: safe connection to the DC bus, controlled charging that does not collapse the plant, trustworthy SOC/SOH for runtime estimation, and a clear alarm policy that tells remote operations what must be acted on immediately.
Boundary: this section covers the backup pack scope (pack + charger + gauge + protection/disconnect). It does not expand into full AC UPS inverter architecture.
Chemistry selection: only the decision axes (no generic overview)
| Decision axis | What it controls in site backup engineering |
|---|---|
| Safety | Thermal runaway risk and protection strategy; impacts how aggressively charging and recovery can be managed remotely. |
| Temperature window | Cold discharge capability and charge restrictions; determines derating rules and runtime confidence in winter conditions. |
| Cycle + calendar life | How quickly SOH fades under frequent micro-outages; determines replacement planning and alarm thresholds. |
| Maintenance model | Field replaceability, periodic checks, and transport/storage constraints; maps directly into alarm severity and service actions. |
| Power vs energy | Whether the pack must support large takeover currents; influences IR/impedance limits and bus stability during transfer. |
Charging & power-path policy: avoid “backup causes instability”
Supply-while-charge (managed power-path)
- Load supply has priority; charging is limited by plant margin.
- Charge current is windowed by Vin stability and bus headroom.
- Prevents back-to-back UV triggers during partial sag conditions.
Charge-only-when-mains-is-good
- Defines “mains good” as a threshold + stable time gate.
- Reduces stress on weak plants but increases recharge time.
- Pairs well with strict temperature-based derating rules.
Charging behaves like a sustained additional load. The loop must ensure charge limiting never pulls the plant below site undervoltage boundaries.
Fuel gauge & telemetry: the minimum fields for remote operations
The goal is not academic estimation methods, but actionable remote visibility and reliable runtime confidence.
| Field | Operational meaning |
|---|---|
| SOC | Runtime estimate; must be bounded by temperature derating and load profile assumptions. |
| SOH | Replacement planning; tracks capacity fade and internal resistance increase trends. |
| Vpack / Ipack | Validates discharge/charge behavior; detects abnormal loads and incorrect power-path transitions. |
| Temperature | Enforces safe charge/discharge windows; drives derating and high-severity thermal alarms. |
| IR/impedance proxy + cycle count | Early warning for “reset on takeover” scenarios where energy looks adequate but pulse drop becomes unacceptable. |
Output artifact: a site-ready alarm dictionary draft (must-report vs maintenance)
The alarm system is most useful when severity, trigger rules, report payload, and recommended actions are standardized.
| Alarm class | Trigger rule and what must be reported |
|---|---|
| Critical (must escalate) |
Thermal unsafe state, protection trip/lock, or pack disconnect on load. Report snapshot: Vpack, Ipack, Temperature, SOC, SOH, charger state, time stamp. |
| Major |
SOH below service threshold, abnormal impedance rise, repeated charge aborts. Report: trend values + recent event counters and min/max records. |
| Minor / Maintenance |
Calibration drift indication, slow recharge, mild temperature derating events. Report: low-rate trend only; no alert storm. |
Figure F7 — Battery backup closed-loop: sensors → gauge → controller → remote → policy actions
Digital power & PMBus telemetry: turn power into an observable system
Why “readable” is not “usable” (the practical goal)
PMBus succeeds only when metrics, units, thresholds, and reporting rules are designed as a system. High-frequency transients are not carried as waveforms; the reliable method is low-rate trends plus event snapshots.
Rule: do not depend on PMBus for high-frequency waveforms. Use summary statistics (min/max/peak/counters) and fault snapshots at event time.
Minimum must-have telemetry set (site-ready)
| Metric | Why it is required |
|---|---|
| Vin / Iin | Plant margin tracking and detecting input stress before UV events. |
| Vbus / Ibus | Bus stability, load steps, and verifying power-path handoffs. |
| Temperatures | Derating decisions and early detection of thermal runaway risk. |
| Fault status + counters | Turns “it happened” into evidence; supports recurring root-cause triage. |
| Energy reserve (cap / battery) | Predicts ride-through capability and prevents false confidence in backup availability. |
Sampling strategy: trend vs event (the only scalable method)
Trend (low-rate)
- Minutes-scale sampling (e.g., 10s/60s/300s) for drift and thermal patterns.
- Stores steady metrics, calibration-corrected, unit-normalized.
- Used for predictive maintenance and capacity planning.
Event (fault snapshot)
- Triggered by UV/OV/OCP/OT/reverse-current or repeated retries.
- Reports a compact snapshot: Vin/Vbus/I/T + reserve state + reason code.
- Supports rapid remote triage without waveform transport.
Output artifact: PMBus metric-to-policy mapping table template
A reusable mapping prevents “metrics exist but nobody knows what to do with them”.
| Metric | Use | Sampling | Threshold (trigger / clear) | Reporting |
|---|---|---|---|---|
| VBUS | bus margin | trend + event | V<Vuv (debounce) / V>Vuv+HYS (stable) | event snapshot + min/max summary |
| IBUS | load stress | trend | I>Ilmt (debounce) / I<Ilmt−HYS | counter + peak summary |
| TEMP | derating | trend + event | T>Thigh / T<Thigh−HYS | alert only on sustained violation |
| Reserve | runtime | trend | SOC<Smin / SOC>Smin+HYS | scheduled report + maintenance flag |
| Fault flags | evidence | event | status asserted / cleared | reason code + snapshot payload |
Figure F8 — Telemetry data path: digital power → PMBus → site controller → logs & alerts
Fault handling & logging: “power outages are manageable—missing evidence is not”
Objective: turn a brownout into a provable root-cause chain
After a site brownout, a reboot is only the symptom. A usable logging design produces a consistent evidence chain: event time, reason code, snapshot, and counters that can distinguish input sag, power-path chatter, protection retries, thermal derating, and reserve depletion.
Principle: PMBus is not an oscilloscope. For millisecond-level transients, rely on latched flags and summary statistics (min/max/peak) plus a small, fixed snapshot at event time.
Fault tiers and the matching recording granularity
| Tier | Typical duration | What must be recorded (site-ready) |
|---|---|---|
| Transient | ms | Latched UV/OV/OCP/OTP flags, min/max VIN/VBUS, peak IBUS, and a compact reason code. |
| Short | s | Event snapshot + retry counters, switch-over counters, and charger/OR-ing state transitions. |
| Long | min+ | Low-rate trends: temperatures, VIN margin, current/derating state, and reserve trend (cap V or SOC/SOH). |
Minimum event dictionary (what must exist to reconstruct the cause)
Protection & state events
- UV / OV: trigger + clear with debounce and hysteresis.
- OCP / short: limit engaged, foldback, or hard trip (with counters).
- Thermal: derating state vs shutdown trip (with temp channel).
- Latch-off: lock reason + unlock criteria.
Power-path evidence
- Switch-over: main→backup / backup→main transitions (count + last reason).
- OR-ing state: which path is sourcing the bus at the event moment.
- Reserve snapshot: cap V or SOC/SOH at event time.
- Charger state: charging / limited / paused / fault.
Rule: every event must attach the same fixed snapshot payload so different incidents can be compared directly.
Fixed snapshot payload (small, consistent, and sufficient)
| Snapshot field | Why it is needed |
|---|---|
| timestamp + event_id / reason_code | Anchors the incident and allows correlation with network/server logs without ambiguity. |
| VIN / VBUS + IIN / IBUS | Separates input sag from power-path instability and identifies overload vs protection behavior. |
| Temp channels (hot-swap / OR-ing / charger / pack) | Establishes thermal derating → bus collapse chains and avoids “heat happened later” confusion. |
| OR-ing state + switch-over counter | Proves whether backup attempted takeover, chattered, or never engaged at all. |
| charger state + reserve (cap V or SOC/SOH) | Explains why a system with apparent energy still resets (reserve depleted, limited, or unavailable). |
| fault flags + retry counters | Shows protection cycles (retries) vs a single hard trip, and supports “recurring root cause” diagnosis. |
VIN sag, VBUS follows
Points to upstream plant (brownout) or wiring impedance, not OR-ing chatter.
VIN stable, VBUS dips
Points to power-path handoff, OCP retry, or thermal derating on the path.
Reserve high, takeover fails
Points to OR-ing thresholds/hysteresis, state machine priority, or protection lock.
Output artifact: fault triage tree (symptom → log evidence → root cause)
This tree is designed for the most common field entries: reboot/outage, repeated alarms, and performance drop from derating.
| Symptom entry | First evidence to check | Likely root-cause branch |
|---|---|---|
| Reboot / outage | UV flag + min VIN/VBUS at event time | Input sag (VIN drops) vs bus-path issue (VIN stable, VBUS drops) |
| Frequent switch-over | Switch-over counter + OR-ing state timeline | Threshold chatter / insufficient hysteresis / noisy sensing / incorrect priority |
| Repeated retries | OCP flag + retry counters + peak current summary | Inrush-driven limit oscillation / intermittent short / unstable limit settings |
| Throughput drop | Thermal derating state + temperature trends | High-temp + charging + load + low VIN corner causing controlled derating |
| Backup not available | Reserve snapshot (cap V or SOC/SOH) + charger state | Reserve depleted / locked out / charge window too strict / aging (SOH) |
Figure F9 — Brownout timeline with logging tap points (what to capture, when)
Thermal & efficiency: backup heat can be more deceptive than the main load
Why backup thermal problems appear “only in the corner”
Backup subsystems often run quietly until a corner case appears: high ambient with charging enabled, high bus load, and low VIN margin. In this corner, conduction losses, OR-ing drops, charger dissipation, and protection behavior combine and can trigger derating, alarms, or cascading undervoltage events.
Rule: thermal design must include derating curves and sensor placement. A “cold” sensor reading does not prove a hot path is safe.
Heat source list (site backup relevant)
Primary heat contributors
- Hot-swap MOSFET: linear region time, protection retries, and high current paths.
- OR-ing drop: continuous
Vdrop × Iloss during sourcing. - Charger: sustained dissipation during recharge windows.
Hidden / corner contributors
- Balancing resistors: can become steady heat under imbalance conditions.
- TVS / clamp: abnormal heating under frequent surge/clamp activity.
- Cabling + connectors: localized I²R heating that shifts sensor trust.
Design strategy: derating curve + thermal path + correct sensing points
| Strategy element | What “done” looks like in a site backup subsystem |
|---|---|
| Derating curve | Stepwise limit (current/power) vs temperature, avoiding abrupt shutdown unless unsafe thresholds are crossed. |
| Thermal path | Heat source → copper/heat spreader → chassis → airflow boundary, documented at the block level (no CFD required). |
| Sensor placement | At the true hotspots: hot-swap path, OR-ing element, charger region, and pack thermal reference, not on a cold corner. |
| Logging linkage | Thermal derating/OTP status is time-aligned with UV events to prove a thermal chain vs an input chain. |
Output artifact: thermal risk FMEA mini-table
| Component | Failure mode | Field symptom | Monitored metric | Mitigation |
|---|---|---|---|---|
| Hot-swap MOS | Overheat from linear region / retries | Derating, then UV reset under load | Temp + OCP counters | Shorten linear-time, tune limits, improve heat spread |
| OR-ing element | Continuous drop heating | Frequent thermal alarms during sourcing | Vdrop + Temp | Lower drop path, airflow, and threshold/hysteresis tuning |
| Charger | Sustained dissipation at high ambient | Charge aborts, reserve never recovers | Charger state + Temp | Windowed charging, derating curve, mechanical thermal path |
| Balancer | Steady heat under imbalance | Hotspot alarms near pack | Pack temp + imbalance indicator | Balance policy + thermal placement, service threshold |
| TVS / clamp | Abnormal heating from frequent clamp activity | Warming, degradation, eventual clamp failure | Clamp temp + event counters | Protection stack review, surge path control, monitoring |
Figure F10 — Thermal source map (block-level, no simulation)
H2-11 · Validation checklist: proving it survives real site events
What “pass” means for an edge site power & backup subsystem
Validation should demonstrate three outcomes simultaneously: (1) the 48V input can be hot-plugged and survive surge/noise without destructive stress, (2) the power-path can ride-through and switch sources without inducing brownout resets, and (3) telemetry/logs can reconstruct the timeline and root cause after the event.
- VBUS droop stays above reset threshold
- Hot-plug inrush stays below ILIM
- No FET SOA violation window
- Reverse current stays blocked
- Backup takeover within budget
- Fault log is time-aligned
Recommended minimum instruments: fast scope (≥200MHz), current probe or shunt + diff probe, programmable DC source, surge generator (if required), thermal camera, and a log collector that timestamps events.
Event-driven test matrix (copy/paste for acceptance)
| Scenario | Stimulus | What to measure (fast path) | Telemetry & logs | Pass criteria |
|---|---|---|---|---|
| Hot-plug | Cable L: short/long; load C: min/nom/max; repeated insertions | Inrush peak, dv/dt, VBUS dip, FET VDS/IDS vs time, gate ramp shape | Fault pins, ILIM flag, retry/latch reason, event counter | No latch unless designed; VBUS stays above UV; FET temperature rise bounded |
| Brownout | VIN sag slope; minimum VIN; duration (ms→s) | Detect time, switchover latency, VBUS hold-up window, oscillation/no “ping-pong” | VIN min, switchover cause, backup energy snapshot, timestamps | Seamless ride-through; no repeated toggling; clean recovery with hysteresis |
| Surge/ESD | Specified surge level; negative transient; input noise | Clamp voltage, overshoot at protected node, FET stress, filter ringing | OV/UV events, clamp overtemp (if monitored), protection layer trigger ID | Protection layers trip in intended order; no TVS thermal runaway |
| Short/OCP | Short at bus; short at downstream; step load to peak | Current limit stability, hiccup vs latch-off behavior, recovery timing | OCP cause code, retry count, last-good snapshot, thermal flag | Limits clamp without oscillation; recovery policy matches spec; no connector damage |
| Backup endurance | Target hold-up time at hot/cold; aged C/ESR and battery derating | Delivered energy, efficiency, VBUS profile, ESR heating | Remaining energy model vs measured, SOH trend, alarm thresholds | Meets time with margin at temperature; alarms precede collapse |
| Telemetry integrity | Disconnect backhaul; reboot controller; power cycles | Timestamp continuity, missing samples, event ordering | Store-and-forward buffer, monotonic event IDs, clock sync method | Logs reconstructable even with link loss; no silent overflow |
Practical trick: every scenario should produce a deterministic “event signature” (flags + counters + min/max snapshots) so that field cases can match lab cases.
Failure exposure: the shortest tests that reveal the biggest hidden risk
The highest-leverage tests intentionally combine “worst pairings” that commonly trigger latent faults: HOT AMBIENT + LOW VIN + CHARGING + PEAK LOAD, and LONG CABLE + HIGH CLOAD + FAST INSERT. These combinations amplify: MOSFET linear stress, OR-ing thermal loss, and control-loop boundary conditions.
Data to capture per run (recommended): VIN/VBUS, IIN/IBUS, FET VDS, temperature at FET + OR-ing + charger hotspot, state ID, and a compact “event frame” (cause, min/max, energy remaining).
H2-12 · BOM / IC selection checklist: choose by criteria (with real part numbers)
How to use this table
Part numbers below are reference-grade building blocks for a 48V edge site power & backup design. Selection priority should match the failure modes from validation: hot-plug SOA control, reverse-current blocking, predictable switchover, and field-reconstructable logs.
- 80V-class front-end
- Stable current limit
- Reverse blocking
- Energy-aware backup
- Telemetry + fault log
“PMBus” often means: (a) native PMBus device, or (b) I²C telemetry + a controller that publishes PMBus-like objects upstream. Both are acceptable if logs stay consistent.
IC shortlist (grouped by function block)
| Function block | Criteria (what matters most) | Concrete part numbers (examples) |
|---|---|---|
| 48V hot-swap | ILIM accuracy & stability; dv/dt control; SOA/power limiting; latch-off vs retry; fault signaling |
TI: LM5069 (9–80V hot-swap, power limiting), TPS2490 (9–80V hot-swap, latch-off) ADI: LTC4282 (high-current hot-swap with I²C-compatible monitoring) |
| OR-ing / ideal diode | Reverse blocking; fast switchover; low loss; stability under noise; multi-source behavior |
ADI: LTC4370 (two-supply diode-OR + current sharing), LTC4357 (80V ideal diode controller), LTC4359 (ideal diode + reverse input protection) TI: LM5050-1 (5–75V OR-ing FET controller) |
| Supercap backup | Charger + backup boost integration; health/ESR monitoring; inrush/hot-swap behavior; cap balancing hooks | ADI: LTC3350 (supercap charger + backup + monitoring), LTC3351 (hot-swappable supercap backup controller + monitoring) |
| Battery charger (48V systems) | Wide VIN headroom; buck-boost if needed; multi-chem support; charge termination; thermal regulation; telemetry hooks | ADI: LTC4020 (55V buck-boost multi-chem battery charger), LT8490 (high-voltage buck-boost charge controller, up to ~80V class) |
| Battery gauge / pack manager | SOC/SOH accuracy across temperature; wide pack voltage; protections/alarm codes; SMBus ecosystem | TI: BQ34Z100-G1 (wide-range fuel gauge up to 65V with translation), BQ40Z50 (1–4 series pack manager / gauge over SMBus) |
| Power system manager (PMBus) | Sequencing & supervision; ADC accuracy; fault log & black-box snapshot; GPIO for enable/PG; PMBus command depth |
ADI: LTC2977 (8-channel PMBus power system manager with telemetry + fault logs) TI: UCD9090A (10-rail PMBus/I²C sequencer & monitor), UCD90160A (16-rail PMBus sequencer & system manager) |
| High-voltage current/energy monitor | VIN range; accuracy; alert thresholds; energy/charge accumulation; event-friendly sampling |
ADI: LTC2946 (2.7–100V current/voltage/power/energy/charge monitor, I²C) TI: INA228 (85V, 20-bit, I²C current/voltage/power/energy/charge monitor) |
Tip: when multiple telemetry sources exist (hot-swap + system manager + current monitor), define one “truth map”: which device owns VIN min/max, which owns VBUS droop, which owns energy remaining, and how timestamps align.
“Function → key criteria → validation method” (so BOM decisions are testable)
| Block | Key criteria to lock early | How to validate (fast & field) |
|---|---|---|
| Hot-swap controller | ILIM stability (no oscillation), dv/dt programmability, SOA/power limiting behavior, retry vs latch-off policy, fault pin semantics | Hot-plug across cable L/CLOAD; scope VDS×IDS window; inject short; check cause code + counters |
| OR-ing / ideal diode | Reverse blocking threshold, takeover speed, thermal loss, noise immunity around threshold, multi-source stability (no ping-pong) | Brownout ramp tests; force reverse delta-V; measure IBACK; log switchover count |
| Supercap manager | Charge limit impact on mains, ESR awareness, cap voltage balance strategy integration, backup boost stability under step load | Ride-through at temperature; compare predicted vs delivered energy; record ESR/health flags |
| Battery charger + gauge | Derating at hot ambient, charge termination reliability, SOC/SOH drift control, alarm dictionary completeness | Endurance runs hot/cold; log SOC vs coulomb count; verify alarm thresholds pre-fail |
| PMBus system manager | ADC accuracy vs rails, fault log depth, snapshot alignment across resets, GPIO mapping to enables/PGs | Induce UV/OV/OCP; verify logs reconstruct a timeline; simulate link loss & cache behavior |
If “specific part numbers” are needed in design docs: keep the shortlist above, then freeze 1–2 per block after bench results confirm stability under worst-case combinations.
H2-13 · FAQs (12) — Edge Site Power & Backup
Each answer is written to stay inside this page’s scope (48V front-end hot-swap, protection stack, OR-ing/power-path, supercap/battery backup, PMBus telemetry, fault logging, thermal, and validation). Answers reference the earlier sections for deeper details.