Micro-DC Rack: DC Power Distribution, Protection & Telemetry

Q: What is the practical engineering boundary between a Micro-DC Rack and a traditional rack PDU?

A Micro-DC Rack focuses on in-rack DC branch switching, electronic protection, and diagnostic-grade telemetry. It delivers a closed loop: branch control, fast fault response, and actionable logs (reason_code + timestamps + snapshots) exposed through an OOB control plane. It does not describe upstream power conversion or facility distribution.

Q: Why can an eFuse trip frequently even when “average current looks small”?

Average current can hide inrush spikes and fast transients. Common triggers include inrush peaks masked by filtering, di/dt-induced voltage spikes from wiring inductance, shunt thermal drift, and noise coupling into sense/enable pins. Start with last reason_code, trip_time, and a pre/post snapshot (Vbus, Ibranch, Tchannel) before changing thresholds or debounce.

Q: What happens if short-circuit response is too fast or too slow?

Too fast response can misclassify inrush or di/dt transients as a hard short, causing nuisance trips. Too slow response increases SOA and thermal stress and can spread disturbances to other branches. Validate with a controlled short fixture and step loads, and require logs that separate SC vs OCP and record response timing and peak values.

Q: After a remote power-cycle, the load still won’t come up—what codes and timestamps should be checked first?

Check last reason_code (UV/OV/OCP/SC/OT/remote-off/lockout), then trip_time and attempt_count/backoff state. Next inspect a pre/post snapshot: Vbus droop, Ibranch peak, channel temperature, and enable state. For correlation, rely on monotonic time plus a near-absolute mapping used by the OOB plane.

Q: How can “critical-domain priority power” be implemented without introducing a new single point of failure?

Define a minimal critical domain (OOB controller, alarms, essential comms) and give it a dedicated branch with conservative thresholds, default-on behavior, and independent fault handling. Avoid coupling it to noncritical branches through shared enables or fragile bus dependencies. Ensure resets preserve the critical domain, and log policy_version + reason_code for auditability and testability.

Q: How should door/tamper events link to power actions without causing accidental shutdowns?

Treat door/tamper primarily as audit and permission signals. Use debounce/hysteresis and sensor-fault detection. Door-open should log and restrict privileged writes; forced/tamper events can require confirmation or two-step approval for power actions. Combine power shedding only with additional safety signals (e.g., smoke/leak) to avoid false shutdowns.

Q: What is the most reasonable division of labor for I³C, I²C, and PMBus inside a Micro-DC Rack?

Use I³C for dense on-board discovery and higher-rate telemetry where supported, with clear bus-recovery paths. Use I²C for broad compatibility with sensors and monitors, managing line length, pull-ups, and hang isolation. Use PMBus/SMBus semantics for power-oriented devices (limits, faults, telemetry fields), and keep long or noisy runs isolated or bridged via a controller boundary.

Q: If current measurement drifts over time, how to tell shunt heating from layout issues or the sampling chain?

Check correlation with temperature and duty cycle first—self-heating shifts shunt value and gradients create error. Then verify true Kelvin routing and quiet returns; small ground drops can look like drift. Finally inspect the sampling chain: input protection, saturation recovery, ADC reference stability, and filtering. Confirm by controlled DC points and log calibration version for traceability.

Q: How should an event log be designed so “who triggered the power action” remains traceable after power loss?

Require every write to carry identity context and log actor_id, request_id, policy_version, channel_id, and result status, alongside reason_code and timestamps. Use a ring buffer plus critical-event snapshot commits on trip/lockout. Reconstruct ordering via monotonic time and store the last known near-absolute mapping for cross-system correlation, without relying on external services.

← Back to: Data Center & Servers

A Micro-DC Rack is an in-rack DC branch power platform that combines eFuses/high-side switches, telemetry + event logs, and OOB control to make power distribution controllable, protective, and diagnosable.

It keeps critical domains (OOB/alarms) alive while enforcing staged policies for faults and environmental/security events—without entering PSU/UPS or facility power topics.

H2-1 · Scope & Boundary

Scope & Boundary

This page focuses on rack-internal DC distribution and branch-level control: how a Micro-DC Rack uses eFuses/high-side switches, per-branch sensing, event logs, and OOB hooks to make power delivery controllable, protectable, and observable at the branch level.

In scope (what this page goes deep on)

DC bus → branch channels: how loads are segmented, switched, and isolated inside a rack.
Per-branch protection: OCP/short/OV/UV/OT, SOA limits, reverse-current blocking, inrush control.
Per-branch observability: voltage/current/temperature telemetry, fault reason codes, timestamps.
Remote operations: power-cycle, staged turn-on, load shedding, “retry vs latch-off” policy.
OOB hooks: minimal interfaces for reading telemetry/logs and applying safe, auditable control.

Out of scope (explicitly not covered here)

PSU internal topologies (e.g., PFC/LLC), UPS/ATS behavior, or facility wiring codes/standards.
AC mains PDU design details beyond a boundary comparison.
PCIe/CXL fabrics, ToR interposers, or accelerator interconnects (separate pages).
Full BMC SoC architecture and TPM/HSM internals (only referenced as integration dependencies).

Sibling-page link strategy (listed, not expanded)

Need PSU details → CRPS / Server PSU
Need bus hot-swap deep dive → 48 V / 12 V Bus & Hot-Swap
Need broader rack sensors/access → Rack Environment & Access Control

The boundary is intentional: Micro-DC Rack treats upstream power as a DC source black box and concentrates on branch-level safety, control, and diagnostics.

Scope Guard (reader view): Branch channels, eFuse/high-side switching, per-branch telemetry and event logs, OOB integration hooks.

Out-of-scope topics are allowed only as one-line boundary references that point to the appropriate sibling page.

Figure F1 — Boundary map: what Micro-DC Rack covers

The boundary is deliberate: upstream power is treated as a DC source black box, while this page goes deep on branch-level safety, control, and diagnostics.

H2-2 · 1-minute Definition

1-minute Definition: What a Micro-DC Rack Solves

Answer block (definition)

A Micro-DC Rack is a rack-internal DC distribution system that splits a DC bus into multiple protected branches, using eFuses/high-side switches to switch and isolate loads and using per-branch sensing to report telemetry and fault events. It enables safe remote operations—power-cycle, staged turn-on, and load shedding—through OOB management hooks with auditable logs and timestamps.

The focus is not “how power is generated,” but how rack-internal DC is distributed, protected, observed, and controlled at branch granularity.

Where it appears (typical deployments)

Edge micro-sites / remote racks: limited on-site access, strong need for remote recovery and audit trails.
Small private clusters: mixed loads (compute/network/storage) with different criticality levels.
Unattended enclosures: environmental or access events must trigger safe, staged power actions.

What it enables (three keywords)

Switching: per-branch on/off, power-cycle, staged ramp, group sequencing.
Protection: fast short response, controlled current limiting, thermal/SOA enforcement, reverse blocking.
Observability: branch V/I/T telemetry + reason codes + timestamps + “before/after” snapshots.

Practical boundary vs “traditional PDU” (conceptual, not AC deep-dive)

Granularity: Micro-DC is built for per-branch actions and diagnostics, not only aggregate metering.
Diagnostics: “why did it shut off?” is answered by reason code + timestamp, not guesswork.
Automation: control is designed to be called by OOB workflows with permissions and audit logs.

Figure F0 — Problem → Capabilities → Outcomes (minimal text)

A Micro-DC Rack turns “power issues” into controllable actions and diagnosable events (reason code + timestamp + telemetry snapshot).

H2-3 · System Architecture

System Architecture: Power Path + Management Path

A Micro-DC Rack is defined by two parallel paths: the power path (how DC is distributed and protected) and the management path (how measurements, events, and control decisions flow to OOB networks and platforms). The upstream supply is treated as a DC source black box; the emphasis is rack-internal segmentation, safety, and observability.

Power path (rack-internal)

DC Source (black box) → Rack bus / busbar → Branch channels (×N) → Loads.
Branch channels provide switching (on/off, power-cycle, staged turn-on) and protection (OCP/short/OT/SOA/reverse block).
Loads are organized by criticality into domains (IT / Network / Aux), enabling predictable policies such as load shedding.

Management path (measure → log → act)

Sensors (V/I/T + door/tamper) feed aggregation (ADC/MUX and low-speed buses).
A controller enforces safe actions and creates event logs (reason codes + timestamps + snapshots).
An OOB uplink exposes branch/domain objects to a platform for alerting, tickets, and audits.

Multi-domain power: a practical organizing model

IT domain: compute loads that may tolerate staged power but require stable recovery procedures.
Network domain: connectivity-critical loads often kept at higher priority during shedding.
Aux domain: sensing, access, indicators, and OOB control power kept alive for diagnosis and recovery.

Design anchor: troubleshooting starts from event logs (reason code + timestamp) and is validated with telemetry snapshots (V/I/T before and after). This prevents “average power looks normal” from masking the real trigger.

Power (solid) Control (dashed) Telemetry (dotted)

Figure F1 — Micro-DC Rack block diagram (power, control, telemetry)

The diagram separates the power path from the management path. Branch channels (×N) are the control and diagnostic unit exposed to OOB platforms.

H2-4 · Power Distribution Strategy

Power Distribution Strategy (Rack-internal, not PSU topology)

Distribution strategy is expressed as bus choices, two-level protection boundaries, and remote operation policies. The goal is predictable rack behavior: one faulty branch should be isolated quickly, while critical domains and OOB diagnostics remain available.

Bus voltage trade-offs (effects inside the rack)

Higher bus voltage reduces current for the same power, easing conductor loss and busbar size.
Device stress shifts: branch switches must tolerate higher voltage transients and safe SOA margins.
Measurement implications: lower current reduces shunt heating; higher voltage increases the need for robust input protection and divider accuracy.

Two-level protection (avoid “one fault takes the rack”)

Level 1 — Bus-level protection: a rack-wide guardrail and last-resort cutoff (referenced only).
Level 2 — Branch-level electronic protection: the primary fault isolator (OCP/short/OT/SOA/reverse).
Rule of thumb: branch channels should trip first and log the reason; bus-level protection is a rare fallback.

Remote operation policies (turn actions into predictable behavior)

Power-cycle: define minimum off-time, retry budget, and “latch-off” conditions for repeated faults.
Load shedding: prioritize domains so Aux/OOB stays alive for diagnosis and recovery.
Staged turn-on: group branches and apply stagger intervals to reduce bus droop and false trips.

Policy must be observable: each action should leave an auditable trace (who/what/when/why/result) and correlate to a branch’s event log timeline.

Figure F2 — Branch strategy state machine (minimal text)

The branch channel is the control unit: policy defines whether overload recovers, short latches off, thermal cools down, and remote actions are auditable.

H2-5 · Branch Protection Core

Branch Protection Core: eFuse / High-Side Switch (Deep Dive)

A branch channel is the unit that turns rack-internal DC into a controllable and diagnosable service. It must (1) deliver current safely under normal load, (2) isolate faults fast without collapsing the bus, and (3) explain every action using reason codes, timestamps, and snapshots. Topics are limited to DC-bus internal behavior (no facility/AC events).

What a branch channel must guarantee

Controlled power: on/off, power-cycle, staged turn-on, and group policies.
Electronic protection: OCP/short/OT/OV/UV with SOA-aware behavior.
Isolation by design: one branch fault should not drag the whole rack bus down.
Forensics: every trip produces a reason code + timestamp + “before/after” telemetry snapshots.

Mechanism: protection loops inside the branch

Current path: switch FET(s) + sense element (shunt or RDS(on)) feed fast detection and measurement.
Control path: gate control enforces soft-start and a programmable current-limit profile.
Thermal path: temperature sensing triggers derating, cooldown, or latch-off depending on policy.
Stateful recovery: retry budget, backoff, and latch-off conditions prevent endless oscillation.

Field pitfalls (why it “trips wrong” or “fails to trip”)

dI/dt + parasitic L causes bus or switch-node spikes that look like OV/UV or false short events.
Sense placement & drift: hot shunt or poor Kelvin routing biases current measurement and thresholds.
Capacitive loads + sequencing: inrush pushes current-limit into thermal stress and late OT trips.
Concurrent turn-on: multiple branches start together → bus droop → UV cascades; mitigate with stagger and priority domains.

Recoverability policy (minimum set): retry budget + backoff, explicit latch-off conditions, manual clear workflow, and a black-box record containing reason code, timestamps, and snapshots.

A “retry loop with no evidence” increases downtime; a “latch-off with evidence” reduces support time.

Key specification checklist (spec → impact → knob → field symptom)

Spec item	What it controls	Engineering knob	Field symptom if wrong
VIN / VBUS range	Operating margin vs DC-bus transients	OV/UV thresholds + deglitch + safe derating	Random resets during load steps; unexplained UV/OV events
I_CONT / I_PEAK	Thermal stability and startup envelope	Stagger, soft-start, inrush shaping	Stable idle but trips during boot or simultaneous start
SOA / short energy	Whether the switch survives worst-case faults	Fast short detect + limit profile + timeout	FET damage despite “not that high” steady current
Limit mode	Fault containment vs heating	Constant-current vs foldback vs fast-trip	Either nuisance trips (too aggressive) or overheating (too soft)
Short response time	Bus stability and device survival	Blanking/deglitch tuned to real parasitics	Does not trip on hard short, or false short on fast load steps
Reverse blocking	Domain isolation and backfeed prevention	Back-to-back FET behavior + reverse thresholds	Unexpected cross-domain coupling; “ghost power” paths
OV/UV + debounce	Cascading trips during bus droop/spike	Debounce windows + staged retry policies	Rack-wide oscillation: trip → recover → trip loops
Thermal sensing	Preventing long-tail failures	Cooldown vs latch-off; derating thresholds	Late OT trips after minutes; performance derates are invisible
Current accuracy + drift	Threshold truthfulness over temperature	Kelvin sense, calibration, conservative margin	“Doesn’t look high” but trips, or never trips until damage
Telemetry bandwidth	Debug resolution vs noise + storage	Low-rate trends + event snapshots	Either too noisy to trust, or too sparse to explain a trip
Reason code model	Forensic explainability	Primary + flags, consistent versioning	“Trip happened” with no actionable why/how/when

Figure F3 — Single-branch eFuse channel internals (protection + observability)

The branch channel combines fast protection loops with an observability plane (ADC + registers + reason codes + snapshots). Every trip becomes diagnosable evidence.

H2-6 · Telemetry & Event Log

Telemetry & Event Log: Make “Observability” Debuggable

Observability is useful only when it answers: which branch, why, when, and what changed before/after. This section defines the minimal telemetry objects and event-log design needed for reliable troubleshooting in a Micro-DC Rack.

Telemetry objects (by layer)

Bus: Vbus (and optionally Ibus) to detect droop and operating margin.
Branch: Ibranch, channel temperature, switch state, and fault flags.
Env / Access: temperature/humidity plus door/tamper events for correlation.

Status & configuration provenance (avoid “unknown changes”)

State: ON/OFF/LIMIT/TRIP/COOL with last transition cause.
Reason code: primary cause + flags (e.g., OCP with UV flag).
Retry: retry counter, last retry time, and latch-off conditions.
Config version: threshold/policy version ID used at the time of an event.

Event-log “three-piece set” (dictionary · timestamps · persistence)

Event dictionary: OCP/SC/OT/UV/OV/REMOTE_OFF/DOOR_OPEN as a versioned list.
Timestamp strategy: local monotonic time + platform-aligned approximate time.
Buffer & power-loss behavior: ring buffer + critical event snapshots (minimum viable).

Low-rate trends Event-driven snapshots Short burst capture

A practical trade-off: keep low-rate telemetry for trends, capture high-resolution evidence only around events (pre/post snapshots). This reduces bandwidth and storage while preserving root-cause explainability.

Minimal event record fields (field → purpose → pitfall)

Field	Purpose	Common pitfall if missing
event_type	Normalize causes (OCP/SC/OT/UV/REMOTE_OFF/DOOR_OPEN)	Different faults look identical; automation can’t route tickets
source_id	Bind to branch/domain/bus object	Cannot answer “which branch” quickly
t_mono	Reliable ordering on-device (monotonic)	Events reorder after reboots or clock changes
t_approx	Approximate absolute time aligned to platform	Hard to correlate with door/env alerts and platform logs
reason_code	Primary cause + flags (e.g., OCP with UV flag)	Only a “trip happened” statement, not actionable
pre_snapshot	State + V/I/T immediately before the event	No evidence for whether it was droop, inrush, or drift
post_snapshot	State + V/I/T immediately after the event	Cannot confirm recovery or thermal consequences
retry_count	Expose oscillation and policy behavior	Hidden “retry storms” waste time and mask real faults
config_version	Record threshold/policy revision in effect	“It used to work” becomes un-debuggable after config changes
actor (optional)	Mark remote/manual actions for audit trails	Remote-off events look like faults; accountability is lost

Figure F4 — Event timeline and correlation (branch + env/access)

Debuggable observability relies on event ordering (monotonic timestamps), platform correlation (approximate time), and pre/post snapshots that explain why the branch transitioned from LIMIT to TRIP.

H2-7 · Env/Security Sensors

Env/Security Sensors: Power-Aware Policies (Logic + Interfaces)

Environmental and security signals are not “extra dashboards” in a Micro-DC Rack—they become policy inputs that shape branch actions (derate, shed, lockout) and create auditable evidence. This section focuses on linkage logic, qualification (debounce/hysteresis), and event-log binding.

Sensor scope (kept intentionally narrow)

Thermal / humidity: temp, humidity (trend + threshold events).
Door / tamper: door-open, chassis tamper, service-panel open.
Smoke / leak: presence events only (severity mapping is policy-driven).
Interface: sensors feed the same policy engine that can actuate branch/group power states.

Power-side action primitives (what “linkage” can actually do)

Branch actions: remote-off, lockout (manual clear), derate, delayed retry/cooldown.
Group actions: load shedding by priority (non-critical first), staged restore.
Preserve domains: keep OOB + alarm domain alive even during emergency shedding.
Evidence capture: trigger pre/post snapshots and annotate the event log.

Policy examples (trigger → qualify → act → record → clear)

Door open → debounce → write audit event + restrict high-risk actions (e.g., threshold changes) → record actor/request_id → clear when door closed + cool-down window.
Smoke / leak → qualify + severity map → shed non-critical group first; preserve OOB/alarm → record “policy_shed” reason code + snapshots → clear requires manual confirmation.
Temp high → hysteresis → derate group; if sustained, shed by priority → record “thermal_policy” + config_version → clear after sustained safe temp (with hysteresis).

Data trust (do not treat noise as emergencies)

Debounce: door/tamper chatter and intermittent contacts must not cause oscillation.
Hysteresis: avoid repeated shed/restore around a single threshold.
Sensor fault detect: open/short/drift should raise sensor_fault and switch to a safe degraded policy.
Event binding: every policy action must reference the sensor event_id and snapshots for postmortems.

Minimal linkage rule: sensor events never directly “cut power” without qualification. The policy engine decides severity and maps it to branch/group actions while preserving OOB + alarm visibility.

Policy template table (copyable operating model)

Sensor event	Severity	Power action	Preserve domain	Log requirement	Clear condition
Door open	L1	Audit + restrict risky controls (threshold changes / bulk cycles)	OOB + alarm	event_type=DOOR_OPEN, actor, request_id, t_mono	Door closed + debounce + short cool-down
Temp high	L2	Derate → if sustained, shed by priority group	OOB + alarm	event_type=TEMP_HIGH, config_version, snapshots	Safe temp sustained (with hysteresis)
Smoke / leak	L3	Shed non-critical first; lockout restore until manual confirm	OOB + alarm	event_type=SMOKE/LEAK, reason_code=POLICY_SHED, snapshots	Manual clear + follow-up safe-check
Sensor fault	L2	Degraded policy (limit risky automation), raise alarm	OOB + alarm	event_type=SENSOR_FAULT, sensor_id, fault_mode	Sensor recovers + validation window

Figure F5 — Sensor → policy → branch/group actions (priority decision tree)

Sensor events are qualified first (debounce/hysteresis/fault-detect), then mapped to a priority ladder (L1–L3) that produces branch/group actions and writes evidence-rich event records.

H2-8 · OOB Management Hooks

OOB Management Hooks: Where the BMC Connects—and Where It Stops

This section defines the minimum OOB closed loop for a Micro-DC Rack: read telemetry, apply safe branch/group controls, pull evidence logs, and trigger updates—without diving into BMC SoC internals. The goal is a clean boundary: object model + privilege gates + audit fields.

Minimum OOB loop (four capabilities)

Read: bus/branch/env telemetry with clear object identifiers.
Control: branch on/off, load-shed by group, policy/threshold application with versioning.
Evidence: pull event logs + snapshots for postmortems and ticket attachments.
Update hook: trigger firmware update workflows and report status (no deep implementation).

Platform responsibilities (beyond the rack)

Alert routing: map events to severities and destinations (NOC, paging, ticketing).
Ticket workflows: attach log packets and snapshots; enforce runbooks.
Audit: immutable trails for who changed what, when, and why.
Fleet policy: batch rollouts of thresholds and policies with staged validation.

Bus/protocol boundary (engineering allocation)

I²C / I³C: local management bus for configuration + moderate-rate telemetry.
SMBus / PMBus: power-oriented objects (telemetry + configuration) with consistent semantics.
RS-485 (optional): longer reach / stronger noise immunity for slower control/monitor paths.
Boundary rule: protocols are tools—object ownership and privilege gating define the system boundary.

Security hooks (hooks only; implementation lives elsewhere)

Authentication: every write/control request is attributable to an actor/session.
RBAC: read-only vs operator vs security admin; risky actions require elevated rights.
Update safety: anti-rollback hook and rollback-safe state reporting.
Non-repudiation: audit fields in logs (actor, request_id, approval_state, config_version).

Privilege matrix (action → role → audit requirement)

Action	Role	Scope	Audit requirement	Safety gate
Read telemetry	Read-only	Rack / group / branch	request_id, t_mono (optional)	None
Pull logs & snapshots	Operator	Rack / branch	actor, request_id, time range	Rate limit
Remote off (single branch)	Operator	Branch	actor, reason, t_mono, branch_id	Optional confirm
Power-cycle (non-critical group)	Operator	Group	actor, request_id, group_id, snapshots	Cooldown window
Change thresholds / policies	Security admin	Group / rack	actor, config_version, approval_state	Two-step apply
Firmware update trigger	Security admin	Controller / rack	actor, package_id, anti-rollback status	Staged rollout
Clear latch-off / unlock	2-person approve	Branch / group	two actors, reason, t_mono, snapshots	Dual confirmation

Figure F6 — OOB data flow and permission boundary (read vs write vs update)

The rack controller exposes a minimal OOB surface (read, control, evidence, update hooks). All write/update actions pass through an RBAC gate and produce audit fields (actor, request_id, approval_state, config_version).

Boundary reminder: this page defines hooks and privilege gates. BMC internals (SoC, DDR, network stack) and TPM/HSM implementation are intentionally excluded and should be linked to their dedicated pages.

H2-9 · Hardware Implementation Checklist

Hardware Implementation Checklist: From “Works” to “Production-Stable”

Production stability in a Micro-DC Rack is usually lost in predictable places: current sensing realism, switching thermal paths, rack-internal power integrity, and management-bus robustness. This checklist focuses on the highest-yield details that prevent false trips, missing trips, and telemetry dropouts.

Reduce false OCP Robust buses Thermal margin Clear fail-safe

1) Current sensing (shunt + Kelvin + input protection)

Shunt placement: keep the power loop compact; avoid placing the shunt where return current can bypass it.
True Kelvin routing: sense traces must be a dedicated pair, symmetric, and kept away from high di/dt nodes.
Thermal coupling: expect temperature gradients; reduce drift by keeping the shunt environment predictable.
Front-end protection: protect amplifier/ADC inputs from inductive spikes without distorting normal sensing.
Bandwidth intent: decide whether the control reacts to peaks or averages; filter accordingly to avoid “spike = trip”.

2) Switch & thermal path (heat flow is part of protection accuracy)

Heat path: device → copper → via array → backside spreader; validate that heat actually leaves the hot spot.
OT thresholds: ensure thermal shutdown and recovery behavior does not oscillate (add time qualification where needed).
Copper sizing: sized for both current and heat spreading; narrow necks create local thermal cliffs.
Parallel / redundancy (light touch): if used, keep symmetry; treat redundancy as a policy-managed grouping problem.

3) Bus and branch PI (rack-internal only)

Return path discipline: high di/dt current must loop locally; avoid “return wandering” through sensitive ground.
Decoupling distribution: bulk near the bus entry + high-frequency close to branch switching elements.
Inrush concurrency: stagger/group enables so “many branches at once” does not create a bus droop trip cascade.
Spike containment: the inductive spike source is often layout; treat mitigation as loop + placement first.

4) Communications robustness (I³C/I²C/PMBus + isolation trigger)

Line length & pull-ups: plan for total bus capacitance; pull-ups must meet rise-time without creating ringing.
Topology clarity: segment or buffer when needed; prevent a single long spur from dominating bus timing.
Common-mode reality: if crossing grounds/domains or long cable runs, consider isolation at the boundary.
Hang recovery: include a bus reset/recovery path (watchdog + controlled re-init) to avoid “silent telemetry loss”.

5) Fail-safe defaults (what happens when control is lost)

Default states: define which branches default OFF and which must stay ON for OOB/alarm continuity.
Priority domains: OOB + alarm domain power must be preserved during shedding/lockout policies.
Config versioning: threshold/policy versions should be logged so “changed policy” is visible in postmortems.

High-yield red lines: true Kelvin sensing, local return loops, staged enables, bus hang recovery, and explicit OOB/alarm power priority.

Figure F7 — Layout-focused checklist (Kelvin, sensitive keep-outs, thermal path, decoupling)

Focus on Kelvin correctness, local return loops, thermal exit paths, and management-bus recovery/isolation triggers to prevent false protection events and telemetry loss.

Boundary reminder: this checklist stays within rack-internal DC distribution and management wiring. Facility-level surge/EMC theory and PSU internals are excluded by design.

H2-10 · Validation & Production Test

Validation & Production Test: Calibration, Fault Injection, and Field Self-Checks

Testing for a Micro-DC Rack is not just “power on and read telemetry.” It must prove that protection actions are correct, telemetry is trustworthy across conditions, and evidence logs are sufficient for fast troubleshooting. The emphasis here is test hooks specific to rack DC distribution: stimulus → expected action → expected log fields.

R&D validation (stimulus → measurement → pass criteria)

Overcurrent / short: step loads and short fixtures; measure response time and state transitions.
SOA / thermal: steady + pulsed load; confirm derating behavior and thermal shutdown stability.
False-trip resilience: noise injection, bus droop, concurrent enables; track trip probability and recovery behavior.
Concurrency scenarios: staged enable vs all-at-once; confirm no cascading bus droop trips.

Production test + maintenance (minimum set with high coverage)

Channel self-test: on/off response and readback of state bits and fault flags.
Sensor open/short detection: door/leak/temp fault checks with explicit reason codes.
Current calibration: one-point or two-point strategy with temperature drift compensation planning.
Log pull + integrity check: verify log packet completeness and the presence of required audit fields.

Fault injection mindset (matrix-driven, not ad-hoc)

Fault types: OCP, short, OT, UV/bus droop, remote-off, sensor-fault.
Expected actions: limit/derate, shed, lockout/manual clear, preserve OOB + alarm domain.
Expected logs: reason_code, t_mono, snapshots, config_version; plus actor/request_id for writes.

Field self-checks (fast, safe, and actionable)

Read-only baseline: state bits, thresholds, policy/config version, and recent events.
Low-risk actuation: validate a non-critical branch/group control under controlled windows.
Degraded operation: if sensors are faulty or telemetry is partial, restrict risky automation and keep OOB/alarm online.

Minimal troubleshooting workflow: check reason_code, then timestamp, then pull ±1s telemetry snapshots, and finally verify config_version + actor (policy changes).

Figure F8 — Fault injection matrix (fault × expected action × expected log fields)

A matrix-driven approach makes validation repeatable: each injected fault must lead to the intended action primitive and a complete, searchable log packet (reason_code, timestamps, snapshots, and configuration context).

Boundary reminder: this section defines rack DC distribution test hooks and workflows (stimulus → action → logs). Generic BIST/POST architectures and crypto internals are intentionally excluded.

H2-11 · Parts / IC Selection Pointers

Parts / IC Selection Pointers: Vendor Questions + Example MPNs

This section is intentionally not a buying guide. It is a practical checklist of “must-ask questions” per function block, plus illustrative MPN examples to anchor the discussion. Always validate derating, SOA, thermal layout, diagnostics, and production test hooks with the latest datasheets and board-level measurements.

Ask Verify Log Recover

Usage tip: For each block, capture (1) protection/telemetry requirements, (2) bus/interface constraints, and (3) the minimum log fields required for postmortems (reason_code, timestamps, snapshots, config_version, actor for writes).

A) eFuse / High-Side Switch (branch protection core)

Two common approaches exist: integrated eFuse (fast integration) or controller + external MOSFET (SOA/thermal flexibility).

Selection dimensions: VIN/VBUS range, continuous current, surge/peak + SOA, limit mode (constant/foldback/fast-trip), short-circuit response, reverse blocking, OV/UV/OT, dv/dt ramp, telemetry & diagnostics, retry/lockout policy.
Must-ask questions:
- Is SOA specified with pulse conditions representative of branch inrush and short fixtures?
- What is the limit behavior (constant vs foldback) and how does it impact dissipation during stalls/overloads?
- How are faults differentiated (OCP vs SC vs OT vs UV/OV vs reverse)? Is there a latched last-fault register?
- Is retry policy configurable (attempt count, backoff, lockout, manual clear)? What is the default after a watchdog reset?
- What is the measurable response time from fault onset to limit/trip under worst-case wiring inductance?
Example MPNs (illustrative):
- Texas Instruments TPS2662 / TPS2663 (integrated eFuse class, high-voltage range family)
- Texas Instruments LM5069 (hot-swap / inrush controller class, external MOSFET)
- Texas Instruments LM25066A (hot-swap + current monitor + PMBus/SMBus telemetry class)
- Infineon PROFET™ high-side switch families (12V/24V domain branch switching class; family selection depends on load)

B) Current / Voltage Monitoring (trustworthy observability)

Selection dimensions: accuracy + drift, common-mode range, input transient behavior (saturation + recovery), sample rate/bandwidth, alert comparators (threshold granularity, hysteresis, debounce), interface (I²C/SMBus/PMBus), multi-channel needs.
Must-ask questions:
- Under bus spikes/droops, does the monitor saturate? If yes, what is the recovery time?
- Can alerts be qualified (time debounce) and do they support hysteresis to prevent alarm storms?
- Is peak vs average reporting supported (or can firmware derive it without aliasing)?
- What calibration hooks exist (offset/gain trim, temperature compensation strategy)?
Example MPNs (illustrative):
- Texas Instruments INA238 (precision current/power monitor class)
- Texas Instruments INA233 (current/power monitor class, SMBus compatible)
- Texas Instruments INA3221 (3-channel shunt/bus monitor class)
- Analog Devices LTC2946 (coulomb/energy monitor class)

C) Sensors & Aggregation (temp/humidity/door/leak + mux/ADC)

Selection dimensions: sensor fault detectability (open/short/stuck), debounce/hysteresis strategy, multiplexing scale (channels, address conflicts), bus recovery (hung bus isolation/reset), ADC channels/resolution, input protection for long sensor leads.
Must-ask questions:
- How are sensor wiring faults represented (explicit fault flags vs “invalid readings”)?
- Does the mux/buffer allow isolating a misbehaving branch device without taking the whole bus down?
- What is the recommended bus pull-up / capacitance limit for the intended cable lengths inside the rack?
Example MPNs (illustrative):
- Texas Instruments TCA9548A (8-channel I²C switch / mux class)
- NXP PCA9548A (8-channel I²C switch / mux class)
- Texas Instruments ADS1115 (I²C ADC class for slow-to-moderate telemetry)
- Sensirion SHT31 (temp/humidity sensor class)

D) Controller (low-power MCU vs small management SoC)

Selection dimensions: watchdog + reset domains, brownout behavior, non-volatile event storage hooks, bus peripherals (I³C/I²C/SMBus/PMBus), secure update hooks (versioning + rollback path), and deterministic recovery after faults.
Must-ask questions:
- During brownouts, can the design guarantee last-fault capture (reason_code + timestamp + snapshot)?
- Is a robust update/rollback mechanism feasible (dual image / safe fallback), and can updates be audited (actor/request_id)?
- After a watchdog reset, what is the default policy for branch enables and the OOB/alarm domain?
Example MPNs (illustrative):
- STMicroelectronics STM32H7 / STM32G4 (MCU class)
- NXP LPC55S69 (MCU class)
- Microchip SAME54 (MCU class)
- ASPEED AST2600 (management SoC / BMC-class device; use only for “hooks” here, not a deep dive)

E) Uplink communications (Ethernet PHY / isolation + optional RS-485)

Selection dimensions: PHY robustness, ESD/EMI tolerance strategy (including magnetics/isolation boundary), link-down behavior, offline logging/buffering strategy, and optional long-line fallback (RS-485).
Must-ask questions:
- If uplink drops, can the system continue logging locally and backfill events after recovery?
- Where is the isolation boundary (cross-domain/long cable/common-mode noise), and does isolation preserve management continuity?
- For RS-485 fallback, what is the minimum control+alarm set that must remain available during degraded operation?
Example MPNs (illustrative):
- Texas Instruments DP83867 (Gigabit Ethernet PHY class)
- Microchip KSZ9131 (Gigabit Ethernet PHY class)
- Texas Instruments SN65HVD1781 (RS-485 transceiver class)
- Analog Devices ADM2587E (isolated RS-485 transceiver class)

Checklist item	Option A (fill in)	Option B (fill in)	Option C (fill in)
SOA / surge margin	TBD	TBD	TBD
Limit behavior	CONST / FOLD / TRIP	CONST / FOLD / TRIP	CONST / FOLD / TRIP
Reverse blocking	OK / N/A	OK / N/A	OK / N/A
Diagnostics richness	reason_code granularity	reason_code granularity	reason_code granularity
Retry / lockout policy	attempts + backoff	attempts + backoff	attempts + backoff
Telemetry interface	I²C / SMBus / PMBus	I²C / SMBus / PMBus	I²C / SMBus / PMBus
Production test hooks	self-test + logs	self-test + logs	self-test + logs

Figure F9 — Reusable selection radar + comparison table template

Reuse F9 across pages: keep the radar axes stable (SOA, speed, telemetry, programmability, reverse block, thermal headroom) and fill the compare table with verified, test-backed answers.

Request a Quote

Name

Company

Part Number(s) / BOM

Quantity & Target Lead Time

Alternates Allowed

Temperature Grade

Package / Footprint

Compliance

Budget Window

Lot Size / Qty

Message

Attachment

Accepted Formats

pdf, csv, xls, xlsx, zip

Attachment

Drag & drop files here or use the button below.

H2-12 · FAQs ×12

FAQs: Micro-DC Rack Branch Protection, Telemetry, Env Linkage, and OOB Hooks

These FAQs stay strictly inside the Micro-DC Rack boundary: branch eFuse/high-side switching, telemetry + event logs, environment/door/tamper linkage, and OOB management hooks.

Scope Guard: Allowed—branch protection (eFuse/high-side), branch telemetry/logs, env/door/tamper linkage, OOB permission boundaries. Banned—PSU/UPS/ATS internals, facility distribution/codes, AC PDU deep dive, crypto implementation details.

Figure F10 — FAQ map: “Symptoms → Checks → Knobs” within Micro-DC Rack scope

Use the same “Symptoms → Fast Checks → Knobs” pattern for all 12 FAQs, without leaving the Micro-DC Rack scope.

What is the practical engineering boundary between a Micro-DC Rack and a traditional rack PDU?

Maps to: H2-2 / H2-3

A Micro-DC Rack focuses on in-rack DC branch switching, electronic protection, and diagnostic-grade telemetry. Its core deliverable is a closed loop: branch enable/disable, fast fault response, and actionable logs (reason_code + timestamps + snapshots) exposed through an OOB control plane. It does not describe upstream power conversion or facility distribution.

Why can an eFuse trip frequently even when “average current looks small”?

Maps to: H2-5 / H2-9

“Current looks small” is often a measurement artifact. Common triggers include inrush spikes masked by filtering, di/dt-induced voltage spikes from wiring inductance, shunt thermal drift, and noise coupling into sense/enable pins. Start with last reason_code, trip_time, and a pre/post snapshot (Vbus, Ibranch, Tchannel) before adjusting thresholds or debounce.

Constant-current vs foldback current limiting—how to choose, and what are the common pitfalls?

Maps to: H2-5

Constant-current limiting maintains voltage longer but can accumulate heat during stalls, causing delayed OT trips and repeated retries. Foldback reduces dissipation but can prevent startup for capacitive loads and “looks like” a short to the load, creating oscillation. Decide based on fault energy and recovery goals, then pair the mode with retry/backoff/lockout policy and logs that distinguish OCP vs OT vs SC.

What happens if short-circuit response is too fast or too slow?

Maps to: H2-5 / H2-10

Too fast response causes nuisance trips by misclassifying inrush or transient di/dt as a hard short, especially with tight blanking windows. Too slow response forces the switch into high stress, increasing SOA risk and spreading bus disturbances to other branches. Validate with a controlled short fixture and step loads, and require logs that separate SC vs OCP and record response timing and peak values.

After a remote power-cycle, the load still won’t come up—what codes and timestamps should be checked first?

Maps to: H2-6 / H2-10

Start with last reason_code (UV/OV/OCP/SC/OT/remote-off/lockout), then the trip_time and the most recent attempt_count/backoff state. Next, inspect a pre/post snapshot around the event: Vbus droop, Ibranch peak, channel temperature, and enable state. If timestamps are inconsistent, rely on monotonic time plus a “near-absolute” mapping used by the OOB plane for correlation.

How can “critical-domain priority power” be implemented without introducing a new single point of failure?

Maps to: H2-4 / H2-9

Define a minimal critical domain (OOB controller, alarms, essential comms) and give it a dedicated branch with conservative thresholds, default-on behavior, and independent fault handling. Avoid coupling it to noncritical branches through shared enables or fragile bus dependencies. Ensure a watchdog reset preserves the critical domain, and record policy_version + reason_code so “priority actions” remain auditable and testable.

How should door/tamper events link to power actions without causing accidental shutdowns?

Maps to: H2-7 / H2-8

Treat door/tamper primarily as audit and permission signals, not immediate power-cut triggers. Use debounce and hysteresis, detect sensor faults, and escalate actions based on severity: door-open → log + restrict privileged writes; tamper/forced-open → require confirmation or two-step approval for power actions. Only combine power shedding with additional safety signals (e.g., smoke/leak) to avoid false shutdowns.

What is the most reasonable division of labor for I³C, I²C, and PMBus inside a Micro-DC Rack?

Maps to: H2-8

Use I³C for dense on-board device discovery and higher-rate telemetry where supported, while keeping robust recovery paths for bus faults. Use I²C for broad compatibility with sensors, muxes, and monitors—pay attention to line length, pull-ups, and bus hang isolation. Use PMBus/SMBus semantics for power-oriented devices (limits, faults, telemetry fields), and keep long or noisy runs isolated or bridged through a controller boundary.

If current measurement drifts over time, how to tell shunt heating from layout issues or the sampling chain?

Maps to: H2-9 / H2-10

First check correlation with temperature and load duty cycle—self-heating shifts shunt value and creates gradients that break assumptions. Next verify true Kelvin routing and quiet return paths; small ground drops can look like current error. Finally inspect the sampling chain: amplifier input protection, saturation recovery, ADC reference stability, and digital filtering. Confirm by controlled DC points, a two-point trim strategy, and logging the calibration version for traceability.

How should an event log be designed so “who triggered the power action” remains traceable after power loss?

Maps to: H2-6 / H2-8

Require every write action to carry identity context and be logged with actor_id, request_id, policy_version, channel_id, and result status, alongside reason_code and timestamps. Use a ring buffer plus “critical-event snapshot” commits on trip/lockout. After reboot, reconstruct ordering via monotonic time and include the last known mapping to near-absolute time for cross-system correlation, without relying on external services.

If simultaneous power-up causes bus droop, how should branch staggering be implemented for stability?

Maps to: H2-4 / H2-5

Stagger by grouping branches based on inrush profile and criticality: bring up the control/OOB domain first, then low-inrush loads, then high-capacitive branches. Use soft-start/ramp control and fixed inter-group delays, and avoid synchronizing retries across many channels. Declare success criteria using telemetry: Vbus droop limits, number of OCP entries, retry counts, and “time-to-ready” measured and logged with timestamps.

When smoke/leak/high-temperature alarms occur, what is a recommended staged load-shedding policy?

Maps to: H2-7 / H2-4

Use staged shedding with explicit priorities: preserve OOB/alarms, shed noncritical branches first, then progressively restrict high-power loads if the alarm persists. Combine sensor debounce/hysteresis with multi-signal confirmation to prevent false trips, and record linkage in logs (env_event_id → power_action_id → affected_branches). Keep the policy deterministic and testable: each stage must have measurable entry/exit criteria and a clear reason_code so postmortems can prove the decision path.

Micro-DC Rack: DC Power Distribution, Protection & Telemetry

Micro-DC Rack: DC Power Distribution, Protection & Telemetry

Scope & Boundary

In scope (what this page goes deep on)

Out of scope (explicitly not covered here)

Sibling-page link strategy (listed, not expanded)

1-minute Definition: What a Micro-DC Rack Solves

Where it appears (typical deployments)

What it enables (three keywords)

Practical boundary vs “traditional PDU” (conceptual, not AC deep-dive)

System Architecture: Power Path + Management Path

Power path (rack-internal)

Management path (measure → log → act)

Multi-domain power: a practical organizing model

Power Distribution Strategy (Rack-internal, not PSU topology)

Bus voltage trade-offs (effects inside the rack)

Two-level protection (avoid “one fault takes the rack”)

Remote operation policies (turn actions into predictable behavior)

Branch Protection Core: eFuse / High-Side Switch (Deep Dive)

What a branch channel must guarantee

Mechanism: protection loops inside the branch

Field pitfalls (why it “trips wrong” or “fails to trip”)

Key specification checklist (spec → impact → knob → field symptom)

Telemetry & Event Log: Make “Observability” Debuggable

Telemetry objects (by layer)

Status & configuration provenance (avoid “unknown changes”)

Event-log “three-piece set” (dictionary · timestamps · persistence)

Minimal event record fields (field → purpose → pitfall)

Env/Security Sensors: Power-Aware Policies (Logic + Interfaces)

Sensor scope (kept intentionally narrow)

Power-side action primitives (what “linkage” can actually do)

Policy examples (trigger → qualify → act → record → clear)

Data trust (do not treat noise as emergencies)

Policy template table (copyable operating model)

OOB Management Hooks: Where the BMC Connects—and Where It Stops

Minimum OOB loop (four capabilities)

Platform responsibilities (beyond the rack)

Bus/protocol boundary (engineering allocation)

Security hooks (hooks only; implementation lives elsewhere)

Privilege matrix (action → role → audit requirement)

Hardware Implementation Checklist: From “Works” to “Production-Stable”

1) Current sensing (shunt + Kelvin + input protection)

2) Switch & thermal path (heat flow is part of protection accuracy)

3) Bus and branch PI (rack-internal only)

4) Communications robustness (I³C/I²C/PMBus + isolation trigger)

5) Fail-safe defaults (what happens when control is lost)

Validation & Production Test: Calibration, Fault Injection, and Field Self-Checks

R&D validation (stimulus → measurement → pass criteria)

Production test + maintenance (minimum set with high coverage)

Fault injection mindset (matrix-driven, not ad-hoc)

Field self-checks (fast, safe, and actionable)

Parts / IC Selection Pointers: Vendor Questions + Example MPNs

A) eFuse / High-Side Switch (branch protection core)

B) Current / Voltage Monitoring (trustworthy observability)

C) Sensors & Aggregation (temp/humidity/door/leak + mux/ADC)

D) Controller (low-power MCU vs small management SoC)

E) Uplink communications (Ethernet PHY / isolation + optional RS-485)

Recommended topics you might also need

Request a Quote

Accepted Formats

Attachment

FAQs: Micro-DC Rack Branch Protection, Telemetry, Env Linkage, and OOB Hooks

Explore

Categories

Get in Touch