123 Main Street, New York, NY 10001

Micro-DC Rack: DC Power Distribution, Protection & Telemetry

← Back to: Data Center & Servers

A Micro-DC Rack is an in-rack DC branch power platform that combines eFuses/high-side switches, telemetry + event logs, and OOB control to make power distribution controllable, protective, and diagnosable.

It keeps critical domains (OOB/alarms) alive while enforcing staged policies for faults and environmental/security events—without entering PSU/UPS or facility power topics.

H2-1 · Scope & Boundary

Scope & Boundary

This page focuses on rack-internal DC distribution and branch-level control: how a Micro-DC Rack uses eFuses/high-side switches, per-branch sensing, event logs, and OOB hooks to make power delivery controllable, protectable, and observable at the branch level.

In scope (what this page goes deep on)

  • DC bus → branch channels: how loads are segmented, switched, and isolated inside a rack.
  • Per-branch protection: OCP/short/OV/UV/OT, SOA limits, reverse-current blocking, inrush control.
  • Per-branch observability: voltage/current/temperature telemetry, fault reason codes, timestamps.
  • Remote operations: power-cycle, staged turn-on, load shedding, “retry vs latch-off” policy.
  • OOB hooks: minimal interfaces for reading telemetry/logs and applying safe, auditable control.

Out of scope (explicitly not covered here)

  • PSU internal topologies (e.g., PFC/LLC), UPS/ATS behavior, or facility wiring codes/standards.
  • AC mains PDU design details beyond a boundary comparison.
  • PCIe/CXL fabrics, ToR interposers, or accelerator interconnects (separate pages).
  • Full BMC SoC architecture and TPM/HSM internals (only referenced as integration dependencies).
Scope Guard (reader view): Branch channels, eFuse/high-side switching, per-branch telemetry and event logs, OOB integration hooks.
Out-of-scope topics are allowed only as one-line boundary references that point to the appropriate sibling page.
Figure F1 — Boundary map: what Micro-DC Rack covers
Micro-DC Rack scope boundary map A boundary diagram showing Micro-DC Rack in scope: DC bus distribution, branch eFuse switches, telemetry and event logs, and OOB hooks. Out-of-scope blocks include PSU internals, UPS/ATS, and facility distribution. Sibling pages are shown as linked boxes. Scope Boundary (Micro-DC Rack) Rack-internal DC distribution & diagnostics IN SCOPE DC Bus Rack-internal distribution Segmentation & grouping Branch Channels eFuse / High-side switch OCP / Short / OT / SOA Telemetry Per-branch V / I / T Thresholds & status Health & drift flags Event Log & OOB Hooks Reason code + timestamp Remote power-cycle Audit-ready actions OUT OF SCOPE PSU internals UPS / ATS Facility distribution SIBLING PAGES CRPS / Server PSU 48V/12V Hot-Swap Rack Env & Access Referenced only for boundary Link out for deep dives Design principle Treat upstream power as a DC source black box; optimize branch safety, control, and diagnostics.
The boundary is deliberate: upstream power is treated as a DC source black box, while this page goes deep on branch-level safety, control, and diagnostics.
H2-2 · 1-minute Definition

1-minute Definition: What a Micro-DC Rack Solves

Answer block (definition)

A Micro-DC Rack is a rack-internal DC distribution system that splits a DC bus into multiple protected branches, using eFuses/high-side switches to switch and isolate loads and using per-branch sensing to report telemetry and fault events. It enables safe remote operations—power-cycle, staged turn-on, and load shedding—through OOB management hooks with auditable logs and timestamps.

The focus is not “how power is generated,” but how rack-internal DC is distributed, protected, observed, and controlled at branch granularity.

Where it appears (typical deployments)

  • Edge micro-sites / remote racks: limited on-site access, strong need for remote recovery and audit trails.
  • Small private clusters: mixed loads (compute/network/storage) with different criticality levels.
  • Unattended enclosures: environmental or access events must trigger safe, staged power actions.

What it enables (three keywords)

  • Switching: per-branch on/off, power-cycle, staged ramp, group sequencing.
  • Protection: fast short response, controlled current limiting, thermal/SOA enforcement, reverse blocking.
  • Observability: branch V/I/T telemetry + reason codes + timestamps + “before/after” snapshots.

Practical boundary vs “traditional PDU” (conceptual, not AC deep-dive)

  • Granularity: Micro-DC is built for per-branch actions and diagnostics, not only aggregate metering.
  • Diagnostics: “why did it shut off?” is answered by reason code + timestamp, not guesswork.
  • Automation: control is designed to be called by OOB workflows with permissions and audit logs.
Figure F0 — Problem → Capabilities → Outcomes (minimal text)
Micro-DC Rack: problem to outcomes overview A three-column overview diagram: problems in remote racks, Micro-DC rack capabilities including per-branch switching and protection, and outcomes such as faster debugging and auditable remote operations. Micro-DC Rack in 60 seconds Problems Remote / unattended sites Random trips are hard to debug Manual reboot is costly One fault can impact the rack Capabilities Per-branch switching power-cycle · staging · groups Electronic protection eFuse/high-side · SOA · OT Telemetry + event logs V/I/T · reason code · timestamp OOB management hooks permissions · audit · safe actions Outcomes Faster root-cause debug Safer fault isolation Auditable remote ops Automation-ready control Minimal text on purpose: the page goes deep in later chapters (protection behavior, telemetry/log design, OOB safety hooks).
A Micro-DC Rack turns “power issues” into controllable actions and diagnosable events (reason code + timestamp + telemetry snapshot).
H2-3 · System Architecture

System Architecture: Power Path + Management Path

A Micro-DC Rack is defined by two parallel paths: the power path (how DC is distributed and protected) and the management path (how measurements, events, and control decisions flow to OOB networks and platforms). The upstream supply is treated as a DC source black box; the emphasis is rack-internal segmentation, safety, and observability.

Power path (rack-internal)

  • DC Source (black box)Rack bus / busbarBranch channels (×N)Loads.
  • Branch channels provide switching (on/off, power-cycle, staged turn-on) and protection (OCP/short/OT/SOA/reverse block).
  • Loads are organized by criticality into domains (IT / Network / Aux), enabling predictable policies such as load shedding.

Management path (measure → log → act)

  • Sensors (V/I/T + door/tamper) feed aggregation (ADC/MUX and low-speed buses).
  • A controller enforces safe actions and creates event logs (reason codes + timestamps + snapshots).
  • An OOB uplink exposes branch/domain objects to a platform for alerting, tickets, and audits.

Multi-domain power: a practical organizing model

  • IT domain: compute loads that may tolerate staged power but require stable recovery procedures.
  • Network domain: connectivity-critical loads often kept at higher priority during shedding.
  • Aux domain: sensing, access, indicators, and OOB control power kept alive for diagnosis and recovery.
Design anchor: troubleshooting starts from event logs (reason code + timestamp) and is validated with telemetry snapshots (V/I/T before and after). This prevents “average power looks normal” from masking the real trigger.
Power (solid) Control (dashed) Telemetry (dotted)
Figure F1 — Micro-DC Rack block diagram (power, control, telemetry)
Micro-DC Rack architecture block diagram Block diagram showing DC source to rack bus to multiple branch channels feeding IT, Network, and Aux domains. Sensors feed an aggregator and controller with event log storage. Control and telemetry connect to an OOB uplink and platform. System Architecture Power · Control · Telemetry DC Source black box Rack Bus / Busbar Branch Channels ×N Switch Current Sense Switch Current Sense Switch Current Sense Switch Current Sense Loads (Domains) IT priority: medium Network priority: high Aux priority: keep-alive Sensors V / I / T Door / Tamper Env (T/H) Aggregation ADC · MUX · I³C/I²C PMBus/SMBus Controller MCU / mini-BMC policy + safety Event Log (local) reason code + timestamp snapshot buffer OOB Uplink Ethernet / Serial auth + audit Platform alerts · tickets · audit Power Control Telemetry
The diagram separates the power path from the management path. Branch channels (×N) are the control and diagnostic unit exposed to OOB platforms.
H2-4 · Power Distribution Strategy

Power Distribution Strategy (Rack-internal, not PSU topology)

Distribution strategy is expressed as bus choices, two-level protection boundaries, and remote operation policies. The goal is predictable rack behavior: one faulty branch should be isolated quickly, while critical domains and OOB diagnostics remain available.

Bus voltage trade-offs (effects inside the rack)

  • Higher bus voltage reduces current for the same power, easing conductor loss and busbar size.
  • Device stress shifts: branch switches must tolerate higher voltage transients and safe SOA margins.
  • Measurement implications: lower current reduces shunt heating; higher voltage increases the need for robust input protection and divider accuracy.

Two-level protection (avoid “one fault takes the rack”)

  • Level 1 — Bus-level protection: a rack-wide guardrail and last-resort cutoff (referenced only).
  • Level 2 — Branch-level electronic protection: the primary fault isolator (OCP/short/OT/SOA/reverse).
  • Rule of thumb: branch channels should trip first and log the reason; bus-level protection is a rare fallback.

Remote operation policies (turn actions into predictable behavior)

  • Power-cycle: define minimum off-time, retry budget, and “latch-off” conditions for repeated faults.
  • Load shedding: prioritize domains so Aux/OOB stays alive for diagnosis and recovery.
  • Staged turn-on: group branches and apply stagger intervals to reduce bus droop and false trips.
Policy must be observable: each action should leave an auditable trace (who/what/when/why/result) and correlate to a branch’s event log timeline.
Figure F2 — Branch strategy state machine (minimal text)
Branch protection and remote control state machine A state machine for a branch channel showing normal ON, current limit, trip latch, cooldown, remote off, and retry states with transitions labeled by OCP, short, OT, remote, retry, and cooldown. Branch Strategy (State Machine) ON normal LIMIT current limiting TRIP latched off RETRY auto re-enable COOL thermal cooldown OFF remote off OCP recover short remote remote retry retry OT cooldown The state machine is a policy surface: it defines predictable behavior for overload, short, thermal, and remote operations.
The branch channel is the control unit: policy defines whether overload recovers, short latches off, thermal cools down, and remote actions are auditable.
H2-5 · Branch Protection Core

Branch Protection Core: eFuse / High-Side Switch (Deep Dive)

A branch channel is the unit that turns rack-internal DC into a controllable and diagnosable service. It must (1) deliver current safely under normal load, (2) isolate faults fast without collapsing the bus, and (3) explain every action using reason codes, timestamps, and snapshots. Topics are limited to DC-bus internal behavior (no facility/AC events).

What a branch channel must guarantee

  • Controlled power: on/off, power-cycle, staged turn-on, and group policies.
  • Electronic protection: OCP/short/OT/OV/UV with SOA-aware behavior.
  • Isolation by design: one branch fault should not drag the whole rack bus down.
  • Forensics: every trip produces a reason code + timestamp + “before/after” telemetry snapshots.

Mechanism: protection loops inside the branch

  • Current path: switch FET(s) + sense element (shunt or RDS(on)) feed fast detection and measurement.
  • Control path: gate control enforces soft-start and a programmable current-limit profile.
  • Thermal path: temperature sensing triggers derating, cooldown, or latch-off depending on policy.
  • Stateful recovery: retry budget, backoff, and latch-off conditions prevent endless oscillation.

Field pitfalls (why it “trips wrong” or “fails to trip”)

  • dI/dt + parasitic L causes bus or switch-node spikes that look like OV/UV or false short events.
  • Sense placement & drift: hot shunt or poor Kelvin routing biases current measurement and thresholds.
  • Capacitive loads + sequencing: inrush pushes current-limit into thermal stress and late OT trips.
  • Concurrent turn-on: multiple branches start together → bus droop → UV cascades; mitigate with stagger and priority domains.
Recoverability policy (minimum set): retry budget + backoff, explicit latch-off conditions, manual clear workflow, and a black-box record containing reason code, timestamps, and snapshots.
A “retry loop with no evidence” increases downtime; a “latch-off with evidence” reduces support time.

Key specification checklist (spec → impact → knob → field symptom)

Spec item What it controls Engineering knob Field symptom if wrong
VIN / VBUS range Operating margin vs DC-bus transients OV/UV thresholds + deglitch + safe derating Random resets during load steps; unexplained UV/OV events
ICONT / IPEAK Thermal stability and startup envelope Stagger, soft-start, inrush shaping Stable idle but trips during boot or simultaneous start
SOA / short energy Whether the switch survives worst-case faults Fast short detect + limit profile + timeout FET damage despite “not that high” steady current
Limit mode Fault containment vs heating Constant-current vs foldback vs fast-trip Either nuisance trips (too aggressive) or overheating (too soft)
Short response time Bus stability and device survival Blanking/deglitch tuned to real parasitics Does not trip on hard short, or false short on fast load steps
Reverse blocking Domain isolation and backfeed prevention Back-to-back FET behavior + reverse thresholds Unexpected cross-domain coupling; “ghost power” paths
OV/UV + debounce Cascading trips during bus droop/spike Debounce windows + staged retry policies Rack-wide oscillation: trip → recover → trip loops
Thermal sensing Preventing long-tail failures Cooldown vs latch-off; derating thresholds Late OT trips after minutes; performance derates are invisible
Current accuracy + drift Threshold truthfulness over temperature Kelvin sense, calibration, conservative margin “Doesn’t look high” but trips, or never trips until damage
Telemetry bandwidth Debug resolution vs noise + storage Low-rate trends + event snapshots Either too noisy to trust, or too sparse to explain a trip
Reason code model Forensic explainability Primary + flags, consistent versioning “Trip happened” with no actionable why/how/when
Figure F3 — Single-branch eFuse channel internals (protection + observability)
Branch channel internal architecture Diagram of a single branch channel showing power stage with switch FET and current sense, gate control, protection comparators, ADC, registers, reason code encoder, fault pins, and snapshot/event log buffer. Branch Channel Internals eFuse / High-side switch Power Stage BUS IN LOAD OUT Switch FET Current Sense shunt / RDS(on) Gate Ctrl slew / SS Protection Loops OCP Short OT Temp Sensor Reverse Block Observability & Interface ADC V / I / T Registers Reason Code primary + flags trip / retry / remote / OT Snapshot Buffer pre / post samples Event Log ring FAULT / IRQ Mgmt Bus I²C / I³C Power Control Telemetry
The branch channel combines fast protection loops with an observability plane (ADC + registers + reason codes + snapshots). Every trip becomes diagnosable evidence.
H2-6 · Telemetry & Event Log

Telemetry & Event Log: Make “Observability” Debuggable

Observability is useful only when it answers: which branch, why, when, and what changed before/after. This section defines the minimal telemetry objects and event-log design needed for reliable troubleshooting in a Micro-DC Rack.

Telemetry objects (by layer)

  • Bus: Vbus (and optionally Ibus) to detect droop and operating margin.
  • Branch: Ibranch, channel temperature, switch state, and fault flags.
  • Env / Access: temperature/humidity plus door/tamper events for correlation.

Status & configuration provenance (avoid “unknown changes”)

  • State: ON/OFF/LIMIT/TRIP/COOL with last transition cause.
  • Reason code: primary cause + flags (e.g., OCP with UV flag).
  • Retry: retry counter, last retry time, and latch-off conditions.
  • Config version: threshold/policy version ID used at the time of an event.

Event-log “three-piece set” (dictionary · timestamps · persistence)

  • Event dictionary: OCP/SC/OT/UV/OV/REMOTE_OFF/DOOR_OPEN as a versioned list.
  • Timestamp strategy: local monotonic time + platform-aligned approximate time.
  • Buffer & power-loss behavior: ring buffer + critical event snapshots (minimum viable).
Low-rate trends Event-driven snapshots Short burst capture
A practical trade-off: keep low-rate telemetry for trends, capture high-resolution evidence only around events (pre/post snapshots). This reduces bandwidth and storage while preserving root-cause explainability.

Minimal event record fields (field → purpose → pitfall)

Field Purpose Common pitfall if missing
event_type Normalize causes (OCP/SC/OT/UV/REMOTE_OFF/DOOR_OPEN) Different faults look identical; automation can’t route tickets
source_id Bind to branch/domain/bus object Cannot answer “which branch” quickly
t_mono Reliable ordering on-device (monotonic) Events reorder after reboots or clock changes
t_approx Approximate absolute time aligned to platform Hard to correlate with door/env alerts and platform logs
reason_code Primary cause + flags (e.g., OCP with UV flag) Only a “trip happened” statement, not actionable
pre_snapshot State + V/I/T immediately before the event No evidence for whether it was droop, inrush, or drift
post_snapshot State + V/I/T immediately after the event Cannot confirm recovery or thermal consequences
retry_count Expose oscillation and policy behavior Hidden “retry storms” waste time and mask real faults
config_version Record threshold/policy revision in effect “It used to work” becomes un-debuggable after config changes
actor (optional) Mark remote/manual actions for audit trails Remote-off events look like faults; accountability is lost
Figure F4 — Event timeline and correlation (branch + env/access)
Event timeline for debugging Timeline showing current step, state transitions, event triggers, alert and remote reset, and correlation with door/tamper and environment events. Includes a small log packet block with reason code and timestamps. Debug Timeline (Evidence Chain) t0 t1 t2 t3 t4 t5 Ibranch baseline step up limit drop off on State ON LIMIT TRIP ALERT REMOTE ON Events Load step OCP OT flag Remote reset Correlation Door open Env temp high Log packet reason + t_mono + snapshots
Debuggable observability relies on event ordering (monotonic timestamps), platform correlation (approximate time), and pre/post snapshots that explain why the branch transitioned from LIMIT to TRIP.
H2-7 · Env/Security Sensors

Env/Security Sensors: Power-Aware Policies (Logic + Interfaces)

Environmental and security signals are not “extra dashboards” in a Micro-DC Rack—they become policy inputs that shape branch actions (derate, shed, lockout) and create auditable evidence. This section focuses on linkage logic, qualification (debounce/hysteresis), and event-log binding.

Sensor scope (kept intentionally narrow)

  • Thermal / humidity: temp, humidity (trend + threshold events).
  • Door / tamper: door-open, chassis tamper, service-panel open.
  • Smoke / leak: presence events only (severity mapping is policy-driven).
  • Interface: sensors feed the same policy engine that can actuate branch/group power states.

Power-side action primitives (what “linkage” can actually do)

  • Branch actions: remote-off, lockout (manual clear), derate, delayed retry/cooldown.
  • Group actions: load shedding by priority (non-critical first), staged restore.
  • Preserve domains: keep OOB + alarm domain alive even during emergency shedding.
  • Evidence capture: trigger pre/post snapshots and annotate the event log.

Policy examples (trigger → qualify → act → record → clear)

  • Door open → debounce → write audit event + restrict high-risk actions (e.g., threshold changes) → record actor/request_id → clear when door closed + cool-down window.
  • Smoke / leak → qualify + severity map → shed non-critical group first; preserve OOB/alarm → record “policy_shed” reason code + snapshots → clear requires manual confirmation.
  • Temp high → hysteresis → derate group; if sustained, shed by priority → record “thermal_policy” + config_version → clear after sustained safe temp (with hysteresis).

Data trust (do not treat noise as emergencies)

  • Debounce: door/tamper chatter and intermittent contacts must not cause oscillation.
  • Hysteresis: avoid repeated shed/restore around a single threshold.
  • Sensor fault detect: open/short/drift should raise sensor_fault and switch to a safe degraded policy.
  • Event binding: every policy action must reference the sensor event_id and snapshots for postmortems.
Minimal linkage rule: sensor events never directly “cut power” without qualification. The policy engine decides severity and maps it to branch/group actions while preserving OOB + alarm visibility.

Policy template table (copyable operating model)

Sensor event Severity Power action Preserve domain Log requirement Clear condition
Door open L1 Audit + restrict risky controls (threshold changes / bulk cycles) OOB + alarm event_type=DOOR_OPEN, actor, request_id, t_mono Door closed + debounce + short cool-down
Temp high L2 Derate → if sustained, shed by priority group OOB + alarm event_type=TEMP_HIGH, config_version, snapshots Safe temp sustained (with hysteresis)
Smoke / leak L3 Shed non-critical first; lockout restore until manual confirm OOB + alarm event_type=SMOKE/LEAK, reason_code=POLICY_SHED, snapshots Manual clear + follow-up safe-check
Sensor fault L2 Degraded policy (limit risky automation), raise alarm OOB + alarm event_type=SENSOR_FAULT, sensor_id, fault_mode Sensor recovers + validation window
Figure F5 — Sensor → policy → branch/group actions (priority decision tree)
Decision tree for env and security linkage Block diagram showing sensors feeding qualification (debounce, hysteresis, fault detect), then a policy priority ladder (L1/L2/L3), producing branch and group actions while preserving OOB and alarm domains and writing event logs with snapshots. Policy Linkage Decision Tree Env + Security → Power Sensors Qualification Policy & Actions Temp / Humidity Door / Tamper Smoke Leak Debounce Hysteresis Fault Detect open/short Priority Ladder L1 · Audit + Restrict L2 · Derate + Shed L3 · Shed + Lockout Derate Shed Lockout manual clear Keep OOB + Alarm Event Log Binding event_type + reason + t_mono snapshots + config_version Control / policy Signals
Sensor events are qualified first (debounce/hysteresis/fault-detect), then mapped to a priority ladder (L1–L3) that produces branch/group actions and writes evidence-rich event records.
H2-8 · OOB Management Hooks

OOB Management Hooks: Where the BMC Connects—and Where It Stops

This section defines the minimum OOB closed loop for a Micro-DC Rack: read telemetry, apply safe branch/group controls, pull evidence logs, and trigger updates—without diving into BMC SoC internals. The goal is a clean boundary: object model + privilege gates + audit fields.

Minimum OOB loop (four capabilities)

  • Read: bus/branch/env telemetry with clear object identifiers.
  • Control: branch on/off, load-shed by group, policy/threshold application with versioning.
  • Evidence: pull event logs + snapshots for postmortems and ticket attachments.
  • Update hook: trigger firmware update workflows and report status (no deep implementation).

Platform responsibilities (beyond the rack)

  • Alert routing: map events to severities and destinations (NOC, paging, ticketing).
  • Ticket workflows: attach log packets and snapshots; enforce runbooks.
  • Audit: immutable trails for who changed what, when, and why.
  • Fleet policy: batch rollouts of thresholds and policies with staged validation.

Bus/protocol boundary (engineering allocation)

  • I²C / I³C: local management bus for configuration + moderate-rate telemetry.
  • SMBus / PMBus: power-oriented objects (telemetry + configuration) with consistent semantics.
  • RS-485 (optional): longer reach / stronger noise immunity for slower control/monitor paths.
  • Boundary rule: protocols are tools—object ownership and privilege gating define the system boundary.

Security hooks (hooks only; implementation lives elsewhere)

  • Authentication: every write/control request is attributable to an actor/session.
  • RBAC: read-only vs operator vs security admin; risky actions require elevated rights.
  • Update safety: anti-rollback hook and rollback-safe state reporting.
  • Non-repudiation: audit fields in logs (actor, request_id, approval_state, config_version).

Privilege matrix (action → role → audit requirement)

Action Role Scope Audit requirement Safety gate
Read telemetry Read-only Rack / group / branch request_id, t_mono (optional) None
Pull logs & snapshots Operator Rack / branch actor, request_id, time range Rate limit
Remote off (single branch) Operator Branch actor, reason, t_mono, branch_id Optional confirm
Power-cycle (non-critical group) Operator Group actor, request_id, group_id, snapshots Cooldown window
Change thresholds / policies Security admin Group / rack actor, config_version, approval_state Two-step apply
Firmware update trigger Security admin Controller / rack actor, package_id, anti-rollback status Staged rollout
Clear latch-off / unlock 2-person approve Branch / group two actors, reason, t_mono, snapshots Dual confirmation
Figure F6 — OOB data flow and permission boundary (read vs write vs update)
OOB data flow and permission boundary Block diagram showing managed objects (bus/branch/env/logs), a rack controller/BMC with RBAC gate, OOB uplink to platform services (alerts, tickets, audit, fleet policy). Lines distinguish read, write/control, and update flows and show which roles can perform which actions. OOB Hooks and Permission Boundary Managed Objects Bus Monitor Branch Channels Env/Security Sensors Event Log Store snapshots + reason codes Rack Controller BMC / MCU RBAC Gate role / approval Mgmt Bus I²C/I³C Power Bus SMBus OOB Uplink & Platform OOB Uplink Alert Routing Ticket Workflow Audit Log actor + request_id Fleet Policy Roles: Read-only Operator Security admin 2-person approve Read Write Update
The rack controller exposes a minimal OOB surface (read, control, evidence, update hooks). All write/update actions pass through an RBAC gate and produce audit fields (actor, request_id, approval_state, config_version).
Boundary reminder: this page defines hooks and privilege gates. BMC internals (SoC, DDR, network stack) and TPM/HSM implementation are intentionally excluded and should be linked to their dedicated pages.
H2-9 · Hardware Implementation Checklist

Hardware Implementation Checklist: From “Works” to “Production-Stable”

Production stability in a Micro-DC Rack is usually lost in predictable places: current sensing realism, switching thermal paths, rack-internal power integrity, and management-bus robustness. This checklist focuses on the highest-yield details that prevent false trips, missing trips, and telemetry dropouts.

Reduce false OCP Robust buses Thermal margin Clear fail-safe

1) Current sensing (shunt + Kelvin + input protection)

  • Shunt placement: keep the power loop compact; avoid placing the shunt where return current can bypass it.
  • True Kelvin routing: sense traces must be a dedicated pair, symmetric, and kept away from high di/dt nodes.
  • Thermal coupling: expect temperature gradients; reduce drift by keeping the shunt environment predictable.
  • Front-end protection: protect amplifier/ADC inputs from inductive spikes without distorting normal sensing.
  • Bandwidth intent: decide whether the control reacts to peaks or averages; filter accordingly to avoid “spike = trip”.

2) Switch & thermal path (heat flow is part of protection accuracy)

  • Heat path: device → copper → via array → backside spreader; validate that heat actually leaves the hot spot.
  • OT thresholds: ensure thermal shutdown and recovery behavior does not oscillate (add time qualification where needed).
  • Copper sizing: sized for both current and heat spreading; narrow necks create local thermal cliffs.
  • Parallel / redundancy (light touch): if used, keep symmetry; treat redundancy as a policy-managed grouping problem.

3) Bus and branch PI (rack-internal only)

  • Return path discipline: high di/dt current must loop locally; avoid “return wandering” through sensitive ground.
  • Decoupling distribution: bulk near the bus entry + high-frequency close to branch switching elements.
  • Inrush concurrency: stagger/group enables so “many branches at once” does not create a bus droop trip cascade.
  • Spike containment: the inductive spike source is often layout; treat mitigation as loop + placement first.

4) Communications robustness (I³C/I²C/PMBus + isolation trigger)

  • Line length & pull-ups: plan for total bus capacitance; pull-ups must meet rise-time without creating ringing.
  • Topology clarity: segment or buffer when needed; prevent a single long spur from dominating bus timing.
  • Common-mode reality: if crossing grounds/domains or long cable runs, consider isolation at the boundary.
  • Hang recovery: include a bus reset/recovery path (watchdog + controlled re-init) to avoid “silent telemetry loss”.

5) Fail-safe defaults (what happens when control is lost)

  • Default states: define which branches default OFF and which must stay ON for OOB/alarm continuity.
  • Priority domains: OOB + alarm domain power must be preserved during shedding/lockout policies.
  • Config versioning: threshold/policy versions should be logged so “changed policy” is visible in postmortems.
High-yield red lines: true Kelvin sensing, local return loops, staged enables, bus hang recovery, and explicit OOB/alarm power priority.
Figure F7 — Layout-focused checklist (Kelvin, sensitive keep-outs, thermal path, decoupling)
Layout checklist diagram for Micro-DC Rack branch channels Four-panel block diagram showing shunt Kelvin routing, switch thermal path, decoupling placement and return loop, and management-bus robustness with isolation triggers. Text labels are minimal and mobile-readable. Board Layout Checklist (Production Stability) A) Shunt + Kelvin B) Switch + Thermal C) Decoupling + Return D) Mgmt Bus + Isolation Power Cu Power Cu Shunt Sense Amp / ADC Keep-out high di/dt node FET Copper Spreader Via Array Heat path Bus Entry Bulk Branch HF Cap Return loop Controller Devices PU Isolation if cross-domain long cable / CM noise
Focus on Kelvin correctness, local return loops, thermal exit paths, and management-bus recovery/isolation triggers to prevent false protection events and telemetry loss.
Boundary reminder: this checklist stays within rack-internal DC distribution and management wiring. Facility-level surge/EMC theory and PSU internals are excluded by design.
H2-10 · Validation & Production Test

Validation & Production Test: Calibration, Fault Injection, and Field Self-Checks

Testing for a Micro-DC Rack is not just “power on and read telemetry.” It must prove that protection actions are correct, telemetry is trustworthy across conditions, and evidence logs are sufficient for fast troubleshooting. The emphasis here is test hooks specific to rack DC distribution: stimulus → expected action → expected log fields.

R&D validation (stimulus → measurement → pass criteria)

  • Overcurrent / short: step loads and short fixtures; measure response time and state transitions.
  • SOA / thermal: steady + pulsed load; confirm derating behavior and thermal shutdown stability.
  • False-trip resilience: noise injection, bus droop, concurrent enables; track trip probability and recovery behavior.
  • Concurrency scenarios: staged enable vs all-at-once; confirm no cascading bus droop trips.

Production test + maintenance (minimum set with high coverage)

  • Channel self-test: on/off response and readback of state bits and fault flags.
  • Sensor open/short detection: door/leak/temp fault checks with explicit reason codes.
  • Current calibration: one-point or two-point strategy with temperature drift compensation planning.
  • Log pull + integrity check: verify log packet completeness and the presence of required audit fields.

Fault injection mindset (matrix-driven, not ad-hoc)

  • Fault types: OCP, short, OT, UV/bus droop, remote-off, sensor-fault.
  • Expected actions: limit/derate, shed, lockout/manual clear, preserve OOB + alarm domain.
  • Expected logs: reason_code, t_mono, snapshots, config_version; plus actor/request_id for writes.

Field self-checks (fast, safe, and actionable)

  • Read-only baseline: state bits, thresholds, policy/config version, and recent events.
  • Low-risk actuation: validate a non-critical branch/group control under controlled windows.
  • Degraded operation: if sensors are faulty or telemetry is partial, restrict risky automation and keep OOB/alarm online.
Minimal troubleshooting workflow: check reason_code, then timestamp, then pull ±1s telemetry snapshots, and finally verify config_version + actor (policy changes).
Figure F8 — Fault injection matrix (fault × expected action × expected log fields)
Fault injection matrix for Micro-DC Rack validation Matrix-style diagram listing fault types versus expected actions and expected log fields, plus a side panel summarizing action primitives, required log packet fields, and pass criteria. Minimal text with large readable labels. Fault Injection Matrix Fault → Expected Action → Expected Logs Fault Type Action Logs Overcurrent LIMIT → optional SHED reason_code t_mono + snap Short TRIP → LOCKOUT reason_code snap + retries Overtemp DERATE → SHED temp + t_mono snap + config Bus Droop / UV SHED → STAGE bus_v + t_mono snap + group Remote-Off OFF → preserve OOB actor + req_id config + t_mono Sensor Fault DEGRADE sensor_id fault_mode Test Reference Action Primitives LIMIT · DERATE SHED · LOCKOUT KEEP OOB Log Packet Fields reason_code + t_mono snapshots (+/- 1s) config_version actor + request_id Pass Criteria Correct action Correct logs Safe clear rule
A matrix-driven approach makes validation repeatable: each injected fault must lead to the intended action primitive and a complete, searchable log packet (reason_code, timestamps, snapshots, and configuration context).
Boundary reminder: this section defines rack DC distribution test hooks and workflows (stimulus → action → logs). Generic BIST/POST architectures and crypto internals are intentionally excluded.
H2-11 · Parts / IC Selection Pointers

Parts / IC Selection Pointers: Vendor Questions + Example MPNs

This section is intentionally not a buying guide. It is a practical checklist of “must-ask questions” per function block, plus illustrative MPN examples to anchor the discussion. Always validate derating, SOA, thermal layout, diagnostics, and production test hooks with the latest datasheets and board-level measurements.

Ask Verify Log Recover
Usage tip: For each block, capture (1) protection/telemetry requirements, (2) bus/interface constraints, and (3) the minimum log fields required for postmortems (reason_code, timestamps, snapshots, config_version, actor for writes).

A) eFuse / High-Side Switch (branch protection core)

Two common approaches exist: integrated eFuse (fast integration) or controller + external MOSFET (SOA/thermal flexibility).

  • Selection dimensions: VIN/VBUS range, continuous current, surge/peak + SOA, limit mode (constant/foldback/fast-trip), short-circuit response, reverse blocking, OV/UV/OT, dv/dt ramp, telemetry & diagnostics, retry/lockout policy.
  • Must-ask questions:
    • Is SOA specified with pulse conditions representative of branch inrush and short fixtures?
    • What is the limit behavior (constant vs foldback) and how does it impact dissipation during stalls/overloads?
    • How are faults differentiated (OCP vs SC vs OT vs UV/OV vs reverse)? Is there a latched last-fault register?
    • Is retry policy configurable (attempt count, backoff, lockout, manual clear)? What is the default after a watchdog reset?
    • What is the measurable response time from fault onset to limit/trip under worst-case wiring inductance?
  • Example MPNs (illustrative):
    • Texas Instruments TPS2662 / TPS2663 (integrated eFuse class, high-voltage range family)
    • Texas Instruments LM5069 (hot-swap / inrush controller class, external MOSFET)
    • Texas Instruments LM25066A (hot-swap + current monitor + PMBus/SMBus telemetry class)
    • Infineon PROFET™ high-side switch families (12V/24V domain branch switching class; family selection depends on load)

B) Current / Voltage Monitoring (trustworthy observability)

  • Selection dimensions: accuracy + drift, common-mode range, input transient behavior (saturation + recovery), sample rate/bandwidth, alert comparators (threshold granularity, hysteresis, debounce), interface (I²C/SMBus/PMBus), multi-channel needs.
  • Must-ask questions:
    • Under bus spikes/droops, does the monitor saturate? If yes, what is the recovery time?
    • Can alerts be qualified (time debounce) and do they support hysteresis to prevent alarm storms?
    • Is peak vs average reporting supported (or can firmware derive it without aliasing)?
    • What calibration hooks exist (offset/gain trim, temperature compensation strategy)?
  • Example MPNs (illustrative):
    • Texas Instruments INA238 (precision current/power monitor class)
    • Texas Instruments INA233 (current/power monitor class, SMBus compatible)
    • Texas Instruments INA3221 (3-channel shunt/bus monitor class)
    • Analog Devices LTC2946 (coulomb/energy monitor class)

C) Sensors & Aggregation (temp/humidity/door/leak + mux/ADC)

  • Selection dimensions: sensor fault detectability (open/short/stuck), debounce/hysteresis strategy, multiplexing scale (channels, address conflicts), bus recovery (hung bus isolation/reset), ADC channels/resolution, input protection for long sensor leads.
  • Must-ask questions:
    • How are sensor wiring faults represented (explicit fault flags vs “invalid readings”)?
    • Does the mux/buffer allow isolating a misbehaving branch device without taking the whole bus down?
    • What is the recommended bus pull-up / capacitance limit for the intended cable lengths inside the rack?
  • Example MPNs (illustrative):
    • Texas Instruments TCA9548A (8-channel I²C switch / mux class)
    • NXP PCA9548A (8-channel I²C switch / mux class)
    • Texas Instruments ADS1115 (I²C ADC class for slow-to-moderate telemetry)
    • Sensirion SHT31 (temp/humidity sensor class)

D) Controller (low-power MCU vs small management SoC)

  • Selection dimensions: watchdog + reset domains, brownout behavior, non-volatile event storage hooks, bus peripherals (I³C/I²C/SMBus/PMBus), secure update hooks (versioning + rollback path), and deterministic recovery after faults.
  • Must-ask questions:
    • During brownouts, can the design guarantee last-fault capture (reason_code + timestamp + snapshot)?
    • Is a robust update/rollback mechanism feasible (dual image / safe fallback), and can updates be audited (actor/request_id)?
    • After a watchdog reset, what is the default policy for branch enables and the OOB/alarm domain?
  • Example MPNs (illustrative):
    • STMicroelectronics STM32H7 / STM32G4 (MCU class)
    • NXP LPC55S69 (MCU class)
    • Microchip SAME54 (MCU class)
    • ASPEED AST2600 (management SoC / BMC-class device; use only for “hooks” here, not a deep dive)

E) Uplink communications (Ethernet PHY / isolation + optional RS-485)

  • Selection dimensions: PHY robustness, ESD/EMI tolerance strategy (including magnetics/isolation boundary), link-down behavior, offline logging/buffering strategy, and optional long-line fallback (RS-485).
  • Must-ask questions:
    • If uplink drops, can the system continue logging locally and backfill events after recovery?
    • Where is the isolation boundary (cross-domain/long cable/common-mode noise), and does isolation preserve management continuity?
    • For RS-485 fallback, what is the minimum control+alarm set that must remain available during degraded operation?
  • Example MPNs (illustrative):
    • Texas Instruments DP83867 (Gigabit Ethernet PHY class)
    • Microchip KSZ9131 (Gigabit Ethernet PHY class)
    • Texas Instruments SN65HVD1781 (RS-485 transceiver class)
    • Analog Devices ADM2587E (isolated RS-485 transceiver class)
Checklist item Option A (fill in) Option B (fill in) Option C (fill in)
SOA / surge margin TBDTBDTBD
Limit behavior CONST / FOLD / TRIPCONST / FOLD / TRIPCONST / FOLD / TRIP
Reverse blocking OK / N/AOK / N/AOK / N/A
Diagnostics richness reason_code granularityreason_code granularityreason_code granularity
Retry / lockout policy attempts + backoffattempts + backoffattempts + backoff
Telemetry interface I²C / SMBus / PMBusI²C / SMBus / PMBusI²C / SMBus / PMBus
Production test hooks self-test + logsself-test + logsself-test + logs
Figure F9 — Reusable selection radar + comparison table template
Reusable selection radar and comparison template for Micro-DC Rack parts 3:2 diagram with a six-axis radar template and a fill-in comparison table for three options. Labels are minimal and large for mobile readability. Selection Template (Radar + Compare) Radar (score 1–5) SOA margin Protection speed Telemetry richness Thermal headroom Reverse block Programmability Option A Option B Option C Comparison Table (fill in) Item A B C SOA / surge Limit mode Reverse block Diag codes Retry / lock Test hooks TBD TBD TBD TBD TBD TBD TBD TBD TBD TBD TBD TBD TBD TBD TBD TBD TBD TBD Template intent: keep vendor comparisons consistent, measurable, and tied to logs + production tests.
Reuse F9 across pages: keep the radar axes stable (SOA, speed, telemetry, programmability, reverse block, thermal headroom) and fill the compare table with verified, test-backed answers.

Request a Quote

Accepted Formats

pdf, csv, xls, xlsx, zip

Attachment

Drag & drop files here or use the button below.
H2-12 · FAQs ×12

FAQs: Micro-DC Rack Branch Protection, Telemetry, Env Linkage, and OOB Hooks

These FAQs stay strictly inside the Micro-DC Rack boundary: branch eFuse/high-side switching, telemetry + event logs, environment/door/tamper linkage, and OOB management hooks.

Scope Guard: Allowed—branch protection (eFuse/high-side), branch telemetry/logs, env/door/tamper linkage, OOB permission boundaries. Banned—PSU/UPS/ATS internals, facility distribution/codes, AC PDU deep dive, crypto implementation details.
Figure F10 — FAQ map: “Symptoms → Checks → Knobs” within Micro-DC Rack scope
FAQ map for Micro-DC Rack: branch protection, telemetry/logs, env linkage, and OOB hooks Block-style diagram with four scope pillars and three-step flow: symptoms, checks, knobs. Minimal text, large labels for mobile readability. FAQ Map (Micro-DC Rack Scope Only) Branch Protection eFuse / High-side Telemetry & Logs reason_code / timestamps Env / Security door / tamper / temp OOB Hooks read / write / audit Symptoms Fast Checks Knobs Trips / resets Won’t power-cycle Alarms storms Env events reason_code + trip_time pre/post snapshot attempt_count actor / config_ver limit mode debounce / hysteresis retry / backoff stagger / priority Rule: every answer must end with measurable checks (codes + time + snapshot) and actionable knobs (policy + thresholds + recovery).
Use the same “Symptoms → Fast Checks → Knobs” pattern for all 12 FAQs, without leaving the Micro-DC Rack scope.
1

What is the practical engineering boundary between a Micro-DC Rack and a traditional rack PDU?

Maps to: H2-2 / H2-3
A Micro-DC Rack focuses on in-rack DC branch switching, electronic protection, and diagnostic-grade telemetry. Its core deliverable is a closed loop: branch enable/disable, fast fault response, and actionable logs (reason_code + timestamps + snapshots) exposed through an OOB control plane. It does not describe upstream power conversion or facility distribution.
2

Why can an eFuse trip frequently even when “average current looks small”?

Maps to: H2-5 / H2-9
“Current looks small” is often a measurement artifact. Common triggers include inrush spikes masked by filtering, di/dt-induced voltage spikes from wiring inductance, shunt thermal drift, and noise coupling into sense/enable pins. Start with last reason_code, trip_time, and a pre/post snapshot (Vbus, Ibranch, Tchannel) before adjusting thresholds or debounce.
3

Constant-current vs foldback current limiting—how to choose, and what are the common pitfalls?

Maps to: H2-5
Constant-current limiting maintains voltage longer but can accumulate heat during stalls, causing delayed OT trips and repeated retries. Foldback reduces dissipation but can prevent startup for capacitive loads and “looks like” a short to the load, creating oscillation. Decide based on fault energy and recovery goals, then pair the mode with retry/backoff/lockout policy and logs that distinguish OCP vs OT vs SC.
4

What happens if short-circuit response is too fast or too slow?

Maps to: H2-5 / H2-10
Too fast response causes nuisance trips by misclassifying inrush or transient di/dt as a hard short, especially with tight blanking windows. Too slow response forces the switch into high stress, increasing SOA risk and spreading bus disturbances to other branches. Validate with a controlled short fixture and step loads, and require logs that separate SC vs OCP and record response timing and peak values.
5

After a remote power-cycle, the load still won’t come up—what codes and timestamps should be checked first?

Maps to: H2-6 / H2-10
Start with last reason_code (UV/OV/OCP/SC/OT/remote-off/lockout), then the trip_time and the most recent attempt_count/backoff state. Next, inspect a pre/post snapshot around the event: Vbus droop, Ibranch peak, channel temperature, and enable state. If timestamps are inconsistent, rely on monotonic time plus a “near-absolute” mapping used by the OOB plane for correlation.
6

How can “critical-domain priority power” be implemented without introducing a new single point of failure?

Maps to: H2-4 / H2-9
Define a minimal critical domain (OOB controller, alarms, essential comms) and give it a dedicated branch with conservative thresholds, default-on behavior, and independent fault handling. Avoid coupling it to noncritical branches through shared enables or fragile bus dependencies. Ensure a watchdog reset preserves the critical domain, and record policy_version + reason_code so “priority actions” remain auditable and testable.
7

How should door/tamper events link to power actions without causing accidental shutdowns?

Maps to: H2-7 / H2-8
Treat door/tamper primarily as audit and permission signals, not immediate power-cut triggers. Use debounce and hysteresis, detect sensor faults, and escalate actions based on severity: door-open → log + restrict privileged writes; tamper/forced-open → require confirmation or two-step approval for power actions. Only combine power shedding with additional safety signals (e.g., smoke/leak) to avoid false shutdowns.
8

What is the most reasonable division of labor for I³C, I²C, and PMBus inside a Micro-DC Rack?

Maps to: H2-8
Use I³C for dense on-board device discovery and higher-rate telemetry where supported, while keeping robust recovery paths for bus faults. Use I²C for broad compatibility with sensors, muxes, and monitors—pay attention to line length, pull-ups, and bus hang isolation. Use PMBus/SMBus semantics for power-oriented devices (limits, faults, telemetry fields), and keep long or noisy runs isolated or bridged through a controller boundary.
9

If current measurement drifts over time, how to tell shunt heating from layout issues or the sampling chain?

Maps to: H2-9 / H2-10
First check correlation with temperature and load duty cycle—self-heating shifts shunt value and creates gradients that break assumptions. Next verify true Kelvin routing and quiet return paths; small ground drops can look like current error. Finally inspect the sampling chain: amplifier input protection, saturation recovery, ADC reference stability, and digital filtering. Confirm by controlled DC points, a two-point trim strategy, and logging the calibration version for traceability.
10

How should an event log be designed so “who triggered the power action” remains traceable after power loss?

Maps to: H2-6 / H2-8
Require every write action to carry identity context and be logged with actor_id, request_id, policy_version, channel_id, and result status, alongside reason_code and timestamps. Use a ring buffer plus “critical-event snapshot” commits on trip/lockout. After reboot, reconstruct ordering via monotonic time and include the last known mapping to near-absolute time for cross-system correlation, without relying on external services.
11

If simultaneous power-up causes bus droop, how should branch staggering be implemented for stability?

Maps to: H2-4 / H2-5
Stagger by grouping branches based on inrush profile and criticality: bring up the control/OOB domain first, then low-inrush loads, then high-capacitive branches. Use soft-start/ramp control and fixed inter-group delays, and avoid synchronizing retries across many channels. Declare success criteria using telemetry: Vbus droop limits, number of OCP entries, retry counts, and “time-to-ready” measured and logged with timestamps.
12

When smoke/leak/high-temperature alarms occur, what is a recommended staged load-shedding policy?

Maps to: H2-7 / H2-4
Use staged shedding with explicit priorities: preserve OOB/alarms, shed noncritical branches first, then progressively restrict high-power loads if the alarm persists. Combine sensor debounce/hysteresis with multi-signal confirmation to prevent false trips, and record linkage in logs (env_event_id → power_action_id → affected_branches). Keep the policy deterministic and testable: each stage must have measurable entry/exit criteria and a clear reason_code so postmortems can prove the decision path.