Micro Edge Datacenter Rack: PDU Monitoring & OOB BMC
← Back to: 5G Edge Telecom Infrastructure
A Micro Edge Datacenter Rack is the rack-level “governance layer” for edge sites: it distributes power to branches, measures V/I/P/E with trustworthy evidence, enforces per-outlet protection and remote control, and keeps OOB management and audit logs alive during outages. Its success is defined by operability and accountability—every alarm and remote action can be verified with pre/post measurements, fault codes, and traceable logs.
Scope & boundaries: what this page covers (and what it avoids)
This page is rack-level: it focuses on power distribution governance (per-branch metering + protection + remote control), environment sensing, and an out-of-band (OOB) management evidence chain. It intentionally avoids upstream site power, network dataplane, and timing subsystems to prevent overlap with sibling pages.
- Per-branch visibility: voltage/current/power/energy and load profiles that remain trustworthy during bursty loads.
- Per-branch governance: eFuse/high-side switch protection and safe remote actions (off / lockout / power-cycle).
- Site-condition awareness: temperature/humidity/door/fan/airflow sensing with stable thresholds (debounce/hysteresis).
- OOB survivability: a management path that stays controllable even when in-band networking or hosts are down.
- Evidence chain: alarm + action + pre/post snapshots + audit trail (who/when/what/result) for accountability.
| Topic | This page | Sibling link |
|---|---|---|
| Rack PDU metering (V/I/P/E) | Covered (accuracy + sampling + validation) | — |
| Branch protection & remote outlet control | Covered (eFuse/HSS policies + safe actions) | — |
| Environmental monitoring (rack sensors) | Covered (placement + thresholds + false alarms) | — |
| OOB management & audit logs | Covered (survivability + evidence chain) | — |
| 48V front-end hot-swap / site rectifier | Avoided (energy system layer) | Edge Site Power & Backup |
| UPS / supercap / battery hold-up sizing | Avoided (capacity planning layer) | Edge Site Power & Backup |
| PTP / GNSS / SyncE timing design | Avoided (timing subsystem) | Edge Grandmaster / Time Hub |
| UPF/LBO/switch dataplane & security policy engines | Avoided (network function layer) | Edge Gateways / Security Nodes |
Reference architecture: rack building blocks & interfaces
A micro edge rack becomes operationally useful only when power path, sense/protect plane, and management plane are designed as three coordinated layers. The diagram below defines a minimal rack that supports per-branch control, reliable telemetry, and OOB survivability.
- Power path: Feed A/B → Rack PDU → Branch/Outlets → Loads (servers/switches as black boxes).
- Sense/protect plane: metering AFE + eFuse/high-side per branch, producing both alarms and hard trips.
- Management plane: OOB BMC/MCU that stays alive for reads, controlled actions, and audit logs.
- Mgmt Ethernet: stable remote access for inventory, telemetry, and controlled actions.
- NCSI (optional): shared management path when a dedicated port is unavailable (must handle reachability risk).
- Serial console: last-resort recovery channel for misconfigurations and host failures.
- Sensor buses: short, robust links for power metering and environment sensors (I2C/SMBus/PMBus-class buses).
PDU monitoring AFE: what to measure and how to make it trustworthy
Rack metering is operationally useful only when it produces actionable evidence: per-branch load profiles, burst/peak indicators, and pre/post snapshots that explain alarms and trips. This section defines what to measure, how to sample it across multiple time scales, and how to turn error sources into a checkable budget.
| Metric | Why it matters (operations) | Window guidance | Common misuse to avoid |
|---|---|---|---|
| V / I (per branch) | Detect undervoltage, overload, wiring drops, and abnormal draw by branch ID. | Maintain both slow averages and fast snapshots around events. | Treating a stable average as “truth” during burst loads. |
| P (real power) | Capacity planning and anomaly detection when current alone is ambiguous. | Compute from synchronized V/I samples or defined intervals. | Comparing power values computed with different windows. |
| E (energy) | Billing/cost allocation, trend baselining, and “who used what” attribution. | Use long integration windows; log resets and counter rollovers. | Using energy counters to diagnose short transient issues. |
| Peak / burst indicator | Explains “mysterious” trips when average current looks acceptable. | Define peak window explicitly (e.g., max over N samples or over T ms). | Reporting “peak” without a stated window or sampling method. |
| Inrush indicator (startup signature) | Separates legitimate startup surges from persistent overload or short events. | Capture early-time waveform statistics (rise time, peak, duration). | Confusing inrush with long-term load draw. |
- Voltage: divider → ADC, with a clear sense point definition (where voltage is considered “the branch voltage”).
- Current (two common options): shunt → amplifier/AFE → ADC, or Hall/magnetic sensor → AFE → ADC.
- ADC + filtering: define anti-alias behavior and sample timing before claiming accuracy under burst loads.
- Isolation/common-mode: treat as a rack monitoring constraint (measurement survivability), not a site power design topic.
| Choice | Strength | Typical failure mode in practice | What to validate |
|---|---|---|---|
| Shunt | High linearity and predictable behavior if thermal and wiring are controlled. | Apparent drift due to self-heating, Kelvin sense mistakes, or ground/return coupling. | Temp sweep, load steps, wiring drop sensitivity, offset stability. |
| Hall / magnetic | Isolation-friendly sensing with minimal insertion loss on high currents. | Quiet-looking readings that are wrong due to sensor saturation, bandwidth limits, or external magnetic fields. | High-current peak capture, saturation tests, ambient field sensitivity, calibration repeatability. |
A single-rate stream cannot explain burst loads. Use two time scales: a slow path for trends and energy, and a fast path for peaks/inrush and event snapshots.
| Path | Purpose | Data form | Key rules |
|---|---|---|---|
| Slow path | Energy/trends, baselines, capacity planning, gradual drift detection. | Averages/integrals (V/I/P/E) per branch ID. | Always tag the window and aggregation method; log counter resets and rollovers. |
| Fast path | Peak/inrush capture, explaining trips/alarms, pre/post snapshots. | Ring buffer + event-triggered snapshots (short-window statistics). | Declare peak window; apply anti-alias filtering or oversampling; freeze snapshots on trip/alarm edges. |
| Error source | Typical symptom | Mitigation | Validation test |
|---|---|---|---|
| Gain/offset (AFE + ADC) | All currents look “consistently high/low” across branches. | Factory calibration + field sanity checks with known loads or references. | Two-point calibration; cross-check against a portable reference meter. |
| Thermal drift (sensor + self-heating) | Slow drift that correlates with rack temperature or sustained load. | Thermal design for sensors; temperature compensation; drift alarms. | Temp sweep under load; compare cold vs hot offsets; long soak tests. |
| Wiring drop (sense point ambiguity) | Voltage looks fine at PDU but load behaves like it is undervoltage. | Define sense points; use Kelvin sense where required; document “what V means”. | Load step test; measure at PDU vs load connector; correlate with temperature rise. |
| Aliasing / windowing | Readings look stable while trips occur during bursts or startups. | Two-scale sampling; anti-alias filtering; event snapshots and explicit peak window. | Inject burst loads; verify peak capture; compare slow average vs snapshot evidence. |
| Sensor saturation (Hall/AFE range) | Clipped peaks; “flat-top” current at high load; missing inrush evidence. | Range headroom; detect saturation; flag invalid samples in logs. | High-current pulse test; confirm saturation flag; verify peak indicator behavior. |
Evidence: mismatch correlates with fan speed changes or cable movement; offset jumps after maintenance.
Action: verify Kelvin sense routing; isolate measurement ground/return; log maintenance events to explain step changes.
Evidence: waveform clips at a constant max; peak indicator does not scale with load severity.
Action: increase range headroom; add saturation flag; freeze a short snapshot on event edges.
Evidence: alarm timestamps align with workload bursts; fast snapshots show peaks beyond limits.
Action: publish peak definitions; use two-scale sampling; store pre/post snapshots around alarms.
Evidence: drift tracks ambient temperature, not traffic or compute utilization.
Action: add temperature compensation; use drift alarms; separate “measurement health” from “load health” in telemetry.
Branch protection & control: eFuse / high-side switches as “outlet governors”
At rack level, protection is not just about “saving hardware.” It is about governing each outlet with predictable behavior: fast fault containment, slow warnings that prevent surprise outages, and remote actions that are safe, rate-limited, and auditable.
- Current limiting: handle load steps and inrush without masking persistent overload.
- Short-circuit response: contain catastrophic faults with deterministic fast trip.
- Thermal protection: protect cables/connectors and avoid repeated heating cycles.
- Soft-start / ramp control: reduce nuisance trips and quantify startup signatures.
- Remote disconnect + lockout: enforce safe maintenance states and prevent flapping.
- Observability: fault flags + pre/post snapshots to support root-cause evidence.
| Event type | Fast trip (hard containment) | Slow alarm (operator time) | Evidence to log |
|---|---|---|---|
| Hard short / severe overcurrent | Immediate trip; optional latch until reviewed. | Alarm still emitted for context, not for decision-making. | fault_code, trip_reason, peak_window stats, pre/post snapshots. |
| Overload trend (sustained high current) | Trip only if thermal limits are crossed or policy requires. | Early warning; allow staged mitigation before outage. | slow averages, temperature trend, duty factor, operator actions. |
| Startup / inrush nuisance | Avoid repeated fast trips by policy (soft-start / limits). | Alarm when signature deviates from baseline (aging/cable issues). | inrush indicator, ramp time, peak stats, retries and cooldown. |
| Overtemperature | Trip when safety requires; protect connectors and harness. | Pre-trip warning with hysteresis to prevent oscillation. | temp sensors, time-over-threshold, last power-cycle attempt, lockout state. |
| State | Entry conditions | Exit / success criteria | Log fields (audit) |
|---|---|---|---|
| PRECHECK | No safety lockouts; acceptable temperature; authorized operator/action. | Interlocks cleared; snapshot trigger armed. | operator_id, policy_id, interlock_state. |
| SNAPSHOT | Fast buffer available; slow averages up-to-date. | Snapshot IDs stored (pre-action). | snapshot_pre_id, V/I/peak, temp/RH, fault_flags. |
| OFF | Command accepted; branch governor controllable. | Outlet confirmed off (state feedback). | cmd_id, result, off_timestamp. |
| WAIT | Minimum off-time timer running. | Timer met; ready for ON. | min_off_time_ms, wait_complete. |
| ON | Policy allows ramp/soft-start; retry budget available. | No immediate fault; inrush within policy signature. | snapshot_post_id, inrush_indicator, fault_code if any. |
| COOLDOWN / RETRY / LOCKOUT | Post-action stabilization or repeated failures. | Success: stable draw; Failure: lockout + escalation. | cooldown_s, retry_count, lockout_reason, escalation_flag. |
- Cooldown enforced: rate-limit repeated ON/OFF cycles to protect connectors and avoid thermal runaway.
- Retry budget: cap the number of retries; once exceeded, require lockout and escalation.
- Temperature gate: block ON when outlet/ambient temperature is above policy limits.
- Door/service gate: block remote switching during local maintenance windows (door open / service mode).
- Snapshot requirement: refuse destructive actions unless a pre-action snapshot is stored for auditing.
- Invalid-data handling: if metering validity flags indicate drift/saturation, treat evidence as suspect and avoid aggressive policies.
Environmental sensing plan: sensor placement, thresholds, and false-alarm control
Rack environmental monitoring should produce trusted, actionable signals, not alert noise. A practical plan defines a minimal sensor set, installs sensors where readings carry operational meaning, and tunes alarms with hysteresis and time windows. Environmental alarms can then drive rack-level power policies such as derating, shedding selected outlets, or locking out repeated retries—without relying on device-internal thermal controls.
| Signal | Operational question it answers | Recommended locations | Common false-alarm trigger |
|---|---|---|---|
| Temperature (multi-point) | Is the rack airflow effective, and is any zone overheating? | Inlet, exhaust, hotspot zone (near the hottest airflow path). | Sensor too close to vents or heat sources; poor thermal coupling. |
| Humidity (RH) | Is there condensation risk or abnormal moisture ingress? | Inlet-side ambient reference; avoid direct exhaust stream. | Door-open transient; sensor exposed to localized airflow jets. |
| Door (open/close + tamper) | Is the rack in service mode, or is access suspicious/unplanned? | Door frame fixed point with stable alignment; protected wiring path. | Vibration/misalignment causing switch bounce; loose cabling. |
| Fan tach / PWM | Are fans responding, and is airflow capacity degrading? | Fan module harness or controller feedback path (rack domain). | Short tach dropouts; noisy signal; connector intermittency. |
| Airflow / pressure (optional) | Is airflow blocked even if fans report “OK”? | Across filter/duct or strategic flow channel points. | Turbulence or placement too near a fan blade wake. |
| Leak / smoke (brief optional) | Site compliance or high-risk locations requiring early hazard signals. | Site-defined; treat as an external safety input. | Dust events or maintenance aerosols triggering nuisance alerts. |
| Placement zone | Sensors | Purpose | Avoid | Validation |
|---|---|---|---|---|
| Inlet | Temp, Humidity | Defines ambient baseline; detects site-level changes and condensation risk. | Direct exhaust mixing or localized warm air recirculation. | Compare against site reference; check stability over door events. |
| Exhaust | Temp | Verifies heat removal; supports inlet–exhaust delta trending. | Directly in high-speed fan jet causing oscillation. | Step-load test: delta should increase predictably and settle. |
| Hotspot zone | Temp | Captures local heat accumulation before it becomes a rack-wide issue. | Touching a heatsink/metal surface that biases the reading. | Correlate with fan telemetry; check repeatability after service. |
| Door frame | Door sensor | Separates service vs abnormal access; gates risky remote actions. | Loose mounting or misalignment that causes bounce. | Tap/vibration test; verify debounce filters with event counts. |
| Cable stress points | Door wiring, fan harness | Prevents “sensor disappears” incidents due to maintenance and cable strain. | Routing across sharp edges or moving hinges without relief. | Service cycle test; verify no intermittent tach/door events. |
False alarms are controlled by a three-part rule set: tiered thresholds (WARN/ALARM/CRITICAL), hysteresis to prevent oscillation, and time windows (debounce/averaging) to ignore short transients.
| Signal | Tiering logic | Hysteresis / latch | Window / debounce | Guard against |
|---|---|---|---|---|
| Temperature | Warn on trend; alarm on sustained limit; critical on rapid rise or high absolute. | Exit hysteresis to avoid toggling near threshold; optional critical latch until reviewed. | Sliding average + “time-over-threshold” confirmation. | Door-open gusts, short fan PWM changes, sensor placement artifacts. |
| Humidity | Warn on rising RH; alarm on persistent high RH; critical when combined with low temp margin. | Use hysteresis to prevent repeated edge crossing during marginal conditions. | Longer averaging window than temperature; ignore short spikes during servicing. | Transient moisture events and sensor airflow exposure. |
| Door | Separate “service open” from “unexpected open” with schedules/policy. | Optional latch for tamper alarms until acknowledged. | Debounce in ms; add event-count window for repeated opens. | Contact bounce, vibration, misalignment, loose magnets. |
| Fan tach | Warn on deviation from target; alarm on sustained low RPM or tach loss. | Avoid immediate latch; prefer controlled escalation with cooldown. | Short delay to ignore transient dropouts; confirm across multiple samples. | Single-sample glitches, connector intermittency. |
| Airflow/pressure (opt) | Warn on drift from baseline; alarm when correlated with rising exhaust temp delta. | Hysteresis to ignore turbulence; require correlation with temperature. | Averaging window tuned to filter dynamics. | Turbulence near fans, measurement noise, seasonal baseline changes. |
Environmental signals become useful when they map to bounded rack-level actions. Actions should be reversible, rate-limited, and always accompanied by pre/post evidence (snapshots + logs).
| Alarm level | Allowed rack action | Goal | Evidence to store |
|---|---|---|---|
| WARN | Derate policy (soft limits), increase monitoring frequency. | Prevent escalation while maintaining service. | trend window ID, temp/RH deltas, fan telemetry summary. |
| ALARM | Shed selected outlet groups by priority; enforce cooldown. | Reduce heat/power density in the rack domain. | pre/post snapshots, outlet_group_id, action_id, result. |
| CRITICAL | Lockout repeated retries; controlled shutdown of non-critical outlets; escalate. | Avoid thermal runaway and self-inflicted flapping outages. | fault_code, lockout_reason, audit log (who/when/what), correlation to door state. |
OOB BMC architecture: why OOB exists and what must stay alive
Out-of-band (OOB) management exists to keep visibility, control, and evidence available when in-band access fails. A rack-level OOB design defines: (1) failure scenarios to survive, (2) a minimal keep-alive domain that must remain powered and reachable, and (3) management interface tradeoffs such as a dedicated management port versus NCSI sharing. Common management planes include IPMI/Redfish, but the focus here is operational continuity, not device-internal networking.
| Scenario | What fails (in-band) | What OOB must still do | Evidence to capture |
|---|---|---|---|
| In-band network outage | Management agents unreachable; remote SSH/APIs fail. | Read sensors, confirm power state, execute a controlled outlet action. | door state, env trends, outlet action IDs, timestamps. |
| Host OS hang / crash | In-band telemetry stops; services freeze while power remains on. | Collect last-known snapshots; power-cycle with cooldown and retry limits. | pre/post snapshots, fault flags, retry counters, outcomes. |
| Remote unattended site | No local technician; prolonged downtime if OOB is absent. | Maintain minimal control plane, logs, and safe recovery actions. | audit trail (who/when/what), lockout reasons, escalation flags. |
| Configuration mistakes | In-band misrouting or VLAN changes cause loss of management reachability. | Remain reachable via an independent path and support rollback actions. | network reachability state, mgmt link state, last successful access time. |
| Block | Must remain powered? | Reason | Evidence / fields |
|---|---|---|---|
| BMC power domain | Yes | Guarantees access to sensing/control when hosts or in-band fail. | uptime, reset causes, access state, policy state. |
| Sensor bus (env + door + fan) | Yes | Provides the ground truth for alarms and safe decisions. | sensor validity, last update time, missing-sensor flags. |
| Outlet control interface | Yes | Enables safe power-cycle, lockout, and controlled recovery actions. | action_id, results, cooldown, retry budget, lockout reason. |
| Local event log storage | Yes | Preserves evidence during loss of network or power disturbances. | snapshot IDs, audit fields, last N critical events. |
| Management interface link | Prefer independent | Reduces shared failure domain with the in-band network path. | link state, last successful login, out-of-band reachability. |
| Option | Strength | Shared failure domain risk | Operational fit |
|---|---|---|---|
| Dedicated mgmt ETH | More independent reachability; easier to isolate and monitor. | Lower shared risk with host NIC and in-band config errors. | Best for unattended sites and strict uptime requirements. |
| NCSI shared port | Reduces ports/cabling; simpler physical build. | Higher shared risk: NIC/PHY/link issues and misconfig can affect both OOB and in-band. | Fits constrained deployments where operational model tolerates shared fault domains. |
- Does not describe internal switch/firewall architectures or dataplane processing.
- Does not deep-dive authentication/PKI/zero-trust policy; focuses on availability and evidence.
- Does not depend on host OS health; OOB remains functional when in-band agents fail.
Telemetry & evidence chain: data model, sampling, event logs, and audit trail
Rack telemetry is most valuable when it forms an evidence chain: every action is attributable, every trip has before/after context, and every report is reproducible. A practical evidence chain defines (1) the minimum accountable evidence set, (2) a data model that ties assets, branches, sensors, and actions together, (3) dual-window sampling to capture both bursts and trends, and (4) log integrity so critical records survive outages and resets.
| Evidence item | Why it matters | Minimum fields | Typical source |
|---|---|---|---|
| Who/when/what acted | Enables accountability and prevents “unknown power changes”. | actor, role, timestamp, action, reason_code | BMC / policy engine |
| Which target was affected | Stops ambiguity across outlets, branches, and groups. | asset_id, FRU, branch_id, outlet_id, group_id | Inventory + PDU controller |
| Pre-snapshot state | Distinguishes overload vs inrush vs policy-driven actions. | V/I/P, env summary, fault_flags, threshold_version | Metering + sensors |
| Action record & interlocks | Explains why a command succeeded/failed or was blocked. | action_id, interlock_state, cooldown, retry_count | BMC state machine |
| Post-snapshot result | Proves the effect and supports verification audits. | result, trip_type, post V/I/P, post env summary | Metering + event logger |
A rack model should tie assets (what exists), branches/outlets (what can be controlled), sensors (what is observed), policies (how decisions are made), and actions (what changed). Correlation IDs link multiple events into a single incident timeline.
| Field | Type / example | Meaning | Used for |
|---|---|---|---|
| asset_id, fru_id | string (RACK-01, PDU-A) | Physical identity and replaceable unit mapping. | Service history, inventory, incident grouping. |
| branch_id, outlet_id, group_id | int/string (B07, O12, GRP-EDGE) | Control scope for limits, trips, and power cycling. | Targeted actions and safe sequencing. |
| sensor_id, sensor_type, location_tag | string (T-INLET, RH-INLET) | Where and what is being measured. | Placement verification, false-alarm diagnosis. |
| state, fault_flags | enum/bitset (ON, TRIPPED) | Outlet/branch state machine + fault indicators. | Troubleshooting and policy gating. |
| threshold_version, policy_id | string (THR-v12) | Which alarm/trip tuning was active at the time. | Explaining “why now” and preventing configuration drift. |
| action_id, actor, action | string/enum (ACT-8891, POWER_CYCLE) | Command identity and initiator. | Auditability and causality chains. |
| result, reason_code, correlation_id | enum/string (DENIED_LOCKOUT, INC-2026-01) | Outcome, failure reason, and incident grouping. | Incident timelines and post-mortems. |
Short-window sampling captures bursts (inrush, sudden load steps, trip preconditions). Long-window sampling captures trends (energy, thermal drift, fan degradation). A common pattern uses a ring buffer for short windows and stores a small pre/post slice only when a trigger fires.
| Signal | Short window (burst) | Long window (trend) | Trigger examples | Stored evidence |
|---|---|---|---|---|
| Branch current / power | High-rate ring buffer + pre/post slice on event | Periodic averages and energy counters | Trip flag, inrush indicator, rapid rise | pre/post V/I/P, peak, event markers |
| Temperature | Short slice around threshold crossings | Trend series (inlet/exhaust delta) | Threshold crossing, fast ramp | threshold_version, time-over-limit |
| Door / fan | Event-driven records with debounce markers | Counts and duty summaries | Unexpected open, tach loss, repeated bounce | actor gating, interlock state, audit fields |
| Policy actions | Always stored (actions are rare but important) | Incident grouping and outcome stats | Derate/shed/lockout entry + exit | action_id, correlation_id, result, reason_code |
- Local-first: store incident records locally so network loss does not erase evidence.
- Ring buffer + promotion: keep a rolling short-window buffer; on triggers, promote a pre/post slice into the incident record.
- Commit markers: write header → payload → commit flag to avoid half-written entries being treated as valid.
- Monotonic IDs: use increasing event/action IDs to detect gaps and simplify audits.
Failure modes & troubleshooting: symptoms → likely causes → what to check
Troubleshooting at the rack level should rely on measurable evidence. The table below maps common symptoms to likely causes and the checks & fields that confirm or falsify each hypothesis. This reduces blind power-cycling and prevents policy-driven “flapping” from being mistaken for real electrical faults.
| Symptom | Likely causes (ranked) | What to check (tests & evidence) |
|---|---|---|
| One branch trips frequently | Real overload/short • Inrush captured by trip window • Threshold too tight • Sensor offset/range • Retry flapping |
Check: pre/post V/I/P, peak marker, trip_type, threshold_version, cooldown/retry_count. Evidence: overload shows sustained high I; inrush shows short peak near action start; policy flapping shows repeated action_id with lockout absence. Next: adjust window/hysteresis; enforce cooldown; validate sensor range and calibration status. |
| Telemetry jumps or looks “wrong” | Sampling window mismatch • Saturation/flat-top • Sensor missing/intermittent • Threshold version drift • Noise/glitches |
Check: short-window presence, sensor validity flags, missing-sensor counters, range flags, threshold_version. Evidence: saturation shows clipped peaks; window mismatch shows no burst slice around events; drift shows incident records under inconsistent THR versions. Next: tighten trigger rules; fix placement/connection; lock configuration versions with audits. |
| OOB can ping, but outlet control fails | Permission/role denial • Interlock active (door/policy lockout) • Cooldown not expired • Outlet control domain not ready • State machine stuck |
Check: audit actor/role, interlock_state, lockout_reason, cooldown timer, action result codes. Evidence: permission problems show DENIED results; interlocks show door_open/lockout flags; keep-alive gaps show control domain “not ready”. Next: correct roles; clear lockout with justification; verify keep-alive domain readiness before retry. |
| False environmental alarms | Placement artifacts • No hysteresis • Too short windows • Door/service transients • Fan tach dropouts |
Check: alarm windows, hysteresis settings, door event correlation, sensor location_tag and stability. Evidence: alarms coincide with door open; repeated near-threshold toggling implies missing hysteresis; tach glitches are single-sample anomalies. Next: fix placement, add hysteresis, widen time-over-threshold windows, debounce door/fan signals. |
- Evidence before action: store a pre-snapshot before any remote power change.
- Rate-limit recovery: cooldown + retry budgets prevent flapping from masquerading as electrical faults.
- Version everything: record threshold_version/policy_id for every incident and action.
Remote operations playbook: safe actions, rollback, and escalation
Remote outlet control is safest when actions run inside explicit guardrails: a risk-ranked action catalog, enforceable interlocks that prevent flapping, and a clear rollback/escalation rule set. The goal is to keep operations repeatable, auditable, and incident-friendly at the rack layer without relying on ad-hoc power cycling.
| Action | Risk level | Preconditions (must be true) | Evidence required | Default guardrails |
|---|---|---|---|---|
| Read-only (telemetry/log export) | L0 | OOB reachable; sensor validity not degraded | timestamp, asset_id, branch/outlet states | No rate limits (audited access only) |
| Soft reset (mgmt logic / controller reset) | L1 | No critical thermal alarms; control domain healthy | pre-snapshot + action_id + result | Retry budget + cooldown timer |
| Outlet cycle (single outlet off/on) | L2 | Door closed (if required); temp below threshold; not lockout | pre/post V/I/P + reason_code | Min off-time + cooldown; max N cycles |
| Lockout (isolate a problematic branch) | L3 | Repeated failures detected; incident correlation_id assigned | incident pack + escalation record | Requires justification; exit requires review |
| Interlock | Rule | Why it exists | What must be logged |
|---|---|---|---|
| Cooldown | Enforce a minimum interval between power actions | Avoid repeated thermal/electrical stress | cooldown_start, cooldown_end, action_id |
| Retry budget | Max N attempts; then auto-lockout | Stops infinite loops and “flapping” incidents | retry_count, lockout_reason, correlation_id |
| Thermal gating | Block risky actions when temp/severity is high | Prevents remote actions from worsening hotspots | temp_summary, severity, threshold_version |
| Door/service gating | Block actions when door open (configurable) | Protects personnel and avoids unsafe transitions | door_state, interlock_state, actor/role |
Remote operations should converge to a safe state. When actions fail, evidence is preserved first, then the rack enters a controlled rollback path (config version rollback or branch isolation). Persistent hazards or missing observability trigger an on-site escalation.
| Condition | Immediate safe state | Escalate to on-site when | Evidence to bring |
|---|---|---|---|
| Repeated outlet failures | Lockout the affected branch | Retry budget exhausted or trip storms persist | incident pack + action results + trip_type |
| Thermal alarms | Derate / shed non-critical outlets | Temp remains above limit across multiple windows | temp trend + threshold_version + actions taken |
| Untrusted telemetry | Freeze risky actions (L2/L3) | Sensor missing/invalid prevents confirmation | validity flags + gaps + correlation_id |
| OOB reachable but control domain not ready | Stop retries; keep evidence logging | Control readiness cannot be restored remotely | result/reason_code + interlock_state + timestamps |
Validation checklist: what proves the rack is production-ready
A rack is production-ready when validation covers measurement trust, protection correctness, OOB resilience, and field maintainability. Each checklist item should define a method, a pass criterion, and the evidence that must be preserved in logs for audits and incident reviews.
| Test | Method | Pass criteria | Evidence to store |
|---|---|---|---|
| Multi-point calibration | Verify low/mid/high points per range | Meets accuracy across points; no gross non-linearity | cal_id, points, error, timestamp |
| Temperature drift check | Repeat key points under temperature variation | Error remains within spec across conditions | temp_tag, error vs temp, threshold_version |
| Range boundary behavior | Test near low-end and near full-scale | No saturation surprises; flags behave as expected | range_flag, peak, validity flags |
| Burst response capture | Trigger short-window on load steps/inrush | Pre/post slices present; peak marker captured | short-window slice, peak marker, correlation_id |
| Fault / scenario | Expected response | Pass criteria | Evidence to store |
|---|---|---|---|
| Short / hard overcurrent | Fast trip and safe isolation | Trip occurs within expected window; branch enters TRIPPED | trip_type, trip_time tag, pre/post snapshot |
| Overload | Alarm first (if designed), then trip if sustained | Alarm thresholds correct; no false flapping | severity, time-over-limit, threshold_version |
| Overtemperature | Derate/shed path before hard shutdown (if applicable) | Actions follow policy without oscillation | action_id chain, temp trend slice, lockout_reason |
| Inrush vs trip discrimination | Avoid tripping on expected startup inrush | Short-window shows peak; trip only when abnormal | peak marker, window settings tag, correlation_id |
| Scenario | Steps | Pass criteria | Evidence to store |
|---|---|---|---|
| In-band network down | Verify read-only access + incident export via OOB | Telemetry and logs remain accessible | export_id, timestamps, correlation_id |
| Host unresponsive | Attempt allowed L1/L2 actions under interlocks | Actions logged; outlet state transitions correct | action_id, result/reason_code, pre/post snapshot |
| Mgmt port switching | Validate connectivity under port change events | No silent loss of audit trail or control | link events, access logs, actor/role |
| Permission audit | Try actions by role (read vs control vs admin) | Denials are explicit and logged | actor, role, DENIED reason, correlation_id |
| Check | Method | Pass criteria | Evidence to store |
|---|---|---|---|
| Sensor disconnect detection | Unplug sensor; confirm explicit alarm and validity flag | No silent “good” state; alarms are traceable | sensor_id, validity flags, timestamps |
| Harness mis-plug prevention | Verify keyed connectors/labels; inspect strain relief | No ambiguous connector paths during service | service checklist ID + photo reference tag |
| FRU replace + auto-identify | Replace FRU; confirm asset/fru identity and policy association | Correct IDs, threshold version, and calibration state visible | asset_id, fru_id, THR ver, cal_id, audit log |
H2-11 · BOM / IC selection criteria (with concrete part numbers)
A Micro Edge Datacenter Rack succeeds or fails on measurable trust: metering that remains accurate across drift and transients, protection that trips predictably without nuisance events, and OOB control that stays alive when the site is degraded. The approach below is criteria-first, with concrete IC examples as shortlisting starting points.
1) Monitoring AFE / ADC: select for “truth under transients”
Rack metering must stay meaningful under bursty loads (PSU inrush, fan spin-up, step-loads). Selection starts from what must be captured (energy/trends vs short-window peaks) and what dominates error (drift vs aliasing vs layout sensitivity).
| Criterion | What to specify | How to verify (rack-level) |
|---|---|---|
| Measurement set | V/I/P/E, per-branch profile, peak/inrush marker, timestamps | Require evidence fields: pre/post snapshot + peak window + accumulated energy |
| Front-end type | Shunt+amp+ADC vs digital power monitor; Hall only when isolation/low loss dominates | Compare drift at temp corners and noise at low current; observe saturation behavior |
| Dynamic capture | Simultaneous sampling (correlation), programmable averaging, alert/threshold engine | Replay step-load/inrush; confirm peaks are not hidden by averaging |
| Accuracy & drift | Gain/offset, tempco, long-term drift; calibration support (factory + in-field) | 2–3 point calibration; validate after FRU swap; log calibration state |
| Common-mode & range | Max bus/common-mode voltage, shunt full-scale, fault overvoltage tolerance | Bus excursions (within spec); confirm no latch-up and telemetry remains valid |
| EMI / layout sensitivity | Kelvin routing needs, input filtering, anti-alias strategy, ground/return constraints | Stability under switching noise; compare against a reference meter |
| Digital interface | I²C/SMBus or SPI; addressability; CRC/PEC; interrupt options | Bus fault injection (brownout/stall); confirm graceful recovery and error counters |
- TI ADS131M04 — 4-ch, 24-bit simultaneous-sampling ΔΣ ADC (useful for time-correlated multi-rail capture).
- Analog Devices ADE9000 — energy metering / power-quality monitoring IC (useful for “power-quality aware” evidence).
- TI INA228 — precision digital power/energy monitor (strong for shunt telemetry on higher bus voltages).
- TI INA4230 / INA4235 — quad-channel current/voltage/power/energy monitors (dense rail monitoring via SMBus/I²C).
- Microchip PAC1934 — 4-channel power monitor (good for multi-rail rail-health + energy dashboards).
2) eFuse / High-side / Hot-swap: treat each outlet as a governed domain
Branch protection is also an operable state machine: controlled turn-on, transient tolerance, deterministic fault response, and readable fault telemetry that feeds the rack evidence chain.
| Criterion | What to specify | How to verify (rack-level) |
|---|---|---|
| Protection model | Current limit vs breaker; latch-off vs auto-retry; inrush blanking timers | Inrush + overload tests; confirm survives inrush and trips on true faults |
| SOA & thermal | RDS(on), package θ, board copper; dissipation under worst load | Thermal soak at sustained load; verify no hidden foldback surprises |
| Short-circuit behavior | Fast trip response, peak current, fault energy, foldback strategy | Hard-short tests with controlled wiring; check trip-time repeatability |
| Reverse blocking | Reverse current blocking when back-fed loads exist | Backfeed scenario test; ensure no phantom power paths |
| Telemetry hooks | IMON/diagnostic, PG/FLT, fault codes; “why it tripped” visibility | Map fault codes to logs; capture pre/post snapshots automatically |
| Control policy | Remote enable/disable, sequencing constraints, minimum off-time | Remote cycle loops; verify lockouts and rate limits prevent oscillation |
- TI TPS25982 — 2.7–24V smart eFuse (adjustable fault management + current monitoring).
- TI TPS2660 — higher-voltage eFuse class option (useful when branch bus voltage is higher).
- TI TPS25947 — mid-voltage eFuse / protection switch option.
- Analog Devices LTC4215-1 — hot-swap controller (external MOSFET + configurable behavior).
- Analog Devices LTC4368 — surge-stopper / protection controller (useful when bus events dominate).
- onsemi NCP45520 — protected load switch / power switch option for protected distribution.
- Infineon BTS7002-1EPP — smart high-side switch (integrated diagnostics + protection).
- ST VNQ7050AJ — multi-channel high-side driver with diagnostics (verify rail voltage/current fit).
3) Environmental sensing & fan telemetry: make alarms actionable
Environmental telemetry is useful only if alarms represent real risk. Selection should be driven by cable-length tolerance, EMC resilience, drift, and how each sensor maps to a rack-level action (derate, lockout, isolate).
| Sensor / function | Selection criteria | Verification & evidence |
|---|---|---|
| Temperature | Accuracy + drift, conversion time, interface robustness, fault detect (open/short) | Inlet/exhaust correlation; drift check at enclosure temperature corners |
| Humidity | Long-term stability, hysteresis behavior, packaging for condensation risk | Step humidity; confirm alarm does not chatter around thresholds |
| Door / tamper | Debounce, ESD robustness, event timestamping, wiring practice | Open/close burst; verify audit log correlation to access windows |
| Fan tach / PWM | Fan count, tach inputs, PWM outputs, stall detect, rate-of-change control | Stall injection; verify deterministic alarm and optional derate action |
- TI TMP117 — high-accuracy digital temperature sensor (trustworthy absolute temperature).
- Sensirion SHT35 — temperature/humidity sensor (stable behavior for monitoring).
- Microchip EMC2305 — SMBus fan controller (up to five PWM fans; dense fan telemetry).
- Analog Devices MAX31790 — 6-channel PWM fan controller with RPM/tach monitoring.
4) OOB BMC / service MCU + trust anchors: select for survivability & auditability
The rack OOB controller must remain operable during in-band failures. Selection should prioritize always-on behavior, secure/verified boot, interface density, and log integrity under brownouts and resets.
| Criterion | What to specify | How to verify (rack-level) |
|---|---|---|
| Always-on domain | Boot time, brownout behavior, watchdog policy, low-power constraints | Brownout + power-cycle loops; confirm deterministic recovery and no log loss |
| Interfaces | SMBus/I²C fanout, SPI/QSPI flash, UART/console, GPIO for PG/FLT, Ethernet mode | Bus fault injection; confirm retry/backoff and counters recorded in logs |
| Secure boot chain | Signed images, anti-rollback, measured boot hooks (TPM/SE as needed) | Rollback attempt; verify rejection + audit record |
| Remote manageability | IPMI/Redfish feasibility, firmware update resilience | Interrupted update; confirm A/B or recovery path with config preserved |
| Log integrity | Append-only model, monotonic sequence, event IDs, retention on resets | Forced reset; confirm audit trail continuity and no event reordering |
- ASPEED AST2600 — BMC SoC (common OpenBMC target).
- ASPEED AST2500 — earlier-generation BMC option (legacy/cost-driven designs).
- NXP i.MX RT1170 — high-performance service MCU option for deterministic control loops.
- ST STM32H743 — service/control MCU option with strong interface set.
- Microchip ATSAME70J19 — Cortex-M7 service MCU option for telemetry/control.
- Infineon OPTIGA TPM SLB9670 — TPM for measured boot / attestation designs.
- Microchip ATECC608B — secure element for device identity / key storage.
5) BOM shortlisting worksheet (mechanical gate)
Before a part is “approved,” require matching evidence and repeatability tests aligned to rack operations.
| Block | Must-have specs | Evidence to collect | Example parts (starting points) |
|---|---|---|---|
| Metering | Drift, capture window, common-mode, calibration workflow | Step-load + inrush peak capture + drift report + calibration-state log | ADS131M04, INA228, INA4230/INA4235, PAC1934, ADE9000 |
| Branch governor | Trip timing, blanking, SOA, reverse blocking, fault telemetry | Short/inrush/overtemp injection + trip distribution + fault-code mapping | TPS25982, TPS2660, TPS25947, LTC4215-1, LTC4368, NCP45520 |
| Env & fan | Placement tolerance, debounce, stall detect, EMC tolerance | Alarm-chatter test + fan-stall test + door event correlation | TMP117, SHT35, EMC2305, MAX31790 |
| OOB control | Always-on behavior, secure boot, update resilience, log integrity | Brownout + interrupted update + audit trail continuity test | AST2600 (or split-control: RT1170/STM32H743 + AST-class BMC) |
Notes: “Example parts” are not endorsements; validate electrical limits, package thermal behavior, firmware ecosystem, and supply-chain constraints for the intended rack design.
H2-12 · FAQs (Micro Edge Datacenter Rack)
These FAQs focus on rack-level distribution, monitoring, OOB control, and evidence logging—without crossing into UPS/backup sizing or network dataplane topics.
FAQ 01How to draw the boundary between a Micro Edge Rack and “Edge Site Power & Backup”?
A micro edge rack is responsible for rack-level distribution and governable outlets: branch protection/control, trustworthy metering, environmental telemetry, OOB reachability, and an audit trail for remote actions. “Edge Site Power & Backup” covers energy sources and ride-through (48V front-end, UPS/battery/supercap capacity, charge/discharge strategy). If the question is about “how long it stays up,” it belongs to the backup/power page; if it’s about “which branch was cut and why,” it belongs here.
Related: H2-1
FAQ 02Why can current readings look normal, yet overcurrent trips happen frequently?
“Normal” readings often reflect averaged or low-rate telemetry, while protection reacts to fast peaks. Inrush and burst loads may be hidden by sampling windows, smoothing, or aliasing. Another pattern is sensor saturation or grounding/return noise that makes the monitor look stable while the eFuse/high-side comparator sees a real overcurrent. Fixes are usually: add short-window peak capture, validate sensor validity flags, and correlate trip timestamps with pre/post electrical snapshots and fault codes.
FAQ 03Shunt resistor vs Hall sensor: how to choose for rack branch monitoring?
Shunt sensing is usually preferred for accuracy, linearity, and repeatable calibration—ideal for per-branch energy and trend evidence, but it adds insertion loss and needs good Kelvin routing. Hall sensing reduces loss and provides galvanic isolation, which helps when common-mode or safety isolation dominates, but it can suffer from offset drift, external field sensitivity, and bandwidth/saturation constraints. Choose by what must be trusted most: billing-grade energy/trends (shunt) or isolation/low loss and wide common-mode tolerance (Hall).
Related: H2-3
FAQ 04How to sample burst load / inrush without missing peaks?
Use a two-layer strategy: low-rate long-window sampling for energy/trends plus a high-rate short-window capture for bursts and inrush. Trigger short windows using dI/dt, power step thresholds, “pre-fault” flags, or outlet enable events. Keep a small ring buffer so each event records pre/post samples around the trigger, with aligned timestamps. In the evidence chain, peaks must be stored as explicit markers (window_id/peak_marker), not inferred from averaged telemetry.
FAQ 05How do constant-current vs foldback current limit affect remote power-up success?
Constant-current limiting tends to “push through” capacitive loads by charging them steadily, improving power-up success—but it can increase device heating and stress if the ramp lasts long. Foldback reduces current as voltage collapses, protecting silicon and wiring, but it can prevent a large load capacitance from ever reaching a valid voltage, causing repeated retries or latch-off. Remote success depends on the combined policy: current-limit curve, blanking time, soft-start ramp, retry budget, and cooldown lockouts.
FAQ 06Why can remote power-cycle fail to recover a system and sometimes make it worse?
Common causes are operational, not mystical: the off-time may be too short for downstream rails to discharge, so the target never truly resets. Repeated cycling can accumulate thermal stress, trigger protection lockouts, or destabilize the rack’s own outlet state machine. Some sites also lack interlocks (temperature/door/alarm gating), allowing cycling during unsafe conditions. A safe playbook enforces minimum off-time, minimum interval between attempts, a retry budget with cooldown, and a clear rollback/escalation path when actions do not improve evidence.
FAQ 07What are the most common sources of false environmental alarms, and how to debounce?
False alarms usually come from poor sensor placement (too close to a transient hotspot or directly in turbulent airflow), noisy wiring/contacts, or thresholds without hysteresis and time windows. Debouncing needs three pieces: (1) time-window averaging or persistence timers, (2) hysteresis so a reading must “return far enough” before clearing, and (3) a simple state machine (OK/Warn/Critical) to avoid chatter. Every alarm should carry context: recent samples, sensor validity, and the threshold version used.
Related: H2-5
FAQ 08Must OOB use a dedicated management port? When is NCSI sharing better?
A dedicated OOB port minimizes shared failure domains and simplifies reachability during site incidents. NCSI sharing can reduce ports and cabling cost and may be acceptable when the shared path is highly reliable and has independent power/boot behavior. The deciding factor is failure-domain tolerance: if the shared in-band path failing would remove OOB access at the exact time it is needed, a dedicated port (or a well-defined redundant alternative) is the safer choice. Document the chosen mode and log failover events as first-class evidence.
Related: H2-6
FAQ 09During network outage or host crash, what “keep-alive domains” are required for OOB control?
OOB control requires a minimal always-on domain: an always-on power rail, the sensor/control buses (I²C/SMBus/GPIO) for outlet governors and alarms, a management link path (dedicated or controlled shared), local non-volatile storage for event logs, and a stable time base for ordered evidence. If any one is missing, symptoms appear quickly—unreachable management, missing fault context, or reordered/duplicated events after brownouts. Design the keep-alive domain as a separate “survivability contract,” then validate it with outage drills.
Related: H2-6
FAQ 10In the telemetry data model, which fields are mandatory for accountability?
Accountability requires “who, what, when, and with what evidence.” Minimum fields typically include: actor and role (who), action_id and branch_id (what), timestamp plus monotonic sequence/correlation_id (when/ordering), action_result and reason_code (outcome), and pre/post electrical snapshots with trip flags (evidence). Versioning is also mandatory: firmware_version, threshold_version, and cal_id so later audits can reproduce decisions. Without these, logs become “numbers without responsibility” and cannot support root-cause or compliance needs.
Related: H2-7
FAQ 11How to prove a trip is a load issue, not measurement-chain or threshold-policy errors?
Use correlation and controlled re-test. First, align the trip timestamp with high-rate short-window captures to see whether a peak/inrush occurred. Second, check sensor validity, saturation flags, calibration state (cal_id), and the active threshold version at the time of the event. Third, run a repeatable injection test (step load or controlled inrush) and confirm the same trip type/time distribution appears. If the evidence differs, the issue is likely policy (thresholds/windows) or measurement integrity, not the load itself.
FAQ 12Which production/field tests quickly catch “potentially unreliable racks”?
Fast screening should target repeatability and survivability: (1) multi-point metering sanity plus a quick drift check at two temperatures, (2) step-load/inrush response with peak capture verified, (3) protection injections for overload/short/overtemp and trip repeatability, (4) OOB outage drills (network loss/host unresponsive) with audit trail continuity, and (5) sensor/harness fault checks (open/short detection). These tests catch the most common latent issues before a site visit becomes the debugging tool.
Related: H2-10