Lightweight MEC Storage Node (U.2/U.3 NVMe Backplane)
← Back to: 5G Edge Telecom Infrastructure
A Lightweight MEC Storage Node is a compact, low-power edge storage building block that focuses on three things: stable U.2/U.3 backplane operation, reliable PCIe links (often requiring re-timing), and provable power-loss consistency through PLP/hold-up evidence.
In practice, “ready” means drives never vanish under warm reboot or hot-plug, per-slot power events cannot reset the whole node, and every outage has a measurable hold-up window and a recorded flush outcome for remote root-cause analysis.
H2-1 What a “Lightweight MEC Storage Node” is — and what it is not Boundary + acceptance criteria
This chapter pins down the engineering boundary of a lightweight edge storage node and defines “done” with measurable KPIs—so later sections stay vertical (backplane, PCIe stability, PLP) and do not drift into platform, networking, or timing topics.
Definition (task + constraints): A lightweight MEC storage node is a compact, low-power NVMe bay system designed for edge sites with weak airflow and imperfect power quality. Its job is to keep drives consistently enumerated, keep the PCIe path stable across temperature and servicing, and provide provable data consistency during unexpected power loss (a controlled flush window rather than “hope and pray”).
What it is not (scope boundary): It is not a full edge cloud platform or virtualization stack, and it is not a long-form network storage guide. If a host link exists (e.g., upstream PCIe to a compute node), it is treated only as a boundary interface—protocol stacks and service orchestration are out of scope for this page.
Three KPIs that actually define success (each KPI must have a measurable acceptance test):
| KPI | Engineering meaning | How to measure (acceptance) |
|---|---|---|
| Drive density | Not just “number of bays,” but serviceable bays with predictable hot-plug behavior and management sidebands (identify, locate, fault-isolate per slot). | Enumerate all bays across cold boot/warm reboot/hot-plug cycles; verify per-slot present/reset behavior and management visibility (slot ID, fault LED control, temperature readout where available). |
| Power ↔ thermal | Edge nodes throttle long before they fail: the target is thermal stability under weak airflow and dust-prone environments, including retimer area temperature. | Thermal soak at elevated ambient; record SSD controller temperature, retimer vicinity temperature, and throttling events; verify that performance degradation is bounded and does not trigger link retrains or drive dropouts. |
| Power-loss consistency | The requirement is a provable flush window that protects critical metadata (mapping/journals) so acknowledged writes are not lost across sudden outages. | Power-cut injection under defined write workloads; verify recovery consistency and collect evidence (flush completion flags, event logs, or controlled write barrier results). Repeat across temperature and component aging margin. |
Design responsibilities (what this page owns): U.2/U.3 backplane management, hot-plug/reset sequencing, PCIe retiming strategy and validation, per-slot power protection and inrush control, PLP hold-up design for a flush window, and telemetry that proves root cause remotely.
H2-2 Workload & failure budget: what breaks edge storage first I/O shape → risk chain → measurable budget
The goal is to translate edge I/O shapes into a failure budget that can be monitored and validated. Most “random drive issues” at the edge trace back to link margin, thermal behavior, power transients, or an insufficient power-loss flush window—not NAND wear.
Edge I/O patterns (described only by I/O shape, not by application stack):
- Small-write heavy (journaling / metadata): 4K random writes, bursty queue depth, sensitive to write barriers and flush timing.
- Read-dominant with write bursts (cache fill/evict): sustained reads, intermittent bursts that stress inrush and thermal headroom.
- Mixed steady-state: sustained heat generation, tail latency sensitivity, and temperature-dependent link stability.
Key engineering insight: Failures rarely start at “SSD wear.” They start at the edges of the system—the PCIe channel margin (retimers/connectors), the thermal envelope (weak airflow), and the power event profile (transients + outages). Designing the node means budgeting these edges and proving them with evidence.
Failure budget table (use as a design target + acceptance checklist):
| Budget axis | What to budget (measurable) | How to collect evidence | Typical corrective action |
|---|---|---|---|
| Link | Drive enumeration failure rate; PCIe correctable error rate; link retrain count; temperature correlation of errors. | PCIe AER counters; link state transitions; retrain logs per slot; boot/enumeration timing traces. | Retimer placement/setting; tighten reset sequencing; enforce lane speed policy; isolate marginal slots. |
| Thermal | SSD controller temperature; throttling events/hour; retimer vicinity temperature; performance drop bound after soak. | NVMe SMART temperature + throttle flags; board sensors near retimers; periodic summaries stored in logs. | Improve heat path; set proactive throttling; re-balance slot power; alarm before link becomes unstable. |
| Power events | 12V droop minimum during inrush; per-slot overcurrent trip rate; reset/PG jitter; brownout occurrence. | Per-slot current/voltage telemetry; PG/RESET state logs; event timestamping for “drive missing” incidents. | Inrush shaping; per-slot power gating; retry policy; tighten sequencing and debounce thresholds. |
| Power-loss consistency | Guaranteed flush window duration (ms) under worst-case load; recovery consistency pass rate. | Power-cut injection tests; flush completion evidence; post-recovery integrity checks logged with timestamps. | Increase hold-up energy; reduce critical-rail load; enforce write barriers; improve evidence logging. |
How to read the budget: each axis should have a target envelope and an alarm envelope. If the alarm envelope is crossed, the node should fail “gracefully” (slot isolation, reduced speed, controlled throttling) and leave a clear evidence trail—rather than exhibiting intermittent, untraceable behavior.
H2-3 NVMe backplane architecture: signal, power, and sideband in one picture Three layers, one serviceable system
The backplane must be treated as three parallel layers—high-speed lanes, per-slot power, and management sidebands. Mixing them causes “mystery failures” that cannot be debugged remotely.
Core idea: A U.2/U.3 backplane is not only a connector board. It is the physical place where link stability, slot power behavior, and field serviceability either become predictable—or become guesswork.
The three “parallel worlds” inside a U.2/U.3 backplane:
- High-speed differential (PCIe lanes): determines whether lanes train quickly, hold margin over temperature, and avoid retrains.
- Power distribution (12V / 3.3Vaux / segmented switches): determines inrush, droop, per-slot fault isolation, and reset stability.
- Management sidebands (SMBus/I2C, SGPIO/SES, PRSNT#/PERST#): determines whether slots are serviceable, diagnosable, and remotely controllable.
Backplane layered checklist (signals that must be measurable or at least loggable):
| Layer | Must-have | Why it matters | Typical evidence / test point |
|---|---|---|---|
| Sideband |
PRSNT# PERST# SMBus/I2C |
Slot state must be unambiguous; reset must be controlled; management must be able to read key status (temperature/IDs where available). | Presence transitions; reset release timing; SMBus polling success rate; slot-identification and fault indication behavior. |
| High-speed | REFCLK distribution | Reference clock quality and topology directly affect training robustness; “works once” is not a stability guarantee. | Enumeration time distribution; link speed consistency; temperature correlation of error counters (detail in H2-4). |
| Power |
12V rail integrity 3.3Vaux semantics |
Power droop and inrush shape frequently trigger reset jitter, partial enumerations, and intermittent missing drives. | Minimum rail voltage during insertion; per-slot power switch event logs; PG/RESET stability checks. |
| Sideband (recommended) |
CLKREQ# / WAKE# LED (Locate/Fault) SGPIO/SES |
Improves low-power behavior and service workflows; enables fast slot localization and consistent enclosure-style management. | Controlled wake behavior; field replacement time improvement; consistent fault isolation per slot. |
H2-4 PCIe re-timing on a backplane: when you need it and how you validate it Decision rules + evidence
Retiming is not a “nice-to-have.” In multi-bay, connector-rich, temperature-stressed edge enclosures, PCIe stability must be proven with measurable evidence (enumeration tail, AER counters, retrain frequency, and temperature correlation).
Retimer vs. redriver — the practical boundary:
| Item | Redriver | Retimer |
|---|---|---|
| What it does | Boosts/equalizes the signal but does not recover timing. | Performs CDR recovery and re-times the link for stronger margin across a long channel. |
| Where it wins | Shorter channels, fewer connectors, lower temperature swing, lower generation speed pressure. | Long channels, multiple connectors, higher density, higher temperature, Gen4/Gen5 sensitivity. |
| Typical symptom when insufficient | “Works in lab, fails in the field” with temperature-dependent errors and sporadic retrains. | When designed correctly, error counters and retrain tails collapse and become predictable. |
| Design coupling | Lower coupling, simpler configuration. | Higher coupling to REFCLK, reset sequencing, and power integrity; must be validated as a system. |
When retimers become “required” (trigger conditions):
- Many bays / high density: parallel high-speed routes and tighter spacing increase crosstalk risk and margin variability.
- Connector-rich channel: each connector adds loss/reflective discontinuities; link training tails widen.
- Long routing / complex backplane: more vias and longer traces increase insertion loss and ISI.
- High temperature / weak airflow: drift pushes marginal links over the edge; errors track temperature.
- Generation speed pressure (Gen4/Gen5): a “barely passes” design at room temperature becomes unstable at field conditions.
Decision rule: The case for retimers is strongest when evidence shows a long-tail problem: sporadic training failures, large variance in enumeration time, rising AER counters under temperature, or recurring retrains under sustained load.
“Retimer needed” measurable checklist (expressed only as evidence, not opinions):
| Metric | What it indicates | How to collect |
|---|---|---|
| Link training failure rate | Channel margin is insufficient or unstable across boots / servicing events. | Repeat cold boot, warm reboot, and hot-plug cycles; record success rate per slot. |
| Enumeration time tail | Long-tail training suggests weak margin even if average looks acceptable. | Log enumeration time distribution (mean + p95/p99) across slots and temperatures. |
| AER correctable error rate | Correctable errors are early warnings; rising rate predicts retrains or dropouts. | Collect PCIe AER counters per slot, trend over time and load windows. |
| Retrain frequency | Retrains are self-healing attempts; frequent retrains are field instability. | Count link state transitions / retrains per day; correlate with temperature and load. |
| Temperature correlation | Strong correlation signals a marginal channel that drifts under thermal stress. | Plot error counters vs sensor readings near retimers and drive areas (trend evidence). |
H2-5 Reset/Hot-plug sequencing: the hidden cause of “random missing drives” State machine + scope-proof checklist
Many “random missing drive” incidents are sequencing failures: presence and power events do not settle, PERST# releases too early, or droop-induced reset jitter breaks link training. The fix is a controlled, evidence-based sequence.
Minimum hot-plug / power-on sequence: Slot power stable → PRSNT# debounced → REFCLK stable → PERST# held low → PERST# release → Train → Enumerate. Each arrow must have a measurable condition, not an assumption.
Common field symptoms and what they usually mean:
| Symptom | Likely break point | Evidence to check |
|---|---|---|
| Drive intermittently missing | PRSNT# contact bounce; PERST# released before power/clock settles; droop forces partial reset. | PRSNT# transitions vs time; PERST# hold/release alignment to slot 12V; PG/RESET jitter events. |
| Enumeration sometimes very slow | Training retries caused by weak margin or unstable prerequisites (clock/reset readiness not consistent). | Enumeration time distribution (tail); retrain count; temperature correlation of error counters. |
| After a dropout, re-attach is unstable | Recovery path is incomplete: power retry happens without a full reset window; sequence lacks a clean “return-to-known-state”. | Power retry timeline; PERST# low duration on retry; repeated toggles of presence and reset. |
Sequencing verification checklist (scope points + expected relationships):
| Probe point | What it proves | Expected relationship (relative) |
|---|---|---|
| P1 Slot 12V | Slot power reaches and holds a stable level (no droop that can re-trigger resets). | Must settle before PERST# release; must not dip during inrush enough to disturb PG/RESET behavior. |
| P2 Slot current | Inrush and step loads are bounded; confirms soft-start and limit behavior. | Peak and shape should align with configured soft-start/limit; repeated spikes imply retries or contact bounce. |
| P3 PERST# (connector side) | Endpoint reset is controlled and free of jitter induced by droop or noisy logic. | Held low long enough after prerequisites; released only after power + REFCLK are stable; re-asserted cleanly on retry. |
| P4 PRSNT# (connector side) | Presence is debounced; avoids multiple false insert/remove triggers. | Should transition once per event and then remain stable; bounce must be filtered before sequencing continues. |
| P5 REFCLK valid | Clock is present and stable before link training begins. | REFCLK must be stable before PERST# release; unstable clock leads to training retries and long tails. |
| P6 PG/RESET (if present) | Global resets are not being re-triggered by a single slot event. | Slot insertion must not create PG chatter; if it does, inrush shaping or isolation is insufficient (see H2-6). |
Practical acceptance: Run repeated cycles (cold boot, warm reboot, hot-plug) and verify that enumeration time tails shrink and retrains do not grow with temperature. Stability is defined by repeatability, not a single “passes once” run.
H2-6 Power tree for U.2/U.3 bays: current steps, inrush, and per-slot protection Per-slot policy + droop control
Slot power design must be built around current steps and insertion inrush, not average watts. A single slot event should never pull the shared rail down enough to create reset jitter or enumeration failures.
Why “plug-in resets the whole box” happens: Insertion inrush draws a short current spike to charge effective capacitance. If the shared 12V distribution cannot hold, the droop can trigger PG/RESET chatter or disturb PERST# sequencing, producing “random missing drives”.
Current steps that matter at the bay level:
- Insertion inrush: dominant spike; shapes whether 12V droops and whether sequencing remains stable.
- Training/bring-up step: controller transitions to active; can expose marginal rail impedance.
- Runtime burst step: sudden write bursts can create repeated droop events if regulation is weak.
Estimation box: inrush and effective capacitance
How to estimate Cload (effective): measure insertion current waveform and integrate charge: C ≈ ∫I(t)dt / ΔV. Effective capacitance includes SSD input caps plus local decoupling on the backplane.
Design action: control dV/dt with per-slot soft-start, and limit peak current with an eFuse/high-side switch policy.
Per-slot power policy (define behavior as configuration, not as hope):
| Policy field | Purpose | Operational outcome |
|---|---|---|
| Current limit | Caps inrush peak and prevents a single slot from collapsing the shared 12V rail. | Fewer PG/RESET disturbances; less PERST# jitter during insertion. |
| Soft-start slope | Controls dV/dt so the rail charges predictably without triggering undervoltage events. | Insertion becomes repeatable across SSD variations and temperatures. |
| Startup window | Defines how long a slot is allowed to ramp and settle before being flagged as abnormal. | Avoids false faults while still catching shorts or stuck loads. |
| Fault response | Specifies immediate off vs. limiting vs. retry for OC/UV/OT conditions. | Prevents “fault storms” and limits impact radius to one slot. |
| Retry + backoff | Controls how many times and how often a slot will retry after a fault. | Stops oscillations that repeatedly droop the system rail and disturb other slots. |
| Latched-off rule | Locks a bad slot off until a clear, logged service action occurs. | Improves stability and makes incidents diagnosable. |
| Telemetry + event logs | Records per-slot voltage/current/fault counters and reports them to the management MCU. | Turns “random resets” into evidence with timestamps and slot IDs. |
H2-7 Power-loss protection (PLP): turning “unexpected outage” into “provable consistency” Energy budget + flush window
PLP is not about “never losing power.” It is about guaranteeing a minimum time window for the SSD to complete critical persistence (mapping table / journal / metadata flush) and producing evidence that the window existed.
Correct definition: PLP provides a measurable flush window after power-fail detection so that the drive can transition from a volatile state to a recoverable state. The goal is provable consistency, not “no outage.”
Two implementation paths (platform-level view):
| Path | What it provides | What must still be proven |
|---|---|---|
| Drive-integrated PLP | On-drive capacitors enable the SSD to finish internal commits under power loss. | That the platform detects power-fail cleanly, and that the SSD actually completes the critical commit path. |
| System hold-up | Backplane/node hold-up energy maintains critical rails long enough for controlled flush behavior. | Energy path priority, minimum rail voltage, and consistent window duration across temperature and aging. |
Usable engineering formulas (minimum set):
-
Energy budget:
E_hold ≥ P_load × t_hold(use critical-rail power, not whole-node watts) -
Hold time from stored energy:
t_hold = (E_cap × η) / P_loadwhereηincludes DC/DC efficiency and usable capacitor voltage range.
Practical target: design for a flush window on the order of 10–50 ms, then validate the worst-case window under temperature and load. The final target is defined by the slowest critical flush path plus margin.
What to log as proof (the “provable” part):
- Power-fail detect timestamp (PF event)
- Critical rail minimum voltage during the window
- Flush completion marker (or error marker) tied to the same event
H2-8 Write path policy: cache, flush, FUA, and what you can actually guarantee Misconceptions + guarantee levels
“Consistency” is not a slogan. It depends on where data is still volatile (caches and queues) and on whether a flush window exists to force the critical commit path to complete under power loss.
Three common misconceptions (and the correction):
| Misconception | Why it fails | What is actually required |
|---|---|---|
| “UPS removes the need for PLP.” | UPS may not cover short droops, switch-over gaps, contact events, or rail-level collapses inside the node. | A measurable flush window at the critical rail and a defined power-fail detect trigger. |
| “Disabling write cache makes it safe.” | Volatility can still exist in multiple layers; safety depends on ordering and forced persistence. | Flush / FUA / barrier policies that actually bind to the critical commit path under PLP. |
| “RAID alone guarantees consistency.” | Redundancy does not automatically guarantee the last-moment metadata ordering or commit completeness. | End-to-end policy alignment and proof under power-fail events (Level 2). |
Guarantee levels (what can be claimed and proven):
| Level | What can be guaranteed | Key prerequisites | Proof artifacts |
|---|---|---|---|
| Level 0 Best-effort |
No provable flush window. Behavior near outage is not bounded; recovery may require repair or rollback. | None. | Limited: post-mortem only. |
| Level 1 Device PLP |
A defined flush window exists at the device/power layer; critical commits can be completed within a bounded time. | PLP window exists and is stable; PF detect is reliable. | PF timestamp + rail min voltage + flush marker. |
| Level 2 End-to-end |
Ordering and persistence are stronger because policy forces the right flush semantics at the right layer. | Flush/FUA/barrier policy alignment + PLP covers worst-case commit latency. | Configuration proof + PF event logs + consistency checks. |
H2-9 Thermal & fanless constraints: why storage nodes throttle long before they fail Heat path + throttle loop
Edge enclosures are often fanless or airflow-limited. The first visible symptom is usually throttling and link instability (not immediate hardware failure), because controller temperature and channel margin degrade together.
Edge reality: sustained high average temperature, weak airflow, and dust accumulation push SSD controllers into thermal limits. As throughput drops, queues build, power density rises, and the node can heat itself into a feedback loop.
What thermal stress looks like in the field:
- Throughput drop (throttle): long before an outright failure, the SSD reduces performance to protect itself.
- Long-tail latency: IO completes, but worst-case latency grows and spikes more often.
- More retrains at high temperature: channel margin can shrink as components drift with temperature.
Sensor placement: minimum set and recommended set
| Sensor | Physical location | What it explains | Why it matters |
|---|---|---|---|
| T1 SSD minimum |
Near SSD controller region (not only the chassis). | Throttle events and long-tail latency. | Direct correlation to performance collapse. |
| T2 Retimer area minimum |
Adjacent to retimer or hottest channel component. | Training retries and margin sensitivity. | Separates “channel drift” from “drive-only” issues. |
| T3 Chassis path minimum |
Heat spreader / chassis hotspot near SSD mounting. | Conduction effectiveness. | Proves whether the heat path is working. |
| T4 Ambient / inlet recommended |
Air inlet or enclosure ambient point. | External heat boundary. | Distinguishes “environment hot” vs “node self-heating.” |
| T5 Backplane hotspot recommended |
Near slot power switch/current sense area. | Power→heat coupling. | Catches localized hotspots affecting stability. |
Operational practice: log temperature trends and correlate with throughput, queue depth, and retrain/AER. Thermal issues are best detected as a rising trend plus feedback-loop signatures, not as a single threshold crossing.
H2-10 Telemetry & field evidence: proving root cause remotely Evidence schema + triage order
Remote diagnosis needs evidence that explains missing drives, retrains, power-loss consistency, and throttling. Keep telemetry focused: power → link → thermal → drive. Avoid protocol deep-dives and log what proves causality.
Design goal: turn “random” failures into a timeline with slot IDs and counters. A good log answers: what happened, where (slot), when, and what changed first.
Minimum evidence set (node-level, not a BMC spec):
| Category | What to record | What it proves |
|---|---|---|
| Link | PCIe AER counts, retrain/recovery count, enumeration time tail. | Channel instability and when it starts. |
| Thermal | SSD temp (controller), retimer temp, chassis/ambient trend. | Throttle trigger and temperature-correlated errors. |
| Slot power | Per-slot current peaks, OC/limit events, UV/PG-like events (if available). | Inrush/rail stress and whether a slot event destabilizes the node. |
| PLP proof | Power-fail timestamp, critical rail min voltage, flush marker/error marker. | Whether consistency window existed and whether the critical commit completed. |
Event log schema (recommended fields):
- timestamp (single time base), slot_id, event_type (power/link/thermal/plp), severity
- value + unit (temp_C, current_A, count), plus context (boot_id, fw_version, temperature bin)
Power anomalies often create link symptoms; thermal trends often amplify margin issues; “drive swap” should be the last step, not the first.
H2-11 · Validation & production checklist: what proves the node is ready
This section converts “it boots and shows drives” into a measurable, repeatable, and remotely provable readiness definition. The focus stays on storage-node reality: PCIe link stability, hot-plug/reset sequencing, per-slot power integrity, and power-fail consistency evidence (PLP/hold-up window + flush markers).
Lab → Factory → Field Every item has: metric + method + pass + evidence Includes example part numbers (BOM-ready)1) Three-layer validation flow (same risks, different tools)
The same failure modes must be proven at three levels. Lab establishes “true margins”, factory turns them into fast screening proxies, and field confirms root cause with evidence trails.
Layer A — Lab validation (prove the physics)
- Worst-channel + worst-temperature link behavior (AER / retrain / tail latency).
- Power-fail injection across workloads + temperature + capacitor-aging equivalents.
- Hot-plug / reset sequencing with scope checkpoints (12V stable → PERST# release → train → enumerate).
Layer B — Production screening (fast, correlated proxies)
- Enumeration-time distribution (watch the “tail”, not the average).
- Short-window retrain/AER burst check under controlled stress.
- Sensor plausibility + quick thermal slope sanity checks.
Layer C — Field self-test & evidence (remote RCA)
- Event snapshots that support an RCA order: power → link → thermal → drive.
- Power-fail timestamps + “flush completed / not completed” markers (when available).
- Trend-based alarms (rising retrains, AER bursts, temperature slope changes).
2) DoD (Definition of Done) checklist — measurable pass/fail + evidence
Each DoD line below can be audited. Evidence should be saved as: CSV logs, event snapshots, and (when needed) scope captures from fixed test points on the backplane.
| Domain | Metric (what to measure) | Method (how to test) | Pass condition (what “ready” means) | Evidence + example material PN(s) |
|---|---|---|---|---|
| LINKENUM | Enumeration time distribution (cold boot, full bays) | Repeat cold boots; collect per-slot enumerate timestamps; compute P50/P95 and tail outliers. | No “missing drive” events; tail does not drift with temperature or bay count. |
Evidence: enumerate-time histogram + per-slot timeline. PN: Retimer/Redriver options — PT4161LRS, PT5161LRS, BCM85657, PEX88T32, DS160PR810. |
| LINK | Retrain count (per link, per hour) + growth rate vs temperature | Run sustained I/O + thermal soak; log retrain counters; correlate with temperature sensors. | Retrains remain below alarm threshold; no temperature-triggered “retrain storms”. |
Evidence: retrain counter trend + temp overlay. PN: Temp sensor TMP117; event memory MB85RC256V. |
| LINK | AER counts by severity (Correctable/Non-Fatal/Fatal) | Enable AER logging; run stress patterns; capture burst windows during high-temp and hot-plug. | No Fatal AER under qualified conditions; Correctable AER stays within defined budget. |
Evidence: AER type histogram + timestamps. PN: Retimer options — BCM85657, PEX88T32; redriver DS160PR810. |
| SEQUENCEHOT-PLUG | Power stable → PERST# release timing (per slot) | Scope on 12V(slot), 3.3Vaux, PERST#, PRSNT#; verify order and minimum delays. | PERST# never releases before power rails are within spec; no PERST# chatter during inrush. |
Evidence: scope captures + measured Δt table. PN: Supervisor/reset — TPS3890; I/O expander (sideband) — TCA9535 (optional). |
| SEQUENCE | Hot-plug stability (insert/remove cycles) | Automated insert/remove cycles; monitor enumerate results, retrains, and AER bursts after each event. | No “random missing drives”; recovery time stays inside target window; no escalating retrain trend. |
Evidence: per-cycle event log + counters. PN: Per-slot protection — TPS25982; current telemetry — INA226. |
| POWERINRUSH | Inrush peak + 12V droop at insertion | Measure slot current and 12V droop during insertion across SSD variants; repeat at high temperature. | Droop never triggers global reset; inrush is bounded by configured limit and soft-start slope. |
Evidence: current peak + droop plot; slot ID correlation. PN: eFuse/hot-swap — TPS25982; monitor — INA226. |
| POWERPROTECTION | Short-circuit isolation per slot (no system-wide collapse) | Inject controlled fault at one bay; verify isolation, retry policy, and event reporting fields. | Fault remains local to the slot; retry count obeys policy; evidence captured with timestamp. |
Evidence: slot fault record + recovery timeline. PN: TPS25982 (programmable fault response). |
| PLPPF | Power-fail detect latency + timestamp integrity | Power-fail injection; measure PF detect signal timing and log timestamp generation path. | PF event is never missed; timestamps are monotonic; event is paired with a “flush status” marker if supported. |
Evidence: PF timing + event log entries. PN: Supervisor/monitor — TPS3890; retention memory — MB85RC256V. |
| PLP | Hold-up window @ Vmin (energy really available) | Measure key rails during PF; compute time-above-Vmin for the SSD critical rails and controller rails. | Hold-up time meets the design target under worst case: high temperature + aged-cap equivalent + burst write. |
Evidence: rail waveform + computed t_hold table. PN: Supercap example — HVZ0E106NF (10F/2.7V EDLC, as a building block). |
| PLPPOLICY | Flush marker semantics: “completed vs not completed” | During PF tests, require a binary completion marker (from SSD or controller) to be logged per event. | Marker is present and consistent; post-reboot checks match marker state (no silent corruption scenarios). |
Evidence: marker + reboot validation record. PN: FRAM MB85RC256V for last-event retention; SPI flash (example) W25Q128JV for log storage. |
| THERMALFANLESS | SSD + retimer temperature slopes and throttle thresholds | Fanless/weak-airflow emulation; log temps at SSD case, retimer area, chassis; record performance vs time. | No thermal oscillation loop (throttle → queue → more heat); alarms trigger before instability. |
Evidence: temp slopes + throttle events + counter overlays. PN: Sensor — TMP117; retimer options — PT4161LRS/PT5161LRS/BCM85657. |
| TELEMETRY | Evidence completeness for RCA (power → link → thermal → drive) | Trigger representative failures (droop, PF, thermal); verify that snapshots include the required fields. | Every critical event has: slot_id, timestamp, severity, counters, temps, and last PF/flush state. |
Evidence: structured event schema + sample snapshots. PN: Current monitor INA226; FRAM MB85RC256V. |
| FACTORY | Fast screening proxy: enum tail + short AER/retrain burst window | Production script: cold boot → enumerate → 2–5 min stress → check counters + temps + slot current events. | Units exceeding proxy thresholds are quarantined for lab-level margin analysis. |
Evidence: per-unit JSON/CSV test record. PN: Slot power protection TPS25982; sensor TMP117. |
| FACTORYPOWER PATH | Reverse/ORing protection health (if redundant feed exists) | Simulate feed loss / reverse polarity at the input stage; verify no backfeed and controlled switchover behavior. | No reverse current; switchover does not create PERST# chatter or link instability. |
Evidence: input rail waveform + event record. PN: Ideal diode controller — LM74700-Q1 (with external N-MOSFET). |
3) Power-fail test matrix (must-cover combinations)
Power-loss validation must be treated as a matrix. “One outage test” is not a proof. The minimum matrix uses three axes: workload × capacitor condition × temperature.
Idle / steady write / burst write / metadata-heavy small writes (log-style).
New / aged-cap equivalent (reduced usable energy and/or higher ESR behavior).
Ambient / high-temp worst fanless / cold boot edge cases.
4) Reference material part numbers (example BOM building blocks)
The list below provides concrete orderable examples for the functions referenced by the DoD checklist. These are examples to make the page “BOM-ready”; final selection should match lane count, PCIe generation, and the target slot power envelope.
PCIe reach extension (Retimer / Redriver) — pick by channel loss & PCIe generation
- PT4161LRS — Astera Labs Aries PCIe Gen4 x16 Smart Retimer (server/backplane reach extension).
- PT5161LRS — Astera Labs Aries PCIe Gen5 x16 Retimer (higher rate, stricter margins).
- BCM85657 — Broadcom 16-lane PCIe Gen5 / CXL retimer (datacenter fabric retiming).
- PEX88T32 — Broadcom PCIe Gen4 retimer (retimer portfolio option).
- DS160PR810 — Texas Instruments PCIe 4.0 16Gbps 8-channel linear redriver (equalization, not full retiming).
Per-slot power, protection, and telemetry (hot-plug survivability)
- TPS25982 — TI smart eFuse / hot-swap protection (2.7V–24V, programmable current limit, fault response, current monitor).
- INA226 — TI 36V I²C current/voltage/power monitor (bus + shunt, programmable averaging/alerts).
- LM74700-Q1 — TI ideal diode controller (input ORing / reverse protection with external N-MOSFET).
Thermal + reset + power-fail evidence retention
- TMP117 — TI high-accuracy I²C temperature sensor (for SSD case, retimer area, chassis points).
- TPS3890 — TI precision supervisor/reset with programmable delay (RESET/PF detect style integration).
- MB85RC256V — Fujitsu I²C FRAM (last-event retention across brownouts).
- HVZ0E106NF — KEMET 10F / 2.7V EDLC supercapacitor (example building block for hold-up energy paths).
5) Figure K — Validation matrix (Scenario × Metric × Tool)
A single matrix prevents “coverage gaps”. Rows are real deployment scenarios; columns are the metrics that predict missing drives, link instability, and power-fail consistency risk.
H2-12 · FAQs (Lightweight MEC Storage Node)
These FAQs stay strictly within this page’s scope: U.2/U.3 backplane fundamentals, PCIe reach/re-timing, hot-plug/reset sequencing, per-slot power integrity, PLP hold-up evidence, fanless thermal constraints, and telemetry that enables remote root-cause isolation.
12 Q&A 40–70 words each Includes FAQPage JSON-LD