Edge CDN / Cache Node: NVMe RAID, PLP, and Thermal Control
← Back to: 5G Edge Telecom Infrastructure
An Edge CDN / Cache Node is “done” only when it delivers stable tail latency in both cache-hit and cache-miss paths, and can survive power loss, disk faults, and thermal throttling with recoverable metadata and field-proven evidence (logs + counters).
This page shows how NVMe/RAID/link/thermal choices map to measurable proof—so performance stays predictable and failures can be diagnosed quickly instead of guessed.
H2-1 · What an Edge CDN/Cache Node is (and what “done” means)
This section defines the cache node’s role boundary and a practical “done” definition based on Performance evidence Integrity evidence Operability evidence. The goal is not a generic CDN overview, but a field-verifiable acceptance standard.
An Edge CDN/Cache Node should be specified as an SLA machine: stable tail latency on cache hits, controlled behavior on misses, and recoverable operation under faults. It is not a general-purpose storage array.
- Hit delivery: serve from RAM/NVMe with predictable P95/P99 (not just average throughput).
- Miss-fill: fetch from origin, write data + metadata, commit safely, then serve without tail-latency collapse.
- Hotset stability: handle hot-content shifts without hit-rate oscillation that triggers origin storms.
- Fault tolerance: NVMe timeouts/resets, RAID degraded/rebuild windows, link errors, and thermal throttling must be recoverable.
- Explainability: any P99 regression or hit-rate drop must be explainable via evidence (counters, events, telemetry).
Out of scope here: UPF/slicing, switch queueing/TSN, time-sync systems (PTP/SyncE), and security appliances. If referenced, they must be link-only dependencies.
“Done” should not be a single peak throughput number. It should be defined as controllable P99 across operating states, recoverable integrity after disruptions, and fast root-cause closure using an unbroken evidence chain.
| Evidence | KPIs (examples) | Where to measure | Minimum proof |
|---|---|---|---|
| Performance |
P95/P99 latency (hit vs miss separated) Tail stability under concurrency & hotset shifts Hit-rate volatility vs origin ratio |
App: latency histogram + hit/miss counters NVMe: latency/timeout counters (time-windowed) NIC: CRC/FEC errors, retransmits, link flaps |
Hit-only baseline + mixed workload soak (≥24h) Step tests: burst + hotset shift, verify P99 control |
| Integrity |
Recovery time after reboot/outage (RTO) Controlled degradation during RAID rebuild No silent metadata/journal corruption signals |
Events: reset/outage cause + recovery phases RAID: degraded/rebuild state + error/repair counters App: abnormal miss spikes, checksum/validation fails |
Power-loss injection matrix (multiple phases) Degrade/rebuild injection with service impact trace |
| Operability |
“symptom → cause → action” closes within one time window Evidence chain does not break (errors → throttling → rebuild) Alert thresholds and actions are explainable and auditable |
Events: NVMe timeout/reset, RAID state, link flap Telemetry: temperature/power, throttle flags, fan status Counters: error rates (rate, not only totals) |
Fault drills: disk fault, link errors, thermal trigger, abrupt power cut Postmortem: evidence must point to a path segment |
- Average-only metrics: peak throughput looks good while P99 becomes unexplainable under GC/rebuild/throttling.
- Hit-only testing: miss-fill commit and metadata writes blow up tail latency after deployment.
- Treating cache as a storage array: “never lose data” goals can slow recovery and amplify write pressure.
- Over-logging: observability itself increases write load and accelerates performance collapse.
- No injection tests: without power-loss/degrade/thermal drills, there is no recovery playbook or proof chain.
H2-2 · Workload anatomy: hit path vs miss path (why NVMe behaves differently)
This section turns “hit/miss” into a measurable workload signature: path breakdown → bottleneck hypotheses → measurable knobs → evidence points → design implications. This avoids algorithm essays and keeps the focus on latency stability and recoverability.
An edge cache node often behaves less like a classic storage server and more like a network-driven latency system: the hit path is dominated by random-read tail latency and retransmits, while the miss-fill path is dominated by small metadata commits, write amplification, and background work that inflates P99.
- Hit path: RAM/NVMe read → cache response → NIC transmit. Primary risk: random-read tail + retry amplification.
- Miss-fill path: origin fetch → data write → metadata/journal commit → serve. Primary risk: commit points + GC/rebuild overlap.
| Path | Dominant pattern | Likely bottleneck | Evidence (what/where) | Typical “bad” signature |
|---|---|---|---|---|
| HIT | Hot random reads light metadata writes |
NVMe read tail NIC errors/retries CPU/IRQ jitter |
App: hit-only P99 histogram NVMe: latency + timeout counters NIC: CRC/FEC, retrans, flaps |
Average looks fine, P99 spikes periodically Link error rate rises with tail-latency spread |
| MISS-FILL | Origin reads + writes frequent small commits |
Commit points write amplification/GC rebuild contention |
App: miss rate + origin ratio RAID: degraded/rebuild state & rate NVMe: thermal throttle, reset, errors |
Miss spikes → origin surge → P99 collapse Rebuild window correlates with long tails |
Scope rule: caching policies (LRU/LFU, etc.) should appear only as workload inputs (hit ratio, write fraction, metadata update frequency)—no algorithm deep dives.
Record these as numbers over a defined time window (minute/hour/day) so later NVMe/RAID/PLP/thermal choices are grounded:
- Object size distribution: P50/P90/P99 (bucketed is best).
- Hit ratio and volatility: average + amplitude of swings (not only a single mean).
- TTL and revalidation behavior: how often metadata commits are triggered.
- Concurrency & burstiness: steady flow vs burst factor, and its impact on P99.
- Read/write mix over time: periodic write bursts are a common P99 amplifier.
- Hotset churn rate: how fast hot content shifts and how the hit ratio responds.
- Origin RTT and failure rate: determines miss-heavy degradation shape.
- Backpressure behavior: when NVMe slows, does the system queue, shed load, cap writes, or amplify origin?
- NVMe/RAID: prioritize tail-latency stability and recoverability over peak bandwidth.
- PLP: focus on commit safety and metadata/journal survivability, not just “it reboots.”
- Link stability: CRC/FEC errors often surface as application tail latency rather than a hard link-down.
- Thermal/power: the most damaging failure mode is “no obvious error, but gradually slower.”
H2-3 · NVMe layout for cache: namespace, queueing, and latency traps
NVMe is not “fast by default” in cache nodes. The practical goal is controllable P99 under real cache behavior: hot random reads (hits), miss-fill writes, and frequent metadata commits. Layout and queueing must prevent these behaviors from contaminating each other.
In an edge cache node, “NVMe layout” is a method to isolate I/O behaviors that inflate tail latency: Hot reads Miss-fill writes Metadata commits. The priority is not peak bandwidth; it is avoiding periodic P99 spikes caused by background work and throttling.
- Separate behaviors: keep hit-dominant reads away from write/GC pressure and commit bursts.
- Keep headroom: avoid “nearly full” steady state that amplifies write amplification and GC intensity.
- Measure by time windows: record P99 alongside NVMe temperature, timeouts/resets, and SMART events.
Use a layout that reflects cache-node traffic shapes. Namespaces/pools are useful when they help isolate the workloads that drive tail latency. The key is to prevent mixed read/write + commit bursts from turning into a shared P99 amplifier.
| Knob | Why it matters for P99 | What to verify (evidence) |
|---|---|---|
| Hot-read pool | Protects hit-path P99 from write amplification and GC bursts. | Hit-only P99 histogram stays tight during miss-fill activity. |
| Write/GC pool | Contains miss-fill write pressure so GC activity does not leak into hot reads. | P99 spikes correlate with this pool’s write bursts, not global hit traffic. |
| Metadata / journal pool | Commit points (small writes) often define tail behavior during mixed workloads. | Commit-rate changes align with P99 inflation and event logs (time-windowed). |
| Headroom policy | Low free space increases GC intensity, which creates periodic tail spikes. | P99 spikes reduce when usable free space is increased (A/B evidence). |
Scope rule: this section focuses on behavior isolation and proof. It does not explain PCIe retimer theory or storage-array architecture.
Queue depth and parallelism must be treated as tail-latency knobs. Excess concurrency can push SSDs into internal queueing, background work overlap, and temperature/power throttling—often without an obvious “error” at the application level.
- Trap GC-driven periodic spikes: average throughput looks stable while P99 spikes repeat in cycles.
- Trap Mixed R/W + commits: as soon as miss-fill and metadata sync intensify, P99 spreads and recovery is slow.
- Trap Throttling: temperature/power limits cause gradual slowdowns (“no hard fault, just slower”).
Minimum proof pattern: perform a concurrency sweep and compare read-only (hit-like) vs mixed workloads; then correlate P99 changes with NVMe temperature and timeout/reset counters.
Use a three-layer evidence chain to attribute P99 inflation to NVMe (and avoid misattributing it to origin or networking):
- App layer: hit vs miss separated latency histograms (time-windowed) + request concurrency.
- NVMe layer: latency distribution, timeout/reset counters, SMART/media error signals, temperature and throttle flags.
- Exclusion layer: NIC error rate and origin RTT do not show the same time-window spike pattern.
High-confidence attribution: if P99 spikes align with NVMe thermal/timeout/SMART events, while NIC errors and origin RTT do not spike in the same window, NVMe is the dominant path segment.
H2-4 · RAID for cache nodes: what you protect (and what you don’t)
RAID in cache nodes should be specified around availability and rebuild behavior (SLA continuity), not as a blanket guarantee of “all data correctness.” Cache content is largely regenerable, but metadata/log integrity and operational recoverability must be protected by design and verified with drills.
- Device fault tolerance: continue serving during a disk failure (degraded mode is acceptable if controlled).
- Fast recovery: bounded rebuild time and bounded service impact while rebuilding.
- Consistency boundary: cache is regenerable, but metadata/log signals must not silently corrupt during disruption.
Key framing: RAID primarily addresses “disk fault availability.” It does not automatically guarantee commit ordering or eliminate all silent corruption risks.
| RAID helps protect | RAID does not automatically protect |
|---|---|
|
Disk failure availability (keep serving in degraded mode) Capacity redundancy and rebuild pathways Operational continuity during single-disk faults |
Commit correctness under abrupt resets/power loss Logical metadata/journal consistency by itself All forms of silent corruption detection without additional mechanisms |
Use criteria that can be checked, measured, and audited. Avoid “RAID level debates” without workload context.
| Check | Criterion | How to validate (proof) |
|---|---|---|
| ✓ | Rebuild time is bounded (target window is explicit). | Measure rebuild duration under realistic background traffic (time-windowed impact). |
| ✓ | P99 remains controllable during degraded and rebuild states. | Track hit/miss separated P99 and origin ratio while forcing degraded/rebuild. |
| ✓ | Service impact controls exist (rebuild rate limiting / prioritization). | Demonstrate rebuild I/O caps and verify P99 improves when caps are engaged. |
| ✓ | Telemetry visibility (state, rate, errors) is complete. | Expose degraded/rebuild state, rebuild rate, error counters; verify alerting. |
| ✓ | Headroom is planned for rebuild and hotset changes. | A/B tests: lower headroom increases tail spikes; adequate headroom stabilizes P99. |
| ✓ | Write amplification awareness during rebuild windows. | Correlate rebuild with NVMe temp/throttle and latency; verify protection actions. |
| ✓ | Integrity boundary is documented (what RAID does not cover). | Postmortem template links cache integrity risks to evidence and recovery steps. |
Rebuild is not “background noise.” In cache nodes, rebuild competes with hit reads and miss-fill writes and can become a P99 amplifier. The service should expose explicit levers to keep the SLA intact:
- Rate limiting: cap rebuild I/O so customer traffic retains tail-latency headroom.
- Mode control: temporary read-only or write deferral when integrity risk or tail spikes exceed thresholds.
- Hotset protection: prioritize the hottest objects and prevent rebuild from evicting hot data patterns.
- Clear alerts: degraded/rebuild + NVMe thermal/throttle events must trigger deterministic actions.
H2-5 · Power-loss protection (PLP): turning sudden outages into recoverable events
PLP should be judged by observable outage behavior, not by a checkbox. The goal is to prevent “half-written state” from degrading cache integrity into hit-rate anomalies and origin storms. A good design turns sudden power loss into a bounded recovery sequence with consistent evidence.
The most damaging outage failures are not “the box reboots.” They are integrity edge cases that silently change cache behavior:
- Risk Half-written metadata: object validity and indexing drift after reboot, causing unstable hit ratio.
- Risk Journal/log gaps: missing evidence around the outage window makes root cause unprovable.
- Risk Ordering broken: partial commits turn into cache erosion → miss spikes → origin surge.
Scope boundary: this section covers local PLP behavior (SSD PLP, local hold-up, and write-commit rules). It does not discuss site-level backup systems or 48V front-end hot-swap design.
| Layer | What it helps protect | What it does not replace |
|---|---|---|
| SSD PLP (or none) | Improves probability of completing in-flight writes inside the drive and reduces “half-write” outcomes. | Does not automatically define which cache state must be committed as a recoverable checkpoint. |
| Local hold-up | Converts an abrupt power cut into a short hold-up window for controlled write quiesce and final commits. | Cannot guarantee correctness if commit points are not explicitly managed (window alone is not a policy). |
| Write-commit rules | Ensures critical metadata/journal transitions become recoverable events instead of silent drift. | Does not eliminate the need for post-boot consistency checks and evidence alignment. |
Flush/FUA/write ordering are described here only as “when they are required,” not as protocol internals.
In cache nodes, strict commit is most valuable at recoverability boundaries rather than everywhere:
- Critical metadata transitions: when validity/index state changes could alter hit/miss behavior after reboot.
- Checkpoint moments: after a batch of objects becomes “serving-ready,” a recoverable point should be formed.
- Power anomaly signals: if a local power-loss indicator exists, writes should quiesce and finalize a safe boundary.
Overuse is harmful: committing everything can increase write pressure and amplify tail latency. The design should choose commit points that maximize recoverability per write cost.
Verification should use controlled outage injections across operating phases and should end with a consistency proof that aligns with the outage timeline. The objective is to prove: “outage → recovery sequence → stable cache behavior,” without unexplained hit-rate drift or origin storms.
| Injection phase | Injection type | Minimum evidence to collect (time-windowed) |
|---|---|---|
| Idle | Hard cut / fast drop | Boot timeline + reset reason + “clean startup” evidence; no unexpected integrity counters. |
| Hit-heavy | Hard cut | Hit P99 and hit ratio return to baseline quickly; no NVMe timeout/reset spike post-boot. |
| Mixed (R/W) | Hard cut / repeated short cuts | Consistency checks pass; any recovery mode (read-only/limited write) is logged and time-aligned. |
| Miss-fill burst | Hard cut | Post-boot hit ratio and origin ratio stabilize; no “miss explosion” pattern; log timeline is complete. |
| Degraded/rebuild present | Hard cut | RAID state and rebuild rate persist correctly; recovery does not cascade into long P99 collapse. |
Proof rule: if hit ratio and origin ratio stabilize after reboot, and integrity counters + outage logs align to the same time window, PLP is functioning as an engineering control—not a slogan.
H2-6 · Logging & evidence chain: proving integrity without over-logging
Logging in cache nodes must preserve the evidence chain for tail-latency and integrity events while avoiding “log-driven write amplification.” The solution is a minimum evidence set plus windowed metrics, with rate limits and ring buffers to prevent logs from becoming a new bottleneck.
These signals are sufficient to reconstruct most field incidents without crossing into security auditing or unrelated subsystems:
- Cache symptoms: hit ratio step changes, origin ratio, origin failure rate, object validation failure counters.
- NVMe events: timeout/reset counters, SMART/media error indicators, temperature and throttle flags.
- RAID state: degraded/rebuild state, rebuild rate, error counters (as evidence, not as an array design guide).
- Power/reset: reset reason and boot timeline (local ordering only; no PTP time required).
Key rule: keep signals in the same time window so correlation is possible without massive per-request logs.
Use three layers so evidence survives while I/O pressure remains bounded:
| Level | What to store | Why it is safe (does not write-storm) |
|---|---|---|
| L1: Event summary | State changes (power-loss, reboot phases, NVMe reset, RAID state change). | Low frequency, structured, time-aligned. |
| L2: Windowed metrics | P95/P99, hit/origin ratio, timeout counts, temp/throttle flags (per minute or per 5 minutes). | Bounded write rate; supports correlation and trend proof. |
| L3: Triggered detail | Short bursts of detail only when thresholds trip (P99 spikes, hit ratio cliff, repeated resets). | Guarded by rate limit; stored in a ring buffer; auto-downgrades under pressure. |
Guardrails: enforce rate limiting, use a fixed-size ring buffer, and add pressure-aware downgrades so logging never becomes the cause of P99 collapse.
Use this repeatable workflow to close incidents without relying on massive logs:
- Symptom: identify the primary symptom (P99 spike, hit ratio cliff, origin surge, validation failures).
- Time window: lock a short window where the change begins.
- Key counters: pull NVMe timeout/reset/SMART, RAID state/rate, and reset reason from the same window.
- Thermal/power correlation: check temperature/throttle flags and power-loss markers for alignment.
- Attribution: classify the dominant segment (NVMe path, rebuild/degraded state, commit/recovery boundary, or software path).
Output format: symptom + time window + the 3 strongest correlated signals + the chosen segment. This is usually enough to drive corrective actions.
- Anti-pattern Per-request verbose logs increase write pressure, worsen P99, and hide the original fault.
- Best practice Prefer event summaries + windowed metrics; use triggered detail with rate limits and ring buffers.
H2-7 · Ethernet PHY/retimers: link stability as a cache performance feature
In edge cache nodes, link-layer instability is not “just networking noise.” When PHY/retimer margins collapse, errors trigger correction and retransmission, CPU packet work increases, and P99 latency spreads. A stable link is therefore a measurable cache performance feature.
Cache nodes are often misdiagnosed as “storage-limited” because SSD metrics look busy. However, link instability creates a different signature: effective throughput drops while tail latency expands.
- Symptom Throughput ceiling: headline link rate is available, but effective delivery stalls below expectation.
- Symptom Retransmission surge: loss/retry effects appear alongside widening P99 distribution.
- Symptom P99 divergence: averages remain acceptable while the tail becomes unstable and “spiky.”
- Symptom CPU/interrupt pressure: packet processing work increases, which further amplifies jitter.
Scope boundary: this section focuses on PHY/retimer evidence and validation. It does not discuss switch queues, TSN, or timing synchronization.
Treat link stability as an evidence chain. The following counters can connect a physical-layer issue to cache performance outcomes:
| Evidence bucket | What to watch | Why it matters to cache P99 |
|---|---|---|
| PHY / coding errors | CRC/FCS errors, FEC corrected/uncorrected, PCS/PMA error counters. | Error handling and recovery expands tail latency even when average throughput looks okay. |
| Link stability events | Link flap, training events, speed/width fallback indicators. | State changes create step-like performance drops and recovery waves. |
| Correlation proof | Port-to-port comparison, temperature correlation, same-window alignment with retrans and P99. | Correlation turns “suspected link issue” into a dominant segment attribution. |
Proof rule: if errors and retrans rise in the same time window as P99 spreads, and the effect follows a specific port/path or temperature, the link layer is not a background detail—it is the root cause segment.
The purpose is not to explain retimer internals. The purpose is to keep the cache node stable by designing for observability, isolation, and recoverable degradation.
- Port tiering: distinguish upstream/downstream roles so comparisons are meaningful and fault isolation is faster.
- Redundancy & degrade actions: if error thresholds trip, use bounded actions (port failover, service throttling, short read-only windows) with clear evidence.
- Path validation: treat retimer/cable/insertion-loss margins as testable. Swap ports/cables/modules and verify whether errors follow the physical path.
- Thermal correlation checks: run high-load soak tests to see whether errors grow with temperature and whether P99 expands in the same windows.
- Compare Same workload, different ports: does the error signature follow a port?
- Swap Swap cable/module: do CRC/FEC counters move with the physical path?
- Soak Sustained traffic: do errors and retrans increase with temperature?
- Align Time-window alignment: errors ↔ retrans ↔ P99 widening in the same windows.
- Prove After corrective action, counters and P99 should converge, not just averages.
H2-8 · Thermal & power management: preventing silent throttling
The most expensive cache-node failures are often silent: the system is “up,” but performance degrades steadily due to thermal or power limits. Thermal/power management should therefore be designed as a closed-loop control that prevents throttling-driven tail-latency collapse and the resulting efficiency loss.
Silent throttling typically follows a repeatable chain. It is rarely visible in averages, but it is obvious in windowed P95/P99:
- Chain Temperature or power limit is reached.
- Chain NVMe and/or NIC throttling engages (reduced internal parallelism or guarded performance states).
- Chain Latency jitter increases and P99 becomes unstable.
- Chain Service efficiency drops; miss pressure may increase and load can rise further (feedback).
Scope boundary: node-level sensors and actions only. This section does not cover data-center HVAC, rack PDUs, or facility cooling design.
The goal is not “collect everything.” The goal is to collect a small set of signals that can be correlated in the same time window:
| Telemetry bucket | Signals | Correlation purpose |
|---|---|---|
| NVMe thermal | Drive temperature (multi-point if available), controller temperature, throttle flags. | Explains periodic P99 spikes and performance plateaus. |
| Power delivery | VRM temperature, node power draw / limit indicators. | Separates thermal throttling from power-limit throttling. |
| Airflow | Fan RPM, inlet/outlet temperature delta (ΔT). | Proves whether cooling response matches heat generation. |
| Performance window | Windowed P95/P99 latency and throughput, plus a small set of error counters. | Links throttling evidence to user-visible outcomes. |
Proof rule: throttling is confirmed when temperature/power-limit indicators and P99 widening align in the same time window, and P99 recovers after cooling/power actions.
Actions must be tied to thresholds and must have observable outcomes. Examples below focus on node-level controls:
| Trigger | Action | Expected evidence change |
|---|---|---|
| NVMe temp enters warning band | Adjust fan curve / increase airflow. | ΔT improves, temperature drops, throttle flags clear, P99 narrows. |
| Hotspot persists | Avoid hotspot placement (balance heat sources / reduce localized stress). | Peak device temperature decreases; P99 stops “stair-stepping.” |
| Power limit reached | Apply power cap or bounded throttling policy (controlled reduction vs random collapse). | Throttle events become fewer and more predictable; P99 spikes reduce. |
| Danger zone sustained | Degrade mode or maintenance action (short read-only window, service limiting, targeted intervention). | Prevents cache erosion and protects SLA; evidence shows controlled recovery. |
- Soak Sustained high-load run: verify temperature trends, throttle flags, and windowed P99 behavior.
- A/B Fan curve A/B: confirm P99 and throttle rates improve without creating new instability.
- A/B Power cap A/B: validate “slightly lower peak throughput but much tighter P99” trade-off.
- Confirm Correlation: temperature/power indicators and P99 changes must align in the same time window.
H2-9 · Failure modes & fast triage: symptoms → likely causes → tests
Field triage on cache nodes should be evidence-driven and fast. The goal is to collapse a noisy symptom into one or two dominant segments (NVMe, RAID, link, thermal/power, or integrity), using node-local signals and a minimal set of actions that change the evidence within minutes.
- Lock the window: define T0→T1 where the symptom started (a step change, periodic spikes, or slow drift).
- Check five evidence buckets: cache KPIs, NVMe events, RAID state, thermal/power throttling, and link errors.
- Run one fast test: choose an action that should tighten P99 or stabilize KPIs quickly if the suspected bucket is correct.
Scope boundary: this triage map does not rely on upstream routing, UPF, or timing systems. If node-local evidence is clean, then external dependencies can be investigated elsewhere.
| Symptom | Evidence points (same time window) | Fastest validation action |
|---|---|---|
| Hit ratio drops suddenly | Integrity errors rise; metadata-related failures appear; NVMe reset/timeout events occur; temperature throttling flags increase; origin failure rate increases. | Short read-only protection window; run consistency check; align NVMe events with KPI change; raise cooling briefly to see if hit behavior stabilizes. |
| P99 spikes (avg looks normal) | NVMe latency tail widens; periodic spikes align with background write pressure; RAID rebuild/degraded state exists; link CRC/FEC counters and retrans rise; throttling flags appear. | Rate-limit rebuild/background work; reduce write intensity; temporary airflow/power cap change; confirm P99 tightens in the same window. |
| RAID degraded → performance collapse | Degraded/rebuild state starts at T0; rebuild speed is high; hottest drive becomes a hotspot; P99 spreads continuously; errors concentrate on one drive/path. | Enable rebuild rate limiting or time-slicing; protect hotspots (avoid single-drive pressure); validate P99 stabilizes while rebuild continues. |
| Frequent NVMe timeouts/resets | Timeout/reset counters increase; temperature/power-limit indicators align; occasional media/SMART anomalies present (high level); events cluster by slot/path. | Thermal correlation test (airflow up / ambient step); bounded power cap test; isolate by swapping slot/path; confirm events follow the physical path. |
| Throughput ceiling (resources not saturated) | Link-layer errors rise; retrans/packet loss observed; CPU packet work increases; NVMe P99 remains reasonable while network-side counters worsen. | Port-to-port comparison; swap cable/module; confirm errors and throughput move with the physical path, then lock in the stable path. |
| Origin surge / “miss storm” | Origin ratio step-up; origin failures/timeouts rise; integrity warnings appear; tail latency spreads; NVMe or link events may co-occur. | Apply bounded protection mode (rate limiting / short read-only window); re-check integrity counters; confirm origin ratio returns toward baseline. |
| Performance slowly drifts worse (hours) | Temperature trends upward; throttle events become more frequent; fan/ΔT response is inadequate; P99 slowly widens (not spiky). | Adjust fan curve and verify ΔT improves; apply power cap to prevent random throttle; confirm P99 distribution tightens over time. |
| Post-restart “works but behaves wrong” | KPIs do not return to baseline; integrity checks fail; event chain shows gaps around restart; background rebuild/cleanup persists longer than expected. | Validate event chain completeness; run consistency checks; reduce background pressure temporarily; confirm KPIs recover in a predictable window. |
| Random short freezes / stalls | Short bursts of timeout-like behavior; counters spike then recover; thermal or link counters show transient excursions; P99 shows narrow spikes. | Compare ports; tighten cooling; rate-limit background work; verify stall frequency drops and P99 spikes reduce. |
| “Everything is fine” except user reports lag | Averages look healthy, but windowed P99 is unstable; tail widens under specific load state (write-heavy or rebuild); throttle flags appear intermittently. | Switch to windowed P95/P99 dashboards; reproduce under controlled state; confirm which evidence bucket aligns with tail widening. |
Format discipline: each row is a three-line decision unit. If evidence does not align in the same time window, switch rows instead of forcing a favorite hypothesis.
Capturing a minimal, consistent set of signals makes triage repeatable and comparable across sites:
These are node-local and sufficient to isolate the dominant segment without referencing UPF, slicing, or timing systems.
H2-10 · Validation & production checklist: what proves it’s ready for field
“Ready for the field” is not a feeling. It is a set of repeatable acceptance actions with pass criteria and archived evidence. This checklist defines what to run before deployment so performance, integrity, and operability remain stable under real outages and degradations.
- Two performance baselines: a cache-hit baseline and a cache-miss (origin + write) baseline.
- One evidence archive: windowed P95/P99 curves, key counters, and an event chain that can reconstruct what happened.
Scope boundary: this checklist focuses on cache-node readiness only. Observability/security appliances and network probes have separate acceptance criteria.
| Test | Pass criteria (behavior) | Evidence to archive |
|---|---|---|
| Hit baseline | Throughput and concurrency meet baseline with a tight P99 distribution (no widening tail). | Windowed P95/P99, throughput, hit ratio, NVMe latency tail snapshot. |
| Miss baseline | Origin + write pressure does not trigger uncontrolled P99 expansion; behavior remains bounded. | Origin ratio, write intensity markers, P99 shape, NVMe events and throttling flags. |
| Concurrency ramp | Each step shows predictable scaling without sudden cliff behavior or periodic spike emergence. | Per-step P99 distribution, counters per evidence bucket, step-change timestamps. |
| Long-soak (24–72h) | No slow drift into throttling or persistent error growth; performance returns after transient events. | Temperature/power trends, throttle flags, RAID state, link errors, and P99 over time. |
Power-loss protection is valid only if repeated outage injections remain recoverable across workload states. Each injection should include a post-restart consistency check and KPI recovery verification.
| Workload state | Injection focus | Pass evidence |
|---|---|---|
| Idle | Baseline restart path, event chain continuity. | Event chain complete; KPIs return to baseline quickly. |
| Write-heavy | Metadata durability under active updates. | Consistency check passes; no integrity counter explosion; hit behavior remains sane. |
| Rebuild active | Outage during degraded service and background rebuild pressure. | Rebuild resumes predictably; P99 remains bounded after restart; no cascading failures. |
| High-temp | Outage when thermal headroom is low and throttling risk is high. | Recovery does not enter a throttle spiral; cooling/policy restores stable P99. |
Evidence discipline: for each injection, archive (a) restart timeline, (b) consistency results, (c) KPIs and P99 recovery curves, and (d) NVMe/RAID/thermal flags in the same time window.
| Test | Pass criteria (behavior) | Evidence to archive |
|---|---|---|
| Fault injection | Degraded state is detected, logged, and service remains bounded (no uncontrolled collapse). | Degraded state timeline, drive-level counters, KPI and P99 change at T0. |
| Rebuild window | Rebuild completes within a predictable window without destroying P99 stability. | Rebuild rate, completion time, service impact curve (P99 vs rebuild rate). |
| Rebuild rate limiting | Rate limiting meaningfully tightens P99 while preserving forward progress. | A/B comparison: rebuild speed vs P99 distribution and hotspot temperatures. |
| Category | Pass criteria (behavior) | Evidence to archive |
|---|---|---|
| Thermal step | Under temperature step or reduced airflow, the policy prevents sustained throttle spirals and stabilizes P99. | Temp trends, fan/ΔT response, throttle flags, P99 recovery curve. |
| Fan fault simulation | Fault triggers clear alerts and bounded degrade/maintenance actions before collapse. | Alert timestamps, actions taken, KPI/P99 before and after. |
| Logging integrity | Event chain is reconstructible without log storms; storage is protected from log overflow. | Event summaries + ring buffer behavior + critical counters; evidence that logs do not “write the disks to death.” |
H2-11 · BOM / IC selection criteria (criteria + example part numbers)
These criteria translate cache-node SLA goals (hit/miss performance, integrity after outages, and fast triage) into verifiable BOM requirements. Each bullet is written as: criterion → why it matters → fastest verification hook.
Part numbers below are shortlist examples to anchor sourcing conversations. Final selection depends on form factor (U.2/U.3/E1.S), lane budget, thermals, endurance class, and the field validation matrix in H2-10.
A) NVMe / SSD (tail-latency stability + PLP semantics)
Selection criteria (verifiable)
- P99/P999 stability under mixed I/O → cache metadata + fills can create tail spikes → verify with long-run latency histograms and periodicity checks (not only avg/peak).
- Explicit power-loss behavior (PLP + firmware checks) → prevents metadata half-writes and “silent cache corruption” → verify with power-cut injection at idle/write-heavy/rebuild and post-boot integrity scan.
- Thermal throttling predictability → throttling turns into hit-rate drop and miss storms → verify with temp ramps and “latency vs temperature” correlation.
- Error recovery & timeout policy → long internal recovery can stall queues and explode P99 → verify with timeout/reset counters, media error trends, and controlled fault injection.
- Telemetry visibility (SMART/NVMe-MI where available) → makes field triage fast → verify sensor availability, polling rate, and event logging completeness.
Example part numbers (enterprise NVMe SSD families)
Read-intensive / cache-heavy: Solidigm™ D7-P5520 (U.2/E1.S/E1.L), Samsung PM9A3, KIOXIA CM7-R (2.5″), Micron 9400 PRO (U.3/U.2).
Mixed-use / heavier writes: Solidigm™ D7-P5620 (same brief family as P5520), Micron 9400 MAX (U.3/U.2), KIOXIA CM7-V (E3.S).
NVMe SSD must provide power-loss protection (PLP) behavior suitable for metadata integrity, validated by fault-injection power cuts across idle / write-heavy / rebuild states.
NVMe SSD must maintain bounded tail latency (P99/P999) under mixed read+write with sustained temperature ramps; throttling behavior must be observable and logged.
SSD must expose health + error signals (temperature, timeouts/resets, media errors) with stable polling interfaces for field evidence.
B) RAID / Storage control (availability + rebuild discipline)
Selection criteria (verifiable)
- Rebuild rate shaping / throttling → uncontrolled rebuild steals I/O and collapses service → verify rebuild rate controls and “service impact curve” during rebuild.
- Degraded-mode policy hooks → allows intentional service protection (read-only, hotspot protection, rate limits) → verify controllable policies and observability during degraded state.
- Power-loss consistency boundary → cache data can be regenerated, metadata cannot → verify post-reboot metadata checks and deterministic recovery behavior.
- Error & state telemetry → RAID state changes must be visible for triage → verify counters for degraded/rebuild progress, timeout storms, and device drop events.
Example part numbers (controllers / HBAs)
- Hardware RAID controller: Broadcom MegaRAID 9560-16i (PCIe Gen4 RAID controller family).
- Tri-Mode HBA (SAS/SATA/NVMe backplanes where applicable): Broadcom HBA 9500-16i.
Storage controller/HBA must support deterministic degraded-mode operation with observable rebuild progress and rebuild rate shaping to protect cache-node P99.
Controller must expose RAID state transitions (degraded/rebuild/failed) and device error signals for event logs and fast triage.
C) Ethernet PHY / retimers (link stability as a performance feature)
Selection criteria (verifiable)
- Error counter visibility (FEC/CRC/PCS/PMA where available) → ties retransmissions to tail latency → verify counters are readable, comparable per-port, and loggable.
- Temperature-linked margin behavior → weak channels become “random P99 spreaders” → verify BER / error counters vs temperature and insertion loss conditions.
- Predictable recovery (retrain/reset behavior) → avoids long stalls and link flaps → verify recovery time bounds and event logging.
- Port role separation (uplink/downlink) → isolates faults and enables graceful degradation → verify independent telemetry and alarms per port group.
Example part numbers (retimers / PHY examples)
- 25G-class multi-channel retimer: TI DS250DF810 (8-channel, multi-rate retimer).
- 28G-class multi-channel retimer: TI DS280DF810 (8-channel, 20.2–28.4Gbps retimer family).
- 10GBASE-T PHY (when copper ports exist on the node): Marvell Alaska X 88X3310P / 88X3140 family (PHY examples; choose based on port count and interface).
High-speed link components (PHY/retimer) must expose lane/port error counters and link events suitable for correlation with retransmissions and P99 latency excursions.
Retimer solution must be validated across temperature and channel-loss conditions, with bounded recovery behavior (no long stalls / frequent link flap).
D) Power, telemetry & protection (observe → decide → act)
Selection criteria (verifiable)
- Rail telemetry coverage (voltage/current/power/temperature) → turns “slowdown” into measurable cause → verify sensor placement and sampling stability under load steps.
- Fault logging hooks (OV/UV/OC/OT, brownout hints) → supports outage and timeout forensics → verify event capture and ring-buffer retention across resets.
- Power capping / budget enforcement → prevents silent throttling cascades → verify power cap action triggers predictable service degradation rather than random stalls.
- Inrush / hot-swap protection (node-level) → prevents transient-induced NVMe resets → verify controlled inrush and bounded fault response.
Example part numbers (telemetry & control ICs)
- Current/voltage/power monitor: TI INA238 (I²C digital power monitor).
- Multi-rail sequencer + monitor: TI UCD90120A (12-rail PMBus/I²C sequencer/monitor).
- Hot-swap / inrush controller: TI LM5069 (9–80V hot-swap controller, for node input protection where applicable).
- Power system manager (telemetry + fault logging): ADI LTC2971 (Power System Manager family).
Node must provide multi-rail telemetry (V/I/P/T) with event logging suitable for correlating NVMe resets/timeouts and performance degradation.
Power subsystem must support controlled inrush and bounded fault response, and must retain reset/power events across reboot for evidence.
E) Sensors & thermal control (prevent silent throttling)
Selection criteria (verifiable)
- Temperature point coverage (SSD controller region, NIC/PHY area, VRM hotspots, inlet/outlet) → prevents blind throttling → verify multi-point readings and cross-check with throttling thresholds.
- Fan control + failure detection → avoids “slow and getting slower” → verify tach feedback, stall detection, and alarm routing into logs.
- Action mapping (alarm → throttle policy) → ensures graceful degradation → verify that alarms trigger defined actions (rate limiting / maintenance flag) and are recorded.
Example part numbers (fan control / thermal sensing)
- Multi-fan controller (SMBus): Microchip EMC2305 (up to 5 PWM fan drivers).
Thermal design must expose multi-point temperature telemetry and provide closed-loop fan control with stall/failure detection and loggable alarms.
System must implement predictable actions when approaching throttling (pre-alarm thresholds, rate-limit policies) rather than relying on silent device throttling.
H2-12 · FAQs (Edge CDN / Cache Node)
These answers focus on verifiable evidence: tail latency shape, NVMe/RAID/link events, power-loss recovery, and thermal/power throttling. No UPF/slicing, no time sync, no programmable data plane.
1
Which metrics define “done” for a cache node—not just throughput?
Mapped: H2-1 / H2-10
“Done” is proven by three evidence groups: (1) performance in both hit and miss states (throughput plus bounded P95/P99 tail), (2) integrity across outages (reboot does not corrupt metadata or trigger miss storms), and (3) operability (failures are explainable via logs, counters, SMART, and thermal/power traces).
- Hit vs miss: P95/P99, concurrency sweep, and origin bandwidth footprint
- Outage: power-cut + reboot integrity scan + KPI recovery window
- Field: NVMe/RAID/link events + temperature/power correlation are loggable
2
Hit ratio looks normal—why can P99 still spike periodically?
Mapped: H2-3 / H2-8
Periodic P99 spikes often come from background behaviors that do not change hit ratio: NVMe garbage collection/trim cycles, metadata journal flush bursts, or thermal/power throttling events. These create short “stall windows” where queues back up, turning a stable average into an unstable tail. The key clue is repeatable spike timing.
- Latency histogram + spike periodicity over 6–24 hours
- SSD temperature / throttling indicators aligned to spikes
- Write/flush bursts and queue depth around spike windows
3
If the cache is mostly reads, why does metadata consistency still matter?
Mapped: H2-2 / H2-5
Even “read-heavy” caches perform frequent small metadata writes (object index updates, TTL state, admission/eviction markers, and journal/log pointers). Power loss during these writes can create partial state that looks valid but points to wrong objects, causing cache corruption symptoms: miss storms, inconsistent hits, and origin overload after reboot.
- Power-cut during metadata-heavy windows → reboot → integrity scan and sampled key checks
- Watch for object validation failures, abnormal eviction patterns, and origin fetch bursts
4
What does RAID mainly protect in a cache node—and what does it not?
Mapped: H2-4
RAID primarily protects availability against device failure and provides a path to rebuild and recovery without taking the node down. It does not automatically solve power-loss ordering, metadata semantics, thermal throttling, or tail-latency collapse during rebuild. A cache may be regenerable, but metadata and recovery behavior must still be proven with outage and rebuild tests.
- Measure P99 impact during degraded + rebuild and confirm rebuild rate shaping exists
- Reboot after faults and confirm metadata integrity and stable hit/miss behavior
5
How should a power-loss test matrix be designed to prove PLP is real?
Mapped: H2-5 / H2-10
A convincing PLP matrix varies both system state and cut timing. At minimum: idle, write-heavy fill, mixed hit+fill, RAID degraded/rebuild, and high-temperature conditions. For each case, inject power cuts at multiple offsets, then require a consistent reboot outcome: metadata scan passes, no cache corruption signals, and KPIs return within a bounded window.
- State × timing grid (not one “demo cut”)
- Post-boot integrity scan + sampled key validation
- Event chain: power-loss → recovery actions → KPI stabilization
6
What are the most common field symptoms of “cache corruption,” and how to confirm fast?
Mapped: H2-6 / H2-9
Common symptoms include: sudden miss storms without traffic growth, inconsistent object validation failures, repeated origin fetches for the same keys, and abnormal eviction/TTL behavior after a reboot or reset event. Fast confirmation comes from time-window correlation: align the onset to NVMe/RAID events and power/reset logs, then run a targeted integrity scan plus sampled object checks.
- Miss/origin ratio step change + object validation failure counters
- NVMe reset/timeout + reboot timeline correlation
- Integrity scan + sampled key replay checks
7
Frequent NVMe timeout/reset—how to tell thermal vs power transient vs link issues?
Mapped: H2-3 / H2-7 / H2-8 / H2-9
Separate causes by correlation and migration. Thermal issues track SSD/controller temperature and worsen under hot ambient; power transients align with load steps and rail anomalies; link issues often “follow the path” when a slot, cable, or port changes, and show rising CRC/FEC/PCS errors. Use evidence first: temperature and power traces, link error counters, and reset timestamps.
- Temperature vs reset timestamp correlation (thermal)
- Rail telemetry anomalies near resets (power transient)
- CRC/FEC/PCS errors and “moves with port/slot” behavior (link)
8
During RAID rebuild, how to avoid dragging production traffic down? What switches help?
Mapped: H2-4 / H2-10
Treat rebuild as a controllable background workload. Key levers are rebuild rate shaping, I/O priority separation, and explicit service protection modes: rate limiting, hotspot protection, and a defined degraded policy (including temporary read-only if integrity risk rises). Success is measured by a bounded P99 curve and stable miss/origin behavior while rebuild progresses predictably.
- P99 vs rebuild rate curve (find a safe rebuild envelope)
- Origin ratio stability and no “miss storm” during rebuild
- Rebuild progress telemetry is visible and loggable
9
Why do CRC/FEC errors show up as tail latency instead of a hard link-down?
Mapped: H2-7
Many link faults are soft errors. FEC may correct errors with extra latency, and uncorrected errors trigger retransmissions at higher layers. The link stays “up,” but effective throughput drops and queues build; CPU interrupt/packet-processing load can rise; congestion control reduces send rates. The application experiences this as higher P99, not necessarily an immediate disconnect.
- CRC/FEC/PCS counters rising during P99 expansion
- Retransmission/packet-loss indicators increase without link flap
- Port-to-port comparison isolates the noisy lane/port
10
Why does “no errors but getting slower” usually mean thermal/power management—and how to prove it?
Mapped: H2-8
Silent slowdowns often come from throttling without explicit faults. NVMe and NIC components may reduce performance when temperature or power limits are approached, causing gradual P99 drift before throughput collapses. Proof requires trend correlation: multi-point temperatures, fan RPM, inlet/outlet ΔT, power draw, and any available throttling indicators aligned to latency histograms.
- P99 drift over time + rising hotspot temperatures
- Fan RPM/ΔT anomalies (dust, aging fans, airflow restriction)
- Power cap or device throttling indicators aligned to slowdowns
11
What logging granularity is enough for forensics without writing the SSD to death?
Mapped: H2-6
Log what enables reconstruction of the event chain, not every request. Use tiered logging: compact event summaries (state transitions, counters, timestamps) plus on-demand detail dumps, protected by rate limits and ring buffers. Always include NVMe timeout/reset events, RAID state changes, power/reset reasons, temperature/power anomalies, and hit/miss/origin step changes—each with a stable time window.
- NVMe: timeout/reset + temperature + media error trend
- RAID: degraded/rebuild state + rebuild rate
- Link: error counters + link events (no packet capture required)
12
For procurement, which criteria best predict long-term stable tail latency?
Mapped: H2-11 / H2-10
The best predictors are criteria that survive long-run stress: (1) bounded P99/P999 under mixed I/O with predictable thermal throttling, (2) verified power-loss semantics (PLP + reboot integrity across a matrix), and (3) observability + control (RAID rebuild shaping, readable link error counters, and rail/thermal telemetry tied to actions). Validation results should be a procurement gate, not a post-purchase surprise.
- 24–72h soak with latency histograms (not just averages)
- Power-cut + reboot integrity verification across states
- Rebuild and thermal step tests with bounded service impact