123 Main Street, New York, NY 10001

5G CU Hardware Platform: TSN, Security, and Power Telemetry

← Back to: Telecom & Networking Equipment

A 5G CU is a platform system: it must move control- and user-plane traffic predictably while staying provably secure and operable at scale. This page shows how to design the CU around multi-port Ethernet/QoS, crypto + chain of trust, and DDR/storage power + telemetry so field performance issues can be measured, explained, and fixed.

H2-1 · What a 5G CU is (CU-CP/CU-UP) — and what it is not

Intent: Remove first-minute confusion: define CU scope using interface boundaries (F1 vs N2/N3), then translate those boundaries into hardware platform requirements.

Scope anchor (one sentence): A 5G CU is a compute platform (CU-CP + CU-UP) that terminates F1-C/F1-U toward the DU and connects to the Core via N2/N3, where “multi-port Ethernet + deterministic forwarding, trust chain, and power/telemetry evidence” matter as much as raw throughput.

Not covered here: DU baseband acceleration details, RU/AAS RF/JESD/DPD, optical transport gear, PoE/48V front-end deep dives, and BNG/CGNAT/firewall architecture.

Boundary model: interfaces define failure domains, traffic shape, and evidence

Instead of a generic definition, the CU becomes clear when boundaries are treated as engineering contracts: each interface implies (1) a traffic profile, (2) a fault isolation goal, and (3) what must be observable to prove correctness.

Boundary CU ↔ DU (F1-C / F1-U) Drives: port role separation, congestion containment, MTU consistency, and per-queue evidence (drops/latency).
Boundary CU ↔ Core (N2 / N3) Drives: uplink redundancy goals, segmentation, and “upstream congestion does not break control-critical flows”.
Platform view CU is a platform, not a single NIC Must include trust chain (secure boot/attestation), DDR/NVMe determinism, and telemetry/log evidence for field root-cause.
CU-CP vs CU-UP: why forwarding/QoS design cannot be “one size fits all”
  • CU-CP (control plane): bursty signaling and control-critical transactions. Design priority is bounded queueing delay and loss avoidance for “small but critical” traffic classes.
  • CU-UP (user plane): sustained high-rate data. Design priority is stable throughput and low tail-latency under contention (avoid “looks fine on average, fails in spikes”).
  • Practical test hook: when CU-UP load approaches line rate, control-critical classes must still show stable counters (no drop bursts) and bounded queue depth excursions.
Deployment shapes (regional DC vs edge-concentrated): only the constraints that matter
  • Edge-concentrated CU: tighter power/thermal headroom → higher sensitivity to rail droop/thermal throttling; telemetry becomes mandatory evidence, not “nice to have”.
  • Regional DC CU: higher port density and higher automation expectations → segmentation and repeatable evidence (counters/logs) become the main availability lever.
F1-C / F1-U N2 / N3 Ethernet Switching TSN (CU-side) Secure Boot DDR / NVMe Telemetry Evidence
What “done right” looks like (platform checklist)
  • Forwarding: clear port roles + predictable QoS mapping for CP vs UP classes.
  • Trust: measurable boot chain (RoT/TPM), policy-backed integrity, and attestation evidence outputs.
  • Determinism: DDR/NVMe health signals that explain performance tail events.
  • Field proof: minimal telemetry set that can correlate congestion, thermal, and power incidents.

Reader takeaway: Treat “CU = platform + evidence” as the organizing principle; the rest of this page is about making forwarding, trust, and determinism provable with the right measurements.

Figure F1 — 5G CU boundary map (what this page covers)
5G CU boundary map with F1 and N2/N3 interfaces and covered platform blocks Block diagram showing DU on the left, CU in the center with switching, trust, DDR/NVMe, and telemetry blocks, and Core on the right. F1-C/F1-U and N2/N3 arrows indicate boundaries. DU F1 peer F1-C F1-U Core N2 / N3 N2 N3 5G CU Platform Covered on this page Ethernet Switch / TSN Trust Secure Boot DDR / NVMe DDR NVMe Telemetry Logs / Counters F1-C F1-U N2 N3 Mgmt / OAM OOB
Use this boundary map as the page “contract”: if a section cannot be pointed to a CU platform block (switching / trust / DDR+NVMe / telemetry) and an interface boundary (F1 or N2/N3), it belongs to a sibling page.

H2-2 · CU networking topology: ports, uplinks, segmentation, and failure domains

Intent: Explain why multi-port design is about failure-domain engineering, not “more ports”. Show how port roles + segmentation produce fast isolation and diagnosable incidents.

Port roles are operational contracts (what breaks together, and what must not)
  • DU-facing (F1): carries CU↔DU service flows; design goal is congestion containment and bounded delay for control-critical classes under CU-UP load.
  • Core-facing (N2/N3 uplinks): where upstream congestion can “push back” into the CU; design goal is redundancy + clear counters to prove whether drops happen inside CU or upstream.
  • Mgmt / OAM: the debugging lifeline; design goal is stability during incidents (logs and counters must still be reachable).
  • OOB (interface-level only): last-resort access path; design goal is strict separation from service forwarding to avoid shared failure modes.
Segmentation (VLAN/VRF) as a diagnosability tool—not a checkbox
  • Minimum viable split: separate F1 domain, Core-uplink domain, and Mgmt domain; keep OOB isolated.
  • What “good” looks like: storms/loops stay inside one domain; counters identify which domain leaked, and where.
  • Verification hook: broadcast/unknown-unicast growth must show in domain-specific counters, not as global “mystery drops”.
Redundancy: state the goal and the first observation points (no deep router talk)
  • LAG (link aggregation): goal is “single-link failure does not change service class behavior”. Check per-member utilization balance and failover counters.
  • ECMP (when used at CU edge): goal is predictable distribution without pathological reordering for sensitive flows. Check per-path counters and incident-time shifts.
  • Rule of thumb for this page: focus on what the CU can prove (local counters/queue watermarks), not on upstream policy design.
High-risk failure modes: symptom → first checks → evidence
  • Loop / L2 storm
    Symptom: sudden CPU spikes, intermittent packet loss across multiple ports. First checks: per-port broadcast counters, MAC move/flap indicators, queue depth rising across unrelated classes. Evidence: domain-level counters showing where the storm originated (F1 vs uplink vs mgmt).
  • MTU mismatch
    Symptom: throughput looks fine until specific transactions fail; odd retransmissions or blackholes. First checks: interface MTU consistency across DU-facing and uplink domains; incrementing “giant/fragment/drop” counters. Evidence: capture shows consistent truncation/fragment patterns aligned with domain boundary.
  • PFC / congestion spread (symptom-level)
    Symptom: a single busy stream causes wider latency spikes; tail events appear under load. First checks: queue watermark bursts, pause/flow-control counters, drops concentrated in a specific class. Evidence: incident timeline correlates queue watermark spikes with CP-class jitter or drops.
Topology “golden rules” for CU bring-up (portable across vendors)
  • Make roles visible: label port groups physically and in software; keep role-to-domain mapping stable.
  • Keep isolation explicit: do not share mgmt/OOB with service forwarding paths unless proven safe by tests.
  • Instrument early: baseline counters at idle and at known loads; keep snapshots for incident comparison.
  • Design for triage: every major failure mode must point to a first-check counter set.

Engineering mindset: The “best” CU topology is the one that limits blast radius and produces clean evidence within minutes—before deep packet analysis is needed.

Figure F2 — Port roles + segmentation + observation points (CU-side)
CU port roles and segmentation with counters and isolation boundaries Diagram showing CU chassis with four port groups (F1, uplink N2/N3, mgmt, OOB) mapped into segmented domains (VLAN/VRF). Observation points mark where to check counters and queue watermarks. CU Port Roles & Segmentation F1 Uplink Mgmt OOB Domains (segmentation boundaries) F1 Domain VLAN / VRF Core Uplink Domain N2 / N3 Mgmt Domain OAM / tools OOB (isolated) last-resort access Switch / vSwitch queues • shapers counters drops queue Legend obs point
The diagram is intentionally CU-side only: it shows how port roles map into segmented domains and where to read evidence (counters, drops, queue watermarks) to isolate incidents quickly.

H2-3 · Multi-port Ethernet switching silicon: feature checklist for a CU platform

Intent: A CU-oriented checklist for choosing switch silicon / NIC modules: focus on determinism, fault isolation, and evidence (not “carrier router” features).

Decision rule: A feature is “CU-useful” only if it improves predictable control behavior, limits blast radius, or produces fast evidence (per-port/per-queue visibility) during incidents.

Keep scope tight: buffers and QoS are sized for CU traffic classes (CP/UP/Mgmt), not for BNG-style deep queues or subscriber policy engines.
Feature checklist (what to require, why it matters, how to verify)
Port plan 25G/50G/100G mix + role-based grouping Why: F1-facing, core-uplink, and mgmt paths must not share failure modes. Verify: “worst case” test = single-link failure + high CU-UP load while CP counters show no burst drops.
Offload boundary L2/L3 fast-path with stable control behavior Why: offload value is reducing software jitter on critical paths, not just saving CPU. Verify: control-plane tasks (telemetry/logs/config) do not induce CP-class drop bursts under sustained CU-UP load.
Queues & buffers Per-class queues + readable watermarks Why: CU needs bounded delay/jitter; overly deep buffers can amplify tail latency. Verify: queue watermarks correlate to latency spikes; drops are attributable to a specific class/queue, not “mystery loss”.
Timestamp support (entry-level) Basic ingress/egress timestamp hooks Why: enables incident timelines (queue bursts ↔ drop/jitter events) and supports CU-side determinism validation. Verify: consistent time correlation of counters/queue peaks across ports; deep clock-tree/jitter topics belong to the Timing pages.
Operability & evidence Per-port / per-queue counters + mirror/tap points Why: root-cause requires “where did it drop” (ingress/queue/egress) and “which class”. Verify: can localize drops to a single port/queue and export snapshots before/after an incident window.
Port Roles Per-Queue Visibility Mirror Points Queue Watermarks Timestamp Hooks
Practical port planning method (CU-sized, redundancy-aware)
  • Step 1 — Role first: split ports into F1-facing, core-uplink, mgmt, and (if present) OOB to prevent shared failure domains.
  • Step 2 — Redundancy next: apply N+1 or dual-uplink targets; avoid “single point” uplink designs that collapse CP stability during uplink congestion.
  • Step 3 — Worst-case validation: run high CU-UP throughput while injecting a link failure; expect CP-class counters to stay stable and queue watermarks to remain bounded.
Figure F3 — Switch data path + “CU-useful” feature callouts
Ethernet switching data path for a 5G CU with feature callouts Block diagram showing ports, parser/classifier, queues/shapers, switch fabric, and observation points for counters, watermarks, mirroring, and timestamps. CU Switch / NIC Module (Data Path View) feature callouts Ports F1 group Uplink Mgmt Parser classify Queues / Shapers CP • UP • Mgmt Switch Fabric forward Evidence plane (what makes incidents diagnosable) Counters per-port per-queue Watermarks queue depth Mirror tap points span Callouts: port mix • queues • counters • mirror • timestamps timestamp hooks (entry)
The diagram is designed for CU selection and bring-up: it highlights where evidence comes from (per-port/per-queue counters, watermarks, mirror points) and keeps timestamps at an “entry-level” scope.

H2-4 · TSN in a CU context: which IEEE features matter and why

Intent: TSN here is not a textbook. It is a CU tool for making delivery provable: bounded latency/jitter for critical classes when CU-UP load is high.

When TSN makes sense: use TSN only when the requirement is an auditable upper bound (latency/jitter/loss) and the platform can keep time and configuration consistent.

Common trap: misaligned clocks + layered rules can reduce determinism—so every TSN feature must map to a measurable CU objective.
CU determinism targets (engineering form, measurable)
  • Latency bound: critical classes must keep a predictable upper bound even during CU-UP saturation (avoid “average looks fine, tail breaks”).
  • Jitter control: arrival spacing variation must remain within a defined window for time-sensitive classes.
  • Loss bound: critical classes require either “no loss” or a clearly defined drop policy with immediate evidence (which class, which queue, which port).
Which TSN features matter for a CU (and where they land)
802.1AS Time base for measurement & scheduling CU value: aligns evidence (timestamps/counters) and enables time-aware actions. (Deep time-distribution topics belong to Timing pages.)
Qbv Time-aware scheduling (egress gates) CU value: reserves time windows for critical classes so they do not wait behind bulk CU-UP frames.
Qci Per-stream policing/filtering CU value: prevents misbehaving flows from consuming queues and destroying determinism; enables “drop with attribution”.
Qbu / 802.3br Frame preemption (platform-dependent) CU value: reduces worst-case blocking by large frames; useful when tail latency dominates incidents.
Go / No-Go criteria (avoid deploying TSN “just because”)
  • Go: incidents are dominated by tail events; critical classes need provable bounds; the platform can keep time/config consistent.
  • No-Go: the dominant issue is long-term capacity shortfall or upstream congestion; TSN cannot fix missing bandwidth.
  • Validation mindset: every TSN rule must correspond to a measurable change in queue watermarks and class-level latency/jitter evidence.
Failure patterns to watch (symptom → first checks)
  • Clock domain misalignment: deterministic windows drift → check time base consistency and whether schedule windows line up across ports.
  • Priority inversion: “critical” traffic gets stuck behind bulk → check class mapping, queue assignment, and egress gate order.
  • Rule stacking conflicts: shaping + gating + policing interact unexpectedly → check which stage applies first and whether drops are attributable (Qci counters).
Figure F5 — Traffic classes → Qci → queues → Qbv time slots (CU-side)
TSN flow through Qci and Qbv for a 5G CU Diagram mapping traffic classes into Qci policing, then into queues, then through Qbv time slots to an egress port. Minimal labels show where 802.1AS time base applies. TSN in a CU Context (minimal, measurable path) 802.1AS time base Traffic Classes CP UP Mgmt Qci per-stream police / filter drop+reason counters Queues CP queue UP queue Mgmt queue Qbv time slots slot A B C egress gate to DU / uplink Meaning: Qci contains misbehavior Qbv bounds worst-case wait
The diagram stays CU-side: it shows how Qci (per-stream policing) and Qbv (time-aware gates) map directly to measurable CU objectives—bounded delay/jitter and attributable drops.

H2-5 · Traffic classes, QoS, and congestion containment (CU-specific)

Intent: Explain why CU performance collapses under congestion, and how to apply a minimal QoS set that contains blast radius without self-inflicted complexity.

CU principle: QoS is successful only when it preserves CP stability, keeps UP throughput predictable, and leaves fast evidence (where drops/marks happened).

Keep it CU-sized: this section avoids large-network policy frameworks; it focuses on queueing, shaping, ECN/PFC observability, and a layered debug path inside the CU box.
CP vs UP isolation Minimal queue set ECN / PFC evidence Drop localization vSwitch awareness
CU traffic classes (defined by sensitivity + evidence needs)
CP Critical F1-C, N2 (control signaling) Most sensitive to: tail latency and burst loss. Evidence required: per-queue drops, queue depth peaks, CPU backlog correlation.
UP Bulk F1-U, N3 (user-plane throughput) Most sensitive to: sustained congestion and jitter that causes retransmission or head-of-line blocking. Evidence required: queue watermarks, ECN marks, pause events, per-port utilization.
Mgmt / OAM OAM interfaces (fault isolation and maintenance access) Most sensitive to: being starved during incidents. Evidence required: dedicated queue counters + mirror/tap availability for packet capture.
Telemetry / Logs metrics, alarms, evidence export Most sensitive to: total blackout (loss of evidence) rather than pure latency. Evidence required: delivery “floor” (minimum service), plus counters that prove it stayed alive.
A “minimal viable” QoS mapping (avoid over-engineering)
  • Queue count: start with 4–6 queues only. More queues raise the risk of priority inversion and rule conflicts.
  • Stable mapping: traffic class → queue should be deterministic (CP/UP/Mgmt/Telemetry), then a small set of schedulers applies.
  • Guardrails: CP may use priority scheduling, but must include a starvation guard so UP/Mgmt never fully collapses.
  • Policers: apply ingress policing mainly to misbehaving/unknown flows to stop congestion spread, not to every flow.
Congestion containment: ECN and PFC (CU boundary + observability)
ECN (marking) Goal: convert “mystery loss” into controlled backoff under load. Watch: ECN mark counters + queue depth peaks + latency/jitter tail metrics. Interpretation: if marks rise but tail stays bounded, congestion is being signaled (not silently amplified).
PFC (pause) Goal: protect specific classes, but avoid spreading stall. Watch: pause frame counters, persistent watermarks, cross-class jitter spikes. Boundary: enable only when it is clear which class is protected and which evidence proves improvement.
Drop localization runbook (symptom → first checks)
  • MAC/PHY layer: link flaps, CRC/FCS errors, MTU mismatch → check port error counters and link event logs.
  • Queue layer: tail latency spikes, burst drops → check per-queue drops + watermarks + scheduling/priority mapping.
  • CPU / host path: control-plane stalls, telemetry gaps → check CPU backlog indicators and correlation with queue peaks.
  • Virtual switch layer (if used): “software drops” under load → check vSwitch drop counters and mirror at the vSwitch boundary.
Figure F6 — Traffic → QoS map with counters and mirror points (CU-contained)
CU traffic classes mapped to QoS queues with evidence points Block diagram showing CU traffic classes, classifier, queue set, schedulers, and CU-contained evidence points including counters, watermarks, ECN marks, PFC pause, and mirror taps. CU QoS Map (minimal set) + evidence points Traffic Classes CP (F1-C / N2) UP (F1-U / N3) Mgmt / OAM Telemetry Classifier DSCP / PCP port-role Queues (4–6 recommended) Q0: CP priority (guarded) Q1: UP bulk Q2: Mgmt / OAM Q3: Telemetry floor Evidence counters watermarks ECN marks PFC pause mirror to DU (F1) to Core (N2/N3)
Minimal queue mapping keeps rules explainable. Evidence points (counters, watermarks, ECN/PFC indicators, mirror taps) make congestion behavior diagnosable inside the CU boundary.

H2-6 · Hardware crypto: what to offload (and what not) for CU security

Intent: Clarify CU crypto offload decisions using measurable trade-offs: throughput vs tail latency vs fault isolation, with clean key custody boundaries.

Engineering framing: crypto offload is valuable when it turns security processing into a predictable, measurable datapath without destabilizing CP behavior during load or rekey events.

Scope boundary: protocol internals and large-network security architecture are out of scope; only CU-side offload forms, key custody, and validation points are covered.
TLS (CP) IPsec / MACsec (UP) Inline vs look-aside TPM / HSM keys Tail latency
CU crypto scenarios (separate control-plane and data-plane)
Control-plane TLS Goal: stable handshakes and predictable tail latency for CP-related management/control traffic. Risk: handshake bursts or rekey events can create CPU jitter that indirectly harms CP scheduling. Verify: under sustained UP load, CP queue counters do not show burst drops during handshake/rekey windows.
Data-plane IPsec Goal: secure transport with measurable throughput and bounded jitter (deployment-dependent). Risk: queueing and reassembly paths can amplify tail latency if offload boundaries are unclear. Verify: measure p99/p999 latency while sweeping throughput; ensure drops remain attributable to a class/queue.
Data-plane MACsec Goal: link-layer protection on selected CU uplinks (deployment-dependent). Risk: link-level security can hide congestion if counters and marks are not collected. Verify: observe per-port utilization + queue watermarks + mark/pause counters under load.
Offload forms: inline vs look-aside (choose by three hard criteria)
Inline (NIC / inline engine) Best when: latency is sensitive and the datapath should be short and predictable. Benefits: fewer hops, clearer failure domain, easier to reason about determinism. Validate: tail latency stays bounded during rekey; counters show where any loss occurs (not “silent” stalls).
Look-aside (accelerator) Best when: throughput must scale and compute can be expanded independently. Risks: datapath hops and topology can introduce tail jitter if the crypto queue becomes a backpressure point. Validate: measure crypto queue depth, DMA/queue occupancy, and p99/p999 latency under load sweeps.
Key custody (TPM/HSM) Role split: TPM/HSM provides non-exportable roots and protected key operations; accelerators provide bulk throughput. Validate: ensure private key material does not appear as exportable software objects; audit with platform key usage counters/logs.
Common pitfalls (symptom → what to measure first)
  • Throughput “wobbles” after enabling crypto: measure p99/p999 latency, queue depths, and class counters around handshake/rekey events.
  • Lower-than-expected throughput: measure crypto engine utilization, queue occupancy, and whether backpressure aligns with UP queue watermarks.
  • Unexpected CP impact: correlate CP queue drops with CPU load spikes and crypto queue surges; CP should remain stable under UP saturation.
Figure F7 — Crypto datapath overlay (CP vs UP) + key custody boundary
CU crypto offload overlay for control plane and data plane Diagram separating TLS control-plane path from IPsec/MACsec data-plane path, showing inline vs look-aside options and a TPM/HSM key custody block. Evidence probes indicate latency and counters. Crypto overlay (CU-side): CP vs UP paths + key custody CU platform CP traffic TLS NIC / vSwitch inline crypto UP traffic IPsec / MACsec NIC / datapath look-aside accelerator queue depth Validation probes tail latency class counters rekey bursts queue peaks TPM / HSM key custody non-exportable attestation Uplink to Core / DC inline option
Separate CP (TLS) and UP (IPsec/MACsec) paths to keep performance and evidence predictable. TPM/HSM is a custody boundary; accelerators address bulk throughput. Validate with tail latency, class counters, and queue peaks during rekey windows.

Secure boot & chain of trust: measurable, attestable, serviceable

Intent: Provide provable evidence that the CU runs untampered software, while keeping upgrades, rollback, and field recovery serviceable.

Platform goal: security must be measurable (recorded), attestable (verifiable), and serviceable (recoverable without breaking the evidence chain).

Engineering outcome: every boot produces a compact set of measurements, policy versions, and verification results that can be logged, exported, and audited.
RoT / TPM / Secure Element Secure vs measured boot Attestation evidence Anti-rollback policy A/B images
Roles and boundaries: who does what
Root of Trust (RoT) Anchors boot verification and enforces “only approved code executes”. Must output: verification status + policy version used for decisions.
TPM / Secure Element Protects keys and stores measurements in a way that can be quoted (attested) without exporting secrets. Must output: measurement digest summary + quote/attest result (as an event record).
Platform (boot chain) Bootloader → OS/hypervisor → workloads, each step verified and/or measured. Must output: component version IDs + signed/verified flag + measurement record ID.
Management/telemetry Exports evidence (logs/attest summaries) and correlates it with incident timelines. Must output: “trust state” field in alarms and audit reports.
Secure boot vs measured boot vs attestation (engineering selection)
Secure boot Blocks execution when verification fails. Choose when: preventing unauthorized code from running is the primary goal.
Measured boot Records what booted (even if policy allows booting) for forensics and accountability. Choose when: field diagnosis and auditability matter as much as prevention.
Attestation Proves the recorded measurements to a verifier (remote or local audit), turning “logs” into verifiable evidence. Choose when: third-party verification, compliance, or zero-trust operations require proof.
Anti-rollback without killing serviceability
  • Separate “recoverable” vs “non-recoverable” rollback: allow A/B rollback for workloads and non-critical packages, but protect core policy/boot components with monotonic counters or policy versions.
  • Make policy visible: every decision should record policy version, decision reason, and which component failed verification.
  • Safe recovery path: define a controlled “recovery mode” that still produces measurements and audit events (no silent bypass).
Verify: minimum evidence set (must exist on every boot)
  • Boot verification status: per-stage “verified / failed” plus component version IDs.
  • Measurement summary: digest list or IDs referencing measurements, plus policy version and anti-rollback counter state.
  • Attestation summary (if enabled): quote result metadata recorded as an event (success/fail + timestamp + reason).
  • Service actions audit: upgrade/rollback attempts, recovery entries, and the reasons for accept/deny decisions.
Figure F3 — Chain of trust with evidence outputs (logs + measurements + quotes)
Chain of trust for a CU platform with evidence outputs Block diagram showing RoT, bootloader, OS/hypervisor, and workloads. Evidence outputs include logs, measurements, and quotes. A policy rail includes anti-rollback counter and policy version. Secure boot: measurable + attestable + serviceable RoT verify policy Bootloader verify + measure OS / Hypervisor measure + enforce Workloads CU services Evidence outputs (must be recorded) Logs verify + policy Measurements boot digest IDs Quotes / Attest verifiable proof Policy rail (visible decisions) Anti-rollback counter Policy version + decision reasons
Evidence must be produced on every boot: verification logs, measurement summaries, and (if enabled) attestation outputs. Anti-rollback policy must be visible and auditable without breaking recovery workflows.

DDR & storage subsystem: ECC, power integrity, and performance determinism

Intent: Explain why CU performance issues often come from memory and storage tail behavior rather than raw compute—and how to measure and diagnose it.

CU determinism rule: peak throughput can look healthy while tail latency, ECC trends, and SSD throttling quietly destroy predictable delivery.

Engineering outcome: map field symptoms (reboots, wobble, latency tails) to observable counters and events that point to a domain.
DDR4/DDR5 + ECC CE vs UE counters Tail latency NUMA / virtualization NVMe SMART / throttling
ECC as availability + diagnosability
Correctable errors (CE) Early warning signal: the system stays up, but memory margin may be degrading. Use: trend CE rate vs temperature/load; treat spikes as “time to investigate”, not as harmless noise.
Uncorrectable errors (UE) Hard fault indicator: triggers resets, crashes, or forced domain recovery. Use: correlate UE events with reset causes and telemetry to pinpoint whether it is power/thermal/stress related.
Why ECC matters for CU It turns “mystery instability” into countable evidence that can be trended and tied to conditions. Minimum requirement: expose CE/UE counters and alert thresholds in the CU’s monitoring pipeline.
Determinism: bandwidth vs tail latency (how it fails in CU platforms)
  • Contention: many vCPU/vNIC queues can saturate memory bandwidth or increase cache/memory pressure, inflating p99/p999 even when average looks fine.
  • NUMA sensitivity: cross-socket memory access can add latency tails; the effect becomes visible under bursty workloads and mixed tenants.
  • Measurement-first approach: keep the story grounded in what is measured (p50/p99/p999, steady vs step load) rather than theoretical limits.
NVMe/SSD: performance and reliability signals (CU platform view)
Thermal throttling Temperature-driven performance collapse that looks like “random wobble”. Observe: SSD temperature + throttle events aligned with throughput drops and latency spikes.
Health / endurance trend Aging can increase background work and error handling. Observe: SMART health summary + error events + sustained write performance trend.
Write amplification / GC effects Background collection can create periodic latency tails. Observe: latency distribution over time; periodic bursts often reveal GC patterns.
Symptom → observable mapping (turn wobble into evidence)
Intermittent reboot Check first: reset cause + UE events + telemetry around the incident window. Strong signal: UE spike or sudden ECC escalation aligned with a thermal or power event log.
Throughput wobble Check first: SSD throttling events + latency tail growth + storage/IO queue depth. Strong signal: temperature throttle aligns with performance collapse.
Latency tail spikes Check first: p99/p999 drift + NUMA placement changes + memory pressure indicators. Strong signal: step-load tests show “tail expansion” without a corresponding throughput increase.
Verify: minimum reproducible test set
  • Run steady + step load: compare p50/p99/p999 under constant load and sudden bursts to expose tail behavior.
  • Capture essentials: CE/UE counters, latency distribution over time, SSD temperature/throttle events, and health summary snapshots.
  • Close the loop: every incident should attribute to a domain (memory vs storage vs thermal) using synchronized events, not guesses.
Figure F4 — DDR/NVMe + telemetry: symptoms → observables → likely domain
Memory and storage determinism with observability mapping Block diagram showing CPU/SoC, DDR with ECC counters, NVMe with SMART and throttling, and VRM/thermal telemetry. Symptoms on the left map via arrows to observables and domains. DDR + NVMe: determinism and diagnosability Symptoms Reboots Wobble Tail spikes Domains + observables CPU/SoC NUMA effects DDR ECC CE/UE NVMe SMART Latency distro p50 / p99 / p999 Thermal / VRM telemetry events SSD throttling temp + events observe → attribute
The goal is not just peak throughput. Track ECC (CE/UE), tail latency (p99/p999), SSD temperature/throttling, and thermal/VRM events so “wobble” becomes evidence mapped to a domain.

DDR/storage power monitoring: what to measure and how to alarm without false positives

Intent: Define the monitoring signals that actually explain DDR/SSD instability, and build an alarm scheme that is sensitive to real faults but robust to transient spikes.

Monitoring success criterion: every alarm should map to a recoverability decision (continue, degrade, protect) and leave actionable evidence (pre/post snapshots) for attribution.

Anti-goal: alarms that fire on harmless transients create “alarm blindness” and remove protection when it is needed.
V / I / P / Temp VRM state PG / FAULT SVID / PMBus Blanking / Debounce Alarm ladder
What to measure: signals that can explain real CU symptoms
Voltage (V) Captures droop and undervoltage risk. Prefer per-rail min/avg and droop depth over raw single samples. Explains: reboots, silent corruption risk, sudden performance collapse.
Current / Power (I / P) Detects overload, abnormal draw, and thermal runaway precursors. Explains: throttling, repeated limit events, brownout-like behavior during bursts.
Temperature (T) VRM/SSD temperatures indicate proximity to throttling and protective shutdown thresholds. Explains: wobble tied to thermal throttling and latency tails.
VRM state Limit, throttle, fault, or “not-ready” flags are more informative than V alone. Explains: why performance changed even when link/counters look healthy.
PG / FAULT Power-good and fault pins/events anchor “hard” protection decisions. Explains: protective resets and domain shutdowns (must be logged).
SVID / PMBus telemetry Standard channel for voltage, current, temperature and status codes. Explains: repeatable patterns with consistent status codes across incidents.
Sampling and filtering: prevent transient false positives without hiding real faults
Three gates that keep alarms useful
  1. Blanking window: after boot, rail enable, or known load-step phases, record signals but suppress alarms for a short “settle” window.
  2. Debounce rule: require threshold violation to persist for T ms or N consecutive samples before raising an alarm.
  3. Window statistics: alarm on rolling-window min/avg/max (or p95) rather than single samples; choose statistic per signal type.
Rule-of-thumb mapping: droop → window-min + debounce; overcurrent → peak + very short debounce (with context); temperature → window-avg.
Alarm ladder: classify by recoverability and data-integrity risk
Warning (recoverable) Early indication with low immediate risk. Examples: approaching thermal limit, mild droop without PG events, sustained power drift vs baseline. Actions: record evidence, dedup/re-rate-limit, optionally increase sampling for a short period.
Critical (recoverable, high risk) Strong indicator of imminent instability or SLA impact. Examples: repeated droop episodes, VRM limit state bursts, correlated ECC CE surge + power anomalies. Actions: trigger pre/post snapshots, apply controlled degradation, escalate if evidence persists.
Protect / Shutdown (non-recoverable) Data integrity or hardware safety at risk. Examples: PG loss, sustained undervoltage, hard over-temperature, persistent VRM fault. Actions: protective stop or forced recovery path; ensure evidence is flushed and reported.
Evidence capture: pre/post windows that make root-cause attribution possible
  • Pre-trigger (before alarm): record short-window V/I/P, VRM state, PG transitions, temperature, and alarm conditioning state (blanking/debounce status).
  • Post-trigger (after alarm): record recovery path, throttling or protective actions taken, reset cause (if any), and final state.
  • Replay goal: decide whether the dominant cause is droop, over-current, over-temperature, or undervoltage—and show the timeline.
Figure F9 — Power monitoring + alarm ladder (conditioning gates → warning/critical/shutdown)
DDR and storage power monitoring with alarm ladder Block diagram showing DDR and NVMe power domains, telemetry signals, conditioning gates (blanking, debounce, window stats), and a three-level alarm ladder. Minimal labels with arrows connecting observables to alarm levels. DDR/Storage power monitoring: robust alarms Power domains DDR rail + ECC NVMe rail + temp Telemetry: V / I / P / Temp / VRM state / PG / FAULT Conditioning gates Blanking settle Debounce persist Window min/avg Alarm ladder Warning recoverable Critical high risk Protect / Shutdown data integrity Pre/Post snapshots: timeline for droop / OCP / OTP / UV
Use conditioning gates (blanking, debounce, window statistics) so alarms track real faults. Classify by recoverability and data-integrity risk, and always capture pre/post evidence for attribution.

Telemetry & evidence: proving determinism and trust in the field

Intent: Build a minimal, high-signal evidence set that proves CU determinism and trust, and enables consistent field attribution without creating alarm storms.

Field-proof rule: each “stability claim” should be backed by a matching field evidence signal on the same time axis (not by anecdotes).

Design target: evidence is compact (summary + event snapshots), correlated (shared timestamps), and minimally invasive (minimum necessary principle).
Queue watermarks Drops / errors Crypto offload Boot / attest state ECC CE/UE VRM / thermal
Minimum evidence set: small, actionable, and time-correlated
Network evidence Port drops/errors, queue watermarks, and utilization snapshots. Answers: is instability driven by congestion or queue overflow inside the CU path?
Determinism evidence Latency distribution summaries (p50/p99/p999) and tail-growth indicators. Answers: does “average OK” hide tail spikes that break delivery guarantees?
Security evidence Boot verification state, policy version, and attestation summary status. Answers: what is running, and can it be proven to be untampered?
Memory/storage evidence ECC CE/UE counters, SSD health summary, thermal/throttle events. Answers: are tails or wobble driven by memory margin or storage throttling?
Power/thermal evidence VRM telemetry events, temperature trends, and protective state changes. Answers: is the platform silently degrading due to thermal or power constraints?
Crypto offload evidence Accelerator/NIC offload utilization, throughput, and latency impact snapshots. Answers: is security processing creating variability under load?
Correlation playbooks: turn “wobble” into an evidence-backed attribution
Wobble → thermal throttle Align throughput drop with throttle events and temperature rise on the same timeline. First look: throttle flag/time. Then: temp trend. Confirm: tail growth coincides.
Tail spikes → memory margin Check whether p99/p999 expands alongside a surge in ECC CE or repeated VRM limit states. First look: CE rate. Then: VRM state. Confirm: step-load reproduces tail expansion.
Wobble → congestion/queue overflow Compare queue watermarks and drop counters against utilization bursts to validate congestion-driven loss. First look: watermarks. Then: drops/errors. Confirm: tail spikes and retransmit indicators.
Alert governance: keep signal high and stop alarm storms
  • Deduplication: collapse repeated alarms with a merge window; emit one “episode” record with counters rather than spamming repeats.
  • Rate limiting: cap alarm emission per type; keep the episode timeline as evidence.
  • Threshold drift: prefer baseline + deviation for slowly changing metrics; avoid static thresholds that become wrong after workload shifts.
  • Evidence-based escalation: escalate to critical only when multiple signals agree (e.g., throttle + tail expansion, or droop + PG event).
Evidence retention: summary always-on + event snapshots on demand
  • Always-on summaries: compact periodic stats (p99/p999, CE/UE, watermarks, temperatures) to establish baselines.
  • Event-trigger snapshots: pre/post windows at higher sampling rate around incidents for attribution.
  • Forensic bundle: only for severe episodes; store the minimum subset that proves the correlation path.
Minimum necessary principle: keep telemetry useful without leaking sensitive content
  • Metadata only: export counters, timestamps, states, and summaries—not user payload content.
  • Security evidence as summaries: report boot/attestation status and policy versions, not secret material.
  • Privacy by design: include only fields that directly support stability, trust, and traceability claims.
Figure F10 — Lab validation → field evidence mapping (determinism + trust + traceability)
Validation to field evidence mapping for CU determinism and trust Two-column diagram: lab validation metrics on the left, field evidence signals on the right, with mapping arrows through a central correlation layer. Minimal text with block elements and arrows. Proving determinism and trust: validation → field evidence Lab validation Throughput baseline Tail latency tests ECC trend checks Thermal margin Boot trust tests Mapping Time axis timestamps Correlation evidence path Escalation multi-signal Field evidence Queue watermarks p99/p999 summaries ECC CE/UE counters Thermal/VRM events Boot/attest status Claims must be backed by time-correlated evidence (summaries + event snapshots), with minimum necessary data.
A CU becomes “provable” when lab claims map to field evidence signals on the same time axis. Keep evidence compact (summaries + snapshots), correlated (timestamps), and governed (dedup + drift-aware thresholds).

Validation & debug checklist: lab bring-up → soak → incident drill

Intent: Define “done” with repeatable tests and evidence that proves determinism, trust, and operability for a 5G CU platform.

Done means: every stability claim is backed by time-correlated evidence (counters, snapshots, logs), and every incident drill produces a replayable root-cause timeline (pre/post windows + pass/fail criteria).

Anti-goal: “it looks fine” sign-off without baselines, thresholds, and reproducible drills.
Bring-up baseline Tail (p99/p999) TSN verification Crypto on/off Chain of trust Load-step droop Soak + drills

How to use this checklist (evidence package structure)

Evidence should be small, high-signal, and aligned on a shared time axis (timestamps). Keep periodic summaries always-on, and capture high-rate snapshots only around events.

/evidence/ baseline/ (ports/roles, MTU, VLAN/VRF, counter baselines, config snapshots) tsn/ (802.1AS offset trend, Qbv gate evidence, Qci hit/drop counters) crypto/ (throughput/latency/CPU curves; rotation and fallback logs) trust/ (boot verify state, measurements summary, policy/version, rollback proof) margining/ (load-step droop + VRM telemetry; thermal steady-state + throttle events; ECC counters) drills/ (scenario scripts, triggers, pre/post snapshots, root-cause timeline)
Determinism acceptance Use tail metrics (p99/p999) and bounded loss behavior under controlled stress—not averages. Evidence: queue watermarks + drop counters + tail summaries on the same timeline.
Trust acceptance Trust must be measurable: boot verification/measured state, policy version, and audit-ready logs. Evidence: boot status + measurements summary + rollback/upgrade recovery proof.
Operability acceptance Incidents must be replayable: pre/post windows, deduped alerts, and a clear triage entry point. Evidence: incident bundle with trigger, snapshots, and attribution conclusion.

Phase A — Lab bring-up (ports/MTU/VLAN, basic forwarding, counter baseline)

  • Port role snapshot: label and record DU-facing (F1), core-facing (N2/N3), mgmt/OAM, and OOB interface roles (interface-level only).
  • MTU sanity: verify path MTU consistency using at least two payload sizes; confirm where fragmentation or drops appear.
  • Basic forwarding: validate controlled traffic templates across each port role; keep early tests simple and reproducible.
  • Counter baselines: capture port errors, drops, and per-queue watermarks at idle and under light load.

Pass criteria (bring-up): link stays stable; error counters do not creep at idle; drops are explainable and reproducible under controlled stress.

Fast triage entry: wobble under “link up” often starts with MTU mismatch, queue tail-drop, or a mis-mapped priority/queue.

Phase B — Determinism checks (TSN verification points on the CU platform)

  • 802.1AS stability: trend the time offset; log step events and stability windows. Focus on “stable vs unstable” evidence, not standard theory.
  • Qbv schedule verification: under a controlled traffic template, verify gate schedule effect via gate evidence + tail metric improvement (on/off comparison).
  • Qci enforcement: inject a deliberately non-conforming test stream and confirm Qci hit/drop counters behave as expected.

Pass criteria (determinism): enabling determinism features reduces tail variability without introducing unexplained loss or priority inversion.

Common failure pattern: “TSN enabled but tail got worse” → check priority mapping, schedule mismatch to traffic, and stacked policies that conflict.

Phase C — Crypto validation (throughput/latency/CPU curves; rotation and fallback)

  • Offload A/B test: run the same traffic template with offload on and off; record throughput, p99/p999 latency, and CPU utilization (per-socket summary if applicable).
  • Failure fallback: inject a controlled failure (e.g., invalid/expired credential in a test environment) and confirm the fallback path is observable and bounded.
  • Key/certificate rotation: execute rotation and verify: (1) transition is logged, (2) service impact is bounded, (3) the system recovers to a verified steady state.

Pass criteria (crypto): offload increases throughput (or reduces CPU) without destabilizing tail latency; fallback behavior is controlled and produces evidence.

Fast triage entry: tail spikes after enabling crypto often correlate with CPU slow-path fallback, scheduler contention, or topology/affinity misplacement.

Phase D — Secure boot & chain-of-trust validation (measurable, attestable, serviceable)

  • Measurement consistency: repeated cold boots on the same image should produce consistent measurement summaries and verification status.
  • Rollback protection: attempt to boot an older image/policy; confirm it is blocked with a clear reason and without entering a “half-alive” state.
  • Interrupted upgrade recovery: interrupt an upgrade at a controlled point (test environment) and validate recovery to a bootable and verifiable state.
  • Evidence export: record boot verify state, policy version, and measurement summary as a compact field bundle for field audits.

Pass criteria (trust): verified state is explicit and reproducible; rollback is blocked with proof; interrupted upgrades recover predictably with audit evidence.

Phase E — Power/thermal/memory margining + soak + incident drills

Load-step droop Apply controlled step-load transitions; correlate rail telemetry with tail behavior and any protective state changes. Evidence: droop snapshot + VRM state/telemetry + tail summary (same timestamps).
Thermal steady-state Run to thermal equilibrium; record throttle onset and performance inflection points. Evidence: temperature trend + throttle events + throughput/tail curves.
ECC statistics / injection Use platform-supported methods to validate CE/UE reporting and alarm behavior (where available). Evidence: CE/UE counters + alert ladder transitions + post-action logs.
Soak tests Long-run stability under representative load; verify counters do not drift into abnormal regimes. Evidence: periodic summaries + episode-based event bundles (deduped).
Incident drills Run repeatable drills that produce a root-cause timeline with a clear triage entry. Evidence: trigger → pre/post snapshots → attribution conclusion (one bundle per drill).

Recommended drill scenarios (CU platform focus):

  • Congestion episode: queue watermark growth → drops → tail expansion (prove where loss occurs).
  • Thermal throttle episode: temperature rise → throttle events → throughput/tail change (prove bounded degradation).
  • Trust anomaly episode: attestation/measurement mismatch → controlled block/recovery → audit evidence (prove serviceability).

Reference parts (examples): concrete BOM items that enable the validations

These are example parts to anchor lab setups and platform discussions. Equivalent alternatives exist across vendors.

TPM 2.0 / Root of Trust Infineon OPTIGA™ TPM (e.g., SLB 9670 family), Nuvoton TPM2.0 families, ST TPM families. Validation tie-in: secure/measured boot evidence fields, policy/version binding, audit-friendly proofs.
Secure element (keys/certs) Microchip ATECC608B, NXP SE050, Infineon OPTIGA™ Trust families. Validation tie-in: key protection and rotation tests with observable success/failure states.
BMC / OOB management ASPEED AST2600 (common BMC SoC class). Validation tie-in: evidence retention, event bundling, remote drill execution, and logs export.
Power telemetry (PMBus / sensors) TI INA228/INA229 (power monitor class); digital multiphase controllers with PMBus telemetry (examples by family): TI TPS536xx, Infineon XDPE, Renesas RAA/ISL, MPS MP29xx. Validation tie-in: load-step droop snapshots, alarm ladder tuning, thermal steady-state evidence.
Crypto acceleration (offload A/B) Intel QAT platforms/cards (QuickAssist class), DPU/SmartNIC offload examples like NVIDIA BlueField (platform-dependent). Validation tie-in: throughput/latency/CPU curves and fallback behavior under controlled failures.
NIC / multi-port Ethernet (platform examples) Intel Ethernet Controller E810 (family), NVIDIA/Mellanox ConnectX-6/7 (family), Broadcom NetXtreme-E (family). Validation tie-in: counters, queue watermarks, controlled traffic templates, offload comparisons.

Procurement guidance (non-controversial): prioritize parts that expose stable counters/status codes and support reproducible evidence export (snapshots + timestamps).

Figure F6 — Checklist map (phase → evidence output bundle)
A single visual that shows what each validation phase produces as proof (compact, time-correlated, replayable).
Checklist map: phases to evidence bundles Flow diagram from validation phases (bring-up, determinism, crypto, trust, margining and soak, incident drills) to evidence output bundles. Minimal labels with arrows. Validation checklist map: phase → evidence bundle Phases A) Bring-up B) Determinism C) Crypto D) Chain of trust E) Margin + Soak Evidence output Baseline snapshot bundle Tail + TSN evidence bundle Crypto curve + fallback bundle Boot/attest proof bundle Power/thermal/ECC bundle Incident drill bundle = trigger + pre/post snapshots + replayable root-cause timeline timestamps aligned episode-based
Keep each phase output as a compact evidence bundle: baseline snapshots, determinism proofs (tail + TSN verification points), crypto curves + fallback, chain-of-trust proofs, and power/thermal/ECC bundles—then validate operability via incident drills with pre/post snapshots.

Request a Quote

Accepted Formats

pdf, csv, xls, xlsx, zip

Attachment

Drag & drop files here or use the button below.

FAQs (5G CU platform)

Each answer is written for CU platform decisions: boundary, measurable checks, and fast triage entry points.

QHow do CU-CP vs CU-UP responsibilities translate into ports and QoS design?

Map responsibilities to traffic behavior: CU-CP is bursty control traffic, CU-UP is sustained user-plane throughput. Put them on separate logical domains (VLAN/VRF) and assign distinct QoS profiles: strict/low-latency treatment for CP with tight rate limits, and shaped/high-throughput queues for UP with tail-latency monitoring. Validate with per-queue counters, drops, and p99/p999 latency under controlled congestion.

QWhy does a CU need multiple Ethernet ports, and which ports should be physically isolated?

Multiple ports separate failure and security domains, not just bandwidth. Keep DU-facing (F1), core-facing (N2/N3), and management/OAM distinct, with an out-of-band (OOB) port physically isolated from data-plane paths. Physical isolation is most justified for OOB and high-risk debug access. Use segmentation to prevent L2 storms and to preserve a guaranteed recovery path during incidents.

QWhich switch/NIC features are required for a CU, and which are mostly marketing?

Must-haves are features that produce actionable evidence: per-port and per-queue counters, queue watermarks, deterministic buffer/drop behavior, reliable mirroring (local or tunneled), and stable driver/firmware support. Timestamp capability matters only if it is accurate and exposed to software for correlation. “Buzzword” features are those without a measurable validation method or without counters/logs that explain tail latency, drops, and congestion propagation.

QWhen does a CU truly need TSN (Qbv/Qci), and how to judge benefit vs complexity?

TSN is justified only when the CU must prove bounded latency/jitter and loss under defined traffic classes, not when “lower average latency” is the goal. Qbv helps enforce time-aware scheduling; Qci enforces per-stream policing/filtering. The cost is configuration complexity and new failure modes (priority inversions, clock-domain mismatch). Decide with an on/off experiment: does p99/p999 tighten without creating unexplained drops?

QDuring congestion, how to quickly tell whether the bottleneck is queues, CPU, or the virtual switch layer?

Start with the hardware counters on a shared timeline. If queue watermarks rise and tail-drop counters increase, the bottleneck is in egress queues/buffers. If drops stay low but throughput collapses while CPU or softirq rises, the bottleneck is host processing. If CPU is moderate but p999 latency explodes, suspect the virtual switch/IO path. Confirm via mirror points and per-layer latency snapshots.

QWhere should IPsec/MACsec/TLS be applied in a CU, and how to choose an offload form factor?

Use TLS for control-plane sessions, IPsec for network-layer protection across routed segments, and MACsec for link-layer protection on trusted Ethernet hops. Offload choice is driven by latency sensitivity and failure isolation: inline NIC offload minimizes per-packet overhead; look-aside accelerators scale throughput but can add queuing. Always A/B test offload on/off and verify key rotation and fallback behavior with explicit logs.

QWhat is the engineering difference between secure boot and measured boot, and when is attestation required?

Secure boot enforces that only signed images execute; measured boot records what executed so it can be audited. Attestation is needed when a remote party must verify the current platform state (not just the image on disk) before enabling service. Keep it serviceable with dual-image updates, monotonic rollback protection, and clear evidence fields: verify state, measurement summary, policy version, and recovery/rollback reason codes.

QWhy can performance still jitter even if ECC shows no errors, and which memory/storage indicators matter?

ECC counters can be clean while performance still jitters due to contention and tail effects. Common causes are memory bandwidth saturation, NUMA imbalance, page migration, and thermal throttling that changes effective frequency. On storage, NVMe latency spikes from temperature throttling or background housekeeping can widen tails without “errors.” Track tail latency, bandwidth utilization, throttling flags, and queue/service-time distributions rather than only error counts.

QHow to set DDR/SSD power-monitor alarm thresholds to avoid false positives (blanking/debounce)?

Measure more than voltage: include current, power, temperature, VRM state/PG/FAULT, and (if available) SVID/PMBus telemetry. Prevent false positives by separating fast protection from slow alarms: apply blanking after known load steps, use debounce duration thresholds, and evaluate windowed statistics (min/percentile) instead of single samples. Grade alerts by recoverability and data-integrity risk, and store pre/post event snapshots for replay.

QHow can field telemetry attribute performance jitter to power, thermal, congestion, or crypto load?

Treat attribution as correlation on a time axis: align tail latency, queue watermarks/drops, thermal/throttle events, VRM alarms, crypto offload utilization, boot/attest status, and ECC/SMART summaries. Look for ordered patterns (e.g., temperature rise -> throttle -> tail growth; watermark climb -> drops -> tail growth; offload fallback -> CPU spike -> tail growth). Suppress alert storms with deduplication, rate limits, and drift-aware thresholds.

QAfter a security upgrade failure, how to guarantee rollback and traceable evidence?

Use a staged update with an A/B or dual-image scheme: write the new image, verify it, boot into it, then commit only after health checks pass. If upgrade fails, rollback must be automatic and auditable. Preserve evidence across reboots: previous/current version IDs, policy version, boot verify status, measurement summaries, and a clear rollback reason. Validate interruption recovery by cutting power during upgrade in a controlled test.

QWhat is the minimal validation set that proves determinism + trust + operability?

A minimal proof set covers three pillars. Determinism: baseline counters plus a TSN/QoS on/off test that tightens p99/p999 without unexplained loss. Trust: secure/measured boot evidence, rollback protection, and interrupted-upgrade recovery. Operability: power/thermal margining (load steps + steady-state), and at least one incident drill that outputs a replayable bundle (trigger, pre/post snapshots, and a root-cause timeline).