5G CU Hardware Platform: TSN, Security, and Power Telemetry
← Back to: Telecom & Networking Equipment
A 5G CU is a platform system: it must move control- and user-plane traffic predictably while staying provably secure and operable at scale. This page shows how to design the CU around multi-port Ethernet/QoS, crypto + chain of trust, and DDR/storage power + telemetry so field performance issues can be measured, explained, and fixed.
H2-1 · What a 5G CU is (CU-CP/CU-UP) — and what it is not
Intent: Remove first-minute confusion: define CU scope using interface boundaries (F1 vs N2/N3), then translate those boundaries into hardware platform requirements.
Scope anchor (one sentence): A 5G CU is a compute platform (CU-CP + CU-UP) that terminates F1-C/F1-U toward the DU and connects to the Core via N2/N3, where “multi-port Ethernet + deterministic forwarding, trust chain, and power/telemetry evidence” matter as much as raw throughput.
Not covered here: DU baseband acceleration details, RU/AAS RF/JESD/DPD, optical transport gear, PoE/48V front-end deep dives, and BNG/CGNAT/firewall architecture.
Instead of a generic definition, the CU becomes clear when boundaries are treated as engineering contracts: each interface implies (1) a traffic profile, (2) a fault isolation goal, and (3) what must be observable to prove correctness.
| Boundary | CU ↔ DU (F1-C / F1-U) Drives: port role separation, congestion containment, MTU consistency, and per-queue evidence (drops/latency). |
|---|---|
| Boundary | CU ↔ Core (N2 / N3) Drives: uplink redundancy goals, segmentation, and “upstream congestion does not break control-critical flows”. |
| Platform view | CU is a platform, not a single NIC Must include trust chain (secure boot/attestation), DDR/NVMe determinism, and telemetry/log evidence for field root-cause. |
- CU-CP (control plane): bursty signaling and control-critical transactions. Design priority is bounded queueing delay and loss avoidance for “small but critical” traffic classes.
- CU-UP (user plane): sustained high-rate data. Design priority is stable throughput and low tail-latency under contention (avoid “looks fine on average, fails in spikes”).
- Practical test hook: when CU-UP load approaches line rate, control-critical classes must still show stable counters (no drop bursts) and bounded queue depth excursions.
- Edge-concentrated CU: tighter power/thermal headroom → higher sensitivity to rail droop/thermal throttling; telemetry becomes mandatory evidence, not “nice to have”.
- Regional DC CU: higher port density and higher automation expectations → segmentation and repeatable evidence (counters/logs) become the main availability lever.
- Forwarding: clear port roles + predictable QoS mapping for CP vs UP classes.
- Trust: measurable boot chain (RoT/TPM), policy-backed integrity, and attestation evidence outputs.
- Determinism: DDR/NVMe health signals that explain performance tail events.
- Field proof: minimal telemetry set that can correlate congestion, thermal, and power incidents.
Reader takeaway: Treat “CU = platform + evidence” as the organizing principle; the rest of this page is about making forwarding, trust, and determinism provable with the right measurements.
H2-2 · CU networking topology: ports, uplinks, segmentation, and failure domains
Intent: Explain why multi-port design is about failure-domain engineering, not “more ports”. Show how port roles + segmentation produce fast isolation and diagnosable incidents.
- DU-facing (F1): carries CU↔DU service flows; design goal is congestion containment and bounded delay for control-critical classes under CU-UP load.
- Core-facing (N2/N3 uplinks): where upstream congestion can “push back” into the CU; design goal is redundancy + clear counters to prove whether drops happen inside CU or upstream.
- Mgmt / OAM: the debugging lifeline; design goal is stability during incidents (logs and counters must still be reachable).
- OOB (interface-level only): last-resort access path; design goal is strict separation from service forwarding to avoid shared failure modes.
- Minimum viable split: separate F1 domain, Core-uplink domain, and Mgmt domain; keep OOB isolated.
- What “good” looks like: storms/loops stay inside one domain; counters identify which domain leaked, and where.
- Verification hook: broadcast/unknown-unicast growth must show in domain-specific counters, not as global “mystery drops”.
- LAG (link aggregation): goal is “single-link failure does not change service class behavior”. Check per-member utilization balance and failover counters.
- ECMP (when used at CU edge): goal is predictable distribution without pathological reordering for sensitive flows. Check per-path counters and incident-time shifts.
- Rule of thumb for this page: focus on what the CU can prove (local counters/queue watermarks), not on upstream policy design.
-
Loop / L2 storm
Symptom: sudden CPU spikes, intermittent packet loss across multiple ports. First checks: per-port broadcast counters, MAC move/flap indicators, queue depth rising across unrelated classes. Evidence: domain-level counters showing where the storm originated (F1 vs uplink vs mgmt). -
MTU mismatch
Symptom: throughput looks fine until specific transactions fail; odd retransmissions or blackholes. First checks: interface MTU consistency across DU-facing and uplink domains; incrementing “giant/fragment/drop” counters. Evidence: capture shows consistent truncation/fragment patterns aligned with domain boundary. -
PFC / congestion spread (symptom-level)
Symptom: a single busy stream causes wider latency spikes; tail events appear under load. First checks: queue watermark bursts, pause/flow-control counters, drops concentrated in a specific class. Evidence: incident timeline correlates queue watermark spikes with CP-class jitter or drops.
- Make roles visible: label port groups physically and in software; keep role-to-domain mapping stable.
- Keep isolation explicit: do not share mgmt/OOB with service forwarding paths unless proven safe by tests.
- Instrument early: baseline counters at idle and at known loads; keep snapshots for incident comparison.
- Design for triage: every major failure mode must point to a first-check counter set.
Engineering mindset: The “best” CU topology is the one that limits blast radius and produces clean evidence within minutes—before deep packet analysis is needed.
H2-3 · Multi-port Ethernet switching silicon: feature checklist for a CU platform
Intent: A CU-oriented checklist for choosing switch silicon / NIC modules: focus on determinism, fault isolation, and evidence (not “carrier router” features).
Decision rule: A feature is “CU-useful” only if it improves predictable control behavior, limits blast radius, or produces fast evidence (per-port/per-queue visibility) during incidents.
Keep scope tight: buffers and QoS are sized for CU traffic classes (CP/UP/Mgmt), not for BNG-style deep queues or subscriber policy engines.| Port plan | 25G/50G/100G mix + role-based grouping Why: F1-facing, core-uplink, and mgmt paths must not share failure modes. Verify: “worst case” test = single-link failure + high CU-UP load while CP counters show no burst drops. |
|---|---|
| Offload boundary | L2/L3 fast-path with stable control behavior Why: offload value is reducing software jitter on critical paths, not just saving CPU. Verify: control-plane tasks (telemetry/logs/config) do not induce CP-class drop bursts under sustained CU-UP load. |
| Queues & buffers | Per-class queues + readable watermarks Why: CU needs bounded delay/jitter; overly deep buffers can amplify tail latency. Verify: queue watermarks correlate to latency spikes; drops are attributable to a specific class/queue, not “mystery loss”. |
| Timestamp support (entry-level) | Basic ingress/egress timestamp hooks Why: enables incident timelines (queue bursts ↔ drop/jitter events) and supports CU-side determinism validation. Verify: consistent time correlation of counters/queue peaks across ports; deep clock-tree/jitter topics belong to the Timing pages. |
| Operability & evidence | Per-port / per-queue counters + mirror/tap points Why: root-cause requires “where did it drop” (ingress/queue/egress) and “which class”. Verify: can localize drops to a single port/queue and export snapshots before/after an incident window. |
- Step 1 — Role first: split ports into F1-facing, core-uplink, mgmt, and (if present) OOB to prevent shared failure domains.
- Step 2 — Redundancy next: apply N+1 or dual-uplink targets; avoid “single point” uplink designs that collapse CP stability during uplink congestion.
- Step 3 — Worst-case validation: run high CU-UP throughput while injecting a link failure; expect CP-class counters to stay stable and queue watermarks to remain bounded.
H2-4 · TSN in a CU context: which IEEE features matter and why
Intent: TSN here is not a textbook. It is a CU tool for making delivery provable: bounded latency/jitter for critical classes when CU-UP load is high.
When TSN makes sense: use TSN only when the requirement is an auditable upper bound (latency/jitter/loss) and the platform can keep time and configuration consistent.
Common trap: misaligned clocks + layered rules can reduce determinism—so every TSN feature must map to a measurable CU objective.- Latency bound: critical classes must keep a predictable upper bound even during CU-UP saturation (avoid “average looks fine, tail breaks”).
- Jitter control: arrival spacing variation must remain within a defined window for time-sensitive classes.
- Loss bound: critical classes require either “no loss” or a clearly defined drop policy with immediate evidence (which class, which queue, which port).
| 802.1AS | Time base for measurement & scheduling CU value: aligns evidence (timestamps/counters) and enables time-aware actions. (Deep time-distribution topics belong to Timing pages.) |
|---|---|
| Qbv | Time-aware scheduling (egress gates) CU value: reserves time windows for critical classes so they do not wait behind bulk CU-UP frames. |
| Qci | Per-stream policing/filtering CU value: prevents misbehaving flows from consuming queues and destroying determinism; enables “drop with attribution”. |
| Qbu / 802.3br | Frame preemption (platform-dependent) CU value: reduces worst-case blocking by large frames; useful when tail latency dominates incidents. |
- Go: incidents are dominated by tail events; critical classes need provable bounds; the platform can keep time/config consistent.
- No-Go: the dominant issue is long-term capacity shortfall or upstream congestion; TSN cannot fix missing bandwidth.
- Validation mindset: every TSN rule must correspond to a measurable change in queue watermarks and class-level latency/jitter evidence.
- Clock domain misalignment: deterministic windows drift → check time base consistency and whether schedule windows line up across ports.
- Priority inversion: “critical” traffic gets stuck behind bulk → check class mapping, queue assignment, and egress gate order.
- Rule stacking conflicts: shaping + gating + policing interact unexpectedly → check which stage applies first and whether drops are attributable (Qci counters).
H2-5 · Traffic classes, QoS, and congestion containment (CU-specific)
Intent: Explain why CU performance collapses under congestion, and how to apply a minimal QoS set that contains blast radius without self-inflicted complexity.
CU principle: QoS is successful only when it preserves CP stability, keeps UP throughput predictable, and leaves fast evidence (where drops/marks happened).
Keep it CU-sized: this section avoids large-network policy frameworks; it focuses on queueing, shaping, ECN/PFC observability, and a layered debug path inside the CU box.| CP Critical | F1-C, N2 (control signaling) Most sensitive to: tail latency and burst loss. Evidence required: per-queue drops, queue depth peaks, CPU backlog correlation. |
|---|---|
| UP Bulk | F1-U, N3 (user-plane throughput) Most sensitive to: sustained congestion and jitter that causes retransmission or head-of-line blocking. Evidence required: queue watermarks, ECN marks, pause events, per-port utilization. |
| Mgmt / OAM | OAM interfaces (fault isolation and maintenance access) Most sensitive to: being starved during incidents. Evidence required: dedicated queue counters + mirror/tap availability for packet capture. |
| Telemetry / Logs | metrics, alarms, evidence export Most sensitive to: total blackout (loss of evidence) rather than pure latency. Evidence required: delivery “floor” (minimum service), plus counters that prove it stayed alive. |
- Queue count: start with 4–6 queues only. More queues raise the risk of priority inversion and rule conflicts.
- Stable mapping: traffic class → queue should be deterministic (CP/UP/Mgmt/Telemetry), then a small set of schedulers applies.
- Guardrails: CP may use priority scheduling, but must include a starvation guard so UP/Mgmt never fully collapses.
- Policers: apply ingress policing mainly to misbehaving/unknown flows to stop congestion spread, not to every flow.
| ECN (marking) | Goal: convert “mystery loss” into controlled backoff under load. Watch: ECN mark counters + queue depth peaks + latency/jitter tail metrics. Interpretation: if marks rise but tail stays bounded, congestion is being signaled (not silently amplified). |
|---|---|
| PFC (pause) | Goal: protect specific classes, but avoid spreading stall. Watch: pause frame counters, persistent watermarks, cross-class jitter spikes. Boundary: enable only when it is clear which class is protected and which evidence proves improvement. |
- MAC/PHY layer: link flaps, CRC/FCS errors, MTU mismatch → check port error counters and link event logs.
- Queue layer: tail latency spikes, burst drops → check per-queue drops + watermarks + scheduling/priority mapping.
- CPU / host path: control-plane stalls, telemetry gaps → check CPU backlog indicators and correlation with queue peaks.
- Virtual switch layer (if used): “software drops” under load → check vSwitch drop counters and mirror at the vSwitch boundary.
H2-6 · Hardware crypto: what to offload (and what not) for CU security
Intent: Clarify CU crypto offload decisions using measurable trade-offs: throughput vs tail latency vs fault isolation, with clean key custody boundaries.
Engineering framing: crypto offload is valuable when it turns security processing into a predictable, measurable datapath without destabilizing CP behavior during load or rekey events.
Scope boundary: protocol internals and large-network security architecture are out of scope; only CU-side offload forms, key custody, and validation points are covered.| Control-plane TLS | Goal: stable handshakes and predictable tail latency for CP-related management/control traffic. Risk: handshake bursts or rekey events can create CPU jitter that indirectly harms CP scheduling. Verify: under sustained UP load, CP queue counters do not show burst drops during handshake/rekey windows. |
|---|---|
| Data-plane IPsec | Goal: secure transport with measurable throughput and bounded jitter (deployment-dependent). Risk: queueing and reassembly paths can amplify tail latency if offload boundaries are unclear. Verify: measure p99/p999 latency while sweeping throughput; ensure drops remain attributable to a class/queue. |
| Data-plane MACsec | Goal: link-layer protection on selected CU uplinks (deployment-dependent). Risk: link-level security can hide congestion if counters and marks are not collected. Verify: observe per-port utilization + queue watermarks + mark/pause counters under load. |
| Inline (NIC / inline engine) | Best when: latency is sensitive and the datapath should be short and predictable. Benefits: fewer hops, clearer failure domain, easier to reason about determinism. Validate: tail latency stays bounded during rekey; counters show where any loss occurs (not “silent” stalls). |
|---|---|
| Look-aside (accelerator) | Best when: throughput must scale and compute can be expanded independently. Risks: datapath hops and topology can introduce tail jitter if the crypto queue becomes a backpressure point. Validate: measure crypto queue depth, DMA/queue occupancy, and p99/p999 latency under load sweeps. |
| Key custody (TPM/HSM) | Role split: TPM/HSM provides non-exportable roots and protected key operations; accelerators provide bulk throughput. Validate: ensure private key material does not appear as exportable software objects; audit with platform key usage counters/logs. |
- Throughput “wobbles” after enabling crypto: measure p99/p999 latency, queue depths, and class counters around handshake/rekey events.
- Lower-than-expected throughput: measure crypto engine utilization, queue occupancy, and whether backpressure aligns with UP queue watermarks.
- Unexpected CP impact: correlate CP queue drops with CPU load spikes and crypto queue surges; CP should remain stable under UP saturation.
Secure boot & chain of trust: measurable, attestable, serviceable
Intent: Provide provable evidence that the CU runs untampered software, while keeping upgrades, rollback, and field recovery serviceable.
Platform goal: security must be measurable (recorded), attestable (verifiable), and serviceable (recoverable without breaking the evidence chain).
Engineering outcome: every boot produces a compact set of measurements, policy versions, and verification results that can be logged, exported, and audited.| Root of Trust (RoT) | Anchors boot verification and enforces “only approved code executes”. Must output: verification status + policy version used for decisions. |
|---|---|
| TPM / Secure Element | Protects keys and stores measurements in a way that can be quoted (attested) without exporting secrets. Must output: measurement digest summary + quote/attest result (as an event record). |
| Platform (boot chain) | Bootloader → OS/hypervisor → workloads, each step verified and/or measured. Must output: component version IDs + signed/verified flag + measurement record ID. |
| Management/telemetry | Exports evidence (logs/attest summaries) and correlates it with incident timelines. Must output: “trust state” field in alarms and audit reports. |
| Secure boot | Blocks execution when verification fails. Choose when: preventing unauthorized code from running is the primary goal. |
|---|---|
| Measured boot | Records what booted (even if policy allows booting) for forensics and accountability. Choose when: field diagnosis and auditability matter as much as prevention. |
| Attestation | Proves the recorded measurements to a verifier (remote or local audit), turning “logs” into verifiable evidence. Choose when: third-party verification, compliance, or zero-trust operations require proof. |
- Separate “recoverable” vs “non-recoverable” rollback: allow A/B rollback for workloads and non-critical packages, but protect core policy/boot components with monotonic counters or policy versions.
- Make policy visible: every decision should record policy version, decision reason, and which component failed verification.
- Safe recovery path: define a controlled “recovery mode” that still produces measurements and audit events (no silent bypass).
- Boot verification status: per-stage “verified / failed” plus component version IDs.
- Measurement summary: digest list or IDs referencing measurements, plus policy version and anti-rollback counter state.
- Attestation summary (if enabled): quote result metadata recorded as an event (success/fail + timestamp + reason).
- Service actions audit: upgrade/rollback attempts, recovery entries, and the reasons for accept/deny decisions.
DDR & storage subsystem: ECC, power integrity, and performance determinism
Intent: Explain why CU performance issues often come from memory and storage tail behavior rather than raw compute—and how to measure and diagnose it.
CU determinism rule: peak throughput can look healthy while tail latency, ECC trends, and SSD throttling quietly destroy predictable delivery.
Engineering outcome: map field symptoms (reboots, wobble, latency tails) to observable counters and events that point to a domain.| Correctable errors (CE) | Early warning signal: the system stays up, but memory margin may be degrading. Use: trend CE rate vs temperature/load; treat spikes as “time to investigate”, not as harmless noise. |
|---|---|
| Uncorrectable errors (UE) | Hard fault indicator: triggers resets, crashes, or forced domain recovery. Use: correlate UE events with reset causes and telemetry to pinpoint whether it is power/thermal/stress related. |
| Why ECC matters for CU | It turns “mystery instability” into countable evidence that can be trended and tied to conditions. Minimum requirement: expose CE/UE counters and alert thresholds in the CU’s monitoring pipeline. |
- Contention: many vCPU/vNIC queues can saturate memory bandwidth or increase cache/memory pressure, inflating p99/p999 even when average looks fine.
- NUMA sensitivity: cross-socket memory access can add latency tails; the effect becomes visible under bursty workloads and mixed tenants.
- Measurement-first approach: keep the story grounded in what is measured (p50/p99/p999, steady vs step load) rather than theoretical limits.
| Thermal throttling | Temperature-driven performance collapse that looks like “random wobble”. Observe: SSD temperature + throttle events aligned with throughput drops and latency spikes. |
|---|---|
| Health / endurance trend | Aging can increase background work and error handling. Observe: SMART health summary + error events + sustained write performance trend. |
| Write amplification / GC effects | Background collection can create periodic latency tails. Observe: latency distribution over time; periodic bursts often reveal GC patterns. |
| Intermittent reboot | Check first: reset cause + UE events + telemetry around the incident window. Strong signal: UE spike or sudden ECC escalation aligned with a thermal or power event log. |
|---|---|
| Throughput wobble | Check first: SSD throttling events + latency tail growth + storage/IO queue depth. Strong signal: temperature throttle aligns with performance collapse. |
| Latency tail spikes | Check first: p99/p999 drift + NUMA placement changes + memory pressure indicators. Strong signal: step-load tests show “tail expansion” without a corresponding throughput increase. |
- Run steady + step load: compare p50/p99/p999 under constant load and sudden bursts to expose tail behavior.
- Capture essentials: CE/UE counters, latency distribution over time, SSD temperature/throttle events, and health summary snapshots.
- Close the loop: every incident should attribute to a domain (memory vs storage vs thermal) using synchronized events, not guesses.
DDR/storage power monitoring: what to measure and how to alarm without false positives
Intent: Define the monitoring signals that actually explain DDR/SSD instability, and build an alarm scheme that is sensitive to real faults but robust to transient spikes.
Monitoring success criterion: every alarm should map to a recoverability decision (continue, degrade, protect) and leave actionable evidence (pre/post snapshots) for attribution.
Anti-goal: alarms that fire on harmless transients create “alarm blindness” and remove protection when it is needed.| Voltage (V) | Captures droop and undervoltage risk. Prefer per-rail min/avg and droop depth over raw single samples. Explains: reboots, silent corruption risk, sudden performance collapse. |
|---|---|
| Current / Power (I / P) | Detects overload, abnormal draw, and thermal runaway precursors. Explains: throttling, repeated limit events, brownout-like behavior during bursts. |
| Temperature (T) | VRM/SSD temperatures indicate proximity to throttling and protective shutdown thresholds. Explains: wobble tied to thermal throttling and latency tails. |
| VRM state | Limit, throttle, fault, or “not-ready” flags are more informative than V alone. Explains: why performance changed even when link/counters look healthy. |
| PG / FAULT | Power-good and fault pins/events anchor “hard” protection decisions. Explains: protective resets and domain shutdowns (must be logged). |
| SVID / PMBus telemetry | Standard channel for voltage, current, temperature and status codes. Explains: repeatable patterns with consistent status codes across incidents. |
- Blanking window: after boot, rail enable, or known load-step phases, record signals but suppress alarms for a short “settle” window.
- Debounce rule: require threshold violation to persist for T ms or N consecutive samples before raising an alarm.
- Window statistics: alarm on rolling-window min/avg/max (or p95) rather than single samples; choose statistic per signal type.
| Warning (recoverable) | Early indication with low immediate risk. Examples: approaching thermal limit, mild droop without PG events, sustained power drift vs baseline. Actions: record evidence, dedup/re-rate-limit, optionally increase sampling for a short period. |
|---|---|
| Critical (recoverable, high risk) | Strong indicator of imminent instability or SLA impact. Examples: repeated droop episodes, VRM limit state bursts, correlated ECC CE surge + power anomalies. Actions: trigger pre/post snapshots, apply controlled degradation, escalate if evidence persists. |
| Protect / Shutdown (non-recoverable) | Data integrity or hardware safety at risk. Examples: PG loss, sustained undervoltage, hard over-temperature, persistent VRM fault. Actions: protective stop or forced recovery path; ensure evidence is flushed and reported. |
- Pre-trigger (before alarm): record short-window V/I/P, VRM state, PG transitions, temperature, and alarm conditioning state (blanking/debounce status).
- Post-trigger (after alarm): record recovery path, throttling or protective actions taken, reset cause (if any), and final state.
- Replay goal: decide whether the dominant cause is droop, over-current, over-temperature, or undervoltage—and show the timeline.
Telemetry & evidence: proving determinism and trust in the field
Intent: Build a minimal, high-signal evidence set that proves CU determinism and trust, and enables consistent field attribution without creating alarm storms.
Field-proof rule: each “stability claim” should be backed by a matching field evidence signal on the same time axis (not by anecdotes).
Design target: evidence is compact (summary + event snapshots), correlated (shared timestamps), and minimally invasive (minimum necessary principle).| Network evidence | Port drops/errors, queue watermarks, and utilization snapshots. Answers: is instability driven by congestion or queue overflow inside the CU path? |
|---|---|
| Determinism evidence | Latency distribution summaries (p50/p99/p999) and tail-growth indicators. Answers: does “average OK” hide tail spikes that break delivery guarantees? |
| Security evidence | Boot verification state, policy version, and attestation summary status. Answers: what is running, and can it be proven to be untampered? |
| Memory/storage evidence | ECC CE/UE counters, SSD health summary, thermal/throttle events. Answers: are tails or wobble driven by memory margin or storage throttling? |
| Power/thermal evidence | VRM telemetry events, temperature trends, and protective state changes. Answers: is the platform silently degrading due to thermal or power constraints? |
| Crypto offload evidence | Accelerator/NIC offload utilization, throughput, and latency impact snapshots. Answers: is security processing creating variability under load? |
| Wobble → thermal throttle | Align throughput drop with throttle events and temperature rise on the same timeline. First look: throttle flag/time. Then: temp trend. Confirm: tail growth coincides. |
|---|---|
| Tail spikes → memory margin | Check whether p99/p999 expands alongside a surge in ECC CE or repeated VRM limit states. First look: CE rate. Then: VRM state. Confirm: step-load reproduces tail expansion. |
| Wobble → congestion/queue overflow | Compare queue watermarks and drop counters against utilization bursts to validate congestion-driven loss. First look: watermarks. Then: drops/errors. Confirm: tail spikes and retransmit indicators. |
- Deduplication: collapse repeated alarms with a merge window; emit one “episode” record with counters rather than spamming repeats.
- Rate limiting: cap alarm emission per type; keep the episode timeline as evidence.
- Threshold drift: prefer baseline + deviation for slowly changing metrics; avoid static thresholds that become wrong after workload shifts.
- Evidence-based escalation: escalate to critical only when multiple signals agree (e.g., throttle + tail expansion, or droop + PG event).
- Always-on summaries: compact periodic stats (p99/p999, CE/UE, watermarks, temperatures) to establish baselines.
- Event-trigger snapshots: pre/post windows at higher sampling rate around incidents for attribution.
- Forensic bundle: only for severe episodes; store the minimum subset that proves the correlation path.
- Metadata only: export counters, timestamps, states, and summaries—not user payload content.
- Security evidence as summaries: report boot/attestation status and policy versions, not secret material.
- Privacy by design: include only fields that directly support stability, trust, and traceability claims.
Validation & debug checklist: lab bring-up → soak → incident drill
Intent: Define “done” with repeatable tests and evidence that proves determinism, trust, and operability for a 5G CU platform.
Done means: every stability claim is backed by time-correlated evidence (counters, snapshots, logs), and every incident drill produces a replayable root-cause timeline (pre/post windows + pass/fail criteria).
Anti-goal: “it looks fine” sign-off without baselines, thresholds, and reproducible drills.How to use this checklist (evidence package structure)
Evidence should be small, high-signal, and aligned on a shared time axis (timestamps). Keep periodic summaries always-on, and capture high-rate snapshots only around events.
| Determinism acceptance | Use tail metrics (p99/p999) and bounded loss behavior under controlled stress—not averages. Evidence: queue watermarks + drop counters + tail summaries on the same timeline. |
|---|---|
| Trust acceptance | Trust must be measurable: boot verification/measured state, policy version, and audit-ready logs. Evidence: boot status + measurements summary + rollback/upgrade recovery proof. |
| Operability acceptance | Incidents must be replayable: pre/post windows, deduped alerts, and a clear triage entry point. Evidence: incident bundle with trigger, snapshots, and attribution conclusion. |
Phase A — Lab bring-up (ports/MTU/VLAN, basic forwarding, counter baseline)
- Port role snapshot: label and record DU-facing (F1), core-facing (N2/N3), mgmt/OAM, and OOB interface roles (interface-level only).
- MTU sanity: verify path MTU consistency using at least two payload sizes; confirm where fragmentation or drops appear.
- Basic forwarding: validate controlled traffic templates across each port role; keep early tests simple and reproducible.
- Counter baselines: capture port errors, drops, and per-queue watermarks at idle and under light load.
Pass criteria (bring-up): link stays stable; error counters do not creep at idle; drops are explainable and reproducible under controlled stress.
Fast triage entry: wobble under “link up” often starts with MTU mismatch, queue tail-drop, or a mis-mapped priority/queue.Phase B — Determinism checks (TSN verification points on the CU platform)
- 802.1AS stability: trend the time offset; log step events and stability windows. Focus on “stable vs unstable” evidence, not standard theory.
- Qbv schedule verification: under a controlled traffic template, verify gate schedule effect via gate evidence + tail metric improvement (on/off comparison).
- Qci enforcement: inject a deliberately non-conforming test stream and confirm Qci hit/drop counters behave as expected.
Pass criteria (determinism): enabling determinism features reduces tail variability without introducing unexplained loss or priority inversion.
Common failure pattern: “TSN enabled but tail got worse” → check priority mapping, schedule mismatch to traffic, and stacked policies that conflict.Phase C — Crypto validation (throughput/latency/CPU curves; rotation and fallback)
- Offload A/B test: run the same traffic template with offload on and off; record throughput, p99/p999 latency, and CPU utilization (per-socket summary if applicable).
- Failure fallback: inject a controlled failure (e.g., invalid/expired credential in a test environment) and confirm the fallback path is observable and bounded.
- Key/certificate rotation: execute rotation and verify: (1) transition is logged, (2) service impact is bounded, (3) the system recovers to a verified steady state.
Pass criteria (crypto): offload increases throughput (or reduces CPU) without destabilizing tail latency; fallback behavior is controlled and produces evidence.
Fast triage entry: tail spikes after enabling crypto often correlate with CPU slow-path fallback, scheduler contention, or topology/affinity misplacement.Phase D — Secure boot & chain-of-trust validation (measurable, attestable, serviceable)
- Measurement consistency: repeated cold boots on the same image should produce consistent measurement summaries and verification status.
- Rollback protection: attempt to boot an older image/policy; confirm it is blocked with a clear reason and without entering a “half-alive” state.
- Interrupted upgrade recovery: interrupt an upgrade at a controlled point (test environment) and validate recovery to a bootable and verifiable state.
- Evidence export: record boot verify state, policy version, and measurement summary as a compact field bundle for field audits.
Pass criteria (trust): verified state is explicit and reproducible; rollback is blocked with proof; interrupted upgrades recover predictably with audit evidence.
Phase E — Power/thermal/memory margining + soak + incident drills
| Load-step droop | Apply controlled step-load transitions; correlate rail telemetry with tail behavior and any protective state changes. Evidence: droop snapshot + VRM state/telemetry + tail summary (same timestamps). |
|---|---|
| Thermal steady-state | Run to thermal equilibrium; record throttle onset and performance inflection points. Evidence: temperature trend + throttle events + throughput/tail curves. |
| ECC statistics / injection | Use platform-supported methods to validate CE/UE reporting and alarm behavior (where available). Evidence: CE/UE counters + alert ladder transitions + post-action logs. |
| Soak tests | Long-run stability under representative load; verify counters do not drift into abnormal regimes. Evidence: periodic summaries + episode-based event bundles (deduped). |
| Incident drills | Run repeatable drills that produce a root-cause timeline with a clear triage entry. Evidence: trigger → pre/post snapshots → attribution conclusion (one bundle per drill). |
Recommended drill scenarios (CU platform focus):
- Congestion episode: queue watermark growth → drops → tail expansion (prove where loss occurs).
- Thermal throttle episode: temperature rise → throttle events → throughput/tail change (prove bounded degradation).
- Trust anomaly episode: attestation/measurement mismatch → controlled block/recovery → audit evidence (prove serviceability).
Reference parts (examples): concrete BOM items that enable the validations
These are example parts to anchor lab setups and platform discussions. Equivalent alternatives exist across vendors.
| TPM 2.0 / Root of Trust | Infineon OPTIGA™ TPM (e.g., SLB 9670 family), Nuvoton TPM2.0 families, ST TPM families. Validation tie-in: secure/measured boot evidence fields, policy/version binding, audit-friendly proofs. |
|---|---|
| Secure element (keys/certs) | Microchip ATECC608B, NXP SE050, Infineon OPTIGA™ Trust families. Validation tie-in: key protection and rotation tests with observable success/failure states. |
| BMC / OOB management | ASPEED AST2600 (common BMC SoC class). Validation tie-in: evidence retention, event bundling, remote drill execution, and logs export. |
| Power telemetry (PMBus / sensors) | TI INA228/INA229 (power monitor class); digital multiphase controllers with PMBus telemetry (examples by family): TI TPS536xx, Infineon XDPE, Renesas RAA/ISL, MPS MP29xx. Validation tie-in: load-step droop snapshots, alarm ladder tuning, thermal steady-state evidence. |
| Crypto acceleration (offload A/B) | Intel QAT platforms/cards (QuickAssist class), DPU/SmartNIC offload examples like NVIDIA BlueField (platform-dependent). Validation tie-in: throughput/latency/CPU curves and fallback behavior under controlled failures. |
| NIC / multi-port Ethernet (platform examples) | Intel Ethernet Controller E810 (family), NVIDIA/Mellanox ConnectX-6/7 (family), Broadcom NetXtreme-E (family). Validation tie-in: counters, queue watermarks, controlled traffic templates, offload comparisons. |
Procurement guidance (non-controversial): prioritize parts that expose stable counters/status codes and support reproducible evidence export (snapshots + timestamps).
FAQs (5G CU platform)
Each answer is written for CU platform decisions: boundary, measurable checks, and fast triage entry points.
QHow do CU-CP vs CU-UP responsibilities translate into ports and QoS design?
Map responsibilities to traffic behavior: CU-CP is bursty control traffic, CU-UP is sustained user-plane throughput. Put them on separate logical domains (VLAN/VRF) and assign distinct QoS profiles: strict/low-latency treatment for CP with tight rate limits, and shaped/high-throughput queues for UP with tail-latency monitoring. Validate with per-queue counters, drops, and p99/p999 latency under controlled congestion.
QWhy does a CU need multiple Ethernet ports, and which ports should be physically isolated?
Multiple ports separate failure and security domains, not just bandwidth. Keep DU-facing (F1), core-facing (N2/N3), and management/OAM distinct, with an out-of-band (OOB) port physically isolated from data-plane paths. Physical isolation is most justified for OOB and high-risk debug access. Use segmentation to prevent L2 storms and to preserve a guaranteed recovery path during incidents.
QWhich switch/NIC features are required for a CU, and which are mostly marketing?
Must-haves are features that produce actionable evidence: per-port and per-queue counters, queue watermarks, deterministic buffer/drop behavior, reliable mirroring (local or tunneled), and stable driver/firmware support. Timestamp capability matters only if it is accurate and exposed to software for correlation. “Buzzword” features are those without a measurable validation method or without counters/logs that explain tail latency, drops, and congestion propagation.
QWhen does a CU truly need TSN (Qbv/Qci), and how to judge benefit vs complexity?
TSN is justified only when the CU must prove bounded latency/jitter and loss under defined traffic classes, not when “lower average latency” is the goal. Qbv helps enforce time-aware scheduling; Qci enforces per-stream policing/filtering. The cost is configuration complexity and new failure modes (priority inversions, clock-domain mismatch). Decide with an on/off experiment: does p99/p999 tighten without creating unexplained drops?
QDuring congestion, how to quickly tell whether the bottleneck is queues, CPU, or the virtual switch layer?
Start with the hardware counters on a shared timeline. If queue watermarks rise and tail-drop counters increase, the bottleneck is in egress queues/buffers. If drops stay low but throughput collapses while CPU or softirq rises, the bottleneck is host processing. If CPU is moderate but p999 latency explodes, suspect the virtual switch/IO path. Confirm via mirror points and per-layer latency snapshots.
QWhere should IPsec/MACsec/TLS be applied in a CU, and how to choose an offload form factor?
Use TLS for control-plane sessions, IPsec for network-layer protection across routed segments, and MACsec for link-layer protection on trusted Ethernet hops. Offload choice is driven by latency sensitivity and failure isolation: inline NIC offload minimizes per-packet overhead; look-aside accelerators scale throughput but can add queuing. Always A/B test offload on/off and verify key rotation and fallback behavior with explicit logs.
QWhat is the engineering difference between secure boot and measured boot, and when is attestation required?
Secure boot enforces that only signed images execute; measured boot records what executed so it can be audited. Attestation is needed when a remote party must verify the current platform state (not just the image on disk) before enabling service. Keep it serviceable with dual-image updates, monotonic rollback protection, and clear evidence fields: verify state, measurement summary, policy version, and recovery/rollback reason codes.
QWhy can performance still jitter even if ECC shows no errors, and which memory/storage indicators matter?
ECC counters can be clean while performance still jitters due to contention and tail effects. Common causes are memory bandwidth saturation, NUMA imbalance, page migration, and thermal throttling that changes effective frequency. On storage, NVMe latency spikes from temperature throttling or background housekeeping can widen tails without “errors.” Track tail latency, bandwidth utilization, throttling flags, and queue/service-time distributions rather than only error counts.
QHow to set DDR/SSD power-monitor alarm thresholds to avoid false positives (blanking/debounce)?
Measure more than voltage: include current, power, temperature, VRM state/PG/FAULT, and (if available) SVID/PMBus telemetry. Prevent false positives by separating fast protection from slow alarms: apply blanking after known load steps, use debounce duration thresholds, and evaluate windowed statistics (min/percentile) instead of single samples. Grade alerts by recoverability and data-integrity risk, and store pre/post event snapshots for replay.
QHow can field telemetry attribute performance jitter to power, thermal, congestion, or crypto load?
Treat attribution as correlation on a time axis: align tail latency, queue watermarks/drops, thermal/throttle events, VRM alarms, crypto offload utilization, boot/attest status, and ECC/SMART summaries. Look for ordered patterns (e.g., temperature rise -> throttle -> tail growth; watermark climb -> drops -> tail growth; offload fallback -> CPU spike -> tail growth). Suppress alert storms with deduplication, rate limits, and drift-aware thresholds.
QAfter a security upgrade failure, how to guarantee rollback and traceable evidence?
Use a staged update with an A/B or dual-image scheme: write the new image, verify it, boot into it, then commit only after health checks pass. If upgrade fails, rollback must be automatic and auditable. Preserve evidence across reboots: previous/current version IDs, policy version, boot verify status, measurement summaries, and a clear rollback reason. Validate interruption recovery by cutting power during upgrade in a controlled test.
QWhat is the minimal validation set that proves determinism + trust + operability?
A minimal proof set covers three pillars. Determinism: baseline counters plus a TSN/QoS on/off test that tightens p99/p999 without unexplained loss. Trust: secure/measured boot evidence, rollback protection, and interrupted-upgrade recovery. Operability: power/thermal margining (load steps + steady-state), and at least one incident drill that outputs a replayable bundle (trigger, pre/post snapshots, and a root-cause timeline).