Edge RAN Accelerator (FEC/UPF/TSN) Architecture & Design
← Back to: 5G Edge Telecom Infrastructure
An Edge RAN Accelerator is a PCIe-attached card that hardens a few high-impact kernels—FEC, selective UPF data-plane packet kernels, and TSN/time measurement & gate execution—to deliver more predictable throughput and latency than host software alone. In practice, success depends on engineering the full evidence loop: queues/DMA/NUMA, a coherent clock tree, and PMBus-managed power so performance remains stable and field failures become attributable and recoverable.
Focus: a PCIe x16 accelerator card that hardens FEC offload, selected UPF data-plane kernels, and time-aware/TSN execution into measurable, operable, deterministic performance—without turning this page into a DU/UPF/TSN system textbook.
H2-1 · What it is & boundary: what it accelerates (and what it doesn’t)
Definition that can be validated
An Edge RAN Accelerator is a PCIe x16 plug-in card that converts specific edge workloads into deterministic, multi-tenant, and measurable hardware/firmware execution—centered on: FEC offload UPF packet kernels TSN/time execution.
The “accelerator” contract (three properties, each with acceptance checks)
- Pluggable (PCIe): stable enumeration, predictable reset behavior (FLR/function reset), and consistent performance across PCIe topology (root-complex/retimer changes).
- Reusable (multi-queue / multi-tenant): queue isolation, per-tenant rate limits, and backpressure control so one workload cannot poison another.
- Measurable (telemetry): counters and logs that explain outcomes—throughput, latency histograms, drop reasons, FEC error stats, thermal/power states, and time-related alarms.
Design intent: performance claims must be reproducible via traffic generators + counters + logs, not marketing peak numbers.
Workload boundary map (what is accelerated vs what stays out-of-scope)
| Workload | Accelerates | I/O object | Primary metrics | Out of scope |
|---|---|---|---|---|
| FEC | LDPC/Polar decode/encode chain and buffering hot spots (LLR → decode → HARQ soft-combine) | Bit / code-block / LLR streams + HARQ soft buffers | Gbps, code-block/s, p99/p999 latency, FER/BLER regression guardrails | DU scheduling algorithms, full 3GPP tutorial content |
| UPF kernels | Hardening selected packet/flow-table kernels (classify/count/encap/optional crypto primitives) | Packets/flows via DMA queues; per-flow state/counters | Mpps@64B, Gbps@large packets, flow count, feature-toggle performance matrix | UPF control plane, session/orchestration, full appliance integration |
| TSN/time | Hardware timestamping and time-aware execution primitives used for deterministic behavior | Timestamps, time-ref inputs, gate schedules / policing parameters | Timestamp error budget, gate schedule jitter, tail latency under shaping | TSN switch silicon architecture, grandmaster GNSS disciplining/BMCA |
Practical writing rule: whenever content stops being card/host measurable, it belongs to a sibling page and should be linked—not expanded here.
Boundary with nearby building blocks (short, hard comparisons)
- vs SmartNIC/DPU: DPU emphasizes a general programmable data plane; this page emphasizes FEC determinism, bounded latency, and time-quality hooks.
- vs UPF appliance: the appliance is the full box/system; this page covers accelerated kernels and the PCIe/telemetry contract.
- vs TSN switch: the switch owns port forwarding and queue silicon; this page focuses on timestamping + time-aware execution inside the accelerator domain.
- vs time hub/grandmaster: the time hub disciplines the reference; this page consumes a reference and enforces coherent clocking + alarms.
Suggested internal links (placeholders): SmartNIC/DPU · Edge UPF Appliance · Edge TSN Switch · Edge Time Hub
Typical deployment positions (interface-only view)
Common placements include a DU-side host server, an edge packet-processing host, or an industrial TSN edge node. The integration description remains limited to five interfaces: PCIe x16 DMA queues time reference in PMBus telemetry thermal/power states.
Figure F1 — Where the card sits (three flows: FEC, packet, time-ref)
H2-2 · Use cases & success metrics: why “deterministic throughput/latency” matters
Why edge workloads punish “average performance”
Edge RAN and converged edge services often fail not because peak throughput is low, but because tail latency and jitter break real-time expectations. Determinism is the ability to keep throughput and latency predictable under load, feature toggles, and thermal states.
- FEC needs bounded decode time to avoid pipeline bubbles and late delivery.
- Packet kernels need stable Mpps behavior to prevent microbursts from collapsing QoS.
- TSN/time needs tight timestamp error and schedule jitter to keep control loops and industrial traffic repeatable.
Success metrics framework (what must be measurable in the field)
Use a four-quadrant acceptance model. Each quadrant must be backed by traffic generator tests + counters + logs so results are auditable.
Recommended acceptance set (minimum)
Throughput Mpps@64B + Gbps@large Latency (p50/p99/p999) Time quality (error/jitter hist) Power/Thermal events
Avoid single-number claims; require curves or matrices versus workload parameters and feature toggles.
Workload-specific acceptance templates (copy into RFQ / test plans)
FEC offload
Gbpscode-block/sp99/p999 latencyFER/BLER guardrail
- Required curve: throughput and p99 latency vs block length / code rate (and iteration settings if applicable).
- Hard-to-fake evidence: decode-time histogram + HARQ buffer pressure counters + timeout/fallback counters.
- Field method: controlled input streams + timestamped completion records + counters snapshot at steady state.
UPF packet kernels
Mpps@64BGbps@largeflow countfeature matrix
- Two-axis requirement: Mpps and Gbps must both be reported; they are not interchangeable.
- Required matrix: performance vs feature toggles (QoS/encap/crypto primitives/statistics) and flow distributions.
- Field method: traffic generator with mice/elephant mix + per-reason drop counters + queue depth telemetry.
TSN/time execution
timestamp errorgate jitter histtail latency under shaping
- Error budget: timestamp error distribution must be reported under idle and loaded conditions.
- Determinism proof: gate schedule jitter and tail latency (p99/p999) must remain bounded under worst-case traffic.
- Field method: reference time input + hardware timestamp logs + alarm logs for lock/holdover transitions.
Figure F2 — Four-quadrant acceptance model (determinism over peak numbers)
H2-3 · Reference architecture: three pipelines coexisting on one card
One card, four “islands” (responsibility-first)
A practical edge accelerator card is easier to validate and operate when it is organized into four functional islands: ComputePacket I/OTimingManagement. Each island owns a stable contract and exposes measurable evidence (counters, alarms, and event logs).
Island responsibilities (deliverables, not part numbers)
| Island | Primary responsibilities | Key resources | Must-expose evidence |
|---|---|---|---|
| Compute | FEC kernels, packet kernels, buffer scheduling, per-queue compute admission control | HBM/DDR/SRAM, on-card scratch buffers, compute pipelines | Kernel timing histograms, timeout/fallback counters, ECC error counters |
| Packet I/O | DMA engines, queue manager, backpressure handling, drop reason classification | PCIe doorbells, MSI-X, descriptor rings, completion queues | Enqueue/dequeue/drop counters, queue depth telemetry, IRQ rate counters |
| Timing | Hardware timestamping, clock mux/PLL, time-ref input handling, time alarms | Ref inputs, PLL domains, timestamp unit, holdover state | Loss-of-lock/holdover logs, time-jump detection, timestamp error stats |
| Management | PMBus telemetry, FRU/asset identity, firmware lifecycle hooks, blackbox logging | MCU/BMC-lite, PMBus sensors, non-volatile log storage | Power cap events, throttle reasons, reset causes, exportable event log |
Two packet I/O integration modes (clear boundary)
There are two common ways to connect packets to the accelerator. The choice changes validation scope and operational complexity.
- Host NIC via DMA: packets remain on the host NIC; the card accelerates selected kernels via DMA queues. This maximizes reuse and keeps physical I/O scope minimal.
- On-card SerDes/PHY: the card owns high-speed I/O timing and possibly tighter determinism, but adds SI/bring-up effort and expands the evidence/telemetry requirements.
How FEC / Packet / Time coexist without contaminating each other
Coexistence is an isolation problem. Three shared resources typically create hidden coupling: DMA/queuesmemory bandwidthclock domains. Isolation requires explicit controls and measurable backpressure behavior.
- Queue isolation: SR-IOV (PF/VF), virt queues, per-tenant queue limits, and priority separation.
- Context isolation: per-tenant contexts for FEC/flow state so one tenant cannot evict another’s working set.
- Bandwidth isolation: memory/QoS arbitration (credits) to keep HARQ buffers and packet buffers from starving each other.
- Fault isolation: function-level reset and watchdog domains to recover a kernel without hard-resetting the entire card.
Figure F3 — Card-level block diagram (four islands + data/control/clock lines)
H2-4 · PCIe x16 host interface: queues, DMA, and memory decide “real” performance
Why Gen4/Gen5 x16 can still underperform
Link bandwidth is rarely the only limiter. Real-world performance is dominated by how efficiently the host can submit work, move data, and retire completions under load—while keeping tail latency bounded.
- Small packets are dominated by per-packet overhead (doorbells, descriptors, interrupts), not raw bandwidth.
- Tail latency grows when queue pressure and memory topology (NUMA/IOMMU/cache) inject jitter into the pipeline.
- Multi-tenant isolation adds additional scheduling/backpressure layers that must be explicitly tuned.
Engineering levers (practical knobs that can be tested)
The following knobs map directly to measurable counters and should be adjusted in a controlled A/B manner:
- DMA submission: scatter-gather depth, batching size, doorbell frequency, completion polling vs interrupt.
- Interrupt path: MSI-X vector allocation, interrupt coalescing, CPU affinity pinning.
- Memory topology: hugepages, NUMA pinning, cache locality, IOMMU vs ATS behavior.
- Queues & isolation: PF/VF layout, queue depth, per-tenant rate limiting, backpressure thresholds.
Throughput vs latency vs jitter: trade-offs that must be explicit
High throughput often pushes toward deeper queues and larger batches, while low latency pushes toward shallow queues and tighter scheduling. Jitter typically appears when the completion path becomes bursty (interrupt storms) or when memory access becomes non-local (NUMA/IOMMU effects).
| Goal | Typical strategy | Common side effect (watch counters) |
|---|---|---|
| Max throughput | Larger batches, deeper queues, aggressive coalescing, higher concurrency | Tail latency inflation; queue pressure spikes; periodic completion bursts |
| Min latency | Smaller batches, bounded queue depth, CPU pinning, polling on hot path | Higher CPU cost; lower peak; risk of underutilizing DMA bandwidth |
| Min jitter | Stable topology (NUMA-local), consistent IRQ rates, predictable backpressure | Requires strict resource partitioning; multi-tenant fairness must be explicit |
Figure F4 — PCIe queues & DMA data path (jitter injection points highlighted)
H2-5 · FEC acceleration deep dive: hardened path from LLR to HARQ
What “FEC offload” actually hardens
FEC acceleration is not a single block. A usable accelerator hardens an end-to-end hot path that starts with LLR ingress and ends at HARQ soft combine, with measurable boundaries and explicit fallback behavior. The goal is not only peak throughput, but predictable p99/p999 decode latency and repeatable quality guardrails.
LLRRate matchLDPC/PolarCRCHARQ
Pipeline breakdown (stage responsibilities and evidence points)
- LLR ingress: defines bit-width, packing/alignment, and queueing granularity. Poor alignment inflates DMA traffic and buffer churn. Evidence: ingress counters, descriptor errors, per-queue backlog.
- Rate matching: applies deterministic rules that can become a hidden hotspot when implemented with small random accesses. Evidence: stage time histogram, memory reads per block.
- LDPC/Polar decode/encode: dominates compute and tail latency; iteration behavior must be bounded or scheduled explicitly. Evidence: iteration distribution, timeout counters, decode-time histogram.
- CRC: provides a fast, auditable correctness checkpoint and a trigger for retry/fallback decisions. Evidence: CRC pass/fail counts, retry reasons.
- HARQ soft combine: is often the true bottleneck because it stresses memory bandwidth and random access patterns. Evidence: soft-buffer read/write counters, queue pressure, bandwidth saturation flags.
Performance bottleneck map (what usually limits real deployments)
| Bottleneck class | Typical symptom | Must-watch evidence | Practical lever |
|---|---|---|---|
| Soft-buffer bandwidth | Throughput plateaus while compute is not fully utilized; p99 latency spikes under load | HBM/DDR R/W counters, buffer hit/miss, queue depth | Memory QoS/credits, locality-aware buffering, reduce random accesses |
| Parallelism & scheduling | Average latency looks fine but p99/p999 expands; long-tail blocks dominate | Decode-time histogram, iteration distribution, timeout rate | Bounded iterations, tiered queues, admission control |
| LLR quantization trade-off | Lower bit-width improves throughput but degrades quality; higher bit-width saturates bandwidth | FER/BLER guardrail counters + throughput curve | Bit-width profiles (safe/aggressive) + explicit test matrix |
Reliability: detection, watchdog, and fallback
A hardened path must fail in an observable and controllable way. The minimum reliability loop is: detectcontainfallbacklog.
- Error detection: CRC failures, invalid descriptors, ECC faults, and stage timeouts.
- Watchdog domains: per-kernel watchdog and function-level reset to avoid full-card disruption.
- Fallback policy: fail-open keeps service continuity by falling back to software at reduced performance; fail-closed prioritizes correctness/determinism by blocking or alarming when invariants break.
- Evidence: every fallback or reset should emit a timestamped event record with a reason code.
Figure F5 — FEC pipeline + buffers + parallelism + counters (card-level view)
H2-6 · UPF / packet kernels on an accelerator: what is worth hardening (and what to avoid)
Keep “UPF acceleration” at kernel scope
Accelerator-friendly UPF work is best described as packet kernels—repeatable match/action building blocks that can be queued, isolated, and measured. This scope prevents the page from expanding into a full UPF system description.
matchactionstatecountersbackpressure
Hardenable kernel checklist (with I/O shape, state, and evidence)
| Kernel | I/O shape | State footprint | Must-have evidence |
|---|---|---|---|
| Flow lookup (hash/ACL/TCAM-like) | Packet headers → rule/action index | Flow table entries, eviction policy | Hit/miss/evict counters, lookup latency histogram |
| Stats / counters | Per-flow updates at line rate | Counter memory, overflow handling | Update drops/overflow flags, per-flow sampling |
| Encap / decap | Header rewrite + tunnel metadata | Profile table (tunnel params) | Malformed/drop reasons, per-profile throughput |
| Checksum / validate | Header fields + payload slices | Light/no state | Bad checksum counters, exception reasons |
| Rate limit / shaping | Packet timestamps + token accounting | Per-flow tokens/queues | Shaper drops, queue delay stats, tail latency |
| Optional crypto primitive | Payload blocks + context selector | Key context (kernel scope only) | Crypto on/off performance matrix, error counters |
Coexistence with FEC: isolation and backpressure that must be explicit
When packet kernels share the same card with FEC offload, hidden coupling typically occurs through memory bandwidth, queue priority, and completion burstiness. A stable design makes isolation policies visible and auditable.
- Bandwidth arbitration: credit-based QoS to protect HARQ buffers during packet bursts.
- Queue separation: independent per-tenant queues and priority tiers; avoid head-of-line blocking.
- Backpressure policy: thresholds and drop reasons must be defined (not a single “drop” counter).
- Evidence: per-tenant throughput + p99 latency must remain bounded when the other pipeline saturates.
Figure F6 — Packet kernel chain (match → action → output) with state and evidence
H2-7 · TSN/time features: why a card needs time consistency and hardware scheduling
What TSN/time means on an accelerator card (and what it does not)
On an accelerator card, TSN/time features are not about implementing a full TSN switch. The practical scope is measurable hardware timestamps, time-aware queue gating, and per-stream protection that keep deterministic latency and throughput stable under bursty workloads.
HW timestampGate executionPer-stream policing Error budgetDegrade mode
Hardware timestamp: measurement and closed-loop control
Hardware timestamps provide a clock-referenced signal that turns “determinism” into something testable. They are used to measure queueing delay, kernel execution time, and schedule alignment—so that timestamp error and drift can be detected before tail latency becomes unstable.
- Primary use: validate latency budgets and alignment of gate windows (measurement → action).
- Must-have outputs: timestamp error statistics, drift/time-jump events, and per-queue delay counters.
- Failure visibility: loss-of-lock or holdover transitions should emit reason-coded event logs.
Time-aware scheduling (concept scope): gate table execution and jitter sources
Time-aware scheduling on a card means a gate table is executed against a time reference to open/close queue windows predictably. The engineering focus is not on the full protocol, but on gate execution jitter and where it is injected.
| Jitter injection point | Typical symptom | Evidence to capture |
|---|---|---|
| PLL / reference instability | Gate windows drift vs time base; timestamp error rises even when traffic is stable | Lock/holdover events, phase/jitter health flags, time error stats |
| Firmware control latency | Gate updates apply late; occasional schedule misalignment under load | Gate-update latency histogram, missed-window counters |
| IRQ / CPU participation | Long-tail scheduling jitter correlated with host interrupts or contention | IRQ/coalescing stats, queue depth spikes, tail-latency correlation logs |
Per-stream policing: protect determinism from a single bad stream
Per-stream policing is the practical safeguard that prevents one misbehaving stream (burst, malformed pacing, or unexpected rate) from breaking queue determinism. The implementation focus is on stream identification, simple state (tokens/counters), and reason-coded outcomes.
- Input: stream classification result and traffic profile selection.
- State: token accounting and per-stream counters (concept scope).
- Output: drop/mark/shape decisions with explicit reason counters (not a single “drops” total).
Time reference interface and safe degradation
A card should define how time reference enters the device, how alarms are raised, and how the system degrades when reference quality drops. Common mechanisms include holdover entry/exit logic, loss-of-lock alarms, and time-jump detection that triggers scheduling protection or conservative modes.
- Reference quality alarms: loss-of-lock, holdover, and ref switching events.
- Degrade policy: reduce strict gating, switch to measurement-only mode, or raise service alarms.
- Evidence: timestamp error and gate jitter should remain auditable throughout transitions.
Figure F7 — Timestamp + gate scheduling path (with jitter injection points)
H2-8 · Coherent clock tree: make jitter, phase, and sync alarms actionable
Why “synchronized” can still be unstable
A system can report “in sync” while still showing unstable determinism because the card may be suffering from reference switching, PLL lock transitions, or cross-domain drift. A coherent clock tree makes time distribution explicit, auditable, and compatible with measurement and scheduling on the same device.
ref inputsmuxPLLdomainsalarms
Reference inputs and selection: treat switching as a first-class event
Typical reference candidates include 1PPS/10MHz, SyncE-derived reference, and PTP-derived reference signals. The engineering requirement is to define how the card selects inputs, how it reports health, and how it behaves during transitions (hitless where possible, or alarmed with bounded impact).
- Ref selection: explicit mux policy with health checks and priority rules.
- Transition visibility: ref-switch events and lock recovery time must be logged.
- Holdover: define entry/exit conditions and quality reporting during holdover.
Clock tree building blocks: responsibilities and risks
| Block | Responsibility | What can go wrong (actionable) |
|---|---|---|
| Clock mux | Select reference sources and expose switching events | Switch glitches or unexpected source changes; require event logging + policy lockout |
| PLL | Filter jitter, discipline the local clock, and support holdover modes | Lock transitions inject time error; phase noise increases timestamp error and gate jitter |
| Clock buffer | Fanout and isolate domains while controlling skew | Skew and power sensitivity create cross-domain drift; require domain health checks |
Coherent domains: keep compute, timestamp, and optional I/O under one time base
Coherent distribution means a single time base is delivered to domains that must agree: the timestamp domain (measurement), the compute/scheduling domain (execution), and an optional I/O domain (only when the card owns timing-sensitive I/O).
- Timestamp domain: highest sensitivity; defines measurement truth.
- Compute/scheduling domain: must stay aligned with timestamp domain to avoid schedule drift.
- Optional I/O domain: keep coherent only when required; otherwise isolate to reduce coupling.
Alarms and diagnostics: make time quality visible
A coherent clock tree is only useful if it is diagnosable. Minimum card-level telemetry should cover: loss-of-lock and holdover transitions, time-jump detection, and drift trend tracking that correlates with scheduling jitter and latency outliers.
- Loss-of-lock: identify which PLL/ref source failed and how long recovery took.
- Holdover entry/exit: record duration and time error growth indicators.
- Time jump detection: detect and log discontinuities that can break gate execution.
- Drift trend: rolling statistics for early warning and postmortem correlation.
Figure F8 — Coherent clock tree: ref → mux → PLL → domains + alarms
H2-9 · PMBus-managed power: not “it powers on”, but “cap, audit, and accountability”
Why high-performance cards throttle, crash, or disappear after deployment
Field issues often come from power being treated as “enable rails and hope” instead of a closed-loop system. Typical failure patterns include sequencing dependency violations, transient droop/overshoot that trips protection, thermally-driven power walls, and a lack of time-aligned evidence linking power events to PCIe/DMA instability.
Typical rail tree and dependencies: “who must be stable before whom”
A practical accelerator power tree is best described by responsibilities and dependencies rather than a long list of rail names. The key is to make sequencing rules explicit so that “random hangs” become explainable.
- Core rail: largest load steps; primary source of transient stress and power wall behavior.
- SerDes rail: sensitive to noise; instability often shows up as link errors or retraining events.
- HBM/DDR rail: training + ECC behavior is strongly coupled to temperature and droop margins.
- PLL/clock rail: small current but high sensitivity; lock quality impacts timestamp and schedule stability.
- Aux rail: management MCU/PMBus/telemetry/logging must remain alive to explain failures.
PMBus loop: telemetry → control → evidence
PMBus-managed power turns power into an observable and controllable subsystem with accountability. The goal is not only protection, but also predictable performance under caps and audit-ready root cause trails.
| PMBus capability | What it enables | Field acceptance evidence |
|---|---|---|
| Telemetry (V/I/T/P) | State-based power profiling (idle/steady/burst/thermal), peak vs duration visibility | Per-rail min/avg/max, peak duration histograms, temperature correlation |
| Power capping | Make “power wall” a predictable limiter; avoid uncontrolled droop trips under burst | Cap value + enforcement counters, stable performance curve under cap, no protection oscillation |
| Fault & event logs | Reason-coded accountability (OCP/OVP/UV/OTP), postmortem without guesswork | Rail ID + reason + timestamp + duration, snapshot of cap/thermal state at event time |
Thermal-power coupling: throttle policy must be explainable
Thermal triggers often cause “mysterious frequency drops” unless the policy is explicit and logged. A robust design exposes temperature states, cap states, and transition reasons so that throttling is predictable and can be verified during acceptance testing.
- Thermal thresholding: include hysteresis and clear recovery criteria.
- Dynamic capping: allow cap to tighten as temperature rises (prevent runaway).
- Host interaction: export thermal/power states and alarms (signals, not platform-wide control theory).
Figure F9 — Rail tree + PMBus monitoring points (sense / limit / action / log)
H2-10 · Reliability & protection: “avoid damage, recover fast, prove what happened”
Reliability target: controlled failure, staged protection, verified recovery
Field reliability is not defined by “never failing,” but by confining failures into predictable behaviors: detect early, isolate impact, recover through staged resets, and preserve an evidence chain that explains every action.
Common failure modes (layered) and what to capture
| Layer | Typical symptom | Evidence (minimum) |
|---|---|---|
| PCIe link | Downtrain, link reset, AER bursts, device disappears/re-enumerates | AER counters, link state transitions, ref/power events around the same timestamp |
| DMA / queues | Queue progress stalls, doorbell stuck, tail latency explodes | Queue depth + progress counters, timeout reasons, reset attempts and outcomes |
| Memory / ECC | Correctable spikes or uncorrectable triggers recovery | ECC counters by region, temperature/power snapshot, workload state |
| PLL / time | Loss-of-lock, timestamp error growth, schedule jitter increases | Lock/holdover events, time-jump detection, gate jitter stats |
| VRM / rails | Throttle, brownout resets, unstable behavior under burst | OCP/UV/OTP logs with rail id + duration + timestamp, cap state at event time |
| Firmware | Heartbeat stops, watchdog triggers, repeated resets | Heartbeat counters, watchdog reason codes, last-known state snapshot |
Protection ladder: mitigate first, reset narrowly, fall back safely
The safest recovery strategy is staged: attempt mitigation before disruptive resets, and prefer function-level recovery before full card resets. This reduces collateral damage and shortens service restoration.
- Mitigate: cap power, throttle queues, reduce burstiness (keep service if possible).
- Function-level reset (FLR): reset only the impacted function or kernel path; preserve other services.
- Card reset: last resort; use only when narrower recovery fails or safety requires it.
- Fallback: degrade mode or software path when hardware is not trustworthy.
Watchdog and heartbeats: “detect quickly, reset correctly, record everything”
A watchdog should observe both control-plane health (heartbeat) and data-path progress (queue forward motion). When triggers occur, recovery should follow the ladder policy and record a consistent snapshot for postmortem.
- Inputs: firmware heartbeat, queue progress, thermal state, cap state, PLL lock health.
- Actions: mitigation → FLR → card reset (escalate only when required).
- Records: reason code + timestamp + “before/after” state (power/thermal/queue).
Evidence chain: make field debugging deterministic
A minimal evidence set should allow correlation across power, time, PCIe, DMA, and firmware—on a shared timeline. Without timestamp alignment, the same symptom will be misdiagnosed repeatedly.
- Single time axis: power events, lock events, queue timeouts, and resets must share a comparable timestamp base.
- Minimum artifacts: counters + last N event logs + snapshot (temp/power/queue depth/cap state).
- Last-gasp: preserve final events across power loss when possible (for true accountability).
Figure F10 — Fault → protection action → recovery path (with evidence bus)
H2-11 · Validation & field-debug checklist
The goal is not “peak speed”, but repeatable proof across deterministic throughputdeterministic latency/jitterrecoverabilityauditability (R&D → production → field).
Definition of “done”: acceptance gates that do not lie
Treat “done” as four gates. Each gate must be reproducible on a bench and collectible in the field as evidence:
- Performance gate: Throughput (Gbps / codeblocks·s⁻¹ / Mpps) stays stable in the target load envelope; p99/p999 latency does not drift.
- Determinism gate: A clear jitter/error budget (timestamp error, gate schedule jitter, tail latency) with diagnosable injection points.
- Power/Thermal gate: No unexplained downclock/reset under power cap and thermal boundary; PMBus telemetry matches protection actions.
- Recoverability gate: DMA hang, PLL unlock, VRM fault → graded reset + software fallback, with time-aligned logs.
R&D validation matrix (table-first)
Write validation as “stimulus + expected behavior + narrowing hints”, not a pile of benchmark screenshots.
| Area | Test stimulus | Pass criteria | If it fails, look at |
|---|---|---|---|
| FEC pipeline | Rate/block/LLR bit-width sweep; HARQ soft-combine pressure (near-full buffers) | Smooth throughput curve; p99 latency does not “step” near saturation; CRC/decoder errors are explainable | HBM/DDR bandwidth, soft-buffer watermarks, iteration histogram, timeout/fallback counters |
| Packet kernels | Separate 64B Mpps vs large-packet Gbps; flow scale 10⁵→10⁷; action toggles (meter/crypto/encap) | Mpps and Gbps both meet targets; miss/evict behavior is predictable; toggles don’t explode tail latency | Queue backpressure, lookup miss/miss penalty, DMA batching, IRQ moderation/congestion |
| TSN/time | Fixed gate schedule + controlled perturbations; timestamp sampling; holdover enter/exit | Gate jitter within budget; timestamp error closes; no time jump during holdover transitions | PLL lock state, clock mux disturbance, firmware scheduling jitter, IRQ preemption |
| PCIe/DMA | Queue depth scan; NUMA remote memory; IOMMU/ATS toggles; doorbell pressure | Throughput/latency/jitter stays explainable via knobs; AER counters do not grow | AER/replay, IOMMU faults, queue depth, MSI-X / interrupt moderation |
| Power/Thermal | Power-cap sweep; thermal step; fan curve perturbation; VRM fault injection (OCP/UV) | No unexplained downclock/reset; telemetry matches thresholds/actions; logs are time-aligned | PMBus sampling/filtering, cap hit counts, VRM fault latches, sensor consistency |
| Reliability | Soak (72h+); controlled insert/remove per spec; firmware update/rollback drills | No silent ECC growth; graded reset restores service; signature checks don’t false-fail | ECC counters, watchdog reasons, FLR success rate, secure-boot reason codes |
Production screening: prove every shipped card behaves the same
- PCIe margin & stability: link training, AER growth, retimer/connector combinations; fixed script → comparable reports.
- Sensor & PMBus sanity: V/I/T readings cross-checked against external meters; threshold actions match event logs.
- Thermal signature: temperature-vs-power curve stays inside a golden envelope (flags assembly/thermal interface variance).
- Firmware integrity: signature, version consistency, rollback drill, FRU fields for traceability.
Field debug playbook: symptom → first checks → safe rollback
| Symptom | Check first (fast) | Likely root-cause class | Rollback knob & evidence |
|---|---|---|---|
| “shakes then hangs” | Queue occupancy, DMA timeout, AER replay/error, NUMA locality | DMA backpressure, IRQ/doorbell storm, IOMMU/ATS interaction, host memory jitter | Reduce queue depth / disable ATS / pin NUMA; capture AER + DMA snapshot + traces |
| Throughput OK, latency spikes | Interrupt moderation, batch size, buffer watermarks, firmware ticks | Over-batching, near-full buffers, firmware scheduling jitter | Reduce batch / tighten watermark alerts; capture p99/p999 + watermark timeline |
| “Synced” but unstable time | PLL lock/holdover flags, time-jump detector, ref-switch events | Noisy/unstable ref, mux switching disturbance, cross-domain drift | Fix ref source / tighten alarms; capture lock timeline + error histogram |
| Random downclock/reset | PMBus fault log (OCP/UV/OTP), cap hit counts, thermal hotspots | VRM transient weakness, thermal contact issues, threshold misconfiguration | Lower cap / increase cooling; capture telemetry + event log (same timebase) |
| FEC regression | LLR bit-width setting, HARQ buffer ECC, iteration distribution | Quantization tradeoff, ECC pressure, tail-latency amplification | Rollback LLR profile / disable aggressive mode; capture BER/BLER vs config |
H2-12 · BOM / IC selection checklist (with example part numbers)
Format: Category → criteria → why it matters → how to verify → representative MPNs. MPNs are examples to anchor procurement and verification (not a “model-only” shopping list).
How to use this checklist
- Define acceptance first: throughput/latency/jitter, power-wall behavior, recovery path → then choose compute/memory/clock/power.
- Prefer observable parts: status/alarm/log interfaces (PMBus, lock flags, AER/ECC counters, secure events).
- Attach a verification step to each key part: e.g., retimers must have margin/AER soak coverage, or field proof becomes impossible.
Selection table (criteria + verification + example MPNs)
| Category | Key criteria (5–8) | Why it matters | How to verify | Example MPNs |
|---|---|---|---|---|
| Compute (FPGA/ASIC/SoC) |
Perf/W per kernel; memory bandwidth; queue/virtualization (PF/VF/SR-IOV depending on product shape); upgradability (FW/bitstream); debug visibility; toolchain maturity; secure boot support | Defines what can be hardened across FEC/packet/time pipelines, and strongly shapes tail latency and recoverability (watchdog, graded reset, fallback). | Run target regressions (throughput curve + p99), inject faults to validate reset/fallback, perform update/rollback drills. |
AMD Versal Premium VP1902 Altera Agilex 7 AGFB027R24C2E4X Intel FPGA PAC N3000 |
| Memory (HBM/DDR) |
Bandwidth/capacity/power; ECC modes and counters; access-pattern fit (HARQ/soft buffer); thermal behavior; training/refresh stability; failure isolation | HARQ/soft information can saturate bandwidth and amplify tail latency. Memory design decides whether performance remains predictable. | Near-full watermark regressions; ECC injection/counters; thermal-step stability checks. |
Micron DDR4 example MT40A512M16JY-083E:B |
| PCIe fabric (switch/retimer/redriver) |
Target Gen4/Gen5 rate; lane count & topology; error visibility (AER); protocol-aware retiming; thermal/power; refclk modes; SI/layout constraints; production-friendly margin flow | “x16 on paper” can still be unstable. Retries translate into jitter and tail latency, and are hard to disprove in the field without counters. | PCIe margin + AER soak across cable/riser/thermal combinations; correlate AER growth with latency spikes. |
Broadcom PEX88096 TI retimer DS160PT801 TI redriver DS80PCI810 |
| Clocking (jitter clean / DPLL) |
Jitter class & outputs; ref inputs (1PPS/10MHz/PTP/SyncE); mux switch disturbance; holdover behavior; lock/alarm pins; domain partitioning; NVM-config stability | A coherent clock tree is what makes timestamps and hardware scheduling credible. Many “synced but unstable” issues are clock-domain management problems. | Lock/holdover event injection; timestamp error histograms; ref-switch disturbance tests. |
Jitter attenuator Si5345 DPLL sync AD9545 |
| Power (VRM + PMBus) |
Multiphase transient response; PMBus telemetry; power-capping loop quality; fault logging (OCP/UV/OTP); NVM config; phase shedding; thermal/fan policy hooks | “Downclock on deployment” is often power-wall + transient behavior. PMBus closed-loop power makes faults auditable and controllable. | Cap sweep; load-step transient tests; fault injection; align telemetry with external meters. |
ADI LTC3880 TI TPS53679 Renesas ISL68224 |
| Sensing (I/V/T monitors) |
Accuracy & sampling; alarms; SMBus/I²C; multi-point placement; calibration flow; EMI robustness; data path to MCU/PMBus | Field debugging depends on trustworthy sensors. If telemetry lies, power/thermal root cause can’t be closed. | Cross-check with external meters; drift under thermal shocks; alarm/action consistency. |
Power monitor INA228 Temp sensor TMP117 |
| Mgmt & Security (MCU/TPM/SE) |
Secure/measured boot; firmware signing; update/rollback safety; event-log integrity; key storage; SPI/I²C interfaces; reset/power sequencing support; keep-alive behavior | Auditability needs a trust anchor: firmware/config/fault logs must be provably authentic and consistent across fleets. | Secure-boot negative tests; signature & rollback drills; log-integrity checks; reason-code coverage. |
Secure element ATECC608B TPM 2.0 SLB 9670 |
| Reference cards (BOM anchoring) |
Has stable P/N and reusable “system template” (power, thermals, firmware workflow); lets you copy verification scripts | Running your full validation flow on a mature reference card reduces bring-up risk before committing a custom PCB. | Mirror: throughput/latency/power/log fields and confirm they match expectations. |
AMD Alveo U55C A-U55C-P00G-PQ-G |
Representative BOM shortlist (starter)
Practical example MPN pool to kick-start a BOM sheet and bind each item to a verification action:
- PCIe switch: Broadcom PEX88096
- PCIe retimer: TI DS160PT801
- PCIe redriver: TI DS80PCI810
- Jitter cleaner: Skyworks/Silicon Labs Si5345
- Network sync DPLL: Analog Devices AD9545
- PMBus digital power: ADI LTC3880 / TI TPS53679 / Renesas ISL68224
- Power monitor: TI INA228
- Temperature sensor: TI TMP117
- Secure element: Microchip ATECC608B
- TPM: Infineon OPTIGA TPM SLB 9670
- DDR4 example (anchor for “specific MPN”): Micron MT40A512M16JY-083E:B
- Compute anchors: AMD VP1902 / Altera AGFB027R24C2E4X / Intel N3000
H2-13 · FAQs (12)
Each answer is scoped to an Edge RAN Accelerator card: FEC/packet-kernel/TSN-time acceleration over PCIe, plus coherent clocking and PMBus-managed power/telemetry. Control-plane and full system architecture are intentionally out of scope.