123 Main Street, New York, NY 10001

UPF Inline Accelerator Card (FPGA/ASIC, PCIe, PMBus Telemetry)

← Back to: 5G Edge Telecom Infrastructure

A UPF Inline Accelerator Card is a PCIe add-in module that hardens the UPF dataplane hot path—flow-table match/action and hardware timestamping—while exposing PMBus telemetry so performance stays deterministic and operable in the field. Its real value is not “peak Gbps,” but predictable 64B Mpps and p99 latency under table churn, PCIe limits, thermal derating, and safe firmware lifecycle control.

What it is & boundary

Define the card correctly, draw hard boundaries to avoid architectural confusion, and establish the “two planes” view: datapath offload vs control/telemetry on the host.

Definition (engineering-precise): A UPF inline accelerator card is a PCIe add-in coprocessor that hardens the hottest datapath steps—flow key extraction, flow-table lookup, action execution, and optional hardware timestamping—while session/control logic remains on the host UPF software stack.

The practical value comes from making per-packet work bounded and deterministic. A card should be evaluated by what it can guarantee under stress: small packets, high table churn, bursty traffic, and power/thermal constraints.

What this card should offload (and why):

  • Flow lookup + action to remove CPU cache misses, lock contention, and unpredictable branch paths.
  • Packet counters and meters close to the action point to prevent “software accounting” from becoming the bottleneck.
  • Hardware timestamping at a clearly defined placement to support SLA measurement without mixing in host jitter.
  • Telemetry hooks (power/temperature/error counters) to close the operations loop under real field conditions.
Boundary What the card is responsible for What it is not responsible for When this card is the right choice
Card vs UPF appliance Per-packet hot path acceleration (lookup/action/stamp), PCIe queueing, and card-level telemetry. Chassis-level concerns: multi-port PHY density, system airflow, OOB management, PSU redundancy, and full platform serviceability. Host UPF exists and works, but fails to meet Mpps / p99 latency targets; incremental scaling is needed without changing the full platform.
Card vs SmartNIC/DPU UPF-focused acceleration: flow-table semantics, action determinism, timestamp placement, and power/thermal observability. General-purpose network platform: onboard CPU ecosystem, broad programmability model, and “run everything on the NIC” approach. Requirements are dominated by one datapath workload (UPF) and a tight acceptance contract (Mpps, jitter, telemetry, table update behavior).
Card vs switch fabric Host-internal offload with stateful actions tied to the UPF pipeline and host software ownership of sessions. Network-internal forwarding and hop-by-hop queueing/shaping decisions across ports and a fabric. The bottleneck is host per-packet cost and stateful action/counters, not network hop latency within a fabric.

A deployment snapshot should be described in two planes to prevent misunderstandings:

  1. Datapath plane: packets enter host I/O, hit the card pipeline via PCIe queues, then return for egress/stack integration.
  2. Control & telemetry plane: the host configures tables/actions and continuously consumes counters + PMBus power/thermal data.
What is an UPF accelerator card? Card vs DPU vs appliance What it offloads Deployment model
Figure F1 — Card-in-host reference architecture (datapath + telemetry)
Card-in-host reference architecture Block diagram showing host UPF software stack connected via PCIe DMA queues to an accelerator card with flow-table engine, timestamping, and PMBus telemetry. Host UPF Server UPF Software Stack Session / control stays on host Driver + Queues DMA rings · doorbells · MSI-X/poll Counters ingestion Telemetry & Logs Power/thermal alerts AER / link errors Flow/action counters PCIe DMA Queues Gen4/5 x16 Inline Accelerator Card Flow-table Engine Match / action Counters / meters Timestamp Unit Placement-defined stamping PMBus Telemetry Rails · power · temperature Alerts → throttle / log Telemetry / counters Datapath Control/Telemetry
The architecture is intentionally split into two planes: bounded datapath offload via PCIe DMA queues, and host-owned control plus continuous telemetry.

Accelerated datapath: from packet to action

Explain the hot path as a bounded pipeline. Each stage must have a clear cost model, a measurable limit, and a failure symptom.

Key idea: A card delivers real value when it reduces per-packet fixed cost. That is why small packets (Mpps) and p99 latency/jitter are often more decisive than large-packet throughput (Gbps).

The accelerated hot path can be expressed as a five-stage pipeline. The wording should stay at the engineering abstraction level: extract only what is needed to build a stable key, execute bounded actions, and keep host interaction predictable.

Pipeline stages (bounded work per packet):

  1. Parser: extract fields required for key formation (bounded parse budget).
  2. Key build: compose a compact key (width, direction bit, tenant/slice tag as needed).
  3. Lookup: match the key in the flow-table (hit path must be deterministic).
  4. Action apply: apply forwarding/marking/metering/counters and optional timestamping.
  5. Emit/egress: return results via PCIe queues with controlled batching.

How each stage fails (symptom-driven depth):

  • Parser complexity leaks into p99: variable headers or branching paths inflate worst-case latency and jitter.
  • Key design is the hidden root cause: too wide increases table resource/power; too narrow increases collisions and false hits.
  • Lookup under churn: high update rate competes with lookup bandwidth; weak consistency models create transient misses or action skew.
  • Action side effects: counters/meters can become a bottleneck if updates are contended or not designed for parallelism.
  • Emit bottlenecks are system bottlenecks: DMA batching helps throughput but hurts latency; queue depth controls jitter amplification.

Why software UPF chokes (two causal chains):

  • Mpps chain: small packets → fixed per-packet work dominates → cache misses + branch unpredictability → shared state contention → p99 spikes → drops/retries.
  • Move-the-bytes chain: offload split → DMA + queueing → doorbells/interrupts/polling → NUMA effects + copy amplification → “card looks idle” while system plateaus.

A well-written section should end with measurable acceptance signals. Metrics should be tied to pipeline stages, not presented as generic benchmarks.

  • Large-packet throughput (Gbps): mainly constrained by I/O bandwidth and DMA efficiency.
  • Small-packet rate (Mpps): reflects bounded work per packet (parser/lookup/action plus queue overhead).
  • Latency & jitter (p95/p99): reveals queue depth, batching strategy, and any time-varying throttling behavior.
  • Table update stability: hit ratio and lookup latency must remain stable under churn (avoid “performance cliffs”).
What gets offloaded? Gbps vs Mpps vs latency Why p99 jitter happens Pipeline failure symptoms
Figure F2 — Match-action pipeline (offload boundary + key metrics)
Match-action pipeline for an inline accelerator card Pipeline diagram showing parser, key build, lookup, action apply, and emit stages with the offloaded stages inside the card boundary and egress on the host side. Offloaded on Card (bounded hot path) Parser Bounded parse Key Build Compact key Lookup Deterministic Action Apply Forward · count · mark Optional timestamp Host Side Emit/Egress DMA batching PCIe DMA queues Acceptance metrics mapped to the pipeline Mpps: per-packet cost Gbps: bandwidth + DMA p99 latency: queue + batching
The offload boundary should be defined by deterministic, bounded work per packet. Egress and batching choices on the host side largely determine p99 jitter.

Flow-table architecture

The flow-table is the performance root cause because every packet hits the lookup path, and real deployments stress the update path during churn. Stability requires a hierarchy, measurable acceptance metrics, and a consistency model that remains hitless under updates.

Two physical realities: lookup happens on every packet and must stay deterministic; updates surge with session/slice dynamics and must not steal resources in a way that creates a p99 latency cliff.

A practical accelerator card typically combines multiple storage types rather than picking a single “best” memory. The reason is simple: different flow populations demand different trade-offs between throughput, flexibility, and capacity.

Physical implementations (engineering meaning, not theory):

  • SRAM hash: optimized for hot flows and deterministic hit latency at very high Mlookups/s, with collision management as the key risk.
  • TCAM: used when wildcard/priority matching must be supported without pushing complexity into host fallback; cost and power are the hard limits.
  • DRAM / host fallback: provides large capacity for cold/overflow entries, but adds variable latency and can amplify jitter if invoked too often.

The section should convert “table design” into acceptance contracts. The following five metrics are the ones that decide whether a card stays stable under churn. Each metric must be tied to a measurement method and a typical failure symptom.

Metric Why it matters How to validate (practical) Failure symptom (field-facing)
Table size / entry width Entry width drives memory footprint, power, and the upper bound of “how many flows can stay in the fast path”. Oversized keys reduce effective capacity and raise access energy. Fix key format, then sweep flow count until L1 hit ratio collapses; record occupancy vs hit ratio. Keep packet mix constant to isolate table effects. Throughput looks fine at low flow counts, then drops sharply when occupancy increases; p99 rises as fallback grows.
Lookup rate (Mlookups/s) Lookup throughput sets the Mpps ceiling for small packets. Even with sufficient PCIe bandwidth, lookup bottlenecks create “Mpps plateaus”. Run fixed-size packets (e.g., 64B) and fixed action set; sweep offered load and record sustained Mpps and lookup latency p99. Gbps appears acceptable on large packets, but 64B Mpps remains low and does not scale with cores/queues.
Update rate (entries/s) Update surges during session/slice churn. If updates contend with lookups, a performance cliff appears even when average load is moderate. Use a churn script: periodically add/delete/modify flows at controlled rates while keeping steady packet traffic; record lookup p99 and hit ratio. Stable at steady state, but collapses when churn spikes; latency shows “steps” rather than gradual changes.
Collision / wildcard strategy Collision handling and wildcard/priority matching define how often fallback is triggered. Fallback frequency is the hidden jitter amplifier. Track per-layer hit ratio, collision counters, and fallback rate; ensure fast-path hit ratio remains above target under realistic flow distributions. Unexplained jitter spikes and CPU load bursts on the host; “card looks fine” but system p99 drifts upward.
Consistency model (hitless update) If lookups observe partially-applied updates, behavior becomes non-deterministic. Stronger models (double-buffer/epoch switch) preserve stability. Verify atomicity: during update storms, lookups must not see transient misses or action skew beyond the specified window; measure commit latency distribution. Rare, hard-to-reproduce misclassification or counter anomalies during updates; symptoms disappear in lab unless churn is reproduced.

How table churn creates a latency cliff (mechanism chain):

  1. Update rate spikes → updates consume bandwidth/metadata resources required by the lookup path.
  2. Lookup resources are squeezed → collisions increase, L1 hit ratio drops, fallback path is invoked more frequently.
  3. Fallback path lengthens per-packet work → queues grow, batching grows, and p99 latency steps upward.
  4. Weak consistency amplifies symptoms → transient misses/action skew appear exactly when traffic is hardest to stabilize.

Track these signals to make churn explainable: per-layer hit ratio, collision counters, update backlog, commit latency, fallback rate, lookup p99.

How big should the flow table be? Update rate under churn Is TCAM worth it? Hitless consistency
Figure F3 — Table hierarchy (hit ratio vs cost)
Flow-table hierarchy: L1 hash, L2 TCAM, L3 DRAM/host Block diagram showing a three-level table hierarchy with hit ratios and relative costs, including fallback arrows and churn impact points. Hierarchical flow table keeps the fast path fast Hit ratio ↑ → p99 latency ↓ Packet → Key Deterministic L1 · SRAM Hash Hot flows Fast hit Collisions Manage L2 · TCAM Wildcard Priority Cost / power Limited L3 · DRAM / Host Cold flows Overflow Variable Latency Relative cost per miss L1 hit Lowest cost L2 hit Moderate cost L3 / host Highest jitter risk Fallback rate Churn risk points Update steals lookup resources Hit ratio drop → p99 step
L1 should carry most hot flows with a stable hit ratio. TCAM is reserved for wildcard/priority needs. DRAM/host provides capacity but must stay rare to avoid jitter.

PCIe subsystem: DMA, queues, and switching

The card can be fast while the system stays slow. PCIe performance is dominated by byte-movement and synchronization overhead: DMA batching, queue depth, doorbell rate, interrupt strategy, IOMMU mapping, and NUMA placement.

Rule of thumb: When bandwidth is sufficient but throughput still does not scale, the limiting factor is usually per-packet overhead: doorbells, descriptors, interrupts, NUMA crossings, and copy amplification.

PCIe sizing should be presented as a practical estimate rather than a textbook derivation. The goal is to confirm whether the link budget can support the target traffic before tuning queues and DMA details.

Effective bandwidth estimate (practical steps):

  1. Start from the link configuration: Gen and lane count (x8/x16).
  2. Apply an efficiency discount for protocol + transaction overhead (real payload is below headline rate).
  3. Compare the resulting payload budget against target throughput and the expected DMA directionality (RX/TX symmetry or not).
  4. If the budget is tight, optimization will only shift bottlenecks; if the budget is ample, focus on per-packet overhead and jitter control.

DMA and queues are the system performance multiplier. The section should explain the trade-offs as operational choices:

  • Push vs pull: determines who controls pacing and how backpressure is handled under bursts.
  • Descriptor rings: ring depth and queue count determine parallelism and headroom, but excessive depth amplifies jitter.
  • Batching: improves throughput by amortizing doorbells and descriptor processing, but increases latency and tail jitter.
  • Pinned memory / hugepages: reduces page faults and translation overhead; critical when per-packet overhead dominates.
  • IOMMU and mapping: impacts DMA translation cost; behavior must be consistent across deployments to avoid surprises.

MSI-X interrupts vs polling (latency stability boundary):

  • Interrupt-driven: appropriate for low/variable load and power saving, but may introduce jitter at high Mpps due to interrupt rate.
  • Polling (DPDK-style): stabilizes throughput under sustained load, but consumes CPU; queue depth and batching must be controlled to protect p99.
  • Practical acceptance: choose one mode per workload profile and validate p95/p99 latency under peak offered load, not under averages.

Card-level PCIe switching/retiming should be framed as a signal integrity and observability enabler. It becomes relevant when a design integrates multiple endpoints/functions or must hold margin at high speed without unpredictable link retraining.

Card-level PCIe design checklist (field-stable):

  • Link budget: Gen/lanes chosen with margin for worst-case traffic directionality.
  • Topology clarity: root complex → (optional) switch/retimer → endpoint; avoid hidden oversubscription.
  • Queue mapping: RX/TX queues pinned to cores; IRQ affinity aligned with NUMA placement.
  • MSI-X vectors: enough vectors for queue scale; verify interrupt moderation settings.
  • DMA memory: pinned/hugepage policy documented; avoid runtime page faults.
  • IOMMU policy: consistent across environments; confirm impact on sustained Mpps.
  • SR-IOV support: PF/VF isolation validated under load (resource accounting must be predictable).
  • AER enabled: error reporting wired to logs/alerts; no silent link degradation.
  • LTSSM visibility: retrain/downshift counters captured; correlate to performance drops.
  • Reset/hot-plug behavior: card-level reset domains and safe defaults tested; recovery time bounded.
Why throughput is not linear How to size PCIe bandwidth DMA batching vs latency MSI-X vs polling
Figure F4 — PCIe topology & queue map (where jitter is created)
PCIe topology and queue map Block diagram showing host root complex and NUMA placement, optional switch/retimer, card endpoint with DMA engine and RX/TX rings, and observability counters. Host Root Complex Gen / lanes budget AER reporting NUMA 0 Core/IRQ Pinned mem NUMA 1 Crossing Jitter risk Queues & Interrupt Strategy MSI-X vs polling Doorbell rate Ring occupancy Optional Switch / Retimer Margin Visibility Card Endpoint DMA Engine Batching control Pinned buffers RX Rings Descriptors Depth TX Rings Descriptors Depth Observability Doorbell rate AER / LTSSM errors Retrain counters DMA payload + sync overhead Jitter is created where queues grow and batching increases Keep depth & batch bounded
When the PCIe link budget is sufficient, non-linear scaling is usually caused by per-packet overhead: doorbells, descriptors, interrupts/polling, and NUMA crossings.

Hardware timestamping unit

Hardware timestamps are only useful when the event point is explicit and the error budget is explainable. Placement defines which delays are included; clocking defines which domain the timestamp represents; load defines how much jitter is added.

What timestamps are for (card-level): evidence for probes and measurement, billing/charging records, congestion diagnosis, and slice SLA tracking—without requiring the host to infer timing from software queues.

Timestamp placement must be treated as an engineering contract: the stamp should represent a well-defined event along the datapath, so that downstream analysis can separate fixed offsets from load-dependent jitter. A card may support multiple stamp points, but stability usually improves when one primary event point is selected per use case.

Stamp placement options (what each point actually includes):

  • T1 · After ingress parse: includes front-end SerDes/retimer fixed delay and parse pipeline; minimizes queueing jitter.
  • T2 · After match/action: includes lookup/action arbitration; captures processing completion but becomes load-sensitive.
  • T3 · Before DMA enqueue: includes internal queueing and cross-domain waits; often the biggest jitter contributor.
  • T4 · Host-visible writeback: includes PCIe transaction and batching; useful for correlation, typically the least stable.

A placement is “better” only if it matches the intended meaning. The correct choice is the one that keeps the error terms bounded and explainable.

The timestamp clock domain must also be explicit. A card can maintain an internal timebase and optionally accept an external reference input disciplined by a timing source (for example, PTP/SyncE as a reference). The goal at card level is not to describe the full timing system, but to define which oscillator/PLL domain drives stamps and how alignment is maintained over temperature and time.

Error class Typical sources (card-level) How it shows up Primary control knob
Offset (calibratable) Fixed SerDes/retimer group delay, fixed pipeline stages, constant CDC alignment delay. Stable bias: stamps are consistently early/late by a constant amount. Periodic calibration; phase alignment; fixed-delay compensation table.
Jitter (load-dependent) FIFO depth changes, CDC wait variability, scheduler arbitration, queue build-up, DMA batching delay. p95/p99 spread grows with load; “bursty” tails and time ordering noise. Bound queue depth; cap batch size; stable arbitration; choose earlier stamp point.
Drift (environment) Oscillator temperature drift, PLL phase noise sensitivity, thermal throttling side effects on timing paths. Slow movement over minutes/hours; offset changes with temperature. Temperature-aware compensation; reference-aligned discipline; bounded operating states.

Calibration & alignment (card-level, minimal):

  • Periodic alignment: re-align the card timebase to a reference to keep drift bounded.
  • Phase alignment: align PLL phase against the reference domain to reduce long-term offset.
  • Temperature compensation: apply a coarse correction based on measured board temperature and a stored slope table.

Acceptance should include: resolution, monotonicity, offset bound after calibration, and p99 jitter growth under peak load.

Why timestamps are unstable Where to place the stamp Error budget in practice Queue & DMA batch jitter
Figure F5 — Timestamp placement & error budget
Timestamp placement and error budget Block diagram showing multiple stamp points along the datapath and arrows marking major error sources: retimer/SerDes, FIFO, CDC, queueing, and DMA batching. Placement defines meaning; error budget defines trust Offset · Jitter · Drift Ingress SerDes Parser Fields Match / Action Arbitration Queue FIFO DMA Batch Writeback T1 After parse T2 After action T3 Before DMA T4 Host visible Offset SerDes / retimer Jitter CDC wait + FIFO depth Queueing variability Batch delay DMA batching Host polling/IRQ Drift control (card-level) Timebase Local PLL Reference External input Alignment Periodic + temp
Earlier stamp points reduce queue and DMA batching jitter; later stamp points are easier to correlate to host visibility but are less stable under load.

Control plane interface (card-level): PF/VF, firmware, and safety rails

Card-level control must be isolated from the datapath. PF/VF separation enables tenancy and predictable resource mapping; firmware lifecycle controls prevent upgrades from becoming outages; safety rails ensure failure modes are bounded and diagnosable.

Design goal: datapath traffic stays on primary queues, while configuration and telemetry stay on a sideband control path. This preserves performance stability and makes upgrades and recovery measurable.

In an UPF acceleration context, SR-IOV PF/VF is primarily about isolation and resource accounting. Each VF can represent a tenant or an UPF instance that requires predictable queue capacity, flow-table partitioning, and counter domains. The PF retains privileged control responsibilities and enforces safe configuration boundaries.

Capability PF (privileged) VF (tenant datapath) Acceptance check
Queue ownership Create/assign queues, set bounds and policies. Use assigned RX/TX queues for datapath traffic. Queue isolation holds under peak load; no cross-tenant starvation.
Flow-table resources Partition or quota table entries and counters. Consume within assigned limits; observe own counters. Hit ratio and update backlogs remain explainable per VF.
Configuration changes Apply transactional config and commit/abort. Read only or limited knobs scoped to the VF. No partial configuration state is observable to datapath.
Telemetry Aggregate health, errors, and performance counters. Read VF-scoped metrics for local diagnosis. Metrics remain available during degraded mode.

Firmware or bitstream lifecycle must be treated as a reliability feature. At card level, the key is a minimal secure boot chain combined with versioned profiles, atomic configuration, and safe rollback. This prevents “upgrade succeeded” from meaning “performance and behavior changed silently.”

Firmware lifecycle controls (card-level):

  • Secure boot (minimum): signed image validation and version binding to prevent untrusted load.
  • A/B slots + rollback: upgrade into an inactive slot; rollback on boot/health/performance gating failure.
  • Transactional config: stage → validate → commit; avoid partially applied policies reaching the datapath.
  • Versioned profiles: explicit defaults for batching/queue depth/table policies to avoid hidden behavioral drift.

Safety rails define bounded failure modes. The objective is to keep the datapath recoverable and the diagnostic surface readable even when the acceleration pipeline is unavailable. A stable design defaults to a predictable state and provides a minimal read-only window for root cause.

Minimum safety rails (card-level):

  • Watchdog + heartbeat: liveness detection for control and datapath; triggers bounded recovery.
  • Health registers: explicit status machine, fault codes, throttle events, and last-known-good state.
  • Fail-safe defaults: predictable degraded behavior (stop accelerating, preserve diagnostics, avoid deadlocks).
  • Read-only diagnostics window: access to version/config hash, error counters, and key ring metrics during faults.

Acceptance should include: rollback time bound, config atomicity proof, and “metrics available under failure” checks.

SR-IOV value for UPF Why upgrades regress performance How rollback should work Fail-safe and diagnostics
Figure F6 — Control vs datapath separation (PF/VF + safety rails)
Control vs datapath separation Diagram showing host PF control path and VF datapath queues, sideband configuration/telemetry, card firmware manager, and safety rails including watchdog, health, and rollback. Host PF Control Agent Config + telemetry Upgrade gating VF #1 RX/TX queues Tenant VF #2 RX/TX queues Tenant Datapath Drivers Primary DMA queues Performance-critical Accelerator Card Datapath Pipeline VF-mapped queues Flow table + stamping Deterministic fast path Control Manager PF sideband Config transaction Commit / abort A/B Slots Rollback Gating Safety Rails Watchdog Health regs Fail-safe Datapath Sideband control Telemetry Isolation + rollback + safety rails keep failures bounded and diagnosable
PF controls configuration and upgrades through a sideband path. VFs own datapath queues. Safety rails preserve diagnostics and enforce fail-safe behavior.

PMBus telemetry & power integrity

Telemetry is operational leverage, not decoration. A useful PMBus design maps each power domain to a small set of readings, assigns thresholds with debounce, and ties each event to a clear policy action and a stable event ID.

Domain-first power tree (card-level): Core (FPGA/ASIC), SerDes, DDR, PCIe, and Aux MCU. Domain mapping makes symptoms explainable: throttling, brownout, or transient droop can be traced to a specific rail group.

A PMBus implementation becomes valuable when it answers “what changed right before performance dropped?” The highest-yield set of telemetry is small: V/I/P/T plus status/fault bits, sampled at a controlled rate and recorded with a consistent event model. Measurements without context are ambiguous; the combination of readings, status, and policy actions is what makes field debugging deterministic.

High-yield telemetry signals (engineering minimum set):

  • Readings: Vout, Iout, Pin/Pout, temperature (domain-local sensor) for trend and budgeting.
  • Status: over-voltage/under-voltage, over-current, over-temperature, power-good anomalies.
  • Peak/min capture (if supported): min Vout or max Iout to catch transient droop and bursts.
  • Fault log (if supported): last-event snapshot for correlation with link and performance counters.

The target is fast classification: “thermal throttle” vs “power cap” vs “brownout/droop” vs “sensor/PMBus fault”.

Thresholds must be paired with a policy that preserves stability. A typical design uses three levels: Warn for evidence and trend, Fault for bounded throttling or power capping, and Critical for controlled reset or safe disablement of acceleration. Debounce prevents noisy sensors and transient spikes from triggering oscillation.

Policy actions (card-level) that keep failures bounded:

  • Thermal throttle: reduce frequency or performance states to keep temperature under control.
  • Power cap: clamp peak power to prevent VRM overload and repeated brownout events.
  • Brownout/droop handling: log, rate-limit bursts, and enter a safe state if power-good becomes unreliable.
  • Evidence first: tie each action to a stable event ID and a compact log record.

Host-facing integration at card level should expose a consistent event model rather than raw dumps. The minimum log record should include: timestamp, domain/rail group, reading, threshold, debounce result, action (throttle/cap/reset/log-only), and event ID. This creates a direct bridge from “symptom” to “evidence” and makes intermittent failures diagnosable.

Domain / rail group Signals (V/I/T/P + status) Thresholds Debounce Action Event ID
Core Vcore, Icore, Pcore, Tcore, status(OT/OC/UV) Warn: OT-W
Fault: OT-F / OC-F
Critical: UV-C
2–5 samples window Throttle → cap bursts → safe reset if UV persists EVT_PWR_CORE_*
SerDes Vser, Iser, Tser, status(UV/OT) Warn: OT-W
Fault: OT-F
Critical: UV-C
Short window + rate limit Throttle SerDes-related states; log for BER correlation EVT_PWR_SERDES_*
DDR Vddr, Iddr, Tddr, status(UV/OT) Warn: OT-W
Fault: OT-F
Critical: UV-C
Window + hysteresis Throttle + protect state; avoid partial-table corruption EVT_PWR_DDR_*
PCIe Vpcie, Ipcie, Tpcie, status(UV/PG) Warn: PG-W
Fault: PG-F
Critical: UV-C
Debounce to avoid chattering Log + enter bounded mode; protect from repeated retrains EVT_PWR_PCIE_*
Aux MCU Vaux, Taux, status(UV/OT/comm) Warn: COMM-W
Fault: COMM-F
Critical: UV-C
Retries + timeout Preserve diagnostics; fail-safe defaults on comm loss EVT_PWR_AUX_*
Why performance drops intermittently How to debug via PMBus Power budgeting by domain Brownout vs throttle
Figure F7 — PMBus telemetry loop: domains → monitor → policy → evidence
PMBus telemetry loop Block diagram with card power domains and sensors feeding a PMBus monitor that drives a policy engine and exports event IDs and logs to the host. PMBus telemetry becomes useful when it drives actions and evidence Card power domains Core V I T P Status SerDes V I T Status DDR V I T P Status PCIe V I T Status Aux MCU V T Status PMBus Monitor Read V/I/T/P Status bits Debounce Policy Engine Cap Throttle Reset Fail-safe Rate limit Evidence to host (card-level) Event ID Stable codes Compact log Domain + action Counters throttle cap brownout sensor
A domain-first telemetry map turns intermittent failures into classifiable events: throttle, cap, brownout/droop, or sensor/PMBus faults—each with a stable event ID.

Thermal & reliability: keeping acceleration deterministic

Determinism is the ability to repeat performance over time. Thermal throttling, hot spots, and error recovery can turn peak throughput into noisy variance. A card-level design must make throttling bounded, error signals visible, and recovery states predictable.

Why “runs fast, then slows down” happens: temperature rises, throttle policies engage, timing margin shrinks, and link error recovery becomes more frequent. The symptom is throughput variance and tail-latency spread.

Thermal is the most common root cause of performance drift in long-duration acceleration. Temperature affects not only frequency limits, but also timing margin and bit error behavior in high-speed interfaces. Card-level thermal design must declare the airflow assumption and place sensors where they predict throttling and error risk, not only where it is convenient to read.

Card-level thermal design elements that preserve repeatability:

  • Heat path clarity: package → heatsink → airflow; avoid relying on uncontrolled chassis conduction.
  • Sensor placement: core hot spot, VRM hot spot, SerDes vicinity, and inlet air reference.
  • Bounded throttle states: discrete throttle levels with stable transitions; avoid oscillation.
  • Evidence: record throttle reason codes and duration counters for correlation.

A simple “power–temperature–performance” view prevents overfitting to peak benchmarks. Instead of complex curves, a compact table can express how performance repeatability changes when airflow and ambient conditions move away from the intended envelope. The objective is not to promise a single number, but to guarantee bounded behavior.

Reliability at card level must make silent failures visible. ECC/CRC counters, link retrain events, and PCIe error signals should be tracked and correlated with thermal and power events. Error recovery that hides problems can look like “random packet loss” or “unexplained jitter,” so the design should export minimal counters and reason codes that describe what the hardware is doing.

Card-level reliability controls that protect determinism:

  • ECC/CRC evidence: counters for corrected/uncorrected events and integrity checks.
  • Link resilience: retrain counters and error-rate thresholds that trigger bounded degradation.
  • State protection: guard against partial table state during faults; prefer safe disablement over corruption.
  • Recovery discipline: controlled reset causes, staged restart, and predictable return-to-service states.

Repeatability should be accepted with long-duration tests: soak at steady load, step-load transitions, and corner airflow conditions. The acceptance criteria should include throughput variance, p99 latency spread, event counts (throttle, retrain, errors), and recovery time bounds. When these signals are exported and stable, performance anomalies become explainable rather than mysterious.

Why performance drops after warm-up Invisible errors and retrains How to make performance repeatable Thermal throttle evidence
Figure F8 — Thermal → throttle → errors → recovery (deterministic loop)
Thermal and reliability determinism loop Diagram linking heat path and sensors to bounded throttle states, error counters and retrain events, and a predictable recovery state machine. Determinism needs bounded throttling and visible recovery Thermal Heat path Package → sink → airflow Sensors Core hot spot VRM hot spot SerDes area Airflow assumption Nominal Reduced Hot corner Throttle states Normal Mild throttle Hard throttle Reliability evidence ECC / CRC counters Retrain events PCIe errors AER counters Recovery bounded Acceptance Soak test Step load Corners Metrics throughput variance p99 latency spread event counts + recovery
Thermal states and error recovery must be bounded and visible. Determinism is proven by stable throttle behavior, visible counters, and repeatable acceptance tests.

Performance model: what limits Gbps vs Mpps vs latency

Three metrics imply three dominant bottlenecks. Big-packet throughput is usually bandwidth-bound, small-packet rate is packet-cost-bound, and tail latency/jitter is queueing-bound. A practical model allocates budget per stage and proves limits with evidence.

Quick classification (use this before tuning):

  • Gbps (large packets): dominated by effective bandwidth (PCIe / memory / SerDes).
  • Mpps (64B / small packets): dominated by per-packet cost (parse → lookup → action → enqueue → doorbell/DMA).
  • Latency & jitter (p99): dominated by queueing and batching (queue depth, batch size, polling vs interrupt, backpressure).

For Gbps in large packets, the ceiling is typically set by moving bytes rather than executing match-action logic. The practical upper bound is the minimum of the transport roofs: PCIe effective bandwidth, host memory bandwidth, and line-side SerDes throughput. When large-packet throughput stops scaling, the fastest path is to verify which roof is flat using DMA throughput and bandwidth counters, then remove hidden copies or inefficient transfer modes.

For Mpps in small packets, bandwidth is rarely the first limit. The dominant constraint is fixed work per packet: parsing, key build, lookup, action apply, queue operations, descriptor submission, and doorbell cadence. A stable engineering model expresses packet rate as a minimum across stage rates: pps ≈ min(parser, lookup, action, enqueue/dequeue, doorbell+DMA submit). Any stage with lock contention, cache miss, or a high-frequency control path can cap Mpps even when Gbps looks healthy.

For latency and jitter, the typical cause is not raw compute speed but queueing policy. Batching improves efficiency but increases tail. Deep queues stabilize throughput but can amplify p99. A practical decomposition makes tuning goal-directed: T_total ≈ T_pipeline + T_queue + T_batch/DMA. Tail behavior is controlled by queue occupancy distribution, batch rules, and the polling/interrupt boundary.

Engineering derivation template (turn goals into budgets):

  1. Set targets: port rate (25/100G), packet mix (64B/IMIX/large), and p99 latency.
  2. List the per-packet path: parse → lookup → action → queue → DMA/doorbell.
  3. Allocate budgets: assign time/cost per stage and a queueing budget for p99.
  4. Identify the bottleneck class: bandwidth-bound vs packet-cost-bound vs queueing-bound.
  5. Select knobs: batch size, queue depth, ring sizes, polling vs interrupt, affinity consistency.
  6. Prove with evidence: throughput/Mpps/p99 plus queue occupancy and action/counter traces.

The objective is explainability: “which budget was exceeded and why,” not just a benchmark number.

Gbps OK but Mpps low Throughput OK but p99 explodes Where jitter comes from Budget-driven tuning
Figure F9 — Three metrics, three dominant bottleneck classes
Performance model: Gbps vs Mpps vs latency Three panels: bandwidth roofs for Gbps, stage-rate minimum for Mpps, and queue/batch decomposition for tail latency and jitter. Gbps, Mpps, and p99 latency are limited by different mechanisms A) Gbps (large packets) Bandwidth-bound PCIe effective BW Host memory BW SerDes line rate Throughput ceiling min( roofs ) Verify with BW counters B) Mpps (small packets) Per-packet cost bound Parser Lookup Action apply Enqueue / dequeue Doorbell + DMA Packet rate limit pps = min(stage rates) C) Latency / jitter Queueing-bound T_pipeline parse+lookup+action T_queue depth & occupancy T_batch/DMA batch rules Tail control knobs queue depth batch size poll vs interrupt
Use the right model for the symptom: bandwidth roofs for Gbps, stage-rate minimum for Mpps, and queue/batch decomposition for p99 latency and jitter.

Validation checklist: how to prove it’s done

A card is “done” only when correctness, repeatable performance, and operability are proven with a compact evidence package. The checklist below prevents “benchmark-only” results and explains why production can be slower than the lab.

Three-layer validation (each layer produces evidence):

  • Correctness: hits/actions/counters/timestamps are consistent and explainable.
  • Performance & stability: Gbps, Mpps, and p99 stay bounded under soak, step-load, and churn stress.
  • Operability: PMBus events, PCIe AER counters, logs, and upgrade/rollback drills close the loop.

Validation should be structured around inputs → actions → outputs. Inputs specify packet mix and constraints (affinity consistency, queue mode, and capture cadence). Actions define durations and corner conditions (soak and step transitions). Outputs are the evidence package: version hashes, configuration snapshots, counters, and event traces tied to a stable test ID.

Test Traffic / condition Target Duration Pass criteria Evidence to save
Correctness Known flow set + action set Hit/action consistency Short + repeat No silent mismatches; counters monotonic; timestamp format stable Flow snapshots, action logs, counter dump, timestamp samples
Gbps roof Large packets Throughput (Gbps) 10–20 min Plateau explained by a bandwidth roof DMA throughput, BW counters, queue occupancy
Mpps roof 64B packets Packet rate (Mpps) 10–20 min No unexpected stage bottleneck; stable rate Stage counters (if available), ring/doorbell stats, occupancy
Tail latency IMIX + queue/batch sweep p99 latency 30–60 min p99 bounded; no oscillation p50/p99/p999, occupancy histogram, throttle reason codes
Soak stability Steady load, steady ambient Repeatability 30–60 min Variance bounded after warm-up Thermal states, throttle duration, error counters
Churn stress Table updates under load Stability under churn 20–40 min No corruption; p99 bounded; recovery predictable Update rate trace, hit ratio, consistency guard events
Operability Fault triggers + recovery Evidence loop Scenario-based PMBus + AER + logs correlate; safe mode works Event IDs, AER counters, fault logs, recovery timeline
Upgrade / rollback Firmware/bitstream swap Non-regression Procedure Same behavior and evidence; rollback restores stability Version hash, config profile, before/after KPIs

Evidence package (minimum deliverable):

  • Version identity: firmware/bitstream hash, driver version, profile ID.
  • Configuration snapshot: queue mode, batch rules, ring sizes, affinity constraints (recorded, not implied).
  • Counters: throughput/Mpps, p99, queue occupancy, throttle/cap events, AER error counts.
  • Event trace: stable event IDs with timestamps for correlation and triage.

When production is slower than the lab, the cause is often configuration drift rather than “mystical load.” The checklist prevents drift by forcing explicit capture of affinity constraints, queue/batch rules, and telemetry sampling cadence. If results still differ, the evidence package enables fast classification: bandwidth roof, per-packet stage limit, or queueing tail.

How to validate an accelerator card Not just a benchmark card Why production is slower Evidence package
Figure F10 — Validation pipeline: setup → three layers → evidence → release decision
Validation pipeline Flow diagram showing test setup feeding three validation layers, producing an evidence package and a release decision, with a triage loop on failures. Prove “done” with layered tests and a compact evidence package Test setup Traffic generator Controlled host affinity pinned Telemetry capture cadence fixed Three-layer validation Correctness hit/action/counter/timestamp no silent mismatch Performance & stability Gbps / Mpps / p99 soak + step-load + churn variance bounded Operability PMBus events + AER counters logs + upgrade/rollback evidence loop closes Evidence package Version hash Config snapshot Counters + events Release decision DONE bounded KPIs NOT DONE triage loop Triage: map evidence back to the limiting model (Gbps / Mpps / p99)
A complete validation closes the loop: setup is controlled, tests are layered, evidence is compact and reproducible, and failures route back to a model-driven triage.

Debug playbook: symptoms → root cause

How to use this playbook

The goal is a closed loop at card level: symptom → hypothesis → evidence counters → targeted change → re-test. Each path below names the fastest discriminators first, so debugging does not drift into “try-and-hope.”

PCIe AER / LTSSM DMA queue depth / drops Flow-table hit/miss & collision Timestamp delta / jump PMBus rails & throttling Thermal sensors & BER
Symptom trees (card-first)
Symptom A — Throughput is low, but host CPU is not busy
  • Check PCIe effective bandwidth: link width/speed, replay counters, unexpected downshift, DMA completion rate.
  • Check NUMA & IOMMU effects: pinned memory, hugepages, IOMMU passthrough vs translation overhead (look for “copy tax”).
  • Check queue saturation: RX/TX ring occupancy, backpressure, descriptor starvation, doorbell cadence.
  • Evidence to capture (card-level): DMA bytes/s, queue watermarks, PCIe error counters (AER), dropped descriptors, recovery events.

Fast discriminator: If PCIe throughput is capped well below expectation while AER/replay rises, treat it as a link-quality/topology issue before tuning datapath logic.

Symptom B — Small-packet Mpps is poor
  • Check per-packet fixed costs: parser complexity, key-build steps, action fan-out, metadata expansion.
  • Check doorbell/batch strategy: too-frequent doorbells or too-small batches amplify overhead; too-large batches inflate tail latency.
  • Check table collisions & fallback: collision rate, wildcard path frequency, host fallback rate.
  • Evidence to capture: parser cycles/packet, lookup cycles/packet, collision counters, fallback counters, doorbell rate, “work done per interrupt/poll cycle”.

Fast discriminator: If collision/fallback rises with load, Mpps will collapse even when Gbps looks acceptable.

Symptom C — p99 latency / jitter is large
  • Check queueing: ring depth, scheduling policy, head-of-line blocking, burst absorption.
  • Check throttling sources: thermal throttling, rail brownout events, link retrain bursts (micro-stalls).
  • Check DMA batching: large batches create “sawtooth” latency; small batches raise CPU/doorbell cost.
  • Evidence to capture: queue dwell histogram (p50/p95/p99), throttle reason codes, retrain count, DMA batch size distribution.

Fast discriminator: If p99 expands while p50 stays flat, the cause is almost always queue/batch/interrupt scheduling rather than raw pipeline speed.

Symptom D — Timestamp drifts, jumps, or becomes “non-monotonic”
  • Check timebase stability: reference selection, holdover state, phase alignment health flags.
  • Check stamp placement: ingress vs egress vs DMA-writeback (different error terms dominate).
  • Check CDC/FIFO & batching: clock-domain crossings and batch commit boundaries can create step-like artifacts.
  • Evidence to capture: timebase status bits, phase error metrics, per-stage delta (ingress→egress), timestamp jump counter.

Fast discriminator: If jumps align with queue/batch boundaries, the root cause is likely batching/commit timing rather than the oscillator itself.

Symptom E — Card occasionally resets, drops, or “reconnects”
  • Check power transients: rail UV/OV, inrush, hotspot-induced droop, PMBus fault logs.
  • Check PCIe AER & recovery: surprise-down, completion timeouts, malformed TLPs, link retrain storms.
  • Check watchdog & firmware health: watchdog reason codes, assert logs, last-known-good rollback triggers.
  • Evidence to capture: PMBus fault history, AER snapshot, watchdog reset cause, thermal peak vs reset timestamp.

Fast discriminator: If PMBus logs show UV/OT near the event time, treat it as power/thermal first; performance tuning will not fix instability.

Figure F11 — Symptom → root-cause decision tree (card-level)
Debug tree: start from symptoms, prove with counters Fast discriminators first: PCIe health → queues → table behavior → power/thermal → firmware safety Symptoms A · Low Gbps, CPU idle Suspect PCIe / queues / NUMA B · Poor small-packet Mpps Suspect per-packet overhead C · High p99 latency/jitter Suspect queue/batch/throttle D · Timestamp drift/jump Suspect timebase/CDC/batch E · Drop/reset/reconnect Suspect power/thermal/AER Root-cause buckets PCIe health & topology AER, replay, downshift, switch Queues & DMA behavior ring depth, batch, doorbell Flow-table dynamics collision, fallback, churn Timebase & stamp logic placement, CDC, alignment Power/thermal & safety PMBus faults, throttling, WDT Evidence to collect PCIe AER snapshot LTSSM / downshift Queues/DMA watermarks batch histogram Table hit/miss, collision fallback rate Timestamp delta per stage jump counter PMBus/Thermal/WDT fault history + reason
Recommended workflow: reproduce the symptom with a controlled traffic profile, capture the evidence blocks above, apply one change per iteration (queue/batch, table update strategy, PCIe topology, power/thermal policy), then re-test.

H2-12 · BOM / selection checklist (criteria + example part numbers)

Selection rule: convert requirements into acceptance clauses

Procurement and engineering should share the same one-page language: Requirement → measurable metric → test method → risk note. The BOM list below provides example material numbers commonly used to build this class of PCIe inline accelerator card.

  • Acceleration determinism: stable Mpps and p99 latency under table churn and temperature drift.
  • Card-level operability: PMBus rails, fault history, and event IDs that survive field conditions.
  • Timestamp credibility: provable error budget from stamp placement, CDC, FIFO, and batching.
  • PCIe robustness: AER visibility, link stability, and predictable DMA behavior across platforms.
One-page acceptance table (copy into RFQ / spec)
Requirement Metric (what to lock) Verification (how to prove) Risk / common pitfall
Small-packet capacity 64B Mpps @ target line-rate profile; stable across table churn Traffic generator + fixed NUMA; sweep batch size; churn test (updates/s) Looks good at steady-state, collapses when collision/fallback rises
Latency determinism p99 latency & jitter bounds under burst + throttling disabled/enabled Measure per-stage dwell (queue, pipeline, DMA); correlate with throttle flags Queue depth “fixes drops” but silently inflates tail latency
Flow-table behavior Lookup rate, update rate, collision %, hitless update guarantees Profile: hit/miss/collision counters; staged update test (versioned tables) Table churn steals memory bandwidth and triggers long stalls
PCIe subsystem Effective GB/s, AER error rate, DMA completion stability Link training logs + AER snapshot; sustained DMA copy tests; queue watermark “Card is fast” but platform caps PCIe or IOMMU adds copy tax
Timestamp credibility Stamp granularity + provable error budget (per stage) Inject known delays; compare ingress/egress deltas; monitor jump counters Batch commits hide real timing; CDC/FIFO adds step artifacts
Field operability PMBus coverage (V/I/T/P), fault history, event IDs Fault injection: UV/OT/OV; confirm logs + debounce + clear procedure Telemetry exists but lacks stable event IDs or is too noisy to use
Reliability Recovery behavior after reset/power cycle; watchdog reasons Reset drills + firmware rollback rehearsal; confirm “last-known-good” path Recovery depends on manual steps; field becomes unserviceable
Figure F12 — BOM domains map (what each part block is responsible for)
BOM domains: map parts to acceptance clauses Keep text minimal on the diagram; put details in the tables below UPF Inline Accelerator Card (PCIe) Acceleration engine FPGA/ASIC datapath + table logic Example device: VP1802 / Agilex 7 PCIe & DMA subsystem queues, descriptors, AER visibility PEX88096 / DS160PR412 Timebase & timestamp unit stamp placement + provable budget Si5345 / 8A34001 Power + Telemetry PMBus rails + faults TPS53679 / ISL68200 INA228 / TPS25982 Security + Firmware secure keys + images ATECC608A W25Q128JV Thermal & Sensors avoid throttling drift TMP464 (remote diodes) Protection / Inrush card stability on transients LM5069 (hot-swap ctrl) TPS25982 (smart eFuse)
Use this map to ensure every BOM block has a matching acceptance clause (performance, determinism, operability, and robustness).
Example BOM blocks (material numbers)

The list below is intentionally “engineering-facing”: each item is tied to a responsibility on the card. Exact ordering codes depend on package, speed grade, and operating temperature.

PCIe switching / signal integrity
  • PCIe Gen4 switch: PEX88096 (Broadcom) — host-to-multi-function fanout or multi-endpoint topologies.
  • PCIe Gen4 redriver: DS160PR412 (TI) — channel margin extension; use where insertion loss is high.
  • PCIe Gen5 redriver (option): DS320PR810 (TI) — if targeting PCIe 5.0 signal budget (platform-dependent).
Clocking / timebase support (timestamp credibility)
  • Jitter attenuating clock: Si5345 (Skyworks/Silicon Labs) — clean clock distribution and holdover behaviors.
  • Sync management / clock matrix: 8A34001 (Renesas) — timing reference management class device.
  • Ultra-low jitter option: Si5395 (Skyworks/Silicon Labs) — for tighter SerDes jitter budgets.
PMBus power control (rails + fault history)
  • Multiphase controller w/ PMBus: TPS53679 (TI) — server-class VCORE control, NVM + PMBus.
  • Digital hybrid controller w/ PMBus: ISL68200 (Renesas) — telemetry + fault reporting via PMBus/SMBus.
  • Digital power monitor: INA228 (TI) — high-resolution current/voltage/power/energy monitoring (I²C).
Protection / hot-plug stability (card won’t “flake”)
  • Smart eFuse: TPS25982 (TI) — integrated hot-swap behavior + accurate current monitoring.
  • Hot-swap controller: LM5069 (TI) — inrush control and power limiting for live insertion scenarios.
Thermal sensors (determinism vs throttling)
  • Multi-channel remote diode sensor: TMP464 (TI) — monitor hotspots (FPGA/ASIC diodes) + local temp.
Secure boot artifacts (card-level)
  • Secure element / crypto co-processor: ATECC608A (Microchip) — key storage and device authentication primitives.
  • SPI NOR flash: W25Q128JV (Winbond) — firmware/bitstream storage class device.

Acceleration engine device examples (naming-level): VP1802 (AMD Versal Premium), Agilex 7 (Altera/Intel). Final device ordering codes vary by package/speed/temperature and platform IO requirements.

Request a Quote

Accepted Formats

pdf, csv, xls, xlsx, zip

Attachment

Drag & drop files here or use the button below.

H2-13 · FAQs (UPF Inline Accelerator Card)

These 12 answers stay at card level and point to the exact chapters that prove performance, determinism, and operability with measurable evidence (AER, queue watermarks, table counters, timestamp health, PMBus faults).

Evidence Pack (recommended fields)
  • PCIe: link speed/width, AER snapshot (correctable/uncorrectable), retrain/downshift events
  • DMA/Queues: ring occupancy watermarks, drops, batch histogram, doorbell rate
  • Flow-table: hit/miss, collision %, fallback rate, update backlog, version switch events
  • Timestamp: timebase lock/holdover flags, jump counter, stage deltas (ingress→egress)
  • Power/Thermal: PMBus status/fault history, rail V/I/T, throttle reason codes

Example card-side parts often seen in this class: PEX88096 (PCIe switch), DS160PR412 (PCIe redriver), Si5345 (jitter cleaner), TPS53679/ISL68200 (PMBus controllers), INA228 (power monitor), TMP464 (thermal sensor), TPS25982 (eFuse), LM5069 (hot-swap), W25Q128JV (SPI NOR), ATECC608A (secure element).

FAQ × 12 (answers + chapter mapping)
1) What is the practical boundary between this card and a SmartNIC/DPU?

A UPF inline accelerator card is a PCIe endpoint optimized for the UPF hot path: fixed match-action flow-table, deterministic timestamp export, and card-level telemetry/health. Session management, policy decisions, and most control-plane state remain in the host stack. A SmartNIC/DPU is a broader platform (virtualization, services, and programmable ecosystem); this page stays on PF/VF control, firmware safety, and measurable datapath offload.

Go deeper: H2-1 (boundary) · H2-6 (PF/VF, firmware, safety rails)

2) Why can Gbps look fine while 64B Mpps is poor—where is the root cause usually?

Large packets are bandwidth-limited (PCIe/memory), while 64B traffic is limited by per-packet fixed cost: parser/key-build, lookup/action cycles, and host↔card doorbell/descriptor overhead. Mpps also collapses when collision/fallback rises or batches are too small. Prove the class quickly with doorbell rate, batch histogram, queue watermarks, and table collision/fallback counters before changing hardware.

Go deeper: H2-2 · H2-4 · H2-9

3) Should flow-table design prioritize capacity or update rate, and how to balance?

Capacity matters when the rule set is large and stable; update rate matters when session churn is high or rules change frequently. The real balance point is whether hitless updates and p99 latency remain bounded during churn. Lock acceptance clauses to updates/s, collision %, fallback rate, and “p99 under churn” instead of chasing entry count alone. Versioned tables (double-buffer + atomic swap) reduce stalls.

Go deeper: H2-3 (table architecture + consistency model)

4) Is TCAM mandatory, and when is SRAM hash enough?

TCAM is valuable for wildcarding and strict priority matching; it is not automatically required for throughput. If most rules are exact-match and priorities are simple, an SRAM hash pipeline can deliver higher lookup rate and lower power. A common compromise is L1 hash for the majority, plus a small TCAM/priority stage for exceptions, with host fallback for rare cases. Decide using measured wildcard frequency and collision behavior.

Go deeper: H2-3 (hierarchy: hash/TCAM/DRAM/host fallback)

5) How does table churn show up as symptoms in production?

Table churn typically appears as p99 latency spikes, sawtooth throughput, hit ratio oscillation, and sudden rises in collision/fallback when updates burst. If updates steal memory bandwidth or trigger table migration/flush, queues inflate and jitter widens without obvious “packet errors.” Capture update backlog, version-switch events, collision %, fallback rate, and queue occupancy histograms; correlate spikes with update bursts to confirm churn.

Go deeper: H2-3 · H2-11

6) Should timestamps be taken at ingress or egress, and what error terms change?

Ingress stamping is earlier and excludes most queueing and emit effects, making it better for isolating parsing and lookup timing. Egress stamping is closer to “leave-card time” but includes queue dwell, action scheduling, CDC/FIFO effects, and batching/commit boundaries. The choice depends on whether the goal is pipeline attribution or end-to-end SLA accounting. A credible design exposes placement selection and stage deltas.

Go deeper: H2-5 (placement + error budget)

7) Why do timestamps occasionally jump or drift, and what is the fastest debug path?

Start with timebase health: lock/holdover flags and reference selection changes. Next, check whether jumps align with queue/batch boundaries (commit timing) rather than oscillator behavior. Then inspect CDC/FIFO error flags, calibration events, and placement profile changes. Evidence should include jump counters, timebase status bits, phase/alignment metrics (if available), and correlation of jitter with throttling or PCIe retrains.

Go deeper: H2-5 · H2-11

8) Why can line rate still be unreachable after adding an accelerator card—common PCIe pitfalls?

The most common culprits are platform-level PCIe limits: unexpected Gen/width downshift, IOMMU copy tax, NUMA mismatch, or switch/retimer margin issues that increase AER/replay and trigger micro-stalls. Queue sizing and DMA batching can also cap effective throughput. Confirm link status and AER first, then validate sustained DMA GB/s and queue watermarks. Hardware choices like PEX88096 (switch) and DS160PR412 (redriver) target topology/margin.

Go deeper: H2-4 (DMA/queues/topology checklist)

9) Which 5 PMBus signals are most valuable to capture for field operability?

The most actionable set is: total input power (budget and caps), core rail current (load transients), hottest temperature (throttle trigger), fault/status word (UV/OV/OT/OC), and throttle reason/event code (turn alarms into a timeline). Devices commonly used around this function include INA228 for high-resolution power monitoring and PMBus controllers such as TPS53679 or ISL68200 for rail telemetry and fault history.

Go deeper: H2-7 (telemetry map + thresholds → actions)

10) Why can throughput drop after running at load for a while without explicit errors?

Silent degradation is often controlled derating: thermal throttling, power capping, or error-recovery behavior (retrain/CRC replay) that reduces effective throughput without “hard faults.” Confirm whether temperature ramps align with the drop (e.g., TMP464 hotspot channels), whether PMBus shows throttle flags or rail droop events, and whether PCIe AER correctables increase over time. If symptoms match churn windows, include update counters too.

Go deeper: H2-8 · H2-7 · H2-11

11) After a firmware upgrade, performance becomes worse or unstable—how to rollback and prove?

Treat firmware/bitstream changes as experiments: lock the traffic profile, NUMA placement, batch policy, and telemetry sampling, then compare evidence packs side-by-side. Rollback must be atomic and rehearsed (last-known- good image + configuration versioning). Common card artifacts include SPI NOR like W25Q128JV for images and a secure element like ATECC608A for authentication. Validate stability under churn and thermal soak, not only peak throughput.

Go deeper: H2-6 · H2-10

12) How to write acceptance clauses so throughput/Mpps/latency/stability/observability are all measurable?

Use a matrix: (a) Gbps at defined packet mix, (b) 64B Mpps with fixed batch policy, (c) p99 latency under burst, (d) churn stress (updates/s) with bounded p99, (e) thermal soak to steady-state with no unreported derating, and (f) observability: mandatory evidence fields (AER, queue histograms, table counters, timestamp health, PMBus fault history). Each clause must name a test method and pass/fail thresholds.

Go deeper: H2-10 · H2-12

Figure F13 — FAQ map (questions grouped by the card’s proof points)
FAQ intent map (card-level) Answer clusters point back to the exact H2 proofs Boundary & control Q1, Q11 PF/VF · firmware safety rollback + evidence pack Datapath & PCIe Q2, Q8 Mpps vs doorbell/batch DMA queues · AER Gen/lanes · NUMA/IOMMU Flow-table design Q3, Q4, Q5 capacity · update · churn collision · fallback · hitless Timestamp trust Q6, Q7 placement · timebase CDC/FIFO · calibration jump counters · stage deltas PMBus & thermal Q9, Q10 rails · faults · throttle steady-state determinism Acceptance clauses Q12 Gbps · Mpps · p99 churn · soak · evidence pass/fail thresholds All answers remain card-level and reference the exact H2 proof sections.
Use this map to keep FAQ answers short and evidence-based, while pushing deep detail into the referenced H2 sections.