UPF Inline Accelerator Card (FPGA/ASIC, PCIe, PMBus Telemetry)
← Back to: 5G Edge Telecom Infrastructure
What it is & boundary
Define the card correctly, draw hard boundaries to avoid architectural confusion, and establish the “two planes” view: datapath offload vs control/telemetry on the host.
Definition (engineering-precise): A UPF inline accelerator card is a PCIe add-in coprocessor that hardens the hottest datapath steps—flow key extraction, flow-table lookup, action execution, and optional hardware timestamping—while session/control logic remains on the host UPF software stack.
The practical value comes from making per-packet work bounded and deterministic. A card should be evaluated by what it can guarantee under stress: small packets, high table churn, bursty traffic, and power/thermal constraints.
What this card should offload (and why):
- Flow lookup + action to remove CPU cache misses, lock contention, and unpredictable branch paths.
- Packet counters and meters close to the action point to prevent “software accounting” from becoming the bottleneck.
- Hardware timestamping at a clearly defined placement to support SLA measurement without mixing in host jitter.
- Telemetry hooks (power/temperature/error counters) to close the operations loop under real field conditions.
| Boundary | What the card is responsible for | What it is not responsible for | When this card is the right choice |
|---|---|---|---|
| Card vs UPF appliance | Per-packet hot path acceleration (lookup/action/stamp), PCIe queueing, and card-level telemetry. | Chassis-level concerns: multi-port PHY density, system airflow, OOB management, PSU redundancy, and full platform serviceability. | Host UPF exists and works, but fails to meet Mpps / p99 latency targets; incremental scaling is needed without changing the full platform. |
| Card vs SmartNIC/DPU | UPF-focused acceleration: flow-table semantics, action determinism, timestamp placement, and power/thermal observability. | General-purpose network platform: onboard CPU ecosystem, broad programmability model, and “run everything on the NIC” approach. | Requirements are dominated by one datapath workload (UPF) and a tight acceptance contract (Mpps, jitter, telemetry, table update behavior). |
| Card vs switch fabric | Host-internal offload with stateful actions tied to the UPF pipeline and host software ownership of sessions. | Network-internal forwarding and hop-by-hop queueing/shaping decisions across ports and a fabric. | The bottleneck is host per-packet cost and stateful action/counters, not network hop latency within a fabric. |
A deployment snapshot should be described in two planes to prevent misunderstandings:
- Datapath plane: packets enter host I/O, hit the card pipeline via PCIe queues, then return for egress/stack integration.
- Control & telemetry plane: the host configures tables/actions and continuously consumes counters + PMBus power/thermal data.
Accelerated datapath: from packet to action
Explain the hot path as a bounded pipeline. Each stage must have a clear cost model, a measurable limit, and a failure symptom.
Key idea: A card delivers real value when it reduces per-packet fixed cost. That is why small packets (Mpps) and p99 latency/jitter are often more decisive than large-packet throughput (Gbps).
The accelerated hot path can be expressed as a five-stage pipeline. The wording should stay at the engineering abstraction level: extract only what is needed to build a stable key, execute bounded actions, and keep host interaction predictable.
Pipeline stages (bounded work per packet):
- Parser: extract fields required for key formation (bounded parse budget).
- Key build: compose a compact key (width, direction bit, tenant/slice tag as needed).
- Lookup: match the key in the flow-table (hit path must be deterministic).
- Action apply: apply forwarding/marking/metering/counters and optional timestamping.
- Emit/egress: return results via PCIe queues with controlled batching.
How each stage fails (symptom-driven depth):
- Parser complexity leaks into p99: variable headers or branching paths inflate worst-case latency and jitter.
- Key design is the hidden root cause: too wide increases table resource/power; too narrow increases collisions and false hits.
- Lookup under churn: high update rate competes with lookup bandwidth; weak consistency models create transient misses or action skew.
- Action side effects: counters/meters can become a bottleneck if updates are contended or not designed for parallelism.
- Emit bottlenecks are system bottlenecks: DMA batching helps throughput but hurts latency; queue depth controls jitter amplification.
Why software UPF chokes (two causal chains):
- Mpps chain: small packets → fixed per-packet work dominates → cache misses + branch unpredictability → shared state contention → p99 spikes → drops/retries.
- Move-the-bytes chain: offload split → DMA + queueing → doorbells/interrupts/polling → NUMA effects + copy amplification → “card looks idle” while system plateaus.
A well-written section should end with measurable acceptance signals. Metrics should be tied to pipeline stages, not presented as generic benchmarks.
- Large-packet throughput (Gbps): mainly constrained by I/O bandwidth and DMA efficiency.
- Small-packet rate (Mpps): reflects bounded work per packet (parser/lookup/action plus queue overhead).
- Latency & jitter (p95/p99): reveals queue depth, batching strategy, and any time-varying throttling behavior.
- Table update stability: hit ratio and lookup latency must remain stable under churn (avoid “performance cliffs”).
Flow-table architecture
The flow-table is the performance root cause because every packet hits the lookup path, and real deployments stress the update path during churn. Stability requires a hierarchy, measurable acceptance metrics, and a consistency model that remains hitless under updates.
Two physical realities: lookup happens on every packet and must stay deterministic; updates surge with session/slice dynamics and must not steal resources in a way that creates a p99 latency cliff.
A practical accelerator card typically combines multiple storage types rather than picking a single “best” memory. The reason is simple: different flow populations demand different trade-offs between throughput, flexibility, and capacity.
Physical implementations (engineering meaning, not theory):
- SRAM hash: optimized for hot flows and deterministic hit latency at very high Mlookups/s, with collision management as the key risk.
- TCAM: used when wildcard/priority matching must be supported without pushing complexity into host fallback; cost and power are the hard limits.
- DRAM / host fallback: provides large capacity for cold/overflow entries, but adds variable latency and can amplify jitter if invoked too often.
The section should convert “table design” into acceptance contracts. The following five metrics are the ones that decide whether a card stays stable under churn. Each metric must be tied to a measurement method and a typical failure symptom.
| Metric | Why it matters | How to validate (practical) | Failure symptom (field-facing) |
|---|---|---|---|
| Table size / entry width | Entry width drives memory footprint, power, and the upper bound of “how many flows can stay in the fast path”. Oversized keys reduce effective capacity and raise access energy. | Fix key format, then sweep flow count until L1 hit ratio collapses; record occupancy vs hit ratio. Keep packet mix constant to isolate table effects. | Throughput looks fine at low flow counts, then drops sharply when occupancy increases; p99 rises as fallback grows. |
| Lookup rate (Mlookups/s) | Lookup throughput sets the Mpps ceiling for small packets. Even with sufficient PCIe bandwidth, lookup bottlenecks create “Mpps plateaus”. | Run fixed-size packets (e.g., 64B) and fixed action set; sweep offered load and record sustained Mpps and lookup latency p99. | Gbps appears acceptable on large packets, but 64B Mpps remains low and does not scale with cores/queues. |
| Update rate (entries/s) | Update surges during session/slice churn. If updates contend with lookups, a performance cliff appears even when average load is moderate. | Use a churn script: periodically add/delete/modify flows at controlled rates while keeping steady packet traffic; record lookup p99 and hit ratio. | Stable at steady state, but collapses when churn spikes; latency shows “steps” rather than gradual changes. |
| Collision / wildcard strategy | Collision handling and wildcard/priority matching define how often fallback is triggered. Fallback frequency is the hidden jitter amplifier. | Track per-layer hit ratio, collision counters, and fallback rate; ensure fast-path hit ratio remains above target under realistic flow distributions. | Unexplained jitter spikes and CPU load bursts on the host; “card looks fine” but system p99 drifts upward. |
| Consistency model (hitless update) | If lookups observe partially-applied updates, behavior becomes non-deterministic. Stronger models (double-buffer/epoch switch) preserve stability. | Verify atomicity: during update storms, lookups must not see transient misses or action skew beyond the specified window; measure commit latency distribution. | Rare, hard-to-reproduce misclassification or counter anomalies during updates; symptoms disappear in lab unless churn is reproduced. |
How table churn creates a latency cliff (mechanism chain):
- Update rate spikes → updates consume bandwidth/metadata resources required by the lookup path.
- Lookup resources are squeezed → collisions increase, L1 hit ratio drops, fallback path is invoked more frequently.
- Fallback path lengthens per-packet work → queues grow, batching grows, and p99 latency steps upward.
- Weak consistency amplifies symptoms → transient misses/action skew appear exactly when traffic is hardest to stabilize.
Track these signals to make churn explainable: per-layer hit ratio, collision counters, update backlog, commit latency, fallback rate, lookup p99.
PCIe subsystem: DMA, queues, and switching
The card can be fast while the system stays slow. PCIe performance is dominated by byte-movement and synchronization overhead: DMA batching, queue depth, doorbell rate, interrupt strategy, IOMMU mapping, and NUMA placement.
Rule of thumb: When bandwidth is sufficient but throughput still does not scale, the limiting factor is usually per-packet overhead: doorbells, descriptors, interrupts, NUMA crossings, and copy amplification.
PCIe sizing should be presented as a practical estimate rather than a textbook derivation. The goal is to confirm whether the link budget can support the target traffic before tuning queues and DMA details.
Effective bandwidth estimate (practical steps):
- Start from the link configuration: Gen and lane count (x8/x16).
- Apply an efficiency discount for protocol + transaction overhead (real payload is below headline rate).
- Compare the resulting payload budget against target throughput and the expected DMA directionality (RX/TX symmetry or not).
- If the budget is tight, optimization will only shift bottlenecks; if the budget is ample, focus on per-packet overhead and jitter control.
DMA and queues are the system performance multiplier. The section should explain the trade-offs as operational choices:
- Push vs pull: determines who controls pacing and how backpressure is handled under bursts.
- Descriptor rings: ring depth and queue count determine parallelism and headroom, but excessive depth amplifies jitter.
- Batching: improves throughput by amortizing doorbells and descriptor processing, but increases latency and tail jitter.
- Pinned memory / hugepages: reduces page faults and translation overhead; critical when per-packet overhead dominates.
- IOMMU and mapping: impacts DMA translation cost; behavior must be consistent across deployments to avoid surprises.
MSI-X interrupts vs polling (latency stability boundary):
- Interrupt-driven: appropriate for low/variable load and power saving, but may introduce jitter at high Mpps due to interrupt rate.
- Polling (DPDK-style): stabilizes throughput under sustained load, but consumes CPU; queue depth and batching must be controlled to protect p99.
- Practical acceptance: choose one mode per workload profile and validate p95/p99 latency under peak offered load, not under averages.
Card-level PCIe switching/retiming should be framed as a signal integrity and observability enabler. It becomes relevant when a design integrates multiple endpoints/functions or must hold margin at high speed without unpredictable link retraining.
Card-level PCIe design checklist (field-stable):
- Link budget: Gen/lanes chosen with margin for worst-case traffic directionality.
- Topology clarity: root complex → (optional) switch/retimer → endpoint; avoid hidden oversubscription.
- Queue mapping: RX/TX queues pinned to cores; IRQ affinity aligned with NUMA placement.
- MSI-X vectors: enough vectors for queue scale; verify interrupt moderation settings.
- DMA memory: pinned/hugepage policy documented; avoid runtime page faults.
- IOMMU policy: consistent across environments; confirm impact on sustained Mpps.
- SR-IOV support: PF/VF isolation validated under load (resource accounting must be predictable).
- AER enabled: error reporting wired to logs/alerts; no silent link degradation.
- LTSSM visibility: retrain/downshift counters captured; correlate to performance drops.
- Reset/hot-plug behavior: card-level reset domains and safe defaults tested; recovery time bounded.
Hardware timestamping unit
Hardware timestamps are only useful when the event point is explicit and the error budget is explainable. Placement defines which delays are included; clocking defines which domain the timestamp represents; load defines how much jitter is added.
What timestamps are for (card-level): evidence for probes and measurement, billing/charging records, congestion diagnosis, and slice SLA tracking—without requiring the host to infer timing from software queues.
Timestamp placement must be treated as an engineering contract: the stamp should represent a well-defined event along the datapath, so that downstream analysis can separate fixed offsets from load-dependent jitter. A card may support multiple stamp points, but stability usually improves when one primary event point is selected per use case.
Stamp placement options (what each point actually includes):
- T1 · After ingress parse: includes front-end SerDes/retimer fixed delay and parse pipeline; minimizes queueing jitter.
- T2 · After match/action: includes lookup/action arbitration; captures processing completion but becomes load-sensitive.
- T3 · Before DMA enqueue: includes internal queueing and cross-domain waits; often the biggest jitter contributor.
- T4 · Host-visible writeback: includes PCIe transaction and batching; useful for correlation, typically the least stable.
A placement is “better” only if it matches the intended meaning. The correct choice is the one that keeps the error terms bounded and explainable.
The timestamp clock domain must also be explicit. A card can maintain an internal timebase and optionally accept an external reference input disciplined by a timing source (for example, PTP/SyncE as a reference). The goal at card level is not to describe the full timing system, but to define which oscillator/PLL domain drives stamps and how alignment is maintained over temperature and time.
| Error class | Typical sources (card-level) | How it shows up | Primary control knob |
|---|---|---|---|
| Offset (calibratable) | Fixed SerDes/retimer group delay, fixed pipeline stages, constant CDC alignment delay. | Stable bias: stamps are consistently early/late by a constant amount. | Periodic calibration; phase alignment; fixed-delay compensation table. |
| Jitter (load-dependent) | FIFO depth changes, CDC wait variability, scheduler arbitration, queue build-up, DMA batching delay. | p95/p99 spread grows with load; “bursty” tails and time ordering noise. | Bound queue depth; cap batch size; stable arbitration; choose earlier stamp point. |
| Drift (environment) | Oscillator temperature drift, PLL phase noise sensitivity, thermal throttling side effects on timing paths. | Slow movement over minutes/hours; offset changes with temperature. | Temperature-aware compensation; reference-aligned discipline; bounded operating states. |
Calibration & alignment (card-level, minimal):
- Periodic alignment: re-align the card timebase to a reference to keep drift bounded.
- Phase alignment: align PLL phase against the reference domain to reduce long-term offset.
- Temperature compensation: apply a coarse correction based on measured board temperature and a stored slope table.
Acceptance should include: resolution, monotonicity, offset bound after calibration, and p99 jitter growth under peak load.
Control plane interface (card-level): PF/VF, firmware, and safety rails
Card-level control must be isolated from the datapath. PF/VF separation enables tenancy and predictable resource mapping; firmware lifecycle controls prevent upgrades from becoming outages; safety rails ensure failure modes are bounded and diagnosable.
Design goal: datapath traffic stays on primary queues, while configuration and telemetry stay on a sideband control path. This preserves performance stability and makes upgrades and recovery measurable.
In an UPF acceleration context, SR-IOV PF/VF is primarily about isolation and resource accounting. Each VF can represent a tenant or an UPF instance that requires predictable queue capacity, flow-table partitioning, and counter domains. The PF retains privileged control responsibilities and enforces safe configuration boundaries.
| Capability | PF (privileged) | VF (tenant datapath) | Acceptance check |
|---|---|---|---|
| Queue ownership | Create/assign queues, set bounds and policies. | Use assigned RX/TX queues for datapath traffic. | Queue isolation holds under peak load; no cross-tenant starvation. |
| Flow-table resources | Partition or quota table entries and counters. | Consume within assigned limits; observe own counters. | Hit ratio and update backlogs remain explainable per VF. |
| Configuration changes | Apply transactional config and commit/abort. | Read only or limited knobs scoped to the VF. | No partial configuration state is observable to datapath. |
| Telemetry | Aggregate health, errors, and performance counters. | Read VF-scoped metrics for local diagnosis. | Metrics remain available during degraded mode. |
Firmware or bitstream lifecycle must be treated as a reliability feature. At card level, the key is a minimal secure boot chain combined with versioned profiles, atomic configuration, and safe rollback. This prevents “upgrade succeeded” from meaning “performance and behavior changed silently.”
Firmware lifecycle controls (card-level):
- Secure boot (minimum): signed image validation and version binding to prevent untrusted load.
- A/B slots + rollback: upgrade into an inactive slot; rollback on boot/health/performance gating failure.
- Transactional config: stage → validate → commit; avoid partially applied policies reaching the datapath.
- Versioned profiles: explicit defaults for batching/queue depth/table policies to avoid hidden behavioral drift.
Safety rails define bounded failure modes. The objective is to keep the datapath recoverable and the diagnostic surface readable even when the acceleration pipeline is unavailable. A stable design defaults to a predictable state and provides a minimal read-only window for root cause.
Minimum safety rails (card-level):
- Watchdog + heartbeat: liveness detection for control and datapath; triggers bounded recovery.
- Health registers: explicit status machine, fault codes, throttle events, and last-known-good state.
- Fail-safe defaults: predictable degraded behavior (stop accelerating, preserve diagnostics, avoid deadlocks).
- Read-only diagnostics window: access to version/config hash, error counters, and key ring metrics during faults.
Acceptance should include: rollback time bound, config atomicity proof, and “metrics available under failure” checks.
PMBus telemetry & power integrity
Telemetry is operational leverage, not decoration. A useful PMBus design maps each power domain to a small set of readings, assigns thresholds with debounce, and ties each event to a clear policy action and a stable event ID.
Domain-first power tree (card-level): Core (FPGA/ASIC), SerDes, DDR, PCIe, and Aux MCU. Domain mapping makes symptoms explainable: throttling, brownout, or transient droop can be traced to a specific rail group.
A PMBus implementation becomes valuable when it answers “what changed right before performance dropped?” The highest-yield set of telemetry is small: V/I/P/T plus status/fault bits, sampled at a controlled rate and recorded with a consistent event model. Measurements without context are ambiguous; the combination of readings, status, and policy actions is what makes field debugging deterministic.
High-yield telemetry signals (engineering minimum set):
- Readings: Vout, Iout, Pin/Pout, temperature (domain-local sensor) for trend and budgeting.
- Status: over-voltage/under-voltage, over-current, over-temperature, power-good anomalies.
- Peak/min capture (if supported): min Vout or max Iout to catch transient droop and bursts.
- Fault log (if supported): last-event snapshot for correlation with link and performance counters.
The target is fast classification: “thermal throttle” vs “power cap” vs “brownout/droop” vs “sensor/PMBus fault”.
Thresholds must be paired with a policy that preserves stability. A typical design uses three levels: Warn for evidence and trend, Fault for bounded throttling or power capping, and Critical for controlled reset or safe disablement of acceleration. Debounce prevents noisy sensors and transient spikes from triggering oscillation.
Policy actions (card-level) that keep failures bounded:
- Thermal throttle: reduce frequency or performance states to keep temperature under control.
- Power cap: clamp peak power to prevent VRM overload and repeated brownout events.
- Brownout/droop handling: log, rate-limit bursts, and enter a safe state if power-good becomes unreliable.
- Evidence first: tie each action to a stable event ID and a compact log record.
Host-facing integration at card level should expose a consistent event model rather than raw dumps. The minimum log record should include: timestamp, domain/rail group, reading, threshold, debounce result, action (throttle/cap/reset/log-only), and event ID. This creates a direct bridge from “symptom” to “evidence” and makes intermittent failures diagnosable.
| Domain / rail group | Signals (V/I/T/P + status) | Thresholds | Debounce | Action | Event ID |
|---|---|---|---|---|---|
| Core | Vcore, Icore, Pcore, Tcore, status(OT/OC/UV) | Warn: OT-W Fault: OT-F / OC-F Critical: UV-C |
2–5 samples window | Throttle → cap bursts → safe reset if UV persists | EVT_PWR_CORE_* |
| SerDes | Vser, Iser, Tser, status(UV/OT) | Warn: OT-W Fault: OT-F Critical: UV-C |
Short window + rate limit | Throttle SerDes-related states; log for BER correlation | EVT_PWR_SERDES_* |
| DDR | Vddr, Iddr, Tddr, status(UV/OT) | Warn: OT-W Fault: OT-F Critical: UV-C |
Window + hysteresis | Throttle + protect state; avoid partial-table corruption | EVT_PWR_DDR_* |
| PCIe | Vpcie, Ipcie, Tpcie, status(UV/PG) | Warn: PG-W Fault: PG-F Critical: UV-C |
Debounce to avoid chattering | Log + enter bounded mode; protect from repeated retrains | EVT_PWR_PCIE_* |
| Aux MCU | Vaux, Taux, status(UV/OT/comm) | Warn: COMM-W Fault: COMM-F Critical: UV-C |
Retries + timeout | Preserve diagnostics; fail-safe defaults on comm loss | EVT_PWR_AUX_* |
Thermal & reliability: keeping acceleration deterministic
Determinism is the ability to repeat performance over time. Thermal throttling, hot spots, and error recovery can turn peak throughput into noisy variance. A card-level design must make throttling bounded, error signals visible, and recovery states predictable.
Why “runs fast, then slows down” happens: temperature rises, throttle policies engage, timing margin shrinks, and link error recovery becomes more frequent. The symptom is throughput variance and tail-latency spread.
Thermal is the most common root cause of performance drift in long-duration acceleration. Temperature affects not only frequency limits, but also timing margin and bit error behavior in high-speed interfaces. Card-level thermal design must declare the airflow assumption and place sensors where they predict throttling and error risk, not only where it is convenient to read.
Card-level thermal design elements that preserve repeatability:
- Heat path clarity: package → heatsink → airflow; avoid relying on uncontrolled chassis conduction.
- Sensor placement: core hot spot, VRM hot spot, SerDes vicinity, and inlet air reference.
- Bounded throttle states: discrete throttle levels with stable transitions; avoid oscillation.
- Evidence: record throttle reason codes and duration counters for correlation.
A simple “power–temperature–performance” view prevents overfitting to peak benchmarks. Instead of complex curves, a compact table can express how performance repeatability changes when airflow and ambient conditions move away from the intended envelope. The objective is not to promise a single number, but to guarantee bounded behavior.
| Test | Traffic / condition | Target | Duration | Pass criteria | Evidence to save |
|---|---|---|---|---|---|
| Correctness | Known flow set + action set | Hit/action consistency | Short + repeat | No silent mismatches; counters monotonic; timestamp format stable | Flow snapshots, action logs, counter dump, timestamp samples |
| Gbps roof | Large packets | Throughput (Gbps) | 10–20 min | Plateau explained by a bandwidth roof | DMA throughput, BW counters, queue occupancy |
| Mpps roof | 64B packets | Packet rate (Mpps) | 10–20 min | No unexpected stage bottleneck; stable rate | Stage counters (if available), ring/doorbell stats, occupancy |
| Tail latency | IMIX + queue/batch sweep | p99 latency | 30–60 min | p99 bounded; no oscillation | p50/p99/p999, occupancy histogram, throttle reason codes |
| Soak stability | Steady load, steady ambient | Repeatability | 30–60 min | Variance bounded after warm-up | Thermal states, throttle duration, error counters |
| Churn stress | Table updates under load | Stability under churn | 20–40 min | No corruption; p99 bounded; recovery predictable | Update rate trace, hit ratio, consistency guard events |
| Operability | Fault triggers + recovery | Evidence loop | Scenario-based | PMBus + AER + logs correlate; safe mode works | Event IDs, AER counters, fault logs, recovery timeline |
| Upgrade / rollback | Firmware/bitstream swap | Non-regression | Procedure | Same behavior and evidence; rollback restores stability | Version hash, config profile, before/after KPIs |
Evidence package (minimum deliverable):
- Version identity: firmware/bitstream hash, driver version, profile ID.
- Configuration snapshot: queue mode, batch rules, ring sizes, affinity constraints (recorded, not implied).
- Counters: throughput/Mpps, p99, queue occupancy, throttle/cap events, AER error counts.
- Event trace: stable event IDs with timestamps for correlation and triage.
When production is slower than the lab, the cause is often configuration drift rather than “mystical load.” The checklist prevents drift by forcing explicit capture of affinity constraints, queue/batch rules, and telemetry sampling cadence. If results still differ, the evidence package enables fast classification: bandwidth roof, per-packet stage limit, or queueing tail.
Debug playbook: symptoms → root cause
The goal is a closed loop at card level: symptom → hypothesis → evidence counters → targeted change → re-test. Each path below names the fastest discriminators first, so debugging does not drift into “try-and-hope.”
Symptom A — Throughput is low, but host CPU is not busy
- Check PCIe effective bandwidth: link width/speed, replay counters, unexpected downshift, DMA completion rate.
- Check NUMA & IOMMU effects: pinned memory, hugepages, IOMMU passthrough vs translation overhead (look for “copy tax”).
- Check queue saturation: RX/TX ring occupancy, backpressure, descriptor starvation, doorbell cadence.
- Evidence to capture (card-level): DMA bytes/s, queue watermarks, PCIe error counters (AER), dropped descriptors, recovery events.
Fast discriminator: If PCIe throughput is capped well below expectation while AER/replay rises, treat it as a link-quality/topology issue before tuning datapath logic.
Symptom B — Small-packet Mpps is poor
- Check per-packet fixed costs: parser complexity, key-build steps, action fan-out, metadata expansion.
- Check doorbell/batch strategy: too-frequent doorbells or too-small batches amplify overhead; too-large batches inflate tail latency.
- Check table collisions & fallback: collision rate, wildcard path frequency, host fallback rate.
- Evidence to capture: parser cycles/packet, lookup cycles/packet, collision counters, fallback counters, doorbell rate, “work done per interrupt/poll cycle”.
Fast discriminator: If collision/fallback rises with load, Mpps will collapse even when Gbps looks acceptable.
Symptom C — p99 latency / jitter is large
- Check queueing: ring depth, scheduling policy, head-of-line blocking, burst absorption.
- Check throttling sources: thermal throttling, rail brownout events, link retrain bursts (micro-stalls).
- Check DMA batching: large batches create “sawtooth” latency; small batches raise CPU/doorbell cost.
- Evidence to capture: queue dwell histogram (p50/p95/p99), throttle reason codes, retrain count, DMA batch size distribution.
Fast discriminator: If p99 expands while p50 stays flat, the cause is almost always queue/batch/interrupt scheduling rather than raw pipeline speed.
Symptom D — Timestamp drifts, jumps, or becomes “non-monotonic”
- Check timebase stability: reference selection, holdover state, phase alignment health flags.
- Check stamp placement: ingress vs egress vs DMA-writeback (different error terms dominate).
- Check CDC/FIFO & batching: clock-domain crossings and batch commit boundaries can create step-like artifacts.
- Evidence to capture: timebase status bits, phase error metrics, per-stage delta (ingress→egress), timestamp jump counter.
Fast discriminator: If jumps align with queue/batch boundaries, the root cause is likely batching/commit timing rather than the oscillator itself.
Symptom E — Card occasionally resets, drops, or “reconnects”
- Check power transients: rail UV/OV, inrush, hotspot-induced droop, PMBus fault logs.
- Check PCIe AER & recovery: surprise-down, completion timeouts, malformed TLPs, link retrain storms.
- Check watchdog & firmware health: watchdog reason codes, assert logs, last-known-good rollback triggers.
- Evidence to capture: PMBus fault history, AER snapshot, watchdog reset cause, thermal peak vs reset timestamp.
Fast discriminator: If PMBus logs show UV/OT near the event time, treat it as power/thermal first; performance tuning will not fix instability.
H2-12 · BOM / selection checklist (criteria + example part numbers)
Procurement and engineering should share the same one-page language: Requirement → measurable metric → test method → risk note. The BOM list below provides example material numbers commonly used to build this class of PCIe inline accelerator card.
- Acceleration determinism: stable Mpps and p99 latency under table churn and temperature drift.
- Card-level operability: PMBus rails, fault history, and event IDs that survive field conditions.
- Timestamp credibility: provable error budget from stamp placement, CDC, FIFO, and batching.
- PCIe robustness: AER visibility, link stability, and predictable DMA behavior across platforms.
| Requirement | Metric (what to lock) | Verification (how to prove) | Risk / common pitfall |
|---|---|---|---|
| Small-packet capacity | 64B Mpps @ target line-rate profile; stable across table churn | Traffic generator + fixed NUMA; sweep batch size; churn test (updates/s) | Looks good at steady-state, collapses when collision/fallback rises |
| Latency determinism | p99 latency & jitter bounds under burst + throttling disabled/enabled | Measure per-stage dwell (queue, pipeline, DMA); correlate with throttle flags | Queue depth “fixes drops” but silently inflates tail latency |
| Flow-table behavior | Lookup rate, update rate, collision %, hitless update guarantees | Profile: hit/miss/collision counters; staged update test (versioned tables) | Table churn steals memory bandwidth and triggers long stalls |
| PCIe subsystem | Effective GB/s, AER error rate, DMA completion stability | Link training logs + AER snapshot; sustained DMA copy tests; queue watermark | “Card is fast” but platform caps PCIe or IOMMU adds copy tax |
| Timestamp credibility | Stamp granularity + provable error budget (per stage) | Inject known delays; compare ingress/egress deltas; monitor jump counters | Batch commits hide real timing; CDC/FIFO adds step artifacts |
| Field operability | PMBus coverage (V/I/T/P), fault history, event IDs | Fault injection: UV/OT/OV; confirm logs + debounce + clear procedure | Telemetry exists but lacks stable event IDs or is too noisy to use |
| Reliability | Recovery behavior after reset/power cycle; watchdog reasons | Reset drills + firmware rollback rehearsal; confirm “last-known-good” path | Recovery depends on manual steps; field becomes unserviceable |
The list below is intentionally “engineering-facing”: each item is tied to a responsibility on the card. Exact ordering codes depend on package, speed grade, and operating temperature.
- PCIe Gen4 switch:
PEX88096(Broadcom) — host-to-multi-function fanout or multi-endpoint topologies. - PCIe Gen4 redriver:
DS160PR412(TI) — channel margin extension; use where insertion loss is high. - PCIe Gen5 redriver (option):
DS320PR810(TI) — if targeting PCIe 5.0 signal budget (platform-dependent).
- Jitter attenuating clock:
Si5345(Skyworks/Silicon Labs) — clean clock distribution and holdover behaviors. - Sync management / clock matrix:
8A34001(Renesas) — timing reference management class device. - Ultra-low jitter option:
Si5395(Skyworks/Silicon Labs) — for tighter SerDes jitter budgets.
- Multiphase controller w/ PMBus:
TPS53679(TI) — server-class VCORE control, NVM + PMBus. - Digital hybrid controller w/ PMBus:
ISL68200(Renesas) — telemetry + fault reporting via PMBus/SMBus. - Digital power monitor:
INA228(TI) — high-resolution current/voltage/power/energy monitoring (I²C).
- Smart eFuse:
TPS25982(TI) — integrated hot-swap behavior + accurate current monitoring. - Hot-swap controller:
LM5069(TI) — inrush control and power limiting for live insertion scenarios.
- Multi-channel remote diode sensor:
TMP464(TI) — monitor hotspots (FPGA/ASIC diodes) + local temp.
- Secure element / crypto co-processor:
ATECC608A(Microchip) — key storage and device authentication primitives. - SPI NOR flash:
W25Q128JV(Winbond) — firmware/bitstream storage class device.
Acceleration engine device examples (naming-level): VP1802 (AMD Versal Premium), Agilex 7 (Altera/Intel).
Final device ordering codes vary by package/speed/temperature and platform IO requirements.
H2-13 · FAQs (UPF Inline Accelerator Card)
These 12 answers stay at card level and point to the exact chapters that prove performance, determinism, and operability with measurable evidence (AER, queue watermarks, table counters, timestamp health, PMBus faults).
- PCIe: link speed/width, AER snapshot (correctable/uncorrectable), retrain/downshift events
- DMA/Queues: ring occupancy watermarks, drops, batch histogram, doorbell rate
- Flow-table: hit/miss, collision %, fallback rate, update backlog, version switch events
- Timestamp: timebase lock/holdover flags, jump counter, stage deltas (ingress→egress)
- Power/Thermal: PMBus status/fault history, rail V/I/T, throttle reason codes
Example card-side parts often seen in this class: PEX88096 (PCIe switch), DS160PR412 (PCIe redriver), Si5345 (jitter cleaner), TPS53679/ISL68200 (PMBus controllers), INA228 (power monitor), TMP464 (thermal sensor), TPS25982 (eFuse), LM5069 (hot-swap), W25Q128JV (SPI NOR), ATECC608A (secure element).
1) What is the practical boundary between this card and a SmartNIC/DPU?
A UPF inline accelerator card is a PCIe endpoint optimized for the UPF hot path: fixed match-action flow-table, deterministic timestamp export, and card-level telemetry/health. Session management, policy decisions, and most control-plane state remain in the host stack. A SmartNIC/DPU is a broader platform (virtualization, services, and programmable ecosystem); this page stays on PF/VF control, firmware safety, and measurable datapath offload.
Go deeper: H2-1 (boundary) · H2-6 (PF/VF, firmware, safety rails)
2) Why can Gbps look fine while 64B Mpps is poor—where is the root cause usually?
Large packets are bandwidth-limited (PCIe/memory), while 64B traffic is limited by per-packet fixed cost: parser/key-build, lookup/action cycles, and host↔card doorbell/descriptor overhead. Mpps also collapses when collision/fallback rises or batches are too small. Prove the class quickly with doorbell rate, batch histogram, queue watermarks, and table collision/fallback counters before changing hardware.
3) Should flow-table design prioritize capacity or update rate, and how to balance?
Capacity matters when the rule set is large and stable; update rate matters when session churn is high or rules change frequently. The real balance point is whether hitless updates and p99 latency remain bounded during churn. Lock acceptance clauses to updates/s, collision %, fallback rate, and “p99 under churn” instead of chasing entry count alone. Versioned tables (double-buffer + atomic swap) reduce stalls.
Go deeper: H2-3 (table architecture + consistency model)
4) Is TCAM mandatory, and when is SRAM hash enough?
TCAM is valuable for wildcarding and strict priority matching; it is not automatically required for throughput. If most rules are exact-match and priorities are simple, an SRAM hash pipeline can deliver higher lookup rate and lower power. A common compromise is L1 hash for the majority, plus a small TCAM/priority stage for exceptions, with host fallback for rare cases. Decide using measured wildcard frequency and collision behavior.
Go deeper: H2-3 (hierarchy: hash/TCAM/DRAM/host fallback)
5) How does table churn show up as symptoms in production?
Table churn typically appears as p99 latency spikes, sawtooth throughput, hit ratio oscillation, and sudden rises in collision/fallback when updates burst. If updates steal memory bandwidth or trigger table migration/flush, queues inflate and jitter widens without obvious “packet errors.” Capture update backlog, version-switch events, collision %, fallback rate, and queue occupancy histograms; correlate spikes with update bursts to confirm churn.
6) Should timestamps be taken at ingress or egress, and what error terms change?
Ingress stamping is earlier and excludes most queueing and emit effects, making it better for isolating parsing and lookup timing. Egress stamping is closer to “leave-card time” but includes queue dwell, action scheduling, CDC/FIFO effects, and batching/commit boundaries. The choice depends on whether the goal is pipeline attribution or end-to-end SLA accounting. A credible design exposes placement selection and stage deltas.
Go deeper: H2-5 (placement + error budget)
7) Why do timestamps occasionally jump or drift, and what is the fastest debug path?
Start with timebase health: lock/holdover flags and reference selection changes. Next, check whether jumps align with queue/batch boundaries (commit timing) rather than oscillator behavior. Then inspect CDC/FIFO error flags, calibration events, and placement profile changes. Evidence should include jump counters, timebase status bits, phase/alignment metrics (if available), and correlation of jitter with throttling or PCIe retrains.
8) Why can line rate still be unreachable after adding an accelerator card—common PCIe pitfalls?
The most common culprits are platform-level PCIe limits: unexpected Gen/width downshift, IOMMU copy tax, NUMA mismatch, or switch/retimer margin issues that increase AER/replay and trigger micro-stalls. Queue sizing and DMA batching can also cap effective throughput. Confirm link status and AER first, then validate sustained DMA GB/s and queue watermarks. Hardware choices like PEX88096 (switch) and DS160PR412 (redriver) target topology/margin.
Go deeper: H2-4 (DMA/queues/topology checklist)
9) Which 5 PMBus signals are most valuable to capture for field operability?
The most actionable set is: total input power (budget and caps), core rail current (load transients), hottest temperature (throttle trigger), fault/status word (UV/OV/OT/OC), and throttle reason/event code (turn alarms into a timeline). Devices commonly used around this function include INA228 for high-resolution power monitoring and PMBus controllers such as TPS53679 or ISL68200 for rail telemetry and fault history.
Go deeper: H2-7 (telemetry map + thresholds → actions)
10) Why can throughput drop after running at load for a while without explicit errors?
Silent degradation is often controlled derating: thermal throttling, power capping, or error-recovery behavior (retrain/CRC replay) that reduces effective throughput without “hard faults.” Confirm whether temperature ramps align with the drop (e.g., TMP464 hotspot channels), whether PMBus shows throttle flags or rail droop events, and whether PCIe AER correctables increase over time. If symptoms match churn windows, include update counters too.
11) After a firmware upgrade, performance becomes worse or unstable—how to rollback and prove?
Treat firmware/bitstream changes as experiments: lock the traffic profile, NUMA placement, batch policy, and telemetry sampling, then compare evidence packs side-by-side. Rollback must be atomic and rehearsed (last-known- good image + configuration versioning). Common card artifacts include SPI NOR like W25Q128JV for images and a secure element like ATECC608A for authentication. Validate stability under churn and thermal soak, not only peak throughput.
12) How to write acceptance clauses so throughput/Mpps/latency/stability/observability are all measurable?
Use a matrix: (a) Gbps at defined packet mix, (b) 64B Mpps with fixed batch policy, (c) p99 latency under burst, (d) churn stress (updates/s) with bounded p99, (e) thermal soak to steady-state with no unreported derating, and (f) observability: mandatory evidence fields (AER, queue histograms, table counters, timestamp health, PMBus fault history). Each clause must name a test method and pass/fail thresholds.