123 Main Street, New York, NY 10001

SmartNIC / DPU: Programmable Data Plane & Offload Design

← Back to: 5G Edge Telecom Infrastructure

SmartNIC/DPU moves the high-rate data plane from host CPU into a programmable card so packet processing, crypto/compression, and queue shaping can run with lower jitter and better isolation. Real outcomes depend on pipeline depth, PCIe/DMA integration, queue/DDR behavior, and measurable telemetry (counters, throttling, rollbackable firmware) rather than headline “100G” specs.

H2-1 · Definition & Boundary: SmartNIC vs Traditional NIC vs DPU

Featured definition: A SmartNIC/DPU moves the packet data plane (parsing, match/action, queueing, and selected accelerations) from the host CPU onto a card/board with its own processing and memory, reducing host overhead and jitter—at the cost of additional firmware/driver coupling and a larger debug/observability surface.

What this section enables

  • Classify a product using three boundaries (programmability, resource independence, host bypass).
  • Predict where performance and reliability limits will come from (pipeline, memory, queues, or host interface).
  • Decide the right integration model (kernel, SR-IOV, virtio/DPDK) without creating an un-debuggable platform.

Three boundary axes (each with testable checks)

Axis 1 — Programmability What can change without swapping hardware?
Level 0: fixed offloads checksum/TSO, basic crypto engines, limited knobs
Level 1: configurable pipeline tables/rules are loadable, action set is bounded
Level 2: programmable datapath P4/SDK/microcode defines pipeline stages/actions; versioning and rollback are possible
Field checks: Can logic change without re-image downtime? Are rule updates versioned/rollbackable? Are per-stage counters visible?
Axis 2 — Resource independence Is there a real execution + memory domain on the card?
Traditional NIC: limited embedded logic; host remains the dominant execution domain
SmartNIC: meaningful on-card resources (cores/accelerators + local memory) for datapath services
DPU: on-card compute behaves like a mini-system (multiple cores, richer memory, stronger isolation and lifecycle control)
Field checks: Does datapath stay stable under host CPU stress? Can health/telemetry/logging run on-card during host incidents?
Axis 3 — Host bypass degree How much host hop and host CPU work is removed?
Bypass CPU cycles: crypto/compression/regex/flow steering moved off host cores
Bypass host hop: vSwitch/eSwitch functions avoid kernel hot paths and context switching
Bypass trust boundary (card-side isolation): per-tenant rules/keys/counters isolated on-card; measured/secure boot supports auditable lifecycle
Field checks: Do packets traverse the host kernel in the steady state? Are per-tenant resources and keys isolated and auditable?

Quick classification rules (If/Then)

If rules change is limited to a few knobs then treat it as an enhanced NIC, not a SmartNIC platform.
If datapath logic can be updated and rolled back with per-stage counters then it qualifies as a programmable SmartNIC.
If on-card compute + memory host a lifecycle-managed control plane (updates/attestation/telemetry) then it behaves as a DPU-class device.

Comparison table (engineering decision fields)

Category Traditional NIC SmartNIC DPU
Programmability model Fixed functions + limited config Configurable or programmable pipeline (P4/SDK/microcode) Programmable datapath + richer on-card services
Isolation boundary Basic queue/VLAN separation Per VF/tenant rules + queue steering; stronger resource partitioning Multi-tenant isolation with lifecycle controls and stronger policy/key domains
Control plane location Host-centric Split (host API + on-card agents/counters) On-card control services are first-class (updates/attestation/telemetry)
Best-at metrics Line-rate throughput for common cases Lower host CPU%, better jitter, higher effective Mpps for selected flows Max offload breadth + consistent tail latency under host contention (when designed well)
Typical failure mode Driver/firmware mismatch; limited visibility Rule/pipeline scaling hits memory/queues; observability split across domains Lifecycle complexity: update + attestation + policy consistency becomes critical

Practical takeaway: classification is not branding—use the three axes above and verify that counters, rollback, and isolation are available as operational evidence.

Figure F1 — Host ↔ SmartNIC/DPU ↔ Network: fast path, control path, telemetry path
SmartNIC / DPU data-plane offload: paths and evidence points Host CPU cores Apps / VNFs / containers Kernel / vSwitch Virtio / SR-IOV / DPDK Host evidence CPU% / softirq / context switches p99 latency / drop counters Driver version / config snapshot SmartNIC / DPU Fast path (data plane) Parser Match/ Action Queues Crypto TLS / IPsec / MACsec Compression LZ4 / DEFLATE Control + Telemetry Driver/API · counters · PMBus · event logs Versioning · rollout · rollback evidence Network Ports 25/50/100G Link evidence FEC counters BER / flap Eye margin Fast path (data plane) Control/telemetry path Evidence points: counters · versions · logs
The core engineering boundary is observable: a programmable fast path on the card plus a control/telemetry path that supports versioning, rollout, rollback, and measurable evidence (counters/logs).

H2-2 · Offload Decision Map: What to Move onto SmartNIC/DPU (and What Not To)

Offload is valuable only when it produces measurable gains: lower host CPU usage, higher effective Mpps, tighter tail latency, and stronger multi-tenant isolation. The decision should be driven by traffic shape (64B-heavy vs large packets), change rate (how often rules and policies change), and operational evidence (counters, logs, rollback).

Offload buckets by where the benefit comes from

CPU cycles · TLS/IPsec/MACsec · compression · regex
Latency/jitter · eSwitch/vSwitch · ACL/flow steering · fewer host hops
I/O efficiency · RDMA/RoCE · zero-copy · queue steering
Isolation · per-tenant queues/rules/keys/counters

Selection checklist (one-line, engineering-grade)

  • Primary target: optimize Mpps (64B packets) or Gbps (throughput)? Offload choice changes completely.
  • Latency goal: average latency or p99/p999 tail? Queueing and backpressure dominate tails.
  • Rule complexity: how deep is match/action? Table width and action fan-out often hit memory bandwidth first.
  • Change rate: are policies updated daily? High churn requires versioning + safe rollout + rollback evidence.
  • Host transparency: can this stay kernel/SR-IOV/virtio, or does it require DPDK/SDK integration?
  • Observability: are per-stage counters, queue depths, drop reasons, and temperature/rail telemetry accessible?
  • Correctness constraints: does offload need strict replay/ordering semantics (crypto) or content boundaries (compression)?

When offload should be avoided (red lines)

Do not offload when the operational cost outweighs the gain:
  • Unstable requirements: policies change faster than the pipeline/SDK can be safely validated and rolled back.
  • Insufficient evidence: counters/logs are missing, making incidents non-reproducible and blame ambiguous.
  • Tail-latency sensitivity: queueing/backpressure cannot be bounded, creating p999 spikes after offload.
  • High app refactor cost: the required user-space stack changes exceed the CPU% savings.
  • Security lifecycle gaps: firmware update/rollback and measured boot are not auditable under multi-tenant operation.

Validation loop (prove the offload is real and repeatable)

  1. Baseline: measure host-only CPU%, Mpps/Gbps, and p99/p999 under the same traffic mix.
  2. Enable one offload at a time: avoid “everything on” changes that hide root causes.
  3. Collect evidence from three planes: datapath counters (hits/drops/queues), host counters (softirq/context switches), and link/SerDes evidence (FEC/BER/flap) if relevant.
  4. Rollback test: roll back rule/firmware/config versions and confirm the delta is reproducible.
Figure F2 — Offload decision map: benefit source → constraints → evidence
Offload decision map (what to move, what to measure) SmartNIC / DPU Pipeline Queues Counters · logs · telemetry CPU cycles TLS / IPsec / MACsec Compression / regex Latency / jitter eSwitch / vSwitch ACL / flow steering I/O efficiency RDMA / RoCE Zero-copy / DMA steering Isolation Per-tenant queues Rules / keys / counters Evidence Mpps / Gbps / p99 Drops / queues / hits Versions / rollback Offload is justified only when evidence is measurable and rollback is safe.
Offload decisions should map to a measurable benefit source (CPU, jitter, I/O, isolation) and must include evidence (counters, versions, rollback) to avoid irreproducible field incidents.

H2-3 · Programmable Data-Plane Pipeline: Parser → Match/Action → Scheduler

Key idea: SmartNIC/DPU throughput does not fail “randomly” when rules get complex. It typically collapses at one of three places: (1) miss paths (default/slow path or recirculation), (2) action inflation (too many stages or memory touches per packet), or (3) queue backpressure (microbursts filling buffers and pushing latency tails).

Pipeline view (three stages, three budgets)

A programmable data plane can be modeled as a fixed budget per packet. Each packet consumes a portion of the budget in: parsing (extracting fields and metadata), lookup + actions (tables and modifications), and scheduling (queues, shaping, and congestion decisions). When rule complexity increases, the per-packet cost rises until either pipeline depth or memory bandwidth becomes the limiting factor.

Budget 1 — Parse workfield extraction steps · branching · re-parse
Budget 2 — Table + action worklookups per packet · action chain length · memory touches
Budget 3 — Queue workenqueue/dequeue ops · shaping decisions · microburst absorption

Stage A: Parser (fields → metadata)

  • Role: identify headers, extract key fields, and create metadata used by later tables.
  • Engineering constraints: variable-length headers, deep header stacks, and conditional parsing increase stage pressure.
  • Failure signature: rule sets that depend on many extracted fields often trigger extra parsing steps or recirculation.
  • Evidence to collect: parser drops, re-parse/recirculation counters, and “unknown header” counts.

Stage B: Match/Action (tables → decisions)

Match/Action is where most rule complexity lives. Tables are implemented with different resources: TCAM (flexible but power/scale limited), SRAM/hash (efficient but sensitive to collisions and key design), or hybrid pipelines. The real cost is not “how many rules exist,” but how many lookups and actions per packet are executed, and how many memory touches each action requires.

  • Table scale pressure: large rule sets increase key width, hit rate variability, and update overhead.
  • Action inflation: action chains that touch counters, modify multiple headers, or do multiple checks amplify per-packet work.
  • Miss-path penalty: a “miss” can mean slow fallback logic, default actions, or extra recirculation—often the first cliff.
  • Evidence to collect: hit/miss counters per table, action execution counts, and recirculation counts.

Stage C: Scheduler / Queueing (priority, shaping, congestion)

Queueing behavior determines tail latency. Even when average throughput looks stable, microbursts can fill queues, causing backpressure that propagates upstream and creates p99/p999 spikes. When backpressure becomes the steady-state, throughput can appear “fine” while application QoS collapses.

  • Queue depth vs. tail latency: deep queues hide loss but inflate tails; shallow queues reduce tails but may drop earlier.
  • Backpressure propagation: egress congestion can stall ingress if buffering and scheduling are not isolated.
  • Evidence to collect: queue depth distributions, drop reasons (tail drop vs WRED/RED), and p99/p999 latency correlations.

Three common performance cliffs (symptom → mechanism → evidence → direction)

Cliff Typical symptom Mechanism Evidence to check first Engineering direction
Table miss Throughput collapses when rule count grows; drops spike on specific flows Miss triggers slow/default path, extra lookups, or recirculation Per-table miss counters, fallback path hits, recirculation count Improve key design, split tables, reduce miss penalty, add staging with early classification
Action inflation Line rate is reached for simple rules but falls sharply with “richer” policies Too many action stages or memory touches per packet exceed pipeline/memory budgets Action execution counts, per-stage utilization, memory pressure indicators Shorten action chains, precompute metadata, move expensive checks to earlier filtering
Queue backpressure Average latency OK but p99/p999 spikes; intermittent drops during bursts Microbursts fill buffers; backpressure propagates and stalls ingress Queue depth distribution, drop reason, tail latency correlation Bound queues, shape burstiness, isolate queues per tenant/priority, tune scheduling policy
Figure F3 — Data-plane cross-section: pipeline blocks and where performance cliffs form
Programmable data plane: stages, evidence points, and performance cliffs Ingress Rx Parser Fields → meta Match / Action TCAM SRAM Hash Crypto Comp Queues Scheduler Egress ! Parse pressure ! Miss / action cost ! Backpressure Evidence points (keep investigations reproducible) Table evidence hit/miss per table fallback path hits recirculation count Action evidence action execution counts per-stage utilization memory pressure signals Queue evidence queue depth distribution drop reason (tail/WRED) p99/p999 correlation
Rule complexity increases per-packet work (lookups, actions, memory touches) until a cliff is hit. Keeping table/action/queue evidence visible avoids “throughput is fine” false confidence while tails and drops grow.

H2-4 · Host Interface & Virtualization: PCIe, DMA, SR-IOV, virtio, IOMMU

Key idea: Many “SmartNIC is powerful but performance is weird” deployments are limited by the host↔card interface rather than datapath logic: PCIe transaction overhead, DMA mapping behavior, IOMMU translation costs, queue/interrupt models, and NUMA affinity.

Host↔card critical chain (what sets the ceiling and the tail)

The interface should be treated as three coupled paths: data path (per packet / per queue), control path (driver, policy, firmware lifecycle), and isolation path (multi-tenant VF boundaries and IOMMU rules). If any path lacks measurable evidence or safe rollback, field incidents become non-reproducible.

PCIebandwidth is necessary, but packet-rate overhead often dominates
DMAdescriptor frequency · doorbells · batching controls Mpps and CPU%
IOMMUtranslation cost and TLB behavior influence tail latency and jitter
Queue modelring sizes · MSI-X vectors · interrupt vs polling shapes tails
NUMA affinityCPU/memory placement mismatches create jitter and inconsistent throughput

SR-IOV vs virtio vs user-space bypass (choose by operational cost)

  • SR-IOV: best raw performance and isolation for VFs, but can fragment observability (per-VF counters, drop reasons, and policy attribution).
  • virtio: simpler lifecycle and portability, but the path can be longer; packet-rate bottlenecks may appear earlier.
  • User-space bypass (e.g., DPDK class): pushes Mpps higher but increases lifecycle risk (version matrices, hugepages, strict affinity rules).

Symptoms → root-cause direction (start with evidence, not guesses)

Symptom Most common cause clusters Evidence to check first First engineering direction
Gbps looks fine, but Mpps (64B) is low Per-packet overhead (doorbells/descriptors), interrupt/polling mismatch, suboptimal batching CPU% + softirq/context switches, queue utilization, ring full/empty counters Increase batching/aggregation, adjust interrupt moderation or polling, verify affinity and ring sizing
Tail latency (p99/p999) spikes Queue backpressure, IOMMU translation jitter, control-plane changes interfering with datapath Queue depth distribution, drop reason, latency correlation with updates/telemetry Bound queues, isolate noisy tenants, schedule safe rollout windows, keep versioned config snapshots
VF jitter/drops under SR-IOV Resource imbalance (queues/vectors), unclear isolation boundary, driver/firmware mismatch Per-VF drops/queues, reset counters, driver+firmware versions, VF allocation map Define per-VF quotas, consolidate evidence collection, validate version matrix and rollback plan
Behavior changes after driver upgrade Default queue/interrupt settings changed, mapping policy changed, firmware mismatch Before/after config snapshot, counters delta, rollback reproducibility Lock baseline settings, roll out with staged canaries, require rollback verification

The fastest path to stability is to treat host↔card as an evidence-driven system: capture versions, counters, and configuration snapshots so performance deltas can be reproduced and safely rolled back.

Figure F4 — Host↔card interface: data/control/isolation paths and jitter sources
Host ↔ SmartNIC/DPU interface: where Mpps and tail latency are decided Host CPU / cores interrupt or polling Memory descriptors · buffers IOMMU translation Driver policy · versions NUMA affinity: CPU ↔ memory ↔ PCIe placement PCIe TLP DMA SmartNIC / DPU Queue rings descriptors · doorbells DMA engine batching Telemetry counters · logs Isolation path SR-IOV VFs · per-tenant queues/counters ! IOMMU jitter ! IRQ / polling ! NUMA mismatch Optimize Mpps and tails by controlling per-packet overhead, translation jitter, queue models, and affinity—then lock versions for reproducibility.
Interface performance is determined by per-packet overhead (descriptors/doorbells), translation behavior (IOMMU), queue/interrupt models, and NUMA placement. Stable operation requires versioned configuration snapshots and rollback verification.

H2-5 · Queues, Buffers & Memory System: DDR/HBM, Descriptor Rings, Congestion & Microbursts

Featured answer: Microbursts “break” SmartNIC/DPU deployments because instantaneous arrival rate exceeds the drain rate of queues + memory paths. The first visible failure is often tail latency, followed by drops or throughput cliffs. The true limit is frequently memory bandwidth/latency + backpressure, not compute.

Three memory-access classes (treat them as separate bottlenecks)

A data plane consumes memory resources in three different ways. Diagnosing the wrong domain leads to “tuning forever” without progress. The fastest approach is to isolate which domain saturates first under the workload shape (small packets, rule complexity, or burstiness).

1) Packet buffers (payload) store frames during bursts · queue depth drives tail latency and drops
2) Descriptor rings (metadata) per-packet descriptors · doorbells · DMA reads/writes shape Mpps and jitter
3) Flow tables (lookups) TCAM/SRAM/DDR/HBM lookups · misses/recirculation create performance cliffs

Why microbursts create backpressure (and why averages mislead)

  • Burst growth: a short burst can fill queues faster than they drain, even when average throughput looks safe.
  • Backpressure propagation: once egress queues are persistently non-empty, upstream stages stall and tail latency grows rapidly.
  • Hidden mode: throughput might still look “OK” while p99/p999 and drop reasons quietly worsen.

Practical sizing method (steps, not formulas)

Use this as a repeatable estimation workflow. It keeps the “buffer vs metadata vs table” decision evidence-based.

  • Step A — Define the traffic shape: target packet rate (Mpps) and typical packet sizes (e.g., 64B vs mixed).
  • Step B — Count per-packet memory work: descriptor reads/writes per packet and how many table lookups/actions are executed.
  • Step C — Define the burst window: how long peak bursts last (microseconds to milliseconds) and which queues see them.
  • Step D — Map to resources: buffer depth must absorb the burst window; DDR/HBM bandwidth must cover per-packet metadata + lookup work at target Mpps.

Symptoms → likely domain → first evidence

Symptom Most likely saturated domain First evidence to check First engineering direction
Average latency OK, but p99/p999 spikes Packet buffers / queueing Queue depth distribution, drop reason, tail correlation with bursts Bound queues, shape bursts, isolate priority/tenant queues, verify backpressure boundaries
Gbps fine, but small-packet Mpps is low Descriptor rings / metadata path Ring full/empty counters, doorbell rate, DMA completion jitter Increase batching, tune ring sizes, reduce per-packet metadata touches, verify affinity
Rules get complex and throughput drops sharply Flow tables / lookup bandwidth Table hit/miss, lookup counts, recirculation, update pressure Split tables, reduce lookup depth, improve key design, reduce miss penalty
Some queues starve or oscillate under load Queueing + shared memory arbitration Per-queue occupancy, scheduling counters, fairness/priority indicators Rebalance queue weights, isolate tenants, cap burst per class, validate scheduler policy
Figure F5 — Microburst path: queues, three memory domains, and the first evidence points
Microburst resilience depends on queues + memory paths (payload, metadata, lookups) Ingress Burst arrival Queues / Buffer Depth → tail latency Scheduler Priority / shaping Egress Drain rate ! Microburst growth Three memory domains (separate them during diagnosis) Packet buffer DDR/HBM payload Evidence: queue depth Descriptor rings metadata + DMA Evidence: ring counters Flow tables lookups + updates Evidence: hit/miss payload metadata lookups
Microbursts first inflate queue depth (tail latency), then propagate backpressure and trigger drops or cliffs. Separate payload buffering, metadata rings, and lookup tables to find the true limiting domain.

H2-6 · SerDes/PHY/Retimer: PAM4 Link Budget, Training, FEC & Eye Margin

Featured answer: PAM4 links can “come up” yet still fail in the field because the remaining margin is small and highly sensitive to channel loss, temperature, and equalization settings. The fastest bring-up sequence is physical margin (BER/eye)training/EQ stateFEC countersperformance validation.

Keep scope tight: external ports and internal PCIe/SerDes

This section focuses on SerDes links that a SmartNIC/DPU directly owns: the external network port SerDes and the internal high-speed SerDes (e.g., PCIe). Many intermittent flaps are rooted in margin erosion at the physical layer, long before datapath logic or software is the culprit.

Channel budget items (what quietly eats margin)

  • Insertion loss: PCB traces, connectors, and cables add frequency-dependent loss that reduces eye opening.
  • Reflections: impedance discontinuities (vias/pads/connectors) create echo and worsen PAM4 decision thresholds.
  • Crosstalk + noise coupling: adjacent lanes and power noise reduce effective SNR and increase BER sensitivity.
  • Environment drift: temperature/voltage changes can turn “barely passing” into “intermittent.”

Retimer vs redriver (practical boundary)

A redriver boosts and equalizes the signal but does not fully re-time it; jitter accumulation remains a limiting factor. A retimer includes clock/data recovery (CDR) and can restore eye quality more aggressively, but it introduces training/compatibility dependencies and makes visibility into training/FEC counters essential.

Training, equalization, and FEC (roles and what to measure)

Training statenegotiation and convergence to stable Tx/Rx settings
Tx EQ / Rx EQFIR/CTLE/DFE choices that recover eye margin
FEC counterscorrected vs uncorrected errors; rising counters indicate margin debt
BER / Eye marginphysical health; correlate with temperature and rate changes

Bring-up checklist (recommended order)

  • 1) Physical layer first: verify BER/eye margin and lane-to-lane consistency; check temperature sensitivity.
  • 2) Training/EQ next: confirm training converges reliably; validate Tx EQ and Rx EQ are not at extreme settings.
  • 3) FEC evidence: monitor corrected/uncorrected error counters; spikes often precede flaps and rate downshifts.
  • 4) Performance last: validate negotiated rate, stability under load, and error-free sustain at target throughput.
Figure F6 — PAM4 SerDes channel cross-section: retimer placement and four measurable points
PAM4 SerDes bring-up: measure margin, training, EQ, and FEC before blaming software MAC / SerDes Tx EQ point Retimer CDR · Rx EQ Connector / Cable Loss · reflections Peer Rx 1 Tx EQ 2 Rx EQ 3 BER / Eye 4 FEC cnt Bring-up evidence panels Training state converged? stable? rate downshift? EQ settings Tx FIR / taps Rx CTLE / DFE Margin + errors BER / eye margin FEC corrected/uncorr
When a PAM4 link is “up but unstable,” prioritize physical evidence (BER/eye) and training/EQ convergence, then use FEC counters as early warning before validating throughput.

H2-7 · Security & Isolation: Root-of-Trust, Firmware Chain, Tenant Key Domains & Attestation

Featured answer: Moving the data plane onto a SmartNIC/DPU extends the trust boundary onto the card. A robust boundary is built from hardware Root-of-Trust + secure/measured boot + controlled firmware updates with rollback. Multi-tenant safety depends on per-tenant/VF domains (keys, policies, telemetry) and on producing auditable evidence (attestation, logs, counters, and configuration snapshots).

Trust chain on the card (what must be anchored)

A SmartNIC/DPU should boot into an identity that can be verified, not just an image that happens to start. Two complementary mechanisms define the boot boundary:

  • Secure boot: prevents unsigned or unauthorized images from running on the card.
  • Measured boot: records what actually started (measurements) so the running state can be proven later.

Firmware lifecycle: update and rollback that can be audited

Firmware changes on a programmable dataplane must be treated as operational change control. The minimum safe posture is a rollback-capable update path and evidence that ties each running state to a version, a signature status, and a policy snapshot.

Rollback-ready updatesA/B slots (or equivalent) to recover from failures fast
Version pinningprevent accidental drift across hosts / racks / tenants
Canary rolloutsmall blast radius for new dataplane firmware
Evidence loggingversion, build ID, signature status, config snapshot, timestamp

Attestation: proving what the card is running

Attestation provides a verifiable “evidence bundle” that ties the running firmware and policy set to measurable identities. It is the operational answer to “which firmware is actually running right now,” especially when multiple tenants share the same physical accelerator.

Domain isolation for multi-tenant / multi-VF

Isolation is most reliable when it is expressed as separate domains with explicit boundaries and evidence for each domain. The practical model is three independent domains per tenant or per VF:

Key domain

Per-tenant keys & lifecycle

Keys should not be shared across VFs/tenants. Rotation, revocation, and usage should be audit-visible.

Policy domain

Per-tenant rules & limits

ACLs/flows/limits must be isolated so one tenant cannot modify or infer another tenant’s policy set.

Telemetry domain

Per-tenant counters & traces

Counters and error visibility must remain tenant-scoped to prevent blame ambiguity and cross-tenant leakage.

Operational evidence checklist (what must exist in logs/counters)

  • Identity: firmware version, build ID, signature verification status, and policy set version.
  • Boot evidence: secure/measured boot results and any boot-policy violations.
  • Change control: update/rollback events with timestamps and outcome codes.
  • Attestation: success/failure counters and last-known evidence hash.
  • Tenant separation: per-tenant/VF configuration snapshot hashes and policy change records.
Figure F7 — Trust chain, attestation evidence, and per-tenant isolation domains
Card security = Root-of-Trust + boot chain + attestation + tenant domains + audit evidence Root-of-Trust Keys · fuses Bootloader Verify · measure Firmware image Signed · versioned Datapath program Policy · actions Attestation evidence: hash + signature + policy ver Tenant/VF isolation domains + audit evidence Key domain per-tenant keys Evidence: key logs rotation / revoke Policy domain ACL · rules · limits Evidence: snapshots diff / rollback Telemetry domain counters · traces Evidence: counters tenant-scoped
Secure/measured boot anchors the card identity; attestation exports proof; tenant/VF domains isolate keys, policies, and telemetry with logs, counters, and snapshots as operational evidence.

H2-8 · Power, Thermal & Monitoring: Rails, PMBus Telemetry, Power Capping & Observable Derating

Featured answer: Sustained-load slowdowns that recover only after a reboot are frequently caused by thermal hotspots, rail droop / VRM current limiting, or power-cap/derating states that persist. The fastest path to stability is making the derating loop observable: temperatures, rail minima, current limits, and throttle counters aligned in time.

Partition the card into power domains (map symptom → domain → evidence)

SmartNIC/DPU cards behave like small systems with multiple rails and thermal zones. The most useful layout is domain-based: each domain has its own failure modes, and each mode has a first evidence point that should be logged.

Core domaincompute hotspots · power cap sensitivity
SerDes domainmargin erosion · link errors with heat/noise
DDR domainbandwidth-driven power · memory thermal rise
PCIe domainlink stability · downshift/replay under stress
Accelerator domainpeak current · localized throttling

Telemetry chain (PMBus / board controller → host tools)

A stable system needs a complete telemetry chain: sensors and VRMs expose measurements (temperature, voltage minima, current limits), a board controller collects them over PMBus (or equivalent), and the host can correlate them with throttle events. The objective is correlation, not just visibility.

Thermal

Hotspot temps

Core, VRM, memory, SerDes area. Track max + time-to-throttle.

Power rails

Rail droop minima

Voltage min events and persistent sag indicate margin debt.

VRM protection

Current limit / OCP

Limit events often appear as “mysterious” slowdowns.

Derating

Throttle counters

Reason codes + duration. Align with temps and rails.

Slowdown symptoms → evidence alignment → likely direction

Symptom Most common derating trigger First evidence to align in time First engineering direction
Throughput drops stepwise after minutes Thermal throttling Hotspot temperature vs throttle counter increments Increase thermal margin, reduce peak power, ensure airflow/heat sink contact quality
Temperature looks “fine” but performance still collapses VRM current limiting / rail droop Rail minima + VRM limit counters vs performance change Re-balance power domains, reduce transient load, validate rail headroom and decoupling
Intermittent link flaps / rate downshift under load SerDes margin erosion (heat/noise) Link error counters/FEC vs temperature and rail noise indicators Improve margin (EQ, channel loss, power integrity), verify temperature coupling to SerDes
Only reboot restores full speed Persistent power-cap/derating state Throttle reason codes + “cap active” duration since boot Make derating state observable and resettable; enforce policy-based caps with logging

Stability “triple”: thermal margin, power budget, observable derating

  • Thermal margin: hotspots stay below thresholds with enough headroom at worst-case ambient.
  • Power budget: rail headroom covers sustained and transient load without entering protection regions.
  • Observable derating: throttle reason codes and counters exist and correlate with sensors in time.
Figure F8 — Card power domains, PMBus telemetry chain, and throttle/derating observability loop
Sustained performance needs observable thermal + rails + throttle reasons (not guesswork) SmartNIC / DPU card Core domain hotspot risk SerDes domain margin erosion DDR domain thermal rise PCIe domain downshift risk Accelerator domain peak current · local throttle ! Hotspot ! Rail droop ! Power clamp Telemetry chain Sensors: temp / Vmin / I PMBus → board controller Host correlation Throttle counters reason codes Time-aligned KPIs temp ↔ rails ↔ perf PMBus events
Partition rails by domain, collect sensor/VRM telemetry over PMBus, and correlate throttle reason codes with hotspot temperatures and rail minima to stabilize sustained performance.

H2-9 · Performance Engineering: Gbps vs Mpps, Tail Latency, Zero-Copy & Queue Shaping

Featured answer: A “100G” label is typically a Gbps story (large packets), while 64B traffic is a Mpps story dominated by per-packet fixed costs (descriptors, DMA, queues, lookups). Real user pain is usually tail latency (p99/p999) driven by queueing and contention, with link-side FEC acting as an amplifier. Tuning should follow a strict priority: queues/pollingDMA/mappingpipeline/ruleslink counters.

The performance “tri-metrics” (what must be measured together)

Throughput

Gbps (large-packet capacity)

Often looks great with big frames, but can hide small-packet collapse.

Packet rate

Mpps (per-packet overhead)

Exposes descriptor/DMA/queue costs and rule processing depth.

Experience

p99 / p999 tail latency

Microbursts and contention pull the tail long before averages move.

Where tail latency comes from (bounded to SmartNIC/DPU)

Tail latency rarely comes from a single “slow component.” It is typically a chain effect: queue depth grows, memory and metadata paths contend, and per-packet work spikes when rules get deeper or misses occur.

Source Typical symptom First evidence to capture First direction
Queueing (depth & scheduling) p99 jumps during bursts; “steady Gbps” but spikes Queue occupancy histogram + time-aligned p99 Shaping, per-tenant budgets, reduce burst amplification
Memory contention (DDR/PCIe) Tail grows with concurrency; Mpps droops Bandwidth/latency indicators + ring completion jitter Reduce metadata touches; rebalance domains; avoid hot contention points
Cache/table miss (lookup depth) Rules get complex → sudden drop / recirculation Hit/miss, recirculation counters, lookup depth Simplify match-action depth; reduce miss penalty; control table growth
Descriptor path (rings/doorbells) pps collapse; “hungry” rings; intermittent stalls Ring full/empty, doorbell rate, DMA completion spread Ring sizing, batch strategy, polling/interrupt balance, mapping costs
Link-side FEC (as an amplifier) Latency tail worsens under errors; rate changes FEC corrected/uncorrected + BER counters Use counters to confirm margin issues; avoid chasing ghosts in software first

Zero-copy boundary: why it helps and when it backfires

Zero-copy can reduce CPU work and lower latency tail by removing extra copies and cache pollution, but it may introduce mapping and pinning costs. The decision boundary is whether per-packet fixed overhead is the limiting factor.

  • Helps when: small packets, high queue concurrency, high Mpps, and measurable descriptor/DMA overhead.
  • Backfires when: workloads are already large-packet dominated, or mapping/operational complexity becomes the bottleneck.

Queue shaping: reducing microburst damage (card/interface scope)

Shaping is a tail-latency tool: it converts uncontrolled bursts into controlled queueing. It is most effective when it is observable (queue depth distribution improves while p99/p999 improves).

Burst cap (limit burstiness) Priority scheduling (protect critical traffic) Per-tenant queue budget (avoid noisy neighbor) Queue depth telemetry (prove the win)

Tuning priority (engineering order that avoids blind guessing)

  1. Queues / interrupts / polling: stabilize jitter sources first.
  2. DMA / mapping: reduce descriptor-path overhead and mapping costs.
  3. Pipeline / rules: control lookup depth and miss penalties.
  4. Link counters: use FEC/BER to confirm margin issues, not to guess.
Figure F9 — Tri-metrics view and the main bottleneck chain for small packets and tail latency
Measure Gbps + Mpps + p99/p999 together (small packets expose fixed per-packet costs) Gbps large packets Mpps fixed per-packet cost p99 / p999 tail latency Datapath chain (where small packets and the tail fail) Ingress packets Queues microbursts Lookup hit / miss Descriptor/DMA rings · mapping queueing table miss descriptor path Evidence to capture (time-aligned): queue depth · ring counters · hit/miss · FEC counters · thermal/rails If p99 moves, the evidence must move with it.
Small-packet performance is constrained by per-packet overhead (queues, lookups, rings/DMA). Tail latency is diagnosed by time-aligning p99/p999 with queue depth, ring completion jitter, hit/miss behavior, and (only as confirmation) link-side FEC counters.

H2-10 · Field Debug Playbook: From Symptoms to an Evidence Chain (Drops, Reorder, Timeouts, Drift)

Playbook promise: Treat every incident as an evidence problem, not a guessing contest. For each symptom, capture hardware counters, queue depth, ring/descriptor signals, FEC/BER counters, thermal/rails/throttle, and firmware version + config snapshot with timestamps, then apply a fixed decision order to isolate the root-cause bucket.

Evidence sources (mandatory, time-aligned)

HW counters (per-port / per-queue / per-VF) Queue depth (histogram) Ring/descriptor (full/empty/doorbell) Table hit/miss (if exposed) FEC/BER counters Thermal/rails + throttle reasons Firmware & config snapshot hash

Fixed structure per incident (Symptom → Likely causes → Tests → Fix)

Each item below is intentionally short and operational. Execute tests in order, record before/after snapshots, and change one variable at a time.

1) VF intermittent reset / link flap
Symptom
VF resets or link flaps intermittently, often under sustained load or temperature rise.
Likely causes
1) SerDes margin erosion (heat/noise) 2) Persistent derating/power clamp 3) Firmware/driver mismatch after upgrade
Tests
Align time: link state changes ↔ FEC/BER counters ↔ SerDes/board temps.
Check rails: rail minima / VRM current-limit counters around the flap window.
Freeze identity: record firmware version + config snapshot hash before reproducing.
Fix
Reduce thermal stress and confirm margin; if power clamps exist, make them observable; pin/rollback firmware if the issue correlates with a change window.
2) Throughput looks stable but tail latency explodes (p99/p999)
Symptom
Gbps remains high, yet p99/p999 spikes during bursts or high concurrency.
Likely causes
1) Queue depth growth 2) Memory contention / metadata path jitter 3) Shaping absent or mis-scoped
Tests
Queue histogram: capture depth distribution aligned to p99 spikes.
Ring jitter: check DMA completion spread and ring starvation indicators.
Rule pressure: verify hit/miss or recirculation changes during the spike window.
Fix
Apply shaping/burst control, simplify contention points, and validate improvement by queue distribution + p99 together (not averages).
3) pps drop (queue congestion / descriptor starvation)
Symptom
Small-packet rate collapses; large-packet Gbps may look acceptable.
Likely causes
1) Descriptor ring hungry/full oscillation 2) DMA/mapping overhead 3) Queue scheduling overhead
Tests
Ring counters: full/empty events, doorbell behavior, batch size sensitivity.
Mapping cost: compare modes where mapping/pinning changes (observe pps and jitter).
Queue proof: confirm queue depth does not silently grow while pps falls.
Fix
Tune ring sizing and batching, stabilize polling/interrupt balance, and reduce descriptor-path work per packet.
4) Crypto offload correctness (handshake / replay anomalies)
Symptom
Connections fail intermittently, replay-like behavior appears, or correctness deviates only when offload is enabled.
Likely causes
1) Policy domain mismatch (rules vs key domain) 2) Offload boundary mismatch (state/timeout) 3) Version/config drift
Tests
A/B compare: run a controlled bypass path vs offload path and diff outcomes.
Snapshot identity: firmware version + config hash + policy set version at failure time.
Counter focus: per-tenant/VF counters for drops, rejects, and state errors (tenant-scoped).
Fix
Pin/rollback to a known-good identity, tighten policy/key domain separation, and re-validate against a bypass baseline.
5) Compression offload mismatch (boundary / block size issues)
Symptom
Output differs, size ratios look inconsistent, or errors occur only for specific payload patterns or sizes.
Likely causes
1) Block boundary assumptions differ 2) Metadata/state handling differs under burst 3) Resource contention causes timeouts
Tests
Size sweep: reproduce across block sizes and record failure thresholds.
Queue alignment: correlate errors with queue depth spikes and ring starvation.
Identity capture: firmware/config snapshot at both success and failure points.
Fix
Normalize block boundary policy, reduce contention hotspots, and enforce deterministic rollback/pinning for reproducibility.
Figure F10 — Evidence-chain decision flow: symptoms → evidence collectors → root-cause buckets → first fixes
Debug by evidence chain: capture → align in time → decide bucket → apply smallest fix Symptoms VF reset / link flap p99/p999 explodes pps drop crypto mismatch Evidence collectors HW counters Queue depth Rings / DMA FEC / BER Thermal / rails Version snapshot Time alignment same window Root-cause buckets Host interface rings / mapping Datapath pipeline lookup depth Link margin FEC / BER Power / thermal throttle Firmware / config pin / rollback First fixes queue tune · ring sizing · rule depth margin check · power cap visibility
Capture evidence first (counters, queues, rings, FEC/BER, thermal/rails, version snapshots), align in time, then map to the smallest root-cause bucket and apply the least invasive fix before larger changes.

H2-11 · BOM / IC Selection Checklist (Criteria + Part Numbers)

This section turns “SmartNIC/DPU BOM planning” into an engineering checklist: each module lists hard gates, trade-offs, required evidence, and a concrete set of orderable part numbers. The goal is to avoid “marketing-only specs” and keep debug/operability measurable.

Evidence-first counters Version/rollback control Thermal & power repeatability SerDes margin + observability DDR bandwidth realism

A) How to use this checklist (fast intake → spec lines)

  • Lock inputs first: port speed & count, target Mpps (64B-heavy or not), offload scope (crypto/comp/flow steering), power cap, operating temperature class, and required counters/telemetry for field debug.
  • Score by evidence: prefer parts that expose queue/ring/drop-reason/FEC/rail/throttle counters and support stable firmware/driver pinning + rollback.
  • Turn criteria into procurement language: each module ends with “spec-ready lines” that can be pasted into a purchase spec or supplier questionnaire.

B) Module criteria + concrete part numbers (examples)

Part numbers below are commonly used building blocks. Final selection must match lane counts, connector topology, temperature grade, availability program (some DPUs are sold via platform programs), and compliance requirements.

Module Hard gates (must-have) Evidence to require Example part numbers (物料号)
Core compute / dataplane
SoC
FPGA
DPU-class
Programmability model (pipeline/firmware), table/queue resources, host interface maturity (PCIe), stable SDK/driver lifecycle. Table hit/miss, drop reason, queue depth, ring starvation, firmware/driver version pinning + rollback proof.
NXP Layerscape: LX2160A, LS1046A
AMD/Xilinx FPGA: XCVU9P-2FLGA2104I, XCKU15P-2FFVE1517I
Intel FPGA: 10AX115N2F45I1SG (Arria 10 GX)
Platform DPUs (often via program): NVIDIA BlueField-2 / BlueField-3 (platform silicon family)
PCIe fabric (optional)
Switch
NTB
Lane/port count, error containment, hot-plug support (if needed), partitioning/virtual switch capability. AER/error counters, port isolation logs, fabric management telemetry, watchdog/reset reason.
Broadcom PCIe Gen4 switch: PEX88096
Microchip Switchtec Gen5: PM50100, PM50084, PM50068, PM50052, PM50036, PM50028
Microchip Switchtec Gen3: PFX/PFX-I family (Switchtec PFX)
PCIe signal integrity
Redriver
Retimer
Supports target PCIe generation, channel count matches lanes, link-training friendly behavior, low-added jitter. Equalization settings readback, link training stability notes, margin/eye hooks (as supported), failure reason logs.
TI PCIe redrivers: DS80PCI402 (x4), DS80PCI810 (x8)
TI PCIe Gen4 redriver: DS160PR410 (x4)
Ethernet/SerDes conditioning
NRZ
PAM4
Rate coverage (10/25/50G classes), retimer vs redriver boundary (CDR needed or not), channel loss budget fit. BER estimate, FEC corrected/uncorrected counters, training status, Rx/Tx EQ readback.
TI retimers: DS125DF410 (9.8–12.5Gbps, x4), DS280DF810 (20.2–28.4Gbps, x8)
DDR memory
Bandwidth
ECC
Bandwidth under concurrency (buffer + descriptors + tables), ECC support, temperature behavior (refresh/power). ECC error counters visibility, throttling events, memory controller perf counters (as available).
Micron DDR4 SDRAM: MT40A512M16TB-062E-R (example orderable DDR4 component)
Telemetry (PMBus/I²C)
Current
Power
Rail voltage/current/power measurement coverage, addressability, accuracy/averaging control. Per-rail min/max, alert logs, time-stamped event capture (if board controller supports it).
TI PMBus monitor: INA233 (example ordering: INA233AIDGSR)
VRM control
Multiphase
Digital
Transient response, phase scaling, control-bus integration, predictable throttle/limit behavior. OCP/OTP counts, loop status (as available), configuration readback, brownout/throttle reason.
Renesas digital multiphase PWM: ISL68137
Renesas smart power stage (example pairing): ISL99227 (often used with Renesas controllers)
Board controller
MCU
Logging
Robust boot/update, watchdog strategy, event logging, safe GPIO policy for resets/power-good fan-in. Reset reasons, rail/thermal snapshots, firmware version hash, “last known good” config record.
NXP: LPC55S69JBD100
ST: STM32G0B1KET6, STM32H743IIT6
Root-of-trust
TPM
Measured boot
Standard compliance, interface fit (SPI/I²C), lifecycle support, provisioning flow. Attestation capability (as used), firmware measurement logs, update/rollback policy proof.
Infineon OPTIGA™ TPM: SLB-9670VQ2-0 (family), example orderable code: SLB9670VQ20FW785XTMA1
Non-volatile storage
QSPI
NOR
Endurance, dual-image support, secure update/rollback compatibility. Image A/B selection evidence, rollback event log, write-protect policy.
Winbond: W25Q256JV, W25Q512JV
Macronix: MX25U25645G, MX25U51245G
Figure F11 — BOM criteria map: module → criteria → evidence
SmartNIC/DPU BOM — Criteria + Evidence (Card-Level) Core SoC / FPGA Tables / queues PCIe Fabric Switch / NTB AER / partition SerDes Retimer / EQ FEC / BER Memory DDR / ECC Bandwidth Power VRM PMBus Programmability model Counters / observability Lane/port scale Error containment Loss budget fit FEC/BER counters ECC visibility Thermal behavior Transient Throttle logs Evidence collectors (must be readable in field) Queue depth / drops Rings starvation Link FEC / BER Rails min/max/throttle FW/Driver pin + rollback Errors AER / reset Memory ECC counters Audit config snapshots Rule of thumb: if counters + rollback + thermal/power evidence are missing, field debug cost explodes.
Use this map to keep BOM decisions “evidence-driven”: each module should provide measurable counters, stable version pinning, and predictable behavior under heat/power stress—otherwise peak throughput numbers rarely survive real deployments.

C) “Marketing spec” vs “real pitfall” (questions to force clarity)

Marketing line What must be asked Evidence to require
“100G line-rate” What is 64B Mpps with real rule depth and multi-queue concurrency? Queue depth/drops by reason; pipeline hit/miss; p99/p999 latency snapshots.
“Supports PCIe Gen4” Is the link stable across temperature/board variance without manual EQ babysitting? Training stability logs, AER counters, margin hooks (as supported), redriver EQ readback.
“Secure boot” Is there a measured chain + rollback + audit trail, or only signature check? Firmware measurement records, version hash snapshot, rollback event log.
“Low power” Does power stay predictable under burst load, or does throttling silently change behavior? Rail min/max, OCP/OTP counts, throttle reason + count, thermal hotspot logging.
“High bandwidth memory” Does memory bandwidth hold under mixed traffic (buffers + descriptors + tables)? ECC/error counters + perf counters; drift across temperature and refresh behavior.

D) Spec-ready lines (copy/paste into procurement docs)

  • PCIe switch (if used): must support required lane/port count and provide AER/error counters, partition/NTB diagnostics, and watchdog/reset logs.
  • PCIe redriver/retimer: must support target PCIe generation and expose EQ configuration/readback and training stability notes; no “black-box only” parts.
  • SerDes retimer: must provide BER estimate and FEC corrected/uncorrected counters; training status must be readable.
  • DDR: ECC required; ECC counters must be readable; behavior under high temperature and refresh changes must be documented.
  • VRM + telemetry: rail min/max, OCP/OTP counts, throttle reasons and counts must be logged and readable via PMBus/I²C + board controller.
  • Root-of-trust: measured boot and attestation-capable RoT/TPM preferred; firmware A/B rollback and audit snapshots must be supported.

Request a Quote

Accepted Formats

pdf, csv, xls, xlsx, zip

Attachment

Drag & drop files here or use the button below.

H2-12 · FAQs (SmartNIC / DPU) — Answers + Evidence-First Checks

These FAQs target engineering “why” questions while staying inside this page’s scope: card-level dataplane offload, PCIe/virtualization integration, queues/memory behavior, PAM4/retimer bring-up, security/firmware chain, power/thermal telemetry, and performance (Gbps vs Mpps vs tail latency).

BoundaryOffload trade-offsP4 pipeline limits PCIe & SR-IOVQueues & microburstPAM4 & retimers RoT & rollbackThermal throttlingMpps & tail latency
1What is the practical engineering boundary between a SmartNIC and a DPU?

The boundary is not the form factor but the degree of dataplane independence. A SmartNIC typically accelerates specific functions (steering, vSwitch blocks, crypto/compression) while still depending on host resources for control and state. A DPU behaves like an independent subsystem on the card: its own cores, memory domains, and a broader programmable dataplane/control plane that can offload host networking and security primitives.

  • Use-case test: can the card enforce dataplane policy and collect counters even when host CPU is saturated?
  • Scope note: this compares card-level roles only (no UPF/MEC application placement).
ProgrammabilityIndependent resourcesHost bypass depth
2When can crypto offload make latency worse instead of better?

Crypto offload can worsen latency when it introduces extra queueing and batching on the card, or when the data path becomes DMA-heavy with small records. Typical regressions appear with small packets, short TLS records, frequent key context switches, or when crypto shares memory bandwidth with flow tables and queue managers. The result is improved throughput but worse tail latency (p99/p999).

  • Measure first: queue depth, crypto engine busy time, DMA/ring starvation counters.
  • Fix direction: reduce batch size, separate queues, and avoid unnecessary host↔card copies.
QueueingDMA overheadTail latency
3Why does throughput collapse when match/action rules get more complex?

A programmable dataplane is a bounded pipeline. As rule complexity grows, the design hits limits in table depth, table miss penalties, action expansion (more memory reads/writes), and shared bandwidth in the packet/metadata paths. Even if the line rate is unchanged, the pipeline may require recirculation, additional stages, or slower memory accesses—causing a cliff in Mpps.

  • Evidence: hit/miss counters, stage utilization, memory bandwidth/latency counters (where available).
  • Design lever: simplify actions, reduce table lookups per packet, and keep “fast path” rules shallow.
Pipeline depthMiss penaltyMemory bandwidth
4SR-IOV is fast—why does observability often get worse?

SR-IOV improves performance by giving VFs direct-ish access to queues, but it often splits the evidence chain. Counters become fragmented across PF/VF, host stack visibility is reduced, and some drops/rewrites happen in paths that traditional host tools cannot see. If firmware, PF driver, and VF driver versions drift, the same symptoms may map to different counter meanings—making field debug slower.

  • Evidence: PF/VF queue depth, drop-reason counters, reset reasons, version snapshots.
  • Mitigation: define “golden counters” that must remain readable at PF level.
PF/VF counter splitTooling blind spotsVersion drift
5Is “more VFs” always better? Where do queues/descriptors bottleneck first?

More VFs increases isolation and scheduling flexibility, but bottlenecks usually appear first in metadata plumbing: descriptor rings, doorbells, interrupt moderation/polling loops, and memory bandwidth for queue state. Once many VFs compete, ring starvation and cache/memory contention raise tail latency and reduce Mpps even when aggregate Gbps looks fine.

  • Evidence: ring starvation, doorbell rate, queue occupancy distribution across VFs.
  • Rule of thumb: scale VF count only if per-VF counters remain measurable and stable under load.
Descriptor ringsDoorbellsMemory contention
6How do microbursts “punch through” SmartNIC buffers?

Microbursts overwhelm buffers because arrival rate briefly exceeds service rate, and the card’s service rate can degrade under pressure. As queues grow, descriptor processing, memory arbitration, and scheduler work increase, which can reduce effective dequeue speed—creating a feedback loop. The visible symptoms are short spikes of drops and a sharp rise in p99/p999 latency even when average utilization seems acceptable.

  • Evidence: queue depth time series, drop-by-reason, ring starvation during burst windows.
  • Mitigation direction: isolate bursty flows, tune shaping/priority, and avoid shared hot queues.
Queue feedbackDescriptor pressureTail latency
7PAM4 links come up but flap intermittently—what are the three most common root causes?

Three common causes are: (1) margin is barely positive (insertion loss/connectors/board variation), (2) training/EQ sensitivity (settings that “work once” but drift), and (3) environmental drift from temperature or power noise that shifts the eye and pushes FEC beyond its comfort zone. Intermittent flaps often correlate with warm-up, vibration, or specific cable/port combinations.

  • Evidence: FEC corrected/uncorrected counters, retraining counts, BER/eye margin indicators (as supported).
  • Debug order: PHY margin → training stability → protocol/perf.
MarginTraining driftThermal/power noise
8How to choose Retimer vs Redriver—and why does a retimer often make debug harder?

A redriver boosts amplitude/EQ but keeps the same clocking domain; a retimer adds CDR and can recover a cleaner clock when channel loss is high. Debug becomes harder because a retimer introduces additional state machines (training, adaptation) and a new counter/visibility boundary. If the retimer does not expose training results and error reasons, failures look “random” across ports and temperature.

  • Choose retimer when: loss budget forces CDR/equalization beyond redriver capability.
  • Require: training status readout, error codes, and margin-related counters.
CDRState machinesCounter visibility
9How can the firmware on the card be proven trustworthy, rollbackable, and auditable?

Proof requires three artifacts: (1) a secure/measured boot chain that records what firmware actually ran, (2) an A/B image update path with deterministic rollback, and (3) audit snapshots that bind firmware version, configuration, and key policy to a verifiable identifier. Without these, “secure boot” may only mean signature checking, not field-grade recoverability and accountability.

  • Evidence: measured-boot logs, rollback events, version+config hash snapshots.
  • Operational rule: upgrades must be pin-able and reproducible across fleets.
Measured bootA/B rollbackAudit snapshot
10After running hot for a while, performance drifts—how to separate thermal throttling from queue congestion?

Distinguish by aligning two evidence streams. Thermal throttling shows throttle counters, rising hotspot temperature, and rail droop/VRM current-limit events that correlate with the throughput drop. Queue congestion shows sustained queue depth growth, ring starvation, and drop-by-reason increases without matching thermal/throttle triggers. The key is time-correlation: “perf drop timestamp” must match one stream strongly.

  • Evidence: temperature/rail min-max + throttle reason vs queue depth + ring starvation.
  • First action: capture a synchronized snapshot (telemetry + queue counters) during the drift window.
Time correlationThrottle countersQueue evidence
11Why can a “100G” card show very low Mpps with 64-byte packets—and what three checks come first?

100G throughput does not guarantee small-packet rate because fixed per-packet costs dominate. The first three checks are: (1) queue/interrupt/polling policy (interrupt storms or poor moderation), (2) descriptor/DMA efficiency (ring starvation, mapping overhead, extra copies), and (3) pipeline rule depth (table lookups/action expansion per packet). Link/FEC issues usually show distinct error counters.

  • Evidence: interrupt/poll counters, ring starvation, pipeline hit/miss, drop reasons.
  • Fix direction: reduce per-packet overhead before chasing raw bandwidth.
Interrupt vs pollingDMA/ringsRule depth
12Which selection metrics are most often misleading in marketing—and what must be demanded instead?

Common traps include “line-rate,” “PCIe Gen5,” “HBM bandwidth,” “secure boot,” and “low power.” Each needs a corresponding field-grade evidence requirement: Mpps and tail latency under real rule depth, AER/training stability across temperature, ECC/error visibility, measured boot + rollback logs, and throttle/rail telemetry with time-stamped events. If counters and rollback are missing, fleet operability cost often exceeds performance gains.

  • Demand: counters, failure-reason codes, and version+config snapshots.
  • Reject: “black-box performance” without diagnostics.
Evidence-firstRollback controlTelemetry
Figure F12 — Fast triage map: symptom → evidence stream → likely bucket
SmartNIC/DPU FAQ — Evidence-First Triage (Card-Level) Symptom Evidence stream Likely bucket Low 64B Mpps pps collapses, tail rises Interrupt/poll counters Ring starvation + DMA stats Per-packet overhead Queues / rings / rules Intermittent link flap comes up, then drops FEC corr/uncorr Retrain count + BER est Margin / training Thermal/power drift Perf drift over time reboot “fixes” it Throttle reason + count Queue depth + drops Thermal vs congestion Time correlation Minimum “proof pack” to capture during incidents FW/driver versions + config snapshot · queue/ring counters · link FEC/BER · rails/temps + throttle logs · reset reasons
This map keeps troubleshooting inside the card-level scope: diagnose by matching the symptom timestamp to the strongest evidence stream (queues/rings, link counters, or thermal/rail telemetry) before changing pipeline rules or blaming the network.