SmartNIC / DPU: Programmable Data Plane & Offload Design

Q: 1) What is the practical engineering boundary between a SmartNIC and a DPU?

The boundary is the degree of dataplane independence. A SmartNIC typically accelerates specific functions (steering, vSwitch blocks, crypto/compression) but still relies on host resources for control and state. A DPU behaves like an independent subsystem on the card with its own cores and memory domains and a broader programmable dataplane/control plane that can offload host networking primitives.

Q: 2) When can crypto offload make latency worse instead of better?

Crypto offload can worsen latency when it introduces extra queueing/batching on the card or becomes DMA-heavy for small records. Regressions commonly appear with small packets, short TLS records, frequent key context switches, or when crypto competes with flow tables and queue managers for memory bandwidth. The result is improved throughput but worse tail latency (p99/p999).

Q: 3) Why does throughput collapse when match/action rules get more complex?

A programmable dataplane is a bounded pipeline. As rule complexity grows, limits show up in table depth, miss penalties, action expansion (more memory reads/writes), and shared bandwidth in packet/metadata paths. The pipeline may require recirculation or additional stages, and slower memory accesses reduce effective Mpps, producing a cliff even if line rate is unchanged.

Q: 4) SR-IOV is fast—why does observability often get worse?

SR-IOV boosts performance by giving VFs direct-ish access to queues, but it splits the evidence chain. Counters fragment across PF/VF, host stack visibility is reduced, and some drops happen in paths host tools cannot see. If PF/VF drivers and firmware versions drift, the same symptom can map to different counter meanings, making field debug slower.

Q: 5) Is “more VFs” always better? Where do queues/descriptors bottleneck first?

More VFs improves isolation, but bottlenecks usually appear first in metadata plumbing: descriptor rings, doorbells, interrupt moderation/polling loops, and memory bandwidth for queue state. As VFs compete, ring starvation and memory contention increase tail latency and reduce Mpps even when aggregate Gbps remains high.

Q: 6) How do microbursts “punch through” SmartNIC buffers?

Microbursts exceed service rate briefly, and service rate can degrade under pressure. As queues grow, descriptor processing, memory arbitration, and scheduler work increase, reducing effective dequeue speed and creating a feedback loop. The symptoms are short spikes of drops and sharp increases in p99/p999 latency even if average utilization looks fine.

Q: 7) PAM4 links come up but flap intermittently—what are the three most common root causes?

Three common causes are: (1) margin is barely positive (insertion loss/connectors/board variance), (2) training/EQ sensitivity that drifts over time, and (3) environmental drift from temperature or power noise that shifts the eye and pushes FEC beyond its comfort zone. Intermittent flaps often correlate with warm-up or specific cable/port combinations.

Q: 8) How to choose Retimer vs Redriver—and why does a retimer often make debug harder?

A redriver boosts signal amplitude/EQ but keeps the same clocking domain; a retimer adds CDR and can recover a cleaner clock when channel loss is high. Debug becomes harder because a retimer introduces additional state machines and a new counter/visibility boundary. If training results and failure reasons are not readable, issues appear random across ports and temperature.

Q: 9) How can the firmware on the card be proven trustworthy, rollbackable, and auditable?

Proof requires three artifacts: a secure/measured boot chain that records what firmware ran, an A/B image update path with deterministic rollback, and audit snapshots that bind firmware version and configuration to a verifiable identifier. Without these, “secure boot” may only mean signature checking, not field-grade recoverability and accountability.

Q: 10) After running hot for a while, performance drifts—how to separate thermal throttling from queue congestion?

Separate by aligning evidence streams. Thermal throttling shows throttle counters, rising hotspot temperature, and rail droop/VRM current-limit events correlated with the drop. Queue congestion shows sustained queue depth growth, ring starvation, and drop-by-reason increases without matching thermal/throttle triggers. Time correlation between the performance drop and the evidence is the key.

← Back to: 5G Edge Telecom Infrastructure

SmartNIC/DPU moves the high-rate data plane from host CPU into a programmable card so packet processing, crypto/compression, and queue shaping can run with lower jitter and better isolation. Real outcomes depend on pipeline depth, PCIe/DMA integration, queue/DDR behavior, and measurable telemetry (counters, throttling, rollbackable firmware) rather than headline “100G” specs.

H2-1 · Definition & Boundary: SmartNIC vs Traditional NIC vs DPU

Featured definition: A SmartNIC/DPU moves the packet data plane (parsing, match/action, queueing, and selected accelerations) from the host CPU onto a card/board with its own processing and memory, reducing host overhead and jitter—at the cost of additional firmware/driver coupling and a larger debug/observability surface.

What this section enables

Classify a product using three boundaries (programmability, resource independence, host bypass).
Predict where performance and reliability limits will come from (pipeline, memory, queues, or host interface).
Decide the right integration model (kernel, SR-IOV, virtio/DPDK) without creating an un-debuggable platform.

Three boundary axes (each with testable checks)

Axis 1 — Programmability What can change without swapping hardware?

Level 0: fixed offloads checksum/TSO, basic crypto engines, limited knobs

Level 1: configurable pipeline tables/rules are loadable, action set is bounded

Level 2: programmable datapath P4/SDK/microcode defines pipeline stages/actions; versioning and rollback are possible

Field checks: Can logic change without re-image downtime? Are rule updates versioned/rollbackable? Are per-stage counters visible?

Axis 2 — Resource independence Is there a real execution + memory domain on the card?

Traditional NIC: limited embedded logic; host remains the dominant execution domain

SmartNIC: meaningful on-card resources (cores/accelerators + local memory) for datapath services

DPU: on-card compute behaves like a mini-system (multiple cores, richer memory, stronger isolation and lifecycle control)

Field checks: Does datapath stay stable under host CPU stress? Can health/telemetry/logging run on-card during host incidents?

Axis 3 — Host bypass degree How much host hop and host CPU work is removed?

Bypass CPU cycles: crypto/compression/regex/flow steering moved off host cores

Bypass host hop: vSwitch/eSwitch functions avoid kernel hot paths and context switching

Bypass trust boundary (card-side isolation): per-tenant rules/keys/counters isolated on-card; measured/secure boot supports auditable lifecycle

Field checks: Do packets traverse the host kernel in the steady state? Are per-tenant resources and keys isolated and auditable?

Quick classification rules (If/Then)

If rules change is limited to a few knobs then treat it as an enhanced NIC, not a SmartNIC platform.

If datapath logic can be updated and rolled back with per-stage counters then it qualifies as a programmable SmartNIC.

If on-card compute + memory host a lifecycle-managed control plane (updates/attestation/telemetry) then it behaves as a DPU-class device.

Comparison table (engineering decision fields)

Category	Traditional NIC	SmartNIC	DPU
Programmability model	Fixed functions + limited config	Configurable or programmable pipeline (P4/SDK/microcode)	Programmable datapath + richer on-card services
Isolation boundary	Basic queue/VLAN separation	Per VF/tenant rules + queue steering; stronger resource partitioning	Multi-tenant isolation with lifecycle controls and stronger policy/key domains
Control plane location	Host-centric	Split (host API + on-card agents/counters)	On-card control services are first-class (updates/attestation/telemetry)
Best-at metrics	Line-rate throughput for common cases	Lower host CPU%, better jitter, higher effective Mpps for selected flows	Max offload breadth + consistent tail latency under host contention (when designed well)
Typical failure mode	Driver/firmware mismatch; limited visibility	Rule/pipeline scaling hits memory/queues; observability split across domains	Lifecycle complexity: update + attestation + policy consistency becomes critical

Practical takeaway: classification is not branding—use the three axes above and verify that counters, rollback, and isolation are available as operational evidence.

Figure F1 — Host ↔ SmartNIC/DPU ↔ Network: fast path, control path, telemetry path

The core engineering boundary is observable: a programmable fast path on the card plus a control/telemetry path that supports versioning, rollout, rollback, and measurable evidence (counters/logs).

H2-2 · Offload Decision Map: What to Move onto SmartNIC/DPU (and What Not To)

Offload is valuable only when it produces measurable gains: lower host CPU usage, higher effective Mpps, tighter tail latency, and stronger multi-tenant isolation. The decision should be driven by traffic shape (64B-heavy vs large packets), change rate (how often rules and policies change), and operational evidence (counters, logs, rollback).

Offload buckets by where the benefit comes from

CPU cycles · TLS/IPsec/MACsec · compression · regex

Latency/jitter · eSwitch/vSwitch · ACL/flow steering · fewer host hops

I/O efficiency · RDMA/RoCE · zero-copy · queue steering

Isolation · per-tenant queues/rules/keys/counters

Selection checklist (one-line, engineering-grade)

Primary target: optimize Mpps (64B packets) or Gbps (throughput)? Offload choice changes completely.
Latency goal: average latency or p99/p999 tail? Queueing and backpressure dominate tails.
Rule complexity: how deep is match/action? Table width and action fan-out often hit memory bandwidth first.
Change rate: are policies updated daily? High churn requires versioning + safe rollout + rollback evidence.
Host transparency: can this stay kernel/SR-IOV/virtio, or does it require DPDK/SDK integration?
Observability: are per-stage counters, queue depths, drop reasons, and temperature/rail telemetry accessible?
Correctness constraints: does offload need strict replay/ordering semantics (crypto) or content boundaries (compression)?

When offload should be avoided (red lines)

Do not offload when the operational cost outweighs the gain:

Unstable requirements: policies change faster than the pipeline/SDK can be safely validated and rolled back.
Insufficient evidence: counters/logs are missing, making incidents non-reproducible and blame ambiguous.
Tail-latency sensitivity: queueing/backpressure cannot be bounded, creating p999 spikes after offload.
High app refactor cost: the required user-space stack changes exceed the CPU% savings.
Security lifecycle gaps: firmware update/rollback and measured boot are not auditable under multi-tenant operation.

Validation loop (prove the offload is real and repeatable)

Baseline: measure host-only CPU%, Mpps/Gbps, and p99/p999 under the same traffic mix.
Enable one offload at a time: avoid “everything on” changes that hide root causes.
Collect evidence from three planes: datapath counters (hits/drops/queues), host counters (softirq/context switches), and link/SerDes evidence (FEC/BER/flap) if relevant.
Rollback test: roll back rule/firmware/config versions and confirm the delta is reproducible.

Figure F2 — Offload decision map: benefit source → constraints → evidence

Offload decisions should map to a measurable benefit source (CPU, jitter, I/O, isolation) and must include evidence (counters, versions, rollback) to avoid irreproducible field incidents.

H2-3 · Programmable Data-Plane Pipeline: Parser → Match/Action → Scheduler

Key idea: SmartNIC/DPU throughput does not fail “randomly” when rules get complex. It typically collapses at one of three places: (1) miss paths (default/slow path or recirculation), (2) action inflation (too many stages or memory touches per packet), or (3) queue backpressure (microbursts filling buffers and pushing latency tails).

Pipeline view (three stages, three budgets)

A programmable data plane can be modeled as a fixed budget per packet. Each packet consumes a portion of the budget in: parsing (extracting fields and metadata), lookup + actions (tables and modifications), and scheduling (queues, shaping, and congestion decisions). When rule complexity increases, the per-packet cost rises until either pipeline depth or memory bandwidth becomes the limiting factor.

Budget 1 — Parse workfield extraction steps · branching · re-parse

Budget 2 — Table + action worklookups per packet · action chain length · memory touches

Budget 3 — Queue workenqueue/dequeue ops · shaping decisions · microburst absorption

Stage A: Parser (fields → metadata)

Role: identify headers, extract key fields, and create metadata used by later tables.
Engineering constraints: variable-length headers, deep header stacks, and conditional parsing increase stage pressure.
Failure signature: rule sets that depend on many extracted fields often trigger extra parsing steps or recirculation.
Evidence to collect: parser drops, re-parse/recirculation counters, and “unknown header” counts.

Stage B: Match/Action (tables → decisions)

Match/Action is where most rule complexity lives. Tables are implemented with different resources: TCAM (flexible but power/scale limited), SRAM/hash (efficient but sensitive to collisions and key design), or hybrid pipelines. The real cost is not “how many rules exist,” but how many lookups and actions per packet are executed, and how many memory touches each action requires.

Table scale pressure: large rule sets increase key width, hit rate variability, and update overhead.
Action inflation: action chains that touch counters, modify multiple headers, or do multiple checks amplify per-packet work.
Miss-path penalty: a “miss” can mean slow fallback logic, default actions, or extra recirculation—often the first cliff.
Evidence to collect: hit/miss counters per table, action execution counts, and recirculation counts.

Stage C: Scheduler / Queueing (priority, shaping, congestion)

Queueing behavior determines tail latency. Even when average throughput looks stable, microbursts can fill queues, causing backpressure that propagates upstream and creates p99/p999 spikes. When backpressure becomes the steady-state, throughput can appear “fine” while application QoS collapses.

Queue depth vs. tail latency: deep queues hide loss but inflate tails; shallow queues reduce tails but may drop earlier.
Backpressure propagation: egress congestion can stall ingress if buffering and scheduling are not isolated.
Evidence to collect: queue depth distributions, drop reasons (tail drop vs WRED/RED), and p99/p999 latency correlations.

Three common performance cliffs (symptom → mechanism → evidence → direction)

Cliff	Typical symptom	Mechanism	Evidence to check first	Engineering direction
Table miss	Throughput collapses when rule count grows; drops spike on specific flows	Miss triggers slow/default path, extra lookups, or recirculation	Per-table miss counters, fallback path hits, recirculation count	Improve key design, split tables, reduce miss penalty, add staging with early classification
Action inflation	Line rate is reached for simple rules but falls sharply with “richer” policies	Too many action stages or memory touches per packet exceed pipeline/memory budgets	Action execution counts, per-stage utilization, memory pressure indicators	Shorten action chains, precompute metadata, move expensive checks to earlier filtering
Queue backpressure	Average latency OK but p99/p999 spikes; intermittent drops during bursts	Microbursts fill buffers; backpressure propagates and stalls ingress	Queue depth distribution, drop reason, tail latency correlation	Bound queues, shape burstiness, isolate queues per tenant/priority, tune scheduling policy

Figure F3 — Data-plane cross-section: pipeline blocks and where performance cliffs form

Rule complexity increases per-packet work (lookups, actions, memory touches) until a cliff is hit. Keeping table/action/queue evidence visible avoids “throughput is fine” false confidence while tails and drops grow.

H2-4 · Host Interface & Virtualization: PCIe, DMA, SR-IOV, virtio, IOMMU

Key idea: Many “SmartNIC is powerful but performance is weird” deployments are limited by the host↔card interface rather than datapath logic: PCIe transaction overhead, DMA mapping behavior, IOMMU translation costs, queue/interrupt models, and NUMA affinity.

Host↔card critical chain (what sets the ceiling and the tail)

The interface should be treated as three coupled paths: data path (per packet / per queue), control path (driver, policy, firmware lifecycle), and isolation path (multi-tenant VF boundaries and IOMMU rules). If any path lacks measurable evidence or safe rollback, field incidents become non-reproducible.

PCIebandwidth is necessary, but packet-rate overhead often dominates

DMAdescriptor frequency · doorbells · batching controls Mpps and CPU%

IOMMUtranslation cost and TLB behavior influence tail latency and jitter

Queue modelring sizes · MSI-X vectors · interrupt vs polling shapes tails

NUMA affinityCPU/memory placement mismatches create jitter and inconsistent throughput

SR-IOV vs virtio vs user-space bypass (choose by operational cost)

SR-IOV: best raw performance and isolation for VFs, but can fragment observability (per-VF counters, drop reasons, and policy attribution).
virtio: simpler lifecycle and portability, but the path can be longer; packet-rate bottlenecks may appear earlier.
User-space bypass (e.g., DPDK class): pushes Mpps higher but increases lifecycle risk (version matrices, hugepages, strict affinity rules).

Symptoms → root-cause direction (start with evidence, not guesses)

Symptom	Most common cause clusters	Evidence to check first	First engineering direction
Gbps looks fine, but Mpps (64B) is low	Per-packet overhead (doorbells/descriptors), interrupt/polling mismatch, suboptimal batching	CPU% + softirq/context switches, queue utilization, ring full/empty counters	Increase batching/aggregation, adjust interrupt moderation or polling, verify affinity and ring sizing
Tail latency (p99/p999) spikes	Queue backpressure, IOMMU translation jitter, control-plane changes interfering with datapath	Queue depth distribution, drop reason, latency correlation with updates/telemetry	Bound queues, isolate noisy tenants, schedule safe rollout windows, keep versioned config snapshots
VF jitter/drops under SR-IOV	Resource imbalance (queues/vectors), unclear isolation boundary, driver/firmware mismatch	Per-VF drops/queues, reset counters, driver+firmware versions, VF allocation map	Define per-VF quotas, consolidate evidence collection, validate version matrix and rollback plan
Behavior changes after driver upgrade	Default queue/interrupt settings changed, mapping policy changed, firmware mismatch	Before/after config snapshot, counters delta, rollback reproducibility	Lock baseline settings, roll out with staged canaries, require rollback verification

The fastest path to stability is to treat host↔card as an evidence-driven system: capture versions, counters, and configuration snapshots so performance deltas can be reproduced and safely rolled back.

Figure F4 — Host↔card interface: data/control/isolation paths and jitter sources

Interface performance is determined by per-packet overhead (descriptors/doorbells), translation behavior (IOMMU), queue/interrupt models, and NUMA placement. Stable operation requires versioned configuration snapshots and rollback verification.

H2-5 · Queues, Buffers & Memory System: DDR/HBM, Descriptor Rings, Congestion & Microbursts

Featured answer: Microbursts “break” SmartNIC/DPU deployments because instantaneous arrival rate exceeds the drain rate of queues + memory paths. The first visible failure is often tail latency, followed by drops or throughput cliffs. The true limit is frequently memory bandwidth/latency + backpressure, not compute.

Three memory-access classes (treat them as separate bottlenecks)

A data plane consumes memory resources in three different ways. Diagnosing the wrong domain leads to “tuning forever” without progress. The fastest approach is to isolate which domain saturates first under the workload shape (small packets, rule complexity, or burstiness).

1) Packet buffers (payload) store frames during bursts · queue depth drives tail latency and drops

2) Descriptor rings (metadata) per-packet descriptors · doorbells · DMA reads/writes shape Mpps and jitter

3) Flow tables (lookups) TCAM/SRAM/DDR/HBM lookups · misses/recirculation create performance cliffs

Why microbursts create backpressure (and why averages mislead)

Burst growth: a short burst can fill queues faster than they drain, even when average throughput looks safe.
Backpressure propagation: once egress queues are persistently non-empty, upstream stages stall and tail latency grows rapidly.
Hidden mode: throughput might still look “OK” while p99/p999 and drop reasons quietly worsen.

Practical sizing method (steps, not formulas)

Use this as a repeatable estimation workflow. It keeps the “buffer vs metadata vs table” decision evidence-based.

Step A — Define the traffic shape: target packet rate (Mpps) and typical packet sizes (e.g., 64B vs mixed).
Step B — Count per-packet memory work: descriptor reads/writes per packet and how many table lookups/actions are executed.
Step C — Define the burst window: how long peak bursts last (microseconds to milliseconds) and which queues see them.
Step D — Map to resources: buffer depth must absorb the burst window; DDR/HBM bandwidth must cover per-packet metadata + lookup work at target Mpps.

Symptoms → likely domain → first evidence

Symptom	Most likely saturated domain	First evidence to check	First engineering direction
Average latency OK, but p99/p999 spikes	Packet buffers / queueing	Queue depth distribution, drop reason, tail correlation with bursts	Bound queues, shape bursts, isolate priority/tenant queues, verify backpressure boundaries
Gbps fine, but small-packet Mpps is low	Descriptor rings / metadata path	Ring full/empty counters, doorbell rate, DMA completion jitter	Increase batching, tune ring sizes, reduce per-packet metadata touches, verify affinity
Rules get complex and throughput drops sharply	Flow tables / lookup bandwidth	Table hit/miss, lookup counts, recirculation, update pressure	Split tables, reduce lookup depth, improve key design, reduce miss penalty
Some queues starve or oscillate under load	Queueing + shared memory arbitration	Per-queue occupancy, scheduling counters, fairness/priority indicators	Rebalance queue weights, isolate tenants, cap burst per class, validate scheduler policy

Figure F5 — Microburst path: queues, three memory domains, and the first evidence points

Microbursts first inflate queue depth (tail latency), then propagate backpressure and trigger drops or cliffs. Separate payload buffering, metadata rings, and lookup tables to find the true limiting domain.

H2-6 · SerDes/PHY/Retimer: PAM4 Link Budget, Training, FEC & Eye Margin

Featured answer: PAM4 links can “come up” yet still fail in the field because the remaining margin is small and highly sensitive to channel loss, temperature, and equalization settings. The fastest bring-up sequence is physical margin (BER/eye) → training/EQ state → FEC counters → performance validation.

Keep scope tight: external ports and internal PCIe/SerDes

This section focuses on SerDes links that a SmartNIC/DPU directly owns: the external network port SerDes and the internal high-speed SerDes (e.g., PCIe). Many intermittent flaps are rooted in margin erosion at the physical layer, long before datapath logic or software is the culprit.

Channel budget items (what quietly eats margin)

Insertion loss: PCB traces, connectors, and cables add frequency-dependent loss that reduces eye opening.
Reflections: impedance discontinuities (vias/pads/connectors) create echo and worsen PAM4 decision thresholds.
Crosstalk + noise coupling: adjacent lanes and power noise reduce effective SNR and increase BER sensitivity.
Environment drift: temperature/voltage changes can turn “barely passing” into “intermittent.”

Retimer vs redriver (practical boundary)

A redriver boosts and equalizes the signal but does not fully re-time it; jitter accumulation remains a limiting factor. A retimer includes clock/data recovery (CDR) and can restore eye quality more aggressively, but it introduces training/compatibility dependencies and makes visibility into training/FEC counters essential.

Training, equalization, and FEC (roles and what to measure)

Training statenegotiation and convergence to stable Tx/Rx settings

Tx EQ / Rx EQFIR/CTLE/DFE choices that recover eye margin

FEC counterscorrected vs uncorrected errors; rising counters indicate margin debt

BER / Eye marginphysical health; correlate with temperature and rate changes

Bring-up checklist (recommended order)

1) Physical layer first: verify BER/eye margin and lane-to-lane consistency; check temperature sensitivity.
2) Training/EQ next: confirm training converges reliably; validate Tx EQ and Rx EQ are not at extreme settings.
3) FEC evidence: monitor corrected/uncorrected error counters; spikes often precede flaps and rate downshifts.
4) Performance last: validate negotiated rate, stability under load, and error-free sustain at target throughput.

Figure F6 — PAM4 SerDes channel cross-section: retimer placement and four measurable points

When a PAM4 link is “up but unstable,” prioritize physical evidence (BER/eye) and training/EQ convergence, then use FEC counters as early warning before validating throughput.

H2-7 · Security & Isolation: Root-of-Trust, Firmware Chain, Tenant Key Domains & Attestation

Featured answer: Moving the data plane onto a SmartNIC/DPU extends the trust boundary onto the card. A robust boundary is built from hardware Root-of-Trust + secure/measured boot + controlled firmware updates with rollback. Multi-tenant safety depends on per-tenant/VF domains (keys, policies, telemetry) and on producing auditable evidence (attestation, logs, counters, and configuration snapshots).

Trust chain on the card (what must be anchored)

A SmartNIC/DPU should boot into an identity that can be verified, not just an image that happens to start. Two complementary mechanisms define the boot boundary:

Secure boot: prevents unsigned or unauthorized images from running on the card.
Measured boot: records what actually started (measurements) so the running state can be proven later.

Firmware lifecycle: update and rollback that can be audited

Firmware changes on a programmable dataplane must be treated as operational change control. The minimum safe posture is a rollback-capable update path and evidence that ties each running state to a version, a signature status, and a policy snapshot.

Rollback-ready updatesA/B slots (or equivalent) to recover from failures fast

Version pinningprevent accidental drift across hosts / racks / tenants

Canary rolloutsmall blast radius for new dataplane firmware

Evidence loggingversion, build ID, signature status, config snapshot, timestamp

Attestation: proving what the card is running

Attestation provides a verifiable “evidence bundle” that ties the running firmware and policy set to measurable identities. It is the operational answer to “which firmware is actually running right now,” especially when multiple tenants share the same physical accelerator.

Domain isolation for multi-tenant / multi-VF

Isolation is most reliable when it is expressed as separate domains with explicit boundaries and evidence for each domain. The practical model is three independent domains per tenant or per VF:

Key domain

Per-tenant keys & lifecycle

Keys should not be shared across VFs/tenants. Rotation, revocation, and usage should be audit-visible.

Policy domain

Per-tenant rules & limits

ACLs/flows/limits must be isolated so one tenant cannot modify or infer another tenant’s policy set.

Telemetry domain

Per-tenant counters & traces

Counters and error visibility must remain tenant-scoped to prevent blame ambiguity and cross-tenant leakage.

Operational evidence checklist (what must exist in logs/counters)

Identity: firmware version, build ID, signature verification status, and policy set version.
Boot evidence: secure/measured boot results and any boot-policy violations.
Change control: update/rollback events with timestamps and outcome codes.
Attestation: success/failure counters and last-known evidence hash.
Tenant separation: per-tenant/VF configuration snapshot hashes and policy change records.

Figure F7 — Trust chain, attestation evidence, and per-tenant isolation domains

Secure/measured boot anchors the card identity; attestation exports proof; tenant/VF domains isolate keys, policies, and telemetry with logs, counters, and snapshots as operational evidence.

H2-8 · Power, Thermal & Monitoring: Rails, PMBus Telemetry, Power Capping & Observable Derating

Featured answer: Sustained-load slowdowns that recover only after a reboot are frequently caused by thermal hotspots, rail droop / VRM current limiting, or power-cap/derating states that persist. The fastest path to stability is making the derating loop observable: temperatures, rail minima, current limits, and throttle counters aligned in time.

Partition the card into power domains (map symptom → domain → evidence)

SmartNIC/DPU cards behave like small systems with multiple rails and thermal zones. The most useful layout is domain-based: each domain has its own failure modes, and each mode has a first evidence point that should be logged.

Core domaincompute hotspots · power cap sensitivity

SerDes domainmargin erosion · link errors with heat/noise

DDR domainbandwidth-driven power · memory thermal rise

PCIe domainlink stability · downshift/replay under stress

Accelerator domainpeak current · localized throttling

Telemetry chain (PMBus / board controller → host tools)

A stable system needs a complete telemetry chain: sensors and VRMs expose measurements (temperature, voltage minima, current limits), a board controller collects them over PMBus (or equivalent), and the host can correlate them with throttle events. The objective is correlation, not just visibility.

Thermal

Hotspot temps

Core, VRM, memory, SerDes area. Track max + time-to-throttle.

Power rails

Rail droop minima

Voltage min events and persistent sag indicate margin debt.

VRM protection

Current limit / OCP

Limit events often appear as “mysterious” slowdowns.

Derating

Throttle counters

Reason codes + duration. Align with temps and rails.

Slowdown symptoms → evidence alignment → likely direction

Symptom	Most common derating trigger	First evidence to align in time	First engineering direction
Throughput drops stepwise after minutes	Thermal throttling	Hotspot temperature vs throttle counter increments	Increase thermal margin, reduce peak power, ensure airflow/heat sink contact quality
Temperature looks “fine” but performance still collapses	VRM current limiting / rail droop	Rail minima + VRM limit counters vs performance change	Re-balance power domains, reduce transient load, validate rail headroom and decoupling
Intermittent link flaps / rate downshift under load	SerDes margin erosion (heat/noise)	Link error counters/FEC vs temperature and rail noise indicators	Improve margin (EQ, channel loss, power integrity), verify temperature coupling to SerDes
Only reboot restores full speed	Persistent power-cap/derating state	Throttle reason codes + “cap active” duration since boot	Make derating state observable and resettable; enforce policy-based caps with logging

Stability “triple”: thermal margin, power budget, observable derating

Thermal margin: hotspots stay below thresholds with enough headroom at worst-case ambient.
Power budget: rail headroom covers sustained and transient load without entering protection regions.
Observable derating: throttle reason codes and counters exist and correlate with sensors in time.

Figure F8 — Card power domains, PMBus telemetry chain, and throttle/derating observability loop

Partition rails by domain, collect sensor/VRM telemetry over PMBus, and correlate throttle reason codes with hotspot temperatures and rail minima to stabilize sustained performance.

H2-9 · Performance Engineering: Gbps vs Mpps, Tail Latency, Zero-Copy & Queue Shaping

Featured answer: A “100G” label is typically a Gbps story (large packets), while 64B traffic is a Mpps story dominated by per-packet fixed costs (descriptors, DMA, queues, lookups). Real user pain is usually tail latency (p99/p999) driven by queueing and contention, with link-side FEC acting as an amplifier. Tuning should follow a strict priority: queues/polling → DMA/mapping → pipeline/rules → link counters.

The performance “tri-metrics” (what must be measured together)

Throughput

Gbps (large-packet capacity)

Often looks great with big frames, but can hide small-packet collapse.

Packet rate

Mpps (per-packet overhead)

Exposes descriptor/DMA/queue costs and rule processing depth.

Experience

p99 / p999 tail latency

Microbursts and contention pull the tail long before averages move.

Where tail latency comes from (bounded to SmartNIC/DPU)

Tail latency rarely comes from a single “slow component.” It is typically a chain effect: queue depth grows, memory and metadata paths contend, and per-packet work spikes when rules get deeper or misses occur.

Source	Typical symptom	First evidence to capture	First direction
Queueing (depth & scheduling)	p99 jumps during bursts; “steady Gbps” but spikes	Queue occupancy histogram + time-aligned p99	Shaping, per-tenant budgets, reduce burst amplification
Memory contention (DDR/PCIe)	Tail grows with concurrency; Mpps droops	Bandwidth/latency indicators + ring completion jitter	Reduce metadata touches; rebalance domains; avoid hot contention points
Cache/table miss (lookup depth)	Rules get complex → sudden drop / recirculation	Hit/miss, recirculation counters, lookup depth	Simplify match-action depth; reduce miss penalty; control table growth
Descriptor path (rings/doorbells)	pps collapse; “hungry” rings; intermittent stalls	Ring full/empty, doorbell rate, DMA completion spread	Ring sizing, batch strategy, polling/interrupt balance, mapping costs
Link-side FEC (as an amplifier)	Latency tail worsens under errors; rate changes	FEC corrected/uncorrected + BER counters	Use counters to confirm margin issues; avoid chasing ghosts in software first

Zero-copy boundary: why it helps and when it backfires

Zero-copy can reduce CPU work and lower latency tail by removing extra copies and cache pollution, but it may introduce mapping and pinning costs. The decision boundary is whether per-packet fixed overhead is the limiting factor.

Helps when: small packets, high queue concurrency, high Mpps, and measurable descriptor/DMA overhead.
Backfires when: workloads are already large-packet dominated, or mapping/operational complexity becomes the bottleneck.

Queue shaping: reducing microburst damage (card/interface scope)

Shaping is a tail-latency tool: it converts uncontrolled bursts into controlled queueing. It is most effective when it is observable (queue depth distribution improves while p99/p999 improves).

Burst cap (limit burstiness) Priority scheduling (protect critical traffic) Per-tenant queue budget (avoid noisy neighbor) Queue depth telemetry (prove the win)

Tuning priority (engineering order that avoids blind guessing)

Queues / interrupts / polling: stabilize jitter sources first.
DMA / mapping: reduce descriptor-path overhead and mapping costs.
Pipeline / rules: control lookup depth and miss penalties.
Link counters: use FEC/BER to confirm margin issues, not to guess.

Figure F9 — Tri-metrics view and the main bottleneck chain for small packets and tail latency

Small-packet performance is constrained by per-packet overhead (queues, lookups, rings/DMA). Tail latency is diagnosed by time-aligning p99/p999 with queue depth, ring completion jitter, hit/miss behavior, and (only as confirmation) link-side FEC counters.

H2-10 · Field Debug Playbook: From Symptoms to an Evidence Chain (Drops, Reorder, Timeouts, Drift)

Playbook promise: Treat every incident as an evidence problem, not a guessing contest. For each symptom, capture hardware counters, queue depth, ring/descriptor signals, FEC/BER counters, thermal/rails/throttle, and firmware version + config snapshot with timestamps, then apply a fixed decision order to isolate the root-cause bucket.

Evidence sources (mandatory, time-aligned)

HW counters (per-port / per-queue / per-VF) Queue depth (histogram) Ring/descriptor (full/empty/doorbell) Table hit/miss (if exposed) FEC/BER counters Thermal/rails + throttle reasons Firmware & config snapshot hash

Fixed structure per incident (Symptom → Likely causes → Tests → Fix)

Each item below is intentionally short and operational. Execute tests in order, record before/after snapshots, and change one variable at a time.

1) VF intermittent reset / link flap

Symptom

VF resets or link flaps intermittently, often under sustained load or temperature rise.

Likely causes

1) SerDes margin erosion (heat/noise) 2) Persistent derating/power clamp 3) Firmware/driver mismatch after upgrade

Tests

Align time: link state changes ↔ FEC/BER counters ↔ SerDes/board temps.
Check rails: rail minima / VRM current-limit counters around the flap window.
Freeze identity: record firmware version + config snapshot hash before reproducing.

Fix

Reduce thermal stress and confirm margin; if power clamps exist, make them observable; pin/rollback firmware if the issue correlates with a change window.

2) Throughput looks stable but tail latency explodes (p99/p999)

Symptom

Gbps remains high, yet p99/p999 spikes during bursts or high concurrency.

Likely causes

1) Queue depth growth 2) Memory contention / metadata path jitter 3) Shaping absent or mis-scoped

Tests

Queue histogram: capture depth distribution aligned to p99 spikes.
Ring jitter: check DMA completion spread and ring starvation indicators.
Rule pressure: verify hit/miss or recirculation changes during the spike window.

Fix

Apply shaping/burst control, simplify contention points, and validate improvement by queue distribution + p99 together (not averages).

3) pps drop (queue congestion / descriptor starvation)

Symptom

Small-packet rate collapses; large-packet Gbps may look acceptable.

Likely causes

1) Descriptor ring hungry/full oscillation 2) DMA/mapping overhead 3) Queue scheduling overhead

Tests

Ring counters: full/empty events, doorbell behavior, batch size sensitivity.
Mapping cost: compare modes where mapping/pinning changes (observe pps and jitter).
Queue proof: confirm queue depth does not silently grow while pps falls.

Fix

Tune ring sizing and batching, stabilize polling/interrupt balance, and reduce descriptor-path work per packet.

4) Crypto offload correctness (handshake / replay anomalies)

Symptom

Connections fail intermittently, replay-like behavior appears, or correctness deviates only when offload is enabled.

Likely causes

1) Policy domain mismatch (rules vs key domain) 2) Offload boundary mismatch (state/timeout) 3) Version/config drift

Tests

A/B compare: run a controlled bypass path vs offload path and diff outcomes.
Snapshot identity: firmware version + config hash + policy set version at failure time.
Counter focus: per-tenant/VF counters for drops, rejects, and state errors (tenant-scoped).

Fix

Pin/rollback to a known-good identity, tighten policy/key domain separation, and re-validate against a bypass baseline.

5) Compression offload mismatch (boundary / block size issues)

Symptom

Output differs, size ratios look inconsistent, or errors occur only for specific payload patterns or sizes.

Likely causes

1) Block boundary assumptions differ 2) Metadata/state handling differs under burst 3) Resource contention causes timeouts

Tests

Size sweep: reproduce across block sizes and record failure thresholds.
Queue alignment: correlate errors with queue depth spikes and ring starvation.
Identity capture: firmware/config snapshot at both success and failure points.

Fix

Normalize block boundary policy, reduce contention hotspots, and enforce deterministic rollback/pinning for reproducibility.

Figure F10 — Evidence-chain decision flow: symptoms → evidence collectors → root-cause buckets → first fixes

Capture evidence first (counters, queues, rings, FEC/BER, thermal/rails, version snapshots), align in time, then map to the smallest root-cause bucket and apply the least invasive fix before larger changes.

H2-11 · BOM / IC Selection Checklist (Criteria + Part Numbers)

This section turns “SmartNIC/DPU BOM planning” into an engineering checklist: each module lists hard gates, trade-offs, required evidence, and a concrete set of orderable part numbers. The goal is to avoid “marketing-only specs” and keep debug/operability measurable.

Evidence-first counters Version/rollback control Thermal & power repeatability SerDes margin + observability DDR bandwidth realism

A) How to use this checklist (fast intake → spec lines)

Lock inputs first: port speed & count, target Mpps (64B-heavy or not), offload scope (crypto/comp/flow steering), power cap, operating temperature class, and required counters/telemetry for field debug.
Score by evidence: prefer parts that expose queue/ring/drop-reason/FEC/rail/throttle counters and support stable firmware/driver pinning + rollback.
Turn criteria into procurement language: each module ends with “spec-ready lines” that can be pasted into a purchase spec or supplier questionnaire.

B) Module criteria + concrete part numbers (examples)

Part numbers below are commonly used building blocks. Final selection must match lane counts, connector topology, temperature grade, availability program (some DPUs are sold via platform programs), and compliance requirements.

Module	Hard gates (must-have)	Evidence to require	Example part numbers (物料号)
Core compute / dataplane SoC FPGA DPU-class	Programmability model (pipeline/firmware), table/queue resources, host interface maturity (PCIe), stable SDK/driver lifecycle.	Table hit/miss, drop reason, queue depth, ring starvation, firmware/driver version pinning + rollback proof.	NXP Layerscape: LX2160A, LS1046A AMD/Xilinx FPGA: XCVU9P-2FLGA2104I, XCKU15P-2FFVE1517I Intel FPGA: 10AX115N2F45I1SG (Arria 10 GX) Platform DPUs (often via program): NVIDIA BlueField-2 / BlueField-3 (platform silicon family)
PCIe fabric (optional) Switch NTB	Lane/port count, error containment, hot-plug support (if needed), partitioning/virtual switch capability.	AER/error counters, port isolation logs, fabric management telemetry, watchdog/reset reason.	Broadcom PCIe Gen4 switch: PEX88096 Microchip Switchtec Gen5: PM50100, PM50084, PM50068, PM50052, PM50036, PM50028 Microchip Switchtec Gen3: PFX/PFX-I family (Switchtec PFX)
PCIe signal integrity Redriver Retimer	Supports target PCIe generation, channel count matches lanes, link-training friendly behavior, low-added jitter.	Equalization settings readback, link training stability notes, margin/eye hooks (as supported), failure reason logs.	TI PCIe redrivers: DS80PCI402 (x4), DS80PCI810 (x8) TI PCIe Gen4 redriver: DS160PR410 (x4)
Ethernet/SerDes conditioning NRZ PAM4	Rate coverage (10/25/50G classes), retimer vs redriver boundary (CDR needed or not), channel loss budget fit.	BER estimate, FEC corrected/uncorrected counters, training status, Rx/Tx EQ readback.	TI retimers: DS125DF410 (9.8–12.5Gbps, x4), DS280DF810 (20.2–28.4Gbps, x8)
DDR memory Bandwidth ECC	Bandwidth under concurrency (buffer + descriptors + tables), ECC support, temperature behavior (refresh/power).	ECC error counters visibility, throttling events, memory controller perf counters (as available).	Micron DDR4 SDRAM: MT40A512M16TB-062E-R (example orderable DDR4 component)
Telemetry (PMBus/I²C) Current Power	Rail voltage/current/power measurement coverage, addressability, accuracy/averaging control.	Per-rail min/max, alert logs, time-stamped event capture (if board controller supports it).	TI PMBus monitor: INA233 (example ordering: INA233AIDGSR)
VRM control Multiphase Digital	Transient response, phase scaling, control-bus integration, predictable throttle/limit behavior.	OCP/OTP counts, loop status (as available), configuration readback, brownout/throttle reason.	Renesas digital multiphase PWM: ISL68137 Renesas smart power stage (example pairing): ISL99227 (often used with Renesas controllers)
Board controller MCU Logging	Robust boot/update, watchdog strategy, event logging, safe GPIO policy for resets/power-good fan-in.	Reset reasons, rail/thermal snapshots, firmware version hash, “last known good” config record.	NXP: LPC55S69JBD100 ST: STM32G0B1KET6, STM32H743IIT6
Root-of-trust TPM Measured boot	Standard compliance, interface fit (SPI/I²C), lifecycle support, provisioning flow.	Attestation capability (as used), firmware measurement logs, update/rollback policy proof.	Infineon OPTIGA™ TPM: SLB-9670VQ2-0 (family), example orderable code: SLB9670VQ20FW785XTMA1
Non-volatile storage QSPI NOR	Endurance, dual-image support, secure update/rollback compatibility.	Image A/B selection evidence, rollback event log, write-protect policy.	Winbond: W25Q256JV, W25Q512JV Macronix: MX25U25645G, MX25U51245G

Figure F11 — BOM criteria map: module → criteria → evidence

Use this map to keep BOM decisions “evidence-driven”: each module should provide measurable counters, stable version pinning, and predictable behavior under heat/power stress—otherwise peak throughput numbers rarely survive real deployments.

C) “Marketing spec” vs “real pitfall” (questions to force clarity)

Marketing line	What must be asked	Evidence to require
“100G line-rate”	What is 64B Mpps with real rule depth and multi-queue concurrency?	Queue depth/drops by reason; pipeline hit/miss; p99/p999 latency snapshots.
“Supports PCIe Gen4”	Is the link stable across temperature/board variance without manual EQ babysitting?	Training stability logs, AER counters, margin hooks (as supported), redriver EQ readback.
“Secure boot”	Is there a measured chain + rollback + audit trail, or only signature check?	Firmware measurement records, version hash snapshot, rollback event log.
“Low power”	Does power stay predictable under burst load, or does throttling silently change behavior?	Rail min/max, OCP/OTP counts, throttle reason + count, thermal hotspot logging.
“High bandwidth memory”	Does memory bandwidth hold under mixed traffic (buffers + descriptors + tables)?	ECC/error counters + perf counters; drift across temperature and refresh behavior.

D) Spec-ready lines (copy/paste into procurement docs)

PCIe switch (if used): must support required lane/port count and provide AER/error counters, partition/NTB diagnostics, and watchdog/reset logs.
PCIe redriver/retimer: must support target PCIe generation and expose EQ configuration/readback and training stability notes; no “black-box only” parts.
SerDes retimer: must provide BER estimate and FEC corrected/uncorrected counters; training status must be readable.
DDR: ECC required; ECC counters must be readable; behavior under high temperature and refresh changes must be documented.
VRM + telemetry: rail min/max, OCP/OTP counts, throttle reasons and counts must be logged and readable via PMBus/I²C + board controller.
Root-of-trust: measured boot and attestation-capable RoT/TPM preferred; firmware A/B rollback and audit snapshots must be supported.

Request a Quote

Name

Company

Part Number(s) / BOM

Quantity & Target Lead Time

Alternates Allowed

Temperature Grade

Package / Footprint

Compliance

Budget Window

Lot Size / Qty

Message

Attachment

Accepted Formats

pdf, csv, xls, xlsx, zip

Attachment

Drag & drop files here or use the button below.

H2-12 · FAQs (SmartNIC / DPU) — Answers + Evidence-First Checks

These FAQs target engineering “why” questions while staying inside this page’s scope: card-level dataplane offload, PCIe/virtualization integration, queues/memory behavior, PAM4/retimer bring-up, security/firmware chain, power/thermal telemetry, and performance (Gbps vs Mpps vs tail latency).

BoundaryOffload trade-offsP4 pipeline limits PCIe & SR-IOVQueues & microburstPAM4 & retimers RoT & rollbackThermal throttlingMpps & tail latency

1What is the practical engineering boundary between a SmartNIC and a DPU?

The boundary is not the form factor but the degree of dataplane independence. A SmartNIC typically accelerates specific functions (steering, vSwitch blocks, crypto/compression) while still depending on host resources for control and state. A DPU behaves like an independent subsystem on the card: its own cores, memory domains, and a broader programmable dataplane/control plane that can offload host networking and security primitives.

Use-case test: can the card enforce dataplane policy and collect counters even when host CPU is saturated?
Scope note: this compares card-level roles only (no UPF/MEC application placement).

ProgrammabilityIndependent resourcesHost bypass depth

2When can crypto offload make latency worse instead of better?

Crypto offload can worsen latency when it introduces extra queueing and batching on the card, or when the data path becomes DMA-heavy with small records. Typical regressions appear with small packets, short TLS records, frequent key context switches, or when crypto shares memory bandwidth with flow tables and queue managers. The result is improved throughput but worse tail latency (p99/p999).

Measure first: queue depth, crypto engine busy time, DMA/ring starvation counters.
Fix direction: reduce batch size, separate queues, and avoid unnecessary host↔card copies.

QueueingDMA overheadTail latency

3Why does throughput collapse when match/action rules get more complex?

A programmable dataplane is a bounded pipeline. As rule complexity grows, the design hits limits in table depth, table miss penalties, action expansion (more memory reads/writes), and shared bandwidth in the packet/metadata paths. Even if the line rate is unchanged, the pipeline may require recirculation, additional stages, or slower memory accesses—causing a cliff in Mpps.

Evidence: hit/miss counters, stage utilization, memory bandwidth/latency counters (where available).
Design lever: simplify actions, reduce table lookups per packet, and keep “fast path” rules shallow.

Pipeline depthMiss penaltyMemory bandwidth

4SR-IOV is fast—why does observability often get worse?

SR-IOV improves performance by giving VFs direct-ish access to queues, but it often splits the evidence chain. Counters become fragmented across PF/VF, host stack visibility is reduced, and some drops/rewrites happen in paths that traditional host tools cannot see. If firmware, PF driver, and VF driver versions drift, the same symptoms may map to different counter meanings—making field debug slower.

Evidence: PF/VF queue depth, drop-reason counters, reset reasons, version snapshots.
Mitigation: define “golden counters” that must remain readable at PF level.

PF/VF counter splitTooling blind spotsVersion drift

5Is “more VFs” always better? Where do queues/descriptors bottleneck first?

More VFs increases isolation and scheduling flexibility, but bottlenecks usually appear first in metadata plumbing: descriptor rings, doorbells, interrupt moderation/polling loops, and memory bandwidth for queue state. Once many VFs compete, ring starvation and cache/memory contention raise tail latency and reduce Mpps even when aggregate Gbps looks fine.

Evidence: ring starvation, doorbell rate, queue occupancy distribution across VFs.
Rule of thumb: scale VF count only if per-VF counters remain measurable and stable under load.

Descriptor ringsDoorbellsMemory contention

6How do microbursts “punch through” SmartNIC buffers?

Microbursts overwhelm buffers because arrival rate briefly exceeds service rate, and the card’s service rate can degrade under pressure. As queues grow, descriptor processing, memory arbitration, and scheduler work increase, which can reduce effective dequeue speed—creating a feedback loop. The visible symptoms are short spikes of drops and a sharp rise in p99/p999 latency even when average utilization seems acceptable.

Evidence: queue depth time series, drop-by-reason, ring starvation during burst windows.
Mitigation direction: isolate bursty flows, tune shaping/priority, and avoid shared hot queues.

Queue feedbackDescriptor pressureTail latency

7PAM4 links come up but flap intermittently—what are the three most common root causes?

Three common causes are: (1) margin is barely positive (insertion loss/connectors/board variation), (2) training/EQ sensitivity (settings that “work once” but drift), and (3) environmental drift from temperature or power noise that shifts the eye and pushes FEC beyond its comfort zone. Intermittent flaps often correlate with warm-up, vibration, or specific cable/port combinations.

Evidence: FEC corrected/uncorrected counters, retraining counts, BER/eye margin indicators (as supported).
Debug order: PHY margin → training stability → protocol/perf.

MarginTraining driftThermal/power noise

8How to choose Retimer vs Redriver—and why does a retimer often make debug harder?

A redriver boosts amplitude/EQ but keeps the same clocking domain; a retimer adds CDR and can recover a cleaner clock when channel loss is high. Debug becomes harder because a retimer introduces additional state machines (training, adaptation) and a new counter/visibility boundary. If the retimer does not expose training results and error reasons, failures look “random” across ports and temperature.

Choose retimer when: loss budget forces CDR/equalization beyond redriver capability.
Require: training status readout, error codes, and margin-related counters.

CDRState machinesCounter visibility

9How can the firmware on the card be proven trustworthy, rollbackable, and auditable?

Proof requires three artifacts: (1) a secure/measured boot chain that records what firmware actually ran, (2) an A/B image update path with deterministic rollback, and (3) audit snapshots that bind firmware version, configuration, and key policy to a verifiable identifier. Without these, “secure boot” may only mean signature checking, not field-grade recoverability and accountability.

Evidence: measured-boot logs, rollback events, version+config hash snapshots.
Operational rule: upgrades must be pin-able and reproducible across fleets.

Measured bootA/B rollbackAudit snapshot

10After running hot for a while, performance drifts—how to separate thermal throttling from queue congestion?

Distinguish by aligning two evidence streams. Thermal throttling shows throttle counters, rising hotspot temperature, and rail droop/VRM current-limit events that correlate with the throughput drop. Queue congestion shows sustained queue depth growth, ring starvation, and drop-by-reason increases without matching thermal/throttle triggers. The key is time-correlation: “perf drop timestamp” must match one stream strongly.

Evidence: temperature/rail min-max + throttle reason vs queue depth + ring starvation.
First action: capture a synchronized snapshot (telemetry + queue counters) during the drift window.

Time correlationThrottle countersQueue evidence

11Why can a “100G” card show very low Mpps with 64-byte packets—and what three checks come first?

100G throughput does not guarantee small-packet rate because fixed per-packet costs dominate. The first three checks are: (1) queue/interrupt/polling policy (interrupt storms or poor moderation), (2) descriptor/DMA efficiency (ring starvation, mapping overhead, extra copies), and (3) pipeline rule depth (table lookups/action expansion per packet). Link/FEC issues usually show distinct error counters.

Evidence: interrupt/poll counters, ring starvation, pipeline hit/miss, drop reasons.
Fix direction: reduce per-packet overhead before chasing raw bandwidth.

Interrupt vs pollingDMA/ringsRule depth

12Which selection metrics are most often misleading in marketing—and what must be demanded instead?

Common traps include “line-rate,” “PCIe Gen5,” “HBM bandwidth,” “secure boot,” and “low power.” Each needs a corresponding field-grade evidence requirement: Mpps and tail latency under real rule depth, AER/training stability across temperature, ECC/error visibility, measured boot + rollback logs, and throttle/rail telemetry with time-stamped events. If counters and rollback are missing, fleet operability cost often exceeds performance gains.

Demand: counters, failure-reason codes, and version+config snapshots.
Reject: “black-box performance” without diagnostics.

Evidence-firstRollback controlTelemetry

Figure F12 — Fast triage map: symptom → evidence stream → likely bucket

This map keeps troubleshooting inside the card-level scope: diagnose by matching the symptom timestamp to the strongest evidence stream (queues/rings, link counters, or thermal/rail telemetry) before changing pipeline rules or blaming the network.

SmartNIC / DPU: Programmable Data Plane & Offload Design

SmartNIC / DPU: Programmable Data Plane & Offload Design

H2-1 · Definition & Boundary: SmartNIC vs Traditional NIC vs DPU

What this section enables

Three boundary axes (each with testable checks)

Quick classification rules (If/Then)

Comparison table (engineering decision fields)

H2-2 · Offload Decision Map: What to Move onto SmartNIC/DPU (and What Not To)

Offload buckets by where the benefit comes from

Selection checklist (one-line, engineering-grade)

When offload should be avoided (red lines)

Validation loop (prove the offload is real and repeatable)

H2-3 · Programmable Data-Plane Pipeline: Parser → Match/Action → Scheduler

Pipeline view (three stages, three budgets)

Stage A: Parser (fields → metadata)

Stage B: Match/Action (tables → decisions)

Stage C: Scheduler / Queueing (priority, shaping, congestion)

Three common performance cliffs (symptom → mechanism → evidence → direction)

H2-4 · Host Interface & Virtualization: PCIe, DMA, SR-IOV, virtio, IOMMU

Host↔card critical chain (what sets the ceiling and the tail)

SR-IOV vs virtio vs user-space bypass (choose by operational cost)

Symptoms → root-cause direction (start with evidence, not guesses)

H2-5 · Queues, Buffers & Memory System: DDR/HBM, Descriptor Rings, Congestion & Microbursts

Three memory-access classes (treat them as separate bottlenecks)

Why microbursts create backpressure (and why averages mislead)

Practical sizing method (steps, not formulas)

Symptoms → likely domain → first evidence

H2-6 · SerDes/PHY/Retimer: PAM4 Link Budget, Training, FEC & Eye Margin

Keep scope tight: external ports and internal PCIe/SerDes

Channel budget items (what quietly eats margin)

Retimer vs redriver (practical boundary)

Training, equalization, and FEC (roles and what to measure)

Bring-up checklist (recommended order)

H2-7 · Security & Isolation: Root-of-Trust, Firmware Chain, Tenant Key Domains & Attestation

Trust chain on the card (what must be anchored)

Firmware lifecycle: update and rollback that can be audited

Attestation: proving what the card is running

Domain isolation for multi-tenant / multi-VF

Operational evidence checklist (what must exist in logs/counters)

H2-8 · Power, Thermal & Monitoring: Rails, PMBus Telemetry, Power Capping & Observable Derating

Partition the card into power domains (map symptom → domain → evidence)

Telemetry chain (PMBus / board controller → host tools)

Slowdown symptoms → evidence alignment → likely direction

Stability “triple”: thermal margin, power budget, observable derating

H2-9 · Performance Engineering: Gbps vs Mpps, Tail Latency, Zero-Copy & Queue Shaping

The performance “tri-metrics” (what must be measured together)

Where tail latency comes from (bounded to SmartNIC/DPU)

Zero-copy boundary: why it helps and when it backfires

Queue shaping: reducing microburst damage (card/interface scope)

Tuning priority (engineering order that avoids blind guessing)

H2-10 · Field Debug Playbook: From Symptoms to an Evidence Chain (Drops, Reorder, Timeouts, Drift)

Evidence sources (mandatory, time-aligned)

Fixed structure per incident (Symptom → Likely causes → Tests → Fix)

H2-11 · BOM / IC Selection Checklist (Criteria + Part Numbers)

A) How to use this checklist (fast intake → spec lines)

B) Module criteria + concrete part numbers (examples)

C) “Marketing spec” vs “real pitfall” (questions to force clarity)

D) Spec-ready lines (copy/paste into procurement docs)

Recommended topics you might also need

Request a Quote

Accepted Formats

Attachment

H2-12 · FAQs (SmartNIC / DPU) — Answers + Evidence-First Checks

Explore

Categories

Get in Touch