DPU / SmartNIC: Programmable Dataplane Offload Guide
← Back to: Data Center & Servers
A DPU/SmartNIC turns “dataplane work” (switching/overlay, crypto/compression, and telemetry) into a programmable, measurable card-level pipeline so the host CPU can stay stable and predictable. The practical value is proven with evidence—queue/telemetry counters, thermal events, and repeatable p99/p999 tests—rather than peak throughput alone.
DPU / SmartNIC: boundary, role, and why it matters now
A SmartNIC adds programmable and offload capabilities to a NIC to reduce CPU work on repetitive per-packet tasks. A DPU pushes this further: it is a programmable, data-plane–centric subsystem (often a PCIe card) designed to harden infrastructure datapaths—offload, isolation, telemetry, and lifecycle control—so application cores stay focused on business workloads.
| Work item | CPU | SmartNIC | DPU |
|---|---|---|---|
| Per-packet repeat work | High overhead, jitter-sensitive | Offloads selected fast-path actions | Offloads + makes fast-path more deterministic |
| Infrastructure datapath (vSwitch/overlay hooks) | Consumes cores and cache; hard to isolate | Assists; may still rely on host for orchestration | Runs as a hardened datapath subsystem with policy + counters |
| Security offload (crypto / policy enforcement) | Possible but expensive under load spikes | Often accelerates specific primitives | Accelerates primitives + enforces datapath boundaries and audit evidence |
| Telemetry & evidence (counters/logs/time) | Software-only visibility; sampling gaps | Provides taps and counters for selected paths | First-class telemetry + event logs tied to datapath stages |
- Core efficiency pressure: infrastructure datapaths steal sellable CPU cycles. Offload returns cores to applications and stabilizes runtime variance.
- Tail-latency sensitivity: p99/p999 often breaks first. Queues, interrupts, and cache interference make CPU-based fast paths hard to keep deterministic.
- Isolation and auditability: multi-tenant boundaries and compliance require stronger separation and better evidence than “best-effort software counters.”
- Is the pain CPU jitter or link/PHY issues? If p99 spikes correlate with CPU contention rather than cabling/SerDes faults, dataplane offload is a strong candidate.
- Is the goal throughput or determinism? If average Gbps is fine but p99/p999 is unacceptable, focus on queueing, backpressure, and telemetry—areas DPUs address directly.
- Is hardened infrastructure needed? If security boundaries, audit evidence, and lifecycle control matter, DPU-style separation usually beats ad-hoc host processing.
Architecture overview: the building blocks of a programmable dataplane
A DPU/SmartNIC dataplane is best understood as a pipeline of blocks that answers two questions: what action to take (parse + match/action) and when to take it (queues/QoS/backpressure). Most field failures show up as queue growth, counter anomalies, and thermal throttling long before average throughput looks “bad.”
Parser → Match/Action → Stateful Table → QoS/Queue → Crypto/Compress → DMA/PCIe
The “programmable region” typically spans parsing, classification, policy actions, and state updates. Fixed accelerators often implement crypto/compression primitives, while DMA/PCIe ties the dataplane to host memory and virtualization queues.
Parser
Metric: packets/s sustained under mixed headers · Pitfall: complex header stacks increase per-packet work, pushing queues downstream.
Match / Action
Metric: table hit-rate + rule update latency · Pitfall: table expansion increases misses, shifting pressure to external memory and raising tail latency.
Stateful Table
Metric: state update rate + contention indicators · Pitfall: state hot-spots create backpressure even when link throughput looks fine.
QoS / Queues (tail-latency root)
Metric: queue depth + drops + scheduling latency · Pitfall: average remains good while p99/p999 collapses under bursts and short flows.
Crypto / Compression (accelerators)
Metric: utilization + per-flow setup overhead · Pitfall: small packets/short flows pay startup cost without gaining net benefit, inflating tail latency.
DMA / PCIe (host interface)
Metric: completion latency + backlog counters · Pitfall: batching/doorbell patterns can turn into jitter sources if queue management is mis-sized.
Programmability models: P4 vs eBPF vs fixed pipeline
Programmability in a DPU/SmartNIC is not a single feature. Different models optimize for different engineering realities: change velocity, deterministic behavior, and state/entry footprint. The goal is to choose a model that matches the task shape, not the marketing label.
P4 (match/action-first)
Best fit for table-driven packet decisions where actions are explicit and repeat per packet. Strong when rules change often and fast-path behavior must remain explainable through counters.
eBPF (event/hook-first)
Best fit for observability, filtering, and policy hooks where signals and events matter more than pure match/action throughput. Strong for rapid iteration and field debugging through targeted instrumentation.
Fixed pipeline / hard datapath (determinism-first)
Best fit when the protocol and behavior are stable and deterministic latency matters more than feature churn. Strong for high throughput with predictable timing, but iteration cost is higher.
| Primary axis | P4 | eBPF | Fixed pipeline |
|---|---|---|---|
| Change velocity | High (rule-driven changes) | Very high (hook-driven iteration) | Low (stable behavior preferred) |
| Deterministic latency | Good when actions are bounded | Varies with hook placement | Best (tight timing paths) |
| State / entries footprint | Scales with tables and actions | Scales with instrumentation scope | Bounded but less flexible |
Observability / filtering / tracing? → Prefer eBPF
Hard determinism with stable behavior? → Prefer fixed pipeline
Tail-latency is the KPI? → Prioritize bounded queues + counters
Offload deep dive: vSwitch, overlay, and service chaining
vSwitch and overlay paths consume CPU not because they are “complex,” but because they are repeat-per-packet: encapsulation/decapsulation, classification, rule lookups, counters, and policy actions. Offload moves these repeat actions into a bounded dataplane and leaves the host CPU primarily for control logic and workloads.
| Before: CPU path | After: DPU/SmartNIC path |
|---|---|
|
Packet parsing + overlay encap/decap Lookup + policy actions Counters in software (sampling gaps) Bursts inflate queues/interrupt pressure |
Parser + match/action in dataplane Queues/QoS bounded under bursts Counters + timestamps as first-class taps Events/logs anchor debug evidence |
State consistency
Symptom: new policy is “applied” but some flows behave as if old rules remain. Root cause is often a mismatch between control-plane commit windows and dataplane table activation. Validation should rely on hit counters and timestamped activation events.
Policy update latency
Symptom: bursts or drops during rule updates. Root cause is frequently table update pressure that shows up first as queue depth growth. Validation should correlate update windows with queue depth, drops, and counter discontinuities.
Replay / rollback readiness
Symptom: rollback does not fully restore previous behavior. Root cause is incomplete snapshots or inconsistent states across dataplane stages. Validation should require versioned policy summaries and event logs with ordering.
Crypto & compression offload: throughput, tail latency, and key boundaries
Offload is valuable when it removes repeat-per-packet work from the host CPU and converts it into a bounded dataplane behavior. The most common evaluation mistake is to focus on Gbps alone and ignore pps, session shape, and the key lifecycle boundary that determines stability during updates and rollbacks.
5.1 Crypto offload (concept): where it sits and what to measure
Placement
Inbound / outbound datapath stages with counters and timestamps around the crypto block
Why tail latency improves (when it does)
CPU jitter drops when per-packet compute is removed from shared cores and moved into a bounded queue + accelerator path
Key metrics
Gbps (large packets) · pps (small packets) · sessions (concurrency) · key update rate (rotation/rekey events)
5.2 Compression offload: when it helps and what it costs
Good fit
Bandwidth- or IO-bound paths where reducing bytes dominates (e.g., egress links, log/telemetry aggregation, storage write paths).
Watch-outs
Compression adds compute + memory movement. High compression ratio can still lose if it increases queueing or triggers thermal throttling.
5.3 Why offload can become slower (field failure patterns)
Pitfall A — Small packets / short flows
Symptom: average throughput looks fine, but p99/p999 worsens. Likely cause: setup/warm-up cost exceeds per-packet benefit. Signals: session create rate, completion latency spikes, queue depth growth.
Pitfall B — Burst backpressure
Symptom: bursts cause drops or long drains. Likely cause: accelerator saturation and backpressure into queues. Signals: utilization near ceiling, qdepth rising, drops during burst windows.
Pitfall C — Key updates cause micro-outages
Symptom: periodic latency spikes during rotation/rekey. Likely cause: lifecycle boundary not aligned with dataplane activation windows. Signals: key-update events, policy/version counters, hit counter discontinuities.
Pitfall D — Compression “wins ratio” but loses latency
Symptom: fewer bytes, but worse tail. Likely cause: compute + memory movement plus thermal/power limits. Signals: throttle events, latency histogram widening, queue drain time increasing.
5.4 Key boundary principles (device-agnostic)
- Keep long-lived roots inside a dedicated trust boundary. The dataplane should operate with derived or wrapped material, not raw roots.
- Prefer wrapped/derived session keys for fast rotation. Rotation must be observable (events) and reversible (rollback-safe).
- Make activation auditable. Tie key/policy versions to timestamped events and counters so behavior changes can be explained.
Link-only: TPM / HSM / Root of Trust →
Host interface & virtualization: PCIe + SR-IOV from a DPU viewpoint
Virtualization instability usually does not come from “virtualization” itself. It comes from how queues, DMA batching, and doorbell pacing interact with isolation boundaries. The engineering objective is to keep tail latency bounded while preserving isolation across tenants and workloads.
6.1 PF vs VF roles (minimal, practical)
PF (control/ownership)
Controls resources and configuration: queue creation, policy knobs, and ownership boundaries.
VF (data queues)
Represents isolated data-plane queues for VMs/containers. Stability depends on queue sizing and scheduling fairness.
6.2 DMA, queues, batching, doorbells (tail-latency levers)
| Lever | What it changes | Typical tail-latency symptom |
|---|---|---|
| Queue depth | Buffering under bursts vs drain time | Too shallow → drops; too deep → p99 grows from waiting |
| Batch size | Efficiency vs waiting for a batch | Periodic spikes and wider latency histogram |
| Doorbell pacing | Update frequency for completions/consumption | “Heartbeat” jitter (regular micro-spikes) |
| Backpressure policy | How overload propagates | Burst → long drain tails, drops in windows |
6.3 Isolation concepts: IOMMU / ATS (why it affects performance)
- IOMMU introduces translation and consistency paths that can expose tail-latency if queues and batching are mis-sized.
- ATS-related behaviors can reduce some overhead, but shift sensitivity toward consistency and fault handling paths.
- Practical rule: stronger isolation increases the need for clean queue evidence (qdepth, drops, completion latency, per-VF counters).
6.4 Troubleshooting cards (problem → symptom → likely cause → what to check)
Problem: p99 jumps after enabling SR-IOV
Symptom: average is stable, tail widens. Likely cause: queue depth too large/small, aggressive batching, doorbell pacing. Check: qdepth, completion latency, batch indicators, drops.
Problem: VFs interfere with each other
Symptom: one tenant load impacts others. Likely cause: shared scheduling or shared accelerator saturation. Check: per-VF counters, utilization, scheduling latency.
Problem: periodic “heartbeat” jitter
Symptom: spikes at regular intervals. Likely cause: doorbell/batch rhythm or reclamation cadence. Check: timestamped events, latency histogram, cadence correlation.
Problem: burst causes long recovery
Symptom: drain time is long after burst ends. Likely cause: backpressure chain and queue drain behavior. Check: backpressure counters, drops, drain time metrics.
Problem: link-like symptoms (separate track)
Symptom: intermittent retries/recovery and throughput swings. Action: treat as link-level suspicion and link to PCIe Switch/Retimer page (no tuning here).
Link-only: PCIe Switch / Retimer →
Memory & DDR power/monitoring: when tables, sessions, and buffers become the bottleneck
On a DPU/SmartNIC, external memory is often the difference between “feature works” and “system stays stable.” Tables and sessions grow quietly, buffers activate under bursts, and telemetry/logs compete for bandwidth. When access patterns become random or congested, effective bandwidth collapses and tail latency rises even if average throughput looks acceptable.
7.1 Three DDR roles in a DPU
Tables & session state
Large flow/ACL/session entries and stateful counters that exceed on-chip capacity.
Buffers & reordering
Burst absorption, staging, and reordering that convert overload into queueing time.
Telemetry & log buffering
Event/ring buffering and evidence capture that must remain stable during fault storms.
7.2 The three performance traps that “drag you down”
Trap A — Random access reduces effective bandwidth
Symptom: average throughput survives, but p99/p999 worsens. Likely cause: sparse keys and poor locality turn memory into scattered fetches. Check: queue depth growth, completion/processing latency widening, retry-like counters (concept).
Trap B — Table inflation amplifies cache misses
Symptom: the more rules/entries, the more jitter appears. Likely cause: bigger entries and expanded hotsets increase misses and churn. Check: table occupancy trend, hit-rate proxies, “hotset” drift indicators (concept).
Trap C — Congestion turns into a long tail
Symptom: a burst ends, but recovery remains slow. Likely cause: buffering/reordering creates waiting time; drain time dominates tail. Check: qdepth, drops, drain time, backpressure counters.
7.3 Order-of-magnitude sizing: how big can tables get?
Quick sizing (order-of-magnitude)
Total ≈ entries × bytes/entry → MB / GB
| Sizing note | Why it matters |
|---|---|
| bytes/entry grows silently | Actions, counters, and alignment can expand entries beyond “field list” intuition. |
| entry count is not the only risk | Access distribution (locality vs scatter) is what determines effective bandwidth and tail. |
| buffers activate under bursts | Even “small tables” can jitter when bursts force deep buffering and long drain time. |
7.4 What to monitor (telemetry fields, device-agnostic)
Thermal
Memory temperature signals and controller-adjacent temperature (concept): rising trends predict throttle and jitter.
Rails (V/I)
Rail voltage/current for memory domains (concept): sudden current step often follows activation of buffers/reorder paths.
Error & retry counters
ECC/repair/retry-like indicators (concept): spikes correlate with widened tail and reduced effective bandwidth.
Timestamped events
Throttle, retries, crash/restart markers: evidence must share the same time base as queue/counter logs.
Link-only: DDR5 PMIC (on-DIMM) →
Power, thermals & telemetry: making observability a first-class dataplane feature
In production, DPU performance is limited by power and thermals long before “feature completeness.” The critical skill is to treat telemetry as evidence: identify throttle windows, correlate queue growth, and confirm whether backpressure or accelerator saturation explains p99/p999 behavior.
8.1 Telemetry fields grouped by diagnostic purpose
Power wall
Power/current/rail status: step changes often precede throttle or unstable queue drain.
Thermals & frequency state
Temperature + frequency/throttle flags: the fastest way to explain “tail got worse without traffic change.”
Congestion evidence
Queue depth, drops, retries, drain time: converts “it feels slow” into measurable queueing.
Accelerator utilization
Crypto/compress utilization: confirms saturation vs mis-sized queues or pacing.
8.2 Event logs: the timestamp consistency rule
- Events matter more than snapshots. power-fail, throttle, link flap, firmware crash should be stored as ordered events.
- One time base. queue/counter samples and events must share a consistent timestamp domain to build a causal timeline.
- Evidence first. correlate events with queue spikes and utilization changes before changing policies.
8.3 Why thermal policy hits p99/p999 (not just averages)
Causal chain (practical)
Temperature rises → throttle/derating → service rate drops → queue depth grows → drain time dominates → p99/p999 expands.
8.4 Troubleshooting order (fastest path to root cause)
| Step | What to check first | What it proves |
|---|---|---|
| 1 | Throttle / frequency events | Whether the system’s service rate changed |
| 2 | Queue depth / drain time / drops | Whether tail is queueing-driven |
| 3 | DMA backpressure / completion latency | Whether pacing or backpressure explains spikes |
| 4 | Crypto/compress utilization | Whether accelerators are saturated vs underused |
Link-only: Cooling / Thermal Management → Link-only: In-band Telemetry & Power Log →
Secure boot & firmware lifecycle: trusted updates, attestation, and rollback protection
Firmware on a DPU/SmartNIC is part of the dataplane’s trust boundary. A stable deployment needs more than “secure boot enabled”: it needs a repeatable evidence chain that proves what booted, how it was measured, and whether updates can be executed safely without creating a rollback path back to vulnerable images.
9.1 Boot chain as an evidence chain (what to verify, what to record)
RoT (trust anchor)
Evidence: device identity + expected trust mode (concept). Goal: a stable root for verification decisions.
Verify (signature check)
Evidence: pass/fail + reason code (concept) + image version. Goal: prevent unauthorized images from executing.
Measured boot (hashes)
Evidence: measurement digest set (concept) + log summary. Goal: prove what actually booted, not what was intended.
Runtime policy
Evidence: policy summary + key security flags. Goal: show that runtime is enforcing the expected security posture.
9.2 Update mechanics that prevent “safe update, unsafe rollback”
| Mechanism | Engineering principle | What to verify (evidence) |
|---|---|---|
| A/B slots | Always keep a known-good image and a deterministic failback path. | Active slot, candidate slot, last-boot result, failback trigger record. |
| Rollback protection | Old images must not become “newest bootable” after an update. | Monotonic version gate (concept), minimum-allowed version, rollback-deny event. |
| Key/cert rotation | Rotation needs a compatibility window and an audit trail. | Signer identity version, rotation window ID, revocation/deny list summary (concept). |
9.3 Attestation output: what the platform should receive
Attestation should be treated as a platform input: a compact token that summarizes firmware integrity, policy posture, and freshness. It is most useful when it is predictable and easy to consume for admission, monitoring, and audit.
Identity & placement
Device identity + slot/asset identity (concept) to bind evidence to a physical deployment.
Firmware version / build
Version/build ID so policy can enforce “minimum safe version” and detect drift.
Measurement digest summary
Hash/digest set summary (concept) for verification against an allowed baseline list.
Policy summary & freshness
Security posture summary + timestamp/nonce/freshness indicator (concept) to prevent stale evidence reuse.
9.4 Firmware update checklist (operator-ready)
- Signature verified and verification evidence recorded (version + pass/fail + reason).
- Version gate enforced (rollback protection evidence present; old images cannot re-qualify).
- A/B switch rules defined (failback conditions are deterministic and logged).
- Rotation window defined (signer/cert version in allowed window; deny rules audited).
- Audit log complete (timestamped boot/update events align with platform time base).
- Recovery path tested (rollback denied; failback works to a known-good slot).
Link-only: TPM / HSM / Root of Trust →
Deployment forms & decision boundary: when SmartNIC is enough, and when a DPU is required
The practical choice between SmartNIC and DPU is a systems decision: what needs to be offloaded, how strict the isolation requirements are, how stable tail latency must be, and how much firmware governance and auditability the platform is willing to operate.
10.1 Form factors (system-level impact)
PCIe add-in card
Flexible deployment and replacement; requires disciplined firmware version control and slot-aware inventory.
Onboard integration
Consistent platform behavior; ties lifecycle more tightly to the server board and its upgrade cadence.
Mezzanine
Balance between replaceability and integration; operational process defines success (swap + audit + rollback rules).
10.2 Inline vs off-path (impact radius)
| Mode | What improves | What becomes harder |
|---|---|---|
| Inline | Per-packet work reduction, consistent dataplane behavior, stronger enforcement points. | Fault impact radius increases; rollback and audit discipline become mandatory. |
| Off-path | Risk isolation; can stage features without becoming a single point of failure. | Some per-packet gains are reduced; evidence correlation still required for tail issues. |
10.3 Scenario buckets → KPI → recommendation
| Scenario | Key KPIs (examples) | Recommendation (boundary) |
|---|---|---|
| Multi-tenant isolation + stable tail latency | p99/p999, fairness, drops/retry, audit completeness | DPU preferred when isolation + provable posture is required; choose form factor based on maintenance workflow. |
| Crypto / compression / telemetry offload | pps, sessions, key-update rate, utilization, throttle events | SmartNIC can fit limited offload; DPU preferred when orchestration + governance + audit are core requirements. |
| Path acceleration (storage/network, high-level) | tail, drain time, backpressure evidence, failure isolation | DPU preferred when state/orchestration is complex and evidence-based operations are needed. |
10.4 Cost model (what teams underestimate)
Power budget & tail behavior
Thermal/derating often changes service rate and expands tail; plan for evidence-based tuning.
Maintainability
Swap cadence, failback behavior, and version alignment define operational stability.
Firmware governance
Signing, version gates, rollback-deny, and audit logs are ongoing costs—not one-time features.
Failure domain
Inline deployments amplify impact; off-path reduces blast radius but may reduce per-packet benefits.
Link-only: TPM / HSM / Root of Trust → Link-only: Cooling / Thermal Management → Link-only: Storage / NVMe pages →
H2-11 · Validation & Stress Testing: Beyond Gbps (Focus on p99/p999)
A DPU/SmartNIC is validated by repeatable tail-latency and steady-state behavior, not by a single “peak throughput” datapoint. The goal is to prove (1) offload is actually active, (2) p99/p999 stays bounded under realistic workload shape, and (3) results remain stable after thermal and state scale-up.
11.1 Metrics as “Bundles” (not single numbers)
Any benchmark that reports only Gbps is incomplete. Real deployments are dominated by pps, flow churn, queueing tails, and thermal steady-state. Treat the metrics below as a single package; missing one makes comparisons unreliable.
- Throughput bundle: Gbps + pps (small packets can collapse pps while Gbps still looks “fine”).
- Workload-shape bundle: flow count, short-flow ratio, burstiness, packet-size mix (these drive state growth and queue spikes).
- Latency bundle: p50 + p99/p999 + jitter (average latency hides queue tails).
- Host-impact bundle: CPU utilization variance (jitter), not only the mean (offload value is often “stability”).
- Steady-state bundle: power/temperature plateau, throttle events, and “after-warm” retest drift.
Common trap
“High throughput but unstable p99” usually points to queue buildup or service-rate changes after warming, not to raw link bandwidth.
11.2 Repeatability: Lock the Variables, Then Compare
Repeatability is a test feature, not an afterthought. A valid A/B comparison requires identical workload controls and an explicit “warm steady-state” phase.
- Traffic model fixed: packet-size distribution + burst pattern + short-flow ratio + flow-count ramp schedule.
- State scale fixed: table size (entries) + bytes/entry assumption (order-of-magnitude) + state churn rate.
- Crypto/key rhythm fixed: session count scale + key/credential update cadence (avoid “random operator timing”).
- Time window fixed: include cold → warm-up → thermal steady-state → retest.
- Evidence alignment: counters/telemetry/events must share a consistent timestamp reference (for causality).
Control Card (copy/paste friendly)
Keep a single test manifest: traffic_profile, flow_ramp, table_size, key_update_cadence,
warmup_minutes, steady_state_minutes, retest_runs.
11.3 Evidence Chain: Prove Offload Is Working
“Offload enabled” is not a claim; it is a verifiable state. Use an evidence chain that connects workload → counters/queues → tail behavior.
Symptom: p99/p999 worsens but Gbps stays high
Evidence: queue depth / drain time increases + throttle events + steady-state power/temperature shift. Conclusion: queueing dominates tails; verify service-rate under warm conditions.
Symptom: Small-packet pps collapses
Evidence: per-packet processing counters saturate, queue occupancy rises, drops/retries appear. Conclusion: per-packet overhead dominates; re-run with identical pps target and observe tail response.
Symptom: CPU mean drops, but CPU jitter stays high
Evidence: offload utilization counters do not track traffic, or fallback-path counters increase. Conclusion: offload misses or fallback is active; isolate the configuration where counters correlate with traffic.
Symptom: “Works in lab” but cannot reproduce in production
Evidence: table scale-up and warm steady-state phases are missing; logs lack aligned timestamps. Conclusion: add fixed ramps + steady-state retest; correlate events (link flap → queue spike → throttle) on one time axis.
11.4 Stress Test Set (≤10 cases) That Predicts Real Outcomes
The following set is small enough to run regularly, but broad enough to expose tail-latency and repeatability failures. Each case should record Gbps + pps + p99/p999 + queue depth + throttle events + steady-state power/temperature.
- Small-packet pps: emphasize 64B/small packets; detect per-packet overhead and tail explosion.
- Short-flow churn: high short-flow ratio; expose state churn and table pressure.
- Burst & drain: bursts with recovery window; capture queue drain time and p999 spikes.
- Flow ramp: step flow count upward; identify the p99 “knee point”.
- Table ramp: step table entries upward; observe cache-miss-like behavior (concept level) and tail drift.
- Key/session rhythm: fixed session scale + update cadence; detect “setup cost > benefit” regions.
- Thermal steady-state retest: repeat small-packet + burst cases after temperature plateaus.
- Fallback behavior check: intentionally trigger a safe fallback (policy/config); verify counters and tails match expectations.
- Repeatability loop: re-run the same manifest N times; compare variance, not only the mean.
11.5 Example DUT Material Numbers (for Procurement & Test Planning)
The list below provides orderable part numbers / SKUs commonly used as DPU/SmartNIC DUT references. Exact ordering varies by port speed, form factor, memory size, and “crypto enabled/disabled” variants.
| Category | Vendor | Platform | Material No. / SKU (examples) | Notes (what it anchors in validation) |
|---|---|---|---|---|
| DPU card | NVIDIA | BlueField-3 B3220 (FHHL) |
900-9D3B6-00CV-AA0 (crypto enabled)900-9D3B6-00SV-AA0 (crypto disabled)
|
200GbE / NDR200 class; good for tail + thermal steady-state + offload evidence counters. |
| DPU card | NVIDIA | BlueField-2 E-Series (FHHL) |
900-9D219-0086-ST1 / MBF2M516A-CECOT900-9D219-0086-ST0 / MBF2M516A-EECOT
|
100GbE class; useful for pps vs Gbps separation and repeatability manifests. |
| DPU card | AMD Pensando | Elba-based Distributed Services Card | DSC2-2Q200-32R32F64P-S4 |
High-scale service chaining scenarios; validates state scale-up + steady-state drift. |
| SmartNIC / DPU option | HPE (Pensando) | Pensando DSC-25 (2×10/25GbE) | P26966-B21 |
Good for “is offload actually active?” evidence chain in typical enterprise servers. |
| IPU / DPU adapter | Intel | Intel IPU Adapter E2100-CCQDA2 | E2100CCQDA2RJG1 (example orderable code) |
Validates infrastructure offload + repeatability under fixed manifests. |
| SmartNIC adapter | Broadcom | Stingray PS225 family (25GbE) |
PS225-H04 / BCM958802A8044CPS225-H08 / BCM958802A8048CPS225-H16 / BCM958802A8021C
|
Anchors SmartNIC-style pipeline validation; good for pps + tail under burst/short flows. |
| DPU silicon | Marvell | OCTEON 10 DPU family | CN102, CN103, CN106, CN106S |
Chip-level anchor for DPU-class design discussions; useful when sourcing modules/platforms. |
Procurement hint
For BOM work, record both the vendor ordering SKU and the variant attributes that change behavior in validation: port speed, onboard memory size, crypto enablement, OOB management presence, and cooling/heatsink option.
Use this matrix as a checklist: each workload knob must map to a metrics bundle and a measurable evidence chain. If any column is missing, the benchmark is not predictive of production tail-latency behavior.
H2-12 · FAQs (DPU / SmartNIC)
Each answer stays within the DPU/SmartNIC dataplane and card-level execution boundary: offload verification, queues/DMA, DDR usage/telemetry, crypto/compression trade-offs, thermal evidence, secure firmware flow, deployment shape, and p99/p999 validation.
1) What is the practical engineering boundary between a DPU and a SmartNIC?
A SmartNIC is typically a NIC-class card that accelerates and steers packets with limited platform governance, while a DPU is treated as a dataplane platform with stronger isolation, lifecycle controls, and “infrastructure services” ownership. The boundary is best defined by three questions: (1) must the dataplane enforce multi-tenant isolation and policies, (2) does the card require audited firmware lifecycle and attestation, and (3) is stable p99/p999 under load a primary goal rather than peak Gbps.
- Choose SmartNIC when acceleration is narrow and governance is light.
- Choose DPU when the dataplane becomes a shared, security-relevant infrastructure layer.
2) Why can throughput improve after offload, but p99 latency gets worse?
This usually indicates queueing or service-rate instability, not raw bandwidth limits. Common causes include burst-induced queue buildup, batching that increases tail, thermal steady-state throttling, or a hidden fallback path that activates under specific flows. The fastest way to prove it is an evidence chain: correlate (a) queue depth/drain time, (b) throttle events and frequency states, and (c) offload utilization counters vs traffic. If p99 worsens while Gbps stays flat, the tail is dominated by queue dynamics or warm-state drift.
- Run a burst + warm steady-state retest (same manifest) and compare variance.
- Verify counters track traffic; watch for fallback counters rising.
3) P4 vs eBPF: which fits observability, filtering, and ACL-style tasks?
Use P4 when packet processing needs a structured match/action pipeline with predictable per-packet behavior (L2–L4 steering, ACL-like matches, counters at deterministic points). Use eBPF when the task is driven by hooks and events (telemetry taps, selective filtering, per-flow sampling, troubleshooting-friendly iteration). When determinism and line-rate dataplane structure dominate, P4 is the clearer fit; when rapid iteration and observability dominate, eBPF is usually faster to operationalize.
- P4 strength: explicit pipeline stages and table-driven actions.
- eBPF strength: flexible attachment points and operational debugging.
4) When tables grow and latency “jitters,” is it DDR bandwidth or cache-miss behavior? How to tell?
Separate the two with controlled ramps. If performance degrades gradually with traffic volume but improves with better locality, it often points to memory-access efficiency (effective bandwidth). If p99/p999 shows a clear “knee” as entries scale, it often indicates a working-set transition where fast on-card resources are no longer effective and random access dominates. Prove it by ramping table entries while keeping traffic shape fixed, then correlating queue depth/drain time, power/temperature steady-state, and any ECC/retry counters available at a concept level.
- DDR-limited patterns: sustained pressure, rising queues, warm-state drift.
- Working-set “knee”: sharp tail jump at a specific entry scale.
5) SR-IOV enabled but performance is unstable—what queue/interrupt/doorbell indicators should be checked first?
Start with per-PF/VF queue health and evidence of backpressure. The first checks are: (1) per-queue occupancy and drain time, (2) drop/retry counters that spike during bursts, (3) imbalance across VFs (one VF queues up while others remain empty), and (4) symptoms consistent with excessive doorbell/update overhead (e.g., tail spikes without bandwidth saturation). Then correlate with thermal and event logs—warm-state throttling or firmware events can masquerade as “virtualization instability.”
- Priority order: throttle/events → queue depth/drain → per-VF drops/retries → fallback counters.
- Validate with a fixed manifest: SR-IOV off vs on, same flow/packet mix, compare p99 variance.
6) Why does crypto offload often fail to pay off for small packets and short flows?
Small packets and short flows amplify per-flow and per-packet overhead. Setup costs (session churn, policy lookups, key-update rhythm) can dominate the payload work, and batching can inflate tail latency even if average throughput improves. The decision should be made with a break-even test: fixed packet-size mix + fixed flow churn + fixed session scale, then compare Gbps+pps together and inspect p99/p999. Offload is “worth it” only when utilization counters track traffic and tail remains bounded under the expected churn model.
- Watch for: high setup cost, fallback activation, and queue buildup in bursts.
- Require: offload utilization counters + stable warm-state retest.
7) What is the biggest compression offload pitfall—high ratio but exploding latency?
The biggest pitfall is optimizing for ratio while ignoring flush/batch and queueing effects. Compression often works on blocks, so block formation, buffering, and reorder/flush behavior can introduce head-of-line delay and p99 spikes even when average throughput looks better. This becomes severe with bursty traffic, mixed payload sizes, or when memory pressure increases. A correct evaluation records p99/p999 across burst and warm steady-state phases, and correlates compression utilization, queue depth/drain time, and any “fallback” indicators.
- Compression is safest when payloads are large and workload shape is stable.
- It is risky when short messages, frequent flush, or bursty queues dominate.
8) Inline vs off-path deployment: is the biggest difference reliability or observability?
Both change, but the primary engineering distinction is the failure domain and the enforcement point. Inline placement can enforce and accelerate directly in the traffic path, but it expands the blast radius: card faults, firmware issues, or throttling can directly impact traffic. Off-path placement improves visibility and reduces blast radius, but enforcement becomes indirect and may lag behind the data path. Selection should be driven by service objectives: tail-latency stability, rollback tolerance, and whether enforcement must be synchronous with packet forwarding.
- Inline: strongest enforcement, strict lifecycle discipline required.
- Off-path: safer failure domain, best for observability-first objectives.
9) What are the most typical symptoms of DPU thermal throttling, and how can telemetry prove it?
Thermal throttling typically appears as slow p99 drift that becomes obvious only after warm steady-state, while average throughput may look unchanged. The signature is a causality chain: temperature and power approach a plateau → throttle/frequency events appear → service rate drops → queue depth and drain time rise → p99/p999 jumps. The proof requires aligned timestamps across telemetry, counters, and events. Without an aligned timeline, thermal issues are often misdiagnosed as “random network jitter” or “virtualization instability.”
- Capture: temperature, power, frequency states, throttle events, queue depth/drain, p99/p999.
- Repeat: cold run vs warm steady-state retest under the same manifest.
10) How can firmware updates prevent a rollback where an old version remains usable?
Anti-rollback is achieved by combining signed boot with a monotonic version gate. A practical approach is: (1) verify signatures in the boot chain, (2) store a monotonic “minimum allowed version” in protected storage, (3) use A/B slots with controlled promotion, and (4) emit auditable update events (requested version, installed version, promotion result, and a rollback-deny reason). Attestation should report version and measurement summaries so the platform can reject nodes that fall below policy.
- Require: signed images + monotonic version policy + A/B governance + audit logs.
- Validate: forced downgrade attempt must fail with a recorded reason.
11) How can offload be proven to be truly active (not silently falling back to the host CPU)?
Proof requires a three-part evidence chain: (1) offload utilization counters (or pipeline hit counters) must scale with traffic volume and flow mix, (2) fallback/slow-path counters must remain near zero under the intended workload, and (3) host-side disturbance should improve in the expected direction (reduced CPU jitter/variance, not just lower mean CPU%). The clean method is a fixed manifest A/B test: offload disabled vs enabled, same traffic + state scale, then compare p99/p999 and counter correlations.
- Offload “on” must change counters and tail behavior consistently across repeated runs.
- If counters do not correlate with traffic, the dataplane path is not the one assumed.
12) For selection, what 10 datapoints should be requested from a vendor (and tied to an exact material number)?
Request a comparison sheet that is valid only if it references the exact SKU/OPN under test (material number, firmware version, cooling option), because variants can change tail behavior. The most decision-relevant 10 datapoints are:
- Gbps + pps together (must disclose packet-size mix).
- p50/p99/p999 under at least one burst + one steady-state case.
- Flow count and short-flow ratio limits (state churn) where tails “knee.”
- Table/entry scale where tails “knee” (entries and bytes/entry assumption).
- Crypto: sessions scale + key-update cadence limits (concept-level disclosure).
- Compression: throughput vs tail trade-off and block/flush assumptions (test conditions).
- Thermal steady-state: power, temperature plateau, and throttle event behavior.
- Telemetry coverage: queue depth/drain, util counters, event timestamps (what is observable).
- Virtualization boundary: PF/VF scale and isolation-related counters that remain visible.
- Lifecycle governance: A/B update, rollback policy evidence, and attestation output fields.
Example material numbers commonly cited in evaluation requests include:
900-9D3B6-00CV-AA0 / 900-9D3B6-00SV-AA0 (BlueField-3 B3220),
900-9D219-0086-ST1 / 900-9D219-0086-ST0 (BlueField-2 E-Series),
P26966-B21 (HPE Pensando DSC-25),
and PS225-H04 / PS225-H08 / PS225-H16 (Broadcom Stingray PS225).
A single-page FAQ is most useful when each answer points to measurable evidence (queues, counters, telemetry, events) inside the same DPU/SmartNIC boundary.