123 Main Street, New York, NY 10001

DPU / SmartNIC: Programmable Dataplane Offload Guide

← Back to: Data Center & Servers

A DPU/SmartNIC turns “dataplane work” (switching/overlay, crypto/compression, and telemetry) into a programmable, measurable card-level pipeline so the host CPU can stay stable and predictable. The practical value is proven with evidence—queue/telemetry counters, thermal events, and repeatable p99/p999 tests—rather than peak throughput alone.

Chapter H2-1 Boundary & Decision

DPU / SmartNIC: boundary, role, and why it matters now

A SmartNIC adds programmable and offload capabilities to a NIC to reduce CPU work on repetitive per-packet tasks. A DPU pushes this further: it is a programmable, data-plane–centric subsystem (often a PCIe card) designed to harden infrastructure datapaths—offload, isolation, telemetry, and lifecycle control—so application cores stay focused on business workloads.

Scope note: this page covers dataplane roles, offloads, queues, host interface behavior, DDR usage concepts, and telemetry/logs. It does not deep dive into NIC PHY/SerDes tuning, PCIe retimer equalization, or platform management stacks.
Practical role boundary (who does what, and why)
Work item CPU SmartNIC DPU
Per-packet repeat work High overhead, jitter-sensitive Offloads selected fast-path actions Offloads + makes fast-path more deterministic
Infrastructure datapath (vSwitch/overlay hooks) Consumes cores and cache; hard to isolate Assists; may still rely on host for orchestration Runs as a hardened datapath subsystem with policy + counters
Security offload (crypto / policy enforcement) Possible but expensive under load spikes Often accelerates specific primitives Accelerates primitives + enforces datapath boundaries and audit evidence
Telemetry & evidence (counters/logs/time) Software-only visibility; sampling gaps Provides taps and counters for selected paths First-class telemetry + event logs tied to datapath stages
Why this discussion is urgent in modern servers
  • Core efficiency pressure: infrastructure datapaths steal sellable CPU cycles. Offload returns cores to applications and stabilizes runtime variance.
  • Tail-latency sensitivity: p99/p999 often breaks first. Queues, interrupts, and cache interference make CPU-based fast paths hard to keep deterministic.
  • Isolation and auditability: multi-tenant boundaries and compliance require stronger separation and better evidence than “best-effort software counters.”
Three decision checks (fast, field-proven)
  • Is the pain CPU jitter or link/PHY issues? If p99 spikes correlate with CPU contention rather than cabling/SerDes faults, dataplane offload is a strong candidate.
  • Is the goal throughput or determinism? If average Gbps is fine but p99/p999 is unacceptable, focus on queueing, backpressure, and telemetry—areas DPUs address directly.
  • Is hardened infrastructure needed? If security boundaries, audit evidence, and lifecycle control matter, DPU-style separation usually beats ad-hoc host processing.
Figure F1 — Placement: host vs DPU/SmartNIC datapath split
Placement of DPU/SmartNIC in a server datapath Block diagram showing host CPU and memory connected via PCIe to a DPU/SmartNIC card, which connects to network ports and the top-of-rack switch. Data plane and control plane paths are distinguished. Host Server CPU App Cores Infra Cores Host DRAM Workloads Infrastructure Path vSwitch / Overlay / Policy Telemetry / Logs Crypto / Compression PCIe DPU / SmartNIC Dataplane Match / Action Queues vSwitch Crypto Compress Telemetry Ports ToR Switch Data plane Mgmt / OOB Control plane Takeaways • Move repeat per-packet work off CPU • Stabilize p99 via queues + telemetry
This diagram clarifies the boundary: CPU keeps application work, while the DPU/SmartNIC hardens the infrastructure datapath with offloads and first-class telemetry.
Chapter H2-2 Architecture Blocks

Architecture overview: the building blocks of a programmable dataplane

A DPU/SmartNIC dataplane is best understood as a pipeline of blocks that answers two questions: what action to take (parse + match/action) and when to take it (queues/QoS/backpressure). Most field failures show up as queue growth, counter anomalies, and thermal throttling long before average throughput looks “bad.”

Canonical pipeline (concept level)

Parser → Match/Action → Stateful Table → QoS/Queue → Crypto/Compress → DMA/PCIe

The “programmable region” typically spans parsing, classification, policy actions, and state updates. Fixed accelerators often implement crypto/compression primitives, while DMA/PCIe ties the dataplane to host memory and virtualization queues.

Block cards: what to expect, what to measure, what often breaks first

Parser

Metric: packets/s sustained under mixed headers · Pitfall: complex header stacks increase per-packet work, pushing queues downstream.

Match / Action

Metric: table hit-rate + rule update latency · Pitfall: table expansion increases misses, shifting pressure to external memory and raising tail latency.

Stateful Table

Metric: state update rate + contention indicators · Pitfall: state hot-spots create backpressure even when link throughput looks fine.

QoS / Queues (tail-latency root)

Metric: queue depth + drops + scheduling latency · Pitfall: average remains good while p99/p999 collapses under bursts and short flows.

Crypto / Compression (accelerators)

Metric: utilization + per-flow setup overhead · Pitfall: small packets/short flows pay startup cost without gaining net benefit, inflating tail latency.

DMA / PCIe (host interface)

Metric: completion latency + backlog counters · Pitfall: batching/doorbell patterns can turn into jitter sources if queue management is mis-sized.

Practical control/management view (kept intentionally high-level): a controller pushes rules/keys/policies, and validates effects through counters, queue depth, drops, and timestamped events. Deep platform management stacks belong to separate pages (linked only).
Figure F2 — Match-action pipeline and where offloads/taps sit
Dataplane pipeline blocks with offload insertion points Block diagram of a programmable dataplane pipeline from parser to DMA/PCIe, highlighting programmable region, fixed accelerators, and telemetry taps near crypto/compression. Programmable dataplane pipeline Programmable region Fixed accel + host interface Parser pps Match / Action hit / updates State contention QoS / Queues qdepth / drops Crypto / Compress util / setup cost TS CNT DMA / PCIe completion latency Host Memory Rings PF VF Reading guide • “What to do”: Parser + Match/Action • “When to do”: Queues + Backpressure • Validate with: taps + counters + events
The pipeline view prevents “black-box offload” thinking: most problems are visible as queue growth, counter anomalies, or throttling before average throughput drops.
Chapter H2-3

Programmability models: P4 vs eBPF vs fixed pipeline

Programmability in a DPU/SmartNIC is not a single feature. Different models optimize for different engineering realities: change velocity, deterministic behavior, and state/entry footprint. The goal is to choose a model that matches the task shape, not the marketing label.

Three models, expressed as engineering boundaries

P4 (match/action-first)

Best fit for table-driven packet decisions where actions are explicit and repeat per packet. Strong when rules change often and fast-path behavior must remain explainable through counters.

eBPF (event/hook-first)

Best fit for observability, filtering, and policy hooks where signals and events matter more than pure match/action throughput. Strong for rapid iteration and field debugging through targeted instrumentation.

Fixed pipeline / hard datapath (determinism-first)

Best fit when the protocol and behavior are stable and deterministic latency matters more than feature churn. Strong for high throughput with predictable timing, but iteration cost is higher.

Trade-off triangle (practical)
Primary axis P4 eBPF Fixed pipeline
Change velocity High (rule-driven changes) Very high (hook-driven iteration) Low (stable behavior preferred)
Deterministic latency Good when actions are bounded Varies with hook placement Best (tight timing paths)
State / entries footprint Scales with tables and actions Scales with instrumentation scope Bounded but less flexible
Selection decision tree (fast, field-oriented)
Start L2–L4 fast-path decisions?Prefer P4
Observability / filtering / tracing?Prefer eBPF
Hard determinism with stable behavior?Prefer fixed pipeline
Tail-latency is the KPI?Prioritize bounded queues + counters
Figure F3 — Model vs task fit (matrix)
Programmability models and task fit matrix Matrix comparing P4, eBPF, and fixed datapath pipeline across common dataplane task categories, with best/good/limited fit markers. Programmability fit matrix Choose by task shape: rules, hooks, or determinism P4 eBPF Fixed Pipeline L2–L4 Fast Path Observability Hook Deterministic Latency Rapid Change BEST GOOD BEST GOOD BEST LIMITED GOOD VARIES BEST BEST BEST LIMITED Takeaways: P4 for rule-driven fast path · eBPF for hooks/visibility · Fixed for tight determinism
The same DPU can expose multiple programmability styles; selection should follow task shape and tail-latency goals, not feature checklists.
Chapter H2-4

Offload deep dive: vSwitch, overlay, and service chaining

vSwitch and overlay paths consume CPU not because they are “complex,” but because they are repeat-per-packet: encapsulation/decapsulation, classification, rule lookups, counters, and policy actions. Offload moves these repeat actions into a bounded dataplane and leaves the host CPU primarily for control logic and workloads.

Before vs After (focus on datapath work)
Before: CPU path After: DPU/SmartNIC path
Packet parsing + overlay encap/decap
Lookup + policy actions
Counters in software (sampling gaps)
Bursts inflate queues/interrupt pressure
Parser + match/action in dataplane
Queues/QoS bounded under bursts
Counters + timestamps as first-class taps
Events/logs anchor debug evidence
Engineering focus points (how problems appear in the field)

State consistency

Symptom: new policy is “applied” but some flows behave as if old rules remain. Root cause is often a mismatch between control-plane commit windows and dataplane table activation. Validation should rely on hit counters and timestamped activation events.

Policy update latency

Symptom: bursts or drops during rule updates. Root cause is frequently table update pressure that shows up first as queue depth growth. Validation should correlate update windows with queue depth, drops, and counter discontinuities.

Replay / rollback readiness

Symptom: rollback does not fully restore previous behavior. Root cause is incomplete snapshots or inconsistent states across dataplane stages. Validation should require versioned policy summaries and event logs with ordering.

Scope note: this section covers dataplane offload mechanics and evidence loops (counters, timestamps, queues). Security device theory and protocol-stack internals are intentionally out of scope.
Figure F4 — vSwitch/overlay offload: CPU path vs DPU path + service chain
vSwitch and overlay offload path comparison Diagram comparing CPU-based vSwitch/overlay processing versus DPU/SmartNIC dataplane processing, with a service chain bubble showing inline chaining options. vSwitch / Overlay offload Move repeat-per-packet work into a bounded dataplane CPU Path (Before) DPU Path (After) Encap / Decap Lookup / Policy Software Counters Burst → Queue/Jitter Parser + Match/Action Queues / QoS Counters + Timestamps Events / Logs Service Chain FW LB Monitor Key signals: queue depth · drops · hit counters · timestamps · throttle/events
Offload success is verified by evidence: stable queue depth under bursts, consistent hit counters after updates, and timestamped events that explain p99 behavior.
Chapter H2-5

Crypto & compression offload: throughput, tail latency, and key boundaries

Offload is valuable when it removes repeat-per-packet work from the host CPU and converts it into a bounded dataplane behavior. The most common evaluation mistake is to focus on Gbps alone and ignore pps, session shape, and the key lifecycle boundary that determines stability during updates and rollbacks.

Boundary reminder: cryptographic algorithm details and protocol internals are out of scope here. Key storage devices are not covered; this section only defines the boundary principles and links to Root of Trust.

5.1 Crypto offload (concept): where it sits and what to measure

Placement

Inbound / outbound datapath stages with counters and timestamps around the crypto block

Why tail latency improves (when it does)

CPU jitter drops when per-packet compute is removed from shared cores and moved into a bounded queue + accelerator path

Key metrics

Gbps (large packets) · pps (small packets) · sessions (concurrency) · key update rate (rotation/rekey events)

5.2 Compression offload: when it helps and what it costs

Good fit

Bandwidth- or IO-bound paths where reducing bytes dominates (e.g., egress links, log/telemetry aggregation, storage write paths).

Watch-outs

Compression adds compute + memory movement. High compression ratio can still lose if it increases queueing or triggers thermal throttling.

5.3 Why offload can become slower (field failure patterns)

Pitfall A — Small packets / short flows

Symptom: average throughput looks fine, but p99/p999 worsens. Likely cause: setup/warm-up cost exceeds per-packet benefit. Signals: session create rate, completion latency spikes, queue depth growth.

Pitfall B — Burst backpressure

Symptom: bursts cause drops or long drains. Likely cause: accelerator saturation and backpressure into queues. Signals: utilization near ceiling, qdepth rising, drops during burst windows.

Pitfall C — Key updates cause micro-outages

Symptom: periodic latency spikes during rotation/rekey. Likely cause: lifecycle boundary not aligned with dataplane activation windows. Signals: key-update events, policy/version counters, hit counter discontinuities.

Pitfall D — Compression “wins ratio” but loses latency

Symptom: fewer bytes, but worse tail. Likely cause: compute + memory movement plus thermal/power limits. Signals: throttle events, latency histogram widening, queue drain time increasing.

5.4 Key boundary principles (device-agnostic)

  • Keep long-lived roots inside a dedicated trust boundary. The dataplane should operate with derived or wrapped material, not raw roots.
  • Prefer wrapped/derived session keys for fast rotation. Rotation must be observable (events) and reversible (rollback-safe).
  • Make activation auditable. Tie key/policy versions to timestamped events and counters so behavior changes can be explained.

Link-only: TPM / HSM / Root of Trust →

Figure F5 — Crypto/Compress insertion points + key boundary (concept)
Crypto and compression insertion points with key boundary Data flow diagram showing plaintext and ciphertext zones, crypto and compression blocks, and a key store boundary that links to a Root of Trust page. Crypto & compression offload boundary Plaintext vs ciphertext zones + key store boundary Plaintext Zone Ciphertext Zone Parser/Match counters Queues/QoS qdepth CRYPTO sessions pps / Gbps TS CNT COMPRESSION ratio vs latency Key Store Boundary Wrapped Keys Link-only: Root of Trust page defines device choices and custody rules
The key boundary is defined by where long-lived roots live and how derived/wrapped keys are activated, rotated, and audited. Offload validation should include pps, sessions, key update events, and tail-latency under bursts.
Chapter H2-6

Host interface & virtualization: PCIe + SR-IOV from a DPU viewpoint

Virtualization instability usually does not come from “virtualization” itself. It comes from how queues, DMA batching, and doorbell pacing interact with isolation boundaries. The engineering objective is to keep tail latency bounded while preserving isolation across tenants and workloads.

Boundary reminder: PCIe switch/retimer tuning is out of scope. If symptoms resemble link-level instability, link to the PCIe Switch/Retimer page and treat it as a separate root-cause track.

6.1 PF vs VF roles (minimal, practical)

PF (control/ownership)

Controls resources and configuration: queue creation, policy knobs, and ownership boundaries.

VF (data queues)

Represents isolated data-plane queues for VMs/containers. Stability depends on queue sizing and scheduling fairness.

6.2 DMA, queues, batching, doorbells (tail-latency levers)

Lever What it changes Typical tail-latency symptom
Queue depth Buffering under bursts vs drain time Too shallow → drops; too deep → p99 grows from waiting
Batch size Efficiency vs waiting for a batch Periodic spikes and wider latency histogram
Doorbell pacing Update frequency for completions/consumption “Heartbeat” jitter (regular micro-spikes)
Backpressure policy How overload propagates Burst → long drain tails, drops in windows

6.3 Isolation concepts: IOMMU / ATS (why it affects performance)

  • IOMMU introduces translation and consistency paths that can expose tail-latency if queues and batching are mis-sized.
  • ATS-related behaviors can reduce some overhead, but shift sensitivity toward consistency and fault handling paths.
  • Practical rule: stronger isolation increases the need for clean queue evidence (qdepth, drops, completion latency, per-VF counters).

6.4 Troubleshooting cards (problem → symptom → likely cause → what to check)

Problem: p99 jumps after enabling SR-IOV

Symptom: average is stable, tail widens. Likely cause: queue depth too large/small, aggressive batching, doorbell pacing. Check: qdepth, completion latency, batch indicators, drops.

Problem: VFs interfere with each other

Symptom: one tenant load impacts others. Likely cause: shared scheduling or shared accelerator saturation. Check: per-VF counters, utilization, scheduling latency.

Problem: periodic “heartbeat” jitter

Symptom: spikes at regular intervals. Likely cause: doorbell/batch rhythm or reclamation cadence. Check: timestamped events, latency histogram, cadence correlation.

Problem: burst causes long recovery

Symptom: drain time is long after burst ends. Likely cause: backpressure chain and queue drain behavior. Check: backpressure counters, drops, drain time metrics.

Problem: link-like symptoms (separate track)

Symptom: intermittent retries/recovery and throughput swings. Action: treat as link-level suspicion and link to PCIe Switch/Retimer page (no tuning here).

Link-only: PCIe Switch / Retimer →

Figure F6 — SR-IOV PF/VF queues + DMA/doorbell + isolation boundary (concept)
SR-IOV queues, DMA and isolation boundary Diagram showing PF control, multiple VF data queues, DMA engine and doorbell updates crossing an IOMMU boundary, with counters and timestamps for evidence. SR-IOV from a DPU viewpoint PF control, VF queues, DMA + doorbells, isolation boundary Host VM / Pod VM / Pod Host Memory Rings PF/VF descriptors IOMMU Boundary DPU / SmartNIC PF (Control / Ownership) VF0 Q VF1 Q VF2 Q DMA Engine completion latency Doorbell pacing TS CNT PCIe Debug signals: qdepth · drops · completion latency · per-VF counters · timestamped events
Virtualization stability is maintained by treating queues and DMA pacing as first-class tail-latency levers, and validating behavior with per-VF counters and timestamps across the isolation boundary.
Chapter H2-7

Memory & DDR power/monitoring: when tables, sessions, and buffers become the bottleneck

On a DPU/SmartNIC, external memory is often the difference between “feature works” and “system stays stable.” Tables and sessions grow quietly, buffers activate under bursts, and telemetry/logs compete for bandwidth. When access patterns become random or congested, effective bandwidth collapses and tail latency rises even if average throughput looks acceptable.

Boundary reminder: this chapter does not deep-dive DDR5 PMIC/RCD/DB devices. It focuses on how memory usage shapes tail latency and what telemetry fields are needed to prove the cause.

7.1 Three DDR roles in a DPU

Tables & session state

Large flow/ACL/session entries and stateful counters that exceed on-chip capacity.

Buffers & reordering

Burst absorption, staging, and reordering that convert overload into queueing time.

Telemetry & log buffering

Event/ring buffering and evidence capture that must remain stable during fault storms.

7.2 The three performance traps that “drag you down”

Trap A — Random access reduces effective bandwidth

Symptom: average throughput survives, but p99/p999 worsens. Likely cause: sparse keys and poor locality turn memory into scattered fetches. Check: queue depth growth, completion/processing latency widening, retry-like counters (concept).

Trap B — Table inflation amplifies cache misses

Symptom: the more rules/entries, the more jitter appears. Likely cause: bigger entries and expanded hotsets increase misses and churn. Check: table occupancy trend, hit-rate proxies, “hotset” drift indicators (concept).

Trap C — Congestion turns into a long tail

Symptom: a burst ends, but recovery remains slow. Likely cause: buffering/reordering creates waiting time; drain time dominates tail. Check: qdepth, drops, drain time, backpressure counters.

7.3 Order-of-magnitude sizing: how big can tables get?

Quick sizing (order-of-magnitude)

Total ≈ entries × bytes/entry → MB / GB

Sizing note Why it matters
bytes/entry grows silently Actions, counters, and alignment can expand entries beyond “field list” intuition.
entry count is not the only risk Access distribution (locality vs scatter) is what determines effective bandwidth and tail.
buffers activate under bursts Even “small tables” can jitter when bursts force deep buffering and long drain time.

7.4 What to monitor (telemetry fields, device-agnostic)

Thermal

Memory temperature signals and controller-adjacent temperature (concept): rising trends predict throttle and jitter.

Rails (V/I)

Rail voltage/current for memory domains (concept): sudden current step often follows activation of buffers/reorder paths.

Error & retry counters

ECC/repair/retry-like indicators (concept): spikes correlate with widened tail and reduced effective bandwidth.

Timestamped events

Throttle, retries, crash/restart markers: evidence must share the same time base as queue/counter logs.

Link-only: DDR5 PMIC (on-DIMM) →

Figure F7 — DDR roles → access patterns → tail latency + telemetry taps
DDR roles, access patterns, and tail latency Concept map from DDR roles (tables, buffers, logs) through access patterns to effective bandwidth and tail latency, with telemetry taps for proof. DDR in a DPU: role → pattern → tail Use telemetry taps to prove the bottleneck DDR Roles Tables sessions Buffers reorder Logs telemetry Access Pattern Random low locality Miss/Churn table inflation Queueing drain time Outcomes Eff. BW ↓ scatter Qdepth ↑ bursts p99 ↑ long tail Telemetry taps: Temp · Rail V/I · ECC/Retry · Events TS · Qdepth
Memory issues surface as “effective bandwidth collapse” and “queue drain tails.” Proving the cause requires timestamp-aligned telemetry across temperature, rails, and error/retry counters.
Chapter H2-8

Power, thermals & telemetry: making observability a first-class dataplane feature

In production, DPU performance is limited by power and thermals long before “feature completeness.” The critical skill is to treat telemetry as evidence: identify throttle windows, correlate queue growth, and confirm whether backpressure or accelerator saturation explains p99/p999 behavior.

Boundary reminder: rack cooling and liquid-loop design are out of scope. This chapter only defines observable interfaces, constraints, and a practical debug order.

8.1 Telemetry fields grouped by diagnostic purpose

Power wall

Power/current/rail status: step changes often precede throttle or unstable queue drain.

Thermals & frequency state

Temperature + frequency/throttle flags: the fastest way to explain “tail got worse without traffic change.”

Congestion evidence

Queue depth, drops, retries, drain time: converts “it feels slow” into measurable queueing.

Accelerator utilization

Crypto/compress utilization: confirms saturation vs mis-sized queues or pacing.

8.2 Event logs: the timestamp consistency rule

  • Events matter more than snapshots. power-fail, throttle, link flap, firmware crash should be stored as ordered events.
  • One time base. queue/counter samples and events must share a consistent timestamp domain to build a causal timeline.
  • Evidence first. correlate events with queue spikes and utilization changes before changing policies.

8.3 Why thermal policy hits p99/p999 (not just averages)

Causal chain (practical)

Temperature rises → throttle/derating → service rate drops → queue depth grows → drain time dominates → p99/p999 expands.

8.4 Troubleshooting order (fastest path to root cause)

Step What to check first What it proves
1 Throttle / frequency events Whether the system’s service rate changed
2 Queue depth / drain time / drops Whether tail is queueing-driven
3 DMA backpressure / completion latency Whether pacing or backpressure explains spikes
4 Crypto/compress utilization Whether accelerators are saturated vs underused

Link-only: Cooling / Thermal Management →    Link-only: In-band Telemetry & Power Log →

Figure F8 — Telemetry sources → bus → timestamped event timeline
Telemetry bus and event timeline Left: telemetry sources and counters feeding an aggregation bus. Right: timestamped timeline from link flap to queue spike to throttle and recovery, showing a causal debug path. Telemetry bus + event timeline Correlate throttle with queue evidence using consistent timestamps Sources Temp Rail V/I Queues Drops/Retry Accel Util (Crypto/Compress) Telemetry Bus Timestamped timeline t0 Link Flap t1 Queue Spike t2 Throttle t3 Recover Causal debug path Throttle → Qdepth → Backpressure → Utilization Rule: events and counters must share one time base to explain p99/p999
A useful telemetry system is not a “dashboard.” It is a timestamp-consistent evidence chain that connects events (flap/throttle) to queue behavior and utilization.
Chapter H2-9

Secure boot & firmware lifecycle: trusted updates, attestation, and rollback protection

Firmware on a DPU/SmartNIC is part of the dataplane’s trust boundary. A stable deployment needs more than “secure boot enabled”: it needs a repeatable evidence chain that proves what booted, how it was measured, and whether updates can be executed safely without creating a rollback path back to vulnerable images.

Boundary reminder: TPM/HSM hardware internals and device selection are out of scope. This chapter focuses on the DPU-side workflow and the evidence that operators should be able to collect.

9.1 Boot chain as an evidence chain (what to verify, what to record)

RoT (trust anchor)

Evidence: device identity + expected trust mode (concept). Goal: a stable root for verification decisions.

Verify (signature check)

Evidence: pass/fail + reason code (concept) + image version. Goal: prevent unauthorized images from executing.

Measured boot (hashes)

Evidence: measurement digest set (concept) + log summary. Goal: prove what actually booted, not what was intended.

Runtime policy

Evidence: policy summary + key security flags. Goal: show that runtime is enforcing the expected security posture.

9.2 Update mechanics that prevent “safe update, unsafe rollback”

Mechanism Engineering principle What to verify (evidence)
A/B slots Always keep a known-good image and a deterministic failback path. Active slot, candidate slot, last-boot result, failback trigger record.
Rollback protection Old images must not become “newest bootable” after an update. Monotonic version gate (concept), minimum-allowed version, rollback-deny event.
Key/cert rotation Rotation needs a compatibility window and an audit trail. Signer identity version, rotation window ID, revocation/deny list summary (concept).

9.3 Attestation output: what the platform should receive

Attestation should be treated as a platform input: a compact token that summarizes firmware integrity, policy posture, and freshness. It is most useful when it is predictable and easy to consume for admission, monitoring, and audit.

Identity & placement

Device identity + slot/asset identity (concept) to bind evidence to a physical deployment.

Firmware version / build

Version/build ID so policy can enforce “minimum safe version” and detect drift.

Measurement digest summary

Hash/digest set summary (concept) for verification against an allowed baseline list.

Policy summary & freshness

Security posture summary + timestamp/nonce/freshness indicator (concept) to prevent stale evidence reuse.

9.4 Firmware update checklist (operator-ready)

  • Signature verified and verification evidence recorded (version + pass/fail + reason).
  • Version gate enforced (rollback protection evidence present; old images cannot re-qualify).
  • A/B switch rules defined (failback conditions are deterministic and logged).
  • Rotation window defined (signer/cert version in allowed window; deny rules audited).
  • Audit log complete (timestamped boot/update events align with platform time base).
  • Recovery path tested (rollback denied; failback works to a known-good slot).

Link-only: TPM / HSM / Root of Trust →

Figure F9 — Boot chain + attestation token flow (evidence-oriented)
Boot chain and attestation token flow Left-to-right boot chain: RoT, verify, measured boot, runtime policy. Each stage emits evidence. Evidence is summarized into an attestation token consumed by platform admission, monitoring, and audit. Boot chain → evidence → attestation Prove what booted, enforce version gates, and log updates RoT trust anchor Verify signature Measured boot hashes Runtime policy Evidence: ID Evidence: ver Evidence: hash Evidence: policy Attestation Token ID · ver · hash · policy · ts Platform Admit / Deny Monitor Audit Audit Log: boot events · update events · rollback-deny · timestamps
Treat boot as an evidence chain: each stage emits verifiable signals, summarized into an attestation token for admission, monitoring, and audit.
Chapter H2-10

Deployment forms & decision boundary: when SmartNIC is enough, and when a DPU is required

The practical choice between SmartNIC and DPU is a systems decision: what needs to be offloaded, how strict the isolation requirements are, how stable tail latency must be, and how much firmware governance and auditability the platform is willing to operate.

Boundary reminder: this chapter avoids protocol deep dives (network or storage). It focuses on scenario-level KPIs, form factors, and governance cost that determine the correct boundary.

10.1 Form factors (system-level impact)

PCIe add-in card

Flexible deployment and replacement; requires disciplined firmware version control and slot-aware inventory.

Onboard integration

Consistent platform behavior; ties lifecycle more tightly to the server board and its upgrade cadence.

Mezzanine

Balance between replaceability and integration; operational process defines success (swap + audit + rollback rules).

10.2 Inline vs off-path (impact radius)

Mode What improves What becomes harder
Inline Per-packet work reduction, consistent dataplane behavior, stronger enforcement points. Fault impact radius increases; rollback and audit discipline become mandatory.
Off-path Risk isolation; can stage features without becoming a single point of failure. Some per-packet gains are reduced; evidence correlation still required for tail issues.

10.3 Scenario buckets → KPI → recommendation

Scenario Key KPIs (examples) Recommendation (boundary)
Multi-tenant isolation + stable tail latency p99/p999, fairness, drops/retry, audit completeness DPU preferred when isolation + provable posture is required; choose form factor based on maintenance workflow.
Crypto / compression / telemetry offload pps, sessions, key-update rate, utilization, throttle events SmartNIC can fit limited offload; DPU preferred when orchestration + governance + audit are core requirements.
Path acceleration (storage/network, high-level) tail, drain time, backpressure evidence, failure isolation DPU preferred when state/orchestration is complex and evidence-based operations are needed.

10.4 Cost model (what teams underestimate)

Power budget & tail behavior

Thermal/derating often changes service rate and expands tail; plan for evidence-based tuning.

Maintainability

Swap cadence, failback behavior, and version alignment define operational stability.

Firmware governance

Signing, version gates, rollback-deny, and audit logs are ongoing costs—not one-time features.

Failure domain

Inline deployments amplify impact; off-path reduces blast radius but may reduce per-packet benefits.

Link-only: TPM / HSM / Root of Trust → Link-only: Cooling / Thermal Management → Link-only: Storage / NVMe pages →

Figure F10 — Scenario buckets → KPI → SmartNIC vs DPU decision map
SmartNIC vs DPU decision map Three scenario buckets feed KPI checks that route to SmartNIC or DPU. Form-factor tags indicate PCIe card, onboard, and mezzanine deployment options. Decision map: SmartNIC vs DPU Route by isolation, tail KPI, and governance cost Scenario buckets Isolation multi-tenant Offload crypto/comp/tele Acceleration path-level KPI checks p99/p999 Audit Sessions Failure radius Recommendation SmartNIC limited offload DPU governed dataplane PCIe Onboard Mezz Rule: strong isolation + audited posture + stable tail → DPU; limited offload → SmartNIC
Choose by required isolation strength, tail-latency KPI stability, and the organization’s ability to operate firmware governance and audit at scale.

H2-11 · Validation & Stress Testing: Beyond Gbps (Focus on p99/p999)

A DPU/SmartNIC is validated by repeatable tail-latency and steady-state behavior, not by a single “peak throughput” datapoint. The goal is to prove (1) offload is actually active, (2) p99/p999 stays bounded under realistic workload shape, and (3) results remain stable after thermal and state scale-up.

Gbps + pps together p50/p99/p999 Repeatability Thermal steady-state Evidence chain

11.1 Metrics as “Bundles” (not single numbers)

Any benchmark that reports only Gbps is incomplete. Real deployments are dominated by pps, flow churn, queueing tails, and thermal steady-state. Treat the metrics below as a single package; missing one makes comparisons unreliable.

  • Throughput bundle: Gbps + pps (small packets can collapse pps while Gbps still looks “fine”).
  • Workload-shape bundle: flow count, short-flow ratio, burstiness, packet-size mix (these drive state growth and queue spikes).
  • Latency bundle: p50 + p99/p999 + jitter (average latency hides queue tails).
  • Host-impact bundle: CPU utilization variance (jitter), not only the mean (offload value is often “stability”).
  • Steady-state bundle: power/temperature plateau, throttle events, and “after-warm” retest drift.

Common trap

“High throughput but unstable p99” usually points to queue buildup or service-rate changes after warming, not to raw link bandwidth.

11.2 Repeatability: Lock the Variables, Then Compare

Repeatability is a test feature, not an afterthought. A valid A/B comparison requires identical workload controls and an explicit “warm steady-state” phase.

  • Traffic model fixed: packet-size distribution + burst pattern + short-flow ratio + flow-count ramp schedule.
  • State scale fixed: table size (entries) + bytes/entry assumption (order-of-magnitude) + state churn rate.
  • Crypto/key rhythm fixed: session count scale + key/credential update cadence (avoid “random operator timing”).
  • Time window fixed: include coldwarm-upthermal steady-stateretest.
  • Evidence alignment: counters/telemetry/events must share a consistent timestamp reference (for causality).

Control Card (copy/paste friendly)

Keep a single test manifest: traffic_profile, flow_ramp, table_size, key_update_cadence, warmup_minutes, steady_state_minutes, retest_runs.

11.3 Evidence Chain: Prove Offload Is Working

“Offload enabled” is not a claim; it is a verifiable state. Use an evidence chain that connects workload → counters/queues → tail behavior.

Symptom: p99/p999 worsens but Gbps stays high

Evidence: queue depth / drain time increases + throttle events + steady-state power/temperature shift. Conclusion: queueing dominates tails; verify service-rate under warm conditions.

Symptom: Small-packet pps collapses

Evidence: per-packet processing counters saturate, queue occupancy rises, drops/retries appear. Conclusion: per-packet overhead dominates; re-run with identical pps target and observe tail response.

Symptom: CPU mean drops, but CPU jitter stays high

Evidence: offload utilization counters do not track traffic, or fallback-path counters increase. Conclusion: offload misses or fallback is active; isolate the configuration where counters correlate with traffic.

Symptom: “Works in lab” but cannot reproduce in production

Evidence: table scale-up and warm steady-state phases are missing; logs lack aligned timestamps. Conclusion: add fixed ramps + steady-state retest; correlate events (link flap → queue spike → throttle) on one time axis.

11.4 Stress Test Set (≤10 cases) That Predicts Real Outcomes

The following set is small enough to run regularly, but broad enough to expose tail-latency and repeatability failures. Each case should record Gbps + pps + p99/p999 + queue depth + throttle events + steady-state power/temperature.

  • Small-packet pps: emphasize 64B/small packets; detect per-packet overhead and tail explosion.
  • Short-flow churn: high short-flow ratio; expose state churn and table pressure.
  • Burst & drain: bursts with recovery window; capture queue drain time and p999 spikes.
  • Flow ramp: step flow count upward; identify the p99 “knee point”.
  • Table ramp: step table entries upward; observe cache-miss-like behavior (concept level) and tail drift.
  • Key/session rhythm: fixed session scale + update cadence; detect “setup cost > benefit” regions.
  • Thermal steady-state retest: repeat small-packet + burst cases after temperature plateaus.
  • Fallback behavior check: intentionally trigger a safe fallback (policy/config); verify counters and tails match expectations.
  • Repeatability loop: re-run the same manifest N times; compare variance, not only the mean.

11.5 Example DUT Material Numbers (for Procurement & Test Planning)

The list below provides orderable part numbers / SKUs commonly used as DPU/SmartNIC DUT references. Exact ordering varies by port speed, form factor, memory size, and “crypto enabled/disabled” variants.

Category Vendor Platform Material No. / SKU (examples) Notes (what it anchors in validation)
DPU card NVIDIA BlueField-3 B3220 (FHHL) 900-9D3B6-00CV-AA0 (crypto enabled)
900-9D3B6-00SV-AA0 (crypto disabled)
200GbE / NDR200 class; good for tail + thermal steady-state + offload evidence counters.
DPU card NVIDIA BlueField-2 E-Series (FHHL) 900-9D219-0086-ST1 / MBF2M516A-CECOT
900-9D219-0086-ST0 / MBF2M516A-EECOT
100GbE class; useful for pps vs Gbps separation and repeatability manifests.
DPU card AMD Pensando Elba-based Distributed Services Card DSC2-2Q200-32R32F64P-S4 High-scale service chaining scenarios; validates state scale-up + steady-state drift.
SmartNIC / DPU option HPE (Pensando) Pensando DSC-25 (2×10/25GbE) P26966-B21 Good for “is offload actually active?” evidence chain in typical enterprise servers.
IPU / DPU adapter Intel Intel IPU Adapter E2100-CCQDA2 E2100CCQDA2RJG1 (example orderable code) Validates infrastructure offload + repeatability under fixed manifests.
SmartNIC adapter Broadcom Stingray PS225 family (25GbE) PS225-H04 / BCM958802A8044C
PS225-H08 / BCM958802A8048C
PS225-H16 / BCM958802A8021C
Anchors SmartNIC-style pipeline validation; good for pps + tail under burst/short flows.
DPU silicon Marvell OCTEON 10 DPU family CN102, CN103, CN106, CN106S Chip-level anchor for DPU-class design discussions; useful when sourcing modules/platforms.

Procurement hint

For BOM work, record both the vendor ordering SKU and the variant attributes that change behavior in validation: port speed, onboard memory size, crypto enablement, OOB management presence, and cooling/heatsink option.

Figure F11 — Validation matrix: workload knobs → metrics bundle → evidence chain
Validation: Stop optimizing only Gbps — prove p99/p999 and repeatability Workload knobs Metrics bundle Evidence chain Packet size mix small vs large Burst pattern spike + drain Flow count ramp churn Table scale entries grow Key update rhythm sessions Thermal steady-state warm retest Gbps + pps always paired p50 / p99 / p999 tail matters CPU jitter variance Queue health depth/drain Steady-state power/temp Counters hits / util Queue depth qdepth Telemetry power/temp Events throttle Aligned timeline cause → effect Repeatability loop: same manifest → rerun N times → compare variance (not only mean)

Use this matrix as a checklist: each workload knob must map to a metrics bundle and a measurable evidence chain. If any column is missing, the benchmark is not predictive of production tail-latency behavior.

Request a Quote

Accepted Formats

pdf, csv, xls, xlsx, zip

Attachment

Drag & drop files here or use the button below.

H2-12 · FAQs (DPU / SmartNIC)

Each answer stays within the DPU/SmartNIC dataplane and card-level execution boundary: offload verification, queues/DMA, DDR usage/telemetry, crypto/compression trade-offs, thermal evidence, secure firmware flow, deployment shape, and p99/p999 validation.

1) What is the practical engineering boundary between a DPU and a SmartNIC?
Maps to: H2-1 / H2-10

A SmartNIC is typically a NIC-class card that accelerates and steers packets with limited platform governance, while a DPU is treated as a dataplane platform with stronger isolation, lifecycle controls, and “infrastructure services” ownership. The boundary is best defined by three questions: (1) must the dataplane enforce multi-tenant isolation and policies, (2) does the card require audited firmware lifecycle and attestation, and (3) is stable p99/p999 under load a primary goal rather than peak Gbps.

  • Choose SmartNIC when acceleration is narrow and governance is light.
  • Choose DPU when the dataplane becomes a shared, security-relevant infrastructure layer.
Scope note: no PHY/SerDes details; decision is based on responsibilities, evidence, and lifecycle constraints.
2) Why can throughput improve after offload, but p99 latency gets worse?
Maps to: H2-8 / H2-11

This usually indicates queueing or service-rate instability, not raw bandwidth limits. Common causes include burst-induced queue buildup, batching that increases tail, thermal steady-state throttling, or a hidden fallback path that activates under specific flows. The fastest way to prove it is an evidence chain: correlate (a) queue depth/drain time, (b) throttle events and frequency states, and (c) offload utilization counters vs traffic. If p99 worsens while Gbps stays flat, the tail is dominated by queue dynamics or warm-state drift.

  • Run a burst + warm steady-state retest (same manifest) and compare variance.
  • Verify counters track traffic; watch for fallback counters rising.
Scope note: no protocol deep dive; focus is on telemetry + repeatable test cases.
3) P4 vs eBPF: which fits observability, filtering, and ACL-style tasks?
Maps to: H2-3

Use P4 when packet processing needs a structured match/action pipeline with predictable per-packet behavior (L2–L4 steering, ACL-like matches, counters at deterministic points). Use eBPF when the task is driven by hooks and events (telemetry taps, selective filtering, per-flow sampling, troubleshooting-friendly iteration). When determinism and line-rate dataplane structure dominate, P4 is the clearer fit; when rapid iteration and observability dominate, eBPF is usually faster to operationalize.

  • P4 strength: explicit pipeline stages and table-driven actions.
  • eBPF strength: flexible attachment points and operational debugging.
Scope note: no compiler/verifier internals; selection is based on task shape and operational change cost.
4) When tables grow and latency “jitters,” is it DDR bandwidth or cache-miss behavior? How to tell?
Maps to: H2-7 / H2-8

Separate the two with controlled ramps. If performance degrades gradually with traffic volume but improves with better locality, it often points to memory-access efficiency (effective bandwidth). If p99/p999 shows a clear “knee” as entries scale, it often indicates a working-set transition where fast on-card resources are no longer effective and random access dominates. Prove it by ramping table entries while keeping traffic shape fixed, then correlating queue depth/drain time, power/temperature steady-state, and any ECC/retry counters available at a concept level.

  • DDR-limited patterns: sustained pressure, rising queues, warm-state drift.
  • Working-set “knee”: sharp tail jump at a specific entry scale.
Scope note: no DDR5 PMIC/RCD/DB device details; only required telemetry fields and ramp methodology.
5) SR-IOV enabled but performance is unstable—what queue/interrupt/doorbell indicators should be checked first?
Maps to: H2-6 / H2-8

Start with per-PF/VF queue health and evidence of backpressure. The first checks are: (1) per-queue occupancy and drain time, (2) drop/retry counters that spike during bursts, (3) imbalance across VFs (one VF queues up while others remain empty), and (4) symptoms consistent with excessive doorbell/update overhead (e.g., tail spikes without bandwidth saturation). Then correlate with thermal and event logs—warm-state throttling or firmware events can masquerade as “virtualization instability.”

  • Priority order: throttle/events → queue depth/drain → per-VF drops/retries → fallback counters.
  • Validate with a fixed manifest: SR-IOV off vs on, same flow/packet mix, compare p99 variance.
Scope note: no PCIe retimer tuning; only DPU-visible queue/DMA evidence and event correlation.
6) Why does crypto offload often fail to pay off for small packets and short flows?
Maps to: H2-5

Small packets and short flows amplify per-flow and per-packet overhead. Setup costs (session churn, policy lookups, key-update rhythm) can dominate the payload work, and batching can inflate tail latency even if average throughput improves. The decision should be made with a break-even test: fixed packet-size mix + fixed flow churn + fixed session scale, then compare Gbps+pps together and inspect p99/p999. Offload is “worth it” only when utilization counters track traffic and tail remains bounded under the expected churn model.

  • Watch for: high setup cost, fallback activation, and queue buildup in bursts.
  • Require: offload utilization counters + stable warm-state retest.
Scope note: no TLS/IPsec deep dive; focus is on measurable trade-offs and operational break-even.
7) What is the biggest compression offload pitfall—high ratio but exploding latency?
Maps to: H2-5

The biggest pitfall is optimizing for ratio while ignoring flush/batch and queueing effects. Compression often works on blocks, so block formation, buffering, and reorder/flush behavior can introduce head-of-line delay and p99 spikes even when average throughput looks better. This becomes severe with bursty traffic, mixed payload sizes, or when memory pressure increases. A correct evaluation records p99/p999 across burst and warm steady-state phases, and correlates compression utilization, queue depth/drain time, and any “fallback” indicators.

  • Compression is safest when payloads are large and workload shape is stable.
  • It is risky when short messages, frequent flush, or bursty queues dominate.
Scope note: no algorithm internals; focus is on block/queue behavior and tail evidence.
8) Inline vs off-path deployment: is the biggest difference reliability or observability?
Maps to: H2-10

Both change, but the primary engineering distinction is the failure domain and the enforcement point. Inline placement can enforce and accelerate directly in the traffic path, but it expands the blast radius: card faults, firmware issues, or throttling can directly impact traffic. Off-path placement improves visibility and reduces blast radius, but enforcement becomes indirect and may lag behind the data path. Selection should be driven by service objectives: tail-latency stability, rollback tolerance, and whether enforcement must be synchronous with packet forwarding.

  • Inline: strongest enforcement, strict lifecycle discipline required.
  • Off-path: safer failure domain, best for observability-first objectives.
Scope note: not a security appliance tutorial; only deployment trade-offs at the card/system boundary.
9) What are the most typical symptoms of DPU thermal throttling, and how can telemetry prove it?
Maps to: H2-8

Thermal throttling typically appears as slow p99 drift that becomes obvious only after warm steady-state, while average throughput may look unchanged. The signature is a causality chain: temperature and power approach a plateau → throttle/frequency events appear → service rate drops → queue depth and drain time rise → p99/p999 jumps. The proof requires aligned timestamps across telemetry, counters, and events. Without an aligned timeline, thermal issues are often misdiagnosed as “random network jitter” or “virtualization instability.”

  • Capture: temperature, power, frequency states, throttle events, queue depth/drain, p99/p999.
  • Repeat: cold run vs warm steady-state retest under the same manifest.
Scope note: no rack cooling design; only card telemetry and evidence correlation.
10) How can firmware updates prevent a rollback where an old version remains usable?
Maps to: H2-9

Anti-rollback is achieved by combining signed boot with a monotonic version gate. A practical approach is: (1) verify signatures in the boot chain, (2) store a monotonic “minimum allowed version” in protected storage, (3) use A/B slots with controlled promotion, and (4) emit auditable update events (requested version, installed version, promotion result, and a rollback-deny reason). Attestation should report version and measurement summaries so the platform can reject nodes that fall below policy.

  • Require: signed images + monotonic version policy + A/B governance + audit logs.
  • Validate: forced downgrade attempt must fail with a recorded reason.
Scope note: no TPM/HSM component selection; only the DPU-side lifecycle and evidence outputs.
11) How can offload be proven to be truly active (not silently falling back to the host CPU)?
Maps to: H2-4 / H2-11

Proof requires a three-part evidence chain: (1) offload utilization counters (or pipeline hit counters) must scale with traffic volume and flow mix, (2) fallback/slow-path counters must remain near zero under the intended workload, and (3) host-side disturbance should improve in the expected direction (reduced CPU jitter/variance, not just lower mean CPU%). The clean method is a fixed manifest A/B test: offload disabled vs enabled, same traffic + state scale, then compare p99/p999 and counter correlations.

  • Offload “on” must change counters and tail behavior consistently across repeated runs.
  • If counters do not correlate with traffic, the dataplane path is not the one assumed.
Scope note: no deep vSwitch internals; only verification methodology and measurable signals.
12) For selection, what 10 datapoints should be requested from a vendor (and tied to an exact material number)?
Maps to: H2-11 / H2-10

Request a comparison sheet that is valid only if it references the exact SKU/OPN under test (material number, firmware version, cooling option), because variants can change tail behavior. The most decision-relevant 10 datapoints are:

  • Gbps + pps together (must disclose packet-size mix).
  • p50/p99/p999 under at least one burst + one steady-state case.
  • Flow count and short-flow ratio limits (state churn) where tails “knee.”
  • Table/entry scale where tails “knee” (entries and bytes/entry assumption).
  • Crypto: sessions scale + key-update cadence limits (concept-level disclosure).
  • Compression: throughput vs tail trade-off and block/flush assumptions (test conditions).
  • Thermal steady-state: power, temperature plateau, and throttle event behavior.
  • Telemetry coverage: queue depth/drain, util counters, event timestamps (what is observable).
  • Virtualization boundary: PF/VF scale and isolation-related counters that remain visible.
  • Lifecycle governance: A/B update, rollback policy evidence, and attestation output fields.

Example material numbers commonly cited in evaluation requests include: 900-9D3B6-00CV-AA0 / 900-9D3B6-00SV-AA0 (BlueField-3 B3220), 900-9D219-0086-ST1 / 900-9D219-0086-ST0 (BlueField-2 E-Series), P26966-B21 (HPE Pensando DSC-25), and PS225-H04 / PS225-H08 / PS225-H16 (Broadcom Stingray PS225).

Scope note: no PHY/retimer parameters; the request focuses on tail, repeatability, telemetry, and governance tied to the exact SKU/OPN.
Figure F12 — FAQ coverage map (questions → chapter evidence)
FAQ map: every question anchors to measurable evidence inside this DPU/SmartNIC page FAQ clusters Boundary & deployment Q1, Q8, Q12 Programmability choice Q3 Offload trade-offs Q2, Q6, Q7, Q11 Queues / SR-IOV stability Q5 Memory scale & telemetry Q4, Q9 Firmware lifecycle Q10 Anchored chapters H2-1 / H2-10 definition + deployment H2-3 P4 / eBPF selection H2-5 crypto / compression trade-offs H2-6 queues / DMA / SR-IOV view H2-7 / H2-8 DDR scale + telemetry timeline H2-9 secure boot + anti-rollback H2-11 p99/p999 validation + repeatability

A single-page FAQ is most useful when each answer points to measurable evidence (queues, counters, telemetry, events) inside the same DPU/SmartNIC boundary.