NIC / HCA (Ethernet & InfiniBand) Design & Debug Guide

Q: Why is “100/200/400G” not enough to choose a card, and how should Mpps/small-packet performance be asked?

Line rate (Gbps) can look perfect while small-packet Mpps collapses due to descriptor processing, queues, interrupts, and pipeline limits. Ask for 64B/128B Mpps, mixed-size profiles, and p99 latency under load—plus CPU cycles saved with offloads enabled. Request counter signatures during sustained small-packet stress (drops, IRQ rate, backpressure) to avoid benchmark-only answers.

Q: Why can enabling TSO/LRO make tail latency worse, even when throughput improves?

Offloads can worsen tail latency when they increase batching, amplify burstiness, or interact poorly with interrupt moderation. Large aggregation and aggressive coalescing add variable queueing delay, and mis-steering adds cache/NUMA penalties. Verify by toggling offloads and coalescing while tracking p99 latency, per-queue drops, and IRQ-per-packet. A good configuration keeps p99 stable across load changes.

Q: What field symptoms indicate RSS is misconfigured (jitter, reordering, CPU hotspots)?

Symptoms include one core becoming a hotspot, jitter spikes under flow changes, and occasional reordering when flows move across queues. Check per-queue packet distribution, per-core IRQ load, and queue drops. Mitigate with stable hashing, reduced queue migration, NUMA-aligned queue placement, and flow steering for heavy flows to avoid cache thrash.

Q: Is more SR-IOV VFs always better? Why can throughput stall while interrupts stay high?

VF count is bounded by queues, MSI-X vectors, cache, and internal scheduling bandwidth. Too many VFs fragments resources, triggers IRQ storms, and increases doorbell pressure so throughput stops scaling. Look for high IRQ-per-packet, small per-VF queue depth, and rising backpressure. Fewer VFs with adequate queue depth and controlled coalescing often improves p99 stability.

Q: How to tell whether the root cause is Ethernet congestion behavior vs NIC/HCA queue backpressure?

Use a layered counter check and timestamp alignment. Congestion-policy pressure tends to present consistently across flows, while NIC/HCA backpressure shows ring-full patterns, per-queue drops, IRQ anomalies, and localized hotspots. If tail spikes coincide with queue saturation without PHY/FEC degradation, focus on steering/queues/interrupt moderation; if they coincide with RDMA retry behavior, focus on loss sensitivity and buffering.

Q: The network looks slower, but the real cause is PCIe downshift or AER—how to verify quickly?

When throughput drops without port errors, verify PCIe link width/speed and look for AER/replay/completion-timeout evidence at the same timestamps as the dips. PCIe retrains or corrected error bursts can masquerade as network jitter. Cross-check NIC logs and compare behavior in another slot/platform. Include PCIe stability under temperature and load in regression tests.

Q: What are early signs of NIC thermal throttling, and which sensors/counters are most useful?

Early signs include rising FEC corrected rate, retrains under sustained load, and throughput steps that correlate with temperature plateaus. Useful signals are ASIC/on-card temperature, module/cage temperature, power/current telemetry, throttling state transitions, and event logs marking derate actions. If error rate or tail latency improves immediately with airflow changes, prioritize thermal remediation before deeper PHY tuning.

← Back to: Data Center & Servers

A NIC/HCA is more than “link up”: real success is stable throughput and predictable p99 latency under load, proven by PHY/FEC counters, queue/interrupt behavior, and thermal/power telemetry. This page explains how Ethernet NICs and RDMA-focused HCAs move packets from PCIe DMA to SerDes/optics, and how to use the right evidence to select, validate, and troubleshoot them in production.

H2-1 · Scope & Boundary

Scope & Boundary: what this page covers (and what it does not)

This chapter is the page “contract”: it defines the owner of the topic, the practical boundary against neighboring pages, and the exact evidence this page will use for selection and debug.

In scope (owner statements)

NIC: Ethernet-centric I/O that reduces host CPU work via offloads (e.g., RSS, TSO, checksum) and exposes actionable telemetry.
HCA: RDMA-centric I/O that prioritizes tail latency control and verbs semantics (QP/CQ) for RoCE/InfiniBand deployments.
System boundary: host-side is a PCIe endpoint (DMA + queues); line-side is PCS/PMA + SerDes linking to optics/cable and the fabric.

Out of scope (guardrails)

Retimer/Switch deep-dive: this page only uses NIC-visible symptoms (BER/FEC counters, link training behavior), not retimer internals.
Rack timing architecture: this page only covers on-NIC timestamp boundaries; system sync trees belong to the Time Card topic.
Programmable dataplane: SmartNIC/DPU compute pipelines are excluded; only a “see also” reference is allowed.

Use this page when a link is “up” but throughput is unstable, FEC-corrected errors trend with temperature, p99 latency explodes under microbursts, or RDMA workloads show rare but severe tail events.

Figure A — Boundary map for NIC/HCA (single-page scope)

Blocks show what is owned here vs “see also” neighbors. Minimal text; element-first layout.

H2-2 · 1-minute Definition

1-minute definition + first-principles selection questions

Featured answer (extractable)

A NIC or HCA is a PCIe endpoint that moves packets between host memory and a network fabric using DMA, queues, and line-side SerDes. NICs focus on Ethernet offloads (RSS/TSO/checksum) to reduce CPU cost, while HCAs prioritize RDMA semantics and tail latency control for RoCE/InfiniBand. Practical selection depends on packet rate, p99 latency, RDMA stability, and observable error counters (FEC/BER/thermal/power).

How it works (3–5 steps)

DMA & queues: descriptors in host memory define what to transmit/receive and where to place data.
Pipeline: parsing, classification, steering (RSS/flow rules) and optional offloads are applied.
RDMA (if used): verbs/QP/CQ scheduling maps requests to DMA operations with tight latency targets.
Line side: MAC/PCS/PMA drive SerDes lanes; PAM4 DSP and FEC protect the link margin.
Telemetry: counters and sensors (FEC/BER, module DDM, temperature, power) explain “why” performance changes.

First principles: what to check first (before brand/model)

Throughput vs packet rate: Gbps describes bandwidth; Mpps predicts small-packet behavior and CPU pressure.
Tail latency: p50 can look good while p99/p999 fails under microbursts, queue backpressure, or rare retries.
RDMA stability (if required): “supported” is not enough—validate behavior under congestion and loss sensitivity.
Observability: FEC corrected/uncorrected, BER proxies, module DDM/CMIS, and event logs must exist and be readable.

Spec	What it predicts	Questions to ask	Evidence to expect
GbpsPort speed	Peak bandwidth ceiling; module/cage compatibility	Which port form factor (QSFP/OSFP)? Which lane mode and FEC requirement?	Link mode report; FEC capability; module DDM/CMIS readout
MppsPacket rate	Small-packet performance, interrupt/queue pressure	Mpps at 64B/128B? Single-flow vs multi-flow? Offload on/off conditions?	Packet-rate curve; CPU cost; queue depth/overrun counters
p99Latency	Tail-risk under microbursts, retries, congestion sensitivity	p50/p99/p999 under defined load? With and without interrupt coalescing?	Latency histograms; retry/replay indicators; queue backpressure signals
RDMARoCE/IB	Remote-memory semantics, tail stability, loss sensitivity	RoCEv2 vs IB mode? What counters exist for congestion/loss events?	QP/CQ health; error syndromes; stable tail under stress profile
FECErrors	Link margin vs DSP/FEC workload; “link up but unstable” root cause	Corrected vs uncorrected thresholds? Lane-level counters exposed?	FEC corrected/uncorrected trend; lane error distribution
ThermalPower	Throttling risk, temperature-coupled error patterns	Max power at target link mode? Thermal sensors? Throttle behavior?	Temperature/power logs; performance vs temperature correlation

Figure B — Key specs map (bandwidth, Mpps, tail, RDMA, telemetry)

Iconic spec tiles with minimal labels (≥18px), optimized for mobile readability.

H2-3 · System Architecture Map

System architecture map: PCIe to QSFP/OSFP end-to-end chain

This chapter establishes a fast “evidence chain” across three domains—Host/PCIe, NIC core, and Line side—so unstable throughput, tail latency spikes, or rising corrected errors can be assigned to the right layer before deeper debug.

Three domains (who owns what)

Host / PCIe domain: DMA submissions, queue doorbells, MSI-X interrupts, and (light) IOMMU/ATS effects on addressability and completion flow.
NIC core domain: MAC + packet pipeline, flow steering, offload engines, and scheduling that turns descriptors into predictable packet movement.
Line side domain: PCS/PMA and SerDes lanes connected to QSFP/OSFP modules and the fabric, with link training and error protection (FEC).

Domain	What it controls	Typical symptom	NIC-visible evidence
Host/PCIe	DMA pacing, doorbell cadence, completion flow, interrupt pressure	Periodic throughput drops; CPU spikes; jitter tied to host load	Queue pressure signals; completion stalls; PCIe error category flags
NIC core	Parsing/classification, steering, offloads, scheduling	Gbps looks fine but Mpps collapses; p99 worsens under microbursts	Drops/overruns; scheduler backpressure; flow distribution imbalance
Line side	Link mode, training, lane equalization, FEC protection	Link stays up but stability degrades; error rates track temperature	FEC corrected/uncorrected trend; lane errors; module DDM/CMIS signals

Boundary reminder: the diagram and text stay at a vendor-neutral functional level. No proprietary block names are used, and external fabric configuration is not discussed—only NIC-visible counters, states, and correlations.

Figure C — NIC/HCA end-to-end block diagram (with telemetry hooks)

Box-style architecture map: Host/PCIe → NIC core → PCS/PMA/SerDes → QSFP/OSFP, plus clocks/power/thermal/NVM/mgmt bus.

H2-4 · SerDes & PHY

SerDes & PHY deep dive: PAM4 DSP, FEC, training, and “link up but unstable” behavior

“Link up” only confirms that training reached a workable state. Throughput instability often appears when margin is thin: PAM4 relies on DSP and FEC to keep BER acceptable, and that compensation can vary with temperature, lane conditions, and jitter.

PAM4 vs NRZ: what changes in practice

PAM4: smaller voltage spacing per level → higher sensitivity to noise, jitter, and non-ideal channel response.
NRZ: larger margin → fewer “near-edge” states where corrected errors climb while the link stays up.
Operational signal: rising FEC corrected without immediate uncorrected events often indicates a margin tax that can later convert into tail latency risk.

DSP chain: symptom → module → evidence

High-frequency loss → CTLE helps restore high-frequency content; lane-to-lane corrected error imbalance is a common hint.
ISI (intersymbol interference) → FFE/DFE compensate channel memory; training “not converged” or unstable EQ states often appear before hard failures.
Jitter/phase noise sensitivity → CDR stability matters; errors that correlate with temperature or operating mode suggest clock/jitter stress.
Residual random errors → FEC absorbs them until uncorrected events start; corrected/uncorrected ratio is a leading indicator.

FEC counters: how to read them without guessing

Corrected trending up: the link is spending “error budget” to stay stable; margin is shrinking or compensation load is rising.
Uncorrected events: represent real frame loss at the physical layer; they amplify retries and tail latency, especially for loss-sensitive transports.
Lane concentration: errors localized to a subset of lanes often indicate a channel/connector/module lane-specific issue rather than a global mode mismatch.

Bring-up checklist (line-side): what to validate first

Mode agreement: autoneg / capability match, correct lane count and mapping for the selected module mode.
Training success: link training reaches stable EQ states (TX/RX equalization) and lane deskew completes.
Polarity/mapping sanity: polarity inversions and lane swaps are tolerated only if mapping is consistent end-to-end.
Correlations: plot corrected errors against temperature/power and module DDM to identify margin-driven instability.

Figure D — PAM4 SerDes DSP + FEC pipeline with a readout panel

Pipeline diagram maps symptoms to DSP blocks; side panel lists the minimal counters/sensors to prove layer and correlation.

H2-5 · Packet Pipeline & Offload

Packet pipeline & offload: real boundaries and performance traps (Gbps vs Mpps vs p99)

High wire-rate throughput does not guarantee strong small-packet Mpps or stable p99 latency. The common root cause is per-packet work along the descriptor → queue → DMA → completion → interrupt/coalescing chain, plus pipeline depth and contention.

Why “Gbps is high” but “Mpps collapses”

Per-packet overhead dominates: each packet consumes descriptor work, queue arbitration, DMA bookkeeping, and completion processing.
Interrupt/coalescing effects: moderation improves CPU efficiency but can stretch tail latency when bursts arrive or queues back up.
Pipeline depth and contention: parse/classify/steer stages have finite capacity; microbursts convert into queue backlog and p99 spikes.
Evidence pattern: wire-rate looks healthy while drops/overruns, queue pressure, or completion stalls increase.

Offload catalog: what it helps, what it can hurt

Offload	Primary benefit	Common trap	NIC-visible evidence
Checksum	Saves host CPU per packet	Does not fix queue/interrupt bottlenecks; Mpps can still be limited	CPU cycles saved vs queue pressure unchanged
TSO/GSO	Reduces TX per-packet overhead for large payloads	Little impact on true small-packet traffic; segmentation does not apply	Higher Gbps with similar 64B Mpps ceiling
LRO/GRO	Reduces RX packet rate pressure	Can increase latency variance; may change observability and fairness during bursts	Lower packet rate with p99 changes
RSS / Steering	Spreads flows across queues/cores	Imbalanced distribution or queue hotspots can worsen tail latency	Queue-to-queue skew; hotspot drops
Crypto inline	Offloads encryption/decryption work for supported paths	Does not replace key/attestation systems; can still be bounded by per-packet pipeline limits	CPU saved with unchanged queue/IRQ limits

Metrics hierarchy (procurement + validation)

Wire rate (Gbps): large-packet ceiling; necessary but not sufficient.
Mpps (small packets): reveals per-packet cost and queue/interrupt capacity.
p99 latency: exposes backlog, coalescing, retries, and burst sensitivity.
CPU cycles saved: quantifies offload value without hiding tail-risk.

Practical rule: when Gbps is stable but user-facing latency becomes “spiky,” treat queue backlog and completion/interrupt behavior as first-class suspects before chasing link-level causes.

Figure E — Latency path breakdown (RX → app) with evidence points

Box diagram highlights where Mpps limits and tail latency typically form: descriptors/queues/DMA/completions/IRQ moderation.

H2-6 · RDMA

RDMA in practice: RoCEv2 vs InfiniBand engineering differences, congestion myths, and why tail latency explodes

RDMA is not a single feature toggle. It is an end-to-end completion model (verbs/QP/CQ) that reacts strongly to loss and congestion. Small amounts of retry/timeout can amplify into a backlog that dominates p99/p999 latency.

RoCEv2 vs InfiniBand: “engineering differences” (not a glossary)

RoCEv2: runs over Ethernet transport. Practical stability depends on loss/congestion management in the path (mentioned only; no switch details here).
InfiniBand: fabric semantics are native and consistent for RDMA traffic, with a more unified verbs/QP/CQ operating model end-to-end.
Implication for p99: RoCE environments often show stronger correlation between congestion signals and tail latency; IB environments still suffer tail growth when retries or resource backlogs occur.

QP/CQ/doorbell/WQE: boundaries that create tail latency

WQE (work request): the unit of outstanding work. Accumulated outstanding WQEs are the physical form of latency backlog.
QP (queue pair): ordering and resource window. A blocked QP delays later work even if the link stays up.
CQ (completion queue): completion aggregation. Completion handling cadence affects tail behavior under bursts.
Doorbell: submission pressure. Batch submission improves efficiency but can increase waiting time when congestion or retries appear.

Why “a little loss” becomes “huge tail latency”

Retry/timeout amplification: once retransmit or timeout logic activates, outstanding work waits behind recovery.
Backlog propagation: queued work accumulates while completions slow down; p50 can look fine while p99/p999 degrade sharply.
Evidence-first approach: correlate retry/timeout categories with queue backlog, completion error categories, and congestion indicators visible to the NIC/HCA.

What can be proven from the NIC/HCA side (without fabric configuration)

Congestion signals (category): marks/pause-like indicators and rising queue occupancy patterns that align with latency spikes.
Retry evidence (category): retransmit and timeout trends that precede uncorrected loss symptoms.
Completion evidence (category): completion errors or slowed completion cadence that match tail growth.

Figure F — RDMA data path (verbs → QP/CQ → DMA) with observation points

Box diagram shows how WQEs, doorbells, QPs, and completions interact; panels mark where NIC-visible counters prove congestion vs retry vs backlog.

H2-7 · Queues, Virtualization & Steering

Queues, virtualization, and steering: how RSS / SR-IOV / MSI-X “split performance” and keep it controllable

Parallelism in a NIC/HCA is not “more threads.” It is a controlled mapping: flows → queues → DMA/completions → interrupts → CPU cores. When the mapping is unstable or oversubscribed, symptoms look like software jitter while the root cause is queue backpressure, doorbell write amplification, and cache locality loss.

Queue mental model (what actually runs in parallel)

RX path: flow classification → RX queue → DMA to host → completion → MSI-X interrupt (or polling).
TX path: TX queue → descriptors → DMA fetch → scheduling → wire.
What it controls: wire rate (Gbps) is not enough; Mpps and p99 latency are dominated by queue pressure and completion cadence.

RSS and flow steering: benefits, and why “parallel” can become “noisy”

RSS benefit: distributes flows across multiple queues to raise parallel throughput and reduce per-core hotspots.
Hidden cost: cache locality can deteriorate when flows bounce across cores; tail latency rises under bursts.
Ordering boundary: a single flow should remain on one queue; re-mapping or mixed steering policies can create reordering-like behavior.
Flow steering role: prioritizes control (stable mapping, predictable p99) over “more randomness.”

SR-IOV (PF/VF): isolation is resource slicing

Concept	Practical pitfall	NIC-side proof category
VF count	More VFs can reduce per-VF queue depth and vectors; aggregate looks fine but per-tenant Mpps/p99 degrades.	Queue pressure skew across VFs; completion backlog per VF
Queue resources	Too few queues per VF forces contention; hot flows collide, creating tail spikes.	Hotspot queues, drops/overruns, uneven queue occupancy
Locality	Remote placement (NUMA distance) increases DMA/completion time variance.	Completion stalls correlated with placement changes

MSI-X and coalescing: the small-packet vs p99 trade-off knob

MSI-X: provides multi-vector interrupts so multiple queues can complete independently (reduces single-IRQ bottlenecks).
Coalescing: batches interrupts to reduce CPU overhead, but can increase tail latency by “holding” completions during bursts.
Evidence-first rule: interpret p99 spikes together with IRQ rate, queue depth, and completion cadence—not by throughput alone.

When it looks like software: random jitter, uneven CPU load, or “driver hiccups” often map to queue backpressure, doorbell amplification (too many submissions per work), or cache miss amplification under bursty traffic.

Figure G — Queue map (flows → RSS/steering → queues → MSI-X → cores/NUMA)

Shows how performance becomes controllable only when mapping and resource slicing are explicit and stable.

H2-8 · PCIe Host Interface (NIC viewpoint)

PCIe host interface (NIC endpoint view): why AER, link downshift, and retraining look like “network jitter”

A NIC/HCA can show throughput drops and latency spikes even when the line side is clean. From the endpoint perspective, link width/speed changes, retraining, and error recovery (AER categories) can stretch completions and stall DMA, which propagates upward as “packet jitter” symptoms.

What matters from the endpoint perspective

Link width / speed changes: sudden ceiling shift (step-like throughput drop) without obvious line-side errors.
Retraining / recovery episodes: short stalls that resemble “micro-disconnects” to upper layers.
AER categories: error handling can trigger retries or recovery paths; tail latency grows even if p50 remains normal.
Replay / completion timeout (category): completions slow down, queue pressure increases, and p99 spikes follow.

How to prove “network symptoms” are actually PCIe-driven

Step 1: check line-side error trends (category). If they remain flat while performance shifts, host-side probability rises.
Step 2: correlate PCIe link state changes (width/speed/retraining/AER categories) with the time of throughput drops or p99 spikes.
Step 3: confirm completion stalls and queue pressure increase during those PCIe events.
Step 4: conclude root layer: completions slowed → DMA progress slowed → queues back up → “network jitter” becomes visible.

Firmware/NVM changes: why versions can shift performance or compatibility (principles)

Default queue and interrupt behavior can change: different moderation defaults alter Mpps and p99 trade-offs.
Virtualization resource slicing can change: VF limits, queue allocation, or completion handling policies may differ across versions.
Error handling paths can change: recovery behavior affects how often and how long “stalls” appear in the field.
Validation rule: after any version change, re-check the same evidence categories (link state, AER categories, completion cadence, queue pressure).

Boundary: this chapter stays at the NIC endpoint viewpoint. It maps symptoms to layers and proof categories without describing PCIe switch/retimer topology or platform-level signal-integrity design.

Figure H — Symptom-to-layer map (PCIe vs NIC pipeline vs line side vs thermal/power)

A visual matrix linking “throughput jitter / tail spikes / reconnect-like events” to proof signals visible from the NIC/HCA side.

H2-9 · Clocking & Timing (NIC internal)

Clocking and timing inside a NIC: jitter budget, CDR, and the real boundary of “hardware timestamp”

This chapter stays inside the NIC/HCA. It explains how clock domains (refclk → PLL → SerDes/CDR → MAC/core) translate into BER/FEC pressure, training stability, and tail latency symptoms, and where a NIC’s hardware timestamp can (and cannot) remove uncertainty.

Internal clock domains (what each domain controls)

Refclk domain: the external reference feeding internal synthesis. Instability often shows as broad “margin shrink.”
PLL / conditioning domain: produces internal clocks; poor margin increases BER and pushes FEC corrected upward.
SerDes / CDR domain: clock recovery and lane timing; sensitive to temperature, supply noise, and channel variability.
MAC / core domain: packet pipeline timing; interacts with queueing and completion cadence (p99 exposure).
Timestamp domain (if present): defines where time is captured; cross-domain alignment is a bounded error term.

How jitter becomes field symptoms (evidence-first)

Margin tightens → lane errors rise → FEC corrected increases (throughput still “looks fine”).
Worsening margin → FEC uncorrected / symbol errors → drops/timeouts → throughput jitter appears.
Training sensitivity → repeated training adjustments or retraining episodes, especially near temperature/power edges.
Practical reading: when line-side module readings are stable but FEC/BER and retraining trends worsen, prioritize clock/power/thermal inside the NIC before blaming the network.

CDR and bring-up stability (why “link up” is not “link stable”)

CDR role: tracks timing variation and keeps sampling aligned; changing conditions increase lock difficulty.
What instability looks like: BER drift, higher FEC pressure, occasional lane deskew issues, and sporadic retraining.
What to correlate: lane error trend + FEC corrected/uncorrected split + retraining events (time aligned with temperature or power changes).

Hardware timestamp boundary (tap point + error components)

Item	What it means inside the NIC	Typical error component
Tap point MAC-side	Timestamp captured near MAC processing; easier integration with the packet pipeline.	More exposure to queueing and pipeline scheduling variance
Tap point PHY/PCS-side	Timestamp captured closer to the wire; better represents egress/ingress timing at the link boundary.	Still includes clock-domain crossing alignment and internal transport delay
Error budget	Even with HW timestamp, uncertainty remains bounded by internal domains and buffering.	Queueing · CDC alignment · SerDes/PCS path

Boundary: this chapter only covers NIC internal clock domains and timestamp capture points. Rack or fabric-level time distribution (PTP trees, SyncE, GPSDO) belongs to the Time Card page.

Figure I — Clock domains and timestamp tap points (NIC internal only)

Refclk → PLL → SerDes/CDR → MAC/core, plus two tap-point options and bounded error components.

H2-10 · Power / Thermal / Telemetry Hooks

Power, thermal, and telemetry hooks: rail droop, throttling, hotspots, and how observability enables fast triage

Performance instability often follows a simple loop: power/thermal stress → link margin shrinks → errors rise → recovery/throttling triggers → throughput and p99 drift. This chapter focuses on NIC-visible sensors, counters, and logs that separate “module vs channel vs NIC” without rack-level cooling details.

Power rails (concept level): how droop turns into link instability

Common rail categories: core · SerDes · PLL · IO (names vary, roles are consistent).
Failure chain: rail droop or noise → CDR/SerDes margin tightens → BER and FEC corrected trend upward → p99 spikes follow.
How to validate: correlate voltage/current trends with FEC/BER and retraining events (time aligned).

Thermal hotspots and throttling: why it “looks like a network problem”

Hotspots: SerDes edge · ASIC core · PLL area · module cage (thermal coupling matters).
Throttling signatures: step-like throughput ceilings, rising error trends near a temperature knee, and clustered retraining episodes.
Evidence-first: when throttling appears, look for synchronized changes in temperature, power, and error counters.

Telemetry categories (what to read, and why it proves the layer)

Category	What it captures	How it is used
Health temp/volt/curr	Stress signals that compress link margin or trigger protection policies.	Time-align with errors and performance shifts
Link FEC/BER/lane	Error trends (corrected vs uncorrected) and lane-level drift.	Separate “margin shrink” from “hard failures”
Traffic port counters	Drops, retries/timeouts, congestion-visible counters (NIC-side).	Connect errors to user-visible symptoms
Logs event timestamps	State changes: retrain, recovery, throttling, alerts.	Build a causal timeline
Mgmt I²C/SMBus/MCTP	Out-of-band access paths for telemetry exchange (concept only).	Enable remote observability

Optics module (CMIS/DDM) triage: module vs channel vs NIC

Module-first: module readings drift abnormally and align with errors → prioritize module seating, module thermals, or module health.
NIC-first: module readings remain stable while NIC FEC/BER/retrain aligns with NIC temperature or rail stress → prioritize NIC thermal/power margin.
Neither is obvious: module and error trends are clean but performance jitters → return to queue/PCIe evidence chains (previous chapters).

Boundary: rack-level fan curves and liquid-cooling control belong to rack thermal/cooling pages. Here the focus is NIC-visible hotspots, sensor evidence, and timestamped event correlation.

Figure J — Thermal map, sensors, telemetry hub, and throttling loop

A closed-loop view: sensors → policy/actions → performance + error impact → event logs → triage.

H2-11｜Form Factors & Port/Module Integration: PCIe AIC → OCP NIC, and what matters at the cage

This section focuses on board-level integration details that decide real-world link stability: mechanical stack-up, return paths at the bezel, port-cage shielding continuity, and low-speed sideband robustness (presence/interrupt/I²C). It stays inside the NIC/HCA card boundary—no storage backplane management and no rack-level cooling control.

Form factor decision points

PCIe AIC Bracket + bezel geometry drives EMI leakage paths; airflow often front-to-back across the cage and heatsink.
OCP NIC 3.0 Serviceability and platform standardization; tighter mechanical constraints around cage RHS/stack height and chassis reference.
Common Port type (QSFP/OSFP), module power budget, and cable bend radius dominate mechanical risk more than PCB routing alone.

Port cage integration checklist (board-level)

Shield continuity: cage-to-bezel contact must be low impedance; gaps become slot antennas at high frequency.
Return path control: keep a predictable chassis return path near the cage; avoid forcing high-frequency currents across long, narrow “neck” copper.
Sideband hardening: presence/LPMode/Reset/Int pins and I²C lines need ESD protection and clean pull-ups near the card edge.
Thermal at the cage: module heat couples into the cage; temperature sensors and throttling logic must reflect the hottest realistic point.
Service events: hot-plug, partial insertion, and “wobble” during cable routing are the top triggers for intermittent faults.

Sideband boundary (what to implement on the NIC/HCA card)

CMIS/DDM reads: treat optics telemetry as a diagnostic tool to separate “module vs. channel vs. NIC” quickly.
Presence/interrupt: debounce and ESD-protect; log insert/remove events with timestamps (useful in field RMAs).
I²C/SMBus/MCTP: bus switching and isolation are card-level concerns; chassis-wide arbitration policies belong elsewhere.

Example BOM material numbers (port/cage/sideband) — representative references

The items below are concrete reference MPNs commonly used as building blocks. Mechanical variants (height, latch style, RHS, press-fit vs. SMT) must match the chassis + bezel + heatsink stack-up.

Block	Example MPN (material number)	Why it appears in NIC/HCA port integration
QSFP28 cage	TE Connectivity `2359309-1` (1×1 cage) TE Connectivity `2170790-3` (1×4 cage)	Mechanical cage + EMI containment + module mating robustness; the cage/bezel contact quality is a top predictor of “works in lab, fails in rack” behavior.
QSFP-DD / high-density cage	TE Connectivity `2227249-3`	Higher port density raises thermal and shielding constraints; integration quality often shows up as intermittent link retrains and sensitivity to cable strain.
OSFP cage (example)	Amphenol `UE62B1620021E1` Amphenol `UE62B462002121`	OSFP enables higher power modules; cage RHS/thermal conduction and bezel grounding become more critical than with lower-power form factors.
OCP NIC 3.0 card-edge connector (example family)	Amphenol Mini Cool Edge examples: `ME1016813401101`, `ME1016811401101` (OCP straddle mount references)	The connector choice affects insertion loss, serviceability, and signal/ground referencing; selection must match OCP NIC mechanical and platform routing rules.
EMI gasket / contact material	Laird (DuPont) EMI gasket example `4049PA22101800`	Used to maintain cage-to-bezel electrical contact and reduce slot leakage; effective gasket/contact strategy often outperforms “more shielding can” alone.
Sideband ESD protection	Texas Instruments `TPD4E05U06` (ESD array)	Protects low-speed lines (presence/INT/I²C) at the cage edge; prevents “random” field failures triggered by hot-plug ESD events.
I²C/SMBus port mux (debug + robustness)	Texas Instruments `TCA9548A` (Alt.) NXP `PCA9548A`	Allows controlled access to module management channels; useful for isolating a stuck bus and for production test partitioning.
Board ID / FRU EEPROM	Microchip `24AA02E64` (EEPROM with EUI-64 option)	Common for board identity / traceability and provisioning flows; simplifies manufacturing correlation between test logs and physical units.

Practical rule: when intermittent link issues correlate with cable touch, module reinsertion, or chassis vibration, the first suspects are bezel contact, cage grounding continuity, and sideband integrity—before deeper PHY tuning.

Figure H11 — Port cage integration: shield/return/sideband and where failures couple in

Use this diagram as a review checklist: bezel/gasket contact, chassis return integrity near the cage, and sideband ESD protection are often the fastest path to root cause for intermittent field failures.

H2-12｜Validation & Production Checklist: proving it is stable, fast, and diagnosable

A NIC/HCA passes engineering only when it is (1) link-stable under stress, (2) predictable at p99 latency and Mpps, (3) recoverable with clear counters/logs, and (4) reproducible in production. This checklist is organized as a test matrix: test item → observables → pass criteria → failure fingerprint → likely layer.

Layered test structure (what to measure)

Link bring-up BER/eye margin (or equivalent), lane mapping/polarity, training stability, FEC corrected/uncorrected trends.
Performance Throughput vs. Mpps, p50/p99 latency, interrupt/coalesce sensitivity, NUMA pinning sensitivity, CPU cycles saved by offloads.
RDMA (HCA focus) QP/CQ stability, loss sensitivity (tail-latency explosions), long-duration soak, counter-driven diagnosis.
Thermal/Power throttling thresholds, error-rate vs. temperature, module power excursions, event logs correlated with sensor readings.

Pass criteria definition (avoid “looks fine”)

Define limits per workload: small-packet (64B) Mpps, mixed-size, and jumbo must each have pass criteria.
Latency must include tail: record p99/p99.9 alongside throughput; verify stability under background IRQ pressure.
Counter sanity: “clean” runs show stable FEC corrected counts and near-zero uncorrected; spikes must correlate to a known stressor.
Reproducibility: the same firmware/NVM build produces the same counter signature across multiple units (production reality check).

Production hooks (what to log)

Link events: up/down, retrain, lane deskew failures, module insert/remove timestamps.
Error counters: FEC corrected/uncorrected, symbol errors, packet drops, congestion markers (as available).
Host interface hints: PCIe AER/replay-related evidence (NIC-visible), correlated with throughput/latency dips.
Thermal/power: max temperatures, throttling state transitions, rail droop events if monitored.

Reference “golden units” and test gear material numbers (examples)

These are concrete examples frequently used in labs and production lines. They are not requirements—only anchors for procurement and planning.

Category	Example MPN / Model	Role in the checklist
Golden NIC (Ethernet)	Intel `E810-CQDA2`	Known baseline for throughput/Mpps/latency comparisons and firmware regression checks (port configuration flexibility helps validation planning).
Golden NIC/HCA (common)	NVIDIA ConnectX-6 Dx OPN `MCX623106AN-CDAT`	Reference for counter vocabulary and field-proven stability patterns; useful when correlating FEC/BER, thermal, and link retrain signatures.
Golden HCA (OSFP class)	NVIDIA ConnectX-7 family examples: `MCX75310AAS-HEAT`, `MCX713106AC-CEAT`	Anchor for OSFP integration and higher-speed link stress patterns (training stability, power/thermal boundary conditions).
BERT (PHY bring-up)	Keysight `M8040A` (BERT) / Anritsu `MP1900A`	Used for link margin characterization, BER sensitivity vs. temperature/channel, and validation of “error-rate ↔ counter” correlation.
Traffic generator	Spirent TestCenter (example kits vary; procure by required port speeds)	Reproducible Mpps, mixed packet sizes, microburst stress, and tail-latency characterization under controlled conditions.

Production success metric: when a failure happens, the logs and counters must identify the likely layer (module/channel/clock/host interface/thermal) without requiring deep lab-only instruments.

Figure H12 — Validation test matrix: test item × observables × failure fingerprints

A production-ready NIC/HCA has measurable pass criteria and a counter/log signature that identifies the likely layer when failures happen.

Request a Quote

Name

Company

Part Number(s) / BOM

Quantity & Target Lead Time

Alternates Allowed

Temperature Grade

Package / Footprint

Compliance

Budget Window

Lot Size / Qty

Message

Attachment

Accepted Formats

pdf, csv, xls, xlsx, zip

Attachment

Drag & drop files here or use the button below.

H2-13｜FAQs (NIC / HCA): field symptoms → evidence → likely layer

These FAQs target long-tail searches and practical troubleshooting. Each answer stays within the NIC/HCA boundary and points to the most relevant section(s) for deeper mechanisms and counters.

Tip: align timestamps across FEC/BER, port counters, thermal/power sensors, and event logs. “Link Up” only means training succeeded—it does not guarantee margin under temperature, power, or cable strain.

Why can a port link up, yet throughput fluctuates and FEC “corrected” counters surge?

A rising FEC-corrected rate usually means the link is surviving on error correction rather than clean margin. Correlate per-lane FEC corrected/uncorrected, retrain events, and BER/eye indicators with temperature and module telemetry. If errors track temperature or module power, suspect thermal/power throttling or cage/module integration; if a single lane dominates, suspect channel/equalization margin.

Maps to: H2-4H2-10

In a PAM4 link, what do CTLE / FFE / DFE mainly fix, and how do symptoms hint what to adjust?

CTLE primarily compensates high-frequency channel loss; FFE pre-emphasizes edges to reduce inter-symbol interference; DFE cancels post-cursor ISI when the channel creates long “tails.” Start from observable symptoms: lane-localized errors and sensitivity to cable length often point to channel loss/ISI; training instability and wide-lane jitter sensitivity often point to clock recovery margin. Validate changes by watching per-lane FEC and retrain trends.

Maps to: H2-4

Why is “100/200/400G” not enough to choose a card, and how should Mpps/small-packet performance be asked?

Line rate (Gbps) can look perfect while small-packet packet-rate (Mpps) collapses due to descriptor processing, queue depth, interrupt pressure, and pipeline limits. Ask for 64B/128B Mpps, mixed-size profiles, and p99 latency under load—plus CPU cycles saved with offloads enabled. Also request the counter signature during sustained small-packet stress (drops, IRQ rate, queue backpressure) to avoid “benchmark-only” answers.

Maps to: H2-2H2-5

Why can enabling TSO/LRO make tail latency worse, even when throughput improves?

Offloads can improve average efficiency but worsen tail latency when they increase batching, amplify burstiness, or interact poorly with interrupt moderation. Large aggregation and aggressive coalescing can add variable queueing delay; poor steering can add cache/NUMA penalties. Verify by toggling offload and coalescing knobs while tracking p99 latency, per-queue drops, and IRQ-per-packet. A “good” setting keeps p99 stable across load changes.

Maps to: H2-5H2-7

What field symptoms indicate RSS is misconfigured (jitter, reordering, CPU hotspots)?

Classic signs are one CPU core overheating while others idle, sudden jitter spikes under flow changes, and occasional reordering when steering moves flows across queues. Check per-queue packet distribution, per-core IRQ load, and queue drops. Fixes typically involve a stable hash key, limiting queue migration, aligning queues with NUMA locality, and using flow steering for “heavy hitter” flows so they do not thrash caches.

Maps to: H2-7

Is more SR-IOV VFs always better? Why can throughput stall while interrupts stay high?

VF count is bounded by hard resources: queues, MSI-X vectors, cache, and internal scheduling bandwidth. Too many VFs can fragment queues, trigger IRQ storms, and increase doorbell pressure, so overall throughput stops scaling. Look for high IRQ-per-packet, small per-VF queue depth, and rising backpressure indicators. A practical approach is fewer VFs with adequate queue depth and controlled coalescing, then scale based on measured p99 and IRQ saturation.

Maps to: H2-7

For RoCE RDMA, why can “rare packet loss” make p99 latency explode?

RDMA tail latency is extremely loss-sensitive: even rare drops can trigger retries, timeouts, and QP backoff that multiply queueing delay. The average latency may look fine while p99/p99.9 becomes unusable. Use NIC/HCA counters that reflect retry behavior (CQ/QP error signals, retry/timeout indicators where available) and correlate them with tail spikes. The fix path is reducing drops and stabilizing queue behavior before chasing raw bandwidth.

Maps to: H2-6

How to tell whether the root cause is Ethernet congestion behavior vs NIC/HCA queue backpressure?

Start with a layered counter check. Congestion-policy issues often show external pressure signatures (marks/drops consistent across flows), while NIC/HCA queue backpressure shows ring-full patterns, per-queue drops, IRQ anomalies, and localized hotspots. Align timestamps across port counters, queue metrics, and application p99 spikes. If tail spikes coincide with queue saturation without PHY/FEC degradation, focus on steering/queues/interrupt moderation; if they coincide with RDMA retry behavior, focus on loss sensitivity and buffering.

Maps to: H2-6H2-7

The network looks slower, but the real cause is PCIe downshift or AER—how to verify quickly?

When throughput drops without corresponding port errors, verify PCIe link width/speed and look for AER/replay/completion-timeout evidence around the same timestamps as the dips. PCIe retrains or corrected error bursts can masquerade as “network jitter.” Cross-check with NIC event logs and compare behavior in a different slot/platform. Firmware/NVM changes can also shift host-interface behavior, so regression tests should include PCIe stability under temperature and load.

Maps to: H2-8

What are early signs of NIC thermal throttling, and which sensors/counters are most useful?

Early signs include a slow rise in FEC corrected rate, sporadic retrains under sustained load, and throughput steps that correlate with temperature plateaus. The most useful signals are ASIC/on-card temperature, module/cage temperature, power/current telemetry, throttling state transitions, and event logs that mark derate actions. If error rate or tail latency worsens with temperature but improves immediately with airflow changes, prioritize thermal remediation before deeper PHY tuning.

Maps to: H2-10

How can CMIS/DDM optics telemetry separate “bad module” vs “bad link” vs “NIC-side issue”?

Use CMIS/DDM as a triage lens: abnormal optical power, bias current, or module alarms often indicate module/fiber issues; stable optics telemetry combined with FEC corrected growth that tracks NIC temperature points to NIC/cage thermal or power effects. If optics metrics fluctuate with cable movement, suspect connector/cage contact. The fastest isolation is a controlled swap test (module/cable) while comparing the counter signature (FEC, retrains, drops) before and after.

Maps to: H2-10H2-11

Why can a NIC advertise IEEE 1588 hardware timestamping, yet accuracy is unstable in practice?

“Hardware timestamping” accuracy depends on where the timestamp is taken (MAC vs PHY), how clock domains cross, and how much variable queueing remains. Interrupt moderation and queue contention can introduce load-dependent jitter; clock/PLL jitter and temperature drift can further degrade stability. A quick check is whether offset error worsens under traffic load—if yes, variable queueing and clock-domain effects are dominating. This page covers NIC-side boundaries, not rack-level timing architecture.

Maps to: H2-9

NIC / HCA (Ethernet & InfiniBand) Design & Debug Guide

NIC / HCA (Ethernet & InfiniBand) Design & Debug Guide

Scope & Boundary: what this page covers (and what it does not)

In scope (owner statements)

Out of scope (guardrails)

1-minute definition + first-principles selection questions

How it works (3–5 steps)

First principles: what to check first (before brand/model)

System architecture map: PCIe to QSFP/OSFP end-to-end chain

Three domains (who owns what)

SerDes & PHY deep dive: PAM4 DSP, FEC, training, and “link up but unstable” behavior

PAM4 vs NRZ: what changes in practice

DSP chain: symptom → module → evidence

FEC counters: how to read them without guessing

Bring-up checklist (line-side): what to validate first

Packet pipeline & offload: real boundaries and performance traps (Gbps vs Mpps vs p99)

Why “Gbps is high” but “Mpps collapses”

Offload catalog: what it helps, what it can hurt

Metrics hierarchy (procurement + validation)

RDMA in practice: RoCEv2 vs InfiniBand engineering differences, congestion myths, and why tail latency explodes

RoCEv2 vs InfiniBand: “engineering differences” (not a glossary)

QP/CQ/doorbell/WQE: boundaries that create tail latency

Why “a little loss” becomes “huge tail latency”

What can be proven from the NIC/HCA side (without fabric configuration)

Queues, virtualization, and steering: how RSS / SR-IOV / MSI-X “split performance” and keep it controllable

Queue mental model (what actually runs in parallel)

RSS and flow steering: benefits, and why “parallel” can become “noisy”

SR-IOV (PF/VF): isolation is resource slicing

MSI-X and coalescing: the small-packet vs p99 trade-off knob

PCIe host interface (NIC endpoint view): why AER, link downshift, and retraining look like “network jitter”

What matters from the endpoint perspective

How to prove “network symptoms” are actually PCIe-driven

Firmware/NVM changes: why versions can shift performance or compatibility (principles)

Clocking and timing inside a NIC: jitter budget, CDR, and the real boundary of “hardware timestamp”

Internal clock domains (what each domain controls)

How jitter becomes field symptoms (evidence-first)

CDR and bring-up stability (why “link up” is not “link stable”)

Hardware timestamp boundary (tap point + error components)

Power, thermal, and telemetry hooks: rail droop, throttling, hotspots, and how observability enables fast triage

Power rails (concept level): how droop turns into link instability

Thermal hotspots and throttling: why it “looks like a network problem”

Telemetry categories (what to read, and why it proves the layer)

Optics module (CMIS/DDM) triage: module vs channel vs NIC

H2-11｜Form Factors & Port/Module Integration: PCIe AIC → OCP NIC, and what matters at the cage

Example BOM material numbers (port/cage/sideband) — representative references

H2-12｜Validation & Production Checklist: proving it is stable, fast, and diagnosable

Reference “golden units” and test gear material numbers (examples)

Recommended topics you might also need

Request a Quote

Accepted Formats

Attachment

H2-13｜FAQs (NIC / HCA): field symptoms → evidence → likely layer

Explore

Categories

Get in Touch