123 Main Street, New York, NY 10001

NIC / HCA (Ethernet & InfiniBand) Design & Debug Guide

← Back to: Data Center & Servers

A NIC/HCA is more than “link up”: real success is stable throughput and predictable p99 latency under load, proven by PHY/FEC counters, queue/interrupt behavior, and thermal/power telemetry. This page explains how Ethernet NICs and RDMA-focused HCAs move packets from PCIe DMA to SerDes/optics, and how to use the right evidence to select, validate, and troubleshoot them in production.

H2-1 · Scope & Boundary

Scope & Boundary: what this page covers (and what it does not)

This chapter is the page “contract”: it defines the owner of the topic, the practical boundary against neighboring pages, and the exact evidence this page will use for selection and debug.

In scope (owner statements)

  • NIC: Ethernet-centric I/O that reduces host CPU work via offloads (e.g., RSS, TSO, checksum) and exposes actionable telemetry.
  • HCA: RDMA-centric I/O that prioritizes tail latency control and verbs semantics (QP/CQ) for RoCE/InfiniBand deployments.
  • System boundary: host-side is a PCIe endpoint (DMA + queues); line-side is PCS/PMA + SerDes linking to optics/cable and the fabric.

Out of scope (guardrails)

  • Retimer/Switch deep-dive: this page only uses NIC-visible symptoms (BER/FEC counters, link training behavior), not retimer internals.
  • Rack timing architecture: this page only covers on-NIC timestamp boundaries; system sync trees belong to the Time Card topic.
  • Programmable dataplane: SmartNIC/DPU compute pipelines are excluded; only a “see also” reference is allowed.

Use this page when a link is “up” but throughput is unstable, FEC-corrected errors trend with temperature, p99 latency explodes under microbursts, or RDMA workloads show rare but severe tail events.

Figure A — Boundary map for NIC/HCA (single-page scope)
Blocks show what is owned here vs “see also” neighbors. Minimal text; element-first layout.
IN SCOPE (THIS PAGE) HOST PCIe endpoint DMA · Queues NIC / HCA Offload · RDMA Telemetry hooks SERDES PCS/PMA PAM4 · FEC FABRIC Module ToR/Switch SEE ALSO DPU / SmartNIC SEE ALSO Time Card SEE ALSO PCIe Retimer Boundary rule: only NIC-visible evidence is discussed here (FEC/BER, queues, offloads, RDMA behavior, thermal/power hooks).
H2-2 · 1-minute Definition

1-minute definition + first-principles selection questions

How it works (3–5 steps)

  1. DMA & queues: descriptors in host memory define what to transmit/receive and where to place data.
  2. Pipeline: parsing, classification, steering (RSS/flow rules) and optional offloads are applied.
  3. RDMA (if used): verbs/QP/CQ scheduling maps requests to DMA operations with tight latency targets.
  4. Line side: MAC/PCS/PMA drive SerDes lanes; PAM4 DSP and FEC protect the link margin.
  5. Telemetry: counters and sensors (FEC/BER, module DDM, temperature, power) explain “why” performance changes.

First principles: what to check first (before brand/model)

  • Throughput vs packet rate: Gbps describes bandwidth; Mpps predicts small-packet behavior and CPU pressure.
  • Tail latency: p50 can look good while p99/p999 fails under microbursts, queue backpressure, or rare retries.
  • RDMA stability (if required): “supported” is not enough—validate behavior under congestion and loss sensitivity.
  • Observability: FEC corrected/uncorrected, BER proxies, module DDM/CMIS, and event logs must exist and be readable.
Spec What it predicts Questions to ask Evidence to expect
GbpsPort speed Peak bandwidth ceiling; module/cage compatibility Which port form factor (QSFP/OSFP)? Which lane mode and FEC requirement? Link mode report; FEC capability; module DDM/CMIS readout
MppsPacket rate Small-packet performance, interrupt/queue pressure Mpps at 64B/128B? Single-flow vs multi-flow? Offload on/off conditions? Packet-rate curve; CPU cost; queue depth/overrun counters
p99Latency Tail-risk under microbursts, retries, congestion sensitivity p50/p99/p999 under defined load? With and without interrupt coalescing? Latency histograms; retry/replay indicators; queue backpressure signals
RDMARoCE/IB Remote-memory semantics, tail stability, loss sensitivity RoCEv2 vs IB mode? What counters exist for congestion/loss events? QP/CQ health; error syndromes; stable tail under stress profile
FECErrors Link margin vs DSP/FEC workload; “link up but unstable” root cause Corrected vs uncorrected thresholds? Lane-level counters exposed? FEC corrected/uncorrected trend; lane error distribution
ThermalPower Throttling risk, temperature-coupled error patterns Max power at target link mode? Thermal sensors? Throttle behavior? Temperature/power logs; performance vs temperature correlation
Figure B — Key specs map (bandwidth, Mpps, tail, RDMA, telemetry)
Iconic spec tiles with minimal labels (≥18px), optimized for mobile readability.
Key Specs Map for NIC / HCA Bandwidth vs Mpps · Tail latency · RDMA · Link evidence · Thermal/power Gbps Port speed / module mode Mpps Small packets / queue pressure p99 Tail latency risk RDMA RoCE / InfiniBand stability FEC Corrected / uncorrected Thermal Power → throttling risk Selection becomes reliable when specs are tied to evidence: counters, sensors, and repeatable test conditions.
H2-3 · System Architecture Map

System architecture map: PCIe to QSFP/OSFP end-to-end chain

This chapter establishes a fast “evidence chain” across three domains—Host/PCIe, NIC core, and Line side—so unstable throughput, tail latency spikes, or rising corrected errors can be assigned to the right layer before deeper debug.

Three domains (who owns what)

  • Host / PCIe domain: DMA submissions, queue doorbells, MSI-X interrupts, and (light) IOMMU/ATS effects on addressability and completion flow.
  • NIC core domain: MAC + packet pipeline, flow steering, offload engines, and scheduling that turns descriptors into predictable packet movement.
  • Line side domain: PCS/PMA and SerDes lanes connected to QSFP/OSFP modules and the fabric, with link training and error protection (FEC).
Domain What it controls Typical symptom NIC-visible evidence
Host/PCIe DMA pacing, doorbell cadence, completion flow, interrupt pressure Periodic throughput drops; CPU spikes; jitter tied to host load Queue pressure signals; completion stalls; PCIe error category flags
NIC core Parsing/classification, steering, offloads, scheduling Gbps looks fine but Mpps collapses; p99 worsens under microbursts Drops/overruns; scheduler backpressure; flow distribution imbalance
Line side Link mode, training, lane equalization, FEC protection Link stays up but stability degrades; error rates track temperature FEC corrected/uncorrected trend; lane errors; module DDM/CMIS signals

Boundary reminder: the diagram and text stay at a vendor-neutral functional level. No proprietary block names are used, and external fabric configuration is not discussed—only NIC-visible counters, states, and correlations.

Figure C — NIC/HCA end-to-end block diagram (with telemetry hooks)
Box-style architecture map: Host/PCIe → NIC core → PCS/PMA/SerDes → QSFP/OSFP, plus clocks/power/thermal/NVM/mgmt bus.
End-to-End NIC / HCA Chain (Evidence-Oriented) Host / PCIe NIC Core Line Side PCIe Link DMA Engine Queues & Doorbells MSI-X · (light) IOMMU/ATS MAC + Parser Packet Pipeline classify · steer · buffer Offload Engines RSS · TSO · checksum · RDMA · crypto Scheduler / QoS PCS / PMA SerDes Lanes PAM4 · EQ · CDR QSFP / OSFP DDM/CMIS Telemetry hooks (NIC-visible) FEC / Lane errors Drops / Congestion Module DDM/CMIS Thermal / Power Dotted boundary: support plane (clocks / rails / NVM / mgmt) is referenced, not expanded here. Clock/PLL Power NVM Mgmt bus I2C/SMBus/MCTP
H2-4 · SerDes & PHY

SerDes & PHY deep dive: PAM4 DSP, FEC, training, and “link up but unstable” behavior

“Link up” only confirms that training reached a workable state. Throughput instability often appears when margin is thin: PAM4 relies on DSP and FEC to keep BER acceptable, and that compensation can vary with temperature, lane conditions, and jitter.

PAM4 vs NRZ: what changes in practice

  • PAM4: smaller voltage spacing per level → higher sensitivity to noise, jitter, and non-ideal channel response.
  • NRZ: larger margin → fewer “near-edge” states where corrected errors climb while the link stays up.
  • Operational signal: rising FEC corrected without immediate uncorrected events often indicates a margin tax that can later convert into tail latency risk.

DSP chain: symptom → module → evidence

  • High-frequency lossCTLE helps restore high-frequency content; lane-to-lane corrected error imbalance is a common hint.
  • ISI (intersymbol interference)FFE/DFE compensate channel memory; training “not converged” or unstable EQ states often appear before hard failures.
  • Jitter/phase noise sensitivityCDR stability matters; errors that correlate with temperature or operating mode suggest clock/jitter stress.
  • Residual random errorsFEC absorbs them until uncorrected events start; corrected/uncorrected ratio is a leading indicator.

FEC counters: how to read them without guessing

  • Corrected trending up: the link is spending “error budget” to stay stable; margin is shrinking or compensation load is rising.
  • Uncorrected events: represent real frame loss at the physical layer; they amplify retries and tail latency, especially for loss-sensitive transports.
  • Lane concentration: errors localized to a subset of lanes often indicate a channel/connector/module lane-specific issue rather than a global mode mismatch.

Bring-up checklist (line-side): what to validate first

  • Mode agreement: autoneg / capability match, correct lane count and mapping for the selected module mode.
  • Training success: link training reaches stable EQ states (TX/RX equalization) and lane deskew completes.
  • Polarity/mapping sanity: polarity inversions and lane swaps are tolerated only if mapping is consistent end-to-end.
  • Correlations: plot corrected errors against temperature/power and module DDM to identify margin-driven instability.
Figure D — PAM4 SerDes DSP + FEC pipeline with a readout panel
Pipeline diagram maps symptoms to DSP blocks; side panel lists the minimal counters/sensors to prove layer and correlation.
PAM4 Link Stability: DSP + FEC + Training (Evidence Map) Channel HF Loss ISI Jitter Noise SerDes DSP + FEC CTLE FFE / DFE CDR FEC Training / EQ autoneg · lane map · polarity · deskew Symptom → block: Loss→CTLE, ISI→FFE/DFE, Jitter→CDR, Random errors→FEC Readout FEC corrected Uncorrected Lane errors Temp / Power DDM/CMIS The goal is proof: correlate counters and sensors to channel, clock, or thermal conditions—without guessing.
H2-5 · Packet Pipeline & Offload

Packet pipeline & offload: real boundaries and performance traps (Gbps vs Mpps vs p99)

High wire-rate throughput does not guarantee strong small-packet Mpps or stable p99 latency. The common root cause is per-packet work along the descriptor → queue → DMA → completion → interrupt/coalescing chain, plus pipeline depth and contention.

Why “Gbps is high” but “Mpps collapses”

  • Per-packet overhead dominates: each packet consumes descriptor work, queue arbitration, DMA bookkeeping, and completion processing.
  • Interrupt/coalescing effects: moderation improves CPU efficiency but can stretch tail latency when bursts arrive or queues back up.
  • Pipeline depth and contention: parse/classify/steer stages have finite capacity; microbursts convert into queue backlog and p99 spikes.
  • Evidence pattern: wire-rate looks healthy while drops/overruns, queue pressure, or completion stalls increase.

Offload catalog: what it helps, what it can hurt

Offload Primary benefit Common trap NIC-visible evidence
Checksum Saves host CPU per packet Does not fix queue/interrupt bottlenecks; Mpps can still be limited CPU cycles saved vs queue pressure unchanged
TSO/GSO Reduces TX per-packet overhead for large payloads Little impact on true small-packet traffic; segmentation does not apply Higher Gbps with similar 64B Mpps ceiling
LRO/GRO Reduces RX packet rate pressure Can increase latency variance; may change observability and fairness during bursts Lower packet rate with p99 changes
RSS / Steering Spreads flows across queues/cores Imbalanced distribution or queue hotspots can worsen tail latency Queue-to-queue skew; hotspot drops
Crypto inline Offloads encryption/decryption work for supported paths Does not replace key/attestation systems; can still be bounded by per-packet pipeline limits CPU saved with unchanged queue/IRQ limits

Metrics hierarchy (procurement + validation)

  • Wire rate (Gbps): large-packet ceiling; necessary but not sufficient.
  • Mpps (small packets): reveals per-packet cost and queue/interrupt capacity.
  • p99 latency: exposes backlog, coalescing, retries, and burst sensitivity.
  • CPU cycles saved: quantifies offload value without hiding tail-risk.

Practical rule: when Gbps is stable but user-facing latency becomes “spiky,” treat queue backlog and completion/interrupt behavior as first-class suspects before chasing link-level causes.

Figure E — Latency path breakdown (RX → app) with evidence points
Box diagram highlights where Mpps limits and tail latency typically form: descriptors/queues/DMA/completions/IRQ moderation.
RX Latency Path (Where Mpps and p99 get decided) RX MAC wire in Parse headers Classify flow steer Queue backlog DMA to host Completion CQ entries IRQ / Coalesce batch vs tail Stack kernel/user Application p99 visible Evidence points Drops/Overruns Queue pressure Completion stalls IRQ rate / coalesce Metrics hierarchy Wire rate (Gbps) · Small packets (Mpps) · Tail latency (p99) · CPU saved
H2-6 · RDMA

RDMA in practice: RoCEv2 vs InfiniBand engineering differences, congestion myths, and why tail latency explodes

RDMA is not a single feature toggle. It is an end-to-end completion model (verbs/QP/CQ) that reacts strongly to loss and congestion. Small amounts of retry/timeout can amplify into a backlog that dominates p99/p999 latency.

RoCEv2 vs InfiniBand: “engineering differences” (not a glossary)

  • RoCEv2: runs over Ethernet transport. Practical stability depends on loss/congestion management in the path (mentioned only; no switch details here).
  • InfiniBand: fabric semantics are native and consistent for RDMA traffic, with a more unified verbs/QP/CQ operating model end-to-end.
  • Implication for p99: RoCE environments often show stronger correlation between congestion signals and tail latency; IB environments still suffer tail growth when retries or resource backlogs occur.

QP/CQ/doorbell/WQE: boundaries that create tail latency

  • WQE (work request): the unit of outstanding work. Accumulated outstanding WQEs are the physical form of latency backlog.
  • QP (queue pair): ordering and resource window. A blocked QP delays later work even if the link stays up.
  • CQ (completion queue): completion aggregation. Completion handling cadence affects tail behavior under bursts.
  • Doorbell: submission pressure. Batch submission improves efficiency but can increase waiting time when congestion or retries appear.

Why “a little loss” becomes “huge tail latency”

  • Retry/timeout amplification: once retransmit or timeout logic activates, outstanding work waits behind recovery.
  • Backlog propagation: queued work accumulates while completions slow down; p50 can look fine while p99/p999 degrade sharply.
  • Evidence-first approach: correlate retry/timeout categories with queue backlog, completion error categories, and congestion indicators visible to the NIC/HCA.

What can be proven from the NIC/HCA side (without fabric configuration)

  • Congestion signals (category): marks/pause-like indicators and rising queue occupancy patterns that align with latency spikes.
  • Retry evidence (category): retransmit and timeout trends that precede uncorrected loss symptoms.
  • Completion evidence (category): completion errors or slowed completion cadence that match tail growth.
Figure F — RDMA data path (verbs → QP/CQ → DMA) with observation points
Box diagram shows how WQEs, doorbells, QPs, and completions interact; panels mark where NIC-visible counters prove congestion vs retry vs backlog.
RDMA Data Path (Proof Points for Tail Latency) Host Verbs API WQEs outstanding work Doorbell submit pressure NIC / HCA QP ordering window CQ completions DMA Engine host memory moves Transport RoCEv2 / IB Fabric Congestion Loss Retry risk Observation points (NIC-visible categories) Retry / Timeout Queue backlog Completion errors Congestion indicators Goal: prove whether tail latency is driven by congestion, retries, or backlog—without fabric configuration details.
H2-7 · Queues, Virtualization & Steering

Queues, virtualization, and steering: how RSS / SR-IOV / MSI-X “split performance” and keep it controllable

Parallelism in a NIC/HCA is not “more threads.” It is a controlled mapping: flows → queues → DMA/completions → interrupts → CPU cores. When the mapping is unstable or oversubscribed, symptoms look like software jitter while the root cause is queue backpressure, doorbell write amplification, and cache locality loss.

Queue mental model (what actually runs in parallel)

  • RX path: flow classification → RX queue → DMA to host → completion → MSI-X interrupt (or polling).
  • TX path: TX queue → descriptors → DMA fetch → scheduling → wire.
  • What it controls: wire rate (Gbps) is not enough; Mpps and p99 latency are dominated by queue pressure and completion cadence.

RSS and flow steering: benefits, and why “parallel” can become “noisy”

  • RSS benefit: distributes flows across multiple queues to raise parallel throughput and reduce per-core hotspots.
  • Hidden cost: cache locality can deteriorate when flows bounce across cores; tail latency rises under bursts.
  • Ordering boundary: a single flow should remain on one queue; re-mapping or mixed steering policies can create reordering-like behavior.
  • Flow steering role: prioritizes control (stable mapping, predictable p99) over “more randomness.”

SR-IOV (PF/VF): isolation is resource slicing

Concept Practical pitfall NIC-side proof category
VF count More VFs can reduce per-VF queue depth and vectors; aggregate looks fine but per-tenant Mpps/p99 degrades. Queue pressure skew across VFs; completion backlog per VF
Queue resources Too few queues per VF forces contention; hot flows collide, creating tail spikes. Hotspot queues, drops/overruns, uneven queue occupancy
Locality Remote placement (NUMA distance) increases DMA/completion time variance. Completion stalls correlated with placement changes

MSI-X and coalescing: the small-packet vs p99 trade-off knob

  • MSI-X: provides multi-vector interrupts so multiple queues can complete independently (reduces single-IRQ bottlenecks).
  • Coalescing: batches interrupts to reduce CPU overhead, but can increase tail latency by “holding” completions during bursts.
  • Evidence-first rule: interpret p99 spikes together with IRQ rate, queue depth, and completion cadence—not by throughput alone.

When it looks like software: random jitter, uneven CPU load, or “driver hiccups” often map to queue backpressure, doorbell amplification (too many submissions per work), or cache miss amplification under bursty traffic.

Figure G — Queue map (flows → RSS/steering → queues → MSI-X → cores/NUMA)
Shows how performance becomes controllable only when mapping and resource slicing are explicit and stable.
Queue Map: Split Performance into Controllable Paths Flows Flow A Flow B Flow C Burst microburst Mapping RSS hash Steer pin Queues Q0 depth Q1 depth Q2 backlog Q3 skew Notify MSI-X vectors Coalescing CPU NUMA A local NUMA B remote Core0 Core1 Core2 Core3 SR-IOV PF shared VF0 VF1 resources slice Evidence: queue depth · queue skew · IRQ rate · completion backlog · drops/overruns
H2-8 · PCIe Host Interface (NIC viewpoint)

PCIe host interface (NIC endpoint view): why AER, link downshift, and retraining look like “network jitter”

A NIC/HCA can show throughput drops and latency spikes even when the line side is clean. From the endpoint perspective, link width/speed changes, retraining, and error recovery (AER categories) can stretch completions and stall DMA, which propagates upward as “packet jitter” symptoms.

What matters from the endpoint perspective

  • Link width / speed changes: sudden ceiling shift (step-like throughput drop) without obvious line-side errors.
  • Retraining / recovery episodes: short stalls that resemble “micro-disconnects” to upper layers.
  • AER categories: error handling can trigger retries or recovery paths; tail latency grows even if p50 remains normal.
  • Replay / completion timeout (category): completions slow down, queue pressure increases, and p99 spikes follow.

How to prove “network symptoms” are actually PCIe-driven

  • Step 1: check line-side error trends (category). If they remain flat while performance shifts, host-side probability rises.
  • Step 2: correlate PCIe link state changes (width/speed/retraining/AER categories) with the time of throughput drops or p99 spikes.
  • Step 3: confirm completion stalls and queue pressure increase during those PCIe events.
  • Step 4: conclude root layer: completions slowedDMA progress slowedqueues back up → “network jitter” becomes visible.

Firmware/NVM changes: why versions can shift performance or compatibility (principles)

  • Default queue and interrupt behavior can change: different moderation defaults alter Mpps and p99 trade-offs.
  • Virtualization resource slicing can change: VF limits, queue allocation, or completion handling policies may differ across versions.
  • Error handling paths can change: recovery behavior affects how often and how long “stalls” appear in the field.
  • Validation rule: after any version change, re-check the same evidence categories (link state, AER categories, completion cadence, queue pressure).

Boundary: this chapter stays at the NIC endpoint viewpoint. It maps symptoms to layers and proof categories without describing PCIe switch/retimer topology or platform-level signal-integrity design.

Figure H — Symptom-to-layer map (PCIe vs NIC pipeline vs line side vs thermal/power)
A visual matrix linking “throughput jitter / tail spikes / reconnect-like events” to proof signals visible from the NIC/HCA side.
Symptom → Layer → Proof (NIC View) Symptoms Candidate layer Proof signals (categories) Throughput jitter p99 spikes Reconnect-like Drops / timeouts PCIe link state NIC pipeline Line side Thermal / Power Width/Speed change downshift AER categories recovery Completion stalls DMA slowed Line errors trend FEC/BER category Temp/Power corr. events align Rule: correlate time-aligned events across layers; do not assume “network” when completions stall or PCIe link state changes.
H2-9 · Clocking & Timing (NIC internal)

Clocking and timing inside a NIC: jitter budget, CDR, and the real boundary of “hardware timestamp”

This chapter stays inside the NIC/HCA. It explains how clock domains (refclk → PLL → SerDes/CDR → MAC/core) translate into BER/FEC pressure, training stability, and tail latency symptoms, and where a NIC’s hardware timestamp can (and cannot) remove uncertainty.

Internal clock domains (what each domain controls)

  • Refclk domain: the external reference feeding internal synthesis. Instability often shows as broad “margin shrink.”
  • PLL / conditioning domain: produces internal clocks; poor margin increases BER and pushes FEC corrected upward.
  • SerDes / CDR domain: clock recovery and lane timing; sensitive to temperature, supply noise, and channel variability.
  • MAC / core domain: packet pipeline timing; interacts with queueing and completion cadence (p99 exposure).
  • Timestamp domain (if present): defines where time is captured; cross-domain alignment is a bounded error term.

How jitter becomes field symptoms (evidence-first)

  • Margin tightens → lane errors rise → FEC corrected increases (throughput still “looks fine”).
  • Worsening margin → FEC uncorrected / symbol errors → drops/timeouts → throughput jitter appears.
  • Training sensitivity → repeated training adjustments or retraining episodes, especially near temperature/power edges.
  • Practical reading: when line-side module readings are stable but FEC/BER and retraining trends worsen, prioritize clock/power/thermal inside the NIC before blaming the network.

CDR and bring-up stability (why “link up” is not “link stable”)

  • CDR role: tracks timing variation and keeps sampling aligned; changing conditions increase lock difficulty.
  • What instability looks like: BER drift, higher FEC pressure, occasional lane deskew issues, and sporadic retraining.
  • What to correlate: lane error trend + FEC corrected/uncorrected split + retraining events (time aligned with temperature or power changes).

Hardware timestamp boundary (tap point + error components)

Item What it means inside the NIC Typical error component
Tap point MAC-side Timestamp captured near MAC processing; easier integration with the packet pipeline. More exposure to queueing and pipeline scheduling variance
Tap point PHY/PCS-side Timestamp captured closer to the wire; better represents egress/ingress timing at the link boundary. Still includes clock-domain crossing alignment and internal transport delay
Error budget Even with HW timestamp, uncertainty remains bounded by internal domains and buffering. Queueing · CDC alignment · SerDes/PCS path

Boundary: this chapter only covers NIC internal clock domains and timestamp capture points. Rack or fabric-level time distribution (PTP trees, SyncE, GPSDO) belongs to the Time Card page.

Figure I — Clock domains and timestamp tap points (NIC internal only)
Refclk → PLL → SerDes/CDR → MAC/core, plus two tap-point options and bounded error components.
Clock Domains & Timestamp Tap Points (NIC) Internal clock chain Refclk PLL clean SerDes CDR MAC core HW TS engine MAC tap PHY/PCS tap Bounded errors Queueing variance CDC align cross-domain SerDes path fixed + jitter What it looks like BER ↑ FEC corr ↑ Retrain Tail spikes NIC internal only
H2-10 · Power / Thermal / Telemetry Hooks

Power, thermal, and telemetry hooks: rail droop, throttling, hotspots, and how observability enables fast triage

Performance instability often follows a simple loop: power/thermal stress → link margin shrinks → errors rise → recovery/throttling triggers → throughput and p99 drift. This chapter focuses on NIC-visible sensors, counters, and logs that separate “module vs channel vs NIC” without rack-level cooling details.

Power rails (concept level): how droop turns into link instability

  • Common rail categories: core · SerDes · PLL · IO (names vary, roles are consistent).
  • Failure chain: rail droop or noise → CDR/SerDes margin tightens → BER and FEC corrected trend upward → p99 spikes follow.
  • How to validate: correlate voltage/current trends with FEC/BER and retraining events (time aligned).

Thermal hotspots and throttling: why it “looks like a network problem”

  • Hotspots: SerDes edge · ASIC core · PLL area · module cage (thermal coupling matters).
  • Throttling signatures: step-like throughput ceilings, rising error trends near a temperature knee, and clustered retraining episodes.
  • Evidence-first: when throttling appears, look for synchronized changes in temperature, power, and error counters.

Telemetry categories (what to read, and why it proves the layer)

Category What it captures How it is used
Health temp/volt/curr Stress signals that compress link margin or trigger protection policies. Time-align with errors and performance shifts
Link FEC/BER/lane Error trends (corrected vs uncorrected) and lane-level drift. Separate “margin shrink” from “hard failures”
Traffic port counters Drops, retries/timeouts, congestion-visible counters (NIC-side). Connect errors to user-visible symptoms
Logs event timestamps State changes: retrain, recovery, throttling, alerts. Build a causal timeline
Mgmt I²C/SMBus/MCTP Out-of-band access paths for telemetry exchange (concept only). Enable remote observability

Optics module (CMIS/DDM) triage: module vs channel vs NIC

  • Module-first: module readings drift abnormally and align with errors → prioritize module seating, module thermals, or module health.
  • NIC-first: module readings remain stable while NIC FEC/BER/retrain aligns with NIC temperature or rail stress → prioritize NIC thermal/power margin.
  • Neither is obvious: module and error trends are clean but performance jitters → return to queue/PCIe evidence chains (previous chapters).

Boundary: rack-level fan curves and liquid-cooling control belong to rack thermal/cooling pages. Here the focus is NIC-visible hotspots, sensor evidence, and timestamped event correlation.

Figure J — Thermal map, sensors, telemetry hub, and throttling loop
A closed-loop view: sensors → policy/actions → performance + error impact → event logs → triage.
Thermal / Power / Telemetry Loop (NIC View) NIC card / package SerDes hot Core hot PLL sens Module cage CMIS / DDM Sensors Temp Volt Curr Telemetry hub FEC / BER Counters Event log timestamp Policy / actions Throttle Retrain Alerts OOB Visible impact Gbps / p99 Triage: module readings stable + NIC errors rise with temp/power → NIC margin; module readings drift with errors → module/cage; both clean → queues/PCIe.

H2-11|Form Factors & Port/Module Integration: PCIe AIC → OCP NIC, and what matters at the cage

This section focuses on board-level integration details that decide real-world link stability: mechanical stack-up, return paths at the bezel, port-cage shielding continuity, and low-speed sideband robustness (presence/interrupt/I²C). It stays inside the NIC/HCA card boundary—no storage backplane management and no rack-level cooling control.

Form factor decision points
  • PCIe AIC Bracket + bezel geometry drives EMI leakage paths; airflow often front-to-back across the cage and heatsink.
  • OCP NIC 3.0 Serviceability and platform standardization; tighter mechanical constraints around cage RHS/stack height and chassis reference.
  • Common Port type (QSFP/OSFP), module power budget, and cable bend radius dominate mechanical risk more than PCB routing alone.
Port cage integration checklist (board-level)
  • Shield continuity: cage-to-bezel contact must be low impedance; gaps become slot antennas at high frequency.
  • Return path control: keep a predictable chassis return path near the cage; avoid forcing high-frequency currents across long, narrow “neck” copper.
  • Sideband hardening: presence/LPMode/Reset/Int pins and I²C lines need ESD protection and clean pull-ups near the card edge.
  • Thermal at the cage: module heat couples into the cage; temperature sensors and throttling logic must reflect the hottest realistic point.
  • Service events: hot-plug, partial insertion, and “wobble” during cable routing are the top triggers for intermittent faults.
Sideband boundary (what to implement on the NIC/HCA card)
  • CMIS/DDM reads: treat optics telemetry as a diagnostic tool to separate “module vs. channel vs. NIC” quickly.
  • Presence/interrupt: debounce and ESD-protect; log insert/remove events with timestamps (useful in field RMAs).
  • I²C/SMBus/MCTP: bus switching and isolation are card-level concerns; chassis-wide arbitration policies belong elsewhere.

Example BOM material numbers (port/cage/sideband) — representative references

The items below are concrete reference MPNs commonly used as building blocks. Mechanical variants (height, latch style, RHS, press-fit vs. SMT) must match the chassis + bezel + heatsink stack-up.

Block Example MPN (material number) Why it appears in NIC/HCA port integration
QSFP28 cage TE Connectivity 2359309-1 (1×1 cage)
TE Connectivity 2170790-3 (1×4 cage)
Mechanical cage + EMI containment + module mating robustness; the cage/bezel contact quality is a top predictor of “works in lab, fails in rack” behavior.
QSFP-DD / high-density cage TE Connectivity 2227249-3 Higher port density raises thermal and shielding constraints; integration quality often shows up as intermittent link retrains and sensitivity to cable strain.
OSFP cage (example) Amphenol UE62B1620021E1
Amphenol UE62B462002121
OSFP enables higher power modules; cage RHS/thermal conduction and bezel grounding become more critical than with lower-power form factors.
OCP NIC 3.0 card-edge connector (example family) Amphenol Mini Cool Edge examples:
ME1016813401101, ME1016811401101 (OCP straddle mount references)
The connector choice affects insertion loss, serviceability, and signal/ground referencing; selection must match OCP NIC mechanical and platform routing rules.
EMI gasket / contact material Laird (DuPont) EMI gasket example 4049PA22101800 Used to maintain cage-to-bezel electrical contact and reduce slot leakage; effective gasket/contact strategy often outperforms “more shielding can” alone.
Sideband ESD protection Texas Instruments TPD4E05U06 (ESD array) Protects low-speed lines (presence/INT/I²C) at the cage edge; prevents “random” field failures triggered by hot-plug ESD events.
I²C/SMBus port mux (debug + robustness) Texas Instruments TCA9548A
(Alt.) NXP PCA9548A
Allows controlled access to module management channels; useful for isolating a stuck bus and for production test partitioning.
Board ID / FRU EEPROM Microchip 24AA02E64 (EEPROM with EUI-64 option) Common for board identity / traceability and provisioning flows; simplifies manufacturing correlation between test logs and physical units.
Practical rule: when intermittent link issues correlate with cable touch, module reinsertion, or chassis vibration, the first suspects are bezel contact, cage grounding continuity, and sideband integrity—before deeper PHY tuning.
Figure H11 — Port cage integration: shield/return/sideband and where failures couple in
Port / Module Integration (Board-Level) Shield continuity • return path • sideband robustness PCIe AIC Bracket + Bezel Cage Contact OCP NIC 3.0 Card-edge Chassis Reference Port Cage Zone QSFP / OSFP Cage Shield can + spring fingers Bezel + EMI Gasket Chassis Return Path Sideband (I²C / INT) 112G lanes presence / INT / I²C ESD diode array Failure Coupling Hotspots 1) Bezel contact gaps → EMI leakage / sensitivity to cable touch 2) Weak return path → common-mode noise → link retrains 3) Sideband ESD/bounce → false presence / stuck I²C bus Keep text minimal on the diagram; validate with continuity + hot-plug + cable-strain tests.
Use this diagram as a review checklist: bezel/gasket contact, chassis return integrity near the cage, and sideband ESD protection are often the fastest path to root cause for intermittent field failures.

H2-12|Validation & Production Checklist: proving it is stable, fast, and diagnosable

A NIC/HCA passes engineering only when it is (1) link-stable under stress, (2) predictable at p99 latency and Mpps, (3) recoverable with clear counters/logs, and (4) reproducible in production. This checklist is organized as a test matrix: test item → observables → pass criteria → failure fingerprint → likely layer.

Layered test structure (what to measure)
  • Link bring-up BER/eye margin (or equivalent), lane mapping/polarity, training stability, FEC corrected/uncorrected trends.
  • Performance Throughput vs. Mpps, p50/p99 latency, interrupt/coalesce sensitivity, NUMA pinning sensitivity, CPU cycles saved by offloads.
  • RDMA (HCA focus) QP/CQ stability, loss sensitivity (tail-latency explosions), long-duration soak, counter-driven diagnosis.
  • Thermal/Power throttling thresholds, error-rate vs. temperature, module power excursions, event logs correlated with sensor readings.
Pass criteria definition (avoid “looks fine”)
  • Define limits per workload: small-packet (64B) Mpps, mixed-size, and jumbo must each have pass criteria.
  • Latency must include tail: record p99/p99.9 alongside throughput; verify stability under background IRQ pressure.
  • Counter sanity: “clean” runs show stable FEC corrected counts and near-zero uncorrected; spikes must correlate to a known stressor.
  • Reproducibility: the same firmware/NVM build produces the same counter signature across multiple units (production reality check).
Production hooks (what to log)
  • Link events: up/down, retrain, lane deskew failures, module insert/remove timestamps.
  • Error counters: FEC corrected/uncorrected, symbol errors, packet drops, congestion markers (as available).
  • Host interface hints: PCIe AER/replay-related evidence (NIC-visible), correlated with throughput/latency dips.
  • Thermal/power: max temperatures, throttling state transitions, rail droop events if monitored.

Reference “golden units” and test gear material numbers (examples)

These are concrete examples frequently used in labs and production lines. They are not requirements—only anchors for procurement and planning.

Category Example MPN / Model Role in the checklist
Golden NIC (Ethernet) Intel E810-CQDA2 Known baseline for throughput/Mpps/latency comparisons and firmware regression checks (port configuration flexibility helps validation planning).
Golden NIC/HCA (common) NVIDIA ConnectX-6 Dx OPN MCX623106AN-CDAT Reference for counter vocabulary and field-proven stability patterns; useful when correlating FEC/BER, thermal, and link retrain signatures.
Golden HCA (OSFP class) NVIDIA ConnectX-7 family examples: MCX75310AAS-HEAT, MCX713106AC-CEAT Anchor for OSFP integration and higher-speed link stress patterns (training stability, power/thermal boundary conditions).
BERT (PHY bring-up) Keysight M8040A (BERT) / Anritsu MP1900A Used for link margin characterization, BER sensitivity vs. temperature/channel, and validation of “error-rate ↔ counter” correlation.
Traffic generator Spirent TestCenter (example kits vary; procure by required port speeds) Reproducible Mpps, mixed packet sizes, microburst stress, and tail-latency characterization under controlled conditions.
Production success metric: when a failure happens, the logs and counters must identify the likely layer (module/channel/clock/host interface/thermal) without requiring deep lab-only instruments.
Figure H12 — Validation test matrix: test item × observables × failure fingerprints
Validation / Production Test Matrix Test item → observables → pass criteria → failure fingerprint → likely layer COLUMNS Link bring-up Performance RDMA (HCA) Thermal/Power OBSERVABLES BER / eye margin • FEC corrected/uncorrected • link retrains Gbps vs Mpps • p50/p99 latency • IRQ pressure QP/CQ errors • drops → tail spikes • soak stability temp hotspots • throttling • rail droop hints PASS CRITERIA Stable training + low uncorrected errors Mpps target met + p99 within budget No QP/CQ collapse under stress No unexpected derate / retrain FAILURE FINGERPRINTS (FAST TRIAGE) FEC corrected spikes track temperature → thermal/cage/module Throughput drops without port errors → host interface / IRQ / queue pressure Tail latency explodes after drops → buffer/credit + RDMA sensitivity (HCA) Counters + logs are mandatory Define p99 targets early
A production-ready NIC/HCA has measurable pass criteria and a counter/log signature that identifies the likely layer when failures happen.

Request a Quote

Accepted Formats

pdf, csv, xls, xlsx, zip

Attachment

Drag & drop files here or use the button below.

H2-13|FAQs (NIC / HCA): field symptoms → evidence → likely layer

These FAQs target long-tail searches and practical troubleshooting. Each answer stays within the NIC/HCA boundary and points to the most relevant section(s) for deeper mechanisms and counters.

Tip: align timestamps across FEC/BER, port counters, thermal/power sensors, and event logs. “Link Up” only means training succeeded—it does not guarantee margin under temperature, power, or cable strain.
Why can a port link up, yet throughput fluctuates and FEC “corrected” counters surge?
A rising FEC-corrected rate usually means the link is surviving on error correction rather than clean margin. Correlate per-lane FEC corrected/uncorrected, retrain events, and BER/eye indicators with temperature and module telemetry. If errors track temperature or module power, suspect thermal/power throttling or cage/module integration; if a single lane dominates, suspect channel/equalization margin.
Maps to: H2-4H2-10
In a PAM4 link, what do CTLE / FFE / DFE mainly fix, and how do symptoms hint what to adjust?
CTLE primarily compensates high-frequency channel loss; FFE pre-emphasizes edges to reduce inter-symbol interference; DFE cancels post-cursor ISI when the channel creates long “tails.” Start from observable symptoms: lane-localized errors and sensitivity to cable length often point to channel loss/ISI; training instability and wide-lane jitter sensitivity often point to clock recovery margin. Validate changes by watching per-lane FEC and retrain trends.
Maps to: H2-4
Why is “100/200/400G” not enough to choose a card, and how should Mpps/small-packet performance be asked?
Line rate (Gbps) can look perfect while small-packet packet-rate (Mpps) collapses due to descriptor processing, queue depth, interrupt pressure, and pipeline limits. Ask for 64B/128B Mpps, mixed-size profiles, and p99 latency under load—plus CPU cycles saved with offloads enabled. Also request the counter signature during sustained small-packet stress (drops, IRQ rate, queue backpressure) to avoid “benchmark-only” answers.
Maps to: H2-2H2-5
Why can enabling TSO/LRO make tail latency worse, even when throughput improves?
Offloads can improve average efficiency but worsen tail latency when they increase batching, amplify burstiness, or interact poorly with interrupt moderation. Large aggregation and aggressive coalescing can add variable queueing delay; poor steering can add cache/NUMA penalties. Verify by toggling offload and coalescing knobs while tracking p99 latency, per-queue drops, and IRQ-per-packet. A “good” setting keeps p99 stable across load changes.
Maps to: H2-5H2-7
What field symptoms indicate RSS is misconfigured (jitter, reordering, CPU hotspots)?
Classic signs are one CPU core overheating while others idle, sudden jitter spikes under flow changes, and occasional reordering when steering moves flows across queues. Check per-queue packet distribution, per-core IRQ load, and queue drops. Fixes typically involve a stable hash key, limiting queue migration, aligning queues with NUMA locality, and using flow steering for “heavy hitter” flows so they do not thrash caches.
Maps to: H2-7
Is more SR-IOV VFs always better? Why can throughput stall while interrupts stay high?
VF count is bounded by hard resources: queues, MSI-X vectors, cache, and internal scheduling bandwidth. Too many VFs can fragment queues, trigger IRQ storms, and increase doorbell pressure, so overall throughput stops scaling. Look for high IRQ-per-packet, small per-VF queue depth, and rising backpressure indicators. A practical approach is fewer VFs with adequate queue depth and controlled coalescing, then scale based on measured p99 and IRQ saturation.
Maps to: H2-7
For RoCE RDMA, why can “rare packet loss” make p99 latency explode?
RDMA tail latency is extremely loss-sensitive: even rare drops can trigger retries, timeouts, and QP backoff that multiply queueing delay. The average latency may look fine while p99/p99.9 becomes unusable. Use NIC/HCA counters that reflect retry behavior (CQ/QP error signals, retry/timeout indicators where available) and correlate them with tail spikes. The fix path is reducing drops and stabilizing queue behavior before chasing raw bandwidth.
Maps to: H2-6
How to tell whether the root cause is Ethernet congestion behavior vs NIC/HCA queue backpressure?
Start with a layered counter check. Congestion-policy issues often show external pressure signatures (marks/drops consistent across flows), while NIC/HCA queue backpressure shows ring-full patterns, per-queue drops, IRQ anomalies, and localized hotspots. Align timestamps across port counters, queue metrics, and application p99 spikes. If tail spikes coincide with queue saturation without PHY/FEC degradation, focus on steering/queues/interrupt moderation; if they coincide with RDMA retry behavior, focus on loss sensitivity and buffering.
Maps to: H2-6H2-7
The network looks slower, but the real cause is PCIe downshift or AER—how to verify quickly?
When throughput drops without corresponding port errors, verify PCIe link width/speed and look for AER/replay/completion-timeout evidence around the same timestamps as the dips. PCIe retrains or corrected error bursts can masquerade as “network jitter.” Cross-check with NIC event logs and compare behavior in a different slot/platform. Firmware/NVM changes can also shift host-interface behavior, so regression tests should include PCIe stability under temperature and load.
Maps to: H2-8
What are early signs of NIC thermal throttling, and which sensors/counters are most useful?
Early signs include a slow rise in FEC corrected rate, sporadic retrains under sustained load, and throughput steps that correlate with temperature plateaus. The most useful signals are ASIC/on-card temperature, module/cage temperature, power/current telemetry, throttling state transitions, and event logs that mark derate actions. If error rate or tail latency worsens with temperature but improves immediately with airflow changes, prioritize thermal remediation before deeper PHY tuning.
Maps to: H2-10
How can CMIS/DDM optics telemetry separate “bad module” vs “bad link” vs “NIC-side issue”?
Use CMIS/DDM as a triage lens: abnormal optical power, bias current, or module alarms often indicate module/fiber issues; stable optics telemetry combined with FEC corrected growth that tracks NIC temperature points to NIC/cage thermal or power effects. If optics metrics fluctuate with cable movement, suspect connector/cage contact. The fastest isolation is a controlled swap test (module/cable) while comparing the counter signature (FEC, retrains, drops) before and after.
Maps to: H2-10H2-11
Why can a NIC advertise IEEE 1588 hardware timestamping, yet accuracy is unstable in practice?
“Hardware timestamping” accuracy depends on where the timestamp is taken (MAC vs PHY), how clock domains cross, and how much variable queueing remains. Interrupt moderation and queue contention can introduce load-dependent jitter; clock/PLL jitter and temperature drift can further degrade stability. A quick check is whether offset error worsens under traffic load—if yes, variable queueing and clock-domain effects are dominating. This page covers NIC-side boundaries, not rack-level timing architecture.
Maps to: H2-9