NIC / HCA (Ethernet & InfiniBand) Design & Debug Guide
← Back to: Data Center & Servers
A NIC/HCA is more than “link up”: real success is stable throughput and predictable p99 latency under load, proven by PHY/FEC counters, queue/interrupt behavior, and thermal/power telemetry. This page explains how Ethernet NICs and RDMA-focused HCAs move packets from PCIe DMA to SerDes/optics, and how to use the right evidence to select, validate, and troubleshoot them in production.
Scope & Boundary: what this page covers (and what it does not)
This chapter is the page “contract”: it defines the owner of the topic, the practical boundary against neighboring pages, and the exact evidence this page will use for selection and debug.
In scope (owner statements)
- NIC: Ethernet-centric I/O that reduces host CPU work via offloads (e.g., RSS, TSO, checksum) and exposes actionable telemetry.
- HCA: RDMA-centric I/O that prioritizes tail latency control and verbs semantics (QP/CQ) for RoCE/InfiniBand deployments.
- System boundary: host-side is a PCIe endpoint (DMA + queues); line-side is PCS/PMA + SerDes linking to optics/cable and the fabric.
Out of scope (guardrails)
- Retimer/Switch deep-dive: this page only uses NIC-visible symptoms (BER/FEC counters, link training behavior), not retimer internals.
- Rack timing architecture: this page only covers on-NIC timestamp boundaries; system sync trees belong to the Time Card topic.
- Programmable dataplane: SmartNIC/DPU compute pipelines are excluded; only a “see also” reference is allowed.
Use this page when a link is “up” but throughput is unstable, FEC-corrected errors trend with temperature, p99 latency explodes under microbursts, or RDMA workloads show rare but severe tail events.
1-minute definition + first-principles selection questions
A NIC or HCA is a PCIe endpoint that moves packets between host memory and a network fabric using DMA, queues, and line-side SerDes. NICs focus on Ethernet offloads (RSS/TSO/checksum) to reduce CPU cost, while HCAs prioritize RDMA semantics and tail latency control for RoCE/InfiniBand. Practical selection depends on packet rate, p99 latency, RDMA stability, and observable error counters (FEC/BER/thermal/power).
How it works (3–5 steps)
- DMA & queues: descriptors in host memory define what to transmit/receive and where to place data.
- Pipeline: parsing, classification, steering (RSS/flow rules) and optional offloads are applied.
- RDMA (if used): verbs/QP/CQ scheduling maps requests to DMA operations with tight latency targets.
- Line side: MAC/PCS/PMA drive SerDes lanes; PAM4 DSP and FEC protect the link margin.
- Telemetry: counters and sensors (FEC/BER, module DDM, temperature, power) explain “why” performance changes.
First principles: what to check first (before brand/model)
- Throughput vs packet rate: Gbps describes bandwidth; Mpps predicts small-packet behavior and CPU pressure.
- Tail latency: p50 can look good while p99/p999 fails under microbursts, queue backpressure, or rare retries.
- RDMA stability (if required): “supported” is not enough—validate behavior under congestion and loss sensitivity.
- Observability: FEC corrected/uncorrected, BER proxies, module DDM/CMIS, and event logs must exist and be readable.
| Spec | What it predicts | Questions to ask | Evidence to expect |
|---|---|---|---|
| GbpsPort speed | Peak bandwidth ceiling; module/cage compatibility | Which port form factor (QSFP/OSFP)? Which lane mode and FEC requirement? | Link mode report; FEC capability; module DDM/CMIS readout |
| MppsPacket rate | Small-packet performance, interrupt/queue pressure | Mpps at 64B/128B? Single-flow vs multi-flow? Offload on/off conditions? | Packet-rate curve; CPU cost; queue depth/overrun counters |
| p99Latency | Tail-risk under microbursts, retries, congestion sensitivity | p50/p99/p999 under defined load? With and without interrupt coalescing? | Latency histograms; retry/replay indicators; queue backpressure signals |
| RDMARoCE/IB | Remote-memory semantics, tail stability, loss sensitivity | RoCEv2 vs IB mode? What counters exist for congestion/loss events? | QP/CQ health; error syndromes; stable tail under stress profile |
| FECErrors | Link margin vs DSP/FEC workload; “link up but unstable” root cause | Corrected vs uncorrected thresholds? Lane-level counters exposed? | FEC corrected/uncorrected trend; lane error distribution |
| ThermalPower | Throttling risk, temperature-coupled error patterns | Max power at target link mode? Thermal sensors? Throttle behavior? | Temperature/power logs; performance vs temperature correlation |
System architecture map: PCIe to QSFP/OSFP end-to-end chain
This chapter establishes a fast “evidence chain” across three domains—Host/PCIe, NIC core, and Line side—so unstable throughput, tail latency spikes, or rising corrected errors can be assigned to the right layer before deeper debug.
Three domains (who owns what)
- Host / PCIe domain: DMA submissions, queue doorbells, MSI-X interrupts, and (light) IOMMU/ATS effects on addressability and completion flow.
- NIC core domain: MAC + packet pipeline, flow steering, offload engines, and scheduling that turns descriptors into predictable packet movement.
- Line side domain: PCS/PMA and SerDes lanes connected to QSFP/OSFP modules and the fabric, with link training and error protection (FEC).
| Domain | What it controls | Typical symptom | NIC-visible evidence |
|---|---|---|---|
| Host/PCIe | DMA pacing, doorbell cadence, completion flow, interrupt pressure | Periodic throughput drops; CPU spikes; jitter tied to host load | Queue pressure signals; completion stalls; PCIe error category flags |
| NIC core | Parsing/classification, steering, offloads, scheduling | Gbps looks fine but Mpps collapses; p99 worsens under microbursts | Drops/overruns; scheduler backpressure; flow distribution imbalance |
| Line side | Link mode, training, lane equalization, FEC protection | Link stays up but stability degrades; error rates track temperature | FEC corrected/uncorrected trend; lane errors; module DDM/CMIS signals |
Boundary reminder: the diagram and text stay at a vendor-neutral functional level. No proprietary block names are used, and external fabric configuration is not discussed—only NIC-visible counters, states, and correlations.
SerDes & PHY deep dive: PAM4 DSP, FEC, training, and “link up but unstable” behavior
“Link up” only confirms that training reached a workable state. Throughput instability often appears when margin is thin: PAM4 relies on DSP and FEC to keep BER acceptable, and that compensation can vary with temperature, lane conditions, and jitter.
PAM4 vs NRZ: what changes in practice
- PAM4: smaller voltage spacing per level → higher sensitivity to noise, jitter, and non-ideal channel response.
- NRZ: larger margin → fewer “near-edge” states where corrected errors climb while the link stays up.
- Operational signal: rising FEC corrected without immediate uncorrected events often indicates a margin tax that can later convert into tail latency risk.
DSP chain: symptom → module → evidence
- High-frequency loss → CTLE helps restore high-frequency content; lane-to-lane corrected error imbalance is a common hint.
- ISI (intersymbol interference) → FFE/DFE compensate channel memory; training “not converged” or unstable EQ states often appear before hard failures.
- Jitter/phase noise sensitivity → CDR stability matters; errors that correlate with temperature or operating mode suggest clock/jitter stress.
- Residual random errors → FEC absorbs them until uncorrected events start; corrected/uncorrected ratio is a leading indicator.
FEC counters: how to read them without guessing
- Corrected trending up: the link is spending “error budget” to stay stable; margin is shrinking or compensation load is rising.
- Uncorrected events: represent real frame loss at the physical layer; they amplify retries and tail latency, especially for loss-sensitive transports.
- Lane concentration: errors localized to a subset of lanes often indicate a channel/connector/module lane-specific issue rather than a global mode mismatch.
Bring-up checklist (line-side): what to validate first
- Mode agreement: autoneg / capability match, correct lane count and mapping for the selected module mode.
- Training success: link training reaches stable EQ states (TX/RX equalization) and lane deskew completes.
- Polarity/mapping sanity: polarity inversions and lane swaps are tolerated only if mapping is consistent end-to-end.
- Correlations: plot corrected errors against temperature/power and module DDM to identify margin-driven instability.
Packet pipeline & offload: real boundaries and performance traps (Gbps vs Mpps vs p99)
High wire-rate throughput does not guarantee strong small-packet Mpps or stable p99 latency. The common root cause is per-packet work along the descriptor → queue → DMA → completion → interrupt/coalescing chain, plus pipeline depth and contention.
Why “Gbps is high” but “Mpps collapses”
- Per-packet overhead dominates: each packet consumes descriptor work, queue arbitration, DMA bookkeeping, and completion processing.
- Interrupt/coalescing effects: moderation improves CPU efficiency but can stretch tail latency when bursts arrive or queues back up.
- Pipeline depth and contention: parse/classify/steer stages have finite capacity; microbursts convert into queue backlog and p99 spikes.
- Evidence pattern: wire-rate looks healthy while drops/overruns, queue pressure, or completion stalls increase.
Offload catalog: what it helps, what it can hurt
| Offload | Primary benefit | Common trap | NIC-visible evidence |
|---|---|---|---|
| Checksum | Saves host CPU per packet | Does not fix queue/interrupt bottlenecks; Mpps can still be limited | CPU cycles saved vs queue pressure unchanged |
| TSO/GSO | Reduces TX per-packet overhead for large payloads | Little impact on true small-packet traffic; segmentation does not apply | Higher Gbps with similar 64B Mpps ceiling |
| LRO/GRO | Reduces RX packet rate pressure | Can increase latency variance; may change observability and fairness during bursts | Lower packet rate with p99 changes |
| RSS / Steering | Spreads flows across queues/cores | Imbalanced distribution or queue hotspots can worsen tail latency | Queue-to-queue skew; hotspot drops |
| Crypto inline | Offloads encryption/decryption work for supported paths | Does not replace key/attestation systems; can still be bounded by per-packet pipeline limits | CPU saved with unchanged queue/IRQ limits |
Metrics hierarchy (procurement + validation)
- Wire rate (Gbps): large-packet ceiling; necessary but not sufficient.
- Mpps (small packets): reveals per-packet cost and queue/interrupt capacity.
- p99 latency: exposes backlog, coalescing, retries, and burst sensitivity.
- CPU cycles saved: quantifies offload value without hiding tail-risk.
Practical rule: when Gbps is stable but user-facing latency becomes “spiky,” treat queue backlog and completion/interrupt behavior as first-class suspects before chasing link-level causes.
RDMA in practice: RoCEv2 vs InfiniBand engineering differences, congestion myths, and why tail latency explodes
RDMA is not a single feature toggle. It is an end-to-end completion model (verbs/QP/CQ) that reacts strongly to loss and congestion. Small amounts of retry/timeout can amplify into a backlog that dominates p99/p999 latency.
RoCEv2 vs InfiniBand: “engineering differences” (not a glossary)
- RoCEv2: runs over Ethernet transport. Practical stability depends on loss/congestion management in the path (mentioned only; no switch details here).
- InfiniBand: fabric semantics are native and consistent for RDMA traffic, with a more unified verbs/QP/CQ operating model end-to-end.
- Implication for p99: RoCE environments often show stronger correlation between congestion signals and tail latency; IB environments still suffer tail growth when retries or resource backlogs occur.
QP/CQ/doorbell/WQE: boundaries that create tail latency
- WQE (work request): the unit of outstanding work. Accumulated outstanding WQEs are the physical form of latency backlog.
- QP (queue pair): ordering and resource window. A blocked QP delays later work even if the link stays up.
- CQ (completion queue): completion aggregation. Completion handling cadence affects tail behavior under bursts.
- Doorbell: submission pressure. Batch submission improves efficiency but can increase waiting time when congestion or retries appear.
Why “a little loss” becomes “huge tail latency”
- Retry/timeout amplification: once retransmit or timeout logic activates, outstanding work waits behind recovery.
- Backlog propagation: queued work accumulates while completions slow down; p50 can look fine while p99/p999 degrade sharply.
- Evidence-first approach: correlate retry/timeout categories with queue backlog, completion error categories, and congestion indicators visible to the NIC/HCA.
What can be proven from the NIC/HCA side (without fabric configuration)
- Congestion signals (category): marks/pause-like indicators and rising queue occupancy patterns that align with latency spikes.
- Retry evidence (category): retransmit and timeout trends that precede uncorrected loss symptoms.
- Completion evidence (category): completion errors or slowed completion cadence that match tail growth.
Queues, virtualization, and steering: how RSS / SR-IOV / MSI-X “split performance” and keep it controllable
Parallelism in a NIC/HCA is not “more threads.” It is a controlled mapping: flows → queues → DMA/completions → interrupts → CPU cores. When the mapping is unstable or oversubscribed, symptoms look like software jitter while the root cause is queue backpressure, doorbell write amplification, and cache locality loss.
Queue mental model (what actually runs in parallel)
- RX path: flow classification → RX queue → DMA to host → completion → MSI-X interrupt (or polling).
- TX path: TX queue → descriptors → DMA fetch → scheduling → wire.
- What it controls: wire rate (Gbps) is not enough; Mpps and p99 latency are dominated by queue pressure and completion cadence.
RSS and flow steering: benefits, and why “parallel” can become “noisy”
- RSS benefit: distributes flows across multiple queues to raise parallel throughput and reduce per-core hotspots.
- Hidden cost: cache locality can deteriorate when flows bounce across cores; tail latency rises under bursts.
- Ordering boundary: a single flow should remain on one queue; re-mapping or mixed steering policies can create reordering-like behavior.
- Flow steering role: prioritizes control (stable mapping, predictable p99) over “more randomness.”
SR-IOV (PF/VF): isolation is resource slicing
| Concept | Practical pitfall | NIC-side proof category |
|---|---|---|
| VF count | More VFs can reduce per-VF queue depth and vectors; aggregate looks fine but per-tenant Mpps/p99 degrades. | Queue pressure skew across VFs; completion backlog per VF |
| Queue resources | Too few queues per VF forces contention; hot flows collide, creating tail spikes. | Hotspot queues, drops/overruns, uneven queue occupancy |
| Locality | Remote placement (NUMA distance) increases DMA/completion time variance. | Completion stalls correlated with placement changes |
MSI-X and coalescing: the small-packet vs p99 trade-off knob
- MSI-X: provides multi-vector interrupts so multiple queues can complete independently (reduces single-IRQ bottlenecks).
- Coalescing: batches interrupts to reduce CPU overhead, but can increase tail latency by “holding” completions during bursts.
- Evidence-first rule: interpret p99 spikes together with IRQ rate, queue depth, and completion cadence—not by throughput alone.
When it looks like software: random jitter, uneven CPU load, or “driver hiccups” often map to queue backpressure, doorbell amplification (too many submissions per work), or cache miss amplification under bursty traffic.
PCIe host interface (NIC endpoint view): why AER, link downshift, and retraining look like “network jitter”
A NIC/HCA can show throughput drops and latency spikes even when the line side is clean. From the endpoint perspective, link width/speed changes, retraining, and error recovery (AER categories) can stretch completions and stall DMA, which propagates upward as “packet jitter” symptoms.
What matters from the endpoint perspective
- Link width / speed changes: sudden ceiling shift (step-like throughput drop) without obvious line-side errors.
- Retraining / recovery episodes: short stalls that resemble “micro-disconnects” to upper layers.
- AER categories: error handling can trigger retries or recovery paths; tail latency grows even if p50 remains normal.
- Replay / completion timeout (category): completions slow down, queue pressure increases, and p99 spikes follow.
How to prove “network symptoms” are actually PCIe-driven
- Step 1: check line-side error trends (category). If they remain flat while performance shifts, host-side probability rises.
- Step 2: correlate PCIe link state changes (width/speed/retraining/AER categories) with the time of throughput drops or p99 spikes.
- Step 3: confirm completion stalls and queue pressure increase during those PCIe events.
- Step 4: conclude root layer: completions slowed → DMA progress slowed → queues back up → “network jitter” becomes visible.
Firmware/NVM changes: why versions can shift performance or compatibility (principles)
- Default queue and interrupt behavior can change: different moderation defaults alter Mpps and p99 trade-offs.
- Virtualization resource slicing can change: VF limits, queue allocation, or completion handling policies may differ across versions.
- Error handling paths can change: recovery behavior affects how often and how long “stalls” appear in the field.
- Validation rule: after any version change, re-check the same evidence categories (link state, AER categories, completion cadence, queue pressure).
Boundary: this chapter stays at the NIC endpoint viewpoint. It maps symptoms to layers and proof categories without describing PCIe switch/retimer topology or platform-level signal-integrity design.
Clocking and timing inside a NIC: jitter budget, CDR, and the real boundary of “hardware timestamp”
This chapter stays inside the NIC/HCA. It explains how clock domains (refclk → PLL → SerDes/CDR → MAC/core) translate into BER/FEC pressure, training stability, and tail latency symptoms, and where a NIC’s hardware timestamp can (and cannot) remove uncertainty.
Internal clock domains (what each domain controls)
- Refclk domain: the external reference feeding internal synthesis. Instability often shows as broad “margin shrink.”
- PLL / conditioning domain: produces internal clocks; poor margin increases BER and pushes FEC corrected upward.
- SerDes / CDR domain: clock recovery and lane timing; sensitive to temperature, supply noise, and channel variability.
- MAC / core domain: packet pipeline timing; interacts with queueing and completion cadence (p99 exposure).
- Timestamp domain (if present): defines where time is captured; cross-domain alignment is a bounded error term.
How jitter becomes field symptoms (evidence-first)
- Margin tightens → lane errors rise → FEC corrected increases (throughput still “looks fine”).
- Worsening margin → FEC uncorrected / symbol errors → drops/timeouts → throughput jitter appears.
- Training sensitivity → repeated training adjustments or retraining episodes, especially near temperature/power edges.
- Practical reading: when line-side module readings are stable but FEC/BER and retraining trends worsen, prioritize clock/power/thermal inside the NIC before blaming the network.
CDR and bring-up stability (why “link up” is not “link stable”)
- CDR role: tracks timing variation and keeps sampling aligned; changing conditions increase lock difficulty.
- What instability looks like: BER drift, higher FEC pressure, occasional lane deskew issues, and sporadic retraining.
- What to correlate: lane error trend + FEC corrected/uncorrected split + retraining events (time aligned with temperature or power changes).
Hardware timestamp boundary (tap point + error components)
| Item | What it means inside the NIC | Typical error component |
|---|---|---|
| Tap point MAC-side | Timestamp captured near MAC processing; easier integration with the packet pipeline. | More exposure to queueing and pipeline scheduling variance |
| Tap point PHY/PCS-side | Timestamp captured closer to the wire; better represents egress/ingress timing at the link boundary. | Still includes clock-domain crossing alignment and internal transport delay |
| Error budget | Even with HW timestamp, uncertainty remains bounded by internal domains and buffering. | Queueing · CDC alignment · SerDes/PCS path |
Boundary: this chapter only covers NIC internal clock domains and timestamp capture points. Rack or fabric-level time distribution (PTP trees, SyncE, GPSDO) belongs to the Time Card page.
Power, thermal, and telemetry hooks: rail droop, throttling, hotspots, and how observability enables fast triage
Performance instability often follows a simple loop: power/thermal stress → link margin shrinks → errors rise → recovery/throttling triggers → throughput and p99 drift. This chapter focuses on NIC-visible sensors, counters, and logs that separate “module vs channel vs NIC” without rack-level cooling details.
Power rails (concept level): how droop turns into link instability
- Common rail categories: core · SerDes · PLL · IO (names vary, roles are consistent).
- Failure chain: rail droop or noise → CDR/SerDes margin tightens → BER and FEC corrected trend upward → p99 spikes follow.
- How to validate: correlate voltage/current trends with FEC/BER and retraining events (time aligned).
Thermal hotspots and throttling: why it “looks like a network problem”
- Hotspots: SerDes edge · ASIC core · PLL area · module cage (thermal coupling matters).
- Throttling signatures: step-like throughput ceilings, rising error trends near a temperature knee, and clustered retraining episodes.
- Evidence-first: when throttling appears, look for synchronized changes in temperature, power, and error counters.
Telemetry categories (what to read, and why it proves the layer)
| Category | What it captures | How it is used |
|---|---|---|
| Health temp/volt/curr | Stress signals that compress link margin or trigger protection policies. | Time-align with errors and performance shifts |
| Link FEC/BER/lane | Error trends (corrected vs uncorrected) and lane-level drift. | Separate “margin shrink” from “hard failures” |
| Traffic port counters | Drops, retries/timeouts, congestion-visible counters (NIC-side). | Connect errors to user-visible symptoms |
| Logs event timestamps | State changes: retrain, recovery, throttling, alerts. | Build a causal timeline |
| Mgmt I²C/SMBus/MCTP | Out-of-band access paths for telemetry exchange (concept only). | Enable remote observability |
Optics module (CMIS/DDM) triage: module vs channel vs NIC
- Module-first: module readings drift abnormally and align with errors → prioritize module seating, module thermals, or module health.
- NIC-first: module readings remain stable while NIC FEC/BER/retrain aligns with NIC temperature or rail stress → prioritize NIC thermal/power margin.
- Neither is obvious: module and error trends are clean but performance jitters → return to queue/PCIe evidence chains (previous chapters).
Boundary: rack-level fan curves and liquid-cooling control belong to rack thermal/cooling pages. Here the focus is NIC-visible hotspots, sensor evidence, and timestamped event correlation.
H2-11|Form Factors & Port/Module Integration: PCIe AIC → OCP NIC, and what matters at the cage
This section focuses on board-level integration details that decide real-world link stability: mechanical stack-up, return paths at the bezel, port-cage shielding continuity, and low-speed sideband robustness (presence/interrupt/I²C). It stays inside the NIC/HCA card boundary—no storage backplane management and no rack-level cooling control.
- PCIe AIC Bracket + bezel geometry drives EMI leakage paths; airflow often front-to-back across the cage and heatsink.
- OCP NIC 3.0 Serviceability and platform standardization; tighter mechanical constraints around cage RHS/stack height and chassis reference.
- Common Port type (QSFP/OSFP), module power budget, and cable bend radius dominate mechanical risk more than PCB routing alone.
- Shield continuity: cage-to-bezel contact must be low impedance; gaps become slot antennas at high frequency.
- Return path control: keep a predictable chassis return path near the cage; avoid forcing high-frequency currents across long, narrow “neck” copper.
- Sideband hardening: presence/LPMode/Reset/Int pins and I²C lines need ESD protection and clean pull-ups near the card edge.
- Thermal at the cage: module heat couples into the cage; temperature sensors and throttling logic must reflect the hottest realistic point.
- Service events: hot-plug, partial insertion, and “wobble” during cable routing are the top triggers for intermittent faults.
- CMIS/DDM reads: treat optics telemetry as a diagnostic tool to separate “module vs. channel vs. NIC” quickly.
- Presence/interrupt: debounce and ESD-protect; log insert/remove events with timestamps (useful in field RMAs).
- I²C/SMBus/MCTP: bus switching and isolation are card-level concerns; chassis-wide arbitration policies belong elsewhere.
Example BOM material numbers (port/cage/sideband) — representative references
The items below are concrete reference MPNs commonly used as building blocks. Mechanical variants (height, latch style, RHS, press-fit vs. SMT) must match the chassis + bezel + heatsink stack-up.
| Block | Example MPN (material number) | Why it appears in NIC/HCA port integration |
|---|---|---|
| QSFP28 cage |
TE Connectivity 2359309-1 (1×1 cage)TE Connectivity 2170790-3 (1×4 cage)
|
Mechanical cage + EMI containment + module mating robustness; the cage/bezel contact quality is a top predictor of “works in lab, fails in rack” behavior. |
| QSFP-DD / high-density cage |
TE Connectivity 2227249-3
|
Higher port density raises thermal and shielding constraints; integration quality often shows up as intermittent link retrains and sensitivity to cable strain. |
| OSFP cage (example) |
Amphenol UE62B1620021E1Amphenol UE62B462002121
|
OSFP enables higher power modules; cage RHS/thermal conduction and bezel grounding become more critical than with lower-power form factors. |
| OCP NIC 3.0 card-edge connector (example family) |
Amphenol Mini Cool Edge examples:ME1016813401101, ME1016811401101 (OCP straddle mount references)
|
The connector choice affects insertion loss, serviceability, and signal/ground referencing; selection must match OCP NIC mechanical and platform routing rules. |
| EMI gasket / contact material |
Laird (DuPont) EMI gasket example 4049PA22101800
|
Used to maintain cage-to-bezel electrical contact and reduce slot leakage; effective gasket/contact strategy often outperforms “more shielding can” alone. |
| Sideband ESD protection |
Texas Instruments TPD4E05U06 (ESD array)
|
Protects low-speed lines (presence/INT/I²C) at the cage edge; prevents “random” field failures triggered by hot-plug ESD events. |
| I²C/SMBus port mux (debug + robustness) |
Texas Instruments TCA9548A(Alt.) NXP PCA9548A
|
Allows controlled access to module management channels; useful for isolating a stuck bus and for production test partitioning. |
| Board ID / FRU EEPROM |
Microchip 24AA02E64 (EEPROM with EUI-64 option)
|
Common for board identity / traceability and provisioning flows; simplifies manufacturing correlation between test logs and physical units. |
H2-12|Validation & Production Checklist: proving it is stable, fast, and diagnosable
A NIC/HCA passes engineering only when it is (1) link-stable under stress, (2) predictable at p99 latency and Mpps, (3) recoverable with clear counters/logs, and (4) reproducible in production. This checklist is organized as a test matrix: test item → observables → pass criteria → failure fingerprint → likely layer.
- Link bring-up BER/eye margin (or equivalent), lane mapping/polarity, training stability, FEC corrected/uncorrected trends.
- Performance Throughput vs. Mpps, p50/p99 latency, interrupt/coalesce sensitivity, NUMA pinning sensitivity, CPU cycles saved by offloads.
- RDMA (HCA focus) QP/CQ stability, loss sensitivity (tail-latency explosions), long-duration soak, counter-driven diagnosis.
- Thermal/Power throttling thresholds, error-rate vs. temperature, module power excursions, event logs correlated with sensor readings.
- Define limits per workload: small-packet (64B) Mpps, mixed-size, and jumbo must each have pass criteria.
- Latency must include tail: record p99/p99.9 alongside throughput; verify stability under background IRQ pressure.
- Counter sanity: “clean” runs show stable FEC corrected counts and near-zero uncorrected; spikes must correlate to a known stressor.
- Reproducibility: the same firmware/NVM build produces the same counter signature across multiple units (production reality check).
- Link events: up/down, retrain, lane deskew failures, module insert/remove timestamps.
- Error counters: FEC corrected/uncorrected, symbol errors, packet drops, congestion markers (as available).
- Host interface hints: PCIe AER/replay-related evidence (NIC-visible), correlated with throughput/latency dips.
- Thermal/power: max temperatures, throttling state transitions, rail droop events if monitored.
Reference “golden units” and test gear material numbers (examples)
These are concrete examples frequently used in labs and production lines. They are not requirements—only anchors for procurement and planning.
| Category | Example MPN / Model | Role in the checklist |
|---|---|---|
| Golden NIC (Ethernet) | Intel E810-CQDA2 |
Known baseline for throughput/Mpps/latency comparisons and firmware regression checks (port configuration flexibility helps validation planning). |
| Golden NIC/HCA (common) | NVIDIA ConnectX-6 Dx OPN MCX623106AN-CDAT |
Reference for counter vocabulary and field-proven stability patterns; useful when correlating FEC/BER, thermal, and link retrain signatures. |
| Golden HCA (OSFP class) | NVIDIA ConnectX-7 family examples: MCX75310AAS-HEAT, MCX713106AC-CEAT |
Anchor for OSFP integration and higher-speed link stress patterns (training stability, power/thermal boundary conditions). |
| BERT (PHY bring-up) | Keysight M8040A (BERT) / Anritsu MP1900A |
Used for link margin characterization, BER sensitivity vs. temperature/channel, and validation of “error-rate ↔ counter” correlation. |
| Traffic generator | Spirent TestCenter (example kits vary; procure by required port speeds) | Reproducible Mpps, mixed packet sizes, microburst stress, and tail-latency characterization under controlled conditions. |
H2-13|FAQs (NIC / HCA): field symptoms → evidence → likely layer
These FAQs target long-tail searches and practical troubleshooting. Each answer stays within the NIC/HCA boundary and points to the most relevant section(s) for deeper mechanisms and counters.