123 Main Street, New York, NY 10001

NVLink / High-Speed Interconnect Switch Explained

← Back to: Data Center & Servers

An NVLink / high-speed interconnect switch is the fabric node that connects many SerDes links, restoring signal margin with retiming/equalization and protecting system stability with reference-clock jitter conditioning and rich link telemetry.

The core of design-in success is measurable margin: a repeatable tuning flow (CTLE/FIR/DFE), a defensible jitter budget, and logs/counters that turn intermittent field issues into actionable root-cause buckets.

H2-1 · What it is & boundary

What an interconnect switch is—and what it is not

An NVLink-class interconnect switch is a multi-port SerDes switching node that routes high-speed lane groups between endpoints (typically GPUs/accelerators). It combines crosspoint switching (port-to-port mapping), a switch fabric (multi-port forwarding under load), and often signal-conditioning capabilities such as equalization and retiming. In practice, its success is measured by stable BER/CRC behavior across temperature, voltage, and traffic stress, backed by actionable counters and event logs.

The engineering boundary is defined by what must be solved at the topology level versus what can be solved on a single link:

Single-link eye closure Lane skew / bonding stability Refclk jitter margin Port mapping / isolation Multi-hop latency predictability Per-port RAS counters & logs

Out of scope by design: PCIe/CXL switching protocols (ACS/SR-IOV), Ethernet/InfiniBand stack behavior, NIC/DPU dataplane offloads, GPU card VRM/HBM power design, rack-level power/cooling infrastructure, and full BMC/Redfish system architecture. When those topics are needed, a short pointer/link is sufficient; detailed coverage belongs to the relevant sibling pages.

Output: practical boundary comparison

Component What it fixes What it won’t fix Cost & “use when” triggers
Redriver Boosts/reshapes signaling to compensate moderate loss; provides basic equalization knobs. Cannot remove accumulated timing noise (no CDR); limited help on severe jitter/ISI; does not solve multi-port routing. Low latency Lower complexity
Use when the channel loss is manageable and the problem is amplitude/ISI, not clocking margin.
Retimer Re-clocks data with CDR; reduces jitter accumulation; improves eye at the receiver; stabilizes long or noisy channels. Does not provide topology-level port mapping; cannot isolate traffic domains; cannot replace fabric-level forwarding. Adds fixed latency Power/thermal cost
Use when BER improves with re-clocking and failures correlate with jitter/phase-noise margin.
Interconnect Switch Routes lane groups across multiple endpoints; enables isolation, reroute/disable policies, and switch-local observability (counters/logs). May also include equalization/retiming. Cannot “mask” a fundamentally broken channel without margin; cannot replace endpoint SerDes quality; does not belong to protocol-stack performance tuning. More latency/power System integration
Use when multi-endpoint routing, fault isolation, and verifiable RAS/telemetry are required—not just a cleaner eye.
Figure F1 — Boundary map: Channel → Retimer/Redriver → Switch
Boundary: fix the link vs route the topology Channel Cable / board path Loss XT Reflections · skew Jitter budget consumed Redriver / Retimer Equalization TX FIR · CTLE · DFE Retiming (CDR) Jitter cleanup · deskew Trade-offs Latency · power · tuning Interconnect Switch Port mapping Lane groups routed Switch fabric Multi-port forwarding Observability Counters · logs · isolation Use the switch when topology routing + isolation + verifiable telemetry are required (not just a cleaner eye).
H2-2 · Where it sits

Topology, ports, and lane groups: a practical placement model

In an interconnect domain, a “port” is best treated as a bonded lane group rather than a single wire. This matters because most real-world failures show up as one lane becoming the limiter (skew, loss, crosstalk, or margin collapse under temperature). A placement model that speaks in lane groups makes topology design, validation, and field-debug repeatable.

A useful abstraction is: endpoint ↔ (lane-group links) ↔ interconnect switch ↔ (lane-group links) ↔ endpoint. Without discussing any protocol stack, the key system behaviors can be predicted by three latency contributors:

  • SerDes pipeline latency (per port): baseline encode/decode and elastic buffering.
  • Retiming latency (optional): fixed delay added when CDR re-clocks the data path.
  • Fabric hop latency: forwarding delay that scales with hop count and internal contention.

Port planning should optimize not only bandwidth but also maintainability: clear lane-group naming, predictable breakout rules, and an explicit plan for isolation (what happens when a single port or lane fails). This reduces “random” failures into observable, bounded cases.

Output: port planning checklist (interconnect domain)

  • Lane-group definition: fixed lanes-per-port, stable naming, and consistent polarity/ordering rules.
  • Breakout policy: defined breakout modes and the debug plan for worst-lane identification.
  • Skew control: deskew tolerance budgets for bonding; avoid mixing very different path lengths inside one group.
  • Clock domain clarity: which ports share a reference clock; where jitter cleaning sits; how skew is bounded.
  • Sideband intent: reset/health visibility and a minimal method to trigger training/loopback when needed.
  • Isolation paths: ability to disable a port/lane group and keep the rest of the domain stable.
  • Validation hooks: test access (PRBS/loopback), counters snapshot points, and a “worst-case matrix” plan.

The placement model intentionally stays at the interconnect domain level. Anything that depends on PCIe/CXL or Ethernet/IB protocol behavior is excluded here and should be handled in its dedicated pages.

Figure F2 — Topology view: endpoints, lane groups, ref clock, sideband
Topology: lane-group ports + clock + sideband Endpoint Port groups: A / B A B Endpoint Port groups: C / D C D Endpoint Port groups: E / F E F Endpoint Port groups: G / H G H Interconnect Switch Fabric hop Latency parts SerDes · retime · hop xN lanes xN lanes xN lanes xN lanes Reference clock Jitter cleaner (PLL) Sideband (reset/health) Plan ports for bandwidth AND maintainability: naming, isolation, clock domains, and validation hooks.
H2-3 · Key metrics that matter

Datasheet metrics that predict real stability

For an interconnect switch, “good on paper” is not enough. The practical goal is predictable latency and stable error behavior across stress (temperature, voltage, and traffic). The most useful metrics are the ones that can be measured, trended, and tied to a fail signature using switch-local counters and margin tests.

Spec traps to ignore (or demand proof for)

Peak bandwidth without internal contention assumptions; “typical latency” without retiming mode and hop count; “supports PAM4” without margining methods; “low jitter” without test conditions and reference-clock assumptions.

Output: Metric → Engineering meaning → How to measure

Metric (what to check) Engineering meaning (why it matters) How to measure / prove
Ports, lanes/port, aggregate bandwidth
Capacity
Determines topology scale and whether lane groups can be mapped without awkward breakouts. Weak capacity often forces extra hops, increasing latency variance and margin loss. Validate with a topology model: lane-group map, hop count, and “worst-case” mapping. Require a clear porting diagram and supported lane-group modes.
SerDes rate mode (NRZ/PAM4)
PHY
Indicates signaling style and sensitivity to channel loss and jitter margin. PAM4 typically demands stronger equalization and more disciplined margining to avoid “runs but unstable.” Prove with margining results (eye height/width or equivalent) at target data rate and channel condition (loss/XT). Demand corner coverage, not only typical.
Latency components
Determinism
Latency is not one number: SerDes pipeline + optional retime fixed delay + fabric hop/queue. This predicts tail behavior and multi-hop predictability. Request latency breakdown by mode (retime on/off) and by hop. Measure with controlled traffic patterns and hop sweep; record min/p50/p99 under thermal stress.
Equalization knobs
Margin control
Tunable TX FIR, CTLE, and DFE determine whether the channel can be pulled back from eye closure without overfitting noise. More knobs are useful only if they are observable and repeatable. Use PRBS/BERT or built-in margin tests to map “knob sweep → margin change.” Require saved profiles per port and a method to export settings + results.
Lane margining support
Proof
Margining turns “it works” into “it has headroom.” It separates marginal designs from robust ones and supports fast binning in production. Run margin sweep per lane group across corners (temp/voltage). Capture worst-lane distribution and a pass/fail threshold tied to field risk.
Error counters
Observability
CRC trends, deskew events, CDR lock events, and retry/training events provide the earliest signal that margin is collapsing—often before a hard failure. Verify counter coverage per port and counter reset semantics. Trend counters against temperature and traffic; require event timestamps or ordered snapshots.
Switch-local thermal & rail alerts
Operate
Many “random” failures are thermal or rail-noise correlated. Switch-local alarms enable correlation without depending on external systems. Confirm alert thresholds, hysteresis behavior, and log visibility. Heat-soak tests: correlate error slope with temperature and alert states.
RAS features
Reliability
Lane repair, port isolation, and link downgrade policies prevent a single weak lane from cascading into full domain instability and reduce MTTR in the field. Fault-inject with worst-lane conditions (margin squeeze) and verify isolation behavior. Require logs that prove why a downgrade/isolation happened.

The best “real metrics” are the ones that can be closed into a loop: measure → trend → correlate → act. If a spec cannot be measured in the intended environment, it should not be used as the primary selection driver.

Figure F3 — Metrics map: capacity, determinism, margin, and proof
Metrics that predict stable links Capacity Ports Lanes/port Aggregate bandwidth Determinism SerDes latency Retime delay Fabric hop / queue Margin control TX FIR CTLE / DFE Lane margining Operate & prove Counters Events RAS actions Outcome Stable · predictable · debuggable Favor metrics that can be measured and trended: margining + counters + latency breakdown.
H2-4 · Inside the box

Data path anatomy: where switching differs from “a bigger retimer”

A practical internal view is a data-path stack: ingress SerDes conditioning builds a clean lane stream; lane bonding/deskew forms a stable lane group (“port”); a crosspoint or fabric maps ingress ports to egress ports; and the egress SerDes drives the channel. Optional retiming points trade fixed latency for jitter cleanup.

Switching is fundamentally different from retiming because it introduces topology control: port mapping, isolation, and fault containment. Those features must be paired with switch-local counters and events; otherwise failures look random and cannot be proven robust.

Output: error-injection points (what becomes sensitive where)

  • Ingress PHY: EQ overfit can hide margin loss until temperature shifts; CDR lock margin collapses under refclk noise.
  • Bonding/deskew: one “worst lane” dominates; skew drift triggers deskew events and error bursts.
  • Fabric/crosspoint: internal contention creates latency variance; hot spots couple into timing margin if thermals rise.
  • Egress PHY: output jitter/ISI sensitivity depends on final EQ profile and channel variation.
  • Monitor taps: poorly placed counters can show “clean” while the real weak lane is failing.

Architecture discussion stays at the SerDes/crosspoint/fabric level. Protocol-level behaviors and endpoint architectures are excluded and should be treated as separate topics.

Figure F4 — Chip anatomy: PHY, bonding, fabric, retime points, monitors
Inside the switch: data path + sensitive points Ingress PHY CTLE DFE TX FIR Bond / Deskew Lane group Worst-lane Crosspoint / Fabric Port map Contention Hop latency variance Egress PHY Drive + EQ Output jitter Retiming CDR Fixed latency Monitors Counters Margining Events EQ sensitivity Skew outlier Contention Retime trade Switching adds topology control (mapping + isolation) and must be paired with counters and events.
H2-5 · Retiming & equalization

Channel budget, EQ boundaries, and a repeatable tuning SOP

The fastest way to stabilize a high-speed interconnect is to treat the link as a channel budget problem, not a “knob-twiddling” problem. Channel impairments collapse eye margin through distinct mechanisms, and each EQ tool has a clear boundary: CTLE shapes the receive spectrum, TX FIR pre-emphasizes to counter loss, DFE targets post-cursor ISI, and retiming (CDR) trades fixed latency for timing cleanup.

Channel model → eye / BER impact

Insertion loss reduces high-frequency energy and increases ISI (eye closure in width/height).
Return loss creates reflections that produce “patterned” distortion and unstable convergence.
Crosstalk injects noise that can look like ISI; aggressive DFE may amplify errors.
Group delay ripple distorts symbol timing across frequency, causing non-intuitive failures under corners.

EQ toolbox: boundaries and tradeoffs

Tool Primary job Typical side effects / limits
TX FIR Counter insertion loss by shaping transmit spectrum and reducing ISI at the receiver. Can increase sensitivity to coupling/XT; poor profiles cause overshoot and mask real noise.
CTLE Boost high-frequency components at the receiver to reopen the eye under lossy channels. Also boosts noise; too much CTLE reduces SNR and makes DFE decisions unstable.
DFE Cancel post-cursor ISI with decision feedback when linear EQ is insufficient. Can misinterpret noise/crosstalk as ISI and amplify error bursts; must be bounded.
CDR / Retiming Improve timing stability by re-establishing sampling phase; reduces accumulated jitter sensitivity. Adds fixed latency and can create mode-dependent determinism risks; requires proof under corners.

Output: a copyable tuning SOP (steps + record fields)

  • Step 0 — Lock the experiment (baseline)

    Fix data rate and training mode. Snapshot per-port counters (CRC/deskew/CDR events) and an initial margin readout. Record: rate/mode, profile ID, ambient/board temperature, rail state.

  • Step 1 — Find the worst lane (do not average)

    Run PRBS/BERT (or equivalent) and lane margining to rank lanes. Treat the “worst lane” as the governing constraint for the whole lane group. Record: worst-lane ID, margin curve key points, event-rate slope vs temperature.

  • Step 2 — Converge in a disciplined order: CTLE → TX FIR → DFE

    Adjust one dimension at a time. First stabilize the receive spectrum (CTLE), then shape TX (FIR), then use bounded DFE only if needed. Stop when margin improves monotonically without counter spikes. Record: knob values, pass/fail points, counter deltas per change.

  • Step 3 — Decide on retiming using a threshold, not preference

    Enable retiming when margining shows timing headroom is insufficient or error slopes rise sharply with temperature/voltage. Record fixed latency impact and confirm determinism across modes.

  • Step 4 — Prove headroom (margining across corners)

    Build a corner matrix (temperature, voltage, traffic stress). Require a minimum residual margin and stable counters (no event bursts). Store final per-port profiles and export the proof artifacts.

Common pitfalls: over-DFE can amplify noise and create burst errors; “works at room temp” can fail at hot/cold due to drift; tuning against an average lane hides the true limiter.
Figure F4 — Channel budget waterfall: loss/noise/jitter vs recovered eye margin
Channel budget waterfall → residual eye margin Ideal margin Eye opening Insertion loss (IL) Return loss (RL) Crosstalk (XT) Timing jitter TX FIR CTLE DFE (bounded) Retiming (CDR) Residual margin Pass threshold Tune by workflow: worst-lane → CTLE → FIR → bounded DFE → margin proof across corners.
H2-6 · Reference clock & jitter cleaners

Why reference-clock jitter is a make-or-break line

In high-speed SerDes links, the reference clock is not just a “frequency source.” Its phase noise and distribution noise shape the timing uncertainty seen by the sampling system. When timing headroom becomes small, links can look acceptable by frequency tolerance yet still show elevated error rates, training instability, or temperature-dependent dropouts.

Concept chain: phase noise → integrated jitter → BER risk

What changes What it does in SerDes What it looks like in the field
Refclk phase noise Reduces effective timing margin through the CDR/PLL path and increases sampling uncertainty. BER slope rises with temperature; more CDR lock/training events before hard failures.
Distribution noise (fanout / coupling) Injects additional jitter after the source; port-to-port sensitivity becomes location-dependent. Some ports are consistently weaker; failures correlate with certain load/thermal states.
Skew / isolation issues Creates lane-group instability and reduces the ability to deskew/hold alignment under stress. Deskew events spike; link “flaps” only in specific corners.

Output: jitter budget template (fill-in fields)

Inputs

Refclk source (measurement or vendor curve) · distribution nodes (fanout, routing segments) · cleaner mode (if used) · operating corners (temp/voltage).

Process

Convert “noise description” into integrated timing risk (conceptually: phase noise → integrated jitter → RJ/DJ behavior), then correlate with margining and switch-local events/counters.

Outputs

Residual timing margin vs pass threshold · per-port sensitivity map · event-rate trend (CDR/deskew/training) · corner matrix result.

When a jitter cleaner is justified (practical criteria)

  • Event correlation: CDR/deskew/training events rise sharply with temperature or operating mode.
  • Timing-direction margin deficit: margining indicates timing headroom is the limiting axis even when amplitude looks acceptable.
  • Location dependence: a subset of ports fail earlier, consistent with clock-tree injection points.
A cleaner is not automatically beneficial. Placement and loop bandwidth shape what noise is rejected vs passed through. If fanout coupling or return-path noise dominates, “adding a cleaner” can mask the real injection point and still fail under corners.
Figure F5 — Refclk tree and cleaner placement: noise injection points to ports
Reference clock tree: cleaner + fanout + ports Refclk source Phase noise Jitter cleaner PLL Loop bandwidth Fanout Coupling noise Ports (lane groups) Port A Port B Port C Port D Port E Port F Port G Source PN Fanout noise Skew / isolation Diagnose by correlation: margining + switch-local events vs temperature and operating modes.
H2-7 · Power, package, and thermal

Environment-driven drift: why links pass cold and fail hot

Interconnect stability is often limited by environment-driven drift. Temperature rise, coupling noise, and board-level return-path discontinuities can reduce timing and equalization headroom even when frequency tolerance appears acceptable. This chapter focuses only on factors that directly perturb SerDes PHY and PLL/clocking behavior—without expanding into VRM design.

Only the rails that matter (PHY / PLL cleanliness)

Sensitive rails: why “cleanliness” matters

PHY rail noise can translate into eye degradation and higher BER sensitivity.
PLL/clock rail noise can increase timing uncertainty, triggering deskew/CDR events.
Coupling paths are often board-level: return-path detours, plane splits, and shared noisy reference regions.

Decoupling principles (interconnect-domain only)

Keep local bypass close to the sensitive block, preserve a short and continuous return path, and avoid routing that forces the return current to cross discontinuities near SerDes/PLL regions.

Thermal hotspots and stability drift

SerDes banks and PLL regions can form hotspots. As temperature increases, equalization effectiveness can drift and timing headroom can shrink. A practical symptom is a rising slope of error or training/deskew events versus temperature, followed by link flaps or dropouts. Thermal throttling can further change activity patterns and noise coupling, producing second-order stability shifts that appear “random” unless correlated with telemetry.

Package & board-level contributors (focused on return-path continuity)

  • Reference plane continuity: discontinuities can force return currents to detour, increasing coupling into sensitive zones.
  • Return-path control: the interconnect domain should avoid unintentional shared return segments with noisy regions.
  • Local isolation: keep clocking/PLL neighborhoods protected from adjacent switching noise injection points.

Output: thermal–SI linked checklist (what to verify and correlate)

Check item How to measure / observe Decision signal
Temperature points (die hotspot / SerDes zone / PLL zone) Use switch-local sensors (if available) and board sensors closest to SerDes/PLL neighborhoods. Event rates change sharply across a temperature band; failures repeat at specific temperatures.
Event correlation (deskew / CDR lock / training) Trend counters versus temperature and operating mode (rate/profile/retime state). Stable at cold, then sudden increases at hot; “port-local” sensitivity emerges.
EQ drift sensitivity Compare margining or eye metrics before/after thermal soak using the same profile snapshot. Residual margin collapses at hot even though the profile is unchanged.
Rail alert association (PHY/PLL) Correlate rail alerts (switch-local) with event bursts and margin drops. Alerts align with error spikes; mitigation must focus on the injection path, not on average readings.
Threshold strategy (with hysteresis) Define trigger thresholds on event slopes and temperature bands; log pre/post snapshots. Actions occur before dropouts: degrade/isolate/retrain with evidence retained for root cause.
Field signature focus: “cold OK, hot fails” is rarely random. It is typically a correlated interaction between thermal drift, coupling paths, and reduced timing headroom in SerDes/PLL neighborhoods.
Figure F6 — Thermal hotspots and sensitive blocks (drift path to link instability)
Thermal hotspots → drift → reduced link margin Switch package / die (concept) SerDes banks PHY neighborhood Fabric Sensors PLL / clocking Timing headroom Sensitive region HOT zone HOT zone Sensitive rails (concept) PHY rail cleanliness PLL rail cleanliness Drift chain Temp ↑ EQ drift / jitter ↑ Margin ↓ → errors ↑ Use telemetry correlation: temperature zones + event slopes + margin proof to avoid “random” hot failures.
H2-8 · Management & telemetry (switch-local)

Observability: the ability to see degradation before it becomes a dropout

High-speed interconnects are maintainable only when degradation is observable. The goal is not to expose full system-management stacks, but to ensure the switch has switch-local telemetry and logging that can separate gradual link-quality decline from transient external causes.

Management interfaces (existence only)

Switches commonly expose configuration and readout paths through sideband-style interfaces such as I²C/SMBus/MDIO classes. These channels allow reading counters, margining results, and local temperature/rail alerts. System management layers are intentionally out of scope.

Telemetry tiers: what to watch

Tier Metrics (examples) Why it matters
Link health CRC/error counters, deskew events, CDR lock/loss, training events Shows whether the link is weakening (trend) or experiencing bursts (transient).
Margin proof Lane margining, eye height/width (or equivalent), worst-lane identification Separates “works” from “has headroom,” and identifies the governing lane.
Environment correlation Switch-local temperature zones, PHY/PLL rail alerts, rate/mode/profile IDs Explains why failures cluster at hot/certain modes and enables deterministic reproduction.

Sampling strategy (trend + trigger)

Periodic sampling (baseline trend)

Sample counters and temperature zones at a steady cadence to detect gradual degradation. Trend slopes are more informative than single-point snapshots.

Triggered sampling (capture evidence)

On mode changes (rate/profile/retime toggles) or event spikes (deskew/CDR bursts), take an immediate full snapshot (cfg + env + counters + margin).

Burst window (around failures)

When link flaps or drops, enable a short burst window to collect dense pre/post evidence for reproducibility and root-cause correlation.

Figure F7 — Observability dashboard: Port → counters/margin/env → decision & action
Observability pipeline: from ports to decisions Ports / lanes Port A Port B Port C Worst lane Port D Switch-local telemetry Counters CRC · deskew · CDR events Margining eye_h · eye_w · worst_lane Environment temp zones · rail alerts Config snapshot profile_id · retime state Decision Degrading? trend slope Transient? event burst Action retrain / degrade Log snapshot field dictionary Trigger: mode change Trigger: event spike Always log: port/lane + profile + env + counters + margin → reproducible diagnosis.
H2-9 · Failure modes & field debug

Unstable links: triage by symptom tree (fast narrowing, evidence-driven)

A link can come up and still fail to run stably when headroom is marginal or when a trigger condition (temperature band, mode switch, jitter injection, or coupling path) pushes the channel across its limit. Field debug should follow a repeatable loop: localize (port / direction / condition), separate domains (loopback / PRBS), and prove the trigger using a minimal reproduction matrix.

First 10 minutes: localize before changing anything

1) Localize: port + direction + condition

Identify whether the issue is port-local, direction-specific, or tied to a specific rate/mode/profile.

2) Classify: creeping vs burst behavior

Rising slopes suggest shrinking margin; sudden bursts suggest a trigger (temperature, mode switch, jitter injection).

3) Check worst-lane stability

A fixed worst lane is often position-related; a moving worst lane is often condition-related.

4) Capture evidence

Take a snapshot of config + environment + counters + margining before applying any “fix.”

Symptom classes → what to watch → what to do next

Symptom class Watch (switch-local) Next action (fast narrowing)
Training fails / retrains Training events, deskew events, CDR lock/loss; link state transitions Freeze the condition; run loopback/PRBS to separate channel vs clock/retime domain; log a before/after snapshot.
BER/CRC creeps upward CRC/error slope, margin score trend; temperature zone trend Run a short temperature sweep and compare margin proof; verify whether a single port group dominates the slope.
Worst lane is always the same Worst-lane ID, worst-lane margin; repeatability across re-trains Swap path/cable if possible; keep the profile constant; confirm “position-related” behavior with minimal matrix.
Only hot/cold triggers it Event bursts vs temperature band; rail alerts (PHY/PLL) if present Thermal soak at the trigger band; capture dense pre/post evidence; compare margining at identical profiles.
Only high load triggers it Event bursts aligned with activity transitions; counters and margin shifts Hold rate constant; test activity step changes while logging; look for condition-correlated loss of headroom.

Diagnosis loop: counters → domain split → conclusion

The most reliable triage flow is evidence-first. Use counters to localize the failing port group and direction, then use a loopback/PRBS-style separation step to decide whether the dominant contributor is channel/equalization margin, clock/jitter headroom, or a trigger condition such as temperature.

Top 5 pitfalls (field signatures)

Refclk quality / injection Lane order / deskew EQ overfit Broken return path Thermal hotspot drift

Output: minimal reproduction matrix (temperature × rate/mode × port × path)

Use a small matrix to prove triggers with minimal combinations. Each cell should record: pass/fail, profile_id, temperature zones, counters snapshot, and worst-lane + margin. This turns “intermittent” into “reproducible.”

Temperature Rate/Mode Port group Path/Cable Record (evidence)
Cold / Ambient / Hot Mode A / Mode B (+ retime on/off) Group 1 / Group 2 Path A / Path B pass/fail + profile_id + env + counters + margin + worst_lane
Figure F8 — Symptom decision tree: from “link flaps” to evidence-driven actions
Symptom tree: unstable link → watch → action Link flaps / unstable Training loop Errors creep up Worst lane fixed Temp-triggered Load-triggered Watch: training · deskew · CDR Action: snapshot → loopback/PRBS Watch: CRC slope · margin trend Action: temp sweep → margin proof Watch: worst_lane · repeatability Action: swap path → prove position Watch: temp band · event bursts Action: thermal soak → dense logs Watch: bursts on activity changes Action: hold rate → step tests Top pitfalls refclk · deskew · EQ overfit · return path · thermal drift
H2-10 · Validation & production test

Proving delivery: lab margin → production screening → field evidence loop

“Done” requires traceability across three environments. Lab characterization must demonstrate margin under stress; production tests must screen edge cases quickly and preserve worst-lane traceability; field telemetry must provide evidence that can be reproduced back in the lab. The output is a closed loop: Lab defines headroom, Prod enforces gates, Field feeds failures back into cases and gates.

Lab: characterize headroom (not just “it runs”)

BERT / PRBS evidence

Quantify error behavior under controlled patterns and conditions, capturing counters and margin proof.

Eye / margin proof

Use eye or equivalent margin metrics to show headroom and identify the governing lane group.

Stress corners

Temperature and supply-corner stress plus coupling scenarios to surface the edge of stability.

Production: fast screening + worst-lane traceability

PRBS loopback / self-test

Use repeatable patterns with loopback to screen marginal links quickly and consistently.

Quick margining

Run a short margin check to catch “passes now, fails later” units before shipment.

Record for traceability

Always store profile_id, worst_lane, margin score, counters, and temperature zones for each port group.

Coverage binding: metrics → method → gate → recorded fields

Validation becomes actionable only when each key metric is bound to a test method, a pass/fail gate type, and required record fields. Gates are expressed by threshold types (margin ≥ threshold, event slope ≤ threshold), without locking to a single vendor value.

Metric / risk Method Gate type Record fields
Margin headroom (worst-lane governs) Margining / eye-equivalent measurement margin ≥ threshold profile_id, margin_score, worst_lane, worst_lane_margin, temp zones
Stability (creeping vs burst) PRBS/BERT run with trend logging event slope ≤ threshold counters snapshot (CRC/deskew/CDR), timestamps, mode/rate
Training robustness Repeated bring-up cycles + stress corners retrain count ≤ threshold training_events, link_state transitions, profile_id
Temperature susceptibility Thermal sweep / soak with identical profile margin drop ≤ threshold temp zones, margin trend, counters slope, worst-lane stability

Output: test-case checklist template (ready for lab + prod + field)

Case ID Purpose Setup Method Gate Record
TC-01 Worst-lane headroom proof Mode A, fixed profile, ambient Margining margin ≥ thr profile_id + worst_lane + margin
TC-02 Training robustness Repeated bring-up cycles Bring-up + logs retrain ≤ thr training_events + link_state
TC-03 Thermal susceptibility Cold/Hot soak, fixed profile Sweep + trend drop ≤ thr temp + margin trend + counters
TC-04 Production quick screen Prod line fixture, standard mode PRBS loopback errors ≤ thr counters + timestamp + profile_id
Figure F9 — Closed-loop validation: Lab → Production → Field logs → back to Lab
Validation loop: margin proof → gates → evidence → reproduction LAB BERT / PRBS Eye / margin Stress corners PRODUCTION PRBS loopback Quick margin Pass/Fail gates FIELD Telemetry Logs & snapshots Worst-lane proof Closed loop Field failure → reproduce Update lab cases Update prod gates Evidence continuity: profile_id + env + counters + margin + worst_lane across Lab, Prod, and Field.
H2-11 · IC selection & design-in checklist

IC Selection & Design-In Checklist (with MPN examples)

This chapter converts “what matters” (metrics, SI/clock margin, telemetry, validation) into a purchase-ready checklist: what to ask before committing, what to lock down during design-in, and what to require as evidence for production readiness.

Procurement reality: NVLink/NVLink Switch silicon is commonly sourced as part of a platform / OEM solution path, not as a simple catalog MPN. Plan the supply path and evidence package early (reports + tools + reproducible configuration profiles).

A) Selection dimensions — shortlist axes (what to require as evidence)

“Good-looking datasheet numbers” are not enough. The selection axes below are framed as: capabilityengineering meaningevidence type. The MPN list is provided for supporting clock/power building blocks that are typically orderable and must be aligned with the switch’s requirements.

Axis What it really controls Orderable MPN examples (non-exhaustive)
Ports & lane groups Lane bonding/breakout constraints, port remap limits, worst-lane behavior under temperature and load. Require a clear “unsupported mapping” list + verified channel envelope. Platform-sourced silicon
NVLink Switch is typically obtained via OEM/platform path (confirm supply + support channel in RFQ).
Retiming modes Fixed latency cost vs stability gain. Require a mode matrix: which paths retime, how latency classes differ, and how jitter transfer behaves across modes. Clock cleaners Si5345 / Si5341, LMK04832, HMC7044, 8V19N850, ZL30273 (as reference-clock conditioning building blocks).
Equalization knobs Whether TX FIR / CTLE / DFE are controllable, repeatable, exportable. Require: ranges + step sizes + default profiles + “export/import profile” mechanism. Jitter attenuators Si5345, LMK04832, HMC7044, ZL30273
Fanout ADCLK948, LMK1C1104 (clock distribution helpers; choose by I/O standard needs)
Reference clock requirements “ppm OK” ≠ “phase noise OK”. Require the measurement method (integration band, units) and a decision rule for when a cleaner is required. Si5345 Si5341 LMK04832 HMC7044 8V19N850 ZL30273
Telemetry / RAS Ability to “see degradation”: margining, CDR/training status, error counters, worst-lane flags, port isolate, lane repair. Require a counter dictionary + event/log export format. Evidence-driven
MPN is less important than: tooling, counter definitions, log export, and reproducible profiles.
Clean rails for PHY/PLL Noise-sensitive rails shift jitter margin and EQ behavior with temperature/load. Require PSRR/noise targets and a layout guideline for the rail’s “quiet zone”. LDO TPS7A94, TPS7A88, ADM7150, LT3045
Package & routability Whether return paths and reference planes can remain continuous across dense escape routing. Require stackup guidance + keepouts for sensitive clock/SerDes zones. Layout constraint pack
Ask for ball map + breakout guidance + SI channel rules + reference design notes.
Ecosystem & support Margining tools, scripts, register/profile workflows, version compatibility rules. Require an evidence package: reports + tool versions + reproducible recipes. EVM/Tools Prefer solutions with evaluation kits & documented automation paths (tool/SDK version pinned in BOM).

MPN notes: clock/fanout/LDO examples above are orderable components frequently used to meet refclk and “clean rail” requirements. Final selection must follow the switch’s specific I/O standards, jitter transfer needs, and power/thermal envelope.

B) Design-in checklist — make it controllable and traceable

The design-in goal is not “link comes up once”, but “margin is measurable, profiles are reproducible, and failures are diagnosable”.

1) Schematic hooks (interconnect domain only)

  • Refclk injection path: defined entry point, optional cleaner placement footprint, and a measurement-friendly node (test header/connector).
  • Clock distribution: controlled fanout / buffering plan (e.g., ADCLK948 or LMK1C1104 class, depending on required I/O standards).
  • Sideband visibility: ensure required access for reading counters, margining metrics, and event logs (protocol specifics remain out of scope).

2) Layout / channel hygiene (what prevents “hot OK, cold fail”)

  • Return path continuity: avoid reference plane splits under critical SerDes/clock routes and fanout branches.
  • Clock quiet zone: keep noisy aggressors away from cleaner/fanout + SerDes PLL region; prioritize short, shielded, consistent-impedance routes.
  • Thermal correlation: place thermal sensors where the SerDes/PLL hotspots actually live; align logging with those sensors.

3) Test hooks (minimum viable bring-up + production screening)

  • PRBS/loopback plan: define how a failing port/lane can be isolated without relying on external “good system state”.
  • Worst-lane capture: ensure the “worst lane” is identifiable and recorded under stress (temperature × rate × load).
  • Clock margin checks: preserve the ability to swap cleaner profiles and verify pass/fail deltas with the same test recipe.

4) Configuration management (non-negotiable for scaling)

Equalization / retiming / clock settings must be treated as a versioned artifact, not a one-off tuning session. The following “minimum record” prevents irreproducible builds:

profile_id: “EQ_NVX_112G_PAM4_A01” device_stepping: “rev / stepping” mode: “rate + lane_group + retime_mode” eq_summary: “CTLE preset, TX FIR preset, DFE enable + caps” refclk_chain: “source -> (cleaner MPN + config) -> (fanout MPN) -> port groups” conditions: “temp range, supply range, cable/backplane class” evidence: “margining snapshot IDs + counter baseline + stress duration” tooling: “SDK/tool version pinned”

Example orderable building blocks for the refclk_chain: Si5345/Si5341, LMK04832, HMC7044, 8V19N850, ZL30273 (cleaners/attenuators); ADCLK948 or LMK1C1104 class (fanout/buffer); TPS7A94/TPS7A88, ADM7150, LT3045 (clean rails for PLL/SerDes).

C) RFQ must-ask questions (≤20) — require verifiable artifacts

Each question below is designed to force an evidence-backed answer (report, tool output, counter dictionary, or documented limitation), so selection does not collapse at thermal corners or in production.

  1. Supply path: Is the interconnect switch silicon available as a discrete MPN, or only via a platform/OEM solution? Provide supported procurement routes and lifecycle policy.
  2. Port/lane mapping limits: Provide a matrix of supported lane-grouping/breakout/remap constraints and known unsupported combinations.
  3. Channel envelope: Provide verified channel conditions (IL/RL/crosstalk classes) and the measurement method used.
  4. Retiming scope: Which paths truly retime? Provide retime mode list and fixed latency classes per mode.
  5. Jitter transfer: Provide jitter transfer / tolerance characterization under key modes and temperature corners.
  6. EQ control: Provide TX FIR / CTLE / DFE range, step sizes, and a documented “export/import profile” workflow.
  7. Lane margining: What margining metrics exist (eye height/width/score), and how can they be read programmatically?
  8. Worst-lane behavior: How is worst-lane detected and flagged? Provide a sample log/counter snapshot under stress.
  9. Training stability: Provide known causes for retrain/deskew churn and mitigation notes (temperature, voltage, connector variance).
  10. Refclk spec (method): Provide refclk phase-noise/jitter requirement including integration band and units.
  11. Cleaner decision rule: Provide a rule-of-thumb (with supporting evidence) for when a jitter cleaner is required and recommended placements.
  12. Clock chain options (orderable MPNs): List qualified/reference cleaners and fanout parts used in validated designs (e.g., Si5345/Si5341, LMK04832, HMC7044, 8V19N850, ZL30273; fanout such as ADCLK948 / LMK1C1104 class).
  13. Clean rails guidance (orderable MPNs): Provide rail noise/PSRR targets and known-good LDO examples for PLL/SerDes rails (e.g., TPS7A94/TPS7A88, ADM7150, LT3045 class).
  14. Telemetry dictionary: Provide a complete counter/state dictionary (names, meanings, reset behavior, overflow behavior).
  15. RAS: Does the device support lane repair and port isolate? Provide conditions, limits, and expected behavior.
  16. Event logs: What event logs are available, how are timestamps generated, and what is the export format?
  17. Validation bundle: Provide a recommended lab characterization plan (BERT/eye/jitter tolerance/thermal stress) and sample reports.
  18. Production recipe: Provide a production screening recipe (PRBS loopback + margin quick check) and pass/fail criteria guidance.
  19. Tooling & versioning: Provide the required SDK/tools, supported automation APIs, and a version-compatibility policy.
  20. Escalation artifacts: If field issues occur, what minimum dataset must be captured (counters + conditions + profiles) for root-cause turnaround?

D) BOM fields template — encode traceability (copy/paste)

These BOM fields separate “demo success” from “production scalable”. The intent is to pin the configuration and evidence chain, not just hardware.

Field Meaning Example value
switch_solution_path How the interconnect switch is procured (discrete MPN vs platform/OEM). Also pins support channel. “OEM platform module / partner SKU”
device_mpn / stepping Exact material number + stepping/revision for all orderable supporting chips (clock/LDO/fanout). LMK04832NKDT; ADCLK948BCPZ; TPS7A94…
ports / lane_grouping Port count + lane group plan + any remap assumptions. “N ports; 8-lane groups; map vA”
supported_modes Rate/mode list that is actually validated for this design. “112G PAM4 class; mode set M1”
retime_mode / latency_class Selected retiming behavior + the associated latency class. “Retime-On; Latency-L2”
eq_profile_id Versioned EQ/retime profile identifier (must be reproducible). EQ_NVX_112G_PAM4_A01
eq_knob_summary Human-readable summary of key knobs (not full register dump). “CTLE P3; FIR P2; DFE on”
refclk_chain_mpn Refclk chain components with explicit MPNs and config IDs. Si5345 + ADCLK948 + config C12
quiet_rail_mpn Noise-sensitive rail regulator MPN(s) for PHY/PLL supply islands. TPS7A94 (PLL); ADM7150 (RF/PLL)
telemetry_support What is readable (margining, counters, thermal/rail alarms) + doc reference. “Margin Y; Worst-lane Y; Dict v3”
validation_report_refs Report identifiers for BERT/eye/jitter/thermal stress used to sign-off. “BERT-RPT-07; TH-RPT-03”
production_test_recipe_id Screening recipe version (loopback + margin quick check + thresholds). “PROD_PRBS_MRG_R1”
tool_sdk_version Tool/SDK version pinned so results remain reproducible across builds. “SDK 1.8.2; tool 5.4”
Figure F10 — Selection → Design-in → Proof (evidence-first pipeline)
Selection to Design-in to Proof pipeline A single-column pipeline showing requirements, shortlist, design-in controls, and proof artifacts, with an evidence pack sidebar. Evidence-first design-in: make margin measurable and profiles reproducible Requirements ports · lane groups rate · retime mode refclk target telemetry RAS channel envelope thermal corner plan Shortlist evidence required EQ knobs exportable margining readable clock chain plan RAS + logs Design-in controllable + traceable test hooks (PRBS) thermal sensors quiet rails (PLL) profile versioning Evidence pack Reports BERT · eye · jitter Counters worst-lane · margin events · logs Pinned MPNs Si5345 · LMK04832 HMC7044 · ADCLK948 TPS7A94 · ADM7150 Reproducibility profile_id · tool ver stress matrix

Keep the diagram “low text, high structure”: each box is a decision or artifact. This prevents mobile clutter while preserving the engineering logic.

Request a Quote

Accepted Formats

pdf, csv, xls, xlsx, zip

Attachment

Drag & drop files here or use the button below.

H2-12 · FAQs (Field & Selection)

These FAQs are designed to capture long-tail searches and common field-debug questions while staying strictly inside this page’s scope: interconnect switching, retiming/equalization, reference-clock jitter, observability, validation, and design-in/RFQ evidence.

What is the practical boundary between a Retimer, a Redriver, and an Interconnect Switch?

A redriver mainly boosts and equalizes analog signals; a retimer recovers data with a CDR and re-times the stream; an interconnect switch adds fabric-level connectivity (multi-source/multi-destination), isolation, and port remapping. The switch solves topology and fault-domain problems that “bigger retimers” cannot, at the cost of power, latency, and validation complexity.

  • Use a redriver when loss is modest and topology stays 1:1.
  • Use a retimer when CDR retiming is required to restore margin.
  • Use a switch when many endpoints must be dynamically connected and isolated.
See: H2-1 (boundary table), H2-4 (fabric vs retimer data path).

Link training succeeds, but BER/CRC slowly increases over time—what are the most common root-cause buckets?

A “slow climb” typically means the link is running with thin margin that is being consumed by temperature drift, supply/PLL noise, or an overfit equalization profile that degrades across PVT. Another common cause is inadequate observability: counters are sampled too coarsely, so early warning signals (deskew events, CDR near-unlock, margin drops) are missed.

  • Pinpoint port/direction/condition first using counters and snapshots.
  • Correlate with temperature, data rate, and EQ profile changes.
  • Use a minimal reproduction matrix to separate environment vs topology.
See: H2-8 (telemetry & logging), H2-9 (symptom tree).

Why can adding a “jitter cleaner” make the system less stable instead of better?

Jitter cleaning is not “always better.” Instability often comes from a poor injection point or a loop-bandwidth choice that either tracks reference noise (too wide) or reacts slowly to real disturbances (too narrow), creating wander, lock stress, or unexpected phase steps. Cleaner settings must match the link’s jitter tolerance and the system’s noise spectrum.

  • Verify loop bandwidth and holdover behavior against the target jitter budget.
  • Check where noise is injected (before/after fanout, across isolation boundaries).
  • Confirm the cleaner’s output format and level match downstream requirements.

Example jitter-cleaner families used in practice include devices like Si5345, LMK04832, and HMC7044 (as reference examples, not requirements).

See: H2-6 (refclk & cleaners).

Refclk ppm is “in spec” but links still flap—how should phase noise/jitter be measured and attributed?

PPM only describes frequency accuracy over long intervals; it does not guarantee low phase noise. The correct approach is to convert phase-noise to integrated jitter over a defined band, then map that jitter to the link’s CDR and BER sensitivity. Attribution requires controlled tests: isolate refclk contribution from channel/EQ by using repeatable stress conditions and consistent pass/fail criteria.

  • Fix the integration band and report the jitter metric consistently.
  • Correlate jitter margin with counters (deskew, CDR stress, BER slope).
  • Bind the measurement method to validation gates and logs.
See: H2-6 (jitter budgeting), H2-10 (validation gates).

What is the correct tuning order for TX FIR / CTLE / DFE, and how to avoid overfitting?

A stable methodology starts by locking data rate and baseline presets, then targeting the worst lane under a representative stress. Typically, converge CTLE first (undo channel tilt), then TX FIR (shape pre-emphasis), and use DFE last and sparingly. Overfitting happens when DFE “learns noise” or when a profile is validated only at a single corner.

  • Rate fixed → find worst lane → CTLE → FIR → minimal DFE.
  • Validate with margining across temperature/voltage corners.
  • Freeze and version-control the final profile for traceability.
See: H2-5 (retiming & EQ SOP).

Only a few lanes are always the worst—how can margining/loopback separate channel issues from a bad port?

The fastest separation is “location-coupled vs port-coupled.” Margining reveals whether the failure is eye height, eye width, or timing; loopback and controlled swaps test whether the weakness follows a physical route or a specific silicon lane/port. The goal is to reduce the problem to one variable before deeper EQ changes are attempted.

  • Use margining to characterize the weakness signature consistently.
  • Swap endpoints or mapping to see whether the weakness follows the path.
  • Record deskew events and per-lane counters for repeatable evidence.
See: H2-5 (margining), H2-9 (debug loop).

Links fail only when hot—how does temperature affect SerDes/PLL margin, and what should be logged?

Temperature changes can shift channel loss, alter equalization convergence, and degrade PLL/clock margin—turning a borderline eye into intermittent retraining, deskew stress, or BER slope changes. Robust logging must capture the condition, not just the outcome: per-port temperature, data rate, EQ profile, CDR lock indicators, and counter snapshots at consistent intervals.

  • Correlate failures with thermal zones and “time-at-temperature.”
  • Log per-port: rate, EQ profile ID, CDR/deskew status, BER/CRC counters.
  • Reproduce with a temperature × rate × port matrix before redesign.
See: H2-7 (thermal drift), H2-8 (logging fields), H2-9 (symptom tree).

The same port behaves very differently across data rates—does this indicate EQ coverage or jitter margin limits?

Large rate sensitivity usually comes from either insufficient equalization range for the channel at a given Nyquist, or a refclk/PLL jitter margin that is rate-dependent. The correct approach is to tie “what changed” to measurable quantities: margining distribution, worst-lane shift, CDR stress flags, and BER slope. Treat it as an attribution problem, not a tuning guess.

  • Compare margining and worst-lane identity across rates.
  • Check CDR/deskew stability and counter behavior vs rate.
  • Validate with a fixed stress recipe and consistent pass/fail gates.
See: H2-3 (metrics→how to measure), H2-5 (EQ), H2-6 (jitter).

How to design a “minimal” production test that catches marginal ports without blowing up cycle time?

Minimal production testing should focus on the most discriminating conditions rather than exhaustive coverage. Use a short PRBS loopback, a margining quick-check, and a “worst-lane” screening rule per port group. The key is traceability: record just enough fields to connect a marginal result to a specific port/rate/temperature and to reproduce it in the lab.

  • Pick representative worst-case rates and channel classes.
  • Use fast margin checks and stop rules instead of long soak tests.
  • Store per-port summaries plus a few snapshots for escalation.
See: H2-10 (validation & production test mapping).

Without high-end lab gear, how can counters + PRBS quickly decide whether “link quality” is acceptable?

A practical field method is to establish a counter baseline, run a short PRBS/loopback window under controlled conditions, and compare the error slope against a known-good reference. Counters provide the “where and when,” while PRBS provides a fast stress. Decisions should follow a symptom tree: identify which port, direction, and condition triggers instability first.

  • Snapshot: rate, temperature, EQ profile, key counters per port.
  • Run PRBS/loopback for a fixed short duration and compare slopes.
  • Escalate only after reproducing with a minimal condition matrix.
See: H2-8 (telemetry), H2-9 (field debug tree).

During port/channel planning, what deskew and layout risks come from lane bonding and breakout?

Lane bonding and breakout increase the probability of lane-to-lane delay mismatch, deskew-window pressure, and reference-plane discontinuities that hurt return paths. They also complicate maintenance and debug because mapping changes can hide which physical path corresponds to a logical port group. Planning should treat “port = lane group” as a first-class constraint and document mapping explicitly.

  • Budget deskew: length, via count, and discontinuity symmetry across lanes.
  • Keep reference planes continuous across breakout regions.
  • Version-control the mapping between logical ports and physical routes.
See: H2-2 (topology & port groups), H2-4 (deskew in data path).

What “provable” evidence should be requested from suppliers (margin, jitter, RAS, thermal drift) during selection?

Selection should be driven by evidence that can be reproduced: margining definitions and reports, jitter measurement methods and integration bands, RAS behaviors (degrade/repair/isolation) with limits, and temperature-corner stability data with clear test setups. The strongest RFQs bind each claim to a measurement, a pass/fail gate, and required log fields to support field forensics.

  • Ask for margining methodology, exported metrics, and corner conditions.
  • Ask for jitter/phase-noise test setup and the exact integration band.
  • Ask for RAS feature boundaries, event logs, and failure-mode handling.
See: H2-11 (RFQ + design-in checklist), H2-10 (validation mapping).

Tip: For mobile-friendly troubleshooting, each answer is written as a “mini closed loop”: what it means → what to check first → what evidence to log → which chapter contains the deeper method.

Figure F10 — FAQ map: symptom → tool → decision loop
Search / Symptom Link flaps / retrains training loops BER/CRC climbs slow margin loss Worst lane repeats path vs port? Hot-only failures PVT drift Rate sensitivity EQ vs jitter? Tools on this page Telemetry & logs counters + snapshots Margining eye width/height hints EQ SOP CTLE → FIR → DFE Jitter budget phase noise → risk Decision tree fast attribution Decision / Output Root-cause bucket channel / clock / EQ / thermal Repro matrix temp × rate × port × path Pass/Fail gate validation + production rules RFQ evidence list provable claims only