Ref Clock & PCB Layout for Industrial Ethernet PHYs and TSN
← Back to: Industrial Ethernet & TSN
The goal is measurable: reduce CRC/BER/retrain by removing plane cuts, stubby transitions, and power-noise coupling—then verify with fixed-window counters and repeatable PRBS/loopback tests.
Why Ref Clock & Layout Decides Link Stability
Many “random” CRC spikes and link flaps are not random: layout turns power/return/coupling errors into ref-clock phase noise and sampling jitter, shrinking margin until BER/CRC crosses the cliff.
Symptom map → first hypothesis → fastest evidence
Evidence: correlation vs. load/temperature/EEE transitions/power events.
First move: PRBS/loopback A/B, then inspect plane cuts + via transitions.
Evidence: fails only at top rate, or only with certain partners/cables.
First move: rate step-down comparison; check layer changes and reference-plane continuity.
Evidence: errors cluster around sleep↔wake, not steady-state traffic.
First move: disable EEE to isolate; then audit clock/power isolation and decap return loops.
Evidence: drift grows with temperature or power-state changes.
First move: verify tap-point placement and ref-clock environment (no PTP algorithm details here).
“Looks OK” traps that hide layout-driven failures
- Diff waveform-only diagnosis: a clean differential trace can still radiate/convert to common-mode if the return path is broken.
- Bench-only validation: short cables and quiet supplies mask EEE and peak-load transients where coupling is worst.
- Probe/fixture artifacts: ground leads and fixtures change loop area and parasitics, creating false confidence (or false alarms).
What this page solves (5 checkpoints)
- Clock noise sources & coupling: where phase noise/RJ enters the ref-clock chain.
- Diff-pair geometry control: impedance continuity, transitions, and symmetry.
- Length matching without myth: what to match, when serpentine becomes harmful.
- Return paths & planes: why plane cuts and detours create mode conversion and EMI.
- Isolation/partition: keepouts and “clock islands” to avoid noisy neighborhoods.
Pass/fail lens (use the same yardstick everywhere)
- BER/CRC rate: < X over Y minutes under worst-case traffic.
- Retrain/flap frequency: < X events per hour at top rate.
- EEE transitions: wake/sleep does not increase errors beyond X.
- Correlation proof: errors track (or do not track) temperature/power/load as expected.
30-second triage checklist
- Disable EEE and compare error rate.
- Rate step-down and compare CRC slope.
- Enable PRBS/loopback and isolate board vs. partner.
- Any diff pair crossing a plane split/cut?
- Ref clock routed near high di/dt rails?
- Decap return loops short and direct?
Clock neighborhood noise → “Clock Isolation / PDN Coupling”
Transition sensitivity → “Diff Geometry / Length Matching”
Clock Noise Taxonomy for Ethernet Hardware
“Low-jitter clock” is not a component label—it is a system outcome. Classifying noise by how it looks, where it enters, and how to disprove it quickly prevents endless tuning without root cause.
What matters to the PHY (engineering view)
Common source: PDN noise, PLL noise, threshold noise.
Fast check: compare errors vs. load/temperature and supply events.
Common source: coupling from DC-DC, nearby clocks, or periodic interference.
Fast check: change aggressor state; watch whether the error signature moves.
Risk: misreading instrumentation or partner tolerance.
Fast check: SSC on/off A/B while holding traffic and temperature constant.
Reference clock entry points (focus on layout control)
Coupling paths (where good clocks get corrupted)
Fast check: errors track current steps and EEE state changes.
Fast check: partner sensitivity changes dramatically; EMI worsens with the same traffic.
Fast check: toggling an aggressor changes the error signature.
Fast check: local LDO isolation A/B shows measurable stability change.
Quick classification rules (to avoid endless tuning)
- Error clusters around power-state changes: start with PDN coupling and return loops.
- Error appears at specific activity patterns: suspect deterministic coupling (spurs/aggressors).
- Error depends on partner but not cable length: margin is small; layout amplifies tolerance differences.
- Instrumentation contradictions: suspect probe/fixture loop area and measurement tap mismatch.
Differential Pair Geometry: What Must Be Controlled
A differential pair is not “stable” because it is equal-length. Stability comes from impedance continuity, symmetry, and return-path continuity across every transition (vias, layer changes, and reference-plane changes).
Controlled impedance (focus on reflections and eye closure)
- Typical discontinuities: width/spacing changes, soldermask openings, pad/via stubs, abrupt neck-downs.
- Layout priority: avoid sudden geometry steps; prefer short, smooth transitions over sharp “stairs”.
- Pass criteria (X): PRBS/loopback shows CRC rate < X and no “top-rate-only” cliff behavior.
Pair symmetry (prevent mode conversion)
- Symmetry means EM symmetry: same layer, same reference plane, and similar neighbors on both sides.
- Avoid “one-side exposure”: do not place one trace near a plane edge/split while the other sits over solid ground.
- Pass criteria (X): stability changes little across partners/cables (variation ≤ X) under identical traffic.
Vias & transitions (where most real failures originate)
Guarding (useful sometimes, harmful sometimes)
- Long parallel run near a strong aggressor is unavoidable.
- Routing density forces close proximity in a noisy region.
- EMI mitigation is needed without breaking impedance continuity.
- Guard forces tight serpentine or geometry steps.
- Guard via fence creates periodic discontinuities.
- Solid reference plane already provides clean return.
Length Matching Without Myth
The goal is not “perfect numbers” in a CAD report. The goal is protecting margin: avoid skew that converts energy into common-mode, and avoid compensation patterns that introduce coupling and impedance ripple.
What to match (scope for this page)
Inter-lane (between lanes): system-dependent; only a reminder here (detailed budgets belong to Key Specs/Cable & Reach pages).
- Priority order: reference continuity → transitions → symmetry → then fine length trim.
- Pass criteria (X): after matching, partner sensitivity does not worsen and BER/CRC improves by ≥ X.
When it matters (decision lens, no protocol tables)
Serpentine pitfalls (why “fixing” length can hurt)
- Self-coupling: tight meanders couple to themselves, acting like a small coupled-line structure.
- Impedance ripple: repeating geometry changes create periodic discontinuities and ISI.
- Local aggressor exposure: length compensation often pushes routing closer to noisy rails or clocks.
Practical default rules (safe by construction)
- Route straight first: protect plane continuity and minimize transitions before trimming length.
- Use relaxed meanders: avoid tight spacing; keep bends gentle to reduce coupling.
- Distribute compensation: do not concentrate meanders into one short region.
- Stay away from noise: keep compensation segments out of DC-DC and clock neighborhoods.
- Preserve symmetry: do not fix one trace while breaking pair symmetry and EM balance.
- Never cross plane cuts: no length compensation is worth a return-path detour.
Return Paths & Reference Planes
A broken return path turns a differential system into an antenna and a noise injector. The highest leverage action is keeping reference planes continuous so return currents can follow the smallest loop area.
Return current basics (engineering view)
- Failure signature: partner/cable sensitivity increases and CRC appears “random” around load or state changes.
- Do: keep critical routes over a solid reference plane; avoid narrow “return bottlenecks”.
- Pass criteria (X): temporary return bridging (A/B test) reduces CRC/BER by ≥ X.
Plane split/cut (crossing gaps is a hard failure)
- Not only “crossing”: running along a gap edge can also destabilize the return distribution.
- Do: enforce a keepout band around plane cuts and connector voids.
- Pass criteria (X): removing a single gap-crossing eliminates a top-rate CRC cliff (≥ X improvement).
Stitching vias/caps (return migration tools)
Connector region (layout-only: keep return continuous)
- Do: keep a wide, uninterrupted reference plane under the last connector approach region.
- Avoid: narrow neck-down return corridors and unnecessary slots/cutouts close to the port.
- Pass criteria (X): cleaning connector-plane continuity improves worst-case partner margin by ≥ X.
Clock Isolation: Partition, Guard, and Keepouts
Clock integrity is protected at board level by partitioning, keepouts, and controlled routing corridors. The objective is blocking coupling paths from high di/dt islands into the reference clock and PHY timing inputs.
Partition map (three-island strategy)
- Failure signature: errors correlate with load steps or switching frequency changes.
- Do: define boundaries early; route clocks as “protected assets”.
- Pass criteria (X): moving the clock corridor away from noise reduces jitter-related failures by ≥ X.
Keepout rules (board-level DRC mindset)
- Clock keepout: no high di/dt power traces or noisy nets within X of the clock route.
- No plane-gap crossing: clocks and critical pairs must not cross splits/cuts.
- Minimize transitions: avoid unnecessary layer changes and long via ladders.
- Switch-node keepout: treat the DC-DC switch node as a red zone; no sensitive routing enters.
- Prefer clean corridors: route clocks over solid reference, away from connector voids and cutouts.
- Pass criteria (X): applying keepouts reduces error correlation with power activity by ≥ X.
Guard / shield choices (use only when they improve continuity)
- Do: choose a quiet reference layer; keep the clock corridor consistent.
- Avoid: periodic via fences that create repeated impedance steps near the clock.
- Pass criteria (X): guard additions do not worsen BER/CRC and show measurable noise reduction ≥ X.
Crystal placement (short loops and low parasitics)
Power Noise → Clock/PHY Jitter Coupling
Power ripple, transients, and ground bounce can modulate reference-clock timing and shrink PHY jitter margin. The practical goal is blocking the injection chain at PDN impedance, decap loop, and domain isolation.
Coupling mechanisms (what injects timing noise)
- PSRR limit: ripple passes through at sensitive bands and shows up as timing modulation.
- Ground bounce: high di/dt current shifts local reference and perturbs sampling thresholds.
- Lpkg/Ltrace: fast di/dt creates ΔV across inductance exactly during burst/transition events.
- Pass criteria (X): error counters correlate with rail events ≤ X after mitigation.
Decoupling strategy (frequency-domain roles)
LDO vs DC-DC (when isolation is mandatory)
- Mandatory LDO triggers (X): top-rate CRC cliff tracks switch-node events; margin loss ≥ X.
- DC-DC acceptable: sensitive domain has proven isolation corridor + tight decap loops + clean reference.
- Pass criteria (X): after domain isolation, error correlation with power mode changes ≤ X.
Return for decaps (loop closure beats capacitance)
- Common failure: “close” capacitor placement but return detours across cuts, neck-downs, or via bottlenecks.
- Do: treat each decap as a loop; ensure plane continuity and short return paths.
- Pass criteria (X): tightening the loop reduces burst errors by ≥ X at the same workload.
Field triage (fast A/B checks)
- Do errors cluster at burst traffic, link training, or EEE wake transitions?
- Do CRC/BER counters correlate with DC-DC load steps or switching activity?
- Does a temporary local decap (tight loop) change the failure rate by ≥ X?
- Does a temporary return bridge (ground strap/foil) change the failure rate by ≥ X?
- Does isolating the timing sub-rail (test LDO) reduce correlation to ≤ X?
Layout Patterns: Good vs Bad (Actionable Rules)
The following rule library turns mechanisms into copyable layout patterns. Each group includes Do/Don’t, quick checks, and pass criteria placeholders (X) for bring-up and production sign-off.
Pair routing rules
- Do: keep pairs over a solid reference plane; avoid long parallel runs with aggressors.
- Do: minimize layer changes; place return migration near mandatory transitions.
- Do: keep the environment symmetric (no one-sided voids/copper changes).
- Don’t: cross plane cuts or run along gap edges within X.
- Don’t: add “decorative” guarding that creates periodic discontinuities.
- Quick check: mark all gap edges, voids, and transition points on the route map.
- Pass criteria (X): partner/cable sensitivity stays within X across test sets.
Clock routing rules
- Do: keep clock runs short/straight; minimize vias and stubs.
- Do: enforce keepouts from switch nodes and high di/dt paths (distance X).
- Do: keep a consistent reference plane; avoid connector cutouts.
- Don’t: route clocks across noisy islands or plane gaps.
- Don’t: place crystals where return paths are fragmented or necked down.
- Quick check: overlay keepout zones and verify zero crossings by critical nets.
- Pass criteria (X): error counters do not step up during power-mode changes (≤ X).
Plane & stitching rules
- Do: keep critical routes off plane gaps; keep a keepout band around cuts.
- Do: add stitching near the exact transition point (layer change, gap edge).
- Do: keep connector approach regions over wide, uninterrupted reference planes.
- Don’t: create narrow return neck-down corridors near ports and transitions.
- Quick check: draw the return path loop (not only the differential waveform).
- Pass criteria (X): A/B return bridging does not change BER/CRC by more than X after fixes.
Placement rules
- Do: place XO/clock buffer close to PHY timing pins; keep corridor clean.
- Do: place decaps to close the loop to the target pins; avoid via/plane bottlenecks.
- Do: separate DC-DC islands from clock/PHY islands with hard keepouts.
- Do: reserve protection footprints while preserving return-plane continuity (layout constraint only).
- Don’t: let connector voids/cutouts sit under the last approach of critical pairs.
- Quick check: highlight noise sources and verify no “timing corridor” crossings.
- Pass criteria (X): station-to-station variation stays within X after production scaling.
Master checklist (copy/paste sign-off)
- No critical pairs or clocks cross plane cuts; no routes run along gap edges within X.
- Clock corridor avoids DC-DC red zones and connector voids; layer changes minimized.
- Every mandatory transition has nearby return migration (stitching) support.
- Decap placement closes loops; return paths are not forced through neck-downs.
- Partition map is enforced with keepouts; sensitive domains have proven isolation.
- Bring-up uses A/B checks: return bridging, local decap loop tightening, timing sub-rail isolation.
Validation & Instrumentation
Layout quality is proven by repeatable evidence. Use built-in tests, partner A/B comparisons, and consistent logging to validate that clock/layout/power coupling is under control before parameter tuning.
Built-in tests (minimum reproducible experiments)
- Loopback: verify local stability without external channel variability.
- PRBS: apply controlled stress to expose margin limits with repeatable statistics.
- Partner A/B: compare sensitivity to different link partners to detect narrow margin.
- Pass criteria (X): loopback is clean and PRBS error rate remains ≤ X over Y minutes.
What to log (counters + context fields)
Scope pitfalls (measurement hygiene)
- Probe loop risk: long ground loops create antennas and distort edge behavior.
- Wrong focus: differential waveform looks clean while return/rails are unstable.
- Over-filtering: averaging or narrow settings can hide burst-only failures.
- Quick check: repeat N captures with a tight return loop and verify counter time-alignment.
Validation ladder (recommended order)
- Local loopback: validate clock/layout stability without the channel.
- Short external link: reduce channel variability and confirm baseline.
- PRBS stress: force margin exposure with repeatable statistics.
- Partner A/B: measure sensitivity to tolerance differences.
- Env sweep: temperature and power-mode sweeps with event logging.
- Regression: re-run the same script after each layout or PDN change.
Failure Modes & Debug Playbook
This playbook standardizes debugging into: Symptom → fastest evidence → fix actions → pass criteria. It stays within clock/layout/power-coupling checks and avoids network-layer storms and TSN parameterization.
Top-rate CRC spikes; lower rate is clean
- Fix actions: remove plane-cut crossings, reduce transitions, add return migration near mandatory transitions.
- Pass criteria (X): CRC spikes < X per 1e9 bits at the target rate.
EEE wake is unstable (flaps around power-save)
- Fix actions: enforce clock keepouts; tighten decap loops; isolate timing sub-rails if correlation persists.
- Pass criteria (X): wake failures ≤ X per Y hours under the same script.
Training fails at target rate
- Fix actions: hunt impedance steps (via/transition clusters), remove tight serpentine, restore symmetry over solid reference.
- Pass criteria (X): training success ≥ X% across N cycles.
Failures correlate with temperature
- Fix actions: validate crystal/clock corridor integrity; ensure PDN loop closure and domain isolation under thermal drift.
- Pass criteria (X): BER/CRC remains within X across the temperature sweep.
One partner is stable; another is fragile
- Fix actions: widen margin by restoring return continuity, enforcing clock corridors, and flattening PDN peaks.
- Pass criteria (X): partner sensitivity drops to ≤ X under identical scripts.
Random link flaps without obvious waveform issues
- Fix actions: lock logging denominators, time-align events, and rerun a fixed script while changing only one variable.
- Pass criteria (X): flap rate < X per Y hours with no unexplained spikes.
Loopback is OK; external link fails
- Fix actions: restore plane continuity near ports, remove neck-down returns, and re-check transition stitching at the approach.
- Pass criteria (X): external PRBS/traffic errors drop by ≥ X at the same settings.
Fix makes behavior “different”, not “better”
- Fix actions: freeze scripts and denominators; keep A/B evidence; change one variable per iteration.
- Pass criteria (X): all key metrics meet X and repeat across N runs.
Boundary lock (avoid cross-page expansion)
Network-layer storms and TSN parameterization belong to their dedicated pages. This playbook stays within clock corridors, return continuity, PDN loop closure, and validation scripts.
Applications & IC Selection
This section is not a product page. It provides types + key signals + typical reference bundles that make ref-clock and layout success more predictable. Example part numbers are provided as engineering anchors.
A) Application mapping (one-line constraints + layout hooks)
- Layout hooks: clock corridor + keepouts, symmetric fanout, no reference discontinuities near port approach.
- Validation hook: fixed-script PRBS + counters + rail/temp event correlation.
- Layout hooks: port approach return path is top priority; enforce stitching at mandatory transitions.
- Validation hook: partner A/B sensitivity to reveal narrow margin.
- Layout hooks: avoid clock crossing noisy islands; close decap loops before increasing capacitance.
- Validation hook: BER/CRC distributions over fixed windows, not single captures.
B) Selection dimensions (signals that predict layout risk)
- XTAL pins vs CLKIN (routing flexibility vs coupling risk).
- Clock amplitude/threshold tolerance (sensitivity to noise injection).
- Startup/lock robustness under rail transients.
- Layout hook: higher sensitivity → stricter clock corridor + isolated PDN island.
- Hold/lock stability under ground bounce and PDN peaks.
- Susceptibility to spur-like rail modulation (seen as burst errors).
- Layout hook: narrow margin → eliminate plane cuts, reduce transitions, enforce return migration.
- Entry/exit stability under rail mode changes and burst workloads.
- Wake-related retrain/flap sensitivity (symptom-level, not protocol deep-dive).
- Layout hook: EEE flaps → prioritize rail transient correlation and clock isolation.
- Loopback and PRBS support (repeatable stress).
- Error counters (CRC/retrain/flap) and event timestamping fields.
- Layout hook: observability enables fast correlation to rail/temp events.
C) Typical reference bundles (categories + example part numbers)
Examples below are not endorsements. They are provided to anchor concrete design discussions (clock input style, PDN isolation, and validation hooks).
- Ethernet PHY (10/100): TI DP83822I, Microchip LAN8742A
- Ethernet PHY (10/100/1000): TI DP83867IR, Analog Devices ADIN1300, Microchip KSZ9031RNX
- Oscillator (XO/TCXO class): Epson SG-210STF (XO family), SiTime SiT1602 (MEMS XO family)
- Low-noise LDO (clock/PHY island): ADI LT3042 / LT3045, TI TPS7A47 / TPS7A49
Validation hook: loopback + PRBS with CRC/retrain counters and rail/temp event stamps.
- Clock buffer / fanout: TI LMK1C1104 (fanout buffer family), TI CDCLVC1102 (clock buffer family), Renesas/IDT 5PB1108 (fanout buffer family)
- Ethernet PHY (1G class examples): TI DP83869HM, Analog Devices ADIN1300, Microchip KSZ9131RNX
- Oscillator (XO/TCXO class): Abracon ASV (XO family), Epson SG-210STF (XO family)
- Low-noise LDO (clock buffer island): ADI LT3045, TI TPS7A47
Validation hook: partner A/B sensitivity + fixed-window counters to prove wider margin.
- SPE PHY (10BASE-T1L examples): Analog Devices ADIN1100, TI DP83TD510E
- SPE MAC-PHY (10BASE-T1L example): Analog Devices ADIN1110
- Automotive Ethernet PHY (100BASE-T1 examples): NXP TJA1100, TI DP83TC812
- Automotive Ethernet PHY (1000BASE-T1 example): Marvell 88Q2112
- Low-noise LDO (clock/PHY island): ADI LT3042 / LT3045, TI TPS7A47 / TPS7A49
- Clock buffer (optional fanout): TI LMK1C1102 / LMK1C1104
Validation hook: time-align rail events with error bursts; accept only statistical stability (X/Y placeholders).
Recommended topics you might also need
Request a Quote
FAQs
Scope: on-site troubleshooting long tails only (ref clock, return paths, differential routing, vias, reference planes). Format is fixed: Likely cause / Quick check / Fix / Pass criteria (X).
- Use a fixed window: Y minutes or Y bits; do not mix denominators.
- Always log: CRC, retrain/flap, EEE entry/exit, and rail/temp events with timestamps.
- Pass criteria uses placeholders: X (threshold) and Y (time/bits/window).
Link is up, but CRC spikes only at full rate — first check return-plane cuts or via transitions?
Likely cause: return-path discontinuity (plane cut/edge) or clustered layer transitions causing return detours and mode conversion.
Quick check: identify any segment that crosses a split/void; count reference changes and via stubs near the port approach; compare internal loopback vs external link counters.
Fix: reroute to stay over a solid reference; move layer changes away from the connector approach; add ground stitching vias at mandatory transitions; remove/shorten stubs.
Pass criteria (X): CRC spikes < X per 1e9 bits over Y minutes at full rate; retrain = 0; BER < X over Y minutes.
Works on bench, fails in enclosure — is clock noise injected by DC-DC coupling?
Likely cause: enclosure changes coupling (field/ground bounce) so DC-DC noise injects into the clock corridor or PHY/PLL island.
Quick check: A/B test open-air vs enclosure with identical scripts; log rail ripple and error bursts; toggle DC-DC operating mode (where possible) to see if errors shift with rail behavior.
Fix: enforce keepouts around the clock route; tighten decap return loops; isolate clock/PHY supply with a low-noise rail (e.g., LDO island); add a stitching fence between noisy and clock regions.
Pass criteria (X): enclosure vs bench delta (CRC/BER) < X; CRC spikes < X per 1e9 bits over Y minutes; no error burst correlation to rail events above X threshold.
Equal-length differential pair is still unstable — did serpentine create tight coupling or impedance ripple?
Likely cause: tight meanders create periodic impedance ripple and local coupling (including to nearby aggressors), reducing eye margin despite “equal length”.
Quick check: inspect meander pitch and spacing; look for long parallel runs next to other nets; compare counters with meander removed (or relaxed) in a controlled A/B build.
Fix: replace tight serpentine with relaxed, segmented meanders; increase spacing; keep meanders on a uniform reference plane and away from plane edges and high-activity nets.
Pass criteria (X): BER < X over Y minutes at target rate; CRC spikes < X per 1e9 bits; aggressor-activity sensitivity delta < X.
EEE wake causes random drops — is ref-clock phase noise worse during power transients?
Likely cause: EEE entry/exit triggers rail transients that couple into the ref-clock/PLL path, exposing a narrow jitter margin.
Quick check: timestamp EEE entry/exit and correlate with CRC/retrain; disable EEE as a diagnostic toggle; capture rail transient at the clock/PHY island during wake events.
Fix: isolate the clock/PHY supply (low-noise island + correct decap return); enforce clock keepouts from high di/dt regions; avoid clock crossings near plane edges.
Pass criteria (X): EEE wake failures < X per Y hours; retrain < X per 24h; CRC spikes during EEE events < X over Y hours.
One board lot is worse — is assembly stress shifting crystal / load caps?
Likely cause: assembly stress and tolerance spread shift crystal ESR/load or parasitic capacitance, changing clock startup margin and noise susceptibility.
Quick check: compare ref-clock offset and startup behavior across lots; thermal sweep while logging CRC; swap crystal/load caps between good/bad units to confirm sensitivity.
Fix: enforce crystal keepouts and symmetric grounding; shorten and balance load-cap traces; tighten component tolerances; reduce mechanical stress coupling (layout + assembly guidance).
Pass criteria (X): lot-to-lot clock offset within X ppm; CRC spikes < X per 1e9 bits over Y minutes; lot-to-lot error-rate delta < X.
BER is worse after adding “shield/guard” — did you increase parasitic C or create resonant stubs?
Likely cause: guard/shield geometry increases parasitic capacitance (impedance drop) or creates unintended resonant segments and asymmetry, increasing mode conversion.
Quick check: compare pre/post-layout channel behavior using the same PRBS window; inspect guard continuity, distance, and any floating segments; verify whether errors increase near a specific frequency/load mode.
Fix: remove continuous close-in guards; use a stitching via fence at a safe distance instead; keep the pair symmetric over a solid reference; avoid guard segments that cross reference discontinuities.
Pass criteria (X): BER < X over Y minutes; CRC spikes < X per 1e9 bits; pre-scan common-mode delta ≤ X dB at key bands (Y MHz).
Retrain happens every few minutes — is the ref clock routed near high di/dt rails or a split-plane edge?
Likely cause: periodic load/rail events inject noise into the clock corridor or PHY island; plane edges/via stubs turn that noise into repeatable margin loss.
Quick check: correlate retrain timestamps with load and rail events; inspect clock route proximity to inductors/switch nodes; downshift rate as a margin probe (stability suggests layout-limited margin).
Fix: reroute clock away from noisy regions; add keepouts and stitching fences; isolate the clock/PHY rail; reduce transitions and eliminate plane-cut crossings.
Pass criteria (X): retrain < X per 24h; link flaps < X per 24h; CRC spikes < X per 1e9 bits over Y minutes.
Only long packets fail — is it a marginal eye from impedance discontinuity + return-path detour?
Likely cause: marginal eye/ISI due to discontinuities and return detours; long frames statistically expose narrow margin sooner than short bursts.
Quick check: test error rate vs frame length using a fixed window; compare internal loopback vs external link; confirm error counters scale with payload length at the same rate.
Fix: smooth impedance transitions (vias, layer changes, stubs); keep routing over a solid reference; relax serpentine; increase spacing to aggressors near the port approach.
Pass criteria (X): CRC rate becomes frame-length independent within X; BER < X over Y minutes; CRC spikes < X per 1e9 bits over Y minutes.
Temperature-dependent CRC — is it crystal ESR/tempco or a PDN impedance peak?
Likely cause: temperature shifts crystal ESR/startup margin or moves a PDN impedance peak that increases noise coupling into clock/PLL and PHY analog blocks.
Quick check: run a controlled thermal sweep while logging CRC and rail events; compare clock offset and stability across temperature; observe whether errors cluster near a specific temperature band.
Fix: improve crystal placement/keepout and reduce stress coupling; isolate clock/PHY supply; adjust decap placement/return loops to flatten PDN peaks that couple into the clock path.
Pass criteria (X): CRC spikes < X per 1e9 bits across full temperature range for Y minutes each; clock drift within X ppm; BER < X over Y minutes.
Different partner switch changes stability — is jitter margin small due to layout-induced noise?
Likely cause: narrow link margin; different partners tolerate different jitter/ISI profiles, exposing layout-induced noise and mode conversion.
Quick check: partner A/B test with the same cable and scripts; compare CRC/BER counters; rate downshift as a margin probe; log whether errors cluster after specific rail events.
Fix: widen margin by eliminating the biggest layout risks first: plane-cut crossings, excessive transitions/stubs, and clock corridor violations; isolate the clock/PHY island supply if rail coupling is observed.
Pass criteria (X): partner A vs B metric delta < X; stable at full rate for Y hours; retrain = 0; CRC spikes < X per 1e9 bits over Y minutes.
Scope looks clean but errors persist — are you measuring at the wrong tap point / probe artifact?
Likely cause: probing hides the real problem (wrong tap point, long ground lead, fixture resonance) or misses clock/PLL sensitivity that does not show in a single capture.
Quick check: re-probe with proper technique (short spring ground or differential probe); compare tap points (near PHY vs near connector); rely on counter statistics (fixed window) to validate improvement.
Fix: correct the measurement setup first; then target the dominant layout risk (return discontinuity, clock keepout violation, via stub) based on the statistical evidence.
Pass criteria (X): measurement-to-measurement variation < X% (same setup); BER < X over Y minutes; CRC spikes < X per 1e9 bits over Y minutes.
Fix improved CRC but EMI got worse — did you trade smaller loop for higher common-mode conversion?
Likely cause: routing change improved differential behavior but increased asymmetry and mode conversion (more common-mode current), often near plane edges or discontinuities.
Quick check: compare common-mode proxy before/after (near-field scan or current clamp where available); review pair symmetry and reference continuity; check if EMI worsens at repeatable bands.
Fix: restore symmetry; keep the pair over a solid reference; avoid plane edges; use a controlled stitching strategy for return migration at unavoidable transitions.
Pass criteria (X): CRC spikes < X per 1e9 bits over Y minutes; retrain = 0 over Y hours; pre-scan emission delta ≤ X dB at key bands (Y MHz).