10/100/1G/2.5G/5G/10G Ethernet PHY Design Guide

Q: Links up at 1G, but 2.5G/5G shows CRC spikes — first check: channel margin or EQ preset?

Likely cause: Thin channel margin at higher rates, or the chosen EQ preset/adaptation lands in a poor operating point. Quick check: Lock the rate and run PRBS for N bits using preset A/B; compare local loopback vs end-to-end to isolate internal vs channel. Fix: Use the stable preset (or narrow adaptation range); reduce discontinuities on the path and re-baseline PRBS/counters at the target rate. Pass criteria: CRC ≤ X per 10^6 frames over Y min at 2.5G/5G; PRBS BER ≤ X over N bits; flap count = 0.

Q: Works in lab, fails in cabinet — is it refclk noise injection or return-plane cut?

Likely cause: Refclk/supply noise couples into PLL/CDR, or return-path disruption increases mode conversion and closes eye margin. Quick check: Correlate errors with cabinet events; compare error slope with refclk source A/B and a controlled supply ripple step (ΔV = X mV). Fix: Improve refclk isolation and restore continuous return path; re-run the same PRBS window and throughput burst in-cabinet. Pass criteria: Error slope ≤ X per min across load steps; PRBS BER ≤ X over N bits; no flap over Y min.

Q: EEE enabled makes periodic flaps — is exit latency or partner mismatch?

Likely cause: Partner ability mismatch or exit timing/latency behavior violates the traffic profile. Quick check: Disable EEE and compare flap rate; if stable, re-enable and test two traffic models: bursts every T ms and continuous load at X%. Fix: Keep EEE disabled for that partner/port class, or tune policy and validate with a partner matrix. Pass criteria: Flap count = 0 over Y min; throughput variance ≤ X% under burst model; CRC increase due to EEE ≤ X/10^6 frames.

Q: AN completes but no traffic passes — is MAC interface timing or FIFO underrun?

Likely cause: MAC-side interface mode/timing mismatch, or datapath starvation (FIFO underrun/overrun). Quick check: Verify MAC/PHY interface mode and delays; check underrun/overrun counters while sending a constant-rate stream at X Mbps. Fix: Correct mode/delays; increase FIFO margin (clocking/pacing) and re-test using the same stream. Pass criteria: Throughput ≥ X% of line rate over Y min; underrun/overrun = 0; CRC ≤ X/10^6 frames.

Q: Only one port fails under temperature — is PLL lock margin or solder/connector loss shift?

Likely cause: Marginal PLL/CDR lock at corners, or localized physical loss shift reduces SI margin. Quick check: Compare the failing port vs a known-good port across Tlow→Thigh: time-to-link, retrain count, and error slope. Fix: Rework suspected solder/connector and verify port routing/return; stabilize refclk/supply and lock stable EQ if PLL margin is suspected. Pass criteria: No retrain/flap over Y min at each corner; PRBS BER ≤ X over N bits; error slope within baseline ± X%.

Q: PRBS passes in local loopback but fails end-to-end — which segment does that isolate?

Likely cause: Internal PHY datapath is healthy; the failing segment is the external channel or the link partner. Quick check: Run PCS/PMA loopbacks then remote loopback (if supported); swap partner/port/cable class to separate channel vs partner. Fix: Treat as channel/partner margin: lock stable EQ preset, reduce discontinuities, validate PRBS with a fixed N-bit window at target rate. Pass criteria: End-to-end PRBS BER ≤ X over N bits; CRC ≤ X/10^6 frames over Y min; internal loopbacks remain clean (BER ≤ X).

Q: Changing magnetics vendor worsened BER — first Cdiff/mismatch sanity check (PHY-only view)

Likely cause: Differential mismatch introduces mode conversion and reduces receiver margin (stronger at higher rates). Quick check: Compare error slope and rate sensitivity before/after the vendor change using the same PRBS window. Fix: Restore validated part list; if change is mandatory, re-baseline EQ presets and validate PRBS/CRC per rate. Pass criteria: PRBS BER ≤ X over N bits per rate; CRC ≤ X/10^6 frames; delta vs baseline slope ≤ X%.

Q: Link ok, but throughput collapses with bursts — is it EEE/LPI transitions or PCS errors?

Likely cause: EEE/LPI transitions interact poorly with burst traffic, or PCS errors force retransmissions/pauses. Quick check: Disable EEE and repeat the same burst pattern (Tidle/Tburst at X% line rate); observe CRC/PCS error counters during bursts. Fix: Keep EEE disabled or adjust policy for burst workloads; if PCS errors dominate, lock stable EQ preset and address discontinuities. Pass criteria: Throughput ≥ X% of expected for Y min under burst model; CRC ≤ X/10^6 frames; no periodic stalls aligned with LPI transitions.

Q: Intermittent down/up every few minutes — brownout reset, strap sampling, or thermal?

Likely cause: Supply dip triggers reset, straps sampled under unstable rails, or thermal protection causes periodic recovery. Quick check: Correlate flap timestamps with supply/temperature logs; check reset-cause/interrupt history; measure time-to-link variance ≤ X%. Fix: Improve hold-up/decoupling and reset sequencing; stabilize straps at sampling; address thermal path if correlated. Pass criteria: Flap count = 0 over Y hours; time-to-link ≤ X ms with variance ≤ X%; no reset/thermal events under worst-case load.

Q: MDIO shows no faults but counters climb — which counters are most telling at PCS vs PMA?

Likely cause: Soft errors accumulate without hard faults; PCS indicates block/symbol integrity stress while PMA indicates lock/retrain/alignment stress. Quick check: Separate counters by layer (PCS integrity vs PMA lock/retrain) and compare slopes under PRBS vs real traffic. Fix: If PCS dominates, treat as margin/EQ/channel; if PMA dominates, treat as refclk/supply/temperature sensitivity and stabilize bring-up settings. Pass criteria: Layered counter slopes ≤ X/min; no retrain events over Y min; CRC ≤ X/10^6 frames.

← Back to: Industrial Ethernet & TSN

Multi-rate Ethernet PHYs are not hard because they “link,” but because they must stay stable, measurable, and production-repeatable across 10M→10G and real-world channels.

This page turns bring-up, jitter/margin, EQ/EEE, and PRBS/loopback observability into a stepwise workflow with numeric acceptance criteria, so failures can be isolated fast and fixed with confidence.

Definition & Where This PHY Fits

Multi-rate Ethernet PHYs (10M → 10G) are integration problems: stable BER/latency across rates and partners requires tight clocking, controlled channel margins, and built-in observability for fast bring-up and production repeatability.

What “PHY” means in practice

An Ethernet PHY converts a MAC-side digital interface into a physical link over a differential channel, while exposing measurable signals of link health. For multi-rate designs, the primary challenge is not “link-up,” but predictable stability under rate changes, environmental drift, partner diversity, and real channel losses.

Typical use-cases (where failures hide)

Switch uplink / port-dense gateways: partner variety and port adjacency amplify jitter/EMI sensitivity and cross-port coupling.
Industrial edge: wide temperature, noisy grounds, and long service life demand strong margins and durable diagnostics.
Mixed-rate networks: legacy endpoints coexist with multi-gig links; negotiation and EEE behavior must stay deterministic.

What this page delivers (engineering outcomes)

Margin control: translate “CRC/BER spikes” into a repeatable check → fix → pass-criteria flow.
Clocking discipline: treat ref-clock quality as a budgeted input to PLL/CDR lock and error rate.
EEE safety: enable power save only when exit latency and partner behavior remain bounded and testable.
Observability: define the minimum PRBS/loopback/counters/log fields required for field forensics.

Scope guard (to avoid cross-page overlap)

This page stays at the PHY layer: link establishment, clock/jitter budgeting, adaptive EQ boundaries, EEE stability, and test/bring-up hooks. System topics are referenced only as integration constraints (no deep dives).

TSN scheduling and time windows (Qbv/Qci/GCL), PTP/SyncE timing systems, PoE power architecture, and magnetics/CMC selection courses are out of scope here.

Diagram: System position + four engineering levers (Margin / Clock / Power-save / Observability)

Integration success is driven by four levers: measurable margins (BER/CRC), disciplined ref-clock/jitter budgeting, EEE enablement that remains bounded under traffic patterns, and observability hooks that isolate failures without lab instruments.

Architecture Map: PCS/PMA, SerDes, DSP/EQ, MAC Interface

A PHY architecture map is a fault-isolation tool: symptoms must translate into “which layer to check first” (MAC interface, auto-negotiation, PCS, PMA/SerDes, EQ, or clock/power integrity).

Layer model (engineering view)

MAC interface layer: RMII/GMII/RGMII/SGMII/USXGMII timing, delays, and error visibility.
Control/management layer: MDIO/MMD access, straps, interrupts, and field-debug logging.
PCS layer: auto-negotiation state, block lock/alignment, and PCS-level error counters.
PMA/SerDes layer: PLL/CDR lock, TX/RX analog behavior, and training/equalization boundaries.
Cross-cutting integrity: ref-clock quality and supply/ground noise modulate margins across all layers.

Diagram: Architecture map for fault isolation (layers + observability hooks)

The map above supports a deterministic debug order: prove MAC-side timing and management visibility, confirm PCS state and counters, validate PMA/SerDes lock, then evaluate EQ/training boundaries under the intended channel. Observability is part of the design, not an afterthought.

Responsibility boundaries (Symptoms → Quick check → Hooks)

Each block below is an isolation unit. The fastest checks are chosen to separate PHY-internal faults from MAC/system issues before deep signal analysis.

MAC interface (RMII/GMII/RGMII/SGMII/USXGMII)

Symptoms: link shows “up” yet throughput collapses, specific frame sizes fail, or errors appear only in one direction.
Quick check: verify interface mode + clock relationship + delay configuration; confirm PCS/PMA counters remain clean in local loopback.
Hooks: readable mode/delay settings, local loopback enable, and a “known-good” test mode independent of higher software layers.

Auto-negotiation & ability exchange

Symptoms: repeated renegotiation, “stuck at 1G,” or specific partners never converge at 2.5G/5G/10G.
Quick check: read partner abilities + AN completion state + remote fault indications; compare negotiated result vs configured policy.
Hooks: partner-ability snapshot, AN state log fields (time-stamped), and a policy control to force/disable targeted rates during triage.

PCS (encode/decode, alignment, PCS counters)

Symptoms: link-up is achieved but CRC/PCS errors grow; errors correlate with rate switches or bursts.
Quick check: check block lock/alignment indicators; inspect PCS error counters before blaming the channel.
Hooks: PCS loopback mode, PCS counters exposed via MDIO/MMD, and interrupts mapped to software logs.

PMA / SerDes (PLL/CDR lock, analog link)

Symptoms: temperature/voltage-sensitive link flaps, “works at low speed only,” or intermittent drops after warm-up.
Quick check: validate PLL lock and CDR lock status; correlate error bursts with lock transitions and ref-clock disturbances.
Hooks: readable lock indicators, optional clock-out for correlation, and PMA loopback to isolate line-side faults.

Adaptive EQ / training (what it fixes vs what it cannot)

Symptoms: good results on short channels but failure on real harness/backplane; “rate works when forced to a different preset.”
Quick check: compare PRBS BER across presets; read EQ/training completion state and any saturation/limit indicators.
Hooks: ability to force safe presets, visibility into EQ/training state, and a repeatable PRBS recipe for A/B comparisons.

EEE (802.3az) power save

Symptoms: periodic flaps, latency spikes during burst traffic, or instability that disappears when EEE is disabled.
Quick check: monitor LPI enter/exit counters; verify partner EEE capability and exit timing margins under intended traffic patterns.
Hooks: per-port EEE enable/disable control, LPI counters exposed to logs, and a stress recipe that reproduces burst/idle transitions.

MDIO/MMD, straps, interrupts (field-debug survivability)

Symptoms: behavior differs across cold boots, field failures cannot be reproduced in a lab, or “no visible faults” while counters climb.
Quick check: confirm strap sampling and reset timing; ensure MDIO access is stable; verify interrupts are rate-limited and logged.
Hooks: deterministic reset/strap scheme, guaranteed MDIO reachability, and a minimum black-box log set (rate, partner, counters, temperature, power events).

Diagram: Symptom → first check routing (fast isolation)

This routing prevents misdiagnosis: start with deterministic layer signals (AN status, lock indicators, counters, PRBS/loopback outcomes), then escalate to channel/SI investigations only after PHY-internal evidence is clean.

Rate Bring-up Flow: AN / Training / Link-up States

A multi-rate PHY bring-up should behave like a deterministic state machine: each stage has a minimal observable, a known set of failure signatures, and a next-action that isolates PHY issues before escalating to system layers.

Fast triage buckets (pick the first lane)

A) No link (never comes up)

Start at: Strap latched → Refclk ready → Reset release
First evidence: MDIO reachable, mode stable, PLL lock asserted
Then: AN/partner ability + remote fault

B) Link flaps (up → down → up)

Start at: Reset causes + PLL/CDR lock stability
First evidence: lock toggles, counters burst in clusters, periodicity
Then: Monitor lane + optional EEE lane (if enabled)

C) Rate-specific instability

Start at: Negotiated rate vs policy, training done, EQ preset A/B
First evidence: errors only at high rate, improve when forced lower
Then: Jitter budget + injected noise correlation (H2-4)

Step S0 · Strap latched (power-on configuration)

Goal

Mode and address are deterministic across cold boots; no “random” interface/rate behavior.

Observables

MDIO address responds
Mode/status register matches expected straps
Optional strap readback (if supported)

Failure signatures

Different behavior between cold boots; ports enumerate inconsistently; link policy appears ignored.

Next action

Harden strap pull network and sampling timing
Log mode/address fields at boot (black-box)
Prefer software override where available

Step S1 · Refclk ready (clock present & stable)

Goal

Reference clock is stable enough for internal PLL/CDR lock without intermittent loss.

Observables

PLL pre-lock / clock-detect (if exposed)
No error bursts before AN starts
Optional CLKOUT for correlation

Failure signatures

Lock is temperature/traffic sensitive; link works at low rate but not at high rate; periodic dropouts.

Next action

Validate refclk source quality and isolation
Check clock routing return paths and coupling
Proceed to jitter budgeting (H2-4)

Step S2 · Reset release (init determinism)

Goal

Reset deasserts cleanly; internal init completes once; management access is stable.

Observables

Init done / ready status
Reset cause flags (if available)
IRQ rate does not storm

Failure signatures

Link works briefly then drops; repeated init; brownout-like resets under traffic or temperature.

Next action

Correlate resets with power/thermal events (black-box)
Stabilize supplies and reset sequencing
Continue to AN only after S0–S2 are clean

Step S3 · AN / ability exchange (negotiated result)

Goal

AN completes and the negotiated rate/duplex matches policy; partner capability is visible.

Observables

AN complete
Partner ability snapshot
Remote fault / link fault flags

Failure signatures

Endless renegotiation; certain partners never converge at multi-gig; negotiated rate differs from expected.

Next action

Force/disable specific rates to bisect partner issues
Compare partner ability vs local policy settings
Proceed to training/EQ only after AN evidence is clean

Step S4 · Training (if present) / EQ convergence

Goal

Training completes; EQ state is stable (no saturation/limit events under the intended channel).

Observables

Training done
EQ preset / taps status
Optional saturation/limit flags

Failure signatures

Bench is stable but real channel fails; only lower rates work; stability depends on a specific preset.

Next action

A/B compare presets with PRBS and counters
Check if instability correlates with refclk or injected noise
Escalate to jitter budget funnel (H2-4)

Step S5 · Link-up (data path validity)

Goal

Link-up is defined by bounded error counters and stable throughput, not only a status bit.

Observables

PCS/PMA error counter slope
Local loopback result (baseline)
Traffic stability under burst load

Failure signatures

CRC spikes without clear link-down; errors appear only during bursts; one direction is worse.

Next action

Prove PHY with PRBS/loopback before system escalation
Re-check MAC interface timing if PHY evidence is clean
Record counters + temperature + power events (black-box)

Step S6 · Monitor (drift, aging, and field stability)

Goal

Error counters remain within threshold X over window Y across temperature and traffic patterns.

Observables

Error counter slope vs time
Lock event timestamps
LPI enter/exit counters (if EEE enabled)

Failure signatures

Periodic bursts; failures only in a temperature band; errors grow after minutes/hours.

Next action

Correlate bursts with refclk/power/noise injections
Capture “minimum black-box fields” on every event
Use H2-4 funnel to allocate a jitter margin reserve

Minimum black-box fields (for field forensics)

Capture these fields per port at boot, link events, and error bursts to enable remote isolation without lab instruments.

Mode + negotiated rate + partner ability
AN state, remote fault flags, training done (if present)
PLL/CDR lock status + lock toggle timestamp
PCS/PMA error counters and their slope over time
Temperature + key power events (brownout/thermal/WD)
EEE LPI enter/exit counters (if enabled)

Diagram: Bring-up state flow (deterministic steps + triage branches)

The bring-up flow enables controlled escalation: verify deterministic configuration and clock readiness, prove link establishment evidence (AN/training), then validate stability using counter slopes and lock events before suspecting higher layers.

Clocking & Jitter Budget

“Low-jitter clock” must be treated as a measurable input with a budget, isolation plan, and pass criteria; otherwise, high-rate stability becomes non-repeatable across boards, temperature, and port activity.

Clock path map (where jitter enters)

Ref clock → PLL/CDR: reference quality and coupling decide lock margin and sampling noise at high data rates.
Power/ground modulation: supply ripple and return-path discontinuities convert into phase modulation inside the clock path.
Activity injection: neighbor port toggling and switching noise can correlate with bursty error clusters.

Jitter contributors (budget buckets)

Refclk: source phase noise and coupling along the route.
PLL/CDR transfer: internal multiplication/filtering adds residual jitter.
Injected: supply/ground noise and EMI coupling modulate clock/SerDes.
Coupling: cross-port and high-speed adjacency shifts effective sampling margin.
Reserve: margin held for worst-case temperature, aging, and component spread.

Budget card (template, thresholds as X placeholders)

Total jitter limit: X (target at highest intended rate)
Allocate: Refclk X1 + PLL X2 + Injected X3 + Coupling X4 + Reserve Xr
Measure proxies: lock stability, counter slope, and PRBS BER under controlled stress
Isolation plan: swap refclk source, change supply noise, vary neighbor activity (one variable at a time)

Acceptance card (template, pass criteria as X placeholders)

Lock stability: PLL/CDR lock does not toggle over Y hours at worst-case temperature
Error slope: PCS/PMA counters ≤ X per minute at target rate under burst load
Correlation: error clusters do not track neighbor activity after isolation measures
Evidence archive: logs + counter snapshots + test recipe identifiers

Debug hooks (measurements that enable correlation)

CLKOUT / recovered clock (if available): enables time correlation with error clusters and lock events.
Lock indicators: PLL/CDR lock state and transition timestamps.
Counters as sensors: error counter slope is the primary proxy for margin when lab gear is absent.
Controlled stress knobs: refclk source selection, supply noise profiles, neighbor port activity patterns.

Diagram: Jitter budget funnel (contributors → sampling margin → BER/CRC)

The funnel turns clock quality into an actionable budget: allocate contributors, preserve a reserve, and validate with lock stability plus counter-slope evidence. If errors track a specific stress knob, the dominant contributor is identified without ambiguous “clock is bad” claims.

Scope guard

This section budgets PHY refclk/PLL/CDR contributions for link stability. SyncE templates, system timing topology, and PTP synchronization are handled on the Timing & Sync pages.

Channel Margin & SI: What Kills BER on Real Boards

If a link is clean on the bench but fragile in a product, the dominant cause is usually channel margin loss from board-level discontinuities and parasitics. This section converts “SI suspicion” into a minimal, repeatable isolation checklist using counters, PRBS/loopback, and A/B stress.

Scope guard

This section is board-level only: routing, vias, reference planes, connector/magnetics parasitics, and how they translate into BER via PHY evidence. Cable standards, PoE/PoDL design, and magnetics theory are handled on their dedicated pages.

Evidence chain (avoid “guessing SI”)

Prove PHY baseline: local loopback + PRBS (if supported) to separate PHY vs channel.
Find the cliff: rate stepping to identify the first unstable rate (10M → 10G).
Correlate: error counter slope vs temperature, supply noise, and neighbor-port activity.
Pin the killer: map symptoms to one of the six channel killers and apply board-level fixes.

Killer 1 · Insertion Loss / ISI (frequency-dependent attenuation)

Symptoms

Stable at low rate but fails at higher rates; BER grows rapidly after a rate step; margin worsens with temperature.

Quick check

Rate stepping: identify the first unstable rate
PRBS + error counter slope under constant load
A/B compare shorter vs longer internal routes (if available)

Fix

Shorten high-speed segments; remove unnecessary vias
Keep differential impedance consistent; avoid stubs and test pads on the main path
Maintain continuous reference planes to preserve return current

Pass criteria

At target rate and worst-case conditions, PRBS / PCS-PMA error slope ≤ X per minute over window Y; no rate-step cliff within the supported ladder.

Killer 2 · Return Loss / Reflections (discontinuities)

Symptoms

Burst CRC errors under specific traffic patterns; link “looks up” but errors cluster; stability depends on a narrow set of presets/training outcomes.

Quick check

Preset A/B with PRBS: does one preset eliminate bursts?
Compare builds with/without a specific connector/test pad element
Check “training done” vs “retraining events” (if exposed)

Fix

Remove impedance steps: pad/via transitions, stubbed vias, and inline test points
Keep differential pair geometry consistent through transitions
Avoid reference plane cuts near the discontinuity region

Pass criteria

Error bursts disappear under worst-case traffic; retraining events ≤ X per hour; counter slope stays below X over window Y.

Killer 3 · Crosstalk (neighbor activity injection)

Symptoms

BER increases when adjacent ports become active; multi-port throughput triggers CRC clusters; single-port tests look clean.

Quick check

Neighbor activity A/B: run the same port with neighbors off vs saturated
Correlation test: counter slope vs port utilization
Thermal A/B: higher temperature often reduces crosstalk tolerance

Fix

Increase spacing and manage layer-to-layer pair adjacency
Avoid long parallel runs; introduce separation and routing discipline
Strengthen return planes and reduce shared impedance in grounds

Pass criteria

Under defined neighbor stress (pattern + load), counter slope ≤ X; correlation between errors and neighbor activity becomes statistically weak.

Killer 4 · Mode Conversion (differential → common-mode imbalance)

Symptoms

EMI sensitivity rises together with BER; behavior changes with chassis/shield configuration; swapping “similar parts” changes stability unexpectedly.

Quick check

Imbalance review: ΔL / asymmetrical via/pad structures on the pair
A/B test with controlled common-mode disturbance (trend-based)
Check whether error clusters align with EMI events or ESD exposure logs

Fix

Enforce symmetry: routing, vias, pads, and placement around the pair
Keep return paths balanced and continuous (avoid asymmetric plane cuts)
Minimize unmatched parasitics near connector/magnetics transitions

Pass criteria

BER no longer tracks common-mode stress trends; counter slope remains ≤ X across defined EMI/noise operating scenarios.

Killer 5 · Reference Plane Cuts (broken return path)

Symptoms

A short route still fails; issues cluster near a split plane, connector keep-out, or a high-noise zone; stability changes with load transients.

Quick check

Geometry audit: does the pair cross any plane split, gap, or moat?
Hotspot audit: proximity to DC/DC, clocks, and high di/dt returns
A/B test: load step vs counter slope correlation

Fix

Restore a continuous reference plane under the pair
Bridge return paths when crossing domains (stitching/return bridges)
Move sensitive routing away from high-noise return regions

Pass criteria

Error slope becomes insensitive to load transients; worst-case counter slope ≤ X within window Y across defined operating modes.

Killer 6 · Connector / Magnetics Parasitics (board-edge discontinuity)

Symptoms

Different connectors/magnetics builds shift BER; insertion/removal or mechanical state changes error clusters; the board works without the final front-end stack.

Quick check

A/B BOM: compare vendor or footprint variants under the same PRBS recipe
Placement A/B: compare “PHY-to-front-end length” across revisions
Trend check: do errors follow mechanical state changes?

Fix

Place the front-end close to the PHY; keep transitions short and symmetric
Reduce discontinuities at board edge: pads, vias, and geometry transitions
Ensure consistent return paths around the connector/magnetics region

Pass criteria

BOM variants converge within X BER/counter limits; mechanical state no longer triggers bursts; sustained slope ≤ X over window Y.

Minimal SI check set (field-friendly)

Baseline evidence: record PCS/PMA error counter slope at a fixed load.
Self-proof: local loopback + PRBS/BERT (if supported) to isolate PHY vs channel.
Rate cliff: step through the speed ladder; note the first unstable rate.
Neighbor stress: compare neighbors OFF vs saturated; check correlation.
Thermal/power A/B: test the same recipe at temperature extremes and with supply stress.
Geometry audit: vias/stubs/test pads; pair symmetry; plane cuts; edge transitions.
Revision/BOM A/B: connector/magnetics variants under identical PRBS recipes.

Binary isolation flow (fastest path to the dominant killer)

PHY self-proof: loopback/PRBS fails → return to bring-up & clocking (H2-3/H2-4).
Find the cliff: first unstable rate defines the margin regime.
Neighbor A/B: strong correlation → crosstalk / injected noise branch.
Thermal/power A/B: strong sensitivity → thin margin branch.
Map to a killer: apply the matching card fix and re-run the same recipe.
Lock pass criteria: counters slope ≤ X within window Y across defined stress cases.

Diagram: SI failure map (symptoms ↔ physical causes ↔ fastest checks)

Use the matrix to pick a dominant physical branch quickly. Apply a fix, then re-run the same PRBS/counter recipe to confirm margin recovery.

Adaptive EQ & Polarity / Pair Swap

Adaptive EQ improves robustness against smooth channel loss and ISI, but it is not a substitute for correct layout. Reflections, crosstalk, and mode conversion can push EQ into “thin-margin stability” where training completes but BER remains fragile.

EQ blocks (concept-level mapping)

TX FIR: pre-emphasis for predictable loss/ISI; helps when attenuation is smooth and consistent.
RX CTLE: high-frequency boost to counter frequency roll-off; complements TX FIR.
DFE: cancels post-cursor ISI; effective when ISI is dominant and stable.
AGC: normalizes amplitude; does not correct reflections or timing uncertainty.

Capability vs Limit (keep the boundary explicit)

Capability

Recovers margin against smooth insertion loss and stable ISI
Improves tolerance to process/temperature spread when the channel model is consistent
Helps hit rate ladders when layout is already “correct-by-construction”

Limit

Cannot “fix” strong reflections from discontinuities; may converge but remain burst-prone
Cannot cancel crosstalk injected by neighbor activity; margin becomes load-dependent
Cannot undo mode conversion; imbalance can couple EMI into BER

Configuration strategy (avoid “EQ as magic”)

Default adaptive mode

Best when channel variation is expected (temperature, assembly, port mix)
Use when robustness is prioritized over strict determinism
Requires monitoring: retrain events, tap saturation, and counter slopes

Fixed preset mode

Best when channel is controlled and repeatable (production determinism)
Use to prevent “thin-margin stability” caused by misleading adaptation
Validate with a small preset scan (2–3 options) and lock the winner

A/B validation recipe (minimal, repeatable)

Fix rate + channel + load. Record baseline counter slope.
Mode A: adaptive EQ. Record training time, retrain count, and slope.
Mode B: fixed preset(s). Record the same fields under the same load.
Stress: neighbor activity and temperature A/B. Repeat measurements.
Choose the configuration that keeps slope ≤ X and retrain ≤ X per hour.

Polarity / Pair swap (bring-up safety net, not a cure)

Polarity swap: corrects differential inversion when wiring/placement flips the pair.
Pair swap / lane mapping: helps recover from routing order mistakes but can mask deeper SI issues if used blindly.
Validation rule: after enabling swap, re-run PRBS and confirm counter slope does not worsen under neighbor stress.
Production rule: lock mapping deterministically; log mapping state for field forensics.

Diagram: EQ capability boundary (impairments → blocks → outcomes)

Treat EQ as a bounded compensator: if stability depends on a narrow preset or collapses with neighbor stress, the channel discontinuity branch should be fixed at the board level.

EEE (802.3az) Power Save

EEE is a frequent root cause of “rare flaps” and latency spikes in industrial deployments because LPI entry/exit interacts with partner behavior, rate ladders, and bursty traffic. This section turns EEE into an executable decision-and-validation workflow.

Scope guard

This section focuses on EEE engineering behavior: negotiation consistency, LPI entry/exit evidence, exit-latency budget, and validation matrices. TSN time windows and PTP scheduling are handled on their dedicated pages.

Engineering model (what matters for stability)

LPI entry: idle detection triggers the PHY to enter Low Power Idle.
LPI maintain: the link stays “electrically quiet” while partner expects compliant signaling.
LPI exit: wake-up takes time; exit latency consumes system budget and can create burst loss or jitter.
Stability risk: bursty traffic causes frequent sleep/wake cycles; partner implementations differ by vendor/firmware.

Pitfall 1 · Partner negotiation mismatch (EEE ability resolution)

Symptoms

One switch works while another flaps; behavior changes after a partner firmware update; issues are partner-model specific.

Quick check

Log local + partner EEE advertised abilities and the resolved EEE mode
A/B: EEE OFF vs ON under the same traffic recipe
Track link flap count while keeping rate fixed

Fix

Enforce a consistent policy: disable EEE by default unless partner is known-good
Whitelist partner models/firmware where EEE is validated
Lock rate modes where partner resolution is stable

Pass criteria

Across the defined partner matrix, flap rate ≤ X per hour and resolved EEE mode remains consistent for window Y.

Pitfall 2 · Exit latency exceeds budget (rate-dependent)

Symptoms

Latency spikes (p99/p999) without obvious BER rise; burst drops during wake; issues appear only at specific speeds.

Quick check

A/B: EEE OFF vs ON; compare latency distribution (p99/p999)
Record LPI enter/exit counts and correlate with spikes
Repeat per-rate to find the first rate where budget breaks

Fix

Disable EEE for critical ports or for the offending rate(s)
Adjust EEE thresholds/policies to reduce frequent sleep/wake cycles
Prefer deterministic behavior over marginal power savings in control loops

Pass criteria

Under the defined burst/idle traffic model, p99 latency ≤ X and p999 latency ≤ X; no wake-related drop bursts in window Y.

Pitfall 3 · Metric accounting artifact (LPI time misread as “idle”)

Symptoms

Monitoring shows low utilization, but users observe stalls; “idle ratio” disagrees with packet captures; false alarms appear during power-save.

Quick check

Separate “LPI duration” from “true idle time” in counters/logs
Align sampling window and denominators across tools
Re-run the same traffic replay with EEE OFF to compare reporting

Fix

Standardize metric definitions: window length, denominators, and “idle” semantics
Log LPI enter/exit counts to contextualize utilization metrics
Use counter slope + packet evidence as primary stability indicators

Pass criteria

For the same traffic replay, monitoring agrees with packet evidence within error ≤ X; no EEE-driven false alarms in window Y.

Pitfall 4 · Thin-margin wake under temperature / power droop

Symptoms

Flaps increase at hot/cold corners; failures correlate with load steps or brownout; appears like SI until EEE is disabled.

Quick check

Correlate LPI exit events with voltage/temperature alarms (if available)
Repeat A/B across temperature and controlled supply droops
Check whether wake events trigger error counter bursts

Fix

Disable EEE in harsh corners or for ports with tight stability requirements
Improve supply margin and reset/strap stability across wake cycles
Enable a fallback policy: auto-disable EEE on repeated wake-related faults

Pass criteria

Across temperature/power stress matrix, wake-related fault events ≤ X and flap rate ≤ X per hour over window Y.

Validation recipe (repeatable)

Fix the rate and partner device; record baseline counters and latency distribution.
Mode A: EEE OFF. Run three traffic models: steady load, burst/idle, periodic idle.
Mode B: EEE ON. Repeat the same traffic models with identical sampling windows.
Apply stress: temperature corners and controlled supply droop (if applicable).
Log: LPI enter/exit, flap count, error counter slope, p99/p999 latency.
Decision: enable only if stability remains within X under the entire test matrix.

Partner + rate matrix (why “one lab pass” is not enough)

Partners: switch models/firmware, gateway uplink ports, and mixed-vendor environments.
Rates: test each supported rate independently (1G/2.5G/5G/10G as applicable).
Traffic: steady + bursty + periodic idle; measure counter slope and latency tails.
Outcome: enable EEE only for the validated subset (partners + rates + policies).

Diagram: EEE enable decision flow

Enable EEE only when the validated partner+rate subset stays within X thresholds under burst/idle traffic and harsh-corner stress.

MDIO/MMD, Straps & Firmware Hooks

Observability determines whether field issues can be reproduced and fixed. This section defines a minimum debug set: resolved abilities, AN/training state, EQ visibility, counter slope evidence, power-save events, and a black-box log schema.

Scope guard

This is not an MDIO standard tutorial and does not enumerate vendor register addresses. The goal is an engineering minimum set that must be exposed through firmware and logs. Remote management protocols (LLDP/NETCONF) are handled elsewhere.

Engineering view: Clause 22 vs Clause 45 / MMD

Clause 22: baseline access for core link status and basic counters.
Clause 45 / MMD: extended device model required for training/EQ visibility, deeper error evidence, and feature states (EEE, sensors).
Rule: expose the minimum set through a stable firmware API and persist snapshots for field forensics.

Minimum set #1 · Link & Resolved Mode

Link up/down + sticky link-change flag
Resolved speed / duplex / master-slave (if applicable)
Local advertised abilities (rate + features)
Partner advertised abilities (rate + features)
Resolved mode snapshot (the “final answer”)
Remote fault / local fault indicators

Why it matters: negotiation evidence prevents misdiagnosing partner incompatibility as SI.

Minimum set #2 · AN / Training Status

AN enable + AN complete
AN failure reason (if provided) + retry counter
Training start/done/fail (if supported)
Retrain count + last retrain cause (if supported)
Time-to-link (from reset to link-up) snapshot

Why it matters: separates “can’t negotiate” from “negotiates but has thin margin.”

Minimum set #3 · EQ Visibility

EQ mode: adaptive vs fixed preset
Selected preset/profile ID (if present)
Tap saturation / limit flags (CTLE/FIR/DFE)
Equalization “converged” flag (if present)
Polarity / pair swap state (as seen by PHY)

Why it matters: a “tap at the limit” signature indicates channel fixes are required.

Minimum set #4 · Error Evidence (use slopes, not single reads)

PCS-layer error counters (code/symbol/blocks as applicable)
PMA/PMD-layer error counters (if exposed)
MAC CRC/FCS error counters (if available in the stack)
Counter snapshot + sampling window duration
Computed counter slope (errors per minute / per Gbps)

Why it matters: slope evidence distinguishes burst issues from steady margin loss.

Minimum set #5 · EEE / Power Events (aligns with H2-7)

EEE enable + resolved EEE mode
LPI enter/exit counters
Wake failure / wake retry flags (if provided)
EEE-related interrupt/event flags

Why it matters: correlates latency spikes and flaps with LPI transitions.

Minimum set #6 · Interrupts, Sensors & Sticky Faults

Interrupt cause register (must be readable and logged)
Sticky fault bits (remote fault, training fail, wake fail)
Temperature snapshot (if supported)
Voltage/brownout snapshot (if supported)
Flap counter (rolling window)

Why it matters: captures the “moment of failure” for field forensics.

Firmware event hooks (capture the critical transitions)

Link down/up, speed change, and duplex change events
AN complete / AN fail / remote fault events
Training start/done/fail, retrain cause events (if supported)
EEE enter/exit storms and wake failure events
Error spike trigger: slope exceeds threshold X within window Y
All events must include timestamps and a snapshot of the minimum debug set

Black-box log schema (field maintainability)

Core fields

timestamp, port_id/phy_id, speed/duplex, partner_ability_hash, resolved_mode

State evidence

AN_state, training_state, eq_mode/preset, polarity/pair_swap_state

EEE evidence

eee_enabled/resolved, lpi_enter_count, lpi_exit_count, wake_fail_flags

Error evidence

counters_snapshot, window_ms, counter_slope, flap_count_rolling

Health snapshot

temperature, voltage/brownout (if available), interrupt_cause, sticky_faults

Sampling rule: periodic snapshots every T seconds plus event-driven snapshots on link changes, retrain events, and slope spikes.

Diagram: Observability stack (Counters → Events → Logs)

The minimum set is designed for reproducibility: if a failure cannot be expressed as mode + state + slope + event, it will be hard to debug in the field.

Test & Bring-up: Loopback / PRBS / BERT

A bring-up test is only useful when it is reproducible and isolating. This section defines a fixed ladder: segment with loopbacks, quantify with PRBS/BER slopes, then stress with throughput and corner conditions.

Scope guard

Focus is on what each loopback isolates, how to run PRBS for evidence, and how to build a golden sequence. This section does not cover specific ATE brands or scripting details.

Test layer map (how each tool isolates)

Loopback: removes the channel/partner and collapses the problem to a segment.
PRBS: removes payload semantics; evidence is BER or error slope over window Y.
BERT / Throughput stress: verifies stability under realistic load and buffering behavior.
Golden sequence: a fixed, rate-by-rate ladder with snapshots and pass criteria X for bring-up and production transfer.

Recipe · PCS loopback (digital-side isolation)

Purpose

Isolate MAC/PCS behavior from PMA/channel effects. A fail here indicates a digital-interface, configuration, or PCS-side issue.

Setup

Fix rate and interface mode; capture resolved mode snapshot
Disable external variables (EEE, partner features) for baseline
Define sampling window Y for counters

Steps

Enter PCS loopback
Transmit a known traffic stream or PRBS source (if supported at PCS)
Read counters at start/end; compute slope over window Y

Pass criteria

Error counter slope ≤ X per minute over window Y; no link-state anomalies during the run.

Recipe · PMA loopback (analog-side isolation)

Purpose

Validate PMA-side datapath and clock recovery behavior without requiring a full external channel. Useful for separating PCS vs PMA sensitivity.

Setup

Capture refclk lock and PHY internal lock status (if available)
Record EQ mode/preset used for the run
Define a fixed PRBS pattern and run duration

Steps

Enter PMA loopback
Enable PRBS generator/checker (or equivalent test mode)
Measure BER or compute counter slope over window Y

Pass criteria

BER ≤ X (or slope ≤ X) at the target rate, with stable lock indicators across duration Y.

Recipe · Remote loopback (end-to-end path)

Purpose

Validate the complete link including channel, connector effects, and partner behavior. This is the “system truth” before production transfer.

Setup

Record partner identity/version snapshot and resolved abilities
Fix traffic model (steady + burst/idle) and define window Y
Capture EQ and training status for correlation

Steps

Enable remote loopback (or partner-assisted loopback mode)
Run PRBS/BERT or controlled traffic replay
Apply stress: temperature corner and controlled supply droop (if applicable)
Log counters and slope evidence; note any retrain or flap events

Pass criteria

End-to-end BER/CRC slope ≤ X over window Y across the defined partner + rate matrix; no unexpected retrains.

PRBS pattern selection (engineering rule)

Quick screen: use a short pattern to detect gross margin loss fast.
Deep confidence: use a longer pattern to catch low-probability errors that appear only after long runs.
Production transfer: keep a stable pattern set so results are comparable across stations and shifts.
Rule: every BER conclusion must be tied to rate, partner snapshot, EQ mode, and sampling window.

BER accounting (use windowed slopes)

Window Y: define a fixed sampling window and log start/end snapshots.
Slope: compute errors per minute (or per Gbps) to avoid single-read noise.
Threshold X: BER ≤ X and/or error slope ≤ X within window Y.
Context: log temperature/voltage and EEE state to explain corner-only failures.

Golden test sequence (rate-by-rate ladder)

Phase 0: reset release, strap snapshot, refclk ready/lock snapshot
Phase 1: AN ability exchange; record resolved mode (rate/duplex/features)
Phase 2: PCS loopback baseline; compute error slope over window Y
Phase 3: PMA loopback baseline; run PRBS; verify BER ≤ X
Phase 4: remote loopback or full-through; validate end-to-end BER/CRC slope
Phase 5: throughput + burst/idle stress; capture latency tails if available
Phase 6: EEE A/B (if enabled); verify no flap or latency budget violation
Phase 7: temperature corner run; repeat Phase 4–5 minimal subset
Phase 8: generate a report pack: snapshots + counters + slopes + verdict

Production transfer tip: keep the same ladder and windows; only shorten duration where confidence remains above threshold X.

Diagram: Test ladder (from isolate → quantify → stress → sign-off)

The ladder is designed for station-to-station repeatability: each step produces a snapshot that either isolates a segment or quantifies margin.

Production & Reliability

Passing a lab demo is not reliability. Production-ready Ethernet PHY design requires baseline capture, corner screening strategy, and a field triage loop that turns symptoms into quantified margin evidence.

Scope guard

This section covers production transfer and reliability evidence. IEC standard details are handled on the protection page. No vendor-specific production equipment brands are discussed here.

Reliability variables (what shifts margin over time)

Temperature corners

Typical shift: timing and analog margin shrink; training becomes more frequent. Evidence: error slope rises only at hot/cold. Acceptance: slope ≤ X in window Y at the corner rate(s).

Supply noise / droop

Typical shift: PLL/CDR thin margin; wake/training instability appears. Evidence: lock/alarm correlation and burst error spikes. Acceptance: no droop-correlated retrain bursts; slope ≤ X.

Degradation after ESD/surge exposure

Typical shift: not a hard failure, but margin reduces; errors become “more likely” weeks later. Evidence: baseline vs current slope delta. Acceptance: performance remains within baseline ± X under the golden minimal subset.

Aging / mechanical stress

Typical shift: connectors and solder joints introduce intermittent impedance changes. Evidence: rare bursts tied to vibration/handling. Acceptance: burst rate ≤ X and no sustained slope increase across window Y.

Production coverage strategy (full test vs sampling)

Full test must catch: gross margin loss, configuration drift, and immediate instability signatures.
Sampling is suited for: long-duration patterns, deep corners, and extended confidence runs.
Transfer rule: reuse the golden ladder with fixed windows; shorten duration only when confidence stays above threshold X.
Record: resolved mode, error slope, time-to-link, and any retrain/flap events for every unit class.

Boundary condition screening (low V / high T)

Corner screening is where thin margin becomes visible. A unit that is “clean” at room conditions may fail only under low supply or high temperature, which predicts field returns. The screening goal is not maximum stress, but repeatable exposure that separates robust units from thin ones.

Acceptance (placeholders)

At low V / high T, error slope ≤ X over window Y and time-to-link ≤ X with no unexpected retrain bursts.

Degradation detection (not dead, but fragile)

Signal: rare bursts increase, retrain becomes frequent, or failures appear only under load/corners.
Fastest evidence: compare current counter slopes and time-to-link against the golden baseline pack.
Minimal re-test: run the golden minimal subset (PRBS baseline + burst throughput) at the failing rate.
Decision: classify as channel/SI shrink, clock/jitter sensitivity, or environment/power coupling based on correlation.

Field failure triage (3 steps)

Step 1 · Symptom

Classify: no link / link flaps / only one rate fails / load-only errors / corner-only errors.

Step 2 · Fastest isolate

Use loopbacks and windowed counter slopes to collapse the issue to digital-side, analog-side, channel, partner, or environment.

Step 3 · Confirm metric

Produce a quantified conclusion: BER ≤ X, slope ≤ X, time-to-link ≤ X, and retrain/flap ≤ X over window Y.

Diagram: Reliability loop (Design → Test → Field → Feedback)

Reliability is a closed loop: baseline evidence enables production screening, and field logs turn failures into actionable updates.

H2-11 · Engineering Checklist (Design → Bring-up → Production)

This chapter converts the previous content into action items that can be executed, proven (evidence), and accepted (pass criteria). Each gate outputs a reusable “baseline pack” for debugging and production.

Design Gate

Bring-up Gate

Production Gate

Gate 1 — Design Gate (prevent the expensive failures)

1) Mode / Rate / MAC-Interface Freeze

Checklist

Freeze the target rate set (10/100/1G/2.5G/5G/10G) and the required fallback behavior per port.
Freeze MAC-side interface and clocking constraints (e.g., RGMII/SGMII/2500BASE-X/USXGMII).
Freeze strap defaults + reset sequencing (what must be correct before any MDIO writes).
Define “link-up acceptance”: time-to-link (X), no flap within Y minutes, and counter slope ≤ X.

Evidence (keep forever)

Mode matrix (rate × cable/board class × partner type × EEE on/off).
Boot strap table + power/reset timing diagram + “MDIO write-after-reset” rules.
Register snapshot template: ability, AN status, training status, EQ status, error counters.

Pass criteria (placeholders)

• Link-up time ≤ X ms; • No flap in Y min; • Error slope ≤ X / min

2) Reference Clock & Jitter Ownership (make “low-jitter” measurable)

Checklist

Define refclk source, distribution, and measurement points (CLKIN, CLKOUT/recovered clock if available).
Allocate jitter contributors: refclk, PLL, supply-noise modulation, coupling injection.
Define acceptance by correlation: BER/counter slope versus refclk configuration and corner conditions.

Example MPNs (oscillators)

SiTime SiT8918BA-28-33N-25.000000 — 25 MHz oscillator option for PHY refclk.
SiTime SiT9120 (family) — differential oscillator family often used for low-jitter clocking; pick a standard frequency variant per design.

Pass criteria (placeholders)

• Refclk jitter ≤ X (RMS); • BER/CRC slope does not worsen beyond X under corner sweep

3) Observability Contract (counters / events / logs must be available)

Checklist

Expose the minimum MDIO/MMD set: partner ability, AN status, training status (if any), EQ status (if any), error counters, interrupts.
Define a black-box field log schema: timestamp, rate, link state, flap count, error counters, temperature/voltage alarms (if provided).
Standardize metric definitions: windowing, denominators, reset behavior, and “good/bad” thresholds.

Example MPNs (PHYs used in industrial systems)

TI DP83867IR — 10/100/1000 PHY family example for robust industrial links.
Microchip KSZ9131RNXI — 10/100/1000 PHY example with RGMII timing options.
Marvell 88E2010 / 88E2110 — 10M/100M/1G/2.5G/5G multi-gig PHY family examples.
Marvell 88X3310P — 1G/2.5G/5G/10GBASE-T PHY family example (10G class).
Aquantia AQR107-B0-EG-Y (or AQR107-B0-IG-Y) — single-port multi-rate 10G-class PHY example.

Pass criteria (placeholders)

• All required status/counters readable in-field; • Root-cause isolation achievable without scope within X minutes

Gate 2 — Bring-up Gate (repeatable, evidence-based debug)

1) Baseline Pack (Golden ladder, fixed windows)

Checklist

Run a fixed “Golden ladder” per target rate: reset → AN/ability exchange → PRBS/BER window → throughput burst → EEE toggle → corner sweep.
Capture counters at consistent timestamps: t0 (post-link), t1 (after PRBS), t2 (after throughput), t3 (after EEE).
Record time-to-link, training completion (if any), retrain events, and flap counts.

Evidence

Per-rate baseline report: link-up time, counter slopes, BER windows, and partner identity.
Register snapshots: AN, training, EQ, and interrupt summary at each ladder milestone.

Pass criteria (placeholders)

• PRBS BER ≤ X over Y seconds; • Counter slope stable ≤ X / min

2) Isolation Sequence (loopbacks first, then the channel)

Checklist

Execute loopbacks in order: PCS loopback → PMA loopback → remote loopback (if partner supports).
Only after internal loopbacks pass, attribute failures to the channel/connector/cable environment.
For “rate-specific instability”, rerun the exact same ladder window at each rate to compare slopes.

Pass criteria (placeholders)

• PCS/PMA loopback passes with BER ≤ X; • Remote loopback stable for Y minutes

Gate 3 — Production Gate (consistency, traceability, fast triage)

1) Minimal Full-Test Coverage (fast but meaningful)

Checklist

Fix test windows and patterns (PRBS set + throughput burst) and keep them identical across stations.
Measure the “thin-margin signature”: errors that appear only at specific rates, loads, or EEE transitions.
Store per-unit summary fields: station ID, FW/strap revision, partner ID, pass/fail reason codes.

Evidence

Production log line (single-line JSON/CSV) with counters + environment + time.
Golden unit comparison record per station per shift.

Pass criteria (placeholders)

• Station-to-station delta ≤ X; • False reject ≤ X%; • Escape ≤ X ppm

2) Field Return Entry (3-step triage using counters/logs)

Checklist

Symptom classify: “no link” vs “flap” vs “rate-specific” vs “load/EEE-triggered”.
Fastest isolate: re-run internal loopbacks first, then channel PRBS window.
Confirm degradation: compare today’s slopes against the unit’s original production baseline pack.

Pass criteria (placeholders)

• Root-cause bucket identified within X steps; • Evidence package complete for feedback closure

Gate Pipeline (Checklist → Evidence → Pass)

H2-12 · Applications & IC Selection Logic (before FAQ)

Selection is expressed in engineering terms: rate coverage, controllability (EEE/EQ), and observability (PRBS/loopback/counters). This page does not deep-dive PoE/TSN/PTP.

Application Slices (strongly in-scope)

1) Switch Uplink / Aggregation (multi-rate copper)

Typical pain: rate-specific instability (2.5G/5G) and “looks fine on bench” failures after integration.
Must-have hooks: per-rate PRBS windows, loopback isolation, stable counter definitions.
EEE must be controllable (enable/disable) for interoperability validation.

Example PHY MPNs

Marvell 88E2010 / 88E2110 — 10M/100M/1G/2.5G/5G PHY family options.
Marvell 88X3310P — 10G-class PHY option for 1G/2.5G/5G/10G designs.
Aquantia AQR107-B0-EG-Y — 10G-class multi-rate PHY option (commonly used for 10G copper).

2) Industrial Gateway / Protocol Bridge (field maintainability first)

Typical pain: “link up but flaps”, or CRC spikes only under burst traffic and temperature corners.
Must-have hooks: partner ability, AN status, error counters, interrupt summary, black-box logs.
Bring-up must be fast: fixed “golden ladder” recipe + clear pass criteria.

Example PHY MPNs

TI DP83867IR — robust 10/100/1000 PHY example used in industrial gateways.
Microchip KSZ9131RNXI — 10/100/1000 PHY example; useful when RGMII timing options matter.

3) Industrial Edge Compute / IPC (long uptime + corners)

Typical pain: slow drift (aging/thermal) that reduces margin without immediate hard failure.
Must-have hooks: long-window error slope monitoring + baseline delta reports.
Clock quality and power-noise immunity must be verified with repeatable correlation tests.

Example supporting MPNs

SiTime SiT8918BA-28-33N-25.000000 — 25 MHz oscillator example for refclk.
Marvell 88E2010 / 88E2110 — multi-gig PHY examples when 2.5G/5G coverage is needed.

4) In-cabinet Backplane / Short Copper Reach (board-driven)

Typical pain: “board-level SI” dominates; loopbacks help separate digital/analog/channel issues.
Must-have hooks: internal loopbacks, per-rate PRBS, and deterministic bring-up sequencing.
Rate upgrade path should be explicit (1G → 2.5G/5G → 10G).

Example PHY MPNs

Marvell 88X3310P — 10G-class PHY example for high-rate copper links.
Aquantia AQR107-B0-IG-Y — 10G-class PHY example seen in reference designs.

Selection Logic (engineering gates, not marketing)

Layer 1 — Hard Gates (fail one → reject)

Rate coverage: required set supported (and future headroom defined).
MAC-side interface: matches SoC/switch requirements (e.g., RGMII/SGMII/2500BASE-X/USXGMII).
EEE controllability: explicit enable/disable + verified interoperability plan.
Debug completeness: PRBS + loopback modes + error counters accessible via MDIO/MMD.
Environment grade: temperature/EMC expectations match deployment (verify ordering grade).

Example “Hard Gate” MPN mapping

10/100/1G class: TI DP83867IR; Microchip KSZ9131RNXI
2.5G/5G class: Marvell 88E2010, 88E2110
10G class: Marvell 88X3310P; Aquantia AQR107-B0-EG-Y/AQR107-B0-IG-Y

Layer 2 — Scorecard (rank candidates)

Score each item 0/1/2 (0 = missing, 1 = partial, 2 = complete). Prefer candidates that reduce debug time and protect production stability.

Observability depth: can EQ/training status be read and correlated to errors?
Loopback coverage: PCS/PMA/remote loopback modes available and documented?
Counter semantics: clear windows/denominators; counters survive resets as required?
EEE behavior: predictable enter/exit latency; partner matrix validation practical?
Bring-up determinism: time-to-link stable; retrain behavior bounded?

Output deliverables (must exist after selection)

Mode matrix + fixed “Golden ladder” recipe names + per-rate pass thresholds (X/Y).
MDIO/MMD readout script + black-box log schema + station-to-station reproducibility plan.

Selection Flow (Hard Gates → Scorecard → Baseline Pack)

Scope guard: PoE/PoDL, magnetics deep-dive, TSN/PTP timing templates, and protocol stacks are handled by sibling pages.

Request a Quote

Name

Company

Part Number(s) / BOM

Quantity & Target Lead Time

Alternates Allowed

Temperature Grade

Package / Footprint

Compliance

Budget Window

Lot Size / Qty

Message

Attachment

Accepted Formats

pdf, csv, xls, xlsx, zip

Attachment

Drag & drop files here or use the button below.

H2-13 · FAQs (Field Troubleshooting, No Scope Required)

Each FAQ is a closed-form troubleshooting entry: isolate the segment, run a minimal check, apply a reversible fix, and accept using measurable criteria (placeholders: X/Y/N).

Links up at 1G, but 2.5G/5G shows CRC spikes — first check: channel margin or EQ preset?

Likely cause: Thin channel margin at higher rates, or the chosen EQ preset/adaptation lands in a poor operating point.

Quick check: Lock the rate and run PRBS for N bits using preset A/B; compare local loopback vs end-to-end to isolate “internal vs channel”.

Fix: Use the stable preset (or narrow adaptation range); reduce discontinuities on the path and re-baseline PRBS/counters at the target rate.

Pass criteria: CRC ≤ X per 10^6 frames over Y min at 2.5G/5G; PRBS BER ≤ X over N bits; flap count = 0.

Works in lab, fails in cabinet — is it refclk noise injection or return-plane cut?

Likely cause: Refclk/supply noise couples into PLL/CDR, or layout return-path disruption increases mode conversion and closes eye margin.

Quick check: Correlate errors with cabinet events (fans/relays/load steps); compare error slope with refclk source A/B and with a controlled supply ripple step (ΔV = X mV).

Fix: Improve refclk isolation (routing/guard/decoupling) and restore a continuous return path near critical traces; re-run the same PRBS window and throughput burst in-cabinet.

Pass criteria: Error slope ≤ X per min across cabinet load steps; PRBS BER ≤ X over N bits; no cabinet-triggered flap over Y min.

EEE enabled makes periodic flaps — is exit latency or partner mismatch?

Likely cause: Partner ability mismatch or exit timing/latency behavior that violates the traffic profile, causing repeated transitions and instability.

Quick check: Disable EEE and compare flap rate; if stable, re-enable and test two traffic models: (a) idle-heavy with bursts every T ms, (b) continuous load at X%.

Fix: Keep EEE disabled for that partner/port class, or tune policy (min idle time, exit behavior) and validate with the partner matrix.

Pass criteria: Flap count = 0 over Y min with EEE policy applied; throughput variance ≤ X% under burst model; EEE transitions do not increase CRC beyond X/10^6 frames.

AN completes but no traffic passes — is MAC interface timing or FIFO underrun?

Likely cause: MAC-side interface mode/timing mismatch (clock phase/delay), or datapath starvation (FIFO underrun/overrun) masquerading as “link up”.

Quick check: Verify MAC/PHY interface configuration match (mode + delays); check underrun/overrun counters while sending a constant-rate stream at X Mbps.

Fix: Correct interface mode and required internal/external delays; increase FIFO margin (clocking/interrupt moderation/driver pacing) and re-test with the same stream.

Pass criteria: Sustained throughput ≥ X% of line rate for Y min; underrun/overrun count = 0; CRC ≤ X/10^6 frames.

Only one port fails under temperature — is PLL lock margin or solder/connector loss shift?

Likely cause: Marginal PLL/CDR lock at temperature corners, or localized physical loss shift (connector/solder/port-specific routing) reduces SI margin.

Quick check: Compare that port vs a known-good port: time-to-link, retrain count, and error slope across temperature sweep from Tlow to Thigh.

Fix: Improve port-specific margin (rework suspected solder/connector, verify routing/return path); if PLL margin is suspected, evaluate refclk/supply conditioning and stable EQ settings.

Pass criteria: No retrain/flap over Y min at each corner; PRBS BER ≤ X over N bits; error slope stays within baseline ± X%.

PRBS passes in local loopback but fails end-to-end — which segment does that isolate?

Likely cause: Internal PHY datapath is healthy; the failing segment is the external channel (PCB + connector + cable) or the link partner’s receive path.

Quick check: Run PMA/PCS loopbacks (internal) and then remote loopback (if supported). If only end-to-end fails, swap partner/port/cable class to separate “channel vs partner”.

Fix: Treat as channel/partner margin: lock a stable EQ preset, reduce discontinuities, and validate with PRBS at the target rate using a fixed N-bit window.

Pass criteria: End-to-end PRBS BER ≤ X over N bits; CRC ≤ X/10^6 frames over Y min; internal loopbacks remain clean (BER ≤ X).

Changing magnetics vendor worsened BER — first Cdiff/mismatch sanity check (PHY-only view)

Likely cause: Differential mismatch introduced (effective Cdiff / imbalance), increasing mode conversion and degrading receiver margin at higher rates.

Quick check: Compare error slope and rate sensitivity before/after the vendor change using the same PRBS window; if degradation is strongest at higher rates, suspect added imbalance/mismatch.

Fix: Restore the previous magnetics or enforce a validated part list; if change is mandatory, re-baseline EQ presets and validate end-to-end PRBS/CRC at each target rate.

Pass criteria: PRBS BER ≤ X over N bits at each rate; CRC ≤ X/10^6 frames; delta vs baseline slope ≤ X%.

Link ok, but throughput collapses with bursts — is it EEE/LPI transitions or PCS errors?

Likely cause: EEE/LPI transitions interact poorly with burst traffic, or PCS-level errors force retransmissions/pauses that look like “throughput collapse”.

Quick check: Disable EEE and repeat the exact burst pattern (idle Tidle ms, burst Tburst ms at X% line rate); observe CRC/PCS error counters during bursts.

Fix: Keep EEE disabled (or adjust policy) for burst workloads; if PCS errors dominate, lock stable EQ preset and address channel discontinuities.

Pass criteria: Throughput ≥ X% of expected under burst model for Y min; CRC ≤ X/10^6 frames; no periodic stalls aligned with LPI transitions.

Intermittent down/up every few minutes — brownout reset, strap sampling, or thermal?

Likely cause: Supply dip triggers reset, strap sampling occurs under unstable rails, or thermal protection causes periodic recovery.

Quick check: Correlate flap timestamps with supply/temperature logs; check reset cause/interrupt history; measure time-to-link consistency after each event (variance ≤ X).

Fix: Improve power hold-up/decoupling and reset sequencing; ensure straps are stable at sampling; address thermal path (airflow/heatsinking) if temperature correlates.

Pass criteria: Flap count = 0 over Y hours; time-to-link ≤ X ms with variance ≤ X%; no reset/thermal events under worst-case load.

MDIO shows no faults but counters climb — which counters are most telling at PCS vs PMA?

Likely cause: Soft errors accumulate without hard faults; PCS shows symbol/block integrity issues, while PMA indicates analog lock/alignment stress.

Quick check: Separate counters by layer: PCS integrity (block/symbol/CRC-related) vs PMA lock/retrain/alignment; compare slopes during PRBS and during real traffic.

Fix: If PCS counters dominate, treat as margin/EQ/channel issue; if PMA lock/retrain dominates, treat as refclk/supply/temperature sensitivity and stabilize the bring-up configuration.

Pass criteria: Layered counter slopes ≤ X/min in steady-state; no retrain events over Y min; CRC ≤ X/10^6 frames.

Polarity / pair swap on PCB — why do only some rates fail?

Likely cause: The swap is only conditionally supported (depends on mode/rate), or the physical routing introduces rate-dependent discontinuities that exceed margin at higher speeds.

Quick check: Verify PHY configuration for polarity/pair-swap support at each target rate; run the same PRBS window at 1G vs 2.5G/5G/10G and compare BER slope vs preset.

Fix: Enable the correct swap mode for the operating rate; if margin-driven, lock a stable EQ preset and reduce discontinuities around the swapped segment.

Pass criteria: PRBS BER ≤ X over N bits at each rate; CRC ≤ X/10^6 frames; no rate-specific flap over Y min.

Lowering TX amplitude helped EMI but BER got worse — how to pick the safe margin?

Likely cause: TX amplitude reduction closes the eye and removes margin, especially at higher rates or under temperature/supply corners.

Quick check: Sweep TX amplitude across K steps and record PRBS BER and CRC/PCS counters per step; identify the “knee” where BER slope rises sharply.

Fix: Choose the lowest TX amplitude that still meets the PRBS/CRC acceptance at all required corners; if EMI still fails, address return path/shielding before sacrificing margin.

Pass criteria: Selected amplitude meets BER ≤ X over N bits and CRC ≤ X/10^6 frames across corners; EMI target met with ≥ X dB margin.

10/100/1G/2.5G/5G/10G Ethernet PHY Design Guide

10/100/1G/2.5G/5G/10G Ethernet PHY Design Guide

Definition & Where This PHY Fits

Architecture Map: PCS/PMA, SerDes, DSP/EQ, MAC Interface

Responsibility boundaries (Symptoms → Quick check → Hooks)

Rate Bring-up Flow: AN / Training / Link-up States

Clocking & Jitter Budget

Channel Margin & SI: What Kills BER on Real Boards

Adaptive EQ & Polarity / Pair Swap

EEE (802.3az) Power Save

MDIO/MMD, Straps & Firmware Hooks

Test & Bring-up: Loopback / PRBS / BERT

Production & Reliability

H2-11 · Engineering Checklist (Design → Bring-up → Production)

Gate 1 — Design Gate (prevent the expensive failures)

Gate 2 — Bring-up Gate (repeatable, evidence-based debug)

Gate 3 — Production Gate (consistency, traceability, fast triage)

H2-12 · Applications & IC Selection Logic (before FAQ)

Request a Quote

Accepted Formats

Attachment

H2-13 · FAQs (Field Troubleshooting, No Scope Required)

Explore

Categories

Get in Touch

10/100/1G/2.5G/5G/10G Ethernet PHY Design Guide

10/100/1G/2.5G/5G/10G Ethernet PHY Design Guide

Definition & Where This PHY Fits

Architecture Map: PCS/PMA, SerDes, DSP/EQ, MAC Interface

Responsibility boundaries (Symptoms → Quick check → Hooks)

Rate Bring-up Flow: AN / Training / Link-up States

Clocking & Jitter Budget

Channel Margin & SI: What Kills BER on Real Boards

Adaptive EQ & Polarity / Pair Swap

EEE (802.3az) Power Save

MDIO/MMD, Straps & Firmware Hooks

Test & Bring-up: Loopback / PRBS / BERT

Production & Reliability

H2-11 · Engineering Checklist (Design → Bring-up → Production)

Gate 1 — Design Gate (prevent the expensive failures)

Gate 2 — Bring-up Gate (repeatable, evidence-based debug)

Gate 3 — Production Gate (consistency, traceability, fast triage)

H2-12 · Applications & IC Selection Logic (before FAQ)

Recommended topics you might also need

Request a Quote

Accepted Formats

Attachment

H2-13 · FAQs (Field Troubleshooting, No Scope Required)

Explore

Categories

Get in Touch