10/100/1G/2.5G/5G/10G Ethernet PHY Design Guide
← Back to: Industrial Ethernet & TSN
Multi-rate Ethernet PHYs are not hard because they “link,” but because they must stay stable, measurable, and production-repeatable across 10M→10G and real-world channels.
This page turns bring-up, jitter/margin, EQ/EEE, and PRBS/loopback observability into a stepwise workflow with numeric acceptance criteria, so failures can be isolated fast and fixed with confidence.
Definition & Where This PHY Fits
Multi-rate Ethernet PHYs (10M → 10G) are integration problems: stable BER/latency across rates and partners requires tight clocking,
controlled channel margins, and built-in observability for fast bring-up and production repeatability.
What “PHY” means in practice
An Ethernet PHY converts a MAC-side digital interface into a physical link over a differential channel, while exposing measurable signals of link health.
For multi-rate designs, the primary challenge is not “link-up,” but predictable stability under rate changes, environmental drift, partner diversity, and real channel losses.
Typical use-cases (where failures hide)
- Switch uplink / port-dense gateways: partner variety and port adjacency amplify jitter/EMI sensitivity and cross-port coupling.
- Industrial edge: wide temperature, noisy grounds, and long service life demand strong margins and durable diagnostics.
- Mixed-rate networks: legacy endpoints coexist with multi-gig links; negotiation and EEE behavior must stay deterministic.
What this page delivers (engineering outcomes)
- Margin control: translate “CRC/BER spikes” into a repeatable check → fix → pass-criteria flow.
- Clocking discipline: treat ref-clock quality as a budgeted input to PLL/CDR lock and error rate.
- EEE safety: enable power save only when exit latency and partner behavior remain bounded and testable.
- Observability: define the minimum PRBS/loopback/counters/log fields required for field forensics.
Scope guard (to avoid cross-page overlap)
This page stays at the PHY layer: link establishment, clock/jitter budgeting, adaptive EQ boundaries, EEE stability,
and test/bring-up hooks. System topics are referenced only as integration constraints (no deep dives).
- TSN scheduling and time windows (Qbv/Qci/GCL), PTP/SyncE timing systems, PoE power architecture, and magnetics/CMC selection courses are out of scope here.
Diagram: System position + four engineering levers (Margin / Clock / Power-save / Observability)
Integration success is driven by four levers: measurable margins (BER/CRC), disciplined ref-clock/jitter budgeting, EEE enablement that remains bounded under traffic patterns,
and observability hooks that isolate failures without lab instruments.
Architecture Map: PCS/PMA, SerDes, DSP/EQ, MAC Interface
A PHY architecture map is a fault-isolation tool: symptoms must translate into “which layer to check first” (MAC interface, auto-negotiation, PCS, PMA/SerDes, EQ, or clock/power integrity).
Layer model (engineering view)
- MAC interface layer: RMII/GMII/RGMII/SGMII/USXGMII timing, delays, and error visibility.
- Control/management layer: MDIO/MMD access, straps, interrupts, and field-debug logging.
- PCS layer: auto-negotiation state, block lock/alignment, and PCS-level error counters.
- PMA/SerDes layer: PLL/CDR lock, TX/RX analog behavior, and training/equalization boundaries.
- Cross-cutting integrity: ref-clock quality and supply/ground noise modulate margins across all layers.
Diagram: Architecture map for fault isolation (layers + observability hooks)
The map above supports a deterministic debug order: prove MAC-side timing and management visibility, confirm PCS state and counters, validate PMA/SerDes lock,
then evaluate EQ/training boundaries under the intended channel. Observability is part of the design, not an afterthought.
Responsibility boundaries (Symptoms → Quick check → Hooks)
Each block below is an isolation unit. The fastest checks are chosen to separate PHY-internal faults from MAC/system issues before deep signal analysis.
MAC interface (RMII/GMII/RGMII/SGMII/USXGMII)
- Symptoms: link shows “up” yet throughput collapses, specific frame sizes fail, or errors appear only in one direction.
- Quick check: verify interface mode + clock relationship + delay configuration; confirm PCS/PMA counters remain clean in local loopback.
- Hooks: readable mode/delay settings, local loopback enable, and a “known-good” test mode independent of higher software layers.
Auto-negotiation & ability exchange
- Symptoms: repeated renegotiation, “stuck at 1G,” or specific partners never converge at 2.5G/5G/10G.
- Quick check: read partner abilities + AN completion state + remote fault indications; compare negotiated result vs configured policy.
- Hooks: partner-ability snapshot, AN state log fields (time-stamped), and a policy control to force/disable targeted rates during triage.
PCS (encode/decode, alignment, PCS counters)
- Symptoms: link-up is achieved but CRC/PCS errors grow; errors correlate with rate switches or bursts.
- Quick check: check block lock/alignment indicators; inspect PCS error counters before blaming the channel.
- Hooks: PCS loopback mode, PCS counters exposed via MDIO/MMD, and interrupts mapped to software logs.
PMA / SerDes (PLL/CDR lock, analog link)
- Symptoms: temperature/voltage-sensitive link flaps, “works at low speed only,” or intermittent drops after warm-up.
- Quick check: validate PLL lock and CDR lock status; correlate error bursts with lock transitions and ref-clock disturbances.
- Hooks: readable lock indicators, optional clock-out for correlation, and PMA loopback to isolate line-side faults.
Adaptive EQ / training (what it fixes vs what it cannot)
- Symptoms: good results on short channels but failure on real harness/backplane; “rate works when forced to a different preset.”
- Quick check: compare PRBS BER across presets; read EQ/training completion state and any saturation/limit indicators.
- Hooks: ability to force safe presets, visibility into EQ/training state, and a repeatable PRBS recipe for A/B comparisons.
EEE (802.3az) power save
- Symptoms: periodic flaps, latency spikes during burst traffic, or instability that disappears when EEE is disabled.
- Quick check: monitor LPI enter/exit counters; verify partner EEE capability and exit timing margins under intended traffic patterns.
- Hooks: per-port EEE enable/disable control, LPI counters exposed to logs, and a stress recipe that reproduces burst/idle transitions.
MDIO/MMD, straps, interrupts (field-debug survivability)
- Symptoms: behavior differs across cold boots, field failures cannot be reproduced in a lab, or “no visible faults” while counters climb.
- Quick check: confirm strap sampling and reset timing; ensure MDIO access is stable; verify interrupts are rate-limited and logged.
- Hooks: deterministic reset/strap scheme, guaranteed MDIO reachability, and a minimum black-box log set (rate, partner, counters, temperature, power events).
Diagram: Symptom → first check routing (fast isolation)
This routing prevents misdiagnosis: start with deterministic layer signals (AN status, lock indicators, counters, PRBS/loopback outcomes),
then escalate to channel/SI investigations only after PHY-internal evidence is clean.
Rate Bring-up Flow: AN / Training / Link-up States
A multi-rate PHY bring-up should behave like a deterministic state machine: each stage has a minimal observable,
a known set of failure signatures, and a next-action that isolates PHY issues before escalating to system layers.
Fast triage buckets (pick the first lane)
A) No link (never comes up)
- Start at: Strap latched → Refclk ready → Reset release
- First evidence: MDIO reachable, mode stable, PLL lock asserted
- Then: AN/partner ability + remote fault
B) Link flaps (up → down → up)
- Start at: Reset causes + PLL/CDR lock stability
- First evidence: lock toggles, counters burst in clusters, periodicity
- Then: Monitor lane + optional EEE lane (if enabled)
C) Rate-specific instability
- Start at: Negotiated rate vs policy, training done, EQ preset A/B
- First evidence: errors only at high rate, improve when forced lower
- Then: Jitter budget + injected noise correlation (H2-4)
Step S0 · Strap latched (power-on configuration)
Goal
Mode and address are deterministic across cold boots; no “random” interface/rate behavior.
Observables
- MDIO address responds
- Mode/status register matches expected straps
- Optional strap readback (if supported)
Failure signatures
Different behavior between cold boots; ports enumerate inconsistently; link policy appears ignored.
Next action
- Harden strap pull network and sampling timing
- Log mode/address fields at boot (black-box)
- Prefer software override where available
Step S1 · Refclk ready (clock present & stable)
Goal
Reference clock is stable enough for internal PLL/CDR lock without intermittent loss.
Observables
- PLL pre-lock / clock-detect (if exposed)
- No error bursts before AN starts
- Optional CLKOUT for correlation
Failure signatures
Lock is temperature/traffic sensitive; link works at low rate but not at high rate; periodic dropouts.
Next action
- Validate refclk source quality and isolation
- Check clock routing return paths and coupling
- Proceed to jitter budgeting (H2-4)
Step S2 · Reset release (init determinism)
Goal
Reset deasserts cleanly; internal init completes once; management access is stable.
Observables
- Init done / ready status
- Reset cause flags (if available)
- IRQ rate does not storm
Failure signatures
Link works briefly then drops; repeated init; brownout-like resets under traffic or temperature.
Next action
- Correlate resets with power/thermal events (black-box)
- Stabilize supplies and reset sequencing
- Continue to AN only after S0–S2 are clean
Step S3 · AN / ability exchange (negotiated result)
Goal
AN completes and the negotiated rate/duplex matches policy; partner capability is visible.
Observables
- AN complete
- Partner ability snapshot
- Remote fault / link fault flags
Failure signatures
Endless renegotiation; certain partners never converge at multi-gig; negotiated rate differs from expected.
Next action
- Force/disable specific rates to bisect partner issues
- Compare partner ability vs local policy settings
- Proceed to training/EQ only after AN evidence is clean
Step S4 · Training (if present) / EQ convergence
Goal
Training completes; EQ state is stable (no saturation/limit events under the intended channel).
Observables
- Training done
- EQ preset / taps status
- Optional saturation/limit flags
Failure signatures
Bench is stable but real channel fails; only lower rates work; stability depends on a specific preset.
Next action
- A/B compare presets with PRBS and counters
- Check if instability correlates with refclk or injected noise
- Escalate to jitter budget funnel (H2-4)
Step S5 · Link-up (data path validity)
Goal
Link-up is defined by bounded error counters and stable throughput, not only a status bit.
Observables
- PCS/PMA error counter slope
- Local loopback result (baseline)
- Traffic stability under burst load
Failure signatures
CRC spikes without clear link-down; errors appear only during bursts; one direction is worse.
Next action
- Prove PHY with PRBS/loopback before system escalation
- Re-check MAC interface timing if PHY evidence is clean
- Record counters + temperature + power events (black-box)
Step S6 · Monitor (drift, aging, and field stability)
Goal
Error counters remain within threshold X over window Y across temperature and traffic patterns.
Observables
- Error counter slope vs time
- Lock event timestamps
- LPI enter/exit counters (if EEE enabled)
Failure signatures
Periodic bursts; failures only in a temperature band; errors grow after minutes/hours.
Next action
- Correlate bursts with refclk/power/noise injections
- Capture “minimum black-box fields” on every event
- Use H2-4 funnel to allocate a jitter margin reserve
Minimum black-box fields (for field forensics)
Capture these fields per port at boot, link events, and error bursts to enable remote isolation without lab instruments.
- Mode + negotiated rate + partner ability
- AN state, remote fault flags, training done (if present)
- PLL/CDR lock status + lock toggle timestamp
- PCS/PMA error counters and their slope over time
- Temperature + key power events (brownout/thermal/WD)
- EEE LPI enter/exit counters (if enabled)
Diagram: Bring-up state flow (deterministic steps + triage branches)
The bring-up flow enables controlled escalation: verify deterministic configuration and clock readiness, prove link establishment evidence (AN/training),
then validate stability using counter slopes and lock events before suspecting higher layers.
Clocking & Jitter Budget
“Low-jitter clock” must be treated as a measurable input with a budget, isolation plan, and pass criteria; otherwise,
high-rate stability becomes non-repeatable across boards, temperature, and port activity.
Clock path map (where jitter enters)
- Ref clock → PLL/CDR: reference quality and coupling decide lock margin and sampling noise at high data rates.
- Power/ground modulation: supply ripple and return-path discontinuities convert into phase modulation inside the clock path.
- Activity injection: neighbor port toggling and switching noise can correlate with bursty error clusters.
Jitter contributors (budget buckets)
- Refclk: source phase noise and coupling along the route.
- PLL/CDR transfer: internal multiplication/filtering adds residual jitter.
- Injected: supply/ground noise and EMI coupling modulate clock/SerDes.
- Coupling: cross-port and high-speed adjacency shifts effective sampling margin.
- Reserve: margin held for worst-case temperature, aging, and component spread.
Budget card (template, thresholds as X placeholders)
- Total jitter limit: X (target at highest intended rate)
- Allocate: Refclk X1 + PLL X2 + Injected X3 + Coupling X4 + Reserve Xr
- Measure proxies: lock stability, counter slope, and PRBS BER under controlled stress
- Isolation plan: swap refclk source, change supply noise, vary neighbor activity (one variable at a time)
Acceptance card (template, pass criteria as X placeholders)
- Lock stability: PLL/CDR lock does not toggle over Y hours at worst-case temperature
- Error slope: PCS/PMA counters ≤ X per minute at target rate under burst load
- Correlation: error clusters do not track neighbor activity after isolation measures
- Evidence archive: logs + counter snapshots + test recipe identifiers
Debug hooks (measurements that enable correlation)
- CLKOUT / recovered clock (if available): enables time correlation with error clusters and lock events.
- Lock indicators: PLL/CDR lock state and transition timestamps.
- Counters as sensors: error counter slope is the primary proxy for margin when lab gear is absent.
- Controlled stress knobs: refclk source selection, supply noise profiles, neighbor port activity patterns.
Diagram: Jitter budget funnel (contributors → sampling margin → BER/CRC)
The funnel turns clock quality into an actionable budget: allocate contributors, preserve a reserve, and validate with lock stability plus counter-slope evidence.
If errors track a specific stress knob, the dominant contributor is identified without ambiguous “clock is bad” claims.
Scope guard
This section budgets PHY refclk/PLL/CDR contributions for link stability. SyncE templates, system timing topology, and PTP synchronization are handled on the Timing & Sync pages.
Channel Margin & SI: What Kills BER on Real Boards
If a link is clean on the bench but fragile in a product, the dominant cause is usually channel margin loss from board-level discontinuities and parasitics.
This section converts “SI suspicion” into a minimal, repeatable isolation checklist using counters, PRBS/loopback, and A/B stress.
Scope guard
This section is board-level only: routing, vias, reference planes, connector/magnetics parasitics, and how they translate into BER via PHY evidence.
Cable standards, PoE/PoDL design, and magnetics theory are handled on their dedicated pages.
Evidence chain (avoid “guessing SI”)
- Prove PHY baseline: local loopback + PRBS (if supported) to separate PHY vs channel.
- Find the cliff: rate stepping to identify the first unstable rate (10M → 10G).
- Correlate: error counter slope vs temperature, supply noise, and neighbor-port activity.
- Pin the killer: map symptoms to one of the six channel killers and apply board-level fixes.
Killer 1 · Insertion Loss / ISI (frequency-dependent attenuation)
Symptoms
Stable at low rate but fails at higher rates; BER grows rapidly after a rate step; margin worsens with temperature.
Quick check
- Rate stepping: identify the first unstable rate
- PRBS + error counter slope under constant load
- A/B compare shorter vs longer internal routes (if available)
Fix
- Shorten high-speed segments; remove unnecessary vias
- Keep differential impedance consistent; avoid stubs and test pads on the main path
- Maintain continuous reference planes to preserve return current
Pass criteria
At target rate and worst-case conditions, PRBS / PCS-PMA error slope ≤ X per minute over window Y; no rate-step cliff within the supported ladder.
Killer 2 · Return Loss / Reflections (discontinuities)
Symptoms
Burst CRC errors under specific traffic patterns; link “looks up” but errors cluster; stability depends on a narrow set of presets/training outcomes.
Quick check
- Preset A/B with PRBS: does one preset eliminate bursts?
- Compare builds with/without a specific connector/test pad element
- Check “training done” vs “retraining events” (if exposed)
Fix
- Remove impedance steps: pad/via transitions, stubbed vias, and inline test points
- Keep differential pair geometry consistent through transitions
- Avoid reference plane cuts near the discontinuity region
Pass criteria
Error bursts disappear under worst-case traffic; retraining events ≤ X per hour; counter slope stays below X over window Y.
Killer 3 · Crosstalk (neighbor activity injection)
Symptoms
BER increases when adjacent ports become active; multi-port throughput triggers CRC clusters; single-port tests look clean.
Quick check
- Neighbor activity A/B: run the same port with neighbors off vs saturated
- Correlation test: counter slope vs port utilization
- Thermal A/B: higher temperature often reduces crosstalk tolerance
Fix
- Increase spacing and manage layer-to-layer pair adjacency
- Avoid long parallel runs; introduce separation and routing discipline
- Strengthen return planes and reduce shared impedance in grounds
Pass criteria
Under defined neighbor stress (pattern + load), counter slope ≤ X; correlation between errors and neighbor activity becomes statistically weak.
Killer 4 · Mode Conversion (differential → common-mode imbalance)
Symptoms
EMI sensitivity rises together with BER; behavior changes with chassis/shield configuration; swapping “similar parts” changes stability unexpectedly.
Quick check
- Imbalance review: ΔL / asymmetrical via/pad structures on the pair
- A/B test with controlled common-mode disturbance (trend-based)
- Check whether error clusters align with EMI events or ESD exposure logs
Fix
- Enforce symmetry: routing, vias, pads, and placement around the pair
- Keep return paths balanced and continuous (avoid asymmetric plane cuts)
- Minimize unmatched parasitics near connector/magnetics transitions
Pass criteria
BER no longer tracks common-mode stress trends; counter slope remains ≤ X across defined EMI/noise operating scenarios.
Killer 5 · Reference Plane Cuts (broken return path)
Symptoms
A short route still fails; issues cluster near a split plane, connector keep-out, or a high-noise zone; stability changes with load transients.
Quick check
- Geometry audit: does the pair cross any plane split, gap, or moat?
- Hotspot audit: proximity to DC/DC, clocks, and high di/dt returns
- A/B test: load step vs counter slope correlation
Fix
- Restore a continuous reference plane under the pair
- Bridge return paths when crossing domains (stitching/return bridges)
- Move sensitive routing away from high-noise return regions
Pass criteria
Error slope becomes insensitive to load transients; worst-case counter slope ≤ X within window Y across defined operating modes.
Killer 6 · Connector / Magnetics Parasitics (board-edge discontinuity)
Symptoms
Different connectors/magnetics builds shift BER; insertion/removal or mechanical state changes error clusters; the board works without the final front-end stack.
Quick check
- A/B BOM: compare vendor or footprint variants under the same PRBS recipe
- Placement A/B: compare “PHY-to-front-end length” across revisions
- Trend check: do errors follow mechanical state changes?
Fix
- Place the front-end close to the PHY; keep transitions short and symmetric
- Reduce discontinuities at board edge: pads, vias, and geometry transitions
- Ensure consistent return paths around the connector/magnetics region
Pass criteria
BOM variants converge within X BER/counter limits; mechanical state no longer triggers bursts; sustained slope ≤ X over window Y.
Minimal SI check set (field-friendly)
- Baseline evidence: record PCS/PMA error counter slope at a fixed load.
- Self-proof: local loopback + PRBS/BERT (if supported) to isolate PHY vs channel.
- Rate cliff: step through the speed ladder; note the first unstable rate.
- Neighbor stress: compare neighbors OFF vs saturated; check correlation.
- Thermal/power A/B: test the same recipe at temperature extremes and with supply stress.
- Geometry audit: vias/stubs/test pads; pair symmetry; plane cuts; edge transitions.
- Revision/BOM A/B: connector/magnetics variants under identical PRBS recipes.
Binary isolation flow (fastest path to the dominant killer)
- PHY self-proof: loopback/PRBS fails → return to bring-up & clocking (H2-3/H2-4).
- Find the cliff: first unstable rate defines the margin regime.
- Neighbor A/B: strong correlation → crosstalk / injected noise branch.
- Thermal/power A/B: strong sensitivity → thin margin branch.
- Map to a killer: apply the matching card fix and re-run the same recipe.
- Lock pass criteria: counters slope ≤ X within window Y across defined stress cases.
Diagram: SI failure map (symptoms ↔ physical causes ↔ fastest checks)
Use the matrix to pick a dominant physical branch quickly. Apply a fix, then re-run the same PRBS/counter recipe to confirm margin recovery.
Adaptive EQ & Polarity / Pair Swap
Adaptive EQ improves robustness against smooth channel loss and ISI, but it is not a substitute for correct layout.
Reflections, crosstalk, and mode conversion can push EQ into “thin-margin stability” where training completes but BER remains fragile.
EQ blocks (concept-level mapping)
- TX FIR: pre-emphasis for predictable loss/ISI; helps when attenuation is smooth and consistent.
- RX CTLE: high-frequency boost to counter frequency roll-off; complements TX FIR.
- DFE: cancels post-cursor ISI; effective when ISI is dominant and stable.
- AGC: normalizes amplitude; does not correct reflections or timing uncertainty.
Capability vs Limit (keep the boundary explicit)
Capability
- Recovers margin against smooth insertion loss and stable ISI
- Improves tolerance to process/temperature spread when the channel model is consistent
- Helps hit rate ladders when layout is already “correct-by-construction”
Limit
- Cannot “fix” strong reflections from discontinuities; may converge but remain burst-prone
- Cannot cancel crosstalk injected by neighbor activity; margin becomes load-dependent
- Cannot undo mode conversion; imbalance can couple EMI into BER
Configuration strategy (avoid “EQ as magic”)
Default adaptive mode
- Best when channel variation is expected (temperature, assembly, port mix)
- Use when robustness is prioritized over strict determinism
- Requires monitoring: retrain events, tap saturation, and counter slopes
Fixed preset mode
- Best when channel is controlled and repeatable (production determinism)
- Use to prevent “thin-margin stability” caused by misleading adaptation
- Validate with a small preset scan (2–3 options) and lock the winner
A/B validation recipe (minimal, repeatable)
- Fix rate + channel + load. Record baseline counter slope.
- Mode A: adaptive EQ. Record training time, retrain count, and slope.
- Mode B: fixed preset(s). Record the same fields under the same load.
- Stress: neighbor activity and temperature A/B. Repeat measurements.
- Choose the configuration that keeps slope ≤ X and retrain ≤ X per hour.
Polarity / Pair swap (bring-up safety net, not a cure)
- Polarity swap: corrects differential inversion when wiring/placement flips the pair.
- Pair swap / lane mapping: helps recover from routing order mistakes but can mask deeper SI issues if used blindly.
- Validation rule: after enabling swap, re-run PRBS and confirm counter slope does not worsen under neighbor stress.
- Production rule: lock mapping deterministically; log mapping state for field forensics.
Diagram: EQ capability boundary (impairments → blocks → outcomes)
Treat EQ as a bounded compensator: if stability depends on a narrow preset or collapses with neighbor stress, the channel discontinuity branch should be fixed at the board level.
EEE (802.3az) Power Save
EEE is a frequent root cause of “rare flaps” and latency spikes in industrial deployments because LPI entry/exit interacts with partner behavior, rate ladders,
and bursty traffic. This section turns EEE into an executable decision-and-validation workflow.
Scope guard
This section focuses on EEE engineering behavior: negotiation consistency, LPI entry/exit evidence, exit-latency budget, and validation matrices.
TSN time windows and PTP scheduling are handled on their dedicated pages.
Engineering model (what matters for stability)
- LPI entry: idle detection triggers the PHY to enter Low Power Idle.
- LPI maintain: the link stays “electrically quiet” while partner expects compliant signaling.
- LPI exit: wake-up takes time; exit latency consumes system budget and can create burst loss or jitter.
- Stability risk: bursty traffic causes frequent sleep/wake cycles; partner implementations differ by vendor/firmware.
Pitfall 1 · Partner negotiation mismatch (EEE ability resolution)
Symptoms
One switch works while another flaps; behavior changes after a partner firmware update; issues are partner-model specific.
Quick check
- Log local + partner EEE advertised abilities and the resolved EEE mode
- A/B: EEE OFF vs ON under the same traffic recipe
- Track link flap count while keeping rate fixed
Fix
- Enforce a consistent policy: disable EEE by default unless partner is known-good
- Whitelist partner models/firmware where EEE is validated
- Lock rate modes where partner resolution is stable
Pass criteria
Across the defined partner matrix, flap rate ≤ X per hour and resolved EEE mode remains consistent for window Y.
Pitfall 2 · Exit latency exceeds budget (rate-dependent)
Symptoms
Latency spikes (p99/p999) without obvious BER rise; burst drops during wake; issues appear only at specific speeds.
Quick check
- A/B: EEE OFF vs ON; compare latency distribution (p99/p999)
- Record LPI enter/exit counts and correlate with spikes
- Repeat per-rate to find the first rate where budget breaks
Fix
- Disable EEE for critical ports or for the offending rate(s)
- Adjust EEE thresholds/policies to reduce frequent sleep/wake cycles
- Prefer deterministic behavior over marginal power savings in control loops
Pass criteria
Under the defined burst/idle traffic model, p99 latency ≤ X and p999 latency ≤ X; no wake-related drop bursts in window Y.
Pitfall 3 · Metric accounting artifact (LPI time misread as “idle”)
Symptoms
Monitoring shows low utilization, but users observe stalls; “idle ratio” disagrees with packet captures; false alarms appear during power-save.
Quick check
- Separate “LPI duration” from “true idle time” in counters/logs
- Align sampling window and denominators across tools
- Re-run the same traffic replay with EEE OFF to compare reporting
Fix
- Standardize metric definitions: window length, denominators, and “idle” semantics
- Log LPI enter/exit counts to contextualize utilization metrics
- Use counter slope + packet evidence as primary stability indicators
Pass criteria
For the same traffic replay, monitoring agrees with packet evidence within error ≤ X; no EEE-driven false alarms in window Y.
Pitfall 4 · Thin-margin wake under temperature / power droop
Symptoms
Flaps increase at hot/cold corners; failures correlate with load steps or brownout; appears like SI until EEE is disabled.
Quick check
- Correlate LPI exit events with voltage/temperature alarms (if available)
- Repeat A/B across temperature and controlled supply droops
- Check whether wake events trigger error counter bursts
Fix
- Disable EEE in harsh corners or for ports with tight stability requirements
- Improve supply margin and reset/strap stability across wake cycles
- Enable a fallback policy: auto-disable EEE on repeated wake-related faults
Pass criteria
Across temperature/power stress matrix, wake-related fault events ≤ X and flap rate ≤ X per hour over window Y.
Validation recipe (repeatable)
- Fix the rate and partner device; record baseline counters and latency distribution.
- Mode A: EEE OFF. Run three traffic models: steady load, burst/idle, periodic idle.
- Mode B: EEE ON. Repeat the same traffic models with identical sampling windows.
- Apply stress: temperature corners and controlled supply droop (if applicable).
- Log: LPI enter/exit, flap count, error counter slope, p99/p999 latency.
- Decision: enable only if stability remains within X under the entire test matrix.
Partner + rate matrix (why “one lab pass” is not enough)
- Partners: switch models/firmware, gateway uplink ports, and mixed-vendor environments.
- Rates: test each supported rate independently (1G/2.5G/5G/10G as applicable).
- Traffic: steady + bursty + periodic idle; measure counter slope and latency tails.
- Outcome: enable EEE only for the validated subset (partners + rates + policies).
Diagram: EEE enable decision flow
Enable EEE only when the validated partner+rate subset stays within X thresholds under burst/idle traffic and harsh-corner stress.
MDIO/MMD, Straps & Firmware Hooks
Observability determines whether field issues can be reproduced and fixed. This section defines a minimum debug set:
resolved abilities, AN/training state, EQ visibility, counter slope evidence, power-save events, and a black-box log schema.
Scope guard
This is not an MDIO standard tutorial and does not enumerate vendor register addresses. The goal is an engineering minimum set
that must be exposed through firmware and logs. Remote management protocols (LLDP/NETCONF) are handled elsewhere.
Engineering view: Clause 22 vs Clause 45 / MMD
- Clause 22: baseline access for core link status and basic counters.
- Clause 45 / MMD: extended device model required for training/EQ visibility, deeper error evidence, and feature states (EEE, sensors).
- Rule: expose the minimum set through a stable firmware API and persist snapshots for field forensics.
Minimum set #1 · Link & Resolved Mode
- Link up/down + sticky link-change flag
- Resolved speed / duplex / master-slave (if applicable)
- Local advertised abilities (rate + features)
- Partner advertised abilities (rate + features)
- Resolved mode snapshot (the “final answer”)
- Remote fault / local fault indicators
Why it matters: negotiation evidence prevents misdiagnosing partner incompatibility as SI.
Minimum set #2 · AN / Training Status
- AN enable + AN complete
- AN failure reason (if provided) + retry counter
- Training start/done/fail (if supported)
- Retrain count + last retrain cause (if supported)
- Time-to-link (from reset to link-up) snapshot
Why it matters: separates “can’t negotiate” from “negotiates but has thin margin.”
Minimum set #3 · EQ Visibility
- EQ mode: adaptive vs fixed preset
- Selected preset/profile ID (if present)
- Tap saturation / limit flags (CTLE/FIR/DFE)
- Equalization “converged” flag (if present)
- Polarity / pair swap state (as seen by PHY)
Why it matters: a “tap at the limit” signature indicates channel fixes are required.
Minimum set #4 · Error Evidence (use slopes, not single reads)
- PCS-layer error counters (code/symbol/blocks as applicable)
- PMA/PMD-layer error counters (if exposed)
- MAC CRC/FCS error counters (if available in the stack)
- Counter snapshot + sampling window duration
- Computed counter slope (errors per minute / per Gbps)
Why it matters: slope evidence distinguishes burst issues from steady margin loss.
Minimum set #5 · EEE / Power Events (aligns with H2-7)
- EEE enable + resolved EEE mode
- LPI enter/exit counters
- Wake failure / wake retry flags (if provided)
- EEE-related interrupt/event flags
Why it matters: correlates latency spikes and flaps with LPI transitions.
Minimum set #6 · Interrupts, Sensors & Sticky Faults
- Interrupt cause register (must be readable and logged)
- Sticky fault bits (remote fault, training fail, wake fail)
- Temperature snapshot (if supported)
- Voltage/brownout snapshot (if supported)
- Flap counter (rolling window)
Why it matters: captures the “moment of failure” for field forensics.
Firmware event hooks (capture the critical transitions)
- Link down/up, speed change, and duplex change events
- AN complete / AN fail / remote fault events
- Training start/done/fail, retrain cause events (if supported)
- EEE enter/exit storms and wake failure events
- Error spike trigger: slope exceeds threshold X within window Y
- All events must include timestamps and a snapshot of the minimum debug set
Black-box log schema (field maintainability)
Core fields
timestamp, port_id/phy_id, speed/duplex, partner_ability_hash, resolved_mode
State evidence
AN_state, training_state, eq_mode/preset, polarity/pair_swap_state
EEE evidence
eee_enabled/resolved, lpi_enter_count, lpi_exit_count, wake_fail_flags
Error evidence
counters_snapshot, window_ms, counter_slope, flap_count_rolling
Health snapshot
temperature, voltage/brownout (if available), interrupt_cause, sticky_faults
Sampling rule: periodic snapshots every T seconds plus event-driven snapshots on link changes, retrain events, and slope spikes.
Diagram: Observability stack (Counters → Events → Logs)
The minimum set is designed for reproducibility: if a failure cannot be expressed as mode + state + slope + event, it will be hard to debug in the field.
Test & Bring-up: Loopback / PRBS / BERT
A bring-up test is only useful when it is reproducible and isolating. This section defines a fixed ladder:
segment with loopbacks, quantify with PRBS/BER slopes, then stress with throughput and corner conditions.
Scope guard
Focus is on what each loopback isolates, how to run PRBS for evidence, and how to build a golden sequence.
This section does not cover specific ATE brands or scripting details.
Test layer map (how each tool isolates)
- Loopback: removes the channel/partner and collapses the problem to a segment.
- PRBS: removes payload semantics; evidence is BER or error slope over window Y.
- BERT / Throughput stress: verifies stability under realistic load and buffering behavior.
- Golden sequence: a fixed, rate-by-rate ladder with snapshots and pass criteria X for bring-up and production transfer.
Recipe · PCS loopback (digital-side isolation)
Purpose
Isolate MAC/PCS behavior from PMA/channel effects. A fail here indicates a digital-interface, configuration, or PCS-side issue.
Setup
- Fix rate and interface mode; capture resolved mode snapshot
- Disable external variables (EEE, partner features) for baseline
- Define sampling window Y for counters
Steps
- Enter PCS loopback
- Transmit a known traffic stream or PRBS source (if supported at PCS)
- Read counters at start/end; compute slope over window Y
Pass criteria
Error counter slope ≤ X per minute over window Y; no link-state anomalies during the run.
Recipe · PMA loopback (analog-side isolation)
Purpose
Validate PMA-side datapath and clock recovery behavior without requiring a full external channel. Useful for separating PCS vs PMA sensitivity.
Setup
- Capture refclk lock and PHY internal lock status (if available)
- Record EQ mode/preset used for the run
- Define a fixed PRBS pattern and run duration
Steps
- Enter PMA loopback
- Enable PRBS generator/checker (or equivalent test mode)
- Measure BER or compute counter slope over window Y
Pass criteria
BER ≤ X (or slope ≤ X) at the target rate, with stable lock indicators across duration Y.
Recipe · Remote loopback (end-to-end path)
Purpose
Validate the complete link including channel, connector effects, and partner behavior. This is the “system truth” before production transfer.
Setup
- Record partner identity/version snapshot and resolved abilities
- Fix traffic model (steady + burst/idle) and define window Y
- Capture EQ and training status for correlation
Steps
- Enable remote loopback (or partner-assisted loopback mode)
- Run PRBS/BERT or controlled traffic replay
- Apply stress: temperature corner and controlled supply droop (if applicable)
- Log counters and slope evidence; note any retrain or flap events
Pass criteria
End-to-end BER/CRC slope ≤ X over window Y across the defined partner + rate matrix; no unexpected retrains.
PRBS pattern selection (engineering rule)
- Quick screen: use a short pattern to detect gross margin loss fast.
- Deep confidence: use a longer pattern to catch low-probability errors that appear only after long runs.
- Production transfer: keep a stable pattern set so results are comparable across stations and shifts.
- Rule: every BER conclusion must be tied to rate, partner snapshot, EQ mode, and sampling window.
BER accounting (use windowed slopes)
- Window Y: define a fixed sampling window and log start/end snapshots.
- Slope: compute errors per minute (or per Gbps) to avoid single-read noise.
- Threshold X: BER ≤ X and/or error slope ≤ X within window Y.
- Context: log temperature/voltage and EEE state to explain corner-only failures.
Golden test sequence (rate-by-rate ladder)
- Phase 0: reset release, strap snapshot, refclk ready/lock snapshot
- Phase 1: AN ability exchange; record resolved mode (rate/duplex/features)
- Phase 2: PCS loopback baseline; compute error slope over window Y
- Phase 3: PMA loopback baseline; run PRBS; verify BER ≤ X
- Phase 4: remote loopback or full-through; validate end-to-end BER/CRC slope
- Phase 5: throughput + burst/idle stress; capture latency tails if available
- Phase 6: EEE A/B (if enabled); verify no flap or latency budget violation
- Phase 7: temperature corner run; repeat Phase 4–5 minimal subset
- Phase 8: generate a report pack: snapshots + counters + slopes + verdict
Production transfer tip: keep the same ladder and windows; only shorten duration where confidence remains above threshold X.
Diagram: Test ladder (from isolate → quantify → stress → sign-off)
The ladder is designed for station-to-station repeatability: each step produces a snapshot that either isolates a segment or quantifies margin.
Production & Reliability
Passing a lab demo is not reliability. Production-ready Ethernet PHY design requires baseline capture, corner screening strategy,
and a field triage loop that turns symptoms into quantified margin evidence.
Scope guard
This section covers production transfer and reliability evidence. IEC standard details are handled on the protection page.
No vendor-specific production equipment brands are discussed here.
Reliability variables (what shifts margin over time)
Temperature corners
Typical shift: timing and analog margin shrink; training becomes more frequent. Evidence: error slope rises only at hot/cold.
Acceptance: slope ≤ X in window Y at the corner rate(s).
Supply noise / droop
Typical shift: PLL/CDR thin margin; wake/training instability appears. Evidence: lock/alarm correlation and burst error spikes.
Acceptance: no droop-correlated retrain bursts; slope ≤ X.
Degradation after ESD/surge exposure
Typical shift: not a hard failure, but margin reduces; errors become “more likely” weeks later. Evidence: baseline vs current slope delta.
Acceptance: performance remains within baseline ± X under the golden minimal subset.
Aging / mechanical stress
Typical shift: connectors and solder joints introduce intermittent impedance changes. Evidence: rare bursts tied to vibration/handling.
Acceptance: burst rate ≤ X and no sustained slope increase across window Y.
Production coverage strategy (full test vs sampling)
- Full test must catch: gross margin loss, configuration drift, and immediate instability signatures.
- Sampling is suited for: long-duration patterns, deep corners, and extended confidence runs.
- Transfer rule: reuse the golden ladder with fixed windows; shorten duration only when confidence stays above threshold X.
- Record: resolved mode, error slope, time-to-link, and any retrain/flap events for every unit class.
Boundary condition screening (low V / high T)
Corner screening is where thin margin becomes visible. A unit that is “clean” at room conditions may fail only under low supply or high temperature,
which predicts field returns. The screening goal is not maximum stress, but repeatable exposure that separates robust units from thin ones.
Acceptance (placeholders)
At low V / high T, error slope ≤ X over window Y and time-to-link ≤ X with no unexpected retrain bursts.
Degradation detection (not dead, but fragile)
- Signal: rare bursts increase, retrain becomes frequent, or failures appear only under load/corners.
- Fastest evidence: compare current counter slopes and time-to-link against the golden baseline pack.
- Minimal re-test: run the golden minimal subset (PRBS baseline + burst throughput) at the failing rate.
- Decision: classify as channel/SI shrink, clock/jitter sensitivity, or environment/power coupling based on correlation.
Field failure triage (3 steps)
Step 1 · Symptom
Classify: no link / link flaps / only one rate fails / load-only errors / corner-only errors.
Step 2 · Fastest isolate
Use loopbacks and windowed counter slopes to collapse the issue to digital-side, analog-side, channel, partner, or environment.
Step 3 · Confirm metric
Produce a quantified conclusion: BER ≤ X, slope ≤ X, time-to-link ≤ X, and retrain/flap ≤ X over window Y.
Diagram: Reliability loop (Design → Test → Field → Feedback)
Reliability is a closed loop: baseline evidence enables production screening, and field logs turn failures into actionable updates.
H2-11 · Engineering Checklist (Design → Bring-up → Production)
This chapter converts the previous content into action items that can be executed, proven (evidence), and accepted (pass criteria). Each gate outputs a reusable “baseline pack” for debugging and production.
Design Gate
Bring-up Gate
Production Gate
Gate 1 — Design Gate (prevent the expensive failures)
1) Mode / Rate / MAC-Interface Freeze
Checklist
- Freeze the target rate set (10/100/1G/2.5G/5G/10G) and the required fallback behavior per port.
- Freeze MAC-side interface and clocking constraints (e.g., RGMII/SGMII/2500BASE-X/USXGMII).
- Freeze strap defaults + reset sequencing (what must be correct before any MDIO writes).
- Define “link-up acceptance”: time-to-link (X), no flap within Y minutes, and counter slope ≤ X.
Evidence (keep forever)
- Mode matrix (rate × cable/board class × partner type × EEE on/off).
- Boot strap table + power/reset timing diagram + “MDIO write-after-reset” rules.
- Register snapshot template: ability, AN status, training status, EQ status, error counters.
Pass criteria (placeholders)
• Link-up time ≤ X ms; • No flap in Y min; • Error slope ≤ X / min
2) Reference Clock & Jitter Ownership (make “low-jitter” measurable)
Checklist
- Define refclk source, distribution, and measurement points (CLKIN, CLKOUT/recovered clock if available).
- Allocate jitter contributors: refclk, PLL, supply-noise modulation, coupling injection.
- Define acceptance by correlation: BER/counter slope versus refclk configuration and corner conditions.
Example MPNs (oscillators)
- SiTime SiT8918BA-28-33N-25.000000 — 25 MHz oscillator option for PHY refclk.
- SiTime SiT9120 (family) — differential oscillator family often used for low-jitter clocking; pick a standard frequency variant per design.
Pass criteria (placeholders)
• Refclk jitter ≤ X (RMS); • BER/CRC slope does not worsen beyond X under corner sweep
3) Observability Contract (counters / events / logs must be available)
Checklist
- Expose the minimum MDIO/MMD set: partner ability, AN status, training status (if any), EQ status (if any), error counters, interrupts.
- Define a black-box field log schema: timestamp, rate, link state, flap count, error counters, temperature/voltage alarms (if provided).
- Standardize metric definitions: windowing, denominators, reset behavior, and “good/bad” thresholds.
Example MPNs (PHYs used in industrial systems)
- TI DP83867IR — 10/100/1000 PHY family example for robust industrial links.
- Microchip KSZ9131RNXI — 10/100/1000 PHY example with RGMII timing options.
- Marvell 88E2010 / 88E2110 — 10M/100M/1G/2.5G/5G multi-gig PHY family examples.
- Marvell 88X3310P — 1G/2.5G/5G/10GBASE-T PHY family example (10G class).
- Aquantia AQR107-B0-EG-Y (or AQR107-B0-IG-Y) — single-port multi-rate 10G-class PHY example.
Pass criteria (placeholders)
• All required status/counters readable in-field; • Root-cause isolation achievable without scope within X minutes
Gate 2 — Bring-up Gate (repeatable, evidence-based debug)
1) Baseline Pack (Golden ladder, fixed windows)
Checklist
- Run a fixed “Golden ladder” per target rate: reset → AN/ability exchange → PRBS/BER window → throughput burst → EEE toggle → corner sweep.
- Capture counters at consistent timestamps: t0 (post-link), t1 (after PRBS), t2 (after throughput), t3 (after EEE).
- Record time-to-link, training completion (if any), retrain events, and flap counts.
Evidence
- Per-rate baseline report: link-up time, counter slopes, BER windows, and partner identity.
- Register snapshots: AN, training, EQ, and interrupt summary at each ladder milestone.
Pass criteria (placeholders)
• PRBS BER ≤ X over Y seconds; • Counter slope stable ≤ X / min
2) Isolation Sequence (loopbacks first, then the channel)
Checklist
- Execute loopbacks in order: PCS loopback → PMA loopback → remote loopback (if partner supports).
- Only after internal loopbacks pass, attribute failures to the channel/connector/cable environment.
- For “rate-specific instability”, rerun the exact same ladder window at each rate to compare slopes.
Pass criteria (placeholders)
• PCS/PMA loopback passes with BER ≤ X; • Remote loopback stable for Y minutes
Gate 3 — Production Gate (consistency, traceability, fast triage)
1) Minimal Full-Test Coverage (fast but meaningful)
Checklist
- Fix test windows and patterns (PRBS set + throughput burst) and keep them identical across stations.
- Measure the “thin-margin signature”: errors that appear only at specific rates, loads, or EEE transitions.
- Store per-unit summary fields: station ID, FW/strap revision, partner ID, pass/fail reason codes.
Evidence
- Production log line (single-line JSON/CSV) with counters + environment + time.
- Golden unit comparison record per station per shift.
Pass criteria (placeholders)
• Station-to-station delta ≤ X; • False reject ≤ X%; • Escape ≤ X ppm
2) Field Return Entry (3-step triage using counters/logs)
Checklist
- Symptom classify: “no link” vs “flap” vs “rate-specific” vs “load/EEE-triggered”.
- Fastest isolate: re-run internal loopbacks first, then channel PRBS window.
- Confirm degradation: compare today’s slopes against the unit’s original production baseline pack.
Pass criteria (placeholders)
• Root-cause bucket identified within X steps; • Evidence package complete for feedback closure
Gate Pipeline (Checklist → Evidence → Pass)
H2-12 · Applications & IC Selection Logic (before FAQ)
Selection is expressed in engineering terms: rate coverage, controllability (EEE/EQ), and observability (PRBS/loopback/counters). This page does not deep-dive PoE/TSN/PTP.
Application Slices (strongly in-scope)
1) Switch Uplink / Aggregation (multi-rate copper)
- Typical pain: rate-specific instability (2.5G/5G) and “looks fine on bench” failures after integration.
- Must-have hooks: per-rate PRBS windows, loopback isolation, stable counter definitions.
- EEE must be controllable (enable/disable) for interoperability validation.
Example PHY MPNs
- Marvell 88E2010 / 88E2110 — 10M/100M/1G/2.5G/5G PHY family options.
- Marvell 88X3310P — 10G-class PHY option for 1G/2.5G/5G/10G designs.
- Aquantia AQR107-B0-EG-Y — 10G-class multi-rate PHY option (commonly used for 10G copper).
2) Industrial Gateway / Protocol Bridge (field maintainability first)
- Typical pain: “link up but flaps”, or CRC spikes only under burst traffic and temperature corners.
- Must-have hooks: partner ability, AN status, error counters, interrupt summary, black-box logs.
- Bring-up must be fast: fixed “golden ladder” recipe + clear pass criteria.
Example PHY MPNs
- TI DP83867IR — robust 10/100/1000 PHY example used in industrial gateways.
- Microchip KSZ9131RNXI — 10/100/1000 PHY example; useful when RGMII timing options matter.
3) Industrial Edge Compute / IPC (long uptime + corners)
- Typical pain: slow drift (aging/thermal) that reduces margin without immediate hard failure.
- Must-have hooks: long-window error slope monitoring + baseline delta reports.
- Clock quality and power-noise immunity must be verified with repeatable correlation tests.
Example supporting MPNs
- SiTime SiT8918BA-28-33N-25.000000 — 25 MHz oscillator example for refclk.
- Marvell 88E2010 / 88E2110 — multi-gig PHY examples when 2.5G/5G coverage is needed.
4) In-cabinet Backplane / Short Copper Reach (board-driven)
- Typical pain: “board-level SI” dominates; loopbacks help separate digital/analog/channel issues.
- Must-have hooks: internal loopbacks, per-rate PRBS, and deterministic bring-up sequencing.
- Rate upgrade path should be explicit (1G → 2.5G/5G → 10G).
Example PHY MPNs
- Marvell 88X3310P — 10G-class PHY example for high-rate copper links.
- Aquantia AQR107-B0-IG-Y — 10G-class PHY example seen in reference designs.
Selection Logic (engineering gates, not marketing)
Layer 1 — Hard Gates (fail one → reject)
- Rate coverage: required set supported (and future headroom defined).
- MAC-side interface: matches SoC/switch requirements (e.g., RGMII/SGMII/2500BASE-X/USXGMII).
- EEE controllability: explicit enable/disable + verified interoperability plan.
- Debug completeness: PRBS + loopback modes + error counters accessible via MDIO/MMD.
- Environment grade: temperature/EMC expectations match deployment (verify ordering grade).
Example “Hard Gate” MPN mapping
- 10/100/1G class: TI DP83867IR; Microchip KSZ9131RNXI
- 2.5G/5G class: Marvell 88E2010, 88E2110
- 10G class: Marvell 88X3310P; Aquantia AQR107-B0-EG-Y/AQR107-B0-IG-Y
Layer 2 — Scorecard (rank candidates)
Score each item 0/1/2 (0 = missing, 1 = partial, 2 = complete). Prefer candidates that reduce
debug time and protect production stability.
- Observability depth: can EQ/training status be read and correlated to errors?
- Loopback coverage: PCS/PMA/remote loopback modes available and documented?
- Counter semantics: clear windows/denominators; counters survive resets as required?
- EEE behavior: predictable enter/exit latency; partner matrix validation practical?
- Bring-up determinism: time-to-link stable; retrain behavior bounded?
Output deliverables (must exist after selection)
- Mode matrix + fixed “Golden ladder” recipe names + per-rate pass thresholds (X/Y).
- MDIO/MMD readout script + black-box log schema + station-to-station reproducibility plan.
Selection Flow (Hard Gates → Scorecard → Baseline Pack)
Scope guard:
PoE/PoDL, magnetics deep-dive, TSN/PTP timing templates, and protocol stacks are handled by sibling pages.
Recommended topics you might also need
Request a Quote
H2-13 · FAQs (Field Troubleshooting, No Scope Required)
Each FAQ is a closed-form troubleshooting entry: isolate the segment, run a minimal check, apply a reversible fix, and accept using measurable criteria (placeholders: X/Y/N).
Links up at 1G, but 2.5G/5G shows CRC spikes — first check: channel margin or EQ preset?
Likely cause: Thin channel margin at higher rates, or the chosen EQ preset/adaptation lands in a poor operating point.
Quick check: Lock the rate and run PRBS for N bits using preset A/B; compare local loopback vs end-to-end to isolate “internal vs channel”.
Fix: Use the stable preset (or narrow adaptation range); reduce discontinuities on the path and re-baseline PRBS/counters at the target rate.
Pass criteria: CRC ≤ X per 10^6 frames over Y min at 2.5G/5G; PRBS BER ≤ X over N bits; flap count = 0.
Works in lab, fails in cabinet — is it refclk noise injection or return-plane cut?
Likely cause: Refclk/supply noise couples into PLL/CDR, or layout return-path disruption increases mode conversion and closes eye margin.
Quick check: Correlate errors with cabinet events (fans/relays/load steps); compare error slope with refclk source A/B and with a controlled supply ripple step (ΔV = X mV).
Fix: Improve refclk isolation (routing/guard/decoupling) and restore a continuous return path near critical traces; re-run the same PRBS window and throughput burst in-cabinet.
Pass criteria: Error slope ≤ X per min across cabinet load steps; PRBS BER ≤ X over N bits; no cabinet-triggered flap over Y min.
EEE enabled makes periodic flaps — is exit latency or partner mismatch?
Likely cause: Partner ability mismatch or exit timing/latency behavior that violates the traffic profile, causing repeated transitions and instability.
Quick check: Disable EEE and compare flap rate; if stable, re-enable and test two traffic models: (a) idle-heavy with bursts every T ms, (b) continuous load at X%.
Fix: Keep EEE disabled for that partner/port class, or tune policy (min idle time, exit behavior) and validate with the partner matrix.
Pass criteria: Flap count = 0 over Y min with EEE policy applied; throughput variance ≤ X% under burst model; EEE transitions do not increase CRC beyond X/10^6 frames.
AN completes but no traffic passes — is MAC interface timing or FIFO underrun?
Likely cause: MAC-side interface mode/timing mismatch (clock phase/delay), or datapath starvation (FIFO underrun/overrun) masquerading as “link up”.
Quick check: Verify MAC/PHY interface configuration match (mode + delays); check underrun/overrun counters while sending a constant-rate stream at X Mbps.
Fix: Correct interface mode and required internal/external delays; increase FIFO margin (clocking/interrupt moderation/driver pacing) and re-test with the same stream.
Pass criteria: Sustained throughput ≥ X% of line rate for Y min; underrun/overrun count = 0; CRC ≤ X/10^6 frames.
Only one port fails under temperature — is PLL lock margin or solder/connector loss shift?
Likely cause: Marginal PLL/CDR lock at temperature corners, or localized physical loss shift (connector/solder/port-specific routing) reduces SI margin.
Quick check: Compare that port vs a known-good port: time-to-link, retrain count, and error slope across temperature sweep from Tlow to Thigh.
Fix: Improve port-specific margin (rework suspected solder/connector, verify routing/return path); if PLL margin is suspected, evaluate refclk/supply conditioning and stable EQ settings.
Pass criteria: No retrain/flap over Y min at each corner; PRBS BER ≤ X over N bits; error slope stays within baseline ± X%.
PRBS passes in local loopback but fails end-to-end — which segment does that isolate?
Likely cause: Internal PHY datapath is healthy; the failing segment is the external channel (PCB + connector + cable) or the link partner’s receive path.
Quick check: Run PMA/PCS loopbacks (internal) and then remote loopback (if supported). If only end-to-end fails, swap partner/port/cable class to separate “channel vs partner”.
Fix: Treat as channel/partner margin: lock a stable EQ preset, reduce discontinuities, and validate with PRBS at the target rate using a fixed N-bit window.
Pass criteria: End-to-end PRBS BER ≤ X over N bits; CRC ≤ X/10^6 frames over Y min; internal loopbacks remain clean (BER ≤ X).
Changing magnetics vendor worsened BER — first Cdiff/mismatch sanity check (PHY-only view)
Likely cause: Differential mismatch introduced (effective Cdiff / imbalance), increasing mode conversion and degrading receiver margin at higher rates.
Quick check: Compare error slope and rate sensitivity before/after the vendor change using the same PRBS window; if degradation is strongest at higher rates, suspect added imbalance/mismatch.
Fix: Restore the previous magnetics or enforce a validated part list; if change is mandatory, re-baseline EQ presets and validate end-to-end PRBS/CRC at each target rate.
Pass criteria: PRBS BER ≤ X over N bits at each rate; CRC ≤ X/10^6 frames; delta vs baseline slope ≤ X%.
Link ok, but throughput collapses with bursts — is it EEE/LPI transitions or PCS errors?
Likely cause: EEE/LPI transitions interact poorly with burst traffic, or PCS-level errors force retransmissions/pauses that look like “throughput collapse”.
Quick check: Disable EEE and repeat the exact burst pattern (idle Tidle ms, burst Tburst ms at X% line rate); observe CRC/PCS error counters during bursts.
Fix: Keep EEE disabled (or adjust policy) for burst workloads; if PCS errors dominate, lock stable EQ preset and address channel discontinuities.
Pass criteria: Throughput ≥ X% of expected under burst model for Y min; CRC ≤ X/10^6 frames; no periodic stalls aligned with LPI transitions.
Intermittent down/up every few minutes — brownout reset, strap sampling, or thermal?
Likely cause: Supply dip triggers reset, strap sampling occurs under unstable rails, or thermal protection causes periodic recovery.
Quick check: Correlate flap timestamps with supply/temperature logs; check reset cause/interrupt history; measure time-to-link consistency after each event (variance ≤ X).
Fix: Improve power hold-up/decoupling and reset sequencing; ensure straps are stable at sampling; address thermal path (airflow/heatsinking) if temperature correlates.
Pass criteria: Flap count = 0 over Y hours; time-to-link ≤ X ms with variance ≤ X%; no reset/thermal events under worst-case load.
MDIO shows no faults but counters climb — which counters are most telling at PCS vs PMA?
Likely cause: Soft errors accumulate without hard faults; PCS shows symbol/block integrity issues, while PMA indicates analog lock/alignment stress.
Quick check: Separate counters by layer: PCS integrity (block/symbol/CRC-related) vs PMA lock/retrain/alignment; compare slopes during PRBS and during real traffic.
Fix: If PCS counters dominate, treat as margin/EQ/channel issue; if PMA lock/retrain dominates, treat as refclk/supply/temperature sensitivity and stabilize the bring-up configuration.
Pass criteria: Layered counter slopes ≤ X/min in steady-state; no retrain events over Y min; CRC ≤ X/10^6 frames.
Polarity / pair swap on PCB — why do only some rates fail?
Likely cause: The swap is only conditionally supported (depends on mode/rate), or the physical routing introduces rate-dependent discontinuities that exceed margin at higher speeds.
Quick check: Verify PHY configuration for polarity/pair-swap support at each target rate; run the same PRBS window at 1G vs 2.5G/5G/10G and compare BER slope vs preset.
Fix: Enable the correct swap mode for the operating rate; if margin-driven, lock a stable EQ preset and reduce discontinuities around the swapped segment.
Pass criteria: PRBS BER ≤ X over N bits at each rate; CRC ≤ X/10^6 frames; no rate-specific flap over Y min.
Lowering TX amplitude helped EMI but BER got worse — how to pick the safe margin?
Likely cause: TX amplitude reduction closes the eye and removes margin, especially at higher rates or under temperature/supply corners.
Quick check: Sweep TX amplitude across K steps and record PRBS BER and CRC/PCS counters per step; identify the “knee” where BER slope rises sharply.
Fix: Choose the lowest TX amplitude that still meets the PRBS/CRC acceptance at all required corners; if EMI still fails, address return path/shielding before sacrificing margin.
Pass criteria: Selected amplitude meets BER ≤ X over N bits and CRC ≤ X/10^6 frames across corners; EMI target met with ≥ X dB margin.