An Ethernet PHY is not just “link up or down” for TSN: it must keep time and latency bounded under real conditions.
This page shows how to design, measure, and validate low-jitter clocks, deterministic latency, EEE side-effects, and PHY hardware timestamp paths so the link stays stable and the timing stays trustworthy.
What an Ethernet PHY is (and what TSN expects from it)
An Ethernet PHY is the physical-layer engine that turns MAC-side digital traffic into a compliant electrical link (and back),
while managing clocks, training, link states, and—when required—hardware timestamp hooks that determine timing repeatability.
Scope boundaries (to prevent topic overlap)
In scope (this page)
Ethernet PHY functions for 10/100/1G/2.5G electrical links
Low-jitter clocking paths and how they affect determinism
EEE (802.3az) behavior as a timing/latency side-effect source
Management interface control (e.g., MDIO register access)
May provide timestamping—but location can be far from the line
PHY (line-side mixed-signal + DSP)
Physical coding/decoding, analog front-end, link training
Clock recovery/PLL behavior that shapes jitter and lock stability
EEE state machine transitions (enter/exit) that can introduce steps
Hardware timestamp capture/insert points tied to physical events
Practical rule
Determinism is rarely limited by average throughput. It is limited by bounded behavior:
clock integrity, latency variability, and timestamp integrity must be measurable, repeatable, and stable across environment changes.
Verify refclk quality meets the system jitter budget (< X) and lock is stable across temperature and cable conditions.
Engineering hook
Must expose lock/training state and (ideally) allow selecting a clean external refclk.
Pass criteria
No clock-related drops; recovered/derived clocks remain within the defined jitter envelope under stress.
B) Bounded latency
Quick check
Confirm latency is either fixed-mode or at least measurable/compensable; detect step events < X ns.
Engineering hook
Track FIFO alignment, training convergence, and EEE exit behavior as the primary variability sources.
Pass criteria
Under controlled A/B conditions, peak-to-peak latency variation stays below the system bound (X).
C) Timestamp integrity
Quick check
Determine where timestamps are captured (near the line vs host-side). Target resolution/precision < X.
Engineering hook
Identify clock-domain crossings and buffers between capture point and software-visible timestamp registers.
Pass criteria
Offset steps remain bounded (X) and do not correlate with EEE transitions, relinks, or temperature ramps.
Diagram: System layering and TSN-relevant hooks (clocks, EEE, timestamps)
Modes & interfaces: 10/100/1G/2.5G and MAC-side buses
PHY selection becomes predictable when rate, MAC-side interface, and clocking are treated as one decision.
The goal is not only link-up—it is repeatable timing behavior: bounded latency, stable clocks, and a timestamp path that stays consistent under stress.
Inside the PHY: PCS/PMA, DSP blocks, and where determinism is lost
Deterministic timing depends on predictable internal paths. Ethernet PHYs combine analog front-end blocks, digitization, adaptive DSP,
coding/decoding, and multiple clock domains. The same link can be “up” while timing repeatability is degraded by state transitions,
adaptive convergence, and clock-recovery dynamics.
Internal block relationships (what each stage changes)
PMA/AFE + ADC/DAC
Defines line-side signal integrity and noise coupling points
Sets SNR margins that limit DSP headroom and lock robustness
Susceptible to supply noise and common-mode disturbances
DSP / EQ (adaptive)
Adaptive equalization converges differently across cables/temperature
Convergence and re-training can introduce step-like timing shifts
Error bursts often correlate with adaptation events
PCS (coding/decoding)
Controls symbol alignment, buffering, and link state transitions
Elastic buffers can create bounded but real variability
State changes (autoneg/relink) are common sources of steps
PLL/CDR (clock recovery)
Defines jitter tolerance and how refclk noise transfers into the link
Lock/relock dynamics create time windows of higher variability
Temperature drift can shift loop behavior and margins
“Determinism killers” (grouped by what they look like in measurements)
Step events
EEE exit / LPI transitions (state gate toggles)
Link retrain / renegotiation / buffer realignment
Timestamp capture point mode changes or re-sync
Fast check
Correlate timing steps with EEE/link-state counters and lock/relock events.
Short-term jitter
Refclk phase noise transfer through PLL/CDR
Supply noise coupling into PLL/AFE bias networks
Measurement settings that hide spurs or amplify artifacts
Fast check
Swap clock source and re-measure at multiple points (refclk vs MAC IF vs recovered clock).
Slow drift
Temperature drift shifting CDR loop behavior and margins
AFE bias drift or common-mode shifts under load
Clock-source frequency drift interacting with timestamp domains
Fast check
Log temperature and airflow changes and align them against drift slope (X per °C).
Key takeaway
A cleaner refclk mainly improves noise-floor-driven jitter. It does not automatically remove step events caused by state changes
(EEE exit, retraining, buffer realignment) or slow drift driven by thermal behavior. Classify the symptom first.
Five internal observation points used throughout this page
① PLL/CDR lock events
Confirms whether timing excursions correlate with lock/relock windows and jitter tolerance margins.
② Training/EQ convergence
Detects adaptive convergence changes that often precede error bursts and latency variability.
③ EEE state transitions
Provides the first correlation test for step-like offset/latency jumps caused by LPI entry/exit.
④ Timestamp engine status
Identifies capture/insert mode, domain crossings, and whether corrections are applied consistently.
⑤ Error/quality counters
Separates physical-layer errors from higher-layer symptoms and supports one-variable A/B experiments.
Low-jitter clocks: reference, recovered clock, jitter transfer & tolerance
Clock quality is only useful when it is mapped to measurable points and bounded outcomes.
A clock plan must specify which domain drives refclk, how PLL/CDR transfers noise, where clocks are observed,
and what pass criteria define “good enough” under stress.
Clock sources and roles (engineer view)
Common sources
On-board crystal / oscillator (XO)
External low-jitter XO (dedicated)
SoC-generated clock (shared tree)
Synchronized clock feed (PHY-side only)
Key roles
refclk: sets PLL/CDR input noise and lock robustness
Clean refclk reduces noise-floor-driven jitter. It does not guarantee removal of step events caused by state changes
(EEE exit, relink, buffer realignment). Measurements must separate noise-floor problems from state-driven steps.
Where to measure (multi-point) and how to avoid artifacts
RBW/VBW/window changes that “improve” plots while system behavior stays unchanged
Probing/ground loop effects that add spurs
Excess averaging that hides rare step events
Clock budget template (X placeholders) + first 3 correlation checks
Budget template
refclk jitter (integrated X1–X2): < X
PLL/CDR output jitter: < X
Host interface timing margin: > X
Worst-case Δjitter across stress: < X
Lock/relock stability window: < X
Pass criteria
The defined bounds remain valid with temperature, cable length, and EEE/link events.
Correlation checks
1) Swap clock source (A/B)
Confirms whether the symptom tracks refclk noise floor or remains unchanged (suggesting a state-driven step).
2) Move measurement point
Separates “clean refclk” from degraded MAC IF/TS domains, indicating internal crossings or gating effects.
3) Change RBW/VBW/window
Detects settings artifacts: if plots “improve” but link/timing behavior does not, the configuration is masking the true issue.
Diagram: Clock tree + measurement points (M1–M4) + tools
Deterministic latency: what varies, what can be bounded, how to measure
Deterministic latency is not a single number. It is a decomposable set of terms across TX/RX pipelines, buffers,
adaptive convergence, and lock-related behavior. TSN readiness requires identifying which terms are fixed, which are configurable,
which are adaptive, and how each term can be observed and bounded.
Latency pipeline map (engineer view)
TX path (host → line)
MAC IF → PCS → FIFO/buffer
DSP/EQ → PMA/AFE → MDI
State events can create step-like shifts
Link segment (line + peer)
Cable + magnetics + peer PHY behavior
Changes can alter EQ convergence
Peer state transitions matter for repeatability
RX path (line → host)
MDI → AFE/PMA → DSP/EQ
PCS → FIFO/buffer → MAC IF
Lock windows can distort short-term timing
Latency decomposition (fixed vs configurable vs adaptive) with observation hooks
Fixed terms (treat as constants in one mode)
Pipeline base delay
AFE/PCS basic pipeline latency under a fixed speed/mode.
EEE (802.3az): power states, wake behavior, and side effects on timing
EEE is not a simple power-save toggle. It is a state machine that changes link behavior. The most important engineering task is to
verify side effects: step-like latency changes around wake events, short-term jitter changes, and compatibility differences across peers.
EEE in practice: state machine and what to correlate
Active
Normal latency distribution (baseline)
Use as reference for A/B comparisons
Correlate with error counters (⑤)
LPI entry / LPI
Low-power idle behavior depends on traffic pattern
Exit behavior is the critical risk for timing
Track EEE state transitions (③)
Wake / recovery
Possible step in latency/offset around exit
Short risk window until stable behavior returns
Correlate with lock events (①) and error bursts (⑤)
Timing side effects (grouped by symptom shape) + fast discrimination
Step events
Latency/offset “jumps” near LPI exit
Buffer realignment and domain gating effects
Peer-dependent behavior across switches
Fast discrimination
Disable EEE and check whether the step disappears; then align the step time with EEE transitions (③).
Short-term jitter rise
Wake window shows degraded jitter tolerance
Clock-domain gating exposes noise coupling
Measurement settings can hide/overstate effects
Fast discrimination
Measure at multiple points and use the same RBW/VBW/window; compare EEE on/off under identical traffic.
Compatibility issues
Peer switch/port differences change stability
Increased error bursts or renegotiation events
Temperature/cable length amplifies differences
Fast discrimination
Hold traffic constant, swap peer device, and compare EEE transition counts (③) and error bursts (⑤).
Log: temperature, ① lock, ③ transitions, latency drift
Diagram: EEE state timeline (Active ↔ LPI ↔ Wake) with measurement points and pass checks
PTP / gPTP hardware timestamping at the PHY: paths, errors, calibration
Hardware timestamping accuracy is determined by where the timestamp is captured/inserted, which clock domain is used,
and which internal stages are included in the correction path. Engineering validation focuses on repeatability, bounded error terms,
and correlation with link state transitions rather than protocol-level details.
Scope (PHY-side reality)
Capture/insert points and their relation to MDI
Error terms: bias, jitter, drift, event steps
Calibration/verification workflow and pass criteria
Not in scope (kept out to avoid cross-page overlap)
Protocol state machines and sync-tree behavior
BMCA, scheduling, or switch queue models
Network-wide configuration policies
Where timestamps happen: MAC vs PHY (engineering differences)
MAC timestamp
Capture point is closer to host processing
MAC↔PHY interface latency becomes an error term
FIFO/CDC effects can show as jitter or step
PHY timestamp
Capture point can be closer to MDI/line side
Internal pipeline + correction path must be consistent
CDC and state transitions must be bounded by tests
First checks
Disable EEE and check if steps disappear
Swap speed/cable length and verify predictable bias change
Align offset/latency changes with link/state counters
Timestamp error sources and an error-budget template
Bias (static offset)
Capture point relative to MDI and fixed pipeline terms
Mode/speed-specific baseline differences
Correction-field configuration mismatches
Budget line (placeholder)
Bias after calibration < X ns (per speed/mode).
Random jitter
FIFO/CDC phase relationship and sampling uncertainty
Clock noise mapped into timestamp domain
Measurement window and statistics definition matter
Budget line (placeholder)
Jitter (RMS) < X ns, p-p < X ns in window T.
Drift + event steps
Temperature-dependent delay and clock parameter shifts
EEE exit windows causing step-like offset/latency changes
Relock/retrain events affecting short-term timing
Budget lines (placeholders)
Drift < X ns/°C and event step < X ns.
After EEE exit, stable within X ms (no step beyond threshold).
Calibration and verification: minimal system → peer test → variable scan → full system
Level 0 — minimal board
Fix speed/mode and cable
EEE off baseline
Calibrate bias to target (X)
Level 1 — peer-to-peer
Two-ended comparison under fixed traffic
Validate correction settings consistency
Check peer sensitivity of bias/jitter
Level 2 — one-variable scan
Speed change (10/100/1G/2.5G)
Cable length/type change
EEE on/off and wake window checks
Level 3 — full system
Bring-up with real power/thermal conditions
Correlate offset with link/state counters
Prove bounded drift and no event steps beyond X
Diagram: Timestamp capture/insertion map (capture points, correction, and CDC)
Analog front-end, magnetics, EMC/ESD/surge: making the link robust
Robust Ethernet links require treating the PHY AFE, magnetics, protection network, and connector region as a coupled system.
Field failures often come from return-path mistakes and parasitics that convert differential energy into common-mode noise,
especially at higher data rates such as 2.5G.
AFE + magnetics as a coupled system (loss, echo, common-mode conversion)
Differential channel
Insertion loss and return loss shape eye margin
Small impedance breaks can create reflections
Parasitic capacitance becomes visible at 2.5G
Common-mode channel
Diff-to-CM conversion drives EMI and sensitivity issues
Asymmetry and return-path discontinuities amplify CM
CMC helps only when placed and referenced correctly
2.5G “no more hand-waving”
ESD capacitance and stubs become first-order terms
Layout asymmetry directly impacts CM and robustness
Partitioning errors show as intermittent bursts
ESD & surge: real damage paths (energy + return path)
ESD path
Connector → protection → short return to reference
Long return loops inject noise into PHY reference
Wrong placement drives current through magnetics/PHY zone
Surge path
Energy must be steered to the external reference (chassis/PE)
Floating paths cause board ground lift and false failures
Partition boundaries prevent energy from crossing zones
Practical rule
Place protection near the connector, keep the return path short and controlled, and prevent ESD/surge current from entering
the magnetics and PHY zones.
Bring-up & debug playbook: link training, autoneg, cable issues, diagnostics
Debugging Ethernet PHY stability is fastest when symptoms are routed into a small set of “root-cause buckets”
using correlation checks first (peer swap, EEE off, forced speed). The goal is to reduce time-to-isolation:
identify whether failures track the peer, the channel, the environment, or internal state transitions.
Symptom entry points (route first, then drill down)
Link flap
Frequent up/down or retrain loops. Start with correlation checks before deep logging.
Speed drop
Falls back to 100M/1G unexpectedly. Check autoneg outcome and partner capability.
Long cable only
Stable on short cable but fails at long run. Treat as channel-margin sensitivity first.
Hot only / soak
Pass at room temp but fails hot. Correlate with drift and state/event counters.
Chassis-only
Bench passes but fails in full system. Treat as power/ground/EMC coupling until proven otherwise.
Correlation-first triage (highest information gain)
Swap peer
If the issue follows the peer, bucket into compatibility/partner behavior.
EEE OFF
If steps/flaps disappear, bucket into LPI entry/exit windows and wake behavior.
Force speed
If forced mode is stable, bucket into autoneg/training process rather than steady-state channel.
Output of triage
Peer bucket
Different peer fixes it → compatibility or partner state behavior.
Channel bucket
Worse with length/type → magnetics/cable/impedance margin.
Autoneg failures: 5 common root causes (symptom → quickest check → next action)
1) Mode mismatch / straps
Likely cause: forced mode conflicts with partner autoneg or strap defaults.
Quick check: force a known-good speed/duplex and compare stability.
Fix: align autoneg policy and strap/config across boots.
Pass criteria: stable link with identical mode after reboot and peer swap.
2) Channel margin (cable/magnetics)
Likely cause: training/FLP is corrupted by loss/echo or poor term/magnetics.
Quick check: short-cable baseline vs long-cable failure reproduction.
Fix: isolate cable type, check magnetics + connector placement and symmetry.
Pass criteria: no renegotiation loops; error counters remain bounded.
3) Clock/power noise → state churn
Likely cause: marginal refclk or supply noise triggers retrain or false transitions.
Quick check: correlate flaps with counters/events; compare clean vs noisy power condition.
Fix: stabilize ref/power; retest under forced speed to separate autoneg vs steady-state.
Pass criteria: retrain count and drops stay below X per hour in window T.
4) Peer compatibility
Likely cause: partner implementation differences or strict corner behavior.
Quick check: swap peer across vendors/models; keep all other variables fixed.
Fix: lock negotiated subset, adjust advertisement policy, or select a compatible profile.
Pass criteria: identical results across peer set under same cable/temperature.
5) EEE interaction
Likely cause: LPI entry/exit timing causes transient instability or misclassification.
Quick check: EEE off A/B; look for steps at wake moments.
Fix: tune EEE policy or disable for deterministic timing applications.
Pass criteria: no offset step beyond X ns; no burst errors after wake.
Cable diagnostics / TDR: use it without false conclusions
When it is meaningful
Stable link state and fixed speed
Known cable type and baseline reference run
Trend comparison, not blind absolute distance
Common false-positive sources
Magnetics + CMC parasitics and routing stubs
ESD array capacitance near connector
Connector/patch-cord reflections
Correct workflow
EEE off and forced speed baseline
Capture a known-good reference signature
Compare deltas under one-variable changes
Minimal log fields (enough to reproduce and correlate)
Compliance & validation: what to test, what to log, and pass/fail criteria
Validation should be structured as a matrix: conditions (cable, temperature, peer, power, feature states) × metrics (BER, drop rate,
recovery, timestamp stability). The goal is to prevent “bench pass, chassis fail” by forcing worst-case corners and logging enough
context to correlate failures to specific transitions.
Coverage dimensions (minimum set that prevents blind spots)
Signal & errors
BER / CRC / PCS errors
burst behavior and retrain count
drop rate over window T
Timing & timestamp
offset step during events
timestamp validity rate
recovery time after wake/relock
Environment & channel
short vs long cable
temperature corners
peer diversity (models/vendors)
Power disturbance
nominal vs ripple/step
mode transitions
correlate bursts with supply events
Feature states
EEE on/off
timestamp on/off
autoneg vs forced
Test matrix template (conditions × metrics)
Conditions (axis)
cable: short / long (X m)
temp: cold / room / hot (X °C)
peer: A / B / C
power: nominal / ripple (X)
EEE: on / off
timestamp: on / off
Metrics (axis)
BER / CRC / PCS errors
drop rate (X / hour)
recovery time (X ms)
offset step (X ns)
timestamp validity (X%)
retrain count (X / hour)
Corner priority
Mark high-risk corners explicitly (long + hot + EEE on + timestamp on). Run these first and gate release on them.
Preventing “bench pass, chassis fail” (force coupling paths into the matrix)
Why chassis differs
ground reference shifts and CM noise increases
thermal gradients and soak effects
power mode changes and fan/load transients
neighbor interface coupling
What to do
repeat key corners in the full system
log the same minimal fields as bring-up
align errors with state/event transitions
keep one-variable changes during debug
Pass/fail criteria templates (define the measurement window)
Offset stability
After event (EEE exit / retrain), in window T:
offset step < X ns.
Drop rate
Over T hours:
drops < X / hour.
Recovery time
From event to stable state:
< X ms (stable means counters stop increasing and offset returns within range).
Timestamp validity
Valid timestamps > X% across corners (long + hot included).
Diagram: Validation matrix map (card-grid, not a dense table)
This checklist turns all the page’s key points around determinism / low jitter / EEE / PTP / robustness into executable checks, gated by stage:
Design gate prevents board re-spins,
Bring-up gate prevents misdiagnosis,
Production gate prevents “Monday yield collapse” and station-to-station mismatch.
Design gate · Build “determinism” into schematic + layout first
Clock / Jitter
MAC interface
Magnetics / Protection
Power integrity
Quantify the refclk budget first: refclk phase noise/jitter must match the TSN timestamp error budget (placeholder:
refclk jitter < X ps RMS, fill X per system budget), and explicitly define probe points (XO output pin / PHY refclk pin / MAC-side txclk).
Lock down one interface/timing strategy: for RGMII, choose either internal delay or board delay (one and only one);
for SGMII/2500BASE-X, define refclk (25/125 MHz) + jitter requirement to avoid “eye passes but offset jumps.”
Place magnetics + protection by return-path logic: ESD/surge return must avoid the PHY ground-reference sensitive area;
prioritize “clamp energy in the outside zone first” when setting distances between diff pairs, protection, and connector.
Partition rails + add low-noise post-regulation: for PHY analog/PLL rails, prefer a high-PSRR low-noise LDO as a second-stage cleanup;
DC/DC switching ripple should support repeatable injection testing (placeholder: under ripple X mVpp, offset/jitter does not degrade).
Reference part numbers (examples; re-verify package/grade/availability)
TDK ACT45B-101-2P-TL003 (signal-line common-mode filter example; re-check insertion loss vs target band)
Note: part numbers are “implementation anchors + datasheet lookup helpers,” not universal answers.
Within a single family, variants often differ by speed grade, pinout, capacitance, package suffix, etc.
Bring-up gate · Turn “autoneg/EEE/PTP” into repeatable A/B tests
Minimal-system baseline: fix cable / peer / temperature first; run the baseline of “lock speed + EEE OFF + PTP OFF,”
and confirm BER/link-flap rate/error counters are zero or at the noise floor.
Add complexity one variable at a time: enable only one variable (EEE or PTP or speed) per step;
every step must be roll-backable, reproducible, and explainable.
EEE validation focus: log LPI enter/exit counts and exit latency (placeholder: exit X µs),
and check whether offset shows a step at the exit instant (placeholder: step X ns).
PTP timestamp validation focus: fix speed + cable length; do loopback/dual-end correlation;
ensure the timestamp path is not distorted by FIFO/clock-domain crossings that introduce “rate-dependent bias,”
and validate temperature-drift slope is calibratable.
Use cable diagnostics/TDR only as a locator: first remove EEE/peer/power-noise via correlation checks,
then enable diagnostics—avoid mislabeling “clock/power issues” as “cable issues.”
Minimum fields to log during bring-up (for reproduction + correlation)
Production gate · Make “station consistency” a system capability
Fixture/cable standardization: define a golden cable (length/category/shielding/bend radius) and a golden peer (peer model/firmware/port config);
re-check insertion loss and contact resistance drift weekly.
Mandatory environment fields in logs: temperature/humidity/airflow/enclosure door state/power-station ID/fixture version;
otherwise “Monday yield collapse” is hard to root-cause.
Sampling must cover state-machine edges: frequent EEE in/out, frequent PTP servo updates, power-disturbance injection,
long vs short cable A/B; avoid testing only steady-state.
Criteria must be numeric: flap rate X /hour; recovery time X ms;
offset step X ns; temp drift slope X ns/°C (fill X per system budget).
The core of the production gate is not “testing more,” but controlling key variables and making every anomaly traceable to one of:
cable / peer / environment / configuration / power.
Checklist map (stage-gated)
Blocks are “must-pass” gates for determinism + TSN readiness
H2-12. Applications + IC selection logic (PHY-focused)
This section only covers Ethernet PHY-side application points and selection logic that strongly impact TSN/determinism:
speed / interface / clock / EEE / PTP timestamp / EMC robustness / power & thermal.
Protocol stack and switching/scheduling details are intentionally out of scope for this page.
H3-A. Typical applications (strongly PHY-related)
Industrial TSN endpoints (motion control / drives / I/O nodes)
Engineering anchor: coupling from refclk/power noise into timestamp/offset must be measurable, controllable, and reproducible.
Industrial controllers / gateways (single or dual ports)
Key focus: whether EEE is allowed (many deterministic systems force-disable it or bound policy) and link recovery time (placeholder: < X ms).
Engineering anchor: autoneg/downshift/peer-compat issues must be quickly classified via “EEE OFF / lock speed / swap peer.”
2.5G uplinks (multi-Gig backhaul / port aggregation)
Key focus: 2.5GBASE-T is more sensitive to cabling/magnetics/EMI; the capacitance + diff-matching “margin” is smaller.
Engineering anchor: prioritize a test matrix proving that under “cable length / temperature / power disturbance / EEE / PTP,” BER/flap rate and offset criteria still hold.
Automotive Ethernet note (avoid overlap with sibling pages)
If you later build a dedicated “Automotive Ethernet PHY” subpage, handle it via an internal link here.
This page does not expand into T1/TC10/automotive topology or harness constraints to avoid topic overlap.
H3-B. Selection logic (scoring + decision tree)
Selection is not “comparing datasheet numbers,” but working backward from TSN goals:
trustworthy timestamps,
bounded latency,
controlled EEE side effects,
passable EMC/ESD/surge,
maintainable production consistency.
Scoring dimensions (PHY-focused)
Speed & MAC interface: 10/100/1G/2.5G; RGMII/SGMII/2500BASE-X/USXGMII (choose based on real SoC/FPGA constraints).
PTP hardware timestamp: sampling point, available calibration/compensation mechanisms, GPIO/interrupt/1PPS support, driver/register accessibility.
Determinism controls: FIFO depth + bypass ability; whether adaptation/power-saving states impact latency in a bounded/observable way.
EEE behavior: LPI enter/exit timing, exit transient impact on offset/BER, peer-compat risk and configurable policy.
Recommended workflow: use the “starter set” to pass bring-up + the verification matrix first, then converge magnetics/protection by EMC/power/cost.
Replace the PHY itself only when “timestamp/interface/diagnostics capability” is insufficient.
Selection decision tree (TSN-first)
Short labels only; details stay in text to keep the diagram clean
Each FAQ is intentionally short and executable (no protocol deep-dives): Likely cause → Quick check → Fix → Pass criteria (thresholds use placeholders “X” to be filled by the system budget).
Example part numbers are provided as “BOM anchors” only (e.g., PESD2ETH-D, SP3012-04UTG, 744231371, TPS7A20, ADP150, SiT1602) — verify package/suffix/ratings/availability.
Likely cause
A transient on EEE (LPI) exit changes the timestamp path/clock domain behavior, or the system mixes MAC-side vs PHY-side capture points across builds/ports.
Quick check
(1) A/B: EEE OFF vs ON, count offset steps per hour and align each step to LPI exit events. (2) Confirm one consistent capture mode: PHY HW timestamp enabled (or MAC), not mixed.
Fix
Keep a deterministic policy: disable EEE for TSN-critical windows, or enforce a bounded EEE policy; standardize on a single timestamp capture point and apply the vendor’s correction/calibration flow. If supply/clock coupling is suspected, isolate PLL/clock rails with a low-noise LDO stage (e.g., TPS7A20 or ADP150).
Pass criteria
With EEE policy applied: offset step < X ns and step rate < X/hour over a soak window T, with steps not correlated to LPI exit.
Correlation
Same board, different switch peer → very different PTP accuracy. What is the first end-to-end correlation check?
Likely cause
The peer is driving different EEE advertisement/policy, different timestamping behavior, or different link conditions (rate/duplex, downshift, retries), which changes the effective time error seen at the endpoint.
Quick check
Normalize the physical layer first: (1) lock the same speed/duplex; (2) A/B with EEE forced OFF on both ends; (3) compare offset jitter while logging LPI enter/exit counts and link partner ID/capabilities.
Fix
Standardize port policy: enforce the same EEE behavior, timestamp mode (PHY HW timestamp vs MAC), and link configuration across switches. If the switch peer is non-negotiable, prioritize a PHY with robust HW timestamp + diagnostics (e.g., LAN8841, DP83867, 88E151x) and validate per-peer profiles.
Pass criteria
Under normalized settings: offset jitter < X ns RMS and no peer-dependent bias > X ns over window T.
2.5G / AN
2.5G links up, but later drops to 1G. Check autoneg first, or cable/magnetics bandwidth first?
Likely cause
A marginal 2.5G channel triggers downshift / renegotiation (cable loss, magnetics limits, excess ESD capacitance, or supply/thermal drift).
Quick check
(1) Read and log downshift / AN-restart counters at the event time. (2) A/B: known-good short Cat6 cable vs the failing cable; then A/B: force 2.5G (no AN) if supported.
Fix
If channel-limited: tighten magnetics/EMC BOM and placement, reduce line capacitance (e.g., replace a high-C array with a low-C Ethernet TVS such as PESD2ETH-D or SP3012-04UTG where appropriate), and tune common-mode control (e.g., a CMC like Würth 744231371 as a tuning knob). If AN-limited: standardize advertisements and disable aggressive downshift policies.
Pass criteria
At 2.5G worst-case: downshift events = 0 and link uptime > (100% − X) over T, with RX/PCS errors below threshold X.
Long / Hot
Link flaps only on long cable or at high temperature — which 3 PHY states/counters should be logged first?
Likely cause
A reduced margin corner exposes channel loss + EQ/CDR sensitivity, or supply/thermal drift that triggers renegotiation, retrain, or LOS events.
Quick check
Log these three “first responders”: (1) link state + negotiated speed/duplex; (2) AN restart / downshift / retrain counters; (3) RX-side error counters (PCS/alignment and FCS/CRC). Correlate flap events with die temperature and supply ripple snapshots.
Fix
If errors rise before the flap: improve channel margin (better cable category, magnetics choice/placement, reduce protection capacitance, tune CMC). If no errors but AN restarts spike: stabilize power/thermal and pin the link policy (lock speed for validation; then re-enable AN only after margin is proven).
Pass criteria
Worst-case (long + hot): link flap < X/hour, AN restart = 0 (or < X) over window T, and error counters remain below X.
EEE
EEE ON: throughput is fine, but latency jitter increases. How to prove it is caused by EEE state transitions?
Likely cause
The Active ↔ LPI ↔ Wake transitions change buffering and clock behavior, adding a bursty component to latency even if average throughput remains unchanged.
Quick check
(1) Collect a latency histogram with timestamps of LPI enter/exit; overlay jitter spikes on exit events. (2) A/B: EEE OFF vs ON at the same link rate and traffic pattern; keep all other variables fixed.
Fix
Disable EEE for TSN-critical ports, or raise the EEE entry threshold so LPI is rare during deterministic traffic. If EEE must remain enabled, require a bounded wake behavior and validate across peers and temperature (do not assume peer compatibility).
Pass criteria
With the final EEE policy: p99.9 latency jitter < X ns and wake recovery < X ms over window T, with no jitter bursts aligned to LPI exit.
RGMII
RGMII shows occasional CRC errors, but the line side looks clean. Check internal delay / routing skew or I/O supply noise first?
Likely cause
Most often it is RGMII timing margin (wrong internal delay mode, skew, edge rate) — but a close second is I/O rail noise / ground bounce corrupting sampling.
Quick check
(1) A/B: toggle the PHY’s RGMII internal delay configuration (ID on/off) and see whether CRC errors track the setting. (2) Scope the I/O rail ripple and correlate CRC bursts with rail noise and simultaneous switching events.
Fix
Lock one timing strategy: correct internal delay mode + enforce length matching; add modest series damping where needed (board-specific). If rail noise is implicated, strengthen decoupling and isolate sensitive rails with a clean LDO stage (e.g., TPS7A20) and tighten the ground return stitching near the PHY.
Pass criteria
At worst-case traffic and temperature: CRC errors = 0 over T, and timing margin remains > X (as defined by the interface budget).
SGMII
SGMII “looks locked” but occasionally fails training — verify refclk quality first, or lane termination first?
Likely cause
If training failures are rare and temperature/voltage-dependent, refclk jitter / supply noise is often the first suspect; if failures are deterministic by board/cable, then lane SI/termination is more likely.
Quick check
(1) Swap to a known low-jitter 25 MHz source (e.g., SiT1602…25.000000 or ASV-25.000MHZ-LC-T) and re-run training statistics. (2) If still failing, validate lane termination/coupling against the PHY reference design and check for skew/return-path discontinuities.
Fix
Improve refclk integrity (clean routing, isolation, stable rail; add a low-noise LDO stage such as ADP150 where applicable) and match the recommended SGMII termination/AC coupling scheme. Then lock the interface configuration (no mixed modes across builds).
Pass criteriaTraining failures = 0 over soak T across temperature range, and interface error counters remain below X.
EMI
EMI fails in one band — suspect common-mode leakage or clock-harmonic coupling? What is the “one-step” validation?
Likely cause
A narrow band peak is often either clock-harmonic coupling (refclk/PLL domains) or common-mode conversion at magnetics/connector due to return-path asymmetry.
Quick check
Do a single A/B change while measuring the failing band: populate/bypass the common-mode choke (e.g., try a CMC option such as 744231371) to see if the peak moves/drops. If it does not, A/B the clock source path (swap to a cleaner oscillator like SiT1602) and observe peak correlation.
Fix
If common-mode dominated: enforce symmetry and return stitching, tune CMC, and reduce parasitic imbalance near the connector. If clock-harmonic dominated: shorten and shield the clock path, isolate the clock/PLL rail with a clean LDO stage (TPS7A20/ADP150), and reduce coupling loops in layout.
Pass criteria
EMI margin at the failing band improves to ≥ X dB under the same test setup, with no regression in link stability or TSN timing metrics.
ESD
ESD hit drops the link, but reconnect is normal. What is the most common return-path mistake?
Likely cause
ESD current returns through the PHY reference ground (or sensitive analog/PLL ground) instead of being dumped at the connector/chassis region, causing a transient reset/LOS/renegotiation.
Quick check
Verify placement and return: TVS arrays (e.g., PESD2ETH-D / SP3012-04UTG) should be connector-side with a short, low-inductance return to the intended ESD sink (often chassis/connector ground). Correlate link drops with PHY reset/interrupt flags.
Fix
Re-route the ESD return so the discharge loop closes locally at the connector region; add stitching vias/short paths; keep the PHY zone “quiet.” Use low-capacitance protection where the channel margin is tight, and re-validate at the worst-case link rate.
Pass criteria
Under the target ESD level: no link drop (preferred), or auto-recovery < X ms with no speed downgrade and no persistent error-counter increase.
Diagnostics
Cable diagnostics/TDR reports a short, but swapping the cable “fixes it.” How to tell measurement artifact vs real defect?
Likely cause
The diagnostic ran under inconsistent conditions (link/EEE state, peer behavior, calibration), or the connector contact is intermittent — creating a false short signature.
Quick check
Run diagnostics only in a controlled state: EEE OFF, stable link policy (or link forced down per vendor guidance), and with a known “golden cable.” Repeat N times; if the result is not repeatable, treat it as an artifact.
Fix
Standardize the diagnostic procedure (state + temperature + cable type), update PHY firmware/driver if required, and treat intermittent connector contact as a hardware defect. Use diagnostics as a secondary tool after correlation checks, not as the first root-cause verdict.
Pass criteria
Under the standardized procedure: false-short rate = 0 / N, and any reported defect is repeatable across runs and correlates with independent evidence (errors/flaps).
Chassis
PTP is stable on the bench, but drift increases in the chassis. First priority: airflow baffle, ground bounce, or supply noise?
Likely cause
In chassis, two new couplings dominate: supply noise / ground bounce from shared rails and return paths, and temperature gradients (airflow) that change oscillator/PLL behavior.
Quick check
Start with the highest information gain: (1) log PHY die temp and PLL/I/O rail ripple at the same cadence as drift; (2) A/B: power from a clean bench supply vs system supply; (3) only then A/B a simple airflow baffle.
Fix
If drift tracks supply/ground: add a dedicated low-noise stage for clock/PLL rails (TPS7A20 / ADP150), improve return stitching, and reduce shared high-di/dt coupling. If drift tracks temperature gradients: improve airflow guidance and keep the clock source stable (e.g., a robust oscillator like SiT1602).
Pass criteria
In chassis worst-case: drift slope < X ns/°C and offset stability meets the system requirement over soak T, with no correlation to rail ripple beyond X mVpp.
Field / Logs
Compliance passes, but the field shows intermittent downshift. Which log field is usually missing (EEE / PTP / temperature / cable info)?
Likely cause
Field downshift is often triggered by a state transition or corner condition that compliance didn’t exercise (peer diversity, EEE policy, supply/thermal drift, cable variability).
Quick check
Compare “good vs bad” field captures: if downshift events exist but EEE state (LPI enter/exit count at event time) is missing, root-cause classification becomes impossible. Correlate downshift with temperature and cable category/length if available.
Fix
Add EEE state + LPI enter/exit counters as mandatory fields, plus the peer ID/capabilities and a minimal cable descriptor. Then re-run the validation matrix at the field-relevant corner (long + hot + peer diversity + EEE policy).
Pass criteria
Field events become classifiable: each downshift is attributable to a bucket (EEE/peer/cable/temp/supply) with > X% confidence, and the final policy holds downshift < X/hour over T.