Industrial-Grade Ethernet PHY for Wide Temp, EMC & TSN
← Back to: Industrial Ethernet & TSN
Industrial-grade Ethernet PHY means verifiable robustness: wide-temperature stability, EMC immunity, and predictable low-BER margin with evidence-ready counters and TSN timestamp hooks. It turns “it runs” into “it ships” by defining measurable pass criteria (X) and a repeatable bring-up + field-forensics workflow.
H2-1 · Definition & Boundary: What “Industrial-Grade PHY” Means
Industrial-grade PHY = predictable link margin (low BER) under wide temperature and strong EMC, with built-in observability to diagnose and prove failures with evidence.
Scope guard
- In-scope: acceptance-ready metrics, test evidence, BER/margin proof, EMC behavior, and diagnostic hooks at PHY/PCS level.
- Out-of-scope: TVS/CMC/magnetics placement details, PoE/PoDL power co-design, TSN scheduling (Qbv/Qci/GCL), and protocol-stack implementations.
The 4-Dimensional Acceptance Model (Engineering, Not Marketing)
A PHY is “industrial-grade” only when all four dimensions are measurable and passable: Temperature robustness, EMC robustness, predictable margin/BER, and observability (evidence chain).
1) Temperature (Wide-temp stability)
- Definition: link stability + margin + latency/timestamp stability across X°C range.
- How to measure: thermal sweep + fixed cable + PRBS/loopback + counter logging.
- Evidence: retrain/link-down counts vs temperature; CRC/PCS-error counters vs temperature.
- Pass criteria (X): retrain ≤ X/hour; CRC ≤ X per 10⁹ bits; delay drift ≤ X ns/°C.
- Common pitfall: “room-temp OK” masking temperature-driven threshold/jitter/channel-loss shifts.
2) EMC (Immunity + recoverability)
- Definition: under ESD/EFT/Surge events, the PHY behavior is recoverable and explainable (counters show a fingerprint).
- How to measure: event injection + continuous traffic/PRBS + synchronized counter snapshots.
- Evidence: error burst shape, recovery time, retrain/auto-neg loop counts, post-event “baseline shift”.
- Pass criteria (X): recovery ≤ X ms; no permanent lock-up; post-event baseline returns within X minutes.
- Common pitfall: only checking “link stays up” while missing degradation (“more fragile after tests”).
3) Margin / BER (Predictable, provable)
- Definition: not “works”, but “measured margin exists” under worst-case channel + temp + noise.
- How to measure: PRBS pattern (X) + duration window (X) + confidence framing (X).
- Evidence: BER statistic window, PCS error counters, retrain counts, rate downshift events.
- Pass criteria (X): BER ≤ X; retrain ≤ X/day; error bursts ≤ X per hour.
- Common pitfall: “0 errors in short time” misread as “BER = 0”.
4) Observability (Diagnostics + evidence chain)
- Definition: failures must answer: when, where (PHY/PCS vs MAC), and under what conditions (temp/power/event).
- How to measure: counter taxonomy + periodic snapshots + event tagging.
- Evidence: CRC/PCS symbol-error/retrain, timestamp noise (if applicable), temperature + supply rails.
- Pass criteria (X): required fields completeness = 100%; root-cause isolation success ≥ X%.
- Common pitfall: only “link up/down” logs (no forensic value).
Metric Definition Table (Acceptance-ready, with X placeholders)
The table below standardizes metric definitions to prevent “window/denominator mismatch” across labs, stations, and field logs.
| Dimension | Metric | Definition (Denominator) | Test mode | Window | Evidence | Pass criteria (X) |
|---|---|---|---|---|---|---|
| Temperature | Temp range | Operating range measured at board sensor (X) | Thermal sweep + PRBS | X minutes per point | retrain/CRC/PCS errors vs temp | No link lock-up; drift ≤ X |
| EMC | ESD class | Specified test level (X) + injection point | Traffic + event injection | X shots / point | burst profile + recovery time | Recover ≤ X ms; stable baseline |
| Margin/BER | BER target | Errors per 10⁹ bits (X) | PRBS pattern (X) | X minutes / run | BER + PCS errors + retrain | BER ≤ X; retrain ≤ X/day |
| Observability | Evidence completeness | Required fields present / total fields | Counter snapshots | Every X seconds | time/temp/power + counters | Completeness = 100% |
Use this quadrant as the acceptance checklist: each dimension must be measurable, logged, and passable (thresholds marked as X).
H2-2 · Field Environment → Failure-Mode Map (Temp / EMC / Long Cable)
Field problems become solvable when symptoms are mapped to coupling paths and verified by a counter “fingerprint” within a defined window.
Why field links fail “randomly”
- Noise is event-driven: VFD speed steps, solenoid switching, welding bursts, ESD hits, and surge events create short error bursts that vanish on the bench.
- Coupling is path-dependent: common-mode injection, ground bounce, shield termination, and return-plane cuts change the receiver’s effective threshold and jitter margin.
- Metrics are often mis-accounted: CRC rates without a defined window/denominator cause “20% busyness but feels blocked” style illusions.
PHY-relevant failure-mode taxonomy (keep scope tight)
- Link-state class: link flap, intermittent link-down, frequent retrain.
- Integrity class: sporadic CRC bursts, rising PCS/symbol errors, drop counters rising under EMI events.
- Training/negotiation class: auto-neg loops, rate downshift, unstable EEE transitions (if enabled).
- Diagnostics class: false cable-fault reports (open/short) caused by noise bursts or grounding path changes.
Field troubleshooting matrix (Trigger → Evidence → First check)
Always record the measurement window and denominator (e.g., errors per 10⁹ bits, per 1k frames, per minute). Without this, counter comparisons are not actionable.
| Symptom | Trigger (field) | Evidence (fingerprint) | Quick check (≤ 5 min) | Fix direction (PHY scope) | Pass criteria (X) |
|---|---|---|---|---|---|
| Link flap / retrain bursts | VFD speed step, welding burst, cabinet door ESD | retrain↑ + PCS errors↑ within window X; recovery time measurable | Align counters to event timestamps; verify window/denominator | Tune recovery policy; increase logging granularity; isolate to EMI vs clock/power using PRBS/loopback | retrain ≤ X/hour; recovery ≤ X ms |
| CRC bursts, link stays up | Solenoid switching, relay coil, noisy DC bus | CRC↑ but link stable; PCS errors may lead CRC by Δt | Compare CRC window vs PCS window; check if bursts correlate with temp/power dips | Use PRBS to remove protocol variability; increase counter sampling rate around bursts | CRC ≤ X/10⁹ bits in worst case |
| Auto-neg loops / unstable rate | Long cable run, cabinet ground shift, intermittent contact | auto-neg restart count↑; rate downshift events | Lock config to a fixed mode briefly; compare behavior and counters | Stabilize negotiation policy; validate margin with PRBS; verify ref-clock/power stability during transitions | No restart loops; stable rate for X hours |
| False cable fault / intermittent open-short | High EMI burst, ESD hit, ground bounce | Diag flags coincide with EMC bursts; returns to normal after X | Repeat diag with quiet window; cross-check with PCS errors | Tag diag results with event context; avoid treating single burst as a hard cable fault | Diag stability ≥ X% across X runs |
| Fails only at hot/cold edges | Hot cabinet, cold-start, rapid thermal transients | Error rate rises with temperature slope; drift signature | Log temp + counters; check if errors align with ΔT/Δt | Validate across thermal sweep; separate channel-loss vs clock/power drift via loopback | Stable BER ≤ X across X°C |
First-check priority (prevents wasted effort)
- Normalize accounting: confirm window + denominator for every counter trend.
- Read the fingerprint: determine which rises first (PCS errors → CRC → retrain).
- Minimize variables: fixed cable, fixed config, repeatable stimulus (PRBS/loopback).
- Correlate with context: event tags (VFD step/ESD) + temperature + supply rails.
- Escalate by scope: when coupling path points to protection/shielding, only then move to the protection/grounding page.
This map is designed for fast field isolation: identify the coupling path class first, then validate with a counter fingerprint within a defined window.
H2-3 · Link Margin & Low BER: “Predictability” of an Industrial PHY
A link is industrial-grade only when margin exists and is provable: BER is defined by a window + denominator + confidence, and failures can be attributed to measurable contributors.
BER / Margin engineering definition (acceptance-ready)
- Test mode: use PRBS to validate channel+PHY BER; use loopback to isolate internal behavior and remove protocol variability.
- Window + denominator: report errors per 10⁹ bits or per X minutes (always explicit), not “percentage without context”.
- Statistical confidence: for “0 errors” runs, document confidence level CL = X and observed bits N = X (so “0 errors” becomes a BER upper bound, not BER=0).
- Corner sweep: margin must be proven at worst corners (speed / cable / temperature / supply / EMI context = X).
- Pass criteria (X): BER ≤ X; retrain ≤ X/day; burst errors ≤ X/hour; stable rate for X hours.
Margin decomposition (engineering-level, measurable)
Predictability comes from identifying the dominant contributor and proving it with a measurement method and a counter fingerprint.
- Channel loss / ISI: sensitivity rises with speed/length/temperature; verify using PRBS across length + temperature corners; look for monotonic slope vs length.
- Return loss (reflections): error bursts often concentrate around specific harness/connector conditions; validate with TDR / return-loss sanity checks and repeatability across reconnects.
- XTALK (crosstalk): errors correlate with neighbor activity or bundle routing; validate with A/B tests (neighbor quiet vs active) while holding traffic constant.
- Noise floor (EMI/power/common-mode): event-driven bursts; validate by aligning counters to event tags (VFD step/relay/ESD) within a fixed time window.
- Jitter (clock/threshold/ground-bounce): errors correlate with temperature slope or supply ripple; validate by tracking drift + error timing (PCS first, CRC later).
Minimum evidence set (copy-ready for lab + field)
- Corner tags: speed X, cable length X, temperature X, supply X, EMC context X.
- Mode: PRBS / loopback; pattern X; duration X; bits window N = X; confidence CL = X.
- Counters: PCS/symbol errors, CRC, drop, retrain, rate-change, EEE transitions (if enabled).
- Outcome: BER ≤ X and dominant contributor = X (supported by a measurable method).
The budget is actionable only if each contributor maps to a measurable method and leaves a consistent counter fingerprint.
H2-4 · EMC Is Not “Just External”: PHY Immunity + System Cooperation
EMC events appear as data errors, drop bursts, retrain loops, or link-down because the receiver threshold and sampling margin are disturbed—so the PHY must expose recovery behavior and a verifiable evidence chain.
PHY-side capabilities (what industrial-grade implies)
- Robust receiver: tolerates common-mode shifts and burst noise without collapsing into non-recoverable states.
- Squelch / threshold tolerance: avoids false “link lost” and reduces negotiation thrashing triggered by short events.
- Layered error detection: PCS/symbol error visibility plus CRC/drop counters for fingerprinting (PCS often leads CRC).
- Fast recovery policy: predictable retrain behavior with bounded recovery time (no retry storm).
- Diagnostics hooks: event-tagging support and periodic snapshots to correlate EMC events with counter spikes.
System cooperation (principles only; layout details out-of-scope)
- Deterministic return paths: grounding/shielding strategy must create a predictable current return during fast transients.
- Consistent reference scheme: mixed chassis bonds or inconsistent shield termination turns repeatable tests into “random” failures.
- Evidence-first validation: EMC acceptance should require counters + recovery time, not only “link did not drop”.
- Boundary note: TVS / CMC / magnetics placement and selection are handled in the dedicated Protection sub-page (one-line pointer only).
EMC test → expected evidence chain (EFT / ESD / Surge)
For each injection type, define a window and check that counters form a consistent fingerprint and recovery stays bounded.
| Injection | Typical symptom | Expected counters (fingerprint) | Pass criteria (X) |
|---|---|---|---|
| EFT (burst) | CRC bursts; occasional retrain | PCS/symbol errors spike within window X → CRC rises → bounded drops | Recover ≤ X ms; retrain ≤ X per test |
| ESD | Link flap; auto-neg restart | Retrain count ↑; link-down time measurable; post-event baseline shift check | No lock-up; baseline returns within X minutes |
| Surge | Immediate drop; rate downshift | Drop counter + retrain; PCS errors indicate receiver disturbance; recovery time logged | Recover ≤ X ms; stable link ≥ X minutes |
This diagram enforces the scope: evidence is validated at PHY/PCS counter level; board-level protection details are handled in the protection sub-page.
H2-5 · Deterministic Latency & TSN Timestamp “Hooks” (PHY Scope Only)
Industrial real-time cares less about averages and more about tail latency and jitter; a PHY is “TSN-ready” only when the timestamp tap point is explicit and the path delay variation is bounded and measurable.
Timestamp tap-point definition (MAC vs PHY/PCS)
- MAC-side timestamp: easier integration, but may include variability from MAC FIFOs, bus arbitration, and buffering—often increasing tail jitter if not tightly controlled.
- PHY/PCS-side timestamp: closer to the wire-time and less exposed to upper-layer queueing, but requires a clear definition of compensation and calibration so the result is comparable across resets and temperature.
- Acceptance rule: the tap point and compensation model must be documented; timestamps from different tap points are not interchangeable.
Latency decomposition (framework-level, measurable)
A useful model splits delay into a deterministic component and a variation component: Total delay = Δt deterministic + Δt variation.
- TX path: input framing → FIFO/CDC → PCS encode → serialization → line output.
- RX path: line sampling → recovery/decision → PCS decode → FIFO/CDC → output.
- FIFO / buffering: main driver of variation when occupancy changes under load or during recovery.
- Retiming / state transitions: can introduce repeatable shifts after reset, retrain, or thermal transitions if not controlled.
Measurement evidence (PHY-focused)
- Loop / two-port repeatability: measure Δt distribution under a fixed window and verify p99 tail bounds.
- Reset repeatability: check that post-reset delay does not jump across discrete “modes” beyond X.
- Thermal sweep: track delay shift vs temperature and quantify slope and hysteresis.
Pass criteria (placeholders X)
- Timestamp noise: σts ≤ X (or p99−p50 ≤ X for tail control).
- Path delay variation: Δtvar(p99) ≤ X within window X.
- Thermal delay shift: |Δt/°C| ≤ X and hysteresis ≤ X.
- Reset repeatability: Δtreset_shift ≤ X across N resets.
TSN scheduling parameters (Qbv/Qci/GCL) belong to the TSN switch page; this section stays strictly at the PHY tap-point and delay stability level.
Use the same tap point across builds and tests; otherwise “timestamp jitter” becomes a definition mismatch rather than a real PHY behavior.
H2-6 · Wide Temperature & Aging: Stable Link from -40°C to +105/+125°C
Temperature shifts the entire boundary condition (drive, threshold, jitter, and channel loss) rather than “slightly reducing performance”, so stability must be proven by drift-aware logging and margin evidence—not by single-point tests.
What temperature changes (PHY-relevant list)
- Driver amplitude drift: reduces margin headroom; error rate becomes sensitive to temperature slope and cable length.
- Receiver threshold drift: increases susceptibility to common-mode bursts; PCS errors often lead CRC.
- Jitter / timing drift: compresses sampling margin; tail error bursts become more frequent.
- Cable loss change: increases ISI; high rate and long runs degrade faster at extremes.
- Magnetics parameter drift (high-level): can shift return-loss / mode suppression behavior; protection and magnetics details are handled in dedicated pages.
Validation strategy (slope + hysteresis + soak)
- Slope: quantify how counters and error rate change per °C (not only pass/fail at a single point).
- Hysteresis: compare warm-up vs cool-down behavior to catch stress and recovery asymmetry.
- Soak: hold at extremes for X hours to detect slow degradation and “Monday failures”.
Black-box fields to log (production + field)
Drift becomes debuggable only when environment tags and counters are captured with consistent time windows.
| Category | Fields (placeholders X) | Why it matters |
|---|---|---|
| Thermal | Board temp X, ambient X, cold/warm state X | Separates slope from event-driven bursts |
| Power | Rails X, ripple class X, reset/brownout tag X | Explains jitter/threshold drift and retrain loops |
| Link state | Rate/duplex X, negotiation count X, EEE state X | Distinguishes real drift from mode flips |
| Counters | PCS errors, CRC, drops, retrain, rate-change | Builds a temperature-to-error fingerprint |
| Context tags | Load state X, event tag X, window X | Aligns bursts to real-world triggers |
Pass criteria (placeholders X)
- BER across temp range: BER ≤ X from -40°C to +105/+125°C.
- Slope limit: ΔBER/°C ≤ X (or counter slope ≤ X per °C).
- Transition recovery: baseline returns within X minutes after thermal transitions.
- Soak stability: no monotonic degradation during X-hour soak at extremes.
Treat temperature as a drift vector: prove slope, hysteresis, and soak stability instead of relying on single-point pass/fail results.
H2-7 · Power / Clock / Reset: Industrial PHY “False Link” Failures
Many dropouts and jitter spikes are not cable problems; reference clock quality, power integrity, and reset/strap timing can destabilize PLL/CDC/FIFOs and look like SI—so definitions, evidence, and pass criteria must be explicit.
Reference clock quality (engineering definition)
- Frequency offset: ppm ≤ X (measured over window X, after stabilization X).
- Jitter definition: RMS jitter ≤ X over integration band X–X, or cycle-to-cycle jitter ≤ X over window X.
- Coupling to BER: higher clock jitter reduces sampling margin and drives PCS errors and BER bursts under stress.
- Coupling to timestamps: clock noise increases retiming/CDC variability, raising timestamp noise and tail latency.
Power-up & reset timing checklist (vendor-neutral)
- Strap latch stability: straps must be stable for X before/after latch to avoid random mode selection.
- MDIO reachability: readable/writable within X ms; read-back consistency must hold across resets.
- PLL lock: lock time ≤ X; no periodic unlock/relock after warm-up.
- Link stability guard: after link-up, no self-triggered renegotiation or downshift within X minutes (unless configured).
- Hot-state repeat: repeat reset after temperature soak; strap/MDIO/PLL must remain deterministic.
Triage map: symptom → first suspect → quick check → pass
| Symptom | First suspect | Quick check | Pass criteria (X) |
|---|---|---|---|
| Link flap / renegotiation loop | Reset/strap timing, PLL stability | Check strap latch stability + PLL lock/unlock counters | retrain ≤ X, renegotiation ≤ X/day |
| CRC bursts with “clean-looking” waveforms | Clock jitter, power ripple | Hold fixed PRBS window; correlate PCS errors with supply/clock tags | PCS errors ≤ X/window, BER ≤ X |
| Timestamp noise / tail latency thickening | Clock noise, CDC/FIFO variability | Compare tap-point distribution across load and temperature windows | σ_ts ≤ X, p99−p50 ≤ X |
| Only fails after warm-up | PLL margin, strap/clock drift with temperature | Repeat the same checks in thermal steady-state (hot soak) | No unlock events, stable BER ≤ X |
Protection parts and magnetics placement are handled in dedicated pages; this section focuses on the clock/power/reset evidence chain to avoid misdiagnosing “false SI”.
A stable-looking waveform does not rule out clock/power/reset issues; counters and repeatability across reset and temperature must close the evidence loop.
H2-8 · Bring-up Flow: From “Link Up” to Verified Regression
Bring-up must be a repeatable step tree: isolate internal paths first, prove BER with defined statistics, then add diagnostics and stress to control tail behavior and enable regression.
Phase steps (with pass placeholders X)
- Link up: negotiation result and stability. Pass: stable ≥ X min; retries ≤ X.
- Loopback: isolate channel vs internal path. Pass: no errors in window X; counters steady.
- PRBS: prove BER with defined statistics. Pass: BER ≤ X with N bits ≥ X and CL ≥ X.
- Cable diagnostics: classify opens/shorts/pair issues for field service. Pass: consistent across N runs; localization ≤ X.
- Stress: load + temperature + events to expose tail. Pass: p99 latency ≤ X; retrain ≤ X; bursts ≤ X/hour.
Minimal evidence set (to enable regression)
- Link: rate/duplex/EEE state, link-down count, negotiation count.
- Loopback: loop type tag, internal error counters windowed by time.
- PRBS: pattern, time window, total bits N, confidence level CL.
- Diag: classification result, repeatability N, environment tags (temp/power).
- Stress: load tag, temperature segment, event tags, p50/p99 window definition.
“Minimal register set” (vendor-neutral categories)
- Strap / boot config: interface mode, addressing/ID, feature enables (logical checks only).
- MDIO accessibility: ID/version readable, write-read consistency, controlled retries.
- PLL / clocking: lock state, lock time, unlock events counters.
- Autoneg / training: negotiated mode stable, downshift/retrain counts.
- Counters: PCS errors, CRC, drops, retrain, rate-change (for fingerprints).
The step order prevents misattribution: isolate internal behavior before blaming the channel, and lock down pass criteria so regressions can be automated.
H2-9 · Verification & Certification: Turning “Works” Into “Deliverable”
Industrial delivery requires evidence: consistent results across environments, repeatable statistics, and traceable logs that map each pass/fail to a unique configuration and test run.
Deliverable evidence = reproducibility + consistency + traceability
- Reproducibility: under the same conditions, repeated runs (N = X) keep key metrics within threshold X (window definition fixed).
- Consistency: conclusions do not change across the environmental matrix because metric definitions and windows are invariant.
- Traceability: every test point links to a unique config hash, script version, DUT revision, and timestamped evidence snapshot.
IEEE 802.3 consistency (categories only, evidence-first)
- Electrical / analog: boundary compliance; evidence = report ID + summary metrics (no deep waveform details here).
- Link bring-up / autoneg: stable mode selection; evidence = negotiation count, downshift, retrain.
- Loopback / PRBS: measurable BER; evidence = pattern, bits N, confidence CL, window X.
- Management / observability: deterministic readback; evidence = MDIO reachability and consistency tags.
- Timestamp / latency hook: defined tap point stability; evidence = σ_ts, p99−p50, thermal drift slope ≤ X.
Interface points for industrial protocol certifications (PHY-facing only)
- Stability evidence: retrain, renegotiation, rate-change counters with fixed windows (per X seconds / per Y frames).
- Error fingerprints: PCS/symbol errors vs CRC/drop (when available) to separate PHY-origin bursts from upstream congestion.
- Diagnostics snapshot: cable diagnosis classification + last-fault time tag (no algorithm detail here).
- Trace tags: strap/mode tag, ref-clock source tag, firmware/script version hash (system-provided) to audit any failure.
Verification matrix skeleton (environment × test category)
Keep the matrix finite and repeatable: only category buckets are listed here; each cell records evidence fields and pass thresholds (X).
| Env dimension | Buckets | Required evidence fields | Pass criteria (X) |
|---|---|---|---|
| Temperature | Cold / Room / Hot | board_temp, mode/rate, retrain, PCS errors, BER window | BER ≤ X; retrain ≤ X; drift slope ≤ X |
| Voltage | Low / Nom / High | rail_id, ripple_class, brownout_flag, counters snapshot | no brownout; errors ≤ X/window |
| Cable | Short / Nominal / Worst-loss | cable_tag, diag_result, PRBS/BER, retrain pattern | BER ≤ X; diag consistent ≥ X runs |
| Interference events | ESD / EFT / Surge | event_tag, time, counters pre/post, link stability window | recovery ≤ X; no persistent fragility |
| Load | Idle / Sustained / Burst | load_tag, p50/p99 window, σ_ts, CRC/PCS errors | p99 ≤ X; σ_ts ≤ X; errors ≤ X/window |
Each grid cell is a deliverable: fixed metric windows + required evidence fields + pass thresholds (X), so results remain comparable across teams and over time.
H2-10 · Field Diagnostics & “Black Box”: Why Industrial PHY Must Be Forensic-Ready
The most expensive field problem is “cannot reproduce.” A black-box evidence chain turns symptoms into counters and tags, then into a clear isolation direction (cable vs EMC event vs clock vs power/reset).
Counter layering (focus on PHY needs)
- PHY / PCS layer (primary): PCS/symbol errors, retrain events, rate changes, link-down causes (when available).
- MAC layer (for boundary): CRC, drops, overruns (used to separate PHY-origin bursts from upstream congestion).
- Upper layer (only as alignment): timeouts/retries/throughput tags (do not diagnose here; use as “symptom timestamp”).
Black box minimal fields (standardized, evidence-ready)
- Time: timestamp (monotonic/UTC tag), window length X.
- Temp: board_temp / ambient tag.
- Power: rail_id, brownout_flag, ripple_class.
- Link: mode/rate/duplex, negotiation count, retrain count, rate-change count.
- Errors (windowed): PCS errors, CRC_windowed, drop_windowed.
- Diagnostics: cable_diag_result (classification), last_diag_time.
- Context tags: load_state, event_tag (ESD/EFT/Surge), script/config hash.
Keep the set minimal but sufficient: the goal is to reproduce evidence, not to log everything.
Windowed statistics: turn “random” into comparable fingerprints
- Always window counters: errors per X seconds / per Y frames / per Z bits.
- Trigger snapshots on bursts: if burst > X per window, capture a structured evidence snapshot (fields + counters + tags).
- Keep windows invariant: changing the window redefines the metric and breaks trend comparability.
Troubleshooting: symptom → primary evidence → next isolation step
| Symptom | Primary evidence to check first | Next isolation direction |
|---|---|---|
| Link flap / periodic retrain | retrain window bursts + negotiation count + time tags | If retrain correlates with brownout/reset flags → power/reset; else check event_tag and clock tags |
| CRC bursts under load | PCS errors precede CRC? compare CRC_windowed vs PCS_windowed | If PCS errors lead → EMC/clock/power; if CRC only with no PCS changes → upstream congestion boundary |
| Fails only after warm-up | temp tag slope + unlock events + error burst timing | Thermal drift path (retest in hot soak) → confirm deterministic behavior across reset |
| Event-related “fragility” after ESD/EFT/Surge | event_tag + pre/post counters snapshot + recovery time | If persistent counters drift → EMC path; capture diag snapshot and isolate to clock/power if correlated |
| Cable fault alarms / intermittent opens | cable_diag_result consistency across N runs + time correlation | If diag is consistent → cable direction; if diag flips with events → EMC path or power/reset tagging issue |
A forensic-ready PHY does not “guess”; it preserves a timestamped evidence chain so isolation can be done quickly and reproducibly.
H2-11 · Engineering Checklist (Design → Bring-up → Production)
This checklist turns “industrial-grade” requirements into executable gates: every line item defines a concrete action, the evidence to capture, and a pass criterion (X).
Reference parts (examples, non-exhaustive)
Use these as concrete BOM anchors for checklists and lab scripts. Final selection must match port speed, temperature grade, EMC targets, and board constraints.
Note: protection/magnetics part numbers belong to the Protection & Magnetics sub-pages; keep this page PHY-facing.
Gate 1 — Design (measurability + traceability by construction)
-
Action: lock the measurement modes that prove BER (PRBS/loopback).
Evidence: PRBS pattern + bit count N + confidence CL + window X.
Pass criteria: BER ≤ X at “baseline cable” and “worst-loss cable”. -
Action: standardize counter names and windowing (PCS errors / retrain / reneg / rate-change).
Evidence: counters per X seconds (windowed), plus snapshot trigger rules.
Pass criteria: no ambiguous definitions; windows fixed across all tests. -
Action: define ref-clock quality targets and how they are validated.
Evidence: XO part number (e.g., Abracon ASFL1-25.000MHZ-EK-T) + measured jitter/offset tags.
Pass criteria: ref clock jitter ≤ X and frequency offset ≤ X (per spec budget). -
Action: enforce “forensic-ready” trace tags from day one.
Evidence: config hash, script version, strap/mode tag, EEPROM identity tag (e.g., 24AA02E64).
Pass criteria: every test record links to unique tags and timestamps. -
Action: power integrity plan tied to PHY counters (avoid “fake SI”).
Evidence: rail IDs + supervisor flags (e.g., TPS3808) + ripple class.
Pass criteria: no brownout/reset flags under stress and across temp corners. -
Action: document the “reference PHY class” for this design (speed + industrial grade).
Evidence: PHY example part chosen (e.g., DP83822I / ADIN1200 / DP83867IR / ADIN1300 / KSZ9131RNX) + operating envelope tags.
Pass criteria: selected PHY meets temp/EMC/margin/observability targets (X).
Gate 2 — Bring-up (from “link up” to “verifiable baseline”)
-
Action: confirm strap latch and management reachability (MDIO).
Evidence: MDIO readback consistency + strap/mode tag.
Pass criteria: no intermittent MDIO failures across X resets. -
Action: prove link stability under a fixed baseline condition.
Evidence: negotiation count, retrain count, link up-time window.
Pass criteria: retrain = 0 (or ≤ X) over X minutes. -
Action: run loopback (local/remote if available) before any system integration.
Evidence: PCS errors windowed; CRC windowed (if available).
Pass criteria: PCS errors ≤ X / window and CRC stable ≤ X / window. -
Action: run PRBS as the primary margin proof.
Evidence: PRBS pattern, bits N, confidence CL, temperature tag.
Pass criteria: BER ≤ X at Room and at Hot/Cold corners. -
Action: characterize “event sensitivity” using labeled bursts (ESD/EFT/Surge tagging).
Evidence: event_tag + pre/post counters snapshot + recovery time.
Pass criteria: recovery ≤ X and no “persistent fragility” after events. -
Action: if the system uses TSN timestamp hooks, verify timestamp noise floor early.
Evidence: σ_ts and p99−p50 under a fixed load tag.
Pass criteria: timestamp noise ≤ X and drift slope ≤ X across temperature steps.
Gate 3 — Production (repeatability + yield + auditability)
-
Action: define “minimum production test” (fast, deterministic, scriptable).
Evidence: link stable window + counters snapshot + device identity tag (e.g., 24AA02E64).
Pass criteria: stable link + error windows within X. -
Action: enforce a fixed PRBS sampling policy (per-lot or per-unit).
Evidence: PRBS bits N + pass/fail + config hash.
Pass criteria: BER ≤ X (policy-defined) with no retest ambiguity. -
Action: track distribution, not just averages (p50/p99/σ).
Evidence: p99 of retrain bursts and error windows across batch.
Pass criteria: p99 ≤ X and drift trend slope ≤ X. -
Action: lock firmware/script versions; forbid silent parameter changes.
Evidence: script version hash recorded in every unit log.
Pass criteria: 100% traceability; no “unknown config” results.
Stop-Ship (evidence-triggered)
Define hard triggers that must stop shipping/line release. Every trigger must include immediate evidence capture and a retest rule.
-
Trigger: retrain burst > X / window or reneg count spikes.
Immediate action: freeze config hash + capture counters snapshot + preserve failing unit.
Retest rule: rerun link/loopback/PRBS baseline under the same windows. -
Trigger: PRBS fail or BER > X at any required corner.
Immediate action: quarantine lot, retain logs (bits N, CL, cable tag).
Retest rule: repeat N = X runs; require consistent pass before release. -
Trigger: brownout/reset flags correlate with error bursts (power-origin evidence).
Immediate action: stop shipment, audit rails and supervisor logs (e.g., TPS3808 flags).
Retest rule: pass stress windows without any brownout/reset flags. -
Trigger: timestamp noise > X (only if TSN hook is a deliverable requirement).
Immediate action: capture σ_ts / p99−p50 with temperature tag and load tag.
Retest rule: meet noise and drift thresholds across temperature steps.
Gates prevent “hidden metric drift”: every pass/fail must be backed by fixed windows, trace tags, and counters snapshots.
H2-12 · Applications (Industrial-grade PHY, PHY-facing buckets)
These buckets explain why industrial-grade PHY matters in real machines: each one links system constraints to common pitfalls and the diagnostic hooks that make failures reproducible.
Bucket 1 — PROFINET / EtherNet-IP Field I/O (remote I/O modules)
Key constraints: high-noise cabinets, long runs, mixed grounding, strict “no intermittent” expectation.
Common pitfalls: treating power/reset issues as cable faults; missing windowed counters so bursts look random.
Recommended hooks: PCS errors + retrain windows, event tags (ESD/EFT/Surge), black-box snapshots (time/temp/power).
Bucket 2 — PLC Expansion Modules (modular backplanes / remote heads)
Key constraints: repeated hot/cold cycles, field service swaps, strict traceability across module revisions.
Common pitfalls: drift with temperature not captured; inconsistent strap/config across modules causes “same build, different behavior”.
Recommended hooks: temperature-tagged windows, config hash + identity EEPROM, deterministic bring-up scripts.
Bucket 3 — Industrial Gateways (multi-protocol bridges to Ethernet)
Key constraints: continuous traffic, bursty workloads, frequent firmware updates, “must diagnose remotely”.
Common pitfalls: confusing MAC congestion with PHY-origin errors; missing separation between PCS errors and CRC windows.
Recommended hooks: counter layering (PCS vs CRC/drop), black-box evidence snapshots, consistent event tags during stress.
Bucket 4 — Machine Vision Edge Boxes (high data + harsh EMC)
Key constraints: high throughput, noisy motor environments, thermal gradients, frequent cable handling.
Common pitfalls: marginal BER shows only under sustained load; timestamp/latency hooks drift with temperature.
Recommended hooks: PRBS/BER confidence policy, windowed counters under load tags, temperature-tagged drift logging.
High-interference deployments (PHY-facing view)
- Motor cabinets / long routing: require windowed counters + event tags to avoid “random” narratives.
- Multi-node grounding complexity: enforce trace tags and power/reset evidence to prevent false cable blame.
- Service handling: cable/connector interactions must map to a reproducible evidence snapshot (time/temp/power/counters).
Keep application sections PHY-facing: map system pain to measurable hooks (windows + tags + counters), not to protocol-stack details.
Recommended topics you might also need
Request a Quote
H2-13 · FAQs (Industrial-Grade PHY)
Each FAQ closes a field-debug long tail with evidence-first steps: fixed windows, labeled counters, and pass criteria with measurable thresholds (X).