Loopback / PRBS / BIST for Built-In BER & Field Diagnostics
← Back to:Interfaces, PHY & SerDes
Loopback / PRBS / BIST turns link debugging from a black box into measurable evidence: BER, error bursts, and lock/align events with reproducible triggers.
It enables fast isolation across bring-up, production test, and field diagnostics by standardizing what to run, what to log, and what “pass” means (time window + confidence + thresholds).
Center Idea: Turn “black-box links” into measurable evidence
PRBS, loopback, and BIST convert link failures from “it sometimes breaks” into measurable BER and reproducible trigger conditions that can be logged, isolated, and acted upon.
Page boundary (strict)
- Covers: loopback modes, PRBS/BER generation & checking, BIST coverage, counters/events/snapshots, production screening flows, and field diagnostic hooks.
- Does NOT cover: protocol compliance details, ESD/TVS selection, or deep EQ algorithm theory (those belong to the related PHY/Protection/Equalization pages).
When this page is the right tool
- Bring-up: isolate whether failures come from TX/RX silicon vs channel vs configuration.
- Production: fast screening + fail binning with time budgets and pass/fail thresholds.
- Field/remote: capture intermittent events via counters, timestamps, and burst snapshots.
When not to use this page as the main reference
- If the goal is protocol certification or compliance test procedures.
- If the primary question is protection parts (TVS/CM choke) or EMI mitigation components.
- If the intent is to learn equalization theory (CTLE/DFE) beyond using presets as diagnostics.
Note: “Margin” can be a proxy score (placeholder) derived from sampler statistics, error-rate vs preset sweeps, or structured stress tests—kept page-local to diagnostics without expanding into full EQ theory.
Taxonomy: Choose the right tool in the first minute
The fastest debug path starts with the correct diagnostic primitive. Loopback isolates where a failure lives, PRBS/BER quantifies margin with confidence, and BIST turns that into repeatable coverage for production and field use.
Loopback
Path validationInputs
Loop point (near/far/digital/analog), lane mask, polarity options, preset lock rules (placeholder).
Observables
Lock/unlock events, alignment loss counters, CRC/error counters (if available), time-to-lock.
Typical time
Fast isolation step: T = [X] s (screen), then extend only if unstable.
Common pitfall
A passing loopback does not prove the external channel is healthy (it may bypass it).
Pass criteria
No lock-loss events and no error flags over T = [X] s.
PRBS / BER
Margin measurementInputs
Pattern (PRBS7/15/23/31), duration, lock criteria, inversion/slip handling, lane mapping.
Observables
bit_count, error_count, burst snapshots, error-rate vs preset sweeps (optional), lock stability events.
Typical time
Two windows: screen (T=[X] s) and confidence (T=[Y] s) based on target BER and bitrate.
Common pitfall
“Zero errors” over a short window does not justify a low BER claim without a confidence window.
Pass criteria
BER upper bound < [X] at confidence [Y]%, or error_count ≤ [X] over bit_count ≥ [Y].
BIST
Coverage + test cadenceInputs
Test suite selection, coverage profile (TX/RX/CDR/deskew/FIFO), time budget, retry policy.
Observables
Coverage flags, per-lane pass/fail, bin codes, event logs, and snapshot-on-fail hooks.
Typical time
Production-friendly: short “must-pass” subset (T=[X] s), then deeper diagnostics only on failures.
Common pitfall
“BIST pass but system fails” often indicates a coverage gap or missing stress condition—not a contradiction.
Pass criteria
Coverage = [X]% (required items met) and all critical bins = PASS, with failure snapshots captured when any bin trips.
Recommended combos (fast → deep)
- Isolation first: Loopback to locate the failing segment → PRBS/BER to quantify margin and confidence.
- Production: PRBS screen window → BIST deep-dive on fails with bin codes and snapshots.
- Field: Counters + trigger snapshots to catch intermittent bursts → targeted loopback to narrow scope.
Use this mapping as the first-minute selector: isolate (loopback), quantify (PRBS/BER), then systematize for cadence and coverage (BIST).
Loopback modes deep dive: what it proves vs what it bypasses
A loopback is only useful when its loop point is explicit. The same “PASS” can mean “TX/RX logic is fine” or “the external channel was never tested” depending on what was bypassed.
Loopback rule of thumb (minimum claim)
A passing loopback only proves the blocks inside the loop. Anything outside the loop is not validated and must not be assumed healthy.
Near-end digital loopback (PCS / Deserializer / Elastic buffer)
Where it loops
Inside digital datapath: PCS loopback, post-deserializer, or elastic buffer return.
What it bypasses
External channel and most analog front-end stress; may bypass CDR/EQ depending on implementation.
What it proves
- Lane mapping, polarity configuration (inside the device), and digital path integrity.
- PCS/gearbox/elastic-buffer corner cases that mimic “link instability.”
What it cannot prove
- Channel loss/return loss, connector intermittency, or external crosstalk.
- Analog margin under temperature/power noise stress if the loop bypasses AFE stress.
Pass criteria (placeholder)
No error flags / CRC errors and no alignment-loss events over T = [X] s.
Near-end analog loopback (PMA / AFE)
Where it loops
Inside PMA/AFE: loop point near the sampler or analog return path (implementation-specific).
What it bypasses
External channel; may still exercise parts of the CDR/sampler chain, but does not include real cable/backplane ISI.
What it proves
- Receiver analog chain stability (gross issues), sampler and internal recovery behavior.
- Sensitivity to on-die supply noise/temperature when stress is injected (if supported).
What it cannot prove
- Channel-dependent reflections, connector micro-motion failures, or far-end topology issues.
- Equalization settings that are only relevant under real channel loss and crosstalk.
Pass criteria (placeholder)
Stable lock with error_count ≤ [X] over bit_count ≥ [Y].
Far-end loopback (remote turn-around)
Where it loops
At the far end (remote device) which re-transmits received data/pattern back to the near end.
What it bypasses
Typically does not bypass the channel; validates a larger portion of the end-to-end path.
What it proves
- A significant portion of the real channel and both endpoints can sustain traffic.
- Many “long cable only” failures appear here even if near-end loops pass.
What it cannot prove
- Which specific segment failed (TX silicon vs channel vs RX silicon) without additional loop points.
- Direction-specific issues unless both directions are tested independently.
Pass criteria (placeholder)
BER upper bound < [X] over T = [Y] s, with lock_loss_cnt = 0.
PCS/PMA boundary loops (purpose-built isolation points)
Where it loops
At the logical boundary between PCS and PMA, often via a muxed test path.
What it bypasses
Can bypass either analog or digital side selectively—ideal for proving whether failures are “logic-side” or “analog-side.”
What it proves
- Which side of the boundary is unstable under the same pattern and time window.
- Whether errors correlate with lock events (timing) or with datapath events (mapping/deskew).
What it cannot prove
- End-to-end channel health unless the loop includes the external path.
- Protocol-specific corner cases beyond diagnostics primitives.
Pass criteria (placeholder)
No error bursts above [X] within T = [Y] s, and event counters remain stable (no align/lock flaps).
The isolation power comes from selecting loop points that separate “digital mapping/deskew issues” from “analog timing margin” and from “channel-dependent failures.”
PRBS fundamentals: make it lock, then make it meaningful
PRBS testing is practical when the setup is deterministic: the generator and checker must match, the alignment must be stable, and the evidence must be time-windowed for confidence.
PRBS7
- Best for: quick screen / plumbing checks.
- Risk: may miss long-memory edge cases.
- Window: T=[X] s (screen).
- Must match: poly/seed/invert/bit-order.
PRBS15
- Best for: general bring-up and regression.
- Risk: still not worst-case for some channels.
- Window: T=[X] s (screen) + T=[Y] s (confirm).
- Must match: lane map + alignment.
PRBS23
- Best for: stronger stress / margin probing.
- Risk: longer time needed for confidence.
- Window: T=[Y] s (confirm).
- Must match: checker lock policy.
PRBS31 / Stress
- Best for: worst-case confidence tests.
- Risk: false conclusions if time window is too short.
- Window: computed from target BER and bitrate.
- Must match: scrambler mode (avoid conflicts).
Pattern vs test time (confidence template)
- Target: prove BER < [BER_target] with confidence [CL]%.
- Zero-error window requires at least N_bits = [-ln(1-CL)] / BER_target.
- Convert to time by T = N_bits / DataRate. Use a short screen window first, then a computed confidence window.
Pitfall: polarity inversion
- Symptom: checker never locks or reports constant errors.
- Quick check: toggle invert option or swap P/N mapping (controlled test).
- Fix: align polarity configuration across generator and checker.
- Pass: lock stable for T=[X] s.
Pitfall: bit slip / word alignment
- Symptom: bursty errors, periodic error clusters, or intermittent lock.
- Quick check: monitor align-loss counters and snapshot around bursts.
- Fix: enforce deterministic alignment rules and re-lock policy.
- Pass: align_loss_cnt = 0 over T=[X] s.
Pitfall: lane mapping / deskew
- Symptom: one lane fails consistently or all lanes fail identically.
- Quick check: per-lane error counters + lane-swap A/B test.
- Fix: correct lane order, deskew window, and polarity per lane.
- Pass: error_count ≤ [X] per lane.
Pitfall: scrambler conflict
- Symptom: PRBS appears random to the checker even with strong signal.
- Quick check: confirm whether the link layer scrambles payload on the test path.
- Fix: route PRBS in a non-scrambled test mode or align scrambler settings.
- Pass: checker lock stable and errors match expected stress level.
Pitfall: checker lock policy
- Symptom: “false lock” (looks locked, errors explode) or “never locks.”
- Quick check: log lock/unlock events and compare with error bursts.
- Fix: tighten lock threshold or require stable align window before lock.
- Pass: lock_loss_cnt = 0 and burst_count ≤ [X].
Pitfall: bit order / lane order mismatch
- Symptom: consistent errors that do not change with channel/preset tweaks.
- Quick check: swap MSB/LSB handling or reorder lanes in the checker.
- Fix: align serializer ordering and checker expectation end-to-end.
- Pass: error_count collapses to expected floor under known-good setup.
Practical PRBS starts with deterministic matching (poly/seed/invert, bit order, lane map) and ends with evidence that is windowed for confidence (bit_count and error_count with stable lock).
BER math that matters: confidence, window, and “zero errors”
A short zero-error run does not prove an ultra-low BER. Evidence must be tied to an observation window, bit count, and a stated confidence level. The output should be a defendable upper bound (or a bounded estimate) rather than a slogan.
Copyable test-time calculation steps (template)
- Set targets: BER_target=[ ], CL=[ ]%, DataRate R=[ ] bps, lanes=[ ], direction=[ ].
- Decide evidence type: zero-error upper bound (err_cnt=0) or bounded estimate (err_cnt>0 with stability checks).
- If err_cnt=0, compute required bits: N_bits = [-ln(1-CL)] / BER_target.
- Convert to time: T = N_bits / R. Use a short screen window first, then run the computed confidence window.
- Add stability bins: choose Δt=[ ], split into K bins, log per-bin (bit_cnt_i, err_cnt_i, lock/align events_i).
- Output the claim: BER upper bound ≤ [ ] @ CL=[ ] with T=[ ], plus burst flag and event correlation.
Example: “zero errors” but the window is too short
Symptom
A short run shows err_cnt=0 and gets labeled “PASS” without stating the implied BER upper bound.
Quick check
Compute N_bits from CL and BER_target. Compare the required T against the actual observation window.
Fix
Replace the slogan with a defendable claim: upper bound ≤ [X] at CL=[Y]% over bit_cnt=[N].
Pass criteria (placeholder)
err_cnt=0 for T ≥ [computed] and lock_loss_cnt=0.
Example: errors exist — average BER is not enough
Symptom
A test produces some errors; the result is reported as a single average number without stability or event context.
Quick check
Split into K bins (Δt=[ ]) and compare err_cnt_i across bins. Check whether errors align with lock/align events.
Fix
Report (BER_est, a bounded interval) and a stability verdict instead of a single average number.
Pass criteria (placeholder)
BER_est ≤ [X] with stable bins (no dominant burst bin) and lock_loss_cnt=0.
Example: burst errors vs random errors
Symptom
The same average BER appears, but failures in the field are triggered by clustered error bursts.
Quick check
- Bin the window (Δt=[ ]) and look for a small number of bins dominating errors.
- Align bursts to lock/align events and to temp/vdd thresholds (placeholders).
Fix
Add burst evidence: max_err_in_bin, burst_cnt, and snapshots around triggers.
Pass criteria (placeholder)
No dominant burst bins: max_err_in_bin ≤ [X] and burst_cnt ≤ [Y] over T=[ ].
A valid claim must include the observation window (T), the bit count, and the stated confidence (CL). Bin-level evidence highlights burst risk.
Instrumentation & hooks: counters, timestamps, snapshots, and freeze
Diagnostics become actionable when counters are timestamped and when burst moments can be captured with a freeze-and-snapshot mechanism. The goal is a reproducible evidence chain across bring-up, production, and field telemetry.
Bring-up checklist (high resolution, fast isolation)
Must-have observables
err_cnt, bit_cnt, lock_loss_cnt, align_loss_cnt, cdr_unlock_events, temperature, vdd ripple (placeholders).
Granularity (placeholder)
per-lane + per-port, sampled every [Δt]. Keep a short rolling window for correlation to lock/align events.
Freeze & snapshot (trigger X)
Trigger when err_cnt in Δt ≥ [X] or on lock/align loss. Freeze key state and push a snapshot to FIFO with timestamp.
Production checklist (throughput + traceability)
Must-have observables
err_cnt, bit_cnt, lock_loss_cnt, align_loss_cnt, cdr_unlock_events, plus unit identifiers (SN/port) in the host log.
Granularity (placeholder)
per-port summary with optional per-lane drill-down on failures. Log every test run with T=[ ], bit_cnt=[ ], and CL=[ ].
Freeze & snapshot (trigger X)
Keep snapshots for only failing units: trigger by burst threshold [X] or by lock_loss_cnt>0. Store snapshot metadata with station ID and time.
Field checklist (remote telemetry, low overhead)
Must-have observables
err_cnt, bit_cnt, lock_loss_cnt, align_loss_cnt, event timestamps, temperature, vdd ripple (placeholders).
Granularity (placeholder)
per-port + per-minute by default; auto-escalate to per-second logging for [T_boost] seconds when a trigger fires.
Freeze & snapshot (trigger X)
Use triggers (burst / lock / align) to capture compact snapshots. Upload snapshot headers first; pull full payload only on demand.
Counters without timestamps are statistics; timestamps plus freeze-and-snapshot turn failures into replayable evidence for root-cause isolation.
BIST architecture: on-chip BERT, routing matrix, and coverage
BIST is not “run once and done”. A meaningful BIST plan is a coverage matrix: each coverage item must name what is exercised, how it is tested, what is observable, and what constitutes a pass.
Scope guard
BIST coverage depends on loopback points and bypass routing. A “PASS” only means the covered path is healthy under the applied window. External channel effects, environmental triggers, and system-level interactions may remain outside the BIST matrix.
Group A — Data integrity (path exercised)
Coverage item
TX datapath (lane-by-lane)
Test method
On-chip PRBS generator + internal loop route (placeholder)
Observable
bit_cnt, err_cnt, lane_status
Pass criteria (placeholder)
err_cnt=0 over T=[ ] and lane_status=OK
Common miss
Bypassed blocks hide weak points; verify which sub-blocks are included by routing.
Coverage item
RX datapath (lane-by-lane)
Test method
On-chip PRBS checker fed by routed PRBS stream (placeholder)
Observable
err_cnt, lock events, align events
Pass criteria (placeholder)
err_cnt ≤ [X] and lock_loss_cnt=0
Common miss
Short windows may miss burst triggers; require bins/snapshots for fails.
Coverage item
MAC/PCS bypass sanity (routing correctness)
Test method
BIST mux route check + signature/CRC-like check (placeholder)
Observable
route_status, signature_ok, event flags
Pass criteria (placeholder)
route_status=OK and signature_ok=1
Common miss
Wrong loopback routing can create “self-consistent” passes that do not exercise the intended path.
Group B — Synchronization (lock, deskew, polarity)
Coverage item
CDR lock stability
Test method
PRBS run with lock event monitoring (placeholder)
Observable
cdr_unlock_events, lock_loss_cnt, timestamped events
Pass criteria (placeholder)
cdr_unlock_events=0 over T=[ ]
Common miss
A short pass can hide rare unlock events; use binning/snapshots for fails.
Coverage item
Lane deskew / alignment
Test method
Multi-lane PRBS with alignment monitor (placeholder)
Observable
align_loss_cnt, deskew_fail flag, lane map status
Pass criteria (placeholder)
align_loss_cnt=0 and deskew_fail=0
Common miss
Deskew issues can appear only under stress conditions; keep event timestamps and retry policy.
Coverage item
Polarity / inversion handling
Test method
PRBS with controlled inversion toggle (placeholder)
Observable
lock status, err_cnt jump, inversion_detect flag
Pass criteria (placeholder)
inversion_detect=OK and err_cnt stable (≤ [X]) after switch
Common miss
Wrong inversion state can mimic burst errors; classify with a dedicated polarity check.
Group C — Elasticity & robustness (FIFO, routing, health flags)
Coverage item
Elastic buffer / FIFO integrity
Test method
March test / read-write stress (placeholder)
Observable
fifo_ovf, fifo_udf, parity/ecc flag (placeholder)
Pass criteria (placeholder)
fifo_ovf=0 and fifo_udf=0 over T=[ ]
Common miss
FIFO issues may look like BER bursts; separate with dedicated FIFO flags and snapshots.
Coverage item
Loopback routing sanity
Test method
Mux matrix self-check + route lock (placeholder)
Observable
route_status, illegal_route flag, event log
Pass criteria (placeholder)
illegal_route=0 and route_status=OK
Common miss
Misrouted loopbacks can produce false confidence; require explicit route status evidence.
The routing matrix defines what is truly exercised. Coverage must be stated explicitly as “item → method → observable → pass criteria”.
Production test flow: fast screen → deep dive classification
A two-stage flow protects takt time: a short, strict screen catches obvious faults; only failing units enter deeper isolation, longer PRBS windows, and parameter sweeps to produce actionable bin codes.
Two-stage gating
- Fast screen: short time, strict rules, capture gross issues.
- Deep dive: only for fails; isolate and classify with longer windows and sweeps.
Strong screen rules (template)
- err_cnt=0 is not sufficient. Require lock_loss_cnt=0 and align_loss_cnt=0.
- Burst guard: max_err_in_bin ≤ [X] (placeholder).
- Retry policy: retry=[N], cooldown=[Δt], optional port/cable swap (placeholders).
Fast screen steps (time-first)
Step
Short PRBS screen (t=[ ])
Time budget (placeholder)
t_screen = [t1]
Fail bin (placeholder)
BIN_LOCK / BIN_ALIGN / BIN_BER_BURST / BIN_BER_RAND
Retry strategy (placeholder)
retry=[N], cooldown=[Δt], optional port swap=[yes/no]
Screen pass criteria (placeholder)
err_cnt=0 and lock_loss_cnt=0 and align_loss_cnt=0 and max_err_in_bin ≤ [X].
Deep dive steps (fail-only isolation & classification)
Step
- Loopback isolate (digital first) — time=[t2], bin=[BIN_ROUTE]
- Long PRBS window — time=[t3], output=[bounded evidence]
- Preset/parameter sweep — time=[t4], output=[best/worst bins]
- Event correlation + snapshot — time=[t5], output=[actionable record]
Fail bin catalog (examples)
BIN_LOCK (unlock events) · BIN_ALIGN (deskew/alignment) · BIN_BER_RAND (distributed errors) · BIN_BER_BURST (dominant bins) · BIN_FIFO (ovf/udf flags) · BIN_FIXTURE (port/cable sensitivity).
Deep dive pass criteria (placeholder)
Classification complete with stored evidence: (time budget met) and (bin code assigned) and (snapshot available for review).
Fast screen protects takt time; deep dive produces actionable bins and stored evidence for traceability and repair decisions.
Field diagnostics: turn “intermittent” into reproducible evidence
Field diagnostics succeeds when failures are converted into structured evidence: a first-log set, a triggerable snapshot, and a minimal reproduction recipe. The goal is to narrow the suspect space remotely and reduce blind part swapping.
First fields to log (priority set)
- Counters: err_cnt, bit_cnt, lock_loss_cnt, align_loss_cnt, cdr_unlock_events
- Events (timestamped): lock_event_ts[], align_event_ts[], link_reset_ts[] [placeholder]
- Environment: temp, vdd, vdd_ripple [placeholder]
- Config: data_rate, preset_id, loopback_mode, polarity_state, lane_map [placeholder]
- Burst metrics: max_err_in_bin, burst_cnt, bin_size(Δt) [placeholder]
Scenario: intermittent link drop
Likely cause bucket (classification only)
Bucket-CLKSYNC · Bucket-RXTX · Bucket-ENV (event-trigger) · Bucket-ROUTE
What to log first
cdr_unlock_events + lock_event_ts[] + align_event_ts[] + preset_id + data_rate
Trigger & snapshot (placeholders)
Trigger on lock_loss_cnt ≥ [X] or cdr_unlock_events ≥ [X]. Snapshot: pre=[Npre] bins, post=[Npost] bins, last K events (K=[ ]).
Minimal reproduction recipe
- Fix config: data_rate=[ ], preset_id=[ ], loopback_mode=[off].
- Run PRBS with bin_size Δt=[ ] for T=[ ].
- Arm trigger: cdr_unlock_events ≥ [X] OR lock_loss_cnt ≥ [X].
- On first trigger, export the snapshot pack (events + counters + config).
- Pass criteria after action: cdr_unlock_events=0 over T=[ ] (placeholder).
Scenario: fails after temperature drift
Likely cause bucket (classification only)
Bucket-ENV (temp) · Bucket-RXTX · Bucket-CLKSYNC (event symptom)
What to log first
temp (sample period=[ ]) + err_cnt in Δt + lock/align event timestamps + preset_id
Trigger & snapshot (placeholders)
Trigger when temp crosses [T_high/T_low] AND err_cnt in Δt ≥ [X]. Snapshot includes Δtemp over window and the first error bin index.
Minimal reproduction recipe
- Fix config and start PRBS logging with Δt=[ ].
- Apply a controlled temperature step (up/down) to cross [T].
- Arm combined trigger: temp cross + err_cnt threshold.
- Export the first-trigger pack and repeat N=[ ] times for consistency.
- Pass criteria after mitigation: zero triggers over T=[ ] (placeholder).
Scenario: passes short cable, fails long cable
Likely cause bucket (classification only)
Bucket-CH · Bucket-FIXTURE · Bucket-ENV (trigger) · Bucket-RXTX (if A/B insensitive)
What to log first
cable_id/length_bin=[ ] + max_err_in_bin + burst_cnt + preset_id + lane_map
Trigger & snapshot (placeholders)
Trigger on max_err_in_bin ≥ [X] OR burst_cnt ≥ [X]. Snapshot must include A/B identity fields (short vs long).
Minimal reproduction recipe
- Keep config fixed: data_rate=[ ], preset_id=[ ], loopback=[off].
- Run with short cable for T=[ ] and record baseline bins.
- Swap to long cable (only one variable) and re-run for T=[ ].
- Export both evidence packs; compare burst signatures and event timing.
- Pass criteria: long cable meets screen rules or recommended action applied (placeholder).
Scenario: errors only under load
Likely cause bucket (classification only)
Bucket-ENV (power/load) · Bucket-RXTX · Bucket-CLKSYNC (event symptom)
What to log first
load_state=[ ] + vdd_ripple + err_cnt in Δt + lock/align events
Trigger & snapshot (placeholders)
Trigger when vdd_ripple ≥ [Vripple] AND err_cnt in Δt ≥ [X]. Snapshot includes ripple peak, load_state transitions, and event timestamps.
Minimal reproduction recipe
- Establish idle baseline (load_state=idle) for T=[ ].
- Run a scripted load step idle→active (repeat N=[ ] cycles).
- Arm combined trigger: ripple threshold + error threshold.
- Export first-trigger pack and compare cycle-to-cycle alignment of triggers.
- Pass criteria: no triggers across N=[ ] cycles after action (placeholder).
A closed loop requires triggerable snapshots and exported evidence packs; actions must be verified with the same counters/events.
Isolation strategy: 4-step decision tree to narrow suspect buckets
The goal is not to “measure everything”. The goal is to decide the first next action. Use loopback and PRBS windows to quickly separate channel/fixure issues from transceiver and event-driven instability.
Step 1 — Near-end loopback (peel off external variables first)
- Action: enable near-end loopback, run PRBS for T=[ ].
- Observe: err_cnt, lock_loss_cnt, align_loss_cnt, route_status.
- Branch: PASS → suspect Bucket-CH / Bucket-FIXTURE / Bucket-ENV. FAIL → suspect Bucket-RXTX / Bucket-CLKSYNC / Bucket-ROUTE.
- Stop condition: route_status != OK → classify as BIN_ROUTE (placeholder).
Step 2 — Far-end loopback (expand coverage outward)
- Action: enable far-end loopback, run PRBS for T=[ ].
- Observe: err_cnt, max_err_in_bin, lock/align events.
- Branch: Step1 PASS + Step2 FAIL → suspect Bucket-CH / outer-path issue. Step1 FAIL + Step2 FAIL → suspect Bucket-RXTX / Bucket-CLKSYNC.
Step 3 — Known-good A/B (port/cable/peer comparison)
- Action: swap to known-good cable/port/peer (only one variable at a time).
- Observe: A/B sensitivity of burst_cnt and event rates.
- Branch: strong A/B delta → Bucket-FIXTURE / Bucket-CH. weak A/B delta → continue to Step 4.
Step 4 — Environmental perturbation (make triggers repeatable)
- Action: apply temp step or load step and re-run the same PRBS window.
- Observe: temp/vdd_ripple aligned with error bursts and lock/align events.
- Branch: threshold-correlated failures → Bucket-ENV. no correlation but still failing → Bucket-RXTX / Bucket-CLKSYNC.
The decision tree is designed to choose the next action in 3–5 steps and output a suspect bucket, not a full root-cause theory.
Selection metrics for parts that support diagnostic hooks
This section scores only diagnostic capability: observability (counters/events/snapshots), test flexibility (patterns/per-lane/loopback points), automation readiness (APIs/telemetry integration), determinism (repeatability after reset), and safety (online diagnostics without breaking normal traffic). It does not compare protocol features.
Copy-ready scoring sheet (1–5) + what to verify
- Hook richness score: [1–5] (counters + events + snapshot + freeze)
- Test flexibility score: [1–5] (pattern set + per-lane + loopback points)
- Automation score: [1–5] (read/config/clear/export + stable schema)
- Determinism score: [1–5] (reset repeatability + consistent event sequence)
- Safety score: [1–5] (bounded overhead + non-disruptive monitoring)
- Evidence pack must include: err_cnt, bit_cnt, lock/align events (timestamped), snapshot window (Npre/Npost), config freeze fields.
1) Hook richness (counters · events · snapshot)
What it measures
Whether the device can convert failures into exportable evidence: per-lane counters, timestamped events, and triggerable snapshots with freeze/read semantics.
Score rubric (1 / 3 / 5)
- 1: coarse counters only; no timestamps; no snapshot/freeze.
- 3: counters + events with timestamps; limited snapshot or limited freeze/read behavior.
- 5: per-lane counters + timestamped events + configurable trigger + snapshot with pre/post windows and config freeze fields.
How to verify (copy steps)
- Clear counters; run PRBS window T=[T] with bin_size Δt=[Δt].
- Arm trigger: err_cnt in Δt ≥ [X] OR lock_loss_cnt ≥ [X].
- On trigger, read a frozen snapshot pack: last K events (K=[ ]) + counters + config freeze fields.
- Pass criteria: snapshot includes pre=[Npre] / post=[Npost] bins and timestamps are monotonic.
Common pitfalls
- Counters readable but not freezable; readout disturbs counting (inconsistent evidence).
- Events exist but lack timestamps or per-lane attribution (cannot align with environment logs).
- Snapshot misses configuration freeze fields (cannot reproduce the exact state).
Example material numbers (verify package/suffix)
- TI DS280DF810 — PRBS generator/checker + in-system diagnostics hooks.
- TI DS250DF410 — PRBS generator/checker + eye monitor class hooks.
- TI DS125DF1610 — standalone BERT via built-in PRBS generator/checker + mission-mode monitor.
- Renesas HXC44400 — PRBS generator/checker + BIST functions (module diagnostics).
2) Test flexibility (pattern · per-lane · loopback points)
What it measures
How quickly the test can isolate the failure segment: multiple PRBS patterns, per-lane independent control, and multiple loopback insertion points.
Score rubric (1 / 3 / 5)
- 1: one fixed pattern; no per-lane isolation; one loopback mode only.
- 3: several patterns + per-lane enable; limited loopback points.
- 5: broad pattern set + per-lane independent generator/checker + multiple loopback points usable for isolation.
How to verify (copy steps)
- Enable generator/checker per lane: Lane-A ON, Lane-B OFF; confirm only Lane-A counters change.
- Switch pattern set PRBS-[7/15/23/31]; confirm lock state and counters are readable for each pattern.
- Switch loopback point among supported modes; confirm which event types disappear/appear (same stimulus, different coverage).
- Pass criteria: per-lane isolation holds and mode switching yields consistent, explainable evidence changes.
Example material numbers (verify package/suffix)
- TI DS125DF1610 — multi-lane PRBS generator/checker; supports per-channel diagnostics.
- Broadcom PEX88T32 — loop-back supported; PRBS BERT evidence often used for margining screens.
- Renesas HXC44200 — PRBS generator/checker + module-level self-test hooks.
3) Automation friendly (API · telemetry integration · firmware workflow)
What it measures
Whether evidence can be collected hands-free: read/config/clear/export operations over a standard control bus, stable field schema for logging, and the ability to script repeated experiments.
Score rubric (1 / 3 / 5)
- 1: manual-only; limited or unstable register map; no export concept.
- 3: readable registers + basic configuration; export is possible but not schema-stable.
- 5: scriptable read/config/clear/export + stable evidence schema + event timestamps aligned to host time base.
How to verify (copy steps)
- Script cycle: configure → run T=[ ] → arm trigger → export evidence pack → clear counters → re-run.
- Repeat N=[N] times; compare exported pack schema and mandatory fields presence.
- Pass criteria: schema stable across resets/firmware versions (placeholders) and export latency bounded.
Example material numbers (verify package/suffix)
- TI DS280DF810 — common register control patterns (bus + optional EEPROM config are common in this class).
- TI DS125DF1610 — built-in PRBS generator/checker; typically integrated into scripted bring-up flows.
- Renesas HXC44400 — integrates control logic; suitable for automated module-level diagnostics.
4) Determinism (repeatability after reset)
What it measures
Whether the same stimulus produces the same evidence. Determinism is the foundation for turning intermittent failures into reproducible recipes.
Score rubric (1 / 3 / 5)
- 1: reset leads to drifting states; evidence varies run-to-run.
- 3: core state mostly repeats; some fields are unstable or undocumented.
- 5: reset yields consistent config/state; event sequence and counters are repeatable within defined tolerances.
How to verify (copy steps)
- Run test recipe; export evidence pack.
- Reset; re-apply the exact config; repeat N=[N] times.
- Compare: config freeze fields identical; event ordering stable; burst signature variance ≤ [X] (placeholder).
Example material numbers (verify package/suffix)
- TI DS280DF810 — common for scripted repeats with PRBS windows and consistent readout flows.
- TI DS125DF1610 — built-in PRBS generator/checker helps repeatability screens.
- Broadcom PEX8648 — documented internal loopback / PRBS / BIST procedures support repeatable isolation workflows.
5) Safety (online diagnostics without breaking normal traffic)
What it measures
Whether diagnostics can be enabled in production systems with bounded overhead: non-disruptive monitoring, rate-limited snapshots, and a safe fallback path when triggers are noisy.
Score rubric (1 / 3 / 5)
- 1: diagnostics require disruptive mode changes; high risk to normal operation.
- 3: some non-disruptive hooks; limited controls on trigger rate and overhead.
- 5: non-disruptive monitoring + bounded snapshot rate + clear guardrails; business traffic remains stable.
How to verify (copy steps)
- Enable read-only counters + low-rate telemetry (period=[P]).
- Enable snapshot trigger with max trigger rate ≤ [R] (placeholder).
- Pass criteria: service KPIs unchanged (placeholder) and diagnostics overhead bounded in logs.
Example material numbers (verify package/suffix)
- TI DS280DF810 — class often advertises non-disruptive in-system diagnostics hooks.
- TI DS125DF1610 — mission-mode monitor + PRBS hooks enable online evidence collection.
- Microchip LAN8022 — includes PRBS generator/checker in retimer-style operating modes.
Prefer score bars over radar charts: fewer labels, clearer validation mapping, and mobile-friendly rendering.
Procurement note (keep evidence-first)
Material numbers above are examples for hook-rich devices. Always validate the exact suffix/package, availability, and firmware/register support with a short evidence-pack test (PRBS window + trigger + snapshot + export). “Supports PRBS” is not sufficient unless the evidence pack is repeatable.
Recommended topics you might also need
Request a Quote
FAQs (Loopback / PRBS / BIST) — actionable, evidence-first
Each answer is intentionally short and executable. Thresholds are placeholders you can standardize per product/line: [R] rate, [T] time, [N]=[R]×[T] bits, [BER_target], [CL], [Δt] bin, [X] trigger, [Npre]/[Npost] snapshot windows.
PRBS frequently loses lock but payload traffic “looks fine” — check polarity/slip or pattern config first?
Likely cause: PRBS generator/checker mismatch (polynomial/seed/lane map/inversion) or checker misalignment (bit slip / lane deskew not settled).
Quick check: Force a known-good single lane; set pattern=PRBS[ ] + seed=[ ]; read lock_state, slip_cnt, err_cnt/bit_cnt for T=[T].
Fix: First make generator/checker configuration identical (polarity + polynomial + seed + lane map); then validate deskew/alignment stability (slip_cnt stays 0) before interpreting BER.
Pass criteria: lock_state=stable over T=[T], slip_cnt=0, and err_cnt ≤ [X] (or BER upper bound [BER_target] @ CL=[CL]).
Near-end loopback passes, far-end loopback fails — suspect channel or CDR/EQ first?
Likely cause: the segment unique to far-end loopback is failing: channel/connector loss/reflection bucket or far-end RX path bucket (CTLE/DFE/CDR alignment).
Quick check: Keep identical pattern and time window; compare near vs far: err_cnt, lock_loss_cnt, cdr_unlock_events, plus A/B with known-good cable/fixture.
Fix: If A/B with known-good channel makes far-end pass → prioritize channel bucket; if far-end still fails → prioritize RX path bucket and iterate loopback point choices (if multiple are available).
Pass criteria: far-end loopback shows no lock-loss and BER upper bound < [BER_target] @ CL=[CL] for T=[T].
Zero errors for 10 seconds — can this claim BER < 1e-12? How long is enough?
Likely cause: a statistical over-claim: test window too short and confidence level undefined (zero errors is not “infinite margin”).
Quick check: Record rate [R] and time [T]; compute observed bits [N]=[R]×[T]; state CL=[CL] explicitly.
Fix: Choose T so that the zero-error upper bound at CL=[CL] satisfies: UpperBound(BER | 0 errors, CL) < [BER_target] (use your standard template calculator).
Pass criteria: 0 errors over T=[T_required] where T_required is computed from [BER_target] and [CL].
Errors only occur in “bursts” — power disturbance vs crosstalk event? What to log first?
Likely cause: burst-triggered impairment bucket: power/clock disturbance (VDD ripple / unlock events) or activity-coupled event (neighbor lane/port switching).
Quick check: Enable binning (Δt=[Δt]) + snapshot trigger: err_cnt_in_Δt ≥ [X]; log fields: err_cnt/bit_cnt, lock_loss_cnt, cdr_unlock_events, vdd_ripple=[VDD_ripple], temp=[Temp], neighbor_activity=[ ].
Fix: Classify by correlation: bursts coincident with cdr_unlock/vdd_ripple spikes → power/clock bucket; bursts coincident with neighbor_activity → coupling bucket; then run one controlled A/B (quiet neighbor vs active neighbor) to confirm.
Pass criteria: snapshot pack includes pre=[Npre]/post=[Npost] bins and identifies a dominant trigger bucket with correlation score ≥ [X].
Only lane2 fails in a multi-lane link — deskew first or routing/connector first?
Likely cause: lane-specific logical bucket (mapping/deskew/polarity) or lane-specific physical bucket (pair/connector/fixture contact).
Quick check: Do a controlled lane permutation: swap lane mapping (logical) without touching hardware; run PRBS for T=[T] and see whether the error follows the lane index or the physical pair.
Fix: If error follows the logical lane → re-check deskew/alignment state + config freeze fields; if error stays on the physical pair → A/B the connector/fixture and inspect that lane path first.
Pass criteria: per-lane err_cnt within tolerance (max/min ≤ [X]) and no lane-specific lock/align loss over T=[T].
After reset: training succeeds but BER is worse — which state machine/freeze point to check?
Likely cause: non-deterministic post-reset state: presets/adaptation state differs run-to-run, or the evidence pack lacks config freeze fields to reproduce the exact trained state.
Quick check: Immediately after training-complete, export freeze fields (preset/equalizer state placeholders), plus baseline counters/events; repeat N=[N] resets and compare evidence packs.
Fix: Add a “train → freeze → export” gate; lock down any auto-adaptation windows (if supported) until evidence is stable; only then iterate presets in a controlled sweep.
Pass criteria: config freeze fields identical across resets and BER upper bound deviation ≤ [X] across N=[N] runs.
Does online PRBS impact live traffic? How to do “low-intrusion” field diagnostics?
Likely cause: intrusive diagnostic mode (forces synthetic data path, changes equalization/clocking, or interrupts business traffic) instead of mission-mode monitoring.
Quick check: Verify availability of read-only counters/events in mission mode; enable telemetry with period P=[P] and snapshot rate limit ≤ [R] triggers per minute.
Fix: Use a three-tier policy: (1) counters-only monitoring, (2) rate-limited snapshot on anomalies, (3) schedule disruptive PRBS/loopback only during maintenance windows.
Pass criteria: business KPIs unchanged (placeholder), telemetry overhead bounded, and snapshot trigger rate ≤ [R] while still capturing evidence on faults.
Internal PRBS checker shows BER=0, but an external BERT reports errors — what correlation check first?
Likely cause: measurement mismatch: pattern/polynomial/seed differs, lane polarity differs, checker alignment differs, or the bit counting window is not equivalent.
Quick check: Lock four items to be identical: polynomial, seed, inversion, lane map; then compare bit_cnt and time window T=[T] between internal and external instruments.
Fix: Use a shared injection/isolation point (same loopback point or same physical segment); run a short correlation window; only after counters agree should BER disagreements be treated as a real link issue.
Pass criteria: |err_cnt_internal − err_cnt_external| ≤ [X] over identical bit_cnt and T=[T].
Swapping the cable improves a lot — how to discriminate reflection vs insertion loss quickly?
Likely cause: channel-dominated bucket: reflection/connector discontinuity or insertion loss (length-dependent attenuation).
Quick check: Two A/B discriminators: (A) same length, different connectors; (B) same connector class, different lengths. Run identical PRBS window T=[T] and compare err_cnt/lock_loss.
Fix: If connector A/B dominates → treat as reflection bucket and prioritize connector/termination consistency; if length dominates → treat as loss bucket and prioritize reach/EQ headroom confirmation (without expanding theory).
Pass criteria: discriminator identifies one dominant bucket (confidence ≥ [X]) and chosen remediation reduces err_rate by ≥ [X]% over T=[T].
Errors start when temperature changes — what fields are most useful to log?
Likely cause: temperature-sensitive margin bucket: drift causes unlock/align events or increases burst susceptibility under the same operating recipe.
Quick check: Log the minimum evidence set per bin Δt=[Δt]: temp=[Temp], vdd_ripple=[VDD_ripple], err_cnt/bit_cnt, lock_loss_cnt, align_loss_cnt, cdr_unlock_events, retrain_cnt; enable snapshot on err_cnt_in_Δt ≥ [X].
Fix: Use correlation: determine whether temp lead/lag aligns with unlock events or with ripple spikes; then reproduce with a controlled temperature step while keeping the PRBS recipe constant.
Pass criteria: a single dominant correlation path identified (e.g., temp→cdr_unlock→err_burst) with lead/lag ≤ [X] bins and repeatable across N=[N] runs.
Dropping one speed step makes it stable — is it real margin or just an insufficient test window?
Likely cause: false conclusion due to unequal statistical power: low speed “looks stable” because the observed bits [N] and confidence are not comparable to the high-speed test.
Quick check: For each speed, compute [N]=[R]×[T] and ensure both meet the same CL=[CL] requirement for [BER_target] (do not compare 10 s vs 10 s across different [R]).
Fix: Normalize by confidence: choose T_high and T_low so that both satisfy UpperBound(BER | observed errors, CL) < [BER_target]; only then interpret “margin” differences.
Pass criteria: conclusion remains the same after confidence-normalized windows: high speed still fails under T=[T_required_high] while low speed passes under T=[T_required_low].
BIST passes but the system still drops link — where is the most common “coverage gap” and what hook to add first?
Likely cause: BIST coverage does not include the failing segment (real channel, real RX recovery path, real alignment/deskew, or business-mode conditions); evidence is missing to map the failure to a bucket.
Quick check: Build a minimal coverage checklist: TX path, RX path, CDR, deskew/alignment, FIFO/buffer, polarity, loopback routing; verify each has an observable (counter/event/snapshot) and a pass gate.
Fix: First add evidence-first hooks: timestamped lock/align events + per-lane counters + triggerable snapshot with config freeze fields; then re-run BIST and correlate drops to a specific missing coverage item.
Pass criteria: for each failing bucket, coverage gap is closed and failures now produce an exportable evidence pack (events + counters + snapshot) within ≤ [T] of occurrence.