123 Main Street, New York, NY 10001

Synchronous Ethernet (SyncE): Jitter Filtering & Holdover

← Back to: Industrial Ethernet & TSN

Core idea
Synchronous Ethernet (SyncE) delivers frequency synchronization over the physical layer by distributing a clean network clock, enabling tight jitter/wander control and predictable holdover in harsh, long-haul industrial deployments. This page turns SyncE into an executable engineering workflow: targets → architecture → filtering/selection/holdover → implementation → verification, with measurable pass criteria.

What is SyncE and When You Actually Need It

Card A · Definition Frequency Sync (Layer-1)

Synchronous Ethernet (SyncE) distributes a network-wide frequency reference by recovering line timing from the Ethernet physical layer. A node uses the PHY CDR recovered clock as an input, applies jitter filtering (DPLL / jitter attenuator), and drives a local clock tree so downstream devices track a stable frequency baseline.

Scope note: SyncE addresses frequency stability (jitter/wander/holdover). Time-of-day or phase alignment belongs to PTP / White Rabbit pages and is not expanded here.

Card B · Positioning No sync vs SyncE vs PTP-only
No sync

Each node free-runs on a local XO. Short tests can look fine, but long-term wander and multi-hop drift accumulate.

PTP-only

Aligns time/phase using timestamps. If local frequency is noisy, the time loop must correct more often and becomes harder to keep stable.

SyncE (+ optional PTP)

Builds a frequency foundation via line timing recovery + filtering. Time alignment (if needed) becomes easier on top of a stable baseline.

Practical takeaway: SyncE primarily reduces frequency error propagation; it is most valuable when the system budget is dominated by wander or when holdover is mandatory during reference loss.

Card C · Deployment triggers Decision checklist
Strong reasons to use SyncE
  • Multi-hop chains where long-term drift becomes visible.
  • Strict jitter/wander budget with small margin at endpoints.
  • Reference-loss tolerance is required (holdover for X minutes/hours).
  • Frequency stability affects deterministic behavior or RF/ADC/DAC timing.
Often not needed
  • Single node or short links with relaxed long-term stability.
  • Systems where timing only impacts non-critical functions.
  • Short duty cycles where wander never accumulates to a limit.
Common false diagnosis
  • Noise coupling from power/ground/layout dominates; SyncE cannot “filter away” board-level coupling.
  • Reference switching flaps due to policy; stability requires hysteresis/cooldown.
  • Link behavior (EEE, auto-neg, flaps) injects disturbances into recovered timing.
Diagram · Three-layer sync map (No sync vs SyncE vs PTP)
No sync SyncE (frequency) PTP (time) Source Network Device Source Network Device Source Network Device Local XO Free-run reference Ethernet fabric No timing transfer Device clock Drift accumulates Line timing Traceable frequency Recover + filter CDR → DPLL Freq locked Stable baseline Time master Time-of-day Timestamps Corrections Time aligned Phase/ToD Key idea: SyncE transports frequency via line timing; PTP transports time/phase via timestamps.

Design Targets: Jitter, Wander, and Holdover Goals (What to Budget)

Card A · Key terms cheat-sheet Same language for the whole page
Jitter (fast)
Short-term phase variations that impact instantaneous timing quality.
Pitfall: “Cleaner scope waveform” does not guarantee end-to-end margin if coupling exists downstream.
Wander (slow)
Long-term frequency/phase drift that accumulates over minutes to hours.
Pitfall: Over-filtering can increase slow error visibility via long lock/recovery behavior.
Holdover
Maintaining frequency stability when the reference is lost.
Pitfall: Lab success at room temperature may fail in cabinets due to gradients and stress drift.
Phase noise
Frequency-domain representation often mapped to jitter across an integration band.
Pitfall: Mixing measurement settings (RBW/VBW/band limits) breaks comparability.
Lock quality
Lock is not binary; margin depends on loop bandwidth, noise, and input quality.
Pitfall: “Locked” without stability logging hides marginal conditions.
Cascade filters
Multiple filters across nodes can create slow response or switching artifacts.
Pitfall: “Stronger filter everywhere” can increase recovery time and flapping risk.
Card B · Budget skeleton End-to-end chain view

A SyncE design becomes predictable only when error sources are budgeted along a timing chain. Instead of treating “jitter” as a single number, each stage should declare what it passes, what it attenuates, and what it adds.

Budget stages (fill-in template)
  • Reference input → input quality placeholder: X ppm / X dB
  • PHY CDR (recovered clock) → sensitivity to link behavior; added fast noise: X ps
  • DPLL / jitter filter → bandwidth and attenuation; switching behavior; lock/re-lock time: BW = X
  • Clock tree / fanout → additive jitter and coupling from power/ground: Add = X ps
  • Sink requirement → system pass criteria: Total < X over Y time window
For each stage, record
  • Observable: recovered/filtered outputs, lock flags, alarms
  • Coupling: temperature, voltage, link state events
  • Evidence: logs/plots with timestamped context
Avoid mismatched accounting
  • Keep measurement windows consistent across stages.
  • Do not compare numbers produced under different integration bands.
  • Separate “fast jitter” limits from “slow wander/holdover” limits.
Card C · What dominates by scenario Focus effort where margin is lost
5G backhaul chains

Dominant losses often come from filter cascade strategy and switching behavior (lock time, hitless requirements, anti-flap policy).

Power-grid synchronization

Dominant losses often come from holdover drift under temperature and stress, plus incomplete event logging during reference anomalies.

Industrial gateways

Dominant losses often come from power/ground coupling into the clock tree and link-state disturbances (EEE, flaps, auto-neg transitions).

Diagram · Budget waterfall (Ref → PHY CDR → DPLL filter → clock tree → sink)
Budget mindset: each stage declares what it passes, attenuates, and adds — with evidence and pass criteria. fast jitter slow wander holdover timeline Reference Input quality X ppm / X dB PHY CDR Recovered clk Add = X ps DPLL filter BW / atten BW = X Clock tree Coupling Add = X ps Sink requirement (pass criteria template) Total timing quality must remain within X (limit) over a window of Y (time) H Holdover X minutes Fill placeholders with project limits, then instrument each stage to produce evidence (plots + logs + events).

Engineering rule: separate fast jitter limits from slow wander/holdover limits, and keep measurement windows and integration settings consistent across the entire timing chain.

SyncE Architecture: EEC/SEC Roles and the End-to-End Timing Chain

Card A · Clock roles in a packet network source · boundary · transit · sink
Source

Provides the preferred frequency reference and declares usable quality. Stability matters more than a “high label” that flaps.

Boundary / gateway

Selects among references, filters recovered timing, and enforces switching/holdover behavior to keep the chain stable.

Transit

Passes timing through without injecting disturbances; link-state behaviors can couple into recovered timing if unmanaged.

Sink

Consumes the clock and defines pass criteria (jitter/wander/holdover windows) with evidence logging for verification.

Scope note: architecture here describes frequency timing roles. Time/phase roles (PTP/WR) are out of scope for this page.

Card B · EEC vs SEC (engineering view) Recover · Filter · Select · Holdover

A practical SyncE system can be described by four functions that appear at key nodes. The SEC (clock element) is the node-level “engine” that recovers timing from the PHY, filters it, selects among references, and provides holdover when inputs degrade. The EEC (equipment clock capability) describes how a network device participates and exposes the synchronization clock behavior.

Recover
PHY CDR produces a recovered clock; link behavior and cabling can modulate timing quality.
Filter
DPLL/jitter attenuation shapes what passes (slow vs fast) and defines recovery dynamics.
Select
Multi-input policy uses quality/alarms/timers; stability requires hysteresis and cooldown.
Holdover
When the preferred input is lost, oscillator stability and steering policy determine drift.
Card C · Where the jitter filter lives PHY · external DPLL · SoC clock tree
Filter inside PHY
  • Short path, simple integration.
  • Higher coupling to link-state behavior.
  • Limited control range in some designs.
External jitter attenuator / DPLL
  • Stronger control (BW, hitless, multi-input).
  • Good for cascade strategy across nodes.
  • Requires clean power/clock-tree hygiene.
SoC / FPGA clock tree
  • High integration, fewer parts.
  • Clock-domain complexity can hide coupling.
  • Verification needs better observability.

Placement rule: the filter location determines which disturbances are attenuated and which are imported into the downstream clock tree.

Diagram · SyncE chain blocks (Recover / Filter / Select / Holdover at each node)
Ref in Switch / Router Gateway Endpoint Declare QL Select Filter Holdover Recover Filter Select Holdover XO Recover CDR Filter DPLL Select Holdover Recover Filter Select Holdover RTC Each node is described by four functions: Recover → Filter → Select → Holdover. Later chapters deepen each block.

Architecture rule: map every SyncE issue to a chain position (recover/filter/select/holdover) before tuning parameters; the fix becomes measurable and repeatable.

Quality Level and Network Messaging: SSM/ESMC Without Getting Lost

Card A · What quality level means in practice usable · consistent · stable

Quality information carried by SSM / ESMC is a selection signal for frequency reference choices. It helps nodes converge on a consistent reference path, but it must be treated as a policy input, not a guarantee that the chosen reference is always the cleanest.

What it enables
  • Consistent reference selection across nodes.
  • Deterministic switching behavior during faults.
  • Loop-avoidance when combined with topology rules.
Why “higher is not always better”
  • A “best-labeled” input can be unstable (flapping).
  • Configuration mismatch can create false-best decisions.
  • Policies without timers can turn marginal changes into oscillation.

Scope note: this page uses quality messaging only for frequency reference selection. PTP time-domain selection (BMCA) is out of scope here.

Card B · Selection policy template priority + QL + alarm + timers
Policy inputs
  • Priority: a preferred order for inputs (per node, per port).
  • Quality: SSM/ESMC quality level used as eligibility and preference.
  • Alarms: LOS / degradation flags that disqualify an input.
  • Timers: hold-off / cooldown / stability windows that prevent flapping.
Revertive

Returns to the preferred input when it becomes available again. Needs strong stability timers to avoid bouncing.

Non-revertive

Stays on the current input after switching. Reduces bounce risk, but must track long-term quality to avoid silent degradation.

Evidence logging

Every switch should record reason codes, QL values, alarms, timer states, and temperature for later correlation.

Card C · Classic failure modes loop · false-best · oscillation
QL loop
  • Symptom: reference “chases itself” across a ring.
  • Pattern: mutual preference or topology rule missing.
  • First check: topology eligibility + per-port priority/QL mapping.
False-best
  • Symptom: “best” input yields worse stability.
  • Pattern: config mismatch or unstable link behavior.
  • First check: ESMC parsing/config + link-event correlation.
Oscillation
  • Symptom: frequent switching every minutes/seconds.
  • Pattern: no hysteresis/hold-off/cooldown around thresholds.
  • First check: timer states + reason codes + stability window.
Diagram · QL-driven selector (Ref A / Ref B → Priority/QL/Alarm → Selector → DPLL)
Reference selection logic: decide eligibility and stability before switching, then feed a DPLL filter. Ref A QL Alarm Stability window Ref B QL Alarm Stability window Priority QL SSM Alarm Selector hold-off DPLL filter BW / hitless Policy knobs: revertive / non-revertive · hysteresis · cooldown · reason codes + event logs

Selection rule: use QL/alarms as eligibility signals, then enforce timers (hold-off/cooldown) so marginal changes do not become oscillation; log every switch with reason codes for forensics.

PHY Clock Recovery: CDR, Recovered Clock, and the Hidden Couplings

Card A · Recovered clock sources and paths CDR → divider → outputs

SyncE frequency stability starts at the PHY. Before tuning filters or policies, identify the clock source and the exact tap point used as the reference input.

Line recovered (CDR)
  • Derived from the link signal via CDR.
  • Most sensitive to link-state and cable behaviors.
  • Best for inheriting network frequency when the link is stable.
Local XO / TCXO
  • Independent from link disturbances.
  • Primary anchor for holdover behavior.
  • Quality depends on oscillator class, power, and thermal gradients.
Synthesized (PLL-derived)
  • Generated via PLL/dividers for distribution.
  • Useful for SoC clock trees and fanout.
  • Can import PLL/power noise if the clock tree is not isolated.

Mapping rule: document the chain as source → tap point → reference input → filter → distribution → sink. Debugging becomes faster and measurable.

Card B · Couplings that bite you EEE · link flap · auto-neg · cable
EEE transitions

Power-save entry/exit can create short timing disturbances that appear as “random” phase steps unless event logs are aligned to clock alarms.

First check: correlate EEE state events with recovered-clock alarms and error counters.
Link flap / micro-cuts

Brief down/up cycles can force re-lock paths and trigger unnecessary reference switching if hold-off and stability windows are missing.

First check: up/down counters + negotiated speed changes vs selection/hitless events.
Auto-neg / re-training

Negotiation windows can temporarily degrade recovered timing; treating these intervals as “valid reference” causes selection oscillation.

First check: negotiation timestamps + reference eligibility masks.
Cable / magnetics / SI

Reflections, return-loss margin, and connector behavior can modulate the PHY’s recovery loop, especially under EMI and load steps.

First check: PCS/PMD error counters + environment event alignment (motors/relays/PoE load).

Key idea: many “network features” are effectively timing injectors at the PHY recovery boundary unless guard timers and eligibility rules are defined.

Card C · Loopback/PRBS vs real network lab OK ≠ field stable
Trigger surface differs

Real networks introduce link-state transitions, negotiation, and device-to-device behavior that PRBS does not exercise.

Environment differs

EMI, temperature gradients, power noise, and ground/shield paths can convert “margin” into timing instability.

Topology differs

Cascaded recover/filter/select stages create system-level dynamics that are absent in bench loopbacks.

Minimal verification ladder
  1. Baseline: PRBS/loopback for link health and error floors.
  2. Events: force EEE, renegotiation, and controlled re-link; record timing alarms.
  3. Field: replay typical load/EMI events; correlate to clock quality counters and selection logs.
Diagram · PHY timing tap points (line recovered / local XO / synthesized → reference paths)
PHY timing sources and tap points choose the tap point before tuning filters PHY CDR XO PLL / Divider Tap points REC_CLK XO_CLK SYN_CLK Destinations External DPLL SoC clock tree Ref out pin Hidden couplings EEE Link flap Auto-neg Cable / SI

Root cause shortcut: when recovered clock becomes unstable, check the tap point and link-state couplings first; filters and policies cannot fully “fix” a contaminated source.

Jitter Filtering: DPLL Bandwidth, Cascading, and Hitless Behavior

Card A · Bandwidth intuition fast vs slow disturbances
Jitter (fast)

Fast disturbances should be attenuated; shrinking bandwidth blindly can hide problems by slowing lock and increasing recovery time.

Wander (mid/slow)

Too-narrow bandwidth can treat wander as “noise” and push error into long-period deviations that are harder to detect.

Holdover drift (very slow)

Filtering cannot remove oscillator drift; long outages require explicit holdover policy and oscillator quality targets.

Decision input list: recovered-clock disturbance shape, acceptable lock time, and downstream sensitivity. Bandwidth is a system constraint, not a single “better” knob.

Card B · Cascade strategy node-level vs system-level
Node-level strong filtering
  • Useful when each hop is noisy.
  • Reduces immediate jitter propagation.
  • Risk: overly slow chain response when cascaded.
System-level main filter
  • Centralizes timing cleanup at key nodes.
  • Other nodes keep minimal shaping.
  • Needs observability to prevent silent degradation.
Cascade guard rules
  • Declare every tap point and bandwidth goal per stage.
  • Avoid “black-box filters” in series without visibility.
  • Keep reason codes and lock metrics per node for correlation.
Card C · Hitless switching requirements no phase step by design
Necessary conditions
  • Input frequency delta stays within the capture and steering range.
  • Inputs meet eligibility and stability windows before switching.
  • Switching is protected by hold-off/cooldown/hysteresis to prevent bounce.
  • DPLL mode transitions are explicit (not accidental) during switch events.
Verification signals
  • Max phase step ≤ X
  • Recovery time ≤ X
  • Post-switch stability within X for Y seconds
Common false-hitless pattern

A switch “looks smooth” only because measurement windows are too slow or alarms are not time-aligned to the event.

Diagram · Filter bandwidth map (jitter / wander / holdover drift + wide vs narrow shapes)
Bandwidth intuition (shape matters) attenuate fast jitter, track slow wander JITTER WANDER HOLDOVER drift FAST SLOW WIDE NARROW Attenuate fast jitter Track slow wander Holdover ≠ filtering Cascade risk too slow / peaking

Filter rule: set DPLL bandwidth to attenuate fast jitter while tracking slow wander; cascading “strong filters everywhere” often makes the chain slow and fragile.

Holdover: Oscillator Choice, Disciplining, and Recovery Strategy

Card A · Holdover error sources temp · aging · power · stress

Holdover is a budgeted behavior: it specifies how long frequency stays within limits and which drift terms dominate when the reference disappears.

Temperature / gradients
  • Ambient changes and local hot-spots drive frequency drift.
  • Airflow direction and enclosure gradients matter as much as absolute temperature.
  • First check: log ΔT/Δt near the oscillator, not only chassis air.
Aging
  • Long-term monotonic drift that short tests rarely expose.
  • Needs baseline snapshots across lifecycle milestones.
  • First check: compare to archived “day-0” frequency reference.
Power / load steps
  • Supply noise can translate into phase noise and short disturbances.
  • Co-rail coupling with PHY/CPU can create timing spikes.
  • First check: align supply events with drift slope changes.
PCB / mechanical stress
  • Board flex, mounting torque, and encapsulation can shift frequency.
  • Often shows up as “bench stable, installed drifting”.
  • First check: compare before/after assembly and mounting steps.

Budget entries (placeholders): drift_rate = X, temp_sensitivity = X, max_holdover_time = X, max_recovery_step = X.

Card B · Holdover algorithm outline freeze · steer · recover
Freeze

On reference loss, stop following link noise and hold the last valid state to avoid chasing flapping inputs.

Steer (limited)

Apply slow, bounded corrections using temperature history or learned trends; enforce rate and step limits.

Phase-continuous recovery

When reference returns, re-lock in stages to avoid phase steps: small capture bandwidth first, then restore normal tracking.

Hard limits (placeholders)
  • MAX HOLDOVER TIME = X
  • MAX RECOVERY STEP = X
  • ELIGIBILITY WINDOW = X
Practical definition

A good holdover design is not “perfect stability”, but a predictable drift envelope and a controlled return without secondary disturbances.

Card C · Field reality airflow · gradients · stress
Fan / airflow steps

Fan profile changes can create timing wander “steps” by forcing local temperature transients near the oscillator.

Fast closure: align PWM/RPM logs with drift slope changes.
Enclosure thermal gradients

A stable ambient reading can still hide a persistent gradient across the board that pushes long-term drift.

Fast closure: compare multiple temperature points (near XO vs airflow inlet/outlet).
Install / mounting sensitivity

Mounting torque and board constraints can shift the oscillator’s operating point and amplify temperature sensitivity.

Fast closure: record a before/after baseline under the same temperature and supply.

Implementation hint: holdover performance is only as good as logging. Store reference loss/return timestamps, temperature near the oscillator, and power/fan events.

Diagram · Holdover timeline (ref OK → lost → drift → return → re-lock)
Holdover is a timeline, not a point budget how long · control how to return time REF OK X hours REF LOST event HOLDOVER X minutes REF RETURN eligible RE-LOCK X sec FREEZE STEER LIMIT RECOVER drift envelope MAX TIME = X MAX STEP = X Return requires eligibility + staged capture avoid secondary disturbances during re-lock

Holdover rule: select oscillator class by drift budget, enforce time/step limits, and re-lock using staged, phase-continuous recovery.

Reference Selection and Protection Switching: Avoid Loops and Flapping

Card A · Selector state machine LOCK_A · LOCK_B · HOLDOVER

A stable selector treats reference inputs as eligible only after they pass a window, then switches through controlled transitions rather than reacting to every glitch.

LOCK_A
  • A is selected and stable.
  • Exit on LOS / invalid / policy trigger.
  • Transitions guarded by timers.
LOCK_B
  • B is selected and stable.
  • Exit on LOS / invalid / policy trigger.
  • Symmetric rules reduce corner cases.
HOLDOVER
  • No eligible reference is available.
  • Holdover policy maintains frequency.
  • Return requires eligibility + hold-off.
Card B · Anti-flap controls hysteresis · hold-off · cooldown · voting
Eligibility window

Require stability for X before selection; prevents glitch-driven switching.

Hysteresis

Use separate enter/exit thresholds; reduces threshold-edge oscillation.

Hold-off

Delay switching on transient degradation; avoids reacting to short disturbances.

Cooldown

After a switch, block further switching for X; prevents bounce loops.

Voting / multi-signal

Combine LOS + validity + quality flags; avoids “false-best” from a single indicator.

Observability

Store reason codes and timer states for every switch; enables field forensics and tuning.

Card C · Loop prevention checklist minimum rule set
  • Define a single reference direction per domain; avoid “mutual upstream” paths.
  • Ensure A/B identities remain consistent across ports, nodes, and logs.
  • Bind revertive behavior to eligibility windows and cooldown (no instant bounce-back).
  • Do not cascade multiple unknown selector policies without visibility.
  • Keep failure modes monotonic: degrade → holdover → recover (avoid rapid toggles).
  • Pin policy at boundary nodes so edge glitches do not amplify network-wide switching.
  • Use consistent timestamps and reason codes for loss/restore/switch events.
  • Verify switching with pass criteria: max step ≤ X, recover time ≤ X.
Diagram · Selector FSM (LOCK_A / LOCK_B / HOLDOVER + cooldown)
Selector stability = state machine + timers avoid loops and flapping LOCK_A eligible + stable LOCK_B eligible + stable HOLDOVER H2-7 policy B eligible + cooldown A eligible + cooldown LOS / invalid QL drop / timeout A eligible window B eligible window Stability controls hysteresis hold-off cooldown / voting

Switching rule: selection must be gated by eligibility windows and protected by hysteresis/hold-off/cooldown to prevent loops and flapping.

Implementation Blueprint: Switch/Gateway Clock Tree, Layout, Noise Hygiene

Card A · Clock tree pattern library recovered → filter → fanout → sinks

A robust SyncE implementation treats the clock path as a domain-crossing system: recovery, selection, cleaning, and distribution must be explicit and reviewable.

Template 1 · Single recovered
  • Input: PHY recovered clock
  • Clean: DPLL / jitter attenuator
  • Distribute: fanout → SoC/FPGA/SerDes
  • Key check: isolate noisy rails from the cleaner
Template 2 · Dual inputs A/B
  • Inputs: recovered A + recovered B
  • Select: eligibility + cooldown guarded
  • Clean: single DPLL stage
  • Key check: switch events must be observable
Template 3 · Holdover-friendly
  • Input: recovered + local XO
  • Discipline: bounded steer model
  • Recover: staged lock return
  • Key check: temperature near XO must be logged
Template 4 · Platform PLL (caution)
  • Input: recovered into SoC PLL
  • Risk: platform noise coupling
  • Mitigation: re-clean before fanout
  • Key check: compare filtered vs sink clocks
Card B · Noise coupling traps power · ground · routing · isolation
Power coupling

Supply ripple and load steps can translate into phase noise in cleaners and fanouts.

Rule: clean rail for DPLL/fanout; avoid shared noisy rails.
Ground / return breaks

Return discontinuities and ground bounce create edge timing uncertainty.

Rule: keep clock return continuous; avoid crossing plane splits.
Routing & crosstalk

Long clock traces and proximity to high-speed pairs inject jitter through coupling and reflections.

Rule: short/straight routing, keepout, and controlled impedance.
Isolation boundary traps

Parasitic capacitance across isolation and ESD current paths can pollute the clock domain.

Rule: define boundaries; avoid unnecessary clock crossings.

Quick triage: if recovered looks good but filtered looks bad, suspect DPLL power/return; if filtered looks good but sink looks bad, suspect fanout/routing/keepout.

Card C · Bring-up probes tap1 · tap2 · tap3
Tap 1 · recovered

Confirms link recovery quality and PHY sensitivity to link events.

Tap 2 · filtered

Confirms the cleaner isolates upstream disturbance and local power noise.

Tap 3 · sink out

Confirms fanout, routing, and clock consumers do not re-contaminate the clock.

Interpretation map: Tap1 bad → link/PHY; Tap1 good & Tap2 bad → cleaner power/return/bandwidth; Tap2 good & Tap3 bad → fanout/routing/keepout.

Diagram · Clock tree layout map (power domains + clock domains + icon rules)
Implementation map: clock domains + power domains use icons, not paragraphs POWER DOMAIN · clean rail (DPLL + fanout) POWER DOMAIN · noisy rail (SoC/SerDes/SMPS) PHY recovered Selector A/B eligible DPLL jitter filter holdover Fanout multi sinks Switch core consumer Gateway SoC PLL domain FPGA clock in SerDes refclk ✅ SHORT 🛡 ISOLATE ↩ RETURN keep text minimal 🛡

Blueprint rule: make recovery, cleaning, and distribution explicit; separate power/return domains; validate at Tap1/Tap2/Tap3 to localize contamination.

Verification & Monitoring: Lab Tests, Field Telemetry, and Pass Criteria

Card A · Lab measurement plan inject → disturb → observe

A good validation plan is scenario-driven: apply controlled impairments, change link/platform conditions, and measure at the same three taps to keep comparisons consistent.

Scenario 1 · reference events
  • loss / restore / degrade (placeholders)
  • validate holdover + selection behavior
  • observe Tap1/Tap2/Tap3
Scenario 2 · link events
  • link flap / re-train / cable change
  • check recovered sensitivity vs filtered isolation
  • observe Tap1/Tap2/Tap3
Scenario 3 · platform noise
  • load steps / fan steps / temp ramps
  • check cleaner + fanout immunity
  • observe Tap2/Tap3 changes

Fixed taps: Tap1 = recovered, Tap2 = filtered, Tap3 = sink out. Keeping taps fixed prevents “instrumented optimism” in lab-only setups.

Card B · Counters / logs to record state · events · env · health
Lock & state
  • lock_state (A/B/HOLDOVER)
  • enter/exit timestamps
  • lock quality (placeholder)
Switch events
  • switch A↔B / to holdover
  • reason_code (LOS/invalid/QL/timeout)
  • timers snapshot (eligibility/hold-off/cooldown)
Environment & power
  • temp near XO
  • fan PWM/RPM
  • supply events / brownout
Health counters
  • link up/down count
  • error counters (placeholder)
  • reset/restart events

Field forensics rule: every switch must have a reason code and timer context; otherwise tuning becomes guesswork and cross-node correlation breaks.

Card C · Pass criteria template X / Y / Z placeholders
  • Lock acquisition: lock time ≤ X within Y window.
  • Switching stability: switch count ≤ X per Y; cooldown enforced.
  • Holdover envelope: drift ≤ X over Y after ref loss.
  • Recovery behavior: max step ≤ X; recover time ≤ X.
  • Noise immunity: platform event causes ≤ X change at Tap2/Tap3.
  • Logging completeness: missing required fields ≤ X.

Acceptance posture: criteria must be measurable from the same taps and the same log schema in both lab and field.

Diagram · Test harness block (source + impairments + DUT + measurement + taps)
Validation harness: inject impairments, measure fixed taps Reference source Impairments loss / restore quality drop noise event DUT PHY Selector DPLL / filter Fanout Tap1 recovered Tap2 filtered Tap3 out Measure jitter / wander logs Pass criteria template (placeholders) lock ≤ X · switch ≤ X/Y · holdover drift ≤ X over Y · recovery step ≤ X

Verification rule: use scenario-driven impairments, fixed taps, a unified telemetry schema, and X/Y/Z pass criteria so lab results translate to field closures.

H2-11 · Engineering Checklist (Design → Bring-up → Production)

Purpose
Convert SyncE architecture decisions into verifiable gates. Each checklist item forces an output artifact (log/plot/config) and a measurable pass condition (X/Y/Z placeholders) to prevent “looks OK” bring-up traps.
Design Gate
Freeze architecture + budget + parts + evidence plan
Check: Define the SyncE clock chain boundary (source → transit → sink).
How: Draw a one-page block map including PHY recovered clock, DPLL/jitter filter, fanout, and consumers.
Evidence: Block diagram + clock-domain table (nodes, frequencies, clock I/O pins).
Pass: Every consumer clock has exactly one selected parent and one measurement tap.
Check: Budget definitions are consistent (jitter vs wander vs holdover drift).
How: Lock a single metric sheet (measurement bandwidths, masks, averaging windows, units).
Evidence: “Budget sheet vX” with bandwidth notes and placeholders (X ps / Y ppb / Z s).
Pass: No metric is used without an explicit bandwidth/window definition.
Check: PHY provides a usable recovered clock path for SyncE.
How: Confirm recovered-clock output modes and pin strapping/registers in the chosen PHY.
Evidence: Datasheet excerpts + pin plan for examples: TI DP83867IR; Microchip KSZ9131RNX.
Pass: A recovered-clock output can be routed to DPLL with known voltage/format.
Check: DPLL/jitter filter choice matches holdover and switching goals.
How: Select by input count, bandwidth range, hitless switching support, and holdover behavior.
Evidence: Shortlist: Renesas 82P33714; Microchip ZL30722; Silicon Labs Si5341A.
Pass: One device can meet target mask with margin using a defined profile.
Check: Oscillator type is justified by required holdover duration.
How: Map drift sources (temp/aging/stress) into the holdover budget and select TCXO/OCXO accordingly.
Evidence: Candidates: SiTime SiT5356; Abracon AST3TQ-10.000MHZ-2; Microchip OX-049.
Pass: Worst-case drift stays within X ppb for Y minutes/hours without ref.
Check: Fanout/clock buffer additive jitter and enable behavior are bounded.
How: Choose buffers with known additive jitter, disable glitch rules, and supply noise tolerance.
Evidence: Examples: TI LMK1C1104; Renesas 5PB1108; TI CDCM6208.
Pass: Distribution adds ≤ X ps RMS and enables do not create spurs/glitches.
Check: Selection logic prevents reference flapping.
How: Define hysteresis + hold-off + cooldown + revertive policy for QL/alarms.
Evidence: Selector state table + timers (T_holdoff, T_cooldown, T_revert).
Pass: No oscillation under “borderline quality” test vectors.
Check: Power integrity plan isolates clock-noise aggressors.
How: Separate clock analog rails, enforce return paths, and keep SMPS nodes away from clock inputs/VCXO pins.
Evidence: Rail map + placement rules + “no-cross” keepouts near DPLL/oscillator.
Pass: Target spur/jitter not degraded when loads step (X mA @ Y kHz).
Check: Temperature and power telemetry are instrumented for forensics.
How: Add local temperature + rail current/voltage monitors around oscillator/DPLL zones.
Evidence: Sensor examples: TI TMP117AIDRVR; TI INA226AIDGST; ADI LTC2990.
Pass: Logs correlate lock events with ΔT and rail excursions within X ms.
Check: Measurement tap points are physically reachable and low-intrusion.
How: Expose test pads/connectors on recovered, filtered, and distributed clocks (with buffering as needed).
Evidence: Probe map + expected amplitude/format at each tap.
Pass: Each tap can be measured without changing the clock profile or load.
Check: “Config as data” is defined (versioned profiles, not manual tweaks).
How: Store DPLL profiles + selector policy + alarm masks in a versioned artifact.
Evidence: Profile files + build IDs + register-dump script list.
Pass: A fresh unit can be configured to “known-good” state in ≤ X minutes.
Bring-up Gate
Prove lock behavior + measurement repeatability
Check: Recovered clock actually tracks the link (not a free-running local clock).
How: Force link changes (speed/remote partner) and watch recovered clock behavior at the PHY output.
Evidence: Frequency/phase plots at PHY CLK output (e.g., DP83867IR / KSZ9131RNX).
Pass: Recovered clock follows expected mode with no unexplained jumps > X.
Check: DPLL achieves lock under nominal conditions with a known profile.
How: Apply the stored configuration and monitor lock/holdover states.
Evidence: Register dumps + status bits over time (e.g., 82P33714 / ZL30722 / Si5341A).
Pass: Lock time ≤ X s; no repeated loss-of-lock in Y minutes.
Check: Filter bandwidth is not “over-tight” (slow relock / wander magnification).
How: Sweep bandwidth presets and measure relock time + output stability under injected disturbances.
Evidence: A/B plots of lock time and output deviation (profile IDs noted).
Pass: Meets mask while relocking within X s after a controlled event.
Check: Hitless/phase-continuity behavior is validated for switching scenarios.
How: Toggle reference inputs with defined Δf/Δphase; measure output phase step.
Evidence: “Switch event” traces + event logs (input A/B, reason, timer state).
Pass: Phase step ≤ X and no output glitch; switching time ≤ Y ms.
Check: Holdover drift matches expectation for the chosen oscillator.
How: Remove reference, hold for T minutes/hours, record output deviation and temperature.
Evidence: Holdover plot + temp log (e.g., SiT5356 / OX-049 + TMP117AIDRVR).
Pass: Drift ≤ X ppb over Y time at specified ΔT.
Check: Reference return does not cause “slam” back (controlled recovery).
How: Re-apply reference and validate freeze/steer/limit policy during recovery.
Evidence: Recovery traces with policy state markers (freeze → steer → locked).
Pass: Output returns to lock without exceeding phase/frequency limits (X/Y).
Check: Power/thermal coupling does not modulate jitter/wander.
How: Apply load steps and fan/airflow changes; correlate to clock metrics.
Evidence: INA226AIDGST rail logs + clock plot overlay + fan state.
Pass: No measurable degradation beyond X under defined step conditions.
Check: Anti-flap timers behave as specified under marginal inputs.
How: Inject intermittent LOS/QL drops; verify hold-off/cooldown suppress chatter.
Evidence: Event timeline (reason codes + timer counters + selected source).
Pass: Selection does not oscillate; max switches per hour ≤ X.
Check: Clock distribution enables do not glitch during resets.
How: Exercise reset sequences; validate clock buffers (LMK1C1104 / 5PB1108) OE behavior.
Evidence: Scope/logic capture at consumer clock pins + reset timing diagram.
Pass: No runt pulses / missing cycles beyond X during reset windows.
Check: Config/profile reproducibility across cold boot.
How: Power-cycle N times; compare register dumps and lock metrics distribution.
Evidence: Boot-to-lock histogram + CRC/hash of config payload.
Pass: Lock success rate ≥ X% and time spread ≤ Y under defined conditions.
Check: Field-forensics packet is complete (minimal, sufficient, searchable).
How: Define a single log record schema (timestamp, selected ref, alarms, temp, rails, counters).
Evidence: Example log line + schema version + trigger conditions list.
Pass: Any lock-loss can be explained within X minutes using only the captured fields.
Production Gate
Make quality measurable at scale
Check: Golden configuration is immutable and traceable per unit.
How: Program DPLL profile + selector policy from a signed/hashed package.
Evidence: Config hash stored in manufacturing log + unit serial linkage.
Pass: Any unit’s config can be reproduced exactly from its record.
Check: Manufacturing station verifies recovered-clock path.
How: Run a short link-up test and validate PHY recovered-clock mode selection.
Evidence: Pass/fail + measured frequency snapshot at PHY clock pin (X ppm window).
Pass: Clock output within X ppm of expected and stable for Y seconds.
Check: DPLL lock and alarm health are verified quickly.
How: Read status bits (LOL/LOS/holdover) and confirm lock within a bounded time.
Evidence: Register snapshot for each unit (82P33714 / ZL30722 / Si5341A).
Pass: Lock achieved ≤ X s; alarms cleared; no unexpected toggles in Y s.
Check: Output clock amplitude/format meets consumer requirements.
How: Validate LVCMOS/LVDS/HCSL at distribution outputs (fanout buffer ports).
Evidence: Scope capture on LMK1C1104 / 5PB1108 outputs (rise/fall & duty placeholders).
Pass: Meets VIH/VIL and edge-rate windows without overshoot beyond X.
Check: Thermal sensor sanity and placement correlation.
How: Compare TMP117AIDRVR (or MCP9808-E/MS) readings at room and warmed conditions.
Evidence: Temperature delta report vs reference thermometer (ΔT placeholders).
Pass: Offset within X °C; stable noise band within Y.
Check: Rail monitors detect brownout margins and transients.
How: Use INA226AIDGST or LTC2990 to verify undervoltage thresholds and transient capture.
Evidence: Min/Max voltage and current records during power-cycle test.
Pass: No rail dips beyond X mV; alarms trigger correctly within Y ms.
Check: “Profile + firmware” compatibility gate.
How: Enforce that config profile version matches firmware expected schema.
Evidence: Version tuple stored (FW vX + Profile vY + Schema vZ).
Pass: Mismatch is rejected; unit never ships with “unknown” profile.
Check: Switch/PHY link behavior is stable across EEE and negotiation settings.
How: Run a quick matrix test on negotiated modes relevant to deployment.
Evidence: Mode matrix record (speed/duplex/EEE flags) + recovered clock stability notes.
Pass: No mode produces unstable recovered clock beyond X under Y seconds.
Check: Evidence pack is automatically attached to each unit record.
How: Store logs/plots/config hashes for traceability and RMA triage.
Evidence: Link to manufacturing DB record with downloadable artifacts.
Pass: Any shipped unit has a complete evidence bundle (no missing fields).
Check: Production screening catches outliers (jitter or drift tails).
How: Use a short statistical test (N units) with stable fixtures and fixed settings.
Evidence: Histogram + control limits (X/Y) + station calibration record.
Pass: Reject rule is defined; station repeatability verified to ≤ X.
Check: Field-update safety (profile rollback and safe defaults).
How: Support A/B profile slots; verify rollback on lock-loss storm.
Evidence: Update log with “before/after” profile IDs and recovery success criteria.
Pass: Update never bricks timing; rollback completes within X minutes.
Gate checklist funnel Three sequential gates produce an evidence bundle: logs, plots, and configuration hashes. Design Gate Bring-up Gate Production Gate Budget + Parts Selector Policy Probe Plan Lock Behavior Holdover Drift Switch Events Golden Config Fast Screening Traceability Evidence Bundle Logs Plots Config
Diagram: a gate-driven workflow that forces SyncE readiness to be proven by artifacts, not intuition.

H2-12 · Applications & IC Selection (Keep It Near the End)

How to use this section
Map each deployment to a target (jitter mask pressure, required holdover time, switching stability), then pick a minimal functional stack: recovered-clock PHYDPLL/jitter filteroscillatorfanouttelemetry. Part numbers below are reference examples to anchor selection logic.
Applications (scenario → constraints → recommended stack)
5G Backhaul
Jitter mask pressure + fast fault recovery
Focus: strong filtering without slow relock; stable selection under marginal links; holdover that survives expected outage windows.
Reference stack examples:
  • Recovered-clock PHY: TI DP83867IR or Microchip KSZ9131RNX
  • DPLL / network synchronizer: Renesas 82P33714 or Microchip ZL30722
  • Jitter cleaner (if split-stage): Silicon Labs Si5341A
  • Oscillator for holdover: Microchip OX-2211-EAE-3091-10M000 or OX-049
  • Fanout / distribution: TI CDCM6208 + TI LMK1C1104
  • Telemetry: TI TMP117AIDRVR + TI INA226AIDGST
Power-Grid Sync
Long-term stability + conservative switching
Focus: predictable holdover drift (temperature/aging/stress), strong anti-flap policy, and complete event logs for compliance and maintenance.
Reference stack examples:
  • Recovered-clock capable nodes: Microchip switch KSZ8567 (recovered clock support) + PHY KSZ9131RNX
  • DPLL / network synchronizer: Microchip ZL30722 or Renesas 82P33714
  • Oscillator emphasis: Abracon OCXO AOCTQ5-X-10.000MHZ-I3-SW or Microchip OX-2211-EAE-3091-10M000
  • Distribution buffer: Renesas 5PB1108 (OE control + low additive jitter)
  • Forensics monitors: ADI LTC2990 + Microchip MCP9808-E/MS
Industrial Gateway
Noisy rails + thermal gradients + mixed clock consumers
Focus: noise hygiene (rail isolation), stable link behavior, and measurement taps designed-in from day one.
Reference stack examples:
  • Recovered-clock PHY: TI DP83867IR or Microchip KSZ9131RNX
  • Jitter cleaning (compact): Silicon Labs Si5341A
  • TCXO option: SiTime SiT5356 or Abracon AST3TQ-T-30.720MHZ-28
  • Clock fanout: TI LMK1C1104 or Renesas 5PB1108
  • Telemetry: TI INA226AIDGSR + TI TMP117AIDRVR
IC Selection (what to look for + example parts)
1) PHY with recovered-clock usability
Selection hooks: recovered clock output mode, stability under link events, deterministic clock output selection (local vs recovered), clean pin/format to DPLL.
Example part numbers: TI DP83867IR (SyncE-capable clock output); Microchip KSZ9131RNX (recovered 125 MHz selection for SyncE).
2) Network synchronizer / DPLL (the “brain”)
Selection hooks: input count and priorities, bandwidth range, hitless switching support, holdover modes, alarm reporting, profile storage, and boot determinism.
Example part numbers: Renesas 82P33714; Microchip ZL30722.
3) Jitter attenuator / clock cleaner (when split-stage)
Selection hooks: deep jitter attenuation vs lock time, clean output formats, stable loss-of-lock handling, and repeatable profiles for manufacturing.
Example part numbers: Silicon Labs Si5341A; TI CDCM6208 (clock generator + jitter cleaner).
4) Oscillator for holdover (TCXO vs OCXO)
Selection hooks: frequency stability over temperature, aging, sensitivity to airflow/vibration, warm-up time (OCXO), and supply-noise sensitivity.
Example part numbers: SiTime SiT5356; Abracon AST3TQ-10.000MHZ-2; Microchip OX-049; Abracon OCXO AOCTQ5-X-10.000MHZ-I3-SW; Connor-Winfield DOCAT020V-010.0M.
5) Fanout + telemetry (distribution and forensics)
Selection hooks: additive jitter, clean OE behavior, per-rail monitoring, and temperature correlation near oscillator/DPLL.
Example part numbers: TI LMK1C1104; Renesas 5PB1108; TI TMP117AIDRVR; TI INA226AIDGST; ADI LTC2990.
Solution stacks by scenario Three columns show recommended functional stacks for SyncE deployments with must vs optional blocks. Must Optional / Split-stage Backhaul Power-Grid Gateway Recovered-Clock PHY DPLL / Selector Jitter Cleaner OCXO / TCXO Telemetry Recovered Clock Path DPLL / Policy OCXO Holdover Fanout Buffer Forensics Logs Recovered-Clock PHY Clock Cleaner TCXO Option Noise Hygiene Telemetry
Diagram: scenario-driven minimal stacks. “Must” blocks build basic SyncE integrity; “Optional” blocks help when budgets are tight or environments are noisy.

Request a Quote

Accepted Formats

pdf, csv, xls, xlsx, zip

Attachment

Drag & drop files here or use the button below.

H2-13 · FAQs (SyncE) — Troubleshooting Close-Out

Scope and format
These FAQs close out long-tail field issues strictly within the SyncE timing chain. Each answer is fixed to four lines: Likely cause / Quick check / Fix / Pass criteria (X/Y/Z placeholders).
Recovered clock looks clean, but downstream wander is worse — filter bandwidth too wide or wrong tap point?
Likely cause: The “clean” measurement is taken at the wrong tap (Tap1=recovered), while Tap2/3 is contaminated by clock tree or filter passband admitting slow drift.
Quick check: Compare Tap1/Tap2/Tap3 in the same window; log tap_point, profile_id, lock_status, rail_ripple_mvpp, temp_near_xo.
Fix: Move the measurement and control point to the intended node; tighten/retune DPLL bandwidth and isolate noisy distribution branches.
Pass criteria: Downstream wander (Tap3) ≤ X (unit) over Y (time) with Z (profile and load conditions) and no unexplained tap divergence.
Works on bench, fails in the field — EEE/link flap coupling into the timing chain?
Likely cause: Real network link events (EEE enter/exit, renegotiation, brief link-down) modulate the recovered clock or trigger repeated state changes upstream.
Quick check: Time-align link_event (eee_enter/exit, link_up/down, reneg) with lock_status and selector_state; count event bursts per hour.
Fix: Reduce coupling by disabling/tuning EEE where required for sync stability, and add hold-off/cooldown so brief link glitches do not propagate into selection or DPLL behavior.
Pass criteria: Under field link profile, loss-of-lock events ≤ X/day and recovered/filtered outputs remain within Y (mask) over Z (time).
After reference switch, phase “jumps” — hitless conditions not met or selector cooldown missing?
Likely cause: The new reference was not “hitless-ready” (not locked/stable), or rapid re-switching occurred due to missing cooldown/hold-off policy.
Quick check: For each switch, log reason_code, selector_state, cooldown_timer, and input lock/valid flags; measure phase step at Tap2/Tap3 around the event.
Fix: Enforce hitless preconditions (valid+stable window) and add cooldown/min-dwell; use non-revertive or delayed revertive policy where appropriate.
Pass criteria: Phase step ≤ X (unit) per switch; switch rate ≤ Y/hour over Z (field conditions) with no double-switch inside cooldown.
Holdover meets spec at room temp, fails in cabinet — thermal gradient / airflow / stress drift?
Likely cause: Cabinet thermal gradients, fan airflow, or PCB stress shifts the oscillator’s effective temperature/stability beyond the bench assumption.
Quick check: Log temp_near_xo and temp_near_dpll during holdover; repeat with airflow blocked/redirected and cable/fastener strain relieved; compare drift slope.
Fix: Improve thermal placement and shielding, reduce stress coupling, upgrade oscillator class if required, and tune holdover model/limits for the cabinet profile.
Pass criteria: Holdover drift ≤ X (ppb/ppm) over Y (time) across Z (ΔT and airflow profile) without lock-loss storms.
QL claims “best”, yet network becomes unstable — QL loop or false-best selection?
Likely cause: A QL loop forms (self-reinforcing selection), or selection policy interprets QL in a way that prefers an unstable path (“false-best”).
Quick check: Record ql_in_a, ql_in_b, ql_selected, priority, and reason_code; check for periodic flips correlated with QL changes.
Fix: Add loop-prevention rules (do-not-select self-fed sources), weight policy by stability (lock/alarms), and enforce minimum dwell plus hysteresis.
Pass criteria: QL-driven selection remains stable (≤ X switches/hour) over Y (time) with Z (defined QL transition vectors).
Lock time is too long after recovery — DPLL bandwidth too narrow or cascade too aggressive?
Likely cause: Over-tight bandwidth and/or multi-stage filtering increases acquisition time; cascade stages fight each other under realistic disturbances.
Quick check: Compare lock_time distribution across profile_id (A/B); log input validity and alarm history during acquisition; isolate which stage dominates.
Fix: Widen acquisition bandwidth (then tighten in steady state if supported), simplify cascade where possible, and align stage roles (one for fast, one for slow).
Pass criteria: P95 lock time ≤ X seconds and P99 ≤ Y seconds under Z (defined disturbance and temperature profile).
Switching happens repeatedly every few minutes — hysteresis/hold-off window mis-set?
Likely cause: Thresholds lack hysteresis, hold-off/cooldown is too short, or revertive behavior keeps re-triggering under borderline quality.
Quick check: Compute switch_count_window and annotate each switch with reason_code, cooldown_timer, and QL/alarms; look for periodic patterns.
Fix: Add hysteresis, lengthen hold-off/cooldown, enforce minimum dwell, and adjust revertive policy (delay or disable) for marginal networks.
Pass criteria: Switching ≤ X/hour over Y hours with Z (known marginal input profile) and no “ping-pong” inside cooldown.
Only one port causes timing issues — marginal recovered clock (CDR sensitivity) or cable impairment?
Likely cause: That link’s recovered clock is noisier due to cable/partner impairments or PHY sensitivity; intermittent link behavior injects instability.
Quick check: A/B compare ports using identical conditions; log tap_point metrics per port, link_event, and lock/alarms; swap cable/partner to isolate.
Fix: Replace/shorten cable, standardize link mode settings for stability, and avoid using that port as a timing parent unless its recovered output is proven clean.
Pass criteria: Port-to-port recovered metrics differ by ≤ X (unit) over Y (time) with Z (same partner and mode), and no excess link events.
Timing degrades after EMC/ESD events — ground return / power noise coupling into clock tree?
Likely cause: ESD/EMC stress changes return paths or injects supply noise, modulating oscillator/DPLL behavior and contaminating distribution branches.
Quick check: Compare pre/post event: rail_ripple_mvpp, lock_status, reason_code, and temperature; inspect whether only Tap3 worsened (distribution coupling).
Fix: Harden power and return paths for clock domains, isolate noisy rails, improve decoupling near DPLL/XO, and ensure post-event re-init returns to known-good profiles.
Pass criteria: Post-stress metrics remain within X of baseline over Y (time) and recovery re-lock completes within Z without repeated lock-loss.
Two devices disagree on QL interpretation — ESMC config mismatch or parsing policy?
Likely cause: ESMC mapping/config differs, or devices apply different parsing/selection policies that map the same messaging to different QL outcomes.
Quick check: Export and compare policy_version, QL mapping tables, and observed ql_in_*/ql_selected under identical inputs; capture event timeline.
Fix: Normalize mapping and policy versions across nodes; ensure priorities and alarm masks are aligned; avoid ambiguous fallbacks for “unknown” QL states.
Pass criteria: Under the same input conditions, selected source and QL agree (match rate ≥ X%) over Y (time) with Z (defined transitions).
Holdover is fine, but re-lock causes overshoot — steering policy too aggressive?
Likely cause: The recovery path uses overly aggressive steering/limits, causing transient overshoot even though the holdover drift itself is within spec.
Quick check: Mark recovery phases (freeze/steer/lock) with reason_code or state labels; measure output deviation around ref return and correlate with profile_id.
Fix: Introduce staged recovery (freeze → gentle steer → lock), add rate/step limiting, and require a stable reference window before tightening bandwidth.
Pass criteria: Re-lock overshoot ≤ X (unit) and settles within Y (time) after ref return, across Z (temp and link conditions).
Field logs insufficient to debug — missing event timestamps, temperature, or selector state history?
Likely cause: The evidence bundle lacks correlation keys (event time, state, reason), so symptoms cannot be tied to selector/DPLL transitions or environmental changes.
Quick check: Verify that every event includes t_event, selector_state, reason_code, profile_id, ql_selected, temp_near_xo, rail_ripple_mvpp, and a ring-buffer watermark.
Fix: Implement a minimal “diagnostic bundle” schema, log on transitions (lock/holdover/switch), and store enough history to cover the longest expected instability window.
Pass criteria: Any lock-loss/switch episode can be explained within X minutes using only logs collected over Y hours with Z required fields present.