White-Rabbit-style timing achieves sub-nanosecond alignment by combining two-way delay measurement, asymmetry/fixed-delay calibration, and distributed frequency lock.
The practical outcome is a measurable, maintainable time base: calibrate what two-way cannot cancel, lock frequency before phase, and verify with per-hop budgets, monitoring, and field-service evidence.
H2-1 · Definition & Scope (WR-style, not “generic PTP/SyncE”)
Intent
Establish a clear and testable definition of “White-Rabbit-style timing,” explain why it can reach sub-nanosecond synchronization,
and lock the page boundary so that PTP/SyncE/TSN details do not leak into this topic.
30-second definition (engineering view)
White-Rabbit-style timing is an end-to-end control system that combines
two-way delay measurement,
link asymmetry calibration, and
distributed frequency locking
to keep time/phase aligned across a network link.
Sub-ns performance is achieved by closing the loop on what cannot be cancelled by simple round-trip math:
fixed device delays, asymmetry drift, and frequency wander.
The deliverable of this page is a measurable path from architecture → calibration → verification,
rather than protocol message-field walkthroughs.
Controls: loop bandwidth; filter design; holdover policy on link loss.
Outputs: frequency alignment; reduced wander; controlled relock behavior.
Failure modes: bandwidth too wide amplifies noise; too narrow slows response; fast drift in holdover.
Pass criteria: wander ≤ X and holdover drift ≤ Y over Z seconds (placeholder).
What this page will answer (and where)
How accurate can it get? → H2-2 (metrics) + later verification chapter (field acceptance).
What hardware is required? → hardware building blocks chapter (clock/timestamp/phase path).
How is asymmetry calibrated? → calibration chapter (tables, binding, temperature compensation).
How is it verified and monitored? → verification/monitoring chapter (KPIs, logs, failure isolation).
How should engineering decisions be made? → engineering checklist chapter (design → bring-up → production).
Stop line (not covered here)
Protocol message formats, standard clauses, and TSN scheduling tables are intentionally excluded here.
This page focuses on the closed-loop timing architecture, calibration, and measurable verification.
PoE/PoDL power path and surge protection design → PoE / Protection
Diagram · Scope map (what WR-style owns vs link-outs)
The scope map prevents cross-page overlap: WR-style timing is treated as a closed-loop control system; PTP/SyncE/TSN/PoE/Protection are referenced only as dependencies.
H2-2 · Sub-ns Requirements & Success Criteria
Intent
Convert “sub-nanosecond” from a slogan into measurable KPIs, consistent definitions, and acceptance criteria that can be applied across lab, rack, and outdoor deployments.
Engineering definitions (no theory, only what can be verified)
Accuracy: long-window mean time offset versus the reference.
Precision: short-window spread (jitter) of time offset under steady conditions.
Stability: how time offset changes with time and environment (wander, temperature drift, power events).
Mapping to the 3 loops:
two-way delay primarily improves precision;
asymmetry calibration sets accuracy;
distributed frequency lock and holdover sustain stability.
KPI stack (each KPI = what it means → how to observe → common traps → pass criteria)
KPI 1 · Time offset
Meaning: instantaneous timing difference between a node and the reference.
Observe: offset time-series from hardware timestamp events (consistent window definition).
Trap: mismatched denominators or windowing makes “better” plots that are not comparable.
Pass criteria: |offset| ≤ X (p95) over Y minutes (placeholder).
KPI 2 · Time wander (low-frequency drift)
Meaning: slow changes in offset due to temperature, frequency error, or control-loop design.
Observe: offset slope over long windows; compare before/after environmental steps.
Trap: aggressive filtering hides wander while real systems still fail deterministic triggers.
Pass criteria: |drift| ≤ X ns/min under defined ΔT and power events (placeholder).
KPI 3 · Phase noise / jitter (system-level)
Meaning: short-term random variation that limits precision and trigger repeatability.
Observe: phase-error distribution or equivalent short-window statistics from the timing loop.
Trap: measurement settings or logging resolution creates “too clean” plots that miss real jitter.
Pass criteria: jitter ≤ X (p95) in the target bandwidth (placeholder).
KPI 4 · Holdover drift (during link loss)
Meaning: offset drift when the link is unavailable and the node must ride its local clock.
Observe: offset trajectory in holdover state, tagged with temperature and power conditions.
Trap: mixing “relock transient” samples with holdover samples breaks comparability.
Pass criteria: drift ≤ X over Y seconds of holdover (placeholder).
Scenario layers (dominant error sources → first KPI to check)
Dominant sources: thermal gradients, airflow changes, power noise coupling, EMI-induced relock events.
First KPI: stability (wander slope) and relock step behavior.
Layer 3 · Outdoor / Long fiber
Dominant sources: medium delay temperature coefficient, asymmetry drift, maintenance and module replacement.
First KPI: accuracy (bias) + temperature-correlated drift (offset vs ΔT).
This ladder forces measurable definitions: a sub-ns goal becomes a KPI stack and explicit acceptance criteria with consistent windows, percentiles, and state tagging.
Explain why timing accuracy degrades when the link gets longer, temperature changes, or optics/PHY modules are swapped.
Build a practical delay budget that separates cancelable terms from terms that require calibration and compensation.
Validate temperature tagging and compensation slope (offset vs ΔT).
Pass criteria (placeholder):
under defined load and ΔT, offset distribution remains within X (p95) and drift slope within Y ns/min.
Diagram · Delay budget block (cancel vs calibrate vs drift risk)
The delay budget separates terms that two-way can cancel (symmetry), terms that require calibration (stable bias), and terms that cannot be corrected by calibration (non-deterministic queueing).
Assemble a practical end-to-end system: define the roles, separate the clock path from the packet/time-correction path,
and set clear hardware boundaries for timestamps, phase measurement, and servo loops.
Roles and responsibilities (architecture view)
Time source (grandmaster-like)
Provides the reference timebase and a stable frequency anchor.
Exposes lock/holdover state for system-level acceptance decisions.
Defines the reference epoch used by endpoints (architecture-level agreement).
WR-capable switch / bridge
Maintains a deterministic datapath for timing traffic (controls queue noise).
Supports hardware timestamp and/or phase measurement primitives where required.
Preserves a consistent timing model across multi-hop deployments.
End node (synchronized endpoint)
Runs the frequency and phase/time servo loops against the reference.
Applies fixed-delay and asymmetry calibration tables (with temperature tagging).
Exports measurable KPIs (offset, wander, drift, lock state) for monitoring and field service.
Two parallel paths (clock distribution vs time/phase correction)
Clock path (thick line)
Goal: distribute and lock frequency (syntonization) to suppress wander.
Sensitive to: jitter injection, PLL bandwidth choices, and holdover behavior.
Observable via: lock state, frequency error, holdover drift metrics.
Packet/time path (thin line)
Goal: measure two-way delay and apply calibrated asymmetry compensation.
Sensitive to: timestamp tap definition, queue noise, and multi-hop forwarding determinism.
Observable via: delay jitter distribution and offset percentile stability.
Hardware boundaries (must be hardware-backed to reach sub-ns)
Boundary 1 · Hardware timestamp (fixed tap)
Timestamp events must be generated at a deterministic physical tap point with consistent TX/RX definitions.
Failure symptom: “stable but biased” offset; repeatable steps after relock.
Boundary 2 · Phase/frequency measurement
Phase/frequency error must be measured with resolution aligned to the target, feeding the servo loops.
Failure symptom: wander dominates; deterministic triggering fails despite “lock.”
Timing traffic must avoid queue-dominated latency; calibration cannot correct non-deterministic jitter.
Failure symptom: offset distribution widens under load; p95/p99 explode.
Pass criteria (placeholder):
across defined load and temperature bands, the system remains locked with bounded p95 offset and bounded drift slope.
Diagram · End-to-end architecture (clock path vs packet/time path)
The thick clock path supports distributed frequency lock and holdover; the thin bidirectional path supports two-way delay measurement and calibrated asymmetry compensation.
H2-5 · Bi-directional Delay Measurement Workflow (Two-way time transfer)
Intent
Clarify how two-way measurement uses four timestamps to estimate one-way delay and offset, and how to keep switch/queue noise
from polluting sub-ns results.
Workflow overview (inputs → outputs → hard preconditions)
Inputs: four timestamp events (t1, t2, t3, t4) captured at deterministic tap points.
Outputs: round-trip delay (RTT), one-way delay estimate, and offset estimate used by the servo loops.
Hard preconditions: fixed tap definitions, deterministic datapath for timing traffic, and bounded measurement jitter.
Acceptance gate (placeholder): offset distribution remains within X (p95) under defined load; measurement jitter does not scale with throughput.
Timestamp tap consistency (non-negotiable quality gate)
Rule 1 · Fixed tap definition
Timestamps must be taken at a deterministic physical boundary (MAC/PHY/SerDes boundary) with identical TX/RX tap semantics.
Rule 2 · Tap repeatability
After reset, relock, or retrain, the tap-to-wire latency must not jump between discrete “modes.”
Rule 3 · Tap-path alignment
Timestamp generation must follow the same datapath characteristics as the payload path; avoid “fast timestamp / slow payload” splits.
Fast validation (placeholder): under controlled traffic, offset width stays within X and remains insensitive to throughput steps.
The timeline shows the four timestamp events and the two-way exchange; correctness depends on tap consistency and isolation from queue-dominated jitter.
H2-6 · Calibration: Fixed Delays, Link Asymmetry & Temperature Compensation
Intent
Make sub-ns possible by defining what must be calibrated, how calibration is executed (factory / field / in-service),
and how calibration tables remain valid across temperature and module swaps.
The pipeline emphasizes table binding, temperature-aware compensation, and trigger-driven recalibration to preserve sub-ns correctness over time.
H2-7 · Distributed Frequency Lock (Syntonization) & Servo Loop Design
Intent
Explain why frequency must be locked first to prevent wander from dominating sub-ns timing, and how nested servo loops
balance noise, response speed, and holdover stability without control-theory derivations.
Frequency lock vs phase alignment (what each fixes)
Frequency lock (syntony): suppresses long-term slope (drift) so offset does not grow into wander.
Phase/time alignment: corrects instantaneous residual offset after the slope is controlled.
Engineering symptom: offset ramps linearly → treat as frequency/temperature/holdover problem first, not as a phase-only tuning issue.
Acceptance gate (placeholder): drift slope ≤ X over Y minutes before tightening phase-loop residual targets.
Servo structure (outer phase/time loop + inner frequency loop)
Inner loop · Frequency
Measure: frequency error estimate from timing measurements and local clock observables.
The nested structure keeps long-term slope under control (inner loop) while driving residual phase/time error down (outer loop).
H2-8 · Hardware Building Blocks (Clock Tree / Phase Measurement / Timestamp Path)
Intent
Provide a hardware-first checklist for sub-ns timing: required clock-tree elements, phase measurement capability beyond basic
timestamping, and deterministic timestamp paths with repeatable latency across resets, training, and load changes.
Clock tree essentials (reference → discipline → distribute → consume)
Reference source
Stable reference input with defined noise/jitter envelope and known temperature behavior.
PLL / DPLL discipline
The actuation point for syntony: resolution, tuning range, and stability determine drift slope and holdover behavior.
Distribution & consumers
Fanout to PHY/FPGA/SoC and to timestamp/phase blocks; manage noise coupling and preserve repeatable latency.
Selection hooks (placeholder): phase-noise/jitter budget, tuning resolution, and temperature sensitivity aligned to sub-ns KPIs.
At sub-ns targets, the effective resolution and jitter of ordinary timestamp capture can become the bottleneck unless finer-grain
phase observability exists.
Engineering requirements
Phase observables must share the same timebase as timestamp capture.
Outputs must be calibratable and temperature-aware (residual error tracked).
Repeatability matters more than raw “spec” resolution.
Acceptance gate (placeholder): phase/timestamp residual distribution stays within X in steady windows and remains stable across resets.
Conservative ramping policies for reacquire stability.
Acceptance gate (placeholder): hardware provides deterministic capture and actuation; software policies keep windows stable and enforce table binding.
Diagram · Clock tree + timestamp datapath (with domain crossing)
The upper half shows clock discipline and distribution; the lower half shows timestamp capture and transport through CDC/FIFO.
Sub-ns correctness depends on deterministic boundaries and repeatable latency across operating states.
H2-9 · Network Topologies & Redundancy (and how to coexist with TSN)
Intent
Translate deployment reality into measurable design rules: topology choice, redundancy behavior, and TSN coexistence boundaries.
TSN handles deterministic forwarding windows; WR-style timing maintains the timebase (no TSN GCL details here).
Topology selection (control vs calibration complexity)
Tree / Star
Paths are stable and easy to segment into per-link budgets.
Calibration table lifecycle is simpler (clear bindings and fewer path flips).
Fault isolation is faster (branch-local diagnosis).
Ring
Strong redundancy, but path flips can invalidate timing assumptions.
Failover must define when to trust new asymmetry estimates.
Reacquire policies must prevent thrash on marginal links.
Multi-segment chains
Each added segment adds fixed delay + temperature drift + asymmetry terms.
Segment-level calibration and binding become mandatory, not optional.
End-to-end verification must include path-change scenarios.
Acceptance gate (placeholder): steady-topology offset p95 ≤ X; after path change, reacquire ≤ Y;
peak transient ≤ Z.
Redundancy strategies (paths, sources, and failover triggers)
Dual path
Define main vs backup timing overlay paths.
On switch, enter a protection window before trusting new offsets.
Bind calibration tables to path identity (avoid stale compensation).
Dual time source
Trigger on lock-status loss, drift slope breach, or offset instability.
Use hysteresis to prevent rapid flapping between sources.
After switch, ramp corrections to avoid overshoot.
Offset p95 breach over a defined window (not a single sample).
Acceptance gate (placeholder): failover triggers are stable (no thrash);
source/path switch returns to lock within X; asymmetry stays bounded within Y.
Coexistence with TSN (strict boundary, no GCL detail)
Responsibility split
TSN: deterministic forwarding windows and controlled queuing behavior.
Timing measurement windows must avoid queue-dominated latency regimes.
Timestamp paths must remain deterministic across TSN load patterns.
Path changes must be treated as calibration-binding events.
Acceptance gate (placeholder): under TSN high-load windows, offset p95 degradation ≤ X and no periodic wander amplification.
Common deployment pitfalls (avoid silent loss of sub-ns)
Path changes without recalibration binding (stale tables applied to new routes).
Failover triggers based on single samples (thrash and oscillation).
“Stable but biased” offsets due to measuring at congested/edge windows.
Using averages only; ignoring p95/peak/reacquire behaviors.
Diagram · Topology map (line · star · ring with timing overlay)
Use the same timing-overlay lens across line, star, and ring: path changes are calibration-binding events; redundancy must be engineered
to avoid thrash and biased “stable” offsets.
H2-10 · Verification, Monitoring & Field Service (make it measurable)
Intent
Turn “it works” into measurable acceptance, diagnosis, and reproducibility: verification ladder, minimal KPI set, black-box evidence,
and field-service mechanisms (loopback/self-test/remote update and rollback) without management-protocol details.
Verification ladder (bench → system → environmental)
Bench
Establish noise floor and baseline lock/reacquire behaviors.
Validate timestamp path determinism under controlled traffic.
Record baseline distributions (p50/p95/peak), not just averages.
System
Stress under realistic switching and throughput steps.
Detect “stable but biased” offsets caused by queuing regimes.
Verify redundancy events (path/source switch) and recovery gates.
Environmental
Temperature sweeps/steps, power disturbances, and mechanical vibration scenarios.
Measure wander slope, holdover drift, and reacquire stability across ΔT.
Confirm calibration binding rules remain correct under component swaps.
Event type: link flap, failover, lock change, calibration update.
Event timestamp + state snapshot timestamp (explicit).
Environment: temperature, supply, airflow/fan state (as available).
Window metadata: start/end, denominators, and thresholds in force.
Why it matters
Without event-aligned evidence, a clean-looking offset plot cannot be reproduced or used to isolate path changes,
table invalidation, or measurement-window artifacts.
Acceptance gate (placeholder): every anomaly includes a matching event record and state snapshot; evidence bundle supports replay-style analysis.
Field service mechanisms (loopback · self-test · remote update & rollback)
Loopback / self-test
Fast split: link path vs timebase vs calibration binding issues.
Run with known windows and export KPI snapshots and event markers.
Remote update & rollback
Version binding: calibration tables and servo policies must match firmware versions.
Rollback must restore prior known-good behavior with recorded gates.
Change log must capture timing-relevant parameter diffs.
Acceptance gate (placeholder): self-test returns KPIs to thresholds within X; rollback restores reacquire ≤ Y and offset p95 ≤ Z.
A reproducible flow ties defined measurement windows to event markers and an evidence bundle. Acceptance is based on distributions
and state-aligned gates, not on averages.
Purpose: convert “sub-ns timing” into auditable gates. Each gate is defined by measurable checks, evidence fields, and pass criteria placeholders (X/Y/Z) to prevent silent drift and non-repeatable latency.
Gate A · Design
Lock the architecture first
Goal
Freeze the clock tree, timestamp tap points, and “repeatable latency boundaries” so that later calibration/servo work is not forced to compensate for unstable hardware paths.
Checklist (tick-box actions)
Clock tree contract: define reference source → jitter cleaner/DPLL → distribution → PHY/FPGA/SoC domains; mark which nodes are “must-follow” vs “free-run”.
Timestamp tap invariance: ensure the timestamp capture point does not change across firmware paths, offloads, or switch forwarding modes.
Deterministic latency budget: identify every FIFO/CDC/queue that can introduce non-determinism; require fixed-depth or bounded behavior.
Thermal drift entry: reserve sensors/telemetry fields (temperature, supply, airflow states) and define where drift coefficients will be stored.
Redundancy boundary: define which links/modules invalidate calibration (e.g., swapping SFP/PHY), and what the safe degrade mode is.
Evidence to capture
Clock-tree block diagram version + net names + domain IDs.
Notes: choose temperature grade, package, and reference frequency (10 MHz / 25 MHz / 125 MHz) to match the servo bandwidth and distribution constraints.
Gate B · Bring-up
Make it stable under stress
Goal
Prove calibration + two-way measurement + servo locks remain measurable and repeatable across load, temperature steps, and link events (drop/reacquire/failover).
Load-step isolation: run low→high traffic; confirm queueing noise is excluded from the measurement window or bounded by filtering rules.
Lock/relock state machine: enforce holdover entry/exit criteria; prevent oscillation (thrash) when link flaps.
Injected disturbances: temperature step (T1→T2), supply ripple step, link down/up, redundant path switch; record event-stamped KPI traces.
Window + denominator consistency: ensure offset/wander metrics keep the same time window and denominator across tools and builds.
Evidence to capture
KPI streams: offset, wander, lock status, asymmetry estimate, holdover state (all with timestamps).
Event markers: load-step, temperature-step, link flap, failover switch, firmware revision.
Calibration snapshots: table version/hash before/after, plus applied coefficients.
Pass criteria (placeholders)
Reacquire time (link down→stable lock) ≤ X s.
Offset peak during load step ≤ Y ns (measured over window W).
Asymmetry estimate jump after path switch ≤ Z ns (otherwise force recalibration).
Example parts (bring-up enablers)
Microchip KSZ9477 (PTP-capable switch for test topologies)Microchip LAN8840 (PHY timestamp signals + GPIO)TI TMP117AIDRVR (temp step correlation)Renesas 8A34001 (DCO/DPLL modes for servo experiments)Winbond W25Q128JV (black-box traces + rollbacks)
Gate C · Production
Make it scalable and traceable
Goal
Ensure unit-to-unit consistency by binding calibration data to hardware identity and software revisions, with fast production tests that catch drift-sensitive failure modes.
Checklist (manufacturing control)
Calibration table governance: version, timestamp, units, and validity rules; store a hash and protect against mismatch.
Identity binding: bind module serial + port/path ID + firmware build ID + calibration hash.
Fast tests that matter: timestamp determinism quick-test, short holdover drift test, reacquire test, and a 2-point thermal spot-check.
Sampling plan: define lot sampling and escalation rules (rework/stop-ship) when drift-sensitive metrics shift.
Field forensics readiness: black-box logs must include event + KPI + environment; support rollback and “known-good” calibration restore.
Evidence to capture
Per-unit record: serials, calibration hash, firmware build ID, date codes.
Production KPI summary: p50/p95/peak for offset, lock time, holdover drift.
Failure artifacts: raw KPI traces + environment + link events (time-stamped).
Pass criteria (placeholders)
Unit-to-unit KPI spread (same fixture) ≤ X ns.
Lot drift (weekly/monthly) ≤ Y ns after normalization.
Forensics completeness: ≥ Z% of field incidents reproducible with logs + calibration restore.
Three gates with tick-box items and a shared “Pass criteria” block (X/Y/Z placeholders).
Use this gate structure to keep WR-style timing measurable from schematic to field incidents.
H2-12 · Applications (WR-style timing as a timebase, no stack deep-dive)
This section stays on “why sub-ns is mandatory” and how to validate it with a small KPI set. Industrial stacks and TSN configuration tables are intentionally out of scope.
Trigger skew directly maps into measurement error when multiple nodes sample the same event. “Stable but biased” offsets are unacceptable because calibration must survive temperature and link events.
KPIs (only two)
Offset p95 ≤ X nsWander slope ≤ Y (placeholder)
Design hooks
Event-stamped measurement windows: exclude queue bursts from the estimator.
Calibration validity rules: module/path swap must force recalibration or safe degrade.
Holdover policy: define how long the trigger system may trust time during outages.
Large facilities: racks / distributed labs / power-grid measurement
Why sub-ns
Multi-segment links introduce temperature-dependent path changes and asymmetry. The system must detect when the estimate is no longer valid and enforce recalibration or safe operation.
KPIs (only two)
Reacquire time ≤ X sAsymmetry jump ≤ Y ns
Design hooks
Path ID and calibration validity: multi-segment topology requires explicit binding and invalidation rules.
Redundant sources and failover: define a protection window before “trusted time” is re-enabled.
Black-box readiness: store event + KPI + environment for incident reconstruction.
Example parts
Renesas 8A34001Microchip KSZ9477Winbond W25Q128JV
Use case D
Field service: measurable, recoverable, and traceable timing
Why sub-ns
Field failures are often intermittent. Sub-ns systems must expose internal states (lock, asymmetry, holdover) and support fast isolation without requiring protocol deep dives.
KPIs (only two)
Lock stability ≥ X% uptimeOffset peak after event ≤ Y ns
Design hooks
Self-test hooks: loopback/PRBS for link sanity + timestamp path health checks.
Immutable records: event logs with environment fields for reproducibility.
Safe recovery: calibration restore + firmware rollback to last known good set.
Scope rule: these FAQs only close long-tail troubleshooting within this page’s boundaries (two-way measurement, calibration, asymmetry, servo/holdover, timestamp/tap-path, topology/verification). No new protocol domains are introduced.
▸
Offset is small but wander is large — is it loop bandwidth/filtering or missing thermal modeling?
Likely cause: a frequency/phase loop bandwidth that passes low-frequency drift, or an estimator window/filter that aliases temperature-driven delay into wander.
Quick check: correlate wander with temperature (ΔT) and traffic/load events; compare wander under a longer vs shorter measurement window.
Fix: tighten the inner frequency lock and re-tune filter/window to reject slow drift; enable drift coefficients in the calibration model for the active path.
Pass criteria: wander p95 ≤ X over Y minutes across ΔT ≤ Z°C, while offset p95 stays ≤ A ns.
▸
Calibrated and OK at room temperature, but degrades at high temperature — thermal compensation model or group-delay drift?
Likely cause: thermal coefficients not applied (or applied to the wrong path), or hardware group delay changes exceeding the modeled range.
Quick check: run a 2–3 point temperature sweep and compare offset slope (ns/°C) against the stored coefficient; verify the “calibration validity” rule still marks the table as valid.
Fix: re-fit the temperature model for the deployed link, store coefficients per module/path, and enforce recalibration triggers when ΔT exceeds the qualified range.
Pass criteria: |offset slope| ≤ X ns/°C over T1–T2; offset peak ≤ Y ns after a temperature step ΔT ≤ Z°C.
▸
After swapping an optical/RJ module, the whole system shifts by a constant step — was the fixed-delay table invalidated or serial binding missed?
Likely cause: fixed TX/RX delays and/or asymmetry parameters are path-specific; the calibration table was reused after a module change, or identity binding did not trigger invalidation.
Quick check: compare module identifiers (serial/part) and calibration-table hash/version; verify the “validity rule” flags the table as invalid when module ID changes.
Fix: enforce module/path/firmware binding for calibration tables; require recalibration (or safe degrade mode) after a swap.
Pass criteria: module swap triggers recalibration within X minutes; post-swap offset step ≤ Y ns after calibration is applied.
▸
After relock, the error shows a step change — relock state machine or calibration-parameter load timing?
Likely cause: calibration parameters applied late/early relative to timestamp domain alignment, or relock logic exits holdover before coefficients are stable.
Quick check: time-stamp the sequence: link-up → measurement-ready → coefficients-applied → “trusted-time”; check if the step aligns with a state transition.
Fix: gate “trusted-time” on (a) stable frequency lock, (b) coefficient application completion, and (c) measurement window stabilization; add a post-relock settling window.
Pass criteria: relock step magnitude ≤ X ns; time-to-trust ≤ Y s with no secondary steps over Z reboots/relocks.
▸
Short links are excellent, but long links suddenly degrade — asymmetry estimate or medium thermal coefficient?
Likely cause: long links amplify temperature-driven propagation changes and asymmetry; an estimate that is “stable” may be stable-but-wrong when conditions shift.
Quick check: compare offset vs temperature for short vs long link; look for slope change; verify asymmetry estimate stability (jump size) during load/temperature transitions.
Fix: recalibrate asymmetry for the deployed length, add temperature coefficients per link class, and enforce revalidation triggers when link length/class changes.
Pass criteria: long-link offset p95 ≤ X ns over Y minutes; asymmetry jump ≤ Z ns under ΔT ≤ A°C.
▸
It looks “locked”, but triggers are still not synchronous — phase measurement resolution or inconsistent timestamp tap points?
Likely cause: the lock indicator reflects frequency/phase loop convergence, but trigger alignment is limited by timestamp/phase measurement granularity or a mismatched tap location.
Quick check: measure trigger skew at multiple repetition rates; if skew quantizes in steps, resolution is the limit; compare tap-path configuration hashes across nodes/ports.
Fix: unify tap points (same capture boundary and domain crossing), and ensure fine phase measurement is used where required; re-baseline trigger path after tap changes.
Pass criteria: trigger skew p95 ≤ X ns across Y nodes, with tap-path hashes identical and no quantization steps > Z ns.
▸
Adding one switch/bridge segment makes the system unstable — queue jitter leakage or non-repeatable forwarding delay?
Likely cause: measurement windows include queueing bursts, or the added segment introduces variable forwarding latency that breaks repeatability assumptions.
Quick check: repeat the same test under low traffic and high traffic; if instability scales with load, queue jitter is leaking; check if forwarding latency distribution widens after insertion.
Fix: isolate timing measurement from traffic-induced queues (windowing/filtering), and require bounded/characterized forwarding latency on timing-critical paths.
Pass criteria: added segment increases offset p95 by ≤ X ns and does not increase wander p95 by > Y under load ≤ Z%.
▸
Monitoring says aligned, but field data does not match — time-domain crossing or timestamp-domain mapping inconsistency?
Likely cause: KPIs are computed in a different domain/epoch than the application data timestamps, or a CDC/mapping layer applies the wrong reference when converting.
Quick check: trace one event end-to-end and verify domain identifiers (source clock ID, timebase ID, conversion stage); ensure the same window and denominator are used across tools.
Fix: standardize the timestamp-domain mapping contract (IDs, units, epoch), and enforce a single “source of truth” for conversions; record conversion metadata in logs.
Pass criteria: event-to-event alignment error ≤ X ns over Y trials, with domain ID consistency = 100%.
▸
Holdover drifts too fast after a link outage — oscillator/PLL holdover strategy or environmental disturbance?
Likely cause: holdover mode does not preserve frequency stability (insufficient oscillator quality or poor holdover tuning), or environment (temperature/supply) shifts during outage.
Quick check: measure drift vs outage duration under controlled temperature and then under realistic airflow/supply; compare drift slope (ns/s or ppb equivalent).
Fix: improve holdover tuning (freeze/track strategy), validate oscillator class for the required outage window, and include temperature/supply compensation during holdover.
Pass criteria: holdover drift ≤ X ns over outage duration Y s at ΔT ≤ Z°C; reacquire time ≤ A s.
▸
Sub-ns cannot be reached in a multi-hop topology — per-hop calibration strategy or the error budget is exhausted?
Likely cause: each hop adds fixed delay uncertainty and variable components; without per-hop calibration/validation, the end-to-end budget is consumed quickly.
Quick check: measure end-to-end and then measure hop-by-hop; identify which hop has the largest p95/peak contribution and whether it is stable or load/temperature dependent.
Fix: apply calibration and validity rules per hop/path, enforce bounded forwarding latency where required, and allocate an explicit budget table with acceptance thresholds.
Pass criteria: sum of per-hop p95 contributions ≤ X ns, and no single hop exceeds Y ns p95 under load ≤ Z%.
▸
Measurements look “stable”, but the absolute offset is consistently biased — stable-but-wrong fixed delays or asymmetry?
Likely cause: a fixed-delay baseline or asymmetry term is incorrect but consistent, producing a stable bias that does not show as wander.
Quick check: validate against a known reference path or swap direction (where possible) to see if bias changes sign/magnitude; verify calibration table units and applied coefficients.
Fix: re-run fixed-delay/asymmetry calibration with strict validity rules; lock table identity to module/path and re-verify at two temperatures.
Pass criteria: absolute offset ≤ X ns over Y minutes and remains within ±Z ns across ΔT ≤ A°C.
▸
Timing shifts after a firmware update even though hardware is unchanged — timestamp path/tap change or calibration schema mismatch?
Likely cause: the update changed the timestamp capture boundary, CDC/FIFO behavior, or coefficient load order; or it reads an older calibration table with mismatched schema/units.
Quick check: compare pre/post update: tap-path hash, FIFO depth constraints, calibration table version/hash, and the state transition timeline around “trusted-time”.
Fix: pin the tap-path contract across releases, migrate calibration schema with explicit unit checks, and block trusted-time until coefficients are applied and stable.
Pass criteria: no update-induced offset step > X ns across Y reboots; calibration table validation success = 100%.