White-Rabbit-style timing achieves ps-level phase alignment by combining a deterministic (fixed-latency) data path with bidirectional link delay calibration and a closed-loop servo that tracks temperature and aging.
This page turns the concept into an engineerable workflow: delay/asymmetry modeling, factory→field→runtime calibration, verification criteria, and failover rules that keep time steps bounded.
What “White-Rabbit-Style Timing” Means (Definition + When It’s Worth It)
White-Rabbit-style timing targets phase alignment (ps-level) across distributed nodes, not just
time synchronization (ns–µs-level) of timestamps. The core engineering problem is turning
link delay—especially asymmetry and state-dependent latency—into quantities that are
measurable, calibratable, and verifiable.
“Style” = a set of hard engineering constraints (not a brand)
Bidirectional link calibration: isolate and compensate one-way delay components and asymmetry (ΔTx/ΔRx, fiber effects).
Fixed-latency data path: prevent queue/FIFO/state changes from moving the timestamp point and introducing variable delay.
Phase measurement + servo: measure phase error with a low noise floor and close the loop to keep nodes phase-aligned.
Observable: frequency error and short-term stability. Action: disciplined local oscillator or recovered frequency as the servo base.
When it’s worth paying the complexity cost
Coherent / phased measurements (distributed digitizers, arrays): phase error directly reduces coherent gain and spectral purity.
Distributed triggering and time tagging: uncalibrated asymmetry becomes a systematic error that does not average out.
Remote sampling clocks: for high-frequency inputs, ps-level phase error behaves like sampling-time uncertainty.
Scope guardrails (prevents cross-page overlap)
This page focuses on
bidirectional calibration, fixed-latency paths, phase measurement/servo closure, and verification hooks for ps-level alignment.
This page avoids
full PTP protocol stack history/details and SyncE class deep dives; only hardware-loop implications are referenced.
Diagram #1. “Time sync” aligns timestamps (offset), while WR-style timing enforces calibrated, fixed-latency behavior to close a phase servo loop for ps-level alignment.
System Architecture: Grandmaster, WR Switch, WR Node, and the Calibrated Link
A White-Rabbit-style system is a calibrated timing chain. It treats the link as a measurable object
(with known fixed components and tracked drift), then uses a phase-aware servo to keep all nodes aligned.
The key is to define where timestamps are taken and which parts of the data path are fixed-latency.
The four architectural primitives
Grandmaster (GM)
Defines the reference time/frequency domain. Quality is ultimately limited by its stability and noise, but ps-level success depends on calibrated links and deterministic paths.
WR Switch
Provides a timing-aware forwarding domain. The timing-critical function is maintaining a fixed-latency behavior and enabling propagation of calibration parameters and status.
WR Node
Converts calibrated link measurements into a controlled local clock output (continuous clock phase, plus optional 1PPS/10 MHz). Contains the measurement engine and servo.
Calibrated Link
The link is treated as a delay model: measurable round-trip terms plus calibrated one-way components (ΔTx/ΔRx, asymmetry). Temperature and replacement events must be tracked.
“Two channels” concept: not necessarily two physical cables
Data + timestamp channel: packet exchange with timestamps taken at defined hardware points (TS points).
Delay/phase calibration channel: mechanisms and parameters that make one-way delay observable/compensable (asymmetry, ΔTx/ΔRx, drift).
Where ps-level alignment is won or lost (architecture-level checks)
Timestamp points must be stable: a timestamp that moves due to buffering/queue state converts network load into timing error.
Fixed-latency path must be defined: any hidden FIFO / variable retiming becomes “random delay,” which no calibration can fully remove.
Calibration parameters must be governed: binding parameters to modules (SFP/PHY), versioning, and re-calibration rules for replacements/temperature.
Diagram #2. The system map highlights fixed timestamp points (TS), fixed-latency behavior, and calibration parameters—the anchors required for ps-level alignment.
Link Delay Model: One-Way Delay, Asymmetry, and What Must Be Calibrated
Bidirectional measurements reveal a round-trip combination of delays. Ps-level alignment requires
turning that combination into a one-way model with explicit components: fiber propagation, endpoint
Tx/Rx chains, and an asymmetry term that does not average out.
measurable as a stable combination, but not sufficient to uniquely determine one-way terms without calibration priors
Why “no calibration” gets stuck at ns–tens of ns
Asymmetry is a systematic error: Tx ≠ Rx latencies, wavelength/module differences, and thermal gradients create a one-way bias that does not vanish with averaging.
State-dependent latency is a random error: variable buffering/retiming (FIFO, queue state, clock-domain crossings) converts traffic and internal state into timing wander and non-repeatable offsets.
Ps-level demands explicit ownership of error terms: each major delay contributor must be either measured online, calibrated offline, or bounded with guardband.
Actionable breakdown: what can be measured online vs what must be calibrated
mechanical and thermal changes after replacement events
Diagram #3. Round-trip measurements are not enough for ps-level one-way accuracy without calibrated Tx/Rx chain terms and controlled asymmetry.
Bidirectional Link Calibration Workflow (Factory + Field + Temperature)
Ps-level alignment requires a calibration lifecycle, not a one-time measurement.
A practical workflow separates what must be learned at the factory (endpoint chain deltas), what must be established on installation (fiber baseline),
and what must be tracked in operation (thermal drift and replacement events).
Three-layer calibration (minimum viable process)
Factory (build-time)
Measure ΔTx/ΔRx for endpoint chains and bind them to module/port identity (serial-bound). Store with version and traceability.
Install (deployment-time)
Establish fiber delay baseline for the installed path (fiber + connectors + jumpers). Record link identity and baseline timestamp.
Runtime (operation-time)
Track thermal drift, detect non-repeatable steps, and raise alarms. Apply a temperature model (linear/LUT) and guardband; trigger re-calibration or failover when needed.
Temperature strategy: drift tracking must be explicit
SFP temperature is often the fastest-changing contributor and should be logged with phase wander.
Enclosure gradients create direction-dependent drift (asymmetry) and must be bounded with guardband when not fully modelable.
Model levels: threshold-only alarms → linear tempCoef → LUT (segmented). Choose based on ps budget and environment dynamics.
Timing parameter governance (keeps calibration valid in the field)
Version + traceability
Store calVersion, fixture identity, timestamp, and measurement repeatability. Enable rollback when a new model increases drift.
Binding rules
Bind ΔTx/ΔRx and related constants to module/port serial IDs. Replacement of a bound item is a calibration invalidation event.
Re-cal triggers
Trigger re-calibration or failover on: module change, fiber path change, non-repeatable offset step, or drift slope exceeding guardband.
Diagram #4. A three-stage calibration lifecycle separates factory constants, installation baselines, and runtime drift governance to keep ps-level alignment valid over time.
Fixed-Latency Data Path: Where Determinism Is Won or Lost
Ps-level alignment fails most often when the timestamp boundary crosses a state-dependent path.
“Fixed-latency” means the effective delay between timestamp points remains repeatable across load,
congestion, reset, and temperature—rather than only looking stable on average.
Fixed-latency boundary (practical definition)
TS1 → TS2 delay must not change with buffer depth, arbitration outcomes, or recovered-clock state.
The requirement is repeatability (single-cluster distribution), not merely a low average offset.
Any variable stage inside TS1–TS2 converts traffic/implementation state into timing error that calibration cannot remove.
Common sources of state-dependent delay (what breaks determinism)
Verification hooks: how to detect a non-deterministic path quickly
Load sweep
Sweep background traffic from idle to stressed and inspect whether offset/phase error distribution widens or becomes multi-cluster.
Priority injection
Introduce bursty traffic and check for step-like timing changes correlated with queue/arb events.
Restart repeatability
Repeat link bring-up and verify the resulting offset forms a single stable cluster; discrete “bands” indicate hidden state-dependent stages.
Scope guardrail: this section focuses only on fixed-latency behavior and timestamp boundaries (not a full Ethernet QoS design guide).
Diagram #5. Any state-dependent stage inside TS1–TS2 converts traffic and internal state into timing error. Ps-level alignment requires a deterministic boundary.
Phase Measurement Engine: DDMTD/TDC and Why ps-Level Is Hard
Ps-level alignment requires a phase measurement engine with a noise floor below the target. The measurement must separate
true system phase error from measurement chain noise,
using repeatable statistics rather than relying on a visually “clean” 1PPS trace.
Two engineering routes to ps-resolution (selection view)
DDMTD (beat-frequency phase amplification)
uses a low-frequency beat to make tiny phase differences observable
excellent for long-window statistics and drift slope tracking
requires clean clock-domain management and repeatable sampling
TDC (direct time interval conversion)
quantizes time intervals into digital codes (event/edge friendly)
must manage quantization noise and nonlinearity across temperature
verification focuses on effective noise floor, not nominal resolution
Why ps-level is hard: measurement floor vs system jitter
Observed phase noise is the sum of true system phase error and measurement chain noise.
A clean 1PPS waveform does not prove ps-repeatability; ps validation requires windowed statistics and repeatable experiments.
If the measurement floor dominates, improvements in the actual link/servo will not appear in the results, masking real engineering progress.
Verification hooks (to confirm the measurement reflects the system)
Noise-floor check
Feed both inputs from the same source to estimate measurement floor and confirm the estimator produces a tight, single-cluster distribution.
Windowed statistics
Use short windows to characterize jitter and longer windows to characterize drift/wander; compare slope changes under controlled perturbations.
A/B isolation
Change one variable (reference quality, fixed-latency stage, or traffic isolation) and confirm the phase distribution responds in the expected direction.
Load/Temp perturbation
Apply small load or temperature perturbations and verify the response matches the deterministic boundary assumptions defined in H2-5.
Scope guardrail: this section focuses on engineering selection and validation hooks (not circuit-level DDMTD/TDC implementation details).
Diagram #6. DDMTD makes tiny phase differences observable via a low-frequency beat, then estimates phase using filtering and windowed statistics.
Servo Design: Frequency Syntonization + Phase Alignment as Two Coupled Loops
Ps-level alignment is achieved by separating error sources into two control objectives:
frequency syntonization suppresses long-term drift, while
phase alignment corrects residual phase error without importing link or measurement noise.
The most common failure mode is a loop split that forces the phase loop to fight drift, or a bandwidth choice that injects noise into the actuator.
Two objectives (keep them distinct)
Frequency syntonization (inner loop)
tracks the recovered clock / SyncE / local OCXO to control drift slope
defines holdover behavior and long-window stability
should reject fast link noise and measurement noise
Phase alignment (outer loop)
uses phase estimate (DDMTD/TDC) to remove residual phase error
must avoid “pulling in” link jitter or estimator floor
should converge to a single-cluster phase distribution
Bandwidth selection: actionable tuning rules (no math)
Inner loop BWf (frequency)
too wide: imports link noise → short-window phase noise rises
too narrow: cannot follow thermal drift → drift slope increases
tune to control long-window drift while keeping short-window jitter near the oscillator floor
Outer loop BWp (phase)
too wide: injects estimator floor/link jitter → phase distribution broadens
too narrow: slow convergence → lock time grows, temp tracking lags
increase BWp only until lock time meets requirements without raising the phase noise floor
Guardband and lock criteria (engineering-facing)
Guardband
Budget the residual uncertainty that is not fully calibratable (thermal gradients, aging, module swaps, environment changes).
Guardband defines how much phase change is acceptable during tuning, holdover, and failover.
Lock criteria
A valid lock is indicated by a single-cluster phase distribution, stable drift slope under steady conditions,
and repeatable convergence after restart—rather than a state flag alone.
Scope guardrail: this section provides practical loop-splitting and tuning principles (not PLL stability math or circuit-level actuator design).
Diagram #7. Separate the objectives: the frequency loop controls drift (BWf), while the phase loop removes residual phase error (BWp) without importing measurement or link noise.
Network Design Patterns: Topology, Redundancy, and Failover Without Time Steps
Timing-grade redundancy is not achieved by merely adding a second path. The goal is
hitless or bounded-step switching:
both paths must be calibrated, the failover state machine must be stable, and the holdover window must prevent transient link artifacts from becoming phase steps.
Topology patterns (timing view, not throughput view)
Star / shallow tree
Clear calibration boundaries and smaller error accumulation. Prefer fewer hops when ps-level repeatability is the top requirement.
Cascaded switches
Each hop must preserve fixed-latency behavior and propagate calibration parameters consistently. Verify determinism hop-by-hop under traffic.
Long chains
Higher risk of drift accumulation and multi-fault propagation. Use only when boundaries, monitoring, and failover constraints are strongly enforced.
Redundancy requirements for bounded-step switching
Calibrate both paths: standby must be pre-calibrated and continuously validated.
Stable state machine: avoid oscillatory switching; use explicit enter/exit criteria.
Holdover window: during transition, prevent transient artifacts from becoming phase corrections.
Guardband-aware policy: switching acceptance is defined by guardband and repeatability, not a status bit.
Switching criteria (timing observables)
Trigger
loss-of-lock alarms, phase error trend exceeding guardband, missing calibration validity, or drift slope outside the allowed envelope.
Acceptance
post-switch phase distribution returns to a single cluster and remains within guardband under load and temperature perturbations.
Scope guardrail: this section defines timing-layer switching logic and criteria (not general network redundancy protocols).
Diagram #8. Timing-grade failover requires both paths to be calibrated and validated. Switching must be governed by a stable state machine and accepted only if the resulting phase remains bounded and repeatable.
Verification & Metrology: How to Prove ps-Level Alignment (and Not Fool Yourself)
Ps-level claims must be backed by statistics and
controlled comparisons, not a clean-looking waveform or a lock flag.
A valid verification plan spans time, temperature, restart repeatability, and failover behavior with logged conditions and a reportable pass template.
What to measure (the minimum set)
Steady-state phase error
RMS / percentile phase error within a defined time window
single-cluster distribution (no multi-step modes)
Phase wander & drift slope
hour/day-scale trends and low-frequency excursions
correlation against temperature and configuration changes
Holdover drift
phase/time error growth during loss of link / reference
N restarts should converge to the same phase cluster
no mode-hopping or step-like re-landing behavior
Toolchain: use each instrument for what it can actually prove
Time interval counter (TIC)
Best for long-window drift, holdover growth, and event-to-event stability. Ensure fixed trigger conditions and report the time base reference used.
Phase noise / phase comparator
Best for phase noise floor and integration-window controlled jitter metrics. Always state integration bandwidth and averaging settings in the report.
DSO (oscilloscope)
Best for transient diagnosis (failover step timing, event correlation, glitches). Not sufficient to prove ps-level repeatability without statistics and controlled comparisons.
Scope guardrail: this section describes how to prove performance (not an instrument purchasing guide).
Scripted validation and portable pass templates
Logging dimensions
Log time (hours/days), temperature,
and configuration (SFP/port/firmware/servo params/cal version).
Require before-after comparisons for SFP swap, restart cycles, and failover events.
Pass criteria templates
steady-state RMS phase error ≤ X ps (window T, conditions C)
temperature slope |d(phase)/dT| ≤ Y ps/°C (range, soak rule)
failover step ≤ Z (ps or ns), recovery ≤ R, single-cluster afterward
Diagram #9. A ps-grade verification plan must cover time, temperature, restart repeatability, and failover. Each cell pairs a metric with a method and a reportable pass template.
Engineering Checklist: Bring-up → Calibration → Production → Field Ops
A White-Rabbit-style timing system is deliverable only when it can be built, calibrated, screened, and operated with repeatable gates.
Each stage must produce logs and parameters tied to device identity and configuration version, then pass a defined stamp before proceeding.
Gate 1 — Bring-up
Fixed-latency sanity: verify timestamp points and remove variable-delay stages (FIFO/CDC/queue)
Baseline lock: confirm frequency lock first, then phase loop convergence
Baseline logs: record steady-state phase distribution and drift slope under known conditions
Fault grading: separate measurement-floor issues from link/cal/version issues
Failover policy: switch only when standby parameters are valid; accept only bounded-step outcomes
Outputs
long logs (time/temp/config), event records (failover/loss), rollback notes
Pass stamp
stable operation under temperature/traffic, bounded-step failover with acceptance logs
Diagram #10. Gate the lifecycle. Each stage produces logs/params and a pass stamp so ps-level performance is deliverable, auditable, and repeatable across build, calibration, production screening, and field operation.
Applications (WR-Style): Where ps-Level Actually Changes the System Design
Ps-level alignment is not “nicer timestamps.” It changes system constraints:
fixed-latency paths become mandatory, calibration data becomes a lifecycle asset, and redundancy must be designed for
bounded steps rather than best-effort continuity.
A) Coherent sampling / distributed ADC
Why ps-level matters
Channel-to-channel phase consistency becomes a “make-or-break” boundary for multi-node measurement behaving like one instrument.
Design impact
fixed-latency timestamp boundary must be stable across restart cycles
phase measurement floor must be lower than the claimed system error
calibration parameters must be versioned and tied to module identity
Common failure mode
“Lock = aligned” assumption: phase re-lands into multiple clusters after reboot due to hidden variable-latency stages.
B) Distributed trigger / event time-marking
Why ps-level matters
Tight event ordering and deterministic time-marking reduce “time-walk budget pressure” and simplify downstream timing margins.
Design impact
define a consistent “event reference point” (timestamp location and signal edge)
failover must be bounded-step (measure and accept, not guess)
holdover policy must match outage window and event criticality
Common failure mode
Redundancy designed for link uptime only; a switchover creates unbounded time steps that break event correlation.
C) Multi-chassis / multi-card synchronization
Why ps-level matters
Maintainability becomes the real win: repeatable phase alignment across chassis without manual re-tuning.
Design impact
cable/SFP replacement must trigger re-calibration rules (governance)
temperature gradients must be measured and guarded (not assumed away)
restart must return to the same phase cluster (production-grade repeatability)
Common failure mode
Calibration treated as a one-time factory step; field changes (airflow, fiber routing, module swap) silently invalidate phase alignment.
Scope guardrail
This section describes timing-driven system constraints. It does not expand into full domain architectures (radar/accelerators/broadcast stacks) or algorithm details.
Diagram #11. Three WR-style application patterns. Ps-level alignment “pays off” only when the system is designed for fixed latency, calibrated links, and lifecycle governance.
IC Selection Notes (WR-Style): What to Choose and What to Measure First
WR-style selection is a decision tree driven by targets and measurable risk.
Start from ps error / holdover / topology, then choose oscillator class, phase measurement engine,
fixed-latency path elements (PHY/SFP), and switch redundancy strategy. Prefer “measure-first” hooks before committing to a BOM.
Measure-first hooks (before selecting parts)
1) SFP temperature-driven phase drift
log phase vs SFP temperature (slope + hysteresis)
verify repeatability after airflow and module reseat
2) Link asymmetry repeatability
cycle link up/down and compare re-landing phase (single cluster expectation)
compare A/B transceivers of the same model to expose systematic offsets
3) Restart return-to-cluster
run N restart cycles and confirm phase returns to the same cluster
if multi-cluster appears, search for hidden variable-latency stages (FIFO/CDC/queues)
Selection logic (by functional blocks)
Node oscillator class (holdover & dynamics)
need for OCXO/TCXO is set by outage window and temperature gradients
need for VCXO is set by servo bandwidth, lock time, and tuning range
verify with long-window drift and temperature slope (not only short-window jitter)
Phase measurement engine (DDMTD/TDC floor)
ensure measurement noise floor is below the target ps error
ensure stable internal clocking and isolation (avoid self-injection)
use controlled comparisons to separate measurement floor vs system jitter
Example material numbers (for datasheet lookup & first measurements)
These part numbers are provided to accelerate verification and datasheet discovery. Final selection must be driven by the decision tree and the measure-first hooks above.
Always verify package/suffix, required features, and availability for the target build.
A) DPLL / jitter attenuators (clean + track)
Silicon Labs: Si5345A, Si5341
Analog Devices: AD9545, AD9548
Texas Instruments: LMK04828, LMK05318
B) Clock distribution / fanout
Analog Devices: ADCLK948, ADCLK954
Texas Instruments: LMK01020, CDCLVP1216
Analog Devices: LTC6952 (low jitter distribution)
C) Programmable XO / clock source
Silicon Labs: Si570 (programmable XO family)
Epson: SG-8002 (XO family)
NDK: NZ2520SD (XO family)
D) Time / phase measurement (TDC)
Texas Instruments: TDC7200, TDC7201
ams OSRAM: TDC-GPX2
E) FPGA (implementation anchor examples)
AMD/Xilinx Artix-7: XC7A35T-1FTG256C
AMD/Xilinx Kintex-7: XC7K325T-2FFG900C
F) SFP transceivers (starting points for thermal/asymmetry tests)
Finisar: FTLF1318P3BTL (1G LX-type)
Broadcom/Avago: AFBR-5710PZ (1G optical)
Diagram #12. Start from targets, then select oscillator/measurement/path/redundancy. Use measure-first hooks to expose temperature drift, asymmetry repeatability, and restart clustering before locking the BOM.
FAQs (WR-Style): Calibration, Asymmetry, Fixed Latency, Metrology, Failover, Temperature
Each answer follows a strict 4-line, measurable structure: Likely cause / Quick check / Fix / Pass criteria.
Placeholders X/Y/Z are acceptance thresholds to be set by the system timing budget.
▸
Lock looks “OK” but phase slowly walks over hours—what should be checked first?
Log phase_error(t) together with SFP case temperature and node board temperature, then compute phase slope (ps/°C) and hysteresis over ≥2 thermal cycles.
Fix
Add runtime temperature compensation (linear/LUT + guardband) and enforce a thermal management/placement rule to reduce gradients across transceivers.
Pass criteria
Steady-state RMS phase error ≤ X ps over window T (e.g., 100–1000 s) and temperature slope ≤ Y ps/°C across the specified ambient range.
▸
After reboot, phase offset changes by a fixed step—what is the most common non-fixed-latency culprit?
Likely cause
A state-dependent latency element (FIFO depth, CDC elastic buffer, serdes deskew, or queueing) changes its internal state across boots.
Quick check
Run N reboot cycles and record phase landing points; multiple discrete clusters (not noise) indicate hidden variable-latency states.
Fix
Force deterministic pipeline configuration (fixed FIFO depth / bypass elastic buffers / lock timestamp point) and add a boot-time self-test that rejects multi-cluster landing.
Pass criteria
Across N reboots, landing spread ≤ X ps and the distribution remains single-cluster (no discrete steps larger than X ps).
▸
Fiber link replaced with the same length, yet offset shifts—what parameter must be re-bound?
Likely cause
Calibration identity changed (SFP Tx/Rx latency deltas, connector pair, patch-cord dispersion), so “same length” does not preserve one-way delay.
Quick check
Compare pre/post replacement: SFP serials, stored ΔTx/ΔRx, and baseline fiberDelay0; if serials differ, assume re-binding is required.
Fix
Re-run the install calibration step and bind calibration parameters to the exact SFP serial + port + patch set; reject “unknown identity” links in runtime.
Pass criteria
After re-binding, one-way offset shift ≤ X ps compared to baseline and remains within X ps across a re-plug/restart sanity loop.
▸
Calibration passes at room temperature but fails in a thermal sweep—what should be logged first?
Likely cause
Temperature-dependent latency/asymmetry drift exceeds the guardband because the calibration model does not include the dominant thermal driver.
Quick check
Log phase error, SFP temperature, board temperature, and ambient temperature at a fixed sampling interval, then fit phase vs temperature with slope + hysteresis.
Fix
Promote the strongest temperature proxy (often SFP case) into the runtime compensation model and enlarge guardband for unmodeled gradient terms.
Pass criteria
Thermal sweep slope ≤ Y ps/°C and max deviation from model ≤ X ps across the full sweep rate and stabilization rules used in production.
▸
Bidirectional delay looks stable, but one-way is wrong—how can asymmetry be detected quickly?
Likely cause
RTT stability hides a constant one-way bias because Tx and Rx latencies differ and do not cancel without a valid asymmetry model.
Quick check
Swap A/B transceivers between ends (or reverse the fiber pair if available) and observe whether one-way offset changes by ~2× the suspected asymmetry term.
Fix
Calibrate and store ΔTx/ΔRx per module identity, and treat any identity change as invalidating one-way accuracy until re-calibrated.
Pass criteria
Across swap/reverse tests, one-way offset remains within ±X ps of the modeled value and does not jump by more than X ps per identity-preserving reconnect.
▸
Failover causes a time step—how can switching be made bounded?
Likely cause
Standby path is not pre-calibrated/phase-tracked, so switchover applies an unknown delay/asymmetry state as an instantaneous step.
Quick check
Measure failover_step under controlled A→B→A cycles while logging standby calibration validity, holdover window, and phase continuity flags.
Fix
Continuously calibrate and phase-track standby, then switch only inside a verified holdover window and apply a bounded step acceptance policy (reject/rollback if exceeded).
Pass criteria
Failover step ≤ Z ps (or ≤ Z ns by system spec) and recovery to steady-state RMS ≤ X ps occurs within the specified lock time budget.
▸
“ps-level” claim is not reproducible across instruments—what is the measurement trap?
Likely cause
Instrument noise floor, trigger jitter, or timebase discipline differences dominate the observed error instead of the system under test.
Quick check
Run a loopback/control measurement to estimate the instrument floor, and force common settings (same reference, same bandwidth limit, same gating window and statistics).
Fix
Standardize a metrology recipe (reference distribution, trigger method, cabling symmetry, statistics window) and publish the instrument floor alongside system results.
Pass criteria
Two independent instruments agree within ±X ps after subtracting/confirming floors, and the measured floor is ≤ (X/2) ps over the same gating window.
▸
Cleaner/PLL change improves jitter but worsens alignment—what bandwidth interaction is likely?
Likely cause
Loop bandwidth shifts the jitter/wander trade-off: narrowing reduces random jitter but can increase low-frequency phase wander or slow temperature tracking.
Quick check
Measure phase error PSD or time-domain wander with two BW settings and compare: (a) short-window RMS jitter and (b) long-window drift/slope.
Fix
Set bandwidths by requirement split (random jitter vs wander) and add guardband so temperature tracking does not exceed alignment error budget.
Pass criteria
Random jitter meets the endpoint limit while long-window phase wander stays ≤ X ps over T and thermal slope stays ≤ Y ps/°C under the expected gradient.
▸
Two nodes align well on the bench, but drift in the chassis—what mechanical/thermal gradient path dominates?
Likely cause
Chassis airflow and placement create persistent temperature gradients across transceivers and local clocking, producing repeatable phase drift not seen on an open bench.
Quick check
Map temperature at SFPs, nearby regulators, and FPGA/clock areas while logging phase; correlate drift with ΔT between ends, not just absolute ambient.
Fix
Reduce gradient (placement, airflow baffles, thermal coupling) and promote gradient-aware compensation (use the most correlated temperature sensors).
Pass criteria
Under worst-case airflow/heat load, drift stays within X ps over T and the slope vs ΔT stays within Y ps/°C with no hysteresis-induced runaway.
▸
Packet load changes alignment—what does that imply about the timestamp point?
Likely cause
Timestamping is occurring after a variable-latency stage (queue, store-and-forward, DMA scheduling), so traffic modulates the effective timing reference.
Quick check
Sweep traffic load (idle → max) while logging phase error; if phase tracks load with repeatable slope, the timestamp point is not at a fixed-latency boundary.
Fix
Move timestamping to PHY/near-PHY fixed-latency points, isolate timing traffic from congested queues, and lock forwarding to deterministic paths.
Pass criteria
Across the traffic sweep, phase shift ≤ X ps and any load-correlated component ≤ (X/2) ps after the timestamp point is fixed.
▸
SFP module change breaks calibration—what must be version-controlled?
Likely cause
Calibration parameters (ΔTx/ΔRx, temp coefficients, baseline fiberDelay0) are not bound to module identity, so a swap silently invalidates one-way accuracy.
Quick check
Compare SFP serial/vendor/DOM fields to the calibration record; if identity mismatch exists, treat the link as “uncalibrated” until re-certified.
Fix
Version-control calibration datasets (device serial, port, firmware, date, model) and enforce a replacement workflow that triggers recalibration and acceptance tests.
Pass criteria
After swap + recalibration, one-way error ≤ X ps and repeatability ≤ X ps across replug/restart, with calibration record traceable to the new identity.
▸
Phase error is periodic—what is the first “beat frequency / DMTD” sanity check?
Likely cause
A small frequency offset or mixing/aliasing artifact produces a beat note that appears as periodic phase modulation.
Quick check
Measure the period P of the phase ripple and compare 1/P to known offsets (reference differences, DMTD beat frequency, PLL update rate, or servo sampling rate).
Fix
Eliminate the offset source (reference mismatch, wrong divisor, sampling alias) or move the beat out of band (adjust sampling/loop rates) without breaking stability margins.
Pass criteria
Periodic component amplitude ≤ X ps (peak or RMS as specified) and does not reappear across temperature, traffic load, and restart acceptance tests.