Edge Timing & Sync (PTP Timestamps, Jitter Cleaner PLL, RTC Backup)
← Back to: IoT & Edge Computing
Edge Timing & Sync is an end-device hardware time subsystem that turns a reference (1PPS/10 MHz/SyncE or local XO) into usable, consistent timestamps—by controlling jitter/holdover with a PLL, enforcing monotonic time (no rollback), and keeping time through power loss with RTC + backup energy. If any link in that chain is weak, the field symptom is almost always the same: long-tail timestamp spikes and time jumps—so the fix starts with evidence at the reference, the PLL events, and timestamp consistency.
Scope & Boundary: What “time sync” means inside an edge device
“Time sync” is not a single feature. In edge hardware, it is a timing subsystem that must keep time measurable, stable, and recoverable across link changes, reference loss, and power cycles. The focus here is the hardware path that turns a reference into a usable clock and a trustworthy timestamp.
What this page covers (hardware timing subsystem)
- Reference to clock: reference input qualification, muxing, jitter cleaning, and distribution to device clock domains.
- Clock to timestamp: where timestamping happens (PHY vs MAC/TSU), and how to keep timestamps consistent and monotonic.
- Loss and recovery: holdover behavior, RTC backup power, and “no time going backwards” recovery policies.
The three outcomes to guarantee (engineering deliverables)
- Timestamp correctness: timestamps represent the real event order; no unexpected jumps; no backward time after recovery.
- Jitter & wander under control: short-term jitter is bounded; long-term drift during holdover is predictable and budgeted.
- Holdover + RTC backup: time remains valid through reference loss and power cycles; recovery is repeatable and testable.
Out of scope (kept on sibling pages)
- TSN scheduling (Qbv/Qci, traffic shaping) — only the timing hardware is covered here.
- BMCA / grandmaster selection — treated as system-level control, not hardware implementation details.
- GNSS anti-jam / RF front-end — belongs to GNSS Timing / Positioning Module pages.
- Cloud / fleet time management — operational architecture, not device timing subsystem design.
Practical boundary test: if the problem can be verified by probing reference inputs, PLL/clock outputs, timestamp behavior, or RTC backup rail, it belongs here. Otherwise, it belongs to a system/network sibling page.
Turn “sync” into acceptance criteria: accuracy, jitter, and holdover
Successful timing designs start with testable acceptance criteria. “Better sync” is ambiguous; a device needs separate targets for timestamp accuracy, short-term jitter, and holdover drift. These targets interact, but they must be specified and validated independently to avoid false confidence.
Define the requirement in three questions (fast triage)
- Timestamp alignment: do multiple devices stamp the same event within the required bound (ns or µs)?
- Frequency / wander control: does the offset grow over minutes when the reference is lost (holdover)?
- Phase noise / jitter control: is short-term jitter low enough for sampling clocks and high-speed I/O margins?
| Dimension | What to specify | Why it matters (impact path) | Minimum verification |
|---|---|---|---|
| Timestamp accuracy | Max error bound + statistic (e.g., p99/p999), plus “no backward time” policy | Event correlation, logs, multi-sensor fusion, audit trails; errors show up as mis-ordered or mis-timed events | Same-event multi-device compare; distribution over time; check for jumps after link/restart |
| Short-term jitter | Allowed jitter budget at clock outputs (qualitative threshold if no jitter analyzer) | Sampling clocks (ADC/AFE), SERDES margins, control loops; too much jitter causes noise, BER, or instability | PPS/clock edge stability checks; compare “clean vs noisy” modes; correlate with error bursts |
| Long-term wander | Allowed drift rate over minutes/hours under holdover | Offset accumulates; a system that starts aligned can diverge steadily without clear alarms | Ref-off holdover curve; measure time error vs time; identify dominant drift contributors |
| Holdover | Duration + bound: “X minutes with ≤Y error” (separate steady vs temperature change cases) | Reference outage is common in the field; robustness requires predictable degradation, not surprises | Ref removal + temperature sweep; compare against the drift budget; record recovery behavior |
Convert application language into acceptance language (examples)
- Logging & event forensics: prioritize timestamp correctness + monotonic recovery; µs-class may be acceptable but jumps are not.
- DAQ synchronous sampling: prioritize jitter + phase stability; “time-of-day” alone is insufficient.
- Motion/control coordination: prioritize wander + holdover; steady divergence is the main failure pattern.
- Multi-sensor fusion: prioritize event-alignment statistics (p99/p999), not just average offset.
Minimal acceptance pack (recommended)
- Three numbers: max event alignment error, required holdover duration, and maximum allowed time jump on recovery.
- Three tests: steady-state timestamp distribution, ref-off holdover drift curve, and power-cycle recovery monotonicity check.
A device that “locks” but fails these acceptance checks is not synchronized. Lock indicators are status signals, not proof of time quality.
From reference to usable timestamps: the end-to-end timing chain
A high-quality time system is a chain, not a single block. Every link must be responsible for a specific output: reference quality → clock conditioning → timestamp capture → time-of-day continuity. When a design fails, the fastest diagnosis is to walk the chain in order and collect evidence at defined test points.
Chain overview (four layers)
- L1 (Reference): 1PPS / 10 MHz / SyncE-as-input / local XO. The job is to provide a stable time or frequency anchor and detect invalidity.
- L2 (Clock tree): qualify → mux → PLL/jitter cleaner → fanout. The job is to produce clean, continuous clocks for all timing domains.
- L3 (Timestamp path): capture point (PHY/MAC/TSU) → insert/report. The job is to stamp events deterministically with minimal uncertainty.
- L4 (ToD/RTC): write → keep → restore. The job is to preserve continuity across power/reference loss and prevent backward time.
Measurement points (evidence hooks used later in debugging)
- Reference input quality: missing pulses, glitches, edge stability, frequency offset trends.
- PLL status + events: lock/unlock, ref switch, holdover entry/exit, relock time.
- Clock output behavior: continuity during switchovers, jitter/wander indicators (direct or proxy).
- Timestamp consistency: outliers, jumps, monotonicity after restarts and ref changes.
- RTC drift + backup rail: actual retention time, drift vs temperature, recovery “jump limit”.
Engineering rule: do not treat “locked” as proof. Treat it as a hint, then validate time quality at TP points with repeatable checks.
PHY timestamp vs MAC/TSU timestamp: where uncertainty is created
Two boards can both claim IEEE-1588 support yet deliver very different results. The main differentiator is the timestamp capture point and the uncertainty contributors between the wire and that capture point. The closer the capture point is to the physical interface, the fewer internal variability sources exist.
High-level contrast (what changes in practice)
- PHY timestamp: capture occurs near the wire; fewer internal path delays leak into the timestamp; better for tight tails (p99/p999).
- MAC/TSU timestamp: capture occurs deeper in the device; easier integration; more sensitive to internal clock-domain and path effects.
Uncertainty contributors (error terms) and how they appear
| Contributor | Why it happens | Typical symptom | Fast check |
|---|---|---|---|
| Clock-domain crossing (CDC) | Timestamp capture and reporting live in different clock domains; edge capture + transfer adds quantization and variability | Random outliers even at steady load; occasional “spikes” not correlated with traffic | Hold traffic constant; check if outliers persist |
| Variable latency (queue/IRQ/driver) | Software time capture or delayed reporting introduces load-dependent delay variability | Good average but poor tails; degrades under higher CPU/traffic load | Increase load; observe tail widening (p99/p999) |
| RX/TX asymmetry | Transmit and receive paths do not match (different pipeline delays, different corrections) | Direction-dependent bias; offset shifts when link/path changes | Compare behavior across directions and link states |
| Clock tree discontinuity | Reference switching or relock causes phase discontinuity that leaks into timestamps | Step changes (“jumps”) aligned with ref switch/relock events | Correlate jumps with PLL/ref event log |
Selection decision tree (practical)
- Need ns-class alignment or tight tails: prioritize PHY timestamp or a tightly integrated TSU with controlled CDC and event logging.
- Need µs-class log alignment: MAC/TSU timestamp can be acceptable if monotonicity, recovery jump limits, and tail behavior are validated.
- Any requirement level: require a way to detect and record ref switch / relock / holdover events; timestamps must be auditable.
A common failure mode is “reasonable average, unacceptable tail.” Always validate p99/p999 and correlate outliers with CDC, load, and ref events.
Why “Locked” is not “Good”: loop bandwidth, jitter transfer, and switching transients
A jitter-cleaner PLL is a noise-shaping system, not a simple “clock lock” indicator. Lock status confirms a control loop is active, but it does not prove the output clock is quiet, continuous, or predictable during reference loss. Time quality depends on how the PLL is configured and how it behaves during reference switching and holdover.
The three engineering knobs (what actually determines time quality)
- Loop BW: decides how much reference noise leaks to the output and how fast the loop can track ref changes.
- Jitter transfer: describes what the PLL passes vs cleans; the output can be ref-dominated or VCO-dominated depending on frequency.
- Switching transient: ref switch/relock events can create phase steps or frequency bumps that show up as timestamp outliers.
Jitter-cleaner vs clock generator (practical boundary)
- Jitter-cleaner PLL: prioritizes clock cleanliness, holdover, and auditable ref switching for time-sensitive subsystems.
- Clock generator: prioritizes frequency synthesis and fanout; it can produce the right frequencies without guaranteeing time-quality tails.
Common failure patterns (symptom → likely cause → engineering action)
| Symptom | Likely cause | Engineering action | Evidence to collect |
|---|---|---|---|
| Locked, but tails are bad | Loop BW too wide; reference noise is passed into the output | Narrow BW or enable stronger cleaning; qualify reference inputs and record ref-quality events | TP: output TIE trend; correlate tail spikes with ref-quality changes |
| Slow lock / unstable during ref switch | Loop BW too narrow; loop cannot track ref changes quickly enough | Increase BW or use staged behavior (fast reacquire then settle); review switch policy | PLL event log: relock time, switch timestamps, unlock bursts |
| Time jumps after ref loss / recovery | Holdover mode mismatched to oscillator quality (freeze vs flywheel vs local ref switch) | Define holdover acceptance (X minutes ≤ Y error) and select holdover strategy accordingly | Holdover drift curve; recovery jump limit; monotonicity check |
| Periodic “spikes” even at steady load | Switching transient / CDC interactions or hidden ref toggling | Audit ref mux policy; ensure transient is bounded; log every ref switch/holdover event | Event log + timestamp outlier correlation; TP at ref and PLL output |
Proof strategy (status bits are insufficient)
- Required: PLL event history (lock/unlock, ref switch, holdover entry/exit, relock time).
- Required: output quality via proxy metrics when lab gear is limited (e.g., clock period stability, TIE trend, outlier rate).
- Optional: phase-noise / jitter analyzer confirmation for final sign-off (same targets, higher resolution).
Acceptance mindset: a PLL can remain locked while violating jitter tails, creating transient steps, or drifting beyond holdover limits. Time quality must be proven at the output and correlated with ref/PLL events.
XO vs TCXO vs OCXO vs CSAC: picking the holdover baseline without drifting into GNSS RF
Oscillator selection is best driven by holdover acceptance: when the primary reference is lost, the device must stay within a defined error budget for a defined duration. The oscillator defines the baseline for wander, temperature drift, and aging; the PLL/clock tree can only shape or distribute what the local reference can support.
Start with a holdover budget (turn requirements into a selection gate)
- Define: “Ref lost → within Y time error for X minutes” (include expected temperature change).
- Then choose: oscillator class that can plausibly meet the drift budget under temperature, aging, and vibration constraints.
- Finally verify: holdover drift curve + recovery jump limit (monotonic behavior).
Practical boundaries (what each oscillator class is good at)
- XO: lowest cost; larger temperature drift and aging. Suitable for low holdover demands or frequent re-discipline.
- TCXO: improved temperature stability with low power. Common choice for edge devices that need practical holdover.
- OCXO: strong short-term stability and phase-noise performance, but higher power, warm-up time, and volume.
- CSAC (when truly needed): strong long-term stability for extended reference outages, at higher cost and integration constraints.
Key parameters (how to read them as system impacts)
- Temperature drift: dominates holdover when ambient changes; focus on curve shape, not a single number.
- Aging: sets long-holdover baseline drift; critical for long outages and long calibration intervals.
- Phase noise: impacts jitter-sensitive domains (sampling clocks, high-speed links) and can drive tail behavior.
- g sensitivity: matters in vibration/portable environments; frequency shifts can appear as timing noise.
- Warm-up: can create “good only after minutes” behavior; treat as a requirement, not a surprise.
Qualitative comparison (use for early architecture choices)
| Type | Holdover drift | Phase-noise / jitter | Power / warm-up | Typical edge fit |
|---|---|---|---|---|
| XO | weak under temp/aging | varies; usually moderate | best power; no warm-up | low requirement, frequent discipline |
| TCXO | good practical drift | good enough for many | low power; minimal warm-up | common edge holdover baseline |
| OCXO | strong short-term | strong (quiet) | high power; warm-up required | high-end sync, DAQ, tight tails |
| CSAC | strong long-term | varies; often good | higher cost; integration tradeoffs | extended outages with strict drift limit |
Scope guard: GNSS anti-jamming and antenna/RF front-end design is out of scope here (covered by the GNSS Timing / Positioning Module page).
Holdover drift budgeting: how long time stays “within spec” after reference loss
Holdover is only meaningful when written as an acceptance statement that can be calculated and verified: after reference loss, time error stays within ±E for T minutes (under a defined temperature profile). This section turns that statement into a drift budget and a validation loop.
Holdover is a sum of error contributors (what must be budgeted)
- Oscillator stability: initial frequency offset and short-term wander define the starting slope of time error.
- Temperature trajectory: drift follows temperature over time; curve shape matters more than a single spec number.
- Aging: long-holdover baseline drift sets the floor for extended outages and long calibration intervals.
- Discipline strategy: pre-loss training reduces the residual frequency error at the moment holdover starts.
Write the budget in the same unit used for acceptance: time error TE(t)
The primary evidence is the time error curve TE(t). Treat it as the scorecard: if |TE(T)| ≤ E, the system passes for the defined temperature profile.
- Linear TE(t): constant frequency offset dominates (slope stays roughly constant).
- Curved TE(t): temperature drift or compensation changes the slope over time.
- Piecewise TE(t): mode transitions (holdover entry, relock, ref switching) introduce slope changes or steps.
Back-calculate the allowed frequency error from the acceptance statement
For an initial gate, convert time error to relative frequency error:
| Acceptance target | Derived gate | How it is used |
|---|---|---|
| After ref loss: ±E time error within T | Avg. |Δf/f| ≤ E/T (convert to ppm) | Filters oscillator class and sets the holdover margin before lab tests |
| Temperature changes during holdover | Reserve margin for temp curve + compensation limits | Ensures the “E/T” gate is not consumed by environmental drift |
| Extended outage or long intervals | Aging budget (ppm over time horizon) | Defines recalibration interval and required oscillator grade |
Evidence priority rule: first confirm TE(t) stays inside the acceptance envelope; only then use deeper metrics (e.g., Allan) to explain residuals.
Verification loop (ref cut + thermal sweep)
- Ref cut test: disconnect the reference, record TE(t) and holdover/relock events, then check |TE(T)| ≤ E.
- Thermal sweep: repeat with a controlled temperature trajectory; compare TE(t) envelopes and compensation effectiveness.
- Correlation: annotate TE(t) with event timestamps (ref loss, holdover entry, ref switch, relock) to attribute slope changes or steps.
RTC and supercap backup: keeping time through power loss without collapsing the main rail
RTC backup is a power-domain design: it must keep the RTC domain alive through power loss while preventing charging inrush, backfeed paths, and leakage from defeating the backup-time target. A correct design is defined by an effective voltage window, a total backup current, and an auditable startup recovery that avoids time rollback.
RTC selection checklist (what matters for holdover + recovery)
- Backup current: the dominant term in backup-time estimation; measure worst-case, not typical.
- Temperature behavior: drift across the expected temperature range; consider calibration/trim registers.
- Clock source: 32 kHz crystal vs integrated oscillator (power, drift curve, startup repeatability).
- Calibration registers: enables writing measured offset back into RTC for improved holdover alignment.
Backup chain blocks (charge limit → OR-ing → domain isolation)
- Charge limiter: prevents cold-start supercap inrush from drooping the main rail.
- Ideal diode / OR-ing: seamless switchover while blocking reverse current between main and backup.
- Domain isolation: prevents backfeed through IO/ESD structures and hidden rails.
Three common field failures (symptom → likely cause → fix + evidence)
| Symptom | Likely cause | Engineering action | Evidence |
|---|---|---|---|
| Backup time too short | Supercap leakage/ESR + underestimated total backup current | Budget I_total (RTC + leakage + OR-ing leakage); validate effective voltage window | TP-BACKUP_V curve; leakage isolation test |
| Main rail droops at cold start | Supercap behaves like a short; missing or weak inrush limiting | Add charge limiter/soft-start; stage charging if needed | TP-INRUSH current; rail dip waveform |
| Weird power paths / partial power | Backfeed through IO/ESD or OR-ing path into RTC domain | Audit isolation; ensure reverse blocking and domain separation | Reverse current check; unexpected “alive” rails |
Backup-time estimation (use the effective voltage window, not the full capacitor)
A practical first estimate uses the usable RTC voltage window: t ≈ C · (V_hi − V_lo) / I_total
- V_hi/V_lo: RTC domain usable range (depends on RTC + OR-ing drop + isolation elements).
- I_total: RTC backup current + supercap leakage + OR-ing leakage + board leakage (contamination can dominate).
- Reality check: always confirm with the power-off timer test and compare to the estimate to locate hidden leakage.
Validation (power-off timer + cold-start recovery + monotonicity)
- Power-off timer: remove power, measure how long RTC stays valid and how much it drifts.
- Cold-start recovery: verify time reconstruction does not cause excessive jump on boot.
- Monotonicity check: confirm time does not go backward; bound the allowed correction step.
Acceptance mindset: the backup domain passes only if backup time meets target and recovery preserves monotonic time behavior.
Reference switching and relock recovery: ref mux, glitchless handover, and time-jump governance
Reference switching becomes unstable when three layers are mixed: reference qualification, clock-loop behavior, and time-of-day mapping. A robust design separates responsibilities: qualify inputs, execute a controlled handover, then govern time corrections with monotonic rules.
Typical switch scenarios (what triggers a handover)
- Ref loss: missing PPS pulses, missing 10 MHz, or SyncE lock loss events.
- Ref degrade: growing jitter, phase steps, or intermittent pulses that still look “present”.
- Anti-flap rule: apply hysteresis and minimum dwell time before switching again.
Ref qualification + ref mux (separate “decide” from “execute”)
- Qualification: detect loss, count missing pulses, track phase stability, and produce a coarse GOOD / WARN / BAD score.
- Decision: the policy selects a target reference using hysteresis and dwell time.
- Execution: the ref mux performs the handover and records the event timestamp.
Design intent: ref mux should not “hunt.” It follows a policy and produces auditable switch events.
Three continuity layers (often confused, with different hardware requirements)
| Continuity goal | What it means | Engineering implications |
|---|---|---|
| Glitchless | No short pulses or missing edges during the switchover | Switch on a safe boundary; gate/align the mux control; verify with PPS/clock waveform |
| Frequency-continuous | Output frequency does not step abruptly at handover | DPLL slews or flywheels through transition; “lock” alone is not proof—settle window matters |
| Phase-continuous | Phase does not exhibit a step; hardest target | Requires phase alignment/phase accumulator continuity; stricter constraints and longer validation |
Relock recovery as a state machine (make transitions observable)
- LOCKED → HOLDOVER: reference quality drops below threshold; log holdover_enter.
- HOLDOVER → REF_SWITCH: policy selects the next best reference; log switch_event.
- REACQUIRE → SETTLE: DPLL relocks; output quality must pass a settle window before declaring stable.
Time-jump governance (hardware/firmware rules only)
- Monotonic rule: time must not go backward (no rollback), even during correction.
- Jump limit: cap the maximum correction step; large jumps must be explicitly marked.
- Step vs slew:
- Step: fast alignment but produces a visible timestamp jump (must be recorded).
- Slew: gradual convergence by controlled frequency offset (preferred for control/sampling continuity).
Boundary reminder: only time mapping and correction rules are covered here—no network selection algorithms are expanded.
Field triage: the three evidence classes to quickly isolate reference, PLL, or timestamp issues
Fast diagnostics starts with a strict evidence order. The goal is not to “tune PTP,” but to localize failure to one of three hardware-visible layers: reference input, PLL/clock tree, or timestamp consistency. Each layer has a fastest tool and a minimal proof method.
The forced order (do not swap steps)
Reason: if reference is unstable, downstream jitter and timestamp outliers are symptoms—not root causes.
Evidence class #1 — reference input (PPS / 10 MHz / SyncE)
- 1PPS: verify missing pulses, phase steps, and widened jitter (scope or logic analyzer).
- 10 MHz: verify continuity and gross stability (counter trend; avoid deep RF analysis here).
- SyncE: check lock/alarm events and correlate to observed time anomalies.
If the reference is not trustworthy, stop and fix the input path before analyzing timestamps.
Evidence class #2 — PLL / clock tree (lock is not enough)
- Must log: lock/unlock, holdover entry/exit, ref switch events, relock time, and settle window outcome.
- Must correlate: time anomalies that align with switch or relock boundaries point to loop transition behavior.
- Practical observation: when phase noise tools are unavailable, use time-error trends and outlier bursts as a substitute indicator.
Evidence class #3 — timestamp consistency (only after ref + PLL pass)
- Same event, multiple timestamps: compare capture points (e.g., PPS capture vs TSU record vs software log).
- Check monotonicity: detect any backward time step (hard failure).
- Check outliers: bursts of spikes suggest capture/CDC boundary issues.
- Check persistent offset: stable, repeatable offset indicates fixed path delay or capture-point mismatch.
Minimal tool mapping (within this page boundary)
| Tool | Fastest target | What it proves |
|---|---|---|
| Scope / logic analyzer | 1PPS stability, phase steps, missing pulses | Confirms reference presence and gross quality; catches switch-induced glitches |
| Counter / frequency trend | 10 MHz or derived clock drift | Shows frequency offset and slow drift that drives TE(t) slope during holdover |
| Software logger | events + timestamp comparisons | Auditable correlation: switch/relock boundaries vs timestamp outliers and monotonicity |
H2-11|Validation Test Plan: Turn “sync quality” into repeatable tests
The goal is to convert “correct timestamps, controlled jitter, and recoverable operation after reference loss/power events” into an executable test matrix: every test case has input conditions, hardware-first observation points, a data logging template, and clear pass/fail criteria—so it can be used for R&D acceptance, production sampling, and fast field attribution (reference issues / PLL issues / timestamp path issues / backup power issues).
1) Test matrix (T1–T5) and deliverables
For each test, produce the same “evidence bundle”: raw logs (CSV/register/event logs), statistical summaries (p50/p99/p999/min/max), waveforms/screenshots, and a final decision (Pass/Fail + root-cause tag).
| Test | Stimulus / conditions | Observation taps (hardware-first) | Pass criteria (template) |
|---|---|---|---|
| T1 Steady-state | Stable reference, normal lock, room temperature | 1PPS phase jitter; output clock jitter (or proxy); timestamp error distribution | p99 TS_err ≤ X and p999 ≤ Y; no outlier spikes; stable lock state |
| T2 Holdover | Reference/link loss → enter holdover | Time error curve TE(t); frequency offset/phase drift; mode switch points | |TE(T)| ≤ E (T minutes/hours) and no steps; after recovery, no rollback |
| T3 Temperature | Chamber sweep (with ramp + dwell) | Thermal drift and compensation; lock margin; TE vs temperature | dTE/dT ≤ K; T1/T2 thresholds met across the specified temperature range |
| T4 Power disturbance | Brownout / hot-plug / reset disturbance | PLL unlock/relock time; TSU timestamp spikes; any time rollback | relock ≤ R; monotonic time (no rollback); complete event logs |
| T5 RTC+Supercap backup | Power-off → backup domain only → power-on | Backup duration; charge inrush; post-restore time continuity (step/rollback) | backup ≥ H; inrush does not droop the main rail; restore is continuous or controlled stepping |
2) Unified test architecture: fix the “stimuli + taps + logging format”
For repeatability, split the setup into two layers: the stimulus layer (reference/temperature/power/backup-off) and the observation layer (PPS/clocks/timestamps/RTC domain). When boards or components change, keep the stimulus layer unchanged and only swap the DUT.
- Stimulus layer: programmable reference input (1PPS/10MHz or SyncE), temperature chamber, controllable brownout/hot-plug fixture, and a power-off backup fixture (supercap domain).
- Observation layer: PPS phase (scope/counter), PLL lock state (registers + event logs), timestamp consistency (same event cross-point comparison), and RTC-domain voltage/current + backup duration.
- Logging format: a unified CSV schema + fixed statistics windows (e.g., 1 s / 10 s / 60 s) + tail metrics (p99/p999).
3) T1 — Normal reference: steady-state timestamp distribution and jitter baseline
T1 does one thing: build a statistical “healthy profile”. Every anomaly later should be compared against the T1 baseline (tail degradation, more spikes, or lock-state instability).
- Conditions: stable reference, lock complete, room temperature; fixed load/traffic (avoid uncertain queue behavior).
- Taps: PPS phase jitter; period jitter at key clock points (or a proxy); timestamp deltas for the same event observed at multiple points.
- Outputs: TS_err distribution (p50/p99/p999/max); outlier rate; lock/switch logs are empty or stable.
- Criteria template: p99 ≤ X and p999 ≤ Y; spike amplitude max ≤ Z; spike rate ≤ N/hour.
4) T2 — Reference loss: holdover drift curve TE(t)
The core evidence for holdover is the time error curve TE(t). Use TE(t) to capture the dominant drift first, then decide whether deeper phase-noise/Allan analysis is necessary.
- Stimulus: after steady lock, disconnect the reference input or simulate link loss; keep the DUT running.
- Logging: sample TE(t) every Δt; also log temperature, PLL mode, and frequency-offset estimates.
- Criteria template: |TE(T)| ≤ E (T minutes/hours); the curve must be continuous with no steps; after reference returns, no rollback is allowed.
5) T3 — Temperature sweep: drift/compensation, lock margin, and the degradation knee
The point is not merely “it still runs”, but to find the degradation knee: at what temperature/slope do lock jitter rise, timestamp tails worsen, or relock slow down.
- Profile: dwell points + ramp sweeps; compare behavior before/after thermal stabilization.
- Evidence: TE vs T; lock state and relock time; whether TS_err p99/p999 degrade with temperature.
- Criteria template: dTE/dT ≤ K; T1/T2 thresholds still met across the target temperature range.
6) T4 — Power disturbance/hot-plug: unlock/relock and time monotonicity
Power events often create “random-looking” timestamp spikes and rollback risk. This test forces checks for: any rollback, predictable relock, and whether event logs can close the loop.
- Stimulus: controlled dips, brief interruptions, hot-plug, reset; cover both main power and clock/RTC-only rails.
- Taps: PLL lock→unlock→relock time; TSU timestamp spikes; whether system time rolls backward.
- Criteria template: relock ≤ R; no rollback; every anomaly maps to a logged reference/power/PLL/timestamp-path cause.
7) T5 — RTC + supercap: backup duration, charge inrush, and restore consistency
A backup path must both “last long enough” and “not collapse the main rail during charging”. T5 therefore validates backup duration, charge inrush, backfeed paths, and time consistency after restore.
- Power-off timing: disconnect the main rail and keep only the RTC domain; record backup duration (backup ≥ H).
- Inrush evidence: cold-start charge peak current and main-rail droop; verify no UV/PG false triggers.
- Restore consistency: on power return, verify no rollback/unexplained large jump; controlled stepping is acceptable, unexplained steps are not.
8) Logging template (CSV field suggestions) and final report structure
The more uniform the schema, the higher the cross-project reuse. Put “environment, reference, lock state, time error, timestamp distribution, and backup domain” on the same row so scripts can generate summary plots and decisions directly.
- Base: test_id, dut_rev, fw_rev, timestamp_utc, run_id.
- Environment: temp_c, vin_main_v, vin_rtc_v, load_state.
- Reference: ref_sel (PPS/10MHz/SyncE/XO), ref_ok, ref_loss_count.
- PLL: pll_lock, mode (lock/holdover/relock), relock_ms, alarm_flags.
- Time error: te_ns (instant), te_ns_max, te_ns_slope.
- Timestamp stats: ts_err_p50/p99/p999/max, outlier_rate, rollback_flag.
- Backup: backup_elapsed_s, charge_inrush_a (peak), rtc_drift_ppm_est.
Report structure: for every test case, keep four fixed sections: Condition → Evidence (waveforms/logs) → Stats (p99/p999/TE curve) → Decision (Pass/Fail + root-cause tag).
9) Part numbers (examples) — common building blocks for reusable validation fixtures
The part numbers below are for building a validation platform or a reference “control group”. Final selection must consider availability, package, and system power/cost targets; if lifecycle changes, use an equivalent-class substitute.
| Module role | Example part numbers | Which validation points |
|---|---|---|
| Jitter-cleaner / DPLL |
Silicon Labs / Skyworks Si5341 (with Si5341-D-EVB) Analog Devices AD9545 (with AD9545-PCBZ) Renesas 8A34001 Microchip ZL30733 |
T1 jitter baseline, T2 holdover, T4 unlock/relock behavior |
| 1588 timestamp PHY / switch |
Texas Instruments DP83640 (IEEE 1588 PTP PHY) Microchip KSZ9477 (IEEE 1588v2-capable switch) |
T1 timestamp distribution, T4 spike/rollback isolation (PHY TS vs TSU path) |
| RTC (temp-comp / low power) |
Analog Devices / Maxim DS3231 NXP PCF2129 |
T5 backup duration and restore consistency; thermal drift comparison |
| Supercap backup management |
Analog Devices LTC3350 (supercap charger + backup control) Texas Instruments TPS61094 (ultra-low IQ approach with supercap management) |
T5 backup path: duration, charge policy, power-off/power-on behavior |
| OR-ing / ideal diode | Analog Devices LTC4412 (PowerPath/ideal-diode controller) | T5 backup-domain isolation and backfeed prevention |
| Inrush limiting (eFuse) | Texas Instruments TPS2595 (adjustable current limit + adjustable soft-start) | T4/T5: cold-start charge inrush, rail droop, repeatable protection behavior |
| Supercapacitor (device) |
Murata DMF series (5.5V EDLC) Panasonic EEC-F series (5.5V Gold Cap family, e.g., EEC-F5R5U / EEC-F5R5H families) |
T5: impact of leakage/ESR on backup duration and inrush |
10) Production & field rollout: make T1/T2/T5 the minimal closed loop
If cost/time must be compressed, keep three “minimal loop” cases: T1 (baseline) + T2 (holdover) + T5 (backup). These three cover most “timing feels like black magic” field failures while staying within this page’s hardware boundary.
- T1: defines what “healthy” looks like—without it you cannot judge degradation.
- T2: reference loss is the most common fault injection; TE(t) is the most explanatory evidence.
- T5: power-off/cold-start is the highest-risk scenario for time jumps and rollback—must be forced-verified.
H2-12|FAQs — Edge Timing & Sync (Hardware Time Subsystem)
These FAQs stay strictly inside the device’s timing hardware boundary: reference input quality, ref mux switching, PLL/jitter-cleaner behavior, timestamp tap placement (PHY/MAC/TSU), ToD monotonicity, and RTC+supercap backup.
Should timestamp requirements use “average error” or p99/p999? Why?
Use p99/p999 when rare spikes can break control, multi-sensor alignment, or event reconstruction. Averages hide tail events caused by ref switching, PLL relock settling, clock-domain crossings, or timestamp tap uncertainty. Keep the average as a sanity check, but accept/reject with a fixed statistics window and tail metrics (plus an outlier rate).
- Acceptance template: p99 ≤ X, p999 ≤ Y, outlier_rate ≤ N/hour.
- Always log the window length and exclude warm-up (e.g., first 2 minutes after lock).
PLL shows “LOCKED” but timestamps still spike—what two evidence classes should be checked first?
A lock indicator only means the loop is closed, not that the output is “production-clean.” First, check PLL/clock-tree events (holdover entry, ref switch, relock time, settle gating) and correlate them with spike timestamps. Second, check timestamp consistency for the same event across tap points (PHY vs TSU vs software-readout) to separate “clock quality” from “tap/path issues.”
- Example jitter-cleaner/DPLL parts used in endpoints/gateways: SiLabs Si5341, ADI AD9545, Microchip ZL30733, Renesas 8A34001.
- Fast triage: spikes aligned with relock/switch events → PLL/switching/settling; spikes without events → tap/path/measurement chain.
In the field, what is the most visible difference between PHY timestamping vs MAC/TSU timestamping?
The most visible difference is the tail behavior: PHY timestamping, taken closer to the wire, is less sensitive to internal timing uncertainty and usually produces a tighter p99/p999 distribution. MAC/TSU timestamping is easier to integrate but can inherit extra variation from internal latency drift and clock-domain boundaries.
- Symptom pattern: similar averages, but MAC/TSU has more outliers during bursts, switching, or thermal drift.
- Example IEEE-1588-capable devices often used for comparison: TI DP83640 (PTP PHY), Microchip KSZ9477 (PTP-aware switch).
When switching from external 1PPS to local XO, what is the most common root cause of a “time step” jump?
The most common cause is phase discontinuity at the switch boundary: the local oscillator phase is not aligned to the outgoing reference, and the PLL allows a step before the system applies a controlled slew policy. A second common cause is releasing “LOCKED” too early—output is not fully settled, so ToD mapping amplifies transient phase error into a visible time step.
- Mitigation: phase-/frequency-continuous switching where possible, plus settle-gate before enabling timestamps.
- Governance: enforce no rollback and a jump limit (step vs slew).
Should PLL loop bandwidth be larger or smaller? What field symptoms indicate the wrong choice?
A wider bandwidth tracks the reference faster but can import reference noise; a narrower bandwidth cleans noise better but reacts slowly. If bandwidth is too wide, timestamp tails worsen even when lock is stable (reference noise leaks through). If bandwidth is too narrow, relock takes longer and switching causes prolonged error windows or slow recovery.
- Validate with T1/T4: p999 and spike rate during switching/relock are the first indicators.
- Rule of thumb: choose bandwidth together with “settle time budget” and “switch frequency” constraints.
How to choose XO/TCXO/OCXO by back-calculating from a target holdover time (minutes)?
Start with an acceptance statement: “within T minutes after reference loss, time error stays within E.” Convert this into an allowable frequency error budget, then split it into temperature drift, aging, and short-term stability. XO fits short/low-risk holdover; TCXO is typical for edge devices; OCXO is used when tight holdover is needed but power/volume and warm-up are acceptable.
- Verification is mandatory: prove the budget using holdover TE(t) tests (T2) and temperature sweep (T3).
- Clock conditioning examples: SiLabs Si5341 or ADI AD9545 can combine holdover + switching policy control.
In a holdover budget, which term is most often underestimated: temp drift, aging, or thermal sweep behavior?
The most underestimated term is usually real thermal behavior (ramp + non-steady-state), not the “25°C ppm” value. Many designs validate only at room temperature and ignore the drift during temperature transitions and settling, which directly stretches TE(t) and increases tail events. Aging matters over long time scales, but thermal transitions dominate many edge deployments.
- Action: include temperature ramp rates and dwell time in the test plan (T3), not just a single-point measurement.
- Evidence: TE(t) slope changes that correlate with temperature slope are the fastest signal.
Why does RTC + supercap backup often last “much shorter in reality” than theory suggests?
Theory assumes ideal capacitance and a clean backup load. Reality is dominated by supercap leakage, effective voltage window, ESR-related droop, and hidden loads or backfeed paths in the RTC domain. If the backup rail is not isolated, unexpected current drains can dwarf the RTC’s budget and collapse the supercap early.
- RTC examples: ADI/Maxim DS3231, NXP PCF2129 (calibration and low backup current options vary by design).
- Measure backup domain current and rail droop curve—do not estimate from capacitance alone.
Supercap charging causes supply droop—should current limit be fixed first, or power sequencing?
Fix inrush control first because droop is usually driven by peak charge current. Power sequencing is often the second step to prevent sensitive rails from seeing the transient. A robust approach uses a controlled charge path plus isolation so the backup domain cannot pull down the main rail during cold start or brownouts.
- Examples: eFuse/soft-start TPS2595; ideal-diode/PowerPath LTC4412; supercap manager LTC3350.
- Acceptance: no repeated PLL unlocks during charge; no timestamp rollback after recovery.
How to quickly tell if the issue is “reference input quality” or “PLL output quality”?
Use a two-step split: verify the reference first, then verify what the PLL does with it. If 1PPS/10 MHz shows missing pulses, unstable amplitude, or phase steps, the input is suspect. If the reference is stable but output jitter proxies or spike bursts still appear, the PLL switching/settling policy or clock-tree distribution is the more likely cause.
- Reference evidence: PPS phase stability on scope/counter.
- PLL evidence: event timeline (switch/relock/settle) aligned to timestamp spikes.
After power restoration, how to ensure time does not go backward and logs do not reorder?
Enforce monotonic time as a hard rule: never allow rollback. After power-up, RTC provides a seed, but the system must apply a governance policy (step vs slew) with a jump limit and explicit “time-adjust” markers. For logging, keep an always-increasing sequence counter so ordering is preserved even when ToD is corrected within allowed bounds.
- Backup chain examples used in validation: RTC DS3231/PCF2129 + supercap control LTC3350 + isolation LTC4412.
- Pass criteria: rollback_count = 0; time adjustments are annotated and bounded.
Minimum validation loop: which three tests catch ~80% of real-world timing pitfalls?
A strong minimum loop is T1 + T2 + T5. T1 establishes steady-state tail metrics (p99/p999 and outlier rate). T2 proves holdover drift TE(t) under reference loss and validates recovery behavior. T5 validates RTC+supercap backup and ensures charging transients do not cause supply droop, repeated PLL unlocks, or time rollback during cold start.
- T1: distribution + spikes; T2: TE(t) drift; T5: backup time + inrush + monotonicity.
- Require a uniform record template and fixed pass/fail criteria across builds.