Time Card (PTP/SyncE/GPSDO) for Data Center Servers
← Back to: Data Center & Servers
A data-center Time Card provides a provable timing foundation by combining disciplined frequency (SyncE/GPSDO) and precise time (PTP/ToD/1PPS) into a single, auditable hardware reference. It exists to keep timestamps deterministic through reference loss/switching and to make holdover, jitter, and time-step behavior measurable and acceptance-testable.
H2-1 · Scope & Boundary — What this page solves
This page focuses on time-card hardware behavior and how to measure, validate, and debug it. It treats timing as an engineering deliverable: a stable frequency reference, a controlled time-of-day output, and a deterministic hardware timestamp base.
- Stable frequency (SyncE/GPSDO) with explicit holdover behavior: expected drift over time, re-lock behavior, and measurement methods (phase error / time error / ADEV-style stability views).
- Accurate time (PTP / ToD / 1PPS) with controlled transitions: when to step vs slew, how to avoid time-jumps during reference loss/recovery, and what to validate before deployment.
- Auditable timestamp base (PHC + HW timestamp): where the capture point sits in the data path, what creates non-determinism, and how to prove tail performance (p99/p999 stability).
- Distributed event ordering: avoids cross-node log mis-ordering and “false causality” when time alignment degrades.
- Compliance timestamps: provides traceable, verifiable timing with explicit reference quality and state transitions.
- Cluster observability: reduces long-tail timestamp variance that breaks correlation across systems under load.
- PTP network algorithm deep dives (BMCA / transparent-clock internals) — use a dedicated timing-network page.
- OOB/BMC/KVM subsystems — see the Management pages for IPMI/Redfish/KVM topics.
- High-speed SI retimer design — see the PCIe Retimer/Switch pages for signal integrity depth.
Minimal mental model: reference signals enter the card, a disciplined oscillator plus DPLL/PLL creates a stable timebase, and the card exports PHC/time-of-day and deterministic hardware timestamps with observable states.
H2-2 · 1-Minute Answer — When a Time Card is truly needed
A time card becomes a necessity when time must be deterministic, traceable, and resilient to reference loss. It moves timing from “best-effort behavior” into an engineered, measurable subsystem—with explicit holdover, jitter control, and hardware timestamp guarantees.
A Time Card is a server timing module that disciplines an OCXO/TCXO using GNSS and/or SyncE, then exposes a stable PHC, ToD/1PPS/10MHz outputs, and IEEE 1588 hardware timestamps. It is built for verifiable accuracy, low jitter, and predictable holdover during reference loss.
-
1Acquire references GNSS (1PPS/ToD), SyncE recovered clock, or external 10MHz/1PPS—plus quality/status signals.
-
2Discipline the oscillator DPLL/servo controls OCXO/TCXO frequency and phase; lock/holdover states are explicitly managed.
-
3Clean jitter and phase noise Jitter-cleaner PLL shapes noise across offset frequencies to stabilize outputs and timestamp base.
-
4Drive PHC and time outputs PHC/ToD/1PPS are aligned to the disciplined timebase with controlled step/slew behavior.
-
5Timestamp with determinism IEEE 1588 hardware timestamps are captured at a defined point, enabling stable tail performance (p99/p999).
- Tail stability matters: p99/p999 timestamp error is more important than average offset.
- Reference loss is realistic: GNSS can be blocked/jammed, or upstream timing can be disrupted—holdover must be bounded.
- Time must be auditable: compliance logs or forensic timelines require traceable states (lock/holdover/quality) and repeatable behavior.
- System correlation breaks today: cross-node logs, traces, or events fail correlation during load spikes or topology changes.
- Mixed timing inputs exist: SyncE for frequency plus PTP for time needs a clean, well-defined convergence point.
- Holdover: maximum time error after X minutes without GNSS/SyncE; drift curve should be bounded and repeatable.
- Determinism: timestamp stability across load (tail metrics), avoiding long-tail spikes caused by uncertain capture points.
- Jitter / phase noise: integrated jitter (10MHz) and phase-error behavior (1PPS) meet system budget.
- Observability: clear state reporting (lock/holdover/quality/alarms) plus event timestamps to support debugging and audits.
The comparison is intentionally “engineering-first”: it highlights bounded behavior, tail stability, and observable states—rather than average offset claims.
H2-3 · System Context — How a Time Card fits into a data-center server
A time card is the timing boundary inside a server: it takes one or more external references (GNSS, SyncE, or an external 10MHz/1PPS), disciplines an on-card oscillator, and then exposes a stable local timebase (PHC/ToD/1PPS/10MHz) plus deterministic hardware timestamps. This section explains practical integration—placement, I/O, redundancy, and the failure points that most often break real deployments.
- PCIe add-in card (AIC): easy to service and swap, typically offers front-panel connectors (SMA/MCX). Watch for panel cabling, grounding, and vibration/airflow sensitivity around the oscillator zone.
- OCP mezzanine: board-level integration is cleaner and repeatable at scale; management sideband is convenient. External reference connectors may be limited, so reference distribution strategy must be decided early.
- On-board module: best for fixed platforms that prioritize mechanical stability. Field replacement is harder, so reference redundancy and observability become even more important.
- Common risks: lightning/surge on the feeder, cable loss budget errors, multipath reflections near metal structures, intermittent遮挡/coverage holes.
- Minimum engineering actions: surge protection + compliant bonding, feeder loss budget, placement validation with GNSS quality indicators (C/N0, satellites, lock stability).
- Common risks: upstream quality level changes, reference switching that introduces wander, “looks locked” but tail timing degrades.
- Minimum engineering actions: monitor quality/alarms, define reference priority, verify behavior during source switching (no uncontrolled time jumps).
- Common risks: ground loops, poor termination/impedance match, noisy distribution that injects jitter into a “clean” reference.
- Minimum engineering actions: define cabling/connector standards, termination rules, and validation using phase-error measurements (not only average offset).
- Local PHC: the server’s auditable hardware timebase for time services and correlation.
- IEEE 1588 hardware timestamps: deterministic packet timestamping at a defined capture point for stable tail behavior.
- ToD / 1PPS / 10MHz: exported for measurement ports, lab validation, or cross-card timing relationships.
- Status / alarms / event logs: essential for proving reference quality and diagnosing holdover and switching events.
- N+1 references: combine at least two independent inputs (e.g., GNSS + SyncE, or GNSS + external 10MHz/1PPS) to avoid single-point dependency.
- Failure isolation: distinguish “reference quality degraded” vs “cable/connector fault” vs “surge/ground event” using explicit states and alarms.
- Observable transitions: lock → holdover → re-lock must be visible and bounded; uncontrolled time steps are treated as failures.
The diagram emphasizes the real deployment boundary: reference distribution and redundancy on the left, disciplined timing and observable states in the middle, and the server/measurement consumers on the right.
H2-4 · Reference & Oscillator — What OCXO/TCXO/GNSS specs actually mean
Timing products often advertise “excellent stability,” yet real systems fail on holdover, tail behavior, or uncontrolled re-lock steps. The practical goal is to translate oscillator and reference claims into acceptance language: measurable limits on time error, repeatable drift curves, and visible state transitions.
- OCXO: chosen when long, bounded holdover is required and thermal stability dominates error. It is typically more resilient to airflow/ambient swings, but demands careful thermal placement.
- TCXO: chosen when space/power/cost dominate and holdover requirements are shorter. It is more sensitive to gradients and board-level temperature dynamics.
- Practical boundary: select by required holdover time-error bound and environmental variability, not by a single ppm line item.
- Metric: short / mid / long windows (τ≈1s–10s, 100s–1000s, 10k s).
- Impact: short windows shape jitter-like behavior; mid windows reveal control and thermal dynamics; long windows dominate holdover drift and aging.
- Measure: record phase/time error vs a known reference; derive stability views per window; verify repeatability across runs.
- Metric: temperature coefficient, aging rate, supply sensitivity, mechanical/airflow sensitivity.
- Impact: changes the drift slope during holdover and can alter re-lock behavior after transients.
- Measure: controlled temperature sweep, long-duration holdover tests, and “airflow/placement A/B” checks in a representative chassis.
- Metric: lock stability, C/N0 trends, satellite visibility changes, and event frequency of reference loss.
- Impact: repeated micro-loss or multipath can inject steps or force frequent transitions into holdover.
- Measure: log quality indicators alongside time error; validate that loss/recovery transitions remain bounded.
- Bounded time error: after losing references, what is the maximum time error at 5/15/30/60 minutes (explicit curve, not a single adjective)?
- Test conditions: under what temperature and airflow profile were those bounds verified (stable lab vs chassis reality)?
- Re-lock behavior: does the card step or slew when references return, and what is the maximum step limit?
- Proof: are lock/holdover/quality states and event timestamps exported for audits and debugging?
This schematic is intentionally acceptance-oriented: it highlights bounded holdover and controlled re-lock behavior, which are the most common gaps in marketing-style specifications.
H2-5 · Disciplining Loop — How GNSS/SyncE/external references are “tamed” by a DPLL
A time card is not “accurate” simply because a reference exists. Accuracy becomes usable only after the reference is translated into a bounded, observable, and controlled local timebase. This section explains the practical disciplining loop: the state machine, the phase-error → filter → frequency-correction core, and the failure modes that typically cause time steps, jitter growth, or unstable holdover behavior.
- Acquire: reference is detected and qualified. Entry/exit should be governed by quality thresholds plus a time window (debounce).
- Lock: phase error is pulled into a bounded range. The system must expose lock indicators and “quality stable” timers.
- Disciplined: the oscillator is continuously corrected. The key requirement is bounded short-term noise injection and bounded long-term drift.
- Holdover: reference is lost or rejected; the local oscillator provides continuity. Acceptance must be stated as a time-error bound vs time (not a vague “good holdover”).
- Re-lock: reference returns. The critical requirement is a controlled transition: define whether re-lock may step or must slew, and define the maximum allowed step/slew.
- Phase-error estimation: converts reference-vs-local differences into a controllable error signal (observable and logged).
- Filtering: decides which variations are trusted. Too much trust in a noisy reference injects noise; too little trust slows convergence.
- Frequency correction: adjusts the oscillator so the local timebase stays close to the reference without amplifying short-term noise.
- Engineering trade-off: a faster loop tracks short-term changes but can increase jitter; a slower loop reduces injected noise but can drift more during transients and take longer to settle.
- Symptom: occasional time steps or tail spikes even when “locked” appears true.
- Why it happens: intermittent quality changes (遮挡/multipath) drive the loop as if they were real time shifts.
- Acceptance: inject reference-quality disturbances and verify bounded transitions and logged state reasons.
- Symptom: short-term jitter rises; timestamp tails become worse (p99/p999 deteriorate).
- Why it happens: the loop passes reference noise into the local oscillator and timebase.
- Acceptance: compare configurations by tail metrics, not only average offset.
- Symptom: when the reference returns, time suddenly jumps; ordering/correlation breaks.
- Why it happens: uncontrolled phase correction during re-lock.
- Acceptance: specify step limit (or require slew) and verify during repeated loss/recovery cycles.
- Holdover bound: maximum time error at 5/15/30/60 minutes after reference loss (curve + limit).
- Re-lock bound: step vs slew policy and the maximum allowed step / maximum slew rate.
- Source switching: bounded transient error during reference priority changes; reason codes must be logged.
- Tail behavior: validate p99/p999 time error/timestamp error, not only average offset.
Left: define observable entry/exit conditions for each state. Right: bandwidth is an engineering choice; validate using tail metrics and bounded transitions.
H2-6 · IEEE 1588 on a Time Card — Hardware timestamps and the PHC datapath
Deterministic timing in a server depends on two things: a disciplined local hardware clock (PHC) and hardware timestamp capture at a well-defined point. Many systems look fine on “average offset” yet fail on ordering and correlation because tail errors and variable internal delays dominate real workloads. This section maps the interface-level datapath: where timestamps are captured, how PHC aligns with ToD/1PPS, and where determinism is gained or lost.
- PHC is a continuously running hardware counter used as the server’s local time reference.
- It is disciplined by the time card’s loop (frequency/phase corrections), with explicit constraints on step vs slew behavior.
- Acceptance requires observable states: lock/holdover/re-lock and event timestamps for each transition.
- Benefit: fewer variable delays between the physical arrival/departure and the capture point.
- Outcome: better determinism; tail behavior tends to be tighter.
- Risk: queueing, clock-domain crossings, and internal scheduling can add variable latency.
- Outcome: average offset can still look fine while p99/p999 timestamp error worsens.
- Tail spikes: rare timestamp error spikes can exceed the event spacing that applications rely on.
- Variable delay: nondeterministic internal latency corrupts correlation even when mean offset remains bounded.
- Transition steps: re-lock steps (or aggressive corrections) can invert ordering within short windows.
- 1PPS provides an edge reference for measurement and alignment checks.
- ToD formatting converts the internal timebase into exported time-of-day while maintaining bounded update behavior.
- Engineering boundary: define the policy for step vs slew so exported signals remain auditable and transitions remain bounded.
- Tail limits: p99/p999 timestamp error bounds (not only average offset).
- Fixed capture point: capture location must be explicit and consistent (deterministic path vs variable path).
- Transition bounds: maximum step / maximum slew rate during re-lock and reference switching.
- Consistency checks: PHC vs ToD vs 1PPS alignment logged over time.
The capture point defines determinism. When variable delay is introduced upstream of the capture, tail timestamp errors rise even if average offset stays small.
H2-7 · SyncE Integration — Why many data centers use “SyncE for frequency, PTP for time”
SyncE and PTP solve different parts of the same timing problem. SyncE distributes a low-wander frequency foundation, while PTP aligns time/phase using packet-based measurements. In practice, separating “frequency stability” from “time alignment” reduces tail behavior, improves robustness under link quality changes, and makes acceptance criteria clearer for operators.
- Lower wander at the local timebase: a stable recovered clock reduces how hard the timing servo must work to maintain frequency continuity.
- Less noise injection into timing: when frequency is already “quiet,” PTP can focus on time/phase alignment rather than compensating for oscillator wander.
- Clear separation of concerns: frequency quality can be validated independently from time alignment, which improves troubleshooting and acceptance.
- QL communicates the expected reference quality in operational terms: “which reference is better” and “when to switch.”
- SSM/ESMC carry and coordinate these quality labels so equipment can choose a reference consistently.
- Stable selection requires policy: thresholds, hold-off, and hysteresis to prevent frequent flapping.
- Policy too sensitive: aggressive thresholds cause frequent reference changes for marginal quality variations.
- Quality propagation breaks: QL labels are not aligned end-to-end, leading to inconsistent selection.
- Unclear priority rules: multiple “best” references appear equal, so the system oscillates between them.
The acceptance goal is not “never switch,” but bounded switching: stable selection under normal conditions and controlled transitions with traceable reason codes.
- Frequency path: SyncE recovered clock → jitter cleaner → PHC discipline (reduces wander and stabilizes the local frequency foundation).
- Time path: PTP packets → hardware timestamps → PHC alignment (aligns time/phase using deterministic capture points).
- Operational requirement: the time card must expose reference state, QL/selection events, and bounded transition behavior during loss/recovery.
- Frequency foundation: low-wander behavior under normal operation; verify by time-error vs observation window (trend + bounds).
- Switching stability: switching rate limits, hold-off/hysteresis policy, and reason codes for each event.
- Time alignment tails: p99/p999 time error and timestamp error bounds (not only mean offset).
- Loss/recovery: bounded transitions during reference loss/recovery (step vs slew policy + maximum allowed step/slew).
Frequency is stabilized by SyncE → jitter cleaning → discipline, while time/phase is aligned using hardware timestamps. Acceptance focuses on bounded switching and tail metrics.
H2-8 · Jitter Cleaner PLL — Phase noise, jitter, and why “cleaning bandwidth” decides success
A jitter cleaner is not a magic box that always improves timing. Its loop bandwidth decides how much input-reference noise is passed through versus how much local-oscillator noise dominates the output. Selecting the right device and configuration requires measurable acceptance: phase noise, integrated jitter, spur behavior, and bounded switching under real reference disturbances.
- Too wide: passes reference noise → output short-term jitter worsens.
- Too narrow: slow tracking and longer recovery during switching or disturbances.
- Acceptance: compare tail jitter/time-error under the same disturbance profile.
- Phase noise L(f) must be stated at offset regions that matter (near vs far).
- Integrated jitter must declare the integration band, or the number is not comparable.
- Acceptance: verify both noise floor and tail behavior; do not rely on a single “typical” value.
- Spurs may appear from reference coupling, fractional synthesis artifacts, or power-domain interactions.
- Hitless/bounded switching requires controlled phase continuity and limited transient error.
- Acceptance: test switching events and degraded references; confirm bounded transients and logged reasons.
- Phase noise: measure L(f) and identify discrete spurs; evaluate near-offset and far-offset regions separately.
- Jitter integration: integrate the phase-noise curve over a declared band to obtain RMS jitter; compare against limits.
- 10MHz / 1PPS phase comparison: measure time error/TIE trends and tail behavior during disturbances and switching.
- Allan deviation (ADEV): use time-scale-dependent stability to validate holdover-like behavior at relevant τ windows (conceptual, not derivation).
- Bandwidth mismatch: the loop passes the wrong noise region; short-term jitter rises.
- Reference disturbances: switching or marginal references inject spurs or transient phase errors.
- System coupling: periodic spurs appear due to reference/power interactions; verify with controlled stimulus and repeatability.
A cleaner improves output only when the bandwidth is chosen to suppress the right noise region without passing the wrong one. Validate with declared integration bands, spur checks, and bounded switching tests.
H2-9 · Interfaces & Form Factors — PCIe cards, OCP modules, connectors, and electrical constraints
Time cards succeed or fail at the interfaces. The same oscillator and disciplining design can deliver very different results depending on cabling, grounding, shielding continuity, port direction, and how redundant references are routed. This section turns “it connects” into “it is stable, measurable, and maintainable.”
- Chassis grounding & shielding continuity become part of the timing path (front-panel openings, seams, and cable shields matter).
- EMI coupling risk increases with crowded slots and mixed high-speed cards; verify tail behavior under realistic load profiles.
- Serviceability is strong, but stability depends on disciplined cable routing and consistent port labeling (IN/OUT).
- Reference ground coupling to the baseboard is tighter; sideband signals are easier to consolidate.
- Front-panel ports (if present) must maintain shield-to-chassis continuity; avoid creating unintended return paths through the module.
- Redundancy strategy should be planned at the rack level (physical separation of reference feeds).
- Fewer external ports can reduce exposure, but system-level grounding and zoning become decisive.
- Validation points must be planned (where to observe 1PPS/10MHz/ToD behavior without disturbing the system).
- Maintainability tradeoff: less flexible field replacement requires stronger monitoring and event logging.
- Purpose: exposes the timing device to the host and provides the control/visibility surface.
- Constraint: platform reset/power behavior affects timing bring-up; acceptance must include realistic reboot and recovery scenarios.
- Operational need: maintain a clear mapping between physical card identity and observed timing state (slot/serial/event logs).
- Purpose: read reference state, temperature, alarms, and event counters that explain timing behavior.
- Constraint: prioritize stable, low-noise routing; avoid placing long, noisy sideband runs adjacent to sensitive reference wiring.
- Acceptance: telemetry must remain available during reference loss/recovery so the root cause is traceable.
- Purpose: simple “state truth” for lock/holdover/reference switching and critical alarms.
- Constraint: define direction and logic clearly; treat long GPIO runs as noise antennas unless properly referenced and shielded.
- Acceptance: status transitions must match logged events (no “silent switching”).
- Common roles: output to validation equipment or downstream distribution; input from an external reference when used.
- Direction discipline: label IN vs OUT explicitly; avoid accidental “loop feed” when redundant ports exist.
- Pitfalls: inconsistent shield/ground reference can show up as tail spikes during load changes or switching events.
- Common roles: input for external frequency foundation; output for distribution or measurement.
- Constraint: define termination expectations and preserve coax shield continuity to avoid common-mode pickup.
- Pitfalls: “works on average” but worsens integrated jitter due to imperfect shield return or mixed grounding.
- Common roles: output for systems that consume absolute time encoding outside of packet timing.
- Constraint: ensure a consistent reference ground model; mixed-ground attachments can cause intermittent decode instability.
- Pitfalls: long unshielded runs or unclear directionality increase the chance of sporadic time jumps.
- Feed routing: treat the feed as an external disturbance path; separate it physically from power bundles and noisy harnesses.
- Surge / lightning protection: protection must have a defined discharge path; avoid turning the shield into a ground-loop driver.
- Grounding & isolation: keep shield termination consistent across the installation; inconsistent shield bonding is a repeatable source of tail instability.
- Physical separation: route redundant reference cables separately; avoid long parallel runs that create a shared coupling path.
- Role clarity: label each port (IN/OUT, primary/secondary) and keep direction consistent with configuration.
- Protection symmetry: apply the same protection and grounding strategy on both paths, or switching behavior becomes asymmetric.
- Acceptance: verify bounded transitions and event traceability during reference switching under realistic disturbance profiles.
- All front-panel ports clearly labeled with signal type and direction (IN/OUT).
- Coax shield continuity verified end-to-end (no “floating shield” segments).
- 10MHz and 1PPS cabling separated from noisy harnesses and power bundles.
- Surge protection has a defined discharge path; no unintended ground loops created through shields.
- Redundant reference paths are physically separated and share consistent grounding/protection policy.
- Switching events produce bounded transient behavior and are traceable via alarms/telemetry.
Use explicit IN/OUT labeling, preserve shield continuity, and keep redundant reference routes physically separated. The diagram is conceptual—port naming and availability depend on the specific card.
Bring-up & Validation: a repeatable acceptance + production test plan
The goal is to turn “looks synchronized” into “provably within spec” with clear measurement points, statistics that expose tail risk (p99/p999), and logs that make every field incident reproducible.
How to structure the validation (3 layers)
Lab verification (what to measure, how to state pass/fail)
- Lock behavior: define start state and end state (e.g., “from cold boot to disciplined”), record time-to-lock and first-stable window (avoid averaging away transients).
- 1PPS phase & 10MHz phase compare: treat short-term noise and long-term drift as different outputs; use TIE trend plus distribution rather than a single “typical” number.
- ADEV windows: declare which τ values are acceptance-critical (e.g., τ=1s for short-term, τ=100s+ for holdover tendency), and keep the same τ set across builds.
- Loop response sanity: apply a controlled disturbance on the reference (small step or frequency offset), then verify bounded response (no large time step, no excessive settling time, no jitter blow-up).
Rack validation (prove tails + switching are bounded)
- Reference switching: test main→backup→main (and GNSS loss/recovery). The acceptance item is not just “recovers” but bounded step and repeatable signature.
- Tail-aware statistics: report min / typical / p99 (optionally p999). Split measurements into steady-state vs recovery windows; never mix them into one histogram.
- Evidence-first logging: every switch/holdover entry must emit a timestamped event record and a reason code; this allows one-to-one correlation to offset jumps.
Production test (fast pass/fail + traceability)
- Fast functional: lock/holdover/alarm paths, output presence (1PPS/10MHz/ToD), and basic frequency sanity. Keep it deterministic.
- Config integrity: record a configuration hash (loop BW profile, reference priority list, switch policy) per serial number. Prevent “same hardware, different behavior” incidents.
- Calibration binding: tie any factory calibration (e.g., oscillator trim tables) to the serial number and the firmware/config set; store in a tamper-evident log if required.
Field Debug Playbook: symptom-first triage and root-cause paths
Field failures are rarely “PTP is bad.” Treat them as evidence problems: isolate whether the trigger is reference health, loop behavior, jitter-cleaner artifacts, or timestamp/ToD alignment.
Symptom → first checks (fast, high-signal)
- Sudden time jump / offset spikes: confirm a reference switch or holdover entry occurred at the same timestamp; if no event exists, treat logging/config integrity as a primary suspect.
- Holdover drift “way too big”: compare drift slope against declared ADEV/holdover limits; check temperature context and whether the loop incorrectly “chased” noisy GNSS before holdover.
- ToD wrong by a day / leap-second chaos: validate ToD formatting source and 1PPS alignment; confirm time-scale flags and the last update moment around the incident.
- SyncE flaps / jitter worsens: look for frequent QL/priority changes, switching transients, and new spurs on the output clock(s).
Root-cause bucket #1 — GNSS/reference health (engineering evidence)
- Evidence to capture: satellite count/CN0 trend, lock/unlock counters, holdover entries per hour, antenna power/current anomalies, and “reference valid” flags.
- Actions: verify feedline continuity & grounding, surge protection path consistency, and whether an indoor distribution/re-radiation point is the true intermittent trigger.
- Relevant MPN examples: u-blox ZED-F9T-10B / ZED-F9T-20B, u-blox RCB-F9T-1 timing board.
Root-cause bucket #2 — Disciplining loop configuration (bounded steps vs noisy chasing)
- Evidence to capture: offset step size at re-lock, frequency correction saturation, settle time, and whether the recovery is “slew” or “step.”
- Actions: verify configuration profile/version consistency (loop BW, filter profiles, reference priority). Force a controlled switch to see if the signature repeats.
- Relevant MPN examples: ADI AD9545, Microchip ZL30772, Renesas 8A34001.
Root-cause bucket #3 — Jitter cleaner artifacts (spurs, switching transients)
- Evidence to capture: phase-noise curve snapshots before/after the incident, discrete spur appearance, integrated jitter increase within the declared band.
- Actions: correlate spur onset to reference switching or PSU/EMI changes; validate input jitter tolerance and “hitless switching” behavior if present.
- Relevant MPN examples: Skyworks/Silicon Labs Si5345 (jitter-attenuating clock family).
Root-cause bucket #4 — Timestamp/PHC/ToD alignment (tails and asymmetry)
- Evidence to capture: Rx/Tx asymmetry, long-tail offset distribution (p99/p999), ToD vs 1PPS edge alignment, and whether tails only occur during recovery windows.
- Actions: split statistics into steady vs recovery; validate ToD output policy (step vs slew boundaries) and port wiring reference for ToD/IRIG/1PPS.
- Useful models for time-interval/phase checks: Keysight 53230A, Pendulum CNT-91.
FAQs (12): common selection, validation, and field-debug questions
Each answer is written for engineering decisions: clear boundary, what to verify, and which measurements/logs make the conclusion defensible.
FAQ answers
1) If a NIC already has a PHC, why add a Time Card?
A Time Card is justified when the system needs multi-reference discipline, strong holdover, and auditable, bounded behavior. A NIC PHC may timestamp packets well, but it typically lacks external reference diversity (GNSS/SyncE/10MHz/1PPS), higher-grade oscillators, and a validation-grade event trail for reference switching and holdover transitions.
- Need signal: tail latency in offset (p99/p999) or intermittent time jumps during reference disturbances.
- Capability gap: no deterministic “bounded step/slew” policy across re-lock and reference switching.
- Operational gap: insufficient evidence (reason codes, timestamps, config hash) to reproduce incidents.
2) OCXO vs TCXO: where is the boundary, and which specs best predict holdover?
OCXO is usually chosen when holdover must remain tight across temperature and time; TCXO fits when cost/power/space dominate and holdover requirements are relaxed or the outage window is short. Predictive specs are those that track time error growth under loss of reference, not only “ppm at 25°C.”
- Most predictive: Allan deviation (ADEV) at relevant τ windows, aging rate, and temperature sensitivity (including gradients).
- Also important: close-in phase noise (for short-term stability) and g-sensitivity (mechanical stress/vibration).
- Verification: measure drift slope over a realistic holdover duration; split steady vs recovery windows.
3) GNSS is locked—why can offset still jump intermittently?
“Locked” can still be unhealthy. Intermittent offset jumps commonly come from reference quality fluctuation (multipath/interference), reference switching events, or a disciplining loop that tracks GNSS noise too aggressively. In practice, the fastest path is to prove whether the jump aligns with an event timestamp.
- Reference health: CN0/satellite count changes, antenna feed anomalies, or interference bursts.
- Switching transient: reference priority changes or “valid” flag flapping triggers a hit.
- Loop behavior: too-wide loop bandwidth converts GNSS phase noise into PHC/ToD noise.
4) If GPSDO loop bandwidth is wrong, what “typical field symptoms” appear?
Wrong bandwidth shows up as a tradeoff failure: too wide makes the clock “follow noise,” too narrow makes it “recover too slowly.” The symptoms that matter are those visible in offset distribution, time steps during re-lock, and jitter/phase-noise snapshots before/after events.
- Too wide: higher short-term jitter, more offset spikes, time steps on reacquire, spurs coupling into outputs.
- Too narrow: slow convergence, long recovery windows, larger residual drift after disturbances.
- Evidence: repeated signature under a controlled reference loss/recovery test.
5) Why do many data centers use “SyncE for frequency, PTP for time”?
SyncE delivers a low-wander frequency foundation, while PTP distributes time/phase alignment. Combining them reduces the burden on the time servo: the PHC is disciplined by a cleaner, steadier frequency input, and PTP does not need to correct as much frequency error over the network.
- Frequency path: SyncE recovered clock → jitter cleaner/DPLL → stable local frequency.
- Time path: PTP packets → hardware timestamps → PHC/ToD phase alignment.
- Outcome: fewer tails and less sensitivity to packet timing noise.
6) If SSM/QL switches frequently, what happens—and how to tell upstream vs local card fault?
Frequent QL switching causes repeated phase hits, wander growth, and long-tail offset degradation—often without obvious “average offset” changes. To separate upstream instability from local policy, correlate QL/priority changes and card switch events on a single timeline.
- Upstream suspect: QL messages flap; multiple devices see the same reference instability window.
- Local suspect: local thresholds/policies trigger unnecessary switching; event counters spike without upstream evidence.
- Confirm: pin reference selection temporarily and replay the disturbance to see if tails vanish.
7) After jitter cleaning, spurs increase—what are the most common root causes?
Spurs often grow when the cleaner is configured with an unsuitable bandwidth/profile, when reference switching transients leak through, or when power/ground coupling modulates the PLL. The fastest proof is a “before/after” phase-noise snapshot plus a controlled switch experiment.
- Config-driven: wrong loop BW or fractional synthesis settings create deterministic spur families.
- Switch-driven: hitless switching not actually hitless under real reference quality.
- Coupling-driven: PSU ripple/ground bounce injects modulation into the PLL/VCO path.
8) How to turn phase noise L(f) and integrated jitter into an acceptance-ready spec?
An acceptance spec must declare measurement point and integration limits. L(f) is only comparable when the offset-frequency range, detector method, and bandwidth are fixed. Integrated jitter must state the RMS integration band (e.g., 12 kHz–20 MHz) and whether spurs are included.
- Define: output node, termination/load, integration band, averaging/time, and pass/fail threshold.
- Report: L(f) mask + RMS jitter + spur list (offset + amplitude) when relevant.
- Validate: repeatability across temperature and across reference switching states.
9) During leap second or ToD updates, how to avoid “time rollback / reordering” in applications?
Avoid large time steps into the system timebase. Prefer bounded slew policies for ToD/PHC alignment, and isolate “update moments” from critical sequencing logic. In validation, force leap-second-like ToD transitions and verify that monotonic ordering is preserved in logs and event timestamps.
- Policy: define when step is allowed vs forced slew; bound the maximum correction rate.
- Evidence: record update timestamps, offset distributions, and any step events with reason codes.
- Mitigation: segment stats: steady-state vs update/recovery windows; never mix them.
10) How long should holdover be tested, and which statistics avoid “short test looks good” traps?
Holdover testing must match outage risk and the oscillator’s relevant time constants. Short tests often miss temperature gradients and aging-like drift behavior. Use a duration long enough to reveal drift slope, then report both slope and tail risk (worst segments, not just averages).
- Measure: time error growth vs time; capture temperature context; repeat across thermal conditions.
- Stats: slope + max segment drift + p99/p999 within declared windows; split steady vs recovery.
- Interpret: connect results to τ windows (ADEV) that matter for the target outage length.
11) What are the most common wiring/grounding pitfalls for 1PPS/10MHz/ToD?
The most frequent failures are not “bad clocks” but ground loops, shield discontinuities, and protection paths that inject noise. Treat 1PPS/10MHz/ToD as precision signals: control return paths, keep shielding continuous, and ensure surge currents do not share sensitive grounds.
- Ground loops: multiple chassis bonds or mixed return paths create low-frequency wander and spur-like modulation.
- Shield continuity: broken coax shield or adapters introduce pickup and edge timing noise.
- Protection path: surge/ESD devices must dump to the intended reference (chassis/earth) without polluting signal ground.
12) What is the minimum pre-deploy test set to quickly screen unstable cards?
A minimal screen should force the two fastest failure modes: switching transients and tail growth. Combine a controlled reference loss/recovery, a short p99 window measurement, a quick spur snapshot, and a traceability check (config hash + event log). This catches the majority of “looks fine until field” escapes.
- Test 1: forced reference loss/re-lock; verify bounded step/slew and repeatable signature.
- Test 2: p99 offset/phase-error in a defined window (steady vs recovery split).
- Test 3: spur check / integrated jitter band snapshot; verify no new discrete spurs.
- Test 4: outputs present (1PPS/10MHz/ToD) + alarms + event logs + config hash record.