Clock Monitor / Missing-Pulse: Alarms & Auto Switchover
← Back to:Reference Oscillators & Timing
Clock monitoring is not about measuring “more accurately”—it is about keeping systems safe: detect missing or out-of-spec clocks early, raise the right alarms, and trigger the right protection or switchover action before endpoints fail.
This page shows how to define observables, thresholds, debounce/persistence, alarm routing, and verification criteria so monitoring becomes a reliable “Detect → Alarm → Protect → Switch” control loop.
What problem this page solves: “Detect → Alarm → Protect → Switch”
Clock monitoring is a reliability loop, not a “more accurate measurement” exercise. The objective is to detect clock health violations early enough to trigger the correct action (alarm, degrade, protect, switchover), while minimizing false alarms and avoiding self-inflicted instability.
Closed-loop questions
- Clock present?
- In-spec (within window)?
- Stable long enough (persistence)?
- Which alarm action?
- Switch/degrade/protect?
- When and how to recover?
Two core detector families
Engineering outputs (what “good” looks like)
- Flags: present / in-spec / suspect / fault
- Numbers: offset (ppm/%), event counters, timestamps
- State: persistence + hysteresis + clear policy
- Actions: alarm level, latch, switchover request, hold-off
Failure modes & false alarms: what is being prevented
This section constrains the problem space. Each failure class maps to the primary observable, the most common false-alarm mechanism, and the first diagnostic probe. The same mapping becomes the backbone for thresholding, debounce/persistence, and switchover policies later in this page.
Why false alarms happen (mechanisms, not symptoms)
What to measure: presence, frequency offset, pulse integrity
Monitoring should focus on observables that directly drive reliable actions. This page uses three: presence (liveness), frequency offset (in-window), and pulse integrity (min width / duty). Topics like jitter or phase noise only matter here as sources of measurement uncertainty and are handled elsewhere.
Detection architectures: counter, gate-time, window compare
The same observables can be estimated in multiple ways. Architecture selection is primarily a trade between latency (how fast an actionable decision is made) and robustness (how stable the decision is under noise, transients, and edge artifacts). The blocks below use a consistent pipeline: qualify edges → measure → compare to windows → apply hysteresis/persistence → drive alarms/actions.
Windowing & guardband: thresholds that do not backfire
A reliable monitor does not start from a “nice-looking number”. It starts from a budget that can be reviewed: spec window (what endpoints require) + drift window (temperature/aging/supply effects) + measurement window (estimator error) + a controlled margin (unmodeled tail + manufacturing spread). This section turns that stack into actionable parameters for frequency, missing-pulse timeout, and pulse integrity.
- Nominal: f_nom and the reference timebase used by the monitor.
- Windows: warn ±W1, fault ±W2, clear ±Wc (Wc < W1 to avoid chatter).
- Estimator: gate time or period-avg count N, plus outlier reject policy.
- Persistence: N_fail to assert, N_ok to clear.
- Slowest period: use the minimum allowed frequency (worst-case drift).
- Blanking tolerance: maximum edge gap the system can tolerate without unsafe behavior.
- Timeout: timeout = period_max × N_miss + hold-off allowance.
- Edge qualify: ensure only valid edges can feed the watchdog.
- Receiver min: min-high/min-low from endpoint timing acceptance.
- Distortion: allocate for buffer + routing + termination effects.
- Measurement: include qualification uncertainty (glitch reject threshold).
- Guardband: add margin to cover manufacturing tails and aging.
Debounce & persistence: stable alarms, clean recovery
Most “false alarms” are not caused by the clock; they are caused by alarm logic that treats momentary outliers as faults. Debounce and persistence convert noisy measurements into action-grade events by controlling when to assert, when to clear, and whether to latch critical states.
- Time-based: require N consecutive measurement windows to fail.
- Event-based: require N consecutive failure events (edge gaps, width violations).
- Combine carefully to avoid delaying detection beyond system tolerance.
- Assert: fail_count ≥ N_fail (or fail_time ≥ T_fail).
- Clear: ok_count ≥ N_clear (or ok_time ≥ T_clear).
- Common pattern: N_clear > N_fail to prevent chatter after recovery.
- Latch faults that triggered protection or switchover.
- Auto-clear warnings that are only informative and do not drive actions.
- Latched alarms should require explicit clear criteria and logging.
- SUSPECT allows increased observation and logging without immediate disruptive actions.
- ALARM is reserved for persistent failures that justify protection or switchover.
- This separation prevents “alarm storms” from transient disturbances.
- After a switch or reset action, apply a short hold-off to ignore known transients.
- Hold-off should be scoped (e.g., presence only) and time-bounded with re-evaluation.
- Always log cause codes so “recovered” does not mean “forgotten”.
Alarm grading & routing: from detection to system actions
An alarm becomes useful only when it is graded (severity is unambiguous), routed (delivered through the right channel with the right latency), and action-mapped (each level triggers a defined response). This section defines a practical INFO/WARN/FAULT ladder, the routing paths (IRQ/GPIO, I²C status, reset chain, telemetry), and the minimum context required for diagnosis and safe switchover.
- INFO: minor deviation; log & trend only.
- WARN: near boundary; prepare actions, increase observation.
- FAULT: out-of-window or missing-pulse; protection or switchover allowed.
- Cause code (timeout / freq high / freq low / duty / width).
- First/last seen timestamp and duration or counts.
- Snapshot: recent measurement value(s) and window parameters.
- IRQ/GPIO: lowest latency trigger for policy/interrupt logic.
- I²C / status: rich context (cause + snapshot) for diagnosis.
- Reset chain: hard protection for high-risk modes.
- Telemetry: remote visibility and trend-based maintenance.
- INFO: log only, increment counters.
- WARN: raise observation, notify supervisor, arm switch logic.
- FAULT: switchover or protection stop, then latch and record.
- Hysteresis: clear window tighter than warn window.
- Rate limit: cap repeated actions over a time span.
- Hold-off: suppress known transients after switching.
Automatic switchover: hitless vs bounded-glitch strategies
“Automatic switchover” is safe only when it is qualified by persistence and bounded by explicit post-check criteria. This section separates trigger signals (missing-pulse, offset, LOS/LOL), qualification (evidence + rate limits + hold-off), and execution (hitless or bounded-glitch), followed by a stability observation window and a conservative return-to-primary strategy.
- Hard: missing-pulse timeout, LOS.
- Soft: frequency out-of-window, LOL (requires stronger persistence).
- WARN can arm the switch path; FAULT can execute it.
- Persistence: N_fail/T_fail met (debounced).
- Evidence: cause code + snapshot captured.
- Action gate: not in hold-off; rate limit not exceeded.
- Define disturbance bounds: glitch width, relock time, error burst.
- Prepare endpoints with a brief protection posture (mode freeze, degrade, or notify supervisor).
- Require a stability observation window before declaring success.
- Observe for T_stable: presence OK, offset back in-window.
- Confirm endpoint status: lock restored and error counters stop increasing.
- If stability fails, escalate to a latched fault or protection stop.
- Return is riskier than the first switch; intermittent failures are common.
- Prefer manual confirmation or stricter stability windows and rate limiting.
- Use anti-ping-pong: repeated switching should force an operator-visible latch.
Implementation hooks: board-level observation points that do not disturb clocks
A clock monitor becomes reliable only when its tap points are chosen for diagnosis, its input path is electrically harmless, and its reference strategy avoids common-mode failure. This section defines practical tap locations along the clock tree, isolation and level-compatibility rules, and multi-channel configuration patterns that keep alarms actionable.
- Meaning: verifies the source is present and near nominal.
- Best for: source stop, gross offset, duty collapse at origin.
- Risk: does not reveal downstream distribution failures.
- Meaning: validates the “conditioning stage” output.
- Best for: lock-loss, unlock chatter, offset after conditioning.
- Risk: still may not match what endpoints actually receive.
- Meaning: isolates issues between conditioner and distribution.
- Best for: link integrity before branching to multiple domains.
- Risk: cannot capture per-endpoint impairments.
- Meaning: observes what the endpoint truly sees.
- Best for: missing pulses after routing, local distortion, edge loss.
- Risk: highest chance of interaction; requires careful isolation.
- Isolation: buffer or high-impedance sampling path.
- Level OK: monitor thresholds and common-mode must match the standard.
- No “heavy filtering”: avoid masking narrow pulses by analog smoothing.
- Independent ref: a separate timebase for diagnosis.
- Cross-check: compare channels/domains for consistency.
- Per-channel thresholds: independent freq window / timeout / width limits.
- Per-channel cause: alarms must include channel/domain identifiers.
- Aggregated policy: channel → domain → system escalation (avoid one-line “kill switch”).
- Maintainability: expose minimal test points and keep parameters configurable.
Verification & production test: proving reliability with executable checks
Reliability is demonstrated by fault injection, measurable coverage targets (false positives, false negatives), and bounded timing (detection and recovery latency). The same logic must be verifiable on a production line with minimal equipment by testing the fastest alarm path (IRQ/GPIO) and reading cause codes via registers for diagnosis.
- Clock-off / gate-off to force missing-pulse.
- Periodic blank windows to validate debounce and persistence.
- Expected output: timeout cause code + bounded T_detect.
- Programmed offset (±ppm or ±%) around the window edges.
- Step changes across warn/fault boundaries to validate hysteresis.
- Expected output: freq-high / freq-low cause code + stable clear behavior.
- Duty distortion (high/low asymmetry).
- Minimum high/low width squeeze.
- Narrow pulse / glitch insertion to validate integrity detection (not analog masking).
- False positive: no FAULT in nominal conditions.
- False negative: injected faults must be caught.
- Latency: detection and recovery must be bounded.
- Use injection edge as t0 (gate control or offset step).
- Use IRQ/GPIO assertion as t1.
- T_detect = t1 − t0; recovery uses clear edge similarly.
- Prefer fastest path validation (IRQ/GPIO) for timing.
- Read registers for cause codes and counters (diagnosis).
- Use minimal injectors: brief gate-off + known offset mode + simple duty toggle.
H2-11. Field monitoring & health metrics: define logs that actually help
Clock monitoring becomes system health only when the logs can answer three questions fast: (1) slow drift vs sudden step, (2) risk trending up, (3) which channel/domain/tap point to investigate first. Keep this device-side and actionable—avoid turning it into a “big data platform” topic.
A) Minimum useful dataset (per channel, per domain)
Avoid “one counter forever.” Use fixed aggregation windows (e.g., 1 min / 10 min / 1 h) and always log window validity so missing samples cannot masquerade as stability.
B) Drift vs step: a simple, reliable triage rule
Field diagnosis fails when slow thermal/aging drift is treated like intermittent disconnection (or vice versa). Use minimal rules that firmware can implement deterministically.
- Offset trend slope keeps the same sign for N windows (e.g., N=6 at 10-min windows).
- WARN-level near-boundary alarms dominate; durations skew long.
- Offset correlates with temperature or predictable load cycles.
- One-window delta exceeds a fixed threshold (e.g., Δoffset > X ppm) and alarms cluster tightly in time.
- Missing-pulse or width/duty faults appear briefly (short duration bins).
- Switchover count spikes; return-to-primary becomes unstable.
Implementation note: compute slope from window medians (robust to outliers) and compute burst from alarm timestamps. Keep thresholds fixed and reviewed—do not “learn” the definition of a fault.
C) Fixed vs adaptive thresholds (safe self-tuning without risk)
- Missing-pulse timeout (presence watchdog).
- Hard frequency fault window (true out-of-spec).
- Minimum high/low width for pulse integrity (prevents silent edge loss).
- “Near-boundary” WARN level (early-warning band inside the spec limit).
- Baseline offset center (to detect abnormal deviation from normal behavior).
- Train only during known-good periods (no alarms, valid samples, normal temp/rails).
- Adaptive WARN must remain inside the fixed FAULT window and may only tighten, not relax.
- FAULT thresholds and missing-pulse logic never change dynamically.
D) Event schema for traceability (post-mortem friendly)
A good event record is small, structured, and searchable. It must explain what happened, where, and what the system did—without requiring oscilloscope access.
- timestamp, channel_id, domain_id, tap_point
- alarm_level (INFO/WARN/FAULT), cause_code
- first_seen, last_seen, duration_ms
- offset_ppm (or %), min_width_ns (if measured), missing_timeout_us
- action_taken (log/degrade/switch/hold/reset), switched_to (if any)
E) A compact “health summary” output (ops + firmware friendly)
- p99 offset as a fraction of budget (e.g., p99 / fault_limit).
- alarm rate (count/day) and top 2 cause codes.
- switch rate and median time_to_stable.
- trend flag: drift / step / burst / stable.
H2-12. Applications & IC selection notes (monitoring-focused)
This section stays strictly on clock monitoring: loss-of-signal/lock awareness, frequency window checking, pulse integrity screening, alarm interfaces, debounce/latch resources, and switchover hooks. It does not attempt to replace the dedicated pages for mux/crosspoint/fanout/cleaners.
A) Monitoring-first application patterns (same template, different priorities)
- Goal: detect LOS/LOL early to prevent link drop escalation.
- Must measure: presence + frequency window; add width/duty only if edge integrity is fragile.
- Output: fast IRQ/GPIO + readable cause code for firmware routing.
- Action: log → degrade → switch (avoid ping-pong via persistence rules).
- Goal: prevent silent misalignment by catching missing SYSREF/refclk conditions.
- Must measure: presence (SYSREF is the strictest) + refclk frequency window.
- Output: deterministic status (pins/registers) and latch support for “fault happened” evidence.
- Action: latch fault → controlled re-sync flow (keep switchover bounded).
- Goal: trend-based early warning and clear out-of-window escalation.
- Must measure: frequency statistics (p95/p99) + drift/step flags.
- Output: readable status + counters; periodic summaries are often more useful than raw edges.
- Action: alarm → degrade/switch (severity-driven).
- Goal: reliable switchover + audit-grade event records.
- Must measure: presence + duration distribution + switch behavior.
- Output: latched faults + explicit switch reason codes.
- Action: switch with strong anti-oscillation rules; return-to-primary is stricter than failover.
B) Selection dimensions (monitoring features only)
- Input type and thresholds (LVCMOS/LVDS/HCSL/LVPECL/CML), common-mode range.
- Frequency span + measurement method limits (gate-time/period/watchdog).
- What can be detected: presence, frequency window, width/duty integrity.
- Programmable windows (freq_hi/lo, timeout, width/duty) and per-channel config.
- Debounce/persistence resources (N windows / N events) and latch/auto-clear behavior.
- Alarm interfaces: GPIO/IRQ + readable cause/status registers.
- Robustness: temperature, supply noise tolerance, and clean status semantics.
- Observability: counters/statistics availability (or at least clear fault flags).
- Multi-channel scaling: per-channel thresholds + aggregated alarm routing.
C) Concrete reference part numbers (starting points; verify suffix/package/availability)
The list below is intentionally monitoring-driven (LOS/LOL, status/interrupts, redundant reference support, readable fault flags). Final selection must be validated against input standard, frequency plan, debounce needs, and system-level switchover requirements.
- Si5341 (example ordering: Si5341B-D-GM) — fault monitoring and status via serial interface; useful when monitoring is coupled with clock generation.
- Si5345 / Si5344 / Si5342 — family provides LOL/LOS indicators and fault handling suited to robust reference monitoring.
- Si5391 — fault indicators for LOS/LOL; common in high-performance timing trees where fault status must be readable.
- Si5382 — reference monitoring features including LOS/invalid frequency awareness (fit when multi-input monitoring is needed).
- LMK03328 (Texas Instruments) — exposes loss-of-input-signal and loss-of-lock interrupt sources; useful when fault flags must feed an MCU/FPGA alarm router.
- 8T49N241 (Renesas / FemtoClock®NG) — monitors input clocks for loss-of-signal and can support redundant reference behavior.
- 8T49N283 (Renesas) — monitors inputs for LOS and supports hitless reference switching options (fit when automatic switchover is required inside the timing IC).
- AD9528 (Analog Devices) — provides status pins and status registers; commonly used where readable lock/fault status is needed in clocking subsystems.
- HMC7044 (Analog Devices) — clock generation/distribution platform used in converter clocking; integrate monitoring via readable status and controlled outputs.
- 8A34001 (Renesas ClockMatrix / SMU) — system timing/synchronization device where reference monitoring and timing paths are managed in a structured way.
- ZL30731–ZL30735 (Microchip) — network synchronizers with frequency measurement/monitoring hooks; useful when field health and reference monitoring are requirements.
Recommended topics you might also need
Request a Quote
H2-13. FAQs (monitoring & missing-pulse) + JSON-LD
Short, actionable troubleshooting answers that stay within this page boundary (presence / frequency window / pulse integrity / alarms / switchover / observability). Each answer uses the same 4-line, data-like structure for fast execution and verification.
1) Missing-pulse alarm triggers randomly, but the clock looks fine on scope—why?
Likely cause: Gate/window too short, so edge jitter/glitches are interpreted as “missing” inside the monitor logic.
Quick check: Increase gate time ×10 and enable event timestamps; correlate alarms vs temperature and supply (same timebase window).
Fix: Add persistence (N consecutive fails) + input conditioning at the tap (e.g., Schmitt buffer SN74LVC1G17 / NC7SZ17 for LVCMOS taps; for differential clocks, use a dedicated buffer/fanout output as the monitor input source).
Pass criteria: False-alarm rate < X/hour across worst-case V/T/noise for the qualification duration (e.g., 24 h), with valid_samples coverage > Y%.
2) Frequency-offset alarm never triggers even when a slow drift is forced—what’s first?
Likely cause: Wrong nominal reference (divider/PLL ratio mismatch) or wrong units (ppm vs %), so the programmed window is not the intended window.
Quick check: Read back nominal and thresholds from registers; compare with measured counts/period over the same gate interval (no mixed units).
Fix: Establish one “source of truth” for nominal (single configuration owner) + sanity clamps (min/max window). If a timing device is used for configuration centralization, ensure nominal is readable (e.g., Si5341 or LMK03328 as examples to look up; verify exact ordering code/features in datasheets).
Pass criteria: Alarm asserts within T_detect after injected offset ≥ threshold, and clears only after the defined recovery persistence window.
3) Alarm chatters near threshold—how to stop it without masking real faults?
Likely cause: No hysteresis and/or no separate enter/exit conditions, so boundary crossings toggle the state machine.
Quick check: Histogram measured offset around the boundary and count bidirectional crossings per window; confirm toggle rate vs persistence settings.
Fix: Add hysteresis (enter/exit thresholds) + recovery persistence. For LVCMOS taps, Schmitt conditioning (e.g., SN74LVC1G17) reduces boundary chatter caused by slow edges.
Pass criteria: Alarm toggles ≤ 1 time during steady operation near the boundary (defined test condition), while FAULT detection remains within T_detect.
4) Switchover happens, but endpoints still lose lock—what should be verified first?
Likely cause: Switch disturbance exceeds endpoint tolerance and/or the post-switch stability window is too short.
Quick check: Align endpoint lock-loss indication (pin/CSR) with switch-control timing; measure “time-to-stable” vs endpoint re-lock time.
Fix: If hitless behavior is required, use a dedicated hitless/controlled-reference switching approach (examples to look up for integrated timing platforms: Si5345, Si5341, 8T49N283; verify exact capabilities/order codes). Otherwise, widen the post-switch settle window and delay endpoint-dependent actions until stability qualifies.
Pass criteria: Endpoint re-lock < T_relock, and no repeated switching within T_guard after a successful switch.
5) After switching to backup, the system keeps switching back and forth—why?
Likely cause: Auto-revert is enabled without stable qualification, and the backup source is also marginal (or both share a common-mode issue).
Quick check: Log both sources’ health metrics (offset p95/p99, alarm duration bins, switch reasons) and verify the revert condition definition (enter/exit thresholds + persistence).
Fix: Require “stable-on-backup for T_hold” before revert; prefer manual revert for high-risk endpoints. If integrated reference switching is used, ensure revert rules are configurable/readable (example platforms to look up: 8T49N283, Si5345; verify datasheets).
Pass criteria: Max switch count ≤ N per day under worst-case conditions, with zero ping-pong sequences shorter than T_pp_min.
6) I²C reads show “mixed” status counters—what’s the first fix?
Likely cause: Non-atomic multi-byte reads during rollover (counter changes mid-transaction), producing inconsistent snapshots.
Quick check: Read twice and compare; if mismatch rate is non-zero, rollover is occurring during reads. Check whether a latch-on-read or snapshot mechanism exists.
Fix: Implement atomic snapshot in firmware (latch, read, then clear/ack) or use device-supported “freeze counters”/shadow registers when available.
Pass criteria: Repeated reads match within ≤ 2 attempts, and snapshot age < T_snap_max.
7) Why does adding a probe change alarm behavior?
Likely cause: The monitoring tap is not isolated; probe capacitance loads the node, distorting edges or creating reflections that shift pulse integrity.
Quick check: Compare with a high-impedance active probe; measure edge rate and swing at the tap with/without the probe (same ground strategy).
Fix: Buffer/isolate the tap and feed the monitor from a dedicated buffer output (example LVCMOS buffer to look up: LMK1C1104; verify I/O standard and operating range). For CMOS taps, Schmitt conditioning (e.g., SN74LVC1G17) may reduce sensitivity to slow/loaded edges.
Pass criteria: Alarm behavior unchanged with/without probe (no new alarms), and measured edge rate change < Δ_edge_max under the defined test setup.
8) Loss-of-lock alarms appear only at high temperature—what to suspect first?
Likely cause: Guardband is too tight vs drift/aging/temperature, and/or the input buffer switching point shifts with temperature and supply noise.
Quick check: Log offset vs temperature and compare against the guardband stack (spec + drift + measurement + margin); confirm whether alarms align with threshold occupancy.
Fix: Widen window using drift budget (do not relax missing-pulse protection), improve tap conditioning, and remove borderline edge shapes (e.g., Schmitt buffer NC7SZ17 / SN74LVC1G17 for suitable CMOS taps; ensure monitor input standard matches the source).
Pass criteria: No alarms across the required temperature range when expected drift + margin is applied, and p99 offset remains ≤ K% of the FAULT limit.
9) Detection delay is met, but short dropouts are still missed—how?
Likely cause: Persistence/gate time is longer than the dropout width; the dropout ends before the next evaluation point.
Quick check: Inject controlled dropout widths and sweep from short to long; map detection probability vs dropout width and evaluation method.
Fix: Run a missing-pulse watchdog (edge-fed timeout) in parallel with a frequency-window counter; use the watchdog for short dropouts and the counter for drift/out-of-window behavior.
Pass criteria: Detect dropout ≥ T_drop_min with probability ≥ P_min, and assert alarm within ≤ T_detect_short.
10) Monitoring after the cleaner misses faults that endpoints see—why?
Likely cause: The fault occurs downstream (fanout output, connector, branch routing, endpoint termination), so upstream monitoring never observes the degraded waveform.
Quick check: Move the monitoring tap closer to the endpoint or monitor multiple tree points; correlate branch-specific alarms with endpoint lock-loss events.
Fix: Use per-branch monitors and correlation logic. If a timing platform is used, prefer variants with readable status/alarms that can be attributed to input/output domains (examples to look up: AD9528, HMC7044, Si5341; verify exact status granularity in datasheets).
Pass criteria: Fault localization points to the correct branch in ≥ X% of injected/field-reproduced cases, with unambiguous cause_code mapping.