Clock Monitor / Missing-Pulse: Alarms & Auto Switchover

Q: Missing-pulse alarm triggers randomly, but the clock looks fine on scope—why?

Likely cause: Gate/window too short, so edge jitter/glitches are interpreted as “missing” inside the monitor logic. Quick check: Increase gate time ×10 and enable event timestamps; correlate alarms vs temperature and supply (same timebase window). Fix: Add persistence (N consecutive fails) + input conditioning at the tap (e.g., Schmitt buffer SN74LVC1G17 / NC7SZ17 for LVCMOS taps; for differential clocks, use a dedicated buffer/fanout output as the monitor input source). Pass criteria: False-alarm rate Y%.

Q: Frequency-offset alarm never triggers even when a slow drift is forced—what’s first?

Likely cause: Wrong nominal reference (divider/PLL ratio mismatch) or wrong units (ppm vs %), so the programmed window is not the intended window. Quick check: Read back nominal and thresholds from registers; compare with measured counts/period over the same gate interval (no mixed units). Fix: Establish one “source of truth” for nominal (single configuration owner) + sanity clamps (min/max window). If a timing device is used for configuration centralization, ensure nominal is readable (e.g., Si5341 or LMK03328 as examples to look up; verify exact ordering code/features in datasheets). Pass criteria: Alarm asserts within T_detect after injected offset ≥ threshold, and clears only after the defined recovery persistence window.

Q: Alarm chatters near threshold—how to stop it without masking real faults?

Likely cause: No hysteresis and/or no separate enter/exit conditions, so boundary crossings toggle the state machine. Quick check: Histogram measured offset around the boundary and count bidirectional crossings per window; confirm toggle rate vs persistence settings. Fix: Add hysteresis (enter/exit thresholds) + recovery persistence. For LVCMOS taps, Schmitt conditioning (e.g., SN74LVC1G17) reduces boundary chatter caused by slow edges. Pass criteria: Alarm toggles ≤ 1 time during steady operation near the boundary (defined test condition), while FAULT detection remains within T_detect.

Q: Switchover happens, but endpoints still lose lock—what should be verified first?

Likely cause: Switch disturbance exceeds endpoint tolerance and/or the post-switch stability window is too short. Quick check: Align endpoint lock-loss indication (pin/CSR) with switch-control timing; measure “time-to-stable” vs endpoint re-lock time. Fix: If hitless behavior is required, use a dedicated hitless/controlled-reference switching approach (examples to look up for integrated timing platforms: Si5345, Si5341, 8T49N283; verify exact capabilities/order codes). Otherwise, widen the post-switch settle window and delay endpoint-dependent actions until stability qualifies. Pass criteria: Endpoint re-lock < T_relock, and no repeated switching within T_guard after a successful switch.

Q: After switching to backup, the system keeps switching back and forth—why?

Likely cause: Auto-revert is enabled without stable qualification, and the backup source is also marginal (or both share a common-mode issue). Quick check: Log both sources’ health metrics (offset p95/p99, alarm duration bins, switch reasons) and verify the revert condition definition (enter/exit thresholds + persistence). Fix: Require “stable-on-backup for T_hold” before revert; prefer manual revert for high-risk endpoints. If integrated reference switching is used, ensure revert rules are configurable/readable (example platforms to look up: 8T49N283, Si5345; verify datasheets). Pass criteria: Max switch count ≤ N per day under worst-case conditions, with zero ping-pong sequences shorter than T_pp_min.

Q: I²C reads show “mixed” status counters—what’s the first fix?

Likely cause: Non-atomic multi-byte reads during rollover (counter changes mid-transaction), producing inconsistent snapshots. Quick check: Read twice and compare; if mismatch rate is non-zero, rollover is occurring during reads. Check whether a latch-on-read or snapshot mechanism exists. Fix: Implement atomic snapshot in firmware (latch, read, then clear/ack) or use device-supported “freeze counters”/shadow registers when available. Pass criteria: Repeated reads match within ≤ 2 attempts, and snapshot age < T_snap_max.

Q: Why does adding a probe change alarm behavior?

Likely cause: The monitoring tap is not isolated; probe capacitance loads the node, distorting edges or creating reflections that shift pulse integrity. Quick check: Compare with a high-impedance active probe; measure edge rate and swing at the tap with/without the probe (same ground strategy). Fix: Buffer/isolate the tap and feed the monitor from a dedicated buffer output (example LVCMOS buffer to look up: LMK1C1104; verify I/O standard and operating range). For CMOS taps, Schmitt conditioning (e.g., SN74LVC1G17) may reduce sensitivity to slow/loaded edges. Pass criteria: Alarm behavior unchanged with/without probe (no new alarms), and measured edge rate change < Δ_edge_max under the defined test setup.

Q: Loss-of-lock alarms appear only at high temperature—what to suspect first?

Likely cause: Guardband is too tight vs drift/aging/temperature, and/or the input buffer switching point shifts with temperature and supply noise. Quick check: Log offset vs temperature and compare against the guardband stack (spec + drift + measurement + margin); confirm whether alarms align with threshold occupancy. Fix: Widen window using drift budget (do not relax missing-pulse protection), improve tap conditioning, and remove borderline edge shapes (e.g., Schmitt buffer NC7SZ17 / SN74LVC1G17 for suitable CMOS taps; ensure monitor input standard matches the source). Pass criteria: No alarms across the required temperature range when expected drift + margin is applied, and p99 offset remains ≤ K% of the FAULT limit.

Q: Detection delay is met, but short dropouts are still missed—how?

Likely cause: Persistence/gate time is longer than the dropout width; the dropout ends before the next evaluation point. Quick check: Inject controlled dropout widths and sweep from short to long; map detection probability vs dropout width and evaluation method. Fix: Run a missing-pulse watchdog (edge-fed timeout) in parallel with a frequency-window counter; use the watchdog for short dropouts and the counter for drift/out-of-window behavior. Pass criteria: Detect dropout ≥ T_drop_min with probability ≥ P_min, and assert alarm within ≤ T_detect_short.

Q: Monitoring after the cleaner misses faults that endpoints see—why?

Likely cause: The fault occurs downstream (fanout output, connector, branch routing, endpoint termination), so upstream monitoring never observes the degraded waveform. Quick check: Move the monitoring tap closer to the endpoint or monitor multiple tree points; correlate branch-specific alarms with endpoint lock-loss events. Fix: Use per-branch monitors and correlation logic. If a timing platform is used, prefer variants with readable status/alarms that can be attributed to input/output domains (examples to look up: AD9528, HMC7044, Si5341; verify exact status granularity in datasheets). Pass criteria: Fault localization points to the correct branch in ≥ X% of injected/field-reproduced cases, with unambiguous cause_code mapping.

← Back to:Reference Oscillators & Timing

Clock monitoring is not about measuring “more accurately”—it is about keeping systems safe: detect missing or out-of-spec clocks early, raise the right alarms, and trigger the right protection or switchover action before endpoints fail.

This page shows how to define observables, thresholds, debounce/persistence, alarm routing, and verification criteria so monitoring becomes a reliable “Detect → Alarm → Protect → Switch” control loop.

What problem this page solves: “Detect → Alarm → Protect → Switch”

Clock monitoring is a reliability loop, not a “more accurate measurement” exercise. The objective is to detect clock health violations early enough to trigger the correct action (alarm, degrade, protect, switchover), while minimizing false alarms and avoiding self-inflicted instability.

False positives (unnecessary actions) False negatives (silent failures) Detection latency (time-to-action) Recovery semantics (clear vs latch)

Closed-loop questions

Clock present?
In-spec (within window)?
Stable long enough (persistence)?
Which alarm action?
Switch/degrade/protect?
When and how to recover?

Two core detector families

Missing-pulse (presence / liveness)

Detects stop/stuck/disconnect quickly. Often drives the fast path to prevent immediate endpoint collapse.

Frequency offset (in-window / in-spec)

Detects drift, wrong division, or loss-of-lock behavior. Often drives the graded path (warn → fault) with persistence.

Engineering outputs (what “good” looks like)

Flags: present / in-spec / suspect / fault
Numbers: offset (ppm/%), event counters, timestamps
State: persistence + hysteresis + clear policy
Actions: alarm level, latch, switchover request, hold-off

Monitoring is a closed-loop pipeline: observe (presence/offset/integrity), decide (grading + persistence), act (protect/switch), and record (reason codes + timestamps).

Takeaway

A good clock monitor produces actionable outputs (flags, numbers, state, reason codes), not just measurements. Parallel detection (missing-pulse + frequency window + integrity) improves coverage and reduces “unknown unknowns”.

Failure modes & false alarms: what is being prevented

This section constrains the problem space. Each failure class maps to the primary observable, the most common false-alarm mechanism, and the first diagnostic probe. The same mapping becomes the backbone for thresholding, debounce/persistence, and switchover policies later in this page.

A) Stop / Stuck

Signature: edges stop, clock tree opens, or a domain freezes.

Primary observable: missing-pulse timeout (watchdog).

Common false positives: overly short timeout, glitch filtering that drops valid edges.

First probe: correlate alarm timestamp with endpoint LOL/LOS + supply droops.

B) Out-of-spec frequency

Signature: drift, wrong divider ratio, or loss-of-lock behavior.

Primary observable: frequency window (ppm/%), optionally graded (warn/fault).

Common false positives: nominal mismatch, unit mistakes, insufficient gate time.

First probe: read back programmed nominal/threshold + log offset histogram.

C) Pulse integrity issues

Signature: narrow pulses, duty distortion, occasional dropped edges.

Primary observable: min-high/min-low width, duty window, glitch reject.

Common false positives: probe loading, threshold noise, bandwidth-limited taps.

First probe: compare a high-Z active probe vs passive probe at the monitor tap.

D) Amplitude / edge degradation

Signature: monitor “looks OK”, but endpoints lose lock downstream.

Primary observable: multi-point monitoring + endpoint status correlation.

Common false positives: single-point monitoring after the cleaner misses branch faults.

First probe: move the monitoring tap closer to the failing endpoint branch.

Why false alarms happen (mechanisms, not symptoms)

1) Not enough statistics

Gate time too short → estimator noise near threshold → chatter.

2) Missing guardband

Spec + drift + measurement error not budgeted → thresholds sit on the cliff edge.

3) Self-inflicted switching

Switchover transient interpreted as a fault → repeated switching loops.

4) Threshold instability

Supply noise or buffer threshold jitter → extra/missed edges at the monitor input.

5) Common-mode blindness

Monitor uses the same bad reference as the clock under test → “both drift together”.

The mapping prevents scope creep: each failure class has a primary observable (and optional secondary signals) that drive thresholds and persistence policies.

Takeaway

A robust monitor starts by classifying failures (stop/stuck, out-of-spec frequency, integrity degradation, downstream branch issues) and assigning each class a primary observable. This enables thresholds, debounce, and switchover logic to be engineered as a controlled system rather than an ad-hoc collection of alarms.

What to measure: presence, frequency offset, pulse integrity

Monitoring should focus on observables that directly drive reliable actions. This page uses three: presence (liveness), frequency offset (in-window), and pulse integrity (min width / duty). Topics like jitter or phase noise only matter here as sources of measurement uncertainty and are handled elsewhere.

Presence (liveness)

Edge activity continues and a watchdog is fed within a timeout window.

Catches: stop/stuck, disconnect, broken clock tree branches.

Key knobs: timeout, valid-edge qualification, persistence.

Frequency offset (window)

Offset from nominal frequency compared against ±ppm or ±% thresholds.

Catches: drift, wrong division, loss-of-lock behavior.

Key knobs: gate time, guardband budget, graded warn/fault windows.

Pulse integrity

Minimum high/low widths and duty window ensure edges are reliably counted and interpreted.

Catches: narrow pulses, duty distortion, edge dropouts that average frequency can hide.

Key knobs: min width thresholds, glitch reject, hysteresis + persistence.

Presence detects “no edges”. Frequency offset detects “edges exist but drift out of window”. Pulse integrity prevents miscount and misinterpretation caused by narrow pulses or duty distortion.

Takeaway

Using these three observables keeps monitoring actionable and bounded. Presence gives the fastest protection, frequency offset provides spec compliance tracking, and integrity prevents “clean numbers” hiding broken edge quality.

Detection architectures: counter, gate-time, window compare

The same observables can be estimated in multiple ways. Architecture selection is primarily a trade between latency (how fast an actionable decision is made) and robustness (how stable the decision is under noise, transients, and edge artifacts). The blocks below use a consistent pipeline: qualify edges → measure → compare to windows → apply hysteresis/persistence → drive alarms/actions.

Gate-time counter

Count edges within a fixed gate time to estimate frequency. Longer gates reduce estimator noise but slow response.

Best for: stable offset monitoring on mid/high-frequency clocks.

Reciprocal / period measure

Measure time between edges (or average over multiple periods). Sensitive at low frequency but more exposed to edge artifacts.

Best for: low-frequency clocks and fast drift detection with averaging.

Missing-pulse watchdog

Feed a watchdog on each valid edge; timeout indicates loss of clock activity. Fastest protection path.

Best for: stop/stuck detection and rapid switchover triggers.

Window + hysteresis

Compare measured values to warn/fault windows and separate clear thresholds to prevent chatter near boundaries.

Best for: stable alarms under slow drift and noisy measurements.

Use a consistent pipeline and swap only the measurement core. Windows, hysteresis, and persistence convert noisy estimates into stable, action-grade alarms.

Takeaway

Gate-time counters deliver stable offsets with slower response, period measurement improves low-frequency sensitivity, watchdogs provide the fastest loss detection, and window+hysteresis ensures alarms do not chatter near thresholds.

Windowing & guardband: thresholds that do not backfire

A reliable monitor does not start from a “nice-looking number”. It starts from a budget that can be reviewed: spec window (what endpoints require) + drift window (temperature/aging/supply effects) + measurement window (estimator error) + a controlled margin (unmodeled tail + manufacturing spread). This section turns that stack into actionable parameters for frequency, missing-pulse timeout, and pulse integrity.

Budget stack

Total threshold window should be decomposed into Spec + Drift + Meas + Margin.

Margin is added last; it must not hide a missing measurement-error estimate.

Frequency window

Express offset as ±ppm for timebases/RTC and ±% or ±ppm for high-speed refs, then grade into warn/fault plus a clear window (hysteresis).

Missing-pulse timeout

Timeout should be derived from the slowest expected period, the maximum tolerable blanking, and the system’s acceptable detection latency.

Duty / width limits

Set min-high/min-low and duty windows from the receiver’s minimum timing requirements, then allocate distortion and measurement error before adding margin.

The stack makes threshold ownership explicit: endpoint requirements define the spec portion, environment defines drift, the chosen estimator defines measurement error, and margin covers the tail.

How to parameterize frequency alarms

Nominal: f_nom and the reference timebase used by the monitor.
Windows: warn ±W1, fault ±W2, clear ±Wc (Wc < W1 to avoid chatter).
Estimator: gate time or period-avg count N, plus outlier reject policy.
Persistence: N_fail to assert, N_ok to clear.

How to derive missing-pulse timeout

Slowest period: use the minimum allowed frequency (worst-case drift).
Blanking tolerance: maximum edge gap the system can tolerate without unsafe behavior.
Timeout: timeout = period_max × N_miss + hold-off allowance.
Edge qualify: ensure only valid edges can feed the watchdog.

How to set duty / width limits

Receiver min: min-high/min-low from endpoint timing acceptance.
Distortion: allocate for buffer + routing + termination effects.
Measurement: include qualification uncertainty (glitch reject threshold).
Guardband: add margin to cover manufacturing tails and aging.

Takeaway

Thresholds should be built from a reviewable budget. Once the stack is defined, it can consistently generate warn/fault/clear windows for frequency, safe timeouts for missing-pulse detection, and receiver-driven limits for duty and minimum widths.

Debounce & persistence: stable alarms, clean recovery

Most “false alarms” are not caused by the clock; they are caused by alarm logic that treats momentary outliers as faults. Debounce and persistence convert noisy measurements into action-grade events by controlling when to assert, when to clear, and whether to latch critical states.

Two debounce axes

Time-based: require N consecutive measurement windows to fail.
Event-based: require N consecutive failure events (edge gaps, width violations).
Combine carefully to avoid delaying detection beyond system tolerance.

Asymmetry: assert vs clear

Assert: fail_count ≥ N_fail (or fail_time ≥ T_fail).
Clear: ok_count ≥ N_clear (or ok_time ≥ T_clear).
Common pattern: N_clear > N_fail to prevent chatter after recovery.

Latch vs auto-clear

Latch faults that triggered protection or switchover.
Auto-clear warnings that are only informative and do not drive actions.
Latched alarms should require explicit clear criteria and logging.

SUSPECT as a buffer state

SUSPECT allows increased observation and logging without immediate disruptive actions.
ALARM is reserved for persistent failures that justify protection or switchover.
This separation prevents “alarm storms” from transient disturbances.

Hold-off to avoid oscillation

After a switch or reset action, apply a short hold-off to ignore known transients.
Hold-off should be scoped (e.g., presence only) and time-bounded with re-evaluation.
Always log cause codes so “recovered” does not mean “forgotten”.

Separate “assert” and “clear” criteria. A SUSPECT stage buffers transient faults; ALARM should be reserved for persistent failures and may be latched when protection or switchover occurred.

Takeaway

Debounce and persistence must be explicit and asymmetric: assert on sustained failure, clear only after sustained stability. Latching should be used when an action was taken, so that “recovered” never means “unobserved”.

Alarm grading & routing: from detection to system actions

An alarm becomes useful only when it is graded (severity is unambiguous), routed (delivered through the right channel with the right latency), and action-mapped (each level triggers a defined response). This section defines a practical INFO/WARN/FAULT ladder, the routing paths (IRQ/GPIO, I²C status, reset chain, telemetry), and the minimum context required for diagnosis and safe switchover.

Severity ladder

INFO: minor deviation; log & trend only.
WARN: near boundary; prepare actions, increase observation.
FAULT: out-of-window or missing-pulse; protection or switchover allowed.

Levels should be tied to the window and persistence outputs (warn/fault/clear + N/T).

Minimum context

A routed alarm should carry:

Cause code (timeout / freq high / freq low / duty / width).
First/last seen timestamp and duration or counts.
Snapshot: recent measurement value(s) and window parameters.

Routing channels

IRQ/GPIO: lowest latency trigger for policy/interrupt logic.
I²C / status: rich context (cause + snapshot) for diagnosis.
Reset chain: hard protection for high-risk modes.
Telemetry: remote visibility and trend-based maintenance.

Action mapping

INFO: log only, increment counters.
WARN: raise observation, notify supervisor, arm switch logic.
FAULT: switchover or protection stop, then latch and record.

Avoid disruptive actions without persistence qualification and rate limiting.

Storm prevention

Hysteresis: clear window tighter than warn window.
Rate limit: cap repeated actions over a time span.
Hold-off: suppress known transients after switching.

Routing is not just wiring: it must preserve enough context (cause, counts, snapshots) for policy decisions and post-incident diagnosis.

Takeaway

Use graded alarms and the right routing channels: fast lines (IRQ/GPIO) for action triggers, rich registers for diagnosis, and logging/telemetry for long-term health and accountability.

Automatic switchover: hitless vs bounded-glitch strategies

“Automatic switchover” is safe only when it is qualified by persistence and bounded by explicit post-check criteria. This section separates trigger signals (missing-pulse, offset, LOS/LOL), qualification (evidence + rate limits + hold-off), and execution (hitless or bounded-glitch), followed by a stability observation window and a conservative return-to-primary strategy.

Triggers

Hard: missing-pulse timeout, LOS.
Soft: frequency out-of-window, LOL (requires stronger persistence).
WARN can arm the switch path; FAULT can execute it.

Qualification locks

Persistence: N_fail/T_fail met (debounced).
Evidence: cause code + snapshot captured.
Action gate: not in hold-off; rate limit not exceeded.

Hitless switching

Hitless behavior requires dedicated switching resources and system-level conditions (alignment/validity checks). This page focuses on the policy and criteria; device-level implementation belongs to the glitch-free switch pages.

Even when switching is transparent, the event should be logged and post-checked.

Bounded-glitch switching

Define disturbance bounds: glitch width, relock time, error burst.
Prepare endpoints with a brief protection posture (mode freeze, degrade, or notify supervisor).
Require a stability observation window before declaring success.

Post-check window

Observe for T_stable: presence OK, offset back in-window.
Confirm endpoint status: lock restored and error counters stop increasing.
If stability fails, escalate to a latched fault or protection stop.

Return-to-primary

Return is riskier than the first switch; intermittent failures are common.
Prefer manual confirmation or stricter stability windows and rate limiting.
Use anti-ping-pong: repeated switching should force an operator-visible latch.

The timing windows make switchover auditable: detection latency, switch execution window, hold-off, and a stability observation window with explicit pass/fail criteria.

Takeaway

Automatic switchover should be persistence-qualified, disturbance-bounded, and post-checked with a stability window. Conservative return-to-primary and anti-ping-pong rules prevent oscillation and hidden intermittent faults.

Implementation hooks: board-level observation points that do not disturb clocks

A clock monitor becomes reliable only when its tap points are chosen for diagnosis, its input path is electrically harmless, and its reference strategy avoids common-mode failure. This section defines practical tap locations along the clock tree, isolation and level-compatibility rules, and multi-channel configuration patterns that keep alarms actionable.

Tap @ Source

Meaning: verifies the source is present and near nominal.
Best for: source stop, gross offset, duty collapse at origin.
Risk: does not reveal downstream distribution failures.

Tap @ Post-cleaner

Meaning: validates the “conditioning stage” output.
Best for: lock-loss, unlock chatter, offset after conditioning.
Risk: still may not match what endpoints actually receive.

Tap @ Pre-fanout

Meaning: isolates issues between conditioner and distribution.
Best for: link integrity before branching to multiple domains.
Risk: cannot capture per-endpoint impairments.

Tap @ Near-endpoint

Meaning: observes what the endpoint truly sees.
Best for: missing pulses after routing, local distortion, edge loss.
Risk: highest chance of interaction; requires careful isolation.

Do-no-harm input rules

Isolation: buffer or high-impedance sampling path.
Level OK: monitor thresholds and common-mode must match the standard.
No “heavy filtering”: avoid masking narrow pulses by analog smoothing.

Avoid common-mode failure

If the monitor measures against the same failing reference, the entire system can “agree” while being wrong.

Independent ref: a separate timebase for diagnosis.
Cross-check: compare channels/domains for consistency.

Multi-channel configuration checklist

Per-channel thresholds: independent freq window / timeout / width limits.
Per-channel cause: alarms must include channel/domain identifiers.
Aggregated policy: channel → domain → system escalation (avoid one-line “kill switch”).
Maintainability: expose minimal test points and keep parameters configurable.

Choose tap points for diagnosis, keep the sampling path electrically harmless, and avoid a single shared reference that can fail in common mode.

Takeaway

Observation points define diagnosability. Isolation and level compatibility prevent the monitor from degrading the clock. Independent reference or cross-check avoids “everyone agrees” failures.

Verification & production test: proving reliability with executable checks

Reliability is demonstrated by fault injection, measurable coverage targets (false positives, false negatives), and bounded timing (detection and recovery latency). The same logic must be verifiable on a production line with minimal equipment by testing the fastest alarm path (IRQ/GPIO) and reading cause codes via registers for diagnosis.

Injection set (presence)

Clock-off / gate-off to force missing-pulse.
Periodic blank windows to validate debounce and persistence.
Expected output: timeout cause code + bounded T_detect.

Injection set (offset)

Programmed offset (±ppm or ±%) around the window edges.
Step changes across warn/fault boundaries to validate hysteresis.
Expected output: freq-high / freq-low cause code + stable clear behavior.

Injection set (integrity)

Duty distortion (high/low asymmetry).
Minimum high/low width squeeze.
Narrow pulse / glitch insertion to validate integrity detection (not analog masking).

Coverage targets

False positive: no FAULT in nominal conditions.
False negative: injected faults must be caught.
Latency: detection and recovery must be bounded.

Latency measurement

Use injection edge as t0 (gate control or offset step).
Use IRQ/GPIO assertion as t1.
T_detect = t1 − t0; recovery uses clear edge similarly.

Production-friendly checks

Prefer fastest path validation (IRQ/GPIO) for timing.
Read registers for cause codes and counters (diagnosis).
Use minimal injectors: brief gate-off + known offset mode + simple duty toggle.

Pass criteria template

Timing

Detection latency ≤ T_detect · Recovery clear latency ≤ T_clear · Post-switch relock ≤ T_relock

Rates (under specified V/T/noise)

False positive rate < R_fp · False negative rate < R_fn · Maximum switches within guard window ≤ N_sw_max

Traceability

Each event must log: channel_id, alarm level, cause code, first/last seen, counters, and measurement snapshots.

A production-friendly setup validates timing on the fastest alarm path while still capturing cause codes and counters for traceability.

Takeaway

Prove reliability with fault injection and bounded metrics: FP/FN rates under defined conditions, and detection/clear/relock latency limits. Keep the workflow producible by testing the IRQ/GPIO path and logging cause codes.

H2-11. Field monitoring & health metrics: define logs that actually help

Clock monitoring becomes system health only when the logs can answer three questions fast: (1) slow drift vs sudden step, (2) risk trending up, (3) which channel/domain/tap point to investigate first. Keep this device-side and actionable—avoid turning it into a “big data platform” topic.

A) Minimum useful dataset (per channel, per domain)

Avoid “one counter forever.” Use fixed aggregation windows (e.g., 1 min / 10 min / 1 h) and always log window validity so missing samples cannot masquerade as stability.

Frequency offset statistics

min / max / p95 / p99 / mean (ppm or %) per window, plus valid_samples / missing_samples.

Alarm accounting

Count alarms by cause_code (missing, freq_hi, freq_lo, width, duty, LOS/LOL if provided) and keep a duration histogram (e.g., <1 s / 1–10 s / 10–60 s / >60 s).

Switchover telemetry

switch_count, switch_reason, time_to_stable, and (if available) downstream re-lock time. Separate back_switch_count (return-to-primary) from forward switches.

B) Drift vs step: a simple, reliable triage rule

Field diagnosis fails when slow thermal/aging drift is treated like intermittent disconnection (or vice versa). Use minimal rules that firmware can implement deterministically.

Slow drift signature

Offset trend slope keeps the same sign for N windows (e.g., N=6 at 10-min windows).
WARN-level near-boundary alarms dominate; durations skew long.
Offset correlates with temperature or predictable load cycles.

Sudden step / burst signature

One-window delta exceeds a fixed threshold (e.g., Δoffset > X ppm) and alarms cluster tightly in time.
Missing-pulse or width/duty faults appear briefly (short duration bins).
Switchover count spikes; return-to-primary becomes unstable.

Implementation note: compute slope from window medians (robust to outliers) and compute burst from alarm timestamps. Keep thresholds fixed and reviewed—do not “learn” the definition of a fault.

C) Fixed vs adaptive thresholds (safe self-tuning without risk)

Must stay fixed (high-risk)

Missing-pulse timeout (presence watchdog).
Hard frequency fault window (true out-of-spec).
Minimum high/low width for pulse integrity (prevents silent edge loss).

Can adapt (low-risk, WARN-only)

“Near-boundary” WARN level (early-warning band inside the spec limit).
Baseline offset center (to detect abnormal deviation from normal behavior).

Adaptive guardrails (non-negotiable)

Train only during known-good periods (no alarms, valid samples, normal temp/rails).
Adaptive WARN must remain inside the fixed FAULT window and may only tighten, not relax.
FAULT thresholds and missing-pulse logic never change dynamically.

D) Event schema for traceability (post-mortem friendly)

A good event record is small, structured, and searchable. It must explain what happened, where, and what the system did—without requiring oscilloscope access.

Recommended fields (example)

timestamp, channel_id, domain_id, tap_point
alarm_level (INFO/WARN/FAULT), cause_code
first_seen, last_seen, duration_ms
offset_ppm (or %), min_width_ns (if measured), missing_timeout_us
action_taken (log/degrade/switch/hold/reset), switched_to (if any)

E) A compact “health summary” output (ops + firmware friendly)

Daily summary (per channel)

p99 offset as a fraction of budget (e.g., p99 / fault_limit).
alarm rate (count/day) and top 2 cause codes.
switch rate and median time_to_stable.
trend flag: drift / step / burst / stable.

Do not hide risk behind a score

A “health score” may be used for dashboards, but always surface the underlying threshold occupancy, cause distribution, and switch behavior to avoid false confidence.

Diagram — Field health dashboard sketch (trend + histogram + top causes)

H2-12. Applications & IC selection notes (monitoring-focused)

This section stays strictly on clock monitoring: loss-of-signal/lock awareness, frequency window checking, pulse integrity screening, alarm interfaces, debounce/latch resources, and switchover hooks. It does not attempt to replace the dedicated pages for mux/crosspoint/fanout/cleaners.

A) Monitoring-first application patterns (same template, different priorities)

SerDes / PCIe refclk

Goal: detect LOS/LOL early to prevent link drop escalation.
Must measure: presence + frequency window; add width/duty only if edge integrity is fragile.
Output: fast IRQ/GPIO + readable cause code for firmware routing.
Action: log → degrade → switch (avoid ping-pong via persistence rules).

JESD204 systems (SYSREF / refclk)

Goal: prevent silent misalignment by catching missing SYSREF/refclk conditions.
Must measure: presence (SYSREF is the strictest) + refclk frequency window.
Output: deterministic status (pins/registers) and latch support for “fault happened” evidence.
Action: latch fault → controlled re-sync flow (keep switchover bounded).

Telecom / SyncE timing chains

Goal: trend-based early warning and clear out-of-window escalation.
Must measure: frequency statistics (p95/p99) + drift/step flags.
Output: readable status + counters; periodic summaries are often more useful than raw edges.
Action: alarm → degrade/switch (severity-driven).

Industrial / avionics redundancy (traceable events)

Goal: reliable switchover + audit-grade event records.
Must measure: presence + duration distribution + switch behavior.
Output: latched faults + explicit switch reason codes.
Action: switch with strong anti-oscillation rules; return-to-primary is stricter than failover.

B) Selection dimensions (monitoring features only)

Tier-1: decide first

Input type and thresholds (LVCMOS/LVDS/HCSL/LVPECL/CML), common-mode range.
Frequency span + measurement method limits (gate-time/period/watchdog).
What can be detected: presence, frequency window, width/duty integrity.

Tier-2: avoid false alarms

Programmable windows (freq_hi/lo, timeout, width/duty) and per-channel config.
Debounce/persistence resources (N windows / N events) and latch/auto-clear behavior.
Alarm interfaces: GPIO/IRQ + readable cause/status registers.

Tier-3: field maintainability

Robustness: temperature, supply noise tolerance, and clean status semantics.
Observability: counters/statistics availability (or at least clear fault flags).
Multi-channel scaling: per-channel thresholds + aggregated alarm routing.

Common-mode risk checkpoint

If monitoring uses the same failing reference, same supply, or same internal path, it can fail “together” with the clock tree. Prefer an independent reference or cross-check in safety- or uptime-critical systems.

C) Concrete reference part numbers (starting points; verify suffix/package/availability)

The list below is intentionally monitoring-driven (LOS/LOL, status/interrupts, redundant reference support, readable fault flags). Final selection must be validated against input standard, frequency plan, debounce needs, and system-level switchover requirements.

Timing/cleaners with LOS/LOL monitoring

Si5341 (example ordering: Si5341B-D-GM) — fault monitoring and status via serial interface; useful when monitoring is coupled with clock generation.
Si5345 / Si5344 / Si5342 — family provides LOL/LOS indicators and fault handling suited to robust reference monitoring.
Si5391 — fault indicators for LOS/LOL; common in high-performance timing trees where fault status must be readable.
Si5382 — reference monitoring features including LOS/invalid frequency awareness (fit when multi-input monitoring is needed).

Clock generator with explicit LOS/LOL interrupts

LMK03328 (Texas Instruments) — exposes loss-of-input-signal and loss-of-lock interrupt sources; useful when fault flags must feed an MCU/FPGA alarm router.

Redundant reference switching + LOS alarms

8T49N241 (Renesas / FemtoClock®NG) — monitors input clocks for loss-of-signal and can support redundant reference behavior.
8T49N283 (Renesas) — monitors inputs for LOS and supports hitless reference switching options (fit when automatic switchover is required inside the timing IC).

Clock generation/distribution with status outputs

AD9528 (Analog Devices) — provides status pins and status registers; commonly used where readable lock/fault status is needed in clocking subsystems.
HMC7044 (Analog Devices) — clock generation/distribution platform used in converter clocking; integrate monitoring via readable status and controlled outputs.

Synchronizers where reference monitoring is central

8A34001 (Renesas ClockMatrix / SMU) — system timing/synchronization device where reference monitoring and timing paths are managed in a structured way.
ZL30731–ZL30735 (Microchip) — network synchronizers with frequency measurement/monitoring hooks; useful when field health and reference monitoring are requirements.

Procurement note (prevents BOM surprises)

Always lock the exact ordering code (package, temp grade, output standard options) and confirm whether alarms are pin-based, register-based, or both. For multi-source redundancy, confirm that LOS/LOL flags remain valid during power transitions and after reference switching.

Diagram — Monitoring-focused selection flow (input → range → interface → debounce → redundancy)

Request a Quote

Name

Company

Part Number(s) / BOM

Quantity & Target Lead Time

Alternates Allowed

Temperature Grade

Package / Footprint

Compliance

Budget Window

Lot Size / Qty

Message

Attachment

Accepted Formats

pdf, csv, xls, xlsx, zip

Attachment

Drag & drop files here or use the button below.

H2-13. FAQs (monitoring & missing-pulse) + JSON-LD

Short, actionable troubleshooting answers that stay within this page boundary (presence / frequency window / pulse integrity / alarms / switchover / observability). Each answer uses the same 4-line, data-like structure for fast execution and verification.

1) Missing-pulse alarm triggers randomly, but the clock looks fine on scope—why?

Likely cause: Gate/window too short, so edge jitter/glitches are interpreted as “missing” inside the monitor logic.

Quick check: Increase gate time ×10 and enable event timestamps; correlate alarms vs temperature and supply (same timebase window).

Fix: Add persistence (N consecutive fails) + input conditioning at the tap (e.g., Schmitt buffer SN74LVC1G17 / NC7SZ17 for LVCMOS taps; for differential clocks, use a dedicated buffer/fanout output as the monitor input source).

Pass criteria: False-alarm rate < X/hour across worst-case V/T/noise for the qualification duration (e.g., 24 h), with valid_samples coverage > Y%.

2) Frequency-offset alarm never triggers even when a slow drift is forced—what’s first?

Likely cause: Wrong nominal reference (divider/PLL ratio mismatch) or wrong units (ppm vs %), so the programmed window is not the intended window.

Quick check: Read back nominal and thresholds from registers; compare with measured counts/period over the same gate interval (no mixed units).

Fix: Establish one “source of truth” for nominal (single configuration owner) + sanity clamps (min/max window). If a timing device is used for configuration centralization, ensure nominal is readable (e.g., Si5341 or LMK03328 as examples to look up; verify exact ordering code/features in datasheets).

Pass criteria: Alarm asserts within T_detect after injected offset ≥ threshold, and clears only after the defined recovery persistence window.

3) Alarm chatters near threshold—how to stop it without masking real faults?

Likely cause: No hysteresis and/or no separate enter/exit conditions, so boundary crossings toggle the state machine.

Quick check: Histogram measured offset around the boundary and count bidirectional crossings per window; confirm toggle rate vs persistence settings.

Fix: Add hysteresis (enter/exit thresholds) + recovery persistence. For LVCMOS taps, Schmitt conditioning (e.g., SN74LVC1G17) reduces boundary chatter caused by slow edges.

Pass criteria: Alarm toggles ≤ 1 time during steady operation near the boundary (defined test condition), while FAULT detection remains within T_detect.

4) Switchover happens, but endpoints still lose lock—what should be verified first?

Likely cause: Switch disturbance exceeds endpoint tolerance and/or the post-switch stability window is too short.

Quick check: Align endpoint lock-loss indication (pin/CSR) with switch-control timing; measure “time-to-stable” vs endpoint re-lock time.

Fix: If hitless behavior is required, use a dedicated hitless/controlled-reference switching approach (examples to look up for integrated timing platforms: Si5345, Si5341, 8T49N283; verify exact capabilities/order codes). Otherwise, widen the post-switch settle window and delay endpoint-dependent actions until stability qualifies.

Pass criteria: Endpoint re-lock < T_relock, and no repeated switching within T_guard after a successful switch.

5) After switching to backup, the system keeps switching back and forth—why?

Likely cause: Auto-revert is enabled without stable qualification, and the backup source is also marginal (or both share a common-mode issue).

Quick check: Log both sources’ health metrics (offset p95/p99, alarm duration bins, switch reasons) and verify the revert condition definition (enter/exit thresholds + persistence).

Fix: Require “stable-on-backup for T_hold” before revert; prefer manual revert for high-risk endpoints. If integrated reference switching is used, ensure revert rules are configurable/readable (example platforms to look up: 8T49N283, Si5345; verify datasheets).

Pass criteria: Max switch count ≤ N per day under worst-case conditions, with zero ping-pong sequences shorter than T_pp_min.

6) I²C reads show “mixed” status counters—what’s the first fix?

Likely cause: Non-atomic multi-byte reads during rollover (counter changes mid-transaction), producing inconsistent snapshots.

Quick check: Read twice and compare; if mismatch rate is non-zero, rollover is occurring during reads. Check whether a latch-on-read or snapshot mechanism exists.

Fix: Implement atomic snapshot in firmware (latch, read, then clear/ack) or use device-supported “freeze counters”/shadow registers when available.

Pass criteria: Repeated reads match within ≤ 2 attempts, and snapshot age < T_snap_max.

7) Why does adding a probe change alarm behavior?

Likely cause: The monitoring tap is not isolated; probe capacitance loads the node, distorting edges or creating reflections that shift pulse integrity.

Quick check: Compare with a high-impedance active probe; measure edge rate and swing at the tap with/without the probe (same ground strategy).

Fix: Buffer/isolate the tap and feed the monitor from a dedicated buffer output (example LVCMOS buffer to look up: LMK1C1104; verify I/O standard and operating range). For CMOS taps, Schmitt conditioning (e.g., SN74LVC1G17) may reduce sensitivity to slow/loaded edges.

Pass criteria: Alarm behavior unchanged with/without probe (no new alarms), and measured edge rate change < Δ_edge_max under the defined test setup.

8) Loss-of-lock alarms appear only at high temperature—what to suspect first?

Likely cause: Guardband is too tight vs drift/aging/temperature, and/or the input buffer switching point shifts with temperature and supply noise.

Quick check: Log offset vs temperature and compare against the guardband stack (spec + drift + measurement + margin); confirm whether alarms align with threshold occupancy.

Fix: Widen window using drift budget (do not relax missing-pulse protection), improve tap conditioning, and remove borderline edge shapes (e.g., Schmitt buffer NC7SZ17 / SN74LVC1G17 for suitable CMOS taps; ensure monitor input standard matches the source).

Pass criteria: No alarms across the required temperature range when expected drift + margin is applied, and p99 offset remains ≤ K% of the FAULT limit.

9) Detection delay is met, but short dropouts are still missed—how?

Likely cause: Persistence/gate time is longer than the dropout width; the dropout ends before the next evaluation point.

Quick check: Inject controlled dropout widths and sweep from short to long; map detection probability vs dropout width and evaluation method.

Fix: Run a missing-pulse watchdog (edge-fed timeout) in parallel with a frequency-window counter; use the watchdog for short dropouts and the counter for drift/out-of-window behavior.

Pass criteria: Detect dropout ≥ T_drop_min with probability ≥ P_min, and assert alarm within ≤ T_detect_short.

10) Monitoring after the cleaner misses faults that endpoints see—why?

Likely cause: The fault occurs downstream (fanout output, connector, branch routing, endpoint termination), so upstream monitoring never observes the degraded waveform.

Quick check: Move the monitoring tap closer to the endpoint or monitor multiple tree points; correlate branch-specific alarms with endpoint lock-loss events.

Fix: Use per-branch monitors and correlation logic. If a timing platform is used, prefer variants with readable status/alarms that can be attributed to input/output domains (examples to look up: AD9528, HMC7044, Si5341; verify exact status granularity in datasheets).

Pass criteria: Fault localization points to the correct branch in ≥ X% of injected/field-reproduced cases, with unambiguous cause_code mapping.

Clock Monitor / Missing-Pulse: Alarms & Auto Switchover

Clock Monitor / Missing-Pulse: Alarms & Auto Switchover

What problem this page solves: “Detect → Alarm → Protect → Switch”

Closed-loop questions

Two core detector families

Engineering outputs (what “good” looks like)

Failure modes & false alarms: what is being prevented

Why false alarms happen (mechanisms, not symptoms)

What to measure: presence, frequency offset, pulse integrity

Detection architectures: counter, gate-time, window compare

Windowing & guardband: thresholds that do not backfire

Debounce & persistence: stable alarms, clean recovery

Alarm grading & routing: from detection to system actions

Automatic switchover: hitless vs bounded-glitch strategies

Implementation hooks: board-level observation points that do not disturb clocks

Verification & production test: proving reliability with executable checks

H2-11. Field monitoring & health metrics: define logs that actually help

A) Minimum useful dataset (per channel, per domain)

B) Drift vs step: a simple, reliable triage rule

C) Fixed vs adaptive thresholds (safe self-tuning without risk)

D) Event schema for traceability (post-mortem friendly)

E) A compact “health summary” output (ops + firmware friendly)

H2-12. Applications & IC selection notes (monitoring-focused)

A) Monitoring-first application patterns (same template, different priorities)

B) Selection dimensions (monitoring features only)

C) Concrete reference part numbers (starting points; verify suffix/package/availability)

Request a Quote

Accepted Formats

Attachment

H2-13. FAQs (monitoring & missing-pulse) + JSON-LD

Explore

Categories

Get in Touch

Clock Monitor / Missing-Pulse: Alarms & Auto Switchover

Clock Monitor / Missing-Pulse: Alarms & Auto Switchover

What problem this page solves: “Detect → Alarm → Protect → Switch”

Closed-loop questions

Two core detector families

Engineering outputs (what “good” looks like)

Failure modes & false alarms: what is being prevented

Why false alarms happen (mechanisms, not symptoms)

What to measure: presence, frequency offset, pulse integrity

Detection architectures: counter, gate-time, window compare

Windowing & guardband: thresholds that do not backfire

Debounce & persistence: stable alarms, clean recovery

Alarm grading & routing: from detection to system actions

Automatic switchover: hitless vs bounded-glitch strategies

Implementation hooks: board-level observation points that do not disturb clocks

Verification & production test: proving reliability with executable checks

H2-11. Field monitoring & health metrics: define logs that actually help

A) Minimum useful dataset (per channel, per domain)

B) Drift vs step: a simple, reliable triage rule

C) Fixed vs adaptive thresholds (safe self-tuning without risk)

D) Event schema for traceability (post-mortem friendly)

E) A compact “health summary” output (ops + firmware friendly)

H2-12. Applications & IC selection notes (monitoring-focused)

A) Monitoring-first application patterns (same template, different priorities)

B) Selection dimensions (monitoring features only)

C) Concrete reference part numbers (starting points; verify suffix/package/availability)

Recommended topics you might also need

Request a Quote

Accepted Formats

Attachment

H2-13. FAQs (monitoring & missing-pulse) + JSON-LD

Explore

Categories

Get in Touch