123 Main Street, New York, NY 10001

Clock Monitor / Missing-Pulse: Alarms & Auto Switchover

← Back to:Reference Oscillators & Timing

Clock monitoring is not about measuring “more accurately”—it is about keeping systems safe: detect missing or out-of-spec clocks early, raise the right alarms, and trigger the right protection or switchover action before endpoints fail.

This page shows how to define observables, thresholds, debounce/persistence, alarm routing, and verification criteria so monitoring becomes a reliable “Detect → Alarm → Protect → Switch” control loop.

What problem this page solves: “Detect → Alarm → Protect → Switch”

Clock monitoring is a reliability loop, not a “more accurate measurement” exercise. The objective is to detect clock health violations early enough to trigger the correct action (alarm, degrade, protect, switchover), while minimizing false alarms and avoiding self-inflicted instability.

False positives (unnecessary actions) False negatives (silent failures) Detection latency (time-to-action) Recovery semantics (clear vs latch)

Closed-loop questions

  • Clock present?
  • In-spec (within window)?
  • Stable long enough (persistence)?
  • Which alarm action?
  • Switch/degrade/protect?
  • When and how to recover?

Two core detector families

Missing-pulse (presence / liveness)
Detects stop/stuck/disconnect quickly. Often drives the fast path to prevent immediate endpoint collapse.
Frequency offset (in-window / in-spec)
Detects drift, wrong division, or loss-of-lock behavior. Often drives the graded path (warn → fault) with persistence.

Engineering outputs (what “good” looks like)

  • Flags: present / in-spec / suspect / fault
  • Numbers: offset (ppm/%), event counters, timestamps
  • State: persistence + hysteresis + clear policy
  • Actions: alarm level, latch, switchover request, hold-off
Monitoring closed-loop Clock source flows into monitor, alarm router, switch/mux, endpoint, and logger, forming a reliability loop. Detect → Alarm → Protect → Switch (closed-loop) Clock Source CLK_IN Monitor Presence Offset Integrity Alarm Router Level Latch Clear Switch / Mux A ↔ B Hold-off Reason Endpoint PHY / ADC / FPGA Logger & Telemetry timestamps · counters · reason codes · field health metrics clock path control
Monitoring is a closed-loop pipeline: observe (presence/offset/integrity), decide (grading + persistence), act (protect/switch), and record (reason codes + timestamps).
Takeaway
A good clock monitor produces actionable outputs (flags, numbers, state, reason codes), not just measurements. Parallel detection (missing-pulse + frequency window + integrity) improves coverage and reduces “unknown unknowns”.

Failure modes & false alarms: what is being prevented

This section constrains the problem space. Each failure class maps to the primary observable, the most common false-alarm mechanism, and the first diagnostic probe. The same mapping becomes the backbone for thresholding, debounce/persistence, and switchover policies later in this page.

A) Stop / Stuck
Signature: edges stop, clock tree opens, or a domain freezes.
Primary observable: missing-pulse timeout (watchdog).
Common false positives: overly short timeout, glitch filtering that drops valid edges.
First probe: correlate alarm timestamp with endpoint LOL/LOS + supply droops.
B) Out-of-spec frequency
Signature: drift, wrong divider ratio, or loss-of-lock behavior.
Primary observable: frequency window (ppm/%), optionally graded (warn/fault).
Common false positives: nominal mismatch, unit mistakes, insufficient gate time.
First probe: read back programmed nominal/threshold + log offset histogram.
C) Pulse integrity issues
Signature: narrow pulses, duty distortion, occasional dropped edges.
Primary observable: min-high/min-low width, duty window, glitch reject.
Common false positives: probe loading, threshold noise, bandwidth-limited taps.
First probe: compare a high-Z active probe vs passive probe at the monitor tap.
D) Amplitude / edge degradation
Signature: monitor “looks OK”, but endpoints lose lock downstream.
Primary observable: multi-point monitoring + endpoint status correlation.
Common false positives: single-point monitoring after the cleaner misses branch faults.
First probe: move the monitoring tap closer to the failing endpoint branch.

Why false alarms happen (mechanisms, not symptoms)

1) Not enough statistics
Gate time too short → estimator noise near threshold → chatter.
2) Missing guardband
Spec + drift + measurement error not budgeted → thresholds sit on the cliff edge.
3) Self-inflicted switching
Switchover transient interpreted as a fault → repeated switching loops.
4) Threshold instability
Supply noise or buffer threshold jitter → extra/missed edges at the monitor input.
5) Common-mode blindness
Monitor uses the same bad reference as the clock under test → “both drift together”.
Failure tree mapped to observables Clock issue classes on the left are connected to observables on the right: presence, frequency window, and pulse integrity. Failure classes → primary observables (monitoring view) Clock issues Stop / Stuck Out-of-spec frequency Pulse integrity Observables Presence Missing-pulse Frequency ±ppm / ±% Integrity min width primary secondary
The mapping prevents scope creep: each failure class has a primary observable (and optional secondary signals) that drive thresholds and persistence policies.
Takeaway
A robust monitor starts by classifying failures (stop/stuck, out-of-spec frequency, integrity degradation, downstream branch issues) and assigning each class a primary observable. This enables thresholds, debounce, and switchover logic to be engineered as a controlled system rather than an ad-hoc collection of alarms.

What to measure: presence, frequency offset, pulse integrity

Monitoring should focus on observables that directly drive reliable actions. This page uses three: presence (liveness), frequency offset (in-window), and pulse integrity (min width / duty). Topics like jitter or phase noise only matter here as sources of measurement uncertainty and are handled elsewhere.

Presence (liveness)
Edge activity continues and a watchdog is fed within a timeout window.
Catches: stop/stuck, disconnect, broken clock tree branches.
Key knobs: timeout, valid-edge qualification, persistence.
Frequency offset (window)
Offset from nominal frequency compared against ±ppm or ±% thresholds.
Catches: drift, wrong division, loss-of-lock behavior.
Key knobs: gate time, guardband budget, graded warn/fault windows.
Pulse integrity
Minimum high/low widths and duty window ensure edges are reliably counted and interpreted.
Catches: narrow pulses, duty distortion, edge dropouts that average frequency can hide.
Key knobs: min width thresholds, glitch reject, hysteresis + persistence.
Waveform observables On a common time axis: presence loss (stop), frequency offset (faster/slower), and pulse integrity issues (narrow pulses and duty distortion). Three observables on a common time base Presence Offset Integrity STOP timeout → fault +ppm -ppm window MIN WIDTH DUTY glitch reject
Presence detects “no edges”. Frequency offset detects “edges exist but drift out of window”. Pulse integrity prevents miscount and misinterpretation caused by narrow pulses or duty distortion.
Takeaway
Using these three observables keeps monitoring actionable and bounded. Presence gives the fastest protection, frequency offset provides spec compliance tracking, and integrity prevents “clean numbers” hiding broken edge quality.

Detection architectures: counter, gate-time, window compare

The same observables can be estimated in multiple ways. Architecture selection is primarily a trade between latency (how fast an actionable decision is made) and robustness (how stable the decision is under noise, transients, and edge artifacts). The blocks below use a consistent pipeline: qualify edges → measure → compare to windows → apply hysteresis/persistence → drive alarms/actions.

Gate-time counter
Count edges within a fixed gate time to estimate frequency. Longer gates reduce estimator noise but slow response.
Best for: stable offset monitoring on mid/high-frequency clocks.
Reciprocal / period measure
Measure time between edges (or average over multiple periods). Sensitive at low frequency but more exposed to edge artifacts.
Best for: low-frequency clocks and fast drift detection with averaging.
Missing-pulse watchdog
Feed a watchdog on each valid edge; timeout indicates loss of clock activity. Fastest protection path.
Best for: stop/stuck detection and rapid switchover triggers.
Window + hysteresis
Compare measured values to warn/fault windows and separate clear thresholds to prevent chatter near boundaries.
Best for: stable alarms under slow drift and noisy measurements.
Detection architectures comparison Four block diagrams in a 2×2 grid compare gate-time counter, period measurement, missing-pulse watchdog, and window+hysteresis pipelines. Architecture blocks: input → qualify → measure → compare → state → alarm Gate-time Counter IN Qualify Count Gate T Compare State Alarm (warn/fault) SLOW Period Measure IN Qualify Timestamp Period Compare State Alarm (offset) FAST Missing-pulse Watchdog IN Qualify Watchdog Timeout Fault fast path → protect/switch FAST Window + Hysteresis IN Measure Window Hyst Persist stable alarm behavior STABLE low-f → period · fast protect → watchdog
Use a consistent pipeline and swap only the measurement core. Windows, hysteresis, and persistence convert noisy estimates into stable, action-grade alarms.
Takeaway
Gate-time counters deliver stable offsets with slower response, period measurement improves low-frequency sensitivity, watchdogs provide the fastest loss detection, and window+hysteresis ensures alarms do not chatter near thresholds.

Windowing & guardband: thresholds that do not backfire

A reliable monitor does not start from a “nice-looking number”. It starts from a budget that can be reviewed: spec window (what endpoints require) + drift window (temperature/aging/supply effects) + measurement window (estimator error) + a controlled margin (unmodeled tail + manufacturing spread). This section turns that stack into actionable parameters for frequency, missing-pulse timeout, and pulse integrity.

Budget stack
Total threshold window should be decomposed into Spec + Drift + Meas + Margin.
Margin is added last; it must not hide a missing measurement-error estimate.
Frequency window
Express offset as ±ppm for timebases/RTC and ±% or ±ppm for high-speed refs, then grade into warn/fault plus a clear window (hysteresis).
Missing-pulse timeout
Timeout should be derived from the slowest expected period, the maximum tolerable blanking, and the system’s acceptable detection latency.
Duty / width limits
Set min-high/min-low and duty windows from the receiver’s minimum timing requirements, then allocate distortion and measurement error before adding margin.
Threshold budget stack Stacked bar illustration showing spec, drift, measurement error, and margin contributing to a total threshold window. Threshold window = Spec + Drift + Measurement + Margin SPEC DRIFT MEAS ERROR MARGIN Spec Drift Meas Margin Total window RTC window (ppm) Refclk window (ppm / %) warn/fault/clear windows are derived from the same stack
The stack makes threshold ownership explicit: endpoint requirements define the spec portion, environment defines drift, the chosen estimator defines measurement error, and margin covers the tail.
How to parameterize frequency alarms
  • Nominal: f_nom and the reference timebase used by the monitor.
  • Windows: warn ±W1, fault ±W2, clear ±Wc (Wc < W1 to avoid chatter).
  • Estimator: gate time or period-avg count N, plus outlier reject policy.
  • Persistence: N_fail to assert, N_ok to clear.
How to derive missing-pulse timeout
  • Slowest period: use the minimum allowed frequency (worst-case drift).
  • Blanking tolerance: maximum edge gap the system can tolerate without unsafe behavior.
  • Timeout: timeout = period_max × N_miss + hold-off allowance.
  • Edge qualify: ensure only valid edges can feed the watchdog.
How to set duty / width limits
  • Receiver min: min-high/min-low from endpoint timing acceptance.
  • Distortion: allocate for buffer + routing + termination effects.
  • Measurement: include qualification uncertainty (glitch reject threshold).
  • Guardband: add margin to cover manufacturing tails and aging.
Takeaway
Thresholds should be built from a reviewable budget. Once the stack is defined, it can consistently generate warn/fault/clear windows for frequency, safe timeouts for missing-pulse detection, and receiver-driven limits for duty and minimum widths.

Debounce & persistence: stable alarms, clean recovery

Most “false alarms” are not caused by the clock; they are caused by alarm logic that treats momentary outliers as faults. Debounce and persistence convert noisy measurements into action-grade events by controlling when to assert, when to clear, and whether to latch critical states.

Two debounce axes
  • Time-based: require N consecutive measurement windows to fail.
  • Event-based: require N consecutive failure events (edge gaps, width violations).
  • Combine carefully to avoid delaying detection beyond system tolerance.
Asymmetry: assert vs clear
  • Assert: fail_count ≥ N_fail (or fail_time ≥ T_fail).
  • Clear: ok_count ≥ N_clear (or ok_time ≥ T_clear).
  • Common pattern: N_clear > N_fail to prevent chatter after recovery.
Latch vs auto-clear
  • Latch faults that triggered protection or switchover.
  • Auto-clear warnings that are only informative and do not drive actions.
  • Latched alarms should require explicit clear criteria and logging.
SUSPECT as a buffer state
  • SUSPECT allows increased observation and logging without immediate disruptive actions.
  • ALARM is reserved for persistent failures that justify protection or switchover.
  • This separation prevents “alarm storms” from transient disturbances.
Hold-off to avoid oscillation
  • After a switch or reset action, apply a short hold-off to ignore known transients.
  • Hold-off should be scoped (e.g., presence only) and time-bounded with re-evaluation.
  • Always log cause codes so “recovered” does not mean “forgotten”.
Alarm state machine State machine showing OK to SUSPECT to ALARM transitions using N_fail/T_fail, and recovery using hold-off and N_ok/T_ok criteria, with latch indication. Alarm logic state machine: debounce, persistence, latch, recovery OK in-window SUSPECT observe + log ALARM protect / switch LATCH RECOVER hold-off + verify N_fail ≥ k1 T_fail ≥ t1 N_fail ≥ k2 missing-pulse hold-off done N_ok ≥ k3 T_ok ≥ t3 Use graded states + asymmetric clear criteria to prevent chatter and alarm storms
Separate “assert” and “clear” criteria. A SUSPECT stage buffers transient faults; ALARM should be reserved for persistent failures and may be latched when protection or switchover occurred.
Takeaway
Debounce and persistence must be explicit and asymmetric: assert on sustained failure, clear only after sustained stability. Latching should be used when an action was taken, so that “recovered” never means “unobserved”.

Alarm grading & routing: from detection to system actions

An alarm becomes useful only when it is graded (severity is unambiguous), routed (delivered through the right channel with the right latency), and action-mapped (each level triggers a defined response). This section defines a practical INFO/WARN/FAULT ladder, the routing paths (IRQ/GPIO, I²C status, reset chain, telemetry), and the minimum context required for diagnosis and safe switchover.

Severity ladder
  • INFO: minor deviation; log & trend only.
  • WARN: near boundary; prepare actions, increase observation.
  • FAULT: out-of-window or missing-pulse; protection or switchover allowed.
Levels should be tied to the window and persistence outputs (warn/fault/clear + N/T).
Minimum context
A routed alarm should carry:
  • Cause code (timeout / freq high / freq low / duty / width).
  • First/last seen timestamp and duration or counts.
  • Snapshot: recent measurement value(s) and window parameters.
Routing channels
  • IRQ/GPIO: lowest latency trigger for policy/interrupt logic.
  • I²C / status: rich context (cause + snapshot) for diagnosis.
  • Reset chain: hard protection for high-risk modes.
  • Telemetry: remote visibility and trend-based maintenance.
Action mapping
  • INFO: log only, increment counters.
  • WARN: raise observation, notify supervisor, arm switch logic.
  • FAULT: switchover or protection stop, then latch and record.
Avoid disruptive actions without persistence qualification and rate limiting.
Storm prevention
  • Hysteresis: clear window tighter than warn window.
  • Rate limit: cap repeated actions over a time span.
  • Hold-off: suppress known transients after switching.
Alarm routing bus Block diagram showing a clock monitor feeding an alarm grader, routed to IRQ/GPIO, I2C status, reset chain, telemetry, and a policy engine controlling logging and switchover. Monitor → Grade → Route → Policy → Log → Switch Clock Monitor presence freq offset pulse integrity Alarm Grader INFO WARN FAULT codes counts snapshot Routing IRQ/GPIO I²C / regs reset chain telemetry MCU / FPGA Policy action mapping rate limit / hold-off Logger cause + snapshot Switch Control select / hold-off Endpoints protected
Routing is not just wiring: it must preserve enough context (cause, counts, snapshots) for policy decisions and post-incident diagnosis.
Takeaway
Use graded alarms and the right routing channels: fast lines (IRQ/GPIO) for action triggers, rich registers for diagnosis, and logging/telemetry for long-term health and accountability.

Automatic switchover: hitless vs bounded-glitch strategies

“Automatic switchover” is safe only when it is qualified by persistence and bounded by explicit post-check criteria. This section separates trigger signals (missing-pulse, offset, LOS/LOL), qualification (evidence + rate limits + hold-off), and execution (hitless or bounded-glitch), followed by a stability observation window and a conservative return-to-primary strategy.

Triggers
  • Hard: missing-pulse timeout, LOS.
  • Soft: frequency out-of-window, LOL (requires stronger persistence).
  • WARN can arm the switch path; FAULT can execute it.
Qualification locks
  • Persistence: N_fail/T_fail met (debounced).
  • Evidence: cause code + snapshot captured.
  • Action gate: not in hold-off; rate limit not exceeded.
Hitless switching
Hitless behavior requires dedicated switching resources and system-level conditions (alignment/validity checks). This page focuses on the policy and criteria; device-level implementation belongs to the glitch-free switch pages.
Even when switching is transparent, the event should be logged and post-checked.
Bounded-glitch switching
  • Define disturbance bounds: glitch width, relock time, error burst.
  • Prepare endpoints with a brief protection posture (mode freeze, degrade, or notify supervisor).
  • Require a stability observation window before declaring success.
Post-check window
  • Observe for T_stable: presence OK, offset back in-window.
  • Confirm endpoint status: lock restored and error counters stop increasing.
  • If stability fails, escalate to a latched fault or protection stop.
Return-to-primary
  • Return is riskier than the first switch; intermittent failures are common.
  • Prefer manual confirmation or stricter stability windows and rate limiting.
  • Use anti-ping-pong: repeated switching should force an operator-visible latch.
Switchover timing diagram Timing diagram showing detection, switch execution, hold-off, and stability observation windows across monitor status, alarm level, switch control and endpoint lock tracks. Fault → Decide → Switch → Hold-off → Stable (bounded by explicit windows) time Monitor Alarm Switch Endpoint Detect Switch Hold-off Stable OK SUSPECT FAULT RECOVER STABLE INFO WARN FAULT CLEAR idle / armed switch hold-off observe locked unlock relock stable Bound the disturbance and require a stability window before declaring success
The timing windows make switchover auditable: detection latency, switch execution window, hold-off, and a stability observation window with explicit pass/fail criteria.
Takeaway
Automatic switchover should be persistence-qualified, disturbance-bounded, and post-checked with a stability window. Conservative return-to-primary and anti-ping-pong rules prevent oscillation and hidden intermittent faults.

Implementation hooks: board-level observation points that do not disturb clocks

A clock monitor becomes reliable only when its tap points are chosen for diagnosis, its input path is electrically harmless, and its reference strategy avoids common-mode failure. This section defines practical tap locations along the clock tree, isolation and level-compatibility rules, and multi-channel configuration patterns that keep alarms actionable.

Tap @ Source
  • Meaning: verifies the source is present and near nominal.
  • Best for: source stop, gross offset, duty collapse at origin.
  • Risk: does not reveal downstream distribution failures.
Tap @ Post-cleaner
  • Meaning: validates the “conditioning stage” output.
  • Best for: lock-loss, unlock chatter, offset after conditioning.
  • Risk: still may not match what endpoints actually receive.
Tap @ Pre-fanout
  • Meaning: isolates issues between conditioner and distribution.
  • Best for: link integrity before branching to multiple domains.
  • Risk: cannot capture per-endpoint impairments.
Tap @ Near-endpoint
  • Meaning: observes what the endpoint truly sees.
  • Best for: missing pulses after routing, local distortion, edge loss.
  • Risk: highest chance of interaction; requires careful isolation.
Do-no-harm input rules
  • Isolation: buffer or high-impedance sampling path.
  • Level OK: monitor thresholds and common-mode must match the standard.
  • No “heavy filtering”: avoid masking narrow pulses by analog smoothing.
Avoid common-mode failure
If the monitor measures against the same failing reference, the entire system can “agree” while being wrong.
  • Independent ref: a separate timebase for diagnosis.
  • Cross-check: compare channels/domains for consistency.
Multi-channel configuration checklist
  • Per-channel thresholds: independent freq window / timeout / width limits.
  • Per-channel cause: alarms must include channel/domain identifiers.
  • Aggregated policy: channel → domain → system escalation (avoid one-line “kill switch”).
  • Maintainability: expose minimal test points and keep parameters configurable.
PCB observation points and monitor placement Clock tree diagram showing source, cleaner, fanout and multiple endpoints, with tap points and a monitor block connected through isolation and level-compatibility blocks. Clock-tree taps for observability (without adding load) Source Cleaner Fanout Endpoint A Endpoint B Endpoint C Tap Tap Tap Tap Monitor + Grader presence freq width duty independent ref / cross-check isolation level OK no load
Choose tap points for diagnosis, keep the sampling path electrically harmless, and avoid a single shared reference that can fail in common mode.
Takeaway
Observation points define diagnosability. Isolation and level compatibility prevent the monitor from degrading the clock. Independent reference or cross-check avoids “everyone agrees” failures.

Verification & production test: proving reliability with executable checks

Reliability is demonstrated by fault injection, measurable coverage targets (false positives, false negatives), and bounded timing (detection and recovery latency). The same logic must be verifiable on a production line with minimal equipment by testing the fastest alarm path (IRQ/GPIO) and reading cause codes via registers for diagnosis.

Injection set (presence)
  • Clock-off / gate-off to force missing-pulse.
  • Periodic blank windows to validate debounce and persistence.
  • Expected output: timeout cause code + bounded T_detect.
Injection set (offset)
  • Programmed offset (±ppm or ±%) around the window edges.
  • Step changes across warn/fault boundaries to validate hysteresis.
  • Expected output: freq-high / freq-low cause code + stable clear behavior.
Injection set (integrity)
  • Duty distortion (high/low asymmetry).
  • Minimum high/low width squeeze.
  • Narrow pulse / glitch insertion to validate integrity detection (not analog masking).
Coverage targets
  • False positive: no FAULT in nominal conditions.
  • False negative: injected faults must be caught.
  • Latency: detection and recovery must be bounded.
Latency measurement
  • Use injection edge as t0 (gate control or offset step).
  • Use IRQ/GPIO assertion as t1.
  • T_detect = t1 − t0; recovery uses clear edge similarly.
Production-friendly checks
  • Prefer fastest path validation (IRQ/GPIO) for timing.
  • Read registers for cause codes and counters (diagnosis).
  • Use minimal injectors: brief gate-off + known offset mode + simple duty toggle.
Pass criteria template
Timing
Detection latency ≤ T_detect · Recovery clear latency ≤ T_clear · Post-switch relock ≤ T_relock
Rates (under specified V/T/noise)
False positive rate < R_fp · False negative rate < R_fn · Maximum switches within guard window ≤ N_sw_max
Traceability
Each event must log: channel_id, alarm level, cause code, first/last seen, counters, and measurement snapshots.
Verification test bench for clock monitor Block diagram showing a fault injector feeding a DUT clock tree; monitor outputs drive time capture and a logger for metrics such as FP, FN, detection and recovery latency. Fault injection → DUT → Monitor outputs → Time capture → Metrics log Injector gate off offset step duty / glitch DUT clock tree source cleaner fanout → endpoints taps for monitor Monitor IRQ / GPIO I²C / regs cause + counters Time capture t0 t1 T_detect / T_clear Metrics log FP / FN pass / fail
A production-friendly setup validates timing on the fastest alarm path while still capturing cause codes and counters for traceability.
Takeaway
Prove reliability with fault injection and bounded metrics: FP/FN rates under defined conditions, and detection/clear/relock latency limits. Keep the workflow producible by testing the IRQ/GPIO path and logging cause codes.

H2-11. Field monitoring & health metrics: define logs that actually help

Clock monitoring becomes system health only when the logs can answer three questions fast: (1) slow drift vs sudden step, (2) risk trending up, (3) which channel/domain/tap point to investigate first. Keep this device-side and actionable—avoid turning it into a “big data platform” topic.

A) Minimum useful dataset (per channel, per domain)

Avoid “one counter forever.” Use fixed aggregation windows (e.g., 1 min / 10 min / 1 h) and always log window validity so missing samples cannot masquerade as stability.

Frequency offset statistics
min / max / p95 / p99 / mean (ppm or %) per window, plus valid_samples / missing_samples.
Alarm accounting
Count alarms by cause_code (missing, freq_hi, freq_lo, width, duty, LOS/LOL if provided) and keep a duration histogram (e.g., <1 s / 1–10 s / 10–60 s / >60 s).
Switchover telemetry
switch_count, switch_reason, time_to_stable, and (if available) downstream re-lock time. Separate back_switch_count (return-to-primary) from forward switches.

B) Drift vs step: a simple, reliable triage rule

Field diagnosis fails when slow thermal/aging drift is treated like intermittent disconnection (or vice versa). Use minimal rules that firmware can implement deterministically.

Slow drift signature
  • Offset trend slope keeps the same sign for N windows (e.g., N=6 at 10-min windows).
  • WARN-level near-boundary alarms dominate; durations skew long.
  • Offset correlates with temperature or predictable load cycles.
Sudden step / burst signature
  • One-window delta exceeds a fixed threshold (e.g., Δoffset > X ppm) and alarms cluster tightly in time.
  • Missing-pulse or width/duty faults appear briefly (short duration bins).
  • Switchover count spikes; return-to-primary becomes unstable.

Implementation note: compute slope from window medians (robust to outliers) and compute burst from alarm timestamps. Keep thresholds fixed and reviewed—do not “learn” the definition of a fault.

C) Fixed vs adaptive thresholds (safe self-tuning without risk)

Must stay fixed (high-risk)
  • Missing-pulse timeout (presence watchdog).
  • Hard frequency fault window (true out-of-spec).
  • Minimum high/low width for pulse integrity (prevents silent edge loss).
Can adapt (low-risk, WARN-only)
  • “Near-boundary” WARN level (early-warning band inside the spec limit).
  • Baseline offset center (to detect abnormal deviation from normal behavior).
Adaptive guardrails (non-negotiable)
  • Train only during known-good periods (no alarms, valid samples, normal temp/rails).
  • Adaptive WARN must remain inside the fixed FAULT window and may only tighten, not relax.
  • FAULT thresholds and missing-pulse logic never change dynamically.

D) Event schema for traceability (post-mortem friendly)

A good event record is small, structured, and searchable. It must explain what happened, where, and what the system did—without requiring oscilloscope access.

Recommended fields (example)
  • timestamp, channel_id, domain_id, tap_point
  • alarm_level (INFO/WARN/FAULT), cause_code
  • first_seen, last_seen, duration_ms
  • offset_ppm (or %), min_width_ns (if measured), missing_timeout_us
  • action_taken (log/degrade/switch/hold/reset), switched_to (if any)

E) A compact “health summary” output (ops + firmware friendly)

Daily summary (per channel)
  • p99 offset as a fraction of budget (e.g., p99 / fault_limit).
  • alarm rate (count/day) and top 2 cause codes.
  • switch rate and median time_to_stable.
  • trend flag: drift / step / burst / stable.
Do not hide risk behind a score
A “health score” may be used for dashboards, but always surface the underlying threshold occupancy, cause distribution, and switch behavior to avoid false confidence.
Diagram — Field health dashboard sketch (trend + histogram + top causes)
Field health dashboard sketch KPI cards for alarm and switch counts and p99 offset, a trend panel, duration histogram, and top cause bars. Alarm cnt ### / day Switch cnt ## / day p99 offset ## ppm Offset trend temp Alarm duration <1s 1–10 10–60 >60 Top causes spacer missing freq_hi/lo width/duty Fixed FAULT Adaptive WARN

H2-12. Applications & IC selection notes (monitoring-focused)

This section stays strictly on clock monitoring: loss-of-signal/lock awareness, frequency window checking, pulse integrity screening, alarm interfaces, debounce/latch resources, and switchover hooks. It does not attempt to replace the dedicated pages for mux/crosspoint/fanout/cleaners.

A) Monitoring-first application patterns (same template, different priorities)

SerDes / PCIe refclk
  • Goal: detect LOS/LOL early to prevent link drop escalation.
  • Must measure: presence + frequency window; add width/duty only if edge integrity is fragile.
  • Output: fast IRQ/GPIO + readable cause code for firmware routing.
  • Action: log → degrade → switch (avoid ping-pong via persistence rules).
JESD204 systems (SYSREF / refclk)
  • Goal: prevent silent misalignment by catching missing SYSREF/refclk conditions.
  • Must measure: presence (SYSREF is the strictest) + refclk frequency window.
  • Output: deterministic status (pins/registers) and latch support for “fault happened” evidence.
  • Action: latch fault → controlled re-sync flow (keep switchover bounded).
Telecom / SyncE timing chains
  • Goal: trend-based early warning and clear out-of-window escalation.
  • Must measure: frequency statistics (p95/p99) + drift/step flags.
  • Output: readable status + counters; periodic summaries are often more useful than raw edges.
  • Action: alarm → degrade/switch (severity-driven).
Industrial / avionics redundancy (traceable events)
  • Goal: reliable switchover + audit-grade event records.
  • Must measure: presence + duration distribution + switch behavior.
  • Output: latched faults + explicit switch reason codes.
  • Action: switch with strong anti-oscillation rules; return-to-primary is stricter than failover.

B) Selection dimensions (monitoring features only)

Tier-1: decide first
  • Input type and thresholds (LVCMOS/LVDS/HCSL/LVPECL/CML), common-mode range.
  • Frequency span + measurement method limits (gate-time/period/watchdog).
  • What can be detected: presence, frequency window, width/duty integrity.
Tier-2: avoid false alarms
  • Programmable windows (freq_hi/lo, timeout, width/duty) and per-channel config.
  • Debounce/persistence resources (N windows / N events) and latch/auto-clear behavior.
  • Alarm interfaces: GPIO/IRQ + readable cause/status registers.
Tier-3: field maintainability
  • Robustness: temperature, supply noise tolerance, and clean status semantics.
  • Observability: counters/statistics availability (or at least clear fault flags).
  • Multi-channel scaling: per-channel thresholds + aggregated alarm routing.
Common-mode risk checkpoint
If monitoring uses the same failing reference, same supply, or same internal path, it can fail “together” with the clock tree. Prefer an independent reference or cross-check in safety- or uptime-critical systems.

C) Concrete reference part numbers (starting points; verify suffix/package/availability)

The list below is intentionally monitoring-driven (LOS/LOL, status/interrupts, redundant reference support, readable fault flags). Final selection must be validated against input standard, frequency plan, debounce needs, and system-level switchover requirements.

Timing/cleaners with LOS/LOL monitoring
  • Si5341 (example ordering: Si5341B-D-GM) — fault monitoring and status via serial interface; useful when monitoring is coupled with clock generation.
  • Si5345 / Si5344 / Si5342 — family provides LOL/LOS indicators and fault handling suited to robust reference monitoring.
  • Si5391 — fault indicators for LOS/LOL; common in high-performance timing trees where fault status must be readable.
  • Si5382 — reference monitoring features including LOS/invalid frequency awareness (fit when multi-input monitoring is needed).
Clock generator with explicit LOS/LOL interrupts
  • LMK03328 (Texas Instruments) — exposes loss-of-input-signal and loss-of-lock interrupt sources; useful when fault flags must feed an MCU/FPGA alarm router.
Redundant reference switching + LOS alarms
  • 8T49N241 (Renesas / FemtoClock®NG) — monitors input clocks for loss-of-signal and can support redundant reference behavior.
  • 8T49N283 (Renesas) — monitors inputs for LOS and supports hitless reference switching options (fit when automatic switchover is required inside the timing IC).
Clock generation/distribution with status outputs
  • AD9528 (Analog Devices) — provides status pins and status registers; commonly used where readable lock/fault status is needed in clocking subsystems.
  • HMC7044 (Analog Devices) — clock generation/distribution platform used in converter clocking; integrate monitoring via readable status and controlled outputs.
Synchronizers where reference monitoring is central
  • 8A34001 (Renesas ClockMatrix / SMU) — system timing/synchronization device where reference monitoring and timing paths are managed in a structured way.
  • ZL30731–ZL30735 (Microchip) — network synchronizers with frequency measurement/monitoring hooks; useful when field health and reference monitoring are requirements.
Procurement note (prevents BOM surprises)
Always lock the exact ordering code (package, temp grade, output standard options) and confirm whether alarms are pin-based, register-based, or both. For multi-source redundancy, confirm that LOS/LOL flags remain valid during power transitions and after reference switching.
Diagram — Monitoring-focused selection flow (input → range → interface → debounce → redundancy)
Monitoring-focused selection flow Flowchart from input type and frequency range to detection targets, interface, debounce, and redundancy decision. Step 1 Input type / threshold domain CMOS LVDS HCSL LVPECL Step 2 Frequency range + method limits low mid high Step 3 What to detect presence freq window width / duty Step 4 Alarm interface + readability IRQ GPIO I²C regs Step 5 Debounce / latch resources Decision Need automatic switchover? YES NO If YES: Bounded-glitch or hitless option Persist fault before switching If NO: Log + degrade only Keep FAULT thresholds fixed Common-mode risk: prefer independent ref / cross-check

Request a Quote

Accepted Formats

pdf, csv, xls, xlsx, zip

Attachment

Drag & drop files here or use the button below.

H2-13. FAQs (monitoring & missing-pulse) + JSON-LD

Short, actionable troubleshooting answers that stay within this page boundary (presence / frequency window / pulse integrity / alarms / switchover / observability). Each answer uses the same 4-line, data-like structure for fast execution and verification.

1) Missing-pulse alarm triggers randomly, but the clock looks fine on scope—why?

Likely cause: Gate/window too short, so edge jitter/glitches are interpreted as “missing” inside the monitor logic.

Quick check: Increase gate time ×10 and enable event timestamps; correlate alarms vs temperature and supply (same timebase window).

Fix: Add persistence (N consecutive fails) + input conditioning at the tap (e.g., Schmitt buffer SN74LVC1G17 / NC7SZ17 for LVCMOS taps; for differential clocks, use a dedicated buffer/fanout output as the monitor input source).

Pass criteria: False-alarm rate < X/hour across worst-case V/T/noise for the qualification duration (e.g., 24 h), with valid_samples coverage > Y%.

2) Frequency-offset alarm never triggers even when a slow drift is forced—what’s first?

Likely cause: Wrong nominal reference (divider/PLL ratio mismatch) or wrong units (ppm vs %), so the programmed window is not the intended window.

Quick check: Read back nominal and thresholds from registers; compare with measured counts/period over the same gate interval (no mixed units).

Fix: Establish one “source of truth” for nominal (single configuration owner) + sanity clamps (min/max window). If a timing device is used for configuration centralization, ensure nominal is readable (e.g., Si5341 or LMK03328 as examples to look up; verify exact ordering code/features in datasheets).

Pass criteria: Alarm asserts within T_detect after injected offset ≥ threshold, and clears only after the defined recovery persistence window.

3) Alarm chatters near threshold—how to stop it without masking real faults?

Likely cause: No hysteresis and/or no separate enter/exit conditions, so boundary crossings toggle the state machine.

Quick check: Histogram measured offset around the boundary and count bidirectional crossings per window; confirm toggle rate vs persistence settings.

Fix: Add hysteresis (enter/exit thresholds) + recovery persistence. For LVCMOS taps, Schmitt conditioning (e.g., SN74LVC1G17) reduces boundary chatter caused by slow edges.

Pass criteria: Alarm toggles ≤ 1 time during steady operation near the boundary (defined test condition), while FAULT detection remains within T_detect.

4) Switchover happens, but endpoints still lose lock—what should be verified first?

Likely cause: Switch disturbance exceeds endpoint tolerance and/or the post-switch stability window is too short.

Quick check: Align endpoint lock-loss indication (pin/CSR) with switch-control timing; measure “time-to-stable” vs endpoint re-lock time.

Fix: If hitless behavior is required, use a dedicated hitless/controlled-reference switching approach (examples to look up for integrated timing platforms: Si5345, Si5341, 8T49N283; verify exact capabilities/order codes). Otherwise, widen the post-switch settle window and delay endpoint-dependent actions until stability qualifies.

Pass criteria: Endpoint re-lock < T_relock, and no repeated switching within T_guard after a successful switch.

5) After switching to backup, the system keeps switching back and forth—why?

Likely cause: Auto-revert is enabled without stable qualification, and the backup source is also marginal (or both share a common-mode issue).

Quick check: Log both sources’ health metrics (offset p95/p99, alarm duration bins, switch reasons) and verify the revert condition definition (enter/exit thresholds + persistence).

Fix: Require “stable-on-backup for T_hold” before revert; prefer manual revert for high-risk endpoints. If integrated reference switching is used, ensure revert rules are configurable/readable (example platforms to look up: 8T49N283, Si5345; verify datasheets).

Pass criteria: Max switch count ≤ N per day under worst-case conditions, with zero ping-pong sequences shorter than T_pp_min.

6) I²C reads show “mixed” status counters—what’s the first fix?

Likely cause: Non-atomic multi-byte reads during rollover (counter changes mid-transaction), producing inconsistent snapshots.

Quick check: Read twice and compare; if mismatch rate is non-zero, rollover is occurring during reads. Check whether a latch-on-read or snapshot mechanism exists.

Fix: Implement atomic snapshot in firmware (latch, read, then clear/ack) or use device-supported “freeze counters”/shadow registers when available.

Pass criteria: Repeated reads match within ≤ 2 attempts, and snapshot age < T_snap_max.

7) Why does adding a probe change alarm behavior?

Likely cause: The monitoring tap is not isolated; probe capacitance loads the node, distorting edges or creating reflections that shift pulse integrity.

Quick check: Compare with a high-impedance active probe; measure edge rate and swing at the tap with/without the probe (same ground strategy).

Fix: Buffer/isolate the tap and feed the monitor from a dedicated buffer output (example LVCMOS buffer to look up: LMK1C1104; verify I/O standard and operating range). For CMOS taps, Schmitt conditioning (e.g., SN74LVC1G17) may reduce sensitivity to slow/loaded edges.

Pass criteria: Alarm behavior unchanged with/without probe (no new alarms), and measured edge rate change < Δ_edge_max under the defined test setup.

8) Loss-of-lock alarms appear only at high temperature—what to suspect first?

Likely cause: Guardband is too tight vs drift/aging/temperature, and/or the input buffer switching point shifts with temperature and supply noise.

Quick check: Log offset vs temperature and compare against the guardband stack (spec + drift + measurement + margin); confirm whether alarms align with threshold occupancy.

Fix: Widen window using drift budget (do not relax missing-pulse protection), improve tap conditioning, and remove borderline edge shapes (e.g., Schmitt buffer NC7SZ17 / SN74LVC1G17 for suitable CMOS taps; ensure monitor input standard matches the source).

Pass criteria: No alarms across the required temperature range when expected drift + margin is applied, and p99 offset remains ≤ K% of the FAULT limit.

9) Detection delay is met, but short dropouts are still missed—how?

Likely cause: Persistence/gate time is longer than the dropout width; the dropout ends before the next evaluation point.

Quick check: Inject controlled dropout widths and sweep from short to long; map detection probability vs dropout width and evaluation method.

Fix: Run a missing-pulse watchdog (edge-fed timeout) in parallel with a frequency-window counter; use the watchdog for short dropouts and the counter for drift/out-of-window behavior.

Pass criteria: Detect dropout ≥ T_drop_min with probability ≥ P_min, and assert alarm within ≤ T_detect_short.

10) Monitoring after the cleaner misses faults that endpoints see—why?

Likely cause: The fault occurs downstream (fanout output, connector, branch routing, endpoint termination), so upstream monitoring never observes the degraded waveform.

Quick check: Move the monitoring tap closer to the endpoint or monitor multiple tree points; correlate branch-specific alarms with endpoint lock-loss events.

Fix: Use per-branch monitors and correlation logic. If a timing platform is used, prefer variants with readable status/alarms that can be attributed to input/output domains (examples to look up: AD9528, HMC7044, Si5341; verify exact status granularity in datasheets).

Pass criteria: Fault localization points to the correct branch in ≥ X% of injected/field-reproduced cases, with unambiguous cause_code mapping.