Timing & Power Panel at Edge (Redundant Clock + Hot-Swap)

Q: Revertive vs non-revertive—when does it cause oscillating A↔B switching?

Oscillation is usually caused by quality flapping near thresholds combined with revertive behavior and insufficient lockout. Require a qualify window before accepting a recovered source, enforce minimum dwell time, and rate-limit switches. If the primary source is noisy or intermittent, use non-revertive plus alarms rather than automatic back-switching. Validate using switch_count slope and lockout_active logs.

Q: How to define an acceptable “phase hit” and accept it on site?

Phase hit is the step disturbance introduced during switchover that can trigger downstream loss-of-lock. Acceptance is procedure-based: run controlled A→B switches, capture phase-hit flags/counters, compare pre/post switch quality snapshots, and verify alarm latency/clearing behavior. If hits appear, adjust switching mode (alignment strategy, dwell/lockout) and keep the evidence bundle (logs + counters) as commissioning proof.

Q: Why is a simple clock mux not enough—why add quality scoring and lock policy?

A mux can switch to a source that is present but poor (wander, noisy, unstable), producing “switched but worse” outcomes. A robust panel must score source health (LOS/LOL, stability/qualification, drift warnings) and only switch when the candidate is qualified. Otherwise, stay on the current source or enter controlled holdover/degraded mode. Evidence should include quality_state transitions and the reason_code for every decision.

Q: eFuse nuisance trips—tune current limit first, or check inrush/cabling first?

Start with evidence: read trip_reason (OC/SC/OT) and correlate with inrush peaks and retry_count. Many “false trips” are actually inrush or cable/connector resistance causing droop and repeated retries. Measure inrush shape, then tune the turn-on profile (upstream hot-swap like LM5069) or per-branch eFuse behavior (e.g., TPS2663) before raising limits. Confirm by repeating commissioning rows and checking trip_count stability.

Q: ORing drop causes sporadic reboots—how to prove it’s supply-side vs load-side?

Correlate bus voltage and UV events with reset_count and switchover logs. Force single-feed operation and repeat the load step: if UV/reset disappears, the ORing path is suspect (controller, MOSFET Rds_on, wiring). Use an ideal-diode controller like LTC4359 with appropriate FETs, and measure at the correct sense point near the load. If UV remains under single-feed, isolate downstream branch behavior using per-branch telemetry and trip evidence.

Q: How to prevent PG/RESET fan-in glitches from rebooting the whole site?

Use three layers: debounce, zoning (critical vs non-critical), and a policy engine. Short glitches should be filtered (glitch_counter only) while sustained faults in critical zones can trigger reset or inhibit outputs. A multi-rail supervisor (e.g., TPS386000) plus a clear zone map prevents one noisy input from becoming a site-wide reset cause. Validate by injecting short/long pulses and confirming “filtered_glitch” vs “action_taken” logs.

Q: Should reset be “gated action” or “alarm-only”? What’s the engineering boundary?

Reset should be reserved for system-safety states: sustained critical-rail loss, repeated brownouts, or confirmed conditions that make continued operation unsafe. For non-critical faults or ambiguous transients, prefer alarm-only or branch isolation to avoid unnecessary downtime. Implement a severity-duration matrix, and require every reset to carry a reason_code in logs so postmortems can distinguish policy vs noise. Prove by replaying fault scripts and checking reset_count stability.

Q: What logging granularity is “enough” for real fault reconstruction?

Minimum forensic set: event_id, source(A/B), severity, start/end time, selected_ref, clock_state (LOS/LOL/quality), power_state (UV/OC/OT), action_taken (switch/isolate/reset), and counters (switch_count, trip_count, reset_count). Add time_quality and monotonic sequence numbers to detect gaps. Store locally in non-volatile memory (e.g., MB85RS2MT FRAM) and export bundles after commissioning for future traceability.

Q: If the management port is down, will switchover and protection still work?

Correct designs keep protection and switchover local and autonomous; management is an observing/command channel, not the control loop. A management failure should only reduce visibility, not break hot-swap, eFuse isolation, or A/B switching. Validate by unplugging management, then repeating a key switch event and a branch trip: counters and logs must continue, and “mgmt_down” should be a separate log event. Isolation can be reinforced with watchdog and strict privilege modes.

← Back to: 5G Edge Telecom Infrastructure

An Edge Timing & Power Panel keeps edge sites stable by distributing clocks with controlled A/B switchover, protecting 48V feeds with ORing/hot-swap/eFuses, and aggregating PG/RESET so one glitch can’t reboot the whole rack—while logging every event as evidence for fast field debugging.

In practice, “good” means measurable jitter/phase-hit behavior, bounded failover decisions, selective power isolation, and forensics-ready logs that let operators prove what happened and fix it without guesswork.

H2-1 · What is an Edge Timing & Power Panel (and what it is NOT)

An Edge Timing & Power Panel is a site-level distribution and protection layer that fan-outs reference timing and DC power to multiple edge devices, while providing redundant switchover, fault isolation, reset/alarm aggregation, and event evidence logging. It is designed to keep an edge rack stable during source failures, maintenance, and transient faults—without turning every glitch into a site reboot.

Continuity: A/B clock + A/B power Isolation: hot-swap + branch eFuse Control: PG/RESET fan-in policy Proof: timestamped event logs

What it typically contains (scan-first checklist)

Inputs: A/B reference clocks (e.g., 1PPS / 10MHz / Sync reference), A/B DC feeds (e.g., 48V/12V), discrete fault/PG inputs, management/OOB link.
Outputs: multi-drop clock fan-out, protected DC branches to loads, reset/alarm outputs, telemetry export.
Protections: ORing and hot-swap for feed redundancy, inrush limiting, branch eFuse / high-side isolation, UV/OV/OT safeguards.
Alarms: clock loss/quality alarms, power fault alarms, reset asserted indicators, maintenance/service state indicators.
Logs: switchover events, protection trips, counters, and “what-action-was-taken” records with reliable timestamps.
Management: read-only observability is mandatory; remote control (enable/disable branches, force source select) is optional and must not break the protection path.

Engineering focus: this page treats the panel as a reliability boundary—it must (1) survive a bad input, (2) contain a bad branch, and (3) leave a clear forensic trail that explains why a switchover or shutdown happened.

Boundary: Panel vs Grandmaster/Time Hub vs Boundary Clock Switch

Component	Owns (primary responsibility)	Does NOT own (avoid confusion)
Timing & Power Panel	Physical distribution, redundant switchover policy, branch protection, PG/RESET fan-in/out, alarms, evidence logging	Protocol servo logic, network forwarding behavior, deep timing-source discipline algorithms
Grandmaster / Time Hub	Timebase generation/discipline and quality control of the timing source	Site power distribution and branch isolation; rack-level reset policy and maintenance containment
Boundary Clock Switch	Time forwarding behavior inside a switching system (timestamps, shaping, alarm integration)	Being the site reference source; being the power protection and reset aggregation authority

Figure F1 — Edge Timing & Power Panel overview (distribution + protection + evidence)

H2-2 · System Use-Cases & Topologies at the Edge (where this panel sits)

The panel sits between site sources (timing references and DC feeds) and edge loads (O-RU/DU, aggregation switches, security/observability nodes, and micro edge racks). The goal is not just to fan-out, but to ensure failures are localized and maintenance actions are non-disruptive.

Topology A — Dual timing sources (GM + GPSDO) feeding a single panel

Why it exists: timing source quality can degrade without going fully “down”; redundancy prevents service-impacting re-lock storms.
What the panel must do: detect quality/LOS, apply lockout to stop ping-pong switching, and record each decision with timestamps.
What to observe: switchover counters, source quality flags, time-in-state, and “reason codes” for each switch event.
Commissioning action: simulate source loss and recovery; verify controlled switchover behavior and the expected event trail.

Topology B — Single timing source with dual distribution paths (A/B path)

Why it exists: connectors, cabling, and terminations fail more often than the reference itself; dual paths reduce site-level single points.
What the panel must do: isolate a bad path, alarm cleanly, and keep remaining outputs stable without triggering unnecessary resets.
What to observe: per-path LOS/quality flags, phase-hit indicators (if available), and output health per group.
Commissioning action: break path A at the panel input; verify alarms and continued service on path B with no cascading actions.

Topology C — Micro edge cabinet (timing + power in one panel with OOB observability)

Why it exists: compact deployments suffer from brownouts, inrush events, and thermal constraints; these cause nuisance resets and “mystery outages.”
What the panel must do: hot-swap feeds, contain branch faults via eFuse, aggregate PG/RESET with debounce and policy zones, and preserve event evidence.
What to observe: branch trip counters, PG/RESET assertions with root-cause tags, supply droop snapshots (if supported), and maintenance-mode markers.
Commissioning action: load-step and inrush tests; verify no full-cabinet reset on a non-critical branch fault.

A useful mental model: the panel is the site stability governor. It converts unpredictable field events (cable faults, droops, short circuits) into bounded actions (isolate a branch, switch a source, raise an alarm) with clear evidence that explains the outcome.

Figure F2 — Three common edge topologies (sources → panel → loads)

H2-3 · Requirements & Budgets: what “good distribution” means (before you design)

“Good distribution” at the edge is defined by bounded impact: a source fault, a branch short, or a maintenance action should trigger a controlled switchover or isolation—while leaving a clear evidence trail. This requires budgets expressed in the panel’s language: added jitter, wander, phase hit on switchover, alarm latency, and power droop / surge margins.

Clock: jitter · wander · phase hit Power: inrush · ORing drop · turn-on Reset: debounce · sequencing Logs: resolution · retention

Budgeting is a distribution contract: each segment (source → panel → load) consumes part of the margin. Without an explicit split, later “tuning” becomes guesswork and field failures become non-repeatable.

Budget checklist (allocate margins and define verification evidence)

Budget item	Where margin is consumed (source → panel → load)	How to verify (evidence)
Clock additive jitter	Fan-out buffers, muxing, cleaner PLL (if enabled), output group loading and cabling	Compare input vs output stability indicators; record lock state and output health per group
Wander / long-term drift	Power noise coupling, temperature gradients, reference quality variations, holdover pass-through path	Trend alarms/counters over time; correlate drift flags with power/thermal telemetry
Phase hit on switchover	Switchover policy, break/make behavior, PLL re-lock behavior, output distribution group timing	Switchover event log (reason + time); phase-step flag (if available); time-in-state counters
Alarm latency	Detection thresholds, debounce windows, policy gating, discrete alarm fan-out or mgmt export	Inject LOS/LOL and verify alarm timestamps and export delay; confirm no alarm “storming”
ORing drop / power margin	ORing elements, hot-swap path, connector/wiring resistance, branch current peaks	Load-step test; minimum-bus snapshot or UV flag; correlate droop with branch current
Inrush & hot-swap profile	Hot-swap ramp, branch capacitance, parallel branch enable timing, retry behavior	Cold-start and hot-plug trials; inrush-limited behavior; retry counters and trip reasons
PG debounce / reset policy	PG/FAULT wiring, thresholding, debounce filters, zone policies (critical vs non-critical)	Pulse injection and brownout simulation; verify no nuisance resets; reset reason codes
Log resolution & retention	Timestamp source, buffering, storage retention, export path availability during faults	Confirm minimum time granularity; verify logs survive power events; validate export integrity

Common pitfall: treating switchover as “binary up/down.” Many edge outages come from quality degradation and oscillating decisions.

Common pitfall: verifying steady-state power only. Most site resets happen on turn-on, inrush, and fault response transients.

Figure F3 — Budget allocation view (source → panel → load)

H2-4 · Clock Distribution Path: inputs, fan-out, isolation, and “cleaning vs passing through”

A timing panel succeeds or fails on the physical distribution path. Most real-world instability is introduced by termination mistakes, fan-out loading, ground coupling, and switchover transients, not by the label on the timing source. A robust design treats the path as two selectable lanes: pass-through (minimum processing) and clean (jitter-cleaning), with clear monitoring points.

Pass-through lane: lowest latency and simplest behavior; it propagates source quality (good or bad) to the outputs.

Clean lane: improves certain stability metrics but introduces lock state, holdover behavior, and potential phase hits on re-lock or switching.

Clock path breakdown (what can go wrong → how to contain it → what proves it)

Input conditioning & protection: control reflections (proper termination), limit ESD/over-voltage coupling, and avoid protection capacitance that distorts edges. Evidence: input LOS/quality flags and stable lock indicators.
Fan-out & isolation: group outputs by load class; isolate grounds to prevent coupling; keep each group observable. Evidence: per-group output health and fault counters.
Cleaning & muxing: define switchover rules and lockout to prevent oscillation; decide when to use pass-through vs clean. Evidence: switchover reason codes, time-in-state, and lock/holdover flags.
Output monitoring: detect LOS/LOL/phase-step and correlate to the exact output group. Evidence: timestamped alarms linked to a specific output path.

Practical rule: a panel should never “silently degrade.” If an output group becomes marginal due to cabling or loading, monitoring must surface it before the site enters repeated re-lock cycles.

Figure F4 — Clock distribution path (pass-through vs clean) with monitoring points

H2-5 · Redundant Switchover: architectures, detection logic, and phase-hit containment

Redundant switchover is not “A/B exists.” It is a controlled mechanism that converts input degradation into bounded actions (switch, lockout, isolate) while leaving a verifiable evidence trail. A robust panel separates switchover into three engineering layers: Detect, Decide, and Act.

Detect: LOS · LOL · drift Decide: priority · lockout Act: switch style · containment Proof: event fields

Layer 1 — Detect (hard faults + soft degradation, with debounce)

Hard faults: LOS (loss-of-signal), LOL (loss-of-lock), reference missing/out-of-range.
Soft degradation: phase drift rate beyond threshold, quality metric below threshold, intermittent instability flags.
Debounce & hysteresis: time-based confirmation prevents transient spikes from triggering site-wide switching.

Layer 2 — Decide (anti-ping-pong policy)

Priority: define preferred reference (A-first or B-first) and whether manual override is allowed.
Lockout timer: after switching, hold on the new source for a minimum time to avoid oscillation.
Revertive vs non-revertive: revertive returns to preferred source after recovery; non-revertive stays until the active source degrades.
Rate limit: cap the number of switches per time window; when exceeded, enter a safe alarm-only state.
Maintenance mode: freeze selection for service operations and mark the mode explicitly in logs.

Layer 3 — Act (switching style + phase-hit containment)

Switching style: break-before-make (avoid overlap) vs make-before-break (avoid gap) depending on allowed risk profile.
Containment by output groups: isolate impact to a defined group (critical vs non-critical) rather than site-wide disturbance.
Phase-step awareness: monitor and report phase-step / re-lock indicators so “hit events” are observable, not guessed.

A “good” switchover is one that does not oscillate, does not surprise critical loads, and can be explained after the fact using logs and counters.

Switchover event record: minimum fields for forensic clarity

Field	Why it matters (what it proves)
Event ID / monotonic counter	Prevents ambiguity from log rollover; supports exact ordering across incidents.
Timestamp + timebase source	Enables correlation with alarms, resets, and maintenance windows.
Pre-state → post-state	Explains the transition path (e.g., NORMAL_A → SWITCHING → NORMAL_B).
Trigger reason code	Separates LOS/LOL from soft degradation (quality low, drift threshold exceeded).
Metrics snapshot	Captures the decision context (LOS/LOL flags, quality flag, drift flag) at trigger time.
Decision mode	Records priority, revertive mode, lockout remaining, and maintenance mode status.
Action taken	Documents break/make style, any output gating, and whether clean lane was enabled.
Affected output groups	Proves impact containment scope (critical group vs non-critical group).
Outcome	Indicates success/fail/rollback and whether safe alarm-only state was entered.

Anti-ping-pong checklist: enforce hysteresis, apply lockout after switch, rate-limit switches, and log every transition with reason codes.

Containment checklist: define output groups, gate only the affected group if needed, and export “affected outputs” as a first-class log field.

Figure F5 — Switchover state machine (with lockout and evidence points)

H2-6 · Power Front-End in the Panel: ORing, hot-swap, eFuse, and inrush control

Power stability is a prerequisite for timing stability. Brownouts, inrush events, and branch faults frequently manifest as “timing issues” because downstream devices reset or enter unstable states. The panel power front-end is therefore designed to contain faults per branch, control transients, and export evidence that explains every shutdown or retry.

Redundancy: A/B ORing Transient control: hot-swap Isolation: branch eFuse Proof: trip + retry logs

Power chain component map (inputs → protection → branches → telemetry)

Input protection: surge/ESD and polarity/backfeed containment at the feed entry.
Redundant ORing: selects/combines A/B feeds while preventing reverse current into a failed feed.
Hot-swap controller: controls turn-on ramp and limits inrush to protect the upstream bus.
Bus sense: observes bus droop/UV/OV and correlates events with branch actions.
Branch eFuse / high-side isolation: per-load current limit, shutdown, retry, and latching behavior.
Current sense + fault flags: per-branch observability for “fault → action → outcome.”
Telemetry & event logging: trip reasons, retry counters, and feed failover states exported via mgmt.

Fault → action → evidence (make every protection decision explainable)

Fault scenario	Panel action	Evidence fields to log/export
Branch short / overcurrent	Current limit → shutoff; optional retry or latch-off based on policy	Branch ID, trip reason, peak/avg current flag, retry count, latch status, timestamp
Inrush too high	Controlled ramp; inrush limiting; staged enable across branches	Turn-on profile flag, inrush-limited flag, enable sequence ID, bus droop flag
Feed failure (A or B)	ORing isolates failed feed; continue on surviving feed	Feed state (A/B), failover event, ORing status, bus minimum flag, time-in-state
Surge / transient	Clamp/contain and avoid propagating to branches; protect hot-swap path	OV/UV event, surge flag, affected branch list (if any), action taken, timestamp
Over-temperature	Derate or shut down affected branch/front-end stage	OT flag, duration bucket, derate state, branch impact, recovery timestamp
Brownout / UV	Selective load shed for non-critical branches; preserve critical branches	UV flag, shed list, critical preserved list, reset prevention status, timestamp

A panel that only “turns off power” without exporting branch ID + reason + counters creates mystery outages. Evidence-driven protection reduces MTTR and prevents repeated site-wide resets.

Figure F6 — Panel power path (A/B feeds → ORing + hot-swap → branch eFuses with sense + logging)

H2-7 · PG/RESET Fan-In & Fan-Out: sequencing, debounce, and “don’t reboot the site”

Edge sites fail not only from real power loss, but from false reset cascades: a single branch glitch propagates into a site-wide reboot. A robust panel treats PG/FAULT handling as a controlled pipeline: Fan-In (clean + classify) → Decision (policy) → Fan-Out (timed outputs), with strict zoning so non-critical noise cannot trip critical reset paths.

Fan-In: debounce + classify Policy: reset / alarm / isolate Zoning: critical vs non-critical Fan-Out: timing + recovery

Fan-In: multi-source PG/FAULT inputs (clean signals before acting)

Input classes: critical power-good, branch faults, thermal/environment alarms, and service/maintenance inputs.
Glitch reject: ignore short spikes caused by cable transients, ground bounce, and connector chatter.
Debounce windows: require a stable low/high duration before state change is accepted.
Hysteresis: recovery conditions must be stricter than trigger conditions to prevent bouncing.
Per-input evidence: last-change timestamp, glitch counter, and stable-state counter enable forensic clarity.

Decision: choose “reset vs alarm vs isolate” (avoid collateral damage)

Trigger (after debounce)	Action	Evidence to export
Critical PG sustained low	System reset for the affected zone + high-severity alarm	Input ID, duration bucket, policy mode, reset assertion time, affected outputs
Non-critical branch fault	Isolate the branch (eFuse/high-side) + warning alarm	Branch ID, fault reason, retry/latch state, current-sense snapshot flag
Transient glitches	Alarm-only (or no action) while increasing diagnostic counters	Glitch counter, last-glitch time, input classification, optional rate-limit state
Repeated triggers (storm)	Enter safe mode: rate-limit resets, prioritize evidence and alarms	Storm counter, rate-limit active flag, lockout time remaining, last N reasons

Fan-Out: reset outputs (sequencing, hold time, and recovery rules)

Zoned reset fan-out: reset outputs are grouped by zones so one zone can recover without rebooting the whole site.
Assertion width: reset hold time must be long enough for deterministic restart, but never uncontrolled.
Release sequencing: critical rails/loads release in a defined order, with optional delay between groups.
Re-arm conditions: a reset output is released only after input PG stability and lockout conditions are met.
Maintenance mode: local service actions should freeze policy decisions and be recorded as a first-class event.

“Don’t reboot the site” is implemented by input cleaning, zoned policies, rate limiting, and evidence-first logging. Without these, resets become an outage amplifier.

Common false-reset root causes (symptom → fix direction)

Root cause	Typical symptom	Mitigation direction
Ground bounce / shared return	Short PG dips during switching or load steps	Glitch reject + hysteresis + zoning (non-critical cannot trip critical)
Cable/transient spikes	Reset storms coincide with door open/close or connector movement	Debounce windows + counters + service mode for maintenance
PG threshold too tight	PG toggles at borderline voltage conditions	Adjust thresholds/hysteresis; avoid “single threshold rules all”
Debounce too short	Site reboots on brief, non-repeatable events	Increase debounce; log glitch statistics instead of rebooting
Inrush-induced bus droop	PG drops right after enabling a branch	Staged enable + inrush control + per-branch isolation

Guardrail: treat non-critical PG/FAULT as “isolate + alarm,” not “reset.” Reserve resets for sustained, critical conditions only.

Guardrail: add rate limiting and safe mode so repeated triggers increase evidence quality instead of increasing downtime.

Figure F7 — PG/RESET logic pipeline (Fan-In → Policy → Fan-Out + alarms)

H2-8 · Interfaces & Management: telemetry, alarms, and out-of-band control

The panel’s interfaces must serve operations: expose evidence, enable bounded actions, and support fast triage. However, management must never be in the real-time protection loop. The panel should keep protection and switchover decisions autonomous even if the management port is down.

Telemetry Discrete alarms OOB channel Local service

Interface categories (what to expose and why)

Telemetry: export feed state, ORing/hot-swap states, branch trips, voltage/current/temperature, and clock-status flags (status only).

Purpose: converts “mystery resets” into time-correlated, measurable evidence.

Discrete alarms: dry contact / opto / relay outputs for critical vs warning categories.

Purpose: raises alarms even when IP management is unavailable.

OOB management channel: Ethernet or serial as a transport for reading logs, viewing counters, and updating policies.

Rule: OOB provides visibility and configuration only, not real-time protection decisions.

Local service: LEDs/LCD, buttons, and DIP switches for on-site triage and maintenance mode.

Rule: local actions should enter maintenance mode and be recorded as events.

Operational design rules (keep protection independent)

Mgmt down ≠ protection down: switchover, hot-swap, eFuse protection, and reset policy must continue autonomously.
Evidence-first: every trip/switch/reset should have a stable record accessible via telemetry or service UI.
Separation by function: ports and signals should be physically grouped to reduce miswiring and cross-coupling.
Bounded control: configuration changes should be applied with clear modes (normal vs maintenance) and logged.

Interfaces are for observability and bounded control. The protection pipeline must remain stable even when cables are unplugged, networks are congested, or management endpoints reboot.

Figure F8 — Front-panel interface zoning (Power / Clock / Alarm / Mgmt)

H2-9 · Event Logging as Evidence: what to log, how to timestamp, how to debug from logs

A timing-and-power panel becomes operationally valuable when it can prove what happened: which input degraded first, which policy decision fired, which outputs were affected, and whether the local timebase was healthy at that moment. “Evidence-first” logs turn nuisance resets and frequent switchovers into diagnosable, repeatable cases instead of recurring mysteries.

Field-level schema Time quality flags Retention priorities Replayable scripts

Event model (field-level schema)

Use a normalized schema so every alarm, switchover, isolation, and reset can be correlated on the same timeline. The model below is designed for filtering, trending, and forensic replay without relying on verbose text logs.

Field group	Recommended fields (examples)
Core identity	`event_id`, `domain` (clock/power/action/mgmt), `severity` (info/warn/critical), `source` (A/B/branch_id/zone_id), `start_time`, `end_time`, `duration`
Clock snapshot	`clock_state` (LOS/LOL/quality), `selected_ref` (A/B), `holdover_flag`, `phase_hit_flag`, `quality_flag` (good/degraded)
Power snapshot	`power_state` (OC/SC/OT/UV/inrush), `feed_path` (A/B/ORed), `branch_state` (on/off/tripped), `retry_state` (retry/latch), `trip_reason`
Policy / action	`action_type` (switch/reset/isolate), `policy_mode` (normal/safe/maintenance), `lockout_active`, `affected_outputs`, `recovery_condition`
Counters / stats	`glitch_count`, `storm_count`, `switch_count`, `reset_count`, `brownout_count`, `branch_trip_count`, `retry_count`

Prefer structured fields over free text Always capture “policy mode” and “lockout” Store last-N critical events with priority Export counters even when no action is taken

Timestamping: record time and time-quality (not just a number)

Dual time representations: keep event_time_mono (ordering/interval) and event_time_utc (if available) for human correlation.
Time-quality flag: include time_quality (good/degraded/holdover) so forensics can trust or discount wall-clock timestamps.
Event-class resolution: switching/reset events need finer timestamp resolution than slow thermal or trend events.
Snapshot-on-trigger: capture a compact state snapshot at event start, not minutes later, to preserve cause-first evidence.

A forensic record is not just “when something happened,” but also “how trustworthy the local timebase was” and “which policy rule produced the action.”

Forensic replay scripts (3 examples)

Script A — frequent A/B switchovers without hard LOS

1) Filter action_type=switch and check switch_count growth rate.
2) Compare clock_state (LOS/LOL) vs quality_flag (degraded).
3) Correlate with power_state=UV or inrush within the same window.
4) If glitch_count rises with no LOS, suspect threshold/hysteresis/termination issues.
5) Mitigation direction: add quality hysteresis, extend debounce, prefer holdover+alarm before switching.

Script B — intermittent device unlock that “looks like timing”

1) Search for repeated branch_state=tripped on a single branch_id.
2) Verify whether clock_state degradation happens after the branch trip (cause order).
3) Check retry_count and whether trips are latched vs auto-retry.
4) If trips precede unlocks, isolate the branch (do not reset the site) and retest with a known load.
5) Mitigation direction: adjust inrush/limit/retry policy; inspect connectors and thermal headroom.

Script C — nuisance reset storm (site reboot loop)

1) Filter action_type=reset and check storm_count/lockout_active behavior.
2) Identify the dominant source (which PG/FAULT input starts each cycle).
3) Inspect duration: very short low pulses suggest glitches, not real outages.
4) Enter safe mode (alarm-only + evidence) to stop reboot amplification.
5) Mitigation direction: increase glitch reject/debounce, add recovery hysteresis, re-classify critical vs non-critical.

Storage & export: keep critical evidence when conditions are worst

Priority retention: keep the last-N critical events and the last state snapshot even when ring buffers wrap.
Counter continuity: counters should survive reboots whenever feasible; at minimum, export them immediately when storms start.
Export channels: provide a local service readout and an out-of-band path for offsite retrieval (transport only).
Tamper-evident cues: log configuration changes and maintenance-mode transitions as first-class events.

Figure F9 — Evidence pipeline (Inputs → Event builder → Timestamp → Retention → Export)

H2-10 · Failure Modes & Field Troubleshooting: symptoms → isolation steps → fixes

Troubleshooting should start with evidence, not guesses. Each symptom card below maps a field symptom to the quickest panel-side checks (LEDs, counters, and structured log fields), then to isolation steps that bound blast radius before applying fixes to thresholds, debounce, lockout, and protection parameters.

Symptom-first Evidence checks Isolation steps Fix directions

Symptom cards (fast triage templates)

Symptom: frequent A/B switchover (flapping)

Quick check: switch_count, quality_flag, glitch_count, lockout_active

Isolation: enable holdover+alarm mode; lock out revertive switching; verify whether flapping stops

Likely causes: borderline quality threshold; poor termination/cable; power droop contaminating ref

Fix direction: add hysteresis + longer debounce; enforce lockout; refine critical/non-critical classification

Symptom: phase hit during switchover

Quick check: phase_hit_flag, switch event duration, time-quality at the moment

Isolation: force a single reference; reproduce under controlled switch; compare hit rate by source A vs B

Likely causes: switch timing misaligned; quality gating too permissive; unstable input edges

Fix direction: tighten quality gating; adjust switch sequence; require stability window before switch

Symptom: devices intermittently lose lock (but no site reset)

Quick check: clock_state quality trends vs branch_trip_count and power_state

Isolation: isolate the suspect branch; test with a known load; check if unlocks disappear

Likely causes: branch inrush/retry causing local droop; cable coupling; marginal clock distribution

Fix direction: tune inrush/limit/retry; improve cable/termination; separate zones for sensitive loads

Symptom: a branch repeatedly trips (eFuse/hot-swap)

Quick check: branch_state, trip_reason, retry_count, temperature warnings

Isolation: disconnect the branch; use a dummy load; verify trip persists (panel-side) vs load-side

Likely causes: short/overcurrent; thermal headroom; connector resistance; too aggressive retry

Fix direction: tune current limit and retry/latch policy; improve thermal path; check connectors and wiring

Symptom: site reset storm (reboot loop)

Quick check: reset_count, storm_count, dominant source, pulse duration

Isolation: enter safe mode (alarm-only + evidence); apply zoned reset; disable auto-reset on non-critical inputs

Likely causes: debounce too short; threshold too tight; cable glitches; inrush droop on enable

Fix direction: increase glitch reject/debounce; add recovery hysteresis; rate-limit resets; re-zone outputs

Symptom: alarms show “healthy,” but issues continue

Quick check: mismatch between severity and counters; missing events; time-quality degraded

Isolation: verify evidence pipeline: ensure event builder snapshots and export are functioning

Likely causes: unlogged paths; insufficient severity mapping; retention overwriting critical evidence

Fix direction: normalize event taxonomy; prioritize critical retention; export counters during storms

A practical rule: if a fix cannot be justified by logs + counters, treat it as a temporary mitigation. Permanent fixes close the loop by reducing the specific event types and counters that proved the fault.

Figure F10 — Troubleshooting fault tree (from “switch flapping” to evidence and fixes)

H2-11 · Validation & Commissioning Checklist: what proves the panel is done

Definition of done

A Timing & Power Panel is “done” only when clock distribution, A/B switchover behavior, power protection, PG/RESET policy, and event logging can be provoked, observed, and proven using repeatable on-site scripts. Every test below is written as Action → Expected → Evidence.

Evidence must be panel-native: front LEDs, telemetry snapshots, counters, event logs, and alarm I/O. External endpoint behavior (DU/switch/server internals) is intentionally out of scope for this page.

Clock: input loss/recovery, switchover quality, and alarm latency

Test C1 — Ref-A loss detection (LOS/LOL/quality drop)

Action: disconnect Ref-A (or force it into a degraded quality state).

Expected: panel flags Ref-A unhealthy without creating a switch storm.

Evidence: Ref-A LED/indicator → telemetry shows selected_ref / quality_flag → log entry with source=A and clock_state (LOS/LOL/quality).

Test C2 — Ref-A recovery (re-acquire + lock discipline)

Action: restore Ref-A after a stable interval.

Expected: recovery is gated by policy (lock/qualify window); no immediate flip-flop.

Evidence: telemetry transitions to “qualified” → counters show no rapid oscillation → logs include qualify/lock outcome and policy mode.

Test C3 — Controlled switchover (A→B) with phase-hit containment

Action: trigger a single switchover (manual trigger or policy-triggered).

Expected: switchover completes within defined window; phase hit is either absent or bounded; alarms reflect actual impact.

Evidence: switch_count increments by 1 → log records switching_start/end + phase_hit_flag (if any) + pre/post quality snapshot.

Test C4 — Lockout / rate-limit proof (storm prevention)

Action: repeatedly flap Ref-A near threshold (disconnect/reconnect cycle).

Expected: lockout timer and hysteresis prevent rapid A↔B toggling.

Evidence: switch_count slope remains bounded → logs show lockout_active / non-revertive (or revertive) policy decisions.

Test C5 — Alarm latency (clock fault → alarm output)

Action: create a crisp event (e.g., Ref-A LOS).

Expected: alarm I/O asserts within the configured latency window; alarm clears only after re-qualification.

Evidence: alarm output changes state → log contains event_id + start_time and alarm_action timestamp ordering.

Test C6 — Holdover path sanity (if supported)

Action: remove all external references briefly (A and B unavailable).

Expected: panel enters holdover/“degraded” mode without repeated resets; outputs remain deterministic per policy.

Evidence: telemetry holdover_flag asserted → logs show source=NONE (or INTERNAL) + time_quality indicator.

Power: ORing, hot-swap/inrush, branch eFuse isolation, and thermal protection

Test P1 — A/B feed ORing behavior (no reverse feed)

Action: bring up Feed-A then Feed-B; then remove Feed-A.

Expected: seamless source handover per ORing priority; no reverse current into the removed feed.

Evidence: bus voltage stable trend → logs indicate feed_source transition (A→B) + reverse_block status.

Test P2 — Hot-swap insertion (inrush limiting)

Action: insert/enable a defined load step (or card-insert scenario).

Expected: inrush is limited; no brownout cascade to clock path; panel records the transient.

Evidence: power monitor captures peak/average → no clock quality storm → log shows inrush/UV flags and timestamped duration.

Test P3 — Branch short-circuit (selective isolation)

Action: inject short on one branch output.

Expected: only that branch trips/isolates (latch-off or retry per policy); other branches stay up.

Evidence: branch_id trip_reason=SC/OC in logs → branch_trip_count++ → system reset_count unchanged.

Test P4 — Overload retry policy (no endless chatter)

Action: set a marginal overload on a branch that triggers retries.

Expected: retry count and cooldown are bounded; alarm severity escalates if persistence is detected.

Evidence: retry_count increments predictably → logs show retry cadence + escalation event_id.

Test P5 — Surge/EFT robustness (panel-level behavior)

Action: apply a controlled surge/EFT test setup (per site practice).

Expected: protection clamps/blocks per design; event is recorded; no false switchover storm.

Evidence: transient logged with power_state=surge/EFT and severity → switch_count does not spike.

Test P6 — Thermal limit / derating alarms

Action: elevate temperature in a controlled manner (or simulate sensor threshold).

Expected: warning then protective action per policy; action is localized where possible.

Evidence: telemetry temperature trend + threshold crossing → log shows OT start/end and any shutoff action.

PG/RESET & Logs: debounce proof, zoning policy, retention, and export integrity

Test R1 — Non-critical PG glitch rejection

Action: inject a short glitch on a non-critical PG/FAULT input.

Expected: no site-wide reset; at most a warning/alarm or branch isolation per zoning policy.

Evidence: glitch_counter++ → logs show “filtered_glitch” without reset_action.

Test R2 — Critical PG sustained fault (debounce passes)

Action: hold a critical PG low longer than the configured debounce window.

Expected: policy engine triggers the defined action (reset / inhibit outputs / alarm-only).

Evidence: logs show zone=CRITICAL + duration + action_taken; reset outputs timing matches policy.

Test R3 — Reset sequencing correctness (fan-out timing)

Action: force a reset condition and then release it.

Expected: reset outputs assert/deassert in correct order with defined hold time; no bounce.

Evidence: reset_count++ → per-output reset_state in telemetry (if available) → log includes sequencing profile ID.

Test L1 — Event model completeness (field-forensics ready)

Action: trigger a representative set: one clock fault + one branch trip + one reset policy action.

Expected: each event contains minimal forensic fields and consistent timestamps.

Evidence: every record has event_id, source(A/B), severity, start/end, state snapshot, and counters.

Test L2 — Retention & power-loss safety (no “lost evidence”)

Action: create N events; then perform controlled power interruption and restore.

Expected: logs survive; counters resume consistently; any gap is explicitly marked.

Evidence: post-restore export includes pre-fault records; log sequence numbers monotonic or gap-tagged.

Test L3 — Export integrity (service port / OOB channel)

Action: export logs and telemetry snapshot at the end of commissioning.

Expected: export succeeds without affecting protection and switchover main paths.

Evidence: export artifact contains required fields; “export_done” event logged with checksum/tamper marker if supported.

Representative part numbers (example BOM anchors)

The panel is system-defined, so exact IC choices depend on output standards (LVCMOS/LVDS/PECL), port count, and voltage domain. The part numbers below are common, field-proven anchors for this class of panel.

Clock cleaning / synthesis (jitter attenuation): Si5345 (Skyworks/Silicon Labs) jitter attenuating clock multiplier :contentReference[oaicite:0]{index=0}
Sync/timing management (multi-channel timing control): Renesas 8A34002 synchronization management unit :contentReference[oaicite:1]{index=1}
Network synchronizer / timing card class device: Microchip ZL30733 (PTP/SyncE network synchronizer family) :contentReference[oaicite:2]{index=2}
Clock fan-out (LVCMOS distribution): TI CDCLVC1104 low-jitter 1:4 fan-out buffer :contentReference[oaicite:3]{index=3}
A/B feed ORing (ideal diode controller): Analog Devices LTC4359 ideal diode controller (external N-MOSFET) :contentReference[oaicite:4]{index=4}
48V-class hot-swap / inrush control: TI LM5069 9–80V hot-swap controller :contentReference[oaicite:5]{index=5}
Branch eFuse (power limiting + protection): TI TPS2663 4.5–60V industrial eFuse :contentReference[oaicite:6]{index=6}
Bus/branch telemetry (power monitor): TI INA238 85V digital power monitor (I²C) :contentReference[oaicite:7]{index=7}
PG/RESET supervision (multi-rail supervisor): TI TPS386000 quad-supply supervisor with programmable delay :contentReference[oaicite:8]{index=8}
Non-volatile event log storage: Fujitsu MB85RS2MT 2 Mbit SPI FRAM :contentReference[oaicite:9]{index=9}
Secure evidence / key storage (tamper-aware design): Microchip ATECC608B secure element :contentReference[oaicite:10]{index=10}
RTC for commissioning timestamps / backup switchover: NXP PCF2131 RTC with integrated TCXO :contentReference[oaicite:11]{index=11}

Note: ordering suffixes/package codes vary by distributor. Select package, temperature grade, and interface options per the panel’s environment and assembly rules.

Figure F11 — Commissioning evidence matrix (test × observation point)

Use this matrix as a printable acceptance sheet. Each row is a test; each column is a required observation point. Mark pass/fail only when the expected evidence is captured.

Figure F11 — Validation matrix (LED / Telemetry / Counters / Logs / Alarm I/O / Export)

Tip: keep this sheet with the exported log bundle. If a future outage occurs, “switch_count / trip_count / reset_count” trends plus timestamped events often narrow root cause in minutes.

Request a Quote

Name

Company

Part Number(s) / BOM

Quantity & Target Lead Time

Alternates Allowed

Temperature Grade

Package / Footprint

Compliance

Budget Window

Lot Size / Qty

Message

Attachment

Accepted Formats

pdf, csv, xls, xlsx, zip

Attachment

Drag & drop files here or use the button below.

H2-12 · FAQs (field decisions + evidence-first answers)

Evidence-first FAQ

These FAQs target on-site decisions for an Edge Timing & Power Panel: redundant switchover, power protection, PG/RESET policy, and event logs as evidence. Each answer gives quick isolation steps and the minimum evidence points (LED / telemetry / counters / logs) to confirm the root cause.

Example BOM anchors (for procurement language): Si5345 (jitter cleaner), CDCLVC1104 (fan-out), LM5069 (hot-swap), TPS2663 (eFuse), LTC4359 (ideal-diode ORing), INA238 (power monitor), TPS386000 (supervisor), MB85RS2MT (FRAM), ATECC608B (secure element).

QRedundant switchover “happened”, but endpoints still lose lock—why?

A switch event only proves the selector moved; it does not prove the new output quality is acceptable. Confirm selected_ref and check for phase-hit flags, per-output LOS/LOL indicators, and post-switch quality snapshots. If phase hits are present, tighten qualify/lockout logic and prefer a cleaner path (e.g., Si5345) instead of pure muxing. Also verify output termination/level before blaming endpoints.

Relevant sections: H2-5, H2-10

QRevertive vs non-revertive—when does it cause oscillating A↔B switching?

Oscillation is usually caused by quality flapping near thresholds combined with revertive behavior and insufficient lockout. Require a qualify window before accepting a recovered source, enforce minimum dwell time, and rate-limit switches. If the primary source is noisy or intermittent, use non-revertive plus alarms rather than automatic back-switching. Validate using switch_count slope and lockout_active logs.

Relevant sections: H2-5

QHow to define an acceptable “phase hit” and accept it on site?

Phase hit is the step disturbance introduced during switchover that can trigger downstream loss-of-lock. Acceptance is procedure-based: run controlled A→B switches, capture phase-hit flags/counters, compare pre/post switch quality snapshots, and verify alarm latency/clearing behavior. If hits appear, adjust switching mode (alignment strategy, dwell/lockout) and keep the evidence bundle (logs + counters) as commissioning proof.

Relevant sections: H2-3, H2-11

QWhy is a simple clock mux not enough—why add quality scoring and lock policy?

A mux can switch to a source that is present but poor (wander, noisy, unstable), producing “switched but worse” outcomes. A robust panel must score source health (LOS/LOL, stability/qualification, drift warnings) and only switch when the candidate is qualified. Otherwise, stay on the current source or enter controlled holdover/degraded mode. Evidence should include quality_state transitions and the reason_code for every decision.

Relevant sections: H2-5

QeFuse nuisance trips—tune current limit first, or check inrush/cabling first?

Start with evidence: read trip_reason (OC/SC/OT) and correlate with inrush peaks and retry_count. Many “false trips” are actually inrush or cable/connector resistance causing droop and repeated retries. Measure inrush shape, then tune the turn-on profile (upstream hot-swap like LM5069) or per-branch eFuse behavior (e.g., TPS2663) before raising limits. Confirm fixes by repeating P2/P3 commissioning rows and checking trip_count stability.

Relevant sections: H2-6, H2-10

QORing drop causes sporadic reboots—how to prove it’s supply-side vs load-side?

Correlate bus voltage and UV events with reset_count and switchover logs. Force single-feed operation and repeat the load step: if UV/reset disappears, the ORing path is suspect (controller, MOSFET Rds_on, wiring). Use an ideal-diode controller like LTC4359 with appropriate FETs, and measure at the correct sense point (near the load interface). If UV remains under single-feed, isolate downstream branch behavior using per-branch telemetry and trip evidence.

Relevant sections: H2-6, H2-10

QHow to prevent PG/RESET fan-in glitches from rebooting the whole site?

Use three layers: debounce, zoning (critical vs non-critical), and a policy engine. Short glitches should be filtered (glitch_counter++ only) while sustained faults in critical zones can trigger reset or inhibit outputs. A multi-rail supervisor (e.g., TPS386000) plus a clear zone map prevents one noisy input from becoming a site-wide reset cause. Validate by injecting short/long pulses and confirming “filtered_glitch” vs “action_taken” logs.

Relevant sections: H2-7

QShould reset be “gated action” or “alarm-only”? What’s the engineering boundary?

Reset should be reserved for system-safety states: sustained critical-rail loss, repeated brownouts, or confirmed conditions that make continued operation unsafe. For non-critical faults or ambiguous transients, prefer alarm-only or branch isolation to avoid unnecessary downtime. Implement a severity-duration matrix, and require every reset to carry a reason_code in logs so postmortems can distinguish policy vs noise. Prove the boundary by replaying fault scripts and checking reset_count stability.

Relevant sections: H2-7, H2-9

QWhat logging granularity is “enough” for real fault reconstruction?

Minimum forensic set: event_id, source(A/B), severity, start/end time, selected_ref, clock_state (LOS/LOL/quality), power_state (UV/OC/OT), action_taken (switch/isolate/reset), and counters (switch_count, trip_count, reset_count). Add time_quality and monotonic sequence numbers to detect gaps. Store locally in non-volatile memory (e.g., MB85RS2MT FRAM) and export bundles after commissioning so future outages can be traced without guesswork.

Relevant sections: H2-9

QIf the management port is down, will switchover and protection still work?

Correct designs keep protection and switchover local and autonomous; management is an observing/command channel, not the control loop. A management failure should only reduce visibility, not break hot-swap, eFuse isolation, or A/B switching. Validate by unplugging management, then repeating a key switch event and a branch trip: switch_count and trip evidence must continue, and “mgmt_down” should be a separate log event. Isolation can be reinforced with watchdog and strict privilege modes.

Relevant sections: H2-8

QHow to handle RU/DU/switch interface differences without crossing into endpoint design?

Treat outputs as port profiles: define signal type (1PPS/10MHz/Sync reference), electrical level, impedance/termination, isolation needs, and per-port monitoring. Avoid endpoint-specific tuning; instead, ensure the panel exposes profile_id and per-port status (LOS/LOL) plus clear labeling to prevent mispatching. Use generic distribution primitives (fan-out buffers like CDCLVC1104, optional cleaning stage like Si5345) and validate with per-port evidence points rather than endpoint internals.

Relevant sections: H2-2, H2-8

QHow to design a minimal-downtime maintenance flow (swap source/branch/panel)?

Use a repeatable flow: enter maintenance mode (freeze revertive behavior, extend lockout), snapshot telemetry/logs, switch to the known-good feed, then replace the target module (input, branch, or panel). After replacement, rerun a small acceptance subset: controlled switchover, hot-swap insertion, branch trip isolation, reset sequencing, and export integrity. Exit maintenance only when evidence is captured (logs/counters) and alarms are clean. This turns “maintenance” into a provable procedure rather than hope.

Relevant sections: H2-2, H2-11

Timing & Power Panel at Edge (Redundant Clock + Hot-Swap)

Timing & Power Panel at Edge (Redundant Clock + Hot-Swap)

H2-1 · What is an Edge Timing & Power Panel (and what it is NOT)

What it typically contains (scan-first checklist)

Boundary: Panel vs Grandmaster/Time Hub vs Boundary Clock Switch

H2-2 · System Use-Cases & Topologies at the Edge (where this panel sits)

Topology A — Dual timing sources (GM + GPSDO) feeding a single panel

Topology B — Single timing source with dual distribution paths (A/B path)

Topology C — Micro edge cabinet (timing + power in one panel with OOB observability)

H2-3 · Requirements & Budgets: what “good distribution” means (before you design)

Budget checklist (allocate margins and define verification evidence)

H2-4 · Clock Distribution Path: inputs, fan-out, isolation, and “cleaning vs passing through”

Clock path breakdown (what can go wrong → how to contain it → what proves it)

H2-5 · Redundant Switchover: architectures, detection logic, and phase-hit containment

Layer 1 — Detect (hard faults + soft degradation, with debounce)

Layer 2 — Decide (anti-ping-pong policy)

Layer 3 — Act (switching style + phase-hit containment)

Switchover event record: minimum fields for forensic clarity

H2-6 · Power Front-End in the Panel: ORing, hot-swap, eFuse, and inrush control

Power chain component map (inputs → protection → branches → telemetry)

Fault → action → evidence (make every protection decision explainable)

H2-7 · PG/RESET Fan-In & Fan-Out: sequencing, debounce, and “don’t reboot the site”

Fan-In: multi-source PG/FAULT inputs (clean signals before acting)

Decision: choose “reset vs alarm vs isolate” (avoid collateral damage)

Fan-Out: reset outputs (sequencing, hold time, and recovery rules)

Common false-reset root causes (symptom → fix direction)

H2-8 · Interfaces & Management: telemetry, alarms, and out-of-band control

Interface categories (what to expose and why)

Operational design rules (keep protection independent)

H2-9 · Event Logging as Evidence: what to log, how to timestamp, how to debug from logs

Event model (field-level schema)

Timestamping: record time and time-quality (not just a number)

Forensic replay scripts (3 examples)

Storage & export: keep critical evidence when conditions are worst

H2-10 · Failure Modes & Field Troubleshooting: symptoms → isolation steps → fixes

Symptom cards (fast triage templates)

Symptom: frequent A/B switchover (flapping)

Symptom: phase hit during switchover

Symptom: devices intermittently lose lock (but no site reset)

Symptom: a branch repeatedly trips (eFuse/hot-swap)

Symptom: site reset storm (reboot loop)

Symptom: alarms show “healthy,” but issues continue

H2-11 · Validation & Commissioning Checklist: what proves the panel is done

Clock: input loss/recovery, switchover quality, and alarm latency

Power: ORing, hot-swap/inrush, branch eFuse isolation, and thermal protection

PG/RESET & Logs: debounce proof, zoning policy, retention, and export integrity

Representative part numbers (example BOM anchors)

Figure F11 — Commissioning evidence matrix (test × observation point)

Recommended topics you might also need

Request a Quote

Accepted Formats

Attachment

H2-12 · FAQs (field decisions + evidence-first answers)

Explore

Categories

Get in Touch