123 Main Street, New York, NY 10001

Timing & Power Panel at Edge (Redundant Clock + Hot-Swap)

← Back to: 5G Edge Telecom Infrastructure

An Edge Timing & Power Panel keeps edge sites stable by distributing clocks with controlled A/B switchover, protecting 48V feeds with ORing/hot-swap/eFuses, and aggregating PG/RESET so one glitch can’t reboot the whole rack—while logging every event as evidence for fast field debugging.

In practice, “good” means measurable jitter/phase-hit behavior, bounded failover decisions, selective power isolation, and forensics-ready logs that let operators prove what happened and fix it without guesswork.

H2-1 · What is an Edge Timing & Power Panel (and what it is NOT)

An Edge Timing & Power Panel is a site-level distribution and protection layer that fan-outs reference timing and DC power to multiple edge devices, while providing redundant switchover, fault isolation, reset/alarm aggregation, and event evidence logging. It is designed to keep an edge rack stable during source failures, maintenance, and transient faults—without turning every glitch into a site reboot.

Continuity: A/B clock + A/B power Isolation: hot-swap + branch eFuse Control: PG/RESET fan-in policy Proof: timestamped event logs

What it typically contains (scan-first checklist)

  • Inputs: A/B reference clocks (e.g., 1PPS / 10MHz / Sync reference), A/B DC feeds (e.g., 48V/12V), discrete fault/PG inputs, management/OOB link.
  • Outputs: multi-drop clock fan-out, protected DC branches to loads, reset/alarm outputs, telemetry export.
  • Protections: ORing and hot-swap for feed redundancy, inrush limiting, branch eFuse / high-side isolation, UV/OV/OT safeguards.
  • Alarms: clock loss/quality alarms, power fault alarms, reset asserted indicators, maintenance/service state indicators.
  • Logs: switchover events, protection trips, counters, and “what-action-was-taken” records with reliable timestamps.
  • Management: read-only observability is mandatory; remote control (enable/disable branches, force source select) is optional and must not break the protection path.
Engineering focus: this page treats the panel as a reliability boundary—it must (1) survive a bad input, (2) contain a bad branch, and (3) leave a clear forensic trail that explains why a switchover or shutdown happened.

Boundary: Panel vs Grandmaster/Time Hub vs Boundary Clock Switch

Component Owns (primary responsibility) Does NOT own (avoid confusion)
Timing & Power Panel Physical distribution, redundant switchover policy, branch protection, PG/RESET fan-in/out, alarms, evidence logging Protocol servo logic, network forwarding behavior, deep timing-source discipline algorithms
Grandmaster / Time Hub Timebase generation/discipline and quality control of the timing source Site power distribution and branch isolation; rack-level reset policy and maintenance containment
Boundary Clock Switch Time forwarding behavior inside a switching system (timestamps, shaping, alarm integration) Being the site reference source; being the power protection and reset aggregation authority
Figure F1 — Edge Timing & Power Panel overview (distribution + protection + evidence)
Edge Timing and Power Panel overview block diagram Shows A/B clock and A/B power inputs feeding clock mux/cleaner, ORing/hot-swap, branch eFuse, PG/RESET aggregation, and event logging, then outputs to multiple edge devices. Clock In A 1PPS / 10MHz / Ref Clock In B 1PPS / 10MHz / Ref Power Feed A 48V / 12V DC Power Feed B 48V / 12V DC Clock Mux + Cleaner Redundant select • Quality alarms OBS ORing + Hot-Swap Inrush control • Feed failover OBS Branch eFuse / Isolation Per-load protection • Fault containment OBS PG/RESET Aggregator Debounce • Policy • Fan-out Event Logger Switchover • Trips • Timestamps Outputs to Edge Loads Clock Fan-out • DC Branches Reset/Alarm • Telemetry O-RU DU Switch

H2-2 · System Use-Cases & Topologies at the Edge (where this panel sits)

The panel sits between site sources (timing references and DC feeds) and edge loads (O-RU/DU, aggregation switches, security/observability nodes, and micro edge racks). The goal is not just to fan-out, but to ensure failures are localized and maintenance actions are non-disruptive.

Topology A — Dual timing sources (GM + GPSDO) feeding a single panel

  • Why it exists: timing source quality can degrade without going fully “down”; redundancy prevents service-impacting re-lock storms.
  • What the panel must do: detect quality/LOS, apply lockout to stop ping-pong switching, and record each decision with timestamps.
  • What to observe: switchover counters, source quality flags, time-in-state, and “reason codes” for each switch event.
  • Commissioning action: simulate source loss and recovery; verify controlled switchover behavior and the expected event trail.

Topology B — Single timing source with dual distribution paths (A/B path)

  • Why it exists: connectors, cabling, and terminations fail more often than the reference itself; dual paths reduce site-level single points.
  • What the panel must do: isolate a bad path, alarm cleanly, and keep remaining outputs stable without triggering unnecessary resets.
  • What to observe: per-path LOS/quality flags, phase-hit indicators (if available), and output health per group.
  • Commissioning action: break path A at the panel input; verify alarms and continued service on path B with no cascading actions.

Topology C — Micro edge cabinet (timing + power in one panel with OOB observability)

  • Why it exists: compact deployments suffer from brownouts, inrush events, and thermal constraints; these cause nuisance resets and “mystery outages.”
  • What the panel must do: hot-swap feeds, contain branch faults via eFuse, aggregate PG/RESET with debounce and policy zones, and preserve event evidence.
  • What to observe: branch trip counters, PG/RESET assertions with root-cause tags, supply droop snapshots (if supported), and maintenance-mode markers.
  • Commissioning action: load-step and inrush tests; verify no full-cabinet reset on a non-critical branch fault.
A useful mental model: the panel is the site stability governor. It converts unpredictable field events (cable faults, droops, short circuits) into bounded actions (isolate a branch, switch a source, raise an alarm) with clear evidence that explains the outcome.
Figure F2 — Three common edge topologies (sources → panel → loads)
Three edge deployment topologies for a timing and power panel Three columns: dual timing sources into panel, single source with dual paths, and micro edge cabinet with combined timing and power plus OOB observability. A) Dual Sources B) Dual Paths C) Micro Cabinet Grandmaster GPSDO Timing & Power Panel O-RU DU Edge Switch Single Source Panel Path A / Path B Loads A Loads B A/B Power Feeds Timing Ref In Panel (Combined) Hot-swap • eFuse • PG/RESET Edge Loads (Cabinet) OOB Observability

H2-3 · Requirements & Budgets: what “good distribution” means (before you design)

“Good distribution” at the edge is defined by bounded impact: a source fault, a branch short, or a maintenance action should trigger a controlled switchover or isolation—while leaving a clear evidence trail. This requires budgets expressed in the panel’s language: added jitter, wander, phase hit on switchover, alarm latency, and power droop / surge margins.

Clock: jitter · wander · phase hit Power: inrush · ORing drop · turn-on Reset: debounce · sequencing Logs: resolution · retention
Budgeting is a distribution contract: each segment (source → panel → load) consumes part of the margin. Without an explicit split, later “tuning” becomes guesswork and field failures become non-repeatable.

Budget checklist (allocate margins and define verification evidence)

Budget item Where margin is consumed (source → panel → load) How to verify (evidence)
Clock additive jitter Fan-out buffers, muxing, cleaner PLL (if enabled), output group loading and cabling Compare input vs output stability indicators; record lock state and output health per group
Wander / long-term drift Power noise coupling, temperature gradients, reference quality variations, holdover pass-through path Trend alarms/counters over time; correlate drift flags with power/thermal telemetry
Phase hit on switchover Switchover policy, break/make behavior, PLL re-lock behavior, output distribution group timing Switchover event log (reason + time); phase-step flag (if available); time-in-state counters
Alarm latency Detection thresholds, debounce windows, policy gating, discrete alarm fan-out or mgmt export Inject LOS/LOL and verify alarm timestamps and export delay; confirm no alarm “storming”
ORing drop / power margin ORing elements, hot-swap path, connector/wiring resistance, branch current peaks Load-step test; minimum-bus snapshot or UV flag; correlate droop with branch current
Inrush & hot-swap profile Hot-swap ramp, branch capacitance, parallel branch enable timing, retry behavior Cold-start and hot-plug trials; inrush-limited behavior; retry counters and trip reasons
PG debounce / reset policy PG/FAULT wiring, thresholding, debounce filters, zone policies (critical vs non-critical) Pulse injection and brownout simulation; verify no nuisance resets; reset reason codes
Log resolution & retention Timestamp source, buffering, storage retention, export path availability during faults Confirm minimum time granularity; verify logs survive power events; validate export integrity

Common pitfall: treating switchover as “binary up/down.” Many edge outages come from quality degradation and oscillating decisions.

Common pitfall: verifying steady-state power only. Most site resets happen on turn-on, inrush, and fault response transients.

Figure F3 — Budget allocation view (source → panel → load)
Budget allocation across source, panel, and load segments Four budget rows: clock stability, power margin, reset response, and logging observability. Each row is segmented into source, panel, and load with observation markers near panel. Budgets as a Distribution Contract Allocate margin and define evidence points before design or procurement SOURCE PANEL LOAD Clock ref quality mux/cleaner + fan-out cables + loads OBS Power feed margin ORing + hot-swap + eFuse wiring + peaks OBS Reset PG quality debounce + policy sequencing OBS Logs time source event schema + retention export OBS Evidence-first acceptance Each segment must have an observable indicator (OBS): alarm flags, counters, timestamps, and exportable logs. Budget splits prevent “tuning by guess” and make field incidents reproducible.

H2-4 · Clock Distribution Path: inputs, fan-out, isolation, and “cleaning vs passing through”

A timing panel succeeds or fails on the physical distribution path. Most real-world instability is introduced by termination mistakes, fan-out loading, ground coupling, and switchover transients, not by the label on the timing source. A robust design treats the path as two selectable lanes: pass-through (minimum processing) and clean (jitter-cleaning), with clear monitoring points.

Pass-through lane: lowest latency and simplest behavior; it propagates source quality (good or bad) to the outputs.

Clean lane: improves certain stability metrics but introduces lock state, holdover behavior, and potential phase hits on re-lock or switching.

Clock path breakdown (what can go wrong → how to contain it → what proves it)

  • Input conditioning & protection: control reflections (proper termination), limit ESD/over-voltage coupling, and avoid protection capacitance that distorts edges. Evidence: input LOS/quality flags and stable lock indicators.
  • Fan-out & isolation: group outputs by load class; isolate grounds to prevent coupling; keep each group observable. Evidence: per-group output health and fault counters.
  • Cleaning & muxing: define switchover rules and lockout to prevent oscillation; decide when to use pass-through vs clean. Evidence: switchover reason codes, time-in-state, and lock/holdover flags.
  • Output monitoring: detect LOS/LOL/phase-step and correlate to the exact output group. Evidence: timestamped alarms linked to a specific output path.
Practical rule: a panel should never “silently degrade.” If an output group becomes marginal due to cabling or loading, monitoring must surface it before the site enters repeated re-lock cycles.
Figure F4 — Clock distribution path (pass-through vs clean) with monitoring points
Clock distribution path with pass-through and clean lanes Inputs feed conditioning and a lane selector. One lane goes through buffer and mux (pass-through). Another lane goes through a cleaner PLL (clean). Output groups have monitors for LOS, LOL, and phase step. Observation points marked as OBS. Clock Path Engineering View Two lanes + explicit monitoring points (avoid silent degradation) Clock In A Clock In B Input Protect ESD · OV · Termination OBS Lane Select Pass-through / Clean Fan-out Buffer Group A/B/C Mux / Policy Lockout · Revert rules OBS Cleaner PLL Lock · Holdover OBS Output Groups Per-group monitoring prevents silent degradation Group A Clock Out LOS LOL Phase Step Group B Clock Out LOS LOL Phase Step

H2-5 · Redundant Switchover: architectures, detection logic, and phase-hit containment

Redundant switchover is not “A/B exists.” It is a controlled mechanism that converts input degradation into bounded actions (switch, lockout, isolate) while leaving a verifiable evidence trail. A robust panel separates switchover into three engineering layers: Detect, Decide, and Act.

Detect: LOS · LOL · drift Decide: priority · lockout Act: switch style · containment Proof: event fields

Layer 1 — Detect (hard faults + soft degradation, with debounce)

  • Hard faults: LOS (loss-of-signal), LOL (loss-of-lock), reference missing/out-of-range.
  • Soft degradation: phase drift rate beyond threshold, quality metric below threshold, intermittent instability flags.
  • Debounce & hysteresis: time-based confirmation prevents transient spikes from triggering site-wide switching.

Layer 2 — Decide (anti-ping-pong policy)

  • Priority: define preferred reference (A-first or B-first) and whether manual override is allowed.
  • Lockout timer: after switching, hold on the new source for a minimum time to avoid oscillation.
  • Revertive vs non-revertive: revertive returns to preferred source after recovery; non-revertive stays until the active source degrades.
  • Rate limit: cap the number of switches per time window; when exceeded, enter a safe alarm-only state.
  • Maintenance mode: freeze selection for service operations and mark the mode explicitly in logs.

Layer 3 — Act (switching style + phase-hit containment)

  • Switching style: break-before-make (avoid overlap) vs make-before-break (avoid gap) depending on allowed risk profile.
  • Containment by output groups: isolate impact to a defined group (critical vs non-critical) rather than site-wide disturbance.
  • Phase-step awareness: monitor and report phase-step / re-lock indicators so “hit events” are observable, not guessed.
A “good” switchover is one that does not oscillate, does not surprise critical loads, and can be explained after the fact using logs and counters.

Switchover event record: minimum fields for forensic clarity

Field Why it matters (what it proves)
Event ID / monotonic counter Prevents ambiguity from log rollover; supports exact ordering across incidents.
Timestamp + timebase source Enables correlation with alarms, resets, and maintenance windows.
Pre-state → post-state Explains the transition path (e.g., NORMAL_A → SWITCHING → NORMAL_B).
Trigger reason code Separates LOS/LOL from soft degradation (quality low, drift threshold exceeded).
Metrics snapshot Captures the decision context (LOS/LOL flags, quality flag, drift flag) at trigger time.
Decision mode Records priority, revertive mode, lockout remaining, and maintenance mode status.
Action taken Documents break/make style, any output gating, and whether clean lane was enabled.
Affected output groups Proves impact containment scope (critical group vs non-critical group).
Outcome Indicates success/fail/rollback and whether safe alarm-only state was entered.

Anti-ping-pong checklist: enforce hysteresis, apply lockout after switch, rate-limit switches, and log every transition with reason codes.

Containment checklist: define output groups, gate only the affected group if needed, and export “affected outputs” as a first-class log field.

Figure F5 — Switchover state machine (with lockout and evidence points)
Redundant switchover state machine with lockout State machine showing NORMAL_A, DEGRADED_A, SWITCHING, NORMAL_B, and RECOVER_A. Includes lockout timer and log points on transitions. Controlled Switchover = Detect → Decide → Act Lockout and reason codes prevent oscillation and make incidents explainable NORMAL_A A active LOG DEGRADED_A quality low / drift LOG SWITCHING act + contain LOG NORMAL_B B active LOG LOCKOUT prevents ping-pong RECOVER_A A restored + lockout expired LOG LOS/quality debounce done lock_B ok A recovers optional revert

H2-6 · Power Front-End in the Panel: ORing, hot-swap, eFuse, and inrush control

Power stability is a prerequisite for timing stability. Brownouts, inrush events, and branch faults frequently manifest as “timing issues” because downstream devices reset or enter unstable states. The panel power front-end is therefore designed to contain faults per branch, control transients, and export evidence that explains every shutdown or retry.

Redundancy: A/B ORing Transient control: hot-swap Isolation: branch eFuse Proof: trip + retry logs

Power chain component map (inputs → protection → branches → telemetry)

  • Input protection: surge/ESD and polarity/backfeed containment at the feed entry.
  • Redundant ORing: selects/combines A/B feeds while preventing reverse current into a failed feed.
  • Hot-swap controller: controls turn-on ramp and limits inrush to protect the upstream bus.
  • Bus sense: observes bus droop/UV/OV and correlates events with branch actions.
  • Branch eFuse / high-side isolation: per-load current limit, shutdown, retry, and latching behavior.
  • Current sense + fault flags: per-branch observability for “fault → action → outcome.”
  • Telemetry & event logging: trip reasons, retry counters, and feed failover states exported via mgmt.

Fault → action → evidence (make every protection decision explainable)

Fault scenario Panel action Evidence fields to log/export
Branch short / overcurrent Current limit → shutoff; optional retry or latch-off based on policy Branch ID, trip reason, peak/avg current flag, retry count, latch status, timestamp
Inrush too high Controlled ramp; inrush limiting; staged enable across branches Turn-on profile flag, inrush-limited flag, enable sequence ID, bus droop flag
Feed failure (A or B) ORing isolates failed feed; continue on surviving feed Feed state (A/B), failover event, ORing status, bus minimum flag, time-in-state
Surge / transient Clamp/contain and avoid propagating to branches; protect hot-swap path OV/UV event, surge flag, affected branch list (if any), action taken, timestamp
Over-temperature Derate or shut down affected branch/front-end stage OT flag, duration bucket, derate state, branch impact, recovery timestamp
Brownout / UV Selective load shed for non-critical branches; preserve critical branches UV flag, shed list, critical preserved list, reset prevention status, timestamp
A panel that only “turns off power” without exporting branch ID + reason + counters creates mystery outages. Evidence-driven protection reduces MTTR and prevents repeated site-wide resets.
Figure F6 — Panel power path (A/B feeds → ORing + hot-swap → branch eFuses with sense + logging)
Power front-end architecture inside a timing and power panel Two power feeds enter ORing and hot-swap stages, then a sensed bus fans out to multiple eFuse-protected branches. Each branch has current sense and fault flag. Key stages report to an event logger. Power Front-End: Contain Faults, Control Transients, Export Proof A/B feeds → ORing → Hot-swap → Bus Sense → Branch eFuses → Loads Power Feed A 48V / DC Power Feed B 48V / DC ORing A/B isolation OBS Hot-swap inrush control OBS Bus Sense UV/OV + droop OBS Event Logger trip + retry + time Branch Distribution Each branch: eFuse + current sense + fault flag eFuse #1 I-sense · Fault eFuse #2 I-sense · Fault eFuse #3 I-sense · Fault eFuse #4 I-sense · Fault Critical Critical Non-critical Non-critical

H2-7 · PG/RESET Fan-In & Fan-Out: sequencing, debounce, and “don’t reboot the site”

Edge sites fail not only from real power loss, but from false reset cascades: a single branch glitch propagates into a site-wide reboot. A robust panel treats PG/FAULT handling as a controlled pipeline: Fan-In (clean + classify)Decision (policy)Fan-Out (timed outputs), with strict zoning so non-critical noise cannot trip critical reset paths.

Fan-In: debounce + classify Policy: reset / alarm / isolate Zoning: critical vs non-critical Fan-Out: timing + recovery

Fan-In: multi-source PG/FAULT inputs (clean signals before acting)

  • Input classes: critical power-good, branch faults, thermal/environment alarms, and service/maintenance inputs.
  • Glitch reject: ignore short spikes caused by cable transients, ground bounce, and connector chatter.
  • Debounce windows: require a stable low/high duration before state change is accepted.
  • Hysteresis: recovery conditions must be stricter than trigger conditions to prevent bouncing.
  • Per-input evidence: last-change timestamp, glitch counter, and stable-state counter enable forensic clarity.

Decision: choose “reset vs alarm vs isolate” (avoid collateral damage)

Trigger (after debounce) Action Evidence to export
Critical PG sustained low System reset for the affected zone + high-severity alarm Input ID, duration bucket, policy mode, reset assertion time, affected outputs
Non-critical branch fault Isolate the branch (eFuse/high-side) + warning alarm Branch ID, fault reason, retry/latch state, current-sense snapshot flag
Transient glitches Alarm-only (or no action) while increasing diagnostic counters Glitch counter, last-glitch time, input classification, optional rate-limit state
Repeated triggers (storm) Enter safe mode: rate-limit resets, prioritize evidence and alarms Storm counter, rate-limit active flag, lockout time remaining, last N reasons

Fan-Out: reset outputs (sequencing, hold time, and recovery rules)

  • Zoned reset fan-out: reset outputs are grouped by zones so one zone can recover without rebooting the whole site.
  • Assertion width: reset hold time must be long enough for deterministic restart, but never uncontrolled.
  • Release sequencing: critical rails/loads release in a defined order, with optional delay between groups.
  • Re-arm conditions: a reset output is released only after input PG stability and lockout conditions are met.
  • Maintenance mode: local service actions should freeze policy decisions and be recorded as a first-class event.
“Don’t reboot the site” is implemented by input cleaning, zoned policies, rate limiting, and evidence-first logging. Without these, resets become an outage amplifier.

Common false-reset root causes (symptom → fix direction)

Root cause Typical symptom Mitigation direction
Ground bounce / shared return Short PG dips during switching or load steps Glitch reject + hysteresis + zoning (non-critical cannot trip critical)
Cable/transient spikes Reset storms coincide with door open/close or connector movement Debounce windows + counters + service mode for maintenance
PG threshold too tight PG toggles at borderline voltage conditions Adjust thresholds/hysteresis; avoid “single threshold rules all”
Debounce too short Site reboots on brief, non-repeatable events Increase debounce; log glitch statistics instead of rebooting
Inrush-induced bus droop PG drops right after enabling a branch Staged enable + inrush control + per-branch isolation

Guardrail: treat non-critical PG/FAULT as “isolate + alarm,” not “reset.” Reserve resets for sustained, critical conditions only.

Guardrail: add rate limiting and safe mode so repeated triggers increase evidence quality instead of increasing downtime.

Figure F7 — PG/RESET logic pipeline (Fan-In → Policy → Fan-Out + alarms)
PG/RESET fan-in and fan-out logic with zoning Multiple PG and fault inputs enter debounce and filtering, then a policy engine decides reset, alarm, or branch isolation. Reset outputs are grouped by zones, and alarms are exported separately. Evidence counters feed an event log block. PG/RESET Handling = Clean Inputs, Apply Policy, Bound Impact Fan-In → Debounce/Filter → Policy Engine → Zoned Reset + Alarms + Evidence PG/FAULT Inputs Classify: critical vs non-critical Critical PG_CORE PG_TIMING FAULT_MAIN Non-critical PG_AUX BRANCH_FAULT TEMP_WARN Debounce / Filter glitch reject stable-window hysteresis OBS Policy Engine zoning + rate limit RESET ALARM ISOLATE OBS Reset Fan-Out zoned outputs ZONE A RESET_A ZONE B RESET_B Alarm Outputs dry contact / relay CRITICAL ALARM_HI WARNING ALARM_LO Evidence Counters glitch · storm · last-change

H2-8 · Interfaces & Management: telemetry, alarms, and out-of-band control

The panel’s interfaces must serve operations: expose evidence, enable bounded actions, and support fast triage. However, management must never be in the real-time protection loop. The panel should keep protection and switchover decisions autonomous even if the management port is down.

Telemetry Discrete alarms OOB channel Local service

Interface categories (what to expose and why)

Telemetry: export feed state, ORing/hot-swap states, branch trips, voltage/current/temperature, and clock-status flags (status only).

Purpose: converts “mystery resets” into time-correlated, measurable evidence.

Discrete alarms: dry contact / opto / relay outputs for critical vs warning categories.

Purpose: raises alarms even when IP management is unavailable.

OOB management channel: Ethernet or serial as a transport for reading logs, viewing counters, and updating policies.

Rule: OOB provides visibility and configuration only, not real-time protection decisions.

Local service: LEDs/LCD, buttons, and DIP switches for on-site triage and maintenance mode.

Rule: local actions should enter maintenance mode and be recorded as events.

Operational design rules (keep protection independent)

  • Mgmt down ≠ protection down: switchover, hot-swap, eFuse protection, and reset policy must continue autonomously.
  • Evidence-first: every trip/switch/reset should have a stable record accessible via telemetry or service UI.
  • Separation by function: ports and signals should be physically grouped to reduce miswiring and cross-coupling.
  • Bounded control: configuration changes should be applied with clear modes (normal vs maintenance) and logged.
Interfaces are for observability and bounded control. The protection pipeline must remain stable even when cables are unplugged, networks are congested, or management endpoints reboot.
Figure F8 — Front-panel interface zoning (Power / Clock / Alarm / Mgmt)
Panel interface layout with functional zoning A front-panel diagram showing four grouped areas: power inputs, clock I/O, alarm outputs, and management/service. Includes short rules indicating management independence and zone separation. Interface Zoning: Power / Clock / Alarm / Mgmt Rule: Mgmt down ≠ protection down · Zone separation reduces blast radius Front Panel POWER A/B feeds FEED A FEED B CHASSIS CLOCK ref in + outputs IN A IN B OUT 1 OUT 2 OUT 3 OUT 4 ALARM contacts ALARM HI ALARM LO FAULT MGMT OOB ETH UART SERVICE Mgmt down ≠ protection down Zone separation reduces blast radius

H2-9 · Event Logging as Evidence: what to log, how to timestamp, how to debug from logs

A timing-and-power panel becomes operationally valuable when it can prove what happened: which input degraded first, which policy decision fired, which outputs were affected, and whether the local timebase was healthy at that moment. “Evidence-first” logs turn nuisance resets and frequent switchovers into diagnosable, repeatable cases instead of recurring mysteries.

Field-level schema Time quality flags Retention priorities Replayable scripts

Event model (field-level schema)

Use a normalized schema so every alarm, switchover, isolation, and reset can be correlated on the same timeline. The model below is designed for filtering, trending, and forensic replay without relying on verbose text logs.

Field group Recommended fields (examples)
Core identity event_id, domain (clock/power/action/mgmt), severity (info/warn/critical), source (A/B/branch_id/zone_id), start_time, end_time, duration
Clock snapshot clock_state (LOS/LOL/quality), selected_ref (A/B), holdover_flag, phase_hit_flag, quality_flag (good/degraded)
Power snapshot power_state (OC/SC/OT/UV/inrush), feed_path (A/B/ORed), branch_state (on/off/tripped), retry_state (retry/latch), trip_reason
Policy / action action_type (switch/reset/isolate), policy_mode (normal/safe/maintenance), lockout_active, affected_outputs, recovery_condition
Counters / stats glitch_count, storm_count, switch_count, reset_count, brownout_count, branch_trip_count, retry_count
Prefer structured fields over free text Always capture “policy mode” and “lockout” Store last-N critical events with priority Export counters even when no action is taken

Timestamping: record time and time-quality (not just a number)

  • Dual time representations: keep event_time_mono (ordering/interval) and event_time_utc (if available) for human correlation.
  • Time-quality flag: include time_quality (good/degraded/holdover) so forensics can trust or discount wall-clock timestamps.
  • Event-class resolution: switching/reset events need finer timestamp resolution than slow thermal or trend events.
  • Snapshot-on-trigger: capture a compact state snapshot at event start, not minutes later, to preserve cause-first evidence.
A forensic record is not just “when something happened,” but also “how trustworthy the local timebase was” and “which policy rule produced the action.”

Forensic replay scripts (3 examples)

Script A — frequent A/B switchovers without hard LOS

1) Filter action_type=switch and check switch_count growth rate.
2) Compare clock_state (LOS/LOL) vs quality_flag (degraded).
3) Correlate with power_state=UV or inrush within the same window.
4) If glitch_count rises with no LOS, suspect threshold/hysteresis/termination issues.
5) Mitigation direction: add quality hysteresis, extend debounce, prefer holdover+alarm before switching.

Script B — intermittent device unlock that “looks like timing”

1) Search for repeated branch_state=tripped on a single branch_id.
2) Verify whether clock_state degradation happens after the branch trip (cause order).
3) Check retry_count and whether trips are latched vs auto-retry.
4) If trips precede unlocks, isolate the branch (do not reset the site) and retest with a known load.
5) Mitigation direction: adjust inrush/limit/retry policy; inspect connectors and thermal headroom.

Script C — nuisance reset storm (site reboot loop)

1) Filter action_type=reset and check storm_count/lockout_active behavior.
2) Identify the dominant source (which PG/FAULT input starts each cycle).
3) Inspect duration: very short low pulses suggest glitches, not real outages.
4) Enter safe mode (alarm-only + evidence) to stop reboot amplification.
5) Mitigation direction: increase glitch reject/debounce, add recovery hysteresis, re-classify critical vs non-critical.

Storage & export: keep critical evidence when conditions are worst

  • Priority retention: keep the last-N critical events and the last state snapshot even when ring buffers wrap.
  • Counter continuity: counters should survive reboots whenever feasible; at minimum, export them immediately when storms start.
  • Export channels: provide a local service readout and an out-of-band path for offsite retrieval (transport only).
  • Tamper-evident cues: log configuration changes and maintenance-mode transitions as first-class events.
Figure F9 — Evidence pipeline (Inputs → Event builder → Timestamp → Retention → Export)
Event logging pipeline for a timing and power panel Multiple sources generate events. An event builder normalizes fields, applies severity, and attaches counters. Timestamp is sourced from a primary timebase or a monotonic clock with a time-quality flag. Events go to ring buffer with critical retention and are exported via OOB and local service. Event Logging as Evidence Normalize → Timestamp with quality → Retain critical → Export for debug Inputs sources of truth Clock Status LOS / LOL / quality Power Faults OC / SC / OT / UV Policy Actions switch / reset / isolate Service Events mode / config change Event Builder normalize fields severity + source attach counters State Snapshot Timestamp Source time + time-quality Primary Timebase Monotonic + Flag Storage / Retention ring buffer + critical keep Ring Buffer Critical Last-N Export for retrieval and analysis Local Service OOB Channel Counters Only

H2-10 · Failure Modes & Field Troubleshooting: symptoms → isolation steps → fixes

Troubleshooting should start with evidence, not guesses. Each symptom card below maps a field symptom to the quickest panel-side checks (LEDs, counters, and structured log fields), then to isolation steps that bound blast radius before applying fixes to thresholds, debounce, lockout, and protection parameters.

Symptom-first Evidence checks Isolation steps Fix directions

Symptom cards (fast triage templates)

Symptom: frequent A/B switchover (flapping)

Quick check: switch_count, quality_flag, glitch_count, lockout_active
Isolation: enable holdover+alarm mode; lock out revertive switching; verify whether flapping stops
Likely causes: borderline quality threshold; poor termination/cable; power droop contaminating ref
Fix direction: add hysteresis + longer debounce; enforce lockout; refine critical/non-critical classification

Symptom: phase hit during switchover

Quick check: phase_hit_flag, switch event duration, time-quality at the moment
Isolation: force a single reference; reproduce under controlled switch; compare hit rate by source A vs B
Likely causes: switch timing misaligned; quality gating too permissive; unstable input edges
Fix direction: tighten quality gating; adjust switch sequence; require stability window before switch

Symptom: devices intermittently lose lock (but no site reset)

Quick check: clock_state quality trends vs branch_trip_count and power_state
Isolation: isolate the suspect branch; test with a known load; check if unlocks disappear
Likely causes: branch inrush/retry causing local droop; cable coupling; marginal clock distribution
Fix direction: tune inrush/limit/retry; improve cable/termination; separate zones for sensitive loads

Symptom: a branch repeatedly trips (eFuse/hot-swap)

Quick check: branch_state, trip_reason, retry_count, temperature warnings
Isolation: disconnect the branch; use a dummy load; verify trip persists (panel-side) vs load-side
Likely causes: short/overcurrent; thermal headroom; connector resistance; too aggressive retry
Fix direction: tune current limit and retry/latch policy; improve thermal path; check connectors and wiring

Symptom: site reset storm (reboot loop)

Quick check: reset_count, storm_count, dominant source, pulse duration
Isolation: enter safe mode (alarm-only + evidence); apply zoned reset; disable auto-reset on non-critical inputs
Likely causes: debounce too short; threshold too tight; cable glitches; inrush droop on enable
Fix direction: increase glitch reject/debounce; add recovery hysteresis; rate-limit resets; re-zone outputs

Symptom: alarms show “healthy,” but issues continue

Quick check: mismatch between severity and counters; missing events; time-quality degraded
Isolation: verify evidence pipeline: ensure event builder snapshots and export are functioning
Likely causes: unlogged paths; insufficient severity mapping; retention overwriting critical evidence
Fix direction: normalize event taxonomy; prioritize critical retention; export counters during storms
A practical rule: if a fix cannot be justified by logs + counters, treat it as a temporary mitigation. Permanent fixes close the loop by reducing the specific event types and counters that proved the fault.
Figure F10 — Troubleshooting fault tree (from “switch flapping” to evidence and fixes)
Fault tree for frequent A/B switching in a timing and power panel A simplified tree that starts from frequent switching and branches into clock quality issues, cable/termination issues, power droop issues, and policy threshold issues. Each branch lists evidence fields to check and the fix direction. Failure Mode: Frequent Switching (Flapping) Evidence → Isolation → Fix (panel-side, bounded impact) Switch flapping A/B toggles too often Clock quality degraded vs LOS Check quality_flag clock_state phase_hit_flag Cable / termination glitches & edges Check glitch_count event duration source A/B Power droop UV / inrush Check power_state feed_path brownout_count Policy thresholds Check lockout_active policy_mode severity map Fix directions (bounded) • Add hysteresis + longer debounce • Enforce lockout • Prefer holdover+alarm before switching • Tune inrush/limit to reduce UV • Improve termination/cabling • Re-classify critical vs non-critical zones

H2-11 · Validation & Commissioning Checklist: what proves the panel is done

Definition of done

A Timing & Power Panel is “done” only when clock distribution, A/B switchover behavior, power protection, PG/RESET policy, and event logging can be provoked, observed, and proven using repeatable on-site scripts. Every test below is written as Action → Expected → Evidence.

Evidence must be panel-native: front LEDs, telemetry snapshots, counters, event logs, and alarm I/O. External endpoint behavior (DU/switch/server internals) is intentionally out of scope for this page.

Clock: input loss/recovery, switchover quality, and alarm latency

Test C1 — Ref-A loss detection (LOS/LOL/quality drop)
Action: disconnect Ref-A (or force it into a degraded quality state).
Expected: panel flags Ref-A unhealthy without creating a switch storm.
Evidence: Ref-A LED/indicator → telemetry shows selected_ref / quality_flag → log entry with source=A and clock_state (LOS/LOL/quality).
Test C2 — Ref-A recovery (re-acquire + lock discipline)
Action: restore Ref-A after a stable interval.
Expected: recovery is gated by policy (lock/qualify window); no immediate flip-flop.
Evidence: telemetry transitions to “qualified” → counters show no rapid oscillation → logs include qualify/lock outcome and policy mode.
Test C3 — Controlled switchover (A→B) with phase-hit containment
Action: trigger a single switchover (manual trigger or policy-triggered).
Expected: switchover completes within defined window; phase hit is either absent or bounded; alarms reflect actual impact.
Evidence: switch_count increments by 1 → log records switching_start/end + phase_hit_flag (if any) + pre/post quality snapshot.
Test C4 — Lockout / rate-limit proof (storm prevention)
Action: repeatedly flap Ref-A near threshold (disconnect/reconnect cycle).
Expected: lockout timer and hysteresis prevent rapid A↔B toggling.
Evidence: switch_count slope remains bounded → logs show lockout_active / non-revertive (or revertive) policy decisions.
Test C5 — Alarm latency (clock fault → alarm output)
Action: create a crisp event (e.g., Ref-A LOS).
Expected: alarm I/O asserts within the configured latency window; alarm clears only after re-qualification.
Evidence: alarm output changes state → log contains event_id + start_time and alarm_action timestamp ordering.
Test C6 — Holdover path sanity (if supported)
Action: remove all external references briefly (A and B unavailable).
Expected: panel enters holdover/“degraded” mode without repeated resets; outputs remain deterministic per policy.
Evidence: telemetry holdover_flag asserted → logs show source=NONE (or INTERNAL) + time_quality indicator.

Power: ORing, hot-swap/inrush, branch eFuse isolation, and thermal protection

Test P1 — A/B feed ORing behavior (no reverse feed)
Action: bring up Feed-A then Feed-B; then remove Feed-A.
Expected: seamless source handover per ORing priority; no reverse current into the removed feed.
Evidence: bus voltage stable trend → logs indicate feed_source transition (A→B) + reverse_block status.
Test P2 — Hot-swap insertion (inrush limiting)
Action: insert/enable a defined load step (or card-insert scenario).
Expected: inrush is limited; no brownout cascade to clock path; panel records the transient.
Evidence: power monitor captures peak/average → no clock quality storm → log shows inrush/UV flags and timestamped duration.
Test P3 — Branch short-circuit (selective isolation)
Action: inject short on one branch output.
Expected: only that branch trips/isolates (latch-off or retry per policy); other branches stay up.
Evidence: branch_id trip_reason=SC/OC in logs → branch_trip_count++ → system reset_count unchanged.
Test P4 — Overload retry policy (no endless chatter)
Action: set a marginal overload on a branch that triggers retries.
Expected: retry count and cooldown are bounded; alarm severity escalates if persistence is detected.
Evidence: retry_count increments predictably → logs show retry cadence + escalation event_id.
Test P5 — Surge/EFT robustness (panel-level behavior)
Action: apply a controlled surge/EFT test setup (per site practice).
Expected: protection clamps/blocks per design; event is recorded; no false switchover storm.
Evidence: transient logged with power_state=surge/EFT and severity → switch_count does not spike.
Test P6 — Thermal limit / derating alarms
Action: elevate temperature in a controlled manner (or simulate sensor threshold).
Expected: warning then protective action per policy; action is localized where possible.
Evidence: telemetry temperature trend + threshold crossing → log shows OT start/end and any shutoff action.

PG/RESET & Logs: debounce proof, zoning policy, retention, and export integrity

Test R1 — Non-critical PG glitch rejection
Action: inject a short glitch on a non-critical PG/FAULT input.
Expected: no site-wide reset; at most a warning/alarm or branch isolation per zoning policy.
Evidence: glitch_counter++ → logs show “filtered_glitch” without reset_action.
Test R2 — Critical PG sustained fault (debounce passes)
Action: hold a critical PG low longer than the configured debounce window.
Expected: policy engine triggers the defined action (reset / inhibit outputs / alarm-only).
Evidence: logs show zone=CRITICAL + duration + action_taken; reset outputs timing matches policy.
Test R3 — Reset sequencing correctness (fan-out timing)
Action: force a reset condition and then release it.
Expected: reset outputs assert/deassert in correct order with defined hold time; no bounce.
Evidence: reset_count++ → per-output reset_state in telemetry (if available) → log includes sequencing profile ID.
Test L1 — Event model completeness (field-forensics ready)
Action: trigger a representative set: one clock fault + one branch trip + one reset policy action.
Expected: each event contains minimal forensic fields and consistent timestamps.
Evidence: every record has event_id, source(A/B), severity, start/end, state snapshot, and counters.
Test L2 — Retention & power-loss safety (no “lost evidence”)
Action: create N events; then perform controlled power interruption and restore.
Expected: logs survive; counters resume consistently; any gap is explicitly marked.
Evidence: post-restore export includes pre-fault records; log sequence numbers monotonic or gap-tagged.
Test L3 — Export integrity (service port / OOB channel)
Action: export logs and telemetry snapshot at the end of commissioning.
Expected: export succeeds without affecting protection and switchover main paths.
Evidence: export artifact contains required fields; “export_done” event logged with checksum/tamper marker if supported.

Representative part numbers (example BOM anchors)

The panel is system-defined, so exact IC choices depend on output standards (LVCMOS/LVDS/PECL), port count, and voltage domain. The part numbers below are common, field-proven anchors for this class of panel.

  • Clock cleaning / synthesis (jitter attenuation): Si5345 (Skyworks/Silicon Labs) jitter attenuating clock multiplier :contentReference[oaicite:0]{index=0}
  • Sync/timing management (multi-channel timing control): Renesas 8A34002 synchronization management unit :contentReference[oaicite:1]{index=1}
  • Network synchronizer / timing card class device: Microchip ZL30733 (PTP/SyncE network synchronizer family) :contentReference[oaicite:2]{index=2}
  • Clock fan-out (LVCMOS distribution): TI CDCLVC1104 low-jitter 1:4 fan-out buffer :contentReference[oaicite:3]{index=3}
  • A/B feed ORing (ideal diode controller): Analog Devices LTC4359 ideal diode controller (external N-MOSFET) :contentReference[oaicite:4]{index=4}
  • 48V-class hot-swap / inrush control: TI LM5069 9–80V hot-swap controller :contentReference[oaicite:5]{index=5}
  • Branch eFuse (power limiting + protection): TI TPS2663 4.5–60V industrial eFuse :contentReference[oaicite:6]{index=6}
  • Bus/branch telemetry (power monitor): TI INA238 85V digital power monitor (I²C) :contentReference[oaicite:7]{index=7}
  • PG/RESET supervision (multi-rail supervisor): TI TPS386000 quad-supply supervisor with programmable delay :contentReference[oaicite:8]{index=8}
  • Non-volatile event log storage: Fujitsu MB85RS2MT 2 Mbit SPI FRAM :contentReference[oaicite:9]{index=9}
  • Secure evidence / key storage (tamper-aware design): Microchip ATECC608B secure element :contentReference[oaicite:10]{index=10}
  • RTC for commissioning timestamps / backup switchover: NXP PCF2131 RTC with integrated TCXO :contentReference[oaicite:11]{index=11}
Note: ordering suffixes/package codes vary by distributor. Select package, temperature grade, and interface options per the panel’s environment and assembly rules.

Figure F11 — Commissioning evidence matrix (test × observation point)

Use this matrix as a printable acceptance sheet. Each row is a test; each column is a required observation point. Mark pass/fail only when the expected evidence is captured.

Figure F11 — Validation matrix (LED / Telemetry / Counters / Logs / Alarm I/O / Export)
Commissioning Matrix — Test Items × Evidence Points Test Item LED Telemetry Counters Logs Alarm I/O Export C1 Ref-A LOS/LOL detect C3 A→B switchover (phase hit) C4 Lockout / storm prevention C5 Alarm latency (clock) P2 Hot-swap insertion (inrush) P3 Branch SC/OC isolation P4 Retry bounded + escalation R1 Non-critical PG glitch reject R3 Reset sequencing correctness L2 Retention across power loss ✓ = required evidence point for acceptance; add site-specific rows as needed.
Tip: keep this sheet with the exported log bundle. If a future outage occurs, “switch_count / trip_count / reset_count” trends plus timestamped events often narrow root cause in minutes.

Request a Quote

Accepted Formats

pdf, csv, xls, xlsx, zip

Attachment

Drag & drop files here or use the button below.

H2-12 · FAQs (field decisions + evidence-first answers)

Evidence-first FAQ

These FAQs target on-site decisions for an Edge Timing & Power Panel: redundant switchover, power protection, PG/RESET policy, and event logs as evidence. Each answer gives quick isolation steps and the minimum evidence points (LED / telemetry / counters / logs) to confirm the root cause.

Example BOM anchors (for procurement language): Si5345 (jitter cleaner), CDCLVC1104 (fan-out), LM5069 (hot-swap), TPS2663 (eFuse), LTC4359 (ideal-diode ORing), INA238 (power monitor), TPS386000 (supervisor), MB85RS2MT (FRAM), ATECC608B (secure element).
QRedundant switchover “happened”, but endpoints still lose lock—why?
A switch event only proves the selector moved; it does not prove the new output quality is acceptable. Confirm selected_ref and check for phase-hit flags, per-output LOS/LOL indicators, and post-switch quality snapshots. If phase hits are present, tighten qualify/lockout logic and prefer a cleaner path (e.g., Si5345) instead of pure muxing. Also verify output termination/level before blaming endpoints.
Relevant sections: H2-5, H2-10
QRevertive vs non-revertive—when does it cause oscillating A↔B switching?
Oscillation is usually caused by quality flapping near thresholds combined with revertive behavior and insufficient lockout. Require a qualify window before accepting a recovered source, enforce minimum dwell time, and rate-limit switches. If the primary source is noisy or intermittent, use non-revertive plus alarms rather than automatic back-switching. Validate using switch_count slope and lockout_active logs.
Relevant sections: H2-5
QHow to define an acceptable “phase hit” and accept it on site?
Phase hit is the step disturbance introduced during switchover that can trigger downstream loss-of-lock. Acceptance is procedure-based: run controlled A→B switches, capture phase-hit flags/counters, compare pre/post switch quality snapshots, and verify alarm latency/clearing behavior. If hits appear, adjust switching mode (alignment strategy, dwell/lockout) and keep the evidence bundle (logs + counters) as commissioning proof.
Relevant sections: H2-3, H2-11
QWhy is a simple clock mux not enough—why add quality scoring and lock policy?
A mux can switch to a source that is present but poor (wander, noisy, unstable), producing “switched but worse” outcomes. A robust panel must score source health (LOS/LOL, stability/qualification, drift warnings) and only switch when the candidate is qualified. Otherwise, stay on the current source or enter controlled holdover/degraded mode. Evidence should include quality_state transitions and the reason_code for every decision.
Relevant sections: H2-5
QeFuse nuisance trips—tune current limit first, or check inrush/cabling first?
Start with evidence: read trip_reason (OC/SC/OT) and correlate with inrush peaks and retry_count. Many “false trips” are actually inrush or cable/connector resistance causing droop and repeated retries. Measure inrush shape, then tune the turn-on profile (upstream hot-swap like LM5069) or per-branch eFuse behavior (e.g., TPS2663) before raising limits. Confirm fixes by repeating P2/P3 commissioning rows and checking trip_count stability.
Relevant sections: H2-6, H2-10
QORing drop causes sporadic reboots—how to prove it’s supply-side vs load-side?
Correlate bus voltage and UV events with reset_count and switchover logs. Force single-feed operation and repeat the load step: if UV/reset disappears, the ORing path is suspect (controller, MOSFET Rds_on, wiring). Use an ideal-diode controller like LTC4359 with appropriate FETs, and measure at the correct sense point (near the load interface). If UV remains under single-feed, isolate downstream branch behavior using per-branch telemetry and trip evidence.
Relevant sections: H2-6, H2-10
QHow to prevent PG/RESET fan-in glitches from rebooting the whole site?
Use three layers: debounce, zoning (critical vs non-critical), and a policy engine. Short glitches should be filtered (glitch_counter++ only) while sustained faults in critical zones can trigger reset or inhibit outputs. A multi-rail supervisor (e.g., TPS386000) plus a clear zone map prevents one noisy input from becoming a site-wide reset cause. Validate by injecting short/long pulses and confirming “filtered_glitch” vs “action_taken” logs.
Relevant sections: H2-7
QShould reset be “gated action” or “alarm-only”? What’s the engineering boundary?
Reset should be reserved for system-safety states: sustained critical-rail loss, repeated brownouts, or confirmed conditions that make continued operation unsafe. For non-critical faults or ambiguous transients, prefer alarm-only or branch isolation to avoid unnecessary downtime. Implement a severity-duration matrix, and require every reset to carry a reason_code in logs so postmortems can distinguish policy vs noise. Prove the boundary by replaying fault scripts and checking reset_count stability.
Relevant sections: H2-7, H2-9
QWhat logging granularity is “enough” for real fault reconstruction?
Minimum forensic set: event_id, source(A/B), severity, start/end time, selected_ref, clock_state (LOS/LOL/quality), power_state (UV/OC/OT), action_taken (switch/isolate/reset), and counters (switch_count, trip_count, reset_count). Add time_quality and monotonic sequence numbers to detect gaps. Store locally in non-volatile memory (e.g., MB85RS2MT FRAM) and export bundles after commissioning so future outages can be traced without guesswork.
Relevant sections: H2-9
QIf the management port is down, will switchover and protection still work?
Correct designs keep protection and switchover local and autonomous; management is an observing/command channel, not the control loop. A management failure should only reduce visibility, not break hot-swap, eFuse isolation, or A/B switching. Validate by unplugging management, then repeating a key switch event and a branch trip: switch_count and trip evidence must continue, and “mgmt_down” should be a separate log event. Isolation can be reinforced with watchdog and strict privilege modes.
Relevant sections: H2-8
QHow to handle RU/DU/switch interface differences without crossing into endpoint design?
Treat outputs as port profiles: define signal type (1PPS/10MHz/Sync reference), electrical level, impedance/termination, isolation needs, and per-port monitoring. Avoid endpoint-specific tuning; instead, ensure the panel exposes profile_id and per-port status (LOS/LOL) plus clear labeling to prevent mispatching. Use generic distribution primitives (fan-out buffers like CDCLVC1104, optional cleaning stage like Si5345) and validate with per-port evidence points rather than endpoint internals.
Relevant sections: H2-2, H2-8
QHow to design a minimal-downtime maintenance flow (swap source/branch/panel)?
Use a repeatable flow: enter maintenance mode (freeze revertive behavior, extend lockout), snapshot telemetry/logs, switch to the known-good feed, then replace the target module (input, branch, or panel). After replacement, rerun a small acceptance subset: controlled switchover, hot-swap insertion, branch trip isolation, reset sequencing, and export integrity. Exit maintenance only when evidence is captured (logs/counters) and alarms are clean. This turns “maintenance” into a provable procedure rather than hope.
Relevant sections: H2-2, H2-11