Timing Switch (PTP/SyncE) for Network Time Distribution
← Back to: Telecom & Networking Equipment
A Timing Switch upgrades an Ethernet network from “connectivity” to a deterministic timebase by combining hardware timestamps, PTP/SyncE clock distribution, redundancy/holdover, and actionable alarms. It is validated not by throughput, but by controlled delay tails (PDV), stable lock/holdover behavior, and evidence-grade observability for operations.
What it is & boundary
A timing switch is an Ethernet switching node that combines hardware timestamps, PTP (BC/TC), SyncE frequency distribution, redundant timebases / holdover, and monitoring/alarms to turn a network from “connected” into “time-deterministic and verifiable”.
The scope is strictly the switch-side timing stack: how a timing switch measures packet time, disciplines its local clock, distributes frequency and time to ports, survives upstream reference loss, and proves health with alarms and telemetry. Topics such as GNSS receivers, GPSDO/atomic-clock internals, and general routing/switching protocols are intentionally excluded.
Focus (covered here):
- PTP time/phase: Boundary Clock vs Transparent Clock, residence-time correction, BMCA & servo behavior (system view).
- Hardware timestamping: where timestamps are taken (PHY/MAC), what makes them trustworthy, and how errors enter.
- SyncE frequency: EEC/DPLL lock, SSM/QL selection, and how frequency stability supports PTP.
- Clock distribution: DPLL output → clock tree → ports (1PPS/10 MHz/ToD as interfaces and check points).
- Reliability: redundancy, holdover entry/exit, hitless goals, plus alarms/monitoring that enable fast root-cause.
Not covered (linked elsewhere in the site):
- GNSS/GPSDO/atomic reference design (OCXO/CSAC/Rb internals, GNSS antenna/LNA chains).
- General switching/routing feature sets (L2/L3, EVPN, QoS design) unless timing-specific.
- Power front-ends (48 V hot-swap/eFuse) and PoE system design.
Practical boundary test: if a paragraph does not change how timestamps are generated/used, how SyncE locks and is selected, how holdover behaves, or how timing health is measured and alarmed, it does not belong on this page.
SEO note: this chapter establishes a strict scope boundary (switch-side PTP/SyncE, timestamps, holdover, monitoring) to prevent topic overlap with GPSDO/atomic clock pages.
Deployment scenarios & roles
Timing switches appear anywhere the network must carry both data and a trustworthy timebase. The practical design question is always the same: which role (BC vs TC), which timing layers (PTP only vs PTP+SyncE), and what error/holdover targets are required.
Three deployment patterns dominate in telecom and networking equipment. Each differs primarily in packet delay variation (PDV), path asymmetry, and service expectations during reference loss. The table below maps scenario → engineering objective → recommended role/mode.
| Scenario | Primary objective | Recommended role | SyncE | Common timing risks | What to validate |
|---|---|---|---|---|---|
| Telco backbone fronthaul/backhaul/core timing |
Stable frequency + controlled phase/time error across long chains; predictable holdover behavior. | BC for domain boundaries and policy control; TC inside controlled segments to reduce hop error. | Usually required (EEC/SSM/QL selection and protection switching). | QL loop/mis-selection; reference switching phase hit; asymmetry across mixed media; PDV under bursts. | SSM/QL transitions; PTP lock/hitless targets; holdover entry/exit; alarm thresholds and escalation. |
| Data center financial/distributed systems |
Low time error under congestion and microbursts; fast convergence after path changes. | TC is often preferred in dense fabrics to shrink residence-time impact; BC when administrative boundaries exist. | Optional but valuable where PDV stresses the servo (frequency stability improves robustness). | Queueing-induced PDV; traffic pattern shifts; asymmetry from ECMP/path changes; timestamp domain mismatch. | Offset distribution under load; PDV statistics; path-change recovery; “timestamp truth” (PHY/MAC HW TS). |
| Industrial / TSN-like rings, cells, machine sync |
Deterministic coordination; bounded error budget across a small number of hops. | TC to reduce hop contribution; BC when segmentation or safety domains are required. | Often recommended for frequency stability; depends on profile and topology control. | Topology redundancy causing asymmetry; profile mismatch; insufficient measurement visibility; poor calibration discipline. | Error budget per hop; asymmetry calibration; failover timing behavior; alarm → action mapping. |
Role terminology is easiest to remember by responsibility: a T-GM (grandmaster) provides time-of-day, a T-BC (boundary clock) terminates and regenerates timing information, a T-TC (transparent clock) corrects for device residence time, and a T-TSC (time-sensitive consumer) uses the timebase for applications. The timing switch typically operates as T-BC, T-TC, or both (per segment), while tracking SyncE quality and redundancy state.
BC vs TC decision rule: Choose BC when policy, segmentation, or “clean handoff” is needed (domain boundary, controlled re-generation, easier fault isolation). Choose TC when the path is well controlled and hop-count is high (minimize cumulative residence-time error without creating new domains).
A deployment plan becomes robust only when it accounts for two non-idealities that dominate real networks: PDV (time error driven by queueing and contention) and asymmetry (forward/reverse delay mismatch). These are not “PTP configuration problems” — they are network physics. The role choice (BC/TC), timing layers (PTP vs PTP+SyncE), and monitoring strategy must be selected with these two factors in mind.
- PDV-heavy segments: prioritize accurate HW timestamps and TC behavior, and validate offset distribution under load (not just idle).
- Asymmetry-prone segments: enforce calibration discipline and detect drift; avoid assuming symmetry in mixed media or redundant paths.
- Strict service continuity: design explicit holdover targets and reference-switch policies; treat alarms as part of the control loop.
Implementation tip: keep this chapter strictly “timing-only” — scenario differences are explained via PDV/asymmetry/holdover expectations, not generic switch/router architecture.
Architecture blocks (hardware, software, and clock planes)
A timing switch becomes understandable (and debuggable) when it is decomposed into three coupled planes: packet forwarding, timestamp capture, and clock discipline & distribution. A fourth layer — management and observability — proves that the first three are actually healthy in the field.
Packet plane
PHY/MAC → switch fabric → egress. Timing relevance: queueing and contention create PDV that distorts delay measurements and stresses the servo.
Timestamp plane
PHY/MAC event → timestamp unit → TS FIFO → CPU/FPGA. Timing relevance: where the event is captured and how it is paired to a packet defines accuracy.
Clock plane
Reference in → EEC/DPLL → clock tree → ports (SyncE recovered clock) and interfaces (1PPS/ToD). Timing relevance: stability and holdover behavior.
The packet plane moves frames; the timestamp plane turns selected frames into time measurements; and the clock plane converts those measurements into a disciplined timebase that is then distributed back to ports. Without the management layer (profiles, KPIs, alarms, and event logs), a timing switch would be “configured” but not “proven”.
Key timing interfaces (what must be visible):
- Timestamp interfaces: per-port HW timestamp capability, timestamp FIFO depth/health, event pairing status.
- PTP health: offset trend and distribution, meanPathDelay, PDV statistics, lock state, servo mode.
- SyncE health: EEC lock status, selected input quality (SSM/QL), switching history, holdover state.
- Resilience signals: reference switch counters, holdover entry/exit, “phase hit” / step events (if tracked).
Design intent: keep packet forwarding performance and timing performance separate in validation. A device can forward at full bandwidth yet still fail timing requirements if timestamps, PDV control, or clock discipline are not engineered as first-class functions.
Hardware timestamping pipeline (where timestamps are taken and how they drive correction)
Timing accuracy is dominated by where the timestamp is captured, how it is associated to a PTP event, and what non-idealities distort the measured delay (queueing PDV, asymmetry, and multi-hop accumulation). A reliable pipeline turns “packets” into “measurements” and measurements into a disciplined clock.
Timestamp placement typically falls into three buckets: PHY, MAC, or deeper inside the switch core. The closer the capture point is to the wire, the fewer unknown delays exist between the physical event and the timestamp. However, correctness still requires deterministic event pairing (sequence identity), consistent clock domain handling, and sufficient buffering under load.
PHY vs MAC vs core timestamping (timing-centric view):
- PHY timestamp: closest to the wire; minimizes unmodeled latency. Best when PDV and path changes exist.
- MAC timestamp: often adequate, but can inherit MAC scheduling/latency variance if not truly hardware-stamped at the boundary.
- Switch-core timestamp: easiest to implement, but absorbs variable fabric/queue effects; commonly fails under congestion.
PTP event handling must be explicit: Sync/Follow_Up and Delay_Req/Delay_Resp are not “ordinary frames”. Each event requires consistent timestamp extraction, correct association to the message sequence, and a defined delay mechanism: E2E (end-to-end) estimates path delay at endpoints, while P2P (peer-to-peer) measures per-hop delay and relies more directly on per-device residence behavior. The chosen mechanism should match the network reality: PDV-heavy segments and asymmetric paths demand stricter discipline.
Common misconception: high throughput does not imply stable timing. Timing quality is governed by PDV tails (queueing under bursts) and timestamp truth (whether capture is truly hardware at PHY/MAC). Also, 1PPS is a frequency checkpoint — it does not guarantee time-of-day correctness.
Practical pipeline (from ingress event to correction):
-
Ingress event occurs (PTP message or SyncE-related state transition).
Key check: which port and which event type is eligible for timestamping. -
Hardware timestamp is captured at the selected point (PHY or MAC).
Key check: capture point consistency across ports; timestamp resolution and clock-domain alignment. -
Event is paired to its message identity (e.g., sequenceId / port identity).
Key check: correct pairing under reordering and burst traffic; avoid “timestamp without the right packet”. -
Timestamp is queued into a TS FIFO (or per-port event buffer).
Key check: FIFO depth and overflow behavior under high event rate; record drops as a first-class fault. -
Host reads timestamps (CPU/FPGA) and produces measurements (offset and delay terms).
Key check: host scheduling affects observability; it should not introduce measurement ambiguity. -
Filters & outlier rejection handle PDV and transient anomalies.
Key check: PDV tails; avoid “chasing bursts” with overly aggressive control gains. -
Servo/DPLL updates correction (slew/step policy) and the clock plane distributes the disciplined output.
Key check: phase hit behavior at mode changes and reference switching; controlled recovery back to lock.
Where timing error enters (what to highlight in debug):
- Queueing PDV: fabric/egress contention adds long-tail delay; distortions grow under microbursts and small packets.
- Asymmetry: forward/reverse delay mismatch breaks “symmetry assumptions” and biases delay estimation.
- Multi-hop accumulation: small residence-time errors add up across many devices; TC accuracy becomes critical.
- Buffer pressure: TS FIFO overflow or event drops create invisible holes in the measurement stream unless explicitly alarmed.
SEO note: this chapter targets “hardware timestamping” intent by providing a deterministic pipeline and mapping symptoms to PDV/asymmetry/buffering, not to generic throughput claims.
PTP control: BMCA, servo, and profiles
PTP timing quality depends on three control-plane decisions: which clock becomes the reference (BMCA), how measurements are converted into correction (servo), and what assumptions are used (profile, message rates, and delay mechanism).
A timing switch must make BMCA behavior explainable to operations. Rather than memorizing the full standard text, it is more useful to read the key dataset fields as engineering intent: priorities express policy, class expresses trust level, accuracy expresses static capability, and stability metrics indicate the “noise tendency” that affects lock behavior.
| BMCA field | Engineering meaning | Operational pitfall |
|---|---|---|
| priority1 / priority2 | Explicit policy override: which source is preferred when multiple are “good enough”. | Mis-set priorities can force a worse reference to win; frequent reconfiguration looks like “random GM flaps”. |
| clockClass | Trust and traceability tier: whether the clock should be treated as a stable root or only a fallback. | Mixing classes without a plan can create unexpected master changes during transient events. |
| accuracy | Static capability upper bound (not a real-time error measurement). | Assuming “accuracy” equals live offset leads to wrong alarm thresholds and false confidence. |
| offsetScaledLogVariance | Stability/noise tendency indicator: helps reason about how “quiet” or “noisy” the reference behaves. | Ignoring stability can yield a master that wins BMCA yet produces a hard-to-lock servo under PDV. |
After a master is selected, the servo converts measured offset and delay into clock correction. In practice, the servo is a control loop that must remain stable under measurement noise: filtering and outlier rejection protect against PDV tails, while the control law (PI / PLL-like behavior) determines how quickly the system acquires lock versus how much wander it allows in steady state.
Control-plane state machine is the operational contract: INIT (not ready) → ACQUIRE (converging) → LOCKED (steady) → HOLDOVER (reference lost) → FAULT (persistent abnormal). Each transition should be traceable to measurable triggers: message loss, offset thresholds, master change, or quality degradation.
Profiles shape expectations. A telecom-oriented profile typically favors stricter behavior and predictable recovery, while default profiles aim for broad interoperability. gPTP is commonly associated with tighter coordination requirements; the practical boundary is not “better or worse”, but differences in message rates, delay mechanism choices, and tolerance assumptions.
| Parameter | Why it matters | What to observe |
|---|---|---|
| Message rate Sync / Announce / Delay |
Higher rates improve acquisition and tracking but increase event load and sensitivity to burst behavior. | Lock time vs CPU/FIFO pressure; offset tails under load; message loss counters. |
| Delay mechanism E2E vs P2P |
Affects how delay is estimated and how multi-hop behavior accumulates; impacts TC/BC strategy. | meanPathDelay stability; bias under asymmetry; behavior across topology changes. |
| Filter / smoothing | Protects the servo from PDV tails; too aggressive can slow recovery, too weak can chase noise. | Offset distribution (mean, P95/P99); step/slew activity; recovery after bursts. |
| Kp / Ki loop bandwidth |
Sets acquisition speed vs steady-state wander; wrong gains produce oscillation or sluggish lock. | Overshoot, ringing, or slow convergence; wander trend during “LOCKED”. |
| Outlier reject | Drops implausible samples caused by queue spikes, reordering, or transient asymmetry. | Rejected-sample counters; correlation with traffic bursts and queue telemetry. |
Field KPIs (timing-only): lock time, offset distribution (mean + tail percentiles), wander trend, master change count, holdover entry/exit events, and alarm threshold crossings (offset high, message loss, quality degrade).
SEO note: this chapter targets “BMCA”, “PTP servo”, “lock time”, and “holdover” intents with an operational state machine and parameter tables.
SyncE frequency layer: EEC/DPLL and SSM/QL
SyncE provides frequency coherence across the network, while PTP provides time/phase. In practice, combining them is more robust: a stable frequency foundation reduces how hard the PTP servo must work under PDV and topology changes.
The SyncE layer is built around an EEC (Ethernet Equipment Clock) that locks to a selected input (often recovered from a port), and a DPLL that cleans jitter and manages holdover. A critical engineering tradeoff is loop bandwidth: wider bandwidth tracks input changes faster but can pass more jitter; narrower bandwidth smooths output but reacts more slowly to disruptions.
Boundary rule: SyncE does not deliver time-of-day. It stabilizes frequency so that PTP can converge with smaller corrections and better resilience. If ToD is wrong, SyncE cannot fix it; ToD correctness still depends on PTP path assumptions, timestamp truth, and control-plane health.
Quality distribution is controlled by SSM/QL signaling (commonly carried via ESMC). The engineering purpose is simple: every node advertises the quality of its timing source so downstream nodes can select the best input and avoid timing loops. A healthy network behaves like a “quality chain”: higher-quality sources win, switching is controlled, and every transition is logged.
Operational flow (timing-only): receive QL → update candidate table → apply selection policy (with loop guard) → execute protection switching (with hold-off / hysteresis) → advertise downstream QL → generate alarms and event logs.
Failure example: a misconfigured QL policy can cause the network to prefer a worse reference or form a timing loop. Symptoms often include frequent switching, increased wander, and PTP lock instability even when packet connectivity is normal.
SEO note: this chapter targets “SyncE EEC/DPLL”, “SSM/QL”, and “ESMC” intents with a quality-chain model and loop-avoidance logic.
Clock distribution & jitter-cleaning
A timing switch is only as good as its clock tree. After the DPLL produces a clean clock, the system still must distribute it to many PHY ports while keeping isolation, phase consistency, and verification points (1PPS/10MHz/ToD) intact.
The practical view is a three-layer clock tree: (1) source (reference mux and DPLL output), (2) distribution (fanout/buffers and isolation boundaries), and (3) loads (PHY port groups and timing outputs). Each layer can introduce or amplify jitter through power noise, temperature drift, reference switching transients, or high-speed digital crosstalk.
Noise sources
- Supply ripple on timing rails, shared return paths
- Temperature gradients and drift around timing components
- Reference switching transients (handovers, re-lock events)
- SerDes / switch-core activity coupling into nearby routes
Coupling paths
- Shared rails between DPLL/clock tree and noisy digital islands
- Fanout stages without sufficient isolation or segmentation
- Clock traces running too close to high-speed differential pairs
- Ground discontinuities that force long return loops
Victims & metrics
- Worse MTIE/TDEV trends (wander) and time-error tails
- Port-to-port phase alignment spread increases
- Lock becomes “fragile” under bursts even when connectivity is fine
- More frequent holdover transitions and larger recovery hits
Three hard system tips (timing-focused): (1) Partition timing rails and keep local decoupling loops tight around the DPLL and fanout blocks; (2) keep clock traces spaced from SerDes and maintain continuous return reference; (3) segment fanout by port groups so a noisy region does not pollute the entire clock domain.
Timing outputs are best treated as interfaces and verification points, not as a path to discuss external reference internals. 1PPS can validate phase continuity during switching; 10MHz helps validate frequency coherence and wander behavior; ToD provides a time interface that should remain consistent with the PTP lock state and alarm telemetry.
SEO note: this chapter targets “clock distribution”, “jitter-cleaning”, “phase alignment”, and “verification outputs (1PPS/10MHz/ToD)” intents.
Redundancy & holdover
Real networks fail in messy ways: reference loss, quality degradation, and transient packet impairments. Redundancy and holdover define whether the timing system degrades in a controlled manner and recovers without large phase hits.
Redundancy should be described at concept level, focusing on independence and observable behavior: 1+1 reference provides primary/backup inputs, A/B timing planes reduce correlated failure exposure, and a dual DPLL concept expresses two discipline paths with independent selection and health telemetry.
Holdover is a policy-driven mode: it enters when the reference becomes unusable (lost lock, quality downgrade, persistent message loss, or offset out-of-range). During holdover, time error typically grows with duration and environmental sensitivity, so the operational goal is controlled drift with strict logging, rate limiting, and a safe re-lock procedure when reference quality returns.
Checklist for non-disruptive switching: require a stability window before switching; enforce hold-off and hysteresis to prevent flapping; limit slew rate to avoid large phase steps; keep a guard time before re-lock; and always record a reason code plus KPIs for every transition.
| Event / trigger | Immediate action | Evidence to log |
|---|---|---|
| Loss of lock | Enter holdover; freeze switching until hold-off expires; raise alarm. | Timestamp, affected input, last QL, last offset stats, lock state. |
| QL degrade | Evaluate backup; apply hysteresis; switch only after stability window. | QL transition history, candidate table snapshot, switch decision reason. |
| PTP message loss | Declare impaired reference; tighten outlier policy; escalate if persistent. | Loss counters, interval, traffic/queue correlation marker, state transitions. |
| Offset high | Limit slew; block immediate re-master; consider holdover if sustained. | Offset distribution (mean/P95/P99), applied slew limits, alarm crossings. |
| Reference switch | Apply guard time; aim for hitless or low-hit switching; monitor phase hit risk. | Switch time, old/new input, hold-off settings, observed phase step (if any). |
Hitless switching is best expressed as an objective: maintain phase continuity as much as the system allows. The practical risk is a phase hit at the moment of switching or during re-lock. Controls such as guard time, hysteresis, and slew limiting reduce the magnitude and frequency of phase disturbances while keeping recovery predictable.
SEO note: this chapter targets “holdover”, “hitless switching”, “phase hit”, and “alarm action matrix” intents.
Timing-aware switching impacts
Timing accuracy is often limited by the network’s delay behavior rather than by the clock hardware itself. Queueing and congestion create packet delay variation (PDV), while asymmetry introduces bias that breaks delay estimation.
PDV is a noise term: it widens the delay distribution and makes offset measurements less stable. Asymmetry is a bias term: it shifts the estimated delay away from reality, producing a persistent time error even when the system appears locked. A timing switch needs both: (a) a measurement path that is robust to PDV tails and (b) an engineering workflow to detect and compensate asymmetry.
PDV (variation / tails)
- Sources: queueing, transient congestion, scheduling jitter, residence-time spread.
- Symptoms: offset jitter grows, lock becomes fragile, convergence time increases.
- Why bursts hurt: small packets and microbursts amplify tail behavior and outliers.
Asymmetry (directional bias)
- Sources: different uplink/downlink paths, different rates, direction-dependent processing.
- Symptoms: stable but wrong offset, step errors after path changes, “good” averages with bad accuracy.
- Key point: bias cannot be filtered away; it must be detected and compensated.
E2E vs P2P boundary
- P2P is preferred when per-hop variability dominates or path changes are frequent.
- E2E can be sufficient in stable, symmetric, low-hop environments that are validated.
- Choice should match observability: the ability to localize where delay uncertainty enters.
Practical troubleshooting pattern: describe the symptom first (jitter / unlock / slow convergence), then test the delay path (distribution tails and outliers), and finally validate symmetry (directional bias) before changing servo parameters.
The most productive way to present this chapter is a three-part “symptom → root cause → verification” flow. The goal is to make the reader’s next action obvious: capture delay distribution (not just the mean), correlate timing jitter with congestion windows, and validate symmetry under a controlled topology.
Symptom → cause → verify
- Offset jitter suddenly increases → suspect PDV tails under bursts → compare P95/P99 delay before/after load change.
- Locks but stays “consistently wrong” → suspect asymmetry bias → perform direction symmetry checks and apply compensation workflow.
- Locks, then drops during busy hours → suspect queue residence-time spread → align timing events with congestion/queue markers.
Asymmetry compensation should be described as an engineering workflow rather than as a protocol tutorial. A concise, repeatable process is enough to avoid overfitting and to keep the system verifiable.
- Freeze topology and capture a baseline window Establish a stable path and record baseline delay and offset distributions under low and steady load.
- Stress with the target load profile Re-capture distributions under bursts or sustained load and compare tail growth (P95/P99) and outlier rate.
- Validate directional symmetry Confirm whether uplink and downlink delay behave similarly; identify persistent bias and its stability over time.
- Apply compensation and re-validate Apply the chosen bias compensation method, then re-run the baseline and stress windows to confirm improved accuracy and stability.
- Lock in observability Store the result as a reason-coded calibration record, including before/after KPI snapshots for auditability.
SEO note: targets “PDV”, “queueing delay”, “asymmetry bias”, “E2E vs P2P” and “offset jitter under bursts”.
Monitoring, alarms & observability
Timing networks are operated, not just configured. Observability must expose the right KPIs, preserve reason-coded events, and map alarms to actions so timing health can be audited and improved.
A useful monitoring model separates service KPIs (what defines timing quality) from diagnostic KPIs (what explains why quality changed), and from event KPIs (what enables traceability). Interfaces such as SNMP NETCONF YANG gNMI are only transport; the real value is a consistent set of fields and reason codes.
Service KPIs
- Offset distribution (mean + tail), time error trends
- Lock state (ACQUIRE / LOCKED / HOLDOVER / FAULT)
- Holdover duration and entry count
- QL changes and reference switch count
Diagnostic KPIs
- meanPathDelay and PDV tail (P95/P99)
- Message loss / interval jitter counters
- Outlier rate and filter rejections
- Before/after snapshots around events
Event KPIs (reason-coded)
- Holdover entry reason (lost lock / QL degrade / msg loss / offset high)
- Reference switch decision reason and stability window result
- Re-lock outcome (slew limited / phase hit suspected)
- Operator actions and acknowledgements
Dashboard layout (operator-friendly): left = time error (offset + tail + thresholds), right = state (lock + ref + QL + holdover timer), bottom = event log with reason codes and KPI snapshots for every critical transition.
| Alarm type | Severity boundary | Default action | What must be logged |
|---|---|---|---|
| Offset tail ↑ | Warning if within limit; Critical if sustained beyond threshold. | Start a diagnostic window; correlate with PDV/queue markers. | Offset P95/P99, PDV tail, traffic/queue correlation marker. |
| QL degrade | Critical if quality drops below the operational minimum. | Evaluate backup input; switch after stability window + hysteresis. | Old/new QL, candidate list snapshot, switch decision reason code. |
| Loss of lock | Always Critical. | Enter holdover; freeze flapping with hold-off; notify operator. | Timestamp, last stable KPIs, reason code, holdover timer start. |
| Message loss | Warning for transient; Critical if persistent. | Tighten outlier policy; escalate and consider holdover if persistent. | Loss counters, intervals, state transitions, before/after snapshots. |
| Ref switch rate ↑ | Critical when flapping risk exists. | Apply hold-off/hysteresis; require stability window before re-switch. | Switch count, hold-off settings, per-switch KPI deltas. |
Closing the loop means every critical event should carry a reason code and a small “evidence bundle”: a short pre/post KPI snapshot window, the state transition, and the action taken. Without this, timing issues become unprovable and non-repeatable.
SEO note: targets “PTP monitoring”, “timing alarms”, “holdover events”, “observability KPIs”, and “alarm action table”.
H2-11 · Validation & selection checklist (project acceptance + BOM criteria)
This section converts “timing is correct” into a repeatable acceptance workflow. Each checklist item maps to evidence that can be archived (KPI snapshot, event log, reason code, and before/after time windows).
3-step acceptance workflow
- Step 1 — Define targets: set acceptance targets for
offset (mean + tail),PDV tail,lock time, and holdover drift trend (qualitative or semi-quantitative). - Step 2 — Run in three tiers: validate (A) functional correctness, (B) stress robustness under real traffic, (C) failure & recovery behavior.
- Step 3 — Attach an evidence bundle: for each critical event, archive
reason code+KPI snapshot+state transition+time window(pre/post).
Notes on part numbers below: examples are provided to make BOM discussions concrete; lifecycle, firmware licensing, and vendor feature options should be re-checked before freezing a design.
Short, auditable statements. Each item should produce a PASS/FAIL result plus a saved evidence bundle.
Tier A — Lab: functional correctness (BC/TC/SyncE + observability)
- BMCA master selection follows configured priorities; unexpected master changes are explained by logged dataset changes.
- BC mode terminates and regenerates PTP correctly; downstream timebase remains stable after upstream disturbances.
- TC mode residence-time correction is effective; corrected PTP messages show expected behavior under controlled delay injection.
- Hardware timestamping is active at the intended point (PHY/MAC/core); software-only timestamps are not silently used.
- SyncE lock indication is visible; recovered-clock validity is gated (no propagation of invalid clocks).
- QL/SSM (ESMC) updates are logged with reason codes; quality changes trigger intended policy actions.
- Reference switching policy is deterministic (hysteresis/hold-off); no flapping is observed under marginal inputs.
- Key KPIs are readable and consistent across interfaces (CLI/telemetry): offset, meanPathDelay, lock state, QL, switch count.
Tier B — Stress: real traffic robustness (PDV tail + lock stability)
- Under microburst load,
offset tail (P95/P99)stays within target or degrades in a predictable, explainable pattern. - Under 64B small-packet stress, PDV tail increases but servo remains stable (no oscillation or repeated loss-of-lock).
- Queue build-up events correlate with delay KPI changes; evidence windows capture before/after states automatically.
- Message loss / irregular intervals produce warnings without cascading into unstable switching unless thresholds are exceeded.
- Step-load transitions (low → high → low) show bounded lock recovery time; lock time target is met.
- Two ports with different utilization show expected asymmetry behavior; compensation strategy is validated by measurement.
- Long-duration stress (hours) shows no drift accumulation beyond expected limits; logs remain consistent and complete.
- High-throughput does not mask timing instability; acceptance gates are based on tail metrics, not throughput headlines.
Tier C — Failure & recovery: holdover + controlled re-lock
- Upstream ref loss triggers a clear transition into HOLDOVER with a reason code and timestamped event log entry.
- Holdover drift trend is measurable and predictable; alarming thresholds are set and validated.
- QL drop triggers intended re-selection; incorrect QL does not become the chosen reference.
- Reference switchback uses guard time and slew limiting; phase-hit risk is minimized and flagged if detected.
- Repeated ref instability triggers anti-flap protection (freeze switching / extend hold-off) and logs actions taken.
- Recovery to LOCKED is achieved without “panic switching”; re-lock time target is met and recorded.
- During failure drills, KPIs remain observable; telemetry does not drop exactly when it is needed most.
- Each drill produces a complete evidence bundle: KPI windows + state timeline + reason codes + configuration snapshot.
Criteria first, part numbers second. Part numbers below are representative examples used in timing-aware switches and synchronization subsystems.
| Category | What to check (engineering criteria) | Example part numbers (for BOM discussions) |
|---|---|---|
| Switch silicon |
|
|
| Ethernet PHY |
|
|
| DPLL / EEC (SyncE + jitter cleaning + ref switch) |
|
|
| Management plane (observability) |
|
|
Part numbers above are commonly discussed in timing-aware switching and synchronization designs. Actual feature enablement can vary by device option, firmware, and licensing. Always validate against the latest vendor datasheets and reference designs before final selection.
FAQs × 12 — Timing Switch (PTP/SyncE)
These FAQs focus on engineering boundaries, troubleshooting entry points, and acceptance evidence. Each answer ends with what to verify (KPIs/logs) and where to read deeper in the main chapters.