Edge Boundary Clock Switch (PTP/SyncE, HW Timestamping)
← Back to: 5G Edge Telecom Infrastructure
An Edge Boundary Clock Switch terminates upstream PTP and regenerates a cleaner downstream time base using hardware timestamping, timing-aware queue control, and a local servo/PLL—so congestion and upstream jitter don’t propagate across the edge site. Its real value is operational: redundant time inputs with alarms, logs, and measurable validation let timing remain stable, traceable, and field-debuggable under load.
What it is (and strict boundary): Why a “Boundary Clock Switch”
A boundary clock switch (BC) terminates upstream PTP, runs a local clock servo inside the switch, then regenerates time toward downstream ports. This makes timing a controlled subsystem (measurement + control + alarms), not a best-effort byproduct of packet forwarding.
This section locks the engineering boundary: BC vs transparent clock (TC), grandmaster (GM), and “QoS-only” switches.
| Item | Boundary Clock Switch (BC) | Transparent Clock (TC) | Grandmaster (GM) | Standard Switch (QoS-only) |
|---|---|---|---|---|
| Timing role | Clock participant: terminates + regenerates time | Forwarder: corrects residence time only | Time source: publishes absolute time | Forwarder: no timing control loop |
| What it controls | Local servo + jitter-cleaning clock path | Correction field only | Time-source disciplining (not covered here) | Queues only (timing is incidental) |
| When it becomes necessary | When upstream PDV / congestion is hard to control | When paths are stable and timing noise is low | When GNSS / absolute time is required | When timing is “nice-to-have” |
| Hardware timestamping | Required to make errors observable and bounded | Often supported but not sufficient alone | Used for source-grade timing | Usually missing or not verifiable |
| What it can isolate | Isolates upstream uncertainty from downstream distribution | Does not isolate upstream servo noise | Creates upstream reference quality | Cannot isolate timing from traffic bursts |
| Operations evidence | Alarms + logs + counters tied to timing state | Limited timing-state evidence | Source health evidence (not here) | Mostly link/traffic counters only |
Decision triggers (practical):
- Traffic bursts cause offset spikes: a timing system that shares “best-effort” queues with data will inherit PDV.
- Upstream paths change or quality varies: BC creates a controllable boundary that protects downstream timing.
- Acceptance must be provable: hardware timestamping + servo state + alarms enable measurable pass/fail evidence.
Where it sits: Timing distribution layers and BC placement at the edge
Placement is not about topology buzzwords (ring/leaf/spine). It is about creating three engineering boundaries: congestion domain (PDV containment), fault domain (blast-radius control), and redundancy domain (A/B time-base switchover with evidence).
Rule 1 — Congestion domain isolation
Place the BC at the boundary where timing queues can be made predictable. The goal is not “low latency”, but bounded delay variation for timing packets under real traffic bursts.
Rule 2 — Fault domain containment
A bad upstream path should not force every downstream client to chase noise. BC placement should prevent upstream instability from propagating into the local edge distribution layer.
Rule 3 — Redundancy domain closure
Dual upstream A/B is only meaningful if switchover is controlled, logged, and alarmed. If A/B switching happens outside the BC layer, troubleshooting becomes guesswork.
Scenario cards (common edge deployments):
Scenario A — Small site (one BC, a few clients)
Topology: GM or upstream timing feed → BC → 3–10 local clients.
- Placement: BC at the site entry boundary.
- Why: isolates local clients from upstream PDV and enables site-level evidence.
- Acceptance signal: traffic bursts no longer produce repeated offset spikes at clients.
Scenario B — Campus / factory edge (two-layer BC)
Topology: upstream timing → distribution BC → access BC / clients.
- Placement: one BC at distribution, optional BC at access for large fan-out.
- Why: splits congestion domains; prevents one segment’s burst traffic from contaminating others.
- Acceptance signal: segment-specific timing issues stay local; alarms map to a single boundary.
Scenario C — Micro edge DC (multiple BC zones)
Topology: redundant upstream A/B → timing zone BCs → compute/security/observability zones.
- Placement: BC per zone (or per timing island) with clear A/B switchover responsibility.
- Why: contains fault blast-radius and keeps evidence aligned with operational zones.
- Acceptance signal: switchover events correlate cleanly with a single zone’s logs/counters.
Timing packets & mechanisms: what a BC must care about (engineering-only)
A boundary clock switch only needs a minimal PTP loop: time transfer (Sync/Follow_Up) and delay measurement (Delay_Req/Resp or Pdelay). The engineering goal is to keep that measurement observable and bounded under traffic bursts—otherwise the servo may “correct” congestion noise as if it were clock drift.
E2E vs P2P: selection rules (practical)
E2E is appropriate when the path is relatively stable and the main uncertainty is the end-to-end delay estimate. P2P becomes attractive when hop behavior or topology changes make per-hop delay visibility necessary.
- Prefer E2E when timing paths are controlled, and per-hop visibility is not required for operations.
- Prefer P2P when hop-by-hop delay tracking helps isolate where variability enters the timing path.
- Avoid “mechanism by habit”: selection should be driven by PDV exposure and troubleshooting needs.
Why congestion breaks timing: PDV is the real enemy
Packet Delay Variation (PDV) is not “lack of bandwidth.” It is the changing queueing time experienced by timing packets. Once PDV contaminates delay measurement, the servo can inject instability downstream—even if average throughput looks excellent.
- Root causes: bursty traffic, queue contention, shaping side-effects, transient congestion.
- Where it shows up: mean path delay jitter, offset spikes correlated with traffic bursts.
- Why it matters: the control loop acts on measurements; noisy measurements produce noisy time.
Asymmetry (high-level): an error amplifier
Upstream/downstream delay asymmetry introduces a systematic bias into delay estimates and can present as a persistent offset. When offset is biased but weakly correlated with traffic load, asymmetry is a primary suspect.
Misconception #1 — “High bandwidth means good timing”
Timing quality depends on bounded delay variation, not raw throughput. A fast switch can still produce poor timing if timing packets share volatile queues with burst traffic.
Misconception #2 — “Offset looks fine, so timing is fine”
Offset alone can hide risk. PDV, packet loss, and holdover transitions are leading indicators of timing fragility and should be tracked alongside offset.
Hardware timestamping path: PHY vs MAC, and 1-step vs 2-step
Hardware timestamping is valuable only when it makes the timing path deterministic and auditable. The timestamp capture point (PHY-side vs MAC-side) decides which internal variations are excluded from measurement, while 1-step vs 2-step decides how robustly transmit timestamps remain consistent under high throughput.
Pipeline view: where variability enters
A practical mental model is a forwarding pipeline: PHY → MAC → switch fabric → egress queue. Some delay is real (queueing), some is implementation-dependent (internal processing), and some is measurement noise (timestamp resolution).
- Egress queue is the dominant PDV source under burst traffic.
- Internal path variation (MAC/fabric) can leak into measurement depending on timestamp location.
- Timestamp resolution sets the short-term noise floor for delay/offset estimates.
PHY-side vs MAC-side timestamping: interpret it as an error map
“Closer to the line” is not a slogan—its value is to exclude more internal variability from the timestamped event. MAC-side capture is easier to integrate, but may allow internal variability to appear as timing noise under changing load.
- PHY-side capture: better isolates internal processing variation from the measured event.
- MAC-side capture: simpler integration; sensitivity to internal path variation depends on implementation.
- Neither removes queueing PDV: queueing is real delay and must be bounded by design (later chapter).
1-step vs 2-step: choose for stability and evidence
1-step updates the correction at transmit time. 2-step sends Follow_Up with the precise transmit timestamp. Under complex pipelines and high throughput, 2-step is often easier to keep consistent and traceable.
- Prefer 2-step when consistency and troubleshooting evidence are prioritized under heavy load.
- Prefer 1-step when transmit-time update is tightly controlled and proven stable in the implementation.
Error decomposition checklist (where it comes from, how to observe it, how to reduce it):
| Error source | Where it originates | What it looks like | Primary mitigation direction |
|---|---|---|---|
| Queueing PDV | Egress queue (contention, bursts) | Offset spikes correlated with traffic bursts | Timing QoS / shaping to bound PDV (later chapter) |
| Internal processing variation | MAC / fabric pipeline differences | Timing noise changes with load, even when queues are controlled | Timestamp closer to line; reduce internal variability exposure |
| Timestamp quantization | Timestamp unit resolution | Short-term noise floor; jitter-like measurement scatter | Higher resolution capture; cleaner clock path for timestamping |
| Asymmetry sensitivity | Path-level upstream/downstream mismatch | Persistent bias with weak traffic correlation | Topology / link consistency checks; treat as systematic error |
Queue shaping & timing QoS: why high-throughput switches can still sync poorly
Time sync fails in “fast” switches primarily because delay is not predictable. When PTP packets share volatile queues with burst traffic, PDV (queueing variation) contaminates delay measurement, and the servo may over-correct congestion noise as if it were clock drift—making downstream timing even worse.
Problem: timing packets are fragile to microsecond-scale queueing
PTP traffic is typically low-rate, but it is highly sensitive to short-term queueing variation. A network can meet high throughput targets while producing a delay distribution with a long tail—exactly what timing cannot tolerate.
- Throughput is an average; timing quality is driven by delay variation and tail events.
- Bursts matter: a short microburst can dominate PDV even when the link is “not saturated” on average.
Mechanism: a repeatable failure chain
Bursty traffic → queue contention → PDV increases → delay measurement becomes noisy → offset jitters → servo reacts too strongly → downstream time becomes more unstable.
- Key insight: noisy measurement leads to noisy time, even with “excellent” bandwidth.
- Engineering target: bound PDV so the measurement distribution stays tight and predictable.
Countermeasures: timing-focused QoS (concept level)
Timing QoS is about giving PTP packets a stable queueing model that remains consistent under bursts. The most effective actions are isolation, controlled bursts, and scheduling that protects timing flows.
- Dedicated / protected queue: avoid contention with bulk traffic.
- Strict priority (with safety bounds): timing packets should not wait behind long frames or bursts.
- Burst limiting for bulk traffic: reduce the “queue blow-up” events that create PDV tails.
- Congestion-domain isolation: keep unpredictable contention outside the timing domain boundary.
Timing QoS is not “more bandwidth”
The practical goal is predictable delay. If timing packets experience unstable queueing, a BC can regenerate time, but it cannot undo the noise already injected into measurement.
Operate with evidence, not assumptions
Treat PDV as a measurable system behavior. If offset spikes align with traffic bursts, the first fix is typically queue isolation + burst control, not servo aggressiveness.
Timing QoS checklist (deployment-ready structure)
Classification: PTP packets can be reliably identified and mapped to a protected timing class.
Queue isolation: PTP is not forced to share the same queue as bursty bulk traffic.
Burst control: bulk traffic bursts are bounded so queue “blow-ups” do not create PDV tail events.
Scheduling intent: timing packets are protected from waiting behind long frames and transient bursts.
Regression test: under burst stress, the observed offset/PDV stays bounded (no burst-correlated spikes).
Jitter-cleaning PLL & servo loop: what each layer actually does
Treat timing as three stacked layers: timestamping measures packet events, the servo converts measurements into phase/frequency corrections, and the jitter-cleaning PLL filters short-term noise to deliver a cleaner output. SyncE primarily supports frequency stability, while PTP primarily aligns time/phase; combining them requires clear loop boundaries.
Layer 1 — Timestamping (measurement)
Timestamping defines what the control loop “sees.” If measurements include queueing PDV or internal variability, the servo will respond to noise. The first priority is to keep measurement predictable (see timing QoS).
Layer 2 — Servo (control)
The servo turns measurements into phase/frequency corrections. It is responsible for tracking real drift without chasing transient noise. If the servo is driven by noisy measurements, it can amplify instability downstream.
Layer 3 — PLL clean-up (filtering)
A jitter-cleaning PLL reduces short-term phase noise and produces a cleaner output. Conceptually, it should pass what is “real” on the desired timescale and reject short-term noise that would otherwise leak into the output.
Short-term vs long-term: the practical interpretation
Short-term behavior looks like fast ripple and jitter; long-term behavior looks like slow drift and holdover quality. In a healthy design, the PLL dominates short-term cleaning while the servo maintains long-term alignment.
- Short-term: jitter, phase noise, fast disturbance components.
- Long-term: drift, slow offset trends, holdover transitions.
SyncE + PTP: complement, not competition (concept level)
SyncE provides a frequency-stable base; PTP provides time/phase alignment. The integration goal is to avoid “two loops chasing the same noise” and to ensure disturbances are handled at the appropriate layer.
- SyncE supports frequency stability and reduces how hard the servo must work.
- PTP aligns time/phase, especially across packet networks.
- Clear boundaries prevent loop coupling that causes slow recovery or oscillation-like behavior.
Common configuration pitfalls (symptom → likely cause → direction)
- Offset spikes follow traffic bursts → PDV is entering measurement → fix queue isolation/burst control before changing servo behavior.
- Output becomes “nervous” (more short-term jitter) → clean-up filtering is not rejecting enough noise → ensure short-term cleaning is handled by the PLL layer.
- Slow recovery after disturbances → loops are coupled or roles are unclear → re-assert: measurement cleanliness, servo for long-term, PLL for short-term.
Redundant time-base & alarms: dual upstreams, switchover, and evidence-ready alerting
Frequent re-selection often happens not because the “best master” is wrong, but because short-lived disturbances (PDV bursts, timing-packet loss, link jitter) repeatedly trip sensitive triggers. A robust redundancy design targets three outcomes: unnoticed service impact, controlled switching (with hysteresis/debounce), and traceable decisions (reason codes + counters + event timeline).
Redundancy inputs (names only, practical scope)
A BC switch can accept multiple time and frequency references. The redundancy goal is not “more inputs,” but stable behavior under disturbances and clean operational evidence.
- PTP upstream A/B (packet time reference)
- SyncE A/B (frequency stability base)
- External reference: 1PPS / ToD (name only; used as an auxiliary reference or gate)
Switchover goals (what “good” looks like)
- Unnoticed: minimize downstream impact during transient faults and recovery.
- Controlled: avoid “ping-pong” switching with debounce/hysteresis and stability checks.
- Traceable: every transition is explainable with reason codes, counters, and a time-ordered event log.
Alarm taxonomy (grouped for field usefulness)
Alarm design should separate “reference health,” “time quality,” “selection stability,” and “path suspicion,” so operators can stop the bleeding first and then isolate the cause.
- Lock / reference health: lock/loss, enter holdover, holdover out-of-range
- Time quality: offset threshold breach, persistent residual behavior
- Selection stability: frequent reselect, repeated relock cycles
- Path suspicion: asymmetry suspected, timing packets missing at a port
Evidence chain: what must be logged to make alarms actionable
Redundancy without evidence becomes guesswork. The minimum evidence chain is a timeline of states, counters that quantify packet health and switching frequency, and a reason code that explains why the system moved.
- Event timeline: state transitions and their timestamps (Locked/Degrade/Holdover/Switch/Re-lock)
- Counters: timing packet loss/anomaly counts, reselect count, relock attempts
- Reason codes: why a transition happened (loss-of-lock, threshold, holdover limit, etc.)
- Snapshots: key quality indicators captured at transition boundaries (pre/post switch)
Common “false alarms” in the field (symptom → pitfall → first move)
- Single offset spike then recovery → misread as “bad upstream” → align spike with burst/loss evidence before switching policy changes.
- Repeated Degrade entries → misread as “device failure” → check whether triggers are too sensitive or missing debounce/hysteresis.
- A↔B ping-pong → misread as “both sources unstable” → enforce controlled switching (stability window) and log reason codes per reselect.
Alarm dictionary (name → trigger → likely cause → first action)
| Alarm name | Trigger condition | Likely causes | First action (direction only) |
|---|---|---|---|
| LOSS_LOCK | Reference lock lost beyond debounce window | Upstream instability, link issue, timing path interruption | Check event timeline + port health counters; confirm whether loss aligns with link jitter or packet anomalies |
| ENTER_HOLDOVER | Reference quality insufficient; system enters holdover state | Loss of upstream, degraded measurement due to PDV/loss | Validate upstream packet continuity; correlate with burst/loss evidence before changing selection policy |
| HOLDOVER_LIMIT | Holdover exceeds acceptable quality window | Extended upstream outage, drift accumulation, repeated disturbances | Review holdover duration timeline; identify why relock is not achieved (loss, unstable ref, repeated reselection) |
| OFFSET_THRESHOLD | Offset breach persists beyond a stability window | PDV tail events, asymmetry suspicion, unstable upstream reference | Check for burst-correlated spikes or packet loss; verify whether shift is steady (asymmetry) or spiky (PDV) |
| FREQ_RESELECT | Reselection occurs repeatedly within a short period | Over-sensitive triggers, unstable reference, missing hysteresis | Inspect reason codes for each reselect; verify debounce/hysteresis intent and correlate with disturbance evidence |
| PATH_ASYM_SUSPECT | Persistent bias behavior suggests asymmetry | Directional delay imbalance, path change, routing differences | Use A/B path comparison approach; check whether bias changes with path selection or topology change |
| PTP_PKT_MISS | Timing packets missing / irregular at a port | Congestion-domain issues, classification/QoS failure, link errors | Validate port counters and traffic correlation; confirm timing queue isolation and burst control are effective |
Note: triggers are expressed intentionally without numeric thresholds here; thresholds must be aligned with the site’s tolerance class and validation evidence.
Error budget: decomposing timing error so bottlenecks become obvious
Short offset “jumps” that recover quickly often indicate transient measurement degradation (PDV bursts, packet loss, brief degrade/holdover entry), not a stable clock drift. The practical way to locate bottlenecks is to decompose error into network contributions, measurement noise, control residual, and state/environment, then use observable evidence and controlled comparisons to identify the dominant term.
Error components (engineering-relevant decomposition)
Treat the observed offset as the sum of multiple contributors. If the “tail” dominates, focus on tail sources first (PDV/loss). If the distribution is tight but still biased, suspect asymmetry or persistent residuals.
- Queue PDV: variable waiting time under bursts
- Timing packet loss/anomalies: missing or irregular Sync/Delay messages
- Path asymmetry (suspected): directional delay imbalance creating bias
- Timestamp quantization/noise: measurement floor and short-term scatter
- Servo residual: persistent control error after PDV is managed
- PLL clean-up output noise: short-term output cleanliness
- Drift / brief unlock / holdover: state-driven excursions
Budgeting method (no numbers required to be useful)
Start from a tolerance class (coarse to strict), then allocate budget to the sources that are hardest to control (network tails), and only then validate whether internal residuals matter. This prevents chasing the servo when the true limiter is PDV.
- Step 1: define tolerance class for the site/application (coarse → strict).
- Step 2: reserve budget for network uncertainty (PDV/loss/asymmetry suspicion).
- Step 3: allocate remaining budget to internal measurement/control/clean-up residuals.
- Step 4: validate with stress + evidence (bursts, path comparison, state timeline alignment).
Measurable vs inferable (what evidence can and cannot prove)
Some contributors are directly observable via counters and timelines; others require A/B comparisons or controlled disturbances. This boundary prevents misattribution and reduces “random tuning.”
Directly observable
Packet loss/anomalies counters, reselect frequency, holdover entry/exit, and burst-correlated offset spikes when aligned with the event timeline.
Inference required
Path asymmetry suspicion (bias shifts with path change), and separating control residual from measurement noise after PDV is constrained.
Practical troubleshooting priority (fastest path to root cause)
- 1) PDV and packet loss first: if tails dominate, nothing downstream can fully “fix” the measurement.
- 2) Asymmetry suspicion next: look for stable bias behavior that follows path selection/topology changes.
- 3) Servo / clean-up layers: only after network/measurement are constrained.
- 4) Hardware timing path and brief unlock evidence: confirm with state timeline and lock/holdover alerts.
Management / OAM: the counters that make a BC switch truly operable
“Stable sync” is not proven by a small average offset. It is proven by visible tails (PDV distribution), accountable gaps (timing packet loss/anomalies), and traceable state changes (reselect/holdover with reason codes and timelines). A device can be reachable (ping works) while time quality remains untrusted if monitoring is blind to these evidence paths.
Dashboard layout (what an operator must see first)
A practical OAM view should separate global health, per-port truth, events/causes, and trends/tails.
- Health: lock state, holdover state/duration, last reselect, offset overview
- Ports: per-port PTP Rx/Tx counts, loss/anomaly markers, delay mechanism mode, SyncE lock
- Events: alarm timeline + reason codes + snapshots at transitions
- Trends: offset trend, PDV tail trend, reselect frequency trend
Required telemetry (minimum set for evidence-based operations)
- Time quality: offset, mean path delay (as indicators, not as a single score)
- PDV statistics: distribution/tails visibility, not just averages
- Timing packet health: loss and anomaly counters
- Selection & state: reselect count, holdover duration, lock/holdover transitions
- Traceability: alarm timeline + reason codes (aligned with transitions)
Port-level truth (where many real issues appear first)
- PTP message counters: Rx/Tx counts for timing messages, anomaly intervals
- Delay mechanism state: consistent mode state and any mode mismatch indicators
- SyncE status: locked/unlocked events and per-port reference status
- Timing packet gaps: counters that quantify missing or irregular cadence
Evidence chain (operate like a recorder, not a guesser)
When an offset spike, degrade, holdover entry, or reselect happens, the OAM plane must reconstruct “what happened” with a repeatable evidence package.
- Timeline: state transitions (Locked → Degrade → Holdover → Switch → Re-lock)
- Reason codes: why each transition happened
- Snapshots: key counters and summary stats captured at transition boundaries
- Trends: time-series plots covering the disturbance window (offset + PDV tail + reselect)
Why “ping works” but time quality is still untrusted
Ping validates reachability, not timing evidence. If monitoring lacks PDV tails, timing packet gap counters, or reselect/holdover reason codes, the system can look “alive” while time quality silently degrades.
Minimal monitoring set (MVP) to avoid blind spots
Must-have (minimum closed loop)
- offset + mean path delay
- PDV statistics with tail visibility
- timing packet loss/anomaly counters
- reselect count + holdover duration
- lock/holdover state timeline
- alarm timeline + reason codes
Nice-to-have (speeds up isolation)
- per-port cadence gap indicators
- trend plots for PDV tail and reselect frequency
- explicit “asymmetry suspected” marker (as a hint, not a proof)
- configuration change audit aligned with event timeline
Security note (kept minimal): protect management authentication, time-domain configuration, and change auditing—without expanding into broader security architecture.
Validation checklist: stress tests, redundancy drills, and acceptance evidence
A BC switch is “validated” only when it stays predictable under full load and bursts, survives redundancy drills (drop A / jitter A / recover A) with controlled behavior, and produces a deliverable evidence package (logs, reason codes, counters, and trend snapshots). Standards can be referenced by family (IEEE 1588, ITU-T G.826x), but acceptance is proven by repeatable evidence.
1) Load & congestion injection (prove stability under pressure)
- Full throughput: saturate business traffic while keeping timing traffic active.
- Burst injection: create burst windows that stress queues and reveal PDV tails.
- Queue policy comparison: compare a “shared queue” baseline versus a “timing-priority” policy (concept level).
- Observe: PDV tail, timing packet loss/anomalies, offset excursions correlated to burst windows.
2) PDV / offset observation (principles, not a standards lecture)
- Do not trust only averages: distributions and tails matter most in the field.
- Align to windows: annotate burst periods and compare before/after behavior.
- Record evidence: trend snapshots + counters at window boundaries + event timeline markers.
Family references only: IEEE 1588 and ITU-T G.826x are relevant standards families; test evidence remains the acceptance driver here.
3) Redundancy drill (failover / holdover with traceability)
Use a scripted drill so outcomes are comparable across builds and sites.
- Step A — drop A: confirm Degrade/Holdover entry, controlled switch behavior, and complete reason codes.
- Step B — jitter A: confirm no “ping-pong” reselect under transient instability (debounce intent verified).
- Step C — recover A: confirm controlled re-lock and policy-consistent behavior, with snapshots and timelines.
- Evidence required: alarm timeline + reason codes + transition snapshots + trend plots covering the full drill window.
4) Environmental disturbance (temperature / power perturbation)
- Temperature steps: observe lock stability and any state transitions under controlled environmental change.
- Power perturbations: validate lock/holdover behavior under mild supply disturbances (no power-board deep dive).
- Evidence: lock/holdover timeline aligned with offset trends and event logs.
Acceptance matrix (Test item → Setup → Pass criteria → Evidence)
| Test item | Setup | Pass criteria (phenomenology) | Evidence package |
|---|---|---|---|
| Full load stability | Sustain high business throughput while timing traffic remains active | Offset behavior remains explainable; no unbounded excursions under steady load | Offset trend + PDV summary + port counters snapshot |
| Burst stress (PDV tail) | Inject burst windows that stress queues | PDV tail remains visible and bounded by policy intent; anomalies correlate to windows (not random) | PDV tail trend + window annotations + timing packet anomaly counters |
| Queue policy comparison | Baseline shared-queue vs timing-priority policy (conceptual) | Timing policy shows reduced PDV tail and fewer timing packet anomalies under bursts | Before/after snapshots + counters + short drill report |
| Failover: drop A | Disable upstream A during live timing | Controlled Degrade/Holdover behavior and policy-consistent switch; traceable cause | Alarm timeline + reason codes + transition snapshots |
| Reselect suppression: jitter A | Introduce transient instability on A | No uncontrolled ping-pong reselect; triggers behave as intended (debounced) | Reselect frequency trend + reason code history |
| Recovery: restore A | Recover upstream A after disturbances | Controlled re-lock and stable state; recovery is evidence-backed | Re-lock timeline + counters snapshot + trend plots |
| Temperature steps | Controlled ambient change (step or ramp) | No unexplained state oscillation; any transitions are traceable and consistent | Lock/holdover timeline + offset trend + event log |
| Power perturbation | Mild supply disturbance (without power-board detail) | State behavior remains controlled; no silent failures without logging | Event timeline + reason codes + port counters snapshot |
Three items most often missed at delivery
- Only averages, no tails: PDV tail under bursts is not measured, so field congestion breaks sync.
- Only drop tests, no jitter tests: drop A works, but jitter A triggers frequent reselect and false alarms.
- Only function, no evidence: switching happens, but no reason codes/snapshots/timelines exist for postmortem.
H2-11 · Field debug playbook: symptoms → fast isolation paths
Core rule: evidence first (timeline alignment) → then isolate network contribution (PDV/loss/queues) → then isolate time-base state (reselect/holdover) → only last suspect device internal.
Format: stacked cards + one figure
Figure: reuse F4 as evidence template
A) Evidence-first entry: build a “proof packet” before changing anything
- Align timelines: mark the exact time window of the symptom (offset spike / reselect / holdover entry) and align it with traffic bursts, link flaps, and recent config changes.
- Capture two snapshots: one just before the window, one inside the window. Minimum fields: PDV tail summary, timing packet loss, port PTP Rx/Tx counters, time-base state, reselect reason code, holdover duration.
- Route the incident: If PDV/loss rises with bursts → go to “Symptom A path”. If frequent reselect / state ping-pong → go to “Symptom B path”. If holdover over-limit → go to “Symptom C path”.
- Do not “tune thresholds” first. Confirm if the spike is real or an alarm artifact via timeline + snapshots.
- Do not assume “high bandwidth” means stable time. PDV tail and queue behavior dominate microsecond-level errors.
- Do not blame hardware without a port-to-port or A/B input comparison experiment.
B) Symptom A — offset steps/spikes: isolate congestion → QoS/queues → asymmetry → state events
- Evidence: PDV tail thickens; timing packet loss increases; queue/egress delay indicators jump.
- Immediate action: export “PDV tail + loss counters + port Rx/Tx” for the same window; annotate burst window.
- Evidence: PTP frames share best-effort queue; strict priority not active; burst causes long queue waits.
- Immediate action: run a controlled A/B compare: same traffic burst with timing priority enabled/verified; compare PDV tail before/after.
- Evidence: mean path delay drifts in a biased way; errors follow a specific uplink/downlink path.
- Immediate action: port/path swap experiment: move downstream client to a different port/path; check if bias follows the path.
- Evidence: spike occurs near Degrade/Holdover/Switch markers; reason codes appear.
- Immediate action: map the spike onto Figure F4 states; confirm if a brief loss-of-lock or reselect triggered the spike.
- Window: [t0, t1], symptom magnitude, downstream impacted nodes count
- PDV tail summary (p95/p99 trend) + timing packet loss counters
- Port PTP Rx/Tx counters + queue isolation status (timing queue / strict priority)
- Time-base state transitions + reason codes (if any)
C) Symptom B — frequent reselect: compare A/B quality → alarm sensitivity → short loss-of-lock
- Evidence: A shows higher PDV tail/loss; B stays clean (or vice versa).
- Experiment: temporarily pin to A (then B) for a controlled interval; compare reselect frequency and stability metrics.
- Evidence: reselects align with very short offset spikes that also align with bursts/flaps.
- Experiment: do not tune first; prove correlation using timeline alignment + PDV/loss snapshots. If correlated, fix queue isolation or link stability first.
- Evidence: lock state oscillates near thresholds; reason codes show brief loss-of-lock.
- Experiment: run a small “jitter the input” drill: slightly disturb upstream A and observe whether the device ping-pongs or transitions smoothly (record Figure F4 markers).
D) Symptom C — holdover over-limit: confirm input quality → recovery behavior → verify “real” vs “false” exceed
- Evidence: holdover entry is preceded by loss-of-lock, loss counters, or PDV tail explosion.
- Immediate action: A/B input compare: disconnect or isolate one input, observe whether holdover duration and exceed events change.
- Evidence: after input restoration, the device stays in degrade too long or ping-pongs.
- Immediate action: run a controlled “restore A” drill; record relock time, reason codes, and whether a stable locked state is reached (use Figure F4 markers).
- Evidence: exceed events without any supporting state change or with inconsistent counters.
- Immediate action: attach proof packet: timeline + snapshots + trend. Confirm “real exceed” before any parameter change.
E) Quick checks in 5 minutes (MVP field routine)
- Export event timeline (last 1h/24h) and mark the symptom window.
- Export PDV tail summary + timing packet loss counters for the same window.
- Check port-level PTP Rx/Tx counters (Sync/Follow_Up/Delay messages) for discontinuities.
- Compare A/B time inputs: lock state, reselect count, holdover duration.
- Run one controlled experiment: pin A then pin B, or move a client to another port/path.
- Package evidence: timeline + reason codes + two snapshots + trend screenshots.
F) Debug-to-silicon mapping (example material numbers / MPNs)
| Function block | Example MPNs (material numbers) | Why it matters in field debug | Typical evidence signals |
|---|---|---|---|
| SyncE/IEEE1588 DPLL + jitter cleaning |
Renesas 8A34001 Microchip ZL30732 (ZL3073x family, e.g., ZL30731–ZL30735) Skyworks/Silicon Labs Si5341 (example ordering codes: SI5341B-D08333-GMR, SI5341B-D11242-GMR) |
Drives lock/holdover/relock behavior. Many “false” issues are actually input quality or DPLL state transitions. | Lock/loss, holdover entry/exit, phase/freq error trends, reselect reasons, input quality alarms |
| Timing-aware switch SoC / TSN switch | Microchip LAN9668/9MX (8-port TSN GigE switch) | Provides queueing/traffic shaping and timing features that dominate PDV under load. Useful for port/queue evidence. | Port counters, queue occupancy indicators, timing queue drops, scheduling/priority confirmation |
| Enterprise/edge switch ASIC with IEEE1588/PTP | Marvell Prestera 98DX83xx (P/N series) | Hardware timestamping + queue scheduling together explain “high throughput but bad time.” Look for per-port PTP stats and queue coupling. | PTP 1-step/2-step behavior, per-port PTP Rx/Tx counters, queue stats, congestion correlation |
| Ethernet PHY with low-latency 1-step PTP support | Marvell Alaska 88E1512P, 88E1510P | Helps localize timestamp path issues. PHY-side behavior can be isolated by port swaps and direction tests. | Link stability, PTP timestamp path confidence, port-to-port comparisons, direction-dependent anomalies |
- If the symptom is state-driven (reselect/holdover): start from DPLL/jitter-clean block evidence (lock/holdover/reason codes).
- If the symptom is load-driven (spikes under traffic): start from switch ASIC/TSN queue evidence (PDV tail + queue isolation).
- If the symptom follows a port/path: do port swap and A/B input tests to see whether it follows PHY/path or time-base state.
H2-12 · FAQs (Edge Boundary Clock Switch)
These FAQs are a fast index to the sections above. Each answer is short, actionable, and points back to the relevant H2 so the page stays deep without repeating the full chapter text.
Note: the “Related sections” links below use suggested anchors (e.g., #bc-h2-4). Adjust IDs to match your headings.