High-Speed Imaging & Motion over TSN Ethernet
← Back to: Industrial Ethernet & TSN
Deterministic imaging and motion over Industrial Ethernet/TSN is achieved by treating trigger jitter, multi-camera skew, and PDV tail as measurable acceptance targets, then enforcing time sync, traffic isolation, and end-to-end verification so the system remains predictable under peak load.
The core deliverable is a practical workflow: budget → measure → prove → troubleshoot, using P95/P99/P99.9 tails as pass/fail criteria (X).
H2-1 · Definition: Deterministic Imaging & Motion over Ethernet
Determinism is not “fast.” It is bounded, repeatable, and verifiable timing behavior across triggering, capture, and the motion loop. The baseline acceptance language for this topic is built on three measurable metrics: Trigger jitter, Capture skew, and E2E latency + PDV.
- Definition: time error of a trigger event at the endpoint versus its ideal trigger time (Δt).
- Measurement point: lock the acceptance probe at the endpoint trigger pin / capture start signal (not at the controller output).
- Statistics: report RMS and P99.9 (avoid max–min as the only criterion).
- Pass criteria: JRMS ≤ X and JP99.9 ≤ Y (X/Y depend on exposure and motion-loop tolerance).
- Definition: time spread among multiple cameras/inputs for the same event: skew = max(ti) − min(ti).
- Scope lock: define whether skew is measured at exposure start, frame start, or hardware timestamp (choose one and keep it consistent).
- Statistics: use P99.9 across a fixed observation window (e.g., N triggers or a production cycle).
- Pass criteria: SkewP99.9 ≤ X (X depends on multi-view fusion tolerance).
- Latency: end-to-end time from event/capture to decision/actuation across sensor → network → compute → motion output.
- PDV (Packet Delay Variation): latency spread over time; recommended definition: PDV = P99.9(latency) − P50(latency).
- Decomposition: split into measurable segments (endpoint DMA/FIFO, switch residence, compute queue, actuation pipeline).
- Pass criteria: LatencyP50 ≤ X and PDV ≤ Y (Y is typically the determinism limiter).
- Budget method: an end-to-end worksheet to allocate jitter / skew / PDV per segment.
- Validation loop: a minimal measurement sequence to confirm timing closure on real hardware.
- Troubleshooting path: symptom → first metric → likely layer → corrective action.
H2-2 · Use Cases & Constraints: Vision, Motion, Triggering
High-speed imaging and motion systems fail determinism in different ways. The fastest path to closure is to match each deployment to a primary metric and a first measurement, then proceed to budgeting and verification with consistent acceptance language.
- Goal: aligned frames for fusion / triangulation / metrology.
- Primary metric: Capture skew (P99.9), not raw throughput.
- Failure signature: fusion drift, depth error, inconsistent frame-to-frame alignment.
- First measurement: same physical event (LED/strobe) across cameras → skew distribution.
- Pass criteria: SkewP99.9 ≤ X and stable across temperature / load.
- Goal: repeatable trigger arrival at endpoints under real traffic.
- Primary metric: Trigger jitter (RMS + P99.9).
- Failure signature: uneven illumination timing, missed exposure windows, inconsistent time-of-flight gating.
- First measurement: probe endpoint trigger pin vs reference clock → jitter vs load correlation.
- Pass criteria: JP99.9 ≤ X and no periodic spikes tied to burst traffic.
- Goal: stable control loop with bounded delay and predictable phase margin.
- Primary metric: E2E latency + PDV (PDV is often the limiter).
- Failure signature: oscillation, overshoot, position ripple, intermittent “good then bad” stability.
- First measurement: segment latency decomposition (endpoint → switch → compute → actuation).
- Pass criteria: LatencyP50 ≤ X and PDV ≤ Y under worst-case load.
- Goal: predictable arrival times for inspection decisions and actuator timing.
- Primary metric: PDV (P99.9) over long windows (captures periodic microbursts).
- Failure signature: rare misses, periodic spikes, “clean counters but timing breaks.”
- First measurement: latency distribution + queue residence correlation during video bursts.
- Pass criteria: PDV ≤ X and no burst-induced tail growth under worst-case traffic.
H2-3 · Reference Architecture: Endpoints → TSN Switch → Compute
A consistent reference architecture prevents ambiguous discussions about “where determinism breaks.” This topic uses two coordinated planes: a Time plane (synchronization and timestamps) and a Data plane (video, trigger, and control traffic). The critical engineering task is to lock each metric to a specific interface boundary: timestamp taps, queue ingress, DMA/FIFO, and endpoint scheduling.
- Time plane: reference clock → TSN switch time base → endpoint time base → hardware timestamp taps.
- Data plane: Video (burst, high bandwidth) + Trigger (small, high priority) + Control/telemetry (steady, observable).
- Coupling points: timestamp taps, queue ingress, DMA/FIFO boundaries, interrupt scheduling, and EEE state transitions.
H2-4 · Timing Loop: Sync Time → Deterministic Trigger → Aligned Capture
Deterministic triggering is a closed-loop engineering task: sync a common time base, schedule traffic, deliver triggers, measure outcomes, and tune budgets until the acceptance metrics remain bounded under worst-case load.
- Queueing & mixing: burst traffic inflates PDV tails; evidence is residence spikes aligned to bursts.
- Shaping/window alignment: periodic jitter spikes; evidence is spikes repeating at gate-cycle cadence.
- Timestamp tap error: skew drift despite “lock”; evidence is offset steps with temperature/link events.
- Endpoint scheduling: jitter tail grows with load; evidence is jitter correlates with CPU/IRQ rate.
- Clock wander/PLL behavior: long-window drift; evidence is P99.9 degrades while short RMS stays clean.
H2-5 · TSN Mechanisms You Actually Use (Application View)
This topic uses a small set of TSN mechanisms only when they directly bound trigger jitter, capture skew, and PDV tails. Each mechanism is specified as: Mechanism → Purpose → How to use in this scenario → Pass criteria. Standard clauses and full table semantics are intentionally out of scope here.
- Purpose: bound trigger class residence to prevent burst-induced PDV tails.
- How to use: allocate a short periodic window for Trigger, keep Video outside that window, and avoid window edge overlap with burst peaks.
- Pass criteria: Trigger jitter P99.9 ≤ X, and trigger-queue residence P99.9 ≤ X under worst-case video bursts.
- Purpose: reduce queue fill events that elongate PDV tails and starve trigger/control.
- How to use: cap video peak rate and burst size; keep Trigger/Control in isolated queues so “priority only” is not relied on.
- Pass criteria: PDV tail (e.g., P99.9 − P50 ≤ X) and video-queue watermark does not saturate during peak frames.
- Purpose: keep deterministic budgets stable as new flows appear.
- How to use: reserve Trigger/Control headroom; reject or downgrade additional video flows instead of silently increasing jitter tails.
- Pass criteria: Trigger jitter and PDV remain within X under worst-case load; new flows trigger explicit reject/alert behaviors.
- Purpose: avoid “mixed cabin” effects where video bursts drag trigger tails.
- How to use: assign Trigger/Control to dedicated queues; shape video in its own queue; validate that queue ingress mapping is correct.
- Pass criteria: Trigger residence P99.9 ≤ X and shows weak correlation to video watermark under burst tests.
- Traffic class mapping: Trigger / Control / Video → class IDs (X). Quick check: verify per-class counters increase as expected.
- Cycle time: deterministic cycle period ≤ X. Quick check: jitter spikes do not align to cycle edges.
- Trigger window: offset + width ≤ X. Quick check: trigger residence P99.9 remains bounded.
- Queue assignment: each class → queue ID (X). Quick check: trigger never shares the video queue.
- Video shaping limits: peak rate / burst size ≤ X. Quick check: watermark stays below saturation under bursts.
- Admission limits: max offered load / reserved headroom ≤ X. Quick check: overload triggers reject/alert, not silent tail growth.
H2-6 · Time Sync in Practice: PTP Timestamping, Asymmetry, Calibration
Time sync “looks locked” but still misses accuracy when timestamp tap points are inconsistent, link asymmetry is unaccounted for, or queue residence and temperature drift move the effective delay model. This section keeps an engineering-only view: tap location, error sources, and calibration + acceptance checks.
- Endpoint MAC tap: sensitive to DMA/driver/interrupt tails; useful only when software path variability is controlled.
- Endpoint PHY tap: closer to the physical boundary; reduces software-induced timing ambiguity.
- Switch ingress/egress taps: expose residence variation and per-hop delay steps during congestion or link events.
- Asymmetry: uplink/downlink delays differ; evidence: stable bias that changes with path or direction.
- Residence variation: queueing treated as propagation delay; evidence: offset/skew correlates with congestion and bursts.
- Oscillator + temperature: slow drift dominates long windows; evidence: P99.9 degrades while short RMS stays clean.
- Link retrain events: delay steps after renegotiation; evidence: offset step + re-convergence phase.
- Asymmetry calibration: define a baseline bias and validate it remains stable across temperature and link events (X).
- Convergence behavior: power-up settle time ≤ X; post-flap recovery time ≤ X.
- Holdover behavior: drift rate ≤ X over a defined duration; re-lock does not overshoot for extended periods (X).
- Offset stability: RMS ≤ X and P99.9 ≤ X (fixed window).
- Recovery: power-up convergence ≤ X; link-event recovery ≤ X.
- Holdover drift: drift ≤ X for a defined time span.
H2-7 · Bandwidth, Latency & Jitter Budget Worksheet (E2E)
This worksheet converts deterministic imaging and motion requirements into measurable budgets. It specifies inputs, a segment-by-segment E2E latency breakdown, and a consistent PDV percentile definition so trigger jitter, capture skew, and latency tails can be validated under burst load.
- Imaging: resolution, fps, bit-depth, ROI, compression (CBR/VBR), key-frame behavior (placeholder).
- Trigger: trigger rate, target trigger-to-exposure window (X), message size/cycle (X).
- Motion: control period, update point, maximum allowed control delay (X).
- Network: link speed (1G/2.5G/5G/10G), hop count, aggregation points, uplink bottlenecks.
- Compute pipeline: stage count (ISP/encode/AI/fusion), buffering vs. batching flags (placeholder).
- Throughput: define average rate, peak rate, and burst size separately (X).
- Packetization: MTU/packet size and packets-per-frame drive queue fill and interrupt pressure.
- Burst shape: frame-boundary micro-bursts can saturate shared resources even at low average utilization.
- Evidence: per-class counters and queue watermark spikes align with frame boundaries under peak load.
- Video shaping target: peak/burst limits ≤ X to prevent queue saturation.
- PDV tail target: P99.9−P50 (or fixed definition) ≤ X under burst tests.
- Sensor exposure: exposure window and trigger alignment (measure at sensor timing pins or device timestamp tap).
- ISP/encode: stage latency and buffering/batching flags (measure per-stage timestamps).
- NIC/DMA: DMA/FIFO watermark and interrupt mitigation tails (measure driver/NIC counters + HW timestamps).
- Switch: per-hop queue residence and shaping/window boundary effects (measure per-class counters and residence proxies).
- Compute: scheduling, queue depth, batch window (measure queue depth + stage timestamps).
- Actuation: output update point, phase alignment, output jitter (measure at actuator edge/timestamp tap).
- Sampling window: fixed observation time span (X) and fixed load profile (peak + mixed flows).
- Latency distribution: report P95 / P99 / P99.9 for each segment and the full chain.
- PDV tail (choose one and lock it): (P99.9 − P50) ≤ X or (P99.9 − P95) ≤ X.
- Trigger jitter: define reference time base (synced clock or master reference) and report P99.9 ≤ X.
- Capture skew: define “same trigger event” alignment error and report P99.9 ≤ X.
- Flow ID / class: Trigger / Control / Video mapping (X). Check: per-class counters match traffic.
- Queue ID: dedicated queues for determinism (X). Check: trigger never shares video queue.
- Window fields: cycle, offset, width (X). Check: residence P99.9 bounded.
- Admission limits: reserved headroom / max offered load (X). Check: overload rejects/alerts.
- Frame size: average/max (X) and fps (X). Check: compute peak packets-per-second.
- MTU / packet size: packets-per-frame (X). Check: frame-boundary burst appears in counters.
- Peak / burst limits: peak rate and burst size (X). Check: watermark stays below saturation.
- Aggregation: number of sources and uplink bottleneck rate (X). Check: bottleneck port never overloads.
- Exposure / ISP / encode: P50/P99.9 stage latency (X). Check: batching flags recorded.
- NIC / DMA: FIFO depth/watermark and IRQ mitigation (X). Check: tails correlate with watermark.
- Switch: per-hop residence P99.9 (X). Check: residence rises under bursts.
- Compute: queue depth and scheduling period (X). Check: queue depth does not grow unbounded.
- Actuation: output edge jitter/skew P99.9 (X). Check: edge timing stable at load.
- Tap points: TS tap / pin / counter location list (placeholder). Check: taps are consistent.
- Test duration: observation time (X) and load profile ID. Check: reproducible results.
- Percentiles: P95/P99/P99.9 definitions are fixed. Check: no denominator mismatch.
- Pass targets: jitter/skew/PDV thresholds (X). Check: worst-case load still passes.
- Trigger condition: stable at lower rate, tails explode at max rate. Evidence: error/retrain counters align with PDV spikes.
- Trigger condition: only certain harness/connector builds fail. Evidence: retransmits rise while queues are not saturated.
- Trigger condition: temperature/humidity changes correlate with failures. Evidence: link events + tail growth move with environment.
H2-8 · Engineering Checklist: Design → Bring-up → Production
This checklist turns the budget and determinism mechanisms into executable steps. Each gate includes the required checks, the measurement points, and placeholder pass thresholds (X) to keep results reproducible from design through bring-up and production.
- Clock tree isolation: define reference distribution and isolation boundaries. Pass: offset drift ≤ X.
- Timestamp tap strategy: choose MAC/PHY/SW taps and document the accuracy boundary. Pass: tap definition fixed for validation.
- Queue plan + class isolation: Trigger/Control separated from Video. Pass: residence P99.9 ≤ X.
- VLAN/QoS segmentation: define traffic compartments and counters visibility. Pass: mixed load does not violate X.
- Budget worksheet readiness: H2-7 field list included in deliverables. Pass: fields complete + tap points listed.
- Time sync convergence: power-up settle ≤ X, post-event recovery ≤ X.
- Link sanity (PRBS/loopback): error counters = 0 or ≤ X under sustained load.
- Timestamp consistency: multi-tap delta remains stable under load. Pass: delta P99.9 ≤ X.
- Trigger jitter under burst load: run peak video + trigger concurrently. Pass: jitter P99.9 ≤ X.
- PDV tail characterization: P95/P99/P99.9 fixed definitions. Pass: PDV tail ≤ X.
- Black-box logging hooks: counters + temp/power/link events recorded. Pass: schema complete (X).
- Calibration field lock: calibration bias fields fixed across builds. Pass: schema/version consistent.
- Version lock (FW/config): deterministic parameters are auditable and frozen. Pass: no silent drift.
- Golden stress script: repeatable peak + mixed flow tests. Pass: P99.9 metrics stable within X.
- Field log schema: per-class counters + link events + temperature + power events. Pass: supports forensic attribution.
H2-9 · Validation & Troubleshooting: What to Measure First
Field troubleshooting must start with a minimal evidence loop. Measure time sync health, per-hop latency, and endpoint scheduling jitter first, then map the symptom to the most likely layer before changing parameters.
- Measure: offset stability, holdover behavior, and convergence time (X).
- Evidence: offset steps or drift correlated with load changes.
- Next: if unhealthy, verify timestamp tap definition and asymmetry calibration first.
- Measure: per-hop queue residence proxy (P99/P99.9), port counters, and queue watermark.
- Evidence: a single hop dominates the tail growth under bursts.
- Next: if a hop is abnormal, confirm mixed traffic, window boundaries, and admission limits.
- Measure: ISR latency distribution, DMA/FIFO watermark, and interrupt rate (X).
- Evidence: jitter spikes align with CPU/IRQ peaks or DMA backpressure.
- Next: if endpoint is abnormal, verify interrupt mitigation, batching, and power-state transitions.
Likely layer: time plane or endpoint scheduling.
Next action: confirm tap definition and ISR/DMA watermark correlation.
Likely layer: time plane asymmetry or hop-specific queue residence.
Next action: check asymmetry calibration and identify the dominant hop.
Likely layer: switch queues / shaping / admission control.
Next action: find the over-subscribed class and confirm burst limits.
Likely layer: overload, recovery/retx behavior, or physical/environment triggers.
Next action: separate “queue-full drops” from “link-error recovery.”
Likely layer: environment causing time/PHY instability.
Next action: correlate tail growth with temp/power/link event timestamps.
Next: enforce admission limits and peak/burst shaping (X).
Next: lock a single tap authority and re-validate skew P99.9 (X).
Next: reduce mitigation-induced delay and ensure DMA headroom (X).
Next: pin deterministic classes away from power transitions (X).
Next: separate “physical errors” from “queue drops” using counters and timestamps.
- drop counters: separate queue drops vs. other drops.
- CRC/error counters: detect physical or recovery-driven tails.
- link events: retrain / down-up / renegotiation timestamps.
- per-class queue depth: observe compartment behavior under bursts.
- watermark: confirm headroom and near-saturation events.
- residence proxy: isolate the hop that dominates tails.
- policer drops: identify over-offered traffic or mis-sized admission.
- DMA/FIFO watermark: detect backpressure tails.
- ISR latency (histogram): quantify scheduling jitter (X).
- interrupt rate: correlate bursts with CPU/IRQ pressure.
- temperature: correlate tail growth with thermal drift.
- power events: brownout/rail dips aligned with jitter steps.
- clock events: lock/unlock or reference changes (placeholder).
H2-10 · Applications & Deployment Patterns
Deployment patterns should keep determinism measurable. This section summarizes topology trade-offs, trigger distribution options, and edge-compute time alignment so systems can be validated with clear acceptance metrics (jitter, skew, and PDV tails).
Risk points: shared segments can amplify burst interference.
Primary metric: per-hop residence tail and PDV tail (X).
Risk points: uplink bottleneck and central switch oversubscription.
Primary metric: aggregation port watermark and tail percentiles (X).
Risk points: misconfiguration can create loops or asymmetric timing paths.
Primary metric: skew stability + post-event recovery time (X).
- Pros: one authoritative event source simplifies auditing.
- Risks: downstream queue pressure can inflate jitter tails.
- Acceptance: trigger jitter P99.9 ≤ X under peak video load.
- Pros: local scheduling can reduce transport-induced jitter.
- Risks: time sync quality and tap consistency become critical.
- Acceptance: capture skew P99.9 ≤ X and holdover stability within X.
- Time base: define a single authoritative time reference for all devices (placeholder).
- Skew definition: “same event ID” alignment error across cameras must be fixed and audited.
- Percentiles: skew tail and PDV tail must share consistent windows (P95/P99/P99.9).
- Validation first: correlate skew tail expansion with hop residence tail under burst load.
H2-11 · IC Selection Logic (Materials & Platforms)
Part selection is centralized here to prevent model numbers from scattering across the page. The selection flow is: requirements → capabilities → device class → verification. Use tail metrics (P95/P99/P99.9) for acceptance; avoid tuning without evidence.
- Rule 1: select by tail (PDV P99/P99.9), not by average latency.
- Rule 2: select by timestamp tap consistency + observability, not by “PTP supported” labels.
- Rule 3: select by verification hooks (counters, loopback/PRBS, histograms) before parameter tuning.
Device class: timestamping endpoint + TSN switch.
Verify: trigger jitter histogram (P99.9) + hop residence tail (P99.9).
Device class: endpoint time engine + clock/jitter cleaner.
Verify: offset stability + holdover drift + event-ID alignment error.
Device class: TSN switch + endpoint observability.
Verify: per-hop residence decomposition + class isolation under burst stress.
- Timestamp tap: fixed tap location and consistent definition across endpoints.
- Scheduling control: bounded ISR/DMA behavior; measurable jitter histogram (X).
- Queue interaction: deterministic path for trigger/control classes (no hidden buffering).
- Observability: counters and watermarks usable for field forensics.
- Tap mismatch: mixing MAC-tap and PHY-tap timing references across devices.
- IRQ mitigation tails: interrupt coalescing improves CPU load but inflates jitter tails.
- Power states: EEE/sleep transitions add periodic wake latency spikes (tail growth).
- Sync health: offset stability and holdover behavior within acceptance (X).
- Endpoint jitter: ISR latency histogram and DMA watermark correlation under load.
- Event alignment: trigger event-ID alignment across cameras (skew P99.9 ≤ X).
- Intel Ethernet Controller I210-IS / I210-IT
- Microchip LAN7430 (PCIe to GbE controller)
- Microchip LAN7431 (PCIe to GbE controller)
- NXP Layerscape LS1028A
- Texas Instruments Sitara AM6548
- Texas Instruments Sitara AM6442
- AMD Xilinx XC7Z020 (Zynq-7000)
- AMD Xilinx XCZU3EG (Zynq UltraScale+ MPSoC)
- Class isolation: trigger/control vs video must be separated (queues/VLAN/QoS).
- Deterministic scheduling: time-windowing support and bounded queue residence behavior.
- Admission & shaping: controls to prevent oversubscription and micro-burst tails.
- Telemetry: per-port counters and queue/watermark visibility for field diagnosis.
- Mixed traffic pollution: trigger/control shares queues with video bursts.
- Oversubscription: admission is missing or configured as “best effort,” inflating PDV tail.
- Visibility gap: no usable residence proxy or queue depth telemetry during field failures.
- Per-hop residence: isolate the hop that dominates PDV tail (P99.9 ≤ X).
- Class isolation: validate trigger latency stability while video load is saturated.
- Micro-burst stress: confirm watermark behavior and policer drops (if used).
- NXP SJA1105T (TSN Ethernet switch family example)
- NXP SJA1110 (TSN switch family example)
- Microchip LAN9662 (TSN Ethernet switch example)
- Microchip LAN9668 (TSN Ethernet switch example)
- Holdover: bounded drift during link loss or topology events (X).
- Jitter filtering: low added jitter and stable phase under load transitions.
- Inputs: reference input options and controlled switching behavior (no large phase steps).
- Auditability: measurable convergence time and stable lock status indicators.
- Holdover surprises: “offset looks good” until link loss triggers rapid drift.
- Input switching steps: reference change introduces phase discontinuities.
- Thermal sensitivity: drift grows with enclosure temperature gradient without detection.
- Convergence: cold-start to stable offset within X in a fixed window.
- Holdover test: disconnect reference; measure drift over time vs acceptance (X).
- Thermal stress: repeat under temperature gradient to validate worst-case drift.
- Analog Devices AD9545 (network synchronizer / DPLL example)
- Analog Devices AD9548 (multi-channel timing / DPLL example)
- Renesas (IDT) 8A34001 (sync / jitter attenuator DPLL example)
- Renesas (IDT) 8A34002 (sync / jitter attenuator DPLL example)
- Silicon Labs Si5345 (jitter attenuator example)
- Microchip ZL30772 (timing / DPLL family example)
Recommended topics you might also need
Request a Quote
H2-12 · FAQs (Deterministic Imaging & Motion)
This FAQ is strictly for field troubleshooting and acceptance criteria. Each answer is fixed to 4 lines: Likely cause / Quick check / Fix / Pass criteria (threshold placeholders: X, Y, N).
PTP shows locked, but capture skew still drifts . (sync ≠ acceptance) .
Likely cause: timestamp tap mismatch (MAC vs PHY), uncalibrated asymmetry, or holdover drift under thermal/load changes.
Quick check: record offset + skew histograms (P99.9) and compare one-way delay estimates per link; confirm all endpoints use the same timestamp domain.
Fix: unify tap location/driver settings, calibrate asymmetry (store per-link correction), and enforce “no-step” time discipline (slew only).
Pass criteria: offset stays within ±X ns and multi-camera capture skew P99.9 ≤ X µs over Y minutes (N cameras, full load).
Trigger jitter is fine on bench, worse on full system load
Likely cause: endpoint scheduling tails (ISR latency, DMA contention), IRQ coalescing, or CPU preemption inflating jitter percentiles.
Quick check: capture ISR latency histogram + DMA watermark vs time; correlate trigger jitter spikes with CPU load and interrupt mitigation counters.
Fix: pin IRQs/threads to isolated cores, reduce/disable coalescing for deterministic classes, and separate trigger/control queues from video queues end-to-end.
Pass criteria: trigger arrival jitter P99.9 ≤ X µs with peak video throughput sustained for Y minutes; no watermark underrun events.
Video stream bursts cause missed trigger windows
Likely cause: micro-bursts exceeding reserved service, shaping missing, or gate/window guard band too small for worst-case queue drain time.
Quick check: inspect per-port queue depth/watermark and burst size (pps/bytes) during misses; verify trigger class is mapped to the intended queue/window.
Fix: apply shaping/policing on video, reserve a deterministic window for trigger/control, and add admission control to prevent oversubscription.
Pass criteria: missed trigger count = 0 for N triggers; trigger queue watermark ≤ X% and E2E PDV P99.9 ≤ X µs over Y minutes.
Switch latency looks constant, but PDV spikes every few minutes
Likely cause: periodic background activity (management polling, statistics flush), EEE/LPI transitions, or intermittent link events creating tail spikes.
Quick check: compute spike periodicity and correlate with logs (NMS polls, link power state changes, counters snapshot times).
Fix: isolate management traffic (separate VLAN/queue), rate-limit telemetry bursts, and disable EEE on deterministic ports/classes.
Pass criteria: PDV spike rate ≤ X per minute and PDV P99.9 ≤ X µs sustained for Y minutes under full load.
Enabling EEE saves power but breaks determinism—what first?
Likely cause: LPI entry/exit adds variable wake latency, inflating jitter/PDV tails even when average latency looks unchanged.
Quick check: compare jitter/PDV histograms with EEE on vs off (focus on P99/P99.9); inspect LPI transition counters around spike timestamps.
Fix: disable EEE on deterministic path/ports, or restrict EEE to best-effort classes while keeping trigger/control always-active.
Pass criteria: enabling power-save does not change deterministic metrics beyond X%: trigger jitter P99.9 ≤ X µs and PDV P99.9 ≤ X µs over Y minutes.
One camera node causes everyone’s skew—endpoint scheduling or cable?
Likely cause: misclassification of traffic classes, timestamp domain mismatch, link errors/retries, or endpoint scheduling stalls causing system-wide alignment drift.
Quick check: compare per-node skew deltas, CRC/retry counters, and endpoint ISR/DMA tails; confirm the node uses the same time source/tap as others.
Fix: lock the node’s QoS/class mapping, replace/shorten the drop cable if errors exist, and normalize timestamp configuration across endpoints.
Pass criteria: with the node restored, global capture skew P99.9 ≤ X µs and CRC/retry counters remain at 0 over Y minutes (full load).
After link renegotiation, timestamps shift by a constant offset
Likely cause: PHC/time engine reset, reinitialized servo, or path delay recalculation applied as a step instead of a controlled slew.
Quick check: detect time-step events (offset discontinuity), record pre/post PHC values, and confirm link-speed/duplex changes at the renegotiation moment.
Fix: enforce “slew-only” discipline, persist calibration across link events, and lock link parameters where possible to avoid repeated retraining.
Pass criteria: max time step = 0; post-event offset returns within ±X ns in ≤ X s and skew P99.9 ≤ X µs over Y minutes.
Two segments individually pass, cascaded topology fails jitter budget
Likely cause: PDV tails compound across hops; schedule phases misalign; oversubscription occurs at the junction hop only when end-to-end traffic mixes.
Quick check: decompose latency by hop (residence tail per switch/port) and verify class mapping + gate phase consistency across segments.
Fix: align gate schedules end-to-end, add admission limits, and reserve bandwidth/guard bands for trigger/control across every hop.
Pass criteria: E2E PDV P99.9 ≤ X µs and each hop residence tail P99.9 ≤ X µs over Y minutes (full load).
VLAN/QoS configured, but trigger traffic still queues behind video
Likely cause: wrong PCP/DSCP mapping, ingress classification not applied, shared egress queue, or head-of-line blocking.
Quick check: capture packets to confirm markings, then verify switch ingress→egress queue mapping and per-queue counters during bursts.
Fix: correct class mapping tables, dedicate an egress queue/window for trigger/control, and apply shaping to video to protect deterministic queues.
Pass criteria: trigger/control queue occupancy ≤ X% and trigger latency distribution is unchanged when video load increases to peak (Y minutes).
Thermal change increases jitter—PLL/clock tree or PHY edge rate?
Likely cause: clock drift/PLL wander under thermal gradient, temperature-driven equalization changes, or link retraining events shifting delay.
Quick check: correlate jitter/skew tails with temperature (sensor logs) and link events; validate holdover drift vs temperature over a controlled soak.
Fix: harden clock distribution (clean reference, isolation), add thermal guard bands, and reduce aggressive retraining/retune behaviors on deterministic links.
Pass criteria: trigger jitter and capture skew P99.9 ≤ X µs across Tmin..Tmax after Y-minute soak; no retrain events during acceptance run.
Sync recovers after drop, then flaps—retry storm or bad holdover criteria?
Likely cause: overly aggressive lock thresholds, missing hysteresis, repeated role changes, or management/traffic bursts destabilizing the time loop.
Quick check: count lock/unlock events, role changes, and offset sawtooth amplitude; correlate flaps with traffic bursts and recovery timers.
Fix: add hysteresis and backoff, stabilize master selection, isolate management traffic, and gate recovery actions on measured offset/holdover quality.
Pass criteria: re-lock ≤ X seconds after a drop and ≤ X flap events per 24 hours; offset stays within ±X ns during steady state.
Field failures with “no errors” counters—what black-box fields are missing?
Likely cause: insufficient observability; the failure is a tail event (jitter/PDV spike) not captured by simple error counters.
Quick check: confirm logging includes (time, offset, PDV percentiles, per-hop residence proxy, queue watermark, CPU/ISR tails, temp/power events).
Fix: add a black-box schema with threshold-triggered snapshots and version fields; enable per-class counters and queue/watermark telemetry.
Pass criteria: every incident has a complete record set (timestamp + versions + offset + PDV P99.9 + queue watermark + CRC/drops + temp/power) enabling root-cause within X iterations.
Measurement rule: prioritize P99/P99.9 tails for jitter/skew/PDV and keep the same time window (Y minutes) across test runs.