Latency & Determinism for TSN: Jitter Budgeting & Shaping
← Back to: Industrial Ethernet & TSN
Determinism is not “low average latency” — it is a provable bound on the tail (p99/p999/max). This page turns switch delay + Qbv/Qci shaping into an end-to-end jitter budget that can be measured, versioned, and accepted with clear pass criteria.
H2-1 · Scope & Non-Scope (Boundary Contract)
This page is a strict anti-overlap gate: it standardizes the latency/jitter “accounting” and focuses only on deterministic delay control via switch delay models, Qbv/Qci shaping, and end-to-end jitter budgeting with measurable pass criteria.
- Average latency looks fine, but sporadic “late” events or p99/p999 spikes still occur.
- Load is “only 20%”, yet real-time traffic still stalls during bursts or contention windows.
- TSN features are enabled, but determinism does not improve because the tail is not budgeted, bounded, or verified.
- Switch delays: per-hop model = fixed pipeline + queueing + shaping / gate-wait.
- Qbv / Qci: time-aware gating (Qbv) and per-stream policing / admission (Qci) as tail-control levers.
- End-to-end jitter budgeting: bound each hop, sum bounds, define test matrix, and write pass/fail criteria.
- Latency/Jitter budget sheet template: hop-by-hop fixed delay + queue bound + gate-wait bound + timebase error input (ε) + notes.
- Qbv GCL design record: cycle time, slot plan, guard-band logic, and “max gate-wait” bound per class.
- Qci policing record: per-stream burst/rate parameters, violation actions, and counters to audit “bad flows”.
- Verification matrix: empty-load / full-load / burst / mixed-class / fault-injection, with p99/p999 and late-rate pass criteria.
- PTP / BMCA / topology calibration → go to PTP Hardware Timestamping. (Timebase is treated here as an input error ε only.)
- SyncE / White-Rabbit frequency lock & holdover → go to the SyncE / White-Rabbit-Style Timing pages.
- PROFINET / EtherCAT / CIP business models & certification → go to Industrial Ethernet Stacks pages.
- PHY SI / EMC / TVS / CMC / grounding → go to PHY Co-Design & Protection pages. (This page focuses on switch scheduling/policing bounds.)
H2-2 · Determinism = Latency Distribution (Beyond the Average)
Determinism is not “lower average latency”. Determinism is a tight and bounded latency distribution under defined load patterns. The engineering goal is tail control: convert unpredictable spikes into bounded wait times that can be budgeted and verified.
- Optimizing mean latency while ignoring p99/p999.
- Using “average utilization” as a proxy for real-time readiness.
- Declaring success because throughput is high, even when control traffic misses deadlines.
- Bounded tail: p99/p999 and max latency stay within a known upper bound.
- Defined load patterns: empty-load, full-load, burst, and mixed-class cases are explicitly tested.
- Deadline integrity: late-rate (deadline miss events) is controlled and audited.
- Latency: end-to-end time from transmit event to receive event under a defined measurement tap-point.
- Jitter (this page): short-term spread of the latency distribution within the same configuration and test window.
- Wander: long-term drift (not expanded here). If wander dominates, route to PTP/SyncE pages.
- Percentiles: p50 (median), p95, p99, p999 — prioritize p99/p999 for tail control.
- Max–min: highlights rare spikes that percentiles may hide in short windows.
- Late-rate: “deadline miss events per N packets / per minute” (requires a clear denominator).
- Burst sensitivity: tail increase under burst injection and mixed traffic classes.
- Keep window length explicit (e.g., Y minutes or N frames) and do not mix windows across runs.
- Do not compare percentiles across different tap points or timestamp definitions.
- Always log load pattern metadata (burst size, mix ratio, class mapping) together with results.
- Queueing under contention: bursts and shared resources create non-linear tail growth.
- Gate-wait (Qbv): bounded but non-zero waiting until the next open window; the bound must be budgeted.
- Unregulated flows (Qci missing/weak): abnormal traffic can reintroduce tail spikes.
- Observability mismatch: counters/timestamps look clean because the accounting definition is inconsistent (tap-point/denominator/window).
H2-3 · Latency Taxonomy (End-to-End Ledger Canon)
A deterministic design requires a single accounting canon. This section defines the end-to-end segments, the only three allowed delay categories, and the minimum timestamp tap-point rules so every budget line uses the same meaning.
Use this chain as the fixed “row order” for budgets and reports:
- Fixed latency: deterministic baseline from pipeline/serialization; weakly dependent on mode and frame length.
- Load-dependent queueing: contention/bursts create non-linear tail growth; must be bounded with assumptions.
- Time-dependent gate-wait: Qbv windows convert random contention into bounded waiting; the bound must be budgeted.
- Tap-point must be declared for each measurement run (endpoint tap vs switch ingress/egress tap).
- Do not mix tap-points when comparing p99/p999; otherwise the distribution is not comparable.
- Window/denominator must be logged (Y minutes or N frames) together with load pattern metadata.
- Timebase error is an input ε; if drift/asymmetry dominates, route to timing pages (PTP/SyncE).
- Segment / hop name
- Fixed (min/typ) and notes (mode, frame length dependency)
- Queue bound (max) with stated burst assumptions
- Gate-wait bound (max) from GCL + guard-band
- Source: Measured / Datasheet / Calibrated
- Tap-point + window length + load pattern ID
H2-4 · Switch Internal Delays (Fixed-Latency Model)
Per-hop fixed latency becomes traceable when it is decomposed into pipeline stages. This section explains store-and-forward vs cut-through at a modeling level, identifies common fixed-delay contributors, and classifies each contributor by the most reliable source: measured, datasheet, or bench-calibrated.
- Store-and-forward: forwarding starts after full frame buffering; fixed latency includes frame-length-dependent serialization terms.
- Cut-through: forwarding starts earlier; fixed latency can be lower, but mode constraints and feature hooks may add deterministic stages.
- Ledger requirement: fixed latency must be recorded with mode + port speed + representative frame-length assumptions.
- Ingress parse and classification
- Lookup and forwarding decision
- Buffer write/read and internal fabric traversal
- Rewrite / mirroring hooks (deterministic adders)
- Egress scheduling baseline and MAC serialization
H2-5 · Queueing Delay (Why the Tail Explodes)
Determinism fails most often because queueing delay is not linear: bursts and contention create tail growth that is invisible in long-window averages. This section maps the top three triggers to observable symptoms and to the minimum accounting fields required for budgets and logs.
- Observable: average utilization looks safe, but p99/p999 latency spikes in short windows.
- Accounting fields: burst size (B), burst interval, peak rate, service rate (μ), peak queue depth.
- First check: compare 1 ms vs 1 s windows; confirm whether peaks are hidden by averaging.
- Observable: a “noisy” class inflates tail latency of a “well-behaved” class.
- Accounting fields: class→queue mapping, per-queue depth/occupancy, per-class counters.
- First check: inspect per-queue counters, not port-wide totals; identify which queue hits saturation.
- Observable: higher-priority traffic still experiences stalls because it is blocked by a shared egress dependency.
- Accounting fields: priority→class mapping, scheduler policy, egress congestion markers.
- First check: verify mapping consistency end-to-end; confirm which egress is the true bottleneck.
- Window mismatch: a safe 1 s average can hide 1–10 ms peaks that dominate tail latency.
- Peak vs mean: queue growth is driven by short-term peak arrival exceeding service rate.
- Effective service loss: drops/retries or backpressure events reduce effective throughput (treat as a queueing amplifier).
- Q(t): queue depth over time (peak, threshold crossings, recovery time).
- μ: service rate (configured vs observed) per egress queue.
- Burst: size (B), peak rate, burst period / inter-burst gap.
- Mapping: class→queue and priority→class consistency across endpoints and switches.
- Counters: drops, backpressure/pause events, and per-queue occupancy.
H2-6 · Qbv Time-Aware Shaping (Turn Uncertainty into a Schedule)
Qbv replaces a portion of random contention with a time table. A gate-control list (GCL) defines repeating windows, and the resulting delay becomes deterministic but bounded: traffic waits until the next open window. This section describes the minimum GCL fields, the role of guard bands, and how timebase error enters the budget without expanding into configuration ecosystems.
A repeating schedule opens/closes gates per traffic class so the worst-case waiting time becomes a bounded gate-wait term in the latency ledger.
- Cycle time: schedule period that repeats.
- Slots: windows with fixed start/length inside the cycle.
- Gate state: open/closed per traffic class per slot.
- Repeat: deterministic repetition used to compute bounded wait.
- Class mapping: which class uses which window (keep mapping stable).
- Guard band: a protective slice near window boundaries to prevent boundary-crossing interference.
- Gate-wait term: traffic may wait until the next open slot; this is deterministic but bounded.
- Timebase input: treat timebase misalignment as an input error ε; if it dominates, route to timing pages (PTP).
H2-7 · Qci Per-Stream Policing (Keep “Bad Traffic” Out)
Deterministic networking collapses when input traffic has no bound. Qci-style per-stream policing enforces rate and burst constraints at admission, isolating abnormal flows so tail latency remains budgetable. This section focuses on engineering goals, ledger fields, and counters — without expanding into configuration ecosystems.
- Bound the arrival process: cap burst size and sustained rate per stream.
- Protect determinism: prevent a single abnormal stream from inflating p99/p999 for others.
- Make budgets credible: queue and gate-wait bounds assume input is already constrained.
- Conformance decision: each stream is judged as conforming or violating against rate/burst limits.
- Actions: allow, drop, or mark (policy-driven) to block “bad traffic” from consuming shared resources.
- Tail reduction: fewer microburst-driven queue spikes → tighter latency distribution.
- Cost: policing can reduce throughput or increase drop/mark events (must be included in acceptance).
- Stream ID: a stable flow identifier (source/destination/port/class tuple).
- Meter parameters: rate limit + burst limit (token-bucket parameter set).
- Action policy: allow / drop / mark (and any severity level).
- Violation counters: count, duration, peak violation intensity.
- Mapping reference: class→queue reference used by the stream (for audit and drift checks).
H2-8 · End-to-End Jitter Budgeting (From “Feeling” to Acceptance)
This section is the core deliverable: a copyable budgeting method that turns end-to-end latency and jitter into a ledger with explicit bounds, assumptions, and verification hooks. Every row must be traceable to a source (measured, datasheet, or calibrated), and the acceptance criteria must include tail metrics, not only averages.
- fixed: per-hop deterministic pipeline/serialization baseline (feature/mode dependent).
- queue bound: worst-case queueing term under declared burst/service assumptions.
- gate-wait bound: worst-case waiting until next open window (plus guard band).
- ε_timebase: timebase input error (treated as a parameter; details belong to timing pages).
- ε_impl: implementation residual (tap-point limitations, host scheduling variance, unmodeled adders).
- Hop / Segment: EP, SW1, SW2…
- Item: fixed / queue bound / gate-wait bound / ε
- Min / Max: or typ/max (explicit bounds, not only typical)
- Notes: mode, features, frame length, mapping, burst window, load pattern ID
- Source: measured / datasheet / calibrated
- Tap-point: measurement tap declaration aligned with the page’s taxonomy
- Verification: plan + counters used to validate this row in bring-up
- Tail metrics: p99/p999 (and a max bound where applicable) under declared load patterns.
- Bound compliance: verify that fixed + bounded terms explain measured results within ε_impl.
- Counter sanity: violation/drop/mark counters remain within declared limits (no hidden trade-offs).
- Version lock: budget must bind to configuration versions (GCL/mapping/policing).
H2-9 · Parameterization Workflow (Requirements → Config → Feedback)
Determinism becomes repeatable only when parameters are managed as a closed loop. This workflow turns requirements into a budget, produces shaping/policing parameters, verifies results with consistent metrics, and feeds deviations back into the ledger. The focus is on reusable artifacts and version control — not vendor tool details.
- Cycle / control loop: cycle time, deadline, update rate.
- Tail constraint: E2E p99/p999 limit (not only average).
- Topology: hop count, critical paths, fan-out / aggregation points.
- Load pattern ID: idle / full / burst / mixed / fault-injection profile.
- Per-hop allocation: assign fixed, queue bound, and gate-wait bound per hop.
- Hard boundaries: identify terms that cannot be “optimized away” (e.g., gate-wait bound, ε_timebase).
- Assumption lock: freeze mapping, burst window, and frame-size assumptions in the ledger.
- GCL (Qbv): cycle time, slot lengths, guard band, and a stable class→window plan.
- Queue mapping: class→queue mapping aligned across endpoints and switches.
- Policing (Qci-style): per-stream rate/burst limits and violation actions.
- Config identity: attach a config version tag to each parameter set.
- Verify with consistent metrics: per-hop delay, E2E p99/p999, queue watermark, gate misses, drops/late.
- Deviation attribution: map every failure to a ledger term (fixed / queue / gate-wait / ε).
- Closed-loop rule: if acceptance fails, return to Step 2 (reallocate bounds) before tuning ad-hoc.
- Rule: parameter sheet version = system configuration source-of-truth.
- Bind: ledger version ↔ GCL version ↔ mapping version ↔ policing version ↔ test matrix version.
- Rollback path: every release must have a tested rollback tag with known acceptance results.
H2-10 · Validation & Measurement (Make Determinism Measurable)
Determinism cannot be accepted without consistent measurement contracts. This section defines a minimum metric set, a scenario matrix that exposes tail failures, a tap-point consistency gate, and a pass-criteria template with threshold placeholders. It avoids tool-specific details and focuses on repeatable acceptance.
- Per-hop delay: hop baseline + deviations (use declared tap-points).
- E2E tail: p99 and p999 latency under declared scenario IDs.
- Drops / late: drop counters and late-arrival events for bounded windows.
- Queue watermark: per-queue peak occupancy (not port-wide only).
- Gate miss: gate misses / window violations (per class/queue where applicable).
- Policing health: violation counters (per stream) and action rates.
- Idle: baseline fixed + measurement noise floor.
- Full load: sustained saturation risks and scheduling drift.
- Burst: microburst tail exposure and queue bound validation.
- Mixed: class interaction and mapping correctness under concurrency.
- Fault injection: abnormal traffic / violations to test policing and isolation.
- Declare tap-points: every per-hop delay must name the timestamp tap position.
- Reject mixed baselines: do not attribute differences to queue/gate if tap-points differ.
- Timebase note: if timebase dominates, treat it as ε_timebase input (route to timing pages).
- Metric: E2E p99 / E2E p999 / per-hop delay / queue watermark / gate miss
- Scenario ID: idle / full / burst / mixed / fault
- Window: X ms / X s (must be declared)
- Threshold: ≤ X (units) + counters within X
- Evidence: counters + percentile report + config version tag
H2-11 · Design Hooks & Pitfalls (Where Tail Latency Explodes)
This section captures only determinism-relevant pitfalls: time windows (Qbv), queue/mapping mistakes, policing drift, and measurement contract mismatches. Each pitfall is expressed as a repeatable triage path: trigger → symptom → first counter check. Topics such as PHY/EMC/SI or protocol stack business models are intentionally excluded.
- Trigger: mismatched cycle lengths or phase drift.
- Symptom: periodic p999 spikes (beating / phase-locked bursts).
- First check: p99/p999 sliced by time phase + GCL cycle/slot audit.
- Fix direction: align cycles/phase, then re-balance slot allocation.
- Trigger: guard band computed with optimistic frame/serialization assumptions.
- Symptom: late events or “window miss” bursts under real payloads.
- First check: gate-miss / late counters aligned to the same window length.
- Fix direction: widen guard band or adjust slot boundaries and queue service.
- Trigger: high-priority class mapped into shared BE queue.
- Symptom: high-priority tail rises (HOL blocking), even at “low average load”.
- First check: per-queue watermark + mapping table version audit.
- Fix direction: isolate queues and re-validate allocation with burst scenario ID.
- Trigger: token bucket set without matching the declared burst window.
- Symptom: drops (too tight) or tail returns (too loose) under burst stress.
- First check: violations/drops counters aligned with p99/p999 in the same scenario window.
- Fix direction: re-derive rate/burst from the ledger assumptions, then re-run matrix.
- Trigger: mixing per-port, per-queue, and per-flow metrics without stating scope.
- Symptom: “utilization looks fine” while p999 and watermarks explode.
- First check: declare window length + denominator + object (port/queue/flow) before comparing.
- Fix direction: standardize a metric contract, then re-baseline acceptance.
H2-12 · Engineering Checklist (Design → Bring-up → Production)
This checklist is determinism-specific. It binds the budget ledger, shaping/policing parameters, measurement contracts, and the regression matrix into three gates: Design, Bring-up, and Production. Each gate requires versioned artifacts, consistency checks, and acceptance-ready evidence.
- Ledger done: per-hop fixed + queue bound + gate-wait bound + ε terms recorded.
- Worst-case defined: scenario IDs + burst assumptions + window length locked.
- Params frozen: GCL + mapping + policing tables frozen with version tag.
- Hard bounds tagged: identify non-negotiable terms and margin ownership.
- Pass template ready: thresholds as “≤ X” with evidence fields defined.
- Per-hop calibration: measured vs inferred vs datasheet terms labeled.
- Tap contract: tap-points declared and consistent across all measurements.
- GCL readback: downloaded schedule matches readback and version tag.
- Counters aligned: window/denominator/object scopes standardized.
- Tail validated: p99/p999 verified under burst & mixed scenario IDs.
- Version binding: ledger ver ↔ config ver ↔ test ver are linked.
- Observability: watermark, gate miss, violations, drops, late, event fields logged.
- Regression set: minimum matrix (idle/full/burst/mixed/fault) is automated.
- Change triggers: topology/mapping/window changes force re-budget + re-accept.
- Rollback tag: a tested rollback version with known acceptance evidence exists.
H2-13 · Applications & IC Selection (Determinism-First)
This section converts latency/jitter modeling into a practical selection method: define determinism targets, map them to TSN mechanisms (Qbv/Qci/observability), then shortlist devices by measurable evidence.
- Applications: determinism targets expressed as p99/p999 + cycle + hop (no protocol deep-dive).
- Selection axes: fixed per-hop latency, tail control, policing protection, observability, and resource boundaries.
- Shortlist examples: TSN switches / TSN MPUs / industrial switch silicon with concrete orderable PNs.
A) Applications (expressed as determinism targets)
Keep the application description strictly measurable: cycle, percentile bounds, “late” rate, and protection against abnormal traffic.
- Targets: cycle = X µs, E2E p99 ≤ Y µs, p999 ≤ Z µs, late ≤ A / 10^6 frames.
- Dominant risks: mixed-load bursts, gate schedule drift, priority mapping mistakes.
- Selection must-have: TAS/Qbv, per-queue watermark + gate-miss/late counters.
- Targets: bounded jitter under contention; abnormal traffic must not inflate tail.
- Dominant risks: “bad” flows (misconfigured burst/rate), queue hogging, retry storms.
- Selection must-have: PSFP/Qci (token bucket + violation counters) and stable policing behavior.
- Targets: p999 bound dominates; small periodic tail spikes are unacceptable.
- Dominant risks: schedule beat frequency, guard-band underestimation, egress serialization.
- Selection must-have: deterministic schedule update, strong time-window tooling, per-hop calibration hooks.
- Targets: deterministic forwarding inside TSN domain + measurable degradation outside.
- Dominant risks: mixed policies, counter definition drift, silent tail inflation.
- Selection must-have: rich counters (per-queue watermark, late/drop, policing violations) + black-box logging hooks.
B) IC Selection Axes (determinism-first, measurable)
Axis 1 — Fixed per-hop latency (predictable baseline)
- Prefer architectures that keep forwarding path stable under feature enablement.
- Require a clear method to obtain fixed latency: datasheet value, bench calibration, or per-hop measurement.
Axis 2 — Tail control (queue bound + gate-wait bound)
- Qbv/TAS converts uncertainty into a bounded, schedulable wait term (deterministic but bounded).
- Queue tail must be bounded by design: burst assumptions, service rate, queue isolation, and shaping policy.
Axis 3 — Domain protection (Qci / per-stream policing)
- PSFP/Qci prevents abnormal flows from inflating tail for everyone else.
- Must-have evidence: token bucket parameters + violation counters + deterministic drop/mark policy.
Axis 4 — Observability (field-proof determinism)
- Require counters that directly map to the budget terms: queue watermark, gate-miss/late, drops, policing violations.
- Prefer designs that can run PRBS/loopback/traffic tests without changing the determinism path.
Axis 5 — Resource boundaries (scale without “schedule jitter”)
- Check table sizes, number of queues per port, GCL length, and schedule update behavior.
- Operational rule: configuration table version is a controlled artifact (diffable, rollbackable).
C) Example Part Numbers (shortlist by determinism needs)
The part numbers below are orderable examples. Feature subsets vary by variant and software stack; verify Qbv/Qci/observability in the official documentation.
Bucket 1 — Compact TSN switch (fast path + integrated CPU)
- Microchip: LAN9662-I/9MX (TSN switch + CPU), LAN9668/9MX (8-port TSN switch + CPU).
- Why (determinism): TSN scheduling/policing options + rich counters are typically available for field validation.
- Best fit: remote I/O, compact cells, gateway-class TSN islands.
Bucket 2 — Higher port-count / higher bandwidth industrial switching silicon
- Microchip (SparX-5i family): VSC7546TSN-V/5CC (industrial switch class).
- Why (determinism): scale (ports/bandwidth) + TSN mechanisms enable tight p999 targets across more hops.
- Best fit: TSN backbone switches, multi-line aggregation, high-throughput deterministic cells.
Bucket 3 — TSN-capable MPU (integrated L2 switch for endpoints/gateways)
- Renesas: R9A07G084M04GBG#AC0 (RZ/N2L group), R9A07G084M04GBG#BC0 (variant).
- Why (determinism): integrated switching + TSN support is useful when endpoint control and deterministic forwarding are packaged together.
- Best fit: drives, gateways, remote I/O, industrial endpoints requiring tight timing behavior.
Bucket 4 — Automotive-grade TSN switches (when safety/security constraints dominate)
- NXP: SJA1105QELY (AVB/TSN switch family), SJA1110CEL/0Y (SJA1110 family variant).
- Why (determinism): TSN scheduling features + strong platform support can help stabilize per-hop behavior at scale.
- Best fit: automotive/industrial cross-over gateways, harsh environment edge boxes.
Selection rule (determinism-first): choose by measurable evidence — per-hop delay method, queue/gate counters, policing violation counters, and a repeatable test matrix. If any of these are missing, determinism becomes un-auditable in the field.
Diagram — Determinism-First Selection Tree
Inputs: p99/p999 target, cycle time, hop count, mixed-load risk. Outputs: required TSN mechanisms and shortlist buckets.
Recommended topics you might also need
Request a Quote
H2-13 · FAQs (Field Troubleshooting + Acceptance Criteria)
Scope: only long-tail troubleshooting and acceptance wording for latency/determinism (p99/p999/tail), Qbv/Qci, and budget/measurement consistency. Format is fixed per question: Likely cause / Quick check / Fix / Pass criteria.
Average latency looks OK, but p99 blows up — burst or gate miss first?
Likely cause: micro-bursts inflate queue tail, or Qbv gate miss creates periodic late frames.
Quick check: correlate p99/p999 spikes with queue_watermark vs gate_miss/late in the same window (same denominator).
Fix: cap burst (Qci token bucket) and/or adjust GCL slots + guard band until gate misses disappear.
Pass criteria: E2E p99 ≤ X, p999 ≤ X over X minutes; gate_miss = 0; late ≤ X / 10^6 frames.
Qbv enabled but jitter gets worse — GCL beat frequency or guard band too small?
Likely cause: GCL period misaligned with traffic cycle (beat), or guard band underestimates worst-case serialization.
Quick check: look for periodic p999 spikes; check gate_miss clustered at window edges; validate max-frame serialization assumption.
Fix: align GCL period to cycle; increase guard band by X (worst-case serialize + margin).
Pass criteria: no periodic p999 spikes over X minutes; gate_miss = 0; window-edge late = 0.
Low load still “stalls” — window closed too long or priority mapping wrong?
Likely cause: gate-closed duration blocks a critical class, or QoS mapping reset misroutes traffic to BE queue.
Quick check: compare stall timing vs gate state; verify class→queue counters (per-class increments must match expectation).
Fix: shorten closed windows; restore mapping matrix; pin configuration by version + readback diff.
Pass criteria: worst-case wait ≤ X; E2E p99 ≤ X; mapping counters stable within X%.
Same config, different switch silicon worsens tail — fixed latency delta or queue behavior?
Likely cause: different per-hop fixed pipeline latency and/or buffer sharing inflates queue tail under burst.
Quick check: measure per-hop idle baseline vs mixed-load tail; compare queue_watermark vs dequeue stability at same input burst.
Fix: re-calibrate fixed term; re-budget queue bound; adjust queue isolation/shaping assumptions for this silicon.
Pass criteria: updated budget closes with ≥ X margin; per-hop fixed repeatability within ±X; p999 ≤ X.
Field shows occasional “late”, but counters look clean — window definition or tap mismatch?
Likely cause: counter window/denominator hides events, or measurement tap points differ across devices/tools.
Quick check: force same observation window + denominator; run tap-consistency check (same event, same tap definition).
Fix: standardize KPI definition (window/denominator/tap); log raw late events until stable.
Pass criteria: tool-to-tool late rate within ±X; late ≤ X / 10^6 frames under standardized window.
One abnormal node slows the whole network — Qci not blocking or shared-queue HOL?
Likely cause: policing missing/weak for that stream, or shared buffering causes head-of-line blocking across classes.
Quick check: isolate offender via per-stream counters; check policing_violations and cross-queue watermark coupling.
Fix: tighten Qci token bucket; increase isolation (dedicated queue/class) for critical traffic.
Pass criteria: violations ≤ X per X minutes; critical class p99/p999 stays within X; watermark ≤ X%.
Worse only after maintenance — GCL version drift or readback mismatch?
Likely cause: schedule/config changed silently, or readback differs from intended GCL/mapping bundle.
Quick check: compare config version hash vs golden; read back GCL + mapping and diff.
Fix: enforce config-as-code; block deployment if readback diff ≠ 0; add rollback path.
Pass criteria: version hash matches; readback diff = 0; KPIs stable within X over X reboots.
Preemption still appears inside a window — guard band missed serialization time?
Likely cause: guard band omitted worst-case serialization/egress drain time, or edge behavior adds extra fixed delay.
Quick check: compute max-frame serialization; check edge-aligned gate_miss/late and egress occupancy at window open.
Fix: extend guard band by X; if needed, cap max frame size for interfering class.
Pass criteria: window-edge late events = 0; gate_miss = 0; guard-band margin ≥ X.
Port watermark is not high, but tail is bad — congestion is downstream?
Likely cause: true bottleneck is downstream (next hop/egress), so local watermark does not reflect final queueing.
Quick check: compare per-hop percentiles; check downstream utilization bursts + counters; identify where queueing accumulates.
Fix: apply shaping at the true bottleneck hop; re-budget per-hop queue bound; adjust schedule or split traffic.
Pass criteria: bottleneck hop confirmed; downstream watermark ≤ X%; E2E p999 ≤ X.
p99 meets spec, but max occasionally violates — beat frequency or burst injection?
Likely cause: periodic beat creates rare peaks, or occasional out-of-model bursts appear (maintenance scans/retries).
Quick check: test max-violation periodicity; correlate to event logs + burst counters; validate burst assumptions used in bounds.
Fix: align periods to remove beat; tighten admission (Qci) for out-of-model bursts; add regression scenario.
Pass criteria: max ≤ X over X minutes; no periodic max spikes; blocked bursts ≤ X.
After VLAN/QoS changes, determinism collapses — mapping matrix reset?
Likely cause: priority→queue mapping or policer bindings reset, breaking class isolation and bounds.
Quick check: verify mapping + policer binding tables post-change; compare per-class counters before/after.
Fix: apply mapping/policers as an atomic versioned bundle; enforce readback validation after any VLAN/QoS change.
Pass criteria: mapping diff = 0; per-class p99/p999 stable within X; no queue cross-talk events > X.
Enabling mirroring/telemetry makes tail worse — fixed path increase or queue contention?
Likely cause: extra pipeline stages add fixed latency, or telemetry shares resources and increases contention.
Quick check: A/B test telemetry on/off; compare per-hop idle baseline and queue_watermark; see whether p99 shifts or tail expands.
Fix: move telemetry off the critical class, reduce sampling, or isolate it in a dedicated queue; re-budget fixed term if unavoidable.
Pass criteria: telemetry-on still meets p99 ≤ X & p999 ≤ X; baseline delta ≤ X; critical watermark change ≤ X%.