TSN Parameterization: GCLs, Time Slots, PTP Calibration, Guards
← Back to: Industrial Ethernet & TSN
TSN Parameterization turns “deterministic” into deployable artifacts: per-port GCL + slot tables + calibration bindings + storm guards that activate at a common T0.
The outcome is measurable: bounded end-to-end latency and jitter within X over Y, with version-consistent schedules and guardrails that prevent misconfiguration from collapsing the network.
H2-1 · Scope Lock & Deliverables
Core idea
Determinism comes from a verifiable parameterization chain: unify the time domain, compile time-slots into GCLs,
validate end-to-end, and keep failures contained with storm guards.
In-scope (this page covers)
Output-driven
- GCL generation: compile gate timelines from flow requirements and topology constraints.
- Time-slot tables: map flows → classes/queues → windows/gate indices with budgets.
- PTP topology calibration: define correction tables, measurement loop, and drift checks.
- Storm guards: prevent mis-parameterization and bursts from collapsing the network.
Out-of-scope (handoff only, no deep dive)
No expansion
- TSN standards deep explainers (Qbv/Qci/etc. as theory) → handoff to TSN Switch/Bridge page.
- PTP protocol internals (servo/BMCA/profile details) → handoff to Timing & Sync / PTP pages.
- PHY/EMC/layout/protection (TVS/CMC/cabling) → handoff to PHY Co-Design & Protection pages.
- Industrial stack certifications (PROFINET/EtherCAT/CIP) → handoff to Industrial Ethernet Stacks pages.
Deliverables (what gets produced)
- Flow inventory: FlowID, period, payload, deadline, priority/class, path hops, endpoints.
- Time-slot table: flow → queue/class → reserved windows → gate index → budget fields.
- Per-port GCL: gate mask + duration + repeat + activation time (schedule ID bound).
- Calibration sheet: per-link/port correction values, method, baseline, drift guard.
- Validation checklist: bring-up ladder + pass/fail criteria (metrics definition locked).
- Storm guard policy: thresholds, actions, logging fields, and recovery time targets.
Pass criteria (placeholders X / definition-first)
Use strict statistics (e.g., P99.9 / P999) instead of averages. Keep the definition identical across lab, factory, and field logs.
- E2E latency bound: P99.9 ≤ X, max ≤ Y (window length and sample period specified).
- Window miss rate: ≤ X per minute (late/early hits split reported).
- Queue safety: peak occupancy ≤ X% and no overflow for critical classes.
- Time-domain health: offset/drift ≤ X under defined measurement window.
- Storm resilience: storm events ≤ X and recovery ≤ Y seconds.
Recompile triggers (any of these changes invalidates schedules)
- Topology or link speed changed (added hop, replaced switch, new path).
- Flow list changed (period/payload/deadline/priority/class).
- Time-domain roles/paths changed (GM/BC/TC placement changes).
- Device firmware/timer granularity changed (gate timing resolution, clock accuracy).
Handoff templates (link-only, no repetition)
- Need TSN feature matrices and Qbv/Qci capability mapping → TSN Switch / Bridge.
- Need PTP protocol internals and profiles → Timing & Sync / PTP Hardware Timestamping.
- Need PHY SI/EMC, TVS/CMC, cabling and grounding → PHY Co-Design & Protection.
- Need PROFINET/EtherCAT/CIP certification and stack behavior → Industrial Ethernet Stacks.
Scope Map (outputs at center, handoffs at the edge)
H2-2 · Problem Model (What must be “deterministic”)
Determinism is not a slogan. It is a set of measurable bounds that must map directly to parameters (slots/GCL/calibration/guards)
and to observation points (counters/logs) with pass criteria.
Latency bound (E2E maximum latency)
Slots + GCL
Meaning: critical flows must have a strict upper bound, not just a good average.
Breakers: microbursts, window gaps, gate misalignment, hop-count changes.
Parameterize: slot sizes, window placement, guard bands, per-hop processing assumptions (X).
Observe: E2E latency histogram (P99.9/P999), per-class queue residence (if available).
Pass criteria: E2E P99.9 ≤ X; max ≤ Y (statistics window and duration defined).
Jitter bound (variation control)
Calib + Guard band
Meaning: delay variation must stay within a budget so scheduled windows remain valid.
Breakers: time-domain drift, inconsistent timestamp points, switch processing jitter.
Parameterize: correction tables, drift guard, gate timing granularity, guard bands (X).
Observe: late/early window hits, window-miss counters, offset/drift trend under defined windows.
Pass criteria: miss ≤ X per minute; drift ≤ Y per 10 minutes (definition locked).
Isolation (critical flows are not starved)
Mapping + Policing
Meaning: best-effort traffic must not consume resources reserved for scheduled traffic.
Breakers: wrong class/queue mapping, missing policing, shared windows, burst amplification.
Parameterize: flow→class→queue mapping, per-class limits, background drop/limit policy (X).
Observe: per-queue occupancy, drops by class, deadline-miss counters for critical flows.
Pass criteria: critical miss ≤ X; background traffic limited within Y (same definition everywhere).
Recovery behavior (no collapse under faults)
Storm guards
Meaning: misconfigurations and bursts must be contained so the system degrades gracefully.
Breakers: broadcast storms, loops, retry storms, aggressive re-announce behaviors after maintenance.
Parameterize: storm thresholds (pps/ratio), loop actions, rate limits, logging fields (X).
Observe: storm counters, loop events, recovery time, and class-level drops.
Pass criteria: recovery ≤ X seconds; critical SLA resumes without repeated flapping (Y cycles).
E2E Determinism Box (where slots, gates, calibration, and guards act)
H2-3 · Parameterization Pipeline (Requirements → Deployable Schedules)
Parameterization must behave like a reusable production line: stable inputs produce deployable schedule packages with
explicit self-checks, verification gates, and monitoring/guard hooks.
Inputs → Outputs (data contract)
Versioned
- Inputs: flow inventory (period, payload, deadline, priority/class, endpoints), topology (hops, link rates), time accuracy grade.
- Outputs: per-port GCL + schedule ID/activation time, per-flow slot/queue mapping, calibration sheet, storm guard policy.
Stage A — Requirements (Flow inventory lock)
Inventory vX
Purpose: translate requirements into a computable flow list (not narrative statements).
Inputs: FlowID, period, payload, deadline, class/priority, endpoints, burst allowance (X).
Outputs: Flow inventory vX (immutable during compilation).
Checks: deadline < period; payload < MTU; endpoints fully specified; critical flows flagged.
Failure mode: unlocked requirements cause schedule drift and “whack-a-mole” tuning.
Stage B — Time Model (Scheduling time contract)
Hyper-cycle
Purpose: define usable time for scheduling (cycle, hyper-cycle, guard band, drift budget).
Inputs: time accuracy grade, device clock error budget (X), gate granularity (X), hop processing assumptions (X).
Outputs: time model sheet (hyper-cycle choice + guard band definition).
Checks: guard band ≥ total error budget; hyper-cycle covers all critical periods.
Failure mode: metric definitions mismatch leads to “offset looks small but windows still miss.”
Stage C — Slotting (Flow → Queue → Window mapping)
Slot table
Purpose: allocate windows per hop and bind each flow to a class/queue and gate index.
Inputs: flow inventory, topology hops/link rates, time model (guard band).
Outputs: time-slot table (reserved window, queue, gate index, per-hop budget fields).
Checks: window ≥ serialization + margin; no critical starvation; microburst sensitivity flagged.
Failure mode: low average utilization but periodic stalls due to burst + window gaps.
Stage D — GCL Compile (Per-port executable timeline)
Schedule ID
Purpose: translate slot tables into gate masks and durations for each egress port.
Inputs: slot table, queue model, gate granularity (X), activation policy.
Outputs: per-port GCL + schedule ID + activation time.
Checks: overlap/holes, critical window coverage, queue-to-gate consistency.
Failure mode: deployment succeeds but determinism does not improve due to mapping/activation mismatch.
Stage E — Deploy (Atomic switch-over + rollback)
Two-phase
Purpose: ensure slot/GCL/calibration/guards become active together under a single version.
Inputs: schedule package (same version for slot+GCL+calib+guards), activation time.
Outputs: active state readback (active schedule ID, next activation, health flags).
Checks: version convergence, activation convergence, rollback triggers defined.
Failure mode: partial deployment creates random jitter and intermittent deadline misses.
Stage F — Verify → Monitor/Guard (Close the loop)
Field-safe
Purpose: validate pass criteria and prevent failures from turning into storms and collapse.
Inputs: metric definitions (fixed), counters/log fields, storm policies, calibration drift checks.
Outputs: pass/fail report, monitoring baseline, guard actions with recovery targets.
Checks: P99.9 latency, miss rate, queue peaks, drift envelope, storm recovery time.
Failure mode: lab-only success fails in field due to missing guards and version discipline.
Config Pipeline (from requirements to monitored schedules)
H2-4 · Time Domain Basics for Scheduling (Usable Time, Not Ideal Time)
Scheduling requires a unified time vocabulary and a usable-time contract: cycles and hyper-cycles must include guard bands
derived from measurable clock error and drift budgets.
Time reference roles (impact on scheduling)
Impact-only
- GM (Grandmaster): defines the network time reference; GM changes can shift phase and trigger schedule re-validation.
- BC (Boundary Clock): segments the time domain; error budgets accumulate per segment and must be guarded per hop.
- TC (Transparent Clock): exposes path delay corrections; calibration must bind corrections to the actual forwarding path.
Scheduling vocabulary (must be consistent across tools and logs)
- Cycle T: the repeating control period used to reserve windows for critical flows.
- Hyper-cycle: the smallest repeat interval that covers all critical periods (chosen to avoid schedule drift).
- Guard band: reserved margin that absorbs time-domain error and per-hop jitter so windows remain valid.
- Local clock error (X): a measurable budget used to size guard bands; definition must match measurement tools.
Usable-time contract (how error becomes schedule margin)
Guard band must cover the worst-case envelope of: (1) offset/drift within the chosen measurement window, (2) gate timing granularity (X),
(3) per-hop processing variation (X), and (4) burst-induced queueing that can push frames toward window boundaries.
- Define: what counts as late vs. early window hits (split counters, not one mixed number).
- Bind: guard band to hyper-cycle and to the schedule version.
- Revalidate: when topology/roles change or drift escapes the envelope.
Recalibrate / Recompile triggers (scheduling safety)
- GM relocation or role changes (GM/BC/TC placement or path changes).
- Topology changes (hop count, link speed, forwarding path) affecting delay and correction binding.
- Temperature/aging changes pushing drift outside the defined envelope.
- Firmware changes altering clock accuracy, gate timing resolution, or timestamp tap locations.
Global Time & Local Clocks (offset + guard band envelope)
H2-5 · GCL Design Method (From Windows to Gate Timelines)
Treat GCL as an executable time program. Start with window intent (who must pass, when), then compile into per-port gate
masks and durations that remain valid under the defined guard-band envelope.
GCL minimal unit (engineering object)
Atomic step
- Gate mask (open/close): which queues are allowed to transmit in this step.
- Duration: how long the mask remains active (aligned to gate granularity X).
- Repeat binding: how the step repeats inside the cycle / hyper-cycle.
Two hard constraints: (1) duration must respect implementable granularity (X), (2) gate
changeover variation must be absorbed by guard band (X) to avoid window boundary misses.
Step 1 — Lock critical flows and cycles
Inputs: Flow inventory vX (period, payload, deadline, path, class).
Action: separate “must-pass” flows from “best-effort” flows; mark hard deadlines and
per-hop constraints.
Output: critical set + invariants (deadline, path, reserved bandwidth).
Self-check: deadline < period; payload fits MTU; endpoints and hops are explicit.
Step 2 — Choose a hyper-cycle that prevents schedule drift
Inputs: all critical periods (T1..Tn) and gate granularity (X).
Action: pick the smallest repeat interval that cleanly covers critical cycles and aligns
with implementable timing.
Output: hyper-cycle definition + rationale recorded in the time model sheet.
Self-check: all critical windows repeat deterministically without phase creep.
Step 3 — Reserve guard band from measurable error budgets
Inputs: clock error/drift envelope (X), gate changeover variation (X), per-hop variation (X).
Action: allocate guard band to protect window boundaries (late and early sides tracked separately).
Output: guard-band budget table bound to the schedule version.
Self-check: guard band ≥ total envelope for the chosen measurement window.
Step 4 — Allocate windows (critical first, background last)
Inputs: critical set + path hops + serialization time per hop.
Action: place critical windows with explicit guard separation; then fill remaining time for non-critical traffic.
Output: window plan (per-hop reserved windows + queue binding).
Self-check: no overlap across gates; background windows never intrude into protected envelopes.
Step 5 — Compute margins (bottlenecks, queue peaks, overflow risk)
Inputs: window plan + burst assumptions (X) + per-port buffer limits (X).
Action: identify the first congestion point; compute worst-case queue peaks near boundary windows.
Output: per-hop margin sheet (peak occupancy, spill risk, microburst sensitivity flags).
Self-check: critical queue peak ≤ threshold X; no sustained backlog across cycles.
Step 6 — Compile to deployable fields (vendor-neutral)
Field semantics
Inputs: final window plan + time model + guard-band budget.
Action: produce per-port lists of (gate mask, duration, repeat binding) plus schedule ID and activation time.
Output: per-port GCL blocks (mask + duration + binding) + activation metadata.
Self-check: schedule ID consistent across devices; activation time consistent; readback fields defined.
Acceptance (GCL success is measurable)
- Window miss: late and early counters tracked separately; ≤ X per Y minutes.
- Latency: P99.9 and max bounded; ≤ X / Y (system-defined).
- Queue peak: critical class peak occupancy ≤ X% (no overflow risk).
- Consistency: schedule ID and activation time converged network-wide.
- Recovery: after schedule switch, stable within Y seconds without retry storms.
GCL Timeline (0 → T with 3 queues + guard band)
H2-6 · Time-slot Tables (Flow → Window → Queue → Gate Index)
A slot table is the single source of truth that binds requirements to deployable schedules. It must carry input fields,
derived mapping fields, and acceptance fields so verification and monitoring align with the same identifiers.
Slot table field template (vendor-neutral)
Split by type
Input fields (immutable)
FlowID · Period · Deadline · Payload · Endpoints · Path hops · Link-rate class
Derived mapping fields (compile inputs)
Class · Queue · Reserved window (start/len) · Gate index · Hyper-cycle binding · Guard band applied (X)
Acceptance & monitoring fields (must match logs)
Expected hits/cycle (X) · Allowed miss rate (X) · Max queue peak (X) · Counter/log keys (FlowID-bound)
Scheduling order strategy (who gets placed first)
- 1) Critical control: hard deadlines and must-pass invariants.
- 2) Sync-sensitive flows: jitter-sensitive triggers that must align with the time contract.
- 3) Management with caps: essential but strictly budgeted.
- 4) Diagnostics/maintenance: placed late, protected by limits.
- 5) Video/high-bandwidth background: fill remaining time, never allowed to erode guarded windows.
Exception rules: microbursts that push frames into boundary windows require additional
policing or window splitting; path bottlenecks must never let background peaks bleed into critical envelopes.
Verification alignment (slot table → tests → logs)
- One FlowID: used in schedule compilation, pass/fail reports, and black-box counters.
- Two-sided misses: late and early window misses tracked separately for root-cause direction.
- Per-hop evidence: queue peaks and window hits recorded at the first congestion point.
Flow → Queue → Gate Mapping (3-layer engineering binding)
H2-7 · Guard Bands & Budgeting (Make Jitter Measurable and Additive)
Determinism fails in the field when guard bands are underspecified, budgets omit key contributors, or statistics use mixed
denominators and windows. This section turns guard bands into a versioned engineering budget that compiles into slot tables
and per-port GCL timelines.
Why a TSN network can still jitter
Root cause pattern
- Budget gaps: only sync error is considered; switch variation and microbursts are ignored.
- Mixed statistics: averages for offset combined with max for delay; mismatched time windows.
- No compilation path: guard bands exist on paper but not as fields in slot tables and GCL.
Guard band sources (boundary-moving contributors)
Each item below must map to a budget field (X) and an evidence counter (late/early) to keep the schedule verifiable.
Time reference error
Sync/offset envelope that shifts windows early/late (field: sync_error_envelope = X).
Switch residence-time variation
Per-hop forwarding variation and queue boundary effects (field: switch_var_per_hop = X).
Serialization time
Frame-on-wire time that consumes window length (field: ser_time = X).
Microbursts / background push
Boundary queue push from bursty background traffic (field: burst_queue_push = X).
Gate granularity & changeover
Implementable timing step and switching overhead (fields: gate_granularity = X,
gate_changeover = X).
Additive worst-case budgeting (per hop → end-to-end)
Use one consistent statistics contract: the same observation window (Y) and the same bound type (envelope/max or P99.9).
Track early and late separately to preserve directionality.
Budget template (placeholders)
per_hop_envelope = sync_component(X) + switch_var(X) + gate_component(X) + burst_component(X) + ser_component(X)
end_to_end_envelope = Σ(per_hop_envelope) across hops
Guard band should be budgeted with an explicit cost field (capacity/time consumed) to prevent over-allocation that triggers new congestion.
Where guard bands must land (so schedules remain testable)
1) Time model sheet
hyper-cycle · gate_granularity (X) · sync_error_envelope (X) · global GB policy (versioned)
2) Slot table fields
GB_before_window (X) · GB_after_window (X) · envelope_source_refs · early/late counter keys
3) Per-port GCL compilation
guard segments encoded as explicit timeline gaps or safe-mask periods; schedule ID binds to the budget version.
Jitter Budget Waterfall (stacked blocks, additive per hop)
H2-8 · PTP Topology Calibration (Turn Time Into a Scheduling Baseline)
Calibration is not protocol theory. It is a closed-loop process that measures topology-dependent biases, computes a correction
table, applies it with schedule-version binding, validates residual error, and monitors drift so scheduling stays aligned over time.
Calibration targets (what must be corrected)
Link asymmetry
Direction-dependent delay that shifts window alignment; increases required guard band if not corrected.
Forwarding path differences
Topology- and configuration-dependent residence paths that create port-specific correction needs.
Timestamp tap bias
Bias from timestamp placement that can make offset look good while egress/ingress alignment remains wrong.
Correction table fields (per link / per port)
Versioned artifact
Identity & binding
LinkID · PortID · Direction · Path signature · schedule_version
Correction payload (placeholders)
asym_correction (X) · path_delay_correction (X) · timestamp_bias (X)
Validity & revalidation
valid_from_time · confidence (X) · next_check_interval · drift_threshold (X)
Calibration SOP (baseline → correction → closed-loop validation)
- Baseline measure: repeatable per link/port evidence within window Y (stability ≤ X).
- Compute correction: generate correction table vX with confidence and applicability bounds.
- Apply: bind to schedule_version; record activation metadata and readback expectations.
- Validate: measure residual error and window-hit stability; ensure early/late miss drops.
- Drift monitor: alarm when drift exceeds threshold X; trigger recalibration and recompilation.
Calibration Loop (Measure → Correct → Apply → Validate → Drift Monitor)
H2-9 · TSN Deployment & Versioning (Consistency, Rollback, and Evidence)
Field failures frequently come from configuration inconsistency: half the network activates a new schedule while the rest runs an old one,
or GCL/slot/calibration artifacts do not share the same version binding. This section defines a versioned artifact contract and an atomic
two-phase activation process with minimal observability fields for audit and rollback.
Common failure pattern: partial or mismatched activation
Consistency risk
- Half-updated network: some devices activate schedule N while others stay on N-1.
- Broken version binding: GCL changes without slot table or calibration table alignment.
- No evidence fields: the running schedule cannot be proven from device readbacks.
Version binding contract (artifact set = one schedule)
Hard rule
One schedule_version must bind all artifacts as a single artifact_set_id:
per-port GCL, slot table, time model, and correction table. The binding enables deterministic checks before activation.
Artifact binding fields (placeholders)
schedule_version · artifact_set_id · gcl_hash (X) · slot_hash (X) · calib_hash (X) · build_time · issuer
A device should refuse activation when artifact_set_id or hashes do not match the staged bundle.
Two-phase activation SOP (shadow → validate → activate at T0)
Atomic activation is achieved by staging a shadow schedule, verifying cross-artifact consistency, and switching at a future activation time (T0)
expressed in the shared time base. This avoids mixed-version windows across the network.
Phase 1 — Stage (shadow)
Download artifacts into a non-active area; device returns: shadow_schedule_id · shadow_ready · shadow_hash_ok.
Phase 1.5 — Validate (pre-activation checks)
- GCL sanity: overlap, invalid indices, excessive gaps, critical starvation.
- Slot table sanity: deadline vs. window, queue class consistency, path binding.
- Calibration binding: schedule_version match and path signature match.
Phase 2 — Activate at T0
Switch at a future time point so all devices can align; activation must be provable via readbacks:
active_schedule_id · next_activation_time.
Engineering rule: choose T0 later than the slowest stage completion plus a safety margin X.
Minimal observability fields + rollback conditions
Minimum fields (field-readback contract)
active_schedule_id · next_activation_time · shadow_schedule_id · shadow_ready · artifact_set_id · version_mismatch_counter · early_miss_counter · late_miss_counter · rollback_reason_code
Rollback triggers (placeholders X)
- version_mismatch_counter rises beyond X within Y minutes
- late_miss_counter exceeds X for critical flows during Y minutes
- critical queue drop/overrun exceeds X
- drift exceeds threshold X causing correction invalidation
Two-phase Activate (Stage → Validate → Activate@T0 → Monitor → Rollback)
H2-10 · Storm Guards (Make Wrong Parameters Non-Fatal)
Storm guards are the safety fuse for parameterization. When maintenance introduces a loop, when broadcast/multicast grows unexpectedly,
or when a schedule has a bad gap or starvation issue, guards keep deterministic traffic protected while logging evidence for forensics and alerting.
Storm threat model (field triggers)
- Loop introduction: miswiring or unintended bridging after maintenance.
- Broadcast/multicast growth: device fault, misconfig, or discovery storms.
- Unknown / non-critical flooding: tools, mirrors, or uncontrolled best-effort bursts.
- Schedule-induced storms: long gaps or starvation can amplify retries and queue push.
Guard mechanisms (engineering-deployable)
Broadcast / multicast thresholds
Dual-threshold triggers: pps_threshold = X and share_threshold = X% (per port / per class). Record hit counters for source localization.
Loop guard isolation
On loop signatures, isolate the suspicious port into a safe profile (critical queues only) or block with alert; log reason codes (X).
Rate limit / policing
Enforce hard ceilings on non-critical classes at ingress to prevent queue fill; preserve critical traffic by class-aware metering.
GCL sanity checks (pre-activation)
- window overlap and overflow beyond cycle
- too-long gaps (excessive empty windows)
- critical starvation (no service window)
- missing guard band segments
- invalid gate index references
Evidence & alert minimal fields (for field forensics)
Minimal logging fields (placeholders)
drop_reason_code (X) · top_talker_port (X) · alert_severity (X) · pps/share counters · police_drop_count · loop_guard_hits · active_schedule_id
Storm Guard Pipeline (Classify → Meter/Policer → Queue/Gate → Drop/Log → Alert)
H2-11 · Bring-up & Validation Checklist (Design → Bring-up → Production)
This checklist turns parameterization artifacts (GCL, slot tables, calibration tables, guard policies) into a measurable acceptance flow.
The goal is to prove: (1) version-consistent activation, (2) deterministic bounds,
and (3) non-fatal behavior under faults.
Design-time lock (do not enter bring-up without these)
Gate
A. Time model & hyper-cycle lock
- Hyper-cycle selection and scheduling granularity (X).
- Guard band policy version (gb_policy_id) and intended safety margin X.
- Activation time semantics: future T0 selection rule (T0 ≥ slowest-stage + margin X).
B. Critical flow manifest lock
FlowID · period · deadline · payload · class · path hops · reserved window · queue · gate index
Evidence: flow_manifest_hash (X), critical_flow_count, path_signature (X). Any manifest change must trigger regression.
C. Artifact binding (hard rule)
One schedule_version binds: per-port GCL + slot table + correction table + storm guard policy as a single artifact_set_id.
Evidence: artifact_set_id · gcl_hash (X) · slot_hash (X) · calib_hash (X). Mismatch → activation must be refused.
Bring-up ladder (single link → multi-hop → full load → fault injection)
Step 1 — Single link
Goal: prove the pipeline works on the smallest topology.
Evidence: active_schedule_id · next_activation_time · gate_miss_counter · early/late hits.
Pass: window misses ≤ X over Y minutes; active_schedule_id stable.
Step 2 — Single switch
Goal: validate per-port GCL correctness and activation alignment across ports.
Evidence: per-port gate_miss_counter · drops_by_class · version_mismatch_counter.
Pass: mismatch_counter does not grow; critical class drops ≤ X.
Step 3 — Multi-hop
Goal: confirm end-to-end envelope matches the additive budget.
Evidence: early/late hits per hop · budget_version · envelope (X).
Pass: worst-case envelope ≤ budget (X) with explainable margin.
Step 4 — Full load
Goal: prove isolation (critical flows cannot be starved by background traffic).
Evidence: queue_occupancy_peak · drops_by_class · policer_drop_count.
Pass: critical class drop = 0 (or ≤ X); background policing hits but does not impact critical windows.
Step 5 — Fault injection (parameterization scope only)
- Simulate mixed-version activation: force one node to stay on N-1; verify mismatch_counter triggers guard/rollback criteria.
- Increase background bursts: verify policer thresholds (X) and that critical windows remain served.
- Introduce loop-like storm signatures: verify loop guard isolation (safe profile) and reason-code logging.
Pass: network remains operational; evidence fields show the triggered guard and the active schedule remains provable.
Production acceptance & automated regression (triggered every change)
Regression triggers
Firmware version change · topology change (ports/hops) · schedule_version change · correction table update · guard policy update
Regression content (parameterization-only)
- Two-phase activate: stage → validate → activate@T0 → verify active_schedule_id alignment.
- GCL sanity: overlap/gaps/starvation/invalid index must be zero-tolerance.
- Slot mapping: critical deadlines must match assigned windows (X margin).
- Budget: additive envelope checks must pass with budget_version tied to artifact_set_id.
- Storm guards: policer thresholds and reason-coded drops must be observable and stable.
Reference BOM examples (concrete part numbers)
Examples
These are commonly used devices for TSN/1588 bring-up platforms and lab validation. Verify the required TSN profile (Qbv/Qci/Qbu/Qav), port count, and timestamp architecture for the target system.
TSN-capable switch / bridge IC examples
- NXP SJA1105 (TSN switch family, automotive/industrial references)
- NXP SJA1110 (TSN switch family, higher integration variants)
- Microchip LAN9662 (TSN switch class reference platform component)
- Microchip LAN9698 (multi-port switch family commonly used in managed designs)
- Marvell 88Q5050 (automotive Ethernet switch class used in TSN ecosystems)
Endpoint MAC/SoC examples used for TSN validation
- NXP LS1028A (ENETC-capable networking SoC often used for TSN gateways)
- TI AM65x / AM64x (industrial SoC families commonly used with TSN-capable Ethernet subsystems)
- Renesas RZ/N2 (industrial networking SoC class used in real-time Ethernet ecosystems)
Timing / clock IC examples for PTP/holdover lab setups
- Silicon Labs Si5341 (jitter attenuator / clock generator class used in sync designs)
- Silicon Labs Si5345 (high-performance clock generator family)
- Microchip ZL30772 (network synchronization clock IC family)
- Renesas (IDT) 82P33731 (sync/clock IC used in timing architectures)
- Renesas 8A34001 (timing/synchronization IC family used in timing systems)
Hardware timestamp NIC examples for capture & correlation
- Intel I210-AT (widely used NIC with hardware timestamping in timing labs)
- Intel I225 (multi-gig NIC family used in modern validation benches)
Bring-up Ladder (progressive validation path)
H2-12 · Monitoring & Field Service (Parameterization Scope Only)
Monitoring is restricted to schedule/slot/calibration/guard/budget layers. When physical-layer or EMC indicators dominate, hand off to the relevant PHY/EMC pages.
The purpose here is to define the minimum counters, the black-box snapshot, and the symptom-to-fix routes that stay within parameterization.
Must-have counters (collection baseline)
Queue & gate
queue_occupancy (avg/peak) · gate_miss_counter (per port/queue) · early_window_hits · late_window_hits
Drops & policing
drops_by_class · policer_drop_count · drop_reason_code (X) · top_talker_port (X)
Consistency & activation evidence
active_schedule_id · next_activation_time · artifact_set_id · version_mismatch_counter · rollback_reason_code
Black-box snapshot fields (for correlation and replay)
- temperature (correlation only) · CPU_load · memory_pressure (optional)
- schedule_version · artifact_set_id · last_activation_time · next_activation_time
- drift_indicator (X) · correction_table_version · budget_version
- topology_signature (X) (port map + hop count summary)
Symptom → Counters → Suspect layer → Fix action (no cross-page expansion)
1) Latency spikes only under load
Counters: queue_occupancy_peak · drops_by_class · policer_drop_count → Suspect: slot table / queue mapping →
Fix: re-slot critical flows first, move background to later windows, and enforce ingress policing (X).
2) Jitter grows after a topology change
Counters: late_window_hits ↑ · drift_indicator (X) → Suspect: calibration table binding →
Fix: re-measure baseline, update correction table, bind to schedule_version, re-activate at a future T0.
3) Network behaves like mixed versions
Counters: version_mismatch_counter ↑ · active_schedule_id differs across nodes → Suspect: deployment/activation →
Fix: enforce two-phase activation (shadow→validate→activate@T0), refuse mismatched artifact_set_id.
4) Periodic misses at stable load
Counters: early/late hits oscillate with a period → Suspect: guard band budget →
Fix: increase guard band (X), unify budget accounting across hops, recompile GCL with updated windows.
5) Critical flow starvation appears
Counters: critical queue occupancy ↑ · critical drops > 0 → Suspect: GCL window allocation →
Fix: detect starvation in GCL sanity checks, re-allocate windows, enforce a minimum service window for critical class.
6) Broadcast/multicast storm symptoms
Counters: pps/share thresholds hit · drops_by_class spikes → Suspect: storm guard thresholds →
Fix: apply policers on non-critical traffic, enable loop guard isolation, log drop_reason_code (X) and alert severity (X).
7) After maintenance, schedule flaps
Counters: next_activation_time changes unexpectedly · rollback_reason_code present → Suspect: activation governance →
Fix: lock activation to controlled T0 windows; require staged validation; freeze schedule unless an approved change ticket exists.
8) Counters clean, but physical errors dominate
If schedule evidence is consistent and misses are not rising, hand off to the PHY/EMC pages for physical-layer acceptance.
This page remains limited to slot/GCL/calibration/guard/budget actions.
Field service & observability kit (concrete part numbers)
Example components often used to implement counters collection, timestamp capture, and controlled traffic generation in the field/lab. Select equivalents as needed.
Switch for mirroring/telemetry & schedule evidence
Microchip LAN9662 / LAN9698 · NXP SJA1105 / SJA1110 · Marvell 88Q5050
Timestamp capture NIC (PC/edge probe)
Intel I210-AT · Intel I225
Timing reference / holdover clock IC (for controlled sync tests)
Silicon Labs Si5341 / Si5345 · Microchip ZL30772 · Renesas 82P33731 / 8A34001
Observability Map (Symptoms → Counters → Suspect layer → Fix action)
Recommended topics you might also need
Request a Quote
H2-13 · FAQs (Field Troubleshooting — Parameterization Scope)
Scope lock: these FAQs only cover schedule activation, GCL/slot mapping, guard bands & budgeting, calibration binding, and storm guards.
If physical-layer/EMC indicators dominate while schedule evidence is consistent, hand off to the PHY/EMC acceptance pages.
X = threshold
Y = time window
Z = consecutive activations
1) GCL download reports success, but determinism does not improve — check activation time or queue mapping first?
Likely cause: the schedule was staged but never activated at a consistent T0, or slot/queue mapping does not match the intended class-to-window plan.
Quick check: compare active_schedule_id + artifact_set_id across nodes; verify next_activation_time is in the future and aligned; confirm critical flows map to the expected queue and gate index.
Fix: enforce two-phase activate (stage → validate → activate@T0); regenerate slot/queue mapping for critical flows and recompile per-port GCL with the corrected mapping.
Pass criteria: active_schedule_id identical on all nodes for Z activations; gate_miss_counter ≤ X over Y.
2) Critical flow times out sporadically though bandwidth looks sufficient — guard band too small or calibration drift?
Likely cause: late window hits from insufficient guard band, or correction table drift/binding mismatch after topology/temperature changes.
Quick check: trend late_window_hits vs drift_indicator; verify correction_table_version is bound to the current schedule_version; confirm budget_version matches artifact_set_id.
Fix: increase guard band by worst-case additive terms (X); re-measure baseline, update correction table, bind it to the same schedule_version, and re-activate at a future T0.
Pass criteria: late_window_hits ≤ X and gate_miss_counter ≤ X over Y, with correction_table_version stable for Z activations.
3) After an upgrade, random jitter increases — inconsistent versions or slot table recompiled differently?
Likely cause: mixed schedule_version/artifact_set_id in the fleet, or a non-deterministic build path changed slot allocation (gate indices/windows) across targets.
Quick check: audit artifact_set_id distribution; compare slot_hash + gcl_hash vs the expected release; check whether early/late hits changed after activation.
Fix: enforce atomic rollout (stage → validate → activate@T0) and refuse mismatched artifact_set_id; rebuild slot/GCL from a locked manifest and deterministic compiler inputs (same hyper-cycle, granularity, policies).
Pass criteria: version_mismatch_counter = 0 over Y; early/late hits within X per Y.
4) At high traffic, periodic congestion appears — window overlap or background not rate-limited?
Likely cause: GCL windows overlap or leave harmful gaps, and background traffic is not policed, creating microbursts that collide with critical windows.
Quick check: run GCL sanity (overlap/gaps/starvation); correlate queue_occupancy_peak with policer_drop_count; review drops_by_class for non-critical classes.
Fix: recompile GCL to remove overlap and enforce minimum critical service; apply ingress policers for non-critical classes and relocate background windows away from critical windows.
Pass criteria: queue_occupancy_peak ≤ X and critical drops = 0 over Y; policer_drop_count present only on non-critical classes.
5) One port consistently “lags behind” — per-port correction differs or gate index misaligned?
Likely cause: per-port correction table entry is inconsistent, or the port is running a different gate index set than expected (wrong per-port GCL).
Quick check: compare per-port gate_miss_counter + late_window_hits; verify per-port correction_table_version and GCL hash; confirm the port’s queue-to-gate mapping indices.
Fix: re-bind correction entries to schedule_version; redeploy the correct per-port GCL; add a pre-activation consistency check that rejects ports with mismatched hashes.
Pass criteria: port-to-port delta of late_window_hits ≤ X over Y, and per-port hashes match the target artifact_set_id.
6) PTP offset looks small, but slots still miss — inconsistent timestamp tap point or measurement definition mismatch?
Likely cause: the timestamp tap point differs across nodes/ports, or offset is computed with a window/denominator that hides local phase error relevant to gate timing.
Quick check: compare early_window_hits/late_window_hits against “small offset”; verify the tap-point configuration is consistent; validate correction_table_version and drift_indicator behavior during the same capture window.
Fix: standardize the tap point (single chosen reference) and update the correction table; align measurement windows used for offset vs gate misses; re-activate at a future T0 with the unified definition.
Pass criteria: early_window_hits + late_window_hits ≤ X over Y, with tap-point config identical on all nodes.
7) After maintenance/hot work, broadcast storms start — loop introduced or storm thresholds too permissive?
Likely cause: loop-like forwarding behavior plus insufficient storm guard thresholds; non-critical traffic is not policed and consumes gated service windows.
Quick check: confirm storm signatures in drop_reason_code; check policer_drop_count vs drops_by_class; verify loop guard state and whether safe profile was engaged.
Fix: enable loop guard isolation strategy; apply ingress storm policers for broadcast/multicast; tighten thresholds and ensure reason-coded logging and alerting are enabled.
Pass criteria: storm-related drop_reason_code count ≤ X over Y, and critical class drops remain 0.
8) Diagnostic traffic starves control traffic — class mapping wrong or policing missing?
Likely cause: incorrect class-to-queue mapping or missing ingress policing allows diagnostic bursts into the same or higher priority window/queue.
Quick check: compare drops_by_class and queue_occupancy_peak between control and diagnostic classes; confirm policer counters exist for diagnostic class; validate slot table maps control flows to reserved windows.
Fix: remap diagnostic class to a lower queue/window; apply policers (pps/bitrate) on diagnostics; re-slot control windows earlier and enforce minimum service for control class.
Pass criteria: control class drops = 0 and control queue_occupancy_peak ≤ X over Y, while policer_drop_count increases only for diagnostic class.
9) Same configuration becomes unstable on a different switch — processing latency differs or gate granularity mismatched?
Likely cause: the schedule assumes a gate granularity or per-hop envelope that does not hold on the new switch; guard band/budget is under-modeled for its processing profile.
Quick check: compare per-hop late_window_hits and gate_miss_counter before/after swap; verify configured granularity and hyper-cycle; confirm budget_version and additive terms.
Fix: increase guard band and re-slot windows to tolerate the new per-hop envelope; adjust gate granularity/hyper-cycle to legal supported step size; recompile GCL and re-activate at T0.
Pass criteria: worst-case envelope ≤ X and gate_miss_counter ≤ X over Y on the new switch.
10) At higher temperature, misses increase — clock drift budget insufficient or correction not updated?
Likely cause: drift terms were not fully budgeted into guard band, or correction table is stale for the current drift/temperature regime.
Quick check: correlate late_window_hits with temperature and drift_indicator; verify correction_table_version and last_activation_time; check whether budget_version includes drift terms.
Fix: extend guard band by explicit drift envelope (X); refresh correction table after baseline measurement; bind and redeploy at T0 with consistent artifact_set_id.
Pass criteria: temperature-correlated late_window_hits slope ≤ X over Y, and gate_miss_counter ≤ X.
11) Multi-hop end-to-end latency exceeds the bound — per-hop window accumulation or queueing upper bound missed?
Likely cause: worst-case per-hop terms were not fully additive in the budget, or the schedule allows queueing beyond the assumed bound at one hop.
Quick check: locate the hop with the largest queue_occupancy_peak and late_window_hits; confirm slot table deadlines vs reserved windows; verify budget_version includes serialization + processing + burst terms.
Fix: tighten the per-hop queueing bound via re-slotting; recompile GCL with earlier critical windows; update the additive budget and bind it to the artifact_set_id.
Pass criteria: e2e envelope ≤ X with the max-hop queue_occupancy_peak ≤ X over Y.
12) The network “feels stuck” though average utilization is low — microbursts + window holes or guard mis-trigger?
Likely cause: microbursts land in harmful window gaps or collide with guard bands; storm guards/policers may mis-trigger and amplify drops for non-critical traffic.
Quick check: inspect queue_occupancy_peak spikes vs gate windows; review GCL sanity for long empty windows; check drop_reason_code distribution and policer hits during “stuck” intervals.
Fix: close harmful window holes (recompile GCL) and schedule background windows away from critical service; tune policer thresholds and enable safe profile on storm triggers; log reason-coded events for forensics.
Pass criteria: queue_occupancy_peak spikes ≤ X and drop_reason_code “guard trigger” ≤ X over Y, with critical drops = 0.