CBTC/ETCS Onboard Unit: Safety Compute, Positioning & Security
← Back to: Rail Transit & Locomotive
The CBTC/ETCS Onboard Unit (OBU/EVC) is a safety-critical evidence engine: it turns speed/position, radio sessions, and time into trusted proof, then drives braking/limits deterministically when any evidence becomes unreliable. In practice, stable operation depends on measurable fields (reason codes, counters, latency/timestamps) and a BOM built around safety compute, secure boot/keys, isolation, supervision/holdup, and tamper-evident logging.
H2-1. What it is: OBU/EVC role and safety boundary
Engineering role: turning movement authority into audited safety actions
A CBTC Onboard Unit (OBU) / ETCS European Vital Computer (EVC) is the train-borne safety decision point that converts position & speed evidence plus received authority into supervision limits and vital actuation (e.g., brake demand / traction restriction), while producing an evidentiary record of why each safety action occurred.
What the onboard unit must do (written as verifiable responsibilities)
- Fuse evidence: combine odometry-based motion evidence with discrete position anchors (e.g., balise) and authority constraints to produce a safe state estimate and confidence.
- Supervise motion: compute and enforce supervision limits (speed ceilings, braking curves, approach constraints) derived from the currently valid authority.
- Trigger safety actions: when evidence indicates a boundary violation or insufficient confidence, force the system back into a safe envelope (restriction or braking), with clear trigger criteria.
- Prove decisions: record the minimum set of fields needed to reconstruct “what was known, what was decided, and why” at the time of each event.
This page treats the onboard unit as a supervisor (safety boundary enforcement), not a traction controller or HMI system.
Safety boundary: vital vs non-vital inputs/outputs (a practical classification)
The safety boundary is defined by the consequences of being wrong. An interface becomes vital if an error can directly cause unsafe permission (false release) or prevent required protection (missed braking).
- Typical vital inputs: dual-channel odometry evidence (speed/distance), discrete position anchors (balise-derived), authority validity fields (sequence/time validity), essential brake feedback.
- Typical non-vital aiding: GNSS for optional aiding, non-critical maintenance telemetry, driver convenience indications.
- Typical vital outputs: brake demand / traction restriction with feedback confirmation and fault-latched states.
- Typical non-vital outputs: informational displays and diagnostics that must never block safety cycles.
Not covered here: wayside controller internals (Zone Controller / RBC), interlocking logic, or full RF front-end design. Those belong to separate pages.
Figure (H2-1): Safety boundary map for an onboard unit
H2-2. System block diagram: sensing → safety compute → vital outputs
Reading method: follow the evidence arrow, not the module list
A useful onboard architecture diagram is not a “box pile.” It is an evidence pipeline: sensing produces measurements, integrity checks turn measurements into evidence, safety compute turns evidence into supervised limits, and vital outputs enforce the safe envelope. Each stage must emit the minimum fields required to prove correctness during audits and post-event analysis.
Inputs: each must state (1) what it proves, (2) how health is proven
- Odometry (wheel tach / encoder / radar aiding): proves motion (speed & distance). Health evidence: dual-channel consistency, jump detection, plausibility vs acceleration limits, sensor timeout.
- Balise-derived anchors: proves discrete position constraints. Health evidence: ID/sequence consistency, validity windows, missed/duplicate discrimination flags.
- Radio authorization/session: proves what is currently permitted. Health evidence: message sequencing, freshness/time validity, session state transitions, reason codes for reconnects.
- Train integrity & essential discretes: proves “allowed to move” constraints. Health evidence: input echo/feedback consistency, debounced state with timeout discipline.
The point is not to list buses; it is to define the evidence required to safely use each input.
Safety compute: four defenses that keep decisions deterministic
- Parallelism & comparison: lockstep / dual execution prevents single-event faults from silently changing safety decisions.
- Partitioning: safety tasks run with guaranteed timing; non-safety tasks (UI, bulk logging, comms) cannot starve supervision cycles.
- Diagnostics coverage: ECC/Parity, clock monitors, watchdogs, LBIST/MBIST turn hardware trust into auditable evidence.
- Fault state machine: each abnormal condition maps to a defined restriction/brake behavior and a defined recovery criterion.
Outputs: vital actuation must be closed-loop and reconstructable
Vital outputs are not “control signals.” They are enforcement actions driven by supervision decisions. Each vital output must have feedback confirmation (or a deterministic fault-latched behavior), plus an event record containing: trigger class, evidence summary, timestamp status, and the resulting actuation state.
- Vital outputs: brake demand, traction restriction, fault latch states (with feedback/echo where applicable).
- Non-vital outputs: indications and maintenance telemetry that must never block the safety cycle.
Evidence checklist (used later by validation + FAQs)
- Sensor health: consistency flags, plausibility counters, timeout markers.
- Session health: sequence counters, freshness checks, reconnect reason codes.
- Timestamp health: drift/holdover flags, time validity, monotonic counters.
- Decision state: current supervision mode, restriction/brake rationale code.
- Actuation feedback: output echo/feedback consistency, latch state, recovery gate.
- Audit integrity: signature status, event chain continuity, anti-rollback indicators.
This checklist is intentionally compact so each later chapter can point back to a named field class.
Figure (H2-2): Evidence-chain system block diagram (single-column, mobile-safe)
H2-3. Safety compute platform: SoC/MCU choices & partitioning
Why safety compute looks this way
A CBTC/ETCS onboard unit cannot rely on “correct software” alone. The compute platform must turn random faults (EMI, soft errors, clock anomalies, transient brownouts) into detectable, controlled, and auditable events. This is why modern safety SoCs/MCUs combine execution comparison, domain partitioning, data-path integrity, and built-in diagnostics.
Defense Layer A — execution comparison (lockstep / dual-core compare / 1oo2D)
Comparison is not for performance. It exists to convert silent computation faults into explicit “mismatch” evidence that can force deterministic restriction or braking.
- Lockstep: tightly synchronized cores execute the same instructions; mismatches raise immediate fault evidence.
- Dual-core compare: two independently scheduled domains compute equivalent safety results and compare within a defined window.
- 1oo2D concept: dual-channel with diagnostics that can identify which channel is unhealthy (without exposing voter implementation details).
Design output of this layer: mismatch flags, comparison window violations, channel health counters.
Defense Layer B — partitioning (safety island + high-performance domain)
Partitioning separates deterministic safety cycles from non-safety workloads. Safety tasks (supervision, fault state machine, vital output gating) must keep timing guarantees even when the non-safety domain is busy (comms stacks, UI, bulk logs, maintenance tools).
The safety argument becomes simpler when safety-critical behavior is contained and timing is provable.
Defense Layer C — data-path integrity (memory ECC, bus parity)
Safety decisions depend on data integrity as much as compute integrity. ECC and parity mechanisms provide measurable evidence that stored state and in-flight transfers are trustworthy.
- ECC (correctable): corrects single-bit faults and increments counters that signal environmental stress or aging.
- ECC (uncorrectable): triggers deterministic fault states when corruption cannot be repaired.
- Bus parity/CRC: surfaces transfer corruption as explicit error flags tied to time and address context.
Evidence fields: correctable_count, uncorrectable_flag, parity_error_count, fault_address (if available).
Defense Layer D — diagnostics & clock supervision (LBIST/MBIST, WDT, clock monitor)
Built-in tests and monitors provide the coverage material required in safety cases: what faults are detected, how fast they are detected, and how the system reacts.
- LBIST/MBIST: detects logic and memory faults at startup and/or periodic intervals; failures must map to a defined restriction/brake path.
- Watchdogs (WDT): detect stalls or runaway execution; safe reaction must be independent of non-safety workloads.
- Clock monitors: detect frequency drift, stop, or instability; time validity is an evidence primitive for logs and supervision.
Evidence fields: bist_status, wdt_reset_reason, clock_valid_flag, fault_reaction_timestamp.
ASIL/SIL evidence points (without turning into a textbook)
Safety cases require defensible evidence that faults are detected and handled within defined reaction times. The compute platform contributes by producing measurable artifacts rather than marketing claims.
- Diagnostic coverage artifacts: which failure modes are detected by compare/ECC/BIST/WDT/clock monitors.
- Fault reaction timing: detection latency and transition time to restrictive or braking states.
- Containment proof: non-safety crashes must not override safety island gating or timing.
- Audit continuity: event chains that show mismatch/fault → decision → enforcement → recorded evidence.
Figure (H2-3): Safety compute “four defense layers” with partitioning
H2-4. Vital I/O & isolation strategy (inputs/outputs you must trust)
Signal credibility engineering: trust is measured, not assumed
Vital I/O is not an interface list. It is a trust loop that ensures every safety-relevant input and output can answer two questions: is it credible now? and what deterministic action occurs if it is not? The onboard unit must maintain closed-loop evidence: command does not equal actuation until feedback confirms it.
Vital inputs: what they influence and how credibility is proven
- Speed/odometry: drives supervision limits and braking curves. Credibility: dual-channel agreement, jump detection, plausibility bounds, timeout discipline.
- Brake feedback: confirms enforcement outcomes and detects stuck paths. Credibility: command-vs-feedback alignment, response window timing, fault-latched mismatch flags.
- Essential discretes (e-stop chain, critical interlocks): gate permission and force conservative states. Credibility: debounced state, line monitoring (open/short), deterministic timeout actions.
Each vital input must emit explicit health evidence; otherwise it cannot be safely consumed by the safety island.
Vital output gating: only safety domain can authorize enforcement
Vital outputs must be controlled through a safety gate that is independent of non-safety workloads. The non-safety domain may request actions, but it cannot directly drive vital enforcement. When evidence is insufficient or fault states are active, the safety gate must force restrictive behavior.
- Deterministic gating: fixed rules map evidence/fault states to allowed outputs.
- Fail-safe default: loss of credible evidence leads to restriction or braking paths, not silent continuation.
- Audit linkage: each gating decision produces a reason code and timestamp validity marker.
Vital outputs: redundancy + feedback (closed-loop enforcement)
Redundancy exists to make failures observable and controllable. Feedback/echo closes the loop so that enforcement is confirmed, timed, and recorded. A vital path should be able to detect stuck-on, stuck-off, delayed response, and intermittent wiring faults through deterministic checks.
- Outputs: brake demand, traction restriction, fault latch states.
- Feedback: output echo/feedback consistency, response latency windows, latch confirmation.
- Record: event reason, evidence snapshot, action taken, recovery gate condition.
Isolation strategy: manage common-mode paths, not just components
Isolation is not solved by adding parts. It is solved by controlling where common-mode current flows. Digital isolators, isolated transceivers, and isolated power must be chosen and placed so that high dv/dt and surge energy does not corrupt sensing, comparison, or vital gating.
- Isolation boundary: define which signals cross, which reference they use, and how health evidence crosses the boundary.
- Common-mode suppression: provide a controlled return path so noise does not enter sensitive measurement/comparison nodes.
- Deterministic comms: isolated links must still provide CRC/sequence/freshness evidence for safety consumption.
Evidence fields dictionary (used later by validation + FAQs)
- input_consistency_flag: dual-channel disagreement markers and counters.
- input_timeout_flag: stale data detection with bounded timeout rules.
- crc_fail_count / seq_jump: integrity and ordering evidence on safety-relevant messages.
- output_echo_mismatch: commanded vs feedback mismatch with latch and clear conditions.
- actuation_latency_ms: response timing evidence (window-based, not best-effort).
- fault_latch_state: current enforced restriction/brake state and recovery gate reason.
These names act as anchors: each later diagnostic question can point back to a specific field class instead of vague “check logs.”
Figure (H2-4): Vital I/O trust loop (inputs → gate → outputs → feedback → audit)
Scope cut (to avoid overlap)
This page defines the onboard unit’s vital I/O gating, confirmation, and evidence fields. Detailed actuator physics (valve/pump drivers, pressure control loops) belongs to the Brake Control Unit page.
H2-5. Speed & odometry AFE chain (how speed becomes evidence)
Goal: speed is only usable when it becomes safety-grade evidence
A safety supervisor does not consume a raw “speed number.” It consumes speed evidence: speed + validity window + credibility flags + health counters + timestamp status. The AFE chain and plausibility checks must turn physical pulses and analog signals into a deterministic evidence stream that can be audited after incidents.
Typical sensing stack (primary + optional aiding)
- Tach/encoder: pulse-based distance and speed evidence; strong resolution, sensitive to missing teeth and mechanical looseness.
- Hall/MR: robust in harsh environments; sensitive to air-gap and magnetic conditions; used as primary or redundant channel.
- Doppler radar (optional): independent motion evidence that can help during wheel slip; installation and multipath must be considered.
- IMU aiding (non-vital aiding): used for short-window plausibility and continuity, not as a sole speed authority.
The onboard unit should treat aiding sources as credibility checks unless they can produce independent integrity evidence.
AFE conditioning: turning noisy physics into stable edges
The AFE does not “improve accuracy” by magic. It prevents noise, bounce, and wiring faults from becoming false motion evidence.
- Threshold + hysteresis: stable switching with margin; prevents edge chatter near noise floors.
- Debounce / glitch reject: blocks micro-spikes from EMI or mechanical bounce; produces reject counters as evidence.
- Open/short detection: converts dead sensors and wiring faults into explicit fault flags.
- Bandwidth shaping: ensures out-of-band noise does not produce “valid” pulses.
- Latency discipline: detects abnormal capture delay that would break supervision timing assumptions.
Error models that matter (and how they appear in evidence fields)
Control actions for sanding/anti-slip are out of scope; this chapter focuses on evidence generation and credibility.
What must be recorded to prove speed credibility (field dictionary)
These fields turn “speed reading” into audit-grade evidence that can justify supervision decisions.
- wheel_speed_chA / wheel_speed_chB: raw or normalized channel speeds.
- wheel_speed_delta: channel disagreement magnitude (with thresholds).
- accel_plausibility_flag: non-physical acceleration/jump detection marker.
- slip_detect_flag: slip/spin marker driven by consistency and aiding checks.
- sensor_open_short_flag: explicit wiring/sensor failure markers.
- debounce_reject_count: count of rejected spikes/bounce events.
- pulse_period_outlier_count: outlier timing that indicates tooth loss or noise.
- health_counter / health_score: rolling credibility metric for supervision gating.
- timestamp_valid_flag: timebase validity for evidence traceability.
Acceptance rules: when speed can be treated as evidence
- Consistency: channel delta remains within limit for a bounded window (N cycles) with no open/short faults.
- Plausibility: acceleration/jump checks pass and slip flag is not active (or the system enters a defined conservative mode).
- Timing: timestamp validity is true and evidence is not stale (timeout discipline holds).
Figure (H2-5): Speed evidence pipeline (sensor → AFE → checks → evidence)
H2-6. Position referencing: balise interface and map constraints (onboard side)
Balise as a discrete anchor: shrinking uncertainty, not just “reading a tag”
Onboard odometry is an integration process and will drift. A balise provides discrete position anchors that allow the onboard unit to correct drift, tighten uncertainty bounds, and validate map constraints. The onboard side must treat a balise read as evidence only when identity and timing integrity are provable.
Onboard reception chain (abstracted to “data enters OBU”)
- Antenna/interface: provides the physical receive point; the onboard concern is signal presence and interface health.
- Demod output: produces decoded balise data units; onboard logic consumes identity and integrity markers.
- Safety intake: validates data within time windows and integrity rules before allowing anchor use in the safety state estimate.
RF front-end internals and transponder implementation are out of scope; this chapter focuses on onboard validation, correction, and evidence logging.
How anchors constrain the onboard position state
The onboard position state is best treated as estimate + uncertainty bound + validity window. When an anchor is accepted, the system applies a bounded correction and tightens the uncertainty. When an anchor is suspicious or missing, the system widens uncertainty and enforces conservative behavior through supervision.
- Correct drift: reduce accumulated odometry bias using accepted anchor evidence.
- Trigger map states: anchors can confirm region boundaries or constraints already present in the onboard map.
- Protect safety: uncertainty growth and time validity directly influence supervision conservatism.
Evidence checks: identity match + timing + anomaly discrimination
- Identity consistency: anchor ID must match an expected set for the current corridor/time window.
- Timestamp validity: acceptance requires a valid timebase and bounded freshness.
- Duplicate detection: same anchor appears again within an impossible distance/time window.
- Missed detection: expected anchor window passes with no read; uncertainty must widen deterministically.
- Misread detection: unexpected ID or integrity failure; anchor is rejected and logged as evidence.
Balise evidence fields (field dictionary)
- balise_id: decoded anchor identity.
- expected_set_hit: whether the ID belongs to the expected set for the current segment.
- balise_timestamp: capture time used for evidence ordering.
- timestamp_valid_flag: timebase validity marker.
- balise_duplicate_flag: duplicate/too-soon occurrence marker.
- balise_missed_flag: expected-window miss marker.
- balise_misread_flag: unexpected ID or integrity failure marker.
- correction_applied_flag: whether anchor correction was applied.
- uncertainty_bound_changed_flag: whether uncertainty tightened/widened after the event.
Figure (H2-6): Onboard balise anchor + map constraint flow
Scope cut (to avoid overlap)
This chapter covers onboard reception, validation, correction, and evidence logging. Transponder RF front-end design and wayside placement strategy belong to the Balise/Transponder page.
H2-7. Radio/session interface for CBTC/ETCS (safety comms without RF deep dive)
Focus: trusted session + movement authority integrity (not RF)
The onboard unit does not need RF implementation details to be safe. It needs a trusted session interface that turns an unreliable bearer into deterministic safety inputs. Movement authority and restriction curves must be accepted only when identity, integrity, freshness, and timing are provable and auditable.
Session boundary: what the OBU must expose to safety logic
Treat the radio path as a bearer. The safety boundary is the session layer that outputs: (1) trusted messages, (2) session health, and (3) explicit reasons when messages are rejected or when the system must degrade.
- Inputs: authenticated/checked authorization updates, bounded timing metadata, and link statistics with timestamps.
- Outputs: session_state, message_accept/reject decisions, and deterministic degrade triggers.
- Rule: non-safety processing can request actions, but the safety gate decides whether updates are usable.
CBTC side: session state + authorization curve update integrity
CBTC authorization updates must be treated as versioned evidence. A valid update requires integrity checks, sequence/freshness acceptance, and a bounded activation policy so that network jitter cannot silently shift supervision.
The exact ground-side logic is out of scope; the onboard view is a deterministic session gate with auditable acceptance rules.
ETCS side: Euroradio/RBC session (concept-level, acceptance is still testable)
Safety communication must prevent silent corruption and stale reuse. Even at a conceptual level, each safety attribute maps to observable fields.
- Integrity: message checks must fail closed; failures increment explicit counters and reason codes.
- Anti-replay: stale or repeated messages are rejected by sequence/freshness windows; violations are logged.
- Session binding: session changes require explicit reason codes; acceptance must be tied to the current session context.
Unreliable network, deterministic behavior: turning jitter into bounded inputs
Packet loss and delay are not “annoyances.” They are safety inputs. The session layer must quantify them and feed deterministic gating so supervision never depends on best-effort timing.
- Loss window: sustained loss raises session health degradation and may freeze authorization advancement.
- Latency distribution: p50/p99 windowing prevents rare long delays from being misinterpreted as valid immediacy.
- Seq anomalies: jumps and reorder events become explicit evidence and rejection triggers.
Evidence fields dictionary (used later by validation + FAQs)
- session_state: CONNECTED / DEGRADED / RECONNECTING / EXPIRED.
- reconnect_reason_code: auth_fail • seq_violation • timeout • bearer_loss (examples).
- msg_seq_last & msg_seq_jump_count: ordering and replay evidence.
- auth_version_id: current movement authority / curve version accepted by the gate.
- freshness_expired_count: stale update rejections.
- crc_or_mac_fail_count: integrity failures (fail closed).
- packet_loss_rate: windowed loss metric.
- latency_ms_p50 & latency_ms_p99: bounded timing metrics.
- rx_stale_flag: stale data detection flag used by supervision.
Figure (H2-7): Trusted session gate for authorization updates
Scope cut (to avoid overlap)
This chapter defines the safety session interface and message trust evidence. RF chain design, antennas, power amplifiers, and bearer-specific engineering belong to the Train Radio page.
H2-8. Time sync & deterministic behavior (why timestamps matter)
Time is part of the evidence chain
In train control, timestamps are not a convenience. They are a core evidence primitive. If time is not credible, event ordering becomes disputable, freshness windows become meaningless, and sensor alignment can silently corrupt supervision decisions. The onboard unit must expose time credibility as an explicit input to safety logic.
Single-unit timebase: dual sources + credibility monitoring
The onboard unit should treat its own time as a monitored subsystem: multiple sources, continuous comparison, and explicit quality states.
- Dual sources: primary clock and backup clock are comparable and can be selected with bounded rules.
- Offset/drift monitoring: source-to-source offset and drift rate expose degradation before it becomes a safety incident.
- Jump detection: sudden time steps must raise explicit flags and trigger deterministic reactions.
- Holdover policy: when external reference is unavailable, time remains usable but with a declared quality level.
Where timestamps are consumed (and why each needs time credibility)
When time is abnormal: deterministic degrade and diagnostic triggers
A “bad clock” must not create ambiguous safety behavior. The onboard unit should map time credibility states to deterministic restrictions and diagnostic actions.
- timestamp_invalid: block immediate application of new authorization updates; enter conservative curves where required.
- clock_jump_detected: freeze or reset bounded state machines, annotate logs, and raise explicit reason codes.
- holdover_level degraded: widen uncertainty, tighten supervision margins, and prioritize re-establishing credible time.
Full train-wide PTP architecture is out of scope; this chapter focuses on onboard time credibility and deterministic reactions.
Evidence fields dictionary (time credibility)
- time_source: primary • backup • holdover.
- timestamp_valid_flag: whether timestamps are credible for evidence ordering.
- clock_offset_ms: source-to-source offset evidence.
- drift_rate_ppm: monitored drift estimate.
- clock_jump_flag & jump_magnitude_ms: sudden step evidence.
- holdover_level: declared quality tier when reference is missing.
- freshness_window_violation_count: stale-use prevention evidence.
- reaction_timestamp: time-stamped fault reaction evidence.
Figure (H2-8): Timestamp as an evidence spine (sources → monitor → consumers → actions)
Scope cut (to avoid overlap)
This chapter defines onboard time credibility, monitored fields, and deterministic reactions. Train-wide PTP topology, TSN distribution, and backbone synchronization belong to the Timing & Sync and Backbone Gateway pages.
H2-11. Validation & field diagnostics (bench → track)
Why this chapter exists: make safety evidence reproducible
Validation is not a checklist. It is a reproducible loop that turns each evidence chain (speed/odometry, balise anchors, trusted session, time credibility, secure boot, fault state machine) into injectable, measurable, and auditable pass/fail behavior. Bench injection creates controlled repeatability; track scenarios confirm real-world boundaries using the same evidence gates.
Evidence map (what to validate, tied to earlier chapters)
This map keeps the chapter vertical: every test points to specific evidence fields and deterministic actions.
- Speed/Odometry evidence (H2-5): wheel_speed_deltaaccel_plausibility_flagdebounce_reject_countsensor_health_counter
- Balise anchor evidence (H2-6): balise_missed_flagexpected_set_hitmisread_flaganchor_timestamp
- Trusted session evidence (H2-7): session_statereconnect_reason_codelatency_ms_p99packet_loss_rateseq_jump_countfreshness_expired_count
- Time credibility evidence (H2-8): timestamp_valid_flagclock_offset_msdrift_rate_ppmclock_jump_flagholdover_level
- Secure boot & key evidence (H2-9): boot_measurement_digestsignature_fail_reason_codekey_version_idrollback_attempt_count
- Fault state machine evidence (H2-10): fault_classstate_transition_reasonaction_takenrecovery_window_count
Bench injection principle (a single template for every test)
Every bench test must define the same four components. This prevents “hand-wavy” validation and guarantees deterministic behavior under identical injections.
Bench: sensor injection (speed/odometry chain)
The purpose is not to “break the sensor,” but to prove that plausibility logic and fault handling remain deterministic and auditable.
First corrective step (bench diagnosis): before replacing sensors, check debounce_reject_count and open/short indicators to confirm the failure is not a filtering/reference issue.
MPN examples (AFE / isolation / safety compute commonly used in safety-critical designs):
- Safety MCU (lockstep-class): Infineon AURIX TC3xx (e.g., TC397 family), Renesas RH850 family, ST SPC58 family.
- Digital isolators: Analog Devices ADuM141E, Texas Instruments ISO7741.
- Isolated RS-485 (for rugged links around cabinets): Analog Devices ADM2587E, TI ISO1410 + transceiver pairing.
These MPNs are provided as concrete examples for documentation and test-fixture planning; exact selection depends on SIL/rail qualification constraints and design rules.
Bench: session latency/loss injection (trusted session gate)
The session gate must convert an unreliable bearer into bounded evidence. Validation proves that jitter and loss translate into predictable gating, not silent timing drift.
MPN examples (platform/communications building blocks frequently referenced in safety systems):
- Secure element (for device identity & credential operations): NXP SE050, Microchip ATECC608B, STMicroelectronics STSAFE-A110.
- Ethernet PHY (industrial/harsh environments often use robust PHY families): TI DP83867 family (example PHY class), NXP TJA110x family (example rugged PHY class).
PHY selection is system-dependent (train backbone/TSN constraints belong to the backbone page); this chapter validates the session gate behavior and evidence fields.
Bench: clock drift/jump injection + signature failure injection (time & secure boot evidence)
MPN examples (time/clocking and crypto hardware building blocks):
- Jitter-cleaning / timing PLL (system time conditioning examples): Renesas 8A34001 (class), Texas Instruments LMK04828 (class).
- Hardware root-of-trust / TPM-style modules (example class): Infineon OPTIGA™ TPM family (class), NXP EdgeLock secure element families (class).
Track: scenario validation (real conditions, same evidence gates)
Track scenarios validate the boundaries in real dynamics. The goal is not “coverage reporting,” but proving that the same gates and fields produce deterministic state transitions under real disturbances.
- Slip/slide conditions: confirm wheel_speed_delta + plausibility flags cause conservative behavior with clear reason codes.
- Weak coverage zones: confirm session_state + latency_ms_p99/packet_loss_rate drive stable gating (freeze auth_version advancement where required).
- Balise missing segments: confirm balise_missed_flag + expected_set_hit tighten uncertainty and trigger bounded degrade (no silent drift).
- Fast station entry/exit switching: confirm action timing remains explainable (reaction_timestamp) and time validity is enforced.
Decision template (waveforms + fields + gates + repro parameters)
Use one template for every bench/track test so failures become actionable and FAQ-ready (2 evidence checks + 1 first fix).
Test ID: BENCH-SESSION-LAT-01
Injection knob:
- latency_tail_ms, loss_burst_len, reorder_rate, duration_s, seed
Observe (waveforms):
- bearer_rx_timestamps vs session_gate_accept_pulses
Observe (log fields):
- session_state, latency_ms_p99, packet_loss_rate, seq_jump_count
- reconnect_reason_code, auth_version_id, action_taken, state_transition_reason
Pass/Fail gate:
- if latency_ms_p99 exceeds window_limit for N windows → DEGRADED within T
- if session_state becomes EXPIRED → freeze auth_version_id advancement, log reason_code
Expected deterministic reaction:
- NORMAL → DEGRADED → RECOVERY_PENDING (bounded), no silent acceptance
Repro parameters (recorded in log):
- duration_s, window_len, threshold_id, seed, build_id, config_hash
Keep thresholds as IDs (threshold_id) so field feedback can update limits without rewriting test procedures.
Figure (H2-11): Bench → evidence gates → track → feedback loop
Scope cut (to avoid overlap)
This chapter defines OBU-level injection knobs, observable evidence fields, deterministic state expectations, and reproducible decision templates. Line-test organization, project management process, and wayside acceptance workflows are intentionally excluded.
H2-12. Reference architecture BOM hotspots (IC classes + example MPN buckets)
How to use this chapter (evidence → hardware buckets)
This BOM view is organized by evidence responsibilities: compute credibility, key custody, trusted I/O across isolation, speed/position credibility, power/reset determinism, and durable logging. The MPNs below are example buckets for documentation and test-fixture planning; final selection must follow the project’s safety, environment, and lifecycle constraints.
Bucket 1 — Safety MCU / SoC (lockstep / safety island)
Evidence role: deterministic supervision under faults (H2-3/H2-10/H2-11). Key selection gates: lockstep/compare support, ECC/parity visibility, LBIST/MBIST reason codes, watchdog/reset attribution, partitioning support for safety vs non-safety workloads.
Example MPNs (typical safety MCU families):
- Infineon AURIX™ TC3xx: TC397, TC387, TC377
- NXP S32 Safety MCU: S32S247 (family example), S32K3 (safety-capable variants)
- Renesas RH850 (functional safety families): RH850/P1x (family example), RH850/U2A (family example)
- ST SPC58 (Stellar/Power Architecture safety families): SPC58EC (family example), SPC58NH (family example)
- Texas Instruments safety MCU class: TMS570LS (family example)
Suggested log fields: lockstep_mismatch_count, ecc_correctable_rate, bist_fail_reason, wdt_reset_reason, sched_jitter_hist.
Bucket 2 — HSM / Secure Element (keys + secure boot + anti-rollback)
Evidence role: prevent silent safety-logic replacement (H2-9) and provide anti-rollback proof for audits (H2-11). Selection gates: non-exportable keys, TRNG, monotonic counters, attestation/measurement support, service/OTA policy hooks.
Example MPNs (SE/HSM-class components):
- Microchip CryptoAuthentication™: ATECC608B
- NXP EdgeLock: SE050
- STMicroelectronics secure element: STSAFE-A110
- Infineon OPTIGA™ Trust: OPTIGA Trust M (family example)
- NXP secure element family example: A71CH
- TPM-style module class (example): Infineon OPTIGA TPM (family example)
Suggested log fields: key_version_id, anti_rollback_counter, signature_fail_reason_code, rollback_attempt_count, boot_measurement_digest.
Bucket 3 — Isolated communications (CAN / RS-485 / generic isolators)
Evidence role: preserve trust across harsh common-mode and surge environments (H2-4/H2-7/H2-10). Selection gates: isolation rating, CMTI robustness, ESD, predictable fail mode, and isolation power integrity.
Example MPNs (isolated comms building blocks):
- Isolated CAN transceiver (TI): ISO1050, ISO1042
- Isolated CAN transceiver (ADI): ADM3055E
- Isolated RS-485 transceiver (ADI): ADM2587E
- Isolated RS-485 transceiver (TI): ISO1410
- Digital isolators (TI): ISO7741, ISO7721
- Digital isolators (ADI): ADuM140E, ADuM141E
Suggested log fields: comm_crc_err_count, seq_jump_count, link_reset_reason, latency_ms_p99, isolation_fault_flag.
Bucket 4 — Speed / position AFE (encoder / Hall / MR + ΣΔ / ADC examples)
Evidence role: turn raw edges/fields into credible speed evidence with diagnosable failure modes (H2-5/H2-10). Selection gates: input protection, thresholds/hysteresis, open/short diagnostics, noise tolerance, and observable health counters.
Example MPNs (sensor interface + conversion):
- Hall/MR sensor interface examples: Maxim/ADI MAX9926, MAX9927 (speed sensor front-end class)
- Magnetic encoder IC examples: ams AS5047P, Infineon TLE5012B
- Hall sensor IC examples (signal source): TI DRV5055 (linear Hall class), Allegro A1332 (angle sensor class)
- ΣΔ modulators (for isolated measurement patterns): ADI AD7403, AD7405
- Isolated amplifier / modulator class examples: TI AMC1301, AMC1311
- High-speed precision ADC class examples: ADI AD7380 (class), TI ADS131M family (class)
Suggested log fields: debounce_reject_count, open_short_flag, sensor_health_counter, wheel_speed_delta, slip_flag.
Bucket 5 — Power & supervision (wide-VIN, supervisor, watchdog, holdup)
Evidence role: prevent “mystery resets” and preserve last-gasp logging (H2-10/H2-11). Selection gates: reset reason observability, brownout thresholds, watchdog independence, holdup monitoring, and deterministic shutdown / commit windows.
Example MPNs (power front-end + supervision + protection):
- Supervisors / reset ICs: TI TPS3890, Maxim/ADI MAX16052, ADI LTC2937
- Watchdog timers: TI TPS3431, Maxim/ADI MAX6369
- Wide-VIN buck regulator class examples: TI LM5009 (class), ADI LT8609S (class)
- eFuse / protection: TI TPS25982, ADI LTC4368
- Power-path / ideal diode controller class: TI TPS2121 (power mux class)
Suggested log fields: brownout_count, reset_reason_code, holdup_voltage_trace, last_gasp_write_status, wdt_reset_reason.
Bucket 6 — NVM & logging (durable storage + evidentiary logs)
Evidence role: preserve explainable timelines and tamper-resistant event history (H2-8/H2-9/H2-11). Selection gates: endurance strategy, power-fail behavior, commit granularity, and integrity tags (hash/signature) aligned with timestamps.
Example MPNs (NVM buckets commonly used for robust logging):
- SPI FRAM (Fujitsu): MB85RS64V, MB85RS256
- SPI FRAM (Cypress/Infineon): FM25V10 (family example)
- SPI NOR Flash (Winbond): W25Q128JV, W25Q256JV
- SPI NOR Flash (Micron): MT25QL128ABA (family example)
- Serial NAND Flash (Winbond): W25N01GV (family example)
- eMMC class (example bucket): JEDEC eMMC devices (use vendor-qualified part per lifecycle policy)
Suggested log fields: log_commit_latency, log_drop_count, time_valid_flag, hash_chain_head, signature_status, storage_wear_index.
H2-13. FAQs (Accordion ×12; each = conclusion + 2 evidence checks + 1 first fix)
How these FAQs are engineered (no scope creep)
Each answer is constrained to: one-sentence conclusion, two evidence checks (field-level), and one first fix (actionable). Every item maps back to H2-4 ~ H2-12, so the troubleshooting path is mechanically verifiable.
Train occasionally triggers emergency braking, but the wireless link looks “up” — session freshness or timestamp anomaly?
Conclusion: This pattern usually comes from an evidence gate expiring (freshness/time validity), not a total link drop.
Evidence checks: (1) Compare freshness_expired_count against session_state to see if messages are accepted but stale. (2) Check timestamp_valid_flag and clock_jump_flag around the brake trigger time.
First fix: Inspect reconnect_reason_code and time-source switch logs before tuning thresholds.
Speed suddenly spikes and triggers overspeed — sensor glitch or slip/slide misclassification?
Conclusion: A true spike must be separated from slip/slide plausibility conflicts using two-layer evidence.
Evidence checks: (1) Review debounce_reject_count and open_short_flag for transient edge artifacts. (2) Correlate wheel_speed_delta with accel_plausibility_flag (and slip_flag if present) to confirm whether dynamics are physically plausible.
First fix: Tighten capture of raw edge timing and filtered speed in the same window before replacing sensors.
After passing a balise, position still drifts — balise misread or odometry drift model not converging?
Conclusion: Drift after an anchor is usually an anchor credibility issue or a slow odometry bias that remains unbounded.
Evidence checks: (1) Validate misread_flag and expected_set_hit for anchor identity consistency. (2) Compare anchor_timestamp to timestamp_valid_flag and track post-anchor wheel_speed_delta trend for bias.
First fix: Require “anchor accepted” events to include a stable time-valid window and a bias-reset marker in logs.
Startup occasionally enters degraded mode — secure boot rejection or safety self-test coverage trigger?
Conclusion: Degraded-at-boot is normally either integrity failure (fail-closed) or compute credibility failing self-tests.
Evidence checks: (1) Read signature_fail_reason_code with boot_stage_id to locate where verification failed. (2) Check bist_fail_reason and ecc_correctable_rate to see if self-test or memory corrections forced safety downgrade.
First fix: Freeze the boot measurement digest and key version into the incident record before retrying or re-flashing.
Log timeline is scrambled and accountability fails — RTC drift or logging pipeline latency?
Conclusion: Timeline issues are rarely “just RTC”; they often come from a time-validity collapse or late commits.
Evidence checks: (1) Inspect clock_offset_ms, drift_rate_ppm, and timestamp_valid_flag across the suspect window. (2) Compare log_commit_latency and log_drop_count to see whether writes are delayed or lost.
First fix: Record time-valid transitions as explicit events and tag each log segment with a monotonic sequence number.
Frequent reconnects in weak coverage — latency tail out-of-window or replay/sequence protection triggering?
Conclusion: Reconnect storms typically come from tail latency and sequence/freshness gates, not average RSSI.
Evidence checks: (1) Use latency_ms_p99 and packet_loss_rate to confirm tail behavior vs thresholds. (2) Check seq_jump_count and freshness_expired_count to identify sequence/replay gating vs transport loss.
First fix: Start from reconnect_reason_code distribution to separate auth/seq/timeouts before any RF investigation.
After maintenance, “input mismatch” alarms increase — wiring/common-mode across isolation or thresholds too tight?
Conclusion: Maintenance often changes reference/return paths; isolation and common-mode behavior can create false mismatches.
Evidence checks: (1) Compare input_consistency_fail_count (or equivalent) with isolation_fault_flag/link_reset_reason to detect common-mode events. (2) Check clock_offset_ms or timing validity if mismatch correlates with timestamp anomalies.
First fix: Re-validate the isolation power/return layout and then re-baseline threshold IDs, not raw numeric thresholds.
Same line, different trains show big behavior differences — wheel/gear variation or calibration parameter version drift?
Conclusion: Cross-train variance is often parameter/version drift plus mechanical differences that the model does not absorb.
Evidence checks: (1) Compare calibration_param_version (or config hash) and threshold_id across vehicles. (2) Track long-term bias signals via wheel_speed_delta trend and sensor_health_counter to separate mechanics from electronics.
First fix: Lock parameter bundles under signed configuration and log config_hash at every trip start.
Compute has enough performance but shows periodic jitter — safety partition preemption or ECC correction “storm”?
Conclusion: Periodic jitter is usually a determinism problem (scheduling/ECC), not raw compute shortage.
Evidence checks: (1) Compare sched_jitter_hist against safety task periods to confirm preemption patterns. (2) Inspect ecc_correctable_rate and any lockstep_mismatch_count spikes that correlate with missed deadlines or degraded transitions.
First fix: Separate safety and non-safety bus/memory contention and cap ECC handling bursts to bounded service windows.
Authority updates look normal, but braking curve execution lags — I/O path delay or state machine stuck in conservative branch?
Conclusion: “Update OK but action late” is either vital I/O latency or a state machine holding conservative mode.
Evidence checks: (1) Compare reaction_timestamp and io_path_latency (or equivalent) from command to actuation confirmation readback. (2) Check state_transition_reason and action_taken to see whether RESTRICTED/DEGRADED policies block fast ramp-in.
First fix: Log the full I/O readback chain with sequence numbers and enforce explicit exit criteria from conservative branches.
Field suspects “firmware rollback” — how to prove anti-rollback is actually working?
Conclusion: Anti-rollback proof requires monotonic counters and version gating evidence, not verbal assurance.
Evidence checks: (1) Verify anti_rollback_counter_value and key_version_id are strictly increasing across updates. (2) Confirm rollback_attempt_count and signature_fail_reason_code capture any downgrade attempts with timestamps and boot stage attribution.
First fix: Ensure boot measurement digest and counter values are exported into signed incident reports for audits.
Critical events are lost during power-off — insufficient holdup or incorrect write policy?
Conclusion: Event loss at power-off is often policy/commit timing, even when holdup energy exists.
Evidence checks: (1) Compare holdup_voltage_trace against last_gasp_write_status to see whether voltage remained sufficient but commit started too late. (2) Inspect log_commit_latency and log_drop_count to detect oversized transactions or blocked I/O.
First fix: Implement a bounded “last-gasp” journal (small, fixed-size) to capture the final critical record before bulk writes.