123 Main Street, New York, NY 10001

Train Control & Monitoring System (TCMS)

← Back to: Rail Transit & Locomotive

TCMS is the train’s control-and-evidence backbone: it coordinates vehicle functions through safety compute, isolated I/O, and redundant networks while producing time-aligned logs that make faults explainable and auditable. The core goal is to turn every incident into defensible evidence—so decisions, degraded modes, and field fixes can be verified, replayed, and improved over time.

H2-1. TCMS in the Rail Architecture

Core takeaway: TCMS is not a “network box”. It is the vehicle-level control-and-evidence hub that makes actions accountable, states trustworthy, and incidents reconstructable.

What TCMS must guarantee

  • Command (control authority): who can issue an action, how it is accepted, executed, and confirmed.
  • State (single source of truth): how distributed measurements become one validated vehicle state with freshness/quality.
  • Evidence (replayable history): how events are time-aligned, signed/verified, and exported for maintenance and audits.

Typical topology (why it exists)

  • Vehicle controller: arbitration + decision + policy + aggregation.
  • Remote I/O modules: distributed sensing/actuation with isolation across long cables and ground offsets.
  • Gateways: bridge legacy buses (e.g., MVB/WTB) and Ethernet/TSN domains without losing determinism.
  • Recorder/maintenance access: ensures “one-time faults” are still captured with the right context.
Why TCMS is a reliability & safety carrier: it defines how failures are classified, how the system degrades, and what evidence is recorded for every switch, trip, or recovery condition.
TCMS System Map (Onboard) Command path • State aggregation • Evidence logging • Maintenance access Driver Desk / HMI Commands + Alerts TCMS Controller Safety MCU/SoC • Policy • Diagnostics Time-stamped Logs • Self-test Event Recorder Evidence + Export Train Network Backbone Legacy Bus + Ethernet/TSN (redundant links) Remote I/O Isolated DI/DO/AI/AO Subsystem Nodes Door • HVAC • PDU • etc. Maintenance Port Diag • Update • Logs Control + State Redundant Network Path
Figure F1. TCMS is the vehicle-level hub that governs command authority, consolidates validated state, and produces time-aligned evidence logs across distributed onboard domains.

H2-2. Functional Partitioning: Control, Monitoring, Safety, and Evidence

Core takeaway: A TCMS problem becomes diagnosable only after it is routed to the right plane. Each plane has its own boundaries, failure modes, and evidence fields.

Control plane

  • Boundary: command generation, arbitration, execution confirmation.
  • Typical failures: duplicate/late commands, missing ACK, wrong target, inconsistent action across cars.
  • Evidence fields: command_id, source, target, accept/reject code, ACK latency, retry count.
  • First check: command_id → ACK → latency triad to confirm whether the action loop is intact.

Monitoring plane

  • Boundary: acquisition, state fusion, state freshness/quality, alarms.
  • Typical failures: false alarms, stale values, contradictory sources, wrong root cause labeling.
  • Evidence fields: timestamp, validity flag, stale_age, sensor_quality, counter deltas.
  • First check: validity + stale_age before trusting any displayed status.

Safety plane

  • Boundary: interlocks, emergency policies, voting/consistency, diagnostic triggers.
  • Typical failures: false trip, missed trip, inconsistent degrade decisions, unsafe recovery.
  • Evidence fields: safety_state, trip_cause, vote_result, diag_coverage hint, recovery_condition.
  • First check: trip_cause + vote_result to verify the decision is evidence-backed.

Maintenance plane

  • Boundary: diagnostics, log export, versioning, configuration control, updates.
  • Typical failures: missing context, untraceable versions, configuration drift, update-induced outages.
  • Evidence fields: fw_version, config_hash, export status, audit trail, update result code.
  • First check: fw_version + config_hash to prevent “mixed baselines” in field debugging.
Minimum evidence set (must exist): time-stamp quality, command/ACK, validated state freshness, reset/brownout causes, link health counters, self-test/watchdog outcomes.
TCMS Functional Planes Each plane has a boundary, failure signature, and mandatory evidence fields TCMS Core Arbitration • Fusion • Diagnostics Control Plane command_id • ACK • latency Action loop intact? Monitoring Plane timestamp • validity • stale_age Is state trustworthy? Safety Plane vote_result • trip_cause Decision defensible? Maintenance Plane fw_version • config_hash Baseline traceable? Evidence Routing Logs / Snapshots time-aligned • exportable Time base
Figure F2. Functional partitioning turns “complex system behavior” into diagnosable planes. Each plane has boundaries, failure signatures, and mandatory evidence fields.

H2-3. Compute Core: Safety MCU/SoC Architecture Choices

The TCMS compute core is judged by its ability to detect and contain silent failures (wrong results without a crash), while preserving traceable evidence for every mismatch, correction, and recovery action.

Design objective: prevent “runs but wrong” behavior by combining redundancy choices, data integrity protection, and measurable diagnostic coverage.

Redundancy options (trade-offs + evidence)

  • Lockstep: detects instruction-level divergence; requires mismatch flags and defined safe reactions.
  • Dual-channel: independent computation paths; requires voting rules and correlated evidence to avoid false decisions.
  • Heterogeneous redundancy: different cores/implementations; improves common-cause resistance; demands interface-level consistency checks.
  • Verification focus: every strategy must output a decision record (what diverged, when, and how it was handled).

Memory & data integrity (from bits to evidence)

  • ECC: corrected/uncorrectable counters turn random corruption into explainable events.
  • E2E protection: CRC + sequence counters detect corruption, loss, duplication, and reordering across modules.
  • State validity: each critical state carries validity, age, and origin to prevent stale truth.
  • Evidence fields: corrected_count, uncorrectable_count, crc_fail_count, seq_gap_count, last_pass_ts.

How to prove it “did not compute wrong”

  • Boot-time self-test: CPU/RAM/flash checks with itemized results and durations.
  • Periodic tests: scheduled coverage refresh for safety mechanisms and critical paths.
  • Runtime monitors: watchdogs, deadline budgets, lockstep mismatch, ECC rate alarms.
  • Mandatory outcome: each detection must map to a defensible action (degrade, reset, isolate, alert).

Engineering acceptance checklist

  • Can the system separate crash vs silent wrong vs degradation?
  • Does each safety decision generate a time-aligned record (trigger → evidence → action → recovery)?
  • Are integrity counters trendable (rate-of-change), not only instantaneous flags?
  • Is there a defined policy for uncorrectable events and repeated corrections?
Practical rule: if a safety architecture cannot output evidence fields for mismatch, integrity failures, and recovery actions, the system will be hard to audit and harder to debug in one-time field incidents.
Safety Compute Core (TCMS) Partitioning + monitoring loop to detect silent failures Application Domain Control • Fusion • Networking Safety Domain Interlocks • Policy Monitor / Diagnostics Watchdog • Deadlines • BIST Integrity Engine ECC • CRC • Seq Counters Redundancy Lockstep / Dual-channel Evidence Output WD Count ECC Rate CRC Fails Mismatch Degrade / Reset
Figure F3. A safety compute core is defined by partitioning and a measurable monitoring loop that turns silent faults into evidence-backed actions.

H2-4. Isolated I/O Modules: DI/DO/AI/AO and the Isolation Boundary

In TCMS, many field failures originate at the I/O boundary: long harnesses, ground offsets, transients, and EMI can turn inputs/outputs into untrustworthy signals unless isolation, protection, and diagnostics are designed as a single chain.

Design objective: make each I/O channel locatable (fault side identification), survivable (transient containment), and auditable (evidence fields).

I/O types & field failure signatures

  • DI jitter: bounce/noise creates false transitions; verify debounce rejects and pulse-width stats.
  • DO stuck: command differs from feedback; check overcurrent trips and stuck-detect flags.
  • AI drift: slow bias shift or saturation; correlate with common-mode events and temperature.
  • AO open-load: actuator missing or wiring open; verify compliance flags and open-load detect.

Isolation strategy (define the boundary)

  • Signal isolation: prevents ground-offset coupling into logic-domain thresholds and ADC references.
  • Isolated power: keeps AFE supply stable under harness transients and ground shifts.
  • Common-mode suppression: reduces CM energy that otherwise leaks through parasitics.
  • Protection chain: containment path for surge/ESD before it reaches AFE and isolator.

Evidence fields (make faults locatable)

  • Open/short detect: open_load, short_to_batt, short_to_gnd, overcurrent_trip_count.
  • Cable diagnostics: bounce_count, min_pulse_width, intermittent_count, cm_event_count.
  • Consistency checks: redundant input mismatch flags and voting outcomes.
  • Quality tagging: validity flag, stale age, saturation flag, out-of-range markers.

Engineering acceptance checklist

  • Can a failure be located to cable side vs isolation side vs logic side?
  • Does each channel expose trendable counters (not only a one-shot alarm)?
  • Are fault flags time-aligned with supply and network health indicators?
  • Does the design define the safe default state for DO/AO when diagnostics fail?
Practical rule: without boundary-aware evidence (which side failed), field repair becomes trial-and-error replacement rather than deterministic diagnosis.
Remote I/O Module (Isolation Boundary) Cable → Protection → AFE → Isolation → Logic/Diagnostics → Network Cable Long Harness Protection Clamp • Limit • Filter TVS Current Limit AFE Threshold • ADC DI / DO AI / AO Isolation Logic + Diag Quality • Counters Open/Short Bounce/CM Network Fault must be locatable
Figure F4. The I/O boundary is engineered as a chain. Isolation and diagnostics exist to contain transients and to localize faults to cable-side, barrier-side, or logic-side.

H2-5. Redundant Communications: Vehicle Networks and Gateway Strategy

Redundant communications must be engineered as a verifiable system: measurable link health, defined switching triggers, bounded switchover time budgets, and evidence logs that explain every failover decision.

Design objective: keep command authority and state truth stable during network degradation, while preserving a defensible record of triggers, decisions, and outcomes.

Common roles on vehicle networks

  • TCMS master: command arbitration and consistency across cars/domains.
  • Remote I/O nodes: field-state acquisition with diagnostics counters and quality flags.
  • Gateways: determinism boundary between legacy bus segments and Ethernet/TSN domains.
  • HMI: operator loop for commands, alarms, and confirmation paths.
  • Recorder: evidence sink for network events, state snapshots, and recovery actions.

Redundancy modes (conceptual)

  • Cold-standby: backup is inactive; switchover relies on re-acquiring state and authority.
  • Hot-standby: backup syncs critical state; only one side publishes authoritative outputs.
  • Dual-active: both are active; requires arbitration to prevent dual-master conflicts.
  • Evidence requirement: each mode must log who was authoritative and why it changed.

Fault criteria (measurable triggers)

  • Loss rate: packet drop percentage within a defined time window.
  • Latency & jitter: budget exceedance and variance expansion (control-impact signature).
  • Link flap: repeated up/down transitions and route instability.
  • Switch triggers: graded states (degraded → failed) with hysteresis to avoid flapping.

Degrade strategy (minimum functionality)

  • Maintain command safety: preserve an unambiguous command authority path.
  • Maintain essential state: keep validated status for the minimum set of safety-relevant signals.
  • Protect evidence: ensure failover decisions and link-health context are time-stamped and exportable.
  • Recovery conditions: define when normal mode may resume (stability window + counters).
Practical rule: a redundant network is only “real” if the system can explain (with evidence fields) what degraded, what triggered the switch, who became authoritative, and how long it took.
Mode Trigger (health criteria + window) Switch budget State handover rule Evidence to log
Cold-standby link down / sustained loss; confirmed by flap counter + windowed stats longer; requires re-acquire and re-validate state rebuild baseline; verify freshness + validity before authority trigger stats, last-known authority, re-sync duration, missed data window
Hot-standby degraded → failed threshold crossed; hysteresis to avoid oscillation bounded; backup already synced last common state ID (seq counter) + integrity check (CRC) authority change, seq gap, sync latency, decision code, outcome
Dual-active arbitration conflict / inconsistent link view / split-brain indicators tight; must prevent dual-master outputs token/vote result; deterministic tie-break + rollback rule vote/arbitration result, conflict count, resolved leader, time-to-stable
Redundant Vehicle Communications Roles • Dual networks • Gateway strategy • Evidence-backed failover TCMS A Authority TCMS B Standby/Peer HMI Recorder Failover Log Backbone (Redundant) Network A • Network B Gateway Domain Boundary Remote I/O Field Truth Loss% Jitter Flap Cnt Evidence Single Authority
Figure F5. Redundant communications are verified through measurable link health, bounded failover, and evidence logs that record triggers, authority changes, and outcomes.

H2-6. Time Base & Event Correlation: PTP/Local Clock/Monotonic Logging

Event correlation is where TCMS becomes reconstructable. A unified time base ensures I/O edges, network degradations, fault codes, and power-health signals can be assembled into a defensible sequence of cause-and-effect.

Design objective: provide time alignment across domains while preserving a monotonic ordering anchor and a timestamp-quality record when synchronization degrades.

Time sources (roles)

  • PTP: cross-domain alignment for networked modules and gateways.
  • RTC: human-readable wall-clock context across long downtime windows.
  • Monotonic counter: non-decreasing ordering anchor that prevents “time jump” reordering.
  • Rule: incident logs must include monotonic time even when wall-clock is invalid.

What must be correlated

  • I/O events: edges, output feedback, open/short and quality flags.
  • Network events: loss/jitter spikes, route changes, failover transitions.
  • Fault codes: assert/clear with context snapshots.
  • Power health: rail minima, brownout markers, reset causes and counters.

Common pitfalls (detectable)

  • Time step / jump: wall-clock changes suddenly; detect via monotonic vs wall-clock delta.
  • Holdover drift: offset trends while external sync is lost.
  • Mixed granularity: hardware vs software timestamp points create inconsistent skew.
  • Action: downgrade timestamp_quality and preserve ordering with monotonic time.

Mandatory evidence fields

  • sync_state: locked / holdover / free-run / invalid
  • offset: relative deviation to the selected master
  • holdover_time: duration since last valid sync
  • timestamp_quality: good / degraded / bad
  • ts_source: HW timestamp / SW timestamp / derived
Practical rule: a timestamp without a quality indicator is not auditable. Log correlation depends on both time values and time credibility.
Time Evidence Chain Source → Sync → Timestamp → Log → Replay (with quality fields) Time Sources PTP • RTC • Mono Sync & Holdover offset • drift Timestamping HW / SW Logging events • snapshots Correlation Targets I/O Events Net Events Fault Codes Power Health Mandatory Time Evidence Fields sync_state timestamp_quality offset holdover_time Forensic Replay Timeline Monotonic time = ordering anchor
Figure F6. Time evidence is a chain: sources and synchronization create timestamps; timestamps gain audit value only when logged with quality fields and replayed as a correlated timeline.

H2-7. Logging & Black-Box Evidence: What to Record and Why

Logging becomes defensible only when it is treated as an evidence product: it must reconstruct what happened, why it happened, what the system decided, and whether recovery was successful—using time-aligned, integrity-protected records.

Design objective: turn incidents into repeatable “evidence bundles” (event + snapshot + trend slice) with integrity markers and timestamp quality.

Three-layer logging model

  • Event: high-density triggers (failover, threshold crossing, fault assert/clear).
  • Snapshot: the system’s complete state around an incident (pre/post window).
  • Trend: counters and statistics that reveal slow degradation (rates and deltas).
  • Rule: events must reference the snapshot and the relevant trend slice via correlation IDs.

Trigger mechanisms (incident bundles)

  • Threshold triggers: power/thermal/offset budget exceedance.
  • Integrity triggers: CRC fail, sequence gaps, invalid state transitions.
  • Network triggers: loss/jitter spikes, flap, route changes, failover.
  • Manual triggers: operator/maintenance request with reason code.
  • Bundle: capture pre/post windows and include timestamp_quality and sync_state.

Integrity & power-loss robustness

  • Record integrity: record CRC/signature + monotonic sequence number.
  • Atomic commit: avoid half-written records that corrupt evidence.
  • Power-loss marker: mark incomplete commits and preserve last critical bundle.
  • Audit value: logs remain explainable even when a crash happens mid-write.

Engineering acceptance checklist

  • Can an incident answer: trigger → context → decision → outcome?
  • Do events include time quality and a correlation_id?
  • Are counters trendable (deltas/rates), not only absolute values?
  • Can evidence survive a one-time crash via atomic commits and markers?
Practical rule: “one-line logs” rarely reconstruct incidents. Evidence requires a bundle: event + snapshot + trend slice, all time-aligned and integrity-marked.

Evidence field checklist (Ctrl+F friendly)

LOG-CHECKLIST:EVENT
  • event_id — unique incident record identifier
  • event_type — threshold / integrity / network / manual
  • event_ts_mono — monotonic timestamp for ordering
  • event_ts_wall — wall-clock timestamp (optional but recommended)
  • timestamp_quality — good / degraded / bad
  • source_module — gateway / io / compute
  • severity — info / warn / fault
  • decision_code — degrade / switch / reset / isolate
  • decision_outcome — success / failed / partial
  • correlation_id — links event ↔ snapshot ↔ trend slice
LOG-CHECKLIST:SNAPSHOT
  • snapshot_id — snapshot record identifier
  • operating_state — ready / operate / degraded / fault
  • authority_owner — current authority (A / B / peer)
  • network_health_summary — loss/jitter/flap summary
  • power_health_summary — rail minima + reset causes + counters
  • io_quality_summary — invalid/saturation/open-short summary
  • active_faults_list — asserted fault list
  • last_transition_reason — why the last state changed
LOG-CHECKLIST:TREND
  • trend_window_start / trend_window_end — statistics window
  • loss_rate_avg / loss_rate_peak — loss stats
  • jitter_p95 — jitter percentile indicator
  • crc_fail_rate / seq_gap_rate — integrity rates
  • watchdog_count_delta — watchdog delta since last window
  • ecc_corrected_delta / ecc_uncorrectable_delta — ECC deltas
  • flap_count_delta — link flap delta
  • holdover_time — time sync holdover duration
LOG-CHECKLIST:INTEGRITY
  • record_crc — record-level checksum
  • record_seq — record sequence counter
  • write_result — commit result code
  • atomic_commit_id — atomic transaction identifier
  • power_fail_marker — incomplete-write marker
Logging as Evidence Product Triggers → Event/Snapshot/Trend → Incident Bundle → Integrity → Replay Triggers Threshold Integrity Network Manual Capture Event Snapshot Trend Incident Bundle correlation_id Integrity record_crc atomic_commit Storage & Replay Log Store Replay Includes timestamp_quality + sync_state
Figure F7. Evidence logging is a pipeline: triggers capture multi-layer data, bundles preserve context, integrity markers protect audit value, and replay reconstructs the incident timeline.

H2-8. Self-Test & Diagnostics Coverage: Boot-Time vs Run-Time

Self-test is the mechanism that turns safety claims into measurable coverage. It must define when tests run, what fault classes they detect, what evidence is produced, and what the system does when checks fail.

Design objective: combine boot-time assurance (POST/BIST) with continuous run-time monitors to detect silent faults, degradations, and intermittent failures—then record defensible evidence.

Boot-time self-test (POST/BIST)

  • Compute & memory: CPU/RAM/flash checks with itemized results and durations.
  • Test vectors: versioned vectors to prevent “test mismatch” after software updates.
  • I/O path loopback: verify acquisition → processing → output feedback chain.
  • Outcome: pass → ready; fail → controlled degraded/fault entry with evidence logs.

Run-time monitoring (continuous)

  • Liveness & deadlines: watchdog, task alive, deadline miss counters.
  • Comms health: loss/jitter/flap, CRC fail, sequence gaps.
  • Integrity: ECC trends, E2E CRC/seq consistency, mismatch counts.
  • Consistency: redundant inputs or channels mismatch and voting outcomes.

Diagnostics coverage mindset

  • Start from fault classes: define what must be detectable (silent wrong, intermittent, degradation).
  • Map detection → evidence: each detection emits fields that explain “why” and “where”.
  • Define reaction semantics: degrade vs fault entry is evidence-driven, not arbitrary.
  • Recovery conditions: stability windows and retest rules gate a return to normal.

Failure handling (strategy semantics)

  • Fail-safe: stop high-risk outputs while preserving alarms and evidence flow.
  • Limp-home: keep a minimum function set with degraded quality flags and strict limits.
  • Evidence-first: every transition must record trigger, checks performed, and outcome.
  • Non-negotiable: do not restore normal mode until monitors confirm stability.
Practical rule: coverage is defined by detected fault classes and recorded evidence—not by the number of tests listed.
Self-Test & Diagnostics State Machine boot → ready → operate → degraded → fault (evidence on every transition) BOOT READY OPERATE DEGRADED FAULT retest pass POST/BIST Run-time Monitors mismatch / jitter Evidence Output on Every Transition event_id state reason_code timestamp_quality
Figure F8. Diagnostics becomes auditable when state transitions are evidence-producing actions, not implicit behavior. Boot-time tests establish baseline; run-time monitors preserve coverage during operation.

H2-9. Fault Handling & Degraded Operation: Decision Logic You Can Defend

Fault handling becomes defensible when it is written as auditable, testable rules: clear fault classes, explicit voting and arbitration semantics, evidence-backed transitions into degraded operation, and objective recovery gates.

Design objective: every degrade/switch/isolate action must be explainable as trigger evidence → decision → action → recovery condition, recorded with correlation IDs.

Fault classification (policy router)

  • Transient vs persistent: windowed spikes vs sustained violations drive action strength and recovery barriers.
  • Single-point vs multi-point: local anomalies vs cross-domain correlation (systemic risk signature).
  • Recoverable vs non-recoverable: automatic restore requires stability windows and retest pass; otherwise manual clear.
  • Rule: classification dictates what evidence must be collected and which actions are allowed.

Voting & consistency (mismatch handling)

  • Mismatch semantics: quantify disagreement by count + duration, not by a single sample.
  • Arbitration outputs: vote_result and confidence_level must be logged.
  • Anti-flap: hysteresis + cooldown prevents oscillation near thresholds.
  • Isolation scope: isolate the faulty channel while preserving stable authority paths where possible.

Evidence requirements (every transition)

  • Trigger context: window statistics and source module ID.
  • Decision trace: reason code, classification, and arbitration summary.
  • Action trace: degrade/switch/isolate/limit plus result status.
  • Recovery gate: stability window, retest status, and explicit restore decision.

Degraded operation (strategy semantics)

  • Degraded: keep a minimum validated control/monitor set with quality flags.
  • Fault: stop high-risk outputs; preserve alarms, evidence, and maintenance access.
  • Restore: never automatic without objective stability + retest criteria.
  • Audit value: transitions must be replayable using correlated evidence bundles.
Practical rule: without anti-flap gates (window + hysteresis + cooldown), “degraded operation” can become a new fault generator.

IF/THEN Rule Block A — Network degradation escalation

IFTHENDOEXIT

IF network loss/jitter exceeds the defined budget for a sustained window (with flap count rising), THEN classify as persistent network degradation, DO switch authority path and limit non-essential traffic, EXIT only after a stability window + retest pass confirms healthy metrics.

  • Record: event_id, event_type, network_health_summary, decision_code, decision_outcome, recovery_gate

IF/THEN Rule Block B — Redundant input mismatch arbitration

IFTHENDOEXIT

IF redundant channels disagree beyond mismatch thresholds (count + duration) and integrity checks are failing, THEN classify as single-point channel fault unless cross-domain correlation indicates multi-point risk, DO isolate the disagreeing channel and publish the arbitration result with confidence, EXIT only after cooldown + retest validates consistency over a stability window.

  • Record: mismatch_count, mismatch_duration, disagreeing_channel, vote_result, confidence_level

IF/THEN Rule Block C — Multi-point anomaly escalation

IFTHENDOEXIT

IF multiple domains degrade simultaneously (I/O quality drops + network instability + time-quality degrades), THEN classify as multi-point/systemic risk, DO enter minimum validated function set and increase evidence capture density, EXIT only after the root cause indicators clear and post-checks pass.

  • Record: timestamp_quality, sync_state, io_quality_summary, network_health_summary, operating_state

IF/THEN Rule Block D — Restore to normal mode

IFTHENDOEXIT

IF health metrics remain within budget for the required stability window and all retests pass, THEN classify as recoverable, DO restore normal mode with a logged authority confirmation, EXIT by recording the restore event and a post-restore snapshot.

  • Record: recovery_gate, decision_code, authority_owner, snapshot_id, decision_outcome
Defensible Decision Logic Evidence → Classification → Voting → Action → Recovery Gate → Audit Log Evidence Inputs Network Health I/O Quality Integrity Time Quality Classifier trans/persist single/multi Voting vote_result confidence Actions degrade switch isolate Recovery Gate stability retest pass cooldown Audit Log trigger → decision → action → exit Every transition emits correlated evidence fields
Figure F9. Defensible fault handling is a decision pipeline that links evidence inputs to classification, voting, actions, and explicit recovery gates—then logs the full chain for audit and replay.

H2-10. Power Integrity & Resilience for TCMS Nodes

TCMS resilience depends on preventing “silent wrong computation” during supply disturbances and on producing evidence that explains resets, degraded behavior, and write-protection outcomes—without expanding into traction or auxiliary power topologies.

Design objective: convert random resets and intermittent logic faults into explainable events using rail monitoring, brownout policies, and correlated evidence fields.

Brownout & reset behavior (closed loop)

  • Brownout policy: define UV threshold + hysteresis + response budget.
  • Reset cause: distinguish brownout vs watchdog vs software reset using reason codes.
  • Write protection: enforce atomic commits and markers during power instability windows.
  • Audit value: a reset must be explainable using rail_min and reset_cause context.

Power health monitoring strategy

  • PG (power-good): gating “allowed to operate” conditions.
  • Rail monitor/ADC: capture rail minima/maxima and windowed statistics.
  • Sampling policy: measure during critical phases (boot, mode switch, write commit).
  • Trendability: store brownout_count and rail_min as deltas over windows.

Core evidence fields (correlatable)

  • reset_cause — brownout / watchdog / software
  • rail_min / rail_max — pre/post incident extrema
  • brownout_count — trendable frequency indicator
  • watchdog_count_delta — liveness consequence indicator
  • timestamp_quality — time credibility during the incident

Make resets explainable

  • Event linking: power events must create incident bundles (event + snapshot + trend slice).
  • State linkage: include operating_state transition reason and post-check results.
  • Write linkage: include power_fail_marker and atomic_commit_id.
  • Outcome: logs prove whether a reset was an expected protection action or an uncontrolled crash.
Practical rule: rail extrema (rail_min) and reset_cause are root-cause anchors; without them, “random reset” remains non-actionable.
Power Evidence Loop (TCMS Node) Rails → Monitor/Protect → Compute → Evidence Fields → Decision → Incident Bundle Input Rails VIN Local Rails Monitor/Protect PG UV rail_min / rail_max brownout_count Compute MCU/SoC Watchdog Evidence reset_cause rail_min timestamp_q Decision & Bundle freeze/limit incident_bundle Convert “random reset” into explainable events with evidence fields
Figure F10. Power resilience for TCMS nodes is an evidence loop: monitoring and protection define behavior during brownouts, while correlated fields (reset cause, rail minima, counters) explain outcomes and protect audit value.

H2-11. Verification & Field Feedback Loop for TCMS

A TCMS improves fastest when field incidents are converted into repeatable evidence packages, aligned on a single time base, narrowed to a responsible domain, reproduced under controlled conditions, and then used to update thresholds, decision rules, and logging fields with versioned compatibility.

Design objective: turn each incident into a measurable upgrade: collect → align → narrow → reproduce → update → deploy → monitor, with evidence proving improvement.

What this loop produces

  • Audit-ready incident bundles: event + snapshot + trend slice linked by correlation_id.
  • Defensible root-cause claims: hypotheses that can be proven or falsified by evidence fields.
  • Upgradeable schemas: new fields added with schema_version and backward compatibility rules.
  • Predictive maintenance assets: trend indicators that move “intermittent” into measurable frequency and degradation.

Evidence prerequisites (links to earlier chapters)

  • Time credibility: timestamp_quality, sync_state, holdover markers (from time-base chapter).
  • Incident bundle structure: event/snapshot/trend and integrity markers (from logging chapter).
  • State transitions: operate/degraded/fault transitions with reasons (from self-test/diagnostics chapter).
  • Decision trace: degrade/switch/isolate with recorded triggers and recovery gates (from fault-handling chapter).
Practical rule: field data that cannot be time-aligned and schema-identified becomes “archived” rather than “evidentiary.”

Step 1 — Collect (incident package)

InputOutputAccept

Input: incident bundle and raw records. Output: a single “incident package” with integrity checks and versions.

  • Must include: schema_version, record_seq, record_crc, timestamp_quality.
  • Acceptance: package integrity can be validated offline with deterministic results.

Step 2 — Time Align (single timeline)

InputOutputAccept

Input: PTP/RTC/monotonic stamps. Output: a unified timeline (monotonic primary, wall-clock secondary).

  • Must include: sync_state, holdover markers, time-jump detection.
  • Acceptance: cross-module events can be ordered without ambiguity.

Step 3 — Domain Narrow (responsibility)

InputOutputAccept

Input: network/I/O/integrity/power evidence summaries. Output: a responsible domain with supporting fields.

  • Examples: network_health_summary, io_quality_summary, reset_cause, mismatch counters.
  • Acceptance: domain choice must cite evidence fields, not inference.

Step 4 — Hypothesize (falsifiable)

InputOutputAccept

Input: aligned timeline + domain. Output: IF/THEN hypotheses mapped to expected field changes.

  • Rule: each hypothesis must be falsifiable by a defined set of fields.
  • Acceptance: expected observations are written as field deltas over time windows.

Step 5 — Reproduce (compare bundles)

InputOutputAccept

Input: reproduction conditions and injection plan. Output: reproduced incident bundles with comparable schemas.

  • Rule: reproduction must emit the same evidence fields set (or a version-mapped superset).
  • Acceptance: “field-to-field” comparison supports or rejects the hypothesis.

Step 6 — Update (thresholds, rules, fields)

InputOutputAccept

Input: validated root cause and evidence gaps. Output: versioned updates with backward compatibility.

  • Update classes: thresholds, decision rules, capture density, and schema fields.
  • Acceptance: mixed fleet (N and N-1) remains diagnosable (time align + domain narrow still works).

Step 7 — Prove & Monitor (window stats)

InputOutputAccept

Input: post-deploy trend windows. Output: measurable improvement backed by windowed statistics.

  • Examples: reduced flap rate, fewer brownouts, lower mismatch duration, improved time-quality ratio.
  • Acceptance: improvement is demonstrated over time windows, not single samples.

Field add strategy (versioned + backward compatible)

  • Version everything: include schema_version and producer FW version on every record.
  • Additive changes first: add new fields without changing old semantics; allow “unknown” + valid flags.
  • Minimum disruption: add lightweight counters/quality bits before heavy snapshots; add indexing fields before payload fields.
  • Mixed-fleet acceptance: N-1 and N must still support time align and domain narrowing.

Predictive trend indicators (high ROI)

  • Degradation: temperature peaks, ECC corrected deltas, jitter percentiles.
  • Frequency: brownout_count_delta, watchdog_count_delta, flap_count_delta.
  • Evidence quality: ratio of degraded timestamp_quality, invalid I/O ratio.
  • Rule: store window statistics (rate/peak/p95), not only absolute values.

Example material part numbers (MPNs) often used to enable this loop

  • Safety compute (examples): Infineon AURIX TC397, TI TMS570LC4357, NXP S32G274A (class varies by platform).
  • TSN / Ethernet switching (examples): NXP SJA1105 (TSN switch family).
  • PTP-capable PHY (examples): TI DP83640 (IEEE-1588 timestamping PHY).
  • Clock / jitter cleaning (examples): Skyworks Si5341 (jitter attenuating clock), Renesas 8A34001 (timing/clock generator family).
  • GNSS timing (examples): u-blox ZED-F9T (timing-grade GNSS module) for redundant time sources.
  • Isolation for field I/O and comms (examples): TI ISO7741 (digital isolator), TI ISO1050 (isolated CAN).
  • Power monitoring / telemetry (examples): TI INA226 (power monitor), Analog Devices LTC2991 (multi-rail monitor family).
  • Supervisors / reset monitoring (examples): TI TPS3839 (supervisor family) to make reset_cause explainable.
  • Non-volatile evidence / counters (examples): Microchip ATECC608B (secure element for device identity), Infineon OPTIGA TPM SLB 9670 (TPM option), Fujitsu MB85RS64V (FRAM for robust counters/markers).
Note: MPNs above are representative examples used in safety/industrial-grade designs. Final selection must match rail standards, temperature class, lifecycle, and approved vendor lists.
Verification & Field Feedback Loop Incident → Evidence → Reproduce → Update → Deploy → Monitor (repeat) COLLECT incident_bundle ALIGN time axis NARROW domain REPRODUCE compare UPDATE rules/fields DEPLOY MONITOR trend windows Schema Evolution vN vN-1 OK Evidence fields drive updates: thresholds + decision rules + logging schema
Figure F11. A TCMS verification loop turns field incidents into reproducible evidence, then upgrades thresholds, decision rules, and logging fields with versioned compatibility—proving improvement through trend windows.

Request a Quote

Accepted Formats

pdf, csv, xls, xlsx, zip

Attachment

Drag & drop files here or use the button below.

H2-12. FAQs (Evidence-Driven Accordion ×12)

Each answer follows a strict, testable pattern: 1-sentence conclusion2 evidence checks1 first fix. Evidence checks reference concrete fields so issues can be replayed on a single timeline.

Rule: Evidence must be loggable/exportable and time-alignable (e.g., timestamp_quality, reset_cause, rail_min, flap_count, mismatch_duration).
1) TCMS outputs wrong results without reboot — brownout edge or watchdog coverage gap?
Conclusion Prioritize a supply-edge event if rail minima or brownout counters drift near thresholds; otherwise suspect runtime monitoring that fails to detect stalled tasks.
  • Evidence #1 (power): correlate rail_min, brownout_count_delta, and any near-miss UV status around the symptom window. MPN examples: INA226, LTC2991, TPS3839.
  • Evidence #2 (liveness): check watchdog_count_delta, task-alive gaps, and whether operating_state changed (operate→degraded) when the wrong output occurred.
  • First Fix: raise capture density for “near-UV” and liveness events into the incident bundle (pre/post snapshot) and log reset_cause / “near-reset” reason codes even when no reboot occurs.
Maps to: H2-8 / H2-10
2) After switchover, recovery is slow — redundancy mode choice or logging/time-sync blockage?
Conclusion If network health remains within budget but switchover time inflates, suspect log or sync bottlenecks during the transition rather than redundancy mode selection alone.
  • Evidence #1 (network): examine loss_rate, latency_p95, and flap_count across the switchover window. MPN examples: SJA1105, DP83640.
  • Evidence #2 (logging/sync pressure): compare log queue depth / write-rate spikes and timestamp_quality degradation during switchover (event storm vs stable time base).
  • First Fix: enforce a “transition logging profile” (keep critical evidence fields, throttle non-critical trend writes) and record a dedicated switch_phase marker so delays can be attributed.
Maps to: H2-5 / H2-7
3) Remote I/O reports input jitter but现场 stable — cable EMI or debounce/threshold policy?
Conclusion If I/O diagnostics show good line integrity yet frequent short glitches appear, prioritize debounce/window policy before blaming cable interference.
  • Evidence #1 (I/O integrity): verify open/short indicators, input quality flags, and whether glitches correlate with known noise windows. MPN examples for isolation robustness: ISO7741, ISO1050.
  • Evidence #2 (decision stats): inspect mismatch_count, mismatch_duration, and “transient vs persistent” classification outcomes.
  • First Fix: switch from single-sample decisions to windowed statistics (count+duration) plus cooldown, and log the chosen classification and thresholds for audit.
Maps to: H2-4 / H2-9 / H2-11
4) Two channels disagree on the same input — isolation-side drift or sampling window misalignment?
Conclusion If disagreement clusters around edges and timing-critical moments, suspect sampling misalignment; if bias persists across steady states, suspect drift/offset on one side of the isolation boundary.
  • Evidence #1 (time alignment): compare channel-to-channel offset and timestamp_quality during mismatches. MPN examples: DP83640, Si5341.
  • Evidence #2 (value pattern): evaluate whether mismatch is edge-only (timing) or steady bias (drift), and record disagreeing_channel + confidence.
  • First Fix: tighten a shared sampling window (or align to a monotonic timestamp boundary) and publish a logged arbitration result (vote_result, confidence_level).
Maps to: H2-4 / H2-6
5) Network latency spikes cause control lag — TSN/PTP time-base issue or link flap?
Conclusion If time quality collapses (holdover/time-jump), prioritize sync issues; if time stays credible but link stability degrades, prioritize flap/congestion.
  • Evidence #1 (time quality): check sync_state, holdover flags, and timestamp_quality in the spike window. MPN examples: Si5341, ZED-F9T.
  • Evidence #2 (link stability): correlate flap_count, loss_rate, and queue/latency percentiles per segment. MPN examples: SJA1105.
  • First Fix: gate critical control timing on a “time-quality OK” condition and log time-quality transitions so TSN tuning isn’t performed blindly under broken time.
Maps to: H2-5 / H2-6
6) Log event order looks wrong — time jump or multi-domain timestamp granularity mismatch?
Conclusion If the wall-clock jumps or holdover toggles, prioritize time issues; otherwise fix ordering anchors (monotonic + record sequence) across domains.
  • Evidence #1 (time jump): verify time-jump markers, holdover events, and timestamp_quality transitions. MPN examples: Si5341, ZED-F9T.
  • Evidence #2 (ordering anchors): confirm record_seq, atomic commit markers, and consistent timestamp granularity across producers.
  • First Fix: enforce monotonic ordering as the primary sort key and add/verify record_seq + atomic-commit IDs for cross-domain replay.
Maps to: H2-6 / H2-7
7) Self-test passes, but runtime still deadlocks — monitoring coverage gap or fault-handling path bug?
Conclusion If liveness evidence is missing during the hang, suspect coverage gaps; if liveness triggers but decisions loop, suspect fault-handling rules entering an unstable path.
  • Evidence #1 (runtime coverage): check task-alive heartbeats, watchdog_count_delta, and whether the scheduler shows gaps before the lock. MPN examples for reset supervision: TPS3839.
  • Evidence #2 (decision trace): inspect decision_code, decision_outcome, and state transitions (operate↔degraded) for oscillation.
  • First Fix: add explicit “reason codes” for state transitions and apply cooldown/hysteresis on repeated decision loops, so deadlock becomes diagnosable rather than opaque.
Maps to: H2-8 / H2-9
8) Exported maintenance logs can’t replay — which minimum evidence set is missing?
Conclusion If logs lack schema/time identity and correlation anchors, they become archival rather than evidentiary and cannot support root-cause replay.
  • Evidence #1 (identity): verify presence of schema_version, producer FW version, and timestamp_quality.
  • Evidence #2 (anchors): verify correlation_id, record_seq, decision trace fields, and atomic commit markers. MPN examples for robust counters/markers: MB85RS64V.
  • First Fix: implement a versioned “minimum evidence checklist” and add backward-compatible fields (do not rename old fields; add valid flags and new IDs).
Maps to: H2-7 / H2-11
9) After a subsystem drops, TCMS degrades repeatedly — recovery gate too strict or fault classification wrong?
Conclusion If degrade triggers repeat with short intervals, classification/cooldown is likely wrong; if recovery never clears despite stable conditions, the recovery gate is too strict or poorly measured.
  • Evidence #1 (frequency): inspect degrade_count, dwell time in degraded state, and trigger intervals (anti-flap effectiveness).
  • Evidence #2 (gate failures): log “recovery gate failed reason” (e.g., stability window not met, retest fail) and correlate with network/I/O health summaries.
  • First Fix: add cooldown + windowed stability checks, and version the recovery gate criteria so improvements can be proven via trend windows post-deploy.
Maps to: H2-9 / H2-11
10) Power-up sometimes never reaches OPERATE — boot self-test sequencing or I/O loopback failure?
Conclusion If boot stalls at a consistent step, suspect test sequencing/timing; if failures correlate with specific I/O groups, suspect loopback/diagnostic failures across the isolation boundary.
  • Evidence #1 (boot step): check boot_step, boot_fail_reason, and state machine transitions (boot→ready→operate).
  • Evidence #2 (I/O diag): check loopback results, open/short flags, and per-channel quality bits. MPN examples: ISO7741 (digital isolation), ISO1050 (isolated CAN if used for diagnostics transport).
  • First Fix: add a deterministic boot-step reason code and a post-fail snapshot so intermittent boot failures become reproducible evidence rather than “won’t start.”
Maps to: H2-8 / H2-4
11) Remote I/O resets occasionally but the master doesn’t notice — missing link monitoring or reset-cause not reported?
Conclusion If the master’s link health shows no drop while remote resets, reset-cause reporting is missing; if link events exist without reset context, monitoring is present but incomplete.
  • Evidence #1 (remote reset context): verify remote reset_cause, reset counters, and any rail minima captured locally. MPN examples: TPS3839 (supervisor), INA226 (monitor).
  • Evidence #2 (master link health): correlate link_down/reconnect markers, flap_count, and heartbeats for missing intervals.
  • First Fix: implement a mandatory “remote reset report” packet after reconnect and store it inside the same incident bundle as master link events (correlation ID shared).
Maps to: H2-5 / H2-10
12) “Happened once, never repeats” — how to design triggers + snapshots so next time it is captured?
Conclusion One-off failures become capturable when triggers are layered (lightweight always-on + burst snapshot on weak signals) and when pre/post windows are guaranteed.
  • Evidence #1 (trigger coverage): verify whether triggers are single-threshold only, or include sequence anomalies (repeated retries, flap bursts, time-quality drops) and manual triggers.
  • Evidence #2 (snapshot depth): verify that incident bundles include pre-trigger context and post-trigger aftermath with record_seq continuity and atomic commit markers. MPN examples for robust markers/counters: MB85RS64V; for device identity/integrity: ATECC608B, SLB 9670.
  • First Fix: introduce tiered triggering: always-on counters (cheap) + burst snapshots when weak indicators accumulate, then validate capture-rate improvement via trend windows after deployment.
Maps to: H2-7 / H2-11