Train Control & Monitoring System (TCMS)

Q: After switchover, recovery is slow — redundancy mode choice or logging/time-sync blockage?

If network health stays within budget but switchover time inflates, suspect log or sync bottlenecks during the transition. Evidence: verify loss_rate/latency_p95/flap_count across the switchover window and check for log write-rate spikes or timestamp_quality degradation. First fix: apply a transition logging profile (keep critical fields, throttle non-critical writes) and log a switch_phase marker to attribute delays.

Q: Remote I/O reports input jitter but现场 stable — cable EMI or debounce/threshold policy?

If line diagnostics look healthy yet short glitches are frequent, prioritize debounce/window policy before blaming cable EMI. Evidence: confirm I/O integrity flags (open/short/quality) and inspect mismatch_count/mismatch_duration plus transient vs persistent outcomes. First fix: move to windowed statistics (count+duration) with cooldown and log the chosen classification and thresholds.

Q: Two channels disagree on the same input — isolation-side drift or sampling window misalignment?

If disagreement clusters around edges, suspect sampling misalignment; if a steady bias persists, suspect drift/offset on one side of the isolation boundary. Evidence: compare channel offset and timestamp_quality during mismatches, and evaluate whether mismatches are edge-only or steady-state. First fix: align to a shared sampling window (or monotonic boundary) and publish vote_result and confidence_level.

Q: Log event order looks wrong — time jump or multi-domain timestamp granularity mismatch?

If wall-clock jumps or holdover toggles, prioritize time issues; otherwise fix ordering anchors across domains. Evidence: verify time-jump markers and timestamp_quality transitions, and confirm record_seq plus atomic commit markers and consistent timestamp granularity. First fix: enforce monotonic ordering as primary and add/verify record_seq and atomic-commit IDs for cross-domain replay.

Q: Exported maintenance logs can’t replay — which minimum evidence set is missing?

Logs that lack schema/time identity and correlation anchors are archival rather than evidentiary. Evidence: verify schema_version and timestamp_quality plus producer FW version, and verify correlation_id, record_seq, decision trace fields, and atomic commit markers. First fix: implement a versioned minimum-evidence checklist and add backward-compatible fields (additive changes with valid flags).

Q: Power-up sometimes never reaches OPERATE — boot self-test sequencing or I/O loopback failure?

If boot stalls at a consistent step, suspect sequencing/timing; if failures correlate with specific I/O groups, suspect loopback/diagnostic failures. Evidence: check boot_step and boot_fail_reason plus state transitions, and check loopback results and open/short/quality bits per channel. First fix: add deterministic boot-step reason codes and a post-fail snapshot for reproducible evidence.

← Back to: Rail Transit & Locomotive

TCMS is the train’s control-and-evidence backbone: it coordinates vehicle functions through safety compute, isolated I/O, and redundant networks while producing time-aligned logs that make faults explainable and auditable. The core goal is to turn every incident into defensible evidence—so decisions, degraded modes, and field fixes can be verified, replayed, and improved over time.

H2-1. TCMS in the Rail Architecture

Core takeaway: TCMS is not a “network box”. It is the vehicle-level control-and-evidence hub that makes actions accountable, states trustworthy, and incidents reconstructable.

What TCMS must guarantee

Command (control authority): who can issue an action, how it is accepted, executed, and confirmed.
State (single source of truth): how distributed measurements become one validated vehicle state with freshness/quality.
Evidence (replayable history): how events are time-aligned, signed/verified, and exported for maintenance and audits.

Typical topology (why it exists)

Vehicle controller: arbitration + decision + policy + aggregation.
Remote I/O modules: distributed sensing/actuation with isolation across long cables and ground offsets.
Gateways: bridge legacy buses (e.g., MVB/WTB) and Ethernet/TSN domains without losing determinism.
Recorder/maintenance access: ensures “one-time faults” are still captured with the right context.

Why TCMS is a reliability & safety carrier: it defines how failures are classified, how the system degrades, and what evidence is recorded for every switch, trip, or recovery condition.

Figure F1. TCMS is the vehicle-level hub that governs command authority, consolidates validated state, and produces time-aligned evidence logs across distributed onboard domains.

Cite this figure: TCMS System Map (Onboard) — IC Navigator / Rail Transit & Locomotive / TCMS

H2-2. Functional Partitioning: Control, Monitoring, Safety, and Evidence

Core takeaway: A TCMS problem becomes diagnosable only after it is routed to the right plane. Each plane has its own boundaries, failure modes, and evidence fields.

Control plane

Boundary: command generation, arbitration, execution confirmation.
Typical failures: duplicate/late commands, missing ACK, wrong target, inconsistent action across cars.
Evidence fields: command_id, source, target, accept/reject code, ACK latency, retry count.
First check: command_id → ACK → latency triad to confirm whether the action loop is intact.

Monitoring plane

Boundary: acquisition, state fusion, state freshness/quality, alarms.
Typical failures: false alarms, stale values, contradictory sources, wrong root cause labeling.
Evidence fields: timestamp, validity flag, stale_age, sensor_quality, counter deltas.
First check: validity + stale_age before trusting any displayed status.

Safety plane

Boundary: interlocks, emergency policies, voting/consistency, diagnostic triggers.
Typical failures: false trip, missed trip, inconsistent degrade decisions, unsafe recovery.
Evidence fields: safety_state, trip_cause, vote_result, diag_coverage hint, recovery_condition.
First check: trip_cause + vote_result to verify the decision is evidence-backed.

Maintenance plane

Boundary: diagnostics, log export, versioning, configuration control, updates.
Typical failures: missing context, untraceable versions, configuration drift, update-induced outages.
Evidence fields: fw_version, config_hash, export status, audit trail, update result code.
First check: fw_version + config_hash to prevent “mixed baselines” in field debugging.

Minimum evidence set (must exist): time-stamp quality, command/ACK, validated state freshness, reset/brownout causes, link health counters, self-test/watchdog outcomes.

Figure F2. Functional partitioning turns “complex system behavior” into diagnosable planes. Each plane has boundaries, failure signatures, and mandatory evidence fields.

Cite this figure: TCMS Functional Planes — Control / Monitoring / Safety / Maintenance

H2-3. Compute Core: Safety MCU/SoC Architecture Choices

The TCMS compute core is judged by its ability to detect and contain silent failures (wrong results without a crash), while preserving traceable evidence for every mismatch, correction, and recovery action.

Design objective: prevent “runs but wrong” behavior by combining redundancy choices, data integrity protection, and measurable diagnostic coverage.

Redundancy options (trade-offs + evidence)

Lockstep: detects instruction-level divergence; requires mismatch flags and defined safe reactions.
Dual-channel: independent computation paths; requires voting rules and correlated evidence to avoid false decisions.
Heterogeneous redundancy: different cores/implementations; improves common-cause resistance; demands interface-level consistency checks.
Verification focus: every strategy must output a decision record (what diverged, when, and how it was handled).

Memory & data integrity (from bits to evidence)

ECC: corrected/uncorrectable counters turn random corruption into explainable events.
E2E protection: CRC + sequence counters detect corruption, loss, duplication, and reordering across modules.
State validity: each critical state carries validity, age, and origin to prevent stale truth.
Evidence fields: corrected_count, uncorrectable_count, crc_fail_count, seq_gap_count, last_pass_ts.

How to prove it “did not compute wrong”

Boot-time self-test: CPU/RAM/flash checks with itemized results and durations.
Periodic tests: scheduled coverage refresh for safety mechanisms and critical paths.
Runtime monitors: watchdogs, deadline budgets, lockstep mismatch, ECC rate alarms.
Mandatory outcome: each detection must map to a defensible action (degrade, reset, isolate, alert).

Engineering acceptance checklist

Can the system separate crash vs silent wrong vs degradation?
Does each safety decision generate a time-aligned record (trigger → evidence → action → recovery)?
Are integrity counters trendable (rate-of-change), not only instantaneous flags?
Is there a defined policy for uncorrectable events and repeated corrections?

Practical rule: if a safety architecture cannot output evidence fields for mismatch, integrity failures, and recovery actions, the system will be hard to audit and harder to debug in one-time field incidents.

Figure F3. A safety compute core is defined by partitioning and a measurable monitoring loop that turns silent faults into evidence-backed actions.

Cite this figure: Safety Compute Core (TCMS) — Partitioning & Monitoring Loop

H2-4. Isolated I/O Modules: DI/DO/AI/AO and the Isolation Boundary

In TCMS, many field failures originate at the I/O boundary: long harnesses, ground offsets, transients, and EMI can turn inputs/outputs into untrustworthy signals unless isolation, protection, and diagnostics are designed as a single chain.

Design objective: make each I/O channel locatable (fault side identification), survivable (transient containment), and auditable (evidence fields).

I/O types & field failure signatures

DI jitter: bounce/noise creates false transitions; verify debounce rejects and pulse-width stats.
DO stuck: command differs from feedback; check overcurrent trips and stuck-detect flags.
AI drift: slow bias shift or saturation; correlate with common-mode events and temperature.
AO open-load: actuator missing or wiring open; verify compliance flags and open-load detect.

Isolation strategy (define the boundary)

Signal isolation: prevents ground-offset coupling into logic-domain thresholds and ADC references.
Isolated power: keeps AFE supply stable under harness transients and ground shifts.
Common-mode suppression: reduces CM energy that otherwise leaks through parasitics.
Protection chain: containment path for surge/ESD before it reaches AFE and isolator.

Evidence fields (make faults locatable)

Open/short detect: open_load, short_to_batt, short_to_gnd, overcurrent_trip_count.
Cable diagnostics: bounce_count, min_pulse_width, intermittent_count, cm_event_count.
Consistency checks: redundant input mismatch flags and voting outcomes.
Quality tagging: validity flag, stale age, saturation flag, out-of-range markers.

Engineering acceptance checklist

Can a failure be located to cable side vs isolation side vs logic side?
Does each channel expose trendable counters (not only a one-shot alarm)?
Are fault flags time-aligned with supply and network health indicators?
Does the design define the safe default state for DO/AO when diagnostics fail?

Practical rule: without boundary-aware evidence (which side failed), field repair becomes trial-and-error replacement rather than deterministic diagnosis.

Figure F4. The I/O boundary is engineered as a chain. Isolation and diagnostics exist to contain transients and to localize faults to cable-side, barrier-side, or logic-side.

Cite this figure: Remote I/O Module Chain — Cable to Network with Isolation Boundary

H2-5. Redundant Communications: Vehicle Networks and Gateway Strategy

Redundant communications must be engineered as a verifiable system: measurable link health, defined switching triggers, bounded switchover time budgets, and evidence logs that explain every failover decision.

Design objective: keep command authority and state truth stable during network degradation, while preserving a defensible record of triggers, decisions, and outcomes.

Common roles on vehicle networks

TCMS master: command arbitration and consistency across cars/domains.
Remote I/O nodes: field-state acquisition with diagnostics counters and quality flags.
Gateways: determinism boundary between legacy bus segments and Ethernet/TSN domains.
HMI: operator loop for commands, alarms, and confirmation paths.
Recorder: evidence sink for network events, state snapshots, and recovery actions.

Redundancy modes (conceptual)

Cold-standby: backup is inactive; switchover relies on re-acquiring state and authority.
Hot-standby: backup syncs critical state; only one side publishes authoritative outputs.
Dual-active: both are active; requires arbitration to prevent dual-master conflicts.
Evidence requirement: each mode must log who was authoritative and why it changed.

Fault criteria (measurable triggers)

Loss rate: packet drop percentage within a defined time window.
Latency & jitter: budget exceedance and variance expansion (control-impact signature).
Link flap: repeated up/down transitions and route instability.
Switch triggers: graded states (degraded → failed) with hysteresis to avoid flapping.

Degrade strategy (minimum functionality)

Maintain command safety: preserve an unambiguous command authority path.
Maintain essential state: keep validated status for the minimum set of safety-relevant signals.
Protect evidence: ensure failover decisions and link-health context are time-stamped and exportable.
Recovery conditions: define when normal mode may resume (stability window + counters).

Practical rule: a redundant network is only “real” if the system can explain (with evidence fields) what degraded, what triggered the switch, who became authoritative, and how long it took.

Mode	Trigger (health criteria + window)	Switch budget	State handover rule	Evidence to log
Cold-standby	link down / sustained loss; confirmed by flap counter + windowed stats	longer; requires re-acquire and re-validate state	rebuild baseline; verify freshness + validity before authority	trigger stats, last-known authority, re-sync duration, missed data window
Hot-standby	degraded → failed threshold crossed; hysteresis to avoid oscillation	bounded; backup already synced	last common state ID (seq counter) + integrity check (CRC)	authority change, seq gap, sync latency, decision code, outcome
Dual-active	arbitration conflict / inconsistent link view / split-brain indicators	tight; must prevent dual-master outputs	token/vote result; deterministic tie-break + rollback rule	vote/arbitration result, conflict count, resolved leader, time-to-stable

Figure F5. Redundant communications are verified through measurable link health, bounded failover, and evidence logs that record triggers, authority changes, and outcomes.

Cite this figure: Redundant Vehicle Communications — Roles, Dual Networks, Evidence-backed Failover

H2-6. Time Base & Event Correlation: PTP/Local Clock/Monotonic Logging

Event correlation is where TCMS becomes reconstructable. A unified time base ensures I/O edges, network degradations, fault codes, and power-health signals can be assembled into a defensible sequence of cause-and-effect.

Design objective: provide time alignment across domains while preserving a monotonic ordering anchor and a timestamp-quality record when synchronization degrades.

Time sources (roles)

PTP: cross-domain alignment for networked modules and gateways.
RTC: human-readable wall-clock context across long downtime windows.
Monotonic counter: non-decreasing ordering anchor that prevents “time jump” reordering.
Rule: incident logs must include monotonic time even when wall-clock is invalid.

What must be correlated

I/O events: edges, output feedback, open/short and quality flags.
Network events: loss/jitter spikes, route changes, failover transitions.
Fault codes: assert/clear with context snapshots.
Power health: rail minima, brownout markers, reset causes and counters.

Common pitfalls (detectable)

Time step / jump: wall-clock changes suddenly; detect via monotonic vs wall-clock delta.
Holdover drift: offset trends while external sync is lost.
Mixed granularity: hardware vs software timestamp points create inconsistent skew.
Action: downgrade timestamp_quality and preserve ordering with monotonic time.

Mandatory evidence fields

sync_state: locked / holdover / free-run / invalid
offset: relative deviation to the selected master
holdover_time: duration since last valid sync
timestamp_quality: good / degraded / bad
ts_source: HW timestamp / SW timestamp / derived

Practical rule: a timestamp without a quality indicator is not auditable. Log correlation depends on both time values and time credibility.

Figure F6. Time evidence is a chain: sources and synchronization create timestamps; timestamps gain audit value only when logged with quality fields and replayed as a correlated timeline.

Cite this figure: Time Evidence Chain — Source to Replay with Timestamp Quality

H2-7. Logging & Black-Box Evidence: What to Record and Why

Logging becomes defensible only when it is treated as an evidence product: it must reconstruct what happened, why it happened, what the system decided, and whether recovery was successful—using time-aligned, integrity-protected records.

Design objective: turn incidents into repeatable “evidence bundles” (event + snapshot + trend slice) with integrity markers and timestamp quality.

Three-layer logging model

Event: high-density triggers (failover, threshold crossing, fault assert/clear).
Snapshot: the system’s complete state around an incident (pre/post window).
Trend: counters and statistics that reveal slow degradation (rates and deltas).
Rule: events must reference the snapshot and the relevant trend slice via correlation IDs.

Trigger mechanisms (incident bundles)

Threshold triggers: power/thermal/offset budget exceedance.
Integrity triggers: CRC fail, sequence gaps, invalid state transitions.
Network triggers: loss/jitter spikes, flap, route changes, failover.
Manual triggers: operator/maintenance request with reason code.
Bundle: capture pre/post windows and include timestamp_quality and sync_state.

Integrity & power-loss robustness

Record integrity: record CRC/signature + monotonic sequence number.
Atomic commit: avoid half-written records that corrupt evidence.
Power-loss marker: mark incomplete commits and preserve last critical bundle.
Audit value: logs remain explainable even when a crash happens mid-write.

Engineering acceptance checklist

Can an incident answer: trigger → context → decision → outcome?
Do events include time quality and a correlation_id?
Are counters trendable (deltas/rates), not only absolute values?
Can evidence survive a one-time crash via atomic commits and markers?

Practical rule: “one-line logs” rarely reconstruct incidents. Evidence requires a bundle: event + snapshot + trend slice, all time-aligned and integrity-marked.

Evidence field checklist (Ctrl+F friendly)

LOG-CHECKLIST:EVENT

event_id — unique incident record identifier
event_type — threshold / integrity / network / manual
event_ts_mono — monotonic timestamp for ordering
event_ts_wall — wall-clock timestamp (optional but recommended)
timestamp_quality — good / degraded / bad
source_module — gateway / io / compute
severity — info / warn / fault
decision_code — degrade / switch / reset / isolate
decision_outcome — success / failed / partial
correlation_id — links event ↔ snapshot ↔ trend slice

LOG-CHECKLIST:SNAPSHOT

snapshot_id — snapshot record identifier
operating_state — ready / operate / degraded / fault
authority_owner — current authority (A / B / peer)
network_health_summary — loss/jitter/flap summary
power_health_summary — rail minima + reset causes + counters
io_quality_summary — invalid/saturation/open-short summary
active_faults_list — asserted fault list
last_transition_reason — why the last state changed

LOG-CHECKLIST:TREND

trend_window_start / trend_window_end — statistics window
loss_rate_avg / loss_rate_peak — loss stats
jitter_p95 — jitter percentile indicator
crc_fail_rate / seq_gap_rate — integrity rates
watchdog_count_delta — watchdog delta since last window
ecc_corrected_delta / ecc_uncorrectable_delta — ECC deltas
flap_count_delta — link flap delta
holdover_time — time sync holdover duration

LOG-CHECKLIST:INTEGRITY

record_crc — record-level checksum
record_seq — record sequence counter
write_result — commit result code
atomic_commit_id — atomic transaction identifier
power_fail_marker — incomplete-write marker

Figure F7. Evidence logging is a pipeline: triggers capture multi-layer data, bundles preserve context, integrity markers protect audit value, and replay reconstructs the incident timeline.

Cite this figure: Logging as Evidence Product — Incident Bundle with Integrity Markers

H2-8. Self-Test & Diagnostics Coverage: Boot-Time vs Run-Time

Self-test is the mechanism that turns safety claims into measurable coverage. It must define when tests run, what fault classes they detect, what evidence is produced, and what the system does when checks fail.

Design objective: combine boot-time assurance (POST/BIST) with continuous run-time monitors to detect silent faults, degradations, and intermittent failures—then record defensible evidence.

Boot-time self-test (POST/BIST)

Compute & memory: CPU/RAM/flash checks with itemized results and durations.
Test vectors: versioned vectors to prevent “test mismatch” after software updates.
I/O path loopback: verify acquisition → processing → output feedback chain.
Outcome: pass → ready; fail → controlled degraded/fault entry with evidence logs.

Run-time monitoring (continuous)

Liveness & deadlines: watchdog, task alive, deadline miss counters.
Comms health: loss/jitter/flap, CRC fail, sequence gaps.
Integrity: ECC trends, E2E CRC/seq consistency, mismatch counts.
Consistency: redundant inputs or channels mismatch and voting outcomes.

Diagnostics coverage mindset

Start from fault classes: define what must be detectable (silent wrong, intermittent, degradation).
Map detection → evidence: each detection emits fields that explain “why” and “where”.
Define reaction semantics: degrade vs fault entry is evidence-driven, not arbitrary.
Recovery conditions: stability windows and retest rules gate a return to normal.

Failure handling (strategy semantics)

Fail-safe: stop high-risk outputs while preserving alarms and evidence flow.
Limp-home: keep a minimum function set with degraded quality flags and strict limits.
Evidence-first: every transition must record trigger, checks performed, and outcome.
Non-negotiable: do not restore normal mode until monitors confirm stability.

Practical rule: coverage is defined by detected fault classes and recorded evidence—not by the number of tests listed.

Figure F8. Diagnostics becomes auditable when state transitions are evidence-producing actions, not implicit behavior. Boot-time tests establish baseline; run-time monitors preserve coverage during operation.

Cite this figure: Self-Test State Machine — Evidence on Every Transition

H2-9. Fault Handling & Degraded Operation: Decision Logic You Can Defend

Fault handling becomes defensible when it is written as auditable, testable rules: clear fault classes, explicit voting and arbitration semantics, evidence-backed transitions into degraded operation, and objective recovery gates.

Design objective: every degrade/switch/isolate action must be explainable as trigger evidence → decision → action → recovery condition, recorded with correlation IDs.

Fault classification (policy router)

Transient vs persistent: windowed spikes vs sustained violations drive action strength and recovery barriers.
Single-point vs multi-point: local anomalies vs cross-domain correlation (systemic risk signature).
Recoverable vs non-recoverable: automatic restore requires stability windows and retest pass; otherwise manual clear.
Rule: classification dictates what evidence must be collected and which actions are allowed.

Voting & consistency (mismatch handling)

Mismatch semantics: quantify disagreement by count + duration, not by a single sample.
Arbitration outputs: vote_result and confidence_level must be logged.
Anti-flap: hysteresis + cooldown prevents oscillation near thresholds.
Isolation scope: isolate the faulty channel while preserving stable authority paths where possible.

Evidence requirements (every transition)

Trigger context: window statistics and source module ID.
Decision trace: reason code, classification, and arbitration summary.
Action trace: degrade/switch/isolate/limit plus result status.
Recovery gate: stability window, retest status, and explicit restore decision.

Degraded operation (strategy semantics)

Degraded: keep a minimum validated control/monitor set with quality flags.
Fault: stop high-risk outputs; preserve alarms, evidence, and maintenance access.
Restore: never automatic without objective stability + retest criteria.
Audit value: transitions must be replayable using correlated evidence bundles.

Practical rule: without anti-flap gates (window + hysteresis + cooldown), “degraded operation” can become a new fault generator.

IF/THEN Rule Block A — Network degradation escalation

IFTHENDOEXIT

IF network loss/jitter exceeds the defined budget for a sustained window (with flap count rising), THEN classify as persistent network degradation, DO switch authority path and limit non-essential traffic, EXIT only after a stability window + retest pass confirms healthy metrics.

Record: event_id, event_type, network_health_summary, decision_code, decision_outcome, recovery_gate

IF/THEN Rule Block B — Redundant input mismatch arbitration

IFTHENDOEXIT

IF redundant channels disagree beyond mismatch thresholds (count + duration) and integrity checks are failing, THEN classify as single-point channel fault unless cross-domain correlation indicates multi-point risk, DO isolate the disagreeing channel and publish the arbitration result with confidence, EXIT only after cooldown + retest validates consistency over a stability window.

Record: mismatch_count, mismatch_duration, disagreeing_channel, vote_result, confidence_level

IF/THEN Rule Block C — Multi-point anomaly escalation

IFTHENDOEXIT

IF multiple domains degrade simultaneously (I/O quality drops + network instability + time-quality degrades), THEN classify as multi-point/systemic risk, DO enter minimum validated function set and increase evidence capture density, EXIT only after the root cause indicators clear and post-checks pass.

Record: timestamp_quality, sync_state, io_quality_summary, network_health_summary, operating_state

IF/THEN Rule Block D — Restore to normal mode

IFTHENDOEXIT

IF health metrics remain within budget for the required stability window and all retests pass, THEN classify as recoverable, DO restore normal mode with a logged authority confirmation, EXIT by recording the restore event and a post-restore snapshot.

Record: recovery_gate, decision_code, authority_owner, snapshot_id, decision_outcome

Figure F9. Defensible fault handling is a decision pipeline that links evidence inputs to classification, voting, actions, and explicit recovery gates—then logs the full chain for audit and replay.

Cite this figure: Defensible Decision Logic — Evidence to Degraded Operation

H2-10. Power Integrity & Resilience for TCMS Nodes

TCMS resilience depends on preventing “silent wrong computation” during supply disturbances and on producing evidence that explains resets, degraded behavior, and write-protection outcomes—without expanding into traction or auxiliary power topologies.

Design objective: convert random resets and intermittent logic faults into explainable events using rail monitoring, brownout policies, and correlated evidence fields.

Brownout & reset behavior (closed loop)

Brownout policy: define UV threshold + hysteresis + response budget.
Reset cause: distinguish brownout vs watchdog vs software reset using reason codes.
Write protection: enforce atomic commits and markers during power instability windows.
Audit value: a reset must be explainable using rail_min and reset_cause context.

Power health monitoring strategy

PG (power-good): gating “allowed to operate” conditions.
Rail monitor/ADC: capture rail minima/maxima and windowed statistics.
Sampling policy: measure during critical phases (boot, mode switch, write commit).
Trendability: store brownout_count and rail_min as deltas over windows.

Core evidence fields (correlatable)

reset_cause — brownout / watchdog / software
rail_min / rail_max — pre/post incident extrema
brownout_count — trendable frequency indicator
watchdog_count_delta — liveness consequence indicator
timestamp_quality — time credibility during the incident

Make resets explainable

Event linking: power events must create incident bundles (event + snapshot + trend slice).
State linkage: include operating_state transition reason and post-check results.
Write linkage: include power_fail_marker and atomic_commit_id.
Outcome: logs prove whether a reset was an expected protection action or an uncontrolled crash.

Practical rule: rail extrema (rail_min) and reset_cause are root-cause anchors; without them, “random reset” remains non-actionable.

Figure F10. Power resilience for TCMS nodes is an evidence loop: monitoring and protection define behavior during brownouts, while correlated fields (reset cause, rail minima, counters) explain outcomes and protect audit value.

Cite this figure: Power Evidence Loop — Explainable Resets for TCMS Nodes

H2-11. Verification & Field Feedback Loop for TCMS

A TCMS improves fastest when field incidents are converted into repeatable evidence packages, aligned on a single time base, narrowed to a responsible domain, reproduced under controlled conditions, and then used to update thresholds, decision rules, and logging fields with versioned compatibility.

Design objective: turn each incident into a measurable upgrade: collect → align → narrow → reproduce → update → deploy → monitor, with evidence proving improvement.

What this loop produces

Audit-ready incident bundles: event + snapshot + trend slice linked by correlation_id.
Defensible root-cause claims: hypotheses that can be proven or falsified by evidence fields.
Upgradeable schemas: new fields added with schema_version and backward compatibility rules.
Predictive maintenance assets: trend indicators that move “intermittent” into measurable frequency and degradation.

Evidence prerequisites (links to earlier chapters)

Time credibility: timestamp_quality, sync_state, holdover markers (from time-base chapter).
Incident bundle structure: event/snapshot/trend and integrity markers (from logging chapter).
State transitions: operate/degraded/fault transitions with reasons (from self-test/diagnostics chapter).
Decision trace: degrade/switch/isolate with recorded triggers and recovery gates (from fault-handling chapter).

Practical rule: field data that cannot be time-aligned and schema-identified becomes “archived” rather than “evidentiary.”

Step 1 — Collect (incident package)

InputOutputAccept

Input: incident bundle and raw records. Output: a single “incident package” with integrity checks and versions.

Must include: schema_version, record_seq, record_crc, timestamp_quality.
Acceptance: package integrity can be validated offline with deterministic results.

Step 2 — Time Align (single timeline)

InputOutputAccept

Input: PTP/RTC/monotonic stamps. Output: a unified timeline (monotonic primary, wall-clock secondary).

Must include: sync_state, holdover markers, time-jump detection.
Acceptance: cross-module events can be ordered without ambiguity.

Step 3 — Domain Narrow (responsibility)

InputOutputAccept

Input: network/I/O/integrity/power evidence summaries. Output: a responsible domain with supporting fields.

Examples: network_health_summary, io_quality_summary, reset_cause, mismatch counters.
Acceptance: domain choice must cite evidence fields, not inference.

Step 4 — Hypothesize (falsifiable)

InputOutputAccept

Input: aligned timeline + domain. Output: IF/THEN hypotheses mapped to expected field changes.

Rule: each hypothesis must be falsifiable by a defined set of fields.
Acceptance: expected observations are written as field deltas over time windows.

Step 5 — Reproduce (compare bundles)

InputOutputAccept

Input: reproduction conditions and injection plan. Output: reproduced incident bundles with comparable schemas.

Rule: reproduction must emit the same evidence fields set (or a version-mapped superset).
Acceptance: “field-to-field” comparison supports or rejects the hypothesis.

Step 6 — Update (thresholds, rules, fields)

InputOutputAccept

Input: validated root cause and evidence gaps. Output: versioned updates with backward compatibility.

Update classes: thresholds, decision rules, capture density, and schema fields.
Acceptance: mixed fleet (N and N-1) remains diagnosable (time align + domain narrow still works).

Step 7 — Prove & Monitor (window stats)

InputOutputAccept

Input: post-deploy trend windows. Output: measurable improvement backed by windowed statistics.

Examples: reduced flap rate, fewer brownouts, lower mismatch duration, improved time-quality ratio.
Acceptance: improvement is demonstrated over time windows, not single samples.

Field add strategy (versioned + backward compatible)

Version everything: include schema_version and producer FW version on every record.
Additive changes first: add new fields without changing old semantics; allow “unknown” + valid flags.
Minimum disruption: add lightweight counters/quality bits before heavy snapshots; add indexing fields before payload fields.
Mixed-fleet acceptance: N-1 and N must still support time align and domain narrowing.

Predictive trend indicators (high ROI)

Degradation: temperature peaks, ECC corrected deltas, jitter percentiles.
Frequency: brownout_count_delta, watchdog_count_delta, flap_count_delta.
Evidence quality: ratio of degraded timestamp_quality, invalid I/O ratio.
Rule: store window statistics (rate/peak/p95), not only absolute values.

Example material part numbers (MPNs) often used to enable this loop

Safety compute (examples): Infineon AURIX TC397, TI TMS570LC4357, NXP S32G274A (class varies by platform).
TSN / Ethernet switching (examples): NXP SJA1105 (TSN switch family).
PTP-capable PHY (examples): TI DP83640 (IEEE-1588 timestamping PHY).
Clock / jitter cleaning (examples): Skyworks Si5341 (jitter attenuating clock), Renesas 8A34001 (timing/clock generator family).
GNSS timing (examples): u-blox ZED-F9T (timing-grade GNSS module) for redundant time sources.
Isolation for field I/O and comms (examples): TI ISO7741 (digital isolator), TI ISO1050 (isolated CAN).
Power monitoring / telemetry (examples): TI INA226 (power monitor), Analog Devices LTC2991 (multi-rail monitor family).
Supervisors / reset monitoring (examples): TI TPS3839 (supervisor family) to make reset_cause explainable.
Non-volatile evidence / counters (examples): Microchip ATECC608B (secure element for device identity), Infineon OPTIGA TPM SLB 9670 (TPM option), Fujitsu MB85RS64V (FRAM for robust counters/markers).

Note: MPNs above are representative examples used in safety/industrial-grade designs. Final selection must match rail standards, temperature class, lifecycle, and approved vendor lists.

Figure F11. A TCMS verification loop turns field incidents into reproducible evidence, then upgrades thresholds, decision rules, and logging fields with versioned compatibility—proving improvement through trend windows.

Cite this figure: Verification & Field Feedback Loop — Evidence to Release Improvement

Request a Quote

Name

Company

Part Number(s) / BOM

Quantity & Target Lead Time

Alternates Allowed

Temperature Grade

Package / Footprint

Compliance

Budget Window

Lot Size / Qty

Message

Attachment

Accepted Formats

pdf, csv, xls, xlsx, zip

Attachment

Drag & drop files here or use the button below.

H2-12. FAQs (Evidence-Driven Accordion ×12)

Each answer follows a strict, testable pattern: 1-sentence conclusion → 2 evidence checks → 1 first fix. Evidence checks reference concrete fields so issues can be replayed on a single timeline.

Rule: Evidence must be loggable/exportable and time-alignable (e.g., timestamp_quality, reset_cause, rail_min, flap_count, mismatch_duration).

1) TCMS outputs wrong results without reboot — brownout edge or watchdog coverage gap?

Conclusion Prioritize a supply-edge event if rail minima or brownout counters drift near thresholds; otherwise suspect runtime monitoring that fails to detect stalled tasks.

Evidence #1 (power): correlate rail_min, brownout_count_delta, and any near-miss UV status around the symptom window. MPN examples: INA226, LTC2991, TPS3839.
Evidence #2 (liveness): check watchdog_count_delta, task-alive gaps, and whether operating_state changed (operate→degraded) when the wrong output occurred.
First Fix: raise capture density for “near-UV” and liveness events into the incident bundle (pre/post snapshot) and log reset_cause / “near-reset” reason codes even when no reboot occurs.

Maps to: H2-8 / H2-10

2) After switchover, recovery is slow — redundancy mode choice or logging/time-sync blockage?

Conclusion If network health remains within budget but switchover time inflates, suspect log or sync bottlenecks during the transition rather than redundancy mode selection alone.

Evidence #1 (network): examine loss_rate, latency_p95, and flap_count across the switchover window. MPN examples: SJA1105, DP83640.
Evidence #2 (logging/sync pressure): compare log queue depth / write-rate spikes and timestamp_quality degradation during switchover (event storm vs stable time base).
First Fix: enforce a “transition logging profile” (keep critical evidence fields, throttle non-critical trend writes) and record a dedicated switch_phase marker so delays can be attributed.

Maps to: H2-5 / H2-7

3) Remote I/O reports input jitter but现场 stable — cable EMI or debounce/threshold policy?

Conclusion If I/O diagnostics show good line integrity yet frequent short glitches appear, prioritize debounce/window policy before blaming cable interference.

Evidence #1 (I/O integrity): verify open/short indicators, input quality flags, and whether glitches correlate with known noise windows. MPN examples for isolation robustness: ISO7741, ISO1050.
Evidence #2 (decision stats): inspect mismatch_count, mismatch_duration, and “transient vs persistent” classification outcomes.
First Fix: switch from single-sample decisions to windowed statistics (count+duration) plus cooldown, and log the chosen classification and thresholds for audit.

Maps to: H2-4 / H2-9 / H2-11

4) Two channels disagree on the same input — isolation-side drift or sampling window misalignment?

Conclusion If disagreement clusters around edges and timing-critical moments, suspect sampling misalignment; if bias persists across steady states, suspect drift/offset on one side of the isolation boundary.

Evidence #1 (time alignment): compare channel-to-channel offset and timestamp_quality during mismatches. MPN examples: DP83640, Si5341.
Evidence #2 (value pattern): evaluate whether mismatch is edge-only (timing) or steady bias (drift), and record disagreeing_channel + confidence.
First Fix: tighten a shared sampling window (or align to a monotonic timestamp boundary) and publish a logged arbitration result (vote_result, confidence_level).

Maps to: H2-4 / H2-6

5) Network latency spikes cause control lag — TSN/PTP time-base issue or link flap?

Conclusion If time quality collapses (holdover/time-jump), prioritize sync issues; if time stays credible but link stability degrades, prioritize flap/congestion.

Evidence #1 (time quality): check sync_state, holdover flags, and timestamp_quality in the spike window. MPN examples: Si5341, ZED-F9T.
Evidence #2 (link stability): correlate flap_count, loss_rate, and queue/latency percentiles per segment. MPN examples: SJA1105.
First Fix: gate critical control timing on a “time-quality OK” condition and log time-quality transitions so TSN tuning isn’t performed blindly under broken time.

Maps to: H2-5 / H2-6

6) Log event order looks wrong — time jump or multi-domain timestamp granularity mismatch?

Conclusion If the wall-clock jumps or holdover toggles, prioritize time issues; otherwise fix ordering anchors (monotonic + record sequence) across domains.

Evidence #1 (time jump): verify time-jump markers, holdover events, and timestamp_quality transitions. MPN examples: Si5341, ZED-F9T.
Evidence #2 (ordering anchors): confirm record_seq, atomic commit markers, and consistent timestamp granularity across producers.
First Fix: enforce monotonic ordering as the primary sort key and add/verify record_seq + atomic-commit IDs for cross-domain replay.

Maps to: H2-6 / H2-7

7) Self-test passes, but runtime still deadlocks — monitoring coverage gap or fault-handling path bug?

Conclusion If liveness evidence is missing during the hang, suspect coverage gaps; if liveness triggers but decisions loop, suspect fault-handling rules entering an unstable path.

Evidence #1 (runtime coverage): check task-alive heartbeats, watchdog_count_delta, and whether the scheduler shows gaps before the lock. MPN examples for reset supervision: TPS3839.
Evidence #2 (decision trace): inspect decision_code, decision_outcome, and state transitions (operate↔degraded) for oscillation.
First Fix: add explicit “reason codes” for state transitions and apply cooldown/hysteresis on repeated decision loops, so deadlock becomes diagnosable rather than opaque.

Maps to: H2-8 / H2-9

8) Exported maintenance logs can’t replay — which minimum evidence set is missing?

Conclusion If logs lack schema/time identity and correlation anchors, they become archival rather than evidentiary and cannot support root-cause replay.

Evidence #1 (identity): verify presence of schema_version, producer FW version, and timestamp_quality.
Evidence #2 (anchors): verify correlation_id, record_seq, decision trace fields, and atomic commit markers. MPN examples for robust counters/markers: MB85RS64V.
First Fix: implement a versioned “minimum evidence checklist” and add backward-compatible fields (do not rename old fields; add valid flags and new IDs).

Maps to: H2-7 / H2-11

9) After a subsystem drops, TCMS degrades repeatedly — recovery gate too strict or fault classification wrong?

Conclusion If degrade triggers repeat with short intervals, classification/cooldown is likely wrong; if recovery never clears despite stable conditions, the recovery gate is too strict or poorly measured.

Evidence #1 (frequency): inspect degrade_count, dwell time in degraded state, and trigger intervals (anti-flap effectiveness).
Evidence #2 (gate failures): log “recovery gate failed reason” (e.g., stability window not met, retest fail) and correlate with network/I/O health summaries.
First Fix: add cooldown + windowed stability checks, and version the recovery gate criteria so improvements can be proven via trend windows post-deploy.

Maps to: H2-9 / H2-11

10) Power-up sometimes never reaches OPERATE — boot self-test sequencing or I/O loopback failure?

Conclusion If boot stalls at a consistent step, suspect test sequencing/timing; if failures correlate with specific I/O groups, suspect loopback/diagnostic failures across the isolation boundary.

Evidence #1 (boot step): check boot_step, boot_fail_reason, and state machine transitions (boot→ready→operate).
Evidence #2 (I/O diag): check loopback results, open/short flags, and per-channel quality bits. MPN examples: ISO7741 (digital isolation), ISO1050 (isolated CAN if used for diagnostics transport).
First Fix: add a deterministic boot-step reason code and a post-fail snapshot so intermittent boot failures become reproducible evidence rather than “won’t start.”

Maps to: H2-8 / H2-4

11) Remote I/O resets occasionally but the master doesn’t notice — missing link monitoring or reset-cause not reported?

Conclusion If the master’s link health shows no drop while remote resets, reset-cause reporting is missing; if link events exist without reset context, monitoring is present but incomplete.

Evidence #1 (remote reset context): verify remote reset_cause, reset counters, and any rail minima captured locally. MPN examples: TPS3839 (supervisor), INA226 (monitor).
Evidence #2 (master link health): correlate link_down/reconnect markers, flap_count, and heartbeats for missing intervals.
First Fix: implement a mandatory “remote reset report” packet after reconnect and store it inside the same incident bundle as master link events (correlation ID shared).

Maps to: H2-5 / H2-10

12) “Happened once, never repeats” — how to design triggers + snapshots so next time it is captured?

Conclusion One-off failures become capturable when triggers are layered (lightweight always-on + burst snapshot on weak signals) and when pre/post windows are guaranteed.

Evidence #1 (trigger coverage): verify whether triggers are single-threshold only, or include sequence anomalies (repeated retries, flap bursts, time-quality drops) and manual triggers.
Evidence #2 (snapshot depth): verify that incident bundles include pre-trigger context and post-trigger aftermath with record_seq continuity and atomic commit markers. MPN examples for robust markers/counters: MB85RS64V; for device identity/integrity: ATECC608B, SLB 9670.
First Fix: introduce tiered triggering: always-on counters (cheap) + burst snapshots when weak indicators accumulate, then validate capture-rate improvement via trend windows after deployment.

Maps to: H2-7 / H2-11

Train Control & Monitoring System (TCMS)

Train Control & Monitoring System (TCMS)

H2-1. TCMS in the Rail Architecture

What TCMS must guarantee

Typical topology (why it exists)

H2-2. Functional Partitioning: Control, Monitoring, Safety, and Evidence

Control plane

Monitoring plane

Safety plane

Maintenance plane

H2-3. Compute Core: Safety MCU/SoC Architecture Choices

Redundancy options (trade-offs + evidence)

Memory & data integrity (from bits to evidence)

How to prove it “did not compute wrong”

Engineering acceptance checklist

H2-4. Isolated I/O Modules: DI/DO/AI/AO and the Isolation Boundary

I/O types & field failure signatures

Isolation strategy (define the boundary)

Evidence fields (make faults locatable)

Engineering acceptance checklist

H2-5. Redundant Communications: Vehicle Networks and Gateway Strategy

Common roles on vehicle networks

Redundancy modes (conceptual)

Fault criteria (measurable triggers)

Degrade strategy (minimum functionality)

H2-6. Time Base & Event Correlation: PTP/Local Clock/Monotonic Logging

Time sources (roles)

What must be correlated

Common pitfalls (detectable)

Mandatory evidence fields

H2-7. Logging & Black-Box Evidence: What to Record and Why

Three-layer logging model

Trigger mechanisms (incident bundles)

Integrity & power-loss robustness

Engineering acceptance checklist

Evidence field checklist (Ctrl+F friendly)

H2-8. Self-Test & Diagnostics Coverage: Boot-Time vs Run-Time

Boot-time self-test (POST/BIST)

Run-time monitoring (continuous)

Diagnostics coverage mindset

Failure handling (strategy semantics)

H2-9. Fault Handling & Degraded Operation: Decision Logic You Can Defend

Fault classification (policy router)

Voting & consistency (mismatch handling)

Evidence requirements (every transition)

Degraded operation (strategy semantics)

IF/THEN Rule Block A — Network degradation escalation

IF/THEN Rule Block B — Redundant input mismatch arbitration

IF/THEN Rule Block C — Multi-point anomaly escalation

IF/THEN Rule Block D — Restore to normal mode

H2-10. Power Integrity & Resilience for TCMS Nodes

Brownout & reset behavior (closed loop)

Power health monitoring strategy

Core evidence fields (correlatable)

Make resets explainable

H2-11. Verification & Field Feedback Loop for TCMS

What this loop produces

Evidence prerequisites (links to earlier chapters)

Step 1 — Collect (incident package)

Step 2 — Time Align (single timeline)

Step 3 — Domain Narrow (responsibility)

Step 4 — Hypothesize (falsifiable)

Step 5 — Reproduce (compare bundles)

Step 6 — Update (thresholds, rules, fields)

Step 7 — Prove & Monitor (window stats)

Field add strategy (versioned + backward compatible)

Predictive trend indicators (high ROI)

Example material part numbers (MPNs) often used to enable this loop

Request a Quote

Accepted Formats

Attachment

H2-12. FAQs (Evidence-Driven Accordion ×12)

Explore

Categories

Get in Touch