Train Control & Monitoring System (TCMS)
← Back to: Rail Transit & Locomotive
TCMS is the train’s control-and-evidence backbone: it coordinates vehicle functions through safety compute, isolated I/O, and redundant networks while producing time-aligned logs that make faults explainable and auditable. The core goal is to turn every incident into defensible evidence—so decisions, degraded modes, and field fixes can be verified, replayed, and improved over time.
H2-1. TCMS in the Rail Architecture
What TCMS must guarantee
- Command (control authority): who can issue an action, how it is accepted, executed, and confirmed.
- State (single source of truth): how distributed measurements become one validated vehicle state with freshness/quality.
- Evidence (replayable history): how events are time-aligned, signed/verified, and exported for maintenance and audits.
Typical topology (why it exists)
- Vehicle controller: arbitration + decision + policy + aggregation.
- Remote I/O modules: distributed sensing/actuation with isolation across long cables and ground offsets.
- Gateways: bridge legacy buses (e.g., MVB/WTB) and Ethernet/TSN domains without losing determinism.
- Recorder/maintenance access: ensures “one-time faults” are still captured with the right context.
H2-2. Functional Partitioning: Control, Monitoring, Safety, and Evidence
Control plane
- Boundary: command generation, arbitration, execution confirmation.
- Typical failures: duplicate/late commands, missing ACK, wrong target, inconsistent action across cars.
- Evidence fields: command_id, source, target, accept/reject code, ACK latency, retry count.
- First check: command_id → ACK → latency triad to confirm whether the action loop is intact.
Monitoring plane
- Boundary: acquisition, state fusion, state freshness/quality, alarms.
- Typical failures: false alarms, stale values, contradictory sources, wrong root cause labeling.
- Evidence fields: timestamp, validity flag, stale_age, sensor_quality, counter deltas.
- First check: validity + stale_age before trusting any displayed status.
Safety plane
- Boundary: interlocks, emergency policies, voting/consistency, diagnostic triggers.
- Typical failures: false trip, missed trip, inconsistent degrade decisions, unsafe recovery.
- Evidence fields: safety_state, trip_cause, vote_result, diag_coverage hint, recovery_condition.
- First check: trip_cause + vote_result to verify the decision is evidence-backed.
Maintenance plane
- Boundary: diagnostics, log export, versioning, configuration control, updates.
- Typical failures: missing context, untraceable versions, configuration drift, update-induced outages.
- Evidence fields: fw_version, config_hash, export status, audit trail, update result code.
- First check: fw_version + config_hash to prevent “mixed baselines” in field debugging.
H2-3. Compute Core: Safety MCU/SoC Architecture Choices
The TCMS compute core is judged by its ability to detect and contain silent failures (wrong results without a crash), while preserving traceable evidence for every mismatch, correction, and recovery action.
Redundancy options (trade-offs + evidence)
- Lockstep: detects instruction-level divergence; requires mismatch flags and defined safe reactions.
- Dual-channel: independent computation paths; requires voting rules and correlated evidence to avoid false decisions.
- Heterogeneous redundancy: different cores/implementations; improves common-cause resistance; demands interface-level consistency checks.
- Verification focus: every strategy must output a decision record (what diverged, when, and how it was handled).
Memory & data integrity (from bits to evidence)
- ECC: corrected/uncorrectable counters turn random corruption into explainable events.
- E2E protection: CRC + sequence counters detect corruption, loss, duplication, and reordering across modules.
- State validity: each critical state carries validity, age, and origin to prevent stale truth.
- Evidence fields: corrected_count, uncorrectable_count, crc_fail_count, seq_gap_count, last_pass_ts.
How to prove it “did not compute wrong”
- Boot-time self-test: CPU/RAM/flash checks with itemized results and durations.
- Periodic tests: scheduled coverage refresh for safety mechanisms and critical paths.
- Runtime monitors: watchdogs, deadline budgets, lockstep mismatch, ECC rate alarms.
- Mandatory outcome: each detection must map to a defensible action (degrade, reset, isolate, alert).
Engineering acceptance checklist
- Can the system separate crash vs silent wrong vs degradation?
- Does each safety decision generate a time-aligned record (trigger → evidence → action → recovery)?
- Are integrity counters trendable (rate-of-change), not only instantaneous flags?
- Is there a defined policy for uncorrectable events and repeated corrections?
H2-4. Isolated I/O Modules: DI/DO/AI/AO and the Isolation Boundary
In TCMS, many field failures originate at the I/O boundary: long harnesses, ground offsets, transients, and EMI can turn inputs/outputs into untrustworthy signals unless isolation, protection, and diagnostics are designed as a single chain.
I/O types & field failure signatures
- DI jitter: bounce/noise creates false transitions; verify debounce rejects and pulse-width stats.
- DO stuck: command differs from feedback; check overcurrent trips and stuck-detect flags.
- AI drift: slow bias shift or saturation; correlate with common-mode events and temperature.
- AO open-load: actuator missing or wiring open; verify compliance flags and open-load detect.
Isolation strategy (define the boundary)
- Signal isolation: prevents ground-offset coupling into logic-domain thresholds and ADC references.
- Isolated power: keeps AFE supply stable under harness transients and ground shifts.
- Common-mode suppression: reduces CM energy that otherwise leaks through parasitics.
- Protection chain: containment path for surge/ESD before it reaches AFE and isolator.
Evidence fields (make faults locatable)
- Open/short detect: open_load, short_to_batt, short_to_gnd, overcurrent_trip_count.
- Cable diagnostics: bounce_count, min_pulse_width, intermittent_count, cm_event_count.
- Consistency checks: redundant input mismatch flags and voting outcomes.
- Quality tagging: validity flag, stale age, saturation flag, out-of-range markers.
Engineering acceptance checklist
- Can a failure be located to cable side vs isolation side vs logic side?
- Does each channel expose trendable counters (not only a one-shot alarm)?
- Are fault flags time-aligned with supply and network health indicators?
- Does the design define the safe default state for DO/AO when diagnostics fail?
H2-5. Redundant Communications: Vehicle Networks and Gateway Strategy
Redundant communications must be engineered as a verifiable system: measurable link health, defined switching triggers, bounded switchover time budgets, and evidence logs that explain every failover decision.
Common roles on vehicle networks
- TCMS master: command arbitration and consistency across cars/domains.
- Remote I/O nodes: field-state acquisition with diagnostics counters and quality flags.
- Gateways: determinism boundary between legacy bus segments and Ethernet/TSN domains.
- HMI: operator loop for commands, alarms, and confirmation paths.
- Recorder: evidence sink for network events, state snapshots, and recovery actions.
Redundancy modes (conceptual)
- Cold-standby: backup is inactive; switchover relies on re-acquiring state and authority.
- Hot-standby: backup syncs critical state; only one side publishes authoritative outputs.
- Dual-active: both are active; requires arbitration to prevent dual-master conflicts.
- Evidence requirement: each mode must log who was authoritative and why it changed.
Fault criteria (measurable triggers)
- Loss rate: packet drop percentage within a defined time window.
- Latency & jitter: budget exceedance and variance expansion (control-impact signature).
- Link flap: repeated up/down transitions and route instability.
- Switch triggers: graded states (degraded → failed) with hysteresis to avoid flapping.
Degrade strategy (minimum functionality)
- Maintain command safety: preserve an unambiguous command authority path.
- Maintain essential state: keep validated status for the minimum set of safety-relevant signals.
- Protect evidence: ensure failover decisions and link-health context are time-stamped and exportable.
- Recovery conditions: define when normal mode may resume (stability window + counters).
| Mode | Trigger (health criteria + window) | Switch budget | State handover rule | Evidence to log |
|---|---|---|---|---|
| Cold-standby | link down / sustained loss; confirmed by flap counter + windowed stats | longer; requires re-acquire and re-validate state | rebuild baseline; verify freshness + validity before authority | trigger stats, last-known authority, re-sync duration, missed data window |
| Hot-standby | degraded → failed threshold crossed; hysteresis to avoid oscillation | bounded; backup already synced | last common state ID (seq counter) + integrity check (CRC) | authority change, seq gap, sync latency, decision code, outcome |
| Dual-active | arbitration conflict / inconsistent link view / split-brain indicators | tight; must prevent dual-master outputs | token/vote result; deterministic tie-break + rollback rule | vote/arbitration result, conflict count, resolved leader, time-to-stable |
H2-6. Time Base & Event Correlation: PTP/Local Clock/Monotonic Logging
Event correlation is where TCMS becomes reconstructable. A unified time base ensures I/O edges, network degradations, fault codes, and power-health signals can be assembled into a defensible sequence of cause-and-effect.
Time sources (roles)
- PTP: cross-domain alignment for networked modules and gateways.
- RTC: human-readable wall-clock context across long downtime windows.
- Monotonic counter: non-decreasing ordering anchor that prevents “time jump” reordering.
- Rule: incident logs must include monotonic time even when wall-clock is invalid.
What must be correlated
- I/O events: edges, output feedback, open/short and quality flags.
- Network events: loss/jitter spikes, route changes, failover transitions.
- Fault codes: assert/clear with context snapshots.
- Power health: rail minima, brownout markers, reset causes and counters.
Common pitfalls (detectable)
- Time step / jump: wall-clock changes suddenly; detect via monotonic vs wall-clock delta.
- Holdover drift: offset trends while external sync is lost.
- Mixed granularity: hardware vs software timestamp points create inconsistent skew.
- Action: downgrade timestamp_quality and preserve ordering with monotonic time.
Mandatory evidence fields
- sync_state: locked / holdover / free-run / invalid
- offset: relative deviation to the selected master
- holdover_time: duration since last valid sync
- timestamp_quality: good / degraded / bad
- ts_source: HW timestamp / SW timestamp / derived
H2-7. Logging & Black-Box Evidence: What to Record and Why
Logging becomes defensible only when it is treated as an evidence product: it must reconstruct what happened, why it happened, what the system decided, and whether recovery was successful—using time-aligned, integrity-protected records.
Three-layer logging model
- Event: high-density triggers (failover, threshold crossing, fault assert/clear).
- Snapshot: the system’s complete state around an incident (pre/post window).
- Trend: counters and statistics that reveal slow degradation (rates and deltas).
- Rule: events must reference the snapshot and the relevant trend slice via correlation IDs.
Trigger mechanisms (incident bundles)
- Threshold triggers: power/thermal/offset budget exceedance.
- Integrity triggers: CRC fail, sequence gaps, invalid state transitions.
- Network triggers: loss/jitter spikes, flap, route changes, failover.
- Manual triggers: operator/maintenance request with reason code.
- Bundle: capture pre/post windows and include timestamp_quality and sync_state.
Integrity & power-loss robustness
- Record integrity: record CRC/signature + monotonic sequence number.
- Atomic commit: avoid half-written records that corrupt evidence.
- Power-loss marker: mark incomplete commits and preserve last critical bundle.
- Audit value: logs remain explainable even when a crash happens mid-write.
Engineering acceptance checklist
- Can an incident answer: trigger → context → decision → outcome?
- Do events include time quality and a correlation_id?
- Are counters trendable (deltas/rates), not only absolute values?
- Can evidence survive a one-time crash via atomic commits and markers?
Evidence field checklist (Ctrl+F friendly)
LOG-CHECKLIST:EVENTevent_id— unique incident record identifierevent_type— threshold / integrity / network / manualevent_ts_mono— monotonic timestamp for orderingevent_ts_wall— wall-clock timestamp (optional but recommended)timestamp_quality— good / degraded / badsource_module— gateway / io / computeseverity— info / warn / faultdecision_code— degrade / switch / reset / isolatedecision_outcome— success / failed / partialcorrelation_id— links event ↔ snapshot ↔ trend slice
snapshot_id— snapshot record identifieroperating_state— ready / operate / degraded / faultauthority_owner— current authority (A / B / peer)network_health_summary— loss/jitter/flap summarypower_health_summary— rail minima + reset causes + countersio_quality_summary— invalid/saturation/open-short summaryactive_faults_list— asserted fault listlast_transition_reason— why the last state changed
trend_window_start/trend_window_end— statistics windowloss_rate_avg/loss_rate_peak— loss statsjitter_p95— jitter percentile indicatorcrc_fail_rate/seq_gap_rate— integrity rateswatchdog_count_delta— watchdog delta since last windowecc_corrected_delta/ecc_uncorrectable_delta— ECC deltasflap_count_delta— link flap deltaholdover_time— time sync holdover duration
record_crc— record-level checksumrecord_seq— record sequence counterwrite_result— commit result codeatomic_commit_id— atomic transaction identifierpower_fail_marker— incomplete-write marker
H2-8. Self-Test & Diagnostics Coverage: Boot-Time vs Run-Time
Self-test is the mechanism that turns safety claims into measurable coverage. It must define when tests run, what fault classes they detect, what evidence is produced, and what the system does when checks fail.
Boot-time self-test (POST/BIST)
- Compute & memory: CPU/RAM/flash checks with itemized results and durations.
- Test vectors: versioned vectors to prevent “test mismatch” after software updates.
- I/O path loopback: verify acquisition → processing → output feedback chain.
- Outcome: pass → ready; fail → controlled degraded/fault entry with evidence logs.
Run-time monitoring (continuous)
- Liveness & deadlines: watchdog, task alive, deadline miss counters.
- Comms health: loss/jitter/flap, CRC fail, sequence gaps.
- Integrity: ECC trends, E2E CRC/seq consistency, mismatch counts.
- Consistency: redundant inputs or channels mismatch and voting outcomes.
Diagnostics coverage mindset
- Start from fault classes: define what must be detectable (silent wrong, intermittent, degradation).
- Map detection → evidence: each detection emits fields that explain “why” and “where”.
- Define reaction semantics: degrade vs fault entry is evidence-driven, not arbitrary.
- Recovery conditions: stability windows and retest rules gate a return to normal.
Failure handling (strategy semantics)
- Fail-safe: stop high-risk outputs while preserving alarms and evidence flow.
- Limp-home: keep a minimum function set with degraded quality flags and strict limits.
- Evidence-first: every transition must record trigger, checks performed, and outcome.
- Non-negotiable: do not restore normal mode until monitors confirm stability.
H2-9. Fault Handling & Degraded Operation: Decision Logic You Can Defend
Fault handling becomes defensible when it is written as auditable, testable rules: clear fault classes, explicit voting and arbitration semantics, evidence-backed transitions into degraded operation, and objective recovery gates.
Fault classification (policy router)
- Transient vs persistent: windowed spikes vs sustained violations drive action strength and recovery barriers.
- Single-point vs multi-point: local anomalies vs cross-domain correlation (systemic risk signature).
- Recoverable vs non-recoverable: automatic restore requires stability windows and retest pass; otherwise manual clear.
- Rule: classification dictates what evidence must be collected and which actions are allowed.
Voting & consistency (mismatch handling)
- Mismatch semantics: quantify disagreement by count + duration, not by a single sample.
- Arbitration outputs:
vote_resultandconfidence_levelmust be logged. - Anti-flap: hysteresis + cooldown prevents oscillation near thresholds.
- Isolation scope: isolate the faulty channel while preserving stable authority paths where possible.
Evidence requirements (every transition)
- Trigger context: window statistics and source module ID.
- Decision trace: reason code, classification, and arbitration summary.
- Action trace: degrade/switch/isolate/limit plus result status.
- Recovery gate: stability window, retest status, and explicit restore decision.
Degraded operation (strategy semantics)
- Degraded: keep a minimum validated control/monitor set with quality flags.
- Fault: stop high-risk outputs; preserve alarms, evidence, and maintenance access.
- Restore: never automatic without objective stability + retest criteria.
- Audit value: transitions must be replayable using correlated evidence bundles.
IF/THEN Rule Block A — Network degradation escalation
IFTHENDOEXITIF network loss/jitter exceeds the defined budget for a sustained window (with flap count rising), THEN classify as persistent network degradation, DO switch authority path and limit non-essential traffic, EXIT only after a stability window + retest pass confirms healthy metrics.
- Record:
event_id,event_type,network_health_summary,decision_code,decision_outcome,recovery_gate
IF/THEN Rule Block B — Redundant input mismatch arbitration
IFTHENDOEXITIF redundant channels disagree beyond mismatch thresholds (count + duration) and integrity checks are failing, THEN classify as single-point channel fault unless cross-domain correlation indicates multi-point risk, DO isolate the disagreeing channel and publish the arbitration result with confidence, EXIT only after cooldown + retest validates consistency over a stability window.
- Record:
mismatch_count,mismatch_duration,disagreeing_channel,vote_result,confidence_level
IF/THEN Rule Block C — Multi-point anomaly escalation
IFTHENDOEXITIF multiple domains degrade simultaneously (I/O quality drops + network instability + time-quality degrades), THEN classify as multi-point/systemic risk, DO enter minimum validated function set and increase evidence capture density, EXIT only after the root cause indicators clear and post-checks pass.
- Record:
timestamp_quality,sync_state,io_quality_summary,network_health_summary,operating_state
IF/THEN Rule Block D — Restore to normal mode
IFTHENDOEXITIF health metrics remain within budget for the required stability window and all retests pass, THEN classify as recoverable, DO restore normal mode with a logged authority confirmation, EXIT by recording the restore event and a post-restore snapshot.
- Record:
recovery_gate,decision_code,authority_owner,snapshot_id,decision_outcome
H2-10. Power Integrity & Resilience for TCMS Nodes
TCMS resilience depends on preventing “silent wrong computation” during supply disturbances and on producing evidence that explains resets, degraded behavior, and write-protection outcomes—without expanding into traction or auxiliary power topologies.
Brownout & reset behavior (closed loop)
- Brownout policy: define UV threshold + hysteresis + response budget.
- Reset cause: distinguish brownout vs watchdog vs software reset using reason codes.
- Write protection: enforce atomic commits and markers during power instability windows.
- Audit value: a reset must be explainable using rail_min and reset_cause context.
Power health monitoring strategy
- PG (power-good): gating “allowed to operate” conditions.
- Rail monitor/ADC: capture rail minima/maxima and windowed statistics.
- Sampling policy: measure during critical phases (boot, mode switch, write commit).
- Trendability: store brownout_count and rail_min as deltas over windows.
Core evidence fields (correlatable)
reset_cause— brownout / watchdog / softwarerail_min/rail_max— pre/post incident extremabrownout_count— trendable frequency indicatorwatchdog_count_delta— liveness consequence indicatortimestamp_quality— time credibility during the incident
Make resets explainable
- Event linking: power events must create incident bundles (event + snapshot + trend slice).
- State linkage: include
operating_statetransition reason and post-check results. - Write linkage: include
power_fail_markerandatomic_commit_id. - Outcome: logs prove whether a reset was an expected protection action or an uncontrolled crash.
rail_min) and reset_cause are root-cause anchors; without them, “random reset” remains non-actionable.
H2-11. Verification & Field Feedback Loop for TCMS
A TCMS improves fastest when field incidents are converted into repeatable evidence packages, aligned on a single time base, narrowed to a responsible domain, reproduced under controlled conditions, and then used to update thresholds, decision rules, and logging fields with versioned compatibility.
What this loop produces
- Audit-ready incident bundles: event + snapshot + trend slice linked by
correlation_id. - Defensible root-cause claims: hypotheses that can be proven or falsified by evidence fields.
- Upgradeable schemas: new fields added with
schema_versionand backward compatibility rules. - Predictive maintenance assets: trend indicators that move “intermittent” into measurable frequency and degradation.
Evidence prerequisites (links to earlier chapters)
- Time credibility:
timestamp_quality,sync_state, holdover markers (from time-base chapter). - Incident bundle structure: event/snapshot/trend and integrity markers (from logging chapter).
- State transitions: operate/degraded/fault transitions with reasons (from self-test/diagnostics chapter).
- Decision trace: degrade/switch/isolate with recorded triggers and recovery gates (from fault-handling chapter).
Step 1 — Collect (incident package)
InputOutputAcceptInput: incident bundle and raw records. Output: a single “incident package” with integrity checks and versions.
- Must include:
schema_version,record_seq,record_crc,timestamp_quality. - Acceptance: package integrity can be validated offline with deterministic results.
Step 2 — Time Align (single timeline)
InputOutputAcceptInput: PTP/RTC/monotonic stamps. Output: a unified timeline (monotonic primary, wall-clock secondary).
- Must include:
sync_state, holdover markers, time-jump detection. - Acceptance: cross-module events can be ordered without ambiguity.
Step 3 — Domain Narrow (responsibility)
InputOutputAcceptInput: network/I/O/integrity/power evidence summaries. Output: a responsible domain with supporting fields.
- Examples:
network_health_summary,io_quality_summary,reset_cause, mismatch counters. - Acceptance: domain choice must cite evidence fields, not inference.
Step 4 — Hypothesize (falsifiable)
InputOutputAcceptInput: aligned timeline + domain. Output: IF/THEN hypotheses mapped to expected field changes.
- Rule: each hypothesis must be falsifiable by a defined set of fields.
- Acceptance: expected observations are written as field deltas over time windows.
Step 5 — Reproduce (compare bundles)
InputOutputAcceptInput: reproduction conditions and injection plan. Output: reproduced incident bundles with comparable schemas.
- Rule: reproduction must emit the same evidence fields set (or a version-mapped superset).
- Acceptance: “field-to-field” comparison supports or rejects the hypothesis.
Step 6 — Update (thresholds, rules, fields)
InputOutputAcceptInput: validated root cause and evidence gaps. Output: versioned updates with backward compatibility.
- Update classes: thresholds, decision rules, capture density, and schema fields.
- Acceptance: mixed fleet (N and N-1) remains diagnosable (time align + domain narrow still works).
Step 7 — Prove & Monitor (window stats)
InputOutputAcceptInput: post-deploy trend windows. Output: measurable improvement backed by windowed statistics.
- Examples: reduced flap rate, fewer brownouts, lower mismatch duration, improved time-quality ratio.
- Acceptance: improvement is demonstrated over time windows, not single samples.
Field add strategy (versioned + backward compatible)
- Version everything: include
schema_versionand producer FW version on every record. - Additive changes first: add new fields without changing old semantics; allow “unknown” + valid flags.
- Minimum disruption: add lightweight counters/quality bits before heavy snapshots; add indexing fields before payload fields.
- Mixed-fleet acceptance: N-1 and N must still support time align and domain narrowing.
Predictive trend indicators (high ROI)
- Degradation: temperature peaks, ECC corrected deltas, jitter percentiles.
- Frequency:
brownout_count_delta,watchdog_count_delta,flap_count_delta. - Evidence quality: ratio of degraded
timestamp_quality, invalid I/O ratio. - Rule: store window statistics (rate/peak/p95), not only absolute values.
Example material part numbers (MPNs) often used to enable this loop
- Safety compute (examples): Infineon
AURIX TC397, TITMS570LC4357, NXPS32G274A(class varies by platform). - TSN / Ethernet switching (examples): NXP
SJA1105(TSN switch family). - PTP-capable PHY (examples): TI
DP83640(IEEE-1588 timestamping PHY). - Clock / jitter cleaning (examples): Skyworks
Si5341(jitter attenuating clock), Renesas8A34001(timing/clock generator family). - GNSS timing (examples): u-blox
ZED-F9T(timing-grade GNSS module) for redundant time sources. - Isolation for field I/O and comms (examples): TI
ISO7741(digital isolator), TIISO1050(isolated CAN). - Power monitoring / telemetry (examples): TI
INA226(power monitor), Analog DevicesLTC2991(multi-rail monitor family). - Supervisors / reset monitoring (examples): TI
TPS3839(supervisor family) to makereset_causeexplainable. - Non-volatile evidence / counters (examples): Microchip
ATECC608B(secure element for device identity), InfineonOPTIGA TPM SLB 9670(TPM option), FujitsuMB85RS64V(FRAM for robust counters/markers).
H2-12. FAQs (Evidence-Driven Accordion ×12)
Each answer follows a strict, testable pattern: 1-sentence conclusion → 2 evidence checks → 1 first fix. Evidence checks reference concrete fields so issues can be replayed on a single timeline.
timestamp_quality, reset_cause, rail_min, flap_count, mismatch_duration).
1) TCMS outputs wrong results without reboot — brownout edge or watchdog coverage gap?
- Evidence #1 (power): correlate
rail_min,brownout_count_delta, and any near-miss UV status around the symptom window. MPN examples:INA226,LTC2991,TPS3839. - Evidence #2 (liveness): check
watchdog_count_delta, task-alive gaps, and whetheroperating_statechanged (operate→degraded) when the wrong output occurred. - First Fix: raise capture density for “near-UV” and liveness events into the incident bundle (pre/post snapshot) and log
reset_cause/ “near-reset” reason codes even when no reboot occurs.
2) After switchover, recovery is slow — redundancy mode choice or logging/time-sync blockage?
- Evidence #1 (network): examine
loss_rate,latency_p95, andflap_countacross the switchover window. MPN examples:SJA1105,DP83640. - Evidence #2 (logging/sync pressure): compare log queue depth / write-rate spikes and
timestamp_qualitydegradation during switchover (event storm vs stable time base). - First Fix: enforce a “transition logging profile” (keep critical evidence fields, throttle non-critical trend writes) and record a dedicated
switch_phasemarker so delays can be attributed.
3) Remote I/O reports input jitter but现场 stable — cable EMI or debounce/threshold policy?
- Evidence #1 (I/O integrity): verify open/short indicators, input quality flags, and whether glitches correlate with known noise windows. MPN examples for isolation robustness:
ISO7741,ISO1050. - Evidence #2 (decision stats): inspect
mismatch_count,mismatch_duration, and “transient vs persistent” classification outcomes. - First Fix: switch from single-sample decisions to windowed statistics (count+duration) plus cooldown, and log the chosen classification and thresholds for audit.
4) Two channels disagree on the same input — isolation-side drift or sampling window misalignment?
- Evidence #1 (time alignment): compare channel-to-channel offset and
timestamp_qualityduring mismatches. MPN examples:DP83640,Si5341. - Evidence #2 (value pattern): evaluate whether mismatch is edge-only (timing) or steady bias (drift), and record
disagreeing_channel+ confidence. - First Fix: tighten a shared sampling window (or align to a monotonic timestamp boundary) and publish a logged arbitration result (
vote_result,confidence_level).
5) Network latency spikes cause control lag — TSN/PTP time-base issue or link flap?
- Evidence #1 (time quality): check
sync_state, holdover flags, andtimestamp_qualityin the spike window. MPN examples:Si5341,ZED-F9T. - Evidence #2 (link stability): correlate
flap_count,loss_rate, and queue/latency percentiles per segment. MPN examples:SJA1105. - First Fix: gate critical control timing on a “time-quality OK” condition and log time-quality transitions so TSN tuning isn’t performed blindly under broken time.
6) Log event order looks wrong — time jump or multi-domain timestamp granularity mismatch?
- Evidence #1 (time jump): verify time-jump markers, holdover events, and
timestamp_qualitytransitions. MPN examples:Si5341,ZED-F9T. - Evidence #2 (ordering anchors): confirm
record_seq, atomic commit markers, and consistent timestamp granularity across producers. - First Fix: enforce monotonic ordering as the primary sort key and add/verify
record_seq+ atomic-commit IDs for cross-domain replay.
7) Self-test passes, but runtime still deadlocks — monitoring coverage gap or fault-handling path bug?
- Evidence #1 (runtime coverage): check task-alive heartbeats,
watchdog_count_delta, and whether the scheduler shows gaps before the lock. MPN examples for reset supervision:TPS3839. - Evidence #2 (decision trace): inspect
decision_code,decision_outcome, and state transitions (operate↔degraded) for oscillation. - First Fix: add explicit “reason codes” for state transitions and apply cooldown/hysteresis on repeated decision loops, so deadlock becomes diagnosable rather than opaque.
8) Exported maintenance logs can’t replay — which minimum evidence set is missing?
- Evidence #1 (identity): verify presence of
schema_version, producer FW version, andtimestamp_quality. - Evidence #2 (anchors): verify
correlation_id,record_seq, decision trace fields, and atomic commit markers. MPN examples for robust counters/markers:MB85RS64V. - First Fix: implement a versioned “minimum evidence checklist” and add backward-compatible fields (do not rename old fields; add valid flags and new IDs).
9) After a subsystem drops, TCMS degrades repeatedly — recovery gate too strict or fault classification wrong?
- Evidence #1 (frequency): inspect
degrade_count, dwell time in degraded state, and trigger intervals (anti-flap effectiveness). - Evidence #2 (gate failures): log “recovery gate failed reason” (e.g., stability window not met, retest fail) and correlate with network/I/O health summaries.
- First Fix: add cooldown + windowed stability checks, and version the recovery gate criteria so improvements can be proven via trend windows post-deploy.
10) Power-up sometimes never reaches OPERATE — boot self-test sequencing or I/O loopback failure?
- Evidence #1 (boot step): check
boot_step,boot_fail_reason, and state machine transitions (boot→ready→operate). - Evidence #2 (I/O diag): check loopback results, open/short flags, and per-channel quality bits. MPN examples:
ISO7741(digital isolation),ISO1050(isolated CAN if used for diagnostics transport). - First Fix: add a deterministic boot-step reason code and a post-fail snapshot so intermittent boot failures become reproducible evidence rather than “won’t start.”
11) Remote I/O resets occasionally but the master doesn’t notice — missing link monitoring or reset-cause not reported?
- Evidence #1 (remote reset context): verify remote
reset_cause, reset counters, and any rail minima captured locally. MPN examples:TPS3839(supervisor),INA226(monitor). - Evidence #2 (master link health): correlate
link_down/reconnectmarkers,flap_count, and heartbeats for missing intervals. - First Fix: implement a mandatory “remote reset report” packet after reconnect and store it inside the same incident bundle as master link events (correlation ID shared).
12) “Happened once, never repeats” — how to design triggers + snapshots so next time it is captured?
- Evidence #1 (trigger coverage): verify whether triggers are single-threshold only, or include sequence anomalies (repeated retries, flap bursts, time-quality drops) and manual triggers.
- Evidence #2 (snapshot depth): verify that incident bundles include pre-trigger context and post-trigger aftermath with
record_seqcontinuity and atomic commit markers. MPN examples for robust markers/counters:MB85RS64V; for device identity/integrity:ATECC608B,SLB 9670. - First Fix: introduce tiered triggering: always-on counters (cheap) + burst snapshots when weak indicators accumulate, then validate capture-rate improvement via trend windows after deployment.