Interlocking & Vital Logic (Rail) — Redundant MCUs & Voter Design

← Back to: Rail Transit & Locomotive

Interlocking & Vital Logic is built to fail safe by design: redundant channels and voters decide “permit vs inhibit” only when inputs, timing, and configuration are provably consistent, otherwise outputs de-energize and faults latch. The core value is evidentiary operation—every trip, mismatch, reset, and recovery is measured and logged with integrity so audits and field replay can reproduce the decision path.

H2-1. System Role & Safety Boundary

What this module actually “does”

Interlocking & vital logic exists to make a single hard promise: permission is granted only when the evidence chain is complete. When inputs are stale, inconsistent, or not provably trustworthy, the only valid outcome is a safe default—typically deny and de-energize. This reframes the design goal from “keep running” to “always remain provably safe.”

Boundary definition: three layers that prevent scope creep

Functional boundary: adjudicates permission/lock/deny based on validated conditions; it does not perform positioning fusion, radio link management, or track-circuit demodulation.
Signal boundary: distinguishes vital signals (must be provable) from non-vital signals (diagnostics/monitoring that cannot directly grant permission).
Safety boundary: defines where isolation exists, what is dual-channel, and what constitutes a safe state when anything is uncertain.

Vital I/O taxonomy (the engineering reason “safe” is enforceable)

Treating every wire as equal is how safety systems fail in the field. A strict taxonomy prevents accidental “permission by convenience”:

Vital Inputs (dual-channel): must pass freshness, consistency, and channel agreement checks before they can influence a permit decision.
Non-vital Inputs: informative only; never sufficient to grant permission.
Vital Outputs: must be designed so that loss of energy implies safety (de-energize-to-safe), not “unknown.”
Proof / Feedback: output feedback is mandatory for detecting welded relays, broken lines, or mismatched actuation.

Evidence chain: what must be logged to make decisions auditable

Safety reviews and field forensics are won or lost on evidence. The boundary is not “real” unless the system records what it relied on. A minimal evidence chain for each decision includes:

Input trust proof (per channel):

input_valid, input_age_ms, input_seq_gap, input_crc_err_cnt, input_contract_hash

Decision proof (per decision):

decision_id, permit_state, decision_reason_code, channel_agreement

Output proof (command vs feedback):

vital_out_cmd, vital_out_feedback, feedback_mismatch_cnt, relay_weld_suspect

Acceptance criteria (what “done” means for this chapter)

Safe state is explicitly defined and reachable under any uncertainty: deny + de-energize.
Every boundary crossing (isolation, dual-channel inputs, vital output chain) is visible in the diagram and referenced in text.
At least one complete decision evidence chain is described (inputs → checks → decision → outputs → feedback → logs).

Figure F1 — Safety boundary and evidence chain placement

Block diagram showing vital inputs, redundant compute, voter, isolation barrier, vital outputs, feedback, and event logs.

Figure F1. Safety boundary and evidence placement: dual-channel vital inputs, redundant compute + voter, isolation barrier, vital outputs with mandatory feedback, and evidentiary event logs.

Cite this figure: Figure F1 — Interlocking & Vital Logic Safety Boundary

H2-2. Vital State Model & “De-energize-to-trip” Philosophy

Why a state machine is mandatory (not optional)

“Fail-safe” becomes real only when behavior is deterministic under uncertainty. A vital state model turns safety from a slogan into an auditable contract: entry conditions define what must be true, allowed outputs constrain what the system may do, and evidence logs preserve why a transition occurred. Without a state model, field failures often degrade into oscillation (reboot loops, chatter, intermittent permission) that is unsafe and hard to diagnose.

Core states and what makes them provable

A rail-grade state model should make two outcomes always measurable: (1) how quickly the system reaches the safe state after a trigger, and (2) whether recovery is allowed, rate-limited, and evidence-backed. The following is a practical minimum set:

INIT: outputs forced OFF; integrity + configuration + timing stability verified before any RUN transition.
RUN: permission decisions allowed only when inputs are fresh and channels agree.
DEGRADED: conservative behavior enforced (e.g., deny-only or restricted permits) when a non-fatal constraint is detected.
TRIP: immediate de-energize; safe state reached within a measured latency budget.
LATCHED: manual intervention or proof-test required; prevents “auto-recover oscillation.”
RECOVERY: controlled checklist; back-off timers avoid repeated transitions in noisy conditions.

De-energize-to-trip: what it means in engineering terms

De-energize-to-trip is not simply “turn it off.” It is a design rule that makes safety enforceable: the safe state is reached by removing energy from the vital output chain, and the removal is confirmed by feedback. This approach is robust under supply anomalies and supports clean forensic evidence (command vs feedback).

Minimum transition evidence fields (transition = a safety event):

state_id, transition_reason, latched_fault_id, recovery_condition_met, safe_state_latency_ms, output_drop_confirmed, recovery_inhibit_timer_ms

How to prevent unsafe oscillation (the hidden field killer)

A large fraction of safety incidents stem from “half recoveries” that repeat under EMI or borderline supply conditions. A vital state model must explicitly block oscillation by combining: (1) latching rules for severe fault classes, (2) back-off timers, and (3) a recovery checklist that re-validates evidence inputs and output feedback before RUN.

Latching: mismatch, output feedback abnormal, critical diagnostics failure → LATCHED.
Back-off: rate-limit recovery attempts; record correlation with brownout/EMI bursts.
Checklist: config hash match, timing stable, inputs fresh, channels agree, output feedback sane.

Acceptance criteria (what “done” means for this chapter)

Every state includes: entry conditions, allowed outputs, evidence fields, and exit conditions.
TRIP behavior is measurable: safe_state_latency_ms and output_drop_confirmed are defined.
Recovery cannot chatter: rate-limit + checklist + latching rules are explicit.

Figure F2 — Vital state model with measurable safe-state transitions

State machine showing INIT, RUN, DEGRADED, TRIP, LATCHED, RECOVERY with triggers, actions, and logged fields.

Figure F2. Vital state model with measurable safe-state behavior: each transition is tied to evidence fields and confirmation of output drop to prevent unsafe oscillation.

Cite this figure: Figure F2 — Vital State Model (INIT/RUN/DEGRADED/TRIP/LATCHED/RECOVERY)

Cite Figure F1 anchor

Cite Figure F2 anchor

H2-3. Redundancy Architectures: Lockstep vs Dual-Channel vs TMR

What this chapter is for

Redundancy is not “more MCUs.” It is a deliberate trade between fault models, safety behavior, availability targets, and the evidence that can be produced during audit and field forensics. A correct architecture choice starts by matching the dominant fault types (transient, random hardware, common-cause, systematic) to the redundancy pattern that can detect and contain them with measurable latency.

Lockstep (cycle-by-cycle compare)

Best at: transient upsets and random hardware faults with fast detection (tight compare window).
Weak at: common-cause failures (shared supply, reset, EMI injection) and systematic errors (same code/config → same wrong result).
Evidence that must exist: mismatch counts and classification, correlated with reset/brownout/EMI indicators.

Dual-channel (1oo2 / 2oo2 variants)

Best at: increasing independence (separate supplies/clocks/resets) and producing stronger decision evidence (value + timing + freshness contracts).
Weak at: systematic errors if both channels share identical software/config and the same wrong assumption.
Key choice: 2oo2 (permit only on agreement) favors safety; 1oo2 can favor availability but demands stricter “confidence + health” rules.

2oo3 / TMR (majority vote)

Best at: availability under a single-channel random failure; the system can continue while isolating a faulty channel.
Weak at: complexity and diagnostic coverage requirements; a weak health model can let majority voting hide a degrading channel.
Non-negotiable: per-channel health scoring + “kick-out” evidence must be logged.

Minimum evidence fields (audit + field debugging)

Core: mismatch_counter, voter_decision, channel_health, crc_error_rate

Recommended (for depth): mismatch_class, reset_reason, brownout_events, decision_reason_code

Acceptance criteria

The dominant fault types are explicitly mapped to each architecture’s strengths and limits.
Safety vs availability is stated as a design intent (fail-silent vs fail-operational positioning).
Evidence fields are not listed as nouns only—each is tied to a reason it is needed.

Figure F3 — Redundancy architecture comparison

Three parallel block diagrams comparing lockstep, dual-channel, and TMR with voting and common output proof chain.

Figure F3. Architecture comparison: lockstep detects fast mismatches but is common-cause sensitive; dual-channel improves independence and evidence; TMR improves availability but requires stronger health and validation.

Cite this figure: Figure F3 — Redundancy Architectures Comparison

H2-4. Voter Design Deep Dive: 1oo2 / 2oo3 Decision Logic

Voter purpose: compare evidence, not only values

A robust voter does not simply compare numerical results. It compares evidence consistency: value agreement must be supported by freshness, alignment (sequence/time), and trust indicators (confidence/health/CRC). When evidence is incomplete, the correct decision is conservative (deny, trip, or degraded), with a reason code that makes the outcome auditable.

Decision inputs (the minimum contract)

Value evidence: value, valid_flag
Timing evidence: timestamp, seq_no
Trust evidence: confidence, channel_health, crc_error_rate

Four-step decision flow (auditable and testable)

Freshness gate: reject stale inputs (age exceeds policy) → reason code indicates STALE, action becomes deny/degraded.
Alignment gate: enforce window_ms and seq_gap limits → reason code indicates ALIGN_FAIL (prevents “same value, different moment” mistakes).
Agreement gate: compare value_delta to tolerance → AGREE or VALUE_CONFLICT.
Conflict resolver: select conservative action or bias toward a higher-trust channel (confidence + health) and log the justification.

Fail-silent vs fail-operational (expressed as actions)

Fail-silent (safety-first): any critical conflict → deny/trip/latch; outputs de-energize quickly and remain OFF until evidence is restored.
Fail-operational (availability-first): continue only in constrained mode with stricter evidence gates; recovery attempts are rate-limited and logged.

Evidence fields that make decisions reviewable

Core: window_ms, seq_gap, value_delta, confidence, vote_reason_code

Recommended: vote_action (permit/deny/degraded/trip), input_age_ms, alignment_skew_ms

Acceptance criteria

Decision flow is specified as gates + resolver (can generate test vectors directly).
Every non-permit outcome has a reason code; “unknown cause” is not acceptable.
Alignment failures (timestamp/sequence) are explicitly handled to avoid intermittent field misvotes.

Figure F4 — Voter window and sequence alignment

Timeline diagram showing seq alignment, timestamp window_ms, and vote reason codes for alignment or value conflict.

Figure F4. Alignment is an evidence gate: window_ms enforces “same moment” samples; seq_gap exposes missing/reordered frames; conflicts produce vote_reason_code and conservative vote_action.

Cite this figure: Figure F4 — Voter Window & Sequence Alignment

Cite Figure F3 anchor

Cite Figure F4 anchor

H2-5. Cross-Monitoring & Data Consistency Contracts

Goal: prove each channel is processing the same event

Cross-monitoring is not only about matching results. It must prove that each channel computed on the same inputs, under the same configuration and calibration, within the same time/sequence context. Without consistency contracts, two channels can appear to “agree” while operating on different samples, different parameter sets, or stale data—creating an unsafe false confidence.

Contract stack (inputs → configuration → output proof)

Input contract: mirrored inputs must be aligned (sequence/time) and represent the same sample set. Alignment failures should be categorized (skew vs missing frames) to avoid unnecessary trips during brief reordering.
Configuration contract: each channel must share the same policy parameters (windows/tolerances/modes) and the same calibration identity. Configuration drift is a safety boundary violation and must be logged and blocked from RUN.
Output proof contract: vital output commands must be confirmed by independent feedback and line monitoring. Command/feedback disagreements must be treated as severe because they imply welded relays, broken wiring, or monitoring faults.

Input mirroring & alignment (what “same data” means)

Sampling alignment: enforce the same window_ms bucket and bounded skew (timestamp or tick-based).
Sequence continuity: detect reordering or missing frames with seq_gap-style checks.
Quantization & calibration consistency: differences are allowed only within defined tolerances, and calibration identity must match.

Output readback & line monitoring (proof of actuation)

Vital relay feedback: compare command vs feedback to detect welded contacts or missing energization.
Line monitor: detect open-load/short-to-supply/short-to-ground states and log the inferred wiring fault class.
Severity rule: cmd=OFF but feedback=ON is typically latch-worthy; cmd=ON but feedback=OFF is deny + diagnose + conservative transition.

Minimum evidence fields

Core: input_hash, config_hash, calibration_id, output_feedback_state

Recommended (for field clarity): alignment_skew_ms, contract_fail_code, line_monitor_state

Acceptance criteria

Contracts explicitly state “must match” fields vs “allowed to differ” fields.
Contract failures are categorized (alignment vs hash vs config vs output proof) with deterministic actions.
Output proof chain includes both feedback and line monitoring, not command-only assumptions.

Figure F5 — Consistency contracts across channels

Three-layer contract diagram: input contract, config contract, and output proof contract with verdict and logged fields.

Figure F5. Consistency contracts bind cross-monitoring to evidence: input_hash for “same samples,” config_hash/calibration_id for “same policy,” and output feedback/line monitoring for “proof of actuation.”

Cite this figure: Figure F5 — Cross-Monitoring Consistency Contracts

H2-6. Diagnostic Coverage & Self-Test Strategy

What auditors and field teams need

Diagnostic coverage is not a checklist. It is a measurable mechanism that connects failure modes to detection methods, detection intervals, and evidence logs. The objective is to prevent unsafe operation under latent faults and to produce traceable proof that self-tests ran, what they tested, and what action was taken.

POST (Power-On Self-Test): block RUN without integrity proof

Memory integrity: RAM test, Flash/ROM CRC verification.
CPU critical path: register/ALU sanity, exception vectors, control-flow baseline.
Clock & watchdog: clock presence/stability checks, watchdog behavior sanity.
I/O loopback: verify the safety I/O path is observable (loopback where feasible).

Online diagnostics: detect drift, corruption, and intermittent faults

Online diagnostics should be scheduled at multiple rates to catch both fast transients and slow degradation. Failures must map to deterministic system states (degraded/trip/latched), consistent with the vital state model.

High-rate: task timing watchdogs, heartbeat consistency, fast mismatch monitors.
Mid-rate: periodic CRC, logic consistency checks, contract checks (input/config/output proof).
Low-rate: trend analysis of error rates and health scores to surface latent faults.

Proof test: expose faults that normal operation can hide

Trigger: maintenance window, operating-hours threshold, or post-event policy.
Mechanism: controlled channel isolation/swap, injected stimuli, verification of voter and feedback paths.
Evidence: timestamped proof-test record that ties outcomes to diagnostic IDs and latent fault counters.

Minimum evidence fields (self-test as auditable data)

Core: diag_id, diag_pass_fail, fault_latent_counter, proof_test_timestamp

Recommended (for measurable coverage): diag_interval_ms, diag_duration_ms, reset_reason

Acceptance criteria

Every diagnostic class states: what it detects, how often it runs, and what evidence it logs.
Latent fault handling is explicit (fault_latent_counter behavior and clearing rules).
Proof tests have defined triggers and produce timestamped evidence.

Figure F6 — Diagnostic strategy across time

Three-layer timeline showing POST at boot, online diagnostics loops, and periodic proof tests feeding evidence logs and state actions.

Figure F6. Diagnostic strategy as a measurable system: POST blocks unsafe start, online diagnostics runs at multiple rates, proof tests expose latent faults, and all outputs feed evidentiary logs that drive deterministic safety actions.

Cite this figure: Figure F6 — Diagnostic Coverage & Self-Test Strategy

Cite Figure F5 anchor

Cite Figure F6 anchor

H2-7. Vital I/O: Isolated Inputs/Outputs, Relay Drivers, Feedback

Vital I/O is an energy chain with proof

A safety output is not a GPIO state. It is a controlled energy path that must be proven end-to-end: decision logic drives an output stage, the electromechanical element changes state, the load is affected, and independent feedback confirms the action. Any “command-only” assumption is insufficient for a vital function.

Vital inputs: dual DI + wiring fault classification

Dual-channel DI: treat A/B as evidence sources; enforce agreement windows and deterministic conflict actions.
Open/short detection: classify wiring states (open-load, short-to-batt, short-to-gnd) instead of “input flaky.”
Debounce & sampling windows: debounce is a measurable policy (stable samples + time window), not a magic delay.

Vital outputs: driver, relay/contactor, and readback

Driver stage: high-side / relay / contactor drivers should provide observable diagnostics (fault flags, current/voltage cues where applicable).
Feedback readback: compare command state vs feedback state within a defined window; mismatches must be logged and classified.
Weld detect: cmd=OFF but feedback=ON is typically treated as severe because it suggests welded contacts or feedback short.
Line monitoring: open-load and short faults on the output path should be detectable and mapped to action policies.

Isolation: separate domains, suppress common-mode coupling

Digital isolators: keep vital logic domain independent from noisy field wiring while preserving timing integrity.
Isolated power: avoid shared disturbances that collapse both domains; record reset/brownout evidence for correlation.
Common-mode suppression: route return paths intentionally to prevent EMI-induced false transitions and feedback errors.

Minimum evidence fields

Core: io_open_load, short_to_batt, short_to_gnd, relay_weld_detect, feedback_mismatch

Recommended (for field clarity): feedback_age_ms, weld_suspect_counter, line_monitor_state

Acceptance criteria

Output behavior is defined as an energy-chain with proof, not a command-only control.
Wiring faults and weld conditions are classified and logged (not treated as generic “noise”).
Isolation boundary is explicit and tied to diagnostic evidence and recovery actions.

Figure F7 — Vital output energy chain with isolation and feedback

Block diagram of MCU decision to isolated driver to relay/contactor to load, with feedback and line monitoring returning across isolation boundary.

Figure F7. Vital output control must be proven: isolation separates domains; the driver/relay controls energy; feedback and line monitoring provide evidence for mismatches, weld suspects, and wiring faults.

Cite this figure: Figure F7 — Vital I/O Energy Chain with Isolation & Feedback

H2-8. Event Recording: Evidentiary Logs, Trusted Time, Tamper Signals

From “logs exist” to “logs are evidentiary”

Evidentiary logging is designed for reconstruction and audit. It must preserve event order, capture critical context (reset reasons, firmware/config identities), and prevent silent loss during power interruptions. A reliable pipeline turns field anomalies into records that can be verified and replayed.

Event taxonomy (what must be recorded)

Critical: trip/latched, output proof failures, weld suspects, persistent contract violations.
Near-miss: mismatch spikes, transient contract failures, CRC bursts, recoveries that barely passed.
Change control: config changes, firmware updates, calibration identity changes.
Platform: reset_reason, brownout, watchdog, clock faults.

Trusted time: monotonic first, external time as an aid

monotonic_counter: guarantees order and supports reconstruction even if wall time jumps.
External time (if present): used for correlation, not as the sole evidence source.
Time anomalies: time jumps or time-source loss should generate explicit records.

Power-loss integrity: commit without silent loss

Holdup window: detect power fail and preserve enough energy to finalize the critical commit.
Journal / double-write: avoid half-written records and ensure replayable recovery after reset.
CRC chain: each record carries CRC and links to the previous record to detect missing segments.

Tamper signals: record the evidence of abnormal access

Enclosure/debug signals: cover open, debug enable, unexpected mode changes.
Identity changes: firmware_hash or config_version changes must be logged as change-control events.

Minimum evidence fields

Core: event_id, monotonic_counter, reset_reason, firmware_hash, config_version

Recommended (for integrity): commit_result, record_crc, prev_record_crc, power_fail_flag

Acceptance criteria

Every critical event carries identity (firmware/config) and platform context (reset/brownout).
Ordering is reconstructable even without wall time (monotonic_counter).
Power-loss behavior is fail-safe: no silent loss; commit failures are recorded and replayed.

Figure F8 — Evidentiary event recording pipeline

Pipeline from capture to buffer to commit to verify to readback, with holdup window and CRC chain, producing audit export.

Figure F8. Evidentiary logging is a pipeline: capture and filter events, buffer with monotonic order, commit with holdup and journal, verify with CRC chaining, then read back and export an evidence bundle.

Cite this figure: Figure F8 — Evidentiary Event Recording Pipeline

Cite Figure F7 anchor

Cite Figure F8 anchor

H2-9. EMC & Common-Cause Failure Hardening (Rail Reality)

Goal: prevent one disturbance from collapsing all channels

In rail environments, the key EMC risk is not a single-channel upset. The real hazard is common-cause failure: the same disturbance (power dip, ground bounce, EMI injection, or shared logic defect) impacts multiple channels at once, removing the redundancy that 1oo2/2oo3 architectures rely on. Hardening must therefore reduce correlation and create measurable evidence when correlation still happens.

Common-cause catalog (what tends to hit channels together)

Power CCF: shared front-end dips, undervoltage thresholds aligned across channels, simultaneous brownouts.
Reference CCF: ground bounce and reference shifts that move logic thresholds together.
EMI CCF: common-mode injection through cables, connectors, and shield return paths, causing bursts of errors.
Logic/Config CCF: shared firmware defects or configuration mistakes producing consistent-but-wrong decisions.

Hardening strategy (partition, suppress, de-correlate)

Partition power domains: avoid a single supply disturbance resetting every channel; use distinct filtering paths and independent supervision where practical.
Independent reset & supervision: separate reset decision logic per channel to avoid “one reset line resets all.”
Suppress injection at the boundary: input filtering and common-mode suppression close to the cable entry, before the disturbance becomes “logic-level.”
Control shield/ground return paths: define where shield currents return so common-mode energy does not couple into every reference.
De-correlate where safe: avoid identical coupling paths (routing symmetry, identical harness grouping) and use bounded timing diversity without breaking data consistency contracts.

Turning “rail constraints” into circuit requirements

Environmental and EMC constraints should be translated into explicit circuit-level requirements: undervoltage behavior (no unsafe toggling), hold-up thresholds, reset qualification, input filter time constants relative to sampling windows, and a defined common-mode return path strategy. The design is considered hardened only if correlation between channels is measurably reduced and diagnosable when it occurs.

Evidence fields (correlation is measurable)

Core: brownout_events, watchdog_rate, cm_noise_level, reset_correlation

Interpretation pattern: high reset_correlation + rising brownout_events ⇒ power CCF suspect; high cm_noise_level + error bursts ⇒ EMI CCF suspect.

Acceptance criteria

Hardening actions are mapped to specific common-cause types (power/reference/EMI/logic).
Correlation is quantified and logged; simultaneous resets are not treated as “random.”
Mitigation placement is explicit (where filtering, shielding return, and CM suppression occur).

Figure F9 — Disturbance paths, coupling, symptoms, and suppression points

Path diagram from injection points through coupling paths to failure symptoms, with suppression placement and evidence fields.

Figure F9. Rail EMC hardening should be engineered as common-cause control: identify injection points and coupling paths, place suppression where correlation is created, and log evidence that quantifies correlation.

Cite this figure: Figure F9 — Disturbance Path & Suppression Placement

H2-10. Fault Handling & Reset/Recovery Policy (No “Oscillating” Failures)

Goal: deterministic actions, never restart oscillations

Many rail field incidents are not a single trip, but a repeated cycle of reset → partial recovery → failure again. This oscillation creates unstable outputs, destroys diagnostic clarity, and can mask common-cause issues. A robust policy classifies faults, caps recovery attempts, introduces backoff, and forces stable end states (latched trip or controlled degraded mode) when recovery is not evidenced.

Fault classes (latch vs degrade vs attempt)

Class A — Trip-latch: output proof failure, weld suspect, repeated diagnostic failures, persistent contract violations.
Class B — Degraded: single-channel health decline, isolated I/O path issues that still allow safe deny behavior.
Class C — Attempt recovery: transient error bursts that clear and do not violate evidence gates.

Reset policy: cold vs warm, caps, and backoff

Warm restart: for transient software timing faults; must preserve evidence logs and avoid rapid state flips.
Cold restart: full re-initialization when state integrity is suspect; re-validate configuration and contracts before RUN.
Attempt cap + backoff: recovery attempts are limited; retry intervals increase to prevent oscillation.
Boot-loop detection: detect “fails shortly after boot” and force a stable terminal state.

Channel-isolated recovery: keep redundancy meaningful

Isolate the suspect channel: based on health indicators, mismatch counters, and diagnostic results.
Controlled degraded mode: proceed only if safety can be maintained conservatively; otherwise deny and latch.
Evidence gate to exit degraded: return to normal only when diagnostics and correlation indicators are stable.

Evidence fields

Core: reset_counter, boot_loop_detect, degraded_mode_reason, recovery_attempts

Evidence gate examples: increasing reset_counter with positive boot_loop_detect ⇒ force latch; repeated recovery_attempts ⇒ backoff then terminal state.

Acceptance criteria

Faults are classified with deterministic actions (latch/degrade/attempt) and consistent evidence logs.
Recovery attempts are bounded and backoff prevents fast oscillation.
Exit from degraded mode requires evidence gates, not “time passed.”

Figure F10 — Anti-oscillation recovery policy

Flow diagram showing fault classification, actions, recovery attempts with backoff and cap, boot-loop detection, and evidence gates to exit degraded state.

Figure F10. A recovery policy must prevent oscillation: classify faults, cap attempts, apply backoff, detect boot loops, and require evidence gates to exit degraded modes.

Cite this figure: Figure F10 — Anti-Oscillation Fault Handling & Recovery

Cite Figure F9 anchor

Cite Figure F10 anchor

H2-11. Validation & Audit Evidence Pack (Bench → Track)

Purpose: “verifiable” means deliverable evidence

Validation for vital logic is judged by closed-loop evidence: faults can be injected repeatably, detection and safe-state latencies are measured with unambiguous time markers, and logs can be replayed without silent gaps. The output of this chapter is a checklist-style evidence pack that serves audit, bench re-test, and track investigations consistently.

Evidence fields (must be measurable, not narrative)

Core: fault_injection_case_id, detect_latency_ms, safe_state_latency_ms, log_integrity_ok

Measurement rule: use an external marker (pulse/trigger) as time-zero; measure detection from “marker asserted” to “diagnostic event recorded”; measure safe state from “marker asserted” to “output proof indicates safe.”

Validation map (Bench → HIL → Track)

Bench (board/subsystem): isolate single-channel diagnostics, vital I/O proof chain, log commit integrity (power-fail scenarios).
HIL/SIL (system behavior): redundancy/voter behavior under mismatch, window/sequence alignment, recovery policy anti-oscillation gates.
Track/field (rail reality): correlation under EMC/power events, simultaneous reset correlation, safe-state determinism under disturbances.

Reference parts (MPN examples) to build repeatable bench/HIL setups

Pulse/Glitch Injection: Keysight 81160A (pulse function arbitrary generator), Tektronix AFG31000 series (AFG31021 / AFG31022 as common variants)
Scope/Acquisition: Tektronix MDO3000 series (e.g., MDO3054), Keysight InfiniiVision 3000T series (e.g., DSOX3054T)
Logic/Bus decode: Saleae Logic Pro 16

Notes: instrument models above are examples for repeatable markers, waveform capture, and evidence screenshots. Equivalent tools are acceptable if marker timing and capture fidelity are maintained.

Fault injection matrix (minimum set)

Each case is defined as: injection point (where), profile (shape/level/duration), expected action (latch/degrade/deny/recover), observables (what to measure), and artifacts (what to export into the evidence pack). Use stable case IDs (e.g., FI-PWR-03).

FI-CH (Channel mismatch injection)

Injection: force one channel’s computed result to drift or drop validity while inputs are nominal.
Pass/Fail: detect_latency_ms within bound; voter reason codes consistent; no unsafe output toggling.
Artifacts: voter decision logs + mismatch counter snapshots + timing marker screenshot.

MPN examples: TI TPS3431 (window watchdog as a controlled “liveness” stress), TI TPS3850 (supervisor with independent reset behavior examples).

FI-CLK (Clock drift / jitter excursion)

Injection: introduce bounded drift/jitter on one channel’s clock source or clock clean-up path.
Pass/Fail: window/sequence alignment flags trigger; system enters safe or degraded deterministically; no oscillation.
Artifacts: drift profile record + event logs + measured alignment error vs time.

MPN examples: TI CDCE913 (clock buffer/PLL family example), Renesas 5P49V60 (programmable clock generator example).

FI-GLI (Input glitch / burst)

Injection: apply controlled glitches/bursts at the cable-entry side (before isolation/filter where possible).
Pass/Fail: false trigger rate bounded; debounce windows behave as specified; valid faults still detected in time.
Artifacts: injected waveform screenshots + input state traces + false-trigger statistics.

MPN examples: ADI ADuM141E (digital isolator example), TI ISO7741 (digital isolator example), Littelfuse SMBJ58A (TVS diode example).

FI-PWR (Power dip / brownout pattern)

Injection: step/ramp dips on the shared front-end and on per-channel rails (separately), with external marker.
Pass/Fail: safe-state latency bounded; boot-loop prevented; correlation is measurable and explainable.
Artifacts: rail waveform + reset timeline + correlation summary + log continuity check.

MPN examples: TI TPS3839 (supervisor), ADI LTC2962 (multi-rail supervisor family example), TI TPS25982 (eFuse / surge-stop style front-end example).

FI-ISO (Isolation transient / common-mode coupling)

Injection: transient common-mode stress at the boundary while observing both logic and field domains.
Pass/Fail: disturbances become evidence (events + counters) rather than silent corruption; safe action is deterministic.
Artifacts: CM stress profile + log_integrity_ok proof + mismatch burst trace.

MPN examples: ADI ADuM5020 (isolated DC/DC example), TI SN6505 (isolated supply driver example), Murata NXE1 (isolated converter series example).

Criteria and metrics (measured + reported)

Detection: report detect_latency_ms as max and 95th percentile per case ID (not only average).
Safety response: report safe_state_latency_ms as max and 95th percentile; confirm no unsafe output chatter.
False triggers: express as “false trips per injected burst count” per profile level.
Log integrity: log_integrity_ok requires: no silent gaps, CRC/chain check passes, and power-fail commits leave evidence.

MPN examples for evidentiary logging integrity (optional building blocks)

Non-volatile event storage: Infineon FM25V10 (FRAM example), Microchip 25LC1024 (SPI EEPROM example)
Secure/time reference building blocks (mechanism only): NXP PCA2129 (RTC example), Microchip ATECC608B (crypto element example for signed manifests; keep usage bounded to “identity evidence”)
Power-fail hold-up control: ADI LTC4040 (backup/holdup controller family example)

Notes: the evidence pack does not require these exact components. The requirement is that event order and integrity can be proven under power interruptions and that firmware/config identities are recorded.

Evidence pack (deliverables) and traceability

The evidence pack should be a fixed directory template so audits and field replays can locate artifacts quickly. Each injection case ID must map to waveforms, logs, and metric summaries. Any deviations require a recorded waiver with clear justification and bounded risk.

Minimum evidence pack tree:

/EvidencePack/ /TestPlan/ (matrix + fault_injection_case_id) /RunRecords/ (env, firmware/config identities) /Waveforms/ (screenshots + markers) /Logs/ (samples + integrity proof) /Results/ (latency stats + false-trigger rates) /Traceability/ (case → artifacts mapping) /Deviations/ (waivers + actions)

Figure F11 — Bench-to-track validation and evidence pipeline

Pipeline showing fault injection matrix, observables, metrics, evidence pack folders, and audit/field replay.

Figure F11. A validation program is complete only when fault injection cases, measurements, and artifacts are traceable into a fixed evidence pack that supports audit and field replay.

Cite this figure: Figure F11 — Bench-to-Track Validation & Evidence Pipeline

Cite Figure F11 anchor

Request a Quote

Name

Company

Part Number(s) / BOM

Quantity & Target Lead Time

Alternates Allowed

Temperature Grade

Package / Footprint

Compliance

Budget Window

Lot Size / Qty

Message

Attachment

Accepted Formats

pdf, csv, xls, xlsx, zip

Attachment

Drag & drop files here or use the button below.

H2-12. FAQs (Field Troubleshooting Index)

Each answer follows a fixed pattern: 1-sentence conclusion + 2 evidence checks + 1 first fix, and maps back to the evidence chain in H2-3…H2-11.

Lockstep reports mismatch, but inputs look normal — transient fault or input de-sync?

Answer: This is most often input de-synchronization rather than a true core fault, unless mismatch bursts correlate with rail/EMI events. Check (1) input_hash and seq_gap alignment across channels at the mismatch timestamp, and (2) whether cm_noise_level rises with the same time window. First fix: enforce deterministic sampling alignment (bounded window + sequence gating) before voting.

Maps to: H2-3 / H2-5 / H2-9
Evidence: input_hash, seq_gap, cm_noise_level, mismatch_counter
First fix: widen and log the alignment window; gate voting on aligned sequence IDs
MPN examples: TI TPS3850 (supervisor), TI ISO7741 (digital isolator)

2oo3 occasionally “votes wrong”, but no fault is logged — missing granularity or lost commit?

Answer: If the decision is not explainable, the issue is usually missing “near-miss” events or a log commit gap. Check (1) whether voter_decision/vote_reason_code and “near-miss” events exist around the time of the anomaly, and (2) whether log_integrity_ok ever fails (power-fail, reset, or storage errors) near that window. First fix: add near-miss logging plus commit-result events.

Maps to: H2-8 / H2-11
Evidence: voter_decision, vote_reason_code, event_id, log_integrity_ok
First fix: log “near-miss” + commit status; fail closed if commit cannot be proven
MPN examples: Infineon FM25V10 (FRAM), ADI LTC4040 (holdup/backup)

During EMI testing it trips, then keeps rebooting — brownout threshold or reset policy oscillation?

Answer: Reboot loops are usually policy oscillation triggered by marginal supply behavior, not a single “bad threshold.” Check (1) brownout_events together with reset_correlation to see if channels reset in the same time window, and (2) whether boot_loop_detect and recovery_attempts show rapid repeated recoveries without backoff. First fix: enforce attempt caps with exponential backoff, then revisit UVLO margins only if correlation persists.

Maps to: H2-9 / H2-10
Evidence: brownout_events, reset_correlation, boot_loop_detect, recovery_attempts
First fix: add backoff + cap; force stable end state (latch or controlled degraded)
MPN examples: TI TPS3839 (supervisor), TI TPS25982 (eFuse/surge-stop example)

A channel looks “healthy” but is still isolated — voter window too short or timestamp/sequence mismatch?

Answer: “Healthy” does not mean “contract-compatible”; isolation commonly happens due to window/sequence misalignment. Check (1) the voter’s window_ms and the observed seq_gap at the time of isolation, and (2) whether input_hash/config_hash match across channels while the channel is being excluded. First fix: parameterize and log window/sequence tolerances and require aligned sequence IDs before counting a mismatch.

Maps to: H2-4 / H2-5
Evidence: window_ms, seq_gap, input_hash, config_hash
First fix: tighten “what must match” and log vote_reason_code for exclusions
MPN examples: Renesas 5P49V60 (clock generator example), TI TPS3431 (window watchdog)

Output is de-energized, but feedback still shows relay engaged — weld or feedback short?

Answer: Treat this as an output-proof contradiction until proven otherwise. Check (1) relay_weld_detect and any line-monitor indicators (io_open_load, short flags) to separate mechanical weld from wiring short, and (2) run a controlled FI-I/O case and measure safe_state_latency_ms using an external marker plus feedback waveform. First fix: enable and require dual-path feedback plausibility checks before allowing recovery.

Maps to: H2-7 / H2-11
Evidence: relay_weld_detect, feedback_mismatch, io_open_load, safe_state_latency_ms
First fix: add plausibility + cross-check; refuse recovery on proof contradiction
MPN examples: ADI ADuM141E (isolator), Littelfuse SMBJ58A (TVS)

Self-tests pass, but audit says diagnostic coverage is insufficient — what “provable” fields are missing?

Answer: Audits fail when diagnostics are not traceable, not when they are absent. Check (1) whether every diagnostic has an explicit diag_id with pass/fail, periodicity, and latent-fault counters, and (2) whether the evidence pack traceability links each fault_injection_case_id to logs, waveforms, and firmware/config identity proofs. First fix: add a traceability map and enforce “no ID, no claim” for diagnostic coverage.

Maps to: H2-6 / H2-11
Evidence: diag_id, diag_pass_fail, fault_latent_counter, fault_injection_case_id
First fix: publish traceability (case → artifacts); record diagnostic periodicity
MPN examples: Infineon FM25V10 (FRAM), NXP PCA2129 (RTC)

False trips cluster at the same time of day — common-cause (power/ground/EMI) or shared software defect?

Answer: Common-cause issues usually show timing correlation across channels and telemetry, while software defects correlate with version/config boundaries. Check (1) whether reset_correlation, cm_noise_level, or brownout_events spike in the same time window across channels, and (2) whether the cluster aligns with a config_version change or specific recovery path (degraded_mode_reason). First fix: generate a correlation report and treat high-correlation windows as CCF suspects by default.

Maps to: H2-9 / H2-10
Evidence: reset_correlation, cm_noise_level, brownout_events, config_version
First fix: create a correlation dashboard; prioritize de-correlation mitigations
MPN examples: TI TPS3850 (supervisor), ADI ADuM5020 (isolated supply)

After configuration update, false alarms rise — contract break or calibration ID mismatch?

Answer: Post-update false alarms are typically contract violations (inconsistent config/calibration) rather than real plant changes. Check (1) config_hash across channels and confirm the update produced a single consistent contract snapshot, and (2) calibration_id alignment plus any “config change” events in the evidentiary log. First fix: block RUN entry unless config_hash and calibration_id match across channels and are recorded as a single atomic change.

Maps to: H2-5 / H2-8
Evidence: config_hash, calibration_id, event_id, config_version
First fix: enforce atomic config contract; log the accepted contract snapshot
MPN examples: Microchip 25LC1024 (SPI EEPROM), Microchip ATECC608B (identity/signing example)

Trip recovers, but immediately trips again — weak latched-fault definition or over-permissive recovery gate?

Answer: Immediate re-trip usually means the system is allowed to recover without evidence that the initiating condition cleared. Check (1) latched_fault_id and transition_reason to see whether the original fault should have been non-recoverable, and (2) whether recovery_attempts and exit gates require a stable window (no mismatch, no brownout, no proof contradictions). First fix: tighten recovery gates with a stability window and cap attempts, forcing a stable terminal state when gates fail.

Maps to: H2-2 / H2-10
Evidence: latched_fault_id, transition_reason, recovery_attempts, boot_loop_detect
First fix: require stability window + attempt cap; refuse repeated fast recoveries
MPN examples: TI TPS3431 (window watchdog), TI TPS3839 (supervisor)

Logs are challenged as non-evidentiary — missing trusted time or missing integrity chain?

Answer: Evidentiary logs require both ordering (trusted time) and tamper-evident integrity. Check (1) whether monotonic_counter advances monotonically across resets and is bound to critical events (trip, config change), and (2) whether integrity checks (CRC/chain/commit read-back) are recorded as explicit events, not assumed. First fix: add commit read-back with integrity result fields and require monotonic counters to be stored in power-fail-safe NVM before acknowledging a record.

Maps to: H2-8
Evidence: monotonic_counter, reset_reason, log_integrity_ok, firmware_hash
First fix: commit read-back + integrity event; persist monotonic counter safely
MPN examples: NXP PCA2129 (RTC), Infineon FM25V10 (FRAM)

Mismatch detection is too slow — comparison point wrong or diagnostic task period too long?

Answer: Long mismatch latency is usually caused by late comparison (after normalization) or a slow diagnostic cadence. Check (1) where the comparison is performed relative to the contract boundary (input_hash, sequence alignment, intermediate results) and whether earlier comparison would reveal divergence sooner, and (2) whether the diagnostic period is consistent with the measured detect_latency_ms distribution (max/95p) in the evidence pack. First fix: move comparison earlier and record diagnostic cadence as a first-class evidence field.

Maps to: H2-5 / H2-6 / H2-11
Evidence: detect_latency_ms, input_hash, seq_gap, diag_id
First fix: compare earlier + increase diagnostic cadence; log cadence with results
MPN examples: Saleae Logic Pro 16 (decode/correlation), Tektronix AFG31000 (marker/glitch injection)

Field reports: “cannot see what’s wrong” — which three waveforms/fields should be captured first?

Answer: Start with a minimal triad that separates common-cause from logic/contract faults. Capture (1) power/reset/clock waveforms with an external marker to align events (supports brownout_events and boot-loop analysis), (2) contract alignment evidence (seq_gap, input_hash, window_ms) at the decision boundary, and (3) evidentiary log integrity proof (monotonic_counter, commit read-back, log_integrity_ok). First fix: add a standardized “marker + triad capture” procedure to every FI case and field replay.

Maps to: H2-11 / H2-8 / H2-9
Evidence: brownout_events, seq_gap, input_hash, log_integrity_ok
First fix: enforce a triad capture SOP (marker + waveforms + log bundle)
MPN examples: Keysight 81160A (pulse marker), Keysight DSOX3054T (scope)