Interlocking & Vital Logic (Rail) — Redundant MCUs & Voter Design
← Back to: Rail Transit & Locomotive
Interlocking & Vital Logic is built to fail safe by design: redundant channels and voters decide “permit vs inhibit” only when inputs, timing, and configuration are provably consistent, otherwise outputs de-energize and faults latch. The core value is evidentiary operation—every trip, mismatch, reset, and recovery is measured and logged with integrity so audits and field replay can reproduce the decision path.
H2-1. System Role & Safety Boundary
What this module actually “does”
Interlocking & vital logic exists to make a single hard promise: permission is granted only when the evidence chain is complete. When inputs are stale, inconsistent, or not provably trustworthy, the only valid outcome is a safe default—typically deny and de-energize. This reframes the design goal from “keep running” to “always remain provably safe.”
Boundary definition: three layers that prevent scope creep
- Functional boundary: adjudicates permission/lock/deny based on validated conditions; it does not perform positioning fusion, radio link management, or track-circuit demodulation.
- Signal boundary: distinguishes vital signals (must be provable) from non-vital signals (diagnostics/monitoring that cannot directly grant permission).
- Safety boundary: defines where isolation exists, what is dual-channel, and what constitutes a safe state when anything is uncertain.
Vital I/O taxonomy (the engineering reason “safe” is enforceable)
Treating every wire as equal is how safety systems fail in the field. A strict taxonomy prevents accidental “permission by convenience”:
- Vital Inputs (dual-channel): must pass freshness, consistency, and channel agreement checks before they can influence a permit decision.
- Non-vital Inputs: informative only; never sufficient to grant permission.
- Vital Outputs: must be designed so that loss of energy implies safety (de-energize-to-safe), not “unknown.”
- Proof / Feedback: output feedback is mandatory for detecting welded relays, broken lines, or mismatched actuation.
Evidence chain: what must be logged to make decisions auditable
Safety reviews and field forensics are won or lost on evidence. The boundary is not “real” unless the system records what it relied on. A minimal evidence chain for each decision includes:
Input trust proof (per channel):
input_valid, input_age_ms, input_seq_gap, input_crc_err_cnt, input_contract_hash
Decision proof (per decision):
decision_id, permit_state, decision_reason_code, channel_agreement
Output proof (command vs feedback):
vital_out_cmd, vital_out_feedback, feedback_mismatch_cnt, relay_weld_suspect
Acceptance criteria (what “done” means for this chapter)
- Safe state is explicitly defined and reachable under any uncertainty: deny + de-energize.
- Every boundary crossing (isolation, dual-channel inputs, vital output chain) is visible in the diagram and referenced in text.
- At least one complete decision evidence chain is described (inputs → checks → decision → outputs → feedback → logs).
H2-2. Vital State Model & “De-energize-to-trip” Philosophy
Why a state machine is mandatory (not optional)
“Fail-safe” becomes real only when behavior is deterministic under uncertainty. A vital state model turns safety from a slogan into an auditable contract: entry conditions define what must be true, allowed outputs constrain what the system may do, and evidence logs preserve why a transition occurred. Without a state model, field failures often degrade into oscillation (reboot loops, chatter, intermittent permission) that is unsafe and hard to diagnose.
Core states and what makes them provable
A rail-grade state model should make two outcomes always measurable: (1) how quickly the system reaches the safe state after a trigger, and (2) whether recovery is allowed, rate-limited, and evidence-backed. The following is a practical minimum set:
- INIT: outputs forced OFF; integrity + configuration + timing stability verified before any RUN transition.
- RUN: permission decisions allowed only when inputs are fresh and channels agree.
- DEGRADED: conservative behavior enforced (e.g., deny-only or restricted permits) when a non-fatal constraint is detected.
- TRIP: immediate de-energize; safe state reached within a measured latency budget.
- LATCHED: manual intervention or proof-test required; prevents “auto-recover oscillation.”
- RECOVERY: controlled checklist; back-off timers avoid repeated transitions in noisy conditions.
De-energize-to-trip: what it means in engineering terms
De-energize-to-trip is not simply “turn it off.” It is a design rule that makes safety enforceable: the safe state is reached by removing energy from the vital output chain, and the removal is confirmed by feedback. This approach is robust under supply anomalies and supports clean forensic evidence (command vs feedback).
Minimum transition evidence fields (transition = a safety event):
state_id, transition_reason, latched_fault_id, recovery_condition_met,
safe_state_latency_ms, output_drop_confirmed, recovery_inhibit_timer_ms
How to prevent unsafe oscillation (the hidden field killer)
A large fraction of safety incidents stem from “half recoveries” that repeat under EMI or borderline supply conditions. A vital state model must explicitly block oscillation by combining: (1) latching rules for severe fault classes, (2) back-off timers, and (3) a recovery checklist that re-validates evidence inputs and output feedback before RUN.
- Latching: mismatch, output feedback abnormal, critical diagnostics failure → LATCHED.
- Back-off: rate-limit recovery attempts; record correlation with brownout/EMI bursts.
- Checklist: config hash match, timing stable, inputs fresh, channels agree, output feedback sane.
Acceptance criteria (what “done” means for this chapter)
- Every state includes: entry conditions, allowed outputs, evidence fields, and exit conditions.
- TRIP behavior is measurable:
safe_state_latency_msandoutput_drop_confirmedare defined. - Recovery cannot chatter: rate-limit + checklist + latching rules are explicit.
H2-3. Redundancy Architectures: Lockstep vs Dual-Channel vs TMR
What this chapter is for
Redundancy is not “more MCUs.” It is a deliberate trade between fault models, safety behavior, availability targets, and the evidence that can be produced during audit and field forensics. A correct architecture choice starts by matching the dominant fault types (transient, random hardware, common-cause, systematic) to the redundancy pattern that can detect and contain them with measurable latency.
Lockstep (cycle-by-cycle compare)
- Best at: transient upsets and random hardware faults with fast detection (tight compare window).
- Weak at: common-cause failures (shared supply, reset, EMI injection) and systematic errors (same code/config → same wrong result).
- Evidence that must exist: mismatch counts and classification, correlated with reset/brownout/EMI indicators.
Dual-channel (1oo2 / 2oo2 variants)
- Best at: increasing independence (separate supplies/clocks/resets) and producing stronger decision evidence (value + timing + freshness contracts).
- Weak at: systematic errors if both channels share identical software/config and the same wrong assumption.
- Key choice: 2oo2 (permit only on agreement) favors safety; 1oo2 can favor availability but demands stricter “confidence + health” rules.
2oo3 / TMR (majority vote)
- Best at: availability under a single-channel random failure; the system can continue while isolating a faulty channel.
- Weak at: complexity and diagnostic coverage requirements; a weak health model can let majority voting hide a degrading channel.
- Non-negotiable: per-channel health scoring + “kick-out” evidence must be logged.
Minimum evidence fields (audit + field debugging)
Core: mismatch_counter, voter_decision, channel_health, crc_error_rate
Recommended (for depth): mismatch_class, reset_reason, brownout_events, decision_reason_code
Acceptance criteria
- The dominant fault types are explicitly mapped to each architecture’s strengths and limits.
- Safety vs availability is stated as a design intent (fail-silent vs fail-operational positioning).
- Evidence fields are not listed as nouns only—each is tied to a reason it is needed.
H2-4. Voter Design Deep Dive: 1oo2 / 2oo3 Decision Logic
Voter purpose: compare evidence, not only values
A robust voter does not simply compare numerical results. It compares evidence consistency: value agreement must be supported by freshness, alignment (sequence/time), and trust indicators (confidence/health/CRC). When evidence is incomplete, the correct decision is conservative (deny, trip, or degraded), with a reason code that makes the outcome auditable.
Decision inputs (the minimum contract)
- Value evidence:
value,valid_flag - Timing evidence:
timestamp,seq_no - Trust evidence:
confidence,channel_health,crc_error_rate
Four-step decision flow (auditable and testable)
- Freshness gate: reject stale inputs (age exceeds policy) → reason code indicates STALE, action becomes deny/degraded.
-
Alignment gate: enforce
window_msandseq_gaplimits → reason code indicates ALIGN_FAIL (prevents “same value, different moment” mistakes). -
Agreement gate: compare
value_deltato tolerance → AGREE or VALUE_CONFLICT. - Conflict resolver: select conservative action or bias toward a higher-trust channel (confidence + health) and log the justification.
Fail-silent vs fail-operational (expressed as actions)
- Fail-silent (safety-first): any critical conflict → deny/trip/latch; outputs de-energize quickly and remain OFF until evidence is restored.
- Fail-operational (availability-first): continue only in constrained mode with stricter evidence gates; recovery attempts are rate-limited and logged.
Evidence fields that make decisions reviewable
Core: window_ms, seq_gap, value_delta, confidence, vote_reason_code
Recommended: vote_action (permit/deny/degraded/trip), input_age_ms, alignment_skew_ms
Acceptance criteria
- Decision flow is specified as gates + resolver (can generate test vectors directly).
- Every non-permit outcome has a reason code; “unknown cause” is not acceptable.
- Alignment failures (timestamp/sequence) are explicitly handled to avoid intermittent field misvotes.
H2-5. Cross-Monitoring & Data Consistency Contracts
Goal: prove each channel is processing the same event
Cross-monitoring is not only about matching results. It must prove that each channel computed on the same inputs, under the same configuration and calibration, within the same time/sequence context. Without consistency contracts, two channels can appear to “agree” while operating on different samples, different parameter sets, or stale data—creating an unsafe false confidence.
Contract stack (inputs → configuration → output proof)
- Input contract: mirrored inputs must be aligned (sequence/time) and represent the same sample set. Alignment failures should be categorized (skew vs missing frames) to avoid unnecessary trips during brief reordering.
- Configuration contract: each channel must share the same policy parameters (windows/tolerances/modes) and the same calibration identity. Configuration drift is a safety boundary violation and must be logged and blocked from RUN.
- Output proof contract: vital output commands must be confirmed by independent feedback and line monitoring. Command/feedback disagreements must be treated as severe because they imply welded relays, broken wiring, or monitoring faults.
Input mirroring & alignment (what “same data” means)
- Sampling alignment: enforce the same
window_msbucket and bounded skew (timestamp or tick-based). - Sequence continuity: detect reordering or missing frames with
seq_gap-style checks. - Quantization & calibration consistency: differences are allowed only within defined tolerances, and calibration identity must match.
Output readback & line monitoring (proof of actuation)
- Vital relay feedback: compare command vs feedback to detect welded contacts or missing energization.
- Line monitor: detect open-load/short-to-supply/short-to-ground states and log the inferred wiring fault class.
- Severity rule: cmd=OFF but feedback=ON is typically latch-worthy; cmd=ON but feedback=OFF is deny + diagnose + conservative transition.
Minimum evidence fields
Core: input_hash, config_hash, calibration_id, output_feedback_state
Recommended (for field clarity): alignment_skew_ms, contract_fail_code, line_monitor_state
Acceptance criteria
- Contracts explicitly state “must match” fields vs “allowed to differ” fields.
- Contract failures are categorized (alignment vs hash vs config vs output proof) with deterministic actions.
- Output proof chain includes both feedback and line monitoring, not command-only assumptions.
H2-6. Diagnostic Coverage & Self-Test Strategy
What auditors and field teams need
Diagnostic coverage is not a checklist. It is a measurable mechanism that connects failure modes to detection methods, detection intervals, and evidence logs. The objective is to prevent unsafe operation under latent faults and to produce traceable proof that self-tests ran, what they tested, and what action was taken.
POST (Power-On Self-Test): block RUN without integrity proof
- Memory integrity: RAM test, Flash/ROM CRC verification.
- CPU critical path: register/ALU sanity, exception vectors, control-flow baseline.
- Clock & watchdog: clock presence/stability checks, watchdog behavior sanity.
- I/O loopback: verify the safety I/O path is observable (loopback where feasible).
Online diagnostics: detect drift, corruption, and intermittent faults
Online diagnostics should be scheduled at multiple rates to catch both fast transients and slow degradation. Failures must map to deterministic system states (degraded/trip/latched), consistent with the vital state model.
- High-rate: task timing watchdogs, heartbeat consistency, fast mismatch monitors.
- Mid-rate: periodic CRC, logic consistency checks, contract checks (input/config/output proof).
- Low-rate: trend analysis of error rates and health scores to surface latent faults.
Proof test: expose faults that normal operation can hide
- Trigger: maintenance window, operating-hours threshold, or post-event policy.
- Mechanism: controlled channel isolation/swap, injected stimuli, verification of voter and feedback paths.
- Evidence: timestamped proof-test record that ties outcomes to diagnostic IDs and latent fault counters.
Minimum evidence fields (self-test as auditable data)
Core: diag_id, diag_pass_fail, fault_latent_counter, proof_test_timestamp
Recommended (for measurable coverage): diag_interval_ms, diag_duration_ms, reset_reason
Acceptance criteria
- Every diagnostic class states: what it detects, how often it runs, and what evidence it logs.
- Latent fault handling is explicit (fault_latent_counter behavior and clearing rules).
- Proof tests have defined triggers and produce timestamped evidence.
H2-7. Vital I/O: Isolated Inputs/Outputs, Relay Drivers, Feedback
Vital I/O is an energy chain with proof
A safety output is not a GPIO state. It is a controlled energy path that must be proven end-to-end: decision logic drives an output stage, the electromechanical element changes state, the load is affected, and independent feedback confirms the action. Any “command-only” assumption is insufficient for a vital function.
Vital inputs: dual DI + wiring fault classification
- Dual-channel DI: treat A/B as evidence sources; enforce agreement windows and deterministic conflict actions.
- Open/short detection: classify wiring states (open-load, short-to-batt, short-to-gnd) instead of “input flaky.”
- Debounce & sampling windows: debounce is a measurable policy (stable samples + time window), not a magic delay.
Vital outputs: driver, relay/contactor, and readback
- Driver stage: high-side / relay / contactor drivers should provide observable diagnostics (fault flags, current/voltage cues where applicable).
- Feedback readback: compare command state vs feedback state within a defined window; mismatches must be logged and classified.
- Weld detect: cmd=OFF but feedback=ON is typically treated as severe because it suggests welded contacts or feedback short.
- Line monitoring: open-load and short faults on the output path should be detectable and mapped to action policies.
Isolation: separate domains, suppress common-mode coupling
- Digital isolators: keep vital logic domain independent from noisy field wiring while preserving timing integrity.
- Isolated power: avoid shared disturbances that collapse both domains; record reset/brownout evidence for correlation.
- Common-mode suppression: route return paths intentionally to prevent EMI-induced false transitions and feedback errors.
Minimum evidence fields
Core: io_open_load, short_to_batt, short_to_gnd, relay_weld_detect, feedback_mismatch
Recommended (for field clarity): feedback_age_ms, weld_suspect_counter, line_monitor_state
Acceptance criteria
- Output behavior is defined as an energy-chain with proof, not a command-only control.
- Wiring faults and weld conditions are classified and logged (not treated as generic “noise”).
- Isolation boundary is explicit and tied to diagnostic evidence and recovery actions.
H2-8. Event Recording: Evidentiary Logs, Trusted Time, Tamper Signals
From “logs exist” to “logs are evidentiary”
Evidentiary logging is designed for reconstruction and audit. It must preserve event order, capture critical context (reset reasons, firmware/config identities), and prevent silent loss during power interruptions. A reliable pipeline turns field anomalies into records that can be verified and replayed.
Event taxonomy (what must be recorded)
- Critical: trip/latched, output proof failures, weld suspects, persistent contract violations.
- Near-miss: mismatch spikes, transient contract failures, CRC bursts, recoveries that barely passed.
- Change control: config changes, firmware updates, calibration identity changes.
- Platform: reset_reason, brownout, watchdog, clock faults.
Trusted time: monotonic first, external time as an aid
- monotonic_counter: guarantees order and supports reconstruction even if wall time jumps.
- External time (if present): used for correlation, not as the sole evidence source.
- Time anomalies: time jumps or time-source loss should generate explicit records.
Power-loss integrity: commit without silent loss
- Holdup window: detect power fail and preserve enough energy to finalize the critical commit.
- Journal / double-write: avoid half-written records and ensure replayable recovery after reset.
- CRC chain: each record carries CRC and links to the previous record to detect missing segments.
Tamper signals: record the evidence of abnormal access
- Enclosure/debug signals: cover open, debug enable, unexpected mode changes.
- Identity changes: firmware_hash or config_version changes must be logged as change-control events.
Minimum evidence fields
Core: event_id, monotonic_counter, reset_reason, firmware_hash, config_version
Recommended (for integrity): commit_result, record_crc, prev_record_crc, power_fail_flag
Acceptance criteria
- Every critical event carries identity (firmware/config) and platform context (reset/brownout).
- Ordering is reconstructable even without wall time (monotonic_counter).
- Power-loss behavior is fail-safe: no silent loss; commit failures are recorded and replayed.
H2-9. EMC & Common-Cause Failure Hardening (Rail Reality)
Goal: prevent one disturbance from collapsing all channels
In rail environments, the key EMC risk is not a single-channel upset. The real hazard is common-cause failure: the same disturbance (power dip, ground bounce, EMI injection, or shared logic defect) impacts multiple channels at once, removing the redundancy that 1oo2/2oo3 architectures rely on. Hardening must therefore reduce correlation and create measurable evidence when correlation still happens.
Common-cause catalog (what tends to hit channels together)
- Power CCF: shared front-end dips, undervoltage thresholds aligned across channels, simultaneous brownouts.
- Reference CCF: ground bounce and reference shifts that move logic thresholds together.
- EMI CCF: common-mode injection through cables, connectors, and shield return paths, causing bursts of errors.
- Logic/Config CCF: shared firmware defects or configuration mistakes producing consistent-but-wrong decisions.
Hardening strategy (partition, suppress, de-correlate)
- Partition power domains: avoid a single supply disturbance resetting every channel; use distinct filtering paths and independent supervision where practical.
- Independent reset & supervision: separate reset decision logic per channel to avoid “one reset line resets all.”
- Suppress injection at the boundary: input filtering and common-mode suppression close to the cable entry, before the disturbance becomes “logic-level.”
- Control shield/ground return paths: define where shield currents return so common-mode energy does not couple into every reference.
- De-correlate where safe: avoid identical coupling paths (routing symmetry, identical harness grouping) and use bounded timing diversity without breaking data consistency contracts.
Turning “rail constraints” into circuit requirements
Environmental and EMC constraints should be translated into explicit circuit-level requirements: undervoltage behavior (no unsafe toggling), hold-up thresholds, reset qualification, input filter time constants relative to sampling windows, and a defined common-mode return path strategy. The design is considered hardened only if correlation between channels is measurably reduced and diagnosable when it occurs.
Evidence fields (correlation is measurable)
Core: brownout_events, watchdog_rate, cm_noise_level, reset_correlation
Interpretation pattern: high reset_correlation + rising brownout_events ⇒ power CCF suspect; high cm_noise_level + error bursts ⇒ EMI CCF suspect.
Acceptance criteria
- Hardening actions are mapped to specific common-cause types (power/reference/EMI/logic).
- Correlation is quantified and logged; simultaneous resets are not treated as “random.”
- Mitigation placement is explicit (where filtering, shielding return, and CM suppression occur).
H2-10. Fault Handling & Reset/Recovery Policy (No “Oscillating” Failures)
Goal: deterministic actions, never restart oscillations
Many rail field incidents are not a single trip, but a repeated cycle of reset → partial recovery → failure again. This oscillation creates unstable outputs, destroys diagnostic clarity, and can mask common-cause issues. A robust policy classifies faults, caps recovery attempts, introduces backoff, and forces stable end states (latched trip or controlled degraded mode) when recovery is not evidenced.
Fault classes (latch vs degrade vs attempt)
- Class A — Trip-latch: output proof failure, weld suspect, repeated diagnostic failures, persistent contract violations.
- Class B — Degraded: single-channel health decline, isolated I/O path issues that still allow safe deny behavior.
- Class C — Attempt recovery: transient error bursts that clear and do not violate evidence gates.
Reset policy: cold vs warm, caps, and backoff
- Warm restart: for transient software timing faults; must preserve evidence logs and avoid rapid state flips.
- Cold restart: full re-initialization when state integrity is suspect; re-validate configuration and contracts before RUN.
- Attempt cap + backoff: recovery attempts are limited; retry intervals increase to prevent oscillation.
- Boot-loop detection: detect “fails shortly after boot” and force a stable terminal state.
Channel-isolated recovery: keep redundancy meaningful
- Isolate the suspect channel: based on health indicators, mismatch counters, and diagnostic results.
- Controlled degraded mode: proceed only if safety can be maintained conservatively; otherwise deny and latch.
- Evidence gate to exit degraded: return to normal only when diagnostics and correlation indicators are stable.
Evidence fields
Core: reset_counter, boot_loop_detect, degraded_mode_reason, recovery_attempts
Evidence gate examples: increasing reset_counter with positive boot_loop_detect ⇒ force latch; repeated recovery_attempts ⇒ backoff then terminal state.
Acceptance criteria
- Faults are classified with deterministic actions (latch/degrade/attempt) and consistent evidence logs.
- Recovery attempts are bounded and backoff prevents fast oscillation.
- Exit from degraded mode requires evidence gates, not “time passed.”
H2-11. Validation & Audit Evidence Pack (Bench → Track)
Purpose: “verifiable” means deliverable evidence
Validation for vital logic is judged by closed-loop evidence: faults can be injected repeatably, detection and safe-state latencies are measured with unambiguous time markers, and logs can be replayed without silent gaps. The output of this chapter is a checklist-style evidence pack that serves audit, bench re-test, and track investigations consistently.
Evidence fields (must be measurable, not narrative)
Core: fault_injection_case_id, detect_latency_ms, safe_state_latency_ms, log_integrity_ok
Measurement rule: use an external marker (pulse/trigger) as time-zero; measure detection from “marker asserted” to “diagnostic event recorded”; measure safe state from “marker asserted” to “output proof indicates safe.”
Validation map (Bench → HIL → Track)
- Bench (board/subsystem): isolate single-channel diagnostics, vital I/O proof chain, log commit integrity (power-fail scenarios).
- HIL/SIL (system behavior): redundancy/voter behavior under mismatch, window/sequence alignment, recovery policy anti-oscillation gates.
- Track/field (rail reality): correlation under EMC/power events, simultaneous reset correlation, safe-state determinism under disturbances.
Reference parts (MPN examples) to build repeatable bench/HIL setups
- Pulse/Glitch Injection: Keysight 81160A (pulse function arbitrary generator), Tektronix AFG31000 series (AFG31021 / AFG31022 as common variants)
- Scope/Acquisition: Tektronix MDO3000 series (e.g., MDO3054), Keysight InfiniiVision 3000T series (e.g., DSOX3054T)
- Logic/Bus decode: Saleae Logic Pro 16
Notes: instrument models above are examples for repeatable markers, waveform capture, and evidence screenshots. Equivalent tools are acceptable if marker timing and capture fidelity are maintained.
Fault injection matrix (minimum set)
Each case is defined as: injection point (where), profile (shape/level/duration), expected action (latch/degrade/deny/recover), observables (what to measure), and artifacts (what to export into the evidence pack). Use stable case IDs (e.g., FI-PWR-03).
FI-CH (Channel mismatch injection)
- Injection: force one channel’s computed result to drift or drop validity while inputs are nominal.
- Pass/Fail:
detect_latency_mswithin bound; voter reason codes consistent; no unsafe output toggling. - Artifacts: voter decision logs + mismatch counter snapshots + timing marker screenshot.
MPN examples: TI TPS3431 (window watchdog as a controlled “liveness” stress), TI TPS3850 (supervisor with independent reset behavior examples).
FI-CLK (Clock drift / jitter excursion)
- Injection: introduce bounded drift/jitter on one channel’s clock source or clock clean-up path.
- Pass/Fail: window/sequence alignment flags trigger; system enters safe or degraded deterministically; no oscillation.
- Artifacts: drift profile record + event logs + measured alignment error vs time.
MPN examples: TI CDCE913 (clock buffer/PLL family example), Renesas 5P49V60 (programmable clock generator example).
FI-GLI (Input glitch / burst)
- Injection: apply controlled glitches/bursts at the cable-entry side (before isolation/filter where possible).
- Pass/Fail: false trigger rate bounded; debounce windows behave as specified; valid faults still detected in time.
- Artifacts: injected waveform screenshots + input state traces + false-trigger statistics.
MPN examples: ADI ADuM141E (digital isolator example), TI ISO7741 (digital isolator example), Littelfuse SMBJ58A (TVS diode example).
FI-PWR (Power dip / brownout pattern)
- Injection: step/ramp dips on the shared front-end and on per-channel rails (separately), with external marker.
- Pass/Fail: safe-state latency bounded; boot-loop prevented; correlation is measurable and explainable.
- Artifacts: rail waveform + reset timeline + correlation summary + log continuity check.
MPN examples: TI TPS3839 (supervisor), ADI LTC2962 (multi-rail supervisor family example), TI TPS25982 (eFuse / surge-stop style front-end example).
FI-ISO (Isolation transient / common-mode coupling)
- Injection: transient common-mode stress at the boundary while observing both logic and field domains.
- Pass/Fail: disturbances become evidence (events + counters) rather than silent corruption; safe action is deterministic.
- Artifacts: CM stress profile +
log_integrity_okproof + mismatch burst trace.
MPN examples: ADI ADuM5020 (isolated DC/DC example), TI SN6505 (isolated supply driver example), Murata NXE1 (isolated converter series example).
Criteria and metrics (measured + reported)
- Detection: report
detect_latency_msas max and 95th percentile per case ID (not only average). - Safety response: report
safe_state_latency_msas max and 95th percentile; confirm no unsafe output chatter. - False triggers: express as “false trips per injected burst count” per profile level.
- Log integrity:
log_integrity_okrequires: no silent gaps, CRC/chain check passes, and power-fail commits leave evidence.
MPN examples for evidentiary logging integrity (optional building blocks)
- Non-volatile event storage: Infineon FM25V10 (FRAM example), Microchip 25LC1024 (SPI EEPROM example)
- Secure/time reference building blocks (mechanism only): NXP PCA2129 (RTC example), Microchip ATECC608B (crypto element example for signed manifests; keep usage bounded to “identity evidence”)
- Power-fail hold-up control: ADI LTC4040 (backup/holdup controller family example)
Notes: the evidence pack does not require these exact components. The requirement is that event order and integrity can be proven under power interruptions and that firmware/config identities are recorded.
Evidence pack (deliverables) and traceability
The evidence pack should be a fixed directory template so audits and field replays can locate artifacts quickly. Each injection case ID must map to waveforms, logs, and metric summaries. Any deviations require a recorded waiver with clear justification and bounded risk.
Minimum evidence pack tree:
/EvidencePack/
/TestPlan/ (matrix + fault_injection_case_id)
/RunRecords/ (env, firmware/config identities)
/Waveforms/ (screenshots + markers)
/Logs/ (samples + integrity proof)
/Results/ (latency stats + false-trigger rates)
/Traceability/ (case → artifacts mapping)
/Deviations/ (waivers + actions)
Request a Quote
H2-12. FAQs (Field Troubleshooting Index)
Each answer follows a fixed pattern: 1-sentence conclusion + 2 evidence checks + 1 first fix, and maps back to the evidence chain in H2-3…H2-11.
Lockstep reports mismatch, but inputs look normal — transient fault or input de-sync?
Answer: This is most often input de-synchronization rather than a true core fault, unless mismatch bursts correlate with rail/EMI events. Check (1) input_hash and seq_gap alignment across channels at the mismatch timestamp, and (2) whether cm_noise_level rises with the same time window. First fix: enforce deterministic sampling alignment (bounded window + sequence gating) before voting.
Evidence:
input_hash, seq_gap, cm_noise_level, mismatch_counterFirst fix: widen and log the alignment window; gate voting on aligned sequence IDs
MPN examples: TI TPS3850 (supervisor), TI ISO7741 (digital isolator)
2oo3 occasionally “votes wrong”, but no fault is logged — missing granularity or lost commit?
Answer: If the decision is not explainable, the issue is usually missing “near-miss” events or a log commit gap. Check (1) whether voter_decision/vote_reason_code and “near-miss” events exist around the time of the anomaly, and (2) whether log_integrity_ok ever fails (power-fail, reset, or storage errors) near that window. First fix: add near-miss logging plus commit-result events.
Evidence:
voter_decision, vote_reason_code, event_id, log_integrity_okFirst fix: log “near-miss” + commit status; fail closed if commit cannot be proven
MPN examples: Infineon FM25V10 (FRAM), ADI LTC4040 (holdup/backup)
During EMI testing it trips, then keeps rebooting — brownout threshold or reset policy oscillation?
Answer: Reboot loops are usually policy oscillation triggered by marginal supply behavior, not a single “bad threshold.” Check (1) brownout_events together with reset_correlation to see if channels reset in the same time window, and (2) whether boot_loop_detect and recovery_attempts show rapid repeated recoveries without backoff. First fix: enforce attempt caps with exponential backoff, then revisit UVLO margins only if correlation persists.
Evidence:
brownout_events, reset_correlation, boot_loop_detect, recovery_attemptsFirst fix: add backoff + cap; force stable end state (latch or controlled degraded)
MPN examples: TI TPS3839 (supervisor), TI TPS25982 (eFuse/surge-stop example)
A channel looks “healthy” but is still isolated — voter window too short or timestamp/sequence mismatch?
Answer: “Healthy” does not mean “contract-compatible”; isolation commonly happens due to window/sequence misalignment. Check (1) the voter’s window_ms and the observed seq_gap at the time of isolation, and (2) whether input_hash/config_hash match across channels while the channel is being excluded. First fix: parameterize and log window/sequence tolerances and require aligned sequence IDs before counting a mismatch.
Evidence:
window_ms, seq_gap, input_hash, config_hashFirst fix: tighten “what must match” and log
vote_reason_code for exclusionsMPN examples: Renesas 5P49V60 (clock generator example), TI TPS3431 (window watchdog)
Output is de-energized, but feedback still shows relay engaged — weld or feedback short?
Answer: Treat this as an output-proof contradiction until proven otherwise. Check (1) relay_weld_detect and any line-monitor indicators (io_open_load, short flags) to separate mechanical weld from wiring short, and (2) run a controlled FI-I/O case and measure safe_state_latency_ms using an external marker plus feedback waveform. First fix: enable and require dual-path feedback plausibility checks before allowing recovery.
Evidence:
relay_weld_detect, feedback_mismatch, io_open_load, safe_state_latency_msFirst fix: add plausibility + cross-check; refuse recovery on proof contradiction
MPN examples: ADI ADuM141E (isolator), Littelfuse SMBJ58A (TVS)
Self-tests pass, but audit says diagnostic coverage is insufficient — what “provable” fields are missing?
Answer: Audits fail when diagnostics are not traceable, not when they are absent. Check (1) whether every diagnostic has an explicit diag_id with pass/fail, periodicity, and latent-fault counters, and (2) whether the evidence pack traceability links each fault_injection_case_id to logs, waveforms, and firmware/config identity proofs. First fix: add a traceability map and enforce “no ID, no claim” for diagnostic coverage.
Evidence:
diag_id, diag_pass_fail, fault_latent_counter, fault_injection_case_idFirst fix: publish traceability (case → artifacts); record diagnostic periodicity
MPN examples: Infineon FM25V10 (FRAM), NXP PCA2129 (RTC)
False trips cluster at the same time of day — common-cause (power/ground/EMI) or shared software defect?
Answer: Common-cause issues usually show timing correlation across channels and telemetry, while software defects correlate with version/config boundaries. Check (1) whether reset_correlation, cm_noise_level, or brownout_events spike in the same time window across channels, and (2) whether the cluster aligns with a config_version change or specific recovery path (degraded_mode_reason). First fix: generate a correlation report and treat high-correlation windows as CCF suspects by default.
Evidence:
reset_correlation, cm_noise_level, brownout_events, config_versionFirst fix: create a correlation dashboard; prioritize de-correlation mitigations
MPN examples: TI TPS3850 (supervisor), ADI ADuM5020 (isolated supply)
After configuration update, false alarms rise — contract break or calibration ID mismatch?
Answer: Post-update false alarms are typically contract violations (inconsistent config/calibration) rather than real plant changes. Check (1) config_hash across channels and confirm the update produced a single consistent contract snapshot, and (2) calibration_id alignment plus any “config change” events in the evidentiary log. First fix: block RUN entry unless config_hash and calibration_id match across channels and are recorded as a single atomic change.
Evidence:
config_hash, calibration_id, event_id, config_versionFirst fix: enforce atomic config contract; log the accepted contract snapshot
MPN examples: Microchip 25LC1024 (SPI EEPROM), Microchip ATECC608B (identity/signing example)
Trip recovers, but immediately trips again — weak latched-fault definition or over-permissive recovery gate?
Answer: Immediate re-trip usually means the system is allowed to recover without evidence that the initiating condition cleared. Check (1) latched_fault_id and transition_reason to see whether the original fault should have been non-recoverable, and (2) whether recovery_attempts and exit gates require a stable window (no mismatch, no brownout, no proof contradictions). First fix: tighten recovery gates with a stability window and cap attempts, forcing a stable terminal state when gates fail.
Evidence:
latched_fault_id, transition_reason, recovery_attempts, boot_loop_detectFirst fix: require stability window + attempt cap; refuse repeated fast recoveries
MPN examples: TI TPS3431 (window watchdog), TI TPS3839 (supervisor)
Logs are challenged as non-evidentiary — missing trusted time or missing integrity chain?
Answer: Evidentiary logs require both ordering (trusted time) and tamper-evident integrity. Check (1) whether monotonic_counter advances monotonically across resets and is bound to critical events (trip, config change), and (2) whether integrity checks (CRC/chain/commit read-back) are recorded as explicit events, not assumed. First fix: add commit read-back with integrity result fields and require monotonic counters to be stored in power-fail-safe NVM before acknowledging a record.
Evidence:
monotonic_counter, reset_reason, log_integrity_ok, firmware_hashFirst fix: commit read-back + integrity event; persist monotonic counter safely
MPN examples: NXP PCA2129 (RTC), Infineon FM25V10 (FRAM)
Mismatch detection is too slow — comparison point wrong or diagnostic task period too long?
Answer: Long mismatch latency is usually caused by late comparison (after normalization) or a slow diagnostic cadence. Check (1) where the comparison is performed relative to the contract boundary (input_hash, sequence alignment, intermediate results) and whether earlier comparison would reveal divergence sooner, and (2) whether the diagnostic period is consistent with the measured detect_latency_ms distribution (max/95p) in the evidence pack. First fix: move comparison earlier and record diagnostic cadence as a first-class evidence field.
Evidence:
detect_latency_ms, input_hash, seq_gap, diag_idFirst fix: compare earlier + increase diagnostic cadence; log cadence with results
MPN examples: Saleae Logic Pro 16 (decode/correlation), Tektronix AFG31000 (marker/glitch injection)
Field reports: “cannot see what’s wrong” — which three waveforms/fields should be captured first?
Answer: Start with a minimal triad that separates common-cause from logic/contract faults. Capture (1) power/reset/clock waveforms with an external marker to align events (supports brownout_events and boot-loop analysis), (2) contract alignment evidence (seq_gap, input_hash, window_ms) at the decision boundary, and (3) evidentiary log integrity proof (monotonic_counter, commit read-back, log_integrity_ok). First fix: add a standardized “marker + triad capture” procedure to every FI case and field replay.
Evidence:
brownout_events, seq_gap, input_hash, log_integrity_okFirst fix: enforce a triad capture SOP (marker + waveforms + log bundle)
MPN examples: Keysight 81160A (pulse marker), Keysight DSOX3054T (scope)