SIS Logic Solver: Voting, Lockstep Safety MCU, Isolated I/O
← Back to: Industrial Sensing & Process Control
A SIS Logic Solver is the decision core of a safety loop: it validates input health, applies deterministic 1oo2/2oo3 voting, and drives a defined safe-state sequence. Its credibility comes from isolation-aware signal integrity plus traceable evidence (snapshots, timestamps, cause codes) that makes every trip predictable, reproducible, and auditable.
H2-1. Role of the SIS Logic Solver in the Safety Loop
A SIS Logic Solver is the decision core of a Safety Instrumented Function (SIF): it validates safety-related inputs, applies deterministic voting/decision rules, and drives a defined safe output action while producing audit-ready evidence (reason codes, timestamps, and state snapshots). It does not measure the physical world and does not provide final power actuation.
- Position in the SIF chain: Sensor → Logic Solver → Final Element. The Logic Solver is where “uncertain inputs” become a “certain decision.”
- Boundary of responsibility: input credibility and consistency are handled here; sensor calibration/physics and actuator sizing are not.
- Auditability requirement: identical validated inputs must yield identical outputs, and each decision must be explainable via a logged cause.
Practical intent: each future chapter (voting, lockstep, isolation, diagnostics, logging) must map back to at least one of these evidence fields.
| This page covers (Logic Solver scope) | Out of scope (handled elsewhere) |
|---|---|
| Voting / comparison rules, decision determinism, safe-state output logic, diagnostic flags, event evidence | SIL math and certification workflow, sensor physics/calibration, actuator power stage sizing, PLC/DCS network architecture |
Transition to next chapter: because inputs can be wrong, late, stuck, or drifting, the Logic Solver must start from explicit failure assumptions and measurable targets.
H2-2. Failure Assumptions and Design Targets
Voting and redundancy are not “features”—they are responses to explicit assumptions about how channels can fail. Without a written failure assumption set, it is impossible to justify a voting rule, set a detection time budget, or produce audit evidence that a safe state will be reached when needed.
- Random failures: device wear-out, intermittent connections, environmental stress. Mitigation often relies on redundancy, voting, and online detection.
- Systematic failures: design, software, requirements, or configuration errors that can repeat. Mitigation relies on deterministic rules, lockstep/self-check coverage, and controlled change management.
Practical implication: voting primarily counters random faults; lockstep + diagnostics constrain systematic faults by detecting divergence and invalid internal states.
- Fault assumption list: what can fail, how it manifests, and which channel(s) it affects.
- Detection latency: the maximum allowed time to detect each fault type before it can become dangerous.
- Safe state definition: the required output action once a dangerous or unknown state is detected (including latch/reset policy).
| Input problem | Safety risk | Design direction (links to later chapters) | Evidence field to log |
|---|---|---|---|
| Wrong (incorrect state) | False permit or false trip | Voting consistency rules; mismatch tolerance | Mismatch reason code + channel snapshot |
| Late (timing skew) | Transient disagreement → wrong vote | Validation window; alignment policy | Timestamp delta + decision window ID |
| Stuck (no change) | Danger masked by stale signal | Stale-data detection; periodic self-test hooks | Stale counter + last-change timestamp |
| Drift (slow bias) | Long-term vote bias / nuisance trips | Window/trend compare; tolerance bands | Trend metric + tolerance threshold ID |
- Assumption registry: fault categories with symptoms and affected channels (versioned).
- Latency budget: per fault type, the maximum detection time and the rationale.
- Safe-state contract: output behavior on detection (de-energize/energize-to-trip, latch, reset conditions).
- Fault codes: unique reason codes enabling reconstruction of “what happened” from logs.
Next chapter setup: once assumptions and targets are explicit, voting (1oo2/2oo3), lockstep, isolation, diagnostics, and logging become implementation choices to satisfy the latency and safe-state contract.
H2-3. Voting Logic Fundamentals: 1oo1, 1oo2, 2oo3
Voting only becomes deterministic when each channel’s state has a fixed meaning. This page uses a Trip vote convention: 1 = Trip request (danger detected), 0 = No trip request. Any “invalid/stale” channel must be handled as an explicit state by policy (covered later under diagnostics and logging).
| Mode | Trip rule (boolean form) | Engineering interpretation |
|---|---|---|
| 1oo1 | Trip = A | Single channel decides. Simplest, but a single wrong channel can dominate. |
| 1oo2 | Trip = A OR B | Any channel can force Trip. Reduces missed-trip risk when hazards can appear in only one channel, but increases nuisance-trip sensitivity. |
| 2oo3 | Trip = (A + B + C) ≥ 2 | Majority required. More robust to a single wrong Trip signal, but depends more on channel coherency (timing and thresholds) to avoid split votes. |
- Nuisance trip sensitivity: 1oo2 trips on any single asserted channel; 2oo3 rejects isolated single-channel assertions unless a second channel corroborates.
- Hazard visibility assumption: if a real hazard can be observed by only one channel (coverage gaps), 1oo2 may trip earlier than 2oo3.
- Timing sensitivity: majority voting can misbehave when channels are asynchronous; validation windows and coherency flags become mandatory controls.
2oo3 and 1oo2 optimize different failure assumptions. 1oo2 prioritizes “trip if any credible channel requests trip,” which can be desirable when hazards may be partially observed. 2oo3 prioritizes “trip on corroborated majority,” which can reduce nuisance trips from single-channel noise or drift. Safety impact depends on the fault model and the coherency controls (thresholds, tolerances, windows).
- vote_mode: 1oo1 / 1oo2 / 2oo3
- input_snapshot: A,B,(C) states captured at decision time
- decision: Trip / Safe
- reason_code: ANY_TRIP / MAJORITY_TRIP / INVALID_INPUT / WINDOW_FAIL
- window_id + timestamp: tie decision to a validation window and time base
Chapter linkage: once the vote rule is fixed, the implementation must guarantee that channel states are comparable (thresholds, tolerances, and validation windows).
H2-4. Voting Comparator Architectures
Voting rules assume discrete channel states, but real inputs are noisy, delayed, and drifting. Comparator architectures convert ambiguous signals into vote-ready states (0/1/INVALID) under controlled error bounds. The controls are not optional: threshold definition, mismatch tolerance, and debounce/validation windows determine whether voting behaves predictably.
- Analog comparator voting: hardware thresholds with hysteresis and debounce to prevent chatter near trip points; deterministic latency is a key advantage.
- Digital comparator + MCU voting: sampled/filtered signals mapped to states using window rules; enables richer diagnostics and drift-aware policies.
- Window / trend compare (drift-aware): band checks flag out-of-range behavior; trend metrics detect slow bias that can poison long-term voting.
Asynchronous channels can disagree briefly even when all channels are correct. Without a validation window and coherency rules, majority voting can oscillate: early edges create split votes; late edges “correct” them after the decision is already made. This is controlled by windowing (time alignment), debounce, and explicit INVALID handling.
- Noise-induced chatter: signals near threshold flip rapidly; hysteresis and debounce convert chatter into a stable state transition.
- Edge timing bounce: short-lived edges create disagreement across channels; validation windows prevent single-edge artifacts from becoming a vote input.
Requirement: each stability mechanism must be traceable (which threshold/band/window was active during the decision).
| Control | Minimum fields to record | Purpose |
|---|---|---|
| threshold definition | threshold_id, threshold_value, hysteresis_band | Proves what “Trip” means and prevents chatter near the trip point. |
| mismatch tolerance | tolerance_id, allowed_delta, channel_pair | Controls how much disagreement is permitted before marking INVALID or forcing a safe action. |
| debounce / validation window | debounce_ms, window_ms, window_id, timestamp_delta | Aligns asynchronous inputs and prevents transient edges from corrupting a vote decision. |
Next chapter linkage: lockstep safety MCUs can internalize comparison and diagnosability, but the same evidence controls (windowing, reason codes, and state snapshots) remain mandatory.
H2-5. Lockstep Safety MCUs as Logic Solvers
A lockstep safety MCU acts like an internal voter: two execution paths run the same instruction stream in tightly aligned cycles, and a cycle-by-cycle compare checks whether computed states remain identical. When a mismatch is detected, the device asserts a fault flag and transitions into a defined safety response (e.g., force safe outputs, latch fault state, and record evidence).
- Core-domain coverage: lockstep is strongest at catching internal random faults that cause divergent computation (register/ALU/control-flow divergence).
- Peripheral-domain boundary: many peripherals are shared. A peripheral fault can feed identical wrong data to both cores, producing consistent but wrong execution that may not trigger a mismatch.
- Practical control: peripheral health flags (timeouts, CRCs, overruns) and input plausibility rules complement lockstep by turning “shared wrong” into an explicit invalid state.
Internal lockstep comparison and external multi-channel voting solve different problems. Lockstep focuses on execution integrity inside the MCU. External 1oo2/2oo3 voting focuses on input-channel consistency and hazard visibility across channels. A robust Logic Solver typically uses both: lockstep to detect internal divergence and external voting/comparison to validate safety inputs.
| Layer | Primary protection | Minimum evidence |
|---|---|---|
| Internal (lockstep) | Detects divergent execution and invalid internal states (random faults causing mismatch) | compare_mismatch_flag, fault_reason_code |
| External (voting) | Validates channel consistency and resolves disagreement across inputs | vote_mode, input_snapshot, window_id |
Lockstep improves fault detection but does not automatically create independence. Shared resources (clock, power, memory paths) can introduce common-cause failures that affect both cores similarly. In addition, a systematic software defect can produce identical wrong behavior in both cores. Therefore, lockstep must be paired with explicit diagnostics, coherency controls, and evidence logging to maintain auditability.
| Category | Typical outcome | Control / evidence |
|---|---|---|
| Caught | Random faults that cause divergent results (bit flips, transient execution anomalies) → mismatch triggers fault | compare_mismatch_flag, fault_latch_status |
| Not caught | Systematic defects that produce identical wrong logic in both cores (wrong threshold, wrong rule) | config_id, software_build_id, change control |
| Conditionally caught | Input sampling or timing differences can create divergence even without true faults (asynchronous edges) | window_id, timestamp_delta, coherency_flag |
- lockstep_state: enabled / degraded / disabled
- compare_mismatch_flag + fault_reason_code: why divergence was declared
- fault_latch_status: whether the fault is latched and requires a controlled reset
- decision_origin: internal fault vs external vote decision
- software_build_id + config_id: ties behavior to a versioned configuration baseline
H2-6. Discrete vs MCU-Based Logic Solver Trade-offs
Discrete and MCU-based logic solvers can both implement voting, but they differ in diagnosability, timing determinism, audit evidence depth, and lifecycle cost. Selection should be driven by measurable requirements: diagnostic expectations, maximum decision latency, evidence obligations, and change-management maturity.
| Dimension | Discrete logic solver | Lockstep safety MCU | Evidence focus |
|---|---|---|---|
| Diagnostic coverage | Strong for clear threshold/line faults; depends on added monitors for deeper visibility | Richer self-tests and state diagnostics; depends on software/config discipline | health flags, reason codes, diagnostic inventory |
| Latency determinism | Short, predictable paths; minimal scheduling uncertainty | Depends on sampling windows, task timing, and rule execution budget | latency budget, window_id, timestamp_delta |
| Auditability | Often needs external logging to explain “why Trip happened” | Can record snapshots, reason codes, and state traces internally | input_snapshot, state_snapshot, log_sequence |
| Lifecycle management | Low change frequency; simpler maintenance but limited feature evolution | Updatable and extensible; requires versioning, regression, and controlled rollout | software_build_id, config_id, change_log_ref |
Reading tip: each dimension should be tied to an explicit requirement. For example, “maximum detection latency” and “minimum evidence fields” determine whether windowing and logging are mandatory.
- Choose discrete-first when deterministic latency and minimal complexity dominate, and evidence needs can be satisfied by external logging.
- Choose MCU-first when richer diagnostics, configurable rules, and audit-grade traces are required, and lifecycle controls (versioning, regression) are feasible.
- Hybrid reality is common: discrete front-end comparators for clean thresholds plus MCU logic for windowing, voting, and evidence logging.
H2-7. Isolated I/O for Safety Signal Integrity
A Logic Solver does not measure the physical world, but it must guarantee that safety inputs remain clean, comparable, and independent. Isolation is not only an electrical barrier: it protects the assumptions behind voting by limiting common-mode coupling, ground shifts, and transient injection that can move multiple channels together and invalidate independence.
| Isolation location | Primary purpose (logic-solver view) | Evidence focus |
|---|---|---|
| Input isolation | Preserves vote credibility by limiting shared reference errors and transient injection into comparator/threshold decisions across channels. | channel_health, CM transient logs, coherency flags |
| Output isolation | Preserves the ability to execute a safe action without back-injection from high-energy domains; prevents output disturbances from polluting inputs. | output_health, isolation fault flags, trip path status |
- Threshold reference shift: a CM event moves the effective threshold in multiple channels at once, producing “consistent” but wrong vote inputs.
- Transient injection inside the decision window: a spike coincides with validation windows, causing multiple channels to sample the same wrong state.
- Recovery mismatch: isolation elements may saturate or recover at different rates, creating short split votes and oscillating decisions if windowing is weak.
Practical implication: voting assumes independence; CM coupling can collapse independence into a single shared failure mode.
An open failure typically removes a channel and is often detected by timeouts or out-of-range checks. An isolation failure can be more dangerous: signals may still appear plausible while channel independence is degraded. This means “value looks normal” is not sufficient evidence—explicit isolation health and CM event awareness are required.
- channel_health: OK / Degraded / Invalid (per channel)
- isolation_fault_flags: isolation element health and fault latches
- CM_transient_logs: timestamp + severity + affected channels
- coherency_flag: whether inputs were comparable during the decision window
H2-8. Diagnostic Coverage and Fault Detection
Diagnostics are not “alarms” for operators. The primary safety purpose is to reduce dangerous undetected failures by converting them into detected faults that trigger defined safe actions, degraded modes, or maintenance interventions. Evidence must distinguish detected versus undetected fault classes and show how each is handled.
| Layer | Role in fault detection | Evidence focus |
|---|---|---|
| Startup diagnostics | Blocks “faulty-at-boot” conditions by validating critical integrity before enabling safety functions. | startup_diag_result, config_id, integrity flags |
| Online diagnostics | Detects faults during operation (execution divergence, I/O integrity loss, window/coherency failures) and drives safe/degraded behavior. | fault_reason_code, window_id, channel_health |
| Periodic proof-test support | Complements online coverage by enabling verification of fault classes that are otherwise difficult to detect continuously. | proof_test_records, counters, test hooks |
Diagnostic coverage categories (often described as Low/Medium/High) reflect how much of the relevant dangerous fault space is converted from undetected to detected. The key is not a label: it is a maintained mapping between fault classes and the diagnostics that detect them, plus proof that detection triggers the intended safe response.
| Fault class | Detected by | Evidence fields (minimum) |
|---|---|---|
| Execution divergence | Online (lockstep compare) | compare_mismatch_flag, fault_reason_code, fault_latch_status |
| I/O integrity loss | Online (channel health + coherency) | channel_health, coherency_flag, window_id, timestamp_delta |
| Isolation degradation | Online (isolation flags) + event logs | isolation_fault_flags, CM_transient_logs, affected_channels |
| Systematic rule/config error | Not reliably detected online (requires lifecycle controls) | config_id, software_build_id, change_log_ref |
| Blind-spot fault class | Periodic (proof-test support) | proof_test_records, counters, last_test_timestamp |
Requirement: “undetected” categories must be explicitly listed and mapped to compensating controls (proof-test support, maintenance actions, or design constraints).
- DC% categories: mapped to fault classes and diagnostics (inventory-style)
- detected_fault_list: detected class → diagnostic source → response
- undetected_fault_list: blind-spot class → compensating control
- diag_event_logs: reason_code + timestamp + affected channels
H2-9. Safe State Handling and Output Actions
A trip decision is only half of safety. The Logic Solver must enforce a deterministic safe-state policy that defines what outputs do after a detected hazard, how long they persist, and what evidence is required before any recovery. The goal is predictable, repeatable, and auditable behavior under real fault and transient conditions.
| Policy | What “Trip” means (logic view) | Evidence focus |
|---|---|---|
| De-energize-to-trip | Trip commands outputs into a de-energized safe state; loss-of-energy tends to align with the safe direction. | safe_state_policy, output_sequence_id |
| Energize-to-trip | Trip commands an energized state to enforce protection; correctness relies on output consistency and health under disturbances. | output_path_health, trip_output_command |
Key engineering point: the difference is not “high vs low” but the defined safe direction under loss-of-energy and disturbances.
Safe-state persistence must be explicit. Latched handling keeps the system in safe outputs until reset preconditions are proven. Auto-reset allows recovery when preconditions are satisfied, but requires validation windows to prevent oscillation during noisy boundaries.
- Reset preconditions: channel health is stable, coherency holds, and recent CM transient severity is acceptable for recovery.
- Reset traceability: every reset must be logged with timestamp, actor/mode, and reason.
- Anti-oscillation: recovery windows and throttling prevent rapid trip-reset cycles under marginal conditions.
Safe-state behavior should be modeled as a sequence, not a single output assignment. A robust policy freezes decision context, commits evidence, drives safe outputs, then manages latching and recovery. Determinism ensures the same event leads to the same sequence, while recoverability ensures the system returns only when evidence-based conditions are met.
- safe_state_policy: DE_ENERGIZE / ENERGIZE_TO_TRIP
- latch_state + reset_mode: latched behavior and reset control mode
- reset_preconditions_met: boolean + reason code (why recovery was allowed)
- output_sequence_id + sequence_step: ties behavior to a deterministic output sequence
- reset_event_log: timestamp + mode/actor + reason
H2-10. Diagnostic Logging and Event Traceability
Logging is not a “nice to have.” It is the mechanism that turns safety decisions into verifiable evidence. A traceable Logic Solver records a time-ordered sequence of events, vote snapshots, and health context so that post-trip analysis can reconstruct what happened, why it happened, and whether the decision was consistent with the configured rules.
| Field | Meaning | Example |
|---|---|---|
| timestamp | Orders events and enables reconstruction of causality and decision windows. | t=123.456s |
| event_type | Declares the semantic class of the event (trip, reset, CM transient, channel invalid, mismatch). | TRIP |
| cause_code | Explains why the event occurred (rule result, mismatch, coherency failure, isolation fault). | VOTE_2OO3 |
| affected_channels | Lists channels implicated in the event and the decision context. | A,B |
| decision_origin | Distinguishes external voting decisions from internal integrity faults. | VOTE |
| log_sequence | Detects missing entries and supports integrity checks across the event stream. | #004218 |
Every trip should bind to a snapshot that captures inputs, voting state, and health context at the decision boundary. Without snapshots, logs show that a trip occurred but cannot prove why it was inevitable under the configured rules.
| Snapshot group | Contents | Key fields |
|---|---|---|
| Inputs snapshot | Per-channel state and quality flags (0/1/INVALID + plausibility/coherency). | input_snapshot, channel_health |
| Voting snapshot | Vote mode, decision window ID, window state, and vote result at the boundary. | vote_mode, window_id, vote_result |
| Health snapshot | Lockstep and isolation context plus recent CM event references. | lockstep_state, isolation_flags, CM_log_ref |
- Build the timeline: sort by timestamp and mark key events (CM transient, channel invalid, vote snapshot, trip, safe output sequence, reset attempts).
- Bind decisions to snapshots: verify that each trip maps to exactly one snapshot and that snapshot fields match the configured vote rules and windows.
- Explain causality: determine whether the trip was driven by input disagreement, integrity faults, or coherency/CM events; confirm output sequence and latching followed policy.
Result: a defensible evidence chain that supports audits and root-cause analysis without relying on subjective interpretation.
- timestamp + log_sequence: ordered events and missing-entry detection
- event_type + cause_code + affected_channels
- input_snapshot + vote_mode + vote_result + window_id
- channel_health + isolation_fault_flags + lockstep_state
- output_sequence_id + sequence_step + latch_state + reset_event_log
H2-11. Integration Considerations in SIS Architectures
Integration is defined by contracts: what inputs mean, what outputs guarantee, and how diagnostics are acknowledged. This chapter covers interface semantics and evidence fields only; it does not describe PLC/DCS network topologies or plant-wide control architectures.
A voting logic solver depends on inputs that are comparable. A usable input contract defines three layers: value semantics (0/1/INVALID), timing semantics (freshness and maximum age), and quality semantics (OK/DEGRADED/INVALID) that governs how each channel participates in voting.
| Contract item | Definition (logic view) | Evidence fields |
|---|---|---|
| Value semantics | Define allowed states and explicit INVALID behavior (e.g., stale/illegal/out-of-window becomes INVALID). | input_state |
| Timing semantics | Define update expectation and max-age; specify how stale values affect vote participation. | input_timestamp, age_ms |
| Quality semantics | Quality flag drives whether a channel is eligible for voting and how it is weighted/ignored. | input_quality, channel_health |
| Contract versioning | Inputs must be traceable to a specific contract revision to prevent silent semantic drift. | input_contract_id |
Practical rule: without a defined INVALID state and freshness rule, voting can mistake “data” for “truth.”
Outputs must be specified as a policy, not a pin toggle: what “Trip” commands (de-energize vs energize-to-trip), whether the state is latched, and what evidence allows recovery. A robust output contract also defines determinism (same cause → same sequence) and verifiability (readback or acknowledgement of output state).
| Contract item | Definition (logic view) | Evidence fields |
|---|---|---|
| Safe-state policy | Define the safe direction and how it is commanded (de-energize vs energize-to-trip). | safe_state_policy |
| Sequence determinism | Trip drives a defined ordered sequence (freeze → log → outputs) with stable steps. | output_sequence_id, sequence_step |
| Latching & reset | Specify latch behavior and the required evidence for reset (manual/auto/supervised). | latch_state, reset_mode, reset_preconditions_met |
| Verifiability | Define how output state is confirmed (readback/ack); log if confirmation is missing. | output_state_readback, output_ack, ack_status |
| Contract versioning | Outputs are traceable to a specific contract revision and policy set. | output_contract_id |
Diagnostics must form a closed loop. A “fault event” becomes actionable only when it is acknowledged and linked to a record ID that supports audits and proof tests. Handshakes should define: fault announcement fields, acknowledgement timing, escalation rules for missing ACK, and proof-test hooks (start/end markers and record IDs).
| Handshake step | Requirement | Evidence fields |
|---|---|---|
| Fault announcement | Emit event with cause code, severity, channels, and decision origin (vote vs integrity). | event_type, cause_code, severity, affected_channels, decision_origin |
| Acknowledgement | Define ACK semantics and timeouts; missing ACK must be visible and policy-driven. | diag_request_id, ack_id, ack_timeout_ms, ack_status |
| Escalation | Specify what changes when ACK is absent (latch hold, degraded mode, or recovery blocked). | escalation_state, latch_state, recovery_blocked |
| Proof-test support | Provide interface markers: enter test mode, record start/end, produce test record ID. | proof_test_record_id, test_start_ts, test_end_ts |
| Versioning | Handshake semantics must be versioned to prevent “same bits, different meaning.” | diag_handshake_version |
Example parts frequently used around Logic Solver interfaces (not system networking):
| Function | Example MPN | Why it fits this chapter |
|---|---|---|
| Industrial digital input | ISO1211 (TI) | Encodes input state/threshold behavior into a predictable logic-level contract. |
| Digital isolation (multi-ch) | ISO7741 (TI), ADuM141E (ADI), Si8642 (Skyworks/Silabs) | Supports channel independence and clean handshakes across isolation boundaries. |
| Comparator building blocks | TLV1704 (TI), LM339 (multi-vendor) | Implements window/threshold checks that feed explicit value semantics (0/1/INVALID). |
| Isolated CAN | ISO1042 (TI), ADM3053 (ADI) | Useful for isolated diagnostic/status interfaces without discussing network topology. |
| Isolated RS-485 | ISO3082 (TI), ADM2587E (ADI) | Provides isolated diagnostic/ACK signaling across domain boundaries. |
| Nonvolatile event log | MB85RS64V (Fujitsu FRAM), CY15B104Q (Infineon nvSRAM) | Supports traceability records (sequence IDs, snapshots) with robust retention. |
| Watchdog supervisor | TPS3435 (TI), MAX6369 (Maxim/ADI) | Supports “handshake timeout / escalation” supervision at the interface-policy level. |
Note: MPNs are examples to anchor integration discussions; final selection depends on isolation rating, channels, data rate, and system constraints.
H2-12. FAQs
Each answer follows a fixed format: 1 conclusion sentence + 2 evidence checks + 1 first fix. (Maps back to H2-3…H2-10.)
FAQ 01 2oo3 still trips unexpectedly—logic threshold or channel drift?
Conclusion: Unexpected 2oo3 trips usually come from a boundary condition (threshold/window) or a slowly diverging channel that crosses the vote rule.
- Evidence to check: vote snapshot at decision time (A/B/C states, vote_mode, vote_result, window_id/window_state).
- Evidence to check: drift-aware comparison signals (mismatch tolerance, trend/window comparator flags) versus the configured threshold definition.
- First fix: tighten the validation window with drift-aware gating (flag channel as DEGRADED/INVALID before voting) and re-baseline thresholds with an explicit mismatch tolerance.
FAQ 02 1oo2 never trips when one channel is stuck—INVALID handling or debounce window?
Conclusion: A stuck channel that never triggers 1oo2 often indicates the stuck state is being treated as “valid” or the debounce/validation window never reaches a decisive state.
- Evidence to check: channel health flags and whether “stale/constant” behavior transitions to INVALID (age_ms, input_quality, channel_health).
- Evidence to check: debounce/validation window telemetry (window_state transitions, window duration, and any suppression due to asynchronous arrivals).
- First fix: enforce a freshness/age rule that forces a stuck channel to INVALID and make the vote rule explicitly count INVALID as ineligible for 1oo2 participation.
FAQ 03 Vote result flips near the boundary—hysteresis missing or async inputs?
Conclusion: Boundary flip-flopping is typically caused by insufficient hysteresis or unsynchronized inputs arriving in different windows.
- Evidence to check: comparator/window thresholds and hysteresis configuration (threshold definition, hysteresis value, chatter counters).
- Evidence to check: input alignment vs validation window (per-channel timestamp skew, window_id consistency across channels).
- First fix: add or increase hysteresis and require window-aligned sampling so inputs are compared within one coherent validation window.
FAQ 04 Lockstep MCU flags mismatch but inputs look normal—core fault or I/O timing skew?
Conclusion: A lockstep mismatch with “normal-looking” inputs usually points to internal execution divergence or a timing/latency skew between sampling and comparison boundaries.
- Evidence to check: lockstep compare records (compare_mismatch_flag, lockstep_state, fault timestamp and persistence).
- Evidence to check: event timeline alignment between input capture and decision (timestamp ordering, snapshot_id binding to the mismatch event).
- First fix: freeze and log a synchronized snapshot on mismatch, then tighten the sampling-to-compare schedule so both cores and I/O capture share the same decision boundary.
FAQ 05 Lockstep is “healthy,” but voting disagrees—peripheral fault coverage gap or input contract violation?
Conclusion: Voting disagreement with a healthy lockstep core often indicates a peripheral/I/O fault not covered by lockstep or an input contract (value/timing/quality) mismatch.
- Evidence to check: core vs peripheral coverage indicators (lockstep_state OK while I/O timing/quality flags degrade).
- Evidence to check: input contract compliance (input_state validity, age_ms freshness, input_quality transitions into DEGRADED/INVALID).
- First fix: enforce the input contract at the boundary (reject stale/low-quality inputs before voting) and add explicit peripheral self-check/handshake flags into channel_health.
FAQ 06 Discrete voting works in the lab but fails in the plant—CM transient or isolation fault flags?
Conclusion: Field-only failures usually trace to common-mode (CM) events or isolation-domain anomalies that corrupt apparent input agreement.
- Evidence to check: CM transient logs (severity/occurrence near trip, correlation with channel invalidations).
- Evidence to check: isolation health/fault flags (isolation_fault_flags, channel_health transitions, “fault ≠ open” semantics).
- First fix: gate voting with isolation/CM-aware health checks so a CM event forces channels to DEGRADED/INVALID before the vote decision is accepted.
FAQ 07 Repeated spurious trips after maintenance—voting window or reset policy changed?
Conclusion: Post-maintenance spurious trips commonly come from altered validation windows/debounce or a reset policy that allows recovery before stability is proven.
- Evidence to check: configured debounce/validation window parameters and window_state history around each trip.
- Evidence to check: latch/reset evidence (reset_mode, reset_preconditions_met, recovery_window_ms, reset_event_log).
- First fix: require supervised or stricter reset preconditions plus a recovery window, and re-validate the voting window configuration against expected input timing.
FAQ 08 Trip happens, but outputs don’t match the expected sequence—output contract or latch state mismatch?
Conclusion: Output misbehavior after a valid trip is usually a contract mismatch (semantics changed) or an unverified output path (no readback/ACK) that breaks determinism.
- Evidence to check: output sequence trace (output_sequence_id, sequence_step progression, trip_output_command snapshot).
- Evidence to check: output confirmation (output_state_readback/output_ack and whether missing ACK triggered escalation/latch hold).
- First fix: enforce the output contract version and require readback/ACK for each critical step, blocking auto-reset when confirmation is missing.
FAQ 09 Diagnostic coverage looks high, but audit still fails—what evidence is missing?
Conclusion: Audits fail when coverage claims are not backed by traceable evidence—especially “detected vs undetected” lists and decision snapshots tied to timestamps.
- Evidence to check: DC categories and the detected/undetected fault inventory (what is covered online vs only by proof test).
- Evidence to check: evidence chain completeness (log_sequence gaps, snapshot presence for each trip, cause_code consistency).
- First fix: publish a bounded fault coverage matrix and ensure every trip binds to a vote snapshot and ordered event records with missing-entry detection.
FAQ 10 Proof test passes, yet undetected faults remain—startup vs online diagnostics gap?
Conclusion: Passing proof tests does not eliminate undetected faults when the diagnostic plan leaves gaps between startup checks and online monitoring.
- Evidence to check: which fault classes are only covered at startup versus continuously online (DC category mapping and intervals).
- Evidence to check: proof-test records tied to events (proof_test_record_id, test_start/end timestamps, and linked cause_code results).
- First fix: add targeted online diagnostics for the remaining gap classes and tie proof-test execution to immutable log records and evidence snapshots.
FAQ 11 A channel shows “OK,” but behaves stale—timestamp/age_ms contract or logging granularity?
Conclusion: “OK but stale” indicates a contract violation (freshness not enforced) or insufficient logging granularity to expose age and window alignment.
- Evidence to check: input freshness fields (input_timestamp, age_ms) and the rule that transitions stale inputs into DEGRADED/INVALID.
- Evidence to check: timeline resolution and snapshot binding (timestamp order, window_id consistency, snapshot_id presence per decision).
- First fix: enforce a hard age_ms threshold in the input contract and log age/window_id at every decision boundary so stale behavior becomes visible and vote-ineligible.
FAQ 12 After a CM event, the system recovers too quickly—recovery window or reset preconditions too weak?
Conclusion: Fast recovery after a CM event is a policy weakness: reset is being allowed before channel health and coherency are stable for long enough.
- Evidence to check: CM transient severity and proximity to reset (CM log entries correlated to reset_event_log and timeline).
- Evidence to check: reset gate conditions (reset_preconditions_met reasons, recovery_window_ms, latch_state transitions).
- First fix: require a CM-aware recovery window and strengthen reset preconditions so CM events force a minimum stability period before outputs can exit safe state.