123 Main Street, New York, NY 10001

Flight Control Computer (FCC): Safety Computing for Flight Control

← Back to: Avionics & Mission Systems

A Flight Control Computer (FCC) is a safety-critical, deterministic control node that must detect faults, contain them, and keep outputs bounded under real-world disturbances. Lockstep compute, ECC memory, supervised power/reset, watchdog/FDIR, redundancy voting, and evidence-grade logging together turn “it runs” into “it can be proven safe and diagnosable in the field.”

H2-1 · What an FCC is (and what it is not)

Chapter goal

A Flight Control Computer (FCC) is the safety-critical compute node inside the flight-control closed loop. It must turn time-bounded inputs into actuator commands with deterministic timing, and it must handle faults in a controlled, traceable way (detect → contain → command-safe → log evidence).

Scope boundary (Ctrl+F verifiable)

In-scope (FCC owns)

  • Closed-loop compute: sample → compute → command update, with a fixed control period.
  • Determinism: bounded latency, jitter control, timeouts, and safe degradation rules.
  • Safety mechanisms: lockstep compare, ECC (detect/correct), watchdog supervision.
  • Power/reset control: sequencing, PG/RESET tree, brownout behavior, safe bring-up.
  • Fault handling: detection → isolation/containment → output inhibit/degrade → recovery.
  • Evidence chain: event logging (reset cause, ECC counters, watchdog trips, mismatch flags).

Out-of-scope (handled by other pages)

  • Sensor front ends: IMU/AHRS AFE/ADC choices, sensor fusion algorithms, calibration math.
  • Network infrastructure: AFDX/ARINC 664 switch architecture, bus protocol deep dives.
  • Display pipelines: HUD/cockpit video interfaces, rendering/graphics processing.
  • RF payload chains: radar/EW Tx/Rx, channelizers, anti-jam RF front-end details.
  • Timing distribution: PTP/SyncE system-wide clock network design.

Practical rule: if a topic does not change the FCC’s determinism, fault reaction, or evidence logging, it does not belong on this page.

The three FCC jobs

1) Sample the control-relevant inputs at a defined moment and validate freshness/consistency.

2) Compute commands within a bounded time budget; handle overruns as a safety event, not a “software glitch”.

3) Drive actuator command outputs, including a safe inhibit/degrade path that is independent and predictable.

Figure F1
FCC boundary and dataflow Deterministic Control Loop Inputs Sensors Pilot Commands Health / Status FCC Safety-Critical Compute Outputs Actuator Commands Inhibit / Degrade Status Output Control loop period (bounded jitter) Monitor + Event Log Fault evidence & health counters Maintenance Read logs / diagnostics
Figure F1 shows the FCC’s boundary: deterministic input→compute→output flow, plus a separate fault/evidence path (monitor + event log) for traceability.

H2-2 · Safety goals drive the architecture (DAL mindset → design constraints)

Chapter goal

Safety requirements should not remain abstract labels. They translate into non-negotiable architectural constraints: independence, diagnostic coverage, bounded reaction time, and a verifiable evidence chain. In an FCC, those constraints directly justify lockstep, ECC, independent watchdog/power supervision, and structured event logging.

Constraint → mechanism → evidence (engineering proof chain)
Safety constraint (generic) FCC mechanism (what must exist) Verifiable evidence (how to prove)
No single fault may produce a hazardous command.
Faults must be detected and contained before unsafe output.
Lockstep compare + controlled output inhibit/degrade path; independent fault handler entry. Inject mismatch / illegal state → output inhibit occurs within bounded time; event log records cause + action.
Latent faults must become observable.
Silent corruption must turn into counters/events.
ECC on critical memories; scrubbing strategy; health counters exposed to logging. ECC corrected/uncorrected counters change under injection; thresholds trigger an event with context snapshot.
Reaction time must be bounded.
Overruns and stalls are safety events.
Windowed watchdog; time budget monitors; overrun detection hooks tied to safe handling. Force compute overrun → watchdog/monitor triggers within defined window; log includes loop ID and timing.
Power anomalies must not create ambiguous states.
Brownout is riskier than a clean reset.
Safety PMIC/supervisor with sequencing, PG/RESET tree, brownout detection, deterministic reset path. Brownout profile test → consistent reset behavior; logs show reset cause + rails that dropped (PG history).
Redundancy must avoid hidden coupling.
Common-cause paths must be minimized.
Independent supervision domains (power/reset/clock monitoring concept), cross-monitor checks, voting-ready outputs. Remove one channel / fault one supervisor → other channel maintains bounded behavior; mismatch is recorded.
Certification-grade traceability.
Decisions must be explainable after the fact.
Event IDs, counters, state snapshots, and a consistent log readout interface (concept). Any safety action produces an auditable record: what happened, when, what was commanded, and why.

Writing rule for the rest of the page: every “important” block must answer three questions — what constraint it serves, how it works, and how to verify it.

Figure F2
Requirement-to-mechanism map for FCC safety Safety Constraints → FCC Mechanisms Draw the proof chain: constraint ↔ mechanism ↔ evidence Constraints Single-fault safe output Detect latent faults Bounded reaction time Traceable evidence Mechanisms Lockstep Compare ECC + Scrub Counters Watchdog + Time Monitor Safety PMIC / Supervisor Event Log Evidence FCC Design core Verification: fault injection + counters + logs
Figure F2 ties abstract safety constraints to concrete FCC mechanisms. Keeping the mapping explicit prevents “feature dumping” and makes verification measurable.

H2-3 · Compute core: lockstep MCU/SoC and fault containment

Why this matters

Lockstep is not a “performance feature.” It is a measurable safety mechanism that turns silent compute faults into an observable mismatch, forcing the FCC into a controlled reaction path: detect → enter safety handler → inhibit/degrade outputs → log evidence. The value is only real when the reaction time is bounded and the output path is contained.

compare latency bounded reaction time IRQ/NMI routing output containment evidence logging
Lockstep types (concept-level)

Dual-core lockstep

Two compute cores run the same flow and their results are compared. A mismatch is treated as a safety event and triggers a fault reaction path. The critical design question is not “how fast,” but how quickly and reliably the mismatch forces a safe output state.

Delayed lockstep

The second core runs with an intentional delay. This can change coverage for certain transient behaviors, but it requires clean synchronization points and disciplined buffering so that the comparison remains meaningful and does not introduce ambiguity in the control loop timeline.

This page stays at the safety path level (compare → exception → safe output). Micro-architecture internals are intentionally out-of-scope.

Fault containment (what “contained” means)

1) Time containment

  • Overrun is a safety event: if control computation exceeds its time budget, the system must transition to a defined safe action, not “keep trying.”
  • Reaction time is bounded: mismatch detection must lead to a safety handler within a predictable upper bound.

2) State containment

  • Normal → Fault-handling is a deliberate state transition. Partial / ambiguous states must be avoided.
  • Fault handling should rely on minimal, deterministic code paths and avoid complex dependencies.

3) Output containment

  • The output chain needs a safe output mode (inhibit or degraded commands) that does not depend on “application tasks being healthy.”
  • When mismatch is detected, unsafe actuator commands must stop before any recovery attempts are made.
Key engineering metrics (no numbers required)
Metric What it controls in an FCC What to check / prove
Compare latency How late a compute mismatch can be detected after it occurs; defines the “latest possible” detection time inside the control period. Mismatch becomes observable quickly enough to preserve safe output behavior within the loop budget.
Error reaction time Time from mismatch detection to the FCC reaching a defined safe output state (inhibit/degrade). Worst-case mismatch triggers a bounded reaction that is independent of normal application scheduling.
Exception routing (IRQ/NMI) Whether a mismatch can bypass “normal software” and reliably enter a safety handler even under overload. Fault path remains effective even when the main control task is stalled, overloaded, or in a bad state.
Fault capture fidelity Ability to log actionable evidence (mismatch type, timing, state snapshot) without adding unstable jitter. Event log records cause + action with consistent ordering; logging overhead does not break determinism.

Practical rule: a safety mechanism is only as strong as its fault reaction path and output containment.

Lockstep quick card (FAQ-style)

What it catches

  • Random compute faults that lead to different results on Core A vs Core B.
  • Transient upsets that flip internal state and produce a mismatch during the compare window.

What it misses

  • Common-mode failures (shared power/reset/clock disturbance) that can make both cores fail similarly.
  • Same-software defects where both cores produce the same wrong result.
  • Faults in non-lockstep peripherals unless those are independently monitored.

What to verify

  • Injected mismatch triggers IRQ/NMI and enters the safety handler without relying on normal tasks.
  • Outputs reach a defined safe state within a bounded time.
  • Event logging captures cause + action + timing without violating the loop budget.
Figure F3
Lockstep compare and fault reaction path Lockstep Compare → Safety Reaction Path Mismatch must reach safe output within a bounded time Core A Compute result Core B Compute result Compare Mismatch? Fault IRQ / NMI Bounded reaction time Safety Handler Contain + record Outputs Inhibit / Degrade Safe state Event Log Cause + action Non-lockstep peripherals Timers · IO · DMA · Interfaces Monitored separately
Figure F3 separates the lockstep mismatch detection path (compare → IRQ/NMI → safety handler) from non-lockstep peripheral risks that require independent monitoring.

H2-4 · Memory system: ECC, buffers, and deterministic data handling

Why this matters

ECC is valuable in an FCC because it changes memory faults from “silent corruption” into an observable and auditable health signal. A practical FCC memory design treats ECC as a detect → correct → record pipeline and then enforces determinism: the worst-case correction/refresh/bus contention behavior must still stay inside the loop’s latency and jitter budgets.

detect correct log scrub strategy frame consistency determinism guardrails
Where ECC matters most (risk-aware view)

SRAM (on-chip state)

  • Holds control states and critical tables that influence the loop immediately.
  • ECC/parity prevents one-bit upsets from becoming a silent wrong command; counters provide early warning.

External memory (bandwidth + refresh + contention)

  • Large buffers may experience contention and refresh-related jitter; ECC is necessary but not sufficient.
  • Determinism requires access discipline: priority, bandwidth budgeting, and predictable refresh impact.

Nonvolatile storage (configuration integrity)

  • Corruption can turn into a wrong parameter set rather than a crash; verification must focus on detectability and traceability.
  • This page stays on the FCC evidence chain and determinism impact; storage security details are out-of-scope.
Three cards: ECC coverage · scrub strategy · determinism guardrails

ECC covers what

  • Control-relevant states: mode/state machine variables and safety-reaction flags.
  • Command buffers: the data structure that feeds outputs must not silently corrupt.
  • Health counters: corrected/uncorrected counts must remain consistent and readable.

Verification: inject bit errors and confirm detection/correction plus consistent counter reporting.

Scrub strategy

  • When: scrub in a dedicated maintenance time slice or controlled slack windows, not inside the tightest compute section.
  • What first: prioritize memory regions holding long-lived safety states and lookup tables.
  • Observable output: corrected/uncorrected counters must feed thresholds and generate events with context.

Verification: scrubbing produces measurable counter movement under injection; threshold crossing generates a structured log event.

Determinism guardrails (budget/latency)

  • Worst-case correction latency must be accounted for: ECC correction/retry cannot silently push the loop beyond deadline.
  • Bus contention must be bounded: background tasks (logging, scrubbing) must not starve control data access.
  • Refresh impact must be predictable: refresh-related jitter must not create timeout storms or unsafe mode thrashing.

Verification: stress the memory subsystem while running the control loop; demonstrate bounded jitter and no unsafe timeouts.

Deterministic data handling: double buffer + frame consistency

Real-time safety depends on consistent snapshots. Double-buffering prevents the controller from reading “half-updated” inputs or states. Each control cycle should consume a single coherent frame (snapshot) and produce a command frame that can be validated for freshness and integrity.

Practical patterns (concept-level)

  • Input snapshot: validate age/freshness, then lock a frame for the cycle.
  • Compute on stable data: no mid-cycle partial updates from background traffic.
  • Command frame: publish atomically; if faults occur, the safe output mode must override the frame.
Figure F4
ECC memory, scrubbing, counters, thresholds, and event logging loop ECC → Scrub → Counters → Thresholds → Event Log Turn silent corruption into observable, auditable health signals Memory with ECC Detect / Correct Scrub engine Scheduled windows ECC counters Corrected trend signal Uncorrected safety event Thresholds no numbers here Event log cause + action Maintenance readout Deterministic data handling Double buffer atomic publish Snapshot frame consistency bounded jitter
Figure F4 shows two parallel requirements: (1) ECC makes memory health observable via counters and thresholds that feed the event log, and (2) double-buffered snapshots preserve frame consistency and deterministic timing.

H2-5 · Power, sequencing, reset: safety PMIC + supervised startup/shutdown

Why this matters

The most dangerous FCC failures often originate from power, reset, and brownout behavior, not from a lack of compute. A safety PMIC/supervisor makes power-up and power-down predictable and traceable: each transition follows a defined sequence, outputs remain inhibited until health is confirmed, and every anomaly becomes an auditable event.

multi-rail domains POR / BOD PG + reset tree output gating reset cause logging
FCC internal rails (scope-limited)

Core domain

  • Must reach a stable operating region before releasing control computation.
  • Reset release must be consistent to avoid early lockstep mismatch behavior.

IO / interface domain

  • Should remain inhibited until the core is stable and health checks are complete.
  • Prevents “ghost outputs” during partial startup or brownout recovery.

Aux / supervision domain

  • Supports supervision and safe-state controls that must remain meaningful during faults.
  • Conceptually enables independence for monitoring and evidence capture.

This section stays inside the FCC module: it does not expand into aircraft-wide distribution or surge front-end design.

Key pitfalls (what actually breaks systems)

Brownout “half-alive” state

  • Some logic may keep partial state while other blocks collapse, creating undefined state machines.
  • A clean reset is safer than allowing ambiguous mid-voltage operation.

Inconsistent reset release

  • If reset is released unevenly across compute and memory, lockstep can diverge before software has control.
  • The reset tree must be designed and verified as a system, not as “one pin.”

Ghost outputs during startup

  • IO rails can become active while the core is not stable, producing unintended actuator signaling.
  • Use output gating and safe-state control to prevent unsafe commands until health is confirmed.
Supervised startup/shutdown in three phases

Phase 1: Pre-check

  • Monitor: brownout status, previous reset cause, rail readiness flags (as available), supervisor status.
  • Action: hold reset, keep outputs gated, block command publishing.
  • Evidence: log reset cause and last known power anomaly markers.

Phase 2: Ramp / sequence

  • Monitor: rail order, PG validity, timeouts, and sequencing state.
  • Action: if PG fails or times out, force safe-state and re-assert reset rather than “continuing partially.”
  • Evidence: log which rail did not reach PG and whether the transition was aborted or retried.

Phase 3: Health confirm

  • Monitor: stable rail flags, supervision domain alive, basic self-check readiness signals (concept-level).
  • Action: release reset in a controlled order; enable outputs only after explicit “health confirmed.”
  • Evidence: log a “startup complete” marker with key health counters snapshot (minimal set).

Shutdown should be symmetrical: remove output authority first (safe-state), then collapse rails in a defined order, and record the reason.

Figure F5
Multi-rail sequencing and reset tree for an FCC Multi-Rail Sequencing + Reset Tree PG and RESET enforce predictable startup and safe outputs Rails Core IO Aux Safety PMIC / Supervisor Sequencing POR / Brownout PG + RESET logic Reset tree MCU / SoC Memories Output gate PG RESET Independent watchdog supply Concept: supervision remains valid
Figure F5 shows how a safety PMIC/supervisor enforces sequencing and uses PG/RESET as a system-level control to prevent brownout ambiguity and ghost outputs.

H2-6 · Watchdogs, monitors, and fault handling (FDIR flow)

Why this matters

A watchdog is not “one pin that resets the board.” In an FCC it is part of a monitoring and fault-handling system that detects timing failures, isolates unsafe behavior, triggers controlled recovery or degradation, and records evidence for certification and field diagnosis. The key is to cover wrong time failures with bounded actions and clear logs.

windowed watchdog external independent heartbeat fault classification output inhibit evidence chain
Watchdog types (concept-level)

Software heartbeat

  • Shows a task is progressing, but may not prove overall timing health.
  • Best used as a layer, not as the last line of defense.

Windowed watchdog

  • Must be serviced within a defined window, catching both “too slow” and “too fast” behavior.
  • Turns timing drift into a bounded fault path rather than a silent performance issue.

External independent watchdog

  • Conceptually independent in clocking/supply, so it can act when internal timing is compromised.
  • Provides a predictable response when software and internal supervision are no longer trustworthy.
Fault tree cards (source → detector → action → evidence)
Fault source Detector Action Evidence
Control loop overrun
deadline missed
Time monitor / windowed watchdog Enter degraded mode or inhibit output authority; prevent repeated unsafe scheduling. Log overrun marker + loop ID + timing context.
Lockstep mismatch
compute divergence
Comparator → IRQ/NMI Immediate safety handler; inhibit outputs; optional controlled reset depending on policy. Log mismatch cause + action + ordering relative to output gating.
Watchdog timeout
stalled execution
External / independent watchdog Force deterministic reset or safe-state transition; avoid “half-alive” continuation. Log reset cause and last valid heartbeat timestamp (if available).
ECC uncorrected
data integrity fail
ECC logic + thresholds Isolate affected region or transition to safe mode; block publishing corrupted frames. Log uncorrected count snapshot + threshold crossing event.
Brownout detected
voltage unsafe
Supervisor BOD / PG Hold reset and gate outputs; restart with supervised sequence, not partial recovery. Log PG history markers and brownout cause flag.
Repeated transient faults
trend indicates worsening
Counters + rate thresholds Escalate from retry to degrade; avoid endless reboot loops. Log fault rate, escalation decision, and final operating mode.

FDIR should treat “reset” as one tool among several. The primary goal is to keep outputs safe and make every transition explainable.

Fault handling policy (concept-level)

Latched vs retryable

  • Latched faults: remain active until explicit conditions are met; protects against repeated unsafe oscillations.
  • Retryable faults: allow controlled restart attempts with a bounded retry budget and escalation on repetition.

Degraded mode vs safe output

  • Degraded mode: reduces authority and limits outputs while continuing operation with stricter supervision.
  • Safe output: inhibits or forces a safe command state when correct behavior cannot be guaranteed.
Figure F6
FDIR state machine for FCC monitoring and fault response FDIR State Machine Detect → Isolate → Recover/Degrade → Safe, with evidence at each step Normal In budget Suspect Overrun Isolate Mismatch Recover / Degrade Retry / reduced Safe Output inhibit recovered escalate Evidence at every step event ID · counters · reset cause
Figure F6 models FDIR as a controlled state machine. Each transition is triggered by a short, verifiable condition and produces evidence for traceability.

H2-7 · Redundancy & voting: dual/triple FCC channels without hidden coupling

Why this matters

Redundancy is only effective when failures remain independent. The fastest way to defeat a dual/triple FCC architecture is hidden coupling: shared power, clock, reset, or software commonality that turns independent channels into a single common-cause failure. This section focuses on decoupling, cross-monitoring, and voting/selection behavior that can be verified.

common-cause failure hidden coupling cross-monitoring voter / selector evidence logging
Voting modes (concept-level only)

2oo2 (two-out-of-two)

  • Both channels must agree before authority is granted.
  • Strong at preventing a single wrong channel from driving outputs.
  • Highly sensitive to hidden coupling (both can fail together).

2oo3 (two-out-of-three)

  • Majority vote tolerates one disagreeing channel.
  • Can maintain operation while isolating a suspected channel.
  • Still requires independence; coupled failures can defeat majority logic.

1oo2 (one-out-of-two)

  • One channel can provide outputs while the other monitors.
  • Useful for controlled degradation strategies with tight supervision.
  • Requires clear rules for authority transfer and fault evidence.

The purpose here is to explain behavior and verification targets, not to define a specific aircraft architecture.

Hidden coupling checklist (risk → decouple → prove)
Coupling type Why it breaks redundancy Decouple methods (concept) Evidence to prove
Power A rail disturbance can push all channels into the same fault state at the same time. Independent monitoring domains; separate supervision paths; avoid single shared “health truth.” Single-channel power anomaly does not synchronize failure across channels; logs remain distinguishable.
Clock A shared clock failure can create simultaneous timing collapse and watchdog storms. Independent clock sources or independent clock validation per channel. One channel’s clock anomaly is detected and isolated without pulling other channels into the same timing fault.
Reset A shared reset tree can cause simultaneous reboot/undefined state across all channels. Reset domain separation; controlled authority gating; channel-local reset policy. Channel reset does not force peer resets; reset-cause logs remain channel-specific.
Software commonality The same defect can produce the same wrong result on every channel, defeating agreement checks. Design for detectability: cross-checking + independent monitors + bounded outputs on anomalies. A wrong-time or inconsistent-state condition triggers isolation/degradation rather than silent agreement.

The goal is not “perfect independence,” but eliminating hidden shared dependencies that can collapse all channels together.

Cross-monitoring (what to check and why)

Liveness cross-check

  • Peer heartbeat freshness (time-since-last-update) to detect stalled peers.
  • Detects “not running” and “not updating on time” before voting grants authority.

Consistency cross-check

  • Compare a minimal state summary: operating mode, health flags, and authority readiness.
  • Detects divergence early and routes disagreements into isolation/degradation logic.

Cross-monitoring is used to identify which channel is no longer trustworthy, not to “prove correctness” of complex application logic.

Figure F7
Redundant FCC channels with voter/selector and cross-check links Redundant FCC Channels + Voting Avoid hidden coupling and isolate disagreeing channels FCC-A Channel health FCC-B Channel health FCC-C (optional) Majority vote Power Clock Reset Voter / Selector Authority control Isolate on disagreement cross-check cross-check Actuator command authority granted
Figure F7 highlights redundant channels feeding a voter/selector while cross-check links provide liveness and consistency monitoring without assuming any specific protocol.

H2-8 · I/O integrity and timing budget (determinism end-to-end)

Why this matters

End-to-end determinism is not created by a bus name. It comes from a timing budget, monitor points, and a timeout policy that turns “late or inconsistent data” into bounded actions. This section uses a generic chain: sample → compute → publish → confirm, and focuses on budgeting methods rather than protocol details.

end-to-end budget monitor points jitter window timeout threshold CRC / sequence / timeout
I/O integrity signals (concept-level)

CRC / integrity check

  • Turns silent bit corruption into a detectable event.
  • Failed integrity should block command publishing or force degradation.

Sequence counter

  • Detects drops, repeats, and out-of-order updates without protocol detail.
  • Enables “freshness” decisions in the control loop.

Timeout / freshness window

  • Late data is treated as unsafe data.
  • Timeout transitions should be logged and tied to a defined output policy.

Examples like ARINC/AFDX can carry these signals, but the determinism method is independent of any specific bus stack.

Budget bar card (cap + monitor + action)

1) Sample window

  • Cap: sampling must complete within a bounded acquisition window.
  • Monitor: timestamp + sequence freshness.
  • Action: mark stale, enter suspect or degrade; record reason.

2) Compute window

  • Cap: control compute must finish before the output publish deadline.
  • Monitor: loop overrun monitor; watchdog window compliance.
  • Action: degrade authority or inhibit publishing on repeated overruns.

3) Output publish window

  • Cap: outputs must be published atomically and at a bounded time.
  • Monitor: integrity check on command frame + publish timestamp.
  • Action: block publish on integrity failure; keep safe-state output.

4) Confirm / feedback window

  • Cap: confirmation must arrive before a timeout threshold.
  • Monitor: confirm freshness + sequence progress.
  • Action: escalate to degrade or safe mode on missing/late confirmation.

The budget is complete only when each segment has a monitor point, a bounded action, and evidence logged for traceability.

Figure F8
Latency budget timeline: sample, compute, publish, confirm End-to-End Latency Budget Timeline t0 sample → t1 compute → t2 publish → t3 confirm, with jitter and timeout controls Sample Compute Publish Confirm t0 t1 t2 t3 jitter window timeout Monitor timestamp Monitor overrun Monitor CRC / seq Monitor timeout
Figure F8 illustrates a generic end-to-end budget. Each segment has a bounded cap, an explicit monitor point, and a timeout threshold that triggers controlled actions.

H2-9 · Event logging & evidence: what to record to prove safety and catch latent faults

Purpose

An FCC log is not a “debug trace.” It is an evidence chain that explains what happened, why it triggered, and what action was taken. The most valuable field outcome is the ability to reconstruct an incident from a small set of consistent records and prove the fix with repeatable evidence.

evidence chain correlation append-only sequence number power-fail safe write
Must-record events (FCC-focused)

Power / Reset evidence

  • Reset cause (e.g., supervised restart, brownout marker, watchdog)
  • Rail PG drops / sequencing abort markers
  • Thermal warnings / overtemp trips

Timing / determinism evidence

  • Task overrun / deadline misses
  • Loop timing anomalies (jitter out-of-window, concept)
  • Timeout escalations (from suspect to isolate/degrade)

Integrity / redundancy evidence

  • ECC corrected counter increases and threshold crossings
  • ECC uncorrected events (must be explicit)
  • Voting mismatch and channel isolation/authority transfer
  • Watchdog trip and recovery path taken

These events directly support proof of mechanisms described earlier: power supervision, watchdog policy, ECC observability, timing budgets, and redundant channel voting.

Log field template card (schema that engineers and buyers can read)
Field What it means Why it is required (evidence value)
EventID Event type enum (reset, watchdog, ECC, PG drop, overrun, vote mismatch). Makes incidents searchable and classifiable; enables consistent reporting across builds.
Severity Info / Warn / Fault classification. Separates noise from safety-relevant signals and supports trend thresholds.
Timestamp Monotonic time or synchronized time tag (concept-level). Enables causal ordering and timing budget proof (late vs on-time).
ChannelID Channel A/B/C (when redundant). Prevents “merged truth”; proves independence and helps isolate common-cause patterns.
Mode / FDIR state Normal / Suspect / Isolate / Degrade / Safe. Links detection to controlled action; demonstrates the fault-handling state machine.
Context snapshot Small snapshot: counters + state flags relevant to the EventID. Transforms “an event happened” into “why it happened” without requiring full debug traces.
ActionTaken Output inhibit, supervised restart, channel isolation, degrade, safe mode entry. Proves the system reacted in a bounded and defined way, not an uncontrolled behavior.
CorrelationID Incident chain identifier linking related events. Reconstructs multi-step chains (e.g., PG drop → brownout marker → reset → startup checks).
SeqNo Append-only record sequence number. Detects missing records, wrap-around issues, and preserves ordering under stress.

A practical rule: each safety-relevant event must be explainable using EventID + Timestamp + Mode + ActionTaken, with Context snapshot for fast root cause.

Consistency principles (concept-level, storage-agnostic)

Append-only

  • Records are added, not rewritten.
  • Preserves incident truth and avoids silent “history edits.”

Sequence number

  • Each record increments SeqNo.
  • Makes drops and ordering faults detectable.

Power-fail safe write

  • Critical events maximize retention under power loss.
  • Prevents the worst case: no evidence after a power incident.

These are design principles that can be validated without tying the discussion to any specific storage device or file system.

Field incident replay: 3-step closed loop

Step 1 — Grab

  • Input: maintenance readout + last incident window by Timestamp / CorrelationID.
  • Output: a compact chain of related records (not a raw dump).

Step 2 — Attribute

  • Input: chain pattern (e.g., PG drop → reset cause → startup incomplete).
  • Output: root category (power/reset vs timing vs integrity vs voting) with a trigger hypothesis.

Step 3 — Fix & prove

  • Input: corrective change (policy threshold, supervision rule, bounded action).
  • Output: repeated scenario produces a safer outcome (degrade/safe) and a cleaner evidence chain.

The goal is reproducible evidence: the system should show a different, controlled timeline after the fix.

Figure F9
Event log pipeline from sources to maintenance readout FCC Event Log Pipeline From event sources to persistent evidence and maintenance readout Event sources reset cause watchdog trip ECC events PG / thermal Aggregator tag + severity CorrelationID Ring buffer append-only SeqNo Persistent retain evidence Maintenance readout incident chains power-fail safe write
Figure F9 shows a storage-agnostic log pipeline: safety-relevant sources are aggregated, correlated, buffered with sequence order, retained across power events, and read out as incident chains.

H2-10 · Verification checklist: what proves an FCC design is done

Purpose

“Done” means evidence closure: under faults and boundary conditions, the FCC detects issues, enters a controlled state, and produces logs that prove both the trigger and the action. The checklist below is layered to match real workflows: development validation, production test, and field self-check.

Layer 1 — Development validation (fault drills + proof)
Inject lockstep mismatch → isolate or safe/degrade entry → log includes mismatch + action taken + ChannelID.
Inject ECC corrected events → counters rise and threshold behavior triggers → log includes corrected count + CorrelationID.
Inject ECC uncorrected event → block publish and enter controlled policy → log includes uncorrected + Mode transition.
Force watchdog timeout → bounded recovery (restart/degrade/safe) → log shows watchdog trip + reset cause chain.
Create task overrun → suspect/degrade escalation on repeats → log shows overrun + timing window breach.
Brownout drill → prevent half-alive state; supervised restart path → log shows PG drop + restart sequence markers.
Voting mismatch scenario → isolate disagreeing channel and maintain defined authority rules → log shows mismatch + isolation.
Cross-monitor liveness loss → remove authority from stale channel → log includes liveness fault + action taken.
Startup sequencing abort → outputs remain gated; evidence recorded → log shows sequencing fail + safe gating.

Each line follows the same structure: condition → expected action → evidence output. This makes verification results easy to trace and compare.

Layer 2 — Production test (each unit is consistent)
BIT/BIST run → pass/fail recorded → readout provides trace ID and summary.
Memory test → detects gross faults → log or test record captures result and unit identity.
Rail sequencing check → PG/RESET behavior matches policy → sequencing results are recorded.
Watchdog basic check → independent path can trigger policy action → result is recorded.
Log readback → schema fields present (EventID/Timestamp/ActionTaken/SeqNo) → sequence continuity verified.
Reset cause sanity → causes are distinguishable in records → no “unknown reset” blind spots.
Config identity capture → build/version/channel identity recorded → enables fleet correlation later.
Layer 3 — Field self-check (long-term observability)
Startup self-check → grant authority only after health confirms → log includes gating decision.
Periodic health counters → overrun/ECC corrected rate/thermal events tracked → trend points recorded.
Trend alert thresholds → rising anomaly rates trigger warnings or degrade policy → evidence chain is preserved.
Redundancy health summary → cross-check/voting health reported → channel isolation history retained.
Incident-chain export → maintenance tool extracts recent CorrelationID chains → supports fast attribution.
Post-fault proof → controlled recovery always produces a readable incident timeline → no silent resets.

Field checks focus on catching latent faults early and proving behavior over time, not only at the moment of a lab test.

Figure F10
Validation stack: development, production, and field evidence layers FCC Verification Stack Dev → Production → Field, each layer produces evidence Dev validation fault inject brownout drill overrun test vote mismatch Production test BIT/BIST rail check log readback trace ID Field checks startup self-check trend counters incident export Evidence Evidence Evidence
Figure F10 summarizes a practical verification stack: development drills prove mechanisms under faults, production tests enforce unit consistency, and field checks ensure long-term observability with incident evidence.

Request a Quote

Accepted Formats

pdf, csv, xls, xlsx, zip

Attachment

Drag & drop files here or use the button below.

H2-11 · FAQs (Flight Control Computer · FCC)

These FAQs clarify what the FCC safety mechanisms really cover, what can still escape, and how to verify behavior using bounded actions and evidence (counters, state transitions, and event logs).

1What does “lockstep” really detect in an FCC, and what can still slip through?
Lockstep primarily detects compute divergence: if two execution paths that should match do not, a compare mechanism can raise a fault and force a bounded safety response. It helps catch random faults that corrupt processing, but it does not automatically cover common-cause issues (bad inputs, shared power/clock/reset problems) or non-lockstepped peripherals. Treat it as one detector in a broader containment plan.
Verification hint: prove the full reaction path (mismatch → fault → handler → output inhibit/degrade) and confirm the event log captures the mismatch plus the action taken.
2How does ECC help safety beyond “fewer crashes”?
ECC turns silent memory corruption into observable evidence. “Corrected” events can be counted and trended; “uncorrected” events can trigger controlled escalation. This shifts memory reliability from a vague hope to a policy: detect, correct when possible, and record what happened so field incidents can be attributed. ECC is most useful when paired with counters, thresholds, and logging that link errors to system mode and response.
Verification hint: validate corrected/uncorrected handling separately and confirm logs include counters, severity, and the resulting mode transition (if any).
3Why is brownout behavior more dangerous than a clean reset?
Brownout is dangerous because it can leave logic in a half-alive state: timing, state machines, and cross-channel alignment can become inconsistent without a clear restart boundary. A clean reset re-establishes known initial conditions, while brownout can create “undefined middle states” that look like software bugs. The safe strategy is supervised undervoltage detection, controlled reset gating, and a startup health confirmation before outputs are allowed.
Verification hint: run brownout drills and confirm the system never continues with partial state; logs should show rail/PG drops, reset cause, and supervised startup markers.
4External watchdog vs internal watchdog: what’s the practical difference?
The practical difference is independence. An internal watchdog can be affected by the same fault domain as the software or core it is supervising. An external watchdog can use an independent clock and reset path, making it more credible when the CPU is misbehaving. In safety use, a windowed policy (too fast or too slow is wrong) and an explicit action plan (reset, degrade, or output inhibit) matter as much as the watchdog itself.
Verification hint: demonstrate window behavior and confirm that a watchdog trip produces a deterministic action and a traceable reset-cause chain in the log.
5How to avoid hidden coupling between redundant FCC channels?
Hidden coupling comes from shared dependencies: power, clock, reset, and common software/config. Avoid it by designing for observable independence: separate supervision paths, clear channel identity, and cross-monitoring that can isolate a single channel without collapsing the others. The goal is not only redundancy on paper, but evidence that one channel’s fault does not silently propagate through shared resources.
Verification hint: inject a fault into one channel and verify the other channel(s) remain stable; logs must distinguish ChannelID and capture isolation/authority changes.
6What’s a reasonable way to build a timing budget without over-optimizing?
Build the budget as an end-to-end chain: sample → compute → publish outputs → confirm. For each segment, set a cap, define a monitor point (timer/counter), and specify what happens on overrun (suspect, degrade, safe). This avoids chasing micro-optimizations while still guaranteeing determinism. A timing budget is successful when it is enforceable by monitors and produces clear evidence when violated.
Verification hint: validate worst-case paths and confirm overrun events trigger the defined escalation path and are recorded as timing evidence.
7Which events must be logged to support field diagnosis and certification evidence?
At minimum, log events that prove detection and response: reset cause, watchdog trips, ECC corrected/uncorrected, rail PG drops, overtemp, task overruns, and voting mismatches. Each record should include a timestamp, severity, channel identity, and the action taken. This set is small enough to be practical but rich enough to reconstruct incident chains and demonstrate bounded behavior.
Verification hint: confirm logs include EventID, Timestamp, ChannelID, Mode/FDIR state, ActionTaken, and a CorrelationID to link multi-step incident chains.
8How to log faults without causing extra jitter or overruns?
Logging must be treated as a real-time workload. Use tiered capture: record minimal “fast path” metadata immediately (EventID, time tag, counters), then aggregate or persist later in a lower-impact context. Prefer bounded operations (ring buffer, fixed-size records) over unbounded work. The system should also monitor whether logging itself correlates with overruns, so evidence collection does not become a new failure mode.
Verification hint: compare jitter/overrun counters with logging enabled vs stressed; confirm the design keeps logging work bounded and observable.
9What’s the safest response when a mismatch is detected: reset, degrade, or inhibit outputs?
The safest response depends on what can be proven. If an error implies outputs may be unsafe, inhibit outputs is the most conservative. If the system can remain controlled with reduced authority, degrade can preserve availability. Reset is appropriate when the restart boundary is supervised and does not create unsafe transients. The key is a clear decision rule: severity, confidence, and recovery path must be logged and repeatable.
Verification hint: define explicit trigger conditions for each response and validate they produce consistent mode transitions and logs (not ad-hoc behavior).
10How to validate fault handling without risking unsafe actuator commands?
Validation should separate detection from authority. Use output gating so fault-handling paths can be exercised while actuator commands are inhibited or routed to a safe sink. Tests should verify that the same monitors, thresholds, and state transitions used in operation are active during validation. A good plan proves “condition → action → evidence” without relying on risky live outputs.
Verification hint: confirm tests can trigger FDIR transitions while outputs remain gated, and the event log still records the full incident chain and actions taken.
11How should periodic self-tests be scheduled to avoid affecting determinism?
Periodic self-tests should run in defined windows with bounded cost, ideally at lower priority than the control loop. Tests must be preemptible or partitioned so they cannot steal time from critical deadlines. The system should also measure and record whether self-tests correlate with jitter, overruns, or mode changes. Done well, self-tests improve confidence without becoming a hidden timing dependency.
Verification hint: measure timing counters with self-tests enabled and confirm no deadline misses; record self-test start/stop markers for attribution.
12What are the top 5 FCC bring-up mistakes that look like “software bugs” but are power/reset issues?
Common bring-up failures often come from power/reset sequencing, not code: (1) brownout leaves partial state, (2) PG signals are ignored or misinterpreted, (3) reset release is not synchronized across domains, (4) watchdog power/clock is not truly independent, and (5) startup health checks are skipped before outputs are enabled. These issues create intermittent, hard-to-reproduce symptoms. The fix is supervised startup and logging that ties resets and rail drops to the observed misbehavior.
Verification hint: add explicit reset-cause and rail/PG markers, then reproduce the symptom; attribution should move from “mystery software bug” to a clean power/reset incident chain.