123 Main Street, New York, NY 10001

Mission Computer for Avionics: PCIe, HSM Secure Boot, NVMe

← Back to: Avionics & Mission Systems

A mission computer is the avionics compute core that routes high-rate data through a designed PCIe fabric, protects integrity with ECC and NVMe evidence logging, and stays upgradeable via secure boot with an HSM. It is considered “done” only when power-fail behavior, health monitoring, and rugged thermal/vibration stability are verifiable with traceable records—not just when the system boots.

H2-1 · Mission Computer: Scope & System Boundary

Goal: define what an avionics mission computer is responsible for, and—equally important—what it is not. This prevents scope creep into sibling topics (switching, timing, and power front-end design) while keeping the page engineering-useful for both procurement and implementation.

System object: compute + storage + assurance Focus: PCIe fabric, secure boot (HSM), NVMe integrity Proof: logs, BIT, power-fail evidence Boundary: interfaces only for timing/switching/power

What a mission computer is (in-scope)

  • Mission application host: sensor fusion, mission planning, payload control, and deterministic data processing pipelines.
  • High-speed fabric hub: a PCIe-based compute/backplane fabric tying CPU/SoC, FPGA/accelerators, network I/O, and NVMe storage into a controlled topology.
  • Assurance anchor: secure boot with a hardware root-of-trust (HSM/TPM-class device), plus audit-ready versioning and event evidence.
  • Recorder & evidence loop: flight/mission logs, BIT artifacts, and “what happened when” timelines that survive resets and maintenance actions.

What a mission computer is not (out-of-scope, but interfaces are defined)

  • Not the aircraft network switch (e.g., AFDX/TSN switching): only bandwidth/port/redundancy requirements and health indicators are referenced.
  • Not the timing master (GPSDO/PTP/SyncE internals): only timing inputs/accuracy targets and “loss-of-sync” behavior requirements are referenced.
  • Not the 28 V power front end (surge/spike chain): only the power-fail/brownout signaling seen by the mission computer is referenced.

Typical hardware forms and where the design pressure comes from

  • Compute blade / SBC (VPX/OpenVPX, rugged COM-based modules): favors modularity and maintainability; stresses connector loss and SI margins.
  • Backplane-centric chassis: favors scalable I/O and accelerators; stresses PCIe topology depth and reset/clock distribution discipline.
  • Integrated mission box: favors size/weight/power; stresses thermals, serviceability, and controlled upgrade flows.

Boundary handoffs (the practical “contracts” with sibling domains)

Neighbor domain What the mission computer consumes What the mission computer must produce
Switching / network Declared throughput targets, traffic classes, redundancy policy, link-health counters (high-level). Deterministic compute + logging of link events (drops, resets, degraded modes) with timestamps.
Timing Time input quality constraints (accuracy/stability targets), holdover expectations, loss-of-time alarms. Time-stamped logs, graceful degradation rules when time quality falls below threshold.
Power (front end) Power-fail/brownout indicator(s), “last-gasp” window requirement, reset reason codes (if available). Power-fail response plan: flush/commit, safe-state entry, and a verifiable shutdown record.

Acceptance criteria (how “done” can be proven later)

  • Determinism: critical task latencies are bounded and repeatable across temperature and load corners.
  • Fault containment: a single endpoint failure does not collapse the fabric into repeated cascade resets.
  • Auditability: boot chain, versions, and security policy changes leave tamper-evident records.
  • Power-loss behavior: power-fail triggers a controlled sequence that preserves NVMe consistency and creates a post-event evidence packet.
Inputs Sensors data + status Payload commands + feedback Comms / Links packets + control Timing input quality signal Mission Computer compute platform + fabric + assurance + storage control Compute CPU/SoC + FPGA/Accel ECC memory domain PCIe Fabric switch + retimers topology discipline Secure Boot HSM / Root-of-Trust audit-ready versions Storage NVMe control integrity & replay Evidence loop BIT results · event logs · reset causes · power-fail records Outputs Displays mission UI Storage export records / evidence Control outputs payload commands Maintenance BIT / updates Constraints & Inputs power-fail signal · thermal path · EMI environment · vibration Figure F1 — System boundary: mission computing focuses on compute/fabric/assurance and produces auditable evidence.
Figure F1. Mission Computer placed within an avionics box: clear boundary to switching/timing/power domains, while emphasizing compute, PCIe fabric, secure boot (HSM), storage integrity, and evidence logging.

H2-2 · Data Paths & I/O Map: How the System Connects

Goal: turn “interfaces” into a flow map that separates mission-critical traffic from management and maintenance. This section establishes the structure used later for PCIe topology, NVMe behavior under faults, and audit-ready upgrade paths.

Step 1 — Classify traffic into three planes

  • Data plane (mission traffic): high throughput and bounded latency. Typical endpoints include accelerators, capture/processing cards, and NVMe record/playback.
  • Control plane (management traffic): observability, policy, and lifecycle actions. Typical endpoints include BMC/management MCU, health sensors, event logs, and signed update mechanisms.
  • Maintenance plane (service tools): restricted access for debugging and depot maintenance. Kept minimal to avoid accidental coupling to mission determinism.

Step 2 — Build a flow map, not a port list

A port list explains what exists; a flow map explains what must be protected. Each critical flow should be described by: source → processing → sink, plus the failure symptom that indicates the flow is compromised.

  • Ingest flows: sensor/payload inputs → compute pipeline → decision outputs.
  • Record flows: compute pipeline → NVMe (records) → export/maintenance access.
  • Evidence flows: events/counters → log store → extraction tooling (BIT packages).

Step 3 — Enforce isolation (logic, bandwidth, fault containment)

  • Logic isolation: updates and maintenance actions must not execute in the same privilege and timing context as mission tasks.
  • Bandwidth isolation: reserve headroom so large transfers (e.g., record export) cannot starve mission pipelines.
  • Fault containment: a misbehaving endpoint must not trigger repeated cascade resets across unrelated links.

Typical bottlenecks and the evidence that exposes them

  • DMA congestion: symptoms include tail-latency spikes and periodic stalls; evidence includes queue depth, dropped frames, and latency histograms.
  • Excessive fabric depth: symptoms include intermittent enumeration and frequent link retraining; evidence includes link-down/retrain counters and negotiated speed changes.
  • Shared reset/refclk coupling: symptoms include “one fault resets many”; evidence includes correlated timestamps across reset causes and link events.

Throughput & latency budget template (copy/paste friendly)

Flow Target rate Latency bound Failure symptom Evidence to log
Ingest → Compute ____ MB/s avg ____ / p99 ____ jitter / missed deadlines timestamps, queue depth, drop counters
Compute → NVMe record ____ MB/s write p99 ____ record gaps / corruption flush time, write errors, power-fail markers
Health → Evidence package low / periodic bounded (non-critical) missing audit trail reset causes, version IDs, event hashes
Figure F2 — Separate mission data flows from management/control flows to protect determinism and auditability. Data plane (mission) High throughput · bounded latency · topology discipline Control plane (management) Observability · policy · signed updates · BIT artifacts Evidence what happened Event logs time-tagged BIT package pass/fail proof Version audit policy trace Compute tasks PCIe fabric switch + retimers NVMe record I/O endpoints BMC / MCU policy + health Telemetry counters + sensors Signed update controlled change BIT self-test Design intent: isolate mission traffic from management actions; log both planes to reconstruct failures without guessing.
Figure F2. Two-plane connection model: the data plane carries mission throughput; the control plane carries observability, policy, and signed maintenance actions. Both feed an evidence package.

H2-3 · Compute Architecture Choices: CPU, FPGA, Accelerators

Goal: explain why mission computing is rarely “just a bigger CPU.” The architecture must produce deterministic timing, fault containment, and an auditable lifecycle (updates, logs, and evidence) under harsh avionics constraints.

CPU = decision + orchestration FPGA = deterministic pipeline Accel = throughput (optional) Separation = determinism + auditability

Role split that maps to mission outcomes (not to marketing terms)

CPU / SoC (general compute)
  • Best for: mission state machines, task orchestration, complex control logic, routing decisions, record/index management.
  • Engineering win: behavior remains explainable (why a decision was made) and maintainable (signed updates, clear versioning).
  • Evidence to log: deadline miss counters, queue depths, restart causes, and “safe-state entered” reasons.
FPGA (deterministic pipeline)
  • Best for: fixed-latency ingest → preprocess → align → DMA scheduling, trigger alignment, and repeatable throughput paths.
  • Engineering win: deterministic upper bounds can be proven (latency ceilings do not depend on OS scheduling noise).
  • Evidence to log: pipeline underrun/overrun, DMA stall reasons, alignment/trigger health flags.
Accelerator (GPU / domain accelerator, optional)
  • Best for: massively parallel workloads where throughput matters more than microsecond-level determinism.
  • Engineering win: peak performance without overloading the CPU timing budget.
  • Guardrails: define a degraded-mode plan when the accelerator is unavailable (the mission must fail gracefully, not silently).

Selection criteria that matter in avionics mission computing

  • Determinism: key paths must have measurable and repeatable latency bounds (avg and tail). If tail latency cannot be bounded, the workload does not belong solely on a CPU.
  • Certifiability / verifiability: isolate what must be proven (critical pipelines and safety monitors) from what evolves frequently (maintenance tools and analytics).
  • Power & thermal envelope: peak compute must not trigger link instability, storage throttling, or unpredictable timing drift under worst-case temperature.
  • Lifecycle supply: replacements should preserve the architecture contract (domain boundaries and interfaces), avoiding “recertify everything” events.

Domain partitioning: mission / management / safety monitor

Partitioning turns reliability and security from “features” into a system rule: maintenance actions and policy changes cannot steal time or privilege from mission execution.

  • Mission domain: real-time workloads, deterministic pipelines, critical data paths.
  • Management domain: secure boot enforcement, signed updates, telemetry collection, log packaging, and controlled configuration changes.
  • Safety monitor domain: watchdog/health verdicts and safe-state requests; operates independently enough to remain credible when the mission domain misbehaves.

Common pitfalls (with symptoms and evidence)

  • Everything on CPU → determinism cannot be proven: symptoms include tail-latency spikes and missed deadlines under load; evidence includes latency histograms, queue depth growth, and correlated drop events.
  • Management mixed into mission → upgrades disturb operations: symptoms include anomalies during updates (retraining bursts, storage hiccups); evidence is a timeline overlap between maintenance events and mission faults.
Figure F3 — Partitioned compute: determinism, auditability, and fault containment. Mission Computer (partitioned) Keep mission execution stable while enabling controlled updates and credible health verdicts. Mission domain bounded latency · critical pipelines Management domain policy · updates · audit Safety monitor independent verdict CPU / SoC decisions FPGA deterministic Accel optional Mission pipeline ingest → process → outputs (bounded tail latency) record hooks → NVMe (controlled write behavior) Secure boot HSM / RoT Signed updates controlled change WDT + health telemetry / logs policy / update trigger safe-state request Maintenance tools restricted access Rule: mission timing cannot be disrupted by maintenance; every change must leave an auditable record.
Figure F3. Partitioning isolates mission execution from management actions and keeps safety verdicts credible, enabling deterministic performance and auditable maintenance.

H2-4 · PCIe Fabric Topology: Switches, Segments, Redundancy

Goal: treat PCIe as a designable fabric, not a wire. The mission computer must scale endpoints (NVMe, accelerators, high-speed I/O) while preventing single-device faults from collapsing the entire system into repeated retraining and resets.

Topology = fault boundaries Switching = scale + isolation Redundancy = graceful degradation Observability = prove behavior

Topology templates (choose by failure modes, not by aesthetics)

  • Single-root, mostly direct: simplest; acceptable when endpoints are few and the physical channel is short and stable.
  • Single-root with a switch segment: common for scalable mission computers; creates a clean subtree for endpoints and improves manageability.
  • Dual-fabric / two switch domains (concept level): used when mission objectives require continued operation after a fabric-side fault. The key is defining what stays alive in degraded mode.

Switch selection criteria (engineering-useful, protocol-minimal)

  • Ports and expansion headroom: cover today’s endpoints plus growth; avoid “forced daisy chains” that deepen the topology.
  • Non-blocking bandwidth: critical flows should not be starved during simultaneous record + processing + export operations.
  • Isolation capability (subtree containment): faults and traffic should be containable within a branch, instead of rippling across unrelated endpoints.
  • Error reporting + manageability: link training events, retrain counts, error counters, and (when available) temperature/power telemetry should be readable.

Redundancy strategy (define the boundary and the proof)

  • Boundary: identify which endpoints are “must survive” (e.g., critical record path or core compute) versus “optional in degraded mode.”
  • Containment: ensure that a failing endpoint causes a local recovery (branch isolation), not system-wide churn.
  • Proof: create logs that show: fault detected → branch isolated → system stabilized → degraded-mode entered (if required).

Observability: the minimum evidence set to log

  • Link training timeline: up/down events, negotiated speed changes, and retrain bursts.
  • Error counters: correctable vs uncorrectable trends, plus time correlation to mission anomalies.
  • Thermal correlation (optional): hotspots that correlate with retraining or storage throttling.
Figure F4 — Fabric map: topology, redundancy, and observability at a glance. Root Complex CPU/SoC refclk / reset Switch A fabric segment Observability train / errors / retrain Switch B redundant fabric Observability train / errors / retrain Retimer if needed Retimer if needed NVMe Storage record / replay FPGA / Accel pipelines NIC / I/O endpoints Capture / PCIe I/O ingest Design rule avoid deep hierarchies; isolate faults into subtrees; log training/errors to prove recovery.
Figure F4. A practical PCIe fabric map: root complex feeding two switch domains with optional retimers and multiple endpoints. Redundancy and observability are shown as design requirements.

H2-5 · Retimers & Signal Integrity: “Boots” Is Not “Flight-Ready”

Goal: explain why retimers/redrivers become an engineering requirement in avionics mission computers, and how to validate the PCIe channel under temperature, vibration, and long-duration stress—not just a single successful boot.

Loss budget drives placement EQ + manageability matter SI/PI shows up as retrains Validation = corners + statistics

When a retimer becomes necessary (trigger conditions)

  • Channel loss exceeds margin: long traces, multiple connectors, and backplane insertion loss reduce eye opening.
  • Cross-board / backplane crossings: each connector/backplane segment consumes budget and increases sensitivity to environment.
  • Higher generation targets: moving to faster link speeds tightens timing and loss margins, increasing “works sometimes” risk.
  • Harsh corners: temperature drift, vibration micro-motion, and EMI can turn a marginal channel into frequent retrains.

Selection criteria (engineering-level, not a protocol lesson)

  • Generation support: the part should support the intended link speed with adequate margin in worst-case channels.
  • Equalization capability: EQ range should cover the channel loss envelope with stable training across corners.
  • Latency/jitter impact: insertion should not break end-to-end latency budgets or amplify jitter into unstable links.
  • Manageability: status and counters should be readable to build an evidence trail (training, errors, retrain bursts).

SI/PI coupling: how power noise turns into link instability

Observed symptom Likely system-level cause Evidence to capture
Retrain bursts during load steps Power rail noise modulates clock/PLL behavior and reduces timing margin. Retrain timestamp correlation to load/thermal events; error counters rising with activity.
Speed downshift at hot corner Eye margin collapses with temperature (loss increases, jitter grows). Negotiated speed history vs temperature; stable at cold but degraded at hot.
Intermittent endpoint drops Marginal channel + connector micro-motion under vibration. Vibration window retrain statistics; endpoint link-down events with tight clustering.

Validation checklist (flight-ready evidence)

  • Eye/BER concept checks: verify the channel margin on representative worst-case paths (not only “best lane”).
  • Temperature corners: run sustained load at cold and hot corners; record retrain/error statistics.
  • Vibration/handling stress: measure retrains per hour under vibration windows (statistics, not anecdotes).
  • Long-duration soak: observe error counters and link stability over time; capture “bursty” failure patterns.
  • Evidence logging: store training events, negotiated speed changes, retrain counts, and correlation timestamps.
Practical rule: channel success must be demonstrated by corner-stable statistics (retrain/error trends) and auditable logs—not by a single successful enumeration.
Figure F5 — Loss budget & retimer placement: restoring margin across backplane and connectors. PCIe channel segments (concept) Goal: keep eye margin positive in worst-case corners (temperature + vibration + EMI). Root RC/CPU Trace on-board Connector Backplane Connector Endpoint NVMe/IO Retimer Loss budget (concept) A stable design keeps margin positive across corners; retimer resets margin mid-channel. Target budget Consumed loss (no retimer) Margin low Consumed loss (retimer inserted) Margin restored Trace Connectors Backplane Endpoints
Figure F5. Retimer placement is driven by loss budget and corner stability. The objective is stable training and low error/retrain statistics under temperature and vibration—not merely a successful boot.

H2-6 · Memory & Storage: ECC + NVMe Control as a Data-Integrity Loop

Goal: make “data integrity” measurable and operational: detect errors, trend them, trigger defined actions, and preserve auditable evidence. This includes ECC memory health signals and NVMe behavior (consistency and tail latency) under real mission load.

ECC = trendable health signal NVMe = consistency + QoS Tier data: system / records / evidence Protect evidence from bulk traffic

ECC memory: why it matters and how it becomes health telemetry

  • Engineering value: ECC reduces silent corruption risk and provides measurable health signals for long-duration operation.
  • Correctable vs uncorrectable: correctable events (CE) are trend indicators; uncorrectable events (UE) trigger defined fault handling and evidence capture.
  • Scrub (concept): periodic maintenance reduces accumulated risk; success/fail counts become part of health records.

NVMe control: the practical metrics that matter

  • Write consistency: records and evidence should not be left in an ambiguous “half-written” state after resets or power events.
  • Tail latency (QoS): bulk recording/export must not destroy mission determinism; evaluate p99/p999 write behavior under sustained load.
  • Power-fail marker (concept): when a power-fail indicator is asserted, the system should perform controlled flush/commit steps and store an auditable shutdown record.

Data tiering: separate what must be preserved from what can be large

  • System area: bootable images and controlled configurations (signed updates only).
  • Evidence / logs area: event logs, BIT packages, version audits, CE/UE trends (prioritize integrity and accessibility).
  • Records area: high-throughput mission records (allowed to follow defined degraded rules, but must remain explainable).

Common pitfall: records and mission tasks share the same queue

  • Symptom: export or heavy recording causes unpredictable tail latency and deadline misses.
  • Root cause: queue contention and background storage behavior introduce non-deterministic stalls.
  • Fix: partition/namespaces (concept), protect the evidence/logs area, and enforce bandwidth/priority separation.
Practical rule: evidence must remain readable and consistent even when record traffic is heavy; integrity must be measured via counters and event markers, not assumed.
Figure F6 — Storage stack: ECC health signals + NVMe consistency/QoS + data tiering for evidence protection. Compute sources mission pipelines and evidence hooks CPU / SoC orchestration FPGA pipeline deterministic ingest ECC memory domain CE/UE counters → health CE trend · UE events NVMe Storage controller behavior matters Consistency flush/commit markers QoS / tail latency p99/p999 under load Power-fail record auditable shutdown Data tiers protect evidence from bulk traffic System signed updates Evidence / Logs event + audit CE/UE trends Records high throughput health counters Rule evidence stays consistent and readable under record load; integrity is measured via counters and markers.
Figure F6. Data integrity is an engineering loop: ECC events become health telemetry, NVMe behavior is evaluated by consistency and tail latency, and data is tiered so evidence/logs remain protected from bulk record traffic.

H2-7 · Secure Boot with HSM: Chain of Trust + Maintainable Updates

Goal: treat secure boot as an engineering chain: clear responsibilities, observable failure modes, and verifiable evidence. The objective is a boot-and-update flow that remains secure while staying serviceable in the field.

Verify at every stage Record evidence points Keys must stay protected Rollback must be blocked

Typical chain of trust (what is verified, and when)

  • ROM (Root-of-Trust start): validates the first-stage boot component before executing it.
  • Bootloader: validates the OS or hypervisor image and the platform policy bundle.
  • OS / Hypervisor: validates mission applications and enforces runtime policy gates.
  • Application stage: loads only approved modules and exports verifiable version/state into logs.

HSM / TPM / Secure Element: the practical responsibilities

  • Protected key usage: cryptographic keys are used inside a protected boundary (avoid key export paths).
  • Version counters / monotonic state: provide the anchor for rollback protection (anti-downgrade).
  • Attestable measurements (optional concept): record “what booted” as evidence that can be audited later.

Measured boot vs secure boot (boundary definition)

  • Secure boot: blocks execution if verification fails (enforced gate).
  • Measured boot: allows boot but records measurements for audit (evidence-first).
  • Engineering rule: use secure boot where unsafe code must never run; use measured boot to strengthen auditability and forensics.

Key provisioning and update serviceability (factory → field)

Process step Control objective Evidence to record
Factory key provisioning Prevent key leakage and batch-wide exposure; keep keys protected and traceable. Device identity + key IDs (not raw keys), operator/workstation ID, timestamp, pass/fail.
Policy-gated updates Only signed images may install; enforce update windows and policy constraints. From-version → to-version, signature status, policy check result, update outcome.
Rollback protection (concept) Block downgrades to vulnerable older images even if they are signed. Monotonic counter/value used, downgrade attempt detected, reason code.

Common pitfalls (symptoms and fixes)

  • Signed updates without rollback control: allows downgrade windows. Fix by enforcing monotonic version rules anchored in RoT state and making failures auditable.
  • Unclear factory key handling: risks “same key across many units.” Fix with per-unit identity materials, batch isolation, least-privilege access, and complete provisioning audits.
Rule: verification alone is not enough—each stage must emit a durable evidence point, and updates must be both signed and policy-gated with rollback blocking.
Figure F7 — Chain of trust timeline: verify + record at each stage; updates are signed and policy-gated. Power-on → Mission application Minimal crypto details; focus on responsibilities, symptoms, and verifiable evidence. ROM (RoT) verify stage-1 Evidence log ✓ Bootloader verify OS/HV Evidence log ✓ OS / HV verify apps Evidence log ✓ Mission App policy gates Audit log ✓ HSM / TPM key use counters measurements Update package signed image policy-gated install Failure handling (engineering view) verification fail → safe state / recovery record reason codes + version + stage Rule verify + record at each stage; updates require signature + policy gates + rollback blocking.
Figure F7. Secure boot is a chain with evidence points. Each stage verifies the next, logs outcomes, and enforces policy-gated updates with rollback protection anchored in RoT state.

H2-8 · Fault Tolerance & Health Monitoring: From WDT to Evidence Chains

Goal: make reliability operational: observable signals become deterministic verdicts, actions are executed in defined tiers, and every action leaves an auditable evidence record for maintenance and post-event review.

Observe → decide → act Prefer degrade over reboot Trends prevent hard faults Evidence packs enable forensics

What to monitor (layered signals)

  • Fabric/link layer: link up/down, retrain bursts, negotiated speed changes, error counter trends.
  • Memory layer: ECC correctable (CE) trends and uncorrectable (UE) events.
  • Thermal & power events: temperature limit flags, reset causes, power-fail markers (event-level only).
  • Storage health (concept): NVMe health indicators and error trends, correlated to workload/temperature.

Verdicts: turn signals into actions

Signal Verdict (engineering) Action tier
Retrain burst rate rises Channel margin degraded; risk of endpoint drops. Degrade bandwidth → isolate branch if persistent.
ECC CE trend accelerates Health risk increasing; maintenance window needed. Raise monitoring level → schedule service action; preserve evidence snapshot.
ECC UE event Integrity compromised. Isolate/recover path → safe-state if required; store evidence pack with reason codes.
Thermal limit sustained Overstress risk; stability may degrade. Degrade compute/record load → safe-state if not recoverable.

Action tiers (degrade first, reboot last)

  • Degrade: limit throughput, pause non-critical recording/export, reduce concurrency.
  • Isolate: quarantine a device/branch to prevent cascaded failures.
  • Recover: retrain links or restart a subsystem if isolation is insufficient.
  • Switch redundancy (concept): move to redundant paths/domains and keep mission in a defined degraded mode.

BIT/BIST in the mission computer (MC-only)

  • PBIT (power-on): verify key links, storage tiers, and evidence-log readiness before mission start.
  • CBIT (continuous): monitor counters/trends with minimal mission impact and produce periodic health records.
  • MBIT (maintenance): deeper diagnostics during scheduled service windows; export evidence packs for review.
Rule: every verdict and action must produce a durable evidence record (time, version, counters, action taken) so field events are traceable and fixable.
Figure F8 — Health monitor loop: signals → verdicts → actions → evidence → remote review (closed loop). Closed-loop reliability evidence Signals PCIe ECC Temp NVMe Rules thresholds trends Actions degrade isolate recover Evidence pack + logs time + version Remote review ops / replay Rule every action leaves evidence; trends drive maintenance before hard failures.
Figure F8. Health monitoring is a closed loop: signals become verdicts, actions are executed in tiers, and evidence packs enable remote review and continuous improvement.

H2-9 · Power Holdup & Graceful Shutdown: “Write Completes, Evidence Remains”

Goal: define what the mission computer requires during a power-fail window and how to verify it. The focus is a controlled sequence (detect → flush → seal evidence → safe state), not holdup hardware details.

Detect power-fail early Flush in priority order Seal evidence first Prove by injection tests

Power-fail detect: what triggers the controlled shutdown flow

  • Power-fail indicator: a hardware or management-domain event that asserts when input power is collapsing.
  • Interrupt-to-action path: the signal must reach a domain capable of pausing non-critical traffic and starting the flush sequence.
  • Timestamped marker: record when the power-fail was detected so post-event analysis can reconstruct timing.

Holdup budgeting: convert “milliseconds needed” into a budget table

Budget the usable window from power-fail detect to the point where storage writes are no longer reliable. The core requirement is an auditable P × t budget tied to a defined flush sequence.

Load group Priority Budget item Engineering intent
Evidence path (logs + markers) Must W × ms Seal an evidence pack that explains the event and the system state.
Metadata flush (consistency anchors) Must W × ms Guarantee boot explainability and avoid ambiguous “half-written” states.
Bulk record traffic Stop 0 Pause high-throughput streams so the window is reserved for integrity actions.

Data consistency strategy (engineering order of operations)

  • Stop non-critical writes first: halt bulk recording/export to prevent queue contention.
  • Flush metadata with a bounded time: commit the minimum set needed for explainable restart.
  • Seal evidence before optional records: preserve the “what happened” timeline even if some record data is dropped.
  • Enter safe state: reduce activity to minimize further corruption risk as voltage continues to fall.

Verification: power-cut injection tests (matrix + pass criteria)

  • Injection phases: idle, high write load, mixed mission + record load, and during maintenance export.
  • Corners: cold and hot temperature corners; include representative vibration windows when applicable.
  • Pass criteria: evidence pack present and consistent; restart is explainable; failures produce clear reason codes.
  • Evidence output: each injection produces a power-fail record: time, version, flush result, and shutdown stage.
Rule: the power-fail window is reserved for integrity actions. Evidence and metadata must complete first; bulk records are allowed to degrade in a controlled, explainable way.
Figure F9 — Power-fail timeline: detect → stop bulk writes → flush metadata → seal evidence → safe state. Input power collapse (concept) Usable window starts at power-fail detect and ends at “writes no longer reliable.” VIN PFI Flush Evidence Off Controlled shutdown sequence (engineering) Actions are ordered to preserve explainability and durable evidence. Stop bulk writes pause records Flush metadata bounded time Seal evidence event pack Safe state low activity Pass criteria Evidence present and consistent; restart is explainable; failures have reason codes. Injection matrix phase × load × temperature (and vibration windows when required).
Figure F9. The mission computer treats power loss as a controlled state machine. The holdup window is budgeted and verified by injection tests, with evidence sealed before power-off.

H2-10 · Thermal, EMI, and Ruggedization: Full-Load Stability You Can Reproduce

Goal: translate avionics environmental reality into board-level and chassis-level constraints for a mission computer. The focus is sustained full-load behavior, controlled degradation, and reproducible evidence for intermittent faults.

Conduction path is primary Hotspots must be tracked Return paths stay continuous Vibration faults need statistics

Thermal path: board hotspots to chassis conduction

  • Primary path: device → PCB → thermal interface → conduction plate / cold wall → chassis.
  • Typical hotspots: PCIe switch, retimers, NVMe devices, and high-load compute zones.
  • Engineering requirement: define allowable hotspot temperatures and a controlled derating policy when limits are approached.
  • Evidence: log temperature peaks with timestamps and correlate them to link stability and throughput behavior.

EMI/EMC (system-level, minimal theory)

  • Return-path continuity: high-speed SERDES stability depends on clean reference/return paths across connectors and shields.
  • Symptom-based evidence: retrain bursts, rising error counters, and speed shifts correlated to specific operating modes or test exposure.
  • Action intent: isolate noisy domains, keep return paths continuous, and maintain consistent shield/ground boundaries at the chassis level.

Vibration/shock: intermittent faults that must be made repeatable

  • Connector micro-motion: vibration can create intermittent impedance changes that trigger retrains and endpoint drops.
  • Correlation requirement: record retrain and link-down events during defined vibration windows and compute “events per hour.”
  • Fix validation: changes are accepted only when statistics improve, not when the issue “seems gone.”

Rugged validation metrics (what “done” looks like)

  • Thermal soak: full-load stability without uncontrolled resets; controlled derating is logged and explainable.
  • EMI exposure: no unexplained link instability; counters and speed states remain within expected bounds.
  • Vibration window: retrain/event rate below threshold; evidence packs generated for any excursions.
Rule: rugged design is proven by reproducible stress tests and evidence records. Intermittent faults must be turned into statistics that improve after fixes.
Figure F10 — Rugged box view: conduction path, shielding layers, connector zone, and component hotspots. Avionics chassis Shielding boundary keep return paths continuous (concept) Conduction plate / cold wall Compute module CPU / FPGA I/O + PCIe switch / retimers Storage module NVMe Hotspot Hotspot Backplane / connector zone Vibration sensitive thermal path → chassis Rule stability is proven by full-load thermal soak, EMI exposure, and vibration statistics with evidence logs.
Figure F10. Ruggedization is expressed as layers: conduction cooling into the chassis, a consistent shielding boundary for return paths, and a connector zone whose intermittent faults are validated by retrain statistics under vibration.

H2-11 · Validation & Production Checklist (What “Done” Means)

This section defines release acceptance as a checkable contract for engineering, manufacturing, and field teams. Each item includes purpose, method, pass criteria, stored evidence, and example part numbers to align with procurement.

Test report (R&D) Factory record (per unit) Field export package Trace / Release ID
Evidence rule: every PASS must be tied to a durable Trace ID (device P/N + serial + build/policy version + timestamp). FAIL requires a hold record (NCR/issue ID) with the same Trace ID.

Layer 1 — R&D Validation (prove stability under stress)

Focus on system risks that must be closed before manufacturing: fabric stability, storage integrity, secure boot enforcement, and health evidence chains.

Check Check item Purpose Method Pass criteria Evidence to store Related parts (examples)
PCIe stability under corners
fabricretrain
Prevent intermittent link drops and cascaded endpoint failures. Stress workload + temperature corners; record retrain bursts, link-down events, negotiated speed changes. Retrain/hour below threshold; no unexplained link-down; any degradation is logged and explainable. Stress report + counter snapshots + corner coverage proof. PCIe switch: Broadcom/PLX PEX97xx family
Retimer: TI DS160PR810 / DS160PR412 (family examples)
NVMe consistency + power-cut injection
storageintegrity
Ensure “boot is explainable” and evidence survives power loss scenarios. Power-cut injection matrix (phase × load × temp). Verify metadata flush ordering and evidence sealing. Evidence pack present and consistent; key metadata consistent; failures return reason codes. Injection record table + sample evidence pack + reason code map. Industrial NVMe SSD: Micron 7450 / Kioxia CD6 / Samsung PM9A3 (examples)
ECC behavior and health thresholds
memoryevidence
Turn CE trends into maintenance signals; treat UE as integrity events with traceable actions. Fault injection (software triggers / controlled stress); confirm CE/UE counters are captured and exported. CE trend alarms are generated; UE triggers defined isolation/recovery action + evidence pack. Counter snapshots + action logs + exported field package sample. ECC DRAM: Micron ECC DDR4/DDR5 (family placeholder)
Supervisor/monitor: TI TPS38xx / ADI LTC29xx (family examples)
Secure boot chain rehearsal
securitypolicy
Prove “verify + record” at each stage, and enforce policy gates with rollback blocking. Inject invalid signature, wrong version, and policy-disallowed package; validate reject paths and logs. Correct rejection with stage + reason code; audit log includes version and policy identifiers. Reject logs + policy version snapshot + boot stage evidence markers. TPM/HSM/SE: Infineon OPTIGA TPM (e.g., SLB 9670) / Microchip ATECC608B (examples)
Health monitoring loop closes
BIT/BISTops
Ensure faults become verdicts, actions, and evidence packs retrievable for review. Run PBIT/CBIT/MBIT flows; verify export path and trace ID binding. BIT results are consistent; exports succeed; any action (degrade/isolate/recover) produces evidence. BIT logs + export file naming convention + trace ID rule. Watchdog/safety monitor (placeholder family): NXP/S32 safety MCU class; TI watchdog supervisors (examples)

Notes: part numbers above are example BOM anchors for procurement alignment and do not mandate a single vendor; final selections must match program qualification rules.

Layer 2 — Production Test (repeatable per-unit release proof)

Production items prioritize identity correctness, interface sanity, and baseline health records that enable field comparison.

Check Check item Purpose Method Pass criteria Evidence to store Related parts (examples)
Serial / certificate injection verification
provisioning
Guarantee per-unit identity and traceability; avoid batch-wide exposure risks. Inject identifiers, then immediately verify chain/IDs (no secret export); bind to Trace ID. ID match; certificate/key IDs present; audit fields complete (station/operator/time/result). Factory record row: serial + device P/N + key/cert IDs (non-secret) + station ID + timestamp. TPM/SE families as qualified (examples: Infineon OPTIGA TPM, Microchip ATECC608B)
Interface self-test (bring-up sanity)
I/O
Catch assembly issues and gross faults early with fast, repeatable tests. Power-on self-test script; endpoint enumeration; minimal loopback/signature checks where applicable. Endpoints present; counters at baseline; no abnormal resets; self-test report generated. Auto-generated production self-test report attached to Trace ID. PCIe switch/retimer qualified families (examples)
Storage health baseline snapshot
NVMe
Create an out-of-box baseline used to detect degradation over life. Record firmware version, health summary, and initial error counters; store as baseline record. Baseline record saved; counters within expected new-unit range. Baseline file/row: drive ID + firmware + health summary + timestamp + Trace ID. Industrial NVMe SSD (examples: Micron 7450 / Kioxia CD6 / Samsung PM9A3)

Layer 3 — Field Self-Test (fast go/no-go + evidence export)

Field checks confirm the system is explainable and supportable: BIT status, exportable logs, and auditable version/policy state.

Check Check item Purpose Method Pass criteria Evidence to store Related parts (examples)
BIT status summary export
PBIT/CBIT
Quick determine if the unit is fit for mission or requires maintenance. Run BIT summary command; export result with Trace ID binding. BIT passes or shows controlled, explainable degraded mode; export completes. BIT summary file + timestamp + unit serial + software/policy version.
Event log / evidence pack export
forensics
Enable post-event reconstruction without lab access. Export evidence pack; verify integrity (length/hash) and naming convention. Package is readable and complete; includes counters + recent events + reason codes. Export package + checksum + export tool version.
Version & policy audit snapshot
audit
Ensure upgrade state is known and rollback is not silently allowed. Read software version, policy version, last update record summary. Versions match expected release; policy gate status valid; anomalies logged with reason codes. Audit snapshot attached to Trace ID; last-update summary record. TPM/SE presence as program requires (examples)

Release Gate (PASS/FAIL with Trace ID)

All three layers feed a single release decision. PASS creates a release ID; FAIL creates a hold record that remains traceable.

Figure F11 — Release gate: inputs → rules → PASS/FAIL, all bound to Trace ID. Inputs R&D test report stress + corners Factory record per-unit proof Field self-test exportable evidence Release rules thresholds met evidence complete traceable records Trace ID required P/N + Serial SW/Policy version Outputs PASS Release ID issued evidence archived FAIL Hold + NCR/issue ID same Trace ID Release gate rule PASS requires complete evidence across layers; FAIL must remain traceable with the same Trace ID.
Figure F11. Release is a gated decision fed by R&D results, factory records, and field self-tests. Both PASS and FAIL remain traceable through a consistent Trace ID.

Request a Quote

Accepted Formats

pdf, csv, xls, xlsx, zip

Attachment

Drag & drop files here or use the button below.

H2-12 · FAQs (Mission Computer)

These FAQs focus on mission-computer engineering decisions: boundary definition, PCIe fabric design, integrity evidence, controlled power-fail behavior, maintainable secure boot, and rugged verification.

Boundary PCIe Fabric Integrity Power-fail Security Rugged Production
1) What is the boundary between a Mission Computer and a Flight Control Computer (FCC)?

A mission computer is optimized for mission applications, sensor fusion, recording, and maintainable upgrades, while an FCC prioritizes closed-loop control determinism and safety certification. The key is not the CPU type, but who owns the safety-critical control authority and its verification evidence.

  • Mission Computer: high-throughput compute + storage + evidence logs for mission functions and post-event analysis.
  • FCC boundary: the control loop owner and the safety case scope (kept as an interface requirement here).
  • Interfaces: the mission computer consumes/produces data and health status; it does not define the flight-control loop behavior.
  • Evidence focus: mission logs, BIT results, and traceable version/policy records.
2) Why is PCIe more like a “network” than a “bus” in mission computing?

In mission computers, PCIe behaves like a designed fabric with topology, isolation boundaries, and observability metrics—not a simple point-to-point cable. Stability depends on how paths are routed, how faults are contained, and how retrains and speed changes are measured over stress corners.

  • Topology matters: root-to-switch depth, fanout, and endpoint grouping shape performance and fault domains.
  • Isolation matters: faults should be contained to a device/path instead of cascading across the fabric.
  • Observability matters: retrain rate, link-down events, and negotiated speed changes become “fabric health” signals.
  • Verification: prove stability under temperature and workload stress, not just “enumerates and runs.”
3) When is a PCIe switch required instead of direct connections?

A PCIe switch becomes necessary when endpoint count, bandwidth aggregation, maintainability, or fault isolation can’t be met with direct lanes. The decision should be justified by a port/bandwidth budget and a fault-domain map, not by “it boots today.”

  • Scaling: many endpoints (NVMe, capture cards, FPGA, NICs) exceed root-complex lane/port constraints.
  • Aggregation: multiple sources must share an upstream link while preserving deterministic priorities.
  • Isolation: a misbehaving endpoint should not stall unrelated devices; topology should define fault boundaries.
  • Serviceability: modular replaceable units benefit from a stable fabric interface behind a switch.
4) How to choose retimer vs redriver, and what is the most common mistake?

A redriver mainly compensates channel loss, while a retimer re-establishes signal timing to regain margin across longer, lossier paths. The most common mistake is validating only at room temperature with short runs, then discovering retrains and dropouts under thermal/vibration corners.

  • Use a redriver when loss is moderate and the main goal is equalization without full timing recovery.
  • Use a retimer when link margin is insufficient across backplanes/connectors/long routes and timing must be restored.
  • Watch for latency and management needs; treat them as system requirements, not afterthoughts.
  • Verification: track error/retrain statistics across temperature and vibration windows.
5) For intermittent retrains or “drive disappears,” which evidence counters should be checked first?

Start by classifying whether the symptom is fabric-level, storage-level, or system-level, then correlate with timestamps. The fastest path is: link events (retrain/link-down/speed shifts), storage health snapshots, and system markers (reset cause, temperature peaks, power-fail flags).

  • Fabric evidence: retrain bursts/hour, link-down count, negotiated speed changes with timestamps.
  • Storage evidence: health summary snapshots and error summaries taken before/after the incident.
  • System evidence: reset reason, thermal peak markers, and any power-fail detect events.
  • Rule: if the timeline cannot be reconstructed, evidence is insufficient even if the fault is rare.
6) When should ECC correction counts trigger degradation or maintenance?

ECC “corrected errors” are not harmless if they trend upward or cluster with stress conditions. A practical policy is to treat corrected-error rate as a trend alarm (maintenance signal) and uncorrectable errors as integrity events that must trigger controlled isolation/recovery with evidence.

  • Trend-based: sustained increases in corrected errors should raise a maintenance flag.
  • Event-based: any uncorrectable error is an integrity incident requiring deterministic action and traceable logs.
  • Context: thresholds depend on mission criticality and whether redundancy can absorb degradation.
  • Evidence: store counts, timestamps, temperature context, and the chosen action (degrade/isolate/recover).
7) What are the most common NVMe data corruption paths during power loss?

The most common corruption paths involve unfinished write queues and metadata updates that do not complete in a controlled order. Mission computers should prioritize “explainability” by flushing critical metadata and sealing an evidence pack before attempting to preserve bulk records.

  • Queue not drained: outstanding writes never reach a durable state before voltage falls too far.
  • Metadata ordering: a partial metadata update can make restart ambiguous even if data blocks exist.
  • Priority inversion: bulk record traffic steals the limited window needed for integrity actions.
  • Verification: power-cut injection tests must confirm evidence presence and explainable restart.
8) How does a power-holdup requirement turn from “milliseconds” into a verifiable budget?

Define the usable window from power-fail detect to “writes no longer reliable,” then budget that window by priority groups using P×t. A verifiable budget includes a load list (must/stop), a flush sequence, and an injection test matrix that proves the sequence completes.

  • Window definition: start at power-fail detect; end at the minimum voltage for reliable writes.
  • Priority groups: stop bulk records, flush critical metadata, seal evidence, then enter safe state.
  • Budget format: per-group power × time with a clear “must complete” list.
  • Verification: injection tests across phase/load/temperature with pass criteria tied to evidence and explainability.
9) How can secure boot with an HSM remain field-upgradeable and auditable?

Keep security as a chain of verified stages with explicit policy gates and audit points, then design upgrade flow to produce logs rather than bypass controls. Field upgradeability comes from signed packages, policy versioning, and traceable rejection reasons—not from weakening verification.

  • Chain of trust: each stage verifies the next and records a marker (stage, version, outcome).
  • Policy gates: upgrade permissions and version rules are enforced and logged.
  • Audits: store package identity, policy version, and decision reason codes.
  • Verification: rehearse failure cases (bad signature, disallowed version) and confirm correct reject logs.
10) How can anti-rollback be enforced without blocking maintenance module replacement?

Separate “replaceable module identity” from “software version floor,” then use policy-driven authorization and audit logs for maintenance actions. The goal is to allow replacement while preventing silent downgrades: every exception must be explicit, traceable, and policy-bound.

  • Version floor: enforce a minimum allowed version/policy state for operation.
  • Replacement flow: allow new modules if identity and policy checks pass; record a maintenance audit entry.
  • No silent exceptions: any special recovery process must produce a traceable log with reason codes.
  • Verification: test replacement scenarios and confirm audit completeness and enforced version rules.
11) Why does “full performance” not guarantee “flight readiness”: how do heat and vibration trigger SERDES issues?

Heat reduces link margin and changes timing, while vibration can cause micro-motion at connectors and backplanes—both can manifest as retrains, speed drops, or intermittent endpoint loss. Flight readiness requires proving stability under combined stress and turning intermittent faults into measurable statistics.

  • Thermal effect: higher temperatures can reduce margin and increase error/retrain probability.
  • Vibration effect: connector micro-motion can cause transient impedance changes and link instability.
  • Evidence metric: retrain/hour, link-down counts, speed shifts correlated to temperature peaks or vibration windows.
  • Verification: fixes are accepted only when statistics improve and evidence remains traceable.
12) In production, which identity/certificate/log baselines are most commonly missed?

Production gaps often come from incomplete traceability: serial and certificate IDs not verified per unit, missing baseline health snapshots, and missing audit versions for software and policy. A robust release process binds every record to a Trace ID and stores baselines needed for field comparison.

  • Identity verification: per-unit serial/device P/N consistency plus non-secret certificate/key IDs.
  • Baseline snapshots: storage health and firmware summary captured at ship time.
  • Audit state: software version, policy version, and last-update summary stored and exportable.
  • Evidence rule: every PASS/FAIL decision must be reconstructable from stored records.