Mission Computer for Avionics: PCIe, HSM Secure Boot, NVMe
← Back to: Avionics & Mission Systems
A mission computer is the avionics compute core that routes high-rate data through a designed PCIe fabric, protects integrity with ECC and NVMe evidence logging, and stays upgradeable via secure boot with an HSM. It is considered “done” only when power-fail behavior, health monitoring, and rugged thermal/vibration stability are verifiable with traceable records—not just when the system boots.
H2-1 · Mission Computer: Scope & System Boundary
Goal: define what an avionics mission computer is responsible for, and—equally important—what it is not. This prevents scope creep into sibling topics (switching, timing, and power front-end design) while keeping the page engineering-useful for both procurement and implementation.
What a mission computer is (in-scope)
- Mission application host: sensor fusion, mission planning, payload control, and deterministic data processing pipelines.
- High-speed fabric hub: a PCIe-based compute/backplane fabric tying CPU/SoC, FPGA/accelerators, network I/O, and NVMe storage into a controlled topology.
- Assurance anchor: secure boot with a hardware root-of-trust (HSM/TPM-class device), plus audit-ready versioning and event evidence.
- Recorder & evidence loop: flight/mission logs, BIT artifacts, and “what happened when” timelines that survive resets and maintenance actions.
What a mission computer is not (out-of-scope, but interfaces are defined)
- Not the aircraft network switch (e.g., AFDX/TSN switching): only bandwidth/port/redundancy requirements and health indicators are referenced.
- Not the timing master (GPSDO/PTP/SyncE internals): only timing inputs/accuracy targets and “loss-of-sync” behavior requirements are referenced.
- Not the 28 V power front end (surge/spike chain): only the power-fail/brownout signaling seen by the mission computer is referenced.
Typical hardware forms and where the design pressure comes from
- Compute blade / SBC (VPX/OpenVPX, rugged COM-based modules): favors modularity and maintainability; stresses connector loss and SI margins.
- Backplane-centric chassis: favors scalable I/O and accelerators; stresses PCIe topology depth and reset/clock distribution discipline.
- Integrated mission box: favors size/weight/power; stresses thermals, serviceability, and controlled upgrade flows.
Boundary handoffs (the practical “contracts” with sibling domains)
| Neighbor domain | What the mission computer consumes | What the mission computer must produce |
|---|---|---|
| Switching / network | Declared throughput targets, traffic classes, redundancy policy, link-health counters (high-level). | Deterministic compute + logging of link events (drops, resets, degraded modes) with timestamps. |
| Timing | Time input quality constraints (accuracy/stability targets), holdover expectations, loss-of-time alarms. | Time-stamped logs, graceful degradation rules when time quality falls below threshold. |
| Power (front end) | Power-fail/brownout indicator(s), “last-gasp” window requirement, reset reason codes (if available). | Power-fail response plan: flush/commit, safe-state entry, and a verifiable shutdown record. |
Acceptance criteria (how “done” can be proven later)
- Determinism: critical task latencies are bounded and repeatable across temperature and load corners.
- Fault containment: a single endpoint failure does not collapse the fabric into repeated cascade resets.
- Auditability: boot chain, versions, and security policy changes leave tamper-evident records.
- Power-loss behavior: power-fail triggers a controlled sequence that preserves NVMe consistency and creates a post-event evidence packet.
H2-2 · Data Paths & I/O Map: How the System Connects
Goal: turn “interfaces” into a flow map that separates mission-critical traffic from management and maintenance. This section establishes the structure used later for PCIe topology, NVMe behavior under faults, and audit-ready upgrade paths.
Step 1 — Classify traffic into three planes
- Data plane (mission traffic): high throughput and bounded latency. Typical endpoints include accelerators, capture/processing cards, and NVMe record/playback.
- Control plane (management traffic): observability, policy, and lifecycle actions. Typical endpoints include BMC/management MCU, health sensors, event logs, and signed update mechanisms.
- Maintenance plane (service tools): restricted access for debugging and depot maintenance. Kept minimal to avoid accidental coupling to mission determinism.
Step 2 — Build a flow map, not a port list
A port list explains what exists; a flow map explains what must be protected. Each critical flow should be described by: source → processing → sink, plus the failure symptom that indicates the flow is compromised.
- Ingest flows: sensor/payload inputs → compute pipeline → decision outputs.
- Record flows: compute pipeline → NVMe (records) → export/maintenance access.
- Evidence flows: events/counters → log store → extraction tooling (BIT packages).
Step 3 — Enforce isolation (logic, bandwidth, fault containment)
- Logic isolation: updates and maintenance actions must not execute in the same privilege and timing context as mission tasks.
- Bandwidth isolation: reserve headroom so large transfers (e.g., record export) cannot starve mission pipelines.
- Fault containment: a misbehaving endpoint must not trigger repeated cascade resets across unrelated links.
Typical bottlenecks and the evidence that exposes them
- DMA congestion: symptoms include tail-latency spikes and periodic stalls; evidence includes queue depth, dropped frames, and latency histograms.
- Excessive fabric depth: symptoms include intermittent enumeration and frequent link retraining; evidence includes link-down/retrain counters and negotiated speed changes.
- Shared reset/refclk coupling: symptoms include “one fault resets many”; evidence includes correlated timestamps across reset causes and link events.
Throughput & latency budget template (copy/paste friendly)
| Flow | Target rate | Latency bound | Failure symptom | Evidence to log |
|---|---|---|---|---|
| Ingest → Compute | ____ MB/s | avg ____ / p99 ____ | jitter / missed deadlines | timestamps, queue depth, drop counters |
| Compute → NVMe record | ____ MB/s | write p99 ____ | record gaps / corruption | flush time, write errors, power-fail markers |
| Health → Evidence package | low / periodic | bounded (non-critical) | missing audit trail | reset causes, version IDs, event hashes |
H2-3 · Compute Architecture Choices: CPU, FPGA, Accelerators
Goal: explain why mission computing is rarely “just a bigger CPU.” The architecture must produce deterministic timing, fault containment, and an auditable lifecycle (updates, logs, and evidence) under harsh avionics constraints.
Role split that maps to mission outcomes (not to marketing terms)
- Best for: mission state machines, task orchestration, complex control logic, routing decisions, record/index management.
- Engineering win: behavior remains explainable (why a decision was made) and maintainable (signed updates, clear versioning).
- Evidence to log: deadline miss counters, queue depths, restart causes, and “safe-state entered” reasons.
- Best for: fixed-latency ingest → preprocess → align → DMA scheduling, trigger alignment, and repeatable throughput paths.
- Engineering win: deterministic upper bounds can be proven (latency ceilings do not depend on OS scheduling noise).
- Evidence to log: pipeline underrun/overrun, DMA stall reasons, alignment/trigger health flags.
- Best for: massively parallel workloads where throughput matters more than microsecond-level determinism.
- Engineering win: peak performance without overloading the CPU timing budget.
- Guardrails: define a degraded-mode plan when the accelerator is unavailable (the mission must fail gracefully, not silently).
Selection criteria that matter in avionics mission computing
- Determinism: key paths must have measurable and repeatable latency bounds (avg and tail). If tail latency cannot be bounded, the workload does not belong solely on a CPU.
- Certifiability / verifiability: isolate what must be proven (critical pipelines and safety monitors) from what evolves frequently (maintenance tools and analytics).
- Power & thermal envelope: peak compute must not trigger link instability, storage throttling, or unpredictable timing drift under worst-case temperature.
- Lifecycle supply: replacements should preserve the architecture contract (domain boundaries and interfaces), avoiding “recertify everything” events.
Domain partitioning: mission / management / safety monitor
Partitioning turns reliability and security from “features” into a system rule: maintenance actions and policy changes cannot steal time or privilege from mission execution.
- Mission domain: real-time workloads, deterministic pipelines, critical data paths.
- Management domain: secure boot enforcement, signed updates, telemetry collection, log packaging, and controlled configuration changes.
- Safety monitor domain: watchdog/health verdicts and safe-state requests; operates independently enough to remain credible when the mission domain misbehaves.
Common pitfalls (with symptoms and evidence)
- Everything on CPU → determinism cannot be proven: symptoms include tail-latency spikes and missed deadlines under load; evidence includes latency histograms, queue depth growth, and correlated drop events.
- Management mixed into mission → upgrades disturb operations: symptoms include anomalies during updates (retraining bursts, storage hiccups); evidence is a timeline overlap between maintenance events and mission faults.
H2-4 · PCIe Fabric Topology: Switches, Segments, Redundancy
Goal: treat PCIe as a designable fabric, not a wire. The mission computer must scale endpoints (NVMe, accelerators, high-speed I/O) while preventing single-device faults from collapsing the entire system into repeated retraining and resets.
Topology templates (choose by failure modes, not by aesthetics)
- Single-root, mostly direct: simplest; acceptable when endpoints are few and the physical channel is short and stable.
- Single-root with a switch segment: common for scalable mission computers; creates a clean subtree for endpoints and improves manageability.
- Dual-fabric / two switch domains (concept level): used when mission objectives require continued operation after a fabric-side fault. The key is defining what stays alive in degraded mode.
Switch selection criteria (engineering-useful, protocol-minimal)
- Ports and expansion headroom: cover today’s endpoints plus growth; avoid “forced daisy chains” that deepen the topology.
- Non-blocking bandwidth: critical flows should not be starved during simultaneous record + processing + export operations.
- Isolation capability (subtree containment): faults and traffic should be containable within a branch, instead of rippling across unrelated endpoints.
- Error reporting + manageability: link training events, retrain counts, error counters, and (when available) temperature/power telemetry should be readable.
Redundancy strategy (define the boundary and the proof)
- Boundary: identify which endpoints are “must survive” (e.g., critical record path or core compute) versus “optional in degraded mode.”
- Containment: ensure that a failing endpoint causes a local recovery (branch isolation), not system-wide churn.
- Proof: create logs that show: fault detected → branch isolated → system stabilized → degraded-mode entered (if required).
Observability: the minimum evidence set to log
- Link training timeline: up/down events, negotiated speed changes, and retrain bursts.
- Error counters: correctable vs uncorrectable trends, plus time correlation to mission anomalies.
- Thermal correlation (optional): hotspots that correlate with retraining or storage throttling.
H2-5 · Retimers & Signal Integrity: “Boots” Is Not “Flight-Ready”
Goal: explain why retimers/redrivers become an engineering requirement in avionics mission computers, and how to validate the PCIe channel under temperature, vibration, and long-duration stress—not just a single successful boot.
When a retimer becomes necessary (trigger conditions)
- Channel loss exceeds margin: long traces, multiple connectors, and backplane insertion loss reduce eye opening.
- Cross-board / backplane crossings: each connector/backplane segment consumes budget and increases sensitivity to environment.
- Higher generation targets: moving to faster link speeds tightens timing and loss margins, increasing “works sometimes” risk.
- Harsh corners: temperature drift, vibration micro-motion, and EMI can turn a marginal channel into frequent retrains.
Selection criteria (engineering-level, not a protocol lesson)
- Generation support: the part should support the intended link speed with adequate margin in worst-case channels.
- Equalization capability: EQ range should cover the channel loss envelope with stable training across corners.
- Latency/jitter impact: insertion should not break end-to-end latency budgets or amplify jitter into unstable links.
- Manageability: status and counters should be readable to build an evidence trail (training, errors, retrain bursts).
SI/PI coupling: how power noise turns into link instability
| Observed symptom | Likely system-level cause | Evidence to capture |
|---|---|---|
| Retrain bursts during load steps | Power rail noise modulates clock/PLL behavior and reduces timing margin. | Retrain timestamp correlation to load/thermal events; error counters rising with activity. |
| Speed downshift at hot corner | Eye margin collapses with temperature (loss increases, jitter grows). | Negotiated speed history vs temperature; stable at cold but degraded at hot. |
| Intermittent endpoint drops | Marginal channel + connector micro-motion under vibration. | Vibration window retrain statistics; endpoint link-down events with tight clustering. |
Validation checklist (flight-ready evidence)
- Eye/BER concept checks: verify the channel margin on representative worst-case paths (not only “best lane”).
- Temperature corners: run sustained load at cold and hot corners; record retrain/error statistics.
- Vibration/handling stress: measure retrains per hour under vibration windows (statistics, not anecdotes).
- Long-duration soak: observe error counters and link stability over time; capture “bursty” failure patterns.
- Evidence logging: store training events, negotiated speed changes, retrain counts, and correlation timestamps.
H2-6 · Memory & Storage: ECC + NVMe Control as a Data-Integrity Loop
Goal: make “data integrity” measurable and operational: detect errors, trend them, trigger defined actions, and preserve auditable evidence. This includes ECC memory health signals and NVMe behavior (consistency and tail latency) under real mission load.
ECC memory: why it matters and how it becomes health telemetry
- Engineering value: ECC reduces silent corruption risk and provides measurable health signals for long-duration operation.
- Correctable vs uncorrectable: correctable events (CE) are trend indicators; uncorrectable events (UE) trigger defined fault handling and evidence capture.
- Scrub (concept): periodic maintenance reduces accumulated risk; success/fail counts become part of health records.
NVMe control: the practical metrics that matter
- Write consistency: records and evidence should not be left in an ambiguous “half-written” state after resets or power events.
- Tail latency (QoS): bulk recording/export must not destroy mission determinism; evaluate p99/p999 write behavior under sustained load.
- Power-fail marker (concept): when a power-fail indicator is asserted, the system should perform controlled flush/commit steps and store an auditable shutdown record.
Data tiering: separate what must be preserved from what can be large
- System area: bootable images and controlled configurations (signed updates only).
- Evidence / logs area: event logs, BIT packages, version audits, CE/UE trends (prioritize integrity and accessibility).
- Records area: high-throughput mission records (allowed to follow defined degraded rules, but must remain explainable).
Common pitfall: records and mission tasks share the same queue
- Symptom: export or heavy recording causes unpredictable tail latency and deadline misses.
- Root cause: queue contention and background storage behavior introduce non-deterministic stalls.
- Fix: partition/namespaces (concept), protect the evidence/logs area, and enforce bandwidth/priority separation.
H2-7 · Secure Boot with HSM: Chain of Trust + Maintainable Updates
Goal: treat secure boot as an engineering chain: clear responsibilities, observable failure modes, and verifiable evidence. The objective is a boot-and-update flow that remains secure while staying serviceable in the field.
Typical chain of trust (what is verified, and when)
- ROM (Root-of-Trust start): validates the first-stage boot component before executing it.
- Bootloader: validates the OS or hypervisor image and the platform policy bundle.
- OS / Hypervisor: validates mission applications and enforces runtime policy gates.
- Application stage: loads only approved modules and exports verifiable version/state into logs.
HSM / TPM / Secure Element: the practical responsibilities
- Protected key usage: cryptographic keys are used inside a protected boundary (avoid key export paths).
- Version counters / monotonic state: provide the anchor for rollback protection (anti-downgrade).
- Attestable measurements (optional concept): record “what booted” as evidence that can be audited later.
Measured boot vs secure boot (boundary definition)
- Secure boot: blocks execution if verification fails (enforced gate).
- Measured boot: allows boot but records measurements for audit (evidence-first).
- Engineering rule: use secure boot where unsafe code must never run; use measured boot to strengthen auditability and forensics.
Key provisioning and update serviceability (factory → field)
| Process step | Control objective | Evidence to record |
|---|---|---|
| Factory key provisioning | Prevent key leakage and batch-wide exposure; keep keys protected and traceable. | Device identity + key IDs (not raw keys), operator/workstation ID, timestamp, pass/fail. |
| Policy-gated updates | Only signed images may install; enforce update windows and policy constraints. | From-version → to-version, signature status, policy check result, update outcome. |
| Rollback protection (concept) | Block downgrades to vulnerable older images even if they are signed. | Monotonic counter/value used, downgrade attempt detected, reason code. |
Common pitfalls (symptoms and fixes)
- Signed updates without rollback control: allows downgrade windows. Fix by enforcing monotonic version rules anchored in RoT state and making failures auditable.
- Unclear factory key handling: risks “same key across many units.” Fix with per-unit identity materials, batch isolation, least-privilege access, and complete provisioning audits.
H2-8 · Fault Tolerance & Health Monitoring: From WDT to Evidence Chains
Goal: make reliability operational: observable signals become deterministic verdicts, actions are executed in defined tiers, and every action leaves an auditable evidence record for maintenance and post-event review.
What to monitor (layered signals)
- Fabric/link layer: link up/down, retrain bursts, negotiated speed changes, error counter trends.
- Memory layer: ECC correctable (CE) trends and uncorrectable (UE) events.
- Thermal & power events: temperature limit flags, reset causes, power-fail markers (event-level only).
- Storage health (concept): NVMe health indicators and error trends, correlated to workload/temperature.
Verdicts: turn signals into actions
| Signal | Verdict (engineering) | Action tier |
|---|---|---|
| Retrain burst rate rises | Channel margin degraded; risk of endpoint drops. | Degrade bandwidth → isolate branch if persistent. |
| ECC CE trend accelerates | Health risk increasing; maintenance window needed. | Raise monitoring level → schedule service action; preserve evidence snapshot. |
| ECC UE event | Integrity compromised. | Isolate/recover path → safe-state if required; store evidence pack with reason codes. |
| Thermal limit sustained | Overstress risk; stability may degrade. | Degrade compute/record load → safe-state if not recoverable. |
Action tiers (degrade first, reboot last)
- Degrade: limit throughput, pause non-critical recording/export, reduce concurrency.
- Isolate: quarantine a device/branch to prevent cascaded failures.
- Recover: retrain links or restart a subsystem if isolation is insufficient.
- Switch redundancy (concept): move to redundant paths/domains and keep mission in a defined degraded mode.
BIT/BIST in the mission computer (MC-only)
- PBIT (power-on): verify key links, storage tiers, and evidence-log readiness before mission start.
- CBIT (continuous): monitor counters/trends with minimal mission impact and produce periodic health records.
- MBIT (maintenance): deeper diagnostics during scheduled service windows; export evidence packs for review.
H2-9 · Power Holdup & Graceful Shutdown: “Write Completes, Evidence Remains”
Goal: define what the mission computer requires during a power-fail window and how to verify it. The focus is a controlled sequence (detect → flush → seal evidence → safe state), not holdup hardware details.
Power-fail detect: what triggers the controlled shutdown flow
- Power-fail indicator: a hardware or management-domain event that asserts when input power is collapsing.
- Interrupt-to-action path: the signal must reach a domain capable of pausing non-critical traffic and starting the flush sequence.
- Timestamped marker: record when the power-fail was detected so post-event analysis can reconstruct timing.
Holdup budgeting: convert “milliseconds needed” into a budget table
Budget the usable window from power-fail detect to the point where storage writes are no longer reliable. The core requirement is an auditable P × t budget tied to a defined flush sequence.
| Load group | Priority | Budget item | Engineering intent |
|---|---|---|---|
| Evidence path (logs + markers) | Must | W × ms | Seal an evidence pack that explains the event and the system state. |
| Metadata flush (consistency anchors) | Must | W × ms | Guarantee boot explainability and avoid ambiguous “half-written” states. |
| Bulk record traffic | Stop | 0 | Pause high-throughput streams so the window is reserved for integrity actions. |
Data consistency strategy (engineering order of operations)
- Stop non-critical writes first: halt bulk recording/export to prevent queue contention.
- Flush metadata with a bounded time: commit the minimum set needed for explainable restart.
- Seal evidence before optional records: preserve the “what happened” timeline even if some record data is dropped.
- Enter safe state: reduce activity to minimize further corruption risk as voltage continues to fall.
Verification: power-cut injection tests (matrix + pass criteria)
- Injection phases: idle, high write load, mixed mission + record load, and during maintenance export.
- Corners: cold and hot temperature corners; include representative vibration windows when applicable.
- Pass criteria: evidence pack present and consistent; restart is explainable; failures produce clear reason codes.
- Evidence output: each injection produces a power-fail record: time, version, flush result, and shutdown stage.
H2-10 · Thermal, EMI, and Ruggedization: Full-Load Stability You Can Reproduce
Goal: translate avionics environmental reality into board-level and chassis-level constraints for a mission computer. The focus is sustained full-load behavior, controlled degradation, and reproducible evidence for intermittent faults.
Thermal path: board hotspots to chassis conduction
- Primary path: device → PCB → thermal interface → conduction plate / cold wall → chassis.
- Typical hotspots: PCIe switch, retimers, NVMe devices, and high-load compute zones.
- Engineering requirement: define allowable hotspot temperatures and a controlled derating policy when limits are approached.
- Evidence: log temperature peaks with timestamps and correlate them to link stability and throughput behavior.
EMI/EMC (system-level, minimal theory)
- Return-path continuity: high-speed SERDES stability depends on clean reference/return paths across connectors and shields.
- Symptom-based evidence: retrain bursts, rising error counters, and speed shifts correlated to specific operating modes or test exposure.
- Action intent: isolate noisy domains, keep return paths continuous, and maintain consistent shield/ground boundaries at the chassis level.
Vibration/shock: intermittent faults that must be made repeatable
- Connector micro-motion: vibration can create intermittent impedance changes that trigger retrains and endpoint drops.
- Correlation requirement: record retrain and link-down events during defined vibration windows and compute “events per hour.”
- Fix validation: changes are accepted only when statistics improve, not when the issue “seems gone.”
Rugged validation metrics (what “done” looks like)
- Thermal soak: full-load stability without uncontrolled resets; controlled derating is logged and explainable.
- EMI exposure: no unexplained link instability; counters and speed states remain within expected bounds.
- Vibration window: retrain/event rate below threshold; evidence packs generated for any excursions.
H2-11 · Validation & Production Checklist (What “Done” Means)
This section defines release acceptance as a checkable contract for engineering, manufacturing, and field teams. Each item includes purpose, method, pass criteria, stored evidence, and example part numbers to align with procurement.
Layer 1 — R&D Validation (prove stability under stress)
Focus on system risks that must be closed before manufacturing: fabric stability, storage integrity, secure boot enforcement, and health evidence chains.
| Check | Check item | Purpose | Method | Pass criteria | Evidence to store | Related parts (examples) |
|---|---|---|---|---|---|---|
| ☐ | PCIe stability under corners fabricretrain |
Prevent intermittent link drops and cascaded endpoint failures. | Stress workload + temperature corners; record retrain bursts, link-down events, negotiated speed changes. | Retrain/hour below threshold; no unexplained link-down; any degradation is logged and explainable. | Stress report + counter snapshots + corner coverage proof. |
PCIe switch: Broadcom/PLX PEX97xx family Retimer: TI DS160PR810 / DS160PR412 (family examples) |
| ☐ | NVMe consistency + power-cut injection storageintegrity |
Ensure “boot is explainable” and evidence survives power loss scenarios. | Power-cut injection matrix (phase × load × temp). Verify metadata flush ordering and evidence sealing. | Evidence pack present and consistent; key metadata consistent; failures return reason codes. | Injection record table + sample evidence pack + reason code map. | Industrial NVMe SSD: Micron 7450 / Kioxia CD6 / Samsung PM9A3 (examples) |
| ☐ | ECC behavior and health thresholds memoryevidence |
Turn CE trends into maintenance signals; treat UE as integrity events with traceable actions. | Fault injection (software triggers / controlled stress); confirm CE/UE counters are captured and exported. | CE trend alarms are generated; UE triggers defined isolation/recovery action + evidence pack. | Counter snapshots + action logs + exported field package sample. |
ECC DRAM: Micron ECC DDR4/DDR5 (family placeholder) Supervisor/monitor: TI TPS38xx / ADI LTC29xx (family examples) |
| ☐ | Secure boot chain rehearsal securitypolicy |
Prove “verify + record” at each stage, and enforce policy gates with rollback blocking. | Inject invalid signature, wrong version, and policy-disallowed package; validate reject paths and logs. | Correct rejection with stage + reason code; audit log includes version and policy identifiers. | Reject logs + policy version snapshot + boot stage evidence markers. | TPM/HSM/SE: Infineon OPTIGA TPM (e.g., SLB 9670) / Microchip ATECC608B (examples) |
| ☐ | Health monitoring loop closes BIT/BISTops |
Ensure faults become verdicts, actions, and evidence packs retrievable for review. | Run PBIT/CBIT/MBIT flows; verify export path and trace ID binding. | BIT results are consistent; exports succeed; any action (degrade/isolate/recover) produces evidence. | BIT logs + export file naming convention + trace ID rule. | Watchdog/safety monitor (placeholder family): NXP/S32 safety MCU class; TI watchdog supervisors (examples) |
Notes: part numbers above are example BOM anchors for procurement alignment and do not mandate a single vendor; final selections must match program qualification rules.
Layer 2 — Production Test (repeatable per-unit release proof)
Production items prioritize identity correctness, interface sanity, and baseline health records that enable field comparison.
| Check | Check item | Purpose | Method | Pass criteria | Evidence to store | Related parts (examples) |
|---|---|---|---|---|---|---|
| ☐ | Serial / certificate injection verification provisioning |
Guarantee per-unit identity and traceability; avoid batch-wide exposure risks. | Inject identifiers, then immediately verify chain/IDs (no secret export); bind to Trace ID. | ID match; certificate/key IDs present; audit fields complete (station/operator/time/result). | Factory record row: serial + device P/N + key/cert IDs (non-secret) + station ID + timestamp. | TPM/SE families as qualified (examples: Infineon OPTIGA TPM, Microchip ATECC608B) |
| ☐ | Interface self-test (bring-up sanity) I/O |
Catch assembly issues and gross faults early with fast, repeatable tests. | Power-on self-test script; endpoint enumeration; minimal loopback/signature checks where applicable. | Endpoints present; counters at baseline; no abnormal resets; self-test report generated. | Auto-generated production self-test report attached to Trace ID. | PCIe switch/retimer qualified families (examples) |
| ☐ | Storage health baseline snapshot NVMe |
Create an out-of-box baseline used to detect degradation over life. | Record firmware version, health summary, and initial error counters; store as baseline record. | Baseline record saved; counters within expected new-unit range. | Baseline file/row: drive ID + firmware + health summary + timestamp + Trace ID. | Industrial NVMe SSD (examples: Micron 7450 / Kioxia CD6 / Samsung PM9A3) |
Layer 3 — Field Self-Test (fast go/no-go + evidence export)
Field checks confirm the system is explainable and supportable: BIT status, exportable logs, and auditable version/policy state.
| Check | Check item | Purpose | Method | Pass criteria | Evidence to store | Related parts (examples) |
|---|---|---|---|---|---|---|
| ☐ | BIT status summary export PBIT/CBIT |
Quick determine if the unit is fit for mission or requires maintenance. | Run BIT summary command; export result with Trace ID binding. | BIT passes or shows controlled, explainable degraded mode; export completes. | BIT summary file + timestamp + unit serial + software/policy version. | — |
| ☐ | Event log / evidence pack export forensics |
Enable post-event reconstruction without lab access. | Export evidence pack; verify integrity (length/hash) and naming convention. | Package is readable and complete; includes counters + recent events + reason codes. | Export package + checksum + export tool version. | — |
| ☐ | Version & policy audit snapshot audit |
Ensure upgrade state is known and rollback is not silently allowed. | Read software version, policy version, last update record summary. | Versions match expected release; policy gate status valid; anomalies logged with reason codes. | Audit snapshot attached to Trace ID; last-update summary record. | TPM/SE presence as program requires (examples) |
Release Gate (PASS/FAIL with Trace ID)
All three layers feed a single release decision. PASS creates a release ID; FAIL creates a hold record that remains traceable.
H2-12 · FAQs (Mission Computer)
These FAQs focus on mission-computer engineering decisions: boundary definition, PCIe fabric design, integrity evidence, controlled power-fail behavior, maintainable secure boot, and rugged verification.
1) What is the boundary between a Mission Computer and a Flight Control Computer (FCC)?
A mission computer is optimized for mission applications, sensor fusion, recording, and maintainable upgrades, while an FCC prioritizes closed-loop control determinism and safety certification. The key is not the CPU type, but who owns the safety-critical control authority and its verification evidence.
- Mission Computer: high-throughput compute + storage + evidence logs for mission functions and post-event analysis.
- FCC boundary: the control loop owner and the safety case scope (kept as an interface requirement here).
- Interfaces: the mission computer consumes/produces data and health status; it does not define the flight-control loop behavior.
- Evidence focus: mission logs, BIT results, and traceable version/policy records.
2) Why is PCIe more like a “network” than a “bus” in mission computing?
In mission computers, PCIe behaves like a designed fabric with topology, isolation boundaries, and observability metrics—not a simple point-to-point cable. Stability depends on how paths are routed, how faults are contained, and how retrains and speed changes are measured over stress corners.
- Topology matters: root-to-switch depth, fanout, and endpoint grouping shape performance and fault domains.
- Isolation matters: faults should be contained to a device/path instead of cascading across the fabric.
- Observability matters: retrain rate, link-down events, and negotiated speed changes become “fabric health” signals.
- Verification: prove stability under temperature and workload stress, not just “enumerates and runs.”
3) When is a PCIe switch required instead of direct connections?
A PCIe switch becomes necessary when endpoint count, bandwidth aggregation, maintainability, or fault isolation can’t be met with direct lanes. The decision should be justified by a port/bandwidth budget and a fault-domain map, not by “it boots today.”
- Scaling: many endpoints (NVMe, capture cards, FPGA, NICs) exceed root-complex lane/port constraints.
- Aggregation: multiple sources must share an upstream link while preserving deterministic priorities.
- Isolation: a misbehaving endpoint should not stall unrelated devices; topology should define fault boundaries.
- Serviceability: modular replaceable units benefit from a stable fabric interface behind a switch.
4) How to choose retimer vs redriver, and what is the most common mistake?
A redriver mainly compensates channel loss, while a retimer re-establishes signal timing to regain margin across longer, lossier paths. The most common mistake is validating only at room temperature with short runs, then discovering retrains and dropouts under thermal/vibration corners.
- Use a redriver when loss is moderate and the main goal is equalization without full timing recovery.
- Use a retimer when link margin is insufficient across backplanes/connectors/long routes and timing must be restored.
- Watch for latency and management needs; treat them as system requirements, not afterthoughts.
- Verification: track error/retrain statistics across temperature and vibration windows.
5) For intermittent retrains or “drive disappears,” which evidence counters should be checked first?
Start by classifying whether the symptom is fabric-level, storage-level, or system-level, then correlate with timestamps. The fastest path is: link events (retrain/link-down/speed shifts), storage health snapshots, and system markers (reset cause, temperature peaks, power-fail flags).
- Fabric evidence: retrain bursts/hour, link-down count, negotiated speed changes with timestamps.
- Storage evidence: health summary snapshots and error summaries taken before/after the incident.
- System evidence: reset reason, thermal peak markers, and any power-fail detect events.
- Rule: if the timeline cannot be reconstructed, evidence is insufficient even if the fault is rare.
6) When should ECC correction counts trigger degradation or maintenance?
ECC “corrected errors” are not harmless if they trend upward or cluster with stress conditions. A practical policy is to treat corrected-error rate as a trend alarm (maintenance signal) and uncorrectable errors as integrity events that must trigger controlled isolation/recovery with evidence.
- Trend-based: sustained increases in corrected errors should raise a maintenance flag.
- Event-based: any uncorrectable error is an integrity incident requiring deterministic action and traceable logs.
- Context: thresholds depend on mission criticality and whether redundancy can absorb degradation.
- Evidence: store counts, timestamps, temperature context, and the chosen action (degrade/isolate/recover).
7) What are the most common NVMe data corruption paths during power loss?
The most common corruption paths involve unfinished write queues and metadata updates that do not complete in a controlled order. Mission computers should prioritize “explainability” by flushing critical metadata and sealing an evidence pack before attempting to preserve bulk records.
- Queue not drained: outstanding writes never reach a durable state before voltage falls too far.
- Metadata ordering: a partial metadata update can make restart ambiguous even if data blocks exist.
- Priority inversion: bulk record traffic steals the limited window needed for integrity actions.
- Verification: power-cut injection tests must confirm evidence presence and explainable restart.
8) How does a power-holdup requirement turn from “milliseconds” into a verifiable budget?
Define the usable window from power-fail detect to “writes no longer reliable,” then budget that window by priority groups using P×t. A verifiable budget includes a load list (must/stop), a flush sequence, and an injection test matrix that proves the sequence completes.
- Window definition: start at power-fail detect; end at the minimum voltage for reliable writes.
- Priority groups: stop bulk records, flush critical metadata, seal evidence, then enter safe state.
- Budget format: per-group power × time with a clear “must complete” list.
- Verification: injection tests across phase/load/temperature with pass criteria tied to evidence and explainability.
9) How can secure boot with an HSM remain field-upgradeable and auditable?
Keep security as a chain of verified stages with explicit policy gates and audit points, then design upgrade flow to produce logs rather than bypass controls. Field upgradeability comes from signed packages, policy versioning, and traceable rejection reasons—not from weakening verification.
- Chain of trust: each stage verifies the next and records a marker (stage, version, outcome).
- Policy gates: upgrade permissions and version rules are enforced and logged.
- Audits: store package identity, policy version, and decision reason codes.
- Verification: rehearse failure cases (bad signature, disallowed version) and confirm correct reject logs.
10) How can anti-rollback be enforced without blocking maintenance module replacement?
Separate “replaceable module identity” from “software version floor,” then use policy-driven authorization and audit logs for maintenance actions. The goal is to allow replacement while preventing silent downgrades: every exception must be explicit, traceable, and policy-bound.
- Version floor: enforce a minimum allowed version/policy state for operation.
- Replacement flow: allow new modules if identity and policy checks pass; record a maintenance audit entry.
- No silent exceptions: any special recovery process must produce a traceable log with reason codes.
- Verification: test replacement scenarios and confirm audit completeness and enforced version rules.
11) Why does “full performance” not guarantee “flight readiness”: how do heat and vibration trigger SERDES issues?
Heat reduces link margin and changes timing, while vibration can cause micro-motion at connectors and backplanes—both can manifest as retrains, speed drops, or intermittent endpoint loss. Flight readiness requires proving stability under combined stress and turning intermittent faults into measurable statistics.
- Thermal effect: higher temperatures can reduce margin and increase error/retrain probability.
- Vibration effect: connector micro-motion can cause transient impedance changes and link instability.
- Evidence metric: retrain/hour, link-down counts, speed shifts correlated to temperature peaks or vibration windows.
- Verification: fixes are accepted only when statistics improve and evidence remains traceable.
12) In production, which identity/certificate/log baselines are most commonly missed?
Production gaps often come from incomplete traceability: serial and certificate IDs not verified per unit, missing baseline health snapshots, and missing audit versions for software and policy. A robust release process binds every record to a Trace ID and stores baselines needed for field comparison.
- Identity verification: per-unit serial/device P/N consistency plus non-secret certificate/key IDs.
- Baseline snapshots: storage health and firmware summary captured at ship time.
- Audit state: software version, policy version, and last-update summary stored and exportable.
- Evidence rule: every PASS/FAIL decision must be reconstructable from stored records.