Secure OTA Module for IoT Devices: Secure Boot & Safe Updates
← Back to: IoT & Edge Computing
A Secure OTA Module is a device-side closed loop that ensures every firmware update is authentic (signed), fresh (anti-rollback), and non-bricking under power loss through A/B slots + atomic commit, with deterministic recovery and minimum evidence logs for fast field debugging.
Boundary definition: what a Secure OTA Module is (and is not)
The goal is to lock the page scope early, so the design discussion stays on device-side update safety instead of drifting into cloud platforms, gateways, or fleet operations.
A Secure OTA Module is the combination of: Secure Boot + Signed Update + Power-loss Safety (Atomicity) + A/B Recovery & Rescue.
- Secure Boot: verifies the next-stage code before execution (chain-of-trust).
- Signed Update: verifies the update package before commit (signature + hash + manifest).
- Atomicity / PLP: ensures unexpected power loss cannot corrupt boot-critical metadata.
- A/B & Rescue: guarantees a path back to a known-good image after failure.
- Edge Security Probe: focuses on identity/attestation/forensics; this page only keeps update-related evidence and minimal audit.
- Edge Power & Backup: focuses on system hold-up and power architecture; this page only defines write/commit atomicity requirements under power loss.
If a paragraph requires backend rollout policies, gateway aggregation, or management-console workflows, it is out of scope.
| Question (field reality) | Why it changes the design | Recommended level |
|---|---|---|
| Remote updates are required? | Assume update content can be tampered/replayed. Strong signature + integrity gates become mandatory. | Industrial High-security |
| Physical access is realistic? | Assume local storage can be copied/rewritten. Anti-rollback + protected trust anchor are required. | Industrial High-security |
| Frequent power loss / brownouts? | Assume mid-write interruption. A/B + atomic metadata is required to avoid “brick-on-update”. | Basic Industrial |
A Secure OTA Module is “done” only when each capability has observable evidence: verify results, version/counter state, slot state transitions, and power-loss-safe commit behavior.
Threat model & measurable security goals: what must be prevented
“Security” becomes actionable only when each goal maps to a control point and a minimum evidence set. This chapter defines the OTA-relevant attack surfaces and the device-side goals that can be verified in the field.
- Supply chain / manufacturing: trust anchor or key material is not unique or is substituted.
- Transport: update package is replaced, replayed, or partially corrupted in transit.
- Local storage: offline rewrite of flash/eMMC, forced downgrade, metadata tampering.
- Boot chain: bootloader stage is modified to bypass verification.
- Debug interface: post-production write access remains possible.
- Power anomalies (brownout): mid-write interruption causes inconsistent state; extreme cases can disturb branching.
Any paragraph that requires fleet rollout policies, gateway routing, or cloud console workflows is out of scope.
| Goal | Device-side control point | Minimum evidence (field-debug friendly) |
|---|---|---|
| Authenticity | Signature verification at boot and pre-commit stages (trusted key anchor). | verify-result code, key ID, verified stage ID (ROM/BL/update). |
| Integrity | Hash validation during download and after write (chunk-hash and/or full-image hash). | hash mismatch counters, final image hash, manifest hash result. |
| Freshness | Anti-rollback gate (monotonic counter or version floor) enforced before boot/commit. | counter value, version floor, rollback reason (reject/update/boot). |
| Recoverability | A/B slots + confirm mechanism; failure routes to known-good slot/rescue. | slot state (pending/confirmed), boot attempt count, fallback path taken. |
| Auditability | Minimal event log with append-only intent (local), covering state transitions and reboot causes. | event sequence, state transitions, reboot reason flags (incl. brownout). |
- Authenticity → Secure Boot + Signed Update pipeline (next chapters).
- Integrity → Manifest + hash strategy (next chapters).
- Freshness → Anti-rollback (next chapters).
- Recoverability → A/B + rescue strategy (later chapters).
- Auditability → Evidence chain & logs (later chapters).
The next chapters will implement these goals with concrete mechanisms (verify points, manifest fields, anti-rollback gates, A/B confirmation, and power-loss-safe commits). Backend rollout policies remain out of scope for this page.
Chain of Trust: verified boot from ROM to application
Verified boot is a hard gate: code must be authenticated before execution. Measured boot can improve auditability, but it does not replace verified boot for minimum “do-not-run-unknown-code” guarantees.
- Verified boot: verification blocks execution on failure (signature/hash/version gates).
- Measured boot: measurements are recorded for later decisions; it can support auditing but does not prevent execution by itself.
| Stage | Loads / executes | Verify point (minimum) | Failure outcome |
|---|---|---|---|
| ROM | Loads 1st stage from internal flash / external memory. | Checks trusted anchor (public key hash / key index) and verifies 1st stage. | Hard fail or enter rescue entry (if designed). |
| 1st stage | Initializes minimal clocks/memory and loads bootloader. | Verifies bootloader signature; applies version gate hook (anti-rollback input). | Fallback to rescue slot / fail-safe mode (no unverified jump). |
| Bootloader | Selects A/B slot and loads OS/app image + manifest. | Verifies manifest + hash + signature; enforces freshness checks before commit/boot. | Switch slot; if both invalid → rescue image. |
| OS/App | Runs post-boot health checks (for confirmation). | Optional: confirms “known-good” after self-test (supports A/B confirm later). | On failure: watchdog/reboot triggers rollback policy. |
- Trust anchor placement: ROM/OTP anchors maximize tamper resistance; a secure element can add flexibility (rotation/counters) but adds BOM and integration surface.
- Signature choice: prioritize verification time, code footprint, and hardware acceleration so boot remains within the reset/watchdog budget.
- Failure policy: “refuse to boot” is correct for revoked/rollback attempts; “rescue slot” is appropriate for corruption or interrupted writes.
- PASS → jump to next stage; record verify code and stage ID.
- FAIL → do not execute; select alternate slot (A↔B) if available.
- Both invalid → enter rescue image; expose a deterministic error reason (signature/hash/version/metadata).
Field evidence should always include: verify-result code, selected slot, and reboot reason (including brownout flags).
The next chapter will anchor the trust chain with concrete key storage and revocation rules so “verified boot” remains enforceable across device lifetime.
Signatures & keys: storing the trust root and enforcing revocation on-device
A secure boot chain is only as strong as the trust anchor and the acceptance rules. This chapter defines device-side key hierarchy, key IDs, and a practical revocation model that remains enforceable without backend deep dives.
- Root public key (anchor): immutable or strongly protected; defines “who is trusted to sign”.
- Intermediate key (optional): enables rotation without replacing the anchor; can be used to sign image-signing keys.
- Image signing key: signs firmware images/manifests; private key is never stored on the device.
The device verifies a chain of signatures and then enforces policy: allowed key IDs, revoked key IDs, and freshness gates (anti-rollback comes next).
| Element | What it stores | Enforcement rule (device-side) |
|---|---|---|
| Key ID | Compact identifier bound to a public key (or public key hash). | Verification output must include Key ID for audit and policy decisions. |
| Allow list | Set of acceptable Key IDs + public key hashes (can be versioned). | If Key ID not in allow list → reject (no fallback to “unknown but valid”). |
| Revoke list | Set of revoked Key IDs (optionally tagged with “revoked since version”). | Revoke list has higher priority than allow list; if revoked → reject even if signature verifies. |
| Acceptance policy | Decision order: revoked? allowed? version gate? integrity? | Reject on revoked/rollback attempts; allow rescue only for corruption/power-loss cases. |
- ROM anchor: strongest immutability; minimal flexibility. Best for high-integrity boot roots.
- OTP/eFuse anchor: strong protection with limited update patterns (e.g., key index, hash slots).
- Secure element: strong isolation and useful features (counters, protected keys); adds integration cost and supply considerations.
- Plain flash: highest flexibility but highest tamper risk; requires additional protections and is not suitable as a sole root anchor.
Anti-rollback complements revocation: revocation answers “who may sign”, freshness answers “is it new enough”.
- Dev mode: allows debug for bring-up; must be visibly and logically separated from production trust (test keys, reduced restrictions).
- Production mode: debug access is locked or constrained; secure boot is mandatory; only production keys are accepted.
- Hard rule: test keys and relaxed gates must not remain reachable in production acceptance policy.
The next chapter (anti-rollback) completes freshness enforcement so a valid signature cannot be used to downgrade to an older, vulnerable image.
Anti-rollback: versions, counters, and enforceable freshness
A valid signature only proves origin and integrity. Freshness proves the image is not an older, revoked, or below-floor build. Anti-rollback must be enforced as a hard gate before execution and before committing an update.
- Gate condition: candidate_build ≥ device_floor (monotonic integer compare).
- Where enforced: at boot (ROM/1st stage/bootloader) and at update commit (before setting “pending”).
- What is recorded: device_floor, candidate_build, decision_code, selected_slot.
| Method | Strength | Common failure mode | Engineering mitigation |
|---|---|---|---|
| OTP/eFuse monotonic version | High tamper resistance; “only increases”. | Incorrect programming or wrong floor bump can block valid images. | Use staged floor bump (post-boot confirm), verify write success, and tie floor to signed manifest. |
| Secure element monotonic counter | High protection plus flexible features (counter/secure storage). | Integration/BOM complexity; counter access failures must fail-safe. | Define deterministic fallback (recovery only), cache read-only floor for boot, and log SE status codes. |
| Protected flash counter | Cost-effective but weaker against physical tampering. | Counter rollback via rewrite; power-loss during update can desync state. | Use redundancy (dual copies + voting), journal/COW updates, bind counter to signed metadata, and treat downgrade detection as recovery-only. |
- Recommended: use a monotonic build number (uint32/uint64) as the freshness key.
- SemVer: suitable for display and compatibility labels, but not as the sole freshness gate due to edge cases.
- Policy: gate on build number; optionally display SemVer separately as a human-facing string.
- Rollback detected (candidate_build < floor) → reject boot and enter recovery/rescue.
- Known-good fallback allowed: switch to the other slot only if it is not below floor and not revoked.
- Revoked overrides “known-good”: a revoked version must never become bootable again.
Freshness gate and key revocation gate must both pass. One cannot compensate for the other.
Freshness enforcement becomes actionable only when the update package carries unambiguous version fields and signed metadata—defined next.
Firmware images & manifest: minimum fields for verification and recovery
Treat the OTA package as an engineering interface: a signed manifest that drives verification, storage writes, commit points, and recovery—without relying on transport-protocol details.
- Verify manifest first, then trust payload. All critical fields must be covered by the manifest signature.
- Manifest drives the state machine: download → verify → write → re-verify → commit → reboot → confirm.
- Evidence is mandatory: each step produces a small, stable record (hash/sig/slot/code) for field debugging.
| Group | Field | Why it exists (device-side) |
|---|---|---|
| Identity & integrity | image_hash (final), signature, signer_key_id | Proves origin and the exact bytes that must be installed and booted. |
| Freshness & policy | build_number (monotonic), hw_compat (board/SoC ID) | Prevents downgrade and prevents installing an image built for different hardware. |
| Install target | slot_target (A/B), image_type (BL/OS/App), dependencies | Ensures the correct partition/slot is written and version-coupled components stay compatible. |
| Chunking & recovery | chunk_map (offset/len/order), chunk_hashes, compression | Enables resume/partial verification and power-loss recovery without accepting reordered or replaced chunks. |
| Storage interface | payload_size, chunk_size, staging_hint | Allows deterministic write planning and prevents overrun/partial state confusion. |
If chunk hashes exist, the chunk map and ordering must be signature-covered; final image hash should still be verified after write.
- Full image: simplest trust boundary; failure recovery is straightforward (retry download/write on inactive slot).
- Delta: validation is harder; a “valid patch” does not guarantee a correct final image unless the final hash is checked.
- Recommendation: prioritize full images for high reliability; if delta is used, require final-image verification and robust staging.
- Whole-image hash: simplest, but resume requires re-download or local caching.
- Chunk hashes: enable resume and partial verification; require signed chunk map to prevent reordering/substitution.
- Post-write rule: verify final image hash after storage writes before setting any boot flag.
- Staging first: manifest and progress metadata must be written to a staging area that does not break the current bootable slot.
- Write inactive slot: never overwrite the currently confirmed slot during normal OTA.
- Commit point: only after verify passes → set boot flag to pending (atomic write).
- Confirm point: after post-boot self-check → set confirmed; otherwise rollback per attempt counter.
Later chapters can reuse this evidence stream to build deterministic field-debug workflows without exposing platform or transport details.
Encrypted storage: partitioning and device-side key strategy
Encryption-at-rest is a storage policy, not a replacement for signature verification. Treat firmware, configuration, and secrets as different asset classes with different write patterns and recovery constraints.
| Asset | Write pattern | Primary risks | Minimum device-side controls |
|---|---|---|---|
| Firmware (slots) | Rare, large writes | IP exposure, offline analysis, tampered images | Signed images + anti-rollback gate; encryption is optional and must not break recovery. |
| Config | Medium, frequent updates | Privacy leakage, policy manipulation, replay of old config | AEAD (encrypt + authenticate), versioning/journal, optional freshness rules for critical fields. |
| Secrets | Small objects | Key extraction, cloning, credential reuse | Per-device keys, key wrapping, secure storage policy, revocation/rotation readiness. |
- Full-disk: strongest coverage but increases boot-chain and rescue constraints; only suitable when early-stage decryption is guaranteed.
- Partition encryption: practical default—encrypt Config and Secrets first; encrypt firmware slots only if recovery remains deterministic.
- Secrets-only: lowest cost; acceptable only when config is not sensitive and does not enable high-impact policy manipulation.
- Root: hardware-anchored device unique secret (OTP/eFuse/PUF/secure element) used as the trust anchor.
- KEK: key-encryption key derived from the root; used to wrap data keys.
- DEK: data-encryption key used for a specific partition/object (Config/Secrets, optional for firmware slots).
- Rule: DEK must never be stored in plaintext; store wrapped_DEK + keyslot_id + policy.
Key rotation is a device-side rule: new KeyID must be accepted only when authorized, and old KeyID must become unbootable/unusable when revoked.
- AEAD requirement: encryption without authentication is insufficient for Config/Secrets.
- Nonce/IV rule: never reuse a nonce with the same key; monotonic or random nonce must be stored consistently with the ciphertext.
- Write amplification: hot config keys should be journaled as small records; avoid re-encrypting large regions on every write.
- Crash consistency: nonce/counter state must be updated with journal/COW so a brownout cannot desynchronize metadata and ciphertext.
- Rescue not encrypted (but signed): maximizes recoverability; expose only minimal functionality.
- Rescue encrypted + signed: only safe when early-stage decryption is guaranteed and key access is deterministic under fault conditions.
- Hard requirement: recovery entry must remain possible even when Config/Secrets are corrupted or decryption fails.
Encryption policies must be compatible with crash consistency; nonce/metadata correctness becomes a first-class requirement under power loss.
Power-loss-safe updates: A/B + atomic metadata + write gating
“Not bricking under power loss” is achieved by invariants: keep one confirmed bootable slot, make metadata updates recoverable, and gate writes as brownout approaches. PLP is treated as an interface requirement (rails/time/trigger), not a sizing exercise.
- Interrupted writes → inactive image corruption → invariant: never overwrite the confirmed slot during OTA.
- Metadata corruption → unknown bootable slot → invariant: metadata must be journaled/COW with validation (CRC/version).
- Counter/nonce desync → false rollback/AEAD failure → invariant: counters/nonces advance only with atomic commit records.
- slot_state: CONFIRMED / PENDING / INVALID
- attempt_count: limits repeated boot trials of a pending slot
- active_slot + next_slot: deterministic selection source of truth
- confirm flag: written only after post-boot health checks
- floor: monotonic freshness threshold (must not regress)
- Dual copy + CRC + generation: read the newest valid record; write the other copy on updates.
- Append-only journal: write small, ordered records; recover by replaying to the last valid entry.
- Boot selection rule: if pending metadata is inconsistent, fall back to the last confirmed slot and enter recovery policy.
- Commit bytes: worst-case bytes written for one state transition (meta + journal + counters).
- Commit latency: worst-case time for flush/verify of commit bytes (includes erase/write/flush behavior).
- Brownout threshold: below this, new erase/write must stop; only allow safe finalization of the current atomic record.
- Rails held: compute + storage + required IO for completing atomic metadata write.
- Hold time: commit latency + safety margin for deterministic finalization.
- Write stop trigger: brownout interrupt / PMIC PG / ADC threshold; must immediately block new erase/write operations.
- Safe action: complete or abort the current atomic record; never leave half-written metadata as the latest record.
PLP is expressed as “rails held + hold time + write-stop trigger”; the OTA design remains valid regardless of the underlying energy source.
Recovery design: rollback, rescue modes, and the last lifeline
A secure OTA system must remain recoverable. Recovery is defined as a layered plan with a bounded confirmation window, deterministic slot selection rules, and field-operable rescue entry points.
- Bootloader rescue: minimal dependencies; must always be reachable even when filesystem/config decryption fails.
- Minimal OS rescue: adds drivers and tooling for validation/export, while staying under secure boot policy.
- App-level safe mode: provides degraded service and operator feedback, but must not bypass signature or rollback gates.
- Purpose: promote a pending slot to confirmed only after it proves basic health.
- Checks: watchdog feed path, 1–3 critical services, storage readability, and a bounded timer window.
- Rule: unconfirmed equals failure; a pending slot must not remain pending indefinitely.
- Outputs: CONFIRMED flag (atomic) + small reason codes on failure (enum, not long strings).
| Condition (hard evidence) | Action | Notes |
|---|---|---|
| PENDING exists and attempt_count < N | Boot PENDING slot (test run) | Health check must confirm within the window. |
| Health check fails or attempt_count reaches N | Mark PENDING invalid (or lower priority), roll back to last CONFIRMED | Record rollback_reason and last error_code. |
| No CONFIRMED slot is bootable | Enter Bootloader rescue | Rescue must still enforce signature/floor rules. |
| Candidate version < floor (revoked/too old) | Disallow boot even if image verifies | Rollback is allowed only to known-good versions ≥ floor. |
- Physical entry: button or GPIO strap at power-on to force rescue mode.
- Console entry: UART/USB commands to inspect slot state, export minimal logs, and trigger rollback/rescue.
- Production guard: development vs production mode gating to prevent rescue paths from bypassing verification.
A rescue path is a controlled interface, not a backdoor.
- Keep at least one CONFIRMED bootable slot at all times; updates modify only the inactive slot.
- Make metadata truth recoverable: journal/COW with validation and deterministic fallback rules.
- Bound trials: pending has an attempt limit; failure triggers automatic rollback or rescue.
Recovery stays within device boundaries: deterministic slot policy + bounded confirmation + controlled rescue interfaces.
Hard evidence and observability: what to collect when updates fail
Debugging must start from hard evidence: first locate where the state machine stopped, then classify the failure as cryptographic/package, storage/atomicity, or power/brownout related.
| Event | Fields (keep it structured) | Why it matters |
|---|---|---|
| EVT_VERIFY_SIG | result, key_id, error_code | Separates “image authenticity” failures from all other classes. |
| EVT_PARSE_MANIFEST | result, version, target_slot, compat | Explains why a valid signature still cannot be accepted. |
| EVT_SLOT_TRANSITION | from_state → to_state, attempt_count | Locates exactly where the OTA flow stopped. |
| EVT_REBOOT_REASON | wdt / brownout / software, brownout_flag | Distinguishes logic bugs from power integrity failures. |
| EVT_ROLLBACK | reason, floor, candidate_version (redacted) | Proves whether anti-rollback rules blocked a boot attempt. |
Use event codes + small fields; avoid long text logs in critical paths.
- Append-only: records are appended, not overwritten; keeps the latest failure evidence intact.
- Tamper-evident digest (optional): periodic digest over recent records to detect easy edits.
- Redaction: never print secrets; prefer KeyID/counter/error enums instead of sensitive content.
- Step 1: read the last EVT_SLOT_TRANSITION to locate the stuck state (DOWNLOAD / STAGING / PENDING / BOOT_TEST).
- Step 2: check EVT_VERIFY_SIG and EVT_PARSE_MANIFEST to classify crypto/package issues.
- Step 3: check EVT_REBOOT_REASON + brownout flags to classify power-loss / brownout behavior.
- Step 4: check EVT_ROLLBACK (floor, candidate_version) to confirm anti-rollback blocks.
- Crypto/package: signature failure, incompatible manifest, version below floor.
- Storage/atomicity: meta CRC failure, journal replay issues, AEAD decrypt failure due to nonce/metadata mismatch.
- Power/brownout: repeated brownout flags, reboot reasons pointing to power, transitions stopping during writes.
- Last N event records + current slot snapshot (active/next/attempt/floor).
- Current firmware version + candidate version + last rollback reason.
- Manifest header digest (hash only) for correlation without leaking content.
This chapter stays device-local: append-only evidence + small structured fields + a deterministic triage order.
Validation & testing: proving OTA is secure and non-bricking
A Secure OTA Module is verified by repeatable, device-side tests that produce hard evidence: deterministic state transitions, reject reasons for malicious inputs, and recovery guarantees under power interruption.
- Authorization: only signed-and-allowed images can reach PENDING / CONFIRMED.
- Freshness: any image below rollback floor is blocked even if signature verifies.
- Atomicity: metadata truth is recoverable after any interruption (journal/COW + validation).
- Recoverability: failure always ends in CONFIRMED boot or controlled rescue entry (bounded attempts).
- Observability: failure is classifiable by device-local evidence (state vs crypto vs power).
These part numbers are common reference options for secure boot + signed update + local evidence. Final selection must match required security level, lifecycle, and availability.
| Block | Example material numbers | Typical role in OTA proof |
|---|---|---|
| Secure MCU / SoC | NXP LPC55S69 · NXP i.MX RT1062 · ST STM32U585 · ST STM32H573 · Renesas RA6M5 · Infineon PSoC 64 | Secure boot root, signature verification, slot state machine, evidence events. |
| Secure Element | Microchip ATECC608B · NXP SE050 · Infineon OPTIGA™ Trust M (SLS32AIA010MS) · ST STSAFE-A110 | Protected key storage, monotonic counter (freshness), key rotation/revocation anchors. |
| Discrete TPM (optional) | Infineon OPTIGA™ TPM SLB 9670 / SLB 9672 · Nuvoton NPCT75x (family) | Hardware trust anchor / measured evidence (optional), policy-controlled key use. |
| External NOR Flash | Winbond W25Q128JV · Macronix MX25L128 / MX25U128 · Micron MT25Q (family) | Slot A/B images, staging areas, metadata journal; supports power-cut sweep tests. |
| eMMC storage (if used) | Micron eMMC (MTFC… family) · Kioxia eMMC (THG… family) | High-capacity slots/delta chunks; validate atomic commit + filesystem resilience. |
| Reset / watchdog supervisor | TI TPS3431 · TI TPS3850 · Maxim/ADI MAX16054 | Enforces bounded boot-test window; provides reboot reason evidence and recovery determinism. |
| Voltage monitor / brownout detect | TI TPS3703 · Microchip MCP1316 (family) | Gates writes near brownout and records brownout events for power-cut/boundary validation. |
Practical validation tip: at least one reference platform should include a Secure Element (e.g., ATECC608B / SE050 / STSAFE-A110) to test monotonic counters and key revocation behavior under power interruptions.
- Full update path: IDLE → DOWNLOADING → STAGING → PENDING → BOOT_TEST → CONFIRMED.
- Resume: interrupt download and restart; chunk verification resumes without nonce/counter inconsistency (when AEAD is used).
- A/B alternation: two consecutive releases alternate inactive slot; one CONFIRMED slot remains bootable at all times.
- Auto-rollback: fail boot-test and confirm automatic rollback to last CONFIRMED within bounded attempts.
- Rescue entry: force rescue via button/strap/UART policy and verify signature/floor gates still apply.
Suggested evidence hooks: EVT_SLOT_TRANSITION + attempt_count + active/next slot snapshot.
- Forged signature: image must be rejected during verification (no PENDING state set).
- Tampered manifest: hash/field mismatch must be rejected with a structured error code.
- Replay: an older valid package must be blocked by freshness rules (counter/floor).
- Rollback: candidate version below floor must be blocked even if signature verifies.
- Key revoke: revoked KeyID must remain unusable across normal and rescue paths.
Reference platforms to cover this class: Secure Element-based monotonic counters (e.g., ATECC608B / SE050 / STSAFE-A110) and at least one MCU with ROM/OTP trust anchor support (e.g., LPC55S69 / STM32U585 / STM32H573).
- Power-cut sweep: repeatedly cut power at multiple phases (download / staging / meta commit / pending set / boot-test / confirm).
- Pass criterion: after any interruption, the system must reach CONFIRMED boot or controlled rescue entry (never stuck in unknown).
- Brownout boundary: verify “write gating” near brownout; metadata truth remains recoverable (journal/COW validation works).
- Wear-out: stress metadata/journal updates and confirm stable recovery rules (no dual-slot loss).
Recommended evidence hooks: EVT_REBOOT_REASON + brownout_flag + meta validation result + last rollback reason. Suggested supporting parts for repeatability: watchdog supervisor (TPS3431/TPS3850/MAX16054) and voltage monitor (TPS3703/MCP1316 family).
Adversarial coverage is validated by confirming that safety policies trigger and recovery remains deterministic, without describing fault-injection procedures.
| # | Must-pass item | Expected hard evidence |
|---|---|---|
| 1 | Normal update reaches CONFIRMED | EVT_SLOT_TRANSITION sequence complete; CONFIRMED set atomically |
| 2 | Resume works after forced reboot during download | Chunk verification consistent; no counter/nonce mismatch error |
| 3 | A/B alternation across two releases | Inactive slot updated; at least one bootable CONFIRMED always remains |
| 4 | Boot-test failure rolls back within bounded attempts | attempt_count increments; rollback_reason recorded; last CONFIRMED boots |
| 5 | Rescue entry is reachable and policy-controlled | Rescue flag/strap/button recorded; verification gates still enforced |
| 6 | Forged signature package is rejected | EVT_VERIFY_SIG=FAIL; PENDING not set |
| 7 | Tampered manifest is rejected | EVT_PARSE_MANIFEST=FAIL (structured error_code) |
| 8 | Replay of older signed package is blocked | floor/counter check blocks; rollback/reject reason captured |
| 9 | Rollback below floor is blocked | candidate_version < floor; boot refused; reason captured |
| 10 | Revoked key cannot install or boot | KeyID flagged revoked; verification/policy blocks in normal and rescue |
| 11 | Power-cut sweep passes across all phases | After each cut: CONFIRMED boot or RESCUE; never “unknown” slot truth |
| 12 | Metadata truth is recoverable after interruption | Journal replay OK; meta validation OK; deterministic slot selection |
| 13 | Brownout boundary does not corrupt truth | brownout_flag set; write gating prevents new writes; recovery succeeds |
| 14 | Attempt limit forces rollback/rescue (no endless pending) | attempt_count reaches N; auto rollback or rescue entry occurs |
| 15 | Failure is classifiable by device-local evidence | At least one of: state / crypto / power classes is provable via events |
The chapter remains device-side: tests prove deterministic recovery and policy enforcement using structured evidence, while avoiding cloud/fleet architecture and avoiding fault-injection procedures.
Component & architecture selection: MCU/SoC, Secure Element, storage, supervisors
This section turns Secure OTA mechanisms into concrete device-side capabilities. The goal is a closed loop: verified boot + signed updates + anti-rollback + atomic commit under power loss + recoverability + minimum evidence.
1) Capability-to-mechanism mapping (what each “ability” pays for)
| Hardware capability (device-side) | Enables which mechanism | Primary failure prevented (field symptom) |
|---|---|---|
| Immutable boot root (ROM verified boot, or equivalent) | Chain-of-trust verification point (H2-3) | Unsigned image boot; “bootloader can be replaced” |
| Secure storage anchor (OTP/eFuse/secure NVM) | Root public-key hash / KeyID allowlist / policy floor (H2-4/H2-5) | “Signature passes but a revoked key still works” |
| Crypto acceleration (SHA-256/384, ECC verify, AES-GCM/CCM) | Fast verify, chunk validation, at-rest encryption (H2-3/H2-6/H2-7) | Long verify windows → higher power-cut brick rate |
| TRNG / reliable entropy | Nonce/IV correctness, per-device key derivation (H2-7) | Repeated IV/nonce; fragile encryption correctness |
| Isolation primitives (MPU/TrustZone-M/privilege) | Separates updater/boot policy from application (H2-3/H2-7/H2-10) | App bugs overwrite policy/logs; evidence becomes unreliable |
| Dual-image boot support (A/B slots + selection logic) | Recoverability and bounded attempts (H2-8/H2-9) | One bad update bricks device; no safe rollback path |
| Brownout detect + write gating | Atomic metadata commit under power loss (H2-8/H2-11) | “Meta truth lost” after dips; stuck in unknown boot state |
| Reset/watchdog supervision | Boot-test window, deterministic rollback/rescue (H2-9/H2-11) | Endless reboot loops; no controlled rollback |
Practical guideline: prioritize abilities that shorten verification time, harden rollback floors, and keep metadata updates atomic.
2) MCU/SoC “must-have” abilities for a minimum Secure OTA loop
- Verified boot root: an immutable verify point (ROM or equivalent) that cannot be bypassed in production mode.
- Trust anchor storage: OTP/eFuse/secure NVM to pin root public-key hash and policy identifiers (KeyID allowlist / floor).
- Signature verification: ECC verify + hashing (software is possible; hardware accel reduces brick risk by shrinking time windows).
- Isolation: at least MPU/privilege separation so the updater/policy cannot be overwritten by normal application code.
- Boot slot selection: supports A/B slot rules and a bounded-attempt boot-test gate (confirm-or-rollback behavior).
Example MCU/SoC references (material numbers): NXP LPC55S69, ST STM32U585, ST STM32H573, Renesas RA6M5, Infineon PSoC 64.
3) Secure Element: when it is worth it (and when it is not)
A Secure Element is most valuable when it provides device-side features that the main MCU cannot reliably emulate: key isolation and monotonic counters that enforce anti-rollback floors under physical access and power interruptions.
- Strong fit: devices with physical exposure, strict anti-rollback, key revocation requirements, or a need for monotonic counters.
- Neutral fit: moderate threat model, MCU already has robust OTP/TrustZone; SE is an optional safety margin.
- Poor fit: ultra-cost-sensitive designs where supply/lifecycle risk dominates and threat model is low.
Example Secure Element references (material numbers): Microchip ATECC608B, NXP SE050, ST STSAFE-A110, Infineon OPTIGA™ Trust M (SLS32AIA010MS). Optional discrete TPM references: Infineon SLB 9670 / SLB 9672.
4) Storage choice for OTA: NOR vs eMMC/NAND (update-write behavior + PLP risk points)
- QSPI NOR: deterministic erase/write units; excellent for A/B images + small metadata journal. Risk is sector erase latency → needs strict staging and verified commit.
- eMMC: high capacity; good for full images and chunk maps. Risk is internal translation layers and power-loss sensitivity → requires stronger two-phase commit + write gating near brownout.
- NAND: capacity-focused; bad blocks/ECC/translation layers increase the burden on “truth recovery” rules. Keep metadata/journal strategy conservative.
Example storage references (material numbers): Winbond W25Q128JV / W25Q256JV, Macronix MX25L128 / MX25U128, Micron MT25Q (family); eMMC examples: Micron MTFC… (family), Kioxia THG… (family).
5) Supervisors & power monitoring (device-side): brownout reset, write gating, watchdog policy
- Brownout detect: capture brownout flags as evidence and block starting new erase/write when voltage is near the unsafe region.
- Write gating threshold: define a “no new writes” threshold (and respect it in updater state machine) to protect metadata truth.
- Watchdog + boot-test window: a bounded window prevents endless pending states; exceed window → rollback/rescue deterministically.
Example supervisor/monitor references (material numbers): TI TPS3431, TI TPS3850, Maxim/ADI MAX16054; voltage monitor examples: TI TPS3703, Microchip MCP1316 (family).
6) Three tier recipes (5-line capability table each) — entry / industrial / high-security
| Tier | MCU/SoC | Trust anchor | Storage | Supervision |
|---|---|---|---|---|
| Entry cost-focused |
NXP LPC55S69 / ST STM32U585 / Renesas RA6M5
ROM verify + OTP anchor + ECC/SHA accel preferred
|
MCU OTP/eFuse pins root public-key hash + KeyID list
No external SE; keep policy minimal but strict
|
QSPI NOR: Winbond W25Q128JV
A/B + meta journal on NOR
|
WDT: TI TPS3431 + monitor: TI TPS3703
boot-test window + brownout write gating
|
| Industrial field resilience |
ST STM32H573 / NXP LPC55S69
stronger isolation + faster verify reduces brick window
|
Secure Element: ATECC608B / SE050 / STSAFE-A110
monotonic counter + key isolation + revoke anchors
|
Larger NOR (e.g., W25Q256JV) or eMMC (Micron MTFC…)
chunk map + staged write + strict commit
|
TI TPS3850 + WDT policy
deterministic rollback, brownout-safe metadata
|
| High-security high threat |
Security-oriented platform (MCU/SoC with strong isolation domain)
policy separation + hardened verify path
|
SE + optional TPM: Infineon OPTIGA Trust M (SLS32AIA010MS) + SLB 9670/9672 (optional)
strict revoke, stronger attestation hooks (device-side)
|
eMMC (Micron MTFC… / Kioxia THG…)
two-phase commit + journal + dual meta copies
|
Supervisor + monitor combo (e.g., MAX16054 + TPS3703)
tight boot-test enforcement + write gating
|
Each tier still requires the same “non-negotiables”: verified boot, anti-rollback floor, atomic metadata, bounded attempts, and minimum evidence events.
The selection logic is device-side: the goal is deterministic verification, atomic commits under power dips, and controlled recovery—without relying on backend orchestration.
FAQs (12) — device-side secure OTA, answers + mappings
Each answer stays within the Secure OTA Module boundary: signature verification, anti-rollback, manifest correctness, encryption-at-rest, power-loss atomicity, recovery rules, evidence logs, and validation matrices.
Q1. Why can an “approved signature” still allow downgrading to old firmware?
Signature verification proves authenticity, not freshness. Anti-rollback requires a device-side floor enforced by a monotonic value (OTP/eFuse version, Secure Element counter, or a protected counter with tamper detection). The boot policy must reject any candidate image with version < floor, even if the signature verifies. Floor updates must be atomic and survive power loss.
Maps to: H2-5
Q2. What if both A/B slots are corrupted—where should a rescue image live?
A reliable “last life” should sit outside the normal updatable slots: a ROM-resident minimal rescue (ideal), or a protected rescue region that is verified by the immutable boot root and written rarely. The rescue path must still enforce signature and rollback-floor rules, then re-provision a known-good slot. Avoid placing rescue in the same metadata domain that can be atomically compromised.
Maps to: H2-9
Q3. During updates, where does power loss brick devices most often, and how does two-phase commit prevent it?
The highest-risk moments are metadata “truth changes”: setting a pending boot flag, switching active slot pointers, or updating rollback counters. Two-phase commit avoids bricking by separating write/verify from activation: write to staging/inactive slot, verify hashes and signature, then atomically set PENDING with journaled metadata. Only after a successful boot-test does CONFIRMED get committed.
Maps to: H2-8, H2-6
Q4. Which manifest fields are mandatory, which are optional, and what breaks if they are missing?
Mandatory fields usually include image hash(es), signature, version/build number, target component/slot, and hardware compatibility constraints. Without compatibility and targeting, devices can install the wrong image and fail boot-test. Optional fields include dependencies, compression, and delta parameters. If chunk maps are missing in interrupted environments, resume becomes unsafe or inefficient, increasing brick risk during retries.
Maps to: H2-6
Q5. After key revocation, how can “old keys still work,” and what is the safest device-side fix?
Old keys remain effective when the device only checks “signature valid” but does not bind acceptance to an allowlist and a revocation state. The safest device-side fix is to anchor a KeyID allowlist (or a small revocation bitmap) in immutable/secure storage, and to require that the signing KeyID is both present and not revoked. Pair revocation with an updated rollback floor to block old signed packages permanently.
Maps to: H2-4, H2-5
Q6. Does storage encryption make rollback/recovery harder, and how can designs stay secure and recoverable?
Encryption-at-rest can complicate recovery if the rescue path cannot access keys or if metadata becomes unrecoverable after brownouts. A practical pattern is to encrypt secrets/config strongly while keeping boot-critical metadata minimal and journaled. If rescue must read encrypted images, ensure the trust anchor can unwrap keys in rescue mode. Avoid designs where encryption keys depend on mutable state that can be lost mid-update.
Maps to: H2-7, H2-9
Q7. Delta updates save bandwidth—what new security and reliability pitfalls appear?
Delta updates increase complexity because correctness depends on the exact base version and patch application order. Security pitfalls include inadequate verification granularity (only verifying the final image) and replaying a delta against an unexpected base. Reliability pitfalls include harder recovery after interruptions and increased write amplification. A robust approach verifies chunk-level hashes, validates base-version binding, and supports safe resume with deterministic commit rules.
Maps to: H2-6
Q8. In frequent power-loss environments, how can the minimum PLP window be estimated for atomic metadata commits?
The minimum PLP window is driven by the worst-case “atomic commit payload”: journal record write + CRC/validation + pointer/epoch update (often duplicated). Estimate the maximum write latency of that payload on the target storage (NOR sector program, eMMC page + internal housekeeping), then add margin for brownout detection and firmware response time. Gate new writes near the brownout threshold so commits never start in unsafe voltage regions.
Maps to: H2-8
Q9. If Secure Boot fails, should devices refuse to boot or enter a safe mode—how to choose without false positives?
Refusing to boot is correct when authenticity cannot be established (signature/policy failure). A controlled safe/rescue mode is appropriate when the system can still enforce trust rules but needs a recovery path (e.g., slot corruption or boot-test failure). The policy should be deterministic: signature/policy failures → REJECT/RESCUE; boot-test failures → bounded retries then ROLLBACK/RESCUE. Supervisors/watchdogs help avoid endless loops.
Maps to: H2-3, H2-9
Q10. When OTA fails, what evidence should be checked first, and which 5 log fields narrow root cause fastest?
Start with the updater state machine step, then separate crypto vs storage vs power. The fastest five fields are: (1) state transition last step, (2) signature verification result/error code, (3) manifest parse/hash mismatch code, (4) reboot reason + brownout flags, and (5) slot state snapshot (active/next/attempt_count) plus rollback reason and floor/counter values (appropriately redacted).
Maps to: H2-10
Q11. How can “real security” be demonstrated—what is the smallest test matrix that avoids blind spots?
A minimum matrix crosses (A) phases (download/verify/staging/meta/boot-test/confirm) with (B) negative cases (bad signature, tampered manifest, replay, rollback below floor) and (C) power interruption points across those phases. Pass criteria are deterministic outcomes: REJECT for policy failures, ROLLBACK for boot-test failures, RESCUE when recovery is required, and always keeping at least one known-good CONFIRMED boot path available after any interruption.
Maps to: H2-11
Q12. Without a Secure Element, can security be “good enough,” and what must the MCU/SoC provide?
Many systems can be robust without a Secure Element if the MCU/SoC provides an immutable verify point, a strong secure storage anchor (OTP/eFuse), fast signature verification, isolation primitives (MPU/TrustZone), and a reliable anti-rollback floor mechanism. The risk increases when physical access is likely or when monotonic counters and hard revocation must be guaranteed under power loss. In those cases, a Secure Element materially improves device-side guarantees.
Maps to: H2-12, H2-4