123 Main Street, New York, NY 10001

Secure OTA Module for IoT Devices: Secure Boot & Safe Updates

← Back to: IoT & Edge Computing

A Secure OTA Module is a device-side closed loop that ensures every firmware update is authentic (signed), fresh (anti-rollback), and non-bricking under power loss through A/B slots + atomic commit, with deterministic recovery and minimum evidence logs for fast field debugging.

H2-1 · Boundary

Boundary definition: what a Secure OTA Module is (and is not)

The goal is to lock the page scope early, so the design discussion stays on device-side update safety instead of drifting into cloud platforms, gateways, or fleet operations.

Working definition (device-side capability set)

A Secure OTA Module is the combination of: Secure Boot + Signed Update + Power-loss Safety (Atomicity) + A/B Recovery & Rescue.

  • Secure Boot: verifies the next-stage code before execution (chain-of-trust).
  • Signed Update: verifies the update package before commit (signature + hash + manifest).
  • Atomicity / PLP: ensures unexpected power loss cannot corrupt boot-critical metadata.
  • A/B & Rescue: guarantees a path back to a known-good image after failure.
Scope boundaries vs sibling pages (mention only, do not expand)
  • Edge Security Probe: focuses on identity/attestation/forensics; this page only keeps update-related evidence and minimal audit.
  • Edge Power & Backup: focuses on system hold-up and power architecture; this page only defines write/commit atomicity requirements under power loss.

If a paragraph requires backend rollout policies, gateway aggregation, or management-console workflows, it is out of scope.

Three-question decision gate (fast “do you need it?” check)
Question (field reality) Why it changes the design Recommended level
Remote updates are required? Assume update content can be tampered/replayed. Strong signature + integrity gates become mandatory. Industrial  High-security
Physical access is realistic? Assume local storage can be copied/rewritten. Anti-rollback + protected trust anchor are required. Industrial  High-security
Frequent power loss / brownouts? Assume mid-write interruption. A/B + atomic metadata is required to avoid “brick-on-update”. Basic  Industrial

A Secure OTA Module is “done” only when each capability has observable evidence: verify results, version/counter state, slot state transitions, and power-loss-safe commit behavior.

Figure F1 — Secure OTA boundary (4 capability blocks, evidence-driven)
Secure OTA Module = 4 device-side capabilities Keep scope on verify · commit · recover · evidence (not cloud rollout) Secure Boot Signed Update PLP / Atomicity A/B & Rescue A B verify-before-exec fail → rescue verify-before-commit manifest + hash atomic metadata power-loss safe two slots + confirm rollback reason Evidence-driven scope: verify results · version/counter · slot state · reboot reason
H2-2 · Threat Model

Threat model & measurable security goals: what must be prevented

“Security” becomes actionable only when each goal maps to a control point and a minimum evidence set. This chapter defines the OTA-relevant attack surfaces and the device-side goals that can be verified in the field.

OTA-relevant attack surfaces (device-centric, no backend deep dive)
  • Supply chain / manufacturing: trust anchor or key material is not unique or is substituted.
  • Transport: update package is replaced, replayed, or partially corrupted in transit.
  • Local storage: offline rewrite of flash/eMMC, forced downgrade, metadata tampering.
  • Boot chain: bootloader stage is modified to bypass verification.
  • Debug interface: post-production write access remains possible.
  • Power anomalies (brownout): mid-write interruption causes inconsistent state; extreme cases can disturb branching.

Any paragraph that requires fleet rollout policies, gateway routing, or cloud console workflows is out of scope.

Five measurable goals (each must produce evidence)
Goal Device-side control point Minimum evidence (field-debug friendly)
Authenticity Signature verification at boot and pre-commit stages (trusted key anchor). verify-result code, key ID, verified stage ID (ROM/BL/update).
Integrity Hash validation during download and after write (chunk-hash and/or full-image hash). hash mismatch counters, final image hash, manifest hash result.
Freshness Anti-rollback gate (monotonic counter or version floor) enforced before boot/commit. counter value, version floor, rollback reason (reject/update/boot).
Recoverability A/B slots + confirm mechanism; failure routes to known-good slot/rescue. slot state (pending/confirmed), boot attempt count, fallback path taken.
Auditability Minimal event log with append-only intent (local), covering state transitions and reboot causes. event sequence, state transitions, reboot reason flags (incl. brownout).
Chapter map (where each goal is implemented)
  • Authenticity → Secure Boot + Signed Update pipeline (next chapters).
  • Integrity → Manifest + hash strategy (next chapters).
  • Freshness → Anti-rollback (next chapters).
  • Recoverability → A/B + rescue strategy (later chapters).
  • Auditability → Evidence chain & logs (later chapters).
Figure F2 — Threats → controls → minimum evidence (device-side)
Threats → Controls → Evidence (device-side) Minimal labels, maximum structure: each goal must produce observable proof Threats Controls Evidence Supply chain Transport Local storage Boot chain Debug access Power loss Signature verify boot + pre-commit Hash verify chunk + image Anti-rollback counter / floor A/B confirm fallback path Atomic metadata journal / COW verify code key ID · stage hash result mismatch cnt counter value rollback reason slot state pending/ok reboot reason brownout flag If evidence cannot be collected, the goal is not implemented

The next chapters will implement these goals with concrete mechanisms (verify points, manifest fields, anti-rollback gates, A/B confirmation, and power-loss-safe commits). Backend rollout policies remain out of scope for this page.

H2-3 · Chain of Trust

Chain of Trust: verified boot from ROM to application

Verified boot is a hard gate: code must be authenticated before execution. Measured boot can improve auditability, but it does not replace verified boot for minimum “do-not-run-unknown-code” guarantees.

Verified boot vs measured boot (practical boundary)
  • Verified boot: verification blocks execution on failure (signature/hash/version gates).
  • Measured boot: measurements are recorded for later decisions; it can support auditing but does not prevent execution by itself.
Goal: do-not-run Evidence: verify code Evidence: slot state
Boot chain layers (each stage has one responsibility + one verify point)
Stage Loads / executes Verify point (minimum) Failure outcome
ROM Loads 1st stage from internal flash / external memory. Checks trusted anchor (public key hash / key index) and verifies 1st stage. Hard fail or enter rescue entry (if designed).
1st stage Initializes minimal clocks/memory and loads bootloader. Verifies bootloader signature; applies version gate hook (anti-rollback input). Fallback to rescue slot / fail-safe mode (no unverified jump).
Bootloader Selects A/B slot and loads OS/app image + manifest. Verifies manifest + hash + signature; enforces freshness checks before commit/boot. Switch slot; if both invalid → rescue image.
OS/App Runs post-boot health checks (for confirmation). Optional: confirms “known-good” after self-test (supports A/B confirm later). On failure: watchdog/reboot triggers rollback policy.
Key design decisions (engineered tradeoffs)
  • Trust anchor placement: ROM/OTP anchors maximize tamper resistance; a secure element can add flexibility (rotation/counters) but adds BOM and integration surface.
  • Signature choice: prioritize verification time, code footprint, and hardware acceleration so boot remains within the reset/watchdog budget.
  • Failure policy: “refuse to boot” is correct for revoked/rollback attempts; “rescue slot” is appropriate for corruption or interrupted writes.
Minimal state machine (for later A/B recovery alignment)
  • PASS → jump to next stage; record verify code and stage ID.
  • FAIL → do not execute; select alternate slot (A↔B) if available.
  • Both invalid → enter rescue image; expose a deterministic error reason (signature/hash/version/metadata).

Field evidence should always include: verify-result code, selected slot, and reboot reason (including brownout flags).

Figure F1 — Chain of Trust & verify points (minimal labels, box-diagram style)
Verified Boot: ROM → 1st stage → Bootloader → OS/App Each stage: load → verify → version gate → jump (or fail to rescue) ROM 1st stage Bootloader OS/App Anchor Verify Verify Gate Version Manifest A B Verify Health Confirm Failure flow FAIL FAIL FAIL Rescue / A↔B no unverified jump Evidence: verify code · stage ID · slot state · reboot reason

The next chapter will anchor the trust chain with concrete key storage and revocation rules so “verified boot” remains enforceable across device lifetime.

H2-4 · Keys

Signatures & keys: storing the trust root and enforcing revocation on-device

A secure boot chain is only as strong as the trust anchor and the acceptance rules. This chapter defines device-side key hierarchy, key IDs, and a practical revocation model that remains enforceable without backend deep dives.

Device-side key hierarchy (minimum model)
  • Root public key (anchor): immutable or strongly protected; defines “who is trusted to sign”.
  • Intermediate key (optional): enables rotation without replacing the anchor; can be used to sign image-signing keys.
  • Image signing key: signs firmware images/manifests; private key is never stored on the device.

The device verifies a chain of signatures and then enforces policy: allowed key IDs, revoked key IDs, and freshness gates (anti-rollback comes next).

Rotation & revocation (on-device data structures and rules)
Element What it stores Enforcement rule (device-side)
Key ID Compact identifier bound to a public key (or public key hash). Verification output must include Key ID for audit and policy decisions.
Allow list Set of acceptable Key IDs + public key hashes (can be versioned). If Key ID not in allow list → reject (no fallback to “unknown but valid”).
Revoke list Set of revoked Key IDs (optionally tagged with “revoked since version”). Revoke list has higher priority than allow list; if revoked → reject even if signature verifies.
Acceptance policy Decision order: revoked? allowed? version gate? integrity? Reject on revoked/rollback attempts; allow rescue only for corruption/power-loss cases.
Where to store the trust anchor (ROM / OTP / secure element)
  • ROM anchor: strongest immutability; minimal flexibility. Best for high-integrity boot roots.
  • OTP/eFuse anchor: strong protection with limited update patterns (e.g., key index, hash slots).
  • Secure element: strong isolation and useful features (counters, protected keys); adds integration cost and supply considerations.
  • Plain flash: highest flexibility but highest tamper risk; requires additional protections and is not suitable as a sole root anchor.

Anti-rollback complements revocation: revocation answers “who may sign”, freshness answers “is it new enough”.

Debug interface policy (dev vs production without foot-guns)
  • Dev mode: allows debug for bring-up; must be visibly and logically separated from production trust (test keys, reduced restrictions).
  • Production mode: debug access is locked or constrained; secure boot is mandatory; only production keys are accepted.
  • Hard rule: test keys and relaxed gates must not remain reachable in production acceptance policy.
Figure F2 — Key hierarchy, storage locations, allow/revoke enforcement, and mode gates
Device-side keys: anchor → allow/revoke → verify engine Minimal labels, more structure: KeyID, Allow, Revoke, DEV/PROD gates Key hierarchy Root public key Intermediate (opt) Image signing key KeyID Anchor storage ROM OTP SE Flash strong limited +1 flex risky Policy & enforcement Allow KeyIDs Revoke priority Verify engine Sig · Hash · Gate DEV PROD Revoke overrides allow · KeyID must be logged · PROD must reject test paths

The next chapter (anti-rollback) completes freshness enforcement so a valid signature cannot be used to downgrade to an older, vulnerable image.

H2-5 · Anti-rollback

Anti-rollback: versions, counters, and enforceable freshness

A valid signature only proves origin and integrity. Freshness proves the image is not an older, revoked, or below-floor build. Anti-rollback must be enforced as a hard gate before execution and before committing an update.

Freshness gate (minimum rule)
  • Gate condition: candidate_build ≥ device_floor (monotonic integer compare).
  • Where enforced: at boot (ROM/1st stage/bootloader) and at update commit (before setting “pending”).
  • What is recorded: device_floor, candidate_build, decision_code, selected_slot.
Evidence: floor Evidence: candidate build Evidence: decision code Evidence: slot
Three common implementations (practical tradeoffs)
Method Strength Common failure mode Engineering mitigation
OTP/eFuse monotonic version High tamper resistance; “only increases”. Incorrect programming or wrong floor bump can block valid images. Use staged floor bump (post-boot confirm), verify write success, and tie floor to signed manifest.
Secure element monotonic counter High protection plus flexible features (counter/secure storage). Integration/BOM complexity; counter access failures must fail-safe. Define deterministic fallback (recovery only), cache read-only floor for boot, and log SE status codes.
Protected flash counter Cost-effective but weaker against physical tampering. Counter rollback via rewrite; power-loss during update can desync state. Use redundancy (dual copies + voting), journal/COW updates, bind counter to signed metadata, and treat downgrade detection as recovery-only.
Version encoding (avoid ambiguous compares)
  • Recommended: use a monotonic build number (uint32/uint64) as the freshness key.
  • SemVer: suitable for display and compatibility labels, but not as the sole freshness gate due to edge cases.
  • Policy: gate on build number; optionally display SemVer separately as a human-facing string.
Failure policy + A/B interaction (non-negotiable rules)
  • Rollback detected (candidate_build < floor) → reject boot and enter recovery/rescue.
  • Known-good fallback allowed: switch to the other slot only if it is not below floor and not revoked.
  • Revoked overrides “known-good”: a revoked version must never become bootable again.

Freshness gate and key revocation gate must both pass. One cannot compensate for the other.

Figure F3 — Version state + A/B decision matrix (freshness + revocation gates)
Anti-rollback gate: candidate vs floor + A/B rules Compare monotonic build numbers; revoked always rejects; recovery-only on downgrade Version state candidate_build >= floor PASS → continue candidate_build < floor REJECT → recovery key revoked OR version revoked REJECT → recovery A/B decision (simplified) Input: slot states Output: action (must also pass floor + revoke) Slot A OK BAD Slot B OK BAD BOOT OK slot pass floor+revoke RECOVERY downgrade/revoked Evidence: floor · candidate_build · revoke_hit · slot_selected · reason_code

Freshness enforcement becomes actionable only when the update package carries unambiguous version fields and signed metadata—defined next.

H2-6 · Manifest

Firmware images & manifest: minimum fields for verification and recovery

Treat the OTA package as an engineering interface: a signed manifest that drives verification, storage writes, commit points, and recovery—without relying on transport-protocol details.

Manifest is the single source of truth
  • Verify manifest first, then trust payload. All critical fields must be covered by the manifest signature.
  • Manifest drives the state machine: download → verify → write → re-verify → commit → reboot → confirm.
  • Evidence is mandatory: each step produces a small, stable record (hash/sig/slot/code) for field debugging.
Minimum manifest fields (grouped by purpose)
Group Field Why it exists (device-side)
Identity & integrity image_hash (final), signature, signer_key_id Proves origin and the exact bytes that must be installed and booted.
Freshness & policy build_number (monotonic), hw_compat (board/SoC ID) Prevents downgrade and prevents installing an image built for different hardware.
Install target slot_target (A/B), image_type (BL/OS/App), dependencies Ensures the correct partition/slot is written and version-coupled components stay compatible.
Chunking & recovery chunk_map (offset/len/order), chunk_hashes, compression Enables resume/partial verification and power-loss recovery without accepting reordered or replaced chunks.
Storage interface payload_size, chunk_size, staging_hint Allows deterministic write planning and prevents overrun/partial state confusion.

If chunk hashes exist, the chunk map and ordering must be signature-covered; final image hash should still be verified after write.

Full image vs delta (security and consistency costs)
  • Full image: simplest trust boundary; failure recovery is straightforward (retry download/write on inactive slot).
  • Delta: validation is harder; a “valid patch” does not guarantee a correct final image unless the final hash is checked.
  • Recommendation: prioritize full images for high reliability; if delta is used, require final-image verification and robust staging.
Hash structure (whole vs chunked) and anti-tamper requirements
  • Whole-image hash: simplest, but resume requires re-download or local caching.
  • Chunk hashes: enable resume and partial verification; require signed chunk map to prevent reordering/substitution.
  • Post-write rule: verify final image hash after storage writes before setting any boot flag.
Storage + commit interface (where devices typically fail)
  • Staging first: manifest and progress metadata must be written to a staging area that does not break the current bootable slot.
  • Write inactive slot: never overwrite the currently confirmed slot during normal OTA.
  • Commit point: only after verify passes → set boot flag to pending (atomic write).
  • Confirm point: after post-boot self-check → set confirmed; otherwise rollback per attempt counter.
Figure F4 — Update package flow (download→verify→write→commit→reboot→confirm) with evidence outputs
Update package flow: verify → write → commit → confirm Each step outputs compact evidence: hash · sig · slot · error code Pipeline Download code Verify chunks hash sig Write staging slot Verify image hash code Set pending slot Reboot reason Confirm post-boot state Failure branches Any step fails stay on confirmed slot log: code + stage Pending slot boot fails rollback to known-good still enforce floor+revoke Evidence per step: manifest hash · sig result · chunk/image hash · slot state · error code

Later chapters can reuse this evidence stream to build deterministic field-debug workflows without exposing platform or transport details.

H2-7 · Storage encryption

Encrypted storage: partitioning and device-side key strategy

Encryption-at-rest is a storage policy, not a replacement for signature verification. Treat firmware, configuration, and secrets as different asset classes with different write patterns and recovery constraints.

Asset classes (what is protected and why)
Asset Write pattern Primary risks Minimum device-side controls
Firmware (slots) Rare, large writes IP exposure, offline analysis, tampered images Signed images + anti-rollback gate; encryption is optional and must not break recovery.
Config Medium, frequent updates Privacy leakage, policy manipulation, replay of old config AEAD (encrypt + authenticate), versioning/journal, optional freshness rules for critical fields.
Secrets Small objects Key extraction, cloning, credential reuse Per-device keys, key wrapping, secure storage policy, revocation/rotation readiness.
Firmware: signed Config: AEAD Secrets: wrap Recovery-safe
Encryption-at-rest options (how to choose)
  • Full-disk: strongest coverage but increases boot-chain and rescue constraints; only suitable when early-stage decryption is guaranteed.
  • Partition encryption: practical default—encrypt Config and Secrets first; encrypt firmware slots only if recovery remains deterministic.
  • Secrets-only: lowest cost; acceptable only when config is not sensitive and does not enable high-impact policy manipulation.
Device-side keys: per-device root + key wrapping (minimal model)
  • Root: hardware-anchored device unique secret (OTP/eFuse/PUF/secure element) used as the trust anchor.
  • KEK: key-encryption key derived from the root; used to wrap data keys.
  • DEK: data-encryption key used for a specific partition/object (Config/Secrets, optional for firmware slots).
  • Rule: DEK must never be stored in plaintext; store wrapped_DEK + keyslot_id + policy.

Key rotation is a device-side rule: new KeyID must be accepted only when authorized, and old KeyID must become unbootable/unusable when revoked.

Nonce/IV and power-loss consistency (typical real-world failure modes)
  • AEAD requirement: encryption without authentication is insufficient for Config/Secrets.
  • Nonce/IV rule: never reuse a nonce with the same key; monotonic or random nonce must be stored consistently with the ciphertext.
  • Write amplification: hot config keys should be journaled as small records; avoid re-encrypting large regions on every write.
  • Crash consistency: nonce/counter state must be updated with journal/COW so a brownout cannot desynchronize metadata and ciphertext.
Security vs recoverability: what “rescue” can access
  • Rescue not encrypted (but signed): maximizes recoverability; expose only minimal functionality.
  • Rescue encrypted + signed: only safe when early-stage decryption is guaranteed and key access is deterministic under fault conditions.
  • Hard requirement: recovery entry must remain possible even when Config/Secrets are corrupted or decryption fails.
Figure F5 — Partition map + key wrapping ladder (firmware/config/secrets)
Encrypted storage: partitions + device-side keys Firmware is signed; Config/Secrets use AEAD + wrapped keys; Recovery must stay deterministic Partition map BOOT / ROM VERIFY SLOT A SIGNED enc? SLOT B SIGNED enc? CONFIG AEAD SECRETS WRAP META + JOURNAL ATOMIC Key ladder DEVICE ROOT OTP / SE KEK derive from root UNWRAP DEK per partition / object wrapped_DEK KeyID slot decrypt for AEAD Rules 1) Firmware stays signed 2) Config/Secrets use AEAD 3) No plaintext DEK 4) Recovery remains possible

Encryption policies must be compatible with crash consistency; nonce/metadata correctness becomes a first-class requirement under power loss.

H2-8 · PLP & atomic update

Power-loss-safe updates: A/B + atomic metadata + write gating

“Not bricking under power loss” is achieved by invariants: keep one confirmed bootable slot, make metadata updates recoverable, and gate writes as brownout approaches. PLP is treated as an interface requirement (rails/time/trigger), not a sizing exercise.

Power-loss risks (root causes → invariants)
  • Interrupted writes → inactive image corruption → invariant: never overwrite the confirmed slot during OTA.
  • Metadata corruption → unknown bootable slot → invariant: metadata must be journaled/COW with validation (CRC/version).
  • Counter/nonce desync → false rollback/AEAD failure → invariant: counters/nonces advance only with atomic commit records.
A/B + two-phase commit (state bits that must exist)
  • slot_state: CONFIRMED / PENDING / INVALID
  • attempt_count: limits repeated boot trials of a pending slot
  • active_slot + next_slot: deterministic selection source of truth
  • confirm flag: written only after post-boot health checks
  • floor: monotonic freshness threshold (must not regress)
Metadata journal / copy-on-write (atomicity in practice)
  • Dual copy + CRC + generation: read the newest valid record; write the other copy on updates.
  • Append-only journal: write small, ordered records; recover by replaying to the last valid entry.
  • Boot selection rule: if pending metadata is inconsistent, fall back to the last confirmed slot and enter recovery policy.
Power-loss window (interface metrics, not component sizing)
  • Commit bytes: worst-case bytes written for one state transition (meta + journal + counters).
  • Commit latency: worst-case time for flush/verify of commit bytes (includes erase/write/flush behavior).
  • Brownout threshold: below this, new erase/write must stop; only allow safe finalization of the current atomic record.
PLP requirements (device-side interface)
  • Rails held: compute + storage + required IO for completing atomic metadata write.
  • Hold time: commit latency + safety margin for deterministic finalization.
  • Write stop trigger: brownout interrupt / PMIC PG / ADC threshold; must immediately block new erase/write operations.
  • Safe action: complete or abort the current atomic record; never leave half-written metadata as the latest record.
Figure F6 — A/B + atomic commit under power loss (layout + state machine + 3 power-cut points)
A/B + atomic metadata: power loss does not brick the device One confirmed slot stays bootable; journal/COW recovers metadata; writes are gated on brownout Flash / eMMC layout SLOT A CONFIRM SLOT B PENDING META (dual copy) CRC+GEN JOURNAL APPEND FLOOR / COUNTER MONO State machine IDLE DOWNLOADING STAGING PENDING BOOT TEST CONFIRMED FAILED ROLLBACK PWR LOSS PWR LOSS PWR LOSS Guarantees: confirmed slot intact · metadata recoverable · writes stopped on brownout

PLP is expressed as “rails held + hold time + write-stop trigger”; the OTA design remains valid regardless of the underlying energy source.

H2-9 · Recovery

Recovery design: rollback, rescue modes, and the last lifeline

A secure OTA system must remain recoverable. Recovery is defined as a layered plan with a bounded confirmation window, deterministic slot selection rules, and field-operable rescue entry points.

Recovery ladder (from most reliable to most feature-rich)
  • Bootloader rescue: minimal dependencies; must always be reachable even when filesystem/config decryption fails.
  • Minimal OS rescue: adds drivers and tooling for validation/export, while staying under secure boot policy.
  • App-level safe mode: provides degraded service and operator feedback, but must not bypass signature or rollback gates.
Bootloader rescue Min-OS rescue App safe
Confirmation gate (post-boot health check)
  • Purpose: promote a pending slot to confirmed only after it proves basic health.
  • Checks: watchdog feed path, 1–3 critical services, storage readability, and a bounded timer window.
  • Rule: unconfirmed equals failure; a pending slot must not remain pending indefinitely.
  • Outputs: CONFIRMED flag (atomic) + small reason codes on failure (enum, not long strings).
Deterministic rollback policy (A/B selection rules)
Condition (hard evidence) Action Notes
PENDING exists and attempt_count < N Boot PENDING slot (test run) Health check must confirm within the window.
Health check fails or attempt_count reaches N Mark PENDING invalid (or lower priority), roll back to last CONFIRMED Record rollback_reason and last error_code.
No CONFIRMED slot is bootable Enter Bootloader rescue Rescue must still enforce signature/floor rules.
Candidate version < floor (revoked/too old) Disallow boot even if image verifies Rollback is allowed only to known-good versions ≥ floor.
Field operability (engineering interfaces only)
  • Physical entry: button or GPIO strap at power-on to force rescue mode.
  • Console entry: UART/USB commands to inspect slot state, export minimal logs, and trigger rollback/rescue.
  • Production guard: development vs production mode gating to prevent rescue paths from bypassing verification.

A rescue path is a controlled interface, not a backdoor.

Non-negotiable “last lifeline” invariants
  • Keep at least one CONFIRMED bootable slot at all times; updates modify only the inactive slot.
  • Make metadata truth recoverable: journal/COW with validation and deterministic fallback rules.
  • Bound trials: pending has an attempt limit; failure triggers automatic rollback or rescue.
Figure F7 — Recovery ladder + confirm gate + field entry points
Recovery ladder and the confirmation gate Deterministic rollback · bounded attempts · rescue entry points Recovery ladder BOOTLOADER RESCUE MIN DEP MIN-OS RESCUE TOOLS APP SAFE MODE DEGRADED Confirm gate PENDING HEALTH WATCHDOG CRIT SVC CONFIRMED ATTEMPT ≤ N A/B & rollback SLOT A GOOD SLOT B TRY ROLLBACK Field entry BTN STRAP UART Guarantees: one confirmed slot · bounded trials · deterministic rescue entry

Recovery stays within device boundaries: deterministic slot policy + bounded confirmation + controlled rescue interfaces.

H2-10 · Hard evidence

Hard evidence and observability: what to collect when updates fail

Debugging must start from hard evidence: first locate where the state machine stopped, then classify the failure as cryptographic/package, storage/atomicity, or power/brownout related.

Minimal event set (recommended as mandatory)
Event Fields (keep it structured) Why it matters
EVT_VERIFY_SIG result, key_id, error_code Separates “image authenticity” failures from all other classes.
EVT_PARSE_MANIFEST result, version, target_slot, compat Explains why a valid signature still cannot be accepted.
EVT_SLOT_TRANSITION from_state → to_state, attempt_count Locates exactly where the OTA flow stopped.
EVT_REBOOT_REASON wdt / brownout / software, brownout_flag Distinguishes logic bugs from power integrity failures.
EVT_ROLLBACK reason, floor, candidate_version (redacted) Proves whether anti-rollback rules blocked a boot attempt.

Use event codes + small fields; avoid long text logs in critical paths.

Local log integrity (no external SIEM assumed)
  • Append-only: records are appended, not overwritten; keeps the latest failure evidence intact.
  • Tamper-evident digest (optional): periodic digest over recent records to detect easy edits.
  • Redaction: never print secrets; prefer KeyID/counter/error enums instead of sensitive content.
Field triage order (hard evidence first)
  • Step 1: read the last EVT_SLOT_TRANSITION to locate the stuck state (DOWNLOAD / STAGING / PENDING / BOOT_TEST).
  • Step 2: check EVT_VERIFY_SIG and EVT_PARSE_MANIFEST to classify crypto/package issues.
  • Step 3: check EVT_REBOOT_REASON + brownout flags to classify power-loss / brownout behavior.
  • Step 4: check EVT_ROLLBACK (floor, candidate_version) to confirm anti-rollback blocks.
Classification by evidence (what it usually indicates)
  • Crypto/package: signature failure, incompatible manifest, version below floor.
  • Storage/atomicity: meta CRC failure, journal replay issues, AEAD decrypt failure due to nonce/metadata mismatch.
  • Power/brownout: repeated brownout flags, reboot reasons pointing to power, transitions stopping during writes.
Minimal export bundle (hand-off friendly)
  • Last N event records + current slot snapshot (active/next/attempt/floor).
  • Current firmware version + candidate version + last rollback reason.
  • Manifest header digest (hash only) for correlation without leaking content.
Figure F8 — Debug decision flow: locate state → classify by evidence → next action
Hard-evidence debug flow Find the state first, then decide: crypto/package · storage/atomicity · power/brownout FAILURE read last EVT_SLOT_TRANSITION STATE CRYPTO POWER EVT_SLOT_TRANSITION attempt_count stuck_state EVT_VERIFY_SIG EVT_PARSE_MANIFEST floor / version EVT_REBOOT_REASON brownout_flag write_window NEXT: FIX FLOW NEXT: FIX PACKAGE NEXT: FIX WINDOW

This chapter stays device-local: append-only evidence + small structured fields + a deterministic triage order.

H2-11 · Validation

Validation & testing: proving OTA is secure and non-bricking

A Secure OTA Module is verified by repeatable, device-side tests that produce hard evidence: deterministic state transitions, reject reasons for malicious inputs, and recovery guarantees under power interruption.

What “secure and non-bricking” means in measurable assertions
  • Authorization: only signed-and-allowed images can reach PENDING / CONFIRMED.
  • Freshness: any image below rollback floor is blocked even if signature verifies.
  • Atomicity: metadata truth is recoverable after any interruption (journal/COW + validation).
  • Recoverability: failure always ends in CONFIRMED boot or controlled rescue entry (bounded attempts).
  • Observability: failure is classifiable by device-local evidence (state vs crypto vs power).
AUTH FRESHNESS ATOMICITY RECOVERY EVIDENCE
Reference BOM (example material numbers for device-side Secure OTA)

These part numbers are common reference options for secure boot + signed update + local evidence. Final selection must match required security level, lifecycle, and availability.

Block Example material numbers Typical role in OTA proof
Secure MCU / SoC NXP LPC55S69 · NXP i.MX RT1062 · ST STM32U585 · ST STM32H573 · Renesas RA6M5 · Infineon PSoC 64 Secure boot root, signature verification, slot state machine, evidence events.
Secure Element Microchip ATECC608B · NXP SE050 · Infineon OPTIGA™ Trust M (SLS32AIA010MS) · ST STSAFE-A110 Protected key storage, monotonic counter (freshness), key rotation/revocation anchors.
Discrete TPM (optional) Infineon OPTIGA™ TPM SLB 9670 / SLB 9672 · Nuvoton NPCT75x (family) Hardware trust anchor / measured evidence (optional), policy-controlled key use.
External NOR Flash Winbond W25Q128JV · Macronix MX25L128 / MX25U128 · Micron MT25Q (family) Slot A/B images, staging areas, metadata journal; supports power-cut sweep tests.
eMMC storage (if used) Micron eMMC (MTFC… family) · Kioxia eMMC (THG… family) High-capacity slots/delta chunks; validate atomic commit + filesystem resilience.
Reset / watchdog supervisor TI TPS3431 · TI TPS3850 · Maxim/ADI MAX16054 Enforces bounded boot-test window; provides reboot reason evidence and recovery determinism.
Voltage monitor / brownout detect TI TPS3703 · Microchip MCP1316 (family) Gates writes near brownout and records brownout events for power-cut/boundary validation.

Practical validation tip: at least one reference platform should include a Secure Element (e.g., ATECC608B / SE050 / STSAFE-A110) to test monotonic counters and key revocation behavior under power interruptions.

Functional tests (state machine evidence)
  • Full update path: IDLE → DOWNLOADING → STAGING → PENDING → BOOT_TEST → CONFIRMED.
  • Resume: interrupt download and restart; chunk verification resumes without nonce/counter inconsistency (when AEAD is used).
  • A/B alternation: two consecutive releases alternate inactive slot; one CONFIRMED slot remains bootable at all times.
  • Auto-rollback: fail boot-test and confirm automatic rollback to last CONFIRMED within bounded attempts.
  • Rescue entry: force rescue via button/strap/UART policy and verify signature/floor gates still apply.
A/B RESUME CONFIRM GATE RESCUE

Suggested evidence hooks: EVT_SLOT_TRANSITION + attempt_count + active/next slot snapshot.

Security tests (reject conditions, not attack tutorials)
  • Forged signature: image must be rejected during verification (no PENDING state set).
  • Tampered manifest: hash/field mismatch must be rejected with a structured error code.
  • Replay: an older valid package must be blocked by freshness rules (counter/floor).
  • Rollback: candidate version below floor must be blocked even if signature verifies.
  • Key revoke: revoked KeyID must remain unusable across normal and rescue paths.

Reference platforms to cover this class: Secure Element-based monotonic counters (e.g., ATECC608B / SE050 / STSAFE-A110) and at least one MCU with ROM/OTP trust anchor support (e.g., LPC55S69 / STM32U585 / STM32H573).

Reliability tests (power-cut sweep, brownout boundary, wear-out)
  • Power-cut sweep: repeatedly cut power at multiple phases (download / staging / meta commit / pending set / boot-test / confirm).
  • Pass criterion: after any interruption, the system must reach CONFIRMED boot or controlled rescue entry (never stuck in unknown).
  • Brownout boundary: verify “write gating” near brownout; metadata truth remains recoverable (journal/COW validation works).
  • Wear-out: stress metadata/journal updates and confirm stable recovery rules (no dual-slot loss).

Recommended evidence hooks: EVT_REBOOT_REASON + brownout_flag + meta validation result + last rollback reason. Suggested supporting parts for repeatability: watchdog supervisor (TPS3431/TPS3850/MAX16054) and voltage monitor (TPS3703/MCP1316 family).

Adversarial checks (coverage only) + pre-release acceptance checklist (15 items)

Adversarial coverage is validated by confirming that safety policies trigger and recovery remains deterministic, without describing fault-injection procedures.

# Must-pass item Expected hard evidence
1Normal update reaches CONFIRMEDEVT_SLOT_TRANSITION sequence complete; CONFIRMED set atomically
2Resume works after forced reboot during downloadChunk verification consistent; no counter/nonce mismatch error
3A/B alternation across two releasesInactive slot updated; at least one bootable CONFIRMED always remains
4Boot-test failure rolls back within bounded attemptsattempt_count increments; rollback_reason recorded; last CONFIRMED boots
5Rescue entry is reachable and policy-controlledRescue flag/strap/button recorded; verification gates still enforced
6Forged signature package is rejectedEVT_VERIFY_SIG=FAIL; PENDING not set
7Tampered manifest is rejectedEVT_PARSE_MANIFEST=FAIL (structured error_code)
8Replay of older signed package is blockedfloor/counter check blocks; rollback/reject reason captured
9Rollback below floor is blockedcandidate_version < floor; boot refused; reason captured
10Revoked key cannot install or bootKeyID flagged revoked; verification/policy blocks in normal and rescue
11Power-cut sweep passes across all phasesAfter each cut: CONFIRMED boot or RESCUE; never “unknown” slot truth
12Metadata truth is recoverable after interruptionJournal replay OK; meta validation OK; deterministic slot selection
13Brownout boundary does not corrupt truthbrownout_flag set; write gating prevents new writes; recovery succeeds
14Attempt limit forces rollback/rescue (no endless pending)attempt_count reaches N; auto rollback or rescue entry occurs
15Failure is classifiable by device-local evidenceAt least one of: state / crypto / power classes is provable via events
Figure F9 — Test matrix: phase × cut point × expected outcome
Secure OTA validation matrix Phases · fault points · expected outcomes (REJECT / ROLLBACK / RESCUE) PHASES DOWNLOAD VERIFY STAGING META BOOT_TEST CONFIRM TEST FUNCTION SECURITY RELIABILITY Matrix cells = cut points / negative cases EXPECTED REJECT ROLLBACK RESCUE Hard evidence: EVT_SLOT_TRANSITION EVT_VERIFY_SIG EVT_REBOOT_REASON floor/version

The chapter remains device-side: tests prove deterministic recovery and policy enforcement using structured evidence, while avoiding cloud/fleet architecture and avoiding fault-injection procedures.

Request a Quote

Accepted Formats

pdf, csv, xls, xlsx, zip

Attachment

Drag & drop files here or use the button below.
H2-12 · Selection

Component & architecture selection: MCU/SoC, Secure Element, storage, supervisors

This section turns Secure OTA mechanisms into concrete device-side capabilities. The goal is a closed loop: verified boot + signed updates + anti-rollback + atomic commit under power loss + recoverability + minimum evidence.

1) Capability-to-mechanism mapping (what each “ability” pays for)

Hardware capability (device-side) Enables which mechanism Primary failure prevented (field symptom)
Immutable boot root (ROM verified boot, or equivalent) Chain-of-trust verification point (H2-3) Unsigned image boot; “bootloader can be replaced”
Secure storage anchor (OTP/eFuse/secure NVM) Root public-key hash / KeyID allowlist / policy floor (H2-4/H2-5) “Signature passes but a revoked key still works”
Crypto acceleration (SHA-256/384, ECC verify, AES-GCM/CCM) Fast verify, chunk validation, at-rest encryption (H2-3/H2-6/H2-7) Long verify windows → higher power-cut brick rate
TRNG / reliable entropy Nonce/IV correctness, per-device key derivation (H2-7) Repeated IV/nonce; fragile encryption correctness
Isolation primitives (MPU/TrustZone-M/privilege) Separates updater/boot policy from application (H2-3/H2-7/H2-10) App bugs overwrite policy/logs; evidence becomes unreliable
Dual-image boot support (A/B slots + selection logic) Recoverability and bounded attempts (H2-8/H2-9) One bad update bricks device; no safe rollback path
Brownout detect + write gating Atomic metadata commit under power loss (H2-8/H2-11) “Meta truth lost” after dips; stuck in unknown boot state
Reset/watchdog supervision Boot-test window, deterministic rollback/rescue (H2-9/H2-11) Endless reboot loops; no controlled rollback

Practical guideline: prioritize abilities that shorten verification time, harden rollback floors, and keep metadata updates atomic.

2) MCU/SoC “must-have” abilities for a minimum Secure OTA loop

  • Verified boot root: an immutable verify point (ROM or equivalent) that cannot be bypassed in production mode.
  • Trust anchor storage: OTP/eFuse/secure NVM to pin root public-key hash and policy identifiers (KeyID allowlist / floor).
  • Signature verification: ECC verify + hashing (software is possible; hardware accel reduces brick risk by shrinking time windows).
  • Isolation: at least MPU/privilege separation so the updater/policy cannot be overwritten by normal application code.
  • Boot slot selection: supports A/B slot rules and a bounded-attempt boot-test gate (confirm-or-rollback behavior).
ROM verify point OTP / eFuse anchor ECC verify + SHA MPU / TrustZone A/B + confirm gate

Example MCU/SoC references (material numbers): NXP LPC55S69, ST STM32U585, ST STM32H573, Renesas RA6M5, Infineon PSoC 64.

3) Secure Element: when it is worth it (and when it is not)

A Secure Element is most valuable when it provides device-side features that the main MCU cannot reliably emulate: key isolation and monotonic counters that enforce anti-rollback floors under physical access and power interruptions.

  • Strong fit: devices with physical exposure, strict anti-rollback, key revocation requirements, or a need for monotonic counters.
  • Neutral fit: moderate threat model, MCU already has robust OTP/TrustZone; SE is an optional safety margin.
  • Poor fit: ultra-cost-sensitive designs where supply/lifecycle risk dominates and threat model is low.

Example Secure Element references (material numbers): Microchip ATECC608B, NXP SE050, ST STSAFE-A110, Infineon OPTIGA™ Trust M (SLS32AIA010MS). Optional discrete TPM references: Infineon SLB 9670 / SLB 9672.

4) Storage choice for OTA: NOR vs eMMC/NAND (update-write behavior + PLP risk points)

  • QSPI NOR: deterministic erase/write units; excellent for A/B images + small metadata journal. Risk is sector erase latency → needs strict staging and verified commit.
  • eMMC: high capacity; good for full images and chunk maps. Risk is internal translation layers and power-loss sensitivity → requires stronger two-phase commit + write gating near brownout.
  • NAND: capacity-focused; bad blocks/ECC/translation layers increase the burden on “truth recovery” rules. Keep metadata/journal strategy conservative.

Example storage references (material numbers): Winbond W25Q128JV / W25Q256JV, Macronix MX25L128 / MX25U128, Micron MT25Q (family); eMMC examples: Micron MTFC… (family), Kioxia THG… (family).

5) Supervisors & power monitoring (device-side): brownout reset, write gating, watchdog policy

  • Brownout detect: capture brownout flags as evidence and block starting new erase/write when voltage is near the unsafe region.
  • Write gating threshold: define a “no new writes” threshold (and respect it in updater state machine) to protect metadata truth.
  • Watchdog + boot-test window: a bounded window prevents endless pending states; exceed window → rollback/rescue deterministically.

Example supervisor/monitor references (material numbers): TI TPS3431, TI TPS3850, Maxim/ADI MAX16054; voltage monitor examples: TI TPS3703, Microchip MCP1316 (family).

6) Three tier recipes (5-line capability table each) — entry / industrial / high-security

Tier MCU/SoC Trust anchor Storage Supervision
Entry
cost-focused
NXP LPC55S69 / ST STM32U585 / Renesas RA6M5
ROM verify + OTP anchor + ECC/SHA accel preferred
MCU OTP/eFuse pins root public-key hash + KeyID list
No external SE; keep policy minimal but strict
QSPI NOR: Winbond W25Q128JV
A/B + meta journal on NOR
WDT: TI TPS3431 + monitor: TI TPS3703
boot-test window + brownout write gating
Industrial
field resilience
ST STM32H573 / NXP LPC55S69
stronger isolation + faster verify reduces brick window
Secure Element: ATECC608B / SE050 / STSAFE-A110
monotonic counter + key isolation + revoke anchors
Larger NOR (e.g., W25Q256JV) or eMMC (Micron MTFC…)
chunk map + staged write + strict commit
TI TPS3850 + WDT policy
deterministic rollback, brownout-safe metadata
High-security
high threat
Security-oriented platform (MCU/SoC with strong isolation domain)
policy separation + hardened verify path
SE + optional TPM: Infineon OPTIGA Trust M (SLS32AIA010MS) + SLB 9670/9672 (optional)
strict revoke, stronger attestation hooks (device-side)
eMMC (Micron MTFC… / Kioxia THG…)
two-phase commit + journal + dual meta copies
Supervisor + monitor combo (e.g., MAX16054 + TPS3703)
tight boot-test enforcement + write gating

Each tier still requires the same “non-negotiables”: verified boot, anti-rollback floor, atomic metadata, bounded attempts, and minimum evidence events.

Figure F10 — BOM blocks → mechanisms → outcomes (device-side Secure OTA)
Capabilities that make Secure OTA non-bricking Device-side blocks only (no cloud orchestration) BOM blocks MCU / SoC Secure Element Storage Supervisor Mechanisms Verified boot Key policy Anti-rollback Manifest Chunk verify At-rest enc Two-phase commit Recovery Evidence events Outcomes REJECT ROLLBACK RESCUE CONFIRMED Design rule: keep at least one known-good CONFIRMED path bootable after any interruption.

The selection logic is device-side: the goal is deterministic verification, atomic commits under power dips, and controlled recovery—without relying on backend orchestration.

H2-13 · FAQs

FAQs (12) — device-side secure OTA, answers + mappings

Each answer stays within the Secure OTA Module boundary: signature verification, anti-rollback, manifest correctness, encryption-at-rest, power-loss atomicity, recovery rules, evidence logs, and validation matrices.

Figure F11 — FAQ map: questions → chapters
FAQ coverage map Q1–Q12 map back to H2 mechanisms (no cross-topic expansion) Questions Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Chapters H2-3 Boot chain H2-4 Keys H2-5 Anti-rollback H2-6 Manifest H2-7 Storage enc H2-8 PLP & atomicity H2-9 Recovery H2-10 Evidence H2-11 Validation H2-12 Selection
Q1. Why can an “approved signature” still allow downgrading to old firmware?

Signature verification proves authenticity, not freshness. Anti-rollback requires a device-side floor enforced by a monotonic value (OTP/eFuse version, Secure Element counter, or a protected counter with tamper detection). The boot policy must reject any candidate image with version < floor, even if the signature verifies. Floor updates must be atomic and survive power loss.

Maps to: H2-5

Q2. What if both A/B slots are corrupted—where should a rescue image live?

A reliable “last life” should sit outside the normal updatable slots: a ROM-resident minimal rescue (ideal), or a protected rescue region that is verified by the immutable boot root and written rarely. The rescue path must still enforce signature and rollback-floor rules, then re-provision a known-good slot. Avoid placing rescue in the same metadata domain that can be atomically compromised.

Maps to: H2-9

Q3. During updates, where does power loss brick devices most often, and how does two-phase commit prevent it?

The highest-risk moments are metadata “truth changes”: setting a pending boot flag, switching active slot pointers, or updating rollback counters. Two-phase commit avoids bricking by separating write/verify from activation: write to staging/inactive slot, verify hashes and signature, then atomically set PENDING with journaled metadata. Only after a successful boot-test does CONFIRMED get committed.

Maps to: H2-8, H2-6

Q4. Which manifest fields are mandatory, which are optional, and what breaks if they are missing?

Mandatory fields usually include image hash(es), signature, version/build number, target component/slot, and hardware compatibility constraints. Without compatibility and targeting, devices can install the wrong image and fail boot-test. Optional fields include dependencies, compression, and delta parameters. If chunk maps are missing in interrupted environments, resume becomes unsafe or inefficient, increasing brick risk during retries.

Maps to: H2-6

Q5. After key revocation, how can “old keys still work,” and what is the safest device-side fix?

Old keys remain effective when the device only checks “signature valid” but does not bind acceptance to an allowlist and a revocation state. The safest device-side fix is to anchor a KeyID allowlist (or a small revocation bitmap) in immutable/secure storage, and to require that the signing KeyID is both present and not revoked. Pair revocation with an updated rollback floor to block old signed packages permanently.

Maps to: H2-4, H2-5

Q6. Does storage encryption make rollback/recovery harder, and how can designs stay secure and recoverable?

Encryption-at-rest can complicate recovery if the rescue path cannot access keys or if metadata becomes unrecoverable after brownouts. A practical pattern is to encrypt secrets/config strongly while keeping boot-critical metadata minimal and journaled. If rescue must read encrypted images, ensure the trust anchor can unwrap keys in rescue mode. Avoid designs where encryption keys depend on mutable state that can be lost mid-update.

Maps to: H2-7, H2-9

Q7. Delta updates save bandwidth—what new security and reliability pitfalls appear?

Delta updates increase complexity because correctness depends on the exact base version and patch application order. Security pitfalls include inadequate verification granularity (only verifying the final image) and replaying a delta against an unexpected base. Reliability pitfalls include harder recovery after interruptions and increased write amplification. A robust approach verifies chunk-level hashes, validates base-version binding, and supports safe resume with deterministic commit rules.

Maps to: H2-6

Q8. In frequent power-loss environments, how can the minimum PLP window be estimated for atomic metadata commits?

The minimum PLP window is driven by the worst-case “atomic commit payload”: journal record write + CRC/validation + pointer/epoch update (often duplicated). Estimate the maximum write latency of that payload on the target storage (NOR sector program, eMMC page + internal housekeeping), then add margin for brownout detection and firmware response time. Gate new writes near the brownout threshold so commits never start in unsafe voltage regions.

Maps to: H2-8

Q9. If Secure Boot fails, should devices refuse to boot or enter a safe mode—how to choose without false positives?

Refusing to boot is correct when authenticity cannot be established (signature/policy failure). A controlled safe/rescue mode is appropriate when the system can still enforce trust rules but needs a recovery path (e.g., slot corruption or boot-test failure). The policy should be deterministic: signature/policy failures → REJECT/RESCUE; boot-test failures → bounded retries then ROLLBACK/RESCUE. Supervisors/watchdogs help avoid endless loops.

Maps to: H2-3, H2-9

Q10. When OTA fails, what evidence should be checked first, and which 5 log fields narrow root cause fastest?

Start with the updater state machine step, then separate crypto vs storage vs power. The fastest five fields are: (1) state transition last step, (2) signature verification result/error code, (3) manifest parse/hash mismatch code, (4) reboot reason + brownout flags, and (5) slot state snapshot (active/next/attempt_count) plus rollback reason and floor/counter values (appropriately redacted).

Maps to: H2-10

Q11. How can “real security” be demonstrated—what is the smallest test matrix that avoids blind spots?

A minimum matrix crosses (A) phases (download/verify/staging/meta/boot-test/confirm) with (B) negative cases (bad signature, tampered manifest, replay, rollback below floor) and (C) power interruption points across those phases. Pass criteria are deterministic outcomes: REJECT for policy failures, ROLLBACK for boot-test failures, RESCUE when recovery is required, and always keeping at least one known-good CONFIRMED boot path available after any interruption.

Maps to: H2-11

Q12. Without a Secure Element, can security be “good enough,” and what must the MCU/SoC provide?

Many systems can be robust without a Secure Element if the MCU/SoC provides an immutable verify point, a strong secure storage anchor (OTP/eFuse), fast signature verification, isolation primitives (MPU/TrustZone), and a reliable anti-rollback floor mechanism. The risk increases when physical access is likely or when monotonic counters and hard revocation must be guaranteed under power loss. In those cases, a Secure Element materially improves device-side guarantees.

Maps to: H2-12, H2-4