Secure OTA Module for IoT Devices: Secure Boot & Safe Updates

Q: Why can an “approved signature” still allow downgrading to old firmware?

Signature verification proves authenticity, not freshness. Anti-rollback requires a device-side floor enforced by a monotonic value (OTP/eFuse version, Secure Element counter, or a protected counter with tamper detection). The boot policy must reject any candidate image with version lower than the floor even if the signature verifies. Floor updates must be atomic and survive power loss.

Q: What if both A/B slots are corrupted—where should a rescue image live?

A reliable last-resort image should sit outside normal updatable slots: a ROM-resident minimal rescue (ideal), or a protected rescue region verified by the immutable boot root and written rarely. The rescue path must still enforce signature and rollback-floor rules, then re-provision a known-good slot. Avoid placing rescue in the same metadata domain that can be compromised during atomic updates.

Q: During updates, where does power loss brick devices most often, and how does two-phase commit prevent it?

The highest-risk moments are metadata truth changes: setting pending boot flags, switching active slot pointers, or updating rollback counters. Two-phase commit separates write/verify from activation: write to staging or inactive slot, verify hashes and signature, then atomically set PENDING with journaled metadata. Only after a successful boot-test does CONFIRMED get committed.

Q: Which manifest fields are mandatory, which are optional, and what breaks if they are missing?

Mandatory fields usually include image hashes, signature, version or build number, target component or slot, and hardware compatibility constraints. Without compatibility and targeting, devices can install the wrong image and fail boot-test. Optional fields include dependencies, compression, and delta parameters. If chunk maps are missing in interrupted environments, resume becomes unsafe or inefficient, increasing brick risk during retries.

Q: After key revocation, how can old keys still work, and what is the safest device-side fix?

Old keys remain effective when a device only checks signature validity but does not bind acceptance to an allowlist and a revocation state. The safest device-side fix is to anchor a KeyID allowlist (or a small revocation bitmap) in immutable or secure storage and require that the signing KeyID is present and not revoked. Pair revocation with an updated rollback floor to block old signed packages permanently.

Q: Does storage encryption make rollback and recovery harder, and how can designs stay secure and recoverable?

Encryption-at-rest can complicate recovery if the rescue path cannot access keys or if metadata becomes unrecoverable after brownouts. A practical pattern is to encrypt secrets and configuration strongly while keeping boot-critical metadata minimal and journaled. If rescue must read encrypted images, ensure the trust anchor can unwrap keys in rescue mode. Avoid designs where encryption keys depend on mutable state that can be lost mid-update.

Q: Delta updates save bandwidth—what new security and reliability pitfalls appear?

Delta updates increase complexity because correctness depends on the exact base version and patch application order. Pitfalls include insufficient verification granularity and replaying a delta against an unexpected base. Reliability risks include harder recovery after interruptions and increased write amplification. A robust approach verifies chunk-level hashes, validates base-version binding, and supports safe resume with deterministic commit rules.

Q: In frequent power-loss environments, how can the minimum PLP window be estimated for atomic metadata commits?

Minimum PLP is driven by the worst-case atomic commit payload: journal record write, validation, and pointer or epoch update (often duplicated). Estimate maximum write latency for that payload on the target storage, then add margin for brownout detection and firmware response time. Gate new writes near the brownout threshold so commits never start in unsafe voltage regions.

Q: If Secure Boot fails, should devices refuse to boot or enter a safe mode—how to choose without false positives?

Refusing to boot is correct when authenticity cannot be established (signature or policy failure). A controlled safe or rescue mode is appropriate when trust rules can still be enforced but a recovery path is needed, such as slot corruption or boot-test failure. Use deterministic policy: signature or policy failures lead to reject or rescue; boot-test failures lead to bounded retries then rollback or rescue. Supervisors and watchdogs prevent endless loops.

Q: When OTA fails, what evidence should be checked first, and which 5 log fields narrow root cause fastest?

Start with the updater state machine step, then separate crypto versus storage versus power. The fastest five fields are: last state transition step, signature verification result or error code, manifest parse or hash mismatch code, reboot reason plus brownout flags, and slot state snapshot (active, next, attempt count) plus rollback reason and floor or counter values with appropriate redaction.

← Back to: IoT & Edge Computing

A Secure OTA Module is a device-side closed loop that ensures every firmware update is authentic (signed), fresh (anti-rollback), and non-bricking under power loss through A/B slots + atomic commit, with deterministic recovery and minimum evidence logs for fast field debugging.

H2-1 · Boundary

Boundary definition: what a Secure OTA Module is (and is not)

The goal is to lock the page scope early, so the design discussion stays on device-side update safety instead of drifting into cloud platforms, gateways, or fleet operations.

Working definition (device-side capability set)

A Secure OTA Module is the combination of: Secure Boot + Signed Update + Power-loss Safety (Atomicity) + A/B Recovery & Rescue.

Secure Boot: verifies the next-stage code before execution (chain-of-trust).
Signed Update: verifies the update package before commit (signature + hash + manifest).
Atomicity / PLP: ensures unexpected power loss cannot corrupt boot-critical metadata.
A/B & Rescue: guarantees a path back to a known-good image after failure.

Scope boundaries vs sibling pages (mention only, do not expand)

Edge Security Probe: focuses on identity/attestation/forensics; this page only keeps update-related evidence and minimal audit.
Edge Power & Backup: focuses on system hold-up and power architecture; this page only defines write/commit atomicity requirements under power loss.

If a paragraph requires backend rollout policies, gateway aggregation, or management-console workflows, it is out of scope.

Three-question decision gate (fast “do you need it?” check)

Question (field reality)	Why it changes the design	Recommended level
Remote updates are required?	Assume update content can be tampered/replayed. Strong signature + integrity gates become mandatory.	Industrial High-security
Physical access is realistic?	Assume local storage can be copied/rewritten. Anti-rollback + protected trust anchor are required.	Industrial High-security
Frequent power loss / brownouts?	Assume mid-write interruption. A/B + atomic metadata is required to avoid “brick-on-update”.	Basic Industrial

A Secure OTA Module is “done” only when each capability has observable evidence: verify results, version/counter state, slot state transitions, and power-loss-safe commit behavior.

Figure F1 — Secure OTA boundary (4 capability blocks, evidence-driven)

H2-2 · Threat Model

Threat model & measurable security goals: what must be prevented

“Security” becomes actionable only when each goal maps to a control point and a minimum evidence set. This chapter defines the OTA-relevant attack surfaces and the device-side goals that can be verified in the field.

OTA-relevant attack surfaces (device-centric, no backend deep dive)

Supply chain / manufacturing: trust anchor or key material is not unique or is substituted.
Transport: update package is replaced, replayed, or partially corrupted in transit.
Local storage: offline rewrite of flash/eMMC, forced downgrade, metadata tampering.
Boot chain: bootloader stage is modified to bypass verification.
Debug interface: post-production write access remains possible.
Power anomalies (brownout): mid-write interruption causes inconsistent state; extreme cases can disturb branching.

Any paragraph that requires fleet rollout policies, gateway routing, or cloud console workflows is out of scope.

Five measurable goals (each must produce evidence)

Goal	Device-side control point	Minimum evidence (field-debug friendly)
Authenticity	Signature verification at boot and pre-commit stages (trusted key anchor).	verify-result code, key ID, verified stage ID (ROM/BL/update).
Integrity	Hash validation during download and after write (chunk-hash and/or full-image hash).	hash mismatch counters, final image hash, manifest hash result.
Freshness	Anti-rollback gate (monotonic counter or version floor) enforced before boot/commit.	counter value, version floor, rollback reason (reject/update/boot).
Recoverability	A/B slots + confirm mechanism; failure routes to known-good slot/rescue.	slot state (pending/confirmed), boot attempt count, fallback path taken.
Auditability	Minimal event log with append-only intent (local), covering state transitions and reboot causes.	event sequence, state transitions, reboot reason flags (incl. brownout).

Chapter map (where each goal is implemented)

Authenticity → Secure Boot + Signed Update pipeline (next chapters).
Integrity → Manifest + hash strategy (next chapters).
Freshness → Anti-rollback (next chapters).
Recoverability → A/B + rescue strategy (later chapters).
Auditability → Evidence chain & logs (later chapters).

Figure F2 — Threats → controls → minimum evidence (device-side)

The next chapters will implement these goals with concrete mechanisms (verify points, manifest fields, anti-rollback gates, A/B confirmation, and power-loss-safe commits). Backend rollout policies remain out of scope for this page.

H2-3 · Chain of Trust

Chain of Trust: verified boot from ROM to application

Verified boot is a hard gate: code must be authenticated before execution. Measured boot can improve auditability, but it does not replace verified boot for minimum “do-not-run-unknown-code” guarantees.

Verified boot vs measured boot (practical boundary)

Verified boot: verification blocks execution on failure (signature/hash/version gates).
Measured boot: measurements are recorded for later decisions; it can support auditing but does not prevent execution by itself.

Goal: do-not-run Evidence: verify code Evidence: slot state

Boot chain layers (each stage has one responsibility + one verify point)

Stage	Loads / executes	Verify point (minimum)	Failure outcome
ROM	Loads 1st stage from internal flash / external memory.	Checks trusted anchor (public key hash / key index) and verifies 1st stage.	Hard fail or enter rescue entry (if designed).
1st stage	Initializes minimal clocks/memory and loads bootloader.	Verifies bootloader signature; applies version gate hook (anti-rollback input).	Fallback to rescue slot / fail-safe mode (no unverified jump).
Bootloader	Selects A/B slot and loads OS/app image + manifest.	Verifies manifest + hash + signature; enforces freshness checks before commit/boot.	Switch slot; if both invalid → rescue image.
OS/App	Runs post-boot health checks (for confirmation).	Optional: confirms “known-good” after self-test (supports A/B confirm later).	On failure: watchdog/reboot triggers rollback policy.

Key design decisions (engineered tradeoffs)

Trust anchor placement: ROM/OTP anchors maximize tamper resistance; a secure element can add flexibility (rotation/counters) but adds BOM and integration surface.
Signature choice: prioritize verification time, code footprint, and hardware acceleration so boot remains within the reset/watchdog budget.
Failure policy: “refuse to boot” is correct for revoked/rollback attempts; “rescue slot” is appropriate for corruption or interrupted writes.

Minimal state machine (for later A/B recovery alignment)

PASS → jump to next stage; record verify code and stage ID.
FAIL → do not execute; select alternate slot (A↔B) if available.
Both invalid → enter rescue image; expose a deterministic error reason (signature/hash/version/metadata).

Field evidence should always include: verify-result code, selected slot, and reboot reason (including brownout flags).

Figure F1 — Chain of Trust & verify points (minimal labels, box-diagram style)

The next chapter will anchor the trust chain with concrete key storage and revocation rules so “verified boot” remains enforceable across device lifetime.

H2-4 · Keys

Signatures & keys: storing the trust root and enforcing revocation on-device

A secure boot chain is only as strong as the trust anchor and the acceptance rules. This chapter defines device-side key hierarchy, key IDs, and a practical revocation model that remains enforceable without backend deep dives.

Device-side key hierarchy (minimum model)

Root public key (anchor): immutable or strongly protected; defines “who is trusted to sign”.
Intermediate key (optional): enables rotation without replacing the anchor; can be used to sign image-signing keys.
Image signing key: signs firmware images/manifests; private key is never stored on the device.

The device verifies a chain of signatures and then enforces policy: allowed key IDs, revoked key IDs, and freshness gates (anti-rollback comes next).

Rotation & revocation (on-device data structures and rules)

Element	What it stores	Enforcement rule (device-side)
Key ID	Compact identifier bound to a public key (or public key hash).	Verification output must include Key ID for audit and policy decisions.
Allow list	Set of acceptable Key IDs + public key hashes (can be versioned).	If Key ID not in allow list → reject (no fallback to “unknown but valid”).
Revoke list	Set of revoked Key IDs (optionally tagged with “revoked since version”).	Revoke list has higher priority than allow list; if revoked → reject even if signature verifies.
Acceptance policy	Decision order: revoked? allowed? version gate? integrity?	Reject on revoked/rollback attempts; allow rescue only for corruption/power-loss cases.

Where to store the trust anchor (ROM / OTP / secure element)

ROM anchor: strongest immutability; minimal flexibility. Best for high-integrity boot roots.
OTP/eFuse anchor: strong protection with limited update patterns (e.g., key index, hash slots).
Secure element: strong isolation and useful features (counters, protected keys); adds integration cost and supply considerations.
Plain flash: highest flexibility but highest tamper risk; requires additional protections and is not suitable as a sole root anchor.

Anti-rollback complements revocation: revocation answers “who may sign”, freshness answers “is it new enough”.

Debug interface policy (dev vs production without foot-guns)

Dev mode: allows debug for bring-up; must be visibly and logically separated from production trust (test keys, reduced restrictions).
Production mode: debug access is locked or constrained; secure boot is mandatory; only production keys are accepted.
Hard rule: test keys and relaxed gates must not remain reachable in production acceptance policy.

Figure F2 — Key hierarchy, storage locations, allow/revoke enforcement, and mode gates

The next chapter (anti-rollback) completes freshness enforcement so a valid signature cannot be used to downgrade to an older, vulnerable image.

H2-5 · Anti-rollback

Anti-rollback: versions, counters, and enforceable freshness

A valid signature only proves origin and integrity. Freshness proves the image is not an older, revoked, or below-floor build. Anti-rollback must be enforced as a hard gate before execution and before committing an update.

Freshness gate (minimum rule)

Gate condition: candidate_build ≥ device_floor (monotonic integer compare).
Where enforced: at boot (ROM/1st stage/bootloader) and at update commit (before setting “pending”).
What is recorded: device_floor, candidate_build, decision_code, selected_slot.

Evidence: floor Evidence: candidate build Evidence: decision code Evidence: slot

Three common implementations (practical tradeoffs)

Method	Strength	Common failure mode	Engineering mitigation
OTP/eFuse monotonic version	High tamper resistance; “only increases”.	Incorrect programming or wrong floor bump can block valid images.	Use staged floor bump (post-boot confirm), verify write success, and tie floor to signed manifest.
Secure element monotonic counter	High protection plus flexible features (counter/secure storage).	Integration/BOM complexity; counter access failures must fail-safe.	Define deterministic fallback (recovery only), cache read-only floor for boot, and log SE status codes.
Protected flash counter	Cost-effective but weaker against physical tampering.	Counter rollback via rewrite; power-loss during update can desync state.	Use redundancy (dual copies + voting), journal/COW updates, bind counter to signed metadata, and treat downgrade detection as recovery-only.

Version encoding (avoid ambiguous compares)

Recommended: use a monotonic build number (uint32/uint64) as the freshness key.
SemVer: suitable for display and compatibility labels, but not as the sole freshness gate due to edge cases.
Policy: gate on build number; optionally display SemVer separately as a human-facing string.

Failure policy + A/B interaction (non-negotiable rules)

Rollback detected (candidate_build < floor) → reject boot and enter recovery/rescue.
Known-good fallback allowed: switch to the other slot only if it is not below floor and not revoked.
Revoked overrides “known-good”: a revoked version must never become bootable again.

Freshness gate and key revocation gate must both pass. One cannot compensate for the other.

Figure F3 — Version state + A/B decision matrix (freshness + revocation gates)

Freshness enforcement becomes actionable only when the update package carries unambiguous version fields and signed metadata—defined next.

H2-6 · Manifest

Firmware images & manifest: minimum fields for verification and recovery

Treat the OTA package as an engineering interface: a signed manifest that drives verification, storage writes, commit points, and recovery—without relying on transport-protocol details.

Manifest is the single source of truth

Verify manifest first, then trust payload. All critical fields must be covered by the manifest signature.
Manifest drives the state machine: download → verify → write → re-verify → commit → reboot → confirm.
Evidence is mandatory: each step produces a small, stable record (hash/sig/slot/code) for field debugging.

Minimum manifest fields (grouped by purpose)

Group	Field	Why it exists (device-side)
Identity & integrity	image_hash (final), signature, signer_key_id	Proves origin and the exact bytes that must be installed and booted.
Freshness & policy	build_number (monotonic), hw_compat (board/SoC ID)	Prevents downgrade and prevents installing an image built for different hardware.
Install target	slot_target (A/B), image_type (BL/OS/App), dependencies	Ensures the correct partition/slot is written and version-coupled components stay compatible.
Chunking & recovery	chunk_map (offset/len/order), chunk_hashes, compression	Enables resume/partial verification and power-loss recovery without accepting reordered or replaced chunks.
Storage interface	payload_size, chunk_size, staging_hint	Allows deterministic write planning and prevents overrun/partial state confusion.

If chunk hashes exist, the chunk map and ordering must be signature-covered; final image hash should still be verified after write.

Full image vs delta (security and consistency costs)

Full image: simplest trust boundary; failure recovery is straightforward (retry download/write on inactive slot).
Delta: validation is harder; a “valid patch” does not guarantee a correct final image unless the final hash is checked.
Recommendation: prioritize full images for high reliability; if delta is used, require final-image verification and robust staging.

Hash structure (whole vs chunked) and anti-tamper requirements

Whole-image hash: simplest, but resume requires re-download or local caching.
Chunk hashes: enable resume and partial verification; require signed chunk map to prevent reordering/substitution.
Post-write rule: verify final image hash after storage writes before setting any boot flag.

Storage + commit interface (where devices typically fail)

Staging first: manifest and progress metadata must be written to a staging area that does not break the current bootable slot.
Write inactive slot: never overwrite the currently confirmed slot during normal OTA.
Commit point: only after verify passes → set boot flag to pending (atomic write).
Confirm point: after post-boot self-check → set confirmed; otherwise rollback per attempt counter.

Figure F4 — Update package flow (download→verify→write→commit→reboot→confirm) with evidence outputs

Later chapters can reuse this evidence stream to build deterministic field-debug workflows without exposing platform or transport details.

H2-7 · Storage encryption

Encrypted storage: partitioning and device-side key strategy

Encryption-at-rest is a storage policy, not a replacement for signature verification. Treat firmware, configuration, and secrets as different asset classes with different write patterns and recovery constraints.

Asset classes (what is protected and why)

Asset	Write pattern	Primary risks	Minimum device-side controls
Firmware (slots)	Rare, large writes	IP exposure, offline analysis, tampered images	Signed images + anti-rollback gate; encryption is optional and must not break recovery.
Config	Medium, frequent updates	Privacy leakage, policy manipulation, replay of old config	AEAD (encrypt + authenticate), versioning/journal, optional freshness rules for critical fields.
Secrets	Small objects	Key extraction, cloning, credential reuse	Per-device keys, key wrapping, secure storage policy, revocation/rotation readiness.

Firmware: signed Config: AEAD Secrets: wrap Recovery-safe

Encryption-at-rest options (how to choose)

Full-disk: strongest coverage but increases boot-chain and rescue constraints; only suitable when early-stage decryption is guaranteed.
Partition encryption: practical default—encrypt Config and Secrets first; encrypt firmware slots only if recovery remains deterministic.
Secrets-only: lowest cost; acceptable only when config is not sensitive and does not enable high-impact policy manipulation.

Device-side keys: per-device root + key wrapping (minimal model)

Root: hardware-anchored device unique secret (OTP/eFuse/PUF/secure element) used as the trust anchor.
KEK: key-encryption key derived from the root; used to wrap data keys.
DEK: data-encryption key used for a specific partition/object (Config/Secrets, optional for firmware slots).
Rule: DEK must never be stored in plaintext; store wrapped_DEK + keyslot_id + policy.

Key rotation is a device-side rule: new KeyID must be accepted only when authorized, and old KeyID must become unbootable/unusable when revoked.

Nonce/IV and power-loss consistency (typical real-world failure modes)

AEAD requirement: encryption without authentication is insufficient for Config/Secrets.
Nonce/IV rule: never reuse a nonce with the same key; monotonic or random nonce must be stored consistently with the ciphertext.
Write amplification: hot config keys should be journaled as small records; avoid re-encrypting large regions on every write.
Crash consistency: nonce/counter state must be updated with journal/COW so a brownout cannot desynchronize metadata and ciphertext.

Security vs recoverability: what “rescue” can access

Rescue not encrypted (but signed): maximizes recoverability; expose only minimal functionality.
Rescue encrypted + signed: only safe when early-stage decryption is guaranteed and key access is deterministic under fault conditions.
Hard requirement: recovery entry must remain possible even when Config/Secrets are corrupted or decryption fails.

Figure F5 — Partition map + key wrapping ladder (firmware/config/secrets)

Encryption policies must be compatible with crash consistency; nonce/metadata correctness becomes a first-class requirement under power loss.

H2-8 · PLP & atomic update

Power-loss-safe updates: A/B + atomic metadata + write gating

“Not bricking under power loss” is achieved by invariants: keep one confirmed bootable slot, make metadata updates recoverable, and gate writes as brownout approaches. PLP is treated as an interface requirement (rails/time/trigger), not a sizing exercise.

Power-loss risks (root causes → invariants)

Interrupted writes → inactive image corruption → invariant: never overwrite the confirmed slot during OTA.
Metadata corruption → unknown bootable slot → invariant: metadata must be journaled/COW with validation (CRC/version).
Counter/nonce desync → false rollback/AEAD failure → invariant: counters/nonces advance only with atomic commit records.

A/B + two-phase commit (state bits that must exist)

slot_state: CONFIRMED / PENDING / INVALID
attempt_count: limits repeated boot trials of a pending slot
active_slot + next_slot: deterministic selection source of truth
confirm flag: written only after post-boot health checks
floor: monotonic freshness threshold (must not regress)

Metadata journal / copy-on-write (atomicity in practice)

Dual copy + CRC + generation: read the newest valid record; write the other copy on updates.
Append-only journal: write small, ordered records; recover by replaying to the last valid entry.
Boot selection rule: if pending metadata is inconsistent, fall back to the last confirmed slot and enter recovery policy.

Power-loss window (interface metrics, not component sizing)

Commit bytes: worst-case bytes written for one state transition (meta + journal + counters).
Commit latency: worst-case time for flush/verify of commit bytes (includes erase/write/flush behavior).
Brownout threshold: below this, new erase/write must stop; only allow safe finalization of the current atomic record.

PLP requirements (device-side interface)

Rails held: compute + storage + required IO for completing atomic metadata write.
Hold time: commit latency + safety margin for deterministic finalization.
Write stop trigger: brownout interrupt / PMIC PG / ADC threshold; must immediately block new erase/write operations.
Safe action: complete or abort the current atomic record; never leave half-written metadata as the latest record.

Figure F6 — A/B + atomic commit under power loss (layout + state machine + 3 power-cut points)

PLP is expressed as “rails held + hold time + write-stop trigger”; the OTA design remains valid regardless of the underlying energy source.

H2-9 · Recovery

Recovery design: rollback, rescue modes, and the last lifeline

A secure OTA system must remain recoverable. Recovery is defined as a layered plan with a bounded confirmation window, deterministic slot selection rules, and field-operable rescue entry points.

Recovery ladder (from most reliable to most feature-rich)

Bootloader rescue: minimal dependencies; must always be reachable even when filesystem/config decryption fails.
Minimal OS rescue: adds drivers and tooling for validation/export, while staying under secure boot policy.
App-level safe mode: provides degraded service and operator feedback, but must not bypass signature or rollback gates.

Bootloader rescue Min-OS rescue App safe

Confirmation gate (post-boot health check)

Purpose: promote a pending slot to confirmed only after it proves basic health.
Checks: watchdog feed path, 1–3 critical services, storage readability, and a bounded timer window.
Rule: unconfirmed equals failure; a pending slot must not remain pending indefinitely.
Outputs: CONFIRMED flag (atomic) + small reason codes on failure (enum, not long strings).

Deterministic rollback policy (A/B selection rules)

Condition (hard evidence)	Action	Notes
PENDING exists and attempt_count < N	Boot PENDING slot (test run)	Health check must confirm within the window.
Health check fails or attempt_count reaches N	Mark PENDING invalid (or lower priority), roll back to last CONFIRMED	Record rollback_reason and last error_code.
No CONFIRMED slot is bootable	Enter Bootloader rescue	Rescue must still enforce signature/floor rules.
Candidate version < floor (revoked/too old)	Disallow boot even if image verifies	Rollback is allowed only to known-good versions ≥ floor.

Field operability (engineering interfaces only)

Physical entry: button or GPIO strap at power-on to force rescue mode.
Console entry: UART/USB commands to inspect slot state, export minimal logs, and trigger rollback/rescue.
Production guard: development vs production mode gating to prevent rescue paths from bypassing verification.

A rescue path is a controlled interface, not a backdoor.

Non-negotiable “last lifeline” invariants

Keep at least one CONFIRMED bootable slot at all times; updates modify only the inactive slot.
Make metadata truth recoverable: journal/COW with validation and deterministic fallback rules.
Bound trials: pending has an attempt limit; failure triggers automatic rollback or rescue.

Figure F7 — Recovery ladder + confirm gate + field entry points

Recovery stays within device boundaries: deterministic slot policy + bounded confirmation + controlled rescue interfaces.

H2-10 · Hard evidence

Hard evidence and observability: what to collect when updates fail

Debugging must start from hard evidence: first locate where the state machine stopped, then classify the failure as cryptographic/package, storage/atomicity, or power/brownout related.

Minimal event set (recommended as mandatory)

Event	Fields (keep it structured)	Why it matters
EVT_VERIFY_SIG	result, key_id, error_code	Separates “image authenticity” failures from all other classes.
EVT_PARSE_MANIFEST	result, version, target_slot, compat	Explains why a valid signature still cannot be accepted.
EVT_SLOT_TRANSITION	from_state → to_state, attempt_count	Locates exactly where the OTA flow stopped.
EVT_REBOOT_REASON	wdt / brownout / software, brownout_flag	Distinguishes logic bugs from power integrity failures.
EVT_ROLLBACK	reason, floor, candidate_version (redacted)	Proves whether anti-rollback rules blocked a boot attempt.

Use event codes + small fields; avoid long text logs in critical paths.

Local log integrity (no external SIEM assumed)

Append-only: records are appended, not overwritten; keeps the latest failure evidence intact.
Tamper-evident digest (optional): periodic digest over recent records to detect easy edits.
Redaction: never print secrets; prefer KeyID/counter/error enums instead of sensitive content.

Field triage order (hard evidence first)

Step 1: read the last EVT_SLOT_TRANSITION to locate the stuck state (DOWNLOAD / STAGING / PENDING / BOOT_TEST).
Step 2: check EVT_VERIFY_SIG and EVT_PARSE_MANIFEST to classify crypto/package issues.
Step 3: check EVT_REBOOT_REASON + brownout flags to classify power-loss / brownout behavior.
Step 4: check EVT_ROLLBACK (floor, candidate_version) to confirm anti-rollback blocks.

Classification by evidence (what it usually indicates)

Crypto/package: signature failure, incompatible manifest, version below floor.
Storage/atomicity: meta CRC failure, journal replay issues, AEAD decrypt failure due to nonce/metadata mismatch.
Power/brownout: repeated brownout flags, reboot reasons pointing to power, transitions stopping during writes.

Minimal export bundle (hand-off friendly)

Last N event records + current slot snapshot (active/next/attempt/floor).
Current firmware version + candidate version + last rollback reason.
Manifest header digest (hash only) for correlation without leaking content.

Figure F8 — Debug decision flow: locate state → classify by evidence → next action

This chapter stays device-local: append-only evidence + small structured fields + a deterministic triage order.

H2-11 · Validation

Validation & testing: proving OTA is secure and non-bricking

A Secure OTA Module is verified by repeatable, device-side tests that produce hard evidence: deterministic state transitions, reject reasons for malicious inputs, and recovery guarantees under power interruption.

What “secure and non-bricking” means in measurable assertions

Authorization: only signed-and-allowed images can reach PENDING / CONFIRMED.
Freshness: any image below rollback floor is blocked even if signature verifies.
Atomicity: metadata truth is recoverable after any interruption (journal/COW + validation).
Recoverability: failure always ends in CONFIRMED boot or controlled rescue entry (bounded attempts).
Observability: failure is classifiable by device-local evidence (state vs crypto vs power).

AUTH FRESHNESS ATOMICITY RECOVERY EVIDENCE

Reference BOM (example material numbers for device-side Secure OTA)

These part numbers are common reference options for secure boot + signed update + local evidence. Final selection must match required security level, lifecycle, and availability.

Block	Example material numbers	Typical role in OTA proof
Secure MCU / SoC	NXP LPC55S69 · NXP i.MX RT1062 · ST STM32U585 · ST STM32H573 · Renesas RA6M5 · Infineon PSoC 64	Secure boot root, signature verification, slot state machine, evidence events.
Secure Element	Microchip ATECC608B · NXP SE050 · Infineon OPTIGA™ Trust M (SLS32AIA010MS) · ST STSAFE-A110	Protected key storage, monotonic counter (freshness), key rotation/revocation anchors.
Discrete TPM (optional)	Infineon OPTIGA™ TPM SLB 9670 / SLB 9672 · Nuvoton NPCT75x (family)	Hardware trust anchor / measured evidence (optional), policy-controlled key use.
External NOR Flash	Winbond W25Q128JV · Macronix MX25L128 / MX25U128 · Micron MT25Q (family)	Slot A/B images, staging areas, metadata journal; supports power-cut sweep tests.
eMMC storage (if used)	Micron eMMC (MTFC… family) · Kioxia eMMC (THG… family)	High-capacity slots/delta chunks; validate atomic commit + filesystem resilience.
Reset / watchdog supervisor	TI TPS3431 · TI TPS3850 · Maxim/ADI MAX16054	Enforces bounded boot-test window; provides reboot reason evidence and recovery determinism.
Voltage monitor / brownout detect	TI TPS3703 · Microchip MCP1316 (family)	Gates writes near brownout and records brownout events for power-cut/boundary validation.

Practical validation tip: at least one reference platform should include a Secure Element (e.g., ATECC608B / SE050 / STSAFE-A110) to test monotonic counters and key revocation behavior under power interruptions.

Functional tests (state machine evidence)

Full update path: IDLE → DOWNLOADING → STAGING → PENDING → BOOT_TEST → CONFIRMED.
Resume: interrupt download and restart; chunk verification resumes without nonce/counter inconsistency (when AEAD is used).
A/B alternation: two consecutive releases alternate inactive slot; one CONFIRMED slot remains bootable at all times.
Auto-rollback: fail boot-test and confirm automatic rollback to last CONFIRMED within bounded attempts.
Rescue entry: force rescue via button/strap/UART policy and verify signature/floor gates still apply.

A/B RESUME CONFIRM GATE RESCUE

Suggested evidence hooks: EVT_SLOT_TRANSITION + attempt_count + active/next slot snapshot.

Security tests (reject conditions, not attack tutorials)

Forged signature: image must be rejected during verification (no PENDING state set).
Tampered manifest: hash/field mismatch must be rejected with a structured error code.
Replay: an older valid package must be blocked by freshness rules (counter/floor).
Rollback: candidate version below floor must be blocked even if signature verifies.
Key revoke: revoked KeyID must remain unusable across normal and rescue paths.

Reference platforms to cover this class: Secure Element-based monotonic counters (e.g., ATECC608B / SE050 / STSAFE-A110) and at least one MCU with ROM/OTP trust anchor support (e.g., LPC55S69 / STM32U585 / STM32H573).

Reliability tests (power-cut sweep, brownout boundary, wear-out)

Power-cut sweep: repeatedly cut power at multiple phases (download / staging / meta commit / pending set / boot-test / confirm).
Pass criterion: after any interruption, the system must reach CONFIRMED boot or controlled rescue entry (never stuck in unknown).
Brownout boundary: verify “write gating” near brownout; metadata truth remains recoverable (journal/COW validation works).
Wear-out: stress metadata/journal updates and confirm stable recovery rules (no dual-slot loss).

Recommended evidence hooks: EVT_REBOOT_REASON + brownout_flag + meta validation result + last rollback reason. Suggested supporting parts for repeatability: watchdog supervisor (TPS3431/TPS3850/MAX16054) and voltage monitor (TPS3703/MCP1316 family).

Adversarial checks (coverage only) + pre-release acceptance checklist (15 items)

Adversarial coverage is validated by confirming that safety policies trigger and recovery remains deterministic, without describing fault-injection procedures.

#	Must-pass item	Expected hard evidence
1	Normal update reaches CONFIRMED	EVT_SLOT_TRANSITION sequence complete; CONFIRMED set atomically
2	Resume works after forced reboot during download	Chunk verification consistent; no counter/nonce mismatch error
3	A/B alternation across two releases	Inactive slot updated; at least one bootable CONFIRMED always remains
4	Boot-test failure rolls back within bounded attempts	attempt_count increments; rollback_reason recorded; last CONFIRMED boots
5	Rescue entry is reachable and policy-controlled	Rescue flag/strap/button recorded; verification gates still enforced
6	Forged signature package is rejected	EVT_VERIFY_SIG=FAIL; PENDING not set
7	Tampered manifest is rejected	EVT_PARSE_MANIFEST=FAIL (structured error_code)
8	Replay of older signed package is blocked	floor/counter check blocks; rollback/reject reason captured
9	Rollback below floor is blocked	candidate_version < floor; boot refused; reason captured
10	Revoked key cannot install or boot	KeyID flagged revoked; verification/policy blocks in normal and rescue
11	Power-cut sweep passes across all phases	After each cut: CONFIRMED boot or RESCUE; never “unknown” slot truth
12	Metadata truth is recoverable after interruption	Journal replay OK; meta validation OK; deterministic slot selection
13	Brownout boundary does not corrupt truth	brownout_flag set; write gating prevents new writes; recovery succeeds
14	Attempt limit forces rollback/rescue (no endless pending)	attempt_count reaches N; auto rollback or rescue entry occurs
15	Failure is classifiable by device-local evidence	At least one of: state / crypto / power classes is provable via events

Figure F9 — Test matrix: phase × cut point × expected outcome

The chapter remains device-side: tests prove deterministic recovery and policy enforcement using structured evidence, while avoiding cloud/fleet architecture and avoiding fault-injection procedures.

Request a Quote

Name

Company

Part Number(s) / BOM

Quantity & Target Lead Time

Alternates Allowed

Temperature Grade

Package / Footprint

Compliance

Budget Window

Lot Size / Qty

Message

Attachment

Accepted Formats

pdf, csv, xls, xlsx, zip

Attachment

Drag & drop files here or use the button below.

H2-12 · Selection

Component & architecture selection: MCU/SoC, Secure Element, storage, supervisors

This section turns Secure OTA mechanisms into concrete device-side capabilities. The goal is a closed loop: verified boot + signed updates + anti-rollback + atomic commit under power loss + recoverability + minimum evidence.

1) Capability-to-mechanism mapping (what each “ability” pays for)

Hardware capability (device-side)	Enables which mechanism	Primary failure prevented (field symptom)
Immutable boot root (ROM verified boot, or equivalent)	Chain-of-trust verification point (H2-3)	Unsigned image boot; “bootloader can be replaced”
Secure storage anchor (OTP/eFuse/secure NVM)	Root public-key hash / KeyID allowlist / policy floor (H2-4/H2-5)	“Signature passes but a revoked key still works”
Crypto acceleration (SHA-256/384, ECC verify, AES-GCM/CCM)	Fast verify, chunk validation, at-rest encryption (H2-3/H2-6/H2-7)	Long verify windows → higher power-cut brick rate
TRNG / reliable entropy	Nonce/IV correctness, per-device key derivation (H2-7)	Repeated IV/nonce; fragile encryption correctness
Isolation primitives (MPU/TrustZone-M/privilege)	Separates updater/boot policy from application (H2-3/H2-7/H2-10)	App bugs overwrite policy/logs; evidence becomes unreliable
Dual-image boot support (A/B slots + selection logic)	Recoverability and bounded attempts (H2-8/H2-9)	One bad update bricks device; no safe rollback path
Brownout detect + write gating	Atomic metadata commit under power loss (H2-8/H2-11)	“Meta truth lost” after dips; stuck in unknown boot state
Reset/watchdog supervision	Boot-test window, deterministic rollback/rescue (H2-9/H2-11)	Endless reboot loops; no controlled rollback

Practical guideline: prioritize abilities that shorten verification time, harden rollback floors, and keep metadata updates atomic.

2) MCU/SoC “must-have” abilities for a minimum Secure OTA loop

Verified boot root: an immutable verify point (ROM or equivalent) that cannot be bypassed in production mode.
Trust anchor storage: OTP/eFuse/secure NVM to pin root public-key hash and policy identifiers (KeyID allowlist / floor).
Signature verification: ECC verify + hashing (software is possible; hardware accel reduces brick risk by shrinking time windows).
Isolation: at least MPU/privilege separation so the updater/policy cannot be overwritten by normal application code.
Boot slot selection: supports A/B slot rules and a bounded-attempt boot-test gate (confirm-or-rollback behavior).

ROM verify point OTP / eFuse anchor ECC verify + SHA MPU / TrustZone A/B + confirm gate

Example MCU/SoC references (material numbers): NXP LPC55S69, ST STM32U585, ST STM32H573, Renesas RA6M5, Infineon PSoC 64.

3) Secure Element: when it is worth it (and when it is not)

A Secure Element is most valuable when it provides device-side features that the main MCU cannot reliably emulate: key isolation and monotonic counters that enforce anti-rollback floors under physical access and power interruptions.

Strong fit: devices with physical exposure, strict anti-rollback, key revocation requirements, or a need for monotonic counters.
Neutral fit: moderate threat model, MCU already has robust OTP/TrustZone; SE is an optional safety margin.
Poor fit: ultra-cost-sensitive designs where supply/lifecycle risk dominates and threat model is low.

Example Secure Element references (material numbers): Microchip ATECC608B, NXP SE050, ST STSAFE-A110, Infineon OPTIGA™ Trust M (SLS32AIA010MS). Optional discrete TPM references: Infineon SLB 9670 / SLB 9672.

4) Storage choice for OTA: NOR vs eMMC/NAND (update-write behavior + PLP risk points)

QSPI NOR: deterministic erase/write units; excellent for A/B images + small metadata journal. Risk is sector erase latency → needs strict staging and verified commit.
eMMC: high capacity; good for full images and chunk maps. Risk is internal translation layers and power-loss sensitivity → requires stronger two-phase commit + write gating near brownout.
NAND: capacity-focused; bad blocks/ECC/translation layers increase the burden on “truth recovery” rules. Keep metadata/journal strategy conservative.

Example storage references (material numbers): Winbond W25Q128JV / W25Q256JV, Macronix MX25L128 / MX25U128, Micron MT25Q (family); eMMC examples: Micron MTFC… (family), Kioxia THG… (family).

5) Supervisors & power monitoring (device-side): brownout reset, write gating, watchdog policy

Brownout detect: capture brownout flags as evidence and block starting new erase/write when voltage is near the unsafe region.
Write gating threshold: define a “no new writes” threshold (and respect it in updater state machine) to protect metadata truth.
Watchdog + boot-test window: a bounded window prevents endless pending states; exceed window → rollback/rescue deterministically.

Example supervisor/monitor references (material numbers): TI TPS3431, TI TPS3850, Maxim/ADI MAX16054; voltage monitor examples: TI TPS3703, Microchip MCP1316 (family).

6) Three tier recipes (5-line capability table each) — entry / industrial / high-security

Tier	MCU/SoC	Trust anchor	Storage	Supervision
Entry cost-focused	NXP LPC55S69 / ST STM32U585 / Renesas RA6M5 ROM verify + OTP anchor + ECC/SHA accel preferred	MCU OTP/eFuse pins root public-key hash + KeyID list No external SE; keep policy minimal but strict	QSPI NOR: Winbond W25Q128JV A/B + meta journal on NOR	WDT: TI TPS3431 + monitor: TI TPS3703 boot-test window + brownout write gating
Industrial field resilience	ST STM32H573 / NXP LPC55S69 stronger isolation + faster verify reduces brick window	Secure Element: ATECC608B / SE050 / STSAFE-A110 monotonic counter + key isolation + revoke anchors	Larger NOR (e.g., W25Q256JV) or eMMC (Micron MTFC…) chunk map + staged write + strict commit	TI TPS3850 + WDT policy deterministic rollback, brownout-safe metadata
High-security high threat	Security-oriented platform (MCU/SoC with strong isolation domain) policy separation + hardened verify path	SE + optional TPM: Infineon OPTIGA Trust M (SLS32AIA010MS) + SLB 9670/9672 (optional) strict revoke, stronger attestation hooks (device-side)	eMMC (Micron MTFC… / Kioxia THG…) two-phase commit + journal + dual meta copies	Supervisor + monitor combo (e.g., MAX16054 + TPS3703) tight boot-test enforcement + write gating

Each tier still requires the same “non-negotiables”: verified boot, anti-rollback floor, atomic metadata, bounded attempts, and minimum evidence events.

Figure F10 — BOM blocks → mechanisms → outcomes (device-side Secure OTA)

The selection logic is device-side: the goal is deterministic verification, atomic commits under power dips, and controlled recovery—without relying on backend orchestration.

H2-13 · FAQs

FAQs (12) — device-side secure OTA, answers + mappings

Each answer stays within the Secure OTA Module boundary: signature verification, anti-rollback, manifest correctness, encryption-at-rest, power-loss atomicity, recovery rules, evidence logs, and validation matrices.

Figure F11 — FAQ map: questions → chapters

Q1. Why can an “approved signature” still allow downgrading to old firmware?

Signature verification proves authenticity, not freshness. Anti-rollback requires a device-side floor enforced by a monotonic value (OTP/eFuse version, Secure Element counter, or a protected counter with tamper detection). The boot policy must reject any candidate image with version < floor, even if the signature verifies. Floor updates must be atomic and survive power loss.

Maps to: H2-5

Q2. What if both A/B slots are corrupted—where should a rescue image live?

A reliable “last life” should sit outside the normal updatable slots: a ROM-resident minimal rescue (ideal), or a protected rescue region that is verified by the immutable boot root and written rarely. The rescue path must still enforce signature and rollback-floor rules, then re-provision a known-good slot. Avoid placing rescue in the same metadata domain that can be atomically compromised.

Maps to: H2-9

Q3. During updates, where does power loss brick devices most often, and how does two-phase commit prevent it?

The highest-risk moments are metadata “truth changes”: setting a pending boot flag, switching active slot pointers, or updating rollback counters. Two-phase commit avoids bricking by separating write/verify from activation: write to staging/inactive slot, verify hashes and signature, then atomically set PENDING with journaled metadata. Only after a successful boot-test does CONFIRMED get committed.

Maps to: H2-8, H2-6

Q4. Which manifest fields are mandatory, which are optional, and what breaks if they are missing?

Mandatory fields usually include image hash(es), signature, version/build number, target component/slot, and hardware compatibility constraints. Without compatibility and targeting, devices can install the wrong image and fail boot-test. Optional fields include dependencies, compression, and delta parameters. If chunk maps are missing in interrupted environments, resume becomes unsafe or inefficient, increasing brick risk during retries.

Maps to: H2-6

Q5. After key revocation, how can “old keys still work,” and what is the safest device-side fix?

Old keys remain effective when the device only checks “signature valid” but does not bind acceptance to an allowlist and a revocation state. The safest device-side fix is to anchor a KeyID allowlist (or a small revocation bitmap) in immutable/secure storage, and to require that the signing KeyID is both present and not revoked. Pair revocation with an updated rollback floor to block old signed packages permanently.

Maps to: H2-4, H2-5

Q6. Does storage encryption make rollback/recovery harder, and how can designs stay secure and recoverable?

Encryption-at-rest can complicate recovery if the rescue path cannot access keys or if metadata becomes unrecoverable after brownouts. A practical pattern is to encrypt secrets/config strongly while keeping boot-critical metadata minimal and journaled. If rescue must read encrypted images, ensure the trust anchor can unwrap keys in rescue mode. Avoid designs where encryption keys depend on mutable state that can be lost mid-update.

Maps to: H2-7, H2-9

Q7. Delta updates save bandwidth—what new security and reliability pitfalls appear?

Delta updates increase complexity because correctness depends on the exact base version and patch application order. Security pitfalls include inadequate verification granularity (only verifying the final image) and replaying a delta against an unexpected base. Reliability pitfalls include harder recovery after interruptions and increased write amplification. A robust approach verifies chunk-level hashes, validates base-version binding, and supports safe resume with deterministic commit rules.

Maps to: H2-6

Q8. In frequent power-loss environments, how can the minimum PLP window be estimated for atomic metadata commits?

The minimum PLP window is driven by the worst-case “atomic commit payload”: journal record write + CRC/validation + pointer/epoch update (often duplicated). Estimate the maximum write latency of that payload on the target storage (NOR sector program, eMMC page + internal housekeeping), then add margin for brownout detection and firmware response time. Gate new writes near the brownout threshold so commits never start in unsafe voltage regions.

Maps to: H2-8

Q9. If Secure Boot fails, should devices refuse to boot or enter a safe mode—how to choose without false positives?

Refusing to boot is correct when authenticity cannot be established (signature/policy failure). A controlled safe/rescue mode is appropriate when the system can still enforce trust rules but needs a recovery path (e.g., slot corruption or boot-test failure). The policy should be deterministic: signature/policy failures → REJECT/RESCUE; boot-test failures → bounded retries then ROLLBACK/RESCUE. Supervisors/watchdogs help avoid endless loops.

Maps to: H2-3, H2-9

Q10. When OTA fails, what evidence should be checked first, and which 5 log fields narrow root cause fastest?

Start with the updater state machine step, then separate crypto vs storage vs power. The fastest five fields are: (1) state transition last step, (2) signature verification result/error code, (3) manifest parse/hash mismatch code, (4) reboot reason + brownout flags, and (5) slot state snapshot (active/next/attempt_count) plus rollback reason and floor/counter values (appropriately redacted).

Maps to: H2-10

Q11. How can “real security” be demonstrated—what is the smallest test matrix that avoids blind spots?

A minimum matrix crosses (A) phases (download/verify/staging/meta/boot-test/confirm) with (B) negative cases (bad signature, tampered manifest, replay, rollback below floor) and (C) power interruption points across those phases. Pass criteria are deterministic outcomes: REJECT for policy failures, ROLLBACK for boot-test failures, RESCUE when recovery is required, and always keeping at least one known-good CONFIRMED boot path available after any interruption.

Maps to: H2-11

Q12. Without a Secure Element, can security be “good enough,” and what must the MCU/SoC provide?

Many systems can be robust without a Secure Element if the MCU/SoC provides an immutable verify point, a strong secure storage anchor (OTP/eFuse), fast signature verification, isolation primitives (MPU/TrustZone), and a reliable anti-rollback floor mechanism. The risk increases when physical access is likely or when monotonic counters and hard revocation must be guaranteed under power loss. In those cases, a Secure Element materially improves device-side guarantees.

Maps to: H2-12, H2-4

Secure OTA Module for IoT Devices: Secure Boot & Safe Updates

Secure OTA Module for IoT Devices: Secure Boot & Safe Updates

Boundary definition: what a Secure OTA Module is (and is not)

Threat model & measurable security goals: what must be prevented

Chain of Trust: verified boot from ROM to application

Signatures & keys: storing the trust root and enforcing revocation on-device

Anti-rollback: versions, counters, and enforceable freshness

Firmware images & manifest: minimum fields for verification and recovery

Encrypted storage: partitioning and device-side key strategy

Power-loss-safe updates: A/B + atomic metadata + write gating

Recovery design: rollback, rescue modes, and the last lifeline

Hard evidence and observability: what to collect when updates fail

Validation & testing: proving OTA is secure and non-bricking

Request a Quote

Accepted Formats

Attachment

Component & architecture selection: MCU/SoC, Secure Element, storage, supervisors

1) Capability-to-mechanism mapping (what each “ability” pays for)

2) MCU/SoC “must-have” abilities for a minimum Secure OTA loop

3) Secure Element: when it is worth it (and when it is not)

4) Storage choice for OTA: NOR vs eMMC/NAND (update-write behavior + PLP risk points)

5) Supervisors & power monitoring (device-side): brownout reset, write gating, watchdog policy

6) Three tier recipes (5-line capability table each) — entry / industrial / high-security

FAQs (12) — device-side secure OTA, answers + mappings

Explore

Categories

Get in Touch

Secure OTA Module for IoT Devices: Secure Boot & Safe Updates

Secure OTA Module for IoT Devices: Secure Boot & Safe Updates

Boundary definition: what a Secure OTA Module is (and is not)

Threat model & measurable security goals: what must be prevented

Chain of Trust: verified boot from ROM to application

Signatures & keys: storing the trust root and enforcing revocation on-device

Anti-rollback: versions, counters, and enforceable freshness

Firmware images & manifest: minimum fields for verification and recovery

Encrypted storage: partitioning and device-side key strategy

Power-loss-safe updates: A/B + atomic metadata + write gating

Recovery design: rollback, rescue modes, and the last lifeline

Hard evidence and observability: what to collect when updates fail

Validation & testing: proving OTA is secure and non-bricking

Recommended topics you might also need

Request a Quote

Accepted Formats

Attachment

Component & architecture selection: MCU/SoC, Secure Element, storage, supervisors

1) Capability-to-mechanism mapping (what each “ability” pays for)

2) MCU/SoC “must-have” abilities for a minimum Secure OTA loop

3) Secure Element: when it is worth it (and when it is not)

4) Storage choice for OTA: NOR vs eMMC/NAND (update-write behavior + PLP risk points)

5) Supervisors & power monitoring (device-side): brownout reset, write gating, watchdog policy

6) Three tier recipes (5-line capability table each) — entry / industrial / high-security

FAQs (12) — device-side secure OTA, answers + mappings

Explore

Categories

Get in Touch