Error Handling & Recovery for I²C, SPI, and UART

Q: I²C bus stuck low after a brown-out — first check line vs state-machine?

Likely cause: SDA/SCL is physically held low by a slave or clamp, or the controller/slave FSM is wedged mid-transaction. Quick check: Sample SDA/SCL as GPIOs and compare to controller status (BUSY/START/STOP). If lines are low with controller disabled → line-held; if lines float high but BUSY persists → FSM wedge. Fix: Disable I²C block → GPIO bus-clear (SCL pulses) → generate STOP if possible → re-init controller; then run a read-only health probe before writes. Pass criteria: SDA/SCL high within T_idle; health probe passes N_ok times; no duplicate write events (0) in window W.

Q: Clock stretching happens occasionally — where to set master timeout without breaking slow devices?

Likely cause: A legitimate slow slave stretches occasionally, or a fault makes SCL never release. Quick check: Log stretch duration distribution (max/p95) per device address; unbounded or always hitting the cap indicates hung stretch. Fix: Use a two-tier policy (T_byte + T_txn); allow longer caps only for known-slow addresses, and escalate to bus-clear after K consecutive cap hits. Pass criteria: No transaction exceeds T_txn; stretch p95 < T_stretch_p95; false timeouts on known-slow devices < X per window W.

Q: SPI CRC bursts after one glitch — how to resync without resetting the whole system?

Likely cause: A glitch caused bit-slip so frame boundaries no longer align. Quick check: After CS deassert, run a short read-only sync header probe; persistent wrong headers imply boundary shift. Fix: CS fence with guard time → dummy clocks (N_dummy_bits) using safe MOSI fill → header probe gate; flush driver/DMA queues if polluted; reset only the target device if probe never locks. Pass criteria: Header locks within M attempts; CRC error rate < R_crc in window W; TTR_p95 < X ms without system reset.

Q: CS toggling doesn’t recover — what’s the quickest dummy clocks sanity check?

Likely cause: CS toggling ends the session but the target remains inside a partial command/stream state. Quick check: CS ↑ (guard) → CS ↓ → dummy clocks for one frame-length/byte multiple → read-only known header; header movement vs dummy length confirms desync. Fix: Standardize N_dummy_bits with safe MOSI fill and require header probe before resuming; add a Hi-Z check if multiple slaves share MISO. Pass criteria: Probe passes N_ok consecutive reads; no contention events in W; no-progress remains 0 in steady state.

Q: UART shows framing errors only in bursts — noise coupling or baud drift?

Likely cause: Burst-coupled noise corrupts edges, or baud mismatch accumulates until sampling slips. Quick check: Correlate FE/PE bursts to system events vs temperature/clock changes; noise correlates to events, drift correlates to sustained conditions. Fix: Bursts: flush RX FIFO + clear flags → wait idle window T_idle_uart → first-valid-frame gate. Drift: re-acquire baud (or calibrated divisor) then re-validate with idle gate. Pass criteria: FE/PE rate < R_fe over window W; first-valid-frame locks within TTR; no FIFO overruns in N frames.

Q: Recovery works on bench but fails in chassis — what telemetry field is usually missing?

Likely cause: Missing progress and density telemetry hides escalation triggers under real coupling. Quick check: Ensure logs include last_progress_ts, error density in W, burst length, action level, and reset cause. Fix: Add progress counters + ring-buffer events (bus/device/op/duration/error/action/attempt) and drive escalation deterministically from these signals. Pass criteria: ≥ P% of field failures attributable to a root bucket; escalation reproducible; TTR_p95 and false reset rate measurable from logs.

Q: Retries make it worse — how to detect retry storms and apply backoff?

Likely cause: Fast retries amplify error density and prevent stability, creating a storm. Quick check: Track retries per second and consecutive failures per device; storm exists when retry rate rises while progress stays flat. Fix: Exponential backoff with jitter and a hard cap; rate-limit per device and globally; enter DEGRADED (probe-only) if density exceeds threshold. Pass criteria: Retry rate < R_retry; progress resumes within T_recover; error density < D_ok in window W.

Q: After recovery, device config is wrong — how to design idempotent writes?

Likely cause: Timeout during a non-idempotent write followed by blind retry causes side effects or partial state. Quick check: Identify non-idempotent writes and whether code retries them without confirm/readback gates. Fix: Use version/sequence tokens or write-then-readback compare; after timeout block blind retry and reconcile via read-only probe before re-applying changes. Pass criteria: Duplicate side effects = 0 in window W; readback/compare passes after recovery; config hash matches within T_reconcile.

Q: Watchdog resets too often — how to separate comm watchdog vs system watchdog?

Likely cause: One watchdog handles temporary comm faults and system deadlocks, causing over-resets. Quick check: Log reset cause and last comm progress; if tasks are healthy but comm stalls, system watchdog is too aggressive. Fix: Split task watchdog, comm watchdog (no-progress/escalation), and system watchdog (fail-safe only); require comm ladder attempts before system reset with rate limits. Pass criteria: System resets/day Y%; reset causes attributable in ≥ P% logs.

Q: Need fast TTR but no false resets — what escalation thresholds are most effective?

Likely cause: Escalation is based on raw error counts instead of progress and density, producing false resets. Quick check: Confirm triggers use no-progress for T_idle and density in W rather than only N errors. Fix: Layered gates: retry/backoff (K1) → resync/reinit on no-progress → device reset only if probe fails → segment power-cycle after K2 repeats with rate limit. Pass criteria: TTR_p95 < X ms and false_reset_rate < Z; escalation distribution stable; hard resets only after gates.

← Back to: I²C / SPI / UART — Serial Peripheral Buses

Reliable recovery is not “reset and hope”—it is a measurable ladder: detect no-progress, resync safely, escalate with bounded timeouts, and verify configuration consistency after recovery.

This page turns I²C/SPI/UART failures into an auditable playbook with telemetry, pass/fail criteria, and soft-to-hard controls that prevent duplicate writes and false resets.

H2-1 · Definition & Recovery Goals

Intent

Establish a strict vocabulary for transient, desync, and hung/wedged states, then define what “recovery success” means in measurable, production-ready terms.

Scope guardrail (to avoid overlap)

Covers: recovery definitions, “no forward progress” criteria, measurable KPIs (TTR/loss/consistency/false-reset).
Does not cover: topology/capacitance budgeting, SI/termination, or protocol deep-dives (those belong to sibling pages).

Three failure classes (strict definitions)

1) Transient errors

Errors occur, but forward progress continues: retries succeed, counters increment, transactions complete intermittently.

Signal: error counters rise, but completion events still appear.
Risk: throughput/latency degradation; occasional data loss.

2) Protocol desync

The bus is active, but endpoints disagree on frame boundary / phase / session state, producing repeated invalid frames until re-aligned.

Signal: persistent CRC/header/parity/framing failures with toggling activity.
Risk: repeated writes/reads may target wrong state unless fenced.

3) Bus wedged / hung

No forward progress beyond a defined timeout window: a line may be held, a controller may be stuck, or a queue may never complete.

Signal: “time since last success” exceeds threshold; completion interrupts stop.
Risk: system-level cascading failures; recovery must be staged and safe.

Recovery success = three parallel pass criteria

Data consistency: no unintended duplicate writes; configuration remains coherent (read-back match where applicable).
Device availability: critical transactions complete repeatedly after recovery (not a one-off “lucky” success).
System safety state: recovery does not trigger unsafe actions (avoid over-resetting, avoid “half-applied” control changes).

Outputs to report (KPIs with threshold placeholders)

Time to recover (TTR)

Define TTR_p95 and TTR_max per application. Example placeholder: TTR_p95 < X ms.

Loss / retry burden

Track loss and retry density (burstiness matters): loss_per_hour < Y, retry_burst_p99 < N.

Duplicate-write risk

For non-idempotent writes, require a fence (read-back, sequence number, or safe “commit” step). Placeholder: dup_write_rate < Z.

False resets

Stage recovery to minimize unnecessary resets. Placeholder: false_reset_per_day < W.

Diagram · Fault-state funnel and action escalation

Reading tip: “Hung” is defined by no forward progress (not just “many errors”). Escalation should be staged to protect consistency and minimize false resets.

H2-2 · Failure Taxonomy Across I²C / SPI / UART

Intent

Replace guessing with a symptom → cause class → first check map. Each symptom includes one discriminator to quickly separate line-level wedges from state-machine wedges.

Two primary trunks (fast classification)

A) Physical-level wedge

A line is held (stuck-low, forced drive, or electrical contention). Progress stops because the signal cannot return to a valid idle or edge sequence.

Discriminator: pin level remains invalid even when the controller is idle.

B) State-machine wedge

The physical signals may toggle, but a controller/driver/device state machine (or DMA/queue) never completes. Progress stops because the system is “stuck waiting”.

Discriminator: pin levels look plausible, but completion events and “last-good” timestamps freeze.

Symptom cards (no large tables; mobile-safe)

I²C — common “hung” signatures

Symptom: SDA stuck low

Likely cause class: physical wedge (slave holds SDA, contention, or pin latch).

First check: read SDA level with controller disabled; log “last-good” timestamp freeze.

Discriminator: SDA remains low even after STOP attempt or peripheral disable.

Symptom: SCL stuck low

Likely cause class: physical wedge or clock source/driver latch.

First check: verify SCL can be driven high in GPIO mode (safe window); confirm pull-up rail present.

Discriminator: SCL low persists across controller reset.

Symptom: clock stretching never releases

Likely cause class: state-machine wedge (slow/slave stuck, timeout policy missing).

First check: measure stretch duration vs master timeout; confirm “no-progress” criteria trips.

Discriminator: pins toggle up to a point, then wait forever without completion.

Symptom: arbitration loss → no recovery

Likely cause class: state-machine wedge (controller state not cleared, retry storm).

First check: confirm driver clears arbitration flag; track retry burst density and backoff presence.

Discriminator: line is idle/high, but transactions never complete.

SPI — common “hung / desync” signatures

Symptom: CS stuck asserted

Likely cause class: state-machine wedge (session boundary not fenced; driver waiting).

First check: confirm CS deassert occurs on timeout; verify “end-of-transaction” interrupt path.

Discriminator: SCLK toggles may stop, but CS remains low beyond the budget window.

Symptom: DMA never completes

Likely cause class: state-machine wedge (interrupt masked, queue stuck, descriptor issue).

First check: verify DMA progress counter; check if last descriptor completed timestamp stops updating.

Discriminator: pins may be quiet, but software shows “busy” indefinitely.

Symptom: MISO driven when it should be Hi-Z

Likely cause class: physical wedge (contention, miswired CS, faulty device).

First check: sample MISO level with CS high; look for forced-level behavior across resets.

Discriminator: line level stays strong even when the bus is idle.

Symptom: header/CRC wrong after one glitch

Likely cause class: protocol desync (bit-slip or phase/session mismatch).

First check: verify a resync primitive exists (CS fence + known header probe); count consecutive invalid frames.

Discriminator: activity continues, but validity never returns without a resync step.

UART — common “noise lock / stuck receive” signatures

Symptom: framing/parity errors in bursts

Likely cause class: protocol desync or noise-driven false start bits.

First check: correlate FE/PE bursts with idle gaps; verify “wait-for-idle” resync window exists.

Discriminator: errors cluster; stable periods exist between bursts.

Symptom: FIFO overrun repeats

Likely cause class: state-machine wedge (ISR starvation, flow control missing, buffer sizing).

First check: log service latency vs FIFO depth; confirm error flags are cleared and counters advance.

Discriminator: overrun rate tracks CPU load/ISR latency, not cable events.

Symptom: autobaud drifts over time

Likely cause class: protocol desync (clock mismatch accumulation; wrong reacquire policy).

First check: compare measured bit time vs expected; validate reacquire trigger and guard window.

Discriminator: errors increase gradually, not abruptly, and improve after reacquire.

Diagram · 3-bus symptom tree (fast triage map)

Usage rule: each symptom card intentionally stops at the first check. Deeper electrical design and protocol timing details belong to sibling pages to avoid content overlap.

H2-3 · Observability & Telemetry Hooks

Intent

Without observability, recovery becomes blind resets. Use a minimal, low-overhead telemetry set to prove forward progress, tune timeouts, and control escalation.

Minimal telemetry set (must-have)

Counters

Error counters: NAK / CRC / FE / PE (map by bus type; keep a unified field name).
Timeout counters: byte/step, frame, transaction timeouts (do not merge).
Retry counters: retry_total, retry_burst_max, backoff_applied_count.

Time signals

last_good_ts: last completed successful transaction.
last_progress_ts: last observable forward progress (DMA advance, FIFO drain, state step).
Duration distribution: p50/p95/p99 per operation type (not just averages).

Recovery action counts

Soft: retry/backoff, resync, flush.
Hard: driver re-init, device reset.
Power: segment power-cycle (should be rare; track false resets).

Rule: “hung” should be triggered by no forward progress (last_progress_ts frozen), not by “many errors” alone.

Event log schema (minimal fields)

Identity

bus_id (controller instance / channel)
device_addr (address/CS index/port id)

Operation

op_type (read/write/config/stream)
duration_us (end-to-end elapsed)

Outcome

error_code (NAK/CRC/FE/PE/timeout)
phase (byte/step, frame, txn)

Recovery

recovery_action (retry/resync/reinit/reset/power)
attempt_id (groups repeated attempts for the same request)

Alert tiers (bind alerts to action)

Warning

Errors rise but progress continues. Action: increase logging granularity; track burstiness; avoid escalation.

Degraded

p95/p99 latency or error bursts exceed budget. Action: apply backoff, reduce load, and prepare resync primitives.

Fail-safe

No forward progress meets the criteria. Action: enter staged recovery; quarantine the device; protect consistency.

Pass criteria (placeholders)

Telemetry coverage: 100% of recovery actions logged with schema fields populated.
Progress detection: hung classification uses last_progress_ts (not error count only).
False reset: false_reset_per_day < W (target placeholder).

Diagram · Telemetry pipeline (low overhead → actionable decision)

Design note: keep ISR work minimal (increment counters only). Use the health task for classification and escalation to reduce false resets.

H2-4 · Timeout Budgeting (Transaction / Byte / Idle)

Intent

Timeouts should be budgeted, not guessed. Use layered timeouts to detect “stuck steps” early while avoiding false resets and uncontrolled TTR.

Timeout layers (strict meanings)

Byte / step timeout (T_byte)

The smallest wait budget for a step to advance (byte, phase, descriptor progress). Triggers early when “progress stalls” inside a transaction.

Frame timeout (T_frame)

A complete frame/burst window (UART frame, SPI burst block, I²C phase group). Useful for “resync boundary” decisions.

Transaction timeout (T_txn)

End-to-end budget for a full transaction. Limits total waiting and bounds TTR when combined with retries/backoff.

Idle / no-progress timeout (T_idle)

A “hung” classifier based on last_progress_ts. Not the same as T_txn; it detects zero progress even if signals toggle.

How to set thresholds (distribution-based)

Use p95/p99 from telemetry (H2-3), not averages.
Set: T_byte = p99(step) + margin, T_frame = p99(frame) + margin, T_txn = p99(txn) + margin.
Set: T_idle using last_progress_ts stalls (progress-free window), not “transaction time”.
Define placeholders: T_byte, T_frame, T_txn, N_retry, Backoff (tuned by burst statistics).

Key knobs by bus type (timeout vocabulary)

I²C

T_stretch_max: maximum tolerated clock-stretching window.
Master timeout policy: define when to stop waiting and classify no-progress.

SPI

DMA block timeout: bound waiting for descriptor completion.
CS assert + frame gap: fence sessions; detect stalls between bursts.

UART

Inter-byte timeout: detect stalled streams and rebuild boundaries.
Frame timeout + idle window: define resync windows after bursts.

Integrate retries/backoff into the TTR budget

Total waiting should be bounded: TTR ≈ (T_txn × N_retry) + Σ(backoff) (placeholder form).
Use burst telemetry to pick Backoff (avoid retry storms).
Escalation should require timeout + no-progress before hard resets.

Pass criteria (placeholders)

False timeouts: false_timeout_rate < X (target placeholder).
Recovery speed: TTR_p95 < Y ms (target placeholder).
Safety: no duplicate-write incidents under fault injection (policy verified).

Diagram · Timeout budgeting timelines (I²C / SPI / UART)

Implementation note: classify “hung” only when timeouts coincide with no-progress (last_progress_ts frozen). This reduces false resets while bounding recovery time.

H2-5 · Recovery Ladder (Soft → Hard) & When to Escalate

Intent

Make recovery auditable with a staged ladder. Escalation should be driven by timeouts + no-forward-progress, not by a single error spike.

Recovery ladder (6 levels, unified vocabulary)

Level 1 · Retry same op (with backoff)

Trigger: transient errors while progress continues.
Action: retry + exponential backoff + jitter (avoid retry storms).
Exit: one success + stability window (N_ok placeholder).

Level 2 · Re-sync (flush / idle wait / boundary rebuild)

Trigger: suspected protocol desync (boundary mismatch).
Action: flush queues, wait idle window, rebuild session boundary.
Exit: read-only health probe passes.

Level 3 · Re-init peripheral block (driver reset)

Trigger: progress stalls inside controller/driver (DMA/ISR/state).
Action: controller reset + context rebuild + config replay/verify.
Exit: controller self-check + health probe passes.

Level 4 · Reset target device (GPIO reset / command reset)

Trigger: failures localize to one device; safe to isolate.
Action: device reset, then read-only probe and configuration reconciliation.
Exit: device online + probe stable for N_ok.

Level 5 · Bus-level clear (I²C bus-clear / SPI dummy clocks)

Trigger: bus wedge indicators (timeouts + no-progress + line/session stuck).
Action: bus clear primitive after freezing traffic and acquiring bus lock.
Exit: bus returns to idle + read-only probe passes.

Level 6 · Power-cycle segment / system reset (last resort)

Trigger: fail-safe requirements or repeated ladder failure.
Action: segment power-cycle first; full system reset only when required.
Exit: boot self-test + probes confirm stable operation.

Escalation policy (combine gates, avoid false resets)

Gate A — Timeout budget hit: T_byte / T_frame / T_txn exceeded (placeholders from H2-4).
Gate B — No forward progress: now − last_progress_ts > T_idle (placeholder).
Gate C — Persistence: fail_streak ≥ K or err_rate ≥ E within a window W (placeholders).
Gate D — Operation class: non-idempotent writes require a read-only probe before retrying; critical ops may skip to a safer level.
Gate E — Safety: enter safe-state before hard actions (device reset / bus clear / power-cycle).

Degraded operation (contain impact while recovering)

Read-only mode

Allow health probes and status reads only; block side-effect writes until stability is proven.

Quarantine device

Isolate one address/CS while keeping the bus usable for other devices; log quarantine entry/exit.

Bypass path

Switch to a redundant channel when available; keep telemetry continuity to compare both paths.

Auditability & pass criteria (placeholders)

Action record: level, trigger gates, duration, result, attempt_id always logged.
Recovery speed: TTR_p95 < Y ms (placeholder).
False resets: false_reset_per_day < W (placeholder).
Consistency: zero duplicate-write incidents in fault injection for non-idempotent ops.

Diagram · Staged recovery ladder (cost/time + escalation gates)

Rule: prefer the lowest level that restores forward progress. Hard actions require the combined gates and an audit trail.

H2-6 · I²C Recovery Playbook (Hung Bus)

Intent

Most I²C hangs are either lines held (SDA/SCL stuck low) or a stalled state machine. Recovery should be repeatable, verifiable, and safe for non-idempotent writes.

Detect & classify (line-level vs state-level)

Hung classifier

No-progress: now − last_progress_ts > T_idle (placeholder).
Timeout evidence: repeated T_step/T_txn hits (from H2-4).

Line state

SDA stuck low: common “bus wedged” signature.
SCL stuck low: treat cautiously; may require reinit or segment power control.

State-level wedge

Stretch timeout: stretch_duration > T_stretch_max (placeholder).
Arbitration loss stall: after arbitration, progress does not resume.

Safe preconditions (protect consistency)

Freeze traffic: stop new I²C transactions; acquire a bus lock/mutex.
Block side-effect writes: during recovery, allow read-only probes only.
Snapshot: record SDA/SCL levels, controller status, and counters before action.

Recovery actions (repeatable sequence)

1) Bus clear (SCL pulses)

Use when: SDA stuck low (or strong suspicion).
Action: drive N_pulses clocks (placeholder) to release SDA.
Observe: SDA returns high; bus returns to idle.

2) Generate STOP (if supported)

Use when: SDA/SCL are high and a clean boundary is needed.
Goal: return bus to a known idle state before probes.

3) Re-init controller (driver/peripheral)

Use when: both lines high but no-progress persists.
Action: reset controller, rebuild context, replay config, verify status.

4) Optional: re-enumerate targets

Use when: devices may have reset/power-cycled.
Action: presence checks via read-only probes (avoid side effects).

Risk control & health probe (read-only, verifiable)

Probe: read a non-side-effect status/ID register to confirm ACK + STOP behavior.
Consistency: after recovery, config writes must be reconciled by read-back (policy).
Pass criteria: probe passes N_ok times; error counters remain stable in window W (placeholders).

Diagram · I²C hung recovery (bus-clear → STOP → probe → escalate)

Safety note: the recovery path should freeze traffic and use read-only probes to prevent duplicate writes during uncertainty.

H2-5 · Recovery Ladder (Soft → Hard) & When to Escalate

Intent

Make recovery auditable with a staged ladder. Escalation should be driven by timeouts + no-forward-progress, not by a single error spike.

Recovery ladder (6 levels, unified vocabulary)

Level 1 · Retry same op (with backoff)

Trigger: transient errors while progress continues.
Action: retry + exponential backoff + jitter (avoid retry storms).
Exit: one success + stability window (N_ok placeholder).

Level 2 · Re-sync (flush / idle wait / boundary rebuild)

Trigger: suspected protocol desync (boundary mismatch).
Action: flush queues, wait idle window, rebuild session boundary.
Exit: read-only health probe passes.

Level 3 · Re-init peripheral block (driver reset)

Trigger: progress stalls inside controller/driver (DMA/ISR/state).
Action: controller reset + context rebuild + config replay/verify.
Exit: controller self-check + health probe passes.

Level 4 · Reset target device (GPIO reset / command reset)

Trigger: failures localize to one device; safe to isolate.
Action: device reset, then read-only probe and configuration reconciliation.
Exit: device online + probe stable for N_ok.

Level 5 · Bus-level clear (I²C bus-clear / SPI dummy clocks)

Trigger: bus wedge indicators (timeouts + no-progress + line/session stuck).
Action: bus clear primitive after freezing traffic and acquiring bus lock.
Exit: bus returns to idle + read-only probe passes.

Level 6 · Power-cycle segment / system reset (last resort)

Trigger: fail-safe requirements or repeated ladder failure.
Action: segment power-cycle first; full system reset only when required.
Exit: boot self-test + probes confirm stable operation.

Escalation policy (combine gates, avoid false resets)

Gate A — Timeout budget hit: T_byte / T_frame / T_txn exceeded (placeholders from H2-4).
Gate B — No forward progress: now − last_progress_ts > T_idle (placeholder).
Gate C — Persistence: fail_streak ≥ K or err_rate ≥ E within a window W (placeholders).
Gate D — Operation class: non-idempotent writes require a read-only probe before retrying; critical ops may skip to a safer level.
Gate E — Safety: enter safe-state before hard actions (device reset / bus clear / power-cycle).

Degraded operation (contain impact while recovering)

Read-only mode

Allow health probes and status reads only; block side-effect writes until stability is proven.

Quarantine device

Isolate one address/CS while keeping the bus usable for other devices; log quarantine entry/exit.

Bypass path

Switch to a redundant channel when available; keep telemetry continuity to compare both paths.

Auditability & pass criteria (placeholders)

Action record: level, trigger gates, duration, result, attempt_id always logged.
Recovery speed: TTR_p95 < Y ms (placeholder).
False resets: false_reset_per_day < W (placeholder).
Consistency: zero duplicate-write incidents in fault injection for non-idempotent ops.

Diagram · Staged recovery ladder (cost/time + escalation gates)

Rule: prefer the lowest level that restores forward progress. Hard actions require the combined gates and an audit trail.

H2-6 · I²C Recovery Playbook (Hung Bus)

Intent

Most I²C hangs are either lines held (SDA/SCL stuck low) or a stalled state machine. Recovery should be repeatable, verifiable, and safe for non-idempotent writes.

Detect & classify (line-level vs state-level)

Hung classifier

No-progress: now − last_progress_ts > T_idle (placeholder).
Timeout evidence: repeated T_step/T_txn hits (from H2-4).

Line state

SDA stuck low: common “bus wedged” signature.
SCL stuck low: treat cautiously; may require reinit or segment power control.

State-level wedge

Stretch timeout: stretch_duration > T_stretch_max (placeholder).
Arbitration loss stall: after arbitration, progress does not resume.

Safe preconditions (protect consistency)

Freeze traffic: stop new I²C transactions; acquire a bus lock/mutex.
Block side-effect writes: during recovery, allow read-only probes only.
Snapshot: record SDA/SCL levels, controller status, and counters before action.

Recovery actions (repeatable sequence)

1) Bus clear (SCL pulses)

Use when: SDA stuck low (or strong suspicion).
Action: drive N_pulses clocks (placeholder) to release SDA.
Observe: SDA returns high; bus returns to idle.

2) Generate STOP (if supported)

Use when: SDA/SCL are high and a clean boundary is needed.
Goal: return bus to a known idle state before probes.

3) Re-init controller (driver/peripheral)

Use when: both lines high but no-progress persists.
Action: reset controller, rebuild context, replay config, verify status.

4) Optional: re-enumerate targets

Use when: devices may have reset/power-cycled.
Action: presence checks via read-only probes (avoid side effects).

Risk control & health probe (read-only, verifiable)

Probe: read a non-side-effect status/ID register to confirm ACK + STOP behavior.
Consistency: after recovery, config writes must be reconciled by read-back (policy).
Pass criteria: probe passes N_ok times; error counters remain stable in window W (placeholders).

Diagram · I²C hung recovery (bus-clear → STOP → probe → escalate)

Safety note: the recovery path should freeze traffic and use read-only probes to prevent duplicate writes during uncertainty.

H2-7 · SPI Recovery Playbook (Desync / Stuck Session)

Intent

Most SPI “hangs” are session-boundary failures (CS) or flow-control stalls (DMA/queues). Recovery should enforce session fencing, rebuild framing, and restore forward progress with minimal side effects.

Detection (symptoms → first check)

CS asserted too long

Likely causes: missing end-of-transfer, device busy, CS control stuck.
First check: CS level vs transfer_done flag and guard time policy.

MISO strong-driven unexpectedly

Likely causes: multi-slave contention, wrong CS decode, device stuck in data phase.
First check: only one CS active; confirm idle Hi-Z expectation.

DMA never completes / queue stalls

Likely causes: lost interrupt, descriptor stuck, driver state mismatch.
First check: last_progress_ts updates; DMA state and IRQ pending flags.

CRC / header continuously wrong

Likely causes: desync (bit-slip), partial frame, CS boundary violation.
First check: enforce CS fence; then header probe after re-sync.

Session fencing (do not start a new transaction until the previous one is closed)

Fence conditions (placeholders): transfer_done = true, CS deasserted, guard time ≥ T_guard.
Queue discipline: flush/abort must fully terminate the active descriptor before re-arming DMA.
Audit fields: cs_low_time, dma_state, attempt_id, last_progress_ts always captured.

Recovery actions (repeatable sequence)

1) CS deassert + guard + reassert

Goal: close the prior session boundary.
Exit: header probe becomes stable after the next step.

2) Dummy clocks (re-align shift boundary)

Use when: header/CRC persistently wrong (suspected bit-slip).
Parameter: N_dummy_bits placeholder; ensure MOSI fill is side-effect safe.

3) Sync header probe (read-only)

Goal: confirm known header/ID framing before resuming normal traffic.
Exit: probe passes N_ok times (placeholder).

4) Flush DMA / driver queues

Use when: descriptors do not advance, or queued frames are polluted.
Exit: DMA state returns to idle and last_progress_ts updates.

5) Device reset (only if required)

Use when: re-sync cannot restore a stable header.
Exit: probe stable; error density below threshold in window W.

Pass criteria (placeholders)

Framing: sync header probe passes N_ok times.
Stability: error rate < X within window W.
Speed: TTR_p95 < Y ms; avoid false hard resets.

Diagram · SPI session fence (CS toggle → dummy clocks → header probe → resume)

Emphasis: close the previous session and re-validate framing before normal traffic resumes. Queue flush must fully terminate the active DMA state.

H2-8 · UART Recovery Playbook (Noise Burst / Framing Lock)

Intent

UART recovery focuses on rebuilding frame boundaries and clearing error states, especially under noise bursts. The goal is a verifiable return to valid frames, not blind resets.

Detection (signals that indicate boundary loss)

FE/PE burst: continuous framing/parity errors; burst_max exceeds baseline.
Idle missing: no reliable idle detect window within W (placeholder).
FIFO overrun: overrun_count rises; RX backlog does not drain.
Autobaud mismatch (optional): autobaud lock/estimate deviates from expected range.

Recovery actions (boundary rebuild)

1) Flush RX FIFO + clear error flags

Goal: remove polluted bytes and exit sticky error state.
Audit: dropped_bytes_count recorded for traceability.

2) Wait-for-idle window (rebuild boundary)

Rule: accept a new frame only after idle ≥ T_idle_uart (placeholder).
Benefit: filters noise-induced false start bits.

3) Break / idle resync (if allowed)

Use when: protocol supports a resync marker (break/idle sequence).
Exit: first valid frame passes checks (parity/length/CRC per stack).

4) Autobaud reacquire (optional)

Use when: autobaud lock is suspected to be wrong.
Exit: estimated baud within tolerance window (placeholder).

Pass criteria (placeholders)

N-frame rule: within next N frames, error frames < X.
Error density: FE/PE rate < R in window W.
Recovery time: TTR_p95 < Y ms; avoid unnecessary hard resets.

Diagram · UART resync timeline (noise → flush → idle window → first valid frame → N-frame check)

Rule: flush and wait for a reliable idle window before accepting the first valid frame; then verify stability over an N-frame window.

H2-9 · Firmware Architecture Patterns (State Machine, Idempotency, Fencing)

Intent

Recovery must be a structure, not a pile of actions. Firmware architecture should prevent duplicate writes, half-transactions, and infinite recovery loops while remaining auditable.

Recovery state machine (core contract)

States: INIT → IDLE → TXN → ERROR → RECOVER → DEGRADED.
Progress definition: last_progress_ts advances (DMA descriptors move, FIFO drains, or a transaction completes).
Exit criteria: every state has a pass condition; RECOVER cannot loop without an escalation gate.

Idempotency rules (safe retry vs protected write)

Retry-safe operations

Reads and status probes.
Write-same-value (explicitly bounded) with readback optional.
Versioned writes (seq/version token; compare-and-set semantics).

Non-idempotent writes (protected)

Increment / append semantics (duplicate write is harmful).
Trigger writes (start/arm/commit operations).
Multi-step writes (page/program sequences). Policy: no blind retry after timeout; require readback/confirmation gates.

Fencing & context cleanup (invariants)

Transaction identity: txn_id and attempt_id must be unique and logged for every retry.
Completion token: do not start a new TXN until the previous one provides a completion/ack token.
Timeout cleanup: queue_depth → 0, dma_state → IDLE, session boundary closed (CS deasserted / idle window observed), error flags cleared.
No infinite loops: RECOVER has a bounded iteration counter; escalation occurs after threshold K (placeholder).

Backoff + error code layering (actionable telemetry)

Backoff: exponential + jitter; bound by overall TTR budget (placeholders: base, max, jitter).
Driver-level codes: dma_stuck, irq_missing, timeout_phase (byte/frame/txn).
Device-level codes: NAK/CRC/header_mismatch, FE/PE burst, busy_stuck.
System-level codes: no_progress, degraded_entered, safety_state_entered.

Diagram · Recovery state machine (bounded escalation, no infinite loops)

Key properties: bounded recovery loops, explicit fencing invariants, and separation of retry-safe operations from protected writes.

H2-10 · Hardware Assist: Watchdogs, Reset Lines, Bus Guardians

Intent

Hardware assists turn recovery into a controlled soft+hard collaboration: layered watchdogs detect no-progress and enforce bounded resets; reset lines and power segmentation isolate faults before system-wide reset.

Watchdog layering (who monitors what)

Task watchdog: health-task heartbeat; catches soft deadlocks early.
Comm watchdog: forward progress (last_progress_ts) and error density; triggers peripheral re-init or bus clear.
System watchdog: last resort; only after fail-safe conditions and audit logging.

Reset & power resources (prefer segment isolation)

Target reset pin: isolate a single device without disturbing the bus segment.
Segment load switch: power-cycle only the affected bus segment; keeps unrelated domains alive.
Reset supervisor: enforces reset pulse width and ordering; exports reset_cause for audit.

Bus guardian concept + minimum anti-ghost-powering rule

Bus guardian: latches a fault and isolates/disconnects the dragging node or segment when no-progress persists.
Firmware contract: guardian trip forces DEGRADED mode and probe-only traffic until stability returns.
Anti-ghost-powering (minimum): disable/Hi-Z bus drivers before removing segment power; re-enable I/O only after rail is stable (ordering placeholders).

Diagram · Soft + hard recovery stack (watchdog → supervisor → load switch → bus segment)

Principle: isolate and recover at the smallest scope first (device → segment → system), with watchdog decisions always logged and rate-limited.

H2-11 · Engineering Checklist (Design → Bring-up → Production)

Intent

Convert recovery ideas into an auditable, executable checklist and acceptance criteria across design, bring-up, and production.

Design gate (define budgets + rules before code)

Timeout budgeting: placeholders fixed for T_byte, T_frame, T_txn, T_idle; retry/backoff bounded by overall TTR.
Recovery ladder: retry → resync → reinit → device reset → bus-clear/dummy clocks → segment power-cycle/system reset; escalation gates defined (K tries, error density window W).
Idempotency rules: retry-safe ops listed; non-idempotent writes require “probe + readback/confirm” gates (no blind retry after timeout).
Minimum event log schema: bus_id, device, op_type, duration, error_code, action_level, txn_id, attempt_id.

Bring-up gate (fault injection validates the ladder)

Inject & observe (minimum set)

I²C: force SDA low / SCL low; stretch timeout; arbitration-loss → no-progress.
SPI: CS stuck asserted; persistent header/CRC mismatch (desync); DMA never completes; MISO contention symptom.
UART: FE/PE burst; FIFO overrun; idle-detect missing; optional autobaud mismatch.

Bring-up pass checks (close the loop)

Detect: no-progress triggers within T_idle and classification is stable.
Recover: ladder steps execute in order; escalation only after gates (K, density in W).
Verify: post-recovery probe passes (N_ok times) and key configuration remains consistent (readback/compare).

Production gate (BIST + counters + field log collection)

BIST / loopback: acceptance routines prove the RX/TX path and basic framing without relying on a “perfect” environment.
Counter policy: define reset/rollover rules for error counters; preserve key “lifetime” counters for root-cause trend.
Reset rate limiting: cap device reset / segment power-cycle attempts; enter DEGRADED mode if exceeded.
Field log readiness: minimal schema is always available and exportable (bus/device/op/error/action/TTR).

Pass criteria (placeholders)

TTR_p95 < X ms and recovery_success_rate > Y%.
false_reset_rate < Z (per hour/day; define window W).
After N injections, recovery remains repeatable and configuration consistency is preserved (readback/compare OK).

Diagram · 3-gate flow (Design gate → Bring-up gate → Production gate)

H2-12 · Applications & IC Selection Notes (placed before FAQ)

Intent

Map recovery requirements to application buckets and selection checkpoints without becoming a product page. Example material numbers are provided for design reference—always verify package, suffix, and availability.

Application buckets (recovery-driven requirements)

Industrial chassis / long harness

Common failures: no-progress, stuck session, burst errors.
Recovery focus: bounded escalation + segment isolation + probe-only degraded mode.
Verification: repeated fault injection across temperature and cable changes; audit TTR and false resets.

High-noise environment

Common failures: CRC/header mismatch (SPI), FE/PE bursts (UART), NAK storms (I²C).
Recovery focus: resync + backoff+jitter; avoid retry storms that keep the bus unstable.
Verification: error density windows and rate limiting demonstrate stability recovery.

Hot-plug boundary

Common failures: session boundary corruption, ghost-power symptoms, partial power domains.
Recovery focus: fencing + controlled reset/power sequencing; probe before enabling writes.
Verification: repeated plug cycles; ensure no duplicate write and no false system reset.

Low-power wake / intermittent duty

Common failures: first-frame failure after wake, idle-detect ambiguity, timeout mismatches.
Recovery focus: idle windows, safe resync, and predictable watchdog behavior.
Verification: wake loops with telemetry; ensure TTR stays within budget without power-cycling storms.

Mass production & field service

Common failures: rare corner cases that require attribution, not guesswork.
Recovery focus: consistent error codes + event logs + bounded reset counters.
Verification: pass/fail gates and field log exportability are mandatory.

Selection notes + example material numbers (verify package/suffix/availability)

Bus helpers (timeouts, buffering, isolation, stuck-bus handling)

I²C hot-swap / stuck-bus assist: TCA4307 (TI), PCA9511A (NXP).
I²C mux / hub (channel isolation): TCA9548A (TI), PCA9548A (NXP).
I²C buffer / rise-time accelerator: TCA4311A (TI), PCA9515A (NXP).
I²C isolator: ISO1540/ISO1541 (TI), ADuM1250/ADuM1251 (Analog Devices).
isoSPI / long-chain SPI-style link: LTC6820 (Analog Devices).

Watchdogs, supervisors, reset lines (hard recovery stack)

Reset supervisor: TPS3808 (TI), ADM809/ADM810 (Analog Devices), MCP130 (Microchip).
External watchdog timer: TPS3430 (TI), MAX6369 (Analog Devices/Maxim legacy), MCP1316 (Microchip).
Reset/power sequencing aid (platform-dependent): combine supervisor + load switch to isolate segments (see below).

Segment power & isolation primitives (keep resets local)

Load switch (segment power-cycle): TPS22910A, TPS22965 (TI), FPF2123 (onsemi/Fairchild legacy family).
High-side switch (higher current, platform-dependent): TPS1H100 (TI) example class; verify current/diagnostics.
Digital isolator for SPI/UART-class signals: ISO7741 (TI), ADuM140x family (Analog Devices) as example classes; verify channel count/direction.

Bridges & service ports (field observability and controlled access)

USB ↔ UART bridge: CP2102N (Silicon Labs), FT232R (FTDI), CH340 (WCH) as common classes; verify driver strategy.
RS-485 transceiver (UART layering): SN65HVD72 (TI), MAX3485 (Analog Devices/Maxim) as common classes.
I²C expander for recovery GPIOs: PCA9555 (NXP), TCA9555 (TI) as common classes.

Capability matrix (presented as card list, not a table)

Controller block: byte/txn/idle timeouts + error flags + abort/flush path + “no-progress” telemetry.
Target device: reset pin and safe-state behavior + read-only probe register + write-readback capability.
Bridge/expander: stuck-bus timeout and counters + channel isolation/disable + predictable reset behavior.

Diagram · Selection flow (requirements → hardware hooks → firmware structure → verification plan) + capability stacks

Example ICs above are reference points. Selection should start from recovery requirements and verification plan, then back-propagate to hardware hooks and firmware structure.

Request a Quote

Name

Company

Part Number(s) / BOM

Quantity & Target Lead Time

Alternates Allowed

Temperature Grade

Package / Footprint

Compliance

Budget Window

Lot Size / Qty

Message

Attachment

Accepted Formats

pdf, csv, xls, xlsx, zip

Attachment

Drag & drop files here or use the button below.

H2-13 · FAQs (Error Handling & Recovery)

Intent

Close long-tail troubleshooting without expanding the main body. Each answer uses a fixed 4-line structure with measurable pass criteria placeholders.

I²C bus stuck low after a brown-out — first check line vs state-machine?

Likely cause: SDA/SCL is physically held low by a slave or clamp, or the controller/slave FSM is wedged mid-transaction.

Quick check: Sample SDA/SCL as GPIOs (inputs) and compare to controller status (BUSY/START seen/STOP seen). If lines are low even with controller disabled → line-held; if lines float high but BUSY persists → FSM wedge.

Fix: Soft: disable I²C block → GPIO bus-clear (SCL pulses) → generate STOP if possible → re-init controller; then run a read-only health probe before any writes.

Pass criteria: SDA/SCL high within T_idle; health probe passes N_ok times; no duplicate write events after recovery (0 occurrences in window W).

Clock stretching happens occasionally — where to set master timeout without breaking slow devices?

Likely cause: A legitimate slow slave stretches occasionally, or a fault makes SCL “never release” (hung stretch).

Quick check: Log stretch duration distribution (max/p95) per device address; compare to “normal” device behavior. If duration is unbounded or always hits the cap → treat as hung stretch.

Fix: Use a two-tier policy: T_byte cap for individual steps + T_txn cap for whole transaction; allow longer caps only for specific known-slow addresses, and escalate to bus-clear after K consecutive cap hits.

Pass criteria: No transaction exceeds T_txn; stretch p95 stays below T_stretch_p95; false timeouts on known-slow devices < X per W.

SPI CRC bursts after one glitch — how to resync without resetting the whole system?

Likely cause: A single edge glitch caused bit-slip so host/device frame boundaries no longer align (desync), even though the bus still toggles.

Quick check: After forcing CS deassert, send a short “sync header probe” (read-only pattern). If header stays wrong across attempts → boundary is shifted.

Fix: Apply a session fence: CS ↑ guard time → CS ↓ → transmit N_dummy_bits of dummy clocks with safe MOSI fill → read-only header probe → flush driver/DMA queue if descriptors are polluted; reset only the target device if probe never locks.

Pass criteria: Header probe locks within M attempts; CRC error rate < R_crc in window W; TTR_p95 < X ms without system reset.

CS toggling doesn’t recover — what’s the quickest “dummy clocks” sanity check?

Likely cause: CS toggling ends the session, but the target remains inside a partial command/stream state and requires boundary re-alignment.

Quick check: Use a minimal sequence: CS ↑ (guard) → CS ↓ → dummy clocks for exactly one frame-length (or one byte multiple) → issue a read-only “known header” read. If the header moves closer to expected values with different dummy lengths → desync confirmed.

Fix: Standardize N_dummy_bits (safe MOSI fill) and require a header probe gate before resuming traffic; if multiple slaves share MISO, add a “Hi-Z check” phase (verify no contention) before reasserting CS.

Pass criteria: Header probe passes N_ok consecutive reads; no MISO contention events in W; no-progress counter remains 0 during steady state.

UART shows framing errors only in bursts — noise coupling or baud drift?

Likely cause: Burst-coupled noise corrupts edges (FE/PE spikes), or baud mismatch accumulates until sampling slips (drift).

Quick check: Compare FE/PE burst timing to system events (motors, relays, radio TX) and to temperature/clock changes. Noise bursts correlate to events; drift correlates to sustained operating conditions and persists until re-sync/autobaud.

Fix: For bursts: flush RX FIFO + clear flags → wait for an idle window T_idle_uart → accept first valid frame gate. For drift: re-acquire baud (or switch to a calibrated divisor) and re-validate with an idle window gate.

Pass criteria: After recovery, FE/PE rate < R_fe over window W; first-valid-frame locks within TTR; no FIFO overruns in N frames.

Recovery works on bench but fails in chassis — what telemetry field is usually missing?

Likely cause: The failure is not just “an error,” but “no forward progress” under real coupling; missing telemetry hides the escalation triggers and wrong action is chosen.

Quick check: Confirm logs include last_progress_ts, error density (errors per W), burst length, action level, and reset cause. If only raw error codes exist, the ladder cannot be audited.

Fix: Add “progress” counters (transaction complete increments) + ring-buffer events (bus/device/op/duration/error/action/attempt) and classify warning/degraded/fail-safe. Use these to drive escalation deterministically.

Pass criteria: ≥ P% of field failures have an attributable root bucket; escalation steps are reproducible; TTR_p95 and false reset rate are measurable from logs.

Retries make it worse — how to detect retry storms and apply backoff?

Likely cause: Fast retries amplify error density and prevent the bus/device from returning to a stable state (self-sustaining storm).

Quick check: Track retries per second and consecutive failures per device. If retry rate rises while progress stays flat (no completions), a storm is present.

Fix: Enforce exponential backoff with jitter and a hard cap; rate-limit per device and globally; if density exceeds threshold, enter DEGRADED (probe-only) until stability returns.

Pass criteria: Retry rate bounded below R_retry; progress resumes within T_recover; error density falls below D_ok in window W.

After recovery, device config is wrong — how to design idempotent writes?

Likely cause: A timeout occurred during a non-idempotent write (increment/append/trigger), and an automatic retry re-applied a side effect or wrote partial state.

Quick check: Identify which writes are non-idempotent and whether a timeout can happen mid-write (page write, multi-register sequence). Check if the code retries these without a “confirm/readback” gate.

Fix: Make writes idempotent via version/sequence tokens or “write-then-readback compare”; after timeout, block blind retry and switch to read-only probe + reconcile state before re-applying changes.

Pass criteria: Duplicate side effects = 0 in window W; readback/compare passes after recovery; config hash matches expected within T_reconcile.

Watchdog resets too often — how to separate comm watchdog vs system watchdog?

Likely cause: A single watchdog is covering both “temporary comm faults” and “system deadlock,” causing over-resets and masking root cause.

Quick check: Log reset cause and the last comm progress timestamp. If resets happen while other tasks are healthy and only comm stalls, the system watchdog is triggered too aggressively.

Fix: Split into task watchdog (task heartbeat), comm watchdog (no-progress/escalation), and system watchdog (fail-safe only). Require comm ladder attempts before system reset, with rate limits.

Pass criteria: System resets per day < Z; comm watchdog resolves stalls in < TTR for > Y% cases; reset causes are attributable in ≥ P% logs.

Need fast TTR but no false resets — what escalation thresholds are most effective?

Likely cause: Escalation gates are based on raw error count rather than progress and error density, so transient bursts trigger hard resets.

Quick check: Evaluate triggers: do they use “no-progress for T_idle” and “density in W” or just “N errors”? Raw-N-only is prone to false resets.

Fix: Use layered gates: (1) retry/backoff until K1 failures, (2) resync/reinit if no-progress persists, (3) device reset only if probe fails, (4) segment power-cycle only after K2 repeats with rate limit.

Pass criteria: TTR_p95 < X ms while false_reset_rate < Z; escalation level distribution stable across runs; hard resets occur only after gate conditions are met.

One bad device wedges the bus — how to quarantine without losing the whole segment?

Likely cause: A single target holds a shared resource (line, MISO, or power domain), causing repeated no-progress and forcing global recovery.

Quick check: Identify the “dominant offender” by per-device no-progress/timeout counters and which address/CS correlates with stalls; confirm that the rest of the segment is healthy via probe-only traffic.

Fix: Quarantine by disabling that channel/device (mux/expander channel disable, CS inhibit) or power-cycling only that sub-branch; keep the system in DEGRADED mode with read-only probes for the remaining devices.

Pass criteria: Non-offender devices maintain progress (no-progress=0) for window W; quarantined branch stays isolated; system-wide resets drop below Z per day.

Production wants a pass/fail — what’s a minimal fault-injection test set?

Likely cause: Production tests validate normal traffic but do not exercise recovery paths, so latent “hung” corner cases escape.

Quick check: Confirm whether the test includes at least one injected no-progress condition and one desync/burst condition, and whether it records TTR + action levels.

Fix: Minimal set: (1) I²C SDA low (bus-clear path), (2) SPI forced desync (dummy clocks + header probe), (3) UART FE burst (flush + idle window), plus reset-rate limit verification; require logs for each step.

Pass criteria: All injections recover within TTR; success rate > Y%; false reset rate < Z; post-recovery probes pass N_ok times.

Error Handling & Recovery for I²C, SPI, and UART

Error Handling & Recovery for I²C, SPI, and UART

H2-1 · Definition & Recovery Goals

H2-2 · Failure Taxonomy Across I²C / SPI / UART

H2-3 · Observability & Telemetry Hooks

H2-4 · Timeout Budgeting (Transaction / Byte / Idle)

H2-5 · Recovery Ladder (Soft → Hard) & When to Escalate

H2-6 · I²C Recovery Playbook (Hung Bus)

H2-5 · Recovery Ladder (Soft → Hard) & When to Escalate

H2-6 · I²C Recovery Playbook (Hung Bus)

H2-7 · SPI Recovery Playbook (Desync / Stuck Session)

H2-8 · UART Recovery Playbook (Noise Burst / Framing Lock)

H2-9 · Firmware Architecture Patterns (State Machine, Idempotency, Fencing)

H2-10 · Hardware Assist: Watchdogs, Reset Lines, Bus Guardians

H2-11 · Engineering Checklist (Design → Bring-up → Production)

H2-12 · Applications & IC Selection Notes (placed before FAQ)

Request a Quote

Accepted Formats

Attachment

H2-13 · FAQs (Error Handling & Recovery)

Explore

Categories

Get in Touch

Error Handling & Recovery for I²C, SPI, and UART

Error Handling & Recovery for I²C, SPI, and UART

H2-1 · Definition & Recovery Goals

H2-2 · Failure Taxonomy Across I²C / SPI / UART

H2-3 · Observability & Telemetry Hooks

H2-4 · Timeout Budgeting (Transaction / Byte / Idle)

H2-5 · Recovery Ladder (Soft → Hard) & When to Escalate

H2-6 · I²C Recovery Playbook (Hung Bus)

H2-5 · Recovery Ladder (Soft → Hard) & When to Escalate

H2-6 · I²C Recovery Playbook (Hung Bus)

H2-7 · SPI Recovery Playbook (Desync / Stuck Session)

H2-8 · UART Recovery Playbook (Noise Burst / Framing Lock)

H2-9 · Firmware Architecture Patterns (State Machine, Idempotency, Fencing)

H2-10 · Hardware Assist: Watchdogs, Reset Lines, Bus Guardians

H2-11 · Engineering Checklist (Design → Bring-up → Production)

H2-12 · Applications & IC Selection Notes (placed before FAQ)

Recommended topics you might also need

Request a Quote

Accepted Formats

Attachment

H2-13 · FAQs (Error Handling & Recovery)

Explore

Categories

Get in Touch