Firmware Robustness: Timeouts, Retries, CRC & Recovery

Q: Retries increased, but stability got worse — retry storm or idempotency issue?

Likely cause: Synchronized retries (no backoff/jitter) amplify congestion, or non-idempotent commands create repeated side effects (writes/commits). Quick check: Bucket retry_count by peer_id + op; compare read vs write ops; check if failures cluster right after identical retry intervals. Fix: Add bounded retries + exponential backoff + jitter + circuit breaker; mark write/commit ops as single-shot or require an idempotency token/seq. Pass criteria: Storms eliminated and error_rate ≤ X/1k transactions per peer/op over Y minutes.

Q: CRC fails occasionally, but the scope looks “fine” — buffer/DMA underrun or the link?

Likely cause: Buffer boundary bugs, DMA underrun/overrun, or framing/length mismatch masquerading as link corruption. Quick check: Correlate crc_fail with DMA underrun/overrun counters, FIFO watermark peaks, and (type,len,seq) parse errors in logs. Fix: Enforce header (type/len/seq) + CRC; reject partial frames; add no-progress detection; throttle bursts when watermark stays high. Pass criteria: crc_fail ≤ X/1k and DMA underrun/overrun = 0 under defined load for Y minutes (per peer bucket).

Q: Timeout too large causes latency explosion; too small causes false kills — what is the first budgeting step?

Likely cause: A flat timeout mixes queueing, transfer, peer processing, and retry windows; decisions become guesswork. Quick check: Log segmented timestamps (queue_start, tx_start, peer_wait_start, retry_start, deadline_hit) and verify forward progress per segment. Fix: Use soft timeout for peer-busy windows + hard timeout for no-progress; allocate a bounded retry window; keep an absolute deadline. Pass criteria: p99 latency ≤ Y ms with error_rate ≤ X/1k, and false-kill rate ≤ A/hour in the same workload window.

Q: After power loss, configuration is occasionally corrupted — check commit flag or version/CRC first?

Likely cause: Incomplete write accepted as valid, or boot selects the wrong slot due to weak commit/CRC/version rules. Quick check: Inspect boot decision logs (slot_id, commit_flag, stored_crc, computed_crc, version) and confirm CRC mismatch is rejected. Fix: Atomic two-phase commit (staging→verify→set_commit→verify) + dual-slot rollback; enable write-inhibit under unstable power. Pass criteria: Across N power-cut injections, boot selection is 100% correct (commit+CRC+newest version), corrupted-config incidence = 0.

Q: Recovery logic enters a loop — how to add a circuit breaker and escalation ladder?

Likely cause: Missing stop rules (max steps/time), or no fail-safe terminal state; the state machine repeats the same action. Quick check: Log (state, edge_id, attempt_index, elapsed_ms, escalation_level) and verify whether transitions repeat without forward progress. Fix: Add bounded attempts + max time budget; escalate L1→L2→FailSafe; quarantine the peer and probe with half-open checks after cooldown. Pass criteria: Max recovery time ≤ B ms, recovery_fail ≤ C/hour, and resets/window ≤ D.

Q: Production passes, but field units are unstable — what counters/log context are usually missing?

Likely cause: Metrics lack consistent denominators/buckets, or logs miss critical context (phase/state/power state), hiding the true failure mode. Quick check: Verify per-peer/per-op bucketing, fixed windows, and the presence of (phase/state, attempt, escalation_level, power_state, result_code). Fix: Standardize definitions; add minimal event logs with rate limits; export an evidence pack (counters + key events + recovery edges hit) per unit/build. Pass criteria: All KPIs computed from identical bucket/window rules, and ≥ P% of field faults map to a known root-cause bucket.

Q: Peer “busy” causes many timeouts — wrong policy, or a non-reentrant state machine?

Likely cause: Busy windows treated as hard timeouts, or recovery interrupts in-flight transactions without reentrancy protection. Quick check: Inspect phase/state transitions during busy; confirm a soft-timeout path exists and guards prevent reentering recovery while tx is active. Fix: Define soft timeout for busy/stretched phases + bounded backoff; make recovery SM reentrant-safe (guarded critical sections, cancel/rollback semantics). Pass criteria: timeout_count during peer_busy drops by ≥ Q%, recovery_success ≥ R%, and p99 latency stays within Y ms.

Q: SPI high-throughput CRC spikes — DMA underrun first, or sampling window first?

Likely cause: DMA underrun/overrun or transaction framing mismatch is the most common firmware-side cause under burst load. Quick check: Correlate crc_fail with DMA underrun/overrun, FIFO watermark, and expected_len vs actual_len; run a known-header/pattern probe. Fix: Strict framing (type/len/seq) + CRC; throttle bursts when watermark is high; treat CRC clusters as degrade→backoff→resync. Pass criteria: DMA underrun/overrun = 0 and crc_fail ≤ X/1k at target throughput for Y minutes (per cs_id bucket).

Q: UART occasional framing errors — baud error budget first, or noise deglitch first?

Likely cause: Combined clock mismatch (baud budget) or bursty noise causing edge glitches; both appear as framing_err without context. Quick check: Compare discrete vs clustered errors; log measured baud/clock trim, framing_err vs parity_err ratio, and the effect of deglitch/oversampling. Fix: Keep combined baud error within a typical budget (target ±2% unless proven otherwise); enable deglitch/oversampling with bounded latency impact. Pass criteria: framing_err ≤ X/1k frames with p99 latency ≤ Y ms, and no clusters above K events in T seconds.

Q: After recovery, things look “OK” but performance drops — what post-recovery health metrics should be logged?

Likely cause: The system remains in degraded mode (extra retries/backoff, quarantine, reduced throughput) without explicit visibility. Quick check: Compare pre/post metrics (p99 latency, retry rate, error_rate, quarantine_count, throughput, resets/window) over the same bucket/window. Fix: Enforce an observation window after recovery; exit degrade/quarantine only after N consecutive clean transactions; reset knobs to nominal in a logged step. Pass criteria: Post-recovery returns to baseline within W seconds (p99 ≤ Y ms; error_rate ≤ X/1k) and degrade ratio ≤ S% per hour.

← Back to: I²C / SPI / UART — Serial Peripheral Buses

Firmware robustness means making serial-bus failures predictable, measurable, and self-healing.

This page delivers practical rules for timeouts, retries, CRC/framing, power-fail-safe writes, and recovery state machines—each with clear pass/fail criteria for bring-up and production.

Definition & Scope Guard (Firmware Robustness for Serial Buses)

What “robustness” means (engineering definition)

Robustness is not “zero errors.” It is predictable failures, deterministic self-healing, and measurable production acceptance.

Predictable failures: consistent counters + consistent denominators + consistent time windows.
Deterministic recovery: bounded steps, bounded time, and explicit escalation rules (no infinite loops).
Production acceptance: pass/fail thresholds defined per metric (e.g., errors per 1k, p99 latency).

Scope: the three pillars + two safety layers

Time · Timeout (budgeted, layered)

Use staged budgets (queue → transfer → peer processing → deadline). Avoid “one giant timeout” guessing.

Attempts · Retry (helpful, not harmful)

Add backoff + jitter + escalation (reinit → bus-clear → reset domain). Prevent retry storms.

Correctness · CRC / sequence control

Use framing fields (type/len/seq/CRC) so “delivered” becomes “delivered correctly and replay-safe.”

Power-fail safety · write protection + rollback

Prevent partial writes and bricking via atomic commits, integrity checks, and fallback slots.

Recovery state machine · bounded self-healing

Replace ad-hoc if/else with a testable, observable, and escalating state machine.

Deliverables from this page (what to take away)

Strategy: staged timeouts, retry ladder with backoff, integrity framing.
Templates: counters/log fields, recovery state machine skeleton, atomic-write skeleton.
Checklist: bring-up → production verification steps with fault injection.
Pass criteria: measurable thresholds and acceptance gates (placeholders for X/Y).

Scope Guard (to prevent cross-page overlap)

In-scope: timeout/retry/CRC policies, power-fail write safety, recovery state machines, observability, and acceptance definitions.
Out-of-scope: electrical/timing/SI/layout tutorials for I²C/SPI/UART (handled in protocol subpages). Only minimal firmware-relevant conclusions appear here.

Diagram · Scope Map (center theme, pillars, and out-of-scope guardrails)

Failure Model & Symptom Taxonomy (What actually goes wrong)

Four failure sources (firmware-visible buckets)

Robustness begins with classification: the same symptom can come from different root causes. Each bucket below must be tied to observable evidence (counters, timing phase, correlations), not assumptions.

A) Line disturbance / noise

Errors appear as scattered events; correlation often exists with cable movement, EMI sources, or load transients.

B) Peer busy / stuck / protocol stall

Failures repeat at the same phase; “stuck busy” patterns and repeated timeouts indicate state is not progressing.

C) Firmware race / buffer / DMA under-run

“Link-like” errors can be scheduling artifacts. Correlate with CPU load, queue depth, and under-run counters.

D) Power events / brown-out / partial writes

Errors cluster around power transitions; integrity checks fail after reboot; configuration or storage shows inconsistency.

Symptom cards (use this as the classification front door)

Each symptom must map to: root-cause bucket → first check → action type. This prevents random timeout/retry tuning.

NAK / missing ACK

Likely bucket: peer busy/stall or line disturbance (A/B).
First check: split by address + op + time window; compare NAK per 1k across endpoints.
Action type: retry with backoff; escalate to reinit if repeated at same phase.

Stuck busy / stalled transaction

Likely bucket: peer stuck or state not progressing (B).
First check: phase tagging in logs (queue/transfer/peer); verify the stall repeats at one phase.
Action type: reinit bus controller → bus-clear / peer reset → reset domain (bounded ladder).

CRC fail (especially at high throughput)

Likely bucket: firmware buffer/DMA under-run masquerading as link errors (C), or noise (A).
First check: correlate CRC fails with queue depth, under-run counters, and CPU load spikes.
Action type: throttle bursts / add backpressure; only then tune retries.

Throughput drop + latency spikes

Likely bucket: retry storms or scheduling jitter (C), sometimes peer busy (B).
First check: p95/p99 latency during error windows + retry_count; confirm if latency spikes follow retry bursts.
Action type: add backoff + global rate limiting + circuit-breaker isolation.

Sporadic reset / configuration inconsistency after reboot

Likely bucket: power events and partial writes (D).
First check: integrity markers (version/CRC/commit flag) and last-write timestamps on boot.
Action type: fallback slot + atomic commit; fail-safe mode if integrity fails.

Recoverable vs non-recoverable (decision rules)

Recoverable → retry / reinit first

Errors are scattered and succeed after bounded retries.
Failure phase varies; no single “stuck point” dominates.
Integrity markers remain valid; state can return to a known baseline.

Non-recoverable → isolate / reset domain / fail-safe

Failures repeat at the same phase with no forward progress (stall pattern).
Integrity checks fail (version/CRC/commit flag mismatch) indicating corrupted state.
Recovery exceeds a bounded time budget or repeats beyond a bounded step count.

Output of this chapter

A consistent taxonomy that routes any symptom into the next chapters: Timeout design, Retry/backoff, CRC/framing, and Recovery state machine escalation.

Diagram · Symptom → Root Cause Funnel (bucket + first action type)

Observability & Metrics (Counters, logs, and pass/fail definitions)

Metric contract (prevents inconsistent accounting)

Robustness requires consistent measurement. Each counter must define: trigger point, denominator, and grouping dimensions. Without this contract, dashboards disagree and tuning becomes random.

Trigger point

Log the phase where the event occurs (queue / transfer / peer / recovery step) to separate slow progress from stalls.

Denominator

Normalize by a stable unit (e.g., per 1k transactions). Define “transaction” once and keep it consistent across builds.

Grouping

Always allow rollups by address, op (R/W), length bucket, phase, bus/port, peer, and power state.

Required counters and derived metrics (comparable and actionable)

Must-have counters

timeout_count (prefer phase tags: queue/transfer/peer)
retry_count (include attempt index and escalation level)
crc_fail, framing_err
bus_reset (and optional: reinit/bus-clear/peer-reset)
recovery_success, recovery_fail

Key derived metrics

Error rate (/1k): (errors ÷ transactions) × 1000
MTBF: mean time between severe events (recovery_fail / fail-safe)
Latency p95/p99: sampled at a fixed start/end point (define once)
Recovery duration: detect → back-to-normal (not just reset issued)
Recovery effectiveness: success ÷ (success + fail)

Logging strategy (high signal, low overhead)

Event-driven: log only state transitions, timeouts, escalations, integrity failures, and fail-safe entry.
Rate-limited: keep one representative record per window and aggregate the rest into counters.
Minimal context: addr/op/len/phase/state/power_state + attempt_index + escalation_level.

Production pass criteria template (fill X/Y, keep the structure)

Thresholds

timeout_rate < X per 1k transactions
crc_fail_rate < X per 1k transactions
recovery_fail = 0 (or < X per hour, if allowed)
p99_latency < Y ms (steady-state)

Observation window

Over N transactions or T minutes (define once)
Include worst-case modes: peak throughput, cold/hot starts, power transitions (if applicable)

Coverage scope

Per port/bus, per peer/address, per operation type (read/write), and per power state
Report includes firmware version/build id and configuration profile

Diagram · Robustness Dashboard (counters → rollups → thresholds → actions)

Timeout Design (Budgeting time, not guessing)

Layered timeouts (each layer prevents a different failure mode)

A single “giant timeout” hides where time is spent and makes recovery unpredictable. Use layered timeouts to keep progress measurable and escalation deterministic.

Transaction

A bounded time for one complete transfer with phase tagging (queue/transfer/peer). Primary unit for error rate (/1k).

Command

A bounded time for a business operation (possibly multiple transactions). Prevents partial success from stalling higher layers.

Session

A bounded time budget for initialization/batch phases. Enables “fail-fast to safe-mode” rather than endless waiting.

Segment budget template (queue → transfer → peer → retry window → deadline)

Segment budgets produce two outcomes: (1) phase-level diagnosis and (2) phase-specific actions. Fill X/Y/Z and keep the structure.

T_queue_max = X ms
T_transfer_max = Y ms
T_peer_max = Z ms
T_retry_window = policy-based (backoff + jitter)
T_deadline = X + Y + Z + margin (tail latency margin is explicit)

Soft vs hard timeout (prevents mis-kill and prevents deadlock)

Soft timeout

Use when progress exists but the phase is slow (peer busy, batch DMA).
Action: log + throttle + extend within the total deadline (no infinite extension).
Escalate only if repeated slowdowns violate p99 latency targets.

Hard timeout

Use when no forward progress is observed (stall pattern, repeated phase failure).
Action: abort → retry ladder (retry → reinit → bus-clear → reset domain → isolate).
Stop conditions: bounded step count and bounded recovery time budget.

Timeout action matrix (phase-aware)

Queue timeout: reduce contention, apply global rate limiting, verify scheduling and backpressure.
Transfer timeout: abort safely, retry with backoff, then reinit controller if repeated.
Peer timeout: classify busy vs stall; soft timeout for busy, hard timeout for stall; escalate to peer reset only when bounded.
Deadline violation: enter fail-safe/isolation path and record a production-grade incident entry.

Diagram · Timeout Budget Timeline (segmented budgets + escalation ladder)

Retry Strategy & Backoff (Make retries help, not harm)

Retry is a control policy, not a counter to maximize

Good retries reduce transient failures without creating congestion collapse. The policy must be bounded, phase-aware, and escalating. Use backoff and jitter to prevent synchronized retry storms and keep recovery deterministic.

Immediate retry

Intended for rare transient disturbances. Keep attempts minimal and record attempt index + phase.

Exponential backoff

Intended for peer busy or contention. Backoff reduces repeated collisions and tail latency blow-ups.

Jitter

Breaks lock-step patterns when many tasks/devices retry at the same cadence. Use bounded randomization.

Tiered escalation

Start light (retry) and move heavier (reinit / bus-clear / reset) only when progress stalls and thresholds are exceeded.

Idempotency (retries must not duplicate side effects)

Not all commands are safe to retry. Classify operations by side effect risk and apply explicit guards (sequence, transaction id, readback verify, and commit rules).

Safe to retry

Read / query operations
Status polling with bounded rate
Requests with sequence number and stateless response

Guarded retry

Register writes with readback verify
Writes with expected-version / compare-before-write
Updates that are safe only with transaction id + commit flag

Dangerous retry

EEPROM / flash / page write without atomic commit
One-shot actions (erase, trigger, arm)
Stateful operations without sequence or replay protection

Recommended guards (minimal and reusable)

attempt_index + escalation_level for observability and policy enforcement
seq / transaction_id to detect replay or duplicate processing
readback verify to confirm the intended value, not just a completed transfer
commit rule (two-phase where needed) to prevent partial writes from becoming permanent

Prevent retry storms (rate limit + circuit breaker + quarantine)

Global rate limiting

When error/1k or p99 latency crosses a threshold, reduce retry throughput to protect system stability.

Circuit breaker

After N consecutive failures per peer/address, block traffic for a cooldown window and probe in half-open mode.

Quarantine

Isolate a bad node/port from the main path while preserving a diagnostic channel and minimal health checks.

Success criteria (de-bounced recovery)

Single success unblocks a pipeline but does not prove health.
N consecutive successes are required to exit degrade mode (prevents oscillation).
Observation window must confirm error/1k and p99 latency remain within thresholds.

Diagram · Retry Ladder (tiered escalation + circuit breaker)

Data Integrity: CRC, Framing, and Sequence Control

CRC placement (payload / frame / transaction)

CRC detects corruption, but robust systems also need framing and sequence control. Use the smallest layer that can provide reliable detection and actionable recovery.

Payload CRC

Validates data only. Good for reusable higher-layer protocols and storage content verification.

Frame CRC

Validates header + payload. Best default for on-wire framing, retries, and logging context.

Transaction CRC

Covers multi-step operations. Useful for atomic updates and commit/rollback validation (bounded and explicit).

A bus-agnostic framing contract (type + length + seq + CRC)

To handle I²C/SPI/UART uniformly, wrap the bus transfer with a higher-layer frame: header (type/len/seq/flags), payload, and CRC. Sequence control prevents replay and duplicate processing, not just corruption.

type: parsing safety and operation intent
len: boundary control (prevents truncation/overrun and UART “glue” errors)
seq: replay/duplicate detection and de-bounce for retry ladders
flags: optional fields, capability bits, and downgrade mode
CRC: integrity verification with clear “where to check” and “what to log”

CRC failures: retry policy, escalation, and compatibility

Handling rules

Single sparse CRC fail: immediate retry (bounded).
Clustered CRC fails: backoff + jitter; consider throughput throttling.
CRC fails correlated with load: verify buffer underrun/overrun counters and scheduling.
Persistent CRC fails: escalate (reinit → bus-clear → reset domain → isolate).

Escalation trigger template

Consecutive CRC fails ≥ N → escalation +1
CRC_fail_rate ≥ X / 1k → enable storm controls
Recovery effectiveness drops below Y → quarantine peer

Compatibility strategy

Use flags for optional CRC/seq fields and downgrade mode.
Attempt CRC-enabled mode first; downgrade only after capability mismatch is detected.
In no-CRC mode, keep length/type strict and tighten timeouts and retry limits.

Diagram · Frame Format (header + payload + CRC) with check and log points

Power-Fail Safety (Write-protection, brown-out, and atomicity)

Power-fail event model (what must be survived)

Power interruptions are not “rare corner cases” in production. Robust firmware treats configuration writes as transactions that remain recoverable across brown-out, dropout, and bounce (power cycling near thresholds).

Brown-out region

Logic may still execute while writes become unreliable. Enforce write inhibit below a stability threshold.

Dropout (instant loss)

Ongoing writes can be truncated. Recovery must detect incomplete transactions deterministically.

Bounce (down/up jitter)

The most dangerous phase for commit markers. Use commit rules that remain interpretable even if interrupted.

Write safety trilogy (two-phase + CRC/version + fallback slot)

The goal is not “never lose power,” but to ensure storage always contains at least one valid configuration and the newest valid one can be selected unambiguously at boot.

Two-phase commit

Write new data into staging region.
Verify CRC before writing commit flag.
Commit is written only when data is complete and verifiable.

CRC + version

CRC validates content integrity.
Version/seq prevents replay and selects the newest valid slot.
Boot selection uses commit + CRC + newest version.

Fallback slot

Dual-slot (A/B) update rotation.
Always keep one last-known-good slot.
Rollback if newest slot fails validation.

Bus interaction (EEPROM/page write) + boot-time repair

Write-protect rules

Inhibit writes when power state is not stable (brown-out or bouncing).
Prefer compare-before-write to avoid unnecessary wear.
Require readback verify for writes with side effects.

Retry constraints for writes

Retry only when the failure type is transient and side effects are guarded.
Keep retry count lower than read operations and log attempt index + power state.
After repeated failures, stop writes and enter a safe mode rather than grinding storage.

Boot-time selection and repair

Scan slot A/B and validate commit + CRC + version.
Select newest valid slot; rollback if newest is invalid.
If commit exists but CRC fails, enter fail-safe and preserve diagnostics.

Pass/Fail template (placeholders)

Power cut injection: X events across write phases; system must boot with valid config every time.
Config integrity: CRC mismatches must remain < Y per N updates.
Recovery: rollback must complete within T ms and produce a stable “active slot” decision.

Diagram · Atomic Write Sequence (two-phase commit with power-loss injection points)

Recovery State Machines (Deterministic self-healing)

State machine principles (measurable and testable)

Replace ad-hoc if-else recovery with a deterministic state machine. Each transition must be observable, bounded, and driven by consistent triggers (timeouts, integrity failures, and “no-progress” detection).

Required properties

Re-entrant: repeated triggers do not corrupt state.
Interruptible: power events and watchdog constraints remain respected.
Observable: log state, phase, attempt index, escalation level.
Finite: maximum steps and maximum recovery time budget.
Escalating: local → link → system, based on thresholds and effectiveness.

Recovery action library (layered by blast radius)

L1 · Local

clear FIFO
flush DMA
restart transaction
reinit driver

L2 · Link

bus-clear / bus reset
re-scan devices
re-sync configuration
toggle peer reset GPIO

L3 · System

reset domain
enter safe-mode
limit writes and throughput
preserve diagnostics

Escalation and stop conditions (avoid infinite loops)

Escalation triggers

no progress in a hard-timeout window
consecutive failures ≥ N
recovery effectiveness below Y
integrity failures persist across reinit/bus-clear

Stop rules

max attempts per peer/address
max recovery time budget
circuit breaker cooldown + half-open probe
enter fail-safe when thresholds are exceeded

Fail-safe intent

Fail-safe is a controlled mode: preserve diagnostics, keep minimal health checks, and prevent repeated destructive actions.

Diagram · Recovery State Machine (deterministic transitions + escalation)

Bus-Specific Firmware Hooks (I²C vs SPI vs UART)

Only what firmware controls: Detect → Recover → Prevent → Log

Electrical and timing details are intentionally out of scope. This section focuses on firmware levers that turn “random glitches” into measurable, bounded behaviors: detection signals, recovery actions, prevention policies, and logging fields that keep teams aligned.

Common stop rules

max attempts per peer / address
max recovery time budget
circuit breaker cooldown + half-open probe
fail-safe for repeated integrity failures

Minimum context fields

bus_id, peer_id (addr / cs / port), op, len
phase/state, attempt_index, escalation_level
power_state, timestamp, latency bucket
result_code (timeout / integrity / overflow)

I²C hooks

Detect

hung-bus: SCL-low vs SDA-low + duration window
no-progress: phase not advancing / counters not changing
arbitration loss: treat as recoverable event
stretch seen: classify into peer-process budget

Recover

bus-clear: SCL pulses → STOP → controller reinit
success rule: bus returns to idle + START is accepted
controller reset if internal state is stuck
escalate to quarantine if repeated stalls persist

Prevent

soft vs hard timeouts for stretching vs no-progress
backoff + jitter after arbitration loss
bounded retries; avoid infinite reinit loops
write operations use stricter retry constraints

Log

addr, op, phase, attempt, escalation_level
stretch_seen, arbitration_lost, bus_clear_used
hung_type (SCL/SDA), stall_duration
result_code + recovery_time

SPI hooks

Detect

CS framing mismatch: expected_len vs actual_len
mode mismatch: pattern test / known header readback
DMA underrun/overrun masquerading as data/CRC errors
sync loss: header/type/seq inconsistent

Recover

flush FIFO/DMA → reinit controller → sync pattern
success rule: N consecutive headers/seq valid
escalate: bus-domain reset if no-progress persists
quarantine noisy peer to protect the system

Prevent

transaction framing: type/len/seq in the payload
rate limit retries to avoid DMA queue avalanches
guard writes with readback verify + idempotency
explicit resync entry point after reset

Log

cs_id, mode_id, expected_len, actual_len
dma_underrun, dma_overrun, fifo_level_peak
sync_pattern_fail, header_fail, seq_gap
recovery_action + recovery_time

UART hooks

Detect

framing vs parity classification (separate counters)
break_seen as a state-machine event (wake/resync)
overrun / watermark peaks (buffer avalanche signals)
flow_drop: RTS/CTS missing or mis-handled

Recover

flush RX → wait idle → resync header/preamble
success rule: N frames with valid length/type/seq
backoff on clustered errors; avoid tight loops
fail-safe when integrity remains unstable

Prevent

deglitch/filters only with latency budget control
watermark-based backpressure + drop policy
flow-control health checks (periodic sanity)
bounded retries to prevent buffer snowballing

Log

baud, frame_cfg, framing_err, parity_err
overrun, rx_watermark_peak, flow_drop
break_seen, resync_count, seq_gap
recovery_action + recovery_time

Diagram · Hooks Matrix (Detect / Recover / Prevent / Log)

Watchdog, Reset Domains, and “Don’t brick the system”

Watchdog layering (feed only on forward progress)

Watchdogs are not timers; they are progress proofs. Feeding should be tied to state-machine milestones, not to periodic loops that might keep running while the system is stuck.

Task watchdog

guards bus tasks and recovery loops
fires on no-progress in local budget
triggers L1/L2 recovery escalation

System watchdog

guards scheduler deadlocks and global collapse
fires when recovery time budget is exceeded
enters fail-safe or system reset as last resort

Progress signals (examples)

phase advances, bytes/frames committed, queue drains
recovery step counter increments (bounded)
health probe passes N consecutive transactions

Reset domain ladder (minimum blast radius first)

L1 · Local reset

driver reinit / controller reset / FIFO-DMA flush

L2 · Bus-domain reset

bus clear / bus controller domain reset / re-enumeration

L3 · Peer reset

toggle peer reset GPIO / isolate and probe cooldown

L4 · System reset

last resort when budgets are exceeded or integrity cannot be restored

Success criteria (placeholders)

after any reset action: N consecutive probes pass (N = X)
p99 recovery time stays below T ms
no reset storms: resets per window below Y

Anti-brick strategy (safe-mode + minimum diagnostics path)

Write inhibit

disable configuration writes when power is unstable or when the system is in degrade/fail-safe mode.

Safe-mode

minimal services only: health probe, diagnostics export, and controlled recovery.

Minimum diagnostics path

retain at least one reliable channel for logs, counters, and recovery status (even when the main bus is unstable).

Post-reset consistency

re-enumerate devices and rebuild the live topology
re-apply configuration using atomic write rules
flush stale queues, reset seq, and align state

Diagram · Reset Domain Map (controlled reset paths and dependencies)

Engineering Checklist (Bring-up → Production) + Engineering Pack

Bring-up checklist (make failures measurable and repeatable)

1) Instrumentation ON (definitions locked)

Required counters: timeout_count, retry_count, crc_fail, framing_err, bus_reset, recovery_success, recovery_fail
Derived metrics: error_rate (per X/1k transactions), MTBF, p95/p99 latency, recovery_time (Detect→Normal)
Minimum log fields: bus_id, peer_id (addr/cs/port), op, len, phase/state, attempt_index, escalation_level, power_state, result_code
Bucket rules: always slice by peer_id + op + window (no mixed denominators)

2) Fault injection (prove self-healing)

Power events: brownout flag simulation, write-inhibit toggling, commit-interrupt cases
Peer behavior: busy response, forced stalls, dropped ACK/frames, delayed service windows
Integrity: CRC/framing error injection at protocol layer, seq gaps, length mismatch
Topology: peer disappear/reappear, re-enumeration, quarantine + half-open probe

3) Recovery coverage (every edge at least once)

State coverage: Normal→Detect, Detect→Recover_L1, Recover_L1→Recover_L2, Recover_L2→FailSafe, Quarantine→Half-open→Normal
Evidence: edge_id hit_count, last_seen_time, avg_recovery_ms, p99_recovery_ms
Stop rules: max attempts per peer, max recovery budget, circuit breaker cooldown

Production checklist (thresholds, BIST, and gates)

4) Threshold lock (acceptance numbers)

Error rate: ≤ X / 1k transactions (per peer/op bucket)
Latency: p99 ≤ Y ms (define window and load level)
Recovery: avg ≤ A ms, p99 ≤ B ms
Reset storm guard: resets/window ≤ C

5) BIST / loopback / rollback proof

Known-pattern probe: header/type/len/seq checks before enabling full traffic
Version consistency: FW/config/protocol capability alignment (no mixed versions)
Rollback verification: dual-slot selection rule (commit + CRC + newest version)
Write safety: write-inhibit must engage during power-unstable / fail-safe states

6) Production gate rules

Fail-safe frequency: must be below D per hour (or per test run)
Quarantine: bad peers isolated; system remains operational with health probes
Evidence pack: logs + counters + coverage report exported per unit

Acceptance template (copy/paste)

Test case

Name: ______________________________
Setup: bus_id / peer_id / load profile / window length
Injection: power event / busy stall / CRC cluster / disconnect
Observation fields: counters + logs (phase/state, attempt, escalation, power_state)
Pass criteria: error_rate ≤ X/1k, p99 latency ≤ Y ms, recovery p99 ≤ B ms, resets/window ≤ C

Concrete material numbers (commonly used to make robustness testable)

Examples only. Verify package, temperature grade, suffix, and availability for the exact BOM.

External watchdog / supervisor

TI TPS3430 (external watchdog timer)
TI TPS3890 / TPS3891 (voltage supervisor)
Analog Devices / Maxim MAX809 / MAX810 (reset supervisor family)
Microchip MCP1316 / MCP130 (reset supervisor family)

I²C robustness helpers (segmentation / long reach / isolation)

TI TCA9548A / TCA9546A (I²C mux for address conflicts & isolation)
NXP PCA9548A (I²C mux)
NXP PCA9615 (differential I²C-bus extender)
Analog Devices LTC4332 (I²C/SMBus extender / conditioner family)
TI ISO1540 / ISO1541 (I²C isolator)
Analog Devices ADuM1250 / ADuM1251 (I²C isolator)

UART/SPI bridges (deep FIFO / backpressure-friendly)

NXP SC16IS740 (UART over I²C/SPI, FIFO)
NXP SC16IS750 (UART over I²C/SPI, FIFO)
NXP SC16IS752 (dual UART over I²C/SPI, FIFO)
NXP SC18IS602B (I²C-to-SPI bridge)
Silicon Labs CP2102N (USB-to-UART bridge)
FTDI FT232R (USB-to-UART bridge)

Config storage (write-protect + rollback-friendly)

Microchip 24LC256 (I²C EEPROM, write control pin variants)
ST M24C64 / M24C128 / M24C256 (I²C EEPROM family)
Infineon FM24C256 (I²C FRAM family; fast writes, high endurance)
Fujitsu MB85RC256V (I²C FRAM)
Winbond W25Q64JV (SPI NOR flash family)
Macronix MX25L series (SPI NOR flash family)

Diagram · Checklist Flow (instrument → inject → observe → tune → lock → production gate)

Applications & Selection Notes (when protocol/bridge features are required)

Typical application buckets → required robustness features

Industrial long lines / noisy sites

Feature set: CRC+seq framing, backoff+jitter, circuit breaker, quarantine + half-open probe
Common enabling parts: NXP PCA9615 (diff I²C), TI ISO1540 (I²C isolation), ADuM1250 (I²C isolation)

BMS / daisy-chain / multi-node topologies

Feature set: per-node buckets, escalation ladder, isolation of bad nodes, bounded retries
Common enabling parts: TI TCA9548A (I²C mux isolation), NXP PCA9548A (I²C mux), NXP SC16IS752 (dual UART bridge with FIFO)

Field update / remote maintenance

Feature set: atomic config writes (dual-slot), safe-mode + diagnostics channel, write-inhibit under unstable power
Common enabling parts: Infineon FM24C256 (FRAM), Fujitsu MB85RC256V (FRAM), TI TPS3430 (watchdog), TI TPS3890 (supervisor)

Critical configuration storage

Feature set: compare-before-write, readback verify, commit flag, CRC+version, rollback slot
Common enabling parts: Microchip 24LC256 (I²C EEPROM), ST M24C256 (EEPROM), Winbond W25Q64JV (SPI NOR)

Selection rule: choose parts that reduce “retry storms” and make failures observable (FIFO depth, isolation, segmentation, and stable nonvolatile behavior).

Selection logic (inputs → mechanism combinations)

Input questions

Is the environment long/noisy (clustered errors, latency spikes)?
Is the topology multi-node (one bad node can stall the system)?
Is field update required (remote recovery must succeed)?
Is power-fail write risk present (brownout, bounce, sudden drop)?
Is isolation used (added delay affects timeout segmentation)?
Is buffering needed (deep FIFO or rate shaping to avoid avalanches)?

Output mechanism set

CRC + seq framing: enforce type/len/seq/crc at protocol layer
Timeout segmentation: queue / transfer / peer_process / retry_window / deadline
Retry policy: bounded + backoff + jitter + circuit breaker
Recovery SM: L1→L2→FailSafe with quarantine + half-open probe
Atomic writes: dual-slot + commit flag + version + rollback
Diagnostics path: minimum channel + counters export

Feature mapping → example material numbers (verify suffix/package)

Segmentation / address conflict isolation

TI TCA9548A / TCA9546A
NXP PCA9548A
TI TCA9535 (I/O expander used as controlled reset/enable lines)
NXP PCA9555 (I/O expander used as controlled reset/enable lines)

Long reach / isolation (delay-aware design)

NXP PCA9615 (differential I²C extender)
Analog Devices LTC4332 (I²C/SMBus extender/conditioner family)
TI ISO1540 / ISO1541 (I²C isolation)
Analog Devices ADuM1250 / ADuM1251 (I²C isolation)
Analog Devices ADuM3151 / ADuM4151 (SPI isolation families)

Deep FIFO / rate shaping / remote console

NXP SC16IS740 / SC16IS750 (UART bridge with FIFO)
NXP SC16IS752 (dual UART bridge with FIFO)
Silicon Labs CP2102N (USB-to-UART)
FTDI FT232R (USB-to-UART)

Atomic config writes / rollback

Infineon FM24C256 (FRAM)
Fujitsu MB85RC256V (FRAM)
Microchip 24LC256 (EEPROM)
ST M24C256 (EEPROM)
Winbond W25Q64JV (SPI NOR flash)
Macronix MX25L series (SPI NOR flash)

System safety net (don’t brick)

TI TPS3430 (watchdog timer)
TI TPS3890 / TPS3891 (supervisor)
Microchip MCP1316 / MCP130 (supervisor)
Analog Devices / Maxim MAX809 / MAX810 (supervisor)

Diagram · Decision Tree (Robustness Features)

Request a Quote

Name

Company

Part Number(s) / BOM

Quantity & Target Lead Time

Alternates Allowed

Temperature Grade

Package / Footprint

Compliance

Budget Window

Lot Size / Qty

Message

Attachment

Accepted Formats

pdf, csv, xls, xlsx, zip

Attachment

Drag & drop files here or use the button below.

FAQs (Firmware Robustness)

Scope: firmware-only robustness (timeouts, retries, CRC/framing, power-fail safety, recovery state machines). Each answer is fixed to 4 lines to avoid expanding the main text. Thresholds use placeholders (X/Y/A/B/C) and must keep the same denominator/window definitions across the team.

Retries increased, but stability got worse — retry storm or idempotency issue?

Likely cause: synchronized retries (no backoff/jitter) amplify congestion, or non-idempotent commands create repeated side effects (writes/commits).

Quick check: bucket retry_count by peer_id + op; compare read vs write ops; check if failures cluster right after identical retry intervals.

Fix: add bounded retries + exponential backoff + jitter + circuit breaker; mark write/commit ops as “single-shot” or require safe idempotency token/seq.

Pass criteria: retry storms eliminated (no simultaneous spikes across peers), and error_rate ≤ X/1k transactions per peer/op over Y minutes.

CRC fails occasionally, but the scope looks “fine” — buffer/DMA underrun or the link?

Likely cause: buffer boundary bugs, DMA underrun/overrun, or framing/length mismatch masquerading as link corruption.

Quick check: correlate crc_fail with DMA underrun/overrun counters, FIFO watermark peaks, and (type,len,seq) parse errors in logs.

Fix: enforce frame header (type/len/seq) + CRC; reject partial frames; add “no-progress” detection; throttle bursts when watermark stays high.

Pass criteria: crc_fail ≤ X/1k and DMA underrun/overrun = 0 under defined load for Y minutes (per peer bucket).

Timeout too large causes latency explosion; too small causes false kills — what is the first budgeting step?

Likely cause: a single “flat timeout” mixes queueing, transfer, peer processing, and retry windows; decisions become guesswork.

Quick check: log segmented timestamps: queue_start, tx_start, peer_wait_start, retry_start, deadline_hit; verify forward progress (bytes/frames) per segment.

Fix: set soft timeout for peer_busy windows + hard timeout for no-progress; allocate a bounded retry window; keep an absolute deadline for the whole transaction/session.

Pass criteria: p99 latency ≤ Y ms with error_rate ≤ X/1k, and false-kill rate ≤ A/hour in the same workload window.

After power loss, configuration is occasionally corrupted — check commit flag or version/CRC first?

Likely cause: incomplete write accepted as valid, or boot selects the wrong slot due to weak commit/CRC/version rules.

Quick check: inspect boot decision logs: slot_id, commit_flag, stored_crc, computed_crc, version; confirm boot always rejects “commit=0” and CRC mismatch.

Fix: implement atomic two-phase commit (staging→verify→set_commit→verify) + dual-slot rollback; enable write-inhibit under unstable power states.

Pass criteria: across N power-cut injections, boot selection is 100% correct (commit+CRC+newest version), and corrupted-config incidence = 0.

Recovery logic enters a loop — how to add a circuit breaker and escalation ladder?

Likely cause: missing stop rules (max steps/time), or the state machine lacks a “fail-safe” terminal state and keeps retrying the same action.

Quick check: log (state, edge_id, attempt_index, elapsed_ms, escalation_level); verify whether transitions repeat without forward progress or cooldown.

Fix: add bounded attempts + max time budget; escalate L1→L2→FailSafe; quarantine the peer and probe with half-open checks after cooldown.

Pass criteria: infinite loops eliminated (max recovery time ≤ B ms), and recovery_fail ≤ C/hour with resets/window ≤ D.

Production passes, but field units are unstable — what counters/log context are usually missing?

Likely cause: metrics lack consistent denominators/buckets, or logs miss critical context (phase/state/power state), hiding the true failure mode.

Quick check: verify per-peer/per-op bucketing, fixed windows, and presence of (phase/state, attempt, escalation_level, power_state, result_code) in logs.

Fix: standardize counter definitions; add minimal event logs with rate limits; export an evidence pack (counters + key events + recovery edges hit) per unit/build.

Pass criteria: all KPIs computed from identical bucket/window rules, and field faults are classifiable (≥ P% mapped to a known root-cause bucket).

Peer “busy” causes many timeouts — wrong policy, or a non-reentrant state machine?

Likely cause: busy windows treated as hard timeouts, or recovery actions interrupt in-flight transactions without reentrancy protection.

Quick check: check phase/state transitions during busy periods; confirm whether “soft timeout” path exists and whether locks/guards prevent reentering recover while tx is active.

Fix: define soft timeout for busy/stretched phases + bounded backoff; make recovery SM reentrant-safe (idempotent actions, guarded critical sections, cancel/rollback semantics).

Pass criteria: timeout_count during peer_busy drops by ≥ Q%, and recovery_success rate ≥ R% without increased p99 latency beyond Y ms.

SPI high-throughput CRC spikes — DMA underrun first, or sampling window first?

Likely cause: DMA underrun/overrun or transaction framing mismatch is the most common firmware-side cause under burst load.

Quick check: correlate crc_fail with DMA underrun/overrun, FIFO watermark, and (expected_len vs actual_len) per transaction; run a known-header/pattern probe.

Fix: enforce strict framing (type/len/seq) + CRC, throttle bursts when watermark is high, and treat CRC clusters as “degrade” → backoff → resync path.

Pass criteria: DMA underrun/overrun = 0 and crc_fail ≤ X/1k at target throughput for Y minutes (per cs_id bucket).

UART occasional framing errors — baud error budget first, or noise deglitch first?

Likely cause: combined clock mismatch (baud budget) or bursty noise causing edge glitches; both appear as framing_err without proper context.

Quick check: compare error patterns: discrete vs clustered; log measured baud/clock trim, framing_err vs parity_err ratio, and whether errors drop when enabling deglitch/oversampling.

Fix: keep combined baud error within typical budget (target ≤ ±2% unless proven otherwise); add optional deglitch/oversampling with bounded latency impact and clear enable conditions.

Pass criteria: framing_err ≤ X/1k frames with p99 latency ≤ Y ms, and errors do not form clusters above K events in T seconds.

After recovery, things look “OK” but performance drops — what post-recovery health metrics should be logged?

Likely cause: the system stays in a degraded mode (extra retries/backoff, quarantine, reduced throughput) without explicit visibility.

Quick check: log a “recovery stamp” and compare pre/post metrics: p99 latency, retry rate, error_rate, quarantine_count, throughput, and resets/window over the same bucket/window.

Fix: enforce an observation window after recovery; only exit degrade/quarantine after N consecutive clean transactions; reset policy knobs to nominal in a controlled, logged step.

Pass criteria: post-recovery metrics return to baseline within W seconds (p99 ≤ Y ms; error_rate ≤ X/1k), and degrade time ratio ≤ S% per hour.

Firmware Robustness: Timeouts, Retries, CRC & Recovery

Firmware Robustness: Timeouts, Retries, CRC & Recovery

Definition & Scope Guard (Firmware Robustness for Serial Buses)

Failure Model & Symptom Taxonomy (What actually goes wrong)

Observability & Metrics (Counters, logs, and pass/fail definitions)

Timeout Design (Budgeting time, not guessing)

Retry Strategy & Backoff (Make retries help, not harm)

Data Integrity: CRC, Framing, and Sequence Control

Power-Fail Safety (Write-protection, brown-out, and atomicity)

Recovery State Machines (Deterministic self-healing)

Bus-Specific Firmware Hooks (I²C vs SPI vs UART)

Watchdog, Reset Domains, and “Don’t brick the system”

Engineering Checklist (Bring-up → Production) + Engineering Pack

Applications & Selection Notes (when protocol/bridge features are required)

Request a Quote

Accepted Formats

Attachment

FAQs (Firmware Robustness)

Explore

Categories

Get in Touch

Firmware Robustness: Timeouts, Retries, CRC & Recovery

Firmware Robustness: Timeouts, Retries, CRC & Recovery

Definition & Scope Guard (Firmware Robustness for Serial Buses)

Failure Model & Symptom Taxonomy (What actually goes wrong)

Observability & Metrics (Counters, logs, and pass/fail definitions)

Timeout Design (Budgeting time, not guessing)

Retry Strategy & Backoff (Make retries help, not harm)

Data Integrity: CRC, Framing, and Sequence Control

Power-Fail Safety (Write-protection, brown-out, and atomicity)

Recovery State Machines (Deterministic self-healing)

Bus-Specific Firmware Hooks (I²C vs SPI vs UART)

Watchdog, Reset Domains, and “Don’t brick the system”

Engineering Checklist (Bring-up → Production) + Engineering Pack

Applications & Selection Notes (when protocol/bridge features are required)

Recommended topics you might also need

Request a Quote

Accepted Formats

Attachment

FAQs (Firmware Robustness)

Explore

Categories

Get in Touch