A UART ↔ SPI/I²C bridge turns a fragile byte stream into verifiable, recoverable bus transactions with buffering, back-pressure, timeouts, and evidence-rich observability—so remote console/debug and production tooling stay deterministic in the field.
Scope & what this bridge really solves
A UART↔SPI/I²C bridge is not a “wire adapter.” It is a transaction engine that converts a UART byte stream into
verifiable peripheral operations, so remote console/debug stays stable under bursts, slow devices, resets, and noisy links.
What “bridge” means in engineering terms
Stream → Transaction: UART carries frames/bytes; SPI/I²C requires atomic operations (read/write/burst). The bridge must make each command decidable: success or a specific failure.
Back-pressure: When the target is slow (busy, long write/erase, long chain), the bridge must slow the host safely (flow control or credits) instead of dropping bytes or overflowing buffers.
Recovery: After timeout, brown-out, or partial transactions, the bridge must return to a known state using a defined recovery sequence—not “power-cycle and hope.”
Minimum contract (must-have behaviors)
A reliable bridge can be evaluated as a contract. If any item below is missing, remote debug usually degrades into “random stalls.”
Framing + integrity: explicit frame boundaries, length, and CRC so corruption becomes detectable (not silent mis-writes).
Transaction semantics: every command maps to one atomic bus action set, with a clear completion event and error code.
Flow control/back-pressure: hardware flow control (preferred) or protocol-level credits/window with queue watermarks.
Layered timeouts: frame timeout, per-transaction timeout, and end-to-end timeout; failures must be surfaced, not hang forever.
Deterministic recovery path: a state machine that drains/aborts in-flight ops, re-probes the target, and resumes safely.
Observability hooks: counters (retries/NAK/CRC/timeout), queue depth, and “last-N transactions” trace for root-cause evidence.
Pass criteria placeholders (fill during implementation)
Examples: “No deadlock in X hours soak”; “P95 transaction latency < X ms under burst load”;
“All failures return error code within X ms and recover automatically.”
In-scope goals (typical targets)
Remote console/debug over constrained links (stable under bursts and slow devices)
Field maintenance access to registers, sensors, EEPROM/flash commands (with verification)
Production fixture port with repeatable scripts and clear pass/fail outcomes
System health monitoring: retries, timeouts, queue pressure, latency percentiles
Out-of-scope (avoid misusing a bridge)
Hard real-time closed loops requiring < X µs jitter-free latency
Phase-aligned bus actions tied to strict sampling windows
Unretryable operations without verification/rollback semantics (risk of irreversible writes)
A bridge is the right tool when the goal is reliable remote access to SPI/I²C devices, not deterministic microsecond control.
This section defines practical use-case patterns and hard “do-not-use” gates.
Common use-case patterns (and what the bridge must provide)
Pattern
Remote console / Field service
Why: stable access for diagnosis, configuration, and safe bring-up without high-bandwidth infrastructure.
Must have: back-pressure (RTS/CTS or credits), layered timeouts, explicit recovery sequence.
Evidence: counters + last-N trace to prove what happened when it stalls.
Pattern
Register access / Debug scripts
Why: a consistent API for reads/writes/bursts across devices and boards.
Safety: write-verify (read-back) and guarded “dangerous write” paths.
Pattern
Timestamped event replay / Logging over UART
Why: capture “what happened right before failure” without a full network stack.
Must have: deep ring buffers, controlled drop policy, and back-pressure to avoid silent truncation.
Evidence: sequence continuity + latency percentiles (P50/P95) for correlation.
Pattern
Production fixture / Acceptance tests
Why: repeatable, scriptable access with explicit pass/fail codes.
Must have: loopback/BIST, stable protocol versioning, and exportable statistics.
Pass criteria: bounded failures and automatic recovery within X retries.
Hard “don’t use a bridge if…” gates (fail any gate → choose another path)
Determinism gate: The system requires microsecond-level deterministic response (e.g., P95 jitter < X µs). A bridge adds queueing, retries, and flow control, which breaks determinism by design.
Phase alignment gate: Peripheral actions must be phase-aligned to strict sampling windows. Transaction engines ensure correctness, not phase coupling.
Unretryable-link gate: The UART path is loss-prone and retries are not allowed (no idempotency, no verification). Without retry/verify semantics, transaction correctness cannot be guaranteed.
Irreversible-write gate: Operations are irreversible (one-time programming, fuses, sensitive calibration) and cannot be verified. Use a safer dedicated method or enforce strict write-guard + read-back policy first.
Alternatives (keep it simple: select the right class)
Use Extender (physical reach is the main issue)
Choose when distance/common-mode noise dominates but native bus semantics must remain intact. Focus is signal transport, not transaction semantics.
Use Native High-speed Link (USB/Ethernet)
Choose when throughput, concurrency, tooling ecosystem, and long-term maintainability matter more than minimal pin count.
Use Dedicated Debug Port (JTAG/SWD)
Choose when low-level debug access, deterministic control, or silicon bring-up requires a purpose-built interface.
Most common selection mistakes
Treating a bridge like a “wire adapter” and skipping transaction semantics (no CRC, no explicit completion).
Ignoring back-pressure; buffers overflow under bursts and failures become intermittent.
Using a single global timeout; long operations cause false failures, short failures cause hangs.
No observability; without counters and traces, “random stall” cannot be proven or fixed.
Diagram focus: hard determinism gate first; then select bridge/extender/native based on distance/noise and maintainability needs.
Architecture taxonomy
UART↔SPI/I²C bridges differ mainly by where the transaction engine lives and how it implements
buffering, back-pressure, recovery, and observability. This taxonomy helps map requirements to an architecture class
before picking specific parts or firmware designs.
Quick self-check (maps directly to architecture choice)
Throughput target: low (interactive), medium (scripts), high (bulk/batch).
Robustness target: manual recovery acceptable vs. self-healing required.
Control model: simple single ops vs. batching/triggers/versioned capabilities.
Best for: long cables, harsh EMC environments, safety/functional isolation boundaries.
Strengths: improved immunity, controlled reset domains, safer field access.
Weaknesses: added latency and reset-domain complexity; must budget delay and recovery behavior across domains.
Proof: show bounded end-to-end timeout and recovery across power/iso boundaries with traceable counters.
Topology overlay (multi-device scaling without turning stalls into chaos)
Multi-device setups (one UART controlling multiple I²C channels or many SPI chip-selects) are not a new bridge class; they add a
scheduler layer. The main risk is queue head-of-line blocking: one slow target can stall everyone.
Per-target queues: separate queues per I²C channel/device or per SPI CS to isolate slow operations.
In-flight limits: cap outstanding transactions per target and globally (credits/window).
Priority lanes: keep interactive debug commands responsive while bulk transfers run in the background.
Diagram focus: architecture choice is a throughput vs robustness trade, driven by where the transaction engine runs and how recovery/observability are implemented.
Transaction model design
The transaction model is the maintainability core: it defines how a UART command becomes an atomic SPI/I²C operation with
decidable completion, safe retries, and actionable error reporting. This section focuses on how a bridge
expresses bus behaviors (not how SPI/I²C protocols work).
Model goals (hard requirements)
Decidable: every command returns OK or FAIL(code) within X time; no infinite hangs.
Traceable: each command has an ID (seq/txn) to correlate UART logs with bus waveforms and counters.
Recoverable: failures trigger a defined recovery path that returns the bridge to a known-good state.
Command frame fields (layered for clarity)
Identity
seq_id / txn_id to detect duplicates, correlate traces, and support safe retries.
SPI phase shape:cmd / addr / dummy / data are modeled as flags and length classes; CS behavior (toggle/hold) is a policy choice.
I²C sequencing: repeated-start and stop policy are explicit flags; page-write is modeled as a transaction type with a policy timeout class.
Atomic mapping: one UART command maps to one bus transaction sequence with a single completion outcome (OK or FAIL(code)).
Atomicity, retries, and idempotency (avoid “successful but no change”)
Atomicity: a command must either complete and return OK, or fail with a code. Partial completion must be surfaced (e.g., batch step index).
Idempotency strategy: retries must not cause repeated side effects. Use one or more of:
write-verify,
guarded writes (unlock token),
duplicate-detect (same txn_id).
Completion guarantees: define layered timeouts: frame timeout, per-transaction timeout, and end-to-end timeout (all bounded by X).
Pass criteria placeholders
Examples: “command completes within X ms”; “retry count ≤ X before FAIL(code)”;
“write operations verified within X ms or rejected with guard policy.”
Error code taxonomy (actionable, not vague)
Errors should identify which layer failed and include minimal context. This makes remote debug evidence-based.
Link layer: framing error, CRC mismatch, overflow.
Diagram focus: one frame maps to one atomic bus sequence; flags shape the sequence and policy governs timeouts, retries, and verification.
Firmware/Host Integration
A bridge becomes a reliable tool only when the host stack enforces a stable contract: structured results,
bounded concurrency, versioned configuration, and
minimal safety guardrails. Without this, behavior drifts across scripts, versions, and operators.
Host API contract (stable semantics, not just function names)
Register-oriented calls
read_reg, write_reg, burst_read, burst_write, scan.
Each call must map to a single transaction outcome: OK or FAIL(code) with context.
Required return structure
status (OK/FAIL), code/subcode,
txn_id, latency_ms,
retries, queue_depth,
plus compact context (bus, target, phase).
Diagnostics endpoints
get_stats/clear_stats,
get_version, get_capabilities,
and (optional) get_trace_last_n for field correlation.
Sync vs async (controlled concurrency, not uncontrolled parallelism)
Asynchronous submission improves throughput only when in-flight work is bounded.
The host must obey bridge-advertised limits and maintain deterministic matching of responses.
Futures/promises: each submitted transaction returns a handle keyed by txn_id.
max_inflight: global bound from capability exchange (host must throttle).
per-target inflight: prevents a slow device from dominating queueing (reduces HOL).
window/credits: inflight budget aligns with credit return and watermarks to keep completion bounded.
Configuration determinism (versioned profiles)
Behavior drift is usually configuration drift. A bridge should expose a profile as a versioned, hashable contract
so that field logs and lab reproductions use the same settings.
Profile fields: SPI mode/bit order/dummy class/CS gap class; I²C speed class/stop policy; common endianness/address-width classes.
Profile identity: expose profile_id and config_hash with each response.
Guarded write: unlock token + time window (valid for X s), then auto-lock.
Audit fields: dangerous-write counter + last target + result code for traceability.
Pass criteria placeholders
Examples: “capability exchange completes within X ms”; “profile hash included in 100% of responses”;
“max_inflight enforced; host never exceeds X in-flight transactions”.
Diagram focus: capability/version negotiation + profile hashing prevent configuration drift; structured results make automation and field correlation reliable.
Robustness & Recovery
Field readiness requires bounded failure modes and deterministic recovery. The bridge must convert “hung bus” and “partial progress”
into explicit FAIL results and a state-machine recovery path that returns to a known-good profile.
Common failure patterns (layered taxonomy)
Link/frame level: CRC failures, framing loss, RX overflow → resync framing and clear stale bytes.
Verify: re-probe target(s) and confirm health before resuming traffic.
SPI recovery flow (bridge-level)
Trigger: verify failures, repeated timeouts, or no-progress counter under stable SCLK.
Sequence: CS reset pattern → re-align via known safe read (e.g., ID/status read).
Verify: read-back check passes under the active profile (mode/dummy class).
Escalation rule
If recovery fails X times within Y minutes, escalate to a stronger reset class
(bus reset → bridge core reset → full system reset) and record the last stage reached.
Watchdog and transaction consistency (never “looks OK but did not commit”)
Watchdog trigger: no completion for X ms, queue not draining, or state stuck.
Consistency rule: every write is either FAIL(code) or verified by read-back (write-verify class).
Recovery to known-good: reload profile + re-run capability exchange after brown-out/reset.
Pass criteria placeholders
Examples: “recovery returns to Normal within X ms”; “no silent drops; every timeout returns FAIL(code) within X ms”;
“watchdog stage and last reason always logged”.
Diagram focus: recovery is a state machine with bounded timeouts, escalation, and evidence logging; every failure returns FAIL(code) rather than silent stalls.
Hardware Design Hooks
Bridge reliability depends on a minimal, repeatable hardware stack: a clean port boundary,
short protection return paths, correct level/PHY mapping,
and a deliberate choice of single-ended vs differential vs isolation for long or noisy environments.
This section stays bridge-specific and avoids turning into a generic hardware course.
Voltage domains and reference domains (map first, then protect)
UART side: often LVCMOS (1.8–5 V) on short runs, or already a PHY layer (RS-232/RS-485) for field cabling.
Bus side: I²C/SPI are typically short, local logic nets; treat them as “inside the enclosure”.
Rule: bring port energy to ground locally (short return) before level shifting and before the bridge core.
Port protection stack (purpose + placement)
Low-capacitance ESD
Best for high-edge-rate lines where extra capacitance worsens thresholds. Place closest to the connector with a short, controlled return path.
Series-R / RC damping
Used when overshoot/ringing drives false edges or framing noise. Place near the receiver/PHY/bridge pins to slow edges and reduce reflections.
Over-voltage clamps
Applied when cable faults or miswiring can exceed IO rails. Place at the port boundary, and ensure clamp currents do not inject noise into logic reference.
CMTI headroom: common-mode transients can look like false edges and frame corruption; observe CRC/framing counters when switching loads.
Reset domains: brown-out and partial reset must re-run capability exchange and reload a known-good profile before resuming traffic.
Isolation component selection details belong in the dedicated Isolation section; this chapter focuses on bridge-level hooks.
Long-line decision gate (when not to “force UART”)
Single-ended UART over long cables becomes fragile when common-mode motion dominates. Upgrade strategy should be explicit instead of “increase retries”.
Symptom-driven trigger: framing/CRC bursts correlate with load switching, motors, relays, or ESD events.
Cable sensitivity: error rate changes sharply with cable length/route/grounding changes.
Upgrade paths: differential PHY (e.g., RS-485), isolation boundary, or a purpose-built field link.
Verification hooks (minimal acceptance checks)
Port boundary: protection parts placed at the connector with short return paths and no “long injection loop”.
Domain sanity: verify IO rails and clamp behavior prevent ghost-powering during partial power states.
Brown-out recovery: after brown-out/reset, capability exchange and profile reload complete before any bus operations resume.
Pass criteria placeholders
Examples: “framing error rate < X/hour on the target cable”; “brown-out recovery < X ms”;
“no latch-up or ghost-powering under the defined fault conditions”.
Diagram focus: a repeatable port boundary stack. Protection is placed at the connector; isolation is optional but requires latency/reset-domain hooks.
Debug & Observability
Debugging becomes deterministic when every transaction carries evidence: queue state, error counters,
and latency distributions. The goal is to correlate UART frames with SPI/I²C operations and to capture the last N transactions on triggers.
Must-have counters and distributions (minimum set)
Goal: every spike in counters corresponds to concrete transaction evidence and a reproduciable target.
Correlating UART frames to SPI/I²C operations (time alignment)
Anchor: txn_id appears in both host logs and bridge trace.
Bus trigger: align to CS falling edge (SPI) or START condition (I²C).
Outcome: separate queueing delay from bus time to avoid “mystery slow” conclusions.
Production test hooks (minimal set)
Loopback: UART self-check to validate the host/PHY path before touching bus devices.
Golden command set: a fixed sequence of reads/writes with deterministic pass/fail criteria.
Record: counters + P95 latency at test exit to catch marginal stations early.
Pass criteria placeholders
Examples: “P95 end-to-end latency < X ms under the golden set”; “timeout_count = 0 in X transactions”;
“dump-on-trigger produces the last N records with txn_id and profile hash”.
Diagram focus: six instrumentation points provide a complete evidence chain to correlate UART framing, queueing behavior, bus timing, and response delivery.
This gate-style checklist turns a UART↔SPI/I²C bridge from “works on a bench” into a fieldable, testable, and regression-safe tool.
Each gate freezes a contract, requires evidence, and sets pass criteria placeholders (X) to prevent silent behavior drift across firmware, scripts, and stations.
Design Gate — Freeze the behavior contract
Transaction model frozen: frame fields, atomic completion semantics, idempotency rules, and commit/abort behavior. Evidence: protocol version + frame examples + compatibility statement.
Example reference implementations (material numbers)
UART→I²C bridge IC: NXP SC18IM704 (UART host to I²C controller bridge; command-to-transaction conversion). MCU bridge (UART→SPI / UART→I²C): TI MSPM0G3507 (SDK examples exist for UART→SPI and UART→I²C bridge style packet translation). Low-cost UART→SPI conversion (MCU): TI MSP430FR2000 (UART-to-SPI bridge app-note class implementations exist). MCU UART→I²C bridge example family: Microchip PIC16F15244 (UART-I²C bridge example projects exist; family variants may apply).
Note: verify speed, FIFO/DMAs, packages (QFN/TSSOP), and temperature grades per project needs.
Bring-up Gate — Measure baseline and inject faults
Baseline performance captured: throughput + end-to-end latency P50/P95 under a defined profile (baud rate, framing, SPI mode, I²C speed). Evidence: one-page benchmark report.
Timeout stack validated: frame timeout / transaction timeout / end-to-end timeout are consistent and map to a stable error code. Evidence: threshold table + triggered dump of last N transactions.
No head-of-line collapse: long/slow transactions do not starve interactive commands (fast lane or per-target queues). Evidence: P95 stays < X ms during long-transaction injection.
Recovery is bounded: every failure mode reaches “resume” or escalates to a defined reset in < X ms. Evidence: watchdog stage count + last_reason.
Bring-up target set (material numbers)
I²C long-line stress (dI²C): NXP PCA9615 (differential I²C buffer) or ADI LTC4331 (rugged differential I²C extender). I²C isolation sanity: ADI ADuM1250 (bidirectional I²C isolator class). SPI isolation sanity: TI ISO7741 family (digital isolator for SPI signal directions; verify channel direction needs).
Production Gate — Lock versions and automate evidence
Stats export enabled: queue high-water hits, timeouts, retries/NAKs, CRC/framing errors, throughput, and latency P95. Evidence: station log captures a report artifact per unit.
Version & profile locked: protocol version + profile hash + capability hash are traceable and immutable per release. Evidence: printed in every station report.
Golden command set: deterministic read/write/burst/scan + fault-trigger cases are executed every build. Evidence: pass/fail + counter deltas are archived.
These deployments focus on bridge mechanics (buffering, back-pressure, timeouts, recovery, observability, guarded writes). They intentionally avoid general UART/I²C/SPI “application encyclopedias”.
Minimum bridge hooks: RTS/CTS back-pressure (preferred), a fast-lane for interactive commands, triggered dump of last N transactions on timeout/queue-full, stable profile hash + capability exchange, and bounded recovery. Typical building blocks (material numbers):
MSPM0G3507 (MCU bridge core option; UART→SPI/UART→I²C style implementations), or NXP SC18IM704 (UART→I²C bridge IC for register access).
I²C multi-drop maintenance routing (if required): TI TCA9548A / NXP PCA9548A (I²C mux; use as a channel gate, not as “more concurrency”).
Isolation where ground potential differences exist: ADI ADuM1250 (I²C isolation class), TI ISO7741 (SPI signal isolation class; confirm channel direction).
Debug over Constrained Links (low-pin-count expansion)
Goal: provide a stable read_reg/write_reg/burst tool surface over a slow UART without turning scripts into timing guesses. Minimum bridge hooks: batching/coalesce for small transactions, credit/window control, idempotent write strategy (write-verify for critical params), and timeout classes per target behavior (slow EEPROM page write, I²C stretching, SPI flash dummy changes). Typical building blocks (material numbers):
MCU bridge: TI MSPM0G3507 (packet→transaction bridge class), or Microchip PIC16F15244 (UART↔I²C bridge example family).
Dedicated UART→I²C bridge IC for register access: NXP SC18IM704.
Long noisy runs (upgrade path): NXP PCA9615 (dI²C) or ADI LTC4331 (rugged differential I²C extender).
Multi-drop Maintenance Port (UART → multi-channel I²C mux)
Minimum bridge hooks: per-target queueing (avoid one slow channel stalling the whole system), scan + capability reporting per channel, bounded recovery per branch, and strict max inflight/credit limits advertised to the host. Typical building blocks (material numbers):
I²C mux: TI TCA9548A or NXP PCA9548A.
Bridge core: TI MSPM0G3507 (MCU bridge) or NXP SC18IM704 (UART→I²C bridge IC).
Operational rule: treat mux channels as fault containment boundaries; on repeated timeouts, isolate the failing branch and keep the console responsive.
Production Tooling (register R/W, barcode, calibration parameters)
Minimum bridge hooks: guarded writes (unlock window), mandatory write-verify for nonvolatile payloads, immutable version+profile hash in station reports, and golden command set regression at every station. Typical building blocks (material numbers):
UART→I²C: NXP SC18IM704 (compact transaction conversion for I²C peripherals).
UART→SPI (MCU bridge): TI MSPM0G3507 or TI MSP430FR2000 class implementations (validate SPI mode/dummy policies in the profile).
Station pass criteria placeholders: P95 < X ms; recover < X ms; no deadlock over X hours; write-verify mismatch = 0; audit counters in-range.
These FAQs close long-tail failure modes without expanding new chapters. Each answer is a measurable, evidence-first checklist:
Likely cause → Quick check → Fix → Pass criteria (threshold X placeholders).
UART is stable at low baud, but at high baud it shows massive retries/timeouts — prove baud error first or queue watermark policy first?
Likely cause: (1) combined clock/baud error causes framing/CRC bursts; (2) RX/transaction queues hit high-water, triggering credit stalls and cascading timeouts.
Quick check: compare framing_err/crc_err vs queue_depth, high_water_hits, credit_stall_time, and latency_p95; confirm measured baud vs configured baud with a logic analyzer at the UART pins.
Fix: lock UART clock source (XTAL/PLL), enable RTS/CTS and verify polarity, raise RX ring + transaction queue depth, tune high/low watermarks, and isolate long transactions into a separate queue (“fast lane” for interactive commands).
Pass criteria:framing_err = 0 over X minutes; timeouts ≤ X/hour; latency_p95 ≤ X ms at the target baud under the golden command set.
I²C reads occasionally NAK, but the scope looks “OK” — check bridge transaction timeout first or target busy/page-write first?
Likely cause: (1) target is legitimately busy (EEPROM page write / internal erase), returning NAK; (2) bridge uses an overly aggressive transaction timeout that converts slow ACK into “NAK/timeout”.
Quick check: log per-target NAK_count, timeouts, and latency_p95; correlate NAK bursts with “write just happened” moments via txn_id; if NAK coincides with busy windows, it is not a signal-integrity proof.
Fix: add a “busy-aware” retry class (backoff + max retries X), separate page-write flows from normal reads, and ensure the bridge exposes a distinct error code for BUSY vs NAK vs TIMEOUT.
Pass criteria: NAK bursts are bounded and classified (BUSY vs NAK); timeouts ≤ X/hour; page-write sequences complete within X ms with deterministic retry limits.
SPI write returns “success” but the register did not change — check CS timing/hold first or bridge commit/busy semantics first?
Likely cause: (1) chip-select (CS) timing/phase violates target expectations (mode, setup/hold, inter-byte gaps); (2) the bridge reports “accepted” instead of “committed” (queued but not executed, or dropped on queue/full).
Quick check: require a read-back verify on the same txn_id; inspect queue_depth at submit/commit; confirm SPI profile (mode, dummy, cs_gap) via profile_hash logged in the response.
Fix: define “success = executed + optional verify” (not “queued”); add explicit BUSY/QUEUE_FULL responses; lock SPI mode/dummy policies in a versioned profile; for fragile targets, enforce CS reset sequence before verify.
Pass criteria: write-verify mismatch = 0; “queued vs committed” states are observable; SPI profile remains stable (profile_hash constant) across resets/releases.
Throughput collapses with many small transactions — do batching first or increase in-flight depth first?
Quick check: measure latency_p95 per transaction and compute “payload/overhead” ratio; check inflight and credit_stall_time; if bus time is small vs total time, batching wins first.
Fix: implement command batching/coalesce (burst reads/writes), add a fixed maximum in-flight window (credit-based), and prioritize interactive commands over bulk transfers.
Pass criteria: throughput improves by ≥ X% on the small-transaction mix; credit_stall_time ≤ X% of runtime; interactive command P95 ≤ X ms under bulk load.
The queue occasionally deadlocks and only a reboot recovers — is the reply path blocked or is head-of-line blocking not isolating long transactions?
Likely cause: (1) TX path is back-pressured (credits never returned / CTS stuck), preventing completion notifications; (2) a long transaction blocks the front of a shared queue (HOL), starving short operations.
Quick check: look for no_progress_counter, tx_backlog, credit_stall_time, and a flat-lined bus_start/bus_end timestamp stream; inspect whether one txn_id stays inflight far longer than others.
Fix: separate queues by class (interactive vs bulk vs slow targets), add watchdog “drain/reset” stages for stuck inflight, and force bounded completion reporting (timeout → fail code → resume).
Pass criteria: no deadlock over X hours; recovery from “stuck inflight” completes in < X ms; P95 for interactive commands remains ≤ X ms under slow-target injection.
Field long cable increases framing errors — prove common-mode injection first or UART sampling-point drift (oversampling/filtering) first?
Likely cause: (1) common-mode noise/ground shift moves the UART threshold, creating burst framing errors; (2) sampling margin is too small at the chosen baud (clock mismatch + edge degradation), and oversampling/filter settings are insufficient.
Quick check: track framing_err vs environmental events (motors, relays) and vs baud changes; if errors drop sharply when baud is reduced, sampling margin dominates; if errors correlate with load/ground events, common-mode dominates.
Fix: upgrade the physical layer for long runs (differential/isolated transport), add input protection/edge control, and tighten UART clock accuracy; if staying single-ended, enforce conservative baud and robust filtering.
Pass criteria:framing_err ≤ X/hour under defined cable/noise conditions; no error bursts during worst-case field events; stability holds at the target baud.
After reset, the first transaction always fails — check power-up defaults (SPI mode/I²C speed) first or bridge state not cleared first?
Likely cause: (1) target peripherals need a post-reset settle window (busy/boot time), but the host sends immediately; (2) bridge reuses stale profile/state (SPI mode, dummy policy, I²C speed) until explicitly re-initialized.
Quick check: log boot_ts vs first txn_id, and compare profile_hash/capability_hash before and after reset; if the first failure disappears when delaying X ms, target settle dominates.
Fix: enforce a deterministic reset sequence: capability exchange → profile load → probe (read ID) → enable normal traffic; add a post-reset guard delay (X ms) and clear all queues/state on reset.
Pass criteria: first-transaction failure rate = 0 across X resets; profile_hash is stable and matches expected; probe succeeds within X ms after reset.
After transaction retries, the device enters an abnormal state — how to design idempotent writes and write-verify?
Likely cause: retries re-apply non-idempotent writes (stateful commands, counters, partial commits), or the bridge cannot distinguish “executed but reply lost” from “not executed”.
Quick check: identify commands that are not safe to retry; check whether retries share the same txn_id semantics (dedupe vs re-execute); confirm post-write state using a read-back verify on a known invariant register/marker.
Fix: classify operations: idempotent (safe retry) vs non-idempotent (no retry; require guarded sequence); use write-verify for critical writes; add a “commit token” or sequence counter to detect partial/duplicate commits.
Pass criteria: retry does not change device state unexpectedly (verified by read-back); write-verify mismatch = 0; non-idempotent ops are blocked unless explicitly unlocked.
RTS/CTS is wired but packets still drop — check flow-control polarity/timing first or bridge RX ring depth first?
Likely cause: (1) RTS/CTS polarity or assertion timing is wrong (CTS asserted too late); (2) RX ring is too shallow for bursty traffic and overflows before back-pressure takes effect.
Quick check: confirm rx_overflow events; monitor CTS line vs RX bursts with a logic analyzer; compare drop moments with queue_depth and high_water_hits to see whether CTS reacts before overflow.
Fix: validate RTS/CTS configuration end-to-end (polarity, enable, timing), assert CTS earlier (lower watermark), increase RX ring and parser buffering, and consider credit-based acknowledgements for higher-layer bursts.
Pass criteria:rx_overflow = 0 under worst-case bursts; queue_high_water_hits do not cause drops; sustained throughput meets target with stable latency P95.
Concurrent multi-device access sometimes “cross-writes” — check command correlation ID first or out-of-order reply handling first?
Likely cause: (1) missing/ambiguous txn_id mapping causes host to associate responses with the wrong request; (2) the bridge executes out-of-order completions but the host assumes in-order replies, or per-target routing is not isolated.
Quick check: enable forced out-of-order tests (two targets, different response times) and verify every reply carries txn_id, target_id, and profile_hash; check whether any response lacks these fields or is duplicated.
Fix: enforce strict request/response correlation (txn_id required), define the ordering contract (in-order only or out-of-order allowed), isolate per-target queues, and add dedupe protection for repeated txn_id submissions.
Pass criteria: cross-write rate = 0 across X concurrency stress runs; every reply includes txn_id/target_id; ordering behavior is deterministic and regression-tested.