Bus Health & Stats for I2C, SPI, and UART
← Back to: I²C / SPI / UART — Serial Peripheral Buses
Turn I²C/SPI/UART problems into numbers: define a metric dictionary, log with a fixed schema, and use percentiles + per-endpoint buckets to drive dashboards, alerts, and pass/fail gates.
The goal is actionable evidence—retries/NAKs/CRC, throughput, tail latency, and recovery success—measured with consistent units and windows so bring-up, field debugging, and production decisions stay comparable.
Scope & Outputs (What this page owns)
Bus health is measurable. This page defines a consistent observability system to track reliability, performance, and recoverability across I²C, SPI, and UART—then turn numbers into actionable guardrails for bring-up, production, and field reliability.
What “health” means (definition used throughout this page)
- Reliability: low error rates (retries / NAKs / CRC / framing), and predictable behavior under stress (temperature, load, hot-plug).
- Performance: sustained throughput and stable utilization without “hidden” costs (e.g., retries masking the true payload rate).
- Recoverability: fast, repeatable recovery from stalls and lockups (timeouts, bus reset, re-init), with measurable success rate and time-to-recover.
Deliverables (copy-paste outputs for engineering teams)
-
Metrics Dictionary — a single source of truth for every counter, rate, percentile, window, and unit.
Minimum fields: metric_id, name, type, numerator, denominator, unit, window, sampling_point, tags, interpretation, scope_guard_link, pass_criteria
-
Logging Schema — structured, low-overhead event records that survive production and field constraints.
Default policy: log metadata (timing, sizes, outcomes, context) and avoid payload by default to reduce volume and risk.
-
Dashboard Layout — consistent overview + drilldown so different devices/projects stay comparable.
Required widgets: Error rate, p95/p99 latency, Throughput, Recovery success, plus “Top endpoints” and “Notable events”.
-
Alert Rules — thresholds + trends + anomalies, tied to clear actions (not just alarms).
- Absolute: “CRC fail rate > X per 1M frames”
- Trend: “NAK rate slope > X/min for Y minutes”
- Anomaly: “Endpoint deviates from its baseline by > Zσ”
-
Bring-up → Production Checklist — measurable pass criteria for lab validation, stress, and manufacturing gates.
Every checklist item should end with: Pass criteria (X/Y/Z placeholders) so teams can validate objectively.
Scope guard (anti-overlap rule)
This page owns
- Metric definitions: units, denominators, windows, percentiles, tags, and interpretation.
- Instrumentation points and timebase rules for consistent latency and throughput measurement.
- Dashboard + alert patterns that turn numbers into actions (triage, regression detection, gates).
This page does NOT own
- Protocol fundamentals and tutorials (I²C timing, SPI mode walkthroughs, UART baud math).
- Detailed timing derivations and electrical/SI deep dives (termination, return paths, eye shaping).
- Long troubleshooting playbooks beyond metric-driven triage (only short “symptom→metric→link” pointers belong here).
Metrics Taxonomy (Define the metric dictionary)
Metrics only help when definitions are consistent. Every metric must specify its unit, numerator/denominator, time window, sampling point, and tags—otherwise different teams will measure different realities and cannot compare results.
Core rules (keep metrics comparable across projects)
- Rate beats raw counts: prefer “per 1k transactions / per 1M frames / per minute” so results scale across workloads.
- Percentiles beat averages: use p50/p95/p99 to expose tail latency and rare stalls hidden by mean values.
- Tags are mandatory: metrics without endpoint + context tags cannot drive root-cause isolation.
- Sampling point must be named: driver-level vs DMA-level vs application-level counters can disagree unless explicitly defined.
Metric dictionary template (recommended fields)
- metric_id: stable identifier for dashboards, alerts, and regression comparisons.
- type: error / performance / latency / stability.
- numerator & denominator: explicit “what is counted” and “what it is normalized by”.
- unit: %, count, microseconds, bytes per second.
- window: 1s / 10s / 60s / 5min (choose by use case).
- sampling_point: driver / ISR / DMA completion / bridge / application boundary.
- tags: bus_type, instance, endpoint_id, op_type, fw_ver, board_rev, temp_bin, power_state, load_bin.
- interpretation: what “high” or “spiky” typically suggests (in categories), and where to drill down.
- pass_criteria: X/Y/Z thresholds for bring-up, production gate, and field alerts.
Taxonomy (four metric families)
Error metrics
- Retry rate (retries / transactions)
- NAK rate (NAKs / address transactions)
- CRC fail rate (CRC fails / frames)
- UART framing/parity/overrun rate (errors / received frames)
Performance metrics
- Throughput (payload bytes / time) with clear “payload vs line” definition
- Utilization (busy time / window)
- Inter-frame gap and queue depth percentiles (p50/p95)
Latency metrics
- Transaction latency p50/p95/p99 (microseconds)
- Queue wait vs service time separation (to isolate “scheduling” vs “bus/device”)
- Max stall time and timeout rate (rare events that dominate user experience)
Stability metrics
- Bus reset count / hour (how often recovery is needed)
- Bus stuck duration (total + worst-case)
- Recovery success rate and time-to-recover (TTR) p95/p99
Scope guard (keep this chapter clean)
This chapter defines how metrics are named, normalized, windowed, and tagged. It does not expand into protocol math, electrical details, or waveform tutorials. When a metric hints at a root cause, it should be expressed as a short category pointer and routed to the correct owner page.
Event Model & Logging Schema (Make stats durable)
Durable stats require durable events. A consistent event model makes logs comparable across I²C, SPI, and UART, across devices, and across time—so dashboards, alerts, and regression checks can share the same language.
Choose event granularity (3 durable levels)
Per-transaction event
Best for root-cause isolation and top-endpoint ranking. Use sampling or rate limits if volume is high.
Per-burst / per-batch event
Best for high-throughput transfers (DMA bursts, long frames). Preserve aggregate outcomes without logging every transaction.
Per-second (or per-window) rollup
Best for dashboards and field telemetry. Store counts, rates, percentiles, and worst-case stalls to enable trends and alerts.
Recommended structured log fields (grouped for consistency)
Identity
bus_type, bus_instance, direction, endpoint_id, op_type
Endpoint conventions: addr0x50 (I²C), cs2 (SPI), uart3 (UART).
Timing
t_start, t_end, latency_us, timeout_ms, clock_domain_id
Store both t_start and t_end (not only derived latency) to support correlation and auditing.
Result
result, err_code, retries, crc_ok, sampled
Keep result (OK/RETRY/FAIL) separate from err_code (NAK/TIMEOUT/CRC/OVERRUN) for clean aggregation.
Context
fw_ver, board_id, power_state, temp_bin, firmware_state, load_bin
Prefer binned tags (e.g., temp_bin) to reduce volume while preserving drilldown power.
Rollup records (dashboards and field telemetry)
Minimal window fields: window_s, txn_count, err_count, p95_latency_us, throughput_Bps
Recommended additions for durability: err_rate, p50_latency_us, p99_latency_us, stall_max_us, endpoint_topN, recovery_attempts, recovery_success
Scope guard: this chapter defines event structure and normalization fields; protocol and electrical root causes should be routed to the corresponding owner pages.
Instrumentation Points & Timebases (Where to measure)
Accurate health metrics depend on consistent measurement points and a trustworthy timebase. This chapter defines where to tap signals and counters, how to record time, and how to keep overhead under control in production and field deployments.
Tap points (what each point can validate)
Firmware counters
- Driver boundaries: request start/end, queue entry/exit.
- ISR timestamps: interrupt arrival and servicing delay.
- DMA completion: transfer completion and FIFO underrun/overrun hooks.
- Retry loop decision points: retry counts and backoff behavior.
Bridge / expander stats
- FIFO depth and drop counts (congestion signatures).
- Internal counters for retries/CRC (if supported).
- Queue-watermark events (sustained pressure vs bursts).
Analyzer correlation
- Truthing boundaries: verify start/end definitions for latency and ordering.
- Detect sampling gaps and dropped events during stress runs.
- Cross-check top endpoints and error bursts with capture triggers.
Timebases (recording time that remains comparable)
Single MCU
Use a monotonic clock for t_start and t_end. Derive latency_us from the same clock domain.
Multi MCU / multi board
Define an explicit sync method and tag events with clock_domain_id. If full sync is unavailable, compare rollups within the same domain and use correlation windows for cross-domain analysis.
Requirement: store t_start, t_end, and clock_domain_id so disagreements can be audited instead of debated.
Overhead control (production-safe logging)
Sampling strategy
- Always keep window rollups (low volume, high value).
- Transaction events: sample by rate or trigger on anomalies (timeouts, CRC bursts).
- Attach sampled to every event for honest interpretation.
Ring buffer + retention
- Keep last N seconds of events for post-mortem correlation.
- Prioritize error events; degrade debug-first on pressure.
- Emit a compact “notable events” list on failures.
Compression + rate limits
- Prefer histogram bins for latency (p95/p99 from bins) over raw traces.
- Rate limit per endpoint to prevent “one bad device” flooding logs.
- Batch uploads; apply back-pressure and drop policies deterministically.
Scope guard: this chapter focuses on where to measure and how to keep measurements trustworthy, not on protocol waveforms or electrical troubleshooting.
I²C Health Signals (Stats-only, not protocol tutorial)
I²C health becomes actionable when counters are normalized, bucketed by endpoint and operation type, and correlated with context tags. This chapter focuses on interpretation and drilldown paths, not waveform or pull-up analysis.
Key counters (with durable normalization and buckets)
NAK_count (bucket by address + op_type)
Normalize as NAK_rate = NAK_count / addressed_txn_count to keep comparisons stable across traffic levels.
Required tags: addr, op_type, bus_instance, fw_ver, temp_bin, power_state.
arbitration_lost_count
Track as a rate: arb_lost_rate = arbitration_lost / master_txn_attempts. Use endpoint and firmware state to locate concurrency patterns.
Deep dive: multi-master policies and bus-stall recovery belong to the owner pages; this page only defines the drilldown signals.
stretch_timeout_count
Treat as a policy-triggered event: stretch_timeout_rate = stretch_timeout / txn_count. Correlate with temp_bin and load_bin.
Deep dive: clock stretching behavior and timeout tuning are covered in the Clock Stretching owner page.
bus_stuck_detected_count + stuck duration
Monitor both frequency and severity: stuck_events/hr, stuck_total_ms/window, and stuck_max_ms (tail risk).
Bucket by power_state and board_id to separate sequencing-related incidents from endpoint-specific faults.
recovery_attempts + success rate (and TTR)
Track reliability of recovery hooks: success_rate = success / attempts. Add time_to_recover_ms p95/p99 to expose tail failures.
Deep dive: recovery mechanisms and reset sequences belong to the Recovery owner page.
Meaning mapping (symptom → metric → fast split → next action)
NAK spikes (looks intermittent)
Metric to check: NAK_rate (1s/10s windows). Prefer per-endpoint normalization.
Fast split: addr → op_type → power_state / fw_ver.
Likely buckets: device busy vs addressing conflict vs timeout policy mismatch. Deep dive: Page Write / Addressing / Timeout policy.
Stuck bus events (hard stalls)
Metric to check: stuck_max_ms + recovery_success_rate (tail + effectiveness).
Fast split: power_state → board_id → last-active addr.
Likely buckets: hung endpoint vs sequencing/ghost-power category. Deep dive: Recovery page.
Scope guard: all waveform, pull-up sizing, and timing deep dives should be routed to the owner pages (Clock Stretching / Pull-up Network / Recovery).
SPI Health Signals (Stats-only)
Many SPI failures originate from throughput pressure and timing boundaries rather than protocol semantics. Health stats should be bucketable by SCLK band, load, temperature, board revision, and firmware version to reveal repeatable patterns.
Key counters (organized by bottleneck layer)
crc_fail_count (if available)
Normalize as crc_fail_rate = crc_fail / frames. Treat CRC bursts as an integrity signature that should be stratified by context.
Required tags: sclk_band, temp_bin, load_bin, board_id, fw_ver.
underrun/overrun_count (DMA/FIFO)
Normalize by bursts or time: underrun_rate = underrun / bursts. Pair with queue metrics to distinguish congestion from integrity issues.
Companion metrics: queue_depth_p95, isr_latency_p95, throughput_Bps.
cs_glitch / sync_error_count (if detectable)
Track as per 1k transactions and bucket by cs_id and power_state to isolate boundary-state incidents.
This counter is a detector input; timing and electrical explanations should be routed to the owner pages.
retry_count (application-level)
Normalize as retry_rate = retry / transactions. Use retry changes to validate that “fixes” improve health without hiding errors via slowdowns.
Always view retry_rate alongside throughput and tail latency to avoid misleading wins.
Symptom-to-metric (bucket first, then attribute)
CRC fails appear only at certain speeds
Primary metric: crc_fail_rate (10s window).
First split: sclk_band → temp_bin → load_bin → board_id.
Confirm with: throughput dip + retry_rate burst. Deep dive: Long-Trace SI / SCLK quality (owner pages).
Data drops under load (looks like “random failures”)
Primary metric: underrun_rate (1s window).
First split: load_bin → fw_state → isr_latency_p95.
Confirm with: queue_depth_p95 and burst size. Deep dive: DMA & High Throughput (owner page).
Scope guard: this chapter maps symptoms to bucketable metrics; protocol-mode explanations and electrical details belong to the SPI owner pages.
UART Health Signals (Stats-only)
UART health statistics should separate integrity errors (noise/clock category) from buffering and flow-control pressure (queue/strategy category). The goal is durable counters, stable normalization, and bucketable tags that enable repeatable attribution.
Key counters (normalized, bucketed, and cross-validated)
Integrity errors (Noise / Clock category)
- framing_error_count → normalize as framing_error_rate = framing_error / rx_frames
- parity_error_count → normalize as parity_error_rate = parity_error / rx_frames
Required tags: uart_instance, peer_id, baud_setting_id, clock_source_id, temp_bin, power_state, board_id, fw_ver.
Buffer / Flow-control pressure (Queue / Strategy category)
- overrun_count → overrun_rate = overrun / rx_frames
- rx_drop_count (buffer overflow) → rx_drop_rate = rx_drop / rx_frames
- flow_control_asserted_time → flow_control_ratio = asserted_time / window_s
Companion metrics (cross-validation): rx_queue_depth_p95/p99, queue_wait_p95/p99, isr_latency_p95, throughput_payload_Bps.
Power and wake context (phase attribution)
- break_detect_count → correlate errors by before_wake / after_wake phase tags
- idle_wake_count → validate whether error bursts align with state transitions
These counters are primarily bucket keys for time alignment; protocol explanations should be routed to the owner pages.
Latency lens (RX → processing, split into actionable components)
Timestamp edges (monotonic clock)
- t_rx_irq: RX interrupt/callback arrival (or DMA completion)
- t_enqueue: enqueue into RX ring/buffer
- t_dequeue: application stack dequeue
- t_done (optional): processing completion
Derived latency components (report percentiles)
- service_time = t_enqueue − t_rx_irq (driver/ISR pressure)
- queue_wait = t_dequeue − t_enqueue (congestion)
- processing_time = t_done − t_dequeue (upper stack)
- rx_to_process = t_dequeue − t_rx_irq (end-to-end)
Use p95/p99 and max to expose tail risks; avoid relying on mean latency.
Scope guard: baud-rate math, sampling theory, and physical-layer explanations belong to UART owner pages; this chapter defines counters, normalization, and attribution paths.
Throughput & Latency Measurement (How to compute correctly)
Correct health metrics require consistent transaction boundaries, tail-aware latency reporting, and explicit separation of queueing from transfer and processing. Throughput should report both application payload and line-usage estimates to avoid misleading “fast” readings.
Define transaction boundaries (per bus type, stable across tooling)
I²C boundary
One driver-submitted message/combined transaction as a single txn. Use t_start/t_end from the event schema to keep latency comparable across firmware versions.
SPI boundary
One CS-active transfer segment or one DMA burst as a txn (choose one and keep it consistent). Record bytes and burst_id so rollups do not mix different batch definitions.
UART boundary
One application packet or fixed-size chunk as a txn. This prevents per-byte noise from dominating latency and makes drops and backpressure comparable across workloads.
Avoid misleading averages (tail-aware reporting)
Latency
- Report p50 / p95 / p99 plus max (tail risk).
- Track timeout_rate separately from latency to avoid silent failure masking.
- Use multi-window views (1s/10s/60s) to capture bursts and trends.
Throughput
- Report median and p05 (low-tail reveals congestion).
- Always pair throughput with error and tail-latency trends to avoid “slow but stable” misreads.
Compute with components (queue vs transfer vs service vs processing)
Latency components
- queue_wait: time waiting for CPU/locks/queue capacity
- on_wire: transfer segment duration (event boundary-defined)
- device_service: peripheral response/ready time (as observed)
- firmware_processing: local processing time after receipt
Percentiles should be computed per component so tail latency is not misattributed.
Throughput definitions
- throughput_payload_Bps = payload_bytes / window_s
- throughput_line_Bps = (payload_bytes + overhead_bytes_equiv) / window_s
- overhead_factor = throughput_line_Bps / throughput_payload_Bps
Overhead is a reported factor rather than a protocol tutorial; bus-specific overhead details should be routed to the owner pages.
Bus Health Dashboard & Alerts (Turn stats into action)
A production-grade dashboard is a closed loop: consistent KPIs, deterministic drilldowns, rule-based alerts, and a notable-events stream that captures first occurrences, regressions, and context-correlated spikes.
Dashboard layout (fixed information architecture)
Overview (4 KPIs)
- Health score (optional) for ranking and triage only.
- Error rate (normalized, e.g., per 1k or per 1M).
- p95 latency (tail-aware primary KPI; p99 as secondary).
- Throughput (payload_Bps; line_Bps optional).
Keep error, latency, and throughput on the same screen to prevent “stable by slowing down” misreads.
Drilldown (endpoint → context)
- By endpoint: addr / cs / port (top talkers ranked by error and tail latency).
- By operation: op_type (read/write/control) to avoid mixing semantics.
- By context: temp_bin, fw_ver, load_bin, power_state, board_id.
The drilldown order should remain fixed to preserve cross-team comparability.
Notable events (structured, not raw logs)
- First occurrence: first threshold crossing for an endpoint or context bucket.
- Regression: baseline shift after fw_ver change.
- Correlation spike: error/latency spikes aligned with temp_bin or load_bin.
- Recovery risk: repeated recovery attempts, reduced success, or long stuck_max_ms.
Alert rules (cookbook structure)
Rule template (mandatory fields)
- rule_id, metric, scope (global / per-endpoint / per-context)
- window (e.g., 1s/10s/60s/5min) and normalization unit
- condition (threshold / slope / baseline deviation)
- severity (warn/crit), suppression (debounce + cooldown)
- action (snapshot/dump, degrade, reset, protect-mode)
- pass_criteria reference (production gating hook)
Rule types (3 categories)
- Absolute thresholds: CRC/timeout/stuck_max_ms > X per window (with units).
- Trend / rate-of-change: fast worsening before thresholds cross.
- Anomaly / baseline deviation: endpoint-specific normals differ; alert on deviation.
Add debounce (N consecutive windows) and cooldown (X minutes) to prevent alert storms.
Burn-in / production gating (quality gates)
- Define a test phase (bring-up/burn-in/production) and duration (minutes/cycles).
- Specify gate metrics: error_rate, timeout_rate, recovery_success, latency_p99.
- Define outcomes: pass / retest / quarantine, plus notable-event capture on failures.
Scope guard: this chapter defines dashboard structure and alert logic; root-cause mechanisms belong to bus-specific owner pages.
Field Logging Strategy (Storage, privacy, and reliability)
Field logging must be survivable and bounded: tiered levels with sampling, ring buffers with crash snapshots, rollup-first aggregation, transport back-pressure, and privacy-by-default metadata-only records.
Logging levels (with sampling policy)
DEBUG
Bring-up only. Default off. Enable per endpoint and time-limited. Use probabilistic sampling to prevent log storms.
INFO
Rollup-first telemetry: counts per window, latency histograms, throughput summaries. Stable across firmware releases.
WARN
Threshold crossings and anomalies with minimal context. Triggered sampling can temporarily increase detail for top endpoints.
ERROR
Failures and recovery breakdowns. Always captured. Used to freeze buffers and produce crash-adjacent snapshots.
Sampling modes (bounded yet diagnostic)
- Probabilistic: keep a small percentage of transaction events under high traffic.
- Triggered: after an alert, increase sampling for the affected endpoint(s) for a limited duration.
- Top-K focus: only escalate detail for top talkers to prevent system-wide amplification.
Ring buffer + crash dump (last N seconds, metadata-only)
Ring buffer goals
- Capacity is defined by time horizon (keep last N seconds), not by event count.
- Metadata-only event schema; payload is scrubbed by default.
- Freeze-on-trigger prevents overwrite of pre-failure evidence.
Triggers (examples)
- panic / watchdog / fatal fault
- repeated recovery failures
- stuck episodes exceeding a time threshold
- consecutive alert windows (debounced)
Aggregation, transport, and privacy baseline
Compression & aggregation
- Counts per window: txn_count, err_count, timeout_rate, recovery_success.
- Latency histograms: bins that can reconstruct p50/p95/p99 approximately.
- Top talkers sketch: keep only top-K endpoints per window.
Telemetry transport
- Batch upload and retry with back-pressure awareness.
- Drop policy priority: ERROR → WARN → rollup → DEBUG/txn events.
- Store-and-forward with explicit retention caps (time and size).
Privacy & security baseline
- Scrub payload by default; keep metadata only.
- Optional endpoint hashing for shared telemetry environments.
- Whitelist/blacklist fields, and protect uploads with authenticated transport.
Scope guard: this chapter covers bounded telemetry engineering (levels, buffers, aggregation, transport, privacy). Protocol parsing and payload analysis belong elsewhere.
Engineering Checklist (Bring-up → Stress → Production)
This checklist turns bus statistics into acceptance: measurable baselines, repeatable stress evidence, and production gates with correlation across stations and tools.
Bring-up checklist (make observability real)
- Enable counters: per-endpoint buckets (addr/cs/port) plus rollups (window_s).
- Validate timebase: monotonic timestamps; defined mapping if multiple MCUs exist.
- Normalize units: error_rate per 1k/1M, latency_us p50/p95/p99, throughput_Bps.
- Baseline run: known-good setup; record temp_bin, load_bin, power_state, fw_ver.
- Truthing check: compare firmware counters vs analyzer/bridge stats for direction + magnitude.
- Rollup integrity: window records exist (txn_count, err_count, p95, throughput, timeout_rate).
- Notable events: first occurrence can be emitted and stored locally.
- Alert suppression: debounce (N windows) + cooldown (X minutes) + Top-K focus.
Stress checklist (prove stability under disturbance)
Temperature sweep
- Track error_rate vs temp_bin (and per endpoint).
- Verify tail latency (p99) does not drift across bins.
- Capture correlation events (temp spikes → error bursts).
Load sweep
- Bind queue_depth_p95/p99 and isr_latency_p95 to overruns/drops.
- Confirm throughput stays stable while error_rate stays bounded.
- Identify Top-K endpoints that dominate tail behavior.
Long-run soak
- Watch p99 latency creep (memory pressure / queue accumulation signals).
- Ensure rollups remain consistent (no schema drift).
- Keep notable events for first seen/regression markers.
Hot-plug / brown-out
- Measure stuck_detected and max_stall_time_ms.
- Measure recovery_attempts and success_rate + time_to_recover p95.
- Freeze ring buffer on failure and export crash-adjacent snapshots.
Production checklist (gates + correlation)
- Gating thresholds: fixed units + windows; pass/fail actions (retest/quarantine) defined.
- Golden unit: compare each station to a known-good reference under the same script.
- Station-to-station correlation: ensure KPI deltas stay within an allowed band (error_rate, p95/p99, throughput).
- Tool correlation: analyzer/bridge counters agree with firmware counters (direction + order-of-magnitude).
- Regression hooks: baseline deviation checks tied to fw_ver updates.
Pass criteria template (fill-in, reproducible)
- error_rate ≤ X per 1k (window=Y s, per-endpoint + global)
- p95_latency_us ≤ X (and p99 guardrail ≤ Y)
- recovery_success_rate ≥ X% (sample count ≥ N)
- max_stall_time_ms ≤ X (and time_to_recover p95 ≤ Y)
Evidence artifacts (required)
- KPI snapshot (same window, same normalization).
- Top talkers list (endpoint ranking).
- Notable events (first seen / regression / correlation).
- Ring buffer snapshot on failures (metadata-only).
Applications & IC Selection Logic (Evidence-first, no shopping list)
Applications show where bus statistics save time. Selection logic specifies capability requirements (counters, FIFOs, timestamp hooks, deterministic latency support) and provides concrete material examples to validate feasibility.
Applications (how the stats close the loop)
Manufacturing test
- Acceptance thresholds: error_rate, timeout_rate, recovery_success, max_stall_time.
- Quick triage: Top talkers by endpoint and notable events on first occurrence.
- Evidence output: one-page KPI snapshot with identical windows and normalization.
Field reliability
- Early warning: trend + baseline deviation alerts (debounced + cooldown).
- Remote debugging: ring buffer snapshots + context tags (fw_ver/temp/load/power).
- Regression defense: fw_ver change detection tied to notable events.
System performance
- Tuning by evidence: throughput_payload_Bps vs p95/p99 latency and error_rate together.
- Component lens: queue_wait vs on-wire vs device_service vs processing.
- Before/after compare: same workload script and endpoint bucket order.
IC selection logic (criteria + capability checklist)
Must-have capabilities
- Deep FIFOs / readable queue depth (prevents silent drops).
- Hardware/driver error counters with per-endpoint bucketing.
- Timestamp hooks at tap points (IRQ/DMA completion/enqueue/dequeue).
- Deterministic retry/timeout controls (consistent windows + gates).
- Bounded telemetry support (rollups + histogram bins + Top-K).
Nice-to-have capabilities
- Hardware timestamping (lower overhead, better correlation).
- Deterministic latency hooks for station-to-station correlation.
- Configurable sampling modes and freeze-on-trigger buffers.
- Isolation delay budgeting awareness (logged as metadata).
Concrete material examples (for feasibility checks; verify package/suffix/availability)
USB ↔ UART bridges
- FTDI FT232R, FT231XQ, FT2232HL
- Silicon Labs CP2102N-A02-GQFN24
- WCH CH340C
USB ↔ I²C / SPI bridges
- Microchip MCP2221A-I/SL (USB↔I²C/UART)
- Microchip MCP2210-I/SL (USB↔SPI)
- FTDI FT4222H (USB↔I²C/SPI)
I²C mux / buffer / extender (bucketing by segment)
- TI TCA9548A (I²C mux), TCA9517 (buffer)
- NXP PCA9548A (I²C mux), PCA9517A (buffer)
- NXP P82B96 (long-reach I²C buffer concept)
I²C isolators / SPI digital isolators (delay-aware)
- Analog Devices ADuM1250ARZ (I²C), ADuM3151BRZ (SPI)
- Texas Instruments ISO1540DWR / ISO1541DWR (I²C)
- Texas Instruments ISO7741FQDWRQ1 (SPI-style multi-channel isolation example)
isoSPI / long-chain comms (stats hooks by node)
- Analog Devices (Linear Tech) LTC6820 (isoSPI)
- Analog Devices (Linear Tech) LTC6821 (isoSPI)
Local storage for ring buffers (telemetry survivability)
- Winbond W25Q64JVSSIQ (SPI NOR)
- Infineon/Cypress FM25V02-G (SPI FRAM example)
- Microchip 23LC1024-I/SN (SPI SRAM example)
Low-cap ESD protection (ports + analyzer headers)
- Nexperia PESD5V0S1UL (low-cap TVS example)
- Texas Instruments TPD1E10B06 (single-line ESD)
- Semtech RClamp0524P (multi-line ESD array example)
These part numbers are examples to validate counters/FIFOs/timestamp and survivable logging paths. Final selection should match voltage, package, temperature grade, and availability constraints.
Selection flow (metrics → taps → hardware → firmware cost → decision)
- Required metrics: error_rate, p95/p99, throughput, recovery success, max stall.
- Tap points: driver IRQ/DMA completion, enqueue/dequeue, bridge/analyzer counters.
- Hardware support: FIFO depth, counter readout, timestamp hooks, isolation delay awareness.
- Firmware effort: CPU overhead, storage budget, bandwidth/telemetry constraints.
- Decision: pass criteria, alert cookbook, evidence artifacts, production gates.
Recommended topics you might also need
Request a Quote
FAQs (4-line answers + measurable pass criteria)
Each FAQ maps a symptom to a first accounting check, a fix that standardizes observability, and a pass criterion with units + window + sample context.
NAK rate jumps but the scope looks “OK”—what is the first accounting check?
Likely cause: window/denominator mismatch, or mixed endpoints collapsing into one number.
Quick check: split by endpoint (addr/cs/port) + read/write + fixed window_s (1s/10s/60s).
Fix: standardize metric definition (per 1k frames/txns) and require per-endpoint rollups as the primary view.
Pass criteria: NAK_rate ≤ X/1k over Y minutes, with Top-K endpoints stable (no single endpoint contributes > Z%).
CRC fails only at certain throughput—SI issue or buffer underrun?
Likely cause: FIFO underrun / scheduling jitter masquerading as link errors under load.
Quick check: correlate CRC_fail_count with underrun/overrun counters + CPU/load_bin + queue_depth_p99 at the same window.
Fix: increase FIFO/DMA burst, raise service priority, reduce IRQ latency; keep the same throughput target during verification.
Pass criteria: CRC_fail < X/1M over Y minutes at load Z, and underrun_count = 0 for the same run.
Latency average is fine, but users see “random stalls”—why?
Likely cause: tail latency spikes hidden by averages; rare max stalls dominate perceived quality.
Quick check: plot p95/p99 + max_stall_time_ms; split latency into queue_wait vs on-wire vs device_service vs processing.
Fix: add percentile alerts + decomposition rollups; gate releases on p99 and stall ceilings, not averages.
Pass criteria: p99_latency ≤ X ms and max_stall ≤ Y ms over N transactions (timeout_rate ≤ Z/1k).
Same firmware, different boards show different error rates—what to log first?
Likely cause: missing context (temp/power/load/board_rev) causing apples-to-oranges comparisons.
Quick check: add context tags and compare per-endpoint Top talkers; stratify by temp_bin + power_state + board_rev.
Fix: require context tags in every rollup; define baseline per board_rev and compare deltas within the same bin.
Pass criteria: after stratification, error_rate variance ≤ X% across boards (same bins, same window_s, same workload).
I²C “stuck” events recover, but keep recurring—what is the fastest triage?
Likely cause: repeated hung endpoint or sequencing/ghost-power pattern that keeps re-triggering.
Quick check: rank stuck events by last_active_endpoint + time_since_power_transition; compute stuck_duration histogram.
Fix: add per-endpoint isolation/reset hooks; enforce recovery escalation ladder (soft reset → segment isolate → power cycle).
Pass criteria: stuck_frequency < X/day and recovery_success ≥ Y% with max_stuck_duration ≤ Z ms.
UART framing errors spike after enabling low-power—noise or wake timing?
Likely cause: wake/clock settling window too short, causing early frames to be sampled incorrectly.
Quick check: correlate errors with wake events; bucket the first N frames after wake vs steady-state frames.
Fix: add guard time, re-sync, discard first frames, and tighten flow-control policy during wake transitions.
Pass criteria: framing/parity errors = 0 in first N frames across M wake cycles (same workload, same bins).
Retry count drops after a “fix”, but throughput also drops—did the system just slow down?
Likely cause: reliability improved only because timing/rate became conservative, masking the real capacity.
Quick check: compare error_rate + throughput_payload_Bps + p95 latency together, under the same workload and window.
Fix: tune guardrails with a 3-metric gate (reliability + throughput + latency); validate with Top-K endpoints, not only global averages.
Pass criteria: error_rate ≤ X while throughput ≥ Y and p95 ≤ Z (same bins, same load, same window_s).
Metrics disagree between firmware counters and a logic analyzer—who is right?
Likely cause: different transaction boundaries or tap points (IRQ/DMA completion vs CS-active vs analyzer triggers), plus sampling/drop effects.
Quick check: align endpoint + fixed window_s + the same boundary definition; do a short manual ledger (txn_count, bytes, fails) for one capture.
Fix: encode boundary definitions in the schema (t_start/t_end meaning, txn_id, tool_id, timebase_id); treat rollups as primary and events as sampled truthing.
Pass criteria: within the same script/window, txn_count/bytes differ ≤ X%, and failed endpoints match ≥ Y% of the time.
Alerts keep firing but the issue cannot be reproduced—too sensitive thresholds or wrong windows?
Likely cause: alert conditions ignore burstiness and endpoint mix; missing debounce/cooldown creates alert storms.
Quick check: compare 1s vs 10s vs 60s windows; check Top-K endpoints; measure alert trigger rate-of-change vs absolute thresholds.
Fix: add debounce (N windows) + cooldown (T minutes) + endpoint scoping; use anomaly/baseline deviation per endpoint when distributions differ.
Pass criteria: false-alert rate ≤ X/day while detection sensitivity remains (confirmed incidents still trigger within Y windows).
Field logs always miss the critical segment—ring buffer too small or drop policy too aggressive?
Likely cause: ring buffer retention is sized by count (not time), or low-priority drops erase pre-fault context.
Quick check: measure effective retention time at real traffic; confirm whether WARN/ERROR events survive uplink back-pressure conditions.
Fix: size buffers by seconds (last N seconds), freeze-on-trigger, and prioritize ERROR/WARN + rollups over raw DEBUG events.
Pass criteria: on each confirmed incident, last N seconds snapshot is available ≥ Y% of the time (with complete endpoint/context tags).
Top talkers change every run—unstable endpoint IDs or wrong transaction boundaries?
Likely cause: endpoint_id is not canonical (aliases, dynamic mapping), or txn boundaries drift across implementations.
Quick check: audit endpoint_id generation; verify the same physical target maps to one ID; compare txn_count and bytes under the same script.
Fix: define canonical endpoint_id rules (addr/cs/port + channel/segment), encode boundary definitions in schema, and validate with a short golden script.
Pass criteria: under the same workload, Top-K endpoints overlap ≥ X% across runs, and rollup totals differ ≤ Y%.
Note: X/Y/Z/N/M are project-specific thresholds. Keep units, windows, and bucket keys fixed to preserve comparability.