123 Main Street, New York, NY 10001

Bus Health & Stats for I2C, SPI, and UART

← Back to: I²C / SPI / UART — Serial Peripheral Buses

Turn I²C/SPI/UART problems into numbers: define a metric dictionary, log with a fixed schema, and use percentiles + per-endpoint buckets to drive dashboards, alerts, and pass/fail gates.

The goal is actionable evidence—retries/NAKs/CRC, throughput, tail latency, and recovery success—measured with consistent units and windows so bring-up, field debugging, and production decisions stay comparable.

Scope & Outputs (What this page owns)

Bus health is measurable. This page defines a consistent observability system to track reliability, performance, and recoverability across I²C, SPI, and UART—then turn numbers into actionable guardrails for bring-up, production, and field reliability.

What “health” means (definition used throughout this page)

  • Reliability: low error rates (retries / NAKs / CRC / framing), and predictable behavior under stress (temperature, load, hot-plug).
  • Performance: sustained throughput and stable utilization without “hidden” costs (e.g., retries masking the true payload rate).
  • Recoverability: fast, repeatable recovery from stalls and lockups (timeouts, bus reset, re-init), with measurable success rate and time-to-recover.

Deliverables (copy-paste outputs for engineering teams)

  1. Metrics Dictionary — a single source of truth for every counter, rate, percentile, window, and unit.

    Minimum fields: metric_id, name, type, numerator, denominator, unit, window, sampling_point, tags, interpretation, scope_guard_link, pass_criteria

  2. Logging Schema — structured, low-overhead event records that survive production and field constraints.

    Default policy: log metadata (timing, sizes, outcomes, context) and avoid payload by default to reduce volume and risk.

  3. Dashboard Layout — consistent overview + drilldown so different devices/projects stay comparable.

    Required widgets: Error rate, p95/p99 latency, Throughput, Recovery success, plus “Top endpoints” and “Notable events”.

  4. Alert Rules — thresholds + trends + anomalies, tied to clear actions (not just alarms).
    • Absolute: “CRC fail rate > X per 1M frames”
    • Trend: “NAK rate slope > X/min for Y minutes”
    • Anomaly: “Endpoint deviates from its baseline by > Zσ”
  5. Bring-up → Production Checklist — measurable pass criteria for lab validation, stress, and manufacturing gates.

    Every checklist item should end with: Pass criteria (X/Y/Z placeholders) so teams can validate objectively.

Scope guard (anti-overlap rule)

This page owns

  • Metric definitions: units, denominators, windows, percentiles, tags, and interpretation.
  • Instrumentation points and timebase rules for consistent latency and throughput measurement.
  • Dashboard + alert patterns that turn numbers into actions (triage, regression detection, gates).

This page does NOT own

  • Protocol fundamentals and tutorials (I²C timing, SPI mode walkthroughs, UART baud math).
  • Detailed timing derivations and electrical/SI deep dives (termination, return paths, eye shaping).
  • Long troubleshooting playbooks beyond metric-driven triage (only short “symptom→metric→link” pointers belong here).
Observability pipeline (bus health → measurable actions) Signals Counters Logs Metrics Dashboards Alerts Actions Context tags (for drilldown) endpoint bus fw_ver temp power_state load Feedback loop alerts drive actions (timeouts / retries / resets / tuning)
A stable health system is built from consistent counters, structured logs, comparable metrics, and action-oriented alerts.

Metrics Taxonomy (Define the metric dictionary)

Metrics only help when definitions are consistent. Every metric must specify its unit, numerator/denominator, time window, sampling point, and tags—otherwise different teams will measure different realities and cannot compare results.

Core rules (keep metrics comparable across projects)

  • Rate beats raw counts: prefer “per 1k transactions / per 1M frames / per minute” so results scale across workloads.
  • Percentiles beat averages: use p50/p95/p99 to expose tail latency and rare stalls hidden by mean values.
  • Tags are mandatory: metrics without endpoint + context tags cannot drive root-cause isolation.
  • Sampling point must be named: driver-level vs DMA-level vs application-level counters can disagree unless explicitly defined.

Metric dictionary template (recommended fields)

  • metric_id: stable identifier for dashboards, alerts, and regression comparisons.
  • type: error / performance / latency / stability.
  • numerator & denominator: explicit “what is counted” and “what it is normalized by”.
  • unit: %, count, microseconds, bytes per second.
  • window: 1s / 10s / 60s / 5min (choose by use case).
  • sampling_point: driver / ISR / DMA completion / bridge / application boundary.
  • tags: bus_type, instance, endpoint_id, op_type, fw_ver, board_rev, temp_bin, power_state, load_bin.
  • interpretation: what “high” or “spiky” typically suggests (in categories), and where to drill down.
  • pass_criteria: X/Y/Z thresholds for bring-up, production gate, and field alerts.

Taxonomy (four metric families)

Error metrics

  • Retry rate (retries / transactions)
  • NAK rate (NAKs / address transactions)
  • CRC fail rate (CRC fails / frames)
  • UART framing/parity/overrun rate (errors / received frames)

Performance metrics

  • Throughput (payload bytes / time) with clear “payload vs line” definition
  • Utilization (busy time / window)
  • Inter-frame gap and queue depth percentiles (p50/p95)

Latency metrics

  • Transaction latency p50/p95/p99 (microseconds)
  • Queue wait vs service time separation (to isolate “scheduling” vs “bus/device”)
  • Max stall time and timeout rate (rare events that dominate user experience)

Stability metrics

  • Bus reset count / hour (how often recovery is needed)
  • Bus stuck duration (total + worst-case)
  • Recovery success rate and time-to-recover (TTR) p95/p99

Scope guard (keep this chapter clean)

This chapter defines how metrics are named, normalized, windowed, and tagged. It does not expand into protocol math, electrical details, or waveform tutorials. When a metric hints at a root cause, it should be expressed as a short category pointer and routed to the correct owner page.

Metric family map (define → normalize → tag → compare) Error Performance Latency Stability retry rate NAK rate CRC fail rate UART errors throughput utilization queue depth p95 IFG txn p95/p99 max stall queue vs service timeouts bus resets/hr stuck time recovery success TTR p99 Legend: normalize by workload (per 1k / per 1M) · use percentiles (p95/p99) · tag by endpoint + context for drilldown
Four families keep dashboards readable: error, performance, latency, and stability. Tags and normalization make results comparable.

Event Model & Logging Schema (Make stats durable)

Durable stats require durable events. A consistent event model makes logs comparable across I²C, SPI, and UART, across devices, and across time—so dashboards, alerts, and regression checks can share the same language.

Choose event granularity (3 durable levels)

Per-transaction event

Best for root-cause isolation and top-endpoint ranking. Use sampling or rate limits if volume is high.

Per-burst / per-batch event

Best for high-throughput transfers (DMA bursts, long frames). Preserve aggregate outcomes without logging every transaction.

Per-second (or per-window) rollup

Best for dashboards and field telemetry. Store counts, rates, percentiles, and worst-case stalls to enable trends and alerts.

Recommended structured log fields (grouped for consistency)

Identity

bus_type, bus_instance, direction, endpoint_id, op_type

Endpoint conventions: addr0x50 (I²C), cs2 (SPI), uart3 (UART).

Timing

t_start, t_end, latency_us, timeout_ms, clock_domain_id

Store both t_start and t_end (not only derived latency) to support correlation and auditing.

Result

result, err_code, retries, crc_ok, sampled

Keep result (OK/RETRY/FAIL) separate from err_code (NAK/TIMEOUT/CRC/OVERRUN) for clean aggregation.

Context

fw_ver, board_id, power_state, temp_bin, firmware_state, load_bin

Prefer binned tags (e.g., temp_bin) to reduce volume while preserving drilldown power.

Rollup records (dashboards and field telemetry)

Minimal window fields: window_s, txn_count, err_count, p95_latency_us, throughput_Bps

Recommended additions for durability: err_rate, p50_latency_us, p99_latency_us, stall_max_us, endpoint_topN, recovery_attempts, recovery_success

Scope guard: this chapter defines event structure and normalization fields; protocol and electrical root causes should be routed to the corresponding owner pages.

Event record card (schema pattern) Txn event sampled=true clock_domain payload=meta Identity bus_type instance endpoint op_type Timing t_start t_end latency timeout Result result err_code retries crc_ok Context fw_ver board_id temp_bin power_state Principle: stable IDs + explicit boundaries + durable tags → comparable stats
A structured event schema keeps dashboards consistent across bus types and across firmware revisions.

Instrumentation Points & Timebases (Where to measure)

Accurate health metrics depend on consistent measurement points and a trustworthy timebase. This chapter defines where to tap signals and counters, how to record time, and how to keep overhead under control in production and field deployments.

Tap points (what each point can validate)

Firmware counters

  • Driver boundaries: request start/end, queue entry/exit.
  • ISR timestamps: interrupt arrival and servicing delay.
  • DMA completion: transfer completion and FIFO underrun/overrun hooks.
  • Retry loop decision points: retry counts and backoff behavior.

Bridge / expander stats

  • FIFO depth and drop counts (congestion signatures).
  • Internal counters for retries/CRC (if supported).
  • Queue-watermark events (sustained pressure vs bursts).

Analyzer correlation

  • Truthing boundaries: verify start/end definitions for latency and ordering.
  • Detect sampling gaps and dropped events during stress runs.
  • Cross-check top endpoints and error bursts with capture triggers.

Timebases (recording time that remains comparable)

Single MCU

Use a monotonic clock for t_start and t_end. Derive latency_us from the same clock domain.

Multi MCU / multi board

Define an explicit sync method and tag events with clock_domain_id. If full sync is unavailable, compare rollups within the same domain and use correlation windows for cross-domain analysis.

Requirement: store t_start, t_end, and clock_domain_id so disagreements can be audited instead of debated.

Overhead control (production-safe logging)

Sampling strategy

  • Always keep window rollups (low volume, high value).
  • Transaction events: sample by rate or trigger on anomalies (timeouts, CRC bursts).
  • Attach sampled to every event for honest interpretation.

Ring buffer + retention

  • Keep last N seconds of events for post-mortem correlation.
  • Prioritize error events; degrade debug-first on pressure.
  • Emit a compact “notable events” list on failures.

Compression + rate limits

  • Prefer histogram bins for latency (p95/p99 from bins) over raw traces.
  • Rate limit per endpoint to prevent “one bad device” flooding logs.
  • Batch uploads; apply back-pressure and drop policies deterministically.

Scope guard: this chapter focuses on where to measure and how to keep measurements trustworthy, not on protocol waveforms or electrical troubleshooting.

Tap points map (where measurements originate) MCU driver ISR DMA completion retry loop Bridge / Expander FIFO depth counters drops watermarks Peripheral status pin IRQ / ready busy hints Analyzer capture trigger timing correlate Metrics pipeline Counters Logs Metrics Dashboards Alerts timebase + tags make stats comparable
Use firmware counters for boundaries, bridges for congestion signatures, and analyzers for correlation and truthing.

I²C Health Signals (Stats-only, not protocol tutorial)

I²C health becomes actionable when counters are normalized, bucketed by endpoint and operation type, and correlated with context tags. This chapter focuses on interpretation and drilldown paths, not waveform or pull-up analysis.

Key counters (with durable normalization and buckets)

NAK_count (bucket by address + op_type)

Normalize as NAK_rate = NAK_count / addressed_txn_count to keep comparisons stable across traffic levels.

Required tags: addr, op_type, bus_instance, fw_ver, temp_bin, power_state.

arbitration_lost_count

Track as a rate: arb_lost_rate = arbitration_lost / master_txn_attempts. Use endpoint and firmware state to locate concurrency patterns.

Deep dive: multi-master policies and bus-stall recovery belong to the owner pages; this page only defines the drilldown signals.

stretch_timeout_count

Treat as a policy-triggered event: stretch_timeout_rate = stretch_timeout / txn_count. Correlate with temp_bin and load_bin.

Deep dive: clock stretching behavior and timeout tuning are covered in the Clock Stretching owner page.

bus_stuck_detected_count + stuck duration

Monitor both frequency and severity: stuck_events/hr, stuck_total_ms/window, and stuck_max_ms (tail risk).

Bucket by power_state and board_id to separate sequencing-related incidents from endpoint-specific faults.

recovery_attempts + success rate (and TTR)

Track reliability of recovery hooks: success_rate = success / attempts. Add time_to_recover_ms p95/p99 to expose tail failures.

Deep dive: recovery mechanisms and reset sequences belong to the Recovery owner page.

Meaning mapping (symptom → metric → fast split → next action)

NAK spikes (looks intermittent)

Metric to check: NAK_rate (1s/10s windows). Prefer per-endpoint normalization.

Fast split: addrop_typepower_state / fw_ver.

Likely buckets: device busy vs addressing conflict vs timeout policy mismatch. Deep dive: Page Write / Addressing / Timeout policy.

Stuck bus events (hard stalls)

Metric to check: stuck_max_ms + recovery_success_rate (tail + effectiveness).

Fast split: power_stateboard_id → last-active addr.

Likely buckets: hung endpoint vs sequencing/ghost-power category. Deep dive: Recovery page.

Scope guard: all waveform, pull-up sizing, and timing deep dives should be routed to the owner pages (Clock Stretching / Pull-up Network / Recovery).

I²C health dashboard widget (trend + top talkers + recovery) Buckets: addr op_type power_state fw_ver temp_bin NAK rate (trend) spike Top talkers (addr) 0x50 0x68 0x3C 0x2A Recovery attempts per window success rate ≥ target TTR p95/p99 tail-focused
Dashboards should pair endpoint drilldown with recovery effectiveness so intermittent spikes do not hide tail risks.

SPI Health Signals (Stats-only)

Many SPI failures originate from throughput pressure and timing boundaries rather than protocol semantics. Health stats should be bucketable by SCLK band, load, temperature, board revision, and firmware version to reveal repeatable patterns.

Key counters (organized by bottleneck layer)

crc_fail_count (if available)

Normalize as crc_fail_rate = crc_fail / frames. Treat CRC bursts as an integrity signature that should be stratified by context.

Required tags: sclk_band, temp_bin, load_bin, board_id, fw_ver.

underrun/overrun_count (DMA/FIFO)

Normalize by bursts or time: underrun_rate = underrun / bursts. Pair with queue metrics to distinguish congestion from integrity issues.

Companion metrics: queue_depth_p95, isr_latency_p95, throughput_Bps.

cs_glitch / sync_error_count (if detectable)

Track as per 1k transactions and bucket by cs_id and power_state to isolate boundary-state incidents.

This counter is a detector input; timing and electrical explanations should be routed to the owner pages.

retry_count (application-level)

Normalize as retry_rate = retry / transactions. Use retry changes to validate that “fixes” improve health without hiding errors via slowdowns.

Always view retry_rate alongside throughput and tail latency to avoid misleading wins.

Symptom-to-metric (bucket first, then attribute)

CRC fails appear only at certain speeds

Primary metric: crc_fail_rate (10s window).

First split: sclk_bandtemp_binload_binboard_id.

Confirm with: throughput dip + retry_rate burst. Deep dive: Long-Trace SI / SCLK quality (owner pages).

Data drops under load (looks like “random failures”)

Primary metric: underrun_rate (1s window).

First split: load_binfw_stateisr_latency_p95.

Confirm with: queue_depth_p95 and burst size. Deep dive: DMA & High Throughput (owner page).

Scope guard: this chapter maps symptoms to bucketable metrics; protocol-mode explanations and electrical details belong to the SPI owner pages.

SPI bottleneck triangle (bucketable stats) Tags: sclk_band load_bin temp_bin board_id fw_ver SCLK crc_fail_rate sync_error DMA/FIFO underrun_rate fifo_depth_p95 CPU sched isr_latency_p95 queue_depth_p95 bucket first → attribute
SPI health becomes repeatable when integrity, buffering, and scheduling signals are tracked together and bucketed by context tags.

UART Health Signals (Stats-only)

UART health statistics should separate integrity errors (noise/clock category) from buffering and flow-control pressure (queue/strategy category). The goal is durable counters, stable normalization, and bucketable tags that enable repeatable attribution.

Key counters (normalized, bucketed, and cross-validated)

Integrity errors (Noise / Clock category)

  • framing_error_count → normalize as framing_error_rate = framing_error / rx_frames
  • parity_error_count → normalize as parity_error_rate = parity_error / rx_frames

Required tags: uart_instance, peer_id, baud_setting_id, clock_source_id, temp_bin, power_state, board_id, fw_ver.

Buffer / Flow-control pressure (Queue / Strategy category)

  • overrun_countoverrun_rate = overrun / rx_frames
  • rx_drop_count (buffer overflow) → rx_drop_rate = rx_drop / rx_frames
  • flow_control_asserted_timeflow_control_ratio = asserted_time / window_s

Companion metrics (cross-validation): rx_queue_depth_p95/p99, queue_wait_p95/p99, isr_latency_p95, throughput_payload_Bps.

Power and wake context (phase attribution)

  • break_detect_count → correlate errors by before_wake / after_wake phase tags
  • idle_wake_count → validate whether error bursts align with state transitions

These counters are primarily bucket keys for time alignment; protocol explanations should be routed to the owner pages.

Latency lens (RX → processing, split into actionable components)

Timestamp edges (monotonic clock)

  • t_rx_irq: RX interrupt/callback arrival (or DMA completion)
  • t_enqueue: enqueue into RX ring/buffer
  • t_dequeue: application stack dequeue
  • t_done (optional): processing completion

Derived latency components (report percentiles)

  • service_time = t_enqueue − t_rx_irq (driver/ISR pressure)
  • queue_wait = t_dequeue − t_enqueue (congestion)
  • processing_time = t_done − t_dequeue (upper stack)
  • rx_to_process = t_dequeue − t_rx_irq (end-to-end)

Use p95/p99 and max to expose tail risks; avoid relying on mean latency.

Scope guard: baud-rate math, sampling theory, and physical-layer explanations belong to UART owner pages; this chapter defines counters, normalization, and attribution paths.

UART error fork (Noise/Clock vs Buffer/Flow-control) UART Health Noise / Clock framing_error_rate parity_error_rate break_detect_rate Tags: baud_setting • clock_source • temp_bin Buffer / Flow-control overrun_rate rx_drop_rate flow_control_ratio Tags: load_bin • fw_state • isr_latency Shared: uart_instance • peer_id • fw_ver • power_state • board_id
A forked view prevents confusion between integrity errors and backpressure-driven loss; counters remain comparable through stable normalization and bucket tags.

Throughput & Latency Measurement (How to compute correctly)

Correct health metrics require consistent transaction boundaries, tail-aware latency reporting, and explicit separation of queueing from transfer and processing. Throughput should report both application payload and line-usage estimates to avoid misleading “fast” readings.

Define transaction boundaries (per bus type, stable across tooling)

I²C boundary

One driver-submitted message/combined transaction as a single txn. Use t_start/t_end from the event schema to keep latency comparable across firmware versions.

SPI boundary

One CS-active transfer segment or one DMA burst as a txn (choose one and keep it consistent). Record bytes and burst_id so rollups do not mix different batch definitions.

UART boundary

One application packet or fixed-size chunk as a txn. This prevents per-byte noise from dominating latency and makes drops and backpressure comparable across workloads.

Avoid misleading averages (tail-aware reporting)

Latency

  • Report p50 / p95 / p99 plus max (tail risk).
  • Track timeout_rate separately from latency to avoid silent failure masking.
  • Use multi-window views (1s/10s/60s) to capture bursts and trends.

Throughput

  • Report median and p05 (low-tail reveals congestion).
  • Always pair throughput with error and tail-latency trends to avoid “slow but stable” misreads.

Compute with components (queue vs transfer vs service vs processing)

Latency components

  • queue_wait: time waiting for CPU/locks/queue capacity
  • on_wire: transfer segment duration (event boundary-defined)
  • device_service: peripheral response/ready time (as observed)
  • firmware_processing: local processing time after receipt

Percentiles should be computed per component so tail latency is not misattributed.

Throughput definitions

  • throughput_payload_Bps = payload_bytes / window_s
  • throughput_line_Bps = (payload_bytes + overhead_bytes_equiv) / window_s
  • overhead_factor = throughput_line_Bps / throughput_payload_Bps

Overhead is a reported factor rather than a protocol tutorial; bus-specific overhead details should be routed to the owner pages.

Latency decomposition (component percentiles, tail-aware) p95 p99 timeout_rate Wait Transfer Device Process tail risk Measurement taps: t_start t_enqueue t_dequeue t_end Use percentiles per component; avoid mean-only reporting.
A component view prevents tail latency from being misattributed; throughput should report both payload and line-usage estimates.

Bus Health Dashboard & Alerts (Turn stats into action)

A production-grade dashboard is a closed loop: consistent KPIs, deterministic drilldowns, rule-based alerts, and a notable-events stream that captures first occurrences, regressions, and context-correlated spikes.

Dashboard layout (fixed information architecture)

Overview (4 KPIs)

  • Health score (optional) for ranking and triage only.
  • Error rate (normalized, e.g., per 1k or per 1M).
  • p95 latency (tail-aware primary KPI; p99 as secondary).
  • Throughput (payload_Bps; line_Bps optional).

Keep error, latency, and throughput on the same screen to prevent “stable by slowing down” misreads.

Drilldown (endpoint → context)

  • By endpoint: addr / cs / port (top talkers ranked by error and tail latency).
  • By operation: op_type (read/write/control) to avoid mixing semantics.
  • By context: temp_bin, fw_ver, load_bin, power_state, board_id.

The drilldown order should remain fixed to preserve cross-team comparability.

Notable events (structured, not raw logs)

  • First occurrence: first threshold crossing for an endpoint or context bucket.
  • Regression: baseline shift after fw_ver change.
  • Correlation spike: error/latency spikes aligned with temp_bin or load_bin.
  • Recovery risk: repeated recovery attempts, reduced success, or long stuck_max_ms.

Alert rules (cookbook structure)

Rule template (mandatory fields)

  • rule_id, metric, scope (global / per-endpoint / per-context)
  • window (e.g., 1s/10s/60s/5min) and normalization unit
  • condition (threshold / slope / baseline deviation)
  • severity (warn/crit), suppression (debounce + cooldown)
  • action (snapshot/dump, degrade, reset, protect-mode)
  • pass_criteria reference (production gating hook)

Rule types (3 categories)

  • Absolute thresholds: CRC/timeout/stuck_max_ms > X per window (with units).
  • Trend / rate-of-change: fast worsening before thresholds cross.
  • Anomaly / baseline deviation: endpoint-specific normals differ; alert on deviation.

Add debounce (N consecutive windows) and cooldown (X minutes) to prevent alert storms.

Burn-in / production gating (quality gates)

  • Define a test phase (bring-up/burn-in/production) and duration (minutes/cycles).
  • Specify gate metrics: error_rate, timeout_rate, recovery_success, latency_p99.
  • Define outcomes: pass / retest / quarantine, plus notable-event capture on failures.

Scope guard: this chapter defines dashboard structure and alert logic; root-cause mechanisms belong to bus-specific owner pages.

Dashboard wireframe (KPIs → Drilldown → Notable events) Health score Error rate p95 latency Throughput Top talkers (endpoint) addr/cs/port err • p95 Context chips temp_bin fw_ver load_bin power board_id Notable events first occurrence / regression temp spike / recovery risk
A fixed layout ensures comparability: KPIs remain stable, drilldowns follow a deterministic order, and notable events support postmortems and regression detection.

Field Logging Strategy (Storage, privacy, and reliability)

Field logging must be survivable and bounded: tiered levels with sampling, ring buffers with crash snapshots, rollup-first aggregation, transport back-pressure, and privacy-by-default metadata-only records.

Logging levels (with sampling policy)

DEBUG

Bring-up only. Default off. Enable per endpoint and time-limited. Use probabilistic sampling to prevent log storms.

INFO

Rollup-first telemetry: counts per window, latency histograms, throughput summaries. Stable across firmware releases.

WARN

Threshold crossings and anomalies with minimal context. Triggered sampling can temporarily increase detail for top endpoints.

ERROR

Failures and recovery breakdowns. Always captured. Used to freeze buffers and produce crash-adjacent snapshots.

Sampling modes (bounded yet diagnostic)

  • Probabilistic: keep a small percentage of transaction events under high traffic.
  • Triggered: after an alert, increase sampling for the affected endpoint(s) for a limited duration.
  • Top-K focus: only escalate detail for top talkers to prevent system-wide amplification.

Ring buffer + crash dump (last N seconds, metadata-only)

Ring buffer goals

  • Capacity is defined by time horizon (keep last N seconds), not by event count.
  • Metadata-only event schema; payload is scrubbed by default.
  • Freeze-on-trigger prevents overwrite of pre-failure evidence.

Triggers (examples)

  • panic / watchdog / fatal fault
  • repeated recovery failures
  • stuck episodes exceeding a time threshold
  • consecutive alert windows (debounced)

Aggregation, transport, and privacy baseline

Compression & aggregation

  • Counts per window: txn_count, err_count, timeout_rate, recovery_success.
  • Latency histograms: bins that can reconstruct p50/p95/p99 approximately.
  • Top talkers sketch: keep only top-K endpoints per window.

Telemetry transport

  • Batch upload and retry with back-pressure awareness.
  • Drop policy priority: ERROR → WARN → rollup → DEBUG/txn events.
  • Store-and-forward with explicit retention caps (time and size).

Privacy & security baseline

  • Scrub payload by default; keep metadata only.
  • Optional endpoint hashing for shared telemetry environments.
  • Whitelist/blacklist fields, and protect uploads with authenticated transport.

Scope guard: this chapter covers bounded telemetry engineering (levels, buffers, aggregation, transport, privacy). Protocol parsing and payload analysis belong elsewhere.

Telemetry pipeline (bounded logs, reliable evidence) Device events Ring buffer last N sec freeze Aggregator counts hist bins top-K Uplink batch back-pressure drop policy Server dashboard Crash dump snapshot Privacy baseline metadata-only • scrub payload
Field telemetry should prefer rollups and histograms, keep last-seconds evidence via ring buffers, and enforce back-pressure and drop priorities under constrained storage and networks.

Engineering Checklist (Bring-up → Stress → Production)

This checklist turns bus statistics into acceptance: measurable baselines, repeatable stress evidence, and production gates with correlation across stations and tools.

Bring-up checklist (make observability real)

  • Enable counters: per-endpoint buckets (addr/cs/port) plus rollups (window_s).
  • Validate timebase: monotonic timestamps; defined mapping if multiple MCUs exist.
  • Normalize units: error_rate per 1k/1M, latency_us p50/p95/p99, throughput_Bps.
  • Baseline run: known-good setup; record temp_bin, load_bin, power_state, fw_ver.
  • Truthing check: compare firmware counters vs analyzer/bridge stats for direction + magnitude.
  • Rollup integrity: window records exist (txn_count, err_count, p95, throughput, timeout_rate).
  • Notable events: first occurrence can be emitted and stored locally.
  • Alert suppression: debounce (N windows) + cooldown (X minutes) + Top-K focus.

Stress checklist (prove stability under disturbance)

Temperature sweep

  • Track error_rate vs temp_bin (and per endpoint).
  • Verify tail latency (p99) does not drift across bins.
  • Capture correlation events (temp spikes → error bursts).

Load sweep

  • Bind queue_depth_p95/p99 and isr_latency_p95 to overruns/drops.
  • Confirm throughput stays stable while error_rate stays bounded.
  • Identify Top-K endpoints that dominate tail behavior.

Long-run soak

  • Watch p99 latency creep (memory pressure / queue accumulation signals).
  • Ensure rollups remain consistent (no schema drift).
  • Keep notable events for first seen/regression markers.

Hot-plug / brown-out

  • Measure stuck_detected and max_stall_time_ms.
  • Measure recovery_attempts and success_rate + time_to_recover p95.
  • Freeze ring buffer on failure and export crash-adjacent snapshots.

Production checklist (gates + correlation)

  • Gating thresholds: fixed units + windows; pass/fail actions (retest/quarantine) defined.
  • Golden unit: compare each station to a known-good reference under the same script.
  • Station-to-station correlation: ensure KPI deltas stay within an allowed band (error_rate, p95/p99, throughput).
  • Tool correlation: analyzer/bridge counters agree with firmware counters (direction + order-of-magnitude).
  • Regression hooks: baseline deviation checks tied to fw_ver updates.

Pass criteria template (fill-in, reproducible)

  • error_rate ≤ X per 1k (window=Y s, per-endpoint + global)
  • p95_latency_us ≤ X (and p99 guardrail ≤ Y)
  • recovery_success_rate ≥ X% (sample count ≥ N)
  • max_stall_time_ms ≤ X (and time_to_recover p95 ≤ Y)

Evidence artifacts (required)

  • KPI snapshot (same window, same normalization).
  • Top talkers list (endpoint ranking).
  • Notable events (first seen / regression / correlation).
  • Ring buffer snapshot on failures (metadata-only).
Checklist ladder (Bring-up → Stress → Production) Bring-up Stress Production Counters Timebase Baseline Truthing Temp Load Soak Hot-plug Tail + correlation Gates Golden Station Tool Regression hooks Pass criteria error p95 recovery stall
The ladder enforces maturity: instrumentation first, then stress evidence, then production gates with station correlation and regression defenses.

Applications & IC Selection Logic (Evidence-first, no shopping list)

Applications show where bus statistics save time. Selection logic specifies capability requirements (counters, FIFOs, timestamp hooks, deterministic latency support) and provides concrete material examples to validate feasibility.

Applications (how the stats close the loop)

Manufacturing test

  • Acceptance thresholds: error_rate, timeout_rate, recovery_success, max_stall_time.
  • Quick triage: Top talkers by endpoint and notable events on first occurrence.
  • Evidence output: one-page KPI snapshot with identical windows and normalization.

Field reliability

  • Early warning: trend + baseline deviation alerts (debounced + cooldown).
  • Remote debugging: ring buffer snapshots + context tags (fw_ver/temp/load/power).
  • Regression defense: fw_ver change detection tied to notable events.

System performance

  • Tuning by evidence: throughput_payload_Bps vs p95/p99 latency and error_rate together.
  • Component lens: queue_wait vs on-wire vs device_service vs processing.
  • Before/after compare: same workload script and endpoint bucket order.

IC selection logic (criteria + capability checklist)

Must-have capabilities

  • Deep FIFOs / readable queue depth (prevents silent drops).
  • Hardware/driver error counters with per-endpoint bucketing.
  • Timestamp hooks at tap points (IRQ/DMA completion/enqueue/dequeue).
  • Deterministic retry/timeout controls (consistent windows + gates).
  • Bounded telemetry support (rollups + histogram bins + Top-K).

Nice-to-have capabilities

  • Hardware timestamping (lower overhead, better correlation).
  • Deterministic latency hooks for station-to-station correlation.
  • Configurable sampling modes and freeze-on-trigger buffers.
  • Isolation delay budgeting awareness (logged as metadata).

Concrete material examples (for feasibility checks; verify package/suffix/availability)

USB ↔ UART bridges

  • FTDI FT232R, FT231XQ, FT2232HL
  • Silicon Labs CP2102N-A02-GQFN24
  • WCH CH340C

USB ↔ I²C / SPI bridges

  • Microchip MCP2221A-I/SL (USB↔I²C/UART)
  • Microchip MCP2210-I/SL (USB↔SPI)
  • FTDI FT4222H (USB↔I²C/SPI)

I²C mux / buffer / extender (bucketing by segment)

  • TI TCA9548A (I²C mux), TCA9517 (buffer)
  • NXP PCA9548A (I²C mux), PCA9517A (buffer)
  • NXP P82B96 (long-reach I²C buffer concept)

I²C isolators / SPI digital isolators (delay-aware)

  • Analog Devices ADuM1250ARZ (I²C), ADuM3151BRZ (SPI)
  • Texas Instruments ISO1540DWR / ISO1541DWR (I²C)
  • Texas Instruments ISO7741FQDWRQ1 (SPI-style multi-channel isolation example)

isoSPI / long-chain comms (stats hooks by node)

  • Analog Devices (Linear Tech) LTC6820 (isoSPI)
  • Analog Devices (Linear Tech) LTC6821 (isoSPI)

Local storage for ring buffers (telemetry survivability)

  • Winbond W25Q64JVSSIQ (SPI NOR)
  • Infineon/Cypress FM25V02-G (SPI FRAM example)
  • Microchip 23LC1024-I/SN (SPI SRAM example)

Low-cap ESD protection (ports + analyzer headers)

  • Nexperia PESD5V0S1UL (low-cap TVS example)
  • Texas Instruments TPD1E10B06 (single-line ESD)
  • Semtech RClamp0524P (multi-line ESD array example)

These part numbers are examples to validate counters/FIFOs/timestamp and survivable logging paths. Final selection should match voltage, package, temperature grade, and availability constraints.

Selection flow (metrics → taps → hardware → firmware cost → decision)

  1. Required metrics: error_rate, p95/p99, throughput, recovery success, max stall.
  2. Tap points: driver IRQ/DMA completion, enqueue/dequeue, bridge/analyzer counters.
  3. Hardware support: FIFO depth, counter readout, timestamp hooks, isolation delay awareness.
  4. Firmware effort: CPU overhead, storage budget, bandwidth/telemetry constraints.
  5. Decision: pass criteria, alert cookbook, evidence artifacts, production gates.
Selection flow (requirements → evidence-ready implementation) Metrics error p95/p99 Tap points IRQ/DMA enqueue Hardware FIFO counters FW cost CPU storage Decision gates alerts Definition of done (evidence package) Dashboard Alerts Rollups Ring snapshot All artifacts share the same windows, units, and endpoint buckets.
Selection should be driven by measurable requirements: capture points and hardware capabilities must support rollups, tail latency, and survivable evidence under faults.

Request a Quote

Accepted Formats

pdf, csv, xls, xlsx, zip

Attachment

Drag & drop files here or use the button below.

FAQs (4-line answers + measurable pass criteria)

Each FAQ maps a symptom to a first accounting check, a fix that standardizes observability, and a pass criterion with units + window + sample context.

Triage flow (stats-first, not protocol tutorial) Symptom one line Quick check split / correlate Fix standardize Pass criteria units + window
The goal is durable evidence: consistent denominators, windows, endpoint buckets, and reproducible acceptance gates.
NAK rate jumps but the scope looks “OK”—what is the first accounting check?

Likely cause: window/denominator mismatch, or mixed endpoints collapsing into one number.

Quick check: split by endpoint (addr/cs/port) + read/write + fixed window_s (1s/10s/60s).

Fix: standardize metric definition (per 1k frames/txns) and require per-endpoint rollups as the primary view.

Pass criteria: NAK_rate ≤ X/1k over Y minutes, with Top-K endpoints stable (no single endpoint contributes > Z%).

CRC fails only at certain throughput—SI issue or buffer underrun?

Likely cause: FIFO underrun / scheduling jitter masquerading as link errors under load.

Quick check: correlate CRC_fail_count with underrun/overrun counters + CPU/load_bin + queue_depth_p99 at the same window.

Fix: increase FIFO/DMA burst, raise service priority, reduce IRQ latency; keep the same throughput target during verification.

Pass criteria: CRC_fail < X/1M over Y minutes at load Z, and underrun_count = 0 for the same run.

Latency average is fine, but users see “random stalls”—why?

Likely cause: tail latency spikes hidden by averages; rare max stalls dominate perceived quality.

Quick check: plot p95/p99 + max_stall_time_ms; split latency into queue_wait vs on-wire vs device_service vs processing.

Fix: add percentile alerts + decomposition rollups; gate releases on p99 and stall ceilings, not averages.

Pass criteria: p99_latency ≤ X ms and max_stall ≤ Y ms over N transactions (timeout_rate ≤ Z/1k).

Same firmware, different boards show different error rates—what to log first?

Likely cause: missing context (temp/power/load/board_rev) causing apples-to-oranges comparisons.

Quick check: add context tags and compare per-endpoint Top talkers; stratify by temp_bin + power_state + board_rev.

Fix: require context tags in every rollup; define baseline per board_rev and compare deltas within the same bin.

Pass criteria: after stratification, error_rate variance ≤ X% across boards (same bins, same window_s, same workload).

I²C “stuck” events recover, but keep recurring—what is the fastest triage?

Likely cause: repeated hung endpoint or sequencing/ghost-power pattern that keeps re-triggering.

Quick check: rank stuck events by last_active_endpoint + time_since_power_transition; compute stuck_duration histogram.

Fix: add per-endpoint isolation/reset hooks; enforce recovery escalation ladder (soft reset → segment isolate → power cycle).

Pass criteria: stuck_frequency < X/day and recovery_success ≥ Y% with max_stuck_duration ≤ Z ms.

UART framing errors spike after enabling low-power—noise or wake timing?

Likely cause: wake/clock settling window too short, causing early frames to be sampled incorrectly.

Quick check: correlate errors with wake events; bucket the first N frames after wake vs steady-state frames.

Fix: add guard time, re-sync, discard first frames, and tighten flow-control policy during wake transitions.

Pass criteria: framing/parity errors = 0 in first N frames across M wake cycles (same workload, same bins).

Retry count drops after a “fix”, but throughput also drops—did the system just slow down?

Likely cause: reliability improved only because timing/rate became conservative, masking the real capacity.

Quick check: compare error_rate + throughput_payload_Bps + p95 latency together, under the same workload and window.

Fix: tune guardrails with a 3-metric gate (reliability + throughput + latency); validate with Top-K endpoints, not only global averages.

Pass criteria: error_rate ≤ X while throughput ≥ Y and p95 ≤ Z (same bins, same load, same window_s).

Metrics disagree between firmware counters and a logic analyzer—who is right?

Likely cause: different transaction boundaries or tap points (IRQ/DMA completion vs CS-active vs analyzer triggers), plus sampling/drop effects.

Quick check: align endpoint + fixed window_s + the same boundary definition; do a short manual ledger (txn_count, bytes, fails) for one capture.

Fix: encode boundary definitions in the schema (t_start/t_end meaning, txn_id, tool_id, timebase_id); treat rollups as primary and events as sampled truthing.

Pass criteria: within the same script/window, txn_count/bytes differ ≤ X%, and failed endpoints match ≥ Y% of the time.

Alerts keep firing but the issue cannot be reproduced—too sensitive thresholds or wrong windows?

Likely cause: alert conditions ignore burstiness and endpoint mix; missing debounce/cooldown creates alert storms.

Quick check: compare 1s vs 10s vs 60s windows; check Top-K endpoints; measure alert trigger rate-of-change vs absolute thresholds.

Fix: add debounce (N windows) + cooldown (T minutes) + endpoint scoping; use anomaly/baseline deviation per endpoint when distributions differ.

Pass criteria: false-alert rate ≤ X/day while detection sensitivity remains (confirmed incidents still trigger within Y windows).

Field logs always miss the critical segment—ring buffer too small or drop policy too aggressive?

Likely cause: ring buffer retention is sized by count (not time), or low-priority drops erase pre-fault context.

Quick check: measure effective retention time at real traffic; confirm whether WARN/ERROR events survive uplink back-pressure conditions.

Fix: size buffers by seconds (last N seconds), freeze-on-trigger, and prioritize ERROR/WARN + rollups over raw DEBUG events.

Pass criteria: on each confirmed incident, last N seconds snapshot is available ≥ Y% of the time (with complete endpoint/context tags).

Top talkers change every run—unstable endpoint IDs or wrong transaction boundaries?

Likely cause: endpoint_id is not canonical (aliases, dynamic mapping), or txn boundaries drift across implementations.

Quick check: audit endpoint_id generation; verify the same physical target maps to one ID; compare txn_count and bytes under the same script.

Fix: define canonical endpoint_id rules (addr/cs/port + channel/segment), encode boundary definitions in schema, and validate with a short golden script.

Pass criteria: under the same workload, Top-K endpoints overlap ≥ X% across runs, and rollup totals differ ≤ Y%.

Note: X/Y/Z/N/M are project-specific thresholds. Keep units, windows, and bucket keys fixed to preserve comparability.