CAN/LIN/FlexRay Diagnostics & Logging: Counters,Utilization,Wake
← Back to: Automotive Fieldbuses: CAN / LIN / FlexRay
Turn “it happened in the field” into a replayable evidence chain: standardize counters, utilization, and wake attribution, then capture black-box snapshots with trustworthy timestamps. Ship service-ready reports with debounced thresholds and measurable pass/fail criteria—so issues are searchable, comparable, and fixable at scale.
Scope Guard & Definitions
Establish one non-negotiable contract for this page: measure consistently, attribute with evidence, and report for serviceability. The scope guard prevents protocol/waveform detours and keeps all later chapters aligned.
- Error counters — how to sample, trend, and correlate counters with events.
- Bus utilization — correct definitions, windows, and peak/burst interpretation.
- Wake-event black box — evidence-based wake attribution + snapshot + retention.
- Waveform shaping, termination tuning, stub/harness SI/EMC details (handled by PHY/EMC sibling pages).
- UDS/DoIP/OTA protocol behavior (only logging interface fields are referenced here).
- Event: a discrete occurrence with a timestamp (e.g., entered bus-off, wake occurred).
- State: a sustained condition (e.g., error passive state, bus-off state).
- Counter sample: a point-in-time reading (e.g., TEC/REC at time t). Use deltas over a defined window.
- Average: % busy within a long window (e.g., 1 s / 10 s).
- Peak: maximum busy% in sliding windows (captures congestion episodes).
- Burst: short-window distribution (P95/P99 in 10–100 ms), required for “feels blocked” complaints.
Attribution must include evidence fields (flags / filter hit / power state transitions). Pure inference is not acceptable for serviceability.
- Timestamp: monotonic ordering + human-readable time + sync quality (for cross-ECU correlation).
- Snapshot trigger: define event triggers + pre/post windows (Tpre/Tpost) for freeze-frame.
- Retention: RAM ring for high-rate context + NVM checkpoints for post-power-cycle evidence.
Every metric must declare window, denominator, and units. Counter values are meaningless without deltas over a window.
- Schema dictionary: minimal fields for counters, utilization, wake attribution, and snapshots.
- Service report template: summary → evidence → severity → next measurement.
- Wake black box MVP: ring buffer + checkpoint + freeze-frame triggers.
- Verification matrix: fault injection coverage + pass criteria placeholders.
Failure Taxonomy for Serviceability
Convert field complaints into searchable, measurable, and triage-ready categories. A good taxonomy reduces log volume requirements and increases root-cause speed.
- Symptom — what service observes (bus-off, wake storm, intermittent errors).
- Signal — measurable evidence (counter deltas, state transitions, utilization peaks, wake flags).
- Suspected domain — attribution buckets guiding next verification (Topology / EMC / Node behavior / Policy).
Minimal field set is split into Identity (bus/node), Evidence (signals), and Context (timestamp quality / power state).
Each entry binds signals to a window + denominator, then maps to suspected domains. This avoids “high counter value” confusion and supports automated service reports.
Symptom · Bus-off (intermittent or frequent)
- State transition: entered bus-off (event) + recovery attempts.
- Counter deltas: ΔTEC / ΔREC over window_Y (not absolute values).
- Utilization snapshot: peak & burst around the event (optional but high value).
- Compute ΔTEC/ΔREC per window_Y (placeholder) and normalize per active_time or frames_sent.
- Freeze-frame: capture Tpre/Tpost context around bus-off (placeholder).
Symptom · Error passive entry / oscillation (flapping)
- State transitions count: active ↔ passive within window_Y.
- Counter deltas: ΔREC dominates vs ΔTEC dominates (direction hints where to look next).
- Utilization bursts: P95/P99 in short window (often correlates with error bursts).
- Normalize flap rate per minute or per 1000 frames (placeholder).
- Record whether flapping is periodic (suggests policy/threshold issues) or random (suggests EMC/topology).
Symptom · Intermittent CRC / frame errors (scope “looks OK”)
- Error deltas per short window: errors / 10 s and errors / 100 ms (placeholder).
- Correlation with utilization: spikes in burst P99 often expose contention/retry amplification.
- Environmental correlation: temperature bucket / power-state transitions (context fields only).
Symptom · False wake / wake storm (no obvious frames)
- Wake evidence: wake flags + filter-hit counters (if selective wake is used).
- Attribution confidence: prefer hardware flags over software inference.
- Pre/post snapshots: capture Tpre/Tpost around wake event (placeholder).
Symptom · Diagnostic session intermittent timeout (field-level only)
- Transport-facing counters: timeout_count per window_Y (placeholder).
- Bus utilization bursts around timeouts (queue starvation often appears as burst congestion).
- Reset/power transitions near timeout windows (to rule out power-induced dropouts).
- Always store deltas (Δcounter) with window_Y, not just absolute counters.
- Always declare the denominator (per time, per frames, per active-time) to avoid incomparable rates.
- Every attribution must include evidence and a confidence label (high/medium/low).
Observability Tap Points
Map where diagnostic evidence originates across Transceiver / Controller / MCU / Power-SBC / Monitor. The goal is consistent evidence capture and correlation—not protocol or RTOS tutorials.
- Hardware-latched flags (wake, dominant timeout, thermal) — highest confidence.
- Controller counters + state transitions (ΔTEC/ΔREC, passive/bus-off) — measurable trends.
- Software context (queue depth, ISR latency, log drops) — correlation and amplification clues.
Lower-level signals can support correlation, but root-cause direction requires higher-confidence evidence whenever available.
- Trends: periodic sampling (Δcounters, utilization average).
- Replay: event-trigger snapshots (bus-off, wake, reset).
- Short pulses: edge/latched capture (wake flags, dominant timeout).
Without these keys, logs become fragments and serviceability collapses under power cycles or multi-ECU correlation.
Each card declares what the layer can prove, what it cannot, and how to sample it for trend vs replay.
Exact waveform/termination root-cause or which node generated a specific frame error (handled by PHY/EMC sibling pages).
- Flags: latched or interrupt-driven reads (avoid missing short pulses).
- Thermal/mode: low-rate periodic sampling (trend context, not high bandwidth).
- Reading flags clears them without first stamping ts_mono.
- Logging “flag occurred” without duration/recovery evidence.
Harness topology and EMC mechanisms directly; only measurable signatures and correlations are available here.
- Counters: store Δcounter per window_Y (never absolute-only).
- Transitions: event-trigger snapshots on passive/bus-off entry and recovery.
- Storing absolute counters without Δ and window leads to incomparable reports.
- Missing state transition history makes “counter jumps” uninterpretable.
Physical-layer faults; MCU signals explain amplification (starvation, backlog) but do not replace bus evidence.
- Periodic: queue depth and CPU load for trends.
- Event-trigger: timeout bursts, queue overflow, log-drop spikes.
- Missing log-drop counters creates false “no issue observed” narratives.
- Treating software timeouts as root-cause instead of correlating with bus/power evidence.
- Reset reason: capture at boot immediately (latched at startup).
- VBAT dips: event-trigger with threshold_X placeholder and persistence policy.
- No boot_count/event sequencing prevents correlation across power cycles.
- Storing only “reset happened” without reason classification reduces service value.
Low-rate trend sampling plus event-trigger capture for exceptions; store counts and durations to support severity ranking.
- Always-on (low rate): power_state, controller state, utilization average.
- Event-only: passive/bus-off entry, wake, reset, dominant timeout.
- Burst capture: short-window utilization P95/P99 and short-window Δcounters in a RAM ring buffer.
Error Counters Deep Dive
Turn counters into an evidence chain: Δcounters + window + denominator bound to state transitions. This prevents “looking at numbers without attribution”.
- Sample counters at t0, t1… then compute Δcounter / window_Y.
- Bind deltas to state transitions (active ↔ passive ↔ bus-off).
- Anchor with events (wake / reset / dominant timeout) when present.
- Correlate with utilization burst (P95/P99) to detect contention amplification.
- Report severity + suspected domain bucket (Topology / EMC / Node / Policy).
- Use ΔTEC/ΔREC over window_Y; absolute values alone are not actionable.
- State entry timestamps (passive/bus-off) must be captured as events and paired with freeze-frame snapshots.
- After recovery, track whether counters decay and stabilize or re-escalate quickly (policy vs burst signatures).
Record error counts by category (header/response/sync/timeout), then normalize by window_Y and denom_Z. The objective is consistent trend + event anchoring, not protocol walkthrough.
Split evidence by channel (A/B) and by error class; store deltas and event anchors. Channel asymmetry is a serviceable clue even without waveform deep dives.
Counters are most useful when categorized by time behavior; each pattern suggests different next measurements.
- Signature: Δcounter rises steadily across long windows.
- Quick check: correlation with temperature / power state transitions.
- Next: trend-first verification; avoid single-shot conclusions.
- Signature: large short-window Δcounter + utilization burst P99 spikes.
- Quick check: align bursts with bus-off/passive events.
- Next: capture freeze-frame around events; validate on real harness.
- Signature: repeating state flaps or recurrent bus-off with stable period.
- Quick check: align with wake/recovery thresholds and policy state changes.
- Next: audit debounce/filter/recovery criteria; require evidence fields.
Each counter definition must include source, units, sampling method, window_Y, denom_Z, threshold_X placeholders, and severity mapping.
Reporting rule: every counter report must include Δ + window + denominator + state + event anchor.
Bus Utilization Metrics
Convert “the bus is busy” into comparable, calculable metrics. Define windowed utilization, burst behavior, and overhead shares so bench and in-vehicle measurements can be explained without waveform-level discussion.
Every utilization number must declare window_Y and a clear denominator. Without this, averages cannot be compared across ECUs, tools, or drives.
Separate load shape (avg/peak/burst) from overhead (retransmissions/errors/arbitration loss). Mixing them hides the reason why real harness behavior diverges from bench.
Different payload sizes and modes change frame_time and therefore busy_time. This page standardizes the accounting path and avoids PHY-level explanations.
busy_time = Σ frame_time(i) + Σ retry_time(i) + overhead
util_% = busy_time / window_Y × 100
Reports must declare the source of frame_time (controller/sniffer/gateway) and keep windowing consistent.
- Strength: ties utilization to local state/counters and retry context.
- Bias: partial view (local ECU perspective) may not reflect network-wide busy time.
- Best for: accountability and local amplification analysis.
- Strength: closer to actual bus busy time and burst behavior.
- Bias: limited internal context (cannot see ECU queues or local retry intent).
- Best for: network health and peak/burst characterization on real harness.
- Strength: multi-segment coverage and system-level correlation.
- Bias: mirror path can introduce resampling and timestamp quality issues.
- Best for: serviceability and cross-domain correlation with wake/diagnostics.
- Quick check: always include peak and P99 for the same capture window.
- Fix: define short_window_s and report burst percentiles.
- Pass criteria: P99 < X% for Y minutes (placeholder).
- Quick check: compare window_Y average vs short-window peak.
- Fix: include rolling short windows and percentile reporting.
- Pass criteria: peak within X% above baseline (placeholder).
- Quick check: report error_frame_share_% and retrans_share_% alongside utilization.
- Fix: separate overhead shares from load shape metrics.
- Pass criteria: retrans_share_% < X% and error_frame_share_% < X% (placeholder).
A report without capture_point and windowing metadata is non-comparable and not serviceable.
Wake Event Attribution
Define a wake evidence chain with explicit source classification and confidence so false-wake and wake-storm issues become measurable and serviceable.
A wake event must be classified using one of the fixed sources below; the schema records evidence fields and confidence.
- Hardware flags (transceiver wake flags, power flags) — highest confidence.
- Controller state (bus state, counters/state transitions) — medium confidence.
- Software inference (queues, timers, application hints) — support only.
- If hardware flags conflict with inference, hardware wins and the conflict is recorded.
- If hardware evidence is missing, confidence is downgraded and the evidence_fields list must explain why.
Record filter identity and outcomes so false-wake rate becomes measurable without detailing the standard or filter algorithms.
The black-box capture must reserve a ring buffer covering Tpre and Tpost around wake. Without this, attribution becomes post-hoc inference.
The schema stores classification, confidence, and evidence fields; it also records debounce and policy state for service replay.
Wake-event Black Box Design
Provide a replayable black-box architecture: continuous sampling, event-triggered freeze-frame, and power-loss retention. The goal is service-grade reconstruction, not raw log dumping.
Separate continuous sampling from event records so storage and endurance remain controlled.
- High-rate, low-cost, overwrite-friendly samples covering Tpre/Tpost windows.
- Stores utilization snapshots, counters deltas, power flags, and minimal state changes.
- Allows decimation and delta encoding to keep bandwidth predictable.
- Low-rate, searchable event records with unique event_id and schema version.
- Each event carries freeze-frame fields and pointers/summaries of sampled slices.
- Supports event merging when multiple triggers occur in one anchor window.
- Periodic or trigger-driven commits to non-volatile storage for power-loss survival.
- Must track write budget and batch updates to avoid excessive wear.
- Stores event ring plus critical summaries for fast service extraction.
- Exports structured event records; avoids requiring manual log reading.
- Must include capture windows, time-quality fields, and filter identity where applicable.
- Allows sorting and searching by trigger, bus_id, node_id, and confidence.
Every trigger defines what gets frozen and how much pre/post context is extracted from the sampling ring.
- Freeze: state + counters + utilization snapshot + wake attribution.
- Capture: Tpre and Tpost slices (placeholders) from sampling ring.
- Freeze: power_state + reset_reason + brownout/thermal flags.
- Capture: immediate pre-reset slice and early-boot slice for correlation.
If multiple triggers occur within one correlation anchor window, merge into one composite record and store all trigger flags to avoid duplicated NVM writes.
- Prefer deltas for counters; prefer window summaries for utilization (avg/peak/p99/shares).
- Record state transitions instead of repeated identical states.
- Allow decimation under CPU pressure; preserve triggers and freeze-frames.
Retention must track write frequency and byte volume. A black box that wears out storage is not production-safe.
- One sampling ring + one event ring.
- Triggers: wake, reset, bus-off, VBAT_dip.
- Retention: event ring + last-N event summaries.
- Dual sampling (fast ring + slow ring) with event slice references.
- Triggers include error_passive_entry, thermal, watchdog.
- Retention includes write budget stats and correlation anchors for cross-ECU replay.
Event records are the minimal searchable unit for service replay.
Timestamp & Correlation
Make logs correlatable across ECUs by recording time base and time quality. Many field failures are unsolved because time is not trusted or not aligned.
- Best for ordering events within one ECU.
- Requires boot_count and uptime_ms for post-reboot reasoning.
- Not directly comparable across ECUs without anchors.
- Human-readable and survives reboot if valid.
- Must record rtc_valid and drift_est to avoid false precision.
- Useful as a coarse cross-ECU reference when aligned time is unavailable.
- Primary method for cross-ECU correlation.
- Must record sync_status and offset_uncertainty to quantify reliability.
- Anchors event streams by generating correlation anchor IDs.
Each key event must carry time base and quality fields; correlation quality should be derived into a single ts_quality label.
- Gateway emits anchor_id with aligned time.
- ECUs record anchor_id + local ts_mono for replay.
- Best correlation strength for system-level reconstruction.
- Use monotonic event sequence for key triggers.
- Carry sequence through mirrored or forwarded reports.
- Works even when aligned time is unavailable.
- Match events by relative Tpre/Tpost patterns and trigger signatures.
- Always downgrade confidence and record matching uncertainty.
- Use as fallback when anchors and sequences are missing.
Diagnostics Reporting Workflow
Convert raw counters, utilization, and wake evidence into a service-grade report: concise, actionable, and confidence-tagged.
- Focus: thresholds, debouncing, severity, readability, and report structure.
- Not included: protocol tutorials, repair-manual encyclopedias, or EMC root-cause explanations.
- Output style: summary + evidence + next measurement direction.
Reporting quality depends on consistent normalization, explicit thresholds, and debouncing that prevents alert storms.
Use a consistent contract so every downstream rule is deterministic: metric_id, window, value, bus_id/node_id, time_ref, ts_quality.
Thresholds must be window-based and debounced to prevent noisy service outputs.
- Burst-type: N occurrences per W seconds (placeholders X/Y).
- Degradation-type: slope over T minutes; baseline and peak tracked per bus_id.
- Policy-type: periodic patterns (wake storms) tracked by recurrence interval.
- burst grouping: collapse multiple same-type events into one burst summary within short windows.
- hysteresis: separate enter/exit conditions to avoid threshold flapping.
- rate limiting: cap repeated reports per hour while keeping evidence counters.
- merge-by-anchor: within one correlation anchor window, merge triggers to minimize NVM writes.
- critical: bus-off, persistent wake storm, repeated resets impacting availability.
- error: error passive entry, sustained error rate beyond threshold.
- warn: rising retransmission share or utilization peaks trending upward.
- info: isolated anomalies preserved for correlation and trending.
Service output must state confidence; do not imply certainty when evidence is incomplete.
- Priority: hardware flags > controller states > software inference.
- Penalize when ts_quality is poor or required evidence fields are missing.
- Expose confidence as a first-class report field.
Every alert must answer these questions in order, using short lines and explicit evidence references.
- What happened — event type + time + impacted bus/node.
- How severe — severity + threshold condition (placeholders X/Y).
- What evidence — event_id + key counters/util snapshots + ts_quality.
- What next to measure — measurement direction, not a repair manual.
This template is designed for fast triage: top events, trends, wake summary, and data-quality confidence.
- Rank by severity × confidence and show event_id for retrieval.
- Include bus_id, node_id, and time reference (mono/aligned).
- Break down by source: bus / local / timed / power.
- Show confidence and missing evidence fields to avoid guess-based attribution.
- ts_quality distribution and last_sync_age_s (placeholder).
- logger drop counters and overflow markers.
- evidence completeness rate for key event types.
Verification & Fault Injection Plan
Validate that logging, black-box capture, retention, and correlation remain trustworthy under worst-case conditions.
- Focus: evidence capture quality, time integrity, retention, and write endurance.
- Not included: detailed EMC methods or protocol-level conformance.
- Matrix style: injection → expected evidence → pass criteria.
Each category is evaluated by whether the expected event records and report outputs are produced, not by physical-layer explanations.
- Expected: state transitions, counter bursts, and correct severity classification.
- Must include event_id + freeze-frame + ts_quality fields.
- Expected: utilization peaks and recurrence patterns appear in trends.
- Wake attribution records evidence completeness and confidence levels.
- No missed triggers for injected conditions (allowing defined merge rules).
- Freeze-frame contains required fields and correct pre/post slice references.
- ts_mono remains monotonic; boot_count increments correctly.
- Cross-ECU correlation uses anchors when available; otherwise confidence is downgraded.
- Checkpoint survives power cycling at multiple cut points (during write, post-trigger, pre-export).
- Partial-write markers are detectable; recovery yields consistent event ring contents.
- Write budget remains within targets; write amplification remains bounded (placeholder).
- Under storms, sampling may decimate but key triggers and reports remain correct.
Use explicit metric definitions so results are comparable across benches, vehicles, and software revisions.
- miss_rate: injected triggers are captured (merge rules applied deterministically).
- false_rate: critical/error reports do not appear under clean baselines beyond threshold.
- corr_rate: events align across ECUs within target uncertainty tiers.
- write_amp: NVM bytes/writes per event remain bounded; batch commits effective.
Represent each test as a compact, scannable card to prevent mobile overflow and preserve SEO indexing per test intent.
- Injection: short/open or disturbance (abstract).
- Expected evidence: state change + counter burst + event_id + ts_quality.
- Expected report: severity and 4Q output populated.
- Pass: miss_rate < X; false_rate < X.
- Injection: power cycling at multiple cut points.
- Expected evidence: checkpoint consistency + recoverable event ring.
- Expected report: top events remain retrievable by event_id.
- Pass: retention success rate > X; write_amp < X.
Engineering Checklist (Design → Bring-up → Production → Service)
Turn counters, utilization, wake attribution, black-box capture, and time correlation into gate-by-gate actions with measurable pass criteria.
- Focus: schema/thresholds/triggers, evidence completeness, exportability, retention, endurance.
- Not included: protocol tutorials, repair-manual encyclopedias, or EMC root-cause explanations.
Freeze the logging contract: field dictionary, thresholds with rationale, trigger coverage, time contract, and export format.
- Schema freeze: required/optional fields tagged; schema_version exported. Pass: required missing rate < X%.
- Metric semantics locked: event vs state vs counter sample, utilization windows, wake-source priority. Pass: definition↔implementation checks = X/X.
- Threshold rationale: each rule has window + count/rate + source (bench/vehicle/history). Pass: rationale coverage > X%.
- Trigger coverage map: bus-off, error-passive entry, wake, reset, VBAT dip, thermal, watchdog. Pass: critical triggers covered = 100%.
- Time contract: mono/RTC/aligned recorded with ts_quality, boot_count, time_source. Pass: monotonic violations = 0.
- Correlation anchors: anchor_id / gateway marks / sequence IDs defined. Pass: correlation-ready events > X%.
- NVM wear budget: write amplification bounded; storm policy defined. Pass: write_amp < X.
- Export contract: min reproducible package defined (summary + dump + schema + time quality). Pass: parse errors = 0.
- Fail-safe logging: logging cannot block safety-critical control. Pass: bounded CPU/ISR time < X.
- CAN FD controller (SPI): Microchip MCP2517FD / MCP2518FD (for external logging taps).
- CAN FD transceiver: TI TCAN1042-Q1; NXP TJA1044GT; Microchip MCP2562FD.
- Selective wake (PN capable): NXP TJA1145 (ISO 11898-6 class); TI TCAN1145-Q1 (family example).
- Non-volatile “black box” storage: Infineon/Cypress F-RAM CY15B104QSN (SPI); Fujitsu FRAM MB85RS2MT (SPI).
- Isolated CAN (when required): TI ISO1042-Q1; Analog Devices ADM3055E.
Validate on real harness and real loads: peak windows, burst grouping, wake evidence priority, export-and-parse loop.
- Real-harness re-measure: bench vs harness deltas captured in the same window definition. Pass: window semantics unchanged.
- Controller vs sniffer correlation: utilization and error shares cross-checked. Pass: difference < X%.
- Peak/burst coverage: peak_window_id and burst grouping validated. Pass: burst detection recall > X%.
- Retrans/error share sanity: retrans_share and error_share match observed symptoms. Pass: share trend aligns with events.
- Wake evidence priority: hardware flag > controller state > inference. Pass: mis-attribution < X%.
- Pre/Post capture: Tpre/Tpost slices present for key triggers. Pass: slice completeness > X%.
- Overflow markers visible: drop_count and overflow flags never silent. Pass: silent loss = 0.
- Export+parse loop: the min package is exported and parsed by tooling. Pass: parse errors = 0.
- CAN FD transceiver with diagnostics: Infineon TLE9255W (family example); TI TCAN1042-Q1.
- LIN transceiver (if wake correlation crosses LIN): TI TLIN1029-Q1; NXP TJA1021.
- FlexRay transceiver (if mixed networks): NXP TJA1080.
- Low-cap TVS for bus ports (SI-friendly): Nexperia PESD2CANFD; Littelfuse SM24CANB.
Ensure the production policy is durable: rate limits, decimation, retention correctness under power cuts, and NVM endurance.
- Policy tiers: normal / diagnostic / factory modes defined. Pass: mode separation verified.
- Rate limit & cooldown: repeated events do not cause alert storms. Pass: max reports/hour < X.
- Decimation under storms: sampling may drop, triggers must remain. Pass: critical trigger miss_rate < X.
- Wear validation: worst-case write budget validated across temperature. Pass: write_amp < X.
- Power-cut recovery: checkpoint consistency proven at multiple cut points. Pass: recovery success > X%.
- Version discipline: schema and thresholds are traceable by version. Pass: version missing = 0.
- Export footprint cap: maximum package size bounded. Pass: package < X MB.
- Data quality counters: drop_count and overflow markers exported. Pass: visibility rate = 100%.
- SBC w/ CAN (policy + reset reasons): NXP UJA1169 (CAN FD + LIN SBC family example); Infineon TLE9471-3ES (SBC family example).
- Watchdog supervisor (if discrete): TI TPS3431-Q1 (watchdog timer family example).
- FRAM for high-cycle logging: Infineon/Cypress CY15B104QSN; Fujitsu MB85RS2MT.
Make the exported package reproducible and readable: event retrieval by ID, confidence tagging, and 4Q reporting.
- Min reproducible package: summary + event dump + schema_version + time quality. Pass: completeness > X%.
- 4Q readability: what/ severity/ evidence/ next measure included per alert. Pass: 4Q compliance > X%.
- Evidence completeness: required fields present for key triggers. Pass: missing required < X%.
- Confidence mandatory: confidence downgraded when evidence is missing. Pass: untagged alerts = 0.
- Event retrieval by ID: event_id maps to freeze-frame and slices. Pass: retrieval success > X%.
- Cross-ECU notes: anchor-aware correlation guidance included. Pass: correlation rate > X.
- Integrity check: package structure self-check (hash/CRC placeholder). Pass: integrity failures = 0.
- Rate-limited summaries: burst events collapsed but counts preserved. Pass: counts retained = 100%.
- CAN transceiver w/ wake flags: NXP TJA1145; TI TCAN1145-Q1 (examples).
- Isolated CAN for HV boundary evidence: TI ISO1042-Q1; ADI ADM3055E.
- FRAM for robust dumps: Infineon/Cypress CY15B104QSN; Fujitsu MB85RS2MT.
Applications (What to log, and why it matters)
Avoid generic use-cases. Each bucket defines the highest-value logs, the minimal field set, and trigger/windows for serviceability.
- Only: what to record + why it is valuable for triage.
- Not included: isolation theory, EMC mechanisms, or protocol internals.
Fast detection of harness/assembly issues using burst errors and utilization peaks (not just averages).
- Triggers: error bursts, bus-off, utilization peak exceed.
- Pass: burst capture recall > X%; peak-window consistency < X% drift.
- CAN FD transceiver: TI TCAN1042-Q1; NXP TJA1044GT; Microchip MCP2562FD.
- Port protection: Nexperia PESD2CANFD; Littelfuse SM24CANB.
Reproduce sporadic bus-off or false wakes using a black-box evidence chain (pre/post slices + attribution + time quality).
- Package: summary + event ring dump + schema_version.
- Wake: wake_source + confidence + evidence_fields_present.
- Slices: Tpre/Tpost refs for key triggers.
- Time: time_source + ts_quality + boot_count.
- Power: reset_reason + power_state (+ VBAT dip marker if present).
- Network: state transitions + counters_delta + util snapshot.
- Triggers: wake, bus-off, reset, VBAT dip, thermal/watchdog.
- Pass: reproducible package decode success > X%; 4Q compliance > X%.
- Selective wake transceiver: NXP TJA1145; TI TCAN1145-Q1 (examples).
- Black-box NVM: Infineon/Cypress CY15B104QSN; Fujitsu MB85RS2MT.
Congestion and retransmissions must be quantified with consistent windows and queue/drop observability.
- Utilization: avg/peak/p95 + window.
- Shares: retrans_share + error_share (+ arbitration_loss_share if available).
- Queue health: queue_depth + drop_count (record only; no RTOS tutorial).
- Correlation: anchor_id or sequence_id + ts_quality.
- Bridge context: flow_id (abstract) + bus_id.
- Triggers: util_peak exceed, retrans surge, drop_count increase.
- Pass: congestion alerts rate-limited; correlation rate > X.
- CAN FD controller (SPI): Microchip MCP2517FD / MCP2518FD (for mirror taps or auxiliary buses).
- SBC (reset reasons + policy): NXP UJA1169 (CAN FD + LIN SBC family example).
Intermittent errors across ground offsets require evidence fields (flags + power markers + time quality), not isolation theory.
- Transceiver flags: fault/wake/thermal indicators (abstracted fields).
- Power markers: VBAT dip markers + reset_reason + power_state.
- Network evidence: counters_delta + state transitions + event_id.
- Time evidence: ts_quality + boot_count + time_source.
- Retention: freeze-frame refs survive power cuts.
- Triggers: power events, error bursts, reset, thermal flags.
- Pass: evidence completeness > X%; monotonic violations = 0.
- Isolated CAN transceiver: TI ISO1042-Q1; Analog Devices ADM3055E.
- Black-box NVM: Infineon/Cypress CY15B104QSN; Fujitsu MB85RS2MT.
Recommended topics you might also need
Request a Quote
FAQs (Diagnostics & Logging)
Each FAQ is a strict 4-line triage closure: Likely cause / Quick check / Fix / Pass criteria (threshold placeholders: X/Y/Tpre/Tpost).
TEC/REC not high, but bus-off happens frequently — sampling semantics or recovery policy first?
Likely cause: TEC/REC is sampled at the wrong moment (post-recovery / wrong hook) or bus-off recovery policy forces repeated transitions without a large visible counter trend.
Quick check: log state_change timestamps + TEC/REC snapshot at bus-off entry/exit; compare counters_delta around each transition; verify sampling is triggered by the same ISR/callback as the state change.
Fix: move sampling to the state-transition hook; add bus_off_reason + recovery_reason; enforce recovery cooldown and explicit retry limits (policy, not waveform).
Pass criteria: > X% bus-off events contain {state_change + TEC/REC snapshot + counters_delta}; unexpected bus-off rate < X per hour over Y operating hours.
Bus utilization shows 20%, but the field feels “blocked” — peak window or retransmission share first?
Likely cause: average window is too long (peaks hidden) and/or retrans_share and error frames are excluded from the utilization denominator.
Quick check: recompute util_peak over a short peak_window (e.g., 1–10 ms) and compare to avg; include retrans/error contributions; inspect p95/p99 and top peak_window_id buckets.
Fix: report utilization as {avg + p95 + peak} with the window definition embedded; always publish retrans_share and error_share; drive congestion alerts from peak/p95 (not avg).
Pass criteria: util_peak < X% and retrans_share < X% in window Y; peak windows are reproducible within ±X% across repeated runs.
False wakes are frequent, but logs show no frames — wake-flag timing or power-event wake first?
Likely cause: wake is not bus-driven (local/timed/power event) or wake_flag is read after it is cleared (timing/ordering bug).
Quick check: capture wake flags at the earliest wake ISR; record power_state, reset_reason, and VBAT_dip_marker; ensure the black box includes Tpre/Tpost slices around wake (Tpre/Tpost placeholders).
Fix: latch and persist flags (read-once → store); explicitly classify wake_source as bus/local/timed/power with evidence; add “no-bus-frame” wake category with required evidence fields.
Pass criteria: > X% wakes have wake_source + evidence fields present; “no-frame wakes” are classified with confidence ≥ X (not “unknown”).
Black box captured a wake, but the wake source is unknown — which 3 evidence fields to add first?
Likely cause: attribution lacks minimum evidence (hardware flag snapshot, controller state snapshot, power marker) and confidence is computed without inputs.
Quick check: verify presence of wake_flag, controller_wake_status, reset_reason/VBAT_dip_marker, and ts_quality for the wake event.
Fix: add these three evidence fields first: (1) transceiver wake flag + capture timestamp, (2) controller state snapshot at wake, (3) power marker (reset_reason or VBAT dip marker); implement confidence scoring that downgrades when any evidence is missing.
Pass criteria: unknown wake_source < X% of wakes; confidence ≥ X for attributed wakes; evidence completeness ≥ X%.
After changing the harness, error counters rise but the waveform looks “OK” — what trend comparison first?
Likely cause: intermittent events are visible in counters_delta trends but not in a single snapshot; baseline windows differ (apples-to-oranges) or the “rate” metric is not normalized.
Quick check: compare counters as rate per hour (or per 1k frames) under matched windows; slice by node_id and by peak_window; correlate with util_peak and retrans_share.
Fix: standardize the trend window and publish it with every report; store a baseline profile (same window, same normalization) and produce “before/after” delta by node buckets.
Pass criteria: post-change error-rate returns within X% of baseline over Y hours (or Y thermal cycles); top contributing node(s) explain > X% of the delta.
False wakes only in winter/dry conditions — log EMC-event counters or ground/power dip markers first?
Likely cause: environment-correlated transients show up as wake flags/power markers rather than as bus frames; missing markers make the event look “frame-less”.
Quick check: record wake_flag transitions + VBAT_dip_marker + reset_reason around the wake; if available, also log an EMC_event_counter (abstract counter, not waveform).
Fix: add wake debounce + cooldown tuned for transient storms; enforce hardware-flag-first attribution; optionally add environment tags (temperature/humidity placeholder) strictly as metadata.
Pass criteria: false-wake rate < X per day under the condition; > X% of wakes include at least one concrete marker {wake_flag/power_dip/reset_reason}.
Utilization is stable on bench, but vehicle peaks spike — group by node first or by message priority first?
Likely cause: peak bursts come from a small set of talkers or from priority-driven bursts; averaging hides the spike mechanism.
Quick check: compute util_peak per short peak_window; produce Top-5 “peak contributors” by node_id and separately by “priority/class” bucket (abstract); track retrans_share during those peaks.
Fix: always export both breakdowns: per-node and per-priority (or per-message-class) in the same report; gate “congestion” alerts on peak/p95 and attach the Top-5 contributors list.
Pass criteria: Top contributor(s) explain > X% of peak utilization; peak utilization < X% for Y consecutive peak windows (or peaks are correctly attributed with confidence ≥ X).
Log timestamps don’t align; multi-ECU events cannot be correlated — unify monotonic time or add gateway anchors first?
Likely cause: mixed time bases without time_source/ts_quality metadata, or clock steps/resets between events (including boot boundaries).
Quick check: inspect time_source, ts_quality, boot_count; search for monotonic violations; check whether anchor_id (gateway marks) exists for cross-ECU correlation.
Fix: use monotonic time for ordering within an ECU and add gateway anchors for cross-ECU alignment; record drift_est/sync_status (abstract) in ts_quality.
Pass criteria: correlation success > X% across ECUs within window Y; monotonic violations = 0 for key events; anchor coverage > X%.
The black box gets overwritten; key events are missing — change triggers first or move to a dual-layer buffer first?
Likely cause: single ring buffer without reservation + high trigger rate leads to ring_overrun; key triggers compete with low-value spam events.
Quick check: log event_rate, ring_overrun counters, and Top trigger types; verify whether key triggers reserve slots and whether freeze-frames exist (freeze_frame_ref).
Fix: implement dual-layer capture (event ring + sample ring) with reserved freeze-frames for critical triggers; tighten triggers and add cooldown; prioritize critical event IDs in retention policy.
Pass criteria: critical event retention ≥ X events (or ≥ X hours); key trigger miss_rate < X%; ring_overrun rate < X per day.
Logging volume causes flash wear — decimation first or event compression first?
Likely cause: frequent checkpoints to flash + no decimation/compression leads to high write_amp (write amplification) and early endurance exhaustion.
Quick check: estimate writes/day and write_amp from event rate + checkpoint interval; identify “hot” event IDs; verify whether the design logs periodic samples at full rate during storms.
Fix: apply decimation to periodic samples and compress/aggregate repeated events; keep flash for sparse checkpoints and consider high-cycle NVM for dumps (e.g., FRAM CY15B104QSN / MB85RS2MT) for high-write paths.
Pass criteria: write_amp < X; projected endurance > X years at worst-case duty cycle; storm mode retains all critical triggers with miss_rate < X%.
Too many false alarms; service cannot use the report — debounce window or severity mapping first?
Likely cause: thresholds fire on noisy samples (not on state/event transitions), missing cooldown, and severity escalation ignores evidence completeness.
Quick check: compute false-positive rate on a labeled set; inspect debounce_window + cooldown; verify mapping of bus-off / wake storm to severity; confirm evidence_fields_present is required for high severity.
Fix: apply N-in-window debounce plus cooldown; split info/warn/error/critical rules; require confidence ≥ X and evidence completeness ≥ X% before raising critical alarms.
Pass criteria: false alarm rate < X%; alert volume/day < X while recall > X%; 4Q report completeness ≥ X%.
Wake happens then the ECU sleeps immediately; logs are fragmented — policy transition point or dump timing first?
Likely cause: power policy cuts logging before the dump/checkpoint completes; dump trigger occurs after the sleep decision; retention window is too short.
Quick check: record power_state transitions around wake; measure time from wake ISR to dump start; verify checkpoint_seq/checkpoint_done fields; confirm Tpre/Tpost slices exist.
Fix: capture minimal freeze-frame immediately at wake; enforce a hold-off timer before sleep decision; checkpoint critical events before sleep; make dump timing part of the policy contract.
Pass criteria: > X% wake events include contiguous Tpre/Tpost slices; dump success > X%; fragmented logs < X% over Y cycles.