123 Main Street, New York, NY 10001

CAN/LIN/FlexRay Diagnostics & Logging: Counters,Utilization,Wake

← Back to: Automotive Fieldbuses: CAN / LIN / FlexRay

Turn “it happened in the field” into a replayable evidence chain: standardize counters, utilization, and wake attribution, then capture black-box snapshots with trustworthy timestamps. Ship service-ready reports with debounced thresholds and measurable pass/fail criteria—so issues are searchable, comparable, and fixable at scale.

Scope Guard & Definitions

Intent

Establish one non-negotiable contract for this page: measure consistently, attribute with evidence, and report for serviceability. The scope guard prevents protocol/waveform detours and keeps all later chapters aligned.

Scope in 30 seconds
Covers (3 objects)
  • Error counters — how to sample, trend, and correlate counters with events.
  • Bus utilization — correct definitions, windows, and peak/burst interpretation.
  • Wake-event black box — evidence-based wake attribution + snapshot + retention.
Out of scope (link out only)
  • Waveform shaping, termination tuning, stub/harness SI/EMC details (handled by PHY/EMC sibling pages).
  • UDS/DoIP/OTA protocol behavior (only logging interface fields are referenced here).
Sibling pages (titles only; no duplication)
HS CAN Transceiver CAN FD / SIC / CAN XL PHY Selective Wake / Partial Networking LIN Transceivers FlexRay PHY / Controller EMC / Protection Co-design
Key definitions (must be consistent)
Error: event vs state vs counter sample
  • Event: a discrete occurrence with a timestamp (e.g., entered bus-off, wake occurred).
  • State: a sustained condition (e.g., error passive state, bus-off state).
  • Counter sample: a point-in-time reading (e.g., TEC/REC at time t). Use deltas over a defined window.
Bus utilization: average vs peak vs burst
  • Average: % busy within a long window (e.g., 1 s / 10 s).
  • Peak: maximum busy% in sliding windows (captures congestion episodes).
  • Burst: short-window distribution (P95/P99 in 10–100 ms), required for “feels blocked” complaints.
Wake source attribution
bus local timed power

Attribution must include evidence fields (flags / filter hit / power state transitions). Pure inference is not acceptable for serviceability.

Timestamp, snapshot trigger, retention
  • Timestamp: monotonic ordering + human-readable time + sync quality (for cross-ECU correlation).
  • Snapshot trigger: define event triggers + pre/post windows (Tpre/Tpost) for freeze-frame.
  • Retention: RAM ring for high-rate context + NVM checkpoints for post-power-cycle evidence.
Measurement contract (report must include)

Every metric must declare window, denominator, and units. Counter values are meaningless without deltas over a window.

bus_id node_id event_id ts_mono ts_quality window_Y denom_Z
Deliverables (what later chapters produce)
  • Schema dictionary: minimal fields for counters, utilization, wake attribution, and snapshots.
  • Service report template: summary → evidence → severity → next measurement.
  • Wake black box MVP: ring buffer + checkpoint + freeze-frame triggers.
  • Verification matrix: fault injection coverage + pass criteria placeholders.
Diagram · Coverage map (what this page covers vs links out)
Coverage map Central Diagnostics and Logging block connects to three branches: error counters, bus utilization, wake-event black box; out-of-scope items shown as grey tags. Diagnostics & Logging One contract: measure → attribute → report Error Counters read · trend · correlate Bus Utilization define · measure · peaks Wake-event Black Box attribute · snapshot · retain Out of scope waveform · termination · UDS

Failure Taxonomy for Serviceability

Intent

Convert field complaints into searchable, measurable, and triage-ready categories. A good taxonomy reduces log volume requirements and increases root-cause speed.

Three-layer model (enables fast triage)
  1. Symptom — what service observes (bus-off, wake storm, intermittent errors).
  2. Signal — measurable evidence (counter deltas, state transitions, utilization peaks, wake flags).
  3. Suspected domain — attribution buckets guiding next verification (Topology / EMC / Node behavior / Policy).
Rule: every symptom must define a minimal field set

Minimal field set is split into Identity (bus/node), Evidence (signals), and Context (timestamp quality / power state).

Taxonomy dictionary (field-ready)

Each entry binds signals to a window + denominator, then maps to suspected domains. This avoids “high counter value” confusion and supports automated service reports.

Symptom · Bus-off (intermittent or frequent)
Primary signals
  • State transition: entered bus-off (event) + recovery attempts.
  • Counter deltas: ΔTEC / ΔREC over window_Y (not absolute values).
  • Utilization snapshot: peak & burst around the event (optional but high value).
Minimal field set
bus_id node_id event_busoff tec_rec_snapshot delta_window_Y ts_mono + ts_quality power_state
Window & denominator
  • Compute ΔTEC/ΔREC per window_Y (placeholder) and normalize per active_time or frames_sent.
  • Freeze-frame: capture Tpre/Tpost context around bus-off (placeholder).
Suspected domains (guide next verification)
Topology Node behavior EMC Policy
Symptom · Error passive entry / oscillation (flapping)
Primary signals
  • State transitions count: active ↔ passive within window_Y.
  • Counter deltas: ΔREC dominates vs ΔTEC dominates (direction hints where to look next).
  • Utilization bursts: P95/P99 in short window (often correlates with error bursts).
Minimal field set
event_passive_enter state_flap_count delta_tec_rec util_burst_P99 ts_mono + window_Y
Window & denominator
  • Normalize flap rate per minute or per 1000 frames (placeholder).
  • Record whether flapping is periodic (suggests policy/threshold issues) or random (suggests EMC/topology).
Suspected domains
Topology EMC Policy
Symptom · Intermittent CRC / frame errors (scope “looks OK”)
Primary signals
  • Error deltas per short window: errors / 10 s and errors / 100 ms (placeholder).
  • Correlation with utilization: spikes in burst P99 often expose contention/retry amplification.
  • Environmental correlation: temperature bucket / power-state transitions (context fields only).
Minimal field set
err_delta_short util_burst_P99 ts_quality temp_bucket power_state
Suspected domains
Topology EMC Node behavior
Symptom · False wake / wake storm (no obvious frames)
Primary signals
  • Wake evidence: wake flags + filter-hit counters (if selective wake is used).
  • Attribution confidence: prefer hardware flags over software inference.
  • Pre/post snapshots: capture Tpre/Tpost around wake event (placeholder).
Minimal field set
event_wake wake_source wake_evidence wake_rate power_state
Suspected domains
Policy EMC Topology
Symptom · Diagnostic session intermittent timeout (field-level only)
Primary signals
  • Transport-facing counters: timeout_count per window_Y (placeholder).
  • Bus utilization bursts around timeouts (queue starvation often appears as burst congestion).
  • Reset/power transitions near timeout windows (to rule out power-induced dropouts).
Minimal field set
timeout_delta util_burst_P99 ts_mono + window_Y power_state
Suspected domains
Node behavior Topology Policy
Normalization rules (prevents misleading logs)
  • Always store deltas (Δcounter) with window_Y, not just absolute counters.
  • Always declare the denominator (per time, per frames, per active-time) to avoid incomparable rates.
  • Every attribution must include evidence and a confidence label (high/medium/low).
Diagram · Attribution funnel (symptoms → signals → suspected domains)
Attribution funnel Left symptoms flow into three evidence lanes; right buckets show topology, EMC, node behavior, and policy with next verification tags. Symptoms bus-off / passive intermittent errors false wake / storm timeout complaints Signals / Evidence Counters ΔTEC/ΔREC · states Utilization avg · peak · burst Wake evidence flags · filter hit Suspected domains Topology EMC Node behavior Policy Next: verify on real harness · correlate with environment · audit tables

Observability Tap Points

Intent

Map where diagnostic evidence originates across Transceiver / Controller / MCU / Power-SBC / Monitor. The goal is consistent evidence capture and correlation—not protocol or RTOS tutorials.

Evidence hierarchy (prevents “guessing”)
  1. Hardware-latched flags (wake, dominant timeout, thermal) — highest confidence.
  2. Controller counters + state transitions (ΔTEC/ΔREC, passive/bus-off) — measurable trends.
  3. Software context (queue depth, ISR latency, log drops) — correlation and amplification clues.
Rule: attribution must cite evidence fields, not inference

Lower-level signals can support correlation, but root-cause direction requires higher-confidence evidence whenever available.

Sampling modes (choose by intent)
Periodic Event-trigger Edge / Latched
  • Trends: periodic sampling (Δcounters, utilization average).
  • Replay: event-trigger snapshots (bus-off, wake, reset).
  • Short pulses: edge/latched capture (wake flags, dominant timeout).
Correlation keys (must exist across layers)

Without these keys, logs become fragments and serviceability collapses under power cycles or multi-ECU correlation.

bus_id node_id ts_mono ts_quality event_id power_state reset_reason
Data sources (cards, not tables)

Each card declares what the layer can prove, what it cannot, and how to sample it for trend vs replay.

PHY / Transceiver
Provides
error_flags wake_flags dominant_timeout thermal_flag mode_state
Cannot prove

Exact waveform/termination root-cause or which node generated a specific frame error (handled by PHY/EMC sibling pages).

Sampling rule
  • Flags: latched or interrupt-driven reads (avoid missing short pulses).
  • Thermal/mode: low-rate periodic sampling (trend context, not high bandwidth).
Correlation keys
bus_id transceiver_id ts_mono
Common pitfalls
  • Reading flags clears them without first stamping ts_mono.
  • Logging “flag occurred” without duration/recovery evidence.
Controller
Provides
TEC/REC error_state state_transitions filter_stats ts_hook
Cannot prove

Harness topology and EMC mechanisms directly; only measurable signatures and correlations are available here.

Sampling rule
  • Counters: store Δcounter per window_Y (never absolute-only).
  • Transitions: event-trigger snapshots on passive/bus-off entry and recovery.
Correlation keys
bus_id node_id ts_mono window_Y
Common pitfalls
  • Storing absolute counters without Δ and window leads to incomparable reports.
  • Missing state transition history makes “counter jumps” uninterpretable.
MCU / RTOS (fields to record only)
Provides
queue_depth ISR_latency CPU_load log_drop
Cannot prove

Physical-layer faults; MCU signals explain amplification (starvation, backlog) but do not replace bus evidence.

Sampling rule
  • Periodic: queue depth and CPU load for trends.
  • Event-trigger: timeout bursts, queue overflow, log-drop spikes.
Correlation keys
ts_mono event_id
Common pitfalls
  • Missing log-drop counters creates false “no issue observed” narratives.
  • Treating software timeouts as root-cause instead of correlating with bus/power evidence.
Power / SBC
Provides
reset_reason VBAT_dip_events wake_policy_state ign_state
Sampling rule
  • Reset reason: capture at boot immediately (latched at startup).
  • VBAT dips: event-trigger with threshold_X placeholder and persistence policy.
Common pitfalls
  • No boot_count/event sequencing prevents correlation across power cycles.
  • Storing only “reset happened” without reason classification reduces service value.
Optional Monitor IC
Provides
open_short_events overvoltage_count imbalance_flag
Sampling rule

Low-rate trend sampling plus event-trigger capture for exceptions; store counts and durations to support severity ranking.

Rate budget (avoid flooding flash)
  • Always-on (low rate): power_state, controller state, utilization average.
  • Event-only: passive/bus-off entry, wake, reset, dominant timeout.
  • Burst capture: short-window utilization P95/P99 and short-window Δcounters in a RAM ring buffer.
Diagram · Tap-point block map (Bus → Transceiver → Controller → MCU → Logger → NVM)
Tap-point block map Blocks show bus to transceiver to controller to MCU to logger to NVM, with side blocks for power/SBC and monitor IC feeding the logger. Bus CAN/LIN/FR Transceiver flags · wake thermal · DTO Controller TEC/REC state · filters MCU queue latency · drops Logger events trend · snapshot NVM checkpoint Power / SBC reset · VBAT · policy Monitor IC open/short · OV Legend Solid arrows: event/state · Side arrows: power/monitor evidence

Error Counters Deep Dive

Intent

Turn counters into an evidence chain: Δcounters + window + denominator bound to state transitions. This prevents “looking at numbers without attribution”.

Counter reading pipeline (use this every time)
  1. Sample counters at t0, t1… then compute Δcounter / window_Y.
  2. Bind deltas to state transitions (active ↔ passive ↔ bus-off).
  3. Anchor with events (wake / reset / dominant timeout) when present.
  4. Correlate with utilization burst (P95/P99) to detect contention amplification.
  5. Report severity + suspected domain bucket (Topology / EMC / Node / Policy).
CAN (bind counters to state)
Key fields to log
TEC REC error_state state_transition event_busoff recovery_count
Interpretation rules
  • Use ΔTEC/ΔREC over window_Y; absolute values alone are not actionable.
  • State entry timestamps (passive/bus-off) must be captured as events and paired with freeze-frame snapshots.
  • After recovery, track whether counters decay and stabilize or re-escalate quickly (policy vs burst signatures).
LIN & FlexRay (abstract, field-oriented)
LIN

Record error counts by category (header/response/sync/timeout), then normalize by window_Y and denom_Z. The objective is consistent trend + event anchoring, not protocol walkthrough.

lin_err_delta timeout_delta window_Y denom_Z
FlexRay

Split evidence by channel (A/B) and by error class; store deltas and event anchors. Channel asymmetry is a serviceable clue even without waveform deep dives.

fr_err_A_delta fr_err_B_delta sync_issue_delta window_Y
Pattern classifier (fast triage)

Counters are most useful when categorized by time behavior; each pattern suggests different next measurements.

Slow degradation
  • Signature: Δcounter rises steadily across long windows.
  • Quick check: correlation with temperature / power state transitions.
  • Next: trend-first verification; avoid single-shot conclusions.
Burst event
  • Signature: large short-window Δcounter + utilization burst P99 spikes.
  • Quick check: align bursts with bus-off/passive events.
  • Next: capture freeze-frame around events; validate on real harness.
Periodic policy error
  • Signature: repeating state flaps or recurrent bus-off with stable period.
  • Quick check: align with wake/recovery thresholds and policy state changes.
  • Next: audit debounce/filter/recovery criteria; require evidence fields.
Deliverable · Counter dictionary (field-ready)

Each counter definition must include source, units, sampling method, window_Y, denom_Z, threshold_X placeholders, and severity mapping.

counter_name scope source_layer unit read_method window_Y denom_Z threshold_X severity_map

Reporting rule: every counter report must include Δ + window + denominator + state + event anchor.

Diagram · State + counters binding (from numbers to attribution)
State and counters binding State bubbles show active, passive, and bus-off with delta arrows for TEC/REC and event anchors; right side shows pattern classifier blocks. Active baseline Δ Passive ΔREC/ΔTEC ↑ Bus-off event anchor enter escalate recover recover ΔTEC per window_Y ΔREC per denom_Z Events wake/reset/DTO Patterns Slow Burst Periodic Use Δcounters + state transitions + event anchors to drive suspected-domain next steps (no waveform deep dives here).

Bus Utilization Metrics

Intent

Convert “the bus is busy” into comparable, calculable metrics. Define windowed utilization, burst behavior, and overhead shares so bench and in-vehicle measurements can be explained without waveform-level discussion.

Definitions (must be fixed for comparability)

Every utilization number must declare window_Y and a clear denominator. Without this, averages cannot be compared across ECUs, tools, or drives.

window_Y short_window_s busy_time denom_Z capture_point ts_quality
Metric set (load shape + overhead shares)

Separate load shape (avg/peak/burst) from overhead (retransmissions/errors/arbitration loss). Mixing them hides the reason why real harness behavior diverges from bench.

Load shape
avg_util_% peak_util_% p95_util_% p99_util_%
Overhead shares
retrans_share_% error_frame_share_% arb_loss_share_% (optional)
Impact of frame size / rate modes (calculation only)

Different payload sizes and modes change frame_time and therefore busy_time. This page standardizes the accounting path and avoids PHY-level explanations.

Accounting model

busy_time = Σ frame_time(i) + Σ retry_time(i) + overhead

util_% = busy_time / window_Y × 100

Rule

Reports must declare the source of frame_time (controller/sniffer/gateway) and keep windowing consistent.

Collection methods (bias sources)
Controller statistics
  • Strength: ties utilization to local state/counters and retry context.
  • Bias: partial view (local ECU perspective) may not reflect network-wide busy time.
  • Best for: accountability and local amplification analysis.
External sniffer
  • Strength: closer to actual bus busy time and burst behavior.
  • Bias: limited internal context (cannot see ECU queues or local retry intent).
  • Best for: network health and peak/burst characterization on real harness.
Gateway mirror
  • Strength: multi-segment coverage and system-level correlation.
  • Bias: mirror path can introduce resampling and timestamp quality issues.
  • Best for: serviceability and cross-domain correlation with wake/diagnostics.
Common pitfalls (and how to avoid them)
Average-only reporting
  • Quick check: always include peak and P99 for the same capture window.
  • Fix: define short_window_s and report burst percentiles.
  • Pass criteria: P99 < X% for Y minutes (placeholder).
Window too large hides bursts
  • Quick check: compare window_Y average vs short-window peak.
  • Fix: include rolling short windows and percentile reporting.
  • Pass criteria: peak within X% above baseline (placeholder).
Ignoring retransmissions/errors
  • Quick check: report error_frame_share_% and retrans_share_% alongside utilization.
  • Fix: separate overhead shares from load shape metrics.
  • Pass criteria: retrans_share_% < X% and error_frame_share_% < X% (placeholder).
Deliverable · Utilization report template

A report without capture_point and windowing metadata is non-comparable and not serviceable.

window_Y short_window_s avg/peak/P95/P99 error_frame_share_% retrans_share_% threshold_X capture_point ts_quality
Diagram · Timeline utilization accounting (frames, gaps, retransmissions, windows)
Timeline utilization accounting Frames appear as blocks on a timeline with idle gaps and retransmission markers; windows for utilization calculation are shown below. Timeline gap gap gap gap gap Retry Retry window_Y avg_util_% = busy_time / window_Y short_window_s peak/p99 from rolling windows Outputs Avg Peak P99 Overhead shares: retrans + error frames

Wake Event Attribution

Intent

Define a wake evidence chain with explicit source classification and confidence so false-wake and wake-storm issues become measurable and serviceable.

Wake classes (fixed taxonomy)

A wake event must be classified using one of the fixed sources below; the schema records evidence fields and confidence.

bus_wake local_wake timed_wake power_event_wake
Attribution priority (prevents “guessing”)
  1. Hardware flags (transceiver wake flags, power flags) — highest confidence.
  2. Controller state (bus state, counters/state transitions) — medium confidence.
  3. Software inference (queues, timers, application hints) — support only.
Rules
  • If hardware flags conflict with inference, hardware wins and the conflict is recorded.
  • If hardware evidence is missing, confidence is downgraded and the evidence_fields list must explain why.
Selective wake (logging/accounting only)

Record filter identity and outcomes so false-wake rate becomes measurable without detailing the standard or filter algorithms.

filter_id/hash sensitivity_profile match_count false_wake_rate (denom required)
Pre/Post windows (snapshot contract)

The black-box capture must reserve a ring buffer covering Tpre and Tpost around wake. Without this, attribution becomes post-hoc inference.

t_pre_s (placeholder) t_post_s (placeholder) ring_buffer_depth snapshot_trigger
Deliverable · Wake attribution schema

The schema stores classification, confidence, and evidence fields; it also records debounce and policy state for service replay.

event_id ts_mono bus_id node_id source confidence evidence_fields[] debounce policy_state
Diagram · Wake decision tree (detect → evidence → classify → snapshot → record)
Wake decision tree Decision flow for wake attribution from detection to evidence capture and schema record. Wake detected event_id + ts_mono Read HW flags transceiver + power highest confidence Match filter filter_id/hash Classify source bus_wake local_wake timed_wake power_event confidence + evidence_fields[] Trigger snapshot Tpre + Tpost ring buffer Write record source + confidence Rule: hardware evidence overrides inference; missing flags must downgrade confidence and explain gaps.

Wake-event Black Box Design

Intent

Provide a replayable black-box architecture: continuous sampling, event-triggered freeze-frame, and power-loss retention. The goal is service-grade reconstruction, not raw log dumping.

Architecture (layers and responsibilities)

Separate continuous sampling from event records so storage and endurance remain controlled.

Sampling ring (RAM)
  • High-rate, low-cost, overwrite-friendly samples covering Tpre/Tpost windows.
  • Stores utilization snapshots, counters deltas, power flags, and minimal state changes.
  • Allows decimation and delta encoding to keep bandwidth predictable.
Event ring (RAM or NVM-backed)
  • Low-rate, searchable event records with unique event_id and schema version.
  • Each event carries freeze-frame fields and pointers/summaries of sampled slices.
  • Supports event merging when multiple triggers occur in one anchor window.
Checkpoint (retention)
  • Periodic or trigger-driven commits to non-volatile storage for power-loss survival.
  • Must track write budget and batch updates to avoid excessive wear.
  • Stores event ring plus critical summaries for fast service extraction.
Export interface
  • Exports structured event records; avoids requiring manual log reading.
  • Must include capture windows, time-quality fields, and filter identity where applicable.
  • Allows sorting and searching by trigger, bus_id, node_id, and confidence.
Triggers (trigger → freeze-frame → slice capture)

Every trigger defines what gets frozen and how much pre/post context is extracted from the sampling ring.

Network + wake
wake bus-off error_passive_entry
  • Freeze: state + counters + utilization snapshot + wake attribution.
  • Capture: Tpre and Tpost slices (placeholders) from sampling ring.
Power + safety
reset VBAT_dip thermal watchdog
  • Freeze: power_state + reset_reason + brownout/thermal flags.
  • Capture: immediate pre-reset slice and early-boot slice for correlation.
Event merge rule

If multiple triggers occur within one correlation anchor window, merge into one composite record and store all trigger flags to avoid duplicated NVM writes.

Data volume and endurance (engineering trade-offs)
Sampling strategy
  • Prefer deltas for counters; prefer window summaries for utilization (avg/peak/p99/shares).
  • Record state transitions instead of repeated identical states.
  • Allow decimation under CPU pressure; preserve triggers and freeze-frames.
NVM write budget

Retention must track write frequency and byte volume. A black box that wears out storage is not production-safe.

writes_per_hour bytes_per_write daily_bytes lifetime_est (placeholder)
Configurations (MVP vs Enhanced)
MVP black box
  • One sampling ring + one event ring.
  • Triggers: wake, reset, bus-off, VBAT_dip.
  • Retention: event ring + last-N event summaries.
Enhanced black box
  • Dual sampling (fast ring + slow ring) with event slice references.
  • Triggers include error_passive_entry, thermal, watchdog.
  • Retention includes write budget stats and correlation anchors for cross-ECU replay.
Deliverable · Event record field list

Event records are the minimal searchable unit for service replay.

event_id schema_ver bus_id node_id ts_mono ts_quality trigger_type merge_id (optional) state counters util_snapshot wake_attribution power_state reset_reason diag_hint
Diagram · Dual-ring black box (sampling ring + event ring + checkpoint)
Dual-ring black box architecture Shows sampling ring, event ring, triggers, freeze-frame, checkpoint retention, and write budget. Triggers bus-off error_passive wake reset VBAT_dip thermal / WDT Sampling ring RAM · overwrite util + counters + power Event ring searchable records event_id freeze-frame ts_quality evidence Freeze-frame Tpre + Tpost Checkpoint NVM retention batch commit write budget Principle: sampling is overwrite-friendly; events are searchable; retention must track endurance.

Timestamp & Correlation

Intent

Make logs correlatable across ECUs by recording time base and time quality. Many field failures are unsolved because time is not trusted or not aligned.

Time bases (what they can and cannot solve)
Monotonic time
  • Best for ordering events within one ECU.
  • Requires boot_count and uptime_ms for post-reboot reasoning.
  • Not directly comparable across ECUs without anchors.
RTC time
  • Human-readable and survives reboot if valid.
  • Must record rtc_valid and drift_est to avoid false precision.
  • Useful as a coarse cross-ECU reference when aligned time is unavailable.
Aligned time (gateway/system)
  • Primary method for cross-ECU correlation.
  • Must record sync_status and offset_uncertainty to quantify reliability.
  • Anchors event streams by generating correlation anchor IDs.
Deliverable · Time quality block (attach to every key event)

Each key event must carry time base and quality fields; correlation quality should be derived into a single ts_quality label.

time_source sync_status drift_est (placeholder) last_sync_age_s (placeholder) offset_uncertainty_ms (placeholder) boot_count uptime_ms ts_quality
Cross-ECU correlation (priority order)
A. Gateway anchors
  • Gateway emits anchor_id with aligned time.
  • ECUs record anchor_id + local ts_mono for replay.
  • Best correlation strength for system-level reconstruction.
B. Event sequence numbers
  • Use monotonic event sequence for key triggers.
  • Carry sequence through mirrored or forwarded reports.
  • Works even when aligned time is unavailable.
C. Relative window matching
  • Match events by relative Tpre/Tpost patterns and trigger signatures.
  • Always downgrade confidence and record matching uncertainty.
  • Use as fallback when anchors and sequences are missing.
Diagram · Multi-clock alignment (ECU A/B → gateway anchor → correlation timeline)
Multi-clock alignment and correlation Two ECUs with local clocks align through gateway anchors and time quality blocks. ECU A mono clock boot_count + uptime ECU B mono clock boot_count + uptime Gateway anchor_id aligned time Time quality time_source sync_status offset_uncertainty ts_quality Correlation timeline events aligned by anchor_id + ts_quality A B G Rule: local monotonic time orders within an ECU; gateway anchors enable cross-ECU replay; always attach time quality fields.

Diagnostics Reporting Workflow

Intent

Convert raw counters, utilization, and wake evidence into a service-grade report: concise, actionable, and confidence-tagged.

Scope guard
  • Focus: thresholds, debouncing, severity, readability, and report structure.
  • Not included: protocol tutorials, repair-manual encyclopedias, or EMC root-cause explanations.
  • Output style: summary + evidence + next measurement direction.
Inputs (normalized sources)
counters states utilization wake_attribution ts_quality event_records
Workflow pipeline

Reporting quality depends on consistent normalization, explicit thresholds, and debouncing that prevents alert storms.

normalize thresholds debounce classify report
Normalized alert unit (field-level contract)

Use a consistent contract so every downstream rule is deterministic: metric_id, window, value, bus_id/node_id, time_ref, ts_quality.

Thresholds and debouncing

Thresholds must be window-based and debounced to prevent noisy service outputs.

Threshold definitions
N / window hold_time cooldown trend_slope
  • Burst-type: N occurrences per W seconds (placeholders X/Y).
  • Degradation-type: slope over T minutes; baseline and peak tracked per bus_id.
  • Policy-type: periodic patterns (wake storms) tracked by recurrence interval.
Debounce and suppression rules
  • burst grouping: collapse multiple same-type events into one burst summary within short windows.
  • hysteresis: separate enter/exit conditions to avoid threshold flapping.
  • rate limiting: cap repeated reports per hour while keeping evidence counters.
  • merge-by-anchor: within one correlation anchor window, merge triggers to minimize NVM writes.
Severity and confidence
Severity levels
info warn error critical
  • critical: bus-off, persistent wake storm, repeated resets impacting availability.
  • error: error passive entry, sustained error rate beyond threshold.
  • warn: rising retransmission share or utilization peaks trending upward.
  • info: isolated anomalies preserved for correlation and trending.
Confidence scoring (evidence priority)

Service output must state confidence; do not imply certainty when evidence is incomplete.

  • Priority: hardware flags > controller states > software inference.
  • Penalize when ts_quality is poor or required evidence fields are missing.
  • Expose confidence as a first-class report field.
Readability rule (mandatory 4 answers)

Every alert must answer these questions in order, using short lines and explicit evidence references.

  1. What happened — event type + time + impacted bus/node.
  2. How severe — severity + threshold condition (placeholders X/Y).
  3. What evidence — event_id + key counters/util snapshots + ts_quality.
  4. What next to measure — measurement direction, not a repair manual.
Deliverable · Service report template

This template is designed for fast triage: top events, trends, wake summary, and data-quality confidence.

Top 5 events
  • Rank by severity × confidence and show event_id for retrieval.
  • Include bus_id, node_id, and time reference (mono/aligned).
Counter trend
baseline slope peak window
Utilization trend
avg peak p95/p99 retrans_share error_share
Wake summary
  • Break down by source: bus / local / timed / power.
  • Show confidence and missing evidence fields to avoid guess-based attribution.
Confidence and data quality
  • ts_quality distribution and last_sync_age_s (placeholder).
  • logger drop counters and overflow markers.
  • evidence completeness rate for key event types.
Diagram · Report generation pipeline (raw signals → normalize → debounce → classify → service report)
Report generation workflow Raw signals are normalized, thresholded and debounced, classified, then assembled into a service report with readability rules. Raw signals counters states utilization wake evidence ts_quality Normalize metric_id Threshold N / window Debounce cooldown Classify severity + conf Service report Top 5 events Trends Wake summary Confidence Readability 4Q rule Principle: service output must be concise, evidence-linked, and confidence-tagged.

Verification & Fault Injection Plan

Intent

Validate that logging, black-box capture, retention, and correlation remain trustworthy under worst-case conditions.

Scope guard
  • Focus: evidence capture quality, time integrity, retention, and write endurance.
  • Not included: detailed EMC methods or protocol-level conformance.
  • Matrix style: injection → expected evidence → pass criteria.
Fault injection categories (log-centric)

Each category is evaluated by whether the expected event records and report outputs are produced, not by physical-layer explanations.

Electrical fault class
short/open undervoltage disturbance
  • Expected: state transitions, counter bursts, and correct severity classification.
  • Must include event_id + freeze-frame + ts_quality fields.
System behavior class
sleep/wake topology temperature
  • Expected: utilization peaks and recurrence patterns appear in trends.
  • Wake attribution records evidence completeness and confidence levels.
Verification dimensions (end-to-end evidence chain)
A. Trigger coverage
  • No missed triggers for injected conditions (allowing defined merge rules).
  • Freeze-frame contains required fields and correct pre/post slice references.
B. Time integrity
  • ts_mono remains monotonic; boot_count increments correctly.
  • Cross-ECU correlation uses anchors when available; otherwise confidence is downgraded.
C. Retention under power loss
  • Checkpoint survives power cycling at multiple cut points (during write, post-trigger, pre-export).
  • Partial-write markers are detectable; recovery yields consistent event ring contents.
D. Endurance and performance
  • Write budget remains within targets; write amplification remains bounded (placeholder).
  • Under storms, sampling may decimate but key triggers and reports remain correct.
Pass criteria (quantified KPIs)

Use explicit metric definitions so results are comparable across benches, vehicles, and software revisions.

miss_rate < X false_rate < X corr_rate > X write_amp < X 4Q_compliance > X
  • miss_rate: injected triggers are captured (merge rules applied deterministically).
  • false_rate: critical/error reports do not appear under clean baselines beyond threshold.
  • corr_rate: events align across ECUs within target uncertainty tiers.
  • write_amp: NVM bytes/writes per event remain bounded; batch commits effective.
Deliverable · Test matrix (card-list format)

Represent each test as a compact, scannable card to prevent mobile overflow and preserve SEO indexing per test intent.

Test ID T-01 (example)
  • Injection: short/open or disturbance (abstract).
  • Expected evidence: state change + counter burst + event_id + ts_quality.
  • Expected report: severity and 4Q output populated.
  • Pass: miss_rate < X; false_rate < X.
Test ID T-02 (example)
  • Injection: power cycling at multiple cut points.
  • Expected evidence: checkpoint consistency + recoverable event ring.
  • Expected report: top events remain retrievable by event_id.
  • Pass: retention success rate > X; write_amp < X.
Diagram · Verification bench (DUT + injector + analyzer + power cycling + logger dump + KPI scoring)
Verification bench for diagnostics and logging DUT is stimulated by fault injector and power cycling, observed by bus analyzer, exported via logger dump, and scored by KPIs. DUT ECU + logger black box Fault injector short/open undervoltage disturbance Power cycling cut points Bus analyzer mirror/observe Logger dump export records event_id list report output KPI scoring miss false corr write_amp Rule: validate evidence capture, time integrity, retention, and endurance—so reports stay trustworthy in the field.

Engineering Checklist (Design → Bring-up → Production → Service)

Intent

Turn counters, utilization, wake attribution, black-box capture, and time correlation into gate-by-gate actions with measurable pass criteria.

Scope guard
  • Focus: schema/thresholds/triggers, evidence completeness, exportability, retention, endurance.
  • Not included: protocol tutorials, repair-manual encyclopedias, or EMC root-cause explanations.
Design gate

Freeze the logging contract: field dictionary, thresholds with rationale, trigger coverage, time contract, and export format.

Checklist (8–12)
  • Schema freeze: required/optional fields tagged; schema_version exported. Pass: required missing rate < X%.
  • Metric semantics locked: event vs state vs counter sample, utilization windows, wake-source priority. Pass: definition↔implementation checks = X/X.
  • Threshold rationale: each rule has window + count/rate + source (bench/vehicle/history). Pass: rationale coverage > X%.
  • Trigger coverage map: bus-off, error-passive entry, wake, reset, VBAT dip, thermal, watchdog. Pass: critical triggers covered = 100%.
  • Time contract: mono/RTC/aligned recorded with ts_quality, boot_count, time_source. Pass: monotonic violations = 0.
  • Correlation anchors: anchor_id / gateway marks / sequence IDs defined. Pass: correlation-ready events > X%.
  • NVM wear budget: write amplification bounded; storm policy defined. Pass: write_amp < X.
  • Export contract: min reproducible package defined (summary + dump + schema + time quality). Pass: parse errors = 0.
  • Fail-safe logging: logging cannot block safety-critical control. Pass: bounded CPU/ISR time < X.
Example BOM parts (design-time choices)
  • CAN FD controller (SPI): Microchip MCP2517FD / MCP2518FD (for external logging taps).
  • CAN FD transceiver: TI TCAN1042-Q1; NXP TJA1044GT; Microchip MCP2562FD.
  • Selective wake (PN capable): NXP TJA1145 (ISO 11898-6 class); TI TCAN1145-Q1 (family example).
  • Non-volatile “black box” storage: Infineon/Cypress F-RAM CY15B104QSN (SPI); Fujitsu FRAM MB85RS2MT (SPI).
  • Isolated CAN (when required): TI ISO1042-Q1; Analog Devices ADM3055E.
Bring-up gate

Validate on real harness and real loads: peak windows, burst grouping, wake evidence priority, export-and-parse loop.

Checklist (8–12)
  • Real-harness re-measure: bench vs harness deltas captured in the same window definition. Pass: window semantics unchanged.
  • Controller vs sniffer correlation: utilization and error shares cross-checked. Pass: difference < X%.
  • Peak/burst coverage: peak_window_id and burst grouping validated. Pass: burst detection recall > X%.
  • Retrans/error share sanity: retrans_share and error_share match observed symptoms. Pass: share trend aligns with events.
  • Wake evidence priority: hardware flag > controller state > inference. Pass: mis-attribution < X%.
  • Pre/Post capture: Tpre/Tpost slices present for key triggers. Pass: slice completeness > X%.
  • Overflow markers visible: drop_count and overflow flags never silent. Pass: silent loss = 0.
  • Export+parse loop: the min package is exported and parsed by tooling. Pass: parse errors = 0.
Example BOM parts (bring-up instrumentation hooks)
  • CAN FD transceiver with diagnostics: Infineon TLE9255W (family example); TI TCAN1042-Q1.
  • LIN transceiver (if wake correlation crosses LIN): TI TLIN1029-Q1; NXP TJA1021.
  • FlexRay transceiver (if mixed networks): NXP TJA1080.
  • Low-cap TVS for bus ports (SI-friendly): Nexperia PESD2CANFD; Littelfuse SM24CANB.
Production gate

Ensure the production policy is durable: rate limits, decimation, retention correctness under power cuts, and NVM endurance.

Checklist (8–12)
  • Policy tiers: normal / diagnostic / factory modes defined. Pass: mode separation verified.
  • Rate limit & cooldown: repeated events do not cause alert storms. Pass: max reports/hour < X.
  • Decimation under storms: sampling may drop, triggers must remain. Pass: critical trigger miss_rate < X.
  • Wear validation: worst-case write budget validated across temperature. Pass: write_amp < X.
  • Power-cut recovery: checkpoint consistency proven at multiple cut points. Pass: recovery success > X%.
  • Version discipline: schema and thresholds are traceable by version. Pass: version missing = 0.
  • Export footprint cap: maximum package size bounded. Pass: package < X MB.
  • Data quality counters: drop_count and overflow markers exported. Pass: visibility rate = 100%.
Example BOM parts (production durability)
  • SBC w/ CAN (policy + reset reasons): NXP UJA1169 (CAN FD + LIN SBC family example); Infineon TLE9471-3ES (SBC family example).
  • Watchdog supervisor (if discrete): TI TPS3431-Q1 (watchdog timer family example).
  • FRAM for high-cycle logging: Infineon/Cypress CY15B104QSN; Fujitsu MB85RS2MT.
Service gate

Make the exported package reproducible and readable: event retrieval by ID, confidence tagging, and 4Q reporting.

Checklist (8–12)
  • Min reproducible package: summary + event dump + schema_version + time quality. Pass: completeness > X%.
  • 4Q readability: what/ severity/ evidence/ next measure included per alert. Pass: 4Q compliance > X%.
  • Evidence completeness: required fields present for key triggers. Pass: missing required < X%.
  • Confidence mandatory: confidence downgraded when evidence is missing. Pass: untagged alerts = 0.
  • Event retrieval by ID: event_id maps to freeze-frame and slices. Pass: retrieval success > X%.
  • Cross-ECU notes: anchor-aware correlation guidance included. Pass: correlation rate > X.
  • Integrity check: package structure self-check (hash/CRC placeholder). Pass: integrity failures = 0.
  • Rate-limited summaries: burst events collapsed but counts preserved. Pass: counts retained = 100%.
Example BOM parts (service export reliability)
  • CAN transceiver w/ wake flags: NXP TJA1145; TI TCAN1145-Q1 (examples).
  • Isolated CAN for HV boundary evidence: TI ISO1042-Q1; ADI ADM3055E.
  • FRAM for robust dumps: Infineon/Cypress CY15B104QSN; Fujitsu MB85RS2MT.
Diagram · Four gates and their concrete outputs
Four-gate checklist flow Design, Bring-up, Production, and Service gates with specific outputs such as schema freeze, harness validation, policy tiers, and minimal reproducible package. Design schema freeze threshold map trigger list export contract Bring-up real harness peak windows wake evidence export+parse Production policy tiers rate limit power-cut rec wear budget Service min package 4Q output confidence event_id map Gate rule: each item must be measurable (evidence) and have a pass criterion (X/Y placeholders).

Applications (What to log, and why it matters)

Intent

Avoid generic use-cases. Each bucket defines the highest-value logs, the minimal field set, and trigger/windows for serviceability.

Scope guard
  • Only: what to record + why it is valuable for triage.
  • Not included: isolation theory, EMC mechanisms, or protocol internals.
Bucket A · Production / EOL

Fast detection of harness/assembly issues using burst errors and utilization peaks (not just averages).

Minimal field set
event_id ts_quality bus_id node_id state_change counters_delta util_peak peak_window retrans_share error_share
  • Triggers: error bursts, bus-off, utilization peak exceed.
  • Pass: burst capture recall > X%; peak-window consistency < X% drift.
Example BOM parts
  • CAN FD transceiver: TI TCAN1042-Q1; NXP TJA1044GT; Microchip MCP2562FD.
  • Port protection: Nexperia PESD2CANFD; Littelfuse SM24CANB.
Bucket B · Service returns / Field failures

Reproduce sporadic bus-off or false wakes using a black-box evidence chain (pre/post slices + attribution + time quality).

Minimal field set
  • Package: summary + event ring dump + schema_version.
  • Wake: wake_source + confidence + evidence_fields_present.
  • Slices: Tpre/Tpost refs for key triggers.
  • Time: time_source + ts_quality + boot_count.
  • Power: reset_reason + power_state (+ VBAT dip marker if present).
  • Network: state transitions + counters_delta + util snapshot.
  • Triggers: wake, bus-off, reset, VBAT dip, thermal/watchdog.
  • Pass: reproducible package decode success > X%; 4Q compliance > X%.
Example BOM parts
  • Selective wake transceiver: NXP TJA1145; TI TCAN1145-Q1 (examples).
  • Black-box NVM: Infineon/Cypress CY15B104QSN; Fujitsu MB85RS2MT.
Bucket C · Gateway / TCU (multi-bus bridging)

Congestion and retransmissions must be quantified with consistent windows and queue/drop observability.

Minimal field set
  • Utilization: avg/peak/p95 + window.
  • Shares: retrans_share + error_share (+ arbitration_loss_share if available).
  • Queue health: queue_depth + drop_count (record only; no RTOS tutorial).
  • Correlation: anchor_id or sequence_id + ts_quality.
  • Bridge context: flow_id (abstract) + bus_id.
  • Triggers: util_peak exceed, retrans surge, drop_count increase.
  • Pass: congestion alerts rate-limited; correlation rate > X.
Example BOM parts
  • CAN FD controller (SPI): Microchip MCP2517FD / MCP2518FD (for mirror taps or auxiliary buses).
  • SBC (reset reasons + policy): NXP UJA1169 (CAN FD + LIN SBC family example).
Bucket D · HV isolation boundary

Intermittent errors across ground offsets require evidence fields (flags + power markers + time quality), not isolation theory.

Minimal field set
  • Transceiver flags: fault/wake/thermal indicators (abstracted fields).
  • Power markers: VBAT dip markers + reset_reason + power_state.
  • Network evidence: counters_delta + state transitions + event_id.
  • Time evidence: ts_quality + boot_count + time_source.
  • Retention: freeze-frame refs survive power cuts.
  • Triggers: power events, error bursts, reset, thermal flags.
  • Pass: evidence completeness > X%; monotonic violations = 0.
Example BOM parts
  • Isolated CAN transceiver: TI ISO1042-Q1; Analog Devices ADM3055E.
  • Black-box NVM: Infineon/Cypress CY15B104QSN; Fujitsu MB85RS2MT.
Diagram · Application buckets and the “highest-value logs” icons
Application bucket map Four cards represent Production/EOL, Service Returns, Gateway/TCU, and HV Isolation Boundary. Each card lists icon blocks for the minimal evidence categories. Common evidence icons counters util wake timeQ power export A · Production / EOL burst errors + peak utilization counters util timeQ export B · Service returns black-box evidence chain wake timeQ power export C · Gateway / TCU congestion + retrans shares util counters timeQ export D · HV boundary flags + power markers + timeQ power timeQ counters export

Request a Quote

Accepted Formats

pdf, csv, xls, xlsx, zip

Attachment

Drag & drop files here or use the button below.

FAQs (Diagnostics & Logging)

Format (fixed)

Each FAQ is a strict 4-line triage closure: Likely cause / Quick check / Fix / Pass criteria (threshold placeholders: X/Y/Tpre/Tpost).

TEC/REC not high, but bus-off happens frequently — sampling semantics or recovery policy first?

Likely cause: TEC/REC is sampled at the wrong moment (post-recovery / wrong hook) or bus-off recovery policy forces repeated transitions without a large visible counter trend.

Quick check: log state_change timestamps + TEC/REC snapshot at bus-off entry/exit; compare counters_delta around each transition; verify sampling is triggered by the same ISR/callback as the state change.

Fix: move sampling to the state-transition hook; add bus_off_reason + recovery_reason; enforce recovery cooldown and explicit retry limits (policy, not waveform).

Pass criteria: > X% bus-off events contain {state_change + TEC/REC snapshot + counters_delta}; unexpected bus-off rate < X per hour over Y operating hours.

Bus utilization shows 20%, but the field feels “blocked” — peak window or retransmission share first?

Likely cause: average window is too long (peaks hidden) and/or retrans_share and error frames are excluded from the utilization denominator.

Quick check: recompute util_peak over a short peak_window (e.g., 1–10 ms) and compare to avg; include retrans/error contributions; inspect p95/p99 and top peak_window_id buckets.

Fix: report utilization as {avg + p95 + peak} with the window definition embedded; always publish retrans_share and error_share; drive congestion alerts from peak/p95 (not avg).

Pass criteria: util_peak < X% and retrans_share < X% in window Y; peak windows are reproducible within ±X% across repeated runs.

False wakes are frequent, but logs show no frames — wake-flag timing or power-event wake first?

Likely cause: wake is not bus-driven (local/timed/power event) or wake_flag is read after it is cleared (timing/ordering bug).

Quick check: capture wake flags at the earliest wake ISR; record power_state, reset_reason, and VBAT_dip_marker; ensure the black box includes Tpre/Tpost slices around wake (Tpre/Tpost placeholders).

Fix: latch and persist flags (read-once → store); explicitly classify wake_source as bus/local/timed/power with evidence; add “no-bus-frame” wake category with required evidence fields.

Pass criteria: > X% wakes have wake_source + evidence fields present; “no-frame wakes” are classified with confidence ≥ X (not “unknown”).

Black box captured a wake, but the wake source is unknown — which 3 evidence fields to add first?

Likely cause: attribution lacks minimum evidence (hardware flag snapshot, controller state snapshot, power marker) and confidence is computed without inputs.

Quick check: verify presence of wake_flag, controller_wake_status, reset_reason/VBAT_dip_marker, and ts_quality for the wake event.

Fix: add these three evidence fields first: (1) transceiver wake flag + capture timestamp, (2) controller state snapshot at wake, (3) power marker (reset_reason or VBAT dip marker); implement confidence scoring that downgrades when any evidence is missing.

Pass criteria: unknown wake_source < X% of wakes; confidence ≥ X for attributed wakes; evidence completeness ≥ X%.

After changing the harness, error counters rise but the waveform looks “OK” — what trend comparison first?

Likely cause: intermittent events are visible in counters_delta trends but not in a single snapshot; baseline windows differ (apples-to-oranges) or the “rate” metric is not normalized.

Quick check: compare counters as rate per hour (or per 1k frames) under matched windows; slice by node_id and by peak_window; correlate with util_peak and retrans_share.

Fix: standardize the trend window and publish it with every report; store a baseline profile (same window, same normalization) and produce “before/after” delta by node buckets.

Pass criteria: post-change error-rate returns within X% of baseline over Y hours (or Y thermal cycles); top contributing node(s) explain > X% of the delta.

False wakes only in winter/dry conditions — log EMC-event counters or ground/power dip markers first?

Likely cause: environment-correlated transients show up as wake flags/power markers rather than as bus frames; missing markers make the event look “frame-less”.

Quick check: record wake_flag transitions + VBAT_dip_marker + reset_reason around the wake; if available, also log an EMC_event_counter (abstract counter, not waveform).

Fix: add wake debounce + cooldown tuned for transient storms; enforce hardware-flag-first attribution; optionally add environment tags (temperature/humidity placeholder) strictly as metadata.

Pass criteria: false-wake rate < X per day under the condition; > X% of wakes include at least one concrete marker {wake_flag/power_dip/reset_reason}.

Utilization is stable on bench, but vehicle peaks spike — group by node first or by message priority first?

Likely cause: peak bursts come from a small set of talkers or from priority-driven bursts; averaging hides the spike mechanism.

Quick check: compute util_peak per short peak_window; produce Top-5 “peak contributors” by node_id and separately by “priority/class” bucket (abstract); track retrans_share during those peaks.

Fix: always export both breakdowns: per-node and per-priority (or per-message-class) in the same report; gate “congestion” alerts on peak/p95 and attach the Top-5 contributors list.

Pass criteria: Top contributor(s) explain > X% of peak utilization; peak utilization < X% for Y consecutive peak windows (or peaks are correctly attributed with confidence ≥ X).

Log timestamps don’t align; multi-ECU events cannot be correlated — unify monotonic time or add gateway anchors first?

Likely cause: mixed time bases without time_source/ts_quality metadata, or clock steps/resets between events (including boot boundaries).

Quick check: inspect time_source, ts_quality, boot_count; search for monotonic violations; check whether anchor_id (gateway marks) exists for cross-ECU correlation.

Fix: use monotonic time for ordering within an ECU and add gateway anchors for cross-ECU alignment; record drift_est/sync_status (abstract) in ts_quality.

Pass criteria: correlation success > X% across ECUs within window Y; monotonic violations = 0 for key events; anchor coverage > X%.

The black box gets overwritten; key events are missing — change triggers first or move to a dual-layer buffer first?

Likely cause: single ring buffer without reservation + high trigger rate leads to ring_overrun; key triggers compete with low-value spam events.

Quick check: log event_rate, ring_overrun counters, and Top trigger types; verify whether key triggers reserve slots and whether freeze-frames exist (freeze_frame_ref).

Fix: implement dual-layer capture (event ring + sample ring) with reserved freeze-frames for critical triggers; tighten triggers and add cooldown; prioritize critical event IDs in retention policy.

Pass criteria: critical event retention ≥ X events (or ≥ X hours); key trigger miss_rate < X%; ring_overrun rate < X per day.

Logging volume causes flash wear — decimation first or event compression first?

Likely cause: frequent checkpoints to flash + no decimation/compression leads to high write_amp (write amplification) and early endurance exhaustion.

Quick check: estimate writes/day and write_amp from event rate + checkpoint interval; identify “hot” event IDs; verify whether the design logs periodic samples at full rate during storms.

Fix: apply decimation to periodic samples and compress/aggregate repeated events; keep flash for sparse checkpoints and consider high-cycle NVM for dumps (e.g., FRAM CY15B104QSN / MB85RS2MT) for high-write paths.

Pass criteria: write_amp < X; projected endurance > X years at worst-case duty cycle; storm mode retains all critical triggers with miss_rate < X%.

Too many false alarms; service cannot use the report — debounce window or severity mapping first?

Likely cause: thresholds fire on noisy samples (not on state/event transitions), missing cooldown, and severity escalation ignores evidence completeness.

Quick check: compute false-positive rate on a labeled set; inspect debounce_window + cooldown; verify mapping of bus-off / wake storm to severity; confirm evidence_fields_present is required for high severity.

Fix: apply N-in-window debounce plus cooldown; split info/warn/error/critical rules; require confidence ≥ X and evidence completeness ≥ X% before raising critical alarms.

Pass criteria: false alarm rate < X%; alert volume/day < X while recall > X%; 4Q report completeness ≥ X%.

Wake happens then the ECU sleeps immediately; logs are fragmented — policy transition point or dump timing first?

Likely cause: power policy cuts logging before the dump/checkpoint completes; dump trigger occurs after the sleep decision; retention window is too short.

Quick check: record power_state transitions around wake; measure time from wake ISR to dump start; verify checkpoint_seq/checkpoint_done fields; confirm Tpre/Tpost slices exist.

Fix: capture minimal freeze-frame immediately at wake; enforce a hold-off timer before sleep decision; checkpoint critical events before sleep; make dump timing part of the policy contract.

Pass criteria: > X% wake events include contiguous Tpre/Tpost slices; dump success > X%; fragmented logs < X% over Y cycles.