CAN/LIN/FlexRay Diagnostics & Logging: Counters,Utilization,Wake

Q: TEC/REC not high, but bus-off happens frequently — sampling semantics or recovery policy first?

Likely cause: TEC/REC is sampled at the wrong moment (post-recovery / wrong hook) or bus-off recovery policy forces repeated transitions without a large visible counter trend. Quick check: log state_change timestamps + TEC/REC snapshot at bus-off entry/exit; compare counters_delta around each transition; verify sampling is triggered by the same ISR/callback as the state change. Fix: move sampling to the state-transition hook; add bus_off_reason + recovery_reason; enforce recovery cooldown and explicit retry limits. Pass criteria: > X% bus-off events contain {state_change + TEC/REC snapshot + counters_delta}; unexpected bus-off rate < X per hour over Y operating hours.

Q: Bus utilization shows 20%, but the field feels “blocked” — peak window or retransmission share first?

Likely cause: average window is too long (peaks hidden) and/or retrans_share and error frames are excluded from the utilization denominator. Quick check: recompute util_peak over a short peak_window (e.g., 1–10 ms) and compare to avg; include retrans/error contributions; inspect p95/p99 and top peak_window_id buckets. Fix: report {avg + p95 + peak} with embedded window definition; always publish retrans_share and error_share; drive congestion alerts from peak/p95. Pass criteria: util_peak < X% and retrans_share < X% in window Y; peak windows reproducible within ±X%.

Q: False wakes are frequent, but logs show no frames — wake-flag timing or power-event wake first?

Likely cause: wake is not bus-driven (local/timed/power event) or wake_flag is read after it is cleared. Quick check: capture wake flags at the earliest wake ISR; record power_state, reset_reason, VBAT_dip_marker; ensure Tpre/Tpost slices exist around wake. Fix: latch and persist flags; classify wake_source as bus/local/timed/power with evidence; add “no-bus-frame” wake category. Pass criteria: > X% wakes have wake_source + evidence fields present; “no-frame wakes” classified with confidence ≥ X.

Q: Black box captured a wake, but the wake source is unknown — which 3 evidence fields to add first?

Likely cause: attribution lacks minimum evidence (hardware flag snapshot, controller state snapshot, power marker) and confidence is computed without inputs. Quick check: verify presence of wake_flag, controller_wake_status, reset_reason/VBAT_dip_marker, ts_quality. Fix: add (1) transceiver wake flag + capture timestamp, (2) controller state snapshot at wake, (3) power marker; implement confidence scoring that downgrades when evidence is missing. Pass criteria: unknown wake_source < X% of wakes; confidence ≥ X for attributed wakes; evidence completeness ≥ X%.

Q: After changing the harness, error counters rise but the waveform looks “OK” — what trend comparison first?

Likely cause: intermittent events appear in counters_delta trends but not in a single snapshot; baseline windows differ or rate is not normalized. Quick check: compare counters as rate per hour (or per 1k frames) under matched windows; slice by node_id and peak_window; correlate with util_peak and retrans_share. Fix: standardize the trend window and publish it with every report; store a baseline profile and produce before/after delta by node buckets. Pass criteria: post-change error-rate within X% of baseline over Y hours (or Y thermal cycles); top node(s) explain > X% of the delta.

Q: False wakes only in winter/dry conditions — log EMC-event counters or ground/power dip markers first?

Likely cause: environment-correlated transients show up as wake flags/power markers rather than as bus frames. Quick check: record wake_flag transitions + VBAT_dip_marker + reset_reason; if available, also log an EMC_event_counter (abstract). Fix: add wake debounce + cooldown; enforce hardware-flag-first attribution; optionally add environment tags as metadata. Pass criteria: false-wake rate X% of wakes include at least one marker {wake_flag/power_dip/reset_reason}.

Q: Utilization is stable on bench, but vehicle peaks spike — group by node first or by message priority first?

Likely cause: peak bursts come from a small set of talkers or from priority-driven bursts. Quick check: compute util_peak per short peak_window; produce Top-5 peak contributors by node_id and by priority/class bucket; track retrans_share during peaks. Fix: export both breakdowns in the same report; gate congestion alerts on peak/p95 and attach Top-5 contributors. Pass criteria: Top contributors explain > X% of peak utilization; peak < X% for Y consecutive peak windows (or correctly attributed with confidence ≥ X).

Q: Log timestamps don’t align; multi-ECU events cannot be correlated — unify monotonic time or add gateway anchors first?

Likely cause: mixed time bases without time_source/ts_quality metadata, or clock steps/resets between events. Quick check: inspect time_source, ts_quality, boot_count; search for monotonic violations; check whether anchor_id exists. Fix: use monotonic time for ordering within an ECU and add gateway anchors for cross-ECU alignment; record drift/sync status in ts_quality. Pass criteria: correlation success > X% across ECUs within window Y; monotonic violations = 0 for key events; anchor coverage > X%.

Q: The black box gets overwritten; key events are missing — change triggers first or move to a dual-layer buffer first?

Likely cause: single ring buffer without reservation + high trigger rate leads to ring_overrun; key triggers compete with low-value spam events. Quick check: log event_rate, ring_overrun, and Top trigger types; verify freeze_frame_ref exists for key triggers. Fix: implement dual-layer capture with reserved freeze-frames; tighten triggers and add cooldown; prioritize critical event IDs. Pass criteria: critical event retention ≥ X events (or ≥ X hours); key trigger miss_rate < X%; ring_overrun < X per day.

Q: Logging volume causes flash wear — decimation first or event compression first?

Likely cause: frequent checkpoints to flash + no decimation/compression yields high write_amp. Quick check: estimate writes/day and write_amp from event rate + checkpoint interval; identify hot event IDs. Fix: decimate periodic samples and compress/aggregate repeated events; keep flash for sparse checkpoints and consider high-cycle NVM for dumps (e.g., FRAM CY15B104QSN / MB85RS2MT) for high-write paths. Pass criteria: write_amp X years at worst-case duty cycle; storm mode critical trigger miss_rate < X%.

← Back to: Automotive Fieldbuses: CAN / LIN / FlexRay

Turn “it happened in the field” into a replayable evidence chain: standardize counters, utilization, and wake attribution, then capture black-box snapshots with trustworthy timestamps. Ship service-ready reports with debounced thresholds and measurable pass/fail criteria—so issues are searchable, comparable, and fixable at scale.

Scope Guard & Definitions

Intent

Establish one non-negotiable contract for this page: measure consistently, attribute with evidence, and report for serviceability. The scope guard prevents protocol/waveform detours and keeps all later chapters aligned.

Scope in 30 seconds

Covers (3 objects)

Error counters — how to sample, trend, and correlate counters with events.
Bus utilization — correct definitions, windows, and peak/burst interpretation.
Wake-event black box — evidence-based wake attribution + snapshot + retention.

Out of scope (link out only)

Waveform shaping, termination tuning, stub/harness SI/EMC details (handled by PHY/EMC sibling pages).
UDS/DoIP/OTA protocol behavior (only logging interface fields are referenced here).

Sibling pages (titles only; no duplication)

HS CAN Transceiver CAN FD / SIC / CAN XL PHY Selective Wake / Partial Networking LIN Transceivers FlexRay PHY / Controller EMC / Protection Co-design

Key definitions (must be consistent)

Error: event vs state vs counter sample

Event: a discrete occurrence with a timestamp (e.g., entered bus-off, wake occurred).
State: a sustained condition (e.g., error passive state, bus-off state).
Counter sample: a point-in-time reading (e.g., TEC/REC at time t). Use deltas over a defined window.

Bus utilization: average vs peak vs burst

Average: % busy within a long window (e.g., 1 s / 10 s).
Peak: maximum busy% in sliding windows (captures congestion episodes).
Burst: short-window distribution (P95/P99 in 10–100 ms), required for “feels blocked” complaints.

Wake source attribution

bus local timed power

Attribution must include evidence fields (flags / filter hit / power state transitions). Pure inference is not acceptable for serviceability.

Timestamp, snapshot trigger, retention

Timestamp: monotonic ordering + human-readable time + sync quality (for cross-ECU correlation).
Snapshot trigger: define event triggers + pre/post windows (Tpre/Tpost) for freeze-frame.
Retention: RAM ring for high-rate context + NVM checkpoints for post-power-cycle evidence.

Measurement contract (report must include)

Every metric must declare window, denominator, and units. Counter values are meaningless without deltas over a window.

bus_id node_id event_id ts_mono ts_quality window_Y denom_Z

Deliverables (what later chapters produce)

Schema dictionary: minimal fields for counters, utilization, wake attribution, and snapshots.
Service report template: summary → evidence → severity → next measurement.
Wake black box MVP: ring buffer + checkpoint + freeze-frame triggers.
Verification matrix: fault injection coverage + pass criteria placeholders.

Diagram · Coverage map (what this page covers vs links out)

Failure Taxonomy for Serviceability

Intent

Convert field complaints into searchable, measurable, and triage-ready categories. A good taxonomy reduces log volume requirements and increases root-cause speed.

Three-layer model (enables fast triage)

Symptom — what service observes (bus-off, wake storm, intermittent errors).
Signal — measurable evidence (counter deltas, state transitions, utilization peaks, wake flags).
Suspected domain — attribution buckets guiding next verification (Topology / EMC / Node behavior / Policy).

Rule: every symptom must define a minimal field set

Minimal field set is split into Identity (bus/node), Evidence (signals), and Context (timestamp quality / power state).

Taxonomy dictionary (field-ready)

Each entry binds signals to a window + denominator, then maps to suspected domains. This avoids “high counter value” confusion and supports automated service reports.

Symptom · Bus-off (intermittent or frequent)

Primary signals

State transition: entered bus-off (event) + recovery attempts.
Counter deltas: ΔTEC / ΔREC over window_Y (not absolute values).
Utilization snapshot: peak & burst around the event (optional but high value).

Minimal field set

bus_id node_id event_busoff tec_rec_snapshot delta_window_Y ts_mono + ts_quality power_state

Window & denominator

Compute ΔTEC/ΔREC per window_Y (placeholder) and normalize per active_time or frames_sent.
Freeze-frame: capture Tpre/Tpost context around bus-off (placeholder).

Suspected domains (guide next verification)

Topology Node behavior EMC Policy

Symptom · Error passive entry / oscillation (flapping)

Primary signals

State transitions count: active ↔ passive within window_Y.
Counter deltas: ΔREC dominates vs ΔTEC dominates (direction hints where to look next).
Utilization bursts: P95/P99 in short window (often correlates with error bursts).

Minimal field set

event_passive_enter state_flap_count delta_tec_rec util_burst_P99 ts_mono + window_Y

Window & denominator

Normalize flap rate per minute or per 1000 frames (placeholder).
Record whether flapping is periodic (suggests policy/threshold issues) or random (suggests EMC/topology).

Suspected domains

Topology EMC Policy

Symptom · Intermittent CRC / frame errors (scope “looks OK”)

Primary signals

Error deltas per short window: errors / 10 s and errors / 100 ms (placeholder).
Correlation with utilization: spikes in burst P99 often expose contention/retry amplification.
Environmental correlation: temperature bucket / power-state transitions (context fields only).

Minimal field set

err_delta_short util_burst_P99 ts_quality temp_bucket power_state

Suspected domains

Topology EMC Node behavior

Symptom · False wake / wake storm (no obvious frames)

Primary signals

Wake evidence: wake flags + filter-hit counters (if selective wake is used).
Attribution confidence: prefer hardware flags over software inference.
Pre/post snapshots: capture Tpre/Tpost around wake event (placeholder).

Minimal field set

event_wake wake_source wake_evidence wake_rate power_state

Suspected domains

Policy EMC Topology

Symptom · Diagnostic session intermittent timeout (field-level only)

Primary signals

Transport-facing counters: timeout_count per window_Y (placeholder).
Bus utilization bursts around timeouts (queue starvation often appears as burst congestion).
Reset/power transitions near timeout windows (to rule out power-induced dropouts).

Minimal field set

timeout_delta util_burst_P99 ts_mono + window_Y power_state

Suspected domains

Node behavior Topology Policy

Normalization rules (prevents misleading logs)

Always store deltas (Δcounter) with window_Y, not just absolute counters.
Always declare the denominator (per time, per frames, per active-time) to avoid incomparable rates.
Every attribution must include evidence and a confidence label (high/medium/low).

Diagram · Attribution funnel (symptoms → signals → suspected domains)

Observability Tap Points

Intent

Map where diagnostic evidence originates across Transceiver / Controller / MCU / Power-SBC / Monitor. The goal is consistent evidence capture and correlation—not protocol or RTOS tutorials.

Evidence hierarchy (prevents “guessing”)

Hardware-latched flags (wake, dominant timeout, thermal) — highest confidence.
Controller counters + state transitions (ΔTEC/ΔREC, passive/bus-off) — measurable trends.
Software context (queue depth, ISR latency, log drops) — correlation and amplification clues.

Rule: attribution must cite evidence fields, not inference

Lower-level signals can support correlation, but root-cause direction requires higher-confidence evidence whenever available.

Sampling modes (choose by intent)

Periodic Event-trigger Edge / Latched

Trends: periodic sampling (Δcounters, utilization average).
Replay: event-trigger snapshots (bus-off, wake, reset).
Short pulses: edge/latched capture (wake flags, dominant timeout).

Correlation keys (must exist across layers)

Without these keys, logs become fragments and serviceability collapses under power cycles or multi-ECU correlation.

bus_id node_id ts_mono ts_quality event_id power_state reset_reason

Data sources (cards, not tables)

Each card declares what the layer can prove, what it cannot, and how to sample it for trend vs replay.

PHY / Transceiver

Provides

error_flags wake_flags dominant_timeout thermal_flag mode_state

Cannot prove

Exact waveform/termination root-cause or which node generated a specific frame error (handled by PHY/EMC sibling pages).

Sampling rule

Flags: latched or interrupt-driven reads (avoid missing short pulses).
Thermal/mode: low-rate periodic sampling (trend context, not high bandwidth).

Correlation keys

bus_id transceiver_id ts_mono

Common pitfalls

Reading flags clears them without first stamping ts_mono.
Logging “flag occurred” without duration/recovery evidence.

Controller

Provides

TEC/REC error_state state_transitions filter_stats ts_hook

Cannot prove

Harness topology and EMC mechanisms directly; only measurable signatures and correlations are available here.

Sampling rule

Counters: store Δcounter per window_Y (never absolute-only).
Transitions: event-trigger snapshots on passive/bus-off entry and recovery.

Correlation keys

bus_id node_id ts_mono window_Y

Common pitfalls

Storing absolute counters without Δ and window leads to incomparable reports.
Missing state transition history makes “counter jumps” uninterpretable.

MCU / RTOS (fields to record only)

Provides

queue_depth ISR_latency CPU_load log_drop

Cannot prove

Physical-layer faults; MCU signals explain amplification (starvation, backlog) but do not replace bus evidence.

Sampling rule

Periodic: queue depth and CPU load for trends.
Event-trigger: timeout bursts, queue overflow, log-drop spikes.

Correlation keys

ts_mono event_id

Common pitfalls

Missing log-drop counters creates false “no issue observed” narratives.
Treating software timeouts as root-cause instead of correlating with bus/power evidence.

Power / SBC

Provides

reset_reason VBAT_dip_events wake_policy_state ign_state

Sampling rule

Reset reason: capture at boot immediately (latched at startup).
VBAT dips: event-trigger with threshold_X placeholder and persistence policy.

Common pitfalls

No boot_count/event sequencing prevents correlation across power cycles.
Storing only “reset happened” without reason classification reduces service value.

Optional Monitor IC

Provides

open_short_events overvoltage_count imbalance_flag

Sampling rule

Low-rate trend sampling plus event-trigger capture for exceptions; store counts and durations to support severity ranking.

Rate budget (avoid flooding flash)

Always-on (low rate): power_state, controller state, utilization average.
Event-only: passive/bus-off entry, wake, reset, dominant timeout.
Burst capture: short-window utilization P95/P99 and short-window Δcounters in a RAM ring buffer.

Diagram · Tap-point block map (Bus → Transceiver → Controller → MCU → Logger → NVM)

Error Counters Deep Dive

Intent

Turn counters into an evidence chain: Δcounters + window + denominator bound to state transitions. This prevents “looking at numbers without attribution”.

Counter reading pipeline (use this every time)

Sample counters at t0, t1… then compute Δcounter / window_Y.
Bind deltas to state transitions (active ↔ passive ↔ bus-off).
Anchor with events (wake / reset / dominant timeout) when present.
Correlate with utilization burst (P95/P99) to detect contention amplification.
Report severity + suspected domain bucket (Topology / EMC / Node / Policy).

CAN (bind counters to state)

Key fields to log

TEC REC error_state state_transition event_busoff recovery_count

Interpretation rules

Use ΔTEC/ΔREC over window_Y; absolute values alone are not actionable.
State entry timestamps (passive/bus-off) must be captured as events and paired with freeze-frame snapshots.
After recovery, track whether counters decay and stabilize or re-escalate quickly (policy vs burst signatures).

LIN & FlexRay (abstract, field-oriented)

LIN

Record error counts by category (header/response/sync/timeout), then normalize by window_Y and denom_Z. The objective is consistent trend + event anchoring, not protocol walkthrough.

lin_err_delta timeout_delta window_Y denom_Z

FlexRay

Split evidence by channel (A/B) and by error class; store deltas and event anchors. Channel asymmetry is a serviceable clue even without waveform deep dives.

fr_err_A_delta fr_err_B_delta sync_issue_delta window_Y

Pattern classifier (fast triage)

Counters are most useful when categorized by time behavior; each pattern suggests different next measurements.

Slow degradation

Signature: Δcounter rises steadily across long windows.
Quick check: correlation with temperature / power state transitions.
Next: trend-first verification; avoid single-shot conclusions.

Burst event

Signature: large short-window Δcounter + utilization burst P99 spikes.
Quick check: align bursts with bus-off/passive events.
Next: capture freeze-frame around events; validate on real harness.

Periodic policy error

Signature: repeating state flaps or recurrent bus-off with stable period.
Quick check: align with wake/recovery thresholds and policy state changes.
Next: audit debounce/filter/recovery criteria; require evidence fields.

Deliverable · Counter dictionary (field-ready)

Each counter definition must include source, units, sampling method, window_Y, denom_Z, threshold_X placeholders, and severity mapping.

counter_name scope source_layer unit read_method window_Y denom_Z threshold_X severity_map

Reporting rule: every counter report must include Δ + window + denominator + state + event anchor.

Diagram · State + counters binding (from numbers to attribution)

Bus Utilization Metrics

Intent

Convert “the bus is busy” into comparable, calculable metrics. Define windowed utilization, burst behavior, and overhead shares so bench and in-vehicle measurements can be explained without waveform-level discussion.

Definitions (must be fixed for comparability)

Every utilization number must declare window_Y and a clear denominator. Without this, averages cannot be compared across ECUs, tools, or drives.

window_Y short_window_s busy_time denom_Z capture_point ts_quality

Metric set (load shape + overhead shares)

Separate load shape (avg/peak/burst) from overhead (retransmissions/errors/arbitration loss). Mixing them hides the reason why real harness behavior diverges from bench.

Load shape

avg_util_% peak_util_% p95_util_% p99_util_%

Overhead shares

retrans_share_% error_frame_share_% arb_loss_share_% (optional)

Impact of frame size / rate modes (calculation only)

Different payload sizes and modes change frame_time and therefore busy_time. This page standardizes the accounting path and avoids PHY-level explanations.

Accounting model

busy_time = Σ frame_time(i) + Σ retry_time(i) + overhead

util_% = busy_time / window_Y × 100

Rule

Reports must declare the source of frame_time (controller/sniffer/gateway) and keep windowing consistent.

Collection methods (bias sources)

Controller statistics

Strength: ties utilization to local state/counters and retry context.
Bias: partial view (local ECU perspective) may not reflect network-wide busy time.
Best for: accountability and local amplification analysis.

External sniffer

Strength: closer to actual bus busy time and burst behavior.
Bias: limited internal context (cannot see ECU queues or local retry intent).
Best for: network health and peak/burst characterization on real harness.

Gateway mirror

Strength: multi-segment coverage and system-level correlation.
Bias: mirror path can introduce resampling and timestamp quality issues.
Best for: serviceability and cross-domain correlation with wake/diagnostics.

Common pitfalls (and how to avoid them)

Average-only reporting

Quick check: always include peak and P99 for the same capture window.
Fix: define short_window_s and report burst percentiles.
Pass criteria: P99 < X% for Y minutes (placeholder).

Window too large hides bursts

Quick check: compare window_Y average vs short-window peak.
Fix: include rolling short windows and percentile reporting.
Pass criteria: peak within X% above baseline (placeholder).

Ignoring retransmissions/errors

Quick check: report error_frame_share_% and retrans_share_% alongside utilization.
Fix: separate overhead shares from load shape metrics.
Pass criteria: retrans_share_% < X% and error_frame_share_% < X% (placeholder).

Deliverable · Utilization report template

A report without capture_point and windowing metadata is non-comparable and not serviceable.

window_Y short_window_s avg/peak/P95/P99 error_frame_share_% retrans_share_% threshold_X capture_point ts_quality

Diagram · Timeline utilization accounting (frames, gaps, retransmissions, windows)

Wake Event Attribution

Intent

Define a wake evidence chain with explicit source classification and confidence so false-wake and wake-storm issues become measurable and serviceable.

Wake classes (fixed taxonomy)

A wake event must be classified using one of the fixed sources below; the schema records evidence fields and confidence.

bus_wake local_wake timed_wake power_event_wake

Attribution priority (prevents “guessing”)

Hardware flags (transceiver wake flags, power flags) — highest confidence.
Controller state (bus state, counters/state transitions) — medium confidence.
Software inference (queues, timers, application hints) — support only.

Rules

If hardware flags conflict with inference, hardware wins and the conflict is recorded.
If hardware evidence is missing, confidence is downgraded and the evidence_fields list must explain why.

Selective wake (logging/accounting only)

Record filter identity and outcomes so false-wake rate becomes measurable without detailing the standard or filter algorithms.

filter_id/hash sensitivity_profile match_count false_wake_rate (denom required)

Pre/Post windows (snapshot contract)

The black-box capture must reserve a ring buffer covering Tpre and Tpost around wake. Without this, attribution becomes post-hoc inference.

t_pre_s (placeholder) t_post_s (placeholder) ring_buffer_depth snapshot_trigger

Deliverable · Wake attribution schema

The schema stores classification, confidence, and evidence fields; it also records debounce and policy state for service replay.

event_id ts_mono bus_id node_id source confidence evidence_fields[] debounce policy_state

Diagram · Wake decision tree (detect → evidence → classify → snapshot → record)

Wake-event Black Box Design

Intent

Provide a replayable black-box architecture: continuous sampling, event-triggered freeze-frame, and power-loss retention. The goal is service-grade reconstruction, not raw log dumping.

Architecture (layers and responsibilities)

Separate continuous sampling from event records so storage and endurance remain controlled.

Sampling ring (RAM)

High-rate, low-cost, overwrite-friendly samples covering Tpre/Tpost windows.
Stores utilization snapshots, counters deltas, power flags, and minimal state changes.
Allows decimation and delta encoding to keep bandwidth predictable.

Event ring (RAM or NVM-backed)

Low-rate, searchable event records with unique event_id and schema version.
Each event carries freeze-frame fields and pointers/summaries of sampled slices.
Supports event merging when multiple triggers occur in one anchor window.

Checkpoint (retention)

Periodic or trigger-driven commits to non-volatile storage for power-loss survival.
Must track write budget and batch updates to avoid excessive wear.
Stores event ring plus critical summaries for fast service extraction.

Export interface

Exports structured event records; avoids requiring manual log reading.
Must include capture windows, time-quality fields, and filter identity where applicable.
Allows sorting and searching by trigger, bus_id, node_id, and confidence.

Triggers (trigger → freeze-frame → slice capture)

Every trigger defines what gets frozen and how much pre/post context is extracted from the sampling ring.

Network + wake

wake bus-off error_passive_entry

Freeze: state + counters + utilization snapshot + wake attribution.
Capture: Tpre and Tpost slices (placeholders) from sampling ring.

Power + safety

reset VBAT_dip thermal watchdog

Freeze: power_state + reset_reason + brownout/thermal flags.
Capture: immediate pre-reset slice and early-boot slice for correlation.

Event merge rule

If multiple triggers occur within one correlation anchor window, merge into one composite record and store all trigger flags to avoid duplicated NVM writes.

Data volume and endurance (engineering trade-offs)

Sampling strategy

Prefer deltas for counters; prefer window summaries for utilization (avg/peak/p99/shares).
Record state transitions instead of repeated identical states.
Allow decimation under CPU pressure; preserve triggers and freeze-frames.

NVM write budget

Retention must track write frequency and byte volume. A black box that wears out storage is not production-safe.

writes_per_hour bytes_per_write daily_bytes lifetime_est (placeholder)

Configurations (MVP vs Enhanced)

MVP black box

One sampling ring + one event ring.
Triggers: wake, reset, bus-off, VBAT_dip.
Retention: event ring + last-N event summaries.

Enhanced black box

Dual sampling (fast ring + slow ring) with event slice references.
Triggers include error_passive_entry, thermal, watchdog.
Retention includes write budget stats and correlation anchors for cross-ECU replay.

Deliverable · Event record field list

Event records are the minimal searchable unit for service replay.

event_id schema_ver bus_id node_id ts_mono ts_quality trigger_type merge_id (optional) state counters util_snapshot wake_attribution power_state reset_reason diag_hint

Diagram · Dual-ring black box (sampling ring + event ring + checkpoint)

Timestamp & Correlation

Intent

Make logs correlatable across ECUs by recording time base and time quality. Many field failures are unsolved because time is not trusted or not aligned.

Time bases (what they can and cannot solve)

Monotonic time

Best for ordering events within one ECU.
Requires boot_count and uptime_ms for post-reboot reasoning.
Not directly comparable across ECUs without anchors.

RTC time

Human-readable and survives reboot if valid.
Must record rtc_valid and drift_est to avoid false precision.
Useful as a coarse cross-ECU reference when aligned time is unavailable.

Aligned time (gateway/system)

Primary method for cross-ECU correlation.
Must record sync_status and offset_uncertainty to quantify reliability.
Anchors event streams by generating correlation anchor IDs.

Deliverable · Time quality block (attach to every key event)

Each key event must carry time base and quality fields; correlation quality should be derived into a single ts_quality label.

time_source sync_status drift_est (placeholder) last_sync_age_s (placeholder) offset_uncertainty_ms (placeholder) boot_count uptime_ms ts_quality

Cross-ECU correlation (priority order)

A. Gateway anchors

Gateway emits anchor_id with aligned time.
ECUs record anchor_id + local ts_mono for replay.
Best correlation strength for system-level reconstruction.

B. Event sequence numbers

Use monotonic event sequence for key triggers.
Carry sequence through mirrored or forwarded reports.
Works even when aligned time is unavailable.

C. Relative window matching

Match events by relative Tpre/Tpost patterns and trigger signatures.
Always downgrade confidence and record matching uncertainty.
Use as fallback when anchors and sequences are missing.

Diagram · Multi-clock alignment (ECU A/B → gateway anchor → correlation timeline)

Diagnostics Reporting Workflow

Intent

Convert raw counters, utilization, and wake evidence into a service-grade report: concise, actionable, and confidence-tagged.

Scope guard

Focus: thresholds, debouncing, severity, readability, and report structure.
Not included: protocol tutorials, repair-manual encyclopedias, or EMC root-cause explanations.
Output style: summary + evidence + next measurement direction.

Inputs (normalized sources)

counters states utilization wake_attribution ts_quality event_records

Workflow pipeline

Reporting quality depends on consistent normalization, explicit thresholds, and debouncing that prevents alert storms.

normalize thresholds debounce classify report

Normalized alert unit (field-level contract)

Use a consistent contract so every downstream rule is deterministic: metric_id, window, value, bus_id/node_id, time_ref, ts_quality.

Thresholds and debouncing

Thresholds must be window-based and debounced to prevent noisy service outputs.

Threshold definitions

N / window hold_time cooldown trend_slope

Burst-type: N occurrences per W seconds (placeholders X/Y).
Degradation-type: slope over T minutes; baseline and peak tracked per bus_id.
Policy-type: periodic patterns (wake storms) tracked by recurrence interval.

Debounce and suppression rules

burst grouping: collapse multiple same-type events into one burst summary within short windows.
hysteresis: separate enter/exit conditions to avoid threshold flapping.
rate limiting: cap repeated reports per hour while keeping evidence counters.
merge-by-anchor: within one correlation anchor window, merge triggers to minimize NVM writes.

Severity and confidence

Severity levels

info warn error critical

critical: bus-off, persistent wake storm, repeated resets impacting availability.
error: error passive entry, sustained error rate beyond threshold.
warn: rising retransmission share or utilization peaks trending upward.
info: isolated anomalies preserved for correlation and trending.

Confidence scoring (evidence priority)

Service output must state confidence; do not imply certainty when evidence is incomplete.

Priority: hardware flags > controller states > software inference.
Penalize when ts_quality is poor or required evidence fields are missing.
Expose confidence as a first-class report field.

Readability rule (mandatory 4 answers)

Every alert must answer these questions in order, using short lines and explicit evidence references.

What happened — event type + time + impacted bus/node.
How severe — severity + threshold condition (placeholders X/Y).
What evidence — event_id + key counters/util snapshots + ts_quality.
What next to measure — measurement direction, not a repair manual.

Deliverable · Service report template

This template is designed for fast triage: top events, trends, wake summary, and data-quality confidence.

Top 5 events

Rank by severity × confidence and show event_id for retrieval.
Include bus_id, node_id, and time reference (mono/aligned).

Counter trend

baseline slope peak window

Utilization trend

avg peak p95/p99 retrans_share error_share

Wake summary

Break down by source: bus / local / timed / power.
Show confidence and missing evidence fields to avoid guess-based attribution.

Confidence and data quality

ts_quality distribution and last_sync_age_s (placeholder).
logger drop counters and overflow markers.
evidence completeness rate for key event types.

Diagram · Report generation pipeline (raw signals → normalize → debounce → classify → service report)

Verification & Fault Injection Plan

Intent

Validate that logging, black-box capture, retention, and correlation remain trustworthy under worst-case conditions.

Scope guard

Focus: evidence capture quality, time integrity, retention, and write endurance.
Not included: detailed EMC methods or protocol-level conformance.
Matrix style: injection → expected evidence → pass criteria.

Fault injection categories (log-centric)

Each category is evaluated by whether the expected event records and report outputs are produced, not by physical-layer explanations.

Electrical fault class

short/open undervoltage disturbance

Expected: state transitions, counter bursts, and correct severity classification.
Must include event_id + freeze-frame + ts_quality fields.

System behavior class

sleep/wake topology temperature

Expected: utilization peaks and recurrence patterns appear in trends.
Wake attribution records evidence completeness and confidence levels.

Verification dimensions (end-to-end evidence chain)

A. Trigger coverage

No missed triggers for injected conditions (allowing defined merge rules).
Freeze-frame contains required fields and correct pre/post slice references.

B. Time integrity

ts_mono remains monotonic; boot_count increments correctly.
Cross-ECU correlation uses anchors when available; otherwise confidence is downgraded.

C. Retention under power loss

Checkpoint survives power cycling at multiple cut points (during write, post-trigger, pre-export).
Partial-write markers are detectable; recovery yields consistent event ring contents.

D. Endurance and performance

Write budget remains within targets; write amplification remains bounded (placeholder).
Under storms, sampling may decimate but key triggers and reports remain correct.

Pass criteria (quantified KPIs)

Use explicit metric definitions so results are comparable across benches, vehicles, and software revisions.

miss_rate < X false_rate < X corr_rate > X write_amp < X 4Q_compliance > X

miss_rate: injected triggers are captured (merge rules applied deterministically).
false_rate: critical/error reports do not appear under clean baselines beyond threshold.
corr_rate: events align across ECUs within target uncertainty tiers.
write_amp: NVM bytes/writes per event remain bounded; batch commits effective.

Deliverable · Test matrix (card-list format)

Represent each test as a compact, scannable card to prevent mobile overflow and preserve SEO indexing per test intent.

Test ID T-01 (example)

Injection: short/open or disturbance (abstract).
Expected evidence: state change + counter burst + event_id + ts_quality.
Expected report: severity and 4Q output populated.
Pass: miss_rate < X; false_rate < X.

Test ID T-02 (example)

Injection: power cycling at multiple cut points.
Expected evidence: checkpoint consistency + recoverable event ring.
Expected report: top events remain retrievable by event_id.
Pass: retention success rate > X; write_amp < X.

Diagram · Verification bench (DUT + injector + analyzer + power cycling + logger dump + KPI scoring)

Engineering Checklist (Design → Bring-up → Production → Service)

Intent

Turn counters, utilization, wake attribution, black-box capture, and time correlation into gate-by-gate actions with measurable pass criteria.

Scope guard

Focus: schema/thresholds/triggers, evidence completeness, exportability, retention, endurance.
Not included: protocol tutorials, repair-manual encyclopedias, or EMC root-cause explanations.

Design gate

Freeze the logging contract: field dictionary, thresholds with rationale, trigger coverage, time contract, and export format.

Checklist (8–12)

Schema freeze: required/optional fields tagged; schema_version exported. Pass: required missing rate < X%.
Metric semantics locked: event vs state vs counter sample, utilization windows, wake-source priority. Pass: definition↔implementation checks = X/X.
Threshold rationale: each rule has window + count/rate + source (bench/vehicle/history). Pass: rationale coverage > X%.
Trigger coverage map: bus-off, error-passive entry, wake, reset, VBAT dip, thermal, watchdog. Pass: critical triggers covered = 100%.
Time contract: mono/RTC/aligned recorded with ts_quality, boot_count, time_source. Pass: monotonic violations = 0.
Correlation anchors: anchor_id / gateway marks / sequence IDs defined. Pass: correlation-ready events > X%.
NVM wear budget: write amplification bounded; storm policy defined. Pass: write_amp < X.
Export contract: min reproducible package defined (summary + dump + schema + time quality). Pass: parse errors = 0.
Fail-safe logging: logging cannot block safety-critical control. Pass: bounded CPU/ISR time < X.

Example BOM parts (design-time choices)

CAN FD controller (SPI): Microchip MCP2517FD / MCP2518FD (for external logging taps).
CAN FD transceiver: TI TCAN1042-Q1; NXP TJA1044GT; Microchip MCP2562FD.
Selective wake (PN capable): NXP TJA1145 (ISO 11898-6 class); TI TCAN1145-Q1 (family example).
Non-volatile “black box” storage: Infineon/Cypress F-RAM CY15B104QSN (SPI); Fujitsu FRAM MB85RS2MT (SPI).
Isolated CAN (when required): TI ISO1042-Q1; Analog Devices ADM3055E.

Bring-up gate

Validate on real harness and real loads: peak windows, burst grouping, wake evidence priority, export-and-parse loop.

Checklist (8–12)

Real-harness re-measure: bench vs harness deltas captured in the same window definition. Pass: window semantics unchanged.
Controller vs sniffer correlation: utilization and error shares cross-checked. Pass: difference < X%.
Peak/burst coverage: peak_window_id and burst grouping validated. Pass: burst detection recall > X%.
Retrans/error share sanity: retrans_share and error_share match observed symptoms. Pass: share trend aligns with events.
Wake evidence priority: hardware flag > controller state > inference. Pass: mis-attribution < X%.
Pre/Post capture: Tpre/Tpost slices present for key triggers. Pass: slice completeness > X%.
Overflow markers visible: drop_count and overflow flags never silent. Pass: silent loss = 0.
Export+parse loop: the min package is exported and parsed by tooling. Pass: parse errors = 0.

Example BOM parts (bring-up instrumentation hooks)

CAN FD transceiver with diagnostics: Infineon TLE9255W (family example); TI TCAN1042-Q1.
LIN transceiver (if wake correlation crosses LIN): TI TLIN1029-Q1; NXP TJA1021.
FlexRay transceiver (if mixed networks): NXP TJA1080.
Low-cap TVS for bus ports (SI-friendly): Nexperia PESD2CANFD; Littelfuse SM24CANB.

Production gate

Ensure the production policy is durable: rate limits, decimation, retention correctness under power cuts, and NVM endurance.

Checklist (8–12)

Policy tiers: normal / diagnostic / factory modes defined. Pass: mode separation verified.
Rate limit & cooldown: repeated events do not cause alert storms. Pass: max reports/hour < X.
Decimation under storms: sampling may drop, triggers must remain. Pass: critical trigger miss_rate < X.
Wear validation: worst-case write budget validated across temperature. Pass: write_amp < X.
Power-cut recovery: checkpoint consistency proven at multiple cut points. Pass: recovery success > X%.
Version discipline: schema and thresholds are traceable by version. Pass: version missing = 0.
Export footprint cap: maximum package size bounded. Pass: package < X MB.
Data quality counters: drop_count and overflow markers exported. Pass: visibility rate = 100%.

Example BOM parts (production durability)

SBC w/ CAN (policy + reset reasons): NXP UJA1169 (CAN FD + LIN SBC family example); Infineon TLE9471-3ES (SBC family example).
Watchdog supervisor (if discrete): TI TPS3431-Q1 (watchdog timer family example).
FRAM for high-cycle logging: Infineon/Cypress CY15B104QSN; Fujitsu MB85RS2MT.

Service gate

Make the exported package reproducible and readable: event retrieval by ID, confidence tagging, and 4Q reporting.

Checklist (8–12)

Min reproducible package: summary + event dump + schema_version + time quality. Pass: completeness > X%.
4Q readability: what/ severity/ evidence/ next measure included per alert. Pass: 4Q compliance > X%.
Evidence completeness: required fields present for key triggers. Pass: missing required < X%.
Confidence mandatory: confidence downgraded when evidence is missing. Pass: untagged alerts = 0.
Event retrieval by ID: event_id maps to freeze-frame and slices. Pass: retrieval success > X%.
Cross-ECU notes: anchor-aware correlation guidance included. Pass: correlation rate > X.
Integrity check: package structure self-check (hash/CRC placeholder). Pass: integrity failures = 0.
Rate-limited summaries: burst events collapsed but counts preserved. Pass: counts retained = 100%.

Example BOM parts (service export reliability)

CAN transceiver w/ wake flags: NXP TJA1145; TI TCAN1145-Q1 (examples).
Isolated CAN for HV boundary evidence: TI ISO1042-Q1; ADI ADM3055E.
FRAM for robust dumps: Infineon/Cypress CY15B104QSN; Fujitsu MB85RS2MT.

Diagram · Four gates and their concrete outputs

Applications (What to log, and why it matters)

Intent

Avoid generic use-cases. Each bucket defines the highest-value logs, the minimal field set, and trigger/windows for serviceability.

Scope guard

Only: what to record + why it is valuable for triage.
Not included: isolation theory, EMC mechanisms, or protocol internals.

Bucket A · Production / EOL

Fast detection of harness/assembly issues using burst errors and utilization peaks (not just averages).

Minimal field set

event_id ts_quality bus_id node_id state_change counters_delta util_peak peak_window retrans_share error_share

Triggers: error bursts, bus-off, utilization peak exceed.
Pass: burst capture recall > X%; peak-window consistency < X% drift.

Example BOM parts

CAN FD transceiver: TI TCAN1042-Q1; NXP TJA1044GT; Microchip MCP2562FD.
Port protection: Nexperia PESD2CANFD; Littelfuse SM24CANB.

Bucket B · Service returns / Field failures

Reproduce sporadic bus-off or false wakes using a black-box evidence chain (pre/post slices + attribution + time quality).

Minimal field set

Package: summary + event ring dump + schema_version.
Wake: wake_source + confidence + evidence_fields_present.
Slices: Tpre/Tpost refs for key triggers.
Time: time_source + ts_quality + boot_count.
Power: reset_reason + power_state (+ VBAT dip marker if present).
Network: state transitions + counters_delta + util snapshot.

Triggers: wake, bus-off, reset, VBAT dip, thermal/watchdog.
Pass: reproducible package decode success > X%; 4Q compliance > X%.

Example BOM parts

Selective wake transceiver: NXP TJA1145; TI TCAN1145-Q1 (examples).
Black-box NVM: Infineon/Cypress CY15B104QSN; Fujitsu MB85RS2MT.

Bucket C · Gateway / TCU (multi-bus bridging)

Congestion and retransmissions must be quantified with consistent windows and queue/drop observability.

Minimal field set

Utilization: avg/peak/p95 + window.
Shares: retrans_share + error_share (+ arbitration_loss_share if available).
Queue health: queue_depth + drop_count (record only; no RTOS tutorial).
Correlation: anchor_id or sequence_id + ts_quality.
Bridge context: flow_id (abstract) + bus_id.

Triggers: util_peak exceed, retrans surge, drop_count increase.
Pass: congestion alerts rate-limited; correlation rate > X.

Example BOM parts

CAN FD controller (SPI): Microchip MCP2517FD / MCP2518FD (for mirror taps or auxiliary buses).
SBC (reset reasons + policy): NXP UJA1169 (CAN FD + LIN SBC family example).

Bucket D · HV isolation boundary

Intermittent errors across ground offsets require evidence fields (flags + power markers + time quality), not isolation theory.

Minimal field set

Transceiver flags: fault/wake/thermal indicators (abstracted fields).
Power markers: VBAT dip markers + reset_reason + power_state.
Network evidence: counters_delta + state transitions + event_id.
Time evidence: ts_quality + boot_count + time_source.
Retention: freeze-frame refs survive power cuts.

Triggers: power events, error bursts, reset, thermal flags.
Pass: evidence completeness > X%; monotonic violations = 0.

Example BOM parts

Isolated CAN transceiver: TI ISO1042-Q1; Analog Devices ADM3055E.
Black-box NVM: Infineon/Cypress CY15B104QSN; Fujitsu MB85RS2MT.

Diagram · Application buckets and the “highest-value logs” icons

Request a Quote

Name

Company

Part Number(s) / BOM

Quantity & Target Lead Time

Alternates Allowed

Temperature Grade

Package / Footprint

Compliance

Budget Window

Lot Size / Qty

Message

Attachment

Accepted Formats

pdf, csv, xls, xlsx, zip

Attachment

Drag & drop files here or use the button below.

FAQs (Diagnostics & Logging)

Format (fixed)

Each FAQ is a strict 4-line triage closure: Likely cause / Quick check / Fix / Pass criteria (threshold placeholders: X/Y/Tpre/Tpost).

TEC/REC not high, but bus-off happens frequently — sampling semantics or recovery policy first?

Likely cause: TEC/REC is sampled at the wrong moment (post-recovery / wrong hook) or bus-off recovery policy forces repeated transitions without a large visible counter trend.

Quick check: log state_change timestamps + TEC/REC snapshot at bus-off entry/exit; compare counters_delta around each transition; verify sampling is triggered by the same ISR/callback as the state change.

Fix: move sampling to the state-transition hook; add bus_off_reason + recovery_reason; enforce recovery cooldown and explicit retry limits (policy, not waveform).

Pass criteria: > X% bus-off events contain {state_change + TEC/REC snapshot + counters_delta}; unexpected bus-off rate < X per hour over Y operating hours.

Bus utilization shows 20%, but the field feels “blocked” — peak window or retransmission share first?

Likely cause: average window is too long (peaks hidden) and/or retrans_share and error frames are excluded from the utilization denominator.

Quick check: recompute util_peak over a short peak_window (e.g., 1–10 ms) and compare to avg; include retrans/error contributions; inspect p95/p99 and top peak_window_id buckets.

Fix: report utilization as {avg + p95 + peak} with the window definition embedded; always publish retrans_share and error_share; drive congestion alerts from peak/p95 (not avg).

Pass criteria: util_peak < X% and retrans_share < X% in window Y; peak windows are reproducible within ±X% across repeated runs.

False wakes are frequent, but logs show no frames — wake-flag timing or power-event wake first?

Likely cause: wake is not bus-driven (local/timed/power event) or wake_flag is read after it is cleared (timing/ordering bug).

Quick check: capture wake flags at the earliest wake ISR; record power_state, reset_reason, and VBAT_dip_marker; ensure the black box includes Tpre/Tpost slices around wake (Tpre/Tpost placeholders).

Fix: latch and persist flags (read-once → store); explicitly classify wake_source as bus/local/timed/power with evidence; add “no-bus-frame” wake category with required evidence fields.

Pass criteria: > X% wakes have wake_source + evidence fields present; “no-frame wakes” are classified with confidence ≥ X (not “unknown”).

Black box captured a wake, but the wake source is unknown — which 3 evidence fields to add first?

Likely cause: attribution lacks minimum evidence (hardware flag snapshot, controller state snapshot, power marker) and confidence is computed without inputs.

Quick check: verify presence of wake_flag, controller_wake_status, reset_reason/VBAT_dip_marker, and ts_quality for the wake event.

Fix: add these three evidence fields first: (1) transceiver wake flag + capture timestamp, (2) controller state snapshot at wake, (3) power marker (reset_reason or VBAT dip marker); implement confidence scoring that downgrades when any evidence is missing.

Pass criteria: unknown wake_source < X% of wakes; confidence ≥ X for attributed wakes; evidence completeness ≥ X%.

After changing the harness, error counters rise but the waveform looks “OK” — what trend comparison first?

Likely cause: intermittent events are visible in counters_delta trends but not in a single snapshot; baseline windows differ (apples-to-oranges) or the “rate” metric is not normalized.

Quick check: compare counters as rate per hour (or per 1k frames) under matched windows; slice by node_id and by peak_window; correlate with util_peak and retrans_share.

Fix: standardize the trend window and publish it with every report; store a baseline profile (same window, same normalization) and produce “before/after” delta by node buckets.

Pass criteria: post-change error-rate returns within X% of baseline over Y hours (or Y thermal cycles); top contributing node(s) explain > X% of the delta.

False wakes only in winter/dry conditions — log EMC-event counters or ground/power dip markers first?

Likely cause: environment-correlated transients show up as wake flags/power markers rather than as bus frames; missing markers make the event look “frame-less”.

Quick check: record wake_flag transitions + VBAT_dip_marker + reset_reason around the wake; if available, also log an EMC_event_counter (abstract counter, not waveform).

Fix: add wake debounce + cooldown tuned for transient storms; enforce hardware-flag-first attribution; optionally add environment tags (temperature/humidity placeholder) strictly as metadata.

Pass criteria: false-wake rate < X per day under the condition; > X% of wakes include at least one concrete marker {wake_flag/power_dip/reset_reason}.

Utilization is stable on bench, but vehicle peaks spike — group by node first or by message priority first?

Likely cause: peak bursts come from a small set of talkers or from priority-driven bursts; averaging hides the spike mechanism.

Quick check: compute util_peak per short peak_window; produce Top-5 “peak contributors” by node_id and separately by “priority/class” bucket (abstract); track retrans_share during those peaks.

Fix: always export both breakdowns: per-node and per-priority (or per-message-class) in the same report; gate “congestion” alerts on peak/p95 and attach the Top-5 contributors list.

Pass criteria: Top contributor(s) explain > X% of peak utilization; peak utilization < X% for Y consecutive peak windows (or peaks are correctly attributed with confidence ≥ X).

Log timestamps don’t align; multi-ECU events cannot be correlated — unify monotonic time or add gateway anchors first?

Likely cause: mixed time bases without time_source/ts_quality metadata, or clock steps/resets between events (including boot boundaries).

Quick check: inspect time_source, ts_quality, boot_count; search for monotonic violations; check whether anchor_id (gateway marks) exists for cross-ECU correlation.

Fix: use monotonic time for ordering within an ECU and add gateway anchors for cross-ECU alignment; record drift_est/sync_status (abstract) in ts_quality.

Pass criteria: correlation success > X% across ECUs within window Y; monotonic violations = 0 for key events; anchor coverage > X%.

The black box gets overwritten; key events are missing — change triggers first or move to a dual-layer buffer first?

Likely cause: single ring buffer without reservation + high trigger rate leads to ring_overrun; key triggers compete with low-value spam events.

Quick check: log event_rate, ring_overrun counters, and Top trigger types; verify whether key triggers reserve slots and whether freeze-frames exist (freeze_frame_ref).

Fix: implement dual-layer capture (event ring + sample ring) with reserved freeze-frames for critical triggers; tighten triggers and add cooldown; prioritize critical event IDs in retention policy.

Pass criteria: critical event retention ≥ X events (or ≥ X hours); key trigger miss_rate < X%; ring_overrun rate < X per day.

Logging volume causes flash wear — decimation first or event compression first?

Likely cause: frequent checkpoints to flash + no decimation/compression leads to high write_amp (write amplification) and early endurance exhaustion.

Quick check: estimate writes/day and write_amp from event rate + checkpoint interval; identify “hot” event IDs; verify whether the design logs periodic samples at full rate during storms.

Fix: apply decimation to periodic samples and compress/aggregate repeated events; keep flash for sparse checkpoints and consider high-cycle NVM for dumps (e.g., FRAM CY15B104QSN / MB85RS2MT) for high-write paths.

Pass criteria: write_amp < X; projected endurance > X years at worst-case duty cycle; storm mode retains all critical triggers with miss_rate < X%.

Too many false alarms; service cannot use the report — debounce window or severity mapping first?

Likely cause: thresholds fire on noisy samples (not on state/event transitions), missing cooldown, and severity escalation ignores evidence completeness.

Quick check: compute false-positive rate on a labeled set; inspect debounce_window + cooldown; verify mapping of bus-off / wake storm to severity; confirm evidence_fields_present is required for high severity.

Fix: apply N-in-window debounce plus cooldown; split info/warn/error/critical rules; require confidence ≥ X and evidence completeness ≥ X% before raising critical alarms.

Pass criteria: false alarm rate < X%; alert volume/day < X while recall > X%; 4Q report completeness ≥ X%.

Wake happens then the ECU sleeps immediately; logs are fragmented — policy transition point or dump timing first?

Likely cause: power policy cuts logging before the dump/checkpoint completes; dump trigger occurs after the sleep decision; retention window is too short.

Quick check: record power_state transitions around wake; measure time from wake ISR to dump start; verify checkpoint_seq/checkpoint_done fields; confirm Tpre/Tpost slices exist.

Fix: capture minimal freeze-frame immediately at wake; enforce a hold-off timer before sleep decision; checkpoint critical events before sleep; make dump timing part of the policy contract.

Pass criteria: > X% wake events include contiguous Tpre/Tpost slices; dump success > X%; fragmented logs < X% over Y cycles.

CAN/LIN/FlexRay Diagnostics & Logging: Counters,Utilization,Wake

CAN/LIN/FlexRay Diagnostics & Logging: Counters,Utilization,Wake

Scope Guard & Definitions

Failure Taxonomy for Serviceability

Observability Tap Points

Error Counters Deep Dive

Bus Utilization Metrics

Wake Event Attribution

Wake-event Black Box Design

Timestamp & Correlation

Diagnostics Reporting Workflow

Verification & Fault Injection Plan

Engineering Checklist (Design → Bring-up → Production → Service)

Applications (What to log, and why it matters)

Request a Quote

Accepted Formats

Attachment

FAQs (Diagnostics & Logging)

Explore

Categories

Get in Touch

CAN/LIN/FlexRay Diagnostics & Logging: Counters,Utilization,Wake

CAN/LIN/FlexRay Diagnostics & Logging: Counters,Utilization,Wake

Scope Guard & Definitions

Failure Taxonomy for Serviceability

Observability Tap Points

Error Counters Deep Dive

Bus Utilization Metrics

Wake Event Attribution

Wake-event Black Box Design

Timestamp & Correlation

Diagnostics Reporting Workflow

Verification & Fault Injection Plan

Engineering Checklist (Design → Bring-up → Production → Service)

Applications (What to log, and why it matters)

Recommended topics you might also need

Request a Quote

Accepted Formats

Attachment

FAQs (Diagnostics & Logging)

Explore

Categories

Get in Touch