CAN Controller & Bridge (TTCAN): Filtering, Remap & Gateway
← Back to: Automotive Fieldbuses: CAN / LIN / FlexRay
A CAN controller/bridge is the traffic policy engine of in-vehicle networks: it filters, schedules, and routes messages so critical control frames stay deterministic while diagnostics and logging are shaped and audited.
Done right, it turns bus load, rule updates, and burst events into measurable budgets (latency/jitter/queue) with traceable rules (version/checksum) and serviceable logs (reason codes) instead of mystery failures.
Definition & Scope Guard
A CAN controller and bridge turns raw bus traffic into deterministic, policy-driven flows: receive → classify (filters) → optionally transform (remap) → queue/shape → forward across domains (CAN↔CAN, CAN↔Ethernet/DoIP), with counters and logs that make failures diagnosable.
- Typical outcomes: stable latency under burst load, controlled forwarding (no storms), explainable drops/timeouts, serviceable field logs.
- Primary value: forwarding becomes strategy (filtering, shaping, isolation, audit), not blind relay.
- Controller datapath: mailboxes/FIFOs, acceptance filtering, timestamps, error states (bus-off policy).
- Gateway/bridge policy: filter → remap → rate-shape → queue isolation → forward, with loop prevention.
- Time-triggered messaging (TTCAN): schedule concepts, windows, determinism checks (no PHY deep-dive).
- Latency & buffer budgets: stage breakdown, p95/p99 targets, backpressure and drop policies.
- Diagnostics & serviceability: counters, drop reasons, black-box logging fields, audit hooks.
- PHY/transceiver electrical behavior (waveforms, termination, EMC, TVS/CMC placement). See sibling pages: HS CAN / CAN FD / SIC / CAN XL PHY.
- Selective wake / partial networking (ISO 11898-6) filter tables and standby false-wake tuning. See: Selective Wake / PN.
- Ethernet/TSN PHY details. Only bridge touchpoints are discussed (mapping, latency, isolation).
- Symptoms: drops with low bus load, timeouts only during diagnostics bursts, latency jitter, mis-forwarding, or bus-off recovery storms.
- Questions: “Which rules forward what?”, “Where is the queue peak?”, “What is the p99 latency?”, “Which domain caused backpressure?”, “Can the event be proven by counters/logs?”
- System architects: choose bridge patterns and isolation boundaries; plan determinism and serviceability.
- Firmware/network stack: implement rule tables, queues, timestamps, bus-off recovery, and logging schemas.
- Diagnostics & test: define evidence fields (counters, drop reasons) and pass criteria for bursts and corner cases.
- Production/service: rely on black-box logs for fast root-cause attribution in the field.
System Architecture Patterns
Bridging can be repeated as a single pipeline: Ingress → Classify → Transform → Shape → Egress + Observe. Each architecture pattern is the same pipeline with different constraints (loop risk, jitter risk, burst risk, audit requirements).
- Ingress: identify source domain, message class, and burst behavior (steady vs diagnostics).
- Classify: hardware filters (coarse) + software rules (fine) + security/diagnostic allowlists (strict).
- Transform: ID remap and payload rewrite only when required; treat “modify” as a safety boundary decision.
- Shape: isolate queues per class; reserve bandwidth for control traffic; rate-limit diagnostics/logging.
- Egress: prevent loops and storms; enforce per-destination budgets; apply backpressure policies.
- Observe: counters + drop reasons + rule-id + timestamps; enable field reproduction (“black-box”).
- Use when: multiple CAN domains require isolation plus controlled exchange (ID translation, domain boundaries).
- Design focus: rule table ownership, remap correctness, loop prevention, congestion containment.
- Common pitfall: “mirror forwarding” creates a silent loop; storms amplify across domains.
- Verify: no-forward-back rule, loop detection counters, and bounded latency at high load.
- Use when: CAN traffic is aggregated into a higher-bandwidth backbone for centralized compute.
- Design focus: latency/jitter budgets, queue isolation, prioritization of control vs bulk flows.
- Common pitfall: Ethernet-side bursts turn into CAN-side jitter (queue coupling).
- Verify: p95/p99 latency bounds with burst injection; queue peak and drop reason coverage.
- Use when: service tools reach CAN via IP tunnels and a secure gateway.
- Design focus: control-plane priority, diagnostics rate shaping, allowlist/audit, timeout attribution.
- Common pitfall: diagnostic bursts starve periodic control frames or trigger false “network unstable” symptoms.
- Verify: channel quotas, audit logs, and consistent timeouts correlated to queue and rule-id.
- Use when: field failures require post-event reconstruction (who sent what, when, and why it dropped).
- Design focus: ring buffers, minimal-but-sufficient fields, drop reason taxonomy, non-intrusive shaping.
- Common pitfall: logging traffic competes with critical traffic; missing drop reasons makes logs useless.
- Verify: log cannot destabilize control traffic; events are reproducible from counters + timestamps.
- Rule evidence: rule-id, rule-version/hash, hit-count, last-hit timestamp.
- Queue evidence: per-class queue peak, drop-count, drop-reason, backpressure time.
- Timing evidence: ingress timestamp, egress timestamp, derived latency histogram (p50/p95/p99).
- Fault evidence: bus-off events, recovery attempts, quarantine triggers, storm counters.
Filtering / Remapping / Routing Strategy
Bridging succeeds when forwarding becomes policy: a versioned rule pipeline that is measurable, auditable, and safe under burst load. The goal is to define what is allowed, how it is shaped, and how it is proven by counters and logs.
- Hardware filter (coarse): reduce RX load early (banked masks/lists/ranges). Treat as a performance asset. Evidence: per-bank hit count, reject count (if available), RX FIFO watermark.
- Software rules (fine): domain isolation, forwarding decisions, remap, and shaping. Treat as a business/architecture asset. Evidence: rule-id hits, rule version/hash, drop reasons.
- Security & diagnostics allowlist (strict): minimal exposure surface (default deny, audited allow). Treat as a security asset. Evidence: allow/deny audit logs, quota usage, timeout attribution.
Keep hardware filters simple and stable. Put frequent-change logic into software rules with versioning and observability.
Prefer forward or drop. Modifying control/safety-critical semantics is not the default. If modification is required, it must be explicit, versioned, and auditable.
- ID translation: domain-specific ID plans require mapping across a boundary.
- Payload rewrite: only for strictly defined gateway transformations with audit fields.
- Signal-level mapping: only when virtualization/aggregation defines a clear “semantic owner”.
- rule-id + rule version/hash recorded at decision time.
- before/after tag (compact summary) for remap events.
- drop/deny reason taxonomy for blocked or shaped traffic.
Use class-based isolation and quotas. Priorities alone can fail under contention; independent queues reduce coupling and stabilize p99 latency.
Periodic and safety-relevant flows. Dedicated queue, guaranteed service, and measurable jitter budget.
Bursty traffic. Enforce quotas (token bucket), allowlist, and audit logs. Timeouts must be attributable.
Lowest priority. Drop is allowed, but every drop must carry an explainable reason and counters.
- per-class queue peak + drop count + drop reason
- token-bucket empty time + shaped count
- end-to-end latency histogram (p50/p95/p99)
- Static guard: forbid mirror rules that create unconditional A→B and B→A forwarding; require an explicit “return path” policy.
- Runtime guard: storm counters and short-window detection trigger quarantine; preserve control allowlist, throttle diagnostics/logging.
- Observability: loop-detected count, quarantine reason, burst histogram, per-rule drop reasons.
- rule-id (unique)
- version/hash (rollback-proof)
- owner tag (control/diag/log/security)
- src bus / dst bus
- ID match (mask/list/range)
- direction (ingress/egress)
- action (drop / forward / remap)
- priority class (control/diag/log)
- logging flag (audit on/off)
- rate limit (token bucket/quota)
- burst allowance (peak control)
- drop policy (oldest/newest per class)
- Under diagnostic burst: control p99 forwarding latency < X ms (placeholder)
- No uncontrolled loop/storm: storm counters remain < X within Y seconds (placeholder)
- Every allow/deny decision is attributable: rule-id + reason + timestamps available
Bridging to Ethernet / DoIP
Only bridge touchpoints are covered: mapping points, isolation, shaping, audit, and latency evidence. Protocol tutorials (DoIP/UDS details, Ethernet PHY/TSN specifics) are intentionally out of scope.
- Why: domain controllers and centralized compute aggregate multiple CAN domains into a higher-bandwidth backbone.
- Primary risk: Ethernet-side bursts couple into CAN queues, producing p99 latency spikes and unexplained timeouts.
- Bridge focus: class isolation, rate shaping, and evidence fields that attribute delays to specific queues and rules.
- Treat as an attack & congestion source: diagnostics is bursty and must not destabilize control traffic.
- Minimum controls: default deny, strict allowlist, quota/token bucket, and full audit logs.
- Timeout attribution: every timeout should map to a reason (deny, quota, queue peak, backpressure).
Separate planes with independent queues and quotas. Control traffic requires reserved service; diagnostics/logging are shaped and audited.
Reserved queue + determinism targets (p99 latency/jitter placeholders).
Token bucket + allowlist gate + audit logs (timeouts must be attributable).
Drop-tolerant; always record drop reasons and counters.
- Default deny: no rule means drop; drop includes reason + timestamps.
- Allowlist only: each allow has rule-id, version/hash, and owner tag.
- Rate limiting: per-channel quotas; record shaped events and token-empty durations.
- Audit: allow/deny decisions are logged and attributable (rule-id + timestamps + counters).
Determinism, Latency & Buffer Budget
Determinism is a measurable contract: define a latency path, break it into stages, monitor each stage, and size buffers for burst worst-cases. When congestion appears, enter controlled degradation instead of unpredictable instability.
- Latency path: ingress timestamp → rule decision complete → enqueue → dequeue → TX submit (or egress).
- Jitter: p99 (or p99–p50 spread) on the same path; thresholds are placeholders until system targets are set.
- Backlog: queue depth peaks and time-above-watermark over a defined window.
Interrupt entry time: impacted by interrupt masking, priority inversions, and CPU contention. Monitor: IRQ latency histogram (or closest proxy).
Processing time: rule matching, remap, and bookkeeping. Monitor: sampled per-message compute time; rule-eval time budget.
Queueing time: growth under burst, shaping constraints, and backpressure. Monitor: queue depth peak, time-in-queue distribution, drop count.
Send-wait time: arbitration opportunities and TX backlog effects. Monitor: TX backlog peak and time-above-watermark.
Determinism requires both isolation (queues/classes) and observability (timestamps/counters). Faster hardware alone cannot fix coupling-induced p99 spikes.
- interrupt masking windows
- task preemption and priority inversions
- non-preemptive critical sections
- DMA contention and arbitration
- cache misses and memory bandwidth pressure
- shared bus contention across peripherals
- queue backpressure and coupling
- insufficient shaping for diagnostics/logging bursts
- output TX backlog under high bus utilization
Define average ingress rate by bus utilization (placeholder X%) and average frame rate. Record typical processing service rate (t_proc_typ).
Specify burst size N and burst duration T for diagnostic/OTA/logging sources. Identify whether the burst is gated by allowlist/quota or arrives unbounded.
Queue growth ≈ ingress_rate − service_rate during burst windows. Size FIFO/queue depth for the peak growth plus safety factor (placeholder X). Add watermarks to trigger shaping and controlled degradation before overflow.
- Control: isolate and reserve service; avoid drops whenever possible.
- Diagnostics: allowlist + quota; shaping is mandatory under burst.
- Logging: drop-tolerant; prioritize recent-window observability (ring buffer).
- Drop-oldest typically preserves a “recent window” for black-box replay.
- Drop-newest can protect existing queued control sequences when bursts arrive.
- Every drop/deny must include drop reason and counters.
- Entry triggers: queue depth over watermark, drop burst, error-rate spike.
- Behavior: keep control allowlist; throttle diagnostics/logging aggressively.
- Evidence: mode entry/exit logs with reason + duration.
Diagnostics, Fault Handling & Logging
Serviceability requires evidence: counters, state transitions, audit fields, and a minimal black-box that can reproduce the last seconds before failure. Fault isolation must prevent a single-node storm from collapsing the entire domain.
- error active / passive
- bus-off entry count + duration
- recovery attempts
- RX overflow + watermark peaks
- TX abort + backlog peaks
- drop burst histogram
- deny count by reason
- rule-id hits + version/hash
- quarantine entry/exit counts
- Detect: error-rate spikes, drop bursts, bus-off, loop/storm counters.
- Suspect: temporarily throttle diagnostics/logging, preserve control allowlist.
- Quarantine: isolate the suspected source; forward only the safe subset; keep full evidence logs.
- Recover: cooldown + stability gate; gradual re-enable prevents relapse storms.
state (Normal/Suspect/Quarantine/Recover) · entry reason · exit reason · duration · counters snapshot · rule-id context
- timestamp (aligned to the latency contract)
- src bus / dst bus
- ID / DLC / flags
- action (forward/drop/remap)
- rule-id + rule version/hash
- queue depth snapshot (per class)
- drop/deny reason taxonomy
- quarantine state (if any)
Every deny/drop must be explainable by reason + counters + timestamps, enabling field triage without re-instrumentation.
- Ring buffer: retain the most recent N seconds or M events (placeholders).
- Tiered retention: control is preserved; diagnostics is limited; logging may be sampled or compressed.
- Trigger densification: bus-off, quarantine entry, and drop bursts increase capture density automatically.
Safety & Security Hooks
A gateway becomes trustworthy through explicit hooks: safety-class protection (reserved service and no-modify), security policy gates (default deny, freshness, rate-limit), and auditable evidence (rule-id, reason, counters, timestamps).
- Covers: safety-class path contract, security policy gates, fault injection hooks, audit evidence fields, and measurable trade-offs.
- Not covered: ISO 26262 handbook content, cryptography deep dives, and full DoIP/UDS protocol tutorials.
Safety-class flows use isolated service resources: dedicated queue, reserved scheduling, and predictable p99 behavior under bursts. Evidence: safety queue peak, time-above-watermark, p99 latency.
Safety-class frames default to forward-only: filtering and protection are allowed; payload rewriting is prohibited unless explicitly justified and audited. Evidence: rule tags for “no-remap”, remap count = 0 for safety class.
When congestion triggers degraded mode, the allowed set is restricted to the safety allowlist, while diagnostics/logging are throttled or paused. Evidence: mode entry/exit logs with reason and duration.
- Drop: simulate loss to validate recovery and isolation.
- Delay: simulate scheduling/backpressure to validate p99 budgets.
- Replay: simulate re-injection to validate freshness and anti-replay policy.
- by traffic class (safety/diagnostic/logging)
- by rule-id / bus / time window
- bounded rate to avoid accidental “always broken” states
Under injected delay/drop: safety p99 latency < X and safety drop = 0. Under replay injection: frames are denied with freshness reason and audited.
Unruled traffic is denied by default. Allowed paths must be explicit and owned (rule-id + version/hash + owner tag). Evidence: deny reason taxonomy; rule version hash recorded.
Anti-replay is enforced via time or counter concepts: stale frames are denied and logged. The policy is implemented at the gate, not scattered in application code. Evidence: replay-detected count; stale-deny count.
Every allow/deny/drop must carry rule-id, reason, counters snapshot, and timestamp to enable field triage.
- False deny (mis-block): keep below X in validated diagnostic scenarios; all denies must be explainable.
- False allow (mis-pass): target 0 across restricted domain boundaries.
- Latency ceiling: safety p99 < X even under X% load + bursts; avoid heavy policy checks that inflate t_proc.
Engineering Checklist (Design → Bring-up → Production)
This checklist converts the architecture into executable gates. Each gate produces artifacts, counters, and pass criteria so that rule changes and load bursts remain explainable and repeatable.
- Latency contract: define t_ISR/t_proc/t_queue/t_wait measurement points and p99 targets (placeholders).
- Rule asset: src/dst/match/action/priority/rate/log + rule-id + version/hash.
- Safety class: isolated queue, reserved service, “no-remap” principle.
- Loop guard: block mirror rules; storm counter + quarantine logic defined.
- Watermarks: shaping thresholds + degraded mode behavior defined and auditable.
- Security minimum: default deny + allowlist + freshness + audit reason taxonomy.
rule table spec · rule version hash · latency budget sheet · mode transition spec · deny reason taxonomy
- Observability: timestamps align to the latency contract; rule-id and queue snapshots are present.
- Baseline: establish p50/p95/p99 under idle, typical load, and diagnostic load.
- Burst reproduction: inject diagnostic/logging bursts; confirm shaping and isolation prevent coupling.
- Extreme load: run X% bus load + burst; verify safety p99 and safety drops remain within gate.
- Fault injection: drop/delay/replay tests produce expected deny reasons and state transitions.
latency histograms · queue peak · token empty time · deny/drop reasons · quarantine entry/exit logs
- Audit completeness: allow/deny/drop coverage with reason + rule-id + version/hash.
- Black box: last N seconds ring buffer exportable; trigger densification on bus-off/quarantine/drop bursts.
- Storm statistics: quarantine rates, drop bursts, and false-deny metrics are tracked and reviewed.
- Regression gates: rule changes must pass typical, burst, and injection suites before rollout.
- Traceability: deployed ruleset checksum recorded for every build and for field reports.
Under X% load + burst: safety drop = 0 and safety p99 < X. Non-safety drops are explainable. Rule changes are fully traceable (version + checksum) and auditable in the field.
Applications (Pattern Library — Bridge/Controller View)
Each bucket is described only by bridge/controller differences: determinism, filtering/routing, logging/diagnostics, and security hooks. PHY, EMC, and protocol tutorials are intentionally out of scope.
Examples below are representative. Always verify automotive grade (AEC-Q/ASIL context), CAN/CAN FD features (and TTCAN/time-trigger support if required), package/suffix, and long-term availability.
The dominant constraint is deterministic behavior under bursts: periodic control traffic must remain stable while diagnostics/logging is shaped and audited.
- Determinism: safety-class isolated queue + reserved service; p99 latency gate under diagnostic bursts.
- Filtering: allowlist-by-default across domains; loop/storm prevention with quarantine transitions.
- Logging: black-box ring buffer with reason codes (rule-id, drop/deny reason, queue snapshot).
- Security hooks: default deny + freshness concept + audit coverage (100% allow/deny/drop).
- Infineon AURIX TC3xx (e.g., TC397) — multi-CAN, strong determinism ecosystem.
- NXP S32K3 (e.g., S32K344) — CAN FD heavy MCU class for ECUs.
- Renesas RH850 (e.g., RH850/U2A) — automotive MCU with robust comms/peripherals.
- Texas Instruments TMS570 (e.g., TMS570LS1224) — safety-oriented MCU class (check CAN feature set per variant).
Note: TTCAN/time-trigger capability is variant-dependent; confirm in the controller module feature table.
Node count is high and traffic is heterogeneous. The gateway value is rule scalability (filter/remap), shaped diagnostics, and serviceable logging under low-power policies.
- Determinism: explainable congestion behavior is more important than ultra-low jitter.
- Filtering: layered filtering (coarse → fine → policy) to avoid rule explosion and loops.
- Logging: event-trigger snapshots (wake/storm/drop bursts) to reproduce intermittent field issues.
- Security hooks: default deny + quotas for diagnostics/OTA/logging to control attack surface.
- NXP S32K1 (e.g., S32K144) — body ECU MCU class (feature set varies).
- ST SPC58 (e.g., SPC58EC) — automotive MCU family used in body domains.
- Microchip MCP2517FD — external CAN FD controller (SPI) for channel expansion/offload.
- Microchip MCP2518FD — external CAN FD controller (SPI) alternative/variant class.
External controllers are often used to add channels or isolate timing/IRQ load; host interface buffering still needs budget gates.
The main problem is not forwarding, but isolation and auditability: diagnostics/OTA traffic must be rate-limited and fully attributable without degrading control traffic.
- Plane split: control plane vs diagnostic/logging plane with hard queue isolation.
- Rules: explicit src/dst/match/action with rule-id + version hash; deny reasons are mandatory.
- Rate shaping: token bucket/quota to prevent diagnostic bursts from starving control flows.
- Audit: 100% allow/deny/drop coverage with counters + timestamps + queue snapshots.
- NXP S32G2 (e.g., S32G274A) — automotive gateway processor class.
- NXP S32G3 (e.g., S32G399A) — higher gateway performance class.
- Texas Instruments Jacinto (e.g., TDA4VM) — domain controller class (verify gateway feature mix).
- Renesas R-Car (e.g., R-Car V3H) — high-integration domain controller class (verify network peripheral mix).
DoIP/OTA details are intentionally not expanded here; only the bridge isolation/audit requirements are modeled.
Small nodes require predictable shaping at the gateway. The gateway must provide service logs with stable reason codes to make intermittent failures diagnosable.
- Shaping: smooth bursts from many small nodes; keep control flows stable.
- Isolation: quarantine stormy endpoints to protect the rest of the bus.
- Service logs: drop/deny must carry reasons, rule-id, and queue snapshots.
- Security hooks: minimal policy gates at the bridge rather than scattered node firmware logic.
- NXP S32K118 — compact MCU class (verify CAN feature set per SKU).
- ST SPC560 (e.g., SPC560B family) — compact automotive MCU family class.
- TI TCAN4550 — external CAN FD controller + integrated transceiver (PHY details out of scope; useful for integration density).
IC Selection Logic (Controller/Bridge View)
Select among three architectures: MCU-only, MCU + external CAN controller, or gateway SoC/domain controller. The decision is driven by throughput, determinism, filtering power, diagnostics evidence, and security hooks.
- Multi-bus scale: number of CAN/CAN FD channels and concurrent traffic (peak bursts included).
- Determinism: required safety p99 latency/jitter gate (placeholder threshold).
- Diagnostics strength: whether field triage needs black-box + reason codes + full audit.
- Security boundary: default deny + allowlist + rate-limit + freshness concepts required.
Throughput is end-to-end service capacity under bursts: RX/TX FIFO depth, DMA support, IRQ load, and queue isolation determine whether drop/overflow happens. Evidence: RX overflow, TX abort, queue peak, drop burst count.
Determinism depends on predictable scheduling and timestamps: p99 latency gates, time semantics (if required), and isolated service for safety-class flows. Evidence: p99/p95 histograms, time-in-queue, mode entry/exit logs.
Filtering power is not only “how many filters”, but how maintainable the rule asset is: layered filters (coarse → fine → policy), update cost, and loop prevention. Evidence: per-rule hit counts, deny/drop reason taxonomy, rule version hash.
Field serviceability requires explainable events: every allow/deny/drop must be attributable (ts, rule-id, reason, queue snapshot), backed by a black-box ring buffer. Evidence: audit coverage, ring-buffer export, quarantine transitions with reasons.
Security hooks are enforceable gates: default deny, allowlist, rate-limit, freshness concept, and mandatory audit. Prefer centralized gates over scattered application logic. Evidence: deny reasons, replay/stale counters, quotas triggered, ruleset checksum.
Best when channel count and rule complexity are moderate and p99 gates can be met with careful queue isolation and observability.
Used to expand channels or offload message objects/filtering/timestamping. Budget the host interface (SPI) and ensure timestamp/queue evidence remains consistent.
TCAN4550 includes an integrated transceiver; PHY/EMC details remain out of scope for this page.
Preferred for multi-domain aggregation with strong audit/security boundaries. Requires strict plane split (control vs diag/log), rule asset management, and p99 budgets.
Recommended topics you might also need
Request a Quote
FAQs (Controller/Bridge Debug — Fixed 4-line Answers)
Scope: controller/bridge/scheduling/filtering/latency/diagnostics only. Each answer is data-driven and anchored to counters, logs, and pass gates.
Bus load is low but frames drop — check RX FIFO or ISR jitter first?
Likely cause: RX FIFO / host queue overflow triggered by ISR entry jitter, IRQ masking, or insufficient service rate under bursts.
Quick check: Correlate rx_overflow, queue_depth_peak, and isr_latency_p99 within the drop time window.
Fix: Increase FIFO depth/watermarks, enable/optimize DMA, reduce IRQ masking, and isolate control vs diagnostic queues.
Pass criteria: rx_overflow=0 and isr_latency_p99 < X µs for Y minutes under the same traffic replay.
Only during diagnostic/OTA bursts, periodic messages delay — priority issue or missing shaping?
Likely cause: Diagnostic bursts share the same queue/service budget and starve periodic control flows (no plane split or quota).
Quick check: Compare control-flow p99_latency and queue_depth_peak when diag_burst_rate spikes; inspect token_empty_time if shaping exists.
Fix: Hard-split control vs diagnostic planes, enforce token bucket/quotas for diagnostics, reserve service for safety-class flows.
Pass criteria: Control p99_latency < X ms while diagnostics runs at Z msgs/s, and drop_count(control)=0.
After bridging, frames occasionally arrive out of order — how to validate merge ordering?
Likely cause: Multi-queue merge without stable ordering (timestamp domain mismatch or per-queue dequeue bias).
Quick check: Log ts_ingress and ts_egress per (src_bus, ID); compute reorder_count for the same stream.
Fix: Enforce per-stream FIFO ordering, unify timestamp source, or add sequence tags through the bridge pipeline.
Pass criteria: reorder_count=0 for Y frames at X% bus load.
Same rule table, new firmware makes latency drift — what to align first?
Likely cause: Timestamp capture point changed, rule-eval cost shifted, or counters/window definitions differ between builds.
Quick check: Verify ts_domain (ingress vs post-filter), compare t_proc proxy (rule-eval time), and ensure identical replay/time-window settings.
Fix: Standardize measurement points, log rule_ver + checksum, and run regression against the same traffic capture.
Pass criteria: New build matches baseline within ±X% on p99_latency and queue_depth_peak under identical replay.
CAN↔CAN bridging creates a storm — how to prove it’s a loop (not a noisy node)?
Likely cause: Forwarding loop (A→B and B→A) or mirrored rules reflecting traffic back, creating a storm signature.
Quick check: Detect repeated (ID + payload hash) crossing both buses with near-zero inter-arrival; track loop_detect_count and top rule_id hits.
Fix: Add loop-prevention tags/TTL, enforce one-way rules, and quarantine endpoints on storm signatures.
Pass criteria: loop_detect_count=0 and storm-induced drop_burst < X per Y minutes.
TTCAN startup jitter is high for a few seconds — sync settling or guard time too small?
Likely cause: Reference time acquisition is still settling, or schedule guard time is under-budgeted during the start phase.
Quick check: Plot jitter vs time since power-up; check window_miss_count and slot-boundary violations during the first T seconds.
Fix: Increase startup guard time, delay enabling non-critical traffic, and require sync lock before full schedule activation.
Pass criteria: After T seconds, window_miss_count=0 and periodic jitter p99 < X µs.
Diagnostic timeouts happen but CAN waveforms look fine — check DoIP→CAN backpressure first?
Likely cause: Diagnostic plane backpressure/queue coupling (DoIP ingress > CAN service), not a PHY-layer problem.
Quick check: Inspect diag queue_depth_peak, token_empty_time, and timeout-aligned drop_reason logs (reason + rule_id + q_depth snapshot).
Fix: Rate-limit DoIP requests, enforce per-target quotas, and hard-split control/diag queues with caps and priority.
Pass criteria: Timeout rate < X/hour and diag queue_depth_peak < Y under the same tester script.
After a rule update, legitimate traffic is blocked — how to do canary/version check/rollback?
Likely cause: Rule mismatch (mask/action), inconsistent ruleset distribution, or non-atomic update without version pinning.
Quick check: Verify rule_ver and ruleset_checksum on all nodes; compare deny hits grouped by rule_id before/after rollout.
Fix: Canary rollout + atomic swap, maintain last-known-good ruleset, and enable one-click rollback keyed by checksum.
Pass criteria: False-deny rate < X% during canary, and rollback restores baseline within T minutes.
After bus-off recovery, the gateway “self-excites” — how to decouple recovery from forwarding?
Likely cause: Recovery triggers burst retransmissions or rule-state resets, feeding back into forwarding queues and amplifying traffic.
Quick check: Align bus_off_count/recovery_events with tx_rate, tx_abort, and queue spikes (queue_depth_peak).
Fix: Gate forwarding during recovery, ramp TX rate, and keep ruleset stable across recovery transitions.
Pass criteria: Post-recovery tx_rate < X msgs/s and control p99_latency < Y ms with no sustained queue saturation.
Logs show drops, but field capture can’t reproduce — which “drop reason” field is missing?
Likely cause: Drops are recorded without the causal dimension (queue full vs rate-limit vs policy deny vs quarantine/loop guard).
Quick check: Ensure every drop has reason + rule_id + q_depth + state captured at decision time.
Fix: Standardize a reason taxonomy (QFULL/RATE/POLICY/QUAR/LOOP) and log it consistently across all drop paths.
Pass criteria: Drop attribution coverage = 100% (no “unknown”), and replay using the same reasons reproduces the drop signature.
Multi-channel CAN FD pushes CPU high — how to tell DMA vs IRQ bottleneck fast?
Likely cause: IRQ storm (small batch size) or DMA/memory contention causing long service gaps and FIFO pressure.
Quick check: Compare irq_rate, isr_latency_p99, DMA completion latency, and RX watermark behavior (rx_fifo_highwater).
Fix: Increase batching, prioritize DMA paths, reduce per-frame IRQs, and isolate bridge tasks to avoid cache/priority inversion.
Pass criteria: CPU headroom > X% and rx_overflow=0 at Y% bus load with Z channels concurrent.
Allowlist enabled, functions fail intermittently — allowlist miss or rate-limit collateral damage?
Likely cause: Legitimate frames are blocked by missing allowlist entries or shaped away by quotas/rate-limit (false deny).
Quick check: Group denies by reason (POLICY vs RATE) and top-hit rule_id; validate rule coverage for required IDs and rates.
Fix: Patch allowlist coverage, add per-function quotas, and reserve service/priority for safety-class/control messages.
Pass criteria: False-deny rate < X% and control drop=0 under worst-case diagnostic burst + normal operation.