123 Main Street, New York, NY 10001

Diagnostics / Gateway / TCU: Multi-bus, DoIP, OTA, Security

← Back to: Automotive Fieldbuses: CAN / LIN / FlexRay

A Diagnostics/Gateway/TCU is the system “traffic + policy” core that bridges multi-bus vehicle networks to Ethernet/external services, while keeping diagnostics and OTA stable, recoverable, and secure.

This page provides an engineering path from architecture and bridging rules to serviceability logs, OTA state machines, security enforcement, and measurable pass criteria—so failures are contained and systems remain operable in real vehicles.

Definition & Boundary: What “Diagnostics / Gateway / TCU” Means Here

This page defines a system-level gateway/TCU as a network boundary node that bridges in-vehicle buses to Automotive Ethernet and external services, while enforcing diagnostic/OTA/security policies and producing service-grade observability.

Role separation (no PHY deep-dive)
Gateway
  • Traffic boundary: filtering, rate limiting, prioritization, and fault containment.
  • Policy execution point: access control, routing rules, and session gates.
  • Serviceability anchor: consistent logging, counters, and traceability IDs.
Domain Controller
  • Domain compute: coordinates domain functions (body/chassis/powertrain/infotainment).
  • May host partial gateway functions, but priority is domain feature execution.
  • Interfaces are treated as abstract ports; physical-layer tuning belongs to bus-specific pages.
TCU
  • External termination: cellular/Wi-Fi/VPN/TLS endpoints and cloud connectivity.
  • OTA & remote diagnostics orchestration with recoverable state machines.
  • Often doubles as the secure gateway boundary for access and update authorization.
Page I/O contract (what enters, what must exit)
  • Inputs: in-vehicle frames (CAN/LIN/FlexRay/Ethernet), diagnostic sessions (DoIP/service tool), OTA campaigns, security policies, timing base.
  • Outputs: policy-compliant forwarding/bridging, bounded latency/throughput behavior, auditable security events, traceable diagnostic logs, OTA state transitions with recovery.
  • Deliverables: architecture boundary map, rule tables (filter/limit/route), minimal observability schema (fields + counters), verification hooks.
Boundary ledger (anti-overlap guard)
This page covers
  • Multi-bus to Ethernet bridging logic (filter/limit/queue/fault containment).
  • DoIP/diagnostic path engineering (session gates, address mapping, serviceability logging).
  • OTA lifecycle reliability (state machine, rollback, dependency control, power-loss recovery).
  • Secure gateway integration (trust chain, keys, policy enforcement, auditability).
This page does NOT cover (use sibling pages)
Linking rule: if the topic requires waveform/termination/CMC/TVS/CMTI deep details, provide one sentence of context and link out—do not expand inside this page.
System map (Zone → Backbone → External)
Zones Body Chassis Powertrain Infotainment Backbone Automotive Ethernet Routing & Policy Logging & Metrics External Cloud / OTA Service Tool Secure Edge GW TCU Ports: CAN / LIN / FlexRay / Eth DoIP · OTA · Policy · Logs TLS/VPN · Auth · Audit

Diagram intent: show the gateway/TCU as the boundary control point between zonal buses, Ethernet backbone, and external diagnostics/OTA services.

System Architecture: Data Plane / Control Plane / Management Plane

A robust gateway/TCU architecture separates fast-path forwarding (data plane), decision-making and policy (control plane), and long-horizon operations (management plane). This prevents diagnostic/OTA/security features from destabilizing real-time traffic.

Plane responsibilities (engineering-grade split)
Data plane
  • Ingress parsing → classification → policy match → queueing → scheduling → egress shaping.
  • Hard requirements: bounded latency, bounded queue growth, and controlled loss under congestion.
  • Fault containment: isolate noisy ports/sessions before they impact the rest of the vehicle network.
Control plane
  • Diagnostic sessions: admission control, authorization levels, and timeout policies.
  • Routing & filtering rules: versioned distribution, rollback, and safe defaults on mismatch.
  • Error handling: degrade modes, reset boundaries, and safe recovery sequencing.
Management plane
  • Observability: structured logs, counters, traces, and health snapshots for field triage.
  • Configuration & versions: rule packs, certificates, OTA campaigns, and policy baselines.
  • Timebase consistency: unified timestamp source for auditability and cross-ECU correlation.
Data plane: decisions that prevent “diagnostics storms”
  • Classification keys: bus type, source ECU, service class (control / diagnostics / OTA / logging), and safety criticality.
  • Queue model: dedicate queues per service class; protect real-time control from diagnostic bursts via strict priority or minimum service.
  • Rate limiting: enforce per-session/per-source caps; apply backoff to misbehaving testers to avoid starvation and watchdog resets.
  • Congestion policy: define drop order (e.g., bulk logs before control), and record drops with reasons for field analysis.
  • Fault containment: circuit-breaker rules for repeated timeouts/resets; isolate the port rather than rebooting the entire gateway.
Artifact: maintain a “Bridge Policy Table” (Ingress → Class → Action → Queue → Limit → Exception).
Control & management: minimum fields for traceability

Diagnostic/OTA incidents are rarely reproducible without a consistent schema. Define a minimal field set and enforce it across sessions.

  • Session: session_id, tester_id, auth_level, start/stop timestamp, timeout reason.
  • Routing: rule_pack_version, mapping_id, source_port, dest_port, action (forward/drop/shape).
  • Performance: queue_id, queue_depth_peak, drop_count, p99_latency (X ms), throughput (X Mbps).
  • Security: cert_version, key_id, policy_decision, violation_type, audit_id.
  • Reliability: reboot_cause, watchdog_stage, power_event_flag, recovery_state.
Pass criteria placeholders: p99 latency ≤ X ms, drop rate ≤ X/1k frames, false rejects ≤ X/day, recovery success ≥ X%.
Cross-page guard (keep interfaces abstract)
  • Allowed: “port type, throughput class, error model, wake/sleep impact, diagnostics session behavior”.
  • Forbidden: “sample point, termination components, SIC waveform symmetry, TVS/CMC parasitic tuning”.
  • Action: when forbidden terms appear, provide a one-line context and link to the dedicated PHY/EMC page.
Three-plane architecture (fast path + policy + operations)
Ingress CAN / CAN FD LIN FlexRay Ethernet Gateway / TCU Stack Security domain Data Plane Classify · Queue · Schedule Control Plane Session · ACL · Policy Management Plane Logs · Metrics · Versions Egress Ethernet Backbone DoIP Service OTA Cloud Audit Fast path is isolated from sessions & operations: queueing and limits protect control traffic from diagnostics/OTA bursts.

Diagram intent: enforce a clean split—data plane stays deterministic; control plane decides; management plane records and operates.

Multi-bus ↔ Ethernet Bridging Fundamentals

Bridging is a data-path engineering problem: classify traffic, apply policy, protect control flows with queueing and limits, and contain faults so one noisy endpoint cannot destabilize the entire vehicle network.

Bridge modes: L2 forwarding vs message routing vs proxy termination
L2 forwarding
  • Fit: Ethernet-to-Ethernet segments where L2 boundaries are explicitly controlled.
  • Risk: broadcast/unknown-unicast storms and uncontrolled fan-out under misconfiguration.
  • Rule: require storm control and strict isolation policies; do not rely on “best effort” forwarding.
Message routing
  • Fit: CAN/LIN/FlexRay ↔ Ethernet where traffic is mapped by ID/address/service class.
  • Risk: rule-table growth, rule conflicts, and “works on bench, fails in field” version drift.
  • Rule: version rule packs and default-safe behavior on mismatch (deny/shape + audit).
Proxy termination
  • Fit: diagnostics/OTA/security where the gateway must enforce authorization and produce audit trails.
  • Risk: state explosion and resource exhaustion if proxy logic leaks into the fast path.
  • Rule: keep proxy decisions in the control plane; keep the data plane deterministic.
Filtering & rate limiting: prevent storms, false diagnostics, and DoS-like overload
  • Filter keys: port, source ECU, service class (control / diagnostics / OTA / logs), and session identity.
  • Admission control: bound concurrent diagnostic sessions; reject excess sessions with a reason code and audit log.
  • Rate caps: apply per-session and per-source limits; add backoff when repeated timeouts indicate retry storms.
  • Congestion policy: define drop order that protects control traffic; record drop reason for field triage.
  • Safe defaults: unknown traffic is shaped or denied (never broadcasted) and always audited.
Verification placeholders: storm trigger time ≤ X s, queue depth peak ≤ X, drop rate ≤ X/1k frames, session reject is always logged.
Queueing: protect control flows while keeping diagnostics explainable
  • Service-class queues: Control / Diagnostics / OTA / Logs as the primary split.
  • Fairness knobs: per-ECU or per-session sub-queues to prevent single-source starvation.
  • Scheduling: strict priority or minimum-service guarantees for control; diagnostics are shaped, not “randomly dropped”.
  • Explainability: every throttle/drop should map to a policy rule and emit a reason code.
Pass criteria placeholders: control traffic p99 latency ≤ X ms under full diagnostic load; diagnostics p99 latency ≤ X ms in service mode.
Fault containment: isolate, circuit-break, and degrade instead of rebooting everything
  • Isolation domains: by port, by session, and by service class (preferred for service stability).
  • Circuit breaker states: Closed → Open → Half-open; transitions require explicit reasons and timers.
  • Degrade modes: keep control + restrict diagnostics + pause OTA; or service-only mode for recovery.
  • Auditability: every isolation event produces an audit_id and correlates with queue and session metrics.
Recovery placeholders: isolation duration ≤ X s, half-open success ≥ X%, system-level reboot rate ≤ X/day.
Output artifact: Bridge Policy Table (Ingress → Class → Action → Exception)
Ingress Class Match Keys Action Queue & Rate Exception Audit
CAN Port A Control ECU / ID range Forward Q1 · min service Service mode rule_pack_version
Tester / DoIP Diagnostics session_id / auth_level Proxy / Shape Q2 · cap X Factory mode reason_code
Cloud / OTA OTA campaign_id / version Shape / Pause Q3 · cap X Degrade mode audit_id

Implementation rule: every forward/drop/shape decision must be attributable to a single policy row and emit a stable reason code.

Typical bridging path (CAN → policy → queue/limit → Ethernet)
Ingress CAN LIN FlexRay Ethernet Bridge Pipeline Parse Classify service / ECU Policy match · action · reason Queues Q1 Q2 Q3 Limit CB open Egress Ethernet Backbone DoIP Service OTA Cloud Key points: classify by service/ECU, apply policy with reason codes, queue by class, limit per session, and isolate on repeated faults.

Diagram intent: show where decisions happen (policy), where protection happens (queues/limits), and where containment happens (circuit breaker).

Diagnostics Path: DoIP Session, Addressing, and Serviceability

Stable diagnostics requires three things: a deterministic admission gate, a versioned address-mapping contract, and a minimal logging schema that makes field failures explainable without reproducing the exact harness setup.

End-to-end chain (Tester → DoIP → Gateway → Target ECU) and first failure checks
  • Session overload: too many concurrent sessions or retries can starve control traffic; check admission counters and queue watermarks first.
  • Mapping drift: address-table version mismatch can look like random timeouts; verify mapping_id and rule_pack_version alignment.
  • Policy mismatch: an auth-level downgrade may cause silent rejects; require explicit reason codes and audit IDs.
  • Resource collapse: CPU/memory spikes or encryption overhead can trigger watchdog resets; correlate session events with health logs.
Fast triage order: admission/queues → mapping/version → auth gate → system health.
Address mapping: an engineering contract (not a spec reprint)
  • Goal: translate DoIP-side logical addressing into stable target identities without ambiguity.
  • Versioning: every mapping must carry mapping_id and rule_pack_version for field correlation and rollback.
  • Conflicts: duplicates and gaps must resolve to safe defaults (reject/shape + audit) rather than “best-effort forward”.
  • Self-check: validate coverage, uniqueness, and default actions before enabling service mode in production.
Artifact: “Mapping Pack” = {mapping_id, rule_pack_version, default_action, audit_id policy}.
Authentication gate: read vs write vs programming must be explicit and audited
  • Read-only: lowest risk, still requires session identity and rate limits.
  • Write: requires elevated auth_level and strict per-target quotas; reject is never silent.
  • Programming/flash: highest risk; enforce strong authorization, maintenance mode constraints, and mandatory audit trails.
  • Reject behavior: always return a stable reason_code and record an audit_id with timestamps and latency.
Pass criteria placeholders: false rejects ≤ X/day, unauthorized writes = 0, programming attempts always logged with audit_id.
Serviceability: minimal logging fields that make failures explainable
  • Time: timestamp (unified timebase), duration/latency_ms.
  • Session: session_id, tester_id, auth_level, start/stop reason.
  • Target: target_ecu, logical_addr, physical_addr (or stable target ID).
  • Operation: service_id, payload_len, status (ok/fail/timeout/reject), reason_code.
  • Resources: queue_id, queue_depth, drop_count_delta, throttle_events.
  • Versions: rule_pack_version, mapping_id, cert_version (if applicable).
Rule: every timeout/reject must be attributable to one gate (admission / mapping / auth / resource) and must emit a reason code.
DoIP session flow (state + auth gate + audit points)
Tester DoIP client Gateway DoIP server Target ECU Diag service Session Flow Idle Session up Auth Gate Read Write Flash Audit Audit Audit Key points: versioned mapping, explicit auth levels, admission limits, and audit IDs for every reject/timeout.

Diagram intent: show the auth gate as a mandatory step and mark audit points that enable field debugging without reproducing the entire setup.

OTA Lifecycle: State Machine, Rollback, and Dependency Control

Automotive-grade OTA requires recoverability. The lifecycle must define durable state per step, strict activation gates, and rollback behavior that keeps the vehicle serviceable under power loss, weak links, or reboots.

Layered OTA model: Campaign → Download → Verify → Install → Activate → Confirm
  • Campaign: define target set, allowed windows, rollout groups, and dependency rules as a versioned contract.
  • Download: chunked transfer with resume; rate caps protect control traffic under weak links.
  • Verify: signature and hash checks; reject on mismatch with explicit reason codes and audit IDs.
  • Install: write to staging/A-B slot; keep the current image untouched until activation is safe.
  • Activate: switch pointers/slots via an atomic flag; ensure a deterministic boot path.
  • Confirm: commit only after health signals pass; otherwise trigger rollback or safe service mode.
Principle: activation is separated from installation; confirmation is the only commit point.
Recoverable state machine: power loss, reboots, and weak links
  • Durable progress: each step persists minimal fields to resume or roll back without guessing.
  • Atomic boundaries: define where interruption is allowed (download, staged install) vs guarded (activate switch).
  • Restart rules: on reboot, recover from the last durable state and follow deterministic transitions.
  • Failure semantics: verification failures never activate; install failures keep the old image bootable.
Pass criteria placeholders: reboot-resume success ≥ X%, activation switch is always atomic, bricks = 0 over X cycles.
A/B (dual image) rollback: triggers, window, and non-rollback exceptions
  • Triggers: boot-loop counters, health-check failure, missing critical services, or explicit negative confirmation.
  • Confirm window: commit only after stable operation across defined cycles/time; otherwise rollback automatically.
  • Non-rollback cases: if rollback is disallowed (e.g., mandatory security update), fail into a restricted safe mode with full diagnostics.
  • Auditability: every rollback records audit_id, reason_code, and the last known durable state.
Recovery placeholders: rollback completion ≤ X s, safe-mode entry ≤ X s, post-rollback serviceability preserved.
Multi-ECU dependency control: version matrix, ordering, and damage containment
  • Version matrix: define compatible sets and minimum versions; reject activation when dependencies are not satisfied.
  • Ordering: enforce explicit sequences per domain role (gateway services, targets, then optional modules) with rollback points.
  • Stop-loss: if any critical ECU fails, pause the campaign and keep the vehicle in a known serviceable mode.
  • Confirm scope: confirmation checks both ECU health and dependency satisfaction as a whole.
Metric placeholders: dependency-violation activations = 0, partial-complete campaigns ≤ X%, recovery success ≥ X%.
Output artifacts: OTA state machine table + durable fields checklist
State Entry Do Durable Fields Exit Fail → Safety Note
Download Campaign accepted Chunked fetch + resume package_id, received_ranges, bytes_done, hash_state All chunks received Retry / Pause Old image untouched
Verify Download complete Signature + hash checks signature_ok, hash_ok, verified_version, audit_id Verified Abort Never activate on fail
Install Verify ok Write to staging slot staging_slot, write_offset, progress, result Installed Retry / Abort Old boot slot preserved
Activate Install ok Atomic slot switch next_boot_slot, activation_flag, time Boot new image Rollback Switch must be atomic
Confirm Boot ok Health checks + commit confirm_deadline, signals, result, reason_code Committed Rollback / Safe Vehicle stays serviceable

Implementation rule: durable fields must be sufficient to determine the next state after reboot without heuristic guesses.

Durable fields checklist (minimum)
  • Campaign: campaign_id, target_set_hash, rollout_group, policy_window
  • Download: package_id, received_ranges, bytes_done, chunk_hash_state
  • Verify: signature_ok, hash_ok, verified_version, audit_id
  • Install: staging_slot, write_offset, install_progress, install_result
  • Activate: next_boot_slot, activation_flag, activation_time
  • Confirm: confirm_deadline, health_signals, confirm_result, reason_code
  • Audit: audit_id, mapping_id, rule_pack_version (for correlation with gateway policies)
OTA state machine (main path + rollback branch)
Main Path Campaign Download Verify Install Activate Confirm Persist Persist Persist Persist Persist Rollback / Safe Rollback Old Slot Service Mode Restricted Commit happens only after Confirm; failures route to rollback or restricted service mode with full audit logs.

Diagram intent: highlight durable persistence points and show rollback/safe-mode paths without dense text.

Secure Gateway Integration: Trust Chain, Keys, and Policy Enforcement

Secure gateways are operational systems: a trust chain anchors identity, key management keeps credentials alive across the vehicle lifecycle, and policy enforcement produces auditable decisions for diagnostics and OTA.

Trust chain: secure boot → HSM/root key → runtime identity → policy
  • Secure boot: establishes a trusted software identity for the gateway/TCU runtime.
  • HSM/root key: anchors cryptographic operations; private roots remain non-exportable.
  • Runtime identity: produces stable device_id/cert_id used by sessions and policy decisions.
  • Policy tie-in: identity + session attributes map to allow/deny/shape decisions with audit IDs.
Rule: policy enforcement must never depend on unauthenticated identity claims.
Key management: lifecycle, rotation, revocation, factory injection, service updates
  • Lifecycle: issue → activate → rotate → revoke/expire with explicit ownership and audit trails.
  • Rotation: time-based and event-based rotation; support rollback of configuration but not of root identity.
  • Revocation: define behavior on expired/revoked credentials (restricted mode vs deny-all) by policy.
  • Factory injection: bind identity to hardware roots; record injection batch and provisioning version.
  • Service updates: update credentials in service mode without breaking diagnostics access.
Pass criteria placeholders: rotation success ≥ X%, expired-credential outages ≤ X/day, revocation effects are predictable and logged.
Secure comms boundary: where TLS/VPN terminates and who owns certificates
  • Termination point: terminate at the gateway for fine-grained audit/policy, or upstream for simplified roles; make it explicit.
  • Certificate ownership: define who rotates and who revokes; enforce a single source of truth for cert versions.
  • Failure semantics: handshake failure routes to restricted mode or deny by policy; never fall back silently.
  • Audit: record peer_id, cert_id, channel, and reason_code for every deny or downgrade.
Policy placeholder: “no silent downgrade”; all downgrades emit audit_id and reason_code.
Policy enforcement: allowlists, diagnostics privilege, OTA authorization, domain isolation
  • Network allowlist: permitted peers, ports, and services; default deny for unknown traffic.
  • Diagnostics privilege: read/write/flash mapped to auth_level; enforce quotas per session and per target.
  • OTA authorization: campaign must be signed/approved; activation depends on dependency satisfaction and policy windows.
  • Domain isolation: cross-domain flows require explicit permits; log every cross-boundary decision.
Correlation fields: rule_id, audit_id, reason_code, mapping_id, rule_pack_version.
IDS / anomaly (minimal viable): rate, session, and replay-like indicators
  • Rate anomalies: sudden spikes per peer/service; link to throttling and circuit-breaker triggers.
  • Session anomalies: abnormal failures, timeouts, or concurrency patterns; tie to admission gates.
  • Replay-like patterns: repeated identical requests in short windows; enforce policy-based rejection and auditing.
  • Actionability: anomaly signals must trigger degrade/isolation/audit escalation instead of being passive dashboards.
Output placeholder: anomaly_id + counter_id + threshold + outcome + audit_id.
Trust chain and enforcement flow (Boot → HSM → Keys → TLS → Policy → Logging)
Trust & Enforcement Chain Boot HSM Keys TLS/VPN Policy Logging Identity Anomaly Key points: identity anchored by HSM, explicit comms termination, policy with reason codes, and audit logs for every decision.

Diagram intent: show a closed loop from trust anchoring to policy enforcement and auditability, with anomaly inputs driving action.

Functional Safety & Reliability: Fail-Operational vs Fail-Silent

A gateway/TCU failure must have a defined outcome. Safety objectives grade impacts by function class, health monitoring closes the loop from detection to action, and degraded modes preserve serviceability without allowing fault propagation.

Fail-Operational vs Fail-Silent (engineering meaning)
  • Fail-Operational: preserve a minimal set of essential functions under fault, typically via controlled degradation.
  • Fail-Silent: stop emitting potentially harmful effects, isolate external connectivity, and block cross-domain propagation.
  • Per-function decision: control, diagnostics, OTA, and external connectivity do not share the same allowed outcome.
  • Evidence requirement: every degrade/isolate decision must be explainable with audit_id and reason_code.
Design rule: “fail-operational” is scoped and explicit; everything else defaults to “fail-silent” across trust boundaries.
Safety objectives: grade impacts by function class and required behavior
Function Class Failure Impact Required Outcome Recovery Target Audit Minimum
Control (critical) High Fail-operational (minimal set) or isolate to local domain ≤ X s mode_id, detector_id, action_id, audit_id
Diagnostics Medium Degraded (service-only) with strict privileges ≤ X s session_id, reason_code, audit_id
OTA Medium Fail-safe pause + recover/rollback (no partial activation) ≤ X min campaign_id, state, audit_id
External connectivity High Fail-silent by default (isolate) unless explicitly allowed ≤ X s peer_id, channel, rule_id, reason_code

Implementation note: safety objectives must map to explicit degraded modes; “undefined behavior” is treated as a design failure.

Watchdog & health monitoring: detect, classify, and recover
  • Deadlock / stalls: heartbeat gaps, scheduling delay spikes, and unresponsive service endpoints.
  • Memory pressure: heap high-watermark, allocation failures, handle growth, and leak-rate estimates.
  • Queue blockage: queue depth saturation, tail latency, drops, and backpressure trigger counts.
  • Session health: abnormal timeouts, failed handshakes, and runaway concurrency.
Actions (layered recovery)
  • Soft recovery: restart a service, clear a stuck queue, re-load rule packs, re-open sessions.
  • Containment: circuit breaker, rate limiting, deny-by-policy, cross-domain blocking.
  • Hard recovery: controlled reboot, revert to last known-good config, enter service-only mode.
Pass criteria placeholders: detection time ≤ X ms, recovery time ≤ X s, false triggers ≤ X/day, audit completeness = 100%.
Degraded modes: explicit policy matrix
Mode Allowed Denied Enter Exit Required Logs
Mode 1
Control-preserve
Minimal control flows, critical routing, bounded queues External access, non-essential cross-domain traffic Resource stress, repeated stalls Health passes Y cycles mode_id, detector_id, action_id
Mode 2
Service-only
Diagnostics + logging, strict auth_level gates OTA activation, broad routing, external sessions Policy failure, rule-pack mismatch Service confirmation session_id, rule_id, reason_code
Mode 3
Silent / Isolated
Local safe logging only, bounded watchdog recovery External connectivity, cross-domain flows Untrusted state, repeated failed recoveries Manual service reset audit_id, fault_id, last_state
Pass criteria placeholders: wrong-mode entries ≤ X/1k hours, isolation time ≤ X ms, recovery success ≥ X%.
Fault injection hooks: verify detection and actions
Injection Expected Detector Expected Action Pass Criteria
CPU saturation (X%) sched-delay / heartbeat timeout rate limit + degrade to Mode 1 detect ≤ X ms; action ≤ X ms
Memory allocation failures heap watermark / alloc-fail counter restart service; if repeated → Mode 2 recovery ≤ X s; logs complete
Queue blockage / flood queue depth + tail latency circuit breaker + drop policy no cross-domain collapse
Policy engine stalls policy heartbeat + rule-pack integrity Mode 2 or Mode 3 depending on trust deny-by-default is enforced
Evidence fields (minimum): timestamp, fault_id, detector_id, threshold_id, action_id, mode_id, audit_id, reason_code.
Health monitoring closed loop (Detect → Decide → Act → Record → Recover/Isolate)
Signals CPU Memory Queues Sessions Decision Rules Thresholds Actions Rate limit Breaker Restart Isolate Record Audit log Mode 1 / 2 / 3 Closed loop: detect anomalies → decide thresholds → act (contain/recover) → record evidence → recover or isolate.

Diagram intent: present a complete reliability loop with explicit actions and audit evidence, without protocol-level details.

Physical Integration Envelope: EMC & Protection Boundaries for Gateways

Gateway/TCU ports define the system boundary. Protection must be layered from the connector inward, return paths must be planned, and configurable drive/slew must follow a system policy that balances emission, margin, and robustness.

Port boundary layering: connector → surge → ESD → common-mode → clean domain
  • Connector / shield: define the physical entry and shield bonding point(s).
  • Surge layer: manage energy and return paths; keep surge loops out of sensitive signal ground.
  • ESD layer: clamp fast events close to the entry; minimize inductive distance to the return path.
  • Common-mode layer: suppress radiated/common-mode energy before the clean IC domain.
  • Clean domain: keep protocol ICs inside a clearly defined “quiet zone” behind the protection stack.
Placement rule: protection components stay near the connector; clean-domain routing starts after the last boundary element.
Return paths & ground: chassis, shield, and surge loops
  • Keep “dirty return” away: ESD/surge return must not traverse sensitive digital/analog reference regions.
  • Shield continuity: bonding must be explicit; floating or intermittent shield connections often amplify emissions.
  • Single controlled tie: define where clean reference and chassis/body ground connect (if required), and keep it deterministic.
  • Black-box logging: record port, trigger type, and the resulting action (reset/isolate) for service correlation.
Pass criteria placeholders: post-event resets ≤ X/100 tests, false wake ≤ X/day, recover ≤ X s with port attribution.
Configurable slew/drive: system policy, not per-board guesswork
  • Slower edges: reduce emissions but shrink timing margin and increase sensitivity to noise and loading.
  • Stronger drive: improves robustness but can increase crosstalk and radiated energy.
  • Policy table: define profiles by harness class (length, node count, environment) and validate against a fixed checklist.
Policy placeholders (example profiles)
  • Harness A: low emission profile (slew low, drive medium)
  • Harness B: balanced profile (slew medium, drive medium)
  • Harness C: high robustness profile (slew high, drive high with strict containment)
When isolation is required: decision criteria (system-level)
  • Uncontrolled ground potential differences: domains with unpredictable reference offsets across operating conditions.
  • High disturbance interfaces: external-facing ports or long harness segments with strong coupling risk.
  • Security partitioning: boundaries where external connectivity must not influence safety-critical domains.
  • Fault containment: when a single-port event must not impact the rest of the network.
Validation placeholder: isolation decision documented, test evidence stored, and service diagnostics remain available.
Output artifacts: port protection stack + return-path checklist
Port Type Layer Stack Placement Return Target Risk Notes
Ethernet / External Surge → ESD → CM → Clean PCB edge / connector-side Chassis / body ground Reset, false wake, session drops
Diagnostics port ESD → CM → Clean Connector-side Controlled tie point False diagnostics triggers
Power entry Surge → ESD → Filtering At entry + tight loop Chassis/body + power return Brownout / reset storms
Return-path checklist (minimum)
  • “Dirty return” paths do not cross clean reference regions.
  • Shield bonding is explicit and mechanically reliable.
  • Protection components are connector-close with short return loops.
  • Clean/dirty zone boundary is drawn and enforced in layout review.
Connector protection “onion” (Surge → ESD → Common-mode → Clean domain + return arrows)
Connector Shield Protection Layers Surge ESD Common-mode Clean IC Signal → Clean domain Return paths Surge/ESD → Chassis/Body ground Config Slew / Drive policy

Diagram intent: visualize protection layering at the connector and emphasize return-path direction without waveform-level details.

Performance Budgeting: Latency, Throughput, CPU/Memory, and Congestion

Budgeting turns performance into engineering contracts: an end-to-end latency model with measurable boundaries, a throughput-to-resource accounting view, congestion protections that preserve serviceability, and pass criteria that can be accepted in production.

End-to-end latency decomposition (measurable boundaries)
  • Ingress: first observable entry point until classification begins (driver scheduling included).
  • Classify: policy match (routing/ACL/session) and rule-pack lookup decision.
  • Queue: waiting time under contention; often the dominant term in P99 latency.
  • Process: forwarding/encapsulation, copy count, crypto checks, and log field assembly.
  • Egress: shaping, rate limiting, and transmission scheduling to the next hop.
Measurement rule: each stage must expose time markers (t0…tN) so P50/P95/P99 can be reported per stage, not just end-to-end.
Throughput-to-resource accounting (DoIP / OTA)
  • CPU budget: packet/session management, policy evaluation, crypto checks, and logging pipeline overhead.
  • Memory budget: session state, buffers, queue depths, retry windows, and log staging buffers.
  • Copy budget: copy count is a primary multiplier for CPU and memory bandwidth under OTA payloads.
  • Storage budget: write/flush policies can dominate tail latency; measure it explicitly as a stage.
Practical budgeting targets (placeholders)
  • Peak throughput: ≥ X Mbps for Y seconds
  • Sustained throughput: ≥ X Mbps for Y minutes
  • CPU headroom: peak CPU ≤ X% with P99 latency within budget
  • Memory headroom: peak memory ≤ X% with stable queue depths
Congestion control: backpressure, watermarks, and drop policy (preserve serviceability)
  • Traffic classes: Control / Diagnostics / OTA / Telemetry must not share a single “best-effort” queue.
  • Watermarks: define low/high thresholds per queue to trigger shaping and admission control.
  • Backpressure: reject new sessions, delay non-critical transfers, and rate-limit at the boundary.
  • Drop rules: drop-by-class and drop-by-session to prevent one client from starving critical services.
  • Containment: circuit breakers for floods; avoid retry/log storms that amplify congestion.
Policy table skeleton (engineering contract)
Traffic Class Queue Priority Rate Limit Drop Rule Protection
Control Q0 Highest Min guarantee + cap Never drop by default Admission control
Diagnostics Q1 High Cap per session Drop by session on flood Breaker + audit
OTA Q2 Medium Windowed throttle Drop non-critical chunks Resume + persist
Telemetry Q3 Low Aggressive cap Drop-first No starvation
Pass criteria placeholders: Q0 starvation = 0; Q1 timeouts ≤ X/1k; Q2 throughput ≥ X with P99 within budget; breaker trips ≤ X/day.
Metrics definition (make results comparable)
  • Latency: P50/P95/P99 per stage and end-to-end, plus tail-spike counts (above X ms).
  • Throughput: sustained (Y minutes) and peak (Z seconds) with resource headroom recorded.
  • Sessions: max concurrent sessions with handshake success and timeout rate thresholds.
  • Congestion: queue depth distribution, drop rate, and backpressure trigger counts.
  • Stability: degrade/recover counts and recovery time after congestion or stress.
Measurement rule: every pass criterion must reference a measurement method (tool, window, sampling, and normalization).
Pass criteria placeholders (acceptance contract)
  • Latency: end-to-end P99 ≤ X ms; stage-level P99 ≤ X ms (Queue must remain bounded).
  • Sessions: max concurrent sessions ≥ X with timeout rate ≤ Y% over Z minutes.
  • Throughput: peak ≥ X Mbps and sustained ≥ X Mbps while CPU ≤ Y% and memory ≤ Z%.
  • Loss/timeout: drop ≤ X/1k and timeout ≤ X/1k per traffic class.
  • Recovery: congestion recovery ≤ X s and no mode oscillation above X/hour.
Evidence fields (minimum): run_id, workload_id, timestamps, stage metrics, queue stats, session stats, resource stats, config_version.
Output artifact: performance budget template (stage / metric / target / method / margin)
Stage Metric Target Measurement Method Margin
Ingress P99 latency ≤ X ms timestamp t0→t1 (fixed window) X%
Classify P99 latency ≤ X ms timestamp t1→t2 (rule eval) X%
Queue P99 wait ≤ X ms queue timestamp t2→t3 X%
Process CPU / copies ≤ X% / ≤ X profiling + counters X%
Egress P99 latency / drop ≤ X ms / ≤ X timestamp t4→t5 + stats X%
Budget waterfall (stage bars + margin)
Total budget = Ingress + Classify + Queue + Process + Egress + Margin (all P99 terms) Stage budget bars Ingress P99 X ms Classify P99 X ms Queue P99 X ms Process P99 X ms Egress P99 X ms Margin X ms Reserve margin for worst-case contention, logging spikes, and recovery actions. Rule: budget by stage, validate by workload, accept by pass criteria (P99 + sessions + throughput + loss).

Engineering Checklist: Design → Bring-up → Production

Gate-based checklists turn complex gateway/TCU programs into repeatable execution. Each gate defines required inputs, checks, evidence artifacts, and pass criteria placeholders so cross-team delivery remains consistent from first bring-up to production.

Design Gate (freeze boundaries and contracts)
  • Inputs: architecture boundary, policy tables, minimum log fields, key/cert policy, state machines, performance budget.
  • Checks: deny-by-default rules, explicit exceptions, audit fields completeness, persistence points, budget measurability.
  • Outputs: frozen config_version / rule_pack_version, acceptance draft (pass criteria placeholders).
  • Evidence: review record, signed tables, baseline workload definition (workload_id).
Pass criteria placeholders: boundary completeness = 100%, policy coverage = 100%, audit fields present in all flows, budgets defined for P99.
Bring-up Gate (stability, stress, and recovery)
  • Inputs: test tools, logging pipeline, workload profiles, congestion tests, fault-injection matrix.
  • Checks: session stability, pressure and congestion behavior, P99 within budget, recovery actions verified end-to-end.
  • Outputs: baseline performance report (run_id), known-good config, validated degraded-mode behavior.
  • Evidence: P99 report, queue stats report, fault-injection report, recovery timing report.
Pass criteria placeholders: max sessions ≥ X, P99 latency ≤ X, timeouts ≤ X/1k, recovery ≤ X s, evidence artifacts complete.
Production Gate (consistency and traceability)
  • Inputs: release bundle, cert injection flow, station self-test, traceability schema.
  • Checks: config/cert version alignment, self-test coverage, consistent pass criteria validation, audit attribution fields.
  • Outputs: trace bundle (device_id, config_version, cert_version), factory pass report.
  • Evidence: station logs, regression comparison report, sampling audit report.
Pass criteria placeholders: version match = 100%, self-test pass = 100%, trace fields present, production drift ≤ X per batch.
Checklist table (items + owner + method + pass criteria + evidence)
Gate Item Owner Method Pass Criteria Evidence
Design Boundary + policy tables complete System / Security Review + diff Coverage = 100% Signed tables
Bring-up P99 within budget under stress Test Load test P99 ≤ X ms run_id report
Bring-up Congestion protections verified System Queue stats No starvation queue report
Production Version + cert alignment Factory Station self-test Match = 100% station logs
3-gate flow (Inputs → Checks → Outputs + Evidence)
Design Gate Bring-up Gate Production Gate Inputs Arch / Policy / Budget Checks Coverage / Audit Outputs Frozen versions Evidence Signed tables Inputs Tools / Workloads Checks Stress / Recovery Outputs Baseline report Evidence P99 + injection Inputs Release / Station Checks Consistency Outputs Trace bundle Evidence Station logs Rule: each gate has explicit inputs, checks, outputs, and evidence with pass criteria placeholders.

H2-11 · Applications: Diagnostics / Gateway / TCU Patterns

This section maps common system shapes to practical constraints, required modules, and serviceability expectations. Each pattern keeps boundaries clear: it focuses on gateway/TCU system integration (bridging, DoIP/OTA, security policy, logging), and avoids PHY-level deep dives that belong to sibling pages.

Pattern → Constraints → Modules → Verification Includes example material numbers
A) Central Gateway (centralized)
  • Typical scene: Multiple CAN/LIN domains converge to one gateway; DoIP diagnostics and OTA coordination are centralized.
  • Key constraints: High session concurrency, queue isolation (diagnostics vs control vs logging), strict fault containment, predictable P99 latency.
  • Serviceability minimum: session-id, tester-id, target ECU, service id, result code, duration, drop reason, policy decision.
  • Related sibling pages to link: CAN FD transceiver / Selective wake / Ethernet PHY & switch / EMC & port protection.
Example BOM candidates (material numbers)
  • Gateway compute / SoC: NXP S32G274AABK0CUCT, Renesas R8A779F0, Infineon SAK-TC397XX-256F300S-BD
  • Automotive Ethernet switch: NXP SJA1105TEL, NXP SJA1105EL
  • Automotive Ethernet PHY: NXP TJA1100, TI DP83TC811R-Q1
  • CAN FD transceiver / controller: TI TCAN1044-Q1, TI TCAN4550
  • LIN transceiver: TI TLIN1029-Q1, TI TLIN1021-Q1
  • Secure element / TPM: NXP SE050A2HQ1/Z01SHZ, Infineon OPTIGA TPM SLB 9672 FW16
  • Safety PMIC / SBC: NXP MFS2633HMBA0AD, Infineon TLF35584QVVS1
  • Port ESD (recommended-for-new): Nexperia PESD2ETH100T-Q, Nexperia PESD2CANFD24LT-Q
B) Zonal Gateway + Ethernet Backbone
  • Typical scene: Many LIN/CAN nodes aggregated per zone; zonal gateways uplink to an Ethernet backbone; DoIP/OTA policy is shared or centralized.
  • Key constraints: Broadcast storm containment, per-zone rate limiting, deterministic forwarding under congestion, isolation of “noisy” zones.
  • Pitfall to guard: Unbounded retries across zones turning into global queue collapse; missing per-zone “circuit breaker”.
  • Related sibling pages to link: Selective wake / SBC with CAN/LIN / CAN FD transceiver / Ethernet PHY & switch.
Example BOM candidates (material numbers)
  • Zonal SBC / CAN: NXP UJA1169A, TI TCAN4550
  • CAN FD transceiver: TI TCAN1044-Q1
  • Ethernet backbone switch: NXP SJA1105TEL
  • Ethernet PHY: NXP TJA1100, TI DP83TC811R-Q1
  • Power & safety monitor: NXP MFS2633HMBA0AD, Infineon TLF35584QVVS1
C) TCU-as-Gateway (TCU terminates external link + policies)
  • Typical scene: TCU is the external termination point (TLS/VPN), and also enforces gateway policies for DoIP/OTA.
  • Key constraints: Trust chain clarity (boot→keys→runtime policy), strong logging/forensics, robust rollback, strict separation between external and in-vehicle domains.
  • Pitfall to guard: “Security termination” placed too deep (policy after bridging), causing untrusted traffic to consume internal queues.
  • Related sibling pages to link: Secure gateway / selective wake / DoIP diagnostics / Ethernet PHY.
Example BOM candidates (material numbers)
  • Compute: NXP S32G274AABK0CUCT, Renesas R8A779F0
  • External trust anchor: Infineon OPTIGA TPM SLB 9672 FW16, NXP SE050A2HQ1/Z01SHZ
  • In-vehicle networking: NXP SJA1105TEL, TI TCAN1044-Q1, TI TLIN1029-Q1
  • Safety PMIC: Infineon TLF35584QVVS1
D) Service Tool / Factory Mode (diagnostics throughput + traceability)
  • Typical scene: Factory flashing, EOL test, service station diagnostics, controlled bypass policies, strong traceability fields.
  • Key constraints: High throughput, strict access levels, deterministic test time, audit-ready logs (who/what/when/result).
  • Pitfall to guard: “Test-only” backdoors leaking into field images; missing policy attestation tags.
  • Related sibling pages to link: DoIP diagnostics / OTA lifecycle / security policy enforcement.
Example BOM candidates (material numbers)
  • Compute / safety MCU option: Infineon SAK-TC397XX-256F300S-BD
  • DoIP-facing Ethernet PHY: TI DP83TC811R-Q1
  • CAN FD access: TI TCAN4550
  • Trust anchor for station auth: Infineon OPTIGA TPM SLB 9672 FW16
Figure 11 — Four system patterns (mobile-stacked)
Diagnostics / Gateway / TCU patterns Four stacked block-diagram patterns: Central Gateway, Zonal Gateway with Ethernet backbone, TCU-as-Gateway, and Factory/Service mode. A) Central Gateway Zone buses CAN / LIN ECUs Sensors / Actuators Gateway core Policy + Routing Queues Rate limit External DoIP tester Cloud / OTA B) Zonal GW + Ethernet backbone Zone Zonal GW LIN / CAN nodes Backbone Eth switch + QoS Backpressure Central services DoIP / OTA Logging C) TCU-as-Gateway (TLS/VPN terminates here) In-vehicle CAN / LIN ECUs TCU policy core Trust + Policy Logs Limiter External TLS / VPN Cloud D) Factory / Service mode Station Service tool Gateway mode Auth + Audit log Targets ECUs / Flash

Use the pattern choice to drive: (1) queue isolation boundaries, (2) rate-limit placement, (3) trust termination location, and (4) minimum logs for field diagnosis.

H2-12 · IC Selection Logic: What to Choose and Why (with material numbers)

Selection is driven by policy boundaries and serviceability requirements first, then by compute/network/security/power modules. Material numbers below are illustrative references; always confirm AEC grade, package, longevity, and safety documentation for the target program.

1) Gateway compute (MCU/SoC)
  • Choose by: session concurrency, copy budget (DMA/zero-copy), security acceleration, real-time partitioning, safety concept (ASIL targets), and I/O count.
  • Validation hook: CPU headroom at peak DoIP + OTA + logging; verify P99 latency with congestion + encryption enabled.
  • Example parts: NXP S32G274AABK0CUCT, Renesas R8A779F0, Renesas R8A779G0, Infineon SAK-TC397XX-256F300S-BD, TI TDA4VM-Q1
2) In-vehicle networking (Ethernet / CAN / LIN / FlexRay)
  • Ethernet topology: port count, QoS/TSN needs, mirroring for diagnostics, and storm control placement.
  • Bus access strategy: discrete transceivers vs integrated SBC/controller; decide by wake policy, SPI bandwidth, and failure isolation.
  • Example parts: Switch: NXP SJA1105TEL, NXP SJA1105EL Ethernet PHY: NXP TJA1100, TI DP83TC811R-Q1 CAN FD transceiver/controller: TI TCAN1044-Q1, TI TCAN4550 LIN transceiver: TI TLIN1029-Q1, TI TLIN1021-Q1 FlexRay (if required): NXP TJA1080A, Infineon TLE9221SX
3) Security (trust anchor + policy enforcement)
  • Choose by: root-of-trust availability, key storage capacity, crypto throughput, update/rotation method, and auditability.
  • Validation hook: attested boot chain + policy versioning + log integrity (tamper evidence).
  • Example parts: NXP SE050A2HQ1/Z01SHZ, Infineon OPTIGA TPM SLB 9672 FW16
4) Power, reset, and safety monitoring
  • Choose by: rail count, fail-safe outputs, watchdog concept, wake sources, and required safety diagnostics coverage.
  • Validation hook: brownout/ignition cranking recovery + OTA power-loss recovery + watchdog-induced safe state.
  • Example parts: NXP MFS2633HMBA0AD, Infineon TLF35584QVVS1
5) Port protection (keep SI while meeting ESD)
  • Choose by: capacitance budget, surge model assumptions, and placement feasibility (return path length dominates).
  • Validation hook: post-ESD link stability + insertion loss / reflection sanity checks on the real harness.
  • Example parts (recommended-for-new): Nexperia PESD2ETH100T-Q, Nexperia PESD2CANFD24LT-Q, Nexperia PESD2CANFD36UU-Q
Figure 12 — Decision flow (scenario → module combo → verification)
IC selection decision flow Four-lane flow showing scenario mapping to module combinations, critical specs, and verification hooks for gateway and TCU systems. Scenario Module combo Critical specs Verify Central Gateway High sessions Strong isolation SoC + Eth switch + CAN/LIN + Security Queues + Rate limit P99 latency CPU headroom Queue watermarks Load Congestion Fault inject Zonal Gateway Many nodes Storm control Zonal SBC + Backbone switch Per-zone limiter Zone QoS Backpressure Drop policy Burst Storm Recovery TCU-as-GW TLS/VPN edge Strong audit Trust anchor + Policy before routing Log integrity Crypto headroom Key rotation Policy versioning Reboot Key loss Replay Factory / Service Traceability Short test time Auth + Audit + Controlled bypass Throughput Audit fields No backdoors EOL Field Audit

Practical acceptance placeholders (replace X/Y with program targets): P99 latency, max concurrent sessions, peak OTA throughput, drop/timeout rate, and post-ESD stability.

Request a Quote

Accepted Formats

pdf, csv, xls, xlsx, zip

Attachment

Drag & drop files here or use the button below.

H2-13 · FAQs (Diagnostics / Gateway / TCU)

Long-tail troubleshooting only. Each answer is a fixed 4-line engineering path: Likely causeQuick checkFixPass criteria (threshold placeholder X).

DoIP connects, but diagnostic services intermittently time out — check concurrency or queue starvation first?
Likely cause: session concurrency spikes and control/telemetry traffic steals queue time, starving diagnostic request/response processing.
Quick check: correlate timeouts with active_sessions, queue watermarks, and per-class drop/latency counters at ingress→classify→egress.
Fix: reserve a diagnostics class (queue + minimum service rate) and enforce per-source rate limiting; cap non-diagnostic bursts via backpressure/circuit-breaker.
Pass criteria: timeout rate ≤ X/1k requests and P99 diag latency ≤ X ms at X concurrent sessions; diag queue watermark < X%.
OTA download completes, but installation fails — missing persisted fields or dependency matrix mismatch?
Likely cause: install step lacks required persisted state (resume tokens, verified hash, selected slot) or dependency rules reject the target set.
Quick check: inspect state store for required fields per step (campaign_id, artifact_id, hash_ok, slot, precheck_result) and compare actual ECU versions to the declared dependency matrix.
Fix: make install preconditions explicit and persisted; validate dependencies before install; emit a single “reject reason code” for serviceability.
Pass criteria: install success ≥ X% across X cycles; dependency rejects always carry reason_code; no “missing_field” failures in logs over X installs.
After update, diagnostics wake-up intermittently fails — certificate/policy rotation or routing table version mismatch?
Likely cause: policy or certificate set is updated without synchronized versioning across wake filters and routing, causing valid wake frames to be dropped.
Quick check: compare policy_version, cert_bundle_version, and route_table_version in the wake event log; verify rejection reason (auth vs filter vs route-miss).
Fix: enforce atomic rollout (version bundle + monotonic policy) and add compatibility window; fail closed on external access but fail open for local service-mode (if allowed by safety concept).
Pass criteria: wake success ≥ X% within X s; all wake drops tagged with reason_code; version skew events = 0 over X days.
Stress test OK, but in-vehicle hot / low-voltage resets increase — watchdog source or power policy?
Likely cause: watchdog triggers due to scheduler stalls under thermal throttling or brownout/reset policy causes repeated reboot loops.
Quick check: read reset_reason histogram (wdt/brownout/thermal/exception) and correlate with temperature and supply monitors; check whether queue deadlocks precede WDT.
Fix: add health monitor + staged degraded modes (shed non-critical traffic, preserve diag) and tune power-up/down policy; ensure watchdog window matches worst-case stall budget.
Pass criteria: unexpected resets ≤ X/24h at X°C and low-voltage profiles; recovery time ≤ X s; reset_reason always populated.
IDS generates too many alerts and hurts availability — tune thresholds first or build whitelist/baseline first?
Likely cause: thresholds are applied without traffic baseline per mode (factory/service/road), causing normal bursts to be flagged as anomalies.
Quick check: split alerts by mode, source, service type, and rate bucket; compare against baseline percentiles (P50/P95/P99) rather than raw averages.
Fix: establish whitelist + per-mode baselines, then set thresholds to target false-positive rate; add alert rate limiting so IDS cannot starve core routing.
Pass criteria: false-positive ≤ X/hour and IDS CPU ≤ X%; alert storms capped to ≤ X/min without impacting P99 routing latency.
Diagnostic write is denied but read works — permission tiering or security access unlock flow?
Likely cause: access control differentiates read vs write (role/level), or the unlock step is missing/expired for this session/policy version.
Quick check: log the decision tuple: (session_id, tester_id, requested_service, access_level, policy_version, deny_reason); verify unlock state and TTL for that ECU/service group.
Fix: define explicit levels (read / write / flash) and bind unlock to session + target ECU; make denies deterministic with a single reason code and guidance.
Pass criteria: authorized writes succeed ≥ X%; unauthorized writes always denied with deny_reason; no ambiguous denies (reason missing) over X attempts.
With multi-ECU diagnostics, one ECU gets “starved” — how to adjust queue classification and rate limiting?
Likely cause: queues are keyed only by service type (diag) and not by ECU/tenant, so a “noisy” ECU consumes the shared budget.
Quick check: break down throughput and drops per ECU address (logical/physical) and per-source tester; verify fairness metrics (per-ECU share, queue wait time).
Fix: add per-ECU (or per-tenant) sub-queues with weighted fairness; apply token-bucket limits per ECU and per tester to prevent domination.
Pass criteria: minimum per-ECU service rate ≥ X req/s; starvation time ≤ X ms; fairness index ≥ X during X-ECU load.
OTA repeats installation after power-loss recovery — where is idempotency missing?
Likely cause: state machine transitions are not atomic; the “completed” marker is not persisted (or persisted after side effects), so reboot re-enters install.
Quick check: inspect persisted state ordering: (downloaded→verified→installed→activated→confirmed) and confirm each step writes a monotonic marker before executing irreversible actions.
Fix: implement idempotent guards (operation_id + step checkpoint), use monotonic state with write-ahead markers, and ensure activation/confirm are strictly separated.
Pass criteria: after forced power loss at any step, resume never regresses more than X state; duplicate installs = 0 over X fault-injection runs.
Enabling encryption drops throughput a lot — copy count or crypto offload capability first?
Likely cause: extra memory copies dominate, or crypto runs on CPU without acceleration, turning throughput into a compute-bound pipeline.
Quick check: measure bytes-copied per payload, CPU cycles per byte, and DMA usage; compare “crypto on/off” with identical queue/traffic to isolate overhead source.
Fix: reduce copies (zero-copy buffers, scatter/gather), enable acceleration/offload where available, and pre-size queues to avoid encryption-induced backpressure.
Pass criteria: sustained throughput ≥ X Mbps with encryption; CPU ≤ X%; copy count ≤ X per payload; P99 latency ≤ X ms.
Unstable external link slows in-vehicle buses — how to set isolation domains and circuit breakers?
Likely cause: external retries and reconnect storms consume shared CPU/queues, leaking failure across domains due to missing isolation and breaker policy.
Quick check: correlate external link state changes with internal queue growth, CPU spikes, and diag/control latency; confirm whether domain boundaries share the same limiter/worker pool.
Fix: isolate external domain (separate queues + worker budget), apply circuit breaker on repeated failures, and degrade gracefully (keep control/diag alive while external is throttled).
Pass criteria: external flaps do not change internal P99 latency by more than X%; breaker trips within X failures; internal diag/control timeout rate ≤ X/1k.
Field complaint “intermittent and not reproducible” — which minimum black-box fields are missing?
Likely cause: missing correlation keys prevent joining events across planes (session, policy decision, route version, queue state, reset reasons), so failures cannot be reconstructed.
Quick check: verify whether each failure log has: timestamp, session_id, tester_id, ecu_id, service_id, result/deny_reason, policy_version, route_table_version, queue_watermark, reset_reason (if any).
Fix: enforce a minimal schema and unique run_id per diagnostic/OTA operation; add sampling for high-rate counters and keep last-N ring buffer for pre-failure context.
Pass criteria:X% of incidents are explainable within X minutes using logs alone; missing-required-field rate ≤ X%.
Factory flashing passes, but service fails — which injection step (cert/version) is inconsistent?
Likely cause: factory and service pipelines use different key/cert bundles, policy versions, or metadata stamping, causing auth/compatibility mismatch in the field.
Quick check: compare artifacts and stamps: firmware version, policy_version, cert_bundle_id, provisioning_profile, and signing chain across factory vs service; confirm that logs report the active bundle IDs.
Fix: unify provisioning profiles, make injection steps deterministic and audited, and add a “configuration attestation” record that must match before allowing service operations.
Pass criteria: provisioning mismatches = 0 over X vehicles; service auth success ≥ X%; every unit exports attestation with bundle IDs and version stamps.