Safety Watchdog for ASIL/SIL
← Back to: Supervisors & Reset
What It Solves
Independent WDT clock + window constraints + auditable logs = a safety loop that detects deadlocks, timing faults, and clock stalls, then enforces fail-safe actions with evidence for audits.
Jitter & ISR latency
Size T_min/T_max with guard bands so task jitter and interrupt latency do not cause false trips.
Main clock stalls
An independent clock/power domain with stall detection exposes failures even if the SoC PLL halts.
Audit-ready evidence
Structured logs capture timestamp, violation type, detect latency, action, evidence hash for coverage reports.
- Application deadlock/disorder → abnormal feeding (early/late/missing/too-often)
- Window violation detected in watchdog’s own clock domain
- Fail-safe action: RESET_SAFE pulse/hold or graded degrade
- Evidence logged → auditable coverage report
Safety Watchdog Architecture
Independent clock/rail feeds a window comparator that classifies early/late/missing, too-often, and clock-stall faults, driving RESET_SAFE, FAULT_LATCH, and DIAG channels. Key traps: t_pw(min), polarity, re-arm and debounce conditions.
Clock domain
XTAL/independent RC with stall detection; do not derive from main PLL.
Window logic
Compare T_feed ∈ [T_min, T_max] with guard bands; classify early/late/missing/too-often.
Outputs & diagnostics
RESET_SAFE pulse/hold, FAULT_LATCH sticky, and DIAG/I²C/PMBus/SPI logging.
Timing Windows & Jitter Budget
Under task jitter, ISR latency, and DVFS transitions, size the watchdog as a valid feed band with an auditable derivation, not just two hard limits.
Copyable formula
Given baseline period T_task, max jitter ±J, worst ISR/scheduler latency L, and safety margin Δguard:
T_min = T_task − J − L − Δguard
T_max = T_task + J + L + Δguard
Valid band: [T_min + Δguard, T_max − Δguard]
Multi-task feed: release a token only after critical tasks complete to prevent “empty feeds”.
1) Sample task period distribution (≥10k events across temp/voltage/load).
2) Estimate J and L (extreme + P95/P99).
3) Set Δguard = 5–15% per risk and aging margin.
4) Sweep parameters to measure false-trip and miss rates.
5) Freeze to registers/OTP with versioned evidence.
- Mask window during DVFS transitions and enable a first-feed delay after sleep.
- At low-frequency modes, relax T_max and log configuration changes.
- All feed validation must occur in the watchdog’s own clock domain.
Fault-Injection & Self-Test
Build scripts and C_diag% coverage. Separate power-on self-test, online periodic, and maintenance mode. Define pass criteria and log fields for audits.
Types: Early feed / Late feed / Missing feed / Too-often
Clock: Stall / frequency skew
Boundary: Window drift / threshold perturbation
Pass criteria
- Late/Missing: trigger RESET_SAFE ≥ t_pw within T_max + Δdetect.
- Early/Too-often: set FAULT_LATCH within N violations and degrade.
- Clock-stall: detect within T_stall_max and enter safe state.
C_diag% = 1 − (undetected / total injections)
Log fields: timestamp, injection_type, detect_latency, action_result, evidence_hash / log_ptr, config_version, tester_id
- Power-on self-test: cover primary failures at boot.
- Online periodic: low-duty rolling injections with time-window avoidance.
- Maintenance mode: factory/service full matrix + exportable report.
Fail-Safe Policies
Policies must be auditable tables: Reset (pulse ≥ t_pw(min)), Degrade (limit power/disable non-critical/read-only), and Hold-off (manual/remote unlock). Guard against mis-trips during power ramp, DVFS, and sleep-to-wake transitions.
Level 1 · Reset
Pulse width ≥ t_pw(min); for Late/Missing or clock-stall criticals.
Level 2 · Degrade
Disable non-critical blocks, limit power, or enter read-only after Early/Too-often or boundary drift.
Level 3 · Hold-off
Keep power removed until authorized; for persistent/severe violations or trust failures.
- Power-ramp mask of the window counter.
- DVFS suppression during frequency transitions; restore with delay.
- Wake-up compensation for the first feed after sleep.
Table columns to freeze in BOM/validation: fault_class · preconditions · action · parameters (t_pw, N, T_stall_max, re-arm, polarity) · logging (timestamp, latency, action_result, evidence_hash, config_version).
Integration Patterns
Minimal interlocks among lock-step MCU, PMIC safety state machine, and reset tree. When multiple sources fault simultaneously, choose the most conservative path.
Lock-Step MCU
Dual-token feeds only after each core completes critical tasks; out-of-phase feeds to avoid common-mode misses; any-side violation triggers policy.
PMIC Interlock
WDT FAIL → PMIC safety state (limit/hold-off). PMIC PG → enables WDT window to prevent early mis-trips. Reverse path elevates policy level when PMIC faults.
Reset Tree
Fan-out buffers and level compatibility; avoid back-powering and glitches; ensure RESET_SAFE pulse width/hold-time survives the tree.
Arbitration: when MCU and PMIC report faults together, select Hold-off > Reset > Degrade. Log reason codes and source IDs.
Selection Guide
Focus on differences & availability rather than long catalogs. Prioritize independent clock, window watchdog, and auditable diagnostics. Use the axes below as hard selection criteria; freeze decisions into BOM notes.
Clock independence: XTAL / independent RC / isolated divider; include clock-stall detect.
Window granularity & interface: Tmin/Tmax step, via OTP / pins / I²C / PMBus / SPI / one-wire.
Diagnostics: DIAG/TLOG/I²C readouts for fault_class, latency, config_version; exportable JSON parity with UI text.
Safety grade & temp: AEC-Q100 level, ASIL/SIL collateral (Safety Manual/FMEDA), −40…125/150 °C range.
RESET/FAULT polarity & pulse: selectable polarity, t_pw(min), re-arm, debounce; verify at reset tree endpoints.
IQ & standby: sleep keep-alive strategy, first-feed compensation after wake.
Package & pins: SOT-23/DFN/QFN pinouts; EN/SET/WDI position and power-up order.
Representative parts & why (engineering-oriented)
TI: TPS3430-Q1 — independent window WDT, programmable reset delay; good for tight window + external feed path.
TPS3813-Q1 — UV/OV supervise + WDT + reset; when power and WDT supervision must be combined.
ST: STWD100 — simple external WDI, low IQ; fits cost/area-constrained basic window/timeout monitoring.
NXP: FS26 Safety SBC — integrates challenge-response watchdog + multi-rail safety outputs; suits ASIL B–D partitioning.
MC33FS6526 — fail-safe outputs + challenger watchdog; tight PMIC + WDT coupling.
Renesas: RAA271000 safety PMIC — challenge-response WDT, reset generator, safety shut-off; pairs with high-compute SoC.
onsemi: NCV8668 — LDO + window WDT + reset, low IQ; compact supply+WDT merge.
NCV97400 — multi-output PMIC with WDT/monitoring for ADAS power trees.
Microchip: MCP1317 / MCP1320 (AEC-Q) — supervisor with WDI & configurable reset polarity; pair with MCU internal WDT for dual-channel.
Melexis: MLX81124 / MLX80051 (LIN SBIC) — does not include an independent window WDT; typical strategy is Melexis SBIC + external window WDT for dual-channel safety.
- Polarity/pulse/order: RESET/FAULT polarity, t_pw(min), re-arm vary across brands; verify at reset-tree endpoints.
- Interface semantics: SBC/PMIC families (NXP FS / Renesas RAA / onsemi NCV97xxx) differ in challenge-response and safety state machines.
- Safety collateral: require Safety Manual/FMEDA and diagnostic coverage guidance.
- Pin/footprint: SOT-23/DFN/QFN pin swaps on EN/SET/WDI can cause empty-feed or false reset at power-up.
Validation & Coverage Report
Acceptance by C_diag%. Pass criteria: Late/Missing → trigger RESET_SAFE ≥ t_pw within T_max + Δ_detect; Early/Too-often → set FAULT_LATCH within N violations; Clock-stall → detect within T_stall_max and enter safe state.
Laboratory
- Window boundary sweep (Tmin/Tmax) and jitter/latency scanning (J/L/Δguard).
- Feed storm / no-feed; period drift; clock skew/stall injection.
- Temp/voltage drift; cold/hot start; power-ramp + DVFS + first-feed compensation.
In-vehicle / System
- Supply disturbance (crank/load-dump emulation).
- Task crash injections (early/late/miss/too-often).
- Wake scenarios (RTC / PG-OR / WDT-IRQ) and mis-trip rate statistics.
Device-level cues (align with the parts above)
TI: TPS3430-Q1 / TPS3813-Q1 — verify window step/tolerance and reset delay; record program version.
ST: STWD100 — confirm timeout & reset pulse variation across temp; check AEC-Q100 grade.
NXP: FS26 / MC33FS6526 — run challenge-response + fail-safe linkage; log escalation paths.
Renesas: RAA271000 — check reset generator transparency through reset tree (polarity/hold-time).
onsemi: NCV8668 / NCV97400 — LDO+WDT bench (low IQ standby) and multi-rail monitor regression for ADAS trees.
Microchip: MCP1317 / MCP1320 — dual-channel with MCU internal WDT; verify WDI polarity/pulse.
Melexis: MLX81124 / MLX80051 — pair with external window WDT; align LIN/SBIC diagnostics with WDT logs.
Report fields (minimum) —
timestamp, injection_type, expected_action, detect_latency, action_result, evidence_hash/log_ptr, tester_id, config_version.
Small-Batch Procurement Hooks
Ship the first prototype without rework. Freeze watchdog math, reset timing, diagnostics export, and safety collateral before PO. Use the copy-paste BOM note and submit only the minimum required fields so sourcing can deliver second-source options within 48 hours.
BOM Notes (copy & paste)
Safety WDT: independent clock domain; window feed = Ttask ± J (incl. ISR latency L); T_min/T_max frozen; RESET_SAFE ≥ X ms; DIAG log export required; AEC-Q100 Grade X; pin/polarity must match reset tree; provide Safety Manual & FIT.
Replace variables (X, T_min/T_max, J, L, Grade) with your values before you paste into PLM.
Task timing: T_task, jitter ±J, ISR/scheduler latency L
Safety target: ASIL/SIL level; operating temp range
Reset semantics: polarity, t_pw(min), re-arm requirement
Diagnostics preference: I²C / PMBus / SPI / DIAG
Second source: mandatory? acceptable package/pin swaps?
Second-source policy — Provide a pin/polarity/timing delta sheet; if polarity or power-up order differs, run reset-tree passthrough tests and log results. Include Safety Manual/FMEDA excerpt and one short-lead-time supply proof per candidate.
FAQs
Why must the watchdog clock be independent for ASIL/SIL?
Independence avoids common-cause failures tied to the main PLL or shared oscillators. If the system clock stalls, the watchdog must still advance, detect lateness, and trigger a safe action. Standards expect clear independence plus an audit trail: timestamps, configuration version, and results linked to evidence.
How to size T_min/T_max with jitter and ISR latency?
Start from the task period T_task and characterize jitter ±J and worst-case latency L. Pick a guard Δ_guard, then compute T_min = T_task − J − L − Δ_guard and T_max = T_task + J + L + Δ_guard. Validate with boundary sweeps and freeze values in OTP or registers with change control.
Early vs late feeds—reset or degrade?
Early or too-often feeds usually indicate logic disorder or feed spoofing, so prefer degrade or locked modes that restrict non-critical functions. Late or missing feeds often reflect task stalls or timebase faults, so prioritize a deterministic reset with guaranteed pulse width and audited action results.
How to run online self-tests without service hits?
Use low-duty rolling injections that avoid service windows, shifting test timing with workload. Escalate only on confirmed violations, falling back to degrade rather than immediate resets when continuity matters. Log detect latency, classification, and evidence hashes so audits can verify both coverage and impact boundaries.
What belongs to an audit-ready coverage report?
Include timestamp, injection type, expected action, detect latency, action result, and an evidence hash or log pointer. Add tester ID and configuration version to anchor repeatability. Summaries should present per-class detection ratios and boundary cases, with raw logs retained for traceable re-analysis when required.
How to avoid false trips during power ramp/DVFS/sleep?
Mask the window counter during ramps, suppress feeds through DVFS transitions, and delay the first post-sleep feed. Add threshold hysteresis and a restore delay so transient frequency or voltage shifts do not contaminate timing. Every mask event must be logged to preserve a complete and auditable record.
Lock-step MCUs: single or dual tokens?
Dual tokens are harder to spoof because each core must complete its critical slice before a feed is allowed. Out-of-phase feeding reduces common-mode misses. Single-token schemes require extra anti-spoof checks and arbitration rules to ensure a single compromised path cannot maintain a healthy watchdog illusion.
Can PMIC-integrated WDT meet independence?
It can, if its clock and safety paths are demonstrably isolated from the main timebase and regulators they supervise. Review the safety manual for independence claims and coupling analysis. Where doubt remains, add an external window watchdog to provide a second, separately powered timing domain.
Safe policy when the watchdog clock stalls?
Detect stall quickly using a reference or window comparator and promote to the most conservative policy. Prefer hold-off when the timebase trust is lost, with optional degrade for limited diagnosis. Redundant sources and stall thresholds should be validated across temperature and voltage corners with logs retained.
Map faults to PG/DIAG for fast RCA?
Use a unified codebook and define edge-versus-level semantics. Time-align PG, DIAG, and reset causes with a common timestamp base, and filter glitches at the reset-tree fan-out. Provide short, machine-readable records that link to raw logs so root-cause analysis is fast, repeatable, and reviewable.
Second-source swaps—pin/polarity/timing traps?
Reset polarity reversals, shorter t_pw(min), or different WDI sampling edges often break compatibility. Power-up order mismatches can also create false resets or empty feeds. Always run a reset-tree passthrough test at endpoints and update the BOM notes and validation plan before approving any substitution.
Minimal BOM notes for first-time pass?
State window parameters with jitter and latency, reset pulse width and polarity, diagnostics export capability, temperature grade, safety level, and second-source policy. Keep wording identical between the visible note and any JSON exports so audits, sourcing portals, and PLM workflows remain perfectly synchronized.