← Back to: Supervisors & Reset
What It Solves
In multi-core and multi-rail systems, a single watchdog cannot localize faults. Partitioned WDTs isolate failures per domain, escalate deterministically from bark to partition bite to conditional system reset, and align events with PG/FAULT semantics for auditability.
Problem Matrix
- Concurrent domains drag each other on failure
- Common-cause risk from shared clocks
- Feed storms and bus contention
- Leftover debug modes create backdoors
- No first-fault traceability
Capability Mapping
- Partitioned WDT with per-domain windows
- Bark → bite ladder with programmable delays
- Reset matrix with K-of-N system reset
- Independent WDT clock and fail detection
- Telemetry, first-fault, and PG/FAULT tags
Commitments
- Partition bite precedes system reset
- No cross-feeding between domains
- Production OTP lock; unlock-write-confirm sequence
Architecture
The architecture connects an independent WDT clock to N partition watchdogs with window comparators and bark/bite ladders, feeding a reset matrix that asserts per-partition resets and a conditional system reset. Events align with PG/FAULT classes and are logged in telemetry registers. Cross-feeding between domains is explicitly blocked.
| Block | Registers / Fields | Notes |
|---|---|---|
| WDTx_CTRL | EN, MODE(window), RATIO, TIMEOUT, UNLOCK_SEQ | Windowed feed; unlock-write-confirm; OTP lock in production |
| WDTx_STAT | BARK_CNT, BITE_CNT, LAST_CAUSE, TS_REF | First-fault record with timestamp (RTC or relative) |
| RST_MATRIX | PART_MASK[N], SYS_CONDITION(K-of-N), DELAY | Partition bite precedes system reset; conditional promotion |
| PG_FAULT_MAP | SRC, CLASS, TAG | Power and watchdog events share unified semantics |
Timing & Windowing
Define a reproducible windowed-feed policy: select window ratio r, set granularity g, and stage bark/bite delays. Manufacturing may allow soft relaxation under a guarded mode, while production values are OTP-locked. Provide jitter and clock-drift budgets so early/late feeds are unambiguous and auditable.
Core Rules
- Start with r = [0.25, 0.75]
- Granularity g ≤ (rhi−rlo)·T/4
- Early/late ⇒ bark (delayable) ⇒ bite (partition)
- Manufacturing relax only; production OTP lock
- Use tolerance ε to desensitize edges
Jitter & Drift Budget
Effective window bounds:
r_lo_eff = r_lo + δ_sched + |δ_clk|
r_hi_eff = r_hi − δ_sched − |δ_clk|
Feasible if r_hi_eff − r_lo_eff ≥ 4g/T.
Unlock & Lock Policy
- Unlock → write(window, timeout) → confirm
- Enable bark then bite ladder
- Switch to Production → OTP lock
Worked Example
Given T=200 ms, r=[0.25,0.75], δ_clk=±0.15, Δt_jitter=10 ms (δ_sched=0.05):
r_lo_eff = 0.25 + 0.05 + 0.15 = 0.45
r_hi_eff = 0.75 − 0.05 − 0.15 = 0.55
Effective window = 0.10·T = 20 ms
Choose g = 5 ms ⇒ need ≥ 20 ms ⇒ OK but tight
Recommendation: widen to r = [0.35, 0.85] for margin
Isolation & Reset Tree
Enforce partition-first resets and a conditional system reset via a reset matrix (K-of-N or weighted). Model cross-domain dependencies to avoid cascade failures and define a safe subset that brings the system back deterministically. Include protection for fanout, level compatibility, and anti-backfeed.
Reset Priority
- Partition bite precedes system reset
- Record first-fault before promotion
- System reset requires an explicit condition
Promotion Policy
- K-of-N:
count(bite) ≥ K - Weighted:
Σ(wᵢ·biteᵢ) ≥ Θ - Class gate: Safety/Security/Power-critical only
Dependency Matrix
Define D[i,j]=1 if partition i depends on j. After a system reset, bring domains up by a topological order; compute a safe subset to maintain core functions during partial outages.
Fanout & Protection
- Level compatibility for reset fanout
- Use open-drain + pull-up across voltage islands
- Add anti-backfeed/ideal-diode where needed
Review Checklist
- Partition-first policy and explicit promotion condition
- Dependency matrix defines a safe subset
- Fanout level compatibility and anti-backfeed covered
- System reset logs first-fault, bite vector, and timestamp
Clocking & Failover
The watchdog clock must be independent. Detect stuck/stop/too-slow behaviors, then degrade in stages: bark (limit power / downclock / isolate) → conditional bite (partition) → optional system promotion. Lock risky clock muxing in production (OTP) and audit for common-cause risks.
Clock Sources
- Internal RC, external crystal, or discrete ref
- Decouple from CPU PLL/DVFS domains
- Expose health bits for audit
Failure Detection
- stuck: no edge over probe window
- stop: zero count vs reference
- too-slow / too-fast: thresholded drift
Degradation Ladder
- bark: limit power, downclock, isolate
- partition bite with logging
- Promote to system reset via K-of-N / class gate
Production Locks
- OTP lock:
CLK_SRC/THRESH/PROBE_PERIOD - Disable risky mux in production
- Read-only counters for field audit
Formulas & Recommendations
Thresholds:
too-slow: f_meas < (1 − δ_clk_max)·f_nom
too-fast: f_meas > (1 + δ_clk_max)·f_nom
Detection bound:
T_detect ≤ M·T_probe + t_readout (M≥3 for safety)
Window pairing with drift:
|δ_clk| ≤ 5% → r=[0.30,0.80]
5–15% → r=[0.35,0.85] + more samples
>15% → external crystal or discrete ref
Telemetry & Counters
Define auditable telemetry: bark/bite counters, last-cause, and timestamps; deterministic first-fault recording; unified PG/FAULT class/tag. Provide a JSON event model for edge/cloud. Use handshaked clears and persist through power loss.
Counters & Clears
- BARK_CNT / BITE_CNT / CLK_FAIL_CNT
- Unlock → readback → clear → confirm
- Production restricts clear permissions
Timestamps
- RTC (UTC) > free-running ticks > boot count
- Persist last N events with CRC
- Upload minimal packet on power-fail
First-Fault
- Record first crossing with {part, evt, cause, ts}
- Subsequent events increment counters only
- Never overwrite first-fault in session
PG/FAULT Tags
- class: power | wdt | thermal | comm
- tag: vin_drop | rail_glitch | none
- Align with counters for auditability
JSON Event Model
{
"ts": "<utc|relative>",
"part": "A|B|C",
"evt": "bark|bite|clk_fail",
"cause": "early|late|miss|stuck_clk",
"pg_tag": "vin_drop|rail_glitch|none",
"counter": { "bark": 123, "bite": 5 },
"first_fault": true
}
Policies & Interlocks
Define system policies for interlocks, priorities, and anti-cross-feeding. Disallow one domain from feeding another via hardware ACL + firmware capabilities. Govern dogstorm (feed storms) with throttling/backpressure and light window randomization. Provide auditable Manufacturing → Production mode switch with OTP lock.
Hardware Interlocks
- Bus firewall (ACL by
master_id/part_id) - Register islanding per power/voltage domain
- Write-only
WDTx_UNLOCKvalid only in-local domain
Firmware Interlocks
- Capability token: {part_id, nonce, ttl}
- One-shot, non-transferable, replay-protected
- Cross-partition proxy disabled in Production
Priority Model
- bark (warn) ≺ partition bite (reset)
- System promotion via K-of-N / weight / class gate
- First-fault must be recorded before promotion
Anti-Dogstorm
- Throttler (token bucket) per partition API
- Backpressure: return
THROTTLED, delay next feed - Window randomization within
εto de-sync edges
Mode Switch & Audit
- One-time OTP: lock window/timeout/unlock_seq
- Persist
POLICY_VER, POLICY_HASH, SIGNER_ID - Audit log: who/which/what/when/result
Review Checklist
- Cross-feeding technically impossible (HW+FW evidence)
- Promotion gated (K-of-N / weighted / class)
- Throttling + backpressure + randomization in place
- Mode switch auditable (version, hash, signer, ts)
Design Rules & Sizing
Engineering rules that drop into BOM/specs: window ratio and granularity, timeout layering, counter overflow policy, write-protect sequence, and temperature/voltage drift countermeasures. Every rule must be testable and tied to action & record.
Window & Granularity
- Start:
r=[0.25,0.75] - If
|δ_clk|>15%→r≥[0.35,0.85]+ random ε g ≤ (r_hi−r_lo)·T/4(≥4 hit chances)
Timeout & Cadence
T ≥ 5×max(task_jitter)(safety: 10×)- Layered:
T_part,T_sys≥T_part - Manufacturing-only edits; production locked
Counters & Retention
- 16/32-bit with saturation + event on overflow
- Keep last
N=8bites: {ts,cause,part} - Power-loss persistence with CRC
Write-Protect Sequence
- unlock
- write(window, timeout)
- readback-confirm
- lock (Production-only feed/clear)
Temp/Voltage Drift
- Grade table: 5% / 15% / 30% → r widen + probe
- On
vin_drop: enlarge ε, delay bark - Link PG/FAULT tags into logs
Expressions & Examples
Feasible window: (r_hi_eff − r_lo_eff) ≥ 4g/T
r_lo_eff = r_lo + δ_sched + |δ_clk|
r_hi_eff = r_hi − δ_sched − |δ_clk|
Example:
δ_clk = ±30% → use r=[0.40,0.90], g ≤ 0.125·T, RANDOM_EPSILON on
task_jitter = 8 ms → choose T = 80–120 ms; T_sys = 150–200 ms
Procurement Notes
- OTP-lock window/timeout; deny cross-partition writes via bus firewall
- Keep last 8 bites in NVRAM with CRC; overflow raises an event
- Enable RANDOM_EPSILON for clock error ≥15% or rail glitches
Validation Matrix
A reusable admission test: FMEA → fault injection → expected action → auditable records → scoring. Automate with Robot/pytest; require repeatability and time-accuracy bands; bind results to reject rules.
Fault Injection Vectors
- Feed violations: early / late / miss
- Feed line: stuck-H / stuck-L
- Clock: stop / stuck / slow
- bark ignored (masked IRQ)
- cross-feed attempt (B→A)
- debug override in PROD
Expected Actions
- bark IRQ visible within Δt;
BARK_CNT++;LAST_CAUSEset - partition bite when gate met; event persisted
- system reset only via K-of-N / weight / class-gate
- JSON event matches Ch6 schema; first-fault preserved
Scoring & Reject
- Repeatability: each vector ≥ M passes (M≥5)
- Timing error ≤ spec ±10%
- 0 cross-feed success; 0 PROD unlock
- Reject if: bark lost, counters not atomic, promotion ungated
Vector → Expectation → Record (Three-Table Link)
T1 (vectors): id, part, kind{early|late|miss|stuckH|stuckL|clk_stop|clk_stuck|clk_slow|bark_ignored|crossfeed|priv_escal}, params{ε,r,Δt,ratio}, repeats
T2 (expected): id → bark{yes,Δt≤X}, bite{gate,K}, sysreset{gate}, telemetry{fields}, reject_if{...}
T3 (records): id → bark_cntΔ, bite_cntΔ, last_cause, ts, first_fault, pg_tag, verdict{PASS|FAIL}, notes
Procurement Hooks
BOM Remarks (Copy-ready)
- Partitioned WDT with independent clock per domain; no cross-feeding (bus firewall ACL by master_id/part_id).
- Windowed feed with r≥0.25 (widen if |δ_clk|≥15%); Production values OTP-locked; mandatory unlock → write → confirm.
- Bark logs cause+timestamp; partition bite precedes system reset; promotion via K-of-N / weight / class-gate.
- Alternatives limited to TI / ST / NXP / Renesas / onsemi / Microchip / Melexis meeting: channels≥N, independent WDT clock, per-partition reset, auditable counters & first-fault.
Acceptance Criteria
- Pass Chapter 9 matrix: M repeats, ±10% timing accuracy
- Provide register map, clock specs (temp/aging), JSON event mapping
- OTP locks enforced in Production
Disqualifiers
- Any cross-feed success or missing ACL proof
- Production allows window/timeout/unlock_seq edits
- No
first_faultor silent counter rollover - No K-of-N / weighted / class-gated promotion path
IC Buckets & Cross-Brand Mapping
Use function buckets to constrain equivalence. Substitutes must match the acceptance fields below. If any critical field is unmet, the part is non-equivalent. Each card lists concrete part numbers and the reason they fit (or do not fit) the bucket.
Acceptance Fields (must match)
Channels · Window · Timeout Range/Step · Bark/Bite Ladders · Per-Partition Reset · Clocking (independent WDT clock) · Interface (WDO/PMBus/I²C, etc.) · Telemetry (counters/first-fault) · Iq · AEC-Q100 · Package family.
- Rule 1: Missing any critical field ⇒ Non-Equivalent.
- Rule 2: If the watchdog shares the CPU clock, declare clocking risk and require mitigation (independent WDT clock or proven failover).
- Rule 3: Windowed policy must be configurable and OTP-locked in production if specified by the BOM notes.
Bucket A — External Multi-Channel Supervisors + Watchdog
Multi-rail supervisors with an integrated or companion watchdog, supporting windowing, per-partition reset lines, and (preferably) an independent WDT clock.
TI
TPS386000 / TPS386040 — 4-rail supervision + watchdog, programmable delays; suitable for mapping to partition resets.
- Fits: Channels, windowed WDT, per-partition reset, tight thresholds.
- Risk: Check watchdog clock independence vs system clocking.
ST
STM6719 (multi-rail supervisor) + STWD100 (window WDT) — combined to meet A-bucket requirements.
- Fits (as a combo): Channels via STM6719 + windowed WDT via STWD100.
- Non-equivalent alone: STM6719 without WDT companion.
NXP
FS6500 (SBC) — multi-rail power management with integrated WDT (automotive).
- Fits: Automotive grade, integrated WDT, external reset lines.
- Risk: Confirm per-partition reset mapping and clock independence.
Renesas
ISL88001/2/3 (SVS) + a WDT µP supervisor (series dependent) — paired to satisfy window and reset ladder.
- Fits (as a combo): Threshold precision, reset routing via hub MCU/CPLD.
- Risk: Check lifecycle status and WDT window programmability.
onsemi
NCP300/301 (SVS) + external windowed WDT — required to meet bucket criteria.
- Non-equivalent alone: NCP30x lacks WDT/window.
- Fits (as a combo): When paired with a window WDT + reset mapping.
Microchip
MCP1316/1318 (Supervisor+WDT) or MIC826 families; combine with multi-rail sensing to form A-bucket.
- Fits: Windowed WDT, WDO/RESET outputs for partition mapping.
- Risk: Verify timeout step granularity and OTP lock options.
Melexis
No native multi-rail supervisor+WDT. Non-equivalent to A-bucket unless combined with external multi-rail SVS + window WDT.
- Migration: Prefer B-bucket (SBC-style) plus external SVS if multi-rail is mandatory.
Migration risks (A-bucket): missing independent WDT clock; single reset line only; no window mode; telemetry absent. These break equivalence and must be called out in the BOM.
Bucket B — MCU/PMIC/SBC with Exposed Partitioned WDT
Integrated window watchdogs in PMIC/SBC/MCU subsystems that expose bark/bite, counters, and reset lines for partition mapping.
TI
TPS3851 — Supervisor with programmable WDT and WDO; can provide bark/bite ladder externally.
- Fits: Windowed WDT, external WDO/RESET.
- Risk: Ensure per-partition reset fanout via reset matrix.
ST
STWD100 (window WDT) with board-level reset routing; pair with STM6719 for PG.
- Fits (as part of B-style platform): window watchdog outward facing.
- Risk: Add counters/telemetry via MCU if device lacks them.
NXP
FS6500 — SBC with integrated window WDT and automotive features; exposes reset outputs.
- Fits: Bark/bite policy + car-grade.
- Risk: Validate clock source independence from CPU.
Renesas
µP Supervisors with WDT (series-dependent) + PMIC; export WDO/RESET to partitions.
- Fits: Externalized bark/bite; counters via MCU/SoC.
- Risk: Ensure window programmability and OTP locks.
onsemi
PMIC + external window WDT (or MCU WDT) exposing WDO; B-bucket when signals are routed per partition.
- Non-equivalent if only NCP30x SVS is used.
Microchip
MIC826 / MCP1316/1318 families — WDT + reset outputs; telemetry via host MCU.
- Fits: Window WDT, WDO/RESET signals.
Melexis
MLX80051 (LIN SBC) — integrated window watchdog, RESET out; good for body LIN nodes as a partition WDT endpoint.
- Fits: Exposed WDT/RESET.
- Risk: Not a multi-rail supervisor; add external SVS if needed.
Migration risks (B-bucket): watchdog clock tied to CPU clock; no counters/first-fault; single reset output only; no OTP lock. Add external SVS/WDT or reset-matrix logic as mitigation.
Bucket C — Supervisor Hub (PG Aggregation + WDT + Reset Matrix)
A hub that aggregates PG/FAULT, implements bark→bite promotion (K-of-N/weighted/class), and fans out per-partition resets. Realized by a single highly-integrated device or by SVS + window WDT + small MCU/CPLD.
TI
TPS386000 / TPS3851 as SVS+WDT sources; add small MCU/CPLD to implement reset-matrix and telemetry store.
- Fits: PG aggregation, bark/bite ladder, K-of-N via firmware.
ST
STM6719 + STWD100 + small MCU — PG collection + window WDT + promotion logic.
- Fits: Hub architecture when MCU logs counters/first-fault.
NXP
FS6500 as the power/WDT anchor; add CPLD/MCU for K-of-N gating and event schema.
Renesas
ISL88001/2/3 + WDT supervisor + MCU/CPLD — Hub with programmable promotion and counters.
onsemi
NCP30x family as PG inputs; add window WDT + MCU for matrix and logs.
Microchip
MCP1316/1318 or MIC826 as WDT sources; SAM/PIC MCU holds counters and promotion rules.
Melexis
MLX80051 (LIN SBC) provides WDT/RESET; add external multi-rail SVS to qualify as a full hub.
Field Alignment Matrix (Bucketed Equivalence)
✓ = meets natively (or with documented combo) · ⚠ = meets with mitigation (e.g., add SVS/WDT, confirm clock) · ✗ = non-equivalent.
| Field | TI | ST | NXP | Renesas | onsemi | Microchip | Melexis |
|---|---|---|---|---|---|---|---|
| Channels (multi-rail) | ✓ (TPS386000) | ⚠ (STM6719 + STWD100) | ✓ (FS6500) | ⚠ (ISL88xx + WDT) | ⚠ (NCP30x + WDT) | ⚠ (MCP131x + PG front-end) | ✗ (use external SVS) |
| Windowed WDT | ✓ (TPS3851, etc.) | ✓ (STWD100) | ✓ (FS6500) | ⚠ (series-dependent) | ⚠ (external WDT required) | ✓ (MIC826/MCP1316) | ✓ (MLX80051) |
| Per-Partition Reset | ✓/⚠ (via matrix) | ⚠ (board-level matrix) | ⚠ (validate mapping) | ⚠ (via MCU/CPLD) | ⚠ (via MCU/CPLD) | ⚠ (via MCU/CPLD) | ⚠ (RESET out; add matrix) |
| Independent WDT Clock | ⚠ (verify source) | ⚠ (add RC/XTAL if needed) | ⚠ (SBC-dependent) | ⚠ (device-dependent) | ✗/⚠ (external WDT) | ⚠ (confirm oscillator) | ⚠ (SBC-style; confirm) |
| Telemetry (counters/first-fault) | ⚠ (via MCU log) | ⚠ (add MCU log) | ✓/⚠ (SBC events) | ⚠ (host-logged) | ⚠ (host-logged) | ⚠ (host-logged) | ⚠ (host-logged) |
| AEC-Q100 (if required) | ✓/⚠ (variant-dependent) | ✓/⚠ | ✓ (FS series) | ✓/⚠ | ✓/⚠ | ✓/⚠ | ✓/⚠ |
Decision: If any cell is ✗ for the target requirement, the candidate is non-equivalent and must not be approved.
Copy-Ready BOM Clauses
- Use bucketed equivalence only. Alternatives must match Channels/Window/Timeout step/Bark→Bite ladder/Per-Partition Reset/Independent WDT clock/Telemetry/AEC-Q100/Package. Any missing field ⇒ Non-Equivalent.
- Production policy: window/timeout are OTP-locked; changes allowed in Manufacturing only via unlock→write→confirm.
- System promotion requires K-of-N/weighted/class gate; partition bite precedes system reset; first-fault must be recorded.
- Cross-brand scope limited to TI / ST / NXP / Renesas / onsemi / Microchip / Melexis parts that meet the acceptance fields above.
FAQ
Answers are 40–70 words, procurement-friendly, and identical to the JSON-LD block below.
Why choose partitioned WDT over a single long timeout on multicore SoCs?
Partitioned watchdogs confine faults to their domains, reducing blast radius and preventing unnecessary system resets. A single long timeout increases blind time and masks first-fault evidence. Use per-partition reset, a bark→bite ladder, and auditable counters to speed root cause without halting healthy domains. Add “per-partition reset” and “first-fault logging” to the BOM.
How do I prevent a “feed storm” with windowed watchdogs?
Introduce throttling and backpressure in the scheduler and bus, capping feed attempts per interval. Apply slight window randomization to desynchronize tasks. Prioritize critical channels and monitor a bark counter threshold. On storm signals, temporarily widen the window and log the event. Require “throttling and bark threshold alarm” as enforceable acceptance items in the BOM.
How should I set the conjunction threshold between partition bites and system reset?
Use a K-of-N rule with domain weighting. Safety-critical partitions carry higher weight than best-effort domains. Choose K so mean time to recovery meets targets while false system resets stay below your acceptable rate. Document the rule, test K against worst-case injections, and store decisions with timestamps for auditability and field replication.
Must the WDT clock be independent? What about RC drift?
Yes—an independent watchdog clock is mandatory to avoid common-cause failure. If RC drift is significant, increase the window ratio, budget timing error explicitly, and enable clock-health detection. On drift or failure, degrade first: limit power, reduce frequency, or isolate the domain before biting. Lock production windows and timeouts via OTP to prevent silent relaxations.
How do we switch from manufacturing debug to production without backdoors?
Use a dedicated mode bit: soft-relaxed windows during manufacturing, then permanently lock via OTP. Enforce an unlock→write→confirm sequence, version the configuration, and store a signed change log. Acceptance testing must verify that production images reject debug unlocks and that window/timeout cannot be altered post-seal without explicit re-provisioning evidence.
How can I timestamp events without an RTC?
Use a free-running counter or boot-count index for relative timing, then align with cloud time when available. Persist the last N bite events with counter values and causes so ordering survives power loss. Include a boolean “first-fault” marker. When an RTC later appears, reconcile relative records by anchoring the first subsequent synchronized event.
How do I prove domain A cannot feed domain B’s watchdog?
Combine address-space isolation, key-based feed handshakes, and physical signal separation. Any cross-domain feed attempt must fail and be logged. In validation, inject cross-feeds from A to B and require zero successful attempts over M trials. Treat any success as a rejection-class failure and block procurement approval until the interlock closes the gap.
How should PG/FAULT align with bark and bite semantics?
Define unified event classes and tags so power anomalies precede processor symptoms in the causal chain. Map PG to bark elevations and promote to bite only when policy conditions are satisfied. Persist class, tag, and timestamps in the same schema used for watchdog events so auditing and fleet analysis remain consistent across domains and resets.
How do I choose between window ratios 0.25–0.75 and 0.4–0.8?
Base the ratio on timing jitter, task density, and clock error budget. With high drift or noisy scheduling, widen the safe region—raising the lower bound to ≥0.35 is common. Keep feed granularity ≤ one quarter of the window width. Finalize your ratio in production and OTP-lock it after validating worst-case workloads and injective timing noise.
On clock failure, should we bite immediately or limit power first?
Degrade before destruction. On suspect or failed watchdog clock, raise a bark and apply mitigations: limit power, reduce frequency, or isolate the partition. Promote to bite only when policy thresholds are met. Always record a first-fault marker and the precise failure class so post-mortem analysis can distinguish clock incidents from scheduling or software faults.
Which parameters are mandatory across brands, and what differences are tolerable?
Mandatory: channels, window capability, timeout range and step, per-partition reset, independent watchdog clock, and telemetry fields. Tolerable: electrical interface style and package family, provided you map signals and verify reset-matrix behavior. State these as acceptance criteria in the BOM. Any missing mandatory field makes the candidate non-equivalent by definition.
How can small-batch validation be automated and scored?
Script fault injection vectors (early, late, miss, stuck feed, clock stop/slow/stuck), expected actions, and logged evidence as three reconciled tables. Require M clean repetitions per vector and timing error within ±10% of spec. Any cross-feed success or missing logs triggers rejection. Export CSV or JSON to attach to procurement records and RFQ responses.