Telemetry & Ward Gateway (BLE/Wi-Fi/Cellular, ULP Power)
← Back to: Medical Imaging & Patient Monitoring
Core takeaway
A ward telemetry gateway saves power by running on a strict state machine (sleep → batch → transmit) and stays reliable by buffering data and enforcing bounded retries under weak RF. Power-loss hold-up is sized from a clear “must-finish” task list so critical data commits and graceful shutdown still complete during brownouts and outages.
H2-1 · What is a Ward Telemetry Gateway (and what it is not)
Practical definition (useful in design reviews)
A ward telemetry gateway is a ward-level aggregator that collects short-range wireless data (often BLE) from many bedside/wearable nodes,
batches and buffers it, and then performs reliable uplink over Wi-Fi and/or cellular—while meeting 24/7 uptime,
low-power operation, and power-loss data protection.
System boundary (prevents topic overlap)
This page covers
- Ward topology: many nodes → gateway → Wi-Fi/cellular uplink, including aggregation/buffering and retry logic.
- Gateway constraints: coverage & roaming symptoms, uptime, energy budget, data integrity, serviceability.
- Low-power architecture: Always-On (AON) vs Radio domains, wake sources, duty-cycle scheduling.
- Power-loss strategy: brownout detect → flush buffer → safe shutdown (hold-up concept).
This page does not cover (link out only)
- Bedside wired comms / time sync (PTP/TSN) architectures (see “Bedside / ICU Monitor Comms”).
- Hospital core network design and IT policy details (only interface expectations are mentioned here).
- Imaging data paths (frame grabbers, PCIe/DMA, recorder pipelines).
- Security deep dive (secure boot/HSM/TRNG) and EMC/isolation handbooks (only tests & boundaries are referenced).
Design targets (what must be true to call it a “gateway”)
- Aggregation: can manage many leaf nodes without scan/connect storms; supports grouping and scheduled collection windows.
- Buffering: absorbs uplink outages using RAM queue + persistent spool (with watermarks and backpressure rules).
- Reliability: retries are bounded; acknowledgements are explicit; duplicate detection is deterministic.
- Low power by state: average current is controlled by a state machine (sleep/sense/batch/transmit/confirm).
- Serviceability: logs and counters exist for field triage (reconnect counts, RSSI stats, outage time, buffer watermarks, brownout events).
- Power-loss protection: detects impending brownout early enough to flush critical records and mark last-known state.
Common failure patterns (symptoms → likely cause → quick check)
| Symptom | Most likely cause | Fast check |
|---|---|---|
| Frequent “offline/online” flips across many nodes | Scan/connect window too aggressive; RF congestion; retry storm | Plot connection attempts/min vs RSSI distribution; cap retries and add randomized backoff |
| Data gaps after uplink outages | No persistent spool; incorrect queue watermarks; overwrite without accounting | Force uplink down for N minutes; verify monotonic sequence IDs and spool watermark behavior |
| Random reboots during peaks (TX bursts) | Supply droop; insufficient peak current; brownout threshold too high/late | Capture rail droop with scope during uplink bursts; log brownout reasons and peak current |
| Data corruption after power loss | No “flush & mark” sequence; hold-up energy too small; non-atomic metadata updates | Perform randomized power-cut tests; verify journal/commit markers; check hold-up time margin |
Recommended links (no duplication)
- Bedside / ICU Monitor Comms (wired interfaces, time sync details)
- Compliance & EMC Subsystem (test items and mitigation patterns)
- Medical PSU & Isolation (isolation, leakage, PSU architecture)
- Image Compression & Security (security primitives and key storage)
H2-2 · Link Options: BLE vs Wi-Fi vs Cellular (selection matrix)
Key idea (how to pick without reading protocol textbooks)
Link selection is primarily driven by deployment control (hospital Wi-Fi access vs independent uplink),
payload pattern (small periodic vs bursty), and reconnect tail energy (how long the radio stays expensive after each transmit).
BLE is typically the leaf access layer; Wi-Fi/cellular are the uplink layers.
Selection matrix (engineering factors that change outcomes)
| Factor | BLE (leaf) | Wi-Fi (uplink) | Cellular (uplink) |
|---|---|---|---|
| Deployment dependency | Low (gateway-controlled) | Medium–High (hospital IT access) | Low (independent uplink) |
| Payload pattern fit | Small periodic / event bursts | Bursty uploads; local backhaul | Low-frequency periodic is ideal |
| Reconnect behavior | Scan/connect storms if mis-tuned | Roaming + retries can dominate energy | Weak coverage causes long retry tail |
| Tail energy (after each TX) | Usually short, tunable by intervals | Can be significant with keep-alives | Often dominant; mitigated by PSM/eDRX |
| Cost & operations | Low BOM; gateway complexity | Low recurring cost; IT coordination | SIM/data ops; coverage validation |
Practical combination patterns (avoid false either/or)
- Pattern A (common): BLE leaf access → gateway batching → Wi-Fi primary uplink when hospital Wi-Fi access is stable.
- Pattern B (independent deployment): BLE leaf access → gateway batching → cellular primary uplink for sites with limited IT integration.
- Pattern C (highest availability): Wi-Fi primary uplink + cellular fallback triggered by outage counters and queue watermarks.
Rule of thumb: prioritize deployment control first, then optimize tail energy via batching and bounded retries.
Pitfalls to preempt (what usually breaks the plan)
- Choosing Wi-Fi without control: access credentials and captive portal policies can turn into months of deployment delay.
- Underestimating tail energy: frequent tiny uploads can consume more energy than rare batched uploads due to post-TX “radio expensive time.”
- Roaming surprises: intermittent weak coverage creates retries and reconnections that look like “software bugs” but are RF realities.
- Cellular coverage edge cases: indoor penetration and weak signal can create long retry tails and brownout-like resets.
- Overloading BLE: too many simultaneous connections causes scan/connect storms—aggregation must be scheduled and bounded.
Output of this chapter (what the reader should take away)
- BLE is a strong leaf access choice for many endpoints; uplink is selected by deployment control and tail energy.
- Wi-Fi works best when hospital access is stable; cellular is strongest when independent deployment is required.
- Battery life and stability improve when uploads are batched and retries are bounded.
H2-3 · Power-State Architecture (Always-On domain, wake sources, duty cycle)
Engineering takeaway
Ultra-low power is achieved by an auditable state machine, not by “adding a larger battery.” Each state must have
clear entry/exit rules, a maximum dwell time, and a measurable current bucket. Wake sources must be gated and rate-limited
to avoid wake storms that silently dominate average power.
Always-On (AON) domain: what must remain alive
- RTC + time base: defines sensing/upload/maintenance windows and guarantees periodic housekeeping.
- Wake arbitration: resolves multiple wake sources with priorities (e.g., brownout warning beats OTA).
- Voltage monitor + early warning: detects impending brownout early enough to flush critical records.
- Minimal bookkeeping: wake reason, reset cause, outage counters, buffer watermarks (for field triage).
- Optional ULP co-processor: performs tiny “pre-check” tasks (threshold, scheduling) to reduce main-domain wakeups.
Boundary: this section describes responsibilities and interfaces (not MCU/RTOS tutorials).
Wake sources (gated): source → gate → rate limit → target state
| Wake source | Gate (must be true) | Rate limit | Target state |
|---|---|---|---|
| Timer tick | within scheduled window; not in cooldown | fixed cadence; drift monitored | SENSE |
| Event flag (alarm/threshold) | event confirmed; debounce passed | burst allowed; then cooldown | AGGREGATE → TRANSMIT |
| Connection request (leaf join) | only during join/scan window; whitelist/allowlist hit | cap attempts/min; randomized backoff | SENSE (short) or AGGREGATE |
| User button | debounce; long-press for costly actions | lockout against chatter | MAINTENANCE |
| Charger insert | stable input detected; thermal OK | one-shot until removed | MAINTENANCE (safe window) |
| Brownout warning | pre-warning threshold hit; hold-up present | no rate limit (highest priority) | CONFIRM (flush) → SLEEP |
Duty-cycle windows (why scheduling beats “more battery”)
- Sensing window: short and predictable; powers only what is needed to collect and pre-check.
- Upload window: expensive radio time; batch records and bound retries to minimize tail energy.
- Maintenance window: infrequent; only allowed when energy/thermal/network gates are satisfied (logs/updates).
Average current model (for validation): I_avg ≈ Σ(I_state × t_state) / T.
The goal is to keep TRANSMIT short and infrequent by batching, and keep unexpected wakeups near zero by gating.
State machine checklist (entry/exit/timeout + observability)
| State | Entry | Exit | Timeout + current bucket | Must log |
|---|---|---|---|---|
| SLEEP | no pending work; gates satisfied | wake source triggers | indefinite; ~µA | wake reason + timestamp |
| SENSE | timer tick; join window | records collected or budget reached | bounded; mA | scan/connect attempts |
| AGGREGATE | new data queued | batch formed; watermark reached | bounded; mA | queue depth + watermark |
| TRANSMIT | upload window; energy OK | sent or retry budget exhausted | strict; burst | outage time + retries |
| CONFIRM | ACK or brownout warning | commit markers written | bounded; mA | last-ack + commit ID |
| MAINTENANCE | manual/charger; gates pass | task done; time budget hit | strict; burst | gates + outcome |
Verification checklist (quick, practical)
- Log wake reasons and state transitions; confirm unexpected wakeups are near zero during idle.
- Measure state currents and dwell times; reconcile with the average-current model (I_avg).
- Force uplink outage; verify buffering continues without raising average power uncontrollably.
- Run brownout tests; confirm pre-warning triggers flush/commit before reset.
H2-4 · ULP PMIC & Power Tree (rails, retention, sequencing)
Engineering takeaway
The power tree is not just “power delivery.” It is a domain control system that ensures only the required rails are on at the right time.
The critical retention path (AON + monitoring + minimal state) must be independent and verifiable. Sequencing with PG/EN must protect
data consistency during both normal shutdown and brownout events.
Multi-rail domains (domain → typical loads → power-off consequence)
- AON: RTC, wake arbiter, voltage monitor. Off = cannot wake or record last state.
- MCU: main compute. Off = cold restart; retention optional depending on boot time budget.
- RADIO: BLE/Wi-Fi/cellular. Off = no uplink; must be hard-gated to eliminate idle tail.
- SENSOR: sensor and front-end rails. Off = no sampling; best controlled by sensing windows.
- STORAGE: flash/spool and metadata. Off during write = corruption risk; must follow sequencing rules.
Boundary: isolation/leakage standards are handled on the dedicated PSU & Isolation page.
Power components (role → why it matters)
- Buck vs LDO: select by light-load efficiency and noise needs; light-load behavior dominates average power in 24/7 systems.
- Load switch: enforces domain off, reduces leakage, limits inrush, and prevents “half-on” failure modes.
- Ideal diode / OR-ing: enables seamless switchover between main input and hold-up source with low drop and no backfeed.
- Fuel gauge (if battery present): enables gating (allow maintenance windows only when energy margin is safe).
- PG/EN signals: turn power sequencing into a hardware-enforced dependency graph (no guessing in firmware).
Sequencing & data consistency (why PG/EN is part of reliability)
- Power-up: AON → MCU → STORAGE → RADIO. Reason: record state first, then safely write, then connect.
- Power-down: stop RADIO → flush STORAGE → enter safe shutdown. Reason: avoid high radio peaks during writes.
- Brownout event: pre-warning triggers a short “flush & mark” routine; commit markers ensure deterministic recovery.
- Reset gating: MCU reset release should depend on PG of critical rails (especially STORAGE and AON).
The retention path must survive long enough to: log reset reason → flush essential records → mark last-ack/commit.
Verification checklist (quick, practical)
- Measure peak current during radio bursts; confirm no brownout resets under worst-case uplink retries.
- Validate sequencing: MCU reset release depends on PG of critical rails (AON + storage readiness).
- Power-cut test during storage writes; confirm commit markers prevent corruption and recovery is deterministic.
- Confirm retention path remains alive during hold-up long enough to log reset reason and flush essentials.
H2-5 · BLE Low-Power Playbook (advertising, connection params, scanning)
Engineering takeaway
BLE average power is dominated by radio on-time (scan duty + connection-event rate) and by retry/reconnect frequency.
Savings come from windowing (short, scheduled scan/join windows), batching (fewer, denser connection events),
and gating (bounded retries + cooldown) to prevent reconnection storms in crowded wards.
Where BLE power really goes: advertising vs scanning vs connection events
| Phase | Main drivers | Hidden drain | Practical control |
|---|---|---|---|
| Advertising | adv interval, PHY, TX power | too-fast adv forces more gateway scanning | separate “join adv” from “presence adv” |
| Scanning | scan window/interval (scan duty) | continuous scan creates “always-on” radio | scheduled scan bursts + allowlist filters |
| Connection | conn interval, slave latency, event length | retries + reconnect storms in RF congestion | batch payloads + bounded retries + cooldown |
Connection parameters (engineering meaning, not textbook definitions)
- Connection interval: sets the “heartbeat” of connection events. Shorter intervals increase responsiveness but multiply radio wakeups and tail energy.
- Slave latency: allows skipping events without dropping the connection. It is a power lever for stable signals, but it increases worst-case report latency.
- Supervision timeout: defines when the link is declared dead. Too short creates false death → reconnect storms; too long delays failure detection and grows buffers.
Practical target: keep event frequency low enough for average power, while bounding worst-case latency and preventing false disconnect.
Multi-device aggregation (how to avoid collisions and reconnect storms)
- Stagger connection-event start times: distribute devices across time so the gateway is not hit by synchronized bursts.
- Group-and-window uploads: use short “batch windows” per group (bed/zone) and keep joining separate from reporting.
- Bounded retries: cap retries per record and per device, then enter a cooldown to avoid tail-dominated power.
- Admission control: in congestion, prioritize stability for already-connected devices; postpone new joins to a later join window.
Rule of thumb: prefer “fewer wakeups with larger batches” over “many tiny packets,” because the radio tail dominates.
Verification metrics (prove the playbook works)
- Power: scan duty on-time, connection-event rate, retry tail duration, average current across a 24/7 trace.
- RF health: packet error rate, retransmissions, disconnect frequency, time-to-reconnect distribution.
- System health: queue watermarks, batch sizes, join success rate under high device density.
- Fault injection: force weak RSSI and interference; confirm bounded retries + cooldown prevents storms.
H2-6 · Wi-Fi Low-Power & Reliability (DTIM, keep-alives, roaming traps)
Engineering takeaway
Wi-Fi power is often stolen by “staying online”: DTIM-driven wakeups, keep-alives, and network-stack retries.
Real savings come from windowed uplink (batch transfers inside an upload window), bounded retries (avoid tail storms),
and roaming control that prioritizes stable connectivity over frequent AP switching.
DTIM and power save (why cadence dominates average current)
- DTIM cadence: defines how often the client must wake to receive buffered traffic. More wakeups create a visible “comb” in current traces.
- Windowed behavior: place expensive uplinks inside a scheduled upload window, then allow the radio to return to deep sleep outside that window.
- Downlink tolerance: if the gateway is primarily uplink-driven, it can tolerate delayed downlink and keep wake cadence low.
Boundary: this is an engineering view of symptoms and controls, not an enterprise Wi-Fi design guide.
Keep-alives: the most common “power thief”
- Why tiny packets can be expensive: waking up, contending for airtime, transmitting, waiting for ACK, and settling back creates tail energy.
- Batch heartbeats: merge multiple status items into one report aligned to the upload window.
- Gate costly actions: maintain “always-on” connectivity only when queue watermark or alarm class requires it; otherwise allow disconnect/sleep.
- Weak-signal behavior: decrease keep-alive frequency and prefer local buffering to avoid repeated handshake tails.
Power tail traps (handshake, DHCP/DNS retries, weak-signal retransmissions)
Typical field symptoms → likely cause → practical strategy
- Frequent current spikes + delayed uploads → repeated association/handshake or DHCP/DNS loops → cap retries and enter cooldown; buffer locally.
- High power with low throughput → weak RSSI causing retransmissions → measure link quality first; upload only when above a minimum margin.
- Random long reconnect times → congestion or unstable AP → prefer stability; avoid aggressive roaming and avoid rapid reconnect loops.
Reliability rule: bounded retries + deterministic buffering is better than “try forever” because tail energy will dominate.
Roaming traps (symptoms and control strategy)
- Symptom: periodic dropouts, latency spikes, or packet bursts after AP switching.
- Control: roam only when metrics degrade beyond thresholds; avoid “ping-pong” switching under marginal RSSI.
- Fallback: if roaming fails, apply backoff and rely on buffering rather than repeated fast re-association loops.
- Operational view: stable uplink with bounded delay often beats peak throughput for ward telemetry.
Verification metrics (power + network + system)
- Power: DTIM comb amplitude/frequency, burst TX tail duration, reconnect/handshake energy cost.
- Network: association time, DHCP/DNS failures, retry counts, roaming attempts and failures.
- System: upload-window completion ratio, queue watermarks, backlog drain speed after outage recovery.
H2-7 · Cellular Power Strategy (PSM/eDRX, modem states, coverage pain)
Engineering takeaway
Cellular power is rarely dominated by “one payload.” It is dominated by connection and signaling tails and by
repeated failures under weak coverage. The strategy is to keep the modem in low-cost states as long as possible
(PSM/eDRX), transmit in scheduled bursts, and enforce bounded retries + cooldown to prevent runaway attach/TAU loops.
Modem state ladder (why average current looks like “steps”)
| State class | What triggers it | Power signature | Common pitfall |
|---|---|---|---|
| PSM / deep sleep | no immediate downlink need | near-zero baseline | waking too often defeats PSM |
| Idle with eDRX | periodic paging listen | comb-like periodic spikes | too-frequent cadence steals power |
| Connected | uplink burst / session | high steps + long tail | tiny frequent sends keep it alive |
| Attach / TAU loops | weak coverage / loss of registration | repeating spikes (storm pattern) | “try forever” destroys battery |
PSM vs eDRX (configuration logic for low-rate telemetry)
- Prefer PSM when uplink is periodic and downlink can be delayed until the next uplink window (lowest baseline).
- Use eDRX when occasional downlink reachability is needed, but seconds-to-minutes latency is acceptable.
- Keep “connected” short by batching: send multiple records in one burst, then return to idle/PSM.
Design goal: maximize time in low-cost states and make uplink energy predictable with scheduled bursts.
Weak coverage: symptom → detect → mitigate (to prevent power runaway)
Symptoms commonly seen in wards with dead zones
- Average current climbs with frequent spikes; uploads become jittery or stall.
- Repeated registration/attach attempts; reconnect time distribution widens dramatically.
- Backlog grows even though the modem appears “busy.”
Detect (log what matters)
- Signal quality trend: RSRP/RSRQ/SINR (trend + thresholds), not single snapshots.
- Failure counters: attach/registration failures, retry counts, time-to-connect percentiles.
- Radio on-time: total connected time per hour; tail duration per burst.
Mitigate (actions that save both power and data integrity)
- Gate uplink by coverage: if quality is below a minimum margin, switch to store-and-forward instead of forcing a burst.
- Bound retries: cap retries per burst and per time window; then enter a cooldown before the next attempt.
- De-rate “keep-alive”: reduce non-critical heartbeats under poor coverage; prioritize alarms only.
- Batch larger, less often: fewer sessions reduces repeated tails and signaling overhead.
Verification (what to measure to prove savings)
- State occupancy: percent of day in PSM/eDRX idle vs connected vs attach/TAU loops.
- Burst energy cost: energy per upload window (and its tail) under normal vs weak coverage.
- Storm prevention: after injecting weak coverage, confirm bounded retries and cooldown stop repeated spikes.
H2-8 · Data Pipeline: batching, buffering, and “store-and-forward”
Engineering takeaway
A ward gateway must assume link dropouts. Data integrity comes from priority classes, batch windows,
and a store-and-forward loop with sequence/ack and controlled flash wear. The goal is to avoid both “lost events”
and “flash death by tiny writes.”
Data classes (QoS): alarms vs trends vs debug
- Alarm (highest): small, urgent, may break the upload window; must be deduplicated and rate-limited during storms.
- Trend (medium): periodic samples; designed for batching; tolerant to short delays; ideal for store-and-forward.
- Debug (lowest): maintenance-only; strictly gated; uploaded in a service window with bandwidth and power limits.
Principle: separate paths by priority so an alarm cannot be blocked by trend backlogs or debug logs.
Batching (reduce session count to reduce tail energy)
- Upload window: aggregate trend points and non-urgent events, then transmit in one burst session.
- Alarm override (gated): alarms can transmit immediately, but enforce a cap and a cooldown to prevent power storms.
- Bundle framing: send one header for many records; avoid per-record handshake behavior.
Buffering: RAM ring buffer + Flash spool (roles and boundaries)
- RAM ring buffer: absorbs short outages and reduces flash writes by collecting records into batches.
- Flash spool: protects against long outages and power loss; stores append-only segments for replay.
- Spool trigger: move from RAM to flash when backlog exceeds a watermark, or when link quality gates uplink.
Boundary: flash is a durability tool, not a substitute for good batching. Tiny writes are the enemy.
Flash wear control (avoid “writing the flash to death”)
- Append-only segments: write sequentially; avoid random overwrites that amplify wear.
- Batch-to-flash: persist only after reaching a minimum batch size or after a timeout boundary.
- Minimal metadata churn: keep pointers/watermarks compact and update at controlled intervals.
- GC gating: reclaim only after confirmed ACK watermark; never delete “maybe delivered” data.
Delivery integrity: sequence → ACK watermark → de-dup → replay
- Sequence IDs: every record or bundle carries an increasing ID to support replay and ordering.
- ACK watermark: server acknowledges up to an ID; the gateway advances the durable watermark.
- De-dup: replays are allowed; server must ignore duplicates to avoid double-counting.
- Replay loop: on reconnection, send from flash spool starting at the last unacked watermark.
Power-fail behavior (fast, bounded, predictable)
- On power-fail warning: stop low-priority ingestion, flush a bounded critical batch to flash, and persist the current watermark.
- No “big work”: avoid compaction, long hashing, or re-indexing inside the hold-up window.
- On next boot: resume replay from durable watermarks; log the event for service visibility.
H2-9 · Power-Loss Hold-Up Sizing (supercap/battery/bulk caps) and budget math
Engineering goal (hold-up contract)
When a power-loss warning occurs, the gateway must complete a bounded “critical sequence”:
freeze ingress → persist minimal state → shed high loads → enter safe state.
Hold-up sizing is therefore an energy window problem (Vstart to Vend), not a “bigger capacitor is always better” problem.
Critical energy budget: define what must finish
Must finish (critical)
- Persist minimal metadata: ACK watermark, spool pointer, monotonic sequence stamp, and a power-fail reason code.
- Bounded flash commit: write the smallest durable record that makes replay deterministic after reboot.
- Shed high loads: stop RF transmit and disable non-critical rails to reduce Pcritical immediately.
- Enter safe state: keep RTC / always-on logic and store the last shutdown stage for diagnostics.
Nice-to-have (only if budget allows)
- Send a single power-fail notice only when link quality gates pass and the transmit tail is predictable.
- Persist a short diagnostic summary (not full logs, not compaction).
Forbidden during hold-up
- Any long network handshake, reconnect, or waiting for server response.
- GC/compaction/re-index work that can turn into unbounded flash writes.
Budget math: energy window + critical power
Step 1 — define the usable voltage window (Vstart to Vend)
- Vstart: the rail voltage at the moment the early warning triggers (before the system becomes unstable).
- Vend: the lowest voltage where flash commit and RTC/AON still behave deterministically (including regulator headroom).
Step 2 — compute energy available from the storage element
Capacitor energy window:
E_cap = 1/2 · C · (Vstart² − Vend²)
Hold-up time estimate (bounded critical sequence):
t ≈ E_usable / P_critical
Step 3 — size C from the time budget (useful design form)
C ≈ 2 · P_critical · t / ( η · (Vstart² − Vend²) )
Where:
- P_critical = only the rails that stay on during hold-up
- η accounts for conversion losses and real-world inefficiencies
- t is the required completion time (typically 50–200 ms for graceful shutdown)
Practical tip: the fastest way to shrink C is to reduce P_critical early (load-shedding) and make flash writes bounded.
Option trade-offs: supercap vs small battery vs bulk caps (ward gateway scale)
Supercap
- Best for: short, deterministic hold-up to finish writes and shut down cleanly.
- Strength: high pulse current capability; long cycle life.
- Watch-outs: leakage/self-discharge, ESR at cold temperature, inrush limiting on recharge.
Small battery
- Best for: longer survival time and extended logging when mains can be absent for minutes.
- Strength: higher energy density; supports more extensive safe-state functions.
- Watch-outs: charger/BMS complexity, aging, and maintenance expectations.
Bulk capacitors
- Best for: very short hold-up and smoothing; often enough for fast metadata commits only.
- Strength: low cost; simple integration.
- Watch-outs: limited usable window and higher risk of brownout timing variability.
OR-ing and ideal diode devices are part of the hold-up system: they enforce one-way energy flow and prevent reverse discharge paths.
Real-world corrections (why margin is mandatory)
- Temperature: effective capacitance and ESR change with temperature; derate to the worst expected condition.
- Aging: capacitance fade and leakage drift over life; reserve extra energy headroom.
- Leakage: supercap self-discharge can dominate if “hold-up” must be available after long idle times.
- Recharge inrush: uncontrolled recharge can cause dips and resets; limit current and sequence rails.
H2-10 · Brownout Detection & Graceful Shutdown (what must happen in 50–200 ms)
Engineering goal (bounded response)
Brownout handling is a time-budgeted state machine. The response must be deterministic within a bounded window:
detect early → shed loads → persist minimal state → enter safe state. Unbounded actions (reconnect, long writes, compaction)
must be gated or skipped.
Power-loss detection chain (two-level triggers)
- Level-1 (early warning): PG de-assertion, ADC threshold crossing, or bus droop detector that interrupts early enough for flash commit.
- Level-2 (imminent brownout): hard supervisor/comparator threshold that forces minimal actions only (protect correctness, skip extras).
Design intent: Level-1 enables graceful shutdown; Level-2 protects against corruption when time is nearly gone.
Graceful shutdown sequence (strict order)
- Freeze ingress: stop adding new records; snapshot current queue watermarks.
- Load-shed: disable RF transmit and non-critical rails first to collapse Pcritical quickly.
- Bounded commit: persist minimal metadata and a power-fail stage marker (small, deterministic write).
- Reason code: store brownout cause and counters for service visibility.
- Safe state: enter a low-power mode that preserves RTC/AON and blocks heavy peripherals.
Skip policy: if voltage drops below the safe margin, skip network activity and any non-essential flash work.
Data consistency with a tiny two-phase commit (action-level, not file-system theory)
- Pre-commit marker: write a short “intent” record that a shutdown commit is starting.
- Payload + pointers: write the minimal durable watermarks (ACK level, spool pointer, sequence stamp).
- Commit marker: write a short “done” record. On next boot, missing “done” triggers replay/rollback safely.
Avoiding reset storms (brownout → reboot → brownout loops)
- Minimum voltage gate: do not enable high-load rails (RF/flash heavy writes) until voltage exceeds a safe threshold with margin.
- Cooldown timer: after a brownout, wait a minimum bounded time before retrying network-heavy actions.
- Retry counter: if brownouts repeat N times, enter a protective mode (RTC + minimal logging only) until power stabilizes.
- WDT policy: ensure watchdog behavior does not create extra resets during the brownout window; keep the shutdown path deterministic.
H2-11 · Validation Checklist: power profiling, RF stress, outage drills, field telemetry
Definition of “done”
Validation is complete only when the gateway shows bounded energy per state, bounded retries under weak RF,
deterministic data consistency under power loss, and field counters that close the loop in production.
A) Power profiling by state machine (average, peaks, and “tails”)
Measure current as a segmented profile (SLEEP → SENSE → AGGREGATE → TRANSMIT → CONFIRM/RETRY),
not as a single average number. The goal is to verify both energy per event and upper bounds under worst-case retries.
What to record
- SLEEP/AON: Iavg, periodic wake spikes, RTC/AON stability across hours.
- Wake + compute: peak current and duration for parsing, batching, encryption (if enabled), queue ops.
- Transmit: peak current, burst duration, and the power tail energy (retries, DHCP/DNS, attach/TAU, scanning).
- Confirm/Retry: energy per retry, maximum retries allowed by policy gates.
Pass criteria (engineering-grade)
- Each state meets its budget: E(state) ≤ E_budget × (1 + margin) across normal and stress runs.
- Transmit “tail” is explainable and bounded (no unbounded reconnect loops).
- Energy per report remains bounded when RF is degraded (bounded retry policy is enforced).
B) RF stress: weak signal, congestion, roaming traps, cellular edge coverage
RF validation must connect reliability metrics with energy cost. The same “bad RF” condition should produce
consistent signatures in reconnect counters, retry rates, and energy per event.
Stress stimuli (examples)
- Weak signal: controlled attenuation / obstructed path; verify retry gates and fallback behavior.
- Congestion: busy channel / high AP load; verify latency P95 and packet loss behavior.
- AP switch / roam: forced reassociation; verify bounded reconnection logic (no energy runaway).
- Cellular edge: poor RSRP/RSRQ; verify attach/TAU and retry pacing remain bounded.
Metrics to log (minimum set)
- Reconnect count, retry count, failure reasons (DNS/DHCP/auth/timeout), and RSSI/RSRP distributions.
- Packet loss and retransmissions; end-to-end latency (P50/P95).
- Energy per report under each RF stress profile.
C) Outage drills: random cuts, cold derating, supercap aging assumptions
A power-loss drill is successful only when data remains consistent and the device avoids reset storms.
Drills should be run across different RF states (idle / transmitting / retrying) to validate the worst-case “tail” behavior.
Drill set (recommended)
- Random cut: remove input power at random phases; repeat across thousands of cycles.
- Cold derating: reduced usable window (simulate higher ESR / lower C); verify hold-up still meets the minimal contract.
- Aging assumption: shrink Vstart→Vend window / increase leakage assumption; verify bounded commit still succeeds.
Pass criteria
- After reboot, ACK watermark / spool pointer / sequence stamp are valid and monotonic (no duplicate or missing critical records beyond defined policy).
- “Brownout → reboot → brownout” loops do not occur (reset-storm guard works: voltage gate + cooldown + retry counter).
- Critical shutdown stage markers show the device reached safe state when budget allowed, and degraded cleanly when not.
D) Field telemetry: counters that close the loop in production
Field observability should separate failures by domain (RF, power, storage, policy) without requiring invasive debugging.
The same metrics used in lab stress tests should exist in field telemetry with stable definitions.
Minimum counter dictionary
- RF: reconnect_count, retry_count, last_fail_reason, avg_RSSI/RSRP, roaming_events, time_to_attach.
- Power: brownout_count, early_warn_count, hold_up_entries, last_shutdown_stage.
- Storage: spool_high_watermark, commit_fail_count, replay_events, wear_estimate (at least erase/write counters).
- Performance: report_latency_P95, queue_delay, drops_by_policy (intentional drops vs corruption).
EMC note: list only what to test (ESD/EFT/surge/radiated immunity) and record symptoms + counters; mitigation details belong to the Compliance & EMC page.
Reference parts (example material numbers used in validation fixtures)
These part numbers are commonly used to make validation repeatable (accurate current/energy logging, precise power-fail triggers,
and measurable hold-up behavior). Actual selection depends on the chosen rails and current ranges.
- Power/energy profiling monitor: TI INA228 (digital power monitor; useful for per-state energy profiling).
- Rail supervisor / reset: TI TPS3839 (ultra-low-power supervisor for deterministic brownout triggers).
- Window supervisor (early warning + hard threshold): TI TPS3703 (dual-threshold monitoring for two-level triggers).
- Supercap backup controller (hold-up system reference): Analog Devices LTC3350 (supercap backup supply controller).
- Supercap state/health monitor (aging/derating evidence): TI BQ33100 (supercap monitor / health estimation).
- External flash for spool validation (example): Winbond W25Q64 (used widely for log/spool endurance exercises).
- BLE SoC platform (example): Nordic nRF52840 (for BLE stress + low-power parameter verification).
- Wi-Fi platform (example): Espressif ESP32-C3 (for DTIM/tail profiling and congestion stress).
- Cellular module platform (example): Quectel BG95 (Cat-M/NB family commonly used for edge-coverage stress).
Tip: keep the validation fixture BOM stable so “before/after” firmware changes can be compared with high confidence.
H2-12 · FAQs
These FAQs focus on low-power telemetry backhaul, store-and-forward reliability, and power-loss hold-up behavior for ward gateways.