123 Main Street, New York, NY 10001

Vision Gateway / Edge Box for Multi-Camera Edge AI

← Back to: Imaging / Camera / Machine Vision

Core idea: A Vision Gateway/Edge Box turns multi-camera streams into provable outcomes—bounded P99 latency, controlled drop rate, tight skew limits, and an auditable evidence chain—by managing bandwidth, queues, timestamps, QoS, storage consistency, security updates, and power/thermal/EMC determinism.


What readers get: A practical, measurement-first playbook to isolate where drops/jitter come from (network vs compute/memory vs storage vs power/thermal) and apply the first fix fast—without scope creep into sensors, Timing Hub internals, or cloud platforms.

H2-1. Role & boundary: where a Vision Gateway sits in the vision chain

Definition in one sentence

A Vision Gateway / Edge Box is the system integration point that aggregates multiple camera inputs, runs deterministic edge inference, aligns frames/events on a single time basis, and delivers evidence (metadata + event clips + health telemetry) over network/storage with field-debug visibility.

This page is intentionally system-level: it focuses on determinism + evidence chain, not on pixel/ISP internals, PHY deep specs, or cloud platform tutorials.

fan-in / fan-out P99 latency drop counters time alignment event clips watchdog & reset reason

What it must deliver (4 deliverables that prevent scope creep)

  • Aggregate: bring N camera streams + triggers into a single compute & switching fabric with known bandwidth headroom.
  • Infer: convert frames into structured outputs (detections, tracks, quality flags) within a bounded P99 latency.
  • Align: time-correlate frames, triggers, and inference results so “event ordering” remains stable under load and across reboots.
  • Deliver evidence: output metadata + optional pre/post event clips + health logs in a way that remains verifiable after power loss or restart.
Non-goals (hard boundary): cloud/MLOps/training, deep ISP math, PHY/SerDes spec derivations, PTP grandmaster design, frame grabber board architecture, storage-controller/ECC deep dive.

Typical I/O (describe engineering objects, not marketing words)

Inputs

  • Video payloads: multiple camera streams (uncompressed or compressed) arriving via CSI/USB/Ethernet-class links.
  • Timing & events: trigger-in, encoder pulses, strobe/actuator coordination signals, embedded frame counters.
  • Time references: hardware/network timestamps or embedded timestamps (used for alignment and audit logs).
  • Power & environment: input power quality, temperature, link integrity indicators (CRC/retries).

Outputs

  • Inference metadata: detections/tracks/quality flags with timestamps and sequence counters.
  • Event evidence: pre-roll/post-roll clips or frame sets that can reconstruct “what happened.”
  • Health telemetry: drop counters, queue depth, P99 latency, thermal throttle, reset cause, storage integrity stats.
  • Network egress: deterministic classes for time/event/control vs bulk payloads (QoS/TSN usage-level).

Acceptance metrics (what “good” means in the field)

  • P99 latency: end-to-end (capture → infer → decision → output) P99 stays within the target bound under full load.
  • Drop rate: stage-local drops are measurable (ingress/queue/egress/storage) and remain below an agreed threshold.
  • Skew bound: cross-camera time alignment remains within an agreed bound, and drift is detectable before it breaks decisions.
  • Evidence chain integrity: events remain reconstructable (timestamps + sequence counters + checksums + logs), including after reboot/power loss.

Practical tip: keep metrics per-stage. “One global FPS number” hides the root cause.

Evidence pledge (how every chapter will stay “engineering”)

Each chapter ends with: (1) two evidence signals to capture, (2) one discriminator to separate root causes, (3) one first fix that changes the system behavior measurably.

Vision Gateway / Edge Box — System Boundary Aggregate → Infer → Align → Deliver Evidence (determinism + field proof) Cameras Video payload Frame counter Timestamp Ingress DMA ingress Trigger-in HW counters Aggregation Fabric Queue depth Backpressure Deterministic flow AI SoC Preprocess NPU/DSP/GPU Postprocess Buffer / Storage RAM ring NVMe scratch Event clips Egress GbE / TSN classes Metadata + alarms Bulk payload (optional) Mgmt / Security Secure boot Audit logs Watchdog The 3 enemies of determinism (must be measurable) Latency P99 spikes from queues/copies Jitter Skew drift, reorder, retries Power Brownout/throttle → drops
F1: Boundary view. The page stays vertical by tracking how latency/jitter/power create measurable drops and misalignment.
Chapter-close (H2-1) — Evidence → Discriminator → First fix
  • First 2 evidence signals: (1) per-stage drop counter (ingress/queue/egress/storage), (2) end-to-end P99 latency trace with timestamps.
  • Discriminator: if P99 grows while drops are zero, it is queueing/backpressure; if drops rise with stable P99, it is hard capacity (link/DDR/SSD) or errors (CRC/retries).
  • First fix: define explicit targets (P99 bound + per-stage drop budget + skew bound) and attach a counter to each stage before tuning anything else.

H2-2. Requirement decomposition: turn “multi-camera + AI” into computable constraints

Why decomposition matters (the failure pattern)

Most gateway failures look like “random drops” or “unstable latency,” but the root cause is usually a requirement that was never quantified: aggregate camera payload + inference load + storage writes + network egress + thermal/power limits.

This chapter converts intentions into an engineering input sheet: a set of parameters that can be budgeted and verified later.

Card 1 — The parameter sheet (copy/paste checklist)

Fill these before choosing SoC/NIC/SSD or tuning QoS.

  • Cameras (N): resolution, fps, pixel format / compression, expected burst behavior.
  • Inference target: model latency (P50/P99), precision mode, per-frame preprocessing cost.
  • Eventing: event rate, clip length (pre/post seconds), metadata size, retention policy.
  • Egress: uplink speed, traffic classes (time/event/control vs bulk), acceptable packet loss/jitter.
  • Environment: ambient temperature, enclosure airflow, cable length/quality, EMI context, input power stability.
  • Resilience: allowed reboot frequency, power-loss behavior, update/rollback requirement, evidence integrity requirement.

The page stays within scope by keeping this sheet hardware/system-oriented (no cloud platform or model training details).

Card 2 — Three “must-calc” formulas (simple, engineering-grade)

Formula A — Aggregate camera payload (bandwidth ceiling)
Camera Mbps ≈ (W × H × fps × bits_per_pixel) / 1e6 (add protocol/guardband margin)
This bounds: NIC/Switch egress, DDR bandwidth, and queue depth.
Formula B — Inference feasibility (real-time gate)
P99_infer_per_frame × streams ≤ frame_interval (or implement deterministic drop/decimation policy)
If violated, queues grow → P99 explodes → events mis-order. Fix by reducing copies, lowering input rate, ROI/tiling, or strict admission control.
Formula C — Evidence storage budget (event clip write load)
Write MB/s ≈ event_rate × clip_seconds × payload_MB/s (+ index/metadata/journal overhead)
This bounds: NVMe sustained write, power-loss safety window, and integrity strategy.

These formulas are intentionally “simple but decisive”: they reveal which subsystem becomes the first bottleneck (link, DDR, compute, or SSD).

Card 3 — Acceptance targets (turn requirements into measurable tests)

  • Latency: define end-to-end P50/P99; measure per-stage timestamps (ingress → queue → infer → output).
  • Drop budget: define drop budget per stage; log counters for ingress, queue overflow, egress drops, storage commit failures.
  • Alignment: define cross-camera skew bound and drift alarm threshold; report skew histogram under full load.
  • Reboot resilience: after power interruption, evidence index remains consistent (no reordering, no missing last-N-seconds beyond spec).
  • Thermal resilience: maintain P99 bounds across the temperature range; throttle events are observable and correlate to performance shifts.
Requirements → Resources → Acceptance Translate intentions into budgets before architecture decisions Requirements (inputs) Camera payload N, resolution, fps, format Inference target P50/P99 per frame Event evidence rate, clip seconds Environment temp, EMI, power Resources (budgets) NIC / Switch egress Mbps + headroom DDR bandwidth copies + NPU traffic Compute (NPU) P99 infer feasibility SSD write MB/s + integrity window Thermal / Power throttle + brownout risk Acceptance (proof) P99 latency stage timestamps Drop budget per-stage counters Skew bound histogram + drift Integrity clip + index survives Rule: if a requirement cannot be turned into a budget + counter, it will become a field failure.
F2: Requirement decomposition. The goal is to decide budgets first, then choose SoC/NIC/SSD and tune QoS around measurable acceptance.
Chapter-close (H2-2) — Evidence → Discriminator → First fix
  • First 2 evidence signals: (1) aggregate input payload rate vs egress capacity (measured, not assumed), (2) inference P99 per frame vs frame interval under full load.
  • Discriminator: if payload headroom exists but P99 grows, it is compute/DDR/queueing; if P99 is stable but drops rise, it is hard capacity or integrity errors (CRC/retries / storage commits).
  • First fix: lock an input sheet with explicit budgets (NIC, DDR, NPU, SSD, thermal) and attach counters for each budget before implementation tuning.

H2-3. Dataflow architecture: Fan-in → Compute → Fan-out (make copies/queues/stalls visible)

What creates latency inside a gateway (the non-obvious sources)

Gateway latency is rarely “just compute.” It is usually the sum of DMA timing, hidden copies, queueing and lock contention, and storage commit backpressure. A stable average can hide unstable tails, so the goal is to explain—and instrument—where P99 spikes come from.

Keywords you should be able to measure in logs/counters: DMA completioncopy bytesqueue depthdrop-on-admissioncommit latency

Card 1 — The staged data path (each stage must expose a counter)

Stage A: Ingress DMA

  • Purpose: move payload into memory without CPU copying.
  • Tail risk: interrupt storms, small-packet overhead, mis-sized buffers.
  • Minimum counters: DMA completions/s, ingress drop, RX errors/CRC (if available).

Stage B: Frame pool (memory lifetime control)

  • Purpose: allocate/recycle frames predictably (avoid fragmentation and stalls).
  • Tail risk: pool depletion → blocking allocation → burst drops.
  • Minimum counters: free pool size, alloc-fail count, recycle latency.

Stage C: Capture queue → Inference queue

  • Purpose: decouple ingress from compute, then enforce admission.
  • Tail risk: queue growth hides overload until P99 explodes.
  • Minimum counters: queue depth over time, drop-on-admission, service time P50/P99.

Stage D: Metadata / event queue

  • Purpose: build “evidence objects” (timestamp + seq + result + optional clip pointers).
  • Tail risk: event backlog causes reorder, stale alerts, or clip loss.
  • Minimum counters: event backlog, reorder count, timestamp discontinuity alarms.

Stage E: Fan-out (Egress / Storage)

  • Purpose: deliver metadata and optional evidence clips deterministically.
  • Tail risk: storage commit stalls → backpressure → upstream queue growth.
  • Minimum counters: TX queue length, commit latency P99, commit-fail count.

Rule: for each stage, you need at least one rate counter and one queue/backlog metric. If a stage cannot be measured, it will be blamed incorrectly in field debugging.

Card 2 — Common traps (why P99 spikes even when averages look fine)

  • Hidden copies: format conversion, stride re-pack, debug capture, cross-thread handoff. Symptom: DDR load rises, cache miss rises, inference queue depth grows.
  • Queue explosion without admission control: letting queues absorb overload delays the failure, then breaks determinism. Symptom: stable throughput, rapidly increasing P99 and backlog.
  • Lock contention in shared structures: one “global mutex” in the hot path turns microbursts into tail latency. Symptom: CPU time in waiting, uneven service time distribution.
  • Backpressure propagation: slow storage commits block event completion, which blocks frame recycling, which forces drops at ingress. Symptom: commit P99 spikes appear first, then event queue grows, then frame pool depletes.
  • Unpredictable drops: random dropping ruins evidence integrity (missing the exact frames you need). Symptom: event clips incomplete even though average FPS “looks fine.”
Design rule: drops must be policy-driven (predictable), not accidental. If overload happens, you choose what to sacrifice first (clip length, clip FPS, inference rate, input FPS), and log it.

Card 3 — Minimal observability set (the “small dashboard” that solves most field cases)

Stage Must-have counters Tail-latency indicator
Ingress DMA DMA completions/s, ingress drops, RX errors/CRC RX microburst → queue growth
Frame pool free pool size, alloc-fail, recycle latency pool depletion → blocking alloc
Capture queue depth, drops-on-admission, service time P99 depth rising → overload
Inference queue depth, per-frame infer P99, throttle flags infer P99 drift → tail spikes
Event queue backlog, reorder count, timestamp discontinuity backlog rising → stale events
Fan-out / Storage TX queue length, commit P99, commit-fail count commit spikes → backpressure
Fan-in → Compute → Fan-out (Dataflow + Queues) Make copies, queueing, and backpressure visible before tuning Ingress DMA DMA completions RX errors / drops Frame pool free / recycle latency alloc-fail alarms Inference per-frame P99 throttle flags Metadata / Event timestamps + seq reorder alarms Capture queue queue depth Inference queue queue depth Event queue queue depth Egress TX queue length Storage commit P99 Backpressure path COPY LOCK Policy rule: overload must trigger predictable admission + drop behavior (not random loss).
F3: Data path and queueing points. Tail latency usually comes from hidden copies, queue growth, lock contention, and storage-driven backpressure.
Chapter-close (H2-3) — Evidence → Discriminator → First fix
  • First 2 evidence signals: (1) time-series of queue depth (capture/inference/event) around P99 spikes, (2) storage commit P99 or egress TX queue length during the same window.
  • Discriminator: if commit P99 spikes precede queue growth, the root is fan-out backpressure; if inference queue grows while commit is stable, the root is compute/DDR/copies or admission control.
  • First fix: add explicit admission control and a predictable drop policy (clip reduction → input decimation → inference rate), then validate by observing reduced queue growth and stabilized P99.

H2-4. Bandwidth & latency budgets: simple math to avoid “drops after deployment”

Budgeting mindset (what to budget and why averages are not enough)

A gateway is stable only when every critical resource has (1) a budget, (2) a guardband, and (3) a counter tied to acceptance. Budget with P99 in mind: averages hide burst, retries, GC, and thermal drift.

This chapter budgets four resources: NIC, DDR, NPU, SSD, then caps stage P99.

Card 1 — The budget steps (a repeatable worksheet)

  1. Input payload: compute aggregate camera payload (include protocol overhead + burst margin).
  2. Egress/uplink: compare total input vs available uplink, then reserve capacity for control/event traffic.
  3. DDR headroom: estimate copy pressure; reserve DDR bandwidth for NPU + system activity.
  4. SSD write load: sum steady writes + event clip bursts; account for write amplification/GC windows.
  5. Stage P99 caps: set max P99 per stage (capture → infer → decision → output/commit).
  6. Attach counters: each budget must have one counter and one alarm threshold (before field rollout).
Rule: if you cannot define budget + guardband + counter, you are not budgeting—you are hoping.

Card 2 — Guardband (why 20–30% headroom is engineering, not pessimism)

  • Burst & microburst: instantaneous load can exceed the average by multiples, creating queueing and P99 spikes.
  • Retries/CRC: error-driven retransmission consumes bandwidth and adds tail latency without changing averages much.
  • Thermal drift: compute throughput drops under throttling; without headroom, queues grow and determinism collapses.
  • SSD GC windows: background management increases commit latency; guardband prevents backpressure runaway.

Practical guideline: budget to 70–80% of capacity for sustained operation, and treat the rest as “shock absorber.”

Card 3 — If a budget fails: the downgrade order (keep determinism first)

When overload happens, the goal is to keep time ordering and evidence integrity predictable. Sacrifice “volume” before sacrificing “truth.”

  • Step 1: reduce evidence clip cost (shorter pre/post, lower clip FPS, fewer streams recorded).
  • Step 2: reduce input load (lower camera FPS, ROI/tiling, controlled decimation with logs).
  • Step 3: reduce inference load (lower inference rate, lighter model path, deterministic batching).
  • Step 4: only as last resort, relax end-to-end targets (and record the policy change as an auditable event).
Budget Overlay (Capacity vs Load) Four budgets + guardband prevent P99 spikes and surprise drops Capacity Load Guardband line NIC / Switch burst / retries DDR bandwidth copy / cache NPU compute throttle SSD write GC / WA Stage P99 caps (examples) Capture → Infer → Decision → Output/Commit Set a max P99 per stage; if one exceeds, downgrade by policy and log it. If “Load” touches the guardband line during normal operation, determinism is at risk.
F4: Budget overlay. Keep sustained operation below the guardband line; use downgrade order to preserve determinism when shocks occur.
Chapter-close (H2-4) — Evidence → Discriminator → First fix
  • First 2 evidence signals: (1) resource load vs guardband for NIC/DDR/NPU/SSD, (2) per-stage P99 latency (capture→infer→output/commit).
  • Discriminator: if P99 rises while all loads stay below guardband, suspect queueing/locks/hidden copies (H2-3); if loads approach guardband and drops appear, it is a capacity/headroom problem.
  • First fix: apply a 20–30% guardband, reserve deterministic classes for time/event traffic, and implement a logged downgrade order (clip reduction → decimation → inference rate) to stabilize P99.

H2-5. Multi-camera time alignment: “provable sync” the gateway must guarantee

Why “provable” matters (not just “looks aligned”)

In a multi-camera system, alignment must survive load changes, retries, and reboots. “Provable sync” means the gateway can produce an auditable record showing what timestamp source was used, how it was mapped to a unified timeline, and how skew/drift behaved over time.

This chapter focuses on gateway-layer mapping and acceptance. It does not describe Timing Hub/PTP implementation.

Card 1 — Three timestamp inputs (what each is used for)

Input A: Camera timestamp

  • Used for: frame ordering per camera, detecting cadence changes (e.g., exposure-driven frame interval shifts).
  • Risk: not a cross-camera truth source by itself; drift exists between cameras.
  • Log fields: cam_id, frame_id/seq, t_cam, epoch

Input B: Hardware trigger timestamp

  • Used for: cross-camera “same-event” alignment anchor (trigger correlation).
  • Risk: missing/duplicated triggers must be detected and flagged.
  • Log fields: trigger_id, t_trig, edge_count, epoch

Input C: NIC hardware timestamp

  • Used for: stable reference for arrival-time behavior under congestion/load (transport sensitivity).
  • Risk: arrival time is not exposure time; use it to study network/jitter impact, not to fake alignment.
  • Log fields: port, t_nic, rx_queue, epoch

Rule: each evidence record must include timestamp source and epoch so reboots/time jumps cannot silently contaminate correlation.

Card 2 — Alignment workflow (engineering-grade, gateway-layer)

  1. Create a unified timeline: define t_global as the gateway’s reporting axis, and version it by epoch.
  2. Key every frame/event: build stable keys such as cam_id + frame_seq and/or trigger_id.
  3. Map timestamps: produce t_global = map(t_source, epoch) with source + confidence.
  4. Monitor drift slope: estimate drift over a sliding window (e.g., slope of skew vs time) to detect slow degradation.
  5. Detect time steps: when restart/link-recover happens, detect discontinuity and start a new epoch.
  6. Validate under load: repeat skew/drift evaluation at baseline, under inference/record load, and under network congestion.

Outcome: the gateway can state, for any evidence clip/metadata item, “which epoch, which source, and what skew/drift envelope applied.”

Card 3 — Acceptance checklist (measure distributions, not anecdotes)

  • Skew distribution: histogram across cameras for identical keys (report P50/P95/P99 and worst-case).
  • Drift slope: track drift vs time; alert when slope exceeds a defined envelope.
  • Trigger correlation: for each trigger_id, verify multi-camera alignment and detect misses/duplicates.
  • Load sensitivity: verify skew does not collapse when bulk transfer, recording, or inference load changes.
  • Reboot integrity: epoch step detection works; evidence records can be filtered by epoch reliably.
Clock domains & timestamp path Gateway-layer mapping + skew/drift proof (not Timing Hub implementation) Camera TS t_cam, frame_seq per-camera order Trigger TS t_trig, trigger_id same-event anchor NIC HW TS t_nic, port/queue load sensitivity Ingestion frame/event keys source tagging quality flags Unified mapping t_global = map(t_source) epoch versioning step detection confidence output Skew monitor skew histogram drift slope load correlation epoch alarms Auditable logs / evidence epoch, source, frame_id/trigger_id, t_global, skew, confidence filterable proof for field debugging epoch skew / drift Acceptance must report skew distribution and drift sensitivity under changing load and after reboots.
F5: Timestamp sources → ingestion → unified mapping (epoch) → skew/drift monitoring → auditable logs.
Chapter-close (H2-5) — Evidence → Discriminator → First fix
  • First 2 measurements: (1) skew histogram (baseline vs load), (2) drift slope trend + epoch step events around reboot/recover.
  • Discriminator: if skew worsens only under congestion, focus on timestamp ingestion path and deterministic delivery (H2-6); if step/drift anomalies appear after restart, focus on epoch versioning and discontinuity detection.
  • First fix: enforce epoch-tagged mapping for every evidence item and output source + confidence, then verify improvement via stable skew distribution under load.

H2-6. GbE/TSN switching & deterministic delivery: QoS to tame congestion jitter

Why the gateway needs traffic classes (critical vs bulk)

A vision gateway carries both tiny critical flows (trigger/time-sync/metadata) and massive bulk flows (video, logs). Without traffic classes, bulk microbursts and head-of-line blocking can convert “plenty of bandwidth” into P99 latency spikes and missed triggers. Determinism comes from isolating and policing bulk, while guaranteeing priority service for critical lanes.

Focus: engineering usage of VLAN / priority queues / shaping / policing. No PHY deep dive.

Card 1 — Traffic classification template (copy/paste into requirements)

Define four logical lanes and bind them to queue/guarantee/drop behavior.

Lane 1: Time-sync / Trigger Lane 2: Metadata / Events Lane 3: Bulk Video Lane 4: Logs / Telemetry
  • Lane 1 (Trigger): highest priority, minimal jitter target, “do not drop” unless explicitly declared and logged.
  • Lane 2 (Metadata): priority service, bounded P99, drops must be policy-driven (never silent loss).
  • Lane 3 (Bulk Video): shaped/policed; allowed to degrade first (resolution/FPS/clip cost) under overload.
  • Lane 4 (Logs): lower priority; keep key alarms/events, allow batching and delayed export.
Design rule: determinism is achieved by separating critical from bulk and forcing bulk to behave via shaping/policing, not by hoping “average bandwidth is enough.”

Card 2 — Typical failures → counters → first actions

  • Microburst causes trigger jitter: Counters: per-port queue depth spikes, burst drop counters, trigger P99 distribution. First action: shape bulk video, reserve priority queue for Lane 1/2.
  • HOL blocking delays metadata: Counters: egress queue head occupancy, per-class latency histogram, metadata late count. First action: strict priority for critical lanes, limit bulk burst size.
  • P99 spikes during “everything looks fine”: Counters: P99 latency vs baseline, CRC/retry indicators, queue depth correlation. First action: validate classification tags (VLAN/priority), then add policing for non-critical classes.
  • Critical flow drop without visible congestion: Counters: per-class drop counters, misclassification count, rule-hit counters. First action: fix classification mapping; verify Lane 1/2 never traverse bulk lane.

Card 3 — Minimal stress test (bulk + trigger storm)

  1. Baseline: measure Lane 1/2 latency histogram at idle (record P50/P99).
  2. Add bulk: run sustained bulk video near target load, measure again.
  3. Add storm: inject trigger burst and metadata burst concurrently.
  4. Acceptance: Lane 1/2 P99 stays within bound; drops are zero or explicitly policy-logged.
  5. Fix loop: apply shaping/policing to Lane 3/4 and re-run until histograms stabilize.
Deterministic delivery with traffic lanes Classify → prioritize critical → shape/police bulk to prevent P99 spikes Cameras (N) video + triggers Gateway apps metadata + logs GbE / TSN Switch Fabric VLAN • priority queues • shaping/policing Lane 1: Time-sync / Trigger priority Lane 2: Metadata / Events guarantee Lane 3: Bulk Video shaping Lane 4: Logs / Telemetry policing Uplink to PLC / server Storage evidence clips Mgmt health + logs Failure modes to watch: microburst → queue spike → P99 jitter • HOL blocking • misclassification Acceptance: Lane 1/2 P99 stays bounded while Lane 3 runs near peak load (bulk + trigger storm test).
F6: Four logical traffic lanes. Critical lanes get priority/guarantee; bulk lanes are shaped/policed to prevent jitter.
Chapter-close (H2-6) — Evidence → Discriminator → First fix
  • First 2 measurements: (1) Lane 1/2 latency histogram (P50/P99) under baseline + bulk load, (2) per-port queue depth and per-class drop counters during microbursts.
  • Discriminator: if Lane 1/2 P99 worsens when bulk starts, QoS isolation is missing or misconfigured; if drops occur without congestion, traffic is misclassified into the wrong lane.
  • First fix: enforce classification tags (VLAN/priority), reserve priority service for Lane 1/2, and apply shaping/policing to Lane 3/4; re-run the bulk + trigger storm test until histograms stabilize.

H2-7. AI inference compute subsystem: selecting & scheduling NPU/DSP/GPU (inference-only)

What this chapter guarantees

In a vision gateway, “compute” is only useful when it becomes stable throughput with a bounded P99 inference latency across multiple camera streams and across temperature/load changes. This chapter stays strictly on inference: selection, scheduling, and proof under thermal throttling.

Out of scope: training, MLOps, platform pipelines, and compiler deep tutorials.

Card 1 — Selection dimensions (each with a validation method)

1) Latency profile (P50/P99)

  • What matters: tail spikes, not just average latency.
  • Validate: run a fixed input set for ≥10–30 minutes; log P50/P95/P99 and spike counts.

2) Power-per-frame (energy stability)

  • What matters: steady-state watts and transient peaks that trigger throttling.
  • Validate: measure power rails or platform power counters vs FPS; correlate with throttling flags.

3) Precision support (INT8/FP16/Hybrid)

  • What matters: stable outputs at target thresholds (avoid “edge-case flips”).
  • Validate: compare inference outputs on a fixed validation set; track confidence drift and decision flips.

4) Memory movement (pre/post overhead)

  • What matters: preprocess/postprocess can dominate latency even if TOPS is high.
  • Validate: split timing into pre, infer, post; monitor DDR bandwidth usage.

5) Concurrency isolation (multi-stream interference)

  • What matters: P99 must stay bounded when N streams run concurrently.
  • Validate: compare 1-stream vs N-stream P99, queue growth, and drop/degenerate counters.

Practical reading: selection is not “NPU vs GPU” as a slogan; it is a repeatable measurement plan tied to P99 stability.

Card 2 — Scheduling strategies (optimize for stability)

Goal: keep inference predictable under multi-camera fan-in. Peak throughput is secondary to bounded tails.

  • Job model: each camera produces inference jobs keyed by cam_id + frame_seq and optional trigger_id.
  • Queues: maintain an inference queue with explicit depth limits; expose queue_depth as a first-class counter.
  • Priorities: triggered events and safety-critical ROIs get priority over routine frames.
  • Batching trade-off: batching improves throughput but increases single-frame latency; enforce a max wait / deadline to prevent tail explosions.
  • Admission control: when compute cannot keep up, reject or downgrade jobs by policy (never silently). Typical order: reduce ROIlower model sizedrop FPSdrop evidence richness.
  • Proof logging: every degrade action increments counters and tags outputs (e.g., degrade_level) to keep decisions auditable.
Rule of thumb: a stable gateway is the one that can explain why it degraded and how it kept P99 bounded, not the one with the highest peak FPS.

Card 3 — Thermal determinism collapse (how to prove it)

  • Causal chain: temperature rises → thermal governor throttles → compute frequency drops → infer P99 drifts → inference queue grows → frame/event deadlines miss.
  • Evidence bundle (must co-plot): temperature, throttle_flag/freq_state, infer_P99, queue_depth, drop/degrade.
  • Discriminator: if infer P99 spikes correlate with throttling state transitions, the failure is thermal-driven (not “random compute”).
  • First fixes: tighten admission control, pre-emptively degrade under rising temp, and ensure throttling events are logged alongside inference tails.
Inference compute island (stability-first) Preprocess → compute → postprocess, with thermal governor + counters for P99 proof Input frames N cameras frame_seq / trigger_id Preprocess resize / normalize ROI select Compute NPU / GPU / DSP infer P50 / P99 batch / concurrency Postprocess NMS / track threshold Metadata out events / boxes confidence + tags Thermal governor temperature → throttle freq_state changes P99 drift correlation Perf counters (must be logged) infer_P50, infer_P99, queue_depth, drop_count, degrade_level throttle_flag, freq_state, temperature, power Proof = distributions + correlations (not single numbers) Acceptance: P99 inference latency remains bounded across multi-stream load and thermal steady-state.
F7: Compute island + thermal governor + counters needed to prove stable P99 latency.
Chapter-close (H2-7) — Evidence → Discriminator → First fix
  • First 2 measurements: (1) infer latency histogram (P50/P95/P99) at 1-stream vs N-stream, (2) temperature + throttle_flag + infer_P99 trend to capture drift.
  • Discriminator: infer P99 drifting with throttle transitions indicates thermal-driven determinism collapse; infer P99 worsening with concurrency at stable temp indicates scheduling/admission issues.
  • First fix: tighten admission control and policy-based degradation (ROI/model/FPS), and log throttle events alongside P99 distributions for auditable proof.

H2-8. Local buffering & evidence storage: ring buffers, event clips, crash consistency

Why storage is part of the gateway’s value

A gateway is not only a router of pixels. It is the system that can retain evidence around events and prove that clips are not lost, not reordered, and not corrupted after crashes or power cuts. This chapter stays at the system level: buffer tiers, commit order, journal + atomic index, and minimal power-cut validation.

Out of scope: NVMe controller internals, FTL/ECC algorithm deep dive.

Card 1 — Buffer pyramid (time-scale tiers)

Tier 1: RAM ring (seconds)

  • Role: absorb short bursts and provide pre-roll.
  • Key counters: ring fill level, overwrite rate, capture-to-clip latency.

Tier 2: NVMe scratch (minutes / hours)

  • Role: store event clips and batch commits under sustained load.
  • Key counters: sustained write MB/s, commit P99 latency, journal replay time.

Tier 3: Export (delivery)

  • Role: export evidence clips + metadata (uplink or removable storage).
  • Key counters: export backlog, retry count, integrity verification results.

Design rule: tier boundaries must be explicit so overload handling is policy-driven (not random loss).

Card 2 — Event clips: commit order + index integrity

Goal: never allow the index to point to a partially written clip (no “ghost evidence”).

  1. Select window: on event, choose pre-roll + post-roll segments from RAM ring.
  2. Write chunks: append clip chunks to NVMe scratch (include seq and per-chunk length).
  3. Write integrity fields: store CRC/hash and final length markers to detect truncation.
  4. Append journal record: write a journal entry that describes the clip and its chunk list (still not visible to index).
  5. Atomic index update (last): update the searchable index to point only to completed clip records.
  6. Export & retention: export according to policy and reclaim storage without breaking index consistency.
Critical invariant: data → journal → atomic index. The index must be updated last.

Card 3 — Crash consistency & power-cut acceptance (how to prove “not broken”)

  • Crash consistency goal: after reboot, the system replays the journal and exposes only fully committed clips.
  • Power-cut test (minimal): run mixed workload (sustained write + frequent events), then cut power at random times for multiple trials.
  • Post-reboot checks:
    • No ghost clips: index entries never point to missing/truncated data.
    • No reordering: clip time/seq monotonicity holds per camera/event key.
    • Clear discontinuities: if a clip is incomplete, it is absent (or explicitly flagged), not silently corrupted.
  • PLP validation: confirm that journal + index atomic update succeed within the hold-up window under worst-case load.
Local buffering & evidence commit Buffer pyramid + journal + atomic index (crash-consistent evidence) Tier 1: RAM ring seconds • pre-roll ring fill level Tier 2: NVMe scratch minutes/hours • clip chunks commit P99 Tier 3: Export deliver evidence backlog + retry Commit pipeline (order matters) Event detect select pre/post-roll Write chunks seq + length + CRC Journal append describes the clip Atomic index update publish completed clips Export / retention export backlog • retry • reclaim without breaking index Invariant: data → journal → atomic index Prevents ghost evidence after crashes and power cuts Acceptance: power-cut tests must show no ghost clips and no reordering after journal replay.
F8: Buffer pyramid + commit order (journal then atomic index) to guarantee crash consistency.
Chapter-close (H2-8) — Evidence → Discriminator → First fix
  • First 2 measurements: (1) commit P99 latency and index update success counters under mixed workload, (2) power-cut test results (ghost clips / truncation / reorder counts after reboot).
  • Discriminator: ghost evidence implies index published before completion (ordering/journal issue); reordering implies missing monotonic keys (seq/epoch) or incomplete replay rules.
  • First fix: enforce data → journal → atomic index, version the index with epoch/seq, and expose only fully committed clips after journal replay.

H2-9. Security & lifecycle: secure boot, keys, signed updates, rollback (system-level)

Why the gateway is a security boundary

A vision gateway is an enforcement point for firmware trust, evidence traceability, and non-bricking updates. The goal is not to list security features, but to define a verifiable chain: boot verification, auditable failures, encrypted evidence at rest, and A/B updates with rollback and power-cut tests.

Out of scope: crypto algorithm derivations, TPM command tutorials, cloud PKI/MLOps platforms.

Card 1 — Threat model (3 items only)

1) Unauthorized firmware

  • Asset: boot chain and runtime images.
  • Impact: unsafe behavior, untrustworthy metadata.
  • Guarantee: signed images + auditable verify failures.

2) Evidence leakage

  • Asset: event clips and forensic metadata.
  • Impact: privacy/IP exposure.
  • Guarantee: data-at-rest encryption + key protection.

3) Update bricking

  • Asset: availability (field uptime).
  • Impact: device becomes unrecoverable after interruptions.
  • Guarantee: A/B slots + rollback protection + power-cut update tests.

Card 2 — Boot trust chain and update chain (must be verifiable)

Root of Trust (RoT / TPM / Secure Element): purpose only

  • Identity & keys: store device identity and protect signing/unwrap keys.
  • Measurements: optionally provide boot measurements (hashes) for audit and forensics.
  • System requirement: keys never leave the protected boundary in plaintext.

Secure boot chain: ROM → BL1 → BL2 → OS → App

  • Each stage verifies the next: signature check before execution.
  • Failure must be recorded: no silent black-screen. Emit a structured failure record.
  • Suggested audit keys: boot_stage, image_hash, sig_check, verify_fail_code, rollback_index, boot_reason.

Signed update chain: verify → write(B) → verify → switch → first-boot → commit

  • A/B slots: keep a known-good slot available until the new slot is committed.
  • Anti-rollback: reject downgrade images by rollback index policy (and record it).
  • Suggested update keys: update_id, update_version, slot_active, slot_pending, update_state, verify_result, rollback_reason, power_loss_flag.

Data-at-rest (evidence clips): system-level requirement

  • Encrypt evidence clips and indices: clips remain confidential when storage is removed.
  • Key linkage: store data_key_id and audit_log_seq to support traceability and replay.
  • Rotation policy: rotate keys by version/epoch without breaking readability of historical evidence.
Chain of trust (verifiable) RoT → boot stages → storage key → signed update → audit log RoT / TPM / SE identity + key root measure / verify Secure boot stages ROM BL1 BL2 OS App Each stage verifies next; failures must emit audit records (sig_check, fail_code). Storage Key data-at-rest encryption for clips key_id + epoch / rotation Signed Update verify → write(B) → verify → switch → first-boot → commit A/B slots + rollback index Slot A known-good Slot B pending/new Audit log seq • event • result boot verify • update state rollback reason • power-loss Acceptance: verify failures are auditable; updates survive interruptions; evidence at rest is encrypted and traceable.
F9: Chain of trust across boot, storage encryption keys, signed A/B updates, and audit logging.

Card 3 — Field acceptance: interrupted update test + rollback verification

  • Interrupted update (power-cut): cut power at random times during write(B), switch, and first-boot. Expected outcome: recover to Slot A or complete safely—no permanent bricking.
  • Rollback verification: present invalid signatures, corrupted images, or forbidden downgrade versions. Expected outcome: refuse activation and emit rollback_reason with an auditable record.
  • Audit export: export a compact audit bundle keyed by audit_log_seq so field incidents remain traceable.
Chapter-close (H2-9) — Evidence → Discriminator → First fix
  • First 2 measurements: (1) boot verification logs per stage, (2) interrupted update trials with recovery rates and rollback reasons.
  • Discriminator: missing failure records indicates an audit-chain gap; bricking under interruptions indicates update state machine/slot switch weakness.
  • First fix: enforce structured audit keys everywhere and a strict state model: verify → write(B) → verify → switch → first-boot → commit, with rollback on any failure.

H2-10. Reliability as determinism: power, thermal, EMC evidence for drops and jitter

Why “reliability” directly changes latency and drop rates

In a vision gateway, power/thermal/EMC issues are not vague “environment problems”. They manifest as measurable counters and waveforms: rail droop, thermal throttling, CRC/retries, and watchdog reasons. Reliability is part of determinism because it changes P99 latency tails and frame/event integrity.

Out of scope: power topology derivations and EMC certification walkthroughs.

Card 1 — Three causal chains (power / thermal / EMC)

Power droop → reboot / drops

Inference bursts, SSD write bursts, or fan start → rail droop → reset or drop bursts.

Thermal throttle → P99 drift → queue pileup

Temperature rise → throttle → infer P99 drifts → queues grow → deadlines missed.

EMI CRC → retries → latency spikes

CRC increases → retransmits/backoff → congestion → P99 latency spikes and jitter.

Card 2 — Each chain: 2 measurements + discriminator + first fix

(A) Power chain

  • First 2 measurements: (1) key rails waveform (SoC/DDR/SSD/PoE DC-DC out) for droop/UVLO margin, (2) reset/watchdog reason + drop counters aligned to burst timestamps.
  • Discriminator: droop events time-align with resets/drops across repeated trials.
  • First fix: limit burst concurrency (stagger SSD flush vs inference peaks), add policy-based throttles, and log rail events into audit evidence.

(B) Thermal chain

  • First 2 measurements: (1) temperature + throttle_flag/freq_state, (2) infer_P99 + queue_depth trends through thermal steady-state.
  • Discriminator: infer P99 drift correlates with throttle transitions and drives queue monotonic growth.
  • First fix: pre-emptive degradation (ROI/model/FPS) when temperature slope rises; keep “critical flows” protected.

(C) EMC chain

  • First 2 measurements: (1) CRC/error counters + retries, (2) P99 latency histogram and spike timestamps (P99 tails).
  • Discriminator: CRC/retry bursts line up with latency spikes even when average bandwidth is not maxed.
  • First fix: prioritize and isolate critical flows (QoS/TSN policies) and record interference windows for field reproducibility.

Card 3 — Degradation policy (preserve determinism)

Under reliability stress, degradation must be explicit and auditable, not random loss.

  • Priority 1: time-sync / trigger + metadata (critical flows).
  • Priority 2: event evidence clips (shorter clips first, lower bitrate/resolution as needed).
  • Priority 3: bulk video and background logs.
  • Audit keys: always emit degrade_level and root_cause_hint (power/thermal/emc) alongside outcomes.
Reliability = determinism (evidence-based) Cause → measurement → effect → evidence keys Power bursts infer / SSD / fan Rail droop UVLO margin Reboot / drops frame loss burst Evidence watchdog_reason Temperature steady-state rise Throttle freq_state P99 drift infer_P99 up Evidence queue_depth Interference EMI / grounding CRC / retries error counters Latency spike P99 tails Evidence crc, retries Degrade policy (explicit + auditable) Priority: critical flows (sync/trigger + metadata) → evidence clips → bulk video/logs Always log: degrade_level + root_cause_hint (power/thermal/emc) Acceptance: causes must align to measurements and evidence keys; determinism is preserved by explicit degradation.
F10: Evidence-first reliability: power/thermal/EMC chains that create drops and P99 latency spikes.
Chapter-close (H2-10) — Evidence → Discriminator → First fix
  • First 2 measurements: choose per symptom: rail droop + watchdog_reason, temperature/throttle + infer_P99, CRC/retries + latency histogram.
  • Discriminator: only conclude after time-alignment between counters/waveforms and P99 spikes/drops.
  • First fix: protect critical flows and make degradation auditable (degrade_level, root_cause_hint), then iterate hardware mitigation.

H2-11. Validation & field debug SOP: symptom → evidence → isolate → fix

How to use this SOP (fast + repeatable)

Each symptom card is a three-line checklist designed for field speed: First 2 measurementsDiscriminatorFirst fix. Only gateway-layer evidence is used.

Network: CRC / retries / egress_drop Compute: infer_P99 / NPU busy Memory: queue_depth / DDR_BW Storage: NVMe_P99 / journal_commit_fail Stability: rail_droop / watchdog_reason Thermal: throttle_flag / temp Policy: degrade_level

Minimal evidence pack to keep always-on: CRC, retries, egress_drop, capture_queue_depth, infer_queue_depth, infer_P99, DDR_BW_pressure, NVMe_write_latency_P99, journal_commit_fail, watchdog_reason, rail_droop_event, throttle_flag, degrade_level.

Symptom: Frame drops / stutter

  • First 2 measurements: (1) capture_queue_depth (or ingress ring occupancy), (2) egress_drop or infer_queue_depth (pick the dominant downstream stage).
  • Discriminator: queue grows first → downstream bottleneck (compute/memory/storage); egress_drop rises first → congestion / shaping deficiency.
  • First fix: enforce explicit degradation (protect metadata/trigger first), cap queue depth with predictable drop policy, isolate bulk video from critical flows.

Example MPNs (gateway-layer):

  • TSN/QoS switch: NXP SJA1105, Microchip LAN9662
  • GbE PHY: TI DP83867, Microchip KSZ9131RNX
  • Power monitor (correlate bursts vs droop): TI INA226, ADI LTC2947

Symptom: P99 latency spikes (tail latency)

  • First 2 measurements: (1) end_to_end_P99 (or infer_P99), (2) NVMe_write_latency_P99 or CRC/retries (pick storage vs link hypothesis).
  • Discriminator: NVMe_P99 aligns with spikes → commit/flush blocking; CRC/retries align → retransmit/backoff causing tail.
  • First fix: move evidence writes to async/batched commit (bounded), rate-limit bulk streams, keep critical lanes guaranteed (QoS/TSN).

Example MPNs:

  • TSN switch (critical lane isolation): NXP SJA1110, Microchip LAN9662
  • Hold-up / backup rail for safe commits: ADI LTC3350 (supercap backup controller)
  • eFuse / inrush limiting (reduce write-burst droop coupling): TI TPS25947

Symptom: Camera-to-camera misalignment (skew / drift)

  • First 2 measurements: (1) skew_histogram_P99, (2) drift_slope (or time_jump_detect_count after reboot).
  • Discriminator: drift_slope increases under load → time mapping not stable; time jumps after reboot without alarm → mapping rebuild is missing.
  • First fix: enforce unified timeline mapping at ingestion; alarm on jump; export skew quality as health telemetry (no deep dive into Timing Hub).

Example MPNs:

  • TPM / secure logging anchor (timestamped audit integrity): Infineon SLB9670, Nuvoton NPCT750
  • Watchdog/supervisor (reboot correlation evidence): TI TPS3436, Maxim MAX16052

Symptom: Trigger missing / event loss

  • First 2 measurements: (1) trigger_in_count vs frame_received_count (delta), (2) ingress_drop or timestamp_monotonic_violation.
  • Discriminator: trigger_in present but frames missing → ingestion/backpressure; trigger_in missing → evidence points upstream (record this fact, do not expand).
  • First fix: place trigger/metadata on highest-priority lane, cap bulk microbursts, tighten queue thresholds to avoid random loss.

Example MPNs:

  • TSN/QoS switch (critical lane guarantee): NXP SJA1105, Microchip LAN9662
  • GbE PHY (stable link under stress): TI DP83867, Marvell 88E1512

Symptom: Reboots / watchdog resets

  • First 2 measurements: (1) watchdog_reason / reset_reason, (2) rail_droop_event on SoC/DDR rails during inference or SSD write bursts.
  • Discriminator: droop aligns with reset → power transient; watchdog aligns with queue stall (no droop) → scheduling/thermal or storage blocking.
  • First fix: stagger burst domains (inference vs flush), enforce brownout-safe degrade policy, add transient protection and inrush control at the gateway level.

Example MPNs:

  • eFuse / inrush limiting: TI TPS25947
  • Surge stopper (front-end protection): ADI LTC4365
  • Watchdog/supervisor: TI TPS3436, Maxim MAX16052
  • Power monitor: TI INA226

Symptom: Storage corruption / missing evidence clips

  • First 2 measurements: (1) journal_commit_fail / atomic_index_fail, (2) power_loss_flag (or unclean_shutdown_count).
  • Discriminator: failures cluster after power-loss → commit order/atomicity gap; failures without power-loss → storage latency/health issue (report, do not speculate).
  • First fix: enforce strict write order (data → journal → atomic index), add bounded commit windows, validate with scripted power-cut tests.

Example MPNs:

  • Hold-up controller for safe commit: ADI LTC3350
  • eFuse / controlled shutdown path: TI TPS25947
  • TPM / secure element for evidence keying: Infineon SLB9670, Microchip ATECC608B

Symptom: Link jitter / CRC bursts / retries

  • First 2 measurements: (1) CRC_error_count, (2) retries (or egress_drop if retries are not visible).
  • Discriminator: CRC+retries bursts → physical/EMI susceptibility; egress_drop without CRC → congestion/shaping.
  • First fix: isolate critical lanes (VLAN/priority queues/shaping), log interference windows, and re-run with a bulk-stress + trigger-storm test.

Example MPNs:

  • TSN switch (traffic shaping): NXP SJA1110, Microchip LAN9662
  • GbE PHY: TI DP83867, Microchip KSZ9131RNX

Symptom: AI output lag (decisions arrive late)

  • First 2 measurements: (1) infer_queue_depth, (2) throttle_flag (or freq_state) plus infer_time drift.
  • Discriminator: infer_queue grows steadily → compute throughput below input; lag worsens after throttle transitions → thermal determinism break.
  • First fix: disable or reduce batching, reduce input cost (ROI/resolution/FPS), enforce thermal guardrails and explicit degrade_level reporting.

Example MPNs:

  • Temperature sensor (predict throttle onset): TI TMP117, ADI ADT7410
  • Fan controller (closed-loop cooling evidence): Microchip EMC2101
  • Watchdog/supervisor (stall capture): TI TPS3436

F11 — Decision tree: symptom → 2 measurements → isolate → first fix

Use this flow when time is limited. It forces a two-measurement split into network / compute+memory / storage / power+thermal.

Field debug decision tree (2 measurements) Symptom → measure (2) → isolate domain → first fix Symptoms Frame drops / stutter P99 latency spikes Misalignment (skew) Trigger missing Reboot / watchdog Storage corruption CRC / retries bursts AI output lag First 2 measurements A) Network evidence CRC • retries • egress_drop B) Compute/memory infer_P99 • queue_depth • DDR_BW C) Storage evidence NVMe_P99 • journal_commit_fail Isolate → First fix Network / congestion QoS/TSN lanes shaping • policing Compute / memory limit batching cap queue_depth Storage / commit async commit journal → atomic index Power / thermal burst staggering throttle guardrails If watchdog_reason / rail_droop / throttle_flag dominates → Power/Thermal branch Cite this figure ICNavigator • F11
F11: Two-measurement decision tree to isolate the dominant domain and apply the first fix without scope creep.  • Cite this figure

MPN quick index (gateway-centric, non-exhaustive)

  • TSN/QoS switches: NXP SJA1105, NXP SJA1110, Microchip LAN9662
  • GbE PHY: TI DP83867, Microchip KSZ9131RNX, Marvell 88E1512
  • Watchdog/supervisor: TI TPS3436, Maxim MAX16052
  • Power protection/limits: TI TPS25947 (eFuse), ADI LTC4365 (surge stopper)
  • Hold-up / backup: ADI LTC3350 (supercap backup controller)
  • Power monitors: TI INA226, ADI LTC2947
  • Temperature / fan control: TI TMP117, ADI ADT7410, Microchip EMC2101
  • TPM / secure element: Infineon SLB9670, Nuvoton NPCT750, Microchip ATECC608B, NXP SE050
  • DDR4 examples (if DDR evidence is needed): Micron MT40A512M16, Samsung K4A8G165WC, SK hynix H5AN8G6NCJR

Request a Quote

Accepted Formats

pdf, csv, xls, xlsx, zip

Attachment

Drag & drop files here or use the button below.

H2-12. FAQs ×12 (evidence-based; no scope creep)

Each answer is constrained to the gateway evidence chain: bandwidth, queue_depth, infer_P99, skew/drift, CRC/retries, NVMe_P99, journal_commit_fail, secure boot/update, rail_droop, throttle_flag, SOP.

Q1 Inference starts → frames drop: check compute first, or DDR bandwidth first?

Answer: Start with one compute indicator and one memory-pressure indicator to avoid guessing.

First 2 measurements: infer_queue_depth and DDR_BW_pressure.

Discriminator: Queue grows while DDR is fine → compute/scheduling bound; DDR saturates or fluctuates → hidden copies/preprocess/memory contention.

First fix: Reduce batching, enforce zero-copy path, lower input cost (ROI/resolution/FPS), and report an explicit degrade_level.

Related: H2-4, H2-7

Q2 Average latency is stable, but P99 suddenly spikes: which two queues to watch first?

Answer: Pick one “compute-side” queue and one “delivery-side” queue to locate where the tail starts.

First 2 measurements: infer_queue_depth and event_queue_depth (or egress/output queue).

Discriminator: Infer queue spikes first → compute/DDR tail; event/egress queue spikes first → storage commit or network shaping causing blocking.

First fix: Cap queue depth with predictable backpressure, add headroom in the budget, and move evidence writes to bounded async commit.

Related: H2-3, H2-4

Q3 Two cameras slowly drift apart: timestamp drift or trigger-chain issue?

Answer: Separate “continuous drift” from “step-like misalignment under load.”

First 2 measurements: skew_P99 (or skew histogram P99) and drift_slope.

Discriminator: Rising drift_slope → time mapping/drift monitoring gap; stable drift_slope but skew jumps under stress → event/trigger correlation loss (often congestion-driven).

First fix: Enforce unified timeline mapping at ingestion, alarm on time-jumps, and export skew quality as health telemetry.

Related: H2-5

Q4 Large file upload causes trigger loss: QoS misconfigured or CPU/IRQ overwhelmed?

Answer: Decide whether the loss is happening on the wire (priority lane) or inside the host processing path.

First 2 measurements: priority_lane_egress_drop (or per-queue drops) and IRQ/softirq_rate (or CPU interrupt counters).

Discriminator: Drops without IRQ spikes → QoS/shaping gap (microbursts/HOL); IRQ spikes aligned with misses → CPU path overwhelmed.

First fix: Isolate trigger/metadata with VLAN+priority queues and shaping, rate-limit bulk transfers, and keep the critical path bounded.

Related: H2-6, H2-11

Q5 After power loss, the last seconds of an event clip are missing: PLP insufficient or index not atomic?

Answer: Treat this as a “commit-window proof” problem: power-cut + atomicity evidence.

First 2 measurements: power_loss_flag (or unclean shutdown count) and journal_commit_fail/atomic_index_fail.

Discriminator: Failures cluster only after cuts → PLP/hold-up or commit window too long; failures without cuts → write order/atomic index gap.

First fix: Enforce data→journal→atomic index order, shorten commit windows, and validate with scripted power-cut tests.

Related: H2-8

Q6 SSD health drops fast: which write pattern destroys endurance most easily?

Answer: The worst offender is high-frequency small random writes with frequent sync/commit, especially for metadata/index churn.

First 2 measurements: bytes_written_per_event (and write rate) plus SMART wear (e.g., media_wearout / host writes).

Discriminator: Wear accelerates with fragmented clips + frequent index updates → write amplification driven by tiny commits.

First fix: Coalesce writes, batch commits, keep metadata compact, and tier buffering (RAM ring → NVMe scratch → export).

Related: H2-8

Q7 Random reboot under full multi-port load: suspect input droop first, or core-rail transient first?

Answer: Split “front-end UVLO” from “local rail transient during bursts.”

First 2 measurements: reset_reason/watchdog_reason and rail_droop_event (core/DDR) or input_uvlo_flag (pick one).

Discriminator: UVLO/input droop aligns → front-end power issue; core droop aligns with inference/SSD bursts → transient/decoupling or burst scheduling.

First fix: Stagger bursts (infer vs flush), enforce inrush limiting and bounded load steps, and apply an explicit degrade policy before reset.

Related: H2-10, H2-11

Q8 Performance degrades after ~20 minutes: thermal throttling or queue/resource leak?

Answer: Use one thermal state marker and one latency/queue trend to avoid over-instrumentation.

First 2 measurements: throttle_flag (or temperature+freq state) and infer_P99_trend (or queue_depth_trend).

Discriminator: P99 step-change after throttle transitions → thermal determinism break; no throttle but steadily rising queues → resource leak/scheduling drift.

First fix: Add thermal guardrails (cap concurrency, early degrade), and enforce bounded queue policies with periodic health checks.

Related: H2-7, H2-10

Q9 Link often jitters in one cabinet: EMC-driven CRC, or switch congestion?

Answer: CRC/retries indicate physical-layer integrity issues; drops without CRC indicate congestion/shaping.

First 2 measurements: CRC_error_count and retries (or egress_drop if retries are hidden).

Discriminator: CRC+retries bursts → susceptibility to interference/cabling/grounding; drops without CRC → microburst/HOL congestion.

First fix: Isolate critical lanes, shape bulk traffic, log the interference window, and rerun a bulk+trigger-storm stress test for proof.

Related: H2-6, H2-10

Q10 How to prove the gateway is trustworthy: secure boot → signed updates → rollback acceptance tests?

Answer: Trustworthiness is an acceptance-test checklist, not a theory lesson.

First 2 measurements: boot_verification_log (good vs bad signature) and A/B_update_state with rollback protection outcome.

Discriminator: If a bad image boots or leaves no audit log → trust chain broken; if power-cut during update bricks the device → update chain broken.

First fix: Enforce signed boot stages, A/B partitions, rollback protection, and immutable audit logging for failures.

Related: H2-9

Q11 Timestamps look fine, but event order is wrong: buffer index bug or time jump/remap?

Answer: Prove “time monotonicity” versus “commit/index continuity.”

First 2 measurements: timestamp_monotonic_violation and journal_gap (or atomic_index_fail).

Discriminator: Monotonic violations → time jump/remap missing; journal gaps/index failures → buffer commit order or atomic index update problem.

First fix: Enable time-jump detection and remap rebuild, or enforce journal→atomic index sequencing with bounded commit windows.

Related: H2-3, H2-5, H2-8

Q12 What are the 3 most valuable field counters?

Answer: Choose counters that isolate the dominant domain in seconds: network, compute, and storage.

First 3 counters: (1) priority_lane_egress_drop (plus CRC/retries as secondary), (2) infer_queue_depth, (3) NVMe_write_latency_P99 (or journal_commit_fail).

Discriminator: The first tells wire-level determinism, the second tells compute headroom, and the third tells evidence commit blocking risk.

First fix: Apply the F11 decision tree to branch and execute the first fix, then re-measure the same three counters.

Related: H2-11