Smart Camera with Edge AI: Sensor+ISP SoC, PoE, GbE/USB3

Q: Dropped frames: network congestion or DDR saturation?

Check tx_queue_depth (or socket drops) and ddr_util with per-stage stage_latency_p95. If TX queue pins first while upstream queues stay bounded, it’s output backpressure. If ddr_util saturates and multiple stages show tail-latency growth, it’s DDR contention/copy storm. First fix: reduce copies (zero-copy DMA buffers), then cap the output tier (resolution/FPS/encoding) with a bounded drop policy.

Q: Latency spikes only after 20 minutes: thermal throttle or queue buildup?

Correlate T1..T4 and dvfs_state with a slow trend of queue_depth. If DVFS/throttle events start before latency tails grow, the spike is thermal-driven. If temperatures are stable but queue depth slowly climbs over time, it’s backlog accumulation (pacing, bounded queues, or a leak in buffer recycling). First fix: export an incident window (±3s) around spikes, then cap peak load while enforcing bounded queues to prevent stale-frame inference.

Q: PoE powers up but reboots on inference start: which 2 rails first?

Measure (1) the PoE-side power after PD/isolated DC-DC and (2) the SoC core rail while logging reset_cause/brownout_warn_cnt. If the upstream rail collapses at load step, suspect inrush/hold-up or eFuse limiting. If upstream stays stable but core droops, it’s secondary rail transient/sequence. First fix: shape inrush and verify PG sequencing; then tighten brownout supervisor thresholds for clean recovery.

Q: USB3 works, GbE stutters: pacing or buffer size?

Look at tx_queue_depth on GbE and internal ring-buffer occupancy. If TX queue periodically hits the ceiling while internal buffers remain moderate, pacing is the likely cause. If internal buffers empty and refill in bursts, buffer count is too shallow for the latency/jitter budget. First fix: enforce steady pacing with bounded TX queues and a deterministic drop policy, then increase buffers only to the minimum that keeps p95 latency stable.

Q: Why does enabling ISP features reduce NPU FPS?

Measure ddr_util and per-stage stage_latency_p95 for ISP and NPU. If NPU utilization drops while DDR utilization rises, the root cause is DDR contention from read/write amplification or extra copies. If DDR stays low but ISP stage latency grows, ISP compute is the limiter. First fix: remove intermediate copies (single producer buffer), align stride/tiling, and disable non-essential ISP branches for the AI path.

Q: Occasional green/purple frames: sensor link or memory corruption?

Use two proofs: frame continuity (frame_id jumps, line/frame error counters) and memory stress context (ddr_util peaks or buffer reuse hot spots). If corruption aligns with discontinuity or link errors, suspect ingress/link integrity first. If it appears only during high DDR pressure and disappears when output tier is capped, suspect buffer reuse/copy bugs or DDR saturation artifacts. First fix: add per-frame buffer tagging/CRC and reduce DDR pressure by removing copies and capping bandwidth.

Q: Power is stable, but stream freezes: software deadlock or backpressure?

Check frame_id progression and tx_queue_depth/reconnect_cnt. If frame_id continues to increment while TX stays pinned or reconnect counters rise, it’s backpressure and missing pacing/bounded queues. If frame_id stops incrementing and watchdog does not trigger, it’s a scheduler stall/deadlock. First fix: add a progress counter at each pipeline node and force a snapshot on stall (incident window + counters) before changing parameters.

Q: p95 latency looks good, but jitter is high: where to log timestamps?

Attach ts_ingress at sensor/ingress, ts_pre_npu before inference, and ts_egress at output enqueue, then compare delta variance across segments. If ingress deltas vary most, the source is exposure/ingress pacing; if egress deltas vary most, output backpressure dominates. First fix: log the three timestamps plus queue_depth in the same frame record and capture ±3s incident windows for spikes.

Q: How many buffers are enough for 60fps?

Start from frame period (16.7ms at 60fps) and worst-case stage latency tail (stage_latency_p95). Validate by measuring peak queue_depth during stress (network throttle + thermal ramp). If drops happen with low queue depth, buffers are too few; if latency grows while queues expand, buffers are too many without pacing. First fix: set bounded queues with a consistent drop policy, then size buffers to the smallest value that keeps p95 stable.

← Back to: Imaging / Camera / Machine Vision

Core idea: A smart camera with edge AI is a self-contained vision node that turns pixels into decisions on-device—by tightly coupling sensor/ISP, NPU/DSP, memory, I/O, power, thermal control, and telemetry so real-time performance stays deterministic and field issues are debuggable with evidence.

What this page answers: It shows how to budget latency, avoid DDR bottlenecks, distinguish network backpressure vs compute/power/thermal limits, and log the minimum signals needed to isolate root causes fast.

H2-1. Positioning & System Boundary: What “Smart Camera with Edge AI” Owns

Intent

Prevent misdiagnosis by locking the ownership boundary between a smart camera and adjacent modules (sensor-only camera, ISP tuning, vision gateway aggregation, frame grabbers, interface PHY deep-dives). The goal is fast root-cause isolation using evidence that a smart camera can produce on its own.

Boundary rule: this page goes deep on on-device pipelinelatency/bandwidthPoE power tree evidencethermal throttling signaturestelemetry for field debug.

One-sentence definition (extractable answer block)

A smart camera with edge AI is a self-contained vision node that couples a sensor ingress and ISP pipeline with NPU/DSP inference, deterministic buffering, and device-side I/O (GbE/USB3), plus its own power, thermal, and logging hooks—so the camera can produce actionable outputs (frames, features, detections, masks, events) under measurable constraints on latency, bandwidth, power, and robustness.

Where it fits (adjacent pages, kept as one-line boundaries)

Vision Gateway / Edge Box: multi-camera aggregation and switching/storage; this page stays on single-camera closure and device evidence.
Machine-Vision Interfaces: PHY/retimer/CDR and link specs; this page only covers output modes + backpressure symptoms.
Image Signal Processor (ISP): detailed algorithm tuning; this page only uses ISP as a pipeline stage affecting latency and DDR traffic.
Compression / Codec: codec internals; this page treats encoding as an output profile and focuses on budget/pacing.

Internal linking tip (WP): keep the above as short lines + links, and avoid repeating their technical depth here.

The “7-piece kit” a real smart camera must expose (and how it maps to debugging)

1) Sensor ingress evidence: frame counter continuity, exposure/gain metadata, ingress error counters.
2) ISP stage placement: where format converts happen (RAW→YUV/RGB) and where stats/timestamps attach.
3) NPU/DSP accountability: utilization, queue depth, deadline misses per model stage.
4) DDR reality: bandwidth utilization, read/write amplification proxies, buffer occupancy.
5) I/O behavior: GbE/USB3 pacing, socket/URB error counters, TX queue depth (backpressure).
6) Power tree evidence: PoE input droop, rail PG/reset cause, brownout counters.
7) Thermal/log hooks: hotspot temps, throttling state, event snapshots for “last 3 seconds” replay.

Fast boundary discriminator (what to measure first)

Symptom (what is observed)	First 2 measurements (minimum tools)	Pass/Fail discriminator (one-line decision)	First fix (lowest cost)
Dropped frames / missing detections	`frame_id continuity` + `queue_depth`	frame_id jumps with low queue_depth → ingress/buffer overrun; queue_depth grows → backpressure/DDR	Increase buffers, remove copies, add ingress counters
Latency spikes (p95/p99), but average looks fine	`p95 stage times` + `DDR util`	DDR util peaks aligned with spikes → bandwidth/copy; NPU util pegged → compute bottleneck	Zero-copy discipline, stride alignment, QoS/priorities
Reboots when inference starts	`PoE input droop` + `core rail droop`	droop + reset cause → power; no droop + watchdog → software hang	Inrush/hold-up tuning, rail sequencing/PG logging

Metric discipline: always track p50/p95/p99 latency and drop rate (ppm)—field pain lives in tails, not averages.

Figure F1. Smart camera ownership boundary: what must be inside the device, what stays external, and where evidence should be attached (pixels, metadata, power, triggers).

Cite this figure: ICNavigator — “Smart Camera with Edge AI” (Figure F1: System Boundary, 1400×933).

H2-2. Dataflow Pipeline: Pixels → ISP → AI → Output (Zero-copy mindset)

Intent

Make the smart camera’s core differentiation measurable: the on-device pipeline and its hard limits. This chapter treats the pipeline as a closed-loop system governed by latency budgets, DDR bandwidth, and backpressure. The “zero-copy mindset” is used to avoid invisible DDR amplification that causes tail-latency spikes and frame drops.

Key promise: every stage must be observable (timestamps, queue depth, counters), so symptoms can be attributed to compute, memory, I/O, power, or thermal effects.

Pipeline map (stages as interfaces, not algorithm deep-dives)

Sensor ingress: RAW frames enter SoC (e.g., MIPI CSI-2 / SLVS-EC) with per-frame metadata (frame_id, exposure, gain).
ISP stage: format conversion and stats placement (RAW→YUV/RGB), producing a stable “AI-ready” surface without deep tuning content.
Pre-processing: resize/normalize/ROI extraction; designed to be streaming and to minimize intermediate copies.
Inference: NPU/DSP execution; measured by utilization and queue latency (not by model training theory).
Post-processing: NMS/tracking/quality gates; produces compact outputs (bbox/mask/event) and attaches timestamps.
Output: images and/or results to GbE/USB3 with pacing and backpressure visibility.

Latency budget (end-to-end, tail-aware)

Smart cameras fail in the tails. A robust latency budget uses p50/p95/p99 stage timing and identifies where jitter is injected. Treat exposure and readout as “physics time,” then budget compute and I/O as “design time.”

Exposure window: dominates minimum latency under low light and flicker constraints.
Readout + ISP buffering: line buffers and conversions add deterministic cost; spikes often indicate buffer contention.
Pre-process: becomes jittery when stride/alignment is poor or when cache coherency causes hidden flush/invalidate.
NPU inference: stable when input is regular; jitter appears with dynamic shapes, batching, or power/thermal throttle.
Post-process + output packetization: becomes the tail driver under backpressure (socket/URB queues fill).

Practical rule: if p50 looks fine but p95/p99 degrades, suspect backpressure or DDR amplification before blaming raw compute.

Bandwidth math (pixels → DDR traffic → failure point)

DDR bandwidth is the most common hidden bottleneck because each extra copy multiplies traffic. Use a simple worksheet to estimate whether the design is operating with margin.

Pixel rate: pixels/s = width × height × FPS.
Frame bytes (approx): RAW10/12 vs YUV vs RGB determine baseline DDR load.
Traffic multiplier: each stage that reads/writes full frames adds a read+write term; each copy adds another full-frame read+write.
Failure signature: DDR utilization peaks align with queue growth, then frame drops occur when buffer pools exhaust.

Engineering focus: reduce “full-frame touches” (copies, format conversions, re-reads). Prefer streaming transforms and zero-copy buffer sharing.

Common traps (and how to prove each one)

Copy storm: the same frame is duplicated for ISP, AI, and output; prove via rising DDR util + increased per-frame memory transactions.
Stride/alignment mismatch: pre-processing reads become non-contiguous; prove via higher p95 pre-process time and cache miss spikes (or CPU load).
Cache coherency tax: hidden flush/invalidate when buffers cross CPU/NPU/DSP domains; prove via timing spikes at handoff boundaries.
Under-sized buffer pools: minor jitter becomes drops; prove via buffer occupancy hitting “full” before drops.
Output backpressure: network/USB pacing stalls upstream; prove via growing TX queue depth + stable upstream compute time.

Fast isolation table (symptom → evidence → first fix)

Symptom	First evidence to log	Discriminator	First fix
Frame drops during high motion scenes	`buffer_occupancy`, `DDR_util`	occupancy hits max before drop → insufficient buffers or copy amplification	Increase buffers, remove intermediate copies, simplify format converts
Latency spikes only under network load	`TX_queue_depth`, `queue_depth`	TX queue grows then pipeline queue grows → output backpressure	Output pacing, bitrate caps, queue sizing, drop policy for non-critical frames
Stable FPS but inference results delayed	`TS_attach_point`, `NPU_queue_latency`	NPU queue latency grows with stable DDR → compute scheduling/concurrency issue	Partition workloads, cap concurrency, make stages streaming

Figure F2. Zero-copy pipeline: ring buffers + DMA + DDR masters, with explicit timestamp attach and backpressure paths to prevent tail-latency spikes and frame drops.

Cite this figure: ICNavigator — “Smart Camera with Edge AI” (Figure F2: Dataflow & Buffering, 1400×933).

H2-3. Sensor Ingress & Control Hooks: Exposure/Gain/Sync Hooks Without Going Full ISP

Intent

A smart camera must be controllable and falsifiable. That means the sensor path exposes a minimal set of hooks and logs so field symptoms (banding, dropped frames, motion artifacts, sync drift) can be proven or disproven with evidence—without turning this page into ISP tuning.

This chapter stays on: sensor controlframe timing evidenceembedded metadatafail patterns & discriminators.

Control chain (what changes what, and where evidence must exist)

Control plane: I2C/SPI writes to exposure, gain, frame length, mode, ROI—every change must be traceable.
Timing plane: frame-start / line-valid / frame-end events (interrupts or counters) define “what actually happened”.
Evidence plane: embedded metadata lines and driver counters bind “what was commanded” to “what was produced”.

Practical boundary: the page focuses on hooks and evidence; it does not explain AE/AWB/denoise algorithms or tuning recipes.

Rolling vs Global Shutter (system impact only, not sensor deep theory)

Rolling shutter

Motion artifacts: skew/wobble grows with readout time.
Exposure window: per-line sampling means “trigger tolerance” is narrower in fast motion.
Debug tip: compare artifact severity vs readout/fps mode changes—evidence should correlate.

Global shutter

Motion robustness: reduced skew; artifacts often shift to noise/lighting constraints.
Trigger behavior: generally more tolerant for external capture timing (implementation-dependent).
Debug tip: verify frame-start timestamps and exposure register snapshots match the capture moment.

Hooks list (10 hooks to expose to the application layer)

Each hook should answer: “What symptom can this prove/disprove?”

1) frame_id / monotonic counter: proves true frame continuity vs silent drops/repeats.
2) exposure_time snapshot: disproves “lighting flicker” myths when exposure is actually stepping.
3) analog_gain snapshot: correlates noise increase with gain changes (not ISP guesswork).
4) digital_gain snapshot: helps separate sensor noise from downstream scaling artifacts.
5) frame_rate / frame_length (blanking): explains banding risk and timing headroom during mode switches.
6) sensor_mode tag: identifies HDR/ROI/binning state without dumping full register maps.
7) trigger mode status: distinguishes free-run vs external trigger behavior at capture time.
8) readout type tag: rolling/global mode for interpreting motion artifacts and trigger tolerance.
9) temperature + overtemp flag: correlates drift/noise with sensor thermal state (field reality).
10) ingress error counters: line drop/overflow/CRC counters separate link issues from compute bottlenecks.

Implementation note: hooks can be exposed via local API/telemetry without revealing proprietary tuning logic.

Evidence to log (frame-level schema that makes RMAs diagnosable)

Per-frame header: frame_id, timestamp_attach_point, exposure, gain, fps_mode, temp.
Ingress health: ingress_err_cnt, line_count, overrun_flag.
Context (optional but high value): queue_depth, drop_reason, mode_switch_seq (a small enum).

Recommended practice: when a drop or banding event is detected, capture a short “evidence window” (e.g., N frames before/after) so mode switches and exposure stepping are visible without full video dumps.

Fail patterns (symptom → evidence → discriminator → first fix)

Banding under flicker lighting: evidence = exposure steps + fps mode + banding periodicity. Discriminator: if exposure_time changes align with banding severity, the cause is control/constraints; not “random ISP”. First fix: lock exposure/fps combinations that avoid flicker aliasing; verify with stable frame_id continuity.
Drops during shutter/fps/mode switch: evidence = frame_id jumps + ingress error counters + switch sequence markers. Discriminator: error counters spike at switch → ingress/timing transient; counters stable but queue grows → pipeline backpressure (handled later). First fix: stop stream → apply register set → wait stable frames → restart; pre-warm buffers.
External trigger “sometimes misses”: evidence = trigger timestamp vs frame-start timestamp and exposure window tag. Discriminator: frame-start jitter correlates with trigger timing → sensor timing tolerance; jitter absent → downstream scheduling/backpressure. First fix: adjust trigger-to-exposure delay policy and enforce minimum exposure window margin.

Figure F3. Sensor control and evidence attach points: control bus writes, timing events, and embedded metadata that allow symptoms to be proven/disproven without ISP tuning deep-dives.

Cite this figure: ICNavigator — “Smart Camera with Edge AI” (Figure F3: Sensor Hooks & Metadata Attach, 1400×933).

H2-4. NPU/DSP Partitioning: What Runs Where (and Why It Matters)

Intent

Move from “the model runs” to “the system is stable, real-time, and maintainable”. Correct partitioning across CPU, DSP, and NPU controls latency tails, DDR pressure, thermal throttling risk, and debug-ability.

This chapter stays on: roles & interfacesqueue disciplinemulti-model concurrencysystem impact of INT8/FP16.

Responsibility map (CPU vs DSP vs NPU)

NPU: backbone inference (conv/attention heavy paths). Target: high utilization with bounded queue latency.
DSP: streaming pre-processing (resize/normalize/colorspace) and lightweight classical CV. Target: reduce full-frame copies.
CPU: scheduling, I/O, logging/telemetry, policies (rate limit, drop policy, failover). Target: avoid full-frame data movement.

Maintainability rule: every cross-domain boundary should have a queue, a timestamp, and counters (depth, wait time, misses).

Multi-model concurrency (detector + classifier + quality model)

A practical smart camera rarely runs one model. Concurrency should be designed as a pipeline, not a competition for DDR and NPU time. Typical pattern: detector produces ROIs → classifier runs on ROIs → quality model gates false positives and triggers events.

Pipeline benefits: bounded latency per stage, clear queue ownership, easier backpressure handling.
Common failure: “parallel everything” creates bursty NPU demand and DDR amplification, worsening p95/p99.
Evidence to confirm: NPU queue latency rises while utilization oscillates (bursty scheduling signature).

Quantization choice (INT8 vs FP16) — system impact only

INT8: typically higher throughput and lower energy per inference → more thermal margin and fewer throttle events.
FP16: potentially more tolerant for some models → but can raise power/thermal load and tighten latency margins.
Decision principle: choose the format that meets pass/fail targets for p95 latency, drop rate, and thermal stability under worst-case scenes.

This page does not cover calibration/training workflows; it only treats quantization as a deployment knob with measurable system consequences.

Partition rules of thumb (6 rules that are directly actionable)

Rule 1: If DDR utilization peaks align with tail latency, reduce full-frame touches before changing the model.
Rule 2: If NPU utilization is low but end-to-end latency is high, fix scheduling/queues (not “more NPU”).
Rule 3: Keep pre-processing streaming on DSP (or fixed-function) to avoid CPU copies and cache coherency penalties.
Rule 4: Prefer pipeline over batching when real-time behavior matters; batching improves throughput but hurts latency tails.
Rule 5: Add an explicit drop policy under backpressure (e.g., drop non-critical frames, keep event metadata).
Rule 6: Thermal constraints must be part of scheduling (DVFS/throttle states must be visible to the scheduler).

Scheduling patterns (when to use which)

Pipeline

Use when: deterministic latency is required.
Risk: sensitive to buffer sizing and backpressure.
Evidence: stable per-stage times with bounded queue depth.

Batch

Use when: throughput matters more than latency (non-real-time).
Risk: p95/p99 latency increases and becomes bursty.
Evidence: utilization rises but queue latency spikes in bursts.

Event-driven

Use when: power is constrained and triggers are reliable.
Risk: trigger noise causes missed events or overwork.
Evidence: event rate correlates with workload and thermal state.

Hybrid (common)

Pattern: pipeline for detection + event-driven for secondary models.
Risk: poor priority rules lead to starvation.
Evidence: deadline misses cluster in one queue stage.

First measurements (minimal dashboard for real-time confidence)

NPU utilization: average + peak; watch for oscillation (bursty scheduling).
Queue depth per stage: PreQ / NPUQ / PostQ; depth growth is the earliest backpressure indicator.
Deadline miss count: per fixed time window; a single number that flags real-time failure.
p95 stage times: pre / inference / post; tails point to the true bottleneck.
Thermal state: throttle flag + temperature; correlate with latency drift and accuracy drift.
Drop reason histogram: “backpressure drop” vs “timeout” vs “error” to avoid guessing.

Maintainability principle: if a symptom cannot be attributed using the above metrics, add instrumentation before changing architecture.

Figure F4. Compute partitioning as a real-time system: explicit queues, deadlines, and counters, with backpressure feeding into scheduling policy.

Cite this figure: ICNavigator — “Smart Camera with Edge AI” (Figure F4: Partitioning & Queues, 1400×933).

H2-5. Memory & DDR Architecture: The Real Bottleneck

Intent

“Dropped frames”, “latency jitter”, and “random stalls” often trace back to DDR contention—not because average bandwidth is too low, but because read/write amplification and arbitration create tail latency (p95/p99). This chapter ties symptoms to measurable DDR signals.

Focus: R/W amplificationbuffer sizingQoS conceptsutil/latency/stall counters.

Why DDR gets “amplified” in a smart camera

Multiple masters, same memory: ISP, NPU, encoder, MAC (GbE), CPU, and DMA all compete for DDR cycles.
R/W amplification: one input frame can be read/written multiple times (RAW ingest → ISP output → NPU tiles/ROIs → encode → TX).
Tail latency dominates: real-time failures often happen when one master is temporarily starved by arbitration.

Key metricDDR read latency (p95/p99)

Key metricstall cycles per master

Key metricutilization + queue depth

Bandwidth worksheet (copyable math + example direction)

Goal: estimate “DDR pressure” quickly. The numbers do not need to be perfect; the structure must be repeatable.

Frame bytes: FrameBytes = W × H × Bpp (rough Bpp: RAW≈2, YUV420≈1.5).
Input bandwidth: In_BW = FrameBytes × FPS.
DDR total: DDR_BW ≈ In_BW × (R_amp + W_amp) where amplification comes from ISP+AI+ENC+TX passes.

Example direction (for intuition): a single 1080p@60 stream can look “small” at the sensor input, but becomes several times larger at DDR once ISP read/write, NPU tiling, encoder reads, and network TX reads are accounted for. If p95 DDR read latency spikes while queues grow, the system will stutter even if the “average” bandwidth seems fine.

Stage	Typical DDR action	Amplification hint	What to watch
ISP	Read RAW → Write YUV/RAW	≈ 1R + 1W	ISP stall cycles, output queue depth
NPU preprocess	Read tiles/ROIs → Write tensors	depends on tiling	DDR read latency + NPUQ wait
Encoder	Read YUV → Write bitstream	≈ 1R + 0.xW	ENC backlog, bitrate spikes
Network TX	Read bitstream → DMA to MAC	≈ 0.xR	TX depth, drops, resend/timeout

Buffer sizing (compute the minimum buffers from “in-flight time”)

Buffer counts should be derived from a maximum allowable in-flight time, not guessed. Use a simple bound and then add safety for transients.

Define: T_inflight = max acceptable pipeline delay (e.g., 50–120 ms depending on application).
Minimum buffers: N ≥ ceil(FPS × T_inflight) + safety (typical safety: +2 or +3).
Stage buffers: size each ring separately (ISP out, NPU in/out, encoder in, TX queue). A single “big buffer” rarely stabilizes tails.

Debug hint: if buffers are too small, drops happen during short bursts; if buffers are too large, latency becomes unbounded and “feels random”.

Symptom mapping (DDR bound vs compute bound)

Field symptom	More likely DDR-bound when…	More likely compute-bound when…
Stutter / jitter	DDR util near peak + read latency spikes + master stalls rise; NPU util is not consistently high	NPU util pinned high + inference p95 expands; DDR util moderate
Drops in bursts	Queue depth grows suddenly across multiple masters (ENC/TX/ISP) after a short bandwidth burst	One stage queue grows steadily (often NPUQ), then misses deadlines regularly
Lowering resolution helps a lot	Yes (DDR pressure scales strongly with pixels and copies)	Sometimes limited (if model/compute is the bottleneck)

Fix ladder (lowest cost first, hardware last)

1) Reduce copies: ensure DMA/zero-copy paths; avoid CPU touching full frames.
2) Fix stride/alignment/tiling: improve burst efficiency and reduce “wasted reads”.
3) Reduce writeback: keep intermediates on-chip (SRAM/cache) when possible.
4) Apply QoS/priority: protect the real-time chain (video/AI) and rate-limit non-critical masters.
5) Step down the mode: reduce FPS/resolution/encode complexity; shift to ROI/event-first outputs.
6) Hardware upgrade: more DDR channels, higher frequency, bigger caches—only after evidence proves DDR is the ceiling.

Every step should have a “win signal”: p95 read latency down, stall cycles down, deadline misses to zero, drop histogram collapses.

Figure F5. DDR arbitration view: multiple masters compete with QoS tags; high pressure plus rising p95/p99 read latency predicts jitter and drops.

Cite this figure: ICNavigator — “Smart Camera with Edge AI” (Figure F5: DDR Arbitration & Masters, 1400×933).

H2-6. Outputs & I/O Modes: GbE / USB3 as a Device, Not a Gateway

Intent

The output side must be treated as a deterministic device interface: select an output form (RAW / YUV / Encoded + metadata), choose a transport (GbE / USB3), and manage backpressure so stalls do not propagate upstream into ISP/AI/DDR.

Focus: output formsGbE/USB3 positioningpacing/MTUbackpressure evidence.

Output matrix (what you want → what to output)

This is a practical selection matrix, not a spec deep dive.

Transport	RAW	ISP (YUV)	Encoded + Metadata / Events
GbE	High bandwidth demand; sensitive to pacing; use only when downstream needs raw frames.	Positioning: “video over network” cases; can map to GigE Vision (positioning only).	Best bandwidth efficiency; adds encode latency; keep metadata/events decoupled when possible (side-channel).
USB3	Possible but host-dependent; avoid if host is shared/bursty.	Common via UVC; easiest host integration; watch for host-side scheduling jitter.	Bulk endpoint for metadata/events pairs well with UVC video; preserves events under video backpressure.

Determinism knobs (pacing, MTU/packetization, and drop policy)

Output pacing: prefer smooth TX pacing over bursty “send as fast as possible” behavior; bursts inflate queues and tail latency.
MTU / packetization: too small → overhead and CPU pressure; too large → higher loss impact. Choose for stability, not peak throughput.
Backpressure policy: define what can drop first. Common safe rule: keep events/metadata even when video frames drop.

A deterministic output is an explicit system feature: counters + pacing + drop reasons, not “a faster cable”.

Backpressure symptoms (network/host congestion vs internal congestion)

Observation	More likely external congestion	More likely internal congestion
TX queue depth	TX depth rises first; MAC/socket drops increase; resend/timeout grows	TX depth stays moderate; upstream queues grow first (ISP/NPU/ENC)
Latency jitter	Correlates with link load spikes and packet loss	Correlates with DDR read latency spikes, NPUQ waits, or encoder backlog
What “fixes” it	Pacing/MTU tuning, link isolation, reduce bitrate bursts	Reduce copies, adjust QoS, lower mode, stabilize buffer sizing

First probes (minimum counters to check before changing architecture)

GbE: socket drop counters, TX queue depth, MAC error counters, resend/timeout indicators (positioning only).
USB3: URB errors, underrun/overrun symptoms, host scheduling jitter indicators.
Always correlate with internal queues: ISP out depth, NPUQ wait, ENCQ backlog, DDR latency/stalls.

Start withTX depth + drop histogram

Then checkURB/socket errors

Correlateinternal queue growth

Figure F6. Output modes matrix: pick payload form (RAW/YUV/ENC+META) and transport (GbE/USB3), then enforce pacing and a drop policy so congestion does not destabilize the pipeline.

Cite this figure: ICNavigator — “Smart Camera with Edge AI” (Figure F6: Output Matrix, 1400×933).

H2-7. Shutter / Iris / ND Mechanism Control Patterns (Profiles, Debounce, Jam, Backlash)

Focus: treat shutter/iris/ND as a mechanical control system with endpoints, friction, backlash, and aging. Control success is proven by current, position, and derived speed.

This section stays at mechanism/control level (no ISP algorithm content).

Motion profiles: fast enough, but not violent

S-curve (jerk-limited) Endpoint cushioning Speed cap near end Soft stop Retry window

Why profiles matter: abrupt acceleration excites resonance and creates hard endpoint impacts that reduce repeatability.
Jerk-limited ramps: reduce rebound and “buzz” by lowering impulse energy at direction changes and near endpoints.
Endpoint cushioning: slow down and reduce drive near the end zone to avoid false jam detection and mechanical shock.

Debounce & endpoint decisions (avoid false “arrived”)

Time-window debounce

Simple and robust against bounce/spikes.
Cost: adds delay (consumes stability budget and timing margin).

Consistency-based debounce

N-of-M samples, hysteresis thresholds, or stable-slope checks.
Lower delay than long windows, but requires clean sampling rules.

Backlash / hysteresis compensation (direction change is the trap)

Root cause: gear lash, elastic preload, and friction create a “dead zone” when direction reverses.
Preload step: intentionally overshoot a small amount, then settle back to the target from a consistent direction.
Bi-direction tables: use separate calibration maps for forward vs reverse approach to the same setpoint.
Practical rule: for critical aperture/ND positions, approach from one direction whenever possible.

Jam / stiction detection (make it a production-grade discriminator)

Signature: current rises while position does not move (or speed collapses).
Endpoint exception: endpoints can also show high current + no motion; use a separate threshold within the endpoint zone.
Decision policy: apply time qualification (persist > T ms), then act (reduce drive, back off, retry, and log).

Lifetime & consistency (engineering metrics, not materials theory)

What to track

Cycle count (group by motion type: small frequent vs large occasional).
Jam/retry counters and endpoint timeout counters.
Temperature bins (cold/ambient/hot) for drift vs environment.

Repeatability statistics

Run N repeats to the same endpoint/setpoint.
Record final position distribution (mean, spread, max deviation).
Compare distributions across temperature and after aging.

Evidence chain (minimum captures)

I(t) vs position: strongest discriminator for jam vs normal motion.
Position vs time step tests: extract settle time, rebound near endpoints, and backlash dead-zone behavior.
Endpoint repeatability: N-run distribution is the simplest “consistency KPI”.

Figure F7 — Motion profiles: command + current + position (Normal vs Jam/Stiction) with endpoint zone and decision window

Cite this figure

Figure F7 — Command/current/position profiles (Normal vs Jam/Stiction)

Copy citation

H2-8. Noise, Vibration & EMI Coupling (Image Jitter, Instability, Banding)

Purpose: turn “image jitter / unstable actuation / banding-like artifacts” into measurable coupling paths: mechanical vibration, conducted noise (rail/ground), and radiated/common-mode coupling.

Only the physical chain is covered here. No ISP mitigation algorithms.

Three coupling paths (one symptom can come from different physics)

Mechanical path

Force ripple or resonance → micro-vibration.
Micro-vibration reduces optical stability and can degrade image sharpness consistency.
Strongly linked to current ripple spectrum and motion profile choices.

Conducted + ground path

High di/dt → rail ripple and ground bounce.
Noise couples into sensitive references/sense lines and changes feedback quality.
Often visible as “instability only when motor moves”.

Radiated / common-mode path

Switching edges + harness length → antenna behavior.
Common-mode currents excite nearby cables and modules.
Reproduced by near-field A/B comparisons and cable routing changes.

First measurements (start here)

Drive current ripple (Icoil or phase currents).
Rail ripple / ground bounce near driver and near feedback ADC reference.
Then compare: change one knob (chop freq, microstep, profile) and observe deltas.

Why audible noise often correlates with vibration and instability

Chopping/microstepping: ripple and sidebands can land in the audible band or excite structural resonance.
Too-aggressive tuning: control output u(t) “chatters” at high frequency, converting quantization/noise into motion.
Engineering tactic: move ripple spectrum away from resonant bands and reduce edge aggression where possible.

EMI injection mechanisms (keep the discussion actionable)

Return-path coupling: shared ground impedance converts motor current into reference modulation (ground bounce).
Harness coupling: long leads + fast edges increase common-mode radiation.
Capacitive paths: dv/dt from switching nodes capacitively injects noise into nearby sensing lines.

Mitigations matched to the path (avoid “laundry list” fixes)

For mechanical vibration

Jerk-limited profiles; avoid resonant speed bands.
Adjust chopping/microstep settings to reduce force ripple.
Verify with POS jitter and current ripple deltas.

For conducted / ground noise

Minimize high di/dt loop area; keep noisy returns away from ADC reference returns.
Local decoupling near driver; tame edges (within efficiency/thermal limits).
Verify via Vrail ripple and ground bounce reduction.

For radiated / common-mode

Twist/shield harness segments; control shield termination strategy consistently.
Reduce dv/dt where feasible; use physical separation from sensitive lines.
Verify via near-field A/B scans and cable reroute experiments.

Minimal A/B method

Fix the motion script; change one knob per run.
Record I ripple + Vrail/ground + POS jitter.
Optional: near-field probe before/after (relative comparison only).

Figure F8 — Coupling paths: driver switching → current ripple → (mechanical / conducted / radiated) → image instability

Cite this figure

Figure F8 — Coupling paths from driver switching to image instability

Copy citation

H2-9. Reliability: Watchdogs, Logging, On-device Telemetry (Make field issues debuggable)

Intent

Field issues are only fixable when the system can explain itself. The goal is not “more logs,” but a compact, structured evidence chain that reconstructs what happened and where it started. This chapter defines a minimal reliability stack: watchdogs, ring logs, frame/stage telemetry, and fault snapshots that can be exported over GbE/USB.

Focus: HW/SW watchdogring logsnapshot frame id + timestampsqueue depthcounters.

Reliability mechanisms (lightweight, field-first)

HW watchdog: guarantees recovery when the system is stuck (power/clock is alive but progress is not).
SW watchdog: catches partial stalls (one pipeline thread stops advancing while others still run).
Crash snapshot: capture a small “context window” (counters, queues, stage latencies, recent events) without heavy dumps.
Ring log: continuous, bounded logging that always keeps the last N seconds/minutes.

Boundedlogs must not grow unbounded

Structuredfields enable joining and correlation

Exportablefield retrieval over GbE/USB

Telemetry schema (a minimal field list that closes the loop)

The schema is designed to reconstruct an incident without protocol deep dives. Keep it compact, consistent, and joinable.

Layer	Recommended fields	Why it matters
Frame-level	`frame_id`, `sensor_ts`, `soc_ts`, `drop_reason`, `output_mode`	Joins everything across threads and stages; enables “what changed at the first bad frame”.
Stage-level	`stage_latency_us` (p50/p95), `queue_depth`, `backpressure_flag`, `deadline_miss_cnt`	Separates compute stalls from downstream congestion; shows the first stage that starts slipping.
System-level	`ddr_util`, `dvfs_state`, `fps_cap_state`, `temp_T1..T4`, `brownout_warn_cnt`	Proves throttling/memory pressure vs “random drops”; ties performance drift to thermal/power signals.
I/O counters	`tx_queue_depth`, `reconnect_cnt`, `socket_drop_cnt` / `usb_urb_err_cnt`	Distinguishes output-side backpressure from internal pipeline issues without protocol-level detail.
Recovery	`watchdog_reason`, `reset_cause`, `fw_version`, `config_hash`	Makes incidents traceable and comparable across builds; avoids “fixed in one version, reappears later”.

Rule of thumb: if a field cannot help decide “compute vs memory vs output vs power/thermal,” remove it.

Incident timeline (reconstruct “the 3 seconds before the drop”)

Field debugging gets fast when every incident produces a joinable window. A practical method is to keep a rolling window in memory and freeze it on trigger.

Trigger: drop_cnt increments, deadline_miss_cnt jumps, reconnect_cnt increments, or brownout_warn_cnt changes.
Freeze window: lock ring-buffer head/tail pointers and attach an incident id.
Join by frame: align all stage events by frame_id, then use soc_ts to align cross-thread ordering.
Find first deviation: identify the first stage where latency p95 grows or queue_depth starts climbing.
Export summary: store the top 3 abnormal metrics + the “first failing stage” tag for fast triage.

What to flagfirst stage where queue grows

What to keepbounded window (±3s)

What to exportsummary + raw window

First fix (why counters beat guessing)

Start with classification: implement drop_reason and queue_depth per stage before adding more verbose logs.
Add latency tails: p95 is often more informative than average FPS in real-time pipelines.
Unify recovery evidence: write watchdog_reason and reset_cause into the same incident stream.
Make builds traceable: always include fw_version + config_hash in exported incidents.

Win signals: time-to-root-cause shrinks, incidents become comparable across builds, and “cannot reproduce” becomes rare.

Figure F9. Reliability evidence chain: stage counters and structured telemetry feed a bounded ring buffer; incidents freeze a snapshot and export it with traceability metadata.

Cite this figure: ICNavigator — “Smart Camera with Edge AI” (Figure F9: Logging & Snapshot Evidence Chain, 1400×933).

H2-10. Validation Plan: Performance, Accuracy, Robustness (a repeatable test matrix)

Intent

A smart camera is deliverable only when performance, accuracy, and robustness are verified with a repeatable matrix. This chapter defines a test plan that scales across models and deployments: conditions × metrics × pass/fail, using telemetry fields that already exist for reliability.

Focus: FPSlat p95jitter drop ratemAP/IoUpower/thermal/network stresssoak.

Metrics (tie validation to telemetry)

PerformanceFPS, lat p95, jitter, drop, queue bounds

AccuracymAP/IoU, FP/FN, confidence drift

Robustnesspower droop, thermal step, congestion, soak

Use the same identifiers across all tests: fw_version, config_hash, output_mode, model_set.

Repeatable test matrix (conditions × metrics × pass/fail)

Condition scenario	What to measure (metrics)	How to write acceptance (example style)
Nominal load steady state	FPS avg, `lat p95`, jitter, `drop_rate`, max queue depth	`p95 < X ms`, `drop < Y ppm`, `max_queue < Q`
Multi-model detector + classifier	Stage p95 (preproc/NPU/post), `deadline_miss_cnt`, queue growth	`no sustained backlog > N s`, `miss_cnt stable`
Domain shift lighting/material/angle	mAP/IoU, FP/FN, confidence drift vs baseline	`mAP ≥ A`, `FP ≤ B`, `drift ≤ Δ`
Power stress droop / cold-start	Reset cause, `brownout_warn_cnt`, drops under stress	`no reboot` or `reboot ≤ R`, `warn ≤ K`
Thermal step temp ramp/step	T1–T4, DVFS state, `lat p95` drift, quality indicators drift	`DVFS stable`, `p95 within band`, `drift converges`
Network congestion rate limit/packet loss	TX queue depth, drop reasons, reconnect count	`no internal backlog runaway`, `reconnect ≤ R`
Soak long run	Drop histogram stability, memory/queue boundedness, reconnect trend	`no degradation trend`, `bounded queues`, `stable counters`

Acceptance is intentionally written as a style template; plug in the project-specific thresholds and keep the matrix structure unchanged.

Minimal gear (to cover performance + accuracy + robustness)

Performance: host capture + telemetry export; ability to compute p95 latency and queue bounds from logs.
Accuracy: repeatable scene setup + labeled evaluation set; consistent camera mounting and lighting steps.
Power stress: oscilloscope for input/PG evidence (or at least droop capture at the supply boundary).
Thermal: temperature chamber or controlled heat/cool step; log temperature points T1–T4.
Network: traffic shaping/rate limiting to force backpressure and validate stability.

Win signal: the same matrix catches regressions before deployment and yields comparable reports across firmware builds.

Figure F10. A repeatable validation matrix: conditions, metrics, and pass/fail templates, with recommended instrumentation taps aligned to on-device telemetry.

Cite this figure: ICNavigator — “Smart Camera with Edge AI” (Figure F10: Validation Matrix & Instrumentation Taps, 1400×933).

H2-11. Field Debug Playbook: Symptom → Evidence → Isolate → First Fix

How to use this playbook

Each symptom is reduced to a short decision path: First 2 measurements 3 discriminators First-fix ladder. The goal is a correct first isolation: power transient, DDR saturation, thermal throttling, output backpressure, or scheduler stall.

Measurement styleone physical point + one telemetry field

Decision stylebinary discriminators (prove / rule out)

Fix stylefast software checks → targeted hardware checks

Telemetry references: frame_id, soc_ts, stage_latency_p95, queue_depth, ddr_util, tx_queue_depth, reconnect_cnt, reset_cause, brownout_warn_cnt, T1..T4, dvfs_state.

Top 6 symptoms (quick table)

Symptom	First 2 measurements	Fast discriminator	First fix (first action)
Dropped frames / latency jitter	`stage_latency_p95` + `queue_depth`; `ddr_util`	Queue grows first (internal) vs TX queue grows first (output)	Reduce copies + cap output mode (lower bandwidth tier)
Random reboot	Input V/I or PoE rail; `reset_cause`/`brownout_warn_cnt`	Rail droop + BOR signature vs watchdog signature	Check inrush/hold-up + supervisor/watchdog reasons
Output freeze	`tx_queue_depth`/`reconnect_cnt`; `pipeline queue_depth`	TX queue pinned vs upstream stage pinned	Add pacing/backpressure policy + increase buffering
Noise rises with temperature	`T1..T4`; noise proxy trend (per-frame stats)	DVFS throttling signature vs sensor/analog drift signature	Thermal path improvement + reduce hot load (model/FPS)
False positives change with lighting	FP trend + confidence drift; exposure state marker	Exposure/flicker correlation vs backlog/time-misalignment	Stabilize exposure/anti-flicker + enforce latency bound
PoE power-up fails	PoE startup V/I; PG timing events	Inrush limit vs UVLO from long cable drop	Soft-start/inrush shaping + hold-up verification

This table is designed to be “one-screen usable”; detailed decision bullets follow in symptom cards.

Symptom 1 — Dropped frames / latency jitter

Typical manifestation: stable average FPS but periodic stalls; p95 latency spikes; occasional drops.

First 2 measurements 1) stage_latency_p95 + queue_depth per stage (ISP / preproc / NPU / post / TX) 2) ddr_util (or memory pressure) + drop reason histogram

Discriminators (pick the first true) • Queue depth climbs upstream before TX queue grows → internal bottleneck (DDR/compute/scheduling) • TX queue pins first while upstream queues stay bounded → output backpressure / pacing • ddr_util saturates during spikes and multiple stages show latency tail growth → DDR contention / copy storm

First-fix ladder (fast → structural) 1) Remove extra copies (DMA/zero-copy buffers), fix stride alignment, reduce format conversions 2) Reduce bandwidth tier: lower resolution/FPS, disable non-essential branches (extra overlays/encoding) 3) Enforce QoS/pacing: cap output, set bounded queues, drop policy (prefer oldest/newest consistently) 4) If still bounded by memory: consider higher bandwidth memory SKU / narrower pipeline concurrency

Example parts (MPNs, common choices) DDR4 Micron MT40A512M16JY-075 (DDR4 example) LPDDR4 Micron MT53E512M32D2 (LPDDR4 example) Supervisor TI TPS3808G01 (reset supervisor for clean recovery) Note: memory MPN depends on SoC controller and board layout; use as reference only.

Symptom 2 — Random reboot (often during burst load)

Typical manifestation: reboots coincide with inference burst, link renegotiation, or power events; logs show reset.

First 2 measurements 1) Input (PoE after PD / main input) V droop + SoC core rail V droop (scope) 2) reset_cause + brownout_warn_cnt + watchdog reason (from incident window)

Discriminators (3 quick proofs) • Rail droop + BOR/UV signature → power transient / hold-up / inrush issue • No droop, but watchdog reason fires → scheduler stall / deadlock (needs counters and snapshot) • High T points + DVFS state changes before reboot → thermal protection / unstable thermal path

First-fix ladder 1) Verify inrush and hold-up: check PD startup, eFuse/hot-swap limits, bulk capacitance 2) Improve brownout behavior: supervisor thresholds, PG sequencing, retry policy 3) If watchdog-driven: add/verify progress counters + tighten watchdog stages (HW then SW) 4) If thermal-driven: reduce peak load (model concurrency) and improve heat spreading

Example parts (MPNs) PoE PD TI TPS2373-4, ADI LTC4269, Silicon Labs Si3402 eFuse TI TPS25942, TI TPS2660 Flyback TI UCC28780, ADI LT8304 Watchdog TI TPS3430, Analog Devices/Maxim MAX6369

Symptom 3 — Output freeze (no frames/events leaving, pipeline may still run)

Typical manifestation: pipeline counters continue, but host stops receiving; or both pipeline and TX stall after hours.

First 2 measurements 1) tx_queue_depth + reconnect_cnt / USB error counter 2) Upstream queue_depth (preproc/NPU/post) and drop_reason histogram

Discriminators • TX queue pinned, upstream bounded → output backpressure / pacing issue • Upstream queues grow, TX stays low → internal stall (DDR/compute) before the output stage • frame_id stops incrementing → scheduler stall; watchdog/snapshot must trigger

First-fix ladder 1) Enforce bounded TX queues and pacing; define a drop policy under congestion 2) Increase buffering only after pacing policy exists (avoid infinite backlog) 3) Add “TX progress” counters (bytes sent, completions) and export incident window 4) If PHY stability suspected: validate link resets under load (no protocol deep dive)

Example parts (MPNs) GbE PHY TI DP83867IR, Microchip KSZ9031RNX USB3 redriver TI TUSB1046 SPI NOR Winbond W25Q128JV (store crash snapshots / ring logs) PHY parts listed as typical choices; output freeze root cause still must be proven by counters first.

Symptom 4 — Image noise rises with temperature (quality drift)

Typical manifestation: noise/black-level drift increases as enclosure warms; model confidence becomes unstable.

First 2 measurements 1) T1..T4 temperature points + dvfs_state 2) Noise proxy trend (per-frame simple stats) + stage latency drift (p95)

Discriminators • DVFS throttling engages and latency tails grow before noise worsens → thermal throttling first • DVFS stable but noise proxy rises with temperature → sensor/analog drift more likely • Only specific modes show drift (e.g., encoded output) → downstream processing burden vs raw quality

First-fix ladder 1) Improve thermal path (spreader/contact), then reduce peak load (model concurrency / FPS cap) 2) Add/verify temperature telemetry near hotspots (SoC/DDR/power) and correlate with quality drift 3) For analog stability: prioritize clean rails and low-noise LDO for sensitive domains 4) Validate with a thermal step test (from the validation matrix)

Example parts (MPNs) Temp sensor TI TMP117, NXP PCT2075 Low-noise LDO Analog Devices LT3042, TI TPS7A47 Current monitor TI INA231 (power trend correlation)

Symptom 5 — False positives change with lighting (domain/illumination sensitivity)

Typical manifestation: FP spikes under flicker/LED lighting, glare, backlight, or fast transitions.

First 2 measurements 1) FP trend + confidence drift (simple distribution shift check) 2) Exposure state marker (frame metadata) + latency bound (stage_latency_p95)

Discriminators • FP spikes correlate with exposure changes / flicker conditions → input stability issue (anti-flicker/exposure stability) • FP spikes correlate with backlog (queue depth growth) → time-misalignment / stale frames under load • FP spikes only under certain scenes (glare/backlight) and not under load → domain shift; fix via validation coverage

First-fix ladder 1) Stabilize the input: enforce anti-flicker mode, constrain exposure transitions, or lock exposure for evaluation 2) Enforce bounded latency: keep queue bounded; prevent stale-frame inference under congestion 3) Expand validation conditions (lighting types, angles, reflectors) and compare against a baseline report 4) If needed, add simple ambient markers (lux/flicker indicator) to incident reports for faster triage

Example parts (MPNs) Ambient light Vishay VEML7700, TI OPT3001 Time base Abracon ASTX-H11 (typical oscillator family; board-specific) Ambient sensors are optional; the primary proof should still come from metadata + backlog correlation.

Symptom 6 — PoE power-up fails (af/at/bt negotiation or startup collapse)

Typical manifestation: some PSEs fail, long cables fail, or repeated start/stop cycles.

First 2 measurements 1) PoE startup V/I trajectory (after the RJ45/PD front end) + PG timing 2) Brownout warnings / reset cause during startup window

Discriminators • Inrush current hits limit and input collapses → inrush shaping / soft-start problem • UVLO triggers with long cable drop (input never reaches stable) → hold-up / undervoltage margin • Startup succeeds but fails on load step → secondary rail transient capability / sequencing issue

First-fix ladder 1) Shape inrush: eFuse/hot-swap limit, soft-start, staged rail enable 2) Verify isolation DC-DC startup behavior and PG sequencing; keep core rails stable first 3) Add hold-up margin and verify with repeated plug cycles and long-cable conditions 4) Record startup incident windows for failing PSEs to compare signatures

Example parts (MPNs) PoE PD TI TPS2373-4, ADI LTC4269, Silicon Labs Si3402 eFuse TI TPS25942, TI TPS2660 Isolated DC-DC module RECOM R05P05S, Murata MGJ2 (examples) Ideal diode / power mux TI TPS2121 (for hold-up/ORing architectures)

Figure F11. Decision tree for field debugging: symptoms drive the first two measurements, then isolate root-cause families and pick a first fix.

Cite this figure: ICNavigator — “Smart Camera with Edge AI” (Figure F11: Field Debug Decision Tree, 1400×933).

Request a Quote

Name

Company

Part Number(s) / BOM

Quantity & Target Lead Time

Alternates Allowed

Temperature Grade

Package / Footprint

Compliance

Budget Window

Lot Size / Qty

Message

Attachment

Accepted Formats

pdf, csv, xls, xlsx, zip

Attachment

Drag & drop files here or use the button below.

H2-12. FAQs ×12 (Evidence-first; no scope creep)

Each answer stays inside the smart-camera evidence chain (pipeline / DDR / output / power / thermal / logging). Every response includes two evidence points, a discriminator, and a first fix, with links back to chapters.

Q1 Dropped frames: network congestion or DDR saturation?

Check tx_queue_depth (or socket drops) and ddr_util with per-stage stage_latency_p95. If TX queue pins first while upstream queues stay bounded, it’s output backpressure. If ddr_util saturates and multiple stages show tail-latency growth, it’s DDR contention/copy storm. First fix: reduce copies (zero-copy DMA buffers), then cap the output tier (resolution/FPS/encoding) with a bounded drop policy.

Evidence: tx_queue_depth / drops Evidence: ddr_util + stage_latency_p95 First fix: reduce copies + cap tier

Maps to: H2-5 · H2-6 · H2-11

Q2 Latency spikes only after 20 minutes: thermal throttle or queue buildup?

Correlate T1..T4 and dvfs_state with a slow trend of queue_depth. If DVFS/throttle events start before latency tails grow, the spike is thermal-driven. If temperatures are stable but queue depth slowly climbs over time, it’s backlog accumulation (pacing, bounded queues, or a leak in buffer recycling). First fix: export an incident window (±3s) around spikes, then cap peak load (lower model concurrency or FPS) while enforcing bounded queues to prevent stale-frame inference.

Evidence: T1..T4 + dvfs_state Evidence: queue_depth trend First fix: cap peak load + bounded queues

Maps to: H2-8 · H2-9 · H2-11

Q3 PoE powers up but reboots on inference start: which 2 rails first?

Measure (1) the PoE-side power after PD/isolated DC-DC (input energy stability) and (2) the SoC core rail (the fastest failure signature) while logging reset_cause/brownout_warn_cnt. If the upstream rail collapses at load step, suspect inrush/hold-up or eFuse limiting. If upstream stays stable but core droops, it’s secondary rail transient/sequence. First fix: shape inrush and verify PG sequencing; then tighten brownout supervisor thresholds for clean recovery.

Evidence: PoE/iso DC-DC rail Evidence: SoC core rail + reset_cause First fix: inrush shaping + PG sequencing

Maps to: H2-7 · H2-11

Q4 Model accuracy worse at night: sensor gain noise or preprocessing mismatch?

Compare night vs day runs using two fields: exposure/gain metadata (e.g., gain state, exposure time) and a preprocessing signature (input resize/normalize path ID or config hash). If false positives rise when gain ramps and noise proxies worsen, it’s input SNR limitation. If gain is stable but accuracy shifts when preprocessing mode changes (ROI, normalization, color space), it’s a preprocessing mismatch. First fix: lock exposure for A/B validation and freeze preprocessing config; then run a minimal validation matrix across lighting conditions to quantify the boundary.

Evidence: exposure/gain metadata Evidence: preprocess config hash First fix: lock exposure + freeze preprocess

Maps to: H2-3 · H2-4 · H2-10

Q5 USB3 works, GbE stutters: pacing or buffer size?

Look at tx_queue_depth on GbE and internal ring-buffer occupancy (frame queue depth). If TX queue periodically hits the ceiling while internal buffers remain moderate, pacing/MTU scheduling is the likely cause. If internal buffers empty and refill in bursts, buffer count is too shallow for the latency/jitter budget. First fix: enforce steady pacing with bounded TX queues and a deterministic drop policy, then increase buffers only to the minimum that keeps p95 latency stable.

Evidence: tx_queue_depth pattern Evidence: ring-buffer occupancy First fix: pacing + bounded queues

Maps to: H2-6 · H2-5

Q6 Why does enabling ISP features reduce NPU FPS?

Measure ddr_util and per-stage stage_latency_p95 for ISP and NPU. If NPU utilization drops while DDR utilization rises, the root cause is usually DDR contention: extra ISP stages add read/write amplification or trigger additional format conversions/copies. If DDR stays low but ISP stage latency grows, ISP compute is the limiter. First fix: remove intermediate copies (keep a single producer buffer), align stride/tiling, and disable non-essential ISP branches for the AI path.

Evidence: ddr_util Evidence: ISP vs NPU stage_latency_p95 First fix: one-producer buffers + stride alignment

Maps to: H2-2 · H2-5

Q7 Occasional green/purple frames: sensor link or memory corruption?

Use two proofs: frame continuity (frame_id jumps, line/frame error counters) and memory stress context (ddr_util peaks or buffer reuse hot spots). If color corruption aligns with frame counter discontinuity or link error flags, suspect ingress/link integrity first. If it appears only during high DDR pressure and disappears when output tier is capped, suspect buffer reuse/copy bugs or DDR saturation artifacts. First fix: enable per-frame CRC/tagging for buffers and reduce DDR pressure by removing copies and capping bandwidth tier.

Evidence: frame_id continuity / link errors Evidence: ddr_util peaks / buffer reuse First fix: buffer tagging + reduce DDR pressure

Maps to: H2-3 · H2-5 · H2-11

Q8 Power is stable, but stream freezes: software deadlock or backpressure?

Check frame_id progression and tx_queue_depth/reconnect_cnt. If frame_id continues to increment while TX stays pinned or reconnect counters rise, it’s backpressure and missing pacing/bounded queues. If frame_id stops incrementing (no progress) and watchdog does not trigger, it’s a scheduler stall/deadlock. First fix: add a “progress counter” at each pipeline node and force a snapshot on stall (incident window + counters) before changing parameters.

Evidence: frame_id increments? Evidence: tx_queue_depth / reconnect_cnt First fix: progress counters + snapshot

Maps to: H2-9 · H2-6 · H2-11

Q9 p95 latency looks good, but jitter is high: where to log timestamps?

Jitter needs multi-point timestamps: attach ts_ingress at sensor/ingress, ts_pre_npu before inference, and ts_egress at output enqueue. Compare delta variance across these segments. If ingress deltas vary most, the source is exposure/ingress pacing; if egress deltas vary most, output backpressure dominates. First fix: log these three timestamps plus queue_depth in the same frame record, and capture ±3s incident windows for spikes.

Evidence: ts_ingress / ts_pre_npu / ts_egress Evidence: segment delta variance First fix: frame record timestamps + incident window

Maps to: H2-2 · H2-9 · H2-10

Q10 How many buffers are “enough” for 60fps?

Start from two numbers: frame period (16.7ms at 60fps) and worst-case stage latency tail (stage_latency_p95). If p95 stage latency exceeds one frame period, buffers must absorb the tail without growing unbounded. Validate by measuring peak queue_depth during stress (network throttle + thermal ramp). If drops happen with low queue depth, buffers are too few; if latency grows while queues expand, buffers are too many without pacing. First fix: set bounded queues with a consistent drop policy, then size buffers to the smallest value that keeps p95 stable.

Evidence: frame period + stage_latency_p95 Evidence: peak queue_depth under stress First fix: bounded queues before bigger buffers

Maps to: H2-5 · H2-2

Q11 Thermal fix: heatsink or reduce model?

Decide with two proofs: T1..T4 + throttle events and the improvement from a controlled load reduction (lower model concurrency/FPS). If reducing load immediately removes DVFS events and stabilizes latency/quality, start with workload shaping. If load reduction barely helps and temperatures remain high, the thermal path (contact/spreader/enclosure) is the primary limiter. First fix: cap peak load to restore deterministic latency, then validate a thermal step test to quantify how much heatsink/contact improvements move the throttle boundary.

Evidence: T1..T4 + dvfs_state Evidence: improvement from load reduction First fix: cap peak load, then thermal step test

Maps to: H2-8

Q12 What minimum telemetry fields make RMA diagnosable?

A minimal RMA-ready set must reconstruct “what happened in the last 3 seconds.” Include frame_id, soc_ts, stage latencies (p95 or per-frame), per-stage queue_depth, ddr_util, output counters (tx_queue_depth, drops, reconnects), and platform health (reset_cause, brownout_warn_cnt, T1..T4, dvfs_state). Discriminator: if the incident window can separate power vs DDR vs thermal vs backpressure, the RMA loop becomes actionable. First fix: implement a ring buffer that exports ±3s snapshots on triggers.

Evidence: incident window ±3s Evidence: frame_id + queue_depth + reset/thermal First fix: ring snapshot export

Maps to: H2-9 · H2-11