Local Buffering & Storage for Machine Vision
← Back to: Imaging / Camera / Machine Vision
Local buffering & storage in machine vision is about absorbing burst data safely and committing it with verifiable integrity. The goal is zero frame loss and zero evidence loss by controlling tail latency, atomic commits (commit markers), power-fail hold-up, and integrity counters (seq/CRC/ECC) that prove what happened.
H2-1. What “Local Buffering & Storage” Means in Machine Vision
This chapter pins down the engineering boundary: buffer is a real-time shock absorber, storage is a persistent sink with worst-case latency, and evidence retention is the rule set that makes captured data provable after faults (including power loss).
Three terms, three responsibilities:
Buffer (DDR/LPDDR) absorbs burst rate and write-stall jitter (watermark-controlled).
Storage (SD/eMMC/NAND) provides persistence; success is defined by sustained throughput + tail latency, not peak spec.
Evidence is payload + metadata (sequence/time/CRC/commit) so gaps and corruption are detectable, and recovery can resume at the last consistent point.
System block covered on this page (interfaces only):
- DDR/LPDDR ring buffer → buffers pixel/packet bursts, exports watermarks and drop counters.
- Storage media (SD/eMMC/raw NAND) → sinks sustained writes and exposes worst-case stall behavior.
- Integrity (ECC/CRC + sequence/timestamp) → distinguishes correctable vs uncorrectable events, makes gaps measurable.
- Power-fail protection (PLP hold-up + commit marker) → ensures “all-or-nothing” finalization under power loss.
Three dominant machine-vision scenarios (each implies a different failure mode and measurement priority):
- Pre-trigger retention (capture-before-event): risk is overwrite/ordering; prove continuity with sequence + timestamp monotonicity.
- Burst capture (short, extreme data rate): risk is buffer overflow driven by tail write stalls; protect with headroom and watermarks.
- Edge logging (long-duration recording): risk is QoS drift over time (GC/erase cycles); prove stability with throughput vs time and error trends.
Evidence to measure (minimum set) — choose metrics that answer both “did it break?” and “why did it break?”
3-way quick classifier (pick one):
- Pre-trigger: prioritize sequence gaps, timestamp monotonicity, and window coverage (seconds).
- Burst capture: prioritize buffer headroom, p99.9 write latency, and overflow watermark hit rate.
- Edge logging: prioritize sustained MB/s over hours, QoS drift (latency percentiles), and integrity counters trend.
H2-2. Data Rate Budget & Buffering Targets
This chapter turns “buffer size” and “storage speed” into a budget problem. The design target must satisfy peak burst (instant demand) and sustained write (long-term sink), while surviving tail latency (worst-case stalls).
Core rule: Peak spec is not an acceptance criterion.
Acceptance is defined by sustained throughput over time + tail write latency (p99.9 / p99.99) such that
the buffer never crosses overflow watermark and evidence continuity remains provable (no seq gaps without explicit policy).
Budget inputs (keep them explicit and measurable):
- Resolution (W×H), FPS, bit depth (including packing), stream count (multi-camera aggregation).
- Overhead (headers, alignment, chunk metadata) — treat as a conservative percentage.
- Burst duration (ms or frames) and any pre-trigger window requirement (seconds).
Two targets that must both pass:
- Peak burst rate (MB/s): defines how fast the buffer fills during the worst segment.
- Sustained storage rate (MB/s): defines long-run drain capability, not peak marketing speed.
“Buffer absorbs jitter” — engineering definition:
- Throughput jitter: storage write rate dips temporarily (GC/erase/program pacing).
- Latency jitter (tail latency): some write operations take far longer than average (p99.9+), causing sudden queue growth.
- Buffer target: provide enough headroom so that worst-case stalls only create watermark excursions, not data loss.
| Inputs (spec) | Derived budget | Targets (acceptance) |
|---|---|---|
|
Resolution, FPS, bits/px Streams count |
Raw payload rate (MB/s) Effective rate after overhead |
Peak burst MB/s (worst segment) Sustained MB/s (hours-scale) |
|
Burst duration (ms / frames) Pre-trigger window (s) |
Buffer coverage time (s) Required buffer capacity (MB/GB) |
Buffer high-water hit rate below limit Overflow watermark hits: 0 |
|
Storage candidate (SD/eMMC/NAND) Workload pattern (append / mixed) |
Write latency distribution Queue depth evolution |
Tail latency: p99.9 / p99.99 within budget No unexplained seq gaps / CRC fails |
What to measure (minimum instrumentation):
- DDR bandwidth utilization and buffer fill watermark events (rate and duration).
- Write queue depth and tail write latency (p99.9/p99.99), not only average MB/s.
- Drop counters + reason codes, plus sequence/timestamp continuity checks.
Practical sizing heuristic (no heavy math):
Treat storage as a device with occasional “stall windows”.
Size buffer headroom so that peak MB/s × worst stall time stays below overflow watermark,
and prove it with watermark and tail-latency logs under real workloads.
H2-3. Buffer Architecture Patterns
This chapter defines buffer behavior as a policy, not a guess: watermarks, trigger windows, and backpressure rules decide whether loss is prevented, or (if unavoidable) becomes explicit and diagnosable via reason codes and continuity evidence.
Target function: absorb peak bursts and tail write stalls so the system remains predictable. If any data must be dropped, it must be dropped by policy (not overflow chaos), and recorded with a drop reason code.
Three reusable buffer patterns (choose by scenario and failure mode):
| Pattern | Best for | Typical risk | First things to measure |
|---|---|---|---|
| Ring buffer | Pre-trigger retention; continuous streams with event capture | Overwrite creates silent window gaps if markers/seq are not enforced | Fill level + high-water hits, seq gap, overwrite count |
| Double buffer (ping-pong) |
Fixed-size chunking; deterministic block flush | Tail stalls break swap timing; whole blocks can miss deadlines | Swap stall time, queue depth, block drop count |
| Tile buffer | Very large frames; incremental flush; multi-stream aggregation | Replay breaks if tile index/order is not written into metadata | Tile seq/index continuity, missing tile count, metadata CRC |
Pre-trigger / Post-trigger window implementation (continuity is provable only when these exist):
- Sequence counter (detect missing segments precisely).
- Monotonic timestamp (order and timing proof; supports replay alignment).
- Trigger marker + commit marker (window boundary is explicit and recoverable).
Backpressure policy (buffer-side only) — when watermarks rise, act in a defined priority order:
- Policy drop: drop non-critical segments first (record reason code).
- Rate reduction request: request reduced input if supported (do not explain ISP/codec internals here).
- Hard drop: last resort; still recorded with reason code and watermark snapshot.
Minimum evidence and counters (these make loss diagnosable):
H2-4. Choosing DDR/LPDDR for Buffering
This chapter prevents a common failure: “theoretical bandwidth is enough, yet the system still stutters.” The real limiter is often tail latency driven by refresh, bank conflicts, and arbitration under multi-master contention.
Selection principle: choose memory and controller/QoS observability so that p99.9 access latency and watermark hit behavior stay within the buffer headroom budget under real multi-master workload.
DDR vs LPDDR (engineering trade-offs, kept intentionally brief):
- LPDDR: optimized for power states and self-refresh behavior; commonly used where thermal/power headroom is limited.
- DDR: commonly used where sustained bandwidth and mature ecosystem are prioritized; easier to scale channels/width in some designs.
- Either can fail if tail latency is not controlled or measured.
Three bottlenecks that create “stalls” despite sufficient average bandwidth:
- Burst access: long bursts can starve other masters and cause queue build-up (watch max wait time).
- Bank conflicts: concurrent streams land on conflicting banks/rows; throughput becomes unstable and latency spikes.
- Refresh: periodic refresh pauses create recurring latency peaks; can align with watermark excursions.
Multi-master contention (FPGA ingest + CPU logging + DMA flush) — QoS/arbitration principles (buffer-side view):
- Protect the real-time writer: guaranteed service or bounded wait for the ingest stream.
- Cap burst length: prevent one master from monopolizing the bus during critical windows.
- Expose arbitration stats: grant ratio, max wait, and starvation events must be observable for acceptance and field debug.
Evidence to capture (acceptance-grade):
Selection checklist (7 items) — use as a “go/no-go” gate:
- Under target workload, p99.9 memory access latency remains inside the buffer headroom budget.
- Refresh behavior does not create periodic watermark excursions (check latency histogram vs time).
- Arbitration shows no starvation; max wait time is bounded for real-time master.
- Burst length/queue limits are configurable and measurable (not a black box).
- With multi-master concurrency, queue depth remains stable (no runaway growth).
- Instrumentation exists: watermark hits, queue depth, grant ratio, max wait, reason codes.
- Across batches/configs, tail latency distributions are consistent under the same workload.
H2-5. Storage Media Options: SD / eMMC / Raw NAND
Media selection is a predictability decision. For machine vision evidence capture, the critical constraints are sustained write plus worst-case (tail) write latency, then lifetime and power-fail consistency. Peak “MB/s” alone does not prevent stalls or data gaps.
Selection axis: who owns FTL / ECC / bad-block, how observable health/lifetime signals are, and whether p99.9/p99.99 write latency stays within the buffering headroom budget during long-run recording.
Core concepts that impact sustained + tail latency:
| Media | Controller boundary | Predictability & risk | Best fit | No-fit (red flags) |
|---|---|---|---|---|
| SD / microSD | Controller/FTL/ECC inside card; behavior can be a black box | High variance across brands/batches; fake/relabeled cards; tail latency spikes during internal GC | Removable logging where swap matters; prototypes; non-critical retention | Hard real-time capture; strict QoS; “must-be-identical” mass production |
| eMMC | Controller/FTL/ECC in package; generally more consistent than SD | Better batch stability; still needs tail latency verification under workload | Embedded cameras/edge devices needing predictable long-run recording | When full custom FTL policy is required or health signals are mandatory but unavailable |
| Raw NAND | FTL/ECC/bad-block handled by system controller or software (you own policy) | Highest controllability; engineering effort is higher; can be optimized for stable tail latency | High-duty continuous capture; strict lifetime control; determinism requirements | When engineering resources are limited or qualification time is short |
What to measure (evidence-based acceptance) — use the same workload across candidates:
- Long-run sustained curve: MB/s vs time over hours (detect drift and GC impact).
- Write latency distribution: p50 / p99 / p99.9 (tail spikes predict buffer overflows).
- Trend signals: retries/errors if available; “bad block growth” as a concept/trend for raw NAND.
- Batch consistency: repeat across multiple samples; compare distributions, not peak numbers.
Quick fit rules
- SD fits when removable media matters and tail spikes are tolerable.
- eMMC fits when mass-production stability is needed with moderate engineering effort.
- Raw NAND fits when policy control and lifetime determinism dominate the requirements.
No-fit (red flags)
- “Peak MB/s looks good” but p99.9 latency is unknown or unstable.
- Long-run curve drops over time (GC pressure) without predictable bounds.
- Power-fail consistency is required but the responsibility boundary is unclear.
H2-6. Wear Leveling, FTL, and Lifetime Control
“Gets slower over time” is usually not a bandwidth mystery—it is a FTL behavior story. Mapping, garbage collection (GC), wear leveling, and bad-block handling create write amplification and tail latency spikes, which can push buffering over watermarks and produce evidence gaps unless recording strategy shapes the write pattern.
Goal: keep WA trend and p99.9 write latency inside a known envelope by controlling write shape: chunked sequential, append-only logging, and batched commit (storage-side strategies only).
What FTL does (the minimum model):
- Mapping: logical → physical translation (LBA→PBA).
- GC: reclaim invalid pages by moving valid data (copy/erase cycles).
- Wear leveling: distribute erase cycles to avoid early failure.
- Bad-block: detect, retire, and remap failing blocks.
Why write amplification (WA) grows (common sources):
- Small random writes (forces read-modify-write and increases GC pressure).
- Frequent metadata updates (turns sequential workload into fragmented churn).
- Mixed workloads (simultaneous logging + indexing/maintenance creates unpredictable peaks).
Recording strategies (storage-side) — shape the workload to reduce WA and tail spikes:
- Chunked sequential writes: write in larger aligned chunks (reduces fragmentation).
- Append-only log: avoid in-place updates; treat records as immutable append.
- Batched commit: commit metadata/index updates in batches to avoid constant small syncs.
Top 3 lifetime killers
- Small writes + immediate sync → WA spikes
- Near-full media (low OP) → frequent GC
- High duty continuous capture → tail spikes densify
First fixes (fastest wins)
- Switch to chunk + append-only + batch commit
- Increase OP / enable pSLC (when supported)
- Verify tail latency envelope under real workload
Evidence and measurable trends (even when “media writes” are not directly visible):
- Long-run drift: sustained MB/s vs time decreases as GC pressure rises.
- Tail density: frequency and height of p99.9 latency spikes increase over time.
- GC proxy: recurring latency-peak patterns indicate GC/relocation bursts.
- If health counters exist: treat as strong evidence, but do not rely on them exclusively.
H2-7. Data Integrity: ECC, CRC, Metadata & Atomicity
Integrity in evidence logging is not “no errors”; it is clear boundaries between recoverable and non-recoverable cases, plus a traceable record structure that can locate gaps, isolate damage, and replay to a proven last-consistent point.
Three evidence chains: (1) Media health via ECC corrected/uncorrectable trends, (2) Record validity via CRC + sequence counters, (3) Consistency via atomic commit markers (power-fail safe replay).
Recoverable (system continues with evidence)
- ECC corrected: bit flips corrected; record remains valid.
- CRC fail but isolated: the damaged chunk can be skipped and the gap is measurable.
- Sequence gap: missing region is located and quantified (frames/time span).
Non-recoverable (must fail safe / fall back)
- ECC uncorrectable: stored bits cannot be reconstructed; data is lost.
- Atomicity broken: no commit marker; replay must roll back to last committed point.
- Header unreadable: without a valid header, the chunk is not trustworthy.
ECC (BCH/LDPC concept) protects stored bits at the media/controller level. It does not prove that a record is complete or ordered—those guarantees come from record structure and commit semantics.
- Scrubbing (concept): periodic read/refresh reduces retention risk and can surface weak blocks early.
- Read disturb / retention (concept): heavy reads or long storage time raise raw bit error risk.
- Evidence trend: rising corrected counts predict aging; uncorrectable triggers containment.
CRC + sequence counters make logs auditable:
- CRC fail count identifies corrupted chunks (silent damage becomes explicit).
- Sequence gap quantifies missing range (missing N chunks or Δt window).
- Chunk header enables fast scanning and precise localization of damage.
Atomicity (all-or-nothing) is enforced by a commit marker: payload and metadata are written first; the commit marker is written last to prove the chunk (or batch) is complete. On next boot, replay accepts only committed chunks and discards any uncommitted tail.
Recommended record chunk format (fields only, no code)
- Magic + Version (fast identification + compatibility)
- Chunk type (frame / metadata / marker)
- Sequence counter (gap detection)
- Timestamp (time-localization of missing region)
- Payload length (bounds damage and scanning)
- Header CRC (keep header trustworthy for recovery)
- Payload CRC (detect payload corruption)
- Commit marker (atomicity proof; last write)
- Optional: Session ID / Device ID (traceability across deployments)
What to measure (acceptance evidence):
- ECC counters: corrected / uncorrectable events (trend over time).
- CRC fail count: corrupted chunk frequency under real workload.
- Sequence gap: missing chunk count and reconstructed time span.
- Recovery point: last committed sequence after restart.
H2-8. PLP Hold-up & Power-Fail Safe Logging (Storage-side View)
Power-loss protection (PLP) is treated here purely as a storage consistency mechanism: detect power-fail, stop accepting new writes, flush pending data, write the commit marker, and guarantee replay returns to the last committed sequence. No full system power topology is required.
Last-gasp objective: hold-up window must cover T_flush + T_commit + margin. If commit cannot be reached, replay must roll back to the previous commit point.
Power-fail safe sequence (SOP):
- PF detect: latch a power-fail event (PF# or voltage-fall detector).
- Freeze: disable new writes; capture the current “freeze seq”.
- Flush: drain queues / write out in-flight chunks (or discard uncommittable tail).
- Commit: write commit marker as the last write; record “commit seq”.
- Power lost: after commit (ideal) or before commit (handled by rollback rule).
- Recover: next boot replays to last committed seq only.
Hold-up sizing (concept + steps) — without deep power derivations:
- Measure T_flush: worst-case time to drain the write path under the real workload.
- Measure T_commit: time to write and finalize the commit marker.
- Add margin for scheduling jitter, temperature, and media tail latency spikes.
- Acceptance: commit succeeds for repeated power-cut tests across operating corners.
Power-fail consistency state machine (storage-side)
- RUN → normal logging with seq/CRC.
- PF_DETECT → power-fail latched; start last-gasp window.
- FREEZE → stop accepting new writes; record freeze seq.
- FLUSH → drain queues; finish in-flight chunks where possible.
- COMMIT → write commit marker; output last committed seq.
- RECOVER → on next boot, replay to last committed seq; discard tail.
What to measure (power-fail evidence):
- PF detect timestamp (PF# or detector event time).
- Flush duration (T_flush) under worst-case queue depth.
- Commit completion time (T_commit) and commit marker presence.
- Recovery point: last committed sequence after restart.
H2-9. Performance & QoS Under Real Workloads
“Fast in the lab” is not “stable in the field.” Real workloads expose tail latency, jitter, background management (e.g., garbage collection), and temperature-driven rate drops. The engineering target is: sustained throughput over time while keeping p99/p99.9 write latency low enough that buffering never overflows.
What matters: (1) sustained throughput vs time, (2) write latency percentiles (p99/p99.9), (3) spike density (how often big delays happen), (4) stability across OP level and thermal steady state.
Workload A — Continuous sequential write
- Goal: long-run stability (minutes → hours).
- Watch: throughput drift, p99.9 growth, spike density.
- Risk: tail spikes accumulate → buffer watermark → frame/evidence loss.
Workload B — Burst write + idle
- Goal: burst repeatability across cycles.
- Watch: first burst vs nth burst symmetry; idle “recovery.”
- Risk: background work shifts into burst windows → sudden p99.9 blowups.
Workload C — Write + read replay (contention)
- Goal: predictable QoS under concurrent access.
- Watch: latency jump when replay starts; throughput flattening.
- Risk: arbitration/queue contention magnifies tail latency.
Thermal rate drop (phenomenon only): when the controller/media reaches a protection state, throughput can step down while tail latency spikes become more frequent. QoS scoring should be performed at thermal steady state (after the temperature stabilizes), not only at cold start.
Why OP (over-provision) stabilizes QoS:
- Low OP reduces free blocks → background management becomes frequent.
- More background work → tail latency spikes become denser and taller.
- Denser spikes → buffer headroom is consumed → overflow risk rises.
- Therefore OP is not only endurance-related; it is a latency stability control.
What to measure (field-grade evidence):
- Write latency percentiles: p50, p95, p99, p99.9 (before/after replay starts).
- Sustained throughput vs time: steady-state average and drift slope.
- Spike density: count of latency spikes above a chosen threshold per minute.
- Stability vs OP: compare the same workload at multiple free-space levels.
H2-10. Validation Plan: How to Prove “No Frame Loss + No Evidence Loss”
A validation plan must prove two things simultaneously: no frame loss at the system boundary and no evidence loss at the storage boundary. “Evidence” is defined by sequence continuity, CRC validity, and replay to last committed sequence after power-fail events.
Acceptance core: drop counter = 0 (or strictly defined allowed drops), seq gap = 0 (or fully explainable), CRC fail = 0 (excluding uncommitted tail by design), recovery reports last committed seq and replays consistently.
1) Throughput stress
- Drive sustained write to the real upper bound.
- Score at thermal steady state.
- Capture p99.9 latency vs time and buffer watermark events.
2) Power-fail injection
- Random cut times across workloads.
- Vary buffer fill levels (low/mid/high).
- Verify replay returns to last committed seq only.
3) Aging / endurance
- Long-run write + periodic burst + periodic replay.
- Repeat at multiple OP levels.
- Track trends: spikes density, throughput drift, error counters.
Test matrix (workload × duration × criteria):
| Workload | Duration | Power-fail injection | Pass criteria (hard) | Recorded evidence |
|---|---|---|---|---|
| A Continuous sequential write | 30 min → 2 h → 8 h | Optional (baseline), then random cuts | Throughput stable vs time; p99.9 within budget; drop=0; gap=0 | Throughput/time; p99.9/time; drop; seq gap; CRC fail; last commit |
| B Burst + idle cycles | 100+ cycles | Cuts at: burst start / mid-burst / idle | Cycle-to-cycle stability; no increasing spike density; recovery to last commit | Burst p99.9; cycle drift; gap/CRC near cut; recovery point |
| C Write + read replay | 60 min | Cuts with replay ON and OFF | Replay does not destabilize writes beyond budget; no hidden gaps | Latency delta (replay on/off); throughput flattening; counters |
| OP sweep (free space levels) | Per level: 60–120 min | At least 10 cuts per level | QoS remains within criteria above; low OP must not introduce unbounded spikes | p99.9 vs OP; spikes density vs OP; throughput drift vs OP |
| Aging / endurance mix | Multi-hour / multi-day | Periodic random cuts | Trend remains bounded; error counters do not accelerate unexpectedly | Error trends; throughput drift; spike density trend; bad-block trend (concept) |
SOP checklist (Prepare → Run → Judge → Record)
- Prepare: lock workload parameters (fps/bit depth/burst length), lock OP level, enable counters (drop/CRC/gap/last commit).
- Run: execute A/B/C workloads; reach thermal steady state before scoring; inject power-fail at random times and defined states.
- Judge: verify drop=0 (or allowed policy), gap=0, CRC fail=0, recovery reports last committed seq and replays consistently.
- Record: store time series (throughput, p99.9 latency), event timeline (PF detect/flush/commit), and final counters.
Evidence definition (must be explicit):
- Frame loss: drop counter must be zero (or strictly defined allowed drops).
- Evidence loss: seq gap must be zero, or any gap must be fully attributable (e.g., uncommitted tail after PF cut).
- Corruption: CRC fail must be zero for committed chunks.
- Recovery: last committed seq is the single source of truth after restart.
H2-11. Field Debug Playbook: Symptom → Evidence → Isolate → Fix
Goal: fastest on-site isolation with minimum tools. Every symptom is resolved by the same structure: First 2 measurements → Discriminator → First fix. Fixes stay within the storage/buffer boundary: tail latency, buffering headroom, commit/atomicity, PLP last-gasp, and integrity counters.
Keep the evidence chain tight: drop counter / buffer watermark / p99.9 latency / seq gap / CRC fail / last committed seq / ECC corrected & uncorrectable trends.
Symptom 1 — Occasional frame drops
- Buffer evidence: buffer fill level, overflow watermark hits, drop reason code (if available).
- QoS evidence: p99.9 write latency (or spike density) aligned to the drop timestamp.
- Watermark peaks just before drops → buffer overflow (burst or spikes exceed headroom).
- Average throughput looks fine but p99.9 spikes align with drops → tail latency kills the buffer.
- Drops appear only when read-replay starts → write/read contention or arbitration starvation.
- Add headroom: raise buffer safety margin and enforce conservative high-water policies (freeze new writes earlier).
- De-spike writes: switch to larger sequential appends (reduce small writes/metadata churn).
- Isolate replay: restrict read-replay to idle windows or lower replay rate so it cannot starve writes.
- Component examples (MPN):
Notes: DRAM parts are representative families; pick density/speed-bin per your bandwidth budget. PCIe switch example applies when frame grabber / edge box aggregates multiple streams.
Symptom 2 — Slower / stuttering after running for a while
- Throughput vs time: sustained write throughput over 30–120 minutes (thermal steady state included).
- Latency vs time: p99.9 write latency trend + free-space/OP level in the same timeline.
- Gradual throughput decay + spikes become denser as free space drops → GC/low OP pressure.
- Step-down throughput plateau that persists → controller/media throttling (phenomenon).
- Nth burst gets worse and idle does not recover → background management overlaps the burst window.
- Raise OP: reserve more free space to bound worst-case latency (stability control, not only endurance).
- Make writes append-only: log-structured sequential writes; reduce small random updates and metadata churn.
- Score at steady state: only accept QoS after warm-up; avoid “cold-start wins” that fail later.
- Component examples (MPN):
Notes: use industrial-grade media when “worst-case latency” matters. Consumer cards often have unstable QoS.
Symptom 3 — Corruption or replay gaps after power loss
- Recovery point: last committed seq reported after reboot / replay scan.
- Gap locality: whether seq gaps / CRC fails cluster around the power-cut timestamp.
- Only the uncommitted tail disappears and commit marker is absent → expected by design (not random corruption).
- CRC fails appear earlier than the cut or in multiple regions → true corruption path or media issue.
- Commit often missing even when cuts happen “late” → PF detect too late or hold-up window too short.
- Enforce commit semantics: replay trusts only the last commit marker (single source of truth).
- Shorten last-gasp path: on PF detect → freeze new writes → flush buffer → write commit marker.
- Power-fail matrix: random cuts at low/mid/high buffer fill levels to confirm deterministic recovery behavior.
- Component examples (MPN):
Notes: device choices depend on rail voltage and hold-up energy target; focus is PF detect + deterministic flush/commit ordering.
Symptom 4 — ECC counters surge / bad blocks increase
- ECC trend: corrected vs uncorrectable counts over time (trend beats a single snapshot).
- Record-layer integrity: CRC fail and seq gap correlation with ECC trend (does it affect evidence?).
- Corrected rises but uncorrectable stays 0 and CRC/gap stays 0 → aging signal, still controllable.
- Uncorrectable appears or CRC/gap rises with ECC → risk of unrecoverable loss is now real.
- ECC surge correlates with small random writes → write amplification / GC pressure is the root driver.
- Reduce write amplification: append-only chunking, fewer metadata updates, larger commit units.
- Increase OP: lower relocation frequency and bound worst-case latency.
- Scrub / health checkpoints (concept): detect weak regions earlier and isolate before uncorrectables appear.
- Component examples (MPN):
Notes: raw NAND requires an FTL/ECC strategy; FPGA example is for implementing/accelerating ECC/CRC pipelines in a grabber/edge design.
H2-12. FAQs (12) — Evidence-based, no scope creep
Each answer maps back to the evidence chain (counters / latency percentiles / seq-gaps / commit markers / ECC stats), and points to the relevant chapter(s) for deeper context.
1 The rated write speed looks sufficient, but frames still drop sometimes—DDR bandwidth or storage tail latency?
Most “sporadic drops” are caused by tail-latency spikes that the buffer cannot absorb, not by average DDR bandwidth.
- Buffer watermark vs drop timestamp: ring fill level / overflow watermark hits aligned to the drop.
- p99.9 write latency (or spike density) aligned to the same timeline; average throughput is not enough.
- Increase buffer headroom and enforce append-only larger chunks to reduce latency spikes (avoid tiny random writes/metadata churn).
Map: H2-2 / H2-3 / H2-9
2 Drops happen only during burst capture, never in steady mode—which two ring-buffer thresholds matter most?
Use two watermarks: a conservative “freeze new writes” high-water mark and a hard overflow watermark for root-cause evidence.
- High-water vs overflow watermark hits and the fill-level slope (how fast the ring climbs during bursts).
- Drop reason code (if available) plus “time-in-high-water” per burst window.
- Freeze/shape burst intake earlier (high-water), then flush/commit in deterministic order; do not wait for overflow.
Map: H2-3
3 After one hour it becomes stuttery—garbage-collection/wear leveling, or thermal throttling?
If tail latency gets denser as free space shrinks, it is GC/OP pressure; if throughput steps down and stays, it’s throttling.
- Sustained throughput vs time across warm-up to steady state, plus free-space/OP level.
- p99.9 latency vs time; look for “spike density” growth rather than average speed.
- Raise OP and switch to log-structured sequential appends (larger chunks, fewer tiny updates) to bound worst-case latency.
Map: H2-6 / H2-9
4 After a power cut, the file opens but a segment is missing—atomicity break or a seq-gap?
Seq-gaps clustered near the cut usually indicate an uncommitted tail; atomicity problems show up as partial/invalid commit markers.
- last committed seq after reboot/replay scan, and where the first gap appears relative to that point.
- CRC fail count and whether failures appear only at the end (tail) or earlier in the log.
- Replay must trust only commit markers and stop at last commit; never treat uncommitted tail as “corruption.”
Map: H2-7 / H2-8
5 PLP is added but the last few frames still vanish—insufficient hold-up energy or wrong flush ordering?
Most PLP “still loses tail” issues are ordering or PF-detect timing problems; hold-up energy is only useful if flush+commit is deterministic.
- PF detect time and flush duration (from PF# trigger to commit marker written).
- Recovery evidence: last committed seq after reboot and whether commit marker is present.
- On PF detect: freeze new writes → flush buffer → write commit marker; shrink the last-gasp write volume.
Map: H2-8
6 eMMC is steadier than SD, but one batch still jitters—what QoS metric should qualify suppliers?
Qualify with tail latency and stability over time (p99/p99.9 + spike density), not peak MB/s.
- Write latency percentiles (p99 and p99.9) under the real workload for ≥60 minutes.
- Sustained throughput vs time at a fixed OP level; record any step-down or drift.
- Define acceptance on p99.9 latency ceiling and allowed spike density, then enforce industrial-grade media for evidence capture.
Map: H2-5 / H2-9
7 Is a slowly increasing “ECC corrected” counter normal? When should it trigger an alarm?
Corrected errors can be an aging signal; the red line is any uncorrectable growth or any corrected trend that correlates with CRC/seq failures.
- Corrected vs uncorrectable ECC trend (per hour/day), not a single snapshot.
- Record-layer health: CRC fail count and seq-gap count aligned to ECC trend.
- Reduce write amplification (append-only, fewer small updates) and keep higher OP to slow deterioration and stabilize QoS.
Map: H2-7
8 Switching to pSLC stabilizes QoS but capacity is too small—how can log format reduce write amplification?
WA is driven by tiny random updates; log-structured large chunks and fewer metadata touches often recover both endurance and stability.
- Chunk size and commit cadence (how often metadata is updated) vs p99.9 latency spikes.
- Throughput drift over time at the same OP; watch for spike density increases as the device fills.
- Increase commit unit (bigger chunks), append-only writes, and batch metadata so the media sees fewer small random writes.
Map: H2-6 / H2-7
9 Pre-trigger replay is always missing a few frames—timestamp/sequence issue or ring overwrite policy?
If seq numbers are continuous but frames are absent, overwrite policy or window sizing is the culprit; true timestamp/seq issues show gaps.
- Sequence continuity: seq-gap around the pre-trigger window boundary.
- Ring overwrite evidence: overwrite watermark hits and the time spent near high-water during the pre-trigger period.
- Increase pre-trigger window headroom (or reduce burst intensity) and freeze intake earlier so overwrite cannot erase evidence frames.
Map: H2-3 / H2-7
10 After long recording, replay shows blocky artifacts—more likely CRC failure or uncorrectable ECC?
CRC failures indicate record-layer integrity/atomicity issues; uncorrectable ECC points to media-level data loss beyond recovery.
- CRC fail count aligned to the artifact location (by seq index / time offset).
- ECC uncorrectable count aligned to the same region; corrected-only without CRC/gaps is usually not visible.
- If CRC dominates: tighten atomic commit ordering; if uncorrectable appears: reduce WA, raise OP, and isolate weak regions early.
Map: H2-7
11 Same workload, but endurance varies wildly across devices—over-provisioning or logging strategy is wrong?
Both matter: low OP amplifies GC/WA, while small random writes multiply media writes; stability and endurance usually improve together when WA drops.
- OP/free-space policy and whether latency spikes grow as the device fills.
- Latency percentiles over time (p99/p99.9) under the same append/commit scheme.
- Standardize OP and switch to append-only chunked logging with fewer metadata updates to reduce WA across all units.
Map: H2-6 / H2-9
12 On-site, how can we quickly prove it’s a storage problem (not ISP / interface)?
If drops correlate with buffer watermarks and storage p99.9 spikes (or PF-commit behavior is repeatable), the evidence points to storage-side QoS/atomicity.
- Correlation: drop timestamps vs buffer watermark hits and p99.9 write-latency spikes on the same timeline.
- Repeatable PF behavior: last committed seq after controlled power cuts at different buffer fill levels.
- Change only storage knobs (OP, append chunk size, freeze→flush→commit ordering) and confirm symptom moves accordingly.
Map: H2-11