123 Main Street, New York, NY 10001

Frame Grabber (PCIe, CoaXPress, GigE Vision)

← Back to: Imaging / Camera / Machine Vision

A frame grabber’s job is to turn high-speed camera data into provable, loss-free host/GPU frames: capture, buffer, timestamp, DMA, and sync/trigger termination. When drops or drift happen, the fix starts with evidence—stage counters, two key waveforms, and a minimal log bundle that pinpoints whether the root cause is link margin, buffering, DMA service, timing, or thermal.

H2-1. What a Frame Grabber Owns in the Vision Pipeline (and what it doesn’t)

Definition that matches the engineering boundary

A frame grabber is the hardware+firmware boundary that converts high-speed camera links (e.g., CoaXPress or GigE Vision) into host-consumable frames with provable integrity: link health accounting, packet/frame assembly, buffering, timestamps, deterministic trigger/genlock termination, and PCIe DMA into CPU or GPU memory.

The differentiator is not “capture,” but accountability: every drop, corruption, or timing skew must be attributable to a stage using counters + logs.

Owns vs Interfaces vs Not owned (scope lock)

Owns Must be measurable & attributable
  • Link Rx health: CRC/BER, lock/retrain, lane/deskew status
  • Packet/frame correctness: sequence continuity, frame CRC, reorder windows
  • Buffer headroom: FIFO/DDR watermarks, overflow/drop stage IDs
  • Timestamp provenance: local timebase, offset/drift counters, resync events
  • DMA integrity: ring depth, completions, timeouts, backpressure behavior
  • Trigger/Genlock termination: edge capture, delay programming, jitter budget
  • Observability: counters, snapshots, flight-recorder logs, version traceability
Interfaces What it must expose cleanly
  • Camera link ports: CXP lanes / Ethernet MAC / GVSP streams
  • Discrete I/O: Trigger In/Out, Encoder In, Genlock/Ref In
  • Host link: PCIe GenX, DMA queues, MSI-X interrupts / polling modes
  • Software API: stream config, counters, timestamp packets, log export
  • Diagnostics: link margin view, buffer watermark trace, DMA stall reason
Not owned Mention once, then link out
  • Sensor pixel / ISP algorithms (demosaic, denoise, HDR fusion, etc.)
  • Compression/codec internals (H.26x/JPEG engine details)
  • Lighting current drivers & strobe power stages
  • Full system timing hub architecture (beyond signals & proof points)
  • Camera PoE / isolated power tree design

Scope discipline rule: when a topic stops being provable by grabber counters/logs, it belongs to another page.

Five failure points the grabber must make non-mysterious

Failure point (symptom) First evidence to collect What it proves (responsibility)
Link margin collapse
CRC spikes / re-lock events
CRCBER windowCDR lock/retrainlane/deskew
Correlate with cable length, EMI events, and temperature.
Proves a PHY/Rx problem (not “host software”). If CRC rises with temperature, it often indicates margin shrink or retimer/SerDes drift.
Reorder/resend overflow
GigE: bursts cause disorder
sequence gapsresend requested/receivedreorder overflow
Track inter-packet gap variance; note switch microbursts.
Proves network loss + recovery pressure. If resend storms precede drops, the root is congestion or reorder window sizing.
Buffer headroom exhausted
drops at watermark
FIFO/DDR watermarkoverflow counterdrop stage ID
Record arrival rate vs DMA drain rate during faults.
Proves the failure is inside the grabber pipeline (burst absorption, arbitration, or drain capacity), not the camera.
DMA starvation / timeout
ring stuck
DMA timeoutcompletion queue depthIRQ rateIOMMU faults
Capture a “freeze snapshot” of ring indices and last descriptor.
Proves a host interface contract issue: mapping, queue sizing, interrupt policy, or backpressure. It separates “PCIe/DMA” from “link errors.”
Sync skew / drift
multi-camera misalign
timestamp driftoffset/resyncgenlock lossskew histogram
Tag every frame with provenance: local vs PTP vs ref-locked.
Proves whether the culprit is timebase discipline or an upstream sync source. The grabber must expose the timebase state.
Figure F1. System boundary map + evidence taps (where proof is collected).
Frame Grabber — Owned Boundary & Evidence Taps Camera links → grabber pipeline → host/GPU, with proof points for every failure mode Camera(s) Camera(s) CoaXPress GigE Vision Frame Grabber (Owned Boundary) Rx Frame / Packet Assemble DDR Buffer Timestamp Provenance + Drift PCIe DMA Engine CRC/BER Watermark TS drift DMA timeout Sync & Observability Trigger / Genlock Edge capture + delay Counters Per-stage proof skew log Host CPU Frames in RAM GPU Optional ingest ICNavigator • Figure F1
Cite this figure Frame Grabber owned boundary & evidence taps (F1) #cite-fg-f1
Use this anchor when referencing the diagram inside your site.

H2-2. Interfaces & Link Behaviors (CoaXPress vs GigE Vision) — What the Grabber Must Guarantee

CoaXPress “capture contract” (serial over coax, margin-driven)

CoaXPress capture is governed by PHY margin and clock recovery stability. The grabber must hold a stable CDR lock, maintain lane alignment, and provide a low-noise path from recovered data to assembled frames. When failures occur, they usually present as CRC bursts, re-lock events, or deskew faults that correlate with cable length, EMI, or temperature drift.

Minimum proof set (CXP):
CDR lock/retrain count CRC/line errors vs time/temp lane/deskew error counter link training state snapshots
Interpretation rule: if CRC rises before any buffer/DMA alarm, the fault is link margin, not host software.

GigE Vision “capture contract” (Ethernet/UDP, loss-and-recovery-driven)

GigE Vision capture is inherently best-effort: packet loss, reordering, and congestion are normal stressors. The grabber must implement robust sequence tracking, reorder buffering, and resend accounting so that transient network events do not silently corrupt frames. Under load, failure commonly appears as resend storms, reorder window overflow, or latency spikes from switch microbursts and host scheduling contention.

Minimum proof set (GigE):
GVSP sequence gaps resend requested/received reorder overflow drops inter-packet gap variance host-side drop counters
Interpretation rule: if resend counters spike before buffer overflow, the root is network congestion or window sizing, not camera timing.

Common metrics checklist (works for both links) — the grabber’s “health certificate”

Metric class What to log continuously Why it matters (what it proves)
Link health CRC/BERlock/retrainlane/deskewtraining state Separates margin problems from higher-layer symptoms. If link health is clean, drops must be downstream.
Flow integrity sequence gapsresend statsreorder occupancyframe CRC Proves whether corruption/drop is caused by loss/recovery pressure or by internal assembly logic.
Buffer headroom watermarks (high/avg)overflow countdrop stage ID Converts “dropped frames” into a specific stage: arrival bursts vs drain capacity vs arbitration.
DMA health completion ratetimeout countring depthIRQ/poll rate Proves host interface stability. Clean link + clean buffer + DMA timeouts implies host queue/mapping policy.
Sync proof timestamp provenanceoffset/driftskew histogramref-loss events Proves whether alignment errors are from timebase discipline vs upstream sync sources.
Thermal correlation board tempSerDes tempCRC vs tempdrops vs temp Explains “works cold, fails hot.” Temperature-linked CRC suggests margin shrink; temp-linked DMA faults suggest throttling or instability.

Practical rule: keep these metrics timestamped and persistent. Without them, a field report becomes opinion; with them, it becomes a reproducible bug.

Figure F2. Side-by-side link behaviors → unified capture abstraction → buffer/timestamp/DMA.
Capture Contracts: CoaXPress vs GigE Vision Different error models, same proof obligation: integrity + determinism + attribution CoaXPress path (margin-driven) CDR / EQ Deserialize Frame Assembler lock CRC GigE Vision path (loss/recovery-driven) MAC / UDP GVSP parser Reorder / Resend loss resend Unified Capture Abstraction Layer Normalize: frame integrity, ordering, timestamps, backpressure signals Shared obligations (both links) Buffer headroom Timestamp offset/drift PCIe DMA → host/GPU wm TS DMA ICNavigator • Figure F2
Cite this figure CoaXPress vs GigE capture contracts (F2) #cite-fg-f2
Use this anchor when referencing the diagram inside your site.

H2-3. Rx Front-End: SerDes/CDR/Retimer, Link Margin, and Error Accounting

Physical-layer truth: prove margin first, or everything above becomes noise

The Rx front-end (SerDes + CDR + retimer/equalization) defines whether the link is fundamentally trustworthy. When margin is weak, higher-layer tuning (buffers, DMA, software) can only mask symptoms; it cannot prevent CRC bursts, re-lock events, and intermittent corruption that follows temperature, cable stress, or EMI coupling.

Rule of thumb: If CRC/BER and lock stability are not clean, do not diagnose “dropped frames” at the buffer/DMA layer yet.

Card 1 — What to measure (minimum evidence pack, vendor-neutral)

A) Lock / training events (is the link stable?)
CDR lock/unlock events retrain count last retrain reason training state snapshot
  • Lock/unlock proves clock recovery stability under real noise.
  • Retrain reason separates “margin collapse” from “lane alignment” style faults.
B) Error statistics (is it running “dirty”?)
CRC error rate BER estimate window error burst length frame integrity flags
  • CRC rate vs time reveals bursty interference and thermal drift patterns.
  • BER windows expose short “micro-failures” that averages can hide.
C) Lane/deskew alignment (multi-lane coherence)
deskew errors lane active map lane resync events alignment status
  • Deskew errors point to differential lane delay, connector mismatch, or retimer behavior under heat.
D) Environment correlation (why does it fail “only sometimes”?)
board/SerDes temperature CRC vs temp cable/connector record EMI event markers
  • If CRC rises with temperature, suspect margin shrink or retimer/SerDes drift.
  • If failures align with machine events, suspect coupled EMI, not random software.

Practical logging tip: record counters with timestamps at a fixed cadence, and also capture a short “burst log” around a fault event (pre/post window).

Card 2 — Symptom → likely PHY cause (with first action)

Symptom pattern First evidence to check Most likely PHY cause First action (generic)
CRC bursts after warm-up
works cold, fails hot
CRC vs temp retrain events lane status snapshot Thermal margin shrink (connector/contact, retimer drift, EQ sensitivity) Improve airflow/heatsinking; verify connectors; re-evaluate EQ strength (avoid over/under-equalization)
Frequent retrain at start
unstable from power-up
training state dwell retrain reason lock/unlock CDR lock instability or insufficient initial margin (cable, termination, connector) Short known-good cable; reseat/replace connectors; reduce sources of coupled noise during start
Intermittent corruption without drops
bad frames pass through
frame CRC flags CRC rate error burst length Running “dirty” at the PHY; error detection works but integrity gating is loose upstream Tighten integrity policy (drop/mark bad frames); investigate margin and EMI coupling
Multi-lane only: occasional artifacts
single-lane OK
deskew errors lane resync lane map Lane-to-lane skew drift (length mismatch, connector variance, retimer lane behavior) Normalize lane paths; verify connectors; confirm deskew tolerance across temperature
Failures correlate with machine events
motors/relays
EMI markers CRC bursts lock events Coupled EMI causing short margin collapse and CDR disturbance Improve shielding/cable routing; add separation; validate margin under the same EMI profile
Two-signal “first check”: (1) CRC/lock timeline, (2) temperature or event markers. If they correlate, root cause is rarely higher-layer software.
Figure F3. PHY state machine (Lock → Train → Run) with evidence taps feeding counters and snapshots.
PHY Health: State Machine + Evidence Taps If this layer is unstable, higher layers only see symptoms LOCK CDR / PLL acquires clock lock/unlock TRAIN Equalize / align lanes retrain + reason deskew status RUN Stream data + detect errors CRC/BER window error bursts fallback retrain Evidence taps → counters + snapshots Counters Snapshots state + lane status Trend logs CRC vs time/temp Environment correlation Temperature EMI events Decision Clean PHY → proceed upward Dirty PHY → fix margin first ICNavigator • Figure F3
Cite this figure PHY state machine + evidence taps (F3) #cite-fg-f3
Anchor for internal references.

H2-4. Buffering Architecture: Line/Frame Buffers, DDR Bandwidth, and Worst-Case Bursts

Buffering is an inequality: burst absorption + worst-case service time

Buffering is not “add more memory and hope.” A frame grabber must absorb burst arrivals (microbursts, resend storms, multi-camera alignment, trigger bursts) while the host/DMA side experiences finite and sometimes delayed service. Drops become unavoidable when the worst-case incoming burst minus worst-case drain exceeds available buffering headroom.

Goal: Keep watermarks comfortably below critical levels in the worst observed burst, and log exactly where overflow would occur.

Card 1 — A sizing recipe (variables, not long math)

Define the four rates/windows
  • Rin: average ingest rate (bytes/s or pixels/s into the grabber)
  • Bin: peak burst rate (microburst / resend storm / aligned capture)
  • Tburst: burst duration (how long peak persists)
  • Rout: effective drain rate (DMA to host/GPU under real load)
  • Tservice: worst service gap (host scheduling/queue stalls)
Capacity rule (written, not formula-heavy)
  • During a burst window, buffer must hold: incoming burst volume minus what can be drained in the same window.
  • Add headroom so the high watermark stays below critical across temperature and worst-case scenes.
  • Use per-stage watermarks (FIFO + DDR) to locate the real choke point.
arrival burst drain rate watermark high overflow count
Burst sources to plan for: GigE resend storms switch microbursts CXP multi-link alignment trigger burst capture multi-camera interleaving

Card 2 — Buffer overflow signatures (what the counters “look like”)

Signature What rises first (evidence order) What it implies First corrective direction
Watermark hits high, then drops watermark high overflow count drop stage ID Burst absorption is insufficient (buffer too small or burst too large) Increase buffering at the true choke point; reduce burstiness upstream
DMA slows first, watermark climbs later completion rate down queue depth up watermark rising Drain capacity is constrained (host service gaps or queue policy) Treat as a drain problem; confirm worst-case service gaps before resizing buffers
Resend spikes, reorder overflows (GigE) resend requested reorder occupancy reorder overflow Congestion/loss creates recovery pressure beyond reorder window Improve network conditions or increase reorder capacity; log loss markers
Aligned trigger causes instant peak trigger marker FIFO watermark DDR watermark Synchronous capture creates short, very high peaks Add fast FIFO headroom; stagger capture where allowed; validate burst duration
DDR bandwidth wall
multi-camera + readback
DDR busy bank conflict markers watermark slope steep Arbitration/bank conflicts reduce effective bandwidth Rebalance read/write scheduling; simplify access patterns; isolate streams

Logging requirement: always record (1) watermark timeline, (2) drop stage ID, and (3) a timestamp marker for trigger/resend events.

Figure F4. Arrival bursts → FIFO/DDR → DMA drain, with watermarks and drop points.
Buffering Architecture: Arrival vs Drain Bursts are absorbed by FIFO/DDR; drops happen only at identifiable points CXP ingest aligned lanes GigE ingest loss / resend Trigger burst multi-cam sync arrival burst Fast FIFO microburst absorber DDR Buffer line / frame buffering LOW HIGH CRIT watermarks + drop counters reorder drop overflow DMA drain to host / GPU Rout + Tservice timeout drain rate ICNavigator • Figure F4
Cite this figure Arrival vs drain buffering with watermarks (F4) #cite-fg-f4
Anchor for internal references.

H2-5. PCIe & DMA-to-Host: Descriptor Rings, Interrupt Strategy, Zero-Copy

Frames-to-host is a pipeline, not a black box

A frame grabber does not “send frames to the PC.” It runs a measurable pipeline: host buffers are allocated and mapped, descriptors are queued, the device performs PCIe writes, completions are generated, and the application consumes frames. Reliability at scale comes from keeping descriptor supply stable, preventing completion backlog, and choosing an event strategy that controls tail latency without wasting CPU.

Minimum evidence pack: ring occupancy completion backlog interrupt/poll rate PCIe throughput drop counters

Card 1 — DMA pipeline in 6 steps (each step has a proof signal)

Step What happens What can break Proof signal (what to log)
1 Host alloc receive buffers (pool) Pool exhaustion → frames arrive with nowhere to land pool depth alloc fail time-to-refill
2 Pin / map memory for DMA (IOMMU mapping where used) Mapping churn, faults, or slow setup → intermittent stalls map errors fault markers setup time
3 Build descriptors (scatter-gather list) into a descriptor ring Ring starvation → DMA engine has nothing to do desc underrun ring occupancy head/tail gap
4 Doorbell notifies the device new descriptors exist Doorbell/completion mismatch → bursty progress and jitter doorbell rate completion rate cadence drift
5 DMA writes frames into host memory over PCIe PCIe contention/backpressure → throughput drop, stalls PCIe bytes/s DMA busy DMA errors
6 Completion events + app consume (event-driven or polling) Completion backlog → tail latency spikes and drops upstream CQ depth interrupt/sec P99 latency
Interpretation shortcut: desc underrun usually means a producer problem (host not feeding), while CQ backlog often means a consumer problem (events or app not draining).

Card 2 — Latency vs CPU tradeoffs (interrupt, polling, batching)

Strategy What improves Evidence to confirm Failure signature First tuning direction
Interrupt-driven
MSI-X style
Lower average latency at moderate rates interrupt/sec CQ depth P95/P99 Interrupt storm → CPU high → completion jitter Reduce event rate via coalescing or limited batching
Polling
busy/periodic
Stable tail latency (if CPU budget exists) CPU% CQ depth cadence CPU burn without throughput gain Use bounded polling windows or hybrid mode
Batching
process N completions
Lower CPU overhead per frame batch size interrupt/sec P99 P99 latency grows with batch size Cap batch size; target P99 rather than max throughput
Hybrid
interrupt + short poll
Good compromise: CPU controlled, tail improved CQ depth interrupt/sec P99 Mode thrash under bursty load Add hysteresis on mode switching; log transitions
Zero-copy patterns (conceptual, within grabber scope): pinned buffers hugepages stable IOMMU mapping TLB pressure risk pinned too much

Pitfall signatures: if throughput is acceptable but P99 latency spikes with high CPU activity, suspect copy/cache pressure or event strategy instability before resizing buffers.

Figure F5. DMA ring pipeline: host alloc → map → descriptors → doorbell → DMA write → completion → app consume.
DMA-to-Host Pipeline (Descriptor Ring + Completion) Observe ring occupancy, completion backlog, event cadence, and PCIe throughput Host Buffer Pool alloc / recycle pool depth Pin / Map IOMMU mapping map errors Descriptor Ring scatter-gather entries head tail ring occupancy DB doorbell Device DMA Engine fetch desc → write host DMA errors PCIe bytes/s Completion Queue events → app notify CQ depth event cadence App Consume process / forward ICNavigator • Figure F5
Cite this figure DMA ring + completion pipeline (F5) #cite-fg-f5
Anchor for internal references.

H2-6. GPU / Accelerator Ingest (Optional Path): GPUDirect, RDMA Concepts, When It Helps

Decision scope: only pursue direct-to-GPU when evidence demands it

Direct-to-GPU ingest matters when the “default path” is measurably limited by CPU copy work, cache pressure, or PCIe saturation. This chapter stays within frame-grabber scope: identify bottleneck signatures, choose a path (host staging vs direct), and prove value with an A/B validation plan.

Proof-first approach: compare end-to-end latency percentiles, CPU utilization, PCIe throughput, and drop/backpressure events under identical scenes and load.

Card 1 — Decision checklist: “Do you need GPU-direct?”

Strong reasons to consider a direct path
  • CPU memcpy dominates: CPU utilization stays high even when capture is stable.
  • Tail latency spikes: P99 grows with bursty copy work or event cadence jitter.
  • PCIe is near saturation: throughput is close to the platform’s practical ceiling.
  • Multi-stream pressure: per-camera queues show HOL symptoms during heavy inference.
P50/P95/P99 CPU% PCIe bytes/s CQ backlog
Reasons to postpone GPU-direct
  • Capture drops are already explained by PHY margin (H2-3) or buffering inequality (H2-4).
  • Completion backlog is caused by event strategy instability (H2-5) rather than copy work.
  • Your workload is not latency sensitive; staging overhead is acceptable.
fix root cause first then optimize path
Conceptual approaches (platform-dependent): host staging buffers reduced-copy pipeline direct DMA to GPU memory RDMA-like flow

Card 2 — Validation plan: A/B metrics that prove benefit (no guesswork)

Item Path A (Host staging) Path B (Direct-to-GPU / reduced-copy)
Pipeline DMA → Host RAM → memcpy → GPU DMA → GPU (or minimal staging)
Latency Measure end-to-end (camera → inference input) P50/P95/P99 Same measurement, same scenes, same load; compare percentile shifts
CPU load CPU% + copy pressure signatures (spikes, cadence jitter) Expect reduced CPU copy load; verify no new completion backlog
PCIe Throughput average + peak; confirm headroom in burst windows Confirm throughput is not capped by a different choke point
Reliability Dropped frames / backpressure events / queue overflow counters Must not increase drops; investigate any new failure signatures
Conclusion If P99 and CPU stay high, staging is the bottleneck Adopt only if P99 improves and drops do not worsen
A/B test hygiene: lock camera settings and scene complexity; keep the same number of streams and the same inference workload; log counters with timestamps for both runs.
Figure F6. Compare two ingest pipelines: (A) host staging + memcpy vs (B) direct-to-GPU; highlight CPU copy pressure and tail latency.
GPU Ingest Decision: Staging vs Direct Prove benefit via P99 latency, CPU copy pressure, and PCIe throughput under identical load A) Host staging DMA to Host RAM Host RAM staging buffer Memcpy CPU copy pressure GPU inference input CPU% P99 latency PCIe bytes/s B) Direct-to-GPU (optional) DMA direct path GPU target memory Lower CPU copy pressure verify reliability + P99 CPU% P99 latency drops ICNavigator • Figure F6
Cite this figure Host staging vs direct-to-GPU ingest (F6) #cite-fg-f6
Anchor for internal references.

H2-7. Trigger / Encoder / Strobe I/O: Determinism, Debounce, and Latency Budget

Why field failures often start at the trigger, not the bandwidth

Many “dropped frame” incidents are actually trigger integrity issues: noisy edges, double-triggering, encoder miscounts, or strobe timing drift. A frame grabber owns the boundary signals (Trigger In/Out, Encoder In, Status) and must make them deterministic and provable by conditioning inputs, latching timestamps at a defined point, applying programmable delay/pulse control, and logging reject/overflow events.

First evidence pack: scope: Trigger In scope: Strobe Out edge/reject counters timestamp latch encoder invalid transitions

Card 1 — Signals owned by the grabber (and what “determinism” means)

Signal Grabber responsibility What can go wrong Proof signal (log)
Trigger In Threshold/edge conditioning (concept), glitch reject / debounce, timestamp latch at a defined edge point Double-triggering, missed triggers on slow/noisy edges, false triggers from ground bounce edge count reject count TS latch marker
Encoder In
quadrature
Capture A/B edges, decode direction/count, enforce maximum rate boundary, record invalid transitions Miscounts at high frequency, phase jitter → invalid transitions, direction flips invalid transitions overflow/overrate count delta
Strobe Out
control-level
Programmable delay, pulse width control, timing repeatability. Output is a timing control boundary (not a power driver). Delay drift, width error at short pulses, load-induced edge deformation (beyond boundary) delay readback width config measured Δt
Status I/O Expose “armed/busy/drop” style state so timing problems are observable Silent failure: system looks fine but was never armed or is backpressured armed/busy drop stage state transitions
Latency budget (measurable segments): Trigger edge → TS latchdelay → Strobe out → (camera exposure) → Frame arrival → (DMA completion). The grabber must prove its own segments with scope + counters.

Card 2 — “Two measurements first” (fast discriminator)

Measurement A (Scope)
  • Probe Trigger In and Strobe Out on the same timebase.
  • Measure Δt distribution: edge-to-strobe delay and pulse width repeatability.
  • Look for: double edges, ringing, slow threshold crossings, jitter growth with temperature/load.
Δt mean Δt jitter pulse width error
Measurement B (Log)
  • Record timestamp latch values for each accepted trigger.
  • Track reject counters and invalid transition counters (encoder).
  • Log state: armed/busy and any drop/backpressure markers.
TS counter reject count invalid transitions
Discriminator: If scope shows clean timing but logs drift or backlog appears → suspect timestamp domain / scheduling (H2-8/H2-5). If scope timing jitters or double edges appear → fix trigger conditioning/debounce first (H2-7).
Figure F7. Trigger integrity + timestamp latch + programmable delay → strobe out. Shows debounce/reject and jitter measurement points.
Trigger → Timestamp → Delay → Strobe (Deterministic I/O) Debounce rejects glitches; latch time at a defined edge; verify Δt and jitter on scope + counters Conditioner threshold / reject TS Latch edge timestamp Delay Block programmable Δt Strobe Out pulse width Timing view (scope + logs) Trigger In glitch reject counter +1 Timestamp latch TS Strobe Out Δt pulse width jitter edge count TS log Δt stats ICNavigator • Figure F7
Cite this figure Trigger→TS latch→Delay→Strobe timing (F7) #cite-fg-f7
Anchor for internal references.

H2-8. Genlock / PTP / Timestamping: Aligning Multi-Camera Frames Without Guessing

Alignment is provable when timestamp provenance is explicit

Multi-camera alignment should not rely on “looks synchronized.” It becomes provable when every frame carries a timestamp with known provenance, clock-domain conversions are tracked with offset/drift counters, and resync/holdover events are logged. Genlock and PTP are tools to discipline the grabber timebase so that skew and drift can be measured as distributions.

Alignment evidence (must exist): skew histogram drift/min offset resync events holdover state

Card 1 — Timestamp provenance ladder (most trustworthy → least)

Timestamp domain Why it is useful Main risk What must be logged
Hardware edge / link-adjacent Closest to the real event; minimal OS influence Still depends on timebase discipline and calibration TS source ID offset drift
FPGA local timebase Stable monotonic counter; can tag every frame consistently Drifts without reference; conversion to other domains must be tracked drift/min resync holdover
Host software time Convenient for applications and logs aggregation Scheduling jitter; not reliable for microsecond-level alignment queue delay timestamp mapping
Conversion principle (conceptual): domain-to-domain alignment requires offset and drift. Without both, “timestamps” cannot justify skew histograms or long-run stability.

Card 2 — Pass/fail criteria for multi-camera alignment (measurable)

Metric How to measure Failure signature First suspect
Skew distribution
P50/P95/P99
Histogram of inter-camera TS deltas for matched events/frames Wide tails or bimodal peaks; some cameras jump by steps resync events, unstable reference, wrong domain mapping
Drift rate
per minute
Track TS delta slope over time (Δskew / minute) Skew slowly grows; alignment degrades predictably undisciplined timebase, holdover too long
Offset stability Monitor offset counter; correlate with skew changes Offset “steps” coincide with alignment jumps PTP step/resync; reference interruptions
Resync frequency Log resync events/hour and associated magnitude Alignment appears fine, then suddenly wrong for a window clock domain switching, ref signal integrity
Holdover state Record when reference is lost and how long holdover lasts Drift accelerates during holdover ref-in loss, PLL unlock, unstable environment
Rule of thumb: adopt genlock/PTP only when alignment evidence improves (tighter P99 skew and fewer resync steps) without increasing drop/backpressure events.
Figure F8. Clock tree: Ref-in → jitter-clean PLL → FPGA timebase → timestamp tags → host mapping, with offset/drift/resync/holdover counters.
Timestamp Provenance: Genlock / PTP Discipline Make alignment provable with offset/drift/resync logs and skew distributions Clock / Timebase tree Ref In Genlock / 1PPS loss detect Jitter-Clean PLL phase noise ↓ lock/holdover FPGA Timebase monotonic counter drift counter offset counter resync events PTP HW TS (concept) Timestamp tagging + alignment evidence Frame Stream Rx → assemble Timestamp Tags domain ID + TS Host Mapping convert domains skew hist drift/min timebase → tags ICNavigator • Figure F8
Cite this figure Clock tree + timestamp provenance + evidence (F8) #cite-fg-f8
Anchor for internal references.

H2-9. Reliability Under Load: Drop Frames, Reorder, Resend, and Backpressure

Make “why frames drop” diagnosable with minimal tools

Dropped frames are not a single failure mode. Under load, the pipeline can fail at distinct stages: Rx/link errors, reorder/resend pressure (especially GigE), buffer/DDR overflow, DMA starvation, or host scheduling stalls that create backpressure. Diagnosis becomes repeatable when each stage exposes counters and watermarks so the “drop point” is proven before any fix is attempted.

Minimal toolset: golden counters watermark logs DMA timeouts link/lane snapshots

Card 1 — Decision tree (Symptom → check counters in order)

Rule: Always localize the drop stage first. Check in this order: Rx/LinkReorder/ResendDDR/AssemblerDMA/PCIeHost backlog.
Symptom (field language) Check these counters first What proves the drop stage First action (within grabber scope)
Drops start only at peak throughput
multi-cam / burst triggers
DDR watermark high assembler drop DMA timeout Watermark hits precede drops; drop counters increment at buffer/DMA stage while Rx CRC stays low Increase buffering headroom, reduce burst concurrency (controlled degradation), verify drain rate stability
GigE: stutter/lag then drops
often with reordering
seq gaps reorder depth reorder overflow/timeout resend requested Reorder depth grows, then overflow/timeout increments; drops occur without DDR overflow Resize reorder window/timeouts; ensure resend accounting is logged and bounded under loss bursts
CXP: corrupted/invalid frames
drops not “missing”, but unusable
lane status deskew fail frame corrupt/CRC Deskew/corrupt counters rise; frame validity fails at Rx stage even when buffers are not saturated Capture lane snapshots; correlate errors with temperature and retrain events; treat as margin loss evidence
Latency spikes but few drops
tail latency collapses determinism
DMA completion backlog descriptor starvation host queue depth Completion backlog climbs first; host queue depth grows; watermark rises without link errors Stabilize DMA servicing (ring depth, batching strategy conceptually); verify backlog returns to baseline
Only one camera drops
same host, same load
per-port CRC/seq gaps per-port reorder overflow per-port watermark Drops correlate to a single port’s counters, not global backpressure Use per-port accounting; isolate stage that is port-specific (Rx vs reorder vs assemble)
Key distinction: GigE often fails as “loss → resend → reorder pressure,” while CXP often fails as “lane/deskew → invalid frame.” Both should be proven by stage counters, not guessed.

Card 2 — Golden metrics (Top 10 counters to always log)

Counter / metric What it means Typical failure signature First suspect
Rx CRC / corrupt frame Data arrived but failed integrity at Rx/assemble stage Corrupt rises before drops; valid frames become sporadic margin loss / thermal drift
Seq gaps (GigE) Missing packets detected in capture stream Seq gaps spike → resend increases → reorder depth grows loss bursts / congestion
Resend requested/received Reconstruction pressure indicator under loss Requested rises faster than received; recovery fails recovery budget exceeded
Reorder depth watermark Peak occupancy of reorder buffer Watermark drifts upward over time; bursts trigger overflow window too small / timeouts
Reorder overflow/timeout Reorder cannot complete frames within budget Drops occur without DDR overflow resend storms / budget
Lane/deskew fails (CXP) Multi-lane alignment cannot be maintained Step-like increase with temperature; retrain events lane margin / ref drift
Assembler incomplete/drop Frame assembly failed due to missing/corrupt inputs or pressure Assembler drops coincide with reorder timeouts or DDR watermark upstream pressure
DDR watermark high Burst absorption is approaching limits Watermark high precedes overflow and frame drops burst peak > drain
DMA timeout / CQ backlog Host transfer path cannot complete within service budget Latency spikes + eventual drops without link errors service starvation
Host backlog time Downstream consumer is not keeping pace (observable backpressure) Queue grows; drops happen later at buffer stage consumer stalls
Logging habit: Always log per-port/per-camera counters. Global totals hide “one bad link” failures.
Figure F9. Reliability pipeline with red X drop points and counters at each stage (Rx → reorder/resend → DDR → DMA → host).
Drop Frames Under Load: Prove the Drop Stage Counters + watermarks localize failure before any fix Input Link GigE / CXP Rx / Decode CRC / lane Reorder resend (GigE) DDR Buffer watermarks DMA timeouts Host / App consume (backpressure evidence) Host Queue backlog time Counters Log per-stage Decision Tree symptom → counters → stage CRC deskew depth overflow watermark timeout drop point ICNavigator • Figure F9
Cite this figure Drop-point map + per-stage counters (F9) #cite-fg-f9
Anchor for internal references.

H2-10. Thermal, Power, and Throttling: Keeping Determinism Across Temperature

Thermal drift turns “margin” into “random drops” unless it is correlated and controlled

Temperature affects link margin and timing repeatability. As devices heat up, error counters often rise gradually before frame drops become visible: CRC/corrupt events, lane/deskew failures, reorder pressure, and jitter proxies can all worsen with temperature. Determinism is preserved when telemetry (on-die + board temperatures) is logged alongside reliability counters, and a controlled degradation strategy prevents hard failure.

Correlation must be explicit: Temp CRC/deskew drops watermarks throttle state

Card 1 — Thermal symptoms (what to look for)

Symptoms that usually indicate thermal-related margin loss
  • Errors ramp with time-to-steady-state: CRC/deskew rises after minutes of load.
  • Alignment/stability degrades: tail latency grows, then drops start.
  • One hot zone dominates: a single port or lane group becomes error-heavy first.
  • Jitter proxies widen: trigger-to-strobe Δt distribution tails get wider at high temperature.
CRC ↑ with Temp deskew ↑ with Temp drops after warm-up
Major heat sources (grabber-side)
  • PHY/SerDes: margin-sensitive; errors often correlate strongly with temperature.
  • FPGA: timing slack shrink → stability/jitter proxies worsen.
  • DDR: sustained bandwidth and refresh overhead make watermarks rise.
  • PCIe block: completion backlog can increase under heat + load.
PHYFPGADDRPCIe

Card 2 — Correlation method (log temp + counters; prove causality)

Goal: prove thermal coupling by showing monotonic degradation with temperature and recovery under controlled degradation.
Step What to do What to log What “proof” looks like
1 Hold a fixed load profile (ports, fps, trigger rate) until thermal steady-state Temp (die/board) CRC/deskew drops Counters worsen gradually as temperature rises; not random spikes
2 Compute correlation: temperature vs error rate and drop rate (same time axis) error rate drop/min watermark Errors track temperature; drop onset aligns with a counter threshold
3 Apply controlled degradation instead of hard failure (reduce fps / reduce burst concurrency) throttle state applied fps/rate errors Counters recover and drops stop while staying observable and deterministic
4 Verify repeatability (warm-up → degrade → recover) across runs resync events temp slope recovery time Same temperature band triggers same failure signature; fixes are stable
Controlled degradation boundary: reduce load deterministically (rate/fps/burst density) rather than allowing uncontrolled drops or timestamp instability.
Figure F10. Thermal zones (PHY/FPGA/DDR/PCIe), airflow direction, temperature sensors, and a logger correlating temp with CRC/drops and throttle state.
Thermal Determinism: Zones → Sensors → Correlation → Throttle Log temperature alongside reliability counters; degrade in a controlled state machine Frame Grabber Board (top view) Airflow PHY / SerDes margin-sensitive FPGA timing slack ↓ with heat DDR watermarks PCIe / DMA CQ backlog T1 T2 T3 T4 T5 Logger / Correlator Temp + counters CRC drops/min watermark deskew Throttle State controlled degradation NORMAL WARN THROTTLE RECOVER ICNavigator • Figure F10
Cite this figure Thermal zones + sensors + correlation + throttle (F10) #cite-fg-f10
Anchor for internal references.

H2-11. Observability & Logging: What to Record So Field Bugs Become Fixable

Why this chapter exists

Field failures become fixable only when every “drop” is attributable to a specific pipeline stage: Rx/linkreorder/resendbuffer/DDRDMA/PCIehost backlog. Observability is the design of evidence artifacts (events, counters, snapshots) that prove the drop stage before any remediation is attempted.

Golden rule: no anonymous drops eventscounterssnapshotsversion digest

Card 1 — Minimal diagnostic payload (uploadable “evidence bundle”)

Always include these (minimum)
  • Identity: board rev, serial, firmware/bitstream hash, driver version, config digest.
  • Link: per-port link state, retrain count, lane/deskew status (if applicable).
  • Integrity: CRC/corrupt/seq-gap counters + “burst summary” for last 60s.
  • Reorder/Resend: reorder depth watermark, overflow/timeout count, resend requested/received.
  • Buffer: DDR watermark high/high-high hits, assembler drops, overflow events.
  • DMA: timeout count, completion backlog watermark, descriptor starvation events.
  • Sync (if enabled): offset/drift summary + resync/holdover events (no protocol deep dive).
  • Thermal: T-sensors max/avg, temperature slope, temp-at-fault.
Recommended “flight recorder” attachments
  • Pre/Post window: ring-buffer dump around trigger (e.g., 3s pre + 8s post).
  • Counters delta: per-stage counter deltas over the snapshot window.
  • State snapshots: link/lane snapshot, reorder window state, DMA queue depths.
  • Timestamp provenance: which clock domain stamped each record (FPGA/local/host).
bundle.json timeline.csv snapshot.bin counters_delta.csv
Interpretation rule: A drop is “explained” only when a stage counter increments first (or a watermark crosses threshold) and the preceding stages remain healthy in the same time window.

Card 2 — When the user says “random”, what to ask for (in order)

  1. Time scale: seconds vs minutes vs “after warm-up”? (points to burst vs thermal coupling)
  2. Load dependency: only at peak throughput / multi-cam bursts / high trigger density?
  3. First counter that moved in the 30s before the drop: CRC/deskew/seq gaps, reorder overflow, DDR watermark, DMA timeout.
  4. Port specificity: only one camera/port? Provide per-port counters.
  5. Snapshot requirement: flight recorder dump (pre/post) + version/config digest.
  6. Thermal context: temp-at-fault + slope + whether errors rise monotonically with temp.
EEAT artifact: The request is the same every time: minimal bundle + snapshot window. That consistency is what turns “random” into “reproducible evidence”.

MPN examples — Common hardware hooks that make evidence reliable

Examples only (not a recommendation). Choose equivalents based on link rates, environment, and lifetime.

Function Example MPN Why it helps observability Evidence it enables
CoaXPress Rx/Tx (CDR/EQ) Microchip EQCO125X40 family; legacy: EQCO62R20 Exposes link lock / equalization behavior; supports CDR-centric margin evidence retrain/lock events, margin-linked CRC/deskew signatures
GigE controller (IEEE1588 capable) Intel I210 (e.g., I210-IS / I210-AT) Hardware timestamp support (platform dependent); enables provable time tagging timestamped packet/stream evidence; drift/outlier correlation
PTP-capable PHY (example) TI DP83640 Hardware timestamping at PHY; useful when time evidence must be close to the wire offset/drift logs; resync event provenance
SPI NOR for versioned firmware Winbond W25Q128JV (example class) Stable firmware storage with readable IDs/hashes for traceability firmware hash, rollback correlation
EEPROM for board ID/config Microchip 24LC256 (example class) Stores serial/rev/calibration IDs used in evidence bundles identity + config digest reproducibility
High-accuracy temperature sensor TI TMP117 (example class) Correlates error counters with temperature reliably (not “hand-wavy thermal”) temp-at-fault, monotonic error-vs-temp proof
Clock jitter cleaner (example) Silicon Labs Si5341 (example class) Stabilizes local timebase; improves timestamp consistency and skew distributions narrower skew histograms; fewer resync anomalies
Multi-fan controller (example) Microchip EMC2305 (example class) Controls airflow deterministically; logs tach faults for thermal root-cause airflow/fan fault evidence; thermal recovery proofs
Figure F11. “Flight recorder”: triggers → ring-buffer pre/post window → snapshot bundle → persistent export.
Flight Recorder for Field Bugs Trigger → pre/post window → evidence bundle → export Triggers CRC burst Watermark hi-hi DMA timeout Skew outlier Link retrain Always-on Ring Buffer Events + counters (low overhead) Timeline TRIGGER PRE window POST window Captured Records event list • counter deltas • snapshots Evidence Bundle Uploadable payload bundle.json timeline.csv snapshot.bin version digest Persistent Export file / ticket attachment case_####.zip ICNavigator • Figure F11
Cite this figure Flight recorder snapshot pipeline (F11) #cite-fg-f11
Anchor for internal references.

H2-12. Validation & Field Debug Playbook: Evidence → Isolate → Fix (No Scope Creep)

Card 1 — Test matrix (minimal coverage of worst-case corners)

Targets below are examples (not absolute standards). Use them to define pass/fail for a specific system.

Corner How to stress it Evidence to log (must-have) Example target (non-absolute)
Bandwidth single-port max rate; then multi-port max aggregate; add burst triggers DDR watermarks, assembler drops, DMA timeouts/CQ backlog, per-port integrity steady-state: no drops; bursts: drops must be attributable & bounded
Cable / SI short vs typical vs longest deployment cable; re-seat connectors CRC/corrupt, lane/deskew, retrain events; error-vs-temp overlay no rising error trend; retrains rare and explainable
Thermal cold start → warm-up → thermal steady-state; airflow restricted vs nominal T sensors + CRC/deskew + drops/min + throttle state controlled degradation prevents uncontrolled drops at high temp
Sync / multi-cam multi-camera capture; measure skew distribution over time; add resync events skew histogram summary, drift/offset summary, resync timestamps skew is a stable distribution; outliers must correlate to events
Pass/fail must be stage-based: if a failure is observed, the evidence bundle must identify which stage first exceeded its budget (errors, reorder pressure, watermark, DMA service, sync integrity, thermal coupling).

Card 2 — Field debug SOP (symptom → evidence → isolate → first fix)

Order matters: start at the lowest layer that can explain everything above it. linkreorderbufferDMAsyncthermal
  1. Confirm link margin: CRC/corrupt (and lane/deskew if applicable) must be stable.
    Proof: errors rise before any buffer/DMA symptom → treat as margin/cable/eq domain.
  2. Confirm reorder/resend budget (GigE path): reorder depth watermark and overflow/timeout must stay within bounds.
    Proof: seq gaps → resend pressure → reorder overflow increments before DDR watermark.
  3. Confirm buffering headroom: DDR watermark high/high-high hits must not precede drops.
    Proof: watermark crosses threshold first → burst absorption budget exceeded.
  4. Confirm DMA health: DMA timeout and completion backlog must be absent or bounded.
    Proof: CQ backlog grows first while link counters stay clean → service starvation.
  5. Confirm sync integrity: skew/drift outliers must correlate to resync/holdover events.
    Proof: outliers without sync events → investigate timestamp domain mismatch within capture chain.
  6. Correlate with thermal: overlay temperature with error rate and drops/min.
    Proof: monotonic error-vs-temp relationship + recovery under controlled degradation.
  7. Attach evidence: minimal diagnostic payload + flight recorder pre/post window dump.

What to change first (grabber-side), with concrete MPN examples

Examples only. The “first change” should match the stage proven by the counters.

Proven stage Evidence signature First change (within scope) Example MPN(s)
CXP link margin deskew/lock events + corrupt/CRC bursts rise (often with temperature) treat as margin: cable/connector seating, equalization/retiming strategy, retrain evidence capture Microchip EQCO125X40 family; legacy: EQCO62R20
GigE timestamp evidence skew outliers without buffer/DMA failures; timebase provenance unclear make timestamps provable: hardware timestamp path + explicit domain tagging in logs Intel I210; TI DP83640
Buffer headroom DDR watermark high-high hits precede drops; assembler drops follow bursts reduce burst concurrency; increase headroom; log watermarks and drop points per stage (DDR device varies) + sensor hook: TI TMP117
DMA service starvation CQ backlog grows; DMA timeout increments; link counters remain clean increase service budget (ring depth / batching conceptually) and prove backlog recovery in logs SPI FW trace: Winbond W25Q128JV; ID EEPROM: Microchip 24LC256
Thermal coupling errors rise monotonically with temperature; recovery under throttle controlled degradation + deterministic airflow; log tach faults + throttle state transitions Silabs Si5341 (clock); Microchip EMC2305 (fan)
Stop rule: do not “try fixes” until a stage is proven by counters + snapshot window evidence.
Figure F12. Debug decision tree: symptom → stage counters → isolate → first action (minimal text).
Evidence → Isolate → Fix Decision tree for “drops / latency spikes / skew outliers” SYMPTOM observed Link errors? CRC / deskew / seq gaps Reorder overflow? depth / timeout / resend DDR watermark? hi / hi-hi hits DMA timeout / backlog? CQ watermark / starvation Sync outliers? skew / drift / resync Thermal coupling? temp vs errors trend FIRST ACTION (choose by proven stage) Cable / EQ / retrain Reorder / buffer headroom DMA service budget ICNavigator • Figure F12
Cite this figure Decision tree flow chart (F12) #cite-fg-f12
Anchor for internal references.

Request a Quote

Accepted Formats

pdf, csv, xls, xlsx, zip

Attachment

Drag & drop files here or use the button below.

H2-13. FAQs ×12 (evidence-based; no scope creep)

Each answer points to specific counters / waveforms / logs and maps back to H2-1…H2-12. Evidence first, fixes second.

H2-2 / H2-5 / H2-9 Dropped frames only when enabling jumbo frames—MTU, resend, or host DMA starvation?
Jumbo frames usually expose a stage budget issue. First check (1) GigE counters: packet-loss and resend requested/received, plus reorder depth/overflow; and (2) DMA health: completion-backlog watermark or timeouts. If resend spikes first and reorder overflows, it’s a receive-side reassembly budget problem. If link counters stay clean but CQ backlog grows, it’s host service/DMA starvation. First fix: prove which counter moves first, then adjust capture-side buffering/reorder and DMA batching budgets.
Evidence anchor: pkt_loss/resend, reorder_overflow, CQ_backlog/timeout + host arrival-interval jitter trace. MPN (evidence hook): Intel I210 (HW timestamp capable path, platform-dependent).
H2-3 / H2-10 CRC errors rise after 30 minutes—cable margin or thermal drift?
Treat late-onset CRC as “margin changing over time.” First check (1) integrity counters: CRC bursts, retrain/lock-loss events, lane/deskew errors; and (2) temperature telemetry logged at the same cadence. If CRC ramps monotonically with temperature (and retrain events cluster near a threshold), it’s thermal-coupled margin. If CRC changes with cable motion/connector reseat while temperature is stable, it’s mechanical/cable margin. First fix: capture a flight-recorder window around the first CRC burst, then correlate error-vs-temp before changing hardware.
Evidence anchor: CRC_burst, retrain/lock, lane_status + temp-at-fault & slope log. MPN: TI TMP117 (high-accuracy temp evidence sensor, example).
H2-8 / H2-11 Multi-camera frames drift apart over time—PTP offset or local timebase drift?
Make timestamp provenance explicit before blaming algorithms. First check (1) sync logs: offset/drift counters and resync/holdover events, and (2) a skew histogram over time (not a single skew number). If skew jumps coincide with resync/offset steps, the alignment error is sync-path related. If offset looks stable but skew drifts smoothly, suspect local timebase drift or a cross-domain conversion error between “link time,” “FPGA time,” and “host time.” First fix: tag each timestamp with its domain and verify conversion consistency in the evidence bundle.
Evidence anchor: offset/drift, resync_events, skew histogram + timestamp-domain tags. MPN: TI DP83640 (PTP-capable PHY example for HW timestamp evidence).
H2-7 / H2-5 Trigger seems “late” sometimes—debounce/conditioning or interrupt batching?
“Late trigger” must be split into input integrity vs service latency. First measure (1) the Trigger-In edge quality at the connector (glitches, bounce, threshold noise) and (2) the grabber’s timestamp-latch time (FPGA counter) versus when the host reports the event. If the edge itself is unstable, conditioning/debounce is the root. If the edge is clean but host-visible timestamps slip in bursts, the cause is service batching/backlog (DMA/interrupt/polling strategy). First fix: keep timestamping at the edge (hardware) and use logs to prove whether latency is pre- or post-latch.
Evidence anchor: scope Trigger-In + FPGA_latch_ts vs host event time + CQ backlog watermark.
H2-4 / H2-5 Works at 1 camera, fails at 4 cameras—DDR arbitration or PCIe throughput?
Multi-camera failures are usually “burst absorption vs drain rate.” First check (1) DDR/buffer watermarks and overflow/drop counters during the failure window, and (2) DMA throughput plus completion queue backlog. If watermarks hit high-high before any DMA timeout, the buffer/arbitration budget is the first limiter (arrival bursts exceed absorption). If watermarks stay safe but CQ backlog grows and DMA timeouts appear, the drain side (PCIe/DMA service) is limiting. First fix: reproduce with a controlled test matrix (same frame rates) and prove which watermark/counter crosses first.
Evidence anchor: DDR_watermark_hi-hi, assembler_drop, CQ_backlog, DMA_timeout + per-camera frame interval trace.
H2-2 / H2-9 Resend storms on GigE—switch buffering or receiver reorder overflow?
Don’t guess; attribute the storm. First check (1) packet-loss + resend requested/received counters, and (2) reorder depth watermark and overflow/timeout. If packet loss and resend requests spike first, then reorder overflows follow, the receiver is reacting to upstream loss bursts (network-side behavior), even if you won’t tune the switch here. If reorder overflows without a clear resend/loss signature, your reorder window/budget is too small for the observed jitter and out-of-order pattern. First fix: log resend burst timing and reorder depth together to prove causal order.
Evidence anchor: pkt_loss, resend_req/rcv + reorder_depth/overflow + burst timeline export.
H2-5 / H2-11 DMA timeouts but link counters are clean—driver mapping/IOMMU issue?
Clean link counters plus DMA timeouts usually indicates a host-side service or mapping failure, not a capture problem. First check (1) DMA timeout events, completion queue backlog, and descriptor starvation counters; and (2) a snapshot containing version/bitstream hash plus a config digest (so the mapping state is comparable across runs). If CQ backlog grows before the timeout, it’s service starvation. If timeouts occur without backlog growth and correlate with map/unmap failures in logs, suspect a mapping path regression (evidence-only here). First fix: capture a flight-recorder dump around the first timeout and compare deltas across versions.
Evidence anchor: DMA_timeout, CQ_backlog, desc_starvation + snapshot bundle (version+config digest).
H2-8 Genlock present but jitter still high—PLL cleanup or ref integrity?
“Genlock present” is not proof of reference integrity. First check (1) PLL/jitter-cleaner state: lock/holdover events and any ref-loss/glitch indicators, and (2) the skew/jitter distribution of timestamps over time. If jitter spikes align with ref anomalies or holdover entries, the reference integrity is the root. If ref looks stable but PLL state shows frequent relock or poor phase stability, the cleanup path is insufficient for the required determinism. First fix: log ref events and PLL state alongside skew histograms; treat “lock” as a time series, not a boolean.
Evidence anchor: PLL_lock/holdover, ref_loss/glitch + skew histogram timeline. MPN: Silicon Labs Si5341 (jitter cleaner example for observable PLL state).
H2-3 / H2-9 Random corrupt frames without drops—CRC vs frame assembly boundary?
Corruption without drops demands stage attribution between “wire integrity” and “assembly boundary.” First check (1) CRC and lane/deskew error counters and (2) frame-assembler integrity signals: invalid frame markers, sequence discontinuities, or “error frame capture” records. If CRC stays clean while corrupt/invalid frames increase, corruption is happening at or after assembly/reformatting (descriptor framing, interleave/merge boundary, or metadata stitching). If CRC bursts precede corruption, treat as link margin. First fix: preserve the first corrupted frame header + timestamp and correlate with per-stage counters in the same snapshot window.
Evidence anchor: CRC_clean? + frame_invalid/corrupt + saved error-frame header + snapshot window.
H2-6 / H2-5 GPU pipeline is fast in lab, slow in deployment—zero-copy broken or memcpy fallback?
The fastest diagnosis is to prove which ingest path is active. First check (1) host CPU utilization and memcpy-related counters/telemetry (bytes copied per second, if exposed) and (2) PCIe throughput plus queue backpressure (CQ backlog or DMA service time). If CPU rises sharply while throughput stays similar, you likely fell back to a memcpy/staging path. If CPU is stable but PCIe utilization and backlog grow, the bottleneck is bus/service contention rather than copy. First fix: log a “path marker” (direct-to-GPU vs staging) in the evidence bundle and A/B under the same workload.
Evidence anchor: path_marker, CPU% + PCIe_throughput + CQ_backlog. MPN note: zero-copy benefit depends on platform support; prove the active path in logs.
H2-7 Encoder count mismatches at high speed—input capture limit or noise?
Separate “capture bandwidth limit” from “signal integrity.” First measure (1) encoder A/B waveforms at the grabber input: edge rate, ringing, glitches, threshold crossings; and (2) capture health counters: edge reject/glitch detect (if available) and overflow/missed-edge indicators. If waveforms are clean but overflow/missed edges appear above a certain frequency, you hit an input capture/processing limit. If glitches and bounce are visible, conditioning is required and timestamps will be unreliable. First fix: validate the maximum stable edge rate with a controlled sweep and log the first failure frequency with the corresponding waveform.
Evidence anchor: scope Encoder A/B + capture_overflow/missed + frequency sweep log.
H2-11 / H2-12 After firmware update, drop rate changed—what logs prove regression?
A regression is proven only with comparable evidence bundles. First require (1) version identifiers: firmware/bitstream hash plus a config digest, and (2) the same validation corner that triggers the issue (bandwidth/cable/thermal/sync) with per-stage counter deltas. If a specific stage counter begins incrementing earlier (CRC bursts, reorder overflow, watermark hi-hi, DMA timeouts), that stage is the regression boundary. First fix: capture a flight-recorder window around the first drop on both versions and compare “first-moving counter” plus timestamp provenance—this isolates regressions without scope creep.
Evidence anchor: fw_hash/bitstream_hash, config_digest + per-stage counters_delta + flight recorder window.