Frame Grabber (PCIe, CoaXPress, GigE Vision)
← Back to: Imaging / Camera / Machine Vision
H2-1. What a Frame Grabber Owns in the Vision Pipeline (and what it doesn’t)
Definition that matches the engineering boundary
A frame grabber is the hardware+firmware boundary that converts high-speed camera links (e.g., CoaXPress or GigE Vision) into host-consumable frames with provable integrity: link health accounting, packet/frame assembly, buffering, timestamps, deterministic trigger/genlock termination, and PCIe DMA into CPU or GPU memory.
Owns vs Interfaces vs Not owned (scope lock)
- Link Rx health: CRC/BER, lock/retrain, lane/deskew status
- Packet/frame correctness: sequence continuity, frame CRC, reorder windows
- Buffer headroom: FIFO/DDR watermarks, overflow/drop stage IDs
- Timestamp provenance: local timebase, offset/drift counters, resync events
- DMA integrity: ring depth, completions, timeouts, backpressure behavior
- Trigger/Genlock termination: edge capture, delay programming, jitter budget
- Observability: counters, snapshots, flight-recorder logs, version traceability
- Camera link ports: CXP lanes / Ethernet MAC / GVSP streams
- Discrete I/O: Trigger In/Out, Encoder In, Genlock/Ref In
- Host link: PCIe GenX, DMA queues, MSI-X interrupts / polling modes
- Software API: stream config, counters, timestamp packets, log export
- Diagnostics: link margin view, buffer watermark trace, DMA stall reason
- Sensor pixel / ISP algorithms (demosaic, denoise, HDR fusion, etc.)
- Compression/codec internals (H.26x/JPEG engine details)
- Lighting current drivers & strobe power stages
- Full system timing hub architecture (beyond signals & proof points)
- Camera PoE / isolated power tree design
Scope discipline rule: when a topic stops being provable by grabber counters/logs, it belongs to another page.
Five failure points the grabber must make non-mysterious
| Failure point (symptom) | First evidence to collect | What it proves (responsibility) |
|---|---|---|
| Link margin collapse CRC spikes / re-lock events |
CRCBER windowCDR lock/retrainlane/deskew
Correlate with cable length, EMI events, and temperature.
|
Proves a PHY/Rx problem (not “host software”). If CRC rises with temperature, it often indicates margin shrink or retimer/SerDes drift. |
| Reorder/resend overflow GigE: bursts cause disorder |
sequence gapsresend requested/receivedreorder overflow
Track inter-packet gap variance; note switch microbursts.
|
Proves network loss + recovery pressure. If resend storms precede drops, the root is congestion or reorder window sizing. |
| Buffer headroom exhausted drops at watermark |
FIFO/DDR watermarkoverflow counterdrop stage ID
Record arrival rate vs DMA drain rate during faults.
|
Proves the failure is inside the grabber pipeline (burst absorption, arbitration, or drain capacity), not the camera. |
| DMA starvation / timeout ring stuck |
DMA timeoutcompletion queue depthIRQ rateIOMMU faults
Capture a “freeze snapshot” of ring indices and last descriptor.
|
Proves a host interface contract issue: mapping, queue sizing, interrupt policy, or backpressure. It separates “PCIe/DMA” from “link errors.” |
| Sync skew / drift multi-camera misalign |
timestamp driftoffset/resyncgenlock lossskew histogram
Tag every frame with provenance: local vs PTP vs ref-locked.
|
Proves whether the culprit is timebase discipline or an upstream sync source. The grabber must expose the timebase state. |
H2-2. Interfaces & Link Behaviors (CoaXPress vs GigE Vision) — What the Grabber Must Guarantee
CoaXPress “capture contract” (serial over coax, margin-driven)
CoaXPress capture is governed by PHY margin and clock recovery stability. The grabber must hold a stable CDR lock, maintain lane alignment, and provide a low-noise path from recovered data to assembled frames. When failures occur, they usually present as CRC bursts, re-lock events, or deskew faults that correlate with cable length, EMI, or temperature drift.
GigE Vision “capture contract” (Ethernet/UDP, loss-and-recovery-driven)
GigE Vision capture is inherently best-effort: packet loss, reordering, and congestion are normal stressors. The grabber must implement robust sequence tracking, reorder buffering, and resend accounting so that transient network events do not silently corrupt frames. Under load, failure commonly appears as resend storms, reorder window overflow, or latency spikes from switch microbursts and host scheduling contention.
Common metrics checklist (works for both links) — the grabber’s “health certificate”
| Metric class | What to log continuously | Why it matters (what it proves) |
|---|---|---|
| Link health | CRC/BERlock/retrainlane/deskewtraining state | Separates margin problems from higher-layer symptoms. If link health is clean, drops must be downstream. |
| Flow integrity | sequence gapsresend statsreorder occupancyframe CRC | Proves whether corruption/drop is caused by loss/recovery pressure or by internal assembly logic. |
| Buffer headroom | watermarks (high/avg)overflow countdrop stage ID | Converts “dropped frames” into a specific stage: arrival bursts vs drain capacity vs arbitration. |
| DMA health | completion ratetimeout countring depthIRQ/poll rate | Proves host interface stability. Clean link + clean buffer + DMA timeouts implies host queue/mapping policy. |
| Sync proof | timestamp provenanceoffset/driftskew histogramref-loss events | Proves whether alignment errors are from timebase discipline vs upstream sync sources. |
| Thermal correlation | board tempSerDes tempCRC vs tempdrops vs temp | Explains “works cold, fails hot.” Temperature-linked CRC suggests margin shrink; temp-linked DMA faults suggest throttling or instability. |
Practical rule: keep these metrics timestamped and persistent. Without them, a field report becomes opinion; with them, it becomes a reproducible bug.
H2-3. Rx Front-End: SerDes/CDR/Retimer, Link Margin, and Error Accounting
Physical-layer truth: prove margin first, or everything above becomes noise
The Rx front-end (SerDes + CDR + retimer/equalization) defines whether the link is fundamentally trustworthy. When margin is weak, higher-layer tuning (buffers, DMA, software) can only mask symptoms; it cannot prevent CRC bursts, re-lock events, and intermittent corruption that follows temperature, cable stress, or EMI coupling.
Card 1 — What to measure (minimum evidence pack, vendor-neutral)
- Lock/unlock proves clock recovery stability under real noise.
- Retrain reason separates “margin collapse” from “lane alignment” style faults.
- CRC rate vs time reveals bursty interference and thermal drift patterns.
- BER windows expose short “micro-failures” that averages can hide.
- Deskew errors point to differential lane delay, connector mismatch, or retimer behavior under heat.
- If CRC rises with temperature, suspect margin shrink or retimer/SerDes drift.
- If failures align with machine events, suspect coupled EMI, not random software.
Practical logging tip: record counters with timestamps at a fixed cadence, and also capture a short “burst log” around a fault event (pre/post window).
Card 2 — Symptom → likely PHY cause (with first action)
| Symptom pattern | First evidence to check | Most likely PHY cause | First action (generic) |
|---|---|---|---|
| CRC bursts after warm-up works cold, fails hot |
CRC vs temp retrain events lane status snapshot | Thermal margin shrink (connector/contact, retimer drift, EQ sensitivity) | Improve airflow/heatsinking; verify connectors; re-evaluate EQ strength (avoid over/under-equalization) |
| Frequent retrain at start unstable from power-up |
training state dwell retrain reason lock/unlock | CDR lock instability or insufficient initial margin (cable, termination, connector) | Short known-good cable; reseat/replace connectors; reduce sources of coupled noise during start |
| Intermittent corruption without drops bad frames pass through |
frame CRC flags CRC rate error burst length | Running “dirty” at the PHY; error detection works but integrity gating is loose upstream | Tighten integrity policy (drop/mark bad frames); investigate margin and EMI coupling |
| Multi-lane only: occasional artifacts single-lane OK |
deskew errors lane resync lane map | Lane-to-lane skew drift (length mismatch, connector variance, retimer lane behavior) | Normalize lane paths; verify connectors; confirm deskew tolerance across temperature |
| Failures correlate with machine events motors/relays |
EMI markers CRC bursts lock events | Coupled EMI causing short margin collapse and CDR disturbance | Improve shielding/cable routing; add separation; validate margin under the same EMI profile |
H2-4. Buffering Architecture: Line/Frame Buffers, DDR Bandwidth, and Worst-Case Bursts
Buffering is an inequality: burst absorption + worst-case service time
Buffering is not “add more memory and hope.” A frame grabber must absorb burst arrivals (microbursts, resend storms, multi-camera alignment, trigger bursts) while the host/DMA side experiences finite and sometimes delayed service. Drops become unavoidable when the worst-case incoming burst minus worst-case drain exceeds available buffering headroom.
Card 1 — A sizing recipe (variables, not long math)
- Rin: average ingest rate (bytes/s or pixels/s into the grabber)
- Bin: peak burst rate (microburst / resend storm / aligned capture)
- Tburst: burst duration (how long peak persists)
- Rout: effective drain rate (DMA to host/GPU under real load)
- Tservice: worst service gap (host scheduling/queue stalls)
- During a burst window, buffer must hold: incoming burst volume minus what can be drained in the same window.
- Add headroom so the high watermark stays below critical across temperature and worst-case scenes.
- Use per-stage watermarks (FIFO + DDR) to locate the real choke point.
Card 2 — Buffer overflow signatures (what the counters “look like”)
| Signature | What rises first (evidence order) | What it implies | First corrective direction |
|---|---|---|---|
| Watermark hits high, then drops | watermark high overflow count drop stage ID | Burst absorption is insufficient (buffer too small or burst too large) | Increase buffering at the true choke point; reduce burstiness upstream |
| DMA slows first, watermark climbs later | completion rate down queue depth up watermark rising | Drain capacity is constrained (host service gaps or queue policy) | Treat as a drain problem; confirm worst-case service gaps before resizing buffers |
| Resend spikes, reorder overflows (GigE) | resend requested reorder occupancy reorder overflow | Congestion/loss creates recovery pressure beyond reorder window | Improve network conditions or increase reorder capacity; log loss markers |
| Aligned trigger causes instant peak | trigger marker FIFO watermark DDR watermark | Synchronous capture creates short, very high peaks | Add fast FIFO headroom; stagger capture where allowed; validate burst duration |
| DDR bandwidth wall multi-camera + readback |
DDR busy bank conflict markers watermark slope steep | Arbitration/bank conflicts reduce effective bandwidth | Rebalance read/write scheduling; simplify access patterns; isolate streams |
Logging requirement: always record (1) watermark timeline, (2) drop stage ID, and (3) a timestamp marker for trigger/resend events.
H2-5. PCIe & DMA-to-Host: Descriptor Rings, Interrupt Strategy, Zero-Copy
Frames-to-host is a pipeline, not a black box
A frame grabber does not “send frames to the PC.” It runs a measurable pipeline: host buffers are allocated and mapped, descriptors are queued, the device performs PCIe writes, completions are generated, and the application consumes frames. Reliability at scale comes from keeping descriptor supply stable, preventing completion backlog, and choosing an event strategy that controls tail latency without wasting CPU.
Card 1 — DMA pipeline in 6 steps (each step has a proof signal)
| Step | What happens | What can break | Proof signal (what to log) |
|---|---|---|---|
| 1 | Host alloc receive buffers (pool) | Pool exhaustion → frames arrive with nowhere to land | pool depth alloc fail time-to-refill |
| 2 | Pin / map memory for DMA (IOMMU mapping where used) | Mapping churn, faults, or slow setup → intermittent stalls | map errors fault markers setup time |
| 3 | Build descriptors (scatter-gather list) into a descriptor ring | Ring starvation → DMA engine has nothing to do | desc underrun ring occupancy head/tail gap |
| 4 | Doorbell notifies the device new descriptors exist | Doorbell/completion mismatch → bursty progress and jitter | doorbell rate completion rate cadence drift |
| 5 | DMA writes frames into host memory over PCIe | PCIe contention/backpressure → throughput drop, stalls | PCIe bytes/s DMA busy DMA errors |
| 6 | Completion events + app consume (event-driven or polling) | Completion backlog → tail latency spikes and drops upstream | CQ depth interrupt/sec P99 latency |
Card 2 — Latency vs CPU tradeoffs (interrupt, polling, batching)
| Strategy | What improves | Evidence to confirm | Failure signature | First tuning direction |
|---|---|---|---|---|
| Interrupt-driven MSI-X style |
Lower average latency at moderate rates | interrupt/sec CQ depth P95/P99 | Interrupt storm → CPU high → completion jitter | Reduce event rate via coalescing or limited batching |
| Polling busy/periodic |
Stable tail latency (if CPU budget exists) | CPU% CQ depth cadence | CPU burn without throughput gain | Use bounded polling windows or hybrid mode |
| Batching process N completions |
Lower CPU overhead per frame | batch size interrupt/sec P99 | P99 latency grows with batch size | Cap batch size; target P99 rather than max throughput |
| Hybrid interrupt + short poll |
Good compromise: CPU controlled, tail improved | CQ depth interrupt/sec P99 | Mode thrash under bursty load | Add hysteresis on mode switching; log transitions |
Pitfall signatures: if throughput is acceptable but P99 latency spikes with high CPU activity, suspect copy/cache pressure or event strategy instability before resizing buffers.
H2-6. GPU / Accelerator Ingest (Optional Path): GPUDirect, RDMA Concepts, When It Helps
Decision scope: only pursue direct-to-GPU when evidence demands it
Direct-to-GPU ingest matters when the “default path” is measurably limited by CPU copy work, cache pressure, or PCIe saturation. This chapter stays within frame-grabber scope: identify bottleneck signatures, choose a path (host staging vs direct), and prove value with an A/B validation plan.
Card 1 — Decision checklist: “Do you need GPU-direct?”
- CPU memcpy dominates: CPU utilization stays high even when capture is stable.
- Tail latency spikes: P99 grows with bursty copy work or event cadence jitter.
- PCIe is near saturation: throughput is close to the platform’s practical ceiling.
- Multi-stream pressure: per-camera queues show HOL symptoms during heavy inference.
- Capture drops are already explained by PHY margin (H2-3) or buffering inequality (H2-4).
- Completion backlog is caused by event strategy instability (H2-5) rather than copy work.
- Your workload is not latency sensitive; staging overhead is acceptable.
Card 2 — Validation plan: A/B metrics that prove benefit (no guesswork)
| Item | Path A (Host staging) | Path B (Direct-to-GPU / reduced-copy) |
|---|---|---|
| Pipeline | DMA → Host RAM → memcpy → GPU | DMA → GPU (or minimal staging) |
| Latency | Measure end-to-end (camera → inference input) P50/P95/P99 | Same measurement, same scenes, same load; compare percentile shifts |
| CPU load | CPU% + copy pressure signatures (spikes, cadence jitter) | Expect reduced CPU copy load; verify no new completion backlog |
| PCIe | Throughput average + peak; confirm headroom in burst windows | Confirm throughput is not capped by a different choke point |
| Reliability | Dropped frames / backpressure events / queue overflow counters | Must not increase drops; investigate any new failure signatures |
| Conclusion | If P99 and CPU stay high, staging is the bottleneck | Adopt only if P99 improves and drops do not worsen |
H2-7. Trigger / Encoder / Strobe I/O: Determinism, Debounce, and Latency Budget
Why field failures often start at the trigger, not the bandwidth
Many “dropped frame” incidents are actually trigger integrity issues: noisy edges, double-triggering, encoder miscounts, or strobe timing drift. A frame grabber owns the boundary signals (Trigger In/Out, Encoder In, Status) and must make them deterministic and provable by conditioning inputs, latching timestamps at a defined point, applying programmable delay/pulse control, and logging reject/overflow events.
Card 1 — Signals owned by the grabber (and what “determinism” means)
| Signal | Grabber responsibility | What can go wrong | Proof signal (log) |
|---|---|---|---|
| Trigger In | Threshold/edge conditioning (concept), glitch reject / debounce, timestamp latch at a defined edge point | Double-triggering, missed triggers on slow/noisy edges, false triggers from ground bounce | edge count reject count TS latch marker |
| Encoder In quadrature |
Capture A/B edges, decode direction/count, enforce maximum rate boundary, record invalid transitions | Miscounts at high frequency, phase jitter → invalid transitions, direction flips | invalid transitions overflow/overrate count delta |
| Strobe Out control-level |
Programmable delay, pulse width control, timing repeatability. Output is a timing control boundary (not a power driver). | Delay drift, width error at short pulses, load-induced edge deformation (beyond boundary) | delay readback width config measured Δt |
| Status I/O | Expose “armed/busy/drop” style state so timing problems are observable | Silent failure: system looks fine but was never armed or is backpressured | armed/busy drop stage state transitions |
Card 2 — “Two measurements first” (fast discriminator)
- Probe Trigger In and Strobe Out on the same timebase.
- Measure Δt distribution: edge-to-strobe delay and pulse width repeatability.
- Look for: double edges, ringing, slow threshold crossings, jitter growth with temperature/load.
- Record timestamp latch values for each accepted trigger.
- Track reject counters and invalid transition counters (encoder).
- Log state: armed/busy and any drop/backpressure markers.
H2-8. Genlock / PTP / Timestamping: Aligning Multi-Camera Frames Without Guessing
Alignment is provable when timestamp provenance is explicit
Multi-camera alignment should not rely on “looks synchronized.” It becomes provable when every frame carries a timestamp with known provenance, clock-domain conversions are tracked with offset/drift counters, and resync/holdover events are logged. Genlock and PTP are tools to discipline the grabber timebase so that skew and drift can be measured as distributions.
Card 1 — Timestamp provenance ladder (most trustworthy → least)
| Timestamp domain | Why it is useful | Main risk | What must be logged |
|---|---|---|---|
| Hardware edge / link-adjacent | Closest to the real event; minimal OS influence | Still depends on timebase discipline and calibration | TS source ID offset drift |
| FPGA local timebase | Stable monotonic counter; can tag every frame consistently | Drifts without reference; conversion to other domains must be tracked | drift/min resync holdover |
| Host software time | Convenient for applications and logs aggregation | Scheduling jitter; not reliable for microsecond-level alignment | queue delay timestamp mapping |
Card 2 — Pass/fail criteria for multi-camera alignment (measurable)
| Metric | How to measure | Failure signature | First suspect |
|---|---|---|---|
| Skew distribution P50/P95/P99 |
Histogram of inter-camera TS deltas for matched events/frames | Wide tails or bimodal peaks; some cameras jump by steps | resync events, unstable reference, wrong domain mapping |
| Drift rate per minute |
Track TS delta slope over time (Δskew / minute) | Skew slowly grows; alignment degrades predictably | undisciplined timebase, holdover too long |
| Offset stability | Monitor offset counter; correlate with skew changes | Offset “steps” coincide with alignment jumps | PTP step/resync; reference interruptions |
| Resync frequency | Log resync events/hour and associated magnitude | Alignment appears fine, then suddenly wrong for a window | clock domain switching, ref signal integrity |
| Holdover state | Record when reference is lost and how long holdover lasts | Drift accelerates during holdover | ref-in loss, PLL unlock, unstable environment |
H2-9. Reliability Under Load: Drop Frames, Reorder, Resend, and Backpressure
Make “why frames drop” diagnosable with minimal tools
Dropped frames are not a single failure mode. Under load, the pipeline can fail at distinct stages: Rx/link errors, reorder/resend pressure (especially GigE), buffer/DDR overflow, DMA starvation, or host scheduling stalls that create backpressure. Diagnosis becomes repeatable when each stage exposes counters and watermarks so the “drop point” is proven before any fix is attempted.
Card 1 — Decision tree (Symptom → check counters in order)
| Symptom (field language) | Check these counters first | What proves the drop stage | First action (within grabber scope) |
|---|---|---|---|
| Drops start only at peak throughput multi-cam / burst triggers |
DDR watermark high assembler drop DMA timeout | Watermark hits precede drops; drop counters increment at buffer/DMA stage while Rx CRC stays low | Increase buffering headroom, reduce burst concurrency (controlled degradation), verify drain rate stability |
| GigE: stutter/lag then drops often with reordering |
seq gaps reorder depth reorder overflow/timeout resend requested | Reorder depth grows, then overflow/timeout increments; drops occur without DDR overflow | Resize reorder window/timeouts; ensure resend accounting is logged and bounded under loss bursts |
| CXP: corrupted/invalid frames drops not “missing”, but unusable |
lane status deskew fail frame corrupt/CRC | Deskew/corrupt counters rise; frame validity fails at Rx stage even when buffers are not saturated | Capture lane snapshots; correlate errors with temperature and retrain events; treat as margin loss evidence |
| Latency spikes but few drops tail latency collapses determinism |
DMA completion backlog descriptor starvation host queue depth | Completion backlog climbs first; host queue depth grows; watermark rises without link errors | Stabilize DMA servicing (ring depth, batching strategy conceptually); verify backlog returns to baseline |
| Only one camera drops same host, same load |
per-port CRC/seq gaps per-port reorder overflow per-port watermark | Drops correlate to a single port’s counters, not global backpressure | Use per-port accounting; isolate stage that is port-specific (Rx vs reorder vs assemble) |
Card 2 — Golden metrics (Top 10 counters to always log)
| Counter / metric | What it means | Typical failure signature | First suspect |
|---|---|---|---|
| Rx CRC / corrupt frame | Data arrived but failed integrity at Rx/assemble stage | Corrupt rises before drops; valid frames become sporadic | margin loss / thermal drift |
| Seq gaps (GigE) | Missing packets detected in capture stream | Seq gaps spike → resend increases → reorder depth grows | loss bursts / congestion |
| Resend requested/received | Reconstruction pressure indicator under loss | Requested rises faster than received; recovery fails | recovery budget exceeded |
| Reorder depth watermark | Peak occupancy of reorder buffer | Watermark drifts upward over time; bursts trigger overflow | window too small / timeouts |
| Reorder overflow/timeout | Reorder cannot complete frames within budget | Drops occur without DDR overflow | resend storms / budget |
| Lane/deskew fails (CXP) | Multi-lane alignment cannot be maintained | Step-like increase with temperature; retrain events | lane margin / ref drift |
| Assembler incomplete/drop | Frame assembly failed due to missing/corrupt inputs or pressure | Assembler drops coincide with reorder timeouts or DDR watermark | upstream pressure |
| DDR watermark high | Burst absorption is approaching limits | Watermark high precedes overflow and frame drops | burst peak > drain |
| DMA timeout / CQ backlog | Host transfer path cannot complete within service budget | Latency spikes + eventual drops without link errors | service starvation |
| Host backlog time | Downstream consumer is not keeping pace (observable backpressure) | Queue grows; drops happen later at buffer stage | consumer stalls |
H2-10. Thermal, Power, and Throttling: Keeping Determinism Across Temperature
Thermal drift turns “margin” into “random drops” unless it is correlated and controlled
Temperature affects link margin and timing repeatability. As devices heat up, error counters often rise gradually before frame drops become visible: CRC/corrupt events, lane/deskew failures, reorder pressure, and jitter proxies can all worsen with temperature. Determinism is preserved when telemetry (on-die + board temperatures) is logged alongside reliability counters, and a controlled degradation strategy prevents hard failure.
Card 1 — Thermal symptoms (what to look for)
- Errors ramp with time-to-steady-state: CRC/deskew rises after minutes of load.
- Alignment/stability degrades: tail latency grows, then drops start.
- One hot zone dominates: a single port or lane group becomes error-heavy first.
- Jitter proxies widen: trigger-to-strobe Δt distribution tails get wider at high temperature.
- PHY/SerDes: margin-sensitive; errors often correlate strongly with temperature.
- FPGA: timing slack shrink → stability/jitter proxies worsen.
- DDR: sustained bandwidth and refresh overhead make watermarks rise.
- PCIe block: completion backlog can increase under heat + load.
Card 2 — Correlation method (log temp + counters; prove causality)
| Step | What to do | What to log | What “proof” looks like |
|---|---|---|---|
| 1 | Hold a fixed load profile (ports, fps, trigger rate) until thermal steady-state | Temp (die/board) CRC/deskew drops | Counters worsen gradually as temperature rises; not random spikes |
| 2 | Compute correlation: temperature vs error rate and drop rate (same time axis) | error rate drop/min watermark | Errors track temperature; drop onset aligns with a counter threshold |
| 3 | Apply controlled degradation instead of hard failure (reduce fps / reduce burst concurrency) | throttle state applied fps/rate errors | Counters recover and drops stop while staying observable and deterministic |
| 4 | Verify repeatability (warm-up → degrade → recover) across runs | resync events temp slope recovery time | Same temperature band triggers same failure signature; fixes are stable |
H2-11. Observability & Logging: What to Record So Field Bugs Become Fixable
Why this chapter exists
Field failures become fixable only when every “drop” is attributable to a specific pipeline stage: Rx/link → reorder/resend → buffer/DDR → DMA/PCIe → host backlog. Observability is the design of evidence artifacts (events, counters, snapshots) that prove the drop stage before any remediation is attempted.
Card 1 — Minimal diagnostic payload (uploadable “evidence bundle”)
- Identity: board rev, serial, firmware/bitstream hash, driver version, config digest.
- Link: per-port link state, retrain count, lane/deskew status (if applicable).
- Integrity: CRC/corrupt/seq-gap counters + “burst summary” for last 60s.
- Reorder/Resend: reorder depth watermark, overflow/timeout count, resend requested/received.
- Buffer: DDR watermark high/high-high hits, assembler drops, overflow events.
- DMA: timeout count, completion backlog watermark, descriptor starvation events.
- Sync (if enabled): offset/drift summary + resync/holdover events (no protocol deep dive).
- Thermal: T-sensors max/avg, temperature slope, temp-at-fault.
- Pre/Post window: ring-buffer dump around trigger (e.g., 3s pre + 8s post).
- Counters delta: per-stage counter deltas over the snapshot window.
- State snapshots: link/lane snapshot, reorder window state, DMA queue depths.
- Timestamp provenance: which clock domain stamped each record (FPGA/local/host).
Card 2 — When the user says “random”, what to ask for (in order)
- Time scale: seconds vs minutes vs “after warm-up”? (points to burst vs thermal coupling)
- Load dependency: only at peak throughput / multi-cam bursts / high trigger density?
- First counter that moved in the 30s before the drop: CRC/deskew/seq gaps, reorder overflow, DDR watermark, DMA timeout.
- Port specificity: only one camera/port? Provide per-port counters.
- Snapshot requirement: flight recorder dump (pre/post) + version/config digest.
- Thermal context: temp-at-fault + slope + whether errors rise monotonically with temp.
MPN examples — Common hardware hooks that make evidence reliable
Examples only (not a recommendation). Choose equivalents based on link rates, environment, and lifetime.
| Function | Example MPN | Why it helps observability | Evidence it enables |
|---|---|---|---|
| CoaXPress Rx/Tx (CDR/EQ) | Microchip EQCO125X40 family; legacy: EQCO62R20 | Exposes link lock / equalization behavior; supports CDR-centric margin evidence | retrain/lock events, margin-linked CRC/deskew signatures |
| GigE controller (IEEE1588 capable) | Intel I210 (e.g., I210-IS / I210-AT) | Hardware timestamp support (platform dependent); enables provable time tagging | timestamped packet/stream evidence; drift/outlier correlation |
| PTP-capable PHY (example) | TI DP83640 | Hardware timestamping at PHY; useful when time evidence must be close to the wire | offset/drift logs; resync event provenance |
| SPI NOR for versioned firmware | Winbond W25Q128JV (example class) | Stable firmware storage with readable IDs/hashes for traceability | firmware hash, rollback correlation |
| EEPROM for board ID/config | Microchip 24LC256 (example class) | Stores serial/rev/calibration IDs used in evidence bundles | identity + config digest reproducibility |
| High-accuracy temperature sensor | TI TMP117 (example class) | Correlates error counters with temperature reliably (not “hand-wavy thermal”) | temp-at-fault, monotonic error-vs-temp proof |
| Clock jitter cleaner (example) | Silicon Labs Si5341 (example class) | Stabilizes local timebase; improves timestamp consistency and skew distributions | narrower skew histograms; fewer resync anomalies |
| Multi-fan controller (example) | Microchip EMC2305 (example class) | Controls airflow deterministically; logs tach faults for thermal root-cause | airflow/fan fault evidence; thermal recovery proofs |
H2-12. Validation & Field Debug Playbook: Evidence → Isolate → Fix (No Scope Creep)
Card 1 — Test matrix (minimal coverage of worst-case corners)
Targets below are examples (not absolute standards). Use them to define pass/fail for a specific system.
| Corner | How to stress it | Evidence to log (must-have) | Example target (non-absolute) |
|---|---|---|---|
| Bandwidth | single-port max rate; then multi-port max aggregate; add burst triggers | DDR watermarks, assembler drops, DMA timeouts/CQ backlog, per-port integrity | steady-state: no drops; bursts: drops must be attributable & bounded |
| Cable / SI | short vs typical vs longest deployment cable; re-seat connectors | CRC/corrupt, lane/deskew, retrain events; error-vs-temp overlay | no rising error trend; retrains rare and explainable |
| Thermal | cold start → warm-up → thermal steady-state; airflow restricted vs nominal | T sensors + CRC/deskew + drops/min + throttle state | controlled degradation prevents uncontrolled drops at high temp |
| Sync / multi-cam | multi-camera capture; measure skew distribution over time; add resync events | skew histogram summary, drift/offset summary, resync timestamps | skew is a stable distribution; outliers must correlate to events |
Card 2 — Field debug SOP (symptom → evidence → isolate → first fix)
- Confirm link margin: CRC/corrupt (and lane/deskew if applicable) must be stable.
Proof: errors rise before any buffer/DMA symptom → treat as margin/cable/eq domain. - Confirm reorder/resend budget (GigE path): reorder depth watermark and overflow/timeout must stay within bounds.
Proof: seq gaps → resend pressure → reorder overflow increments before DDR watermark. - Confirm buffering headroom: DDR watermark high/high-high hits must not precede drops.
Proof: watermark crosses threshold first → burst absorption budget exceeded. - Confirm DMA health: DMA timeout and completion backlog must be absent or bounded.
Proof: CQ backlog grows first while link counters stay clean → service starvation. - Confirm sync integrity: skew/drift outliers must correlate to resync/holdover events.
Proof: outliers without sync events → investigate timestamp domain mismatch within capture chain. - Correlate with thermal: overlay temperature with error rate and drops/min.
Proof: monotonic error-vs-temp relationship + recovery under controlled degradation. - Attach evidence: minimal diagnostic payload + flight recorder pre/post window dump.
What to change first (grabber-side), with concrete MPN examples
Examples only. The “first change” should match the stage proven by the counters.
| Proven stage | Evidence signature | First change (within scope) | Example MPN(s) |
|---|---|---|---|
| CXP link margin | deskew/lock events + corrupt/CRC bursts rise (often with temperature) | treat as margin: cable/connector seating, equalization/retiming strategy, retrain evidence capture | Microchip EQCO125X40 family; legacy: EQCO62R20 |
| GigE timestamp evidence | skew outliers without buffer/DMA failures; timebase provenance unclear | make timestamps provable: hardware timestamp path + explicit domain tagging in logs | Intel I210; TI DP83640 |
| Buffer headroom | DDR watermark high-high hits precede drops; assembler drops follow bursts | reduce burst concurrency; increase headroom; log watermarks and drop points per stage | (DDR device varies) + sensor hook: TI TMP117 |
| DMA service starvation | CQ backlog grows; DMA timeout increments; link counters remain clean | increase service budget (ring depth / batching conceptually) and prove backlog recovery in logs | SPI FW trace: Winbond W25Q128JV; ID EEPROM: Microchip 24LC256 |
| Thermal coupling | errors rise monotonically with temperature; recovery under throttle | controlled degradation + deterministic airflow; log tach faults + throttle state transitions | Silabs Si5341 (clock); Microchip EMC2305 (fan) |
H2-13. FAQs ×12 (evidence-based; no scope creep)
Each answer points to specific counters / waveforms / logs and maps back to H2-1…H2-12. Evidence first, fixes second.