Network TAP / Probe: Lossless Packet Capture, Timestamps & Replay
← Back to: Telecom & Networking Equipment
A Network TAP/Probe is a non-intrusive observation platform that copies traffic (and optionally records/replays it) without making forwarding or security blocking decisions. “Production-ready” means provably lossless under a declared profile (Gbps + 64B Mpps + microbursts + replication + storage), backed by aligned evidence like drop counters, buffer watermarks, timestamp stability, and IO tail latency.
H2-1 · What is a Network TAP / Probe, and where the boundary is
A Network TAP / Probe is a visibility system: it copies traffic for capture, analysis, and replay without participating in forwarding or security enforcement decisions. The fastest way to avoid design and procurement confusion is to separate three roles—TAP, NPB, and Probe—by what each one does to packets.
Role TAP — “Copy traffic” (do not interpret)
- Produces a duplicate stream from a live link (passive split or electronic mirror).
- Goal is visibility fidelity: “what was on the wire” as closely as the deployment allows.
- Does not decide whether packets are good/bad, allowed/blocked, or routed.
Role Network Packet Broker (NPB) — “Organize traffic”
- Aggregates multiple inputs, filters/masks, replicates 1→N, and load-balances to tools.
- Operates as a visibility fabric (not a security gateway): the output is “what tools should observe”.
- Where “lossless” becomes conditional: fan-out, burst, and egress congestion determine drops.
Role Probe / Recorder / Analyzer — “Prove & replay”
- Captures packets to storage, adds timestamps + metadata, indexes for search, and supports replay.
- Focus is evidence quality: timestamp accuracy, integrity (checksums/hashes), and reproducibility.
- Often the bottleneck is write jitter (tail latency), not peak throughput.
Boundary What it is NOT
- Not a firewall/IPS/DDoS mitigator (no policy enforcement, no blocking responsibility).
- Not a router/switch for production forwarding decisions (no control plane, no routing roles).
- Inline products exist, but their purpose is still visibility; resilience comes from bypass design.
- Wire truth required? (microburst, forensic evidence, consistent packet counts)
- Time-order matters? (hardware timestamps, multi-device time alignment)
- Long retention + query? (indexing, storage control, replay with shaping)
If the answer is “yes” to any of the above, the architecture must be treated as a pipeline (copy → organize → prove), not as a single “box with ports”.
H2-2 · Deployment patterns: passive optical TAP, SPAN, inline bypass—when each breaks
Deployment choice is less about marketing labels and more about failure modes. Each method can “look fine” during light load and quietly fail when burst, contention, or timing precision becomes critical. The goal is to choose a pattern whose breakpoints are understood, measurable, and acceptable for the use case.
SPAN Port mirroring — quick to enable, easiest to lie
- What it delivers: convenient mirrored frames for troubleshooting and coarse visibility.
- Where it breaks: mirror congestion (drops), prioritization differences, and timing skew.
- How to detect: compare ingress counters vs captured counts; test with 64B-heavy traffic and burst.
- Choose when: temporary triage, cost-sensitive visibility, and lossless is not a hard requirement.
Passive Optical TAP — highest fidelity, physical budgets matter
- What it delivers: traffic copy closest to wire behavior (independent of switch mirroring resources).
- Where it breaks: insertion loss / optical power margin, connector hygiene, module/port compatibility.
- How to detect: pre-check link margin; verify stable BER on production + identical packet counts on copy.
- Choose when: long-term monitoring, microburst sensitivity, forensics-grade evidence needs.
Inline Inline with bypass — visibility inline, resilience via fail-open
- What it delivers: inline insertion when physical constraints demand it, with controlled observation outputs.
- Where it breaks: bypass design (relay/optical switch) and unexpected link flap during failover.
- How to detect: power-cut and reboot tests; measure outage time; validate alarms/logs match switchovers.
- Choose when: inline insertion is unavoidable and production continuity must be protected by hardware bypass.
Decision A practical selection rule (keeps teams aligned)
- Need “wire truth” (microburst, strict counts, evidence chain) → prioritize Passive TAP.
- Need lowest operational risk with “good enough” visibility → SPAN with explicit limitations.
- Must be inline (physical path constraint) → Inline + bypass with rigorous fail-open tests.
“Lossless” should be treated as a tested condition: traffic profile + burst + replication + egress write stability. A deployment method is “correct” only if its breakpoints are measurable and controlled.
- Run a known traffic pattern and confirm packet counts match on source vs capture (not just throughput).
- Repeat with 64B-heavy traffic and with controlled bursts to expose Mpps/buffer limits.
- If timing matters, validate hardware timestamp stability across load and across reboot/holdover scenarios.
H2-3 · “Lossless” really means what: throughput vs Mpps vs burst, and where drops are born
“Lossless capture” is not a single checkbox. It is a tested condition defined by a traffic model (packet-size mix, burstiness, replication factor, and output constraints). A platform can meet line-rate in Gbps yet still drop packets when Mpps (small packets) or microbursts exceed the observation pipeline’s service budget.
Framework Three knobs that decide “lossless”
- Gbps (line-rate): sustained bandwidth capacity of ports and fabric.
- Mpps (per-packet budget): parser/match/replicate/schedule must finish within a fixed cycle budget—64B frames are the hardest case.
- Burst absorption: when instantaneous arrival rate exceeds service rate, buffer depth and arbitration decide whether drops occur.
Practical takeaway: a “100G/400G” label is a throughput statement, not a proof of lossless capture across packet sizes and bursts.
Mpps Why 64B traffic sets the hard floor
- On the wire, minimum Ethernet frames carry overhead; a common engineering approximation is 84B per packet (64B frame + 8B preamble/SFD + 12B IFG).
- Packets-per-second estimate: pps ≈ R / (84B × 8).
- Rule-of-thumb reference points (approx.): 10G ≈ 14.88 Mpps, 25G ≈ 37.2 Mpps, 100G ≈ 148.8 Mpps, 400G ≈ 595.2 Mpps (64B-heavy).
If a capture path is validated only with large packets, the real bottleneck may remain hidden until a 64B-heavy workload appears (telemetry storms, ACK-heavy flows, scanning, or bursty control traffic).
Burst Microburst math that explains “average is low, yet it drops”
- When a short burst arrives faster than the observation path can drain, buffer fills by the rate gap.
- Useful sizing intuition: B_needed ≈ (R_in − R_out) × t_burst × Fanout.
- Fanout is a traffic multiplier (replication 1→N, oversubscription, or “one input feeds many tools”).
Microbursts punish systems that look stable at 5–10% average utilization; watermark-driven evidence is more reliable than averages.
Drop taxonomy Where drops are born (and what to instrument)
- Ingress overflow: front-end FIFO/PCS/MAC cannot absorb burst → ingress overflow counters rise.
- Pipeline stall: parser/lookup/match exceeds per-packet budget → stage-stall/utilization counters spike (often worst on small packets).
- Replication & egress arbitration: fan-out creates congestion → egress queue watermarks hit max, arbitration drops appear.
- DMA/PCIe backpressure: host path cannot keep up → DMA ring overflow/backpressure flags asserted.
- Storage write jitter: tail latency (GC/flush/RAID events) causes capture backpressure → drops correlate with I/O latency spikes.
- Drops explode with 64B-heavy traffic → parser/lookup/Mpps bottleneck (pipeline stall evidence).
- Drops occur only during bursts → buffer depth/watermark + arbitration (burst absorption evidence).
- Drops start after enabling extra tool outputs → replication fan-out + egress oversubscription.
- Drops correlate with host load → DMA/PCIe backpressure.
- Drops correlate with I/O latency spikes → storage tail-latency jitter (write-path evidence).
H2-4 · Data plane architecture: parse → classify → filter/mask → replicate → load-balance
High-speed TAP/NPB/probe platforms rely on a hardware pipeline (ASIC/FPGA or equivalent) because the observation plane must handle line-rate plus worst-case per-packet timing. The key is not “more throughput” but deterministic actions per packet: extracting keys, deciding actions, replicating, and delivering consistent tool feeds without hidden drops.
Why HW Why ASIC/FPGA-class pipelines are used
- 64B Mpps budget: header parsing + match + action must complete within a fixed cycle window.
- Deterministic replication: 1→N fan-out and egress arbitration require predictable scheduling and watermark evidence.
- Instrumentable correctness: stage counters, queue watermarks, and drop-reason telemetry provide proof of “lossless under conditions”.
Pipeline Stage-by-stage actions (visibility-focused, not enforcement)
- Parse: extract headers and tunnel context needed for observation decisions (outer/inner keys where applicable).
- Classify: compute flow key and select an action profile (fast match + counters).
- Filter: decide whether a packet is forwarded to tools (visibility selection, not security blocking).
- Mask / Slice: header-only, payload truncation, or field masking to reduce tool load and storage pressure.
- Replicate: copy traffic to multiple tools or recorders; fan-out becomes a multiplier on burst and queueing.
- Load-balance: hash-based distribution with flow consistency (same flow goes to the same tool node).
A reliable architecture makes every action measurable: rule-hit counters, per-output utilization, queue watermarks, and explicit “drop reason” telemetry.
Hashing Load-balance without breaking analysis
- Flow consistency is the default goal: splitting one conversation across multiple tools can break stateful analytics and distort timelines.
- Hash key selection determines outcome: 5-tuple vs inner-5-tuple (for encapsulated traffic) vs custom fields.
- Skew detection matters: report distribution across outputs (utilization + flow counts) to catch hot spots.
- Every “feature” (filter, slice, replicate, hash) must map to a measurable counter or watermark; otherwise it cannot be proven lossless.
- Replicate and slice decisions must be evaluated under 64B-heavy traffic and burst, not only under large packets and average load.
H2-5 · Hardware timestamps: where accuracy is won or lost (PTP/SyncE only for marking)
For a probe or recorder, timestamps are the spine of usability: event ordering, burst correlation, and cross-box alignment all depend on them. Accuracy is “won” when timestamping happens early (near ingress) and the local timebase is stable; it is “lost” when queues, clock-domain crossings, or write-path backpressure inject variable delay. PTP/SyncE/GNSS are treated here only as timebase inputs for the probe (marking), not as a full network timing design topic.
Where to stamp Timestamp point decides how much queue delay leaks into time
- Best case: stamp at PHY/MAC ingress before shared queues—lowest sensitivity to traffic load.
- Later stamping: after parsing, replication, or buffering—timestamp includes variable queue wait and arbitration delay.
- A practical rule: if timestamp variance grows with utilization or fan-out, the stamp point is too far downstream.
Timebase chain Time source → cleaner/PLL → distribution → timestamp unit
- Time source: GNSS / PTP / SyncE provide a reference; stability during loss-of-source depends on holdover.
- Clock cleaner / PLL: reduces short-term jitter and isolates noise; bad settings convert phase noise into timestamp jitter.
- Clock distribution: introduces skew and potential clock-domain crossings; consistency matters across all timestamp domains.
- Timestamp unit (TSU): stamps packets and writes metadata; its resolution and domain crossing define quantization and uncertainty.
The most useful mental model is to track three quantities separately: offset (fixed bias), jitter (short-term variance), and drift (slow change over time).
Error sources Common ways accuracy is lost (and what to log)
- Queue-induced jitter: variable FIFO/queue wait (worse under burst and replication) → correlate timestamp spread with queue watermarks.
- Domain crossing uncertainty: clock-domain crossings add non-determinism → track CDC events/exceptions where available.
- Holdover drift: losing the external source increases drift → log holdover state and drift alarms.
- Cleaner/PLL noise: insufficient jitter cleaning inflates short-term variance → monitor phase-noise/jitter status indicators if exposed.
- Same epoch and time scale: devices must reference the same notion of time before comparing capture timelines.
- Calibrate fixed offset: measure and compensate deterministic offsets introduced by distribution and stamp placement.
- Monitor drift: track drift over time and during holdover; alert when drift rate changes.
- Prefer ingress stamping: keep stamping upstream so traffic load does not rewrite time ordering.
H2-6 · Buffering & burst absorption: sizing rules and the hidden cost of replication
Buffering is not a mystery feature. It is a capacity to absorb the gap between instantaneous arrival rate and the observation pipeline’s drain rate. The non-obvious part is that replication and write-path variability can turn a “localized slowdown” into a system-wide backpressure event. This section turns burst absorption into a small set of sizing variables and evidence signals.
Burst model A usable sizing intuition (with fan-out)
- When a burst arrives faster than packets can be drained, occupancy grows by the rate gap.
- Conservative intuition: B_needed ≈ (R_in − R_out) × t_burst × Fanout.
- Fanout is a multiplier: a 1→N replication tree increases effective egress demand and raises watermark peaks.
The correct question is not “how much buffer exists”, but “what burst profile and replication configuration can be absorbed with drop counters staying at zero”.
Buffer tiers FIFO vs DRAM vs host memory (what each is good at)
- On-chip FIFO/SRAM: fastest and most deterministic; limited depth; best for short microbursts and per-stage smoothing.
- External DRAM: deeper absorption for longer bursts; performance depends on arbitration patterns and contention.
- Host memory (DMA): seemingly deep but least deterministic; susceptible to PCIe scheduling, CPU pressure, and driver noise.
Arbitration Policies matter only if “drop reasons” are visible
- Queue arbitration chooses who drains first during contention (per-port or per-class priority).
- Drop behaviors are only diagnosable if the platform exports drop reason counters and watermark telemetry.
- Two common names: tail drop (drop at full) and early drop (drop before full under policy); the key is visibility.
Hidden costs Replication and write jitter amplify bursts
- Replication amplifies demand: each additional tool output consumes egress scheduling and buffer budget.
- One slow sink can backpressure many: a recorder with tail-latency write spikes can push back into shared queues.
- Rule changes can amplify bursts: changing filters or slicing can unintentionally increase fan-out or shift traffic to a single hot output.
- Peak watermarks: per-stage and per-egress queue peaks, with timestamps for correlation.
- Drop reasons: ingress overflow vs egress queue drop vs DMA/storage backpressure.
- Output utilization: per-output bandwidth and skew; replication changes should move these numbers predictably.
- Write-path tail latency: capture backpressure aligns with P99/P999 latency spikes, not average throughput.
H2-7 · Capture formats & metadata: PCAP at scale, indexing, and “wire data vs derived data”
Storing packets is not the same as making them usable. At scale, capture must be split into two deliverables: wire data for faithful replay, and derived data (metadata) for fast search and triage. The engineering goal is simple: keep the write path sequential and predictable, while making queries hit an index first.
What is captured Wire data vs derived data (why both are needed)
- Wire data: full packets or truncated packets. Truncation saves write bandwidth and capacity, but reduces forensic depth and replay fidelity.
- Derived data (metadata): flow records, counters, events, labels, and file pointers. Metadata answers “where to look” in seconds.
- A scalable system avoids the trap of “only wire data” (slow search) and “only metadata” (no payload replay).
File format PCAP vs PCAPNG (engineering choice, not a spec lecture)
- PCAP: broad tool compatibility and simple pipelines—often the default when interoperability is the priority.
- PCAPNG: better suited for carrying extra capture context (interfaces, annotations/options) when scale and multi-port capture matter.
- The decisive question is: which format preserves the metadata needed for operations without forcing random writes.
Format choice should be paired with a chunk strategy and indexing strategy; otherwise, “better format” still produces an unsearchable archive.
Indexing Minimal index dimensions that make archives searchable
- Time: enables fast “minute/second window” retrieval and aligns with alarms and incident timelines.
- 5-tuple / flow key: enables “show this session” queries without scanning entire files.
- Interface / direction: makes multi-port, multi-link capture debuggable.
- Tags: connects capture to events (burst windows, anomaly IDs, maintenance windows) without touching payload.
Write path Chunking + sequential append (avoid random-write amplification)
- Split capture into chunks (by time window or size). Each chunk becomes an immutable object with a chunk ID and time range.
- Write wire data as sequential append with alignment; avoid in-place updates during capture.
- Build the index as a separate append-only stream that references chunk IDs and byte ranges.
- Per-chunk hash detects silent corruption and verifies replay fidelity.
- An optional hash chain makes tampering detectable by linking each chunk to the previous one.
- Integrity metadata should be stored alongside the index so audits do not require scanning payload files.
H2-8 · Storage control: NVMe/RAID throughput, write jitter, and keeping capture truly lossless
“Lossless capture” is usually lost in storage, not in headline bandwidth. The dominant failure mode is tail-latency spikes (P99/P999) caused by internal housekeeping (GC), cache transitions, or rebuild activity. Those spikes create backpressure, watermarks climb, and drops appear upstream. This section provides a diagnosis-first view: prove causality by aligning timelines.
Core idea Peak throughput is not the metric; tail latency is
- Average write bandwidth can look fine while P99/P999 write latency stalls the write path long enough to overflow buffers.
- Capture systems fail on variance: a rare long stall is enough to trigger backpressure and drops during microbursts.
- Use latency distribution and queue depth, not only “GB/s”, to judge whether storage is safe for line-rate capture.
Causes Where write jitter comes from (capture-relevant only)
- NVMe GC / write amplification: background cleanup can steal service time and inflate tail latency.
- Cache transitions: when fast caching is exhausted, latency rises and becomes bursty.
- RAID rebuild or scrubbing: background IO changes latency distribution even if bandwidth remains high.
Controls Make the write path predictable
- Chunked, aligned, sequential writes: avoid random IO amplification and reduce metadata churn under load.
- Queue depth discipline: keep write queues within a stable operating region; persistent growth indicates backpressure.
- PLP (power-loss protection): stabilizes write semantics and reduces “partial write” risk for data + index consistency.
The goal is not maximal speed; the goal is minimal tail latency while sustaining the capture’s sequential append pattern.
RAID tradeoffs Bandwidth vs rebuild risk vs CPU overhead
- RAID can increase sustained throughput and parallelism, but rebuild/scrub windows can inject latency spikes.
- For capture, the decisive question is: what happens to P99/P999 latency during rebuild or background maintenance?
- Monitor CPU overhead and controller contention; storage “success” still fails if host-side scheduling adds jitter.
- IO latency distribution: log P50/P99/P999 and mark spike windows.
- IO queue depth: correlate sustained depth growth with spike windows.
- Backpressure events: writer backlog / DMA stalls / flush stalls must rise before drops.
- Drop reason counters: storage/DMA-related reasons should align with the same windows.
Failure modes & troubleshooting: symptom → root cause → fix
Effective CGNAT troubleshooting starts with a short decision path: classify the symptom, confirm with minimal counters, then apply a targeted fix.
This chapter stays “CGNAT-local”: sessions/ports/table/drops/log pipeline signals are enough to narrow the failure class without pulling in external protocol detail.
Use the cards below like a field playbook: each is four lines—Symptom, Fast check, Likely root cause, Fix.
Fault cards (field-usable) — four fixed lines per card
1) Port exhaustion / block hot spot
Symptom: new flows fail; only a subset of users/services degrade; failures cluster in time.
Fast check: port block utilization distribution becomes highly skewed; CPS spikes; drops show allocation/exhaustion reasons.
Likely root cause: hot blocks/pools saturate while averages look acceptable (skew hides risk).
Fix: reduce skew (rebalance blocks/pools), increase headroom where the skew concentrates, and verify skew flattening plus drop-reason recovery.
Key counters: port block p95/p99 utilization, pool headroom, CPS, drops-by-reason.
2) “Gbps is fine” but setup collapses
Symptom: throughput remains high, yet new sessions time out; setup rate falls off a cliff.
Fast check: CPS falls while sessions plateau; create-path drops increase; table occupancy stays high or churn spikes.
Likely root cause: create/update path is saturated (inserts, updates, or reclaim pressure), not the steady-state forwarding.
Fix: cut create-path cost (reduce churn drivers), keep occupancy below jitter threshold, and confirm CPS recovery during burst tests.
Key counters: CPS, create-path drops, occupancy, aging/churn, collision/chain depth.
3) Table jitter / early reclaim
Symptom: sessions are reclaimed too early; retransmissions rise; tail latency spikes periodically.
Fast check: aging/churn peaks align with latency spikes; collision/chain depth increases; occupancy remains high.
Likely root cause: aging/reclaim cycle becomes expensive and bursty; hot buckets amplify tail behavior.
Fix: tune aging to reduce churn, rebalance buckets to reduce collisions, and validate that churn peaks no longer trigger tail spikes.
Key counters: churn/aging rate, occupancy, collision/chain depth, tail indicators (if available).
4) Asymmetric path → return flow state miss
Symptom: one-way connectivity; intermittent “works then breaks”; failures are direction-dependent.
Fast check: state-miss drops rise for return-direction traffic; hit/miss balance becomes asymmetric during the incident window.
Likely root cause: forward and return packets do not hit the same state domain/shard, so return lookups miss.
Fix: enforce flow-to-state consistency (same flow lands in the same shard/state domain) and confirm miss drops disappear after change.
Key counters: state-miss drops by reason, per-direction hit/miss (or equivalent), shard imbalance indicators.
5) “Random” loss that is packet-size dependent (PMTU)
Symptom: small packets succeed but larger payloads fail; issues correlate with specific size ranges.
Fast check: drops spike in certain size bins; oversize/fragment-related counters increase during failures.
Likely root cause: path MTU constraints or size-dependent handling triggers drops that look random at flow level.
Fix: make size-dependent handling consistent and validate with controlled size sweeps until the spike disappears.
Key counters: packet-size histogram (if available), oversize/fragment counters, drops by reason.
6) Fragmentation / checksum inconsistency
Symptom: intermittent loss with no clear CPU spike; failures show weak correlation to throughput.
Fast check: fragment-related drops rise; checksum-related drops appear; issue reproduces only under specific packet patterns.
Likely root cause: fragmentation path or checksum update path diverges from the main translation path.
Fix: unify translation behavior for all packet paths and verify checksum/fragment drops return to baseline.
Key counters: fragment drops, checksum drops, drops by reason, packet pattern correlation.
7) Logging backpressure spillover
Symptom: throughput falls but CPU is not high; queue/watermark signals look abnormal.
Fast check: log backlog watermark rises first; export latency rises; log drops may appear; data-plane drops follow later.
Likely root cause: log pipeline cannot drain; backlog feeds back into data plane (backpressure).
Fix: reduce log pressure (record size/event rate), strengthen decoupling (buffers/batching), and confirm backlog leads no longer precede drops.
Key counters: backlog watermark, export latency, log drops, drops by reason, CPS over time.
8) Drops surge with no clear single “big” metric change
Symptom: drops increase suddenly; no single aggregate metric explains it; impact is uneven.
Fast check: drops by reason show one class dominating; distributions (blocks/buckets) worsen even if averages stay flat.
Likely root cause: localized hot spots (port blocks or hash buckets) create tail failures that aggregates mask.
Fix: switch to distribution-first view, mitigate hot spots, and verify the dominating drop-reason class returns to baseline.
Key counters: drops by reason, port block distribution, collision/chain depth distribution, occupancy.
High availability: state sync, failover, and keeping mappings consistent
HA for CGNAT is hard because state is large: replication must preserve enough mapping state for continuity without turning synchronization into a second data-plane bottleneck.
The practical trade-off is straightforward: stronger session continuity requires more replication load, which can reduce CPS and increase tail behavior if not isolated.
Success criteria: after failover, mapping consistency holds (no mass state misses) and replication load does not push CPS into a cliff during bursts.
What state must be replicated (minimal set vs optional)
Must replicate (minimal set)
Active session mapping identity (inside/outside address+port mapping) and enough lifecycle info to keep lookups consistent after takeover.
Goal: prevent mass state misses immediately after failover.
Optional replicate (only if justified)
Non-essential metadata that improves investigation or reporting but is not required for mapping continuity.
Rule: if it can be rebuilt, avoid replicating it under load.
Replication load vs data-plane health (how to avoid a CPS cliff)
Replication frequency and bandwidth
More frequent updates reduce continuity gaps but increase write amplification and contention risk.
Practical readout: CPS and create-path stability under burst should not degrade when replication becomes busy.
How to detect “sync is hurting the data plane”
If replication queue/backlog rises first and CPS drops next, synchronization load is likely spilling into the packet path.
Correlate replication backlog (if available) with CPS, drops by reason, and churn peaks.
Failover mapping consistency (avoid mass state misses)
Design goal: after takeover, existing flows should still resolve to the expected mapping state domain.
Operational test: during controlled failover, verify state-miss drops do not surge and that session continuity is preserved within expected limits.
Use the same “drops by reason + distribution view” approach: a short surge may be acceptable; a sustained state-miss plateau indicates broken consistency.
H2-11 · Validation & acceptance checklist: how to certify a TAP/Probe is production-ready
“Production-ready” is not a datasheet claim. It is a repeatable acceptance report: under a declared traffic profile (packet mix, bursts, replication, timestamp mode, and storage mode), drops stay at zero and the evidence (counters, watermarks, backpressure, utilization, IO tail latency, timestamp stability) aligns in the same time window.
Acceptance package What must be delivered with the test results
- Test profile: port rates, packet-size distribution (64B / mixed / IMIX), flow count, replication factor, timestamp mode, storage mode.
- Evidence bundle (time-aligned): drop counters (per-port/per-reason), key queue/FIFO watermarks, backpressure/DMA stall signals, per-port bps+pps utilization, and (if writing) IO p99/p999 + queue depth.
- Runbook: minimal reproducible steps (1→1 first, then enable replication, then enable filtering/masking, then enable storage last).
- Config snapshot: filter rules, masking/slicing, replication map, hash key, output gating state, timestamp source selection.
A clean acceptance result is reproducible by a second engineer without hidden knobs.
A · Performance Line-rate + 64B Mpps + mixed packet sizes + multi-port concurrent
A bps-only claim can hide failures at 64B where Mpps and arbitration dominate.
B · Microburst Burst injection validates buffer headroom and “lossless under spikes”
C · Replication & load-balance 1→N fan-out, hash consistency, and explainable output balance
- Fan-out ladder: validate 1→2→4→8 (or declared maximum) with fixed traffic profile at each step.
- Hash consistency: the same flow must map to the same tool/output when “session consistency” is enabled.
- Output balance: verify no single “hot” port becomes the hidden choke (check per-port bps+pps + watermark).
D · Timebase & hardware timestamps Lock, holdover, temperature drift, and reboot convergence
- Lock: record offset/jitter baseline after locking to the declared time source.
- Holdover: disconnect time source and record drift vs time; define the acceptable holdover window.
- Thermal drift: observe drift under temperature variation in the declared operating envelope.
- Reboot convergence: measure time-to-stable timestamps after restart (cold/warm where applicable).
E · Storage Sustained write, tail latency, power-loss behavior, and index searchability
- Sustained write: run continuous capture for N hours (e.g., 4/8/24h steps) under the declared packet profile.
- Tail latency: monitor IO p99/p999 and queue depth; prove that spikes do not create capture backpressure.
- Power-loss behavior: validate recovery (where supported) and confirm index consistency for post-event search.
- Indexing: confirm queries by time window / interface / flow key return expected results (not just “files exist”).
Peak bandwidth alone is not sufficient; sustained behavior and IO tail latency decide whether capture stays lossless.
F · Reliability Fail-open behavior, upgrade rollback, and complete event logging
- Fail-open / bypass: validate that device faults (or loss of power where designed) do not disrupt the production link.
- Upgrade & rollback: prove a safe return to a known-good version without losing critical configuration.
- Logs: confirm that drop reasons, watermark events, backpressure states, and timebase status changes are recorded and exportable.
Reference BOM Example MPNs commonly used in TAP/Probe platforms (verify per SKU)
These are examples to anchor acceptance criteria to concrete building blocks. Exact fit depends on port speeds, timestamp needs, and storage architecture.
Hardware timestamp / capture NIC
Intel Ethernet Controller E810 — PTP-capable family often used when ingress timestamp quality matters.
Clock cleaner / DPLL (timebase conditioning)
Skyworks/SiLabs Si5345 — jitter-attenuating clock device family.
Analog Devices AD9545 — DPLL-centric time synchronization device family.
PCIe switch (multi-NVMe / multi-endpoint fabrics)
Broadcom PEX88096 — PCIe switch example SKU.
Microchip Switchtec PFX (e.g., PM4162) — PCIe switch family example SKU.
BMC / management controller
ASPEED AST2600 — widely used BMC SoC for OOB management, sensors, and lifecycle control.
NVMe SSD (capture storage)
Solidigm D7-P5520 — data center NVMe SSD family often evaluated for sustained write behavior and PLP options (SKU-dependent).
Switch silicon (merchant switching, when used)
Broadcom StrataXGS Tomahawk family — example class of switch silicon used in high-density port platforms (exact SKU varies).
Acceptance tests must be written against observables (drops, watermarks, tail latency, timestamp drift), not brand promises. Use the exact SKU datasheets to set the final thresholds.
H2-12 · FAQs (Network TAP / Probe)
Each answer is kept concise (roughly 40–70 words) and stays within this page’s scope: observation, capture, replay, timestamps, buffering, and storage.