Carrier-Grade NAT (CGNAT) Architecture, Scaling & Troubleshooting
← Back to: Telecom & Networking Equipment
Carrier-Grade NAT (CGNAT) enables large-scale IPv4 sharing in ISP networks by translating address/port pairs with a high-speed state (flow) table and the minimum logging/telemetry required for traceability. In practice, CGNAT success is decided by CPS/Mpps/session scaling, port-pool hotspot control, and log/HA backpressure isolation—not by headline Gbps alone.
What CGNAT is (and what it is NOT): boundary & placement
CGNAT is a high-scale translation system that shares scarce public IPv4 addresses across many private users by mapping private tuples to public IP:port pairs.
It sits at the private/shared address realm → public Internet boundary, where it must sustain massive state (sessions), fast new-flow creation (CPS), and stable forwarding under small-packet bursts.
Boundary sentence: CGNAT is primarily about state tables + address/port pools + provable logs/telemetry—not security policy inspection, attack detection, or access-side protocol stacks.
What readers should expect from this page
- Capacity reality: sizing beyond “Gbps” using sessions, CPS, Mpps, port utilization, and log pipeline limits.
- Architecture depth: how flow tables, hashing, timeouts/aging, and packet I/O paths determine real performance.
- Troubleshooting map: symptom → root cause patterns → the first counter to check and the first fix to try.
CGNAT’s engineering core (3 responsibilities)
- Translate: allocate and apply a public IP:port mapping (NAT44 or NAT64), including consistent return-path handling for mapped flows.
- Track: maintain a high-scale flow/state table (create / lookup / update / age-out) so packets hit a fast path after the first packet.
- Prove: generate logs and telemetry that allow reliable mapping trace-back and operations visibility—without destabilizing forwarding under load.
Design intent: the remaining chapters will repeatedly tie symptoms and sizing decisions back to the same three primitives: state table, address/port pools, and logs/telemetry.
Capacity KPIs that actually break CGNAT (not just “Gbps”)
A CGNAT platform fails when any one of five ceilings is hit: concurrent sessions, setup rate (CPS), Mpps under small packets, log pipeline rate, or port utilization hot spots.
Throughput-only testing mostly validates the “hit path” of existing flows. Real deployments also stress the “create path” (new flows), memory writes, and logging—where collapses happen first.
Rule of thumb: sizing must match a traffic model (flow length distribution + burstiness + packet size mix + log requirement), not a single Gbps number.
The KPI table to use for sizing, acceptance tests, and troubleshooting
| KPI | Typical symptom | Root-cause pattern | What to measure first | First fix to try |
|---|---|---|---|---|
| Concurrent sessions state table occupancy |
New connections fail, “random” drops, short flows disappear | Table near full → higher collision/eviction; aging churn grows; memory footprint spikes | Flow-table occupancy, eviction/age-out rate, collision counters, memory watermarks | Re-tune timeouts/aging, improve sharding, increase table capacity, reduce per-flow state |
| Setup rate (CPS) new-flow create path |
Throughput looks fine but logins / short sessions time out | Create path saturates: hash insert cost, lock/atomic contention, write amplification, log events burst | Create-fail counters, miss-path latency, lock contention signals, per-bucket insert retries | Increase flow sharding, reduce shared locks, optimize hashing/buckets, isolate logging from create path |
| Mpps under small packets 64B vs IMIX |
IMIX test passes, 64B bursts collapse; drops rise quickly | Per-packet fixed cost dominates: queue/DMA overhead, cache misses, frequent state writes | RX/TX drops by reason, queue depth/watermarks, per-core/engine utilization, cache-miss trends | Rebalance queues, enable efficient batching, reduce per-packet writes, improve RX/TX distribution |
| Log pipeline rate events/sec |
Latency jitter, performance sag “without” obvious packet CPU saturation | Logging backpressure: buffers fill → sync stalls or drops → data plane slowed indirectly | Log backlog, buffer watermarks, dropped-log counters, write/ship latency | Decouple logging path, increase drain bandwidth, aggregate/compact events, apply safe sampling where allowed |
| Port utilization hot spots |
Some users fail while averages look fine; “port exhausted” pockets | Uneven port block allocation, hash skew, pool fragmentation, noisy-neighbor effects | Per-public-IP port usage distribution, per-block occupancy, skew/variance metrics | Rework block allocation strategy, add skew-aware hashing, rebalance pools, reserve headroom per shard |
- Trap: “Gbps is enough” sizing. Fix: require CPS, sessions, Mpps, and log-rate targets in every acceptance plan.
- Trap: average-only monitoring. Fix: track distributions (per pool / per shard / per block), because hot spots cause the first visible failures.
Address & port management: pools, blocks, overloading, deterministic mapping
CGNAT “public capacity” is not only the number of public IPv4 addresses—it is the distribution of available ports per public IP and how evenly those ports are consumed.
Port-block allocation turns a shared public IP into a manageable resource (chunks), but it can also create localized exhaustion (hot blocks) even when the global average still looks healthy.
Decision rule: “IP shortage” is a pool-wide problem; “port shortage” is often a hot-spot or skew problem visible only in per-IP / per-block distributions.
Glossary (to prevent common misinterpretations)
Pool
A managed set of public IPv4 addresses (often partitioned by region, shard, or capacity domain).
Block
A contiguous port range on a given public IP reserved to a subscriber bucket or processing shard.
Overloading
Multiple private users share the same public IP, separated by distinct public port allocations.
Deterministic mapping
A predictable rule maps users (or buckets) to specific public IPs and port ranges to simplify trace-back and reduce event volume.
Three allocation patterns (and what they optimize)
| Pattern | What it optimizes | Failure pattern | Key counters to watch | First tuning lever |
|---|---|---|---|---|
| Per-subscriber blocks | Predictable fairness and trace-back; stable mapping per bucket | Heavy users exhaust their blocks early; localized failures | Per-block occupancy distribution, block-exhaust events, new-flow failures by bucket | Resize blocks, allocate multiple blocks to heavy buckets, add headroom per shard |
| Per-shard / line-card blocks | Locality (state + ports stay near processing); easier scaling by shards | Skew between shards; one shard exhausts while others idle | Per-shard port usage variance, per-shard new-flow failure rate | Rebalance shards, move pools, add skew-aware assignment |
| Hash-based blocks | Load spreading under diverse traffic; simpler stateless choice | Hash hot spots (unlucky keys) → uneven block usage | Per-public-IP port usage tail, hot-IP list, block-level variance metrics | Improve hashing inputs, add rehash / fallback, pre-split pools into more buckets |
- Step 1 (pool view): check how many public IPs are active and whether new allocations are failing globally. A true IP shortage is typically pool-wide and monotonic.
- Step 2 (distribution view): inspect per public-IP and per block port occupancy. Hot spots show a heavy tail: a small set of IPs/blocks approach full utilization.
- Step 3 (symptom shape): hot spots often appear as “some users/buckets fail while the average looks fine” and are time-correlated with bursts or locality.
- First fixes: rebalance blocks/pools, add headroom per shard, reduce skew, and avoid designs that require global coordination during bursts.
This chapter sets the resource model used later: flow creation (H2-4) depends on a reliable and evenly consumable port/address supply, and operations (telemetry/logging) must expose distributions, not only averages.
Flow table architecture: fast path lookup, hashing, aging, collisions
CGNAT performance is dominated by flow/state table behavior: how quickly packets hit the fast path (lookup) and how safely the platform handles bursts of misses (create path).
Throughput tests often stress the hit path. Real-world collapses frequently occur in the miss/create path where inserts, memory writes, and event generation spike together.
Engineering takeaway: collision rate, aging/timeout policy, and write/lock contention determine whether high CPS turns into timeouts.
Packet-to-flow lifecycle (the minimal model)
- Parse key: extract a stable flow key (commonly 5-tuple plus necessary direction context).
- Hash & bucket: map the key into a bucket/shard to keep lookups O(1) and reduce shared contention.
- Hit path: read state → apply translation → update small metadata (timeouts/counters) → forward.
- Miss/create path: allocate state + reserve ports (from H2-3) → insert into table → emit necessary events.
- Aging: retire inactive flows without storms; avoid policies that amplify churn during bursts.
Four “silent killers” that look like random failures
Hash collisions
Symptoms: p99 latency spikes, sporadic new-flow failures. Watch bucket depth/variance and insert retries.
Aging / timeout churn
Too short: constant re-creates (CPS amplification). Too long: occupancy rises, collisions grow. Watch eviction/age-out rate.
Lock / atomic contention
Shared buckets or global structures cause stalls under bursty creates. Watch contention signals and miss-path latency.
Write amplification
Excessive per-packet state updates trigger cache misses and memory bandwidth pressure. Watch cache-miss trends and per-flow update cost.
This chapter intentionally stays at the data-plane mechanics level. The operational requirement is simple: keep the hit path light, keep create bursts safe, and tune aging to avoid churn storms.
Memory & state scaling: where sessions really live (SRAM/DRAM) and why it matters
A CGNAT session table is a memory system before it is an algorithm: performance depends on which state stays in the fast tier (near-cache/SRAM) versus which state spills into the slow tier (DRAM).
When the working set grows or per-packet writes increase, the platform shifts from lookup-dominated to movement-dominated behavior. That is when CPS drops in a cliff and tail latency explodes.
Practical triage: use hit rate, lookup latency (p99), and reclaim/evict time to confirm a memory bottleneck.
State tiers (what belongs where)
Fast tier (small & near)
Keep flow index pointers, bucket heads, hot-entry metadata, and minimal counters—optimize for fast-path hits and predictable latency.
Slow tier (large & far)
Store full session records, cold entries, aging lists, and large buffers. Treat buffering as a bandwidth consumer, not “free space.”
Common collapse patterns (what they look like)
Cache thrash
Working set spills; hit rate drops; lookup cost rises; p99 latency spikes even if average looks acceptable.
Bandwidth saturation
Reads and writes queue; create path becomes write-heavy; CPS collapses before line rate is reached.
Reclaim jitter storms
Eviction/aging becomes expensive and periodic; bursts trigger timeouts during reclaim peaks.
Write amplification
Too many per-packet state touches; cache lines churn; memory traffic rises without throughput gains.
Symptom → Observation → First action (memory edition)
| Symptom | What to observe | First action to try |
|---|---|---|
| CPS drops in a cliff | Fast-tier hit rate falling; lookup p99 rising; create-path latency rising | Reduce write amplification; improve bucket/shard locality; keep fast-tier updates minimal |
| Tail latency spikes (p99/p999) | Cache miss surge; bucket depth variance; slow-tier fallback rate | Tighten hashing/bucketing; cap per-bucket depth; reduce spill frequency |
| Periodic timeouts in bursts | Evict/reclaim time peaks; eviction scans; aging list churn | Adjust aging to avoid churn; smooth reclaim; reduce full-table scans |
| Throughput stable but new flows fail | Miss/create path becomes write-heavy; insert retries; reclaim backpressure | Protect create path from heavy background work; isolate buffers/events from inserts |
| Small packets make it worse | Per-packet update cost dominates; memory BW rises faster than Gbps | Minimize per-packet touches; move updates to sampling or per-burst accounting |
The goal is not “more memory,” but a stable locality plan: keep the hit path predictable, keep creates from turning into bulk writes, and keep reclaim from becoming a periodic pause.
10G/100G packet I/O: PHY is not the bottleneck, the I/O pipeline is
Line rate describes what the port can carry. Packet rate (Mpps) and burst behavior describe what the processing pipeline must absorb.
IMIX can pass while 64-byte traffic collapses because fixed per-packet costs dominate: queueing, DMA descriptors, buffer churn, and flow-shard consistency are stressed first.
Debug method: walk from port → RX queues → processing units → flow shards and confirm where the distribution becomes uneven or backpressured.
From port to flow table (step-by-step bottleneck map)
| Stage | What breaks under burst / small packets | Counter signals to check |
|---|---|---|
| Ingress port | Microbursts create instant queue pressure; early drops can appear even when average Gbps is moderate | Port-level drops, pause/backpressure indicators, burst correlation with p99 latency |
| RX queues (RSS) | Skewed steering makes one queue hot; one hot queue can collapse the system “locally” | Per-queue occupancy, per-queue drop, hot-queue variance (tail vs average) |
| DMA + descriptor rings | Descriptors exhaust or recycle too slowly; bursts hit ring limits first at high Mpps | Ring full events, refill rate, DMA stalls, buffer allocation failures |
| Buffer strategy | Small packets churn buffers; refill overhead and cache pressure rise rapidly | Buffer pool watermark, allocation latency, cache-miss trends during bursts |
| Processing units | One unit saturates while total utilization seems fine; create bursts amplify the imbalance | Per-unit utilization, per-unit miss/create latency, imbalance ratio |
| Flow shards | If a flow is handled inconsistently across shards, overhead rises and misses increase | Cross-shard access rate, miss/create spikes, shard hot-spot list |
Telemetry & observability: the minimum counters that prevent blind operation
A stable CGNAT operation depends on visibility into sessions, port/blocks, table health, and drops by reason—not on average throughput alone.
Early warnings come from distribution signals (skew across pools/blocks/buckets) and backlog signals (log pipeline watermarks), often before outright drops appear.
Minimum dashboard below is designed to catch port exhaustion, hot spots, and table jitter before service impact.
Minimum dashboard (≤10 metrics) — each exists to detect one failure mode
| Metric | Look at it to detect | Early warning pattern |
|---|---|---|
| Concurrent sessions | State scale approaching capacity and higher reclaim cost | High plateau + rising churn indicators |
| CPS (setup rate) | Create-path stress and short-flow surges | Sharp rise + new-flow latency/drops follow |
| Drops by reason | Where the pipeline fails (queue/ring/table/alloc/backpressure) | One reason class dominates; shift over time |
| Flow table occupancy | Table pressure and jitter risk | Sustained high occupancy + rising churn |
| Collision rate / chain depth | Hash hot spots and lookup tail latency risk | Tail grows even when averages look stable |
| Aging rate / churn | Expensive reclaim cycles and periodic pauses | Peaks align with latency spikes or timeouts |
| Public IPv4 pool utilization | Address scarcity trend (pool depletion) | Pool headroom shrinking steadily |
| Port block utilization distribution | Port hot spots and block-level exhaustion | Skew worsens (high percentile near full) |
| Log backlog watermark | Backpressure conditions that will spill into data plane | Watermark rises first, then drops/timeouts |
- Port exhaustion risk: port block distribution skew worsens + CPS rises + drops show allocation/exhaustion reasons.
- Table jitter risk: occupancy stays high + collision/chain depth increases + aging/churn rises.
- Backpressure risk: log backlog watermark rises + log drops/queueing delays appear + data-plane drops follow.
NAT logging at scale: volume, correlation, time, and storage backpressure
Logging is the second critical path in CGNAT: traceability and operations depend on it, and backlog can translate into data-plane impact.
Capacity planning must be event-driven (CPS and session events), not throughput-driven (Gbps). Backpressure often appears first as rising watermarks and delayed export.
Key test: if backlog rises before drops/timeouts, the performance issue is likely log-pipeline backpressure.
Minimum log record (write less, but make it searchable)
Time
Timestamp with clear time base and resolution for ordering, window queries, and correlation.
Mapping
Inside addr/port and outside addr/port (or equivalent mapping pair) for traceability.
Session identity
A compact session id or flow key hash to correlate create/close and detect duplicates.
Resource context
Public pool id and/or port block id to pinpoint hot spots and exhaustion patterns.
Back-of-the-envelope volume model (event-driven)
Estimation model
log_events_per_sec ≈ CPS × events_per_session
bytes_per_sec ≈ log_events_per_sec × bytes_per_record
storage_per_day ≈ bytes_per_sec × 86400
events_per_session is often 1 (create) or 2 (create + close). Adding periodic “update” events can multiply volume quickly and increase backpressure risk.
How to recognize log backpressure (and stop blaming the data plane)
| Observed behavior | What to confirm in counters | First corrective direction |
|---|---|---|
| Backlog watermarks climb first | Buffer watermark rising; export latency rising; log drops start to appear | Reduce record size and frequency; batch/export smoothing; avoid blocking inserts on export |
| Data-plane drops follow later | Drops by reason shift toward resource/queue pressure after backlog climbs | Protect the data plane from logging pressure (decouple and isolate buffers) |
| CPS falls while Gbps looks fine | Create-path counters degrade concurrently with backlog peaks | Prevent backlog peaks (watermark control) and keep create-path lightweight |
| Correlation is weak | No backlog change; no log drop change during the incident window | Search elsewhere (avoid assuming logging is the cause) |
Failure modes & troubleshooting: symptom → root cause → fix
Effective CGNAT troubleshooting starts with a short decision path: classify the symptom, confirm with minimal counters, then apply a targeted fix.
This chapter stays “CGNAT-local”: sessions/ports/table/drops/log pipeline signals are enough to narrow the failure class without pulling in external protocol detail.
Use the cards below like a field playbook: each is four lines—Symptom, Fast check, Likely root cause, Fix.
Fault cards (field-usable) — four fixed lines per card
1) Port exhaustion / block hot spot
Symptom: new flows fail; only a subset of users/services degrade; failures cluster in time.
Fast check: port block utilization distribution becomes highly skewed; CPS spikes; drops show allocation/exhaustion reasons.
Likely root cause: hot blocks/pools saturate while averages look acceptable (skew hides risk).
Fix: reduce skew (rebalance blocks/pools), increase headroom where the skew concentrates, and verify skew flattening plus drop-reason recovery.
Key counters: port block p95/p99 utilization, pool headroom, CPS, drops-by-reason.
2) “Gbps is fine” but setup collapses
Symptom: throughput remains high, yet new sessions time out; setup rate falls off a cliff.
Fast check: CPS falls while sessions plateau; create-path drops increase; table occupancy stays high or churn spikes.
Likely root cause: create/update path is saturated (inserts, updates, or reclaim pressure), not the steady-state forwarding.
Fix: cut create-path cost (reduce churn drivers), keep occupancy below jitter threshold, and confirm CPS recovery during burst tests.
Key counters: CPS, create-path drops, occupancy, aging/churn, collision/chain depth.
3) Table jitter / early reclaim
Symptom: sessions are reclaimed too early; retransmissions rise; tail latency spikes periodically.
Fast check: aging/churn peaks align with latency spikes; collision/chain depth increases; occupancy remains high.
Likely root cause: aging/reclaim cycle becomes expensive and bursty; hot buckets amplify tail behavior.
Fix: tune aging to reduce churn, rebalance buckets to reduce collisions, and validate that churn peaks no longer trigger tail spikes.
Key counters: churn/aging rate, occupancy, collision/chain depth, tail indicators (if available).
4) Asymmetric path → return flow state miss
Symptom: one-way connectivity; intermittent “works then breaks”; failures are direction-dependent.
Fast check: state-miss drops rise for return-direction traffic; hit/miss balance becomes asymmetric during the incident window.
Likely root cause: forward and return packets do not hit the same state domain/shard, so return lookups miss.
Fix: enforce flow-to-state consistency (same flow lands in the same shard/state domain) and confirm miss drops disappear after change.
Key counters: state-miss drops by reason, per-direction hit/miss (or equivalent), shard imbalance indicators.
5) “Random” loss that is packet-size dependent (PMTU)
Symptom: small packets succeed but larger payloads fail; issues correlate with specific size ranges.
Fast check: drops spike in certain size bins; oversize/fragment-related counters increase during failures.
Likely root cause: path MTU constraints or size-dependent handling triggers drops that look random at flow level.
Fix: make size-dependent handling consistent and validate with controlled size sweeps until the spike disappears.
Key counters: packet-size histogram (if available), oversize/fragment counters, drops by reason.
6) Fragmentation / checksum inconsistency
Symptom: intermittent loss with no clear CPU spike; failures show weak correlation to throughput.
Fast check: fragment-related drops rise; checksum-related drops appear; issue reproduces only under specific packet patterns.
Likely root cause: fragmentation path or checksum update path diverges from the main translation path.
Fix: unify translation behavior for all packet paths and verify checksum/fragment drops return to baseline.
Key counters: fragment drops, checksum drops, drops by reason, packet pattern correlation.
7) Logging backpressure spillover
Symptom: throughput falls but CPU is not high; queue/watermark signals look abnormal.
Fast check: log backlog watermark rises first; export latency rises; log drops may appear; data-plane drops follow later.
Likely root cause: log pipeline cannot drain; backlog feeds back into data plane (backpressure).
Fix: reduce log pressure (record size/event rate), strengthen decoupling (buffers/batching), and confirm backlog leads no longer precede drops.
Key counters: backlog watermark, export latency, log drops, drops by reason, CPS over time.
8) Drops surge with no clear single “big” metric change
Symptom: drops increase suddenly; no single aggregate metric explains it; impact is uneven.
Fast check: drops by reason show one class dominating; distributions (blocks/buckets) worsen even if averages stay flat.
Likely root cause: localized hot spots (port blocks or hash buckets) create tail failures that aggregates mask.
Fix: switch to distribution-first view, mitigate hot spots, and verify the dominating drop-reason class returns to baseline.
Key counters: drops by reason, port block distribution, collision/chain depth distribution, occupancy.
High availability: state sync, failover, and keeping mappings consistent
HA for CGNAT is hard because state is large: replication must preserve enough mapping state for continuity without turning synchronization into a second data-plane bottleneck.
The practical trade-off is straightforward: stronger session continuity requires more replication load, which can reduce CPS and increase tail behavior if not isolated.
Success criteria: after failover, mapping consistency holds (no mass state misses) and replication load does not push CPS into a cliff during bursts.
What state must be replicated (minimal set vs optional)
Must replicate (minimal set)
Active session mapping identity (inside/outside address+port mapping) and enough lifecycle info to keep lookups consistent after takeover.
Goal: prevent mass state misses immediately after failover.
Optional replicate (only if justified)
Non-essential metadata that improves investigation or reporting but is not required for mapping continuity.
Rule: if it can be rebuilt, avoid replicating it under load.
Replication load vs data-plane health (how to avoid a CPS cliff)
Replication frequency and bandwidth
More frequent updates reduce continuity gaps but increase write amplification and contention risk.
Practical readout: CPS and create-path stability under burst should not degrade when replication becomes busy.
How to detect “sync is hurting the data plane”
If replication queue/backlog rises first and CPS drops next, synchronization load is likely spilling into the packet path.
Correlate replication backlog (if available) with CPS, drops by reason, and churn peaks.
Failover mapping consistency (avoid mass state misses)
Design goal: after takeover, existing flows should still resolve to the expected mapping state domain.
Operational test: during controlled failover, verify state-miss drops do not surge and that session continuity is preserved within expected limits.
Use the same “drops by reason + distribution view” approach: a short surge may be acceptable; a sustained state-miss plateau indicates broken consistency.
Validation & sizing checklist: prove it before deployment
A CGNAT design is deployable only after it passes an acceptance checklist that covers performance, resource headroom, logging backpressure, and HA behavior.
Testing must be scenario-driven (IMIX and 64B, burst CPS and long-lived sessions, one-way and two-way), and every pass/fail decision must be tied to a minimal set of counters.
Outcome: sizing becomes a signed-off artifact, not a guess based on “Gbps” alone.
A) Acceptance criteria (what “pass” means)
Performance coverage
Must include: IMIX + 64B, one-way + two-way, burst CPS, and long-lived sessions.
Reason: different mixes shift the bottleneck among I/O, state updates, and logging.
Resource headroom
Must show: safe flow table occupancy, controlled collision/chain depth, and non-skewed port utilization.
Reason: averages hide hotspots; distributions decide stability under peaks.
Logging resilience
Must prove: peak log rate does not create sustained backlog and does not degrade CPS/Mpps.
Reason: log pipeline backpressure is a common “hidden” cause of data-plane drops.
HA behavior
Must measure: drop rate during failover window and recovery time to stable mapping behavior.
Reason: state sync load and takeover consistency can trigger mass state misses.
B) Sizing inputs (turn traffic into numbers)
Use peak-first inputs: subscriber peak concurrency, bursty CPS, packet-size mix, and peak logging events.
Sessions_peak ≈ Subscribers × Sessions_per_subscriber × Peak_factor
CPS_peak ≈ Subscribers × New_flows_per_subscriber_per_sec × Peak_factor
Ports_needed ≈ Concurrent_translated_flows (watch skew, not only average)
Log_rate_peak ≈ CPS_peak × Events_per_flow (create/delete/other required events)
Log_GB_day ≈ Log_rate_peak × Bytes_per_record × 86400 / 1e9
Sizing is considered stable only when the acceptance tests pass while these peaks are sustained long enough to expose tail behavior and backlog effects.
C) Test plan (methods, not slogans)
Traffic scenarios to run
- ☐IMIX steady state to validate throughput under realistic mix.
- ☐64B small packets to validate Mpps, queueing, and create-path stability.
- ☐Burst CPS to expose flow create/update contention and reclaim jitter.
- ☐Long-lived sessions to validate table aging, occupancy, and tail stability.
- ☐One-way + two-way to detect directional state-miss behavior.
Minimum counters to capture
- ☐Sessions & CPS (including tail behavior under burst).
- ☐Drops by reason (queue overflow vs state-miss vs allocation failures).
- ☐Port utilization distribution by pool/block/bucket (skew and hotspots).
- ☐Flow table: occupancy, collision/chain depth, aging/churn rate.
- ☐Logging: backlog/watermark and export latency under peak log rate.
- ☐HA (when enabled): sync/backlog indicators and post-takeover state-miss behavior.
D) Copy-paste checklist (pass/fail items)
Performance
- ☐IMIX: stable throughput with drops-by-reason not increasing over time.
- ☐64B: Mpps stability without queue overflow and without CPS collapse.
- ☐Burst CPS: CPS holds for a sustained window (no “cliff”), and create-path drops stay near baseline.
- ☐One-way/two-way: no directional state-miss plateau during the run.
Resources
- ☐Flow table headroom: occupancy stays below a defined safety threshold under peaks.
- ☐Collision control: collision/chain depth does not surge with burst CPS.
- ☐Port skew: pool/block/bucket utilization distributions remain bounded (no runaway hotspots).
Logging
- ☐Peak log rate: backlog/watermark does not grow without bound.
- ☐No spillover: enabling peak logging does not materially reduce CPS/Mpps.
- ☐Drain check: after peak, backlog returns to a low steady watermark.
HA
- ☐Failover window: drop rate and recovery time are measured and within target.
- ☐Post-takeover: no sustained state-miss drop plateau; sessions stabilize quickly.
- ☐Sync impact: replication indicators do not precede a CPS cliff under burst.
BOM / platform selection checklist (criteria + example part numbers)
Choose a CGNAT platform by scoring architecture criteria that directly control CPS, session scale, port skew behavior, logging backpressure isolation, and HA stability.
Part numbers are listed as reference examples to help procurement and engineering align on categories and verification steps; availability and lifecycle must be validated before purchase.
Outcome: options can be compared with a consistent scorecard instead of marketing specs.
A) Platform archetypes (what is being compared)
1) General-purpose CPU + user-space data plane
Strength: fast iteration and flexible tuning.
Risk: CPS and tail stability depend on memory bandwidth, cache behavior, and queueing under bursts.
Best when: control over tuning is high and traffic is well-characterized.
2) NPU/ASIC/DPU-oriented data plane
Strength: predictable per-packet costs and high efficiency.
Risk: feature flexibility and observability depth may vary by platform ecosystem.
Best when: sustained high CPS/Mpps is primary and platform integration is mature.
3) CPU + accelerator cards (selective offload)
Strength: isolates heavy compute paths and protects tail behavior.
Risk: adds PCIe and driver complexity; can become a hidden bottleneck if not sized.
Best when: burst CPS or logging/telemetry processing needs isolation.
How to compare
Method: score each archetype against the criteria below, then validate with the H2-11 test matrix.
Rule: a platform is “better” only if acceptance tests pass with headroom.
B) Scoring criteria (all items should be measurable)
| Criteria | Why it matters for CGNAT | How to verify (what to measure) |
|---|---|---|
| Per-flow update cost | High CPS depends on fast flow create/update; expensive updates trigger a CPS cliff under bursts. | Create-path drops, CPS sustainability window, tail behavior under burst CPS. |
| State entry size (bytes/entry) | Session scale is bounded by memory capacity; large entries reduce max sessions and increase bandwidth pressure. | Max stable sessions before jitter; occupancy vs latency; memory bandwidth headroom. |
| Memory bandwidth & latency | State lookup/update becomes bandwidth-bound; cache misses can dominate tail latency even when CPU looks fine. | Latency stability under load; collision/chain depth impact; bandwidth utilization headroom. |
| Queueing & multi-queue scaling | 64B/Mpps and bursts depend on queueing, RSS/affinity, and buffer behavior, not PHY line rate. | 64B test: drops by reason, queue overflow, per-queue imbalance under peaks. |
| Port utilization distributions | Hotspots and local exhaustion break users even when average utilization is safe. | Pool/block/bucket p95/p99 utilization; hotspot persistence under stress. |
| Logging backpressure isolation | Peak log rate can degrade data plane; decoupling is required to prevent spillover. | Backlog/watermark under peak log; CPS/Mpps delta with logging enabled. |
| HA state sync impact | Replication can become a second bottleneck; mapping consistency must hold after takeover. | Failover: state-miss behavior, recovery time, replication backlog correlation with CPS. |
| Thermal / power stability | Long-run throttling silently reduces CPS/Mpps; stable telemetry prevents blind operation. | Long-duration load: frequency stability, temperature/power telemetry, throughput drift. |
C) Example BOM (reference part numbers)
Important: part numbers below are reference examples for quoting and comparison. Validate availability, lifecycle, and platform compatibility before purchasing.
High-speed NIC / I/O
- •Intel E810-XXVDA4 (25GbE class adapter example)
- •NVIDIA ConnectX-6 Dx family (100/200GbE class SmartNIC family example)
- •MCX623106AN-CDAT (example SKU style for ConnectX-6 Dx-class adapters)
Selection focus: queue count, RSS behavior, buffer model, and drops-by-reason visibility.
CPU / Infrastructure processor
- •Intel Xeon D-2796NT (network-oriented SoC example)
- •AMD EPYC 75F3 (high-frequency server CPU example)
- •Marvell OCTEON TX2 (CN92xx/CN96xx/CN98xx) family (infrastructure processor example)
Selection focus: burst CPS stability, memory bandwidth headroom, and tail behavior under churn.
Logging storage (NVMe)
- •Samsung PM9A3 family (datacenter NVMe example)
- •MZQL23T8HCLS-00A07 (example NVMe part number format)
Selection focus: sustained write behavior at peak log rate and backlog drain speed after peaks.
Power / thermal telemetry (examples)
- •INA238 (current/power monitor example)
- •TPS25982 (eFuse / hot-swap protection example)
Selection focus: detect throttling early (temperature/power drift) and correlate with throughput/CPS drops.
Optional acceleration (example)
- •Intel QAT 8970 (crypto/compression accelerator family example)
- •IQA89701G2P5 (example adapter part number format)
Selection focus: isolate heavy paths without creating PCIe or driver bottlenecks.
Memory module (example)
- •M393A4K40DB2-CTD (RDIMM example PN format; verify lifecycle)
Selection focus: capacity vs bandwidth; stable sessions and burst CPS both require bandwidth headroom.
CGNAT FAQs (Engineering-focused)
These answers stay inside CGNAT scope: performance (CPS/Mpps/sessions), address/port management, flow-table behavior, logging backpressure, telemetry counters, HA consistency, and platform selection criteria.
Group A — Boundary & performance bottlenecks
1How to describe the engineering boundary of CGNAT in one sentence?
CGNAT is a high-scale NAT44 function placed in the provider path to share limited public IPv4 addresses across many private users by translating address/port pairs while maintaining state and the minimum logs/telemetry for traceability. It is not a security policy engine, not an attack detector, and not an access-side protocol stack.
2Why does “Gbps is fine” still fail when CPS spikes (new flows start failing)?
Throughput tests mostly stress the fast path, but CPS stresses the create/update path: allocating state, inserting into the flow table, and often emitting per-event logs/counters. Under bursts, contention (locks/atomics), memory churn, collision growth, or logging backpressure can cause a CPS cliff even when average CPU looks acceptable. Confirm with create-fail reasons, insert latency, collision/chain depth, and log backlog.
3Why do 64-byte packets break CGNAT more easily than IMIX traffic?
64B traffic drives packet rate (Mpps) dramatically higher, so per-packet overhead in the I/O pipeline dominates: queueing, DMA descriptors, buffer recycling, and per-packet state touch. IMIX may pass because larger packets reach line-rate with far fewer packets. Validate with per-queue drops, queue watermarks, Mpps headroom, and whether small-packet load triggers create/update amplification and table churn.
Group B — Address/port resources & state table behavior
4How to quickly distinguish “public IPv4 shortage” vs “port exhaustion / hotspots”?
Address shortage looks like broad saturation of the public IP pool, while port exhaustion is usually skewed: specific IPs, port blocks, or buckets hit high utilization first and new translations fail for a subset of users. The fastest discriminator is distribution, not averages: check p95/p99 port utilization by pool/block, hotspot persistence, and failure reasons. Fix directions differ: expand pool vs rebalance blocks and reduce hotspots.
5How should port-block allocation avoid hotspots and imbalance?
Hotspots often appear when allocation granularity and processing sharding do not align. Port blocks should be assigned so that heavy users or heavy destination patterns do not concentrate into a small subset of blocks handled by the same shard/queue. Verify with utilization per block/bucket, per-queue load balance, and translation failures localized to specific blocks. Practical tuning focuses on block sizing, hashing/assignment policy, and keeping shard consistency under burst load.
6What visible symptoms come from rising flow-table collision rate?
Collision growth increases lookup/insert work, so the first visible symptom is often CPS degradation and rising tail latency rather than immediate Gbps loss. Drops may shift from queue overflow to state allocation/insert failures. In telemetry, collision rate, average chain depth, and lookup/insert latency trend up, while churn/aging may spike as the system struggles to reclaim entries. The practical fix is table sizing/sharding and collision control, validated by reduced insert latency under burst CPS.
Group C — Aging, asymmetry, logging, and observability
7What “weird dropouts” happen when timeout/aging is wrong?
Incorrect aging can reclaim entries too early (causing sudden state misses and retransmission bursts) or too late (inflating occupancy until collision and allocation failures appear). The “weird” pattern is intermittent breakage that correlates with churn and table pressure rather than with raw throughput. Confirm with aging rate, churn, occupancy headroom, and state-miss drop reasons. Fix direction is differentiated timeouts and stable reclaim behavior under long-session plus burst-CPS mixed tests.
8Why does asymmetric pathing make NAT look “sometimes OK, sometimes broken”?
CGNAT state is not optional: return traffic must hit the same state view (same instance/shard or a consistent replicated state set). If packets arrive on a different path that does not share the same mapping/state, state misses occur and behavior becomes intermittent: one-way works, two-way fails, or failures correlate with load-balancing changes. Confirm with state-miss drops and shard/instance imbalance; mitigation is path/state consistency and HA takeover consistency rather than “more bandwidth.”
9How to estimate NAT log volume, and why can logging reduce forwarding performance?
Log rate scales with event rate: roughly CPS multiplied by required events per flow (create/delete and other mandated records). Volume is then log_rate × bytes_per_record. Performance loss happens when the log pipeline is not fully decoupled: backlog grows, buffers and I/O contend with the data plane, and create/update becomes slower. Confirm by correlating backlog/watermark and export latency with CPS/Mpps drops; a healthy design drains backlog after peaks.
10What is the smallest set of counters that can warn of an upcoming meltdown?
A minimal dashboard should cover capacity, dynamics, outcomes, and backpressure: sessions, CPS, drops by reason, port utilization distribution (p95/p99 by pool/block), flow-table occupancy, collision/chain depth, churn/aging rate, and log backlog/watermark with export latency. The key is trending and distributions: hotspots and monotonic backlog growth predict failure earlier than average throughput. Use thresholding on skew and trend slopes, not only on absolute values.
Group D — HA, validation, and platform selection
11How to minimize the drop window during HA failover?
Failover is difficult because state volume is large and takeover consistency is fragile under peak churn. The shortest drop window comes from synchronizing only the minimal state required for mapping continuity, keeping replication from competing with create/update, and validating takeover under realistic burst CPS and logging load. Confirm with drop rate during the failover window, recovery time to stable CPS, and whether replication backlog precedes a CPS cliff. A “works on idle” failover is not sufficient.
12What are the most important platform criteria (ignore part numbers)?
Five criteria decide real CGNAT stability: (1) create/update tail latency under burst CPS, (2) state bytes per entry plus memory bandwidth headroom for target sessions, (3) 64B/Mpps behavior of queues/buffers with drops-by-reason visibility, (4) logging backpressure isolation so backlog does not spill into the data plane, and (5) HA state sync impact that remains measurable and within acceptance limits. Platforms should be chosen by scoring these items and proving them in the acceptance test matrix.