Carrier-Grade NAT (CGNAT) Architecture, Scaling & Troubleshooting

← Back to: Telecom & Networking Equipment

Carrier-Grade NAT (CGNAT) enables large-scale IPv4 sharing in ISP networks by translating address/port pairs with a high-speed state (flow) table and the minimum logging/telemetry required for traceability. In practice, CGNAT success is decided by CPS/Mpps/session scaling, port-pool hotspot control, and log/HA backpressure isolation—not by headline Gbps alone.

Chapter H2-1 Focus boundary & placement Outcome what CGNAT does / does NOT do

What CGNAT is (and what it is NOT): boundary & placement

CGNAT is a high-scale translation system that shares scarce public IPv4 addresses across many private users by mapping private tuples to public IP:port pairs.

It sits at the private/shared address realm → public Internet boundary, where it must sustain massive state (sessions), fast new-flow creation (CPS), and stable forwarding under small-packet bursts.

Boundary sentence: CGNAT is primarily about state tables + address/port pools + provable logs/telemetry—not security policy inspection, attack detection, or access-side protocol stacks.

What readers should expect from this page

Capacity reality: sizing beyond “Gbps” using sessions, CPS, Mpps, port utilization, and log pipeline limits.
Architecture depth: how flow tables, hashing, timeouts/aging, and packet I/O paths determine real performance.
Troubleshooting map: symptom → root cause patterns → the first counter to check and the first fix to try.

CGNAT’s engineering core (3 responsibilities)

Translate: allocate and apply a public IP:port mapping (NAT44 or NAT64), including consistent return-path handling for mapped flows.
Track: maintain a high-scale flow/state table (create / lookup / update / age-out) so packets hit a fast path after the first packet.
Prove: generate logs and telemetry that allow reliable mapping trace-back and operations visibility—without destabilizing forwarding under load.

Practical boundary checks (to avoid scope creep)

If a problem is about session scale, new-flow creation, port/address exhaustion, state consistency, or log/telemetry backpressure, it belongs to CGNAT. If it is about security policy decisions or attack classification, it belongs elsewhere and is intentionally not covered here.

Design intent: the remaining chapters will repeatedly tie symptoms and sizing decisions back to the same three primitives: state table, address/port pools, and logs/telemetry.

Figure F1 — CGNAT placement: private/shared realm → public Internet, plus logging/telemetry

This figure intentionally highlights only four CGNAT primitives (state, pool, logs, telemetry) to keep the page boundary clean and prevent cross-topic drift.

Chapter H2-2 Focus sizing KPIs Outcome “Gbps is not enough”

Capacity KPIs that actually break CGNAT (not just “Gbps”)

A CGNAT platform fails when any one of five ceilings is hit: concurrent sessions, setup rate (CPS), Mpps under small packets, log pipeline rate, or port utilization hot spots.

Throughput-only testing mostly validates the “hit path” of existing flows. Real deployments also stress the “create path” (new flows), memory writes, and logging—where collapses happen first.

Rule of thumb: sizing must match a traffic model (flow length distribution + burstiness + packet size mix + log requirement), not a single Gbps number.

The KPI table to use for sizing, acceptance tests, and troubleshooting

KPI	Typical symptom	Root-cause pattern	What to measure first	First fix to try
Concurrent sessions state table occupancy	New connections fail, “random” drops, short flows disappear	Table near full → higher collision/eviction; aging churn grows; memory footprint spikes	Flow-table occupancy, eviction/age-out rate, collision counters, memory watermarks	Re-tune timeouts/aging, improve sharding, increase table capacity, reduce per-flow state
Setup rate (CPS) new-flow create path	Throughput looks fine but logins / short sessions time out	Create path saturates: hash insert cost, lock/atomic contention, write amplification, log events burst	Create-fail counters, miss-path latency, lock contention signals, per-bucket insert retries	Increase flow sharding, reduce shared locks, optimize hashing/buckets, isolate logging from create path
Mpps under small packets 64B vs IMIX	IMIX test passes, 64B bursts collapse; drops rise quickly	Per-packet fixed cost dominates: queue/DMA overhead, cache misses, frequent state writes	RX/TX drops by reason, queue depth/watermarks, per-core/engine utilization, cache-miss trends	Rebalance queues, enable efficient batching, reduce per-packet writes, improve RX/TX distribution
Log pipeline rate events/sec	Latency jitter, performance sag “without” obvious packet CPU saturation	Logging backpressure: buffers fill → sync stalls or drops → data plane slowed indirectly	Log backlog, buffer watermarks, dropped-log counters, write/ship latency	Decouple logging path, increase drain bandwidth, aggregate/compact events, apply safe sampling where allowed
Port utilization hot spots	Some users fail while averages look fine; “port exhausted” pockets	Uneven port block allocation, hash skew, pool fragmentation, noisy-neighbor effects	Per-public-IP port usage distribution, per-block occupancy, skew/variance metrics	Rework block allocation strategy, add skew-aware hashing, rebalance pools, reserve headroom per shard

Two common traps (and how to avoid them)

Trap: “Gbps is enough” sizing. Fix: require CPS, sessions, Mpps, and log-rate targets in every acceptance plan.
Trap: average-only monitoring. Fix: track distributions (per pool / per shard / per block), because hot spots cause the first visible failures.

Figure F2 — Throughput vs CPS vs Mpps: why “passing tests” can still fail in production

Use this model to explain “why tests disagree”: different traffic mixes shift bottlenecks between CPU, memory, I/O, and logging—even at the same headline Gbps.

Chapter H2-3 Focus address & port resources Goal avoid hot spots + keep trace-back feasible

Address & port management: pools, blocks, overloading, deterministic mapping

CGNAT “public capacity” is not only the number of public IPv4 addresses—it is the distribution of available ports per public IP and how evenly those ports are consumed.

Port-block allocation turns a shared public IP into a manageable resource (chunks), but it can also create localized exhaustion (hot blocks) even when the global average still looks healthy.

Decision rule: “IP shortage” is a pool-wide problem; “port shortage” is often a hot-spot or skew problem visible only in per-IP / per-block distributions.

Glossary (to prevent common misinterpretations)

Pool

A managed set of public IPv4 addresses (often partitioned by region, shard, or capacity domain).

Block

A contiguous port range on a given public IP reserved to a subscriber bucket or processing shard.

Overloading

Multiple private users share the same public IP, separated by distinct public port allocations.

Deterministic mapping

A predictable rule maps users (or buckets) to specific public IPs and port ranges to simplify trace-back and reduce event volume.

Three allocation patterns (and what they optimize)

Pattern	What it optimizes	Failure pattern	Key counters to watch	First tuning lever
Per-subscriber blocks	Predictable fairness and trace-back; stable mapping per bucket	Heavy users exhaust their blocks early; localized failures	Per-block occupancy distribution, block-exhaust events, new-flow failures by bucket	Resize blocks, allocate multiple blocks to heavy buckets, add headroom per shard
Per-shard / line-card blocks	Locality (state + ports stay near processing); easier scaling by shards	Skew between shards; one shard exhausts while others idle	Per-shard port usage variance, per-shard new-flow failure rate	Rebalance shards, move pools, add skew-aware assignment
Hash-based blocks	Load spreading under diverse traffic; simpler stateless choice	Hash hot spots (unlucky keys) → uneven block usage	Per-public-IP port usage tail, hot-IP list, block-level variance metrics	Improve hashing inputs, add rehash / fallback, pre-split pools into more buckets

How to tell “IP shortage” vs “port shortage / hot spot” (practical checklist)

Step 1 (pool view): check how many public IPs are active and whether new allocations are failing globally. A true IP shortage is typically pool-wide and monotonic.
Step 2 (distribution view): inspect per public-IP and per block port occupancy. Hot spots show a heavy tail: a small set of IPs/blocks approach full utilization.
Step 3 (symptom shape): hot spots often appear as “some users/buckets fail while the average looks fine” and are time-correlated with bursts or locality.
First fixes: rebalance blocks/pools, add headroom per shard, reduce skew, and avoid designs that require global coordination during bursts.

This chapter sets the resource model used later: flow creation (H2-4) depends on a reliable and evenly consumable port/address supply, and operations (telemetry/logging) must expose distributions, not only averages.

Figure F3 — Public IPv4 pool split into port blocks, assigned to subscriber buckets (hot blocks highlighted)

Hot spots are structural: even with sufficient global resources, a small number of blocks can saturate first. Track tail distributions and variance, not only averages.

Chapter H2-4 Focus flow table + create path Goal explain “Gbps OK but new flows fail”

Flow table architecture: fast path lookup, hashing, aging, collisions

CGNAT performance is dominated by flow/state table behavior: how quickly packets hit the fast path (lookup) and how safely the platform handles bursts of misses (create path).

Throughput tests often stress the hit path. Real-world collapses frequently occur in the miss/create path where inserts, memory writes, and event generation spike together.

Engineering takeaway: collision rate, aging/timeout policy, and write/lock contention determine whether high CPS turns into timeouts.

Packet-to-flow lifecycle (the minimal model)

Parse key: extract a stable flow key (commonly 5-tuple plus necessary direction context).
Hash & bucket: map the key into a bucket/shard to keep lookups O(1) and reduce shared contention.
Hit path: read state → apply translation → update small metadata (timeouts/counters) → forward.
Miss/create path: allocate state + reserve ports (from H2-3) → insert into table → emit necessary events.
Aging: retire inactive flows without storms; avoid policies that amplify churn during bursts.

Why high CPS causes timeouts (the common collapse chain)

When short flows surge, misses rise. Miss handling requires state allocation, bucket inserts, and often event generation. This shifts the workload from mostly reads (hit path) to heavy writes (create path). Collisions and shared structures multiply the cost, and any backpressure in event pipelines increases create latency until new connections time out.

Four “silent killers” that look like random failures

Hash collisions

Symptoms: p99 latency spikes, sporadic new-flow failures. Watch bucket depth/variance and insert retries.

Aging / timeout churn

Too short: constant re-creates (CPS amplification). Too long: occupancy rises, collisions grow. Watch eviction/age-out rate.

Lock / atomic contention

Shared buckets or global structures cause stalls under bursty creates. Watch contention signals and miss-path latency.

Write amplification

Excessive per-packet state updates trigger cache misses and memory bandwidth pressure. Watch cache-miss trends and per-flow update cost.

This chapter intentionally stays at the data-plane mechanics level. The operational requirement is simple: keep the hit path light, keep create bursts safe, and tune aging to avoid churn storms.

Figure F4 — Packet-to-flow pipeline: hit path vs miss/create path (SRAM/DRAM tiers)

The diagram separates the steady-state lookup (hit path) from the burst-sensitive create path (miss path). Tuning and observability should prioritize miss-path latency, collision behavior, and aging churn.

Chapter H2-5 Focus memory + session/state scaling Signal cliffs, tail latency, jitter storms

Memory & state scaling: where sessions really live (SRAM/DRAM) and why it matters

A CGNAT session table is a memory system before it is an algorithm: performance depends on which state stays in the fast tier (near-cache/SRAM) versus which state spills into the slow tier (DRAM).

When the working set grows or per-packet writes increase, the platform shifts from lookup-dominated to movement-dominated behavior. That is when CPS drops in a cliff and tail latency explodes.

Practical triage: use hit rate, lookup latency (p99), and reclaim/evict time to confirm a memory bottleneck.

State tiers (what belongs where)

Fast tier (small & near)

Keep flow index pointers, bucket heads, hot-entry metadata, and minimal counters—optimize for fast-path hits and predictable latency.

Slow tier (large & far)

Store full session records, cold entries, aging lists, and large buffers. Treat buffering as a bandwidth consumer, not “free space.”

Why “cliffs” happen

A stable hit path relies on locality. Short-flow surges expand the working set and increase create events, while frequent state updates increase write pressure. Once hot state no longer fits in the fast tier, cache misses and memory stalls multiply. Reclaim/eviction then becomes slower and more bursty, feeding back into create-path latency.

Common collapse patterns (what they look like)

Cache thrash

Working set spills; hit rate drops; lookup cost rises; p99 latency spikes even if average looks acceptable.

Bandwidth saturation

Reads and writes queue; create path becomes write-heavy; CPS collapses before line rate is reached.

Reclaim jitter storms

Eviction/aging becomes expensive and periodic; bursts trigger timeouts during reclaim peaks.

Write amplification

Too many per-packet state touches; cache lines churn; memory traffic rises without throughput gains.

Symptom → Observation → First action (memory edition)

Symptom	What to observe	First action to try
CPS drops in a cliff	Fast-tier hit rate falling; lookup p99 rising; create-path latency rising	Reduce write amplification; improve bucket/shard locality; keep fast-tier updates minimal
Tail latency spikes (p99/p999)	Cache miss surge; bucket depth variance; slow-tier fallback rate	Tighten hashing/bucketing; cap per-bucket depth; reduce spill frequency
Periodic timeouts in bursts	Evict/reclaim time peaks; eviction scans; aging list churn	Adjust aging to avoid churn; smooth reclaim; reduce full-table scans
Throughput stable but new flows fail	Miss/create path becomes write-heavy; insert retries; reclaim backpressure	Protect create path from heavy background work; isolate buffers/events from inserts
Small packets make it worse	Per-packet update cost dominates; memory BW rises faster than Gbps	Minimize per-packet touches; move updates to sampling or per-burst accounting

The goal is not “more memory,” but a stable locality plan: keep the hit path predictable, keep creates from turning into bulk writes, and keep reclaim from becoming a periodic pause.

Figure F5A — Hot vs cold flows: fast tier vs slow tier (working-set spill drives cliffs)

Keep hot state compact and stable. Monitor spill signals (hit-rate drop, p99 lookup rise, reclaim peaks) to detect “cliff” conditions early.

Chapter H2-6 Focus 10G/100G packet I/O pipeline Signal IMIX passes, 64B fails

10G/100G packet I/O: PHY is not the bottleneck, the I/O pipeline is

Line rate describes what the port can carry. Packet rate (Mpps) and burst behavior describe what the processing pipeline must absorb.

IMIX can pass while 64-byte traffic collapses because fixed per-packet costs dominate: queueing, DMA descriptors, buffer churn, and flow-shard consistency are stressed first.

Debug method: walk from port → RX queues → processing units → flow shards and confirm where the distribution becomes uneven or backpressured.

From port to flow table (step-by-step bottleneck map)

Stage	What breaks under burst / small packets	Counter signals to check
Ingress port	Microbursts create instant queue pressure; early drops can appear even when average Gbps is moderate	Port-level drops, pause/backpressure indicators, burst correlation with p99 latency
RX queues (RSS)	Skewed steering makes one queue hot; one hot queue can collapse the system “locally”	Per-queue occupancy, per-queue drop, hot-queue variance (tail vs average)
DMA + descriptor rings	Descriptors exhaust or recycle too slowly; bursts hit ring limits first at high Mpps	Ring full events, refill rate, DMA stalls, buffer allocation failures
Buffer strategy	Small packets churn buffers; refill overhead and cache pressure rise rapidly	Buffer pool watermark, allocation latency, cache-miss trends during bursts
Processing units	One unit saturates while total utilization seems fine; create bursts amplify the imbalance	Per-unit utilization, per-unit miss/create latency, imbalance ratio
Flow shards	If a flow is handled inconsistently across shards, overhead rises and misses increase	Cross-shard access rate, miss/create spikes, shard hot-spot list

Why IMIX passes but 64-byte traffic fails

IMIX reduces packet rate for the same Gbps, keeping descriptor churn and per-packet state costs manageable. 64-byte traffic pushes Mpps to the limit: queue skew, ring pressure, and small per-packet updates dominate. Under bursts, the pipeline fails “locally” (a hot queue/shard) first, then cascades into timeouts and new-flow failures.

Figure F5 — 100G port → multi-queue RX/TX → processing units → flow shards (hot-queue risk + consistency)

Packet rate and burst behavior surface skew and ring pressure first. Keep queue balance stable and preserve flow-to-shard consistency to protect miss/create bursts.

Chapter H2-7 Focus telemetry + observability Goal early warning (no blind operation)

Telemetry & observability: the minimum counters that prevent blind operation

A stable CGNAT operation depends on visibility into sessions, port/blocks, table health, and drops by reason—not on average throughput alone.

Early warnings come from distribution signals (skew across pools/blocks/buckets) and backlog signals (log pipeline watermarks), often before outright drops appear.

Minimum dashboard below is designed to catch port exhaustion, hot spots, and table jitter before service impact.

Minimum dashboard (≤10 metrics) — each exists to detect one failure mode

Metric	Look at it to detect	Early warning pattern
Concurrent sessions	State scale approaching capacity and higher reclaim cost	High plateau + rising churn indicators
CPS (setup rate)	Create-path stress and short-flow surges	Sharp rise + new-flow latency/drops follow
Drops by reason	Where the pipeline fails (queue/ring/table/alloc/backpressure)	One reason class dominates; shift over time
Flow table occupancy	Table pressure and jitter risk	Sustained high occupancy + rising churn
Collision rate / chain depth	Hash hot spots and lookup tail latency risk	Tail grows even when averages look stable
Aging rate / churn	Expensive reclaim cycles and periodic pauses	Peaks align with latency spikes or timeouts
Public IPv4 pool utilization	Address scarcity trend (pool depletion)	Pool headroom shrinking steadily
Port block utilization distribution	Port hot spots and block-level exhaustion	Skew worsens (high percentile near full)
Log backlog watermark	Backpressure conditions that will spill into data plane	Watermark rises first, then drops/timeouts

Practical early-warning rules (patterns, not absolute numbers)

Port exhaustion risk: port block distribution skew worsens + CPS rises + drops show allocation/exhaustion reasons.
Table jitter risk: occupancy stays high + collision/chain depth increases + aging/churn rises.
Backpressure risk: log backlog watermark rises + log drops/queueing delays appear + data-plane drops follow.

Figure F6 — Counter-to-symptom map (minimum counters that predict failures)

Use counter distributions (skew across pools/blocks/buckets) and backlog watermarks to predict failures earlier than aggregate Gbps or averages.

Chapter H2-8 Focus NAT logging at scale Risk log backpressure hits data plane

NAT logging at scale: volume, correlation, time, and storage backpressure

Logging is the second critical path in CGNAT: traceability and operations depend on it, and backlog can translate into data-plane impact.

Capacity planning must be event-driven (CPS and session events), not throughput-driven (Gbps). Backpressure often appears first as rising watermarks and delayed export.

Key test: if backlog rises before drops/timeouts, the performance issue is likely log-pipeline backpressure.

Minimum log record (write less, but make it searchable)

Time

Timestamp with clear time base and resolution for ordering, window queries, and correlation.

Mapping

Inside addr/port and outside addr/port (or equivalent mapping pair) for traceability.

Session identity

A compact session id or flow key hash to correlate create/close and detect duplicates.

Resource context

Public pool id and/or port block id to pinpoint hot spots and exhaustion patterns.

Back-of-the-envelope volume model (event-driven)

Estimation model

log_events_per_sec ≈ CPS × events_per_session
bytes_per_sec ≈ log_events_per_sec × bytes_per_record
storage_per_day ≈ bytes_per_sec × 86400

events_per_session is often 1 (create) or 2 (create + close). Adding periodic “update” events can multiply volume quickly and increase backpressure risk.

How to recognize log backpressure (and stop blaming the data plane)

Observed behavior	What to confirm in counters	First corrective direction
Backlog watermarks climb first	Buffer watermark rising; export latency rising; log drops start to appear	Reduce record size and frequency; batch/export smoothing; avoid blocking inserts on export
Data-plane drops follow later	Drops by reason shift toward resource/queue pressure after backlog climbs	Protect the data plane from logging pressure (decouple and isolate buffers)
CPS falls while Gbps looks fine	Create-path counters degrade concurrently with backlog peaks	Prevent backlog peaks (watermark control) and keep create-path lightweight
Correlation is weak	No backlog change; no log drop change during the incident window	Search elsewhere (avoid assuming logging is the cause)

Figure F7 — Data plane vs log pipeline decoupling (backpressure feedback path)

Keep packet processing and log export decoupled. Track buffer watermarks and export latency to prevent log backpressure from becoming a data-plane outage.

Chapter H2-9 Focus troubleshooting playbook Format symptom → cause → fix

Failure modes & troubleshooting: symptom → root cause → fix

Effective CGNAT troubleshooting starts with a short decision path: classify the symptom, confirm with minimal counters, then apply a targeted fix.

This chapter stays “CGNAT-local”: sessions/ports/table/drops/log pipeline signals are enough to narrow the failure class without pulling in external protocol detail.

Use the cards below like a field playbook: each is four lines—Symptom, Fast check, Likely root cause, Fix.

Fault cards (field-usable) — four fixed lines per card

1) Port exhaustion / block hot spot

Symptom: new flows fail; only a subset of users/services degrade; failures cluster in time.

Fast check: port block utilization distribution becomes highly skewed; CPS spikes; drops show allocation/exhaustion reasons.

Likely root cause: hot blocks/pools saturate while averages look acceptable (skew hides risk).

Fix: reduce skew (rebalance blocks/pools), increase headroom where the skew concentrates, and verify skew flattening plus drop-reason recovery.

Key counters: port block p95/p99 utilization, pool headroom, CPS, drops-by-reason.

2) “Gbps is fine” but setup collapses

Symptom: throughput remains high, yet new sessions time out; setup rate falls off a cliff.

Fast check: CPS falls while sessions plateau; create-path drops increase; table occupancy stays high or churn spikes.

Likely root cause: create/update path is saturated (inserts, updates, or reclaim pressure), not the steady-state forwarding.

Fix: cut create-path cost (reduce churn drivers), keep occupancy below jitter threshold, and confirm CPS recovery during burst tests.

Key counters: CPS, create-path drops, occupancy, aging/churn, collision/chain depth.

3) Table jitter / early reclaim

Symptom: sessions are reclaimed too early; retransmissions rise; tail latency spikes periodically.

Fast check: aging/churn peaks align with latency spikes; collision/chain depth increases; occupancy remains high.

Likely root cause: aging/reclaim cycle becomes expensive and bursty; hot buckets amplify tail behavior.

Fix: tune aging to reduce churn, rebalance buckets to reduce collisions, and validate that churn peaks no longer trigger tail spikes.

Key counters: churn/aging rate, occupancy, collision/chain depth, tail indicators (if available).

4) Asymmetric path → return flow state miss

Symptom: one-way connectivity; intermittent “works then breaks”; failures are direction-dependent.

Fast check: state-miss drops rise for return-direction traffic; hit/miss balance becomes asymmetric during the incident window.

Likely root cause: forward and return packets do not hit the same state domain/shard, so return lookups miss.

Fix: enforce flow-to-state consistency (same flow lands in the same shard/state domain) and confirm miss drops disappear after change.

Key counters: state-miss drops by reason, per-direction hit/miss (or equivalent), shard imbalance indicators.

5) “Random” loss that is packet-size dependent (PMTU)

Symptom: small packets succeed but larger payloads fail; issues correlate with specific size ranges.

Fast check: drops spike in certain size bins; oversize/fragment-related counters increase during failures.

Likely root cause: path MTU constraints or size-dependent handling triggers drops that look random at flow level.

Fix: make size-dependent handling consistent and validate with controlled size sweeps until the spike disappears.

Key counters: packet-size histogram (if available), oversize/fragment counters, drops by reason.

6) Fragmentation / checksum inconsistency

Symptom: intermittent loss with no clear CPU spike; failures show weak correlation to throughput.

Fast check: fragment-related drops rise; checksum-related drops appear; issue reproduces only under specific packet patterns.

Likely root cause: fragmentation path or checksum update path diverges from the main translation path.

Fix: unify translation behavior for all packet paths and verify checksum/fragment drops return to baseline.

Key counters: fragment drops, checksum drops, drops by reason, packet pattern correlation.

7) Logging backpressure spillover

Symptom: throughput falls but CPU is not high; queue/watermark signals look abnormal.

Fast check: log backlog watermark rises first; export latency rises; log drops may appear; data-plane drops follow later.

Likely root cause: log pipeline cannot drain; backlog feeds back into data plane (backpressure).

Fix: reduce log pressure (record size/event rate), strengthen decoupling (buffers/batching), and confirm backlog leads no longer precede drops.

Key counters: backlog watermark, export latency, log drops, drops by reason, CPS over time.

8) Drops surge with no clear single “big” metric change

Symptom: drops increase suddenly; no single aggregate metric explains it; impact is uneven.

Fast check: drops by reason show one class dominating; distributions (blocks/buckets) worsen even if averages stay flat.

Likely root cause: localized hot spots (port blocks or hash buckets) create tail failures that aggregates mask.

Fix: switch to distribution-first view, mitigate hot spots, and verify the dominating drop-reason class returns to baseline.

Key counters: drops by reason, port block distribution, collision/chain depth distribution, occupancy.

Uses: sessions/CPS Uses: port blocks + utilization skew Uses: flow table occupancy/collisions/aging Uses: drops by reason Uses: log backlog/watermarks

Figure F8 — Troubleshooting decision tree (3-level field workflow)

Keep the tree shallow: 3 entry symptoms, minimal checks, then a fix direction. Red highlights the two high-impact chains: port hot spots and log backpressure.

Chapter H2-10 Focus high availability (HA) Risk state sync load hurts CPS

High availability: state sync, failover, and keeping mappings consistent

HA for CGNAT is hard because state is large: replication must preserve enough mapping state for continuity without turning synchronization into a second data-plane bottleneck.

The practical trade-off is straightforward: stronger session continuity requires more replication load, which can reduce CPS and increase tail behavior if not isolated.

Success criteria: after failover, mapping consistency holds (no mass state misses) and replication load does not push CPS into a cliff during bursts.

What state must be replicated (minimal set vs optional)

Must replicate (minimal set)

Active session mapping identity (inside/outside address+port mapping) and enough lifecycle info to keep lookups consistent after takeover.

Goal: prevent mass state misses immediately after failover.

Optional replicate (only if justified)

Non-essential metadata that improves investigation or reporting but is not required for mapping continuity.

Rule: if it can be rebuilt, avoid replicating it under load.

Replication load vs data-plane health (how to avoid a CPS cliff)

Replication frequency and bandwidth

More frequent updates reduce continuity gaps but increase write amplification and contention risk.

Practical readout: CPS and create-path stability under burst should not degrade when replication becomes busy.

How to detect “sync is hurting the data plane”

If replication queue/backlog rises first and CPS drops next, synchronization load is likely spilling into the packet path.

Correlate replication backlog (if available) with CPS, drops by reason, and churn peaks.

Failover mapping consistency (avoid mass state misses)

Design goal: after takeover, existing flows should still resolve to the expected mapping state domain.

Operational test: during controlled failover, verify state-miss drops do not surge and that session continuity is preserved within expected limits.

Use the same “drops by reason + distribution view” approach: a short surge may be acceptable; a sustained state-miss plateau indicates broken consistency.

Figure F9 — Active/Standby with state replication (where sync load becomes a risk)

Replicate only the minimal state needed for mapping continuity. Treat replication queue/backlog as a risk signal: if it rises first and CPS drops next, sync load is likely impacting the data plane.

Chapter H2-11 Deliverable acceptance checklist Goal prove sizing before deployment

Validation & sizing checklist: prove it before deployment

A CGNAT design is deployable only after it passes an acceptance checklist that covers performance, resource headroom, logging backpressure, and HA behavior.

Testing must be scenario-driven (IMIX and 64B, burst CPS and long-lived sessions, one-way and two-way), and every pass/fail decision must be tied to a minimal set of counters.

Outcome: sizing becomes a signed-off artifact, not a guess based on “Gbps” alone.

A) Acceptance criteria (what “pass” means)

Performance coverage

Must include: IMIX + 64B, one-way + two-way, burst CPS, and long-lived sessions.

Reason: different mixes shift the bottleneck among I/O, state updates, and logging.

Resource headroom

Must show: safe flow table occupancy, controlled collision/chain depth, and non-skewed port utilization.

Reason: averages hide hotspots; distributions decide stability under peaks.

Logging resilience

Must prove: peak log rate does not create sustained backlog and does not degrade CPS/Mpps.

Reason: log pipeline backpressure is a common “hidden” cause of data-plane drops.

HA behavior

Must measure: drop rate during failover window and recovery time to stable mapping behavior.

Reason: state sync load and takeover consistency can trigger mass state misses.

B) Sizing inputs (turn traffic into numbers)

Use peak-first inputs: subscriber peak concurrency, bursty CPS, packet-size mix, and peak logging events.


            Sessions_peak ≈ Subscribers × Sessions_per_subscriber × Peak_factor

            CPS_peak      ≈ Subscribers × New_flows_per_subscriber_per_sec × Peak_factor

            Ports_needed  ≈ Concurrent_translated_flows (watch skew, not only average)

            Log_rate_peak ≈ CPS_peak × Events_per_flow (create/delete/other required events)

            Log_GB_day    ≈ Log_rate_peak × Bytes_per_record × 86400 / 1e9

Sizing is considered stable only when the acceptance tests pass while these peaks are sustained long enough to expose tail behavior and backlog effects.

C) Test plan (methods, not slogans)

Traffic scenarios to run

☐
IMIX steady state to validate throughput under realistic mix.
☐
64B small packets to validate Mpps, queueing, and create-path stability.
☐
Burst CPS to expose flow create/update contention and reclaim jitter.
☐
Long-lived sessions to validate table aging, occupancy, and tail stability.
☐
One-way + two-way to detect directional state-miss behavior.

Minimum counters to capture

☐
Sessions & CPS (including tail behavior under burst).
☐
Drops by reason (queue overflow vs state-miss vs allocation failures).
☐
Port utilization distribution by pool/block/bucket (skew and hotspots).
☐
Flow table: occupancy, collision/chain depth, aging/churn rate.
☐
Logging: backlog/watermark and export latency under peak log rate.
☐
HA (when enabled): sync/backlog indicators and post-takeover state-miss behavior.

D) Copy-paste checklist (pass/fail items)

Performance

☐
IMIX: stable throughput with drops-by-reason not increasing over time.
☐
64B: Mpps stability without queue overflow and without CPS collapse.
☐
Burst CPS: CPS holds for a sustained window (no “cliff”), and create-path drops stay near baseline.
☐
One-way/two-way: no directional state-miss plateau during the run.

Resources

☐
Flow table headroom: occupancy stays below a defined safety threshold under peaks.
☐
Collision control: collision/chain depth does not surge with burst CPS.
☐
Port skew: pool/block/bucket utilization distributions remain bounded (no runaway hotspots).

Logging

☐
Peak log rate: backlog/watermark does not grow without bound.
☐
No spillover: enabling peak logging does not materially reduce CPS/Mpps.
☐
Drain check: after peak, backlog returns to a low steady watermark.

HA

☐
Failover window: drop rate and recovery time are measured and within target.
☐
Post-takeover: no sustained state-miss drop plateau; sessions stabilize quickly.
☐
Sync impact: replication indicators do not precede a CPS cliff under burst.

Figure F10 — Test matrix (scenarios × metrics to record)

A minimal matrix reduces blind spots: small packets (64B) and burst CPS expose create-path and I/O limits; logging peak exposes backpressure; HA runs validate takeover consistency.

Chapter H2-12 Deliverable BOM & platform checklist Rule criteria + example part numbers

BOM / platform selection checklist (criteria + example part numbers)

Choose a CGNAT platform by scoring architecture criteria that directly control CPS, session scale, port skew behavior, logging backpressure isolation, and HA stability.

Part numbers are listed as reference examples to help procurement and engineering align on categories and verification steps; availability and lifecycle must be validated before purchase.

Outcome: options can be compared with a consistent scorecard instead of marketing specs.

A) Platform archetypes (what is being compared)

1) General-purpose CPU + user-space data plane

Strength: fast iteration and flexible tuning.

Risk: CPS and tail stability depend on memory bandwidth, cache behavior, and queueing under bursts.

Best when: control over tuning is high and traffic is well-characterized.

2) NPU/ASIC/DPU-oriented data plane

Strength: predictable per-packet costs and high efficiency.

Risk: feature flexibility and observability depth may vary by platform ecosystem.

Best when: sustained high CPS/Mpps is primary and platform integration is mature.

3) CPU + accelerator cards (selective offload)

Strength: isolates heavy compute paths and protects tail behavior.

Risk: adds PCIe and driver complexity; can become a hidden bottleneck if not sized.

Best when: burst CPS or logging/telemetry processing needs isolation.

How to compare

Method: score each archetype against the criteria below, then validate with the H2-11 test matrix.

Rule: a platform is “better” only if acceptance tests pass with headroom.

B) Scoring criteria (all items should be measurable)

Criteria	Why it matters for CGNAT	How to verify (what to measure)
Per-flow update cost	High CPS depends on fast flow create/update; expensive updates trigger a CPS cliff under bursts.	Create-path drops, CPS sustainability window, tail behavior under burst CPS.
State entry size (bytes/entry)	Session scale is bounded by memory capacity; large entries reduce max sessions and increase bandwidth pressure.	Max stable sessions before jitter; occupancy vs latency; memory bandwidth headroom.
Memory bandwidth & latency	State lookup/update becomes bandwidth-bound; cache misses can dominate tail latency even when CPU looks fine.	Latency stability under load; collision/chain depth impact; bandwidth utilization headroom.
Queueing & multi-queue scaling	64B/Mpps and bursts depend on queueing, RSS/affinity, and buffer behavior, not PHY line rate.	64B test: drops by reason, queue overflow, per-queue imbalance under peaks.
Port utilization distributions	Hotspots and local exhaustion break users even when average utilization is safe.	Pool/block/bucket p95/p99 utilization; hotspot persistence under stress.
Logging backpressure isolation	Peak log rate can degrade data plane; decoupling is required to prevent spillover.	Backlog/watermark under peak log; CPS/Mpps delta with logging enabled.
HA state sync impact	Replication can become a second bottleneck; mapping consistency must hold after takeover.	Failover: state-miss behavior, recovery time, replication backlog correlation with CPS.
Thermal / power stability	Long-run throttling silently reduces CPS/Mpps; stable telemetry prevents blind operation.	Long-duration load: frequency stability, temperature/power telemetry, throughput drift.

C) Example BOM (reference part numbers)

Important: part numbers below are reference examples for quoting and comparison. Validate availability, lifecycle, and platform compatibility before purchasing.

High-speed NIC / I/O

•
Intel E810-XXVDA4 (25GbE class adapter example)
•
NVIDIA ConnectX-6 Dx family (100/200GbE class SmartNIC family example)
•
MCX623106AN-CDAT (example SKU style for ConnectX-6 Dx-class adapters)

Selection focus: queue count, RSS behavior, buffer model, and drops-by-reason visibility.

CPU / Infrastructure processor

•
Intel Xeon D-2796NT (network-oriented SoC example)
•
AMD EPYC 75F3 (high-frequency server CPU example)
•
Marvell OCTEON TX2 (CN92xx/CN96xx/CN98xx) family (infrastructure processor example)

Selection focus: burst CPS stability, memory bandwidth headroom, and tail behavior under churn.

Logging storage (NVMe)

•
Samsung PM9A3 family (datacenter NVMe example)
•
MZQL23T8HCLS-00A07 (example NVMe part number format)

Selection focus: sustained write behavior at peak log rate and backlog drain speed after peaks.

Power / thermal telemetry (examples)

•
INA238 (current/power monitor example)
•
TPS25982 (eFuse / hot-swap protection example)

Selection focus: detect throttling early (temperature/power drift) and correlate with throughput/CPS drops.

Optional acceleration (example)

•
Intel QAT 8970 (crypto/compression accelerator family example)
•
IQA89701G2P5 (example adapter part number format)

Selection focus: isolate heavy paths without creating PCIe or driver bottlenecks.

Memory module (example)

•
M393A4K40DB2-CTD (RDIMM example PN format; verify lifecycle)

Selection focus: capacity vs bandwidth; stable sessions and burst CPS both require bandwidth headroom.

Output: score each platform (1–5) per criterion Output: run H2-11 test matrix to validate Output: select BOM with lifecycle & alternatives

Figure F11 — Platform scorecard map (criteria → hardware blocks)

Use this map as a procurement scorecard: each criterion must be validated by the block(s) it depends on, then proven by the H2-11 acceptance tests.

Request a Quote

Name

Company

Part Number(s) / BOM

Quantity & Target Lead Time

Alternates Allowed

Temperature Grade

Package / Footprint

Compliance

Budget Window

Lot Size / Qty

Message

Attachment

Accepted Formats

pdf, csv, xls, xlsx, zip

Attachment

Drag & drop files here or use the button below.

CGNAT FAQs (Engineering-focused)

These answers stay inside CGNAT scope: performance (CPS/Mpps/sessions), address/port management, flow-table behavior, logging backpressure, telemetry counters, HA consistency, and platform selection criteria.

Group A — Boundary & performance bottlenecks

1How to describe the engineering boundary of CGNAT in one sentence?

CGNAT is a high-scale NAT44 function placed in the provider path to share limited public IPv4 addresses across many private users by translating address/port pairs while maintaining state and the minimum logs/telemetry for traceability. It is not a security policy engine, not an attack detector, and not an access-side protocol stack.

Maps to: H2-1 Keywords: boundary, placement

2Why does “Gbps is fine” still fail when CPS spikes (new flows start failing)?

Throughput tests mostly stress the fast path, but CPS stresses the create/update path: allocating state, inserting into the flow table, and often emitting per-event logs/counters. Under bursts, contention (locks/atomics), memory churn, collision growth, or logging backpressure can cause a CPS cliff even when average CPU looks acceptable. Confirm with create-fail reasons, insert latency, collision/chain depth, and log backlog.

Maps to: H2-2, H2-4 Signals: CPS cliff, create-fail

3Why do 64-byte packets break CGNAT more easily than IMIX traffic?

64B traffic drives packet rate (Mpps) dramatically higher, so per-packet overhead in the I/O pipeline dominates: queueing, DMA descriptors, buffer recycling, and per-packet state touch. IMIX may pass because larger packets reach line-rate with far fewer packets. Validate with per-queue drops, queue watermarks, Mpps headroom, and whether small-packet load triggers create/update amplification and table churn.

Maps to: H2-2, H2-6 Keywords: 64B, Mpps

Group B — Address/port resources & state table behavior

4How to quickly distinguish “public IPv4 shortage” vs “port exhaustion / hotspots”?

Address shortage looks like broad saturation of the public IP pool, while port exhaustion is usually skewed: specific IPs, port blocks, or buckets hit high utilization first and new translations fail for a subset of users. The fastest discriminator is distribution, not averages: check p95/p99 port utilization by pool/block, hotspot persistence, and failure reasons. Fix directions differ: expand pool vs rebalance blocks and reduce hotspots.

Maps to: H2-3, H2-7 Signals: p99 skew, hotspot

5How should port-block allocation avoid hotspots and imbalance?

Hotspots often appear when allocation granularity and processing sharding do not align. Port blocks should be assigned so that heavy users or heavy destination patterns do not concentrate into a small subset of blocks handled by the same shard/queue. Verify with utilization per block/bucket, per-queue load balance, and translation failures localized to specific blocks. Practical tuning focuses on block sizing, hashing/assignment policy, and keeping shard consistency under burst load.

Maps to: H2-3, H2-6 Keywords: port blocks, skew

6What visible symptoms come from rising flow-table collision rate?

Collision growth increases lookup/insert work, so the first visible symptom is often CPS degradation and rising tail latency rather than immediate Gbps loss. Drops may shift from queue overflow to state allocation/insert failures. In telemetry, collision rate, average chain depth, and lookup/insert latency trend up, while churn/aging may spike as the system struggles to reclaim entries. The practical fix is table sizing/sharding and collision control, validated by reduced insert latency under burst CPS.

Maps to: H2-4, H2-7 Signals: chain depth, insert latency

Group C — Aging, asymmetry, logging, and observability

7What “weird dropouts” happen when timeout/aging is wrong?

Incorrect aging can reclaim entries too early (causing sudden state misses and retransmission bursts) or too late (inflating occupancy until collision and allocation failures appear). The “weird” pattern is intermittent breakage that correlates with churn and table pressure rather than with raw throughput. Confirm with aging rate, churn, occupancy headroom, and state-miss drop reasons. Fix direction is differentiated timeouts and stable reclaim behavior under long-session plus burst-CPS mixed tests.

Maps to: H2-4, H2-9 Signals: churn, state miss

8Why does asymmetric pathing make NAT look “sometimes OK, sometimes broken”?

CGNAT state is not optional: return traffic must hit the same state view (same instance/shard or a consistent replicated state set). If packets arrive on a different path that does not share the same mapping/state, state misses occur and behavior becomes intermittent: one-way works, two-way fails, or failures correlate with load-balancing changes. Confirm with state-miss drops and shard/instance imbalance; mitigation is path/state consistency and HA takeover consistency rather than “more bandwidth.”

Maps to: H2-9, H2-10 Signals: intermittent, state miss

9How to estimate NAT log volume, and why can logging reduce forwarding performance?

Log rate scales with event rate: roughly CPS multiplied by required events per flow (create/delete and other mandated records). Volume is then log_rate × bytes_per_record. Performance loss happens when the log pipeline is not fully decoupled: backlog grows, buffers and I/O contend with the data plane, and create/update becomes slower. Confirm by correlating backlog/watermark and export latency with CPS/Mpps drops; a healthy design drains backlog after peaks.

Maps to: H2-8 Signals: backlog, export latency

10What is the smallest set of counters that can warn of an upcoming meltdown?

A minimal dashboard should cover capacity, dynamics, outcomes, and backpressure: sessions, CPS, drops by reason, port utilization distribution (p95/p99 by pool/block), flow-table occupancy, collision/chain depth, churn/aging rate, and log backlog/watermark with export latency. The key is trending and distributions: hotspots and monotonic backlog growth predict failure earlier than average throughput. Use thresholding on skew and trend slopes, not only on absolute values.

Maps to: H2-7 Signals: skew, trend slope

Group D — HA, validation, and platform selection

11How to minimize the drop window during HA failover?

Failover is difficult because state volume is large and takeover consistency is fragile under peak churn. The shortest drop window comes from synchronizing only the minimal state required for mapping continuity, keeping replication from competing with create/update, and validating takeover under realistic burst CPS and logging load. Confirm with drop rate during the failover window, recovery time to stable CPS, and whether replication backlog precedes a CPS cliff. A “works on idle” failover is not sufficient.

Maps to: H2-10, H2-11 Signals: failover drops, recovery time

12What are the most important platform criteria (ignore part numbers)?

Five criteria decide real CGNAT stability: (1) create/update tail latency under burst CPS, (2) state bytes per entry plus memory bandwidth headroom for target sessions, (3) 64B/Mpps behavior of queues/buffers with drops-by-reason visibility, (4) logging backpressure isolation so backlog does not spill into the data plane, and (5) HA state sync impact that remains measurable and within acceptance limits. Platforms should be chosen by scoring these items and proving them in the acceptance test matrix.

Maps to: H2-12 Signals: tail stability, isolation