Ethernet Controllers / MAC-PHY: RGMII, SGMII, Offload & WOL

Q: RGMII link is up, but throughput is poor — TX/RX delay or rings too small?

Likely cause: RGMII TX/RX internal delay/edge selection is mismatched (silent retries/corruption), or DMA rings are too shallow causing burst drops. Quick check: Read back effective RGMII delay mode; compare counter deltas FCS_err_rate vs rx_overrun_rate (per minute); run A/B with large vs small packets. Fix: Make delay configuration unequivocally correct first, then increase RX/TX descriptors (one variable per iteration). Pass criteria: iperf3_throughput > X and FCS_err_rate < X/hour and rx_overrun_rate < X/min over X minutes.

Q: SGMII link is up, but intermittent one-way traffic — in-band status or PAUSE flow control?

Likely cause: In-band status / autoneg state interpretation is inconsistent, or PAUSE is throttling one direction under congestion. Quick check: Read back SGMII mode (AN, in_band, fixed rate); log pause_rx/pause_tx delta and compare with unilateral throughput tests. Fix: Force a known-good speed/duplex policy (fixed or consistent AN) for isolation; then enable PAUSE only if needed and understood. Pass criteria: Unidirectional and bidirectional tests meet throughput > X, with pause_frames_rate stable and explainable (no unexplained bursts).

Q: Enabling checksum offload makes captures show “bad checksum” — display artifact or real corruption?

Likely cause: Capture point observes packets before HW fills checksum (normal offload artifact), or offload coverage/mode mismatch causes real bad frames. Quick check: A/B disable checksum offload and compare FCS_err_rate and retry symptoms; verify enabled offload types (IPv4/IPv6/TCP/UDP) match driver capability flags. Fix: Use link evidence (FCS/counter deltas) as truth; keep checksum offload only after the A/B test proves no real corruption. Pass criteria: FCS_err_rate X.

Q: Jumbo frames (9k MTU) drop, but small packets are fine — buffers/descriptors or segmentation offload?

Likely cause: RX buffer sizing / descriptor chain is insufficient for MTU, or TSO/GSO/LRO configuration is incompatible with the MTU path. Quick check: Enable 9k and watch rx_overrun/no_buffer deltas; read back effective MTU/offload toggles; A/B disable TSO/LRO while keeping MTU constant. Fix: Stabilize jumbo path with complex offloads disabled, then increase buffers/descriptors and re-enable offloads one-by-one. Pass criteria: With MTU=9000, drop_delta == 0 and rx_overrun_rate < X/min over X minutes.

Q: CPU usage is high even at low traffic — interrupt storm or polling/NAPI thresholds?

Likely cause: Interrupt rate is excessive (no/poor coalescing), or polling thresholds cause frequent wake-ups with low work per wake. Quick check: Log irq_rate (X/s) vs throughput; confirm whether rx_overrun rises (pressure) or CPU burns without drops (scheduling/IRQ). Fix: Enable/tune interrupt coalescing first; then tune polling thresholds (one knob at a time) while keeping traffic profile fixed. Pass criteria: CPU% X and drop_delta == 0.

Q: Wake-on-LAN is configured but does not wake — AON power domain or pattern/magic mismatch?

Likely cause: Always-on (AON) domain is not powered/clocked in sleep, or WOL pattern/magic packet configuration does not match the sender. Quick check: Verify wol_armed; send packet then read pattern_hit (if available); observe PME/WAKE asserted (pin or status). Fix: Guarantee AON rail and wake line integrity first; then align pattern/magic settings and re-test with a known-good generator. Pass criteria: wol_armed==true, pattern_hit==true, wake_asserted==true, and wake_latency < X ms.

Q: After sleep/wake, ping works but link drops later — restore order or driver reinit missing?

Likely cause: Post-wake sequence leaves queues/descriptors partially stale, or mode/offload/flow-control is not restored to the pre-sleep effective state. Quick check: Snapshot counters immediately after wake, then watch delta for X minutes; compare pre/post wake readbacks (mode, offloads, PAUSE policy). Fix: Enforce a deterministic restore sequence: mode → ring/queue init → enable traffic; reapply only known-good offloads after stability is proven. Pass criteria: sleep_wake_success_rate == 100% over X cycles and FCS_err_rate < X/hour.

Q: Speed/duplex keeps renegotiating — force mode first or check peer consistency?

Likely cause: Autoneg advertisement mismatch, or driver state machine triggers repeated renegotiation under specific conditions. Quick check: Log speed/duplex transitions with timestamps; A/B compare fixed mode vs AN while keeping the peer unchanged. Fix: Lock a known-good speed/duplex for isolation; only re-enable AN after peer settings and driver behavior are verified stable. Pass criteria: No unexpected renegotiation events for X hours; throughput remains within ±X%.

Q: FCS errors increase, but “the eye looks OK” — how to isolate with MAC/PCS loopback?

Likely cause: Fault can be host-side (DMA/driver/mode) or link-side; visual inspection is not a pass criterion. Quick check: Run MAC loopback (isolates host-side); then PCS loopback (digital link logic) and compare FCS_err_rate deltas. Fix: If MAC loopback fails, focus on controller/driver/rings; if loopbacks pass but external link fails, escalate to sibling pages (PHY/line-side) without expanding scope here. Pass criteria: In loopback modes, FCS_err_rate == 0 over X minutes at throughput > X.

Q: MAC address changes after every reset — strap/EEPROM/driver precedence conflict?

Likely cause: Multiple sources compete (strap → EEPROM/OTP → driver override), or EEPROM programming/verification is unreliable across cold vs warm resets. Quick check: Read back current MAC and (if available) MAC-source indicator; compare cold boot vs warm reset; verify EEPROM content and checksum/valid flag. Fix: Define a single authoritative MAC source; disable or harmonize other override paths; add production write + readback + verify step. Pass criteria: mac_addr_stable == true across X cold boots and X warm resets.

← Back to:Interfaces, PHY & SerDes

This page teaches how to select and bring up Ethernet controllers / MAC-PHY from the MAC-side: pick the right host interface, stabilize DMA/queues, validate offloads, and make Wake-on-LAN reliable with measurable pass criteria.

It focuses on fail-fast debugging and production-ready configuration (counters, loopback isolation, strap/EEPROM/driver precedence), without expanding into analog PHY, magnetics/ESD, or TSN switching.

Scope & System View: What this MAC-PHY Controller Page Solves

This page focuses on the controller-side Ethernet endpoint: host interfaces (RGMII/SGMII and similar), DMA/buffering, feature offloads, and Wake-on-LAN—so bring-up reaches “link + packets + stable throughput” without drifting into PHY analog, magnetics, or TSN switching.

Page boundary (hard rule)

This page covers

Host-side interfaces: RGMII/SGMII (timing, mode, status)
Packet path: DMA rings/descriptors, buffering, interrupts/polling strategy
Offloads: checksum, segmentation (if present), VLAN filtering hooks
Low-power: Wake-on-LAN arming, pattern/magic handling, wake signaling
Bring-up + field diagnostics: counters, loopback isolation (MAC/PCS-side)

This page does NOT cover

PHY analog line performance, eye/return-loss tuning, or cable/magnetics design
ESD/surge component selection and compliance test details
TSN switch scheduling (802.1AS/Qbv/Qbu) or industrial protocol stacks

Two common deployment patterns

1) SoC/MCU MAC + external PHY

Most failures are controller-side: RGMII/SGMII mode and timing, reset/strap correctness, DMA starvation, offload toggles, and WOL bring-up. Line-side analog and magnetics stay outside this page boundary.

2) External MAC-PHY controller (SPI/USB/PCIe ↔ Ethernet)

Critical factors shift toward host-bus throughput, driver maturity, EEPROM/strap configuration, WOL wake path, and counter-driven field isolation. The practical target is stable packets under load, not just “link up”.

Expected outcome (what “done” means)

Bring-up: link established, packets pass bidirectionally, counters stay clean
Performance: sustained throughput without ring overruns or CPU interrupt storms
Low-power: WOL arms reliably and wakes the host with bounded latency

Pass criteria (placeholders)

Link up within X ms after reset release
Sustained throughput ≥ X with overrun/underrun = 0
WOL wake latency ≤ X ms from packet arrival

Diagram note: the dashed scope box intentionally stops before magnetics/RJ45 to prevent cross-topic expansion.

What is a MAC-PHY Controller: Practical Block Ownership (MAC vs PCS vs PHY)

A MAC-PHY controller is best treated as a host-facing Ethernet endpoint: it exposes a host interface and implements MAC-side packet handling (DMA, filtering, statistics) with optional tightly-coupled PCS/PHY functions. The engineering value comes from clear ownership: which block explains a symptom, and which register/counter is the first probe point.

Block ownership map (use for fast isolation)

MAC (packet plane)

Responsibilities: frame Tx/Rx, address filtering, CRC generation/check, flow control (PAUSE), statistics counters, DMA ownership
Typical symptoms: throughput collapses, CPU interrupt storms, drops under burst, Rx overrun/Tx underrun
First probe: descriptor/ring health + MAC counters (overrun/underrun, dropped, FCS error rate)

PCS (interface + link-status plane)

Responsibilities: interface coding/status (e.g., SGMII PCS), auto-negotiation status reporting, link mode mapping
Typical symptoms: link up but wrong speed/duplex, one-way traffic, unstable mode switching
First probe: PCS status (in-band link/speed/duplex) + host interface mode pins/straps

PHY (line-side) — boundary reminder

Responsibilities: cable/line interaction, energy detect, analog front-end behavior, line-side robustness
When to exit this page: magnetics/ESD/surge events, analog margining, link quality vs cable/EMI
Action: keep this page on controller-side checks; hand off to Ethernet PHY / protection pages for line-side analysis

Minimal glossary (only what is needed for this page)

MII family: host↔MAC/PHY interface families; this page focuses on RGMII/SGMII bring-up behavior
Offload: hardware assistance for checksum/segmentation/VLAN that changes “who computes what” in the packet path
Counters first: prefer counter-driven isolation before tuning complex features

Diagram note: ownership is the goal—map symptoms to MAC counters, ring state, and PCS status before changing features.

Host-Side Interfaces: RGMII vs SGMII Selection Logic (Bring-up Ready)

Interface selection is a board-level decision driven by timing margin, routing complexity, and debug visibility. This section stays on the digital boundary (MAC/PCS and host interface behavior) and avoids PHY analog and magnetics depth. The goal is a deterministic bring-up path: correct mode, correct status, and clean counters before performance tuning.

Decision output (what to decide here)

Choose RGMII when short routing and low complexity can preserve DDR timing margin with correct delay strategy
Choose SGMII when long routes, connectors, or noise risk requires a serial PCS-based interface with status-driven bring-up
Bring-up must confirm: speed/duplex/link-state consistency with the peer before any feature enablement

RGMII engineering model (source-synchronous DDR)

What it is: 4-bit DDR data + clock; sampling margin depends on clock-to-data skew control
Margin driver: the effective clock delay path (internal delay, external trace, or both) must match the device expectation
Common failure pattern: link appears up but CRC/FCS errors rise under load or temperature

Bring-up checks (RGMII)

Confirm link partner expected mode (1G/100M and duplex) matches the controller configuration
Verify TX/RX delay enable is set on the correct side (TX delay, RX delay, or both per datasheet)
Monitor frame integrity counters: CRC/FCS errors should remain near zero over a fixed window
If errors appear: reduce rate (or force a simpler mode) as an isolation step, then re-check delay settings

Pass criteria (placeholders): CRC/FCS error rate ≤ X over X seconds at target load.

SGMII engineering model (serial PCS + status alignment)

What it is: a serial lane to a PCS layer; routing is simpler, but mode/state must be consistent end-to-end
Bring-up axis: decide auto-negotiation vs fixed rate, and whether in-band status is required
Common failure pattern: link up but wrong speed/duplex, or one-way traffic due to strategy mismatch

Bring-up checks (SGMII)

Confirm both ends agree on AN enabled or fixed 1G/100M
If using in-band status: validate the PCS status fields reflect the negotiated speed/duplex consistently
If fixing rate: force the same mode on the peer to avoid hidden negotiation asymmetry
Validate counters: frame errors should not climb as throughput increases

Pass criteria (placeholders): reported speed/duplex matches peer, and frame error counters remain ≤ X.

Selection rules (fast heuristic, not marketing)

Long routes / connectors / noisy environment: prefer SGMII to reduce multi-signal skew sensitivity
Short routes / minimal layers / cost pressure: RGMII is viable if delay strategy is explicitly designed and validated
Debug phase: lock configuration first (fixed mode), then introduce negotiation and power-save features later

Clocking & Reset Sequencing: Power Rails, Strap Sampling, and Link Start Order

Bring-up failures frequently originate before packet traffic exists: rails are not stable, clocks are not stable, or straps/EEPROM are sampled at the wrong time. The engineering objective is a deterministic sequence: rails stable → reference clock stable → strap sample window → reset release → link training starts.

Engineering sequence (minimum required ordering)

Core rails reach regulation and remain stable (no brownout within the sampling window)
I/O rails reach regulation and remain stable (interface pins are valid)
Reference clock is present and stable (frequency stable; “clock-good” if available)
Straps/EEPROM are sampled during the defined window (mode, default speed, optional WOL/LED modes)
Reset is deasserted after all above conditions are met; link training should begin within a bounded time

Strap / EEPROM sampling (what it commonly decides)

Interface mode selection (e.g., RGMII vs SGMII)
Default speed/duplex policy (auto-negotiation enabled vs fixed mode)
Optional behavior pins (LED/WOL mode selection) — only verify sampling here; detailed WOL is handled elsewhere

Reset pitfalls (common root causes)

Reset deasserted too early: straps sampled incorrectly → wrong mode/speed policy
Clock not stable at start: MAC/PCS does not initialize cleanly → unstable state or no link start
Multiple reset sources unsynchronized: POR, external reset, watchdog reset cause partial-domain mismatch

Pass criteria (placeholders; define per datasheet)

tCLK_STABLE > X ms (measure at clock-good or clock output)
tRESET_DEASSERT after rails stable > X ms (measure at reset_n pin)
tSTRAP_SAMPLE window = X µs (confirm strap pins are valid during this window)
tLINK_START < X ms after reset (observe PCS/link status transition)

Packet Path & DMA Model: Root Causes of Low Throughput, Drops, and High CPU

Throughput, latency, and loss behavior is dominated by ownership handoff (HW↔SW control of descriptors and buffers) and backpressure points (where rings fill or starve). The fastest isolation path is counter-driven: identify whether the bottleneck is Rx ring saturation, Tx ring starvation, or interrupt/poll scheduling overhead.

Packet path (controller-side view)

Rx: DMA writes frames into Rx buffers → flips descriptor ownership → SW consumes → returns buffers to the ring
Tx: SW posts descriptors → DMA reads Tx buffers → transmits → returns completion to free ring slots
Backpressure: drops happen when Rx buffers are not replenished fast enough, or when Tx descriptors cannot be posted/recycled

Interrupt vs polling (engineering decision)

Prefer interrupts for sparse traffic, low average rate, and power-first systems (avoid constant CPU wakeups)
Prefer polling/batching for sustained high throughput, latency stability, and to avoid interrupt storms under burst load
Symptom hint: “low throughput + high CPU” often indicates excessive interrupt rate or too-small batching thresholds

Three dominant bottlenecks (actionable patterns)

Bottleneck A — Rx ring too small (burst drop)

Likely cause: Rx buffers/descriptors exhausted during bursts; SW cannot replenish fast enough
Quick check: rx_dropped / overrun increments correlated with traffic spikes
Fix: increase ring depth (or buffer pool), enable batching/polling, reduce per-packet overhead
Pass criteria: burst test shows rx_dropped = 0 over X minutes at target load

Bottleneck B — DMA coherency / cache-line alignment mismatch

Likely cause: incorrect cacheability attributes, missing invalidate/clean operations, misaligned buffers/descriptors
Quick check: “random” corruption/drops without strong link counter evidence; behavior changes with CPU load
Fix: enforce cache-line alignment, correct DMA mapping, use non-cacheable or coherent memory region as required
Pass criteria: sustained test shows zero data integrity errors and stable counters at X throughput

Bottleneck C — interrupt storm (CPU high, rings not serviced)

Likely cause: per-packet interrupts, too-low moderation, or insufficient batching thresholds
Quick check: CPU spikes with high IRQ rate while throughput stays low; tx_underrun may appear during heavy Tx
Fix: enable polling/batching, tune interrupt moderation, enlarge rings to absorb bursts
Pass criteria: CPU utilization ≤ X% at target throughput with stable ring occupancy

Quick check (5-minute isolation checklist)

Snapshot counters at idle, then after a fixed load window: rx_dropped, overrun, tx_underrun
Check ring health: descriptor recycle rate and occupancy trend (flat vs saturating)
Correlate with IRQ/poll rate: confirm whether CPU time is spent servicing events or moving payload
Only after rings are stable: toggle offloads one-by-one (covered in the next section)

Boundary note: this section isolates controller-side packet handling. Line-side analog/magnetics issues belong to PHY/protection pages.

Offload Features: When to Enable Checksum, TSO/LRO, and VLAN

Offloads improve throughput by shifting work from software to hardware, but they also change who computes or edits packet fields and where packets become observable. A reliable strategy is to stabilize the ring model first, then enable offloads one-by-one with counter-based pass criteria.

Toggle strategy (stabilize → optimize)

Stabilize: disable complex offloads; confirm ring health and counters remain stable under load
Enable checksum: verify rx_csum_err == 0 over a fixed window
Enable segmentation: TSO/GSO only after checksum passes; confirm throughput improves without new drops
Enable aggregation/filtering: LRO/VLAN filter last; validate latency/observability trade-offs

Pass criteria (placeholders): iperf3 throughput > X and rx_csum_err == 0.

Feature blocks (what each changes)

Checksum offload (Rx/Tx)

Scope: may cover IPv4/IPv6 + TCP/UDP depending on implementation
Common pitfall: capture points may show “bad checksum” when hardware fills fields late
Quick check: A/B toggle checksum offload and compare rx_csum_err trend
Pass criteria: rx_csum_err == 0 at target load for X minutes

TSO/GSO and LRO (throughput vs observability)

TSO/GSO: shifts segmentation to hardware/software layers to reduce CPU per-byte cost
LRO: aggregates received packets for efficiency; can affect latency profiles and fine-grain capture fidelity
When to disable: debugging, latency-sensitive traffic, or when precise packet-level observability is required
Pass criteria: throughput improves to ≥ X with no new drops and acceptable latency (≤ X)

VLAN tag handling and filtering

Tag operations: insert/strip/forward VLAN tags depending on mode
Common pitfall: filters can silently drop frames if rule ownership is misconfigured
Quick check: start with VLAN pass-through (no filtering), then add rules incrementally
Pass criteria: expected VLAN traffic passes with drop counters == 0 over X minutes

Wake-on-LAN & Low-Power Path: Always-On Listen → Match → WAKE Assert

Wake-on-LAN is a chain, not a checkbox. Reliable wake requires an Always-On (AON) path that remains powered and (if required) clocked in low-power state, a deterministic match engine (magic packet or pattern), and a clean PME/WAKE assertion that the host power manager accepts. Debug must prove which stage fails: Armed → Packet seen → Pattern hit → WAKE edge.

WOL chain (three segments, two must-have conditions)

Listen (low-power Rx): minimal receive path stays alive to observe relevant frames
Match: magic packet or pattern engine decides whether a wake event should be generated
Assert: PME/WAKE pin or internal wake event crosses domains and triggers host wake
Must-have #1: WOL logic must be in an AON power domain (and clocked if required)
Must-have #2: match rule must align with actual traffic format (VLAN/IPv6/encapsulation can break offsets)

Magic packet vs pattern match (engineering differences)

Magic packet (compatibility-first)

Simpler rule set; commonly supported across controllers
Lower risk of offset/mask mismatch
Best for early bring-up and cross-platform verification

Pattern match (flexibility-first)

Precise triggers (protocol/port/payload signature) but easy to misconfigure
Offsets can shift with VLAN tags, IPv6 headers, tunneling, or driver-side packet shaping
Must verify with hit counters and a known-good generator toolchain

Failure tree (prove which stage fails)

Step 1 — WOL armed?

Likely cause: driver did not enable WOL, or low-power entry cleared WOL state
Quick check: readback WOL_ARMED status (register/flag)
Fix: apply WOL enable in the final pre-sleep stage and confirm it persists into low-power mode

Step 2 — Packet seen in low-power?

Likely cause: low-power mode disabled the minimal Rx listen path
Quick check: PACKET_SEEN_COUNT increments (or low-power Rx event flag)
Fix: adjust low-power policy to keep the required listen path enabled (AON domain)

Step 3 — Pattern hit?

Likely cause: mismatch in rule/offset/mask; VLAN or encapsulation shifts headers
Quick check: PATTERN_HIT_COUNT increments; compare magic vs pattern behavior
Fix: start with magic packet; then add pattern rules incrementally and validate each hit counter

Step 4 — PME/WAKE asserted?

Likely cause: WAKE pin wiring/polarity/pull is incorrect; wake event not latched across domains
Quick check: observe WAKE edge at pin or a host GPIO sampling point
Fix: confirm pin mux, polarity, pull network, and wake latch configuration

Step 5 — Host wake accepted?

Likely cause: host sleep state is too deep or wake source is not mapped/enabled
Quick check: host power manager wake-source log (platform-specific)
Fix: verify wake-source mapping and supported sleep state for network wake

Quick check (minimum evidence fields) + pass criteria

Log fields: WOL_ARMED, PACKET_SEEN_COUNT, PATTERN_HIT_COUNT, PME/WAKE_EDGE_COUNT
Pass (latency): packet → wake latency < X ms (measure from packet_seen event to WAKE edge)
Pass (integrity): pattern_hit implies WAKE asserted within X ms (no lost wake)
Optional pass (robustness): false wake rate < X/day

Driver / EEPROM / Straps Configuration: Priority, Readback, and Stable Mode

Stable bring-up depends on knowing which configuration source wins and proving it via readback. The priority chain is typically: Strap (latched at boot) → EEPROM/NVM (loaded at init) → Driver override (runtime). Debug should treat “effective mode” as the single source of truth.

Configuration priority model (what overrides what)

Strap: sampled at boot/reset; defines default personality (interface mode, default policy)
EEPROM/NVM: loads persistent fields (MAC address, LED/WOL mode, interface options) if enabled
Driver override: last layer; can override defaults and must be audited for reproducibility

Typical fields to persist (keep within this page boundary)

Identity (must be stable)

MAC address source and programming policy (factory vs field)
Per-port allocation rule (placeholder) when multiple ports exist

Bring-up critical (wrong → no link / errors)

RGMII delay mode (TX/RX delay enable as required)
SGMII policy (auto-negotiation vs fixed rate; in-band status enable)

Behavior (verify enablement via readback)

LED mode selection (verify correct strap/NVM field applied)
WOL enable (verify effective WOL state persists into low power)

Debug flow (readback-first, reproducible)

After boot: read strap-latched register(s) to confirm sampled personality
After NVM load: read NVM/EEPROM status (load OK, CRC/signature OK if available)
After driver init: read effective mode registers (RGMII/SGMII policy, WOL enable, LED mode)
Record: store readback fields in bring-up logs for regression and production correlation

Pass criteria (placeholders)

MAC addr stable across reset == true
Effective mode stable across cold boot == true
Readback mismatch count ≤ X over X cycles

Reliability & Field Diagnostics: Counters → Rates → Loopback → Isolation

Field failures are solved fastest by evidence ordering: choose the right counter group, watch growth rate (delta per minute), then use loopback to isolate host-side vs link-side. This chapter stays on controller/PCS observability and avoids analog PHY deep dives.

Counter groups (map symptoms to evidence)

Group A — Frame integrity (are frames corrupted?)

FCS/CRC errors (FCS_err)
Alignment / malformed frame flags (if exposed)
Error rate rule: use delta per minute, not total count

Group B — Pressure & buffering (are rings/DMA overwhelmed?)

Missed frame / dropped
Overrun / no-buffer / ring overflow (rx_overrun)
Interpretation: high small-packet load + rising overrun often points to host-side pacing

Group C — Flow control (is PAUSE shaping throughput?)

PAUSE frames rx/tx (pause_rx, pause_tx)
Symptom mapping: sawtooth throughput + increasing pause counters → congestion policy is a prime suspect
Action rule: confirm rate vs mode changes before tuning offload features

Field logging (rate-first, correlation-friendly)

Sampling: snapshot counters every X seconds and compute delta per minute
Correlate: compare deltas across load (idle vs stress), temperature (ambient vs hot), and power policy transitions
Minimum context: throughput, packet size mix (large vs small), current power mode (sleep/EEE/WOL armed)

Loopback as an isolation knife (host-side vs link-side)

MAC loopback (controller-focused)

Cuts off external link while testing host ↔ MAC ↔ DMA integrity
If errors persist in MAC loopback, prioritize driver/DMA/ring pacing and memory mapping

PCS loopback (digital link logic, no analog deep dive)

Keeps digital link logic in path while excluding far-end traffic variability
If PCS loopback is clean but external link errors exist, escalate to link-side topics in sibling pages

Pass criteria (placeholders with windows)

FCS_err_rate < X / hour over X hours at throughput > X% line-rate
overrun_delta == 0 over X minutes during small-packet stress
pause_frame_rate within expected policy bounds (no unexplained bursts)

Bring-up Checklist: Design → First Light → Performance → Production

A controller bring-up should behave like a reproducible SOP. Each phase lists actions, evidence (readbacks/counters), and pass criteria (placeholders). Keep configuration changes single-layer at a time and log results for regression.

Phase 1 — Design review (avoid non-bring-upable boards)

Interface: lock decision (RGMII/SGMII) and required strap defaults
Clock/reset: ensure stable reference and reset timing windows (placeholders)
Config paths: strap/EEPROM/driver plan; define single source of truth per phase
Debug hooks: MDIO access, strap readback point, WAKE probe point, basic test pads

Phase 2 — First light (ID → mode → link up → basic traffic)

Read ID/version: confirm management bus access and register map sanity
Apply mode: set RGMII delays or SGMII policy; read back effective mode
Link up: verify speed/duplex match the peer (avoid ambiguous AN states)
Basic traffic: ARP/ping baseline; ensure no immediate FCS/overrun deltas

Phase 3 — Performance (run full rate without dirty counters)

Throughput: iperf stress with large packets, small packets, and bidirectional traffic
Evidence: counter deltas stay clean (FCS, overrun, missed, pause rate is explainable)
If failing: return to “counters → rates → loopback” isolation before changing features

Performance pass criteria (placeholders)

iperf3 throughput > X at X% line-rate
FCS_err_rate < X/hour over X hours
overrun_delta == 0 over X minutes

Phase 4 — Power/WOL + Production hooks (repeatable in factory)

Sleep/WOL: sleep → magic/pattern → wake → link restore; log armed/hit/wake edges
Identity: persist MAC address and serial policy; verify stable across cold boot
Self-test: loopback + counter snapshot in < X seconds (station-friendly)
Station output: store PASS/FAIL with counter deltas and effective-mode readback

IC Selection Logic (controller/MAC-side only)

Selection is treated as a decision path, not a parameter dump: host interface gate → throughput/CPU gate → offload set → WOL path → software ecosystem. Scope is strictly the controller/MAC-side (interfaces, DMA/queues, offload, WOL, driver/tooling). Analog PHY/magnetics/TSN switching are excluded.

Selection boundary & expected outputs

Allowed dimensions: host interface, DMA/ring capability, offload set, WOL support path, driver/tool maturity, power states
Excluded: analog PHY tuning, magnetics selection, ESD/surge networks, TSN switching features
Output: one interface choice + must-have feature list + minimal bring-up/production evidence list

Gate checklist (fail-fast filters)

Gate 1 — Host interface compatibility

RGMII/SGMII: used when host already exposes MAC-side lanes; selection is dominated by bring-up risk and debug readiness
PCIe/USB/SPI: used when host lacks native Ethernet MAC lanes or needs an external network port module
Evidence: confirm driver support on target OS and confirm management/register access path exists

Gate 2 — Throughput & CPU budget

Target line-rate: 10/100 vs 1G vs 2.5G must match actual system need
Small-packet sensitivity: high PPS workloads demand stronger DMA/queue design (not just peak Mbps)
Evidence: iperf + counter deltas (FCS/overrun) stay clean at throughput > X% line-rate (placeholders)

Gate 3 — Must-have offload & WOL

Offload floor: checksum offload (Rx/Tx) is a baseline requirement for CPU containment
Optional accelerators: TSO/LRO/VLAN filtering become score items after stability is proven
WOL floor: confirm WOL arming + pattern hit + wake assertion observability (registers/status), not marketing labels

Scorecard (compare survivors without crossing PHY/TSN scope)

DMA / buffering strength (burst absorption)

Max descriptors: supports X Rx and X Tx descriptors (placeholder)
Queues: supports X queues (placeholder) if traffic separation is needed
Interrupt mitigation: supports coalescing/polling mode to avoid interrupt storms

Software ecosystem (risk & time-to-stable)

Driver maturity: stable on target OS/kernel versions; known issues are bounded
Config tooling: strap/EEPROM/OTP programming flow is scriptable for production
Observability: exposes counters + loopback + status readbacks for field diagnostics

Power & WOL readiness (placeholders)

Active: < X mW (placeholder) at target link speed
Idle: < X mW (placeholder) with link idle
WOL armed: < X mW (placeholder) + wake latency < X ms (placeholder)

Concrete material numbers (examples by host interface)

Part numbers below are commonly used reference ICs for this topic. Always verify package, temperature grade, suffix, and driver support on the exact target platform.

SPI Ethernet controllers (embedded/low-pin-count ports)

WIZnet W5500 — SPI Ethernet controller with integrated TCP/IP offload (fit for MCU-class hosts)
WIZnet W5100S — SPI Ethernet controller family option (verify offload/feature set vs W5500)
Microchip ENC28J60 — 10/100 Ethernet controller (SPI), widely used in cost-sensitive embedded designs
Microchip KSZ8851SNL — 10/100 Ethernet controller with SPI host interface (common in industrial embedded ports)
Davicom DM9051 — SPI Ethernet controller option (verify driver availability and feature coverage)

Selection note: SPI ports are often dominated by PPS/latency and driver overhead. Treat checksum/offload availability and buffer depth as primary score items.

USB to Ethernet controllers (external ports / dongles / gateways)

Microchip LAN7800 — USB 3.x to Gigabit Ethernet controller (commonly supports WoL; verify platform power policy)
Microchip LAN7850 — USB 3.x to Gigabit Ethernet controller family option (verify tooling/OTP flow)
Realtek RTL8153B — USB 3.x to Gigabit Ethernet controller (widely deployed; WoL support depends on OS/driver policy)
ASIX AX88179 — USB 3.0 to Gigabit Ethernet controller (commonly offers WoL features; verify driver stack behavior)
ASIX AX88772B — USB 2.0 to 10/100 Ethernet controller option (cost/legacy fit)

Selection note: USB solutions are sensitive to small-packet PPS and host USB power management. Confirm suspend/resume stability and WoL behavior early.

PCIe Ethernet controllers (PC-class hosts / higher PPS stability)

Intel I210-AT — PCIe Gigabit Ethernet controller (enterprise/industrial class; verify exact feature set per stepping)
Intel I211-AT — PCIe Gigabit Ethernet controller family option (platform-dependent integration)
Intel I225-V — PCIe 2.5GbE controller option (verify revision/driver compatibility on target OS)
Realtek RTL8111H — PCIe Gigabit Ethernet controller (common on embedded x86 boards; verify driver choice and WoL policy)
Realtek RTL8125B — PCIe 2.5GbE controller option (throughput headroom; verify thermal/power budget)

Selection note: PCIe controllers usually win on PPS and CPU containment. The dominant risk shifts to driver maturity, platform power states, and WoL integration.

Selection pass criteria (placeholders)

iperf3 throughput > X with clean counter deltas (FCS_err_rate < X/hour)
overrun_delta == 0 over X minutes under small-packet stress
WOL: armed==true, pattern_hit==true, wake_asserted==true; latency < X ms
Ecosystem: stable driver on target OS + scriptable configuration path (strap/EEPROM/OTP)

Request a Quote

Name

Company

Part Number(s) / BOM

Quantity & Target Lead Time

Alternates Allowed

Temperature Grade

Package / Footprint

Compliance

Budget Window

Lot Size / Qty

Message

Attachment

Accepted Formats

pdf, csv, xls, xlsx, zip

Attachment

Drag & drop files here or use the button below.

FAQs (MAC-PHY controller scope): actionable checks with pass thresholds

Each answer is intentionally short and executable. Scope is controller/MAC-side only: host interface (RGMII/SGMII/PCIe/USB/SPI), DMA/rings, offloads, WOL, counters, loopback isolation, and config precedence.

RGMII link is up, but throughput is poor — TX/RX delay or rings too small?

Likely cause: RGMII TX/RX internal delay/edge selection is mismatched (silent retries/corruption), or DMA rings are too shallow causing burst drops.

Quick check: Read back effective RGMII delay mode; compare counter deltas FCS_err_rate vs rx_overrun_rate (per minute); run A/B with large vs small packets.

Fix: Make delay configuration unequivocally correct first, then increase RX/TX descriptors (one variable per iteration).

Pass criteria: iperf3_throughput > X and FCS_err_rate < X/hour and rx_overrun_rate < X/min over X minutes.

SGMII link is up, but intermittent one-way traffic — in-band status or PAUSE flow control?

Likely cause: In-band status / autoneg state interpretation is inconsistent, or PAUSE is throttling one direction under congestion.

Quick check: Read back SGMII mode (AN, in_band, fixed rate); log pause_rx/pause_tx delta and compare with unilateral throughput tests.

Fix: Force a known-good speed/duplex policy (fixed or consistent AN) for isolation; then enable PAUSE only if needed and understood.

Pass criteria: Unidirectional and bidirectional tests meet throughput > X, with pause_frames_rate stable and explainable (no unexplained bursts).

Enabling checksum offload makes captures show “bad checksum” — display artifact or real corruption?

Likely cause: Capture point observes packets before HW fills checksum (normal offload artifact), or offload coverage/mode mismatch causes real bad frames.

Quick check: A/B disable checksum offload and compare FCS_err_rate and retry symptoms; verify enabled offload types (IPv4/IPv6/TCP/UDP) match driver capability flags.

Fix: Use link evidence (FCS/counter deltas) as truth; keep checksum offload only after the A/B test proves no real corruption.

Pass criteria: FCS_err_rate < X/hour and (if exposed) rx_csum_err == 0 over X hours at throughput > X.

Jumbo frames (9k MTU) drop, but small packets are fine — buffers/descriptors or segmentation offload?

Likely cause: RX buffer sizing / descriptor chain is insufficient for MTU, or TSO/GSO/LRO configuration is incompatible with the MTU path.

Quick check: Enable 9k and watch rx_overrun/no_buffer deltas; read back effective MTU/offload toggles; A/B disable TSO/LRO while keeping MTU constant.

Fix: Stabilize jumbo path with complex offloads disabled, then increase buffers/descriptors and re-enable offloads one-by-one.

Pass criteria: With MTU=9000, drop_delta == 0 and rx_overrun_rate < X/min over X minutes.

CPU usage is high even at low traffic — interrupt storm or polling/NAPI thresholds?

Likely cause: Interrupt rate is excessive (no/poor coalescing), or polling thresholds cause frequent wake-ups with low work per wake.

Quick check: Log irq_rate (X/s) vs throughput; confirm whether rx_overrun rises (pressure) or CPU burns without drops (scheduling/IRQ).

Fix: Enable/tune interrupt coalescing first; then tune polling thresholds (one knob at a time) while keeping traffic profile fixed.

Pass criteria: CPU% < X and irq_rate < X/s with throughput > X and drop_delta == 0.

Wake-on-LAN is configured but does not wake — AON power domain or pattern/magic mismatch?

Likely cause: Always-on (AON) domain is not powered/clocked in sleep, or WOL pattern/magic packet configuration does not match the sender.

Quick check: Verify wol_armed; send packet then read pattern_hit (if available); observe PME/WAKE asserted (pin or status).

Fix: Guarantee AON rail and wake line integrity first; then align pattern/magic settings and re-test with a known-good generator.

Pass criteria: wol_armed==true, pattern_hit==true, wake_asserted==true, and wake_latency < X ms.

After sleep/wake, ping works but link drops later — restore order or driver reinit missing?

Likely cause: Post-wake sequence leaves queues/descriptors partially stale, or mode/offload/flow-control is not restored to the pre-sleep effective state.

Quick check: Snapshot counters immediately after wake, then watch delta for X minutes; compare pre/post wake readbacks (mode, offloads, PAUSE policy).

Fix: Enforce a deterministic restore sequence: mode → ring/queue init → enable traffic; reapply only known-good offloads after stability is proven.

Pass criteria: sleep_wake_success_rate == 100% over X cycles and FCS_err_rate < X/hour.

Speed/duplex keeps renegotiating — force mode first or check peer consistency?

Likely cause: Autoneg advertisement mismatch, or driver state machine triggers repeated renegotiation under specific conditions.

Quick check: Log speed/duplex transitions with timestamps; A/B compare fixed mode vs AN while keeping the peer unchanged.

Fix: Lock a known-good speed/duplex for isolation; only re-enable AN after peer settings and driver behavior are verified stable.

Pass criteria: No unexpected renegotiation events for X hours; throughput remains within ±X%.

FCS errors increase, but “the eye looks OK” — how to isolate with MAC/PCS loopback?

Likely cause: Fault can be host-side (DMA/driver/mode) or link-side; visual inspection is not a pass criterion.

Quick check: Run MAC loopback (isolates host-side); then PCS loopback (digital link logic) and compare FCS_err_rate deltas.

Fix: If MAC loopback fails, focus on controller/driver/rings; if loopbacks pass but external link fails, escalate to sibling pages (PHY/line-side) without expanding scope here.

Pass criteria: In loopback modes, FCS_err_rate == 0 over X minutes at throughput > X.

MAC address changes after every reset — strap/EEPROM/driver precedence conflict?

Likely cause: Multiple sources compete (strap → EEPROM/OTP → driver override), or EEPROM programming/verification is unreliable across cold vs warm resets.

Quick check: Read back current MAC and (if available) MAC-source indicator; compare cold boot vs warm reset; verify EEPROM content and checksum/valid flag.

Fix: Define a single authoritative MAC source; disable or harmonize other override paths; add production write + readback + verify step.

Pass criteria: mac_addr_stable == true across X cold boots and X warm resets.

VLAN enabled and some packets “disappear” — filter rules or tag direction is wrong?

Likely cause: VLAN filter is too strict (drops frames), or tag insert/strip direction is misconfigured so frames are not delivered to the intended path.

Quick check: A/B disable VLAN filtering while keeping tag processing constant; check for filtered_drop_counter (if exposed) and drop_delta deltas.

Fix: Stabilize VLAN as pure pass-through first, then add filter rules incrementally with A/B verification per rule.

Pass criteria: In the target VLAN scenario, drop_delta == 0 over X minutes and throughput > X.

Drops only at low/high temperature — log counter rates first or disable complex offloads for A/B?

Likely cause: Temperature shifts a marginal boundary where complex features (TSO/LRO/EEE/WOL-armed policies) become brittle, or ring/interrupt pacing fails under stress.

Quick check: Start with rate-based logging (FCS_err_rate, rx_overrun_rate, drop_delta) across temperature steps; then A/B disable complex offloads while holding traffic profile constant.

Fix: Use A/B to isolate the feature boundary first, then re-enable features one-by-one with counter-rate evidence at each temperature point.

Pass criteria: Over X~X°C, drop_delta == 0 and FCS_err_rate < X/hour at throughput > X.

Ethernet Controllers / MAC-PHY: RGMII, SGMII, Offload & WOL

Ethernet Controllers / MAC-PHY: RGMII, SGMII, Offload & WOL

Scope & System View: What this MAC-PHY Controller Page Solves

Two common deployment patterns

Expected outcome (what “done” means)

What is a MAC-PHY Controller: Practical Block Ownership (MAC vs PCS vs PHY)

Block ownership map (use for fast isolation)

Minimal glossary (only what is needed for this page)

Host-Side Interfaces: RGMII vs SGMII Selection Logic (Bring-up Ready)

Decision output (what to decide here)

RGMII engineering model (source-synchronous DDR)

SGMII engineering model (serial PCS + status alignment)

Selection rules (fast heuristic, not marketing)

Clocking & Reset Sequencing: Power Rails, Strap Sampling, and Link Start Order

Engineering sequence (minimum required ordering)

Strap / EEPROM sampling (what it commonly decides)

Reset pitfalls (common root causes)

Pass criteria (placeholders; define per datasheet)

Packet Path & DMA Model: Root Causes of Low Throughput, Drops, and High CPU

Packet path (controller-side view)

Interrupt vs polling (engineering decision)

Three dominant bottlenecks (actionable patterns)

Quick check (5-minute isolation checklist)

Offload Features: When to Enable Checksum, TSO/LRO, and VLAN

Toggle strategy (stabilize → optimize)

Feature blocks (what each changes)

Wake-on-LAN & Low-Power Path: Always-On Listen → Match → WAKE Assert

WOL chain (three segments, two must-have conditions)

Magic packet vs pattern match (engineering differences)

Failure tree (prove which stage fails)

Quick check (minimum evidence fields) + pass criteria

Driver / EEPROM / Straps Configuration: Priority, Readback, and Stable Mode

Configuration priority model (what overrides what)

Typical fields to persist (keep within this page boundary)

Debug flow (readback-first, reproducible)

Pass criteria (placeholders)

Reliability & Field Diagnostics: Counters → Rates → Loopback → Isolation

Counter groups (map symptoms to evidence)

Field logging (rate-first, correlation-friendly)

Loopback as an isolation knife (host-side vs link-side)

Pass criteria (placeholders with windows)

Bring-up Checklist: Design → First Light → Performance → Production

Phase 1 — Design review (avoid non-bring-upable boards)

Phase 2 — First light (ID → mode → link up → basic traffic)

Phase 3 — Performance (run full rate without dirty counters)

Phase 4 — Power/WOL + Production hooks (repeatable in factory)

IC Selection Logic (controller/MAC-side only)

Selection boundary & expected outputs

Gate checklist (fail-fast filters)

Scorecard (compare survivors without crossing PHY/TSN scope)

Concrete material numbers (examples by host interface)

Recommended topics you might also need

Request a Quote

Accepted Formats

Attachment

FAQs (MAC-PHY controller scope): actionable checks with pass thresholds

Explore

Categories

Get in Touch