Unmanaged & Smart Switch ICs for Industrial Ethernet

Q: VLAN configured and a device “disappeared” — wrong PVID or egress untag rule?

Likely cause: Port PVID mismatch, or egress changed from untag→tag (or tag→untag) and the endpoint cannot parse it. Quick check: Mirror the port and confirm whether ARP/LLDP frames are tagged or untagged; read back PVID + egress rule. Fix: Restore a management reachability invariant (rescue port/VLAN), then standardize access ports as untag + correct PVID; keep trunk ports tagged. Pass criteria: Endpoint reachable for X_min; port CRC/drop increment ≤ X_cnt per X_min; no unexpected VLAN tag changes.

Q: Utilization looks low, but the cell feels “stuck” — unknown-unicast flood or queue starvation?

Likely cause: Unknown-unicast flooding (MAC table miss/aging) or strict-priority scheduling starving lower classes, triggering retries/backpressure. Quick check: Read unknown-unicast counters and broadcast share; mirror suspect ports for repeated ARP/retries/timeouts spikes. Fix: Enable unknown-unicast storm guard or policing; switch to WRR (or add minimum share) to prevent permanent starvation. Pass criteria: Unknown-unicast share ≤ X_% (or ≤ X_pps) for X_min; control latency/jitter ≤ X_ms; drop increment ≤ X_cnt per X_min.

Q: Mirroring is enabled, but the “bad frames” never show up — wrong mirror tap point (ingress vs egress)?

Likely cause: Wrong mirror source (port vs VLAN), wrong tap (ingress vs egress), or mirror port oversubscription drops. Quick check: Inject a known test flow; compare ingress vs egress tap (if supported); check mirror port drop counters. Fix: Move the tap closer to the suspected stage; narrow mirrored scope; ensure destination can sustain the rate. Pass criteria: Target-frame hit rate ≥ X_% ; mirror-port drop increment ≤ X_cnt per X_min.

Q: QoS enabled, but latency jitter got worse — strict priority starving lower classes and amplifying retransmits?

Likely cause: Strict priority causes starvation until retries explode; or PCP→queue mapping is inverted. Quick check: Create contention and measure control delay; mirror PCP markings and verify mapping. Fix: Use WRR (or strict + minimum share); apply ingress policing on noisy ports. Pass criteria: Control latency/jitter ≤ X_ms under stress; no sustained starvation (low-class drop increment ≤ X_cnt per X_min).

Q: Storm control “kills” discovery/heartbeats — threshold unit or measurement window mismatch?

Likely cause: Wrong unit (pps vs %) or too-short averaging window drops legitimate multicast/discovery. Quick check: Measure baseline LLDP/ARP/heartbeat pps; correlate storm-drop with the outage moment. Fix: Sweep thresholds to find a safe window; separate limits for broadcast/multicast/unknown-unicast. Pass criteria: Heartbeats survive for X_min; storm-drop ≤ X_cnt during normal traffic; no repeat outage within X_min.

Q: A ring/loop connection causes a broadcast meltdown — loop not contained, dual uplinks, or threshold too high?

Likely cause: Physical loop (mispatch/dual uplinks) creates broadcast amplification; storm guard is disabled or too high. Quick check: Break one link and observe immediate recovery; read broadcast share and storm-drop counters. Fix: Enable loop containment (within smart capability) and set practical storm limits; enforce single-uplink rule. Pass criteria: Broadcast share ≤ X_% for X_min; recovery time ≤ X_s; no cell-wide collapse in drills.

Q: Only one port shows high CRC/drop — board-level return path or local SI/power noise?

Likely cause: Local SI/return-path issue, connector/cable defect, or power/ground noise coupling into that port. Quick check: Swap cables/ports to see if it follows the port; correlate CRC ramps with event logs. Fix: Isolate via VLAN to reduce blast radius; then address decoupling, ground continuity, routing/spacing, cable quality. Pass criteria: CRC increment ≤ X_cnt per X_min under same load; link flaps = 0 for X_min.

Q: Link drops only at high temperature — rail derating, reset margin, or clock margin collapse?

Likely cause: Rail derating triggers brownout/reset, oscillator margin collapses, or temperature shifts margins until errors accumulate. Quick check: Align temperature vs link events vs CRC/drop; verify reset-reason/undervoltage indicators. Fix: Increase power/reset margin and thermal path; improve clock stability/noise isolation; apply protective lower-rate mode if needed. Pass criteria: Stable for X_min at high temperature; link flaps = 0; CRC/drop increment ≤ X_cnt per X_min.

Q: After shipping, remote configuration is impossible — missing lockout prevention and rollback?

Likely cause: No rescue profile/port, no staged commit, or management VLAN can be isolated by policy. Quick check: Read active profile version/CRC; confirm rescue path remains reachable after a forced misconfig test. Fix: Dual-profile (active/rescue) + watchdog rollback; enforce management reachability invariant. Pass criteria: Auto-recovery within X_s; success rate ≥ X_% across drills; reachability stable for X_min.

Q: Plugging in one device slows the whole network — uncontrolled burst, unknown-unicast flooding, or no policing?

Likely cause: A noisy endpoint generates bursts without limits; queues saturate and create global contention. Quick check: Mirror that port and quantify broadcast/unknown share; check storm-drop and policing counters. Fix: Apply ingress policing; enable storm control; isolate via VLAN if necessary. Pass criteria: Global latency/drops stay within X_ms/X_cnt; abnormal traffic share ≤ X_% for X_min.

← Back to: Industrial Ethernet & TSN

Smart/unmanaged switch ICs make small industrial networks predictable: isolate traffic with VLAN, protect control flows with QoS, and contain loops/storms before they become field outages.

This page focuses on the minimum set of features and verification steps that turn “it forwards packets” into repeatable, serviceable, and production-safe behavior.

H2-1. Definition, Positioning, and Scope (Unmanaged vs Smart Switch IC)

Unmanaged and smart Ethernet switch ICs enable multi-port Layer-2 forwarding. The practical difference is whether the design gains containment (VLAN/limits), predictability (QoS), and observability (mirroring/counters) to prevent field incidents and speed up root-cause analysis.

In-scope

Unmanaged vs smart positioning and decision cues
VLAN basics (port-based / 802.1Q entry-level rules)
QoS basics (PCP mapping, queue behavior, rate limiting)
Port mirroring and minimum counters for debugging
Storm control fundamentals and loop containment basics
Ring “basics” as concepts only (no standards deep-dive)

Out-of-scope

Fully managed switch deep features (L3 routing, ACL/TCAM details)
TSN scheduling (Qbv/Qci/Qav), time windows, admission control
Time sync architecture (PTP/SyncE/White Rabbit)
MACsec / crypto / secure boot frameworks
PHY electrical/SI deep-dive (return loss, EQ tuning, compliance labs)
Industrial protocol internals (PROFINET/EtherCAT/CIP standard details)

Unmanaged vs Smart: What Changes in Real Deployments

Unmanaged

Control: minimal or none; mostly default forwarding
Segmentation: typically none (no VLAN isolation)
Debug: limited observability; no mirroring in many designs
Field risk: faults can spread; troubleshooting becomes guesswork

Best when: small fan-out, controlled environment, low need for isolation and diagnostics.

Smart Switch IC

Control: straps/EEPROM and/or runtime config (I²C/SPI/SMI-like)
Containment: VLAN (port-based and/or 802.1Q) and traffic limits
Predictability: QoS mapping to queues; basic scheduling behavior
Observability: port mirroring + essential counters
Field safety: storm control to prevent “whole-network freeze” incidents

Best when: industrial cells need isolation, limiters, and fast RCA without full managed complexity.

Managed Switch IC (Out of scope)

Full L2/L3: advanced filtering, ACL/TCAM, multicast controls
Telemetry: richer stats, remote management stacks
Determinism: TSN scheduling features (separate topic)

Mentioned only to define boundaries and prevent scope creep.

A 30-Second Decision Rule

Need isolation? Multiple device classes or tenants on one box → choose Smart (VLAN required).
Need controlled failure blast radius? Risk of loops or broadcast storms → choose Smart (storm control required).
Need faster root-cause analysis? Field failures must be proven with captures/counters → choose Smart (mirroring required).
Only basic fan-out? Single-purpose, controlled network with minimal diagnostics needs → Unmanaged may be sufficient.

Diagram: capability ladder to keep scope clean—smart switch ICs add isolation, limits, and observability without full managed complexity.

H2-2. Internal Architecture You Actually Need (Blocks, Tables, and Bottlenecks)

The goal is not to explain Ethernet fundamentals. The goal is to map each internal block to a real factory failure mode, the observable evidence to collect, and the configuration knobs that contain the impact.

Minimal Forwarding Path (What Matters)

Ingress → Classify (VLAN/QoS) → Lookup (MAC learning) → Queues (schedule/drop) → Egress (tag/limit) → Port.

Classify (VLAN / Priority Mapping)

Typical symptom: link is up, but a device “disappears” after VLAN changes.
Evidence to check: ingress accept policy (tag/untag), PVID, egress tag/untag rules.
Action: define a safe default VLAN, keep management path on a known untagged VLAN, use rollback strategy if supported.

Pass criteria: expected endpoints reachable in VLAN X; no unintended tag stripping on uplink.

MAC Learning Table (Capacity / Aging)

Typical symptom: “bus utilization looks low” but the network feels clogged or intermittent.
Evidence to check: unknown-unicast flooding, MAC move/learn events, table overflow indications.
Action: constrain broadcast domains with VLANs; verify aging defaults; avoid topologies that amplify flooding.

Pass criteria: unknown-unicast and flooding counters remain within X (per window) under normal load.

Queues & Scheduling (Where “QoS Made It Worse” Happens)

Typical symptom: enabling QoS increases jitter or timeouts during congestion.
Evidence to check: per-queue drop counters, queue depth/watermark indicators (if available), priority mapping sanity.
Action: reserve a high-priority path for control traffic; avoid starving low-priority traffic that triggers retries.

Pass criteria: control traffic latency/jitter stays within X under stress, without runaway drops in lower queues.

Rate Limiters & Storm Filters (Contain the Blast Radius)

Typical symptom: a single misbehaving node slows down every port on the switch.
Evidence to check: broadcast/multicast/unknown-unicast rates, storm-drop counters, per-port ingress policing stats.
Action: enable storm control on edge ports, set conservative defaults, and validate that discovery/heartbeat frames are not cut.

Pass criteria: under induced storms, only the offending port experiences drops; uplink remains stable within X.

Mirroring Tap Points (Ingress vs Egress)

Typical symptom: captures “look clean” even though endpoints report drops.
Evidence to check: whether mirroring taps ingress or egress; whether filters exclude the problematic VLAN/port.
Action: start with ingress mirroring to find sources; use egress mirroring to confirm queue drops or tag rules.

Pass criteria: mirrored frames match the suspected flow set; counters and captures agree within X.

Minimum Observability Set (Do Not Skip)

Even “smart” switches vary widely. For field-proof designs, ensure the debug path can separate these root causes quickly:

Port health: CRC errors, alignment errors, drops, overruns
Congestion: per-queue drops (or at least total drops) per port
Flooding: broadcast/multicast/unknown-unicast counters and storm-drop counters
Learning events: MAC learn/move/age indications (if exposed)
Mirroring: configurable ingress/egress taps and source selection (port/VLAN)

Diagram: data-path view for engineering work—classify, learn, queue, and filter points map directly to symptoms, counters, and corrective actions.

H2-3. VLAN Basics for Smart Switches (Port-based vs 802.1Q, Ingress/Egress Rules)

VLAN is an engineering rule system: classify frames on ingress, enforce membership, then decide tag/untag behavior on egress. The goal is predictable isolation and a safe debug path—without accidental lockout.

Port-based VLAN

Use when: a small machine cell or I/O box needs clean separation with minimal complexity.
Why it’s safe: isolation is driven by port membership; fewer tag rules to misalign.
Watch-outs: limited flexibility when multiple VLANs must traverse an uplink between cabinets.

802.1Q Tag VLAN

Use when: an uplink must carry multiple VLANs (trunk) or VLAN identity must persist across cabinets.
What changes: ingress accept rules + PVID + egress tag/untag policy must match across links.
Primary risk: misaligned native/tag rules can isolate endpoints or cause cross-domain leakage.

Ingress Rules (Decision Flow)

Untagged frame → assign PVID (Port VLAN ID) and continue switching within that VLAN.
Tagged frame (VID=X) → accept only if the port allows tagged frames and VLAN X is a member; otherwise drop.
Drop evidence → VLAN policy drop counters (if available) and “silent reachability loss” symptoms.

Common pitfalls: (1) tagged frames arriving on an untag-only port, (2) wrong PVID for untagged endpoints, (3) overly broad VLAN membership.

Egress Rules (Tag/Untag + Native VLAN)

Access ports: typically egress untag for PLC/HMI/cameras that do not expect VLAN tags.
Trunk uplink: typically egress tag to carry multiple VLANs across a single link.
Native VLAN: one VLAN may be allowed to traverse the trunk as untagged; both ends must match the native VLAN definition.

Pass criteria: trunk shows expected VLAN tags (VID=X) and access ports remain untagged; cross-domain broadcast/ARP must not leak.

Minimal Usable Configuration Patterns

Pattern A — 3 Domains + Trunk Uplink

PLC/HMI/Camera ports: access, egress untag, PVID=V10/V20/V30
Uplink: trunk, allow V10/V20/V30, egress tag
Verification: trunk capture shows VID=10/20/30; domains isolated

Pattern B — Pure Port-based Isolation

No tagging; isolation enforced by port membership groups
Best for single-cabinet segmentation without VLAN continuity needs
Verification: broadcasts/ARP must stay within each group

Pattern C — Trunk with Native VLAN

One VLAN is untagged on trunk (native); others remain tagged
Use only when legacy untag requirements exist
Verification: both ends agree on native VLAN; no cross-domain leakage

Diagram: three VLAN domains use untagged access ports for endpoints, while the uplink carries tagged VLANs on a trunk.

H2-4. QoS in the Small: Priorities, Queues, and Rate Limiting (What Matters in Factories)

Factory QoS is a minimal set of choices: classify traffic, map it into a few queues, and apply policing/shaping so a single endpoint cannot destabilize the cell. The proof comes from queue drops and counters—never from “it feels faster”.

Minimal Correct Set

Classification: 802.1p PCP for L2 priority (primary); DSCP remark only when an L3 gateway exists (do not expand here).
Mapping: assign traffic classes into a small number of queues (commonly 4).
Scheduling: strict vs WRR is chosen by starvation tolerance—strict can starve low queues; WRR preserves minimum service.
Policing: cap abnormal ingress behavior (misbehaving endpoint, storms, unknown unicast floods).
Shaping: protect uplinks so local bursts do not overload the core network.

Strict Priority

Benefit: minimizes delay for the top queue when load is high.
Risk: lower queues can be starved; starvation can trigger retries that amplify congestion.
Use when: control traffic is small but must remain stable under stress.

WRR / Weighted Scheduling

Benefit: guarantees minimum service to lower queues.
Trade-off: top queue latency can increase slightly versus strict priority.
Use when: video/log flows must not collapse during sustained congestion.

Practical QoS Recipes (2–3 Minimal Policies)

Recipe 1 — Control First

Control → highest queue (PCP=6/7)
I/O → next queue
Video/log → low queue
Best-effort → lowest queue

Recipe 2 — Misbehavior Containment

Enable storm control on edge ports (broadcast/multicast/unknown unicast).
Apply ingress policing to suspicious ports or traffic classes.
Track storm-drop and unknown-unicast counters for proof.

Recipe 3 — Uplink Protection

Apply egress shaping on the uplink to cap burst impact.
Prefer limiting on uplink rather than cutting endpoints blindly.
Use queue drops + uplink utilization as verification evidence.

Verification Evidence (Stress + Proof)

Stress: introduce congestion using video/bulk transfers while control traffic continues.
Observe: per-queue drops (low queues drop first), storm-drop counters, and policing counters.
Pass criteria: control jitter/latency stays within X under stress; top queue drops remain ≤ X; storm effects remain local to offending ports.

Diagram: 4-queue QoS shows where policing and queue drops occur under congestion, and how scheduling affects starvation behavior.

H2-5. Port Mirroring & Observability (How to Debug Without a Managed Switch)

Port mirroring is a field-proof evidence channel: it turns “suspicions” into packets and counters. The objective is to isolate the failing port/VLAN/direction quickly, then verify fixes with repeatable captures.

What Mirroring CAN Solve

Reproduce: drops, duplicates, bursts, broadcast floods, and topology “flapping” symptoms.
Localize: identify the suspect port, VLAN, and direction (ingress vs egress).
Prove: validate configuration changes by comparing before/after captures and counters.

What Mirroring CANNOT Solve

No per-queue truth: it cannot precisely expose each egress queue state (managed telemetry scope).
No guaranteed capture: oversubscription can drop mirrored packets on the mirror port.
No hardware precision: it complements, not replaces, line-rate counters and PHY diagnostics.

Mirror Setup Choices (Source, Destination, Direction)

Source: start with the port that shows the highest error/storm counters, or the port on the failing path.
Destination: use a dedicated mirror port to a laptop; avoid sharing it with production devices.
Ingress mirror: best when the endpoint is suspected (abnormal bursts, wrong VLAN tags, malformed traffic).
Egress mirror: best when configuration/policy is suspected (unexpected tagging, policing, congestion effects).
VLAN-based mirror (if supported): focus on one VLAN to reduce noise when multiple domains share the switch.

Field Debug Workflow (Repeatable, Evidence-Driven)

Freeze the baseline: record wiring, VLAN/PVID, QoS rules, and storm thresholds before changing anything.
Pick a mirror source: use the port with the strongest symptom (drops, CRC, storm counters) or the suspected segment.
Pick direction: start with ingress, then confirm with egress if policy effects are suspected.
Capture windows: 30–120 seconds for burst faults; longer windows for periodic flaps.
Three fast checks: broadcast ratio, retransmit/duplicate patterns, and ARP/LLDP/discovery behavior.
Align with counters: correlate packet evidence with CRC/drop/overrun/storm counters on the same port.
Form one testable hypothesis: e.g., “wrong egress tag policy” or “unknown-unicast flooding”.
Change one variable: modify a single rule/threshold, then re-capture to validate improvement.

Black-Box Counter Checklist (Fast Signal & Next Action)

Integrity / Link

CRC/FCS errors: SI/EMI/cabling/termination; verify grounding and connector path.
Link up/down: intermittent cable/power/EMI events; correlate with time and environment.

Congestion / Buffer

RX overruns / FIFO drops: ingress overload or internal contention; inspect storms and uplink shaping.
Egress drops (if available): queue congestion or shaping; align with QoS policy and rate limits.

Storm / Flood Evidence

Storm-drop counters: thresholds are triggering; validate if true storms or false positives.
Unknown-unicast indicators: learning instability or topology issues; confirm with capture patterns.

Diagram: mirror a suspect source to a dedicated destination port for laptop captures while production traffic continues on the uplink.

H2-6. Storm Control & Loop Containment (Broadcast/Multicast/Unknown-Unicast Guards)

Storm control is not about saving bandwidth; it is a fuse that prevents a single fault from collapsing the entire cell. Correct limits create local containment and measurable evidence (storm-drop counters) for diagnosis.

Three Storm Classes to Guard

Broadcast: ARP/discovery bursts can amplify rapidly under loops and miswiring.
Multicast: without proper control, multicast can behave like broadcast at the edge.
Unknown-unicast: when learning is unstable, traffic floods; this often feels like “low utilization but the network is stuck”.

Typical Loop Triggers in the Field

Accidental patching between two ports during maintenance.
Dual uplinks connected without loop protection or consistent ring configuration.
Ring ports treated as ordinary ports (protocol not enabled or mismatched).
One device bridges two connections unexpectedly (IPC/gateway wired twice).
Learning instability that turns unknown-unicast into persistent flooding.

Threshold Templates (Placeholders for Standardization)

PPS-based

Broadcast limit: X pps
Multicast limit: X pps
Unknown-unicast limit: X pps

% Line-rate

Broadcast cap: X% of port rate
Multicast cap: X% of port rate
Unknown-unicast cap: X% of port rate

Guardrails: start conservative to avoid false kills, then tighten using storm-drop counters and mirror captures as proof.

False-Kill Risks & How to Avoid Them

Threshold too low: discovery/heartbeats fail and devices appear “offline”.
Wrong target: limiting the uplink first can hide the root cause and increase retransmissions.
Mitigation: enable on edge ports first, keep protocol lifelines safe, and validate with counters + captures.

Incident Playbook (Contain → Identify → Fix)

Confirm containment evidence: check storm-drop / unknown-unicast counters for growth.
Find the hottest port: locate the port with the fastest counter increase.
Mirror ingress: capture and confirm whether broadcast/unknown-unicast dominates.
Isolate quickly: unplug the suspect cable or disable the suspect port to stop the flood.
Recover carefully: restore wiring stepwise, validating counters remain stable.
Harden config: keep strict guards on maintenance-prone ports to prevent recurrence.

Diagram: a storm can saturate the cell quickly; a guard limiter contains the impact and creates measurable storm-drop evidence.

H2-7. Ring Protocol Basics (MRP/HSR/PRP — What to Know Without Becoming a Spec Expert)

“Ring basics” focuses on outcomes and selection decisions: recovery time, zero-loss expectations, and the real boundary between smart-switch assists and full redundancy protocol stacks.

Target Outcomes

Fast switchover: a link break triggers reroute within X ms (short interruption may exist).
Zero-loss concept: dual-active paths or dual networks reduce disruption (often requires endpoint support).
Containment: prevent a loop from collapsing the entire cell (often via guards and rapid isolation).

Concept Map (Minimal)

MRP: ring recovery behavior and switchover time are primary concerns.
HSR: redundancy via duplicated delivery around a ring; receiver de-duplicates.
PRP: redundancy via two independent networks; receiver de-duplicates.

Smart Switch Boundary (What to Expect)

Commonly Available

Basic loop containment and storm guards.
Simple ring assist / fixed-port behaviors (vendor-dependent).
Evidence counters (link events, storm-drop, unknown-unicast indicators).

Often Out of Reach

Full MRP/HSR/PRP stacks with cross-vendor interoperability guarantees.
Strict zero-loss guarantees without endpoint / system-level support.
Complex topology determinism (typically a managed / dedicated redundancy scope).

Selection Questions (Answer Before Buying)

Goal: ms switchover or “zero-loss” expectation?
Endpoint support: do PLC/I/O/cameras support redundancy modes natively (or require an external adapter)?
Roles: are manager/client or port roles required, and can the field maintain consistent configuration?
Certification fit: does the system require protocol certification alignment with controllers/devices?
Load headroom: can duplicated traffic or reroute bursts exceed uplink or buffer limits?
Failure modes: is the main risk a link break, or miswiring that creates a loop?
Evidence: can the switch expose link events and storm/flood counters to validate the root cause?

Minimal Acceptance Checks (Keep It Practical)

Break test: unplug one ring segment and confirm recovery within X ms (target-dependent).
Business impact: verify critical control traffic stays within X loss/timeout limits during reroute.
Containment: simulate mispatch/loop and confirm guards keep the cell responsive with measurable counters.

Out of scope: full standard details, certification workflows, and cross-vendor interoperability rules belong on a dedicated “Ring Redundancy (MRP/HSR/PRP)” page. This section keeps only the decision-critical basics.

Diagram: line topologies isolate downstream nodes on a break; ring topologies can reroute (outcome depends on redundancy method and support).

H2-8. Hardware Co-Design Hooks (Power, Clocks, PHY-integration, and Layout Traps)

These board-level hooks prevent “protocol-looking” failures that are actually power, strap, clock, or return-path issues. The focus is on layout rules and bring-up evidence, not PHY textbooks.

Power Rails & Reset (The #1 “Half-Alive” Root Cause)

Typical Symptoms

Link flaps after boot, then stabilizes “randomly”.
Intermittent drops that correlate with load steps.
Configuration appears inconsistent across boots.

Design Hooks

Multi-rail sequencing: ensure all rails reach stable levels before releasing reset.
UVLO/POR clarity: use a supervisor; avoid “borderline” brownout behavior.
Reset timing: release reset X ms after rails and clock are confirmed stable.

Clocking Hooks (Stability vs. “Speed”)

Jitter sensitivity: poor clock quality can surface as instability, renegotiation, or elevated error counters.
Noise coupling: avoid routing clocks through switching regulator hot zones and high-current loops.
Bring-up evidence: align clock-related hypotheses with link events + CRC/overrun counter behavior under load.

Straps & Boot Pins (Configuration That Depends on Timing)

Sampling moment: strap pins are often latched at reset release; unstable rails can create unstable config.
Shared pins: avoid external devices driving strap pins during boot; keep pull-ups/pull-downs unambiguous.
Field symptom: “same board, different behavior” can be a strap + reset timing problem, not a protocol bug.

Integrated PHY & Layout Rules (Placement, Return Path, ESD)

Do (Green)

Keep diff-pair reference plane continuous under the entire path.
Place connector-entry protection with short, direct return paths.
Keep magnetics/connector path compact to reduce loop area.

Avoid (Red)

Diff-pairs crossing plane cuts/slots or return-path discontinuities.
Clock sources placed next to DC/DC noise or adjacent to the PHY path.
Long “detours” for TVS returns that create large discharge loops.

Board-Level Evidence to Log (Fast Root-Cause Signals)

Link events vs power events: link drops aligned with brownouts or load steps indicate power/return-path issues.
CRC/overrun patterns: correlate with temperature, motor switching, or DC/DC mode transitions.
Before/after layout change: a large delta from routing/placement changes indicates SI/return-path dominance.

Diagram: compact PHY path, short protection returns, and continuous reference planes prevent many “protocol-looking” failures.

H2-9. Configuration & Bring-up Flow (Straps/I²C/SPI, Default Safe Modes, Field-Safe Updates)

Configuration must be designed as a gated, field-safe workflow: keep reachability first, then enforce VLAN/QoS, and only commit changes after evidence proves stability.

A. Default Safe Modes (Pick One Baseline Per Product)

Mode A All-pass (Factory)

Best for: first power-on and recovery.
Risk: storms/loops can spread.
Must-have guard: conservative storm limits enabled by default.

Mode B Split VLAN (Commissioning)

Best for: early domain isolation (PLC / HMI / Camera).
Risk: wrong PVID/egress rules can cause remote loss.
Must-have guard: a “rescue port/VLAN” always reachable.

Mode C Uplink trunk (Deployment)

Best for: multi-VLAN uplink on a single port.
Risk: native VLAN mismatch → upstream “silent failures”.
Must-have guard: mirror proof of tagging before commit.

B. Configuration Entry Points (Straps → NVM → Host Bus)

Straps (Boot-latched)

Use: default mode, interface enable.
Rule: stable at reset release.
Failure signature: inconsistent behavior across boots.

EEPROM / NVM (Profile)

Use: baseline VLAN/QoS/port roles.
Rule: version + CRC + fallback copy.
Failure signature: “works until brownout” profile corruption.

I²C / SPI Host (Runtime)

Use: staged policy apply, diagnostics, update control.
Rule: detect loss-of-control and re-enter safe profile.
Failure signature: policy changed but not provable by evidence.

C. Field-Safe Updates (Anti-Lockout Pattern)

Common Failure

Wrong VLAN/PVID/egress rules remove the only management path and cause remote lockout.

Anti-Lockout Hooks

Dual profile: Active + Rescue (known-reachable baseline).
Staged commit: apply → observe → commit to NVM.
Rollback gate: revert if no heartbeat/ack within X s.
Rescue access: dedicated port/VLAN not affected by user policy.

D. Bring-up Minimal Closed Loop (Prove Then Advance)

Link up: stable link for X min (no flapping).
Forwarding: unicast flows pass; MAC learning stable (no persistent flooding).
VLAN: PVID + tag/untag verified by mirror capture; management remains reachable.
QoS: queue mapping validated under contention; control stays responsive.
Mirror: captures confirm tagging/priority behavior; abnormal ratios visible.
Storm: thresholds contain storms without killing discovery/heartbeats.

Diagram: each bring-up step has a pass/fail gate to prevent “remote lockout by policy”.

H2-10. Verification Plan (Throughput/Latency, Loss Under Stress, Storm & Loop Tests)

Verification must be executable: define stimulus, capture observable evidence, and declare pass criteria. Coverage must include line-rate, latency behavior, burst loss, storm threshold sweep, loop containment, and corner stress.

Coverage (What Must Be Proven)

Throughput: line-rate definition by frame size and direction (uni/bi).
Latency: measure distributions under load; observe mode signature (store-and-forward vs cut-through behavior).
Stress loss: burst + contention + mixed priorities; verify critical class survivability.
Storm sweep: find the safe window between “no containment” and “false-kill”.
Loop/ring: mispatch loop containment and break recovery observation points.
Industrial corners: temperature and brownout/load-step alignment with counters/events.

Executable Checklist (Stimulus → Observable → Pass)

Throughput

Stimulus: min/typ/max frame sizes; uni/bi; full load.
Observable: drops/overruns; link events; mirror proof.
Pass: ≤ X drops over Y minutes at defined load.

Latency

Stimulus: controlled load points + contention.
Observable: latency distribution shift and jitter growth.
Pass: latency/jitter within X under defined traffic mix.

Stress Loss

Stimulus: bursts + mixed classes + uplink contention.
Observable: which class drops first; responsiveness.
Pass: critical class loss ≤ X while best-effort absorbs pressure.

Storm Sweep

Stimulus: broadcast/multicast/unknown-unicast sweep.
Observable: storm-drop counters vs discovery/heartbeats.
Pass: safe window identified and documented.

Loop / Ring

Stimulus: mispatch loop; break one segment (if ring).
Observable: containment evidence; recovery time X ms.
Pass: cell remains responsive; recovery meets target.

Temp / Voltage Corners

Stimulus: low/high temp; brownout/load-step events.
Observable: link events + counters aligned to logs.
Pass: no persistent flapping; error within X.

Required capability: controllable load and bursts, class-aware traffic generation, capture + counters for evidence, and synchronized power/temperature/event logging for correlation.

Diagram: verification items are expressed as card-matrix entries to stay mobile-safe while remaining executable.

H2-11. Engineering Checklist (Design → Bring-up → Production)

A smart/unmanaged switch design is production-ready only when each gate has a repeatable verification action, objective evidence, and explicit pass criteria (threshold X).

Gate 1

Design Gate

Power rails + reset predictability

Action: brownout/load-step injection while logging reset causes.
Evidence: reset-reason log + link-event timeline aligned to power.
Pass criteria: no “hang/no-recover” in X disturbances.
MPN examples: TI TPS3808G01 (supervisor), ADI ADM809 (reset), TI TLV75533P (LDO), TI TPS62177 (buck).

Strap latch robustness (boot mode is deterministic)

Action: repeat power cycles across slow/fast ramps and temperature corners.
Evidence: boot mode readback consistency + interface reachability check.
Pass criteria: mode mismatch rate ≤ X / 1000 boots.
MPN examples: N/A (logic-level straps); use stable pull parts such as Yageo RC0603FR-0710KL (10 kΩ).

Default safe mode + rescue path (anti-lockout)

Action: define “Rescue profile” + dedicated rescue port/VLAN and verify it survives user policy.
Evidence: reachability proof after intentionally wrong VLAN/PVID settings.
Pass criteria: remote access recovered within X seconds by rollback.
MPN examples: SPI flash: Winbond W25Q32JV; I²C EEPROM: Microchip 24LC02B or 24AA025E48 (EEPROM with EUI-48).

VLAN plan (PVID + ingress/egress rules are explicit)

Action: define per-port PVID, acceptable frame types (tag/untag), and egress tag policy.
Evidence: mirror capture showing correct tag/untag on trunk/access paths.
Pass criteria: “management reachable” invariant holds under X reconfig cycles.
MPN examples (switch IC): Microchip KSZ8795 (5-port smart), Microchip KSZ8863 (3-port), Microchip KSZ8895 (5-port).

QoS skeleton (queues + scheduling + basic shaping)

Action: map classes to queues (Control/IO/Video/BE) and define rate-limit granularity.
Evidence: under contention, control class stays responsive; lower classes absorb loss.
Pass criteria: control latency/jitter ≤ X under defined stress.
MPN examples (clocking): Epson SG-210STF (25 MHz osc), Abracon ASE-25.000MHZ (XO).

Storm guard defaults (containment without false-kill)

Action: decide threshold unit (pps or % line-rate) for broadcast/multicast/unknown-unicast.
Evidence: threshold sweep shows a “safe window” where discovery/heartbeats survive.
Pass criteria: no global outage under mispatch/loop; false-kill rate ≤ X.
MPN examples (ESD protection, board-level): choose ultra-low-cap Ethernet ESD arrays per line-rate (example families: Semtech RClamp series / Littelfuse SP305x series).

Gate 2

Bring-up Gate

Minimal closed loop (prove then advance)

Action: Link → Forward → VLAN → QoS → Mirror → Storm (in order).
Evidence: per-step capture + counter snapshot baseline.
Pass criteria: each step stable for X minutes, no link flaps.
MPN examples (switch IC): Microchip KSZ8795 / KSZ8895 (smart features), Microchip KSZ8863 (compact).

Mirror capture proof (tag/priority behavior is visible)

Action: mirror source by port/VLAN and select ingress/egress tap if supported.
Evidence: pcap shows correct 802.1Q tag/untag and 802.1p PCP mapping.
Pass criteria: no unexplained flooding; abnormal ratio detectable within X minutes.
MPN examples (field tool): passive is enough; use mirror port + laptop NIC; optional USB NIC: Realtek-based RTL8153 adapters (common).

Counter baseline (the “black-box” is usable)

Action: record idle/nominal/stress baselines for CRC/drop/overrun/storm-drop/link events.
Evidence: baseline snapshots stored with timestamp + temperature + power cycle count.
Pass criteria: baseline drift ≤ X over Y hours.
MPN examples (storage): Winbond W25Q32JV (SPI flash), Microchip 24LC02B (I²C EEPROM).

Storm threshold sweep (find safe window)

Action: sweep broadcast/multicast/unknown-unicast thresholds from loose to strict.
Evidence: storm-drop vs LLDP/ARP/heartbeat survival charted per port.
Pass criteria: safe window documented; false-kill at normal traffic ≤ X.
MPN examples (switch IC): Microchip KSZ8795 (storm controls present in smart class; device-dependent).

Rollback drill (intentional misconfig)

Action: intentionally apply a lockout VLAN/PVID; verify watchdog rollback to rescue profile.
Evidence: rollback event log + recovered reachability + restored mirror proof.
Pass criteria: recovery time ≤ X seconds; success rate ≥ X%.
MPN examples: TI TPS3808G01 (supervisor), Winbond W25Q32JV (dual profile storage).

Gate 3

Production Gate

Configuration lockdown (version + CRC + dual image)

Action: implement active/rescue profile with versioning and integrity checks.
Evidence: profile readback + CRC status + audit log entry.
Pass criteria: failed update always falls back to rescue in X seconds.
MPN examples: Winbond W25Q32JV (SPI flash), Microchip 24LC02B (EEPROM).

Black-box logging fields (service-ready evidence)

Action: define a minimal field log schema (link flap, CRC/drop, storm-drop, reboot reason, temp/power).
Evidence: sample logs from burn-in + induced fault runs.
Pass criteria: event-to-root-cause coverage ≥ X%.
MPN examples (optional time base): Microchip MCP7940N (I²C RTC) for timestamping logs.

Serviceability (minimum on-site workflow)

Action: document mirror-port capture, counter readout, and factory-reset sequence.
Evidence: SOP + successful on-site drill records.
Pass criteria: on-site isolation time ≤ X minutes for common failures.
MPN examples: Microchip 24AA025E48 (EUI-48 for per-unit identity), Winbond W25Q32JV.

Corner robustness (temp + brownout alignment)

Action: run thermal/brownout tests while correlating events and counters.
Evidence: aligned timeline: temperature + voltage + link events + drops.
Pass criteria: no persistent flapping; error budget within X.
MPN examples (sensing, optional): TI TMP117 (temp sensor) for correlation logging.

Diagram: each gate is a kanban-style board with icon tiles to keep the checklist actionable and audit-friendly.

H2-12. Applications & IC Selection (Combined)

Applications are expressed as “capability vectors”. IC selection is expressed as a checklist + decision tree, not a brand catalog.

H3-12.1 Applications (Where Unmanaged/Smart Switch ICs Win)

Machine cell (star/line)

Topology / ports: 5–8 ports, mixed PLC/HMI/camera nodes.
Must-have: VLAN isolation + storm guard; optional mirror for field debug.
Common failure: broadcast storm collapses the whole cell.
Bring-up proof: mirror shows tag/untag correctness; storm window documented.
MPN examples: Microchip KSZ8795, Microchip KSZ8895.

Remote I/O fan-out

Topology / ports: 3–6 ports, deterministic control priority.
Must-have: QoS queues + basic policing; mirror for on-site capture.
Common failure: burst traffic causes control jitter.
Bring-up proof: contention test keeps control class within threshold X.
MPN examples: Microchip KSZ8863 (compact), Microchip KSZ8795.

Small gateway / edge box

Topology / ports: 2–5 ports + uplink.
Must-have: 802.1Q trunk uplink + VLAN/PVID discipline; optional QoS for control.
Common failure: native VLAN mismatch produces “silent upstream failure”.
Bring-up proof: mirror capture shows VLAN tags on trunk before committing.
MPN examples: Microchip KSZ8795, Microchip KSZ8895.

Lightweight ring / loop containment

Topology / ports: 2 ring ports + local fan-out.
Must-have: loop containment + storm guard; “ring assist” if supported.
Common failure: mispatch loop floods the network.
Bring-up proof: mispatch drill contains broadcast; recovery time ≤ X.
MPN examples: Microchip KSZ8795 (smart-class guards), Microchip KSZ8895.

H3-12.2 IC Selection Logic (A Checklist, Not a Shopping List)

Core requirement checklist

Ports/speed mix: FE/GE/2.5G as needed (avoid “overkill” heat).
VLAN: PVID + accept rules + egress tag policy + table/limit awareness.
QoS: queue count + scheduling (strict/WRR) + policing/shaping granularity.
Mirror: per-port/per-VLAN, ingress/egress tap position if available.
Storm: bcast/mcast/unknown-unicast types + threshold units + default behavior.
Config/rollback: straps + EEPROM/flash + watchdog rollback + rescue port/VLAN.
Industrial target: temperature + EMC evidence plan (metric-only, no spec tutorial).
Power/thermal: package dissipation + airflow/heatsinking plan.

MPN anchor examples (common)

Smart-class switch IC (VLAN/QoS/mirror/storm): Microchip KSZ8795, Microchip KSZ8895.
Compact switch IC (small fan-out): Microchip KSZ8863.
Identity / MAC storage: Microchip 24AA025E48 (EUI-48 EEPROM).
Config storage: Winbond W25Q32JV (SPI flash), Microchip 24LC02B (EEPROM).
Reset supervisor: TI TPS3808G01, ADI ADM809.
Clock source: Epson SG-210STF (25 MHz), Abracon ASE-25.000MHZ.
Power rails: TI TPS62177 (buck), TI TLV75533P (LDO).
Optional RTC for logs: Microchip MCP7940N.

Decision rule (engineering)

If any of VLAN isolation, QoS protection, mirror observability, or storm containment is required for field safety, select a smart-class switch IC. Otherwise, a basic unmanaged mode can be acceptable for the simplest cells.

Diagram: the decision tree stays inside the page boundary—only VLAN/QoS/Mirror/Storm and field-safety observability.

Request a Quote

Name

Company

Part Number(s) / BOM

Quantity & Target Lead Time

Alternates Allowed

Temperature Grade

Package / Footprint

Compliance

Budget Window

Lot Size / Qty

Message

Attachment

Accepted Formats

pdf, csv, xls, xlsx, zip

Attachment

Drag & drop files here or use the button below.

H2-13. FAQs (Troubleshooting Only: VLAN / QoS / Mirror / Storm / Ring Basics)

These FAQs close long-tail field issues without expanding the main content. Every answer is exactly four lines: Likely cause / Quick check / Fix / Pass criteria (X).

Data placeholders (consistent units)

X_min: minutes of stable operation window
X_s: seconds to recover (rollback / reconnect)
X_cnt: counter increment threshold (CRC/drop/storm-drop) within a window
X_%: ratio threshold (% of frames or % line-rate), used for broadcast/unknown-unicast share
X_pps: packets-per-second threshold for storm control
X_ms: latency/jitter bound for control-class traffic under contention

VLAN configured and a device “disappeared” — wrong PVID or egress untag rule?

Likely cause: Port PVID mismatch, or egress changed from untag→tag (or tag→untag) and the endpoint cannot parse it.
Quick check: Mirror the port and confirm whether ARP/LLDP frames are tagged or untagged; read back PVID + egress rule for that port.
Fix: Restore a “management reachability invariant” (rescue port/VLAN), then standardize access ports as untag + correct PVID; keep trunk ports tagged only.
Pass criteria: Endpoint reachable for X_min; port CRC/drop increment ≤ X_cnt per X_min; no unexpected VLAN tag changes in capture.

Utilization looks low, but the cell feels “stuck” — unknown-unicast flood or queue starvation?

Likely cause: Unknown-unicast flooding (MAC table miss/aging) or strict-priority scheduling starving lower classes, triggering retries/backpressure upstream.
Quick check: Read unknown-unicast counters and broadcast share; mirror suspect ports and look for repeated ARP/retries/timeouts spikes.
Fix: Enable unknown-unicast storm guard or policing; switch from strict-only to WRR (or add minimum share) so non-control traffic cannot be permanently starved.
Pass criteria: Unknown-unicast share ≤ X_% (or ≤ X_pps) for X_min; control latency/jitter ≤ X_ms; drops do not ramp (drop increment ≤ X_cnt per X_min).

Mirroring is enabled, but the “bad frames” never show up — wrong mirror tap point (ingress vs egress)?

Likely cause: Mirror source is wrong (port vs VLAN), or the tap point is before/after the event (ingress vs egress), or the mirror port is oversubscribed and drops.
Quick check: Inject a known test flow and verify it appears; compare ingress-tap vs egress-tap (if supported) and check mirror port drop counters.
Fix: Move the tap closer to the suspected failure stage; narrow the mirrored scope (single port/VLAN) and ensure the mirror destination can sustain the rate.
Pass criteria: Target-frame hit rate ≥ X_%; mirror-port drop increment ≤ X_cnt per X_min; captures contain consistent VLAN/PCP evidence.

QoS enabled, but latency jitter got worse — strict priority starving lower classes and amplifying retransmits?

Likely cause: Strict priority starves a queue until higher-layer retries explode; or PCP→queue mapping is inverted, pushing control into a congested class.
Quick check: Create controlled contention and measure control-frame delay; mirror PCP/DSCP markings and verify mapping aligns with the queue plan.
Fix: Use WRR (or strict + minimum share) to prevent starvation; apply ingress policing on “noisy” ports to cap burst damage.
Pass criteria: Control traffic latency/jitter ≤ X_ms under defined stress; no sustained starvation (drop increment on low class ≤ X_cnt per X_min).

Storm control “kills” discovery/heartbeats — threshold unit or measurement window mismatch?

Likely cause: Threshold is set in the wrong unit (pps vs %) or the averaging window is too short/too strict, causing legitimate multicast/discovery to be dropped.
Quick check: Measure baseline LLDP/ARP/heartbeat pps in normal operation; correlate storm-drop counters with the moment devices “vanish”.
Fix: Sweep thresholds to find a safe window; separate limits for broadcast/multicast/unknown-unicast and avoid a single “global” clamp.
Pass criteria: Heartbeats survive continuously for X_min; storm-drop is 0 (or ≤ X_cnt) during normal traffic; outage events do not recur within X_min.

A ring/loop connection causes a broadcast meltdown — loop not contained, dual uplinks, or threshold too high?

Likely cause: Physical loop (mispatch / dual uplinks) creates a broadcast amplification loop; storm guard is disabled or set so high it never triggers.
Quick check: Break one link and see if the network immediately recovers; read broadcast share and storm-drop counters during the event.
Fix: Enable loop containment (within smart-switch capability) and set practical storm limits; enforce “single-uplink rule” where applicable.
Pass criteria: Mispatch drill does not collapse the cell; broadcast share ≤ X_% for X_min; recovery time ≤ X_s.

Only one port shows high CRC/drop — board-level return path or local SI/power noise?

Likely cause: Local signal integrity/return-path issue, connector/cable defect, or power/ground noise coupling into that port’s PHY/MAC interface.
Quick check: Swap cables/ports to see if the symptom follows the port or the device; correlate CRC ramps with load/temperature/time-of-event logs.
Fix: Isolate the port via VLAN to reduce blast radius; then address board-level causes (decoupling, ground return continuity, routing/spacing) and cable quality.
Pass criteria: CRC increment ≤ X_cnt per X_min under the same load; link flaps = 0 for X_min.

Link drops only at high temperature — rail derating, reset margin, or clock margin collapse?

Likely cause: Power rail derating triggers brownout/reset, oscillator margin collapses, or temperature shifts timing/noise margins until errors accumulate into a flap.
Quick check: Align temperature vs link events vs CRC/drop; verify reset-reason and undervoltage indicators when the drop occurs.
Fix: Increase power/reset margins and improve thermal path; use a more stable clock source or reduce noise coupling; apply a protective lower-rate mode if required.
Pass criteria: High-temp run stable for X_min; link flaps = 0; CRC/drop increment ≤ X_cnt per X_min.

After shipping, remote configuration is impossible — missing lockout prevention and rollback?

Likely cause: No rescue profile/port, configuration applied without staged commit, or management VLAN can be accidentally isolated by policy.
Quick check: Read back active profile version/CRC; confirm whether a rescue path (port/VLAN) remains reachable after a forced misconfig test.
Fix: Implement dual-profile (active/rescue) + watchdog rollback; enforce “management reachability invariant” that policy cannot break.
Pass criteria: Any bad config auto-recovers within X_s; success rate ≥ X_% across repeated drills; remote reachability stable for X_min.

Plugging in one device slows the whole network — uncontrolled burst, unknown-unicast flooding, or no policing?

Likely cause: A noisy endpoint generates bursts (broadcast/multicast/unknown-unicast) without limits; queues saturate and create global contention symptoms.
Quick check: Mirror that port and quantify broadcast/unknown share; check storm-drop and port policing counters during the slowdown window.
Fix: Apply ingress policing/rate limiting on that port; enable storm control for relevant types; isolate via VLAN if needed to contain blast radius.
Pass criteria: With the device connected, global latency/drops stay within X_ms/X_cnt; abnormal traffic share ≤ X_% for X_min.

Captures look “normal”, but failures still happen — event is egress-queue timing and the tap misses it?

Likely cause: The event is created at egress under congestion (queueing/drop), while capture is taken at ingress; or the mirror destination drops at peak.
Quick check: Switch to egress tap (if supported) or mirror a narrower scope; correlate counter baselines to the exact failure timestamp.
Fix: Reduce mirrored sources to avoid oversubscription; use counters + event logs as the primary trigger and capture around the event window.
Pass criteria: Reproduction hit rate ≥ X_%; event-aligned counters show a clear signature; mirror port drop increment ≤ X_cnt per X_min.