Unmanaged & Smart Switch ICs for Industrial Ethernet
← Back to: Industrial Ethernet & TSN
Smart/unmanaged switch ICs make small industrial networks predictable: isolate traffic with VLAN, protect control flows with QoS, and contain loops/storms before they become field outages.
This page focuses on the minimum set of features and verification steps that turn “it forwards packets” into repeatable, serviceable, and production-safe behavior.
H2-1. Definition, Positioning, and Scope (Unmanaged vs Smart Switch IC)
Unmanaged and smart Ethernet switch ICs enable multi-port Layer-2 forwarding. The practical difference is whether the design gains containment (VLAN/limits), predictability (QoS), and observability (mirroring/counters) to prevent field incidents and speed up root-cause analysis.
In-scope
- Unmanaged vs smart positioning and decision cues
- VLAN basics (port-based / 802.1Q entry-level rules)
- QoS basics (PCP mapping, queue behavior, rate limiting)
- Port mirroring and minimum counters for debugging
- Storm control fundamentals and loop containment basics
- Ring “basics” as concepts only (no standards deep-dive)
Out-of-scope
- Fully managed switch deep features (L3 routing, ACL/TCAM details)
- TSN scheduling (Qbv/Qci/Qav), time windows, admission control
- Time sync architecture (PTP/SyncE/White Rabbit)
- MACsec / crypto / secure boot frameworks
- PHY electrical/SI deep-dive (return loss, EQ tuning, compliance labs)
- Industrial protocol internals (PROFINET/EtherCAT/CIP standard details)
Unmanaged vs Smart: What Changes in Real Deployments
Unmanaged
- Control: minimal or none; mostly default forwarding
- Segmentation: typically none (no VLAN isolation)
- Debug: limited observability; no mirroring in many designs
- Field risk: faults can spread; troubleshooting becomes guesswork
Best when: small fan-out, controlled environment, low need for isolation and diagnostics.
Smart Switch IC
- Control: straps/EEPROM and/or runtime config (I²C/SPI/SMI-like)
- Containment: VLAN (port-based and/or 802.1Q) and traffic limits
- Predictability: QoS mapping to queues; basic scheduling behavior
- Observability: port mirroring + essential counters
- Field safety: storm control to prevent “whole-network freeze” incidents
Best when: industrial cells need isolation, limiters, and fast RCA without full managed complexity.
Managed Switch IC (Out of scope)
- Full L2/L3: advanced filtering, ACL/TCAM, multicast controls
- Telemetry: richer stats, remote management stacks
- Determinism: TSN scheduling features (separate topic)
Mentioned only to define boundaries and prevent scope creep.
A 30-Second Decision Rule
- Need isolation? Multiple device classes or tenants on one box → choose Smart (VLAN required).
- Need controlled failure blast radius? Risk of loops or broadcast storms → choose Smart (storm control required).
- Need faster root-cause analysis? Field failures must be proven with captures/counters → choose Smart (mirroring required).
- Only basic fan-out? Single-purpose, controlled network with minimal diagnostics needs → Unmanaged may be sufficient.
H2-2. Internal Architecture You Actually Need (Blocks, Tables, and Bottlenecks)
The goal is not to explain Ethernet fundamentals. The goal is to map each internal block to a real factory failure mode, the observable evidence to collect, and the configuration knobs that contain the impact.
Minimal Forwarding Path (What Matters)
Ingress → Classify (VLAN/QoS) → Lookup (MAC learning) → Queues (schedule/drop) → Egress (tag/limit) → Port.
Classify (VLAN / Priority Mapping)
- Typical symptom: link is up, but a device “disappears” after VLAN changes.
- Evidence to check: ingress accept policy (tag/untag), PVID, egress tag/untag rules.
- Action: define a safe default VLAN, keep management path on a known untagged VLAN, use rollback strategy if supported.
Pass criteria: expected endpoints reachable in VLAN X; no unintended tag stripping on uplink.
MAC Learning Table (Capacity / Aging)
- Typical symptom: “bus utilization looks low” but the network feels clogged or intermittent.
- Evidence to check: unknown-unicast flooding, MAC move/learn events, table overflow indications.
- Action: constrain broadcast domains with VLANs; verify aging defaults; avoid topologies that amplify flooding.
Pass criteria: unknown-unicast and flooding counters remain within X (per window) under normal load.
Queues & Scheduling (Where “QoS Made It Worse” Happens)
- Typical symptom: enabling QoS increases jitter or timeouts during congestion.
- Evidence to check: per-queue drop counters, queue depth/watermark indicators (if available), priority mapping sanity.
- Action: reserve a high-priority path for control traffic; avoid starving low-priority traffic that triggers retries.
Pass criteria: control traffic latency/jitter stays within X under stress, without runaway drops in lower queues.
Rate Limiters & Storm Filters (Contain the Blast Radius)
- Typical symptom: a single misbehaving node slows down every port on the switch.
- Evidence to check: broadcast/multicast/unknown-unicast rates, storm-drop counters, per-port ingress policing stats.
- Action: enable storm control on edge ports, set conservative defaults, and validate that discovery/heartbeat frames are not cut.
Pass criteria: under induced storms, only the offending port experiences drops; uplink remains stable within X.
Mirroring Tap Points (Ingress vs Egress)
- Typical symptom: captures “look clean” even though endpoints report drops.
- Evidence to check: whether mirroring taps ingress or egress; whether filters exclude the problematic VLAN/port.
- Action: start with ingress mirroring to find sources; use egress mirroring to confirm queue drops or tag rules.
Pass criteria: mirrored frames match the suspected flow set; counters and captures agree within X.
Minimum Observability Set (Do Not Skip)
Even “smart” switches vary widely. For field-proof designs, ensure the debug path can separate these root causes quickly:
- Port health: CRC errors, alignment errors, drops, overruns
- Congestion: per-queue drops (or at least total drops) per port
- Flooding: broadcast/multicast/unknown-unicast counters and storm-drop counters
- Learning events: MAC learn/move/age indications (if exposed)
- Mirroring: configurable ingress/egress taps and source selection (port/VLAN)
H2-3. VLAN Basics for Smart Switches (Port-based vs 802.1Q, Ingress/Egress Rules)
VLAN is an engineering rule system: classify frames on ingress, enforce membership, then decide tag/untag behavior on egress. The goal is predictable isolation and a safe debug path—without accidental lockout.
Port-based VLAN
- Use when: a small machine cell or I/O box needs clean separation with minimal complexity.
- Why it’s safe: isolation is driven by port membership; fewer tag rules to misalign.
- Watch-outs: limited flexibility when multiple VLANs must traverse an uplink between cabinets.
802.1Q Tag VLAN
- Use when: an uplink must carry multiple VLANs (trunk) or VLAN identity must persist across cabinets.
- What changes: ingress accept rules + PVID + egress tag/untag policy must match across links.
- Primary risk: misaligned native/tag rules can isolate endpoints or cause cross-domain leakage.
Ingress Rules (Decision Flow)
- Untagged frame → assign PVID (Port VLAN ID) and continue switching within that VLAN.
- Tagged frame (VID=X) → accept only if the port allows tagged frames and VLAN X is a member; otherwise drop.
- Drop evidence → VLAN policy drop counters (if available) and “silent reachability loss” symptoms.
Common pitfalls: (1) tagged frames arriving on an untag-only port, (2) wrong PVID for untagged endpoints, (3) overly broad VLAN membership.
Egress Rules (Tag/Untag + Native VLAN)
- Access ports: typically egress untag for PLC/HMI/cameras that do not expect VLAN tags.
- Trunk uplink: typically egress tag to carry multiple VLANs across a single link.
- Native VLAN: one VLAN may be allowed to traverse the trunk as untagged; both ends must match the native VLAN definition.
Pass criteria: trunk shows expected VLAN tags (VID=X) and access ports remain untagged; cross-domain broadcast/ARP must not leak.
Minimal Usable Configuration Patterns
Pattern A — 3 Domains + Trunk Uplink
- PLC/HMI/Camera ports: access, egress untag, PVID=V10/V20/V30
- Uplink: trunk, allow V10/V20/V30, egress tag
- Verification: trunk capture shows VID=10/20/30; domains isolated
Pattern B — Pure Port-based Isolation
- No tagging; isolation enforced by port membership groups
- Best for single-cabinet segmentation without VLAN continuity needs
- Verification: broadcasts/ARP must stay within each group
Pattern C — Trunk with Native VLAN
- One VLAN is untagged on trunk (native); others remain tagged
- Use only when legacy untag requirements exist
- Verification: both ends agree on native VLAN; no cross-domain leakage
H2-4. QoS in the Small: Priorities, Queues, and Rate Limiting (What Matters in Factories)
Factory QoS is a minimal set of choices: classify traffic, map it into a few queues, and apply policing/shaping so a single endpoint cannot destabilize the cell. The proof comes from queue drops and counters—never from “it feels faster”.
Minimal Correct Set
- Classification: 802.1p PCP for L2 priority (primary); DSCP remark only when an L3 gateway exists (do not expand here).
- Mapping: assign traffic classes into a small number of queues (commonly 4).
- Scheduling: strict vs WRR is chosen by starvation tolerance—strict can starve low queues; WRR preserves minimum service.
- Policing: cap abnormal ingress behavior (misbehaving endpoint, storms, unknown unicast floods).
- Shaping: protect uplinks so local bursts do not overload the core network.
Strict Priority
- Benefit: minimizes delay for the top queue when load is high.
- Risk: lower queues can be starved; starvation can trigger retries that amplify congestion.
- Use when: control traffic is small but must remain stable under stress.
WRR / Weighted Scheduling
- Benefit: guarantees minimum service to lower queues.
- Trade-off: top queue latency can increase slightly versus strict priority.
- Use when: video/log flows must not collapse during sustained congestion.
Practical QoS Recipes (2–3 Minimal Policies)
Recipe 1 — Control First
- Control → highest queue (PCP=6/7)
- I/O → next queue
- Video/log → low queue
- Best-effort → lowest queue
Recipe 2 — Misbehavior Containment
- Enable storm control on edge ports (broadcast/multicast/unknown unicast).
- Apply ingress policing to suspicious ports or traffic classes.
- Track storm-drop and unknown-unicast counters for proof.
Recipe 3 — Uplink Protection
- Apply egress shaping on the uplink to cap burst impact.
- Prefer limiting on uplink rather than cutting endpoints blindly.
- Use queue drops + uplink utilization as verification evidence.
Verification Evidence (Stress + Proof)
- Stress: introduce congestion using video/bulk transfers while control traffic continues.
- Observe: per-queue drops (low queues drop first), storm-drop counters, and policing counters.
- Pass criteria: control jitter/latency stays within X under stress; top queue drops remain ≤ X; storm effects remain local to offending ports.
H2-5. Port Mirroring & Observability (How to Debug Without a Managed Switch)
Port mirroring is a field-proof evidence channel: it turns “suspicions” into packets and counters. The objective is to isolate the failing port/VLAN/direction quickly, then verify fixes with repeatable captures.
What Mirroring CAN Solve
- Reproduce: drops, duplicates, bursts, broadcast floods, and topology “flapping” symptoms.
- Localize: identify the suspect port, VLAN, and direction (ingress vs egress).
- Prove: validate configuration changes by comparing before/after captures and counters.
What Mirroring CANNOT Solve
- No per-queue truth: it cannot precisely expose each egress queue state (managed telemetry scope).
- No guaranteed capture: oversubscription can drop mirrored packets on the mirror port.
- No hardware precision: it complements, not replaces, line-rate counters and PHY diagnostics.
Mirror Setup Choices (Source, Destination, Direction)
- Source: start with the port that shows the highest error/storm counters, or the port on the failing path.
- Destination: use a dedicated mirror port to a laptop; avoid sharing it with production devices.
- Ingress mirror: best when the endpoint is suspected (abnormal bursts, wrong VLAN tags, malformed traffic).
- Egress mirror: best when configuration/policy is suspected (unexpected tagging, policing, congestion effects).
- VLAN-based mirror (if supported): focus on one VLAN to reduce noise when multiple domains share the switch.
Field Debug Workflow (Repeatable, Evidence-Driven)
- Freeze the baseline: record wiring, VLAN/PVID, QoS rules, and storm thresholds before changing anything.
- Pick a mirror source: use the port with the strongest symptom (drops, CRC, storm counters) or the suspected segment.
- Pick direction: start with ingress, then confirm with egress if policy effects are suspected.
- Capture windows: 30–120 seconds for burst faults; longer windows for periodic flaps.
- Three fast checks: broadcast ratio, retransmit/duplicate patterns, and ARP/LLDP/discovery behavior.
- Align with counters: correlate packet evidence with CRC/drop/overrun/storm counters on the same port.
- Form one testable hypothesis: e.g., “wrong egress tag policy” or “unknown-unicast flooding”.
- Change one variable: modify a single rule/threshold, then re-capture to validate improvement.
Black-Box Counter Checklist (Fast Signal & Next Action)
Integrity / Link
- CRC/FCS errors: SI/EMI/cabling/termination; verify grounding and connector path.
- Link up/down: intermittent cable/power/EMI events; correlate with time and environment.
Congestion / Buffer
- RX overruns / FIFO drops: ingress overload or internal contention; inspect storms and uplink shaping.
- Egress drops (if available): queue congestion or shaping; align with QoS policy and rate limits.
Storm / Flood Evidence
- Storm-drop counters: thresholds are triggering; validate if true storms or false positives.
- Unknown-unicast indicators: learning instability or topology issues; confirm with capture patterns.
H2-6. Storm Control & Loop Containment (Broadcast/Multicast/Unknown-Unicast Guards)
Storm control is not about saving bandwidth; it is a fuse that prevents a single fault from collapsing the entire cell. Correct limits create local containment and measurable evidence (storm-drop counters) for diagnosis.
Three Storm Classes to Guard
- Broadcast: ARP/discovery bursts can amplify rapidly under loops and miswiring.
- Multicast: without proper control, multicast can behave like broadcast at the edge.
- Unknown-unicast: when learning is unstable, traffic floods; this often feels like “low utilization but the network is stuck”.
Typical Loop Triggers in the Field
- Accidental patching between two ports during maintenance.
- Dual uplinks connected without loop protection or consistent ring configuration.
- Ring ports treated as ordinary ports (protocol not enabled or mismatched).
- One device bridges two connections unexpectedly (IPC/gateway wired twice).
- Learning instability that turns unknown-unicast into persistent flooding.
Threshold Templates (Placeholders for Standardization)
PPS-based
- Broadcast limit: X pps
- Multicast limit: X pps
- Unknown-unicast limit: X pps
% Line-rate
- Broadcast cap: X% of port rate
- Multicast cap: X% of port rate
- Unknown-unicast cap: X% of port rate
Guardrails: start conservative to avoid false kills, then tighten using storm-drop counters and mirror captures as proof.
False-Kill Risks & How to Avoid Them
- Threshold too low: discovery/heartbeats fail and devices appear “offline”.
- Wrong target: limiting the uplink first can hide the root cause and increase retransmissions.
- Mitigation: enable on edge ports first, keep protocol lifelines safe, and validate with counters + captures.
Incident Playbook (Contain → Identify → Fix)
- Confirm containment evidence: check storm-drop / unknown-unicast counters for growth.
- Find the hottest port: locate the port with the fastest counter increase.
- Mirror ingress: capture and confirm whether broadcast/unknown-unicast dominates.
- Isolate quickly: unplug the suspect cable or disable the suspect port to stop the flood.
- Recover carefully: restore wiring stepwise, validating counters remain stable.
- Harden config: keep strict guards on maintenance-prone ports to prevent recurrence.
H2-7. Ring Protocol Basics (MRP/HSR/PRP — What to Know Without Becoming a Spec Expert)
“Ring basics” focuses on outcomes and selection decisions: recovery time, zero-loss expectations, and the real boundary between smart-switch assists and full redundancy protocol stacks.
Target Outcomes
- Fast switchover: a link break triggers reroute within X ms (short interruption may exist).
- Zero-loss concept: dual-active paths or dual networks reduce disruption (often requires endpoint support).
- Containment: prevent a loop from collapsing the entire cell (often via guards and rapid isolation).
Concept Map (Minimal)
- MRP: ring recovery behavior and switchover time are primary concerns.
- HSR: redundancy via duplicated delivery around a ring; receiver de-duplicates.
- PRP: redundancy via two independent networks; receiver de-duplicates.
Smart Switch Boundary (What to Expect)
Commonly Available
- Basic loop containment and storm guards.
- Simple ring assist / fixed-port behaviors (vendor-dependent).
- Evidence counters (link events, storm-drop, unknown-unicast indicators).
Often Out of Reach
- Full MRP/HSR/PRP stacks with cross-vendor interoperability guarantees.
- Strict zero-loss guarantees without endpoint / system-level support.
- Complex topology determinism (typically a managed / dedicated redundancy scope).
Selection Questions (Answer Before Buying)
- Goal: ms switchover or “zero-loss” expectation?
- Endpoint support: do PLC/I/O/cameras support redundancy modes natively (or require an external adapter)?
- Roles: are manager/client or port roles required, and can the field maintain consistent configuration?
- Certification fit: does the system require protocol certification alignment with controllers/devices?
- Load headroom: can duplicated traffic or reroute bursts exceed uplink or buffer limits?
- Failure modes: is the main risk a link break, or miswiring that creates a loop?
- Evidence: can the switch expose link events and storm/flood counters to validate the root cause?
Minimal Acceptance Checks (Keep It Practical)
- Break test: unplug one ring segment and confirm recovery within X ms (target-dependent).
- Business impact: verify critical control traffic stays within X loss/timeout limits during reroute.
- Containment: simulate mispatch/loop and confirm guards keep the cell responsive with measurable counters.
Out of scope: full standard details, certification workflows, and cross-vendor interoperability rules belong on a dedicated “Ring Redundancy (MRP/HSR/PRP)” page. This section keeps only the decision-critical basics.
H2-8. Hardware Co-Design Hooks (Power, Clocks, PHY-integration, and Layout Traps)
These board-level hooks prevent “protocol-looking” failures that are actually power, strap, clock, or return-path issues. The focus is on layout rules and bring-up evidence, not PHY textbooks.
Power Rails & Reset (The #1 “Half-Alive” Root Cause)
Typical Symptoms
- Link flaps after boot, then stabilizes “randomly”.
- Intermittent drops that correlate with load steps.
- Configuration appears inconsistent across boots.
Design Hooks
- Multi-rail sequencing: ensure all rails reach stable levels before releasing reset.
- UVLO/POR clarity: use a supervisor; avoid “borderline” brownout behavior.
- Reset timing: release reset X ms after rails and clock are confirmed stable.
Clocking Hooks (Stability vs. “Speed”)
- Jitter sensitivity: poor clock quality can surface as instability, renegotiation, or elevated error counters.
- Noise coupling: avoid routing clocks through switching regulator hot zones and high-current loops.
- Bring-up evidence: align clock-related hypotheses with link events + CRC/overrun counter behavior under load.
Straps & Boot Pins (Configuration That Depends on Timing)
- Sampling moment: strap pins are often latched at reset release; unstable rails can create unstable config.
- Shared pins: avoid external devices driving strap pins during boot; keep pull-ups/pull-downs unambiguous.
- Field symptom: “same board, different behavior” can be a strap + reset timing problem, not a protocol bug.
Integrated PHY & Layout Rules (Placement, Return Path, ESD)
Do (Green)
- Keep diff-pair reference plane continuous under the entire path.
- Place connector-entry protection with short, direct return paths.
- Keep magnetics/connector path compact to reduce loop area.
Avoid (Red)
- Diff-pairs crossing plane cuts/slots or return-path discontinuities.
- Clock sources placed next to DC/DC noise or adjacent to the PHY path.
- Long “detours” for TVS returns that create large discharge loops.
Board-Level Evidence to Log (Fast Root-Cause Signals)
- Link events vs power events: link drops aligned with brownouts or load steps indicate power/return-path issues.
- CRC/overrun patterns: correlate with temperature, motor switching, or DC/DC mode transitions.
- Before/after layout change: a large delta from routing/placement changes indicates SI/return-path dominance.
H2-9. Configuration & Bring-up Flow (Straps/I²C/SPI, Default Safe Modes, Field-Safe Updates)
Configuration must be designed as a gated, field-safe workflow: keep reachability first, then enforce VLAN/QoS, and only commit changes after evidence proves stability.
A. Default Safe Modes (Pick One Baseline Per Product)
- Best for: first power-on and recovery.
- Risk: storms/loops can spread.
- Must-have guard: conservative storm limits enabled by default.
- Best for: early domain isolation (PLC / HMI / Camera).
- Risk: wrong PVID/egress rules can cause remote loss.
- Must-have guard: a “rescue port/VLAN” always reachable.
- Best for: multi-VLAN uplink on a single port.
- Risk: native VLAN mismatch → upstream “silent failures”.
- Must-have guard: mirror proof of tagging before commit.
B. Configuration Entry Points (Straps → NVM → Host Bus)
Straps (Boot-latched)
- Use: default mode, interface enable.
- Rule: stable at reset release.
- Failure signature: inconsistent behavior across boots.
EEPROM / NVM (Profile)
- Use: baseline VLAN/QoS/port roles.
- Rule: version + CRC + fallback copy.
- Failure signature: “works until brownout” profile corruption.
I²C / SPI Host (Runtime)
- Use: staged policy apply, diagnostics, update control.
- Rule: detect loss-of-control and re-enter safe profile.
- Failure signature: policy changed but not provable by evidence.
C. Field-Safe Updates (Anti-Lockout Pattern)
Common Failure
Wrong VLAN/PVID/egress rules remove the only management path and cause remote lockout.
Anti-Lockout Hooks
- Dual profile: Active + Rescue (known-reachable baseline).
- Staged commit: apply → observe → commit to NVM.
- Rollback gate: revert if no heartbeat/ack within X s.
- Rescue access: dedicated port/VLAN not affected by user policy.
D. Bring-up Minimal Closed Loop (Prove Then Advance)
- Link up: stable link for X min (no flapping).
- Forwarding: unicast flows pass; MAC learning stable (no persistent flooding).
- VLAN: PVID + tag/untag verified by mirror capture; management remains reachable.
- QoS: queue mapping validated under contention; control stays responsive.
- Mirror: captures confirm tagging/priority behavior; abnormal ratios visible.
- Storm: thresholds contain storms without killing discovery/heartbeats.
H2-10. Verification Plan (Throughput/Latency, Loss Under Stress, Storm & Loop Tests)
Verification must be executable: define stimulus, capture observable evidence, and declare pass criteria. Coverage must include line-rate, latency behavior, burst loss, storm threshold sweep, loop containment, and corner stress.
Coverage (What Must Be Proven)
- Throughput: line-rate definition by frame size and direction (uni/bi).
- Latency: measure distributions under load; observe mode signature (store-and-forward vs cut-through behavior).
- Stress loss: burst + contention + mixed priorities; verify critical class survivability.
- Storm sweep: find the safe window between “no containment” and “false-kill”.
- Loop/ring: mispatch loop containment and break recovery observation points.
- Industrial corners: temperature and brownout/load-step alignment with counters/events.
Executable Checklist (Stimulus → Observable → Pass)
Throughput
- Stimulus: min/typ/max frame sizes; uni/bi; full load.
- Observable: drops/overruns; link events; mirror proof.
- Pass: ≤ X drops over Y minutes at defined load.
Latency
- Stimulus: controlled load points + contention.
- Observable: latency distribution shift and jitter growth.
- Pass: latency/jitter within X under defined traffic mix.
Stress Loss
- Stimulus: bursts + mixed classes + uplink contention.
- Observable: which class drops first; responsiveness.
- Pass: critical class loss ≤ X while best-effort absorbs pressure.
Storm Sweep
- Stimulus: broadcast/multicast/unknown-unicast sweep.
- Observable: storm-drop counters vs discovery/heartbeats.
- Pass: safe window identified and documented.
Loop / Ring
- Stimulus: mispatch loop; break one segment (if ring).
- Observable: containment evidence; recovery time X ms.
- Pass: cell remains responsive; recovery meets target.
Temp / Voltage Corners
- Stimulus: low/high temp; brownout/load-step events.
- Observable: link events + counters aligned to logs.
- Pass: no persistent flapping; error within X.
Required capability: controllable load and bursts, class-aware traffic generation, capture + counters for evidence, and synchronized power/temperature/event logging for correlation.
H2-11. Engineering Checklist (Design → Bring-up → Production)
A smart/unmanaged switch design is production-ready only when each gate has a repeatable verification action, objective evidence, and explicit pass criteria (threshold X).
Design Gate
Evidence: reset-reason log + link-event timeline aligned to power.
Pass criteria: no “hang/no-recover” in X disturbances.
MPN examples: TI TPS3808G01 (supervisor), ADI ADM809 (reset), TI TLV75533P (LDO), TI TPS62177 (buck).
Evidence: boot mode readback consistency + interface reachability check.
Pass criteria: mode mismatch rate ≤ X / 1000 boots.
MPN examples: N/A (logic-level straps); use stable pull parts such as Yageo RC0603FR-0710KL (10 kΩ).
Evidence: reachability proof after intentionally wrong VLAN/PVID settings.
Pass criteria: remote access recovered within X seconds by rollback.
MPN examples: SPI flash: Winbond W25Q32JV; I²C EEPROM: Microchip 24LC02B or 24AA025E48 (EEPROM with EUI-48).
Evidence: mirror capture showing correct tag/untag on trunk/access paths.
Pass criteria: “management reachable” invariant holds under X reconfig cycles.
MPN examples (switch IC): Microchip KSZ8795 (5-port smart), Microchip KSZ8863 (3-port), Microchip KSZ8895 (5-port).
Evidence: under contention, control class stays responsive; lower classes absorb loss.
Pass criteria: control latency/jitter ≤ X under defined stress.
MPN examples (clocking): Epson SG-210STF (25 MHz osc), Abracon ASE-25.000MHZ (XO).
Evidence: threshold sweep shows a “safe window” where discovery/heartbeats survive.
Pass criteria: no global outage under mispatch/loop; false-kill rate ≤ X.
MPN examples (ESD protection, board-level): choose ultra-low-cap Ethernet ESD arrays per line-rate (example families: Semtech RClamp series / Littelfuse SP305x series).
Bring-up Gate
Evidence: per-step capture + counter snapshot baseline.
Pass criteria: each step stable for X minutes, no link flaps.
MPN examples (switch IC): Microchip KSZ8795 / KSZ8895 (smart features), Microchip KSZ8863 (compact).
Evidence: pcap shows correct 802.1Q tag/untag and 802.1p PCP mapping.
Pass criteria: no unexplained flooding; abnormal ratio detectable within X minutes.
MPN examples (field tool): passive is enough; use mirror port + laptop NIC; optional USB NIC: Realtek-based RTL8153 adapters (common).
Evidence: baseline snapshots stored with timestamp + temperature + power cycle count.
Pass criteria: baseline drift ≤ X over Y hours.
MPN examples (storage): Winbond W25Q32JV (SPI flash), Microchip 24LC02B (I²C EEPROM).
Evidence: storm-drop vs LLDP/ARP/heartbeat survival charted per port.
Pass criteria: safe window documented; false-kill at normal traffic ≤ X.
MPN examples (switch IC): Microchip KSZ8795 (storm controls present in smart class; device-dependent).
Evidence: rollback event log + recovered reachability + restored mirror proof.
Pass criteria: recovery time ≤ X seconds; success rate ≥ X%.
MPN examples: TI TPS3808G01 (supervisor), Winbond W25Q32JV (dual profile storage).
Production Gate
Evidence: profile readback + CRC status + audit log entry.
Pass criteria: failed update always falls back to rescue in X seconds.
MPN examples: Winbond W25Q32JV (SPI flash), Microchip 24LC02B (EEPROM).
Evidence: sample logs from burn-in + induced fault runs.
Pass criteria: event-to-root-cause coverage ≥ X%.
MPN examples (optional time base): Microchip MCP7940N (I²C RTC) for timestamping logs.
Evidence: SOP + successful on-site drill records.
Pass criteria: on-site isolation time ≤ X minutes for common failures.
MPN examples: Microchip 24AA025E48 (EUI-48 for per-unit identity), Winbond W25Q32JV.
Evidence: aligned timeline: temperature + voltage + link events + drops.
Pass criteria: no persistent flapping; error budget within X.
MPN examples (sensing, optional): TI TMP117 (temp sensor) for correlation logging.
H2-12. Applications & IC Selection (Combined)
Applications are expressed as “capability vectors”. IC selection is expressed as a checklist + decision tree, not a brand catalog.
H3-12.1 Applications (Where Unmanaged/Smart Switch ICs Win)
Must-have: VLAN isolation + storm guard; optional mirror for field debug.
Common failure: broadcast storm collapses the whole cell.
Bring-up proof: mirror shows tag/untag correctness; storm window documented.
MPN examples: Microchip KSZ8795, Microchip KSZ8895.
Must-have: QoS queues + basic policing; mirror for on-site capture.
Common failure: burst traffic causes control jitter.
Bring-up proof: contention test keeps control class within threshold X.
MPN examples: Microchip KSZ8863 (compact), Microchip KSZ8795.
Must-have: 802.1Q trunk uplink + VLAN/PVID discipline; optional QoS for control.
Common failure: native VLAN mismatch produces “silent upstream failure”.
Bring-up proof: mirror capture shows VLAN tags on trunk before committing.
MPN examples: Microchip KSZ8795, Microchip KSZ8895.
Must-have: loop containment + storm guard; “ring assist” if supported.
Common failure: mispatch loop floods the network.
Bring-up proof: mispatch drill contains broadcast; recovery time ≤ X.
MPN examples: Microchip KSZ8795 (smart-class guards), Microchip KSZ8895.
H3-12.2 IC Selection Logic (A Checklist, Not a Shopping List)
- Ports/speed mix: FE/GE/2.5G as needed (avoid “overkill” heat).
- VLAN: PVID + accept rules + egress tag policy + table/limit awareness.
- QoS: queue count + scheduling (strict/WRR) + policing/shaping granularity.
- Mirror: per-port/per-VLAN, ingress/egress tap position if available.
- Storm: bcast/mcast/unknown-unicast types + threshold units + default behavior.
- Config/rollback: straps + EEPROM/flash + watchdog rollback + rescue port/VLAN.
- Industrial target: temperature + EMC evidence plan (metric-only, no spec tutorial).
- Power/thermal: package dissipation + airflow/heatsinking plan.
Compact switch IC (small fan-out): Microchip KSZ8863.
Identity / MAC storage: Microchip 24AA025E48 (EUI-48 EEPROM).
Config storage: Winbond W25Q32JV (SPI flash), Microchip 24LC02B (EEPROM).
Reset supervisor: TI TPS3808G01, ADI ADM809.
Clock source: Epson SG-210STF (25 MHz), Abracon ASE-25.000MHZ.
Power rails: TI TPS62177 (buck), TI TLV75533P (LDO).
Optional RTC for logs: Microchip MCP7940N.
Recommended topics you might also need
Request a Quote
H2-13. FAQs (Troubleshooting Only: VLAN / QoS / Mirror / Storm / Ring Basics)
These FAQs close long-tail field issues without expanding the main content. Every answer is exactly four lines: Likely cause / Quick check / Fix / Pass criteria (X).
- X_min: minutes of stable operation window
- X_s: seconds to recover (rollback / reconnect)
- X_cnt: counter increment threshold (CRC/drop/storm-drop) within a window
- X_%: ratio threshold (% of frames or % line-rate), used for broadcast/unknown-unicast share
- X_pps: packets-per-second threshold for storm control
- X_ms: latency/jitter bound for control-class traffic under contention
VLAN configured and a device “disappeared” — wrong PVID or egress untag rule?
Likely cause: Port PVID mismatch, or egress changed from untag→tag (or tag→untag) and the endpoint cannot parse it.
Quick check: Mirror the port and confirm whether ARP/LLDP frames are tagged or untagged; read back PVID + egress rule for that port.
Fix: Restore a “management reachability invariant” (rescue port/VLAN), then standardize access ports as untag + correct PVID; keep trunk ports tagged only.
Pass criteria: Endpoint reachable for X_min; port CRC/drop increment ≤ X_cnt per X_min; no unexpected VLAN tag changes in capture.
Utilization looks low, but the cell feels “stuck” — unknown-unicast flood or queue starvation?
Likely cause: Unknown-unicast flooding (MAC table miss/aging) or strict-priority scheduling starving lower classes, triggering retries/backpressure upstream.
Quick check: Read unknown-unicast counters and broadcast share; mirror suspect ports and look for repeated ARP/retries/timeouts spikes.
Fix: Enable unknown-unicast storm guard or policing; switch from strict-only to WRR (or add minimum share) so non-control traffic cannot be permanently starved.
Pass criteria: Unknown-unicast share ≤ X_% (or ≤ X_pps) for X_min; control latency/jitter ≤ X_ms; drops do not ramp (drop increment ≤ X_cnt per X_min).
Mirroring is enabled, but the “bad frames” never show up — wrong mirror tap point (ingress vs egress)?
Likely cause: Mirror source is wrong (port vs VLAN), or the tap point is before/after the event (ingress vs egress), or the mirror port is oversubscribed and drops.
Quick check: Inject a known test flow and verify it appears; compare ingress-tap vs egress-tap (if supported) and check mirror port drop counters.
Fix: Move the tap closer to the suspected failure stage; narrow the mirrored scope (single port/VLAN) and ensure the mirror destination can sustain the rate.
Pass criteria: Target-frame hit rate ≥ X_%; mirror-port drop increment ≤ X_cnt per X_min; captures contain consistent VLAN/PCP evidence.
QoS enabled, but latency jitter got worse — strict priority starving lower classes and amplifying retransmits?
Likely cause: Strict priority starves a queue until higher-layer retries explode; or PCP→queue mapping is inverted, pushing control into a congested class.
Quick check: Create controlled contention and measure control-frame delay; mirror PCP/DSCP markings and verify mapping aligns with the queue plan.
Fix: Use WRR (or strict + minimum share) to prevent starvation; apply ingress policing on “noisy” ports to cap burst damage.
Pass criteria: Control traffic latency/jitter ≤ X_ms under defined stress; no sustained starvation (drop increment on low class ≤ X_cnt per X_min).
Storm control “kills” discovery/heartbeats — threshold unit or measurement window mismatch?
Likely cause: Threshold is set in the wrong unit (pps vs %) or the averaging window is too short/too strict, causing legitimate multicast/discovery to be dropped.
Quick check: Measure baseline LLDP/ARP/heartbeat pps in normal operation; correlate storm-drop counters with the moment devices “vanish”.
Fix: Sweep thresholds to find a safe window; separate limits for broadcast/multicast/unknown-unicast and avoid a single “global” clamp.
Pass criteria: Heartbeats survive continuously for X_min; storm-drop is 0 (or ≤ X_cnt) during normal traffic; outage events do not recur within X_min.
A ring/loop connection causes a broadcast meltdown — loop not contained, dual uplinks, or threshold too high?
Likely cause: Physical loop (mispatch / dual uplinks) creates a broadcast amplification loop; storm guard is disabled or set so high it never triggers.
Quick check: Break one link and see if the network immediately recovers; read broadcast share and storm-drop counters during the event.
Fix: Enable loop containment (within smart-switch capability) and set practical storm limits; enforce “single-uplink rule” where applicable.
Pass criteria: Mispatch drill does not collapse the cell; broadcast share ≤ X_% for X_min; recovery time ≤ X_s.
Only one port shows high CRC/drop — board-level return path or local SI/power noise?
Likely cause: Local signal integrity/return-path issue, connector/cable defect, or power/ground noise coupling into that port’s PHY/MAC interface.
Quick check: Swap cables/ports to see if the symptom follows the port or the device; correlate CRC ramps with load/temperature/time-of-event logs.
Fix: Isolate the port via VLAN to reduce blast radius; then address board-level causes (decoupling, ground return continuity, routing/spacing) and cable quality.
Pass criteria: CRC increment ≤ X_cnt per X_min under the same load; link flaps = 0 for X_min.
Link drops only at high temperature — rail derating, reset margin, or clock margin collapse?
Likely cause: Power rail derating triggers brownout/reset, oscillator margin collapses, or temperature shifts timing/noise margins until errors accumulate into a flap.
Quick check: Align temperature vs link events vs CRC/drop; verify reset-reason and undervoltage indicators when the drop occurs.
Fix: Increase power/reset margins and improve thermal path; use a more stable clock source or reduce noise coupling; apply a protective lower-rate mode if required.
Pass criteria: High-temp run stable for X_min; link flaps = 0; CRC/drop increment ≤ X_cnt per X_min.
After shipping, remote configuration is impossible — missing lockout prevention and rollback?
Likely cause: No rescue profile/port, configuration applied without staged commit, or management VLAN can be accidentally isolated by policy.
Quick check: Read back active profile version/CRC; confirm whether a rescue path (port/VLAN) remains reachable after a forced misconfig test.
Fix: Implement dual-profile (active/rescue) + watchdog rollback; enforce “management reachability invariant” that policy cannot break.
Pass criteria: Any bad config auto-recovers within X_s; success rate ≥ X_% across repeated drills; remote reachability stable for X_min.
Plugging in one device slows the whole network — uncontrolled burst, unknown-unicast flooding, or no policing?
Likely cause: A noisy endpoint generates bursts (broadcast/multicast/unknown-unicast) without limits; queues saturate and create global contention symptoms.
Quick check: Mirror that port and quantify broadcast/unknown share; check storm-drop and port policing counters during the slowdown window.
Fix: Apply ingress policing/rate limiting on that port; enable storm control for relevant types; isolate via VLAN if needed to contain blast radius.
Pass criteria: With the device connected, global latency/drops stay within X_ms/X_cnt; abnormal traffic share ≤ X_% for X_min.
Captures look “normal”, but failures still happen — event is egress-queue timing and the tap misses it?
Likely cause: The event is created at egress under congestion (queueing/drop), while capture is taken at ingress; or the mirror destination drops at peak.
Quick check: Switch to egress tap (if supported) or mirror a narrower scope; correlate counter baselines to the exact failure timestamp.
Fix: Reduce mirrored sources to avoid oversubscription; use counters + event logs as the primary trigger and capture around the event window.
Pass criteria: Reproduction hit rate ≥ X_%; event-aligned counters show a clear signature; mirror port drop increment ≤ X_cnt per X_min.