123 Main Street, New York, NY 10001

Unmanaged & Smart Switch ICs for Industrial Ethernet

← Back to: Industrial Ethernet & TSN

Smart/unmanaged switch ICs make small industrial networks predictable: isolate traffic with VLAN, protect control flows with QoS, and contain loops/storms before they become field outages.

This page focuses on the minimum set of features and verification steps that turn “it forwards packets” into repeatable, serviceable, and production-safe behavior.

H2-1. Definition, Positioning, and Scope (Unmanaged vs Smart Switch IC)

Unmanaged and smart Ethernet switch ICs enable multi-port Layer-2 forwarding. The practical difference is whether the design gains containment (VLAN/limits), predictability (QoS), and observability (mirroring/counters) to prevent field incidents and speed up root-cause analysis.

In-scope

  • Unmanaged vs smart positioning and decision cues
  • VLAN basics (port-based / 802.1Q entry-level rules)
  • QoS basics (PCP mapping, queue behavior, rate limiting)
  • Port mirroring and minimum counters for debugging
  • Storm control fundamentals and loop containment basics
  • Ring “basics” as concepts only (no standards deep-dive)

Out-of-scope

  • Fully managed switch deep features (L3 routing, ACL/TCAM details)
  • TSN scheduling (Qbv/Qci/Qav), time windows, admission control
  • Time sync architecture (PTP/SyncE/White Rabbit)
  • MACsec / crypto / secure boot frameworks
  • PHY electrical/SI deep-dive (return loss, EQ tuning, compliance labs)
  • Industrial protocol internals (PROFINET/EtherCAT/CIP standard details)

Unmanaged vs Smart: What Changes in Real Deployments

Unmanaged

  • Control: minimal or none; mostly default forwarding
  • Segmentation: typically none (no VLAN isolation)
  • Debug: limited observability; no mirroring in many designs
  • Field risk: faults can spread; troubleshooting becomes guesswork

Best when: small fan-out, controlled environment, low need for isolation and diagnostics.

Smart Switch IC

  • Control: straps/EEPROM and/or runtime config (I²C/SPI/SMI-like)
  • Containment: VLAN (port-based and/or 802.1Q) and traffic limits
  • Predictability: QoS mapping to queues; basic scheduling behavior
  • Observability: port mirroring + essential counters
  • Field safety: storm control to prevent “whole-network freeze” incidents

Best when: industrial cells need isolation, limiters, and fast RCA without full managed complexity.

Managed Switch IC (Out of scope)

  • Full L2/L3: advanced filtering, ACL/TCAM, multicast controls
  • Telemetry: richer stats, remote management stacks
  • Determinism: TSN scheduling features (separate topic)

Mentioned only to define boundaries and prevent scope creep.

A 30-Second Decision Rule

  • Need isolation? Multiple device classes or tenants on one box → choose Smart (VLAN required).
  • Need controlled failure blast radius? Risk of loops or broadcast storms → choose Smart (storm control required).
  • Need faster root-cause analysis? Field failures must be proven with captures/counters → choose Smart (mirroring required).
  • Only basic fan-out? Single-purpose, controlled network with minimal diagnostics needs → Unmanaged may be sufficient.
Unmanaged vs Smart: Capability Ladder (Scope Guard Included) Unmanaged Learning + Forwarding Minimal controls Limited observability Smart VLAN Isolation QoS + Rate Limits Mirror + Counters Storm Guards Managed (out) L3 / ACL / Telemetry TSN Scheduling OUT-OF-SCOPE
Diagram: capability ladder to keep scope clean—smart switch ICs add isolation, limits, and observability without full managed complexity.

H2-2. Internal Architecture You Actually Need (Blocks, Tables, and Bottlenecks)

The goal is not to explain Ethernet fundamentals. The goal is to map each internal block to a real factory failure mode, the observable evidence to collect, and the configuration knobs that contain the impact.

Minimal Forwarding Path (What Matters)

Ingress → Classify (VLAN/QoS) → Lookup (MAC learning) → Queues (schedule/drop) → Egress (tag/limit) → Port.

Classify (VLAN / Priority Mapping)

  • Typical symptom: link is up, but a device “disappears” after VLAN changes.
  • Evidence to check: ingress accept policy (tag/untag), PVID, egress tag/untag rules.
  • Action: define a safe default VLAN, keep management path on a known untagged VLAN, use rollback strategy if supported.

Pass criteria: expected endpoints reachable in VLAN X; no unintended tag stripping on uplink.

MAC Learning Table (Capacity / Aging)

  • Typical symptom: “bus utilization looks low” but the network feels clogged or intermittent.
  • Evidence to check: unknown-unicast flooding, MAC move/learn events, table overflow indications.
  • Action: constrain broadcast domains with VLANs; verify aging defaults; avoid topologies that amplify flooding.

Pass criteria: unknown-unicast and flooding counters remain within X (per window) under normal load.

Queues & Scheduling (Where “QoS Made It Worse” Happens)

  • Typical symptom: enabling QoS increases jitter or timeouts during congestion.
  • Evidence to check: per-queue drop counters, queue depth/watermark indicators (if available), priority mapping sanity.
  • Action: reserve a high-priority path for control traffic; avoid starving low-priority traffic that triggers retries.

Pass criteria: control traffic latency/jitter stays within X under stress, without runaway drops in lower queues.

Rate Limiters & Storm Filters (Contain the Blast Radius)

  • Typical symptom: a single misbehaving node slows down every port on the switch.
  • Evidence to check: broadcast/multicast/unknown-unicast rates, storm-drop counters, per-port ingress policing stats.
  • Action: enable storm control on edge ports, set conservative defaults, and validate that discovery/heartbeat frames are not cut.

Pass criteria: under induced storms, only the offending port experiences drops; uplink remains stable within X.

Mirroring Tap Points (Ingress vs Egress)

  • Typical symptom: captures “look clean” even though endpoints report drops.
  • Evidence to check: whether mirroring taps ingress or egress; whether filters exclude the problematic VLAN/port.
  • Action: start with ingress mirroring to find sources; use egress mirroring to confirm queue drops or tag rules.

Pass criteria: mirrored frames match the suspected flow set; counters and captures agree within X.

Minimum Observability Set (Do Not Skip)

Even “smart” switches vary widely. For field-proof designs, ensure the debug path can separate these root causes quickly:

  • Port health: CRC errors, alignment errors, drops, overruns
  • Congestion: per-queue drops (or at least total drops) per port
  • Flooding: broadcast/multicast/unknown-unicast counters and storm-drop counters
  • Learning events: MAC learn/move/age indications (if exposed)
  • Mirroring: configurable ingress/egress taps and source selection (port/VLAN)
Ingress → Classify → Learn → Queue → Egress (Observation Points Marked) Ingress Frames In Classify VLAN / QoS MAC Table Learn / Age Queues Schedule / Drop Egress Frames Out Mirror Tap Queue Drops Storm Filter Point
Diagram: data-path view for engineering work—classify, learn, queue, and filter points map directly to symptoms, counters, and corrective actions.

H2-3. VLAN Basics for Smart Switches (Port-based vs 802.1Q, Ingress/Egress Rules)

VLAN is an engineering rule system: classify frames on ingress, enforce membership, then decide tag/untag behavior on egress. The goal is predictable isolation and a safe debug path—without accidental lockout.

Port-based VLAN

  • Use when: a small machine cell or I/O box needs clean separation with minimal complexity.
  • Why it’s safe: isolation is driven by port membership; fewer tag rules to misalign.
  • Watch-outs: limited flexibility when multiple VLANs must traverse an uplink between cabinets.

802.1Q Tag VLAN

  • Use when: an uplink must carry multiple VLANs (trunk) or VLAN identity must persist across cabinets.
  • What changes: ingress accept rules + PVID + egress tag/untag policy must match across links.
  • Primary risk: misaligned native/tag rules can isolate endpoints or cause cross-domain leakage.

Ingress Rules (Decision Flow)

  • Untagged frame → assign PVID (Port VLAN ID) and continue switching within that VLAN.
  • Tagged frame (VID=X) → accept only if the port allows tagged frames and VLAN X is a member; otherwise drop.
  • Drop evidence → VLAN policy drop counters (if available) and “silent reachability loss” symptoms.

Common pitfalls: (1) tagged frames arriving on an untag-only port, (2) wrong PVID for untagged endpoints, (3) overly broad VLAN membership.

Egress Rules (Tag/Untag + Native VLAN)

  • Access ports: typically egress untag for PLC/HMI/cameras that do not expect VLAN tags.
  • Trunk uplink: typically egress tag to carry multiple VLANs across a single link.
  • Native VLAN: one VLAN may be allowed to traverse the trunk as untagged; both ends must match the native VLAN definition.

Pass criteria: trunk shows expected VLAN tags (VID=X) and access ports remain untagged; cross-domain broadcast/ARP must not leak.

Minimal Usable Configuration Patterns

Pattern A — 3 Domains + Trunk Uplink

  • PLC/HMI/Camera ports: access, egress untag, PVID=V10/V20/V30
  • Uplink: trunk, allow V10/V20/V30, egress tag
  • Verification: trunk capture shows VID=10/20/30; domains isolated

Pattern B — Pure Port-based Isolation

  • No tagging; isolation enforced by port membership groups
  • Best for single-cabinet segmentation without VLAN continuity needs
  • Verification: broadcasts/ARP must stay within each group

Pattern C — Trunk with Native VLAN

  • One VLAN is untagged on trunk (native); others remain tagged
  • Use only when legacy untag requirements exist
  • Verification: both ends agree on native VLAN; no cross-domain leakage
VLAN Partition: PLC / HMI / Camera + Tagged Trunk Uplink PLC Domain V10 PLC I/O HMI Domain V20 HMI Camera Domain V30 Camera Camera Smart Switch Access + Trunk P H C TRUNK Gateway / Core Uplink Aggregation untag untag untag tag
Diagram: three VLAN domains use untagged access ports for endpoints, while the uplink carries tagged VLANs on a trunk.

H2-4. QoS in the Small: Priorities, Queues, and Rate Limiting (What Matters in Factories)

Factory QoS is a minimal set of choices: classify traffic, map it into a few queues, and apply policing/shaping so a single endpoint cannot destabilize the cell. The proof comes from queue drops and counters—never from “it feels faster”.

Minimal Correct Set

  • Classification: 802.1p PCP for L2 priority (primary); DSCP remark only when an L3 gateway exists (do not expand here).
  • Mapping: assign traffic classes into a small number of queues (commonly 4).
  • Scheduling: strict vs WRR is chosen by starvation tolerance—strict can starve low queues; WRR preserves minimum service.
  • Policing: cap abnormal ingress behavior (misbehaving endpoint, storms, unknown unicast floods).
  • Shaping: protect uplinks so local bursts do not overload the core network.

Strict Priority

  • Benefit: minimizes delay for the top queue when load is high.
  • Risk: lower queues can be starved; starvation can trigger retries that amplify congestion.
  • Use when: control traffic is small but must remain stable under stress.

WRR / Weighted Scheduling

  • Benefit: guarantees minimum service to lower queues.
  • Trade-off: top queue latency can increase slightly versus strict priority.
  • Use when: video/log flows must not collapse during sustained congestion.

Practical QoS Recipes (2–3 Minimal Policies)

Recipe 1 — Control First

  • Control → highest queue (PCP=6/7)
  • I/O → next queue
  • Video/log → low queue
  • Best-effort → lowest queue

Recipe 2 — Misbehavior Containment

  • Enable storm control on edge ports (broadcast/multicast/unknown unicast).
  • Apply ingress policing to suspicious ports or traffic classes.
  • Track storm-drop and unknown-unicast counters for proof.

Recipe 3 — Uplink Protection

  • Apply egress shaping on the uplink to cap burst impact.
  • Prefer limiting on uplink rather than cutting endpoints blindly.
  • Use queue drops + uplink utilization as verification evidence.

Verification Evidence (Stress + Proof)

  • Stress: introduce congestion using video/bulk transfers while control traffic continues.
  • Observe: per-queue drops (low queues drop first), storm-drop counters, and policing counters.
  • Pass criteria: control jitter/latency stays within X under stress; top queue drops remain ≤ X; storm effects remain local to offending ports.
4-Queue QoS: Control > IO > Video > Best-effort (Congestion Drop Points) Ingress PCP Map Policer Queues Q3 Control Q2 IO Q1 Video Q0 Best-effort D D Scheduler Strict / WRR Egress Port Out Shape
Diagram: 4-queue QoS shows where policing and queue drops occur under congestion, and how scheduling affects starvation behavior.

H2-5. Port Mirroring & Observability (How to Debug Without a Managed Switch)

Port mirroring is a field-proof evidence channel: it turns “suspicions” into packets and counters. The objective is to isolate the failing port/VLAN/direction quickly, then verify fixes with repeatable captures.

What Mirroring CAN Solve

  • Reproduce: drops, duplicates, bursts, broadcast floods, and topology “flapping” symptoms.
  • Localize: identify the suspect port, VLAN, and direction (ingress vs egress).
  • Prove: validate configuration changes by comparing before/after captures and counters.

What Mirroring CANNOT Solve

  • No per-queue truth: it cannot precisely expose each egress queue state (managed telemetry scope).
  • No guaranteed capture: oversubscription can drop mirrored packets on the mirror port.
  • No hardware precision: it complements, not replaces, line-rate counters and PHY diagnostics.

Mirror Setup Choices (Source, Destination, Direction)

  • Source: start with the port that shows the highest error/storm counters, or the port on the failing path.
  • Destination: use a dedicated mirror port to a laptop; avoid sharing it with production devices.
  • Ingress mirror: best when the endpoint is suspected (abnormal bursts, wrong VLAN tags, malformed traffic).
  • Egress mirror: best when configuration/policy is suspected (unexpected tagging, policing, congestion effects).
  • VLAN-based mirror (if supported): focus on one VLAN to reduce noise when multiple domains share the switch.

Field Debug Workflow (Repeatable, Evidence-Driven)

  1. Freeze the baseline: record wiring, VLAN/PVID, QoS rules, and storm thresholds before changing anything.
  2. Pick a mirror source: use the port with the strongest symptom (drops, CRC, storm counters) or the suspected segment.
  3. Pick direction: start with ingress, then confirm with egress if policy effects are suspected.
  4. Capture windows: 30–120 seconds for burst faults; longer windows for periodic flaps.
  5. Three fast checks: broadcast ratio, retransmit/duplicate patterns, and ARP/LLDP/discovery behavior.
  6. Align with counters: correlate packet evidence with CRC/drop/overrun/storm counters on the same port.
  7. Form one testable hypothesis: e.g., “wrong egress tag policy” or “unknown-unicast flooding”.
  8. Change one variable: modify a single rule/threshold, then re-capture to validate improvement.

Black-Box Counter Checklist (Fast Signal & Next Action)

Integrity / Link

  • CRC/FCS errors: SI/EMI/cabling/termination; verify grounding and connector path.
  • Link up/down: intermittent cable/power/EMI events; correlate with time and environment.

Congestion / Buffer

  • RX overruns / FIFO drops: ingress overload or internal contention; inspect storms and uplink shaping.
  • Egress drops (if available): queue congestion or shaping; align with QoS policy and rate limits.

Storm / Flood Evidence

  • Storm-drop counters: thresholds are triggering; validate if true storms or false positives.
  • Unknown-unicast indicators: learning instability or topology issues; confirm with capture patterns.
Port Mirroring Topology: Suspect Port → Mirror Port → Laptop Capture Suspect Device Port under test Smart Switch Mirror supported SRC DST UP Uplink / Core Laptop Packet capture Traffic Mirror copy Uplink Choose Ingress mirror to inspect endpoint behavior; choose Egress mirror to confirm VLAN/QoS policy effects.
Diagram: mirror a suspect source to a dedicated destination port for laptop captures while production traffic continues on the uplink.

H2-6. Storm Control & Loop Containment (Broadcast/Multicast/Unknown-Unicast Guards)

Storm control is not about saving bandwidth; it is a fuse that prevents a single fault from collapsing the entire cell. Correct limits create local containment and measurable evidence (storm-drop counters) for diagnosis.

Three Storm Classes to Guard

  • Broadcast: ARP/discovery bursts can amplify rapidly under loops and miswiring.
  • Multicast: without proper control, multicast can behave like broadcast at the edge.
  • Unknown-unicast: when learning is unstable, traffic floods; this often feels like “low utilization but the network is stuck”.

Typical Loop Triggers in the Field

  • Accidental patching between two ports during maintenance.
  • Dual uplinks connected without loop protection or consistent ring configuration.
  • Ring ports treated as ordinary ports (protocol not enabled or mismatched).
  • One device bridges two connections unexpectedly (IPC/gateway wired twice).
  • Learning instability that turns unknown-unicast into persistent flooding.

Threshold Templates (Placeholders for Standardization)

PPS-based

  • Broadcast limit: X pps
  • Multicast limit: X pps
  • Unknown-unicast limit: X pps

% Line-rate

  • Broadcast cap: X% of port rate
  • Multicast cap: X% of port rate
  • Unknown-unicast cap: X% of port rate

Guardrails: start conservative to avoid false kills, then tighten using storm-drop counters and mirror captures as proof.

False-Kill Risks & How to Avoid Them

  • Threshold too low: discovery/heartbeats fail and devices appear “offline”.
  • Wrong target: limiting the uplink first can hide the root cause and increase retransmissions.
  • Mitigation: enable on edge ports first, keep protocol lifelines safe, and validate with counters + captures.

Incident Playbook (Contain → Identify → Fix)

  1. Confirm containment evidence: check storm-drop / unknown-unicast counters for growth.
  2. Find the hottest port: locate the port with the fastest counter increase.
  3. Mirror ingress: capture and confirm whether broadcast/unknown-unicast dominates.
  4. Isolate quickly: unplug the suspect cable or disable the suspect port to stop the flood.
  5. Recover carefully: restore wiring stepwise, validating counters remain stable.
  6. Harden config: keep strict guards on maintenance-prone ports to prevent recurrence.
Storm Timeline: Normal → Flood → Guarded Containment Normal Flood Guard SW SW SW Limiter Evidence: storm-drop counters rise during Flood; after Guard, drops are contained while the cell stays responsive.
Diagram: a storm can saturate the cell quickly; a guard limiter contains the impact and creates measurable storm-drop evidence.

H2-7. Ring Protocol Basics (MRP/HSR/PRP — What to Know Without Becoming a Spec Expert)

“Ring basics” focuses on outcomes and selection decisions: recovery time, zero-loss expectations, and the real boundary between smart-switch assists and full redundancy protocol stacks.

Target Outcomes

  • Fast switchover: a link break triggers reroute within X ms (short interruption may exist).
  • Zero-loss concept: dual-active paths or dual networks reduce disruption (often requires endpoint support).
  • Containment: prevent a loop from collapsing the entire cell (often via guards and rapid isolation).

Concept Map (Minimal)

  • MRP: ring recovery behavior and switchover time are primary concerns.
  • HSR: redundancy via duplicated delivery around a ring; receiver de-duplicates.
  • PRP: redundancy via two independent networks; receiver de-duplicates.

Smart Switch Boundary (What to Expect)

Commonly Available

  • Basic loop containment and storm guards.
  • Simple ring assist / fixed-port behaviors (vendor-dependent).
  • Evidence counters (link events, storm-drop, unknown-unicast indicators).

Often Out of Reach

  • Full MRP/HSR/PRP stacks with cross-vendor interoperability guarantees.
  • Strict zero-loss guarantees without endpoint / system-level support.
  • Complex topology determinism (typically a managed / dedicated redundancy scope).

Selection Questions (Answer Before Buying)

  1. Goal: ms switchover or “zero-loss” expectation?
  2. Endpoint support: do PLC/I/O/cameras support redundancy modes natively (or require an external adapter)?
  3. Roles: are manager/client or port roles required, and can the field maintain consistent configuration?
  4. Certification fit: does the system require protocol certification alignment with controllers/devices?
  5. Load headroom: can duplicated traffic or reroute bursts exceed uplink or buffer limits?
  6. Failure modes: is the main risk a link break, or miswiring that creates a loop?
  7. Evidence: can the switch expose link events and storm/flood counters to validate the root cause?

Minimal Acceptance Checks (Keep It Practical)

  • Break test: unplug one ring segment and confirm recovery within X ms (target-dependent).
  • Business impact: verify critical control traffic stays within X loss/timeout limits during reroute.
  • Containment: simulate mispatch/loop and confirm guards keep the cell responsive with measurable counters.

Out of scope: full standard details, certification workflows, and cross-vendor interoperability rules belong on a dedicated “Ring Redundancy (MRP/HSR/PRP)” page. This section keeps only the decision-critical basics.

Line vs Ring: What Changes When a Link Breaks Line Topology PLC SW I/O Link break → downstream isolated Ring Topology PLC SW I/O Link break → traffic reroutes
Diagram: line topologies isolate downstream nodes on a break; ring topologies can reroute (outcome depends on redundancy method and support).

H2-8. Hardware Co-Design Hooks (Power, Clocks, PHY-integration, and Layout Traps)

These board-level hooks prevent “protocol-looking” failures that are actually power, strap, clock, or return-path issues. The focus is on layout rules and bring-up evidence, not PHY textbooks.

Power Rails & Reset (The #1 “Half-Alive” Root Cause)

Typical Symptoms

  • Link flaps after boot, then stabilizes “randomly”.
  • Intermittent drops that correlate with load steps.
  • Configuration appears inconsistent across boots.

Design Hooks

  • Multi-rail sequencing: ensure all rails reach stable levels before releasing reset.
  • UVLO/POR clarity: use a supervisor; avoid “borderline” brownout behavior.
  • Reset timing: release reset X ms after rails and clock are confirmed stable.

Clocking Hooks (Stability vs. “Speed”)

  • Jitter sensitivity: poor clock quality can surface as instability, renegotiation, or elevated error counters.
  • Noise coupling: avoid routing clocks through switching regulator hot zones and high-current loops.
  • Bring-up evidence: align clock-related hypotheses with link events + CRC/overrun counter behavior under load.

Straps & Boot Pins (Configuration That Depends on Timing)

  • Sampling moment: strap pins are often latched at reset release; unstable rails can create unstable config.
  • Shared pins: avoid external devices driving strap pins during boot; keep pull-ups/pull-downs unambiguous.
  • Field symptom: “same board, different behavior” can be a strap + reset timing problem, not a protocol bug.

Integrated PHY & Layout Rules (Placement, Return Path, ESD)

Do (Green)

  • Keep diff-pair reference plane continuous under the entire path.
  • Place connector-entry protection with short, direct return paths.
  • Keep magnetics/connector path compact to reduce loop area.

Avoid (Red)

  • Diff-pairs crossing plane cuts/slots or return-path discontinuities.
  • Clock sources placed next to DC/DC noise or adjacent to the PHY path.
  • Long “detours” for TVS returns that create large discharge loops.

Board-Level Evidence to Log (Fast Root-Cause Signals)

  • Link events vs power events: link drops aligned with brownouts or load steps indicate power/return-path issues.
  • CRC/overrun patterns: correlate with temperature, motor switching, or DC/DC mode transitions.
  • Before/after layout change: a large delta from routing/placement changes indicates SI/return-path dominance.
Layout Traps: Switch + Magnetics + TVS + Clock + Return Path PCB Top View (simplified) RJ45 Mag Switch IC + PHY (optional) Short path + continuous reference TVS Return Plane Cut Avoid diff-pairs over cuts/slots Clock DC/DC Keep clock away from DC/DC
Diagram: compact PHY path, short protection returns, and continuous reference planes prevent many “protocol-looking” failures.

H2-9. Configuration & Bring-up Flow (Straps/I²C/SPI, Default Safe Modes, Field-Safe Updates)

Configuration must be designed as a gated, field-safe workflow: keep reachability first, then enforce VLAN/QoS, and only commit changes after evidence proves stability.

A. Default Safe Modes (Pick One Baseline Per Product)

Mode A All-pass (Factory)
  • Best for: first power-on and recovery.
  • Risk: storms/loops can spread.
  • Must-have guard: conservative storm limits enabled by default.
Mode B Split VLAN (Commissioning)
  • Best for: early domain isolation (PLC / HMI / Camera).
  • Risk: wrong PVID/egress rules can cause remote loss.
  • Must-have guard: a “rescue port/VLAN” always reachable.
Mode C Uplink trunk (Deployment)
  • Best for: multi-VLAN uplink on a single port.
  • Risk: native VLAN mismatch → upstream “silent failures”.
  • Must-have guard: mirror proof of tagging before commit.

B. Configuration Entry Points (Straps → NVM → Host Bus)

Straps (Boot-latched)

  • Use: default mode, interface enable.
  • Rule: stable at reset release.
  • Failure signature: inconsistent behavior across boots.

EEPROM / NVM (Profile)

  • Use: baseline VLAN/QoS/port roles.
  • Rule: version + CRC + fallback copy.
  • Failure signature: “works until brownout” profile corruption.

I²C / SPI Host (Runtime)

  • Use: staged policy apply, diagnostics, update control.
  • Rule: detect loss-of-control and re-enter safe profile.
  • Failure signature: policy changed but not provable by evidence.

C. Field-Safe Updates (Anti-Lockout Pattern)

Common Failure

Wrong VLAN/PVID/egress rules remove the only management path and cause remote lockout.

Anti-Lockout Hooks

  • Dual profile: Active + Rescue (known-reachable baseline).
  • Staged commit: apply → observe → commit to NVM.
  • Rollback gate: revert if no heartbeat/ack within X s.
  • Rescue access: dedicated port/VLAN not affected by user policy.

D. Bring-up Minimal Closed Loop (Prove Then Advance)

  1. Link up: stable link for X min (no flapping).
  2. Forwarding: unicast flows pass; MAC learning stable (no persistent flooding).
  3. VLAN: PVID + tag/untag verified by mirror capture; management remains reachable.
  4. QoS: queue mapping validated under contention; control stays responsive.
  5. Mirror: captures confirm tagging/priority behavior; abnormal ratios visible.
  6. Storm: thresholds contain storms without killing discovery/heartbeats.
Bring-up Gated Flow Power Link Forward VLAN QoS Mirror Storm PASS FAIL PASS FAIL PASS FAIL PASS FAIL PASS FAIL PASS FAIL PASS FAIL Gate rule: define Pass threshold (X) and Fail action (rollback / isolate / re-check evidence).
Diagram: each bring-up step has a pass/fail gate to prevent “remote lockout by policy”.

H2-10. Verification Plan (Throughput/Latency, Loss Under Stress, Storm & Loop Tests)

Verification must be executable: define stimulus, capture observable evidence, and declare pass criteria. Coverage must include line-rate, latency behavior, burst loss, storm threshold sweep, loop containment, and corner stress.

Coverage (What Must Be Proven)

  • Throughput: line-rate definition by frame size and direction (uni/bi).
  • Latency: measure distributions under load; observe mode signature (store-and-forward vs cut-through behavior).
  • Stress loss: burst + contention + mixed priorities; verify critical class survivability.
  • Storm sweep: find the safe window between “no containment” and “false-kill”.
  • Loop/ring: mispatch loop containment and break recovery observation points.
  • Industrial corners: temperature and brownout/load-step alignment with counters/events.

Executable Checklist (Stimulus → Observable → Pass)

Throughput

  • Stimulus: min/typ/max frame sizes; uni/bi; full load.
  • Observable: drops/overruns; link events; mirror proof.
  • Pass:X drops over Y minutes at defined load.

Latency

  • Stimulus: controlled load points + contention.
  • Observable: latency distribution shift and jitter growth.
  • Pass: latency/jitter within X under defined traffic mix.

Stress Loss

  • Stimulus: bursts + mixed classes + uplink contention.
  • Observable: which class drops first; responsiveness.
  • Pass: critical class loss ≤ X while best-effort absorbs pressure.

Storm Sweep

  • Stimulus: broadcast/multicast/unknown-unicast sweep.
  • Observable: storm-drop counters vs discovery/heartbeats.
  • Pass: safe window identified and documented.

Loop / Ring

  • Stimulus: mispatch loop; break one segment (if ring).
  • Observable: containment evidence; recovery time X ms.
  • Pass: cell remains responsive; recovery meets target.

Temp / Voltage Corners

  • Stimulus: low/high temp; brownout/load-step events.
  • Observable: link events + counters aligned to logs.
  • Pass: no persistent flapping; error within X.

Required capability: controllable load and bursts, class-aware traffic generation, capture + counters for evidence, and synchronized power/temperature/event logging for correlation.

Verification Matrix (Card Format) Test Stimulus Observable Pass Throughput Line-rate Frame mix Full load Drops Overrun ≤ X drops / Y min Stress loss Bursts Contention Mixed prio Class drops Response Critical ≤ X Storm/Loop Contain Sweep Break Counters Events Cell OK X ms
Diagram: verification items are expressed as card-matrix entries to stay mobile-safe while remaining executable.

H2-11. Engineering Checklist (Design → Bring-up → Production)

A smart/unmanaged switch design is production-ready only when each gate has a repeatable verification action, objective evidence, and explicit pass criteria (threshold X).

Gate 1

Design Gate

Power rails + reset predictability
Action: brownout/load-step injection while logging reset causes.
Evidence: reset-reason log + link-event timeline aligned to power.
Pass criteria: no “hang/no-recover” in X disturbances.
MPN examples: TI TPS3808G01 (supervisor), ADI ADM809 (reset), TI TLV75533P (LDO), TI TPS62177 (buck).
Strap latch robustness (boot mode is deterministic)
Action: repeat power cycles across slow/fast ramps and temperature corners.
Evidence: boot mode readback consistency + interface reachability check.
Pass criteria: mode mismatch rate ≤ X / 1000 boots.
MPN examples: N/A (logic-level straps); use stable pull parts such as Yageo RC0603FR-0710KL (10 kΩ).
Default safe mode + rescue path (anti-lockout)
Action: define “Rescue profile” + dedicated rescue port/VLAN and verify it survives user policy.
Evidence: reachability proof after intentionally wrong VLAN/PVID settings.
Pass criteria: remote access recovered within X seconds by rollback.
MPN examples: SPI flash: Winbond W25Q32JV; I²C EEPROM: Microchip 24LC02B or 24AA025E48 (EEPROM with EUI-48).
VLAN plan (PVID + ingress/egress rules are explicit)
Action: define per-port PVID, acceptable frame types (tag/untag), and egress tag policy.
Evidence: mirror capture showing correct tag/untag on trunk/access paths.
Pass criteria: “management reachable” invariant holds under X reconfig cycles.
MPN examples (switch IC): Microchip KSZ8795 (5-port smart), Microchip KSZ8863 (3-port), Microchip KSZ8895 (5-port).
QoS skeleton (queues + scheduling + basic shaping)
Action: map classes to queues (Control/IO/Video/BE) and define rate-limit granularity.
Evidence: under contention, control class stays responsive; lower classes absorb loss.
Pass criteria: control latency/jitter ≤ X under defined stress.
MPN examples (clocking): Epson SG-210STF (25 MHz osc), Abracon ASE-25.000MHZ (XO).
Storm guard defaults (containment without false-kill)
Action: decide threshold unit (pps or % line-rate) for broadcast/multicast/unknown-unicast.
Evidence: threshold sweep shows a “safe window” where discovery/heartbeats survive.
Pass criteria: no global outage under mispatch/loop; false-kill rate ≤ X.
MPN examples (ESD protection, board-level): choose ultra-low-cap Ethernet ESD arrays per line-rate (example families: Semtech RClamp series / Littelfuse SP305x series).
Gate 2

Bring-up Gate

Minimal closed loop (prove then advance)
Action: Link → Forward → VLAN → QoS → Mirror → Storm (in order).
Evidence: per-step capture + counter snapshot baseline.
Pass criteria: each step stable for X minutes, no link flaps.
MPN examples (switch IC): Microchip KSZ8795 / KSZ8895 (smart features), Microchip KSZ8863 (compact).
Mirror capture proof (tag/priority behavior is visible)
Action: mirror source by port/VLAN and select ingress/egress tap if supported.
Evidence: pcap shows correct 802.1Q tag/untag and 802.1p PCP mapping.
Pass criteria: no unexplained flooding; abnormal ratio detectable within X minutes.
MPN examples (field tool): passive is enough; use mirror port + laptop NIC; optional USB NIC: Realtek-based RTL8153 adapters (common).
Counter baseline (the “black-box” is usable)
Action: record idle/nominal/stress baselines for CRC/drop/overrun/storm-drop/link events.
Evidence: baseline snapshots stored with timestamp + temperature + power cycle count.
Pass criteria: baseline drift ≤ X over Y hours.
MPN examples (storage): Winbond W25Q32JV (SPI flash), Microchip 24LC02B (I²C EEPROM).
Storm threshold sweep (find safe window)
Action: sweep broadcast/multicast/unknown-unicast thresholds from loose to strict.
Evidence: storm-drop vs LLDP/ARP/heartbeat survival charted per port.
Pass criteria: safe window documented; false-kill at normal traffic ≤ X.
MPN examples (switch IC): Microchip KSZ8795 (storm controls present in smart class; device-dependent).
Rollback drill (intentional misconfig)
Action: intentionally apply a lockout VLAN/PVID; verify watchdog rollback to rescue profile.
Evidence: rollback event log + recovered reachability + restored mirror proof.
Pass criteria: recovery time ≤ X seconds; success rate ≥ X%.
MPN examples: TI TPS3808G01 (supervisor), Winbond W25Q32JV (dual profile storage).
Gate 3

Production Gate

Configuration lockdown (version + CRC + dual image)
Action: implement active/rescue profile with versioning and integrity checks.
Evidence: profile readback + CRC status + audit log entry.
Pass criteria: failed update always falls back to rescue in X seconds.
MPN examples: Winbond W25Q32JV (SPI flash), Microchip 24LC02B (EEPROM).
Black-box logging fields (service-ready evidence)
Action: define a minimal field log schema (link flap, CRC/drop, storm-drop, reboot reason, temp/power).
Evidence: sample logs from burn-in + induced fault runs.
Pass criteria: event-to-root-cause coverage ≥ X%.
MPN examples (optional time base): Microchip MCP7940N (I²C RTC) for timestamping logs.
Serviceability (minimum on-site workflow)
Action: document mirror-port capture, counter readout, and factory-reset sequence.
Evidence: SOP + successful on-site drill records.
Pass criteria: on-site isolation time ≤ X minutes for common failures.
MPN examples: Microchip 24AA025E48 (EUI-48 for per-unit identity), Winbond W25Q32JV.
Corner robustness (temp + brownout alignment)
Action: run thermal/brownout tests while correlating events and counters.
Evidence: aligned timeline: temperature + voltage + link events + drops.
Pass criteria: no persistent flapping; error budget within X.
MPN examples (sensing, optional): TI TMP117 (temp sensor) for correlation logging.
Three-Gate Checklist Board Design Bring-up Production Power/Reset Straps Port plan VLAN plan QoS map Closed loop Mirror proof Counters QoS stress Rollback Lockdown Logs Service SOP Corners Recovery
Diagram: each gate is a kanban-style board with icon tiles to keep the checklist actionable and audit-friendly.

H2-12. Applications & IC Selection (Combined)

Applications are expressed as “capability vectors”. IC selection is expressed as a checklist + decision tree, not a brand catalog.

H3-12.1 Applications (Where Unmanaged/Smart Switch ICs Win)

Machine cell (star/line)
Topology / ports: 5–8 ports, mixed PLC/HMI/camera nodes.
Must-have: VLAN isolation + storm guard; optional mirror for field debug.
Common failure: broadcast storm collapses the whole cell.
Bring-up proof: mirror shows tag/untag correctness; storm window documented.
MPN examples: Microchip KSZ8795, Microchip KSZ8895.
Remote I/O fan-out
Topology / ports: 3–6 ports, deterministic control priority.
Must-have: QoS queues + basic policing; mirror for on-site capture.
Common failure: burst traffic causes control jitter.
Bring-up proof: contention test keeps control class within threshold X.
MPN examples: Microchip KSZ8863 (compact), Microchip KSZ8795.
Small gateway / edge box
Topology / ports: 2–5 ports + uplink.
Must-have: 802.1Q trunk uplink + VLAN/PVID discipline; optional QoS for control.
Common failure: native VLAN mismatch produces “silent upstream failure”.
Bring-up proof: mirror capture shows VLAN tags on trunk before committing.
MPN examples: Microchip KSZ8795, Microchip KSZ8895.
Lightweight ring / loop containment
Topology / ports: 2 ring ports + local fan-out.
Must-have: loop containment + storm guard; “ring assist” if supported.
Common failure: mispatch loop floods the network.
Bring-up proof: mispatch drill contains broadcast; recovery time ≤ X.
MPN examples: Microchip KSZ8795 (smart-class guards), Microchip KSZ8895.

H3-12.2 IC Selection Logic (A Checklist, Not a Shopping List)

Core requirement checklist
  • Ports/speed mix: FE/GE/2.5G as needed (avoid “overkill” heat).
  • VLAN: PVID + accept rules + egress tag policy + table/limit awareness.
  • QoS: queue count + scheduling (strict/WRR) + policing/shaping granularity.
  • Mirror: per-port/per-VLAN, ingress/egress tap position if available.
  • Storm: bcast/mcast/unknown-unicast types + threshold units + default behavior.
  • Config/rollback: straps + EEPROM/flash + watchdog rollback + rescue port/VLAN.
  • Industrial target: temperature + EMC evidence plan (metric-only, no spec tutorial).
  • Power/thermal: package dissipation + airflow/heatsinking plan.
MPN anchor examples (common)
Smart-class switch IC (VLAN/QoS/mirror/storm): Microchip KSZ8795, Microchip KSZ8895.
Compact switch IC (small fan-out): Microchip KSZ8863.
Identity / MAC storage: Microchip 24AA025E48 (EUI-48 EEPROM).
Config storage: Winbond W25Q32JV (SPI flash), Microchip 24LC02B (EEPROM).
Reset supervisor: TI TPS3808G01, ADI ADM809.
Clock source: Epson SG-210STF (25 MHz), Abracon ASE-25.000MHZ.
Power rails: TI TPS62177 (buck), TI TLV75533P (LDO).
Optional RTC for logs: Microchip MCP7940N.
Decision rule (engineering)
If any of VLAN isolation, QoS protection, mirror observability, or storm containment is required for field safety, select a smart-class switch IC. Otherwise, a basic unmanaged mode can be acceptable for the simplest cells.
Selection Decision Tree (Minimal Text) Any field-safety need? Need VLAN? Need QoS? Need Mirror? Need Storm? SMART switch IC UNMANAGED OK YES NO
Diagram: the decision tree stays inside the page boundary—only VLAN/QoS/Mirror/Storm and field-safety observability.

Request a Quote

Accepted Formats

pdf, csv, xls, xlsx, zip

Attachment

Drag & drop files here or use the button below.

H2-13. FAQs (Troubleshooting Only: VLAN / QoS / Mirror / Storm / Ring Basics)

These FAQs close long-tail field issues without expanding the main content. Every answer is exactly four lines: Likely cause / Quick check / Fix / Pass criteria (X).

Data placeholders (consistent units)
  • X_min: minutes of stable operation window
  • X_s: seconds to recover (rollback / reconnect)
  • X_cnt: counter increment threshold (CRC/drop/storm-drop) within a window
  • X_%: ratio threshold (% of frames or % line-rate), used for broadcast/unknown-unicast share
  • X_pps: packets-per-second threshold for storm control
  • X_ms: latency/jitter bound for control-class traffic under contention
VLAN configured and a device “disappeared” — wrong PVID or egress untag rule?

Likely cause: Port PVID mismatch, or egress changed from untag→tag (or tag→untag) and the endpoint cannot parse it.
Quick check: Mirror the port and confirm whether ARP/LLDP frames are tagged or untagged; read back PVID + egress rule for that port.
Fix: Restore a “management reachability invariant” (rescue port/VLAN), then standardize access ports as untag + correct PVID; keep trunk ports tagged only.
Pass criteria: Endpoint reachable for X_min; port CRC/drop increment ≤ X_cnt per X_min; no unexpected VLAN tag changes in capture.

Utilization looks low, but the cell feels “stuck” — unknown-unicast flood or queue starvation?

Likely cause: Unknown-unicast flooding (MAC table miss/aging) or strict-priority scheduling starving lower classes, triggering retries/backpressure upstream.
Quick check: Read unknown-unicast counters and broadcast share; mirror suspect ports and look for repeated ARP/retries/timeouts spikes.
Fix: Enable unknown-unicast storm guard or policing; switch from strict-only to WRR (or add minimum share) so non-control traffic cannot be permanently starved.
Pass criteria: Unknown-unicast share ≤ X_% (or ≤ X_pps) for X_min; control latency/jitter ≤ X_ms; drops do not ramp (drop increment ≤ X_cnt per X_min).

Mirroring is enabled, but the “bad frames” never show up — wrong mirror tap point (ingress vs egress)?

Likely cause: Mirror source is wrong (port vs VLAN), or the tap point is before/after the event (ingress vs egress), or the mirror port is oversubscribed and drops.
Quick check: Inject a known test flow and verify it appears; compare ingress-tap vs egress-tap (if supported) and check mirror port drop counters.
Fix: Move the tap closer to the suspected failure stage; narrow the mirrored scope (single port/VLAN) and ensure the mirror destination can sustain the rate.
Pass criteria: Target-frame hit rate ≥ X_%; mirror-port drop increment ≤ X_cnt per X_min; captures contain consistent VLAN/PCP evidence.

QoS enabled, but latency jitter got worse — strict priority starving lower classes and amplifying retransmits?

Likely cause: Strict priority starves a queue until higher-layer retries explode; or PCP→queue mapping is inverted, pushing control into a congested class.
Quick check: Create controlled contention and measure control-frame delay; mirror PCP/DSCP markings and verify mapping aligns with the queue plan.
Fix: Use WRR (or strict + minimum share) to prevent starvation; apply ingress policing on “noisy” ports to cap burst damage.
Pass criteria: Control traffic latency/jitter ≤ X_ms under defined stress; no sustained starvation (drop increment on low class ≤ X_cnt per X_min).

Storm control “kills” discovery/heartbeats — threshold unit or measurement window mismatch?

Likely cause: Threshold is set in the wrong unit (pps vs %) or the averaging window is too short/too strict, causing legitimate multicast/discovery to be dropped.
Quick check: Measure baseline LLDP/ARP/heartbeat pps in normal operation; correlate storm-drop counters with the moment devices “vanish”.
Fix: Sweep thresholds to find a safe window; separate limits for broadcast/multicast/unknown-unicast and avoid a single “global” clamp.
Pass criteria: Heartbeats survive continuously for X_min; storm-drop is 0 (or ≤ X_cnt) during normal traffic; outage events do not recur within X_min.

A ring/loop connection causes a broadcast meltdown — loop not contained, dual uplinks, or threshold too high?

Likely cause: Physical loop (mispatch / dual uplinks) creates a broadcast amplification loop; storm guard is disabled or set so high it never triggers.
Quick check: Break one link and see if the network immediately recovers; read broadcast share and storm-drop counters during the event.
Fix: Enable loop containment (within smart-switch capability) and set practical storm limits; enforce “single-uplink rule” where applicable.
Pass criteria: Mispatch drill does not collapse the cell; broadcast share ≤ X_% for X_min; recovery time ≤ X_s.

Only one port shows high CRC/drop — board-level return path or local SI/power noise?

Likely cause: Local signal integrity/return-path issue, connector/cable defect, or power/ground noise coupling into that port’s PHY/MAC interface.
Quick check: Swap cables/ports to see if the symptom follows the port or the device; correlate CRC ramps with load/temperature/time-of-event logs.
Fix: Isolate the port via VLAN to reduce blast radius; then address board-level causes (decoupling, ground return continuity, routing/spacing) and cable quality.
Pass criteria: CRC increment ≤ X_cnt per X_min under the same load; link flaps = 0 for X_min.

Link drops only at high temperature — rail derating, reset margin, or clock margin collapse?

Likely cause: Power rail derating triggers brownout/reset, oscillator margin collapses, or temperature shifts timing/noise margins until errors accumulate into a flap.
Quick check: Align temperature vs link events vs CRC/drop; verify reset-reason and undervoltage indicators when the drop occurs.
Fix: Increase power/reset margins and improve thermal path; use a more stable clock source or reduce noise coupling; apply a protective lower-rate mode if required.
Pass criteria: High-temp run stable for X_min; link flaps = 0; CRC/drop increment ≤ X_cnt per X_min.

After shipping, remote configuration is impossible — missing lockout prevention and rollback?

Likely cause: No rescue profile/port, configuration applied without staged commit, or management VLAN can be accidentally isolated by policy.
Quick check: Read back active profile version/CRC; confirm whether a rescue path (port/VLAN) remains reachable after a forced misconfig test.
Fix: Implement dual-profile (active/rescue) + watchdog rollback; enforce “management reachability invariant” that policy cannot break.
Pass criteria: Any bad config auto-recovers within X_s; success rate ≥ X_% across repeated drills; remote reachability stable for X_min.

Plugging in one device slows the whole network — uncontrolled burst, unknown-unicast flooding, or no policing?

Likely cause: A noisy endpoint generates bursts (broadcast/multicast/unknown-unicast) without limits; queues saturate and create global contention symptoms.
Quick check: Mirror that port and quantify broadcast/unknown share; check storm-drop and port policing counters during the slowdown window.
Fix: Apply ingress policing/rate limiting on that port; enable storm control for relevant types; isolate via VLAN if needed to contain blast radius.
Pass criteria: With the device connected, global latency/drops stay within X_ms/X_cnt; abnormal traffic share ≤ X_% for X_min.

Captures look “normal”, but failures still happen — event is egress-queue timing and the tap misses it?

Likely cause: The event is created at egress under congestion (queueing/drop), while capture is taken at ingress; or the mirror destination drops at peak.
Quick check: Switch to egress tap (if supported) or mirror a narrower scope; correlate counter baselines to the exact failure timestamp.
Fix: Reduce mirrored sources to avoid oversubscription; use counters + event logs as the primary trigger and capture around the event window.
Pass criteria: Reproduction hit rate ≥ X_%; event-aligned counters show a clear signature; mirror port drop increment ≤ X_cnt per X_min.