Ring Redundancy (MRP/HSR/PRP): ms Failover & Zero-Loss Guide

Q: MRP says “recovered”, but the application still drops cyclically

Likely cause: detection threshold too sensitive (link flap) or post-recovery forwarding table/MAC learning keeps oscillating. Quick check: count MRP state transitions per W; correlate with port up/down timestamps and cyclic-miss events; check MAC move/flap counters. Fix: tune detection window/threshold; isolate noisy segment; enforce stable forwarding/MAC aging after recovery. Pass criteria: reconfigure ≤ X times/10 min; cyclic-miss ≤ X/hour; stable-forwarding ≥ X min.

Q: MRP ring keeps reconfiguring — it never becomes stable

Likely cause: persistent link flap, miswiring that expands the loop, or role/state mismatch. Quick check: verify only intended ring ports; measure port event frequency; confirm single manager role and consistent ring parameters. Fix: bound the redundancy domain; harden the flapping segment; align manager/client roles and ring settings. Pass criteria: stable ring state ≥ X min; port event rate ≤ X/min; reconfigure count ≤ X/day.

Q: HSR duplicates frames in both directions — why are frames still lost?

Likely cause: second path arrives outside effective window (Δt too large) or discard/sequence mismatch drops valid frames. Quick check: measure Δt p95/p99 at receiver; correlate missing-seq with discard counters and queue load within W. Fix: reduce asymmetry (routing/queues); align discard/sequence semantics; ensure line-rate under duplication overhead. Pass criteria: missing-seq = 0 within W=X; Δt(p99) ≤ X; discard not correlated with missing-seq spikes.

Q: HSR duplicate counter spikes and throughput collapses

Likely cause: discard window mismatch, domain boundary leakage into non-HSR LAN, or mis-placed bridge/RedBox creates storms. Quick check: quantify duplicates per 1k per port; verify boundary points; compare broadcast/multicast rate vs baseline. Fix: enforce strict domain boundary; correct discard/sequence settings; apply storm guards at boundary ports. Pass criteria: duplicates ≤ X/1k; throughput ≥ X% baseline; broadcast/multicast ≤ X% utilization.

Q: PRP “zero-loss” is not achieved — occasional missing frames remain

Likely cause: LAN A/B not independent (shared power/switch/cable) or discard/sequence applied inconsistently. Quick check: cut A-only then B-only; compare missing-seq and drop counters; map common-mode elements. Fix: remove common-mode dependencies; move common point outside domain; standardize discard/sequence configuration. Pass criteria: missing-seq=0 within W=X under A-only and B-only injection; common-mode single-point failures = 0.

Q: PRP shows MAC flapping / learning oscillation

Likely cause: inconsistent learning boundaries or boundary device bridges redundancy semantics into ordinary switching domains. Quick check: track MAC move events; verify single boundary insertion; compare table stability before/after adding PRP node. Fix: enforce redundancy domain isolation; remove unintended parallel bridges; align learning policies and boundary placement. Pass criteria: MAC moves ≤ X/hour; table stable ≥ X min; broadcast amplification ≤ X%.

Q: After connecting to a normal LAN, the whole network turns into a broadcast storm

Likely cause: missing domain boundary, unintended bridging around the ring, or leaked duplicate traffic into ordinary LAN. Quick check: locate the first point where broadcast/multicast rises; verify exactly one boundary insertion; measure duplicates outside the domain. Fix: re-establish domain boundaries; remove parallel bridges; apply storm guards at boundary ports. Pass criteria: broadcast/multicast utilization ≤ X%; duplicates outside domain ≤ X/1k; CPU/queue load within X%.

Q: Switchover time is much slower than expected

Likely cause: detection cycle dominates budget, event reporting is delayed, or repeated reconvergence follows recovery. Quick check: timestamp fault→detect→forwarding change→stable; identify the largest contributor. Fix: tune detection parameters; ensure fast event propagation; eliminate post-recovery oscillation sources. Pass criteria: switchover ≤ X ms (application-visible); detect ≤ X ms; stable-forwarding ≤ X ms after detection.

Q: After recovery, latency jitter becomes worse (even if loss is gone)

Likely cause: lingering flap/reconfigure, queue congestion from duplication overhead, or large asymmetry creates periodic spikes. Quick check: compare latency p99 before/after; correlate jitter spikes with duplicates and state transitions within W. Fix: stop flap root cause; enforce domain boundaries; cap utilization; reduce Δt via balanced queues/routes. Pass criteria: latency p99 ≤ X; jitter spike rate ≤ X/hour; duplicates ≤ X/1k; stable ≥ X min.

Q: Only one device causes ring instability after joining

Likely cause: capability/parameter mismatch (discard semantics, role/state) or the device introduces link flap. Quick check: A/B join/leave and track state transitions; compare that device’s port events and duplicate/discard counters vs baseline. Fix: align discard window/sequence; enforce correct role; harden physical link; isolate behind boundary if needed. Pass criteria: node addition does not raise reconfigure rate beyond X%; duplicates/discards ≤ X/1k; stable ≥ X min.

← Back to: Industrial Ethernet & TSN

Ring redundancy is an engineering promise with measurable acceptance: either ms-class switchover (MRP) or true zero-loss via frame duplication (HSR/PRP). This page turns that promise into deployable topology patterns, budgeted timing/overhead, and testable pass criteria (X) for power and rail networks.

H2-1. Definition & Boundary: What “Ring Redundancy” Really Means

Intent: Lock an acceptance-ready definition of “redundancy” before discussing protocols—fast failover (ms) versus zero-loss continuity.

Two acceptance targets (do not mix)

Fast switchover (ms): after a single fault, traffic recovers within X ms. Short interruption may occur, but recovery time is contractually bounded.
Zero-loss continuity: under a single fault, the receiver continues to see no missing frames, achieved by parallel delivery (HSR/PRP) plus correct duplicate discard.

Key point: “Link up” is not “service restored”. Acceptance must be defined at a measurement point (application, probe, or counters), with an explicit start/stop rule and a pass threshold.

Redundancy domain (fault domain) boundary

A redundancy domain is the smallest network region expected to survive a single fault without violating the chosen acceptance target. The domain definition must state what is protected and what is not.

Protected (examples)

Single link break inside the ring
Single port down event (including transient drop)
Single node/switch power loss or reboot (if explicitly required)

Not covered unless stated

Two simultaneous independent faults (e.g., dual link breaks)
Common-cause failures (shared power, shared harness, shared upstream switch)
PHY signal integrity / EMC / ESD root causes (handled in protection/PHY pages)

Deliverable: Acceptance target definition table

Target	Measurement definition	Pass criteria (placeholders)	Typical fit
Fast switchover (MRP)	Measure at a defined point (application heartbeat, probe, or counters). Switchover time = fault detected → service restored (first valid packet after reconvergence).	Switchover ≤ X ms (P95), across Y injections. Post-recovery stability ≥ X min (no flap / no oscillation).	Control/automation networks that can tolerate a short gap, but need bounded recovery and stable convergence.
Zero-loss (HSR/PRP)	Prove continuity with sequence-aware traffic (frame ID/sequence) at the receiver. Acceptance requires no missing sequence during the fault, plus correct duplicate discard.	Missing sequence = 0 during Y injections. Duplicate rate ≤ X/1k. Stable ≥ X min.	Mission-critical traffic where even a short gap is unacceptable, often in power/rail protection and high-availability segments.

Note: Threshold placeholders (X, Y) are intentionally open; values must be filled by the system-level business loop and certification constraints.

Diagram focus: acceptance contract first (ms switchover vs zero-loss), then select the matching mechanism and test plan.

H2-2. Requirements & Failure Model: What Must Survive (and What Need Not)

Intent: Convert “redundancy” into testable engineering clauses—fault injection, observable symptoms, required behavior, and stability after recovery.

“Recovered” and “stable” are different acceptance gates

Recovery gate: traffic meets the chosen target (switchover ≤ X ms or missing sequence = 0).
Stability gate: after recovery, the network does not oscillate (no repeated reconfigure / no port flap storm) for ≥ X minutes.

A system that “recovers quickly but keeps flapping” typically fails real plants: symptoms appear random, diagnostics become noisy, and root-cause cycles explode.

Requirement template (repeatable)

Define the fault precisely (what counts as “flap”, what counts as “down”).
Specify injection method (unplug, relay break, port disable, power cut).
State observable symptom and pick a primary metric (drop, missing sequence, duplicate rate, latency spike).
Declare required behavior with thresholds (X ms / X frames / X per 1k / X minutes stable).

Deliverable: Failure model table (fault → symptom → required behavior)

Fault	Observable symptom	Required behavior (placeholders)	Primary metric
Single link break	Drop burst (MRP), or path continuity maintained (HSR/PRP). Possible latency spike during convergence.	MRP: switchover ≤ X ms. HSR/PRP: missing sequence = 0. Stability ≥ X min.	Switchover time / missing sequence
Port flap (transient up/down)	Repeated reconvergence; intermittent drops; oscillating ring state; bursty duplicates in zero-loss domains.	Define flap as ≥ N toggles in T seconds. System must not enter continuous oscillation; stability gate must pass.	Flap count / oscillation time
Switch/node power-off	Topology changes; potential traffic interruption (MRP) or continuity via alternate path (HSR/PRP) if domain is correctly defined.	If required: service must meet the selected target under a single node loss, then remain stable ≥ X min.	Missing sequence / recovery time
Node reboot / rejoin	Brief duplicate spikes; forwarding table churn; possible MAC learning instability if boundaries are wrong.	Rejoin must not cause storm; duplicate rate ≤ X/1k; no sustained oscillation during the stability window.	Duplicate rate / storm counters
Miswiring (extra branch / loop expansion)	Broadcast storm; duplicates explode; ring state never stabilizes; “works in lab, fails in plant”.	Domain boundary must prevent loop expansion; storm control must trip within X; network must return to stable state after correction.	Storm counters / stability time

“Need not survive” (explicitly document)

Dual simultaneous independent faults (unless specified)
Common-cause faults that bypass redundancy (shared power, shared upstream choke point)
Out-of-domain configuration errors that expand loops beyond the designed boundary

Pass criteria placeholders (example set)

MRP switchover ≤ X ms (P95) over Y injections
HSR/PRP missing sequence = 0 during Y injections
Duplicate rate ≤ X/1k, storm counters remain bounded
Stability window ≥ X min (no oscillation / no reconfigure storm)

Diagram focus: inject realistic faults (break / power-off / miswire), then verify recovery and stability using bounded metrics.

H2-3. Standards & Terminology Map (IEC 62439-2/-3 Essentials)

Intent: Align terms and roles to a single engineering vocabulary so requirements, configuration, and verification stay consistent across teams.

Where the terms come from (scope in one glance)

IEC 62439-2 (MRP): ring management roles, blocked port behavior, and health checks for reconvergence.
IEC 62439-3 (HSR/PRP): frame replication, sequence-based duplicate discard, and boundary devices (RedBox / LRE).
This section focuses on engineering meaning (what a role does, what a frame implies), not certification profiles.

Terminology anchors (use these consistently)

Domain & topology

Ring defines the redundancy loop; LAN A/B defines dual independent networks (PRP). Redundancy domain is the boundary where single-fault guarantees are claimed.

Roles & responsibilities

MRM / MRC: manage ring state and the blocked port (loop prevention + reconvergence).
DANH / DANP: end nodes that participate in HSR/PRP and perform duplicate discard using sequence context.
RedBox / LRE: boundary device/function to connect redundancy domains to non-redundant LANs while keeping duplication rules correct.

Supervision & duplicate discard (why they matter)

Health check / supervision: affects fault detection delay → part of the switchover budget.
Sequence / LAN-ID: defines how duplicates are recognized → drives duplicate rate and “missing sequence = 0” proof.

Deliverable: Terminology crosswalk (one dictionary)

Term	Engineering meaning	MRP / HSR / PRP	Why it matters (metrics / risks)
MRM / MRC	Ring manager / client roles that coordinate ring state and reconvergence behavior.	MRP	Drives reconvergence path and stability; impacts switchover time and post-fault oscillation risk.
Blocked port	A designated blocked edge to prevent loops in normal operation; becomes a recovery lever under fault.	MRP	Determines normal traffic direction and recovery path; wrong boundary causes persistent reconfigure.
Test frame / health check	Supervision mechanism that detects faults and triggers ring state change.	MRP (and supervision concepts in 62439-3)	Affects detection delay and false-trigger risk; directly touches the switchover budget.
DANH / DANP	End nodes that participate in redundancy; receive duplicates and keep one valid copy.	HSR / PRP	Enables the missing sequence = 0 proof; poor discard logic leads to storms or false loss.
RedBox	Boundary device connecting redundancy domains (HSR/PRP) to non-redundant Ethernet segments.	HSR / PRP	Defines domain boundary; misplacement causes loop expansion, broadcast storms, and MAC instability.
LRE	Link redundancy entity/function that handles replication context and duplicate discard behavior.	HSR / PRP	Controls duplicate rate and “zero-loss” proof quality; wrong window triggers discard failures.
Sequence / LAN-ID	Metadata used to correlate duplicates and identify which network copy was received first.	HSR / PRP	The basis for proving missing sequence = 0 and keeping duplicates bounded (≤ X/1k).

Practice: keep term usage identical in requirements, configuration checklists, logs, and verification reports.

Map focus: terms are placed at their action points—blocked edge (MRP), boundary (RedBox/LRE), and duplicate discard (HSR/PRP).

H2-4. How Each Protocol Works (MRP vs HSR vs PRP) — Mechanisms, Not Marketing

Intent: Explain data-plane actions (block, copy, discard) and map each mechanism to acceptance targets (ms switchover vs zero-loss).

Mechanism in three moves (block / copy / discard)

MRP: block one edge → fault → unblock & reconverge

Normal operation prevents loops with a blocked port. Under a fault, the ring state changes and the blocked edge is used to restore a single forwarding path—therefore recovery includes detection and reconvergence, producing ms switchover.

HSR: per-frame dual copy around ring → receiver discards duplicates

Each frame is replicated in two directions. If at least one copy survives the fault, continuity can be proven as missing sequence = 0, provided duplicate discard works and storms remain bounded.

PRP: send over LAN A and LAN B → receiver discards duplicates

Frames are transmitted over two independent networks. Under a single fault, one network remains available and the receiver keeps the first valid copy. The engineering risk is usually shared choke points and boundary mistakes (RedBox placement).

Acceptance mapping (tie back to H2-1)

MRP → prove switchover ≤ X ms + post-fault stability.
HSR/PRP → prove missing sequence = 0 + bounded duplicates.

Deliverable: Mechanism comparison (MRP vs HSR vs PRP)

Protocol	Copy point	Discard / block point	Overhead	Typical recovery behavior	Acceptance fit
MRP	No per-frame replication; normal traffic follows the unblocked path.	A designated blocked port prevents loops; under fault it becomes the recovery edge.	Low steady-state overhead; cost appears as reconvergence time and transient interruptions.	ms switchover after detection + reconvergence; stability must be verified.	Fast switchover contract
HSR	Per-frame replication in both directions around the ring.	Receiver/participant performs duplicate discard using sequence context.	Higher ring load (replication); requires bounded duplicates and storm control discipline.	Zero-loss under single fault if at least one copy arrives within the validity window.	Zero-loss continuity contract
PRP	Replication across LAN A and LAN B (dual networks).	Receiver discards duplicates; boundaries are enforced via RedBox / LRE where needed.	Two networks (or boundary devices); common-cause failures can erase redundancy if not architected out.	Zero-loss under single fault if LANs are truly independent and discard is correct.	Zero-loss continuity contract

Design implication: “block” creates reconvergence time; “copy+discard” creates continuity but adds load and boundary discipline requirements.

Triple panel focus: MRP relies on block/unblock reconvergence; HSR/PRP rely on frame replication with sequence-based duplicate discard.

H2-5. Architecture & Topology Patterns You Can Actually Deploy

Intent: Convert protocol choices into deployable topologies with explicit boundaries, fault domains, and observable acceptance points.

Topology building blocks (use consistent vocabulary)

Redundancy domain: the boundary where single-fault guarantees are claimed (ring domain or LAN A/B domain).
Boundary device: a controlled edge that prevents the domain from expanding (e.g., RedBox, interconnect boundary).
Fault domain: the set of links/switches/nodes assumed to fail one at a time under the acceptance contract.
Acceptance hooks: where to observe switchover time (MRP) or missing sequence / duplicates (HSR/PRP).

Deliverable: Topology selection checklist (deploy-first)

Topology form: single ring / dual ring / interconnected rings / PRP dual LAN / HSR + RedBox.
Acceptance target: switchover ≤ X ms, or missing sequence = 0 (zero-loss).
Endpoint constraints: endpoints can participate (DAN) or must remain standard Ethernet.
Boundary controllability: RedBox / interconnect boundary placement is controllable and testable.
Independence: for PRP, LAN A/B have no shared choke points (power, uplinks, media, conduit).
Scale headroom: node count, ring length, and traffic headroom can tolerate replication or reconvergence.

Failure-cost reminder: choosing a topology without a clear boundary can turn a local fault into a domain-wide flap or storm.

Layering pattern (power/rail style) as a topology meaning

A practical deployment often uses layering to keep fault domains contained: an aggregation layer (station control / edge compute), a ring layer (bays / segments), and an endpoint layer (IED / remote I/O). The goal is not protocol complexity, but fault containment and clear acceptance points per layer.

Topology gallery: each pattern highlights the boundary and the action points (Block / Copy / Discard / RedBox / LAN A/B).

H2-6. Selection Logic: When to Choose MRP, HSR, or PRP

Intent: Convert requirements into an executable Yes/No decision tree and a red-line table that prevents invalid choices.

Executable selection logic (priority order)

Zero-loss required? If missing sequence must remain 0 under a single fault, the choice must be HSR or PRP.
Dual networks allowed? If LAN A/B independence is possible, PRP is usually the cleanest zero-loss architecture.
Endpoints changeable? If endpoints cannot participate, a boundary (RedBox) becomes mandatory for domain control.
Replication headroom available? If bandwidth is tight, replication-based choices require traffic shaping or smaller domains.
Brief interruption acceptable? If short interruption is acceptable, MRP can satisfy an ms switchover contract with lower steady overhead.

Deliverable: Red-line conditions (forces a choice)

Red-line condition	Why it forces the choice	Forced choice	Verification hook
Missing sequence must remain 0 under a single link fault	A reconverging topology can introduce a brief gap; continuity requires duplicated delivery.	HSR or PRP	Missing-seq counter = 0; duplicates bounded ≤ X/1k; discard correctness.
Dual independent networks are available (LAN A/B) with no shared choke points	PRP benefits most from independence; shared points can erase redundancy under common-cause failures.	PRP preferred	Inject faults on LAN A and LAN B separately; verify continuity + independence evidence.
Endpoints cannot be modified, but the domain still needs zero-loss behavior	A boundary function is required to keep duplication semantics inside the redundancy domain.	HSR/PRP with RedBox boundary	Verify boundary placement and discard logic at the domain edge; check storm containment.
Brief interruption is acceptable, but recovery must be within X ms	A reconverging ring can satisfy ms recovery with low steady overhead if detection and stability are controlled.	MRP	Measure switchover ≤ X ms; verify no repeated reconvergence (flap) after recovery.
Replication headroom is insufficient at expected peak traffic	Copy-based designs can overload links and increase queuing; the domain must be reduced or traffic must be controlled.	Avoid HSR/PRP unless bounded	Validate link utilization under fault; verify duplicates and queueing remain within X.

Red-line usage: treat each condition as a gating check before committing to topology and equipment capability.

Decision tree: start from acceptance (zero-loss vs ms switchover), then apply constraints (dual LAN, endpoint participation, boundary control).

H2-7. Performance & Timing Budget: Switchover, Zero-Loss, and Overhead

Intent: Turn “fast / stable / zero-loss” into measurable budgets, testable hooks, and acceptance criteria.

Acceptance vocabulary (avoid metric ambiguity)

MRP switchover time: time from fault onset to stable-forwarding (no repeated reconvergence). Target ≤ X ms.
Zero-loss (HSR/PRP): during a single fault, missing sequence = 0 at the receiver. This assumes both paths deliver a valid copy within an effective window.
Overhead: additional bandwidth/queue pressure caused by frame duplication; must be budgeted to keep utilization under a conservative ceiling (≤ X%).
Zero-loss ≠ zero-jitter: duplication and discard processing can still add latency variation; jitter is a separate acceptance item.

MRP switchover budget (three segments that can be measured)

Switchover time can be decomposed into three segments: Detection → Recovery → Stable-forwarding. The purpose of budgeting is to pin each segment to an observable start/end so teams do not argue about the stopwatch.

Segment	Observable start	Observable end	Dominant drivers	Target (X)	Measurement hook
Detection	Fault is introduced (link down / cable pull / port flap start)	Redundancy domain confirms fault (health check misses / status change becomes certain)	Health-check interval, debounce, flap filtering, domain size	≤ X ms	Event timestamp + ring-health counter transition
Recovery	Fault is confirmed (detection ends)	Forwarding resumes on the alternate path (traffic reappears)	Role convergence, port unblock timing, FDB/forwarding update latency	≤ X ms	Packet gap measurement + port-state change log
Stable-forwarding	Forwarding returns (recovery ends)	No repeated reconvergence / no periodic flap under steady load	Flap suppression, stable health checks, queue stability under load	For Y sec	Counter stability + repeated event absence

HSR/PRP zero-loss preconditions and overhead budget (engineering estimates)

Zero-loss precondition (effective arrival window)

Zero-loss requires that at least one valid copy of each frame arrives without sequence gaps. The second copy may arrive later, but it must still fall within a receiver-side valid window so discard logic can operate deterministically. If the path arrival delta (Δt) grows beyond the window, observed behavior can shift from “duplicates” to “missing-seq”.

Overhead budget (practical ceiling)

HSR: duplication pressure shows up as higher utilization and queue contention within the redundancy domain; size the domain so peak load stays under a conservative ceiling (≤ X%).
PRP: duplication is split across LAN A and LAN B; keep each LAN under a utilization ceiling (≤ X%) and prove independence to avoid common-cause collapse.
Jitter sources: replication-induced queuing, discard buffering, and fault-time traffic redistribution can add latency variation even when missing-seq stays 0.

Mechanism	Copy scope	Overhead driver	Utilization ceiling	Validation counters
HSR	Ring domain (dual direction copies)	Frame duplication + domain path length + peak traffic	≤ X%	Port utilization, queue drop, duplicates, missing-seq, discard stats
PRP	LAN A and LAN B (parallel delivery)	Dual transmission per frame + independence constraints	Each LAN ≤ X%	Per-LAN utilization, duplicates, missing-seq, fault-injection continuity evidence

The left timeline budgets MRP recovery into measurable segments; the right view highlights path arrival delta (Δt) inside a receiver-side valid window for zero-loss.

H2-8. Configuration & Interoperability Blueprint (Commissioning Without Chaos)

Intent: Prevent “works but unstable” outcomes by aligning roles, boundaries, and discard/counter definitions before acceptance testing.

Parameter alignment (the minimum set that prevents chaos)

MRP (concept-level alignment)

Roles: Manager/Client assignment is consistent with the intended domain.
Ring ports: ring membership is explicit; boundary ports are not accidentally included.
Health check: interval and timeout definitions are consistent; flap filtering is defined.
Boundary principle: avoid overlapping loop-control responsibilities on the same fault domain.

HSR / PRP (boundary and discard correctness)

RedBox boundary: the redundancy domain must not silently expand into a standard LAN.
Duplicate discard window: window/timeout definitions align with the expected Δt; counters must match the same definition.
Broadcast/multicast sanity: check for amplification risk inside the domain; verify storm containment via counters and fault injections.

Deliverable: Commissioning parameter list (10–15 fields to confirm)

Field	Role / scope	Default direction	Risk if wrong
MRP role (MRM/MRC)	Per ring domain	Single manager per domain; document boundaries	Split-brain behavior or unstable reconvergence
Ring ports (membership)	Switch ports inside domain	Make membership explicit; keep boundary ports excluded	Fault domain expands; unexpected flap / storm spread
Health-check interval/timeout definition	Per domain (MRP)	Start conservative; avoid false positives under load	False fault detection → reconvergence storms
RedBox boundary placement	HSR/PRP domain edge	Keep duplication semantics inside the defined domain	Domain “leaks” into LAN → duplicates and storm amplification
Discard window / timeout definition	Receiver / boundary discard logic	Align with expected Δt; keep counters consistent with the same definition	False missing-seq or inflated duplicates; misleading acceptance
Broadcast/multicast amplification check	Domain-wide sanity	Baseline counters without fault; verify bounded growth under fault	Storm risk, queue drop, and “works but unstable” symptoms

Commissioning blueprint (the order that prevents unstable rollouts)

Confirm topology boundary: lock the redundancy domain and document the boundary ports / RedBox locations.
Assign roles: apply MRP or HSR/PRP roles per domain; ensure endpoints match the intended participation model.
Configure health/discard definitions: align health check timing (MRP) and discard window/counter definitions (HSR/PRP).
Baseline counters (no fault): record utilization, duplicates, missing-seq, queue drops as the “clean” reference.
Inject single faults: cable pull / port down / device reboot inside the domain; observe acceptance hooks.
Accept: prove switchover ≤ X ms or missing-seq = 0, and verify post-fault stability over Y seconds.

Commissioning flow: lock boundaries first, align role/parameter definitions, baseline counters, then inject single faults and accept with explicit hooks.

H2-9. Failure Modes & Hardening: The Top 12 Ways Redundancy Fails in Real Plants

Intent: Convert common plant failures into a “check-first” playbook with symptoms, fast checks, and hardening actions.

Symptom vocabulary (what operators actually see)

Drop / missing-seq: sequence gaps (HSR/PRP) or cyclic data loss windows (MRP).
Duplicate burst: duplicates spike because discard fails or the redundancy domain “leaks” into a LAN.
Latency spike: transient queuing and delay variance (duplication, reconvergence, or storm pressure).
Flap / reconvergence loop: repeated role/state transitions after a fault or under link jitter.

Deliverable: Symptom → root cause → quick check → fix (Top 12)

This table is designed as the “mother pool” for FAQs: each row can become a single 4-line FAQ item later.

Symptom (observable)	Likely root cause	Quick check (10–30 min)	Fix / hardening action
Flap: repeated reconvergence events every few seconds	Link jitter / port flap not filtered; health-check debounce too aggressive	Correlate link-state logs with reconvergence timestamps; check flap counters and health-check miss patterns	Add flap suppression; align health-check interval/timeout; re-accept with “stable-forwarding ≥ X sec”
Flap: recovers, then drops again under load	Queue instability after reconvergence; peak utilization too high during reroute	Compare utilization and queue-drop counters before/after fault; check latency spikes during recovery window	Lower peak load ceiling (≤ X%); isolate critical traffic; validate recovery under near-worst-case load
Drop: cyclic data “holes” during a single-link break (MRP)	Switchover budget dominated by detection; acceptance stopwatch start/end not aligned	Measure Detection vs Recovery separately using event timestamps + packet gap; check which segment dominates	Re-budget with a 3-segment table; tune health-check definition; accept switchover ≤ X ms
Duplicate storm: duplicates explode after adding a RedBox	Redundancy boundary leak into standard LAN; duplication semantics extended unintentionally	Confirm domain boundary ports; check whether duplicates appear outside the intended redundancy domain	Re-define and enforce domain edge; ensure discard/translation occurs at the boundary; re-baseline counters
Duplicate burst only during faults (HSR/PRP)	Discard window/timeout definition mismatched to path arrival delta (Δt)	Measure Δt with a controlled sequence stream; compare against receiver-side discard window definition	Align discard window/timeout; ensure counters follow the same definition; accept duplicates ≤ X/1k
MAC flapping: learning oscillates between two paths	Dual-path visibility causes unstable learning; boundary not well defined for “which side learns”	Observe MAC move events during faults; verify whether both directions are presented to learning in the same domain	Constrain learning scope at the boundary; validate with a fault-injection case for “MAC stable for Y min”
Miswiring: adding a spur creates a hidden loop	Ring domain accidentally expanded; boundary ports included by mistake	Validate physical cabling against the topology map; confirm “domain membership” port list matches reality	Implement wiring validation step; enforce boundary rules; add a miswiring acceptance test (optional)
Latency spike: control loop becomes unstable though missing-seq stays 0	Path asymmetry (Δt) too large; queue contention during duplication/discard increases jitter	Record latency histogram during faults; correlate jitter spikes with utilization and discard events	Budget Δt and utilization ceiling; accept jitter within X; separate “zero-loss” from “timing stability”
Drop at high load only (HSR/PRP)	Duplication pushes ports over safe utilization; transient queue drops create missing-seq	Check queue drop and utilization during peak; verify missing-seq aligns with drop counters	Reduce peak load ceiling (≤ X%); validate under worst-case load with missing-seq = 0
“Looks stable” but duplicates slowly grow over hours	Counter definition mismatch; duplicates measured with inconsistent windows/denominators	Standardize denominator (per 1k frames or per second); compare logs across nodes using the same window	Align metric definition; add a stability acceptance criterion for counters over Y hours
Fault injection passes in lab, fails in plant	Real topology differs (extra T-branch / boundary port); domain is larger than assumed	Compare plant cabling to topology map; verify “domain membership” and RedBox locations match lab model	Update topology assumptions; re-run acceptance with the same measurement hooks at plant scale
Interop: “same standard” but behavior differs after a fault	Parameter definition mismatch (role boundary, discard window, health timing) across vendors	Freeze and compare the 10–15 commissioning fields; validate counters use the same time window and denominator	Standardize definitions and acceptance hooks; re-accept with one common measurement plan

Use the matrix to pick the first check: duplicates point to discard/boundary issues; flaps point to link jitter and timing definitions.

H2-10. Verification Plan: Fault Injection, Measurement, and Pass Criteria

Intent: Provide an executable verification loop: defined injections, measurement hooks, artifacts, and pass criteria placeholders.

Must-test cases (single-fault first)

Single link break: validate switchover (MRP) or continuity (HSR/PRP).
Port flap: prove stability filters prevent reconvergence loops.
Switch power loss: verify domain re-forms and remains stable after recovery.
Node reboot: validate re-join behavior does not trigger storms or long instability.
Miswiring (optional): demonstrate boundary rules and wiring validation detect hidden loops.

Measurement hooks (cross-checking prevents false “passes”)

MRP measurement anchors

Application-level gap: missing cycles or packet-gap timing at the receiver.
Link-state events: port link-down/up timestamps at the injection point.
Domain counters/logs: reconvergence count and “stable-forwarding” evidence over Y seconds.

HSR / PRP measurement anchors

Sequence continuity: missing-seq = 0 under single faults.
Duplicates/Discard: duplicates are bounded and discard counters match definitions.
Utilization/Queue health: duplication overhead does not push ports beyond the X% ceiling.

Deliverable: Test matrix (Test / Setup / Injection / Metrics / Pass criteria)

Test	Setup	Injection	Metrics (hooks)	Pass criteria (X placeholders)
Single link break	Near-worst-case traffic; baseline counters recorded	Cable pull or forced port-down at defined point	MRP: switchover gap + link-state timestamps + reconvergence logs HSR/PRP: missing-seq + duplicates + utilization	Switchover ≤ X ms; continuous loss ≤ X frames (MRP) missing-seq = 0; duplicates ≤ X/1k (HSR/PRP)
Port flap resilience	Baseline under steady load; flap counters cleared	Toggle port state for N cycles at a defined rate	Reconvergence count, stability window, latency histogram	No reconvergence loop; stable-forwarding ≥ X min; latency spike ≤ X
Switch power loss	Define the powered node; record steady-state counters and utilization	Power off/on the node; record event timestamps	Recovery time; post-recovery stability; duplicates/missing-seq (HSR/PRP)	Re-formation ≤ X ms; stable ≥ X min; duplicates bounded; missing-seq = 0 (as applicable)
Node reboot re-join	Clear counters; define “re-join success” evidence	Reboot a node while traffic runs; capture logs and pcap	Reconvergence events; duplicates; missing-seq; utilization and queue health	Re-join without storm; stability ≥ X min; duplicates ≤ X/1k; missing-seq = 0

Recommended evidence bundle per test run: event timestamps, pcap snippet, counter snapshots (before / during / after), and a signed pass/fail line with X thresholds.

A practical bench must include controlled traffic, explicit injection points, a packet analyzer, and a black-box counter logger to prevent false passes.

H2-11. Applications (Power/Rail) — Deployment Patterns

Focus: turn redundancy into deployable domains (core/access/edge), with acceptance metrics that match control-loop vs monitoring sensitivity. Avoid expanding the redundancy domain into ordinary LAN segments.

A) Why redundancy is required in power/rail systems

Control-loop links (protection, interlocking, motion/traction timing) are dominated by continuity and jitter spikes, not only packet-loss.
Monitoring/forensics links (SCADA logs, event uploads, diagnostics) can sometimes tolerate brief interruption, but require stable recovery and auditability.
Redundancy should protect a defined fault domain: a link cut, a port flap, or a device outage—without expanding loops outside the intended ring.

B) Acceptance metrics that match link criticality (X = project-specific threshold)

Link type	What must be protected	Pass criteria (examples)
Control-loop / cyclic traffic	Continuity + bounded arrival-time spread (Δt), stable scheduling after a fault	missing-seq = 0 within window X; Δt(pathA-pathB) ≤ X; latency spike ≤ X; stable ≥ X min
Monitoring / diagnostics	Recover fast and stay stable; counters + timestamps for post-mortem	switchover ≤ X ms; consecutive loss ≤ X frames; duplicates ≤ X/1k; stable ≥ X min; event timestamps present

Note: “zero packet loss” is necessary but not sufficient for stability—unbounded Δt or jitter spikes can still destabilize a control loop.

C) Layered deployment: core ring, access ring, edge nodes (keep redundancy domains bounded)

Core ring: isolates major fault domains; demands stable recovery and clean event logging for post-fault analysis.
Access ring: local ring for bays/areas; limits wiring variability; enables repeatable fault-injection acceptance.
Edge IO: endpoints should not propagate redundancy semantics into ordinary LAN; use a clear boundary (e.g., dual-homing/border device) to prevent storms.

Diagram — Layered topology (Core ring + Access ring + Edge IO) with redundancy-domain boundaries

Tip: keep “redundancy semantics” inside the ring domain; enforce boundaries at the entry point to prevent duplicate storms and MAC flapping.

H2-12. IC / Switch Selection Logic — Capability Checklist + Example Material Numbers

Objective: select silicon and switch platforms that can be proven to meet redundancy acceptance criteria. The checklist below is organized as a funnel: goal → topology → implementation point → observability → verification hooks.

1) Selection funnel (avoid “feature-list shopping”)

Goal: zero-loss (HSR/PRP) vs ms switchover (MRP).
Topology: ring / dual-ring / dual-LAN / boundary insertion (RedBox-like boundary).
Implementation point: end node vs boundary device (where copy/discard happens).
Observability: domain state + counters + event timestamps (must be measurable).
Verification hooks: fault injection + pass criteria X (must be reproducible).

Diagram — Selection funnel (goal → topology → observability → verification)

2) Capability checklist (what must be provable)

MRP (ms switchover): ring role/state visibility, port event timestamps, deterministic reconvergence behavior, post-fault stability logging.
HSR/PRP (zero-loss): duplication + discard placement is explicit (end node or boundary), duplicate counters, missing-seq counters, and bounded Δt behavior under load.
Non-negotiable observability: domain state + drop/duplicate counters + event timestamps that can be exported as evidence.

3) Capability scoring table (Must / Should / Nice-to-have)

Capability item	Evidence to request	Priority	Acceptance hook (X)
Domain state visibility (ring state / role)	register/counter list; CLI/API objects; event log field definitions	Must	stable ≥ X min after fault
Duplicate handling (where discard happens)	explicit block diagram; mode description (end node vs boundary); counter semantics	Must (HSR/PRP)	duplicates ≤ X/1k
Event timestamping (port up/down, fault window)	timestamp format, resolution, export method, monotonicity guarantees	Must	switchover ≤ X ms (MRP)
Line-rate behavior under duplication overhead	throughput vs load evidence; queue policy notes; counter saturation behavior	Should	port utilization ≤ X%

“Supports protocol” is not actionable unless counters + timestamps + state are available for acceptance testing.

4) Example material numbers (for engineering qualification & evidence collection)

The following are example part numbers that explicitly call out redundancy capabilities (MRP / HSR / PRP) in public documentation. Final selection must be validated against the project’s pass criteria (X) and certification requirements.

Switch IC with redundancy managed modes: Microchip LAN9645xF family (example orderable PN: LAN96459ST-I/8NW)
High-bandwidth switch family with HSR/PRP variants: Microchip LAN9694RED / LAN9696RED / LAN9698RED (TSN + HSR/PRP redundancy variants)
Switch family that lists HSR/PRP in feature set: Microchip LAN9696 / LAN9698 (family pages describe HSR/PRP redundancy)
Enterprise/industrial switching family that lists integrated HSR/PRP: Broadcom BCM53570 series (integrated HSR/PRP is called out in product information)
Programmable PRP endpoint reference stack (CPU + PHY example): TI AM3359 + TI TLK110 (PRP reference design for substation automation)

Practical procurement rule: require a vendor-provided counter map (drop/duplicate/missing-seq), a state model (domain role/state), and a repeatable fault-injection demo that matches the project’s pass criteria (X).

Request a Quote

Name

Company

Part Number(s) / BOM

Quantity & Target Lead Time

Alternates Allowed

Temperature Grade

Package / Footprint

Compliance

Budget Window

Lot Size / Qty

Message

Attachment

Accepted Formats

pdf, csv, xls, xlsx, zip

Attachment

Drag & drop files here or use the button below.

H2-13. FAQs (MRP / HSR / PRP) — Field Troubleshooting Closure

Each FAQ is a measurable closure: symptom → counters/probes → fix → pass criteria. Use consistent windows and thresholds (X) to avoid “it feels better” conclusions.

Suggested metric buckets (use the same time window everywhere): W = X seconds for counters; p99 for latency/jitter; Δt = arrival-time difference between redundant paths; duplicates = duplicate frames accepted then discarded.

MRP says “recovered”, but the application still drops cyclically

Likely cause: detection threshold too sensitive (link flap) or post-recovery forwarding table/MAC learning keeps oscillating.

Quick check: count MRP state transitions per W; correlate with port up/down timestamps and cyclic-miss events; check MAC move/flap counters (if available).

Fix: tune detection window/threshold to match physical stability; isolate noisy port/segment; enforce stable forwarding/MAC aging parameters after recovery.

Pass criteria: reconfigure ≤ X times / 10 min; application cyclic-miss ≤ X per hour; stable-forwarding ≥ X min.

MRP ring keeps reconfiguring — it never becomes stable

Likely cause: persistent link flap, miswiring that expands the loop outside the intended ring domain, or role/state mismatch among nodes.

Quick check: verify the intended ring ports only (no extra bridges); measure port event frequency (events/min); confirm a single manager role and consistent ring ID/profile.

Fix: bound the redundancy domain (remove unintended bridges); harden the flapping segment (cable/connector/EMC); align manager/client roles and ring parameters.

Pass criteria: ring state remains “stable” continuously ≥ X min; port event rate ≤ X/min; reconfigure count ≤ X/day.

HSR duplicates frames in both directions — why are frames still lost?

Likely cause: the second path arrives outside the effective receive window (Δt too large), or discard/sequence handling drops valid frames due to parameter mismatch.

Quick check: measure Δt distribution (p95/p99) between the two path arrivals at the receiver; compare missing-seq with discard counters and CPU/queue load during the same W.

Fix: reduce asymmetry (routing/queues); align sequence/discard window semantics across nodes; ensure line-rate forwarding under duplication overhead.

Pass criteria: missing-seq = 0 within W = X; Δt(p99) ≤ X; discard does not correlate with missing-seq spikes.

HSR duplicate counter spikes and throughput collapses

Likely cause: discard window/sequence semantics mismatch, redundancy domain boundary is leaking into non-HSR LAN, or a storm loop is created by a mis-placed bridge/RedBox.

Quick check: quantify duplicates per 1k frames on each port; verify boundary insertion points; check broadcast/multicast rate vs baseline and confirm no unintended bridging around the ring.

Fix: enforce strict redundancy domain boundary; correct discard window/sequence handling; enable storm containment (rate limits) at boundary devices (without changing protocol semantics).

Pass criteria: duplicates ≤ X/1k on all ports; throughput ≥ X% of non-redundant baseline; broadcast/multicast ≤ X% utilization.

PRP “zero-loss” is not achieved — occasional missing frames remain

Likely cause: LAN A and LAN B are not truly independent (shared power, shared switch, shared cable segment), or sequence/dup-discard is applied inconsistently at the receiver/boundary.

Quick check: perform single-fault isolation: cut A only then B only; compare missing-seq and drop counters; map any common-mode elements (common PSU, common uplink, common enclosure feedthrough).

Fix: remove common-mode dependencies (separate power/uplinks); move the common point outside the redundancy domain; standardize receiver discard/sequence configuration.

Pass criteria: missing-seq = 0 within W = X under A-only and B-only injection; common-mode single-point failure count = 0 in the domain.

PRP shows MAC flapping / learning oscillation

Likely cause: inconsistent learning domain boundaries (PRP traffic observed on both sides incorrectly), or a boundary device is bridging redundancy semantics into ordinary switching domains.

Quick check: track MAC move events/time; verify that PRP frames are contained and only the intended boundary participates; compare learning table stability before/after adding the PRP node.

Fix: enforce redundancy domain isolation at the boundary; avoid unintended parallel bridges; align switch learning policies and make the PRP boundary explicit (single insertion point).

Pass criteria: MAC moves ≤ X per hour; forwarding table stable ≥ X min during steady-state; no broadcast amplification above X%.

After connecting to a normal LAN, the whole network turns into a broadcast storm

Likely cause: missing redundancy-domain boundary (RedBox/border), unintended bridging around the ring, or “leaked” duplicate traffic into ordinary LAN segments.

Quick check: identify the first point where broadcast/multicast rises above baseline; verify there is exactly one boundary insertion; measure duplicate rate outside the intended redundancy domain.

Fix: re-establish domain boundaries; remove unintended parallel bridges; apply storm guards at boundary ports (rate-limits) and verify no protocol semantics are altered inside the domain.

Pass criteria: broadcast/multicast utilization ≤ X%; duplicates outside domain ≤ X/1k; steady-state CPU/queue load stays within X%.

Switchover time is much slower than expected

Likely cause: detection cycle/threshold dominates the budget, event reporting is delayed, or recovery is followed by repeated reconvergence due to instability.

Quick check: time-stamp the fault timeline: fault → detect → forwarding changes → stable; isolate which segment consumes most of X ms.

Fix: tune detection parameters; ensure fast event propagation; remove post-recovery oscillation causes (flap/noisy segment) before chasing pure “algorithm speed”.

Pass criteria: switchover ≤ X ms end-to-end (application-visible); detect ≤ X ms; stable-forwarding achieved ≤ X ms after detection.

After recovery, latency jitter becomes worse (even if loss is gone)

Likely cause: lingering flap/reconfigure activity, queues are congested due to duplication overhead, or asymmetry creates periodic latency spikes.

Quick check: compare latency p99 before/after recovery; correlate jitter spikes with duplicates rate and ring state transitions within the same W; check port utilization headroom.

Fix: stop root-cause flap; enforce domain boundaries; reserve utilization headroom (cap port utilization); reduce Δt by balancing queues/routes.

Pass criteria: latency p99 ≤ X; jitter spike rate ≤ X/hour; duplicates ≤ X/1k; stable ≥ X min post-fault.

Only one device causes ring instability after joining

Likely cause: inconsistent protocol capability/parameters (discard semantics, role/state mismatch) or that device introduces link flap/EMC-induced port events.

Quick check: A/B test: join/leave and track ring state transitions; compare that device’s port event timestamps, duplicate/discard counters, and CPU/queue load vs the baseline.

Fix: align parameter semantics (discard window/sequence); enforce correct role; harden the physical link for that node; isolate it behind a boundary device if needed.

Pass criteria: adding the node does not increase reconfigure rate beyond X%; duplicates/discards stay ≤ X/1k; stable ≥ X min with the node present.

Everything looks normal, but operators still report “stutter” or “lag”

Likely cause: the symptom is dominated by latency spikes/jitter, not drop; or duplicates/discard overhead causes queue bursts even without visible loss.

Quick check: decompose metrics in the same W: drop vs duplicate vs latency p99 vs ring state transitions; look for correlation rather than single counters.

Fix: enforce utilization headroom; reduce Δt and queue bursts; tighten domain boundaries; eliminate low-frequency flap that keeps re-triggering micro-recovery.

Pass criteria: latency p99 ≤ X; jitter spike rate ≤ X/hour; operators’ stutter events correlate to none of the four metrics above within W.

How to prove it is truly “zero-loss” (not “lost but unnoticed”)?

Likely cause: measurement probe placement misses the loss point, or counter semantics hide the drop (e.g., only counts delivered frames).

Quick check: use a known sequence stream (monotonic IDs) and place probes at: source, receiver, and boundary/discard point; compare missing-seq and duplicates in the same W.

Fix: standardize metric definition (what counts as “received”); align counter semantics across devices; ensure probes cover both LAN A/B (PRP) or both directions (HSR).

Pass criteria: missing-seq = 0 at receiver within W = X under defined fault injection; probe-to-probe reconciliation error ≤ X; duplicates ≤ X/1k.

Ring Redundancy (MRP/HSR/PRP): ms Failover & Zero-Loss Guide

Ring Redundancy (MRP/HSR/PRP): ms Failover & Zero-Loss Guide

Two acceptance targets (do not mix)

Redundancy domain (fault domain) boundary

Deliverable: Acceptance target definition table

“Recovered” and “stable” are different acceptance gates

Requirement template (repeatable)

Deliverable: Failure model table (fault → symptom → required behavior)

Where the terms come from (scope in one glance)

Terminology anchors (use these consistently)

Deliverable: Terminology crosswalk (one dictionary)

Mechanism in three moves (block / copy / discard)

Deliverable: Mechanism comparison (MRP vs HSR vs PRP)

Topology building blocks (use consistent vocabulary)

Deliverable: Topology selection checklist (deploy-first)

Layering pattern (power/rail style) as a topology meaning

Executable selection logic (priority order)

Deliverable: Red-line conditions (forces a choice)

Acceptance vocabulary (avoid metric ambiguity)

MRP switchover budget (three segments that can be measured)

HSR/PRP zero-loss preconditions and overhead budget (engineering estimates)

Zero-loss precondition (effective arrival window)

Overhead budget (practical ceiling)

Parameter alignment (the minimum set that prevents chaos)

MRP (concept-level alignment)

HSR / PRP (boundary and discard correctness)

Deliverable: Commissioning parameter list (10–15 fields to confirm)

Commissioning blueprint (the order that prevents unstable rollouts)

Symptom vocabulary (what operators actually see)

Deliverable: Symptom → root cause → quick check → fix (Top 12)

Must-test cases (single-fault first)

Measurement hooks (cross-checking prevents false “passes”)

MRP measurement anchors

HSR / PRP measurement anchors

Deliverable: Test matrix (Test / Setup / Injection / Metrics / Pass criteria)

H2-11. Applications (Power/Rail) — Deployment Patterns

A) Why redundancy is required in power/rail systems

B) Acceptance metrics that match link criticality (X = project-specific threshold)

C) Layered deployment: core ring, access ring, edge nodes (keep redundancy domains bounded)

H2-12. IC / Switch Selection Logic — Capability Checklist + Example Material Numbers

1) Selection funnel (avoid “feature-list shopping”)

2) Capability checklist (what must be provable)

3) Capability scoring table (Must / Should / Nice-to-have)

4) Example material numbers (for engineering qualification & evidence collection)

Recommended topics you might also need

Request a Quote

Accepted Formats

Attachment

H2-13. FAQs (MRP / HSR / PRP) — Field Troubleshooting Closure

Explore

Categories

Get in Touch