Ring Redundancy (MRP/HSR/PRP): ms Failover & Zero-Loss Guide
← Back to: Industrial Ethernet & TSN
Ring redundancy is an engineering promise with measurable acceptance: either ms-class switchover (MRP) or true zero-loss via frame duplication (HSR/PRP). This page turns that promise into deployable topology patterns, budgeted timing/overhead, and testable pass criteria (X) for power and rail networks.
H2-1. Definition & Boundary: What “Ring Redundancy” Really Means
Intent: Lock an acceptance-ready definition of “redundancy” before discussing protocols—fast failover (ms) versus zero-loss continuity.
Two acceptance targets (do not mix)
- Fast switchover (ms): after a single fault, traffic recovers within X ms. Short interruption may occur, but recovery time is contractually bounded.
- Zero-loss continuity: under a single fault, the receiver continues to see no missing frames, achieved by parallel delivery (HSR/PRP) plus correct duplicate discard.
Key point: “Link up” is not “service restored”. Acceptance must be defined at a measurement point (application, probe, or counters), with an explicit start/stop rule and a pass threshold.
Redundancy domain (fault domain) boundary
A redundancy domain is the smallest network region expected to survive a single fault without violating the chosen acceptance target. The domain definition must state what is protected and what is not.
- Single link break inside the ring
- Single port down event (including transient drop)
- Single node/switch power loss or reboot (if explicitly required)
- Two simultaneous independent faults (e.g., dual link breaks)
- Common-cause failures (shared power, shared harness, shared upstream switch)
- PHY signal integrity / EMC / ESD root causes (handled in protection/PHY pages)
Deliverable: Acceptance target definition table
| Target | Measurement definition | Pass criteria (placeholders) | Typical fit |
|---|---|---|---|
| Fast switchover (MRP) | Measure at a defined point (application heartbeat, probe, or counters). Switchover time = fault detected → service restored (first valid packet after reconvergence). | Switchover ≤ X ms (P95), across Y injections. Post-recovery stability ≥ X min (no flap / no oscillation). | Control/automation networks that can tolerate a short gap, but need bounded recovery and stable convergence. |
| Zero-loss (HSR/PRP) | Prove continuity with sequence-aware traffic (frame ID/sequence) at the receiver. Acceptance requires no missing sequence during the fault, plus correct duplicate discard. | Missing sequence = 0 during Y injections. Duplicate rate ≤ X/1k. Stable ≥ X min. | Mission-critical traffic where even a short gap is unacceptable, often in power/rail protection and high-availability segments. |
Note: Threshold placeholders (X, Y) are intentionally open; values must be filled by the system-level business loop and certification constraints.
H2-2. Requirements & Failure Model: What Must Survive (and What Need Not)
Intent: Convert “redundancy” into testable engineering clauses—fault injection, observable symptoms, required behavior, and stability after recovery.
“Recovered” and “stable” are different acceptance gates
- Recovery gate: traffic meets the chosen target (switchover ≤ X ms or missing sequence = 0).
- Stability gate: after recovery, the network does not oscillate (no repeated reconfigure / no port flap storm) for ≥ X minutes.
A system that “recovers quickly but keeps flapping” typically fails real plants: symptoms appear random, diagnostics become noisy, and root-cause cycles explode.
Requirement template (repeatable)
- Define the fault precisely (what counts as “flap”, what counts as “down”).
- Specify injection method (unplug, relay break, port disable, power cut).
- State observable symptom and pick a primary metric (drop, missing sequence, duplicate rate, latency spike).
- Declare required behavior with thresholds (X ms / X frames / X per 1k / X minutes stable).
Deliverable: Failure model table (fault → symptom → required behavior)
| Fault | Observable symptom | Required behavior (placeholders) | Primary metric |
|---|---|---|---|
| Single link break | Drop burst (MRP), or path continuity maintained (HSR/PRP). Possible latency spike during convergence. | MRP: switchover ≤ X ms. HSR/PRP: missing sequence = 0. Stability ≥ X min. | Switchover time / missing sequence |
| Port flap (transient up/down) | Repeated reconvergence; intermittent drops; oscillating ring state; bursty duplicates in zero-loss domains. | Define flap as ≥ N toggles in T seconds. System must not enter continuous oscillation; stability gate must pass. | Flap count / oscillation time |
| Switch/node power-off | Topology changes; potential traffic interruption (MRP) or continuity via alternate path (HSR/PRP) if domain is correctly defined. | If required: service must meet the selected target under a single node loss, then remain stable ≥ X min. | Missing sequence / recovery time |
| Node reboot / rejoin | Brief duplicate spikes; forwarding table churn; possible MAC learning instability if boundaries are wrong. | Rejoin must not cause storm; duplicate rate ≤ X/1k; no sustained oscillation during the stability window. | Duplicate rate / storm counters |
| Miswiring (extra branch / loop expansion) | Broadcast storm; duplicates explode; ring state never stabilizes; “works in lab, fails in plant”. | Domain boundary must prevent loop expansion; storm control must trip within X; network must return to stable state after correction. | Storm counters / stability time |
- Dual simultaneous independent faults (unless specified)
- Common-cause faults that bypass redundancy (shared power, shared upstream choke point)
- Out-of-domain configuration errors that expand loops beyond the designed boundary
- MRP switchover ≤ X ms (P95) over Y injections
- HSR/PRP missing sequence = 0 during Y injections
- Duplicate rate ≤ X/1k, storm counters remain bounded
- Stability window ≥ X min (no oscillation / no reconfigure storm)
H2-3. Standards & Terminology Map (IEC 62439-2/-3 Essentials)
Intent: Align terms and roles to a single engineering vocabulary so requirements, configuration, and verification stay consistent across teams.
Where the terms come from (scope in one glance)
- IEC 62439-2 (MRP): ring management roles, blocked port behavior, and health checks for reconvergence.
- IEC 62439-3 (HSR/PRP): frame replication, sequence-based duplicate discard, and boundary devices (RedBox / LRE).
- This section focuses on engineering meaning (what a role does, what a frame implies), not certification profiles.
Terminology anchors (use these consistently)
Ring defines the redundancy loop; LAN A/B defines dual independent networks (PRP). Redundancy domain is the boundary where single-fault guarantees are claimed.
- MRM / MRC: manage ring state and the blocked port (loop prevention + reconvergence).
- DANH / DANP: end nodes that participate in HSR/PRP and perform duplicate discard using sequence context.
- RedBox / LRE: boundary device/function to connect redundancy domains to non-redundant LANs while keeping duplication rules correct.
- Health check / supervision: affects fault detection delay → part of the switchover budget.
- Sequence / LAN-ID: defines how duplicates are recognized → drives duplicate rate and “missing sequence = 0” proof.
Deliverable: Terminology crosswalk (one dictionary)
| Term | Engineering meaning | MRP / HSR / PRP | Why it matters (metrics / risks) |
|---|---|---|---|
| MRM / MRC | Ring manager / client roles that coordinate ring state and reconvergence behavior. | MRP | Drives reconvergence path and stability; impacts switchover time and post-fault oscillation risk. |
| Blocked port | A designated blocked edge to prevent loops in normal operation; becomes a recovery lever under fault. | MRP | Determines normal traffic direction and recovery path; wrong boundary causes persistent reconfigure. |
| Test frame / health check | Supervision mechanism that detects faults and triggers ring state change. | MRP (and supervision concepts in 62439-3) | Affects detection delay and false-trigger risk; directly touches the switchover budget. |
| DANH / DANP | End nodes that participate in redundancy; receive duplicates and keep one valid copy. | HSR / PRP | Enables the missing sequence = 0 proof; poor discard logic leads to storms or false loss. |
| RedBox | Boundary device connecting redundancy domains (HSR/PRP) to non-redundant Ethernet segments. | HSR / PRP | Defines domain boundary; misplacement causes loop expansion, broadcast storms, and MAC instability. |
| LRE | Link redundancy entity/function that handles replication context and duplicate discard behavior. | HSR / PRP | Controls duplicate rate and “zero-loss” proof quality; wrong window triggers discard failures. |
| Sequence / LAN-ID | Metadata used to correlate duplicates and identify which network copy was received first. | HSR / PRP | The basis for proving missing sequence = 0 and keeping duplicates bounded (≤ X/1k). |
Practice: keep term usage identical in requirements, configuration checklists, logs, and verification reports.
H2-4. How Each Protocol Works (MRP vs HSR vs PRP) — Mechanisms, Not Marketing
Intent: Explain data-plane actions (block, copy, discard) and map each mechanism to acceptance targets (ms switchover vs zero-loss).
Mechanism in three moves (block / copy / discard)
Normal operation prevents loops with a blocked port. Under a fault, the ring state changes and the blocked edge is used to restore a single forwarding path—therefore recovery includes detection and reconvergence, producing ms switchover.
Each frame is replicated in two directions. If at least one copy survives the fault, continuity can be proven as missing sequence = 0, provided duplicate discard works and storms remain bounded.
Frames are transmitted over two independent networks. Under a single fault, one network remains available and the receiver keeps the first valid copy. The engineering risk is usually shared choke points and boundary mistakes (RedBox placement).
- MRP → prove switchover ≤ X ms + post-fault stability.
- HSR/PRP → prove missing sequence = 0 + bounded duplicates.
Deliverable: Mechanism comparison (MRP vs HSR vs PRP)
| Protocol | Copy point | Discard / block point | Overhead | Typical recovery behavior | Acceptance fit |
|---|---|---|---|---|---|
| MRP | No per-frame replication; normal traffic follows the unblocked path. | A designated blocked port prevents loops; under fault it becomes the recovery edge. | Low steady-state overhead; cost appears as reconvergence time and transient interruptions. | ms switchover after detection + reconvergence; stability must be verified. | Fast switchover contract |
| HSR | Per-frame replication in both directions around the ring. | Receiver/participant performs duplicate discard using sequence context. | Higher ring load (replication); requires bounded duplicates and storm control discipline. | Zero-loss under single fault if at least one copy arrives within the validity window. | Zero-loss continuity contract |
| PRP | Replication across LAN A and LAN B (dual networks). | Receiver discards duplicates; boundaries are enforced via RedBox / LRE where needed. | Two networks (or boundary devices); common-cause failures can erase redundancy if not architected out. | Zero-loss under single fault if LANs are truly independent and discard is correct. | Zero-loss continuity contract |
Design implication: “block” creates reconvergence time; “copy+discard” creates continuity but adds load and boundary discipline requirements.
H2-5. Architecture & Topology Patterns You Can Actually Deploy
Intent: Convert protocol choices into deployable topologies with explicit boundaries, fault domains, and observable acceptance points.
Topology building blocks (use consistent vocabulary)
- Redundancy domain: the boundary where single-fault guarantees are claimed (ring domain or LAN A/B domain).
- Boundary device: a controlled edge that prevents the domain from expanding (e.g., RedBox, interconnect boundary).
- Fault domain: the set of links/switches/nodes assumed to fail one at a time under the acceptance contract.
- Acceptance hooks: where to observe switchover time (MRP) or missing sequence / duplicates (HSR/PRP).
Deliverable: Topology selection checklist (deploy-first)
- Topology form: single ring / dual ring / interconnected rings / PRP dual LAN / HSR + RedBox.
- Acceptance target: switchover ≤ X ms, or missing sequence = 0 (zero-loss).
- Endpoint constraints: endpoints can participate (DAN) or must remain standard Ethernet.
- Boundary controllability: RedBox / interconnect boundary placement is controllable and testable.
- Independence: for PRP, LAN A/B have no shared choke points (power, uplinks, media, conduit).
- Scale headroom: node count, ring length, and traffic headroom can tolerate replication or reconvergence.
Failure-cost reminder: choosing a topology without a clear boundary can turn a local fault into a domain-wide flap or storm.
Layering pattern (power/rail style) as a topology meaning
A practical deployment often uses layering to keep fault domains contained: an aggregation layer (station control / edge compute), a ring layer (bays / segments), and an endpoint layer (IED / remote I/O). The goal is not protocol complexity, but fault containment and clear acceptance points per layer.
H2-6. Selection Logic: When to Choose MRP, HSR, or PRP
Intent: Convert requirements into an executable Yes/No decision tree and a red-line table that prevents invalid choices.
Executable selection logic (priority order)
- Zero-loss required? If missing sequence must remain 0 under a single fault, the choice must be HSR or PRP.
- Dual networks allowed? If LAN A/B independence is possible, PRP is usually the cleanest zero-loss architecture.
- Endpoints changeable? If endpoints cannot participate, a boundary (RedBox) becomes mandatory for domain control.
- Replication headroom available? If bandwidth is tight, replication-based choices require traffic shaping or smaller domains.
- Brief interruption acceptable? If short interruption is acceptable, MRP can satisfy an ms switchover contract with lower steady overhead.
Deliverable: Red-line conditions (forces a choice)
| Red-line condition | Why it forces the choice | Forced choice | Verification hook |
|---|---|---|---|
| Missing sequence must remain 0 under a single link fault | A reconverging topology can introduce a brief gap; continuity requires duplicated delivery. | HSR or PRP | Missing-seq counter = 0; duplicates bounded ≤ X/1k; discard correctness. |
| Dual independent networks are available (LAN A/B) with no shared choke points | PRP benefits most from independence; shared points can erase redundancy under common-cause failures. | PRP preferred | Inject faults on LAN A and LAN B separately; verify continuity + independence evidence. |
| Endpoints cannot be modified, but the domain still needs zero-loss behavior | A boundary function is required to keep duplication semantics inside the redundancy domain. | HSR/PRP with RedBox boundary | Verify boundary placement and discard logic at the domain edge; check storm containment. |
| Brief interruption is acceptable, but recovery must be within X ms | A reconverging ring can satisfy ms recovery with low steady overhead if detection and stability are controlled. | MRP | Measure switchover ≤ X ms; verify no repeated reconvergence (flap) after recovery. |
| Replication headroom is insufficient at expected peak traffic | Copy-based designs can overload links and increase queuing; the domain must be reduced or traffic must be controlled. | Avoid HSR/PRP unless bounded | Validate link utilization under fault; verify duplicates and queueing remain within X. |
Red-line usage: treat each condition as a gating check before committing to topology and equipment capability.
H2-7. Performance & Timing Budget: Switchover, Zero-Loss, and Overhead
Intent: Turn “fast / stable / zero-loss” into measurable budgets, testable hooks, and acceptance criteria.
Acceptance vocabulary (avoid metric ambiguity)
- MRP switchover time: time from fault onset to stable-forwarding (no repeated reconvergence). Target ≤ X ms.
- Zero-loss (HSR/PRP): during a single fault, missing sequence = 0 at the receiver. This assumes both paths deliver a valid copy within an effective window.
- Overhead: additional bandwidth/queue pressure caused by frame duplication; must be budgeted to keep utilization under a conservative ceiling (≤ X%).
- Zero-loss ≠ zero-jitter: duplication and discard processing can still add latency variation; jitter is a separate acceptance item.
MRP switchover budget (three segments that can be measured)
Switchover time can be decomposed into three segments: Detection → Recovery → Stable-forwarding. The purpose of budgeting is to pin each segment to an observable start/end so teams do not argue about the stopwatch.
| Segment | Observable start | Observable end | Dominant drivers | Target (X) | Measurement hook |
|---|---|---|---|---|---|
| Detection | Fault is introduced (link down / cable pull / port flap start) | Redundancy domain confirms fault (health check misses / status change becomes certain) | Health-check interval, debounce, flap filtering, domain size | ≤ X ms | Event timestamp + ring-health counter transition |
| Recovery | Fault is confirmed (detection ends) | Forwarding resumes on the alternate path (traffic reappears) | Role convergence, port unblock timing, FDB/forwarding update latency | ≤ X ms | Packet gap measurement + port-state change log |
| Stable-forwarding | Forwarding returns (recovery ends) | No repeated reconvergence / no periodic flap under steady load | Flap suppression, stable health checks, queue stability under load | For Y sec | Counter stability + repeated event absence |
HSR/PRP zero-loss preconditions and overhead budget (engineering estimates)
Zero-loss precondition (effective arrival window)
Zero-loss requires that at least one valid copy of each frame arrives without sequence gaps. The second copy may arrive later, but it must still fall within a receiver-side valid window so discard logic can operate deterministically. If the path arrival delta (Δt) grows beyond the window, observed behavior can shift from “duplicates” to “missing-seq”.
Overhead budget (practical ceiling)
- HSR: duplication pressure shows up as higher utilization and queue contention within the redundancy domain; size the domain so peak load stays under a conservative ceiling (≤ X%).
- PRP: duplication is split across LAN A and LAN B; keep each LAN under a utilization ceiling (≤ X%) and prove independence to avoid common-cause collapse.
- Jitter sources: replication-induced queuing, discard buffering, and fault-time traffic redistribution can add latency variation even when missing-seq stays 0.
| Mechanism | Copy scope | Overhead driver | Utilization ceiling | Validation counters |
|---|---|---|---|---|
| HSR | Ring domain (dual direction copies) | Frame duplication + domain path length + peak traffic | ≤ X% | Port utilization, queue drop, duplicates, missing-seq, discard stats |
| PRP | LAN A and LAN B (parallel delivery) | Dual transmission per frame + independence constraints | Each LAN ≤ X% | Per-LAN utilization, duplicates, missing-seq, fault-injection continuity evidence |
H2-8. Configuration & Interoperability Blueprint (Commissioning Without Chaos)
Intent: Prevent “works but unstable” outcomes by aligning roles, boundaries, and discard/counter definitions before acceptance testing.
Parameter alignment (the minimum set that prevents chaos)
MRP (concept-level alignment)
- Roles: Manager/Client assignment is consistent with the intended domain.
- Ring ports: ring membership is explicit; boundary ports are not accidentally included.
- Health check: interval and timeout definitions are consistent; flap filtering is defined.
- Boundary principle: avoid overlapping loop-control responsibilities on the same fault domain.
HSR / PRP (boundary and discard correctness)
- RedBox boundary: the redundancy domain must not silently expand into a standard LAN.
- Duplicate discard window: window/timeout definitions align with the expected Δt; counters must match the same definition.
- Broadcast/multicast sanity: check for amplification risk inside the domain; verify storm containment via counters and fault injections.
Deliverable: Commissioning parameter list (10–15 fields to confirm)
| Field | Role / scope | Default direction | Risk if wrong |
|---|---|---|---|
| MRP role (MRM/MRC) | Per ring domain | Single manager per domain; document boundaries | Split-brain behavior or unstable reconvergence |
| Ring ports (membership) | Switch ports inside domain | Make membership explicit; keep boundary ports excluded | Fault domain expands; unexpected flap / storm spread |
| Health-check interval/timeout definition | Per domain (MRP) | Start conservative; avoid false positives under load | False fault detection → reconvergence storms |
| RedBox boundary placement | HSR/PRP domain edge | Keep duplication semantics inside the defined domain | Domain “leaks” into LAN → duplicates and storm amplification |
| Discard window / timeout definition | Receiver / boundary discard logic | Align with expected Δt; keep counters consistent with the same definition | False missing-seq or inflated duplicates; misleading acceptance |
| Broadcast/multicast amplification check | Domain-wide sanity | Baseline counters without fault; verify bounded growth under fault | Storm risk, queue drop, and “works but unstable” symptoms |
Commissioning blueprint (the order that prevents unstable rollouts)
- Confirm topology boundary: lock the redundancy domain and document the boundary ports / RedBox locations.
- Assign roles: apply MRP or HSR/PRP roles per domain; ensure endpoints match the intended participation model.
- Configure health/discard definitions: align health check timing (MRP) and discard window/counter definitions (HSR/PRP).
- Baseline counters (no fault): record utilization, duplicates, missing-seq, queue drops as the “clean” reference.
- Inject single faults: cable pull / port down / device reboot inside the domain; observe acceptance hooks.
- Accept: prove switchover ≤ X ms or missing-seq = 0, and verify post-fault stability over Y seconds.
H2-9. Failure Modes & Hardening: The Top 12 Ways Redundancy Fails in Real Plants
Intent: Convert common plant failures into a “check-first” playbook with symptoms, fast checks, and hardening actions.
Symptom vocabulary (what operators actually see)
- Drop / missing-seq: sequence gaps (HSR/PRP) or cyclic data loss windows (MRP).
- Duplicate burst: duplicates spike because discard fails or the redundancy domain “leaks” into a LAN.
- Latency spike: transient queuing and delay variance (duplication, reconvergence, or storm pressure).
- Flap / reconvergence loop: repeated role/state transitions after a fault or under link jitter.
Deliverable: Symptom → root cause → quick check → fix (Top 12)
This table is designed as the “mother pool” for FAQs: each row can become a single 4-line FAQ item later.
| Symptom (observable) | Likely root cause | Quick check (10–30 min) | Fix / hardening action |
|---|---|---|---|
| Flap: repeated reconvergence events every few seconds | Link jitter / port flap not filtered; health-check debounce too aggressive | Correlate link-state logs with reconvergence timestamps; check flap counters and health-check miss patterns | Add flap suppression; align health-check interval/timeout; re-accept with “stable-forwarding ≥ X sec” |
| Flap: recovers, then drops again under load | Queue instability after reconvergence; peak utilization too high during reroute | Compare utilization and queue-drop counters before/after fault; check latency spikes during recovery window | Lower peak load ceiling (≤ X%); isolate critical traffic; validate recovery under near-worst-case load |
| Drop: cyclic data “holes” during a single-link break (MRP) | Switchover budget dominated by detection; acceptance stopwatch start/end not aligned | Measure Detection vs Recovery separately using event timestamps + packet gap; check which segment dominates | Re-budget with a 3-segment table; tune health-check definition; accept switchover ≤ X ms |
| Duplicate storm: duplicates explode after adding a RedBox | Redundancy boundary leak into standard LAN; duplication semantics extended unintentionally | Confirm domain boundary ports; check whether duplicates appear outside the intended redundancy domain | Re-define and enforce domain edge; ensure discard/translation occurs at the boundary; re-baseline counters |
| Duplicate burst only during faults (HSR/PRP) | Discard window/timeout definition mismatched to path arrival delta (Δt) | Measure Δt with a controlled sequence stream; compare against receiver-side discard window definition | Align discard window/timeout; ensure counters follow the same definition; accept duplicates ≤ X/1k |
| MAC flapping: learning oscillates between two paths | Dual-path visibility causes unstable learning; boundary not well defined for “which side learns” | Observe MAC move events during faults; verify whether both directions are presented to learning in the same domain | Constrain learning scope at the boundary; validate with a fault-injection case for “MAC stable for Y min” |
| Miswiring: adding a spur creates a hidden loop | Ring domain accidentally expanded; boundary ports included by mistake | Validate physical cabling against the topology map; confirm “domain membership” port list matches reality | Implement wiring validation step; enforce boundary rules; add a miswiring acceptance test (optional) |
| Latency spike: control loop becomes unstable though missing-seq stays 0 | Path asymmetry (Δt) too large; queue contention during duplication/discard increases jitter | Record latency histogram during faults; correlate jitter spikes with utilization and discard events | Budget Δt and utilization ceiling; accept jitter within X; separate “zero-loss” from “timing stability” |
| Drop at high load only (HSR/PRP) | Duplication pushes ports over safe utilization; transient queue drops create missing-seq | Check queue drop and utilization during peak; verify missing-seq aligns with drop counters | Reduce peak load ceiling (≤ X%); validate under worst-case load with missing-seq = 0 |
| “Looks stable” but duplicates slowly grow over hours | Counter definition mismatch; duplicates measured with inconsistent windows/denominators | Standardize denominator (per 1k frames or per second); compare logs across nodes using the same window | Align metric definition; add a stability acceptance criterion for counters over Y hours |
| Fault injection passes in lab, fails in plant | Real topology differs (extra T-branch / boundary port); domain is larger than assumed | Compare plant cabling to topology map; verify “domain membership” and RedBox locations match lab model | Update topology assumptions; re-run acceptance with the same measurement hooks at plant scale |
| Interop: “same standard” but behavior differs after a fault | Parameter definition mismatch (role boundary, discard window, health timing) across vendors | Freeze and compare the 10–15 commissioning fields; validate counters use the same time window and denominator | Standardize definitions and acceptance hooks; re-accept with one common measurement plan |
H2-10. Verification Plan: Fault Injection, Measurement, and Pass Criteria
Intent: Provide an executable verification loop: defined injections, measurement hooks, artifacts, and pass criteria placeholders.
Must-test cases (single-fault first)
- Single link break: validate switchover (MRP) or continuity (HSR/PRP).
- Port flap: prove stability filters prevent reconvergence loops.
- Switch power loss: verify domain re-forms and remains stable after recovery.
- Node reboot: validate re-join behavior does not trigger storms or long instability.
- Miswiring (optional): demonstrate boundary rules and wiring validation detect hidden loops.
Measurement hooks (cross-checking prevents false “passes”)
MRP measurement anchors
- Application-level gap: missing cycles or packet-gap timing at the receiver.
- Link-state events: port link-down/up timestamps at the injection point.
- Domain counters/logs: reconvergence count and “stable-forwarding” evidence over Y seconds.
HSR / PRP measurement anchors
- Sequence continuity: missing-seq = 0 under single faults.
- Duplicates/Discard: duplicates are bounded and discard counters match definitions.
- Utilization/Queue health: duplication overhead does not push ports beyond the X% ceiling.
Deliverable: Test matrix (Test / Setup / Injection / Metrics / Pass criteria)
| Test | Setup | Injection | Metrics (hooks) | Pass criteria (X placeholders) |
|---|---|---|---|---|
| Single link break | Near-worst-case traffic; baseline counters recorded | Cable pull or forced port-down at defined point |
MRP: switchover gap + link-state timestamps + reconvergence logs HSR/PRP: missing-seq + duplicates + utilization |
Switchover ≤ X ms; continuous loss ≤ X frames (MRP) missing-seq = 0; duplicates ≤ X/1k (HSR/PRP) |
| Port flap resilience | Baseline under steady load; flap counters cleared | Toggle port state for N cycles at a defined rate | Reconvergence count, stability window, latency histogram | No reconvergence loop; stable-forwarding ≥ X min; latency spike ≤ X |
| Switch power loss | Define the powered node; record steady-state counters and utilization | Power off/on the node; record event timestamps | Recovery time; post-recovery stability; duplicates/missing-seq (HSR/PRP) | Re-formation ≤ X ms; stable ≥ X min; duplicates bounded; missing-seq = 0 (as applicable) |
| Node reboot re-join | Clear counters; define “re-join success” evidence | Reboot a node while traffic runs; capture logs and pcap | Reconvergence events; duplicates; missing-seq; utilization and queue health | Re-join without storm; stability ≥ X min; duplicates ≤ X/1k; missing-seq = 0 |
Recommended evidence bundle per test run: event timestamps, pcap snippet, counter snapshots (before / during / after), and a signed pass/fail line with X thresholds.
H2-11. Applications (Power/Rail) — Deployment Patterns
Focus: turn redundancy into deployable domains (core/access/edge), with acceptance metrics that match control-loop vs monitoring sensitivity. Avoid expanding the redundancy domain into ordinary LAN segments.
A) Why redundancy is required in power/rail systems
- Control-loop links (protection, interlocking, motion/traction timing) are dominated by continuity and jitter spikes, not only packet-loss.
- Monitoring/forensics links (SCADA logs, event uploads, diagnostics) can sometimes tolerate brief interruption, but require stable recovery and auditability.
- Redundancy should protect a defined fault domain: a link cut, a port flap, or a device outage—without expanding loops outside the intended ring.
B) Acceptance metrics that match link criticality (X = project-specific threshold)
| Link type | What must be protected | Pass criteria (examples) |
|---|---|---|
| Control-loop / cyclic traffic | Continuity + bounded arrival-time spread (Δt), stable scheduling after a fault | missing-seq = 0 within window X; Δt(pathA-pathB) ≤ X; latency spike ≤ X; stable ≥ X min |
| Monitoring / diagnostics | Recover fast and stay stable; counters + timestamps for post-mortem | switchover ≤ X ms; consecutive loss ≤ X frames; duplicates ≤ X/1k; stable ≥ X min; event timestamps present |
Note: “zero packet loss” is necessary but not sufficient for stability—unbounded Δt or jitter spikes can still destabilize a control loop.
C) Layered deployment: core ring, access ring, edge nodes (keep redundancy domains bounded)
- Core ring: isolates major fault domains; demands stable recovery and clean event logging for post-fault analysis.
- Access ring: local ring for bays/areas; limits wiring variability; enables repeatable fault-injection acceptance.
- Edge IO: endpoints should not propagate redundancy semantics into ordinary LAN; use a clear boundary (e.g., dual-homing/border device) to prevent storms.
Tip: keep “redundancy semantics” inside the ring domain; enforce boundaries at the entry point to prevent duplicate storms and MAC flapping.
H2-12. IC / Switch Selection Logic — Capability Checklist + Example Material Numbers
Objective: select silicon and switch platforms that can be proven to meet redundancy acceptance criteria. The checklist below is organized as a funnel: goal → topology → implementation point → observability → verification hooks.
1) Selection funnel (avoid “feature-list shopping”)
- Goal: zero-loss (HSR/PRP) vs ms switchover (MRP).
- Topology: ring / dual-ring / dual-LAN / boundary insertion (RedBox-like boundary).
- Implementation point: end node vs boundary device (where copy/discard happens).
- Observability: domain state + counters + event timestamps (must be measurable).
- Verification hooks: fault injection + pass criteria X (must be reproducible).
2) Capability checklist (what must be provable)
- MRP (ms switchover): ring role/state visibility, port event timestamps, deterministic reconvergence behavior, post-fault stability logging.
- HSR/PRP (zero-loss): duplication + discard placement is explicit (end node or boundary), duplicate counters, missing-seq counters, and bounded Δt behavior under load.
- Non-negotiable observability: domain state + drop/duplicate counters + event timestamps that can be exported as evidence.
3) Capability scoring table (Must / Should / Nice-to-have)
| Capability item | Evidence to request | Priority | Acceptance hook (X) |
|---|---|---|---|
| Domain state visibility (ring state / role) | register/counter list; CLI/API objects; event log field definitions | Must | stable ≥ X min after fault |
| Duplicate handling (where discard happens) | explicit block diagram; mode description (end node vs boundary); counter semantics | Must (HSR/PRP) | duplicates ≤ X/1k |
| Event timestamping (port up/down, fault window) | timestamp format, resolution, export method, monotonicity guarantees | Must | switchover ≤ X ms (MRP) |
| Line-rate behavior under duplication overhead | throughput vs load evidence; queue policy notes; counter saturation behavior | Should | port utilization ≤ X% |
“Supports protocol” is not actionable unless counters + timestamps + state are available for acceptance testing.
4) Example material numbers (for engineering qualification & evidence collection)
The following are example part numbers that explicitly call out redundancy capabilities (MRP / HSR / PRP) in public documentation. Final selection must be validated against the project’s pass criteria (X) and certification requirements.
- Switch IC with redundancy managed modes: Microchip LAN9645xF family (example orderable PN: LAN96459ST-I/8NW)
- High-bandwidth switch family with HSR/PRP variants: Microchip LAN9694RED / LAN9696RED / LAN9698RED (TSN + HSR/PRP redundancy variants)
- Switch family that lists HSR/PRP in feature set: Microchip LAN9696 / LAN9698 (family pages describe HSR/PRP redundancy)
- Enterprise/industrial switching family that lists integrated HSR/PRP: Broadcom BCM53570 series (integrated HSR/PRP is called out in product information)
- Programmable PRP endpoint reference stack (CPU + PHY example): TI AM3359 + TI TLK110 (PRP reference design for substation automation)
Practical procurement rule: require a vendor-provided counter map (drop/duplicate/missing-seq), a state model (domain role/state), and a repeatable fault-injection demo that matches the project’s pass criteria (X).
Recommended topics you might also need
Request a Quote
H2-13. FAQs (MRP / HSR / PRP) — Field Troubleshooting Closure
Each FAQ is a measurable closure: symptom → counters/probes → fix → pass criteria. Use consistent windows and thresholds (X) to avoid “it feels better” conclusions.
Suggested metric buckets (use the same time window everywhere): W = X seconds for counters; p99 for latency/jitter; Δt = arrival-time difference between redundant paths; duplicates = duplicate frames accepted then discarded.