Security Offload for Industrial Ethernet: MACsec & DTLS/TLS
← Back to: Industrial Ethernet & TSN
This page turns “industrial Ethernet security” into an executable engineering decision: choose the right layer (MACsec vs DTLS/TLS), define key/trust anchoring, and verify determinism and observability with measurable pass criteria.
The outcome is a deployable checklist—integrate, validate, and operate security offload without breaking latency/jitter budgets or losing field forensics.
H2-1 · Definition & Page Boundary (What offload means)
Security offload turns “add encryption” into an engineering decision about where protection lives, what it costs in latency/jitter, and how it is verified and operated. This chapter fixes the page scope and the exact outputs expected from the rest of the guide.
Card A · Decisions this page enables
- Layer choice: pick MACsec (link) vs DTLS/TLS (transport) by trust boundary and determinism constraints.
- Placement choice: select integrated, coprocessor, or SoC acceleration based on throughput, jitter sensitivity, and observability needs.
- Key & trust plan: define what keys/certs exist, where they are stored, and when rotation must happen.
- Bring-up & validation: run a test ladder that includes negative tests (replay, expiry, rollover, resource exhaustion).
- Field operations: standardize telemetry (counters/events/log fields) for forensics and fast triage.
Offload comes in three deployment forms
Card B · Not covered here (to avoid cross-page overlap)
- Cryptography math or algorithm tutorials (this page stays engineering-focused).
- Full TSN parameter deep dives (Qbv/Qci/Qav, GCL tables) beyond “what changes latency/jitter”.
- PTP/SyncE standard details beyond “where timestamps are impacted”.
- Remote management protocol training (LLDP/NETCONF/etc.) beyond “telemetry hooks”.
- Compliance text walkthroughs; only test evidence and engineering checks are included.
Card C · Where to go next (sibling pages)
Use these pages for deeper specialization; this page only references their constraints.
- Timing: PTP hardware timestamping / SyncE holdover and jitter templates.
- Determinism: TSN switch features and TSN parameterization workflows.
- Operations: Remote management and link-health black-box telemetry.
30-second self-check (scope alignment)
- Need hop-by-hop protection across a trusted L2 domain → MACsec is usually the first candidate.
- Need end-to-end sessions across gateways or routed domains → DTLS/TLS is usually required.
- System is jitter-sensitive (TSN/controls) → placement and observability must be decided before implementation.
H2-2 · Threat Model & Security Goals (Industrial Ethernet reality)
The goal is not to teach cryptography. The goal is to translate real industrial threats into security goals and then into observable signals that can be verified in bring-up and diagnosed in the field.
Card A · Threats seen on factory floors (grouped by where they enter)
- Unknown device plugged into a spare port; traffic sniffing or injection attempts.
- Mirroring/span or diagnostics features expose sensitive payloads if not isolated.
- Replay-style behavior: repeated frames cause duplicated control actions or state drift.
- Key material copied between devices; loss of uniqueness breaks the trust boundary.
- Rollback to older firmware re-enables known vulnerabilities or invalidates measurements.
- Resource exhaustion (sessions/queues/CPU) turns security into downtime.
- Certificate expiry or clock drift triggers reconnect storms and cyclic jitter.
- Configuration mismatch (cipher suites / MKA params) creates “link up but no traffic”.
- Key rotation windows not coordinated across a fleet cause network-wide flaps.
Card B · Security goals mapped to observables (what to measure, not what to believe)
Observable starter pack (minimum fields for bring-up + field triage)
- MACsec: mka_state, key_version, icv_fail_count, replay_drop_count, pn_discontinuity_count, sak_rollover_events
- DTLS/TLS: handshake_time_ms, alert_code, cipher_mismatch_count, session_reuse_rate, reconnect_rate, session_count_peak
- Trust/boot: firmware_version, measurement_id, rollback_event, secure_time_valid, key_store_health
- Determinism: latency_budget_delta, jitter_peak, queue_depth_peak, cpu_load_peak
H2-3 · Layer Choice: MACsec vs DTLS/TLS (Decision Tree)
The layer decision is driven by trust boundaries, scope of protection, and determinism constraints. The decision tree below maps common industrial questions to a recommended security layer and the first engineering checks to run.
Card A · Decision tree (Yes/No path → recommendation)
- Hop-by-hop protection inside a controlled L2 domain → MACsec is often the first candidate.
- End-to-end sessions across gateways/routed domains → DTLS/TLS is typically required.
- TSN/cyclic control jitter sensitivity → prefer solutions with stable datapaths and controlled session behavior.
Card B · When to combine (division of labor)
H2-4 · Key Material & Trust Anchors (CAK/SAK/PSK/Cert) + Storage
Security offload succeeds only if key material is treated like a managed engineering asset: what exists, where it lives, when it rotates, and what evidence is logged. This chapter defines a key inventory and a lifecycle flow.
Card A · Key inventory (treat keys like a controlled BOM)
| Artifact | Scope | Update trigger | Storage location |
|---|---|---|---|
| CAK / CKN (MACsec) | L2 security domain | Periodic + compromise suspicion | Secure element / TPM preferred |
| SAK (MACsec) | Per secure association | Rollover events / policy | On-chip engine (volatile) + logs |
| PSK (TLS/DTLS option) | Per device / per service | Rotation window + fleet policy | Secure element strongly recommended |
| Certificate + private key | Per device identity | Expiry + rotation policy | TPM/SE; avoid exportable keys |
| Root trust anchor | System-wide policy | Rare; controlled updates only | ROM/OTP or protected storage |
| Secure time validity | Fleet behavior | Boot + periodic checks | Secure RTC / signed time source |
Card B · Rollover policy checklist (prevent fleet-wide flaps)
- Time sanity gate: define acceptable drift and failure behavior when secure time is invalid (do not “retry storm”).
- Staggering gate: avoid synchronized rotation across a fleet; enforce rolling windows per site/segment.
- Overlap gate: keep an overlap window where old and new credentials can both work during transition.
- Rollback gate: define a safe fallback if rotation fails (without reusing compromised material).
- Evidence gate: log key_version, rollover_reason, error_code, and peer identity mismatch for forensics.
- Rate gate: limit reconnection rate and handshake concurrency to protect determinism and availability.
H2-5 · MACsec Offload Pipeline (TX/RX, latency, counters)
This chapter treats MACsec as an engineering pipeline: which blocks sit on TX/RX, what can break first, and which counters prove the root cause. The goal is fast bring-up triage: separate key-management issues from datapath/offload issues.
Card A · TX/RX pipeline (block → failure point → observable)
- Does: selects which frames must be secured and which may bypass.
- Breaks first: wrong VLAN/ethertype rules → “traffic disappears” or plaintext leaks.
- Observe: protected/bypass frame counts, policy hit-rate.
- Does: binds a frame to an SA and handles SCI for peer identity.
- Breaks first: SA not installed / wrong SCI mode → link up but no valid decrypt.
- Observe: SA installed flag, SCI mismatch, MKA state.
- Does: increments packet number (PN) for replay defense.
- Breaks first: PN desync / rollover policy mismatch → sudden drops after “working” period.
- Observe: PN high-watermark, rollover count, replay window drops.
- Does: encrypts payload (optional) and appends ICV for integrity.
- Breaks first: wrong key / wrong mode → ICV fail on RX, silent drops.
- Observe: ICV fail counters, decrypt fail counters.
- Does: drops frames outside the configured replay window.
- Breaks first: window too tight for burst/queueing → drops only under load.
- Observe: replay_drop, out_of_window, reorder stats.
- Does: moves frames between MAC, offload engine, and memory queues.
- Breaks first: descriptor starvation / FIFO backpressure → latency spikes or throughput collapse.
- Observe: DMA underrun/overrun, FIFO watermark, queue drops.
Card B · Counter-to-symptom map (symptom → first check → fix direction)
| Symptom | First check | Likely direction |
|---|---|---|
| Link up, traffic blocked | MKA state, SA installed | Key mgmt not ready or SA lookup/SCI mismatch |
| Traffic flows, but RX drops under load | replay_drop, out_of_window | Replay window too tight for queueing/bursts |
| One direction works, the other fails | ICV fail, SA selection per direction | Wrong key/SA bound for one lane or policy mismatch |
| Works for minutes, then collapses | PN high-watermark, rollover count | PN desync or rollover policy mismatch |
| Latency spikes / jitter increases | DMA underrun/overrun, FIFO watermark | Descriptor starvation or backpressure in queues |
| Plaintext leak suspected | bypass hit-rate, policy counters | Classification rules incomplete or wrong traffic tagging |
H2-6 · DTLS/TLS Offload: Handshake, Sessions, and Failure Modes
DTLS/TLS offload is operationally defined by handshake time, session stability, and failure evidence. The sections below focus on the fastest path to explain “connects but times out” and “periodic reconnect storms”.
Card A · Handshake critical path (time-budget view)
Cap first: chain length, validation policy, time sanity gate.
Observe: handshake_time split, alert_code, time_valid flag.
Cap first: concurrent handshakes, crypto engine queue depth.
Observe: crypto_queue depth, handshake_time, fallback-to-software indicator.
Cap first: cipher list, strict profile per segment, version pinning.
Observe: cipher mismatch counter, alert_code, negotiation retries.
Cap first: session count, cache size, resumption policy and timers.
Observe: session_reuse_rate, cache_hit, reconnect_rate peaks.
Card B · Failure modes (symptom → check → fix)
| Symptom | First check | Fix direction |
|---|---|---|
| Connects, but periodic timeouts | handshake_time, reconnect_rate peaks | Cap retry rate, stabilize timers, reduce concurrent handshakes |
| Reconnect storm after link blip | session_reuse_rate, cache_hit | Enable/repair resumption cache, enforce staggering and backoff |
| Handshake slow → control loop stalls | handshake_time split, crypto_queue depth | Limit concurrency, pin cipher profile, ensure hardware crypto is used |
| Works with one peer, fails with another | cipher mismatch, alert_code | Lock compatible cipher list, align versions and policy sets |
| Sudden failures after time event | time_valid flag, alert_code | Implement secure time gate; define safe degraded behavior |
| DTLS unstable on lossy links | retransmit count, handshake_time variance | Tune retry/backoff, enlarge reorder tolerance, reduce burstiness |
H2-7 · Measured Boot & Secure Update Hooks (why security offload needs it)
Security offload must be gated by a provable device state. This chapter covers hooks and evidence only: measured-boot checkpoints, minimum secure-update requirements, and the “key-use gate” that prevents using MACsec/TLS keys in an untrusted or rolled-back state.
Card A · Boot chain checklist (evidence + gate + pass criteria placeholders)
- Evidence: boot_reason, root_id, secure_boot=1 (log fields).
- Pass criteria: secure_boot asserted; root_id matches allowlist (X).
- Fail action: maintenance-only mode; block key-use gate.
- Evidence: bl_version, bl_hash/manifest_id, verify_ok.
- Pass criteria: verify_ok=1; version monotonic (anti-rollback X).
- Fail action: refuse network enrollment; raise tamper flag.
- Evidence: policy_version, policy_hash, config_lock_state.
- Pass criteria: policy hash matches; version matches fleet baseline (X).
- Fail action: disable key-use; allow only recovery endpoint.
- Evidence: os_build_id, module_sign_ok, secure_time_ready.
- Pass criteria: critical modules verified; secure_time gate met (X).
- Fail action: keep TLS disabled; allow local service only.
- Evidence: app_version, net_stack_id, offload_profile_id.
- Pass criteria: approved profile loaded; debug bypass disabled (X).
- Fail action: quarantine VLAN / maintenance-only.
- Evidence: keystore_ok, key_slot_id, monotonic_counter.
- Pass criteria: keystore_ok=1; counter not rolled back (X).
- Fail action: block MACsec/TLS key material use.
- Gate condition: boot+policy+keystore evidence all “green”.
- Pass criteria: gate_open=1 before MACsec SA install / TLS handshake.
- Fail action: refuse SA install; refuse handshake; expose only recovery path.
Card B · Update failure playbook (after update: cannot connect / cert missing / rollback)
| Symptom | First check | Fix direction | Pass criteria |
|---|---|---|---|
| Update completed, cannot join secured network | gate_open, policy_version, keystore_ok | Restore baseline policy, re-provision keys if store changed | gate_open=1 within X s |
| Certificates missing or permission denied | cert_path, access_denied log, key_slot_id | Fix storage path/ACLs; re-bind to secure element slot | cert load OK in X tries |
| Silent rollback occurred | monotonic_counter, slot_id, rollback_reason | Reconcile version counters; block network until repaired | counter monotonic passes X boots |
| Time invalid after update → TLS fails | secure_time_ready, cert_not_yet_valid/expired | Apply secure time gate; define safe degraded behavior | handshake succeeds after X sync |
| Update half-failed → boot ok but network unstable | verify_ok flags, config_lock_state, error stamps | Force rollback and re-apply update; preserve evidence log | no verify errors across X boots |
H2-8 · Determinism & Timing Interaction (TSN/PTP without breaking it)
Industrial Ethernet cares about determinism. Turning on MACsec/TLS can change latency, increase queue jitter, and introduce event-driven bursts (rekey/handshake). This chapter shows how to budget and isolate those effects without expanding TSN/PTP parameterization details.
Card A · What changes when security is enabled (determinism view)
- Encrypt/ICV work adds fixed Δt; software fallback amplifies tails.
- Gate concurrency: cap parallel handshakes and rekey storms.
- Observe: per-stage p99 latency, crypto queue depth.
- DMA/FIFO backpressure changes queue delay distribution under load.
- Replay drops and resends can shift burst patterns.
- Observe: FIFO watermark, queue drops, replay_drop/ICV fail.
- Handshake/rekey/recovery can create periodic latency spikes.
- Isolate control plane from cyclic data plane; stagger renewals.
- Observe: handshake_time variance, reconnect_rate, rollover count.
- Timestamp tap and queue delay drift can inflate offset noise.
- CPU preemption from crypto events can skew timestamp handling.
- Observe: offset variance, tap-path counters, IRQ load markers.
Card B · Budget template (fields for latency/jitter budgeting)
| Segment | Δt_mean | Δt_p99 | Jitter source | Observables | Pass criteria |
|---|---|---|---|---|---|
| App scheduling | X | X | preemption | cpu load | p99 < X |
| Offload enqueue | X | X | DMA backpressure | watermark | no drops |
| Crypto/ICV | X | X | fallback | engine use | p99 < X |
| Queue / shaping | X | X | burst/retry | replay/alert | spikes < X |
| Timestamp tap | X | X | queue drift | offset var | var < X |
H2-9 · System Integration Patterns (End device / Gateway / Switch)
Deployment determines where security terminates, where trust boundaries are drawn, and how key material can spread. This section provides reusable patterns for end devices, gateways, and switches, including risks, minimum observables, and fail-safe behavior.
Card A · Pattern cards (use / risks / must-have observables / fail-safe)
- Use when: confidentiality/integrity is required across an untrusted path; endpoints can manage sessions and credentials.
- Top risks: certificate lifecycle mistakes; session spikes (reconnect storms) harming cyclic traffic; time gate failures.
- Must-have observables: key_version, handshake_time(p99), session_reuse_rate, alert_code, gate_fail_reason.
- Fail-safe: cap concurrent handshakes; degrade to maintenance-only channel if gate fails; block uncontrolled retries.
- Bring-up gate: stable session reuse within X minutes; no periodic reconnect spikes above X/hour.
- Use when: protect each Ethernet segment; industrial switch fabric is the primary exposure surface.
- Top risks: trust boundary mistakes across ports; limited observability if relying on plaintext mirroring; rekey/PN issues causing silent drops.
- Must-have observables: mka_state, sa_state, pn_tx/pn_rx, replay_drop_count, icv_fail_count, rekey_count.
- Fail-safe: controlled bypass only with audit; quarantine ports on SA mismatch; rate-limit rekey triggers.
- Bring-up gate: SA install success within X s; replay/ICV drops below X/1k over Y minutes.
- Use when: gateway bridges field to IT; end devices are resource-limited; internal segments still need protection.
- Top risks: key/CA sprawl into the gateway; policy drift between domains; gateway becomes high-value target.
- Must-have observables: per-domain policy_version, key_slot_id, rollover events, cross-domain session counts, gate_fail_reason.
- Fail-safe: strict domain isolation; staged rollouts; maintenance-only interface when trust gate fails.
- Bring-up gate: domain separation verified; rekey/handshake events do not create periodic spikes above X.
- Use when: central controller connects many ports; strict zone policies are required (cell/line/plant domains).
- Top risks: shared credentials across zones; noisy neighbor effect from concurrent handshakes; audit gaps across ports.
- Must-have observables: per-port security_profile_id, per-zone key_version, session caps, event stamps, alert_code.
- Fail-safe: hard caps per port/zone; isolate control-plane events from cyclic data-plane; quarantine misbehaving zones.
- Bring-up gate: no cross-zone credential reuse; per-zone event rates below X.
Card B · Do / Don’t (prevent key sprawl into untrusted domains)
- Separate credentials by domain (field / line / plant / IT).
- Use a key-use gate tied to provable device state before enabling security.
- Stagger rollovers and cap concurrent handshakes to avoid storms.
- Make maintenance modes auditable (reason codes + timestamps).
- Do not clone long-lived PSKs/certificates across zones and devices.
- Do not keep permanent bypass/diagnostic backdoors without audit hooks.
- Do not depend on plaintext mirroring as the primary long-term observability method.
- Do not concentrate all CA/private keys in a gateway without strict isolation.
H2-10 · Security Telemetry & Field Forensics (minimum observability)
Encryption reduces packet-level visibility. Field triage requires a minimal, structured security black-box record: key versions, rollovers, handshake statistics, alert codes, replay drops, and gate reasons, with consistent time windows and denominators.
Card A · Telemetry schema (minimum fields to log)
- device_id, port_id, role (client/server)
- firmware_build_id, security_profile_id, policy_version
- key_slot_id, key_version (or epoch), cert_serial (if applicable)
- macsec_sa_state, mka_state (MACsec deployments)
- tls_state / dtls_state, session_count, session_reuse_rate
- handshake_time_ms (p50/p99), reconnect_rate
- replay_drop_count, replay_window_hits
- icv_fail_count, auth_fail_count
- pn_tx, pn_rx, pn_jump_events
- rekey_count, rollover_event_count
- alert_code (TLS/DTLS), error_class
- gate_fail_reason, policy_mismatch_reason
- cert_fail_reason (not_yet_valid / expired / unknown_ca)
- timestamp_mono, timestamp_wall (convertible)
- temp_tag, power_event_tag (tag only)
- security_mode_tag (on/off/profile_id)
Card B · Correlation checklist (align time, windows, denominators)
- Timebase alignment: log timestamp_mono + timestamp_wall, with a stable conversion method.
- Window definition: define a fixed window length (X s) and whether it is sliding or fixed-bucket.
- Denominator: standardize “per 1k frames” vs “per session” vs “per port-minute”; document mappings.
- Direction: define ingress/egress consistently across ports and roles.
- Layer binding: bind MACsec SA and TLS sessions to the same device_id + port_id key.
- Event ordering: record rekey/handshake/rollback/gate-fail with monotonic ordering stamps.
- Baseline pairs: keep security-off vs security-on baselines for handshake_time and drop counters.
- Triage order: gate reasons → state → counters → alerts; avoid starting from “link looks fine”.
H2-11 · Validation & Negative Testing (interop + attack-surface sanity)
Validation must close the loop from functional interop to performance impact, negative tests, regression, and production sampling. Every failure path must produce observable evidence (reason codes, counters, timestamps) instead of silent instability.
Card A · Test ladder (from baseline to production sampling)
- Purpose: establish latency/stability baseline to avoid mis-attributing pre-existing issues to security.
- Minimum steps: run representative traffic and cyclic loads for X minutes.
- Observables: reconnect_rate (X/hour), resource headroom (CPU/mem, X%), baseline jitter envelope (X).
- Scope: cipher suite alignment, certificate chain compatibility, PSK identifiers; MACsec MKA parameter consistency.
- Goal: separate negotiation failures from datapath failures.
- Observables: alert_code / reason_code, policy_version mismatch, MKA state transitions, SA install status.
- Scope: added latency/jitter, handshake spikes, rollover interruptions.
- Minimum steps: capture p50/p99 handshake_time_ms and steady-state jitter under load.
- Observables: handshake_time_ms (p50/p99), session_reuse_rate, rekey_events, gate_fail_reason.
- Certificate errors: unknown CA, broken chain, wrong identity.
- Time gates: expired and not-yet-valid (must emit cert_fail_reason).
- Replay / reorder: replay injections; out-of-order conditions (must increment replay_drop_count or equivalent).
- Timeouts: forced handshake timeout (must emit alert_code and timeout bucket).
- Rollover interrupt: key rotation interruption (must emit rollover_event + reason).
- Rule: any change to cipher/keys/credentials/offload config triggers a fixed regression subset.
- Minimum subset: one interop case + one replay case + one expiry case + one timeout case + one rollover case.
- Observables: pass/fail must include reason_code, key_version, and timestamp evidence.
- Goal: prevent credential-injection and lifecycle failures from escaping to field deployments.
- Minimum checks: identity binding + expiry gate + one replay sanity + one rollback/rotation sanity.
- Evidence: serial bind, policy_version, key_version, and audit record completeness.
Card B · Pass criteria table (all thresholds use X placeholders)
| Category | Metric | Pass criteria | Window / denominator |
|---|---|---|---|
| Sessions | handshake_time_ms | p50 ≤ X ms; p99 ≤ X ms | window X min |
| Sessions | reconnect_rate | ≤ X / hour | per port_id + role |
| Sessions | session_reuse_rate | ≥ X % | window X min |
| Alerts | alert_rate | ≤ X / 1k sessions | by alert_code bucket |
| MACsec | replay_drop_count | ≤ X / 1k frames | window X min |
| MACsec | icv_fail_count | ≤ X / 1k frames | window X min |
| Rotation | rollover impact | drop increase ≤ X; duration ≤ X s | during key rollover window |
| Resources | session_count cap | hard cap ≤ X | enforced under stress |
| Forensics | evidence completeness | reason + key_version + timestamp present | per failure path |
H2-12 · Engineering Checklist (Design → Bring-up → Production)
This checklist compresses the page into hard quality gates. Each item must be verifiable with an artifact: a configuration record, a log field, a test result, or a pass/fail evidence stamp.
Card A · Design gate (must-pass)
- Termination points for TLS/DTLS and MACsec are explicitly documented per port and domain.
- Trusted domain inventory exists (which nodes may hold long-lived credentials).
- Cross-domain credential reuse is prohibited and verified by policy (evidence: unique binding rules).
- Key inventory is complete (type, scope, update trigger, storage location, key_version field name).
- Key-use gate is defined (preconditions + fail-safe behavior + logged gate_fail_reason).
- Anti-rollback behavior is specified (evidence: rollback attempt is rejected with reason).
- Security path is included in latency/jitter budget (fields defined; thresholds use X placeholders).
- Control-plane events are isolated from cyclic data-plane behavior (evidence: concurrency caps).
- Black-box schema is implemented (device_id/port_id/profile_id/policy_version/key_version).
- Failure paths emit reason codes + timestamps (mono + wall convertible).
- Security mode changes are auditable (who/when/why hooks exist).
Card B · Bring-up gate (must-pass)
- Baseline runs exist with identical traffic profile and window definitions.
- Delta impact is recorded (handshake_time_ms, reconnect_rate, replay/ICV drops where applicable).
- At least two endpoint variants are validated (different stacks/vendors/firmware builds).
- Negotiation failure is distinguishable from datapath failure via reason codes and states.
- Expired / not-yet-valid must produce cert_fail_reason and a clear rejection path.
- Replay must produce replay_drop_count increments and an auditable record.
- Timeout must produce alert_code/timeout bucket and stop uncontrolled retry storms.
- Rollover interrupt must produce rollover_event + reason and predictable recovery behavior.
- Regression subset is fixed and documented, triggered by security-related changes.
- Results include evidence stamps: key_version, policy_version, timestamps, and failure reasons.
Card C · Production & Field gate (must-pass)
- Credential injection is traceable (serial bind, key_version, policy_version are recorded).
- Uniqueness policy is enforced across units and zones (no silent cloning of long-lived secrets).
- Sampling includes at least one expiry gate and one replay sanity test per batch.
- Configuration integrity is protected (unaudited security profile changes are rejected or flagged).
- Rollover strategy is staged (caps on concurrency; no fleet-wide synchronized reconnect storms).
- Log collection and evidence bundle is defined for RMA (reason codes + key_version + timestamps).
- Recovery playbook exists for gate failures (maintenance-only mode, controlled rollback, quarantines).
H2-13 · Applications & IC Selection (Security Offload)
Convert “why security” into “what to deploy and what to buy”: map use-cases to MACsec vs DTLS/TLS, then score IC capabilities that make security operable, deterministic, and diagnosable.
A) Use-case mapping (deployment → recommended layer)
Each row stays within this page boundary: security goal, determinism sensitivity, recommended layer, and the IC capability class to look for.
| Deployment | Primary goal | Determinism sensitivity | Recommended layer | IC capability class (examples) |
|---|---|---|---|---|
| Machine cell / motion island | Prevent tap/spoof on the OT segment; keep diagnostics usable. | Very high (latency/jitter must be budgeted). |
MACsec (hop-by-hop)
Optional TLS for management plane only (keep control/data separation).
|
MACsec-capable PHY / switch silicon; deterministic forwarding preserved.
Examples:
Broadcom BCM54195, Microchip VSC8582, Microchip VSC8254, Marvell Prestera 98DX1508/98DX2508.
|
| Edge gateway (field ↔ cloud) | End-to-end confidentiality/integrity with identity (cert/PSK) and auditable lifecycle. | Medium (handshake spikes must not stall control loops). |
TLS / DTLS (end-to-end)
Add MACsec internally only if the gateway terminates TLS and bridges domains.
|
Crypto acceleration + secure key storage (SE/TPM) + measurable boot hooks.
Examples:
NXP MIMXRT1176AVM8A, Renesas R7FA6M5AH2CBG#AC0, NXP EdgeLock SE050E2HQ1/Z01Z3, Infineon TPM SLB9670VQ20FW785XTMA1, Microchip ATECC608B-TCSM.
|
| Automotive-style domain link (camera / zone) | Protect each hop on the in-vehicle segment; reduce exposure of intermediate taps. | High (bounded latency + low jitter). |
MACsec at PHY
Use TLS only when crossing into IP/cloud domains.
|
MACsec-capable T1 PHYs (security close to the wire).
Examples:
NXP TJA1104, NXP TJA1121, TI DP83TC817S-Q1, Marvell 88Q120xM, Broadcom BCM89586M.
|
| Multigig cabinet uplink (2.5G/5G/10G) | Segment protection with line-rate crypto; keep forensics signals (drops/ICV/replay) visible. | Medium to high (depends on TSN usage; keep budgets explicit). |
MACsec on uplink
Optionally add TLS for control plane sessions.
|
Multigig/10G PHYs or retimers with integrated MACsec; validate timestamp tap impact.
Examples:
Microchip LAN8268, Microchip VSC8254, Realtek RTL822561.
|
Selection rule of thumb: lock the layer (MACsec vs DTLS/TLS) → lock the trust anchor (SE/TPM + anti-rollback hooks) → budget latency/jitter → require minimum telemetry for field forensics.
B) IC selection scorecard (what to score, not what to guess)
A scorecard prevents “checkbox security”: each field is an engineering lever tied to determinism, operability, and lifecycle control.
- MACsec: 802.1AE support, optional XPN/256b, MKA offload, replay window behavior.
- TLS/DTLS: supported versions, session resumption, DTLS retransmit strategy, cipher suite coverage.
- Line-rate throughput: sustained encrypted traffic (no hidden “burst-only” limits).
- Handshake spikes: worst-case time and CPU contention; cap reconnection storms.
- Timestamp/tap point impact: verify PTP/latency tap stays consistent when crypto is enabled.
- Trust anchor: MCU internal vs Secure Element vs TPM (keys non-exportable).
- Anti-rollback hooks: keys usable only when measured boot status passes.
- Rollover readiness: key versioning, overlap windows, revocation and audit trail fields.
- MACsec counters: PN, replay drops, ICV fails, SA rollover counts, MKA state.
- TLS/DTLS stats: handshake time, alert codes, session reuse rate, cipher mismatch counts.
- Black-box minimum: key version + event stamps aligned to power/temp reset events.
- Host interfaces: RGMII/SGMII/QSGMII/PCIe/SPI/I²C; DMA/descriptor headroom.
- Fail-safe mode: defined behavior when authentication fails (safe bypass vs safe stop).
- Provisioning flow: factory injection method and traceability (serial-bound credentials).
Representative material numbers (reference points)
Use these as capability anchors; exact feature sets vary by configuration and must be verified in datasheets and reference designs.
- TI DP83TC817S-Q1 (Automotive Ethernet PHY with MACsec reference capability)
- NXP TJA1104, TJA1121 (Automotive Ethernet PHY family entries with MACsec variants)
- Broadcom BCM54195 (GbE PHY family entry with integrated MACsec)
- Microchip VSC8582, VSC8254 (Ethernet PHY family entries with MACsec)
- Realtek RTL822561 (Multi-gig PHY entry listed with MACsec support in distributor summaries)
- Marvell Prestera examples: 98DX1508, 98DX2508, 98DX3508, 98DX7325 (MACsec-enabled entries depending on SKU)
- NXP EdgeLock SE050 example order code: SE050E2HQ1/Z01Z3
- Microchip CryptoAuthentication example: ATECC608B-TCSM
- Infineon OPTIGA™ Trust M example order code: OPTIGA-TRUST-M-MTR
- Infineon TPM 2.0 example order code: SLB9670VQ20FW785XTMA1
- NXP i.MX RT1170 example: MIMXRT1176AVM8A (gateway-class MCU reference point)
- Renesas RA6M5 example: R7FA6M5AH2CBG#AC0 (TRNG + crypto engine reference point)
Recommended topics you might also need
Request a Quote
H2-13 · FAQs (Field Triage, No New Scope)
Each answer is a fixed four-line, measurable playbook: likely root cause, the fastest falsifiable check, the smallest fix, and pass criteria using rate + window + denominator (X placeholders).
Q1 MACsec link is up, but traffic is black-holed — MKA state vs datapath SA binding?
Likely cause: MKA is up, but the controlled port / Secure Association (SA) is not bound to the datapath (bypass or SA-select mismatch).
Quick check: Compare MKA state to “SA in-use”; confirm controlled-port enable and TX/RX SA indices on both ends.
Fix: Align SA selection policy, enable controlled port, and verify SAK installed symmetrically (same key version and direction).
Pass criteria: Encrypted TX/RX counters increase; ICV_fail ≤ X per 10^6 frames and replay_drop = 0 over Y minutes.
Q2 Works in lab, fails in plant after a few hours — key rollover mismatch or replay window too tight?
Likely cause: Rekey/rollover policy mismatch or replay window too tight under bursty traffic and reordering.
Quick check: Correlate failure time with rekey events; inspect PN continuity and replay-drop spikes around rollover windows.
Fix: Harmonize rollover timing + overlap, widen replay window if needed, and prevent PN reset on link flap/reset.
Pass criteria: rekey_success = 100% across N rollovers; replay_drop ≤ X per 10^6 frames over Y hours.
Q3 TLS handshake succeeds, but cyclic control messages jitter — CPU contention or queueing after offload?
Likely cause: Crypto/handshake bursts contend with cyclic traffic (CPU, DMA, or queue scheduling), adding variable service time.
Quick check: Compare jitter with security ON/OFF; log handshake_time_p95 and queue_depth_peak during cyclic intervals.
Fix: Separate control/data-plane queues, cap handshake concurrency, and reserve deterministic resources for cyclic traffic.
Pass criteria: jitter_pp99 ≤ X µs over Y minutes; handshake_time_p95 ≤ X ms at N concurrent sessions.
Q4 DTLS only fails on lossy links — retransmit timer vs MTU/fragmentation?
Likely cause: DTLS retransmit/timeout policy is too aggressive, or fragmentation/PMTU handling is inconsistent on the path.
Quick check: Track dtls_retransmits_per_handshake and mtu_fragment_events; compare handshake timeout distribution under loss=Y%.
Fix: Tune retransmit timers, enforce a safe MTU, and reduce record/handshake message sizes to avoid fragmentation.
Pass criteria: handshake_success ≥ X% at loss=Y%; dtls_retransmits ≤ X per session over N sessions.
Q5 After firmware update, all peers reject the node — measured boot changed; cert/PSK binding mismatch?
Likely cause: Trust gate fails after update (measured state differs), or credential identity binding no longer matches (cert/PSK ↔ device ID).
Quick check: Compare key_version/cert_fingerprint before vs after; confirm attested_state flag and policy_version match expected values.
Fix: Restore accepted measurement policy or re-provision credentials bound to the updated device identity and policy version.
Pass criteria: auth_reject = 0 across N reboots; key_version matches policy and first-attempt auth succeeds over Y hours.
Q6 Only one switch port shows ICV failures — SCI/PN reset or mirror/SPAN side effect?
Likely cause: SCI/PN handling differs on that port, or mirroring/SPAN changes frame handling outside the secured path assumptions.
Quick check: Compare pn_reset_count and ICV_fail_rate per port; disable mirroring temporarily and re-measure ICV failures.
Fix: Correct SCI configuration, prevent PN reset across link flaps, and keep mirror traffic separated from controlled-port processing.
Pass criteria: ICV_fail ≤ X per 10^6 frames on that port over Y minutes; pn_reset_count = 0 across N link flaps.
Q7 Time sync drifts after enabling encryption — timestamp tap moved or added variable queue delay?
Likely cause: Timestamp tap point shifts, or encryption introduces variable queuing/serialization delay that was not budgeted.
Quick check: Measure offset_pp99 with security ON/OFF; log queue_delay_variance and timestamp_source_id across the same traffic profile.
Fix: Keep timestamps at a stable point, reserve deterministic queues, and include crypto path Δt/jitter in the latency budget.
Pass criteria: offset_pp99 ≤ X ns over Y minutes; ON/OFF delta ≤ X ns under N traffic mixes.
Q8 Certificate rotation caused a network-wide flap — coordinated window/clock validity mismatch?
Likely cause: Rotation overlap window is inconsistent across nodes, or time validity checks fail due to time-base mismatch.
Quick check: Compare notBefore/notAfter rejection logs and time_sync_status across nodes; identify overlap gaps and skew at rotation time.
Fix: Stagger rotation with overlap, enforce consistent time source, and roll out with a tested rollback plan.
Pass criteria: reconnect_rate ≤ X per hour during rotation; auth_reject = 0 across N nodes over Y hours.
Q9 Session count spikes then device reboots — resource exhaustion or session cache leak?
Likely cause: Session table/heap exhaustion under reconnection storms, or session cache leak that accumulates over time.
Quick check: Track session_table_usage_peak and heap_low_watermark; correlate reboot_count with reconnect bursts and handshake failures.
Fix: Cap sessions, enable eviction, harden timeouts, and throttle retries/renegotiation to prevent storms.
Pass criteria: session_table_usage_peak ≤ X% at load=N; reboots = 0 over Y hours; reconnect_rate ≤ X/hour.
Q10 MACsec counters look clean, but PLC still reports intermittent drops — metric window/denominator mismatch?
Likely cause: Accounting mismatch (window/denominator/ingress-egress direction) or drops occur outside MACsec-visible counters.
Quick check: Normalize by port + direction + window; compare plc_drop_rate vs device_drop_rate using the same denominator and time window.
Fix: Standardize telemetry schema (window_ms, denom_frames, direction) and align rollups; add an explicit “unknown_drop” bucket.
Pass criteria: Metrics agree within ±X% over Y minutes; unknown_drop = 0 across N windows.
Q11 TLS works, but latency budget is blown — record size, fragmentation, or path MTU?
Likely cause: Large records trigger fragmentation/PMTU issues and bursty queueing, creating variable latency and jitter.
Quick check: Inspect tls_record_size_histogram, fragment_count, and pmtu_error_count; correlate latency spikes with large-record bursts.
Fix: Tune record sizes, enforce PMTU-safe settings, and prioritize cyclic traffic to bound queueing under load.
Pass criteria: latency_pp99 ≤ X µs over Y minutes; spike_rate ≤ X per minute at throughput=N.
Q12 Security event logs are empty during failures — telemetry not persisted or time-base not aligned?
Likely cause: Telemetry is not persisted across resets, or timestamps are not aligned so events cannot be correlated to failures.
Quick check: Power-cycle and verify log_retention_ratio; confirm time_base_id and boot_count are present and monotonic.
Fix: Persist a minimum security schema (key_version, alerts, rollover, handshake stats) and align time base (monotonic + sync status).
Pass criteria: log_retention_ratio ≥ X% across N resets; missing_required_fields = 0 over Y hours.