Industrial Ethernet Remote Management (LLDP/LLDP-MED/NETCONF)

Q: LLDP is enabled, but the NMS cannot see any neighbors.

Likely cause: LLDP frames are not reaching the collector (disabled on one side, filtered, wrong collection interface), or the NMS is not ingesting/displaying the received records. Quick check: Compare local neighbor table vs adjacent switch neighbor table; verify LLDP Tx/Rx counters increase; confirm NMS collection job status and any LLDP filters. Fix: Enable LLDP Tx/Rx on both ends; align TLV selection and collection path; ensure LLDP ethertype is not blocked; standardize ingestion and de-dup keys. Pass criteria: neighbor_visible_within_s ≤ X; ttl_expiry_events_per_Ymin ≤ X; discovery_coverage_pct ≥ X%.

Q: The neighbor table changes every few minutes.

Likely cause: TTL/interval mismatch, unstable de-dup keys (port-id changes), link flaps, or LAG/virtual ports being treated as physical ports. Quick check: Compute churn rate per port; check TTL-expiry ratio; verify chassis-id+port-id stability; correlate neighbor churn with link up/down events. Fix: Normalize de-dup key (stable chassis-id + stable port identity); align LLDP interval/TTL; add debounce for short flaps; treat LAG as a single logical interface. Pass criteria: neighbor_churn_per_10min ≤ X; ttl_expiry_ratio_pct ≤ X%; link_flap_per_hour ≤ X.

Q: After moving the same device to another port, it is treated as a new device (asset ID jumps).

Likely cause: Identity is anchored to port-dependent fields, missing a stable device anchor, or merge rules are absent/incorrect in the inventory schema. Quick check: Compare chassis-id, system-name, mgmt-address, serial/firmware fields; inspect the stable-id precedence rules and merge/audit logs. Fix: Enforce stable device-id precedence (serial/cert-hash > chassis-id+vendor/model); keep port identity separate; add merge-with-history and a de-dup policy with correlation IDs. Pass criteria: duplicate_device_rate_pct ≤ X%; stable_id_survives_port_moves_count ≥ X; merge_events_traced_pct = 100%.

Q: LLDP-MED classifies the endpoint type incorrectly.

Likely cause: Endpoint class TLV is missing/ambiguous, the NMS mapping table is outdated, or vendor/OUI subtypes are misinterpreted. Quick check: Inspect raw MED TLVs and OUI subtype; verify mapping-table version/hash; compare against the site’s expected endpoint template. Fix: Update mapping rules; require explicit evidence for each class; prefer “unknown” over a wrong class; allow site overrides with versioned policies. Pass criteria: classification_accuracy_pct ≥ X%; unknown_rate_pct ≤ X%; misclass_corrected_within_min ≤ Y.

Q: LLDP-MED power TLV exists, but PoE power behavior is “wrong”.

Likely cause: Power TLV is an advertisement/accounting signal, not a guarantee of negotiated power; source-of-truth priority or units/classes are misapplied. Quick check: Compare MED power fields vs management DB records (timestamp freshness + per-port budget record); confirm which source is treated as authoritative. Fix: Treat MED power as a hint; reconcile with PoE telemetry/controls using priority + staleness rules; log a single discrepancy event with evidence. Pass criteria: stale_record_age_min ≤ X; power_field_consistency_pct ≥ X%; discrepancy_ticket_rate_per_day ≤ X.

Q: NETCONF connects, but edit-config fails every time.

Likely cause: Wrong target datastore, missing lock, insufficient authorization, or YANG constraints/type validation failures. Quick check: Read the rpc-error details (error-tag, error-path, bad-element); confirm capabilities and supported datastores; verify current lock ownership. Fix: Gate edits by capabilities; lock before edit; validate payload against the active YANG module-set; stage via candidate then commit (or discard on failure). Pass criteria: edit_config_success_pct ≥ X%; uncategorized_errors_pct = 0%; triage_time_to_rootcause_min ≤ Y.

Q: Commit succeeds, read-back shows the new config, but field behavior does not change.

Likely cause: Asynchronous apply, shadow vs effective runtime state, or a higher-priority template/override masks the intended change. Quick check: Compare “config nodes” vs “operational/effective state nodes”; check last-change/apply-status timestamps; verify no higher-priority policy is re-applying old values. Fix: Enforce read-after-write on effective operational state; record an effective-config snapshot; define an apply timeout that triggers rollback and evidence logging. Pass criteria: effective_state_matches_within_s ≤ X; auto_rollback_within_min ≤ Y; drift_incidents_per_day ≤ X.

Q: Candidate/lock conflicts happen frequently (multiple systems compete to write).

Likely cause: No single-writer policy, lock leases are too long, or synchronized retries create a write storm. Quick check: Measure conflict rate and p95 lock hold time; identify top lock holders; correlate conflicts with batch windows or periodic automation runs. Fix: Enforce single-writer per scope; use lease-based locks; add exponential backoff with jitter; schedule explicit change windows and failure-domain limits. Pass criteria: lock_conflict_rate_pct ≤ X%; lock_hold_p95_s ≤ X; max_retry_attempts ≤ N.

Q: After a model version update, old automation scripts stop working.

Likely cause: YANG schema paths/types changed, capabilities changed, or scripts are hard-coded to old module versions. Quick check: Snapshot yang-library/module-set-id; diff module versions; run a dry-run validation against the active module-set before pushing changes. Fix: Maintain a model×firmware matrix; compile templates per module-set; add capability gating and adapters for backward compatibility. Pass criteria: surprise_failures_count = 0; automation_pass_rate_pct ≥ X%; precheck_block_rate_pct (expected) ≥ X%.

Q: At large scale, LLDP traffic becomes a burden (high CPU / drops / delayed updates).

Likely cause: LLDP interval too aggressive, ingestion/DB writes unbounded, or neighbor-change storms overload parsing queues. Quick check: Track LLDP rx/tx rate, parse queue depth, CPU p95, LLDP drop counters, and DB write latency/backlog. Fix: Tune LLDP interval/TTL; rate-limit ingestion; batch writes; dedupe updates; apply debounce/cooldown and cap per-port update frequency. Pass criteria: cpu_p95_pct ≤ X%; lldp_drop_per_min ≤ X; neighbor_update_latency_p95_s ≤ X.

← Back to: Industrial Ethernet & TSN

Remote Management turns “plug-and-see” discovery into an auditable, rollback-safe operations loop.

Using LLDP/LLDP-MED for trustworthy inventory and NETCONF/YANG for transaction-based changes, it enables consistent onboarding, controlled batch configuration, verification, and evidence-linked troubleshooting at scale.

H2-1 · Definition & Scope: What “Remote Management” means here

In Industrial Ethernet, remote management is not generic IT administration. It is an engineering closed-loop: plug-in discovery, auditable remote configuration, and reproducible operations that can be verified and rolled back.

Intent (why this chapter exists)

Lock the scope boundary so later chapters do not expand into PoE/TSN/PTP/security details.
Define remote management using outputs and acceptance criteria, not protocol names.
Establish a single mental model: Discover → Describe → Configure.

Remote management vs. monitoring vs. provisioning (outputs-first)

Monitoring (read-focused)

Output: metrics/time-series + alerts.
Success: consistent accounting + controlled alert noise (debounce/cooldown).

Provisioning (bootstrap-focused)

Output: baseline config + identity binding + initial policy mapping.
Success: short onboarding time + high first-pass rate + rollback point.

Remote Management (closed-loop)

Output: discoverable inventory + transactional config + audit trail.
Success: changes are traceable, verifiable, and reversible (repeatable operations).

Responsibilities (3 verbs → 3 concrete artifacts)

Discover (LLDP)

Artifact: Neighbor Record (normalized neighbor table).
Key checks: TTL expiry, de-duplication, port-move handling.

Describe (LLDP-MED)

Artifact: Endpoint Profile (class/policy hints/power advert fields).
Key checks: stable classification, source-priority rules, version fields.

Configure (NETCONF + YANG)

Artifact: Config Transaction (lock/edit/commit) + Audit Diff.
Key checks: idempotency, read-after-write verification, rollback path.

Stop line (to prevent cross-page expansion)

This page defines management artifacts and workflows. When detailed electrical, timing, or cryptography topics appear, they must be linked out and not expanded here:

PoE/PoDL power curves & thermal → PoE/PoDL pages.
TSN schedules (Qbv/Qci/GCL tables) → TSN Switch/Bridge page.
PTP/SyncE servo algorithms → Timing & Sync pages.
MACsec/DTLS/TLS cryptography deep dive → Security page.

Acceptance criteria (what “done” looks like)

Neighbor Record remains stable across normal link churn (TTL-driven expiry, deterministic de-dup).
Endpoint Profile classification does not flip without a real physical or policy change.
Configuration changes are transactional (lock → commit) with read-back verification and rollback points.
Every change produces an audit diff: who/when/what/expected vs observed outcome.

Diagram: the management pipeline is defined by artifacts (Neighbor Record → Endpoint Profile → Config Transaction) and closed-loop verification (read-back + audit diff).

H2-2 · System Architecture: Who talks to whom (Device, Agent, NMS)

The architecture must align two worlds: on-wire packets (LLDP/LLDP-MED) and system objects (inventory/config/audit). When “observed state” and “desired state” are not separated, remote operations become unreproducible and hard to audit.

Roles (each role owns an artifact)

Endpoint / Device

Owns: port identity + device identity + config state.
Emits: LLDP(+MED) advertisements + NETCONF state.
Common failure: identity drift (port moves look like new assets).

Switch / Controller

Owns: neighbor table + port policy enforcement.
Emits: link events + neighbor-change signals.
Common failure: control policies accidentally suppress discovery packets.

Collector / Agent

Owns: normalization rules (TTL, de-dup, source priority).
Emits: Neighbor Records into inventory pipeline.
Common failure: inconsistent accounting (same link → different records).

Config Orchestrator + DB

Owns: desired state + versions + rollback points.
Emits: transactions (lock/edit/commit) + diffs.
Common failure: write conflicts or missing rollback anchors.

NMS / Ops Portal

Owns: the closed loop (discover → bind → change → verify → audit).
Emits: tasks, reports, alarms, approvals.
Common failure: alert storms (no debounce/cooldown policy).

Management channel: in-band vs out-of-band (selection logic)

In-band

Benefit: lower deployment cost; natural reachability.
Risk: if the production network fails, management may fail too.
Fit: small-to-mid systems with controlled switching domain.

Out-of-band

Benefit: fault isolation; emergency reachability.
Cost: extra cabling and operational complexity.
Fit: high-value lines requiring strict uptime and auditability.

Note: transport security details (SSH/TLS, keys, certificates) are handled in the dedicated Security page.

Data objects: align “on-wire facts” with “system truth”

Neighbor Table (Observed)

Source: LLDP packets.
Lifetime: TTL-based, short-lived, can flap.
Pass: deterministic de-dup + expiry rules.

Inventory Record (Normalized)

Source: Neighbor Record + device self-report + bindings.
Lifetime: stable ID + change history.
Pass: port moves do not create fake “new assets”.

Config Datastore (Desired vs Running)

Source: NETCONF datastores (running/candidate/startup).
Lifetime: versioned, transactional.
Pass: read-after-write verification + rollback anchors.

Audit Log (Traceability)

Source: commits + diffs + operator/task metadata.
Lifetime: long-lived compliance record.
Pass: who/when/what/why + outcome linked to evidence.

Diagram: discovery creates observed facts (LLDP → inventory), while NETCONF drives desired state (transactions + audit). Separation enables verification and rollback.

H2-3 · LLDP Discovery: Neighbor table that you can trust

LLDP is the shortest path from “a cable is plugged in” to a usable topology. The goal is not a raw neighbor table, but a trusted neighbor view that survives link churn, port moves, and aggregation, and can be normalized into a database-ready record.

Key fields (what they mean in field operations)

Chassis ID

Use: stable device identity across time.
Pitfall: duplicated defaults or virtualization drift.
Pass: port moves do not create “new assets”.

Port ID

Use: physical or logical port localization.
Pitfall: LAG/bridge ports mask real member ports.
Pass: the record answers “which port to touch”.

TTL

Use: freshness window for observed facts.
Pitfall: mixed defaults cause “see/not-see” flapping.
Pass: deterministic expiry with a tolerance window.

System Name / Port Description

Use: human-friendly labeling for field service.
Pitfall: strings change; must not drive identity.
Pass: labels are display-only, tracked as history.

When LLDP transmits (engineering selection logic)

Periodic

Benefit: continuous refresh for inventory and topology.
Risk: management-plane noise at scale.
Rule: budget LLDP traffic + prioritize control-plane queues.

Event-triggered (link up/down)

Benefit: fast reaction for plug/unplug troubleshooting.
Risk: link flap amplifies churn and alarms.
Rule: apply debounce/cooldown before generating events.

A production-ready design keeps discovery stable by combining periodic refresh with event-rate limiting.

Trust conditions (make the neighbor table stable)

Deterministic de-dup

Use a stable key (Chassis ID + Port ID + Local Port). Prefer consistent sources and keep a source list.

TTL expiry with tolerance

Avoid instant deletion at TTL boundary; apply an expiry window to prevent transient packet loss from causing flaps.

Port-move detection

Treat “same Chassis ID, new port” as a move event with history, not a brand-new device record.

Virtual/LAG awareness

Record port type (physical vs logical). Preserve mappings so field actions still point to a physical member port.

Output artifact: Neighbor Record (database-ready)

Identity: chassis_id, port_id, id_type.
Topology: local_port, remote_port, link_context (VLAN/bridge hints).
Timing: first_seen, last_seen, ttl, expiry_state.
Source: collector_id, source_type, sample_count.
Quality: dedup_key, confidence, flags (port_move / lag / virtual).

Pass criteria: the same physical link yields the same Neighbor Record across collectors, and does not oscillate under normal packet loss or link churn.

Diagram: LLDP fields become a normalized Neighbor Record when identity, timing, and quality rules are applied consistently.

H2-4 · LLDP-MED: Endpoint classification + policy + power adverts

LLDP-MED upgrades discovery into operations: it adds an endpoint profile (what it is), policy hints (what to apply), and power advertisements (what is declared), enabling automatic grouping and safe template binding in the management system.

Endpoint class (why classification matters)

Camera / Imaging

Operational meaning: stable VLAN binding + predictable priority.
Misclass risk: wrong template causes reachability or jitter regressions.
Pass: class remains stable unless the physical endpoint changes.

I/O / Control

Operational meaning: conservative change windows and labeling.
Misclass risk: unintended policy overrides or operational noise.
Pass: templates apply with read-back verification.

Compute / Gateway

Operational meaning: inventory completeness (HW/FW versions) for change management.
Misclass risk: inventory drift blocks safe automation.
Pass: firmware/version changes generate auditable diffs.

Network policy TLVs (hints, not forwarding algorithms)

VLAN hint: intended membership for automatic grouping and templating.
Priority / DSCP hint: intended service class for consistent policy binding.
Rule: hints are inputs; the management system remains the source of truth.
Pass: any applied policy is verified by read-back (no silent divergence).

QoS shaping details belong to switching/TSN pages; this chapter focuses on policy intent and verification hooks.

Power via MDI TLVs (declaration and accounting)

Meaning: power is advertised for management-plane planning and logging.
Rule: advertisement is not metering; compare against actual measurements in PoE systems.
Pass: declared vs measured discrepancies are detectable and auditable.

Electrical PoE design, classification, and thermal limits are handled in the PoE/PoDL pages.

Inventory TLVs (traceability for change management)

Vendor / Model: identify device class and expected capability set.
Serial / Asset tag: preserve stable asset identity across port moves.
Firmware version: correlate incidents with releases and rollback points.
Pass: version changes generate events and diffs (auditable history).

Output artifact: Endpoint Profile (operations-ready)

Class: endpoint type for grouping and template selection.
Policy hints: VLAN / priority / DSCP intent fields.
Power advert: declared power fields for planning and logging.
Inventory: vendor/model/serial/firmware for traceability.
Pass: stable classification + verified policy application + auditable changes.

Diagram: LLDP-MED turns discovery into operations by producing an Endpoint Profile used for automatic grouping, templating, and auditable change workflows.

H2-5 · Data Normalization: From TLVs to an inventory schema

Normalization is the operational bridge between on-wire fields and a durable inventory system. The goal is to produce stable identities, consistent field semantics, and auditable changes that can drive safe automation.

Field rules (naming, units, versions, source priority)

Naming & types

Rule: stable snake_case names + explicit types and enums.
Why: prevents “same meaning, different field” drift across collectors.
Pass: the same field name is used end-to-end (collector → DB → API → UI).

Units & normalization

Rule: every numeric field has a unit semantics (W/mW, s/ms, dBm).
Why: vendor defaults and platform exports may not match units.
Pass: unit conversions are explicit and reproducible.

Schema versions

Rule: add schema_version for inventory objects and model_version for capabilities.
Why: keeps historical records interpretable after upgrades.
Pass: new fields never break old parsing; unknown fields degrade safely.

Source priority & conflict handling

Rule: define authoritative sources per field class.
Examples: identity from local inventory/CMDB; topology from LLDP observations; policy from NMS templates.
Pass: conflicts never silently overwrite; store evidence + winner rationale.

Identity stability (keep Asset ID stable under real-world changes)

Device replacement

Rule: location is not identity; treat new serial/asset_tag as a new device.
Output: asset_replaced_event with old device retired state.
Pass: history answers “what changed, when, and where”.

Cable move / port move

Rule: same device identity + new local_port becomes a move event, not a new asset.
Output: port_move_event + topology_history records.
Pass: Asset ID remains stable; topology change remains traceable.

Collector / switch migration

Rule: do not anchor identity to collector_id; store collector as evidence, not as primary key.
Output: source_list + confidence scoring across collectors.
Pass: NMS migration does not “rebuild” inventory from scratch.

Recommended key hierarchy: Device Identity Key (most stable) → Port/Location → Neighbor Observations (most volatile).

Change detection (diff records that are actionable)

Event types (bounded to this page)

neighbor_added / neighbor_removed: topology facts appear or expire.
neighbor_moved: same remote identity observed on a new local port.
inventory_changed: vendor/model/serial/firmware changes.
policy_hint_changed: VLAN/priority hint updates.
power_advert_changed: advertised power fields change.

Required event payload: before/after diff, timestamps, source evidence, and correlation_id. Apply debounce windows to avoid alarm storms from transient LLDP loss.

Pass criteria: transient packet loss does not generate repeated remove/add loops, and every merged change remains explainable via evidence and correlation.

Diagram: normalize on-wire facts into stable entities and attach all changes to audited Timestamp/Event records.

H2-6 · NETCONF Basics: Capabilities, datastores, RPC operations

NETCONF enables transaction-style configuration with locking, predictable commit behavior, and audit-friendly workflows. This section focuses on the engineering advantages: capability gating, datastore discipline, and verifiable RPC recipes.

Transport (secure channel requirement, bounded scope)

Rule: run NETCONF over a protected transport (SSH or TLS).
Operational focus: authentication, session lifecycle, and audit logs.
Pass: failures are classified (auth/channel/capability) and recorded.

Capabilities exchange (a hard gate before writing)

Goal: discover supported models and features before any edit-config.
Gate: refuse writes if required modules/datastores are missing.
Pass: capability changes (after firmware updates) trigger model_version diffs.

Datastores (running / candidate / startup) — modes and risks

running

Use: small changes with immediate effect.
Risk: partial writes can create intermediate states.
Pass: always read-back verify after edits.

candidate + commit

Use: transactional staging with controlled commit.
Risk: lock contention or forgotten discard/unlock.
Pass: lock → edit → commit/discard → unlock is enforced.

startup

Use: baseline configuration at boot time.
Risk: operational reboot windows and recovery policies.
Pass: changes are tied to approved windows and rollback plans.

Core RPC recipe (minimal safe transaction)

hello: capability exchange and model gating.
lock: lock target datastore (candidate or running policy).
get-config: capture baseline snapshot for audit.
edit-config: stage changes with explicit intent.
commit: apply staged changes atomically.
get-config: read-back verification (no silent divergence).
unlock: release lock; ensure cleanup on all exits.

Failure branch: on commit failure, use discard-changes (candidate) or a defined rollback procedure, then unlock.

Pass criteria: every write has a pre/post snapshot, a deterministic lock discipline, and a verifiable outcome.

Diagram: a minimal safe NETCONF recipe enforces capability gating, lock discipline, atomic commit, and read-back verification.

H2-7 · YANG Modeling (Just enough): Make configs machine-safe

YANG turns configuration from fragile text into a schema-driven, strongly typed structure. The practical outcome is fewer operational mistakes: values become validated, constraints become enforceable, and diffs become auditable.

Why YANG reduces “CLI-style” failures

Schema first

Rule: structure is explicit (module → container → list → leaf).
Prevents: wrong paths, misspelled nodes, and ambiguous fields.
Pass: clients generate paths and validate shape before sending edits.

Constraints

Rule: enforce type/range/enum and cross-field rules (must/when).
Prevents: out-of-range values and invalid combinations.
Pass: invalid inputs fail at validation with a field-level reason.

Validation

Rule: validate intent before commit (client-side and/or server-side).
Prevents: “committed but semantically wrong” configs.
Pass: every write has a validation result linked to an audit record.

Model scope (bounded): what becomes machine-manageable

Port: admin state, speed/duplex, description, basic management toggles.
VLAN: membership and tagging mode (access/trunk semantics).
QoS policy: mappings and policy bindings (no scheduling algorithm details).
TSN parameters: treated as versioned, validated objects (no time-slot or gate-list expansion here).

Stop line: detailed TSN tables, scheduling math, or security mechanisms are out of scope for this page.

Version alignment: model_version ↔ firmware_version ↔ templates

Why it matters

Reality: firmware updates can change supported nodes, enums, and behavior.
Risk: a previously valid template may become partially incompatible.
Pass: capability gating blocks writes when model versions mismatch.

Operational rules

Record: store model_version with inventory for each device.
Declare: templates publish a compatibility window of model versions.
React: version diffs create an audit event and trigger re-validation.

Audit with candidate diffs (who changed what, with evidence)

Must capture: operator/tool/time, target device/port, before/after diff, outcome.
Link: attach validation results and correlation_id to group batch changes.
Pass: failed commits still produce a durable record for root-cause learning.

Diagram: YANG provides structure (schema), rules (constraints), and pre-commit validation so configuration becomes machine-safe.

H2-8 · Operational Workflow: Onboarding → Config → Verify → Rollback

Remote management becomes reliable only when protocols are assembled into an executable SOP. This workflow defines inputs, steps, outputs, and pass criteria for onboarding, configuration, verification, and rollback.

SOP overview (inputs → outputs)

Inputs: LLDP/LLDP-MED records, normalized inventory, model_version, templates.
Outputs: desired state intent, audited diffs, verification reports, rollback evidence.
Pass: “configured” is only accepted after read-back verification and reconciliation.

Onboarding (Discover → inventory → classify → baseline intent)

Steps

Ingest LLDP neighbor facts and apply TTL/duplicate rules.
Attach inventory identity (serial/asset tags) with source priority.
Classify devices and ports (role/site tags) for template selection.
Bind a baseline template to a device/port scope.

Pass criteria

New devices become identifiable and classifiable within a defined window.
Template selection is deterministic and traceable (why this template).
Inventory records remain stable across port moves and collector changes.

Configuration strategy (idempotent + layered)

Idempotency rules

Only-if-diff: compute deltas against desired state before editing.
Repeat-safe: the same intent applied N times produces the same result.
Validate-first: reject intents that fail YANG constraints.

Layering

global: consistent defaults and audit hooks.
site: local VLAN plans and role baselines.
device: role-specific configuration (switch/endpoint/gateway).
port: explicit exceptions with traceable rationale.

Pass criteria: every active configuration line can be traced back to a layer and an exception record (no mystery drift).

Verify (read-after-write + reconciliation)

Read-after-write: commit is followed by get-config read-back.
Reconcile: compare inventory facts vs configured objects; classify mismatch causes.
Output: verification report + drift events linked to correlation_id.

Pass criteria: a change is accepted only when the verified actual state matches the desired intent within defined tolerances.

Rollback (by port / by device / by batch)

By port: smallest blast radius for policy or classification mistakes.
By device: recover from role-level misconfiguration (e.g., VLAN baseline).
By batch: revert a template release that impacted a fleet.
Rule: rollback is followed by read-back verification and audit logging.

Diagram: a closed-loop SOP turns discovery and configuration into a verifiable, auditable operational cycle.

H2-9 · Monitoring & Telemetry: Events, counters, and black-box logging

Remote management is operationally useful only when it is observable and forensically reproducible. This section defines event sources, metric definitions, black-box records, and alert gating rules that prevent noise while preserving evidence.

Event sources (management-plane evidence)

High-value events

LLDP neighbor change: add/remove/update of neighbor facts.
Link flap: repeated up/down transitions on a port.
Config commit: success/fail/rollback outcomes.
Auth fail: management-channel access failures (as an event class).

Minimum event fields

timestamp: collector time (plus device time if available).
device_id / port_id: stable identifiers for correlation.
event_type + reason_code: classification for routing.
correlation_id: batch change or incident grouping.
config_version: before/after context for commits.
operator/tool: provenance for audit and accountability.

Metrics and accounting (definitions matter)

State and counters

Port state: up/down + last-change time.
Neighbor state: neighbor_count and valid_neighbor_count.
Change counters: neighbor_change_count per port/device/site.
Commit counters: commit_success/fail and rollback_count.
Auth counters: auth_fail_count for management access attempts.

Accounting rules

Window: define rolling windows (e.g., 5 min / 1 h) consistently.
Denominator: normalize per port/device/site to avoid misleading totals.
Latency: separate reporting interval from aggregation interval.
Units: declare units and sampling cadence for every metric.

Pass criteria: every metric has a stable definition (unit, window, sampling, aggregation) and supports consistent alert thresholds.

Black-box records (forensic minimum set)

Required fields

timestamps: collector + device (when available).
env events: temperature/power events as context fields.
config versions: desired vs actual + template/model versions.
operator/tool: provenance for every attempted change.
snapshots: before/after summaries (neighbor/config/alerts).
correlation_id: join events, alerts, and tickets into one chain.

Operational outputs

Verification trail: read-back evidence linked to commits.
Drift events: mismatches classified and routed.
Ticket context: payload that makes incidents reproducible without site return.

Pass criteria: any incident can be reconstructed using timestamps, versions, diffs, and snapshots.

Alert gating (noise suppression without losing evidence)

Three gates

Debounce: wait for stability before alerting on state changes.
Cooldown: coalesce identical alerts within a suppression window.
Rate-limit: cap alert volume; overflow becomes a summarized alert.

Severity mapping

Info: single change with no persistence.
Warn: repeated flaps or commit failures with auto-recovery.
Critical: sustained loss of manageability or fleet-wide correlation.

Pass criteria: alert volume stays bounded while critical sustained failures remain visible and traceable.

Diagram: events are deduplicated and correlated, stored as evidence (including snapshots), then converted into gated alerts and tickets.

H2-10 · Failure Modes & Pitfalls: When discovery/config goes wrong

Field failures are best handled as a bounded decision tree: start from the symptom, run the fastest checks, and land on a minimal fix with a clear pass criterion. This section stays within discovery/configuration boundaries and avoids cross-domain expansion.

Start points (symptoms)

No neighbor visible: discovery does not populate a neighbor table.
Neighbor table flaps: entries appear/disappear or fields oscillate.
Commit fails: lock/candidate/version issues block changes.
Fleet incident: many devices fail together (aggregation required).

No neighbor visible (fast checks)

Case A — LLDP disabled

Likely cause: LLDP TX/RX is off on the port or device.
Quick check: verify per-port LLDP enable state and LLDP counters.
Fix: enable LLDP; ensure consistent TX interval and TTL policy.
Pass criteria: neighbor appears and stays valid across at least one TTL window.

Case B — management-plane isolation

Likely cause: port isolation or policy blocks discovery frames.
Quick check: confirm port policies allow LLDP and the device is in the expected management context.
Fix: relax the blocking rule for discovery; keep changes scoped to the port/site policy.
Pass criteria: neighbor table populates consistently after policy update and remains stable.

Case C — TTL/timer mismatch

Likely cause: TTL is too short or collector timing causes premature expiry.
Quick check: compare TX interval, TTL, and ingestion/aggregation windows.
Fix: align TTL and TX interval; ensure collector windows do not alias the TTL.
Pass criteria: neighbors do not expire unexpectedly under steady link conditions.

Neighbor table flaps (do not guess; use evidence)

Case D — link flap drives churn

Likely cause: port state changes invalidate and re-learn neighbors repeatedly.
Quick check: correlate neighbor changes with port up/down timestamps.
Fix: apply alert debouncing and confirm stability at the management layer before escalating.
Pass criteria: neighbor churn rate drops below the operational threshold within the observation window.

Case E — delayed discovery processing

Likely cause: delayed sending/processing makes the table appear unstable.
Quick check: inspect event queue latency and “frame age” distributions in telemetry.
Fix: adjust ingestion cadence and ensure TTL comfortably exceeds expected processing delays.
Pass criteria: table stability returns without changing topology or templates.

Case F — timer aliasing

Likely cause: collection windows alias TX/TTL cadence, amplifying oscillations.
Quick check: compare collector poll/aggregation periods with TX interval and TTL.
Fix: de-synchronize windows (jittered polling) and use stable rolling aggregations.
Pass criteria: neighbor presence does not oscillate on a fixed period.

Configuration fails (NETCONF failure modes)

Case G — lock contention

Likely cause: another session holds the lock or lock timeout is too aggressive.
Quick check: check active sessions and lock owner/age where available.
Fix: implement backoff and consistent unlock/discard cleanup for failed sessions.
Pass criteria: commits succeed after deterministic retries without leaving stale locks.

Case H — candidate not committed

Likely cause: candidate changes exist but never reach commit or get discarded properly.
Quick check: read candidate vs running; detect uncommitted diffs.
Fix: enforce a transaction boundary: lock → edit → validate → commit (or discard).
Pass criteria: candidate is empty after completion, and read-back matches desired intent.

Case I — model/version mismatch

Likely cause: templates target unsupported nodes or enums for the current model version.
Quick check: inspect capabilities exchange and model_version compatibility gating.
Fix: select a compatible template or update the model/template matrix and re-validate.
Pass criteria: validation passes before commit and produces a clean audit trail.

Fleet-wide incidents (aggregation, not alert floods)

Case J — discovery suppressed at scale

Likely cause: suppression/policing rules reduce discovery visibility during storms.
Quick check: correlate simultaneous neighbor loss across many ports/devices in telemetry.
Fix: adjust suppression scope and verify LLDP event throughput in the pipeline.
Pass criteria: discovery remains visible or degrades gracefully with summarized alerts.

Case K — slow convergence is misread as failure

Likely cause: event backlog or long aggregation windows delay visibility.
Quick check: measure queue latency and time-to-index for event ingestion.
Fix: reduce aggregation delay, add backpressure indicators, and use summarized fleet alerts.
Pass criteria: time-to-visibility stays within the operational SLO during bursts.

Diagram: start from a bounded symptom, run the fastest checks, and land on a minimal fix with a clear pass criterion.

H2-11 · Applications & Selection Logic (Remote Management)

This chapter converts protocol capability into deployable outcomes: a 5-minute topology/inventory snapshot, safe batch changes with rollback, and audit-ready operations. Selection is framed as a closed-loop capability checklist (Discover → Normalize → Configure → Verify → Audit).

A) Applications (deployable playbooks)

App 1 · Power-on 5-minute topology & inventory snapshot

Goal: Produce a trusted neighbor map + normalized inventory records fast enough for commissioning and field triage.
Key capabilities: LLDP neighbor capture (TTL + de-dup) → TLV-to-schema normalization → stable device/port identity.
Minimal steps: enable LLDP baseline → ingest neighbors → normalize into Device/Port/Neighbor entities → emit a “snapshot summary”.
Acceptance criteria: discovery coverage ≥ X%; neighbor churn ≤ X/5min; identity remains stable after re-cabling/migration.
Reference BOM examples (part numbers): Switch/TSN endpoint: Microchip LAN9662 / LAN9662-I/9MX; Managed switch: Microchip KSZ9477 / KSZ9477STXI; Industrial PHY options: TI DP83822I / DP83822IRHBT, ADI ADIN1300; Linux-capable management MPU (agent): STM32MP157, AM6442, SAMA7G54-E/4HBVAO, MIMX8MM5CVTKZAA.

Output artifacts: Neighbor Record (trusted) + Inventory Record (normalized) + Snapshot Report (counts, missing fields, anomalies).

App 2 · Remote batch change (policy/template) with controlled rollback

Goal: Apply consistent port/device templates at scale without “CLI drift”, with rollback that is fast and auditable.
Key capabilities: NETCONF lock/edit/commit/discard; YANG validation; read-after-write verification; versioned templates.
Minimal steps: capability gate → lock → stage to candidate → validate + diff → commit → read-back verify → rollback on fail (port/device/batch).
Acceptance criteria: batch success ≥ X%; rollback restores pre-change state within Y min; every change has operator + version + diff.
Reference BOM examples (part numbers): Linux MPU for NETCONF/YANG agent: AM6442 (AM64x), STM32MP157, SAMA7G54-E/4HBVAO, MIMX8MM5CVTKZAA; Switch silicon with integrated CPU option: LAN9662 / LAN9662-I/9MX.

Safety rule: “No commit without validate + diff + read-back verify.” Keep rollback scope explicit (port vs device vs batch).

App 3 · Operations forensics (audit trail + black-box correlation)

Goal: Make incidents reproducible from evidence: who changed what, when, and what the device observed right after.
Key capabilities: event sources (neighbor change, link flap, commit, auth fail) → queue → durable log → alert with evidence link.
Minimal steps: assign correlation-id → record before/after snapshots (inventory+config version) → attach evidence to alerts/tickets.
Acceptance criteria: each alert links to (device/port/time/version/operator/diff); noise is bounded via debounce/cooldown/rate-limit.
Reference BOM examples (part numbers): Device compute: STM32MP157 / AM6442 / SAMA7G54-E/4HBVAO / MIMX8MM5CVTKZAA; Switch fabric: KSZ9477 / LAN9662; Industrial PHY: DP83822I, ADIN1300.

Minimum black-box fields: timestamp, device-id, port-id, inventory-version, config-version, operator/tool, diff-hash, and event reason.

App 4 · Low-touch operations (drift detection & reconcile)

Goal: Detect “inventory vs config vs reality” drift early, without waiting for outages.
Key capabilities: periodic neighbor/inventory/config summaries → diff classification → only actionable drift opens tickets.
Minimal steps: schedule snapshots → classify drift (new/lost/changed/version mismatch) → reconcile via validated template pushes.
Acceptance criteria: drift entries are traceable (diff + version); operations scale without alert storms.
Reference BOM examples (part numbers): Management MPU: AM6442, STM32MP157, SAMA7G54-E/4HBVAO, MIMX8MM5CVTKZAA; Switch endpoint: LAN9662, KSZ9477.

Reconcile rule: treat LLDP as discovery truth, schema as operational truth, and NETCONF/YANG as change truth.

B) Selection logic (capability checklist that prevents scale failures)

Must / Should / Avoid — decision checklist

MUST (non-negotiable)

Capability gating: refuse template push if capability/model mismatch is detected.
Transactional change: lock → candidate → validate → commit (or discard).
Read-back verify: read-after-write + inventory/config reconciliation.
Stable identity: device/port IDs survive re-cabling and switch migration.
Audit evidence: operator + version + diff + correlation-id stored durably.

Example platforms: STM32MP157, AM6442, SAMA7G54-E/4HBVAO, MIMX8MM5CVTKZAA; switch endpoint option LAN9662-I/9MX.

SHOULD (strongly recommended)

Model/version matrix: map YANG model versions to firmware versions before rollout.
Noise controls: debounce/cooldown/rate-limit for neighbor-change and link-flap alerts.
Batch safety: staged rollout (site → rack → line), with failure domain limits.
Inventory enrichment: LLDP-MED inventory TLVs where endpoint classes matter.

Switch/PHY examples for industrial deployments: KSZ9477STXI, LAN9662, DP83822I, ADIN1300.

AVOID (high risk at scale)

“CLI-only” operations: no transaction boundary, no rollback, and drift becomes unbounded.
No read-back: configuration success is assumed rather than verified.
No stable IDs: inventory merges/splits after re-cabling, making audits unreliable.
Alert without evidence: tickets lack snapshots, diff, and correlation-id, blocking forensics.

Diagram · Must/Should/Avoid matrix + reference deployment mini-map

Keep the diagram “protocol-light”: focus on outputs (records, diffs, versions) and the closed-loop workflow (verify + rollback + evidence).

Request a Quote

Name

Company

Part Number(s) / BOM

Quantity & Target Lead Time

Alternates Allowed

Temperature Grade

Package / Footprint

Compliance

Budget Window

Lot Size / Qty

Message

Attachment

Accepted Formats

pdf, csv, xls, xlsx, zip

Attachment

Drag & drop files here or use the button below.

H2-12 · FAQs (Field troubleshooting, closed scope)

Scope stop-line: LLDP / LLDP-MED / NETCONF / YANG, normalization, audit, rollback, alert de-noise. No PHY/SI waveforms, no TSN tables, no PoE electrical deep-dive (PoE details belong to the PoE page).

Fixed answer format per item (4 lines): Likely cause / Quick check / Fix / Pass criteria (numeric thresholds use X/Y/N placeholders).

LLDP is enabled, but the NMS cannot see any neighbors.

Likely cause: LLDP frames are not reaching the collector (disabled on one side, filtered, wrong collection interface), or the NMS is not ingesting/displaying the received records.

Quick check: Compare local neighbor table vs adjacent switch neighbor table; verify LLDP Tx/Rx counters increase; confirm NMS collection job status and any LLDP filters.

Fix: Enable LLDP Tx/Rx on both ends; align TLV selection and collection path; ensure LLDP ethertype is not blocked; standardize ingestion and de-dup keys.

Pass criteria: neighbor_visible_within_s ≤ X; ttl_expiry_events_per_Ymin ≤ X; discovery_coverage_pct ≥ X%.

The neighbor table changes every few minutes.

Likely cause: TTL/interval mismatch, unstable de-dup keys (port-id changes), link flaps, or LAG/virtual ports being treated as physical ports.

Quick check: Compute churn rate per port; check TTL-expiry ratio; verify chassis-id+port-id stability; correlate neighbor churn with link up/down events.

Fix: Normalize de-dup key (stable chassis-id + stable port identity); align LLDP interval/TTL; add debounce for short flaps; treat LAG as a single logical interface.

Pass criteria: neighbor_churn_per_10min ≤ X; ttl_expiry_ratio_pct ≤ X%; link_flap_per_hour ≤ X.

After moving the same device to another port, it is treated as a new device (asset ID jumps).

Likely cause: Identity is anchored to port-dependent fields, missing a stable device anchor, or merge rules are absent/incorrect in the inventory schema.

Quick check: Compare chassis-id, system-name, mgmt-address, serial/firmware fields; inspect the stable-id precedence rules and merge/audit logs.

Fix: Enforce stable device-id precedence (serial/cert-hash > chassis-id+vendor/model); keep port identity separate; add merge-with-history and a de-dup policy with correlation IDs.

Pass criteria: duplicate_device_rate_pct ≤ X%; stable_id_survives_port_moves_count ≥ X; merge_events_traced_pct = 100%.

LLDP-MED classifies the endpoint type incorrectly.

Likely cause: Endpoint class TLV is missing/ambiguous, the NMS mapping table is outdated, or vendor/OUI subtypes are misinterpreted.

Quick check: Inspect raw MED TLVs and OUI subtype; verify mapping-table version/hash; compare against the site’s expected endpoint template.

Fix: Update mapping rules; require explicit evidence for each class; prefer “unknown” over a wrong class; allow site overrides with versioned policies.

Pass criteria: classification_accuracy_pct ≥ X%; unknown_rate_pct ≤ X%; misclass_corrected_within_min ≤ Y.

LLDP-MED power TLV exists, but PoE power behavior is “wrong”.

Likely cause: Power TLV is an advertisement/accounting signal, not a guarantee of negotiated power; source-of-truth priority or units/classes are misapplied.

Quick check: Compare MED power fields vs management DB records (timestamp freshness + per-port budget record); confirm which source is treated as authoritative.

Fix: Treat MED power as a hint; reconcile with PoE telemetry/controls (see PoE page) using priority + staleness rules; log a single discrepancy event with evidence.

Pass criteria: stale_record_age_min ≤ X; power_field_consistency_pct ≥ X%; discrepancy_ticket_rate_per_day ≤ X.

NETCONF connects, but edit-config fails every time.

Likely cause: Wrong target datastore, missing lock, insufficient authorization, or YANG constraints/type validation failures.

Quick check: Read the rpc-error details (error-tag, error-path, bad-element); confirm capabilities and supported datastores; verify current lock ownership.

Fix: Gate edits by capabilities; lock before edit; validate payload against the active YANG module-set; stage via candidate then commit (or discard on failure).

Pass criteria: edit_config_success_pct ≥ X%; uncategorized_errors_pct = 0%; triage_time_to_rootcause_min ≤ Y.

Commit succeeds, read-back shows the new config, but field behavior does not change.

Likely cause: Asynchronous apply, shadow vs effective runtime state, or a higher-priority template/override masks the intended change.

Quick check: Compare “config nodes” vs “operational/effective state nodes”; check last-change/apply-status timestamps; verify no higher-priority policy is re-applying old values.

Fix: Enforce read-after-write on effective operational state; record an effective-config snapshot; define an apply timeout that triggers rollback and evidence logging.

Pass criteria: effective_state_matches_within_s ≤ X; auto_rollback_within_min ≤ Y; drift_incidents_per_day ≤ X.

Candidate/lock conflicts happen frequently (multiple systems compete to write).

Likely cause: No single-writer policy, lock leases are too long, or synchronized retries create a write storm.

Quick check: Measure conflict rate and p95 lock hold time; identify top lock holders; correlate conflicts with batch windows or periodic automation runs.

Fix: Enforce single-writer per scope; use lease-based locks; add exponential backoff with jitter; schedule explicit change windows and failure-domain limits.

Pass criteria: lock_conflict_rate_pct ≤ X%; lock_hold_p95_s ≤ X; max_retry_attempts ≤ N.

After a model version update, old automation scripts stop working.

Likely cause: YANG schema paths/types changed, capabilities changed, or scripts are hard-coded to old module versions.

Quick check: Snapshot yang-library/module-set-id; diff module versions; run a dry-run validation against the active module-set before pushing changes.

Fix: Maintain a model×firmware matrix; compile templates per module-set; add capability gating and adapters for backward compatibility.

Pass criteria: surprise_failures_count = 0; automation_pass_rate_pct ≥ X%; precheck_block_rate_pct (expected) ≥ X%.

At large scale, LLDP traffic becomes a burden (high CPU / drops / delayed updates).

Likely cause: LLDP interval too aggressive, ingestion/DB writes unbounded, or neighbor-change storms overload parsing queues.

Quick check: Track LLDP rx/tx rate, parse queue depth, CPU p95, LLDP drop counters, and DB write latency/backlog.

Fix: Tune LLDP interval/TTL; rate-limit ingestion; batch writes; dedupe updates; apply debounce/cooldown and cap per-port update frequency.

Pass criteria: cpu_p95_pct ≤ X%; lldp_drop_per_min ≤ X; neighbor_update_latency_p95_s ≤ X.

After rollback, behavior is still abnormal (cache or multiple datastores out of sync).

Likely cause: Rollback applied to the wrong datastore (candidate vs running vs startup), cached operational state is stale, or partial commits created inconsistent effective state.

Quick check: Compare running vs startup hashes; verify last-commit ID; check “effective-config” snapshot and any reboot-required flags.

Fix: Make rollback target explicit; refresh caches; reconcile startup and running (or schedule controlled reboot if required); clear staged configs and re-verify.

Pass criteria: running_startup_hash_match = true (within X s or next reboot); behavior_restored_within_min ≤ Y.

Alert storm: neighbor-change alerts are too frequent.

Likely cause: Link flaps or TTL mismatch generates frequent change events, and alerts are emitted per event without debounce, cooldown, or aggregation.

Quick check: Compute neighbor-change rate per port and correlate with link events; verify de-dup keys; inspect debounce/cooldown/rate-limit settings and aggregation keys.

Fix: Debounce neighbor changes; apply cooldown per port; aggregate alerts by device+port+reason; promote only sustained anomalies and preserve evidence links.

Pass criteria: alerts_per_device_per_hour ≤ X; storm_suppression_ratio_pct ≥ X%; sev1_evidence_loss_count = 0.

Industrial Ethernet Remote Management (LLDP/LLDP-MED/NETCONF)

Industrial Ethernet Remote Management (LLDP/LLDP-MED/NETCONF)

H2-1 · Definition & Scope: What “Remote Management” means here

H2-2 · System Architecture: Who talks to whom (Device, Agent, NMS)

H2-3 · LLDP Discovery: Neighbor table that you can trust

H2-4 · LLDP-MED: Endpoint classification + policy + power adverts

H2-5 · Data Normalization: From TLVs to an inventory schema

H2-6 · NETCONF Basics: Capabilities, datastores, RPC operations

H2-7 · YANG Modeling (Just enough): Make configs machine-safe

H2-8 · Operational Workflow: Onboarding → Config → Verify → Rollback

H2-9 · Monitoring & Telemetry: Events, counters, and black-box logging

H2-10 · Failure Modes & Pitfalls: When discovery/config goes wrong

H2-11 · Applications & Selection Logic (Remote Management)

Request a Quote

Accepted Formats

Attachment

H2-12 · FAQs (Field troubleshooting, closed scope)

Explore

Categories

Get in Touch

Industrial Ethernet Remote Management (LLDP/LLDP-MED/NETCONF)

Industrial Ethernet Remote Management (LLDP/LLDP-MED/NETCONF)

H2-1 · Definition & Scope: What “Remote Management” means here

H2-2 · System Architecture: Who talks to whom (Device, Agent, NMS)

H2-3 · LLDP Discovery: Neighbor table that you can trust

H2-4 · LLDP-MED: Endpoint classification + policy + power adverts

H2-5 · Data Normalization: From TLVs to an inventory schema

H2-6 · NETCONF Basics: Capabilities, datastores, RPC operations

H2-7 · YANG Modeling (Just enough): Make configs machine-safe

H2-8 · Operational Workflow: Onboarding → Config → Verify → Rollback

H2-9 · Monitoring & Telemetry: Events, counters, and black-box logging

H2-10 · Failure Modes & Pitfalls: When discovery/config goes wrong

H2-11 · Applications & Selection Logic (Remote Management)

Recommended topics you might also need

Request a Quote

Accepted Formats

Attachment

H2-12 · FAQs (Field troubleshooting, closed scope)

Explore

Categories

Get in Touch