DMA & High Throughput on SPI/I2C/UART: Bursts, Framing, Latency
← Back to: I²C / SPI / UART — Serial Peripheral Buses
DMA high throughput is achieved by building a measurable end-to-end data pipeline—FIFO → DMA → RAM buffer → consumer—then tuning burst size, buffering, and cache coherency to maximize sustained payload while keeping worst-case latency and jitter bounded.
What “High Throughput” Really Means on Peripheral Buses
“High throughput” is not a single knob. A design that maximizes sustained payload rate can easily worsen first-byte latency, increase jitter, or burn CPU on interrupts. This section fixes the measurement vocabulary and the decision order, so later tuning (DMA, buffering, framing) converges instead of_topics and avoids accidental scope creep into signal-integrity pages.
- Sustained throughput and peak burst are different goals; each implies different DMA batch sizes and interrupt cadence.
- End-to-end latency must specify a reference point (first-byte vs last-byte vs service start); “average latency” alone is not a real-time guarantee.
- Jitter (determinism) is driven by contention (DMA arbitration, memory, scheduling), not just bus frequency.
- CPU budget is dominated by interrupt rate and data copies; DMA can reduce CPU while increasing latency if batching is unbounded.
The 4 metrics (measurement-grade definitions)
Sustained throughput
Effective payload rate over a long window (e.g., 1–10 s), excluding startup transients; this is what “streaming stable” means.
Peak burst
Short-window maximum payload rate; often limited by peripheral FIFO depth, DMA burst length, and memory bandwidth.
End-to-end latency
Time from “data becomes valid at the producer” to “data is usable by the application”. Must state the exact tap points.
Latency jitter (determinism)
Spread of end-to-end latency across time and system load. Use percentiles (P99/P999) plus maximum, not only averages.
Latency vocabulary (avoid measurement confusion)
- First-byte latency: time until the first useful byte/sample can be consumed (most sensitive to batching).
- Last-byte latency: time until the entire frame/block is complete (most sensitive to transfer size).
- Service latency: time until the consumer actually starts processing (most sensitive to scheduling/locks/cache).
Bottleneck map (typical high-throughput chain)
Throughput and determinism are set by the weakest stage, not by bus frequency alone.
- Bus bandwidth (wire time vs gaps)
- Peripheral FIFO (watermarks, overruns/underruns)
- DMA arbitration (priority, contention, burst length)
- Memory/cache (bandwidth, cache-coherency overhead)
- ISR/scheduling (interrupt rate, critical sections)
- Consumer/application (back-pressure, processing budget)
When DMA is the right baseline (practical triggers)
- CPU copy dominates: data-move work persistently consumes a noticeable share of CPU time and displaces real tasks.
- Interrupt rate is the bottleneck: frequent per-byte/per-small-chunk interrupts cause a throughput “sawtooth” and missed deadlines.
- Real-time deadlines exist: worst-case service time and jitter must be bounded, not just “good on average”.
Measurement hooks (place these before tuning)
- Counters: FIFO overrun/underrun, DMA errors, retries, dropped frames, queue depth high-watermark.
- Timestamps: producer-valid, DMA-complete, consumer-start, app-commit (consistent tap points).
- CPU load: ISR time, copy time, lock wait time (separate “busy” from “blocked”).
Pass criteria (template)
- Sustained payload ≥ X (MB/s) over window T
- P99 end-to-end latency ≤ Y ms; max ≤ Z ms
- Data-move CPU load ≤ W% (ISR + copy + cache maintenance)
- Loss/overrun counters = 0 (or explicitly bounded by a stated drop policy)
Scope guard
This section focuses on throughput/latency/jitter/CPU definitions and end-to-end pipeline bottlenecks. Signal integrity, clock quality, and termination are handled in SCLK Quality & Skew and Long-Trace SI.
Throughput Budget Model: Bits on the Wire vs Useful Payload
If line rate is high but real payload throughput is low, the missing bandwidth is almost always consumed by overhead and gaps. A usable budget model must be decomposable: each loss term should be measurable and optimizable independently.
Practical payload model (decomposable into measurable factors)
Payload throughput = Line rate × Protocol efficiency × Continuity × (1 − Retry loss)
Each factor maps to a different class of fixes: framing structure, timing gaps, software cadence, or error recovery.
- Protocol efficiency: structural overhead per payload (headers/commands/turnaround/dummies).
- Continuity: how “continuous” transfer time is (burst continuity) versus being punctured by gaps.
- Retry loss: bandwidth eaten by retries/NAKs/CRC recovery; treat as a first-class budget term.
The 3 overhead classes (separate them to avoid the wrong fix)
1) Frame / command overhead
Structural bytes and phases required per payload. Optimization usually means larger blocks, fewer boundaries, or better packing.
2) CS / turnaround gaps (hard gaps)
Mandatory timing holes between phases or devices. These show up clearly on a logic analyzer as wire-time with no payload.
3) Software gaps (soft gaps)
DMA re-arming delays, ISR work, cache misses, locks, and scheduling. These gaps disappear only when cadence is engineered.
Quick estimation (rule-of-thumb bands)
These bands are useful for sanity checks before deep optimization. If measured results land far below, a large gap term exists.
85–95% efficiency
Large blocks, few boundaries, minimal hard gaps, and DMA re-armed ahead of time (continuity is high).
70–85% efficiency
Periodic boundaries and moderate hard gaps; software cadence is acceptable but not fully pipeline-optimized.
60–70% (or lower)
Fragmented transfers with frequent turnarounds and visible software gaps; retries may amplify the loss dramatically.
How to prove the budget (measure → attribute → prioritize)
- Measure wire-time segmentation: payload vs overhead vs gaps (logic/protocol analyzer).
- Separate hard vs soft gaps: hard gaps persist even with a busy CPU; soft gaps correlate with ISR/scheduling stalls.
- Measure retry loss as a rate: retries per second or per megabyte; treat it as a throughput tax.
- Prioritize the biggest term: the largest gap/overhead term is the first optimization target.
Pass criteria (budget closure)
- Efficiency computed from measurement matches the model within ±X% (no hidden loss term).
- Hard-gap share and soft-gap share are separately quantified (each has a named root cause category).
- Retry loss is bounded and tracked (rate + worst-case bursts), not just “it seems fine”.
Scope guard
This section models payload loss using overhead and gaps. Electrical edge quality, termination, and routing are handled in Long-Trace SI.
Latency & Determinism: Why DMA Can Make Latency Worse
DMA often improves sustained throughput and reduces CPU time by batching transfers. However, batching can increase first-byte latency and widen the latency distribution under contention. Real-time designs should start from a latency budget (P99/P999 and maximum), then choose batch sizes and watermarks that keep worst-case service time bounded.
- Batching trades latency for CPU: larger batches reduce interrupts but can delay the first usable byte/sample.
- Determinism is about the tail: use P99/P999 and maximum, not only average latency.
- Jitter comes from contention: arbitration, memory/cache stalls, and priority inversion widen the latency distribution.
- Watermarks are a hard knob: they define when the system “starts caring” and strongly shape first-byte latency.
Mechanism: batching increases “wait-to-fill” time
When transfers are triggered at a watermark or a fixed batch size, the system must wait until enough data accumulates before a DMA completion event can wake the consumer. This reduces interrupt frequency and improves continuity, but increases the time until the first usable bytes become available.
- Small batch: lower first-byte latency, higher interrupt cadence, higher CPU overhead.
- Large batch: higher throughput and lower CPU, but first-byte latency increases and jitter can widen under load.
The three latency components (use consistent tap points)
First-byte latency
Producer-valid → consumer can read the first usable byte/sample. Most sensitive to batching and watermark thresholds.
Last-byte latency
Producer start → entire frame/block completed in memory. Dominated by transfer size and bus/memory bandwidth.
Service latency
DMA-complete → consumer actually starts processing. Dominated by scheduling, locks, cache refill, and interrupt masking.
Why jitter widens (root-cause categories)
DMA arbitration & contention
Multi-channel DMA and bus-matrix contention introduce variable queueing time. Worst-case wait time sets jitter tail.
Cache refill / maintenance
Cache-line refills and coherency operations can be non-uniform under load. Separate “copy time” from “cache time”.
Priority inversion & scheduling
Long critical sections and lock contention delay the consumer even after DMA completes. This often dominates P99/P999.
Watermark policy
Higher watermarks reduce interrupts but delay wakeups; if combined with contention, tails become much wider.
Real-time rule: define a latency budget, then choose batch size
- Set targets: P99 and max for end-to-end latency and service latency.
- Bound batching: set a maximum batch size and watermark so first-byte latency stays under budget.
- Control contention: assign DMA priority and limit burst length where worst-case wait time matters.
- Prove the tail: collect latency histograms under worst-case load (not only idle lab conditions).
Pass criteria
- P99 end-to-end latency ≤ X; max ≤ Y
- P99 service latency ≤ A (consumer wake and start bounded)
- First-byte latency ≤ B (batching bounded by watermark and max batch size)
- Latency histograms remain within bounds under worst-case contention (DMA + memory + CPU load)
Scope guard
This section covers batching, arbitration, cache, and scheduling as sources of latency and jitter. Electrical edge quality is handled in SCLK Quality & Skew and Long-Trace SI.
DMA Building Blocks (Channels, Requests, Descriptors, Scatter-Gather)
DMA is a small execution engine: requests trigger transfers, channels compete for shared bandwidth, and descriptors define the copy plan (where, how much, and when to interrupt). Understanding these building blocks makes throughput tuning predictable and prevents “mystery jitter” from arbitration and buffer policy.
DMA as a pipeline (what each part controls)
Request (trigger)
Defines when DMA runs (e.g., FIFO watermark). Too frequent wastes CPU; too sparse increases first-byte latency.
Channel (resource owner)
A channel competes for shared bandwidth. Its priority and burst policy shape worst-case wait time (jitter tail).
Descriptor (transfer plan)
Specifies source, destination, length, and next pointer. The interrupt flag controls cadence and CPU load.
Linked list / scatter-gather
Chains descriptors to reduce re-arming gaps, enable ring buffers, and support zero-copy staging across multiple blocks.
Transfer modes (choose by continuity and determinism)
Single-shot
- Best for fixed-size blocks and command-driven transfers.
- Risk: soft gaps between blocks if re-arming is slow.
Cyclic (ring)
- Best for continuous streams (high continuity, minimal soft gaps).
- Requirement: robust head/tail management and overrun detection.
Scatter-gather (linked descriptors)
- Best for zero-copy staging and long transfers without re-arming.
- Risks: alignment/length limits, chain integrity, error handling strategy.
Descriptor design hard rules (avoid silent corruption)
- Alignment: keep buffer base and length aligned to platform requirements (often cache-line and bus-burst aligned).
- Length limits: never exceed the per-descriptor maximum; split into multiple descriptors instead of “hoping it wraps”.
- Ring integrity: validate next pointers and wrap behavior; a broken chain becomes a “random freeze”.
- Interrupt policy: too frequent interrupts collapse throughput; too sparse interrupts inflate first-byte and service latency.
- Error strategy: define what happens on DMA error (log + reset channel + bounded retries + safe fallback).
Bring-up checklist
- Verify the request source toggles (FIFO watermark) before DMA is enabled.
- Verify one descriptor moves the expected byte count to the expected address.
- Enable chain mode and confirm wrap returns to Desc0 without stopping.
- Measure interrupt cadence and confirm it matches the intended watermark/batch policy.
- Inject an error (timeout/abort) and confirm recovery does not deadlock the pipeline.
Buffering Strategies: Double/Triple Buffer, Ring Buffer, Watermark Tuning
Buffering turns bursty transfers into a steady stream that software can consume without overruns or underruns. The critical knob is the watermark: it creates margin between “data arrives faster” and “data is consumed slower,” shaping interrupt cadence, first-byte latency, and worst-case service time.
- Double buffer favors determinism when the consumer can finish each block within one block period.
- Triple buffer absorbs short consumer stalls but increases queueing and the latency upper bound.
- Ring buffer maximizes continuity and is the foundation for frame slicing (see H2-6).
- Watermark too low: high callback rate → CPU overhead → service jitter. Too high: wait-to-fill → first-byte delay.
Common buffering patterns (choose by continuity and deadline risk)
Double buffer (Ping-pong)
- Best when: fixed blocks and bounded consumer time (tight deadlines).
- Risk: switch gaps if re-arming is late; immediate overrun if the consumer slips.
- Observe: P99 “DMA complete → consumer done” stays below one block period.
Triple buffer
- Best when: occasional consumer stalls exist but average throughput is sufficient.
- Trade-off: lower drop risk, higher queueing (latency upper bound grows).
- Observe: buffer occupancy rarely hits high-watermark under worst-case load.
Ring buffer
- Best when: continuous streams and minimal soft gaps are required.
- Requirement: robust head/tail accounting, wrap handling, and overrun recovery.
- Observe: fill-level distribution stays away from 0% and 100% for stability.
Watermark tuning (two safe operating goals)
Determinism-first
- Use smaller batches and more frequent callbacks to bound first-byte latency.
- Keep callback work minimal (short path): move pointers/markers, defer heavy parsing.
- Set a max batch size so wait-to-fill cannot consume the latency budget.
Throughput-first
- Use larger batches to reduce ISR frequency and improve continuity.
- Ensure consumer capacity exceeds producer average rate; otherwise buffers only delay failure.
- Prefer ring buffers and linked descriptors to minimize “re-arming gaps”.
Common pitfalls (symptom → first check → fix → pass)
Switch-gap “holes”
Symptom: sawtooth throughput and periodic idle gaps.
First check: re-arming latency and callback critical sections.
Fix: link descriptors or pre-arm the next buffer before the switch point.
Pass: measured “DMA complete → next start” gap ≤ X.
Consumer falls behind
Symptom: sporadic overruns under load.
First check: P99 service latency vs buffer headroom at the chosen watermark.
Fix: shorten the consumer critical path; split “fast pointer move” and “slow parse”.
Pass: occupancy stays below high-watermark with worst-case workload.
Overrun causes misalignment
Symptom: after one drop, parsing stays wrong.
First check: whether a resync policy exists after overrun.
Fix: mark the drop, enter resync (scan next marker/boundary), then resume framing.
Pass: recovery completes within Y ms and error counters stop rising.
Pass criteria (buffering is stable)
- Fill-level stays within [L%, H%] under worst-case producer/consumer load (no sustained drift to 0% or 100%).
- Overrun/underrun counters remain 0 (or are bounded with a defined resync recovery).
- P99 service latency stays within the headroom implied by watermark and buffer depth.
- Callback/ISR cadence matches the intended policy (no unexpected over-frequency).
Framing & Bursts: Keeping Boundaries Without Killing Performance
Burst transfers maximize efficiency, but application data often has frame boundaries (messages, lines, packets). The core rule is burst ≠ frame. A robust design writes continuously into a ring buffer using DMA, then performs frame slicing (markers + pointers) without copying or blocking the DMA pipeline.
- Fixed length is DMA-friendly but needs resync after drops.
- Length-prefixed supports variable frames; must bound length and handle corruption.
- Delimiter/idle is easy to scan but needs escape/validation and recovery policy.
- Frame slicing in a ring uses markers (start/len/end) and span descriptors for wrap-around.
Framing methods (choose by recovery strength, not just convenience)
Fixed length
- Best for: constant-size frames and strict timing.
- Risk: one drop shifts alignment; define resync points and maximum drift time.
- Pass: bounded resync time after a forced drop.
Length-prefixed
- Best for: variable frames with bounded maximum size.
- Risk: corrupted length can “consume” the ring; enforce max length and validation.
- Pass: invalid lengths are rejected and recovery is bounded.
Delimiter / idle
- Best for: human-readable lines or simple message streams.
- Risk: noise/drops create false boundaries; use escaping and validation.
- Pass: bounded scan window and deterministic resync policy.
Ring slicing workflow (burst-friendly boundaries)
- DMA writes continuously into a ring buffer (maximize continuity, minimize soft gaps).
- Parser creates markers (start/len/end) without copying; markers reference ring offsets.
- Frames may wrap; represent as two spans (tail span + head span) instead of memmove.
- Partial frames keep state and wait for more data; never block DMA progress.
- On error (bad length/CRC/delimiter), enter resync mode with bounded scan window and timeout.
Sticky failure modes (must be handled explicitly)
- Half packet: frame start present, end missing → hold state and wait; do not emit false frame.
- Sticky misalignment after drop: enter resync mode and bound recovery time.
- False delimiter: require validation/escape and enforce maximum scan distance.
- Corrupted length: clamp to a maximum and reject unreasonable values; avoid ring “runaway”.
- Parser overload: keep parsing lightweight; heavy work should run after boundary is established.
Pass criteria (boundaries are preserved without killing throughput)
- DMA continuity remains high (soft gaps bounded; no parser-induced stalls).
- Frame extraction is zero-copy whenever possible (markers/spans, not memmove).
- Resync completes within X ms and does not cause ring runaway.
- Frame error counters stop rising after recovery; parser time stays bounded (P99).
CPU Interaction: Interrupt Rate, Polling, Zero-Copy, Back-Pressure
High throughput fails when CPU involvement becomes the bottleneck. A stable DMA pipeline treats interrupts as control-plane events (pointer/marker updates), keeps heavy work in a consumer task, and uses back-pressure to prevent queue runaway when the consumer cannot keep up.
- Interrupt rate too high fragments CPU time → cache churn, scheduler overhead, and lock contention.
- Zero-copy works only when ownership/lifetime is explicit; otherwise copy or copy-on-demand is safer.
- Back-pressure is mandatory for overload: drop policy, reduced burst, or hardware flow control (covered in UART RTS/CTS page).
- Producer/Consumer/Monitor is the stable pattern: DMA pushes, task consumes, stats closes the loop.
Interrupt rate & throughput collapse (control-plane vs data-plane)
- Fixed per-interrupt cost: entry/exit, cache pollution, scheduler bookkeeping, and deferred work.
- Failure mode: “more interrupts to be responsive” reduces effective data-plane time, causing soft gaps and jitter.
- Rule: ISR should only move pointers/markers, update counters, and wake the consumer task.
Polling vs interrupt (a robust mixed strategy)
Interrupt (event-driven)
- Best for: bounded first-byte latency and wake-up from idle.
- Use it for: “data available” signal, pointer advance, marker creation, counter updates.
- Avoid: parsing, copying, and long critical sections inside ISR.
Short-window polling (batch)
- Best for: high throughput after a wake event (consume in chunks).
- Use it for: bounded loops that drain N frames or M bytes, then yield.
- Benefit: fewer context switches and better cache locality than per-chunk interrupts.
Zero-copy boundaries (ownership and lifetime decide)
Direct consume (safe zero-copy)
- Consumer reads without modifying; buffer lifetime ends after processing.
- Frames are referenced by markers/spans (ring offsets), not copied.
- Cache coherency is enforced at the ownership switch (see H2-8).
Copy-on-demand (partial copy)
- Copy only when a wrapped frame needs contiguity, or when upper layers cannot handle spans.
- Keep copy scope minimal: edges only, not whole streams.
- Preserve throughput by batching copies and avoiding per-byte operations.
Must copy (safety over speed)
- Data must be retained long-term or modified in-place.
- Buffer ownership is unclear or can be preempted by re-use.
- Security/sandbox boundaries require isolated copies.
Back-pressure policies (prevent queue runaway)
Drop policy (controlled loss)
- Choose drop-new, drop-old, or drop-to-boundary (resync-friendly).
- Define trigger and recovery thresholds (hysteresis) to avoid oscillation.
- Expose counters for monitoring and post-mortem.
Reduce load (graceful degradation)
- Reduce burst size or watermark to cap worst-case service latency.
- Lower source rate where possible (application-level throttling).
- Prefer deterministic caps over “unbounded buffering”.
Flow control (hardware support)
- Use hardware flow control when available (e.g., UART RTS/CTS) to stop the producer at the source.
- This page only defines the policy; electrical/timing details belong in the UART flow control subpage.
- Still keep drop/degrade as a failsafe for misbehaving sources.
Pass criteria (CPU stays out of the critical path)
- Interrupt rate under full load stays ≤ X and does not cause throughput collapse.
- ISR work remains bounded: pointer/marker updates only (no parsing, no long locks).
- Queue occupancy remains within [L%, H%] with a defined back-pressure response.
- Drop/degrade events are visible via counters and recovery is deterministic (no runaway).
Memory System Pitfalls: Cache Coherency, Alignment, DMA-Safe Regions
Many “protocol-looking” failures are memory-system failures: cache lines, alignment, and DMA-safe regions. DMA writes RAM directly while CPUs often read and write through caches. Without correct maintenance at ownership boundaries, software may see stale, repeated, or scrambled data.
- Device → Memory (DMA writes): CPU must not read stale cache lines (invalidate at handoff).
- Memory → Device (DMA reads): RAM must contain latest CPU writes (flush/clean at handoff).
- Cache-line granularity: maintenance ranges must be expanded to line boundaries.
- Alignment: start/length/boundary alignment reduces edge-case failures and jitter.
Device → Memory (DMA writes, CPU reads)
- Risk: CPU keeps old cache lines while DMA updates RAM.
- Rule: before CPU consumes DMA output, invalidate the covered cache-line range.
- Scope: expand start/end to cache-line boundaries to avoid “edge bytes” being stale.
Memory → Device (CPU writes, DMA reads)
- Risk: CPU updates cache but RAM still contains older data.
- Rule: before DMA reads CPU-prepared buffers, flush/clean cache lines to RAM.
- Scope: align to cache-line boundaries; avoid sharing cache lines with unrelated data.
Alignment & DMA-safe regions (prevent “works on bench, fails under load”)
- Start alignment: align buffer start to cache-line and DMA burst-friendly boundaries.
- Length alignment: pad lengths to avoid partial-line maintenance and boundary crossings.
- Boundary rules: avoid crossing forbidden windows (platform bus/bridge limitations).
- DMA-safe memory: prefer regions defined for DMA (coherent or explicitly managed); do not assume all RAM is equivalent.
- Protection hints: MPU/IOMMU mapping errors can appear as silent corruption or hard faults; validate early in bring-up.
Symptoms → first checks → fix → pass
Stale / repeated data
Quick check: verify direction and cache maintenance at ownership handoff.
Fix: invalidate before CPU read (device→mem) or flush before DMA read (mem→device).
Pass: repeated patterns disappear under worst-case load and long runs.
Misalignment / offset shifts
Quick check: buffer start/length alignment and cache-line boundary expansion.
Fix: align start/size; avoid sharing cache lines; pad to boundaries.
Pass: no sporadic “one-byte shift” or partial-line artifacts.
Random corruption (rare)
Quick check: DMA-safe region constraints and forbidden boundary crossings.
Fix: move buffers to DMA-approved regions; enforce address window rules; validate MPU/IOMMU mapping.
Pass: corruption rate drops to 0 over extended soak tests.
Pass criteria (memory interactions are deterministic)
- Correct coherency operations occur at every ownership switch (direction-aware).
- Maintenance ranges are cache-line aligned; buffers do not share cache lines with unrelated data.
- Start/length alignment rules are enforced and verified during bring-up.
- No stale/repeated/scrambled symptoms during long-run soak under worst-case contention.
Bus-Specific Patterns (SPI / UART / I²C): What Changes, What Stays
DMA patterns share the same backbone across buses: FIFO → DMA → RAM buffer → consumer → stats. What changes is the boundary model (CS vs idle vs transactions) and the way overload is handled. Only DMA-relevant differences are covered here; bus electrical/timing details belong to the dedicated SPI/UART/I²C pages.
- Trigger model: which events generate DMA requests (FIFO watermark, RX ready, TX empty).
- Boundary model: how frames are marked (CS, IDLE/BREAK, or transaction limits).
- Overload model: what happens when the consumer is late (drop/degrade/flow-control).
- Metrics: sustained throughput, max latency, jitter (P99), and IRQ rate / CPU time.
SPI (DMA-specific)
- Full-duplex coupling: RX/TX progress often must stay synchronized; dummy bytes may clock data out.
- Boundary anchor: CS acts as a hard boundary; define whether a DMA burst must stay within one CS window.
- Off-page pointer: SCLK quality and mode settings affect sampling, but are not expanded here.
First checks
- Verify RX and TX DMA descriptors advance in lockstep when required.
- Confirm CS boundary policy matches the transaction framing expected by the consumer.
UART (DMA-specific)
- Continuous byte stream: boundaries are not inherent; DMA is best treated as a continuous ring writer.
- Boundary trigger: IDLE gaps or BREAK events can generate markers for framing and resync.
- Overload priority: back-pressure and drop-to-boundary policies prevent runaway when the consumer stalls.
- Off-page pointer: baud error and RTS/CTS details belong to UART subpages.
First checks
- Validate IDLE/BREAK markers are created at the correct handoff points.
- Verify ring occupancy has hysteresis and back-pressure triggers before overflow.
I²C (DMA-specific)
- Short transactions: per-transaction overhead is large; DMA is mainly for CPU offload and jitter reduction.
- Bursts are bounded: DMA descriptors typically map to short byte blocks per transaction, not long streams.
- Off-page pointer: clock stretching and pull-up constraints drive worst-case time, but are not expanded here.
First checks
- Confirm DMA is reducing CPU touch points rather than chasing theoretical peak throughput.
- Ensure transaction boundaries are preserved and error handling remains deterministic.
Cross-bus checklist (what must be re-confirmed when switching buses)
- DMA request source (RX/TX/FIFO watermark) and the chosen completion cadence.
- Boundary anchors (CS / IDLE-BREAK / transaction) and marker semantics in the buffer.
- Maximum burst size and queue depth caps to keep worst-case latency bounded.
- Back-pressure strategy (drop/degrade/flow-control) with trigger and recovery thresholds.
- Cache coherency and alignment rules at ownership handoffs (see H2-8).
- Observed metrics: sustained throughput, max latency, P99 jitter, IRQ rate.
Real-Time Playbook: Bounding Worst-Case Latency
Real-time is not “fast on average”. It is bounded worst-case service time. Start from a deadline, decompose end-to-end latency into bounded parts, then apply hard caps that hold under contention.
Definition (deadline vs worst-case service time)
- Deadline: the maximum time allowed from “byte arrives” to “byte is consumed”.
- Worst-case service time: the maximum observed under worst contention, not the average.
- Pass condition: max latency < X with margin.
Worst-case decomposition (sum of bounded parts)
- DMA wait: arbitration and priority queueing time.
- Transfer: bus time + memory bandwidth under contention.
- ISR/notify: completion signaling and wake latency (avoid long masking).
- Consume: bounded batch work in the consumer task (no unbounded loops).
1) DMA priority & arbitration
- Amplifier: long queueing behind bulk channels.
- Control: fixed priority for the critical channel and a capped burst length.
2) Bus contention
- Amplifier: competing DMA/CPU traffic stretches transfer time.
- Control: schedule bulk work outside critical windows; cap queue depth.
3) IRQ masking & long critical sections
- Amplifier: completion and wake-up are delayed unpredictably.
- Control: bound masking time; move work out of ISR; shorten lock holds.
4) Cache refill & memory effects
- Amplifier: unpredictable cache misses and coherence maintenance costs.
- Control: aligned buffers, fixed working sets, and no dynamic allocation on the critical path.
Practical rules (turn worst-case into hard caps)
- Fixed max burst length: cap transfer and completion intervals.
- Fixed max queue depth: cap queueing delay; avoid “unbounded buffering”.
- No dynamic memory: critical path uses pre-allocated buffers and fixed markers.
- Watermark + watchdog: detect stuck pipelines and force deterministic recovery.
- Back-pressure integration: when near deadline, prefer degrade/drop-to-boundary over collapse.
Pass criteria (examples; use project thresholds)
- Max latency: < X
- Jitter: < Y
- Drop rate: < Z (or only under a defined overload policy)
- Recovery: watchdog triggers within T and returns to stable operation
Debug & Validation: Prove Throughput, Prove Lossless, Prove Boundaries
A stable DMA pipeline is validated with three evidence chains: (1) throughput (sustained/peak), (2) loss/reorder (sequence/CRC/counters), and (3) boundaries (markers stay consistent across FIFO → DMA → consumer → app). Use one time base and fixed observation points to turn “it feels slow” into measurable proof.
1) Throughput (sustained vs peak)
- Peak: short-window burst limit (bus time + FIFO + DMA burst).
- Sustained: long-window payload rate (reveals software gaps and back-pressure).
- Must log: payload bytes / fixed window (e.g., 1 s) + IRQ rate.
2) Loss / reorder (prove “lossless”)
- Sequence number: detect gaps, repeats, and out-of-order consumption.
- CRC / checksum: detect corruption and buffer misalignment.
- Counters: FIFO overrun/underrun, DMA error flags, and drop-policy triggers.
3) Boundaries (prove markers)
- Markers: boundary metadata written at the producer side must match what the consumer slices.
- Monotonicity: write pointer, read pointer, and marker indices must never “go backward”.
- Resync path: when overflow happens, recovery must drop-to-boundary and restart deterministically.
Recommended instrumentation (minimum set)
- Ring stats: bytes moved, IRQ count, occupancy histogram, max occupancy, and drop counters.
- Alarms: watermark high/low, overrun/underrun, and watchdog timeout events.
- DMA flags: bus error, address error, FIFO error, descriptor error (if supported).
- Unified timestamps: TS0/TS1/TS2/TS3 use one time base (avoid mixed clocks).
Concrete debug tools (examples; verify model/options)
- Logic/protocol analysis: Saleae Logic Pro 16, Total Phase Beagle I2C/SPI Protocol Analyzer.
- Embedded trace / profiling: SEGGER J-Link Plus (or J-Link Ultra+), Arm ULINKpro.
- High-speed oscilloscope (edge sanity): Tektronix MDO3 series (model per bandwidth), Keysight InfiniiVision series (model per bandwidth).
Note: these tool examples support the validation workflow; use project requirements to select bandwidth/options.
Typical failure tree (symptom → first checks)
- Low sustained throughput: check software gaps (DMA re-arm latency), IRQ collapse (too frequent completions), and back-pressure triggers.
- High CPU while “DMA is on”: check ISR payload, polling loops, cache maintenance overhead, and lock contention.
- Occasional gaps / duplicates: check sequence counters, ring overwrite/overrun counters, and marker monotonicity.
- Only fails at high load: check DMA arbitration/priority, queue depth caps, and IRQ masking/critical sections.
- Looks like corruption: check cache coherency rules, alignment to cache-line, and DMA-safe memory regions.
Pass criteria (examples; fill with project thresholds)
- Sustained payload: ≥ X (window fixed and documented)
- Loss/reorder: sequence gaps = 0, CRC failures = 0 (or explicitly bounded and logged)
- Boundary errors: marker/frameslice errors = 0; resync completes within T
- Latency/jitter: max latency < A, P99 jitter < B (single time base)
Applications & IC Selection Notes (DMA-Friendly Peripherals)
“DMA-friendly” is a datasheet-visible feature set: descriptor modes (SG/cyclic), FIFO depth & watermarks, multi-channel arbitration, flexible request mapping, and strong error reporting/coherency guidance. The goal is high sustained throughput with bounded worst-case latency and predictable recovery.
Key DMA-related dimensions
- Descriptor modes: scatter-gather, linked-list, cyclic, half-transfer events.
- FIFO depth & watermarks: independent RX/TX watermarks, overrun flags, and programmable thresholds.
- Arbitration control: per-channel priority/weighting, burst caps, and interconnect QoS (if available).
- Request mapping: flexible peripheral request routing (avoid “fixed channel bottlenecks”).
- Debuggability: explicit error causes (bus/address/FIFO), counters, and coherency notes.
Where to look in the datasheet/manual
- DMA chapter: descriptor support, interrupt modes, cyclic/SG, error flags.
- Interconnect / bus matrix: arbitration, bandwidth notes, QoS/priority.
- Cache / coherency notes: DMA-safe regions, alignment requirements, maintenance rules.
- Peripheral FIFO section: depth, watermarks, overrun/underrun behavior.
Concrete part-number examples (verify package/suffix/availability)
MCUs / application processors with strong DMA ecosystems
- ST: STM32H743ZI, STM32H723VG (high-performance DMA + cache considerations).
- NXP: MIMXRT1062DVL6A, MIMXRT1176DVMAA (strong peripheral DMA + high throughput IO).
- Microchip: ATSAME70Q21B, ATSAMV71Q21B (DMA-centric peripheral set; check cache/coherency notes per design).
- Renesas: R7FA6M5BH2CB (RA6M5 family example; verify the exact part variant for memory/peripheral mix).
- TI: TM4C1294NCPDT (uDMA-style flows; verify FIFO depth and interrupt modes per peripheral).
Selection tip: prioritize parts with explicit DMA error reporting and coherent-memory guidance in the reference manual.
DMA-friendly bridge / expander ICs (help reduce CPU touch points)
- I²C/SPI-to-UART with FIFOs: NXP SC16IS752, SC16IS750 (use FIFO + interrupt/watermark to reduce CPU service rate).
- I²C-to-SPI bridge: NXP SC18IS602B (useful for turning short I²C transactions into buffered SPI accesses).
- I²C channel mux (address/fanout management): TI TCA9548A (helps segmentation and reduces “bus-wide” recovery impact).
- USB-to-serial (buffered links for host-side throughput): FTDI FT232H, Silicon Labs CP2102N (exact suffix depends on package/temperature).
These parts do not replace system DMA, but can shift boundary handling and buffering away from the CPU.
Common high-throughput DMA endpoints (SPI/QSPI/OSPI memories)
- Winbond: W25Q128JV, W25Q256JV (verify package and speed grade).
- Macronix: MX25L25645G (verify package/temperature suffix).
- Micron: MT25QU256ABA (verify exact density/IO mode support per variant).
Memory endpoints highlight DMA requirements: long bursts, boundary control, and consistent cache/coherency handling.
Minimal selection checklist (quick self-check)
- Supports scatter-gather or linked descriptors (or equivalent chaining).
- Supports cyclic mode (or stable ring-buffer writer behavior).
- Provides FIFO watermarks and explicit overrun/underrun flags.
- Provides priority/arbitration control and (ideally) burst caps for worst-case bounds.
- Provides request mapping flexibility (avoid fixed bottlenecks).
- Provides clear coherency guidance (DMA-safe memory, alignment, maintenance rules).
- Provides actionable error reporting (bus/address/FIFO/descriptor errors).
Recommended topics you might also need
Request a Quote
FAQs: DMA High Throughput (Troubleshooting Closure)
These FAQs close long-tail debug questions without expanding the main text. Each answer follows the fixed 4-line structure: Likely cause / Quick check / Fix / Pass criteria.
Throughput is high but CPU is also high — IRQ too frequent or cache thrash?
Likely cause: Completion IRQ rate is too high and/or cache maintenance is triggered excessively, causing scheduling and cache churn.
Quick check: Log IRQ rate and DMA re-arm gap per 1 s window; compare CPU time in ISR vs consumer; check cache maintenance time spikes.
Fix: Increase batch size (or enable half-transfer events), move heavy work out of ISR, and reduce cache ops by using DMA-safe regions and cache-line-aligned buffers.
Pass criteria: IRQ rate < X, sustained payload ≥ Y, CPU(data-move) < Z%, no sustained queue backlog.
After enabling DMA, occasional “old data” appears — flush or invalidate first?
Likely cause: Cache coherency maintenance is incorrect for the transfer direction, so CPU reads stale cache lines or DMA reads stale RAM.
Quick check: Determine direction: DMA writes RAM → CPU reads (invalidate before CPU use) vs CPU writes RAM → DMA reads (flush/clean before DMA). Verify maintenance range expands to cache-line boundaries.
Fix: Apply correct invalidate/flush at ownership handoff points (DMA done / buffer handoff) and enforce cache-line-aligned start/length for buffers.
Pass criteria: CRC fails = 0, repeat/ghost samples = 0 across stress; coherence actions are deterministic and documented.
Larger bursts increase jitter — check watermark, priority, or arbitration first?
Likely cause: Larger bursts reduce IRQ overhead but increase worst-case service time due to DMA/bus arbitration delays and consumer wake-up granularity.
Quick check: Compare max latency and P99 jitter before/after increasing burst; log queue occupancy peaks and DMA wait time (if available).
Fix: Cap maximum burst length, tune watermark to trigger earlier service, and raise priority for the real-time DMA channel (or isolate it from bulk traffic).
Pass criteria: max latency < X and P99 jitter < Y under worst-case load; no sustained backlog at high watermark.
Frame boundaries become corrupted — ring slicing or delimiter strategy issue?
Likely cause: Boundary metadata (markers) is not consistent with buffer ownership, or delimiter-based framing loses sync after a gap/overflow.
Quick check: Validate marker monotonicity (write ptr/read ptr/index never goes backward); check SEQ/CRC across boundary crossings and wrap points; confirm overflow recovery does drop-to-boundary.
Fix: Prefer explicit length/marker slicing on the ring buffer; for delimiter schemes, add resync logic that searches next valid boundary after overflow or missing bytes.
Pass criteria: boundary errors = 0, SEQ gaps = 0 (or bounded with recovery), wrap-around never produces malformed frames.
RX occasional overrun — check FIFO watermark or consumer blocking first?
Likely cause: Consumer service cannot keep up with burst arrival, or watermark is set too late to absorb scheduling/arbitration delays.
Quick check: Inspect FIFO_OVR and ring occupancy peaks; correlate overrun timestamps with consumer blocked time (locks/critical sections/IRQ masking).
Fix: Lower the watermark (earlier wake-up), bound consumer critical sections, and cap burst length; add back-pressure or drop-to-boundary policy before hard overrun occurs.
Pass criteria: FIFO_OVR = 0 under stress; occupancy stays below high watermark with margin; recovery never corrupts boundaries.
SPI full-duplex reads 0xFF/0x00 — dummy/turnaround or CS timing? (DMA-focused)
Likely cause: RX/TX progress is not synchronized (insufficient dummy bytes or wrong pairing), or CS is not held across the intended DMA burst boundary.
Quick check: Compare TX byte count vs RX byte count per burst; verify descriptor boundaries match CS hold policy; confirm first-byte content is discarded when dummy clocks are expected.
Fix: Ensure paired TX/RX descriptors (or coupled lengths), insert explicit dummy phase when required, and keep CS asserted for the full transaction span (avoid unintended CS toggles between descriptors).
Pass criteria: device ID reads match expected value across N trials; SEQ/CRC stable; no “all 0xFF/0x00” bursts under load.
UART high traffic shows occasional framing errors — sampling noise or buffer “holes”?
Likely cause: Consumer/ISR service creates buffer holes (gaps) so bytes are dropped, which looks like framing errors at the parser level under high load.
Quick check: Correlate framing-error events with FIFO_OVR, ring occupancy spikes, and DMA re-arm gaps; verify IDLE/BREAK markers align with buffer slices (not mid-gap).
Fix: Lower watermark for earlier service, cap burst/queue depth, and add drop-to-boundary resync when overflow happens; keep UART framing logic independent from DMA chunk size.
Pass criteria: overrun counters = 0 (or bounded with defined recovery), parser resync time < T, sustained traffic shows no growing SEQ gaps.
I²C with DMA becomes slower — transaction overhead vs batching, how to decide?
Likely cause: I²C performance is dominated by per-transaction overhead and gaps; DMA reduces CPU work but may not improve wire-time efficiency for short transfers.
Quick check: Measure payload efficiency: payload time vs overhead+gap time; compare CPU load reduction vs end-to-end latency change when switching to DMA.
Fix: Use DMA mainly for offload (lower CPU/jitter), combine small reads/writes when protocol allows, and avoid overly large DMA batches that increase first-byte latency without increasing wire efficiency.
Pass criteria: CPU load drops by ≥ X% while throughput/latency remain within targets; no increase in timeout/retry events.
Only fails at high load — first step for memory bandwidth / bus contention?
Likely cause: Under load, arbitration delays and memory contention expand worst-case service time, exposing hidden timing margins.
Quick check: Run A/B test: disable one bulk DMA stream and observe if jitter/overrun disappears; log DMA wait indicators (or latency segments TS0→TS1, TS1→TS2) to find which segment inflates.
Fix: Raise priority for the real-time channel, cap burst length, limit queue depth, and schedule bulk transfers outside the critical window (or isolate via QoS if supported).
Pass criteria: worst-case latency remains bounded (< X) with all background traffic enabled; no high-watermark saturation.
No data loss, but latency spikes — adjust queue depth or back-pressure strategy?
Likely cause: Queue depth is too large (hidden buffering), or back-pressure triggers too late, so service delay accumulates even without dropping bytes.
Quick check: Monitor queue occupancy histogram and max occupancy; compare P99 latency with different queue depth caps; check if back-pressure events occur only after near-saturation.
Fix: Reduce maximum queue depth, lower watermark for earlier wake-up, and implement explicit back-pressure (drop/degrade/flow-control) before saturation; keep recovery deterministic (drop-to-boundary when needed).
Pass criteria: max latency < X and P99 jitter < Y; queue occupancy stays below cap; drop rate (if enabled) < Z and bounded.
Periodic system “hang” — how to design DMA error flags and watchdog?
Likely cause: DMA enters an error state (bus/address/descriptor) or the consumer stops advancing pointers, causing silent deadlock without visible loss counters.
Quick check: Log DMA_ERR flags and last TS progression (TS0/TS1/TS2/TS3). If TS stops advancing while input continues, it is a service deadlock; if DMA_ERR asserts, it is a transfer fault.
Fix: Implement watchdog on pointer progress and watermark; on fault, stop DMA, reset descriptors/ring indices, drop-to-boundary, and restart with a recorded fault code for postmortem.
Pass criteria: recovery completes within T, no repeated fault storm, fault code is logged with TS snapshot and counters.
ATE/production passes, but field drops occur — which stats/alarm fields are usually missing?
Likely cause: Production tests validate functionality but do not log long-window sustained metrics, worst-case latency, and near-overrun precursors under realistic background load.
Quick check: Verify firmware logs include: bytes/window, IRQ rate, max occupancy, FIFO_OVR, DMA_ERR, drop events, and TS-based latency segments; compare ATE vs field log coverage.
Fix: Add ring stats + watermark alarms + fault snapshots (TS + counters). Include stress profiles (background DMA, cache pressure) in validation to expose contention-driven worst-case.
Pass criteria: field logs can reproduce root cause within one trace; pre-fault indicators (occupancy, IRQ, DMA wait) are captured before any drop event.