From: Sebastian Andrzej Siewior Date: Thu, 27 Nov 2025 15:43:41 +0000 (+0100) Subject: Documentation: Add some hardware hints for real-time X-Git-Url: http://git.ipfire.org/cgi-bin/gitweb.cgi?a=commitdiff_plain;h=7548c69f5167a00d7bbd43be9b6521351f9bedc6;p=thirdparty%2Fkernel%2Flinux.git Documentation: Add some hardware hints for real-time Some thoughts on hardware that is used for real-time workload. Certainly not complete but should cover some of the import topics such as: - Main memory, caches and the possiblie control given by the hardware. - What could happen by putting critical hardware behind USB or VirtIO. - Allowing real-time tasks to consume the CPU entirely without giving the system some time to breath. - Networking with what the kernel provides. Reviewed-by: Steven Rostedt (Google) Reviewed-by: Randy Dunlap Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Jonathan Corbet Message-ID: <20251127154343.292156-2-bigeasy@linutronix.de> --- diff --git a/Documentation/core-api/real-time/hardware.rst b/Documentation/core-api/real-time/hardware.rst new file mode 100644 index 0000000000000..19f9bb3786e03 --- /dev/null +++ b/Documentation/core-api/real-time/hardware.rst @@ -0,0 +1,132 @@ +.. SPDX-License-Identifier: GPL-2.0 + +==================== +Considering hardware +==================== + +:Author: Sebastian Andrzej Siewior + +The way a workload is handled can be influenced by the hardware it runs on. +Key components include the CPU, memory, and the buses that connect them. +These resources are shared among all applications on the system. +As a result, heavy utilization of one resource by a single application +can affect the deterministic handling of workloads in other applications. + +Below is a brief overview. + +System memory and cache +----------------------- + +Main memory and the associated caches are the most common shared resources among +tasks in a system. One task can dominate the available caches, forcing another +task to wait until a cache line is written back to main memory before it can +proceed. The impact of this contention varies based on write patterns and the +size of the caches available. Larger caches may reduce stalls because more lines +can be buffered before being written back. Conversely, certain write patterns +may trigger the cache controller to flush many lines at once, causing +applications to stall until the operation completes. + +This issue can be partly mitigated if applications do not share the same CPU +cache. The kernel is aware of the cache topology and exports this information to +user space. Tools such as **lstopo** from the Portable Hardware Locality (hwloc) +project (https://www.open-mpi.org/projects/hwloc/) can visualize the hierarchy. + +Avoiding shared L2 or L3 caches is not always possible. Even when cache sharing +is minimized, bottlenecks can still occur when accessing system memory. Memory +is used not only by the CPU but also by peripheral devices via DMA, such as +graphics cards or network adapters. + +In some cases, cache and memory bottlenecks can be controlled if the hardware +provides the necessary support. On x86 systems, Intel offers Cache Allocation +Technology (CAT), which enables cache partitioning among applications and +provides control over the interconnect. AMD provides similar functionality under +Platform Quality of Service (PQoS). On Arm64, the equivalent is Memory +System Resource Partitioning and Monitoring (MPAM). + +These features can be configured through the Linux Resource Control interface. +For details, see Documentation/filesystems/resctrl.rst. + +The perf tool can be used to monitor cache behavior. It can analyze +cache misses of an application and compare how they change under +different workloads on a neighboring CPU. Even more powerful, the perf +c2c tool can help identify cache-to-cache issues, where multiple CPU +cores repeatedly access and modify data on the same cache line. + +Hardware buses +-------------- + +Real-time systems often need to access hardware directly to perform their work. +Any latency in this process is undesirable, as it can affect the outcome of the +task. For example, on an I/O bus, a changed output may not become immediately +visible but instead appear with variable delay depending on the latency of the +bus used for communication. + +A bus such as PCI is relatively simple because register accesses are routed +directly to the connected device. In the worst case, a read operation stalls the +CPU until the device responds. + +A bus such as USB is more complex, involving multiple layers. A register read +or write is wrapped in a USB Request Block (URB), which is then sent by the +USB host controller to the device. Timing and latency are influenced by the +underlying USB bus. Requests cannot be sent immediately; they must align with +the next frame boundary according to the endpoint type and the host controller's +scheduling rules. This can introduce delays and additional latency. For example, +a network device connected via USB may still deliver sufficient throughput, but +the added latency when sending or receiving packets may fail to meet the +requirements of certain real-time use cases. + +Additional restrictions on bus latency can arise from power management. For +instance, PCIe with Active State Power Management (ASPM) enabled can suspend +the link between the device and the host. While this behavior is beneficial for +power savings, it delays device access and adds latency to responses. This issue +is not limited to PCIe; internal buses within a System-on-Chip (SoC) can also be +affected by power management mechanisms. + +Virtualization +-------------- + +In a virtualized environment such as KVM, each guest CPU is represented as a +thread on the host. If such a thread runs with real-time priority, the system +should be tested to confirm it can sustain this behavior over extended periods. +Because of its priority, the thread will not be preempted by lower-priority +threads (such as SCHED_OTHER), which may then receive no CPU time. This can +cause problems if a lower-priority thread is pinned to a CPU already occupied by +a real-time task and unable to make progress. Even if a CPU has been isolated, +the system may still (accidentally) start a per‑CPU thread on that CPU. +Ensuring that a guest CPU goes idle is difficult, as it requires avoiding both +task scheduling and interrupt handling. Furthermore, if the guest CPU does go +idle but the guest system is booted with the option **idle=poll**, the guest +CPU will never enter an idle state and will instead spin until an event +arrives. + +Device handling introduces additional considerations. Emulated PCI devices or +VirtIO devices require a counterpart on the host to complete requests. This +adds latency because the host must intercept and either process the request +directly or schedule a thread for its completion. These delays can be avoided if +the required PCI device is passed directly through to the guest. Some devices, +such as networking or storage controllers, support the PCIe SR-IOV feature. +SR-IOV allows a single PCIe device to be divided into multiple virtual functions, +which can then be assigned to different guests. + +Networking +---------- + +For low-latency networking, the full networking stack may be undesirable, as it +can introduce additional sources of delay. In this context, XDP can be used +as a shortcut to bypass much of the stack while still relying on the kernel's +network driver. + +The requirements are that the network driver must support XDP- preferably using +an "skb pool" and that the application must use an XDP socket. Additional +configuration may involve BPF filters, tuning networking queues, or configuring +qdiscs for time-based transmission. These techniques are often +applied in Time-Sensitive Networking (TSN) environments. + +Documenting all required steps exceeds the scope of this text. For detailed +guidance, see the TSN documentation at https://tsn.readthedocs.io. + +Another useful resource is the Linux Real-Time Communication Testbench +https://github.com/Linutronix/RTC-Testbench. +The goal of this project is to validate real-time network communication. It can +be thought of as a "cyclictest" for networking and also serves as a starting +point for application development. diff --git a/Documentation/core-api/real-time/index.rst b/Documentation/core-api/real-time/index.rst index 7e14c4ea3d592..f08d2395a22c9 100644 --- a/Documentation/core-api/real-time/index.rst +++ b/Documentation/core-api/real-time/index.rst @@ -13,4 +13,5 @@ the required changes compared to a non-PREEMPT_RT configuration. theory differences + hardware architecture-porting