--- /dev/null
+.. SPDX-License-Identifier: GPL-2.0
+
+.. _ethernet-flow-control:
+
+=====================
+Ethernet Flow Control
+=====================
+
+This document is a practical guide to Ethernet Flow Control in Linux, covering
+what it is, how it works, and how to configure it.
+
+What is Flow Control?
+=====================
+
+Flow control is a mechanism to prevent a fast sender from overwhelming a
+slow receiver with data, which would cause buffer overruns and dropped packets.
+The receiver can signal the sender to temporarily stop transmitting, giving it
+time to process its backlog.
+
+Standards references
+====================
+
+Ethernet flow control mechanisms are specified across consolidated IEEE base
+standards; some originated as amendments:
+
+- Collision-based flow control is part of CSMA/CD in **IEEE 802.3**
+ (half-duplex).
+- Link-wide PAUSE is defined in **IEEE 802.3 Annex 31B**
+ (originally **802.3x**).
+- Priority-based Flow Control (PFC) is defined in **IEEE 802.1Q Clause 36**
+ (originally **802.1Qbb**).
+
+In the remainder of this document, the consolidated clause numbers are used.
+
+How It Works: The Mechanisms
+============================
+
+The method used for flow control depends on the link's duplex mode.
+
+.. note::
+ The user-visible ``ethtool`` pause API described in this document controls
+ **link-wide PAUSE** (IEEE 802.3 Annex 31B) only. It does not control the
+ collision-based behavior that exists on half-duplex links.
+
+1. Half-Duplex: Collision-Based Flow Control
+--------------------------------------------
+On half-duplex links, a device cannot send and receive simultaneously, so PAUSE
+frames are not used. Flow control is achieved by leveraging the CSMA/CD
+(Carrier Sense Multiple Access with Collision Detection) protocol itself.
+
+* **How it works**: To inhibit incoming data, a receiving device can force a
+ collision on the line. When the sending station detects this collision, it
+ terminates its transmission, sends a "jam" signal, and then executes the
+ "Collision backoff and retransmission" procedure as defined in IEEE 802.3,
+ Section 4.2.3.2.5. This algorithm makes the sender wait for a random
+ period before attempting to retransmit. By repeatedly forcing collisions,
+ the receiver can effectively throttle the sender's transmission rate.
+
+.. note::
+ While this mechanism is part of the IEEE standard, there is currently no
+ generic kernel API to configure or control it. Drivers should not enable
+ this feature until a standardized interface is available.
+
+.. warning::
+ On shared-medium networks (e.g. 10BASE2, or twisted-pair networks using a
+ hub rather than a switch) forcing collisions inhibits traffic **across the
+ entire shared segment**, not just a single point-to-point link. Enabling
+ such behavior is generally undesirable.
+
+2. Full-Duplex: Link-wide PAUSE (IEEE 802.3 Annex 31B)
+------------------------------------------------------
+On full-duplex links, devices can send and receive at the same time. Flow
+control is achieved by sending a special **PAUSE frame**, defined by IEEE
+802.3 Annex 31B. This mechanism pauses all traffic on the link and is therefore
+called *link-wide PAUSE*.
+
+* **What it is**: A standard Ethernet frame with a globally reserved
+ destination MAC address (``01-80-C2-00-00-01``). This address is in a range
+ that standard IEEE 802.1D-compliant bridges do not forward. However, some
+ unmanaged or misconfigured bridges have been reported to forward these
+ frames, which can disrupt flow control across a network.
+
+* **How it works**: The frame contains a MAC Control opcode for PAUSE
+ (``0x0001``) and a ``pause_time`` value, telling the sender how long to
+ wait before sending more data frames. This time is specified in units of
+ "pause quantum", where one quantum is the time it takes to transmit 512 bits.
+ For example, one pause quantum is 51.2 microseconds on a 10 Mbit/s link,
+ and 512 nanoseconds on a 1 Gbit/s link. A ``pause_time`` of zero indicates
+ that the transmitter can resume transmission, even if a previous non-zero
+ pause time has not yet elapsed.
+
+* **Who uses it**: Any full-duplex link, from 10 Mbit/s to multi-gigabit speeds.
+
+3. Full-Duplex: Priority-based Flow Control (PFC) (IEEE 802.1Q Clause 36)
+-------------------------------------------------------------------------
+Priority-based Flow Control is an enhancement to the standard PAUSE mechanism
+that allows flow control to be applied independently to different classes of
+traffic, identified by their priority level.
+
+* **What it is**: PFC allows a receiver to pause traffic for one or more of the
+ 8 standard priority levels without stopping traffic for other priorities.
+ This is critical in data center environments for protocols that cannot
+ tolerate packet loss due to congestion (e.g., Fibre Channel over Ethernet
+ or RoCE).
+
+* **How it works**: PFC uses a specific PAUSE frame format. It shares the same
+ globally reserved destination MAC address (``01-80-C2-00-00-01``) as legacy
+ PAUSE frames but uses a unique opcode (``0x0101``). The frame payload
+ contains two key fields:
+
+ - **``priority_enable_vector``**: An 8-bit mask where each bit corresponds to
+ one of the 8 priorities. If a bit is set to 1, it means the pause time
+ for that priority is active.
+ - **``time_vector``**: A list of eight 2-octet fields, one for each priority.
+ Each field specifies the ``pause_time`` for its corresponding priority,
+ measured in units of ``pause_quanta`` (the time to transmit 512 bits).
+
+.. note::
+ When PFC is enabled for at least one priority on a port, the standard
+ **link-wide PAUSE** (IEEE 802.3 Annex 31B) must be disabled for that port.
+ The two mechanisms are mutually exclusive (IEEE 802.1Q Clause 36).
+
+Configuring Flow Control
+========================
+
+Link-wide PAUSE and Priority-based Flow Control are configured with different
+tools.
+
+Configuring Link-wide PAUSE with ``ethtool`` (IEEE 802.3 Annex 31B)
+-------------------------------------------------------------------
+Use ``ethtool -a <interface>`` to view and ``ethtool -A <interface>`` to change
+the link-wide PAUSE settings.
+
+.. code-block:: bash
+
+ # View current link-wide PAUSE settings
+ ethtool -a eth0
+
+ # Enable RX and TX pause, with autonegotiation
+ ethtool -A eth0 autoneg on rx on tx on
+
+**Key Configuration Concepts**:
+
+* **Pause Autoneg vs Generic Autoneg**: ``ethtool -A ... autoneg {on,off}``
+ controls **Pause Autoneg** (Annex 31B) only. It is independent from the
+ **Generic link autonegotiation** configured with ``ethtool -s``. A device can
+ have Generic autoneg **on** while Pause Autoneg is **off**, and vice versa.
+
+* **If Pause Autoneg is off** (``-A ... autoneg off``): the device will **not**
+ advertise pause in the PHY. The MAC PAUSE state is **forced** according to
+ ``rx``/``tx`` and does not depend on partner capabilities or resolution.
+ Ensure the peer is configured complementarily for PAUSE to be effective.
+
+* **If generic autoneg is off** but **Pause Autoneg is on**, the pause policy
+ is **remembered** by the kernel and applied later when Generic autoneg is
+ enabled again.
+
+* **Autonegotiation Mode**: The PHY will *advertise* the ``rx`` and ``tx``
+ capabilities. The final active state is determined by what both sides of the
+ link agree on. See the "PHY (Physical Layer Transceiver)" section below,
+ especially the *Resolution* subsection, for details of the negotiation rules.
+
+* **Forced Mode**: This mode is necessary when autonegotiation is not used or
+ not possible. This includes links where one or both partners have
+ autonegotiation disabled, or in setups without a PHY (e.g., direct
+ MAC-to-MAC connections). The driver bypasses PHY advertisement and
+ directly forces the MAC into the specified ``rx``/``tx`` state. The
+ configuration on both sides of the link must be complementary. For
+ example, if one side is set to ``tx on`` ``rx off``, the link partner must be
+ set to ``tx off`` ``rx on`` for flow control to function correctly.
+
+Configuring PFC with ``dcb`` (IEEE 802.1Q Clause 36)
+----------------------------------------------------
+PFC is part of the Data Center Bridging (DCB) subsystem and is managed with the
+``dcb`` tool (iproute2). Some deployments use ``dcbtool`` (lldpad) instead; this
+document shows ``dcb(8)`` examples.
+
+**Viewing PFC Settings**:
+
+.. code-block:: text
+
+ $ dcb pfc show dev eth0
+ pfc-cap 8 macsec-bypass off delay 4096
+ prio-pfc 0:off 1:off 2:off 3:off 4:off 5:off 6:on 7:on
+
+This shows the PFC state (on/off) for each priority (0-7).
+
+**Changing PFC Settings**:
+
+.. code-block:: bash
+
+ # Enable PFC on priorities 6 and 7, leaving others as they are
+ $ dcb pfc set dev eth0 prio-pfc 6:on 7:on
+
+ # Disable PFC for all priorities except 6 and 7
+ $ dcb pfc set dev eth0 prio-pfc all:off 6:on 7:on
+
+Monitoring Flow Control
+=======================
+
+The standard way to check if flow control is actively being used is to view the
+pause-related statistics.
+
+**Monitoring Link-wide PAUSE**:
+Use ``ethtool --include-statistics -a <interface>``.
+
+.. code-block:: text
+
+ $ ethtool --include-statistics -a eth0
+ Pause parameters for eth0:
+ ...
+ Statistics:
+ tx_pause_frames: 0
+ rx_pause_frames: 0
+
+**Monitoring PFC**:
+PFC statistics (sent and received frames per priority) are available
+through the ``dcb`` tool.
+
+.. code-block:: text
+
+ $ dcb pfc show dev eth0 requests indications
+ requests 0:0 1:0 2:0 3:1024 4:2048 5:0 6:0 7:0
+ indications 0:0 1:0 2:0 3:512 4:4096 5:0 6:0 7:0
+
+The ``requests`` counters track transmitted PFC frames (TX), and the
+``indications`` counters track received PFC frames (RX).
+
+Link-wide PAUSE Autonegotiation Details
+=======================================
+
+The autonegotiation process for link-wide PAUSE is managed by the PHY and
+involves advertising capabilities and resolving the outcome.
+
+* Terminology (link-wide PAUSE):
+
+ - **Symmetric pause**: both directions are paused when requested (TX+RX
+ enabled).
+ - **Asymmetric pause**: only one direction is paused (e.g., RX-only or
+ TX-only).
+
+ In IEEE 802.3 advertisement/resolution, symmetric/asymmetric are encoded
+ using two bits (Pause/Asym) and resolved per the standard truth tables
+ below.
+
+* **Advertisement**: The PHY advertises the MAC's flow control capabilities.
+ This is done using two bits in the advertisement register: "Symmetric
+ Pause" (Pause) and "Asymmetric Pause" (Asym). These bits should be
+ interpreted as a combined value, not as independent flags. The kernel
+ converts the user's ``rx`` and ``tx`` settings into this two-bit value as
+ follows:
+
+ .. code-block:: text
+
+ tx rx | Pause Asym
+ -------+-------------
+ 0 0 | 0 0
+ 0 1 | 1 1
+ 1 0 | 0 1
+ 1 1 | 1 0
+
+* **Resolution**: After negotiation, the PHY reports the link partner's
+ advertised Pause and Asym bits. The final flow control mode is determined
+ by the combination of the local and partner advertisements, according to
+ the IEEE 802.3 standard:
+
+ .. code-block:: text
+
+ Local Device | Link Partner | Result
+ Pause Asym | Pause Asym |
+ -------------------+--------------------+---------
+ 0 X | 0 X | Disabled
+ 0 1 | 1 0 | Disabled
+ 0 1 | 1 1 | TX only
+ 1 0 | 0 X | Disabled
+ 1 X | 1 X | TX + RX
+ 1 1 | 0 1 | RX only
+
+ It is important to note that the advertised bits reflect the *current
+ configuration* of the MAC, which may not represent its full hardware
+ capabilities.
+
+Kernel Policy: "Set and Trust"
+==============================
+
+The ethtool pause API is defined as a **wish policy** for
+IEEE 802.3 link-wide PAUSE only. A user request is always accepted
+as the preferred configuration, but it may not be possible to apply
+it in all link states.
+
+Key constraints:
+
+- Link-wide PAUSE is not valid on half-duplex links.
+- Link-wide PAUSE cannot be used together with Priority-based Flow Control
+ (PFC, IEEE 802.1Q Clause 36).
+- If autonegotiation is active and the link is currently down, the future
+ mode is not yet known.
+
+Because of these constraints, the kernel stores the requested setting
+and applies it only when the link is in a compatible state.
+
+Implications for userspace:
+
+1. Set once (the "wish"): the requested Rx/Tx PAUSE policy is
+ remembered even if it cannot be applied immediately.
+2. Applied conditionally: when the link comes up, the kernel enables
+ PAUSE only if the active mode allows it.
+
+Component Roles in Flow Control
+===============================
+
+The configuration of flow control involves several components, each with a
+distinct role.
+
+The MAC (Media Access Controller)
+---------------------------------
+The MAC is the hardware component that actually sends and receives PAUSE
+frames. Its capabilities define the upper limit of what the driver can support.
+For link-wide PAUSE, MACs can vary in their support for symmetric (both
+directions) or asymmetric (independent TX/RX) flow control.
+
+For PFC, the MAC must be capable of generating and interpreting the
+priority-based PAUSE frames and managing separate pause states for each
+traffic class.
+
+Many MACs also implement automatic PAUSE frame transmission based on the fill
+level of their internal RX FIFO. This is typically configured with two
+thresholds:
+
+* **FLOW_ON (High Water Mark)**: When the RX FIFO usage reaches this
+ threshold, the MAC automatically transmits a PAUSE frame to stop the sender.
+
+* **FLOW_OFF (Low Water Mark)**: When the RX FIFO usage drops below this
+ threshold, the MAC transmits a PAUSE frame with a quantum of zero to tell
+ the sender it can resume transmission.
+
+The PHY (Physical Layer Transceiver)
+------------------------------------
+The PHY's role is distinct for each flow control mechanism:
+
+* **Link-wide PAUSE**: During the autonegotiation process, the PHY is
+ responsible for advertising the device's flow control capabilities. See the
+ "Link-wide PAUSE Autonegotiation Details" section for more information.
+
+* **Half-Duplex Collision-Based Flow Control**: The PHY is fundamental to the
+ CSMA/CD process. It performs carrier sensing (checking if the line is idle)
+ and collision detection, which is the mechanism leveraged to throttle the
+ sender.
+
+* **Priority-based Flow Control (PFC)**: The PHY is not directly involved in
+ negotiating PFC capabilities. Its role is to establish the physical link.
+ PFC negotiation happens at a higher layer via the Data Center Bridging
+ Capability Exchange Protocol (DCBX).
+
+User Space Interface
+====================
+The primary user space tools are ``ethtool`` for link-wide PAUSE and ``dcb`` for
+PFC. They communicate with the kernel to configure the network device driver
+and underlying hardware.
+
+**Link-wide PAUSE Netlink Interface (``ethtool``)**
+
+See the ethtool Netlink spec (``Documentation/netlink/specs/ethtool.yaml``)
+for the authoritative definition of the Pause control and Pause statistics
+attributes. The generated UAPI is in
+``include/uapi/linux/ethtool_netlink_generated.h``.
+
+**PFC Netlink Interface (``dcb``)**
+
+The authoritative definitions for DCB/PFC netlink attributes and commands are in
+``include/uapi/linux/dcbnl.h``. See also the ``dcb(8)`` manual page and the DCB
+subsystem documentation for userspace configuration details.
+