``ioctl()`` will have a name derived from the ``ioctl()`` command name. E.g., the
``VFIO_DEVICE_GET_INFO`` ``ioctl()`` command becomes a
``VFIO_USER_DEVICE_GET_INFO`` message. The purpose of this reuse is to share as
-much code as feasible with the ``ioctl()`` implementation``.
+much code as feasible with the ``ioctl()`` implementation.
Connection Initiation
^^^^^^^^^^^^^^^^^^^^^
1) one or more clients, and/or
2) one or more virtual devices, belonging to one or more clients.
-The current protocol specification requires a dedicated socket per
-client/server connection. It is a server-side implementation detail whether a
-single server handles multiple virtual devices from the same or multiple
-clients. The location of the socket is implementation-specific. Multiplexing
-clients, devices, and servers over the same socket is not supported in this
-version of the protocol.
+The current protocol specification requires dedicated sockets per
+client/server connection. Commands in the client-to-server direction are
+handled on the main communication socket which the client connects to, and
+replies to these commands are passed on the same socket. Commands sent in the
+other direction from the server to the client as well as their corresponding
+replies can optionally be passed across a separate socket, which is set up
+during negotiation (AF_UNIX servers just pass the file descriptor).
+
+Using separate sockets for each command channel avoids introducing an
+artificial point of synchronization between the channels. This simplifies
+implementations since it obviates the need to demultiplex incoming messages
+into commands and replies and interleave command handling and reply processing.
+Note that it is still illegal for implementations to stall command or reply
+processing indefinitely while waiting for replies on the other channel, as this
+may lead to deadlocks. However, since incoming commands and requests arrive on
+different sockets, it's possible to meet this requirement e.g. by running two
+independent request processing threads that can internally operate
+synchronously. It is expected that this is simpler to implement than fully
+asynchronous message handling code. Implementations may still choose a fully
+asynchronous, event-based design for other reasons, and the protocol fully
+supports it.
+
+It is a server-side implementation detail whether a single server handles
+multiple virtual devices from the same or multiple clients. The location of the
+socket is implementation-specific. Multiplexing clients, devices, and servers
+over the same socket is not supported in this version of the protocol.
Authentication
--------------
``VFIO_USER_DMA_WRITE`` 12 server -> client
``VFIO_USER_DEVICE_RESET`` 13 client -> server
``VFIO_USER_REGION_WRITE_MULTI`` 15 client -> server
+``VFIO_USER_DEVICE_FEATURE`` 16 client -> server
+``VFIO_USER_MIG_DATA_READ`` 17 client -> server
+``VFIO_USER_MIG_DATA_WRITE`` 18 client -> server
====================================== ========= =================
Header
Capabilities:
-+--------------------+---------+------------------------------------------------+
-| Name | Type | Description |
-+====================+=========+================================================+
-| max_msg_fds | number | Maximum number of file descriptors that can be |
-| | | received by the sender in one message. |
-| | | Optional. If not specified then the receiver |
-| | | must assume a value of ``1``. |
-+--------------------+---------+------------------------------------------------+
-| max_data_xfer_size | number | Maximum ``count`` for data transfer messages; |
-| | | see `Read and Write Operations`_. Optional, |
-| | | with a default value of 1048576 bytes. |
-+--------------------+---------+------------------------------------------------+
-| pgsizes | number | Page sizes supported in DMA map operations |
-| | | or'ed together. Optional, with a default value |
-| | | of supporting only 4k pages. |
-+--------------------+---------+------------------------------------------------+
-| max_dma_maps | number | Maximum number DMA map windows that can be |
-| | | valid simultaneously. Optional, with a |
-| | | value of 65535 (64k-1). |
-+--------------------+---------+------------------------------------------------+
-| migration | object | Migration capability parameters. If missing |
-| | | then migration is not supported by the sender. |
-+--------------------+---------+------------------------------------------------+
-| write_multiple | boolean | ``VFIO_USER_REGION_WRITE_MULTI`` messages |
-| | | are supported if the value is ``true``. |
-+--------------------+---------+------------------------------------------------+
-
-The migration capability contains the following name/value pairs:
-
-+-----------------+--------+--------------------------------------------------+
-| Name | Type | Description |
-+=================+========+==================================================+
-| pgsize | number | Page size of dirty pages bitmap. The smallest |
-| | | between the client and the server is used. |
-+-----------------+--------+--------------------------------------------------+
-| max_bitmap_size | number | Maximum bitmap size in ``VFIO_USER_DIRTY_PAGES`` |
-| | | and ``VFIO_DMA_UNMAP`` messages. Optional, |
-| | | with a default value of 256MB. |
-+-----------------+--------+--------------------------------------------------+
++--------------------+---------+-----------------------------------------------+
+| Name | Type | Description |
++====================+=========+===============================================+
+| max_msg_fds | number | Maximum number of file descriptors that can |
+| | | be received by the sender in one message. |
+| | | Optional. If not specified then the receiver |
+| | | must assume a value of ``1``. |
++--------------------+---------+-----------------------------------------------+
+| max_data_xfer_size | number | Maximum ``count`` for data transfer messages; |
+| | | see `Read and Write Operations`_. Optional, |
+| | | with a default value of 1048576 bytes. |
++--------------------+---------+-----------------------------------------------+
+| max_dma_maps | number | Maximum number DMA map windows that can be |
+| | | valid simultaneously. Optional, with a |
+| | | value of 65535 (64k-1). |
++--------------------+---------+-----------------------------------------------+
+| pgsizes | number | Page sizes supported in DMA map operations |
+| | | or'ed together. Optional, with a default |
+| | | value of supporting only 4k pages. |
++--------------------+---------+-----------------------------------------------+
+| twin_socket | object | Parameters for twin-socket mode, which |
+| | | handles server-to-client commands and their |
+| | | replies on a separate socket. Optional. |
++--------------------+---------+-----------------------------------------------+
+| write_multiple | boolean | ``VFIO_USER_REGION_WRITE_MULTI`` messages |
+| | | are supported if the value is ``true``. |
++--------------------+---------+-----------------------------------------------+
+
+The ``twin_socket`` capability object holds these name/value pairs:
+
++-----------+---------+--------------------------------------------------------+
+| Name | Type | Description |
++===========+=========+========================================================+
+| supported | boolean | Indicates whether the sender supports twin-socket |
+| | | mode. Optional, defaults to false. |
++-----------+---------+--------------------------------------------------------+
+| fd_index | number | Specifies an index in the file descriptor array |
+| | | included with the message. The designated file |
+| | | descriptor is a socket which is to be used for the |
+| | | server-to-client command channel. Optional, only valid |
+| | | in the reply message. |
++-----------+---------+--------------------------------------------------------+
Reply
^^^^^
The same message format is used in the server's reply with the semantics
described above.
+If and only if the client has indicated support for twin-socket mode by setting
+``twin_socket.supported`` to true in its capabilities, the server may optionally
+set up a separate command channel for server-to-client commands and their
+replies. The server enables twin-socket mode as follows:
+
+* Create a fresh socket pair.
+* Keep the server end of the socket pair and pass the client end in the file
+ descriptor array included with the reply message.
+* Set ``twin_socket.supported`` to true in the reply.
+* Indicate the index in the file descriptor array by the
+ ``twin_socket.fd_index`` capability field in the reply, so the client can
+ identify the correct file descriptor to use.
+
+A client requesting twin-socket mode must examine the ``twin_socket`` capability
+in the reply:
+
+* If ``twin_socket.supported`` is false, the field is missing, or the entire
+ ``twin_socket`` object is absent, the server does not support twin-socket mode
+ or decided not to enable it. The client can choose whether it wants to proceed
+ without twin-socket mode, or close the connection if not.
+* If ``twin_socket.supported`` is true and ``twin_socket.fd_index`` is present
+ and refers to a valid file descriptor, twin-socket mode negotiation has
+ succeeded. The client monitors the provided file descriptor for commands from
+ the server.
+* Otherwise, the reply from the server is inconsistent. The client must abort
+ and close the connection since it is potentially unable to receive commands
+ from the server.
+
+The twin-socket feature is optional, so some servers may not support it.
+However, for server implementations that do send server-to-client commands it is
+strongly recommended to implement twin-socket support.
+
``VFIO_USER_DMA_MAP``
---------------------
-----------------------
If the client has not shared mappable memory, the server can use this message to
-read from guest memory.
+read from guest memory. This message and its reply are passed over the separate
+server-to-client socket if twin-socket mode has been negotiated during
+connection setup.
Request
^^^^^^^
-----------------------
If the client has not shared mappable memory, the server can use this message to
-write to guest memory.
+write to guest memory. This message and its reply are passed over the separate
+server-to-client socket if twin-socket mode has been negotiated during
+connection setup.
Request
^^^^^^^
* *wr_cnt* is the number of device writes completed.
+``VFIO_USER_DEVICE_FEATURE``
+----------------------------
+
+This command is analogous to ``VFIO_DEVICE_FEATURE``. It is used to get, set, or
+probe feature data of the device.
+
+Request
+^^^^^^^
+
+The request payload for this message is a structure of the following format.
+
++-------+--------+--------------------------------+
+| Name | Offset | Size |
++=======+========+================================+
+| argsz | 0 | 4 |
++-------+--------+--------------------------------+
+| flags | 4 | 4 |
++-------+--------+--------------------------------+
+| | +---------+---------------------------+ |
+| | | Bit | Definition | |
+| | +=========+===========================+ |
+| | | 0 to 15 | Feature index | |
+| | +---------+---------------------------+ |
+| | | 16 | VFIO_DEVICE_FEATURE_GET | |
+| | +---------+---------------------------+ |
+| | | 17 | VFIO_DEVICE_FEATURE_SET | |
+| | +---------+---------------------------+ |
+| | | 18 | VFIO_DEVICE_FEATURE_PROBE | |
+| | +---------+---------------------------+ |
++-------+--------+--------------------------------+
+| data | 8 | variable |
++-------+--------+--------------------------------+
+
+* *argsz* is the maximum size of the reply payload.
+
+* *flags* defines the action to be performed by the server and upon which
+ feature:
+
+ * The feature index consists of the least significant 16 bits of the flags
+ field, and can be accessed using the ``VFIO_DEVICE_FEATURE_MASK`` bit mask.
+
+ * ``VFIO_DEVICE_FEATURE_GET`` instructs the server to get the data for the
+ given feature.
+
+ * ``VFIO_DEVICE_FEATURE_SET`` instructs the server to set the feature data to
+ that given in the ``data`` field of the payload.
+
+ * ``VFIO_DEVICE_FEATURE_PROBE`` instructs the server to probe for feature
+ support. If ``VFIO_DEVICE_FEATURE_GET`` and/or ``VFIO_DEVICE_FEATURE_SET``
+ are also set, the probe will only return success if all of the indicated
+ methods are supported.
+
+ ``VFIO_DEVICE_FEATURE_GET`` and ``VFIO_DEVICE_FEATURE_SET`` are mutually
+ exclusive, except for use with ``VFIO_DEVICE_FEATURE_PROBE``.
+
+* *data* is specific to the particular feature. It is not used for probing.
+
+This part of the request is analogous to VFIO's ``struct vfio_device_feature``.
+
+Reply
+^^^^^
+
+The reply payload must be the same as the request payload for setting or
+probing a feature. For getting a feature's data, the data is added in the data
+section and its length is added to ``argsz``.
+
+Device Features
+^^^^^^^^^^^^^^^
+
+The only device features supported by vfio-user are those related to migration,
+although this may change in the future. They are a subset of those supported in
+the VFIO implementation of the Linux kernel.
+
++----------------------------------------+---------------+
+| Name | Feature Index |
++========================================+===============+
+| VFIO_DEVICE_FEATURE_MIGRATION | 1 |
++----------------------------------------+---------------+
+| VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE | 2 |
++----------------------------------------+---------------+
+| VFIO_DEVICE_FEATURE_DMA_LOGGING_START | 6 |
++----------------------------------------+---------------+
+| VFIO_DEVICE_FEATURE_DMA_LOGGING_STOP | 7 |
++----------------------------------------+---------------+
+| VFIO_DEVICE_FEATURE_DMA_LOGGING_REPORT | 8 |
++----------------------------------------+---------------+
+
+``VFIO_DEVICE_FEATURE_MIGRATION``
+"""""""""""""""""""""""""""""""""
+
+This feature indicates that the device can support the migration API through
+``VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE``. If ``GET`` succeeds, the ``RUNNING``
+and ``ERROR`` states are always supported. Support for additional states is
+indicated via the flags field; at least ``VFIO_MIGRATION_STOP_COPY`` must be
+set.
+
+There is no data field of the request message.
+
+The data field of the reply message is structured as follows:
+
++-------+--------+---------------------------+
+| Name | Offset | Size |
++=======+========+===========================+
+| flags | 0 | 8 |
++-------+--------+---------------------------+
+| | +-----+--------------------------+ |
+| | | Bit | Definition | |
+| | +=====+==========================+ |
+| | | 0 | VFIO_MIGRATION_STOP_COPY | |
+| | +-----+--------------------------+ |
+| | | 1 | VFIO_MIGRATION_P2P | |
+| | +-----+--------------------------+ |
+| | | 2 | VFIO_MIGRATION_PRE_COPY | |
+| | +-----+--------------------------+ |
++-------+--------+---------------------------+
+
+These flags are interpreted in the same way as VFIO.
+
+``VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE``
+""""""""""""""""""""""""""""""""""""""""
+
+Upon ``VFIO_DEVICE_FEATURE_SET``, execute a migration state change on the VFIO
+device. The new state is supplied in ``device_state``. The state transition must
+fully complete before the reply is sent.
+
+The data field of the reply message, as well as the ``SET`` request message, is
+structured as follows:
+
++--------------+--------+------+
+| Name | Offset | Size |
++==============+========+======+
+| device_state | 0 | 4 |
++--------------+--------+------+
+| data_fd | 4 | 4 |
++--------------+--------+------+
+
+* *device_state* is the current state of the device (for ``GET``) or the
+ state to transition to (for ``SET``). It is defined by the
+ ``vfio_device_mig_state`` enum as detailed below. These states are the states
+ of the device migration Finite State Machine.
+
++--------------------------------+-------+---------------------------------------------------------------------+
+| Name | State | Description |
++================================+=======+=====================================================================+
+| VFIO_DEVICE_STATE_ERROR | 0 | The device has failed and must be reset. |
++--------------------------------+-------+---------------------------------------------------------------------+
+| VFIO_DEVICE_STATE_STOP | 1 | The device does not change the internal or external state. |
++--------------------------------+-------+---------------------------------------------------------------------+
+| VFIO_DEVICE_STATE_RUNNING | 2 | The device is running normally. |
++--------------------------------+-------+---------------------------------------------------------------------+
+| VFIO_DEVICE_STATE_STOP_COPY | 3 | The device internal state can be read out. |
++--------------------------------+-------+---------------------------------------------------------------------+
+| VFIO_DEVICE_STATE_RESUMING | 4 | The device is stopped and is loading a new internal state. |
++--------------------------------+-------+---------------------------------------------------------------------+
+| VFIO_DEVICE_STATE_RUNNING_P2P | 5 | (not used in vfio-user) |
++--------------------------------+-------+---------------------------------------------------------------------+
+| VFIO_DEVICE_STATE_PRE_COPY | 6 | The device is running normally but tracking internal state changes. |
++--------------------------------+-------+---------------------------------------------------------------------+
+| VFIO_DEVICE_STATE_PRE_COPY_P2P | 7 | (not used in vfio-user) |
++--------------------------------+-------+---------------------------------------------------------------------+
+
+* *data_fd* is unused in vfio-user, as the ``VFIO_USER_MIG_DATA_READ`` and
+ ``VFIO_USER_MIG_DATA_WRITE`` messages are used instead for migration data
+ transport.
+
+Direct State Transitions
+""""""""""""""""""""""""
+
+The device migration FSM is a Mealy machine, so actions are taken upon the arcs
+between FSM states. The following transitions need to be supported by the
+server, a subset of those defined in ``<linux/vfio.h>``
+(``enum vfio_device_mig_state``).
+
+* ``RUNNING -> STOP``, ``STOP_COPY -> STOP``: Stop the operation of the device.
+ The ``STOP_COPY`` arc terminates the data transfer session.
+
+* ``RESUMING -> STOP``: Terminate the data transfer session. Complete processing
+ of the migration data. Stop the operation of the device. If the delivered data
+ is found to be incomplete, inconsistent, or otherwise invalid, fail the
+ ``SET`` command and optionally transition to the ``ERROR`` state.
+
+* ``PRE_COPY -> RUNNING``: Terminate the data transfer session. The device is
+ now fully operational.
+
+* ``STOP -> RUNNING``: Start the operation of the device.
+
+* ``RUNNING -> PRE_COPY``, ``STOP -> STOP_COPY``: Begin the process of saving
+ the device state. The device operation is unchanged, but data transfer begins.
+ ``PRE_COPY`` and ``STOP_COPY`` are referred to as the "saving group" of
+ states.
+
+* ``PRE_COPY -> STOP_COPY``: Continue to transfer migration data, but stop
+ device operation.
+
+* ``STOP -> RESUMING``: Start the process of restoring the device state. The
+ internal device state may be changed to prepare the device to receive the
+ migration data.
+
+The ``STOP_COPY -> PRE_COPY`` transition is explicitly not allowed and should
+return an error if requested.
+
+``ERROR`` cannot be specified as a device state, but any transition request can
+be failed and then move the state into ``ERROR`` if the server was unable to
+execute the requested arc AND was unable to restore the device into any valid
+state. To recover from ``ERROR``, ``VFIO_USER_DEVICE_RESET`` must be used to
+return back to ``RUNNING``.
+
+If ``PRE_COPY`` is not supported, arcs touching it are removed.
+
+Complex State Transitions
+"""""""""""""""""""""""""
+
+The remaining possible transitions are to be implemented as combinations of the
+above FSM arcs. As there are multiple paths, the path should be selected based
+on the following rules:
+
+* Select the shortest path.
+
+* The path cannot have saving group states as interior arcs, only start/end
+ states.
+
+``VFIO_DEVICE_FEATURE_DMA_LOGGING_START`` / ``VFIO_DEVICE_FEATURE_DMA_LOGGING_STOP``
+""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+
+Upon ``VFIO_DEVICE_FEATURE_SET``, start/stop DMA logging. These features can
+also be probed to determine whether the device supports DMA logging.
+
+When DMA logging is started, a range of IOVAs to monitor is provided and the
+device can optimize its logging to cover only the IOVA range given. Only DMA
+writes are logged.
+
+The data field of the ``SET`` request is structured as follows:
+
++------------+--------+----------+
+| Name | Offset | Size |
++============+========+==========+
+| page_size | 0 | 8 |
++------------+--------+----------+
+| num_ranges | 8 | 4 |
++------------+--------+----------+
+| reserved | 12 | 4 |
++------------+--------+----------+
+| ranges | 16 | variable |
++------------+--------+----------+
+
+* *page_size* hints what tracking granularity the device should try to achieve.
+ If the device cannot do the hinted page size then it's the driver's choice
+ which page size to pick based on its support. On output the device will return
+ the page size it selected.
+
+* *num_ranges* is the number of IOVA ranges to monitor. A value of zero
+ indicates that all writes should be logged.
+
+* *ranges* is an array of ``vfio_user_device_feature_dma_logging_range``
+ entries:
+
++--------+--------+------+
+| Name | Offset | Size |
++========+========+======+
+| iova | 0 | 8 |
++--------+--------+------+
+| length | 8 | 8 |
++--------+--------+------+
+
+ * *iova* is the base IO virtual address
+ * *length* is the length of the range to log
+
+Upon success, the response data field will be the same as the request, unless
+the page size was changed, in which case this will be reflected in the response.
+
+``VFIO_DEVICE_FEATURE_DMA_LOGGING_REPORT``
+""""""""""""""""""""""""""""""""""""""""""
+
+Upon ``VFIO_DEVICE_FEATURE_GET``, returns the dirty bitmap for a specific IOVA
+range. This operation is only valid if logging of dirty pages has been
+previously started by setting ``VFIO_DEVICE_FEATURE_DMA_LOGGING_START``.
+
+The data field of the request is structured as follows:
+
++-----------+--------+------+
+| Name | Offset | Size |
++===========+========+======+
+| iova | 0 | 8 |
++-----------+--------+------+
+| length | 8 | 8 |
++-----------+--------+------+
+| page_size | 16 | 8 |
++-----------+--------+------+
+
+* *iova* is the base IO virtual address
+
+* *length* is the length of the range
+
+* *page_size* is the unit of granularity of the bitmap, and must be a power of
+ two. It doesn't have to match the value given to
+ ``VFIO_DEVICE_FEATURE_DMA_LOGGING_START`` because the driver will format its
+ internal logging to match the reporting page size possibly by replicating bits
+ if the internal page size is lower than requested
+
+The data field of the response is identical, except with the bitmap added on
+the end at offset 24.
+
+The bitmap is an array of u64s that holds the output bitmap, with 1 bit
+reporting a *page_size* unit of IOVA. The bits outside of the requested range
+must be zero.
+
+The mapping of IOVA to bits is given by:
+
+``bitmap[(addr - iova)/page_size] & (1ULL << (addr % 64))``
+
+``VFIO_USER_MIG_DATA_READ``
+---------------------------
+
+This command is used to read data from the source migration server while it is
+in a saving group state (``PRE_COPY`` or ``STOP_COPY``).
+
+This command, and ``VFIO_USER_MIG_DATA_WRITE``, are used in place of the
+``data_fd`` file descriptor in ``<linux/vfio.h>``
+(``struct vfio_device_feature_mig_state``) to enable all data transport to use
+the single already-established UNIX socket. Hence, the migration data is
+treated like a stream, so the client must continue reading until no more
+migration data remains.
+
+Request
+^^^^^^^
+
+The request payload for this message is a structure of the following format.
+
++-------+--------+------+
+| Name | Offset | Size |
++=======+========+======+
+| argsz | 0 | 4 |
++-------+--------+------+
+| size | 4 | 4 |
++-------+--------+------+
+
+* *argsz* is the maximum size of the reply payload.
+
+* *size* is the size of the migration data to read.
+
+Reply
+^^^^^
+
+The reply payload for this message is a structure of the following format.
+
++-------+--------+----------+
+| Name | Offset | Size |
++=======+========+==========+
+| argsz | 0 | 4 |
++-------+--------+----------+
+| size | 4 | 4 |
++-------+--------+----------+
+| data | 8 | variable |
++-------+--------+----------+
+
+* *argsz* is the size of the above structure, including the size of the data.
+
+* *size* indicates the size of returned migration data. If this is less than the
+ requested size, there is no more migration data to read.
+
+* *data* contains the migration data.
+
+``VFIO_USER_MIG_DATA_WRITE``
+----------------------------
+
+This command is used to write data to the destination migration server while it
+is in the ``RESUMING`` state.
+
+As above, this replaces the ``data_fd`` file descriptor for transport of
+migration data, and as such, the migration data is treated like a stream.
+
+Request
+^^^^^^^
+
+The request payload for this message is a structure of the following format.
+
++-------+--------+----------+
+| Name | Offset | Size |
++=======+========+==========+
+| argsz | 0 | 4 |
++-------+--------+----------+
+| size | 4 | 4 |
++-------+--------+----------+
+| data | 8 | variable |
++-------+--------+----------+
+
+* *argsz* is the maximum size of the reply payload.
+
+* *size* is the size of the migration data to be written.
+
+* *data* contains the migration data.
+
+Reply
+^^^^^
+
+There is no reply payload for this message.
Appendices
==========