]> git.ipfire.org Git - thirdparty/linux.git/blob - Documentation/s390/vfio-ccw.rst
MAINTAINERS: Fix Hyperv vIOMMU driver file name
[thirdparty/linux.git] / Documentation / s390 / vfio-ccw.rst
1 ==================================
2 vfio-ccw: the basic infrastructure
3 ==================================
4
5 Introduction
6 ------------
7
8 Here we describe the vfio support for I/O subchannel devices for
9 Linux/s390. Motivation for vfio-ccw is to passthrough subchannels to a
10 virtual machine, while vfio is the means.
11
12 Different than other hardware architectures, s390 has defined a unified
13 I/O access method, which is so called Channel I/O. It has its own access
14 patterns:
15
16 - Channel programs run asynchronously on a separate (co)processor.
17 - The channel subsystem will access any memory designated by the caller
18 in the channel program directly, i.e. there is no iommu involved.
19
20 Thus when we introduce vfio support for these devices, we realize it
21 with a mediated device (mdev) implementation. The vfio mdev will be
22 added to an iommu group, so as to make itself able to be managed by the
23 vfio framework. And we add read/write callbacks for special vfio I/O
24 regions to pass the channel programs from the mdev to its parent device
25 (the real I/O subchannel device) to do further address translation and
26 to perform I/O instructions.
27
28 This document does not intend to explain the s390 I/O architecture in
29 every detail. More information/reference could be found here:
30
31 - A good start to know Channel I/O in general:
32 https://en.wikipedia.org/wiki/Channel_I/O
33 - s390 architecture:
34 s390 Principles of Operation manual (IBM Form. No. SA22-7832)
35 - The existing QEMU code which implements a simple emulated channel
36 subsystem could also be a good reference. It makes it easier to follow
37 the flow.
38 qemu/hw/s390x/css.c
39
40 For vfio mediated device framework:
41 - Documentation/driver-api/vfio-mediated-device.rst
42
43 Motivation of vfio-ccw
44 ----------------------
45
46 Typically, a guest virtualized via QEMU/KVM on s390 only sees
47 paravirtualized virtio devices via the "Virtio Over Channel I/O
48 (virtio-ccw)" transport. This makes virtio devices discoverable via
49 standard operating system algorithms for handling channel devices.
50
51 However this is not enough. On s390 for the majority of devices, which
52 use the standard Channel I/O based mechanism, we also need to provide
53 the functionality of passing through them to a QEMU virtual machine.
54 This includes devices that don't have a virtio counterpart (e.g. tape
55 drives) or that have specific characteristics which guests want to
56 exploit.
57
58 For passing a device to a guest, we want to use the same interface as
59 everybody else, namely vfio. We implement this vfio support for channel
60 devices via the vfio mediated device framework and the subchannel device
61 driver "vfio_ccw".
62
63 Access patterns of CCW devices
64 ------------------------------
65
66 s390 architecture has implemented a so called channel subsystem, that
67 provides a unified view of the devices physically attached to the
68 systems. Though the s390 hardware platform knows about a huge variety of
69 different peripheral attachments like disk devices (aka. DASDs), tapes,
70 communication controllers, etc. They can all be accessed by a well
71 defined access method and they are presenting I/O completion a unified
72 way: I/O interruptions.
73
74 All I/O requires the use of channel command words (CCWs). A CCW is an
75 instruction to a specialized I/O channel processor. A channel program is
76 a sequence of CCWs which are executed by the I/O channel subsystem. To
77 issue a channel program to the channel subsystem, it is required to
78 build an operation request block (ORB), which can be used to point out
79 the format of the CCW and other control information to the system. The
80 operating system signals the I/O channel subsystem to begin executing
81 the channel program with a SSCH (start sub-channel) instruction. The
82 central processor is then free to proceed with non-I/O instructions
83 until interrupted. The I/O completion result is received by the
84 interrupt handler in the form of interrupt response block (IRB).
85
86 Back to vfio-ccw, in short:
87
88 - ORBs and channel programs are built in guest kernel (with guest
89 physical addresses).
90 - ORBs and channel programs are passed to the host kernel.
91 - Host kernel translates the guest physical addresses to real addresses
92 and starts the I/O with issuing a privileged Channel I/O instruction
93 (e.g SSCH).
94 - channel programs run asynchronously on a separate processor.
95 - I/O completion will be signaled to the host with I/O interruptions.
96 And it will be copied as IRB to user space to pass it back to the
97 guest.
98
99 Physical vfio ccw device and its child mdev
100 -------------------------------------------
101
102 As mentioned above, we realize vfio-ccw with a mdev implementation.
103
104 Channel I/O does not have IOMMU hardware support, so the physical
105 vfio-ccw device does not have an IOMMU level translation or isolation.
106
107 Subchannel I/O instructions are all privileged instructions. When
108 handling the I/O instruction interception, vfio-ccw has the software
109 policing and translation how the channel program is programmed before
110 it gets sent to hardware.
111
112 Within this implementation, we have two drivers for two types of
113 devices:
114
115 - The vfio_ccw driver for the physical subchannel device.
116 This is an I/O subchannel driver for the real subchannel device. It
117 realizes a group of callbacks and registers to the mdev framework as a
118 parent (physical) device. As a consequence, mdev provides vfio_ccw a
119 generic interface (sysfs) to create mdev devices. A vfio mdev could be
120 created by vfio_ccw then and added to the mediated bus. It is the vfio
121 device that added to an IOMMU group and a vfio group.
122 vfio_ccw also provides an I/O region to accept channel program
123 request from user space and store I/O interrupt result for user
124 space to retrieve. To notify user space an I/O completion, it offers
125 an interface to setup an eventfd fd for asynchronous signaling.
126
127 - The vfio_mdev driver for the mediated vfio ccw device.
128 This is provided by the mdev framework. It is a vfio device driver for
129 the mdev that created by vfio_ccw.
130 It realizes a group of vfio device driver callbacks, adds itself to a
131 vfio group, and registers itself to the mdev framework as a mdev
132 driver.
133 It uses a vfio iommu backend that uses the existing map and unmap
134 ioctls, but rather than programming them into an IOMMU for a device,
135 it simply stores the translations for use by later requests. This
136 means that a device programmed in a VM with guest physical addresses
137 can have the vfio kernel convert that address to process virtual
138 address, pin the page and program the hardware with the host physical
139 address in one step.
140 For a mdev, the vfio iommu backend will not pin the pages during the
141 VFIO_IOMMU_MAP_DMA ioctl. Mdev framework will only maintain a database
142 of the iova<->vaddr mappings in this operation. And they export a
143 vfio_pin_pages and a vfio_unpin_pages interfaces from the vfio iommu
144 backend for the physical devices to pin and unpin pages by demand.
145
146 Below is a high Level block diagram::
147
148 +-------------+
149 | |
150 | +---------+ | mdev_register_driver() +--------------+
151 | | Mdev | +<-----------------------+ |
152 | | bus | | | vfio_mdev.ko |
153 | | driver | +----------------------->+ |<-> VFIO user
154 | +---------+ | probe()/remove() +--------------+ APIs
155 | |
156 | MDEV CORE |
157 | MODULE |
158 | mdev.ko |
159 | +---------+ | mdev_register_device() +--------------+
160 | |Physical | +<-----------------------+ |
161 | | device | | | vfio_ccw.ko |<-> subchannel
162 | |interface| +----------------------->+ | device
163 | +---------+ | callback +--------------+
164 +-------------+
165
166 The process of how these work together.
167
168 1. vfio_ccw.ko drives the physical I/O subchannel, and registers the
169 physical device (with callbacks) to mdev framework.
170 When vfio_ccw probing the subchannel device, it registers device
171 pointer and callbacks to the mdev framework. Mdev related file nodes
172 under the device node in sysfs would be created for the subchannel
173 device, namely 'mdev_create', 'mdev_destroy' and
174 'mdev_supported_types'.
175 2. Create a mediated vfio ccw device.
176 Use the 'mdev_create' sysfs file, we need to manually create one (and
177 only one for our case) mediated device.
178 3. vfio_mdev.ko drives the mediated ccw device.
179 vfio_mdev is also the vfio device drvier. It will probe the mdev and
180 add it to an iommu_group and a vfio_group. Then we could pass through
181 the mdev to a guest.
182
183 vfio-ccw I/O region
184 -------------------
185
186 An I/O region is used to accept channel program request from user
187 space and store I/O interrupt result for user space to retrieve. The
188 definition of the region is::
189
190 struct ccw_io_region {
191 #define ORB_AREA_SIZE 12
192 __u8 orb_area[ORB_AREA_SIZE];
193 #define SCSW_AREA_SIZE 12
194 __u8 scsw_area[SCSW_AREA_SIZE];
195 #define IRB_AREA_SIZE 96
196 __u8 irb_area[IRB_AREA_SIZE];
197 __u32 ret_code;
198 } __packed;
199
200 While starting an I/O request, orb_area should be filled with the
201 guest ORB, and scsw_area should be filled with the SCSW of the Virtual
202 Subchannel.
203
204 irb_area stores the I/O result.
205
206 ret_code stores a return code for each access of the region.
207
208 vfio-ccw operation details
209 --------------------------
210
211 vfio-ccw follows what vfio-pci did on the s390 platform and uses
212 vfio-iommu-type1 as the vfio iommu backend.
213
214 * CCW translation APIs
215 A group of APIs (start with `cp_`) to do CCW translation. The CCWs
216 passed in by a user space program are organized with their guest
217 physical memory addresses. These APIs will copy the CCWs into kernel
218 space, and assemble a runnable kernel channel program by updating the
219 guest physical addresses with their corresponding host physical addresses.
220 Note that we have to use IDALs even for direct-access CCWs, as the
221 referenced memory can be located anywhere, including above 2G.
222
223 * vfio_ccw device driver
224 This driver utilizes the CCW translation APIs and introduces
225 vfio_ccw, which is the driver for the I/O subchannel devices you want
226 to pass through.
227 vfio_ccw implements the following vfio ioctls::
228
229 VFIO_DEVICE_GET_INFO
230 VFIO_DEVICE_GET_IRQ_INFO
231 VFIO_DEVICE_GET_REGION_INFO
232 VFIO_DEVICE_RESET
233 VFIO_DEVICE_SET_IRQS
234
235 This provides an I/O region, so that the user space program can pass a
236 channel program to the kernel, to do further CCW translation before
237 issuing them to a real device.
238 This also provides the SET_IRQ ioctl to setup an event notifier to
239 notify the user space program the I/O completion in an asynchronous
240 way.
241
242 The use of vfio-ccw is not limited to QEMU, while QEMU is definitely a
243 good example to get understand how these patches work. Here is a little
244 bit more detail how an I/O request triggered by the QEMU guest will be
245 handled (without error handling).
246
247 Explanation:
248
249 - Q1-Q7: QEMU side process.
250 - K1-K5: Kernel side process.
251
252 Q1.
253 Get I/O region info during initialization.
254
255 Q2.
256 Setup event notifier and handler to handle I/O completion.
257
258 ... ...
259
260 Q3.
261 Intercept a ssch instruction.
262 Q4.
263 Write the guest channel program and ORB to the I/O region.
264
265 K1.
266 Copy from guest to kernel.
267 K2.
268 Translate the guest channel program to a host kernel space
269 channel program, which becomes runnable for a real device.
270 K3.
271 With the necessary information contained in the orb passed in
272 by QEMU, issue the ccwchain to the device.
273 K4.
274 Return the ssch CC code.
275 Q5.
276 Return the CC code to the guest.
277
278 ... ...
279
280 K5.
281 Interrupt handler gets the I/O result and write the result to
282 the I/O region.
283 K6.
284 Signal QEMU to retrieve the result.
285
286 Q6.
287 Get the signal and event handler reads out the result from the I/O
288 region.
289 Q7.
290 Update the irb for the guest.
291
292 Limitations
293 -----------
294
295 The current vfio-ccw implementation focuses on supporting basic commands
296 needed to implement block device functionality (read/write) of DASD/ECKD
297 device only. Some commands may need special handling in the future, for
298 example, anything related to path grouping.
299
300 DASD is a kind of storage device. While ECKD is a data recording format.
301 More information for DASD and ECKD could be found here:
302 https://en.wikipedia.org/wiki/Direct-access_storage_device
303 https://en.wikipedia.org/wiki/Count_key_data
304
305 Together with the corresponding work in QEMU, we can bring the passed
306 through DASD/ECKD device online in a guest now and use it as a block
307 device.
308
309 While the current code allows the guest to start channel programs via
310 START SUBCHANNEL, support for HALT SUBCHANNEL or CLEAR SUBCHANNEL is
311 not yet implemented.
312
313 vfio-ccw supports classic (command mode) channel I/O only. Transport
314 mode (HPF) is not supported.
315
316 QDIO subchannels are currently not supported. Classic devices other than
317 DASD/ECKD might work, but have not been tested.
318
319 Reference
320 ---------
321 1. ESA/s390 Principles of Operation manual (IBM Form. No. SA22-7832)
322 2. ESA/390 Common I/O Device Commands manual (IBM Form. No. SA22-7204)
323 3. https://en.wikipedia.org/wiki/Channel_I/O
324 4. Documentation/s390/cds.rst
325 5. Documentation/driver-api/vfio.rst
326 6. Documentation/driver-api/vfio-mediated-device.rst