]>
Commit | Line | Data |
---|---|---|
a2468cc9 AL |
1 | Automatically bind swap device to numa node |
2 | ------------------------------------------- | |
3 | ||
4 | If the system has more than one swap device and swap device has the node | |
5 | information, we can make use of this information to decide which swap | |
6 | device to use in get_swap_pages() to get better performance. | |
7 | ||
8 | ||
9 | How to use this feature | |
10 | ----------------------- | |
11 | ||
12 | Swap device has priority and that decides the order of it to be used. To make | |
13 | use of automatically binding, there is no need to manipulate priority settings | |
14 | for swap devices. e.g. on a 2 node machine, assume 2 swap devices swapA and | |
15 | swapB, with swapA attached to node 0 and swapB attached to node 1, are going | |
16 | to be swapped on. Simply swapping them on by doing: | |
17 | # swapon /dev/swapA | |
18 | # swapon /dev/swapB | |
19 | ||
20 | Then node 0 will use the two swap devices in the order of swapA then swapB and | |
21 | node 1 will use the two swap devices in the order of swapB then swapA. Note | |
22 | that the order of them being swapped on doesn't matter. | |
23 | ||
24 | A more complex example on a 4 node machine. Assume 6 swap devices are going to | |
25 | be swapped on: swapA and swapB are attached to node 0, swapC is attached to | |
26 | node 1, swapD and swapE are attached to node 2 and swapF is attached to node3. | |
27 | The way to swap them on is the same as above: | |
28 | # swapon /dev/swapA | |
29 | # swapon /dev/swapB | |
30 | # swapon /dev/swapC | |
31 | # swapon /dev/swapD | |
32 | # swapon /dev/swapE | |
33 | # swapon /dev/swapF | |
34 | ||
35 | Then node 0 will use them in the order of: | |
36 | swapA/swapB -> swapC -> swapD -> swapE -> swapF | |
37 | swapA and swapB will be used in a round robin mode before any other swap device. | |
38 | ||
39 | node 1 will use them in the order of: | |
40 | swapC -> swapA -> swapB -> swapD -> swapE -> swapF | |
41 | ||
42 | node 2 will use them in the order of: | |
43 | swapD/swapE -> swapA -> swapB -> swapC -> swapF | |
44 | Similaly, swapD and swapE will be used in a round robin mode before any | |
45 | other swap devices. | |
46 | ||
47 | node 3 will use them in the order of: | |
48 | swapF -> swapA -> swapB -> swapC -> swapD -> swapE | |
49 | ||
50 | ||
51 | Implementation details | |
52 | ---------------------- | |
53 | ||
54 | The current code uses a priority based list, swap_avail_list, to decide | |
55 | which swap device to use and if multiple swap devices share the same | |
56 | priority, they are used round robin. This change here replaces the single | |
57 | global swap_avail_list with a per-numa-node list, i.e. for each numa node, | |
58 | it sees its own priority based list of available swap devices. Swap | |
59 | device's priority can be promoted on its matching node's swap_avail_list. | |
60 | ||
61 | The current swap device's priority is set as: user can set a >=0 value, | |
62 | or the system will pick one starting from -1 then downwards. The priority | |
63 | value in the swap_avail_list is the negated value of the swap device's | |
64 | due to plist being sorted from low to high. The new policy doesn't change | |
65 | the semantics for priority >=0 cases, the previous starting from -1 then | |
66 | downwards now becomes starting from -2 then downwards and -1 is reserved | |
67 | as the promoted value. So if multiple swap devices are attached to the same | |
68 | node, they will all be promoted to priority -1 on that node's plist and will | |
69 | be used round robin before any other swap devices. |