--- /dev/null
+=====================
+Restartable Sequences
+=====================
+
+Restartable Sequences allow to register a per thread userspace memory area
+to be used as an ABI between kernel and userspace for three purposes:
+
+ * userspace restartable sequences
+
+ * quick access to read the current CPU number, node ID from userspace
+
+ * scheduler time slice extensions
+
+Restartable sequences (per-cpu atomics)
+---------------------------------------
+
+Restartable sequences allow userspace to perform update operations on
+per-cpu data without requiring heavyweight atomic operations. The actual
+ABI is unfortunately only available in the code and selftests.
+
+Quick access to CPU number, node ID
+-----------------------------------
+
+Allows to implement per CPU data efficiently. Documentation is in code and
+selftests. :(
+
+Scheduler time slice extensions
+-------------------------------
+
+This allows a thread to request a time slice extension when it enters a
+critical section to avoid contention on a resource when the thread is
+scheduled out inside of the critical section.
+
+The prerequisites for this functionality are:
+
+ * Enabled in Kconfig
+
+ * Enabled at boot time (default is enabled)
+
+ * A rseq userspace pointer has been registered for the thread
+
+The thread has to enable the functionality via prctl(2)::
+
+ prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_SET,
+ PR_RSEQ_SLICE_EXT_ENABLE, 0, 0);
+
+prctl() returns 0 on success or otherwise with the following error codes:
+
+========= ==============================================================
+Errorcode Meaning
+========= ==============================================================
+EINVAL Functionality not available or invalid function arguments.
+ Note: arg4 and arg5 must be zero
+ENOTSUPP Functionality was disabled on the kernel command line
+ENXIO Available, but no rseq user struct registered
+========= ==============================================================
+
+The state can be also queried via prctl(2)::
+
+ prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_GET, 0, 0, 0);
+
+prctl() returns ``PR_RSEQ_SLICE_EXT_ENABLE`` when it is enabled or 0 if
+disabled. Otherwise it returns with the following error codes:
+
+========= ==============================================================
+Errorcode Meaning
+========= ==============================================================
+EINVAL Functionality not available or invalid function arguments.
+ Note: arg3 and arg4 and arg5 must be zero
+========= ==============================================================
+
+The availability and status is also exposed via the rseq ABI struct flags
+field via the ``RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT`` and the
+``RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT``. These bits are read-only for user
+space and only for informational purposes.
+
+If the mechanism was enabled via prctl(), the thread can request a time
+slice extension by setting rseq::slice_ctrl::request to 1. If the thread is
+interrupted and the interrupt results in a reschedule request in the
+kernel, then the kernel can grant a time slice extension and return to
+userspace instead of scheduling out. The length of the extension is
+determined by the ``rseq_slice_extension_nsec`` sysctl.
+
+The kernel indicates the grant by clearing rseq::slice_ctrl::request and
+setting rseq::slice_ctrl::granted to 1. If there is a reschedule of the
+thread after granting the extension, the kernel clears the granted bit to
+indicate that to userspace.
+
+If the request bit is still set when the leaving the critical section,
+userspace can clear it and continue.
+
+If the granted bit is set, then userspace invokes rseq_slice_yield(2) when
+leaving the critical section to relinquish the CPU. The kernel enforces
+this by arming a timer to prevent misbehaving userspace from abusing this
+mechanism.
+
+If both the request bit and the granted bit are false when leaving the
+critical section, then this indicates that a grant was revoked and no
+further action is required by userspace.
+
+The required code flow is as follows::
+
+ rseq->slice_ctrl.request = 1;
+ barrier(); // Prevent compiler reordering
+ critical_section();
+ barrier(); // Prevent compiler reordering
+ rseq->slice_ctrl.request = 0;
+ if (rseq->slice_ctrl.granted)
+ rseq_slice_yield();
+
+As all of this is strictly CPU local, there are no atomicity requirements.
+Checking the granted state is racy, but that cannot be avoided at all::
+
+ if (rseq->slice_ctrl.granted)
+ -> Interrupt results in schedule and grant revocation
+ rseq_slice_yield();
+
+So there is no point in pretending that this might be solved by an atomic
+operation.
+
+If the thread issues a syscall other than rseq_slice_yield(2) within the
+granted timeslice extension, the grant is also revoked and the CPU is
+relinquished immediately when entering the kernel. This is required as
+syscalls might consume arbitrary CPU time until they reach a scheduling
+point when the preemption model is either NONE or VOLUNTARY and therefore
+might exceed the grant by far.
+
+The preferred solution for user space is to use rseq_slice_yield(2) which
+is side effect free. The support for arbitrary syscalls is required to
+support onion layer architectured applications, where the code handling the
+critical section and requesting the time slice extension has no control
+over the code within the critical section.
+
+The kernel enforces flag consistency and terminates the thread with SIGSEGV
+if it detects a violation.
};
enum rseq_cs_flags_bit {
+ /* Historical and unsupported bits */
RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT_BIT = 0,
RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT = 1,
RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT = 2,
+ /* (3) Intentional gap to put new bits into a separate byte */
+
+ /* User read only feature flags */
+ RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT = 4,
+ RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT = 5,
};
enum rseq_cs_flags {
(1U << RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT),
RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE =
(1U << RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT),
+
+ RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE =
+ (1U << RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT),
+ RSEQ_CS_FLAG_SLICE_EXT_ENABLED =
+ (1U << RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT),
};
/*
__u64 abort_ip;
} __attribute__((aligned(4 * sizeof(__u64))));
+/**
+ * rseq_slice_ctrl - Time slice extension control structure
+ * @all: Compound value
+ * @request: Request for a time slice extension
+ * @granted: Granted time slice extension
+ *
+ * @request is set by user space and can be cleared by user space or kernel
+ * space. @granted is set and cleared by the kernel and must only be read
+ * by user space.
+ */
+struct rseq_slice_ctrl {
+ union {
+ __u32 all;
+ struct {
+ __u8 request;
+ __u8 granted;
+ __u16 __reserved;
+ };
+ };
+};
+
/*
* struct rseq is aligned on 4 * 8 bytes to ensure it is always
* contained within a single cache-line.
*/
__u32 mm_cid;
+ /*
+ * Time slice extension control structure. CPU local updates from
+ * kernel and user space.
+ */
+ struct rseq_slice_ctrl slice_ctrl;
+
/*
* Flexible array member at end of structure, after last feature field.
*/
*/
SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32, sig)
{
+ u32 rseqfl = 0;
+
if (flags & RSEQ_FLAG_UNREGISTER) {
if (flags & ~RSEQ_FLAG_UNREGISTER)
return -EINVAL;
if (!access_ok(rseq, rseq_len))
return -EFAULT;
+ if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION))
+ rseqfl |= RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE;
+
scoped_user_write_access(rseq, efault) {
/*
* If the rseq_cs pointer is non-NULL on registration, clear it to
* clearing the fields. Don't bother reading it, just reset it.
*/
unsafe_put_user(0UL, &rseq->rseq_cs, efault);
+ unsafe_put_user(rseqfl, &rseq->flags, efault);
/* Initialize IDs in user space */
unsafe_put_user(RSEQ_CPU_ID_UNINITIALIZED, &rseq->cpu_id_start, efault);
unsafe_put_user(RSEQ_CPU_ID_UNINITIALIZED, &rseq->cpu_id, efault);
unsafe_put_user(0U, &rseq->node_id, efault);
unsafe_put_user(0U, &rseq->mm_cid, efault);
+ unsafe_put_user(0U, &rseq->slice_ctrl.all, efault);
}
/*