From: Greg Kroah-Hartman Date: Wed, 27 Nov 2019 10:27:12 +0000 (+0100) Subject: 5.3-stable patches X-Git-Tag: v4.4.204~43 X-Git-Url: http://git.ipfire.org/?a=commitdiff_plain;h=1792d0c1079c198c9ef16741cc3c44c282458ada;p=thirdparty%2Fkernel%2Fstable-queue.git 5.3-stable patches added patches: alsa-usb-audio-fix-null-dereference-at-parsing-badd.patch futex-prevent-exit-livelock.patch futex-prevent-robust-futex-exit-race.patch gve-fix-dma-sync-bug-where-not-all-pages-synced.patch nbd-prevent-memory-leak.patch net-sysfs-fix-reference-count-leak-in-rx-netdev_queue_add_kobject.patch nfc-port100-handle-command-failure-cleanly.patch selftests-x86-mov_ss_trap-fix-the-sysenter-test.patch selftests-x86-sigreturn-32-invalidate-ds-and-es-when-abusing-the-kernel.patch x86-cpu_entry_area-add-guard-page-for-entry-stack-on-32bit.patch x86-doublefault-32-fix-stack-canaries-in-the-double-fault-handler.patch x86-entry-32-fix-fixup_espfix_stack-with-user-cr3.patch x86-entry-32-fix-iret-exception.patch x86-entry-32-fix-nmi-vs-espfix.patch x86-entry-32-move-fixup_frame-after-pushing-fs-in-save_all.patch x86-entry-32-unwind-the-espfix-stack-earlier-on-exception-entry.patch x86-entry-32-use-ss-segment-where-required.patch x86-pti-32-calculate-the-various-pti-cpu_entry_area-sizes-correctly-make-the-cpu_entry_area_pages-assert-precise.patch x86-pti-32-size-initial_page_table-correctly.patch x86-speculation-fix-incorrect-mds-taa-mitigation-status.patch x86-speculation-fix-redundant-mds-mitigation-message.patch x86-stackframe-32-repair-32-bit-xen-pv.patch x86-xen-32-make-xen_iret_crit_fixup-independent-of-frame-layout.patch x86-xen-32-simplify-ring-check-in-xen_iret_crit_fixup.patch --- diff --git a/queue-5.3/alsa-usb-audio-fix-null-dereference-at-parsing-badd.patch b/queue-5.3/alsa-usb-audio-fix-null-dereference-at-parsing-badd.patch new file mode 100644 index 00000000000..f328dae9860 --- /dev/null +++ b/queue-5.3/alsa-usb-audio-fix-null-dereference-at-parsing-badd.patch @@ -0,0 +1,39 @@ +From 9435f2bb66874a0c4dd25e7c978957a7ca2c93b1 Mon Sep 17 00:00:00 2001 +From: Takashi Iwai +Date: Fri, 22 Nov 2019 12:28:40 +0100 +Subject: ALSA: usb-audio: Fix NULL dereference at parsing BADD + +From: Takashi Iwai + +commit 9435f2bb66874a0c4dd25e7c978957a7ca2c93b1 upstream. + +snd_usb_mixer_controls_badd() that parses UAC3 BADD profiles misses a +NULL check for the given interfaces. When a malformed USB descriptor +is passed, this may lead to an Oops, as spotted by syzkaller. +Skip the iteration if the interface doesn't exist for avoiding the +crash. + +Fixes: 17156f23e93c ("ALSA: usb: add UAC3 BADD profiles support") +Reported-by: syzbot+a36ab65c6653d7ccdd62@syzkaller.appspotmail.com +Suggested-by: Dan Carpenter +Cc: +Link: https://lore.kernel.org/r/20191122112840.24797-1-tiwai@suse.de +Signed-off-by: Takashi Iwai +Signed-off-by: Greg Kroah-Hartman + +--- + sound/usb/mixer.c | 3 +++ + 1 file changed, 3 insertions(+) + +--- a/sound/usb/mixer.c ++++ b/sound/usb/mixer.c +@@ -2930,6 +2930,9 @@ static int snd_usb_mixer_controls_badd(s + continue; + + iface = usb_ifnum_to_if(dev, intf); ++ if (!iface) ++ continue; ++ + num = iface->num_altsetting; + + if (num < 2) diff --git a/queue-5.3/futex-prevent-exit-livelock.patch b/queue-5.3/futex-prevent-exit-livelock.patch new file mode 100644 index 00000000000..cfc6d7afc0c --- /dev/null +++ b/queue-5.3/futex-prevent-exit-livelock.patch @@ -0,0 +1,343 @@ +From 3ef240eaff36b8119ac9e2ea17cbf41179c930ba Mon Sep 17 00:00:00 2001 +From: Thomas Gleixner +Date: Wed, 6 Nov 2019 22:55:46 +0100 +Subject: futex: Prevent exit livelock + +From: Thomas Gleixner + +commit 3ef240eaff36b8119ac9e2ea17cbf41179c930ba upstream. + +Oleg provided the following test case: + +int main(void) +{ + struct sched_param sp = {}; + + sp.sched_priority = 2; + assert(sched_setscheduler(0, SCHED_FIFO, &sp) == 0); + + int lock = vfork(); + if (!lock) { + sp.sched_priority = 1; + assert(sched_setscheduler(0, SCHED_FIFO, &sp) == 0); + _exit(0); + } + + syscall(__NR_futex, &lock, FUTEX_LOCK_PI, 0,0,0); + return 0; +} + +This creates an unkillable RT process spinning in futex_lock_pi() on a UP +machine or if the process is affine to a single CPU. The reason is: + + parent child + + set FIFO prio 2 + + vfork() -> set FIFO prio 1 + implies wait_for_child() sched_setscheduler(...) + exit() + do_exit() + .... + mm_release() + tsk->futex_state = FUTEX_STATE_EXITING; + exit_futex(); (NOOP in this case) + complete() --> wakes parent + sys_futex() + loop infinite because + tsk->futex_state == FUTEX_STATE_EXITING + +The same problem can happen just by regular preemption as well: + + task holds futex + ... + do_exit() + tsk->futex_state = FUTEX_STATE_EXITING; + + --> preemption (unrelated wakeup of some other higher prio task, e.g. timer) + + switch_to(other_task) + + return to user + sys_futex() + loop infinite as above + +Just for the fun of it the futex exit cleanup could trigger the wakeup +itself before the task sets its futex state to DEAD. + +To cure this, the handling of the exiting owner is changed so: + + - A refcount is held on the task + + - The task pointer is stored in a caller visible location + + - The caller drops all locks (hash bucket, mmap_sem) and blocks + on task::futex_exit_mutex. When the mutex is acquired then + the exiting task has completed the cleanup and the state + is consistent and can be reevaluated. + +This is not a pretty solution, but there is no choice other than returning +an error code to user space, which would break the state consistency +guarantee and open another can of problems including regressions. + +For stable backports the preparatory commits ac31c7ff8624 .. ba31c1a48538 +are required as well, but for anything older than 5.3.y the backports are +going to be provided when this hits mainline as the other dependencies for +those kernels are definitely not stable material. + +Fixes: 778e9a9c3e71 ("pi-futex: fix exit races and locking problems") +Reported-by: Oleg Nesterov +Signed-off-by: Thomas Gleixner +Reviewed-by: Ingo Molnar +Acked-by: Peter Zijlstra (Intel) +Cc: Stable Team +Link: https://lkml.kernel.org/r/20191106224557.041676471@linutronix.de +Signed-off-by: Greg Kroah-Hartman + +--- + kernel/futex.c | 106 ++++++++++++++++++++++++++++++++++++++++++++++++--------- + 1 file changed, 91 insertions(+), 15 deletions(-) + +--- a/kernel/futex.c ++++ b/kernel/futex.c +@@ -1171,6 +1171,36 @@ out_error: + return ret; + } + ++/** ++ * wait_for_owner_exiting - Block until the owner has exited ++ * @exiting: Pointer to the exiting task ++ * ++ * Caller must hold a refcount on @exiting. ++ */ ++static void wait_for_owner_exiting(int ret, struct task_struct *exiting) ++{ ++ if (ret != -EBUSY) { ++ WARN_ON_ONCE(exiting); ++ return; ++ } ++ ++ if (WARN_ON_ONCE(ret == -EBUSY && !exiting)) ++ return; ++ ++ mutex_lock(&exiting->futex_exit_mutex); ++ /* ++ * No point in doing state checking here. If the waiter got here ++ * while the task was in exec()->exec_futex_release() then it can ++ * have any FUTEX_STATE_* value when the waiter has acquired the ++ * mutex. OK, if running, EXITING or DEAD if it reached exit() ++ * already. Highly unlikely and not a problem. Just one more round ++ * through the futex maze. ++ */ ++ mutex_unlock(&exiting->futex_exit_mutex); ++ ++ put_task_struct(exiting); ++} ++ + static int handle_exit_race(u32 __user *uaddr, u32 uval, + struct task_struct *tsk) + { +@@ -1230,7 +1260,8 @@ static int handle_exit_race(u32 __user * + * it after doing proper sanity checks. + */ + static int attach_to_pi_owner(u32 __user *uaddr, u32 uval, union futex_key *key, +- struct futex_pi_state **ps) ++ struct futex_pi_state **ps, ++ struct task_struct **exiting) + { + pid_t pid = uval & FUTEX_TID_MASK; + struct futex_pi_state *pi_state; +@@ -1270,7 +1301,19 @@ static int attach_to_pi_owner(u32 __user + int ret = handle_exit_race(uaddr, uval, p); + + raw_spin_unlock_irq(&p->pi_lock); +- put_task_struct(p); ++ /* ++ * If the owner task is between FUTEX_STATE_EXITING and ++ * FUTEX_STATE_DEAD then store the task pointer and keep ++ * the reference on the task struct. The calling code will ++ * drop all locks, wait for the task to reach ++ * FUTEX_STATE_DEAD and then drop the refcount. This is ++ * required to prevent a live lock when the current task ++ * preempted the exiting task between the two states. ++ */ ++ if (ret == -EBUSY) ++ *exiting = p; ++ else ++ put_task_struct(p); + return ret; + } + +@@ -1309,7 +1352,8 @@ static int attach_to_pi_owner(u32 __user + + static int lookup_pi_state(u32 __user *uaddr, u32 uval, + struct futex_hash_bucket *hb, +- union futex_key *key, struct futex_pi_state **ps) ++ union futex_key *key, struct futex_pi_state **ps, ++ struct task_struct **exiting) + { + struct futex_q *top_waiter = futex_top_waiter(hb, key); + +@@ -1324,7 +1368,7 @@ static int lookup_pi_state(u32 __user *u + * We are the first waiter - try to look up the owner based on + * @uval and attach to it. + */ +- return attach_to_pi_owner(uaddr, uval, key, ps); ++ return attach_to_pi_owner(uaddr, uval, key, ps, exiting); + } + + static int lock_pi_update_atomic(u32 __user *uaddr, u32 uval, u32 newval) +@@ -1352,6 +1396,8 @@ static int lock_pi_update_atomic(u32 __u + * lookup + * @task: the task to perform the atomic lock work for. This will + * be "current" except in the case of requeue pi. ++ * @exiting: Pointer to store the task pointer of the owner task ++ * which is in the middle of exiting + * @set_waiters: force setting the FUTEX_WAITERS bit (1) or not (0) + * + * Return: +@@ -1360,11 +1406,17 @@ static int lock_pi_update_atomic(u32 __u + * - <0 - error + * + * The hb->lock and futex_key refs shall be held by the caller. ++ * ++ * @exiting is only set when the return value is -EBUSY. If so, this holds ++ * a refcount on the exiting task on return and the caller needs to drop it ++ * after waiting for the exit to complete. + */ + static int futex_lock_pi_atomic(u32 __user *uaddr, struct futex_hash_bucket *hb, + union futex_key *key, + struct futex_pi_state **ps, +- struct task_struct *task, int set_waiters) ++ struct task_struct *task, ++ struct task_struct **exiting, ++ int set_waiters) + { + u32 uval, newval, vpid = task_pid_vnr(task); + struct futex_q *top_waiter; +@@ -1434,7 +1486,7 @@ static int futex_lock_pi_atomic(u32 __us + * attach to the owner. If that fails, no harm done, we only + * set the FUTEX_WAITERS bit in the user space variable. + */ +- return attach_to_pi_owner(uaddr, newval, key, ps); ++ return attach_to_pi_owner(uaddr, newval, key, ps, exiting); + } + + /** +@@ -1852,6 +1904,8 @@ void requeue_pi_wake_futex(struct futex_ + * @key1: the from futex key + * @key2: the to futex key + * @ps: address to store the pi_state pointer ++ * @exiting: Pointer to store the task pointer of the owner task ++ * which is in the middle of exiting + * @set_waiters: force setting the FUTEX_WAITERS bit (1) or not (0) + * + * Try and get the lock on behalf of the top waiter if we can do it atomically. +@@ -1859,16 +1913,20 @@ void requeue_pi_wake_futex(struct futex_ + * then direct futex_lock_pi_atomic() to force setting the FUTEX_WAITERS bit. + * hb1 and hb2 must be held by the caller. + * ++ * @exiting is only set when the return value is -EBUSY. If so, this holds ++ * a refcount on the exiting task on return and the caller needs to drop it ++ * after waiting for the exit to complete. ++ * + * Return: + * - 0 - failed to acquire the lock atomically; + * - >0 - acquired the lock, return value is vpid of the top_waiter + * - <0 - error + */ +-static int futex_proxy_trylock_atomic(u32 __user *pifutex, +- struct futex_hash_bucket *hb1, +- struct futex_hash_bucket *hb2, +- union futex_key *key1, union futex_key *key2, +- struct futex_pi_state **ps, int set_waiters) ++static int ++futex_proxy_trylock_atomic(u32 __user *pifutex, struct futex_hash_bucket *hb1, ++ struct futex_hash_bucket *hb2, union futex_key *key1, ++ union futex_key *key2, struct futex_pi_state **ps, ++ struct task_struct **exiting, int set_waiters) + { + struct futex_q *top_waiter = NULL; + u32 curval; +@@ -1905,7 +1963,7 @@ static int futex_proxy_trylock_atomic(u3 + */ + vpid = task_pid_vnr(top_waiter->task); + ret = futex_lock_pi_atomic(pifutex, hb2, key2, ps, top_waiter->task, +- set_waiters); ++ exiting, set_waiters); + if (ret == 1) { + requeue_pi_wake_futex(top_waiter, key2, hb2); + return vpid; +@@ -2034,6 +2092,8 @@ retry_private: + } + + if (requeue_pi && (task_count - nr_wake < nr_requeue)) { ++ struct task_struct *exiting = NULL; ++ + /* + * Attempt to acquire uaddr2 and wake the top waiter. If we + * intend to requeue waiters, force setting the FUTEX_WAITERS +@@ -2041,7 +2101,8 @@ retry_private: + * faults rather in the requeue loop below. + */ + ret = futex_proxy_trylock_atomic(uaddr2, hb1, hb2, &key1, +- &key2, &pi_state, nr_requeue); ++ &key2, &pi_state, ++ &exiting, nr_requeue); + + /* + * At this point the top_waiter has either taken uaddr2 or is +@@ -2068,7 +2129,8 @@ retry_private: + * If that call succeeds then we have pi_state and an + * initial refcount on it. + */ +- ret = lookup_pi_state(uaddr2, ret, hb2, &key2, &pi_state); ++ ret = lookup_pi_state(uaddr2, ret, hb2, &key2, ++ &pi_state, &exiting); + } + + switch (ret) { +@@ -2097,6 +2159,12 @@ retry_private: + hb_waiters_dec(hb2); + put_futex_key(&key2); + put_futex_key(&key1); ++ /* ++ * Handle the case where the owner is in the middle of ++ * exiting. Wait for the exit to complete otherwise ++ * this task might loop forever, aka. live lock. ++ */ ++ wait_for_owner_exiting(ret, exiting); + cond_resched(); + goto retry; + default: +@@ -2803,6 +2871,7 @@ static int futex_lock_pi(u32 __user *uad + { + struct hrtimer_sleeper timeout, *to; + struct futex_pi_state *pi_state = NULL; ++ struct task_struct *exiting = NULL; + struct rt_mutex_waiter rt_waiter; + struct futex_hash_bucket *hb; + struct futex_q q = futex_q_init; +@@ -2824,7 +2893,8 @@ retry: + retry_private: + hb = queue_lock(&q); + +- ret = futex_lock_pi_atomic(uaddr, hb, &q.key, &q.pi_state, current, 0); ++ ret = futex_lock_pi_atomic(uaddr, hb, &q.key, &q.pi_state, current, ++ &exiting, 0); + if (unlikely(ret)) { + /* + * Atomic work succeeded and we got the lock, +@@ -2846,6 +2916,12 @@ retry_private: + */ + queue_unlock(hb); + put_futex_key(&q.key); ++ /* ++ * Handle the case where the owner is in the middle of ++ * exiting. Wait for the exit to complete otherwise ++ * this task might loop forever, aka. live lock. ++ */ ++ wait_for_owner_exiting(ret, exiting); + cond_resched(); + goto retry; + default: diff --git a/queue-5.3/futex-prevent-robust-futex-exit-race.patch b/queue-5.3/futex-prevent-robust-futex-exit-race.patch new file mode 100644 index 00000000000..52c90b76c48 --- /dev/null +++ b/queue-5.3/futex-prevent-robust-futex-exit-race.patch @@ -0,0 +1,261 @@ +From ca16d5bee59807bf04deaab0a8eccecd5061528c Mon Sep 17 00:00:00 2001 +From: Yang Tao +Date: Wed, 6 Nov 2019 22:55:35 +0100 +Subject: futex: Prevent robust futex exit race + +From: Yang Tao + +commit ca16d5bee59807bf04deaab0a8eccecd5061528c upstream. + +Robust futexes utilize the robust_list mechanism to allow the kernel to +release futexes which are held when a task exits. The exit can be voluntary +or caused by a signal or fault. This prevents that waiters block forever. + +The futex operations in user space store a pointer to the futex they are +either locking or unlocking in the op_pending member of the per task robust +list. + +After a lock operation has succeeded the futex is queued in the robust list +linked list and the op_pending pointer is cleared. + +After an unlock operation has succeeded the futex is removed from the +robust list linked list and the op_pending pointer is cleared. + +The robust list exit code checks for the pending operation and any futex +which is queued in the linked list. It carefully checks whether the futex +value is the TID of the exiting task. If so, it sets the OWNER_DIED bit and +tries to wake up a potential waiter. + +This is race free for the lock operation but unlock has two race scenarios +where waiters might not be woken up. These issues can be observed with +regular robust pthread mutexes. PI aware pthread mutexes are not affected. + +(1) Unlocking task is killed after unlocking the futex value in user space + before being able to wake a waiter. + + pthread_mutex_unlock() + | + V + atomic_exchange_rel (&mutex->__data.__lock, 0) + <------------------------killed + lll_futex_wake () | + | + |(__lock = 0) + |(enter kernel) + | + V + do_exit() + exit_mm() + mm_release() + exit_robust_list() + handle_futex_death() + | + |(__lock = 0) + |(uval = 0) + | + V + if ((uval & FUTEX_TID_MASK) != task_pid_vnr(curr)) + return 0; + + The sanity check which ensures that the user space futex is owned by + the exiting task prevents the wakeup of waiters which in consequence + block infinitely. + +(2) Waiting task is killed after a wakeup and before it can acquire the + futex in user space. + + OWNER WAITER + futex_wait() + pthread_mutex_unlock() | + | | + |(__lock = 0) | + | | + V | + futex_wake() ------------> wakeup() + | + |(return to userspace) + |(__lock = 0) + | + V + oldval = mutex->__data.__lock + <-----------------killed + atomic_compare_and_exchange_val_acq (&mutex->__data.__lock, | + id | assume_other_futex_waiters, 0) | + | + | + (enter kernel)| + | + V + do_exit() + | + | + V + handle_futex_death() + | + |(__lock = 0) + |(uval = 0) + | + V + if ((uval & FUTEX_TID_MASK) != task_pid_vnr(curr)) + return 0; + + The sanity check which ensures that the user space futex is owned + by the exiting task prevents the wakeup of waiters, which seems to + be correct as the exiting task does not own the futex value, but + the consequence is that other waiters wont be woken up and block + infinitely. + +In both scenarios the following conditions are true: + + - task->robust_list->list_op_pending != NULL + - user space futex value == 0 + - Regular futex (not PI) + +If these conditions are met then it is reasonably safe to wake up a +potential waiter in order to prevent the above problems. + +As this might be a false positive it can cause spurious wakeups, but the +waiter side has to handle other types of unrelated wakeups, e.g. signals +gracefully anyway. So such a spurious wakeup will not affect the +correctness of these operations. + +This workaround must not touch the user space futex value and cannot set +the OWNER_DIED bit because the lock value is 0, i.e. uncontended. Setting +OWNER_DIED in this case would result in inconsistent state and subsequently +in malfunction of the owner died handling in user space. + +The rest of the user space state is still consistent as no other task can +observe the list_op_pending entry in the exiting tasks robust list. + +The eventually woken up waiter will observe the uncontended lock value and +take it over. + +[ tglx: Massaged changelog and comment. Made the return explicit and not + depend on the subsequent check and added constants to hand into + handle_futex_death() instead of plain numbers. Fixed a few coding + style issues. ] + +Fixes: 0771dfefc9e5 ("[PATCH] lightweight robust futexes: core") +Signed-off-by: Yang Tao +Signed-off-by: Yi Wang +Signed-off-by: Thomas Gleixner +Reviewed-by: Ingo Molnar +Acked-by: Peter Zijlstra (Intel) +Cc: stable@vger.kernel.org +Link: https://lkml.kernel.org/r/1573010582-35297-1-git-send-email-wang.yi59@zte.com.cn +Link: https://lkml.kernel.org/r/20191106224555.943191378@linutronix.de +Signed-off-by: Greg Kroah-Hartman + +--- + kernel/futex.c | 58 ++++++++++++++++++++++++++++++++++++++++++++++++++------- + 1 file changed, 51 insertions(+), 7 deletions(-) + +--- a/kernel/futex.c ++++ b/kernel/futex.c +@@ -3454,11 +3454,16 @@ err_unlock: + return ret; + } + ++/* Constants for the pending_op argument of handle_futex_death */ ++#define HANDLE_DEATH_PENDING true ++#define HANDLE_DEATH_LIST false ++ + /* + * Process a futex-list entry, check whether it's owned by the + * dying task, and do notification if so: + */ +-static int handle_futex_death(u32 __user *uaddr, struct task_struct *curr, int pi) ++static int handle_futex_death(u32 __user *uaddr, struct task_struct *curr, ++ bool pi, bool pending_op) + { + u32 uval, uninitialized_var(nval), mval; + int err; +@@ -3471,6 +3476,42 @@ retry: + if (get_user(uval, uaddr)) + return -1; + ++ /* ++ * Special case for regular (non PI) futexes. The unlock path in ++ * user space has two race scenarios: ++ * ++ * 1. The unlock path releases the user space futex value and ++ * before it can execute the futex() syscall to wake up ++ * waiters it is killed. ++ * ++ * 2. A woken up waiter is killed before it can acquire the ++ * futex in user space. ++ * ++ * In both cases the TID validation below prevents a wakeup of ++ * potential waiters which can cause these waiters to block ++ * forever. ++ * ++ * In both cases the following conditions are met: ++ * ++ * 1) task->robust_list->list_op_pending != NULL ++ * @pending_op == true ++ * 2) User space futex value == 0 ++ * 3) Regular futex: @pi == false ++ * ++ * If these conditions are met, it is safe to attempt waking up a ++ * potential waiter without touching the user space futex value and ++ * trying to set the OWNER_DIED bit. The user space futex value is ++ * uncontended and the rest of the user space mutex state is ++ * consistent, so a woken waiter will just take over the ++ * uncontended futex. Setting the OWNER_DIED bit would create ++ * inconsistent state and malfunction of the user space owner died ++ * handling. ++ */ ++ if (pending_op && !pi && !uval) { ++ futex_wake(uaddr, 1, 1, FUTEX_BITSET_MATCH_ANY); ++ return 0; ++ } ++ + if ((uval & FUTEX_TID_MASK) != task_pid_vnr(curr)) + return 0; + +@@ -3590,10 +3631,11 @@ void exit_robust_list(struct task_struct + * A pending lock might already be on the list, so + * don't process it twice: + */ +- if (entry != pending) ++ if (entry != pending) { + if (handle_futex_death((void __user *)entry + futex_offset, +- curr, pi)) ++ curr, pi, HANDLE_DEATH_LIST)) + return; ++ } + if (rc) + return; + entry = next_entry; +@@ -3607,9 +3649,10 @@ void exit_robust_list(struct task_struct + cond_resched(); + } + +- if (pending) ++ if (pending) { + handle_futex_death((void __user *)pending + futex_offset, +- curr, pip); ++ curr, pip, HANDLE_DEATH_PENDING); ++ } + } + + long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout, +@@ -3786,7 +3829,8 @@ void compat_exit_robust_list(struct task + if (entry != pending) { + void __user *uaddr = futex_uaddr(entry, futex_offset); + +- if (handle_futex_death(uaddr, curr, pi)) ++ if (handle_futex_death(uaddr, curr, pi, ++ HANDLE_DEATH_LIST)) + return; + } + if (rc) +@@ -3805,7 +3849,7 @@ void compat_exit_robust_list(struct task + if (pending) { + void __user *uaddr = futex_uaddr(pending, futex_offset); + +- handle_futex_death(uaddr, curr, pip); ++ handle_futex_death(uaddr, curr, pip, HANDLE_DEATH_PENDING); + } + } + diff --git a/queue-5.3/gve-fix-dma-sync-bug-where-not-all-pages-synced.patch b/queue-5.3/gve-fix-dma-sync-bug-where-not-all-pages-synced.patch new file mode 100644 index 00000000000..bd78a6446db --- /dev/null +++ b/queue-5.3/gve-fix-dma-sync-bug-where-not-all-pages-synced.patch @@ -0,0 +1,43 @@ +From db96c2cb4870173ea9b08df130f1d1cc9b5dd53d Mon Sep 17 00:00:00 2001 +From: Adi Suresh +Date: Tue, 19 Nov 2019 08:02:47 -0800 +Subject: gve: fix dma sync bug where not all pages synced + +From: Adi Suresh + +commit db96c2cb4870173ea9b08df130f1d1cc9b5dd53d upstream. + +The previous commit had a bug where the last page in the memory range +could not be synced. This change fixes the behavior so that all the +required pages are synced. + +Fixes: 9cfeeb576d49 ("gve: Fixes DMA synchronization") +Signed-off-by: Adi Suresh +Reviewed-by: Catherine Sullivan +Signed-off-by: David S. Miller +Signed-off-by: Greg Kroah-Hartman + +--- + drivers/net/ethernet/google/gve/gve_tx.c | 9 +++++---- + 1 file changed, 5 insertions(+), 4 deletions(-) + +--- a/drivers/net/ethernet/google/gve/gve_tx.c ++++ b/drivers/net/ethernet/google/gve/gve_tx.c +@@ -393,12 +393,13 @@ static void gve_tx_fill_seg_desc(union g + static void gve_dma_sync_for_device(struct device *dev, dma_addr_t *page_buses, + u64 iov_offset, u64 iov_len) + { ++ u64 last_page = (iov_offset + iov_len - 1) / PAGE_SIZE; ++ u64 first_page = iov_offset / PAGE_SIZE; + dma_addr_t dma; +- u64 addr; ++ u64 page; + +- for (addr = iov_offset; addr < iov_offset + iov_len; +- addr += PAGE_SIZE) { +- dma = page_buses[addr / PAGE_SIZE]; ++ for (page = first_page; page <= last_page; page++) { ++ dma = page_buses[page]; + dma_sync_single_for_device(dev, dma, PAGE_SIZE, DMA_TO_DEVICE); + } + } diff --git a/queue-5.3/nbd-prevent-memory-leak.patch b/queue-5.3/nbd-prevent-memory-leak.patch new file mode 100644 index 00000000000..e697b0b0e29 --- /dev/null +++ b/queue-5.3/nbd-prevent-memory-leak.patch @@ -0,0 +1,42 @@ +From 03bf73c315edca28f47451913177e14cd040a216 Mon Sep 17 00:00:00 2001 +From: Navid Emamdoost +Date: Mon, 23 Sep 2019 15:09:58 -0500 +Subject: nbd: prevent memory leak + +From: Navid Emamdoost + +commit 03bf73c315edca28f47451913177e14cd040a216 upstream. + +In nbd_add_socket when krealloc succeeds, if nsock's allocation fail the +reallocted memory is leak. The correct behaviour should be assigning the +reallocted memory to config->socks right after success. + +Reviewed-by: Josef Bacik +Signed-off-by: Navid Emamdoost +Signed-off-by: Jens Axboe +Signed-off-by: Greg Kroah-Hartman + +--- + drivers/block/nbd.c | 5 +++-- + 1 file changed, 3 insertions(+), 2 deletions(-) + +--- a/drivers/block/nbd.c ++++ b/drivers/block/nbd.c +@@ -995,14 +995,15 @@ static int nbd_add_socket(struct nbd_dev + sockfd_put(sock); + return -ENOMEM; + } ++ ++ config->socks = socks; ++ + nsock = kzalloc(sizeof(struct nbd_sock), GFP_KERNEL); + if (!nsock) { + sockfd_put(sock); + return -ENOMEM; + } + +- config->socks = socks; +- + nsock->fallback_index = -1; + nsock->dead = false; + mutex_init(&nsock->tx_lock); diff --git a/queue-5.3/net-sysfs-fix-reference-count-leak-in-rx-netdev_queue_add_kobject.patch b/queue-5.3/net-sysfs-fix-reference-count-leak-in-rx-netdev_queue_add_kobject.patch new file mode 100644 index 00000000000..18da03c6df8 --- /dev/null +++ b/queue-5.3/net-sysfs-fix-reference-count-leak-in-rx-netdev_queue_add_kobject.patch @@ -0,0 +1,106 @@ +From b8eb718348b8fb30b5a7d0a8fce26fb3f4ac741b Mon Sep 17 00:00:00 2001 +From: Jouni Hogander +Date: Wed, 20 Nov 2019 09:08:16 +0200 +Subject: net-sysfs: Fix reference count leak in rx|netdev_queue_add_kobject + +From: Jouni Hogander + +commit b8eb718348b8fb30b5a7d0a8fce26fb3f4ac741b upstream. + +kobject_init_and_add takes reference even when it fails. This has +to be given up by the caller in error handling. Otherwise memory +allocated by kobject_init_and_add is never freed. Originally found +by Syzkaller: + +BUG: memory leak +unreferenced object 0xffff8880679f8b08 (size 8): + comm "netdev_register", pid 269, jiffies 4294693094 (age 12.132s) + hex dump (first 8 bytes): + 72 78 2d 30 00 36 20 d4 rx-0.6 . + backtrace: + [<000000008c93818e>] __kmalloc_track_caller+0x16e/0x290 + [<000000001f2e4e49>] kvasprintf+0xb1/0x140 + [<000000007f313394>] kvasprintf_const+0x56/0x160 + [<00000000aeca11c8>] kobject_set_name_vargs+0x5b/0x140 + [<0000000073a0367c>] kobject_init_and_add+0xd8/0x170 + [<0000000088838e4b>] net_rx_queue_update_kobjects+0x152/0x560 + [<000000006be5f104>] netdev_register_kobject+0x210/0x380 + [<00000000e31dab9d>] register_netdevice+0xa1b/0xf00 + [<00000000f68b2465>] __tun_chr_ioctl+0x20d5/0x3dd0 + [<000000004c50599f>] tun_chr_ioctl+0x2f/0x40 + [<00000000bbd4c317>] do_vfs_ioctl+0x1c7/0x1510 + [<00000000d4c59e8f>] ksys_ioctl+0x99/0xb0 + [<00000000946aea81>] __x64_sys_ioctl+0x78/0xb0 + [<0000000038d946e5>] do_syscall_64+0x16f/0x580 + [<00000000e0aa5d8f>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 + [<00000000285b3d1a>] 0xffffffffffffffff + +Cc: David Miller +Cc: Lukas Bulwahn +Signed-off-by: Jouni Hogander +Signed-off-by: David S. Miller +Signed-off-by: Greg Kroah-Hartman + +--- + net/core/net-sysfs.c | 24 +++++++++++++----------- + 1 file changed, 13 insertions(+), 11 deletions(-) + +--- a/net/core/net-sysfs.c ++++ b/net/core/net-sysfs.c +@@ -923,21 +923,23 @@ static int rx_queue_add_kobject(struct n + error = kobject_init_and_add(kobj, &rx_queue_ktype, NULL, + "rx-%u", index); + if (error) +- return error; ++ goto err; + + dev_hold(queue->dev); + + if (dev->sysfs_rx_queue_group) { + error = sysfs_create_group(kobj, dev->sysfs_rx_queue_group); +- if (error) { +- kobject_put(kobj); +- return error; +- } ++ if (error) ++ goto err; + } + + kobject_uevent(kobj, KOBJ_ADD); + + return error; ++ ++err: ++ kobject_put(kobj); ++ return error; + } + #endif /* CONFIG_SYSFS */ + +@@ -1461,21 +1463,21 @@ static int netdev_queue_add_kobject(stru + error = kobject_init_and_add(kobj, &netdev_queue_ktype, NULL, + "tx-%u", index); + if (error) +- return error; ++ goto err; + + dev_hold(queue->dev); + + #ifdef CONFIG_BQL + error = sysfs_create_group(kobj, &dql_group); +- if (error) { +- kobject_put(kobj); +- return error; +- } ++ if (error) ++ goto err; + #endif + + kobject_uevent(kobj, KOBJ_ADD); + +- return 0; ++err: ++ kobject_put(kobj); ++ return error; + } + #endif /* CONFIG_SYSFS */ + diff --git a/queue-5.3/nfc-port100-handle-command-failure-cleanly.patch b/queue-5.3/nfc-port100-handle-command-failure-cleanly.patch new file mode 100644 index 00000000000..4c1aeb27c5e --- /dev/null +++ b/queue-5.3/nfc-port100-handle-command-failure-cleanly.patch @@ -0,0 +1,34 @@ +From 5f9f0b11f0816b35867f2cf71e54d95f53f03902 Mon Sep 17 00:00:00 2001 +From: Oliver Neukum +Date: Thu, 21 Nov 2019 11:37:10 +0100 +Subject: nfc: port100: handle command failure cleanly + +From: Oliver Neukum + +commit 5f9f0b11f0816b35867f2cf71e54d95f53f03902 upstream. + +If starting the transfer of a command suceeds but the transfer for the reply +fails, it is not enough to initiate killing the transfer for the +command may still be running. You need to wait for the killing to finish +before you can reuse URB and buffer. + +Reported-and-tested-by: syzbot+711468aa5c3a1eabf863@syzkaller.appspotmail.com +Signed-off-by: Oliver Neukum +Signed-off-by: David S. Miller +Signed-off-by: Greg Kroah-Hartman + +--- + drivers/nfc/port100.c | 2 +- + 1 file changed, 1 insertion(+), 1 deletion(-) + +--- a/drivers/nfc/port100.c ++++ b/drivers/nfc/port100.c +@@ -783,7 +783,7 @@ static int port100_send_frame_async(stru + + rc = port100_submit_urb_for_ack(dev, GFP_KERNEL); + if (rc) +- usb_unlink_urb(dev->out_urb); ++ usb_kill_urb(dev->out_urb); + + exit: + mutex_unlock(&dev->out_urb_lock); diff --git a/queue-5.3/selftests-x86-mov_ss_trap-fix-the-sysenter-test.patch b/queue-5.3/selftests-x86-mov_ss_trap-fix-the-sysenter-test.patch new file mode 100644 index 00000000000..ea812eaaf80 --- /dev/null +++ b/queue-5.3/selftests-x86-mov_ss_trap-fix-the-sysenter-test.patch @@ -0,0 +1,40 @@ +From 8caa016bfc129f2c925d52da43022171d1d1de91 Mon Sep 17 00:00:00 2001 +From: Andy Lutomirski +Date: Wed, 20 Nov 2019 12:59:13 -0800 +Subject: selftests/x86/mov_ss_trap: Fix the SYSENTER test + +From: Andy Lutomirski + +commit 8caa016bfc129f2c925d52da43022171d1d1de91 upstream. + +For reasons that I haven't quite fully diagnosed, running +mov_ss_trap_32 on a 32-bit kernel results in an infinite loop in +userspace. This appears to be because the hacky SYSENTER test +doesn't segfault as desired; instead it corrupts the program state +such that it infinite loops. + +Fix it by explicitly clearing EBP before doing SYSENTER. This will +give a more reliable segfault. + +Fixes: 59c2a7226fc5 ("x86/selftests: Add mov_to_ss test") +Signed-off-by: Andy Lutomirski +Signed-off-by: Peter Zijlstra (Intel) +Cc: stable@kernel.org +Signed-off-by: Greg Kroah-Hartman + +--- + tools/testing/selftests/x86/mov_ss_trap.c | 3 ++- + 1 file changed, 2 insertions(+), 1 deletion(-) + +--- a/tools/testing/selftests/x86/mov_ss_trap.c ++++ b/tools/testing/selftests/x86/mov_ss_trap.c +@@ -257,7 +257,8 @@ int main() + err(1, "sigaltstack"); + sethandler(SIGSEGV, handle_and_longjmp, SA_RESETHAND | SA_ONSTACK); + nr = SYS_getpid; +- asm volatile ("mov %[ss], %%ss; SYSENTER" : "+a" (nr) ++ /* Clear EBP first to make sure we segfault cleanly. */ ++ asm volatile ("xorl %%ebp, %%ebp; mov %[ss], %%ss; SYSENTER" : "+a" (nr) + : [ss] "m" (ss) : "flags", "rcx" + #ifdef __x86_64__ + , "r11" diff --git a/queue-5.3/selftests-x86-sigreturn-32-invalidate-ds-and-es-when-abusing-the-kernel.patch b/queue-5.3/selftests-x86-sigreturn-32-invalidate-ds-and-es-when-abusing-the-kernel.patch new file mode 100644 index 00000000000..5d4b6d0c8f8 --- /dev/null +++ b/queue-5.3/selftests-x86-sigreturn-32-invalidate-ds-and-es-when-abusing-the-kernel.patch @@ -0,0 +1,46 @@ +From 4d2fa82d98d2d296043a04eb517d7dbade5b13b8 Mon Sep 17 00:00:00 2001 +From: Andy Lutomirski +Date: Wed, 20 Nov 2019 11:58:32 -0800 +Subject: selftests/x86/sigreturn/32: Invalidate DS and ES when abusing the kernel + +From: Andy Lutomirski + +commit 4d2fa82d98d2d296043a04eb517d7dbade5b13b8 upstream. + +If the kernel accidentally uses DS or ES while the user values are +loaded, it will work fine for sane userspace. In the interest of +simulating maximally insane userspace, make sigreturn_32 zero out DS +and ES for the nasty parts so that inadvertent use of these segments +will crash. + +Signed-off-by: Andy Lutomirski +Signed-off-by: Peter Zijlstra (Intel) +Cc: stable@kernel.org +Signed-off-by: Greg Kroah-Hartman + +--- + tools/testing/selftests/x86/sigreturn.c | 13 +++++++++++++ + 1 file changed, 13 insertions(+) + +--- a/tools/testing/selftests/x86/sigreturn.c ++++ b/tools/testing/selftests/x86/sigreturn.c +@@ -451,6 +451,19 @@ static void sigusr1(int sig, siginfo_t * + ctx->uc_mcontext.gregs[REG_SP] = (unsigned long)0x8badf00d5aadc0deULL; + ctx->uc_mcontext.gregs[REG_CX] = 0; + ++#ifdef __i386__ ++ /* ++ * Make sure the kernel doesn't inadvertently use DS or ES-relative ++ * accesses in a region where user DS or ES is loaded. ++ * ++ * Skip this for 64-bit builds because long mode doesn't care about ++ * DS and ES and skipping it increases test coverage a little bit, ++ * since 64-bit kernels can still run the 32-bit build. ++ */ ++ ctx->uc_mcontext.gregs[REG_DS] = 0; ++ ctx->uc_mcontext.gregs[REG_ES] = 0; ++#endif ++ + memcpy(&requested_regs, &ctx->uc_mcontext.gregs, sizeof(gregset_t)); + requested_regs[REG_CX] = *ssptr(ctx); /* The asm code does this. */ + diff --git a/queue-5.3/series b/queue-5.3/series index 124655a6e7a..92dde8db3a3 100644 --- a/queue-5.3/series +++ b/queue-5.3/series @@ -44,3 +44,27 @@ md-raid10-prevent-access-of-uninitialized-resync_pages-offset.patch mdio_bus-fix-init-if-config_reset_controller-n.patch arm-8904-1-skip-nomap-memblocks-while-finding-the-lowmem-highmem-boundary.patch x86-insn-fix-awk-regexp-warnings.patch +x86-speculation-fix-incorrect-mds-taa-mitigation-status.patch +x86-speculation-fix-redundant-mds-mitigation-message.patch +nbd-prevent-memory-leak.patch +gve-fix-dma-sync-bug-where-not-all-pages-synced.patch +x86-stackframe-32-repair-32-bit-xen-pv.patch +x86-xen-32-make-xen_iret_crit_fixup-independent-of-frame-layout.patch +x86-xen-32-simplify-ring-check-in-xen_iret_crit_fixup.patch +x86-doublefault-32-fix-stack-canaries-in-the-double-fault-handler.patch +x86-pti-32-size-initial_page_table-correctly.patch +x86-cpu_entry_area-add-guard-page-for-entry-stack-on-32bit.patch +x86-entry-32-fix-iret-exception.patch +x86-entry-32-use-ss-segment-where-required.patch +x86-entry-32-move-fixup_frame-after-pushing-fs-in-save_all.patch +x86-entry-32-unwind-the-espfix-stack-earlier-on-exception-entry.patch +x86-entry-32-fix-nmi-vs-espfix.patch +selftests-x86-mov_ss_trap-fix-the-sysenter-test.patch +selftests-x86-sigreturn-32-invalidate-ds-and-es-when-abusing-the-kernel.patch +x86-pti-32-calculate-the-various-pti-cpu_entry_area-sizes-correctly-make-the-cpu_entry_area_pages-assert-precise.patch +x86-entry-32-fix-fixup_espfix_stack-with-user-cr3.patch +futex-prevent-robust-futex-exit-race.patch +futex-prevent-exit-livelock.patch +alsa-usb-audio-fix-null-dereference-at-parsing-badd.patch +nfc-port100-handle-command-failure-cleanly.patch +net-sysfs-fix-reference-count-leak-in-rx-netdev_queue_add_kobject.patch diff --git a/queue-5.3/x86-cpu_entry_area-add-guard-page-for-entry-stack-on-32bit.patch b/queue-5.3/x86-cpu_entry_area-add-guard-page-for-entry-stack-on-32bit.patch new file mode 100644 index 00000000000..4d41ec8c352 --- /dev/null +++ b/queue-5.3/x86-cpu_entry_area-add-guard-page-for-entry-stack-on-32bit.patch @@ -0,0 +1,41 @@ +From 880a98c339961eaa074393e3a2117cbe9125b8bb Mon Sep 17 00:00:00 2001 +From: Thomas Gleixner +Date: Thu, 21 Nov 2019 00:40:24 +0100 +Subject: x86/cpu_entry_area: Add guard page for entry stack on 32bit + +From: Thomas Gleixner + +commit 880a98c339961eaa074393e3a2117cbe9125b8bb upstream. + +The entry stack in the cpu entry area is protected against overflow by the +readonly GDT on 64-bit, but on 32-bit the GDT needs to be writeable and +therefore does not trigger a fault on stack overflow. + +Add a guard page. + +Fixes: c482feefe1ae ("x86/entry/64: Make cpu_entry_area.tss read-only") +Signed-off-by: Thomas Gleixner +Signed-off-by: Peter Zijlstra (Intel) +Cc: stable@kernel.org +Signed-off-by: Greg Kroah-Hartman + +--- + arch/x86/include/asm/cpu_entry_area.h | 6 +++++- + 1 file changed, 5 insertions(+), 1 deletion(-) + +--- a/arch/x86/include/asm/cpu_entry_area.h ++++ b/arch/x86/include/asm/cpu_entry_area.h +@@ -78,8 +78,12 @@ struct cpu_entry_area { + + /* + * The GDT is just below entry_stack and thus serves (on x86_64) as +- * a a read-only guard page. ++ * a read-only guard page. On 32-bit the GDT must be writeable, so ++ * it needs an extra guard page. + */ ++#ifdef CONFIG_X86_32 ++ char guard_entry_stack[PAGE_SIZE]; ++#endif + struct entry_stack_page entry_stack_page; + + /* diff --git a/queue-5.3/x86-doublefault-32-fix-stack-canaries-in-the-double-fault-handler.patch b/queue-5.3/x86-doublefault-32-fix-stack-canaries-in-the-double-fault-handler.patch new file mode 100644 index 00000000000..b5f1d1fa894 --- /dev/null +++ b/queue-5.3/x86-doublefault-32-fix-stack-canaries-in-the-double-fault-handler.patch @@ -0,0 +1,33 @@ +From 3580d0b29cab08483f84a16ce6a1151a1013695f Mon Sep 17 00:00:00 2001 +From: Andy Lutomirski +Date: Thu, 21 Nov 2019 11:50:12 +0100 +Subject: x86/doublefault/32: Fix stack canaries in the double fault handler + +From: Andy Lutomirski + +commit 3580d0b29cab08483f84a16ce6a1151a1013695f upstream. + +The double fault TSS was missing GS setup, which is needed for stack +canaries to work. + +Signed-off-by: Andy Lutomirski +Signed-off-by: Peter Zijlstra (Intel) +Cc: stable@kernel.org +Signed-off-by: Greg Kroah-Hartman + +--- + arch/x86/kernel/doublefault.c | 3 +++ + 1 file changed, 3 insertions(+) + +--- a/arch/x86/kernel/doublefault.c ++++ b/arch/x86/kernel/doublefault.c +@@ -65,6 +65,9 @@ struct x86_hw_tss doublefault_tss __cach + .ss = __KERNEL_DS, + .ds = __USER_DS, + .fs = __KERNEL_PERCPU, ++#ifndef CONFIG_X86_32_LAZY_GS ++ .gs = __KERNEL_STACK_CANARY, ++#endif + + .__cr3 = __pa_nodebug(swapper_pg_dir), + }; diff --git a/queue-5.3/x86-entry-32-fix-fixup_espfix_stack-with-user-cr3.patch b/queue-5.3/x86-entry-32-fix-fixup_espfix_stack-with-user-cr3.patch new file mode 100644 index 00000000000..c1827fa2924 --- /dev/null +++ b/queue-5.3/x86-entry-32-fix-fixup_espfix_stack-with-user-cr3.patch @@ -0,0 +1,68 @@ +From 4a13b0e3e10996b9aa0b45a764ecfe49f6fcd360 Mon Sep 17 00:00:00 2001 +From: Andy Lutomirski +Date: Sun, 24 Nov 2019 08:50:03 -0800 +Subject: x86/entry/32: Fix FIXUP_ESPFIX_STACK with user CR3 + +From: Andy Lutomirski + +commit 4a13b0e3e10996b9aa0b45a764ecfe49f6fcd360 upstream. + +UNWIND_ESPFIX_STACK needs to read the GDT, and the GDT mapping that +can be accessed via %fs is not mapped in the user pagetables. Use +SGDT to find the cpu_entry_area mapping and read the espfix offset +from that instead. + +Reported-and-tested-by: Borislav Petkov +Signed-off-by: Andy Lutomirski +Cc: Peter Zijlstra +Cc: Thomas Gleixner +Cc: Linus Torvalds +Cc: +Signed-off-by: Ingo Molnar +Signed-off-by: Greg Kroah-Hartman + +--- + arch/x86/entry/entry_32.S | 21 ++++++++++++++++++--- + 1 file changed, 18 insertions(+), 3 deletions(-) + +--- a/arch/x86/entry/entry_32.S ++++ b/arch/x86/entry/entry_32.S +@@ -415,7 +415,8 @@ + + .macro CHECK_AND_APPLY_ESPFIX + #ifdef CONFIG_X86_ESPFIX32 +-#define GDT_ESPFIX_SS PER_CPU_VAR(gdt_page) + (GDT_ENTRY_ESPFIX_SS * 8) ++#define GDT_ESPFIX_OFFSET (GDT_ENTRY_ESPFIX_SS * 8) ++#define GDT_ESPFIX_SS PER_CPU_VAR(gdt_page) + GDT_ESPFIX_OFFSET + + ALTERNATIVE "jmp .Lend_\@", "", X86_BUG_ESPFIX + +@@ -1147,12 +1148,26 @@ ENDPROC(entry_INT80_32) + * We can't call C functions using the ESPFIX stack. This code reads + * the high word of the segment base from the GDT and swiches to the + * normal stack and adjusts ESP with the matching offset. ++ * ++ * We might be on user CR3 here, so percpu data is not mapped and we can't ++ * access the GDT through the percpu segment. Instead, use SGDT to find ++ * the cpu_entry_area alias of the GDT. + */ + #ifdef CONFIG_X86_ESPFIX32 + /* fixup the stack */ +- mov GDT_ESPFIX_SS + 4, %al /* bits 16..23 */ +- mov GDT_ESPFIX_SS + 7, %ah /* bits 24..31 */ ++ pushl %ecx ++ subl $2*4, %esp ++ sgdt (%esp) ++ movl 2(%esp), %ecx /* GDT address */ ++ /* ++ * Careful: ECX is a linear pointer, so we need to force base ++ * zero. %cs is the only known-linear segment we have right now. ++ */ ++ mov %cs:GDT_ESPFIX_OFFSET + 4(%ecx), %al /* bits 16..23 */ ++ mov %cs:GDT_ESPFIX_OFFSET + 7(%ecx), %ah /* bits 24..31 */ + shl $16, %eax ++ addl $2*4, %esp ++ popl %ecx + addl %esp, %eax /* the adjusted stack pointer */ + pushl $__KERNEL_DS + pushl %eax diff --git a/queue-5.3/x86-entry-32-fix-iret-exception.patch b/queue-5.3/x86-entry-32-fix-iret-exception.patch new file mode 100644 index 00000000000..7e77c37ab57 --- /dev/null +++ b/queue-5.3/x86-entry-32-fix-iret-exception.patch @@ -0,0 +1,45 @@ +From 40ad2199580e248dce2a2ebb722854180c334b9e Mon Sep 17 00:00:00 2001 +From: Peter Zijlstra +Date: Wed, 20 Nov 2019 13:05:06 +0100 +Subject: x86/entry/32: Fix IRET exception + +From: Peter Zijlstra + +commit 40ad2199580e248dce2a2ebb722854180c334b9e upstream. + +As reported by Lai, the commit 3c88c692c287 ("x86/stackframe/32: +Provide consistent pt_regs") wrecked the IRET EXTABLE entry by making +.Lirq_return not point at IRET. + +Fix this by placing IRET_FRAME in RESTORE_REGS, to mirror how +FIXUP_FRAME is part of SAVE_ALL. + +Fixes: 3c88c692c287 ("x86/stackframe/32: Provide consistent pt_regs") +Reported-by: Lai Jiangshan +Signed-off-by: Peter Zijlstra (Intel) +Acked-by: Andy Lutomirski +Cc: stable@kernel.org +Signed-off-by: Greg Kroah-Hartman + +--- + arch/x86/entry/entry_32.S | 2 +- + 1 file changed, 1 insertion(+), 1 deletion(-) + +--- a/arch/x86/entry/entry_32.S ++++ b/arch/x86/entry/entry_32.S +@@ -357,6 +357,7 @@ + 2: popl %es + 3: popl %fs + POP_GS \pop ++ IRET_FRAME + .pushsection .fixup, "ax" + 4: movl $0, (%esp) + jmp 1b +@@ -1075,7 +1076,6 @@ restore_all: + /* Restore user state */ + RESTORE_REGS pop=4 # skip orig_eax/error_code + .Lirq_return: +- IRET_FRAME + /* + * ARCH_HAS_MEMBARRIER_SYNC_CORE rely on IRET core serialization + * when returning from IPI handler and when returning from diff --git a/queue-5.3/x86-entry-32-fix-nmi-vs-espfix.patch b/queue-5.3/x86-entry-32-fix-nmi-vs-espfix.patch new file mode 100644 index 00000000000..94832d6e85b --- /dev/null +++ b/queue-5.3/x86-entry-32-fix-nmi-vs-espfix.patch @@ -0,0 +1,126 @@ +From 895429076512e9d1cf5428181076299c90713159 Mon Sep 17 00:00:00 2001 +From: Peter Zijlstra +Date: Wed, 20 Nov 2019 15:02:26 +0100 +Subject: x86/entry/32: Fix NMI vs ESPFIX + +From: Peter Zijlstra + +commit 895429076512e9d1cf5428181076299c90713159 upstream. + +When the NMI lands on an ESPFIX_SS, we are on the entry stack and must +swizzle, otherwise we'll run do_nmi() on the entry stack, which is +BAD. + +Also, similar to the normal exception path, we need to correct the +ESPFIX magic before leaving the entry stack, otherwise pt_regs will +present a non-flat stack pointer. + +Tested by running sigreturn_32 concurrent with perf-record. + +Fixes: e5862d0515ad ("x86/entry/32: Leave the kernel via trampoline stack") +Signed-off-by: Peter Zijlstra (Intel) +Acked-by: Andy Lutomirski +Cc: stable@kernel.org +Signed-off-by: Greg Kroah-Hartman + +--- + arch/x86/entry/entry_32.S | 53 +++++++++++++++++++++++++++++++++++----------- + 1 file changed, 41 insertions(+), 12 deletions(-) + +--- a/arch/x86/entry/entry_32.S ++++ b/arch/x86/entry/entry_32.S +@@ -205,6 +205,7 @@ + #define CS_FROM_ENTRY_STACK (1 << 31) + #define CS_FROM_USER_CR3 (1 << 30) + #define CS_FROM_KERNEL (1 << 29) ++#define CS_FROM_ESPFIX (1 << 28) + + .macro FIXUP_FRAME + /* +@@ -342,8 +343,8 @@ + .endif + .endm + +-.macro SAVE_ALL_NMI cr3_reg:req +- SAVE_ALL ++.macro SAVE_ALL_NMI cr3_reg:req unwind_espfix=0 ++ SAVE_ALL unwind_espfix=\unwind_espfix + + BUG_IF_WRONG_CR3 + +@@ -1526,6 +1527,10 @@ ENTRY(nmi) + ASM_CLAC + + #ifdef CONFIG_X86_ESPFIX32 ++ /* ++ * ESPFIX_SS is only ever set on the return to user path ++ * after we've switched to the entry stack. ++ */ + pushl %eax + movl %ss, %eax + cmpw $__ESPFIX_SS, %ax +@@ -1561,6 +1566,11 @@ ENTRY(nmi) + movl %ebx, %esp + + .Lnmi_return: ++#ifdef CONFIG_X86_ESPFIX32 ++ testl $CS_FROM_ESPFIX, PT_CS(%esp) ++ jnz .Lnmi_from_espfix ++#endif ++ + CHECK_AND_APPLY_ESPFIX + RESTORE_ALL_NMI cr3_reg=%edi pop=4 + jmp .Lirq_return +@@ -1568,23 +1578,42 @@ ENTRY(nmi) + #ifdef CONFIG_X86_ESPFIX32 + .Lnmi_espfix_stack: + /* +- * create the pointer to lss back ++ * Create the pointer to LSS back + */ + pushl %ss + pushl %esp + addl $4, (%esp) +- /* copy the iret frame of 12 bytes */ +- .rept 3 +- pushl 16(%esp) +- .endr +- pushl %eax +- SAVE_ALL_NMI cr3_reg=%edi ++ ++ /* Copy the (short) IRET frame */ ++ pushl 4*4(%esp) # flags ++ pushl 4*4(%esp) # cs ++ pushl 4*4(%esp) # ip ++ ++ pushl %eax # orig_ax ++ ++ SAVE_ALL_NMI cr3_reg=%edi unwind_espfix=1 + ENCODE_FRAME_POINTER +- FIXUP_ESPFIX_STACK # %eax == %esp ++ ++ /* clear CS_FROM_KERNEL, set CS_FROM_ESPFIX */ ++ xorl $(CS_FROM_ESPFIX | CS_FROM_KERNEL), PT_CS(%esp) ++ + xorl %edx, %edx # zero error code +- call do_nmi ++ movl %esp, %eax # pt_regs pointer ++ jmp .Lnmi_from_sysenter_stack ++ ++.Lnmi_from_espfix: + RESTORE_ALL_NMI cr3_reg=%edi +- lss 12+4(%esp), %esp # back to espfix stack ++ /* ++ * Because we cleared CS_FROM_KERNEL, IRET_FRAME 'forgot' to ++ * fix up the gap and long frame: ++ * ++ * 3 - original frame (exception) ++ * 2 - ESPFIX block (above) ++ * 6 - gap (FIXUP_FRAME) ++ * 5 - long frame (FIXUP_FRAME) ++ * 1 - orig_ax ++ */ ++ lss (1+5+6)*4(%esp), %esp # back to espfix stack + jmp .Lirq_return + #endif + END(nmi) diff --git a/queue-5.3/x86-entry-32-move-fixup_frame-after-pushing-fs-in-save_all.patch b/queue-5.3/x86-entry-32-move-fixup_frame-after-pushing-fs-in-save_all.patch new file mode 100644 index 00000000000..c488731c275 --- /dev/null +++ b/queue-5.3/x86-entry-32-move-fixup_frame-after-pushing-fs-in-save_all.patch @@ -0,0 +1,122 @@ +From 82cb8a0b1d8d07817b5d59f7fa1438e1fceafab2 Mon Sep 17 00:00:00 2001 +From: Andy Lutomirski +Date: Wed, 20 Nov 2019 09:56:36 +0100 +Subject: x86/entry/32: Move FIXUP_FRAME after pushing %fs in SAVE_ALL + +From: Andy Lutomirski + +commit 82cb8a0b1d8d07817b5d59f7fa1438e1fceafab2 upstream. + +This will allow us to get percpu access working before FIXUP_FRAME, +which will allow us to unwind ESPFIX earlier. + +Signed-off-by: Andy Lutomirski +Signed-off-by: Peter Zijlstra (Intel) +Cc: stable@kernel.org +Signed-off-by: Greg Kroah-Hartman + +--- + arch/x86/entry/entry_32.S | 66 ++++++++++++++++++++++++---------------------- + 1 file changed, 35 insertions(+), 31 deletions(-) + +--- a/arch/x86/entry/entry_32.S ++++ b/arch/x86/entry/entry_32.S +@@ -213,54 +213,58 @@ + * + * Be careful: we may have nonzero SS base due to ESPFIX. + */ +- andl $0x0000ffff, 3*4(%esp) ++ andl $0x0000ffff, 4*4(%esp) + + #ifdef CONFIG_VM86 +- testl $X86_EFLAGS_VM, 4*4(%esp) ++ testl $X86_EFLAGS_VM, 5*4(%esp) + jnz .Lfrom_usermode_no_fixup_\@ + #endif +- testl $USER_SEGMENT_RPL_MASK, 3*4(%esp) ++ testl $USER_SEGMENT_RPL_MASK, 4*4(%esp) + jnz .Lfrom_usermode_no_fixup_\@ + +- orl $CS_FROM_KERNEL, 3*4(%esp) ++ orl $CS_FROM_KERNEL, 4*4(%esp) + + /* + * When we're here from kernel mode; the (exception) stack looks like: + * +- * 5*4(%esp) - +- * 4*4(%esp) - flags +- * 3*4(%esp) - cs +- * 2*4(%esp) - ip +- * 1*4(%esp) - orig_eax +- * 0*4(%esp) - gs / function ++ * 6*4(%esp) - ++ * 5*4(%esp) - flags ++ * 4*4(%esp) - cs ++ * 3*4(%esp) - ip ++ * 2*4(%esp) - orig_eax ++ * 1*4(%esp) - gs / function ++ * 0*4(%esp) - fs + * + * Lets build a 5 entry IRET frame after that, such that struct pt_regs + * is complete and in particular regs->sp is correct. This gives us +- * the original 5 enties as gap: ++ * the original 6 enties as gap: + * +- * 12*4(%esp) - +- * 11*4(%esp) - gap / flags +- * 10*4(%esp) - gap / cs +- * 9*4(%esp) - gap / ip +- * 8*4(%esp) - gap / orig_eax +- * 7*4(%esp) - gap / gs / function +- * 6*4(%esp) - ss +- * 5*4(%esp) - sp +- * 4*4(%esp) - flags +- * 3*4(%esp) - cs +- * 2*4(%esp) - ip +- * 1*4(%esp) - orig_eax +- * 0*4(%esp) - gs / function ++ * 14*4(%esp) - ++ * 13*4(%esp) - gap / flags ++ * 12*4(%esp) - gap / cs ++ * 11*4(%esp) - gap / ip ++ * 10*4(%esp) - gap / orig_eax ++ * 9*4(%esp) - gap / gs / function ++ * 8*4(%esp) - gap / fs ++ * 7*4(%esp) - ss ++ * 6*4(%esp) - sp ++ * 5*4(%esp) - flags ++ * 4*4(%esp) - cs ++ * 3*4(%esp) - ip ++ * 2*4(%esp) - orig_eax ++ * 1*4(%esp) - gs / function ++ * 0*4(%esp) - fs + */ + + pushl %ss # ss + pushl %esp # sp (points at ss) +- addl $6*4, (%esp) # point sp back at the previous context +- pushl 6*4(%esp) # flags +- pushl 6*4(%esp) # cs +- pushl 6*4(%esp) # ip +- pushl 6*4(%esp) # orig_eax +- pushl 6*4(%esp) # gs / function ++ addl $7*4, (%esp) # point sp back at the previous context ++ pushl 7*4(%esp) # flags ++ pushl 7*4(%esp) # cs ++ pushl 7*4(%esp) # ip ++ pushl 7*4(%esp) # orig_eax ++ pushl 7*4(%esp) # gs / function ++ pushl 7*4(%esp) # fs + .Lfrom_usermode_no_fixup_\@: + .endm + +@@ -308,8 +312,8 @@ + .if \skip_gs == 0 + PUSH_GS + .endif +- FIXUP_FRAME + pushl %fs ++ FIXUP_FRAME + pushl %es + pushl %ds + pushl \pt_regs_ax diff --git a/queue-5.3/x86-entry-32-unwind-the-espfix-stack-earlier-on-exception-entry.patch b/queue-5.3/x86-entry-32-unwind-the-espfix-stack-earlier-on-exception-entry.patch new file mode 100644 index 00000000000..192af024088 --- /dev/null +++ b/queue-5.3/x86-entry-32-unwind-the-espfix-stack-earlier-on-exception-entry.patch @@ -0,0 +1,117 @@ +From a1a338e5b6fe9e0a39c57c232dc96c198bb53e47 Mon Sep 17 00:00:00 2001 +From: Andy Lutomirski +Date: Wed, 20 Nov 2019 10:10:49 +0100 +Subject: x86/entry/32: Unwind the ESPFIX stack earlier on exception entry + +From: Andy Lutomirski + +commit a1a338e5b6fe9e0a39c57c232dc96c198bb53e47 upstream. + +Right now, we do some fancy parts of the exception entry path while SS +might have a nonzero base: we fill in regs->ss and regs->sp, and we +consider switching to the kernel stack. This results in regs->ss and +regs->sp referring to a non-flat stack and it may result in +overflowing the entry stack. The former issue means that we can try to +call iret_exc on a non-flat stack, which doesn't work. + +Tested with selftests/x86/sigreturn_32. + +Fixes: 45d7b255747c ("x86/entry/32: Enter the kernel via trampoline stack") +Signed-off-by: Andy Lutomirski +Signed-off-by: Peter Zijlstra (Intel) +Cc: stable@kernel.org +Signed-off-by: Greg Kroah-Hartman + +--- + arch/x86/entry/entry_32.S | 30 ++++++++++++++++-------------- + 1 file changed, 16 insertions(+), 14 deletions(-) + +--- a/arch/x86/entry/entry_32.S ++++ b/arch/x86/entry/entry_32.S +@@ -210,8 +210,6 @@ + /* + * The high bits of the CS dword (__csh) are used for CS_FROM_*. + * Clear them in case hardware didn't do this for us. +- * +- * Be careful: we may have nonzero SS base due to ESPFIX. + */ + andl $0x0000ffff, 4*4(%esp) + +@@ -307,12 +305,21 @@ + .Lfinished_frame_\@: + .endm + +-.macro SAVE_ALL pt_regs_ax=%eax switch_stacks=0 skip_gs=0 ++.macro SAVE_ALL pt_regs_ax=%eax switch_stacks=0 skip_gs=0 unwind_espfix=0 + cld + .if \skip_gs == 0 + PUSH_GS + .endif + pushl %fs ++ ++ pushl %eax ++ movl $(__KERNEL_PERCPU), %eax ++ movl %eax, %fs ++.if \unwind_espfix > 0 ++ UNWIND_ESPFIX_STACK ++.endif ++ popl %eax ++ + FIXUP_FRAME + pushl %es + pushl %ds +@@ -326,8 +333,6 @@ + movl $(__USER_DS), %edx + movl %edx, %ds + movl %edx, %es +- movl $(__KERNEL_PERCPU), %edx +- movl %edx, %fs + .if \skip_gs == 0 + SET_KERNEL_GS %edx + .endif +@@ -1153,18 +1158,17 @@ ENDPROC(entry_INT80_32) + lss (%esp), %esp /* switch to the normal stack segment */ + #endif + .endm ++ + .macro UNWIND_ESPFIX_STACK ++ /* It's safe to clobber %eax, all other regs need to be preserved */ + #ifdef CONFIG_X86_ESPFIX32 + movl %ss, %eax + /* see if on espfix stack */ + cmpw $__ESPFIX_SS, %ax +- jne 27f +- movl $__KERNEL_DS, %eax +- movl %eax, %ds +- movl %eax, %es ++ jne .Lno_fixup_\@ + /* switch to normal stack */ + FIXUP_ESPFIX_STACK +-27: ++.Lno_fixup_\@: + #endif + .endm + +@@ -1458,10 +1462,9 @@ END(page_fault) + + common_exception_read_cr2: + /* the function address is in %gs's slot on the stack */ +- SAVE_ALL switch_stacks=1 skip_gs=1 ++ SAVE_ALL switch_stacks=1 skip_gs=1 unwind_espfix=1 + + ENCODE_FRAME_POINTER +- UNWIND_ESPFIX_STACK + + /* fixup %gs */ + GS_TO_REG %ecx +@@ -1483,9 +1486,8 @@ END(common_exception_read_cr2) + + common_exception: + /* the function address is in %gs's slot on the stack */ +- SAVE_ALL switch_stacks=1 skip_gs=1 ++ SAVE_ALL switch_stacks=1 skip_gs=1 unwind_espfix=1 + ENCODE_FRAME_POINTER +- UNWIND_ESPFIX_STACK + + /* fixup %gs */ + GS_TO_REG %ecx diff --git a/queue-5.3/x86-entry-32-use-ss-segment-where-required.patch b/queue-5.3/x86-entry-32-use-ss-segment-where-required.patch new file mode 100644 index 00000000000..b4fc43a826c --- /dev/null +++ b/queue-5.3/x86-entry-32-use-ss-segment-where-required.patch @@ -0,0 +1,75 @@ +From 4c4fd55d3d59a41ddfa6ecba7e76928921759f43 Mon Sep 17 00:00:00 2001 +From: Andy Lutomirski +Date: Wed, 20 Nov 2019 09:49:33 +0100 +Subject: x86/entry/32: Use %ss segment where required + +From: Andy Lutomirski + +commit 4c4fd55d3d59a41ddfa6ecba7e76928921759f43 upstream. + +When re-building the IRET frame we use %eax as an destination %esp, +make sure to then also match the segment for when there is a nonzero +SS base (ESPFIX). + +[peterz: Changelog and minor edits] +Fixes: 3c88c692c287 ("x86/stackframe/32: Provide consistent pt_regs") +Signed-off-by: Andy Lutomirski +Signed-off-by: Peter Zijlstra (Intel) +Cc: stable@kernel.org +Signed-off-by: Greg Kroah-Hartman + +--- + arch/x86/entry/entry_32.S | 19 ++++++++++++++----- + 1 file changed, 14 insertions(+), 5 deletions(-) + +--- a/arch/x86/entry/entry_32.S ++++ b/arch/x86/entry/entry_32.S +@@ -210,6 +210,8 @@ + /* + * The high bits of the CS dword (__csh) are used for CS_FROM_*. + * Clear them in case hardware didn't do this for us. ++ * ++ * Be careful: we may have nonzero SS base due to ESPFIX. + */ + andl $0x0000ffff, 3*4(%esp) + +@@ -263,6 +265,13 @@ + .endm + + .macro IRET_FRAME ++ /* ++ * We're called with %ds, %es, %fs, and %gs from the interrupted ++ * frame, so we shouldn't use them. Also, we may be in ESPFIX ++ * mode and therefore have a nonzero SS base and an offset ESP, ++ * so any attempt to access the stack needs to use SS. (except for ++ * accesses through %esp, which automatically use SS.) ++ */ + testl $CS_FROM_KERNEL, 1*4(%esp) + jz .Lfinished_frame_\@ + +@@ -276,20 +285,20 @@ + movl 5*4(%esp), %eax # (modified) regs->sp + + movl 4*4(%esp), %ecx # flags +- movl %ecx, -4(%eax) ++ movl %ecx, %ss:-1*4(%eax) + + movl 3*4(%esp), %ecx # cs + andl $0x0000ffff, %ecx +- movl %ecx, -8(%eax) ++ movl %ecx, %ss:-2*4(%eax) + + movl 2*4(%esp), %ecx # ip +- movl %ecx, -12(%eax) ++ movl %ecx, %ss:-3*4(%eax) + + movl 1*4(%esp), %ecx # eax +- movl %ecx, -16(%eax) ++ movl %ecx, %ss:-4*4(%eax) + + popl %ecx +- lea -16(%eax), %esp ++ lea -4*4(%eax), %esp + popl %eax + .Lfinished_frame_\@: + .endm diff --git a/queue-5.3/x86-pti-32-calculate-the-various-pti-cpu_entry_area-sizes-correctly-make-the-cpu_entry_area_pages-assert-precise.patch b/queue-5.3/x86-pti-32-calculate-the-various-pti-cpu_entry_area-sizes-correctly-make-the-cpu_entry_area_pages-assert-precise.patch new file mode 100644 index 00000000000..e9c61efa937 --- /dev/null +++ b/queue-5.3/x86-pti-32-calculate-the-various-pti-cpu_entry_area-sizes-correctly-make-the-cpu_entry_area_pages-assert-precise.patch @@ -0,0 +1,199 @@ +From 05b042a1944322844eaae7ea596d5f154166d68a Mon Sep 17 00:00:00 2001 +From: Ingo Molnar +Date: Sun, 24 Nov 2019 11:21:44 +0100 +Subject: x86/pti/32: Calculate the various PTI cpu_entry_area sizes correctly, make the CPU_ENTRY_AREA_PAGES assert precise +MIME-Version: 1.0 +Content-Type: text/plain; charset=UTF-8 +Content-Transfer-Encoding: 8bit + +From: Ingo Molnar + +commit 05b042a1944322844eaae7ea596d5f154166d68a upstream. + +When two recent commits that increased the size of the 'struct cpu_entry_area' +were merged in -tip, the 32-bit defconfig build started failing on the following +build time assert: + + ./include/linux/compiler.h:391:38: error: call to ‘__compiletime_assert_189’ declared with attribute error: BUILD_BUG_ON failed: CPU_ENTRY_AREA_PAGES * PAGE_SIZE < CPU_ENTRY_AREA_MAP_SIZE + arch/x86/mm/cpu_entry_area.c:189:2: note: in expansion of macro ‘BUILD_BUG_ON’ + In function ‘setup_cpu_entry_area_ptes’, + +Which corresponds to the following build time assert: + + BUILD_BUG_ON(CPU_ENTRY_AREA_PAGES * PAGE_SIZE < CPU_ENTRY_AREA_MAP_SIZE); + +The purpose of this assert is to sanity check the fixed-value definition of +CPU_ENTRY_AREA_PAGES arch/x86/include/asm/pgtable_32_types.h: + + #define CPU_ENTRY_AREA_PAGES (NR_CPUS * 41) + +The '41' is supposed to match sizeof(struct cpu_entry_area)/PAGE_SIZE, which value +we didn't want to define in such a low level header, because it would cause +dependency hell. + +Every time the size of cpu_entry_area is changed, we have to adjust CPU_ENTRY_AREA_PAGES +accordingly - and this assert is checking that constraint. + +But the assert is both imprecise and buggy, primarily because it doesn't +include the single readonly IDT page that is mapped at CPU_ENTRY_AREA_BASE +(which begins at a PMD boundary). + +This bug was hidden by the fact that by accident CPU_ENTRY_AREA_PAGES is defined +too large upstream (v5.4-rc8): + + #define CPU_ENTRY_AREA_PAGES (NR_CPUS * 40) + +While 'struct cpu_entry_area' is 155648 bytes, or 38 pages. So we had two extra +pages, which hid the bug. + +The following commit (not yet upstream) increased the size to 40 pages: + + x86/iopl: ("Restrict iopl() permission scope") + +... but increased CPU_ENTRY_AREA_PAGES only 41 - i.e. shortening the gap +to just 1 extra page. + +Then another not-yet-upstream commit changed the size again: + + 880a98c33996: ("x86/cpu_entry_area: Add guard page for entry stack on 32bit") + +Which increased the cpu_entry_area size from 38 to 39 pages, but +didn't change CPU_ENTRY_AREA_PAGES (kept it at 40). This worked +fine, because we still had a page left from the accidental 'reserve'. + +But when these two commits were merged into the same tree, the +combined size of cpu_entry_area grew from 38 to 40 pages, while +CPU_ENTRY_AREA_PAGES finally caught up to 40 as well. + +Which is fine in terms of functionality, but the assert broke: + + BUILD_BUG_ON(CPU_ENTRY_AREA_PAGES * PAGE_SIZE < CPU_ENTRY_AREA_MAP_SIZE); + +because CPU_ENTRY_AREA_MAP_SIZE is the total size of the area, +which is 1 page larger due to the IDT page. + +To fix all this, change the assert to two precise asserts: + + BUILD_BUG_ON((CPU_ENTRY_AREA_PAGES+1)*PAGE_SIZE != CPU_ENTRY_AREA_MAP_SIZE); + BUILD_BUG_ON(CPU_ENTRY_AREA_TOTAL_SIZE != CPU_ENTRY_AREA_MAP_SIZE); + +This takes the IDT page into account, and also connects the size-based +define of CPU_ENTRY_AREA_TOTAL_SIZE with the address-subtraction based +define of CPU_ENTRY_AREA_MAP_SIZE. + +Also clean up some of the names which made it rather confusing: + + - 'CPU_ENTRY_AREA_TOT_SIZE' wasn't actually the 'total' size of + the cpu-entry-area, but the per-cpu array size, so rename this + to CPU_ENTRY_AREA_ARRAY_SIZE. + + - Introduce CPU_ENTRY_AREA_TOTAL_SIZE that _is_ the total mapping + size, with the IDT included. + + - Add comments where '+1' denotes the IDT mapping - it wasn't + obvious and took me about 3 hours to decode... + +Finally, because this particular commit is actually applied after +this patch: + + 880a98c33996: ("x86/cpu_entry_area: Add guard page for entry stack on 32bit") + +Fix the CPU_ENTRY_AREA_PAGES value from 40 pages to the correct 39 pages. + +All future commits that change cpu_entry_area will have to adjust +this value precisely. + +As a side note, we should probably attempt to remove CPU_ENTRY_AREA_PAGES +and derive its value directly from the structure, without causing +header hell - but that is an adventure for another day! :-) + +Fixes: 880a98c33996: ("x86/cpu_entry_area: Add guard page for entry stack on 32bit") +Cc: Thomas Gleixner +Cc: Borislav Petkov +Cc: Peter Zijlstra (Intel) +Cc: Linus Torvalds +Cc: Andy Lutomirski +Cc: stable@kernel.org +Signed-off-by: Ingo Molnar +Signed-off-by: Greg Kroah-Hartman + +--- + arch/x86/include/asm/cpu_entry_area.h | 12 +++++++----- + arch/x86/include/asm/pgtable_32_types.h | 8 ++++---- + arch/x86/mm/cpu_entry_area.c | 4 +++- + 3 files changed, 14 insertions(+), 10 deletions(-) + +--- a/arch/x86/include/asm/cpu_entry_area.h ++++ b/arch/x86/include/asm/cpu_entry_area.h +@@ -98,7 +98,6 @@ struct cpu_entry_area { + */ + struct cea_exception_stacks estacks; + #endif +-#ifdef CONFIG_CPU_SUP_INTEL + /* + * Per CPU debug store for Intel performance monitoring. Wastes a + * full page at the moment. +@@ -109,11 +108,13 @@ struct cpu_entry_area { + * Reserve enough fixmap PTEs. + */ + struct debug_store_buffers cpu_debug_buffers; +-#endif + }; + +-#define CPU_ENTRY_AREA_SIZE (sizeof(struct cpu_entry_area)) +-#define CPU_ENTRY_AREA_TOT_SIZE (CPU_ENTRY_AREA_SIZE * NR_CPUS) ++#define CPU_ENTRY_AREA_SIZE (sizeof(struct cpu_entry_area)) ++#define CPU_ENTRY_AREA_ARRAY_SIZE (CPU_ENTRY_AREA_SIZE * NR_CPUS) ++ ++/* Total size includes the readonly IDT mapping page as well: */ ++#define CPU_ENTRY_AREA_TOTAL_SIZE (CPU_ENTRY_AREA_ARRAY_SIZE + PAGE_SIZE) + + DECLARE_PER_CPU(struct cpu_entry_area *, cpu_entry_area); + DECLARE_PER_CPU(struct cea_exception_stacks *, cea_exception_stacks); +@@ -121,13 +122,14 @@ DECLARE_PER_CPU(struct cea_exception_sta + extern void setup_cpu_entry_areas(void); + extern void cea_set_pte(void *cea_vaddr, phys_addr_t pa, pgprot_t flags); + ++/* Single page reserved for the readonly IDT mapping: */ + #define CPU_ENTRY_AREA_RO_IDT CPU_ENTRY_AREA_BASE + #define CPU_ENTRY_AREA_PER_CPU (CPU_ENTRY_AREA_RO_IDT + PAGE_SIZE) + + #define CPU_ENTRY_AREA_RO_IDT_VADDR ((void *)CPU_ENTRY_AREA_RO_IDT) + + #define CPU_ENTRY_AREA_MAP_SIZE \ +- (CPU_ENTRY_AREA_PER_CPU + CPU_ENTRY_AREA_TOT_SIZE - CPU_ENTRY_AREA_BASE) ++ (CPU_ENTRY_AREA_PER_CPU + CPU_ENTRY_AREA_ARRAY_SIZE - CPU_ENTRY_AREA_BASE) + + extern struct cpu_entry_area *get_cpu_entry_area(int cpu); + +--- a/arch/x86/include/asm/pgtable_32_types.h ++++ b/arch/x86/include/asm/pgtable_32_types.h +@@ -44,11 +44,11 @@ extern bool __vmalloc_start_set; /* set + * Define this here and validate with BUILD_BUG_ON() in pgtable_32.c + * to avoid include recursion hell + */ +-#define CPU_ENTRY_AREA_PAGES (NR_CPUS * 40) ++#define CPU_ENTRY_AREA_PAGES (NR_CPUS * 39) + +-#define CPU_ENTRY_AREA_BASE \ +- ((FIXADDR_TOT_START - PAGE_SIZE * (CPU_ENTRY_AREA_PAGES + 1)) \ +- & PMD_MASK) ++/* The +1 is for the readonly IDT page: */ ++#define CPU_ENTRY_AREA_BASE \ ++ ((FIXADDR_TOT_START - PAGE_SIZE*(CPU_ENTRY_AREA_PAGES+1)) & PMD_MASK) + + #define LDT_BASE_ADDR \ + ((CPU_ENTRY_AREA_BASE - PAGE_SIZE) & PMD_MASK) +--- a/arch/x86/mm/cpu_entry_area.c ++++ b/arch/x86/mm/cpu_entry_area.c +@@ -178,7 +178,9 @@ static __init void setup_cpu_entry_area_ + #ifdef CONFIG_X86_32 + unsigned long start, end; + +- BUILD_BUG_ON(CPU_ENTRY_AREA_PAGES * PAGE_SIZE < CPU_ENTRY_AREA_MAP_SIZE); ++ /* The +1 is for the readonly IDT: */ ++ BUILD_BUG_ON((CPU_ENTRY_AREA_PAGES+1)*PAGE_SIZE != CPU_ENTRY_AREA_MAP_SIZE); ++ BUILD_BUG_ON(CPU_ENTRY_AREA_TOTAL_SIZE != CPU_ENTRY_AREA_MAP_SIZE); + BUG_ON(CPU_ENTRY_AREA_BASE & ~PMD_MASK); + + start = CPU_ENTRY_AREA_BASE; diff --git a/queue-5.3/x86-pti-32-size-initial_page_table-correctly.patch b/queue-5.3/x86-pti-32-size-initial_page_table-correctly.patch new file mode 100644 index 00000000000..066d934f4ad --- /dev/null +++ b/queue-5.3/x86-pti-32-size-initial_page_table-correctly.patch @@ -0,0 +1,62 @@ +From f490e07c53d66045d9d739e134145ec9b38653d3 Mon Sep 17 00:00:00 2001 +From: Thomas Gleixner +Date: Thu, 21 Nov 2019 00:40:23 +0100 +Subject: x86/pti/32: Size initial_page_table correctly + +From: Thomas Gleixner + +commit f490e07c53d66045d9d739e134145ec9b38653d3 upstream. + +Commit 945fd17ab6ba ("x86/cpu_entry_area: Sync cpu_entry_area to +initial_page_table") introduced the sync for the initial page table for +32bit. + +sync_initial_page_table() uses clone_pgd_range() which does the update for +the kernel page table. If PTI is enabled it also updates the user space +page table counterpart, which is assumed to be in the next page after the +target PGD. + +At this point in time 32-bit did not have PTI support, so the user space +page table update was not taking place. + +The support for PTI on 32-bit which was introduced later on, did not take +that into account and missed to add the user space counter part for the +initial page table. + +As a consequence sync_initial_page_table() overwrites any data which is +located in the page behing initial_page_table causing random failures, +e.g. by corrupting doublefault_tss and wreckaging the doublefault handler +on 32bit. + +Fix it by adding a "user" page table right after initial_page_table. + +Fixes: 7757d607c6b3 ("x86/pti: Allow CONFIG_PAGE_TABLE_ISOLATION for x86_32") +Signed-off-by: Thomas Gleixner +Signed-off-by: Peter Zijlstra (Intel) +Reviewed-by: Joerg Roedel +Cc: stable@kernel.org +Signed-off-by: Greg Kroah-Hartman + +--- + arch/x86/kernel/head_32.S | 10 ++++++++++ + 1 file changed, 10 insertions(+) + +--- a/arch/x86/kernel/head_32.S ++++ b/arch/x86/kernel/head_32.S +@@ -571,6 +571,16 @@ ENTRY(initial_page_table) + # error "Kernel PMDs should be 1, 2 or 3" + # endif + .align PAGE_SIZE /* needs to be page-sized too */ ++ ++#ifdef CONFIG_PAGE_TABLE_ISOLATION ++ /* ++ * PTI needs another page so sync_initial_pagetable() works correctly ++ * and does not scribble over the data which is placed behind the ++ * actual initial_page_table. See clone_pgd_range(). ++ */ ++ .fill 1024, 4, 0 ++#endif ++ + #endif + + .data diff --git a/queue-5.3/x86-speculation-fix-incorrect-mds-taa-mitigation-status.patch b/queue-5.3/x86-speculation-fix-incorrect-mds-taa-mitigation-status.patch new file mode 100644 index 00000000000..19e90afe56f --- /dev/null +++ b/queue-5.3/x86-speculation-fix-incorrect-mds-taa-mitigation-status.patch @@ -0,0 +1,154 @@ +From 64870ed1b12e235cfca3f6c6da75b542c973ff78 Mon Sep 17 00:00:00 2001 +From: Waiman Long +Date: Fri, 15 Nov 2019 11:14:44 -0500 +Subject: x86/speculation: Fix incorrect MDS/TAA mitigation status + +From: Waiman Long + +commit 64870ed1b12e235cfca3f6c6da75b542c973ff78 upstream. + +For MDS vulnerable processors with TSX support, enabling either MDS or +TAA mitigations will enable the use of VERW to flush internal processor +buffers at the right code path. IOW, they are either both mitigated +or both not. However, if the command line options are inconsistent, +the vulnerabilites sysfs files may not report the mitigation status +correctly. + +For example, with only the "mds=off" option: + + vulnerabilities/mds:Vulnerable; SMT vulnerable + vulnerabilities/tsx_async_abort:Mitigation: Clear CPU buffers; SMT vulnerable + +The mds vulnerabilities file has wrong status in this case. Similarly, +the taa vulnerability file will be wrong with mds mitigation on, but +taa off. + +Change taa_select_mitigation() to sync up the two mitigation status +and have them turned off if both "mds=off" and "tsx_async_abort=off" +are present. + +Update documentation to emphasize the fact that both "mds=off" and +"tsx_async_abort=off" have to be specified together for processors that +are affected by both TAA and MDS to be effective. + + [ bp: Massage and add kernel-parameters.txt change too. ] + +Fixes: 1b42f017415b ("x86/speculation/taa: Add mitigation for TSX Async Abort") +Signed-off-by: Waiman Long +Signed-off-by: Borislav Petkov +Cc: Greg Kroah-Hartman +Cc: "H. Peter Anvin" +Cc: Ingo Molnar +Cc: Jiri Kosina +Cc: Jonathan Corbet +Cc: Josh Poimboeuf +Cc: linux-doc@vger.kernel.org +Cc: Mark Gross +Cc: +Cc: Pawan Gupta +Cc: Peter Zijlstra +Cc: Thomas Gleixner +Cc: Tim Chen +Cc: Tony Luck +Cc: Tyler Hicks +Cc: x86-ml +Link: https://lkml.kernel.org/r/20191115161445.30809-2-longman@redhat.com +Signed-off-by: Greg Kroah-Hartman + +--- + Documentation/admin-guide/hw-vuln/mds.rst | 7 +++++-- + Documentation/admin-guide/hw-vuln/tsx_async_abort.rst | 5 ++++- + Documentation/admin-guide/kernel-parameters.txt | 11 +++++++++++ + arch/x86/kernel/cpu/bugs.c | 17 +++++++++++++++-- + 4 files changed, 35 insertions(+), 5 deletions(-) + +--- a/Documentation/admin-guide/hw-vuln/mds.rst ++++ b/Documentation/admin-guide/hw-vuln/mds.rst +@@ -265,8 +265,11 @@ time with the option "mds=". The valid a + + ============ ============================================================= + +-Not specifying this option is equivalent to "mds=full". +- ++Not specifying this option is equivalent to "mds=full". For processors ++that are affected by both TAA (TSX Asynchronous Abort) and MDS, ++specifying just "mds=off" without an accompanying "tsx_async_abort=off" ++will have no effect as the same mitigation is used for both ++vulnerabilities. + + Mitigation selection guide + -------------------------- +--- a/Documentation/admin-guide/hw-vuln/tsx_async_abort.rst ++++ b/Documentation/admin-guide/hw-vuln/tsx_async_abort.rst +@@ -174,7 +174,10 @@ the option "tsx_async_abort=". The valid + CPU is not vulnerable to cross-thread TAA attacks. + ============ ============================================================= + +-Not specifying this option is equivalent to "tsx_async_abort=full". ++Not specifying this option is equivalent to "tsx_async_abort=full". For ++processors that are affected by both TAA and MDS, specifying just ++"tsx_async_abort=off" without an accompanying "mds=off" will have no ++effect as the same mitigation is used for both vulnerabilities. + + The kernel command line also allows to control the TSX feature using the + parameter "tsx=" on CPUs which support TSX control. MSR_IA32_TSX_CTRL is used +--- a/Documentation/admin-guide/kernel-parameters.txt ++++ b/Documentation/admin-guide/kernel-parameters.txt +@@ -2449,6 +2449,12 @@ + SMT on vulnerable CPUs + off - Unconditionally disable MDS mitigation + ++ On TAA-affected machines, mds=off can be prevented by ++ an active TAA mitigation as both vulnerabilities are ++ mitigated with the same mechanism so in order to disable ++ this mitigation, you need to specify tsx_async_abort=off ++ too. ++ + Not specifying this option is equivalent to + mds=full. + +@@ -4896,6 +4902,11 @@ + vulnerable to cross-thread TAA attacks. + off - Unconditionally disable TAA mitigation + ++ On MDS-affected machines, tsx_async_abort=off can be ++ prevented by an active MDS mitigation as both vulnerabilities ++ are mitigated with the same mechanism so in order to disable ++ this mitigation, you need to specify mds=off too. ++ + Not specifying this option is equivalent to + tsx_async_abort=full. On CPUs which are MDS affected + and deploy MDS mitigation, TAA mitigation is not +--- a/arch/x86/kernel/cpu/bugs.c ++++ b/arch/x86/kernel/cpu/bugs.c +@@ -304,8 +304,12 @@ static void __init taa_select_mitigation + return; + } + +- /* TAA mitigation is turned off on the cmdline (tsx_async_abort=off) */ +- if (taa_mitigation == TAA_MITIGATION_OFF) ++ /* ++ * TAA mitigation via VERW is turned off if both ++ * tsx_async_abort=off and mds=off are specified. ++ */ ++ if (taa_mitigation == TAA_MITIGATION_OFF && ++ mds_mitigation == MDS_MITIGATION_OFF) + goto out; + + if (boot_cpu_has(X86_FEATURE_MD_CLEAR)) +@@ -339,6 +343,15 @@ static void __init taa_select_mitigation + if (taa_nosmt || cpu_mitigations_auto_nosmt()) + cpu_smt_disable(false); + ++ /* ++ * Update MDS mitigation, if necessary, as the mds_user_clear is ++ * now enabled for TAA mitigation. ++ */ ++ if (mds_mitigation == MDS_MITIGATION_OFF && ++ boot_cpu_has_bug(X86_BUG_MDS)) { ++ mds_mitigation = MDS_MITIGATION_FULL; ++ mds_select_mitigation(); ++ } + out: + pr_info("%s\n", taa_strings[taa_mitigation]); + } diff --git a/queue-5.3/x86-speculation-fix-redundant-mds-mitigation-message.patch b/queue-5.3/x86-speculation-fix-redundant-mds-mitigation-message.patch new file mode 100644 index 00000000000..e979867be76 --- /dev/null +++ b/queue-5.3/x86-speculation-fix-redundant-mds-mitigation-message.patch @@ -0,0 +1,81 @@ +From cd5a2aa89e847bdda7b62029d94e95488d73f6b2 Mon Sep 17 00:00:00 2001 +From: Waiman Long +Date: Fri, 15 Nov 2019 11:14:45 -0500 +Subject: x86/speculation: Fix redundant MDS mitigation message + +From: Waiman Long + +commit cd5a2aa89e847bdda7b62029d94e95488d73f6b2 upstream. + +Since MDS and TAA mitigations are inter-related for processors that are +affected by both vulnerabilities, the followiing confusing messages can +be printed in the kernel log: + + MDS: Vulnerable + MDS: Mitigation: Clear CPU buffers + +To avoid the first incorrect message, defer the printing of MDS +mitigation after the TAA mitigation selection has been done. However, +that has the side effect of printing TAA mitigation first before MDS +mitigation. + + [ bp: Check box is affected/mitigations are disabled first before + printing and massage. ] + +Suggested-by: Pawan Gupta +Signed-off-by: Waiman Long +Signed-off-by: Borislav Petkov +Cc: Greg Kroah-Hartman +Cc: "H. Peter Anvin" +Cc: Ingo Molnar +Cc: Josh Poimboeuf +Cc: Mark Gross +Cc: Peter Zijlstra +Cc: Thomas Gleixner +Cc: Tim Chen +Cc: Tony Luck +Cc: Tyler Hicks +Cc: x86-ml +Link: https://lkml.kernel.org/r/20191115161445.30809-3-longman@redhat.com +Signed-off-by: Greg Kroah-Hartman + +--- + arch/x86/kernel/cpu/bugs.c | 13 +++++++++++++ + 1 file changed, 13 insertions(+) + +--- a/arch/x86/kernel/cpu/bugs.c ++++ b/arch/x86/kernel/cpu/bugs.c +@@ -39,6 +39,7 @@ static void __init spectre_v2_select_mit + static void __init ssb_select_mitigation(void); + static void __init l1tf_select_mitigation(void); + static void __init mds_select_mitigation(void); ++static void __init mds_print_mitigation(void); + static void __init taa_select_mitigation(void); + + /* The base value of the SPEC_CTRL MSR that always has to be preserved. */ +@@ -108,6 +109,12 @@ void __init check_bugs(void) + mds_select_mitigation(); + taa_select_mitigation(); + ++ /* ++ * As MDS and TAA mitigations are inter-related, print MDS ++ * mitigation until after TAA mitigation selection is done. ++ */ ++ mds_print_mitigation(); ++ + arch_smt_update(); + + #ifdef CONFIG_X86_32 +@@ -245,6 +252,12 @@ static void __init mds_select_mitigation + (mds_nosmt || cpu_mitigations_auto_nosmt())) + cpu_smt_disable(false); + } ++} ++ ++static void __init mds_print_mitigation(void) ++{ ++ if (!boot_cpu_has_bug(X86_BUG_MDS) || cpu_mitigations_off()) ++ return; + + pr_info("%s\n", mds_strings[mds_mitigation]); + } diff --git a/queue-5.3/x86-stackframe-32-repair-32-bit-xen-pv.patch b/queue-5.3/x86-stackframe-32-repair-32-bit-xen-pv.patch new file mode 100644 index 00000000000..3c8e3bd5ab0 --- /dev/null +++ b/queue-5.3/x86-stackframe-32-repair-32-bit-xen-pv.patch @@ -0,0 +1,71 @@ +From 81ff2c37f9e5d77593928df0536d86443195fd64 Mon Sep 17 00:00:00 2001 +From: Jan Beulich +Date: Mon, 18 Nov 2019 16:21:12 +0100 +Subject: x86/stackframe/32: Repair 32-bit Xen PV + +From: Jan Beulich + +commit 81ff2c37f9e5d77593928df0536d86443195fd64 upstream. + +Once again RPL checks have been introduced which don't account for a 32-bit +kernel living in ring 1 when running in a PV Xen domain. The case in +FIXUP_FRAME has been preventing boot. + +Adjust BUG_IF_WRONG_CR3 as well to guard against future uses of the macro +on a code path reachable when running in PV mode under Xen; I have to admit +that I stopped at a certain point trying to figure out whether there are +present ones. + +Fixes: 3c88c692c287 ("x86/stackframe/32: Provide consistent pt_regs") +Signed-off-by: Jan Beulich +Signed-off-by: Thomas Gleixner +Cc: Stable Team +Link: https://lore.kernel.org/r/0fad341f-b7f5-f859-d55d-f0084ee7087e@suse.com +Signed-off-by: Greg Kroah-Hartman + +--- + arch/x86/entry/entry_32.S | 4 ++-- + arch/x86/include/asm/segment.h | 12 ++++++++++++ + 2 files changed, 14 insertions(+), 2 deletions(-) + +--- a/arch/x86/entry/entry_32.S ++++ b/arch/x86/entry/entry_32.S +@@ -172,7 +172,7 @@ + ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI + .if \no_user_check == 0 + /* coming from usermode? */ +- testl $SEGMENT_RPL_MASK, PT_CS(%esp) ++ testl $USER_SEGMENT_RPL_MASK, PT_CS(%esp) + jz .Lend_\@ + .endif + /* On user-cr3? */ +@@ -217,7 +217,7 @@ + testl $X86_EFLAGS_VM, 4*4(%esp) + jnz .Lfrom_usermode_no_fixup_\@ + #endif +- testl $SEGMENT_RPL_MASK, 3*4(%esp) ++ testl $USER_SEGMENT_RPL_MASK, 3*4(%esp) + jnz .Lfrom_usermode_no_fixup_\@ + + orl $CS_FROM_KERNEL, 3*4(%esp) +--- a/arch/x86/include/asm/segment.h ++++ b/arch/x86/include/asm/segment.h +@@ -31,6 +31,18 @@ + */ + #define SEGMENT_RPL_MASK 0x3 + ++/* ++ * When running on Xen PV, the actual privilege level of the kernel is 1, ++ * not 0. Testing the Requested Privilege Level in a segment selector to ++ * determine whether the context is user mode or kernel mode with ++ * SEGMENT_RPL_MASK is wrong because the PV kernel's privilege level ++ * matches the 0x3 mask. ++ * ++ * Testing with USER_SEGMENT_RPL_MASK is valid for both native and Xen PV ++ * kernels because privilege level 2 is never used. ++ */ ++#define USER_SEGMENT_RPL_MASK 0x2 ++ + /* User mode is privilege level 3: */ + #define USER_RPL 0x3 + diff --git a/queue-5.3/x86-xen-32-make-xen_iret_crit_fixup-independent-of-frame-layout.patch b/queue-5.3/x86-xen-32-make-xen_iret_crit_fixup-independent-of-frame-layout.patch new file mode 100644 index 00000000000..bd54449ac67 --- /dev/null +++ b/queue-5.3/x86-xen-32-make-xen_iret_crit_fixup-independent-of-frame-layout.patch @@ -0,0 +1,180 @@ +From 29b810f5a5ec127d3143770098e05981baa3eb77 Mon Sep 17 00:00:00 2001 +From: Jan Beulich +Date: Mon, 11 Nov 2019 15:32:12 +0100 +Subject: x86/xen/32: Make xen_iret_crit_fixup() independent of frame layout + +From: Jan Beulich + +commit 29b810f5a5ec127d3143770098e05981baa3eb77 upstream. + +Now that SS:ESP always get saved by SAVE_ALL, this also needs to be +accounted for in xen_iret_crit_fixup(). Otherwise the old_ax value gets +interpreted as EFLAGS, and hence VM86 mode appears to be active all the +time, leading to random "vm86_32: no user_vm86: BAD" log messages alongside +processes randomly crashing. + +Since following the previous model (sitting after SAVE_ALL) would further +complicate the code _and_ retain the dependency of xen_iret_crit_fixup() on +frame manipulations done by entry_32.S, switch things around and do the +adjustment ahead of SAVE_ALL. + +Fixes: 3c88c692c287 ("x86/stackframe/32: Provide consistent pt_regs") +Signed-off-by: Jan Beulich +Signed-off-by: Thomas Gleixner +Reviewed-by: Juergen Gross +Cc: Stable Team +Link: https://lkml.kernel.org/r/32d8713d-25a7-84ab-b74b-aa3e88abce6b@suse.com +Signed-off-by: Greg Kroah-Hartman + +--- + arch/x86/entry/entry_32.S | 22 +++++--------- + arch/x86/xen/xen-asm_32.S | 70 +++++++++++++++++----------------------------- + 2 files changed, 35 insertions(+), 57 deletions(-) + +--- a/arch/x86/entry/entry_32.S ++++ b/arch/x86/entry/entry_32.S +@@ -1341,11 +1341,6 @@ END(spurious_interrupt_bug) + + #ifdef CONFIG_XEN_PV + ENTRY(xen_hypervisor_callback) +- pushl $-1 /* orig_ax = -1 => not a system call */ +- SAVE_ALL +- ENCODE_FRAME_POINTER +- TRACE_IRQS_OFF +- + /* + * Check to see if we got the event in the critical + * region in xen_iret_direct, after we've reenabled +@@ -1353,16 +1348,17 @@ ENTRY(xen_hypervisor_callback) + * iret instruction's behaviour where it delivers a + * pending interrupt when enabling interrupts: + */ +- movl PT_EIP(%esp), %eax +- cmpl $xen_iret_start_crit, %eax ++ cmpl $xen_iret_start_crit, (%esp) + jb 1f +- cmpl $xen_iret_end_crit, %eax ++ cmpl $xen_iret_end_crit, (%esp) + jae 1f +- +- jmp xen_iret_crit_fixup +- +-ENTRY(xen_do_upcall) +-1: mov %esp, %eax ++ call xen_iret_crit_fixup ++1: ++ pushl $-1 /* orig_ax = -1 => not a system call */ ++ SAVE_ALL ++ ENCODE_FRAME_POINTER ++ TRACE_IRQS_OFF ++ mov %esp, %eax + call xen_evtchn_do_upcall + #ifndef CONFIG_PREEMPT + call xen_maybe_preempt_hcall +--- a/arch/x86/xen/xen-asm_32.S ++++ b/arch/x86/xen/xen-asm_32.S +@@ -126,10 +126,9 @@ hyper_iret: + .globl xen_iret_start_crit, xen_iret_end_crit + + /* +- * This is called by xen_hypervisor_callback in entry.S when it sees ++ * This is called by xen_hypervisor_callback in entry_32.S when it sees + * that the EIP at the time of interrupt was between +- * xen_iret_start_crit and xen_iret_end_crit. We're passed the EIP in +- * %eax so we can do a more refined determination of what to do. ++ * xen_iret_start_crit and xen_iret_end_crit. + * + * The stack format at this point is: + * ---------------- +@@ -138,34 +137,23 @@ hyper_iret: + * eflags } outer exception info + * cs } + * eip } +- * ---------------- <- edi (copy dest) +- * eax : outer eax if it hasn't been restored + * ---------------- +- * eflags } nested exception info +- * cs } (no ss/esp because we're nested +- * eip } from the same ring) +- * orig_eax }<- esi (copy src) +- * - - - - - - - - +- * fs } +- * es } +- * ds } SAVE_ALL state +- * eax } +- * : : +- * ebx }<- esp ++ * eax : outer eax if it hasn't been restored + * ---------------- ++ * eflags } ++ * cs } nested exception info ++ * eip } ++ * return address : (into xen_hypervisor_callback) + * +- * In order to deliver the nested exception properly, we need to shift +- * everything from the return addr up to the error code so it sits +- * just under the outer exception info. This means that when we +- * handle the exception, we do it in the context of the outer +- * exception rather than starting a new one. ++ * In order to deliver the nested exception properly, we need to discard the ++ * nested exception frame such that when we handle the exception, we do it ++ * in the context of the outer exception rather than starting a new one. + * +- * The only caveat is that if the outer eax hasn't been restored yet +- * (ie, it's still on stack), we need to insert its value into the +- * SAVE_ALL state before going on, since it's usermode state which we +- * eventually need to restore. ++ * The only caveat is that if the outer eax hasn't been restored yet (i.e. ++ * it's still on stack), we need to restore its value here. + */ + ENTRY(xen_iret_crit_fixup) ++ pushl %ecx + /* + * Paranoia: Make sure we're really coming from kernel space. + * One could imagine a case where userspace jumps into the +@@ -176,32 +164,26 @@ ENTRY(xen_iret_crit_fixup) + * jump instruction itself, not the destination, but some + * virtual environments get this wrong. + */ +- movl PT_CS(%esp), %ecx ++ movl 3*4(%esp), %ecx /* nested CS */ + andl $SEGMENT_RPL_MASK, %ecx + cmpl $USER_RPL, %ecx ++ popl %ecx + je 2f + +- lea PT_ORIG_EAX(%esp), %esi +- lea PT_EFLAGS(%esp), %edi +- + /* + * If eip is before iret_restore_end then stack + * hasn't been restored yet. + */ +- cmp $iret_restore_end, %eax ++ cmpl $iret_restore_end, 1*4(%esp) + jae 1f + +- movl 0+4(%edi), %eax /* copy EAX (just above top of frame) */ +- movl %eax, PT_EAX(%esp) +- +- lea ESP_OFFSET(%edi), %edi /* move dest up over saved regs */ +- +- /* set up the copy */ +-1: std +- mov $PT_EIP / 4, %ecx /* saved regs up to orig_eax */ +- rep movsl +- cld +- +- lea 4(%edi), %esp /* point esp to new frame */ +-2: jmp xen_do_upcall +- ++ movl 4*4(%esp), %eax /* load outer EAX */ ++ ret $4*4 /* discard nested EIP, CS, and EFLAGS as ++ * well as the just restored EAX */ ++ ++1: ++ ret $3*4 /* discard nested EIP, CS, and EFLAGS */ ++ ++2: ++ ret ++END(xen_iret_crit_fixup) diff --git a/queue-5.3/x86-xen-32-simplify-ring-check-in-xen_iret_crit_fixup.patch b/queue-5.3/x86-xen-32-simplify-ring-check-in-xen_iret_crit_fixup.patch new file mode 100644 index 00000000000..826ebafdee5 --- /dev/null +++ b/queue-5.3/x86-xen-32-simplify-ring-check-in-xen_iret_crit_fixup.patch @@ -0,0 +1,56 @@ +From 922eea2ce5c799228d9ff1be9890e6873ce8fff6 Mon Sep 17 00:00:00 2001 +From: Jan Beulich +Date: Mon, 11 Nov 2019 15:32:59 +0100 +Subject: x86/xen/32: Simplify ring check in xen_iret_crit_fixup() + +From: Jan Beulich + +commit 922eea2ce5c799228d9ff1be9890e6873ce8fff6 upstream. + +This can be had with two instead of six insns, by just checking the high +CS.RPL bit. + +Also adjust the comment - there would be no #GP in the mentioned cases, as +there's no segment limit violation or alike. Instead there'd be #PF, but +that one reports the target EIP of said branch, not the address of the +branch insn itself. + +Signed-off-by: Jan Beulich +Signed-off-by: Thomas Gleixner +Reviewed-by: Juergen Gross +Link: https://lkml.kernel.org/r/a5986837-01eb-7bf8-bf42-4d3084d6a1f5@suse.com +Signed-off-by: Greg Kroah-Hartman + +--- + arch/x86/xen/xen-asm_32.S | 15 ++++----------- + 1 file changed, 4 insertions(+), 11 deletions(-) + +--- a/arch/x86/xen/xen-asm_32.S ++++ b/arch/x86/xen/xen-asm_32.S +@@ -153,22 +153,15 @@ hyper_iret: + * it's still on stack), we need to restore its value here. + */ + ENTRY(xen_iret_crit_fixup) +- pushl %ecx + /* + * Paranoia: Make sure we're really coming from kernel space. + * One could imagine a case where userspace jumps into the + * critical range address, but just before the CPU delivers a +- * GP, it decides to deliver an interrupt instead. Unlikely? +- * Definitely. Easy to avoid? Yes. The Intel documents +- * explicitly say that the reported EIP for a bad jump is the +- * jump instruction itself, not the destination, but some +- * virtual environments get this wrong. ++ * PF, it decides to deliver an interrupt instead. Unlikely? ++ * Definitely. Easy to avoid? Yes. + */ +- movl 3*4(%esp), %ecx /* nested CS */ +- andl $SEGMENT_RPL_MASK, %ecx +- cmpl $USER_RPL, %ecx +- popl %ecx +- je 2f ++ testb $2, 2*4(%esp) /* nested CS */ ++ jnz 2f + + /* + * If eip is before iret_restore_end then stack