From: Greg Kroah-Hartman Date: Mon, 26 Feb 2018 19:58:11 +0000 (+0100) Subject: 4.9-stable patches X-Git-Tag: v3.18.97~4 X-Git-Url: http://git.ipfire.org/?a=commitdiff_plain;h=e6d3ec88b7f2f825c44d009e751593424fb8b1c5;p=thirdparty%2Fkernel%2Fstable-queue.git 4.9-stable patches added patches: device-dax-implement-split-to-catch-invalid-munmap-attempts.patch fs-dax.c-fix-inefficiency-in-dax_writeback_mapping_range.patch ib-core-disable-memory-registration-of-filesystem-dax-vmas.patch libnvdimm-dax-fix-1gb-aligned-namespaces-vs-physical-misalignment.patch libnvdimm-fix-integer-overflow-static-analysis-warning.patch mm-avoid-spurious-bad-pmd-warning-messages.patch mm-fail-get_vaddr_frames-for-filesystem-dax-mappings.patch mm-fix-devm_memremap_pages-collision-handling.patch mm-introduce-get_user_pages_longterm.patch v4l2-disable-filesystem-dax-mapping-support.patch x86-entry-64-clear-extra-registers-beyond-syscall-arguments-to-reduce-speculation-attack-surface.patch --- diff --git a/queue-4.9/binder-add-missing-binder_unlock.patch b/queue-4.9/binder-add-missing-binder_unlock.patch index 22f2fed2eca..1219aa2bd28 100644 --- a/queue-4.9/binder-add-missing-binder_unlock.patch +++ b/queue-4.9/binder-add-missing-binder_unlock.patch @@ -6,7 +6,6 @@ To: stable@vger.kernel.org, Greg Kroah-Hartman Cc: Guenter Roeck , Todd Kjos , Eric Biggers Message-ID: <20180226185645.241652-1-ebiggers3@gmail.com> - From: Eric Biggers When commit 4be5a2810489 ("binder: check for binder_thread allocation diff --git a/queue-4.9/device-dax-implement-split-to-catch-invalid-munmap-attempts.patch b/queue-4.9/device-dax-implement-split-to-catch-invalid-munmap-attempts.patch new file mode 100644 index 00000000000..a9a87a1caa9 --- /dev/null +++ b/queue-4.9/device-dax-implement-split-to-catch-invalid-munmap-attempts.patch @@ -0,0 +1,76 @@ +From foo@baz Mon Feb 26 20:55:53 CET 2018 +From: Dan Williams +Date: Fri, 23 Feb 2018 14:05:43 -0800 +Subject: device-dax: implement ->split() to catch invalid munmap attempts +To: gregkh@linuxfoundation.org +Cc: Jeff Moyer , Linus Torvalds , Andrew Morton , stable@vger.kernel.org, linux-kernel@vger.kernel.org +Message-ID: <151942354379.21775.5321017414392517094.stgit@dwillia2-desk3.amr.corp.intel.com> + +From: Dan Williams + +commit 9702cffdbf2129516db679e4467db81e1cd287da upstream. + +Similar to how device-dax enforces that the 'address', 'offset', and +'len' parameters to mmap() be aligned to the device's fundamental +alignment, the same constraints apply to munmap(). Implement ->split() +to fail munmap calls that violate the alignment constraint. + +Otherwise, we later fail VM_BUG_ON checks in the unmap_page_range() path +with crash signatures of the form: + + vma ffff8800b60c8a88 start 00007f88c0000000 end 00007f88c0e00000 + next (null) prev (null) mm ffff8800b61150c0 + prot 8000000000000027 anon_vma (null) vm_ops ffffffffa0091240 + pgoff 0 file ffff8800b638ef80 private_data (null) + flags: 0x380000fb(read|write|shared|mayread|maywrite|mayexec|mayshare|softdirty|mixedmap|hugepage) + ------------[ cut here ]------------ + kernel BUG at mm/huge_memory.c:2014! + [..] + RIP: 0010:__split_huge_pud+0x12a/0x180 + [..] + Call Trace: + unmap_page_range+0x245/0xa40 + ? __vma_adjust+0x301/0x990 + unmap_vmas+0x4c/0xa0 + unmap_region+0xae/0x120 + ? __vma_rb_erase+0x11a/0x230 + do_munmap+0x276/0x410 + vm_munmap+0x6a/0xa0 + SyS_munmap+0x1d/0x30 + +Link: http://lkml.kernel.org/r/151130418681.4029.7118245855057952010.stgit@dwillia2-desk3.amr.corp.intel.com +Fixes: dee410792419 ("/dev/dax, core: file operations and dax-mmap") +Signed-off-by: Dan Williams +Reported-by: Jeff Moyer +Cc: +Signed-off-by: Andrew Morton +Signed-off-by: Linus Torvalds +Signed-off-by: Greg Kroah-Hartman +--- + drivers/dax/dax.c | 12 ++++++++++++ + 1 file changed, 12 insertions(+) + +--- a/drivers/dax/dax.c ++++ b/drivers/dax/dax.c +@@ -453,9 +453,21 @@ static int dax_dev_pmd_fault(struct vm_a + return rc; + } + ++static int dax_dev_split(struct vm_area_struct *vma, unsigned long addr) ++{ ++ struct file *filp = vma->vm_file; ++ struct dax_dev *dax_dev = filp->private_data; ++ struct dax_region *dax_region = dax_dev->region; ++ ++ if (!IS_ALIGNED(addr, dax_region->align)) ++ return -EINVAL; ++ return 0; ++} ++ + static const struct vm_operations_struct dax_dev_vm_ops = { + .fault = dax_dev_fault, + .pmd_fault = dax_dev_pmd_fault, ++ .split = dax_dev_split, + }; + + static int dax_mmap(struct file *filp, struct vm_area_struct *vma) diff --git a/queue-4.9/fs-dax.c-fix-inefficiency-in-dax_writeback_mapping_range.patch b/queue-4.9/fs-dax.c-fix-inefficiency-in-dax_writeback_mapping_range.patch new file mode 100644 index 00000000000..bcb600695ab --- /dev/null +++ b/queue-4.9/fs-dax.c-fix-inefficiency-in-dax_writeback_mapping_range.patch @@ -0,0 +1,40 @@ +From foo@baz Mon Feb 26 20:55:53 CET 2018 +From: Dan Williams +Date: Fri, 23 Feb 2018 14:05:33 -0800 +Subject: fs/dax.c: fix inefficiency in dax_writeback_mapping_range() +To: gregkh@linuxfoundation.org +Cc: Jan Kara , linux-kernel@vger.kernel.org, stable@vger.kernel.org, Ross Zwisler , Linus Torvalds , Andrew Morton +Message-ID: <151942353293.21775.3589635231521871832.stgit@dwillia2-desk3.amr.corp.intel.com> + +From: Jan Kara + +commit 1eb643d02b21412e603b42cdd96010a2ac31c05f upstream. + +dax_writeback_mapping_range() fails to update iteration index when +searching radix tree for entries needing cache flushing. Thus each +pagevec worth of entries is searched starting from the start which is +inefficient and prone to livelocks. Update index properly. + +Link: http://lkml.kernel.org/r/20170619124531.21491-1-jack@suse.cz +Fixes: 9973c98ecfda3 ("dax: add support for fsync/sync") +Signed-off-by: Jan Kara +Reviewed-by: Ross Zwisler +Cc: Dan Williams +Cc: +Signed-off-by: Andrew Morton +Signed-off-by: Linus Torvalds +Signed-off-by: Greg Kroah-Hartman +--- + fs/dax.c | 1 + + 1 file changed, 1 insertion(+) + +--- a/fs/dax.c ++++ b/fs/dax.c +@@ -785,6 +785,7 @@ int dax_writeback_mapping_range(struct a + if (ret < 0) + return ret; + } ++ start_index = indices[pvec.nr - 1] + 1; + } + return 0; + } diff --git a/queue-4.9/ib-core-disable-memory-registration-of-filesystem-dax-vmas.patch b/queue-4.9/ib-core-disable-memory-registration-of-filesystem-dax-vmas.patch new file mode 100644 index 00000000000..851672c857e --- /dev/null +++ b/queue-4.9/ib-core-disable-memory-registration-of-filesystem-dax-vmas.patch @@ -0,0 +1,54 @@ +From foo@baz Mon Feb 26 20:55:53 CET 2018 +From: Dan Williams +Date: Fri, 23 Feb 2018 14:06:00 -0800 +Subject: IB/core: disable memory registration of filesystem-dax vmas +To: gregkh@linuxfoundation.org +Cc: Sean Hefty , Jan Kara , Joonyoung Shim , linux-kernel@vger.kernel.org, Seung-Woo Kim , Jeff Moyer , stable@vger.kernel.org, Christoph Hellwig , Inki Dae , Doug Ledford , Jason Gunthorpe , Mel Gorman , Ross Zwisler , Kyungmin Park , Andrew Morton , Mauro Carvalho Chehab , Linus Torvalds , Hal Rosenstock , Vlastimil Babka +Message-ID: <151942356005.21775.11352557058864235434.stgit@dwillia2-desk3.amr.corp.intel.com> + +From: Dan Williams + +commit 5f1d43de54164dcfb9bfa542fcc92c1e1a1b6c1d upstream. + +Until there is a solution to the dma-to-dax vs truncate problem it is +not safe to allow RDMA to create long standing memory registrations +against filesytem-dax vmas. + +Link: http://lkml.kernel.org/r/151068941011.7446.7766030590347262502.stgit@dwillia2-desk3.amr.corp.intel.com +Fixes: 3565fce3a659 ("mm, x86: get_user_pages() for dax mappings") +Signed-off-by: Dan Williams +Reported-by: Christoph Hellwig +Reviewed-by: Christoph Hellwig +Acked-by: Jason Gunthorpe +Acked-by: Doug Ledford +Cc: Sean Hefty +Cc: Hal Rosenstock +Cc: Jeff Moyer +Cc: Ross Zwisler +Cc: Inki Dae +Cc: Jan Kara +Cc: Joonyoung Shim +Cc: Kyungmin Park +Cc: Mauro Carvalho Chehab +Cc: Mel Gorman +Cc: Seung-Woo Kim +Cc: Vlastimil Babka +Cc: +Signed-off-by: Andrew Morton +Signed-off-by: Linus Torvalds +Signed-off-by: Greg Kroah-Hartman +--- + drivers/infiniband/core/umem.c | 2 +- + 1 file changed, 1 insertion(+), 1 deletion(-) + +--- a/drivers/infiniband/core/umem.c ++++ b/drivers/infiniband/core/umem.c +@@ -193,7 +193,7 @@ struct ib_umem *ib_umem_get(struct ib_uc + sg_list_start = umem->sg_head.sgl; + + while (npages) { +- ret = get_user_pages(cur_base, ++ ret = get_user_pages_longterm(cur_base, + min_t(unsigned long, npages, + PAGE_SIZE / sizeof (struct page *)), + gup_flags, page_list, vma_list); diff --git a/queue-4.9/libnvdimm-dax-fix-1gb-aligned-namespaces-vs-physical-misalignment.patch b/queue-4.9/libnvdimm-dax-fix-1gb-aligned-namespaces-vs-physical-misalignment.patch new file mode 100644 index 00000000000..23762ea7ede --- /dev/null +++ b/queue-4.9/libnvdimm-dax-fix-1gb-aligned-namespaces-vs-physical-misalignment.patch @@ -0,0 +1,87 @@ +From foo@baz Mon Feb 26 20:55:53 CET 2018 +From: Dan Williams +Date: Fri, 23 Feb 2018 14:06:05 -0800 +Subject: libnvdimm, dax: fix 1GB-aligned namespaces vs physical misalignment +To: gregkh@linuxfoundation.org +Cc: Jane Chu , linux-kernel@vger.kernel.org, stable@vger.kernel.org +Message-ID: <151942356576.21775.15139045279160411096.stgit@dwillia2-desk3.amr.corp.intel.com> + +From: Dan Williams + +commit 41fce90f26333c4fa82e8e43b9ace86c4e8a0120 upstream. + +The following namespace configuration attempt: + + # ndctl create-namespace -e namespace0.0 -m devdax -a 1G -f + libndctl: ndctl_dax_enable: dax0.1: failed to enable + Error: namespace0.0: failed to enable + + failed to reconfigure namespace: No such device or address + +...fails when the backing memory range is not physically aligned to 1G: + + # cat /proc/iomem | grep Persistent + 210000000-30fffffff : Persistent Memory (legacy) + +In the above example the 4G persistent memory range starts and ends on a +256MB boundary. + +We handle this case correctly when needing to handle cases that violate +section alignment (128MB) collisions against "System RAM", and we simply +need to extend that padding/truncation for the 1GB alignment use case. + +Cc: +Fixes: 315c562536c4 ("libnvdimm, pfn: add 'align' attribute...") +Reported-and-tested-by: Jane Chu +Signed-off-by: Dan Williams +Signed-off-by: Greg Kroah-Hartman +--- + drivers/nvdimm/pfn_devs.c | 15 ++++++++++++--- + include/linux/kernel.h | 1 + + 2 files changed, 13 insertions(+), 3 deletions(-) + +--- a/drivers/nvdimm/pfn_devs.c ++++ b/drivers/nvdimm/pfn_devs.c +@@ -563,6 +563,12 @@ static struct vmem_altmap *__nvdimm_setu + return altmap; + } + ++static u64 phys_pmem_align_down(struct nd_pfn *nd_pfn, u64 phys) ++{ ++ return min_t(u64, PHYS_SECTION_ALIGN_DOWN(phys), ++ ALIGN_DOWN(phys, nd_pfn->align)); ++} ++ + static int nd_pfn_init(struct nd_pfn *nd_pfn) + { + u32 dax_label_reserve = is_nd_dax(&nd_pfn->dev) ? SZ_128K : 0; +@@ -618,13 +624,16 @@ static int nd_pfn_init(struct nd_pfn *nd + start = nsio->res.start; + size = PHYS_SECTION_ALIGN_UP(start + size) - start; + if (region_intersects(start, size, IORESOURCE_SYSTEM_RAM, +- IORES_DESC_NONE) == REGION_MIXED) { ++ IORES_DESC_NONE) == REGION_MIXED ++ || !IS_ALIGNED(start + resource_size(&nsio->res), ++ nd_pfn->align)) { + size = resource_size(&nsio->res); +- end_trunc = start + size - PHYS_SECTION_ALIGN_DOWN(start + size); ++ end_trunc = start + size - phys_pmem_align_down(nd_pfn, ++ start + size); + } + + if (start_pad + end_trunc) +- dev_info(&nd_pfn->dev, "%s section collision, truncate %d bytes\n", ++ dev_info(&nd_pfn->dev, "%s alignment collision, truncate %d bytes\n", + dev_name(&ndns->dev), start_pad + end_trunc); + + /* +--- a/include/linux/kernel.h ++++ b/include/linux/kernel.h +@@ -46,6 +46,7 @@ + #define REPEAT_BYTE(x) ((~0ul / 0xff) * (x)) + + #define ALIGN(x, a) __ALIGN_KERNEL((x), (a)) ++#define ALIGN_DOWN(x, a) __ALIGN_KERNEL((x) - ((a) - 1), (a)) + #define __ALIGN_MASK(x, mask) __ALIGN_KERNEL_MASK((x), (mask)) + #define PTR_ALIGN(p, a) ((typeof(p))ALIGN((unsigned long)(p), (a))) + #define IS_ALIGNED(x, a) (((x) & ((typeof(x))(a) - 1)) == 0) diff --git a/queue-4.9/libnvdimm-fix-integer-overflow-static-analysis-warning.patch b/queue-4.9/libnvdimm-fix-integer-overflow-static-analysis-warning.patch new file mode 100644 index 00000000000..0eec2b8976b --- /dev/null +++ b/queue-4.9/libnvdimm-fix-integer-overflow-static-analysis-warning.patch @@ -0,0 +1,88 @@ +From foo@baz Mon Feb 26 20:55:53 CET 2018 +From: Dan Williams +Date: Fri, 23 Feb 2018 14:05:38 -0800 +Subject: libnvdimm: fix integer overflow static analysis warning +To: gregkh@linuxfoundation.org +Cc: stable@vger.kernel.org, Dan Carpenter , linux-kernel@vger.kernel.org +Message-ID: <151942353841.21775.10479863744600514056.stgit@dwillia2-desk3.amr.corp.intel.com> + +From: Dan Williams + +commit 58738c495e15badd2015e19ff41f1f1ed55200bc upstream. + +Dan reports: + The patch 62232e45f4a2: "libnvdimm: control (ioctl) messages for + nvdimm_bus and nvdimm devices" from Jun 8, 2015, leads to the + following static checker warning: + + drivers/nvdimm/bus.c:1018 __nd_ioctl() + warn: integer overflows 'buf_len' + + From a casual review, this seems like it might be a real bug. On + the first iteration we load some data into in_env[]. On the second + iteration we read a use controlled "in_size" from nd_cmd_in_size(). + It can go up to UINT_MAX - 1. A high number means we will fill the + whole in_env[] buffer. But we potentially keep looping and adding + more to in_len so now it can be any value. + + It simple enough to change, but it feels weird that we keep looping + even though in_env is totally full. Shouldn't we just return an + error if we don't have space for desc->in_num. + +We keep looping because the size of the total input is allowed to be +bigger than the 'envelope' which is a subset of the payload that tells +us how much data to expect. For safety explicitly check that buf_len +does not overflow which is what the checker flagged. + +Cc: +Fixes: 62232e45f4a2: "libnvdimm: control (ioctl) messages for nvdimm_bus..." +Reported-by: Dan Carpenter +Signed-off-by: Dan Williams +Signed-off-by: Greg Kroah-Hartman +--- + drivers/nvdimm/bus.c | 11 ++++++----- + 1 file changed, 6 insertions(+), 5 deletions(-) + +--- a/drivers/nvdimm/bus.c ++++ b/drivers/nvdimm/bus.c +@@ -812,16 +812,17 @@ static int __nd_ioctl(struct nvdimm_bus + int read_only, unsigned int ioctl_cmd, unsigned long arg) + { + struct nvdimm_bus_descriptor *nd_desc = nvdimm_bus->nd_desc; +- size_t buf_len = 0, in_len = 0, out_len = 0; + static char out_env[ND_CMD_MAX_ENVELOPE]; + static char in_env[ND_CMD_MAX_ENVELOPE]; + const struct nd_cmd_desc *desc = NULL; + unsigned int cmd = _IOC_NR(ioctl_cmd); + void __user *p = (void __user *) arg; + struct device *dev = &nvdimm_bus->dev; +- struct nd_cmd_pkg pkg; + const char *cmd_name, *dimm_name; ++ u32 in_len = 0, out_len = 0; + unsigned long cmd_mask; ++ struct nd_cmd_pkg pkg; ++ u64 buf_len = 0; + void *buf; + int rc, i; + +@@ -882,7 +883,7 @@ static int __nd_ioctl(struct nvdimm_bus + } + + if (cmd == ND_CMD_CALL) { +- dev_dbg(dev, "%s:%s, idx: %llu, in: %zu, out: %zu, len %zu\n", ++ dev_dbg(dev, "%s:%s, idx: %llu, in: %u, out: %u, len %llu\n", + __func__, dimm_name, pkg.nd_command, + in_len, out_len, buf_len); + +@@ -912,9 +913,9 @@ static int __nd_ioctl(struct nvdimm_bus + out_len += out_size; + } + +- buf_len = out_len + in_len; ++ buf_len = (u64) out_len + (u64) in_len; + if (buf_len > ND_IOCTL_MAX_BUFLEN) { +- dev_dbg(dev, "%s:%s cmd: %s buf_len: %zu > %d\n", __func__, ++ dev_dbg(dev, "%s:%s cmd: %s buf_len: %llu > %d\n", __func__, + dimm_name, cmd_name, buf_len, + ND_IOCTL_MAX_BUFLEN); + return -EINVAL; diff --git a/queue-4.9/mm-avoid-spurious-bad-pmd-warning-messages.patch b/queue-4.9/mm-avoid-spurious-bad-pmd-warning-messages.patch new file mode 100644 index 00000000000..ff8d0d6e58f --- /dev/null +++ b/queue-4.9/mm-avoid-spurious-bad-pmd-warning-messages.patch @@ -0,0 +1,128 @@ +From foo@baz Mon Feb 26 20:55:53 CET 2018 +From: Dan Williams +Date: Fri, 23 Feb 2018 14:05:27 -0800 +Subject: mm: avoid spurious 'bad pmd' warning messages +To: gregkh@linuxfoundation.org +Cc: Jan Kara , Eryu Guan , Xiong Zhou , linux-kernel@vger.kernel.org, Matthew Wilcox , Christoph Hellwig , stable@vger.kernel.org, Pawel Lebioda , Dave Hansen , Alexander Viro , Ross Zwisler , Dave Jiang , Andrew Morton , Linus Torvalds , "Darrick J. Wong" , "Kirill A . Shutemov" +Message-ID: <151942352781.21775.15841303754448120195.stgit@dwillia2-desk3.amr.corp.intel.com> + +From: Ross Zwisler + +commit d0f0931de936a0a468d7e59284d39581c16d3a73 upstream. + +When the pmd_devmap() checks were added by 5c7fb56e5e3f ("mm, dax: +dax-pmd vs thp-pmd vs hugetlbfs-pmd") to add better support for DAX huge +pages, they were all added to the end of if() statements after existing +pmd_trans_huge() checks. So, things like: + + - if (pmd_trans_huge(*pmd)) + + if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) + +When further checks were added after pmd_trans_unstable() checks by +commit 7267ec008b5c ("mm: postpone page table allocation until we have +page to map") they were also added at the end of the conditional: + + + if (pmd_trans_unstable(fe->pmd) || pmd_devmap(*fe->pmd)) + +This ordering is fine for pmd_trans_huge(), but doesn't work for +pmd_trans_unstable(). This is because DAX huge pages trip the bad_pmd() +check inside of pmd_none_or_trans_huge_or_clear_bad() (called by +pmd_trans_unstable()), which prints out a warning and returns 1. So, we +do end up doing the right thing, but only after spamming dmesg with +suspicious looking messages: + + mm/pgtable-generic.c:39: bad pmd ffff8808daa49b88(84000001006000a5) + +Reorder these checks in a helper so that pmd_devmap() is checked first, +avoiding the error messages, and add a comment explaining why the +ordering is important. + +Fixes: commit 7267ec008b5c ("mm: postpone page table allocation until we have page to map") +Link: http://lkml.kernel.org/r/20170522215749.23516-1-ross.zwisler@linux.intel.com +Signed-off-by: Ross Zwisler +Reviewed-by: Jan Kara +Cc: Pawel Lebioda +Cc: "Darrick J. Wong" +Cc: Alexander Viro +Cc: Christoph Hellwig +Cc: Dan Williams +Cc: Dave Hansen +Cc: Matthew Wilcox +Cc: "Kirill A . Shutemov" +Cc: Dave Jiang +Cc: Xiong Zhou +Cc: Eryu Guan +Cc: +Signed-off-by: Andrew Morton +Signed-off-by: Linus Torvalds +Signed-off-by: Greg Kroah-Hartman +--- + mm/memory.c | 40 ++++++++++++++++++++++++++++++---------- + 1 file changed, 30 insertions(+), 10 deletions(-) + +--- a/mm/memory.c ++++ b/mm/memory.c +@@ -2848,6 +2848,17 @@ static int __do_fault(struct fault_env * + return ret; + } + ++/* ++ * The ordering of these checks is important for pmds with _PAGE_DEVMAP set. ++ * If we check pmd_trans_unstable() first we will trip the bad_pmd() check ++ * inside of pmd_none_or_trans_huge_or_clear_bad(). This will end up correctly ++ * returning 1 but not before it spams dmesg with the pmd_clear_bad() output. ++ */ ++static int pmd_devmap_trans_unstable(pmd_t *pmd) ++{ ++ return pmd_devmap(*pmd) || pmd_trans_unstable(pmd); ++} ++ + static int pte_alloc_one_map(struct fault_env *fe) + { + struct vm_area_struct *vma = fe->vma; +@@ -2871,18 +2882,27 @@ static int pte_alloc_one_map(struct faul + map_pte: + /* + * If a huge pmd materialized under us just retry later. Use +- * pmd_trans_unstable() instead of pmd_trans_huge() to ensure the pmd +- * didn't become pmd_trans_huge under us and then back to pmd_none, as +- * a result of MADV_DONTNEED running immediately after a huge pmd fault +- * in a different thread of this mm, in turn leading to a misleading +- * pmd_trans_huge() retval. All we have to ensure is that it is a +- * regular pmd that we can walk with pte_offset_map() and we can do that +- * through an atomic read in C, which is what pmd_trans_unstable() +- * provides. ++ * pmd_trans_unstable() via pmd_devmap_trans_unstable() instead of ++ * pmd_trans_huge() to ensure the pmd didn't become pmd_trans_huge ++ * under us and then back to pmd_none, as a result of MADV_DONTNEED ++ * running immediately after a huge pmd fault in a different thread of ++ * this mm, in turn leading to a misleading pmd_trans_huge() retval. ++ * All we have to ensure is that it is a regular pmd that we can walk ++ * with pte_offset_map() and we can do that through an atomic read in ++ * C, which is what pmd_trans_unstable() provides. + */ +- if (pmd_trans_unstable(fe->pmd) || pmd_devmap(*fe->pmd)) ++ if (pmd_devmap_trans_unstable(fe->pmd)) + return VM_FAULT_NOPAGE; + ++ /* ++ * At this point we know that our vmf->pmd points to a page of ptes ++ * and it cannot become pmd_none(), pmd_devmap() or pmd_trans_huge() ++ * for the duration of the fault. If a racing MADV_DONTNEED runs and ++ * we zap the ptes pointed to by our vmf->pmd, the vmf->ptl will still ++ * be valid and we will re-check to make sure the vmf->pte isn't ++ * pte_none() under vmf->ptl protection when we return to ++ * alloc_set_pte(). ++ */ + fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address, + &fe->ptl); + return 0; +@@ -3456,7 +3476,7 @@ static int handle_pte_fault(struct fault + fe->pte = NULL; + } else { + /* See comment in pte_alloc_one_map() */ +- if (pmd_trans_unstable(fe->pmd) || pmd_devmap(*fe->pmd)) ++ if (pmd_devmap_trans_unstable(fe->pmd)) + return 0; + /* + * A regular pmd is established and it can't morph into a huge diff --git a/queue-4.9/mm-fail-get_vaddr_frames-for-filesystem-dax-mappings.patch b/queue-4.9/mm-fail-get_vaddr_frames-for-filesystem-dax-mappings.patch new file mode 100644 index 00000000000..4d4c2f930a6 --- /dev/null +++ b/queue-4.9/mm-fail-get_vaddr_frames-for-filesystem-dax-mappings.patch @@ -0,0 +1,66 @@ +From foo@baz Mon Feb 26 20:55:53 CET 2018 +From: Dan Williams +Date: Fri, 23 Feb 2018 14:06:16 -0800 +Subject: mm: fail get_vaddr_frames() for filesystem-dax mappings +To: gregkh@linuxfoundation.org +Cc: Jan Kara , Joonyoung Shim , linux-kernel@vger.kernel.org, Seung-Woo Kim , Jeff Moyer , stable@vger.kernel.org, Christoph Hellwig , Inki Dae , Doug Ledford , Jason Gunthorpe , Mel Gorman , Andrew Morton , Ross Zwisler , Kyungmin Park , Sean Hefty , Mauro Carvalho Chehab , Linus Torvalds , Hal Rosenstock , Vlastimil Babka +Message-ID: <151942357601.21775.3085470269801679738.stgit@dwillia2-desk3.amr.corp.intel.com> + +From: Dan Williams + +commit b7f0554a56f21fb3e636a627450a9add030889be upstream. + +Until there is a solution to the dma-to-dax vs truncate problem it is +not safe to allow V4L2, Exynos, and other frame vector users to create +long standing / irrevocable memory registrations against filesytem-dax +vmas. + +[dan.j.williams@intel.com: add comment for vma_is_fsdax() check in get_vaddr_frames(), per Jan] + Link: http://lkml.kernel.org/r/151197874035.26211.4061781453123083667.stgit@dwillia2-desk3.amr.corp.intel.com +Link: http://lkml.kernel.org/r/151068939985.7446.15684639617389154187.stgit@dwillia2-desk3.amr.corp.intel.com +Fixes: 3565fce3a659 ("mm, x86: get_user_pages() for dax mappings") +Signed-off-by: Dan Williams +Reviewed-by: Jan Kara +Cc: Inki Dae +Cc: Seung-Woo Kim +Cc: Joonyoung Shim +Cc: Kyungmin Park +Cc: Mauro Carvalho Chehab +Cc: Mel Gorman +Cc: Vlastimil Babka +Cc: Christoph Hellwig +Cc: Doug Ledford +Cc: Hal Rosenstock +Cc: Jason Gunthorpe +Cc: Jeff Moyer +Cc: Ross Zwisler +Cc: Sean Hefty +Cc: +Signed-off-by: Andrew Morton +Signed-off-by: Linus Torvalds +Signed-off-by: Greg Kroah-Hartman +--- + mm/frame_vector.c | 12 ++++++++++++ + 1 file changed, 12 insertions(+) + +--- a/mm/frame_vector.c ++++ b/mm/frame_vector.c +@@ -52,6 +52,18 @@ int get_vaddr_frames(unsigned long start + ret = -EFAULT; + goto out; + } ++ ++ /* ++ * While get_vaddr_frames() could be used for transient (kernel ++ * controlled lifetime) pinning of memory pages all current ++ * users establish long term (userspace controlled lifetime) ++ * page pinning. Treat get_vaddr_frames() like ++ * get_user_pages_longterm() and disallow it for filesystem-dax ++ * mappings. ++ */ ++ if (vma_is_fsdax(vma)) ++ return -EOPNOTSUPP; ++ + if (!(vma->vm_flags & (VM_IO | VM_PFNMAP))) { + vec->got_ref = true; + vec->is_pfns = false; diff --git a/queue-4.9/mm-fix-devm_memremap_pages-collision-handling.patch b/queue-4.9/mm-fix-devm_memremap_pages-collision-handling.patch new file mode 100644 index 00000000000..8a8d21c8d3d --- /dev/null +++ b/queue-4.9/mm-fix-devm_memremap_pages-collision-handling.patch @@ -0,0 +1,81 @@ +From foo@baz Mon Feb 26 20:55:53 CET 2018 +From: Dan Williams +Date: Fri, 23 Feb 2018 14:06:10 -0800 +Subject: mm: Fix devm_memremap_pages() collision handling +To: gregkh@linuxfoundation.org +Cc: +Message-ID: <151942357089.21775.3486425046348885247.stgit@dwillia2-desk3.amr.corp.intel.com> + +From: Jan H. Schönherr + +commit 77dd66a3c67c93ab401ccc15efff25578be281fd upstream. + +If devm_memremap_pages() detects a collision while adding entries +to the radix-tree, we call pgmap_radix_release(). Unfortunately, +the function removes *all* entries for the range -- including the +entries that caused the collision in the first place. + +Modify pgmap_radix_release() to take an additional argument to +indicate where to stop, so that only newly added entries are removed +from the tree. + +Cc: +Fixes: 9476df7d80df ("mm: introduce find_dev_pagemap()") +Signed-off-by: Jan H. Schönherr +Signed-off-by: Dan Williams +Signed-off-by: Greg Kroah-Hartman +--- + kernel/memremap.c | 13 ++++++++----- + 1 file changed, 8 insertions(+), 5 deletions(-) + +--- a/kernel/memremap.c ++++ b/kernel/memremap.c +@@ -194,7 +194,7 @@ void put_zone_device_page(struct page *p + } + EXPORT_SYMBOL(put_zone_device_page); + +-static void pgmap_radix_release(struct resource *res) ++static void pgmap_radix_release(struct resource *res, resource_size_t end_key) + { + resource_size_t key, align_start, align_size, align_end; + +@@ -203,8 +203,11 @@ static void pgmap_radix_release(struct r + align_end = align_start + align_size - 1; + + mutex_lock(&pgmap_lock); +- for (key = res->start; key <= res->end; key += SECTION_SIZE) ++ for (key = res->start; key <= res->end; key += SECTION_SIZE) { ++ if (key >= end_key) ++ break; + radix_tree_delete(&pgmap_radix, key >> PA_SECTION_SHIFT); ++ } + mutex_unlock(&pgmap_lock); + } + +@@ -255,7 +258,7 @@ static void devm_memremap_pages_release( + unlock_device_hotplug(); + + untrack_pfn(NULL, PHYS_PFN(align_start), align_size); +- pgmap_radix_release(res); ++ pgmap_radix_release(res, -1); + dev_WARN_ONCE(dev, pgmap->altmap && pgmap->altmap->alloc, + "%s: failed to free all reserved pages\n", __func__); + } +@@ -289,7 +292,7 @@ struct dev_pagemap *find_dev_pagemap(res + void *devm_memremap_pages(struct device *dev, struct resource *res, + struct percpu_ref *ref, struct vmem_altmap *altmap) + { +- resource_size_t key, align_start, align_size, align_end; ++ resource_size_t key = 0, align_start, align_size, align_end; + pgprot_t pgprot = PAGE_KERNEL; + struct dev_pagemap *pgmap; + struct page_map *page_map; +@@ -392,7 +395,7 @@ void *devm_memremap_pages(struct device + untrack_pfn(NULL, PHYS_PFN(align_start), align_size); + err_pfn_remap: + err_radix: +- pgmap_radix_release(res); ++ pgmap_radix_release(res, key); + devres_free(page_map); + return ERR_PTR(error); + } diff --git a/queue-4.9/mm-introduce-get_user_pages_longterm.patch b/queue-4.9/mm-introduce-get_user_pages_longterm.patch new file mode 100644 index 00000000000..91ffc2e8325 --- /dev/null +++ b/queue-4.9/mm-introduce-get_user_pages_longterm.patch @@ -0,0 +1,232 @@ +From foo@baz Mon Feb 26 20:55:53 CET 2018 +From: Dan Williams +Date: Fri, 23 Feb 2018 14:05:49 -0800 +Subject: mm: introduce get_user_pages_longterm +To: gregkh@linuxfoundation.org +Cc: Jan Kara , Joonyoung Shim , linux-kernel@vger.kernel.org, Seung-Woo Kim , Doug Ledford , stable@vger.kernel.org, Christoph Hellwig , Inki Dae , Jeff Moyer , Jason Gunthorpe , Mel Gorman , Andrew Morton , Ross Zwisler , Kyungmin Park , Sean Hefty , Mauro Carvalho Chehab , Linus Torvalds , Hal Rosenstock , Vlastimil Babka +Message-ID: <151942354920.21775.1595898555475851190.stgit@dwillia2-desk3.amr.corp.intel.com> + +From: Dan Williams + +commit 2bb6d2837083de722bfdc369cb0d76ce188dd9b4 upstream. + +Patch series "introduce get_user_pages_longterm()", v2. + +Here is a new get_user_pages api for cases where a driver intends to +keep an elevated page count indefinitely. This is distinct from usages +like iov_iter_get_pages where the elevated page counts are transient. +The iov_iter_get_pages cases immediately turn around and submit the +pages to a device driver which will put_page when the i/o operation +completes (under kernel control). + +In the longterm case userspace is responsible for dropping the page +reference at some undefined point in the future. This is untenable for +filesystem-dax case where the filesystem is in control of the lifetime +of the block / page and needs reasonable limits on how long it can wait +for pages in a mapping to become idle. + +Fixing filesystems to actually wait for dax pages to be idle before +blocks from a truncate/hole-punch operation are repurposed is saved for +a later patch series. + +Also, allowing longterm registration of dax mappings is a future patch +series that introduces a "map with lease" semantic where the kernel can +revoke a lease and force userspace to drop its page references. + +I have also tagged these for -stable to purposely break cases that might +assume that longterm memory registrations for filesystem-dax mappings +were supported by the kernel. The behavior regression this policy +change implies is one of the reasons we maintain the "dax enabled. +Warning: EXPERIMENTAL, use at your own risk" notification when mounting +a filesystem in dax mode. + +It is worth noting the device-dax interface does not suffer the same +constraints since it does not support file space management operations +like hole-punch. + +This patch (of 4): + +Until there is a solution to the dma-to-dax vs truncate problem it is +not safe to allow long standing memory registrations against +filesytem-dax vmas. Device-dax vmas do not have this problem and are +explicitly allowed. + +This is temporary until a "memory registration with layout-lease" +mechanism can be implemented for the affected sub-systems (RDMA and +V4L2). + +[akpm@linux-foundation.org: use kcalloc()] +Link: http://lkml.kernel.org/r/151068939435.7446.13560129395419350737.stgit@dwillia2-desk3.amr.corp.intel.com +Fixes: 3565fce3a659 ("mm, x86: get_user_pages() for dax mappings") +Signed-off-by: Dan Williams +Suggested-by: Christoph Hellwig +Cc: Doug Ledford +Cc: Hal Rosenstock +Cc: Inki Dae +Cc: Jan Kara +Cc: Jason Gunthorpe +Cc: Jeff Moyer +Cc: Joonyoung Shim +Cc: Kyungmin Park +Cc: Mauro Carvalho Chehab +Cc: Mel Gorman +Cc: Ross Zwisler +Cc: Sean Hefty +Cc: Seung-Woo Kim +Cc: Vlastimil Babka +Cc: +Signed-off-by: Andrew Morton +Signed-off-by: Linus Torvalds +Signed-off-by: Greg Kroah-Hartman +--- + include/linux/dax.h | 5 ---- + include/linux/fs.h | 20 ++++++++++++++++ + include/linux/mm.h | 13 ++++++++++ + mm/gup.c | 64 ++++++++++++++++++++++++++++++++++++++++++++++++++++ + 4 files changed, 97 insertions(+), 5 deletions(-) + +--- a/include/linux/dax.h ++++ b/include/linux/dax.h +@@ -61,11 +61,6 @@ static inline int dax_pmd_fault(struct v + int dax_pfn_mkwrite(struct vm_area_struct *, struct vm_fault *); + #define dax_mkwrite(vma, vmf, gb) dax_fault(vma, vmf, gb) + +-static inline bool vma_is_dax(struct vm_area_struct *vma) +-{ +- return vma->vm_file && IS_DAX(vma->vm_file->f_mapping->host); +-} +- + static inline bool dax_mapping(struct address_space *mapping) + { + return mapping->host && IS_DAX(mapping->host); +--- a/include/linux/fs.h ++++ b/include/linux/fs.h +@@ -18,6 +18,7 @@ + #include + #include + #include ++#include + #include + #include + #include +@@ -3033,6 +3034,25 @@ static inline bool io_is_direct(struct f + return (filp->f_flags & O_DIRECT) || IS_DAX(filp->f_mapping->host); + } + ++static inline bool vma_is_dax(struct vm_area_struct *vma) ++{ ++ return vma->vm_file && IS_DAX(vma->vm_file->f_mapping->host); ++} ++ ++static inline bool vma_is_fsdax(struct vm_area_struct *vma) ++{ ++ struct inode *inode; ++ ++ if (!vma->vm_file) ++ return false; ++ if (!vma_is_dax(vma)) ++ return false; ++ inode = file_inode(vma->vm_file); ++ if (inode->i_mode == S_IFCHR) ++ return false; /* device-dax */ ++ return true; ++} ++ + static inline int iocb_flags(struct file *file) + { + int res = 0; +--- a/include/linux/mm.h ++++ b/include/linux/mm.h +@@ -1288,6 +1288,19 @@ long __get_user_pages_unlocked(struct ta + struct page **pages, unsigned int gup_flags); + long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages, + struct page **pages, unsigned int gup_flags); ++#ifdef CONFIG_FS_DAX ++long get_user_pages_longterm(unsigned long start, unsigned long nr_pages, ++ unsigned int gup_flags, struct page **pages, ++ struct vm_area_struct **vmas); ++#else ++static inline long get_user_pages_longterm(unsigned long start, ++ unsigned long nr_pages, unsigned int gup_flags, ++ struct page **pages, struct vm_area_struct **vmas) ++{ ++ return get_user_pages(start, nr_pages, gup_flags, pages, vmas); ++} ++#endif /* CONFIG_FS_DAX */ ++ + int get_user_pages_fast(unsigned long start, int nr_pages, int write, + struct page **pages); + +--- a/mm/gup.c ++++ b/mm/gup.c +@@ -982,6 +982,70 @@ long get_user_pages(unsigned long start, + } + EXPORT_SYMBOL(get_user_pages); + ++#ifdef CONFIG_FS_DAX ++/* ++ * This is the same as get_user_pages() in that it assumes we are ++ * operating on the current task's mm, but it goes further to validate ++ * that the vmas associated with the address range are suitable for ++ * longterm elevated page reference counts. For example, filesystem-dax ++ * mappings are subject to the lifetime enforced by the filesystem and ++ * we need guarantees that longterm users like RDMA and V4L2 only ++ * establish mappings that have a kernel enforced revocation mechanism. ++ * ++ * "longterm" == userspace controlled elevated page count lifetime. ++ * Contrast this to iov_iter_get_pages() usages which are transient. ++ */ ++long get_user_pages_longterm(unsigned long start, unsigned long nr_pages, ++ unsigned int gup_flags, struct page **pages, ++ struct vm_area_struct **vmas_arg) ++{ ++ struct vm_area_struct **vmas = vmas_arg; ++ struct vm_area_struct *vma_prev = NULL; ++ long rc, i; ++ ++ if (!pages) ++ return -EINVAL; ++ ++ if (!vmas) { ++ vmas = kcalloc(nr_pages, sizeof(struct vm_area_struct *), ++ GFP_KERNEL); ++ if (!vmas) ++ return -ENOMEM; ++ } ++ ++ rc = get_user_pages(start, nr_pages, gup_flags, pages, vmas); ++ ++ for (i = 0; i < rc; i++) { ++ struct vm_area_struct *vma = vmas[i]; ++ ++ if (vma == vma_prev) ++ continue; ++ ++ vma_prev = vma; ++ ++ if (vma_is_fsdax(vma)) ++ break; ++ } ++ ++ /* ++ * Either get_user_pages() failed, or the vma validation ++ * succeeded, in either case we don't need to put_page() before ++ * returning. ++ */ ++ if (i >= rc) ++ goto out; ++ ++ for (i = 0; i < rc; i++) ++ put_page(pages[i]); ++ rc = -EOPNOTSUPP; ++out: ++ if (vmas != vmas_arg) ++ kfree(vmas); ++ return rc; ++} ++EXPORT_SYMBOL(get_user_pages_longterm); ++#endif /* CONFIG_FS_DAX */ ++ + /** + * populate_vma_page_range() - populate a range of pages in the vma. + * @vma: target vma diff --git a/queue-4.9/series b/queue-4.9/series index ec070f0961f..0bfd576472f 100644 --- a/queue-4.9/series +++ b/queue-4.9/series @@ -26,3 +26,14 @@ drm-amdgpu-avoid-leaking-pm-domain-on-driver-unbind-v2.patch drm-amdgpu-add-new-device-to-use-atpx-quirk.patch binder-add-missing-binder_unlock.patch x.509-fix-null-dereference-when-restricting-key-with-unsupported_sig.patch +mm-avoid-spurious-bad-pmd-warning-messages.patch +fs-dax.c-fix-inefficiency-in-dax_writeback_mapping_range.patch +libnvdimm-fix-integer-overflow-static-analysis-warning.patch +device-dax-implement-split-to-catch-invalid-munmap-attempts.patch +mm-introduce-get_user_pages_longterm.patch +v4l2-disable-filesystem-dax-mapping-support.patch +ib-core-disable-memory-registration-of-filesystem-dax-vmas.patch +libnvdimm-dax-fix-1gb-aligned-namespaces-vs-physical-misalignment.patch +mm-fix-devm_memremap_pages-collision-handling.patch +mm-fail-get_vaddr_frames-for-filesystem-dax-mappings.patch +x86-entry-64-clear-extra-registers-beyond-syscall-arguments-to-reduce-speculation-attack-surface.patch diff --git a/queue-4.9/v4l2-disable-filesystem-dax-mapping-support.patch b/queue-4.9/v4l2-disable-filesystem-dax-mapping-support.patch new file mode 100644 index 00000000000..ea3a0d1c381 --- /dev/null +++ b/queue-4.9/v4l2-disable-filesystem-dax-mapping-support.patch @@ -0,0 +1,69 @@ +From foo@baz Mon Feb 26 20:55:53 CET 2018 +From: Dan Williams +Date: Fri, 23 Feb 2018 14:05:54 -0800 +Subject: v4l2: disable filesystem-dax mapping support +To: gregkh@linuxfoundation.org +Cc: Jan Kara , Joonyoung Shim , linux-kernel@vger.kernel.org, Seung-Woo Kim , Doug Ledford , stable@vger.kernel.org, Christoph Hellwig , Inki Dae , Jeff Moyer , Jason Gunthorpe , Mel Gorman , Andrew Morton , Ross Zwisler , Kyungmin Park , Sean Hefty , Mauro Carvalho Chehab , Linus Torvalds , Hal Rosenstock , Vlastimil Babka +Message-ID: <151942355435.21775.3892492011172127062.stgit@dwillia2-desk3.amr.corp.intel.com> + +From: Dan Williams + +commit b70131de648c2b997d22f4653934438013f407a1 upstream. + +V4L2 memory registrations are incompatible with filesystem-dax that +needs the ability to revoke dma access to a mapping at will, or +otherwise allow the kernel to wait for completion of DMA. The +filesystem-dax implementation breaks the traditional solution of +truncate of active file backed mappings since there is no page-cache +page we can orphan to sustain ongoing DMA. + +If v4l2 wants to support long lived DMA mappings it needs to arrange to +hold a file lease or use some other mechanism so that the kernel can +coordinate revoking DMA access when the filesystem needs to truncate +mappings. + +Link: http://lkml.kernel.org/r/151068940499.7446.12846708245365671207.stgit@dwillia2-desk3.amr.corp.intel.com +Fixes: 3565fce3a659 ("mm, x86: get_user_pages() for dax mappings") +Signed-off-by: Dan Williams +Reported-by: Jan Kara +Reviewed-by: Jan Kara +Cc: Mauro Carvalho Chehab +Cc: Christoph Hellwig +Cc: Doug Ledford +Cc: Hal Rosenstock +Cc: Inki Dae +Cc: Jason Gunthorpe +Cc: Jeff Moyer +Cc: Joonyoung Shim +Cc: Kyungmin Park +Cc: Mel Gorman +Cc: Ross Zwisler +Cc: Sean Hefty +Cc: Seung-Woo Kim +Cc: Vlastimil Babka +Cc: +Signed-off-by: Andrew Morton +Signed-off-by: Linus Torvalds +Signed-off-by: Greg Kroah-Hartman +--- + drivers/media/v4l2-core/videobuf-dma-sg.c | 5 +++-- + 1 file changed, 3 insertions(+), 2 deletions(-) + +--- a/drivers/media/v4l2-core/videobuf-dma-sg.c ++++ b/drivers/media/v4l2-core/videobuf-dma-sg.c +@@ -185,12 +185,13 @@ static int videobuf_dma_init_user_locked + dprintk(1, "init user [0x%lx+0x%lx => %d pages]\n", + data, size, dma->nr_pages); + +- err = get_user_pages(data & PAGE_MASK, dma->nr_pages, ++ err = get_user_pages_longterm(data & PAGE_MASK, dma->nr_pages, + flags, dma->pages, NULL); + + if (err != dma->nr_pages) { + dma->nr_pages = (err >= 0) ? err : 0; +- dprintk(1, "get_user_pages: err=%d [%d]\n", err, dma->nr_pages); ++ dprintk(1, "get_user_pages_longterm: err=%d [%d]\n", err, ++ dma->nr_pages); + return err < 0 ? err : -EINVAL; + } + return 0; diff --git a/queue-4.9/x.509-fix-null-dereference-when-restricting-key-with-unsupported_sig.patch b/queue-4.9/x.509-fix-null-dereference-when-restricting-key-with-unsupported_sig.patch index 6e055a6a23d..b75a76379e9 100644 --- a/queue-4.9/x.509-fix-null-dereference-when-restricting-key-with-unsupported_sig.patch +++ b/queue-4.9/x.509-fix-null-dereference-when-restricting-key-with-unsupported_sig.patch @@ -6,7 +6,6 @@ To: stable@vger.kernel.org, Greg Kroah-Hartman Cc: keyrings@vger.kernel.org, Eric Biggers , David Howells Message-ID: <20180226181715.194965-1-ebiggers3@gmail.com> - From: Eric Biggers commit 4b34968e77ad09628cfb3c4a7daf2adc2cefc6e8 upstream. diff --git a/queue-4.9/x86-entry-64-clear-extra-registers-beyond-syscall-arguments-to-reduce-speculation-attack-surface.patch b/queue-4.9/x86-entry-64-clear-extra-registers-beyond-syscall-arguments-to-reduce-speculation-attack-surface.patch new file mode 100644 index 00000000000..18d7e22c8a4 --- /dev/null +++ b/queue-4.9/x86-entry-64-clear-extra-registers-beyond-syscall-arguments-to-reduce-speculation-attack-surface.patch @@ -0,0 +1,78 @@ +From foo@baz Mon Feb 26 20:55:53 CET 2018 +From: Dan Williams +Date: Fri, 23 Feb 2018 14:06:21 -0800 +Subject: x86/entry/64: Clear extra registers beyond syscall arguments, to reduce speculation attack surface +To: gregkh@linuxfoundation.org +Cc: Andi Kleen , Denys Vlasenko , Peter Zijlstra , Brian Gerst , "H. Peter Anvin" , linux-kernel@vger.kernel.org, stable@vger.kernel.org, Borislav Petkov , Andy Lutomirski , Josh Poimboeuf , Thomas Gleixner , Linus Torvalds , Ingo Molnar +Message-ID: <151942358116.21775.14209781084277174517.stgit@dwillia2-desk3.amr.corp.intel.com> + +From: Dan Williams + +commit 8e1eb3fa009aa7c0b944b3c8b26b07de0efb3200 upstream. + +At entry userspace may have (maliciously) populated the extra registers +outside the syscall calling convention with arbitrary values that could +be useful in a speculative execution (Spectre style) attack. + +Clear these registers to minimize the kernel's attack surface. + +Note, this only clears the extra registers and not the unused +registers for syscalls less than 6 arguments, since those registers are +likely to be clobbered well before their values could be put to use +under speculation. + +Note, Linus found that the XOR instructions can be executed with +minimized cost if interleaved with the PUSH instructions, and Ingo's +analysis found that R10 and R11 should be included in the register +clearing beyond the typical 'extra' syscall calling convention +registers. + +Suggested-by: Linus Torvalds +Reported-by: Andi Kleen +Signed-off-by: Dan Williams +Cc: +Cc: Andy Lutomirski +Cc: Borislav Petkov +Cc: Brian Gerst +Cc: Denys Vlasenko +Cc: H. Peter Anvin +Cc: Josh Poimboeuf +Cc: Peter Zijlstra +Cc: Thomas Gleixner +Link: http://lkml.kernel.org/r/151787988577.7847.16733592218894189003.stgit@dwillia2-desk3.amr.corp.intel.com +[ Made small improvements to the changelog and the code comments. ] +Signed-off-by: Ingo Molnar +Signed-off-by: Greg Kroah-Hartman +--- + arch/x86/entry/entry_64.S | 13 +++++++++++++ + 1 file changed, 13 insertions(+) + +--- a/arch/x86/entry/entry_64.S ++++ b/arch/x86/entry/entry_64.S +@@ -176,13 +176,26 @@ GLOBAL(entry_SYSCALL_64_after_swapgs) + pushq %r8 /* pt_regs->r8 */ + pushq %r9 /* pt_regs->r9 */ + pushq %r10 /* pt_regs->r10 */ ++ /* ++ * Clear extra registers that a speculation attack might ++ * otherwise want to exploit. Interleave XOR with PUSH ++ * for better uop scheduling: ++ */ ++ xorq %r10, %r10 /* nospec r10 */ + pushq %r11 /* pt_regs->r11 */ ++ xorq %r11, %r11 /* nospec r11 */ + pushq %rbx /* pt_regs->rbx */ ++ xorl %ebx, %ebx /* nospec rbx */ + pushq %rbp /* pt_regs->rbp */ ++ xorl %ebp, %ebp /* nospec rbp */ + pushq %r12 /* pt_regs->r12 */ ++ xorq %r12, %r12 /* nospec r12 */ + pushq %r13 /* pt_regs->r13 */ ++ xorq %r13, %r13 /* nospec r13 */ + pushq %r14 /* pt_regs->r14 */ ++ xorq %r14, %r14 /* nospec r14 */ + pushq %r15 /* pt_regs->r15 */ ++ xorq %r15, %r15 /* nospec r15 */ + + /* IRQs are off. */ + movq %rsp, %rdi