Cc: Guenter Roeck <linux@roeck-us.net>, Todd Kjos <tkjos@android.com>, Eric Biggers <ebiggers@google.com>
Message-ID: <20180226185645.241652-1-ebiggers3@gmail.com>
-
From: Eric Biggers <ebiggers@google.com>
When commit 4be5a2810489 ("binder: check for binder_thread allocation
--- /dev/null
+From foo@baz Mon Feb 26 20:55:53 CET 2018
+From: Dan Williams <dan.j.williams@intel.com>
+Date: Fri, 23 Feb 2018 14:05:43 -0800
+Subject: device-dax: implement ->split() to catch invalid munmap attempts
+To: gregkh@linuxfoundation.org
+Cc: Jeff Moyer <jmoyer@redhat.com>, Linus Torvalds <torvalds@linux-foundation.org>, Andrew Morton <akpm@linux-foundation.org>, stable@vger.kernel.org, linux-kernel@vger.kernel.org
+Message-ID: <151942354379.21775.5321017414392517094.stgit@dwillia2-desk3.amr.corp.intel.com>
+
+From: Dan Williams <dan.j.williams@intel.com>
+
+commit 9702cffdbf2129516db679e4467db81e1cd287da upstream.
+
+Similar to how device-dax enforces that the 'address', 'offset', and
+'len' parameters to mmap() be aligned to the device's fundamental
+alignment, the same constraints apply to munmap(). Implement ->split()
+to fail munmap calls that violate the alignment constraint.
+
+Otherwise, we later fail VM_BUG_ON checks in the unmap_page_range() path
+with crash signatures of the form:
+
+ vma ffff8800b60c8a88 start 00007f88c0000000 end 00007f88c0e00000
+ next (null) prev (null) mm ffff8800b61150c0
+ prot 8000000000000027 anon_vma (null) vm_ops ffffffffa0091240
+ pgoff 0 file ffff8800b638ef80 private_data (null)
+ flags: 0x380000fb(read|write|shared|mayread|maywrite|mayexec|mayshare|softdirty|mixedmap|hugepage)
+ ------------[ cut here ]------------
+ kernel BUG at mm/huge_memory.c:2014!
+ [..]
+ RIP: 0010:__split_huge_pud+0x12a/0x180
+ [..]
+ Call Trace:
+ unmap_page_range+0x245/0xa40
+ ? __vma_adjust+0x301/0x990
+ unmap_vmas+0x4c/0xa0
+ unmap_region+0xae/0x120
+ ? __vma_rb_erase+0x11a/0x230
+ do_munmap+0x276/0x410
+ vm_munmap+0x6a/0xa0
+ SyS_munmap+0x1d/0x30
+
+Link: http://lkml.kernel.org/r/151130418681.4029.7118245855057952010.stgit@dwillia2-desk3.amr.corp.intel.com
+Fixes: dee410792419 ("/dev/dax, core: file operations and dax-mmap")
+Signed-off-by: Dan Williams <dan.j.williams@intel.com>
+Reported-by: Jeff Moyer <jmoyer@redhat.com>
+Cc: <stable@vger.kernel.org>
+Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
+Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+---
+ drivers/dax/dax.c | 12 ++++++++++++
+ 1 file changed, 12 insertions(+)
+
+--- a/drivers/dax/dax.c
++++ b/drivers/dax/dax.c
+@@ -453,9 +453,21 @@ static int dax_dev_pmd_fault(struct vm_a
+ return rc;
+ }
+
++static int dax_dev_split(struct vm_area_struct *vma, unsigned long addr)
++{
++ struct file *filp = vma->vm_file;
++ struct dax_dev *dax_dev = filp->private_data;
++ struct dax_region *dax_region = dax_dev->region;
++
++ if (!IS_ALIGNED(addr, dax_region->align))
++ return -EINVAL;
++ return 0;
++}
++
+ static const struct vm_operations_struct dax_dev_vm_ops = {
+ .fault = dax_dev_fault,
+ .pmd_fault = dax_dev_pmd_fault,
++ .split = dax_dev_split,
+ };
+
+ static int dax_mmap(struct file *filp, struct vm_area_struct *vma)
--- /dev/null
+From foo@baz Mon Feb 26 20:55:53 CET 2018
+From: Dan Williams <dan.j.williams@intel.com>
+Date: Fri, 23 Feb 2018 14:05:33 -0800
+Subject: fs/dax.c: fix inefficiency in dax_writeback_mapping_range()
+To: gregkh@linuxfoundation.org
+Cc: Jan Kara <jack@suse.cz>, linux-kernel@vger.kernel.org, stable@vger.kernel.org, Ross Zwisler <ross.zwisler@linux.intel.com>, Linus Torvalds <torvalds@linux-foundation.org>, Andrew Morton <akpm@linux-foundation.org>
+Message-ID: <151942353293.21775.3589635231521871832.stgit@dwillia2-desk3.amr.corp.intel.com>
+
+From: Jan Kara <jack@suse.cz>
+
+commit 1eb643d02b21412e603b42cdd96010a2ac31c05f upstream.
+
+dax_writeback_mapping_range() fails to update iteration index when
+searching radix tree for entries needing cache flushing. Thus each
+pagevec worth of entries is searched starting from the start which is
+inefficient and prone to livelocks. Update index properly.
+
+Link: http://lkml.kernel.org/r/20170619124531.21491-1-jack@suse.cz
+Fixes: 9973c98ecfda3 ("dax: add support for fsync/sync")
+Signed-off-by: Jan Kara <jack@suse.cz>
+Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
+Cc: Dan Williams <dan.j.williams@intel.com>
+Cc: <stable@vger.kernel.org>
+Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
+Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+---
+ fs/dax.c | 1 +
+ 1 file changed, 1 insertion(+)
+
+--- a/fs/dax.c
++++ b/fs/dax.c
+@@ -785,6 +785,7 @@ int dax_writeback_mapping_range(struct a
+ if (ret < 0)
+ return ret;
+ }
++ start_index = indices[pvec.nr - 1] + 1;
+ }
+ return 0;
+ }
--- /dev/null
+From foo@baz Mon Feb 26 20:55:53 CET 2018
+From: Dan Williams <dan.j.williams@intel.com>
+Date: Fri, 23 Feb 2018 14:06:00 -0800
+Subject: IB/core: disable memory registration of filesystem-dax vmas
+To: gregkh@linuxfoundation.org
+Cc: Sean Hefty <sean.hefty@intel.com>, Jan Kara <jack@suse.cz>, Joonyoung Shim <jy0922.shim@samsung.com>, linux-kernel@vger.kernel.org, Seung-Woo Kim <sw0312.kim@samsung.com>, Jeff Moyer <jmoyer@redhat.com>, stable@vger.kernel.org, Christoph Hellwig <hch@lst.de>, Inki Dae <inki.dae@samsung.com>, Doug Ledford <dledford@redhat.com>, Jason Gunthorpe <jgg@mellanox.com>, Mel Gorman <mgorman@suse.de>, Ross Zwisler <ross.zwisler@linux.intel.com>, Kyungmin Park <kyungmin.park@samsung.com>, Andrew Morton <akpm@linux-foundation.org>, Mauro Carvalho Chehab <mchehab@kernel.org>, Linus Torvalds <torvalds@linux-foundation.org>, Hal Rosenstock <hal.rosenstock@gmail.com>, Vlastimil Babka <vbabka@suse.cz>
+Message-ID: <151942356005.21775.11352557058864235434.stgit@dwillia2-desk3.amr.corp.intel.com>
+
+From: Dan Williams <dan.j.williams@intel.com>
+
+commit 5f1d43de54164dcfb9bfa542fcc92c1e1a1b6c1d upstream.
+
+Until there is a solution to the dma-to-dax vs truncate problem it is
+not safe to allow RDMA to create long standing memory registrations
+against filesytem-dax vmas.
+
+Link: http://lkml.kernel.org/r/151068941011.7446.7766030590347262502.stgit@dwillia2-desk3.amr.corp.intel.com
+Fixes: 3565fce3a659 ("mm, x86: get_user_pages() for dax mappings")
+Signed-off-by: Dan Williams <dan.j.williams@intel.com>
+Reported-by: Christoph Hellwig <hch@lst.de>
+Reviewed-by: Christoph Hellwig <hch@lst.de>
+Acked-by: Jason Gunthorpe <jgg@mellanox.com>
+Acked-by: Doug Ledford <dledford@redhat.com>
+Cc: Sean Hefty <sean.hefty@intel.com>
+Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
+Cc: Jeff Moyer <jmoyer@redhat.com>
+Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
+Cc: Inki Dae <inki.dae@samsung.com>
+Cc: Jan Kara <jack@suse.cz>
+Cc: Joonyoung Shim <jy0922.shim@samsung.com>
+Cc: Kyungmin Park <kyungmin.park@samsung.com>
+Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
+Cc: Mel Gorman <mgorman@suse.de>
+Cc: Seung-Woo Kim <sw0312.kim@samsung.com>
+Cc: Vlastimil Babka <vbabka@suse.cz>
+Cc: <stable@vger.kernel.org>
+Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
+Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+---
+ drivers/infiniband/core/umem.c | 2 +-
+ 1 file changed, 1 insertion(+), 1 deletion(-)
+
+--- a/drivers/infiniband/core/umem.c
++++ b/drivers/infiniband/core/umem.c
+@@ -193,7 +193,7 @@ struct ib_umem *ib_umem_get(struct ib_uc
+ sg_list_start = umem->sg_head.sgl;
+
+ while (npages) {
+- ret = get_user_pages(cur_base,
++ ret = get_user_pages_longterm(cur_base,
+ min_t(unsigned long, npages,
+ PAGE_SIZE / sizeof (struct page *)),
+ gup_flags, page_list, vma_list);
--- /dev/null
+From foo@baz Mon Feb 26 20:55:53 CET 2018
+From: Dan Williams <dan.j.williams@intel.com>
+Date: Fri, 23 Feb 2018 14:06:05 -0800
+Subject: libnvdimm, dax: fix 1GB-aligned namespaces vs physical misalignment
+To: gregkh@linuxfoundation.org
+Cc: Jane Chu <jane.chu@oracle.com>, linux-kernel@vger.kernel.org, stable@vger.kernel.org
+Message-ID: <151942356576.21775.15139045279160411096.stgit@dwillia2-desk3.amr.corp.intel.com>
+
+From: Dan Williams <dan.j.williams@intel.com>
+
+commit 41fce90f26333c4fa82e8e43b9ace86c4e8a0120 upstream.
+
+The following namespace configuration attempt:
+
+ # ndctl create-namespace -e namespace0.0 -m devdax -a 1G -f
+ libndctl: ndctl_dax_enable: dax0.1: failed to enable
+ Error: namespace0.0: failed to enable
+
+ failed to reconfigure namespace: No such device or address
+
+...fails when the backing memory range is not physically aligned to 1G:
+
+ # cat /proc/iomem | grep Persistent
+ 210000000-30fffffff : Persistent Memory (legacy)
+
+In the above example the 4G persistent memory range starts and ends on a
+256MB boundary.
+
+We handle this case correctly when needing to handle cases that violate
+section alignment (128MB) collisions against "System RAM", and we simply
+need to extend that padding/truncation for the 1GB alignment use case.
+
+Cc: <stable@vger.kernel.org>
+Fixes: 315c562536c4 ("libnvdimm, pfn: add 'align' attribute...")
+Reported-and-tested-by: Jane Chu <jane.chu@oracle.com>
+Signed-off-by: Dan Williams <dan.j.williams@intel.com>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+---
+ drivers/nvdimm/pfn_devs.c | 15 ++++++++++++---
+ include/linux/kernel.h | 1 +
+ 2 files changed, 13 insertions(+), 3 deletions(-)
+
+--- a/drivers/nvdimm/pfn_devs.c
++++ b/drivers/nvdimm/pfn_devs.c
+@@ -563,6 +563,12 @@ static struct vmem_altmap *__nvdimm_setu
+ return altmap;
+ }
+
++static u64 phys_pmem_align_down(struct nd_pfn *nd_pfn, u64 phys)
++{
++ return min_t(u64, PHYS_SECTION_ALIGN_DOWN(phys),
++ ALIGN_DOWN(phys, nd_pfn->align));
++}
++
+ static int nd_pfn_init(struct nd_pfn *nd_pfn)
+ {
+ u32 dax_label_reserve = is_nd_dax(&nd_pfn->dev) ? SZ_128K : 0;
+@@ -618,13 +624,16 @@ static int nd_pfn_init(struct nd_pfn *nd
+ start = nsio->res.start;
+ size = PHYS_SECTION_ALIGN_UP(start + size) - start;
+ if (region_intersects(start, size, IORESOURCE_SYSTEM_RAM,
+- IORES_DESC_NONE) == REGION_MIXED) {
++ IORES_DESC_NONE) == REGION_MIXED
++ || !IS_ALIGNED(start + resource_size(&nsio->res),
++ nd_pfn->align)) {
+ size = resource_size(&nsio->res);
+- end_trunc = start + size - PHYS_SECTION_ALIGN_DOWN(start + size);
++ end_trunc = start + size - phys_pmem_align_down(nd_pfn,
++ start + size);
+ }
+
+ if (start_pad + end_trunc)
+- dev_info(&nd_pfn->dev, "%s section collision, truncate %d bytes\n",
++ dev_info(&nd_pfn->dev, "%s alignment collision, truncate %d bytes\n",
+ dev_name(&ndns->dev), start_pad + end_trunc);
+
+ /*
+--- a/include/linux/kernel.h
++++ b/include/linux/kernel.h
+@@ -46,6 +46,7 @@
+ #define REPEAT_BYTE(x) ((~0ul / 0xff) * (x))
+
+ #define ALIGN(x, a) __ALIGN_KERNEL((x), (a))
++#define ALIGN_DOWN(x, a) __ALIGN_KERNEL((x) - ((a) - 1), (a))
+ #define __ALIGN_MASK(x, mask) __ALIGN_KERNEL_MASK((x), (mask))
+ #define PTR_ALIGN(p, a) ((typeof(p))ALIGN((unsigned long)(p), (a)))
+ #define IS_ALIGNED(x, a) (((x) & ((typeof(x))(a) - 1)) == 0)
--- /dev/null
+From foo@baz Mon Feb 26 20:55:53 CET 2018
+From: Dan Williams <dan.j.williams@intel.com>
+Date: Fri, 23 Feb 2018 14:05:38 -0800
+Subject: libnvdimm: fix integer overflow static analysis warning
+To: gregkh@linuxfoundation.org
+Cc: stable@vger.kernel.org, Dan Carpenter <dan.carpenter@oracle.com>, linux-kernel@vger.kernel.org
+Message-ID: <151942353841.21775.10479863744600514056.stgit@dwillia2-desk3.amr.corp.intel.com>
+
+From: Dan Williams <dan.j.williams@intel.com>
+
+commit 58738c495e15badd2015e19ff41f1f1ed55200bc upstream.
+
+Dan reports:
+ The patch 62232e45f4a2: "libnvdimm: control (ioctl) messages for
+ nvdimm_bus and nvdimm devices" from Jun 8, 2015, leads to the
+ following static checker warning:
+
+ drivers/nvdimm/bus.c:1018 __nd_ioctl()
+ warn: integer overflows 'buf_len'
+
+ From a casual review, this seems like it might be a real bug. On
+ the first iteration we load some data into in_env[]. On the second
+ iteration we read a use controlled "in_size" from nd_cmd_in_size().
+ It can go up to UINT_MAX - 1. A high number means we will fill the
+ whole in_env[] buffer. But we potentially keep looping and adding
+ more to in_len so now it can be any value.
+
+ It simple enough to change, but it feels weird that we keep looping
+ even though in_env is totally full. Shouldn't we just return an
+ error if we don't have space for desc->in_num.
+
+We keep looping because the size of the total input is allowed to be
+bigger than the 'envelope' which is a subset of the payload that tells
+us how much data to expect. For safety explicitly check that buf_len
+does not overflow which is what the checker flagged.
+
+Cc: <stable@vger.kernel.org>
+Fixes: 62232e45f4a2: "libnvdimm: control (ioctl) messages for nvdimm_bus..."
+Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
+Signed-off-by: Dan Williams <dan.j.williams@intel.com>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+---
+ drivers/nvdimm/bus.c | 11 ++++++-----
+ 1 file changed, 6 insertions(+), 5 deletions(-)
+
+--- a/drivers/nvdimm/bus.c
++++ b/drivers/nvdimm/bus.c
+@@ -812,16 +812,17 @@ static int __nd_ioctl(struct nvdimm_bus
+ int read_only, unsigned int ioctl_cmd, unsigned long arg)
+ {
+ struct nvdimm_bus_descriptor *nd_desc = nvdimm_bus->nd_desc;
+- size_t buf_len = 0, in_len = 0, out_len = 0;
+ static char out_env[ND_CMD_MAX_ENVELOPE];
+ static char in_env[ND_CMD_MAX_ENVELOPE];
+ const struct nd_cmd_desc *desc = NULL;
+ unsigned int cmd = _IOC_NR(ioctl_cmd);
+ void __user *p = (void __user *) arg;
+ struct device *dev = &nvdimm_bus->dev;
+- struct nd_cmd_pkg pkg;
+ const char *cmd_name, *dimm_name;
++ u32 in_len = 0, out_len = 0;
+ unsigned long cmd_mask;
++ struct nd_cmd_pkg pkg;
++ u64 buf_len = 0;
+ void *buf;
+ int rc, i;
+
+@@ -882,7 +883,7 @@ static int __nd_ioctl(struct nvdimm_bus
+ }
+
+ if (cmd == ND_CMD_CALL) {
+- dev_dbg(dev, "%s:%s, idx: %llu, in: %zu, out: %zu, len %zu\n",
++ dev_dbg(dev, "%s:%s, idx: %llu, in: %u, out: %u, len %llu\n",
+ __func__, dimm_name, pkg.nd_command,
+ in_len, out_len, buf_len);
+
+@@ -912,9 +913,9 @@ static int __nd_ioctl(struct nvdimm_bus
+ out_len += out_size;
+ }
+
+- buf_len = out_len + in_len;
++ buf_len = (u64) out_len + (u64) in_len;
+ if (buf_len > ND_IOCTL_MAX_BUFLEN) {
+- dev_dbg(dev, "%s:%s cmd: %s buf_len: %zu > %d\n", __func__,
++ dev_dbg(dev, "%s:%s cmd: %s buf_len: %llu > %d\n", __func__,
+ dimm_name, cmd_name, buf_len,
+ ND_IOCTL_MAX_BUFLEN);
+ return -EINVAL;
--- /dev/null
+From foo@baz Mon Feb 26 20:55:53 CET 2018
+From: Dan Williams <dan.j.williams@intel.com>
+Date: Fri, 23 Feb 2018 14:05:27 -0800
+Subject: mm: avoid spurious 'bad pmd' warning messages
+To: gregkh@linuxfoundation.org
+Cc: Jan Kara <jack@suse.cz>, Eryu Guan <eguan@redhat.com>, Xiong Zhou <xzhou@redhat.com>, linux-kernel@vger.kernel.org, Matthew Wilcox <mawilcox@microsoft.com>, Christoph Hellwig <hch@lst.de>, stable@vger.kernel.org, Pawel Lebioda <pawel.lebioda@intel.com>, Dave Hansen <dave.hansen@intel.com>, Alexander Viro <viro@zeniv.linux.org.uk>, Ross Zwisler <ross.zwisler@linux.intel.com>, Dave Jiang <dave.jiang@intel.com>, Andrew Morton <akpm@linux-foundation.org>, Linus Torvalds <torvalds@linux-foundation.org>, "Darrick J. Wong" <darrick.wong@oracle.com>, "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
+Message-ID: <151942352781.21775.15841303754448120195.stgit@dwillia2-desk3.amr.corp.intel.com>
+
+From: Ross Zwisler <ross.zwisler@linux.intel.com>
+
+commit d0f0931de936a0a468d7e59284d39581c16d3a73 upstream.
+
+When the pmd_devmap() checks were added by 5c7fb56e5e3f ("mm, dax:
+dax-pmd vs thp-pmd vs hugetlbfs-pmd") to add better support for DAX huge
+pages, they were all added to the end of if() statements after existing
+pmd_trans_huge() checks. So, things like:
+
+ - if (pmd_trans_huge(*pmd))
+ + if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd))
+
+When further checks were added after pmd_trans_unstable() checks by
+commit 7267ec008b5c ("mm: postpone page table allocation until we have
+page to map") they were also added at the end of the conditional:
+
+ + if (pmd_trans_unstable(fe->pmd) || pmd_devmap(*fe->pmd))
+
+This ordering is fine for pmd_trans_huge(), but doesn't work for
+pmd_trans_unstable(). This is because DAX huge pages trip the bad_pmd()
+check inside of pmd_none_or_trans_huge_or_clear_bad() (called by
+pmd_trans_unstable()), which prints out a warning and returns 1. So, we
+do end up doing the right thing, but only after spamming dmesg with
+suspicious looking messages:
+
+ mm/pgtable-generic.c:39: bad pmd ffff8808daa49b88(84000001006000a5)
+
+Reorder these checks in a helper so that pmd_devmap() is checked first,
+avoiding the error messages, and add a comment explaining why the
+ordering is important.
+
+Fixes: commit 7267ec008b5c ("mm: postpone page table allocation until we have page to map")
+Link: http://lkml.kernel.org/r/20170522215749.23516-1-ross.zwisler@linux.intel.com
+Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
+Reviewed-by: Jan Kara <jack@suse.cz>
+Cc: Pawel Lebioda <pawel.lebioda@intel.com>
+Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
+Cc: Alexander Viro <viro@zeniv.linux.org.uk>
+Cc: Christoph Hellwig <hch@lst.de>
+Cc: Dan Williams <dan.j.williams@intel.com>
+Cc: Dave Hansen <dave.hansen@intel.com>
+Cc: Matthew Wilcox <mawilcox@microsoft.com>
+Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
+Cc: Dave Jiang <dave.jiang@intel.com>
+Cc: Xiong Zhou <xzhou@redhat.com>
+Cc: Eryu Guan <eguan@redhat.com>
+Cc: <stable@vger.kernel.org>
+Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
+Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+---
+ mm/memory.c | 40 ++++++++++++++++++++++++++++++----------
+ 1 file changed, 30 insertions(+), 10 deletions(-)
+
+--- a/mm/memory.c
++++ b/mm/memory.c
+@@ -2848,6 +2848,17 @@ static int __do_fault(struct fault_env *
+ return ret;
+ }
+
++/*
++ * The ordering of these checks is important for pmds with _PAGE_DEVMAP set.
++ * If we check pmd_trans_unstable() first we will trip the bad_pmd() check
++ * inside of pmd_none_or_trans_huge_or_clear_bad(). This will end up correctly
++ * returning 1 but not before it spams dmesg with the pmd_clear_bad() output.
++ */
++static int pmd_devmap_trans_unstable(pmd_t *pmd)
++{
++ return pmd_devmap(*pmd) || pmd_trans_unstable(pmd);
++}
++
+ static int pte_alloc_one_map(struct fault_env *fe)
+ {
+ struct vm_area_struct *vma = fe->vma;
+@@ -2871,18 +2882,27 @@ static int pte_alloc_one_map(struct faul
+ map_pte:
+ /*
+ * If a huge pmd materialized under us just retry later. Use
+- * pmd_trans_unstable() instead of pmd_trans_huge() to ensure the pmd
+- * didn't become pmd_trans_huge under us and then back to pmd_none, as
+- * a result of MADV_DONTNEED running immediately after a huge pmd fault
+- * in a different thread of this mm, in turn leading to a misleading
+- * pmd_trans_huge() retval. All we have to ensure is that it is a
+- * regular pmd that we can walk with pte_offset_map() and we can do that
+- * through an atomic read in C, which is what pmd_trans_unstable()
+- * provides.
++ * pmd_trans_unstable() via pmd_devmap_trans_unstable() instead of
++ * pmd_trans_huge() to ensure the pmd didn't become pmd_trans_huge
++ * under us and then back to pmd_none, as a result of MADV_DONTNEED
++ * running immediately after a huge pmd fault in a different thread of
++ * this mm, in turn leading to a misleading pmd_trans_huge() retval.
++ * All we have to ensure is that it is a regular pmd that we can walk
++ * with pte_offset_map() and we can do that through an atomic read in
++ * C, which is what pmd_trans_unstable() provides.
+ */
+- if (pmd_trans_unstable(fe->pmd) || pmd_devmap(*fe->pmd))
++ if (pmd_devmap_trans_unstable(fe->pmd))
+ return VM_FAULT_NOPAGE;
+
++ /*
++ * At this point we know that our vmf->pmd points to a page of ptes
++ * and it cannot become pmd_none(), pmd_devmap() or pmd_trans_huge()
++ * for the duration of the fault. If a racing MADV_DONTNEED runs and
++ * we zap the ptes pointed to by our vmf->pmd, the vmf->ptl will still
++ * be valid and we will re-check to make sure the vmf->pte isn't
++ * pte_none() under vmf->ptl protection when we return to
++ * alloc_set_pte().
++ */
+ fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address,
+ &fe->ptl);
+ return 0;
+@@ -3456,7 +3476,7 @@ static int handle_pte_fault(struct fault
+ fe->pte = NULL;
+ } else {
+ /* See comment in pte_alloc_one_map() */
+- if (pmd_trans_unstable(fe->pmd) || pmd_devmap(*fe->pmd))
++ if (pmd_devmap_trans_unstable(fe->pmd))
+ return 0;
+ /*
+ * A regular pmd is established and it can't morph into a huge
--- /dev/null
+From foo@baz Mon Feb 26 20:55:53 CET 2018
+From: Dan Williams <dan.j.williams@intel.com>
+Date: Fri, 23 Feb 2018 14:06:16 -0800
+Subject: mm: fail get_vaddr_frames() for filesystem-dax mappings
+To: gregkh@linuxfoundation.org
+Cc: Jan Kara <jack@suse.cz>, Joonyoung Shim <jy0922.shim@samsung.com>, linux-kernel@vger.kernel.org, Seung-Woo Kim <sw0312.kim@samsung.com>, Jeff Moyer <jmoyer@redhat.com>, stable@vger.kernel.org, Christoph Hellwig <hch@lst.de>, Inki Dae <inki.dae@samsung.com>, Doug Ledford <dledford@redhat.com>, Jason Gunthorpe <jgg@mellanox.com>, Mel Gorman <mgorman@suse.de>, Andrew Morton <akpm@linux-foundation.org>, Ross Zwisler <ross.zwisler@linux.intel.com>, Kyungmin Park <kyungmin.park@samsung.com>, Sean Hefty <sean.hefty@intel.com>, Mauro Carvalho Chehab <mchehab@kernel.org>, Linus Torvalds <torvalds@linux-foundation.org>, Hal Rosenstock <hal.rosenstock@gmail.com>, Vlastimil Babka <vbabka@suse.cz>
+Message-ID: <151942357601.21775.3085470269801679738.stgit@dwillia2-desk3.amr.corp.intel.com>
+
+From: Dan Williams <dan.j.williams@intel.com>
+
+commit b7f0554a56f21fb3e636a627450a9add030889be upstream.
+
+Until there is a solution to the dma-to-dax vs truncate problem it is
+not safe to allow V4L2, Exynos, and other frame vector users to create
+long standing / irrevocable memory registrations against filesytem-dax
+vmas.
+
+[dan.j.williams@intel.com: add comment for vma_is_fsdax() check in get_vaddr_frames(), per Jan]
+ Link: http://lkml.kernel.org/r/151197874035.26211.4061781453123083667.stgit@dwillia2-desk3.amr.corp.intel.com
+Link: http://lkml.kernel.org/r/151068939985.7446.15684639617389154187.stgit@dwillia2-desk3.amr.corp.intel.com
+Fixes: 3565fce3a659 ("mm, x86: get_user_pages() for dax mappings")
+Signed-off-by: Dan Williams <dan.j.williams@intel.com>
+Reviewed-by: Jan Kara <jack@suse.cz>
+Cc: Inki Dae <inki.dae@samsung.com>
+Cc: Seung-Woo Kim <sw0312.kim@samsung.com>
+Cc: Joonyoung Shim <jy0922.shim@samsung.com>
+Cc: Kyungmin Park <kyungmin.park@samsung.com>
+Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
+Cc: Mel Gorman <mgorman@suse.de>
+Cc: Vlastimil Babka <vbabka@suse.cz>
+Cc: Christoph Hellwig <hch@lst.de>
+Cc: Doug Ledford <dledford@redhat.com>
+Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
+Cc: Jason Gunthorpe <jgg@mellanox.com>
+Cc: Jeff Moyer <jmoyer@redhat.com>
+Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
+Cc: Sean Hefty <sean.hefty@intel.com>
+Cc: <stable@vger.kernel.org>
+Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
+Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+---
+ mm/frame_vector.c | 12 ++++++++++++
+ 1 file changed, 12 insertions(+)
+
+--- a/mm/frame_vector.c
++++ b/mm/frame_vector.c
+@@ -52,6 +52,18 @@ int get_vaddr_frames(unsigned long start
+ ret = -EFAULT;
+ goto out;
+ }
++
++ /*
++ * While get_vaddr_frames() could be used for transient (kernel
++ * controlled lifetime) pinning of memory pages all current
++ * users establish long term (userspace controlled lifetime)
++ * page pinning. Treat get_vaddr_frames() like
++ * get_user_pages_longterm() and disallow it for filesystem-dax
++ * mappings.
++ */
++ if (vma_is_fsdax(vma))
++ return -EOPNOTSUPP;
++
+ if (!(vma->vm_flags & (VM_IO | VM_PFNMAP))) {
+ vec->got_ref = true;
+ vec->is_pfns = false;
--- /dev/null
+From foo@baz Mon Feb 26 20:55:53 CET 2018
+From: Dan Williams <dan.j.williams@intel.com>
+Date: Fri, 23 Feb 2018 14:06:10 -0800
+Subject: mm: Fix devm_memremap_pages() collision handling
+To: gregkh@linuxfoundation.org
+Cc:
+Message-ID: <151942357089.21775.3486425046348885247.stgit@dwillia2-desk3.amr.corp.intel.com>
+
+From: Jan H. Schönherr <jschoenh@amazon.de>
+
+commit 77dd66a3c67c93ab401ccc15efff25578be281fd upstream.
+
+If devm_memremap_pages() detects a collision while adding entries
+to the radix-tree, we call pgmap_radix_release(). Unfortunately,
+the function removes *all* entries for the range -- including the
+entries that caused the collision in the first place.
+
+Modify pgmap_radix_release() to take an additional argument to
+indicate where to stop, so that only newly added entries are removed
+from the tree.
+
+Cc: <stable@vger.kernel.org>
+Fixes: 9476df7d80df ("mm: introduce find_dev_pagemap()")
+Signed-off-by: Jan H. Schönherr <jschoenh@amazon.de>
+Signed-off-by: Dan Williams <dan.j.williams@intel.com>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+---
+ kernel/memremap.c | 13 ++++++++-----
+ 1 file changed, 8 insertions(+), 5 deletions(-)
+
+--- a/kernel/memremap.c
++++ b/kernel/memremap.c
+@@ -194,7 +194,7 @@ void put_zone_device_page(struct page *p
+ }
+ EXPORT_SYMBOL(put_zone_device_page);
+
+-static void pgmap_radix_release(struct resource *res)
++static void pgmap_radix_release(struct resource *res, resource_size_t end_key)
+ {
+ resource_size_t key, align_start, align_size, align_end;
+
+@@ -203,8 +203,11 @@ static void pgmap_radix_release(struct r
+ align_end = align_start + align_size - 1;
+
+ mutex_lock(&pgmap_lock);
+- for (key = res->start; key <= res->end; key += SECTION_SIZE)
++ for (key = res->start; key <= res->end; key += SECTION_SIZE) {
++ if (key >= end_key)
++ break;
+ radix_tree_delete(&pgmap_radix, key >> PA_SECTION_SHIFT);
++ }
+ mutex_unlock(&pgmap_lock);
+ }
+
+@@ -255,7 +258,7 @@ static void devm_memremap_pages_release(
+ unlock_device_hotplug();
+
+ untrack_pfn(NULL, PHYS_PFN(align_start), align_size);
+- pgmap_radix_release(res);
++ pgmap_radix_release(res, -1);
+ dev_WARN_ONCE(dev, pgmap->altmap && pgmap->altmap->alloc,
+ "%s: failed to free all reserved pages\n", __func__);
+ }
+@@ -289,7 +292,7 @@ struct dev_pagemap *find_dev_pagemap(res
+ void *devm_memremap_pages(struct device *dev, struct resource *res,
+ struct percpu_ref *ref, struct vmem_altmap *altmap)
+ {
+- resource_size_t key, align_start, align_size, align_end;
++ resource_size_t key = 0, align_start, align_size, align_end;
+ pgprot_t pgprot = PAGE_KERNEL;
+ struct dev_pagemap *pgmap;
+ struct page_map *page_map;
+@@ -392,7 +395,7 @@ void *devm_memremap_pages(struct device
+ untrack_pfn(NULL, PHYS_PFN(align_start), align_size);
+ err_pfn_remap:
+ err_radix:
+- pgmap_radix_release(res);
++ pgmap_radix_release(res, key);
+ devres_free(page_map);
+ return ERR_PTR(error);
+ }
--- /dev/null
+From foo@baz Mon Feb 26 20:55:53 CET 2018
+From: Dan Williams <dan.j.williams@intel.com>
+Date: Fri, 23 Feb 2018 14:05:49 -0800
+Subject: mm: introduce get_user_pages_longterm
+To: gregkh@linuxfoundation.org
+Cc: Jan Kara <jack@suse.cz>, Joonyoung Shim <jy0922.shim@samsung.com>, linux-kernel@vger.kernel.org, Seung-Woo Kim <sw0312.kim@samsung.com>, Doug Ledford <dledford@redhat.com>, stable@vger.kernel.org, Christoph Hellwig <hch@lst.de>, Inki Dae <inki.dae@samsung.com>, Jeff Moyer <jmoyer@redhat.com>, Jason Gunthorpe <jgg@mellanox.com>, Mel Gorman <mgorman@suse.de>, Andrew Morton <akpm@linux-foundation.org>, Ross Zwisler <ross.zwisler@linux.intel.com>, Kyungmin Park <kyungmin.park@samsung.com>, Sean Hefty <sean.hefty@intel.com>, Mauro Carvalho Chehab <mchehab@kernel.org>, Linus Torvalds <torvalds@linux-foundation.org>, Hal Rosenstock <hal.rosenstock@gmail.com>, Vlastimil Babka <vbabka@suse.cz>
+Message-ID: <151942354920.21775.1595898555475851190.stgit@dwillia2-desk3.amr.corp.intel.com>
+
+From: Dan Williams <dan.j.williams@intel.com>
+
+commit 2bb6d2837083de722bfdc369cb0d76ce188dd9b4 upstream.
+
+Patch series "introduce get_user_pages_longterm()", v2.
+
+Here is a new get_user_pages api for cases where a driver intends to
+keep an elevated page count indefinitely. This is distinct from usages
+like iov_iter_get_pages where the elevated page counts are transient.
+The iov_iter_get_pages cases immediately turn around and submit the
+pages to a device driver which will put_page when the i/o operation
+completes (under kernel control).
+
+In the longterm case userspace is responsible for dropping the page
+reference at some undefined point in the future. This is untenable for
+filesystem-dax case where the filesystem is in control of the lifetime
+of the block / page and needs reasonable limits on how long it can wait
+for pages in a mapping to become idle.
+
+Fixing filesystems to actually wait for dax pages to be idle before
+blocks from a truncate/hole-punch operation are repurposed is saved for
+a later patch series.
+
+Also, allowing longterm registration of dax mappings is a future patch
+series that introduces a "map with lease" semantic where the kernel can
+revoke a lease and force userspace to drop its page references.
+
+I have also tagged these for -stable to purposely break cases that might
+assume that longterm memory registrations for filesystem-dax mappings
+were supported by the kernel. The behavior regression this policy
+change implies is one of the reasons we maintain the "dax enabled.
+Warning: EXPERIMENTAL, use at your own risk" notification when mounting
+a filesystem in dax mode.
+
+It is worth noting the device-dax interface does not suffer the same
+constraints since it does not support file space management operations
+like hole-punch.
+
+This patch (of 4):
+
+Until there is a solution to the dma-to-dax vs truncate problem it is
+not safe to allow long standing memory registrations against
+filesytem-dax vmas. Device-dax vmas do not have this problem and are
+explicitly allowed.
+
+This is temporary until a "memory registration with layout-lease"
+mechanism can be implemented for the affected sub-systems (RDMA and
+V4L2).
+
+[akpm@linux-foundation.org: use kcalloc()]
+Link: http://lkml.kernel.org/r/151068939435.7446.13560129395419350737.stgit@dwillia2-desk3.amr.corp.intel.com
+Fixes: 3565fce3a659 ("mm, x86: get_user_pages() for dax mappings")
+Signed-off-by: Dan Williams <dan.j.williams@intel.com>
+Suggested-by: Christoph Hellwig <hch@lst.de>
+Cc: Doug Ledford <dledford@redhat.com>
+Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
+Cc: Inki Dae <inki.dae@samsung.com>
+Cc: Jan Kara <jack@suse.cz>
+Cc: Jason Gunthorpe <jgg@mellanox.com>
+Cc: Jeff Moyer <jmoyer@redhat.com>
+Cc: Joonyoung Shim <jy0922.shim@samsung.com>
+Cc: Kyungmin Park <kyungmin.park@samsung.com>
+Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
+Cc: Mel Gorman <mgorman@suse.de>
+Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
+Cc: Sean Hefty <sean.hefty@intel.com>
+Cc: Seung-Woo Kim <sw0312.kim@samsung.com>
+Cc: Vlastimil Babka <vbabka@suse.cz>
+Cc: <stable@vger.kernel.org>
+Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
+Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+---
+ include/linux/dax.h | 5 ----
+ include/linux/fs.h | 20 ++++++++++++++++
+ include/linux/mm.h | 13 ++++++++++
+ mm/gup.c | 64 ++++++++++++++++++++++++++++++++++++++++++++++++++++
+ 4 files changed, 97 insertions(+), 5 deletions(-)
+
+--- a/include/linux/dax.h
++++ b/include/linux/dax.h
+@@ -61,11 +61,6 @@ static inline int dax_pmd_fault(struct v
+ int dax_pfn_mkwrite(struct vm_area_struct *, struct vm_fault *);
+ #define dax_mkwrite(vma, vmf, gb) dax_fault(vma, vmf, gb)
+
+-static inline bool vma_is_dax(struct vm_area_struct *vma)
+-{
+- return vma->vm_file && IS_DAX(vma->vm_file->f_mapping->host);
+-}
+-
+ static inline bool dax_mapping(struct address_space *mapping)
+ {
+ return mapping->host && IS_DAX(mapping->host);
+--- a/include/linux/fs.h
++++ b/include/linux/fs.h
+@@ -18,6 +18,7 @@
+ #include <linux/bug.h>
+ #include <linux/mutex.h>
+ #include <linux/rwsem.h>
++#include <linux/mm_types.h>
+ #include <linux/capability.h>
+ #include <linux/semaphore.h>
+ #include <linux/fiemap.h>
+@@ -3033,6 +3034,25 @@ static inline bool io_is_direct(struct f
+ return (filp->f_flags & O_DIRECT) || IS_DAX(filp->f_mapping->host);
+ }
+
++static inline bool vma_is_dax(struct vm_area_struct *vma)
++{
++ return vma->vm_file && IS_DAX(vma->vm_file->f_mapping->host);
++}
++
++static inline bool vma_is_fsdax(struct vm_area_struct *vma)
++{
++ struct inode *inode;
++
++ if (!vma->vm_file)
++ return false;
++ if (!vma_is_dax(vma))
++ return false;
++ inode = file_inode(vma->vm_file);
++ if (inode->i_mode == S_IFCHR)
++ return false; /* device-dax */
++ return true;
++}
++
+ static inline int iocb_flags(struct file *file)
+ {
+ int res = 0;
+--- a/include/linux/mm.h
++++ b/include/linux/mm.h
+@@ -1288,6 +1288,19 @@ long __get_user_pages_unlocked(struct ta
+ struct page **pages, unsigned int gup_flags);
+ long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
+ struct page **pages, unsigned int gup_flags);
++#ifdef CONFIG_FS_DAX
++long get_user_pages_longterm(unsigned long start, unsigned long nr_pages,
++ unsigned int gup_flags, struct page **pages,
++ struct vm_area_struct **vmas);
++#else
++static inline long get_user_pages_longterm(unsigned long start,
++ unsigned long nr_pages, unsigned int gup_flags,
++ struct page **pages, struct vm_area_struct **vmas)
++{
++ return get_user_pages(start, nr_pages, gup_flags, pages, vmas);
++}
++#endif /* CONFIG_FS_DAX */
++
+ int get_user_pages_fast(unsigned long start, int nr_pages, int write,
+ struct page **pages);
+
+--- a/mm/gup.c
++++ b/mm/gup.c
+@@ -982,6 +982,70 @@ long get_user_pages(unsigned long start,
+ }
+ EXPORT_SYMBOL(get_user_pages);
+
++#ifdef CONFIG_FS_DAX
++/*
++ * This is the same as get_user_pages() in that it assumes we are
++ * operating on the current task's mm, but it goes further to validate
++ * that the vmas associated with the address range are suitable for
++ * longterm elevated page reference counts. For example, filesystem-dax
++ * mappings are subject to the lifetime enforced by the filesystem and
++ * we need guarantees that longterm users like RDMA and V4L2 only
++ * establish mappings that have a kernel enforced revocation mechanism.
++ *
++ * "longterm" == userspace controlled elevated page count lifetime.
++ * Contrast this to iov_iter_get_pages() usages which are transient.
++ */
++long get_user_pages_longterm(unsigned long start, unsigned long nr_pages,
++ unsigned int gup_flags, struct page **pages,
++ struct vm_area_struct **vmas_arg)
++{
++ struct vm_area_struct **vmas = vmas_arg;
++ struct vm_area_struct *vma_prev = NULL;
++ long rc, i;
++
++ if (!pages)
++ return -EINVAL;
++
++ if (!vmas) {
++ vmas = kcalloc(nr_pages, sizeof(struct vm_area_struct *),
++ GFP_KERNEL);
++ if (!vmas)
++ return -ENOMEM;
++ }
++
++ rc = get_user_pages(start, nr_pages, gup_flags, pages, vmas);
++
++ for (i = 0; i < rc; i++) {
++ struct vm_area_struct *vma = vmas[i];
++
++ if (vma == vma_prev)
++ continue;
++
++ vma_prev = vma;
++
++ if (vma_is_fsdax(vma))
++ break;
++ }
++
++ /*
++ * Either get_user_pages() failed, or the vma validation
++ * succeeded, in either case we don't need to put_page() before
++ * returning.
++ */
++ if (i >= rc)
++ goto out;
++
++ for (i = 0; i < rc; i++)
++ put_page(pages[i]);
++ rc = -EOPNOTSUPP;
++out:
++ if (vmas != vmas_arg)
++ kfree(vmas);
++ return rc;
++}
++EXPORT_SYMBOL(get_user_pages_longterm);
++#endif /* CONFIG_FS_DAX */
++
+ /**
+ * populate_vma_page_range() - populate a range of pages in the vma.
+ * @vma: target vma
drm-amdgpu-add-new-device-to-use-atpx-quirk.patch
binder-add-missing-binder_unlock.patch
x.509-fix-null-dereference-when-restricting-key-with-unsupported_sig.patch
+mm-avoid-spurious-bad-pmd-warning-messages.patch
+fs-dax.c-fix-inefficiency-in-dax_writeback_mapping_range.patch
+libnvdimm-fix-integer-overflow-static-analysis-warning.patch
+device-dax-implement-split-to-catch-invalid-munmap-attempts.patch
+mm-introduce-get_user_pages_longterm.patch
+v4l2-disable-filesystem-dax-mapping-support.patch
+ib-core-disable-memory-registration-of-filesystem-dax-vmas.patch
+libnvdimm-dax-fix-1gb-aligned-namespaces-vs-physical-misalignment.patch
+mm-fix-devm_memremap_pages-collision-handling.patch
+mm-fail-get_vaddr_frames-for-filesystem-dax-mappings.patch
+x86-entry-64-clear-extra-registers-beyond-syscall-arguments-to-reduce-speculation-attack-surface.patch
--- /dev/null
+From foo@baz Mon Feb 26 20:55:53 CET 2018
+From: Dan Williams <dan.j.williams@intel.com>
+Date: Fri, 23 Feb 2018 14:05:54 -0800
+Subject: v4l2: disable filesystem-dax mapping support
+To: gregkh@linuxfoundation.org
+Cc: Jan Kara <jack@suse.cz>, Joonyoung Shim <jy0922.shim@samsung.com>, linux-kernel@vger.kernel.org, Seung-Woo Kim <sw0312.kim@samsung.com>, Doug Ledford <dledford@redhat.com>, stable@vger.kernel.org, Christoph Hellwig <hch@lst.de>, Inki Dae <inki.dae@samsung.com>, Jeff Moyer <jmoyer@redhat.com>, Jason Gunthorpe <jgg@mellanox.com>, Mel Gorman <mgorman@suse.de>, Andrew Morton <akpm@linux-foundation.org>, Ross Zwisler <ross.zwisler@linux.intel.com>, Kyungmin Park <kyungmin.park@samsung.com>, Sean Hefty <sean.hefty@intel.com>, Mauro Carvalho Chehab <mchehab@kernel.org>, Linus Torvalds <torvalds@linux-foundation.org>, Hal Rosenstock <hal.rosenstock@gmail.com>, Vlastimil Babka <vbabka@suse.cz>
+Message-ID: <151942355435.21775.3892492011172127062.stgit@dwillia2-desk3.amr.corp.intel.com>
+
+From: Dan Williams <dan.j.williams@intel.com>
+
+commit b70131de648c2b997d22f4653934438013f407a1 upstream.
+
+V4L2 memory registrations are incompatible with filesystem-dax that
+needs the ability to revoke dma access to a mapping at will, or
+otherwise allow the kernel to wait for completion of DMA. The
+filesystem-dax implementation breaks the traditional solution of
+truncate of active file backed mappings since there is no page-cache
+page we can orphan to sustain ongoing DMA.
+
+If v4l2 wants to support long lived DMA mappings it needs to arrange to
+hold a file lease or use some other mechanism so that the kernel can
+coordinate revoking DMA access when the filesystem needs to truncate
+mappings.
+
+Link: http://lkml.kernel.org/r/151068940499.7446.12846708245365671207.stgit@dwillia2-desk3.amr.corp.intel.com
+Fixes: 3565fce3a659 ("mm, x86: get_user_pages() for dax mappings")
+Signed-off-by: Dan Williams <dan.j.williams@intel.com>
+Reported-by: Jan Kara <jack@suse.cz>
+Reviewed-by: Jan Kara <jack@suse.cz>
+Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
+Cc: Christoph Hellwig <hch@lst.de>
+Cc: Doug Ledford <dledford@redhat.com>
+Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
+Cc: Inki Dae <inki.dae@samsung.com>
+Cc: Jason Gunthorpe <jgg@mellanox.com>
+Cc: Jeff Moyer <jmoyer@redhat.com>
+Cc: Joonyoung Shim <jy0922.shim@samsung.com>
+Cc: Kyungmin Park <kyungmin.park@samsung.com>
+Cc: Mel Gorman <mgorman@suse.de>
+Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
+Cc: Sean Hefty <sean.hefty@intel.com>
+Cc: Seung-Woo Kim <sw0312.kim@samsung.com>
+Cc: Vlastimil Babka <vbabka@suse.cz>
+Cc: <stable@vger.kernel.org>
+Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
+Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+---
+ drivers/media/v4l2-core/videobuf-dma-sg.c | 5 +++--
+ 1 file changed, 3 insertions(+), 2 deletions(-)
+
+--- a/drivers/media/v4l2-core/videobuf-dma-sg.c
++++ b/drivers/media/v4l2-core/videobuf-dma-sg.c
+@@ -185,12 +185,13 @@ static int videobuf_dma_init_user_locked
+ dprintk(1, "init user [0x%lx+0x%lx => %d pages]\n",
+ data, size, dma->nr_pages);
+
+- err = get_user_pages(data & PAGE_MASK, dma->nr_pages,
++ err = get_user_pages_longterm(data & PAGE_MASK, dma->nr_pages,
+ flags, dma->pages, NULL);
+
+ if (err != dma->nr_pages) {
+ dma->nr_pages = (err >= 0) ? err : 0;
+- dprintk(1, "get_user_pages: err=%d [%d]\n", err, dma->nr_pages);
++ dprintk(1, "get_user_pages_longterm: err=%d [%d]\n", err,
++ dma->nr_pages);
+ return err < 0 ? err : -EINVAL;
+ }
+ return 0;
Cc: keyrings@vger.kernel.org, Eric Biggers <ebiggers@google.com>, David Howells <dhowells@redhat.com>
Message-ID: <20180226181715.194965-1-ebiggers3@gmail.com>
-
From: Eric Biggers <ebiggers@google.com>
commit 4b34968e77ad09628cfb3c4a7daf2adc2cefc6e8 upstream.
--- /dev/null
+From foo@baz Mon Feb 26 20:55:53 CET 2018
+From: Dan Williams <dan.j.williams@intel.com>
+Date: Fri, 23 Feb 2018 14:06:21 -0800
+Subject: x86/entry/64: Clear extra registers beyond syscall arguments, to reduce speculation attack surface
+To: gregkh@linuxfoundation.org
+Cc: Andi Kleen <ak@linux.intel.com>, Denys Vlasenko <dvlasenk@redhat.com>, Peter Zijlstra <peterz@infradead.org>, Brian Gerst <brgerst@gmail.com>, "H. Peter Anvin" <hpa@zytor.com>, linux-kernel@vger.kernel.org, stable@vger.kernel.org, Borislav Petkov <bp@alien8.de>, Andy Lutomirski <luto@kernel.org>, Josh Poimboeuf <jpoimboe@redhat.com>, Thomas Gleixner <tglx@linutronix.de>, Linus Torvalds <torvalds@linux-foundation.org>, Ingo Molnar <mingo@kernel.org>
+Message-ID: <151942358116.21775.14209781084277174517.stgit@dwillia2-desk3.amr.corp.intel.com>
+
+From: Dan Williams <dan.j.williams@intel.com>
+
+commit 8e1eb3fa009aa7c0b944b3c8b26b07de0efb3200 upstream.
+
+At entry userspace may have (maliciously) populated the extra registers
+outside the syscall calling convention with arbitrary values that could
+be useful in a speculative execution (Spectre style) attack.
+
+Clear these registers to minimize the kernel's attack surface.
+
+Note, this only clears the extra registers and not the unused
+registers for syscalls less than 6 arguments, since those registers are
+likely to be clobbered well before their values could be put to use
+under speculation.
+
+Note, Linus found that the XOR instructions can be executed with
+minimized cost if interleaved with the PUSH instructions, and Ingo's
+analysis found that R10 and R11 should be included in the register
+clearing beyond the typical 'extra' syscall calling convention
+registers.
+
+Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
+Reported-by: Andi Kleen <ak@linux.intel.com>
+Signed-off-by: Dan Williams <dan.j.williams@intel.com>
+Cc: <stable@vger.kernel.org>
+Cc: Andy Lutomirski <luto@kernel.org>
+Cc: Borislav Petkov <bp@alien8.de>
+Cc: Brian Gerst <brgerst@gmail.com>
+Cc: Denys Vlasenko <dvlasenk@redhat.com>
+Cc: H. Peter Anvin <hpa@zytor.com>
+Cc: Josh Poimboeuf <jpoimboe@redhat.com>
+Cc: Peter Zijlstra <peterz@infradead.org>
+Cc: Thomas Gleixner <tglx@linutronix.de>
+Link: http://lkml.kernel.org/r/151787988577.7847.16733592218894189003.stgit@dwillia2-desk3.amr.corp.intel.com
+[ Made small improvements to the changelog and the code comments. ]
+Signed-off-by: Ingo Molnar <mingo@kernel.org>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+---
+ arch/x86/entry/entry_64.S | 13 +++++++++++++
+ 1 file changed, 13 insertions(+)
+
+--- a/arch/x86/entry/entry_64.S
++++ b/arch/x86/entry/entry_64.S
+@@ -176,13 +176,26 @@ GLOBAL(entry_SYSCALL_64_after_swapgs)
+ pushq %r8 /* pt_regs->r8 */
+ pushq %r9 /* pt_regs->r9 */
+ pushq %r10 /* pt_regs->r10 */
++ /*
++ * Clear extra registers that a speculation attack might
++ * otherwise want to exploit. Interleave XOR with PUSH
++ * for better uop scheduling:
++ */
++ xorq %r10, %r10 /* nospec r10 */
+ pushq %r11 /* pt_regs->r11 */
++ xorq %r11, %r11 /* nospec r11 */
+ pushq %rbx /* pt_regs->rbx */
++ xorl %ebx, %ebx /* nospec rbx */
+ pushq %rbp /* pt_regs->rbp */
++ xorl %ebp, %ebp /* nospec rbp */
+ pushq %r12 /* pt_regs->r12 */
++ xorq %r12, %r12 /* nospec r12 */
+ pushq %r13 /* pt_regs->r13 */
++ xorq %r13, %r13 /* nospec r13 */
+ pushq %r14 /* pt_regs->r14 */
++ xorq %r14, %r14 /* nospec r14 */
+ pushq %r15 /* pt_regs->r15 */
++ xorq %r15, %r15 /* nospec r15 */
+
+ /* IRQs are off. */
+ movq %rsp, %rdi