ipc_addid() makes a new ipc identifier visible to everyone. New objects
start as locked, so that the caller can complete the initialization
after the call. Within struct sem_array, at least sma->sem_base and
sma->sem_nsems are accessed without any locks, therefore this approach
doesn't work.
Thus: Move the ipc_addid() to the end of the initialization.
Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Reported-by: Rik van Riel <riel@redhat.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Davidlohr Bueso <dave@stgolabs.net> Acked-by: Rafael Aquini <aquini@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
Commit f8d960524328 ("sctp: Enforce retransmission limit during shutdown")
fixed a problem with excessive retransmissions in the SHUTDOWN_PENDING by not
resetting the association overall_error_count. This allowed the association
to better enforce assoc.max_retrans limit.
However, the same issue still exists when the association is in SHUTDOWN_RECEIVED
state. In this state, HB-ACKs will continue to reset the overall_error_count
for the association would extend the lifetime of association unnecessarily.
This patch solves this by resetting the overall_error_count whenever the current
state is small then SCTP_STATE_SHUTDOWN_PENDING. As a small side-effect, we
end up also handling SCTP_STATE_SHUTDOWN_ACK_SENT and SCTP_STATE_SHUTDOWN_SENT
states, but they are not really impacted because we disable Heartbeats in those
states.
Fixes: Commit f8d960524328 ("sctp: Enforce retransmission limit during shutdown") Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Vlad Yasevich <vyasevich@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
It was possible for an attacking user to trick root (or another user) into
writing his coredumps into an attacker-readable, pre-existing file using
rename() or link(), causing the disclosure of secret data from the victim
process' virtual memory. Depending on the configuration, it was also
possible to trick root into overwriting system files with coredumps. Fix
that issue by never writing coredumps into existing files.
Requirements for the attack:
- The attack only applies if the victim's process has a nonzero
RLIMIT_CORE and is dumpable.
- The attacker can trick the victim into coredumping into an
attacker-writable directory D, either because the core_pattern is
relative and the victim's cwd is attacker-writable or because an
absolute core_pattern pointing to a world-writable directory is used.
- The attacker has one of these:
A: on a system with protected_hardlinks=0:
execute access to a folder containing a victim-owned,
attacker-readable file on the same partition as D, and the
victim-owned file will be deleted before the main part of the attack
takes place. (In practice, there are lots of files that fulfill
this condition, e.g. entries in Debian's /var/lib/dpkg/info/.)
This does not apply to most Linux systems because most distros set
protected_hardlinks=1.
B: on a system with protected_hardlinks=1:
execute access to a folder containing a victim-owned,
attacker-readable and attacker-writable file on the same partition
as D, and the victim-owned file will be deleted before the main part
of the attack takes place.
(This seems to be uncommon.)
C: on any system, independent of protected_hardlinks:
write access to a non-sticky folder containing a victim-owned,
attacker-readable file on the same partition as D
(This seems to be uncommon.)
The basic idea is that the attacker moves the victim-owned file to where
he expects the victim process to dump its core. The victim process dumps
its core into the existing file, and the attacker reads the coredump from
it.
If the attacker can't move the file because he does not have write access
to the containing directory, he can instead link the file to a directory
he controls, then wait for the original link to the file to be deleted
(because the kernel checks that the link count of the corefile is 1).
A less reliable variant that requires D to be non-sticky works with link()
and does not require deletion of the original link: link() the file into
D, but then unlink() it directly before the kernel performs the link count
check.
On systems with protected_hardlinks=0, this variant allows an attacker to
not only gain information from coredumps, but also clobber existing,
victim-writable files with coredumps. (This could theoretically lead to a
privilege escalation.)
Signed-off-by: Jann Horn <jann@thejh.net> Cc: Kees Cook <keescook@chromium.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
The commit "drm/vmwgfx: Fix up user_dmabuf refcounting", while fixing a
kernel crash introduced a NULL pointer dereference on older hardware.
Fix this.
Signed-off-by: Thomas Hellstrom <thellstrom@vmware.com> Reviewed-by: Sinclair Yeh <syeh@vmware.com> Reviewed-by: Brian Paul <brianp@vmware.com> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
Kernels after v3.9 use kmalloc_size(INDEX_NODE + 1) to get the next
larger cache size than the size index INDEX_NODE mapping. In kernels
3.9 and earlier we used malloc_sizes[INDEX_L3 + 1].cs_size.
However, sometimes we can't get the right output we expected via
kmalloc_size(INDEX_NODE + 1), causing a BUG().
The mapping table in the latest kernel is like:
index = {0, 1, 2 , 3, 4, 5, 6, n}
size = {0, 96, 192, 8, 16, 32, 64, 2^n}
The mapping table before 3.10 is like this:
index = {0 , 1 , 2, 3, 4 , 5 , 6, n}
size = {32, 64, 96, 128, 192, 256, 512, 2^(n+3)}
The problem on my mips64 machine is as follows:
(1) When configured DEBUG_SLAB && DEBUG_PAGEALLOC && DEBUG_LOCK_ALLOC
&& DEBUG_SPINLOCK, the sizeof(struct kmem_cache_node) will be "150",
and the macro INDEX_NODE turns out to be "2": #define INDEX_NODE
kmalloc_index(sizeof(struct kmem_cache_node))
(2) Then the result of kmalloc_size(INDEX_NODE + 1) is 8.
(3) Then "if(size >= kmalloc_size(INDEX_NODE + 1)" will lead to "size
= PAGE_SIZE".
(4) Then "if ((size >= (PAGE_SIZE >> 3))" test will be satisfied and
"flags |= CFLGS_OFF_SLAB" will be covered.
(5) if (flags & CFLGS_OFF_SLAB)" test will be satisfied and will go to
"cachep->slabp_cache = kmalloc_slab(slab_size, 0u)", and the result
here may be NULL while kernel bootup.
(6) Finally,"BUG_ON(ZERO_OR_NULL_PTR(cachep->slabp_cache));" causes the
BUG info as the following shows (may be only mips64 has this problem):
This patch fixes the problem of kmalloc_size(INDEX_NODE + 1) and removes
the BUG by adding 'size >= 256' check to guarantee that all necessary
small sized slabs are initialized regardless sequence of slab size in
mapping table.
Fixes: e33660165c90 ("slab: Use common kmalloc_index/kmalloc_size...") Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reported-by: Liuhailong <liu.hailong6@zte.com.cn> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
In case we have less than maximum allowed channels (8) and autoconfiguration is
enabled the DWC_PARAMS read is wrong because it uses different arithmetic to
what is needed for channel priority setup.
Re-do the caclulations properly. This now works on AVR32 board well.
Fixes: fed2574b3c9f (dw_dmac: introduce software emulation of LLP transfers) Cc: yitian.bu@tangramtek.com Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Vinod Koul <vinod.koul@intel.com> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
Unused space between the end of __ex_table and the start of
rodata can be left W+x in the kernel page tables. Extend the
setting of the NX bit to cover this gap by starting from
text_end rather than rodata_start.
Before:
---[ High Kernel Mapping ]---
0xffffffff80000000-0xffffffff81000000 16M pmd
0xffffffff81000000-0xffffffff81600000 6M ro PSE GLB x pmd
0xffffffff81600000-0xffffffff81754000 1360K ro GLB x pte
0xffffffff81754000-0xffffffff81800000 688K RW GLB x pte
0xffffffff81800000-0xffffffff81a00000 2M ro PSE GLB NX pmd
0xffffffff81a00000-0xffffffff81b3b000 1260K ro GLB NX pte
0xffffffff81b3b000-0xffffffff82000000 4884K RW GLB NX pte
0xffffffff82000000-0xffffffff82200000 2M RW PSE GLB NX pmd
0xffffffff82200000-0xffffffffa0000000 478M pmd
After:
---[ High Kernel Mapping ]---
0xffffffff80000000-0xffffffff81000000 16M pmd
0xffffffff81000000-0xffffffff81600000 6M ro PSE GLB x pmd
0xffffffff81600000-0xffffffff81754000 1360K ro GLB x pte
0xffffffff81754000-0xffffffff81800000 688K RW GLB NX pte
0xffffffff81800000-0xffffffff81a00000 2M ro PSE GLB NX pmd
0xffffffff81a00000-0xffffffff81b3b000 1260K ro GLB NX pte
0xffffffff81b3b000-0xffffffff82000000 4884K RW GLB NX pte
0xffffffff82000000-0xffffffff82200000 2M RW PSE GLB NX pmd
0xffffffff82200000-0xffffffffa0000000 478M pmd
Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov> Acked-by: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/1443704662-3138-1-git-send-email-sds@tycho.nsa.gov Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
I think I find a linux bug, I have the test cases is constructed. I
can stable recurring problems in fedora22(4.0.4) kernel version,
arch for x86_64. I construct transparent huge page, when the parent
and child process with MAP_SHARE, MAP_PRIVATE way to access the same
huge page area, it has the opportunity to lead to huge page copy on
write failure, and then it will munmap the child corresponding mmap
area, but then the child mmap area with VM_MAYSHARE attributes, child
process munmap this area can trigger VM_BUG_ON in set_vma_resv_flags
functions (vma - > vm_flags & VM_MAYSHARE).
There were a number of problems with the report (e.g. it's hugetlbfs that
triggers this, not transparent huge pages) but it was fundamentally
correct in that a VM_BUG_ON in set_vma_resv_flags() can be triggered that
looks like this
vma ffff8804651fd0d0 start 00007fc474e00000 end 00007fc475e00000
next ffff8804651fd018 prev ffff8804651fd188 mm ffff88046b1b1800
prot 8000000000000027 anon_vma (null) vm_ops ffffffff8182a7a0
pgoff 0 file ffff88106bdb9800 private_data (null)
flags: 0x84400fb(read|write|shared|mayread|maywrite|mayexec|mayshare|dontexpand|hugetlb)
------------
kernel BUG at mm/hugetlb.c:462!
SMP
Modules linked in: xt_pkttype xt_LOG xt_limit [..]
CPU: 38 PID: 26839 Comm: map Not tainted 4.0.4-default #1
Hardware name: Dell Inc. PowerEdge R810/0TT6JF, BIOS 2.7.4 04/26/2012
set_vma_resv_flags+0x2d/0x30
The VM_BUG_ON is correct because private and shared mappings have
different reservation accounting but the warning clearly shows that the
VMA is shared.
When a private COW fails to allocate a new page then only the process
that created the VMA gets the page -- all the children unmap the page.
If the children access that data in the future then they get killed.
The problem is that the same file is mapped shared and private. During
the COW, the allocation fails, the VMAs are traversed to unmap the other
private pages but a shared VMA is found and the bug is triggered. This
patch identifies such VMAs and skips them.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reported-by: SunDong <sund_sky@126.com> Reviewed-by: Michal Hocko <mhocko@suse.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: David Rientjes <rientjes@google.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
The cpu feature flags are not ever going to change, so warning
everytime can cause a lot of kernel log spam
(in our case more than 10GB/hour).
The warning seems to only occur when nested virtualization is
enabled, so it's probably triggered by a KVM bug. This is a
sensible and safe change anyway, and the KVM bug fix might not
be suitable for stable releases anyway.
Signed-off-by: Dirk Mueller <dmueller@suse.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
Beginning with UEFI v2.5 EFI_PROPERTIES_TABLE was introduced
that signals that the firmware PE/COFF loader supports splitting
code and data sections of PE/COFF images into separate EFI
memory map entries. This allows the kernel to map those regions
with strict memory protections, e.g. EFI_MEMORY_RO for code,
EFI_MEMORY_XP for data, etc.
Unfortunately, an unwritten requirement of this new feature is
that the regions need to be mapped with the same offsets
relative to each other as observed in the EFI memory map. If
this is not done crashes like this may occur,
Here 0xfffffffefe6086dd refers to an address the firmware
expects to be mapped but which the OS never claimed was mapped.
The issue is that included in these regions are relative
addresses to other regions which were emitted by the firmware
toolchain before the "splitting" of sections occurred at
runtime.
Needless to say, we don't satisfy this unwritten requirement on
x86_64 and instead map the EFI memory map entries in reverse
order. The above crash is almost certainly triggerable with any
kernel newer than v3.13 because that's when we rewrote the EFI
runtime region mapping code, in commit d2f7cbe7b26a ("x86/efi:
Runtime services virtual mapping"). For kernel versions before
v3.13 things may work by pure luck depending on the
fragmentation of the kernel virtual address space at the time we
map the EFI regions.
Instead of mapping the EFI memory map entries in reverse order,
where entry N has a higher virtual address than entry N+1, map
them in the same order as they appear in the EFI memory map to
preserve this relative offset between regions.
This patch has been kept as small as possible with the intention
that it should be applied aggressively to stable and
distribution kernels. It is very much a bugfix rather than
support for a new feature, since when EFI_PROPERTIES_TABLE is
enabled we must map things as outlined above to even boot - we
have no way of asking the firmware not to split the code/data
regions.
In fact, this patch doesn't even make use of the more strict
memory protections available in UEFI v2.5. That will come later.
Suggested-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Reported-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Matt Fleming <matt.fleming@intel.com> Cc: Borislav Petkov <bp@suse.de> Cc: Chun-Yi <jlee@suse.com> Cc: Dave Young <dyoung@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: James Bottomley <JBottomley@Odin.com> Cc: Lee, Chun-Yi <jlee@suse.com> Cc: Leif Lindholm <leif.lindholm@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Garrett <mjg59@srcf.ucam.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Jones <pjones@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/1443218539-7610-2-git-send-email-matt@codeblueprint.co.uk Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
Per-IRQ directories in procfs are created only when a handler is first
added to the irqdesc, not when the irqdesc is created. In the case of
a shared IRQ, multiple tasks can race to create a directory. This
race condition seems to have been present forever, but is easier to
hit with async probing.
As reported by Dmitry Vyukov, we really shouldn't do ipc_addid() before
having initialized the IPC object state. Yes, we initialize the IPC
object in a locked state, but with all the lockless RCU lookup work,
that IPC object lock no longer means that the state cannot be seen.
We already did this for the IPC semaphore code (see commit e8577d1f0329:
"ipc/sem.c: fully initialize sem_array before making it visible") but we
clearly forgot about msg and shm.
The CONFIG_MIPS_MT symbol can be selected by CONFIG_MIPS_VPE_LOADER in
addition to CONFIG_MIPS_MT_SMP. We only want MT code in the CPS SMP boot
vector if we're using MT for SMP. Thus switch the config symbol we ifdef
against to CONFIG_MIPS_MT_SMP.
Signed-off-by: Paul Burton <paul.burton@imgtec.com> Cc: Markos Chandras <markos.chandras@imgtec.com> Cc: James Hogan <james.hogan@imgtec.com> Cc: linux-mips@linux-mips.org Cc: linux-kernel@vger.kernel.org
Patchwork: https://patchwork.linux-mips.org/patch/10867/ Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
[ luis: backported to 3.16: adjusted context ] Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
The MT-specific code in mips_cps_boot_vpes can safely be omitted from
kernels which don't support MT, with the default VPE==0 case being used
as it would be after the has_mt (Config3.MT) check failed at runtime.
Discarding the code entirely will save us a few bytes & allow cleaner
handling of MT ASE instructions by later patches.
Signed-off-by: Paul Burton <paul.burton@imgtec.com> Cc: Markos Chandras <markos.chandras@imgtec.com> Cc: James Hogan <james.hogan@imgtec.com> Cc: linux-mips@linux-mips.org Cc: linux-kernel@vger.kernel.org
Patchwork: https://patchwork.linux-mips.org/patch/10866/ Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
[ luis: backported to 3.16: adjusted context ] Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
The has_mt macro ended with a branch, leaving its callers with a delay
slot that would be executed if Config3.MT is not set. However it would
not be executed if Config3 (or earlier Config registers) don't exist
which makes it somewhat inconsistent at best. Fill the delay slot in the
macro & fix the mips_cps_boot_vpes caller appropriately.
Signed-off-by: Paul Burton <paul.burton@imgtec.com> Cc: Markos Chandras <markos.chandras@imgtec.com> Cc: James Hogan <james.hogan@imgtec.com> Cc: linux-mips@linux-mips.org Cc: linux-kernel@vger.kernel.org
Patchwork: https://patchwork.linux-mips.org/patch/10865/ Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
[ luis: backported to 3.16: adjusted context ] Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
If there is a DMA zone (usually 24bit = 16MB I believe), but no DMA32
zone, as is the case for some 32-bit kernels, then massage_gfp_flags()
will cause DMA memory allocated for devices with a 32..63-bit
coherent_dma_mask to fall back to using __GFP_DMA, even though there may
only be 32-bits of physical address available anyway.
Correct that case to compare against a mask the size of phys_addr_t
instead of always using a 64-bit mask.
Signed-off-by: James Hogan <james.hogan@imgtec.com> Fixes: a2e715a86c6d ("MIPS: DMA: Fix computation of DMA flags from device's coherent_dma_mask.") Cc: Ralf Baechle <ralf@linux-mips.org> Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/9610/ Signed-off-by: Ralf Baechle <ralf@linux-mips.org> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
UBI: attaching mtd1 to ubi0
UBI: scanning is finished
UBI error: init_volumes: not enough PEBs, required 706, available 686
UBI error: ubi_wl_init: no enough physical eraseblocks (-20, need 1)
UBI error: ubi_attach_mtd_dev: failed to attach mtd1, error -12 <= NOT ENOMEM
UBI error: ubi_init: cannot attach mtd1
If available PEBs are not enough when initializing volumes, return -ENOSPC
directly. If available PEBs are not enough when initializing WL, return
-ENOSPC instead of -ENOMEM.
Signed-off-by: Sheng Yong <shengyong1@huawei.com> Signed-off-by: Richard Weinberger <richard@nod.at> Reviewed-by: David Gstir <david@sigma-star.at> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
Make sure that data_size is less than LEB size.
Otherwise a handcrafted UBI image is able to trigger
an out of bounds memory access in ubi_compare_lebs().
Signed-off-by: Richard Weinberger <richard@nod.at> Reviewed-by: David Gstir <david@sigma-star.at>
[ luis: backported to 3.16:
- no ubi_device parameter for the ubi_err() macro in 3.16 ] Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
Make sure the compiler does not modify arguments of syscall functions.
This can happen if the compiler generates a tailcall to another
function. For example, without asmlinkage_protect sys_openat is compiled
into this function:
Note how the fourth argument is modified in place, modifying the register
%d4 that gets restored from this stack slot when the function returns to
user-space. The caller may expect the register to be unmodified across
system calls.
Signed-off-by: Andreas Schwab <schwab@linux-m68k.org> Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
A copy of /proc/kcore containing the kernel text can be made to the
buildid cache. e.g.
perf buildid-cache -v -k /proc/kcore
To workaround objdump limitations, a copy is also made when annotating
against /proc/kcore.
The copying process stops working from libelf about v1.62 onwards (the
problem was found with v1.63).
The cause is that a call to gelf_getphdr() in kcore__add_phdr() fails
because additional validation has been added to gelf_getphdr().
The use of gelf_getphdr() is a misguided attempt to get default
initialization of the Gelf_Phdr structure. That should not be
necessary because every member of the Gelf_Phdr structure is
subsequently assigned. So just remove the call to gelf_getphdr().
Similarly, a call to gelf_getehdr() in gelf_kcore__init() can be
removed also.
Committer notes:
Note to stable@kernel.org, from Adrian in the cover letter for this
patchkit:
The "Fix copying of /proc/kcore" problem goes back to v3.13 if you think
it is important enough for stable.
The kernel may delay interrupts for a long time which can result in timers
being delayed. If this occurs the intel_pstate driver will crash with a
divide by zero error:
The duration between reads of the APERF and MPERF registers overflowed a s32
sized integer in intel_pstate_get_scaled_busy()'s call to div_fp(). The result
is that int_tofp(duration_us) == 0, and the kernel attempts to divide by 0.
While the kernel shouldn't be delaying for a long time, it can and does
happen and the intel_pstate driver should not panic in this situation. This
patch changes the div_fp() function to use div64_s64() to allow for "long"
division. This will avoid the overflow condition on long delays.
[v2]: use div64_s64() in div_fp()
Signed-off-by: Prarit Bhargava <prarit@redhat.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Thomas Renninger <trenn@suse.de> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
Signed-off-by: Chas Williams <3chas3@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Cc: Kamata, Munehisa <kamatam@amazon.com> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
Roland Dreier [Mon, 5 Oct 2015 17:29:28 +0000 (10:29 -0700)]
fib_rules: Fix dump_rules() not to exit early
Backports of 41fc014332d9 ("fib_rules: fix fib rule dumps across
multiple skbs") introduced a regression in "ip rule show" - it ends up
dumping the first rule over and over and never exiting, because 3.19
and earlier are missing commit 053c095a82cf ("netlink: make
nlmsg_end() and genlmsg_end() void"), so fib_nl_fill_rule() ends up
returning skb->len (i.e. > 0) in the success case.
Fix this by checking the return code for < 0 instead of != 0.
Signed-off-by: Roland Dreier <roland@purestorage.com> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
commit 3cc81d85ee01 ("asix: Don't reset PHY on if_up for ASIX 88772")
causes the ethernet on Arndale to no longer function. This appears to
be because the Arndale ethernet requires a full reset before it will
function correctly, however simply reverting the above patch causes
problems with ethtool settings getting reset.
It seems the problem is that the ethernet is not properly reset during
bind, and indeed the code in ax88772_bind that resets the device is a
very small subset of the actual ax88772_reset function. This patch uses
ax88772_reset in place of the existing reset code in ax88772_bind which
removes some code duplication and fixes the ethernet on Arndale.
It is still possible that the original patch causes some issues with
suspend and resume but that seems like a separate issue and I haven't
had a chance to test that yet.
Signed-off-by: Charles Keepax <ckeepax@opensource.wolfsonmicro.com> Tested-by: Riku Voipio <riku.voipio@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net> Cc: Ian Campbell <ian.campbell@citrix.com> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
I've noticed every time the interface is set to 'up,', the kernel
reports that the link speed is set to 100 Mbps/Full Duplex, even
when ethtool is used to set autonegotiation to 'off', half
duplex, 10 Mbps.
It can be tested by:
ifconfig eth0 down
ethtool -s eth0 autoneg off speed 10 duplex half
ifconfig eth0 up
Then checking 'dmesg' for the link speed.
Signed-off-by: Michel Stam <m.stam@fugro.nl> Signed-off-by: David S. Miller <davem@davemloft.net> Cc: Ian Campbell <ian.campbell@citrix.com> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
Commit 6f6a6fda2945 "jbd2: fix ocfs2 corrupt when updating journal
superblock fails" changed jbd2_cleanup_journal_tail() to return EIO
when the journal is aborted. That makes logic in
jbd2_log_do_checkpoint() bail out which is fine, except that
jbd2_journal_destroy() expects jbd2_log_do_checkpoint() to always make
a progress in cleaning the journal. Without it jbd2_journal_destroy()
just loops in an infinite loop.
Fix jbd2_journal_destroy() to cleanup journal checkpoint lists of
jbd2_log_do_checkpoint() fails with error.
Reported-by: Eryu Guan <guaneryu@gmail.com> Tested-by: Eryu Guan <guaneryu@gmail.com> Fixes: 6f6a6fda294506dfe0e3e0a253bb2d2923f28f0a Signed-off-by: Jan Kara <jack@suse.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
[ luis: backported to 3.16: used Jan's backport ] Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
Ben Hutchings pointed out this was not applicable to the 3.16-ckt kernel.
Cc: Ben Hutchings <ben@decadent.org.uk> Cc: Gregory CLEMENT <gregory.clement@free-electrons.com> Cc: Benjamin Cama <benoar@dolka.fr> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
vxlan_setup is called when allocating the net_device, i.e. way before
vxlan_newlink (or vxlan_dev_configure) is called. This means
vxlan->default_dst is actually unset in vxlan_setup and the condition that
sets needed_headroom always takes the else branch.
Set the needed_headrom at the point when we have the information about
the address family available.
Fixes: e4c7ed415387c ("vxlan: add ipv6 support") Fixes: 2853af6a2ea1a ("vxlan: use dev->needed_headroom instead of dev->hard_header_len") CC: Cong Wang <cwang@twopensource.com> Signed-off-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
[ luis: backported to 3.16:
- initialise needed_headrom in vxlan_newlink() instead of
vxlan_dev_configure() ] Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
The i2c5 pinctrl offsets are wrong. If the bootloader doesn't set the
pins up, communication with tca6424a doesn't work (controller timeouts)
and it is not possible to enable HDMI.
The previous fix of pxa library support, which was introduced to fix the
library dependency, broke the previous SoC behavior, where a machine
code binding pxa2xx-ac97 with a coded relied on :
- sound/soc/pxa/pxa2xx-ac97.c
- sound/soc/codecs/XXX.c
For example, the mioa701_wm9713.c machine code is currently broken. The
"select ARM" statement wrongly selects the soc/arm/pxa2xx-ac97 for
compilation, as per an unfortunate fate SND_PXA2XX_AC97 is both declared
in sound/arm/Kconfig and sound/soc/pxa/Kconfig.
Fix this by ensuring that SND_PXA2XX_SOC correctly triggers the correct
pxa2xx-ac97 compilation.
Fixes: 846172dfe33c ("ASoC: fix SND_PXA2XX_LIB Kconfig warning") Signed-off-by: Robert Jarzmik <robert.jarzmik@free.fr> Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
Fix lookup of existing match/target structures in the corresponding list
by skipping the family check if NFPROTO_UNSPEC is used.
This is resulting in the allocation and insertion of one match/target
structure for each use of them. So this not only bloats memory
consumption but also severely affects the time to reload the ruleset
from the iptables-compat utility.
After this patch, iptables-compat-restore and iptables-compat take
almost the same time to reload large rulesets.
Fixes: 0ca743a55991 ("netfilter: nf_tables: add compatibility layer for x_tables") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
932c435caba8 ("PCI: Add dev_flags bit to access VPD through function 0")
added PCI_DEV_FLAGS_VPD_REF_F0. Previously, we set the flag on every
non-zero function of quirked devices. If a function turned out to be
different from function 0, i.e., it had a different class, vendor ID, or
device ID, the flag remained set but we didn't make VPD accessible at all.
Flip this around so we only set PCI_DEV_FLAGS_VPD_REF_F0 for functions that
are identical to function 0, and allow regular VPD access for any other
functions.
[bhelgaas: changelog, stable tag] Fixes: 932c435caba8 ("PCI: Add dev_flags bit to access VPD through function 0") Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Bjorn Helgaas <helgaas@kernel.org> Acked-by: Myron Stowe <myron.stowe@redhat.com> Acked-by: Mark Rustad <mark.d.rustad@intel.com> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
Commit 932c435caba8 ("PCI: Add dev_flags bit to access VPD through function
0") passes PCI_SLOT(devfn) for the devfn parameter of pci_get_slot().
Generally this works because we're fairly well guaranteed that a PCIe
device is at slot address 0, but for the general case, including
conventional PCI, it's incorrect. We need to get the slot and then convert
it back into a devfn.
Fixes: 932c435caba8 ("PCI: Add dev_flags bit to access VPD through function 0") Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Bjorn Helgaas <helgaas@kernel.org> Acked-by: Myron Stowe <myron.stowe@redhat.com> Acked-by: Mark Rustad <mark.d.rustad@intel.com> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
Fix potential null-pointer dereference at probe by making sure that the
required endpoints are present.
The whiteheat driver assumes there are at least five pairs of bulk
endpoints, of which the final pair is used for the "command port". An
attempt to bind to an interface with fewer bulk endpoints would
currently lead to an oops.
Fixes CVE-2015-5257.
Reported-by: Moein Ghasemzadeh <moein@istuary.com> Signed-off-by: Johan Hovold <johan@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
The VBT MIPI Sequence Block version 3 has forward incompatible changes:
First, the block size in the header has been specified reserved, and the
actual size is a separate 32-bit value within the block. The current
find_section() function to will only look at the size in the block
header, and, depending on what's in that now reserved size field,
continue looking for other sections in the wrong place.
Fix this by taking the new block size field into account. This will
ensure that the lookups for other sections will work properly, as long
as the new 32-bit size does not go beyond the opregion VBT mailbox size.
Second, the contents of the block have been completely
changed. Gracefully refuse parsing the yet unknown data version.
Cc: Deepak M <m.deepak@intel.com> Reviewed-by: Deepak M <m.deepak@intel.com> Signed-off-by: Jani Nikula <jani.nikula@intel.com>
[ luis: backported to 3.16: adjusted context ] Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
The order of the following three spinlocks should be:
dlm_domain_lock < dlm_ctxt->spinlock < dlm_lock_resource->spinlock
But dlm_dispatch_assert_master() is called while holding
dlm_ctxt->spinlock and dlm_lock_resource->spinlock, and then it calls
dlm_grab() which will take dlm_domain_lock.
Once another thread (for example, dlm_query_join_handler) has already
taken dlm_domain_lock, and tries to take dlm_ctxt->spinlock deadlock
happens.
Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: "Junxiao Bi" <junxiao.bi@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[ luis: backported to 3.16: adjusted context ] Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
PCM receive and transmit DMA requestor lines were reverted, breaking the
PCM playback interface for PXA platforms using the sound/soc/ variant
instead of the sound/arm variant.
The commit below shows the inversion in the requestor lines.
Fixes: d65a14587a9b ("ASoC: pxa: use snd_dmaengine_dai_dma_data") Signed-off-by: Robert Jarzmik <robert.jarzmik@free.fr> Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
The NMI entry code that switches to the normal kernel stack needs to
be very careful not to clobber any extra stack slots on the NMI
stack. The code is fine under the assumption that SWAPGS is just a
normal instruction, but that assumption isn't really true. Use
SWAPGS_UNSAFE_STACK instead.
This is part of a fix for some random crashes that Sasha saw.
Fixes: 9b6e6a8334d5 ("x86/nmi/64: Switch stacks on userspace NMI entry") Reported-and-tested-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Andy Lutomirski <luto@kernel.org> Link: http://lkml.kernel.org/r/974bc40edffdb5c2950a5c4977f821a446b76178.1442791737.git.luto@kernel.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
[ luis: backported to 3.16:
- file rename: arch/x86/entry/entry_64.S -> arch/x86/kernel/entry_64.S
- adjusted context ] Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
That's a call through a function pointer to regular C function that
does nothing on native boots, but that function isn't protected
against kprobes, isn't marked notrace, and is certainly not
guaranteed to preserve any registers if the compiler is feeling
perverse. This is bad news for a CLBR_NONE operation.
Of course, if everything works correctly, once paravirt ops are
patched, it gets nopped out, but what if we hit this code before
paravirt ops are patched in? This can potentially cause breakage
that is very difficult to debug.
A more subtle failure is possible here, too: if _paravirt_nop uses
the stack at all (even just to push RBP), it will overwrite the "NMI
executing" variable if it's called in the NMI prologue.
The Xen case, perhaps surprisingly, is fine, because it's already
written in asm.
Fix all of the cases that default to paravirt_nop (including
adjust_exception_frame) with a big hammer: replace paravirt_nop with
an asm function that is just a ret instruction.
The Xen case may have other problems, so document them.
This is part of a fix for some random crashes that Sasha saw.
Reported-and-tested-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Andy Lutomirski <luto@kernel.org> Link: http://lkml.kernel.org/r/8f5d2ba295f9d73751c33d97fda03e0495d9ade0.1442791737.git.luto@kernel.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
[ luis: backported to 3.16:
- file rename: arch/x86/entry/entry_64.S -> arch/x86/kernel/entry_64.S
- adjusted context ] Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
Linux cifs mount with ntlmssp against an Mac OS X (Yosemite
10.10.5) share fails in case the clocks differ more than +/-2h:
digest-service: digest-request: od failed with 2 proto=ntlmv2
digest-service: digest-request: kdc failed with -1561745592 proto=ntlmv2
Fix this by (re-)using the given server timestamp for the
ntlmv2 authentication (as Windows 7 does).
A related problem was also reported earlier by Namjae Jaen (see below):
Windows machine has extended security feature which refuse to allow
authentication when there is time difference between server time and
client time when ntlmv2 negotiation is used. This problem is prevalent
in embedded enviornment where system time is set to default 1970.
Modern servers send the server timestamp in the TargetInfo Av_Pair
structure in the challenge message [see MS-NLMP 2.2.2.1]
In [MS-NLMP 3.1.5.1.2] it is explicitly mentioned that the client must
use the server provided timestamp if present OR current time if it is
not
Reported-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Peter Seiderer <ps.report@gmx.net> Signed-off-by: Steve French <smfrench@gmail.com> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
leases (oplocks) were always requested for SMB2/SMB3 even when oplocks
disabled in the cifs.ko module.
Signed-off-by: Steve French <steve.french@primarydata.com> Reviewed-by: Chandrika Srinivasan <chandrika.srinivasan@citrix.com> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
XTFPGA SPI controller has native endian registers.
Fix register acessors so that they work in big-endian configurations.
Signed-off-by: Max Filippov <jcmvbkbc@gmail.com> Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
Don't check if timer is running with a timer_pending() before
deleting it with del_timer_sync(), this defies the whole point of
the sync part and can cause a possible race.
Instead we just want to make sure the timer is initialized early enough
before we have a chance to delete it.
Reported-by: Oliver Neukum <oneukum@suse.com> Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
Some changes between xhci 0.96 and xhci 1.0 specifications forced us to
check the hci version in code, some of these checks were implemented as
hci_version == 1.0, which will not work with new xhci 1.1 controllers.
xhci 1.1 behaves similar to xhci 1.0 in these cases, so change these
checks to hci_version >= 1.0
Don't set xhci->shared_hcd to NULL in xhci_stop() as we have
still not de-allocated it. It was resulting in a NULL pointer
de-reference if usb_add/remove_hcd() is called repeatedly.
We want repeated add/remove to work for the OTG use case.
Signed-off-by: Roger Quadros <rogerq@ti.com> Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
We want to give the command abortion an additional try to stop
the command ring before we completely hose xhci.
Tested-by: Vincent Pelletier <plr.vincent@gmail.com> Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ luis: backported to 3.16:
- xhci_handshake() has an extra 'xhci' parameter in 3.16 ] Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
These have roughly the same purpose as the SMRR, which we do not need
to implement in KVM. However, Linux accesses MSR_K8_TSEG_ADDR at
boot, which causes problems when running a Xen dom0 under KVM.
Just return 0, meaning that processor protection of SMRAM is not
in effect.
Reported-by: M A Young <m.a.young@durham.ac.uk> Acked-by: Borislav Petkov <bp@suse.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[ luis: backported to 3.16:
- file rename: arch/x86/include/asm/msr-index.h ->
arch/x86/include/uapi/asm/msr-index.h
- adjusted context ] Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
Fixes: 83271f626 ("ion: hold reference to handle...") Signed-off-by: Shawn Lin <shawn.lin@rock-chips.com> Reviewed-by: Laura Abbott <labbott@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
As documented in iscsit_sequence_cmd:
/*
* Existing callers for iscsit_sequence_cmd() will silently
* ignore commands with CMDSN_LOWER_THAN_EXP, so force this
* return for CMDSN_MAXCMDSN_OVERRUN as well..
*/
We need to silently finish a command when it's in ISTATE_REMOVE.
This fixes an teardown hang we were seeing where a mis-behaved
initiator (triggered by allocation error injections) sent us a
cmdsn which was lower than expected.
Signed-off-by: Jenny Derzhavetz <jennyf@mellanox.com> Signed-off-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
This patch fixes a regression introduced by the commit a84e32894191
("net: mvneta: fix refilling for Rx DMA buffers"). Due to this commit
the newly allocated Rx buffers are DMA-unmapped in place of those passed
to the networking stack. Obviously, this causes data corruptions.
This patch fixes the issue by ensuring that the right Rx buffers are
DMA-unmapped.
Reported-by: Oren Laskin <oren@igneous.io> Signed-off-by: Simon Guinot <simon.guinot@sequanux.org> Fixes: a84e32894191 ("net: mvneta: fix refilling for Rx DMA buffers") Tested-by: Oren Laskin <oren@igneous.io> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
According to spec, there are functional and protocol stalls.
For functional stall, it is for bulk and interrupt endpoints,
below are cases for it:
- Host sends SET_FEATURE request for Set-Halt, the udc driver
needs to set stall, and return true unconditionally.
- The gadget driver may call usb_ep_set_halt to stall certain
endpoints, if there is a transfer in pending, the udc driver
should not set stall, and return -EAGAIN accordingly.
These two kinds of stall need to be cleared by host using CLEAR_FEATURE
request (Clear-Halt).
For protocol stall, it is for control endpoint, this stall will
be set if the control request has failed. This stall will be
cleared by next setup request (hardware will do it).
It fixed usbtest (drivers/usb/misc/usbtest.c) Test 13 "set/clear halt"
test failure, meanwhile, this change has been verified by
USB2 CV Compliance Test and MSC Tests.
Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Felipe Balbi <balbi@ti.com> Signed-off-by: Peter Chen <peter.chen@freescale.com> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
In btrfs_evict_inode, we properly truncate the page cache for evicted
inodes but then we call btrfs_wait_ordered_range for every inode as well.
It's the right thing to do for regular files but results in incorrect
behavior for device inodes for block devices.
filemap_fdatawrite_range gets called with inode->i_mapping which gets
resolved to the block device inode before getting passed to
wbc_attach_fdatawrite_inode and ultimately to inode_to_bdi. What happens
next depends on whether there's an open file handle associated with the
inode. If there is, we write to the block device, which is unexpected
behavior. If there isn't, we through normally and inode->i_data is used.
We can also end up racing against open/close which can result in crashes
when i_mapping points to a block device inode that has been closed.
Since there can't be any page cache associated with special file inodes,
it's safe to skip the btrfs_wait_ordered_range call entirely and avoid
the problem.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=100911 Tested-by: Christoph Biedl <linux-kernel.bfrz@manchmal.in-ulm.de> Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
If a file has a range pointing to a compressed extent, followed by
another range that points to the same compressed extent and a read
operation attempts to read both ranges (either completely or part of
them), the pages that correspond to the second range are incorrectly
filled with zeroes.
Consider the following example:
File layout
[0 - 8K] [8K - 24K]
| |
| |
points to extent X, points to extent X,
offset 4K, length of 8K offset 0, length 16K
If a readpages() call spans the 2 ranges, a single bio to read the extent
is submitted - extent_io.c:submit_extent_page() would only create a new
bio to cover the second range pointing to the extent if the extent it
points to had a different logical address than the extent associated with
the first range. This has a consequence of the compressed read end io
handler (compression.c:end_compressed_bio_read()) finish once the extent
is decompressed into the pages covering the first range, leaving the
remaining pages (belonging to the second range) filled with zeroes (done
by compression.c:btrfs_clear_biovec_end()).
So fix this by submitting the current bio whenever we find a range
pointing to a compressed extent that was preceded by a range with a
different extent map. This is the simplest solution for this corner
case. Making the end io callback populate both ranges (or more, if we
have multiple pointing to the same extent) is a much more complex
solution since each bio is tightly coupled with a single extent map and
the extent maps associated to the ranges pointing to the shared extent
can have different offsets and lengths.
The following test case for fstests triggers the issue:
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_need_to_be_root
_supported_fs btrfs
_supported_os Linux
_require_scratch
_require_cloner
rm -f $seqres.full
test_clone_and_read_compressed_extent()
{
local mount_opts=$1
# Create a test file with a single extent that is compressed (the
# data we write into it is highly compressible no matter which
# compression algorithm is used, zlib or lzo).
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 4K" \
-c "pwrite -S 0xbb 4K 8K" \
-c "pwrite -S 0xcc 12K 4K" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Now clone our extent into an adjacent offset.
$CLONER_PROG -s $((4 * 1024)) -d $((16 * 1024)) -l $((8 * 1024)) \
$SCRATCH_MNT/foo $SCRATCH_MNT/foo
# Same as before but for this file we clone the extent into a lower
# file offset.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 8K 4K" \
-c "pwrite -S 0xbb 12K 8K" \
-c "pwrite -S 0xcc 20K 4K" \
$SCRATCH_MNT/bar | _filter_xfs_io
# Evicting the inode or clearing the page cache before reading
# again the file would also trigger the bug - reads were returning
# all bytes in the range corresponding to the second reference to
# the extent with a value of 0, but the correct data was persisted
# (it was a bug exclusively in the read path). The issue happened
# only if the same readpages() call targeted pages belonging to the
# first and second ranges that point to the same compressed extent.
_scratch_remount
echo "File digests after mounting filesystem again:"
# Must match the same digests we got before.
md5sum $SCRATCH_MNT/foo | _filter_scratch
md5sum $SCRATCH_MNT/bar | _filter_scratch
}
echo -e "\nTesting with zlib compression..."
test_clone_and_read_compressed_extent "-o compress=zlib"
_scratch_unmount
echo -e "\nTesting with lzo compression..."
test_clone_and_read_compressed_extent "-o compress=lzo"
status=0
exit
Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Qu Wenruo<quwenruo@cn.fujitsu.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
Let's fix pinmux address of gpio 170 used by tfp410 powerdown-gpio.
According to the OMAP35x Technical Reference Manual
CONTROL_PADCONF_I2C3_SDA[15:0] 0x480021C4 mode0: i2c3_sda
CONTROL_PADCONF_I2C3_SDA[31:16] 0x480021C4 mode4: gpio_170
the pinmux address of gpio 170 must be 0x480021C6.
The former wrong address broke i2c3 (used by hdmi ddc), resulting in
kernel message:
omap_i2c 48060000.i2c: controller timed out
Fixes: 8cecf52befd7 ("ARM: omap3-beagle.dts: add display information") Signed-off-by: Carl Frederik Werner <frederik@cfbw.eu> Signed-off-by: Tony Lindgren <tony@atomide.com> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
If user space calls unreference on a user_dmabuf it will typically
kill the struct ttm_base_object member which is responsible for the
user-space visibility. However the dmabuf part may still be alive and
refcounted. In some situations, like for shared guest-backed surface
referencing/opening, the driver may try to reference the
struct ttm_base_object member again, causing an immediate kernel warning
and a later kernel NULL pointer dereference.
Fix this by always maintaining a reference on the struct
ttm_base_object member, in situations where it might subsequently be
referenced.
Signed-off-by: Thomas Hellstrom <thellstrom@vmware.com> Reviewed-by: Brian Paul <brianp@vmware.com> Reviewed-by: Sinclair Yeh <syeh@vmware.com>
[ luis: backported to 3.16: adjusted context ] Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
This is intended to add ZTE device PIDs on kernel.
Signed-off-by: Liu.Zhao <lzsos369@163.com>
[johan: sort the new entries ] Signed-off-by: Johan Hovold <johan@kernel.org> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
Actually, spi_master_put() after spi_alloc_master() must _not_ be followed
by kfree(). The memory is already freed with the call to spi_master_put()
through spi_master_class, which registers a release function. Calling both
spi_master_put() and kfree() results in often nasty (and delayed) crashes
elsewhere in the kernel, often in the networking stack.
Link to patch and concerns: https://lkml.org/lkml/2012/9/3/269
or
http://lkml.iu.edu/hypermail/linux/kernel/1209.0/00790.html
Alexey Klimov: This revert becomes valid after 94c69f765f1b4a658d96905ec59928e3e3e07e6a when spi-imx.c
has been fixed and there is no need to call kfree() so comment
for spi_alloc_master() should be fixed.
Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Alexey Klimov <alexey.klimov@linaro.org> Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
On Intel Baytrail, there is case when interrupt handler get called, no SPI
message is captured. The RX FIFO is indeed empty when RX timeout pending
interrupt (SSSR_TINT) happens.
Use the BIOS version where both HSUART and SPI are on the same IRQ. Both
drivers are using IRQF_SHARED when calling the request_irq function. When
running two separate and independent SPI and HSUART application that
generate data traffic on both components, user will see messages like
below on the console:
pxa2xx-spi pxa2xx-spi.0: bad message state in interrupt handler
This commit will fix this by first checking Receiver Time-out Interrupt,
if it is disabled, ignore the request and return without servicing.
Signed-off-by: Tan, Jui Nee <jui.nee.tan@intel.com> Acked-by: Jarkko Nikula <jarkko.nikula@linux.intel.com> Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
In rare cases a directory can be renamed out from under a bind mount.
In those cases without special handling it becomes possible to walk up
the directory tree to the root dentry of the filesystem and down
from the root dentry to every other file or directory on the filesystem.
Like division by zero .. from an unconnected path can not be given
a useful semantic as there is no predicting at which path component
the code will realize it is unconnected. We certainly can not match
the current behavior as the current behavior is a security hole.
Therefore when encounting .. when following an unconnected path
return -ENOENT.
- Add a function path_connected to verify path->dentry is reachable
from path->mnt.mnt_root. AKA to validate that rename did not do
something nasty to the bind mount.
To avoid races path_connected must be called after following a path
component to it's next path component.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
A rename can result in a dentry that by walking up d_parent
will never reach it's mnt_root. For lack of a better term
I call this an escaped path.
prepend_path is called by four different functions __d_path,
d_absolute_path, d_path, and getcwd.
__d_path only wants to see paths are connected to the root it passes
in. So __d_path needs prepend_path to return an error.
d_absolute_path similarly wants to see paths that are connected to
some root. Escaped paths are not connected to any mnt_root so
d_absolute_path needs prepend_path to return an error greater
than 1. So escaped paths will be treated like paths on lazily
unmounted mounts.
getcwd needs to prepend "(unreachable)" so getcwd also needs
prepend_path to return an error.
d_path is the interesting hold out. d_path just wants to print
something, and does not care about the weird cases. Which raises
the question what should be printed?
Given that <escaped_path>/<anything> should result in -ENOENT I
believe it is desirable for escaped paths to be printed as empty
paths. As there are not really any meaninful path components when
considered from the perspective of a mount tree.
So tweak prepend_path to return an empty path with an new error
code of 3 when it encounters an escaped path.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
The original patch introducing this header wrote the number of CPUs available
and online in one order and then swapped those values when reading, fix it.
zcomp_create() verifies the success of zcomp_strm_{multi,single}_create()
through comp->stream, which can potentially be pointing to memory that
was freed if these functions returned an error.
While at it, replace a 'ERR_PTR(-ENOMEM)' by a more generic
'ERR_PTR(error)' as in the future zcomp_strm_{multi,siggle}_create()
could return other error codes. Function documentation updated
accordingly.
Fixes: beca3ec71fe5 ("zram: add multi stream functionality") Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
Do not write initialize magic on systems that do not have
feature query 0xb. Fixes Bug #82451.
Redefine FEATURE_QUERY to align with 0xb and FEATURE2 with 0xd
for code clearity.
Add a new test function, hp_wmi_bios_2008_later() & simplify
hp_wmi_bios_2009_later(), which fixes a bug in cases where
an improper value is returned. Probably also fixes Bug #69131.
Add missing __init tag.
Signed-off-by: Kyle Evans <kvans32@gmail.com> Signed-off-by: Darren Hart <dvhart@linux.intel.com> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
When running a guest with the architected timer disabled (with QEMU and
the kernel_irqchip=off option, for example), it is important to make
sure the timer gets turned off. Otherwise, the guest may try to
enable it anyway, leading to a screaming HW interrupt.
The fix is to unconditionally turn off the virtual timer on guest
exit.
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
[ luis: backported to 3.16: adjusted context ] Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
When running a guest with the architected timer disabled (with QEMU and
the kernel_irqchip=off option, for example), it is important to make
sure the timer gets turned off. Otherwise, the guest may try to
enable it anyway, leading to a screaming HW interrupt.
The fix is to unconditionally turn off the virtual timer on guest
exit.
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
Cortex-A53 processors <= r0p4 are affected by erratum #843419 which can
lead to a memory access using an incorrect address in certain sequences
headed by an ADRP instruction.
There is a linker fix to generate veneers for ADRP instructions, but
this doesn't work for kernel modules which are built as unlinked ELF
objects.
This patch adds a new config option for the erratum which, when enabled,
builds kernel modules with the mcmodel=large flag. This uses absolute
addressing for all kernel symbols, thereby removing the use of ADRP as
a PC-relative form of addressing. The ADRP relocs are removed from the
module loader so that we fail to load any potentially affected modules.
Acked-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
When saving/restoring the VFP registers from a compat (AArch32)
signal frame, we rely on the compat registers forming a prefix of the
native register file and therefore make use of copy_{to,from}_user to
transfer between the native fpsimd_state and the compat_vfp_sigframe.
Unfortunately, this doesn't work so well in a big-endian environment.
Our fpsimd save/restore code operates directly on 128-bit quantities
(Q registers) whereas the compat_vfp_sigframe represents the registers
as an array of 64-bit (D) registers. The architecture packs the compat D
registers into the Q registers, with the least significant bytes holding
the lower register. Consequently, we need to swap the 64-bit halves when
converting between these two representations on a big-endian machine.
This patch replaces the __copy_{to,from}_user invocations in our
compat VFP signal handling code with explicit __put_user loops that
operate on 64-bit values and swap them accordingly.
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
In 2007, commit 07190a08eef36 ("Mark TSC on GeodeLX reliable")
bypassed verification of the TSC on Geode LX. However, this code
(now in the check_system_tsc_reliable() function in
arch/x86/kernel/tsc.c) was only present if CONFIG_MGEODE_LX was
set.
OpenWRT has recently started building its generic Geode target
for Geode GX, not LX, to include support for additional
platforms. This broke the timekeeping on LX-based devices,
because the TSC wasn't marked as reliable:
https://dev.openwrt.org/ticket/20531
By adding a runtime check on is_geode_lx(), we can also include
the fix if CONFIG_MGEODEGX1 or CONFIG_X86_GENERIC are set, thus
fixing the problem.
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com> Cc: Andres Salomon <dilinger@queued.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Marcelo Tosatti <marcelo@kvack.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1442409003.131189.87.camel@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
If we had secondary hash flag set, we ended up modifying hash value in
the updatepp code path. Hence with a failed updatepp we will be using
a wrong hash value for the following hash insert. Fix this by
recomputing hash before insert.
Without this patch we can end up with using wrong slot number in linux
pte. That can result in us missing an hash pte update or invalidate
which can cause memory corruption or even machine check.
Fixes: 6d492ecc6489 ("powerpc/THP: Add code to handle HPTE faults for hugepages") Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Reviewed-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
The kernel does it, not the boot wrapper, which breaks with some
cross compilers that still default to ABI v1.
Fixes: 147c05168fc8 ("powerpc/boot: Add support for 64bit little endian wrapper") Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
When a kernel is built covering ARMv6 to ARMv7, we omit to clear the
IT state when entering a signal handler. This can cause the first
few instructions to be conditionally executed depending on the parent
context.
In any case, the original test for >= ARMv7 is broken - ARMv6 can have
Thumb-2 support as well, and an ARMv6T2 specific build would omit this
code too.
Relax the test back to ARMv6 or greater. This results in us always
clearing the IT state bits in the PSR, even on CPUs where these bits
are reserved. However, they're reserved for the IT state, so this
should cause no harm.
Fixes: d71e1352e240 ("Clear the IT state when invoking a Thumb-2 signal handler") Acked-by: Tony Lindgren <tony@atomide.com> Tested-by: H. Nikolaus Schaller <hns@goldelico.com> Tested-by: Grazvydas Ignotas <notasas@gmail.com> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
Currently, if we had a zero length mmio eventfd assigned on
KVM_MMIO_BUS. It will never be found by kvm_io_bus_cmp() since it
always compares the kvm_io_range() with the length that guest
wrote. This will cause e.g for vhost, kick will be trapped by qemu
userspace instead of vhost. Fixing this by using zero length if an
iodevice is zero length.
Cc: Gleb Natapov <gleb@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
We register wildcard mmio eventfd on two buses, once for KVM_MMIO_BUS
and once on KVM_FAST_MMIO_BUS but with a single iodev
instance. This will lead to an issue: kvm_io_bus_destroy() knows
nothing about the devices on two buses pointing to a single dev. Which
will lead to double free[1] during exit. Fix this by allocating two
instances of iodevs then registering one on KVM_MMIO_BUS and another
on KVM_FAST_MMIO_BUS.
Cc: Gleb Natapov <gleb@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
This patch factors out core eventfd assign/deassign logic and leaves
the argument checking and bus index selection to callers.
Cc: Gleb Natapov <gleb@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
We only want zero length mmio eventfd to be registered on
KVM_FAST_MMIO_BUS. So check this explicitly when arg->len is zero to
make sure this.
Cc: Gleb Natapov <gleb@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
When entering the kernel at EL2, we fail to initialise the MDCR_EL2
register which controls debug access and PMU capabilities at EL1.
This patch ensures that the register is initialised so that all traps
are disabled and all the PMU counters are available to the host. When a
guest is scheduled, KVM takes care to configure trapping appropriately.
Acked-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
The APIC LVTT register is MMIO mapped but the TSC_DEADLINE register is an
MSR. The write to the TSC_DEADLINE MSR is not serializing, so it's not
guaranteed that the write to LVTT has reached the APIC before the
TSC_DEADLINE MSR is written. In such a case the write to the MSR is
ignored and as a consequence the local timer interrupt never fires.
The SDM decribes this issue for xAPIC and x2APIC modes. The
serialization methods recommended by the SDM differ.
xAPIC:
"1. Memory-mapped write to LVT Timer Register, setting bits 18:17 to 10b.
2. WRMSR to the IA32_TSC_DEADLINE MSR a value much larger than current time-stamp counter.
3. If RDMSR of the IA32_TSC_DEADLINE MSR returns zero, go to step 2.
4. WRMSR to the IA32_TSC_DEADLINE MSR the desired deadline."
x2APIC:
"To allow for efficient access to the APIC registers in x2APIC mode,
the serializing semantics of WRMSR are relaxed when writing to the
APIC registers. Thus, system software should not use 'WRMSR to APIC
registers in x2APIC mode' as a serializing instruction. Read and write
accesses to the APIC registers will occur in program order. A WRMSR to
an APIC register may complete before all preceding stores are globally
visible; software can prevent this by inserting a serializing
instruction, an SFENCE, or an MFENCE before the WRMSR."
The xAPIC method is to just wait for the memory mapped write to hit
the LVTT by checking whether the MSR write has reached the hardware.
There is no reason why a proper MFENCE after the memory mapped write would
not do the same. Andi Kleen confirmed that MFENCE is sufficient for the
xAPIC case as well.
Issue MFENCE before writing to the TSC_DEADLINE MSR. This can be done
unconditionally as all CPUs which have TSC_DEADLINE also have MFENCE
support.
[ tglx: Massaged the changelog ]
Signed-off-by: Shaohua Li <shli@fb.com> Reviewed-by: Ingo Molnar <mingo@kernel.org> Cc: <Kernel-team@fb.com> Cc: <lenb@kernel.org> Cc: <fenghua.yu@intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Link: http://lkml.kernel.org/r/20150909041352.GA2059853@devbig257.prn2.facebook.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
This might lead to local privilege escalation (code execution as
kernel) for systems where the following conditions are met:
- CONFIG_CIFS_SMB2 and CONFIG_CIFS_POSIX are enabled
- a cifs filesystem is mounted where:
- the mount option "vers" was used and set to a value >=2.0
- the attacker has write access to at least one file on the filesystem
To attack this, an attacker would have to guess the target_tcon
pointer (but guessing wrong doesn't cause a crash, it just returns an
error code) and win a narrow race.
Signed-off-by: Jann Horn <jann@thejh.net> Signed-off-by: Steve French <smfrench@gmail.com> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
While working on the 32-bit ARM port of UEFI, I noticed a strange
corruption in the kernel log. The following snprintf() statement
(in drivers/firmware/efi/efi.c:efi_md_typeattr_format())
As it turns out, this is caused by incorrect code being emitted for
the string() function in lib/vsprintf.c. The following code
if (!(spec.flags & LEFT)) {
while (len < spec.field_width--) {
if (buf < end)
*buf = ' ';
++buf;
}
}
for (i = 0; i < len; ++i) {
if (buf < end)
*buf = *s;
++buf; ++s;
}
while (len < spec.field_width--) {
if (buf < end)
*buf = ' ';
++buf;
}
when called with len == 0, triggers an issue in the GCC SRA optimization
pass (Scalar Replacement of Aggregates), which handles promotion of signed
struct members incorrectly. This is a known but as yet unresolved issue.
(https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65932). In this particular
case, it is causing the second while loop to be executed erroneously a
single time, causing the additional space characters to be printed.
So disable the optimization by passing -fno-ipa-sra.
Acked-by: Nicolas Pitre <nico@linaro.org> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
[ luis: backported to 3.16: adjusted context ] Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
When detecting a serial port on newer PA-RISC machines (with iosapic) we have a
long way to go to find the right IRQ line, registering it, then registering the
serial port and the irq handler for the serial port. During this phase spurious
interrupts for the serial port may happen which then crashes the kernel because
the action handler might not have been set up yet.
So, basically it's a race condition between the serial port hardware and the
CPU which sets up the necessary fields in the irq sructs. The main reason for
this race is, that we unmask the serial port irqs too early without having set
up everything properly before (which isn't easily possible because we need the
IRQ number to register the serial ports).
This patch is a work-around for this problem. It adds checks to the CPU irq
handler to verify if the IRQ action field has been initialized already. If not,
we just skip this interrupt (which isn't critical for a serial port at bootup).
The real fix would probably involve rewriting all PA-RISC specific IRQ code
(for CPU, IOSAPIC, GSC and EISA) to use IRQ domains with proper parenting of
the irq chips and proper irq enabling along this line.
This bug has been in the PA-RISC port since the beginning, but the crashes
happened very rarely with currently used hardware. But on the latest machine
which I bought (a C8000 workstation), which uses the fastest CPUs (4 x PA8900,
1GHz) and which has the largest possible L1 cache size (64MB each), the kernel
crashed at every boot because of this race. So, without this patch the machine
would currently be unuseable.
For the record, here is the flow logic:
1. serial_init_chip() in 8250_gsc.c calls iosapic_serial_irq().
2. iosapic_serial_irq() calls txn_alloc_irq() to find the irq.
3. iosapic_serial_irq() calls cpu_claim_irq() to register the CPU irq
4. cpu_claim_irq() unmasks the CPU irq (which it shouldn't!)
5. serial_init_chip() then registers the 8250 port.
Problems:
- In step 4 the CPU irq shouldn't have been registered yet, but after step 5
- If serial irq happens between 4 and 5 have finished, the kernel will crash
Cc: linux-parisc@vger.kernel.org Signed-off-by: Helge Deller <deller@gmx.de>
[ luis: backported to 3.16: used Helge's backport ] Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
dump_rules returns skb length and not error.
But when family == AF_UNSPEC, the caller of dump_rules
assumes that it returns an error. Hence, when family == AF_UNSPEC,
we continue trying to dump on -EMSGSIZE errors resulting in
incorrect dump idx carried between skbs belonging to the same dump.
This results in fib rule dump always only dumping rules that fit
into the first skb.
This patch fixes dump_rules to return error so that we exit correctly
and idx is correctly maintained between skbs that are part of the
same dump.
Signed-off-by: Wilson Kok <wkok@cumulusnetworks.com> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
When support for megaflows was introduced, OVS needed to start
installing flows with a mask applied to them. Since masking is an
expensive operation, OVS also had an optimization that would only
take the parts of the flow keys that were covered by a non-zero
mask. The values stored in the remaining pieces should not matter
because they are masked out.
While this works fine for the purposes of matching (which must always
look at the mask), serialization to netlink can be problematic. Since
the flow and the mask are serialized separately, the uninitialized
portions of the flow can be encoded with whatever values happen to be
present.
In terms of functionality, this has little effect since these fields
will be masked out by definition. However, it leaks kernel memory to
userspace, which is a potential security vulnerability. It is also
possible that other code paths could look at the masked key and get
uninitialized data, although this does not currently appear to be an
issue in practice.
This removes the mask optimization for flows that are being installed.
This was always intended to be the case as the mask optimizations were
really targetting per-packet flow operations.
Fixes: 03f0d916 ("openvswitch: Mega flow implementation") Signed-off-by: Jesse Gross <jesse@nicira.com> Acked-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
[ luis: backported to 3.16: adjusted context ] Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
Consider sctp module is unloaded and is being requested because an user
is creating a sctp socket.
During initialization, sctp will add the new protocol type and then
initialize pernet subsys:
status = sctp_v4_protosw_init();
if (status)
goto err_protosw_init;
status = sctp_v6_protosw_init();
if (status)
goto err_v6_protosw_init;
status = register_pernet_subsys(&sctp_net_ops);
The problem is that after those calls to sctp_v{4,6}_protosw_init(), it
is possible for userspace to create SCTP sockets like if the module is
already fully loaded. If that happens, one of the possible effects is
that we will have readers for net->sctp.local_addr_list list earlier
than expected and sctp_net_init() does not take precautions while
dealing with that list, leading to a potential panic but not limited to
that, as sctp_sock_init() will copy a bunch of blank/partially
initialized values from net->sctp.
The race happens like this:
CPU 0 | CPU 1
socket() |
__sock_create | socket()
inet_create | __sock_create
list_for_each_entry_rcu( |
answer, &inetsw[sock->type], |
list) { | inet_create
/* no hits */ |
if (unlikely(err)) { |
... |
request_module() |
/* socket creation is blocked |
* the module is fully loaded |
*/ |
sctp_init |
sctp_v4_protosw_init |
inet_register_protosw |
list_add_rcu(&p->list, |
last_perm); |
| list_for_each_entry_rcu(
| answer, &inetsw[sock->type],
sctp_v6_protosw_init | list) {
| /* hit, so assumes protocol
| * is already loaded
| */
| /* socket creation continues
| * before netns is initialized
| */
register_pernet_subsys |
Simply inverting the initialization order between
register_pernet_subsys() and sctp_v4_protosw_init() is not possible
because register_pernet_subsys() will create a control sctp socket, so
the protocol must be already visible by then. Deferring the socket
creation to a work-queue is not good specially because we loose the
ability to handle its errors.
So, as suggested by Vlad, the fix is to split netns initialization in
two moments: defaults and control socket, so that the defaults are
already loaded by when we register the protocol, while control socket
initialization is kept at the same moment it is today.
Fixes: 4db67e808640 ("sctp: Make the address lists per network namespace") Signed-off-by: Vlad Yasevich <vyasevich@gmail.com> Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
Ken-ichirou reported that running netlink in mmap mode for receive in
combination with nlmon will throw a NULL pointer dereference in
__kfree_skb() on nlmon_xmit(), in my case I can also trigger an "unable
to handle kernel paging request". The problem is the skb_clone() in
__netlink_deliver_tap_skb() for skbs that are mmaped.
I.e. the cloned skb doesn't have a destructor, whereas the mmap netlink
skb has it pointed to netlink_skb_destructor(), set in the handler
netlink_ring_setup_skb(). There, skb->head is being set to NULL, so
that in such cases, __kfree_skb() doesn't perform a skb_release_data()
via skb_release_all(), where skb->head is possibly being freed through
kfree(head) into slab allocator, although netlink mmap skb->head points
to the mmap buffer. Similarly, the same has to be done also for large
netlink skbs where the data area is vmalloced. Therefore, as discussed,
make a copy for these rather rare cases for now. This fixes the issue
on my and Ken-ichirou's test-cases.
Reference: http://thread.gmane.org/gmane.linux.network/371129 Fixes: bcbde0d449ed ("net: netlink: virtual tap device management") Reported-by: Ken-ichirou MATSUZAWA <chamaken@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Ken-ichirou MATSUZAWA <chamaken@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
[ luis: backported to 3.16: adjusted context ] Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
In the IPv6 multicast routing code the mrt_lock was not being released
correctly in the MFC iterator, as a result adding or deleting a MIF would
cause a hang because the mrt_lock could not be acquired.
This fix is a copy of the code for the IPv4 case and ensures that the lock
is released correctly.
Signed-off-by: Richard Laing <richard.laing@alliedtelesis.co.nz> Acked-by: Cong Wang <cwang@twopensource.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
It is needed to check EVENT_NO_RUNTIME_PM bit of dev->flags in
usbnet_stop(), but its value should be read before it is cleared
when dev->flags is set to 0.
The problem was spotted and the fix was provided by
Oliver Neukum <oneukum@suse.de>.
Signed-off-by: Eugene Shatokhin <eugene.shatokhin@rosalab.ru> Acked-by: Oliver Neukum <oneukum@suse.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
Block drivers are responsible for calling flush_dcache_page() on each
BIO request. This operation keeps the I$ coherent with the D$ on
architectures that don't have hardware coherency support. Without this
flush, random crashes are seen when executing user programs from an ext4
filesystem backed by a ubiblock device.
This patch is based on the change implemented in commit 2d4dc890b5c8
("block: add helpers to run flush_dcache_page() against a bio and a
request's pages").
Fixes: 9d54c8a33eec ("UBI: R/O block driver on top of UBI volumes") Signed-off-by: Kevin Cernekee <cernekee@chromium.org> Signed-off-by: Ezequiel Garcia <ezequiel.garcia@imgtec.com> Signed-off-by: Richard Weinberger <richard@nod.at>
[ luis: backported to 3.16: used Kevin's backport to 3.14 ] Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
This fixes a race which can result in the same virtual IRQ number
being assigned to two different MSI interrupts. The most visible
consequence of that is usually a warning and stack trace from the
sysfs code about an attempt to create a duplicate entry in sysfs.
The race happens when one CPU (say CPU 0) is disposing of an MSI
while another CPU (say CPU 1) is setting up an MSI. CPU 0 calls
(for example) pnv_teardown_msi_irqs(), which calls
msi_bitmap_free_hwirqs() to indicate that the MSI (i.e. its
hardware IRQ number) is no longer in use. Then, before CPU 0 gets
to calling irq_dispose_mapping() to free up the virtal IRQ number,
CPU 1 comes in and calls msi_bitmap_alloc_hwirqs() to allocate an
MSI, and gets the same hardware IRQ number that CPU 0 just freed.
CPU 1 then calls irq_create_mapping() to get a virtual IRQ number,
which sees that there is currently a mapping for that hardware IRQ
number and returns the corresponding virtual IRQ number (which is
the same virtual IRQ number that CPU 0 was using). CPU 0 then
calls irq_dispose_mapping() and frees that virtual IRQ number.
Now, if another CPU comes along and calls irq_create_mapping(), it
is likely to get the virtual IRQ number that was just freed,
resulting in the same virtual IRQ number apparently being used for
two different hardware interrupts.
To fix this race, we just move the call to msi_bitmap_free_hwirqs()
to after the call to irq_dispose_mapping(). Since virq_to_hw()
doesn't work for the virtual IRQ number after irq_dispose_mapping()
has been called, we need to call it before irq_dispose_mapping() and
remember the result for the msi_bitmap_free_hwirqs() call.
The pattern of calling msi_bitmap_free_hwirqs() before
irq_dispose_mapping() appears in 5 places under arch/powerpc, and
appears to have originated in commit 05af7bd2d75e ("[POWERPC] MPIC
U3/U4 MSI backend") from 2007.
Fixes: 05af7bd2d75e ("[POWERPC] MPIC U3/U4 MSI backend") Reported-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>