ALSA sequencer code has an open race between the timer setup ioctl and
the close of the client. This was triggered by syzkaller fuzzer, and
a use-after-free was caught there as a result.
This patch papers over it by adding a proper queue->timer_mutex lock
around the timer-related calls in the relevant code path.
snd_seq_ioctl_remove_events() calls snd_seq_fifo_clear()
unconditionally even if there is no FIFO assigned, and this leads to
an Oops due to NULL dereference. The fix is just to add a proper NULL
check.
When using a protocol v2 or v3 hardware, elantech uses the function
elantech_report_semi_mt_data() to report data. This devices are rather
creepy because if num_finger is 3, (x2,y2) is (0,0). Yes, only one valid
touch is reported.
Anyway, userspace (libinput) is now confused by these (0,0) touches,
and detect them as palm, and rejects them.
Commit 3c0213d17a09 ("Input: elantech - fix semi-mt protocol for v3 HW")
was sufficient enough for xf86-input-synaptics and libinput before it has
palm rejection. Now we need to actually tell libinput that this device is
a semi-mt one and it should not rely on the actual values of the 2 touches.
Signed-off-by: Benjamin Tissoires <benjamin.tissoires@redhat.com> Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
The vt8500 clocksource driver declares itself as capable to handle the
minimum delay of 4 cycles by passing the value into
clockevents_config_and_register(). The vt8500_timer_set_next_event()
requires the passed cycles value to be at least 16. The impact is that
userspace hangs in nanosleep() calls with small delay intervals.
This problem is reproducible in Linux 4.2 starting from: c6eb3f70d448 ('hrtimer: Get rid of hrtimer softirq')
From Russell King, more detailed explanation:
"It's a speciality of the StrongARM/PXA hardware. It takes a certain
number of OSCR cycles for the value written to hit the compare registers.
So, if a very small delta is written (eg, the compare register is written
with a value of OSCR + 1), the OSCR will have incremented past this value
before it hits the underlying hardware. The result is, that you end up
waiting a very long time for the OSCR to wrap before the event fires.
So, we introduce a check in set_next_event() to detect this and return
-ETIME if the calculated delta is too small, which causes the generic
clockevents code to retry after adding the min_delta specified in
clockevents_config_and_register() to the current time value.
min_delta must be sufficient that we don't re-trip the -ETIME check - if
we do, we will return -ETIME, forward the next event time, try to set it,
return -ETIME again, and basically lock the system up. So, min_delta
must be larger than the check inside set_next_event(). A factor of two
was chosen to ensure that this situation would never occur.
The PXA code worked on PXA systems for years, and I'd suggest no one
changes this mechanism without access to a wide range of PXA systems,
otherwise they're risking breakage."
Cc: Russell King <linux@arm.linux.org.uk> Acked-by: Alexey Charkov <alchark@gmail.com> Signed-off-by: Roman Volkov <rvolkov@v1ros.org> Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
When we do dquot readahead in log recovery, we do not use a verifier
as the underlying buffer may not have dquots in it. e.g. the
allocation operation hasn't yet been replayed. Hence we do not want
to fail recovery because we detect an operation to be replayed has
not been run yet. This problem was addressed for inodes in commit d891400 ("xfs: inode buffers may not be valid during recovery
readahead") but the problem was not recognised to exist for dquots
and their buffers as the dquot readahead did not have a verifier.
The result of not using a verifier is that when the buffer is then
next read to replay a dquot modification, the dquot buffer verifier
will only be attached to the buffer if *readahead is not complete*.
Hence we can read the buffer, replay the dquot changes and then add
it to the delwri submission list without it having a verifier
attached to it. This then generates warnings in xfs_buf_ioapply(),
which catches and warns about this case.
Fix this and make it handle the same readahead verifier error cases
as for inode buffers by adding a new readahead verifier that has a
write operation as well as a read operation that marks the buffer as
not done if any corruption is detected. Also make sure we don't run
readahead if the dquot buffer has been marked as cancelled by
recovery.
This will result in readahead either succeeding and the buffer
having a valid write verifier, or readahead failing and the buffer
state requiring the subsequent read to resubmit the IO with the new
verifier. In either case, this will result in the buffer always
ending up with a valid write verifier on it.
Note: we also need to fix the inode buffer readahead error handling
to mark the buffer with EIO. Brian noticed the code I copied from
there wrong during review, so fix it at the same time. Add comments
linking the two functions that handle readahead verifier errors
together so we don't forget this behavioural link in future.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
[ luis: backported to 3.16:
- struct xfs_buf_ops does not have a 'name' field in 3.16
- adjusted context ] Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
When we do inode readahead in log recovery, we do can do the
readahead before we've replayed the icreate transaction that stamps
the buffer with inode cores. The inode readahead verifier catches
this and marks the buffer as !done to indicate that it doesn't yet
contain valid inodes.
In adding buffer error notification (i.e. setting b_error = -EIO at
the same time as as we clear the done flag) to such a readahead
verifier failure, we can then get subsequent inode recovery failing
with this error:
This occurs when readahead completion races with icreate item replay
such as:
inode readahead
find buffer
lock buffer
submit RA io
....
icreate recovery
xfs_trans_get_buffer
find buffer
lock buffer
<blocks on RA completion>
.....
<ra completion>
fails verifier
clear XBF_DONE
set bp->b_error = -EIO
release and unlock buffer
<icreate gains lock>
icreate initialises buffer
marks buffer as done
adds buffer to delayed write queue
releases buffer
At this point, we have an initialised inode buffer that is up to
date but has an -EIO state registered against it. When we finally
get to recovering an inode in that buffer:
inode item recovery
xfs_trans_read_buffer
find buffer
lock buffer
sees XBF_DONE is set, returns buffer
sees bp->b_error is set
fail log recovery!
Essentially, we need xfs_trans_get_buf_map() to clear the error status of
the buffer when doing a lookup. This function returns uninitialised
buffers, so the buffer returned can not be in an error state and
none of the code that uses this function expects b_error to be set
on return. Indeed, there is an ASSERT(!bp->b_error); in the
transaction case in xfs_trans_get_buf_map() that would have caught
this if log recovery used transactions....
This patch firstly changes the inode readahead failure to set -EIO
on the buffer, and secondly changes xfs_buf_get_map() to never
return a buffer with an error state set so this first change doesn't
cause unexpected log recovery failures.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
[ luis: backported to 3.16:
- file rename: fs/xfs/libxfs/xfs_inode_buf.c -> fs/xfs/xfs_inode_buf.c
- adjusted context ] Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
The normalization pass in the sorting routine of the relative exception
table serves two purposes:
- it ensures that the address fields of the exception table entries are
fully ordered, so that no ambiguities arise between entries with
identical instruction offsets (i.e., when two instructions that are
exactly 8 bytes apart each have an exception table entry associated with
them)
- it ensures that the offsets of both the instruction and the fixup fields
of each entry are relative to their final location after sorting.
Commit eb608fb366de ("s390/exceptions: switch to relative exception table
entries") ported the relative exception table format from x86, but modified
the sorting routine to only normalize the instruction offset field and not
the fixup offset field. The result is that the fixup offset of each entry
will be relative to the original location of the entry before sorting,
likely leading to crashes when those entries are dereferenced.
Fixes: eb608fb366de ("s390/exceptions: switch to relative exception table entries") Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
When decompressing kernel image during x86 bootup, malloc memory
for ELF program headers may run out of heap space, which leads
to system halt. This patch doubles BOOT_HEAP_SIZE to 64KB.
Tested with 32-bit kernel which failed to boot without this patch.
Signed-off-by: H.J. Lu <hjl.tools@gmail.com> Acked-by: H. Peter Anvin <hpa@zytor.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
When switch_mm() activates a new PGD, it also sets a bit that
tells other CPUs that the PGD is in use so that TLB flush IPIs
will be sent. In order for that to work correctly, the bit
needs to be visible prior to loading the PGD and therefore
starting to fill the local TLB.
Document all the barriers that make this work correctly and add
a couple that were missing.
Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
[ luis: backported to 3.16:
- dropped N/A comment in flush_tlb_mm_range()
- adjusted context ] Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
Let's also get rid of the "cosmic ray protection" while we're at it.
Fixes: e9193059b1b3 "hostfs: fix races in dentry_name() and inode_name()" Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com> Cc: Jeff Dike <jdike@addtoit.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Richard Weinberger <richard@nod.at> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
When there is an error copying a chunk dm-snapshot can incorrectly hold
associated bios indefinitely, resulting in hung IO.
The function copy_callback sets pe->error if there was error copying the
chunk, and then calls complete_exception. complete_exception calls
pending_complete on error, otherwise it calls commit_exception with
commit_callback (and commit_callback calls complete_exception).
The persistent exception store (dm-snap-persistent.c) assumes that calls
to prepare_exception and commit_exception are paired.
persistent_prepare_exception increases ps->pending_count and
persistent_commit_exception decreases it.
If there is a copy error, persistent_prepare_exception is called but
persistent_commit_exception is not. This results in the variable
ps->pending_count never returning to zero and that causes some pending
exceptions (and their associated bios) to be held forever.
Fix this by unconditionally calling commit_exception regardless of
whether the copy was successful. A new "valid" parameter is added to
commit_exception -- when the copy fails this parameter is set to zero so
that the chunk that failed to copy (and all following chunks) is not
recorded in the snapshot store. Also, remove commit_callback now that
it is merely a wrapper around pending_complete.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
The detection of direction for compress was only taking into account codec
capabilities and not CPU ones. Fix this by checking the CPU side capabilities
as well
Tested-by: Ashish Panwar <ashish.panwar@intel.com> Signed-off-by: Vinod Koul <vinod.koul@intel.com> Signed-off-by: Mark Brown <broonie@kernel.org>
[ luis: backported to 3.16: adjusted context ] Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
Dmitry reported that he was able to reproduce the WARN_ON_ONCE that
fires in locks_free_lock_context when the flc_posix list isn't empty.
The problem turns out to be that we're basically rebuilding the
file_lock from scratch in fcntl_setlk when we discover that the setlk
has raced with a close. If the l_whence field is SEEK_CUR or SEEK_END,
then we may end up with fl_start and fl_end values that differ from
when the lock was initially set, if the file position or length of the
file has changed in the interim.
Fix this by just reusing the same lock request structure, and simply
override fl_type value with F_UNLCK as appropriate. That ensures that
we really are unlocking the lock that was initially set.
While we're there, make sure that we do pop a WARN_ON_ONCE if the
removal ever fails. Also return -EBADF in this event, since that's
what we would have returned if the close had happened earlier.
Cc: Alexander Viro <viro@zeniv.linux.org.uk> Fixes: c293621bbf67 (stale POSIX lock handling) Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Acked-by: "J. Bruce Fields" <bfields@fieldses.org> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
On -RT and if kernel is booting with "threadirqs" cmd line parameter,
PCIe/PCI (MSI) IRQ cascade handlers (like dra7xx_pcie_msi_irq_handler())
will be forced threaded and, as result, will generate warnings like this:
WARNING: CPU: 1 PID: 82 at kernel/irq/handle.c:150 handle_irq_event_percpu+0x14c/0x174()
irq 460 handler irq_default_primary_handler+0x0/0x14 enabled interrupts
Backtrace:
(warn_slowpath_common) from (warn_slowpath_fmt+0x38/0x40)
(warn_slowpath_fmt) from (handle_irq_event_percpu+0x14c/0x174)
(handle_irq_event_percpu) from (handle_irq_event+0x84/0xb8)
(handle_irq_event) from (handle_simple_irq+0x90/0x118)
(handle_simple_irq) from (generic_handle_irq+0x30/0x44)
(generic_handle_irq) from (dra7xx_pcie_msi_irq_handler+0x7c/0x8c)
(dra7xx_pcie_msi_irq_handler) from (irq_forced_thread_fn+0x28/0x5c)
(irq_forced_thread_fn) from (irq_thread+0x128/0x204)
This happens because all of them invoke generic_handle_irq() from the
requested handler. generic_handle_irq() grabs raw_locks and thus needs to
run in raw-IRQ context.
This issue was originally reproduced on TI dra7-evem, but, as was
identified during discussion [1], other hosts can also suffer from this
issue. Fix all them at once by marking PCIe/PCI (MSI) IRQ cascade handlers
IRQF_NO_THREAD explicitly.
Commit 36e097a8a297 ("PCI: Split out bridge window override of minimum
allocation address") claimed to do no functional changes but unfortunately
did: The "min" variable is altered. At least the AVM A1 PCMCIA adapter was
no longer detected, breaking ISDN operation.
Use a local copy of "min" to restore the previous behaviour.
[bhelgaas: avoid gcc "?:" extension for portability and readability] Fixes: 36e097a8a297 ("PCI: Split out bridge window override of minimum allocation address") Signed-off-by: Christoph Biedl <linux-kernel.bfrz@manchmal.in-ulm.de> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
If a name contains at least some characters with Unicode values
exceeding single byte, the CS0 output should have 2 bytes per character.
And if other input characters have single byte Unicode values, then
the single input byte is converted to 2 output bytes, and the length
of output becomes larger than the length of input. And if the input
name is long enough, the output length may exceed the allocated buffer
length.
All this means that conversion from UTF8 or NLS to CS0 requires
checking of output length in order to stop when it exceeds the given
output buffer size.
[JK: Make code return -ENAMETOOLONG instead of silently truncating the
name]
Signed-off-by: Andrew Gabbasov <andrew_gabbasov@mentor.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
udf_CS0toUTF8 function stops the conversion when the output buffer
length reaches UDF_NAME_LEN-2, which is correct maximum name length,
but, when checking, it leaves the space for a single byte only,
while multi-bytes output characters can take more space, causing
buffer overflow.
Similar error exists in udf_CS0toNLS function, that restricts
the output length to UDF_NAME_LEN, while actual maximum allowed
length is UDF_NAME_LEN-2.
In these cases the output can override not only the current buffer
length field, causing corruption of the name buffer itself, but also
following allocation structures, causing kernel crash.
Adjust the output length checks in both functions to prevent buffer
overruns in case of multi-bytes UTF8 or NLS characters.
Signed-off-by: Andrew Gabbasov <andrew_gabbasov@mentor.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
On a cancelled suspend the vcpu_info location does not change (it's
still in the per-cpu area registered by xen_vcpu_setup()). So do not
call xen_hvm_init_shared_info() which would make the kernel think its
back in the shared info. With the wrong vcpu_info, events cannot be
received and the domain will hang after a cancelled suspend.
Signed-off-by: Charles Ouyang <ouyangzhaowei@huawei.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
Because struct xfs_agfl is 36 bytes long and has a 64-bit integer
inside it, gcc will quietly round the structure size up to the nearest
64 bits -- in this case, 40 bytes. This results in the XFS_AGFL_SIZE
macro returning incorrect results for v5 filesystems on 64-bit
machines (118 items instead of 119). As a result, a 32-bit xfs_repair
will see garbage in AGFL item 119 and complain.
Therefore, tell gcc not to pad the structure so that the AGFL size
calculation is correct.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
[ luis: backported to 3.16:
- file rename: fs/xfs/libxfs/xfs_format.h -> fs/xfs/xfs_ag.h ] Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
Without i8042.nomux=1 the Elantech touch pad is not working at all on
a Fujitsu Lifebook U745. This patch does not seem necessary for all
U745 (maybe because of different BIOS versions?). However, it was
verified that the patch does not break those (see opensuse bug 883192:
https://bugzilla.opensuse.org/show_bug.cgi?id=883192).
Fix the below Oops when trying to modprobe wlcore_spi.
The oops occurs because the wl1271_power_{off,on}()
function doesn't check the power() function pointer.
[ 23.665030] [<bf2581f0>] (wl12xx_set_power_on [wlcore]) from
[<bf25f7ac>] (wlcore_nvs_cb+0x118/0xa4c [wlcore])
[ 23.675604] [<bf25f7ac>] (wlcore_nvs_cb [wlcore]) from [<c04387ec>]
(request_firmware_work_func+0x30/0x58)
[ 23.685784] [<c04387ec>] (request_firmware_work_func) from
[<c0058e2c>] (process_one_work+0x1b4/0x4b4)
[ 23.695591] [<c0058e2c>] (process_one_work) from [<c0059168>]
(worker_thread+0x3c/0x4a4)
[ 23.704124] [<c0059168>] (worker_thread) from [<c005ee68>]
(kthread+0xd4/0xf0)
[ 23.711747] [<c005ee68>] (kthread) from [<c000f598>]
(ret_from_fork+0x14/0x3c)
[ 23.719357] Code: bad PC value
[ 23.722760] ---[ end trace 981be8510db9b3a9 ]---
Prevent oops by validationg power() pointer value before
calling the function.
Signed-off-by: Uri Mashiach <uri.mashiach@compulab.co.il> Acked-by: Igor Grinberg <grinberg@compulab.co.il> Signed-off-by: Kalle Valo <kvalo@codeaurora.org> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
Previously, it would only scan the entire disk if it was starting from
the very start of the disk - i.e. if the previous scan got to the end.
This was broken by refill_full_stripes(), which updates last_scanned so
that refill_dirty was never triggering the searched_from_start path.
But if we change refill_dirty() to always scan the entire disk if
necessary, regardless of what last_scanned was, the code gets cleaner
and we fix that bug too.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
Added a safeguard in the shutdown case. At least while not being
attached it is also possible to trigger a kernel bug by writing into
writeback_running. This change adds the same check before trying to
wake up the thread for that case.
Signed-off-by: Stefan Bader <stefan.bader@canonical.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
Allows to use register, not register_quiet in udev to avoid "device_busy" error.
The initial patch proposed at https://lkml.org/lkml/2013/8/26/549 by Gabriel de Perthuis
<g2p.code@gmail.com> does not unlock the mutex and hangs the kernel.
See http://thread.gmane.org/gmane.linux.kernel.bcache.devel/2594 for the discussion.
Cc: Denis Bychkov <manover@gmail.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Eric Wheeler <bcache@linux.ewheeler.net> Cc: Gabriel de Perthuis <g2p.code@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
[ luis: backported to 3.16: adjusted context ] Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
We forget to clear BCACHE_DEV_UNLINK_DONE flag in bcache_device_attach()
function when we attach a backing device first time. After detaching this
backing device, this flag will be true and sysfs_remove_link() isn't called in
bcache_device_unlink(). Then when we attach this backing device again,
sysfs_create_link() will return EEXIST error in bcache_device_link().
So the fix is trival and we clear this flag in bcache_device_link().
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Tested-by: Joshua Schmid <jschmid@suse.com> Tested-by: Eric Wheeler <bcache@linux.ewheeler.net> Cc: Kent Overstreet <kmo@daterainc.com> Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
Subject : [PATCH v2] bcache: fix a livelock in btree lock
Date : Wed, 25 Feb 2015 20:32:09 +0800 (02/25/2015 04:32:09 AM)
This commit tries to fix a livelock in bcache. This livelock might
happen when we causes a huge number of cache misses simultaneously.
When we get a cache miss, bcache will execute the following path.
->cached_dev_make_request()
->cached_dev_read()
->cached_lookup()
->bch->btree_map_keys()
->btree_root() <------------------------
->bch_btree_map_keys_recurse() |
->cache_lookup_fn() |
->cached_dev_cache_miss() |
->bch_btree_insert_check_key() -|
[If btree->seq is not equal to seq + 1, we should return
EINTR and traverse btree again.]
In bch_btree_insert_check_key() function we first need to check upgrade
flag (op->lock == -1), and when this flag is true we need to release
read btree->lock and try to take write btree->lock. During taking and
releasing this write lock, btree->seq will be monotone increased in
order to prevent other threads modify this in cache miss (see btree.h:74).
But if there are some cache misses caused by some requested, we could
meet a livelock because btree->seq is always changed by others. Thus no
one can make progress.
This commit will try to take write btree->lock if it encounters a race
when we traverse btree. Although it sacrifice the scalability but we
can ensure that only one can modify the btree.
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Tested-by: Joshua Schmid <jschmid@suse.com> Tested-by: Eric Wheeler <bcache@linux.ewheeler.net> Cc: Joshua Schmid <jschmid@suse.com> Cc: Zhu Yanhai <zhu.yanhai@gmail.com> Cc: Kent Overstreet <kmo@daterainc.com> Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
If a NFSv4 client uses the cache_consistency_bitmask in order to
request only information about the change attribute, timestamps and
size, then it has not revalidated all attributes, and hence the
attribute timeout timestamp should not be updated.
Reported-by: Donald Buczek <buczek@molgen.mpg.de> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
[ luis: backported to 3.16: adjusted context ] Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
Donald Buczek reports that a nfs4 client incorrectly denies
execute access based on outdated file mode (missing 'x' bit).
After the mode on the server is 'fixed' (chmod +x) further execution
attempts continue to fail, because the nfs ACCESS call updates
the access parameter but not the mode parameter or the mode in
the inode.
The root cause is ultimately that the VFS is calling may_open()
before the NFS client has a chance to OPEN the file and hence revalidate
the access and attribute caches.
Al Viro suggests:
>>> Make nfs_permission() relax the checks when it sees MAY_OPEN, if you know
>>> that things will be caught by server anyway?
>>
>> That can work as long as we're guaranteed that everything that calls
>> inode_permission() with MAY_OPEN on a regular file will also follow up
>> with a vfs_open() or dentry_open() on success. Is this always the
>> case?
>
> 1) in do_tmpfile(), followed by do_dentry_open() (not reachable by NFS since
> it doesn't have ->tmpfile() instance anyway)
>
> 2) in atomic_open(), after the call of ->atomic_open() has succeeded.
>
> 3) in do_last(), followed on success by vfs_open()
>
> That's all. All calls of inode_permission() that get MAY_OPEN come from
> may_open(), and there's no other callers of that puppy.
This driver fails to copy the module parameter for software encryption
to the locations used by the main code.
Signed-off-by: Larry Finger <Larry.Finger@lwfinger.net> Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
[ luis: backported to 3.16:
- file rename: drivers/net/wireless/realtek/rtlwifi/rtl8192cu/sw.c ->
drivers/net/wireless/rtlwifi/rtl8192cu/sw.c ] Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
The module parameter for software encryption was never transferred to
the location used by the driver.
Signed-off-by: Larry Finger <Larry.Finger@lwfinger.net> Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
[ luis: backported to 3.16:
- drivers/net/wireless/realtek/rtlwifi/rtl8192ce/sw.c ->
drivers/net/wireless/rtlwifi/rtl8192ce/sw.c ] Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
Two of the module parameter descriptions show incorrect default values.
In addition the value for software encryption is not transferred to
the locations used by the driver.
Signed-off-by: Larry Finger <Larry.Finger@lwfinger.net> Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
[ luis: backported to 3.16:
- file rename: drivers/net/wireless/realtek/rtlwifi/rtl8192se/sw.c ->
drivers/net/wireless/rtlwifi/rtl8192se/sw.c ] Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
Two of the module parameters are listed with incorrect default values.
Signed-off-by: Larry Finger <Larry.Finger@lwfinger.net> Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
[ luis: backported to 3.16:
- file rename: drivers/net/wireless/realtek/rtlwifi/rtl8192de/sw.c ->
drivers/net/wireless/rtlwifi/rtl8192de/sw.c ] Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
The posix_clock_poll function is supposed to return a bit mask of
POLLxxx values. However, in case the hardware has disappeared (due to
hot plugging for example) this code returns -ENODEV in a futile
attempt to throw an error at the file descriptor level. The kernel's
file_operations interface does not accept such error codes from the
poll method. Instead, this function aught to return POLLERR.
The value -ENODEV does, in fact, contain the POLLERR bit (and almost
all the other POLLxxx bits as well), but only by chance. This patch
fixes code to return a proper bit mask.
Credit goes to Markus Elfring for pointing out the suspicious
signed/unsigned mismatch.
Reported-by: Markus Elfring <elfring@users.sourceforge.net>
igned-off-by: Richard Cochran <richardcochran@gmail.com> Cc: John Stultz <john.stultz@linaro.org> Cc: Julia Lawall <julia.lawall@lip6.fr> Link: http://lkml.kernel.org/r/1450819198-17420-1-git-send-email-richardcochran@gmail.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
Add the USB device ID for ELV Marble Sound Board 1.
Signed-off-by: Oliver Freyermuth <o.freyermuth@googlemail.com> Signed-off-by: Johan Hovold <johan@kernel.org> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
We've seen this in a packet capture - I've intermixed what I
think was going on. The fix here is to grab the so_lock sooner.
1964379 -> #1 open (for write) reply seqid=1 1964393 -> #2 open (for read) reply seqid=2
__nfs4_close(), state->n_wronly--
nfs4_state_set_mode_locked(), changes state->state = [R]
state->flags is [RW]
state->state is [R], state->n_wronly == 0, state->n_rdonly == 1
1964398 -> #3 open (for write) call -> because close is already running 1964399 -> downgrade (to read) call seqid=2 (close of #1) 1964402 -> #3 open (for write) reply seqid=3
__update_open_stateid()
nfs_set_open_stateid_locked(), changes state->flags
state->flags is [RW]
state->state is [R], state->n_wronly == 0, state->n_rdonly == 1
new sequence number is exposed now via nfs4_stateid_copy()
next step would be update_open_stateflags(), pending so_lock
1964403 -> downgrade reply seqid=2, fails with OLD_STATEID (close of #1)
nfs4_close_prepare() gets so_lock and recalcs flags -> send close
udf_next_aext() just follows extent pointers while extents are marked as
indirect. This can loop forever for corrupted filesystem. Limit number
the of indirect extents we are willing to follow in a row.
[JK: Updated changelog, limit, style]
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com> Cc: Jan Kara <jack@suse.com> Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
sdhci has a legacy facility to prevent runtime suspend if the
bus power is on. This is needed in cases where the power to
the card is dependent on the bus power. It is controlled by
a pair of functions: sdhci_runtime_pm_bus_on() and
sdhci_runtime_pm_bus_off(). These functions use a boolean
variable 'bus_on' to ensure changes are always paired.
There is an additional check for 'runtime_suspended' which is
the problem. In fact, its use is ill-conceived as the only
requirement for the logic is that 'on' and 'off' are paired,
which is actually broken by the check, for example if the bus
power is turned on during runtime resume. So remove the check.
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
The 'ocr' parameter passed to mmc_set_signal_voltage()
defines the power-on voltage used when power cycling
after a failure to set the voltage. However, in the
case of mmc_sdio_init_card(), the value passed has the
R4_18V_PRESENT flag set which is not valid for power-on
and results in an invalid vdd. Fix by passing the card's
ocr value which does not have the flag.
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
The pmuserenr_el0 register value is architecturally UNKNOWN on reset.
Current kernel code resets that register value iff the core pmu device is
correctly probed in the kernel. On platforms with missing DT pmu nodes (or
disabled perf events in the kernel), the pmu is not probed, therefore the
pmuserenr_el0 register is not reset in the kernel, which means that its
value retains the reset value that is architecturally UNKNOWN (system
may run with eg pmuserenr_el0 == 0x1, which means that PMU counters access
is available at EL0, which must be disallowed).
This patch adds code that resets pmuserenr_el0 on cold boot and restores
it on core resume from shutdown, so that the pmuserenr_el0 setup is
always enforced in the kernel.
Cc: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
If the proxy lock in the requeue loop acquires the rtmutex for a
waiter then it acquired also refcount on the pi_state related to the
futex, but the waiter side does not drop the reference count.
Add the missing free_pi_state() call.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Darren Hart <darren@dvhart.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Bhuvanesh_Surachari@mentor.com Cc: Andy Lowe <Andy_Lowe@mentor.com> Link: http://lkml.kernel.org/r/20151219200607.178132067@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
When a thin pool is being destroyed delayed work items are
cancelled using cancel_delayed_work(), which doesn't guarantee that on
return the delayed item isn't running. This can cause the work item to
requeue itself on an already destroyed workqueue. Fix this by using
cancel_delayed_work_sync() which guarantees that on return the work item
is not running anymore.
Fixes: 905e51b39a555 ("dm thin: commit outstanding data every second") Fixes: 85ad643b7e7e5 ("dm thin: add timeout to stop out-of-data-space mode holding IO forever") Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
Fixes: 50dd842ad ("dm space map metadata: fix ref counting bug when bootstrapping a new space map") Reported-by: David Binderman <dcb314@hotmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
According to memory-barriers.txt, xchg*, cmpxchg* and their atomic_
versions all need to be fully ordered, however they are now just
RELEASE+ACQUIRE, which are not fully ordered.
So also replace PPC_RELEASE_BARRIER and PPC_ACQUIRE_BARRIER with
PPC_ATOMIC_ENTRY_BARRIER and PPC_ATOMIC_EXIT_BARRIER in
__{cmp,}xchg_{u32,u64} respectively to guarantee fully ordered semantics
of atomic{,64}_{cmp,}xchg() and {cmp,}xchg(), as a complement of commit b97021f85517 ("powerpc: Fix atomic_xxx_return barrier semantics")
This patch depends on patch "powerpc: Make value-returning atomics fully
ordered" for PPC_ATOMIC_ENTRY_BARRIER definition.
Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
> Any atomic operation that modifies some state in memory and returns
> information about the state (old or new) implies an SMP-conditional
> general memory barrier (smp_mb()) on each side of the actual
> operation ...
Which mean these operations should be fully ordered. However on PPC,
PPC_ATOMIC_ENTRY_BARRIER is the barrier before the actual operation,
which is currently "lwsync" if SMP=y. The leading "lwsync" can not
guarantee fully ordered atomics, according to Paul Mckenney:
https://lkml.org/lkml/2015/10/14/970
To fix this, we define PPC_ATOMIC_ENTRY_BARRIER as "sync" to guarantee
the fully-ordered semantics.
This also makes futex atomics fully ordered, which can avoid possible
memory ordering problems if userspace code relies on futex system call
for fully ordered semantics.
Fixes: b97021f85517 ("powerpc: Fix atomic_xxx_return barrier semantics") Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
In paging_init, we allocate the zero page, memset it to zero and then
point TTBR0 to it in order to avoid speculative fetches through the
identity mapping.
In order to guarantee that the freshly zeroed page is indeed visible to
the page table walker, we need to execute a dsb instruction prior to
writing the TTBR.
Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
EDAC workqueue destruction is really fragile. We cancel delayed work
but if it is still running and requeues itself, we still go ahead and
destroy the workqueue and the queued work explodes when workqueue core
attempts to run it.
Make the destruction more robust by switching op_state to offline so
that requeuing stops. Cancel any pending work *synchronously* too.
I get the splat below when modprobing/rmmoding EDAC drivers. It happens
because bus->name is invalid after bus_unregister() has run. The Code: section
below corresponds to:
Also use goto labels for all failure paths in
edac_create_sysfs_mci_device and update meaningless labels.
Signed-off-by: Junjie Mao <junjie.mao@hotmail.com> Link: http://lkml.kernel.org/r/BLU436-SMTP25291B6B612942A212AEBFE95300@phx.gbl
[ Boris: Use ! for 0 checks and add newlines for less crammed code. ] Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
The maximum chunks used by the function is
(SPI_AGGR_BUFFER_SIZE / WSPI_MAX_CHUNK_SIZE + 1).
The original commands array had space for
(SPI_AGGR_BUFFER_SIZE / WSPI_MAX_CHUNK_SIZE) commands.
When the last chunk is used (len > 4 * WSPI_MAX_CHUNK_SIZE), the last
command is stored outside the bounds of the commands array.
Oops 5 (page fault) is generated during current wl1271 firmware load
attempt:
root@debian-armhf:~# ifconfig wlan0 up
[ 294.312399] Unable to handle kernel paging request at virtual address 00203fc4
[ 294.320173] pgd = de528000
[ 294.323028] [00203fc4] *pgd=00000000
[ 294.326916] Internal error: Oops: 5 [#1] SMP ARM
[ 294.331789] Modules linked in: bnep rfcomm bluetooth ipv6 arc4 wl12xx
wlcore mac80211 musb_dsps cfg80211 musb_hdrc usbcore usb_common
wlcore_spi omap_rng rng_core musb_am335x omap_wdt cpufreq_dt thermal_sys
hwmon
[ 294.351838] CPU: 0 PID: 1827 Comm: ifconfig Not tainted 4.2.0-00002-g3e9ad27-dirty #78
[ 294.360154] Hardware name: Generic AM33XX (Flattened Device Tree)
[ 294.366557] task: dc9d6d40 ti: de550000 task.ti: de550000
[ 294.372236] PC is at __spi_validate+0xa8/0x2ac
[ 294.376902] LR is at __spi_sync+0x78/0x210
[ 294.381200] pc : [<c049c760>] lr : [<c049ebe0>] psr: 60000013
[ 294.381200] sp : de551998 ip : de5519d8 fp : 00200000
[ 294.393242] r10: de551c8c r9 : de5519d8 r8 : de3a9000
[ 294.398730] r7 : de3a9258 r6 : de3a9400 r5 : de551a48 r4 : 00203fbc
[ 294.405577] r3 : 00000000 r2 : 00000000 r1 : 00000000 r0 : de3a9000
[ 294.412420] Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM
Segment user
[ 294.419918] Control: 10c5387d Table: 9e528019 DAC: 00000015
[ 294.425954] Process ifconfig (pid: 1827, stack limit = 0xde550218)
[ 294.432437] Stack: (0xde551998 to 0xde552000)
...
[ 294.883613] [<c049c760>] (__spi_validate) from [<c049ebe0>]
(__spi_sync+0x78/0x210)
[ 294.891670] [<c049ebe0>] (__spi_sync) from [<bf036598>]
(wl12xx_spi_raw_write+0xfc/0x148 [wlcore_spi])
[ 294.901661] [<bf036598>] (wl12xx_spi_raw_write [wlcore_spi]) from
[<bf21c694>] (wlcore_boot_upload_firmware+0x1ec/0x458 [wlcore])
[ 294.914038] [<bf21c694>] (wlcore_boot_upload_firmware [wlcore]) from
[<bf24532c>] (wl12xx_boot+0xc10/0xfac [wl12xx])
[ 294.925161] [<bf24532c>] (wl12xx_boot [wl12xx]) from [<bf20d5cc>]
(wl1271_op_add_interface+0x5b0/0x910 [wlcore])
[ 294.936364] [<bf20d5cc>] (wl1271_op_add_interface [wlcore]) from
[<bf15c4ac>] (ieee80211_do_open+0x44c/0xf7c [mac80211])
[ 294.947963] [<bf15c4ac>] (ieee80211_do_open [mac80211]) from
[<c0537978>] (__dev_open+0xa8/0x110)
[ 294.957307] [<c0537978>] (__dev_open) from [<c0537bf8>]
(__dev_change_flags+0x88/0x148)
[ 294.965713] [<c0537bf8>] (__dev_change_flags) from [<c0537cd0>]
(dev_change_flags+0x18/0x48)
[ 294.974576] [<c0537cd0>] (dev_change_flags) from [<c05a55a0>]
(devinet_ioctl+0x6b4/0x7d0)
[ 294.983191] [<c05a55a0>] (devinet_ioctl) from [<c0517040>]
(sock_ioctl+0x1e4/0x2bc)
[ 294.991244] [<c0517040>] (sock_ioctl) from [<c017d378>]
(do_vfs_ioctl+0x420/0x6b0)
[ 294.999208] [<c017d378>] (do_vfs_ioctl) from [<c017d674>]
(SyS_ioctl+0x6c/0x7c)
[ 295.006880] [<c017d674>] (SyS_ioctl) from [<c000f4c0>]
(ret_fast_syscall+0x0/0x54)
[ 295.014835] Code: e1550004e24440340a00007de5953018 (e5942008)
[ 295.021544] ---[ end trace 66ed188198f4e24e ]---
Signed-off-by: Uri Mashiach <uri.mashiach@compulab.co.il> Acked-by: Igor Grinberg <grinberg@compulab.co.il> Signed-off-by: Kalle Valo <kvalo@codeaurora.org> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
Free skb for received frames with a wrong checksum. This can happen
pretty rapidly, exhausting all memory.
This fixes a memleak (detected with kmemleak). Originally found while
using monitor mode, but it also appears during managed mode (once the
link is up).
Signed-off-by: Peter Wu <peter@lekensteyn.nl> ACKed-by: Larry Finger <Larry.Finger@lwfinger.net> Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
[ luis: backported to 3.16:
- file rename: drivers/net/wireless/realtek/rtlwifi/usb.c ->
drivers/net/wireless/rtlwifi/usb.c ] Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
1e75fa8 "time: Condense timekeeper.xtime into xtime_sec" replaced a call to
clocksource_cyc2ns() from timekeeping_get_ns() with an open-coded version
of the same logic to avoid keeping a semi-redundant struct timespec
in struct timekeeper.
However, the commit also introduced a subtle semantic change - where
clocksource_cyc2ns() uses purely unsigned math, the new version introduces
a signed temporary, meaning that if (delta * tk->mult) has a 63-bit
overflow the following shift will still give a negative result. The
choice of 'maxsec' in __clocksource_updatefreq_scale() means this will
generally happen if there's a ~10 minute pause in examining the
clocksource.
This can be triggered on a powerpc KVM guest by stopping it from qemu for
a bit over 10 minutes. After resuming time has jumped backwards several
minutes causing numerous problems (jiffies does not advance, msleep()s can
be extended by minutes..). It doesn't happen on x86 KVM guests, because
the guest TSC is effectively frozen while the guest is stopped, which is
not the case for the powerpc timebase.
Obviously an unsigned (64 bit) overflow will only take twice as long as a
signed, 63-bit overflow. I don't know the time code well enough to know
if that will still cause incorrect calculations, or if a 64-bit overflow
is avoided elsewhere.
Still, an incorrect forwards clock adjustment will cause less trouble than
time going backwards. So, this patch removes the potential for
intermediate signed overflow.
Suggested-by: Laurent Vivier <lvivier@redhat.com> Tested-by: Laurent Vivier <lvivier@redhat.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: John Stultz <john.stultz@linaro.org>
[ luis: backported to 3.16: adjusted context ] Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
Make sure to clear out any ptrace singlestep state when a ptrace(2)
PTRACE_DETACH call is made on arm64 systems.
Otherwise, the previously ptraced task will die off with a SIGTRAP
signal if the debugger just previously singlestepped the ptraced task.
Signed-off-by: John Blackwood <john.blackwood@ccur.com>
[will: added comment to justify why this is in the arch code] Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
Signed-off-by: Oliver Neukum <oneukum@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ luis: backported to 3.16:
- moved usb_disabled() check to the top of the function so that there's
no need to invoke xhci_unregister_pci() before returning. Suggested
by gregkh. ] Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
If we do not do this, it is not properly saved and restored across
migration. Windows notices due to its self-protection mechanisms,
and is very upset about it (blue screen of death).
Cc: Radim Krcmar <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
When a long value is read on 32 bit machines for 64 bit output, the
parsing needs to change "%lu" into "%llu", as the value is read
natively.
Unfortunately, if "%llu" is already there, the code will add another "l"
to it and fail to parse it properly.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Acked-by: Namhyung Kim <namhyung@kernel.org> Link: http://lkml.kernel.org/r/20151116172516.4b79b109@gandalf.local.home Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
v4l2-compliance sends a zeroed struct v4l2_streamparm in
v4l2-test-formats.cpp::testParmType(), and this results in a division by
0 in some gspca subdrivers:
Following what the doc says about a zeroed timeperframe (see
http://www.linuxtv.org/downloads/v4l-dvb-apis/vidioc-g-parm.html):
...
To reset manually applications can just set this field to zero.
fix the issue by resetting the frame rate to a default value in case of
an unusable timeperframe.
The fix is done in the subdrivers instead of gspca.c because only the
subdrivers have notion of a default frame rate to reset the camera to.
Signed-off-by: Antonio Ospite <ao2@ao2.it> Reviewed-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Hans Verkuil <hans.verkuil@cisco.com> Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
when A sends a data to B, then A close() and enter into SHUTDOWN_PENDING
state, if B neither claim his rwnd is 0 nor send SACK for this data, A
will keep retransmitting this data until t5 timeout, Max.Retrans times
can't work anymore, which is bad.
if B's rwnd is not 0, it should send abort after Max.Retrans times, only
when B's rwnd == 0 and A's retransmitting beyonds Max.Retrans times, A
will start t5 timer, which is also commit f8d960524328 ("sctp: Enforce
retransmission limit during shutdown") means, but it lacks the condition
peer rwnd == 0.
so fix it by adding a bit (zero_window_announced) in peer to record if
the last rwnd is 0. If it was, zero_window_announced will be set. and use
this bit to decide if start t5 timer when local.state is SHUTDOWN_PENDING.
Fixes: commit f8d960524328 ("sctp: Enforce retransmission limit during shutdown") Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
They don't need to be any bigger than that and with this we start a new
bitfield for tracking association runtime stuff, like zero window
situation.
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Vlad Yasevich <vyasevich@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
[ kamal: 3.19-stable prereq for 8a0d19c sctp: start t5 timer only when peer rwnd is 0 and local state is SHUTDOWN_PENDING ] Signed-off-by: Kamal Mostafa <kamal@canonical.com> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
A case can occur when sctp_accept() is called by the user during
a heartbeat timeout event after the 4-way handshake. Since
sctp_assoc_migrate() changes both assoc->base.sk and assoc->ep, the
bh_sock_lock in sctp_generate_heartbeat_event() will be taken with
the listening socket but released with the new association socket.
The result is a deadlock on any future attempts to take the listening
socket lock.
Note that this race can occur with other SCTP timeouts that take
the bh_lock_sock() in the event sctp_accept() is called.
=====================================
[ BUG: bad unlock balance detected! ]
-------------------------------------
CslRx/12087 is trying to release lock (slock-AF_INET) at:
[<ffffffffa01bcae0>] sctp_generate_timeout_event+0x40/0xe0 [sctp]
but there are no more locks to release!
other info that might help us debug this:
2 locks held by CslRx/12087:
#0: (&asoc->timers[i]){+.-...}, at: [<ffffffff8108ce1f>] run_timer_softirq+0x16f/0x3e0
#1: (slock-AF_INET){+.-...}, at: [<ffffffffa01bcac3>] sctp_generate_timeout_event+0x23/0xe0 [sctp]
Ensure the socket taken is also the same one that is released by
saving a copy of the socket before entering the timeout event
critical section.
Signed-off-by: Karl Heiss <kheiss@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Cc: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
Remove the dst_entries_init/destroy calls for xfrm4 and xfrm6 dst_ops
templates; their dst_entries counters will never be used. Move the
xfrm dst_ops initialization from the common xfrm/xfrm_policy.c to
xfrm4/xfrm4_policy.c and xfrm6/xfrm6_policy.c, and call dst_entries_init
and dst_entries_destroy for each net namespace.
The ipv4 and ipv6 xfrms each create dst_ops template, and perform
dst_entries_init on the templates. The template values are copied to each
net namespace's xfrm.xfrm*_dst_ops. The problem there is the dst_ops
pcpuc_entries field is a percpu counter and cannot be used correctly by
simply copying it to another object.
The result of this is a very subtle bug; changes to the dst entries
counter from one net namespace may sometimes get applied to a different
net namespace dst entries counter. This is because of how the percpu
counter works; it has a main count field as well as a pointer to the
percpu variables. Each net namespace maintains its own main count
variable, but all point to one set of percpu variables. When any net
namespace happens to change one of the percpu variables to outside its
small batch range, its count is moved to the net namespace's main count
variable. So with multiple net namespaces operating concurrently, the
dst_ops entries counter can stray from the actual value that it should
be; if counts are consistently moved from one net namespace to another
(which my testing showed is likely), then one net namespace winds up
with a negative dst_ops count while another winds up with a continually
increasing count, eventually reaching its gc_thresh limit, which causes
all new traffic on the net namespace to fail with -ENOBUFS.
Signed-off-by: Dan Streetman <dan.streetman@canonical.com> Signed-off-by: Dan Streetman <ddstreet@ieee.org> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
Sometimes xennet_create_queues() may failed to created all requested
queues, we need to update num_queues to real created to avoid NULL
pointer dereference.
Signed-off-by: Joe Jin <joe.jin@oracle.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Wei Liu <wei.liu2@citrix.com> Cc: Ian Campbell <ian.campbell@citrix.com> Cc: David S. Miller <davem@davemloft.net> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
When less than the requested number of queues could be created, include
the actual number in the warning (instead of the requested number).
Signed-off-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
Originally that parameter was always reset to num_online_cpus during
module initialisation, which renders it useless.
The fix is to only set max_queues to num_online_cpus when user has not
provided a value.
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Cc: David Vrabel <david.vrabel@citrix.com> Reviewed-by: David Vrabel <david.vrabel@citrix.com> Tested-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
Originally that parameter was always reset to num_online_cpus during
module initialisation, which renders it useless.
The fix is to only set max_queues to num_online_cpus when user has not
provided a value.
Reported-by: Johnny Strom <johnny.strom@linuxsolutions.fi> Signed-off-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: David Vrabel <david.vrabel@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
We can't be within an RCU read-side critical section when deleting
VLANs, as underlying drivers might sleep during the hardware operation.
Therefore, replace the RCU critical section with a mutex. This is
consistent with team_vlan_rx_add_vid.
Fixes: 3d249d4ca7d0 ("net: introduce ethernet teaming device") Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
When a tunnel decapsulates the outer header, it has to comply
with RFC 6080 and eventually propagate CE mark into inner header.
It turns out IP6_ECN_set_ce() does not correctly update skb->csum
for CHECKSUM_COMPLETE packets, triggering infamous "hw csum failure"
messages and stack traces.
Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
[ luis: backported to 3.16: adjusted context ] Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
On ARM64, a BUG() is triggered in the eBPF JIT if a filter with a
constant shift that can't be encoded in the immediate field of the
UBFM/SBFM instructions is passed to the JIT. Since these shifts
amounts, which are negative or >= regsize, are invalid, reject them in
the eBPF verifier and the classic BPF filter checker, for all
architectures.
Signed-off-by: Rabin Vincent <rabin@rab.in> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
[ luis: backported to 3.16:
- drop changes to eBPF verifier, only added in 3.18 kernel
- function rename: bpf_check_classic() -> sk_chk_filter() ] Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
Ivaylo Dimitrov reported a regression caused by commit 7866a621043f
("dev: add per net_device packet type chains").
skb->dev becomes NULL and we crash in __netif_receive_skb_core().
Before above commit, different kind of bugs or corruptions could happen
without major crash.
But the root cause is that phonet_rcv() can queue skb without checking
if skb is shared or not.
Many thanks to Ivaylo Dimitrov for his help, diagnosis and tests.
Reported-by: Ivaylo Dimitrov <ivo.g.dimitrov.75@gmail.com> Tested-by: Ivaylo Dimitrov <ivo.g.dimitrov.75@gmail.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Remi Denis-Courmont <courmisch@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
Commit 1f718f0f4f97 ("bonding: populate neighbour's private on enslave")
undoes the fix provided by commit c2edacf80e15 ("bonding / ipv6: no addrconf
for slaves separately from master") by effectively setting the slave flag
after the slave has been opened. If the slave comes up quickly enough, it
will go through the IPv6 addrconf before the slave flag has been set and
will get a link local IPv6 address.
In order to ensure that addrconf knows to ignore the slave devices on state
change, set IFF_SLAVE before dev_open() during bonding enslavement.
Fixes: 1f718f0f4f97 ("bonding: populate neighbour's private on enslave") Signed-off-by: Karl Heiss <kheiss@gmail.com> Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com> Reviewed-by: Jarod Wilson <jarod@redhat.com> Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
For tcp_yeah, use an ssthresh floor of 2, the same floor used by Reno
and CUBIC, per RFC 5681 (equation 4).
tcp_yeah_ssthresh() was sometimes returning a 0 or negative ssthresh
value if the intended reduction is as big or bigger than the current
cwnd. Congestion control modules should never return a zero or
negative ssthresh. A zero ssthresh generally results in a zero cwnd,
causing the connection to stall. A negative ssthresh value will be
interpreted as a u32 and will set a target cwnd for PRR near 4
billion.
Oleksandr Natalenko reported that a system using tcp_yeah with ECN
could see a warning about a prior_cwnd of 0 in
tcp_cwnd_reduction(). Testing verified that this was due to
tcp_yeah_ssthresh() misbehaving in this way.
Reported-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
proc_dostring() needs an initialized destination string, while the one
provided in proc_sctp_do_hmac_alg() contains stack garbage.
Thus, writing to cookie_hmac_alg would strlen() that garbage and end up
accessing invalid memory.
Fixes: 3c68198e7 ("sctp: Make hmac algorithm selection for cookie generation dynamic") Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
When a vxlan interface is created, the driver checks that there is not
another vxlan interface with the same properties. To do this, it checks
the existing vxlan udp socket. Since commit 1c51a9159dde, the creation of
the vxlan socket is done only when the interface is set up, thus it breaks
that test.
Example:
$ ip l a vxlan10 type vxlan id 10 group 239.0.0.10 dev eth0 dstport 0
$ ip l a vxlan11 type vxlan id 10 group 239.0.0.10 dev eth0 dstport 0
$ ip -br l | grep vxlan
vxlan10 DOWN f2:55:1c:6a:fb:00 <BROADCAST,MULTICAST>
vxlan11 DOWN 7a:cb:b9:38:59:0d <BROADCAST,MULTICAST>
Instead of checking sockets, let's loop over the vxlan iface list.
Fixes: 1c51a9159dde ("vxlan: fix race caused by dropping rtnl_unlock") Reported-by: Thomas Faivre <thomas.faivre@6wind.com> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
[ luis: backported to 3.16: used davem's backport to 3.18 ] Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
[I stole this patch from Eric Biederman. He wrote:]
> There is no defined mechanism to pass network namespace information
> into /sbin/bridge-stp therefore don't even try to invoke it except
> for bridge devices in the initial network namespace.
>
> It is possible for unprivileged users to cause /sbin/bridge-stp to be
> invoked for any network device name which if /sbin/bridge-stp does not
> guard against unreasonable arguments or being invoked twice on the
> same network device could cause problems.
[Hannes: changed patch using netns_eq]
Cc: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
It is possible for a process to allocate and accumulate far more FDs than
the process' limit by sending them over a unix socket then closing them
to keep the process' fd count low.
This change addresses this problem by keeping track of the number of FDs
in flight per user and preventing non-privileged processes from having
more FDs in flight than their configured FD limit.
Reported-by: socketpair@gmail.com Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Mitigates: CVE-2013-4312 (Linux 2.0+) Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: David S. Miller <davem@davemloft.net>
[ luis: backported to 3.16: adjusted context ] Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
Dmitry reports memleak with syskaller program.
Problem is that connector bumps skb usecount but might not invoke callback.
So move skb_get to where we invoke the callback.
Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
In sctp_close, sctp_make_abort_user may return NULL because of memory
allocation failure. If this happens, it will bypass any state change
and never free the assoc. The assoc has no chance to be freed and it
will be kept in memory with the state it had even after the socket is
closed by sctp_close().
So if sctp_make_abort_user fails to allocate memory, we should abort
the asoc via sctp_primitive_ABORT as well. Just like the annotation in
sctp_sf_cookie_wait_prm_abort and sctp_sf_do_9_1_prm_abort said,
"Even if we can't send the ABORT due to low memory delete the TCB.
This is a departure from our typical NOMEM handling".
But then the chunk is NULL (low memory) and the SCTP_CMD_REPLY cmd would
dereference the chunk pointer, and system crash. So we should add
SCTP_CMD_REPLY cmd only when the chunk is not NULL, just like other
places where it adds SCTP_CMD_REPLY cmd.
Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
Packets that arrive from real hardware devices have ip_summed ==
CHECKSUM_UNNECESSARY if the hardware verified the checksums, or
CHECKSUM_NONE if the packet is bad or it was unable to verify it. The
current version of veth will replace CHECKSUM_NONE with
CHECKSUM_UNNECESSARY, which causes corrupt packets routed from hardware to
a veth device to be delivered to the application. This caused applications
at Twitter to receive corrupt data when network hardware was corrupting
packets.
We believe this was added as an optimization to skip computing and
verifying checksums for communication between containers. However, locally
generated packets have ip_summed == CHECKSUM_PARTIAL, so the code as
written does nothing for them. As far as we can tell, after removing this
code, these packets are transmitted from one stack to another unmodified
(tcpdump shows invalid checksums on both sides, as expected), and they are
delivered correctly to applications. We didn’t test every possible network
configuration, but we tried a few common ones such as bridging containers,
using NAT between the host and a container, and routing from hardware
devices to containers. We have effectively deployed this in production at
Twitter (by disabling RX checksum offloading on veth devices).
This code dates back to the first version of the driver, commit
<e314dbdc1c0dc6a548ecf> ("[NET]: Virtual ethernet device driver"), so I
suspect this bug occurred mostly because the driver API has evolved
significantly since then. Commit <0b7967503dc97864f283a> ("net/veth: Fix
packet checksumming") (in December 2010) fixed this for packets that get
created locally and sent to hardware devices, by not changing
CHECKSUM_PARTIAL. However, the same issue still occurs for packets coming
in from hardware devices.
Co-authored-by: Evan Jones <ej@evanjones.ca> Signed-off-by: Evan Jones <ej@evanjones.ca> Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com> Cc: Phil Sutter <phil@nwl.cc> Cc: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Cc: netdev@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Vijay Pandurangan <vijayp@vijayp.ca> Acked-by: Cong Wang <cwang@twopensource.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
MSI interrupts appear to not work for nv46 based cards. Change the mc
subdev oclass for these cards from nv44 to nv4c, the nv4c mc code is
identical to the nv44 mc code except that it does not use msi
(it does not define a msi_rearm callback).
BugLink: https://bugs.freedesktop.org/show_bug.cgi?id=90435 Signed-off-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Ben Skeggs <bskeggs@redhat.com> Cc: Ilia Mirkin <imirkin@alum.mit.edu>
[ luis: backported to 3.16:
- file rename: drivers/gpu/drm/nouveau/nvkm/engine/device/nv40.c ->
drivers/gpu/drm/nouveau/core/engine/device/nv40.c ] Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
with a usage count of 100 * the number of times the program has been run,
then the kernel is malfunctioning. If leaked-keyring has zero usages or
has been garbage collected, then the problem is fixed.
Reported-by: Yevgeny Pats <yevgeny@perception-point.io> Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Don Zickus <dzickus@redhat.com> Acked-by: Prarit Bhargava <prarit@redhat.com> Acked-by: Jarod Wilson <jarod@redhat.com> Signed-off-by: James Morris <james.l.morris@oracle.com> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
Backport of this upstream commit into stable kernels : 89c22d8c3b27 ("net: Fix skb csum races when peeking")
exposed a bug in udp stack vs MSG_PEEK support, when user provides
a buffer smaller than skb payload.
In this case,
skb_copy_and_csum_datagram_iovec(skb, sizeof(struct udphdr),
msg->msg_iov);
returns -EFAULT.
This bug does not happen in upstream kernels since Al Viro did a great
job to replace this into :
skb_copy_and_csum_datagram_msg(skb, sizeof(struct udphdr), msg);
This variant is safe vs short buffers.
For the time being, instead reverting Herbert Xu patch and add back
skb->ip_summed invalid changes, simply store the result of
udp_lib_checksum_complete() so that we avoid computing the checksum a
second time, and avoid the problematic
skb_copy_and_csum_datagram_iovec() call.
This patch can be applied on recent kernels as it avoids a double
checksumming, then backported to stable kernels as a bug fix.
Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
[ luis: backported to 3.16: adjusted context ] Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
As reported by Michal Kubecek, the backport of commit 89c22d8c3b27
("net: Fix skb csum races when peeking") exposed an upstream bug.
It is being reverted and replaced by a better fix.
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
While setting the KVM PIT counters in 'kvm_pit_load_count', if
'hpet_legacy_start' is set, the function disables the timer on
channel[0], instead of the respective index 'channel'. This is
because channels 1-3 are not linked to the HPET. Fix the caller
to only activate the special HPET processing for channel 0.
Reported-by: P J P <pjp@fedoraproject.org> Fixes: 0185604c2d82c560dab2f2933a18f797e74ab5a8 Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
dst_release should not access dst->flags after decrementing
__refcnt to 0. The dst_entry may be in dst_busy_list and
dst_gc_task may dst_destroy it before dst_release gets a chance
to access dst->flags.
Fixes: d69bbf88c8d0 ("net: fix a race in dst_release()") Fixes: 27b75c95f10d ("net: avoid RCU for NOCACHE dst") Signed-off-by: Francesco Ruggeri <fruggeri@arista.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
[ luis: backported to 3.16: adjusted context ] Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
The SKF_AD_ALU_XOR_X ancillary is not like the other ancillary data
instructions since it XORs A with X while all the others replace A with
some loaded value. All the BPF JITs fail to clear A if this is used as
the first instruction in a filter. This was found using american fuzzy
lop.
Add a helper to determine if A needs to be cleared given the first
instruction in a filter, and use this in the JITs. Except for ARM, the
rest have only been compile-tested.
Fixes: 3480593131e0 ("net: filter: get rid of BPF_S_* enum") Signed-off-by: Rabin Vincent <rabin@rab.in> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
[ luis: backported to 3.16: adjusted context ] Signed-off-by: Luis Henriques <luis.henriques@canonical.com>