We can not unlock/lock cifs_tcp_ses_lock while walking through ses
and tcon lists because it can corrupt list iterator pointers and
a tcon structure can be released if we don't hold an extra reference.
Fix it by moving a reconnect process to a separate delayed work
and acquiring a reference to every tcon that needs to be reconnected.
Also do not send an echo request on newly established connections.
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
In dm_sm_metadata_create() we temporarily change the dm_space_map
operations from 'ops' (whose .destroy function deallocates the
sm_metadata) to 'bootstrap_ops' (whose .destroy function doesn't).
If dm_sm_metadata_create() fails in sm_ll_new_metadata() or
sm_ll_extend(), it exits back to dm_tm_create_internal(), which calls
dm_sm_destroy() with the intention of freeing the sm_metadata, but it
doesn't (because the dm_space_map operations is still set to
'bootstrap_ops').
Fix this by setting the dm_space_map operations back to 'ops' if
dm_sm_metadata_create() fails when it is set to 'bootstrap_ops'.
[js] no nr_blocks test in 3.12 yet
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
In crypt_set_key(), if a failure occurs while replacing the old key
(e.g. tfm->setkey() fails) the key must not have DM_CRYPT_KEY_VALID flag
set. Otherwise, the crypto layer would have an invalid key that still
has DM_CRYPT_KEY_VALID flag set.
Signed-off-by: Ondrej Kozina <okozina@redhat.com> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
If you have a process that has set itself to be non-dumpable, and it
then undergoes exec(2), any CLOEXEC file descriptors it has open are
"exposed" during a race window between the dumpable flags of the process
being reset for exec(2) and CLOEXEC being applied to the file
descriptors. This can be exploited by a process by attempting to access
/proc/<pid>/fd/... during this window, without requiring CAP_SYS_PTRACE.
The race in question is after set_dumpable has been (for get_link,
though the trace is basically the same for readlink):
Which will return 0, during the race window and CLOEXEC file descriptors
will still be open during this window because do_close_on_exec has not
been called yet. As a result, the ordering of these calls should be
reversed to avoid this race window.
This is of particular concern to container runtimes, where joining a
PID namespace with file descriptors referring to the host filesystem
can result in security issues (since PRCTL_SET_DUMPABLE doesn't protect
against access of CLOEXEC file descriptors -- file descriptors which may
reference filesystem objects the container shouldn't have access to).
Cc: dev@opencontainers.org Reported-by: Michael Crosby <crosbymichael@gmail.com> Signed-off-by: Aleksa Sarai <asarai@suse.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Our system uses significantly more slab memory with memcg enabled with
the latest kernel. With 3.10 kernel, slab uses 2G memory, while with
4.6 kernel, 6G memory is used. The shrinker has problem. Let's see we
have two memcg for one shrinker. In do_shrink_slab:
1. Check cg1. nr_deferred = 0, assume total_scan = 700. batch size
is 1024, then no memory is freed. nr_deferred = 700
2. Check cg2. nr_deferred = 700. Assume freeable = 20, then
total_scan = 10 or 40. Let's assume it's 10. No memory is freed.
nr_deferred = 10.
The deferred share of cg1 is lost in this case. kswapd will free no
memory even run above steps again and again.
The fix makes sure one memcg's deferred share isn't lost.
Link: http://lkml.kernel.org/r/2414be961b5d25892060315fbb56bb19d81d0c07.1476227351.git.shli@fb.com Signed-off-by: Shaohua Li <shli@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
The struct file_operations instance serving the f2fs/status debugfs file
lacks an initialization of its ->owner.
This means that although that file might have been opened, the f2fs module
can still get removed. Any further operation on that opened file, releasing
included, will cause accesses to unmapped memory.
Indeed, Mike Marshall reported the following:
BUG: unable to handle kernel paging request at ffffffffa0307430
IP: [<ffffffff8132a224>] full_proxy_release+0x24/0x90
<...>
Call Trace:
[] __fput+0xdf/0x1d0
[] ____fput+0xe/0x10
[] task_work_run+0x8e/0xc0
[] do_exit+0x2ae/0xae0
[] ? __audit_syscall_entry+0xae/0x100
[] ? syscall_trace_enter+0x1ca/0x310
[] do_group_exit+0x44/0xc0
[] SyS_exit_group+0x14/0x20
[] do_syscall_64+0x61/0x150
[] entry_SYSCALL64_slow_path+0x25/0x25
<...>
---[ end trace f22ae883fa3ea6b8 ]---
Fixing recursive fault but reboot is needed!
Fix this by initializing the f2fs/status file_operations' ->owner with
THIS_MODULE.
This will allow debugfs to grab a reference to the f2fs module upon any
open on that file, thus preventing it from getting removed.
Fixes: 902829aa0b72 ("f2fs: move proc files to debugfs") Reported-by: Mike Marshall <hubcap@omnibond.com> Reported-by: Martin Brandenburg <martin@omnibond.com> Signed-off-by: Nicolai Stange <nicstange@gmail.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Fixes: 67cf5b09a46f ("ext4: add the basic function for inline data support") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
The commit "ext4: sanity check the block and cluster size at mount
time" should prevent any problems, but in case the superblock is
modified while the file system is mounted, add an extra safety check
to make sure we won't overrun the allocated buffer.
Fix a large number of problems with how we handle mount options in the
superblock. For one, if the string in the superblock is long enough
that it is not null terminated, we could run off the end of the string
and try to interpret superblocks fields as characters. It's unlikely
this will cause a security problem, but it could result in an invalid
parse. Also, parse_options is destructive to the string, so in some
cases if there is a comma-separated string, it would be modified in
the superblock. (Fortunately it only happens on file systems with a
1k block size.)
The number of 'counters' elements needed in 'struct sg' is
super_block->s_blocksize_bits + 2. Presently we have 16 'counters'
elements in the array. This is insufficient for block sizes >= 32k. In
such cases the memcpy operation performed in ext4_mb_seq_groups_show()
would cause stack memory corruption.
Fixes: c9de560ded61f Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
'border' variable is set to a value of 2 times the block size of the
underlying filesystem. With 64k block size, the resulting value won't
fit into a 16-bit variable. Hence this commit changes the data type of
'border' to 'unsigned int'.
The AEAD givenc descriptor relies on moving the IV through the
output FIFO and then back to the CTX2 for authentication. The
SEQ FIFO STORE could be scheduled before the data can be
read from OFIFO, especially since the SEQ FIFO LOAD needs
to wait for the SEQ FIFO LOAD SKIP to finish first. The
SKIP takes more time when the input is SG than when it's
a contiguous buffer. If the SEQ FIFO LOAD is not scheduled
before the STORE, the DECO will hang waiting for data
to be available in the OFIFO so it can be transferred to C2.
In order to overcome this, first force transfer of IV to C2
by starting the "cryptlen" transfer first and then starting to
store data from OFIFO to the output buffer.
bdev->bd_contains is not stable before calling __blkdev_get().
When __blkdev_get() is called on a parition with ->bd_openers == 0
it sets
bdev->bd_contains = bdev;
which is not correct for a partition.
After a call to __blkdev_get() succeeds, ->bd_openers will be > 0
and then ->bd_contains is stable.
When FMODE_EXCL is used, blkdev_get() calls
bd_start_claiming() -> bd_prepare_to_claim() -> bd_may_claim()
This call happens before __blkdev_get() is called, so ->bd_contains
is not stable. So bd_may_claim() cannot safely use ->bd_contains.
It currently tries to use it, and this can lead to a BUG_ON().
This happens when a whole device is already open with a bd_holder (in
use by dm in my particular example) and two threads race to open a
partition of that device for the first time, one opening with O_EXCL and
one without.
The thread that doesn't use O_EXCL gets through blkdev_get() to
__blkdev_get(), gains the ->bd_mutex, and sets bdev->bd_contains = bdev;
Immediately thereafter the other thread, using FMODE_EXCL, calls
bd_start_claiming() from blkdev_get(). This should fail because the
whole device has a holder, but because bdev->bd_contains == bdev
bd_may_claim() incorrectly reports success.
This thread continues and blocks on bd_mutex.
The first thread then sets bdev->bd_contains correctly and drops the mutex.
The thread using FMODE_EXCL then continues and when it calls bd_may_claim()
again in:
BUG_ON(!bd_may_claim(bdev, whole, holder));
The BUG_ON fires.
Fix this by removing the dependency on ->bd_contains in
bd_may_claim(). As bd_may_claim() has direct access to the whole
device, it can simply test if the target bdev is the whole device.
So we can read a btree block via readahead or intentional read,
and we can end up with a memory leak when something happens as
follows,
1) readahead starts to read block A but does not wait for read
completion,
2) btree_readpage_end_io_hook finds that block A is corrupted,
and it needs to clear all block A's pages' uptodate bit.
3) meanwhile an intentional read kicks in and checks block A's
pages' uptodate to decide which page needs to be read.
4) when some pages have the uptodate bit during 3)'s check so
3) doesn't count them for eb->io_pages, but they are later
cleared by 2) so we has to readpage on the page, we get
the wrong eb->io_pages which results in a memory leak of
this block.
This fixes the problem by firstly getting all pages's locking and
then checking pages' uptodate bit.
t1(readahead) t2(readahead endio) t3(the following read)
read_extent_buffer_pages end_bio_extent_readpage
for pg in eb: for page 0,1,2 in eb:
if pg is uptodate: btree_readpage_end_io_hook(pg)
num_reads++ if uptodate:
eb->io_pages = num_reads SetPageUptodate(pg) _______________
for pg in eb: for page 3 in eb: read_extent_buffer_pages
if pg is NOT uptodate: btree_readpage_end_io_hook(pg) for pg in eb:
__extent_read_full_page(pg) sanity check reports something wrong if pg is uptodate:
clear_extent_buffer_uptodate(eb) num_reads++
for pg in eb: eb->io_pages = num_reads
ClearPageUptodate(page) _______________
for pg in eb:
if pg is NOT uptodate:
__extent_read_full_page(pg)
So t3's eb->io_pages is not consistent with the number of pages it's reading,
and during endio(), atomic_dec_and_test(&eb->io_pages) will get a negative
number so that we're not able to free the eb.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
HP Z1 Gen3 AiO with Conexant codec doesn't give an unsolicited event
to the headset mic pin upon the jack plugging, it reports only to the
headphone pin. It results in the missing mic switching. Let's fix up
by simply gating the jack event.
Sampling rate changes after first set one are not reflected to the
hardware, while driver and ALSA think the rate has been changed.
Fix the problem by properly stopping the interface at the beginning of
prepare call, allowing new rate to be set to the hardware. This keeps
the hardware in sync with the driver.
The Logitech QuickCam Communicate Deluxe/S7500 microphone fails with the
following warning.
[ 6.778995] usb 2-1.2.2.2: Warning! Unlikely big volume range (=3072),
cval->res is probably wrong.
[ 6.778996] usb 2-1.2.2.2: [5] FU [Mic Capture Volume] ch = 1, val =
4608/7680/1
Adding it to the list of devices in volume_control_quirks makes it work
properly, fixing related typo.
The UHCI controllers in Intel chipsets rely on a platform-specific non-PME
mechanism for wakeup signalling. They can generate wakeup signals even
though they don't support PME.
We need to let the USB core know this so that it will enable runtime
suspend for UHCI controllers.
usb_endpoint_maxp() returns wMaxPacketSize in its
raw form. Without taking into consideration that it
also contains other bits reserved for isochronous
endpoints.
This patch fixes one occasion where this is a
problem by making sure that we initialize
ep->maxpacket only with lower 10 bits of the value
returned by usb_endpoint_maxp(). Note that seperate
patches will be necessary to audit all call sites of
usb_endpoint_maxp() and make sure that
usb_endpoint_maxp() only returns lower 10 bits of
wMaxPacketSize.
Signed-off-by: Felipe Balbi <felipe.balbi@linux.intel.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
USB-3 does not have any link state that will avoid negotiating a connection
with a plugged-in cable but will signal the host when the cable is
unplugged.
For USB-3 we used to first set the link to Disabled, then to RxDdetect to
be able to detect cable connects or disconnects. But in RxDetect the
connected device is detected again and eventually enabled.
Instead set the link into U3 and disable remote wakeups for the device.
This is what Windows does, and what Alan Stern suggested.
Cc: Alan Stern <stern@rowland.harvard.edu> Acked-by: Alan Stern <stern@rowland.harvard.edu> Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
leaf N:
...
item 240 key (282 DIR_LOG_ITEM 0) itemoff 8189 itemsize 8
dir log end 1275809046
leaf N + 1:
item 0 key (282 DIR_LOG_ITEM 3936149215) itemoff 16275 itemsize 8
dir log end 18446744073709551615
...
When we pass the value 1275809046 + 1 as the parameter start_ret to the
function tree-log.c:find_dir_range() (done by replay_dir_deletes()), we
end up with path->slots[0] having the value 239 (points to the last item
of leaf N, item 240). Because the dir log item in that position has an
offset value smaller than *start_ret (1275809046 + 1) we need to move on
to the next leaf, however the logic for that is wrong since it compares
the current slot to the number of items in the leaf, which is smaller
and therefore we don't lookup for the next leaf but instead we set the
slot to point to an item that does not exist, at slot 240, and we later
operate on that slot which has unexpected content or in the worst case
can result in an invalid memory access (accessing beyond the last page
of leaf N's extent buffer).
So fix the logic that checks when we need to lookup at the next leaf
by first incrementing the slot and only after to check if that slot
is beyond the last item of the current leaf.
Signed-off-by: Robbie Ko <robbieko@synology.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Fixes: e02119d5a7b4 (Btrfs: Add a write ahead tree log to optimize synchronous operations) Signed-off-by: Filipe Manana <fdmanana@suse.com>
[Modified changelog for clarity and correctness] Signed-off-by: Jiri Slaby <jslaby@suse.cz>
The original patch for mainline, 6f8960541b1 (Btrfs: don't delay
inode ref updates during log replay) lists 1d52c78afbb (Btrfs: try
not to ENOSPC on log replay) as the only pre-3.18 dependency, but it
also depends on 67de11769bd (Btrfs: introduce the delayed inode ref
deletion for the single link inode), which was introduced in 3.14
and isn't in 3.12.y.
The -stable commit added the check to btrfs_delayed_update_inode,
which may look similar to btrfs_delayed_delete_inode_ref, but it's
only superficial. The tops of both functions handle typical
delayed node boilerplate. The upshot is that the patch is harmless
since the caller already checks to see if we're doing log recovery,
so we're not breaking anything. It should be reverted because it
makes it appear as if this issue was fixed for users who did
backport 67de11769bd, when it is not.
Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Yu Zhao has noticed that __unregister_cpu_notifier only unregisters its
notifiers when HOTPLUG_CPU=y while the registration might succeed even
when HOTPLUG_CPU=n if MODULE is enabled. This means that e.g. zswap
might keep a stale notifier on the list on the manual clean up during
the pool tear down and thus corrupt the list. Resulting in the following
This can be even triggered manually by changing
/sys/module/zswap/parameters/compressor multiple times.
Fix this issue by making unregister APIs symmetric to the register so
there are no surprises.
[js] backport to 3.12
Fixes: 47e627bc8c9a ("[PATCH] hotplug: Allow modules to use the cpu hotplug notifiers even if !CONFIG_HOTPLUG_CPU") Reported-and-tested-by: Yu Zhao <yuzhao@google.com> Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: linux-mm@kvack.org Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dan Streetman <ddstreet@ieee.org> Link: http://lkml.kernel.org/r/20161207135438.4310-1-mhocko@kernel.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
The current ndelay() macro definition has an extra semi-colon at the
end of the line thus leading to a compilation error when ndelay is used
in a conditional block without curly braces like this one:
if (cond)
ndelay(t);
else
...
which, after the preprocessor pass gives:
if (cond)
m68k_ndelay(t);;
else
...
thus leading to the following gcc error:
error: 'else' without a previous 'if'
Remove this extra semi-colon.
Signed-off-by: Boris Brezillon <boris.brezillon@free-electrons.com> Fixes: c8ee038bd1488 ("m68k: Implement ndelay() based on the existing udelay() logic") Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
This patch adds a check to limit the number of can_filters that can be
set via setsockopt on CAN_RAW sockets. Otherwise allocations > MAX_ORDER
are not prevented resulting in a warning.
Lukasz reported that perf stat counters overflow handling is broken on KNL/SLM.
Both these parts have full_width_write set, and that does indeed have
a problem. In order to deal with counter wrap, we must sample the
counter at at least half the counter period (see also the sampling
theorem) such that we can unambiguously reconstruct the count.
However commit:
069e0c3c4058 ("perf/x86/intel: Support full width counting")
sets the sampling interval to the full period, not half.
Fixing that exposes another issue, in that we must not sign extend the
delta value when we shift it right; the counter cannot have
decremented after all.
With both these issues fixed, counter overflow functions correctly
again.
Reported-by: Lukasz Odzioba <lukasz.odzioba@intel.com> Tested-by: Liang, Kan <kan.liang@intel.com> Tested-by: Odzioba, Lukasz <lukasz.odzioba@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Fixes: 069e0c3c4058 ("perf/x86/intel: Support full width counting") Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
While debugging the rtmutex unlock vs. dequeue race Will suggested to use
READ_ONCE() in rt_mutex_owner() as it might race against the
cmpxchg_release() in unlock_rt_mutex_safe().
Will: "It's a minor thing which will most likely not matter in practice"
Careful search did not unearth an actual problem in todays code, but it's
better to be safe than surprised.
Suggested-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: David Daney <ddaney@caviumnetworks.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sebastian Siewior <bigeasy@linutronix.de> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/20161130210030.431379999@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
That means the problem is caused by fixup_rt_mutex_waiters() which does the
RMW to clear the waiters bit unconditionally when there are no waiters in
the rtmutexes rbtree.
This can be fatal: A concurrent unlock can release the rtmutex in the
fastpath because the waiters bit is not set. If the cmpxchg() gets in the
middle of the RMW operation then the previous owner, which just unlocked
the rtmutex is set as the owner again when the write takes place after the
successfull cmpxchg().
The solution is rather trivial: verify that the owner member of the rtmutex
has the waiters bit set before clearing it. This does not require a
cmpxchg() or other atomic operations because the waiters bit can only be
set and cleared with the rtmutex wait_lock held. It's also safe against the
fast path unlock attempt. The unlock attempt via cmpxchg() will either see
the bit set and take the slowpath or see the bit cleared and release it
atomically in the fastpath.
It's remarkable that the test program provided by David triggers on ARM64
and MIPS64 really quick, but it refuses to reproduce on x86-64, while the
problem exists there as well. That refusal might explain that this got not
discovered earlier despite the bug existing from day one of the rtmutex
implementation more than 10 years ago.
Thanks to David for meticulously instrumenting the code and providing the
information which allowed to decode this subtle problem.
Reported-by: David Daney <ddaney@caviumnetworks.com> Tested-by: David Daney <david.daney@cavium.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sebastian Siewior <bigeasy@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Fixes: 23f78d4a03c5 ("[PATCH] pi-futex: rt mutex core") Link: http://lkml.kernel.org/r/20161130210030.351136722@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Huang has reported that in his powerfail testing he is seeing stale
block contents in some of recently allocated blocks although he mounts
ext4 in data=ordered mode. After some investigation I have found out
that indeed when delayed allocation is used, we don't add inode to
transaction's list of inodes needing flushing before commit. Originally
we were doing that but commit f3b59291a69d removed the logic with a
flawed argument that it is not needed.
The problem is that although for delayed allocated blocks we write their
contents immediately after allocating them, there is no guarantee that
the IO scheduler or device doesn't reorder things and thus transaction
allocating blocks and attaching them to inode can reach stable storage
before actual block contents. Actually whenever we attach freshly
allocated blocks to inode using a written extent, we should add inode to
transaction's ordered inode list to make sure we properly wait for block
contents to be written before committing the transaction. So that is
what we do in this patch. This also handles other cases where stale data
exposure was possible - like filling hole via mmap in
data=ordered,nodelalloc mode.
The only exception to the above rule are extending direct IO writes where
blkdev_direct_IO() waits for IO to complete before increasing i_size and
thus stale data exposure is not possible. For now we don't complicate
the code with optimizing this special case since the overhead is pretty
low. In case this is observed to be a performance problem we can always
handle it using a special flag to ext4_map_blocks().
The global mutex of 'gdp_mutex' is used to serialize creating/querying
glue dir and its cleanup. Turns out it isn't a perfect way because
part(kobj_kset_leave()) of the actual cleanup action() is done inside
the release handler of the glue dir kobject. That means gdp_mutex has
to be held before releasing the last reference count of the glue dir
kobject.
This patch moves glue dir's cleanup after kobject_del() in device_del()
for avoiding the race.
Cc: Yijing Wang <wangyijing@huawei.com> Reported-by: Chandra Sekhar Lingutla <clingutla@codeaurora.org> Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
A compile warning is introduced by a commit to fix the find_node().
This patch fix the compile warning by moving find_node() into __init
section. Because find_node() is only used by memblock_nid_range() which
is only used by a __init add_node_ranges(). find_node() and
memblock_nid_range() should also be inside __init section.
Signed-off-by: Thomas Tai <thomas.tai@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
When booting up LDOM, find_node() warns that a physical address
doesn't match a NUMA node.
WARNING: CPU: 0 PID: 0 at arch/sparc/mm/init_64.c:835
find_node+0xf4/0x120 find_node: A physical address doesn't
match a NUMA node rule. Some physical memory will be
owned by node 0.Modules linked in:
It is because linux use an internal structure node_masks[] to
keep the best memory latency node only. However, LDOM mdesc can
contain single latency-group with multiple memory latency nodes.
If the address doesn't match the best latency node within
node_masks[], it should check for an alternative via mdesc.
The warning message should only be printed if the address
doesn't match any node_masks[] nor within mdesc. To minimize
the impact of searching mdesc every time, the last matched
mask and index is stored in a variable.
Signed-off-by: Thomas Tai <thomas.tai@oracle.com> Reviewed-by: Chris Hyser <chris.hyser@oracle.com> Reviewed-by: Liam Merwick <liam.merwick@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Prior to commit c0371da6047a ("put iov_iter into msghdr") in v3.19, there
was no check that the iovec contained enough bytes for an ICMP header,
and the read loop would walk across neighboring stack contents. Since the
iov_iter conversion, bad arguments are noticed, but the returned error is
EFAULT. Returning EINVAL is a clearer error and also solves the problem
prior to v3.19.
CAP_NET_ADMIN users should not be allowed to set negative
sk_sndbuf or sk_rcvbuf values, as it can lead to various memory
corruptions, crashes, OOM...
Note that before commit 82981930125a ("net: cleanups in
sock_setsockopt()"), the bug was even more serious, since SO_SNDBUF
and SO_RCVBUF were vulnerable.
This needs to be backported to all known linux kernels.
Again, many thanks to syzkaller team for discovering this gem.
Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
When packet_set_ring creates a ring buffer it will initialize a
struct timer_list if the packet version is TPACKET_V3. This value
can then be raced by a different thread calling setsockopt to
set the version to TPACKET_V1 before packet_set_ring has finished.
This leads to a use-after-free on a function pointer in the
struct timer_list when the socket is closed as the previously
initialized timer will not be deleted.
The bug is fixed by taking lock_sock(sk) in packet_setsockopt when
changing the packet version while also taking the lock at the start
of packet_set_ring.
Fixes: f6fb8f100b80 ("af-packet: TPACKET_V3 flexible buffer implementation.") Signed-off-by: Philip Pettersson <philip.pettersson@gmail.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
pskb_may_pull() can reallocate skb->head, we need to reload dh pointer
in dccp_invalid_packet() or risk use after free.
Bug found by Andrey Konovalov using syzkaller.
Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Add a validation function to make sure offset is valid:
1. Not below skb head (could happen when offset is negative).
2. Validate both 'offset' and 'at'.
Signed-off-by: Amir Vadai <amir@vadai.me> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Lock socket before checking the SOCK_ZAPPED flag in l2tp_ip6_bind().
Without lock, a concurrent call could modify the socket flags between
the sock_flag(sk, SOCK_ZAPPED) test and the lock_sock() call. This way,
a socket could be inserted twice in l2tp_ip6_bind_table. Releasing it
would then leave a stale pointer there, generating use-after-free
errors when walking through the list or modifying adjacent entries.
BUG: KASAN: use-after-free in l2tp_ip6_close+0x22e/0x290 at addr ffff8800081b0ed8
Write of size 8 by task syz-executor/10987
CPU: 0 PID: 10987 Comm: syz-executor Not tainted 4.8.0+ #39
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014 ffff880031d97838ffffffff829f835bffff88001b5a1640ffff8800081b0ec0 ffff8800081b15a0ffff8800081b6d20ffff880031d97860ffffffff8174d3cc ffff880031d978f0ffff8800081b0e80ffff88001b5a1640ffff880031d978e0
Call Trace:
[<ffffffff829f835b>] dump_stack+0xb3/0x118 lib/dump_stack.c:15
[<ffffffff8174d3cc>] kasan_object_err+0x1c/0x70 mm/kasan/report.c:156
[< inline >] print_address_description mm/kasan/report.c:194
[<ffffffff8174d666>] kasan_report_error+0x1f6/0x4d0 mm/kasan/report.c:283
[< inline >] kasan_report mm/kasan/report.c:303
[<ffffffff8174db7e>] __asan_report_store8_noabort+0x3e/0x40 mm/kasan/report.c:329
[< inline >] __write_once_size ./include/linux/compiler.h:249
[< inline >] __hlist_del ./include/linux/list.h:622
[< inline >] hlist_del_init ./include/linux/list.h:637
[<ffffffff8579047e>] l2tp_ip6_close+0x22e/0x290 net/l2tp/l2tp_ip6.c:239
[<ffffffff850b2dfd>] inet_release+0xed/0x1c0 net/ipv4/af_inet.c:415
[<ffffffff851dc5a0>] inet6_release+0x50/0x70 net/ipv6/af_inet6.c:422
[<ffffffff84c4581d>] sock_release+0x8d/0x1d0 net/socket.c:570
[<ffffffff84c45976>] sock_close+0x16/0x20 net/socket.c:1017
[<ffffffff817a108c>] __fput+0x28c/0x780 fs/file_table.c:208
[<ffffffff817a1605>] ____fput+0x15/0x20 fs/file_table.c:244
[<ffffffff813774f9>] task_work_run+0xf9/0x170
[<ffffffff81324aae>] do_exit+0x85e/0x2a00
[<ffffffff81326dc8>] do_group_exit+0x108/0x330
[<ffffffff81348cf7>] get_signal+0x617/0x17a0 kernel/signal.c:2307
[<ffffffff811b49af>] do_signal+0x7f/0x18f0
[<ffffffff810039bf>] exit_to_usermode_loop+0xbf/0x150 arch/x86/entry/common.c:156
[< inline >] prepare_exit_to_usermode arch/x86/entry/common.c:190
[<ffffffff81006060>] syscall_return_slowpath+0x1a0/0x1e0 arch/x86/entry/common.c:259
[<ffffffff85e4d726>] entry_SYSCALL_64_fastpath+0xc4/0xc6
Object at ffff8800081b0ec0, in cache L2TP/IPv6 size: 1448
Allocated:
PID = 10987
[ 1116.897025] [<ffffffff811ddcb6>] save_stack_trace+0x16/0x20
[ 1116.897025] [<ffffffff8174c736>] save_stack+0x46/0xd0
[ 1116.897025] [<ffffffff8174c9ad>] kasan_kmalloc+0xad/0xe0
[ 1116.897025] [<ffffffff8174cee2>] kasan_slab_alloc+0x12/0x20
[ 1116.897025] [< inline >] slab_post_alloc_hook mm/slab.h:417
[ 1116.897025] [< inline >] slab_alloc_node mm/slub.c:2708
[ 1116.897025] [< inline >] slab_alloc mm/slub.c:2716
[ 1116.897025] [<ffffffff817476a8>] kmem_cache_alloc+0xc8/0x2b0 mm/slub.c:2721
[ 1116.897025] [<ffffffff84c4f6a9>] sk_prot_alloc+0x69/0x2b0 net/core/sock.c:1326
[ 1116.897025] [<ffffffff84c58ac8>] sk_alloc+0x38/0xae0 net/core/sock.c:1388
[ 1116.897025] [<ffffffff851ddf67>] inet6_create+0x2d7/0x1000 net/ipv6/af_inet6.c:182
[ 1116.897025] [<ffffffff84c4af7b>] __sock_create+0x37b/0x640 net/socket.c:1153
[ 1116.897025] [< inline >] sock_create net/socket.c:1193
[ 1116.897025] [< inline >] SYSC_socket net/socket.c:1223
[ 1116.897025] [<ffffffff84c4b46f>] SyS_socket+0xef/0x1b0 net/socket.c:1203
[ 1116.897025] [<ffffffff85e4d685>] entry_SYSCALL_64_fastpath+0x23/0xc6
Freed:
PID = 10987
[ 1116.897025] [<ffffffff811ddcb6>] save_stack_trace+0x16/0x20
[ 1116.897025] [<ffffffff8174c736>] save_stack+0x46/0xd0
[ 1116.897025] [<ffffffff8174cf61>] kasan_slab_free+0x71/0xb0
[ 1116.897025] [< inline >] slab_free_hook mm/slub.c:1352
[ 1116.897025] [< inline >] slab_free_freelist_hook mm/slub.c:1374
[ 1116.897025] [< inline >] slab_free mm/slub.c:2951
[ 1116.897025] [<ffffffff81748b28>] kmem_cache_free+0xc8/0x330 mm/slub.c:2973
[ 1116.897025] [< inline >] sk_prot_free net/core/sock.c:1369
[ 1116.897025] [<ffffffff84c541eb>] __sk_destruct+0x32b/0x4f0 net/core/sock.c:1444
[ 1116.897025] [<ffffffff84c5aca4>] sk_destruct+0x44/0x80 net/core/sock.c:1452
[ 1116.897025] [<ffffffff84c5ad33>] __sk_free+0x53/0x220 net/core/sock.c:1460
[ 1116.897025] [<ffffffff84c5af23>] sk_free+0x23/0x30 net/core/sock.c:1471
[ 1116.897025] [<ffffffff84c5cb6c>] sk_common_release+0x28c/0x3e0 ./include/net/sock.h:1589
[ 1116.897025] [<ffffffff8579044e>] l2tp_ip6_close+0x1fe/0x290 net/l2tp/l2tp_ip6.c:243
[ 1116.897025] [<ffffffff850b2dfd>] inet_release+0xed/0x1c0 net/ipv4/af_inet.c:415
[ 1116.897025] [<ffffffff851dc5a0>] inet6_release+0x50/0x70 net/ipv6/af_inet6.c:422
[ 1116.897025] [<ffffffff84c4581d>] sock_release+0x8d/0x1d0 net/socket.c:570
[ 1116.897025] [<ffffffff84c45976>] sock_close+0x16/0x20 net/socket.c:1017
[ 1116.897025] [<ffffffff817a108c>] __fput+0x28c/0x780 fs/file_table.c:208
[ 1116.897025] [<ffffffff817a1605>] ____fput+0x15/0x20 fs/file_table.c:244
[ 1116.897025] [<ffffffff813774f9>] task_work_run+0xf9/0x170
[ 1116.897025] [<ffffffff81324aae>] do_exit+0x85e/0x2a00
[ 1116.897025] [<ffffffff81326dc8>] do_group_exit+0x108/0x330
[ 1116.897025] [<ffffffff81348cf7>] get_signal+0x617/0x17a0 kernel/signal.c:2307
[ 1116.897025] [<ffffffff811b49af>] do_signal+0x7f/0x18f0
[ 1116.897025] [<ffffffff810039bf>] exit_to_usermode_loop+0xbf/0x150 arch/x86/entry/common.c:156
[ 1116.897025] [< inline >] prepare_exit_to_usermode arch/x86/entry/common.c:190
[ 1116.897025] [<ffffffff81006060>] syscall_return_slowpath+0x1a0/0x1e0 arch/x86/entry/common.c:259
[ 1116.897025] [<ffffffff85e4d726>] entry_SYSCALL_64_fastpath+0xc4/0xc6
Memory state around the buggy address: ffff8800081b0d80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff8800081b0e00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>ffff8800081b0e80: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
^ ffff8800081b0f00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff8800081b0f80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
This is caused by the sky2 being called after it has been shutdown.
A previous thread about this can be found here:
https://lkml.org/lkml/2016/4/12/410
An alternative fix is to assure that IFF_UP gets cleared by
calling dev_close() during shutdown. This is similar to what the
bnx2/tg3/xgene and maybe others are doing to assure that the driver
isn't being called following _shutdown().
Signed-off-by: Jeremy Linton <jeremy.linton@arm.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Currently kill_fasync() is called outside the stream lock in
snd_pcm_period_elapsed(). This is potentially racy, since the stream
may get released even during the irq handler is running. Although
snd_pcm_release_substream() calls snd_pcm_drop(), this doesn't
guarantee that the irq handler finishes, thus the kill_fasync() call
outside the stream spin lock may be invoked after the substream is
detached, as recently reported by KASAN.
As a quick workaround, move kill_fasync() call inside the stream
lock. The fasync is rarely used interface, so this shouldn't have a
big impact from the performance POV.
Ideally, we should implement some sync mechanism for the proper finish
of stream and irq handler. But this oneliner should suffice for most
cases, so far.
where skb_gso_segment() relies on skb->protocol to function properly.
This patch sets skb->protocol to ETH_P_IP before dst_output() is called,
fixing a bug where GSO packets sent through a sit tunnel are dropped
when xfrm is involved.
Signed-off-by: Eli Cooper <elicooper@gmx.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
where skb_gso_segment() relies on skb->protocol to function properly.
This patch sets skb->protocol to ETH_P_IPV6 before dst_output() is called,
fixing a bug where GSO packets sent through an ipip6 tunnel are dropped
when xfrm is involved.
Signed-off-by: Eli Cooper <elicooper@gmx.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
It turns out that the rcuos callback-offload kthread is busy processing
a very large quantity of RCU callbacks, and it is not reliquishing the
CPU while doing so. This commit therefore adds an cond_resched_rcu_qs()
within the loop to allow other tasks to run.
[js] use onlu cond_resched() in 3.12
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
[ paulmck: Substituted cond_resched_rcu_qs for cond_resched. ] Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Dhaval Giani <dhaval.giani@oracle.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
On the 80486 DX, it seems that some exceptions may leave garbage in
the high bits of CS. This causes sporadic failures in which
early_fixup_exception() refuses to fix up an exception.
As far as I can tell, this has been buggy for a long time, but the
problem seems to have been exacerbated by commits:
1e02ce4cccdc ("x86: Store a per-cpu shadow copy of CR4") e1bfc11c5a6f ("x86/init: Fix cr4_init_shadow() on CR4-less machines")
This appears to have broken for as long as we've had early
exception handling.
[ This backport should apply to kernels from 3.4 - 4.5. ]
Fixes: 4c5023a3fa2e ("x86-32: Handle exception table entries during early boot") Cc: H. Peter Anvin <hpa@zytor.com> Reported-by: Matthew Whitehead <tedheadster@gmail.com> Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Michel Dänzer [Tue, 29 Nov 2016 09:40:20 +0000 (18:40 +0900)]
drm/radeon: Ensure vblank interrupt is enabled on DPMS transition to on
NOTE: This patch only applies to 4.5.y or older kernels. With newer
kernels, this problem cannot happen because the driver now uses
drm_crtc_vblank_on/off instead of drm_vblank_pre/post_modeset[0]. I
consider this patch safer for older kernels than backporting the API
change, because drm_crtc_vblank_on/off had various issues in older
kernels, and I'm not sure all fixes for those have been backported to
all stable branches where this patch could be applied.
---------------------
Fixes the vblank interrupt being disabled when it should be on, which
can cause at least the following symptoms:
* Hangs when running 'xset dpms force off' in a GNOME session with
gnome-shell using DRI2.
* RandR 1.4 slave outputs freezing with garbage displayed using
xf86-video-ati 7.8.0 or newer.
Reported-and-Tested-by: Max Staudt <mstaudt@suse.de> Reviewed-by: Daniel Vetter <daniel@ffwll.ch> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Michel Dänzer <michel.daenzer@amd.com>
If mpi_powm() is given a zero exponent, it wants to immediately return
either 1 or 0, depending on the modulus. However, if the result was
initalised with zero limb space, no limbs space is allocated and a
NULL-pointer exception ensues.
Fix this by allocating a minimal amount of limb space for the result when
the 0-exponent case when the result is 1 and not touching the limb space
when the result is 0.
This affects the use of RSA keys and X.509 certificates that carry them.
After a policy replacement, the task cred may be out of date and need
to be updated. However change_hat is using the stale profiles from
the out of date cred resulting in either: a stale profile being applied
or, incorrect failure when searching for a hat profile as it has been
migrated to the new parent profile.
Fixes: 01e2b670aa898a39259bc85c78e3d74820f4d3b6 (failure to find hat) Fixes: 898127c34ec03291c86f4ff3856d79e9e18952bc (stale policy being applied)
Bugzilla: https://bugzilla.suse.com/show_bug.cgi?id=1000287 Signed-off-by: John Johansen <john.johansen@canonical.com> Signed-off-by: James Morris <james.l.morris@oracle.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
It's possible to make scanning consume almost arbitrary amounts
of memory, e.g. by sending beacon frames with random BSSIDs at
high rates while somebody is scanning.
Limit the number of BSS table entries we're willing to cache to
1000, limiting maximum memory usage to maybe 4-5MB, but lower
in practice - that would be the case for having both full-sized
beacon and probe response frames for each entry; this seems not
possible in practice, so a limit of 1000 entries will likely be
closer to 0.5 MB.
Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
For large values of "mult" and long uptimes, the intermediate
result of "cycles * mult" can overflow 64 bits. For example,
the tile platform calls clocksource_cyc2ns with a 1.2 GHz clock;
we have mult = 853, and after 208.5 days, we overflow 64 bits.
Since clocksource_cyc2ns() is intended to be used for relative
cycle counts, not absolute cycle counts, performance is more
importance than accepting a wider range of cycle values. So,
just use mult_frac() directly in tile's sched_clock().
Commit 4cecf6d401a0 ("sched, x86: Avoid unnecessary overflow
in sched_clock") by Salman Qazi results in essentially the same
generated code for x86 as this change does for tile. In fact,
a follow-on change by Salman introduced mult_frac() and switched
to using it, so the C code was largely identical at that point too.
Peter Zijlstra then added mul_u64_u32_shr() and switched x86
to use it. This is, in principle, better; by optimizing the
64x64->64 multiplies to be 32x32->64 multiplies we can potentially
save some time. However, the compiler piplines the 64x64->64
multiplies pretty well, and the conditional branch in the generic
mul_u64_u32_shr() causes some bubbles in execution, with the
result that it's pretty much a wash. If tilegx provided its own
implementation of mul_u64_u32_shr() without the conditional branch,
we could potentially save 3 cycles, but that seems like small gain
for a fair amount of additional build scaffolding; no other platform
currently provides a mul_u64_u32_shr() override, and tile doesn't
currently have an <asm/div64.h> header to put the override in.
Additionally, gcc currently has an optimization bug that prevents
it from recognizing the opportunity to use a 32x32->64 multiply,
and so the result would be no better than the existing mult_frac()
until such time as the compiler is fixed.
For now, just using mult_frac() seems like the right answer.
Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
This is a work around for a bug with LSI Fusion MPT SAS2 when perfoming
secure erase. Due to the very long time the operation takes, commands
issued during the erase will time out and will trigger execution of the
abort hook. Even though the abort hook is called for the specific
command which timed out, this leads to entire device halt
(scsi_state terminated) and premature termination of the secure erase.
Set device state to busy while ATA passthrough commands are in progress.
[mkp: hand applied to 4.9/scsi-fixes, tweaked patch description]
Signed-off-by: Andrey Grodzovsky <andrey2805@gmail.com> Acked-by: Sreekanth Reddy <Sreekanth.Reddy@broadcom.com> Cc: <linux-scsi@vger.kernel.org> Cc: Sathya Prakash <sathya.prakash@broadcom.com> Cc: Chaitra P B <chaitra.basappa@broadcom.com> Cc: Suganath Prabu Subramani <suganath-prabu.subramani@broadcom.com> Cc: Sreekanth Reddy <Sreekanth.Reddy@broadcom.com> Cc: Hannes Reinecke <hare@suse.de> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Some code (all error handling) submits CDBs that are allocated
on the stack. This breaks with CB/CBI code that tries to create
URB directly from SCSI command buffer - which happens to be in
vmalloced memory with vmalloced kernel stacks.
Let's make copy of the command in usb_stor_CB_transport.
Signed-off-by: Petr Vandrovec <petr@vandrovec.name> Acked-by: Alan Stern <stern@rowland.harvard.edu> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
This patch adds support for the TI CC3200 LaunchPad board, which uses a
custom USB vendor ID and product ID. Channel A is used for JTAG, and
channel B is used for a UART.
Signed-off-by: Doug Brown <doug@schmorgal.com> Signed-off-by: Johan Hovold <johan@kernel.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
The BRIM Brothers Zone DPMX is a bicycle powermeter. This ID is for the USB
serial interface in its charging dock for the control pods, via which some
settings for the pods can be modified.
Signed-off-by: Paul Jakma <paul@jakma.org> Cc: Barry Redmond <barry@brimbrothers.com> Signed-off-by: Johan Hovold <johan@kernel.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
em_jmp_far and em_ret_far assumed that setting IP can only fail in 64
bit mode, but syzkaller proved otherwise (and SDM agrees).
Code segment was restored upon failure, but it was left uninitialized
outside of long mode, which could lead to a leak of host kernel stack.
We could have fixed that by always saving and restoring the CS, but we
take a simpler approach and just break any guest that manages to fail
as the error recovery is error-prone and modern CPUs don't need emulator
for this.
Found by syzkaller:
WARNING: CPU: 2 PID: 3668 at arch/x86/kvm/emulate.c:2217 em_ret_far+0x428/0x480
Kernel panic - not syncing: panic_on_warn set ...
Reported-by: Dmitry Vyukov <dvyukov@google.com> Fixes: d1442d85cc30 ("KVM: x86: Handle errors when RIP is set during far jumps") Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Now that GSO functionality can correctly track if the fragment
id has been selected and select a fragment id if necessary,
we can re-enable UFO on tap/macvap and virtio devices.
Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Cc: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Commit 08d78658f393 ("panic: release stale console lock to always get the
logbuf printed out") introduced an unwanted bad unlock balance report when
panic() is called directly and not from OOPS (e.g. from out_of_memory()).
The difference is that in case of OOPS we disable locks debug in
oops_enter() and on direct panic call nobody does that.
Fixes: 08d78658f393 ("panic: release stale console lock to always get the logbuf printed out") Reported-by: kernel test robot <ying.huang@linux.intel.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Baoquan He <bhe@redhat.com> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Xie XiuQi <xiexiuqi@huawei.com> Cc: Seth Jennings <sjenning@redhat.com> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Jan Kara <jack@suse.cz> Cc: Petr Mladek <pmladek@suse.cz> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Commit 073db4a51ee4 ("mtd: fix: avoid race condition when accessing
mtd->usecount") fixed a race condition but due to poor ordering of the
mutex acquisition, introduced a potential deadlock.
The deadlock can occur, for example, when rmmod'ing the m25p80 module, which
will delete one or more MTDs, along with any corresponding mtdblock
devices. This could potentially race with an acquisition of the block
device as follows.
This is a classic (potential) ABBA deadlock, which can be fixed by
making the A->B ordering consistent everywhere. There was no real
purpose to the ordering in the original patch, AFAIR, so this shouldn't
be a problem. This ordering was actually already present in
del_mtd_blktrans_dev(), for one, where the function tried to ensure that
its caller already held mtd_table_mutex before it acquired &dev->lock:
if (mutex_trylock(&mtd_table_mutex)) {
mutex_unlock(&mtd_table_mutex);
BUG();
}
So, reverse the ordering of acquisition of &dev->lock and &mtd_table_mutex so
we always acquire mtd_table_mutex first.
In some cases a NACK interrupt may be pending in the Status Register (SR)
as a result of a previous transfer. However at91_do_twi_transfer() did not
read the SR to clear pending interruptions before starting a new transfer.
Hence a NACK interrupt rose as soon as it was enabled again at the I2C
controller level, resulting in a wrong sequence of operations and strange
patterns of behaviour on the I2C bus, such as a clock stretch followed by
a restart of the transfer.
This first issue occurred with both DMA and PIO write transfers.
Also when a NACK error was detected during a PIO write transfer, the
interrupt handler used to wrongly start a new transfer by writing into the
Transmit Holding Register (THR). Then the I2C slave was likely to reply
with a second NACK.
This second issue is fixed in atmel_twi_interrupt() by handling the TXRDY
status bit only if both the TXCOMP and NACK status bits are cleared.
Tested with a at24 eeprom on sama5d36ek board running a linux-4.1-at91
kernel image. Adapted to linux-next.
Reported-by: Peter Rosin <peda@lysator.liu.se> Signed-off-by: Cyrille Pitchen <cyrille.pitchen@atmel.com> Signed-off-by: Ludovic Desroches <ludovic.desroches@atmel.com> Tested-by: Peter Rosin <peda@lysator.liu.se> Signed-off-by: Wolfram Sang <wsa@the-dreams.de> Fixes: 93563a6a71bb ("i2c: at91: fix a race condition when using the DMA controller") Signed-off-by: Jiri Slaby <jslaby@suse.cz>
932c435caba8 ("PCI: Add dev_flags bit to access VPD through function 0")
added PCI_DEV_FLAGS_VPD_REF_F0. Previously, we set the flag on every
non-zero function of quirked devices. If a function turned out to be
different from function 0, i.e., it had a different class, vendor ID, or
device ID, the flag remained set but we didn't make VPD accessible at all.
Flip this around so we only set PCI_DEV_FLAGS_VPD_REF_F0 for functions that
are identical to function 0, and allow regular VPD access for any other
functions.
[bhelgaas: changelog, stable tag] Fixes: 932c435caba8 ("PCI: Add dev_flags bit to access VPD through function 0") Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Bjorn Helgaas <helgaas@kernel.org> Acked-by: Myron Stowe <myron.stowe@redhat.com> Acked-by: Mark Rustad <mark.d.rustad@intel.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Commit 932c435caba8 ("PCI: Add dev_flags bit to access VPD through function
0") passes PCI_SLOT(devfn) for the devfn parameter of pci_get_slot().
Generally this works because we're fairly well guaranteed that a PCIe
device is at slot address 0, but for the general case, including
conventional PCI, it's incorrect. We need to get the slot and then convert
it back into a devfn.
Fixes: 932c435caba8 ("PCI: Add dev_flags bit to access VPD through function 0") Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Bjorn Helgaas <helgaas@kernel.org> Acked-by: Myron Stowe <myron.stowe@redhat.com> Acked-by: Mark Rustad <mark.d.rustad@intel.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Commit b253149b843f ("sched/idle/x86: Restore mwait_idle() to fix boot
hangs, to improve power savings and to improve performance") restores
mwait_idle(), but the trace_cpu_idle related calls are missing. This
causes powertop on my old desktop powered by Intel Core2 E6550 to
report zero wakeups and zero events.
The fix for deadlock in PM in commit [1ee23fe07ee8: ALSA: usb-audio:
Fix deadlocks at resuming] introduced a new check of in_pm flag.
However, the brainless patch author evaluated it in a wrong way
(logical AND instead of logical OR), thus usb_autopm_get_interface()
is wrongly called at probing, leading to unbalance of runtime PM
refcount.
This patch fixes it by correcting the logic.
Reported-by: Hans Yang <hansy@nvidia.com> Fixes: 1ee23fe07ee8 ('ALSA: usb-audio: Fix deadlocks at resuming') Signed-off-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
The variable for the 'permissive' module parameter used to be static
but was recently changed to be extern. This puts it in the kernel
global namespace if the driver is built-in, so its name should begin
with a prefix identifying the driver.
Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Fixes: af6fc858a35b ("xen-pciback: limit guest control of command register") Signed-off-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Some AMD CS553x devices have read-only BARs because of a firmware or
hardware defect. There's a workaround in quirk_cs5536_vsa(), but it no
longer works after 36e8164882ca ("PCI: Restore detection of read-only
BARs"). Prior to 36e8164882ca, we filled in res->start; afterwards we
leave it zeroed out. The quirk only updated the size, so the driver tried
to use a region starting at zero, which didn't work.
Expand quirk_cs5536_vsa() to read the base addresses from the BARs and
hard-code the sizes.
On Nix's system BAR 2's read-only value is 0x6200. Prior to 36e8164882ca,
we interpret that as a 512-byte BAR based on the lowest-order bit set. Per
datasheet sec 5.6.1, that BAR (MFGPT) requires only 64 bytes; use that to
avoid clearing any address bits if a platform uses only 64-byte alignment.
[js] pcibios_bus_to_resource takes pdev, not bus in 3.12
The fix from 9fc81d87420d ("perf: Fix events installation during
moving group") was incomplete in that it failed to recognise that
creating a group with events for different CPUs is semantically
broken -- they cannot be co-scheduled.
Furthermore, it leads to real breakage where, when we create an event
for CPU Y and then migrate it to form a group on CPU X, the code gets
confused where the counter is programmed -- triggered in practice
as well by me via the perf fuzzer.
Fix this by tightening the rules for creating groups. Only allow
grouping of counters that can be co-scheduled in the same context.
This means for the same task and/or the same cpu.
Fixes: 9fc81d87420d ("perf: Fix events installation during moving group") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20150123125834.090683288@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
There is a poll loop for max 25us for HS devices. Now guess what, I
tested it in gadget mode and forgot about the little detail. Nobody seem
to have it noticed…
This patch adds the missing logic for hostmode so it is recognized in
host and device mode properly.
Fixes: 50aea6fca771 ("usb: musb: cppi41: fire hrtimer according to
programmed channel length") Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Felipe Balbi <balbi@ti.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
IPv6 does not allow fragmentation by routers, so there is no
fragmentation ID in the fixed header. UFO for IPv6 requires the ID to
be passed separately, but there is no provision for this in the virtio
net protocol.
Until recently our software implementation of UFO/IPv6 generated a new
ID, but this was a bug. Now we will use ID=0 for any UFO/IPv6 packet
passed through a tap, which is even worse.
Unfortunately there is no distinction between UFO/IPv4 and v6
features, so disable UFO on taps and virtio_net completely until we
have a proper solution.
We cannot depend on VM managers respecting the tap feature flags, so
keep accepting UFO packets but log a warning the first time we do
this.
Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Fixes: 916e4cf46d02 ("ipv6: reuse ip6_frag_id from ip6_ufo_append_data") Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Read-only memory ranges may be backed by the zero page, so avoid
misidentifying it a a MMIO pfn.
This fixes another issue I identified when testing QEMU+KVM_UEFI, where
a read to an uninitialized emulated NOR flash brought in the zero page,
but mapped as a read-write device region, because kvm_is_mmio_pfn()
misidentifies it as a MMIO pfn due to its PG_reserved bit being set.
In order to make the static inline function is_zero_pfn() callable by
modules, export its symbol dependencies 'zero_pfn' and (for s390 and
mips) 'zero_page_mask'.
We need this for KVM, as CONFIG_KVM is a tristate for all supported
architectures except ARM and arm64, and testing a pfn whether it refers
to the zero page is required to correctly distinguish the zero page
from other special RAM ranges that may also have the PG_reserved bit
set, but need to be treated as MMIO memory.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Acked-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
1343 /*
1344 * This function is never used on a shmem/tmpfs
1345 * mapping, so a swap entry won't be found here.
1346 */
1347 BUG();
After commit 0cd6144aadd2 ("mm + fs: prepare for non-page entries in
page cache radix trees") this comment and BUG() are out of date because
exceptional entries can now appear in all mappings - as shadows of
recently evicted pages.
However, as Hugh Dickins notes,
"it is truly surprising for a PAGECACHE_TAG_WRITEBACK (and probably
any other PAGECACHE_TAG_*) to appear on an exceptional entry.
I expect it comes down to an occasional race in RCU lookup of the
radix_tree: lacking absolute synchronization, we might sometimes
catch an exceptional entry, with the tag which really belongs with
the unexceptional entry which was there an instant before."
And indeed, not only is the tree walk lockless, the tags are also read
in chunks, one radix tree node at a time. There is plenty of time for
page reclaim to swoop in and replace a page that was already looked up
as tagged with a shadow entry.
Remove the BUG() and update the comment. While reviewing all other
lookup sites for whether they properly deal with shadow entries of
evicted pages, update all the comments and fix memcg file charge moving
to not miss shmem/tmpfs swapcache pages.
Fixes: 0cd6144aadd2 ("mm + fs: prepare for non-page entries in page cache radix trees") Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Dave Jones <davej@redhat.com> Acked-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
When the vmalloc area gets fragmented, and because the firmware
mapping area sits between where modules live and the vmalloc area, we
can sometimes receive requests for enormous kernel TLB range flushes.
When this happens the cpu just spins flushing billions of pages and
this triggers the NMI watchdog and other problems.
We took care of this on the TSB side by doing a linear scan of the
table once we pass a certain threshold.
Do something similar for the TLB flush, however we are limited by
the TLB flush facilities provided by the different chip variants.
First of all we use an (mostly arbitrary) cut-off of 256K which is
about 32 pages. This can be tuned in the future.
The huge range code path for each chip works as follows:
1) On spitfire we flush all non-locked TLB entries using diagnostic
acceses.
2) On cheetah we use the "flush all" TLB flush.
3) On sun4v/hypervisor we do a TLB context flush on context 0, which
unlike previous chips does not remove "permanent" or locked
entries.
We could probably do something better on spitfire, such as limiting
the flush to kernel TLB entries or even doing range comparisons.
However that probably isn't worth it since those chips are old and
the TLB only had 64 entries.
Reported-by: James Clarke <jrtc27@jrtc27.com> Tested-by: James Clarke <jrtc27@jrtc27.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
When we copy code over to patch another piece of code, we can only use
PC-relative branches that target code within that piece of code.
Such PC-relative branches cannot be made to external symbols because
the patch moves the location of the code and thus modifies the
relative address of external symbols.
Use an absolute jmpl to fix this problem.
Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
If the number of pages we are flushing is more than twice the number
of entries in the TSB, just scan the TSB table for matches rather
than probing each and every page in the range.
Based upon a patch and report by James Clarke.
Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
do_sparc64_fault() calculates both the base and huge page RSS sizes and
uses this information in calls to tsb_grow(). The calculation for base
page TSB size is not correct if the task uses hugetlb pages. hugetlb
pages are not accounted for in RSS, therefore the call to get_mm_rss(mm)
does not include hugetlb pages. However, the number of pages based on
huge_pte_count (which does include hugetlb pages) is subtracted from
this value. This will result in an artificially small and often negative
RSS calculation. The base TSB size is then often set to max_tsb_size
as the passed RSS is unsigned, so a negative value looks really big.
THP pages are also accounted for in huge_pte_count, and THP pages are
accounted for in RSS so the calculation in do_sparc64_fault() is correct
if a task only uses THP pages.
A single huge_pte_count is not sufficient for TSB sizing if both hugetlb
and THP pages can be used. Instead of a single counter, use two: one
for hugetlb and one for THP.
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
On pre-Niagara systems, we fetch the fault address on data TLB
exceptions from the TLB_TAG_ACCESS register. But this register also
contains the context ID assosciated with the fault in the low 13 bits
of the register value.
This propagates into current_thread_info()->fault_address and can
cause trouble later on.
So clear the low 13-bits out of the TLB_TAG_ACCESS value in the cases
where it matters.
Reported-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
With syzkaller help, Marco Grassi found a bug in TCP stack,
crashing in tcp_collapse()
Root cause is that sk_filter() can truncate the incoming skb,
but TCP stack was not really expecting this to happen.
It probably was expecting a simple DROP or ACCEPT behavior.
We first need to make sure no part of TCP header could be removed.
Then we need to adjust TCP_SKB_CB(skb)->end_seq
Many thanks to syzkaller team and Marco for giving us a reproducer.
Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Marco Grassi <marco.gra@gmail.com> Reported-by: Vladis Dronov <vdronov@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
In v2.6, ip_rt_redirect() calls arp_bind_neighbour() which returns 0
and then the state of the neigh for the new_gw is checked. If the state
isn't valid then the redirected route is deleted. This behavior is
maintained up to v3.5.7 by check_peer_redirect() because rt->rt_gateway
is assigned to peer->redirect_learned.a4 before calling
ipv4_neigh_lookup().
After commit 5943634fc559 ("ipv4: Maintain redirect and PMTU info in
struct rtable again."), ipv4_neigh_lookup() is performed without the
rt_gateway assigned to the new_gw. In the case when rt_gateway (old_gw)
isn't zero, the function uses it as the key. The neigh is most likely
valid since the old_gw is the one that sends the ICMP redirect message.
Then the new_gw is assigned to fib_nh_exception. The problem is: the
new_gw ARP may never gets resolved and the traffic is blackholed.
So, use the new_gw for neigh lookup.
Changes from v1:
- use __ipv4_neigh_lookup instead (per Eric Dumazet).
Fixes: 5943634fc559 ("ipv4: Maintain redirect and PMTU info in struct rtable again.") Signed-off-by: Stephen Suryaputra Lin <ssurya@ieee.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Fixes: commit f187bc6efb7250afee0e2009b6106 ("ipv4: No need to set generic neighbour pointer") Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
sctp_wait_for_connect() currently already holds the asoc to keep it
alive during the sleep, in case another thread release it. But Andrey
Konovalov and Dmitry Vyukov reported an use-after-free in such
situation.
Problem is that __sctp_connect() doesn't get a ref on the asoc and will
do a read on the asoc after calling sctp_wait_for_connect(), but by then
another thread may have closed it and the _put on sctp_wait_for_connect
will actually release it, causing the use-after-free.
Fix is, instead of doing the read after waiting for the connect, do it
before so, and avoid this issue as the socket is still locked by then.
There should be no issue on returning the asoc id in case of failure as
the application shouldn't trust on that number in such situations
anyway.
This issue doesn't exist in sctp_sendmsg() path.
Reported-by: Dmitry Vyukov <dvyukov@google.com> Reported-by: Andrey Konovalov <andreyknvl@google.com> Tested-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Reviewed-by: Xin Long <lucien.xin@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jiri Slaby <jslaby@suse.cz>