Functions inc_rt_prio_smp() and dec_rt_prio_smp() change (global) rq's
priority, while rt_rq passed to them may be not the top-level rt_rq.
This is wrong, because changing of priority on a child level does not
guarantee that the priority is the highest all over the rq. So, this
leak makes RT balancing unusable.
The short example: the task having the highest priority among all rq's
RT tasks (no one other task has the same priority) are waking on a
throttle rt_rq. The rq's cpupri is set to the task's priority
equivalent, but real rq->rt.highest_prio.curr is less.
The patch below fixes the problem.
Signed-off-by: Kirill Tkhai <tkhai@yandex.ru> Signed-off-by: Peter Zijlstra <peterz@infradead.org> CC: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/49231385567953@web4m.yandex.ru Signed-off-by: Ingo Molnar <mingo@kernel.org>
[bwh: Backported to 3.2: adjust filename] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
VMAs covering a bo but that didn't start at the same address space offset as
the bo they were mapping were incorrectly generating SEGFAULT errors in
the fault handler.
Reported-by: Joseph Dolinak <kanilo2@yahoo.com> Signed-off-by: Thomas Hellstrom <thellstrom@vmware.com> Reviewed-by: Jakob Bornecrantz <jakob@vmware.com>
[bwh: Backported to 3.2: drm_vma_node_start() is open-coded;
vma_pages() was open-coded] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
A user on StackExchange had a failing SSD that's soldered directly
onto the motherboard of his system. The BIOS does not give any option
to disable it at all, so he can't just hide it from the OS via the
BIOS.
The old IDE layer had hdX=noprobe override for situations like this,
but that was never ported to the libata layer.
This patch implements a disable flag for libata.force.
Example use:
libata.force=2.0:disable
[v2 of the patch, removed the nodisable flag per Tejun Heo]
Ftrace currently initializes only the online CPUs. This implementation has
two problems:
- If we online a CPU after we enable the function profile, and then run the
test, we will lose the trace information on that CPU.
Steps to reproduce:
# echo 0 > /sys/devices/system/cpu/cpu1/online
# cd <debugfs>/tracing/
# echo <some function name> >> set_ftrace_filter
# echo 1 > function_profile_enabled
# echo 1 > /sys/devices/system/cpu/cpu1/online
# run test
- If we offline a CPU before we enable the function profile, we will not clear
the trace information when we enable the function profile. It will trouble
the users.
Steps to reproduce:
# cd <debugfs>/tracing/
# echo <some function name> >> set_ftrace_filter
# echo 1 > function_profile_enabled
# run test
# cat trace_stat/function*
# echo 0 > /sys/devices/system/cpu/cpu1/online
# echo 0 > function_profile_enabled
# echo 1 > function_profile_enabled
# cat trace_stat/function*
# run test
# cat trace_stat/function*
So it is better that we initialize the ftrace profiler for each possible cpu
every time we enable the function profile instead of just the online ones.
Evan Huus found (by fuzzing in wireshark) that the radiotap
iterator code can access beyond the length of the buffer if
the first bitmap claims an extension but then there's no
data at all. Fix this.
Reported-by: Evan Huus <eapache@gmail.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
We should be writing bits here but instead we're writing the
numbers that correspond to the bits we want to write. Fix it by
wrapping the numbers in the BIT() macro. This fixes gpios acting
as interrupts.
Signed-off-by: Stephen Boyd <sboyd@codeaurora.org> Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
While enabling these machines, we found we would sometimes lose an
interrupt if we change hardware volume during playback, and that
disabling msi fixed this issue. (Losing the interrupt caused underruns
and crackling audio, as the one second timeout is usually bigger than
the period size.)
The machines were all machines from HP, running AMD Hudson controller,
and Realtek ALC282 codec.
BugLink: https://bugs.launchpad.net/bugs/1260225 Signed-off-by: David Henningsson <david.henningsson@canonical.com> Signed-off-by: Takashi Iwai <tiwai@suse.de>
[bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Some RS690 boards with 64MB of sideport memory show up as
having 128MB sideport + 256MB of UMA. In this case,
just skip the sideport memory and use UMA. This fixes
rendering corruption and should improve performance.
This patch changes special case handling for ISCSI_OP_SCSI_CMD
where an initiator sends a zero length Expected Data Transfer
Length (EDTL), but still sets the WRITE and/or READ flag bits
when no payload transfer is requested.
Many, many moons ago two special cases where added for an ancient
version of ESX that has long since been fixed, so instead of adding
a new special case for the reported bug with a Broadcom 57800 NIC,
go ahead and always strip off the incorrect WRITE + READ flag bits.
Also, avoid sending a reject here, as RFC-3720 does mandate this
case be handled without protocol error.
Reported-by: Witold Bazakbal <865perl@wp.pl> Tested-by: Witold Bazakbal <865perl@wp.pl> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
[bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
We've got regression reports that my previous fix for spurious wakeups
after S5 on HP Haswell machines leads to the automatic reboot at
shutdown on some machines. It turned out that the fix for one side
triggers another BIOS bug in other side. So, it's exclusive.
Since the original S5 wakeups have been confirmed only on HP machines,
it'd be safer to apply it only to limited machines. As a wild guess,
limiting to machines with HP PCI SSID should suffice.
This patch should be backported to kernels as old as 3.12, that
contain the commit 638298dc66ea36623dbc2757a24fc2c4ab41b016 "xhci: Fix
spurious wakeups after S5 on Haswell".
That thing should be del_timer_sync(); consider what happens
if ext4_put_super() call of del_timer() happens to come just as it's
getting run on another CPU. Since that timer reschedules itself
to run next day, you are pretty much guaranteed that you'll end up
with kfree'd scheduled timer, with usual fun consequences. AFAICS,
that's -stable fodder all way back to 2010... [the second del_timer_sync()
is almost certainly not needed, but it doesn't hurt either]
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
[bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
ext2_quota_write() doesn't properly setup bh it passes to
ext2_get_block() and thus we hit assertion BUG_ON(maxblocks == 0) in
ext2_get_blocks() (or we could actually ask for mapping arbitrary number
of blocks depending on whatever value was on stack).
Fix ext2_quota_write() to properly fill in number of blocks to map.
Reviewed-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Christoph Hellwig <hch@lst.de> Reported-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
A corrupted ext4 may have out of order leaf extents, i.e.
extent: lblk 0--1023, len 1024, pblk 9217, flags: LEAF UNINIT
extent: lblk 1000--2047, len 1024, pblk 10241, flags: LEAF UNINIT
^^^^ overlap with previous extent
Reading such extent could hit BUG_ON() in ext4_es_cache_extent().
BUG_ON(end < lblk);
The problem is that __read_extent_tree_block() tries to cache holes as
well but assumes 'lblk' is greater than 'prev' and passes underflowed
length to ext4_es_cache_extent(). Fix it by checking for overlapping
extents in ext4_valid_extent_entries().
I hit this when fuzz testing ext4, and am able to reproduce it by
modifying the on-disk extent by hand.
Also add the check for (ee_block + len - 1) in ext4_valid_extent() to
make sure the value is not overflow.
ext4_mb_put_pa should hold pa->pa_lock before accessing pa->pa_count.
While ext4_mb_use_preallocated checks pa->pa_deleted first and then
increments pa->count later, ext4_mb_put_pa decrements pa->pa_count
before holding pa->pa_lock and then sets pa->pa_deleted.
* Free sequence
ext4_mb_put_pa (1): atomic_dec_and_test pa->pa_count
ext4_mb_put_pa (2): lock pa->pa_lock
ext4_mb_put_pa (3): check pa->pa_deleted
ext4_mb_put_pa (4): set pa->pa_deleted=1
ext4_mb_put_pa (5): unlock pa->pa_lock
ext4_mb_put_pa (6): remove pa from a list
ext4_mb_pa_callback: free pa
* Use sequence
ext4_mb_use_preallocated (1): iterate over preallocation
ext4_mb_use_preallocated (2): lock pa->pa_lock
ext4_mb_use_preallocated (3): check pa->pa_deleted
ext4_mb_use_preallocated (4): increase pa->pa_count
ext4_mb_use_preallocated (5): unlock pa->pa_lock
ext4_mb_release_context: access pa
While it's true that errors can only happen if there is a bug in
jbd2_journal_dirty_metadata(), if a bug does happen, we need to halt
the kernel or remount the file system read-only in order to avoid
further data loss. The ext4_journal_abort_handle() function doesn't
do any of this, and while it's likely that this call (since it doesn't
adjust refcounts) will likely result in the file system eventually
deadlocking since the current transaction will never be able to close,
it's much cleaner to call let ext4's error handling system deal with
this situation.
There's a separate bug here which is that if certain jbd2 errors
errors occur and file system is mounted errors=continue, the file
system will probably eventually end grind to a halt as described
above. But things have been this way in a long time, and usually when
we have these sorts of errors it's pretty much a disaster --- and
that's why the jbd2 layer aggressively retries memory allocations,
which is the most likely cause of these jbd2 errors.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
[bwh: Backported to 3.2: drop logging of missing transaction debug data] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
We've received multiple reports in Fedora via (BZ 907193)
that the Seagate Momentus SpinPoint M8 errors out when enabling AA:
[ 2.555905] ata2.00: failed to enable AA (error_mask=0x1)
[ 2.568482] ata2.00: failed to enable AA (error_mask=0x1)
Add the ATA_HORKAGE_BROKEN_FPDMA_AA for this specific harddisk.
Reported-by: Nicholas <arealityfarbetween@googlemail.com> Signed-off-by: Michele Baldessari <michele@acksyn.org> Tested-by: Nicholas <arealityfarbetween@googlemail.com> Acked-by: Alan Cox <gnomes@lxorguk.ukuu.org.uk> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
E.g. landisk_defconfig, which has CONFIG_NTFS_FS=m:
ERROR: "__ashrdi3" [fs/ntfs/ntfs.ko] undefined!
For "lib-y", if no symbols in a compilation unit are referenced by other
units, the compilation unit will not be included in vmlinux. This
breaks modules that do reference those symbols.
This doesn't fix all cases. There are others, e.g. udivsi3.
This is also not limited to sh, many architectures handle this in the
same way.
A simple solution is to unconditionally include all helper functions.
A more complex solution is to make the choice of "lib-y" or "obj-y" depend
on CONFIG_MODULES:
Aborted requests usually get cleared when the reply is received.
If MDS crashes, no reply will be received. So we need to cleanup
aborted requests when re-sending requests.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Reviewed-by: Greg Farnum <greg@inktank.com> Signed-off-by: Sage Weil <sage@inktank.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
This patch fixes the problem that get_unmapped_area() can return illegal
address and result in failing mmap(2) etc.
In case that the address higher than PAGE_SIZE is set to
/proc/sys/vm/mmap_min_addr, the address lower than mmap_min_addr can be
returned by get_unmapped_area(), even if you do not pass any virtual
address hint (i.e. the second argument).
This is because the current get_unmapped_area() code does not take into
account mmap_min_addr.
This leads to two actual problems as follows:
1. mmap(2) can fail with EPERM on the process without CAP_SYS_RAWIO,
although any illegal parameter is not passed.
2. The bottom-up search path after the top-down search might not work in
arch_get_unmapped_area_topdown().
Note: The first and third chunk of my patch, which changes "len" check,
are for more precise check using mmap_min_addr, and not for solving the
above problem.
Signed-off-by: Akira Takeuchi <takeuchi.akr@jp.panasonic.com> Signed-off-by: Kiyoshi Owada <owada.kiyoshi@jp.panasonic.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[bwh: Backported to 3.2:
As we do not have vm_unmapped_area(), make arch_get_unmapped_area_topdown()
calculate the lower limit for the new area's end address and then compare
addresses with this instead of with len. In the process, fix an off-by-one
error which could result in returning 0 if mm->mmap_base == len.] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Before we do an EMMS in the AMD FXSAVE information leak workaround we
need to clear any pending exceptions, otherwise we trap with a
floating-point exception inside this code.
In kvm_lapic_sync_from_vapic and kvm_lapic_sync_to_vapic there is the
potential to corrupt kernel memory if userspace provides an address that
is at the end of a page. This patches concerts those functions to use
kvm_write_guest_cached and kvm_read_guest_cached. It also checks the
vapic_address specified by userspace during ioctl processing and returns
an error to userspace if the address is not a valid GPA.
This is generally not guest triggerable, because the required write is
done by firmware that runs before the guest. Also, it only affects AMD
processors and oldish Intel that do not have the FlexPriority feature
(unless you disable FlexPriority, of course; then newer processors are
also affected).
Fixes: b93463aa59d6 ('KVM: Accelerated apic support') Reported-by: Andrew Honig <ahonig@google.com> Signed-off-by: Andrew Honig <ahonig@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[dannf: backported to Debian's 3.2] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Pick the MAC address of the first virtual interface as the new hardware MAC
address. Set BSSID mask according to this MAC address. This fixes CVE-2013-4579.
Signed-off-by: Mathy Vanhoef <vanhoefm@gmail.com> Signed-off-by: John W. Linville <linville@tuxdriver.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Use proper cpp defined(...) constructs to avoid this:
arch/ia64/kernel/machine_kexec.c: In function 'arch_crash_save_vmcoreinfo':
arch/ia64/kernel/machine_kexec.c:160:8: warning: "CONFIG_PGTABLE_4" is not defined
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
At some point, Measurement Computing / ComputerBoards redesigned the
PCI-DIO48H to use a PLX PCI interface chip instead of an AMCC chip.
This meant they had to put their hardware registers in the PCI BAR 2
region instead of PCI BAR 1. Unfortunately, they kept the same PCI
device ID for the new design. This means the driver recognizes the
newer cards, but doesn't work (and is likely to screw up the local
configuration registers of the PLX chip) because it's using the wrong
region.
Since all the supported boards have the DIO registers in the PCI BAR 2
region except for older PCI-DIO48H boards which have an empty PCI BAR 2
region and the DIO registers in PCI BAR 1, determine which PCI BAR
region to use based on whether the PCI BAR 2 region is empty or not.
This change makes the `dioregs_badrindex` member of `struct
pcidio_board` redundant. The `pcicontroler_badrindex` member is also
unused, so remove both members.
Signed-off-by: Ian Abbott <abbotti@mev.co.uk> Cc: kernel-team@lists.ubuntu.com Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
After a successful hugetlb page migration by soft offline, the source
page will either be freed into hugepage_freelists or buddy(over-commit
page). If page is in buddy, page_hstate(page) will be NULL. It will
hit a NULL pointer dereference in dequeue_hwpoisoned_huge_page().
Currently, we enable ARI in a device's upstream bridge if the bridge and
the device support it. But we never disable ARI, even if the device is
removed and replaced with a device that doesn't support ARI.
This means that if we hot-remove an ARI device and replace it with a
non-ARI multi-function device, we find only function 0 of the new device
because the upstream bridge still has ARI enabled, and next_ari_fn()
only returns function 0 for the new non-ARI device.
This patch disables ARI in the upstream bridge if the device doesn't
support ARI. See the PCIe spec, r3.0, sec 6.13.
[bhelgaas: changelog, function comment]
[yijing: replace PCIe Cap accessor with legacy PCI accessor] Signed-off-by: Yijing Wang <wangyijing@huawei.com> Signed-off-by: Jiang Liu <jiang.liu@huawei.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
The common cause appears to be lots of freeze and unfreeze cycles,
and the output from the warnings indicates that we are leaking
around 8 bytes of log space per freeze/unfreeze cycle.
When we freeze the filesystem, we write an unmount record and that
uses xlog_write directly - a special type of transaction,
effectively. What it doesn't do, however, is correctly account for
the log space it uses. The unmount record writes an 8 byte structure
with a special magic number into the log, and the space this
consumes is not accounted for in the log ticket tracking the
operation. Hence we leak 8 bytes every unmount record that is
written.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Bob Falken reported that after 4G packets, multicast forwarding stopped
working. This was because of a rule reference counter overflow which
freed the rule as soon as the overflow happend.
This patch solves this by adding the FIB_LOOKUP_NOREF flag to
fib_rules_lookup calls. This is safe even from non-rcu locked sections
as in this case the flag only implies not taking a reference to the rule,
which we don't need at all.
Rules only hold references to the namespace, which are guaranteed to be
available during the call of the non-rcu protected function reg_vif_xmit
because of the interface reference which itself holds a reference to
the net namespace.
Fixes: f0ad0860d01e47 ("ipv4: ipmr: support multiple tables") Fixes: d1db275dd3f6e4 ("ipv6: ip6mr: support multiple tables") Reported-by: Bob Falken <NetFestivalHaveFun@gmx.com> Cc: Patrick McHardy <kaber@trash.net> Cc: Thomas Graf <tgraf@suug.ch> Cc: Julian Anastasov <ja@ssi.bg> Cc: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Fix inet_diag_dump_icsk() to reflect the fact that both TIME_WAIT and
FIN_WAIT2 connections are represented by inet_timewait_sock (not just
TIME_WAIT). Thus:
(a) We need to iterate through the time_wait buckets if the user wants
either TIME_WAIT or FIN_WAIT2. (Before fixing this, "ss -nemoi state
fin-wait-2" would not return any sockets, even if there were some in
FIN_WAIT2.)
(b) We need to check tw_substate to see if the user wants to dump
sockets in the particular substate (TIME_WAIT or FIN_WAIT2) that a
given connection is in. (Before fixing this, "ss -nemoi state
time-wait" would actually return sockets in state FIN_WAIT2.)
An analogous fix is in v3.13: 70315d22d3c7383f9a508d0aab21e2eb35b2303a
("inet_diag: fix inet_diag_dump_icsk() to use correct state for
timewait sockets") but that patch is quite different because 3.13 code
is very different in this area due to the unification of TCP hash
tables in 05dbc7b ("tcp/dccp: remove twchain") in v3.13-rc1.
I tested that this applies cleanly between v3.3 and v3.12, and tested
that it works in both 3.3 and 3.12. It does not apply cleanly to 3.2
and earlier (though it makes semantic sense), and semantically is not
the right fix for 3.13 and beyond (as mentioned above).
Signed-off-by: Neal Cardwell <ncardwell@google.com> Cc: Eric Dumazet <edumazet@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
bnx2x triggers warnings with CONFIG_DMA_API_DEBUG=y:
WARNING: CPU: 0 PID: 2253 at lib/dma-debug.c:887 check_unmap+0xf8/0x920()
bnx2x 0000:28:00.0: DMA-API: device driver frees DMA memory with
different size [device address=0x00000000da2b389e] [map size=1490 bytes]
[unmap size=66 bytes]
The reason is that bnx2x splits a TSO BD into two BDs (headers + data)
using one DMA mapping for both, but it uses only the length of the first
BD when unmapping.
This patch fixes the bug by unmapping the whole length of the two BDs.
Signed-off-by: Michal Schmidt <mschmidt@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Dmitry Kravkov <dmitry@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
br_multicast_set_hash_max() is called from process context in
net/bridge/br_sysfs_br.c by the sysfs store_hash_max() function.
br_multicast_set_hash_max() calls spin_lock(&br->multicast_lock),
which can deadlock the CPU if a softirq that also tries to take the
same lock interrupts br_multicast_set_hash_max() while the lock is
held . This can happen quite easily when any of the bridge multicast
timers expire, which try to take the same lock.
The fix here is to use spin_lock_bh(), preventing other softirqs from
executing on this CPU.
Steps to reproduce:
1. Create a bridge with several interfaces (I used 4).
2. Set the "multicast query interval" to a low number, like 2.
3. Enable the bridge as a multicast querier.
4. Repeatedly set the bridge hash_max parameter via sysfs.
# while true ; do echo 4096 > /sys/class/net/br0/bridge/hash_max; done
Signed-off-by: Curt Brune <curt@cumulusnetworks.com> Signed-off-by: Scott Feldman <sfeldma@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
While commit 30a584d944fb fixes datagram interface in LLC, a use
after free bug has been introduced for SOCK_STREAM sockets that do
not make use of MSG_PEEK.
The flow is as follow ...
if (!(flags & MSG_PEEK)) {
...
sk_eat_skb(sk, skb, false);
...
}
...
if (used + offset < skb->len)
continue;
... where sk_eat_skb() calls __kfree_skb(). Therefore, cache
original length and work on skb_len to check partial reads.
Fixes: 30a584d944fb ("[LLX]: SOCK_DGRAM interface fixes") Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Cc: Stephen Hemminger <stephen@networkplumber.org> Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
When the vlan code detects that the real device can do TX VLAN offloads
in hardware, it tries to arrange for the real device's header_ops to
be invoked directly.
But it does so illegally, by simply hooking the real device's
header_ops up to the VLAN device.
This doesn't work because we will end up invoking a set of header_ops
routines which expect a device type which matches the real device, but
will see a VLAN device instead.
Fix this by providing a pass-thru set of header_ops which will arrange
to pass the proper real device instead.
To facilitate this add a dev_rebuild_header(). There are
implementations which provide a ->cache and ->create but not a
->rebuild (f.e. PLIP). So we need a helper function just like
dev_hard_header() to avoid crashes.
Use this helper in the one existing place where the
header_ops->rebuild was being invoked, the neighbour code.
With lots of help from Florian Westphal.
Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
recvmsg handler in net/rose/af_rose.c performs size-check ->msg_namelen.
After commit f3d3342602f8bcbf37d7c46641cb9bca7618eb1c
(net: rework recvmsg handler msg_name and msg_namelen logic), we now
always take the else branch due to namelen being initialized to 0.
Digging in netdev-vger-cvs git repo shows that msg_namelen was
initialized with a fixed-size since at least 1995, so the else branch
was never taken.
Compile tested only.
Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
The yam_ioctl() code fails to initialise the cmd field
of the struct yamdrv_ioctl_cfg. Add an explicit memset(0)
before filling the structure to avoid the 4-byte info leak.
Signed-off-by: Salva Peiró <speiro@ai2.upv.es> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
The local variable 'bi' comes from userspace. If userspace passed a
large number to 'bi.data.calibrate', there would be an integer overflow
in the following line:
s->hdlctx.calibrate = bi.data.calibrate * s->par.bitrate / 16;
Signed-off-by: Wenliang Fan <fanwlexca@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Jakub reported while working with nlmon netlink sniffer that parts of
the inet_diag_sockid are not initialized when r->idiag_family != AF_INET6.
That is, fields of r->id.idiag_src[1 ... 3], r->id.idiag_dst[1 ... 3].
In fact, it seems that we can leak 6 * sizeof(u32) byte of kernel [slab]
memory through this. At least, in udp_dump_one(), we allocate a skb in ...
... and then pass that to inet_sk_diag_fill() that puts the whole struct
inet_diag_msg into the skb, where we only fill out r->id.idiag_src[0],
r->id.idiag_dst[0] and leave the rest untouched:
struct inet_diag_msg embeds struct inet_diag_sockid that is correctly /
fully filled out in IPv6 case, but for IPv4 not.
So just zero them out by using plain memset (for this little amount of
bytes it's probably not worth the extra check for idiag_family == AF_INET).
Similarly, fix also other places where we fill that out.
Reported-by: Jakub Zawadzki <darkjames-ws@darkjames.pl> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
This is similar to the set_peek_off patch where calling bind while the
socket is stuck in unix_dgram_recvmsg() will block and cause a hung task
spew after a while.
This is also the last place that did a straightforward mutex_lock(), so
there shouldn't be any more of these patches.
Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
The new tg3 driver leaves REG_BASE_ADDR (PCI config offset 120)
uninitialized. From power on reset this register may have garbage in it. The
Register Base Address register defines the device local address of a
register. The data pointed to by this location is read or written using
the Register Data register (PCI config offset 128). When REG_BASE_ADDR has
garbage any read or write of Register Data Register (PCI 128) will cause the
PCI bus to lock up. The TCO watchdog will fire and bring down the system.
Signed-off-by: Nat Gurumoorthy <natg@google.com> Acked-by: Michael Chan <mchan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
maxattr in genl_family should be used to save the max attribute
type, but not the max command type. Drop monitor doesn't support
any attributes, so we should leave it as zero.
Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Brett Ciphery reported that new ipv6 addresses failed to get installed
because the addrconf generated dsts where counted against the dst gc
limit. We don't need to count those routes like we currently don't count
administratively added routes.
Because the max_addresses check enforces a limit on unbounded address
generation first in case someone plays with router advertisments, we
are still safe here.
Reported-by: Brett Ciphery <brett.ciphery@windriver.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
After congestion update on a local connection, when rds_ib_xmit returns
less bytes than that are there in the message, rds_send_xmit calls
back rds_ib_xmit with an offset that causes BUG_ON(off & RDS_FRAG_SIZE)
to trigger.
For a 4Kb PAGE_SIZE rds_ib_xmit returns min(8240,4096)=4096 when actually
the message contains 8240 bytes. rds_send_xmit thinks there is more to send
and calls rds_ib_xmit again with a data offset "off" of 4096-48(rds header)
=4048 bytes thus hitting the BUG_ON(off & RDS_FRAG_SIZE) [RDS_FRAG_SIZE=4k].
The commit 6094628bfd94323fc1cea05ec2c6affd98c18f7f
"rds: prevent BUG_ON triggering on congestion map updates" introduced
this regression. That change was addressing the triggering of a different
BUG_ON in rds_send_xmit() on PowerPC architecture with 64Kbytes PAGE_SIZE:
BUG_ON(ret != 0 &&
conn->c_xmit_sg == rm->data.op_nents);
This was the sequence it was going through:
(rds_ib_xmit)
/* Do not send cong updates to IB loopback */
if (conn->c_loopback
&& rm->m_inc.i_hdr.h_flags & RDS_FLAG_CONG_BITMAP) {
rds_cong_map_updated(conn->c_fcong, ~(u64) 0);
return sizeof(struct rds_header) + RDS_CONG_MAP_BYTES;
}
rds_ib_xmit returns 8240
rds_send_xmit:
c_xmit_data_off = 0 + 8240 - 48 (rds header accounted only the first time)
= 8192
c_xmit_data_off < 65536 (sg->length), so calls rds_ib_xmit again
rds_ib_xmit returns 8240
rds_send_xmit:
c_xmit_data_off = 8192 + 8240 = 16432, calls rds_ib_xmit again
and so on (c_xmit_data_off 24672,32912,41152,49392,57632)
rds_ib_xmit returns 8240
On this iteration this sequence causes the BUG_ON in rds_send_xmit:
while (ret) {
tmp = min_t(int, ret, sg->length - conn->c_xmit_data_off);
[tmp = 65536 - 57632 = 7904]
conn->c_xmit_data_off += tmp;
[c_xmit_data_off = 57632 + 7904 = 65536]
ret -= tmp;
[ret = 8240 - 7904 = 336]
if (conn->c_xmit_data_off == sg->length) {
conn->c_xmit_data_off = 0;
sg++;
conn->c_xmit_sg++;
BUG_ON(ret != 0 &&
conn->c_xmit_sg == rm->data.op_nents);
[c_xmit_sg = 1, rm->data.op_nents = 1]
What the current fix does:
Since the congestion update over loopback is not actually transmitted
as a message, all that rds_ib_xmit needs to do is let the caller think
the full message has been transmitted and not return partial bytes.
It will return 8240 (RDS_CONG_MAP_BYTES+48) when PAGE_SIZE is 4Kb.
And 64Kb+48 when page size is 64Kb.
Reported-by: Josh Hunt <joshhunt00@gmail.com> Tested-by: Honggang Li <honli@redhat.com> Acked-by: Bang Nguyen <bang.nguyen@oracle.com> Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Few network drivers really supports frag_list : virtual drivers.
Some drivers wrongly advertise NETIF_F_FRAGLIST feature.
If skb with a frag_list is given to them, packet on the wire will be
corrupt.
Remove this flag, as core networking stack will make sure to
provide packets that can be sent without corruption.
Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Thadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com> Cc: Anirudha Sarangi <anirudh@xilinx.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Current MMC driver doesn't handle generic error (bit19 of device
status) in write sequence. As a result, write data gets lost when
generic error occurs. For example, a generic error when updating a
filesystem management information causes a loss of write data and
corrupts the filesystem. In the worst case, the system will never
boot.
This patch includes the following functionality:
1. To enable error checking for the response of CMD12 and CMD13
in write command sequence
2. To retry write sequence when a generic error occurs
Messages are added for v2 to show what occurs.
[Backported to 3.4-stable]
Signed-off-by: KOBAYASHI Yoshitake <yoshitake.kobayashi@toshiba.co.jp> Signed-off-by: Chris Ball <cjb@laptop.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Commit 8c4f3c3fa9681 "ftrace: Check module functions being traced on reload"
fixed module loading and unloading with respect to function tracing, but
it missed the function graph tracer. If you perform the following
The above mentioned commit didn't go far enough. Well, it covered the
function tracer by adding checks in __register_ftrace_function(). The
problem is that the function graph tracer circumvents that (for a slight
efficiency gain when function graph trace is running with a function
tracer. The gain was not worth this).
The problem came with ftrace_startup() which should always be called after
__register_ftrace_function(), if you want this bug to be completely fixed.
Anyway, this solution moves __register_ftrace_function() inside of
ftrace_startup() and removes the need to call them both.
Reported-by: Dave Wysochanski <dwysocha@redhat.com> Fixes: ed926f9b35cd ("ftrace: Use counters to enable functions to trace") Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
It was finally narrowed down to unloading a module that was being traced.
It was actually more than that. When functions are being traced, there's
a table of all functions that have a ref count of the number of active
tracers attached to that function. When a function trace callback is
registered to a function, the function's record ref count is incremented.
When it is unregistered, the function's record ref count is decremented.
If an inconsistency is detected (ref count goes below zero) the above
warning is shown and the function tracing is permanently disabled until
reboot.
The ftrace callback ops holds a hash of functions that it filters on
(and/or filters off). If the hash is empty, the default means to filter
all functions (for the filter_hash) or to disable no functions (for the
notrace_hash).
When a module is unloaded, it frees the function records that represent
the module functions. These records exist on their own pages, that is
function records for one module will not exist on the same page as
function records for other modules or even the core kernel.
Now when a module unloads, the records that represents its functions are
freed. When the module is loaded again, the records are recreated with
a default ref count of zero (unless there's a callback that traces all
functions, then they will also be traced, and the ref count will be
incremented).
The problem is that if an ftrace callback hash includes functions of the
module being unloaded, those hash entries will not be removed. If the
module is reloaded in the same location, the hash entries still point
to the functions of the module but the module's ref counts do not reflect
that.
With the help of Steve and Joern, we found a reproducer:
Using uinput module and uinput_release function.
cd /sys/kernel/debug/tracing
modprobe uinput
echo uinput_release > set_ftrace_filter
echo function > current_tracer
rmmod uinput
modprobe uinput
# check /proc/modules to see if loaded in same addr, otherwise try again
echo nop > current_tracer
[BOOM]
The above loads the uinput module, which creates a table of functions that
can be traced within the module.
We add uinput_release to the filter_hash to trace just that function.
Enable function tracincg, which increments the ref count of the record
associated to uinput_release.
Remove uinput, which frees the records including the one that represents
uinput_release.
Load the uinput module again (and make sure it's at the same address).
This recreates the function records all with a ref count of zero,
including uinput_release.
Disable function tracing, which will decrement the ref count for uinput_release
which is now zero because of the module removal and reload, and we have
a mismatch (below zero ref count).
The solution is to check all currently tracing ftrace callbacks to see if any
are tracing any of the module's functions when a module is loaded (it already does
that with callbacks that trace all functions). If a callback happens to have
a module function being traced, it increments that records ref count and starts
tracing that function.
There may be a strange side effect with this, where tracing module functions
on unload and then reloading a new module may have that new module's functions
being traced. This may be something that confuses the user, but it's not
a big deal. Another approach is to disable all callback hashes on module unload,
but this leaves some ftrace callbacks that may not be registered, but can
still have hashes tracing the module's function where ftrace doesn't know about
it. That situation can cause the same bug. This solution solves that case too.
Another benefit of this solution, is it is possible to trace a module's
function on unload and load.
Link: http://lkml.kernel.org/r/20130705142629.GA325@redhat.com Reported-by: Jörn Engel <joern@logfs.org> Reported-by: Dave Jones <davej@redhat.com> Reported-by: Steve Hodgson <steve@purestorage.com> Tested-by: Steve Hodgson <steve@purestorage.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
[bwh: Backported to 3.2: adjust context, indentation] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
There are two types of hashes in the ftrace_ops; one type
is the filter_hash and the other is the notrace_hash. Either
one may be null, meaning it has no elements. But when elements
are added, the hash is allocated.
Throughout the code, a check needs to be made to see if a hash
exists or the hash has elements, but the check if the hash exists
is usually missing causing the possible "NULL pointer dereference bug".
Add a helper routine called "ftrace_hash_empty()" that returns
true if the hash doesn't exist or its count is zero. As they mean
the same thing.
Last-bug-reported-by: Jiri Olsa <jolsa@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
When disabling the "notrace" records, that means we want to trace them.
If the notrace_hash is zero, it means that we want to trace all
records. But to disable a zero notrace_hash means nothing.
The check for the notrace_hash count was incorrect with:
if (hash && !hash->count)
return
With the correct comment above it that states that we do nothing
if the notrace_hash has zero count. But !hash also means that
the notrace hash has zero count. I think this was done to
protect against dereferencing NULL. But if !hash is true, then
we go through the following loop without doing a single thing.
Fix it to:
if (!hash || !hash->count)
return;
Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
We don't validate iph->ihl which may lead a dead loop if we meet a IPIP
skb whose iph->ihl is zero. Fix this by failing immediately when iph->ihl
is evil (less than 5).
Cc: Eric Dumazet <edumazet@google.com> Cc: Petr Matousek <pmatouse@redhat.com> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
[bwh: Backported to 3.2: the affected code is in __skb_get_rxhash()] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
It appears that driver runs into a problem here if fibsize is too small
because we allocate user_srbcmd with fibsize size only but later we
access it until user_srbcmd->sg.count to copy it over to srbcmd.
It is not correct to test (fibsize < sizeof(*user_srbcmd)) because this
structure already includes one sg element and this is not needed for
commands without data. So, we would recommend to add the following
(instead of test for fibsize == 0).
If we do a zero size allocation then it will oops. Also we can't be
sure the user passes us a NUL terminated string so I've added a
terminator.
This code can only be triggered by root.
Reported-by: Nico Golde <nico@ngolde.de> Reported-by: Fabian Yamaguchi <fabs@goesec.de> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Dan Williams <dcbw@redhat.com> Signed-off-by: John W. Linville <linville@tuxdriver.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
The {get,put}_user macros don't perform range checking on the provided
__user address when !CPU_HAS_DOMAINS.
This patch reworks the out-of-line assembly accessors to check the user
address against a specified limit, returning -EFAULT if is is out of
range.
[will: changed get_user register allocation to match put_user]
[rmk: fixed building on older ARM architectures]
Reported-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
[bwh: Backported to 3.2: TUSER() was called T()] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
The iommu integration into memory slots expects memory slots to be
added or removed and doesn't handle the move case. We can unmap
slots from the iommu after we mark them invalid and map them before
installing the final memslot array. Also re-order the kmemdup vs
map so we don't leave iommu mappings if we get ENOMEM.
Reviewed-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
[bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Stephan Mueller reported to me recently a error in random number generation in
the ansi cprng. If several small requests are made that are less than the
instances block size, the remainder for loop code doesn't increment
rand_data_valid in the last iteration, meaning that the last bytes in the
rand_data buffer gets reused on the subsequent smaller-than-a-block request for
random data.
The fix is pretty easy, just re-code the for loop to make sure that
rand_data_valid gets incremented appropriately
Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Reported-by: Stephan Mueller <stephan.mueller@atsec.com> CC: Stephan Mueller <stephan.mueller@atsec.com> CC: Petr Matousek <pmatouse@redhat.com> CC: Herbert Xu <herbert@gondor.apana.org.au> CC: "David S. Miller" <davem@davemloft.net> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
When working on report indexes, always validate that they are in bounds.
Without this, a HID device could report a malicious feature report that
could trick the driver into a heap overflow:
[ 634.885003] usb 1-1: New USB device found, idVendor=0596, idProduct=0500
...
[ 676.469629] BUG kmalloc-192 (Tainted: G W ): Redzone overwritten
Note that we need to change the indexes from s8 to s16 as they can
be between -1 and 255.
CVE-2013-2897
Signed-off-by: Benjamin Tissoires <benjamin.tissoires@redhat.com> Acked-by: Kees Cook <keescook@chromium.org> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
[bwh: Backported to 3.2: mt_device::{cc,cc_value,inputmode}_index do not
exist and the corresponding indices do not need to be validated.
mt_device::maxcontact_report_id does not exist either. So all we need
to do is to widen mt_device::inputmode.] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
commit a553e4a6317b2cfc7659542c10fe43184ffe53da ("[PKTGEN]: IPSEC support")
tried to support IPsec ESP transport transformation for pktgen, but acctually
this doesn't work at all for two reasons(The orignal transformed packet has
bad IPv4 checksum value, as well as wrong auth value, reported by wireshark)
- After transpormation, IPv4 header total length needs update,
because encrypted payload's length is NOT same as that of plain text.
- After transformation, IPv4 checksum needs re-caculate because of payload
has been changed.
With this patch, armmed pktgen with below cofiguration, Wireshark is able to
decrypted ESP packet generated by pktgen without any IPv4 checksum error or
auth value error.
pgset "flag IPSEC"
pgset "flows 1"
Signed-off-by: Fan Du <fan.du@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
IPv6 stats are 64 bits and thus are protected with a seqlock. By not
disabling bottom-half we could deadlock here if we don't disable bh and
a softirq reentrantly updates the same mib.
Cc: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Currently we're using plain spin_lock() in prb_shutdown_retire_blk_timer(),
however the timer might fire right in the middle and thus try to re-aquire
the same spinlock, leaving us in a endless loop.
To fix that, use the spin_lock_bh() to block it.
Fixes: f6fb8f100b80 ("af-packet: TPACKET_V3 flexible buffer implementation.") CC: "David S. Miller" <davem@davemloft.net> CC: Daniel Borkmann <dborkman@redhat.com> CC: Willem de Bruijn <willemb@google.com> CC: Phil Sutter <phil@nwl.cc> CC: Eric Dumazet <edumazet@google.com> Reported-by: Jan Stancek <jstancek@redhat.com> Tested-by: Jan Stancek <jstancek@redhat.com> Signed-off-by: Veaceslav Falico <vfalico@redhat.com> Acked-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Salam reported a use after free bug in PF_PACKET that occurs when
we're sending out frames on a socket bound device and suddenly the
net device is being unregistered. It appears that commit 827d9780
introduced a possible race condition between {t,}packet_snd() and
packet_notifier(). In the case of a bound socket, packet_notifier()
can drop the last reference to the net_device and {t,}packet_snd()
might end up suddenly sending a packet over a freed net_device.
To avoid reverting 827d9780 and thus introducing a performance
regression compared to the current state of things, we decided to
hold a cached RCU protected pointer to the net device and maintain
it on write side via bind spin_lock protected register_prot_hook()
and __unregister_prot_hook() calls.
In {t,}packet_snd() path, we access this pointer under rcu_read_lock
through packet_cached_dev_get() that holds reference to the device
to prevent it from being freed through packet_notifier() while
we're in send path. This is okay to do as dev_put()/dev_hold() are
per-cpu counters, so this should not be a performance issue. Also,
the code simplifies a bit as we don't need need_rls_dev anymore.
Fixes: 827d978037d7 ("af-packet: Use existing netdev reference for bound sockets.") Reported-by: Salam Noureddine <noureddine@aristanetworks.com> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: Salam Noureddine <noureddine@aristanetworks.com> Cc: Ben Greear <greearb@candelatech.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
--------------------------- cut here -------------------------------
The reason is that when the bridge dev's address is changed, the
br_fdb_change_mac_address() will add new address in fdb, but when
the bridge was removed, the address entry in the fdb did not free,
the bridge_fdb_cache still has objects when destroy the cache, Fix
this by flushing the bridge address entry when removing the bridge.
v2: according to the Toshiaki Makita and Vlad's suggestion, I only
delete the vlan0 entry, it still have a leak here if the vlan id
is other number, so I need to call fdb_delete_by_port(br, NULL, 1)
to flush all entries whose dst is NULL for the bridge.
Suggested-by: Toshiaki Makita <toshiaki.makita1@gmail.com> Suggested-by: Vlad Yasevich <vyasevich@gmail.com> Signed-off-by: Ding Tianhong <dingtianhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
tried to fix a problem with VLAN devices and promiscuouse flag setting.
The issue was that VLAN device was setting a flag on an interface that
was down, thus resulting in bad promiscuity count.
This commit blocked flag propagation to any device that is currently
down.
fixed VLAN code to only propagate flags when the VLAN interface is up,
thus fixing the same issue as above, only localized to VLAN.
The problem we have now is that if we have create a complex stack
involving multiple software devices like bridges, bonds, and vlans,
then it is possible that the flags would not propagate properly to
the physical devices. A simple examle of the scenario is the
following:
eth0----> bond0 ----> bridge0 ---> vlan50
If bond0 or eth0 happen to be down at the time bond0 is added to
the bridge, then eth0 will never have promisc mode set which is
currently required for operation as part of the bridge. As a
result, packets with vlan50 will be dropped by the interface.
The only 2 devices that implement the special flag handling are
VLAN and DSA and they both have required code to prevent incorrect
flag propagation. As a result we can remove the generic solution
introduced in b6c40d68ff6498b7f63ddf97cf0aa818d748dee7 and leave
it to the individual devices to decide whether they will block
flag propagation or not.
Reported-by: Stefan Priebe <s.priebe@profihost.ag> Suggested-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: Vlad Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
init_card() calls dev_get_by_name() to get a network deceive. But it
doesn't decrease network device reference count after the device is
used.
Signed-off-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Offenders don't have port numbers, so set it to 0.
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
If kmsg->msg_namelen > sizeof(struct sockaddr_storage) then in the
original code that would lead to memory corruption in the kernel if you
had audit configured. If you didn't have audit configured it was
harmless.
There are some programs such as beta versions of Ruby which use too
large of a buffer and returning an error code breaks them. We should
clamp the ->msg_namelen value instead.
Fixes: 1661bf364ae9 ("net: heap overflow in __audit_sockaddr()") Reported-by: Eric Wong <normalperson@yhbt.net> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Tested-by: Eric Wong <normalperson@yhbt.net> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Commit bceaa90240b6019ed73b49965eac7d167610be69 ("inet: prevent leakage
of uninitialized memory to user in recv syscalls") conditionally updated
addr_len if the msg_name is written to. The recv_error and rxpmtu
functions relied on the recvmsg functions to set up addr_len before.
As this does not happen any more we have to pass addr_len to those
functions as well and set it to the size of the corresponding sockaddr
length.
This broke traceroute and such.
Fixes: bceaa90240b6 ("inet: prevent leakage of uninitialized memory to user in recv syscalls") Reported-by: Brad Spengler <spender@grsecurity.net> Reported-by: Tom Labanowski Cc: mpb <mpb.mail@gmail.com> Cc: David S. Miller <davem@davemloft.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
In that case it is probable that kernel code overwrote part of the
stack. So we should bail out loudly here.
The BUG_ON may be removed in future if we are sure all protocols are
conformant.
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
This patch now always passes msg->msg_namelen as 0. recvmsg handlers must
set msg_namelen to the proper size <= sizeof(struct sockaddr_storage)
to return msg_name to the user.
This prevents numerous uninitialized memory leaks we had in the
recvmsg handlers and makes it harder for new code to accidentally leak
uninitialized memory.
Optimize for the case recvfrom is called with NULL as address. We don't
need to copy the address at all, so set it to NULL before invoking the
recvmsg handler. We can do so, because all the recvmsg handlers must
cope with the case a plain read() is called on them. read() also sets
msg_name to NULL.
Also document these changes in include/linux/net.h as suggested by David
Miller.
Changes since RFC:
Set msg->msg_name = NULL if user specified a NULL in msg_name but had a
non-null msg_namelen in verify_iovec/verify_compat_iovec. This doesn't
affect sendto as it would bail out earlier while trying to copy-in the
address. It also more naturally reflects the logic by the callers of
verify_iovec.
With this change in place I could remove "
if (!uaddr || msg_sys->msg_namelen == 0)
msg->msg_name = NULL
".
This change does not alter the user visible error logic as we ignore
msg_namelen as long as msg_name is NULL.
Also remove two unnecessary curly brackets in ___sys_recvmsg and change
comments to netdev style.
Cc: David Miller <davem@davemloft.net> Suggested-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Only update *addr_len when we actually fill in sockaddr, otherwise we
can return uninitialized memory from the stack to the caller in the
recvfrom, recvmmsg and recvmsg syscalls. Drop the the (addr_len == NULL)
checks because we only get called with a valid addr_len pointer either
from sock_common_recvmsg or inet_recvmsg.
If a blocking read waits on a socket which is concurrently shut down we
now return zero and set msg_msgnamelen to 0.
Reported-by: mpb <mpb.mail@gmail.com> Suggested-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
ip4_datagram_connect() being called from process context,
it should use IP_INC_STATS() instead of IP_INC_STATS_BH()
otherwise we can deadlock on 32bit arches, or get corruptions of
SNMP counters.
Fixes: 584bdf8cbdf6 ("[IPV4]: Fix "ipOutNoRoutes" counter error for TCP and UDP") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Dave Jones <davej@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
In af3e095a1fb4, Erik Jacobsen fixed one type of unaligned access
bug for ia64 by converting a 64-bit write to use put_unaligned().
Unfortunately, since gcc will convert a short memset() to a series
of appropriately-aligned stores, the problem is now visible again
on tilegx, where the memset that zeros out proc_event is converted
to three 64-bit stores, causing an unaligned access panic.
A better fix for the original problem is to ensure that proc_event
is aligned to 8 bytes here. We can do that relatively easily by
arranging to start the struct cn_msg aligned to 8 bytes and then
offset by 4 bytes. Doing so means that the immediately following
proc_event structure is then correctly aligned to 8 bytes.
The result is that the memset() stores are now aligned, and as an
added benefit, we can remove the put_unaligned() calls in the code.
Signed-off-by: Chris Metcalf <cmetcalf@tilera.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
These strings come from a copy_from_user() and there is no way to be
sure they are NUL terminated.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
This patch fixes two race conditions between bond_store_updelay/downdelay
and bond_store_miimon which could lead to division by zero as miimon can
be set to 0 while either updelay/downdelay are being set and thus miss the
zero check in the beginning, the zero div happens because updelay/downdelay
are stored as new_value / bond->params.miimon. Use rtnl to synchronize with
miimon setting.
CC: Jay Vosburgh <fubar@us.ibm.com> CC: Andy Gospodarek <andy@greyhouse.net> CC: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com> Acked-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
If priority/traffic class field in IPv6 header is set (seen when
using ssh), the uncompression sets the TC and Flow fields incorrectly.
Example:
This is IPv6 header of a sent packet. Note the priority/TC (=1) in
the first byte.
00000000: 61 00 00 00 00 2c 06 40 fe 80 00 00 00 00 00 00 00000010: 02 02 72 ff fe c6 42 10 fe 80 00 00 00 00 00 00 00000020: 02 1e ab ff fe 4c 52 57
This gets compressed like this in the sending side
00000000: 72 31 04 06 02 1e ab ff fe 4c 52 57 ec c2 00 16 00000010: aa 2d fe 92 86 4e be c6 ....
In the receiving end, the packet gets uncompressed to this
IPv6 header
00000000: 60 06 06 02 00 2a 1e 40 fe 80 00 00 00 00 00 00 00000010: 02 02 72 ff fe c6 42 10 fe 80 00 00 00 00 00 00 00000020: ab ff fe 4c 52 57 ec c2
First four bytes are set incorrectly and we have also lost
two bytes from destination address.
The fix is to switch the case values in switch statement
when checking the TC field.
Signed-off-by: Jukka Rissanen <jukka.rissanen@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Currently the ARP monitoring is not supported with 802.3ad, and it's
prohibited to use it via the module params.
However we still can set it afterwards via sysfs, cause we only check for
*LB modes there.
To fix this - add a check for 802.3ad mode in bonding_store_arp_interval.
CC: Jay Vosburgh <fubar@us.ibm.com> CC: Andy Gospodarek <andy@greyhouse.net> Signed-off-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
For properly initialising the Tausworthe generator [1], we have
a strict seeding requirement, that is, s1 > 1, s2 > 7, s3 > 15.
Commit 697f8d0348 ("random32: seeding improvement") introduced
a __seed() function that imposes boundary checks proposed by the
errata paper [2] to properly ensure above conditions.
However, we're off by one, as the function is implemented as:
"return (x < m) ? x + m : x;", and called with __seed(X, 1),
__seed(X, 7), __seed(X, 15). Thus, an unwanted seed of 1, 7, 15
would be possible, whereas the lower boundary should actually
be of at least 2, 8, 16, just as GSL does. Fix this, as otherwise
an initialization with an unwanted seed could have the effect
that Tausworthe's PRNG properties cannot not be ensured.
Note that this PRNG is *not* used for cryptography in the kernel.
Fixes: 697f8d0348a6 ("random32: seeding improvement") Cc: Stephen Hemminger <stephen@networkplumber.org> Cc: Florian Weimer <fweimer@redhat.com> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
As the rfc 4191 said, the Router Preference and Lifetime values in a
::/0 Route Information Option should override the preference and lifetime
values in the Router Advertisement header. But when the kernel deals with
a ::/0 Route Information Option, the rt6_get_route_info() always return
NULL, that means that overriding will not happen, because those default
routers were added without flag RTF_ROUTEINFO in rt6_add_dflt_router().
In order to deal with that condition, we should call rt6_get_dflt_router
when the prefix length is 0.
Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
When trying to delete a table >= 256 using iproute2 the local table
will be deleted.
The table id is specified as a netlink attribute when it needs more then
8 bits and iproute2 then sets the table field to RT_TABLE_UNSPEC (0).
Preconditions to matching the table id in the rule delete code
doesn't seem to take the "table id in netlink attribute" into condition
so the frh_get_table helper function never gets to do its job when
matching against current rule.
Use the helper function twice instead of peaking at the table value directly.
Originally reported at: http://bugs.debian.org/724783
Reported-by: Nicolas HICHER <nhicher@avencall.com> Signed-off-by: Andreas Henriksson <andreas@fatal.se> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
arch/um/os-Linux/start_up.c: In function 'check_coredump_limit':
arch/um/os-Linux/start_up.c:338:16: error: storage size of 'lim' isn't known
arch/um/os-Linux/start_up.c:339:2: error: implicit declaration of function 'getrlimit' [-Werror=implicit-function-declaration]
Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org> CC: Jeff Dike <jdike@addtoit.com> CC: Richard Weinberger <richard@nod.at> CC: Al Viro <viro@zeniv.linux.org.uk> CC: user-mode-linux-devel@lists.sourceforge.net CC: user-mode-linux-user@lists.sourceforge.net CC: linux-kernel@vger.kernel.org Signed-off-by: Richard Weinberger <richard@nod.at> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
throttle_cfs_rq() doesn't check to make sure that period_timer is running,
and while update_curr/assign_cfs_runtime does, a concurrently running
period_timer on another cpu could cancel itself between this cpu's
update_curr and throttle_cfs_rq(). If there are no other cfs_rqs running
in the tg to restart the timer, this causes the cfs_rq to be stranded
forever.
Fix this by calling __start_cfs_bandwidth() in throttle if the timer is
inactive.
(Also add some sched_debug lines for cfs_bandwidth.)
Tested: make a run/sleep task in a cgroup, loop switching the cgroup
between 1ms/100ms quota and unlimited, checking for timer_active=0 and
throttled=1 as a failure. With the throttle_cfs_rq() change commented out
this fails, with the full patch it passes.
In selinux_ip_postroute() we perform access checks based on the
packet's security label. For locally generated traffic we get the
packet's security label from the associated socket; this works in all
cases except for TCP SYN-ACK packets. In the case of SYN-ACK packet's
the correct security label is stored in the connection's request_sock,
not the server's socket. Unfortunately, at the point in time when
selinux_ip_postroute() is called we can't query the request_sock
directly, we need to recreate the label using the same logic that
originally labeled the associated request_sock.
See the inline comments for more explanation.
Reported-by: Janak Desai <Janak.Desai@gtri.gatech.edu> Tested-by: Janak Desai <Janak.Desai@gtri.gatech.edu> Signed-off-by: Paul Moore <pmoore@redhat.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
In selinux_ip_output() we always label packets based on the parent
socket. While this approach works in almost all cases, it doesn't
work in the case of TCP SYN-ACK packets when the correct label is not
the label of the parent socket, but rather the label of the larval
socket represented by the request_sock struct.
Unfortunately, since the request_sock isn't queued on the parent
socket until *after* the SYN-ACK packet is sent, we can't lookup the
request_sock to determine the correct label for the packet; at this
point in time the best we can do is simply pass/NF_ACCEPT the packet.
It must be said that simply passing the packet without any explicit
labeling action, while far from ideal, is not terrible as the SYN-ACK
packet will inherit any IP option based labeling from the initial
connection request so the label *should* be correct and all our
access controls remain in place so we shouldn't have to worry about
information leaks.
Reported-by: Janak Desai <Janak.Desai@gtri.gatech.edu> Tested-by: Janak Desai <Janak.Desai@gtri.gatech.edu> Signed-off-by: Paul Moore <pmoore@redhat.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Under guest controllable circumstances apic_get_tmcct will execute a
divide by zero and cause a crash. If the guest cpuid support
tsc deadline timers and performs the following sequence of requests
the host will crash.
- Set the mode to periodic
- Set the TMICT to 0
- Set the mode bits to 11 (neither periodic, nor one shot, nor tsc deadline)
- Set the TMICT to non-zero.
Then the lapic_timer.period will be 0, but the TMICT will not be. If the
guest then reads from the TMCCT then the host will perform a divide by 0.
This patch ensures that if the lapic_timer.period is 0, then the division
does not occur.
Reported-by: Andrew Honig <ahonig@google.com> Signed-off-by: Andrew Honig <ahonig@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[bwh: Backported to 3.2: s/kvm_apic_get_reg/apic_get_reg/] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
In multiple functions the vcpu_id is used as an offset into a bitfield. Ag
malicious user could specify a vcpu_id greater than 255 in order to set or
clear bits in kernel memory. This could be used to elevate priveges in the
kernel. This patch verifies that the vcpu_id provided is less than 255.
The api documentation already specifies that the vcpu_id must be less than
max_vcpus, but this is currently not checked.
Reported-by: Andrew Honig <ahonig@google.com> Signed-off-by: Andrew Honig <ahonig@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
The hugepage code had the exact same bug that regular pages had in
commit 7485d0d3758e ("futexes: Remove rw parameter from
get_futex_key()").
The regular page case was fixed by commit 9ea71503a8ed ("futex: Fix
regression with read only mappings"), but the transparent hugepage case
(added in a5b338f2b0b1: "thp: update futex compound knowledge") case
remained broken.
Found by Dave Jones and his trinity tool.
Reported-and-tested-by: Dave Jones <davej@fedoraproject.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: Mel Gorman <mgorman@suse.de> Cc: Darren Hart <dvhart@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
The "rpm * div" operations can overflow here, so this patch adds an
upper limit to rpm to prevent that. Jean Delvare helped me with this
patch.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Roger Lucas <vt8231@hiddenengine.co.uk> Signed-off-by: Jean Delvare <khali@linux-fr.org>
[bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
The W83L786NG stores the fan speed on 4 bits while the sysfs interface
uses a 0-255 range. Thus the driver should scale the user input down
to map it to the device range, and scale up the value read from the
device before presenting it to the user. The reserved register nibble
should be left unchanged.
Signed-off-by: Jean Delvare <khali@linux-fr.org> Reviewed-by: Guenter Roeck <linux@roeck-us.net>
[bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Erratum 71 of PXA270M Processor Family Specification Update
(April 19, 2010) explains that watchdog reset time is just
8us insead of 10ms in EMTS.
If SDRAM is not reset, it causes memory bus congestion and
the device hangs. We put SDRAM in selfresh mode before watchdog
reset, removing potential freezes.
Without this patch PXA270-based ICP DAS LP-8x4x hangs after up to 40
reboots. With this patch it has successfully rebooted 500 times.
Signed-off-by: Sergei Ianovich <ynvich@gmail.com> Tested-by: Marek Vasut <marex@denx.de> Signed-off-by: Haojian Zhuang <haojian.zhuang@gmail.com> Signed-off-by: Olof Johansson <olof@lixom.net>
[bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
When converting from tosa-keyboard driver to matrix keyboard, tosa keys
received extra 1 column shift. Replace that with correct values to make
keyboard work again.
Fixes: f69a6548c9d5 ('[ARM] pxa/tosa: make use of the matrix keypad driver') Signed-off-by: Dmitry Eremin-Solenikov <dbaryshkov@gmail.com> Signed-off-by: Haojian Zhuang <haojian.zhuang@gmail.com> Signed-off-by: Olof Johansson <olof@lixom.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Some module parameters in dm-bufio are read-only. These parameters
inform the user about memory consumption. They are not supposed to be
changed by the user.
However, despite being read-only, these parameters can be set on
modprobe or insmod command line, for example:
modprobe dm-bufio current_allocated_bytes=12345
The kernel doesn't expect that these variables can be non-zero at module
initialization and if the user sets them, it results in BUG.
This patch initializes the variables in the module init routine, so that
user-supplied values are ignored.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
UEFI time services are often broken once we're in virtual mode. We were
already refusing to use them on 64-bit systems, but it turns out that
they're also broken on some 32-bit firmware, including the Dell Venue.
Disable them for now, we can revisit once we have the 1:1 mappings code
incorporated.
Signed-off-by: Matthew Garrett <matthew.garrett@nebula.com> Link: http://lkml.kernel.org/r/1385754283-2464-1-git-send-email-matthew.garrett@nebula.com Cc: Matt Fleming <matt.fleming@intel.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
[bwh: Backported to 3.2: deleted code is slightly different] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
When compiling with icc, <linux/compiler-gcc.h> ends up included
because the icc environment defines __GNUC__. Thus, we neither need
nor want to have this macro defined in both compiler-gcc.h and
compiler-intel.h, and the fact that they are inconsistent just makes
the compiler spew warnings.
Reported-by: Sunil K. Pandey <sunil.k.pandey@intel.com> Cc: Kevin B. Smith <kevin.b.smith@intel.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Link: http://lkml.kernel.org/n/tip-0mbwou1zt7pafij09b897lg3@git.kernel.org
[bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>