]> git.ipfire.org Git - thirdparty/kernel/stable.git/log
thirdparty/kernel/stable.git
10 years agogen_stats.c: Duplicate xstats buffer for later use
Ignacy Gawędzki [Fri, 13 Feb 2015 22:47:05 +0000 (14:47 -0800)] 
gen_stats.c: Duplicate xstats buffer for later use

commit 1c4cff0cf55011792125b6041bc4e9713e46240f upstream.

The gnet_stats_copy_app() function gets called, more often than not, with its
second argument a pointer to an automatic variable in the caller's stack.
Therefore, to avoid copying garbage afterwards when calling
gnet_stats_finish_copy(), this data is better copied to a dynamically allocated
memory that gets freed after use.

[xiyou.wangcong@gmail.com: remove a useless kfree()]

Signed-off-by: Ignacy Gawędzki <ignacy.gawedzki@green-communications.fr>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agortnetlink: call ->dellink on failure when ->newlink exists
WANG Cong [Fri, 13 Feb 2015 21:56:53 +0000 (13:56 -0800)] 
rtnetlink: call ->dellink on failure when ->newlink exists

commit 7afb8886a05be68e376655539a064ec672de8a8e upstream.

Ignacy reported that when eth0 is down and add a vlan device
on top of it like:

  ip link add link eth0 name eth0.1 up type vlan id 1

We will get a refcount leak:

  unregister_netdevice: waiting for eth0.1 to become free. Usage count = 2

The problem is when rtnl_configure_link() fails in rtnl_newlink(),
we simply call unregister_device(), but for stacked device like vlan,
we almost do nothing when we unregister the upper device, more work
is done when we unregister the lower device, so call its ->dellink().

Reported-by: Ignacy Gawedzki <ignacy.gawedzki@green-communications.fr>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
[ luis: backported to 3.16: based on davem's backport to 3.14 ]
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agoipv6: fix ipv6_cow_metrics for non DST_HOST case
Martin KaFai Lau [Fri, 13 Feb 2015 00:14:08 +0000 (16:14 -0800)] 
ipv6: fix ipv6_cow_metrics for non DST_HOST case

commit 3b4711757d7903ab6fa88a9e7ab8901b8227da60 upstream.

ipv6_cow_metrics() currently assumes only DST_HOST routes require
dynamic metrics allocation from inetpeer.  The assumption breaks
when ndisc discovered router with RTAX_MTU and RTAX_HOPLIMIT metric.
Refer to ndisc_router_discovery() in ndisc.c and note that dst_metric_set()
is called after the route is created.

This patch creates the metrics array (by calling dst_cow_metrics_generic) in
ipv6_cow_metrics().

Test:
radvd.conf:
interface qemubr0
{
AdvLinkMTU 1300;
AdvCurHopLimit 30;

prefix fd00:face:face:face::/64
{
AdvOnLink on;
AdvAutonomous on;
AdvRouterAddr off;
};
};

Before:
[root@qemu1 ~]# ip -6 r show | egrep -v unreachable
fd00:face:face:face::/64 dev eth0  proto kernel  metric 256  expires 27sec
fe80::/64 dev eth0  proto kernel  metric 256
default via fe80::74df:d0ff:fe23:8ef2 dev eth0  proto ra  metric 1024  expires 27sec

After:
[root@qemu1 ~]# ip -6 r show | egrep -v unreachable
fd00:face:face:face::/64 dev eth0  proto kernel  metric 256  expires 27sec mtu 1300
fe80::/64 dev eth0  proto kernel  metric 256  mtu 1300
default via fe80::74df:d0ff:fe23:8ef2 dev eth0  proto ra  metric 1024  expires 27sec mtu 1300 hoplimit 30

Fixes: 8e2ec639173f325 (ipv6: don't use inetpeer to store metrics for routes.)
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agotcp: make sure skb is not shared before using skb_get()
Eric Dumazet [Fri, 13 Feb 2015 12:47:12 +0000 (04:47 -0800)] 
tcp: make sure skb is not shared before using skb_get()

commit ba34e6d9d346fe4e05d7e417b9edf5140772d34c upstream.

IPv6 can keep a copy of SYN message using skb_get() in
tcp_v6_conn_request() so that caller wont free the skb when calling
kfree_skb() later.

Therefore TCP fast open has to clone the skb it is queuing in
child->sk_receive_queue, as all skbs consumed from receive_queue are
freed using __kfree_skb() (ie assuming skb->users == 1)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Fixes: 5b7ed0892f2af ("tcp: move fastopen functions to tcp_fastopen.c")
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agortnetlink: ifla_vf_policy: fix misuses of NLA_BINARY
Daniel Borkmann [Thu, 5 Feb 2015 17:44:04 +0000 (18:44 +0100)] 
rtnetlink: ifla_vf_policy: fix misuses of NLA_BINARY

commit 364d5716a7adb91b731a35765d369602d68d2881 upstream.

ifla_vf_policy[] is wrong in advertising its individual member types as
NLA_BINARY since .type = NLA_BINARY in combination with .len declares the
len member as *max* attribute length [0, len].

The issue is that when do_setvfinfo() is being called to set up a VF
through ndo handler, we could set corrupted data if the attribute length
is less than the size of the related structure itself.

The intent is exactly the opposite, namely to make sure to pass at least
data of minimum size of len.

Fixes: ebc08a6f47ee ("rtnetlink: Add VF config code to rtnetlink")
Cc: Mitch Williams <mitch.a.williams@intel.com>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agopktgen: fix UDP checksum computation
Sabrina Dubroca [Wed, 4 Feb 2015 22:08:50 +0000 (23:08 +0100)] 
pktgen: fix UDP checksum computation

commit 7744b5f3693cc06695cb9d6667671c790282730f upstream.

This patch fixes two issues in UDP checksum computation in pktgen.

First, the pseudo-header uses the source and destination IP
addresses. Currently, the ports are used for IPv4.

Second, the UDP checksum covers both header and data.  So we need to
generate the data earlier (move pktgen_finalize_skb up), and compute
the checksum for UDP header + data.

Fixes: c26bf4a51308c ("pktgen: Add UDPCSUM flag to support UDP checksums")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agoflowcache: Fix kernel panic in flow_cache_flush_task
Miroslav Urbanek [Thu, 5 Feb 2015 15:36:50 +0000 (16:36 +0100)] 
flowcache: Fix kernel panic in flow_cache_flush_task

commit 233c96fc077d310772375d47522fb444ff546905 upstream.

flow_cache_flush_task references a structure member flow_cache_gc_work
where it should reference flow_cache_flush_task instead.

Kernel panic occurs on kernels using IPsec during XFRM garbage
collection. The garbage collection interval can be shortened using the
following sysctl settings:

net.ipv4.xfrm4_gc_thresh=4
net.ipv6.xfrm6_gc_thresh=4

With the default settings, our productions servers crash approximately
once a week. With the settings above, they crash immediately.

Fixes: ca925cf1534e ("flowcache: Make flow cache name space aware")
Reported-by: Tomáš Charvát <tc@excello.cz>
Tested-by: Jan Hejl <jh@excello.cz>
Signed-off-by: Miroslav Urbanek <mu@miroslavurbanek.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agoipvs: add missing ip_vs_pe_put in sync code
Julian Anastasov [Sat, 21 Feb 2015 19:03:10 +0000 (21:03 +0200)] 
ipvs: add missing ip_vs_pe_put in sync code

commit 528c943f3bb919aef75ab2fff4f00176f09a4019 upstream.

ip_vs_conn_fill_param_sync() gets in param.pe a module
reference for persistence engine from __ip_vs_pe_getbyname()
but forgets to put it. Problem occurs in backup for
sync protocol v1 (2.6.39).

Also, pe_data usually comes in sync messages for
connection templates and ip_vs_conn_new() copies
the pointer only in this case. Make sure pe_data
is not leaked if it comes unexpectedly for normal
connections. Leak can happen only if bogus messages
are sent to backup server.

Fixes: fe5e7a1efb66 ("IPVS: Backup, Adding Version 1 receive capability")
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agonetfilter: xt_socket: fix a stack corruption bug
Eric Dumazet [Mon, 16 Feb 2015 03:03:45 +0000 (19:03 -0800)] 
netfilter: xt_socket: fix a stack corruption bug

commit 78296c97ca1fd3b104f12e1f1fbc06c46635990b upstream.

As soon as extract_icmp6_fields() returns, its local storage (automatic
variables) is deallocated and can be overwritten.

Lets add an additional parameter to make sure storage is valid long
enough.

While we are at it, adds some const qualifiers.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: b64c9256a9b76 ("tproxy: added IPv6 support to the socket match")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agonetfilter: nft_compat: fix module refcount underflow
Pablo Neira Ayuso [Thu, 12 Feb 2015 21:15:31 +0000 (22:15 +0100)] 
netfilter: nft_compat: fix module refcount underflow

commit 520aa7414bb590f39d0d1591b06018e60cbc7cf4 upstream.

Feb 12 18:20:42 nfdev kernel: ------------[ cut here ]------------
Feb 12 18:20:42 nfdev kernel: WARNING: CPU: 4 PID: 4359 at kernel/module.c:963 module_put+0x9b/0xba()
Feb 12 18:20:42 nfdev kernel: CPU: 4 PID: 4359 Comm: ebtables-compat Tainted: G        W      3.19.0-rc6+ #43
[...]
Feb 12 18:20:42 nfdev kernel: Call Trace:
Feb 12 18:20:42 nfdev kernel: [<ffffffff815fd911>] dump_stack+0x4c/0x65
Feb 12 18:20:42 nfdev kernel: [<ffffffff8103e6f7>] warn_slowpath_common+0x9c/0xb6
Feb 12 18:20:42 nfdev kernel: [<ffffffff8109919f>] ? module_put+0x9b/0xba
Feb 12 18:20:42 nfdev kernel: [<ffffffff8103e726>] warn_slowpath_null+0x15/0x17
Feb 12 18:20:42 nfdev kernel: [<ffffffff8109919f>] module_put+0x9b/0xba
Feb 12 18:20:42 nfdev kernel: [<ffffffff813ecf7c>] nft_match_destroy+0x45/0x4c
Feb 12 18:20:42 nfdev kernel: [<ffffffff813e683f>] nf_tables_rule_destroy+0x28/0x70

Reported-by: Arturo Borrero Gonzalez <arturo.borrero.glez@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Tested-by: Arturo Borrero Gonzalez <arturo.borrero.glez@gmail.com>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agoipvs: rerouting to local clients is not needed anymore
Julian Anastasov [Thu, 18 Dec 2014 20:41:23 +0000 (22:41 +0200)] 
ipvs: rerouting to local clients is not needed anymore

commit 579eb62ac35845686a7c4286c0a820b4eb1f96aa upstream.

commit f5a41847acc5 ("ipvs: move ip_route_me_harder for ICMP")
from 2.6.37 introduced ip_route_me_harder() call for responses to
local clients, so that we can provide valid rt_src after SNAT.
It was used by TCP to provide valid daddr for ip_send_reply().
After commit 0a5ebb8000c5 ("ipv4: Pass explicit daddr arg to
ip_send_reply()." from 3.0 this rerouting is not needed anymore
and should be avoided, especially in LOCAL_IN.

Fixes 3.12.33 crash in xfrm reported by Florian Wiessner:
"3.12.33 - BUG xfrm_selector_match+0x25/0x2f6"

Reported-by: Smart Weblications GmbH - Florian Wiessner <f.wiessner@smart-weblications.de>
Tested-by: Smart Weblications GmbH - Florian Wiessner <f.wiessner@smart-weblications.de>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agonetfilter: nf_tables: fix leaks in error path of nf_tables_newchain()
Pablo Neira Ayuso [Thu, 29 Jan 2015 18:08:09 +0000 (19:08 +0100)] 
netfilter: nf_tables: fix leaks in error path of nf_tables_newchain()

commit f5553c19ff9058136e7082c0b1f4268e705ea538 upstream.

Release statistics and module refcount on memory allocation problems.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agonetfilter: nf_tables: disable preemption when restoring chain counters
Pablo Neira Ayuso [Wed, 21 Jan 2015 17:04:18 +0000 (18:04 +0100)] 
netfilter: nf_tables: disable preemption when restoring chain counters

commit e8781f70a5b210a1b08cff8ce05895ebcec18d83 upstream.

With CONFIG_DEBUG_PREEMPT=y

[22144.496057] BUG: using smp_processor_id() in preemptible [00000000] code: iptables-compat/10406
[22144.496061] caller is debug_smp_processor_id+0x17/0x1b
[22144.496065] CPU: 2 PID: 10406 Comm: iptables-compat Not tainted 3.19.0-rc4+ #
[...]
[22144.496092] Call Trace:
[22144.496098]  [<ffffffff8145b9fa>] dump_stack+0x4f/0x7b
[22144.496104]  [<ffffffff81244f52>] check_preemption_disabled+0xd6/0xe8
[22144.496110]  [<ffffffff81244f90>] debug_smp_processor_id+0x17/0x1b
[22144.496120]  [<ffffffffa07c557e>] nft_stats_alloc+0x94/0xc7 [nf_tables]
[22144.496130]  [<ffffffffa07c73d2>] nf_tables_newchain+0x471/0x6d8 [nf_tables]
[22144.496140]  [<ffffffffa07c5ef6>] ? nft_trans_alloc+0x18/0x34 [nf_tables]
[22144.496154]  [<ffffffffa063c8da>] nfnetlink_rcv_batch+0x2b4/0x457 [nfnetlink]

Reported-by: Andreas Schultz <aschultz@tpip.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agoLinux 3.16.7-ckt8
Luis Henriques [Wed, 11 Mar 2015 10:16:00 +0000 (10:16 +0000)] 
Linux 3.16.7-ckt8

Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agoquota: Store maximum space limit in bytes
Jan Kara [Thu, 9 Oct 2014 14:54:13 +0000 (16:54 +0200)] 
quota: Store maximum space limit in bytes

commit b10a08194c2b615955dfab2300331a90ae9344c7 upstream.

Currently maximum space limit quota format supports is in blocks however
since we store space limits in bytes, this is somewhat confusing. So
store the maximum limit in bytes as well. Also rename the field to match
the new unit and related inode field to match the new naming scheme.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
[ luis: backported to 3.16: adjusted context ]
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agocaif: remove wrong dev_net_set() call
Nicolas Dichtel [Mon, 26 Jan 2015 21:28:13 +0000 (22:28 +0100)] 
caif: remove wrong dev_net_set() call

commit 8997c27ec41127bf57421cc0205413d525421ddc upstream.

src_net points to the netns where the netlink message has been received. This
netns may be different from the netns where the interface is created (because
the user may add IFLA_NET_NS_[PID|FD]). In this case, src_net is the link netns.

It seems wrong to override the netns in the newlink() handler because if it
was not already src_net, it means that the user explicitly asks to create the
netdevice in another netns.

CC: Sjur Brændeland <sjur.brandeland@stericsson.com>
CC: Dmitry Tarnyagin <dmitry.tarnyagin@lockless.no>
Fixes: 8391c4aab1aa ("caif: Bugfixes in CAIF netdevice for close and flow control")
Fixes: c41254006377 ("caif-hsi: Add rtnl support")
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agosched/rt: Reduce rq lock contention by eliminating locking of non-feasible target
Tim Chen [Fri, 12 Dec 2014 23:38:12 +0000 (15:38 -0800)] 
sched/rt: Reduce rq lock contention by eliminating locking of non-feasible target

commit 80e3d87b2c5582db0ab5e39610ce3707d97ba409 upstream.

This patch adds checks that prevens futile attempts to move rt tasks
to a CPU with active tasks of equal or higher priority.

This reduces run queue lock contention and improves the performance of
a well known OLTP benchmark by 0.7%.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Shawn Bohrer <sbohrer@rgmadvisors.com>
Cc: Suruchi Kadu <suruchi.a.kadu@intel.com>
Cc: Doug Nelson<doug.nelson@intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1421430374.2399.27.camel@schen9-desk2.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agoARM: dts: am335x-bone*: usb0 is hardwired for peripheral
Robert Nelson [Tue, 24 Feb 2015 16:10:43 +0000 (10:10 -0600)] 
ARM: dts: am335x-bone*: usb0 is hardwired for peripheral

commit 67fd14b3eca63b14429350e9eadc5fab709a8821 upstream.

Fixes: http://bugs.elinux.org/issues/127
the bb.org community was seeing random reboots before this change.

Signed-off-by: Robert Nelson <robertcnelson@gmail.com>
Reviewed-by: Felipe Balbi <balbi@ti.com>
Acked-by: Felipe Balbi <balbi@ti.com>
Signed-off-by: Tony Lindgren <tony@atomide.com>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agoproc/pagemap: walk page tables under pte lock
Konstantin Khlebnikov [Wed, 11 Feb 2015 23:27:31 +0000 (15:27 -0800)] 
proc/pagemap: walk page tables under pte lock

commit 05fbf357d94152171bc50f8a369390f1f16efd89 upstream.

Lockless access to pte in pagemap_pte_range() might race with page
migration and trigger BUG_ON(!PageLocked()) in migration_entry_to_page():

CPU A (pagemap)                           CPU B (migration)
                                          lock_page()
                                          try_to_unmap(page, TTU_MIGRATION...)
                                               make_migration_entry()
                                               set_pte_at()
<read *pte>
pte_to_pagemap_entry()
                                          remove_migration_ptes()
                                          unlock_page()
    if(is_migration_entry())
        migration_entry_to_page()
            BUG_ON(!PageLocked(page))

Also lockless read might be non-atomic if pte is larger than wordsize.
Other pte walkers (smaps, numa_maps, clear_refs) already lock ptes.

Fixes: 052fb0d635df ("proc: report file/anon bit in /proc/pid/pagemap")
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reported-by: Andrey Ryabinin <a.ryabinin@samsung.com>
Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agomm: softdirty: unmapped addresses between VMAs are clean
Peter Feiner [Thu, 9 Oct 2014 22:28:32 +0000 (15:28 -0700)] 
mm: softdirty: unmapped addresses between VMAs are clean

commit 81d0fa623c5b8dbd5279d9713094b0f9b0a00fb4 upstream.

If a /proc/pid/pagemap read spans a [VMA, an unmapped region, then a
VM_SOFTDIRTY VMA], the virtual pages in the unmapped region are reported
as softdirty.  Here's a program to demonstrate the bug:

int main() {
const uint64_t PAGEMAP_SOFTDIRTY = 1ul << 55;
uint64_t pme[3];
int fd = open("/proc/self/pagemap", O_RDONLY);;
char *m = mmap(NULL, 3 * getpagesize(), PROT_READ,
               MAP_ANONYMOUS | MAP_SHARED, -1, 0);
munmap(m + getpagesize(), getpagesize());
pread(fd, pme, 24, (unsigned long) m / getpagesize() * 8);
assert(pme[0] & PAGEMAP_SOFTDIRTY);    /* passes */
assert(!(pme[1] & PAGEMAP_SOFTDIRTY)); /* fails */
assert(pme[2] & PAGEMAP_SOFTDIRTY);    /* passes */
return 0;
}

(Note that all pages in new VMAs are softdirty until cleared).

Tested:
Used the program given above. I'm going to include this code in
a selftest in the future.

[n-horiguchi@ah.jp.nec.com: prevent pagemap_pte_range() from overrunning]
Signed-off-by: Peter Feiner <pfeiner@google.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Jamie Liu <jamieliu@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[ luis: 3.16-stable prereq for:
  05fbf357d941 "proc/pagemap: walk page tables under pte lock" ]
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agoHID: wacom: Report ABS_MISC event for Cintiq Companion Hybrid
Jason Gerecke [Thu, 22 Jan 2015 23:53:28 +0000 (15:53 -0800)] 
HID: wacom: Report ABS_MISC event for Cintiq Companion Hybrid

commit 33e5df0e0e32027866e9fb00451952998fc957f2 upstream.

It appears that the Cintiq Companion Hybrid does not send an ABS_MISC event to
userspace when any of its ExpressKeys are pressed. This is not strictly
necessary now that the pad exists on its own device, but should be fixed for
consistency's sake.

Traditionally both the stylus and pad shared the same device node, and
xf86-input-wacom would use ABS_MISC for disambiguation. Not sending this causes
the Hybrid to behave incorrectly with xf86-input-wacom beginning with its
8f44f3 commit.

Signed-off-by: Jason Gerecke <killertofu@gmail.com>
Reviewed-by: Benjamin Tissoires <benjamin.tissoires@redhat.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
[killertofu@gmail.com: ported to drivers/input/tablet/wacom_wac.c]
Signed-off-by: Jason Gerecke <killertofu@gmail.com>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agosunxi: clk: Set sun6i-pll1 n_start = 1
Hans de Goede [Sat, 24 Jan 2015 11:56:32 +0000 (12:56 +0100)] 
sunxi: clk: Set sun6i-pll1 n_start = 1

commit 76820fcf7aa5a418b69cb7bed31b62d1feb1d6ad upstream.

For all pll-s on sun6i n == 0 means use a multiplier of 1, rather then 0 as
it means on sun4i / sun5i / sun7i. n_start = 1 is already correctly set
for sun6i pll6, but was missing for pll1, this commit fixes this.

Cc: Chen-Yu Tsai <wens@csie.org>
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Maxime Ripard <maxime.ripard@free-electrons.com>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agoclk: sunxi: Support factor clocks with N factor starting not from 0
Chen-Yu Tsai [Thu, 26 Jun 2014 15:55:41 +0000 (23:55 +0800)] 
clk: sunxi: Support factor clocks with N factor starting not from 0

commit 9a5e6c7eb5ccbb5f0d3a1dffce135f0a727f40e1 upstream.

The PLLs on newer Allwinner SoC's, such as the A31 and A23, have a
N multiplier factor that starts from 1, not 0.

This patch adds an option to the factor clk driver's config data
structures to specify the base value of N.

Signed-off-by: Chen-Yu Tsai <wens@csie.org>
Acked-by: Maxime Ripard <maxime.ripard@free-electrons.com>
Signed-off-by: Maxime Ripard <maxime.ripard@free-electrons.com>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agovhost/scsi: potential memory corruption
Dan Carpenter [Thu, 5 Feb 2015 07:37:33 +0000 (10:37 +0300)] 
vhost/scsi: potential memory corruption

commit 59c816c1f24df0204e01851431d3bab3eb76719c upstream.

This code in vhost_scsi_make_tpg() is confusing because we limit "tpgt"
to UINT_MAX but the data type of "tpg->tport_tpgt" and that is a u16.

I looked at the context and it turns out that in
vhost_scsi_set_endpoint(), "tpg->tport_tpgt" is used as an offset into
the vs_tpg[] array which has VHOST_SCSI_MAX_TARGET (256) elements so
anything higher than 255 then it is invalid.  I have made that the limit
now.

In vhost_scsi_send_evt() we mask away values higher than 255, but now
that the limit has changed, we don't need the mask.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
[ luis: backported to 3.16: functions rename:
  - tcm_vhost_send_evt -> vhost_scsi_send_evt
  - tcm_vhost_make_tpg -> vhost_scsi_make_tpg ]
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agotarget: Allow Write Exclusive non-reservation holders to READ
Lee Duncan [Mon, 5 Jan 2015 18:49:44 +0000 (10:49 -0800)] 
target: Allow Write Exclusive non-reservation holders to READ

commit 1ecc7586922662e3ca2f3f0c3f17fec8749fc621 upstream.

For PGR reservation of type Write Exclusive Access, allow all non
reservation holding I_T nexuses with active registrations to READ
from the device.

This addresses a bug where active registrations that attempted
to READ would result in an reservation conflict.

Signed-off-by: Lee Duncan <lduncan@suse.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agotarget: Allow AllRegistrants to re-RESERVE existing reservation
Nicholas Bellinger [Fri, 19 Dec 2014 00:49:23 +0000 (00:49 +0000)] 
target: Allow AllRegistrants to re-RESERVE existing reservation

commit ae450e246e8540300699480a3780a420a028b73f upstream.

This patch changes core_scsi3_pro_release() logic to allow an
existing AllRegistrants type reservation to be re-reserved by
any registered I_T nexus.

This addresses a issue where AllRegistrants type RESERVE was
receiving RESERVATION_CONFLICT status if dev_pr_res_holder did
not match the same I_T nexus, instead of just returning GOOD
status following spc4r34 Section 5.9.9:

"If the device server receives a PERSISTENT RESERVE OUT command
 with RESERVE service action where the TYPE field and the SCOPE
 field contain the same values as the existing type and scope
 from a persistent reservation holder, it shall not make any
 change to the existing persistent reservation and shall complete
 the command with GOOD status."

Reported-by: Ilias Tsitsimpis <i.tsitsimpis@gmail.com>
Cc: Ilias Tsitsimpis <i.tsitsimpis@gmail.com>
Cc: Lee Duncan <lduncan@suse.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agotarget: Avoid dropping AllRegistrants reservation during unregister
Nicholas Bellinger [Mon, 15 Dec 2014 19:50:26 +0000 (11:50 -0800)] 
target: Avoid dropping AllRegistrants reservation during unregister

commit 6c3c9baa0debeb4bcc52a78c4463a0a97518de10 upstream.

This patch fixes an issue with AllRegistrants reservations where
an unregister operation by the I_T nexus reservation holder would
incorrectly drop the reservation, instead of waiting until the
last active I_T nexus is unregistered as per SPC-4.

This includes updating __core_scsi3_complete_pro_release() to reset
dev->dev_pr_res_holder with another pr_reg for this special case,
as well as a new 'unreg' parameter to determine when the release
is occuring from an implicit unregister, vs. explicit RELEASE.

It also adds special handling in core_scsi3_free_pr_reg_from_nacl()
to release the left-over pr_res_holder, now that pr_reg is deleted
from pr_reg_list within __core_scsi3_complete_pro_release().

Reported-by: Ilias Tsitsimpis <i.tsitsimpis@gmail.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agotarget: Fix R_HOLDER bit usage for AllRegistrants
Nicholas Bellinger [Sun, 14 Dec 2014 09:47:19 +0000 (01:47 -0800)] 
target: Fix R_HOLDER bit usage for AllRegistrants

commit d16ca7c5198fd668db10d2c7b048ed3359c12c54 upstream.

This patch fixes the usage of R_HOLDER bit for an All Registrants
reservation in READ_FULL_STATUS, where only the registration who
issued RESERVE was being reported as having an active reservation.

It changes core_scsi3_pri_read_full_status() to check ahead of the
list walk of active registrations to see if All Registrants is active,
and if so set R_HOLDER bit and scope/type fields for all active
registrations.

Reported-by: Ilias Tsitsimpis <i.tsitsimpis@gmail.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agostaging: comedi: cb_pcidas64: fix incorrect AI range code handling
Ian Abbott [Mon, 19 Jan 2015 14:47:27 +0000 (14:47 +0000)] 
staging: comedi: cb_pcidas64: fix incorrect AI range code handling

commit be8e89087ec2d2c8a1ad1e3db64bf4efdfc3c298 upstream.

The hardware range code values and list of valid ranges for the AI
subdevice is incorrect for several supported boards.  The hardware range
code values for all boards except PCI-DAS4020/12 is determined by
calling `ai_range_bits_6xxx()` based on the maximum voltage of the range
and whether it is bipolar or unipolar, however it only returns the
correct hardware range code for the PCI-DAS60xx boards.  For
PCI-DAS6402/16 (and /12) it returns the wrong code for the unipolar
ranges.  For PCI-DAS64/Mx/16 it returns the wrong code for all the
ranges and the comedi range table is incorrect.

Change `ai_range_bits_6xxx()` to use a look-up table pointed to by new
member `ai_range_codes` of `struct pcidas64_board` to map the comedi
range table indices to the hardware range codes.  Use a new comedi range
table for the PCI-DAS64/Mx/16 boards (and the commented out variants).

Signed-off-by: Ian Abbott <abbotti@mev.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years ago[media] Si2168: increase timeout to fix firmware loading
Jurgen Kramer [Mon, 8 Dec 2014 08:30:44 +0000 (05:30 -0300)] 
[media] Si2168: increase timeout to fix firmware loading

commit 551c33e729f654ecfaed00ad399f5d2a631b72cb upstream.

Increase si2168 cmd execute timeout to prevent firmware load failures. Tests
shows it takes up to 52ms to load the 'dvb-demod-si2168-a30-01.fw' firmware.
Increase timeout to a safe value of 70ms.

Signed-off-by: Jurgen Kramer <gtmkramer@xs4all.nl>
Reviewed-by: Antti Palosaari <crope@iki.fi>
Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agoIB/iser: Use correct dma direction when unmapping SGs
Roi Dayan [Sun, 28 Dec 2014 12:26:11 +0000 (14:26 +0200)] 
IB/iser: Use correct dma direction when unmapping SGs

commit c6c95ef4cec680f7a10aa425a9970744b35b6489 upstream.

We always unmap SGs with the same direction instead of unmapping
with the direction the mapping was done, fix that.

Fixes: 9a8b08fad2ef ("IB/iser: Generalize iser_unmap_task_data and [...]")
Signed-off-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
[ luis: backported to 3.16: adjusted context ]
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agoIB/mlx4: Fix wrong usage of IPv4 protocol for multicast attach/detach
Or Gerlitz [Wed, 17 Dec 2014 14:17:34 +0000 (16:17 +0200)] 
IB/mlx4: Fix wrong usage of IPv4 protocol for multicast attach/detach

commit e9a7faf11af94957e5107b40af46c2e329541510 upstream.

The MLX4_PROT_IB_IPV4 protocol should only be used with RoCEv2 and such.
Removing this wrong usage allows to run multicast applications over RoCE.

Fixes: d487ee77740c ("IB/mlx4: Use IBoE (RoCE) IP based GIDs in the port GID table")
Reported-by: Carol Soto <clsoto@linux.vnet.ibm.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agoIB/core: Fix deadlock on uverbs modify_qp error flow
Moshe Lazer [Thu, 5 Feb 2015 11:53:52 +0000 (13:53 +0200)] 
IB/core: Fix deadlock on uverbs modify_qp error flow

commit 0fb8bcf022f19a375d7c4bd79ac513da8ae6d78b upstream.

The deadlock occurs in __uverbs_modify_qp: we take a lock (idr_read_qp)
and in case of failure in ib_resolve_eth_l2_attrs we don't release
it (put_qp_read).  Fix that.

Fixes: ed4c54e5b4ba ("IB/core: Resolve Ethernet L2 addresses when modifying QP")
Signed-off-by: Moshe Lazer <moshel@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agoIB/core: When marshaling ucma path from user-space, clear unused fields
Ilya Nelkenbaum [Thu, 5 Feb 2015 11:53:48 +0000 (13:53 +0200)] 
IB/core: When marshaling ucma path from user-space, clear unused fields

commit c2be9dc0e0fa59cc43c2c7084fc42b430809a0fe upstream.

When marshaling a user path to the kernel struct ib_sa_path, we need
to zero smac and dmac and set the vlan id to the "no vlan" value.

This is to ensure that Ethernet attributes are not used with
InfiniBand QPs.

Fixes: dd5f03beb4f7 ("IB/core: Ethernet L2 attributes in verbs/cm structures")
Signed-off-by: Ilya Nelkenbaum <ilyan@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agoideapad-laptop: Change Lenovo Yoga 2 series rfkill handling
Hans de Goede [Mon, 23 Jun 2014 14:45:51 +0000 (16:45 +0200)] 
ideapad-laptop: Change Lenovo Yoga 2 series rfkill handling

commit ce363c2bcb2303e7fad3a79398db739c6995141b upstream.

It seems that the same problems which lead to adding an rfkill blacklist and
putting the Lenovo Yoga 2 11 on it are also present on the Lenovo Yoga 2 13
and Lenovo Yoga 2 Pro too:
https://bugzilla.redhat.com/show_bug.cgi?id=1021036
https://forums.lenovo.com/t5/Linux-Discussion/Yoga-2-13-not-Pro-Linux-Warning/m-p/1517612

Testing has shown that the firmware rfkill settings are persistent over
reboots. So blacklisting the driver is not good enough, if the wifi is blocked
at the firmware level the wifi needs to be explictly unblocked through the
ideapad-laptop interface.

And at least on the Lenovo Yoga 2 13 the VPCCMD_RF register which on devices
with hardware kill switch reports the hardware switch state, needs to be
explictly set to 1 (radio enabled / not blocked).

So this patch does 3 things to get proper rfkill handling on these models:

1) Instead of blacklisting the rfkill functionality, which means that people
with a firmware blocked wifi get stuck in that situation, ignore the value
reported by the not present hardware rfkill switch, as this is what is causing
ideapad-laptop to wrongly report all radios as hardware blocks. But do register
the rfkill interfaces so that the user can soft [un]block them.

2) On models without a hardware rfkill switch, explictly set VPCCMD_RF to 1

3) Drop the " 11" postfix from the dmi match string, as the entire Yoga 2
series is affected.

Yoga 2 11:
Reported-and-tested-by: Vincent Gerris <vgerris@gmail.com>
Yoga 2 13:
Tested-by: madls05 <http://ubuntuforums.org/showthread.php?t=2215044>
Yoga 2 Pro:
Reported-and-tested-by: Peter F. Patel-Schneider <pfpschneider@gmail.com>
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Matthew Garrett <matthew.garrett@nebula.com>
Cc: Gaudenz Steinlin <gaudenz@debian.org>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agoipv6: mld: fix add_grhead skb_over_panic for devs with large MTUs
Daniel Borkmann [Wed, 5 Nov 2014 19:27:38 +0000 (20:27 +0100)] 
ipv6: mld: fix add_grhead skb_over_panic for devs with large MTUs

commit 4c672e4b42bc8046d63a6eb0a2c6a450a501af32 upstream.

It has been reported that generating an MLD listener report on
devices with large MTUs (e.g. 9000) and a high number of IPv6
addresses can trigger a skb_over_panic():

skbuff: skb_over_panic: text:ffffffff80612a5d len:3776 put:20
head:ffff88046d751000 data:ffff88046d751010 tail:0xed0 end:0xec0
dev:port1
 ------------[ cut here ]------------
kernel BUG at net/core/skbuff.c:100!
invalid opcode: 0000 [#1] SMP
Modules linked in: ixgbe(O)
CPU: 3 PID: 0 Comm: swapper/3 Tainted: G O 3.14.23+ #4
[...]
Call Trace:
 <IRQ>
 [<ffffffff80578226>] ? skb_put+0x3a/0x3b
 [<ffffffff80612a5d>] ? add_grhead+0x45/0x8e
 [<ffffffff80612e3a>] ? add_grec+0x394/0x3d4
 [<ffffffff80613222>] ? mld_ifc_timer_expire+0x195/0x20d
 [<ffffffff8061308d>] ? mld_dad_timer_expire+0x45/0x45
 [<ffffffff80255b5d>] ? call_timer_fn.isra.29+0x12/0x68
 [<ffffffff80255d16>] ? run_timer_softirq+0x163/0x182
 [<ffffffff80250e6f>] ? __do_softirq+0xe0/0x21d
 [<ffffffff8025112b>] ? irq_exit+0x4e/0xd3
 [<ffffffff802214bb>] ? smp_apic_timer_interrupt+0x3b/0x46
 [<ffffffff8063f10a>] ? apic_timer_interrupt+0x6a/0x70

mld_newpack() skb allocations are usually requested with dev->mtu
in size, since commit 72e09ad107e7 ("ipv6: avoid high order allocations")
we have changed the limit in order to be less likely to fail.

However, in MLD/IGMP code, we have some rather ugly AVAILABLE(skb)
macros, which determine if we may end up doing an skb_put() for
adding another record. To avoid possible fragmentation, we check
the skb's tailroom as skb->dev->mtu - skb->len, which is a wrong
assumption as the actual max allocation size can be much smaller.

The IGMP case doesn't have this issue as commit 57e1ab6eaddc
("igmp: refine skb allocations") stores the allocation size in
the cb[].

Set a reserved_tailroom to make it fit into the MTU and use
skb_availroom() helper instead. This also allows to get rid of
igmp_skb_size().

Reported-by: Wei Liu <lw1a2.jing@gmail.com>
Fixes: 72e09ad107e7 ("ipv6: avoid high order allocations")
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: David L Stevens <david.stevens@oracle.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agogpio: sysfs: fix gpio attribute-creation race
Johan Hovold [Tue, 13 Jan 2015 12:00:06 +0000 (13:00 +0100)] 
gpio: sysfs: fix gpio attribute-creation race

commit ebbeba120ab2ec6ac5f3afc1425ec6ff0b77ad6f upstream.

Fix attribute-creation race with userspace by using the default group
to create also the contingent gpio device attributes.

Fixes: d8f388d8dc8d ("gpio: sysfs interface")
Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
[ luis: backported to 3.16:
  - all changes in drivers/gpio/gpiolib.c
  - adjusted context ]
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agonet: sctp: fix race for one-to-many sockets in sendmsg's auto associate
Daniel Borkmann [Thu, 15 Jan 2015 15:34:35 +0000 (16:34 +0100)] 
net: sctp: fix race for one-to-many sockets in sendmsg's auto associate

commit 2061dcd6bff8b774b4fac8b0739b6be3f87bc9f2 upstream.

I.e. one-to-many sockets in SCTP are not required to explicitly
call into connect(2) or sctp_connectx(2) prior to data exchange.
Instead, they can directly invoke sendmsg(2) and the SCTP stack
will automatically trigger connection establishment through 4WHS
via sctp_primitive_ASSOCIATE(). However, this in its current
implementation is racy: INIT is being sent out immediately (as
it cannot be bundled anyway) and the rest of the DATA chunks are
queued up for later xmit when connection is established, meaning
sendmsg(2) will return successfully. This behaviour can result
in an undesired side-effect that the kernel made the application
think the data has already been transmitted, although none of it
has actually left the machine, worst case even after close(2)'ing
the socket.

Instead, when the association from client side has been shut down
e.g. first gracefully through SCTP_EOF and then close(2), the
client could afterwards still receive the server's INIT_ACK due
to a connection with higher latency. This INIT_ACK is then considered
out of the blue and hence responded with ABORT as there was no
alive assoc found anymore. This can be easily reproduced f.e.
with sctp_test application from lksctp. One way to fix this race
is to wait for the handshake to actually complete.

The fix defers waiting after sctp_primitive_ASSOCIATE() and
sctp_primitive_SEND() succeeded, so that DATA chunks cooked up
from sctp_sendmsg() have already been placed into the output
queue through the side-effect interpreter, and therefore can then
be bundeled together with COOKIE_ECHO control chunks.

strace from example application (shortened):

socket(PF_INET, SOCK_SEQPACKET, IPPROTO_SCTP) = 3
sendmsg(3, {msg_name(28)={sa_family=AF_INET, sin_port=htons(8888), sin_addr=inet_addr("192.168.1.115")},
           msg_iov(1)=[{"hello", 5}], msg_controllen=0, msg_flags=0}, 0) = 5
sendmsg(3, {msg_name(28)={sa_family=AF_INET, sin_port=htons(8888), sin_addr=inet_addr("192.168.1.115")},
           msg_iov(1)=[{"hello", 5}], msg_controllen=0, msg_flags=0}, 0) = 5
sendmsg(3, {msg_name(28)={sa_family=AF_INET, sin_port=htons(8888), sin_addr=inet_addr("192.168.1.115")},
           msg_iov(1)=[{"hello", 5}], msg_controllen=0, msg_flags=0}, 0) = 5
sendmsg(3, {msg_name(28)={sa_family=AF_INET, sin_port=htons(8888), sin_addr=inet_addr("192.168.1.115")},
           msg_iov(1)=[{"hello", 5}], msg_controllen=0, msg_flags=0}, 0) = 5
sendmsg(3, {msg_name(28)={sa_family=AF_INET, sin_port=htons(8888), sin_addr=inet_addr("192.168.1.115")},
           msg_iov(0)=[], msg_controllen=48, {cmsg_len=48, cmsg_level=0x84 /* SOL_??? */, cmsg_type=, ...},
           msg_flags=0}, 0) = 0 // graceful shutdown for SOCK_SEQPACKET via SCTP_EOF
close(3) = 0

tcpdump before patch (fooling the application):

22:33:36.306142 IP 192.168.1.114.41462 > 192.168.1.115.8888: sctp (1) [INIT] [init tag: 3879023686] [rwnd: 106496] [OS: 10] [MIS: 65535] [init TSN: 3139201684]
22:33:36.316619 IP 192.168.1.115.8888 > 192.168.1.114.41462: sctp (1) [INIT ACK] [init tag: 3345394793] [rwnd: 106496] [OS: 10] [MIS: 10] [init TSN: 3380109591]
22:33:36.317600 IP 192.168.1.114.41462 > 192.168.1.115.8888: sctp (1) [ABORT]

tcpdump after patch:

14:28:58.884116 IP 192.168.1.114.35846 > 192.168.1.115.8888: sctp (1) [INIT] [init tag: 438593213] [rwnd: 106496] [OS: 10] [MIS: 65535] [init TSN: 3092969729]
14:28:58.888414 IP 192.168.1.115.8888 > 192.168.1.114.35846: sctp (1) [INIT ACK] [init tag: 381429855] [rwnd: 106496] [OS: 10] [MIS: 10] [init TSN: 2141904492]
14:28:58.888638 IP 192.168.1.114.35846 > 192.168.1.115.8888: sctp (1) [COOKIE ECHO] , (2) [DATA] (B)(E) [TSN: 3092969729] [...]
14:28:58.893278 IP 192.168.1.115.8888 > 192.168.1.114.35846: sctp (1) [COOKIE ACK] , (2) [SACK] [cum ack 3092969729] [a_rwnd 106491] [#gap acks 0] [#dup tsns 0]
14:28:58.893591 IP 192.168.1.114.35846 > 192.168.1.115.8888: sctp (1) [DATA] (B)(E) [TSN: 3092969730] [...]
14:28:59.096963 IP 192.168.1.115.8888 > 192.168.1.114.35846: sctp (1) [SACK] [cum ack 3092969730] [a_rwnd 106496] [#gap acks 0] [#dup tsns 0]
14:28:59.097086 IP 192.168.1.114.35846 > 192.168.1.115.8888: sctp (1) [DATA] (B)(E) [TSN: 3092969731] [...] , (2) [DATA] (B)(E) [TSN: 3092969732] [...]
14:28:59.103218 IP 192.168.1.115.8888 > 192.168.1.114.35846: sctp (1) [SACK] [cum ack 3092969732] [a_rwnd 106486] [#gap acks 0] [#dup tsns 0]
14:28:59.103330 IP 192.168.1.114.35846 > 192.168.1.115.8888: sctp (1) [SHUTDOWN]
14:28:59.107793 IP 192.168.1.115.8888 > 192.168.1.114.35846: sctp (1) [SHUTDOWN ACK]
14:28:59.107890 IP 192.168.1.114.35846 > 192.168.1.115.8888: sctp (1) [SHUTDOWN COMPLETE]

Looks like this bug is from the pre-git history museum. ;)

Fixes: 08707d5482df ("lksctp-2_5_31-0_5_1.patch")
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
[ luis: backported to 3.16: adjusted context ]
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agofib_trie: Fix /proc/net/fib_trie when CONFIG_IP_MULTIPLE_TABLES is not defined
Alexander Duyck [Tue, 2 Dec 2014 18:58:21 +0000 (10:58 -0800)] 
fib_trie: Fix /proc/net/fib_trie when CONFIG_IP_MULTIPLE_TABLES is not defined

commit a5a519b2710be43fce3cf9ce7bd8de8db3f2a9de upstream.

In recent testing I had disabled CONFIG_IP_MULTIPLE_TABLES and as a result
when I ran "cat /proc/net/fib_trie" the main trie was displayed multiple
times.  I found that the problem line of code was in the function
fib_trie_seq_next.  Specifically the line below caused the indexes to go in
the opposite direction of our traversal:

h = tb->tb_id & (FIB_TABLE_HASHSZ - 1);

This issue was that the RT tables are defined such that RT_TABLE_LOCAL is ID
255, while it is located at TABLE_LOCAL_INDEX of 0, and RT_TABLE_MAIN is 254
with a TABLE_MAIN_INDEX of 1.  This means that the above line will return 1
for the local table and 0 for main.  The result is that fib_trie_seq_next
will return NULL at the end of the local table, fib_trie_seq_start will
return the start of the main table, and then fib_trie_seq_next will loop on
main forever as h will always return 0.

The fix for this is to reverse the ordering of the two tables.  It has the
advantage of making it so that the tables now print in the same order
regardless of if multiple tables are enabled or not.  In order to make the
definition consistent with the multiple tables case I simply masked the to
RT_TABLE_XXX values by (FIB_TABLE_HASHSZ - 1).  This way the two table
layouts should always stay consistent.

Fixes: 93456b6 ("[IPV4]: Unify access to the routing tables")
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agoHID: i2c-hid: Limit reads to wMaxInputLength bytes for input events
Seth Forshee [Fri, 20 Feb 2015 17:45:11 +0000 (11:45 -0600)] 
HID: i2c-hid: Limit reads to wMaxInputLength bytes for input events

commit 6d00f37e49d95e640a3937a4a1ae07dbe92a10cb upstream.

d1c7e29e8d27 (HID: i2c-hid: prevent buffer overflow in early IRQ)
changed hid_get_input() to read ihid->bufsize bytes, which can be
more than wMaxInputLength. This is the case with the Dell XPS 13
9343, and it is causing events to be missed. In some cases the
missed events are releases, which can cause the cursor to jump or
freeze, among other problems. Limit the number of bytes read to
min(wMaxInputLength, ihid->bufsize) to prevent such problems.

Fixes: d1c7e29e8d27 "HID: i2c-hid: prevent buffer overflow in early IRQ"
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Reviewed-by: Benjamin Tissoires <benjamin.tissoires@redhat.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agonet: rds: use correct size for max unacked packets and bytes
Sasha Levin [Tue, 3 Feb 2015 13:55:58 +0000 (08:55 -0500)] 
net: rds: use correct size for max unacked packets and bytes

commit db27ebb111e9f69efece08e4cb6a34ff980f8896 upstream.

Max unacked packets/bytes is an int while sizeof(long) was used in the
sysctl table.

This means that when they were getting read we'd also leak kernel memory
to userspace along with the timeout values.

Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cc: Moritz Muehlenhoff <jmm@inutil.org>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agonet: llc: use correct size for sysctl timeout entries
Sasha Levin [Sat, 24 Jan 2015 01:47:00 +0000 (20:47 -0500)] 
net: llc: use correct size for sysctl timeout entries

commit 6b8d9117ccb4f81b1244aafa7bc70ef8fa45fc49 upstream.

The timeout entries are sizeof(int) rather than sizeof(long), which
means that when they were getting read we'd also leak kernel memory
to userspace along with the timeout values.

Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cc: Moritz Muehlenhoff <jmm@inutil.org>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agoGFS2: Fix crash during ACL deletion in acl max entry check in gfs2_set_acl()
Andrew Elble [Mon, 9 Feb 2015 17:53:04 +0000 (12:53 -0500)] 
GFS2: Fix crash during ACL deletion in acl max entry check in gfs2_set_acl()

commit 278702074ff77b1a3fa2061267997095959f5e2c upstream.

Fixes: e01580bf9e ("gfs2: use generic posix ACL infrastructure")
Reported-by: Eric Meddaugh <etmsys@rit.edu>
Tested-by: Eric Meddaugh <etmsys@rit.edu>
Signed-off-by: Andrew Elble <aweits@rit.edu>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agoALSA: off by one bug in snd_riptide_joystick_probe()
Dan Carpenter [Mon, 9 Feb 2015 13:51:40 +0000 (16:51 +0300)] 
ALSA: off by one bug in snd_riptide_joystick_probe()

commit e4940626defdf6c92da1052ad3f12741c1a28c90 upstream.

The problem here is that we check:

if (dev >= SNDRV_CARDS)

Then we increment "dev".

       if (!joystick_port[dev++])

Then we use it as an offset into a array with SNDRV_CARDS elements.

if (!request_region(joystick_port[dev], 8, "Riptide gameport")) {

This has 3 effects:
1) If you use the module option to specify the joystick port then it has
   to be shifted one space over.
2) The wrong error message will be printed on failure if you have over
   32 cards.
3) Static checkers will correctly complain that are off by one.

Fixes: db1005ec6ff8 ('ALSA: riptide - Fix joystick resource handling')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agopinctrl: pinctrl-imx: don't use invalid value of conf_reg
Uwe Kleine-König [Tue, 27 Jan 2015 22:50:25 +0000 (23:50 +0100)] 
pinctrl: pinctrl-imx: don't use invalid value of conf_reg

commit 4ff0f034e95d65f8f063a362dfcf86e986377a82 upstream.

The right check for conf_reg to be invalid it testing against -1 not 0
as is done in the rest of the driver.

This fixes an oops that can be triggered by:

cat /sys/kernel/debug/pinctrl/43fac000.iomuxc/*

Fixes: ae75ff814538 ("pinctrl: pinctrl-imx: add imx pinctrl core driver")
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
[ luis: backported to 3.16:
  - file rename: drivers/pinctrl/freescale/pinctrl-imx.c ->
    drivers/pinctrl/pinctrl-imx.c ]
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agopowerpc/kernel: Avoid memory corruption at early stage
Gavin Shan [Thu, 8 Jan 2015 05:40:51 +0000 (16:40 +1100)] 
powerpc/kernel: Avoid memory corruption at early stage

commit 6f20e7f2e930211613a66d0603fa4abaaf3ce662 upstream.

When calling to early_setup(), we pick "boot_paca" up for the master CPU
and initialize that with initialise_paca(). At that point, the SLB
shadow buffer isn't populated yet. Updating the SLB shadow buffer should
corrupt what we had in physical address 0 where the trap instruction is
usually stored.

This hasn't been observed to cause any trouble in practice, but is
obviously fishy.

Fixes: 6f4441ef7009 ("powerpc: Dynamically allocate slb_shadow from memblock")
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agoclk-gate: fix bit # check in clk_register_gate()
Sergei Shtylyov [Wed, 24 Dec 2014 14:43:27 +0000 (17:43 +0300)] 
clk-gate: fix bit # check in clk_register_gate()

commit 2e9dcdae4068460c45a308dd891be5248260251c upstream.

In case CLK_GATE_HIWORD_MASK flag is passed to clk_register_gate(), the bit #
should be no higher than 15, however the corresponding check is obviously off-
by-one.

Fixes: 045779942c04 ("clk: gate: add CLK_GATE_HIWORD_MASK")
Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Signed-off-by: Michael Turquette <mturquette@linaro.org>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agoefi: Small leak on error in runtime map code
Dan Carpenter [Thu, 15 Jan 2015 09:21:21 +0000 (12:21 +0300)] 
efi: Small leak on error in runtime map code

commit 86d68a58d00db3770735b5919ef2c6b12d7f06f3 upstream.

The "> 0" here should ">= 0" so we free map_entries[0].

Fixes: 926172d46038 ('efi: Export EFI runtime memory mapping to sysfs')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Dave Young <dyoung@redhat.com>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agogpio: rcar: Fix error path for devm_kzalloc() failure
Geert Uytterhoeven [Mon, 12 Jan 2015 10:07:58 +0000 (11:07 +0100)] 
gpio: rcar: Fix error path for devm_kzalloc() failure

commit 7d82bf3419c103dbb730e7834186fc5d577b9da1 upstream.

If the call to devm_kzalloc() fails, nothing must be cleant up.
This was missed before because gpio_rcar_probe() had a "return"
statement after the first "goto err0".

Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Fixes: df0c6c80232f2ad4 ("gpio: rcar: Add minimal runtime PM support")
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agoASoC: mioa701_wm9713: Fix speaker event
Lars-Peter Clausen [Thu, 15 Jan 2015 11:52:01 +0000 (12:52 +0100)] 
ASoC: mioa701_wm9713: Fix speaker event

commit 7331ea474e9e7a348541c207bdb6aa518c6403f4 upstream.

Commit f6b2a04590bb ("ASoC: pxa: mioa701_wm9713: Convert to table based DAPM
setup") converted the driver to register the board level DAPM elements with
the card's DAPM context rather than the CODEC's DAPM context. The change
overlooked that the speaker widget event callback accesses the widget's
codec field which is only valid if the widget has been registered in a CODEC
DAPM context. This patch modifies the callback to take an alternative route
to get the CODEC.

Fixes: f6b2a04590bb ("ASoC: pxa: mioa701_wm9713: Convert to table based DAPM
setup")
Signed-off-by: Lars-Peter Clausen <lars@metafoo.de>
Signed-off-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agoautofs4 copy_dev_ioctl(): keep the value of ->size we'd used for allocation
Al Viro [Sun, 22 Feb 2015 03:19:57 +0000 (22:19 -0500)] 
autofs4 copy_dev_ioctl(): keep the value of ->size we'd used for allocation

commit 0a280962dc6e117e0e4baa668453f753579265d9 upstream.

X-Coverup: just ask spender
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agoprocfs: fix race between symlink removals and traversals
Al Viro [Sun, 22 Feb 2015 03:16:11 +0000 (22:16 -0500)] 
procfs: fix race between symlink removals and traversals

commit 7e0e953bb0cf649f93277ac8fb67ecbb7f7b04a9 upstream.

use_pde()/unuse_pde() in ->follow_link()/->put_link() resp.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agodebugfs: leave freeing a symlink body until inode eviction
Al Viro [Sun, 22 Feb 2015 03:05:11 +0000 (22:05 -0500)] 
debugfs: leave freeing a symlink body until inode eviction

commit 0db59e59299f0b67450c5db21f7f316c8fb04e84 upstream.

As it is, we have debugfs_remove() racing with symlink traversals.
Supply ->evict_inode() and do freeing there - inode will remain
pinned until we are done with the symlink body.

And rip the idiocy with checking if dentry is positive right after
we'd verified debugfs_positive(), which is a stronger check...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agoblk-throttle: check stats_cpu before reading it from sysfs
Thadeu Lima de Souza Cascardo [Mon, 16 Feb 2015 19:16:45 +0000 (17:16 -0200)] 
blk-throttle: check stats_cpu before reading it from sysfs

commit 045c47ca306acf30c740c285a77a4b4bda6be7c5 upstream.

When reading blkio.throttle.io_serviced in a recently created blkio
cgroup, it's possible to race against the creation of a throttle policy,
which delays the allocation of stats_cpu.

Like other functions in the throttle code, just checking for a NULL
stats_cpu prevents the following oops caused by that race.

[ 1117.285199] Unable to handle kernel paging request for data at address 0x7fb4d0020
[ 1117.285252] Faulting instruction address: 0xc0000000003efa2c
[ 1137.733921] Oops: Kernel access of bad area, sig: 11 [#1]
[ 1137.733945] SMP NR_CPUS=2048 NUMA PowerNV
[ 1137.734025] Modules linked in: bridge stp llc kvm_hv kvm binfmt_misc autofs4
[ 1137.734102] CPU: 3 PID: 5302 Comm: blkcgroup Not tainted 3.19.0 #5
[ 1137.734132] task: c000000f1d188b00 ti: c000000f1d210000 task.ti: c000000f1d210000
[ 1137.734167] NIP: c0000000003efa2c LR: c0000000003ef9f0 CTR: c0000000003ef980
[ 1137.734202] REGS: c000000f1d213500 TRAP: 0300   Not tainted  (3.19.0)
[ 1137.734230] MSR: 9000000000009032 <SF,HV,EE,ME,IR,DR,RI>  CR: 42008884  XER: 20000000
[ 1137.734325] CFAR: 0000000000008458 DAR: 00000007fb4d0020 DSISR: 40000000 SOFTE: 0
GPR00: c0000000003ed3a0 c000000f1d213780 c000000000c59538 0000000000000000
GPR04: 0000000000000800 0000000000000000 0000000000000000 0000000000000000
GPR08: ffffffffffffffff 00000007fb4d0020 00000007fb4d0000 c000000000780808
GPR12: 0000000022000888 c00000000fdc0d80 0000000000000000 0000000000000000
GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR20: 000001003e120200 c000000f1d5b0cc0 0000000000000200 0000000000000000
GPR24: 0000000000000001 c000000000c269e0 0000000000000020 c000000f1d5b0c80
GPR28: c000000000ca3a08 c000000000ca3dec c000000f1c667e00 c000000f1d213850
[ 1137.734886] NIP [c0000000003efa2c] .tg_prfill_cpu_rwstat+0xac/0x180
[ 1137.734915] LR [c0000000003ef9f0] .tg_prfill_cpu_rwstat+0x70/0x180
[ 1137.734943] Call Trace:
[ 1137.734952] [c000000f1d213780] [d000000005560520] 0xd000000005560520 (unreliable)
[ 1137.734996] [c000000f1d2138a0] [c0000000003ed3a0] .blkcg_print_blkgs+0xe0/0x1a0
[ 1137.735039] [c000000f1d213960] [c0000000003efb50] .tg_print_cpu_rwstat+0x50/0x70
[ 1137.735082] [c000000f1d2139e0] [c000000000104b48] .cgroup_seqfile_show+0x58/0x150
[ 1137.735125] [c000000f1d213a70] [c0000000002749dc] .kernfs_seq_show+0x3c/0x50
[ 1137.735161] [c000000f1d213ae0] [c000000000218630] .seq_read+0xe0/0x510
[ 1137.735197] [c000000f1d213bd0] [c000000000275b04] .kernfs_fop_read+0x164/0x200
[ 1137.735240] [c000000f1d213c80] [c0000000001eb8e0] .__vfs_read+0x30/0x80
[ 1137.735276] [c000000f1d213cf0] [c0000000001eb9c4] .vfs_read+0x94/0x1b0
[ 1137.735312] [c000000f1d213d90] [c0000000001ebb38] .SyS_read+0x58/0x100
[ 1137.735349] [c000000f1d213e30] [c000000000009218] syscall_exit+0x0/0x98
[ 1137.735383] Instruction dump:
[ 1137.735405] 7c6307b4 7f891800 409d00b8 60000000 60420000 3d420004 392a63b0 786a1f24
[ 1137.735471] 7d49502a e93e01c8 7d495214 7d2ad214 <7cead02ae9090008 e9490010 e9290018

And here is one code that allows to easily reproduce this, although this
has first been found by running docker.

void run(pid_t pid)
{
int n;
int status;
int fd;
char *buffer;
buffer = memalign(BUFFER_ALIGN, BUFFER_SIZE);
n = snprintf(buffer, BUFFER_SIZE, "%d\n", pid);
fd = open(CGPATH "/test/tasks", O_WRONLY);
write(fd, buffer, n);
close(fd);
if (fork() > 0) {
fd = open("/dev/sda", O_RDONLY | O_DIRECT);
read(fd, buffer, 512);
close(fd);
wait(&status);
} else {
fd = open(CGPATH "/test/blkio.throttle.io_serviced", O_RDONLY);
n = read(fd, buffer, BUFFER_SIZE);
close(fd);
}
free(buffer);
exit(0);
}

void test(void)
{
int status;
mkdir(CGPATH "/test", 0666);
if (fork() > 0)
wait(&status);
else
run(getpid());
rmdir(CGPATH "/test");
}

int main(int argc, char **argv)
{
int i;
for (i = 0; i < NR_TESTS; i++)
test();
return 0;
}

Reported-by: Ricardo Marin Matinata <rmm@br.ibm.com>
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agokdb: fix incorrect counts in KDB summary command output
Jay Lan [Mon, 29 Sep 2014 22:36:57 +0000 (15:36 -0700)] 
kdb: fix incorrect counts in KDB summary command output

commit 146755923262037fc4c54abc28c04b1103f3cc51 upstream.

The output of KDB 'summary' command should report MemTotal, MemFree
and Buffers output in kB. Current codes report in unit of pages.

A define of K(x) as
is defined in the code, but not used.

This patch would apply the define to convert the values to kB.
Please include me on Cc on replies. I do not subscribe to linux-kernel.

Signed-off-by: Jay Lan <jlan@sgi.com>
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agoMIPS: Export MSA functions used by lose_fpu(1) for KVM
James Hogan [Tue, 10 Feb 2015 10:03:00 +0000 (10:03 +0000)] 
MIPS: Export MSA functions used by lose_fpu(1) for KVM

commit ca5d25642e212f73492d332d95dc90ef46a0e8dc upstream.

Export the _save_msa asm function used by the lose_fpu(1) macro to GPL
modules so that KVM can make use of it when it is built as a module.

This fixes the following build error when CONFIG_KVM=m and
CONFIG_CPU_HAS_MSA=y due to commit f798217dfd03 ("KVM: MIPS: Don't leak
FPU/DSP to guest"):

ERROR: "_save_msa" [arch/mips/kvm/kvm.ko] undefined!

Fixes: f798217dfd03 (KVM: MIPS: Don't leak FPU/DSP to guest)
Signed-off-by: James Hogan <james.hogan@imgtec.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul Burton <paul.burton@imgtec.com>
Cc: Gleb Natapov <gleb@kernel.org>
Cc: kvm@vger.kernel.org
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/9261/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agoMIPS: Export FP functions used by lose_fpu(1) for KVM
James Hogan [Tue, 10 Feb 2015 10:02:59 +0000 (10:02 +0000)] 
MIPS: Export FP functions used by lose_fpu(1) for KVM

commit 3ce465e04bfd8de9956d515d6e9587faac3375dc upstream.

Export the _save_fp asm function used by the lose_fpu(1) macro to GPL
modules so that KVM can make use of it when it is built as a module.

This fixes the following build error when CONFIG_KVM=m due to commit
f798217dfd03 ("KVM: MIPS: Don't leak FPU/DSP to guest"):

ERROR: "_save_fp" [arch/mips/kvm/kvm.ko] undefined!

Signed-off-by: James Hogan <james.hogan@imgtec.com>
Fixes: f798217dfd03 (KVM: MIPS: Don't leak FPU/DSP to guest)
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul Burton <paul.burton@imgtec.com>
Cc: Gleb Natapov <gleb@kernel.org>
Cc: kvm@vger.kernel.org
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/9260/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agolibceph: fix double __remove_osd() problem
Ilya Dryomov [Tue, 17 Feb 2015 16:37:15 +0000 (19:37 +0300)] 
libceph: fix double __remove_osd() problem

commit 7eb71e0351fbb1b242ae70abb7bb17107fe2f792 upstream.

It turns out it's possible to get __remove_osd() called twice on the
same OSD.  That doesn't sit well with rb_erase() - depending on the
shape of the tree we can get a NULL dereference, a soft lockup or
a random crash at some point in the future as we end up touching freed
memory.  One scenario that I was able to reproduce is as follows:

            <osd3 is idle, on the osd lru list>
<con reset - osd3>
con_fault_finish()
  osd_reset()
                              <osdmap - osd3 down>
                              ceph_osdc_handle_map()
                                <takes map_sem>
                                kick_requests()
                                  <takes request_mutex>
                                  reset_changed_osds()
                                    __reset_osd()
                                      __remove_osd()
                                  <releases request_mutex>
                                <releases map_sem>
    <takes map_sem>
    <takes request_mutex>
    __kick_osd_requests()
      __reset_osd()
        __remove_osd() <-- !!!

A case can be made that osd refcounting is imperfect and reworking it
would be a proper resolution, but for now Sage and I decided to fix
this by adding a safe guard around __remove_osd().

Fixes: http://tracker.ceph.com/issues/8087
Cc: Sage Weil <sage@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agolibceph: change from BUG to WARN for __remove_osd() asserts
Ilya Dryomov [Wed, 5 Nov 2014 16:33:44 +0000 (19:33 +0300)] 
libceph: change from BUG to WARN for __remove_osd() asserts

commit cc9f1f518cec079289d11d732efa490306b1ddad upstream.

No reason to use BUG_ON for osd request list assertions.

Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agolibceph: assert both regular and lingering lists in __remove_osd()
Ilya Dryomov [Wed, 18 Jun 2014 09:02:12 +0000 (13:02 +0400)] 
libceph: assert both regular and lingering lists in __remove_osd()

commit 7c6e6fc53e7335570ed82f77656cedce1502744e upstream.

It is important that both regular and lingering requests lists are
empty when the OSD is removed.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agox86, mm/ASLR: Fix stack randomization on 64-bit systems
Hector Marco-Gisbert [Sat, 14 Feb 2015 17:33:50 +0000 (09:33 -0800)] 
x86, mm/ASLR: Fix stack randomization on 64-bit systems

commit 4e7c22d447bb6d7e37bfe39ff658486ae78e8d77 upstream.

The issue is that the stack for processes is not properly randomized on
64 bit architectures due to an integer overflow.

The affected function is randomize_stack_top() in file
"fs/binfmt_elf.c":

  static unsigned long randomize_stack_top(unsigned long stack_top)
  {
           unsigned int random_variable = 0;

           if ((current->flags & PF_RANDOMIZE) &&
                   !(current->personality & ADDR_NO_RANDOMIZE)) {
                   random_variable = get_random_int() & STACK_RND_MASK;
                   random_variable <<= PAGE_SHIFT;
           }
           return PAGE_ALIGN(stack_top) + random_variable;
           return PAGE_ALIGN(stack_top) - random_variable;
  }

Note that, it declares the "random_variable" variable as "unsigned int".
Since the result of the shifting operation between STACK_RND_MASK (which
is 0x3fffff on x86_64, 22 bits) and PAGE_SHIFT (which is 12 on x86_64):

  random_variable <<= PAGE_SHIFT;

then the two leftmost bits are dropped when storing the result in the
"random_variable". This variable shall be at least 34 bits long to hold
the (22+12) result.

These two dropped bits have an impact on the entropy of process stack.
Concretely, the total stack entropy is reduced by four: from 2^28 to
2^30 (One fourth of expected entropy).

This patch restores back the entropy by correcting the types involved
in the operations in the functions randomize_stack_top() and
stack_maxrandom_size().

The successful fix can be tested with:

  $ for i in `seq 1 10`; do cat /proc/self/maps | grep stack; done
  7ffeda566000-7ffeda587000 rw-p 00000000 00:00 0                          [stack]
  7fff5a332000-7fff5a353000 rw-p 00000000 00:00 0                          [stack]
  7ffcdb7a1000-7ffcdb7c2000 rw-p 00000000 00:00 0                          [stack]
  7ffd5e2c4000-7ffd5e2e5000 rw-p 00000000 00:00 0                          [stack]
  ...

Once corrected, the leading bytes should be between 7ffc and 7fff,
rather than always being 7fff.

Signed-off-by: Hector Marco-Gisbert <hecmargi@upv.es>
Signed-off-by: Ismael Ripoll <iripoll@upv.es>
[ Rebased, fixed 80 char bugs, cleaned up commit message, added test example and CVE ]
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Fixes: CVE-2015-1593
Link: http://lkml.kernel.org/r/20150214173350.GA18393@www.outflux.net
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agocpufreq: s3c: remove incorrect __init annotations
Arnd Bergmann [Wed, 18 Feb 2015 20:55:03 +0000 (21:55 +0100)] 
cpufreq: s3c: remove incorrect __init annotations

commit 61882b63171736571e1139ab5aa929e3bb336016 upstream.

The two functions s3c2416_cpufreq_driver_init and s3c_cpufreq_register
are marked init but are called from a context that might be run after
the __init sections are discarded, as the compiler points out:

WARNING: vmlinux.o(.data+0x1ad9dc): Section mismatch in reference from the variable s3c2416_cpufreq_driver to the function .init.text:s3c2416_cpufreq_driver_init()
WARNING: drivers/built-in.o(.text+0x35b5dc): Section mismatch in reference from the function s3c2410a_cpufreq_add() to the function .init.text:s3c_cpufreq_register()

This removes the __init markings.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agodm snapshot: fix a possible invalid memory access on unload
Mikulas Patocka [Tue, 17 Feb 2015 19:34:00 +0000 (14:34 -0500)] 
dm snapshot: fix a possible invalid memory access on unload

commit 22aa66a3ee5b61e0f4a0bfeabcaa567861109ec3 upstream.

When the snapshot target is unloaded, snapshot_dtr() waits until
pending_exceptions_count drops to zero.  Then, it destroys the snapshot.
Therefore, the function that decrements pending_exceptions_count
should not touch the snapshot structure after the decrement.

pending_complete() calls free_pending_exception(), which decrements
pending_exceptions_count, and then it performs up_write(&s->lock) and it
calls retry_origin_bios() which dereferences  s->origin.  These two
memory accesses to the fields of the snapshot may touch the dm_snapshot
struture after it is freed.

This patch moves the call to free_pending_exception() to the end of
pending_complete(), so that the snapshot will not be destroyed while
pending_complete() is in progress.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agodm: fix a race condition in dm_get_md
Mikulas Patocka [Tue, 17 Feb 2015 19:30:53 +0000 (14:30 -0500)] 
dm: fix a race condition in dm_get_md

commit 2bec1f4a8832e74ebbe859f176d8a9cb20dd97f4 upstream.

The function dm_get_md finds a device mapper device with a given dev_t,
increases the reference count and returns the pointer.

dm_get_md calls dm_find_md, dm_find_md takes _minor_lock, finds the
device, tests that the device doesn't have DMF_DELETING or DMF_FREEING
flag, drops _minor_lock and returns pointer to the device. dm_get_md then
calls dm_get. dm_get calls BUG if the device has the DMF_FREEING flag,
otherwise it increments the reference count.

There is a possible race condition - after dm_find_md exits and before
dm_get is called, there are no locks held, so the device may disappear or
DMF_FREEING flag may be set, which results in BUG.

To fix this bug, we need to call dm_get while we hold _minor_lock. This
patch renames dm_find_md to dm_get_md and changes it so that it calls
dm_get while holding the lock.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agontp: Fixup adjtimex freq validation on 32-bit systems
John Stultz [Tue, 10 Feb 2015 07:30:36 +0000 (23:30 -0800)] 
ntp: Fixup adjtimex freq validation on 32-bit systems

commit 29183a70b0b828500816bd794b3fe192fce89f73 upstream.

Additional validation of adjtimex freq values to avoid
potential multiplication overflows were added in commit
5e5aeb4367b (time: adjtimex: Validate the ADJ_FREQUENCY values)

Unfortunately the patch used LONG_MAX/MIN instead of
LLONG_MAX/MIN, which was fine on 64-bit systems, but being
much smaller on 32-bit systems caused false positives
resulting in most direct frequency adjustments to fail w/
EINVAL.

ntpd only does direct frequency adjustments at startup, so
the issue was not as easily observed there, but other time
sync applications like ptpd and chrony were more effected by
the bug.

See bugs:

  https://bugzilla.kernel.org/show_bug.cgi?id=92481
  https://bugzilla.redhat.com/show_bug.cgi?id=1188074

This patch changes the checks to use LLONG_MAX for
clarity, and additionally the checks are disabled
on 32-bit systems since LLONG_MAX/PPM_SCALE is always
larger then the 32-bit long freq value, so multiplication
overflows aren't possible there.

Reported-by: Josh Boyer <jwboyer@fedoraproject.org>
Reported-by: George Joseph <george.joseph@fairview5.com>
Tested-by: George Joseph <george.joseph@fairview5.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Sasha Levin <sasha.levin@oracle.com>
Link: http://lkml.kernel.org/r/1423553436-29747-1-git-send-email-john.stultz@linaro.org
[ Prettified the changelog and the comments a bit. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agotime: adjtimex: Validate the ADJ_FREQUENCY values
Sasha Levin [Thu, 4 Dec 2014 00:25:05 +0000 (19:25 -0500)] 
time: adjtimex: Validate the ADJ_FREQUENCY values

commit 5e5aeb4367b450a28f447f6d5ab57d8f2ab16a5f upstream.

Verify that the frequency value from userspace is valid and makes sense.

Unverified values can cause overflows later on.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
[jstultz: Fix up bug for negative values and drop redunent cap check]
Signed-off-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agolocking/rtmutex: Avoid a NULL pointer dereference on deadlock
Sebastian Andrzej Siewior [Tue, 17 Feb 2015 15:43:43 +0000 (16:43 +0100)] 
locking/rtmutex: Avoid a NULL pointer dereference on deadlock

commit 8d1e5a1a1ccf5ae9d8a5a0ee7960202ccb0c5429 upstream.

With task_blocks_on_rt_mutex() returning early -EDEADLK we never
add the waiter to the waitqueue. Later, we try to remove it via
remove_waiter() and go boom in rt_mutex_top_waiter() because
rb_entry() gives a NULL pointer.

( Tested on v3.18-RT where rtmutex is used for regular mutex and I
  tried to get one twice in a row. )

Not sure when this started but I guess 397335f004f4 ("rtmutex: Fix
deadlock detector for real") or commit 3d5c9340d194 ("rtmutex:
Handle deadlock detection smarter").

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1424187823-19600-1-git-send-email-bigeasy@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
[ luis: backported to 3.16: adjusted context ]
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agomd/raid5: Fix livelock when array is both resyncing and degraded.
NeilBrown [Wed, 18 Feb 2015 00:35:14 +0000 (11:35 +1100)] 
md/raid5: Fix livelock when array is both resyncing and degraded.

commit 26ac107378c4742978216be1005b7291b799c7b2 upstream.

Commit a7854487cd7128a30a7f4f5259de9f67d5efb95f:
  md: When RAID5 is dirty, force reconstruct-write instead of read-modify-write.

Causes an RCW cycle to be forced even when the array is degraded.
A degraded array cannot support RCW as that requires reading all data
blocks, and one may be missing.

Forcing an RCW when it is not possible causes a live-lock and the code
spins, repeatedly deciding to do something that cannot succeed.

So change the condition to only force RCW on non-degraded arrays.

Reported-by: Manibalan P <pmanibalan@amiindia.co.in>
Bisected-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Tested-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Fixes: a7854487cd7128a30a7f4f5259de9f67d5efb95f
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agoMIPS: kernel: cps-vec: Replace "addi" with "addiu"
Markos Chandras [Mon, 24 Nov 2014 14:40:11 +0000 (14:40 +0000)] 
MIPS: kernel: cps-vec: Replace "addi" with "addiu"

commit acac4108df6029c03195513ead7073bbb0cb9718 upstream.

The "addi" instruction will trap on overflows which is not something
we need in this code, so we replace that with "addiu".

Link: http://www.linux-mips.org/archives/linux-mips/2015-01/msg00430.html
Cc: Maciej W. Rozycki <macro@linux-mips.org>
Cc: Paul Burton <paul.burton@imgtec.com>
Signed-off-by: Markos Chandras <markos.chandras@imgtec.com>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agoMIPS: asm: asmmacro: Replace "add" instructions with "addu"
Markos Chandras [Wed, 5 Nov 2014 14:17:52 +0000 (14:17 +0000)] 
MIPS: asm: asmmacro: Replace "add" instructions with "addu"

commit 98a833c1fa4de0695830f77b2d13fd86693da298 upstream.

The "add" instruction is actually a macro in binutils and depending on
the size of the immediate it can expand to an "addi" instruction.
However, the "addi" instruction traps on overflows which is not
something we want on address calculation.

Link: http://www.linux-mips.org/archives/linux-mips/2015-01/msg00121.html
Cc: Paul Burton <paul.burton@imgtec.com>
Cc: Maciej W. Rozycki <macro@linux-mips.org>
Signed-off-by: Markos Chandras <markos.chandras@imgtec.com>
[ luis: backported to 3.16: adjusted context ]
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agoEDAC, amd64_edac: Prevent OOPS with >16 memory controllers
Daniel J Blueman [Tue, 17 Feb 2015 03:34:38 +0000 (11:34 +0800)] 
EDAC, amd64_edac: Prevent OOPS with >16 memory controllers

commit 0c510cc83bdbaac8406f4f7caef34f4da0ba35ea upstream.

When DRAM errors occur on memory controllers after EDAC_MAX_MCS (16),
the kernel fatally dereferences unallocated structures, see splat below;
this occurs on at least NumaConnect systems.

Fix by checking if a memory controller info structure was found.

BUG: unable to handle kernel NULL pointer dereference at 0000000000000320
IP: [<ffffffff819f714f>] decode_bus_error+0x2f/0x2b0
PGD 2f8b5a3067 PUD 2f8b5a2067 PMD 0
Oops: 0000 [#2] SMP
Modules linked in:
CPU: 224 PID: 11930 Comm: stream_c.exe.gn Tainted: G   D    3.19.0 #1
Hardware name: Supermicro H8QGL/H8QGL, BIOS 3.5b    01/28/2015
task: ffff8807dbfb8c00 ti: ffff8807dd16c000 task.ti: ffff8807dd16c000
RIP: 0010:[<ffffffff819f714f>] [<ffffffff819f714f>] decode_bus_error+0x2f/0x2b0
RSP: 0000:ffff8907dfc03c48 EFLAGS: 00010297
RAX: 0000000000000001 RBX: 9c67400010080a13 RCX: 0000000000001dc6
RDX: 000000001dc61dc6 RSI: ffff8907dfc03df0 RDI: 000000000000001c
RBP: ffff8907dfc03ce8 R08: 0000000000000000 R09: 0000000000000022
R10: ffff891fffa30380 R11: 00000000001cfc90 R12: 0000000000000008
R13: 0000000000000000 R14: 000000000000001c R15: 00009c6740001000
FS: 00007fa97ee18700(0000) GS:ffff8907dfc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000320 CR3: 0000003f889b8000 CR4: 00000000000407e0
Stack:
 0000000000000000 ffff8907dfc03df0 0000000000000008 9c67400010080a13
 000000000000001c 00009c6740001000 ffff8907dfc03c88 ffffffff810e4f9a
 ffff8907dfc03ce8 ffffffff81b375b9 0000000000000000 0000000000000010
Call Trace:
 <IRQ>
 ? vprintk_default
 ? printk
 amd_decode_mce
 notifier_call_chain
 atomic_notifier_call_chain
 mce_log
 machine_check_poll
 mce_timer_fn
 ? mce_cpu_restart
 call_timer_fn.isra.29
 run_timer_softirq
 __do_softirq
 irq_exit
 smp_apic_timer_interrupt
 apic_timer_interrupt
 <EOI>
 ? down_read_trylock
 __do_page_fault
 ? __schedule
 do_page_fault
 page_fault

Signed-off-by: Daniel J Blueman <daniel@numascale.com>
Link: http://lkml.kernel.org/r/1424144078-24589-1-git-send-email-daniel@numascale.com
[ Boris: massage commit message ]
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agoIB/qib: Do not write EEPROM
Mitko Haralanov [Fri, 16 Jan 2015 13:55:27 +0000 (08:55 -0500)] 
IB/qib: Do not write EEPROM

commit 18c0b82a3e4501511b08d0e8676fb08ac08734a3 upstream.

This changeset removes all the code that allows the driver to write to
the EEPROM and update the recorded error counters and power on hours.

These two stats are unused and writing them exposes a timing risk
which could leave the EEPROM in a bad state preventing further normal
operation of the HCA.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agosg: fix read() error reporting
Tony Battersby [Wed, 11 Feb 2015 16:32:06 +0000 (11:32 -0500)] 
sg: fix read() error reporting

commit 3b524a683af8991b4eab4182b947c65f0ce1421b upstream.

Fix SCSI generic read() incorrectly returning success after detecting an
error.

Signed-off-by: Tony Battersby <tonyb@cybernetics.com>
Acked-by: Douglas Gilbert <dgilbert@interlog.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agofixed invalid assignment of 64bit mask to host dma_boundary for scatter gather segmen...
Minh Duc Tran [Mon, 9 Feb 2015 18:54:09 +0000 (18:54 +0000)] 
fixed invalid assignment of 64bit mask to host dma_boundary for scatter gather segment boundary limit.

commit f76a610a8b4b6280eaedf48f3af9d5d74e418b66 upstream.

In reference to bug https://bugzilla.redhat.com/show_bug.cgi?id=1097141
Assert is seen with AMD cpu whenever calling pci_alloc_consistent.

[   29.406183] ------------[ cut here ]------------
[   29.410505] kernel BUG at lib/iommu-helper.c:13!

Signed-off-by: Minh Tran <minh.tran@emulex.com>
Fixes: 6733b39a1301b0b020bbcbf3295852e93e624cb1
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agoAdded Little Endian support to vtpm module
honclo [Fri, 13 Feb 2015 02:02:24 +0000 (21:02 -0500)] 
Added Little Endian support to vtpm module

commit eb71f8a5e33fa1066fb92f0111ab366a341e1f6c upstream.

The tpm_ibmvtpm module is affected by an unaligned access problem.
ibmvtpm_crq_get_version failed with rc=-4 during boot when vTPM is
enabled in Power partition, which supports both little endian and
big endian modes.

We added little endian support to fix this problem:
1) added cpu_to_be64 calls to ensure BE data is sent from an LE OS.
2) added be16_to_cpu and be32_to_cpu calls to make sure data received
   is in LE format on a LE OS.

Signed-off-by: Hon Ching(Vicky) Lo <honclo@linux.vnet.ibm.com>
Signed-off-by: Joy Latten <jmlatten@linux.vnet.ibm.com>
[phuewe: manually applied the patch :( ]
Reviewed-by: Ashley Lai <ashley@ahsleylai.com>
Signed-off-by: Peter Huewe <peterhuewe@gmx.de>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agoBtrfs: fix fsync data loss after adding hard link to inode
Filipe Manana [Fri, 13 Feb 2015 12:30:56 +0000 (12:30 +0000)] 
Btrfs: fix fsync data loss after adding hard link to inode

commit 1a4bcf470c886b955adf36486f4c86f2441d85cb upstream.

We have a scenario where after the fsync log replay we can lose file data
that had been previously fsync'ed if we added an hard link for our inode
and after that we sync'ed the fsync log (for example by fsync'ing some
other file or directory).

This is because when adding an hard link we updated the inode item in the
log tree with an i_size value of 0. At that point the new inode item was
in memory only and a subsequent fsync log replay would not make us lose
the file data. However if after adding the hard link we sync the log tree
to disk, by fsync'ing some other file or directory for example, we ended
up losing the file data after log replay, because the inode item in the
persisted log tree had an an i_size of zero.

This is easy to reproduce, and the following excerpt from my test for
xfstests shows this:

  _scratch_mkfs >> $seqres.full 2>&1
  _init_flakey
  _mount_flakey

  # Create one file with data and fsync it.
  # This made the btrfs fsync log persist the data and the inode metadata with
  # a correct inode->i_size (4096 bytes).
  $XFS_IO_PROG -f -c "pwrite -S 0xaa -b 4K 0 4K" -c "fsync" \
       $SCRATCH_MNT/foo | _filter_xfs_io

  # Now add one hard link to our file. This made the btrfs code update the fsync
  # log, in memory only, with an inode metadata having a size of 0.
  ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link

  # Now force persistence of the fsync log to disk, for example, by fsyncing some
  # other file.
  touch $SCRATCH_MNT/bar
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/bar

  # Before a power loss or crash, we could read the 4Kb of data from our file as
  # expected.
  echo "File content before:"
  od -t x1 $SCRATCH_MNT/foo

  # Simulate a crash/power loss.
  _load_flakey_table $FLAKEY_DROP_WRITES
  _unmount_flakey

  _load_flakey_table $FLAKEY_ALLOW_WRITES
  _mount_flakey

  # After the fsync log replay, because the fsync log had a value of 0 for our
  # inode's i_size, we couldn't read anymore the 4Kb of data that we previously
  # wrote and fsync'ed. The size of the file became 0 after the fsync log replay.
  echo "File content after:"
  od -t x1 $SCRATCH_MNT/foo

Another alternative test, that doesn't need to fsync an inode in the same
transaction it was created, is:

  _scratch_mkfs >> $seqres.full 2>&1
  _init_flakey
  _mount_flakey

  # Create our test file with some data.
  $XFS_IO_PROG -f -c "pwrite -S 0xaa -b 8K 0 8K" \
       $SCRATCH_MNT/foo | _filter_xfs_io

  # Make sure the file is durably persisted.
  sync

  # Append some data to our file, to increase its size.
  $XFS_IO_PROG -f -c "pwrite -S 0xcc -b 4K 8K 4K" \
       $SCRATCH_MNT/foo | _filter_xfs_io

  # Fsync the file, so from this point on if a crash/power failure happens, our
  # new data is guaranteed to be there next time the fs is mounted.
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo

  # Add one hard link to our file. This made btrfs write into the in memory fsync
  # log a special inode with generation 0 and an i_size of 0 too. Note that this
  # didn't update the inode in the fsync log on disk.
  ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link

  # Now make sure the in memory fsync log is durably persisted.
  # Creating and fsync'ing another file will do it.
  touch $SCRATCH_MNT/bar
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/bar

  # As expected, before the crash/power failure, we should be able to read the
  # 12Kb of file data.
  echo "File content before:"
  od -t x1 $SCRATCH_MNT/foo

  # Simulate a crash/power loss.
  _load_flakey_table $FLAKEY_DROP_WRITES
  _unmount_flakey

  _load_flakey_table $FLAKEY_ALLOW_WRITES
  _mount_flakey

  # After mounting the fs again, the fsync log was replayed.
  # The btrfs fsync log replay code didn't update the i_size of the persisted
  # inode because the inode item in the log had a special generation with a
  # value of 0 (and it couldn't know the correct i_size, since that inode item
  # had a 0 i_size too). This made the last 4Kb of file data inaccessible and
  # effectively lost.
  echo "File content after:"
  od -t x1 $SCRATCH_MNT/foo

This isn't a new issue/regression. This problem has been around since the
log tree code was added in 2008:

  Btrfs: Add a write ahead tree log to optimize synchronous operations
  (commit e02119d5a7b4396c5a872582fddc8bd6d305a70a)

Test cases for xfstests follow soon.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agotarget: Check for LBA + sectors wrap-around in sbc_parse_cdb
Nicholas Bellinger [Fri, 13 Feb 2015 22:27:40 +0000 (22:27 +0000)] 
target: Check for LBA + sectors wrap-around in sbc_parse_cdb

commit aa179935edea9a64dec4b757090c8106a3907ffa upstream.

This patch adds a check to sbc_parse_cdb() in order to detect when
an LBA + sector vs. end-of-device calculation wraps when the LBA is
sufficently large enough (eg: 0xFFFFFFFFFFFFFFFF).

Cc: Martin Petersen <martin.petersen@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agotarget: Add missing WRITE_SAME end-of-device sanity check
Nicholas Bellinger [Fri, 13 Feb 2015 22:09:47 +0000 (22:09 +0000)] 
target: Add missing WRITE_SAME end-of-device sanity check

commit 8e575c50a171f2579e367a7f778f86477dfdaf49 upstream.

This patch adds a check to sbc_setup_write_same() to verify
the incoming WRITE_SAME LBA + number of blocks does not exceed
past the end-of-device.

Also check for potential LBA wrap-around as well.

Reported-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Martin Petersen <martin.petersen@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agoNFS: struct nfs_commit_info.lock must always point to inode->i_lock
Trond Myklebust [Sat, 14 Feb 2015 02:03:16 +0000 (21:03 -0500)] 
NFS: struct nfs_commit_info.lock must always point to inode->i_lock

commit f4086a3d789dbe18949862276d83b8f49fce6d2f upstream.

Commit 411a99adffb4f (nfs: clear_request_commit while holding i_lock)
assumes that the nfs_commit_info always points to the inode->i_lock.
For historical reasons, that is not the case for O_DIRECT writes.

Cc: Weston Andros Adamson <dros@primarydata.com>
Fixes: 411a99adffb4f ("nfs: clear_request_commit while holding i_lock")
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agodm io: reject unsupported DISCARD requests with EOPNOTSUPP
Darrick J. Wong [Fri, 13 Feb 2015 19:05:37 +0000 (11:05 -0800)] 
dm io: reject unsupported DISCARD requests with EOPNOTSUPP

commit 37527b869207ad4c208b1e13967d69b8bba1fbf9 upstream.

I created a dm-raid1 device backed by a device that supports DISCARD
and another device that does NOT support DISCARD with the following
dm configuration:

 #  echo '0 2048 mirror core 1 512 2 /dev/sda 0 /dev/sdb 0' | dmsetup create moo
 # lsblk -D
 NAME         DISC-ALN DISC-GRAN DISC-MAX DISC-ZERO
 sda                 0        4K       1G         0
 `-moo (dm-0)        0        4K       1G         0
 sdb                 0        0B       0B         0
 `-moo (dm-0)        0        4K       1G         0

Notice that the mirror device /dev/mapper/moo advertises DISCARD
support even though one of the mirror halves doesn't.

If I issue a DISCARD request (via fstrim, mount -o discard, or ioctl
BLKDISCARD) through the mirror, kmirrord gets stuck in an infinite
loop in do_region() when it tries to issue a DISCARD request to sdb.
The problem is that when we call do_region() against sdb, num_sectors
is set to zero because q->limits.max_discard_sectors is zero.
Therefore, "remaining" never decreases and the loop never terminates.

To fix this: before entering the loop, check for the combination of
REQ_DISCARD and no discard and return -EOPNOTSUPP to avoid hanging up
the mirror device.

This bug was found by the unfortunate coincidence of pvmove and a
discard operation in the RHEL 6.5 kernel; upstream is also affected.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Acked-by: "Martin K. Petersen" <martin.petersen@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agodm mirror: do not degrade the mirror on discard error
Mikulas Patocka [Thu, 12 Feb 2015 15:09:20 +0000 (10:09 -0500)] 
dm mirror: do not degrade the mirror on discard error

commit f2ed51ac64611d717d1917820a01930174c2f236 upstream.

It may be possible that a device claims discard support but it rejects
discards with -EOPNOTSUPP.  It happens when using loopback on ext2/ext3
filesystem driven by the ext4 driver.  It may also happen if the
underlying devices are moved from one disk on another.

If discard error happens, we reject the bio with -EOPNOTSUPP, but we do
not degrade the array.

This patch fixes failed test shell/lvconvert-repair-transient.sh in the
lvm2 testsuite if the testsuite is extracted on an ext2 or ext3
filesystem and it is being driven by the ext4 driver.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agojffs2: fix handling of corrupted summary length
Chen Jie [Tue, 10 Feb 2015 20:49:48 +0000 (12:49 -0800)] 
jffs2: fix handling of corrupted summary length

commit 164c24063a3eadee11b46575c5482b2f1417be49 upstream.

sm->offset maybe wrong but magic maybe right, the offset do not have CRC.

Badness at c00c7580 [verbose debug info unavailable]
NIP: c00c7580 LR: c00c718c CTR: 00000014
REGS: df07bb40 TRAP: 0700   Not tainted  (2.6.34.13-WR4.3.0.0_standard)
MSR: 00029000 <EE,ME,CE>  CR: 22084f84  XER: 00000000
TASK = df84d6e0[908] 'mount' THREAD: df07a000
GPR00: 00000001 df07bbf0 df84d6e0 00000000 00000001 00000000 df07bb58 00000041
GPR08: 00000041 c0638860 00000000 00000010 22084f88 100636c8 df814ff8 00000000
GPR16: df84d6e0 dfa558cc c05adb90 00000048 c0452d30 00000000 000240d0 000040d0
GPR24: 00000014 c05ae734 c05be2e0 00000000 00000001 00000000 00000000 c05ae730
NIP [c00c7580] __alloc_pages_nodemask+0x4d0/0x638
LR [c00c718c] __alloc_pages_nodemask+0xdc/0x638
Call Trace:
[df07bbf0] [c00c718c] __alloc_pages_nodemask+0xdc/0x638 (unreliable)
[df07bc90] [c00c7708] __get_free_pages+0x20/0x48
[df07bca0] [c00f4a40] __kmalloc+0x15c/0x1ec
[df07bcd0] [c01fc880] jffs2_scan_medium+0xa58/0x14d0
[df07bd70] [c01ff38c] jffs2_do_mount_fs+0x1f4/0x6b4
[df07bdb0] [c020144c] jffs2_do_fill_super+0xa8/0x260
[df07bdd0] [c020230c] jffs2_fill_super+0x104/0x184
[df07be00] [c0335814] get_sb_mtd_aux+0x9c/0xec
[df07be20] [c033596c] get_sb_mtd+0x84/0x1e8
[df07be60] [c0201ed0] jffs2_get_sb+0x1c/0x2c
[df07be70] [c0103898] vfs_kern_mount+0x78/0x1e8
[df07bea0] [c0103a58] do_kern_mount+0x40/0x100
[df07bec0] [c011fe90] do_mount+0x240/0x890
[df07bf10] [c0120570] sys_mount+0x90/0xd8
[df07bf40] [c00110d8] ret_from_syscall+0x0/0x4

=== Exception: c01 at 0xff61a34
    LR = 0x100135f0
Instruction dump:
38800005 38600000 48010f41 4bfffe1c 4bfc2d15 4bfffe8c 72e90200 4082fc28
3d20c064 39298860 8809000d 68000001 <0f0000002f800000 419efc0c 38000001
mount: mounting /dev/mtdblock3 on /common failed: Input/output error

Signed-off-by: Chen Jie <chenjie6@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agoALSA: hdspm - Constrain periods to 2 on older cards
Adrian Knoth [Tue, 10 Feb 2015 10:33:50 +0000 (11:33 +0100)] 
ALSA: hdspm - Constrain periods to 2 on older cards

commit f0153c3d948c1764f6c920a0675d86fc1d75813e upstream.

RME RayDAT and AIO use a fixed buffer size of 16384 samples. With period
sizes of 32-4096, this translates to 4-512 periods.

The older RME cards have a variable buffer size but require exactly two
periods.

This patch enforces nperiods=2 on those cards.

Signed-off-by: Adrian Knoth <adi@drcomp.erfurt.thur.de>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agodrm/radeon: fix voltage setup on hawaii
Alex Deucher [Thu, 12 Feb 2015 05:40:58 +0000 (00:40 -0500)] 
drm/radeon: fix voltage setup on hawaii

commit 09b6e85fc868568e1b2820235a2a851aecbccfcc upstream.

Missing parameter when fetching the real voltage values
from atom.  Fixes problems with dynamic clocking on
certain boards.

bug:
https://bugs.freedesktop.org/show_bug.cgi?id=87457

Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agodrm/radeon/dp: Set EDP_CONFIGURATION_SET for bridge chips if necessary
Alex Deucher [Wed, 11 Feb 2015 23:34:36 +0000 (18:34 -0500)] 
drm/radeon/dp: Set EDP_CONFIGURATION_SET for bridge chips if necessary

commit 66c2b84ba6256bc5399eed45582af9ebb3ba2c15 upstream.

Don't restrict it to just eDP panels.  Some LVDS bridge chips require
this.  Fixes blank panels on resume on certain laptops.  Noticed
by mrnuke on IRC.

bug:
https://bugs.freedesktop.org/show_bug.cgi?id=42960

Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agoARC: fix page address calculation if PAGE_OFFSET != LINUX_LINK_BASE
Alexey Brodkin [Thu, 12 Feb 2015 18:10:11 +0000 (21:10 +0300)] 
ARC: fix page address calculation if PAGE_OFFSET != LINUX_LINK_BASE

commit 06f34e1c28f3608b0ce5b310e41102d3fe7b65a1 upstream.

We used to calculate page address differently in 2 cases:

1. In virt_to_page(x) we do
 --->8---
 mem_map + (x - CONFIG_LINUX_LINK_BASE) >> PAGE_SHIFT
 --->8---

2. In in pte_page(x) we do
 --->8---
 mem_map + (pte_val(x) - PAGE_OFFSET) >> PAGE_SHIFT
 --->8---

That leads to problems in case PAGE_OFFSET != CONFIG_LINUX_LINK_BASE -
different pages will be selected depending on where and how we calculate
page address.

In particular in the STAR 9000853582 when gdb attempted to read memory
of another process it got improper page in get_user_pages() because this
is exactly one of the places where we search for a page by pte_page().

The fix is trivial - we need to calculate page address similarly in both
cases.

Signed-off-by: Alexey Brodkin <abrodkin@synopsys.com>
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agoALSA: hda - enable mute led quirk for one more hp machine.
Hui Wang [Fri, 13 Feb 2015 03:14:41 +0000 (11:14 +0800)] 
ALSA: hda - enable mute led quirk for one more hp machine.

commit 7976eb49cbd138d8014fa02682d8f969ad1e9ff2 upstream.

Otherwise, the mute led can't work at all.

Tested-by: Taihsiang Ho <taihsiang.ho@canonical.com>
BugLink: https://bugs.launchpad.net/bugs/1410704
Signed-off-by: Hui Wang <hui.wang@canonical.com>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
[ luis: backported to 3.16: adjusted context ]
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agomm: hwpoison: drop lru_add_drain_all() in __soft_offline_page()
Naoya Horiguchi [Thu, 12 Feb 2015 23:00:25 +0000 (15:00 -0800)] 
mm: hwpoison: drop lru_add_drain_all() in __soft_offline_page()

commit 9ab3b598d2dfbdb0153ffa7e4b1456bbff59a25d upstream.

A race condition starts to be visible in recent mmotm, where a PG_hwpoison
flag is set on a migration source page *before* it's back in buddy page
poo= l.

This is problematic because no page flag is supposed to be set when
freeing (see __free_one_page().) So the user-visible effect of this race
is that it could trigger the BUG_ON() when soft-offlining is called.

The root cause is that we call lru_add_drain_all() to make sure that the
page is in buddy, but that doesn't work because this function just
schedule= s a work item and doesn't wait its completion.
drain_all_pages() does drainin= g directly, so simply dropping
lru_add_drain_all() solves this problem.

Fixes: f15bdfa802bf ("mm/memory-failure.c: fix memory leak in successful soft offlining")
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Chen Gong <gong.chen@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agomm/memory.c: actually remap enough memory
Grazvydas Ignotas [Thu, 12 Feb 2015 23:00:19 +0000 (15:00 -0800)] 
mm/memory.c: actually remap enough memory

commit 9cb12d7b4ccaa976f97ce0c5fd0f1b6a83bc2a75 upstream.

For whatever reason, generic_access_phys() only remaps one page, but
actually allows to access arbitrary size.  It's quite easy to trigger
large reads, like printing out large structure with gdb, which leads to a
crash.  Fix it by remapping correct size.

Fixes: 28b2ee20c7cb ("access_process_vm device memory infrastructure")
Signed-off-by: Grazvydas Ignotas <notasas@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agomm/compaction: fix wrong order check in compact_finished()
Joonsoo Kim [Thu, 12 Feb 2015 22:59:50 +0000 (14:59 -0800)] 
mm/compaction: fix wrong order check in compact_finished()

commit 372549c2a3778fd3df445819811c944ad54609ca upstream.

What we want to check here is whether there is highorder freepage in buddy
list of other migratetype in order to steal it without fragmentation.
But, current code just checks cc->order which means allocation request
order.  So, this is wrong.

Without this fix, non-movable synchronous compaction below pageblock order
would not stopped until compaction is complete, because migratetype of
most pageblocks are movable and high order freepage made by compaction is
usually on movable type buddy list.

There is some report related to this bug. See below link.

  http://www.spinics.net/lists/linux-mm/msg81666.html

Although the issued system still has load spike comes from compaction,
this makes that system completely stable and responsive according to his
report.

stress-highalloc test in mmtests with non movable order 7 allocation
doesn't show any notable difference in allocation success rate, but, it
shows more compaction success rate.

Compaction success rate (Compaction success * 100 / Compaction stalls, %)
18.47 : 28.94

Fixes: 1fb3f8ca0e92 ("mm: compaction: capture a suitable high-order page immediately when it is made available")
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agotarget: Fix PR_APTPL_BUF_LEN buffer size limitation
Nicholas Bellinger [Thu, 12 Feb 2015 02:34:40 +0000 (18:34 -0800)] 
target: Fix PR_APTPL_BUF_LEN buffer size limitation

commit f161d4b44d7cc1dc66b53365215227db356378b1 upstream.

This patch addresses the original PR_APTPL_BUF_LEN = 8k limitiation
for write-out of PR APTPL metadata that Martin has recently been
running into.

It changes core_scsi3_update_and_write_aptpl() to use vzalloc'ed
memory instead of kzalloc, and increases the default hardcoded
length to 256k.

It also adds logic in core_scsi3_update_and_write_aptpl() to double
the original length upon core_scsi3_update_aptpl_buf() failure, and
retries until the vzalloc'ed buffer is large enough to accommodate
the outgoing APTPL metadata.

Reported-by: Martin Svec <martin.svec@zoner.cz>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agoiscsi-target: Drop problematic active_ts_list usage
Nicholas Bellinger [Thu, 22 Jan 2015 08:56:53 +0000 (00:56 -0800)] 
iscsi-target: Drop problematic active_ts_list usage

commit 3fd7b60f2c7418239d586e359e0c6d8503e10646 upstream.

This patch drops legacy active_ts_list usage within iscsi_target_tq.c
code.  It was originally used to track the active thread sets during
iscsi-target shutdown, and is no longer used by modern upstream code.

Two people have reported list corruption using traditional iscsi-target
and iser-target with the following backtrace, that appears to be related
to iscsi_thread_set->ts_list being used across both active_ts_list and
inactive_ts_list.

[   60.782534] ------------[ cut here ]------------
[   60.782543] WARNING: CPU: 0 PID: 9430 at lib/list_debug.c:53 __list_del_entry+0x63/0xd0()
[   60.782545] list_del corruption, ffff88045b00d180->next is LIST_POISON1 (dead000000100100)
[   60.782546] Modules linked in: ib_srpt tcm_qla2xxx qla2xxx tcm_loop tcm_fc libfc scsi_transport_fc scsi_tgt ib_isert rdma_cm iw_cm ib_addr iscsi_target_mod target_core_pscsi target_core_file target_core_iblock target_core_mod configfs ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 ipt_REJECT xt_CHECKSUM iptable_mangle iptable_filter ip_tables bridge stp llc autofs4 sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 ib_ipoib ib_cm ib_uverbs ib_umad mlx4_en mlx4_ib ib_sa ib_mad ib_core mlx4_core dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan vhost tun kvm_intel kvm uinput iTCO_wdt iTCO_vendor_support microcode serio_raw pcspkr sb_edac edac_core sg i2c_i801 lpc_ich mfd_core mtip32xx igb i2c_algo_bit i2c_core ptp pps_core ioatdma dca wmi ext3(F) jbd(F) mbcache(F) sd_mod(F) crc_t10dif(F) crct10dif_common(F) ahci(F) libahci(F) isci(F) libsas(F) scsi_transport_sas(F) [last unloaded: speedstep_lib]
[   60.782597] CPU: 0 PID: 9430 Comm: iscsi_ttx Tainted: GF 3.12.19+ #2
[   60.782598] Hardware name: Supermicro X9DRX+-F/X9DRX+-F, BIOS 3.00 07/09/2013
[   60.782599]  0000000000000035 ffff88044de31d08 ffffffff81553ae7 0000000000000035
[   60.782602]  ffff88044de31d58 ffff88044de31d48 ffffffff8104d1cc 0000000000000002
[   60.782605]  ffff88045b00d180 ffff88045b00d0c0 ffff88045b00d0c0 ffff88044de31e58
[   60.782607] Call Trace:
[   60.782611]  [<ffffffff81553ae7>] dump_stack+0x49/0x62
[   60.782615]  [<ffffffff8104d1cc>] warn_slowpath_common+0x8c/0xc0
[   60.782618]  [<ffffffff8104d2b6>] warn_slowpath_fmt+0x46/0x50
[   60.782620]  [<ffffffff81280933>] __list_del_entry+0x63/0xd0
[   60.782622]  [<ffffffff812809b1>] list_del+0x11/0x40
[   60.782630]  [<ffffffffa06e7cf9>] iscsi_del_ts_from_active_list+0x29/0x50 [iscsi_target_mod]
[   60.782635]  [<ffffffffa06e87b1>] iscsi_tx_thread_pre_handler+0xa1/0x180 [iscsi_target_mod]
[   60.782642]  [<ffffffffa06fb9ae>] iscsi_target_tx_thread+0x4e/0x220 [iscsi_target_mod]
[   60.782647]  [<ffffffffa06fb960>] ? iscsit_handle_snack+0x190/0x190 [iscsi_target_mod]
[   60.782652]  [<ffffffffa06fb960>] ? iscsit_handle_snack+0x190/0x190 [iscsi_target_mod]
[   60.782655]  [<ffffffff8106f99e>] kthread+0xce/0xe0
[   60.782657]  [<ffffffff8106f8d0>] ? kthread_freezable_should_stop+0x70/0x70
[   60.782660]  [<ffffffff8156026c>] ret_from_fork+0x7c/0xb0
[   60.782662]  [<ffffffff8106f8d0>] ? kthread_freezable_should_stop+0x70/0x70
[   60.782663] ---[ end trace 9662f4a661d33965 ]---

Since this code is no longer used, go ahead and drop the problematic usage
all-together.

Reported-by: Gavin Guo <gavin.guo@canonical.com>
Reported-by: Moussa Ba <moussaba@micron.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agomm/nommu.c: fix arithmetic overflow in __vm_enough_memory()
Roman Gushchin [Wed, 11 Feb 2015 23:28:42 +0000 (15:28 -0800)] 
mm/nommu.c: fix arithmetic overflow in __vm_enough_memory()

commit 8138a67a5557ffea3a21dfd6f037842d4e748513 upstream.

I noticed that "allowed" can easily overflow by falling below 0, because
(total_vm / 32) can be larger than "allowed".  The problem occurs in
OVERCOMMIT_NONE mode.

In this case, a huge allocation can success and overcommit the system
(despite OVERCOMMIT_NONE mode).  All subsequent allocations will fall
(system-wide), so system become unusable.

The problem was masked out by commit c9b1d0981fcc
("mm: limit growth of 3% hardcoded other user reserve"),
but it's easy to reproduce it on older kernels:
1) set overcommit_memory sysctl to 2
2) mmap() large file multiple times (with VM_SHARED flag)
3) try to malloc() large amount of memory

It also can be reproduced on newer kernels, but miss-configured
sysctl_user_reserve_kbytes is required.

Fix this issue by switching to signed arithmetic here.

Signed-off-by: Roman Gushchin <klamm@yandex-team.ru>
Cc: Andrew Shewmaker <agshew@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agomm/mmap.c: fix arithmetic overflow in __vm_enough_memory()
Roman Gushchin [Wed, 11 Feb 2015 23:28:39 +0000 (15:28 -0800)] 
mm/mmap.c: fix arithmetic overflow in __vm_enough_memory()

commit 5703b087dc8eaf47bfb399d6cf512d471beff405 upstream.

I noticed, that "allowed" can easily overflow by falling below 0,
because (total_vm / 32) can be larger than "allowed".  The problem
occurs in OVERCOMMIT_NONE mode.

In this case, a huge allocation can success and overcommit the system
(despite OVERCOMMIT_NONE mode).  All subsequent allocations will fall
(system-wide), so system become unusable.

The problem was masked out by commit c9b1d0981fcc
("mm: limit growth of 3% hardcoded other user reserve"),
but it's easy to reproduce it on older kernels:
1) set overcommit_memory sysctl to 2
2) mmap() large file multiple times (with VM_SHARED flag)
3) try to malloc() large amount of memory

It also can be reproduced on newer kernels, but miss-configured
sysctl_user_reserve_kbytes is required.

Fix this issue by switching to signed arithmetic here.

[akpm@linux-foundation.org: use min_t]
Signed-off-by: Roman Gushchin <klamm@yandex-team.ru>
Cc: Andrew Shewmaker <agshew@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agomm: when stealing freepages, also take pages created by splitting buddy page
Vlastimil Babka [Wed, 11 Feb 2015 23:28:15 +0000 (15:28 -0800)] 
mm: when stealing freepages, also take pages created by splitting buddy page

commit 99592d598eca62bdbbf62b59941c189176dfc614 upstream.

When studying page stealing, I noticed some weird looking decisions in
try_to_steal_freepages().  The first I assume is a bug (Patch 1), the
following two patches were driven by evaluation.

Testing was done with stress-highalloc of mmtests, using the
mm_page_alloc_extfrag tracepoint and postprocessing to get counts of how
often page stealing occurs for individual migratetypes, and what
migratetypes are used for fallbacks.  Arguably, the worst case of page
stealing is when UNMOVABLE allocation steals from MOVABLE pageblock.
RECLAIMABLE allocation stealing from MOVABLE allocation is also not ideal,
so the goal is to minimize these two cases.

The evaluation of v2 wasn't always clear win and Joonsoo questioned the
results.  Here I used different baseline which includes RFC compaction
improvements from [1].  I found that the compaction improvements reduce
variability of stress-highalloc, so there's less noise in the data.

First, let's look at stress-highalloc configured to do sync compaction,
and how these patches reduce page stealing events during the test.  First
column is after fresh reboot, other two are reiterations of test without
reboot.  That was all accumulater over 5 re-iterations (so the benchmark
was run 5x3 times with 5 fresh restarts).

Baseline:

                                                   3.19-rc4        3.19-rc4        3.19-rc4
                                                  5-nothp-1       5-nothp-2       5-nothp-3
Page alloc extfrag event                               10264225     8702233    10244125
Extfrag fragmenting                                    10263271     8701552    10243473
Extfrag fragmenting for unmovable                         13595       17616       15960
Extfrag fragmenting unmovable placed with movable          7989       12193        8447
Extfrag fragmenting for reclaimable                         658        1840        1817
Extfrag fragmenting reclaimable placed with movable         558        1677        1679
Extfrag fragmenting for movable                        10249018     8682096    10225696

With Patch 1:
                                                   3.19-rc4        3.19-rc4        3.19-rc4
                                                  6-nothp-1       6-nothp-2       6-nothp-3
Page alloc extfrag event                               11834954     9877523     9774860
Extfrag fragmenting                                    11833993     9876880     9774245
Extfrag fragmenting for unmovable                          7342       16129       11712
Extfrag fragmenting unmovable placed with movable          4191       10547        6270
Extfrag fragmenting for reclaimable                         373        1130         923
Extfrag fragmenting reclaimable placed with movable         302         906         738
Extfrag fragmenting for movable                        11826278     9859621     9761610

With Patch 2:
                                                   3.19-rc4        3.19-rc4        3.19-rc4
                                                  7-nothp-1       7-nothp-2       7-nothp-3
Page alloc extfrag event                                4725990     3668793     3807436
Extfrag fragmenting                                     4725104     3668252     3806898
Extfrag fragmenting for unmovable                          6678        7974        7281
Extfrag fragmenting unmovable placed with movable          2051        3829        4017
Extfrag fragmenting for reclaimable                         429        1208        1278
Extfrag fragmenting reclaimable placed with movable         369         976        1034
Extfrag fragmenting for movable                         4717997     3659070     3798339

With Patch 3:
                                                   3.19-rc4        3.19-rc4        3.19-rc4
                                                  8-nothp-1       8-nothp-2       8-nothp-3
Page alloc extfrag event                                5016183     4700142     3850633
Extfrag fragmenting                                     5015325     4699613     3850072
Extfrag fragmenting for unmovable                          1312        3154        3088
Extfrag fragmenting unmovable placed with movable          1115        2777        2714
Extfrag fragmenting for reclaimable                         437        1193        1097
Extfrag fragmenting reclaimable placed with movable         330         969         879
Extfrag fragmenting for movable                         5013576     4695266     3845887

In v2 we've seen apparent regression with Patch 1 for unmovable events,
this is now gone, suggesting it was indeed noise.  Here, each patch
improves the situation for unmovable events.  Reclaimable is improved by
patch 1 and then either the same modulo noise, or perhaps sligtly worse -
a small price for unmovable improvements, IMHO.  The number of movable
allocations falling back to other migratetypes is most noisy, but it's
reduced to half at Patch 2 nevertheless.  These are least critical as
compaction can move them around.

If we look at success rates, the patches don't affect them, that didn't change.

Baseline:
                             3.19-rc4              3.19-rc4              3.19-rc4
                            5-nothp-1             5-nothp-2             5-nothp-3
Success 1 Min         49.00 (  0.00%)       42.00 ( 14.29%)       41.00 ( 16.33%)
Success 1 Mean        51.00 (  0.00%)       45.00 ( 11.76%)       42.60 ( 16.47%)
Success 1 Max         55.00 (  0.00%)       51.00 (  7.27%)       46.00 ( 16.36%)
Success 2 Min         53.00 (  0.00%)       47.00 ( 11.32%)       44.00 ( 16.98%)
Success 2 Mean        59.60 (  0.00%)       50.80 ( 14.77%)       48.20 ( 19.13%)
Success 2 Max         64.00 (  0.00%)       56.00 ( 12.50%)       52.00 ( 18.75%)
Success 3 Min         84.00 (  0.00%)       82.00 (  2.38%)       78.00 (  7.14%)
Success 3 Mean        85.60 (  0.00%)       82.80 (  3.27%)       79.40 (  7.24%)
Success 3 Max         86.00 (  0.00%)       83.00 (  3.49%)       80.00 (  6.98%)

Patch 1:
                             3.19-rc4              3.19-rc4              3.19-rc4
                            6-nothp-1             6-nothp-2             6-nothp-3
Success 1 Min         49.00 (  0.00%)       44.00 ( 10.20%)       44.00 ( 10.20%)
Success 1 Mean        51.80 (  0.00%)       46.00 ( 11.20%)       45.80 ( 11.58%)
Success 1 Max         54.00 (  0.00%)       49.00 (  9.26%)       49.00 (  9.26%)
Success 2 Min         58.00 (  0.00%)       49.00 ( 15.52%)       48.00 ( 17.24%)
Success 2 Mean        60.40 (  0.00%)       51.80 ( 14.24%)       50.80 ( 15.89%)
Success 2 Max         63.00 (  0.00%)       54.00 ( 14.29%)       55.00 ( 12.70%)
Success 3 Min         84.00 (  0.00%)       81.00 (  3.57%)       79.00 (  5.95%)
Success 3 Mean        85.00 (  0.00%)       81.60 (  4.00%)       79.80 (  6.12%)
Success 3 Max         86.00 (  0.00%)       82.00 (  4.65%)       82.00 (  4.65%)

Patch 2:

                             3.19-rc4              3.19-rc4              3.19-rc4
                            7-nothp-1             7-nothp-2             7-nothp-3
Success 1 Min         50.00 (  0.00%)       44.00 ( 12.00%)       39.00 ( 22.00%)
Success 1 Mean        52.80 (  0.00%)       45.60 ( 13.64%)       42.40 ( 19.70%)
Success 1 Max         55.00 (  0.00%)       46.00 ( 16.36%)       47.00 ( 14.55%)
Success 2 Min         52.00 (  0.00%)       48.00 (  7.69%)       45.00 ( 13.46%)
Success 2 Mean        53.40 (  0.00%)       49.80 (  6.74%)       48.80 (  8.61%)
Success 2 Max         57.00 (  0.00%)       52.00 (  8.77%)       52.00 (  8.77%)
Success 3 Min         84.00 (  0.00%)       81.00 (  3.57%)       79.00 (  5.95%)
Success 3 Mean        85.00 (  0.00%)       82.40 (  3.06%)       79.60 (  6.35%)
Success 3 Max         86.00 (  0.00%)       83.00 (  3.49%)       80.00 (  6.98%)

Patch 3:
                             3.19-rc4              3.19-rc4              3.19-rc4
                            8-nothp-1             8-nothp-2             8-nothp-3
Success 1 Min         46.00 (  0.00%)       44.00 (  4.35%)       42.00 (  8.70%)
Success 1 Mean        50.20 (  0.00%)       45.60 (  9.16%)       44.00 ( 12.35%)
Success 1 Max         52.00 (  0.00%)       47.00 (  9.62%)       47.00 (  9.62%)
Success 2 Min         53.00 (  0.00%)       49.00 (  7.55%)       48.00 (  9.43%)
Success 2 Mean        55.80 (  0.00%)       50.60 (  9.32%)       49.00 ( 12.19%)
Success 2 Max         59.00 (  0.00%)       52.00 ( 11.86%)       51.00 ( 13.56%)
Success 3 Min         84.00 (  0.00%)       80.00 (  4.76%)       79.00 (  5.95%)
Success 3 Mean        85.40 (  0.00%)       81.60 (  4.45%)       80.40 (  5.85%)
Success 3 Max         87.00 (  0.00%)       83.00 (  4.60%)       82.00 (  5.75%)

While there's no improvement here, I consider reduced fragmentation events
to be worth on its own.  Patch 2 also seems to reduce scanning for free
pages, and migrations in compaction, suggesting it has somewhat less work
to do:

Patch 1:

Compaction stalls                 4153        3959        3978
Compaction success                1523        1441        1446
Compaction failures               2630        2517        2531
Page migrate success           4600827     4943120     5104348
Page migrate failure             19763       16656       17806
Compaction pages isolated      9597640    10305617    10653541
Compaction migrate scanned    77828948    86533283    87137064
Compaction free scanned      517758295   521312840   521462251
Compaction cost                   5503        5932        6110

Patch 2:

Compaction stalls                 3800        3450        3518
Compaction success                1421        1316        1317
Compaction failures               2379        2134        2201
Page migrate success           4160421     4502708     4752148
Page migrate failure             19705       14340       14911
Compaction pages isolated      8731983     9382374     9910043
Compaction migrate scanned    98362797    96349194    98609686
Compaction free scanned      496512560   469502017   480442545
Compaction cost                   5173        5526        5811

As with v2, /proc/pagetypeinfo appears unaffected with respect to numbers
of unmovable and reclaimable pageblocks.

Configuring the benchmark to allocate like THP page fault (i.e.  no sync
compaction) gives much noisier results for iterations 2 and 3 after
reboot.  This is not so surprising given how [1] offers lower improvements
in this scenario due to less restarts after deferred compaction which
would change compaction pivot.

Baseline:
                                                   3.19-rc4        3.19-rc4        3.19-rc4
                                                    5-thp-1         5-thp-2         5-thp-3
Page alloc extfrag event                                8148965     6227815     6646741
Extfrag fragmenting                                     8147872     6227130     6646117
Extfrag fragmenting for unmovable                         10324       12942       15975
Extfrag fragmenting unmovable placed with movable          5972        8495       10907
Extfrag fragmenting for reclaimable                         601        1707        2210
Extfrag fragmenting reclaimable placed with movable         520        1570        2000
Extfrag fragmenting for movable                         8136947     6212481     6627932

Patch 1:
                                                   3.19-rc4        3.19-rc4        3.19-rc4
                                                    6-thp-1         6-thp-2         6-thp-3
Page alloc extfrag event                                8345457     7574471     7020419
Extfrag fragmenting                                     8343546     7573777     7019718
Extfrag fragmenting for unmovable                         10256       18535       30716
Extfrag fragmenting unmovable placed with movable          6893       11726       22181
Extfrag fragmenting for reclaimable                         465        1208        1023
Extfrag fragmenting reclaimable placed with movable         353         996         843
Extfrag fragmenting for movable                         8332825     7554034     6987979

Patch 2:
                                                   3.19-rc4        3.19-rc4        3.19-rc4
                                                    7-thp-1         7-thp-2         7-thp-3
Page alloc extfrag event                                3512847     3020756     2891625
Extfrag fragmenting                                     3511940     3020185     2891059
Extfrag fragmenting for unmovable                          9017        6892        6191
Extfrag fragmenting unmovable placed with movable          1524        3053        2435
Extfrag fragmenting for reclaimable                         445        1081        1160
Extfrag fragmenting reclaimable placed with movable         375         918         986
Extfrag fragmenting for movable                         3502478     3012212     2883708

Patch 3:
                                                   3.19-rc4        3.19-rc4        3.19-rc4
                                                    8-thp-1         8-thp-2         8-thp-3
Page alloc extfrag event                                3181699     3082881     2674164
Extfrag fragmenting                                     3180812     3082303     2673611
Extfrag fragmenting for unmovable                          1201        4031        4040
Extfrag fragmenting unmovable placed with movable           974        3611        3645
Extfrag fragmenting for reclaimable                         478        1165        1294
Extfrag fragmenting reclaimable placed with movable         387         985        1030
Extfrag fragmenting for movable                         3179133     3077107     2668277

The improvements for first iteration are clear, the rest is much noisier
and can appear like regression for Patch 1.  Anyway, patch 2 rectifies it.

Allocation success rates are again unaffected so there's no point in
making this e-mail any longer.

[1] http://marc.info/?l=linux-mm&m=142166196321125&w=2

This patch (of 3):

When __rmqueue_fallback() is called to allocate a page of order X, it will
find a page of order Y >= X of a fallback migratetype, which is different
from the desired migratetype.  With the help of try_to_steal_freepages(),
it may change the migratetype (to the desired one) also of:

1) all currently free pages in the pageblock containing the fallback page
2) the fallback pageblock itself
3) buddy pages created by splitting the fallback page (when Y > X)

These decisions take the order Y into account, as well as the desired
migratetype, with the goal of preventing multiple fallback allocations
that could e.g.  distribute UNMOVABLE allocations among multiple
pageblocks.

Originally, decision for 1) has implied the decision for 3).  Commit
47118af076f6 ("mm: mmzone: MIGRATE_CMA migration type added") changed that
(probably unintentionally) so that the buddy pages in case 3) are always
changed to the desired migratetype, except for CMA pageblocks.

Commit fef903efcf0c ("mm/page_allo.c: restructure free-page stealing code
and fix a bug") did some refactoring and added a comment that the case of
3) is intended.  Commit 0cbef29a7821 ("mm: __rmqueue_fallback() should
respect pageblock type") removed the comment and tried to restore the
original behavior where 1) implies 3), but due to the previous
refactoring, the result is instead that only 2) implies 3) - and the
conditions for 2) are less frequently met than conditions for 1).  This
may increase fragmentation in situations where the code decides to steal
all free pages from the pageblock (case 1)), but then gives back the buddy
pages produced by splitting.

This patch restores the original intended logic where 1) implies 3).
During testing with stress-highalloc from mmtests, this has shown to
decrease the number of events where UNMOVABLE and RECLAIMABLE allocations
steal from MOVABLE pageblocks, which can lead to permanent fragmentation.
In some cases it has increased the number of events when MOVABLE
allocations steal from UNMOVABLE or RECLAIMABLE pageblocks, but these are
fixable by sync compaction and thus less harmful.

Note that evaluation has shown that the behavior introduced by
47118af076f6 for buddy pages in case 3) is actually even better than the
original logic, so the following patch will introduce it properly once
again.  For stable backports of this patch it makes thus sense to only fix
versions containing 0cbef29a7821.

[iamjoonsoo.kim@lge.com: tracepoint fix]
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agomm/hugetlb: add migration entry check in __unmap_hugepage_range
Naoya Horiguchi [Wed, 11 Feb 2015 23:25:32 +0000 (15:25 -0800)] 
mm/hugetlb: add migration entry check in __unmap_hugepage_range

commit 9fbc1f635fd0bd28cb32550211bf095753ac637a upstream.

If __unmap_hugepage_range() tries to unmap the address range over which
hugepage migration is on the way, we get the wrong page because pte_page()
doesn't work for migration entries.  This patch simply clears the pte for
migration entries as we do for hwpoison entries.

Fixes: 290408d4a2 ("hugetlb: hugepage migration core")
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Steve Capper <steve.capper@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agomm/hugetlb: add migration/hwpoisoned entry check in hugetlb_change_protection
Naoya Horiguchi [Wed, 11 Feb 2015 23:25:28 +0000 (15:25 -0800)] 
mm/hugetlb: add migration/hwpoisoned entry check in hugetlb_change_protection

commit a8bda28d87c38c6aa93de28ba5d30cc18e865a11 upstream.

There is a race condition between hugepage migration and
change_protection(), where hugetlb_change_protection() doesn't care about
migration entries and wrongly overwrites them.  That causes unexpected
results like kernel crash.  HWPoison entries also can cause the same
problem.

This patch adds is_hugetlb_entry_(migration|hwpoisoned) check in this
function to do proper actions.

Fixes: 290408d4a2 ("hugetlb: hugepage migration core")
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Steve Capper <steve.capper@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agomm/hugetlb: fix getting refcount 0 page in hugetlb_fault()
Naoya Horiguchi [Wed, 11 Feb 2015 23:25:25 +0000 (15:25 -0800)] 
mm/hugetlb: fix getting refcount 0 page in hugetlb_fault()

commit 0f792cf949a0be506c2aa8bfac0605746b146dda upstream.

When running the test which causes the race as shown in the previous patch,
we can hit the BUG "get_page() on refcount 0 page" in hugetlb_fault().

This race happens when pte turns into migration entry just after the first
check of is_hugetlb_entry_migration() in hugetlb_fault() passed with false.
To fix this, we need to check pte_present() again after huge_ptep_get().

This patch also reorders taking ptl and doing pte_page(), because
pte_page() should be done in ptl.  Due to this reordering, we need use
trylock_page() in page != pagecache_page case to respect locking order.

Fixes: 66aebce747ea ("hugetlb: fix race condition in hugetlb_fault()")
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Steve Capper <steve.capper@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agomm/hugetlb: take page table lock in follow_huge_pmd()
Naoya Horiguchi [Wed, 11 Feb 2015 23:25:22 +0000 (15:25 -0800)] 
mm/hugetlb: take page table lock in follow_huge_pmd()

commit e66f17ff71772b209eed39de35aaa99ba819c93d upstream.

We have a race condition between move_pages() and freeing hugepages, where
move_pages() calls follow_page(FOLL_GET) for hugepages internally and
tries to get its refcount without preventing concurrent freeing.  This
race crashes the kernel, so this patch fixes it by moving FOLL_GET code
for hugepages into follow_huge_pmd() with taking the page table lock.

This patch intentionally removes page==NULL check after pte_page.
This is justified because pte_page() never returns NULL for any
architectures or configurations.

This patch changes the behavior of follow_huge_pmd() for tail pages and
then tail pages can be pinned/returned.  So the caller must be changed to
properly handle the returned tail pages.

We could have a choice to add the similar locking to
follow_huge_(addr|pud) for consistency, but it's not necessary because
currently these functions don't support FOLL_GET flag, so let's leave it
for future development.

Here is the reproducer:

  $ cat movepages.c
  #include <stdio.h>
  #include <stdlib.h>
  #include <numaif.h>

  #define ADDR_INPUT      0x700000000000UL
  #define HPS             0x200000
  #define PS              0x1000

  int main(int argc, char *argv[]) {
          int i;
          int nr_hp = strtol(argv[1], NULL, 0);
          int nr_p  = nr_hp * HPS / PS;
          int ret;
          void **addrs;
          int *status;
          int *nodes;
          pid_t pid;

          pid = strtol(argv[2], NULL, 0);
          addrs  = malloc(sizeof(char *) * nr_p + 1);
          status = malloc(sizeof(char *) * nr_p + 1);
          nodes  = malloc(sizeof(char *) * nr_p + 1);

          while (1) {
                  for (i = 0; i < nr_p; i++) {
                          addrs[i] = (void *)ADDR_INPUT + i * PS;
                          nodes[i] = 1;
                          status[i] = 0;
                  }
                  ret = numa_move_pages(pid, nr_p, addrs, nodes, status,
                                        MPOL_MF_MOVE_ALL);
                  if (ret == -1)
                          err("move_pages");

                  for (i = 0; i < nr_p; i++) {
                          addrs[i] = (void *)ADDR_INPUT + i * PS;
                          nodes[i] = 0;
                          status[i] = 0;
                  }
                  ret = numa_move_pages(pid, nr_p, addrs, nodes, status,
                                        MPOL_MF_MOVE_ALL);
                  if (ret == -1)
                          err("move_pages");
          }
          return 0;
  }

  $ cat hugepage.c
  #include <stdio.h>
  #include <sys/mman.h>
  #include <string.h>

  #define ADDR_INPUT      0x700000000000UL
  #define HPS             0x200000

  int main(int argc, char *argv[]) {
          int nr_hp = strtol(argv[1], NULL, 0);
          char *p;

          while (1) {
                  p = mmap((void *)ADDR_INPUT, nr_hp * HPS, PROT_READ | PROT_WRITE,
                           MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);
                  if (p != (void *)ADDR_INPUT) {
                          perror("mmap");
                          break;
                  }
                  memset(p, 0, nr_hp * HPS);
                  munmap(p, nr_hp * HPS);
          }
  }

  $ sysctl vm.nr_hugepages=40
  $ ./hugepage 10 &
  $ ./movepages 10 $(pgrep -f hugepage)

Fixes: e632a938d914 ("mm: migrate: add hugepage migration code to move_pages()")
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reported-by: Hugh Dickins <hughd@google.com>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Steve Capper <steve.capper@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
10 years agomm/hugetlb: pmd_huge() returns true for non-present hugepage
Naoya Horiguchi [Wed, 11 Feb 2015 23:25:19 +0000 (15:25 -0800)] 
mm/hugetlb: pmd_huge() returns true for non-present hugepage

commit cbef8478bee55775ac312a574aad48af7bb9cf9f upstream.

Migrating hugepages and hwpoisoned hugepages are considered as non-present
hugepages, and they are referenced via migration entries and hwpoison
entries in their page table slots.

This behavior causes race condition because pmd_huge() doesn't tell
non-huge pages from migrating/hwpoisoned hugepages.  follow_page_mask() is
one example where the kernel would call follow_page_pte() for such
hugepage while this function is supposed to handle only normal pages.

To avoid this, this patch makes pmd_huge() return true when pmd_none() is
true *and* pmd_present() is false.  We don't have to worry about mixing up
non-present pmd entry with normal pmd (pointing to leaf level pte entry)
because pmd_present() is true in normal pmd.

The same race condition could happen in (x86-specific) gup_pmd_range(),
where this patch simply adds pmd_present() check instead of pmd_huge().
This is because gup_pmd_range() is fast path.  If we have non-present
hugepage in this function, we will go into gup_huge_pmd(), then return 0
at flag mask check, and finally fall back to the slow path.

Fixes: 290408d4a2 ("hugetlb: hugepage migration core")
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Steve Capper <steve.capper@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>