]> git.ipfire.org Git - thirdparty/kernel/stable.git/log
thirdparty/kernel/stable.git
10 years agoiio: mxs-lradc: only update the buffer when its conversions have finished
Kristina Martšenko [Sun, 25 Jan 2015 16:28:22 +0000 (18:28 +0200)] 
iio: mxs-lradc: only update the buffer when its conversions have finished

commit 89bb35e200bee745c539a96666e0792301ca40f1 upstream.

Using the touchscreen while running buffered capture results in the
buffer reporting lots of wrong values, often just zeros. This is because
we push readings to the buffer every time a touchscreen interrupt
arrives, including when the buffer's own conversions have not yet
finished. So let's only push to the buffer when its conversions are
ready.

Signed-off-by: Kristina Martšenko <kristina.martsenko@gmail.com>
Reviewed-by: Marek Vasut <marex@denx.de>
Signed-off-by: Jonathan Cameron <jic23@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agoiio: mxs-lradc: make ADC reads not unschedule touchscreen conversions
Kristina Martšenko [Sun, 25 Jan 2015 16:28:21 +0000 (18:28 +0200)] 
iio: mxs-lradc: make ADC reads not unschedule touchscreen conversions

commit 6abe0300a1d5242f4ff89257197f284679af1a06 upstream.

Reading a channel through sysfs, or starting a buffered capture, can
occasionally turn off the touchscreen.

This is because the read_raw() and buffer preenable()/postdisable()
callbacks unschedule current conversions on all channels. If a delay
channel happens to schedule a touchscreen conversion at the same time,
the conversion gets cancelled and the touchscreen sequence stops.

This is probably related to this note from the reference manual:

"If a delay group schedules channels to be sampled and a manual
write to the schedule field in CTRL0 occurs while the block is
discarding samples, the LRADC will switch to the new schedule
and will not sample the channels that were previously scheduled.
The time window for this to happen is very small and lasts only
while the LRADC is discarding samples."

So make the callbacks only unschedule conversions for the channels they
use. This means channel 0 for read_raw() and channels 0-5 for the buffer
(if the touchscreen is enabled). Since the touchscreen uses different
channels (6 and 7), it no longer gets turned off.

This is tested and fixes the issue on i.MX28, but hasn't been tested on
i.MX23.

Signed-off-by: Kristina Martšenko <kristina.martsenko@gmail.com>
Reviewed-by: Marek Vasut <marex@denx.de>
Signed-off-by: Jonathan Cameron <jic23@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agoiio: mxs-lradc: make ADC reads not disable touchscreen interrupts
Kristina Martšenko [Sun, 25 Jan 2015 16:28:20 +0000 (18:28 +0200)] 
iio: mxs-lradc: make ADC reads not disable touchscreen interrupts

commit 86bf7f3ef7e961e91e16dceb31ae0f583483b204 upstream.

Reading a channel through sysfs, or starting a buffered capture, will
currently turn off the touchscreen. This is because the read_raw() and
buffer preenable()/postdisable() callbacks disable interrupts for all
LRADC channels, including those the touchscreen uses.

So make the callbacks only disable interrupts for the channels they use.
This means channel 0 for read_raw() and channels 0-5 for the buffer (if
the touchscreen is enabled). Since the touchscreen uses different
channels (6 and 7), it no longer gets turned off.

Note that only i.MX28 is affected by this issue, i.MX23 should be fine.

Signed-off-by: Kristina Martšenko <kristina.martsenko@gmail.com>
Reviewed-by: Marek Vasut <marex@denx.de>
Signed-off-by: Jonathan Cameron <jic23@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agoiio: mxs-lradc: separate touchscreen and buffer virtual channels
Kristina Martšenko [Sun, 25 Jan 2015 16:28:19 +0000 (18:28 +0200)] 
iio: mxs-lradc: separate touchscreen and buffer virtual channels

commit f81197b8a31b8fb287ae57f597b5b6841e1ece92 upstream.

The touchscreen was initially designed [1] to map all of its physical
channels to one virtual channel, leaving buffered capture to use the
remaining 7 virtual channels. When the touchscreen was reimplemented
[2], it was made to use four virtual channels, which overlap and
conflict with the channels the buffer uses.

As a result, when the buffer is enabled, the touchscreen's virtual
channels are remapped to whichever physical channels the buffer was
configured with, causing the touchscreen to read those instead of the
touch measurement channels. Effectively the touchscreen stops working.

So here we separate the channels again, giving the touchscreen 2 virtual
channels and the buffer 6. We can't give the touchscreen just 1 channel
as before, as the current pressure calculation requires 2 channels to be
read at the same time.

This makes the touchscreen continue to work during buffered capture. It
has been tested on i.MX28, but not on i.MX23.

[1] 06ddd353f5c8 ("iio: mxs: Implement support for touchscreen")
[2] dee05308f602 ("Staging/iio/adc/touchscreen/MXS: add interrupt driven
touch detection")

Signed-off-by: Kristina Martšenko <kristina.martsenko@gmail.com>
Reviewed-by: Marek Vasut <marex@denx.de>
Signed-off-by: Jonathan Cameron <jic23@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agoiio: imu: adis16400: Fix sign extension
Rasmus Villemoes [Thu, 22 Jan 2015 23:34:02 +0000 (00:34 +0100)] 
iio: imu: adis16400: Fix sign extension

commit 19e353f2b344ad86cea6ebbc0002e5f903480a90 upstream.

The intention is obviously to sign-extend a 12 bit quantity. But
because of C's promotion rules, the assignment is equivalent to "val16
&= 0xfff;". Use the proper API for this.

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Acked-by: Lars-Peter Clausen <lars@metafoo.de>
Signed-off-by: Jonathan Cameron <jic23@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agoiio: mxs-lradc: fix iio channel map regression
Stefan Wahren [Sat, 3 Jan 2015 20:34:12 +0000 (20:34 +0000)] 
iio: mxs-lradc: fix iio channel map regression

commit 03305e535cd5cdc1079b32909bf4b2dd67d46f7f upstream.

Since commit c8231a9af8147f8a ("iio: mxs-lradc: compute temperature
from channel 8 and 9") with the removal of adc channel 9 there is
no 1-1 mapping in the channel spec.

All hwmon channel values above 9 are accessible via there index minus
one. So add a hidden iio channel 9 to fix this issue.

Signed-off-by: Stefan Wahren <stefan.wahren@i2se.com>
Acked-by: Alexandre Belloni <alexandre.belloni@free-electrons.com>
Reviewed-by: Marek Vasut <marex@denx.de>
Signed-off-by: Jonathan Cameron <jic23@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agox86/fpu/xsaves: Fix improper uses of __ex_table
Quentin Casasnovas [Thu, 5 Mar 2015 12:19:22 +0000 (13:19 +0100)] 
x86/fpu/xsaves: Fix improper uses of __ex_table

commit 06c8173eb92bbfc03a0fe8bb64315857d0badd06 upstream.

Commit:

  f31a9f7c7169 ("x86/xsaves: Use xsaves/xrstors to save and restore xsave area")

introduced alternative instructions for XSAVES/XRSTORS and commit:

  adb9d526e982 ("x86/xsaves: Add xsaves and xrstors support for booting time")

added support for the XSAVES/XRSTORS instructions at boot time.

Unfortunately both failed to properly protect them against faulting:

The 'xstate_fault' macro will use the closest label named '1'
backward and that ends up in the .altinstr_replacement section
rather than in .text. This means that the kernel will never find
in the __ex_table the .text address where this instruction might
fault, leading to serious problems if userspace manages to
trigger the fault.

Signed-off-by: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Signed-off-by: Jamie Iles <jamie.iles@oracle.com>
[ Improved the changelog, fixed some whitespace noise. ]
Acked-by: Borislav Petkov <bp@alien8.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Allan Xavier <mr.a.xavier@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: adb9d526e982 ("x86/xsaves: Add xsaves and xrstors support for booting time")
Fixes: f31a9f7c7169 ("x86/xsaves: Use xsaves/xrstors to save and restore xsave area")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agox86/asm/entry/64: Remove a bogus 'ret_from_fork' optimization
Andy Lutomirski [Thu, 5 Mar 2015 00:09:44 +0000 (01:09 +0100)] 
x86/asm/entry/64: Remove a bogus 'ret_from_fork' optimization

commit 956421fbb74c3a6261903f3836c0740187cf038b upstream.

'ret_from_fork' checks TIF_IA32 to determine whether 'pt_regs' and
the related state make sense for 'ret_from_sys_call'.  This is
entirely the wrong check.  TS_COMPAT would make a little more
sense, but there's really no point in keeping this optimization
at all.

This fixes a return to the wrong user CS if we came from int
0x80 in a 64-bit task.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/4710be56d76ef994ddf59087aad98c000fbab9a4.1424989793.git.luto@amacapital.net
[ Backported from tip:x86/asm. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agotarget: Check for LBA + sectors wrap-around in sbc_parse_cdb
Nicholas Bellinger [Fri, 13 Feb 2015 22:27:40 +0000 (22:27 +0000)] 
target: Check for LBA + sectors wrap-around in sbc_parse_cdb

commit aa179935edea9a64dec4b757090c8106a3907ffa upstream.

This patch adds a check to sbc_parse_cdb() in order to detect when
an LBA + sector vs. end-of-device calculation wraps when the LBA is
sufficently large enough (eg: 0xFFFFFFFFFFFFFFFF).

Cc: Martin Petersen <martin.petersen@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agotarget: Add missing WRITE_SAME end-of-device sanity check
Nicholas Bellinger [Fri, 13 Feb 2015 22:09:47 +0000 (22:09 +0000)] 
target: Add missing WRITE_SAME end-of-device sanity check

commit 8e575c50a171f2579e367a7f778f86477dfdaf49 upstream.

This patch adds a check to sbc_setup_write_same() to verify
the incoming WRITE_SAME LBA + number of blocks does not exceed
past the end-of-device.

Also check for potential LBA wrap-around as well.

Reported-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Martin Petersen <martin.petersen@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agotarget: Fix PR_APTPL_BUF_LEN buffer size limitation
Nicholas Bellinger [Thu, 12 Feb 2015 02:34:40 +0000 (18:34 -0800)] 
target: Fix PR_APTPL_BUF_LEN buffer size limitation

commit f161d4b44d7cc1dc66b53365215227db356378b1 upstream.

This patch addresses the original PR_APTPL_BUF_LEN = 8k limitiation
for write-out of PR APTPL metadata that Martin has recently been
running into.

It changes core_scsi3_update_and_write_aptpl() to use vzalloc'ed
memory instead of kzalloc, and increases the default hardcoded
length to 256k.

It also adds logic in core_scsi3_update_and_write_aptpl() to double
the original length upon core_scsi3_update_aptpl_buf() failure, and
retries until the vzalloc'ed buffer is large enough to accommodate
the outgoing APTPL metadata.

Reported-by: Martin Svec <martin.svec@zoner.cz>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agodrm/i915: Clamp efficient frequency to valid range
Tom O'Rourke [Wed, 11 Feb 2015 07:06:46 +0000 (23:06 -0800)] 
drm/i915: Clamp efficient frequency to valid range

commit 46efa4abe5712276494adbce102f46e3214632fd upstream.

The efficient frequency (RPe) should stay in the range
RPn <= RPe <= RP0.  The pcode clamps the returned value
internally on Broadwell but not on Haswell.

Fix for missing range check in
commit 93ee29203f506582cca2bcec5f05041526d9ab0a
Author: Tom O'Rourke <Tom.O'Rourke@intel.com>
Date:   Wed Nov 19 14:21:52 2014 -0800

    drm/i915: Use efficient frequency for HSW/BDW

Reference: http://lists.freedesktop.org/archives/intel-gfx/2015-February/059802.html
Reported-by: Michael Auchter <a@phire.org>
Suggested-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Tom O'Rourke <Tom.O'Rourke@intel.com>
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agodrm/i915: Correct the IOSF Dev_FN field for IOSF transfers
Shobhit Kumar [Thu, 5 Feb 2015 11:40:56 +0000 (17:10 +0530)] 
drm/i915: Correct the IOSF Dev_FN field for IOSF transfers

commit d180d2bbb66579e3bf449642b8ec2a76f4014fcd upstream.

As per the specififcation, the SB_DevFn is the PCI_DEVFN of the target
device and not the source. So PCI_DEVFN(2,0) is not correct. Further the
port ID should be enough to identify devices unless they are MFD. The
SB_DevFn was intended to remove ambiguity in case of these MFD devices.

For non MFD devices the recommendation for the target device IP was to
ignore these fields, but not all of them followed the recommendation.
Some like CCK ignore these fields and hence PCI_DEVFN(2, 0) works and so
does PCI_DEVFN(0, 0) as it works for DPIO. The issue came to light because
of GPIONC which was not getting programmed correctly with PCI_DEVFN(2, 0).
It turned out that this did not follow the recommendation and expected 0
in this field.

In general the recommendation is to use SB_DevFn as PCI_DEVFN(0, 0) for
all devices except target PCI devices.

Signed-off-by: Shobhit Kumar <shobhit.kumar@intel.com>
Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agodrm/i915: Prevent use-after-free in invalidate_range_start callback
Michał Winiarski [Tue, 3 Feb 2015 14:48:17 +0000 (15:48 +0100)] 
drm/i915: Prevent use-after-free in invalidate_range_start callback

commit 460822b0b1a77db859b0320469799fa4dbe4d367 upstream.

It's possible for invalidate_range_start mmu notifier callback to race
against userptr object release. If the gem object was released prior to
obtaining the spinlock in invalidate_range_start we're hitting null
pointer dereference.

Testcase: igt/gem_userptr_blits/stress-mm-invalidate-close
Testcase: igt/gem_userptr_blits/stress-mm-invalidate-close-overlap
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Michał Winiarski <michal.winiarski@intel.com>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
[Jani: added code comment suggested by Chris]
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agodrm/i915: Drop vblank wait from intel_dp_link_down
Daniel Vetter [Mon, 24 Nov 2014 15:54:11 +0000 (16:54 +0100)] 
drm/i915: Drop vblank wait from intel_dp_link_down

commit 0ca09685546fed5fc8f0535204f0626f352140f4 upstream.

Nothing in Bspec seems to indicate that we actually needs this, and it
looks like can't work since by this point the pipe is off and so
vblanks won't really happen any more.

Note that Bspec mentions that it takes a vblank for this bit to
change, but _only_ when enabling.

Dropping this code quenches an annoying backtrace introduced by the
more anal checking since

commit 51e31d49c89055299e34b8f44d13f70e19aaaad1
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Mon Sep 15 12:36:02 2014 +0200

    drm/i915: Use generic vblank wait

Note: This fixes the fallout from the above commit, but does not address
the shortcomings of the IBX transcoder select workaround implementation
discussed during review [1].

[1] http://mid.gmane.org/87y4o7usxf.fsf@intel.com

Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=86095
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Reviewed-by: Paulo Zanoni <paulo.r.zanoni@intel.com>
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agodrm/i915: Insert a command barrier on BLT/BSD cache flushes
Chris Wilson [Thu, 22 Jan 2015 13:42:00 +0000 (13:42 +0000)] 
drm/i915: Insert a command barrier on BLT/BSD cache flushes

commit f0a1fb10e5f79f5aaf8d7e94b9fa6bf2fa9aeebf upstream.

This looked like an odd regression from

commit ec5cc0f9b019af95e4571a9fa162d94294c8d90b
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Jun 12 10:28:55 2014 +0100

    drm/i915: Restrict GPU boost to the RCS engine

but in reality it undercovered a much older coherency bug. The issue that
boosting the GPU frequency on the BCS ring was masking was that we could
wake the CPU up after completion of a BCS batch and inspect memory prior
to the write cache being fully evicted. In order to serialise the
breadcrumb interrupt (and so ensure that the CPU's view of memory is
coherent) we need to perform a post-sync operation in the MI_FLUSH_DW.

v2: Fix all the MI_FLUSH_DW (bsd plus the duplication in execlists).

Also fix the invalidate_domains mask in gen8_emit_flush() for ring !=
VCS.

Testcase: gpuX-rcs-gpu-read-after-write
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Acked-by: Daniel Vetter <daniel@ffwll.ch>
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agodrm/radeon: fix voltage setup on hawaii
Alex Deucher [Thu, 12 Feb 2015 05:40:58 +0000 (00:40 -0500)] 
drm/radeon: fix voltage setup on hawaii

commit 09b6e85fc868568e1b2820235a2a851aecbccfcc upstream.

Missing parameter when fetching the real voltage values
from atom.  Fixes problems with dynamic clocking on
certain boards.

bug:
https://bugs.freedesktop.org/show_bug.cgi?id=87457

Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agodrm/radeon/dp: Set EDP_CONFIGURATION_SET for bridge chips if necessary
Alex Deucher [Wed, 11 Feb 2015 23:34:36 +0000 (18:34 -0500)] 
drm/radeon/dp: Set EDP_CONFIGURATION_SET for bridge chips if necessary

commit 66c2b84ba6256bc5399eed45582af9ebb3ba2c15 upstream.

Don't restrict it to just eDP panels.  Some LVDS bridge chips require
this.  Fixes blank panels on resume on certain laptops.  Noticed
by mrnuke on IRC.

bug:
https://bugs.freedesktop.org/show_bug.cgi?id=42960

Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agodrm/radeon: workaround for CP HW bug on CIK
Christian König [Tue, 10 Feb 2015 13:26:39 +0000 (14:26 +0100)] 
drm/radeon: workaround for CP HW bug on CIK

commit a9c73a0e022c33954835e66fec3cd744af90ec98 upstream.

Emit the EOP twice to avoid cache flushing problems.

Signed-off-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agodrm/radeon: only enable kv/kb dpm interrupts once v3
Alex Deucher [Fri, 6 Feb 2015 17:53:27 +0000 (12:53 -0500)] 
drm/radeon: only enable kv/kb dpm interrupts once v3

commit 410af8d7285a0b96314845c75c39fd612b755688 upstream.

Enable at init and disable on fini. Workaround for hardware problems.

v2 (chk): extend commit message
v3: add new function

Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Christian König <christian.koenig@amd.com> (v2)
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agodrm/radeon: Don't try to enable write-combining without PAT
Michel Dänzer [Wed, 4 Feb 2015 01:19:51 +0000 (10:19 +0900)] 
drm/radeon: Don't try to enable write-combining without PAT

commit a53fa43873b88bad15a2eb1f01dc5efa689625ce upstream.

Doing so can cause things to become slow.

Print a warning at compile time and an informative message at runtime in
that case.

Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=88758
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Michel Dänzer <michel.daenzer@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agodrm/tegra: Use correct relocation target offsets
David Ung [Wed, 21 Jan 2015 02:37:35 +0000 (18:37 -0800)] 
drm/tegra: Use correct relocation target offsets

commit 31f40f86526b71009973854c1dfe799ee70f7588 upstream.

When copying a relocation from userspace, copy the correct target
offset.

Signed-off-by: David Ung <davidu@nvidia.com>
Fixes: 961e3beae3b2 ("drm/tegra: Make job submission 64-bit safe")
[treding@nvidia.com: provide a better commit message]
Signed-off-by: Thierry Reding <treding@nvidia.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agomm: page_alloc: revert inadvertent !__GFP_FS retry behavior change
Johannes Weiner [Fri, 27 Feb 2015 23:52:09 +0000 (15:52 -0800)] 
mm: page_alloc: revert inadvertent !__GFP_FS retry behavior change

commit cc87317726f851531ae8422e0c2d3d6e2d7b1955 upstream.

Historically, !__GFP_FS allocations were not allowed to invoke the OOM
killer once reclaim had failed, but nevertheless kept looping in the
allocator.

Commit 9879de7373fc ("mm: page_alloc: embed OOM killing naturally into
allocation slowpath"), which should have been a simple cleanup patch,
accidentally changed the behavior to aborting the allocation at that
point.  This creates problems with filesystem callers (?) that currently
rely on the allocator waiting for other tasks to intervene.

Revert the behavior as it shouldn't have been changed as part of a
cleanup patch.

Fixes: 9879de7373fc ("mm: page_alloc: embed OOM killing naturally into allocation slowpath")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Dave Chinner <david@fromorbit.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agomm/nommu: fix memory leak
Joonsoo Kim [Fri, 27 Feb 2015 23:51:43 +0000 (15:51 -0800)] 
mm/nommu: fix memory leak

commit da616534ed7f6e8ffaab699258b55c8d78d0b4ea upstream.

Maxime reported the following memory leak regression due to commit
dbc8358c7237 ("mm/nommu: use alloc_pages_exact() rather than its own
implementation").

On v3.19, I am facing a memory leak.  Each time I run a command one page
is lost.  Here an example with busybox's free command:

  / # free
               total       used       free     shared    buffers     cached
  Mem:          7928       1972       5956          0          0        492
  -/+ buffers/cache:       1480       6448
  / # free
               total       used       free     shared    buffers     cached
  Mem:          7928       1976       5952          0          0        492
  -/+ buffers/cache:       1484       6444
  / # free
               total       used       free     shared    buffers     cached
  Mem:          7928       1980       5948          0          0        492
  -/+ buffers/cache:       1488       6440
  / # free
               total       used       free     shared    buffers     cached
  Mem:          7928       1984       5944          0          0        492
  -/+ buffers/cache:       1492       6436
  / # free
               total       used       free     shared    buffers     cached
  Mem:          7928       1988       5940          0          0        492
  -/+ buffers/cache:       1496       6432

At some point, the system fails to sastisfy 256KB allocations:

  free: page allocation failure: order:6, mode:0xd0
  CPU: 0 PID: 67 Comm: free Not tainted 3.19.0-05389-gacf2cf1-dirty #64
  Hardware name: STM32 (Device Tree Support)
    show_stack+0xb/0xc
    warn_alloc_failed+0x97/0xbc
    __alloc_pages_nodemask+0x295/0x35c
    __get_free_pages+0xb/0x24
    alloc_pages_exact+0x19/0x24
    do_mmap_pgoff+0x423/0x658
    vm_mmap_pgoff+0x3f/0x4e
    load_flat_file+0x20d/0x4f8
    load_flat_binary+0x3f/0x26c
    search_binary_handler+0x51/0xe4
    do_execveat_common+0x271/0x35c
    do_execve+0x19/0x1c
    ret_fast_syscall+0x1/0x4a
  Mem-info:
  Normal per-cpu:
  CPU    0: hi:    0, btch:   1 usd:   0
  active_anon:0 inactive_anon:0 isolated_anon:0
   active_file:0 inactive_file:0 isolated_file:0
   unevictable:123 dirty:0 writeback:0 unstable:0
   free:1515 slab_reclaimable:17 slab_unreclaimable:139
   mapped:0 shmem:0 pagetables:0 bounce:0
   free_cma:0
  Normal free:6060kB min:352kB low:440kB high:528kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:492kB isolated(anon):0ks
  lowmem_reserve[]: 0 0
  Normal: 23*4kB (U) 22*8kB (U) 24*16kB (U) 23*32kB (U) 23*64kB (U) 23*128kB (U) 1*256kB (U) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 6060kB
  123 total pagecache pages
  2048 pages of RAM
  1538 free pages
  66 reserved pages
  109 slab pages
  -46 pages shared
  0 pages swap cached
  nommu: Allocation of length 221184 from process 67 (free) failed
  Normal per-cpu:
  CPU    0: hi:    0, btch:   1 usd:   0
  active_anon:0 inactive_anon:0 isolated_anon:0
   active_file:0 inactive_file:0 isolated_file:0
   unevictable:123 dirty:0 writeback:0 unstable:0
   free:1515 slab_reclaimable:17 slab_unreclaimable:139
   mapped:0 shmem:0 pagetables:0 bounce:0
   free_cma:0
  Normal free:6060kB min:352kB low:440kB high:528kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:492kB isolated(anon):0ks
  lowmem_reserve[]: 0 0
  Normal: 23*4kB (U) 22*8kB (U) 24*16kB (U) 23*32kB (U) 23*64kB (U) 23*128kB (U) 1*256kB (U) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 6060kB
  123 total pagecache pages
  Unable to allocate RAM for process text/data, errno 12 SEGV

This problem happens because we allocate ordered page through
__get_free_pages() in do_mmap_private() in some cases and we try to free
individual pages rather than ordered page in free_page_series().  In
this case, freeing pages whose refcount is not 0 won't be freed to the
page allocator so memory leak happens.

To fix the problem, this patch changes __get_free_pages() to
alloc_pages_exact() since alloc_pages_exact() returns
physically-contiguous pages but each pages are refcounted.

Fixes: dbc8358c7237 ("mm/nommu: use alloc_pages_exact() rather than its own implementation").
Reported-by: Maxime Coquelin <mcoquelin.stm32@gmail.com>
Tested-by: Maxime Coquelin <mcoquelin.stm32@gmail.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agomm: fix negative nr_isolated counts
Hugh Dickins [Thu, 12 Feb 2015 23:00:28 +0000 (15:00 -0800)] 
mm: fix negative nr_isolated counts

commit ff59909a077b3c51c168cb658601c6b63136a347 upstream.

The vmstat interfaces are good at hiding negative counts (at least when
CONFIG_SMP); but if you peer behind the curtain, you find that
nr_isolated_anon and nr_isolated_file soon go negative, and grow ever
more negative: so they can absorb larger and larger numbers of isolated
pages, yet still appear to be zero.

I'm happy to avoid a congestion_wait() when too_many_isolated() myself;
but I guess it's there for a good reason, in which case we ought to get
too_many_isolated() working again.

The imbalance comes from isolate_migratepages()'s ISOLATE_ABORT case:
putback_movable_pages() decrements the NR_ISOLATED counts, but we forgot
to call acct_isolated() to increment them.

It is possible that the bug whcih this patch fixes could cause OOM kills
when the system still has a lot of reclaimable page cache.

Fixes: edc2ca612496 ("mm, compaction: move pageblock checks up from isolate_migratepages_range()")
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agomm: hwpoison: drop lru_add_drain_all() in __soft_offline_page()
Naoya Horiguchi [Thu, 12 Feb 2015 23:00:25 +0000 (15:00 -0800)] 
mm: hwpoison: drop lru_add_drain_all() in __soft_offline_page()

commit 9ab3b598d2dfbdb0153ffa7e4b1456bbff59a25d upstream.

A race condition starts to be visible in recent mmotm, where a PG_hwpoison
flag is set on a migration source page *before* it's back in buddy page
poo= l.

This is problematic because no page flag is supposed to be set when
freeing (see __free_one_page().) So the user-visible effect of this race
is that it could trigger the BUG_ON() when soft-offlining is called.

The root cause is that we call lru_add_drain_all() to make sure that the
page is in buddy, but that doesn't work because this function just
schedule= s a work item and doesn't wait its completion.
drain_all_pages() does drainin= g directly, so simply dropping
lru_add_drain_all() solves this problem.

Fixes: f15bdfa802bf ("mm/memory-failure.c: fix memory leak in successful soft offlining")
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Chen Gong <gong.chen@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agomm/memory.c: actually remap enough memory
Grazvydas Ignotas [Thu, 12 Feb 2015 23:00:19 +0000 (15:00 -0800)] 
mm/memory.c: actually remap enough memory

commit 9cb12d7b4ccaa976f97ce0c5fd0f1b6a83bc2a75 upstream.

For whatever reason, generic_access_phys() only remaps one page, but
actually allows to access arbitrary size.  It's quite easy to trigger
large reads, like printing out large structure with gdb, which leads to a
crash.  Fix it by remapping correct size.

Fixes: 28b2ee20c7cb ("access_process_vm device memory infrastructure")
Signed-off-by: Grazvydas Ignotas <notasas@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agomm/compaction: fix wrong order check in compact_finished()
Joonsoo Kim [Thu, 12 Feb 2015 22:59:50 +0000 (14:59 -0800)] 
mm/compaction: fix wrong order check in compact_finished()

commit 372549c2a3778fd3df445819811c944ad54609ca upstream.

What we want to check here is whether there is highorder freepage in buddy
list of other migratetype in order to steal it without fragmentation.
But, current code just checks cc->order which means allocation request
order.  So, this is wrong.

Without this fix, non-movable synchronous compaction below pageblock order
would not stopped until compaction is complete, because migratetype of
most pageblocks are movable and high order freepage made by compaction is
usually on movable type buddy list.

There is some report related to this bug. See below link.

  http://www.spinics.net/lists/linux-mm/msg81666.html

Although the issued system still has load spike comes from compaction,
this makes that system completely stable and responsive according to his
report.

stress-highalloc test in mmtests with non movable order 7 allocation
doesn't show any notable difference in allocation success rate, but, it
shows more compaction success rate.

Compaction success rate (Compaction success * 100 / Compaction stalls, %)
18.47 : 28.94

Fixes: 1fb3f8ca0e92 ("mm: compaction: capture a suitable high-order page immediately when it is made available")
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agomm/nommu.c: fix arithmetic overflow in __vm_enough_memory()
Roman Gushchin [Wed, 11 Feb 2015 23:28:42 +0000 (15:28 -0800)] 
mm/nommu.c: fix arithmetic overflow in __vm_enough_memory()

commit 8138a67a5557ffea3a21dfd6f037842d4e748513 upstream.

I noticed that "allowed" can easily overflow by falling below 0, because
(total_vm / 32) can be larger than "allowed".  The problem occurs in
OVERCOMMIT_NONE mode.

In this case, a huge allocation can success and overcommit the system
(despite OVERCOMMIT_NONE mode).  All subsequent allocations will fall
(system-wide), so system become unusable.

The problem was masked out by commit c9b1d0981fcc
("mm: limit growth of 3% hardcoded other user reserve"),
but it's easy to reproduce it on older kernels:
1) set overcommit_memory sysctl to 2
2) mmap() large file multiple times (with VM_SHARED flag)
3) try to malloc() large amount of memory

It also can be reproduced on newer kernels, but miss-configured
sysctl_user_reserve_kbytes is required.

Fix this issue by switching to signed arithmetic here.

Signed-off-by: Roman Gushchin <klamm@yandex-team.ru>
Cc: Andrew Shewmaker <agshew@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agomm/mmap.c: fix arithmetic overflow in __vm_enough_memory()
Roman Gushchin [Wed, 11 Feb 2015 23:28:39 +0000 (15:28 -0800)] 
mm/mmap.c: fix arithmetic overflow in __vm_enough_memory()

commit 5703b087dc8eaf47bfb399d6cf512d471beff405 upstream.

I noticed, that "allowed" can easily overflow by falling below 0,
because (total_vm / 32) can be larger than "allowed".  The problem
occurs in OVERCOMMIT_NONE mode.

In this case, a huge allocation can success and overcommit the system
(despite OVERCOMMIT_NONE mode).  All subsequent allocations will fall
(system-wide), so system become unusable.

The problem was masked out by commit c9b1d0981fcc
("mm: limit growth of 3% hardcoded other user reserve"),
but it's easy to reproduce it on older kernels:
1) set overcommit_memory sysctl to 2
2) mmap() large file multiple times (with VM_SHARED flag)
3) try to malloc() large amount of memory

It also can be reproduced on newer kernels, but miss-configured
sysctl_user_reserve_kbytes is required.

Fix this issue by switching to signed arithmetic here.

[akpm@linux-foundation.org: use min_t]
Signed-off-by: Roman Gushchin <klamm@yandex-team.ru>
Cc: Andrew Shewmaker <agshew@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agomm: when stealing freepages, also take pages created by splitting buddy page
Vlastimil Babka [Wed, 11 Feb 2015 23:28:15 +0000 (15:28 -0800)] 
mm: when stealing freepages, also take pages created by splitting buddy page

commit 99592d598eca62bdbbf62b59941c189176dfc614 upstream.

When studying page stealing, I noticed some weird looking decisions in
try_to_steal_freepages().  The first I assume is a bug (Patch 1), the
following two patches were driven by evaluation.

Testing was done with stress-highalloc of mmtests, using the
mm_page_alloc_extfrag tracepoint and postprocessing to get counts of how
often page stealing occurs for individual migratetypes, and what
migratetypes are used for fallbacks.  Arguably, the worst case of page
stealing is when UNMOVABLE allocation steals from MOVABLE pageblock.
RECLAIMABLE allocation stealing from MOVABLE allocation is also not ideal,
so the goal is to minimize these two cases.

The evaluation of v2 wasn't always clear win and Joonsoo questioned the
results.  Here I used different baseline which includes RFC compaction
improvements from [1].  I found that the compaction improvements reduce
variability of stress-highalloc, so there's less noise in the data.

First, let's look at stress-highalloc configured to do sync compaction,
and how these patches reduce page stealing events during the test.  First
column is after fresh reboot, other two are reiterations of test without
reboot.  That was all accumulater over 5 re-iterations (so the benchmark
was run 5x3 times with 5 fresh restarts).

Baseline:

                                                   3.19-rc4        3.19-rc4        3.19-rc4
                                                  5-nothp-1       5-nothp-2       5-nothp-3
Page alloc extfrag event                               10264225     8702233    10244125
Extfrag fragmenting                                    10263271     8701552    10243473
Extfrag fragmenting for unmovable                         13595       17616       15960
Extfrag fragmenting unmovable placed with movable          7989       12193        8447
Extfrag fragmenting for reclaimable                         658        1840        1817
Extfrag fragmenting reclaimable placed with movable         558        1677        1679
Extfrag fragmenting for movable                        10249018     8682096    10225696

With Patch 1:
                                                   3.19-rc4        3.19-rc4        3.19-rc4
                                                  6-nothp-1       6-nothp-2       6-nothp-3
Page alloc extfrag event                               11834954     9877523     9774860
Extfrag fragmenting                                    11833993     9876880     9774245
Extfrag fragmenting for unmovable                          7342       16129       11712
Extfrag fragmenting unmovable placed with movable          4191       10547        6270
Extfrag fragmenting for reclaimable                         373        1130         923
Extfrag fragmenting reclaimable placed with movable         302         906         738
Extfrag fragmenting for movable                        11826278     9859621     9761610

With Patch 2:
                                                   3.19-rc4        3.19-rc4        3.19-rc4
                                                  7-nothp-1       7-nothp-2       7-nothp-3
Page alloc extfrag event                                4725990     3668793     3807436
Extfrag fragmenting                                     4725104     3668252     3806898
Extfrag fragmenting for unmovable                          6678        7974        7281
Extfrag fragmenting unmovable placed with movable          2051        3829        4017
Extfrag fragmenting for reclaimable                         429        1208        1278
Extfrag fragmenting reclaimable placed with movable         369         976        1034
Extfrag fragmenting for movable                         4717997     3659070     3798339

With Patch 3:
                                                   3.19-rc4        3.19-rc4        3.19-rc4
                                                  8-nothp-1       8-nothp-2       8-nothp-3
Page alloc extfrag event                                5016183     4700142     3850633
Extfrag fragmenting                                     5015325     4699613     3850072
Extfrag fragmenting for unmovable                          1312        3154        3088
Extfrag fragmenting unmovable placed with movable          1115        2777        2714
Extfrag fragmenting for reclaimable                         437        1193        1097
Extfrag fragmenting reclaimable placed with movable         330         969         879
Extfrag fragmenting for movable                         5013576     4695266     3845887

In v2 we've seen apparent regression with Patch 1 for unmovable events,
this is now gone, suggesting it was indeed noise.  Here, each patch
improves the situation for unmovable events.  Reclaimable is improved by
patch 1 and then either the same modulo noise, or perhaps sligtly worse -
a small price for unmovable improvements, IMHO.  The number of movable
allocations falling back to other migratetypes is most noisy, but it's
reduced to half at Patch 2 nevertheless.  These are least critical as
compaction can move them around.

If we look at success rates, the patches don't affect them, that didn't change.

Baseline:
                             3.19-rc4              3.19-rc4              3.19-rc4
                            5-nothp-1             5-nothp-2             5-nothp-3
Success 1 Min         49.00 (  0.00%)       42.00 ( 14.29%)       41.00 ( 16.33%)
Success 1 Mean        51.00 (  0.00%)       45.00 ( 11.76%)       42.60 ( 16.47%)
Success 1 Max         55.00 (  0.00%)       51.00 (  7.27%)       46.00 ( 16.36%)
Success 2 Min         53.00 (  0.00%)       47.00 ( 11.32%)       44.00 ( 16.98%)
Success 2 Mean        59.60 (  0.00%)       50.80 ( 14.77%)       48.20 ( 19.13%)
Success 2 Max         64.00 (  0.00%)       56.00 ( 12.50%)       52.00 ( 18.75%)
Success 3 Min         84.00 (  0.00%)       82.00 (  2.38%)       78.00 (  7.14%)
Success 3 Mean        85.60 (  0.00%)       82.80 (  3.27%)       79.40 (  7.24%)
Success 3 Max         86.00 (  0.00%)       83.00 (  3.49%)       80.00 (  6.98%)

Patch 1:
                             3.19-rc4              3.19-rc4              3.19-rc4
                            6-nothp-1             6-nothp-2             6-nothp-3
Success 1 Min         49.00 (  0.00%)       44.00 ( 10.20%)       44.00 ( 10.20%)
Success 1 Mean        51.80 (  0.00%)       46.00 ( 11.20%)       45.80 ( 11.58%)
Success 1 Max         54.00 (  0.00%)       49.00 (  9.26%)       49.00 (  9.26%)
Success 2 Min         58.00 (  0.00%)       49.00 ( 15.52%)       48.00 ( 17.24%)
Success 2 Mean        60.40 (  0.00%)       51.80 ( 14.24%)       50.80 ( 15.89%)
Success 2 Max         63.00 (  0.00%)       54.00 ( 14.29%)       55.00 ( 12.70%)
Success 3 Min         84.00 (  0.00%)       81.00 (  3.57%)       79.00 (  5.95%)
Success 3 Mean        85.00 (  0.00%)       81.60 (  4.00%)       79.80 (  6.12%)
Success 3 Max         86.00 (  0.00%)       82.00 (  4.65%)       82.00 (  4.65%)

Patch 2:

                             3.19-rc4              3.19-rc4              3.19-rc4
                            7-nothp-1             7-nothp-2             7-nothp-3
Success 1 Min         50.00 (  0.00%)       44.00 ( 12.00%)       39.00 ( 22.00%)
Success 1 Mean        52.80 (  0.00%)       45.60 ( 13.64%)       42.40 ( 19.70%)
Success 1 Max         55.00 (  0.00%)       46.00 ( 16.36%)       47.00 ( 14.55%)
Success 2 Min         52.00 (  0.00%)       48.00 (  7.69%)       45.00 ( 13.46%)
Success 2 Mean        53.40 (  0.00%)       49.80 (  6.74%)       48.80 (  8.61%)
Success 2 Max         57.00 (  0.00%)       52.00 (  8.77%)       52.00 (  8.77%)
Success 3 Min         84.00 (  0.00%)       81.00 (  3.57%)       79.00 (  5.95%)
Success 3 Mean        85.00 (  0.00%)       82.40 (  3.06%)       79.60 (  6.35%)
Success 3 Max         86.00 (  0.00%)       83.00 (  3.49%)       80.00 (  6.98%)

Patch 3:
                             3.19-rc4              3.19-rc4              3.19-rc4
                            8-nothp-1             8-nothp-2             8-nothp-3
Success 1 Min         46.00 (  0.00%)       44.00 (  4.35%)       42.00 (  8.70%)
Success 1 Mean        50.20 (  0.00%)       45.60 (  9.16%)       44.00 ( 12.35%)
Success 1 Max         52.00 (  0.00%)       47.00 (  9.62%)       47.00 (  9.62%)
Success 2 Min         53.00 (  0.00%)       49.00 (  7.55%)       48.00 (  9.43%)
Success 2 Mean        55.80 (  0.00%)       50.60 (  9.32%)       49.00 ( 12.19%)
Success 2 Max         59.00 (  0.00%)       52.00 ( 11.86%)       51.00 ( 13.56%)
Success 3 Min         84.00 (  0.00%)       80.00 (  4.76%)       79.00 (  5.95%)
Success 3 Mean        85.40 (  0.00%)       81.60 (  4.45%)       80.40 (  5.85%)
Success 3 Max         87.00 (  0.00%)       83.00 (  4.60%)       82.00 (  5.75%)

While there's no improvement here, I consider reduced fragmentation events
to be worth on its own.  Patch 2 also seems to reduce scanning for free
pages, and migrations in compaction, suggesting it has somewhat less work
to do:

Patch 1:

Compaction stalls                 4153        3959        3978
Compaction success                1523        1441        1446
Compaction failures               2630        2517        2531
Page migrate success           4600827     4943120     5104348
Page migrate failure             19763       16656       17806
Compaction pages isolated      9597640    10305617    10653541
Compaction migrate scanned    77828948    86533283    87137064
Compaction free scanned      517758295   521312840   521462251
Compaction cost                   5503        5932        6110

Patch 2:

Compaction stalls                 3800        3450        3518
Compaction success                1421        1316        1317
Compaction failures               2379        2134        2201
Page migrate success           4160421     4502708     4752148
Page migrate failure             19705       14340       14911
Compaction pages isolated      8731983     9382374     9910043
Compaction migrate scanned    98362797    96349194    98609686
Compaction free scanned      496512560   469502017   480442545
Compaction cost                   5173        5526        5811

As with v2, /proc/pagetypeinfo appears unaffected with respect to numbers
of unmovable and reclaimable pageblocks.

Configuring the benchmark to allocate like THP page fault (i.e.  no sync
compaction) gives much noisier results for iterations 2 and 3 after
reboot.  This is not so surprising given how [1] offers lower improvements
in this scenario due to less restarts after deferred compaction which
would change compaction pivot.

Baseline:
                                                   3.19-rc4        3.19-rc4        3.19-rc4
                                                    5-thp-1         5-thp-2         5-thp-3
Page alloc extfrag event                                8148965     6227815     6646741
Extfrag fragmenting                                     8147872     6227130     6646117
Extfrag fragmenting for unmovable                         10324       12942       15975
Extfrag fragmenting unmovable placed with movable          5972        8495       10907
Extfrag fragmenting for reclaimable                         601        1707        2210
Extfrag fragmenting reclaimable placed with movable         520        1570        2000
Extfrag fragmenting for movable                         8136947     6212481     6627932

Patch 1:
                                                   3.19-rc4        3.19-rc4        3.19-rc4
                                                    6-thp-1         6-thp-2         6-thp-3
Page alloc extfrag event                                8345457     7574471     7020419
Extfrag fragmenting                                     8343546     7573777     7019718
Extfrag fragmenting for unmovable                         10256       18535       30716
Extfrag fragmenting unmovable placed with movable          6893       11726       22181
Extfrag fragmenting for reclaimable                         465        1208        1023
Extfrag fragmenting reclaimable placed with movable         353         996         843
Extfrag fragmenting for movable                         8332825     7554034     6987979

Patch 2:
                                                   3.19-rc4        3.19-rc4        3.19-rc4
                                                    7-thp-1         7-thp-2         7-thp-3
Page alloc extfrag event                                3512847     3020756     2891625
Extfrag fragmenting                                     3511940     3020185     2891059
Extfrag fragmenting for unmovable                          9017        6892        6191
Extfrag fragmenting unmovable placed with movable          1524        3053        2435
Extfrag fragmenting for reclaimable                         445        1081        1160
Extfrag fragmenting reclaimable placed with movable         375         918         986
Extfrag fragmenting for movable                         3502478     3012212     2883708

Patch 3:
                                                   3.19-rc4        3.19-rc4        3.19-rc4
                                                    8-thp-1         8-thp-2         8-thp-3
Page alloc extfrag event                                3181699     3082881     2674164
Extfrag fragmenting                                     3180812     3082303     2673611
Extfrag fragmenting for unmovable                          1201        4031        4040
Extfrag fragmenting unmovable placed with movable           974        3611        3645
Extfrag fragmenting for reclaimable                         478        1165        1294
Extfrag fragmenting reclaimable placed with movable         387         985        1030
Extfrag fragmenting for movable                         3179133     3077107     2668277

The improvements for first iteration are clear, the rest is much noisier
and can appear like regression for Patch 1.  Anyway, patch 2 rectifies it.

Allocation success rates are again unaffected so there's no point in
making this e-mail any longer.

[1] http://marc.info/?l=linux-mm&m=142166196321125&w=2

This patch (of 3):

When __rmqueue_fallback() is called to allocate a page of order X, it will
find a page of order Y >= X of a fallback migratetype, which is different
from the desired migratetype.  With the help of try_to_steal_freepages(),
it may change the migratetype (to the desired one) also of:

1) all currently free pages in the pageblock containing the fallback page
2) the fallback pageblock itself
3) buddy pages created by splitting the fallback page (when Y > X)

These decisions take the order Y into account, as well as the desired
migratetype, with the goal of preventing multiple fallback allocations
that could e.g.  distribute UNMOVABLE allocations among multiple
pageblocks.

Originally, decision for 1) has implied the decision for 3).  Commit
47118af076f6 ("mm: mmzone: MIGRATE_CMA migration type added") changed that
(probably unintentionally) so that the buddy pages in case 3) are always
changed to the desired migratetype, except for CMA pageblocks.

Commit fef903efcf0c ("mm/page_allo.c: restructure free-page stealing code
and fix a bug") did some refactoring and added a comment that the case of
3) is intended.  Commit 0cbef29a7821 ("mm: __rmqueue_fallback() should
respect pageblock type") removed the comment and tried to restore the
original behavior where 1) implies 3), but due to the previous
refactoring, the result is instead that only 2) implies 3) - and the
conditions for 2) are less frequently met than conditions for 1).  This
may increase fragmentation in situations where the code decides to steal
all free pages from the pageblock (case 1)), but then gives back the buddy
pages produced by splitting.

This patch restores the original intended logic where 1) implies 3).
During testing with stress-highalloc from mmtests, this has shown to
decrease the number of events where UNMOVABLE and RECLAIMABLE allocations
steal from MOVABLE pageblocks, which can lead to permanent fragmentation.
In some cases it has increased the number of events when MOVABLE
allocations steal from UNMOVABLE or RECLAIMABLE pageblocks, but these are
fixable by sync compaction and thus less harmful.

Note that evaluation has shown that the behavior introduced by
47118af076f6 for buddy pages in case 3) is actually even better than the
original logic, so the following patch will introduce it properly once
again.  For stable backports of this patch it makes thus sense to only fix
versions containing 0cbef29a7821.

[iamjoonsoo.kim@lge.com: tracepoint fix]
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agomm, hugetlb: remove unnecessary lower bound on sysctl handlers"?
Andrey Ryabinin [Tue, 10 Feb 2015 22:11:33 +0000 (14:11 -0800)] 
mm, hugetlb: remove unnecessary lower bound on sysctl handlers"?

commit 3cd7645de624939c38f5124b4ac15f8b35a1a8b7 upstream.

Commit ed4d4902ebdd ("mm, hugetlb: remove hugetlb_zero and
hugetlb_infinity") replaced 'unsigned long hugetlb_zero' with 'int zero'
leading to out-of-bounds access in proc_doulongvec_minmax().  Use
'.extra1 = NULL' instead of '.extra1 = &zero'.  Passing NULL is
equivalent to passing minimal value, which is 0 for unsigned types.

Fixes: ed4d4902ebdd ("mm, hugetlb: remove hugetlb_zero and hugetlb_infinity")
Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Suggested-by: Manfred Spraul <manfred@colorfullife.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agomm/hugetlb: add migration entry check in __unmap_hugepage_range
Naoya Horiguchi [Wed, 11 Feb 2015 23:25:32 +0000 (15:25 -0800)] 
mm/hugetlb: add migration entry check in __unmap_hugepage_range

commit 9fbc1f635fd0bd28cb32550211bf095753ac637a upstream.

If __unmap_hugepage_range() tries to unmap the address range over which
hugepage migration is on the way, we get the wrong page because pte_page()
doesn't work for migration entries.  This patch simply clears the pte for
migration entries as we do for hwpoison entries.

Fixes: 290408d4a2 ("hugetlb: hugepage migration core")
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Steve Capper <steve.capper@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agomm/hugetlb: add migration/hwpoisoned entry check in hugetlb_change_protection
Naoya Horiguchi [Wed, 11 Feb 2015 23:25:28 +0000 (15:25 -0800)] 
mm/hugetlb: add migration/hwpoisoned entry check in hugetlb_change_protection

commit a8bda28d87c38c6aa93de28ba5d30cc18e865a11 upstream.

There is a race condition between hugepage migration and
change_protection(), where hugetlb_change_protection() doesn't care about
migration entries and wrongly overwrites them.  That causes unexpected
results like kernel crash.  HWPoison entries also can cause the same
problem.

This patch adds is_hugetlb_entry_(migration|hwpoisoned) check in this
function to do proper actions.

Fixes: 290408d4a2 ("hugetlb: hugepage migration core")
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Steve Capper <steve.capper@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agomm/hugetlb: fix getting refcount 0 page in hugetlb_fault()
Naoya Horiguchi [Wed, 11 Feb 2015 23:25:25 +0000 (15:25 -0800)] 
mm/hugetlb: fix getting refcount 0 page in hugetlb_fault()

commit 0f792cf949a0be506c2aa8bfac0605746b146dda upstream.

When running the test which causes the race as shown in the previous patch,
we can hit the BUG "get_page() on refcount 0 page" in hugetlb_fault().

This race happens when pte turns into migration entry just after the first
check of is_hugetlb_entry_migration() in hugetlb_fault() passed with false.
To fix this, we need to check pte_present() again after huge_ptep_get().

This patch also reorders taking ptl and doing pte_page(), because
pte_page() should be done in ptl.  Due to this reordering, we need use
trylock_page() in page != pagecache_page case to respect locking order.

Fixes: 66aebce747ea ("hugetlb: fix race condition in hugetlb_fault()")
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Steve Capper <steve.capper@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agoteam: don't traverse port list using rcu in team_set_mac_address
Jiri Pirko [Wed, 4 Mar 2015 07:36:31 +0000 (08:36 +0100)] 
team: don't traverse port list using rcu in team_set_mac_address

[ Upstream commit 9215f437b85da339a7dfe3db6e288637406f88b2 ]

Currently the list is traversed using rcu variant. That is not correct
since dev_set_mac_address can be called which eventually calls
rtmsg_ifinfo_build_skb and there, skb allocation can sleep. So fix this
by remove the rcu usage here.

Fixes: 3d249d4ca7 "net: introduce ethernet teaming device"
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agonet: ping: Return EAFNOSUPPORT when appropriate.
Lorenzo Colitti [Tue, 3 Mar 2015 14:16:16 +0000 (23:16 +0900)] 
net: ping: Return EAFNOSUPPORT when appropriate.

[ Upstream commit 9145736d4862145684009d6a72a6e61324a9439e ]

1. For an IPv4 ping socket, ping_check_bind_addr does not check
   the family of the socket address that's passed in. Instead,
   make it behave like inet_bind, which enforces either that the
   address family is AF_INET, or that the family is AF_UNSPEC and
   the address is 0.0.0.0.
2. For an IPv6 ping socket, ping_check_bind_addr returns EINVAL
   if the socket family is not AF_INET6. Return EAFNOSUPPORT
   instead, for consistency with inet6_bind.
3. Make ping_v4_sendmsg and ping_v6_sendmsg return EAFNOSUPPORT
   instead of EINVAL if an incorrect socket address structure is
   passed in.
4. Make IPv6 ping sockets be IPv6-only. The code does not support
   IPv4, and it cannot easily be made to support IPv4 because
   the protocol numbers for ICMP and ICMPv6 are different. This
   makes connect(::ffff:192.0.2.1) fail with EAFNOSUPPORT instead
   of making the socket unusable.

Among other things, this fixes an oops that can be triggered by:

    int s = socket(AF_INET, SOCK_DGRAM, IPPROTO_ICMP);
    struct sockaddr_in6 sin6 = {
        .sin6_family = AF_INET6,
        .sin6_addr = in6addr_any,
    };
    bind(s, (struct sockaddr *) &sin6, sizeof(sin6));

Change-Id: If06ca86d9f1e4593c0d6df174caca3487c57a241
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agoudp: only allow UFO for packets from SOCK_DGRAM sockets
Michal Kubeček [Mon, 2 Mar 2015 17:27:11 +0000 (18:27 +0100)] 
udp: only allow UFO for packets from SOCK_DGRAM sockets

[ Upstream commit acf8dd0a9d0b9e4cdb597c2f74802f79c699e802 ]

If an over-MTU UDP datagram is sent through a SOCK_RAW socket to a
UFO-capable device, ip_ufo_append_data() sets skb->ip_summed to
CHECKSUM_PARTIAL unconditionally as all GSO code assumes transport layer
checksum is to be computed on segmentation. However, in this case,
skb->csum_start and skb->csum_offset are never set as raw socket
transmit path bypasses udp_send_skb() where they are usually set. As a
result, driver may access invalid memory when trying to calculate the
checksum and store the result (as observed in virtio_net driver).

Moreover, the very idea of modifying the userspace provided UDP header
is IMHO against raw socket semantics (I wasn't able to find a document
clearly stating this or the opposite, though). And while allowing
CHECKSUM_NONE in the UFO case would be more efficient, it would be a bit
too intrusive change just to handle a corner case like this. Therefore
disallowing UFO for packets from SOCK_DGRAM seems to be the best option.

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agousb: plusb: Add support for National Instruments host-to-host cable
Ben Shelton [Mon, 16 Feb 2015 19:47:06 +0000 (13:47 -0600)] 
usb: plusb: Add support for National Instruments host-to-host cable

[ Upstream commit 42c972a1f390e3bc51ca1e434b7e28764992067f ]

The National Instruments USB Host-to-Host Cable is based on the Prolific
PL-25A1 chipset.  Add its VID/PID so the plusb driver will recognize it.

Signed-off-by: Ben Shelton <ben.shelton@ni.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agonet: do not use rcu in rtnl_dump_ifinfo()
Eric Dumazet [Fri, 27 Feb 2015 17:42:50 +0000 (09:42 -0800)] 
net: do not use rcu in rtnl_dump_ifinfo()

[ Upstream commit cac5e65e8a7ea074f2626d2eaa53aa308452dec4 ]

We did a failed attempt in the past to only use rcu in rtnl dump
operations (commit e67f88dd12f6 "net: dont hold rtnl mutex during
netlink dump callbacks")

Now that dumps are holding RTNL anyway, there is no need to also
use rcu locking, as it forbids any scheduling ability, like
GFP_KERNEL allocations that controlling path should use instead
of GFP_ATOMIC whenever possible.

This should fix following splat Cong Wang reported :

 [ INFO: suspicious RCU usage. ]
 3.19.0+ #805 Tainted: G        W

 include/linux/rcupdate.h:538 Illegal context switch in RCU read-side critical section!

 other info that might help us debug this:

 rcu_scheduler_active = 1, debug_locks = 0
 2 locks held by ip/771:
  #0:  (rtnl_mutex){+.+.+.}, at: [<ffffffff8182b8f4>] netlink_dump+0x21/0x26c
  #1:  (rcu_read_lock){......}, at: [<ffffffff817d785b>] rcu_read_lock+0x0/0x6e

 stack backtrace:
 CPU: 3 PID: 771 Comm: ip Tainted: G        W       3.19.0+ #805
 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
  0000000000000001 ffff8800d51e7718 ffffffff81a27457 0000000029e729e6
  ffff8800d6108000 ffff8800d51e7748 ffffffff810b539b ffffffff820013dd
  00000000000001c8 0000000000000000 ffff8800d7448088 ffff8800d51e7758
 Call Trace:
  [<ffffffff81a27457>] dump_stack+0x4c/0x65
  [<ffffffff810b539b>] lockdep_rcu_suspicious+0x107/0x110
  [<ffffffff8109796f>] rcu_preempt_sleep_check+0x45/0x47
  [<ffffffff8109e457>] ___might_sleep+0x1d/0x1cb
  [<ffffffff8109e67d>] __might_sleep+0x78/0x80
  [<ffffffff814b9b1f>] idr_alloc+0x45/0xd1
  [<ffffffff810cb7ab>] ? rcu_read_lock_held+0x3b/0x3d
  [<ffffffff814b9f9d>] ? idr_for_each+0x53/0x101
  [<ffffffff817c1383>] alloc_netid+0x61/0x69
  [<ffffffff817c14c3>] __peernet2id+0x79/0x8d
  [<ffffffff817c1ab7>] peernet2id+0x13/0x1f
  [<ffffffff817d8673>] rtnl_fill_ifinfo+0xa8d/0xc20
  [<ffffffff810b17d9>] ? __lock_is_held+0x39/0x52
  [<ffffffff817d894f>] rtnl_dump_ifinfo+0x149/0x213
  [<ffffffff8182b9c2>] netlink_dump+0xef/0x26c
  [<ffffffff8182bcba>] netlink_recvmsg+0x17b/0x2c5
  [<ffffffff817b0adc>] __sock_recvmsg+0x4e/0x59
  [<ffffffff817b1b40>] sock_recvmsg+0x3f/0x51
  [<ffffffff817b1f9a>] ___sys_recvmsg+0xf6/0x1d9
  [<ffffffff8115dc67>] ? handle_pte_fault+0x6e1/0xd3d
  [<ffffffff8100a3a0>] ? native_sched_clock+0x35/0x37
  [<ffffffff8109f45b>] ? sched_clock_local+0x12/0x72
  [<ffffffff8109f6ac>] ? sched_clock_cpu+0x9e/0xb7
  [<ffffffff810cb7ab>] ? rcu_read_lock_held+0x3b/0x3d
  [<ffffffff811abde8>] ? __fcheck_files+0x4c/0x58
  [<ffffffff811ac556>] ? __fget_light+0x2d/0x52
  [<ffffffff817b376f>] __sys_recvmsg+0x42/0x60
  [<ffffffff817b379f>] SyS_recvmsg+0x12/0x1c

Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: 0c7aecd4bde4b7302 ("netns: add rtnl cmd to add and get peer netns ids")
Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reported-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agosh_eth: Fix lost MAC address on kexec
Geert Uytterhoeven [Fri, 27 Feb 2015 16:16:26 +0000 (17:16 +0100)] 
sh_eth: Fix lost MAC address on kexec

[ Upstream commit a14c7d15ca91b444e77df08b916befdce77562ab ]

Commit 740c7f31c094703c ("sh_eth: Ensure DMA engines are stopped before
freeing buffers") added a call to sh_eth_reset() to the
sh_eth_set_ringparam() and sh_eth_close() paths.

However, setting the software reset bit(s) in the EDMR register resets
the MAC Address Registers to zero. Hence after kexec, the new kernel
doesn't detect a valid MAC address and assigns a random MAC address,
breaking DHCP.

Set the MAC address again after the reset in sh_eth_dev_exit() to fix
this.

Tested on r8a7740/armadillo (GETHER) and r8a7791/koelsch (FAST_RCAR).

Fixes: 740c7f31c094703c ("sh_eth: Ensure DMA engines are stopped before freeing buffers")
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agonet: bcmgenet: fix software maintained statistics
Florian Fainelli [Sun, 1 Mar 2015 02:09:16 +0000 (18:09 -0800)] 
net: bcmgenet: fix software maintained statistics

[ Upstream commit f62ba9c14b85a682b64a4c421f91de0bd2aa8538 ]

Commit 44c8bc3ce39f ("net: bcmgenet: log RX buffer allocation and RX/TX dma
failures") added a few software maintained statistics using
BCMGENET_STAT_MIB_RX and BCMGENET_STAT_MIB_TX. These statistics are read from
the hardware MIB counters, such that bcmgenet_update_mib_counters() was trying
to read from a non-existing MIB offset for these counters.

Fix this by introducing a special type: BCMGENET_STAT_SOFT, similar to
BCMGENET_STAT_NETDEV, such that bcmgenet_get_ethtool_stats will read from the
software mib.

Fixes: 44c8bc3ce39f ("net: bcmgenet: log RX buffer allocation and RX/TX dma failures")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agonet: bcmgenet: fix throughtput regression
Jaedon Shin [Sat, 28 Feb 2015 02:48:26 +0000 (11:48 +0900)] 
net: bcmgenet: fix throughtput regression

[ Upstream commit 4092e6acf5cb16f56154e2dd22d647023dc3d646 ]

This patch adds bcmgenet_tx_poll for the tx_rings. This can reduce the
interrupt load and send xmit in network stack on time. This also
separated for the completion of tx_ring16 from bcmgenet_poll.

The bcmgenet_tx_reclaim of tx_ring[{0,1,2,3}] operative by an interrupt
is to be not more than a certain number TxBDs. It is caused by too
slowly reclaiming the transmitted skb. Therefore, performance
degradation of xmit after 605ad7f ("tcp: refine TSO autosizing").

Signed-off-by: Jaedon Shin <jaedon.shin@gmail.com>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agomacvtap: make sure neighbour code can push ethernet header
Eric Dumazet [Sat, 28 Feb 2015 02:35:35 +0000 (18:35 -0800)] 
macvtap: make sure neighbour code can push ethernet header

[ Upstream commit 2f1d8b9e8afa5a833d96afcd23abcb8cdf8d83ab ]

Brian reported crashes using IPv6 traffic with macvtap/veth combo.

I tracked the crashes in neigh_hh_output()

-> memcpy(skb->data - HH_DATA_MOD, hh->hh_data, HH_DATA_MOD);

Neighbour code assumes headroom to push Ethernet header is
at least 16 bytes.

It appears macvtap has only 14 bytes available on arches
where NET_IP_ALIGN is 0 (like x86)

Effect is a corruption of 2 bytes right before skb->head,
and possible crashes if accessing non existing memory.

This fix should also increase IPv4 performance, as paranoid code
in ip_finish_output2() wont have to call skb_realloc_headroom()

Reported-by: Brian Rak <brak@vultr.com>
Tested-by: Brian Rak <brak@vultr.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agonet: compat: Ignore MSG_CMSG_COMPAT in compat_sys_{send, recv}msg
Catalin Marinas [Mon, 23 Feb 2015 18:12:56 +0000 (18:12 +0000)] 
net: compat: Ignore MSG_CMSG_COMPAT in compat_sys_{send, recv}msg

[ Upstream commit d720d8cec563ce4e4fa44a613d4f2dcb1caf2998 ]

With commit a7526eb5d06b (net: Unbreak compat_sys_{send,recv}msg), the
MSG_CMSG_COMPAT flag is blocked at the compat syscall entry points,
changing the kernel compat behaviour from the one before the commit it
was trying to fix (1be374a0518a, net: Block MSG_CMSG_COMPAT in
send(m)msg and recv(m)msg).

On 32-bit kernels (!CONFIG_COMPAT), MSG_CMSG_COMPAT is 0 and the native
32-bit sys_sendmsg() allows flag 0x80000000 to be set (it is ignored by
the kernel). However, on a 64-bit kernel, the compat ABI is different
with commit a7526eb5d06b.

This patch changes the compat_sys_{send,recv}msg behaviour to the one
prior to commit 1be374a0518a.

The problem was found running 32-bit LTP (sendmsg01) binary on an arm64
kernel. Arguably, LTP should not pass 0xffffffff as flags to sendmsg()
but the general rule is not to break user ABI (even when the user
behaviour is not entirely sane).

Fixes: a7526eb5d06b (net: Unbreak compat_sys_{send,recv}msg)
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agoteam: fix possible null pointer dereference in team_handle_frame
Jiri Pirko [Mon, 23 Feb 2015 13:02:54 +0000 (14:02 +0100)] 
team: fix possible null pointer dereference in team_handle_frame

[ Upstream commit 57e595631904c827cfa1a0f7bbd7cc9a49da5745 ]

Currently following race is possible in team:

CPU0                                        CPU1
                                            team_port_del
                                              team_upper_dev_unlink
                                                priv_flags &= ~IFF_TEAM_PORT
team_handle_frame
  team_port_get_rcu
    team_port_exists
      priv_flags & IFF_TEAM_PORT == 0
    return NULL (instead of port got
                 from rx_handler_data)
                                              netdev_rx_handler_unregister

The thing is that the flag is removed before rx_handler is unregistered.
If team_handle_frame is called in between, team_port_exists returns 0
and team_port_get_rcu will return NULL.
So do not check the flag here. It is guaranteed by netdev_rx_handler_unregister
that team_handle_frame will always see valid rx_handler_data pointer.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Fixes: 3d249d4ca7d0 ("net: introduce ethernet teaming device")
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agonet: pktgen: disable xmit_clone on virtual devices
Eric Dumazet [Mon, 23 Feb 2015 01:03:41 +0000 (17:03 -0800)] 
net: pktgen: disable xmit_clone on virtual devices

[ Upstream commit 52d6c8c6ca125872459054daa70f2f1c698c8e75 ]

Trying to use burst capability (aka xmit_more) on a virtual device
like bonding is not supported.

For example, skb might be queued multiple times on a qdisc, with
various list corruptions.

Fixes: 38b2cf2982dc ("net: pktgen: packet bursting via skb->xmit_more")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agoRevert "r8169: add support for Byte Queue Limits"
David S. Miller [Tue, 10 Mar 2015 22:47:33 +0000 (18:47 -0400)] 
Revert "r8169: add support for Byte Queue Limits"

This reverts commit 1e918876853aa85435e0f17fd8b4a92dcfff53d6.

Revert BQL support in r8169 driver as several regressions
point to this commit and we cannot figure out the real
cause yet.

Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agonet: reject creation of netdev names with colons
Matthew Thode [Wed, 18 Feb 2015 00:31:57 +0000 (18:31 -0600)] 
net: reject creation of netdev names with colons

[ Upstream commit a4176a9391868bfa87705bcd2e3b49e9b9dd2996 ]

colons are used as a separator in netdev device lookup in dev_ioctl.c

Specific functions are SIOCGIFTXQLEN SIOCETHTOOL SIOCSIFNAME

Signed-off-by: Matthew Thode <mthode@mthode.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agosock: sock_dequeue_err_skb() needs hard irq safety
Eric Dumazet [Wed, 18 Feb 2015 13:47:55 +0000 (05:47 -0800)] 
sock: sock_dequeue_err_skb() needs hard irq safety

[ Upstream commit 997d5c3f4427f38562cbe207ce05bb25fdcb993b ]

Non NAPI drivers can call skb_tstamp_tx() and then sock_queue_err_skb()
from hard IRQ context.

Therefore, sock_dequeue_err_skb() needs to block hard irq or
corruptions or hangs can happen.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: 364a9e93243d1 ("sock: deduplicate errqueue dequeue")
Fixes: cb820f8e4b7f7 ("net: Provide a generic socket error queue delivery method for Tx time stamps.")
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agoopenvswitch: Fix net exit.
Pravin B Shelar [Tue, 17 Feb 2015 19:23:10 +0000 (11:23 -0800)] 
openvswitch: Fix net exit.

[ Upstream commit 7b4577a9da3702049650f7095506e9afd9f68849 ]

Open vSwitch allows moving internal vport to different namespace
while still connected to the bridge. But when namespace deleted
OVS does not detach these vports, that results in dangling
pointer to netdevice which causes kernel panic as follows.
This issue is fixed by detaching all ovs ports from the deleted
namespace at net-exit.

BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
IP: [<ffffffffa0aadaa5>] ovs_vport_locate+0x35/0x80 [openvswitch]
Oops: 0000 [#1] SMP
Call Trace:
 [<ffffffffa0aa6391>] lookup_vport+0x21/0xd0 [openvswitch]
 [<ffffffffa0aa65f9>] ovs_vport_cmd_get+0x59/0xf0 [openvswitch]
 [<ffffffff8167e07c>] genl_family_rcv_msg+0x1bc/0x3e0
 [<ffffffff8167e319>] genl_rcv_msg+0x79/0xc0
 [<ffffffff8167d919>] netlink_rcv_skb+0xb9/0xe0
 [<ffffffff8167deac>] genl_rcv+0x2c/0x40
 [<ffffffff8167cffd>] netlink_unicast+0x12d/0x1c0
 [<ffffffff8167d3da>] netlink_sendmsg+0x34a/0x6b0
 [<ffffffff8162e140>] sock_sendmsg+0xa0/0xe0
 [<ffffffff8162e5e8>] ___sys_sendmsg+0x408/0x420
 [<ffffffff8162f541>] __sys_sendmsg+0x51/0x90
 [<ffffffff8162f592>] SyS_sendmsg+0x12/0x20
 [<ffffffff81764ee9>] system_call_fastpath+0x12/0x17

Reported-by: Assaf Muller <amuller@redhat.com>
Fixes: 46df7b81454("openvswitch: Add support for network namespaces.")
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Reviewed-by: Thomas Graf <tgraf@noironetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agoematch: Fix auto-loading of ematch modules.
Ignacy Gawędzki [Tue, 17 Feb 2015 19:15:20 +0000 (20:15 +0100)] 
ematch: Fix auto-loading of ematch modules.

[ Upstream commit 34eea79e2664b314cab6a30fc582fdfa7a1bb1df ]

In tcf_em_validate(), after calling request_module() to load the
kind-specific module, set em->ops to NULL before returning -EAGAIN, so
that module_put() is not called again by tcf_em_tree_destroy().

Signed-off-by: Ignacy Gawędzki <ignacy.gawedzki@green-communications.fr>
Acked-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agonet: phy: Fix verification of EEE support in phy_init_eee
Guenter Roeck [Tue, 17 Feb 2015 17:36:22 +0000 (09:36 -0800)] 
net: phy: Fix verification of EEE support in phy_init_eee

[ Upstream commit 54da5a8be3c1e924c35480eb44c6e9b275f6444e ]

phy_init_eee uses phy_find_setting(phydev->speed, phydev->duplex)
to find a valid entry in the settings array for the given speed
and duplex value. For full duplex 1000baseT, this will return
the first matching entry, which is the entry for 1000baseKX_Full.

If the phy eee does not support 1000baseKX_Full, this entry will not
match, causing phy_init_eee to fail for no good reason.

Fixes: 9a9c56cb34e6 ("net: phy: fix a bug when verify the EEE support")
Fixes: 3e7077067e80c ("phy: Expand phy speed/duplex settings array")
Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agoipv4: ip_check_defrag should not assume that skb_network_offset is zero
Alexander Drozdov [Thu, 5 Mar 2015 07:29:39 +0000 (10:29 +0300)] 
ipv4: ip_check_defrag should not assume that skb_network_offset is zero

[ Upstream commit 3e32e733d1bbb3f227259dc782ef01d5706bdae0 ]

ip_check_defrag() may be used by af_packet to defragment outgoing packets.
skb_network_offset() of af_packet's outgoing packets is not zero.

Signed-off-by: Alexander Drozdov <al.drozdov@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agoipv4: ip_check_defrag should correctly check return value of skb_copy_bits
Alexander Drozdov [Tue, 17 Feb 2015 10:33:46 +0000 (13:33 +0300)] 
ipv4: ip_check_defrag should correctly check return value of skb_copy_bits

[ Upstream commit fba04a9e0c869498889b6445fd06cbe7da9bb834 ]

skb_copy_bits() returns zero on success and negative value on error,
so it is needed to invert the condition in ip_check_defrag().

Fixes: 1bf3751ec90c ("ipv4: ip_check_defrag must not modify skb before unsharing")
Signed-off-by: Alexander Drozdov <al.drozdov@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agogen_stats.c: Duplicate xstats buffer for later use
Ignacy Gawędzki [Fri, 13 Feb 2015 22:47:05 +0000 (14:47 -0800)] 
gen_stats.c: Duplicate xstats buffer for later use

[ Upstream commit 1c4cff0cf55011792125b6041bc4e9713e46240f ]

The gnet_stats_copy_app() function gets called, more often than not, with its
second argument a pointer to an automatic variable in the caller's stack.
Therefore, to avoid copying garbage afterwards when calling
gnet_stats_finish_copy(), this data is better copied to a dynamically allocated
memory that gets freed after use.

[xiyou.wangcong@gmail.com: remove a useless kfree()]

Signed-off-by: Ignacy Gawędzki <ignacy.gawedzki@green-communications.fr>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agortnetlink: call ->dellink on failure when ->newlink exists
WANG Cong [Fri, 13 Feb 2015 21:56:53 +0000 (13:56 -0800)] 
rtnetlink: call ->dellink on failure when ->newlink exists

[ Upstream commit 7afb8886a05be68e376655539a064ec672de8a8e ]

Ignacy reported that when eth0 is down and add a vlan device
on top of it like:

  ip link add link eth0 name eth0.1 up type vlan id 1

We will get a refcount leak:

  unregister_netdevice: waiting for eth0.1 to become free. Usage count = 2

The problem is when rtnl_configure_link() fails in rtnl_newlink(),
we simply call unregister_device(), but for stacked device like vlan,
we almost do nothing when we unregister the upper device, more work
is done when we unregister the lower device, so call its ->dellink().

Reported-by: Ignacy Gawedzki <ignacy.gawedzki@green-communications.fr>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agoipv6: fix ipv6_cow_metrics for non DST_HOST case
Martin KaFai Lau [Fri, 13 Feb 2015 00:14:08 +0000 (16:14 -0800)] 
ipv6: fix ipv6_cow_metrics for non DST_HOST case

[ Upstream commit 3b4711757d7903ab6fa88a9e7ab8901b8227da60 ]

ipv6_cow_metrics() currently assumes only DST_HOST routes require
dynamic metrics allocation from inetpeer.  The assumption breaks
when ndisc discovered router with RTAX_MTU and RTAX_HOPLIMIT metric.
Refer to ndisc_router_discovery() in ndisc.c and note that dst_metric_set()
is called after the route is created.

This patch creates the metrics array (by calling dst_cow_metrics_generic) in
ipv6_cow_metrics().

Test:
radvd.conf:
interface qemubr0
{
AdvLinkMTU 1300;
AdvCurHopLimit 30;

prefix fd00:face:face:face::/64
{
AdvOnLink on;
AdvAutonomous on;
AdvRouterAddr off;
};
};

Before:
[root@qemu1 ~]# ip -6 r show | egrep -v unreachable
fd00:face:face:face::/64 dev eth0  proto kernel  metric 256  expires 27sec
fe80::/64 dev eth0  proto kernel  metric 256
default via fe80::74df:d0ff:fe23:8ef2 dev eth0  proto ra  metric 1024  expires 27sec

After:
[root@qemu1 ~]# ip -6 r show | egrep -v unreachable
fd00:face:face:face::/64 dev eth0  proto kernel  metric 256  expires 27sec mtu 1300
fe80::/64 dev eth0  proto kernel  metric 256  mtu 1300
default via fe80::74df:d0ff:fe23:8ef2 dev eth0  proto ra  metric 1024  expires 27sec mtu 1300 hoplimit 30

Fixes: 8e2ec639173f325 (ipv6: don't use inetpeer to store metrics for routes.)
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agotcp: make sure skb is not shared before using skb_get()
Eric Dumazet [Fri, 13 Feb 2015 12:47:12 +0000 (04:47 -0800)] 
tcp: make sure skb is not shared before using skb_get()

[ Upstream commit ba34e6d9d346fe4e05d7e417b9edf5140772d34c ]

IPv6 can keep a copy of SYN message using skb_get() in
tcp_v6_conn_request() so that caller wont free the skb when calling
kfree_skb() later.

Therefore TCP fast open has to clone the skb it is queuing in
child->sk_receive_queue, as all skbs consumed from receive_queue are
freed using __kfree_skb() (ie assuming skb->users == 1)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Fixes: 5b7ed0892f2af ("tcp: move fastopen functions to tcp_fastopen.c")
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agoipv6: Make __ipv6_select_ident static
Vlad Yasevich [Mon, 9 Feb 2015 14:38:21 +0000 (09:38 -0500)] 
ipv6: Make __ipv6_select_ident static

[ Upstream commit 8381eacf5c3b35cf7755f4bc521c4d56d24c1cd9 ]

Make __ipv6_select_ident() static as it isn't used outside
the file.

Fixes: 0508c07f5e0c9 (ipv6: Select fragment id during UFO segmentation if not set.)
Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agoipv6: Fix fragment id assignment on LE arches.
Vlad Yasevich [Mon, 9 Feb 2015 14:38:20 +0000 (09:38 -0500)] 
ipv6: Fix fragment id assignment on LE arches.

[ Upstream commit 51f30770e50eb787200f30a79105e2615b379334 ]

Recent commit:
0508c07f5e0c94f38afd5434e8b2a55b84553077
Author: Vlad Yasevich <vyasevich@gmail.com>
Date:   Tue Feb 3 16:36:15 2015 -0500

    ipv6: Select fragment id during UFO segmentation if not set.

Introduced a bug on LE in how ipv6 fragment id is assigned.
This was cought by nightly sparce check:

Resolve the following sparce error:
 net/ipv6/output_core.c:57:38: sparse: incorrect type in assignment
 (different base types)
   net/ipv6/output_core.c:57:38:    expected restricted __be32
[usertype] ip6_frag_id
   net/ipv6/output_core.c:57:38:    got unsigned int [unsigned]
[assigned] [usertype] id

Fixes: 0508c07f5e0c9 (ipv6: Select fragment id during UFO segmentation if not set.)
Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agortnetlink: ifla_vf_policy: fix misuses of NLA_BINARY
Daniel Borkmann [Thu, 5 Feb 2015 17:44:04 +0000 (18:44 +0100)] 
rtnetlink: ifla_vf_policy: fix misuses of NLA_BINARY

[ Upstream commit 364d5716a7adb91b731a35765d369602d68d2881 ]

ifla_vf_policy[] is wrong in advertising its individual member types as
NLA_BINARY since .type = NLA_BINARY in combination with .len declares the
len member as *max* attribute length [0, len].

The issue is that when do_setvfinfo() is being called to set up a VF
through ndo handler, we could set corrupted data if the attribute length
is less than the size of the related structure itself.

The intent is exactly the opposite, namely to make sure to pass at least
data of minimum size of len.

Fixes: ebc08a6f47ee ("rtnetlink: Add VF config code to rtnetlink")
Cc: Mitch Williams <mitch.a.williams@intel.com>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agopktgen: fix UDP checksum computation
Sabrina Dubroca [Wed, 4 Feb 2015 22:08:50 +0000 (23:08 +0100)] 
pktgen: fix UDP checksum computation

[ Upstream commit 7744b5f3693cc06695cb9d6667671c790282730f ]

This patch fixes two issues in UDP checksum computation in pktgen.

First, the pseudo-header uses the source and destination IP
addresses. Currently, the ports are used for IPv4.

Second, the UDP checksum covers both header and data.  So we need to
generate the data earlier (move pktgen_finalize_skb up), and compute
the checksum for UDP header + data.

Fixes: c26bf4a51308c ("pktgen: Add UDPCSUM flag to support UDP checksums")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agoipv6: addrconf: add missing validate_link_af handler
Daniel Borkmann [Thu, 5 Feb 2015 13:39:11 +0000 (14:39 +0100)] 
ipv6: addrconf: add missing validate_link_af handler

[ Upstream commit 11b1f8288d4341af5d755281c871bff6c3e270dd ]

We still need a validate_link_af() handler with an appropriate nla policy,
similarly as we have in IPv4 case, otherwise size validations are not being
done properly in that case.

Fixes: f53adae4eae5 ("net: ipv6: add tokenized interface identifier support")
Fixes: bc91b0f07ada ("ipv6: addrconf: implement address generation modes")
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agoflowcache: Fix kernel panic in flow_cache_flush_task
Miroslav Urbanek [Thu, 5 Feb 2015 15:36:50 +0000 (16:36 +0100)] 
flowcache: Fix kernel panic in flow_cache_flush_task

[ Upstream commit 233c96fc077d310772375d47522fb444ff546905 ]

flow_cache_flush_task references a structure member flow_cache_gc_work
where it should reference flow_cache_flush_task instead.

Kernel panic occurs on kernels using IPsec during XFRM garbage
collection. The garbage collection interval can be shortened using the
following sysctl settings:

net.ipv4.xfrm4_gc_thresh=4
net.ipv6.xfrm6_gc_thresh=4

With the default settings, our productions servers crash approximately
once a week. With the settings above, they crash immediately.

Fixes: ca925cf1534e ("flowcache: Make flow cache name space aware")
Reported-by: Tomáš Charvát <tc@excello.cz>
Tested-by: Jan Hejl <jh@excello.cz>
Signed-off-by: Miroslav Urbanek <mu@miroslavurbanek.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agoLinux 3.19.1 v3.19.1
Greg Kroah-Hartman [Fri, 6 Mar 2015 22:57:59 +0000 (14:57 -0800)] 
Linux 3.19.1

10 years agoppc/kvm: Replace ACCESS_ONCE with READ_ONCE
Christian Borntraeger [Tue, 6 Jan 2015 21:41:46 +0000 (22:41 +0100)] 
ppc/kvm: Replace ACCESS_ONCE with READ_ONCE

commit 5ee07612e9e20817bb99256ab6cf1400fd5aa270 upstream.

ACCESS_ONCE does not work reliably on non-scalar types. For
example gcc 4.6 and 4.7 might remove the volatile tag for such
accesses during the SRA (scalar replacement of aggregates) step
(https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58145)

Change the ppc/kvm code to replace ACCESS_ONCE with READ_ONCE.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agoppc/hugetlbfs: Replace ACCESS_ONCE with READ_ONCE
Christian Borntraeger [Tue, 6 Jan 2015 21:47:41 +0000 (22:47 +0100)] 
ppc/hugetlbfs: Replace ACCESS_ONCE with READ_ONCE

commit da1a288d8562739aa8ba0273d4fb6b73b856c0d3 upstream.

ACCESS_ONCE does not work reliably on non-scalar types. For
example gcc 4.6 and 4.7 might remove the volatile tag for such
accesses during the SRA (scalar replacement of aggregates) step
(https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58145)

Change the ppc/hugetlbfs code to replace ACCESS_ONCE with READ_ONCE.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agomm/gup: Replace ACCESS_ONCE with READ_ONCE
Christian Borntraeger [Tue, 6 Jan 2015 21:54:46 +0000 (22:54 +0100)] 
mm/gup: Replace ACCESS_ONCE with READ_ONCE

commit 38c5ce936a0862a6ce2c8d1c72689a3aba301425 upstream.

ACCESS_ONCE does not work reliably on non-scalar types. For
example gcc 4.6 and 4.7 might remove the volatile tag for such
accesses during the SRA (scalar replacement of aggregates) step
(https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58145)

Fixup gup_pmd_range.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agonext: sh: Fix compile error
Guenter Roeck [Wed, 7 Jan 2015 20:32:28 +0000 (12:32 -0800)] 
next: sh: Fix compile error

commit 378af02b1aecabb3756e19c0cbb8cdd9c3b9637f upstream.

Commit 927609d622a3 ("kernel: tighten rules for ACCESS ONCE") results in a
compile failure for sh builds with CONFIG_X2TLB enabled.

arch/sh/mm/gup.c: In function 'gup_get_pte':
arch/sh/mm/gup.c:20:2: error: invalid initializer
make[1]: *** [arch/sh/mm/gup.o] Error 1

Replace ACCESS_ONCE with READ_ONCE to fix the problem.

Fixes: 927609d622a3 ("kernel: tighten rules for ACCESS ONCE")
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agoquota: Store maximum space limit in bytes
Jan Kara [Thu, 9 Oct 2014 14:54:13 +0000 (16:54 +0200)] 
quota: Store maximum space limit in bytes

commit b10a08194c2b615955dfab2300331a90ae9344c7 upstream.

Currently maximum space limit quota format supports is in blocks however
since we store space limits in bytes, this is somewhat confusing. So
store the maximum limit in bytes as well. Also rename the field to match
the new unit and related inode field to match the new naming scheme.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agox86/xen/p2m: Replace ACCESS_ONCE with READ_ONCE
Christian Borntraeger [Sun, 7 Dec 2014 21:01:59 +0000 (22:01 +0100)] 
x86/xen/p2m: Replace ACCESS_ONCE with READ_ONCE

commit 1760f1eb7ec485197bd3a8a9c13e4160bb740275 upstream.

ACCESS_ONCE does not work reliably on non-scalar types. For
example gcc 4.6 and 4.7 might remove the volatile tag for such
accesses during the SRA (scalar replacement of aggregates) step
(https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58145)

Change the p2m code to replace ACCESS_ONCE with READ_ONCE.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Acked-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agox86/spinlocks/paravirt: Fix memory corruption on unlock
Raghavendra K T [Fri, 6 Feb 2015 11:14:11 +0000 (16:44 +0530)] 
x86/spinlocks/paravirt: Fix memory corruption on unlock

commit d6abfdb2022368d8c6c4be3f11a06656601a6cc2 upstream.

Paravirt spinlock clears slowpath flag after doing unlock.
As explained by Linus currently it does:

                prev = *lock;
                add_smp(&lock->tickets.head, TICKET_LOCK_INC);

                /* add_smp() is a full mb() */

                if (unlikely(lock->tickets.tail & TICKET_SLOWPATH_FLAG))
                        __ticket_unlock_slowpath(lock, prev);

which is *exactly* the kind of things you cannot do with spinlocks,
because after you've done the "add_smp()" and released the spinlock
for the fast-path, you can't access the spinlock any more.  Exactly
because a fast-path lock might come in, and release the whole data
structure.

Linus suggested that we should not do any writes to lock after unlock(),
and we can move slowpath clearing to fastpath lock.

So this patch implements the fix with:

 1. Moving slowpath flag to head (Oleg):
    Unlocked locks don't care about the slowpath flag; therefore we can keep
    it set after the last unlock, and clear it again on the first (try)lock.
    -- this removes the write after unlock. note that keeping slowpath flag would
    result in unnecessary kicks.
    By moving the slowpath flag from the tail to the head ticket we also avoid
    the need to access both the head and tail tickets on unlock.

 2. use xadd to avoid read/write after unlock that checks the need for
    unlock_kick (Linus):
    We further avoid the need for a read-after-release by using xadd;
    the prev head value will include the slowpath flag and indicate if we
    need to do PV kicking of suspended spinners -- on modern chips xadd
    isn't (much) more expensive than an add + load.

Result:
 setup: 16core (32 cpu +ht sandy bridge 8GB 16vcpu guest)
 benchmark overcommit %improve
 kernbench  1x           -0.13
 kernbench  2x            0.02
 dbench     1x           -1.77
 dbench     2x           -0.63

[Jeremy: Hinted missing TICKET_LOCK_INC for kick]
[Oleg: Moved slowpath flag to head, ticket_equals idea]
[PeterZ: Added detailed changelog]

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Jones <drjones@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Jones <davej@redhat.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Fernando Luis Vázquez Cao <fernando_b1@lab.ntt.co.jp>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Cc: Waiman Long <Waiman.Long@hp.com>
Cc: a.ryabinin@samsung.com
Cc: dave@stgolabs.net
Cc: hpa@zytor.com
Cc: jasowang@redhat.com
Cc: jeremy@goop.org
Cc: paul.gortmaker@windriver.com
Cc: riel@redhat.com
Cc: tglx@linutronix.de
Cc: waiman.long@hp.com
Cc: xen-devel@lists.xenproject.org
Link: http://lkml.kernel.org/r/20150215173043.GA7471@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agokernel: make READ_ONCE() valid on const arguments
Linus Torvalds [Fri, 20 Feb 2015 23:46:31 +0000 (15:46 -0800)] 
kernel: make READ_ONCE() valid on const arguments

commit dd36929720f40f17685e841ae0d4c581c165ea60 upstream.

The use of READ_ONCE() causes lots of warnings witht he pending paravirt
spinlock fixes, because those ends up having passing a member to a
'const' structure to READ_ONCE().

There should certainly be nothing wrong with using READ_ONCE() with a
const source, but the helper function __read_once_size() would cause
warnings because it would drop the 'const' qualifier, but also because
the destination would be marked 'const' too due to the use of 'typeof'.

Use a union of types in READ_ONCE() to avoid this issue.

Also make sure to use parenthesis around the macro arguments to avoid
possible operator precedence issues.

Tested-by: Ingo Molnar <mingo@kernel.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agokernel: Fix sparse warning for ACCESS_ONCE
Christian Borntraeger [Mon, 12 Jan 2015 11:13:39 +0000 (12:13 +0100)] 
kernel: Fix sparse warning for ACCESS_ONCE

commit c5b19946eb76c67566aae6a84bf2b10ad59295ea upstream.

Commit 927609d622a3 ("kernel: tighten rules for ACCESS ONCE") results in
sparse warnings like "Using plain integer as NULL pointer" - Let's add a
type cast to the dummy assignment.
To avoid warnings lik "sparse: warning: cast to restricted __hc32" we also
use __force on that cast.

Fixes: 927609d622a3 ("kernel: tighten rules for ACCESS ONCE")
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agokernel: tighten rules for ACCESS ONCE
Christian Borntraeger [Tue, 25 Nov 2014 09:16:39 +0000 (10:16 +0100)] 
kernel: tighten rules for ACCESS ONCE

commit 927609d622a3773995f84bc03b4564f873cf0e22 upstream.

Now that all non-scalar users of ACCESS_ONCE have been converted
to READ_ONCE or ASSIGN once, lets tighten ACCESS_ONCE to only
work on scalar types.
This variant was proposed by Alexei Starovoitov.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agox86: pmc-atom: Assign debugfs node as soon as possible
Andy Shevchenko [Wed, 14 Jan 2015 16:39:31 +0000 (18:39 +0200)] 
x86: pmc-atom: Assign debugfs node as soon as possible

commit 1b43d7125f3b6f7d46e72da64f65f3187a83b66b upstream.

pmc_dbgfs_unregister() will be called when pmc->dbgfs_dir is unconditionally
NULL on error path in pmc_dbgfs_register(). To prevent this we move the
assignment to where is should be.

Fixes: f855911c1f48 (x86/pmc_atom: Expose PMC device state and platform sleep state)
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Aubrey Li <aubrey.li@linux.intel.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Kumar P. Mahesh <mahesh.kumar.p@intel.com>
Link: http://lkml.kernel.org/r/1421253575-22509-2-git-send-email-andriy.shevchenko@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agox86/irq: Fix regression caused by commit b568b8601f05
Jiang Liu [Mon, 16 Feb 2015 02:11:13 +0000 (10:11 +0800)] 
x86/irq: Fix regression caused by commit b568b8601f05

commit 1ea76fbadd667b19c4fa4466f3a3b55a505e83d9 upstream.

Commit b568b8601f05 ("Treat SCI interrupt as normal GSI interrupt")
accidently removes support of legacy PIC interrupt when fixing a
regression for Xen, which causes a nasty regression on HP/Compaq
nc6000 where we fail to register the ACPI interrupt, and thus
lose eg. thermal notifications leading a potentially overheated
machine.

So reintroduce support of legacy PIC based ACPI SCI interrupt.

Reported-by: Ville Syrjälä <syrjala@sci.fi>
Tested-by: Ville Syrjälä <syrjala@sci.fi>
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Sander Eikelenboom <linux@eikelenboom.it>
Cc: linux-pm@vger.kernel.org
Link: http://lkml.kernel.org/r/1424052673-22974-1-git-send-email-jiang.liu@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agox86, mm/ASLR: Fix stack randomization on 64-bit systems
Hector Marco-Gisbert [Sat, 14 Feb 2015 17:33:50 +0000 (09:33 -0800)] 
x86, mm/ASLR: Fix stack randomization on 64-bit systems

commit 4e7c22d447bb6d7e37bfe39ff658486ae78e8d77 upstream.

The issue is that the stack for processes is not properly randomized on
64 bit architectures due to an integer overflow.

The affected function is randomize_stack_top() in file
"fs/binfmt_elf.c":

  static unsigned long randomize_stack_top(unsigned long stack_top)
  {
           unsigned int random_variable = 0;

           if ((current->flags & PF_RANDOMIZE) &&
                   !(current->personality & ADDR_NO_RANDOMIZE)) {
                   random_variable = get_random_int() & STACK_RND_MASK;
                   random_variable <<= PAGE_SHIFT;
           }
           return PAGE_ALIGN(stack_top) + random_variable;
           return PAGE_ALIGN(stack_top) - random_variable;
  }

Note that, it declares the "random_variable" variable as "unsigned int".
Since the result of the shifting operation between STACK_RND_MASK (which
is 0x3fffff on x86_64, 22 bits) and PAGE_SHIFT (which is 12 on x86_64):

  random_variable <<= PAGE_SHIFT;

then the two leftmost bits are dropped when storing the result in the
"random_variable". This variable shall be at least 34 bits long to hold
the (22+12) result.

These two dropped bits have an impact on the entropy of process stack.
Concretely, the total stack entropy is reduced by four: from 2^28 to
2^30 (One fourth of expected entropy).

This patch restores back the entropy by correcting the types involved
in the operations in the functions randomize_stack_top() and
stack_maxrandom_size().

The successful fix can be tested with:

  $ for i in `seq 1 10`; do cat /proc/self/maps | grep stack; done
  7ffeda566000-7ffeda587000 rw-p 00000000 00:00 0                          [stack]
  7fff5a332000-7fff5a353000 rw-p 00000000 00:00 0                          [stack]
  7ffcdb7a1000-7ffcdb7c2000 rw-p 00000000 00:00 0                          [stack]
  7ffd5e2c4000-7ffd5e2e5000 rw-p 00000000 00:00 0                          [stack]
  ...

Once corrected, the leading bytes should be between 7ffc and 7fff,
rather than always being 7fff.

Signed-off-by: Hector Marco-Gisbert <hecmargi@upv.es>
Signed-off-by: Ismael Ripoll <iripoll@upv.es>
[ Rebased, fixed 80 char bugs, cleaned up commit message, added test example and CVE ]
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Fixes: CVE-2015-1593
Link: http://lkml.kernel.org/r/20150214173350.GA18393@www.outflux.net
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agox86/efi: Avoid triple faults during EFI mixed mode calls
Matt Fleming [Tue, 13 Jan 2015 15:25:00 +0000 (15:25 +0000)] 
x86/efi: Avoid triple faults during EFI mixed mode calls

commit 96738c69a7fcdbf0d7c9df0c8a27660011e82a7b upstream.

Andy pointed out that if an NMI or MCE is received while we're in the
middle of an EFI mixed mode call a triple fault will occur. This can
happen, for example, when issuing an EFI mixed mode call while running
perf.

The reason for the triple fault is that we execute the mixed mode call
in 32-bit mode with paging disabled but with 64-bit kernel IDT handlers
installed throughout the call.

At Andy's suggestion, stop playing the games we currently do at runtime,
such as disabling paging and installing a 32-bit GDT for __KERNEL_CS. We
can simply switch to the __KERNEL32_CS descriptor before invoking
firmware services, and run in compatibility mode. This way, if an
NMI/MCE does occur the kernel IDT handler will execute correctly, since
it'll jump to __KERNEL_CS automatically.

However, this change is only possible post-ExitBootServices(). Before
then the firmware "owns" the machine and expects for its 32-bit IDT
handlers to be left intact to service interrupts, etc.

So, we now need to distinguish between early boot and runtime
invocations of EFI services. During early boot, we need to restore the
GDT that the firmware expects to be present. We can only jump to the
__KERNEL32_CS code segment for mixed mode calls after ExitBootServices()
has been invoked.

A liberal sprinkling of comments in the thunking code should make the
differences in early and late environments more apparent.

Reported-by: Andy Lutomirski <luto@amacapital.net>
Tested-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agoblk-throttle: check stats_cpu before reading it from sysfs
Thadeu Lima de Souza Cascardo [Mon, 16 Feb 2015 19:16:45 +0000 (17:16 -0200)] 
blk-throttle: check stats_cpu before reading it from sysfs

commit 045c47ca306acf30c740c285a77a4b4bda6be7c5 upstream.

When reading blkio.throttle.io_serviced in a recently created blkio
cgroup, it's possible to race against the creation of a throttle policy,
which delays the allocation of stats_cpu.

Like other functions in the throttle code, just checking for a NULL
stats_cpu prevents the following oops caused by that race.

[ 1117.285199] Unable to handle kernel paging request for data at address 0x7fb4d0020
[ 1117.285252] Faulting instruction address: 0xc0000000003efa2c
[ 1137.733921] Oops: Kernel access of bad area, sig: 11 [#1]
[ 1137.733945] SMP NR_CPUS=2048 NUMA PowerNV
[ 1137.734025] Modules linked in: bridge stp llc kvm_hv kvm binfmt_misc autofs4
[ 1137.734102] CPU: 3 PID: 5302 Comm: blkcgroup Not tainted 3.19.0 #5
[ 1137.734132] task: c000000f1d188b00 ti: c000000f1d210000 task.ti: c000000f1d210000
[ 1137.734167] NIP: c0000000003efa2c LR: c0000000003ef9f0 CTR: c0000000003ef980
[ 1137.734202] REGS: c000000f1d213500 TRAP: 0300   Not tainted  (3.19.0)
[ 1137.734230] MSR: 9000000000009032 <SF,HV,EE,ME,IR,DR,RI>  CR: 42008884  XER: 20000000
[ 1137.734325] CFAR: 0000000000008458 DAR: 00000007fb4d0020 DSISR: 40000000 SOFTE: 0
GPR00: c0000000003ed3a0 c000000f1d213780 c000000000c59538 0000000000000000
GPR04: 0000000000000800 0000000000000000 0000000000000000 0000000000000000
GPR08: ffffffffffffffff 00000007fb4d0020 00000007fb4d0000 c000000000780808
GPR12: 0000000022000888 c00000000fdc0d80 0000000000000000 0000000000000000
GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR20: 000001003e120200 c000000f1d5b0cc0 0000000000000200 0000000000000000
GPR24: 0000000000000001 c000000000c269e0 0000000000000020 c000000f1d5b0c80
GPR28: c000000000ca3a08 c000000000ca3dec c000000f1c667e00 c000000f1d213850
[ 1137.734886] NIP [c0000000003efa2c] .tg_prfill_cpu_rwstat+0xac/0x180
[ 1137.734915] LR [c0000000003ef9f0] .tg_prfill_cpu_rwstat+0x70/0x180
[ 1137.734943] Call Trace:
[ 1137.734952] [c000000f1d213780] [d000000005560520] 0xd000000005560520 (unreliable)
[ 1137.734996] [c000000f1d2138a0] [c0000000003ed3a0] .blkcg_print_blkgs+0xe0/0x1a0
[ 1137.735039] [c000000f1d213960] [c0000000003efb50] .tg_print_cpu_rwstat+0x50/0x70
[ 1137.735082] [c000000f1d2139e0] [c000000000104b48] .cgroup_seqfile_show+0x58/0x150
[ 1137.735125] [c000000f1d213a70] [c0000000002749dc] .kernfs_seq_show+0x3c/0x50
[ 1137.735161] [c000000f1d213ae0] [c000000000218630] .seq_read+0xe0/0x510
[ 1137.735197] [c000000f1d213bd0] [c000000000275b04] .kernfs_fop_read+0x164/0x200
[ 1137.735240] [c000000f1d213c80] [c0000000001eb8e0] .__vfs_read+0x30/0x80
[ 1137.735276] [c000000f1d213cf0] [c0000000001eb9c4] .vfs_read+0x94/0x1b0
[ 1137.735312] [c000000f1d213d90] [c0000000001ebb38] .SyS_read+0x58/0x100
[ 1137.735349] [c000000f1d213e30] [c000000000009218] syscall_exit+0x0/0x98
[ 1137.735383] Instruction dump:
[ 1137.735405] 7c6307b4 7f891800 409d00b8 60000000 60420000 3d420004 392a63b0 786a1f24
[ 1137.735471] 7d49502a e93e01c8 7d495214 7d2ad214 <7cead02ae9090008 e9490010 e9290018

And here is one code that allows to easily reproduce this, although this
has first been found by running docker.

void run(pid_t pid)
{
int n;
int status;
int fd;
char *buffer;
buffer = memalign(BUFFER_ALIGN, BUFFER_SIZE);
n = snprintf(buffer, BUFFER_SIZE, "%d\n", pid);
fd = open(CGPATH "/test/tasks", O_WRONLY);
write(fd, buffer, n);
close(fd);
if (fork() > 0) {
fd = open("/dev/sda", O_RDONLY | O_DIRECT);
read(fd, buffer, 512);
close(fd);
wait(&status);
} else {
fd = open(CGPATH "/test/blkio.throttle.io_serviced", O_RDONLY);
n = read(fd, buffer, BUFFER_SIZE);
close(fd);
}
free(buffer);
exit(0);
}

void test(void)
{
int status;
mkdir(CGPATH "/test", 0666);
if (fork() > 0)
wait(&status);
else
run(getpid());
rmdir(CGPATH "/test");
}

int main(int argc, char **argv)
{
int i;
for (i = 0; i < NR_TESTS; i++)
test();
return 0;
}

Reported-by: Ricardo Marin Matinata <rmm@br.ibm.com>
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agoBtrfs: fix fsync data loss after adding hard link to inode
Filipe Manana [Fri, 13 Feb 2015 12:30:56 +0000 (12:30 +0000)] 
Btrfs: fix fsync data loss after adding hard link to inode

commit 1a4bcf470c886b955adf36486f4c86f2441d85cb upstream.

We have a scenario where after the fsync log replay we can lose file data
that had been previously fsync'ed if we added an hard link for our inode
and after that we sync'ed the fsync log (for example by fsync'ing some
other file or directory).

This is because when adding an hard link we updated the inode item in the
log tree with an i_size value of 0. At that point the new inode item was
in memory only and a subsequent fsync log replay would not make us lose
the file data. However if after adding the hard link we sync the log tree
to disk, by fsync'ing some other file or directory for example, we ended
up losing the file data after log replay, because the inode item in the
persisted log tree had an an i_size of zero.

This is easy to reproduce, and the following excerpt from my test for
xfstests shows this:

  _scratch_mkfs >> $seqres.full 2>&1
  _init_flakey
  _mount_flakey

  # Create one file with data and fsync it.
  # This made the btrfs fsync log persist the data and the inode metadata with
  # a correct inode->i_size (4096 bytes).
  $XFS_IO_PROG -f -c "pwrite -S 0xaa -b 4K 0 4K" -c "fsync" \
       $SCRATCH_MNT/foo | _filter_xfs_io

  # Now add one hard link to our file. This made the btrfs code update the fsync
  # log, in memory only, with an inode metadata having a size of 0.
  ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link

  # Now force persistence of the fsync log to disk, for example, by fsyncing some
  # other file.
  touch $SCRATCH_MNT/bar
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/bar

  # Before a power loss or crash, we could read the 4Kb of data from our file as
  # expected.
  echo "File content before:"
  od -t x1 $SCRATCH_MNT/foo

  # Simulate a crash/power loss.
  _load_flakey_table $FLAKEY_DROP_WRITES
  _unmount_flakey

  _load_flakey_table $FLAKEY_ALLOW_WRITES
  _mount_flakey

  # After the fsync log replay, because the fsync log had a value of 0 for our
  # inode's i_size, we couldn't read anymore the 4Kb of data that we previously
  # wrote and fsync'ed. The size of the file became 0 after the fsync log replay.
  echo "File content after:"
  od -t x1 $SCRATCH_MNT/foo

Another alternative test, that doesn't need to fsync an inode in the same
transaction it was created, is:

  _scratch_mkfs >> $seqres.full 2>&1
  _init_flakey
  _mount_flakey

  # Create our test file with some data.
  $XFS_IO_PROG -f -c "pwrite -S 0xaa -b 8K 0 8K" \
       $SCRATCH_MNT/foo | _filter_xfs_io

  # Make sure the file is durably persisted.
  sync

  # Append some data to our file, to increase its size.
  $XFS_IO_PROG -f -c "pwrite -S 0xcc -b 4K 8K 4K" \
       $SCRATCH_MNT/foo | _filter_xfs_io

  # Fsync the file, so from this point on if a crash/power failure happens, our
  # new data is guaranteed to be there next time the fs is mounted.
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo

  # Add one hard link to our file. This made btrfs write into the in memory fsync
  # log a special inode with generation 0 and an i_size of 0 too. Note that this
  # didn't update the inode in the fsync log on disk.
  ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link

  # Now make sure the in memory fsync log is durably persisted.
  # Creating and fsync'ing another file will do it.
  touch $SCRATCH_MNT/bar
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/bar

  # As expected, before the crash/power failure, we should be able to read the
  # 12Kb of file data.
  echo "File content before:"
  od -t x1 $SCRATCH_MNT/foo

  # Simulate a crash/power loss.
  _load_flakey_table $FLAKEY_DROP_WRITES
  _unmount_flakey

  _load_flakey_table $FLAKEY_ALLOW_WRITES
  _mount_flakey

  # After mounting the fs again, the fsync log was replayed.
  # The btrfs fsync log replay code didn't update the i_size of the persisted
  # inode because the inode item in the log had a special generation with a
  # value of 0 (and it couldn't know the correct i_size, since that inode item
  # had a 0 i_size too). This made the last 4Kb of file data inaccessible and
  # effectively lost.
  echo "File content after:"
  od -t x1 $SCRATCH_MNT/foo

This isn't a new issue/regression. This problem has been around since the
log tree code was added in 2008:

  Btrfs: Add a write ahead tree log to optimize synchronous operations
  (commit e02119d5a7b4396c5a872582fddc8bd6d305a70a)

Test cases for xfstests follow soon.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agobtrfs: fix leak of path in btrfs_find_item
David Sterba [Fri, 2 Jan 2015 17:45:16 +0000 (18:45 +0100)] 
btrfs: fix leak of path in btrfs_find_item

commit 381cf6587f8a8a8e981bc0c1aaaa8859b51dc756 upstream.

If btrfs_find_item is called with NULL path it allocates one locally but
does not free it. Affected paths are inserting an orphan item for a file
and for a subvol root.

Move the path allocation to the callers.

Fixes: 3f870c289900 ("btrfs: expand btrfs_find_item() to include find_orphan_item functionality")
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agobtrfs: set proper message level for skinny metadata
David Sterba [Fri, 19 Dec 2014 17:38:47 +0000 (18:38 +0100)] 
btrfs: set proper message level for skinny metadata

commit 5efa0490cc94aee06cd8d282683e22a8ce0a0026 upstream.

This has been confusing people for too long, the message is really just
informative.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agolibceph: fix double __remove_osd() problem
Ilya Dryomov [Tue, 17 Feb 2015 16:37:15 +0000 (19:37 +0300)] 
libceph: fix double __remove_osd() problem

commit 7eb71e0351fbb1b242ae70abb7bb17107fe2f792 upstream.

It turns out it's possible to get __remove_osd() called twice on the
same OSD.  That doesn't sit well with rb_erase() - depending on the
shape of the tree we can get a NULL dereference, a soft lockup or
a random crash at some point in the future as we end up touching freed
memory.  One scenario that I was able to reproduce is as follows:

            <osd3 is idle, on the osd lru list>
<con reset - osd3>
con_fault_finish()
  osd_reset()
                              <osdmap - osd3 down>
                              ceph_osdc_handle_map()
                                <takes map_sem>
                                kick_requests()
                                  <takes request_mutex>
                                  reset_changed_osds()
                                    __reset_osd()
                                      __remove_osd()
                                  <releases request_mutex>
                                <releases map_sem>
    <takes map_sem>
    <takes request_mutex>
    __kick_osd_requests()
      __reset_osd()
        __remove_osd() <-- !!!

A case can be made that osd refcounting is imperfect and reworking it
would be a proper resolution, but for now Sage and I decided to fix
this by adding a safe guard around __remove_osd().

Fixes: http://tracker.ceph.com/issues/8087
Cc: Sage Weil <sage@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agosamsung-laptop: Add use_native_backlight quirk, and enable it on some models
Hans de Goede [Fri, 9 Jan 2015 14:10:01 +0000 (15:10 +0100)] 
samsung-laptop: Add use_native_backlight quirk, and enable it on some models

commit 4690555e13c48fef07f2762f6b0cd6b181e326d0 upstream.

Since kernel 3.14 the backlight control has been broken on various Samsung
Atom based netbooks. This has been bisected and this problem happens since
commit b35684b8fa94 ("drm/i915: do full backlight setup at enable time")

This has been reported and discussed in detail here:
http://lists.freedesktop.org/archives/intel-gfx/2014-July/049395.html

Unfortunately no-one has been able to fix this. This only affects Samsung
Atom netbooks, and the Linux kernel and the BIOS of those laptops have never
worked well together. All affected laptops already have a quirk to avoid using
the standard acpi-video interface and instead use the samsung specific SABI
interface which samsung-laptop uses. It seems that recent fixes to the i915
driver have also broken backlight control through the SABI interface.

The intel_backlight driver OTOH works fine, and also allows for finer grained
backlight control. So add a new use_native_backlight quirk, and replace the
broken_acpi_video quirk with this quirk for affected models. This new quirk
disables acpi-video as before and also stops samsung-laptop from registering
the SABI based samsung_laptop backlight interface, leaving only the working
intel_backlight interface.

This commit enables this new quirk for 3 models which are known to be affected,
chances are that it needs to be used on other models too.

BugLink: https://bugzilla.redhat.com/show_bug.cgi?id=1094948
BugLink: https://bugzilla.redhat.com/show_bug.cgi?id=1115713
Reported-by: Bertrik Sikken <bertrik@sikken.nl> # N150P
Cc: stable@vger.kernel.org # 3.16
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Darren Hart <dvhart@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agojffs2: fix handling of corrupted summary length
Chen Jie [Tue, 10 Feb 2015 20:49:48 +0000 (12:49 -0800)] 
jffs2: fix handling of corrupted summary length

commit 164c24063a3eadee11b46575c5482b2f1417be49 upstream.

sm->offset maybe wrong but magic maybe right, the offset do not have CRC.

Badness at c00c7580 [verbose debug info unavailable]
NIP: c00c7580 LR: c00c718c CTR: 00000014
REGS: df07bb40 TRAP: 0700   Not tainted  (2.6.34.13-WR4.3.0.0_standard)
MSR: 00029000 <EE,ME,CE>  CR: 22084f84  XER: 00000000
TASK = df84d6e0[908] 'mount' THREAD: df07a000
GPR00: 00000001 df07bbf0 df84d6e0 00000000 00000001 00000000 df07bb58 00000041
GPR08: 00000041 c0638860 00000000 00000010 22084f88 100636c8 df814ff8 00000000
GPR16: df84d6e0 dfa558cc c05adb90 00000048 c0452d30 00000000 000240d0 000040d0
GPR24: 00000014 c05ae734 c05be2e0 00000000 00000001 00000000 00000000 c05ae730
NIP [c00c7580] __alloc_pages_nodemask+0x4d0/0x638
LR [c00c718c] __alloc_pages_nodemask+0xdc/0x638
Call Trace:
[df07bbf0] [c00c718c] __alloc_pages_nodemask+0xdc/0x638 (unreliable)
[df07bc90] [c00c7708] __get_free_pages+0x20/0x48
[df07bca0] [c00f4a40] __kmalloc+0x15c/0x1ec
[df07bcd0] [c01fc880] jffs2_scan_medium+0xa58/0x14d0
[df07bd70] [c01ff38c] jffs2_do_mount_fs+0x1f4/0x6b4
[df07bdb0] [c020144c] jffs2_do_fill_super+0xa8/0x260
[df07bdd0] [c020230c] jffs2_fill_super+0x104/0x184
[df07be00] [c0335814] get_sb_mtd_aux+0x9c/0xec
[df07be20] [c033596c] get_sb_mtd+0x84/0x1e8
[df07be60] [c0201ed0] jffs2_get_sb+0x1c/0x2c
[df07be70] [c0103898] vfs_kern_mount+0x78/0x1e8
[df07bea0] [c0103a58] do_kern_mount+0x40/0x100
[df07bec0] [c011fe90] do_mount+0x240/0x890
[df07bf10] [c0120570] sys_mount+0x90/0xd8
[df07bf40] [c00110d8] ret_from_syscall+0x0/0x4

=== Exception: c01 at 0xff61a34
    LR = 0x100135f0
Instruction dump:
38800005 38600000 48010f41 4bfffe1c 4bfc2d15 4bfffe8c 72e90200 4082fc28
3d20c064 39298860 8809000d 68000001 <0f0000002f800000 419efc0c 38000001
mount: mounting /dev/mtdblock3 on /common failed: Input/output error

Signed-off-by: Chen Jie <chenjie6@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agoEDAC, amd64_edac: Prevent OOPS with >16 memory controllers
Daniel J Blueman [Tue, 17 Feb 2015 03:34:38 +0000 (11:34 +0800)] 
EDAC, amd64_edac: Prevent OOPS with >16 memory controllers

commit 0c510cc83bdbaac8406f4f7caef34f4da0ba35ea upstream.

When DRAM errors occur on memory controllers after EDAC_MAX_MCS (16),
the kernel fatally dereferences unallocated structures, see splat below;
this occurs on at least NumaConnect systems.

Fix by checking if a memory controller info structure was found.

BUG: unable to handle kernel NULL pointer dereference at 0000000000000320
IP: [<ffffffff819f714f>] decode_bus_error+0x2f/0x2b0
PGD 2f8b5a3067 PUD 2f8b5a2067 PMD 0
Oops: 0000 [#2] SMP
Modules linked in:
CPU: 224 PID: 11930 Comm: stream_c.exe.gn Tainted: G   D    3.19.0 #1
Hardware name: Supermicro H8QGL/H8QGL, BIOS 3.5b    01/28/2015
task: ffff8807dbfb8c00 ti: ffff8807dd16c000 task.ti: ffff8807dd16c000
RIP: 0010:[<ffffffff819f714f>] [<ffffffff819f714f>] decode_bus_error+0x2f/0x2b0
RSP: 0000:ffff8907dfc03c48 EFLAGS: 00010297
RAX: 0000000000000001 RBX: 9c67400010080a13 RCX: 0000000000001dc6
RDX: 000000001dc61dc6 RSI: ffff8907dfc03df0 RDI: 000000000000001c
RBP: ffff8907dfc03ce8 R08: 0000000000000000 R09: 0000000000000022
R10: ffff891fffa30380 R11: 00000000001cfc90 R12: 0000000000000008
R13: 0000000000000000 R14: 000000000000001c R15: 00009c6740001000
FS: 00007fa97ee18700(0000) GS:ffff8907dfc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000320 CR3: 0000003f889b8000 CR4: 00000000000407e0
Stack:
 0000000000000000 ffff8907dfc03df0 0000000000000008 9c67400010080a13
 000000000000001c 00009c6740001000 ffff8907dfc03c88 ffffffff810e4f9a
 ffff8907dfc03ce8 ffffffff81b375b9 0000000000000000 0000000000000010
Call Trace:
 <IRQ>
 ? vprintk_default
 ? printk
 amd_decode_mce
 notifier_call_chain
 atomic_notifier_call_chain
 mce_log
 machine_check_poll
 mce_timer_fn
 ? mce_cpu_restart
 call_timer_fn.isra.29
 run_timer_softirq
 __do_softirq
 irq_exit
 smp_apic_timer_interrupt
 apic_timer_interrupt
 <EOI>
 ? down_read_trylock
 __do_page_fault
 ? __schedule
 do_page_fault
 page_fault

Signed-off-by: Daniel J Blueman <daniel@numascale.com>
Link: http://lkml.kernel.org/r/1424144078-24589-1-git-send-email-daniel@numascale.com
[ Boris: massage commit message ]
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agosb_edac: Fix detection on SNB machines
Borislav Petkov [Thu, 5 Feb 2015 11:39:36 +0000 (12:39 +0100)] 
sb_edac: Fix detection on SNB machines

commit 11249e73992981e31fd50e7231da24fad68e3320 upstream.

d0585cd815fa ("sb_edac: Claim a different PCI device") changed the
probing of sb_edac to look for PCI device 0x3ca0:

3f:0e.0 System peripheral: Intel Corporation Xeon E5/Core i7 Processor Home Agent (rev 07)
00: 86 80 a0 3c 00 00 00 00 07 00 80 08 00 00 80 00
...

but we're matching for 0x3ca8, i.e. PCI_DEVICE_ID_INTEL_SBRIDGE_IMC_TA
in sbridge_probe() therefore the probing fails.

Changing it to probe for 0x3ca0 (PCI_DEVICE_ID_INTEL_SBRIDGE_IMC_HA0),
.i.e., the 14.0 device, fixes the issue and driver loads successfully
again:

[ 2449.013120] EDAC DEBUG: sbridge_init:
[ 2449.017029] EDAC sbridge: Seeking for: PCI ID 8086:3ca0
[ 2449.022368] EDAC DEBUG: sbridge_get_onedevice: Detected 8086:3ca0
[ 2449.028498] EDAC sbridge: Seeking for: PCI ID 8086:3ca0
[ 2449.033768] EDAC sbridge: Seeking for: PCI ID 8086:3ca8
[ 2449.039028] EDAC DEBUG: sbridge_get_onedevice: Detected 8086:3ca8
[ 2449.045155] EDAC sbridge: Seeking for: PCI ID 8086:3ca8
...

Add a debug printk while at it to be able to catch the failure in the
future and dump driver version on successful load.

Fixes: d0585cd815fa ("sb_edac: Claim a different PCI device")
Acked-by: Aristeu Rozanski <aris@redhat.com>
Cc: Tony Luck <tony.luck@intel.com>
Acked-by: Andy Lutomirski <luto@amacapital.net>
Acked-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agomd/raid1: fix read balance when a drive is write-mostly.
Tomáš Hodek [Mon, 23 Feb 2015 00:00:38 +0000 (11:00 +1100)] 
md/raid1: fix read balance when a drive is write-mostly.

commit d1901ef099c38afd11add4cfb3312c02ef21ec4a upstream.

When a drive is marked write-mostly it should only be the
target of reads if there is no other option.

This behaviour was broken by

commit 9dedf60313fa4dddfd5b9b226a0ef12a512bf9dc
    md/raid1: read balance chooses idlest disk for SSD

which causes a write-mostly device to be *preferred* is some cases.

Restore correct behaviour by checking and setting
best_dist_disk and best_pending_disk rather than best_disk.

We only need to test one of these as they are both changed
from -1 or >=0 at the same time.

As we leave min_pending and best_dist unchanged, any non-write-mostly
device will appear better than the write-mostly device.

Reported-by: Tomáš Hodek <tomas.hodek@volny.cz>
Reported-by: Dark Penguin <darkpenguin@yandex.ru>
Signed-off-by: NeilBrown <neilb@suse.de>
Link: http://marc.info/?l=linux-raid&m=135982797322422
Fixes: 9dedf60313fa4dddfd5b9b226a0ef12a512bf9dc
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agomd/raid5: Fix livelock when array is both resyncing and degraded.
NeilBrown [Wed, 18 Feb 2015 00:35:14 +0000 (11:35 +1100)] 
md/raid5: Fix livelock when array is both resyncing and degraded.

commit 26ac107378c4742978216be1005b7291b799c7b2 upstream.

Commit a7854487cd7128a30a7f4f5259de9f67d5efb95f:
  md: When RAID5 is dirty, force reconstruct-write instead of read-modify-write.

Causes an RCW cycle to be forced even when the array is degraded.
A degraded array cannot support RCW as that requires reading all data
blocks, and one may be missing.

Forcing an RCW when it is not possible causes a live-lock and the code
spins, repeatedly deciding to do something that cannot succeed.

So change the condition to only force RCW on non-degraded arrays.

Reported-by: Manibalan P <pmanibalan@amiindia.co.in>
Bisected-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Tested-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Fixes: a7854487cd7128a30a7f4f5259de9f67d5efb95f
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agoperf tools: Fix probing for PERF_FLAG_FD_CLOEXEC flag
Adrian Hunter [Tue, 24 Feb 2015 11:20:59 +0000 (13:20 +0200)] 
perf tools: Fix probing for PERF_FLAG_FD_CLOEXEC flag

commit 48536c9195ae8c2a00fd8f400bac72ab613feaab upstream.

Commit f6edb53c4993ffe92ce521fb449d1c146cea6ec2 converted the probe to
a CPU wide event first (pid == -1). For kernels that do not support
the PERF_FLAG_FD_CLOEXEC flag the probe fails with EINVAL. Since this
errno is not handled pid is not reset to 0 and the subsequent use of
pid = -1 as an argument brings in an additional failure path if
perf_event_paranoid > 0:

$ perf record -- sleep 1
perf_event_open(..., 0) failed unexpectedly with error 13 (Permission denied)
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.007 MB /tmp/perf.data (11 samples) ]

Also, ensure the fd of the confirmation check is closed and comment why
pid = -1 is used.

Needs to go to 3.18 stable tree as well.

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Based-on-patch-by: David Ahern <david.ahern@oracle.com>
Acked-by: David Ahern <david.ahern@oracle.com>
Cc: David Ahern <dsahern@gmail.com>
Link: http://lkml.kernel.org/r/54EC610C.8000403@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agoclocksource: mtk: Fix race conditions in probe code
Matthias Brugger [Thu, 19 Feb 2015 10:41:33 +0000 (11:41 +0100)] 
clocksource: mtk: Fix race conditions in probe code

commit d4a19eb3b15a4ba98f627182f48d5bc0cffae670 upstream.

We have two race conditions in the probe code which could lead to a null
pointer dereference in the interrupt handler.

The interrupt handler accesses the clockevent device, which may not yet be
registered.

First race condition happens when the interrupt handler gets registered before
the interrupts get disabled. The second race condition happens when the
interrupts get enabled, but the clockevent device is not yet registered.

Fix that by disabling the interrupts before we register the interrupt and enable
the interrupts after the clockevent device got registered.

Reported-by: Gongbae Park <yongbae2@gmail.com>
Signed-off-by: Matthias Brugger <matthias.bgg@gmail.com>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agometag: Fix KSTK_EIP() and KSTK_ESP() macros
James Hogan [Tue, 24 Feb 2015 12:25:25 +0000 (12:25 +0000)] 
metag: Fix KSTK_EIP() and KSTK_ESP() macros

commit c2996cb29bfb73927a79dc96e598a718e843f01a upstream.

The KSTK_EIP() and KSTK_ESP() macros should return the user program
counter (PC) and stack pointer (A0StP) of the given task. These are used
to determine which VMA corresponds to the user stack in
/proc/<pid>/maps, and for the user PC & A0StP in /proc/<pid>/stat.

However for Meta the PC & A0StP from the task's kernel context are used,
resulting in broken output. For example in following /proc/<pid>/maps
output, the 3afff000-3b021000 VMA should be described as the stack:

  # cat /proc/self/maps
  ...
  100b0000-100b1000 rwxp 00000000 00:00 0          [heap]
  3afff000-3b021000 rwxp 00000000 00:00 0

And in the following /proc/<pid>/stat output, the PC is in kernel code
(1074234964 = 0x40078654) and the A0StP is in the kernel heap
(1335981392 = 0x4fa17550):

  # cat /proc/self/stat
  51 (cat) R ... 1335981392 1074234964 ...

Fix the definitions of KSTK_EIP() and KSTK_ESP() to use
task_pt_regs(tsk)->ctx rather than (tsk)->thread.kernel_context. This
gets the registers from the user context stored after the thread info at
the base of the kernel stack, which is from the last entry into the
kernel from userland, regardless of where in the kernel the task may
have been interrupted, which results in the following more correct
/proc/<pid>/maps output:

  # cat /proc/self/maps
  ...
  0800b000-08070000 r-xp 00000000 00:02 207        /lib/libuClibc-0.9.34-git.so
  ...
  100b0000-100b1000 rwxp 00000000 00:00 0          [heap]
  3afff000-3b021000 rwxp 00000000 00:00 0          [stack]

And /proc/<pid>/stat now correctly reports the PC in libuClibc
(134320308 = 0x80190b4) and the A0StP in the [stack] region (989864576 =
0x3b002280):

  # cat /proc/self/stat
  51 (cat) R ... 989864576 134320308 ...

Reported-by: Alexey Brodkin <Alexey.Brodkin@synopsys.com>
Reported-by: Vineet Gupta <Vineet.Gupta1@synopsys.com>
Signed-off-by: James Hogan <james.hogan@imgtec.com>
Cc: linux-metag@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agoxfs: Fix quota type in quota structures when reusing quota file
Jan Kara [Mon, 23 Feb 2015 11:34:17 +0000 (22:34 +1100)] 
xfs: Fix quota type in quota structures when reusing quota file

commit dfcc70a8c868fe03276fa59864149708fb41930b upstream.

For filesystems without separate project quota inode field in the
superblock we just reuse project quota file for group quotas (and vice
versa) if project quota file is allocated and we need group quota file.
When we reuse the file, quota structures on disk suddenly have wrong
type stored in d_flags though. Nobody really cares about this (although
structure type reported to userspace was wrong as well) except
that after commit 14bf61ffe6ac (quota: Switch ->get_dqblk() and
->set_dqblk() to use bytes as space units) assertion in
xfs_qm_scall_getquota() started to trigger on xfs/106 test (apparently I
was testing without XFS_DEBUG so I didn't notice when submitting the
above commit).

Fix the problem by properly resetting ddq->d_flags when running quotacheck
for a quota file.

Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agogpio: tps65912: fix wrong container_of arguments
Nicolas Saenz Julienne [Thu, 19 Feb 2015 01:52:25 +0000 (01:52 +0000)] 
gpio: tps65912: fix wrong container_of arguments

commit 2f97c20e5f7c3582c7310f65a04465bfb0fd0e85 upstream.

The gpio_chip operations receive a pointer the gpio_chip struct which is
contained in the driver's private struct, yet the container_of call in those
functions point to the mfd struct defined in include/linux/mfd/tps65912.h.

Signed-off-by: Nicolas Saenz Julienne <nicolassaenzj@gmail.com>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agogpiolib: of: allow of_gpiochip_find_and_xlate to find more than one chip per node
Hans Holmberg [Tue, 10 Feb 2015 08:48:27 +0000 (09:48 +0100)] 
gpiolib: of: allow of_gpiochip_find_and_xlate to find more than one chip per node

commit 9cf75e9e4ddd587ac12e88e8751c358b7b27e95f upstream.

The change:

7b8792bbdffdff3abda704f89c6a45ea97afdc62
gpiolib: of: Correct error handling in of_get_named_gpiod_flags

assumed that only one gpio-chip is registred per of-node.
Some drivers register more than one chip per of-node, so
adjust the matching function of_gpiochip_find_and_xlate to
not stop looking for chips if a node-match is found and
the translation fails.

Fixes: 7b8792bbdffd ("gpiolib: of: Correct error handling in of_get_named_gpiod_flags")
Signed-off-by: Hans Holmberg <hans.holmberg@intel.com>
Acked-by: Alexandre Courbot <acourbot@nvidia.com>
Tested-by: Robert Jarzmik <robert.jarzmik@free.fr>
Tested-by: Tyler Hall <tylerwhall@gmail.com>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agoarm64: compat Fix siginfo_t -> compat_siginfo_t conversion on big endian
Catalin Marinas [Mon, 23 Feb 2015 15:13:40 +0000 (15:13 +0000)] 
arm64: compat Fix siginfo_t -> compat_siginfo_t conversion on big endian

commit 9d42d48a342aee208c1154696196497fdc556bbf upstream.

The native (64-bit) sigval_t union contains sival_int (32-bit) and
sival_ptr (64-bit). When a compat application invokes a syscall that
takes a sigval_t value (as part of a larger structure, e.g.
compat_sys_mq_notify, compat_sys_timer_create), the compat_sigval_t
union is converted to the native sigval_t with sival_int overlapping
with either the least or the most significant half of sival_ptr,
depending on endianness. When the corresponding signal is delivered to a
compat application, on big endian the current (compat_uptr_t)sival_ptr
cast always returns 0 since sival_int corresponds to the top part of
sival_ptr. This patch fixes copy_siginfo_to_user32() so that sival_int
is copied to the compat_siginfo_t structure.

Reported-by: Bamvor Jian Zhang <bamvor.zhangjian@huawei.com>
Tested-by: Bamvor Jian Zhang <bamvor.zhangjian@huawei.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agohx4700: regulator: declare full constraints
Martin Vajnar [Tue, 23 Dec 2014 23:27:57 +0000 (00:27 +0100)] 
hx4700: regulator: declare full constraints

commit a52d209336f8fc7483a8c7f4a8a7d2a8e1692a6c upstream.

Since the removal of CONFIG_REGULATOR_DUMMY option, the touchscreen stopped
working. This patch enables the "replacement" for REGULATOR_DUMMY and
allows the touchscreen to work even though there is no regulator for "vcc".

Signed-off-by: Martin Vajnar <martin.vajnar@gmail.com>
Signed-off-by: Robert Jarzmik <robert.jarzmik@free.fr>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 years agoKVM: s390: avoid memory leaks if __inject_vm() fails
David Hildenbrand [Fri, 16 Jan 2015 11:58:09 +0000 (12:58 +0100)] 
KVM: s390: avoid memory leaks if __inject_vm() fails

commit 428d53be5e7468769d4e7899cca06ed5f783a6e1 upstream.

We have to delete the allocated interrupt info if __inject_vm() fails.

Otherwise user space can keep flooding kvm with floating interrupts and
provoke more and more memory leaks.

Reported-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Reviewed-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>