Using the touchscreen while running buffered capture results in the
buffer reporting lots of wrong values, often just zeros. This is because
we push readings to the buffer every time a touchscreen interrupt
arrives, including when the buffer's own conversions have not yet
finished. So let's only push to the buffer when its conversions are
ready.
Signed-off-by: Kristina Martšenko <kristina.martsenko@gmail.com> Reviewed-by: Marek Vasut <marex@denx.de> Signed-off-by: Jonathan Cameron <jic23@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reading a channel through sysfs, or starting a buffered capture, can
occasionally turn off the touchscreen.
This is because the read_raw() and buffer preenable()/postdisable()
callbacks unschedule current conversions on all channels. If a delay
channel happens to schedule a touchscreen conversion at the same time,
the conversion gets cancelled and the touchscreen sequence stops.
This is probably related to this note from the reference manual:
"If a delay group schedules channels to be sampled and a manual
write to the schedule field in CTRL0 occurs while the block is
discarding samples, the LRADC will switch to the new schedule
and will not sample the channels that were previously scheduled.
The time window for this to happen is very small and lasts only
while the LRADC is discarding samples."
So make the callbacks only unschedule conversions for the channels they
use. This means channel 0 for read_raw() and channels 0-5 for the buffer
(if the touchscreen is enabled). Since the touchscreen uses different
channels (6 and 7), it no longer gets turned off.
This is tested and fixes the issue on i.MX28, but hasn't been tested on
i.MX23.
Signed-off-by: Kristina Martšenko <kristina.martsenko@gmail.com> Reviewed-by: Marek Vasut <marex@denx.de> Signed-off-by: Jonathan Cameron <jic23@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reading a channel through sysfs, or starting a buffered capture, will
currently turn off the touchscreen. This is because the read_raw() and
buffer preenable()/postdisable() callbacks disable interrupts for all
LRADC channels, including those the touchscreen uses.
So make the callbacks only disable interrupts for the channels they use.
This means channel 0 for read_raw() and channels 0-5 for the buffer (if
the touchscreen is enabled). Since the touchscreen uses different
channels (6 and 7), it no longer gets turned off.
Note that only i.MX28 is affected by this issue, i.MX23 should be fine.
Signed-off-by: Kristina Martšenko <kristina.martsenko@gmail.com> Reviewed-by: Marek Vasut <marex@denx.de> Signed-off-by: Jonathan Cameron <jic23@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The touchscreen was initially designed [1] to map all of its physical
channels to one virtual channel, leaving buffered capture to use the
remaining 7 virtual channels. When the touchscreen was reimplemented
[2], it was made to use four virtual channels, which overlap and
conflict with the channels the buffer uses.
As a result, when the buffer is enabled, the touchscreen's virtual
channels are remapped to whichever physical channels the buffer was
configured with, causing the touchscreen to read those instead of the
touch measurement channels. Effectively the touchscreen stops working.
So here we separate the channels again, giving the touchscreen 2 virtual
channels and the buffer 6. We can't give the touchscreen just 1 channel
as before, as the current pressure calculation requires 2 channels to be
read at the same time.
This makes the touchscreen continue to work during buffered capture. It
has been tested on i.MX28, but not on i.MX23.
[1] 06ddd353f5c8 ("iio: mxs: Implement support for touchscreen")
[2] dee05308f602 ("Staging/iio/adc/touchscreen/MXS: add interrupt driven
touch detection")
Signed-off-by: Kristina Martšenko <kristina.martsenko@gmail.com> Reviewed-by: Marek Vasut <marex@denx.de> Signed-off-by: Jonathan Cameron <jic23@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The intention is obviously to sign-extend a 12 bit quantity. But
because of C's promotion rules, the assignment is equivalent to "val16
&= 0xfff;". Use the proper API for this.
Since commit c8231a9af8147f8a ("iio: mxs-lradc: compute temperature
from channel 8 and 9") with the removal of adc channel 9 there is
no 1-1 mapping in the channel spec.
All hwmon channel values above 9 are accessible via there index minus
one. So add a hidden iio channel 9 to fix this issue.
Signed-off-by: Stefan Wahren <stefan.wahren@i2se.com> Acked-by: Alexandre Belloni <alexandre.belloni@free-electrons.com> Reviewed-by: Marek Vasut <marex@denx.de> Signed-off-by: Jonathan Cameron <jic23@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
f31a9f7c7169 ("x86/xsaves: Use xsaves/xrstors to save and restore xsave area")
introduced alternative instructions for XSAVES/XRSTORS and commit:
adb9d526e982 ("x86/xsaves: Add xsaves and xrstors support for booting time")
added support for the XSAVES/XRSTORS instructions at boot time.
Unfortunately both failed to properly protect them against faulting:
The 'xstate_fault' macro will use the closest label named '1'
backward and that ends up in the .altinstr_replacement section
rather than in .text. This means that the kernel will never find
in the __ex_table the .text address where this instruction might
fault, leading to serious problems if userspace manages to
trigger the fault.
Signed-off-by: Quentin Casasnovas <quentin.casasnovas@oracle.com> Signed-off-by: Jamie Iles <jamie.iles@oracle.com>
[ Improved the changelog, fixed some whitespace noise. ] Acked-by: Borislav Petkov <bp@alien8.de> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Allan Xavier <mr.a.xavier@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: adb9d526e982 ("x86/xsaves: Add xsaves and xrstors support for booting time") Fixes: f31a9f7c7169 ("x86/xsaves: Use xsaves/xrstors to save and restore xsave area") Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
'ret_from_fork' checks TIF_IA32 to determine whether 'pt_regs' and
the related state make sense for 'ret_from_sys_call'. This is
entirely the wrong check. TS_COMPAT would make a little more
sense, but there's really no point in keeping this optimization
at all.
This fixes a return to the wrong user CS if we came from int
0x80 in a 64-bit task.
Signed-off-by: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/4710be56d76ef994ddf59087aad98c000fbab9a4.1424989793.git.luto@amacapital.net
[ Backported from tip:x86/asm. ] Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This patch adds a check to sbc_parse_cdb() in order to detect when
an LBA + sector vs. end-of-device calculation wraps when the LBA is
sufficently large enough (eg: 0xFFFFFFFFFFFFFFFF).
Cc: Martin Petersen <martin.petersen@oracle.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This patch addresses the original PR_APTPL_BUF_LEN = 8k limitiation
for write-out of PR APTPL metadata that Martin has recently been
running into.
It changes core_scsi3_update_and_write_aptpl() to use vzalloc'ed
memory instead of kzalloc, and increases the default hardcoded
length to 256k.
It also adds logic in core_scsi3_update_and_write_aptpl() to double
the original length upon core_scsi3_update_aptpl_buf() failure, and
retries until the vzalloc'ed buffer is large enough to accommodate
the outgoing APTPL metadata.
Reported-by: Martin Svec <martin.svec@zoner.cz> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The efficient frequency (RPe) should stay in the range
RPn <= RPe <= RP0. The pcode clamps the returned value
internally on Broadwell but not on Haswell.
Fix for missing range check in
commit 93ee29203f506582cca2bcec5f05041526d9ab0a
Author: Tom O'Rourke <Tom.O'Rourke@intel.com>
Date: Wed Nov 19 14:21:52 2014 -0800
drm/i915: Use efficient frequency for HSW/BDW
Reference: http://lists.freedesktop.org/archives/intel-gfx/2015-February/059802.html Reported-by: Michael Auchter <a@phire.org> Suggested-by: Chris Wilson <chris@chris-wilson.co.uk> Signed-off-by: Tom O'Rourke <Tom.O'Rourke@intel.com> Signed-off-by: Jani Nikula <jani.nikula@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
As per the specififcation, the SB_DevFn is the PCI_DEVFN of the target
device and not the source. So PCI_DEVFN(2,0) is not correct. Further the
port ID should be enough to identify devices unless they are MFD. The
SB_DevFn was intended to remove ambiguity in case of these MFD devices.
For non MFD devices the recommendation for the target device IP was to
ignore these fields, but not all of them followed the recommendation.
Some like CCK ignore these fields and hence PCI_DEVFN(2, 0) works and so
does PCI_DEVFN(0, 0) as it works for DPIO. The issue came to light because
of GPIONC which was not getting programmed correctly with PCI_DEVFN(2, 0).
It turned out that this did not follow the recommendation and expected 0
in this field.
In general the recommendation is to use SB_DevFn as PCI_DEVFN(0, 0) for
all devices except target PCI devices.
Signed-off-by: Shobhit Kumar <shobhit.kumar@intel.com> Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com> Signed-off-by: Jani Nikula <jani.nikula@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
It's possible for invalidate_range_start mmu notifier callback to race
against userptr object release. If the gem object was released prior to
obtaining the spinlock in invalidate_range_start we're hitting null
pointer dereference.
Testcase: igt/gem_userptr_blits/stress-mm-invalidate-close
Testcase: igt/gem_userptr_blits/stress-mm-invalidate-close-overlap Cc: Chris Wilson <chris@chris-wilson.co.uk> Signed-off-by: Michał Winiarski <michal.winiarski@intel.com> Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
[Jani: added code comment suggested by Chris] Signed-off-by: Jani Nikula <jani.nikula@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Nothing in Bspec seems to indicate that we actually needs this, and it
looks like can't work since by this point the pipe is off and so
vblanks won't really happen any more.
Note that Bspec mentions that it takes a vblank for this bit to
change, but _only_ when enabling.
Dropping this code quenches an annoying backtrace introduced by the
more anal checking since
Note: This fixes the fallout from the above commit, but does not address
the shortcomings of the IBX transcoder select workaround implementation
discussed during review [1].
[1] http://mid.gmane.org/87y4o7usxf.fsf@intel.com
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=86095 Signed-off-by: Daniel Vetter <daniel.vetter@intel.com> Reviewed-by: Paulo Zanoni <paulo.r.zanoni@intel.com> Signed-off-by: Jani Nikula <jani.nikula@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
but in reality it undercovered a much older coherency bug. The issue that
boosting the GPU frequency on the BCS ring was masking was that we could
wake the CPU up after completion of a BCS batch and inspect memory prior
to the write cache being fully evicted. In order to serialise the
breadcrumb interrupt (and so ensure that the CPU's view of memory is
coherent) we need to perform a post-sync operation in the MI_FLUSH_DW.
v2: Fix all the MI_FLUSH_DW (bsd plus the duplication in execlists).
Also fix the invalidate_domains mask in gen8_emit_flush() for ring !=
VCS.
Testcase: gpuX-rcs-gpu-read-after-write Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Acked-by: Daniel Vetter <daniel@ffwll.ch> Signed-off-by: Jani Nikula <jani.nikula@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Emit the EOP twice to avoid cache flushing problems.
Signed-off-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Enable at init and disable on fini. Workaround for hardware problems.
v2 (chk): extend commit message
v3: add new function
Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Christian König <christian.koenig@amd.com> (v2) Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Print a warning at compile time and an informative message at runtime in
that case.
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=88758 Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Michel Dänzer <michel.daenzer@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Historically, !__GFP_FS allocations were not allowed to invoke the OOM
killer once reclaim had failed, but nevertheless kept looping in the
allocator.
Commit 9879de7373fc ("mm: page_alloc: embed OOM killing naturally into
allocation slowpath"), which should have been a simple cleanup patch,
accidentally changed the behavior to aborting the allocation at that
point. This creates problems with filesystem callers (?) that currently
rely on the allocator waiting for other tasks to intervene.
Revert the behavior as it shouldn't have been changed as part of a
cleanup patch.
Fixes: 9879de7373fc ("mm: page_alloc: embed OOM killing naturally into allocation slowpath") Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Dave Chinner <david@fromorbit.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Maxime reported the following memory leak regression due to commit dbc8358c7237 ("mm/nommu: use alloc_pages_exact() rather than its own
implementation").
On v3.19, I am facing a memory leak. Each time I run a command one page
is lost. Here an example with busybox's free command:
/ # free
total used free shared buffers cached
Mem: 7928 1972 5956 0 0 492
-/+ buffers/cache: 1480 6448
/ # free
total used free shared buffers cached
Mem: 7928 1976 5952 0 0 492
-/+ buffers/cache: 1484 6444
/ # free
total used free shared buffers cached
Mem: 7928 1980 5948 0 0 492
-/+ buffers/cache: 1488 6440
/ # free
total used free shared buffers cached
Mem: 7928 1984 5944 0 0 492
-/+ buffers/cache: 1492 6436
/ # free
total used free shared buffers cached
Mem: 7928 1988 5940 0 0 492
-/+ buffers/cache: 1496 6432
At some point, the system fails to sastisfy 256KB allocations:
This problem happens because we allocate ordered page through
__get_free_pages() in do_mmap_private() in some cases and we try to free
individual pages rather than ordered page in free_page_series(). In
this case, freeing pages whose refcount is not 0 won't be freed to the
page allocator so memory leak happens.
To fix the problem, this patch changes __get_free_pages() to
alloc_pages_exact() since alloc_pages_exact() returns
physically-contiguous pages but each pages are refcounted.
Fixes: dbc8358c7237 ("mm/nommu: use alloc_pages_exact() rather than its own implementation"). Reported-by: Maxime Coquelin <mcoquelin.stm32@gmail.com> Tested-by: Maxime Coquelin <mcoquelin.stm32@gmail.com> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The vmstat interfaces are good at hiding negative counts (at least when
CONFIG_SMP); but if you peer behind the curtain, you find that
nr_isolated_anon and nr_isolated_file soon go negative, and grow ever
more negative: so they can absorb larger and larger numbers of isolated
pages, yet still appear to be zero.
I'm happy to avoid a congestion_wait() when too_many_isolated() myself;
but I guess it's there for a good reason, in which case we ought to get
too_many_isolated() working again.
The imbalance comes from isolate_migratepages()'s ISOLATE_ABORT case:
putback_movable_pages() decrements the NR_ISOLATED counts, but we forgot
to call acct_isolated() to increment them.
It is possible that the bug whcih this patch fixes could cause OOM kills
when the system still has a lot of reclaimable page cache.
Fixes: edc2ca612496 ("mm, compaction: move pageblock checks up from isolate_migratepages_range()") Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
A race condition starts to be visible in recent mmotm, where a PG_hwpoison
flag is set on a migration source page *before* it's back in buddy page
poo= l.
This is problematic because no page flag is supposed to be set when
freeing (see __free_one_page().) So the user-visible effect of this race
is that it could trigger the BUG_ON() when soft-offlining is called.
The root cause is that we call lru_add_drain_all() to make sure that the
page is in buddy, but that doesn't work because this function just
schedule= s a work item and doesn't wait its completion.
drain_all_pages() does drainin= g directly, so simply dropping
lru_add_drain_all() solves this problem.
Fixes: f15bdfa802bf ("mm/memory-failure.c: fix memory leak in successful soft offlining") Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Tony Luck <tony.luck@intel.com> Cc: Chen Gong <gong.chen@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
For whatever reason, generic_access_phys() only remaps one page, but
actually allows to access arbitrary size. It's quite easy to trigger
large reads, like printing out large structure with gdb, which leads to a
crash. Fix it by remapping correct size.
What we want to check here is whether there is highorder freepage in buddy
list of other migratetype in order to steal it without fragmentation.
But, current code just checks cc->order which means allocation request
order. So, this is wrong.
Without this fix, non-movable synchronous compaction below pageblock order
would not stopped until compaction is complete, because migratetype of
most pageblocks are movable and high order freepage made by compaction is
usually on movable type buddy list.
There is some report related to this bug. See below link.
Although the issued system still has load spike comes from compaction,
this makes that system completely stable and responsive according to his
report.
stress-highalloc test in mmtests with non movable order 7 allocation
doesn't show any notable difference in allocation success rate, but, it
shows more compaction success rate.
I noticed that "allowed" can easily overflow by falling below 0, because
(total_vm / 32) can be larger than "allowed". The problem occurs in
OVERCOMMIT_NONE mode.
In this case, a huge allocation can success and overcommit the system
(despite OVERCOMMIT_NONE mode). All subsequent allocations will fall
(system-wide), so system become unusable.
The problem was masked out by commit c9b1d0981fcc
("mm: limit growth of 3% hardcoded other user reserve"),
but it's easy to reproduce it on older kernels:
1) set overcommit_memory sysctl to 2
2) mmap() large file multiple times (with VM_SHARED flag)
3) try to malloc() large amount of memory
It also can be reproduced on newer kernels, but miss-configured
sysctl_user_reserve_kbytes is required.
Fix this issue by switching to signed arithmetic here.
Signed-off-by: Roman Gushchin <klamm@yandex-team.ru> Cc: Andrew Shewmaker <agshew@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
I noticed, that "allowed" can easily overflow by falling below 0,
because (total_vm / 32) can be larger than "allowed". The problem
occurs in OVERCOMMIT_NONE mode.
In this case, a huge allocation can success and overcommit the system
(despite OVERCOMMIT_NONE mode). All subsequent allocations will fall
(system-wide), so system become unusable.
The problem was masked out by commit c9b1d0981fcc
("mm: limit growth of 3% hardcoded other user reserve"),
but it's easy to reproduce it on older kernels:
1) set overcommit_memory sysctl to 2
2) mmap() large file multiple times (with VM_SHARED flag)
3) try to malloc() large amount of memory
It also can be reproduced on newer kernels, but miss-configured
sysctl_user_reserve_kbytes is required.
Fix this issue by switching to signed arithmetic here.
[akpm@linux-foundation.org: use min_t] Signed-off-by: Roman Gushchin <klamm@yandex-team.ru> Cc: Andrew Shewmaker <agshew@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Reviewed-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When studying page stealing, I noticed some weird looking decisions in
try_to_steal_freepages(). The first I assume is a bug (Patch 1), the
following two patches were driven by evaluation.
Testing was done with stress-highalloc of mmtests, using the
mm_page_alloc_extfrag tracepoint and postprocessing to get counts of how
often page stealing occurs for individual migratetypes, and what
migratetypes are used for fallbacks. Arguably, the worst case of page
stealing is when UNMOVABLE allocation steals from MOVABLE pageblock.
RECLAIMABLE allocation stealing from MOVABLE allocation is also not ideal,
so the goal is to minimize these two cases.
The evaluation of v2 wasn't always clear win and Joonsoo questioned the
results. Here I used different baseline which includes RFC compaction
improvements from [1]. I found that the compaction improvements reduce
variability of stress-highalloc, so there's less noise in the data.
First, let's look at stress-highalloc configured to do sync compaction,
and how these patches reduce page stealing events during the test. First
column is after fresh reboot, other two are reiterations of test without
reboot. That was all accumulater over 5 re-iterations (so the benchmark
was run 5x3 times with 5 fresh restarts).
With Patch 1:
3.19-rc4 3.19-rc4 3.19-rc4
6-nothp-1 6-nothp-2 6-nothp-3
Page alloc extfrag event 1183495498775239774860
Extfrag fragmenting 1183399398768809774245
Extfrag fragmenting for unmovable 7342 16129 11712
Extfrag fragmenting unmovable placed with movable 4191 10547 6270
Extfrag fragmenting for reclaimable 373 1130 923
Extfrag fragmenting reclaimable placed with movable 302 906 738
Extfrag fragmenting for movable 1182627898596219761610
With Patch 2:
3.19-rc4 3.19-rc4 3.19-rc4
7-nothp-1 7-nothp-2 7-nothp-3
Page alloc extfrag event 472599036687933807436
Extfrag fragmenting 472510436682523806898
Extfrag fragmenting for unmovable 6678 7974 7281
Extfrag fragmenting unmovable placed with movable 2051 3829 4017
Extfrag fragmenting for reclaimable 429 1208 1278
Extfrag fragmenting reclaimable placed with movable 369 976 1034
Extfrag fragmenting for movable 471799736590703798339
With Patch 3:
3.19-rc4 3.19-rc4 3.19-rc4
8-nothp-1 8-nothp-2 8-nothp-3
Page alloc extfrag event 501618347001423850633
Extfrag fragmenting 501532546996133850072
Extfrag fragmenting for unmovable 1312 3154 3088
Extfrag fragmenting unmovable placed with movable 1115 2777 2714
Extfrag fragmenting for reclaimable 437 1193 1097
Extfrag fragmenting reclaimable placed with movable 330 969 879
Extfrag fragmenting for movable 501357646952663845887
In v2 we've seen apparent regression with Patch 1 for unmovable events,
this is now gone, suggesting it was indeed noise. Here, each patch
improves the situation for unmovable events. Reclaimable is improved by
patch 1 and then either the same modulo noise, or perhaps sligtly worse -
a small price for unmovable improvements, IMHO. The number of movable
allocations falling back to other migratetypes is most noisy, but it's
reduced to half at Patch 2 nevertheless. These are least critical as
compaction can move them around.
If we look at success rates, the patches don't affect them, that didn't change.
While there's no improvement here, I consider reduced fragmentation events
to be worth on its own. Patch 2 also seems to reduce scanning for free
pages, and migrations in compaction, suggesting it has somewhat less work
to do:
As with v2, /proc/pagetypeinfo appears unaffected with respect to numbers
of unmovable and reclaimable pageblocks.
Configuring the benchmark to allocate like THP page fault (i.e. no sync
compaction) gives much noisier results for iterations 2 and 3 after
reboot. This is not so surprising given how [1] offers lower improvements
in this scenario due to less restarts after deferred compaction which
would change compaction pivot.
When __rmqueue_fallback() is called to allocate a page of order X, it will
find a page of order Y >= X of a fallback migratetype, which is different
from the desired migratetype. With the help of try_to_steal_freepages(),
it may change the migratetype (to the desired one) also of:
1) all currently free pages in the pageblock containing the fallback page
2) the fallback pageblock itself
3) buddy pages created by splitting the fallback page (when Y > X)
These decisions take the order Y into account, as well as the desired
migratetype, with the goal of preventing multiple fallback allocations
that could e.g. distribute UNMOVABLE allocations among multiple
pageblocks.
Originally, decision for 1) has implied the decision for 3). Commit 47118af076f6 ("mm: mmzone: MIGRATE_CMA migration type added") changed that
(probably unintentionally) so that the buddy pages in case 3) are always
changed to the desired migratetype, except for CMA pageblocks.
Commit fef903efcf0c ("mm/page_allo.c: restructure free-page stealing code
and fix a bug") did some refactoring and added a comment that the case of
3) is intended. Commit 0cbef29a7821 ("mm: __rmqueue_fallback() should
respect pageblock type") removed the comment and tried to restore the
original behavior where 1) implies 3), but due to the previous
refactoring, the result is instead that only 2) implies 3) - and the
conditions for 2) are less frequently met than conditions for 1). This
may increase fragmentation in situations where the code decides to steal
all free pages from the pageblock (case 1)), but then gives back the buddy
pages produced by splitting.
This patch restores the original intended logic where 1) implies 3).
During testing with stress-highalloc from mmtests, this has shown to
decrease the number of events where UNMOVABLE and RECLAIMABLE allocations
steal from MOVABLE pageblocks, which can lead to permanent fragmentation.
In some cases it has increased the number of events when MOVABLE
allocations steal from UNMOVABLE or RECLAIMABLE pageblocks, but these are
fixable by sync compaction and thus less harmful.
Note that evaluation has shown that the behavior introduced by 47118af076f6 for buddy pages in case 3) is actually even better than the
original logic, so the following patch will introduce it properly once
again. For stable backports of this patch it makes thus sense to only fix
versions containing 0cbef29a7821.
[iamjoonsoo.kim@lge.com: tracepoint fix] Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit ed4d4902ebdd ("mm, hugetlb: remove hugetlb_zero and
hugetlb_infinity") replaced 'unsigned long hugetlb_zero' with 'int zero'
leading to out-of-bounds access in proc_doulongvec_minmax(). Use
'.extra1 = NULL' instead of '.extra1 = &zero'. Passing NULL is
equivalent to passing minimal value, which is 0 for unsigned types.
If __unmap_hugepage_range() tries to unmap the address range over which
hugepage migration is on the way, we get the wrong page because pte_page()
doesn't work for migration entries. This patch simply clears the pte for
migration entries as we do for hwpoison entries.
Fixes: 290408d4a2 ("hugetlb: hugepage migration core") Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Hogan <james.hogan@imgtec.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Steve Capper <steve.capper@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
There is a race condition between hugepage migration and
change_protection(), where hugetlb_change_protection() doesn't care about
migration entries and wrongly overwrites them. That causes unexpected
results like kernel crash. HWPoison entries also can cause the same
problem.
This patch adds is_hugetlb_entry_(migration|hwpoisoned) check in this
function to do proper actions.
Fixes: 290408d4a2 ("hugetlb: hugepage migration core") Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Hogan <james.hogan@imgtec.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Steve Capper <steve.capper@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When running the test which causes the race as shown in the previous patch,
we can hit the BUG "get_page() on refcount 0 page" in hugetlb_fault().
This race happens when pte turns into migration entry just after the first
check of is_hugetlb_entry_migration() in hugetlb_fault() passed with false.
To fix this, we need to check pte_present() again after huge_ptep_get().
This patch also reorders taking ptl and doing pte_page(), because
pte_page() should be done in ptl. Due to this reordering, we need use
trylock_page() in page != pagecache_page case to respect locking order.
Fixes: 66aebce747ea ("hugetlb: fix race condition in hugetlb_fault()") Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Hogan <james.hogan@imgtec.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Steve Capper <steve.capper@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Currently the list is traversed using rcu variant. That is not correct
since dev_set_mac_address can be called which eventually calls
rtmsg_ifinfo_build_skb and there, skb allocation can sleep. So fix this
by remove the rcu usage here.
Fixes: 3d249d4ca7 "net: introduce ethernet teaming device" Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
1. For an IPv4 ping socket, ping_check_bind_addr does not check
the family of the socket address that's passed in. Instead,
make it behave like inet_bind, which enforces either that the
address family is AF_INET, or that the family is AF_UNSPEC and
the address is 0.0.0.0.
2. For an IPv6 ping socket, ping_check_bind_addr returns EINVAL
if the socket family is not AF_INET6. Return EAFNOSUPPORT
instead, for consistency with inet6_bind.
3. Make ping_v4_sendmsg and ping_v6_sendmsg return EAFNOSUPPORT
instead of EINVAL if an incorrect socket address structure is
passed in.
4. Make IPv6 ping sockets be IPv6-only. The code does not support
IPv4, and it cannot easily be made to support IPv4 because
the protocol numbers for ICMP and ICMPv6 are different. This
makes connect(::ffff:192.0.2.1) fail with EAFNOSUPPORT instead
of making the socket unusable.
Among other things, this fixes an oops that can be triggered by:
Change-Id: If06ca86d9f1e4593c0d6df174caca3487c57a241 Signed-off-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
If an over-MTU UDP datagram is sent through a SOCK_RAW socket to a
UFO-capable device, ip_ufo_append_data() sets skb->ip_summed to
CHECKSUM_PARTIAL unconditionally as all GSO code assumes transport layer
checksum is to be computed on segmentation. However, in this case,
skb->csum_start and skb->csum_offset are never set as raw socket
transmit path bypasses udp_send_skb() where they are usually set. As a
result, driver may access invalid memory when trying to calculate the
checksum and store the result (as observed in virtio_net driver).
Moreover, the very idea of modifying the userspace provided UDP header
is IMHO against raw socket semantics (I wasn't able to find a document
clearly stating this or the opposite, though). And while allowing
CHECKSUM_NONE in the UFO case would be more efficient, it would be a bit
too intrusive change just to handle a corner case like this. Therefore
disallowing UFO for packets from SOCK_DGRAM seems to be the best option.
Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The National Instruments USB Host-to-Host Cable is based on the Prolific
PL-25A1 chipset. Add its VID/PID so the plusb driver will recognize it.
Signed-off-by: Ben Shelton <ben.shelton@ni.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
We did a failed attempt in the past to only use rcu in rtnl dump
operations (commit e67f88dd12f6 "net: dont hold rtnl mutex during
netlink dump callbacks")
Now that dumps are holding RTNL anyway, there is no need to also
use rcu locking, as it forbids any scheduling ability, like
GFP_KERNEL allocations that controlling path should use instead
of GFP_ATOMIC whenever possible.
This should fix following splat Cong Wang reported :
[ INFO: suspicious RCU usage. ]
3.19.0+ #805 Tainted: G W
include/linux/rcupdate.h:538 Illegal context switch in RCU read-side critical section!
Signed-off-by: Eric Dumazet <edumazet@google.com> Fixes: 0c7aecd4bde4b7302 ("netns: add rtnl cmd to add and get peer netns ids") Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com> Reported-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit 740c7f31c094703c ("sh_eth: Ensure DMA engines are stopped before
freeing buffers") added a call to sh_eth_reset() to the
sh_eth_set_ringparam() and sh_eth_close() paths.
However, setting the software reset bit(s) in the EDMR register resets
the MAC Address Registers to zero. Hence after kexec, the new kernel
doesn't detect a valid MAC address and assigns a random MAC address,
breaking DHCP.
Set the MAC address again after the reset in sh_eth_dev_exit() to fix
this.
Tested on r8a7740/armadillo (GETHER) and r8a7791/koelsch (FAST_RCAR).
Fixes: 740c7f31c094703c ("sh_eth: Ensure DMA engines are stopped before freeing buffers") Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit 44c8bc3ce39f ("net: bcmgenet: log RX buffer allocation and RX/TX dma
failures") added a few software maintained statistics using
BCMGENET_STAT_MIB_RX and BCMGENET_STAT_MIB_TX. These statistics are read from
the hardware MIB counters, such that bcmgenet_update_mib_counters() was trying
to read from a non-existing MIB offset for these counters.
Fix this by introducing a special type: BCMGENET_STAT_SOFT, similar to
BCMGENET_STAT_NETDEV, such that bcmgenet_get_ethtool_stats will read from the
software mib.
Fixes: 44c8bc3ce39f ("net: bcmgenet: log RX buffer allocation and RX/TX dma failures") Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This patch adds bcmgenet_tx_poll for the tx_rings. This can reduce the
interrupt load and send xmit in network stack on time. This also
separated for the completion of tx_ring16 from bcmgenet_poll.
The bcmgenet_tx_reclaim of tx_ring[{0,1,2,3}] operative by an interrupt
is to be not more than a certain number TxBDs. It is caused by too
slowly reclaiming the transmitted skb. Therefore, performance
degradation of xmit after 605ad7f ("tcp: refine TSO autosizing").
Signed-off-by: Jaedon Shin <jaedon.shin@gmail.com> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Neighbour code assumes headroom to push Ethernet header is
at least 16 bytes.
It appears macvtap has only 14 bytes available on arches
where NET_IP_ALIGN is 0 (like x86)
Effect is a corruption of 2 bytes right before skb->head,
and possible crashes if accessing non existing memory.
This fix should also increase IPv4 performance, as paranoid code
in ip_finish_output2() wont have to call skb_realloc_headroom()
Reported-by: Brian Rak <brak@vultr.com> Tested-by: Brian Rak <brak@vultr.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
With commit a7526eb5d06b (net: Unbreak compat_sys_{send,recv}msg), the
MSG_CMSG_COMPAT flag is blocked at the compat syscall entry points,
changing the kernel compat behaviour from the one before the commit it
was trying to fix (1be374a0518a, net: Block MSG_CMSG_COMPAT in
send(m)msg and recv(m)msg).
On 32-bit kernels (!CONFIG_COMPAT), MSG_CMSG_COMPAT is 0 and the native
32-bit sys_sendmsg() allows flag 0x80000000 to be set (it is ignored by
the kernel). However, on a 64-bit kernel, the compat ABI is different
with commit a7526eb5d06b.
This patch changes the compat_sys_{send,recv}msg behaviour to the one
prior to commit 1be374a0518a.
The problem was found running 32-bit LTP (sendmsg01) binary on an arm64
kernel. Arguably, LTP should not pass 0xffffffff as flags to sendmsg()
but the general rule is not to break user ABI (even when the user
behaviour is not entirely sane).
Fixes: a7526eb5d06b (net: Unbreak compat_sys_{send,recv}msg) Cc: Andy Lutomirski <luto@amacapital.net> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
CPU0 CPU1
team_port_del
team_upper_dev_unlink
priv_flags &= ~IFF_TEAM_PORT
team_handle_frame
team_port_get_rcu
team_port_exists
priv_flags & IFF_TEAM_PORT == 0
return NULL (instead of port got
from rx_handler_data)
netdev_rx_handler_unregister
The thing is that the flag is removed before rx_handler is unregistered.
If team_handle_frame is called in between, team_port_exists returns 0
and team_port_get_rcu will return NULL.
So do not check the flag here. It is guaranteed by netdev_rx_handler_unregister
that team_handle_frame will always see valid rx_handler_data pointer.
Signed-off-by: Jiri Pirko <jiri@resnulli.us> Fixes: 3d249d4ca7d0 ("net: introduce ethernet teaming device") Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
colons are used as a separator in netdev device lookup in dev_ioctl.c
Specific functions are SIOCGIFTXQLEN SIOCETHTOOL SIOCSIFNAME
Signed-off-by: Matthew Thode <mthode@mthode.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Non NAPI drivers can call skb_tstamp_tx() and then sock_queue_err_skb()
from hard IRQ context.
Therefore, sock_dequeue_err_skb() needs to block hard irq or
corruptions or hangs can happen.
Signed-off-by: Eric Dumazet <edumazet@google.com> Fixes: 364a9e93243d1 ("sock: deduplicate errqueue dequeue") Fixes: cb820f8e4b7f7 ("net: Provide a generic socket error queue delivery method for Tx time stamps.") Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Open vSwitch allows moving internal vport to different namespace
while still connected to the bridge. But when namespace deleted
OVS does not detach these vports, that results in dangling
pointer to netdevice which causes kernel panic as follows.
This issue is fixed by detaching all ovs ports from the deleted
namespace at net-exit.
Reported-by: Assaf Muller <amuller@redhat.com> Fixes: 46df7b81454("openvswitch: Add support for network namespaces.") Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Reviewed-by: Thomas Graf <tgraf@noironetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In tcf_em_validate(), after calling request_module() to load the
kind-specific module, set em->ops to NULL before returning -EAGAIN, so
that module_put() is not called again by tcf_em_tree_destroy().
Signed-off-by: Ignacy Gawędzki <ignacy.gawedzki@green-communications.fr> Acked-by: Cong Wang <cwang@twopensource.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
phy_init_eee uses phy_find_setting(phydev->speed, phydev->duplex)
to find a valid entry in the settings array for the given speed
and duplex value. For full duplex 1000baseT, this will return
the first matching entry, which is the entry for 1000baseKX_Full.
If the phy eee does not support 1000baseKX_Full, this entry will not
match, causing phy_init_eee to fail for no good reason.
Fixes: 9a9c56cb34e6 ("net: phy: fix a bug when verify the EEE support") Fixes: 3e7077067e80c ("phy: Expand phy speed/duplex settings array") Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com> Signed-off-by: Guenter Roeck <linux@roeck-us.net> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
ip_check_defrag() may be used by af_packet to defragment outgoing packets.
skb_network_offset() of af_packet's outgoing packets is not zero.
Signed-off-by: Alexander Drozdov <al.drozdov@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
skb_copy_bits() returns zero on success and negative value on error,
so it is needed to invert the condition in ip_check_defrag().
Fixes: 1bf3751ec90c ("ipv4: ip_check_defrag must not modify skb before unsharing") Signed-off-by: Alexander Drozdov <al.drozdov@gmail.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The gnet_stats_copy_app() function gets called, more often than not, with its
second argument a pointer to an automatic variable in the caller's stack.
Therefore, to avoid copying garbage afterwards when calling
gnet_stats_finish_copy(), this data is better copied to a dynamically allocated
memory that gets freed after use.
[xiyou.wangcong@gmail.com: remove a useless kfree()]
Signed-off-by: Ignacy Gawędzki <ignacy.gawedzki@green-communications.fr> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Ignacy reported that when eth0 is down and add a vlan device
on top of it like:
ip link add link eth0 name eth0.1 up type vlan id 1
We will get a refcount leak:
unregister_netdevice: waiting for eth0.1 to become free. Usage count = 2
The problem is when rtnl_configure_link() fails in rtnl_newlink(),
we simply call unregister_device(), but for stacked device like vlan,
we almost do nothing when we unregister the upper device, more work
is done when we unregister the lower device, so call its ->dellink().
Reported-by: Ignacy Gawedzki <ignacy.gawedzki@green-communications.fr> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
ipv6_cow_metrics() currently assumes only DST_HOST routes require
dynamic metrics allocation from inetpeer. The assumption breaks
when ndisc discovered router with RTAX_MTU and RTAX_HOPLIMIT metric.
Refer to ndisc_router_discovery() in ndisc.c and note that dst_metric_set()
is called after the route is created.
This patch creates the metrics array (by calling dst_cow_metrics_generic) in
ipv6_cow_metrics().
Before:
[root@qemu1 ~]# ip -6 r show | egrep -v unreachable
fd00:face:face:face::/64 dev eth0 proto kernel metric 256 expires 27sec
fe80::/64 dev eth0 proto kernel metric 256
default via fe80::74df:d0ff:fe23:8ef2 dev eth0 proto ra metric 1024 expires 27sec
After:
[root@qemu1 ~]# ip -6 r show | egrep -v unreachable
fd00:face:face:face::/64 dev eth0 proto kernel metric 256 expires 27sec mtu 1300
fe80::/64 dev eth0 proto kernel metric 256 mtu 1300
default via fe80::74df:d0ff:fe23:8ef2 dev eth0 proto ra metric 1024 expires 27sec mtu 1300 hoplimit 30
Fixes: 8e2ec639173f325 (ipv6: don't use inetpeer to store metrics for routes.) Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
IPv6 can keep a copy of SYN message using skb_get() in
tcp_v6_conn_request() so that caller wont free the skb when calling
kfree_skb() later.
Therefore TCP fast open has to clone the skb it is queuing in
child->sk_receive_queue, as all skbs consumed from receive_queue are
freed using __kfree_skb() (ie assuming skb->users == 1)
Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Fixes: 5b7ed0892f2af ("tcp: move fastopen functions to tcp_fastopen.c") Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Make __ipv6_select_ident() static as it isn't used outside
the file.
Fixes: 0508c07f5e0c9 (ipv6: Select fragment id during UFO segmentation if not set.) Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
ipv6: Select fragment id during UFO segmentation if not set.
Introduced a bug on LE in how ipv6 fragment id is assigned.
This was cought by nightly sparce check:
Resolve the following sparce error:
net/ipv6/output_core.c:57:38: sparse: incorrect type in assignment
(different base types)
net/ipv6/output_core.c:57:38: expected restricted __be32
[usertype] ip6_frag_id
net/ipv6/output_core.c:57:38: got unsigned int [unsigned]
[assigned] [usertype] id
Fixes: 0508c07f5e0c9 (ipv6: Select fragment id during UFO segmentation if not set.) Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
ifla_vf_policy[] is wrong in advertising its individual member types as
NLA_BINARY since .type = NLA_BINARY in combination with .len declares the
len member as *max* attribute length [0, len].
The issue is that when do_setvfinfo() is being called to set up a VF
through ndo handler, we could set corrupted data if the attribute length
is less than the size of the related structure itself.
The intent is exactly the opposite, namely to make sure to pass at least
data of minimum size of len.
Fixes: ebc08a6f47ee ("rtnetlink: Add VF config code to rtnetlink") Cc: Mitch Williams <mitch.a.williams@intel.com> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This patch fixes two issues in UDP checksum computation in pktgen.
First, the pseudo-header uses the source and destination IP
addresses. Currently, the ports are used for IPv4.
Second, the UDP checksum covers both header and data. So we need to
generate the data earlier (move pktgen_finalize_skb up), and compute
the checksum for UDP header + data.
Fixes: c26bf4a51308c ("pktgen: Add UDPCSUM flag to support UDP checksums") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
We still need a validate_link_af() handler with an appropriate nla policy,
similarly as we have in IPv4 case, otherwise size validations are not being
done properly in that case.
Fixes: f53adae4eae5 ("net: ipv6: add tokenized interface identifier support") Fixes: bc91b0f07ada ("ipv6: addrconf: implement address generation modes") Cc: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
flow_cache_flush_task references a structure member flow_cache_gc_work
where it should reference flow_cache_flush_task instead.
Kernel panic occurs on kernels using IPsec during XFRM garbage
collection. The garbage collection interval can be shortened using the
following sysctl settings:
With the default settings, our productions servers crash approximately
once a week. With the settings above, they crash immediately.
Fixes: ca925cf1534e ("flowcache: Make flow cache name space aware") Reported-by: Tomáš Charvát <tc@excello.cz> Tested-by: Jan Hejl <jh@excello.cz> Signed-off-by: Miroslav Urbanek <mu@miroslavurbanek.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
ACCESS_ONCE does not work reliably on non-scalar types. For
example gcc 4.6 and 4.7 might remove the volatile tag for such
accesses during the SRA (scalar replacement of aggregates) step
(https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58145)
Change the ppc/kvm code to replace ACCESS_ONCE with READ_ONCE.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Acked-by: Alexander Graf <agraf@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
ACCESS_ONCE does not work reliably on non-scalar types. For
example gcc 4.6 and 4.7 might remove the volatile tag for such
accesses during the SRA (scalar replacement of aggregates) step
(https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58145)
Change the ppc/hugetlbfs code to replace ACCESS_ONCE with READ_ONCE.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
ACCESS_ONCE does not work reliably on non-scalar types. For
example gcc 4.6 and 4.7 might remove the volatile tag for such
accesses during the SRA (scalar replacement of aggregates) step
(https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58145)
Fixup gup_pmd_range.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit 927609d622a3 ("kernel: tighten rules for ACCESS ONCE") results in a
compile failure for sh builds with CONFIG_X2TLB enabled.
arch/sh/mm/gup.c: In function 'gup_get_pte':
arch/sh/mm/gup.c:20:2: error: invalid initializer
make[1]: *** [arch/sh/mm/gup.o] Error 1
Replace ACCESS_ONCE with READ_ONCE to fix the problem.
Fixes: 927609d622a3 ("kernel: tighten rules for ACCESS ONCE") Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Guenter Roeck <linux@roeck-us.net> Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Currently maximum space limit quota format supports is in blocks however
since we store space limits in bytes, this is somewhat confusing. So
store the maximum limit in bytes as well. Also rename the field to match
the new unit and related inode field to match the new naming scheme.
Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
ACCESS_ONCE does not work reliably on non-scalar types. For
example gcc 4.6 and 4.7 might remove the volatile tag for such
accesses during the SRA (scalar replacement of aggregates) step
(https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58145)
Change the p2m code to replace ACCESS_ONCE with READ_ONCE.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Reviewed-by: Juergen Gross <jgross@suse.com> Acked-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
if (unlikely(lock->tickets.tail & TICKET_SLOWPATH_FLAG))
__ticket_unlock_slowpath(lock, prev);
which is *exactly* the kind of things you cannot do with spinlocks,
because after you've done the "add_smp()" and released the spinlock
for the fast-path, you can't access the spinlock any more. Exactly
because a fast-path lock might come in, and release the whole data
structure.
Linus suggested that we should not do any writes to lock after unlock(),
and we can move slowpath clearing to fastpath lock.
So this patch implements the fix with:
1. Moving slowpath flag to head (Oleg):
Unlocked locks don't care about the slowpath flag; therefore we can keep
it set after the last unlock, and clear it again on the first (try)lock.
-- this removes the write after unlock. note that keeping slowpath flag would
result in unnecessary kicks.
By moving the slowpath flag from the tail to the head ticket we also avoid
the need to access both the head and tail tickets on unlock.
2. use xadd to avoid read/write after unlock that checks the need for
unlock_kick (Linus):
We further avoid the need for a read-after-release by using xadd;
the prev head value will include the slowpath flag and indicate if we
need to do PV kicking of suspended spinners -- on modern chips xadd
isn't (much) more expensive than an add + load.
The use of READ_ONCE() causes lots of warnings witht he pending paravirt
spinlock fixes, because those ends up having passing a member to a
'const' structure to READ_ONCE().
There should certainly be nothing wrong with using READ_ONCE() with a
const source, but the helper function __read_once_size() would cause
warnings because it would drop the 'const' qualifier, but also because
the destination would be marked 'const' too due to the use of 'typeof'.
Use a union of types in READ_ONCE() to avoid this issue.
Also make sure to use parenthesis around the macro arguments to avoid
possible operator precedence issues.
Commit 927609d622a3 ("kernel: tighten rules for ACCESS ONCE") results in
sparse warnings like "Using plain integer as NULL pointer" - Let's add a
type cast to the dummy assignment.
To avoid warnings lik "sparse: warning: cast to restricted __hc32" we also
use __force on that cast.
Fixes: 927609d622a3 ("kernel: tighten rules for ACCESS ONCE") Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Now that all non-scalar users of ACCESS_ONCE have been converted
to READ_ONCE or ASSIGN once, lets tighten ACCESS_ONCE to only
work on scalar types.
This variant was proposed by Alexei Starovoitov.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
pmc_dbgfs_unregister() will be called when pmc->dbgfs_dir is unconditionally
NULL on error path in pmc_dbgfs_register(). To prevent this we move the
assignment to where is should be.
Fixes: f855911c1f48 (x86/pmc_atom: Expose PMC device state and platform sleep state) Reported-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Aubrey Li <aubrey.li@linux.intel.com> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Kumar P. Mahesh <mahesh.kumar.p@intel.com> Link: http://lkml.kernel.org/r/1421253575-22509-2-git-send-email-andriy.shevchenko@linux.intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit b568b8601f05 ("Treat SCI interrupt as normal GSI interrupt")
accidently removes support of legacy PIC interrupt when fixing a
regression for Xen, which causes a nasty regression on HP/Compaq
nc6000 where we fail to register the ACPI interrupt, and thus
lose eg. thermal notifications leading a potentially overheated
machine.
So reintroduce support of legacy PIC based ACPI SCI interrupt.
Reported-by: Ville Syrjälä <syrjala@sci.fi> Tested-by: Ville Syrjälä <syrjala@sci.fi> Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Pavel Machek <pavel@ucw.cz> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Len Brown <len.brown@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Rafael J. Wysocki <rjw@rjwysocki.net> Cc: Sander Eikelenboom <linux@eikelenboom.it> Cc: linux-pm@vger.kernel.org Link: http://lkml.kernel.org/r/1424052673-22974-1-git-send-email-jiang.liu@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Note that, it declares the "random_variable" variable as "unsigned int".
Since the result of the shifting operation between STACK_RND_MASK (which
is 0x3fffff on x86_64, 22 bits) and PAGE_SHIFT (which is 12 on x86_64):
random_variable <<= PAGE_SHIFT;
then the two leftmost bits are dropped when storing the result in the
"random_variable". This variable shall be at least 34 bits long to hold
the (22+12) result.
These two dropped bits have an impact on the entropy of process stack.
Concretely, the total stack entropy is reduced by four: from 2^28 to
2^30 (One fourth of expected entropy).
This patch restores back the entropy by correcting the types involved
in the operations in the functions randomize_stack_top() and
stack_maxrandom_size().
Andy pointed out that if an NMI or MCE is received while we're in the
middle of an EFI mixed mode call a triple fault will occur. This can
happen, for example, when issuing an EFI mixed mode call while running
perf.
The reason for the triple fault is that we execute the mixed mode call
in 32-bit mode with paging disabled but with 64-bit kernel IDT handlers
installed throughout the call.
At Andy's suggestion, stop playing the games we currently do at runtime,
such as disabling paging and installing a 32-bit GDT for __KERNEL_CS. We
can simply switch to the __KERNEL32_CS descriptor before invoking
firmware services, and run in compatibility mode. This way, if an
NMI/MCE does occur the kernel IDT handler will execute correctly, since
it'll jump to __KERNEL_CS automatically.
However, this change is only possible post-ExitBootServices(). Before
then the firmware "owns" the machine and expects for its 32-bit IDT
handlers to be left intact to service interrupts, etc.
So, we now need to distinguish between early boot and runtime
invocations of EFI services. During early boot, we need to restore the
GDT that the firmware expects to be present. We can only jump to the
__KERNEL32_CS code segment for mixed mode calls after ExitBootServices()
has been invoked.
A liberal sprinkling of comments in the thunking code should make the
differences in early and late environments more apparent.
Reported-by: Andy Lutomirski <luto@amacapital.net> Tested-by: Borislav Petkov <bp@suse.de> Signed-off-by: Matt Fleming <matt.fleming@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When reading blkio.throttle.io_serviced in a recently created blkio
cgroup, it's possible to race against the creation of a throttle policy,
which delays the allocation of stats_cpu.
Like other functions in the throttle code, just checking for a NULL
stats_cpu prevents the following oops caused by that race.
We have a scenario where after the fsync log replay we can lose file data
that had been previously fsync'ed if we added an hard link for our inode
and after that we sync'ed the fsync log (for example by fsync'ing some
other file or directory).
This is because when adding an hard link we updated the inode item in the
log tree with an i_size value of 0. At that point the new inode item was
in memory only and a subsequent fsync log replay would not make us lose
the file data. However if after adding the hard link we sync the log tree
to disk, by fsync'ing some other file or directory for example, we ended
up losing the file data after log replay, because the inode item in the
persisted log tree had an an i_size of zero.
This is easy to reproduce, and the following excerpt from my test for
xfstests shows this:
# Create one file with data and fsync it.
# This made the btrfs fsync log persist the data and the inode metadata with
# a correct inode->i_size (4096 bytes).
$XFS_IO_PROG -f -c "pwrite -S 0xaa -b 4K 0 4K" -c "fsync" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Now add one hard link to our file. This made the btrfs code update the fsync
# log, in memory only, with an inode metadata having a size of 0.
ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link
# Now force persistence of the fsync log to disk, for example, by fsyncing some
# other file.
touch $SCRATCH_MNT/bar
$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/bar
# Before a power loss or crash, we could read the 4Kb of data from our file as
# expected.
echo "File content before:"
od -t x1 $SCRATCH_MNT/foo
# Simulate a crash/power loss.
_load_flakey_table $FLAKEY_DROP_WRITES
_unmount_flakey
# After the fsync log replay, because the fsync log had a value of 0 for our
# inode's i_size, we couldn't read anymore the 4Kb of data that we previously
# wrote and fsync'ed. The size of the file became 0 after the fsync log replay.
echo "File content after:"
od -t x1 $SCRATCH_MNT/foo
Another alternative test, that doesn't need to fsync an inode in the same
transaction it was created, is:
# Create our test file with some data.
$XFS_IO_PROG -f -c "pwrite -S 0xaa -b 8K 0 8K" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Make sure the file is durably persisted.
sync
# Append some data to our file, to increase its size.
$XFS_IO_PROG -f -c "pwrite -S 0xcc -b 4K 8K 4K" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Fsync the file, so from this point on if a crash/power failure happens, our
# new data is guaranteed to be there next time the fs is mounted.
$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo
# Add one hard link to our file. This made btrfs write into the in memory fsync
# log a special inode with generation 0 and an i_size of 0 too. Note that this
# didn't update the inode in the fsync log on disk.
ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link
# Now make sure the in memory fsync log is durably persisted.
# Creating and fsync'ing another file will do it.
touch $SCRATCH_MNT/bar
$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/bar
# As expected, before the crash/power failure, we should be able to read the
# 12Kb of file data.
echo "File content before:"
od -t x1 $SCRATCH_MNT/foo
# Simulate a crash/power loss.
_load_flakey_table $FLAKEY_DROP_WRITES
_unmount_flakey
# After mounting the fs again, the fsync log was replayed.
# The btrfs fsync log replay code didn't update the i_size of the persisted
# inode because the inode item in the log had a special generation with a
# value of 0 (and it couldn't know the correct i_size, since that inode item
# had a 0 i_size too). This made the last 4Kb of file data inaccessible and
# effectively lost.
echo "File content after:"
od -t x1 $SCRATCH_MNT/foo
This isn't a new issue/regression. This problem has been around since the
log tree code was added in 2008:
If btrfs_find_item is called with NULL path it allocates one locally but
does not free it. Affected paths are inserting an orphan item for a file
and for a subvol root.
Move the path allocation to the callers.
Fixes: 3f870c289900 ("btrfs: expand btrfs_find_item() to include find_orphan_item functionality") Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
It turns out it's possible to get __remove_osd() called twice on the
same OSD. That doesn't sit well with rb_erase() - depending on the
shape of the tree we can get a NULL dereference, a soft lockup or
a random crash at some point in the future as we end up touching freed
memory. One scenario that I was able to reproduce is as follows:
A case can be made that osd refcounting is imperfect and reworking it
would be a proper resolution, but for now Sage and I decided to fix
this by adding a safe guard around __remove_osd().
Fixes: http://tracker.ceph.com/issues/8087 Cc: Sage Weil <sage@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Sage Weil <sage@redhat.com> Reviewed-by: Alex Elder <elder@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Since kernel 3.14 the backlight control has been broken on various Samsung
Atom based netbooks. This has been bisected and this problem happens since
commit b35684b8fa94 ("drm/i915: do full backlight setup at enable time")
This has been reported and discussed in detail here:
http://lists.freedesktop.org/archives/intel-gfx/2014-July/049395.html
Unfortunately no-one has been able to fix this. This only affects Samsung
Atom netbooks, and the Linux kernel and the BIOS of those laptops have never
worked well together. All affected laptops already have a quirk to avoid using
the standard acpi-video interface and instead use the samsung specific SABI
interface which samsung-laptop uses. It seems that recent fixes to the i915
driver have also broken backlight control through the SABI interface.
The intel_backlight driver OTOH works fine, and also allows for finer grained
backlight control. So add a new use_native_backlight quirk, and replace the
broken_acpi_video quirk with this quirk for affected models. This new quirk
disables acpi-video as before and also stops samsung-laptop from registering
the SABI based samsung_laptop backlight interface, leaving only the working
intel_backlight interface.
This commit enables this new quirk for 3 models which are known to be affected,
chances are that it needs to be used on other models too.
When DRAM errors occur on memory controllers after EDAC_MAX_MCS (16),
the kernel fatally dereferences unallocated structures, see splat below;
this occurs on at least NumaConnect systems.
Fix by checking if a memory controller info structure was found.
Causes an RCW cycle to be forced even when the array is degraded.
A degraded array cannot support RCW as that requires reading all data
blocks, and one may be missing.
Forcing an RCW when it is not possible causes a live-lock and the code
spins, repeatedly deciding to do something that cannot succeed.
So change the condition to only force RCW on non-degraded arrays.
Commit f6edb53c4993ffe92ce521fb449d1c146cea6ec2 converted the probe to
a CPU wide event first (pid == -1). For kernels that do not support
the PERF_FLAG_FD_CLOEXEC flag the probe fails with EINVAL. Since this
errno is not handled pid is not reset to 0 and the subsequent use of
pid = -1 as an argument brings in an additional failure path if
perf_event_paranoid > 0:
$ perf record -- sleep 1
perf_event_open(..., 0) failed unexpectedly with error 13 (Permission denied)
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.007 MB /tmp/perf.data (11 samples) ]
Also, ensure the fd of the confirmation check is closed and comment why
pid = -1 is used.
Needs to go to 3.18 stable tree as well.
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Based-on-patch-by: David Ahern <david.ahern@oracle.com> Acked-by: David Ahern <david.ahern@oracle.com> Cc: David Ahern <dsahern@gmail.com> Link: http://lkml.kernel.org/r/54EC610C.8000403@intel.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
We have two race conditions in the probe code which could lead to a null
pointer dereference in the interrupt handler.
The interrupt handler accesses the clockevent device, which may not yet be
registered.
First race condition happens when the interrupt handler gets registered before
the interrupts get disabled. The second race condition happens when the
interrupts get enabled, but the clockevent device is not yet registered.
Fix that by disabling the interrupts before we register the interrupt and enable
the interrupts after the clockevent device got registered.
Reported-by: Gongbae Park <yongbae2@gmail.com> Signed-off-by: Matthias Brugger <matthias.bgg@gmail.com> Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The KSTK_EIP() and KSTK_ESP() macros should return the user program
counter (PC) and stack pointer (A0StP) of the given task. These are used
to determine which VMA corresponds to the user stack in
/proc/<pid>/maps, and for the user PC & A0StP in /proc/<pid>/stat.
However for Meta the PC & A0StP from the task's kernel context are used,
resulting in broken output. For example in following /proc/<pid>/maps
output, the 3afff000-3b021000 VMA should be described as the stack:
And in the following /proc/<pid>/stat output, the PC is in kernel code
(1074234964 = 0x40078654) and the A0StP is in the kernel heap
(1335981392 = 0x4fa17550):
Fix the definitions of KSTK_EIP() and KSTK_ESP() to use
task_pt_regs(tsk)->ctx rather than (tsk)->thread.kernel_context. This
gets the registers from the user context stored after the thread info at
the base of the kernel stack, which is from the last entry into the
kernel from userland, regardless of where in the kernel the task may
have been interrupted, which results in the following more correct
/proc/<pid>/maps output:
For filesystems without separate project quota inode field in the
superblock we just reuse project quota file for group quotas (and vice
versa) if project quota file is allocated and we need group quota file.
When we reuse the file, quota structures on disk suddenly have wrong
type stored in d_flags though. Nobody really cares about this (although
structure type reported to userspace was wrong as well) except
that after commit 14bf61ffe6ac (quota: Switch ->get_dqblk() and
->set_dqblk() to use bytes as space units) assertion in
xfs_qm_scall_getquota() started to trigger on xfs/106 test (apparently I
was testing without XFS_DEBUG so I didn't notice when submitting the
above commit).
Fix the problem by properly resetting ddq->d_flags when running quotacheck
for a quota file.
Reported-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The gpio_chip operations receive a pointer the gpio_chip struct which is
contained in the driver's private struct, yet the container_of call in those
functions point to the mfd struct defined in include/linux/mfd/tps65912.h.
assumed that only one gpio-chip is registred per of-node.
Some drivers register more than one chip per of-node, so
adjust the matching function of_gpiochip_find_and_xlate to
not stop looking for chips if a node-match is found and
the translation fails.
Fixes: 7b8792bbdffd ("gpiolib: of: Correct error handling in of_get_named_gpiod_flags") Signed-off-by: Hans Holmberg <hans.holmberg@intel.com> Acked-by: Alexandre Courbot <acourbot@nvidia.com> Tested-by: Robert Jarzmik <robert.jarzmik@free.fr> Tested-by: Tyler Hall <tylerwhall@gmail.com> Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The native (64-bit) sigval_t union contains sival_int (32-bit) and
sival_ptr (64-bit). When a compat application invokes a syscall that
takes a sigval_t value (as part of a larger structure, e.g.
compat_sys_mq_notify, compat_sys_timer_create), the compat_sigval_t
union is converted to the native sigval_t with sival_int overlapping
with either the least or the most significant half of sival_ptr,
depending on endianness. When the corresponding signal is delivered to a
compat application, on big endian the current (compat_uptr_t)sival_ptr
cast always returns 0 since sival_int corresponds to the top part of
sival_ptr. This patch fixes copy_siginfo_to_user32() so that sival_int
is copied to the compat_siginfo_t structure.
Since the removal of CONFIG_REGULATOR_DUMMY option, the touchscreen stopped
working. This patch enables the "replacement" for REGULATOR_DUMMY and
allows the touchscreen to work even though there is no regulator for "vcc".
Signed-off-by: Martin Vajnar <martin.vajnar@gmail.com> Signed-off-by: Robert Jarzmik <robert.jarzmik@free.fr> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>