The check was already in place in the dp mode_valid check, but
radeon_dp_get_dp_link_clock() never returned the high clock
mode_valid was checking for because that function clipped the
clock based on the hw capabilities. Add an explicit check
in the mode_valid function.
Check the that ring we are using for copies is functional
rather than the GFX ring. On newer asics we use the DMA
ring for bo moves.
Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Andrew Morton wrote:
> On Wed, 12 Nov 2014 13:08:55 +0900 Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> wrote:
>
> > Andrew Morton wrote:
> > > Poor ttm guys - this is a bit of a trap we set for them.
> >
> > Commit a91576d7916f6cce ("drm/ttm: Pass GFP flags in order to avoid deadlock.")
> > changed to use sc->gfp_mask rather than GFP_KERNEL.
> >
> > - pages_to_free = kmalloc(npages_to_free * sizeof(struct page *),
> > - GFP_KERNEL);
> > + pages_to_free = kmalloc(npages_to_free * sizeof(struct page *), gfp);
> >
> > But this bug is caused by sc->gfp_mask containing some flags which are not
> > in GFP_KERNEL, right? Then, I think
> >
> > - pages_to_free = kmalloc(npages_to_free * sizeof(struct page *), gfp);
> > + pages_to_free = kmalloc(npages_to_free * sizeof(struct page *), gfp & GFP_KERNEL);
> >
> > would hide this bug.
> >
> > But I think we should use GFP_ATOMIC (or drop __GFP_WAIT flag)
>
> Well no - ttm_page_pool_free() should stop calling kmalloc altogether.
> Just do
>
> struct page *pages_to_free[16];
>
> and rework the code to free 16 pages at a time. Easy.
Well, ttm code wants to process 512 pages at a time for performance.
Memory footprint increased by 512 * sizeof(struct page *) buffer is
only 4096 bytes. What about using static buffer like below?
----------
>From d3cb5393c9c8099d6b37e769f78c31af1541fe8c Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Thu, 13 Nov 2014 22:21:54 +0900
Subject: drm/ttm: Avoid memory allocation from shrinker functions.
Commit a91576d7916f6cce ("drm/ttm: Pass GFP flags in order to avoid
deadlock.") caused BUG_ON() due to sc->gfp_mask containing flags
which are not in GFP_KERNEL.
https://bugzilla.kernel.org/show_bug.cgi?id=87891
Changing from sc->gfp_mask to (sc->gfp_mask & GFP_KERNEL) would
avoid the BUG_ON(), but avoiding memory allocation from shrinker
function is better and reliable fix.
Shrinker function is already serialized by global lock, and
clean up function is called after shrinker function is unregistered.
Thus, we can use static buffer when called from shrinker function
and clean up function.
The commit "vmwgfx: Rework fence event action" introduced a number of bugs
that are fixed with this commit:
a) A forgotten return stateemnt.
b) An if statement with identical branches.
Reported-by: Rob Clark <robdclark@gmail.com> Signed-off-by: Thomas Hellstrom <thellstrom@vmware.com> Reviewed-by: Jakob Bornecrantz <jakob@vmware.com> Reviewed-by: Sinclair Yeh <syeh@vmware.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
There is a conflict seen when requesting the kernel to reserve
the physical space used for the stolen area. This is because
some BIOS are wrapping the stolen area in the root PCI bus, but have
an off-by-one error. As a workaround we retry the reservation with an
offset of 1 instead of 0.
v2: updated commit message & the comment in source file (Daniel)
Signed-off-by: Akash Goel <akash.goel@intel.com> Reviewed-by: Jesse Barnes <jbarnes@virtuousgeek.org> Tested-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Jani Nikula <jani.nikula@intel.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
ACPI maintains cache of ioremap regions to speed up operations and
access to them from irq context where ioremap() calls aren't allowed.
This code abuses synchronize_rcu() on unmap path for synchronization
with fast-path in acpi_os_read/write_memory which uses this cache.
Since v3.10 CPUs are allowed to enter idle state even if they have RCU
callbacks queued, see commit c0f4dfd4f90f1667d234d21f15153ea09a2eaa66
("rcu: Make RCU_FAST_NO_HZ take advantage of numbered callbacks").
That change caused problems with nvidia proprietary driver which calls
acpi_os_map/unmap_generic_address several times during initialization.
Each unmap calls synchronize_rcu and adds significant delay. Totally
initialization is slowed for a couple of seconds and that is enough to
trigger timeout in hardware, gpu decides to "fell off the bus". Widely
spread workaround is reducing "rcu_idle_gp_delay" from 4 to 1 jiffy.
This patch replaces synchronize_rcu() with synchronize_rcu_expedited()
which is much faster.
Link: https://devtalk.nvidia.com/default/topic/567297/linux/linux-3-10-driver-crash/ Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com> Reported-and-tested-by: Alexander Monakov <amonakov@gmail.com> Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
We could be reading 8 bytes into a 4 byte buffer here. It seems
harmless but adding a check is the right thing to do and it silences a
static checker warning.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
When VXLAN encapsulated traffic is received from a similarly
configured peer, the above warning is generated in the receive
processing of the encapsulated packet. Note that the warning is
associated with the container eth0.
The skbs from sky2 have ip_summed set to CHECKSUM_COMPLETE, and
because the packet is an encapsulated Ethernet frame, the checksum
generated by the hardware includes the inner protocol and Ethernet
headers.
The receive code is careful to update the skb->csum, except in
__dev_forward_skb, as called by dev_forward_skb. __dev_forward_skb
calls eth_type_trans, which in turn calls skb_pull_inline(skb, ETH_HLEN)
to skip over the Ethernet header, but does not update skb->csum when
doing so.
This patch resolves the problem by adding a call to
skb_postpull_rcsum to update the skb->csum after the call to
eth_type_trans.
Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Hardware verifies IP & tcp/udp header checksum but does not provide payload
checksum, use CHECKSUM_UNNECESSARY. Set it only if its valid IP tcp/udp packet.
Cc: Jiri Benc <jbenc@redhat.com> Cc: Stefan Assmann <sassmann@redhat.com> Reported-by: Sunil Choudhary <schoudha@redhat.com> Signed-off-by: Govindarajulu Varadarajan <_govind@gmx.com> Reviewed-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
This patch is fixing a race condition that may cause setting
count_pending to -1, which results in unwanted big bulk of arp messages
(in case of "notify peers").
Consider following scenario:
count_pending == 2
CPU0 CPU1
team_notify_peers_work
atomic_dec_and_test (dec count_pending to 1)
schedule_delayed_work
team_notify_peers
atomic_add (adding 1 to count_pending)
team_notify_peers_work
atomic_dec_and_test (dec count_pending to 1)
schedule_delayed_work
team_notify_peers_work
atomic_dec_and_test (dec count_pending to 0)
schedule_delayed_work
team_notify_peers_work
atomic_dec_and_test (dec count_pending to -1)
Fix this race by using atomic_dec_if_positive - that will prevent
count_pending running under 0.
Fixes: fc423ff00df3a1955441 ("team: add peer notification") Fixes: 492b200efdd20b8fcfd ("team: add support for sending multicast rejoins") Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Commit d75b1ade567f ("net: less interrupt masking in NAPI") uncovered
wrong alx_poll() behavior.
A NAPI poll() handler is supposed to return exactly the budget when/if
napi_complete() has not been called.
It is also supposed to return number of frames that were received, so
that netdev_budget can have a meaning.
Also, in case of TX pressure, we still have to dequeue received
packets : alx_clean_rx_irq() has to be called even if
alx_clean_tx_irq(alx) returns false, otherwise device is half duplex.
Signed-off-by: Eric Dumazet <edumazet@google.com> Fixes: d75b1ade567f ("net: less interrupt masking in NAPI") Reported-by: Oded Gabbay <oded.gabbay@amd.com> Bisected-by: Oded Gabbay <oded.gabbay@amd.com> Tested-by: Oded Gabbay <oded.gabbay@amd.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Thomas Jarosch reported IPsec TCP stalls when a PMTU event occurs.
In fact the problem was completely unrelated to IPsec. The bug is
also reproducible if you just disable TSO/GSO.
The problem is that when the MSS goes down, existing queued packet
on the TX queue that have not been transmitted yet all look like
TSO packets and get treated as such.
This then triggers a bug where tcp_mss_split_point tells us to
generate a zero-sized packet on the TX queue. Once that happens
we're screwed because the zero-sized packet can never be removed
by ACKs.
Fixes: 1485348d242 ("tcp: Apply device TSO segment limit earlier") Reported-by: Thomas Jarosch <thomas.jarosch@intra2net.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Cheers, Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
skb_scrub_packet() is called when a packet switches between a context
such as between underlay and overlay, between namespaces, or between
L3 subnets.
While we already scrub the packet mark, connection tracking entry,
and cached destination, the security mark/context is left intact.
It seems wrong to inherit the security context of a packet when going
from overlay to underlay or across forwarding paths.
Signed-off-by: Thomas Graf <tgraf@suug.ch> Acked-by: Flavio Leitner <fbl@sysclose.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
When vlan tags are stacked, it is very likely that the outer tag is stored
in skb->vlan_tci and skb->protocol shows the inner tag's vlan_proto.
Currently netif_skb_features() first looks at skb->protocol even if there
is the outer tag in vlan_tci, thus it incorrectly retrieves the protocol
encapsulated by the inner vlan instead of the inner vlan protocol.
This allows GSO packets to be passed to HW and they end up being
corrupted.
Fixes: 58e998c6d239 ("offloading: Force software GSO for multiple vlan tags.") Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
During driver load in tg3_init_one, if the driver detects DMA activity before
intializing the chip tg3_halt is called. As part of tg3_halt interrupts are
disabled using routine tg3_disable_ints. This routine was using mailbox value
which was not initialized (default value is 0). As a result driver was writing
0x00000001 to pci config space register 0, which is the vendor id / device id.
This driver bug was exposed because of the commit a7877b17a667 (PCI: Check only
the Vendor ID to identify Configuration Request Retry). Also this issue is only
seen in older generation chipsets like 5722 because config space write to offset
0 from driver is possible. The newer generation chips ignore writes to offset 0.
Also without commit a7877b17a667, for these older chips when a GRC reset is
issued the Bootcode would reprogram the vendor id/device id, which is the reason
this bug was masked earlier.
Fixed by initializing the interrupt mailbox registers before calling tg3_halt.
Please queue for -stable.
Reported-by: Nils Holland <nholland@tisys.org> Reported-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Prashant Sreedharan <prashant@broadcom.com> Signed-off-by: Michael Chan <mchan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Resolve conflicts between glibc definition of IPV6 socket options
and those defined in Linux headers. Looks like earlier efforts to
solve this did not cover all the definitions.
It resolves warnings during iproute2 build.
Please consider for stable as well.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Each mmap Netlink frame contains a status field which indicates
whether the frame is unused, reserved, contains data or needs to
be skipped. Both loads and stores may not be reordeded and must
complete before the status field is changed and another CPU might
pick up the frame for use. Use an smp_mb() to cover needs of both
types of callers to netlink_set_status(), callers which have been
reading data frame from the frame, and callers which have been
filling or releasing and thus writing to the frame.
- Example code path requiring a smp_rmb():
memcpy(skb->data, (void *)hdr + NL_MMAP_HDRLEN, hdr->nm_len);
netlink_set_status(hdr, NL_MMAP_STATUS_UNUSED);
- Example code path requiring a smp_wmb():
hdr->nm_uid = from_kuid(sk_user_ns(sk), NETLINK_CB(skb).creds.uid);
hdr->nm_gid = from_kgid(sk_user_ns(sk), NETLINK_CB(skb).creds.gid);
netlink_frame_flush_dcache(hdr);
netlink_set_status(hdr, NL_MMAP_STATUS_VALID);
Fixes: f9c228 ("netlink: implement memory mapped recvmsg()") Reported-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Checking the file f_count and the nlk->mapped count is not completely
sufficient to prevent the mmap'd area contents from changing from
under us during netlink mmap sendmsg() operations.
Be careful to sample the header's length field only once, because this
could change from under us as well.
Fixes: 5fd96123ee19 ("netlink: implement memory mapped sendmsg()") Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Daniel Borkmann <dborkman@redhat.com> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Commit fee7e49d4514 ("mm: propagate error from stack expansion even for
guard page") made sure that we return the error properly for stack
growth conditions. It also theorized that counting the guard page
towards the stack limit might break something, but also said "Let's see
if anybody notices".
Somebody did notice. Apparently android-x86 sets the stack limit very
close to the limit indeed, and including the guard page in the rlimit
check causes the android 'zygote' process problems.
So this adds the (fairly trivial) code to make the stack rlimit check be
against the actual real stack size, rather than the size of the vma that
includes the guard page.
Jay Foad reports that the address sanitizer test (asan) sometimes gets
confused by a stack pointer that ends up being outside the stack vma
that is reported by /proc/maps.
This happens due to an interaction between RLIMIT_STACK and the guard
page: when we do the guard page check, we ignore the potential error
from the stack expansion, which effectively results in a missing guard
page, since the expected stack expansion won't have been done.
And since /proc/maps explicitly ignores the guard page (commit d7824370e263: "mm: fix up some user-visible effects of the stack guard
page"), the stack pointer ends up being outside the reported stack area.
This is the minimal patch: it just propagates the error. It also
effectively makes the guard page part of the stack limit, which in turn
measn that the actual real stack is one page less than the stack limit.
Let's see if anybody notices. We could teach acct_stack_growth() to
allow an extra page for a grow-up/grow-down stack in the rlimit test,
but I don't want to add more complexity if it isn't needed.
Charles Shirron and Paul Cassella from Cray Inc have reported kswapd
stuck in a busy loop with nothing left to balance, but
kswapd_try_to_sleep() failing to sleep. Their analysis found the cause
to be a combination of several factors:
1. A process is waiting in throttle_direct_reclaim() on pgdat->pfmemalloc_wait
2. The process has been killed (by OOM in this case), but has not yet been
scheduled to remove itself from the waitqueue and die.
3. kswapd checks for throttled processes in prepare_kswapd_sleep():
if (waitqueue_active(&pgdat->pfmemalloc_wait)) {
wake_up(&pgdat->pfmemalloc_wait);
return false; // kswapd will not go to sleep
}
However, for a process that was already killed, wake_up() does not remove
the process from the waitqueue, since try_to_wake_up() checks its state
first and returns false when the process is no longer waiting.
4. kswapd is running on the same CPU as the only CPU that the process is
allowed to run on (through cpus_allowed, or possibly single-cpu system).
5. CONFIG_PREEMPT_NONE=y kernel is used. If there's nothing to balance, kswapd
encounters no voluntary preemption points and repeatedly fails
prepare_kswapd_sleep(), blocking the process from running and removing
itself from the waitqueue, which would let kswapd sleep.
So, the source of the problem is that we prevent kswapd from going to
sleep until there are processes waiting on the pfmemalloc_wait queue,
and a process waiting on a queue is guaranteed to be removed from the
queue only when it gets scheduled. This was done to make sure that no
process is left sleeping on pfmemalloc_wait when kswapd itself goes to
sleep.
However, it isn't necessary to postpone kswapd sleep until the
pfmemalloc_wait queue actually empties. To prevent processes from being
left sleeping, it's actually enough to guarantee that all processes
waiting on pfmemalloc_wait queue have been woken up by the time we put
kswapd to sleep.
This patch therefore fixes this issue by substituting 'wake_up' with
'wake_up_all' and removing 'return false' in the code snippet from
prepare_kswapd_sleep() above. Note that if any process puts itself in
the queue after this waitqueue_active() check, or after the wake up
itself, it means that the process will also wake up kswapd - and since
we are under prepare_to_wait(), the wake up won't be missed. Also we
update the comment prepare_kswapd_sleep() to hopefully more clearly
describe the races it is preventing.
Fixes: 5515061d22f0 ("mm: throttle direct reclaimers if PF_MEMALLOC reserves are low and swap is backed by network storage") Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Sleep in atomic context happened on Trats2 board after inserting or
removing SD card because mmc_gpio_get_cd() was called under spin lock.
Fix this by moving card detection earlier, before acquiring spin lock.
The mmc_gpio_get_cd() call does not have to be protected by spin lock
because it does not access any sdhci internal data.
The sdhci_do_get_cd() call access host flags (SDHCI_DEVICE_DEAD). After
moving it out side of spin lock it could theoretically race with driver
removal but still there is no actual protection against manual card
eject.
Dmesg after inserting SD card:
[ 41.663414] BUG: sleeping function called from invalid context at drivers/gpio/gpiolib.c:1511
[ 41.670469] in_atomic(): 1, irqs_disabled(): 128, pid: 30, name: kworker/u8:1
[ 41.677580] INFO: lockdep is turned off.
[ 41.681486] irq event stamp: 61972
[ 41.684872] hardirqs last enabled at (61971): [<c0490ee0>] _raw_spin_unlock_irq+0x24/0x5c
[ 41.693118] hardirqs last disabled at (61972): [<c04907ac>] _raw_spin_lock_irq+0x18/0x54
[ 41.701190] softirqs last enabled at (61648): [<c0026fd4>] __do_softirq+0x234/0x2c8
[ 41.708914] softirqs last disabled at (61631): [<c00273a0>] irq_exit+0xd0/0x114
[ 41.716206] Preemption disabled at:[< (null)>] (null)
[ 41.721500]
[ 41.722985] CPU: 3 PID: 30 Comm: kworker/u8:1 Tainted: G W 3.18.0-rc5-next-20141121 #883
[ 41.732111] Workqueue: kmmcd mmc_rescan
[ 41.735945] [<c0014d2c>] (unwind_backtrace) from [<c0011c80>] (show_stack+0x10/0x14)
[ 41.743661] [<c0011c80>] (show_stack) from [<c0489d14>] (dump_stack+0x70/0xbc)
[ 41.750867] [<c0489d14>] (dump_stack) from [<c0228b74>] (gpiod_get_raw_value_cansleep+0x18/0x30)
[ 41.759628] [<c0228b74>] (gpiod_get_raw_value_cansleep) from [<c03646e8>] (mmc_gpio_get_cd+0x38/0x58)
[ 41.768821] [<c03646e8>] (mmc_gpio_get_cd) from [<c036d378>] (sdhci_request+0x50/0x1a4)
[ 41.776808] [<c036d378>] (sdhci_request) from [<c0357934>] (mmc_start_request+0x138/0x268)
[ 41.785051] [<c0357934>] (mmc_start_request) from [<c0357cc8>] (mmc_wait_for_req+0x58/0x1a0)
[ 41.793469] [<c0357cc8>] (mmc_wait_for_req) from [<c0357e68>] (mmc_wait_for_cmd+0x58/0x78)
[ 41.801714] [<c0357e68>] (mmc_wait_for_cmd) from [<c0361c00>] (mmc_io_rw_direct_host+0x98/0x124)
[ 41.810480] [<c0361c00>] (mmc_io_rw_direct_host) from [<c03620f8>] (sdio_reset+0x2c/0x64)
[ 41.818641] [<c03620f8>] (sdio_reset) from [<c035a3d8>] (mmc_rescan+0x254/0x2e4)
[ 41.826028] [<c035a3d8>] (mmc_rescan) from [<c003a0e0>] (process_one_work+0x180/0x3f4)
[ 41.833920] [<c003a0e0>] (process_one_work) from [<c003a3bc>] (worker_thread+0x34/0x4b0)
[ 41.841991] [<c003a3bc>] (worker_thread) from [<c003fed8>] (kthread+0xe4/0x104)
[ 41.849285] [<c003fed8>] (kthread) from [<c000f268>] (ret_from_fork+0x14/0x2c)
[ 42.038276] mmc0: new high speed SDHC card at address 1234
When used via spidev with more than one messages to tranfer via
SPI_IOC_MESSAGE the current implementation would return with
-EINVAL, since bits_per_word and speed_hz are set in all
transfer structs. And in the 2nd loop status will stay at
-EINVAL as its not overwritten again via fsl_spi_setup_transfer().
This patch changes this behavious by first checking if one of
the messages uses different settings. If this is the case
the function will return with -EINVAL. If not, the messages
are transferred correctly.
Signed-off-by: Stefan Roese <sr@denx.de> Signed-off-by: Mark Brown <broonie@linaro.org> Cc: Esben Haabendal <esbenhaabendal@gmail.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Linus reported perf report command being interrupted due to processing
of 'out of order' event, with following error:
Timestamp below last timeslice flush
0x5733a8 [0x28]: failed to process type: 3
I could reproduce the issue and in my case it was caused by one CPU
(mmap) being behind during record and userspace mmap reader seeing the
data after other CPUs data were already stored.
This is expected under some circumstances because we need to limit the
number of events that we queue for reordering when we receive a
PERF_RECORD_FINISHED_ROUND or when we force flush due to memory
pressure.
Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: Andi Kleen <ak@linux.intel.com> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: David Ahern <dsahern@gmail.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matt Fleming <matt.fleming@intel.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Stephane Eranian <eranian@google.com> Link: http://lkml.kernel.org/r/1417016371-30249-1-git-send-email-jolsa@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
[zhangzhiqiang: backport to 3.10:
- adjust context
- commit f61ff6c06d struct events_stats was defined in tools/perf/util/event.h
while 3.10 stable defined in tools/perf/util/hist.h.
- 3.10 stable there is no pr_oe_time() which used for debug.
- After the above adjustments, becomes same to the original patch:
https://github.com/torvalds/linux/commit/f61ff6c06dc8f32c7036013ad802c899ec590607
] Signed-off-by: Zhiqiang Zhang <zhangzhiqiang.zhang@huawei.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
forces group members to change their event->cpu,
if the currently-opened-event's PMU changed the cpu
and there is a group move.
Above behaviour causes problem for breakpoint events,
which uses event->cpu to touch cpu specific data for
breakpoints accounting. By changing event->cpu, some
breakpoints slots were wrongly accounted for given
cpu.
Vinces's perf fuzzer hit this issue and caused following
WARN on my setup:
WARNING: CPU: 0 PID: 20214 at arch/x86/kernel/hw_breakpoint.c:119 arch_install_hw_breakpoint+0x142/0x150()
Can't find any breakpoint slot
[...]
This patch changes the group moving code to keep the event's
original cpu.
The uncore_collect_events functions assumes that event group
might contain only uncore events which is wrong, because it
might contain any type of events.
This bug leads to uncore framework touching 'not' uncore events,
which could end up all sorts of bugs.
One was triggered by Vince's perf fuzzer, when the uncore code
touched breakpoint event private event space as if it was uncore
event and caused BUG:
The code in uncore_assign_events() function was looking for
event->hw.idx data while the event was initialized as a
breakpoint with different members in event->hw union.
This patch forces uncore_collect_events() to collect only uncore
events.
Enabling the hardware I/O coherency on Armada 370, Armada 375, Armada
38x and Armada XP requires a certain number of conditions:
- On Armada 370, the cache policy must be set to write-allocate.
- On Armada 375, 38x and XP, the cache policy must be set to
write-allocate, the pages must be mapped with the shareable
attribute, and the SMP bit must be set
Currently, on Armada XP, when CONFIG_SMP is enabled, those conditions
are met. However, when Armada XP is used in a !CONFIG_SMP kernel, none
of these conditions are met. With Armada 370, the situation is worse:
since the processor is single core, regardless of whether CONFIG_SMP
or !CONFIG_SMP is used, the cache policy will be set to write-back by
the kernel and not write-allocate.
Since solving this problem turns out to be quite complicated, and we
don't want to let users with a mainline kernel known to have
infrequent but existing data corruptions, this commit proposes to
simply disable hardware I/O coherency in situations where it is known
not to work.
And basically, the is_smp() function of the kernel tells us whether it
is OK to enable hardware I/O coherency or not, so this commit slightly
refactors the coherency_type() function to return
COHERENCY_FABRIC_TYPE_NONE when is_smp() is false, or the appropriate
type of the coherency fabric in the other case.
Thanks to this, the I/O coherency fabric will no longer be used at all
in !CONFIG_SMP configurations. It will continue to be used in
CONFIG_SMP configurations on Armada XP, Armada 375 and Armada 38x
(which are multiple cores processors), but will no longer be used on
Armada 370 (which is a single core processor).
In the process, it simplifies the implementation of the
coherency_type() function, and adds a missing call to of_node_put().
Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com> Fixes: e60304f8cb7bb545e79fe62d9b9762460c254ec2 ("arm: mvebu: Add hardware I/O Coherency support") Acked-by: Gregory CLEMENT <gregory.clement@free-electrons.com> Link: https://lkml.kernel.org/r/1415871540-20302-3-git-send-email-thomas.petazzoni@free-electrons.com Signed-off-by: Jason Cooper <jason@lakedaemon.net> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Commit 9fc2105aeaaf ("ARM: 7830/1: delay: don't bother reporting
bogomips in /proc/cpuinfo") breaks audio in python, and probably
elsewhere, with message
Commit 705814b5ea6f ("ARM: OMAP4+: PM: Consolidate OMAP4 PM code to
re-use it for OMAP5")
Moved logic generic for OMAP5+ as part of the init routine by
introducing omap4_pm_init. However, the patch left the powerdomain
initial setup, an unused omap4430 es1.0 check and a spurious log
"Power Management for TI OMAP4." in the original code.
Remove the duplicate code which is already present in omap4_pm_init from
omap4_init_static_deps.
As part of this change, also move the u-boot version print out of the
static dependency function to the omap4_pm_init function.
Fixes: 705814b5ea6f ("ARM: OMAP4+: PM: Consolidate OMAP4 PM code to re-use it for OMAP5") Signed-off-by: Nishanth Menon <nm@ti.com> Signed-off-by: Tony Lindgren <tony@atomide.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Same story as in commit 41080b5a2401 ("nfsd race fixes: ext2") (similar
ext2 fix) except that nilfs2 needs to use insert_inode_locked4() instead
of insert_inode_locked() and a bug of a check for dead inodes needs to
be fixed.
If nilfs_iget() is called from nfsd after nilfs_new_inode() calls
insert_inode_locked4(), nilfs_iget() will wait for unlock_new_inode() at
the end of nilfs_mkdir()/nilfs_create()/etc to unlock the inode.
If nilfs_iget() is called before nilfs_new_inode() calls
insert_inode_locked4(), it will create an in-core inode and read its
data from the on-disk inode. But, nilfs_iget() will find i_nlink equals
zero and fail at nilfs_read_inode_common(), which will lead it to call
iget_failed() and cleanly fail.
However, this sanity check doesn't work as expected for reused on-disk
inodes because they leave a non-zero value in i_mode field and it
hinders the test of i_nlink. This patch also fixes the issue by
removing the test on i_mode that nilfs2 doesn't need.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Bugs similar to the one in acbbe6fbb240 (kcmp: fix standard comparison
bug) are in rich supply.
In this variant, the problem is that struct xdr_netobj::len has type
unsigned int, so the expression o1->len - o2->len _also_ has type
unsigned int; it has completely well-defined semantics, and the result
is some non-negative integer, which is always representable in a long
long. But this means that if the conditional triggers, we are
guaranteed to return a positive value from compare_blob.
In this case it could be fixed by
- res = o1->len - o2->len;
+ res = (long long)o1->len - (long long)o2->len;
but I'd rather eliminate the usually broken 'return a - b;' idiom.
Reviewed-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
This happens due to the following race: between 'if (channel->device_obj)' check
in vmbus_process_rescind_offer() and pr_debug() in vmbus_device_unregister() the
device can disappear. Fix the issue by taking an additional reference to the
device before proceeding to vmbus_device_unregister().
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: K. Y. Srinivasan <kys@microsoft.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Commit 19e2ad6a09f0c06dbca19c98e5f4584269d913dd ("n_tty: Remove overflow
tests from receive_buf() path") moved the increment of read_head into
the arguments list of read_buf_addr(). Function calls represent a
sequence point in C. Therefore read_head is incremented before the
character c is placed in the buffer. Since the circular read buffer is
a lock-less design since commit 6d76bd2618535c581f1673047b8341fd291abc67
("n_tty: Make N_TTY ldisc receive path lockless"), this creates a race
condition that leads to communication errors.
This patch modifies the code to increment read_head _after_ the data
is placed in the buffer and thus fixes the race for non-SMP machines.
To fix the problem for SMP machines, memory barriers must be added in
a separate patch.
Signed-off-by: Christian Riesch <christian.riesch@omicron.at> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
This patch adds waiting until transmit buffer and shifter will be empty
before clock disabling.
Without this fix it's possible to have clock disabled while data was
not transmited yet, which causes unproper state of TX line and problems
in following data transfers.
Signed-off-by: Robert Baldyga <r.baldyga@samsung.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
After invoking ->dirty_inode(), __mark_inode_dirty() does smp_mb() and
tests inode->i_state locklessly to see whether it already has all the
necessary I_DIRTY bits set. The comment above the barrier doesn't
contain any useful information - memory barriers can't ensure "changes
are seen by all cpus" by itself.
And it sure enough was broken. Please consider the following
scenario.
CPU 0 CPU 1
-------------------------------------------------------------------------------
enters __writeback_single_inode()
grabs inode->i_lock
tests PAGECACHE_TAG_DIRTY which is clear
enters __set_page_dirty()
grabs mapping->tree_lock
sets PAGECACHE_TAG_DIRTY
releases mapping->tree_lock
leaves __set_page_dirty()
enters __mark_inode_dirty()
smp_mb()
sees I_DIRTY_PAGES set
leaves __mark_inode_dirty()
clears I_DIRTY_PAGES
releases inode->i_lock
Now @inode has dirty pages w/ I_DIRTY_PAGES clear. This doesn't seem
to lead to an immediately critical problem because requeue_inode()
later checks PAGECACHE_TAG_DIRTY instead of I_DIRTY_PAGES when
deciding whether the inode needs to be requeued for IO and there are
enough unintentional memory barriers inbetween, so while the inode
ends up with inconsistent I_DIRTY_PAGES flag, it doesn't fall off the
IO list.
The lack of explicit barrier may also theoretically affect the other
I_DIRTY bits which deal with metadata dirtiness. There is no
guarantee that a strong enough barrier exists between
I_DIRTY_[DATA]SYNC clearing and write_inode() writing out the dirtied
inode. Filesystem inode writeout path likely has enough stuff which
can behave as full barrier but it's theoretically possible that the
writeout may not see all the updates from ->dirty_inode().
Fix it by adding an explicit smp_mb() after I_DIRTY clearing. Note
that I_DIRTY_PAGES needs a special treatment as it always needs to be
cleared to be interlocked with the lockless test on
__mark_inode_dirty() side. It's cleared unconditionally and
reinstated after smp_mb() if the mapping still has dirty pages.
Also add comments explaining how and why the barriers are paired.
Lightly tested.
Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Mikulas Patocka <mpatocka@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Al Viro <viro@zeniv.linux.org.uk> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
We can get here from blkdev_ioctl() -> blkpg_ioctl() -> add_partition()
with a user passed in partno value. If we pass in 0x7fffffff, the
new target in disk_expand_part_tbl() overflows the 'int' and we
access beyond the end of ptbl->part[] and even write to it when we
do the rcu_assign_pointer() to assign the new partition.
Currently we enable Exynos devices in the multi v7 defconfig, however, when
testing on my ODROID-U3, I noticed that USB was not working. Enabling this
option causes USB to work, which enables networking support as well since the
ODROID-U3 has networking on the USB bus.
[arnd] Support for odroid-u3 was added in 3.10, so it would be nice to
backport this fix at least that far.
stac_store_hints() does utterly wrong for masking the values for
gpio_dir and gpio_data, likely due to copy&paste errors. Fortunately,
this feature is used very rarely, so the impact must be really small.
In olden times the snd_hda_param_read() function always set "*start_id"
but in 2007 we introduced a new return and it causes uninitialized data
bugs in a couple of the callers: print_codec_info() and
hdmi_parse_codec().
The Arcam rPAC seems to have the same problem - whenever anything
(alsamixer, udevd, 3.9+ kernel from 60af3d037eb8c, ..) attempts to
access mixer / control interface of the card, the firmware "locks up"
the entire device, resulting in
SNDRV_PCM_IOCTL_HW_PARAMS failed (-5): Input/output error
from alsa-lib.
Other operating systems can somehow read the mixer (there seems to be
playback volume/mute), but any manipulation is ignored by the device
(which has hardware volume controls).
bus_find_device_by_name() acquires a device reference which is never
released. This results in an object leak, which on older kernels
results in failure to release all resources of PCI devices. libvirt
uses drivers_probe to re-attach devices to the host after assignment
and is therefore a common trigger for this leak.
In Linux 3.18 and below, GCC hoists the lsl instructions in the
pvclock code all the way to the beginning of __vdso_clock_gettime,
slowing the non-paravirt case significantly. For unknown reasons,
presumably related to the removal of a branch, the performance issue
is gone as of
e76b027e6408 x86,vdso: Use LSL unconditionally for vgetcpu
but I don't trust GCC enough to expect the problem to stay fixed.
There should be no correctness issue, because the __getcpu calls in
__vdso_vlock_gettime were never necessary in the first place.
Note to stable maintainers: In 3.18 and below, depending on
configuration, gcc 4.9.2 generates code like this:
This patch won't apply as is to any released kernel, but I'll send a
trivial backported version if needed.
[
Backported by Andy Lutomirski. Should apply to all affected
versions. This fixes a functionality bug as well as a performance
bug: buggy kernels can infinite loop in __vdso_clock_gettime on
affected compilers. See, for exammple:
The theory behind vdso randomization is that it's mapped at a random
offset above the top of the stack. To avoid wasting a page of
memory for an extra page table, the vdso isn't supposed to extend
past the lowest PMD into which it can fit. Other than that, the
address should be a uniformly distributed address that meets all of
the alignment requirements.
The current algorithm is buggy: the vdso has about a 50% probability
of being at the very end of a PMD. The current algorithm also has a
decent chance of failing outright due to incorrect handling of the
case where the top of the stack is near the top of its PMD.
This fixes the implementation. The paxtest estimate of vdso
"randomisation" improves from 11 bits to 18 bits. (Disclaimer: I
don't know what the paxtest code is actually calculating.)
It's worth noting that this algorithm is inherently biased: the vdso
is more likely to end up near the end of its PMD than near the
beginning. Ideally we would either nix the PMD sharing requirement
or jointly randomize the vdso and the stack to reduce the bias.
In the mean time, this is a considerable improvement with basically
no risk of compatibility issues, since the allowed outputs of the
algorithm are unchanged.
As an easy test, doing this:
for i in `seq 10000`
do grep -P vdso /proc/self/maps |cut -d- -f1
done |sort |uniq -d
used to produce lots of output (1445 lines on my most recent run).
A tiny subset looks like this:
New Genius MousePen i608X devices have a new id 0x501a instead of the
old 0x5011 so add a new #define with "_2" appended and change required
places.
The remaining two checkpatch warnings about line length
being over 80 characters are present in the original files too and this
patch was made in the same style (no line break).
Just adding a new id and changing the required places should make the
new device work without any issues according to the bug report in the
following url.
This patch was made according to and fixes:
https://bugzilla.kernel.org/show_bug.cgi?id=67111
Apple bluetooth wireless keyboard (sold in UK) has always reported zero
for battery strength no matter what condition the batteries are actually
in. With this patch applied (applying same quirk as other Apple
keyboards), the battery strength is now correctly reported.
This is a static checker fix. We write some binary settings to the
sysfs file. One of the settings is the "->startup_profile". There
isn't any checking to make sure it fits into the
pyra->profile_settings[] array in the profile_activated() function.
I added a check to pyra_sysfs_write_settings() in both places because
I wasn't positive that the other callers were correct.
Before ->start() is called, bufsize size is set to HID_MIN_BUFFER_SIZE,
64 bytes. While processing the IRQ, we were asking to receive up to
wMaxInputLength bytes, which can be bigger than 64 bytes.
Later, when ->start is run, a proper bufsize will be calculated.
Given wMaxInputLength is said to be unreliable in other part of the
code, set to receive only what we can even if it results in truncated
reports.
Current driver uses a common buffer for reading reports either
synchronously in i2c_hid_get_raw_report() and asynchronously in
the interrupt handler.
There is race condition if an interrupt arrives immediately after
the report is received in i2c_hid_get_raw_report(); the common
buffer is modified by the interrupt handler with the new report
and then i2c_hid_get_raw_report() proceed using wrong data.
Fix it by using a separate buffers for synchronous reports.
Signed-off-by: Jean-Baptiste Maneyrol <jmaneyrol@invensense.com>
[Antonio Borneo: cleanup, rebase to v3.17, submit mainline] Signed-off-by: Antonio Borneo <borneo.antonio@gmail.com> Reviewed-by: Benjamin Tissoires <benjamin.tissoires@redhat.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
There's an off-by-one bug in function __domain_mapping(), which may
trigger the BUG_ON(nr_pages < lvl_pages) when
(nr_pages + 1) & superpage_mask == 0
The issue was introduced by commit 9051aa0268dc "intel-iommu: Combine
domain_pfn_mapping() and domain_sg_mapping()", which sets sg_res to
"nr_pages + 1" to avoid some of the 'sg_res==0' code paths.
It's safe to remove extra "+1" because sg_res is only used to calculate
page size now.
Reported-And-Tested-by: Sudeep Dutt <sudeep.dutt@intel.com> Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com> Acked-By: David Woodhouse <David.Woodhouse@intel.com> Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
If the erase worker is unable to erase a PEB it will
free the ubi_wl_entry itself.
The failing ubi_wl_entry must not free()'d again after
do_sync_erase() returns.
The logic of vfree()'ing vol->upd_buf is tied to vol->updating.
In ubi_start_update() vol->updating is set long before vmalloc()'ing
vol->upd_buf. If we encounter a write failure in ubi_start_update()
before vmalloc() the UBI device release function will try to vfree()
vol->upd_buf because vol->updating is set.
Fix this by allocating vol->upd_buf directly after setting vol->updating.
On some ARMs the memory can be mapped pgprot_noncached() and still
be working for atomic operations. As pointed out by Colin Cross
<ccross@android.com>, in some cases you do want to use
pgprot_noncached() if the SoC supports it to see a debug printk
just before a write hanging the system.
On ARMs, the atomic operations on strongly ordered memory are
implementation defined. So let's provide an optional kernel parameter
for configuring pgprot_noncached(), and use pgprot_writecombine() by
default.
Cc: Arnd Bergmann <arnd@arndb.de> Cc: Rob Herring <robherring2@gmail.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Anton Vorontsov <anton@enomsg.org> Cc: Colin Cross <ccross@android.com> Cc: Olof Johansson <olof@lixom.net> Cc: Russell King <linux@arm.linux.org.uk> Acked-by: Kees Cook <keescook@chromium.org> Signed-off-by: Tony Lindgren <tony@atomide.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Currently trying to use pstore on at least ARMs can hang as we're
mapping the peristent RAM with pgprot_noncached().
On ARMs, pgprot_noncached() will actually make the memory strongly
ordered, and as the atomic operations pstore uses are implementation
defined for strongly ordered memory, they may not work. So basically
atomic operations have undefined behavior on ARM for device or strongly
ordered memory types.
Let's fix the issue by using write-combine variants for mappings. This
corresponds to normal, non-cacheable memory on ARM. For many other
architectures, this change does not change the mapping type as by
default we have:
#define pgprot_writecombine pgprot_noncached
The reason why pgprot_noncached() was originaly used for pstore
is because Colin Cross <ccross@android.com> had observed lost
debug prints right before a device hanging write operation on some
systems. For the platforms supporting pgprot_noncached(), we can
add a an optional configuration option to support that. But let's
get pstore working first before adding new features.
Cc: Arnd Bergmann <arnd@arndb.de> Cc: Anton Vorontsov <cbouatmailru@gmail.com> Cc: Colin Cross <ccross@android.com> Cc: Olof Johansson <olof@lixom.net> Cc: linux-kernel@vger.kernel.org Acked-by: Kees Cook <keescook@chromium.org> Signed-off-by: Rob Herring <rob.herring@calxeda.com>
[tony@atomide.com: updated description] Signed-off-by: Tony Lindgren <tony@atomide.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Commit 6ac665c63dca ("PCI: rewrite PCI BAR reading code") masked off
low-order bits from 'l', but not from 'sz'. Both are passed to pci_size(),
which compares 'base == maxbase' to check for read-only BARs. The masking
of 'l' means that comparison will never be 'true', so the check for
read-only BARs no longer works.
Resolve this by also masking off the low-order bits of 'sz' before passing
it into pci_size() as 'maxbase'. With this change, pci_size() will once
again catch the problems that have been encountered to date:
- AGP aperture BAR of AMD-7xx host bridges: if the AGP window is
disabled, this BAR is read-only and read as 0x00000008 [1]
- BARs 0-4 of ALi IDE controllers can be non-zero and read-only [1]
Currently, when going idle, we set the flag indicating that we are in
nap mode (paca->kvm_hstate.hwthread_state) and then execute the nap
(or sleep or rvwinkle) instruction, all with the MMU on. This is bad
for two reasons: (a) the architecture specifies that those instructions
must be executed with the MMU off, and in fact with only the SF, HV, ME
and possibly RI bits set, and (b) this introduces a race, because as
soon as we set the flag, another thread can switch the MMU to a guest
context. If the race is lost, this thread will typically start looping
on relocation-on ISIs at 0xc...4400.
This fixes it by setting the MSR as required by the architecture before
setting the flag or executing the nap/sleep/rvwinkle instruction.
[ shreyas@linux.vnet.ibm.com: Edited to handle LE ] Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Shreyas B. Prabhu <shreyas@linux.vnet.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: linuxppc-dev@lists.ozlabs.org Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
We have some code in udbg_uart_getc_poll() that tries to protect
against a NULL udbg_uart_in, but gets it all wrong.
Found with the LLVM static analyzer (scan-build).
Fixes: 309257484cc1 ("powerpc: Cleanup udbg_16550 and add support for LPC PIO-only UARTs") Signed-off-by: Anton Blanchard <anton@samba.org>
[mpe: Add some newlines for readability while we're here] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Flush the FIFOs when the stream is prepared for use. This avoids
an inadvertent swapping of the left/right channels if the FIFOs are
not empty at startup.
Signed-off-by: Andrew Jackson <Andrew.Jackson@arm.com> Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Commit 5fe5b767dc6f ("ASoC: dapm: Do not pretend to support controls for non
mixer/mux widgets") revealed ill-defined control in a route between
"STENL Mux" and DACs in max98090.c:
max98090 i2c-193C9890:00: Control not supported for path STENL Mux -> [NULL] -> DACL
max98090 i2c-193C9890:00: ASoC: no dapm match for STENL Mux --> NULL --> DACL
max98090 i2c-193C9890:00: ASoC: Failed to add route STENL Mux -> NULL -> DACL
max98090 i2c-193C9890:00: Control not supported for path STENL Mux -> [NULL] -> DACR
max98090 i2c-193C9890:00: ASoC: no dapm match for STENL Mux --> NULL --> DACR
max98090 i2c-193C9890:00: ASoC: Failed to add route STENL Mux -> NULL -> DACR
Since there is no control between "STENL Mux" and DACs the control name must
be NULL not "NULL".
Signed-off-by: Jarkko Nikula <jarkko.nikula@linux.intel.com> Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Make sure to check the version field of the firmware header to make sure to
not accidentally try to parse a firmware file with a different layout.
Trying to do so can result in loading invalid firmware code to the device.
Signed-off-by: Lars-Peter Clausen <lars@metafoo.de> Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Like with ath9k, ath5k queues also need to be ordered by priority.
queue_info->tqi_subtype already contains the correct index, so use it
instead of relying on the order of ath5k_hw_setup_tx_queue calls.
Signed-off-by: Felix Fietkau <nbd@openwrt.org> Signed-off-by: John W. Linville <linville@tuxdriver.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
This patchs fixes a misplaced call to memset() that fills the request
buffer with 0. The problem was with sending PCAN_USBPRO_REQ_FCT
requests, the content set by the caller was thus lost.
With this patch, the memory area is zeroed only when requesting info
from the device.
This patch sets the correct reverse sequence order to the instructions
set to run, when any failure occurs during the initialization steps.
It also adds the missing unregistration call of the can device if the
failure appears after having been registered.
The driver passes the desired hardware queue index for a WMM data queue
in qinfo->tqi_subtype. This was ignored in ath9k_hw_setuptxqueue, which
instead relied on the order in which the function is called.
Reported-by: Hubert Feurstein <h.feurstein@gmail.com> Signed-off-by: Felix Fietkau <nbd@openwrt.org> Signed-off-by: John W. Linville <linville@tuxdriver.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
For buffer write, page lock will be got in write_begin and released in
write_end, in ocfs2_write_end_nolock(), before it unlock the page in
ocfs2_free_write_ctxt(), it calls ocfs2_run_deallocs(), this will ask
for the read lock of journal->j_trans_barrier. Holding page lock and
ask for journal->j_trans_barrier breaks the locking order.
This will cause a deadlock with journal commit threads, ocfs2cmt will
get write lock of journal->j_trans_barrier first, then it wakes up
kjournald2 to do the commit work, at last it waits until done. To
commit journal, kjournald2 needs flushing data first, it needs get the
cache page lock.
Since some ocfs2 cluster locks are holding by write process, this
deadlock may hung the whole cluster.
unlock pages before ocfs2_run_deallocs() can fix the locking order, also
put unlock before ocfs2_commit_trans() to make page lock is unlocked
before j_trans_barrier to preserve unlocking order.
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Wengang Wang <wen.gang.wang@oracle.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Move rtc register to be later than hardware initialization. The reason
is that devm_rtc_device_register() will do read_time() which is a
callback accessing hardware. This sometimes causes a hang in the
hardware related callback.
Signed-off-by: Guo Zeng <guo.zeng@csr.com> Signed-off-by: Barry Song <Baohua.Song@csr.com> Cc: Alessandro Zummo <a.zummo@towertech.it> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
During file system stress testing on 3.10 and 3.12 based kernels, the
umount command occasionally hung in fsnotify_unmount_inodes in the
section of code:
spin_lock(&inode->i_lock);
if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
spin_unlock(&inode->i_lock);
continue;
}
As this section of code holds the global inode_sb_list_lock, eventually
the system hangs trying to acquire the lock.
Multiple crash dumps showed:
The inode->i_state == 0x60 and i_count == 0 and i_sb_list would point
back at itself. As this is not the value of list upon entry to the
function, the kernel never exits the loop.
To help narrow down problem, the call to list_del_init in
inode_sb_list_del was changed to list_del. This poisons the pointers in
the i_sb_list and causes a kernel to panic if it transverse a freed
inode.
Subsequent stress testing paniced in fsnotify_unmount_inodes at the
bottom of the list_for_each_entry_safe loop showing next_i had become
free.
We believe the root cause of the problem is that next_i is being freed
during the window of time that the list_for_each_entry_safe loop
temporarily releases inode_sb_list_lock to call fsnotify and
fsnotify_inode_delete.
The code in fsnotify_unmount_inodes attempts to prevent the freeing of
inode and next_i by calling __iget. However, the code doesn't do the
__iget call on next_i
if i_count == 0 or
if i_state & (I_FREEING | I_WILL_FREE)
The patch addresses this issue by advancing next_i in the above two cases
until we either find a next_i which we can __iget or we reach the end of
the list. This makes the handling of next_i more closely match the
handling of the variable "inode."
The time to reproduce the hang is highly variable (from hours to days.) We
ran the stress test on a 3.10 kernel with the proposed patch for a week
without failure.
During list_for_each_entry_safe, next_i is becoming free causing
the loop to never terminate. Advance next_i in those cases where
__iget is not done.
Signed-off-by: Jerry Hoemann <jerry.hoemann@hp.com> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Cc: Ken Helias <kenhelias@firemail.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
[stable 3.12 note]
This commit was supposed to fix a completely other issue. But in 3.12,
with commit f72e7dcdd25229446b102e587ef2f826f76bff28 (mm: let
mm_find_pmd fix buggy race with THP fault), we need this commit as
well (it fixes the issue as a by-product). Hugh Dickins writes:
<== citation starts here>
Fine for this to go in, but there is one catch, which I discovered
when backporting to v3.11: it needed one more hunk. I haven't checked
your base tree, but if this applies then I believe you need it - most
of the time no problem, but it can case page migration to fail to find
a migration entry it inserted earlier, then BUG_ON(!PageLocked(p)) in
migration_entry_to_page() soon after. Here's what I wrote back then:
Note on rebase to v3.11: added a hunk to replace the use of
mm_find_pmd() in page_check_address_pmd(). This call had been
similarly replaced by the time of my v3.16 commit, in Kirill
Shutemov's v3.15 b5a8cad376ee ("thp: close race between split and zap
huge pages"): which we do not need as such, since it's fixing v3.13 117b0791ac42 ("mm, thp: move ptl taking inside
page_check_address_pmd()"), from a split page-table-lock series we are
not backporting. But without this additional hunk, rmap sometimes
broke when the new semantic for mm_find_pmd() was used here.
<== end of citation>
But instead of appending hunks to commits, I am taking a full,
backported version of commit b5a8cad376ee with this note prepended.
So the changelog of b5a8cad376ee is left below, but does not apply to
3.12 yet.
[=== stable 3.12 note ends here]
Sasha Levin has reported two THP BUGs[1][2]. I believe both of them
have the same root cause. Let's look to them one by one.
The first bug[1] is "kernel BUG at mm/huge_memory.c:1829!". It's
BUG_ON(mapcount != page_mapcount(page)) in __split_huge_page(). From my
testing I see that page_mapcount() is higher than mapcount here.
I think it happens due to race between zap_huge_pmd() and
page_check_address_pmd(). page_check_address_pmd() misses PMD which is
under zap:
CPU0 CPU1
zap_huge_pmd()
pmdp_get_and_clear()
__split_huge_page()
anon_vma_interval_tree_foreach()
__split_huge_page_splitting()
page_check_address_pmd()
mm_find_pmd()
/*
* We check if PMD present without taking ptl: no
* serialization against zap_huge_pmd(). We miss this PMD,
* it's not accounted to 'mapcount' in __split_huge_page().
*/
pmd_present(pmd) == 0
The second bug[2] is "kernel BUG at mm/huge_memory.c:1371!".
It's VM_BUG_ON_PAGE(!PageHead(page), page) in zap_huge_pmd().
This happens in similar way:
CPU0 CPU1
zap_huge_pmd()
pmdp_get_and_clear()
page_remove_rmap(page)
atomic_add_negative(-1, &page->_mapcount)
__split_huge_page()
anon_vma_interval_tree_foreach()
__split_huge_page_splitting()
page_check_address_pmd()
mm_find_pmd()
pmd_present(pmd) == 0 /* The same comment as above */
/*
* No crash this time since we already decremented page->_mapcount in
* zap_huge_pmd().
*/
BUG_ON(mapcount != page_mapcount(page))
/*
* We split the compound page here into small pages without
* serialization against zap_huge_pmd()
*/
__split_huge_page_refcount()
VM_BUG_ON_PAGE(!PageHead(page), page); // CRASH!!!
So my understanding the problem is pmd_present() check in mm_find_pmd()
without taking page table lock.
The bug was introduced by me commit with commit 117b0791ac42. Sorry for
that. :(
Let's open code mm_find_pmd() in page_check_address_pmd() and do the
check under page table lock.
Note that __page_check_address() does the same for PTE entires
if sync != 0.
I've stress tested split and zap code paths for 36+ hours by now and
don't see crashes with the patch applied. Before it took <20 min to
trigger the first bug and few hours for second one (if we ignore
first).
The current code simply assumes Intel Arch PerfMon v2+ to have
the IA32_PERF_CAPABILITIES MSR; the SDM specifies that we should check
CPUID[1].ECX[15] (aka, FEATURE_PDCM) instead.
This was found by KVM which implements v2+ but didn't provide the
capabilities MSR. Change the code to DTRT; KVM will also implement the
MSR and return 0.
Cc: pbonzini@redhat.com Reported-by: "Michael S. Tsirkin" <mst@redhat.com> Suggested-by: Eduardo Habkost <ehabkost@redhat.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20140203132903.GI8874@twins.programming.kicks-ass.net Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
I believe this comes about because, whereas collapsing and splitting THP
functions take anon_vma lock in write mode (which excludes concurrent
rmap walks), faulting THP functions (write protection and misplaced
NUMA) do not - and mostly they do not need to.
But they do use a pmdp_clear_flush(), set_pmd_at() sequence which, for
an instant (indeed, for a long instant, given the inter-CPU TLB flush in
there), leaves *pmd neither present not trans_huge.
Which can confuse a concurrent rmap walk, as when removing migration
ptes, seen in the dumped trace. Although that rmap walk has a 4k page
to insert, anon_vmas containing THPs are in no way segregated from
4k-page anon_vmas, so the 4k-intent mm_find_pmd() does need to cope with
that instant when a trans_huge pmd is temporarily absent.
I don't think we need strengthen the locking at the THP end: it's easily
handled with an ACCESS_ONCE() before testing both conditions.
And since mm_find_pmd() had only one caller who wanted a THP rather than
a pmd, let's slightly repurpose it to fail when it hits a THP or
non-present pmd, and open code split_huge_page_address() again.
Signed-off-by: Hugh Dickins <hughd@google.com> Reported-by: Sasha Levin <sasha.levin@oracle.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Bob Liu <bob.liu@oracle.com> Cc: Christoph Lameter <cl@gentwo.org> Cc: Dave Jones <davej@redhat.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Allow more than one viperboard to be connected by registering with
PLATFORM_DEVID_AUTO instead of PLATFORM_DEVID_NONE.
The subdevices are currently registered with PLATFORM_DEVID_NONE, which
will cause a name collision on the platform bus when a second viperboard
is plugged in:
viperboard 1-2.4:1.0: version 0.00 found at bus 001 address 004
------------[ cut here ]------------
WARNING: CPU: 0 PID: 181 at /home/johan/work/omicron/src/linux/fs/sysfs/dir.c:31 sysfs_warn_dup+0x74/0x84()
sysfs: cannot create duplicate filename '/bus/platform/devices/viperboard-gpio'
Modules linked in: i2c_viperboard viperboard netconsole [last unloaded: viperboard]
CPU: 0 PID: 181 Comm: bash Tainted: G W 3.17.0-rc6 #1
[<c0016bf4>] (unwind_backtrace) from [<c0013860>] (show_stack+0x20/0x24)
[<c0013860>] (show_stack) from [<c04305f8>] (dump_stack+0x24/0x28)
[<c04305f8>] (dump_stack) from [<c0040fb4>] (warn_slowpath_common+0x80/0x98)
[<c0040fb4>] (warn_slowpath_common) from [<c004100c>] (warn_slowpath_fmt+0x40/0x48)
[<c004100c>] (warn_slowpath_fmt) from [<c016f1bc>] (sysfs_warn_dup+0x74/0x84)
[<c016f1bc>] (sysfs_warn_dup) from [<c016f548>] (sysfs_do_create_link_sd.isra.2+0xcc/0xd0)
[<c016f548>] (sysfs_do_create_link_sd.isra.2) from [<c016f588>] (sysfs_create_link+0x3c/0x48)
[<c016f588>] (sysfs_create_link) from [<c02867ec>] (bus_add_device+0x12c/0x1e0)
[<c02867ec>] (bus_add_device) from [<c0284820>] (device_add+0x410/0x584)
[<c0284820>] (device_add) from [<c0289440>] (platform_device_add+0xd8/0x26c)
[<c0289440>] (platform_device_add) from [<c02a5ae4>] (mfd_add_device+0x240/0x344)
[<c02a5ae4>] (mfd_add_device) from [<c02a5ce0>] (mfd_add_devices+0xb8/0x110)
[<c02a5ce0>] (mfd_add_devices) from [<bf00d1c8>] (vprbrd_probe+0x160/0x1b0 [viperboard])
[<bf00d1c8>] (vprbrd_probe [viperboard]) from [<c030c000>] (usb_probe_interface+0x1bc/0x2a8)
[<c030c000>] (usb_probe_interface) from [<c028768c>] (driver_probe_device+0x14c/0x3ac)
[<c028768c>] (driver_probe_device) from [<c02879e4>] (__driver_attach+0xa4/0xa8)
[<c02879e4>] (__driver_attach) from [<c0285698>] (bus_for_each_dev+0x70/0xa4)
[<c0285698>] (bus_for_each_dev) from [<c0287030>] (driver_attach+0x2c/0x30)
[<c0287030>] (driver_attach) from [<c030a288>] (usb_store_new_id+0x170/0x1ac)
[<c030a288>] (usb_store_new_id) from [<c030a2f8>] (new_id_store+0x34/0x3c)
[<c030a2f8>] (new_id_store) from [<c02853ec>] (drv_attr_store+0x30/0x3c)
[<c02853ec>] (drv_attr_store) from [<c016eaa8>] (sysfs_kf_write+0x5c/0x60)
[<c016eaa8>] (sysfs_kf_write) from [<c016dc68>] (kernfs_fop_write+0xd4/0x194)
[<c016dc68>] (kernfs_fop_write) from [<c010fe40>] (vfs_write+0xb4/0x1c0)
[<c010fe40>] (vfs_write) from [<c01104a8>] (SyS_write+0x4c/0xa0)
[<c01104a8>] (SyS_write) from [<c000f900>] (ret_fast_syscall+0x0/0x48)
---[ end trace 98e8603c22d65817 ]---
viperboard 1-2.4:1.0: Failed to add mfd devices to core.
viperboard: probe of 1-2.4:1.0 failed with error -17
Signed-off-by: Johan Hovold <johan@kernel.org> Signed-off-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
The least significat byte of the GPIO value read register
on the STMPE24xx series is on addres 0xA4 not 0xA5. Correct
against datasheet and tested on the STMPE2401 hardware.
Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
When we abort a transaction we iterate over all the ranges marked as dirty
in fs_info->freed_extents[0] and fs_info->freed_extents[1], clear them
from those trees, add them back (unpin) to the free space caches and, if
the fs was mounted with "-o discard", perform a discard on those regions.
Also, after adding the regions to the free space caches, a fitrim ioctl call
can see those ranges in a block group's free space cache and perform a discard
on the ranges, so the same issue can happen without "-o discard" as well.
This causes corruption, affecting one or multiple btree nodes (in the worst
case leaving the fs unmountable) because some of those ranges (the ones in
the fs_info->pinned_extents tree) correspond to btree nodes/leafs that are
referred by the last committed super block - breaking the rule that anything
that was committed by a transaction is untouched until the next transaction
commits successfully.
I ran into this while running in a loop (for several hours) the fstest that
I recently submitted:
[PATCH] fstests: add btrfs test to stress chunk allocation/removal and fstrim
The corruption always happened when a transaction aborted and then fsck complained
like this:
_check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent
*** fsck.btrfs output ***
Check tree block failed, want=94945280, have=0
Check tree block failed, want=94945280, have=0
Check tree block failed, want=94945280, have=0
Check tree block failed, want=94945280, have=0
Check tree block failed, want=94945280, have=0
read block failed check_tree_block
Couldn't open file system
In this case 94945280 corresponded to the root of a tree.
Using frace what I observed was the following sequence of steps happened:
1) transaction N started, fs_info->pinned_extents pointed to
fs_info->freed_extents[0];
4) transaction N commit starts, fs_info->pinned_extents now points to
fs_info->freed_extents[1], and transaction N completes;
5) transaction N + 1 starts;
6) eb is COWed, and btrfs_free_tree_block() called for this eb;
7) eb range (94945280 to 94945280 + 16Kb) is added to
fs_info->pinned_extents (fs_info->freed_extents[1]);
8) Something goes wrong in transaction N + 1, like hitting ENOSPC
for example, and the transaction is aborted, turning the fs into
readonly mode. The stack trace I got for example:
[112065.253935] [<ffffffff8140c7b6>] dump_stack+0x4d/0x66
[112065.254271] [<ffffffff81042984>] warn_slowpath_common+0x7f/0x98
[112065.254567] [<ffffffffa0325990>] ? __btrfs_abort_transaction+0x50/0x10b [btrfs]
[112065.261674] [<ffffffff810429e5>] warn_slowpath_fmt+0x48/0x50
[112065.261922] [<ffffffffa032949e>] ? btrfs_free_path+0x26/0x29 [btrfs]
[112065.262211] [<ffffffffa0325990>] __btrfs_abort_transaction+0x50/0x10b [btrfs]
[112065.262545] [<ffffffffa036b1d6>] btrfs_remove_chunk+0x537/0x58b [btrfs]
[112065.262771] [<ffffffffa033840f>] btrfs_delete_unused_bgs+0x1de/0x21b [btrfs]
[112065.263105] [<ffffffffa0343106>] cleaner_kthread+0x100/0x12f [btrfs]
(...)
[112065.264493] ---[ end trace dd7903a975a31a08 ]---
[112065.264673] BTRFS: error (device sdc) in btrfs_remove_chunk:2625: errno=-28 No space left
[112065.264997] BTRFS info (device sdc): forced readonly
9) The clear kthread sees that the BTRFS_FS_STATE_ERROR bit is set in
fs_info->fs_state and calls btrfs_cleanup_transaction(), which in
turn calls btrfs_destroy_pinned_extent();
10) Then btrfs_destroy_pinned_extent() iterates over all the ranges
marked as dirty in fs_info->freed_extents[], and for each one
it calls discard, if the fs was mounted with "-o discard", and
adds the range to the free space cache of the respective block
group;
11) btrfs_trim_block_group(), invoked from the fitrim ioctl code path,
sees the free space entries and performs a discard;
12) After an umount and mount (or fsck), our eb's location on disk was full
of zeroes, and it should have been untouched, because it was marked as
dirty in the fs_info->pinned_extents tree, and therefore used by the
trees that the last committed superblock points to.
Fix this by not performing a discard and not adding the ranges to the free space
caches - it's useless from this point since the fs is now in readonly mode and
we won't write free space caches to disk anymore (otherwise we would leak space)
nor any new superblock. By not adding the ranges to the free space caches, it
prevents other code paths from allocating that space and write to it as well,
therefore being safer and simpler.
We use the modified list to keep track of which extents have been modified so we
know which ones are candidates for logging at fsync() time. Newly modified
extents are added to the list at modification time, around the same time the
ordered extent is created. We do this so that we don't have to wait for ordered
extents to complete before we know what we need to log. The problem is when
something like this happens
log extent 0-4k on inode 1
copy csum for 0-4k from ordered extent into log
sync log
commit transaction
log some other extent on inode 1
ordered extent for 0-4k completes and adds itself onto modified list again
log changed extents
see ordered extent for 0-4k has already been logged
at this point we assume the csum has been copied
sync log
crash
On replay we will see the extent 0-4k in the log, drop the original 0-4k extent
which is the same one that we are replaying which also drops the csum, and then
we won't find the csum in the log for that bytenr. This of course causes us to
have errors about not having csums for certain ranges of our inode. So remove
the modified list manipulation in unpin_extent_cache, any modified extents
should have been added well before now, and we don't want them re-logged. This
fixes my test that I could reliably reproduce this problem with. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Dmitry Chernenkov used KASAN to discover that eCryptfs writes past the
end of the allocated buffer during encrypted filename decoding. This
fix corrects the issue by getting rid of the unnecessary 0 write when
the current bit offset is 2.
The ecryptfs_encrypted_view mount option greatly changes the
functionality of an eCryptfs mount. Instead of encrypting and decrypting
lower files, it provides a unified view of the encrypted files in the
lower filesystem. The presence of the ecryptfs_encrypted_view mount
option is intended to force a read-only mount and modifying files is not
supported when the feature is in use. See the following commit for more
information:
This patch forces the mount to be read-only when the
ecryptfs_encrypted_view mount option is specified by setting the
MS_RDONLY flag on the superblock. Additionally, this patch removes some
broken logic in ecryptfs_open() that attempted to prevent modifications
of files when the encrypted view feature was in use. The check in
ecryptfs_open() was not sufficient to prevent file modifications using
system calls that do not operate on a file descriptor.
UDF specification allows arbitrarily large symlinks. However we support
only symlinks at most one block large. Check the length of the symlink
so that we don't access memory beyond end of the symlink block.
Reported-by: Carl Henrik Lunde <chlunde@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
alloc_pid() does get_pid_ns() beforehand but forgets to put_pid_ns() if it
fails because disable_pid_allocation() was called by the exiting
child_reaper.
We could simply move get_pid_ns() down to successful return, but this fix
tries to be as trivial as possible.
Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Aaron Tomlin <atomlin@redhat.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Serge Hallyn <serge.hallyn@ubuntu.com> Cc: Sterling Alexander <stalexan@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
If some error happens in NCP_IOC_SETROOT ioctl, the appropriate error
return value is then (in most cases) just overwritten before we return.
This can result in reporting success to userspace although error happened.
This bug was introduced by commit 2e54eb96e2c8 ("BKL: Remove BKL from
ncpfs"). Propagate the errors correctly.
If a request is backlogged, it's complete() handler will get called
twice: once with -EINPROGRESS, and once with the final error code.
af_alg's complete handler, unlike other users, does not handle the
-EINPROGRESS but instead always completes the completion that recvmsg()
is waiting on. This can lead to a return to user space while the
request is still pending in the driver. If userspace closes the sockets
before the requests are handled by the driver, this will lead to
use-after-frees (and potential crashes) in the kernel due to the tfm
having been freed.
The crashes can be easily reproduced (for example) by reducing the max
queue length in cryptod.c and running the following (from
http://www.chronox.de/libkcapi.html) on AES-NI capable hardware:
A regression was caused by commit 780a7654cee8:
audit: Make testing for a valid loginuid explicit.
(which in turn attempted to fix a regression caused by e1760bd)
When audit_krule_to_data() fills in the rules to get a listing, there was a
missing clause to convert back from AUDIT_LOGINUID_SET to AUDIT_LOGINUID.
This broke userspace by not returning the same information that was sent and
expected.
The rule:
auditctl -a exit,never -F auid=-1
gives:
auditctl -l
LIST_RULES: exit,never f24=0 syscall=all
when it should give:
LIST_RULES: exit,never auid=-1 (0xffffffff) syscall=all
Tag it so that it is reported the same way it was set. Create a new
private flags audit_krule field (pflags) to store it that won't interact with
the public one from the API.
Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Paul Moore <pmoore@redhat.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
A security fix in caused the way the unprivileged remount tests were
using user namespaces to break. Tweak the way user namespaces are
being used so the test works again.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Now that setgroups can be disabled and not reenabled, setting gid_map
without privielge can now be enabled when setgroups is disabled.
This restores most of the functionality that was lost when unprivileged
setting of gid_map was removed. Applications that use this functionality
will need to check to see if they use setgroups or init_groups, and if they
don't they can be fixed by simply disabling setgroups before writing to
gid_map.
Reviewed-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
- Expose the knob to user space through a proc file /proc/<pid>/setgroups
A value of "deny" means the setgroups system call is disabled in the
current processes user namespace and can not be enabled in the
future in this user namespace.
A value of "allow" means the segtoups system call is enabled.
- Descendant user namespaces inherit the value of setgroups from
their parents.
- A proc file is used (instead of a sysctl) as sysctls currently do
not allow checking the permissions at open time.
- Writing to the proc file is restricted to before the gid_map
for the user namespace is set.
This ensures that disabling setgroups at a user namespace
level will never remove the ability to call setgroups
from a process that already has that ability.
A process may opt in to the setgroups disable for itself by
creating, entering and configuring a user namespace or by calling
setns on an existing user namespace with setgroups disabled.
Processes without privileges already can not call setgroups so this
is a noop. Prodcess with privilege become processes without
privilege when entering a user namespace and as with any other path
to dropping privilege they would not have the ability to call
setgroups. So this remains within the bounds of what is possible
without a knob to disable setgroups permanently in a user namespace.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
If you did not create the user namespace and are allowed
to write to uid_map or gid_map you should already have the necessary
privilege in the parent user namespace to establish any mapping
you want so this will not affect userspace in practice.
Limiting unprivileged uid mapping establishment to the creator of the
user namespace makes it easier to verify all credentials obtained with
the uid mapping can be obtained without the uid mapping without
privilege.
Limiting unprivileged gid mapping establishment (which is temporarily
absent) to the creator of the user namespace also ensures that the
combination of uid and gid can already be obtained without privilege.
This is part of the fix for CVE-2014-8989.
Reviewed-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
setresuid allows the euid to be set to any of uid, euid, suid, and
fsuid. Therefor it is safe to allow an unprivileged user to map
their euid and use CAP_SETUID privileged with exactly that uid,
as no new credentials can be obtained.
I can not find a combination of existing system calls that allows setting
uid, euid, suid, and fsuid from the fsuid making the previous use
of fsuid for allowing unprivileged mappings a bug.
This is part of a fix for CVE-2014-8989.
Reviewed-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
As any gid mapping will allow and must allow for backwards
compatibility dropping groups don't allow any gid mappings to be
established without CAP_SETGID in the parent user namespace.
For a small class of applications this change breaks userspace
and removes useful functionality. This small class of applications
includes tools/testing/selftests/mount/unprivilged-remount-test.c
Most of the removed functionality will be added back with the addition
of a one way knob to disable setgroups. Once setgroups is disabled
setting the gid_map becomes as safe as setting the uid_map.
For more common applications that set the uid_map and the gid_map
with privilege this change will have no affect.
This is part of a fix for CVE-2014-8989.
Reviewed-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
setgroups is unique in not needing a valid mapping before it can be called,
in the case of setgroups(0, NULL) which drops all supplemental groups.
The design of the user namespace assumes that CAP_SETGID can not actually
be used until a gid mapping is established. Therefore add a helper function
to see if the user namespace gid mapping has been established and call
that function in the setgroups permission check.
This is part of the fix for CVE-2014-8989, being able to drop groups
without privilege using user namespaces.
Reviewed-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
The rule is simple. Don't allow anything that wouldn't be allowed
without unprivileged mappings.
It was previously overlooked that establishing gid mappings would
allow dropping groups and potentially gaining permission to files and
directories that had lesser permissions for a specific group than for
all other users.
This is the rule needed to fix CVE-2014-8989 and prevent any other
security issues with new_idmap_permitted.
The reason for this rule is that the unix permission model is old and
there are programs out there somewhere that take advantage of every
little corner of it. So allowing a uid or gid mapping to be
established without privielge that would allow anything that would not
be allowed without that mapping will result in expectations from some
code somewhere being violated. Violated expectations about the
behavior of the OS is a long way to say a security issue.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>