HPET irq is routed to i8259 and then to MIPS CPU irq (cascade). After
commit a3e6c1eff5 (MIPS: IRQ: Fix disable_irq on CPU IRQs), if without
IRQF_NO_SUSPEND in cascade_irqaction, HPET interrupts will lost during
suspend. The result is machine cannot be waken up.
memsize denotes the amount of RAM we can access from kseg{0,1} and
that should be up to 256M. In case the bootloader reports a value
higher than that (perhaps reporting all the available RAM) it's best
if we fix it ourselves and just warn the user about that. This is
usually a problem with the bootloader and/or its environment.
[ralf@linux-mips.org: Remove useless parens as suggested bei Sergei.
Reformat long pr_warn statement to fit into 80 column limit.]
The lose_fpu() function only disables the FPU in CP0_Status.CU1 if the
FPU is in use and MSA isn't enabled.
This isn't necessarily a problem because KSTK_STATUS(current), the
version of CP0_Status stored on the kernel stack on entry from user
mode, does always get updated and gets restored when returning to user
mode, but I don't think it was intended, and it is inconsistent with the
case of only the FPU being in use. Sometimes leaving the FPU enabled may
also mask kernel bugs where FPU operations are executed when the FPU
might not be enabled.
Guest user mode can generate a guest MSA Disabled exception on an MSA
capable core by simply trying to execute an MSA instruction. Since this
exception is unknown to KVM it will be passed on to the guest kernel.
However guest Linux kernels prior to v3.15 do not set up an exception
handler for the MSA Disabled exception as they don't support any MSA
capable cores. This results in a guest OS panic.
Since an older processor ID may be being emulated, and MSA support is
not advertised to the guest, the correct behaviour is to generate a
Reserved Instruction exception in the guest kernel so it can send the
guest process an illegal instruction signal (SIGILL), as would happen
with a non-MSA-capable core.
Fix this as minimally as reasonably possible by preventing
kvm_mips_check_privilege() from relaying MSA Disabled exceptions from
guest user mode to the guest kernel, and handling the MSA Disabled
exception by emulating a Reserved Instruction exception in the guest,
via a new handle_msa_disabled() KVM callback.
When userland injects a SPI via the KVM_IRQ_LINE ioctl we currently
only check it against a fixed limit, which historically is set
to 127. With the new dynamic IRQ allocation the effective limit may
actually be smaller (64).
So when now a malicious or buggy userland injects a SPI in that
range, we spill over on our VGIC bitmaps and bytemaps memory.
I could trigger a host kernel NULL pointer dereference with current
mainline by injecting some bogus IRQ number from a hacked kvmtool:
-----------------
....
DEBUG: kvm_vgic_inject_irq(kvm, cpu=0, irq=114, level=1)
DEBUG: vgic_update_irq_pending(kvm, cpu=0, irq=114, level=1)
DEBUG: IRQ #114 still in the game, writing to bytemap now...
Unable to handle kernel NULL pointer dereference at virtual address 00000000
pgd = ffffffc07652e000
[00000000] *pgd=00000000f658b003, *pud=00000000f658b003, *pmd=0000000000000000
Internal error: Oops: 96000006 [#1] PREEMPT SMP
Modules linked in:
CPU: 1 PID: 1053 Comm: lkvm-msi-irqinj Not tainted 4.0.0-rc7+ #3027
Hardware name: FVP Base (DT)
task: ffffffc0774e9680 ti: ffffffc0765a8000 task.ti: ffffffc0765a8000
PC is at kvm_vgic_inject_irq+0x234/0x310
LR is at kvm_vgic_inject_irq+0x30c/0x310
pc : [<ffffffc0000ae0a8>] lr : [<ffffffc0000ae180>] pstate: 80000145
.....
So this patch fixes this by checking the SPI number against the
actual limit. Also we remove the former legacy hard limit of
127 in the ioctl code.
Signed-off-by: Andre Przywara <andre.przywara@arm.com> Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org> CC: <stable@vger.kernel.org> # 4.0, 3.19, 3.18
[maz: wrap KVM_ARM_IRQ_GIC_MAX with #ifndef __KERNEL__,
as suggested by Christopher Covington] Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
kvm_write_guest_cached() does not mark all written pages as dirty and
code comments in kvm_gfn_to_hva_cache_init() talk about NULL memslot
with cross page accesses. Fix all the easy way.
The check is '<= 1' to have the same result for 'len = 0' cache anywhere
in the page. (nr_pages_needed is 0 on page boundary.)
Fixes: 8f964525a121 ("KVM: Allow cross page reads and writes from cached translations.") Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Message-Id: <20150408121648.GA3519@potion.brq.redhat.com> Reviewed-by: Wanpeng Li <wanpeng.li@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Sebastian reported a crash caused by a jump label mismatch after resume.
This happens because we do not save the kernel text section during suspend
and therefore also do not restore it during resume, but use the kernel image
that restores the old system.
This means that after a suspend/resume cycle we lost all modifications done
to the kernel text section.
The reason for this is the pfn_is_nosave() function, which incorrectly
returns that read-only pages don't need to be saved. This is incorrect since
we mark the kernel text section read-only.
We still need to make sure to not save and restore pages contained within
NSS and DCSS segment.
To fix this add an extra case for the kernel text section and only save
those pages if they are not contained within an NSS segment.
Fixes the following crash (and the above bugs as well):
This fixes a bug introduced with commit c05c4186bbe4 ("KVM: s390:
add floating irq controller").
get_all_floating_irqs() does copy_to_user() while holding
a spin lock. Let's fix this by filling a temporary buffer
first and copy it to userspace after giving up the lock.
Cc: <stable@vger.kernel.org> # 3.18+: 69a8d4562638 KVM: s390: no need to hold... Reviewed-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: Jens Freimann <jfrei@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
The kvm mutex was (probably) used to protect against cpu hotplug.
The current code no longer needs to protect against that, as we only
rely on CPU data structures that are guaranteed to be available
if we can access the CPU. (e.g. vcpu_create will put the cpu
in the array AFTER the cpu is ready).
The reinjection of an I/O interrupt can fail if the list is at the limit
and between the dequeue and the reinjection, another I/O interrupt is
injected (e.g. if user space floods kvm with I/O interrupts).
This patch avoids this memory leak and returns -EFAULT in this special
case. This error is not recoverable, so let's fail hard. This can later
be avoided by not dequeuing the interrupt but working directly on the
locked list.
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Cc: stable@vger.kernel.org # 3.16+ Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
If the I/O interrupt could not be written to the guest provided
area (e.g. access exception), a program exception was injected into the
guest but "inti" wasn't freed, therefore resulting in a memory leak.
In addition, the I/O interrupt wasn't reinjected. Therefore the dequeued
interrupt is lost.
This patch fixes the problem while cleaning up the function and making the
cc and rc logic easier to handle.
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Cc: stable@vger.kernel.org # 3.16+ Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Function-specific setup requests should be handled in such a way, that
apart from filling in the data buffer, the requests are also actually
enqueued: if function-specific setup is called from composte_setup(),
the "usb_ep_queue()" block of code in composite_setup() is skipped.
The printer function lacks this part and it results in e.g. get device id
requests failing: the host expects some response, the device prepares it
but does not equeue it for sending to the host, so the host finally asserts
timeout.
This patch adds enqueueing the prepared responses.
Cc: <stable@vger.kernel.org> # v3.4+ Fixes: 2e87edf49227: "usb: gadget: make g_printer use composite" Signed-off-by: Andrzej Pietrasiewicz <andrzej.p@samsung.com> Signed-off-by: Felipe Balbi <balbi@ti.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
If we pass a length of 0 to the extent_same ioctl, we end up locking an
extent range with a start offset greater then its end offset (if the
destination file's offset is greater than zero). This results in a warning
from extent_io.c:insert_state through the following call chain:
This leads to an infinite loop when evicting the inode. This is the same
problem that my previous patch titled
"Btrfs: fix inode eviction infinite loop after cloning into it" addressed
but for the extent_same ioctl instead of the clone ioctl.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 9dc106617d5669a6f8d86e08f620dc2fb0413e21) Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
If we attempt to clone a 0 length region into a file we can end up
inserting a range in the inode's extent_io tree with a start offset
that is greater then the end offset, which triggers immediately the
following warning:
So just bail out of the clone ioctl if the length of the region to clone
is zero, without locking any extent range, in order to prevent this issue
(same behaviour as a pwrite with a 0 length for example).
This is trivial to reproduce. For example, the steps for the test I just
made for fstests:
mkfs.btrfs -f SCRATCH_DEV
mount SCRATCH_DEV $SCRATCH_MNT
While committing a transaction we free the log roots before we write the
new super block. Freeing the log roots implies marking the disk location
of every node/leaf (metadata extent) as pinned before the new super block
is written. This is to prevent the disk location of log metadata extents
from being reused before the new super block is written, otherwise we
would have a corrupted log tree if before the new super block is written
a crash/reboot happens and the location of any log tree metadata extent
ended up being reused and rewritten.
Even though we pinned the log tree's metadata extents, we were issuing a
discard against them if the fs was mounted with the -o discard option,
resulting in corruption of the log tree if a crash/reboot happened before
writing the new super block - the next time the fs was mounted, during
the log replay process we would find nodes/leafs of the log btree with
a content full of zeroes, causing the process to fail and require the
use of the tool btrfs-zero-log to wipeout the log tree (and all data
previously fsynced becoming lost forever).
Fix this by not doing a discard when pinning an extent. The discard will
be done later when it's safe (after the new super block is committed) at
extent-tree.c:btrfs_finish_extent_commit().
Fixes: e688b7252f78 (Btrfs: fix extent pinning bugs in the tree log) Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 3909e5a93ed64a186a396c1b7fd1db07e065728f) Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
kvm_init_msr_list is currently called before hardware_setup. As a result,
vmx_mpx_supported always returns false when kvm_init_msr_list checks whether to
save MSR_IA32_BNDCFGS.
Move kvm_init_msr_list after vmx_hardware_setup is called to fix this issue.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Message-Id: <1428864435-4732-1-git-send-email-namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry picked from commit 702a71cf592282298395b3359f49a9a985182934) Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
To fully take advantage of MWAIT, apparently the CLFLUSH instruction needs
another quirk on certain CPUs: proper barriers around it on certain machines.
On a Q6600 SMP system, pipe-test scheduling performance, cross core,
improves significantly:
Since X86_BUG_CLFLUSH_MONITOR is already a quirk, don't create a separate
quirk for the extra smp_mb()s.
Signed-off-by: Mike Galbraith <bitbucket@online.de> Cc: <stable@vger.kernel.org> # 3.10+ Cc: Borislav Petkov <bp@alien8.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Ian Malone <ibmalone@gmail.com> Cc: Josh Boyer <jwboyer@redhat.com> Cc: Len Brown <len.brown@intel.com> Cc: Len Brown <lenb@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1390061684.5566.4.camel@marge.simpson.net
[ Ported to recent kernel, added comments about the quirk. ] Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
69fb3676df33 ("x86 idle: remove mwait_idle() and "idle=mwait" cmdline param")
The reasoning was that modern machines should be sufficiently
happy during the boot process using the default_idle() HALT
loop, until cpuidle loads and either acpi_idle or intel_idle
invoke the newer MWAIT-with-hints idle loop.
But two machines reported problems:
1. Certain Core2-era machines support MWAIT-C1 and HALT only.
MWAIT-C1 is preferred for optimal power and performance.
But if they support just C1, cpuidle never loads and
so they use the boot-time default idle loop forever.
2. Some laptops will boot-hang if HALT is used,
but will boot successfully if MWAIT is used.
This appears to be a hidden assumption in BIOS SMI,
that is presumably valid on the proprietary OS
where the BIOS was validated.
https://bugzilla.kernel.org/show_bug.cgi?id=60770
So here we effectively revert the patch above, restoring
the mwait_idle() loop. However, we don't bother restoring
the idle=mwait cmdline parameter, since it appears to add
no value.
Maintainer notes:
For 3.9, simply revert 69fb3676df
for 3.10, patch -F3 applies, fuzz needed due to __cpuinit use in
context For 3.11, 3.12, 3.13, this patch applies cleanly
Tested-by: Mike Galbraith <bitbucket@online.de> Signed-off-by: Len Brown <len.brown@intel.com> Acked-by: Mike Galbraith <bitbucket@online.de> Cc: <stable@vger.kernel.org> # 3.9+ Cc: Borislav Petkov <bp@alien8.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Ian Malone <ibmalone@gmail.com> Cc: Josh Boyer <jwboyer@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/345254a551eb5a6a866e048d7ab570fd2193aca4.1389763084.git.len.brown@intel.com
[ Ported to recent kernels. ] Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Fixes: 79930f5892e ("net: do not deplete pfmemalloc reserve") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
build_skb() should look at the page pfmemalloc status.
If set, this means page allocator allocated this page in the
expectation it would help to free other pages. Networking
stack can do that only if skb->pfmemalloc is also set.
Also, we must refrain using high order pages from the pfmemalloc
reserve, so __page_frag_refill() must also use __GFP_NOMEMALLOC for
them. Under memory pressure, using order-0 pages is probably the best
strategy.
Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Presence of an unbound loop in tcp_send_fin() had always been hard
to explain when analyzing crash dumps involving gigantic dying processes
with millions of sockets.
Lets try a different strategy :
In case of memory pressure, try to add the FIN flag to last packet
in write queue, even if packet was already sent. TCP stack will
be able to deliver this FIN after a timeout event. Note that this
FIN being delivered by a retransmit, it also carries a Push flag
given our current implementation.
By checking sk_under_memory_pressure(), we anticipate that cooking
many FIN packets might deplete tcp memory.
In the case we could not allocate a packet, even with __GFP_WAIT
allocation, then not sending a FIN seems quite reasonable if it allows
to get rid of this socket, free memory, and not block the process from
eventually doing other useful work.
Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Using sk_stream_alloc_skb() in tcp_send_fin() is dangerous in
case a huge process is killed by OOM, and tcp_mem[2] is hit.
To be able to free memory we need to make progress, so this
patch allows FIN packets to not care about tcp_mem[2], if
skb allocation succeeded.
In a follow-up patch, we might abort tcp_send_fin() infinite loop
in case TIF_MEMDIE is set on this thread, as memory allocator
did its best getting extra memory already.
This patch reverts d22e15371811 ("tcp: fix tcp fin memory accounting")
Fixes: d22e15371811 ("tcp: fix tcp fin memory accounting") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Call checksum_complete_unset in PPP receive to discard checksum-complete
value. PPP does not pull checksum for headers and also modifies packet
as in VJ compression.
Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
This function changes ip_summed to CHECKSUM_NONE if CHECKSUM_COMPLETE
is set. This is called to discard checksum-complete when packet
is being modified and checksum is not pulled for headers in a layer.
Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Forwarded frames should not have a socket attached. Especially
tw sockets will lead to panics later-on in the stack.
This was observed with TPROXY assigning a tw socket and broken
policy routing (misconfigured). As a result frame enters
forwarding path instead of input. We cannot solve this in
TPROXY as it cannot know that policy routing is broken.
v2:
Remove useless comment
Signed-off-by: Sebastian Poehn <sebastian.poehn@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
When system is out of memory, refilling of RX buffers fails while
the driver continue to pass the received packets to the kernel stack.
At some point, when all RX buffers deplete, driver may fall into a
sleep, and not recover when memory for new RX buffers is once again
availible. This is because hardware does not have valid descriptors,
so no interrupt will be generated for the driver to return to work
in napi context. Fix it by schedule the napi poll function from
stats_task delayed workqueue, as long as the allocations fail.
Signed-off-by: Ido Shamay <idos@mellanox.com> Signed-off-by: Amir Vadai <amirv@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
By default, the number of tx queues is limited by the number of online cpus
in mlx4_en_get_profile(). However, this limit no longer holds after the
ethtool .set_channels method has been called. In that situation, the driver
may access invalid bits of certain cpumask variables when queue_index >=
nr_cpu_ids.
Signed-off-by: Benjamin Poirier <bpoirier@suse.de> Acked-by: Ido Shamay <idos@mellanox.com> Fixes: d03a68f ("net/mlx4_en: Configure the XPS queue mapping on driver load") Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
There is an interesting bug in the vgic code, which manifests itself
when the KVM run loop has a signal pending or needs a vmid generation
rollover after having disabled interrupts but before actually switching
to the guest.
In this case, we flush the vgic as usual, but we sync back the vgic
state and exit to userspace before entering the guest. The consequence
is that we will be syncing the list registers back to the software model
using the GICH_ELRSR and GICH_EISR from the last execution of the guest,
potentially overwriting a list register containing an interrupt.
This showed up during migration testing where we would capture a state
where the VM has masked the arch timer but there were no interrupts,
resulting in a hung test.
Cc: Marc Zyngier <marc.zyngier@arm.com> Reported-by: Alex Bennee <alex.bennee@linaro.org> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Alex Bennée <alex.bennee@linaro.org> Acked-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Shannon Zhao <shannon.zhao@linaro.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
The kernel's pgd_index macro is designed to index a normal, page
sized array. KVM is a bit diffferent, as we can use concatenated
pages to have a bigger address space (for example 40bit IPA with
4kB pages gives us an 8kB PGD.
In the above case, the use of pgd_index will always return an index
inside the first 4kB, which makes a guest that has memory above
0x8000000000 rather unhappy, as it spins forever in a page fault,
whist the host happilly corrupts the lower pgd.
The obvious fix is to get our own kvm_pgd_index that does the right
thing(tm).
Tested on X-Gene with a hacked kvmtool that put memory at a stupidly
high address.
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Shannon Zhao <shannon.zhao@linaro.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
We're using __get_free_pages with to allocate the guest's stage-2
PGD. The standard behaviour of this function is to return a set of
pages where only the head page has a valid refcount.
This behaviour gets us into trouble when we're trying to increment
the refcount on a non-head page:
A possible approach for this is to split the compound page using
split_page() at allocation time, and change the teardown path to
free one page at a time. It turns out that alloc_pages_exact() and
free_pages_exact() does exactly that.
While we're at it, the PGD allocation code is reworked to reduce
duplication.
This has been tested on an X-Gene platform with a 4kB/48bit-VA host
kernel, and kvmtool hacked to place memory in the second page of
the hardware PGD (PUD for the host kernel). Also regression-tested
on a Cubietruck (Cortex-A7).
[ Reworked to use alloc_pages_exact() and free_pages_exact() and to
return pointers directly instead of by reference as arguments
- Christoffer ]
Reported-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Shannon Zhao <shannon.zhao@linaro.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
When handling a fault in stage-2, we need to resync I$ and D$, just
to be sure we don't leave any old cache line behind.
That's very good, except that we do so using the *user* address.
Under heavy load (swapping like crazy), we may end up in a situation
where the page gets mapped in stage-2 while being unmapped from
userspace by another CPU.
At that point, the DC/IC instructions can generate a fault, which
we handle with kvm->mmu_lock held. The box quickly deadlocks, user
is unhappy.
Instead, perform this invalidation through the kernel mapping,
which is guaranteed to be present. The box is much happier, and so
am I.
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Shannon Zhao <shannon.zhao@linaro.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Let's assume a guest has created an uncached mapping, and written
to that page. Let's also assume that the host uses a cache-coherent
IO subsystem. Let's finally assume that the host is under memory
pressure and starts to swap things out.
Before this "uncached" page is evicted, we need to make sure
we invalidate potential speculated, clean cache lines that are
sitting there, or the IO subsystem is going to swap out the
cached view, loosing the data that has been written directly
into memory.
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Shannon Zhao <shannon.zhao@linaro.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Commit b856a59141b1 (arm/arm64: KVM: Reset the HCR on each vcpu
when resetting the vcpu) moved the init of the HCR register to
happen later in the init of a vcpu, but left out the fixup
done in kvm_reset_vcpu when preparing for a 32bit guest.
As a result, the 32bit guest is run as a 64bit guest, but the
rest of the kernel still manages it as a 32bit. Fun follows.
Moving the fixup to vcpu_reset_hcr solves the problem for good.
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Shannon Zhao <shannon.zhao@linaro.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
It took about two years for someone to notice that the IPA passed
to TLBI IPAS2E1IS must be shifted by 12 bits. Clearly our reviewing
is not as good as it should be...
Paper bag time for me.
Reported-by: Mario Smarduch <m.smarduch@samsung.com> Tested-by: Mario Smarduch <m.smarduch@samsung.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Shannon Zhao <shannon.zhao@linaro.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
It is curently possible to run a VM with architected timers support
without creating an in-kernel VGIC, which will result in interrupts from
the virtual timer going nowhere.
To address this issue, move the architected timers initialization to the
time when we run a VCPU for the first time, and then only initialize
(and enable) the architected timers if we have a properly created and
initialized in-kernel VGIC.
When injecting interrupts from the virtual timer to the vgic, the
current setup should ensure that this never calls an on-demand init of
the VGIC, which is the only call path that could return an error from
kvm_vgic_inject_irq(), so capture the return value and raise a warning
if there's an error there.
We also change the kvm_timer_init() function from returning an int to be
a void function, since the function always succeeds.
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Shannon Zhao <shannon.zhao@linaro.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Userspace assumes that it can wire up IRQ injections after having
created all VCPUs and after having created the VGIC, but potentially
before starting the first VCPU. This can currently lead to lost IRQs
because the state of that IRQ injection is not stored anywhere and we
don't return an error to userspace.
We haven't seen this problem manifest itself yet, presumably because
guests reset the devices on boot, but this could cause issues with
migration and other non-standard startup configurations.
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Shannon Zhao <shannon.zhao@linaro.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
When call kvm_vgic_inject_irq to inject interrupt, we can known which
vcpu the interrupt for by the irq_num and the cpuid. So we should just
kick this vcpu to avoid iterating through all.
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Shannon Zhao <zhaoshenglong@huawei.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Shannon Zhao <shannon.zhao@linaro.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
When the vgic initializes its internal state it does so based on the
number of VCPUs available at the time. If we allow KVM to create more
VCPUs after the VGIC has been initialized, we are likely to error out in
unfortunate ways later, perform buffer overflows etc.
Acked-by: Marc Zyngier <marc.zyngier@arm.com> Reviewed-by: Eric Auger <eric.auger@linaro.org> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Shannon Zhao <shannon.zhao@linaro.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
VGIC initialization currently happens in three phases:
(1) kvm_vgic_create() (triggered by userspace GIC creation)
(2) vgic_init_maps() (triggered by userspace GIC register read/write
requests, or from kvm_vgic_init() if not already run)
(3) kvm_vgic_init() (triggered by first VM run)
We were doing initialization of some state to correspond with the
state of a freshly-reset GIC in kvm_vgic_init(); this is too late,
since it will overwrite changes made by userspace using the
register access APIs before the VM is run. Move this initialization
earlier, into the vgic_init_maps() phase.
This fixes a bug where QEMU could successfully restore a saved
VM state snapshot into a VM that had already been run, but could
not restore it "from cold" using the -loadvm command line option
(the symptoms being that the restored VM would run but interrupts
were ignored).
Finally rename vgic_init_maps to vgic_init and renamed kvm_vgic_init to
kvm_vgic_map_resources.
[ This patch is originally written by Peter Maydell, but I have
modified it somewhat heavily, renaming various bits and moving code
around. If something is broken, I am to be blamed. - Christoffer ]
Acked-by: Marc Zyngier <marc.zyngier@arm.com> Reviewed-by: Eric Auger <eric.auger@linaro.org> Signed-off-by: Peter Maydell <peter.maydell@linaro.org> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Shannon Zhao <shannon.zhao@linaro.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Introduce a new function to unmap user RAM regions in the stage2 page
tables. This is needed on reboot (or when the guest turns off the MMU)
to ensure we fault in pages again and make the dcache, RAM, and icache
coherent.
Using unmap_stage2_range for the whole guest physical range does not
work, because that unmaps IO regions (such as the GIC) which will not be
recreated or in the best case faulted in on a page-by-page basis.
Call this function on secondary and subsequent calls to the
KVM_ARM_VCPU_INIT ioctl so that a reset VCPU will detect the guest
Stage-1 MMU is off when faulting in pages and make the caches coherent.
Acked-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Shannon Zhao <shannon.zhao@linaro.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
When a vcpu calls SYSTEM_OFF or SYSTEM_RESET with PSCI v0.2, the vcpus
should really be turned off for the VM adhering to the suggestions in
the PSCI spec, and it's the sane thing to do.
Also, clarify the behavior and expectations for exits to user space with
the KVM_EXIT_SYSTEM_EVENT case.
Acked-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Shannon Zhao <shannon.zhao@linaro.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
When userspace resets the vcpu using KVM_ARM_VCPU_INIT, we should also
reset the HCR, because we now modify the HCR dynamically to
enable/disable trapping of guest accesses to the VM registers.
This is crucial for reboot of VMs working since otherwise we will not be
doing the necessary cache maintenance operations when faulting in pages
with the guest MMU off.
Acked-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Shannon Zhao <shannon.zhao@linaro.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
The implementation of KVM_ARM_VCPU_INIT is currently not doing what
userspace expects, namely making sure that a vcpu which may have been
turned off using PSCI is returned to its initial state, which would be
powered on if userspace does not set the KVM_ARM_VCPU_POWER_OFF flag.
Implement the expected functionality and clarify the ABI.
Acked-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Shannon Zhao <shannon.zhao@linaro.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
If a VCPU was originally started with power off (typically to be brought
up by PSCI in SMP configurations), there is no need to clear the
POWER_OFF flag in the kernel, as this flag is only tested during the
init ioctl itself.
Acked-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Shannon Zhao <shannon.zhao@linaro.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Readonly memslots are often used to implement emulation of ROMs and
NOR flashes, in which case the guest may legally map these regions as
uncached.
To deal with the incoherency associated with uncached guest mappings,
treat all readonly memslots as incoherent, and ensure that pages that
belong to regions tagged as such are flushed to DRAM before being passed
to the guest.
To allow handling of incoherent memslots in a subsequent patch, this
patch adds a paramater 'ipa_uncached' to cache_coherent_guest_page()
so that we can instruct it to flush the page's contents to DRAM even
if the guest has caching globally enabled.
Memory regions may be incoherent with the caches, typically when the
guest has mapped a host system RAM backed memory region as uncached.
Add a flag KVM_MEMSLOT_INCOHERENT so that we can tag these memslots
and handle them appropriately when mapping them.
A race condition starts to be visible in recent mmotm, where a PG_hwpoison
flag is set on a migration source page *before* it's back in buddy page
poo= l.
This is problematic because no page flag is supposed to be set when
freeing (see __free_one_page().) So the user-visible effect of this race
is that it could trigger the BUG_ON() when soft-offlining is called.
The root cause is that we call lru_add_drain_all() to make sure that the
page is in buddy, but that doesn't work because this function just
schedule= s a work item and doesn't wait its completion.
drain_all_pages() does drainin= g directly, so simply dropping
lru_add_drain_all() solves this problem.
The Asus Z97-DELUXE motherboard contains a Broadcom based Bluetooth
controller on the USB bus. However vendor and product ID are listed
as ASUSTek Computer.
If a USB serial device is unplugged while there is an active program
using the device it may spam the logs with -EPROTO (71) messages as it
attempts to retry.
Most serial usb drivers (metro-usb, pl2303, mos7840, ...) only output
these messages for debugging. The generic driver treats these as
errors.
Change the default output for the generic serial driver from error to
debug to silence these non-critical errors.
Bo Yan [Fri, 24 Apr 2015 17:30:52 +0000 (10:30 -0700)]
arm64: fix midr range for Cortex-A57 erratum 832075
Register MIDR_EL1 is masked to get variant and revision fields, then
compared against midr_range_min and midr_range_max when checking
whether CPU is affected by any particular erratum. However, variant
and revision fields in MIDR_EL1 are separated by 16 bits, so the min
and max of midr range should be constructed accordingly, otherwise
the patch will not be applied when variant field is non-0.
Acked-by: Andre Przywara <andre.przywara@arm.com> Reviewed-by: Paul Walmsley <paul@pwsan.com> Signed-off-by: Bo Yan <byan@nvidia.com>
[will: use MIDR_VARIANT_SHIFT to construct upper bound] Signed-off-by: Will Deacon <will.deacon@arm.com> Cc: <stable@vger.kernel.org> # v3.18.y
(cherry picked from commit 6d1966dfd6e0ad2f8aa4b664ae1a62e33abe1998) Signed-off-by: Kevin Hilman <khilman@linaro.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Will Deacon [Fri, 24 Apr 2015 17:30:51 +0000 (10:30 -0700)]
arm64: errata: add workaround for cortex-a53 erratum #845719
When running a compat (AArch32) userspace on Cortex-A53, a load at EL0
from a virtual address that matches the bottom 32 bits of the virtual
address used by a recent load at (AArch64) EL1 might return incorrect
data.
This patch works around the issue by writing to the contextidr_el1
register on the exception return path when returning to a 32-bit task.
This workaround is patched in at runtime based on the MIDR value of the
processor.
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com> Tested-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Will Deacon <will.deacon@arm.com> Cc: <stable@vger.kernel.org> # v3.18.y
(cherry picked from commit 905e8c5dcaa147163672b06fe9dcb5abaacbc711) Signed-off-by: Kevin Hilman <khilman@linaro.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Andre Przywara [Fri, 24 Apr 2015 17:30:50 +0000 (10:30 -0700)]
arm64: protect alternatives workarounds with Kconfig options
Not all of the errata we have workarounds for apply necessarily to all
SoCs, so people compiling a kernel for one very specific SoC may not
need to patch the kernel.
Introduce a new submenu in the "Platform selection" menu to allow
people to turn off certain bugs if they are not affected. By default
all of them are enabled.
Normal users or distribution kernels shouldn't bother to deselect any
bugs here, since the alternatives framework will take care of
patching them in only if needed.
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
[will: moved kconfig menu under `Kernel Features'] Signed-off-by: Will Deacon <will.deacon@arm.com> Cc: <stable@vger.kernel.org> # v3.18.y
(cherry picked from commit c0a01b84b1fdbd98bff5bca5b201fe73fda7e9d9) Signed-off-by: Kevin Hilman <khilman@linaro.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Andre Przywara [Fri, 24 Apr 2015 17:30:49 +0000 (10:30 -0700)]
arm64: add Cortex-A57 erratum 832075 workaround
The ARM erratum 832075 applies to certain revisions of Cortex-A57,
one of the workarounds is to change device loads into using
load-aquire semantics.
This is achieved using the alternatives framework.
Signed-off-by: Andre Przywara <andre.przywara@arm.com> Signed-off-by: Will Deacon <will.deacon@arm.com> Cc: <stable@vger.kernel.org> # v3.18.y
(cherry picked from commit 5afaa1fc1b320cec48affa7e6949f2493f875c12) Signed-off-by: Kevin Hilman <khilman@linaro.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Andre Przywara [Fri, 24 Apr 2015 17:30:48 +0000 (10:30 -0700)]
arm64: add Cortex-A53 cache errata workaround
The ARM errata 819472, 826319, 827319 and 824069 define the same
workaround for these hardware issues in certain Cortex-A53 parts.
Use the new alternatives framework and the CPU MIDR detection to
patch "cache clean" into "cache clean and invalidate" instructions if
an affected CPU is detected at runtime.
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
[will: add __maybe_unused to squash gcc warning] Signed-off-by: Will Deacon <will.deacon@arm.com> Cc: <stable@vger.kernel.org> # v3.18.y
(cherry picked from commit 301bcfac42897dbd1b0b3c1be49f24654a1bc49e) Signed-off-by: Kevin Hilman <khilman@linaro.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Andre Przywara [Fri, 24 Apr 2015 17:30:47 +0000 (10:30 -0700)]
arm64: detect silicon revisions and set cap bits accordingly
After each CPU has been started, we iterate through a list of
CPU features or bugs to detect CPUs which need (or could benefit
from) kernel code patches.
For each feature/bug there is a function which checks if that
particular CPU is affected. We will later provide some more generic
functions for common things like testing for certain MIDR ranges.
We do this for every CPU to cover big.LITTLE systems properly as
well.
If a certain feature/bug has been detected, the capability bit will
be set, so that later the call to apply_alternatives() will trigger
the actual code patching.
Signed-off-by: Andre Przywara <andre.przywara@arm.com> Signed-off-by: Will Deacon <will.deacon@arm.com> Cc: <stable@vger.kernel.org> # v3.18.y
(cherry picked from commit e116a375423393cdb94714e90a96857005d58428) Signed-off-by: Kevin Hilman <khilman@linaro.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Andre Przywara [Fri, 24 Apr 2015 17:30:46 +0000 (10:30 -0700)]
arm64: add alternative runtime patching
With a blatant copy of some x86 bits we introduce the alternative
runtime patching "framework" to arm64.
This is quite basic for now and we only provide the functions we need
at this time.
This is connected to the newly introduced feature bits.
Signed-off-by: Andre Przywara <andre.przywara@arm.com> Signed-off-by: Will Deacon <will.deacon@arm.com> Cc: <stable@vger.kernel.org> # v3.18.y
(cherry picked from commit e039ee4ee3fcf174736f2cb0a2eed6cb908348a6) Signed-off-by: Kevin Hilman <khilman@linaro.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Andre Przywara [Fri, 24 Apr 2015 17:30:45 +0000 (10:30 -0700)]
arm64: add cpu_capabilities bitmap
For taking note if at least one CPU in the system needs a bug
workaround or would benefit from a code optimization, we create a new
bitmap to hold (artificial) feature bits.
Since elf_hwcap is part of the userland ABI, we keep it alone and
introduce a new data structure for that (along with some accessors).
Signed-off-by: Andre Przywara <andre.przywara@arm.com> Signed-off-by: Will Deacon <will.deacon@arm.com> Cc: <stable@vger.kernel.org> # v3.18.y
(cherry picked from commit 930da09f5e50dd22fb0a8600388da8677d62d671) Signed-off-by: Kevin Hilman <khilman@linaro.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
tg3_init_one() calls tg3_halt() without tp->lock despite its assumption
and causes deadlock.
If lockdep is enabled, a warning like this shows up before the stall:
[ BUG: bad unlock balance detected! ]
3.19.0test #3 Tainted: G E
-------------------------------------
insmod/369 is trying to release lock (&(&tp->lock)->rlock) at:
[<ffffffffa02d5a1d>] tg3_chip_reset+0x14d/0x780 [tg3]
but there are no more locks to release!
tg3_init_one() doesn't call tg3_halt() under normal situation but
during kexec kdump I hit this problem.
Fixes: 932f19de ("tg3: Release tp->lock before invoking synchronize_irq()") Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
cdc_ncm disagrees with usbnet about how much framing overhead should
be counted in the tx_bytes statistics, and tries 'fix' this by
decrementing tx_bytes on the transmit path. But statistics must never
be decremented except due to roll-over; this will thoroughly confuse
user-space. Also, tx_bytes is only incremented by usbnet in the
completion path.
Fix this by requiring drivers that set FLAG_MULTI_FRAME to set a
tx_bytes delta along with the tx_packets count.
Currently the usbnet core does not update the tx_packets statistic for
drivers with FLAG_MULTI_PACKET and there is no hook in the TX
completion path where they could do this.
cdc_ncm and dependent drivers are bumping tx_packets stat on the
transmit path while asix and sr9800 aren't updating it at all.
Add a packet count in struct skb_data so these drivers can fill it
in, initialise it to 1 for other drivers, and add the packet count
to the tx_packets statistic on completion.
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Tested-by: Bjørn Mork <bjorn@mork.no> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
handle_offloads() calls skb_reset_inner_headers() to store
the layer pointers to the encapsulated packet. However, we
currently push the vlag tag (if there is one) onto the packet
afterwards. This changes the MAC header for the encapsulated
packet but it is not reflected in skb->inner_mac_header, which
breaks GSO and drivers which attempt to use this for encapsulation
offloads.
Fixes: 1eaa8178 ("vxlan: Add tx-vlan offload support.") Signed-off-by: Jesse Gross <jesse@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
In case of error vxlan_xmit_one() can free already freed skb.
Also fixes memory leak of dst-entry.
Fixes: acbf74a7630 ("vxlan: Refactor vxlan driver to make use
of the common UDP tunnel functions").
Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Use them to push skb->vlan_tci into the payload and avoid code
duplication.
Signed-off-by: Jiri Pirko <jiri@resnulli.us> Acked-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Name fits better. Plus there's going to be introduced
__vlan_insert_tag later on.
Signed-off-by: Jiri Pirko <jiri@resnulli.us> Acked-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
On Wed, Apr 15, 2015 at 05:41:26PM +0200, Nicolas Dichtel wrote:
> Le 15/04/2015 15:57, Herbert Xu a écrit :
> >On Wed, Apr 15, 2015 at 06:22:29PM +0800, Herbert Xu wrote:
> [snip]
> >Subject: skbuff: Do not scrub skb mark within the same name space
> >
> >The commit ea23192e8e577dfc51e0f4fc5ca113af334edff9 ("tunnels:
> Maybe add a Fixes tag?
> Fixes: ea23192e8e57 ("tunnels: harmonize cleanup done on skb on rx path")
>
> >harmonize cleanup done on skb on rx path") broke anyone trying to
> >use netfilter marking across IPv4 tunnels. While most of the
> >fields that are cleared by skb_scrub_packet don't matter, the
> >netfilter mark must be preserved.
> >
> >This patch rearranges skb_scurb_packet to preserve the mark field.
> nit: s/scurb/scrub
>
> Else it's fine for me.
Sure.
PS I used the wrong email for James the first time around. So
let me repeat the question here. Should secmark be preserved
or cleared across tunnels within the same name space? In fact,
do our security models even support name spaces?
---8<---
The commit ea23192e8e577dfc51e0f4fc5ca113af334edff9 ("tunnels:
harmonize cleanup done on skb on rx path") broke anyone trying to
use netfilter marking across IPv4 tunnels. While most of the
fields that are cleared by skb_scrub_packet don't matter, the
netfilter mark must be preserved.
This patch rearranges skb_scrub_packet to preserve the mark field.
Fixes: ea23192e8e57 ("tunnels: harmonize cleanup done on skb on rx path") Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
This patch reverts commit b8fb4e0648a2ab3734140342002f68fb0c7d1602
because the secmark must be preserved even when a packet crosses
namespace boundaries. The reason is that security labels apply to
the system as a whole and is not per-namespace.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Commit 9a2620c877454 ("bnx2x: prevent WARN during driver unload")
switched the napi/busy_lock locking mechanism from spin_lock() into
spin_lock_bh(), breaking inter-operability with netconsole, as netpoll
disables interrupts prior to calling our napi mechanism.
This switches the driver into using atomic assignments instead of the
spinlock mechanisms previously employed.
Based on initial patch from Yuval Mintz & Ariel Elior
I basically added softirq starvation avoidance, and mixture
of atomic operations, plain writes and barriers.
Note this slightly reduces the overhead for this driver when no
busy_poll sockets are in use.
Fixes: 9a2620c877454 ("bnx2x: prevent WARN during driver unload") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
I noticed tcpdump was giving funky timestamps for locally
generated SYNACK messages on loopback interface.
11:42:46.938990 IP 127.0.0.1.48245 > 127.0.0.2.23850: S 945476042:945476042(0) win 43690 <mss 65495,nop,nop,sackOK,nop,wscale 7>
20:28:58.502209 IP 127.0.0.2.23850 > 127.0.0.1.48245: S 3160535375:3160535375(0) ack 945476043 win 43690 <mss
65495,nop,nop,sackOK,nop,wscale 7>
This is because we need to clear skb->tstamp before
entering lower stack, otherwise net_timestamp_check()
does not set skb->tstamp.
Fixes: 7faee5c0d514 ("tcp: remove TCP_SKB_CB(skb)->when") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Commit 1daa4303b4ca ("net/mlx4_core: Deprecate error message at
ConnectX-2 cards startup to debug") did the deprecation only for port 1
of the card. Need to deprecate for port 2 as well.
Fixes: 1daa4303b4ca ("net/mlx4_core: Deprecate error message at ConnectX-2 cards startup to debug") Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il> Signed-off-by: Amir Vadai <amirv@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
We should not consult skb->sk for output decisions in xmit recursion
levels > 0 in the stack. Otherwise local socket settings could influence
the result of e.g. tunnel encapsulation process.
ipv6 does not conform with this in three places:
1) ip6_fragment: we do consult ipv6_npinfo for frag_size
2) sk_mc_loop in ipv6 uses skb->sk and checks if we should
loop the packet back to the local socket
3) ip6_skb_dst_mtu could query the settings from the user socket and
force a wrong MTU
Furthermore:
In sk_mc_loop we could potentially land in WARN_ON(1) if we use a
PF_PACKET socket ontop of an IPv6-backed vxlan device.
Reuse xmit_recursion as we are currently only interested in protecting
tunnel devices.
Cc: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
On processing cumulative ACKs, the FRTO code was not checking the
SACKed bit, meaning that there could be a spurious FRTO undo on a
cumulative ACK of a previously SACKed skb.
The FRTO code should only consider a cumulative ACK to indicate that
an original/unretransmitted skb is newly ACKed if the skb was not yet
SACKed.
The effect of the spurious FRTO undo would typically be to make the
connection think that all previously-sent packets were in flight when
they really weren't, leading to a stall and an RTO.
Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Fixes: e33099f96d99c ("tcp: implement RFC5682 F-RTO") Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
xen-netfront limits transmitted skbs to be at most 44 segments in size. However,
GSO permits up to 65536 bytes, which means a maximum of 45 segments of 1448
bytes each. This slight reduction in the size of packets means a slight loss in
efficiency.
Since c/s 9ecd1a75d, xen-netfront sets gso_max_size to
XEN_NETIF_MAX_TX_SIZE - MAX_TCP_HEADER,
where XEN_NETIF_MAX_TX_SIZE is 65535 bytes.
The calculation used by tcp_tso_autosize (and also tcp_xmit_size_goal since c/s 6c09fa09d) in determining when to split an skb into two is
sk->sk_gso_max_size - 1 - MAX_TCP_HEADER.
So the maximum permitted size of an skb is calculated to be
(XEN_NETIF_MAX_TX_SIZE - MAX_TCP_HEADER) - 1 - MAX_TCP_HEADER.
Intuitively, this looks like the wrong formula -- we don't need two TCP headers.
Instead, there is no need to deviate from the default gso_max_size of 65536 as
this already accommodates the size of the header.
Currently, the largest skb transmitted by netfront is 63712 bytes (44 segments
of 1448 bytes each), as observed via tcpdump. This patch makes netfront send
skbs of up to 65160 bytes (45 segments of 1448 bytes each).
Similarly, the maximum allowable mtu does not need to subtract MAX_TCP_HEADER as
it relates to the size of the whole packet, including the header.
Fixes: 9ecd1a75d977 ("xen-netfront: reduce gso_max_size to account for max TCP header") Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Before commit 3900f29021f0bc7fe9815aa32f1a993b7dfdd402 ("bonding: slight
optimizztion for bond_slave_override()") the override logic was to send packets
with non-zero queue_id through the slave with corresponding queue_id, under two
conditions only - if the slave can transmit and it's up.
The above mentioned commit changed this logic by introducing an additional
condition - whether the bond is active (indirectly, using the slave_can_tx and
later - bond_is_active_slave), that prevents the user from implementing more
complex policies according to the Documentation/networking/bonding.txt.
Signed-off-by: Anton Nayshtut <anton@swortex.com> Signed-off-by: Alexey Bogoslavsky <alexey@swortex.com> Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
tcp_v6_fill_cb() will be called twice if socket's state changes from
TCP_TIME_WAIT to TCP_LISTEN. That can result in control buffer data
corruption because in the second tcp_v6_fill_cb() call it's not copying
IP6CB(skb) anymore, but 'seq', 'end_seq', etc., so we can get weird and
unpredictable results. Performance loss of up to 1200% has been observed
in LTP/vxlan03 test.
This can be fixed by copying inet6_skb_parm to the beginning of 'cb'
only if xfrm6_policy_check() and tcp_v6_fill_cb() are going to be
called again.
Fixes: 2dc49d1680b53 ("tcp6: don't move IP6CB before xfrm6_policy_check()") Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
A local route may have a lower hop_limit set than global routes do.
RFC 3756, Section 4.2.7, "Parameter Spoofing"
> 1. The attacker includes a Current Hop Limit of one or another small
> number which the attacker knows will cause legitimate packets to
> be dropped before they reach their destination.
> As an example, one possible approach to mitigate this threat is to
> ignore very small hop limits. The nodes could implement a
> configurable minimum hop limit, and ignore attempts to set it below
> said limit.
Signed-off-by: D.S. Ljungmark <ljungmark@modio.se> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Netdevice registration should be performed a the end of the driver
initialization flow. If we don't do that, after calling register_netdevice,
device callbacks may be issued by higher layers of the stack before
final configuration of the device is done.
For example (VXLAN configuration race), mlx4_SET_PORT_VXLAN was issued
after the register_netdev command. System network scripts may configure
the interface (UP) right after the registration, which also attach
unicast VXLAN steering rule, before mlx4_SET_PORT_VXLAN was called,
causing the firmware to fail the rule attachment.
Fixes: 837052d0ccc5 ("net/mlx4_en: Add netdev support for TCP/IP offloads of vxlan tunneling") Signed-off-by: Ido Shamay <idos@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
On s390x, gcc 4.8 compiles this part of tcp_v6_early_demux()
struct dst_entry *dst = sk->sk_rx_dst;
if (dst)
dst = dst_check(dst, inet6_sk(sk)->rx_dst_cookie);
to code reading sk->sk_rx_dst twice, once for the test and once for
the argument of ip6_dst_check() (dst_check() is inline). This allows
ip6_dst_check() to be called with null first argument, causing a crash.
Protect sk->sk_rx_dst access by READ_ONCE() both in IPv4 and IPv6
TCP early demux code.
Fixes: 41063e9dd119 ("ipv4: Early TCP socket demux.") Fixes: c7109986db3c ("ipv6: Early TCP socket demux") Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
ACCESS_ONCE does not work reliably on non-scalar types. For
example gcc 4.6 and 4.7 might remove the volatile tag for such
accesses during the SRA (scalar replacement of aggregates) step
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58145)
Let's provide READ_ONCE/ASSIGN_ONCE that will do all accesses via
scalar types as suggested by Linus Torvalds. Accesses larger than
the machines word size cannot be guaranteed to be atomic. These
macros will use memcpy and emit a build warning.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Failure happens when attempting to allocate pages for
'struct kvm_memslots', however it doesn't have to be
present in physically contiguous (kmalloc-ed) address
space, change allocation to kvm_kvzalloc() so that
it will be vmalloc-ed when its size is more then a page.
A new fsync vs power fail test in xfstests indicated that XFS can
have unreliable data consistency when doing extending truncates that
require block zeroing. The blocks beyond EOF get zeroed in memory,
but we never force those changes to disk before we run the
transaction that extends the file size and exposes those blocks to
userspace. This can result in the blocks not being correctly zeroed
after a crash.
Because in-memory behaviour is correct, tools like fsx don't pick up
any coherency problems - it's not until the filesystem is shutdown
or the system crashes after writing the truncate transaction to the
journal but before the zeroed data in the page cache is flushed that
the issue is exposed.
Fix this by also flushing the dirty data in memory region between
the old size and new size when we've found blocks that need zeroing
in the truncate process.
Reported-by: Liu Bo <bo.li.liu@oracle.com>
cc: <stable@vger.kernel.org> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Commit 4f579ae7de56 (ext4: fix punch hole on files with indirect
mapping) rewrote FALLOC_FL_PUNCH_HOLE for ext4 files with indirect
mapping. However, there are bugs in several corner cases. This fixes 5
distinct bugs:
1. When there is at least one entire level of indirection between the
start and end of the punch range and the end of the punch range is the
first block of its level, we can't return early; we have to free the
intervening levels.
2. When the end is at a higher level of indirection than the start and
ext4_find_shared returns a top branch for the end, we still need to free
the rest of the shared branch it returns; we can't decrement partial2.
3. When a punch happens within one level of indirection, we need to
converge on an indirect block that contains the start and end. However,
because the branches returned from ext4_find_shared do not necessarily
start at the same level (e.g., the partial2 chain will be shallower if
the last block occurs at the beginning of an indirect group), the walk
of the two chains can end up "missing" each other and freeing a bunch of
extra blocks in the process. This mismatch can be handled by first
making sure that the chains are at the same level, then walking them
together until they converge.
4. When the punch happens within one level of indirection and
ext4_find_shared returns a top branch for the start, we must free it,
but only if the end does not occur within that branch.
5. When the punch happens within one level of indirection and
ext4_find_shared returns a top branch for the end, then we shouldn't
free the block referenced by the end of the returned chain (this mirrors
the different levels case).
The hrtimer_{start/cancel} functions call into tracing which uses RCU.
But it is not legal to call into RCU in cpuidle because it is one of the
quiescent states. Hence protect this region with RCU_NONIDLE which informs
RCU that the cpu is momentarily non-idle.
As an aside it is helpful to point out that the clock event device that is
programmed here is not a per-cpu clock device; it is a
pseudo clock device, used by the broadcast framework alone.
The per-cpu clock device programming never goes through bc_set_next().
Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: linuxppc-dev@ozlabs.org Cc: mpe@ellerman.id.au Cc: tglx@linutronix.de Link: http://lkml.kernel.org/r/20150318104705.17763.56668.stgit@preeti.in.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
For RoCE ports, we set the u32 PMA values based on u64 HCA counters. In case of
overflow, according to the IB spec, we have to saturate a counter to its
max value, do that.
Fixes: c37791349cc7 ('IB/mlx4: Support PMA counters for IBoE') Signed-off-by: Majd Dibbiny <majd@mellanox.com> Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
The rate provided at the output of a clk-divider is calculated as:
DIV_ROUND_UP(parent_rate, div)
since commit b11d282dbea2 (clk: divider: fix rate calculation for
fractional rates). So to yield a rate not bigger than r parent_rate
must be <= r * div.
The effect of choosing a parent rate that is too big as was done before
this patch results in wrongly ruling out good dividers.
Note that this is not a complete fix as __clk_round_rate might return a
value >= its 2nd parameter. Also for dividers with
CLK_DIVIDER_ROUND_CLOSEST set the calculation is not accurate. But this
fixes the test case by Sascha Hauer that uses a chain of three dividers
under a fixed clock.
It's an invalid approach to assume that among two divider values
the one nearer the exact divider is the better one.
Assume a parent rate of 1000 Hz, a divider with CLK_DIVIDER_POWER_OF_TWO
and a target rate of 89 Hz. The exact divider is ~ 11.236 so 8 and 16
are the candidates to choose from yielding rates 125 Hz and 62.5 Hz
respectivly. While 8 is nearer to 11.236 than 16 is, the latter is still
the better divider as 62.5 is nearer to 89 than 125 is.
Fixes: 774b514390b1 (clk: divider: Add round to closest divider) Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Acked-by: Sascha Hauer <s.hauer@pengutronix.de> Acked-by: Maxime Coquelin <maxime.coquelin@st.com> Signed-off-by: Michael Turquette <mturquette@linaro.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
This is due to a race condition between stopping the thread and
calling vb2_internal_streamoff(). While I have not been able to deduce
the exact mechanism how this race condition can produce this warning,
I can see that the way the stream is stopped is likely to lead to a
race somewhere.
This patch simplifies how this is done by first ensuring that the
thread is completely stopped before cleaning up the vb2 queue. It
does that by setting threadio->stop to true, followed by a call to
vb2_queue_error() which will wake up the thread. The thread sees that
'stop' is true and it will exit.
The call to kthread_stop() waits until the thread has exited, and only
then is the queue cleaned up by calling __vb2_cleanup_fileio().
This is a much cleaner sequence and the warning has now disappeared.
Reported-by: Jurgen Kramer <gtmkramer@xs4all.nl> Tested-by: Jurgen Kramer <gtmkramer@xs4all.nl> Signed-off-by: Hans Verkuil <hans.verkuil@cisco.com> Cc: <stable@vger.kernel.org> # for v3.18 and up Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Unlike scan_async_group(), soc_of_bind() doesn't allocate its
soc_camera_async_client structure using devm_kzalloc(), but has it
embedded inside the soc_of_info structure. Hence on failure, it must
free the whole soc_of_info structure, and not just the embedded
soc_camera_async_client structure, as the latter causes a warning, and
may cause slab corruption: