Currently, for the packets receives from tuntap, before doing header check,
kernel just reset the transport header in netif_receive_skb() which pretends no
l4 header. This is suboptimal for precise packet length estimation (introduced
in 1def9238) which needs correct l4 header for gso packets.
So this patch set the transport header to csum_start for partial checksum
packets, otherwise it first try skb_flow_dissect(), if it fails, just reset the
transport header.
Signed-off-by: Jason Wang <jasowang@redhat.com> Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Set the transport header for 1) some drivers (e.g ixgbe) needs l4 header 2)
precise packet length estimation (introduced in 1def9238) needs l4 header to
compute header length.
For the packets with partial checksum, the patch just set the transport header
to csum_start. Otherwise tries to use skb_flow_dissect() to get l4 offset, if it
fails, just pretend no l4 header.
Signed-off-by: Jason Wang <jasowang@redhat.com> Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit (3be8fbab tuntap: fix error return code in tun_set_iff()) breaks the
creation of multiqueue tuntap since it forbids to create more than one queues
for a multiqueue tuntap device. We need return 0 instead -EBUSY here since we
don't want to re-initialize the device when one or more queues has been already
attached. Add a comment and correct the return value to zero.
Reported-by: Jerry Chu <hkchu@google.com> Cc: Jerry Chu <hkchu@google.com> Cc: Wei Yongjun <weiyj.lk@gmail.com> Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Jerry Chu <hkchu@google.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This patch fixes an issue that the driver increments the "RX length error"
on every buffer in sh_eth_rx() if the R8A7740.
This patch also adds a description about the Receive Frame Status bits.
Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In commit 2f94aabd9f6c925d77aecb3ff020f1cc12ed8f86
(refactor sctp_outq_teardown to insure proper re-initalization)
we modified sctp_outq_teardown to use sctp_outq_init to fully re-initalize the
outq structure. Steve West recently asked me why I removed the q->error = 0
initalization from sctp_outq_teardown. I did so because I was operating under
the impression that sctp_outq_init would properly initalize that value for us,
but it doesn't. sctp_outq_init operates under the assumption that the outq
struct is all 0's (as it is when called from sctp_association_init), but using
it in __sctp_outq_teardown violates that assumption. We should do a memset in
sctp_outq_init to ensure that the entire structure is in a known state there
instead.
Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Reported-by: "West, Steve (NSN - US/Fort Worth)" <steve.west@nsn.com> CC: Vlad Yasevich <vyasevich@gmail.com> CC: netdev@vger.kernel.org CC: davem@davemloft.net Acked-by: Vlad Yasevich <vyasevich@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
PPPoL2TP sockets should comply with the standard send*() return values
(i.e. return number of bytes sent instead of 0 upon success).
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Copy user data after PPP framing header. This prevents erasure of the
added PPP header and avoids leaking two bytes of uninitialised memory
at the end of skb's data buffer.
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
uaddr->sa_data is exactly of size 14, which is hard-coded here and
passed as a size argument to strncpy(). A device name can be of size
IFNAMSIZ (== 16), meaning we might leave the destination string
unterminated. Thus, use strlcpy() and also sizeof() while we're
at it. We need to memset the data area beforehand, since strlcpy
does not padd the remaining buffer with zeroes for user space, so
that we do not possibly leak anything.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
team_port_enable() adds port to port_hashlist. Reader sees port
in team_get_port_by_index_rcu() and returns it, but
team_get_first_port_txable_rcu() tries to go through port_list, where the
port is not inserted yet -> NULL pointer dereference.
Fix this by reordering port_list and port_hashlist insertion.
Panic is easily triggeable when txing packets and adding/removing port
in a loop.
Introduced by commit 3d249d4c "net: introduce ethernet teaming device"
Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
team_get_port_by_index_rcu() might return NULL due to race between port
removal and skb tx path. Panic is easily triggeable when txing packets
and adding/removing port in a loop.
introduced by commit 3d249d4ca "net: introduce ethernet teaming device"
and commit 753f993911b "team: introduce random mode" (for random mode)
Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit 54f968d6efdbf7dec36faa44fc11f01b0e4d1990
(tuntap: move socket to tun_file) forgets to set SOCK_ZEROCOPY flag, which will
prevent vhost_net from doing zercopy w/ tap. This patch fixes this by setting
it during file open.
Signed-off-by: Jason Wang <jasowang@redhat.com> Cc: Michael S. Tsirkin <mst@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
I did not hit this with the lksctp-tools functional tests, but with a
small, multi-threaded test program, that heavily allocates, binds,
listens and waits in accept on sctp sockets, and then randomly kills
some of them (no need for an actual client in this case to hit this).
Then, again, allocating, binding, etc, and then killing child processes.
This panic then only occurs when ``echo 1 > /proc/sys/net/sctp/auth_enable''
is set. The cause for that is actually very simple: in sctp_endpoint_init()
we enter the path of sctp_auth_init_hmacs(). There, we try to allocate
our crypto transforms through crypto_alloc_hash(). In our scenario,
it then can happen that crypto_alloc_hash() fails with -EINTR from
crypto_larval_wait(), thus we bail out and release the socket via
sk_common_release(), sctp_destroy_sock() and hit the NULL pointer
dereference as soon as we try to access members in the endpoint during
sctp_endpoint_free(), since endpoint at that time is still NULL. Now,
if we have that case, we do not need to do any cleanup work and just
leave the destruction handler.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: Vlad Yasevich <vyasevich@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When we decide not use zero-copy, msg.control should be set to NULL otherwise
macvtap/tap may set zerocopy callbacks which may decrease the kref of ubufs
wrongly.
Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit 25fb6ca4ed9cad72f14f61629b68dc03c0d9713f
"net IPv6 : Fix broken IPv6 routing table after loopback down-up"
forgot to assign rt6_info to the inet6_ifaddr.
When disable the net device, the rt6_info which allocated
in init_loopback will not be destroied in __ipv6_ifa_notify.
This will trigger the waring message below
[23527.916091] unregister_netdevice: waiting for tap0 to become free. Usage count = 1
Reported-by: Arkadiusz Miskiewicz <a.miskiewicz@gmail.com> Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Roman Gushchin discovered that udp4_lib_lookup2() was not reloading
first item in the rcu protected list, in case the loop was restarted.
This produced soft lockups as in https://lkml.org/lkml/2013/4/16/37
rcu_dereference(X)/ACCESS_ONCE(X) seem to not work as intended if X is
ptr->field :
In some cases, gcc caches the value or ptr->field in a register.
Use a barrier() to disallow such caching, as documented in
Documentation/atomic_ops.txt line 114
Thanks a lot to Roman for providing analysis and numerous patches.
Diagnosed-by: Roman Gushchin <klamm@yandex-team.ru> Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Boris Zhmurov <zhmurov@yandex-team.ru> Signed-off-by: Roman Gushchin <klamm@yandex-team.ru> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
udp6 over GRE tunnel does not work after to GRE tso changes. GRE
tso handler passes inner packet but keeps track of outer header
start in SKB_GSO_CB(skb)->mac_offset. udp6 fragment need to
take care of outer header, which start at the mac_offset, while
adding fragment header.
This bug is introduced by commit 68c3316311 (GRE: Add TCP
segmentation offload for GRE).
Reported-by: Dmitry Kravkov <dkravkov@gmail.com> Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Tested-by: Dmitry Kravkov <dmitry@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
We currently allow changing the mq flag (IFF_MULTI_QUEUE) for a persistent
device. This will result a mismatch between the number the queues in netdev and
tuntap. This is because we only allocate a 1q netdevice when IFF_MULTI_QUEUE was
not specified, so when we set the IFF_MULTI_QUEUE and try to attach more queues
later, netif_set_real_num_tx_queues() may fail which result a single queue
netdevice with multiple sockets attached.
Solve this by disallowing changing the mq flag for persistent device.
Reported-by: Sriram Narasimhan <sriram.narasimhan@hp.com> Cc: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The three arrays of strings: af_family_key_strings,
af_family_slock_key_strings and af_family_clock_key_strings have not
VSOCK's string
Signed-off-by: Federico Vaga <federico.vaga@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
MSG_CMSG_COMPAT is (AFAIK) not intended to be part of the API --
it's a hack that steals a bit to indicate to other networking code
that a compat entry was used. So don't allow it from a non-compat
syscall.
This prevents an oops when running this code:
int main()
{
int s;
struct sockaddr_in addr;
struct msghdr *hdr;
if (syscall(__NR_sendmmsg, s, evil, 1, MSG_CMSG_COMPAT) < 0)
err(1, "sendmmsg");
return 0;
}
Signed-off-by: Andy Lutomirski <luto@amacapital.net> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Unlike ipv4_redirect() and ipv4_sk_redirect(), ip_do_redirect()
doesn't call __build_flow_key() directly but via
ip_rt_build_flow_key() wrapper. This leads to __build_flow_key()
getting pointer to IPv4 header of the ICMP redirect packet
rather than pointer to the embedded IPv4 header of the packet
initiating the redirect.
As a result, handling of ICMP redirects initiated by TCP packets
is broken. Issue was introduced by
Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The phy_init_eee has to exit with an error when the
local device and its link partner both do not support EEE.
So this patch fixes a problem when verify this.
Signed-off-by: Giuseppe Cavallaro <peppe.cavallaro@st.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Daniel Petre reported crashes in icmp_dst_unreach() with following call
graph:
Daniel found a similar problem mentioned in
http://lkml.indiana.edu/hypermail/linux/kernel/1007.0/00961.html
And indeed this is the root cause : skb->cb[] contains data fooling IP
stack.
We must clear IPCB in ip_tunnel_xmit() sooner in case dst_link_failure()
is called. Or else skb->cb[] might contain garbage from GSO segmentation
layer.
A similar fix was tested on linux-3.9, but gre code was refactored in
linux-3.10. I'll send patches for stable kernels as well.
Many thanks to Daniel for providing reports, patches and testing !
Reported-by: Daniel Petre <daniel.petre@rcs-rds.ro> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 3853b5841c01a ("xps: Improvements in TX queue selection")
introduced ooo_okay flag, but the condition to set it is slightly wrong.
In our traces, we have seen ACK packets being received out of order,
and RST packets sent in response.
We should test if we have any packets still in host queue.
Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The error exit path needs err explicitly set. Otherwise it
returns success and the only caller, xfrm_output_resume(),
would oops in skb_dst(skb)->ops derefence as skb_dst(skb) is
NULL.
Bug introduced in commit bb65a9cb (xfrm: removes a superfluous
check and add a statistic).
Signed-off-by: Timo Teräs <timo.teras@iki.fi> Cc: Li RongQing <roy.qing.li@gmail.com> Cc: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This patch is a fix for a bug triggering newly_acked_sacked < 0
in tcp_ack(.).
The bug is triggered by sacked_out decreasing relative to prior_sacked,
but packets_out remaining the same as pior_packets. This is because the
snapshot of prior_packets is taken after tcp_sacktag_write_queue() while
prior_sacked is captured before tcp_sacktag_write_queue(). The problem
is: tcp_sacktag_write_queue (tcp_match_skb_to_sack() -> tcp_fragment)
adjusts the pcount for packets_out and sacked_out (MSS change or other
reason). As a result, this delta in pcount is reflected in
(prior_sacked - sacked_out) but not in (prior_packets - packets_out).
This patch does the following:
1) initializes prior_packets at the start of tcp_ack() so as to
capture the delta in packets_out created by tcp_fragment.
2) introduces a new "previous_packets_out" variable that snapshots
packets_out right before tcp_clean_rtx_queue, so pkts_acked can be
correctly computed as before.
3) Computes pkts_acked using previous_packets_out, and computes
newly_acked_sacked using prior_packets.
Signed-off-by: Nandita Dukkipati <nanditad@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This patch cures transmit timeout's with DHCP observed
while running under KVM. When the transmit ring is cleaned out,
the Byte Queue Limit values need to be reset.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
8168evl offloaded checksums are wrong since commit e5195c1f31f399289347e043d6abf3ffa80f0005 ("r8169: fix 8168evl frame padding.")
pads small packets to 60 bytes (without ethernet checksum). Typical symptoms
appear as UDP checksums which are wrong by the count of added bytes.
It isn't worth compensating. Let the driver checksum.
Due to the skb length changes, TSO code is moved before the Tx descriptor gets
written.
Signed-off-by: Francois Romieu <romieu@fr.zoreil.com> Tested-by: Holger Hoffstätte <holger.hoffstaette@googlemail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The net/netlabel/netlabel_domainhash.c:netlbl_domhsh_add() function
does not properly validate new domain hash entries resulting in
potential problems when an administrator attempts to add an invalid
entry. One such problem, as reported by Vlad Halilov, is a kernel
BUG (found in netlabel_domainhash.c:netlbl_domhsh_audit_add()) when
adding an IPv6 outbound mapping with a CIPSO configuration.
This patch corrects this problem by adding the necessary validation
code to netlbl_domhsh_add() via the newly created
netlbl_domhsh_validate() function.
Ideally this patch should also be pushed to the currently active
-stable trees.
Reported-by: Vlad Halilov <vlad.halilov@gmail.com> Signed-off-by: Paul Moore <pmoore@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 0178b695fd6b4 ("ipv6: Copy cork options in ip6_append_data")
added some code duplication and bad error recovery, leading to potential
crash in ip6_cork_release() as kfree() could be called with garbage.
use kzalloc() to make sure this wont happen.
Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Neal Cardwell <ncardwell@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Fix some instances where vxlan fdb 'used' field is not updated after the entry
is used.
v2: rename vxlan_find_mac() as __vxlan_find_mac() and create a new vxlan_find_mac()
that also updates ->used field.
Signed-off-by: Sridhar Samudrala <sri@us.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Add the missing iounmap() before return from gianfar_ptp_probe()
in the error handling case.
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
app->lock is normally taken under softirq context, so disable BH to
avoid the splat.
Reported-by: Denys Fedoryshchenko <denys@visp.net.lb> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: David Ward <david.ward@ll.mit.edu> Cc: Cong Wang <amwang@redhat.com> Tested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[2] mlx4 driver uses order-2 pages to allocate RX frags
Reported-by: Matt Schnall <mischnal@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Bernhard Beck <bbeck@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
reproduce steps
1. flood ping from other machine
ping -f -s 41000 IP
2. run below script
while [ 1 ]; do ethtool -s eth0 autoneg off;
sleep 3;ethtool -s eth0 autoneg on; sleep 4; done;
You can see oops in one hour.
The reason is fec_restart clear BD but NAPI may use it.
The solution is disable NAPI and stop xmit when reset BD.
disable NAPI may sleep, so fec_restart can't be call in
atomic context.
Signed-off-by: Frank Li <Frank.Li@freescale.com> Reviewed-by: Lucas Stach <l.stach@pengutronix.de> Tested-by: Lucas Stach <l.stach@pengutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
We are in the process of removing all the __cpuinit annotations.
While working on making that change, an existing problem was
made evident:
WARNING: arch/x86/kernel/built-in.o(.text+0x198f2): Section mismatch
in reference from the function cpu_init() to the function
.init.text:load_ucode_ap() The function cpu_init() references
the function __init load_ucode_ap(). This is often because cpu_init
lacks a __init annotation or the annotation of load_ucode_ap is wrong.
This now appears because in my working tree, cpu_init() is no longer
tagged as __cpuinit, and so the audit picks up the mismatch. The 2nd
hypothesis from the audit is the correct one, as there was an incorrect
__init tag on the prototype in the header (but __cpuinit was used on
the function itself.)
The audit is telling us that the prototype's __init annotation took
effect and the function did land in the .init.text section. Checking
with objdump on a mainline tree that still has __cpuinit shows that
the __cpuinit on the function takes precedence over the __init on the
prototype, but that won't be true once we make __cpuinit a no-op.
Even though we are removing __cpuinit, we temporarily align both
the function and the prototype on __cpuinit so that the changeset
can be applied to stable trees if desired.
[ hpa: build fix only, no object code change ]
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Link: http://lkml.kernel.org/r/1371654926-11729-1-git-send-email-paul.gortmaker@windriver.com Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Fix kconfig warning and build errors on x86_64 by selecting BINFMT_ELF
when COMPAT_BINFMT_ELF is being selected.
warning: (IA32_EMULATION) selects COMPAT_BINFMT_ELF which has unmet direct dependencies (COMPAT && BINFMT_ELF)
fs/built-in.o: In function `elf_core_dump':
compat_binfmt_elf.c:(.text+0x3e093): undefined reference to `elf_core_extra_phdrs'
compat_binfmt_elf.c:(.text+0x3ebcd): undefined reference to `elf_core_extra_data_size'
compat_binfmt_elf.c:(.text+0x3eddd): undefined reference to `elf_core_write_extra_phdrs'
compat_binfmt_elf.c:(.text+0x3f004): undefined reference to `elf_core_write_extra_data'
[ hpa: This was sent to me for -next but it is a low risk build fix ]
Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Link: http://lkml.kernel.org/r/51C0B614.5000708@infradead.org Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Joshua reported: Commit cd7b304dfaf1 (x86, range: fix missing merge
during add range) broke mtrr cleanup on his setup in 3.9.5.
corresponding commit in upstream is fbe06b7bae7c.
It turns out we have some problem with initial mtrr range retrieval.
The current sequence is:
x86_get_mtrr_mem_range
==> bunchs of add_range_with_merge
==> bunchs of subract_range
==> clean_sort_range
add_range_with_merge for [0,1M)
sort_range()
add_range_with_merge could have blank slots, so we can not just
sort only, that will have final result have extra blank slot in head.
So move that calling add_range_with_merge for [0,1M), with that we
could avoid extra clean_sort_range calling.
Reported-by: Joshua Covington <joshuacov@googlemail.com> Tested-by: Joshua Covington <joshuacov@googlemail.com> Signed-off-by: Yinghai Lu <yinghai@kernel.org> Link: http://lkml.kernel.org/r/1371154622-8929-2-git-send-email-yinghai@kernel.org Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Joshua reported: Commit cd7b304dfaf1 (x86, range: fix missing merge
during add range) broke mtrr cleanup on his setup in 3.9.5.
corresponding commit in upstream is fbe06b7bae7c.
The reason is add_range_with_merge could generate blank spot.
We could avoid that by searching new expanded start/end, that
new range should include all connected ranges in range array.
At last add the new expanded start/end to the range array.
Also move up left array so do not add new blank slot in the
range array.
-v2: move left array to avoid enhance add_range()
-v3: include fix from Joshua about memmove declaring when
DYN_DEBUG is used.
Reported-by: Joshua Covington <joshuacov@googlemail.com> Tested-by: Joshua Covington <joshuacov@googlemail.com> Signed-off-by: Yinghai Lu <yinghai@kernel.org> Link: http://lkml.kernel.org/r/1371154622-8929-3-git-send-email-yinghai@kernel.org Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
There might be issue with lockup detection when scheduling on an
empty ring that have been sitting idle for a while. Thus update
the lockup tracking data when scheduling new work in an empty ring.
Signed-off-by: Jerome Glisse <jglisse@redhat.com> Tested-by: Andy Lutomirski <luto@amacapital.net> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
If a buffer is never bound to a virtual memory pagetable than don't try
to unbind it. Only drawback is that we don't update the pagetable when
unbinding the ib pool buffer which is fine because it only happens at
suspend or module unload/shutdown.
Fixes spurious messages about buffers without VM mappings. E.g.:
radeon 0000:01:00.0: bo ffff88020afac400 don't has a mapping in vm ffff88021ca2b900
Commit 781d737 (ACPI: Drop power resources driver) introduced a
bug in the power resources initialization error code path causing
a NULL pointer to be referenced in acpi_release_power_resource()
if there's an error triggering a jump to the 'err' label in
acpi_add_power_resource(). This happens because the list_node
field of struct acpi_power_resource has not been initialized yet
at this point and doing a list_del() on it is a bad idea.
To prevent this problem from occuring, initialize the list_node
field of struct acpi_power_resource upfront.
Reported-by: Mika Westerberg <mika.westerberg@linux.intel.com> Tested-by: Lan Tianyu <tianyu.lan@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Since commit 3757b94 (ACPI / hotplug: Fix concurrency issues and
memory leaks) acpi_bus_scan() and acpi_bus_trim() must always be
called under acpi_scan_lock, but currently the following scenario
violating that requirement is possible:
Fix that by making write_undock() acquire acpi_scan_lock before
calling handle_eject_request() as appropriate (begin_undock() is
under the lock too in analogy with acpi_dock_deferred_cb()).
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Toshi Kani <toshi.kani@hp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
acpi_get_override_irq() was added because there was a problem with
buggy BIOSes passing wrong IRQ() resource for the RTC IRQ. The
commit that added the workaround was 61fd47e0c8476 (ACPI: fix two
IRQ8 issues in IOAPIC mode).
With ACPI 5 enumerated devices there are typically one or more
extended IRQ resources per device (and these IRQs can be shared).
However, the acpi_get_override_irq() workaround forces all IRQs in
range 0 - 15 (the legacy ISA IRQs) to be edge triggered, active high
as can be seen from the dmesg below:
ACPI: IRQ 6 override to edge, high
ACPI: IRQ 7 override to edge, high
ACPI: IRQ 7 override to edge, high
ACPI: IRQ 13 override to edge, high
Also /proc/interrupts for the I2C controllers (INT33C2 and INT33C3) shows
the same thing:
7: 4 0 0 0 IO-APIC-edge INT33C2:00, INT33C3:00
The _CSR method for INT33C2 (and INT33C3) device returns following
resource:
which states that this is supposed to be level triggered, active low,
shared IRQ instead.
Fix this by making sure that acpi_get_override_irq() gets only called
when we are dealing with legacy IRQ() or IRQNoFlags() descriptors.
While we are there, correct pr_warning() to print the right triggering
value.
This change turns out to be necessary to make DMA work correctly on
systems based on the Intel Lynxpoint PCH (Platform Controller Hub).
[rjw: Changelog] Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
kernel might hung in pvclock_clocksource_read() due to
uninitialized memory might contain odd version value in
following cycle:
do {
version = __pvclock_read_cycles(src, &ret, &flags);
} while ((src->version & 1) || version != src->version);
if secondary kvmclock is accessed before it's registered with kvm.
Clear garbage in pvclock shared memory area right after it's
allocated to avoid this issue.
Ref: https://bugzilla.kernel.org/show_bug.cgi?id=59521 Signed-off-by: Igor Mammedov <imammedo@redhat.com>
[See BZ for analysis. We may want a different fix for 3.11, but
this is the safest for now - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
With "mac80211/minstrel_ht: add support for using CCK rates"
minstrel_ht selects legacy CCK rates as viable rates for
outgoing frames which might be sent as part of an A-MPDU
[IEEE80211_TX_CTL_AMPDU is set].
This behavior triggered the following WARN_ON in the driver:
> WARNING: at carl9170/tx.c:995 carl9170_op_tx+0x1dd/0x6fd
The driver assumed that the rate control algorithm made a
mistake and dropped the frame.
This patch removes the noisy warning altogether and allows
said A-MPDU frames with CCK sample and/or fallback rates to
be transmitted seamlessly.
Signed-off-by: Christian Lamparter <chunkeey@googlemail.com> Signed-off-by: John W. Linville <linville@tuxdriver.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The C8000 workstation (64 bit kernel only) has a somewhat different
serial port configuration than other models.
Thomas Bogendoerfer sent a patch to fix this in September 2010, which
was now minimally modified by me.
Make sure that we really return -1 (instead of 0x00ff) as node id for
page frame numbers which are not physically available.
This finally fixes the kernel panic when running
cat /proc/kpageflags /proc/kpagecount.
Theoretically this patch now limits the number of physical memory ranges
to 127 instead of 254, but currently we have MAX_PHYSMEM_RANGES
hardcoded to 8 which is sufficient for all existing parisc machines.
Fix the above kernel error from parport_announce_port() on 32bit GSC
machines (e.g. B160L). The parport driver requires now a pointer to the
device struct.
There's a Makefile line setting cflags for CONFIG_PA7100. But that
Kconfig macro doesn't exist. There is a Kconfig symbol PA7000, which
covers both PA7000 and PA7100 processors. So let's use the corresponding
Kconfig macro.
With CONFIG_DISCONTIGMEM=y and multiple physical memory areas,
cat /proc/kpageflags triggers this kernel bug:
kernel BUG at arch/parisc/include/asm/mmzone.h:50!
CPU: 2 PID: 7848 Comm: cat Tainted: G D W 3.10.0-rc3-64bit #44
IAOQ[0]: kpageflags_read0x128/0x238
IAOQ[1]: kpageflags_read0x12c/0x238
RP(r2): proc_reg_read0xbc/0x130
Backtrace:
[<00000000402ca2d4>] proc_reg_read0xbc/0x130
[<0000000040235bcc>] vfs_read0xc4/0x1d0
[<0000000040235f0c>] SyS_read0x94/0xf0
[<0000000040105fc0>] syscall_exit0x0/0x14
kpageflags_read() walks through the whole memory, even if some memory
areas are physically not available. So, we should better not BUG on an
unavailable pfn in pfn_to_nid() but just return the expected value -1 or
0.
The logic to detect if the irq stack was already in use with
raw_spin_trylock() is wrong, because it will generate a "trylock failure
on UP" error message with CONFIG_SMP=n and CONFIG_DEBUG_SPINLOCK=y.
arch_spin_trylock() can't be used either since in the CONFIG_SMP=n case
no atomic protection is given and we are reentrant here. A mutex didn't
worked either and brings more overhead by turning off interrupts.
So, let's use the fastest path for parisc which is the ldcw instruction.
Counting how often the irq stack was used is pretty useless, so just
drop this piece of code.
The get_stack_use_cr30 and get_stack_use_r30 macros allocate a stack
frame for external interrupts and interruptions requiring a stack frame.
They are currently not reentrant in that they save register context
before the stack is set or adjusted.
I have observed a number of system crashes where there was clear
evidence of stack corruption during interrupt processing, and as a
result register corruption. Some interruptions can still occur during
interruption processing, however external interrupts are disabled and
data TLB misses don't occur for absolute accesses. So, it's not entirely
clear what triggers this issue. Also, if an interruption occurs when
Q=0, it is generally not possible to recover as the shadowed registers
are not copied.
The attached patch reworks the get_stack_use_cr30 and get_stack_use_r30
macros to allocate stack before doing register saves. The new code is a
couple of instructions shorter than the old implementation. Thus, it's
an improvement even if it doesn't fully resolve the stack corruption
issue. Based on limited testing, it improves SMP system stability.
Signed-off-by: John David Anglin <dave.anglin@bell.net> Signed-off-by: Helge Deller <deller@gmx.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Default kernel stack size on parisc is 16k. During tests we found that the
kernel stack can easily grow beyond 13k, which leaves 3k left for irq
processing.
This patch adds the possibility to activate an additional stack of 16k per CPU
which is being used during irq processing. This implementation does not yet
uses this irq stack for the irq bh handler.
The assembler code for call_on_stack was heavily cleaned up by John
David Anglin.
Signed-off-by: Helge Deller <deller@gmx.de> CC: John David Anglin <dave.anglin@bell.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Add the CONFIG_DEBUG_STACKOVERFLOW config option to enable checks to
detect kernel stack overflows.
Stack overflows can not be detected reliable since we do not want to
introduce too much overhead.
Instead, during irq processing in do_cpu_irq_mask() we check kernel
stack usage of the interrupted kernel process. Kernel threads can be
easily detected by checking the value of space register 7 (sr7) which
is zero when running inside the kernel.
Since THREAD_SIZE is 16k and PAGE_SIZE is 4k, reduce the alignment of
the init thread to the lower value (PAGE_SIZE) in the kernel
vmlinux.ld.S linker script.
ARP offloading should only be used in STA or P2P client mode. It
is currently configured once at init. When being configured for AP
ARP offloading should be turned off and when AP mode is left it can
be turned back on.
Reviewed-by: Arend Van Spriel <arend@broadcom.com> Signed-off-by: Hante Meuleman <meuleman@broadcom.com> Signed-off-by: Arend van Spriel <arend@broadcom.com> Signed-off-by: John W. Linville <linville@tuxdriver.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Driver rtl8192cu can connect to WPA2 networks, but fails for any other
encryption method. The cause is a failure to set the rate control data
blocks. These changes fix https://bugzilla.redhat.com/show_bug.cgi?id=952793
and https://bugzilla.redhat.com/show_bug.cgi?id=761525.
Signed-off-by: Larry Finger <Larry.Finger@lwfinger.net> Signed-off-by: John W. Linville <linville@tuxdriver.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
gcc 4.7.x is emitting calls to __ffsdi2 where previously
it used to inline the appropriate ctz instructions.
While this needs to be fixed in gcc, it's also easy to avoid
having it cause build failures when building with those
compilers by exporting __ffsdi2 to modules.
Signed-off-by: Chris Metcalf <cmetcalf@tilera.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When the Android firmware enables the audio interfaces in accessory
mode, it always declares in the control interface's baInterfaceNr array
that interfaces 0 and 1 belong to the audio function. However, the
accessory interface itself, if also enabled, already is at index 0 and
shifts the actual audio interface numbers to 1 and 2, which prevents the
PCM streaming interface from being seen by the host driver.
To get the PCM interface interface to work, detect when the descriptors
point to the (for this driver useless) accessory interface, and redirect
to the correct one.
Reported-by: Jeremy Rosen <jeremy.rosen@openwide.fr> Tested-by: Jeremy Rosen <jeremy.rosen@openwide.fr> Signed-off-by: Clemens Ladisch <clemens@ladisch.de> Signed-off-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
MacBook Air 4,2 requires the whole default pin configuration table to
be overridden by the driver, as usual, as Apple's machines don't set
up properly after boot. Otherwise mic won't work, and other ill
effect may happen.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=59381 Reported-and-tested-by: Peter John Hartman <peterjohnhartman@gmail.com> Signed-off-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Just like the previous fix for LogitechHD Webcam c270 in commit 11e7064f35bb87da8f427d1aa4bbd8b7473a3993, c310 model also requires the
same workaround for avoiding the kernel warning.
With this change, we no longer lose the innermost entry in the user-mode
part of the call chain. See also the x86 port, which includes the ip,
and the corresponding change in arch/arm.
Signed-off-by: Jed Davis <jld@mozilla.com> Acked-by: Ingo Molnar <mingo@kernel.org> Acked-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
On Cortex-A9 before version r1p0, the LoUIS bit field of the CLIDR
register returns zero when it should return one. This leads to cache
maintenance operations which rely on this value to not function as
intended, causing data corruption.
The workaround for this errata is to detect affected CPUs and correct
the LoUIS value read.
Acked-by: Will Deacon <will.deacon@arm.com> Acked-by: Nicolas Pitre <nico@linaro.org> Signed-off-by: Jon Medhurst <tixy@linaro.org> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
MPP_F6281_MASK would be previously be returned when on mv88f6282,
which would disallow some valid MPP configurations.
Commit 830f8b91 (arm: plat-orion: fix printing of "MPP config
unavailable on this hardware") made this problem visible as an invalid
MPP configuration is now correctly detected and not applied.
Signed-off-by: Nicolas Schichan <nschichan@freebox.fr> Signed-off-by: Jason Cooper <jason@lakedaemon.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Some systems that don't need wake-on-lan may choose to power down the
chip on system standby. Upon resume, the power on causes the boot code
to startup and initialize the hardware. On one new platform, this is
causing the device to go into a bad state due to a race between the
driver and boot code, once every several hundred resumes. The same race
exists on open since we come up from a power on.
This patch adds a wait for boot code signature at the beginning of
tg3_init_hw() which is common to both cases. If there has not been a
power-off or the boot code has already completed, the signature will be
present and poll_fw() returns immediately. Also return immediately if
the device does not have firmware.
Signed-off-by: Nithin Nayak Sujir <nsujir@broadcom.com> Signed-off-by: Michael Chan <mchan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Re-enable chipidea irq even if there's no role changing to do. This is
a problem since b183c19f ("USB: chipidea: re-order irq handling to avoid
unhandled irqs"); when it manifests, chipidea irq gets disabled for good.
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When replaying interrupts (as a result of the interrupt occurring
while soft-disabled), in the case of the decrementer, we are exclusively
testing for a pending timer target. However we also use decrementer
interrupts to trigger the new "irq_work", which in this case would
be missed.
This change the logic to force a replay in both cases of a timer
boundary reached and a decrementer interrupt having actually occurred
while disabled. The former test is still useful to catch cases where
a CPU having been hard-disabled for a long time completely misses the
interrupt due to a decrementer rollover.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Tested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Normally, the kernel emulates a few instructions that are unimplemented
on some processors (e.g. the old dcba instruction), or privileged (e.g.
mfpvr). The emulation of unimplemented instructions is currently not
working on the PowerNV platform. The reason is that on these machines,
unimplemented and illegal instructions cause a hypervisor emulation
assist interrupt, rather than a program interrupt as on older CPUs.
Our vector for the emulation assist interrupt just calls
program_check_exception() directly, without setting the bit in SRR1
that indicates an illegal instruction interrupt. This fixes it by
making the emulation assist interrupt set that bit before calling
program_check_interrupt(). With this, old programs that use no-longer
implemented instructions such as dcba now work again.
Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
If we look at current's stack (paca->__current->stack) we see it is
equal to c0000002ecab0000. Our stack is 16K, and comparing to
paca->kstack (c0000002ecab3e30) we can see that we have overflowed our
kernel stack. This leads to us writing over our struct thread_info, and
in this case we have corrupted thread_info->flags and set
_TIF_EMULATE_STACK_STORE.
__ppc64_runlatch_on() is called from RUNLATCH_ON in the exception entry
path. At that point the irq state is not consistent, ie. interrupts are
hard disabled (by the exception entry), but the paca soft-enabled flag
may be out of sync.
This leads to the local_irq_restore() in trace_graph_entry() actually
enabling interrupts, which we do not want. Because we have not yet
reprogrammed the decrementer we immediately take another decrementer
exception, and recurse.
The fix is twofold. Firstly make sure we call DISABLE_INTS before
calling RUNLATCH_ON. The badly named DISABLE_INTS actually reconciles
the irq state in the paca with the hardware, making it safe again to
call local_irq_save/restore().
Although that should be sufficient to fix the bug, we also mark the
runlatch routines as notrace. They are called very early in the
exception entry and we are asking for trouble tracing them. They are
also fairly uninteresting and tracing them just adds unnecessary
overhead.
This patch reworks the UEFI anti-bricking code, including an effective
reversion of cc5a080c and 31ff2f20. It turns out that calling
QueryVariableInfo() from boot services results in some firmware
implementations jumping to physical addresses even after entering virtual
mode, so until we have 1:1 mappings for UEFI runtime space this isn't
going to work so well.
Reverting these gets us back to the situation where we'd refuse to create
variables on some systems because they classify deleted variables as "used"
until the firmware triggers a garbage collection run, which they won't do
until they reach a lower threshold. This results in it being impossible to
install a bootloader, which is unhelpful.
Feedback from Samsung indicates that the firmware doesn't need more than
5KB of storage space for its own purposes, so that seems like a reasonable
threshold. However, there's still no guarantee that a platform will attempt
garbage collection merely because it drops below this threshold. It seems
that this is often only triggered if an attempt to write generates a
genuine EFI_OUT_OF_RESOURCES error. We can force that by attempting to
create a variable larger than the remaining space. This should fail, but if
it somehow succeeds we can then immediately delete it.
I've tested this on the UEFI machines I have available, but I don't have
a Samsung and so can't verify that it avoids the bricking problem.
The auth code is called from a variety of contexts, include the mon_client
(protected by the monc's mutex) and the messenger callbacks (currently
protected by nothing). Avoid chaos by protecting all auth state with a
mutex. Nothing is blocking, so this should be simple and lightweight.
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <elder@inktank.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Use wrapper functions that check whether the auth op exists so that callers
do not need a bunch of conditional checks. Simplifies the external
interface.
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <elder@inktank.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Currently the messenger calls out to a get_authorizer con op, which will
create a new authorizer if it doesn't yet have one. In the meantime, when
we rotate our service keys, the authorizer doesn't get updated. Eventually
it will be rejected by the server on a new connection attempt and get
invalidated, and we will then rebuild a new authorizer, but this is not
ideal.
Instead, if we do have an authorizer, call a new update_authorizer op that
will verify that the current authorizer is using the latest secret. If it
is not, we will build a new one that does. This avoids the transient
failure.
This fixes one of the sorry sequence of events for bug
http://tracker.ceph.com/issues/4282
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <elder@inktank.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
We were invalidating the authorizer by removing the ticket handler
entirely. This was effective in inducing us to request a new authorizer,
but in the meantime it mean that any authorizer we generated would get a
new and initialized handler with secret_id=0, which would always be
rejected by the server side with a confusing error message:
auth: could not find secret_id=0
cephx: verify_authorizer could not get service secret for service osd secret_id=0
Instead, simply clear the validity field. This will still induce the auth
code to request a new secret, but will let us continue to use the old
ticket in the meantime. The messenger code will probably continue to fail,
but the exponential backoff will kick in, and eventually the we will get a
new (hopefully more valid) ticket from the mon and be able to continue.
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <elder@inktank.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
We maintain a counter of failed auth attempts to allow us to retry once
before failing. However, if the second attempt succeeds, the flag isn't
cleared, which makes us think auth failed again later when the connection
resets for other reasons (like a socket error).
This is one part of the sorry sequence of events in bug
http://tracker.ceph.com/issues/4282
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <elder@inktank.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
for last entry is not what we want, we should have
[mem 0x00200000-0x3fffffff] page 2M
[mem 0x40000000-0x7bffffff] page 1G
Actually we merge the continuous ranges with same page size too early.
in this case, before merging we have
[mem 0x00200000-0x3fffffff] page 2M
[mem 0x40000000-0x7bffffff] page 2M
after merging them, will get
[mem 0x00200000-0x7bffffff] page 2M
even we can use 1G page to map
[mem 0x40000000-0x7bffffff]
that will cause problem, because we already map
[mem 0x7fe00000-0x7fffffff] page 1G
[mem 0x7c000000-0x7fdfffff] page 1G
with 1G page, aka [0x40000000-0x7fffffff] is mapped with 1G page already.
During phys_pud_init() for [0x40000000-0x7bffffff], it will not
reuse existing that pud page, and allocate new one then try to use
2M page to map it instead, as page_size_mask does not include
PG_LEVEL_1G. At end will have [7c000000-0x7fffffff] not mapped, loop
in phys_pmd_init stop mapping at 0x7bffffff.
That is right behavoir, it maps exact range with exact page size that
we ask, and we should explicitly call it to map [7c000000-0x7fffffff]
before or after mapping 0x40000000-0x7bffffff.
Anyway we need to make sure ranges' page_size_mask correct and consistent
after split_mem_range for each range.
Fix that by calling adjust_range_size_mask before merging range
with same page size.
-v2: update change log.
-v3: add more explanation why [7c000000-0x7fffffff] is not mapped, and
it causes panic.
Bisected-by: "Xie, ChanglongX" <changlongx.xie@intel.com> Bisected-by: Yuanhan Liu <yuanhan.liu@linux.intel.com> Reported-and-tested-by: Yuanhan Liu <yuanhan.liu@linux.intel.com> Signed-off-by: Yinghai Lu <yinghai@kernel.org> Link: http://lkml.kernel.org/r/1370015587-20835-1-git-send-email-yinghai@kernel.org Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When we have a page fault for the address which is backed by a hugepage
under migration, the kernel can't wait correctly and do busy looping on
hugepage fault until the migration finishes. As a result, users who try
to kick hugepage migration (via soft offlining, for example) occasionally
experience long delay or soft lockup.
This is because pte_offset_map_lock() can't get a correct migration entry
or a correct page table lock for hugepage. This patch introduces
migration_entry_wait_huge() to solve this.
The watermark check consists of two sub-checks. The first one is:
if (free_pages <= min + lowmem_reserve)
return false;
The check assures that there is minimal amount of RAM in the zone. If
CMA is used then the free_pages is reduced by the number of free pages
in CMA prior to the over-mentioned check.
if (!(alloc_flags & ALLOC_CMA))
free_pages -= zone_page_state(z, NR_FREE_CMA_PAGES);
This prevents the zone from being drained from pages available for
non-movable allocations.
The second check prevents the zone from getting too fragmented.
for (o = 0; o < order; o++) {
free_pages -= z->free_area[o].nr_free << o;
min >>= 1;
if (free_pages <= min)
return false;
}
The field z->free_area[o].nr_free is equal to the number of free pages
including free CMA pages. Therefore the CMA pages are subtracted twice.
This may cause a false positive fail of __zone_watermark_ok() if the CMA
area gets strongly fragmented. In such a case there are many 0-order
free pages located in CMA. Those pages are subtracted twice therefore
they will quickly drain free_pages during the check against
fragmentation. The test fails even though there are many free non-cma
pages in the zone.
This patch fixes this issue by subtracting CMA pages only for a purpose of
(free_pages <= min + lowmem_reserve) check.
Laura said:
We were observing allocation failures of higher order pages (order 5 =
128K typically) under tight memory conditions resulting in driver
failure. The output from the page allocation failure showed plenty of
free pages of the appropriate order/type/zone and mostly CMA pages in
the lower orders.
For full disclosure, we still observed some page allocation failures
even after applying the patch but the number was drastically reduced and
those failures were attributed to fragmentation/other system issues.
Signed-off-by: Tomasz Stanislawski <t.stanislaws@samsung.com> Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com> Tested-by: Laura Abbott <lauraa@codeaurora.org> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Mel Gorman <mel@csn.ul.ie> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>