Neighbour code assumes headroom to push Ethernet header is
at least 16 bytes.
It appears macvtap has only 14 bytes available on arches
where NET_IP_ALIGN is 0 (like x86)
Effect is a corruption of 2 bytes right before skb->head,
and possible crashes if accessing non existing memory.
This fix should also increase IPv4 performance, as paranoid code
in ip_finish_output2() wont have to call skb_realloc_headroom()
Reported-by: Brian Rak <brak@vultr.com> Tested-by: Brian Rak <brak@vultr.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
With commit a7526eb5d06b (net: Unbreak compat_sys_{send,recv}msg), the
MSG_CMSG_COMPAT flag is blocked at the compat syscall entry points,
changing the kernel compat behaviour from the one before the commit it
was trying to fix (1be374a0518a, net: Block MSG_CMSG_COMPAT in
send(m)msg and recv(m)msg).
On 32-bit kernels (!CONFIG_COMPAT), MSG_CMSG_COMPAT is 0 and the native
32-bit sys_sendmsg() allows flag 0x80000000 to be set (it is ignored by
the kernel). However, on a 64-bit kernel, the compat ABI is different
with commit a7526eb5d06b.
This patch changes the compat_sys_{send,recv}msg behaviour to the one
prior to commit 1be374a0518a.
The problem was found running 32-bit LTP (sendmsg01) binary on an arm64
kernel. Arguably, LTP should not pass 0xffffffff as flags to sendmsg()
but the general rule is not to break user ABI (even when the user
behaviour is not entirely sane).
Fixes: a7526eb5d06b (net: Unbreak compat_sys_{send,recv}msg) Cc: Andy Lutomirski <luto@amacapital.net> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
CPU0 CPU1
team_port_del
team_upper_dev_unlink
priv_flags &= ~IFF_TEAM_PORT
team_handle_frame
team_port_get_rcu
team_port_exists
priv_flags & IFF_TEAM_PORT == 0
return NULL (instead of port got
from rx_handler_data)
netdev_rx_handler_unregister
The thing is that the flag is removed before rx_handler is unregistered.
If team_handle_frame is called in between, team_port_exists returns 0
and team_port_get_rcu will return NULL.
So do not check the flag here. It is guaranteed by netdev_rx_handler_unregister
that team_handle_frame will always see valid rx_handler_data pointer.
Signed-off-by: Jiri Pirko <jiri@resnulli.us> Fixes: 3d249d4ca7d0 ("net: introduce ethernet teaming device") Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
colons are used as a separator in netdev device lookup in dev_ioctl.c
Specific functions are SIOCGIFTXQLEN SIOCETHTOOL SIOCSIFNAME
Signed-off-by: Matthew Thode <mthode@mthode.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Non NAPI drivers can call skb_tstamp_tx() and then sock_queue_err_skb()
from hard IRQ context.
Therefore, sock_dequeue_err_skb() needs to block hard irq or
corruptions or hangs can happen.
Signed-off-by: Eric Dumazet <edumazet@google.com> Fixes: 364a9e93243d1 ("sock: deduplicate errqueue dequeue") Fixes: cb820f8e4b7f7 ("net: Provide a generic socket error queue delivery method for Tx time stamps.") Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Open vSwitch allows moving internal vport to different namespace
while still connected to the bridge. But when namespace deleted
OVS does not detach these vports, that results in dangling
pointer to netdevice which causes kernel panic as follows.
This issue is fixed by detaching all ovs ports from the deleted
namespace at net-exit.
Reported-by: Assaf Muller <amuller@redhat.com> Fixes: 46df7b81454("openvswitch: Add support for network namespaces.") Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Reviewed-by: Thomas Graf <tgraf@noironetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In tcf_em_validate(), after calling request_module() to load the
kind-specific module, set em->ops to NULL before returning -EAGAIN, so
that module_put() is not called again by tcf_em_tree_destroy().
Signed-off-by: Ignacy Gawędzki <ignacy.gawedzki@green-communications.fr> Acked-by: Cong Wang <cwang@twopensource.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
phy_init_eee uses phy_find_setting(phydev->speed, phydev->duplex)
to find a valid entry in the settings array for the given speed
and duplex value. For full duplex 1000baseT, this will return
the first matching entry, which is the entry for 1000baseKX_Full.
If the phy eee does not support 1000baseKX_Full, this entry will not
match, causing phy_init_eee to fail for no good reason.
Fixes: 9a9c56cb34e6 ("net: phy: fix a bug when verify the EEE support") Fixes: 3e7077067e80c ("phy: Expand phy speed/duplex settings array") Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com> Signed-off-by: Guenter Roeck <linux@roeck-us.net> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
ip_check_defrag() may be used by af_packet to defragment outgoing packets.
skb_network_offset() of af_packet's outgoing packets is not zero.
Signed-off-by: Alexander Drozdov <al.drozdov@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
skb_copy_bits() returns zero on success and negative value on error,
so it is needed to invert the condition in ip_check_defrag().
Fixes: 1bf3751ec90c ("ipv4: ip_check_defrag must not modify skb before unsharing") Signed-off-by: Alexander Drozdov <al.drozdov@gmail.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The gnet_stats_copy_app() function gets called, more often than not, with its
second argument a pointer to an automatic variable in the caller's stack.
Therefore, to avoid copying garbage afterwards when calling
gnet_stats_finish_copy(), this data is better copied to a dynamically allocated
memory that gets freed after use.
[xiyou.wangcong@gmail.com: remove a useless kfree()]
Signed-off-by: Ignacy Gawędzki <ignacy.gawedzki@green-communications.fr> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Ignacy reported that when eth0 is down and add a vlan device
on top of it like:
ip link add link eth0 name eth0.1 up type vlan id 1
We will get a refcount leak:
unregister_netdevice: waiting for eth0.1 to become free. Usage count = 2
The problem is when rtnl_configure_link() fails in rtnl_newlink(),
we simply call unregister_device(), but for stacked device like vlan,
we almost do nothing when we unregister the upper device, more work
is done when we unregister the lower device, so call its ->dellink().
Reported-by: Ignacy Gawedzki <ignacy.gawedzki@green-communications.fr> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
ipv6_cow_metrics() currently assumes only DST_HOST routes require
dynamic metrics allocation from inetpeer. The assumption breaks
when ndisc discovered router with RTAX_MTU and RTAX_HOPLIMIT metric.
Refer to ndisc_router_discovery() in ndisc.c and note that dst_metric_set()
is called after the route is created.
This patch creates the metrics array (by calling dst_cow_metrics_generic) in
ipv6_cow_metrics().
Before:
[root@qemu1 ~]# ip -6 r show | egrep -v unreachable
fd00:face:face:face::/64 dev eth0 proto kernel metric 256 expires 27sec
fe80::/64 dev eth0 proto kernel metric 256
default via fe80::74df:d0ff:fe23:8ef2 dev eth0 proto ra metric 1024 expires 27sec
After:
[root@qemu1 ~]# ip -6 r show | egrep -v unreachable
fd00:face:face:face::/64 dev eth0 proto kernel metric 256 expires 27sec mtu 1300
fe80::/64 dev eth0 proto kernel metric 256 mtu 1300
default via fe80::74df:d0ff:fe23:8ef2 dev eth0 proto ra metric 1024 expires 27sec mtu 1300 hoplimit 30
Fixes: 8e2ec639173f325 (ipv6: don't use inetpeer to store metrics for routes.) Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
IPv6 can keep a copy of SYN message using skb_get() in
tcp_v6_conn_request() so that caller wont free the skb when calling
kfree_skb() later.
Therefore TCP fast open has to clone the skb it is queuing in
child->sk_receive_queue, as all skbs consumed from receive_queue are
freed using __kfree_skb() (ie assuming skb->users == 1)
Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Fixes: 5b7ed0892f2af ("tcp: move fastopen functions to tcp_fastopen.c") Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Make __ipv6_select_ident() static as it isn't used outside
the file.
Fixes: 0508c07f5e0c9 (ipv6: Select fragment id during UFO segmentation if not set.) Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
ipv6: Select fragment id during UFO segmentation if not set.
Introduced a bug on LE in how ipv6 fragment id is assigned.
This was cought by nightly sparce check:
Resolve the following sparce error:
net/ipv6/output_core.c:57:38: sparse: incorrect type in assignment
(different base types)
net/ipv6/output_core.c:57:38: expected restricted __be32
[usertype] ip6_frag_id
net/ipv6/output_core.c:57:38: got unsigned int [unsigned]
[assigned] [usertype] id
Fixes: 0508c07f5e0c9 (ipv6: Select fragment id during UFO segmentation if not set.) Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
ifla_vf_policy[] is wrong in advertising its individual member types as
NLA_BINARY since .type = NLA_BINARY in combination with .len declares the
len member as *max* attribute length [0, len].
The issue is that when do_setvfinfo() is being called to set up a VF
through ndo handler, we could set corrupted data if the attribute length
is less than the size of the related structure itself.
The intent is exactly the opposite, namely to make sure to pass at least
data of minimum size of len.
Fixes: ebc08a6f47ee ("rtnetlink: Add VF config code to rtnetlink") Cc: Mitch Williams <mitch.a.williams@intel.com> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This patch fixes two issues in UDP checksum computation in pktgen.
First, the pseudo-header uses the source and destination IP
addresses. Currently, the ports are used for IPv4.
Second, the UDP checksum covers both header and data. So we need to
generate the data earlier (move pktgen_finalize_skb up), and compute
the checksum for UDP header + data.
Fixes: c26bf4a51308c ("pktgen: Add UDPCSUM flag to support UDP checksums") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
We still need a validate_link_af() handler with an appropriate nla policy,
similarly as we have in IPv4 case, otherwise size validations are not being
done properly in that case.
Fixes: f53adae4eae5 ("net: ipv6: add tokenized interface identifier support") Fixes: bc91b0f07ada ("ipv6: addrconf: implement address generation modes") Cc: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
flow_cache_flush_task references a structure member flow_cache_gc_work
where it should reference flow_cache_flush_task instead.
Kernel panic occurs on kernels using IPsec during XFRM garbage
collection. The garbage collection interval can be shortened using the
following sysctl settings:
With the default settings, our productions servers crash approximately
once a week. With the settings above, they crash immediately.
Fixes: ca925cf1534e ("flowcache: Make flow cache name space aware") Reported-by: Tomáš Charvát <tc@excello.cz> Tested-by: Jan Hejl <jh@excello.cz> Signed-off-by: Miroslav Urbanek <mu@miroslavurbanek.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
ACCESS_ONCE does not work reliably on non-scalar types. For
example gcc 4.6 and 4.7 might remove the volatile tag for such
accesses during the SRA (scalar replacement of aggregates) step
(https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58145)
Change the ppc/kvm code to replace ACCESS_ONCE with READ_ONCE.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Acked-by: Alexander Graf <agraf@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
ACCESS_ONCE does not work reliably on non-scalar types. For
example gcc 4.6 and 4.7 might remove the volatile tag for such
accesses during the SRA (scalar replacement of aggregates) step
(https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58145)
Change the ppc/hugetlbfs code to replace ACCESS_ONCE with READ_ONCE.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
ACCESS_ONCE does not work reliably on non-scalar types. For
example gcc 4.6 and 4.7 might remove the volatile tag for such
accesses during the SRA (scalar replacement of aggregates) step
(https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58145)
Fixup gup_pmd_range.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit 927609d622a3 ("kernel: tighten rules for ACCESS ONCE") results in a
compile failure for sh builds with CONFIG_X2TLB enabled.
arch/sh/mm/gup.c: In function 'gup_get_pte':
arch/sh/mm/gup.c:20:2: error: invalid initializer
make[1]: *** [arch/sh/mm/gup.o] Error 1
Replace ACCESS_ONCE with READ_ONCE to fix the problem.
Fixes: 927609d622a3 ("kernel: tighten rules for ACCESS ONCE") Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Guenter Roeck <linux@roeck-us.net> Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Currently maximum space limit quota format supports is in blocks however
since we store space limits in bytes, this is somewhat confusing. So
store the maximum limit in bytes as well. Also rename the field to match
the new unit and related inode field to match the new naming scheme.
Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
ACCESS_ONCE does not work reliably on non-scalar types. For
example gcc 4.6 and 4.7 might remove the volatile tag for such
accesses during the SRA (scalar replacement of aggregates) step
(https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58145)
Change the p2m code to replace ACCESS_ONCE with READ_ONCE.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Reviewed-by: Juergen Gross <jgross@suse.com> Acked-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
if (unlikely(lock->tickets.tail & TICKET_SLOWPATH_FLAG))
__ticket_unlock_slowpath(lock, prev);
which is *exactly* the kind of things you cannot do with spinlocks,
because after you've done the "add_smp()" and released the spinlock
for the fast-path, you can't access the spinlock any more. Exactly
because a fast-path lock might come in, and release the whole data
structure.
Linus suggested that we should not do any writes to lock after unlock(),
and we can move slowpath clearing to fastpath lock.
So this patch implements the fix with:
1. Moving slowpath flag to head (Oleg):
Unlocked locks don't care about the slowpath flag; therefore we can keep
it set after the last unlock, and clear it again on the first (try)lock.
-- this removes the write after unlock. note that keeping slowpath flag would
result in unnecessary kicks.
By moving the slowpath flag from the tail to the head ticket we also avoid
the need to access both the head and tail tickets on unlock.
2. use xadd to avoid read/write after unlock that checks the need for
unlock_kick (Linus):
We further avoid the need for a read-after-release by using xadd;
the prev head value will include the slowpath flag and indicate if we
need to do PV kicking of suspended spinners -- on modern chips xadd
isn't (much) more expensive than an add + load.
The use of READ_ONCE() causes lots of warnings witht he pending paravirt
spinlock fixes, because those ends up having passing a member to a
'const' structure to READ_ONCE().
There should certainly be nothing wrong with using READ_ONCE() with a
const source, but the helper function __read_once_size() would cause
warnings because it would drop the 'const' qualifier, but also because
the destination would be marked 'const' too due to the use of 'typeof'.
Use a union of types in READ_ONCE() to avoid this issue.
Also make sure to use parenthesis around the macro arguments to avoid
possible operator precedence issues.
Commit 927609d622a3 ("kernel: tighten rules for ACCESS ONCE") results in
sparse warnings like "Using plain integer as NULL pointer" - Let's add a
type cast to the dummy assignment.
To avoid warnings lik "sparse: warning: cast to restricted __hc32" we also
use __force on that cast.
Fixes: 927609d622a3 ("kernel: tighten rules for ACCESS ONCE") Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Now that all non-scalar users of ACCESS_ONCE have been converted
to READ_ONCE or ASSIGN once, lets tighten ACCESS_ONCE to only
work on scalar types.
This variant was proposed by Alexei Starovoitov.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
pmc_dbgfs_unregister() will be called when pmc->dbgfs_dir is unconditionally
NULL on error path in pmc_dbgfs_register(). To prevent this we move the
assignment to where is should be.
Fixes: f855911c1f48 (x86/pmc_atom: Expose PMC device state and platform sleep state) Reported-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Aubrey Li <aubrey.li@linux.intel.com> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Kumar P. Mahesh <mahesh.kumar.p@intel.com> Link: http://lkml.kernel.org/r/1421253575-22509-2-git-send-email-andriy.shevchenko@linux.intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit b568b8601f05 ("Treat SCI interrupt as normal GSI interrupt")
accidently removes support of legacy PIC interrupt when fixing a
regression for Xen, which causes a nasty regression on HP/Compaq
nc6000 where we fail to register the ACPI interrupt, and thus
lose eg. thermal notifications leading a potentially overheated
machine.
So reintroduce support of legacy PIC based ACPI SCI interrupt.
Reported-by: Ville Syrjälä <syrjala@sci.fi> Tested-by: Ville Syrjälä <syrjala@sci.fi> Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Pavel Machek <pavel@ucw.cz> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Len Brown <len.brown@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Rafael J. Wysocki <rjw@rjwysocki.net> Cc: Sander Eikelenboom <linux@eikelenboom.it> Cc: linux-pm@vger.kernel.org Link: http://lkml.kernel.org/r/1424052673-22974-1-git-send-email-jiang.liu@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Note that, it declares the "random_variable" variable as "unsigned int".
Since the result of the shifting operation between STACK_RND_MASK (which
is 0x3fffff on x86_64, 22 bits) and PAGE_SHIFT (which is 12 on x86_64):
random_variable <<= PAGE_SHIFT;
then the two leftmost bits are dropped when storing the result in the
"random_variable". This variable shall be at least 34 bits long to hold
the (22+12) result.
These two dropped bits have an impact on the entropy of process stack.
Concretely, the total stack entropy is reduced by four: from 2^28 to
2^30 (One fourth of expected entropy).
This patch restores back the entropy by correcting the types involved
in the operations in the functions randomize_stack_top() and
stack_maxrandom_size().
Andy pointed out that if an NMI or MCE is received while we're in the
middle of an EFI mixed mode call a triple fault will occur. This can
happen, for example, when issuing an EFI mixed mode call while running
perf.
The reason for the triple fault is that we execute the mixed mode call
in 32-bit mode with paging disabled but with 64-bit kernel IDT handlers
installed throughout the call.
At Andy's suggestion, stop playing the games we currently do at runtime,
such as disabling paging and installing a 32-bit GDT for __KERNEL_CS. We
can simply switch to the __KERNEL32_CS descriptor before invoking
firmware services, and run in compatibility mode. This way, if an
NMI/MCE does occur the kernel IDT handler will execute correctly, since
it'll jump to __KERNEL_CS automatically.
However, this change is only possible post-ExitBootServices(). Before
then the firmware "owns" the machine and expects for its 32-bit IDT
handlers to be left intact to service interrupts, etc.
So, we now need to distinguish between early boot and runtime
invocations of EFI services. During early boot, we need to restore the
GDT that the firmware expects to be present. We can only jump to the
__KERNEL32_CS code segment for mixed mode calls after ExitBootServices()
has been invoked.
A liberal sprinkling of comments in the thunking code should make the
differences in early and late environments more apparent.
Reported-by: Andy Lutomirski <luto@amacapital.net> Tested-by: Borislav Petkov <bp@suse.de> Signed-off-by: Matt Fleming <matt.fleming@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When reading blkio.throttle.io_serviced in a recently created blkio
cgroup, it's possible to race against the creation of a throttle policy,
which delays the allocation of stats_cpu.
Like other functions in the throttle code, just checking for a NULL
stats_cpu prevents the following oops caused by that race.
We have a scenario where after the fsync log replay we can lose file data
that had been previously fsync'ed if we added an hard link for our inode
and after that we sync'ed the fsync log (for example by fsync'ing some
other file or directory).
This is because when adding an hard link we updated the inode item in the
log tree with an i_size value of 0. At that point the new inode item was
in memory only and a subsequent fsync log replay would not make us lose
the file data. However if after adding the hard link we sync the log tree
to disk, by fsync'ing some other file or directory for example, we ended
up losing the file data after log replay, because the inode item in the
persisted log tree had an an i_size of zero.
This is easy to reproduce, and the following excerpt from my test for
xfstests shows this:
# Create one file with data and fsync it.
# This made the btrfs fsync log persist the data and the inode metadata with
# a correct inode->i_size (4096 bytes).
$XFS_IO_PROG -f -c "pwrite -S 0xaa -b 4K 0 4K" -c "fsync" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Now add one hard link to our file. This made the btrfs code update the fsync
# log, in memory only, with an inode metadata having a size of 0.
ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link
# Now force persistence of the fsync log to disk, for example, by fsyncing some
# other file.
touch $SCRATCH_MNT/bar
$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/bar
# Before a power loss or crash, we could read the 4Kb of data from our file as
# expected.
echo "File content before:"
od -t x1 $SCRATCH_MNT/foo
# Simulate a crash/power loss.
_load_flakey_table $FLAKEY_DROP_WRITES
_unmount_flakey
# After the fsync log replay, because the fsync log had a value of 0 for our
# inode's i_size, we couldn't read anymore the 4Kb of data that we previously
# wrote and fsync'ed. The size of the file became 0 after the fsync log replay.
echo "File content after:"
od -t x1 $SCRATCH_MNT/foo
Another alternative test, that doesn't need to fsync an inode in the same
transaction it was created, is:
# Create our test file with some data.
$XFS_IO_PROG -f -c "pwrite -S 0xaa -b 8K 0 8K" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Make sure the file is durably persisted.
sync
# Append some data to our file, to increase its size.
$XFS_IO_PROG -f -c "pwrite -S 0xcc -b 4K 8K 4K" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Fsync the file, so from this point on if a crash/power failure happens, our
# new data is guaranteed to be there next time the fs is mounted.
$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo
# Add one hard link to our file. This made btrfs write into the in memory fsync
# log a special inode with generation 0 and an i_size of 0 too. Note that this
# didn't update the inode in the fsync log on disk.
ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link
# Now make sure the in memory fsync log is durably persisted.
# Creating and fsync'ing another file will do it.
touch $SCRATCH_MNT/bar
$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/bar
# As expected, before the crash/power failure, we should be able to read the
# 12Kb of file data.
echo "File content before:"
od -t x1 $SCRATCH_MNT/foo
# Simulate a crash/power loss.
_load_flakey_table $FLAKEY_DROP_WRITES
_unmount_flakey
# After mounting the fs again, the fsync log was replayed.
# The btrfs fsync log replay code didn't update the i_size of the persisted
# inode because the inode item in the log had a special generation with a
# value of 0 (and it couldn't know the correct i_size, since that inode item
# had a 0 i_size too). This made the last 4Kb of file data inaccessible and
# effectively lost.
echo "File content after:"
od -t x1 $SCRATCH_MNT/foo
This isn't a new issue/regression. This problem has been around since the
log tree code was added in 2008:
If btrfs_find_item is called with NULL path it allocates one locally but
does not free it. Affected paths are inserting an orphan item for a file
and for a subvol root.
Move the path allocation to the callers.
Fixes: 3f870c289900 ("btrfs: expand btrfs_find_item() to include find_orphan_item functionality") Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
It turns out it's possible to get __remove_osd() called twice on the
same OSD. That doesn't sit well with rb_erase() - depending on the
shape of the tree we can get a NULL dereference, a soft lockup or
a random crash at some point in the future as we end up touching freed
memory. One scenario that I was able to reproduce is as follows:
A case can be made that osd refcounting is imperfect and reworking it
would be a proper resolution, but for now Sage and I decided to fix
this by adding a safe guard around __remove_osd().
Fixes: http://tracker.ceph.com/issues/8087 Cc: Sage Weil <sage@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Sage Weil <sage@redhat.com> Reviewed-by: Alex Elder <elder@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Since kernel 3.14 the backlight control has been broken on various Samsung
Atom based netbooks. This has been bisected and this problem happens since
commit b35684b8fa94 ("drm/i915: do full backlight setup at enable time")
This has been reported and discussed in detail here:
http://lists.freedesktop.org/archives/intel-gfx/2014-July/049395.html
Unfortunately no-one has been able to fix this. This only affects Samsung
Atom netbooks, and the Linux kernel and the BIOS of those laptops have never
worked well together. All affected laptops already have a quirk to avoid using
the standard acpi-video interface and instead use the samsung specific SABI
interface which samsung-laptop uses. It seems that recent fixes to the i915
driver have also broken backlight control through the SABI interface.
The intel_backlight driver OTOH works fine, and also allows for finer grained
backlight control. So add a new use_native_backlight quirk, and replace the
broken_acpi_video quirk with this quirk for affected models. This new quirk
disables acpi-video as before and also stops samsung-laptop from registering
the SABI based samsung_laptop backlight interface, leaving only the working
intel_backlight interface.
This commit enables this new quirk for 3 models which are known to be affected,
chances are that it needs to be used on other models too.
When DRAM errors occur on memory controllers after EDAC_MAX_MCS (16),
the kernel fatally dereferences unallocated structures, see splat below;
this occurs on at least NumaConnect systems.
Fix by checking if a memory controller info structure was found.
Causes an RCW cycle to be forced even when the array is degraded.
A degraded array cannot support RCW as that requires reading all data
blocks, and one may be missing.
Forcing an RCW when it is not possible causes a live-lock and the code
spins, repeatedly deciding to do something that cannot succeed.
So change the condition to only force RCW on non-degraded arrays.
Commit f6edb53c4993ffe92ce521fb449d1c146cea6ec2 converted the probe to
a CPU wide event first (pid == -1). For kernels that do not support
the PERF_FLAG_FD_CLOEXEC flag the probe fails with EINVAL. Since this
errno is not handled pid is not reset to 0 and the subsequent use of
pid = -1 as an argument brings in an additional failure path if
perf_event_paranoid > 0:
$ perf record -- sleep 1
perf_event_open(..., 0) failed unexpectedly with error 13 (Permission denied)
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.007 MB /tmp/perf.data (11 samples) ]
Also, ensure the fd of the confirmation check is closed and comment why
pid = -1 is used.
Needs to go to 3.18 stable tree as well.
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Based-on-patch-by: David Ahern <david.ahern@oracle.com> Acked-by: David Ahern <david.ahern@oracle.com> Cc: David Ahern <dsahern@gmail.com> Link: http://lkml.kernel.org/r/54EC610C.8000403@intel.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
We have two race conditions in the probe code which could lead to a null
pointer dereference in the interrupt handler.
The interrupt handler accesses the clockevent device, which may not yet be
registered.
First race condition happens when the interrupt handler gets registered before
the interrupts get disabled. The second race condition happens when the
interrupts get enabled, but the clockevent device is not yet registered.
Fix that by disabling the interrupts before we register the interrupt and enable
the interrupts after the clockevent device got registered.
Reported-by: Gongbae Park <yongbae2@gmail.com> Signed-off-by: Matthias Brugger <matthias.bgg@gmail.com> Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The KSTK_EIP() and KSTK_ESP() macros should return the user program
counter (PC) and stack pointer (A0StP) of the given task. These are used
to determine which VMA corresponds to the user stack in
/proc/<pid>/maps, and for the user PC & A0StP in /proc/<pid>/stat.
However for Meta the PC & A0StP from the task's kernel context are used,
resulting in broken output. For example in following /proc/<pid>/maps
output, the 3afff000-3b021000 VMA should be described as the stack:
And in the following /proc/<pid>/stat output, the PC is in kernel code
(1074234964 = 0x40078654) and the A0StP is in the kernel heap
(1335981392 = 0x4fa17550):
Fix the definitions of KSTK_EIP() and KSTK_ESP() to use
task_pt_regs(tsk)->ctx rather than (tsk)->thread.kernel_context. This
gets the registers from the user context stored after the thread info at
the base of the kernel stack, which is from the last entry into the
kernel from userland, regardless of where in the kernel the task may
have been interrupted, which results in the following more correct
/proc/<pid>/maps output:
For filesystems without separate project quota inode field in the
superblock we just reuse project quota file for group quotas (and vice
versa) if project quota file is allocated and we need group quota file.
When we reuse the file, quota structures on disk suddenly have wrong
type stored in d_flags though. Nobody really cares about this (although
structure type reported to userspace was wrong as well) except
that after commit 14bf61ffe6ac (quota: Switch ->get_dqblk() and
->set_dqblk() to use bytes as space units) assertion in
xfs_qm_scall_getquota() started to trigger on xfs/106 test (apparently I
was testing without XFS_DEBUG so I didn't notice when submitting the
above commit).
Fix the problem by properly resetting ddq->d_flags when running quotacheck
for a quota file.
Reported-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The gpio_chip operations receive a pointer the gpio_chip struct which is
contained in the driver's private struct, yet the container_of call in those
functions point to the mfd struct defined in include/linux/mfd/tps65912.h.
assumed that only one gpio-chip is registred per of-node.
Some drivers register more than one chip per of-node, so
adjust the matching function of_gpiochip_find_and_xlate to
not stop looking for chips if a node-match is found and
the translation fails.
Fixes: 7b8792bbdffd ("gpiolib: of: Correct error handling in of_get_named_gpiod_flags") Signed-off-by: Hans Holmberg <hans.holmberg@intel.com> Acked-by: Alexandre Courbot <acourbot@nvidia.com> Tested-by: Robert Jarzmik <robert.jarzmik@free.fr> Tested-by: Tyler Hall <tylerwhall@gmail.com> Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The native (64-bit) sigval_t union contains sival_int (32-bit) and
sival_ptr (64-bit). When a compat application invokes a syscall that
takes a sigval_t value (as part of a larger structure, e.g.
compat_sys_mq_notify, compat_sys_timer_create), the compat_sigval_t
union is converted to the native sigval_t with sival_int overlapping
with either the least or the most significant half of sival_ptr,
depending on endianness. When the corresponding signal is delivered to a
compat application, on big endian the current (compat_uptr_t)sival_ptr
cast always returns 0 since sival_int corresponds to the top part of
sival_ptr. This patch fixes copy_siginfo_to_user32() so that sival_int
is copied to the compat_siginfo_t structure.
Since the removal of CONFIG_REGULATOR_DUMMY option, the touchscreen stopped
working. This patch enables the "replacement" for REGULATOR_DUMMY and
allows the touchscreen to work even though there is no regulator for "vcc".
Signed-off-by: Martin Vajnar <martin.vajnar@gmail.com> Signed-off-by: Robert Jarzmik <robert.jarzmik@free.fr> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Patch 0759d0681cae ("KVM: s390: cleanup handle_wait by reusing
kvm_vcpu_block") changed the way pending guest clock comparator
interrupts are detected. It was assumed that as soon as the hrtimer
wakes up, the condition for the guest ckc is satisfied.
This is however only true as long as adjclock() doesn't speed
up the monotonic clock. Reason is that the hrtimer is based on
CLOCK_MONOTONIC, the guest clock comparator detection is based
on the raw TOD clock. If CLOCK_MONOTONIC runs faster than the
TOD clock, the hrtimer wakes the target VCPU up too early and
the target VCPU will not detect any pending interrupts, therefore
going back to sleep. It will never be woken up again because the
hrtimer has finished. The VCPU is stuck.
As a quick fix, we have to forward the hrtimer until the guest
clock comparator is really due, to guarantee properly timed wake
ups.
As the hrtimer callback might be triggered on another cpu, we
have to make sure that the timer is really stopped and not currently
executing the callback on another cpu. This can happen if the vcpu
thread is scheduled onto another physical cpu, but the timer base
is not migrated. So lets use hrtimer_cancel instead of try_to_cancel.
A proper fix might be to introduce a RAW based hrtimer.
Reported-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Check length of extended attributes and allocation descriptors when
loading inodes from disk. Otherwise corrupted filesystems could confuse
the code and make the kernel oops.
Reported-by: Carl Henrik Lunde <chlunde@ping.uio.no> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
activate_mm() and switch_mm() call get_new_mmu_context() which in turn
can enable the HTW before the entryhi is changed with the new ASID.
Since the latter will enable the HTW in local_flush_tlb_all(),
then there is a small timing window where the HTW is running with the
new ASID but with an old pgd since the TLBMISS_HANDLER_SETUP_PGD
hasn't assigned a new one yet. In order to prevent that, we introduce a
simple htw counter to avoid starting HTW accidentally due to nested
htw_{start,stop}() sequences. Moreover, since various IPI calls can
enforce TLB flushing operations on a different core, such an operation
may interrupt another htw_{stop,start} in progress leading inconsistent
updates of the htw_seq variable. In order to avoid that, we disable the
interrupts whenever we update that variable.
We used to calculate page address differently in 2 cases:
1. In virt_to_page(x) we do
--->8---
mem_map + (x - CONFIG_LINUX_LINK_BASE) >> PAGE_SHIFT
--->8---
2. In in pte_page(x) we do
--->8---
mem_map + (pte_val(x) - PAGE_OFFSET) >> PAGE_SHIFT
--->8---
That leads to problems in case PAGE_OFFSET != CONFIG_LINUX_LINK_BASE -
different pages will be selected depending on where and how we calculate
page address.
In particular in the STAR 9000853582 when gdb attempted to read memory
of another process it got improper page in get_user_pages() because this
is exactly one of the places where we search for a page by pte_page().
The fix is trivial - we need to calculate page address similarly in both
cases.
When the UART is in DMA receive mode (RDMAS set) and one character
just arrived while another interrupt is handled (e.g. TX), the RDRF
(receiver data register full flag) is set due to the water level of
1. But since the DMA will take care of this character, there is no
need to handle it by calling lpuart_prepare_rx. Handling it leads to
adding the RX timeout timer twice:
[ 74.336698] Kernel BUG at 80053070 [verbose debug info unavailable]
[ 74.342999] Internal error: Oops - BUG: 0 [#1] ARM0:00.00 khungtaskd
[ 74.347817] Modules linked in: 0 S 0.0 0.0 0:00.00 writeback
[ 74.350926] CPU: 0 PID: 0 Comm: swapper Not tainted 3.19.0-rc3-00001-g39d78e2 #1788
[ 74.358617] Hardware name: Freescale Vybrid VF610 (Device Tree)t
[ 74.364563] task: 807a7678 ti: 8079c000 task.ti: 8079c000 kblockd
[ 74.370002] PC is at add_timer+0x24/0x28.0 0.0 0:00.09 kworker/u2:1
[ 74.373960] LR is at lpuart_int+0x15c/0x3d8
[ 74.378171] pc : [<80053070>] lr : [<802e0d88>] psr: a0010193
[ 74.378171] sp : 8079de10 ip : 8079de20 fp : 8079de1c
[ 74.389694] r10: 807d44c0 r9 : 8688c300 r8 : 00000013
[ 74.394943] r7 : 20010193 r6 : 00000000 r5 : 000000a0 r4 : 86997210
[ 74.401498] r3 : ffffa7da r2 : 80817868 r1 : 86997210 r0 : 86997344
[ 74.408052] Flags: NzCv IRQs off FIQs on Mode SVC_32 ISA ARM Segment kernel
[ 74.415489] Control: 10c5387d Table: 8611c059 DAC: 00000015
[ 74.421265] Process swapper (pid: 0, stack limit = 0x8079c230)
...
Solve this by only execute the receiver path (lpuart_prepare_rx) if
the DMA receive mode (RDMAS) is not set. Also, make sure the flag is
cleared on initialization, in case it has been left set.
This can be best reproduced using UART as a serial console, then
running top while dd'ing data into the terminal.
Signed-off-by: Stefan Agner <stefan@agner.ch> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
If the serial port gets closed while a RX transfer is in progress,
the timer might fire after the serial port shutdown finished. This
leads in a NULL pointer dereference:
[ 7.508324] Unable to handle kernel NULL pointer dereference at virtual address 00000000
[ 7.516590] pgd = 86348000
[ 7.519445] [00000000] *pgd=86179831, *pte=00000000, *ppte=00000000
[ 7.526145] Internal error: Oops: 17 [#1] ARM
[ 7.530611] Modules linked in:
[ 7.533876] CPU: 0 PID: 123 Comm: systemd Not tainted 3.19.0-rc3-00004-g5b11ea7 #1778
[ 7.541827] Hardware name: Freescale Vybrid VF610 (Device Tree)
[ 7.547862] task: 861c3400 ti: 86ac8000 task.ti: 86ac8000
[ 7.553392] PC is at lpuart_timer_func+0x24/0xf8
[ 7.558127] LR is at lpuart_timer_func+0x20/0xf8
[ 7.562857] pc : [<802df99c>] lr : [<802df998>] psr: 600b0113
[ 7.562857] sp : 86ac9b90 ip : 86ac9b90 fp : 86ac9bbc
[ 7.574467] r10: 80817180 r9 : 80817b98 r8 : 80817998
[ 7.579803] r7 : 807acee0 r6 : 86989000 r5 : 00000100 r4 : 86997210
[ 7.586444] r3 : 86ac8000 r2 : 86ac9bc0 r1 : 86997210 r0 : 00000000
[ 7.593085] Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment user
[ 7.600341] Control: 10c5387d Table: 86348059 DAC: 00000015
[ 7.606203] Process systemd (pid: 123, stack limit = 0x86ac8230)
Setup the timer on UART startup which allows to delete the timer
unconditionally on shutdown. This also saves the initialization
on each transfer.
Signed-off-by: Stefan Agner <stefan@agner.ch> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Additional validation of adjtimex freq values to avoid
potential multiplication overflows were added in commit 5e5aeb4367b (time: adjtimex: Validate the ADJ_FREQUENCY values)
Unfortunately the patch used LONG_MAX/MIN instead of
LLONG_MAX/MIN, which was fine on 64-bit systems, but being
much smaller on 32-bit systems caused false positives
resulting in most direct frequency adjustments to fail w/
EINVAL.
ntpd only does direct frequency adjustments at startup, so
the issue was not as easily observed there, but other time
sync applications like ptpd and chrony were more effected by
the bug.
This patch changes the checks to use LLONG_MAX for
clarity, and additionally the checks are disabled
on 32-bit systems since LLONG_MAX/PPM_SCALE is always
larger then the 32-bit long freq value, so multiplication
overflows aren't possible there.
Reported-by: Josh Boyer <jwboyer@fedoraproject.org> Reported-by: George Joseph <george.joseph@fairview5.com> Tested-by: George Joseph <george.joseph@fairview5.com> Signed-off-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Sasha Levin <sasha.levin@oracle.com> Link: http://lkml.kernel.org/r/1423553436-29747-1-git-send-email-john.stultz@linaro.org
[ Prettified the changelog and the comments a bit. ] Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
There was a follow on replacement patch against the prior
"kgdb: Timeout if secondary CPUs ignore the roundup".
See: https://lkml.org/lkml/2015/1/7/442
This patch is the delta vs the patch that was committed upstream:
* Fix an off-by-one error in kdb_cpu().
* Replace NR_CPUS with CONFIG_NR_CPUS to tell checkpatch that we
really want a static limit.
* Removed the "KGDB: " prefix from the pr_crit() in debug_core.c
(kgdb-next contains a patch which introduced pr_fmt() to this file
to the tag will now be applied automatically).
Cc: Daniel Thompson <daniel.thompson@linaro.org> Signed-off-by: Jason Wessel <jason.wessel@windriver.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Currently when kdb traps printk messages then the raw log level prefix
(consisting of '\001' followed by a numeral) does not get stripped off
before the message is issued to the various I/O handlers supported by
kdb. This causes annoying visual noise as well as causing problems
grepping for ^. It is also a change of behaviour compared to normal usage
of printk() usage. For example <SysRq>-h ends up with different output to
that of kdb's "sr h".
This patch addresses the problem by stripping log levels from messages
before they are issued to the I/O handlers. printk() which can also
act as an i/o handler in some cases is special cased; if the caller
provided a log level then the prefix will be preserved when sent to
printk().
The addition of non-printable characters to the output of kdb commands is a
regression, albeit and extremely elderly one, introduced by commit 04d2c8c83d0e ("printk: convert the format for KERN_<LEVEL> to a 2 byte
pattern"). Note also that this patch does *not* restore the original
behaviour from v3.5. Instead it makes printk() from within a kdb command
display the message without any prefix (i.e. like printk() normally does).
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org> Cc: Joe Perches <joe@perches.com> Signed-off-by: Jason Wessel <jason.wessel@windriver.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The output of KDB 'summary' command should report MemTotal, MemFree
and Buffers output in kB. Current codes report in unit of pages.
A define of K(x) as
is defined in the code, but not used.
This patch would apply the define to convert the values to kB.
Please include me on Cc on replies. I do not subscribe to linux-kernel.
Signed-off-by: Jay Lan <jlan@sgi.com> Signed-off-by: Jason Wessel <jason.wessel@windriver.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
mvebu_armada375_smp_wa_init is only used on armada 375 but is defined
for all mvebu machines. As it calls a function that is only provided
sometimes, this can result in a link error:
arch/arm/mach-mvebu/built-in.o: In function `mvebu_armada375_smp_wa_init':
:(.text+0x228): undefined reference to `mvebu_setup_boot_addr_wa'
To solve this, we can just change the existing #ifdef around the
function to also check for Armada375 SMP platforms.
Signed-off-by: Arnd Bergmann <arnd@arndb.de> Fixes: 305969fb6292 ("ARM: mvebu: use the common function for Armada 375 SMP workaround") Cc: Andrew Lunn <andrew@lunn.ch> Cc: Jason Cooper <jason@lakedaemon.net> Cc: Gregory Clement <gregory.clement@free-electrons.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
A recent cleanup rearranged the Kconfig file for mach-bcm and
accidentally dropped the dependency on ARCH_MULTI_V7, which
makes it possible to now build the two mobile SoC platforms
on an ARMv6-only kernel, resulting in a log of Kconfig
warnings like
warning: ARCH_BCM_MOBILE selects ARM_ERRATA_775420 which has unmet direct dependencies (CPU_V7)
and which of course cannot work on any machine.
This puts back the dependencies as before.
Signed-off-by: Arnd Bergmann <arnd@arndb.de> Fixes: 64e74aa788f99 ("ARM: mach-bcm: ARCH_BCM_MOBILE: remove one level of menu from Kconfig") Acked-by: Florian Fainelli <f.fainelli@gmail.com> Acked-by: Scott Branden <sbranden@broadcom.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Add regulator_has_full_constraints() call to spitz board file to let
regulator core know that we do not have any additional regulators left.
This lets it substitute unprovided regulators with dummy ones.
This fixes the following warnings that can be seen on spitz if
regulators are enabled:
ads7846 spi2.0: unable to get regulator: -517
spi spi2.0: Driver ads7846 requests probe deferral
Signed-off-by: Dmitry Eremin-Solenikov <dbaryshkov@gmail.com> Acked-by: Mark Brown <broonie@kernel.org> Signed-off-by: Robert Jarzmik <robert.jarzmik@free.fr> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Add regulator_has_full_constraints() call to poodle board file to let
regulator core know that we do not have any additional regulators left.
This lets it substitute unprovided regulators with dummy ones.
This fixes the following warnings that can be seen on poodle if
regulators are enabled:
ads7846 spi1.0: unable to get regulator: -517
spi spi1.0: Driver ads7846 requests probe deferral
wm8731 0-001b: Failed to get supply 'AVDD': -517
wm8731 0-001b: Failed to request supplies: -517
wm8731 0-001b: ASoC: failed to probe component -517
Signed-off-by: Dmitry Eremin-Solenikov <dbaryshkov@gmail.com> Acked-by: Mark Brown <broonie@kernel.org> Signed-off-by: Robert Jarzmik <robert.jarzmik@free.fr> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Add regulator_has_full_constraints() call to corgi board file to let
regulator core know that we do not have any additional regulators left.
This lets it substitute unprovided regulators with dummy ones.
This fixes the following warnings that can be seen on corgi if
regulators are enabled:
ads7846 spi1.0: unable to get regulator: -517
spi spi1.0: Driver ads7846 requests probe deferral
wm8731 0-001b: Failed to get supply 'AVDD': -517
wm8731 0-001b: Failed to request supplies: -517
wm8731 0-001b: ASoC: failed to probe component -517
corgi-audio corgi-audio: ASoC: failed to instantiate card -517
Signed-off-by: Dmitry Eremin-Solenikov <dbaryshkov@gmail.com> Acked-by: Mark Brown <broonie@kernel.org> Signed-off-by: Robert Jarzmik <robert.jarzmik@free.fr> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The vcs device's poll/fasync support relies on the vt notifier to signal
changes to the screen content. Notifier invocations were missing for
changes that comes through the selection interface though. Fix that.
Tested with BRLTTY 5.2.
Signed-off-by: Nicolas Pitre <nico@linaro.org> Cc: Dave Mielke <dave@mielke.cc> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Check the special CDC headers for a plausible minimum length.
Another big operating systems ignores such garbage.
Signed-off-by: Oliver Neukum <oneukum@suse.de> Reviewed-by: Adam Lee <adam8157@gmail.com> Tested-by: Adam Lee <adam8157@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Currently the USB stack assumes that all host controller drivers are
capable of receiving wakeup requests from downstream devices.
However, this isn't true for the isp1760-hcd driver, which means that
it isn't safe to do a runtime suspend of any device attached to a
root-hub port if the device requires wakeup.
This patch adds a "cant_recv_wakeups" flag to the usb_hcd structure
and sets the flag in isp1760-hcd. The core is modified to prevent a
direct child of the root hub from being put into runtime suspend with
wakeup enabled if the flag is set.
Signed-off-by: Alan Stern <stern@rowland.harvard.edu> Tested-by: Nicolas Pitre <nico@linaro.org> Signed-off-by: Greg Kroah-Hartman <greg@kroah.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The USB stack provides a mechanism for drivers to request an
asynchronous device reset (usb_queue_reset_device()). The mechanism
uses a work item (reset_ws) embedded in the usb_interface structure
used by the driver, and the reset is carried out by a work queue
routine.
The asynchronous reset can race with driver unbinding. When this
happens, we try to cancel the queued reset before unbinding the
driver, on the theory that the driver won't care about any resets once
it is unbound.
However, thanks to the fact that lockdep now tracks work queue
accesses, this can provoke a lockdep warning in situations where the
device reset causes another interface's driver to be unbound; see
for an example. The reason is that the work routine for reset_ws in
one interface calls cancel_queued_work() for the reset_ws in another
interface. Lockdep thinks this might lead to a work routine trying to
cancel itself. The simplest solution is not to cancel queued resets
when unbinding drivers.
This means we now need to acquire a reference to the usb_interface
when queuing a reset_ws work item and to drop the reference when the
work routine finishes. We also need to make sure that the
usb_interface structure doesn't outlive its parent usb_device; this
means acquiring and dropping a reference when the interface is created
and destroyed.
In addition, cancelling a queued reset can fail (if the device is in
the middle of an earlier reset), and this can cause usb_reset_device()
to try to rebind an interface that has been deallocated (see
http://marc.info/?l=linux-usb&m=142175717016628&w=2 for details).
Acquiring the extra references prevents this failure.
Signed-off-by: Alan Stern <stern@rowland.harvard.edu> Reported-by: Russell King - ARM Linux <linux@arm.linux.org.uk> Reported-by: Olivier Sobrie <olivier@sobrie.be> Tested-by: Olivier Sobrie <olivier@sobrie.be> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
the following error pops up during "testusb -a -t 10"
| musb-hdrc musb-hdrc.1.auto: dma_pool_free buffer-128, f134e000/be842000 (bad dma)
hcd_buffer_create() creates a few buffers, the smallest has 32 bytes of
size. ARCH_KMALLOC_MINALIGN is set to 64 bytes. This combo results in
hcd_buffer_alloc() returning memory which is 32 bytes aligned and it
might by identified by buffer_offset() as another buffer. This means the
buffer which is on a 32 byte boundary will not get freed, instead it
tries to free another buffer with the error message.
This patch fixes the issue by creating the smallest DMA buffer with the
size of ARCH_KMALLOC_MINALIGN (or 32 in case ARCH_KMALLOC_MINALIGN is
smaller). This might be 32, 64 or even 128 bytes. The next three pools
will have the size 128, 512 and 2048.
In case the smallest pool is 128 bytes then we have only three pools
instead of four (and zero the first entry in the array).
The last pool size is always 2048 bytes which is the assumed PAGE_SIZE /
2 of 4096. I doubt it makes sense to continue using PAGE_SIZE / 2 where
we would end up with 8KiB buffer in case we have 16KiB pages.
Instead I think it makes sense to have a common size(s) and extend them
if there is need to.
There is a BUILD_BUG_ON() now in case someone has a minalign of more than
128 bytes.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Alan Stern <stern@rowland.harvard.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 8e74475b0e0a (usb: dwc3: gadget: use udc-core's
reset notifier) added support for the new UDC core's
reset notifier to dwc3 but while at it, it removed
a spin_lock() from dwc3_reset_gadget() which might
cause an unbalanced spin_unlock() further down the line
Fixes: 8e74475b0e0a (usb: dwc3: gadget: use udc-core's reset notifier) Signed-off-by: Felipe Balbi <balbi@ti.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The usb_hcd_unlink_urb() routine in hcd.c contains two possible
use-after-free errors. The dev_dbg() statement at the end of the
routine dereferences urb and urb->dev even though both structures may
have been deallocated.
This patch fixes the problem by storing urb->dev in a local variable
(avoiding the dereference of urb) and moving the dev_dbg() up before
the usb_put_dev() call.
Signed-off-by: Alan Stern <stern@rowland.harvard.edu> Reported-by: Joe Lawrence <joe.lawrence@stratus.com> Tested-by: Joe Lawrence <joe.lawrence@stratus.com> Signed-off-by: Greg Kroah-Hartman <greg@kroah.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
We might enter the interrupt handler with hw_ready already set,
but prior we actually started the reset flow.
To soleve this we move the reset release from the interrupt handler
to the HW start wait function which is part of the reset sequence.
Signed-off-by: Alexander Usyskin <alexander.usyskin@intel.com> Signed-off-by: Tomas Winkler <tomas.winkler@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-EDEFER error wasn't handle properly by atmel_serial_probe().
As an example, when atmel_serial_probe() is called for the first time, we pass
the test_and_set_bit() test to check whether the port has already been
initalized. Then we call atmel_init_port(), which may return -EDEFER, possibly
returned before by clk_get(). Consequently atmel_serial_probe() used to return
this error code WITHOUT clearing the port bit in the "atmel_ports_in_use" mask.
When atmel_serial_probe() was called for the second time, it used to fail on
the test_and_set_bit() function then returning -EBUSY.
When atmel_serial_probe() fails, this patch make it clear the port bit in the
"atmel_ports_in_use" mask, if needed, before returning the error code.
atmel_serial_probe() calls atmel_init_port(). In turn, atmel_init_port() calls
clk_disable_unprepare() to disable the peripheral clock before returning.
Later atmel_serial_probe() accesses some I/O registers such as the Mode and
Control registers for RS485 support then the Name and Version registers, through a call to
atmel_get_ip_name(), but at that moment the peripheral clock was still
disabled.
Commit 26df6d13406d1a5 ("tty: Add EXTPROC support for LINEMODE")
allows a process which has opened a pty master to send _any_ signal
to the process group of the pty slave. Although potentially
exploitable by a malicious program running a setuid program on
a pty slave, it's unknown if this exploit currently exists.
Limit to signals actually used.
Cc: Theodore Ts'o <tytso@mit.edu> Cc: Howard Chu <hyc@symas.com> Cc: One Thousand Gnomes <gnomes@lxorguk.ukuu.org.uk> Cc: Jiri Slaby <jslaby@suse.cz> Signed-off-by: Peter Hurley <peter@hurleysoftware.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
We hit use after free on dereferncing pointer to task_smack struct in
smk_of_task() called from smack_task_to_inode().
task_security() macro uses task_cred_xxx() to get pointer to the task_smack.
task_cred_xxx() could be used only for non-pointer members of task's
credentials. It cannot be used for pointer members since what they point
to may disapper after dropping RCU read lock.
Mainly task_security() used this way:
smk_of_task(task_security(p))
Intead of this introduce function smk_of_task_struct() which
takes task_struct as argument and returns pointer to smk_known struct
and do this under RCU read lock.
Bogus task_security() macro is not used anymore, so remove it.
KASan's report for this:
AddressSanitizer: use after free in smack_task_to_inode+0x50/0x70 at addr c4635600
=============================================================================
BUG kmalloc-64 (Tainted: PO): kasan error
-----------------------------------------------------------------------------
Bytes b4 c46355f0: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ
Object c4635600: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
Object c4635610: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
Object c4635620: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
Object c4635630: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b a5 kkkkkkkkkkkkkkk.
Redzone c4635640: bb bb bb bb ....
Padding c46356e8: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ
Padding c46356f8: 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZ
CPU: 5 PID: 834 Comm: launchpad_prelo Tainted: PBO 3.10.30 #1
Backtrace:
[<c00233a4>] (dump_backtrace+0x0/0x158) from [<c0023dec>] (show_stack+0x20/0x24)
r7:c4634010 r6:d3b23f50 r5:c4635600 r4:d1002140
[<c0023dcc>] (show_stack+0x0/0x24) from [<c06d6d7c>] (dump_stack+0x20/0x28)
[<c06d6d5c>] (dump_stack+0x0/0x28) from [<c01c1d50>] (print_trailer+0x124/0x144)
[<c01c1c2c>] (print_trailer+0x0/0x144) from [<c01c1e88>] (object_err+0x3c/0x44)
r7:c4635600 r6:d1002140 r5:d3b23f50 r4:c4635600
[<c01c1e4c>] (object_err+0x0/0x44) from [<c01cac18>] (kasan_report_error+0x2b8/0x538)
r6:d1002140 r5:d3b23f50 r4:c6429cf8 r3:c09e1aa7
[<c01ca960>] (kasan_report_error+0x0/0x538) from [<c01c9430>] (__asan_load4+0xd4/0xf8)
[<c01c935c>] (__asan_load4+0x0/0xf8) from [<c031e168>] (smack_task_to_inode+0x50/0x70)
r5:c4635600 r4:ca9da000
[<c031e118>] (smack_task_to_inode+0x0/0x70) from [<c031af64>] (security_task_to_inode+0x3c/0x44)
r5:cca25e80 r4:c0ba9780
[<c031af28>] (security_task_to_inode+0x0/0x44) from [<c023d614>] (pid_revalidate+0x124/0x178)
r6:00000000 r5:cca25e80 r4:cbabe3c0 r3:00008124
[<c023d4f0>] (pid_revalidate+0x0/0x178) from [<c01db98c>] (lookup_fast+0x35c/0x43y4)
r9:c6429efc r8:00000101 r7:c079d940 r6:c6429e90 r5:c6429ed8 r4:c83c4148
[<c01db630>] (lookup_fast+0x0/0x434) from [<c01deec8>] (do_last.isra.24+0x1c0/0x1108)
[<c01ded08>] (do_last.isra.24+0x0/0x1108) from [<c01dff04>] (path_openat.isra.25+0xf4/0x648)
[<c01dfe10>] (path_openat.isra.25+0x0/0x648) from [<c01e1458>] (do_filp_open+0x3c/0x88)
[<c01e141c>] (do_filp_open+0x0/0x88) from [<c01ccb28>] (do_sys_open+0xf0/0x198)
r7:00000001 r6:c0ea2180 r5:0000000b r4:00000000
[<c01cca38>] (do_sys_open+0x0/0x198) from [<c01ccc00>] (SyS_open+0x30/0x34)
[<c01ccbd0>] (SyS_open+0x0/0x34) from [<c001db80>] (ret_fast_syscall+0x0/0x48)
Read of size 4 by thread T834:
Memory state around the buggy address: c4635380: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc c4635400: 00 00 00 00 00 00 00 00 fc fc fc fc fc fc fc fc c4635480: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc c4635500: 00 00 00 00 00 fc fc fc fc fc fc fc fc fc fc fc c4635580: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>c4635600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^ c4635680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb c4635700: 00 00 00 00 04 fc fc fc fc fc fc fc fc fc fc fc c4635780: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc c4635800: 00 00 00 00 00 00 04 fc fc fc fc fc fc fc fc fc c4635880: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
==================================================================
When an application connects to the ring buffer via splice, it can only
read full pages. Splice does not work with partial pages. If there is
not enough data to fill a page, the splice command will either block
or return -EAGAIN (if set to nonblock).
Code was added where if the page is not full, to just sleep again.
The problem is, it will get woken up again on the next event. That
is, when something is written into the ring buffer, if there is a waiter
it will wake it up. The waiter would then check the buffer, see that
it still does not have enough data to fill a page and go back to sleep.
To make matters worse, when the waiter goes back to sleep, it could
cause another event, which would wake it back up again to see it
doesn't have enough data and sleep again. This produces a tremendous
overhead and fills the ring buffer with noise.
For example, recording sched_switch on an idle system for 10 seconds
produces 25,350,475 events!!!
Create another wait queue for those waiters wanting full pages.
When an event is written, it only wakes up waiters if there's a full
page of data. It does not wake up the waiter if the page is not yet
full.
After this change, recording sched_switch on an idle system for 10
seconds produces only 800 events. Getting rid of 25,349,675 useless
events (99.9969% of events!!), is something to take seriously.
Cc: Rabin Vincent <rabin@rab.in> Fixes: e30f53aad220 "tracing: Do not busy wait in buffer splice" Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Using the IPCB() macro to get the IPv4 options is convenient, but
unfortunately NetLabel often needs to examine the CIPSO option outside
of the scope of the IP layer in the stack. While historically IPCB()
worked above the IP layer, due to the inclusion of the inet_skb_param
struct at the head of the {tcp,udp}_skb_cb structs, recent commit 971f10ec ("tcp: better TCP_SKB_CB layout to reduce cache line misses")
reordered the tcp_skb_cb struct and invalidated this IPCB() trick.
This patch fixes the problem by creating a new function,
cipso_v4_optptr(), which locates the CIPSO option inside the IP header
without calling IPCB(). Unfortunately, this isn't as fast as a simple
lookup so some additional tweaks were made to limit the use of this
new function.
Reported-by: Casey Schaufler <casey@schaufler-ca.com> Signed-off-by: Paul Moore <pmoore@redhat.com> Tested-by: Casey Schaufler <casey@schaufler-ca.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
If you can manage to submit an async write as the first async I/O from
the context of a process with realtime scheduling priority, then a
cfq_queue is allocated, but filed into the wrong async_cfqq bucket. It
ends up in the best effort array, but actually has realtime I/O
scheduling priority set in cfqq->ioprio.
The reason is that cfq_get_queue assumes the default scheduling class and
priority when there is no information present (i.e. when the async cfqq
is created):
static struct cfq_queue *
cfq_get_queue(struct cfq_data *cfqd, bool is_sync, struct cfq_io_cq *cic,
struct bio *bio, gfp_t gfp_mask)
{
const int ioprio_class = IOPRIO_PRIO_CLASS(cic->ioprio);
const int ioprio = IOPRIO_PRIO_DATA(cic->ioprio);
cic->ioprio starts out as 0, which is "invalid". So, class of 0
(IOPRIO_CLASS_NONE) is passed to cfq_async_queue_prio like so:
static struct cfq_queue **
cfq_async_queue_prio(struct cfq_data *cfqd, int ioprio_class, int ioprio)
{
switch (ioprio_class) {
case IOPRIO_CLASS_RT:
return &cfqd->async_cfqq[0][ioprio];
case IOPRIO_CLASS_NONE:
ioprio = IOPRIO_NORM;
/* fall through */
case IOPRIO_CLASS_BE:
return &cfqd->async_cfqq[1][ioprio];
case IOPRIO_CLASS_IDLE:
return &cfqd->async_idle_cfqq;
default:
BUG();
}
}
Here, instead of returning a class mapped from the process' scheduling
priority, we get back the bucket associated with IOPRIO_CLASS_BE.
Now, there is no queue allocated there yet, so we create it:
Cfq_lookup_create_cfqg() allocates struct blkcg_gq using GFP_ATOMIC.
In cfq_find_alloc_queue() possible allocation failure is not handled.
As a result kernel oopses on NULL pointer dereference when
cfq_link_cfqq_cfqg() calls cfqg_get() for NULL pointer.
Bug was introduced in v3.5 in commit cd1604fab4f9 ("blkcg: factor
out blkio_group creation"). Prior to that commit cfq group lookup
had returned pointer to root group as fallback.
This patch handles this error using existing fallback oom_cfqq.
This patch drops legacy active_ts_list usage within iscsi_target_tq.c
code. It was originally used to track the active thread sets during
iscsi-target shutdown, and is no longer used by modern upstream code.
Two people have reported list corruption using traditional iscsi-target
and iser-target with the following backtrace, that appears to be related
to iscsi_thread_set->ts_list being used across both active_ts_list and
inactive_ts_list.
With scsi-mq enabled, userspace programs can get unexpected EWOULDBLOCK
(a.k.a. EAGAIN) errors when submitting commands to the SCSI generic
driver. Fix by calling blk_get_request() with GFP_KERNEL instead of
GFP_ATOMIC.
Note: to avoid introducing a potential deadlock, this patch should be
applied after the patch titled "sg: fix unkillable I/O wait deadlock
with scsi-mq".
Signed-off-by: Tony Battersby <tonyb@cybernetics.com> Acked-by: Douglas Gilbert <dgilbert@interlog.com> Tested-by: Douglas Gilbert <dgilbert@interlog.com> Signed-off-by: James Bottomley <JBottomley@Parallels.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>