For the devices in this driver, the default qdisc is "noqueue",
because their "tx_queue_len" is 0.
In function "__dev_queue_xmit" in "net/core/dev.c", devices with the
"noqueue" qdisc are specially handled. Packets are transmitted without
being queued after a "dev->flags & IFF_UP" check. However, it's possible
that even if this check succeeds, "ops->ndo_stop" may still have already
been called. This is because in "__dev_close_many", "ops->ndo_stop" is
called before clearing the "IFF_UP" flag.
If we call "netif_stop_queue" in "ops->ndo_stop", then it's possible in
"__dev_queue_xmit", it sees the "IFF_UP" flag is present, and then it
checks "netif_xmit_stopped" and finds that the queue is already stopped.
In this case, it will complain that:
"Virtual device ... asks to queue packet!"
To prevent "__dev_queue_xmit" from generating this complaint, we should
not call "netif_stop_queue" in "ops->ndo_stop".
We also don't need to call "netif_start_queue" in "ops->ndo_open",
because after a netdev is allocated and registered, the
"__QUEUE_STATE_DRV_XOFF" flag is initially not set, so there is no need
to call "netif_start_queue" to clear it.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Xie He <xie.he.0141@gmail.com> Acked-by: Martin Schiller <ms@dev.tdt.de> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The current CIPSO and CALIPSO refcounting scheme for the DOI
definitions is a bit flawed in that we:
1. Don't correctly match gets/puts in netlbl_cipsov4_list().
2. Decrement the refcount on each attempt to remove the DOI from the
DOI list, only removing it from the list once the refcount drops
to zero.
This patch fixes these problems by adding the missing "puts" to
netlbl_cipsov4_list() and introduces a more conventional, i.e.
not-buggy, refcounting mechanism to the DOI definitions. Upon the
addition of a DOI to the DOI list, it is initialized with a refcount
of one, removing a DOI from the list removes it from the list and
drops the refcount by one; "gets" and "puts" behave as expected with
respect to refcounts, increasing and decreasing the DOI's refcount by
one.
Fixes: b1edeb102397 ("netlabel: Replace protocol/NetLabel linking with refrerence counts") Fixes: d7cce01504a0 ("netlabel: Add support for removing a CALIPSO DOI.") Reported-by: syzbot+9ec037722d2603a9f52e@syzkaller.appspotmail.com Signed-off-by: Paul Moore <paul@paul-moore.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This is a follow up of commit ea3274695353 ("net: sched: avoid
duplicates in qdisc dump") which has fixed the issue only for the qdisc
dump.
The duplicate printing also occurs when dumping the classes via
tc class show dev eth0
Fixes: 59cc1f61f09c ("net: sched: convert qdisc linked list to hashtable") Signed-off-by: Maximilian Heyne <mheyne@amazon.de> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
We introduce dwmac410_dma_init_channel() here for both EQoS v4.10 and
above which use different DMA_CH(n)_Interrupt_Enable bit definitions for
NIE and AIE.
Fixes: 48863ce5940f ("stmmac: add DMA support for GMAC 4.xx") Signed-off-by: Ong Boon Leong <boon.leong.ong@intel.com> Signed-off-by: Ramesh Babu B <ramesh.babu.b@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This patch fixes a bug that the moderation config will not be
applied when calling mlx4_en_reset_config. For example, when
turning on rx timestamping, mlx4_en_reset_config() will be called,
causing the NIC to forget previous moderation config.
This fix is in phase with a previous fix:
commit 79c54b6bbf06 ("net/mlx4_en: Fix TX moderation info loss
after set_ringparam is called")
Tested: Before this patch, on a host with NIC using mlx4, run
netserver and stream TCP to the host at full utilization.
$ sar -I SUM 1
INTR intr/s
14:03:56 sum 48758.00
After rx hwtstamp is enabled:
$ sar -I SUM 1
14:10:38 sum 317771.00
We see the moderation is not working properly and issued 7x more
interrupts.
After the patch, and turned on rx hwtstamp, the rate of interrupts
is as expected:
$ sar -I SUM 1
14:52:11 sum 49332.00
Fixes: 79c54b6bbf06 ("net/mlx4_en: Fix TX moderation info loss after set_ringparam is called") Signed-off-by: Kevin(Yudong) Yang <yyd@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> CC: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
According to the SH7710, SH7712, SH7713 Group User's Manual: Hardware,
Rev. 3.00, the TRSCER register actually has only bit 7 valid (and named
differently), with all the other bits reserved. Apparently, this was not
the case with some early revisions of the manual as we have the other
bits declared (and set) in the original driver. Follow the suit and add
the explicit sh_eth_cpu_data::trscer_err_mask initializer for SH771x...
Fixes: 86a74ff21a7a ("net: sh_eth: add support for Renesas SuperH Ethernet") Signed-off-by: Sergey Shtylyov <s.shtylyov@omprussia.ru> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The kernel test robot reports a huge performance regression due to the
commit, and the reason seems fairly straightforward: when there is
contention on the page list (which is what causes acquire_slab() to
fail), we do _not_ want to just loop and try again, because that will
transfer the contention to the 'n->list_lock' spinlock we hold, and
just make things even worse.
This is admittedly likely a problem only on big machines - the kernel
test robot report comes from a 96-thread dual socket Intel Xeon Gold
6252 setup, but the regression there really is quite noticeable:
-47.9% regression of stress-ng.rawpkt.ops_per_sec
and the commit that was marked as being fixed (7ced37197196: "slub:
Acquire_slab() avoid loop") actually did the loop exit early very
intentionally (the hint being that "avoid loop" part of that commit
message), exactly to avoid this issue.
The correct thing to do may be to pick some kind of reasonable middle
ground: instead of breaking out of the loop on the very first sign of
contention, or trying over and over and over again, the right thing may
be to re-try _once_, and then give up on the second failure (or pick
your favorite value for "once"..).
Reported-by: kernel test robot <oliver.sang@intel.com> Link: https://lore.kernel.org/lkml/20210301080404.GF12822@xsang-OptiPlex-9020/ Cc: Jann Horn <jannh@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When building with -ffunction-sections, the compiler will place each
function into its own ELF section, prefixed with ".text". For example,
a simple test module with functions test_module_do_work() and
test_module_wq_func():
Adjust the recordmcount scripts to look for ".text" as a section name
prefix. This will ensure that those functions will be included in the
__mcount_loc relocations:
% objdump --reloc --section __mcount_loc test_module.o
OFFSET TYPE VALUE 0000000000000000 R_X86_64_64 .text.test_module_do_work 0000000000000008 R_X86_64_64 .text.test_module_wq_func 0000000000000010 R_X86_64_64 .init.text
In cifs_statfs(), if server->ops->queryfs is not NULL, then we should
use its return value rather than always returning 0. Instead, use rc
variable as it is properly set to 0 in case there is no
server->ops->queryfs.
Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Reviewed-by: Aurelien Aptel <aaptel@suse.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> CC: <stable@vger.kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
nested target/match_revfn() calls work with xt[NFPROTO_UNSPEC] lists
without taking xt[NFPROTO_UNSPEC].mutex. This can race with module unload
and cause host to crash:
Assert HALT bit to enter freeze mode, there is a premise that FRZ bit is
asserted. This patch asserts FRZ bit in flexcan_chip_freeze, although
the reset value is 1b'1. This is a prepare patch, later patch will
invoke flexcan_chip_freeze() to enter freeze mode, which polling freeze
mode acknowledge.
Fixes: b1aa1c7a2165b ("can: flexcan: fix transition from and to freeze mode in chip_{,un}freeze") Link: https://lore.kernel.org/r/20210218110037.16591-2-qiangqing.zhang@nxp.com Signed-off-by: Joakim Zhang <qiangqing.zhang@nxp.com> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
There are two ref count variables controlling the free()ing of a socket:
- struct sock::sk_refcnt - which is changed by sock_hold()/sock_put()
- struct sock::sk_wmem_alloc - which accounts the memory allocated by
the skbs in the send path.
In case there are still TX skbs on the fly and the socket() is closed,
the struct sock::sk_refcnt reaches 0. In the TX-path the CAN stack
clones an "echo" skb, calls sock_hold() on the original socket and
references it. This produces the following back trace:
A packet with skb_inner_network_header(skb) == skb_network_header(skb)
and ETH_P_MPLS_UC will prevent mpls_gso_segment from pulling any headers
from the packet. Subsequently, the call to skb_mac_gso_segment will
again call mpls_gso_segment with the same packet leading to an infinite
loop. In addition, ensure that the header length is a multiple of four,
which should hold irrespective of the number of stacked labels.
Signed-off-by: Balazs Nemeth <bnemeth@redhat.com> Acked-by: Willem de Bruijn <willemb@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
For gso packets, virtio_net_hdr_set_proto sets the protocol (if it isn't
set) based on the type in the virtio net hdr, but the skb could contain
anything since it could come from packet_snd through a raw socket. If
there is a mismatch between what virtio_net_hdr_set_proto sets and
the actual protocol, then the skb could be handled incorrectly later
on.
An example where this poses an issue is with the subsequent call to
skb_flow_dissect_flow_keys_basic which relies on skb->protocol being set
correctly. A specially crafted packet could fool
skb_flow_dissect_flow_keys_basic preventing EINVAL to be returned.
Avoid blindly trusting the information provided by the virtio net header
by checking that the protocol in the packet actually matches the
protocol set by virtio_net_hdr_set_proto. Note that since the protocol
is only checked if skb->dev implements header_ops->parse_protocol,
packets from devices without the implementation are not checked at this
stage.
Fixes: 9274124f023b ("net: stricter validation of untrusted gso packets") Signed-off-by: Balazs Nemeth <bnemeth@redhat.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Introduce a new optional header_ops callback called parse_protocol and a
wrapper function dev_parse_header_protocol, similar to dev_parse_header.
The new callback's purpose is to extract the protocol number from the L2
header, the format of which is known to the driver, but not to the upper
layers of the stack.
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
We noticed a GRO issue for UDP-based encaps such as vxlan/geneve when the
csum for the UDP header itself is 0. In that case, GRO aggregation does
not take place on the phys dev, but instead is deferred to the vxlan/geneve
driver (see trace below).
The reason is essentially that GRO aggregation bails out in udp_gro_receive()
for such case when drivers marked the skb with CHECKSUM_UNNECESSARY (ice, i40e,
others) where for non-zero csums 2abb7cdc0dc8 ("udp: Add support for doing
checksum unnecessary conversion") promotes those skbs to CHECKSUM_COMPLETE
and napi context has csum_valid set. This is however not the case for zero
UDP csum (here: csum_cnt is still 0 and csum_valid continues to be false).
At the same time 57c67ff4bd92 ("udp: additional GRO support") added matches
on !uh->check ^ !uh2->check as part to determine candidates for aggregation,
so it certainly is expected to handle zero csums in udp_gro_receive(). The
purpose of the check added via 662880f44203 ("net: Allow GRO to use and set
levels of checksum unnecessary") seems to catch bad csum and stop aggregation
right away.
One way to fix aggregation in the zero case is to only perform the !csum_valid
check in udp_gro_receive() if uh->check is infact non-zero.
# netperf -H 10.55.10.4 -t TCP_STREAM -l 20
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.55.10.4 () port 0 AF_INET : demo
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec
87380 16384 16384 20.01 24576.53
Fixes: 57c67ff4bd92 ("udp: additional GRO support") Fixes: 662880f44203 ("net: Allow GRO to use and set levels of checksum unnecessary") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Jesse Brandeburg <jesse.brandeburg@intel.com> Cc: Tom Herbert <tom@herbertland.com> Acked-by: Willem de Bruijn <willemb@google.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/r/20210226212248.8300-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When transmitting to a receiver in dynamic SMPS mode, all transmissions that
use multiple spatial streams need to be sent using CTS-to-self or RTS/CTS to
give the receiver's extra chains some time to wake up.
This fixes the tx rate getting stuck at <= MCS7 for some clients, especially
Intel ones, which make aggressive use of SMPS.
Cc: stable@vger.kernel.org Reported-by: Martin Kennedy <hurricos@gmail.com> Signed-off-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: Kalle Valo <kvalo@codeaurora.org> Link: https://lore.kernel.org/r/20210214184911.96702-1-nbd@nbd.name Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
netif_device_attach() will unpause the queues so we can't call
it before __alx_open(). This went undetected until
commit b0999223f224 ("alx: add ability to allocate and free
alx_napi structures") but now if stack tries to xmit immediately
on resume before __alx_open() we'll crash on the NAPI being null:
Apparently, <linux/netfilter/nfnetlink_cthelper.h> and
<linux/netfilter/nfnetlink_acct.h> could not be included into the same
compilation unit because of a cut-and-paste typo in the former header.
The PC_DBG_ECO_CNTL register on the Adreno A5xx family gets
programmed to some different values on a per-model basis.
At least, this is what we intend to do here;
Unfortunately, though, this register is being overwritten with a
static magic number, right after applying the GPU-specific
configuration (including the GPU-specific quirks) and that is
effectively nullifying the efforts.
Let's remove the redundant and wrong write to the PC_DBG_ECO_CNTL
register in order to retain the wanted configuration for the
target GPU.
Signed-off-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@somainline.org> Reviewed-by: Jordan Crouse <jcrouse@codeaurora.org> Signed-off-by: Rob Clark <robdclark@chromium.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
A dummy zero bit is sent preceding the data during a read transfer by the
Microchip 93LC46B eeprom (section 2.7 of[1]). This results in right shift
of data during a read. In order to ignore this bit a quirk can be added to
send an extra zero bit after the read address.
Add a quirk to ignore the zero bit sent before data by adding a zero bit
after the read address.
Add support for SW_TABLET_MODE on the Acer Switch 10 (SW5-012) and the
acer Switch 10 (S1003) models.
There is no way to detect if this is supported, so this uses DMI based
quirks setting force_caps to ACER_CAP_KBD_DOCK (these devices have no
other acer-wmi based functionality).
The new SW_TABLET_MODE functionality can be tested on devices which
are not in the DMI table by passing acer_wmi.force_caps=0x40 on the
kernel commandline.
Add a new force_caps module parameter to allow overriding the drivers
builtin capability detection mechanism.
This can be used to for example:
-Disable rfkill functionality on devices where there is an AA OEM DMI
record advertising non functional rfkill switches
-Force loading of the driver on devices with a missing AA OEM DMI record
Note that force_caps is -1 when unset, this allows forcing the
capability field to 0, which results in acer-wmi only providing WMI
hotkey handling while disabling all other (led, rfkill, backlight)
functionality.
Cleanup accelerometer device handling:
-Drop acer_wmi_accel_destroy instead directly call input_unregister_device.
-The information tracked by the CAP_ACCEL flag mirrors acer_wmi_accel_dev
being NULL. Drop the CAP flag, this is a preparation change for allowing
users to override the capability flags. Dropping the flag stops users
from causing a NULL pointer dereference by forcing the capability.
Cleanup the ACER_CAP_FOO defines:
-Switch to using BIT() macro.
-The ACER_CAP_RFBTN flag is set, but it is never checked anywhere, drop it.
-Drop the unused ACER_CAP_ANY define.
If a reset is performed, but even the reset fails for some reasons (e.g.,
on Surface devices, the fw reset requires another quirks),
cancel_work_sync() hangs in mwifiex_cleanup_pcie().
This is a deadlock caused by calling cancel_work_sync() in
mwifiex_cleanup_pcie():
- Device resets are done via mwifiex_pcie_card_reset()
- which schedules card->work to call mwifiex_pcie_card_reset_work()
- which calls pci_try_reset_function().
- This leads to mwifiex_pcie_reset_done() be called on the same workqueue,
which in turn calls
- mwifiex_reinit_sw() and that calls
- _mwifiex_fw_dpc().
The problem is now that _mwifiex_fw_dpc() calls mwifiex_free_adapter()
in case firmware initialization fails. That ends up calling
mwifiex_cleanup_pcie().
Note that all those calls are still running on the workqueue. So when
mwifiex_cleanup_pcie() now calls cancel_work_sync(), it's really waiting
on itself to complete, causing a deadlock.
This commit fixes the deadlock by skipping cancel_work_sync() on a reset
failure path.
After this commit, when reset fails, the following output is
expected to be shown:
increase_address_space() calls get_zeroed_page(gfp) under spin_lock with
disabled interrupts. gfp flags passed to increase_address_space() may allow
sleeping, so it comes to this:
BUG: sleeping function called from invalid context at mm/page_alloc.c:4342
in_atomic(): 1, irqs_disabled(): 1, pid: 21555, name: epdcbbf1qnhbsd8
Fix dm_table_supports_zoned_model() and invert logic of both
iterate_devices_callout_fn so that all devices' zoned capabilities are
properly checked.
Add one more parameter to dm_table_any_dev_attr(), which is actually
used as the @data parameter of iterate_devices_callout_fn, so that
dm_table_matches_zone_sectors() can be replaced by
dm_table_any_dev_attr().
According to the definition of dm_iterate_devices_fn:
* This function must iterate through each section of device used by the
* target until it encounters a non-zero return code, which it then returns.
* Returns zero if no callout returned non-zero.
For some target type (e.g. dm-stripe), one call of iterate_devices() may
iterate multiple underlying devices internally, in which case a non-zero
return code returned by iterate_devices_callout_fn will stop the iteration
in advance. No iterate_devices_callout_fn should return non-zero unless
device iteration should stop.
Rename dm_table_requires_stable_pages() to dm_table_any_dev_attr() and
elevate it for reuse to stop iterating (and return non-zero) on the
first device that causes iterate_devices_callout_fn to return non-zero.
Use dm_table_any_dev_attr() to properly iterate through devices.
Rename device_is_nonrot() to device_is_rotational() and invert logic
accordingly to fix improper disposition.
[jeffle: backport notes]
Also convert the no_sg_merge capability check, which is introduced by
commit 200612ec33e5 ("dm table: propagate QUEUE_FLAG_NO_SG_MERGE"), and
removed since commit 2705c93742e9 ("block: kill QUEUE_FLAG_NO_SG_MERGE")
in v5.1.
Fixes: c3c4555edd10 ("dm table: clear add_random unless all devices have it set") Fixes: 4693c9668fdc ("dm table: propagate non rotational flag") Cc: stable@vger.kernel.org Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The copy_to_user() function returns the number of bytes remaining but
we want to return -EFAULT to the user if it can't complete the copy.
The "st" variable only holds zero on success or negative error codes on
failure so the type should be int.
Fixes: 36f988e978f8 ("rsxx: Adding in debugfs entries.") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
Currently the mask operation on variable conf is just 3 bits so
the switch statement case value of 8 is unreachable dead code.
The function daio_mgr_dao_init can be passed a 4 bit value,
function dao_rsc_init calls it with conf set to:
conf = (desc->msr & 0x7) | (desc->passthru << 3);
so clearly when desc->passthru is set to 1 then conf can be
at least 8.
With GCC 10, building usbip triggers error for multiple definition
of 'udev_context', in:
- libsrc/vhci_driver.c:18 and
- libsrc/usbip_host_common.c:27.
Declare as extern the definition in libsrc/usbip_host_common.c.
Because the PM-runtime status of the device is not updated in
__rpm_callback(), attempts to suspend the suppliers of the given
device triggered by rpm_put_suppliers() called by it may fail.
Fix this by making __rpm_callback() update the device's status to
RPM_SUSPENDED before calling rpm_put_suppliers() if the current
status of the device is RPM_SUSPENDING and the callback just invoked
by it has returned 0 (success).
While at it, modify the code in __rpm_callback() to always check
the device's PM-runtime status under its PM lock.
Link: https://lore.kernel.org/linux-pm/CAPDyKFqm06KDw_p8WXsM4dijDbho4bb6T4k50UqqvR1_COsp8g@mail.gmail.com/ Fixes: 21d5c57b3726 ("PM / runtime: Use device links") Reported-by: Elaine Zhang <zhangqing@rock-chips.com> Diagnosed-by: Ulf Hansson <ulf.hansson@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Tested-by: Elaine Zhang <zhangiqng@rock-chips.com> Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org> Cc: 4.10+ <stable@vger.kernel.org> # 4.10+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When a qstripe is required an extra page is allocated and mapped. There
were 3 problems:
1) There is no corresponding call of kunmap() for the qstripe page.
2) There is no reason to map the qstripe page more than once if the
number of bits set in rbio->dbitmap is greater than one.
3) There is no reason to map the parity page and unmap it each time
through the loop.
The page memory can continue to be reused with a single mapping on each
iteration by raid6_call.gen_syndrome() without remapping. So map the
page for the duration of the loop.
Similarly, improve the algorithm by mapping the parity page just 1 time.
Fixes: 5a6ac9eacb49 ("Btrfs, raid56: support parity scrub on raid56") CC: stable@vger.kernel.org # 4.4.x: c17af96554a8: btrfs: raid56: simplify tracking of Q stripe presence CC: stable@vger.kernel.org # 4.4.x Signed-off-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
There are temporary variables tracking the index of P and Q stripes, but
none of them is really used as such, merely for determining if the Q
stripe is present. This leads to compiler warnings with
-Wunused-but-set-variable and has been reported several times.
fs/btrfs/raid56.c: In function ‘finish_rmw’:
fs/btrfs/raid56.c:1199:6: warning: variable ‘p_stripe’ set but not used [-Wunused-but-set-variable]
1199 | int p_stripe = -1;
| ^~~~~~~~
fs/btrfs/raid56.c: In function ‘finish_parity_scrub’:
fs/btrfs/raid56.c:2356:6: warning: variable ‘p_stripe’ set but not used [-Wunused-but-set-variable]
2356 | int p_stripe = -1;
| ^~~~~~~~
Replace the two variables with one that has a clear meaning and also get
rid of the warnings. The logic that verifies that there are only 2
valid cases is unchanged.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When an IOCTL with argument size larger than 128 that also used array
arguments were handled, two memory allocations were made but alas, only
the latter one of them was released. This happened because there was only
a single local variable to hold such a temporary allocation.
Fix this by adding separate variables to hold the pointers to the
temporary allocations.
We're not factoring in the start of the file for where to write and
read the swapfile, which leads to very unfortunate side effects of
writing where we should not be...
[This issue only affects swapfiles on filesystems on top of blockdevs
that implement rw_page ops (brd, zram, btt, pmem), and not on top of any
other block devices, in contrast to the upstream commit fix.]
Fixes: dd6bd0d9c7db ("swap: use bdev_read_page() / bdev_write_page()") Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Anthony Iliopoulos <ailiop@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
There exists multiple path may do zram compaction concurrently.
1. auto-compaction triggered during memory reclaim
2. userspace utils write zram<id>/compaction node
So, multiple threads may call zs_shrinker_scan/zs_compact concurrently.
But pages_compacted is a per zsmalloc pool variable and modification
of the variable is not serialized(through under class->lock).
There are two issues here:
1. the pages_compacted may not equal to total number of pages
freed(due to concurrently add).
2. zs_shrinker_scan may not return the correct number of pages
freed(issued by current shrinker).
The fix is simple:
1. account the number of pages freed in zs_compact locally.
2. use actomic variable pages_compacted to accumulate total number.
Link: https://lkml.kernel.org/r/20210202122235.26885-1-wu-yan@tcl.com Fixes: 860c707dca155a56 ("zsmalloc: account the number of compacted pages") Signed-off-by: Rokudo Yan <wu-yan@tcl.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit 3194a1746e8a ("xen-netback: don't "handle" error by BUG()")
dropped respective a BUG_ON() without noticing that with this the
variable's value wouldn't be consumed anymore. With gnttab_set_map_op()
setting all status fields to a non-zero value, in case of an error no
slot should have a status of GNTST_okay (zero).
Bailing immediately from set_foreign_p2m_mapping() upon a p2m updating
error leaves the full batch in an ambiguous state as far as the caller
is concerned. Instead flags respective slots as bad, unmapping what
was mapped there right away.
HYPERVISOR_grant_table_op()'s return value and the individual unmap
slots' status fields get used only for a one-time - there's not much we
can do in case of a failure.
Note that there's no GNTST_enomem or alike, so GNTST_general_error gets
used.
The map ops' handle fields get overwritten just to be on the safe side.
Open-iSCSI sends passthrough PDUs over netlink, but the kernel should be
verifying that the provided PDU header and data lengths fall within the
netlink message to prevent accessing beyond that in memory.
Cc: stable@vger.kernel.org Reported-by: Adam Nichols <adam@grimm-co.com> Reviewed-by: Lee Duncan <lduncan@suse.com> Reviewed-by: Mike Christie <michael.christie@oracle.com> Signed-off-by: Chris Leech <cleech@redhat.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
As the iSCSI parameters are exported back through sysfs, it should be
enforcing that they never are more than PAGE_SIZE (which should be more
than enough) before accepting updates through netlink.
Change all iSCSI sysfs attributes to use sysfs_emit().
Cc: stable@vger.kernel.org Reported-by: Adam Nichols <adam@grimm-co.com> Reviewed-by: Lee Duncan <lduncan@suse.com> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Mike Christie <michael.christie@oracle.com> Signed-off-by: Chris Leech <cleech@redhat.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Output defects can exist in sysfs content using sprintf and snprintf.
sprintf does not know the PAGE_SIZE maximum of the temporary buffer
used for outputting sysfs content and it's possible to overrun the
PAGE_SIZE buffer length.
Add a generic sysfs_emit function that knows that the size of the
temporary buffer and ensures that no overrun is done.
Add a generic sysfs_emit_at function that can be used in multiple
call situations that also ensures that no overrun is done.
Validate the output buffer argument to be page aligned.
Validate the offset len argument to be within the PAGE_SIZE buf.
Protect the iSCSI transport handle, available in sysfs, by requiring
CAP_SYS_ADMIN to read it. Also protect the netlink socket by restricting
reception of messages to ones sent with CAP_SYS_ADMIN. This disables
normal users from being able to end arbitrary iSCSI sessions.
Cc: stable@vger.kernel.org Reported-by: Adam Nichols <adam@grimm-co.com> Reviewed-by: Chris Leech <cleech@redhat.com> Reviewed-by: Mike Christie <michael.christie@oracle.com> Signed-off-by: Lee Duncan <lduncan@suse.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
I had a kernel IRQ stack overflow on the mx3210 debian buildd machine. This patch increases the
64-bit IRQ stack size to 64 KB. The 64-bit stack size needs to be larger than the 32-bit stack
size since registers are twice as big.
Signed-off-by: John David Anglin <dave.anglin@bell.net> Signed-off-by: Helge Deller <deller@gmx.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
b21ebf2fb4cd ("x86: Treat R_X86_64_PLT32 as R_X86_64_PC32")
but for i386. As far as the kernel is concerned, R_386_PLT32 can be
treated the same as R_386_PC32.
R_386_PLT32/R_X86_64_PLT32 are PC-relative relocation types which
can only be used by branches. If the referenced symbol is defined
externally, a PLT will be used.
R_386_PC32/R_X86_64_PC32 are PC-relative relocation types which can be
used by address taking operations and branches. If the referenced symbol
is defined externally, a copy relocation/canonical PLT entry will be
created in the executable.
On x86-64, there is no PIC vs non-PIC PLT distinction and an
R_X86_64_PLT32 relocation is produced for both `call/jmp foo` and
`call/jmp foo@PLT` with newer (2018) GNU as/LLVM integrated assembler.
This avoids canonical PLT entries (st_shndx=0, st_value!=0).
On i386, there are 2 types of PLTs, PIC and non-PIC. Currently,
the GCC/GNU as convention is to use R_386_PC32 for non-PIC PLT and
R_386_PLT32 for PIC PLT. Copy relocations/canonical PLT entries
are possible ABI issues but GCC/GNU as will likely keep the status
quo because (1) the ABI is legacy (2) the change will drop a GNU
ld diagnostic for non-default visibility ifunc in shared objects.
clang-12 -fno-pic (since [1]) can emit R_386_PLT32 for compiler
generated function declarations, because preventing canonical PLT
entries is weighed over the rare ifunc diagnostic.
[84977.840894] ath10k_snoc a000000.wifi: wmi mgmt tx queue is full
[84977.840913] ath10k_snoc a000000.wifi: failed to transmit packet, dropping: -28
[84977.840924] ath10k_snoc a000000.wifi: failed to submit frame: -28
[84977.840932] ath10k_snoc a000000.wifi: failed to transmit frame: -28
This issue is caused by race condition between skb_dequeue and
__skb_queue_tail. The queue of ‘wmi_mgmt_tx_queue’ is protected by a
different lock: ar->data_lock vs list->lock, the result is no protection.
So when ath10k_mgmt_over_wmi_tx_work() and ath10k_mac_tx_wmi_mgmt()
running concurrently on different CPUs, there appear to be a rare corner
cases when the queue length is 1,
If the instruction ‘next = skb->next’ is executed before
‘WRITE_ONCE(prev->next, newsk)’, newsk will be lost, as CPUx get the
old ‘next’ pointer, but the length is still added by one. The final
result is the length of the queue will reach the maximum value but
the queue is empty.
So remove ar->data_lock, and use 'skb_queue_tail' instead of
'__skb_queue_tail' to prevent the potential race condition. Also switch
to use skb_queue_len_lockless, in case we queue a few SKBs simultaneously.
pktgen create threads for all online cpus and bond these threads to
relevant cpu repecivtily. when this thread firstly be woken up, it
will compare cpu currently running with the cpu specified at the time
of creation and if the two cpus are not equal, BUG_ON() will take effect
causing panic on the system.
Notice that these threads could be migrated to other cpus before start
running because of the cpu hotplug after these threads have created. so the
BUG_ON() used here seems unreasonable and we can replace it with WARN_ON()
to just printf a warning other than panic the system.
We can currently get a "command execute failure 19" error on beacon loss
if the signal is weak:
wlcore: Beacon loss detected. roles:0xff
wlcore: Connection loss work (role_id: 0).
...
wlcore: ERROR command execute failure 19
...
WARNING: CPU: 0 PID: 1552 at drivers/net/wireless/ti/wlcore/main.c:803
...
(wl12xx_queue_recovery_work.part.0 [wlcore])
(wl12xx_cmd_role_start_sta [wlcore])
(wl1271_op_bss_info_changed [wlcore])
(ieee80211_prep_connection [mac80211])
Error 19 is defined as CMD_STATUS_WRONG_NESTING from the wlcore firmware,
and seems to mean that the firmware no longer wants to see the quirk
handling for WLCORE_QUIRK_START_STA_FAILS done.
This quirk got added with commit 18eab430700d ("wlcore: workaround
start_sta problem in wl12xx fw"), and it seems that this already got fixed
in the firmware long time ago back in 2012 as wl18xx never had this quirk
in place to start with.
As we no longer even support firmware that early, to me it seems that it's
safe to just drop WLCORE_QUIRK_START_STA_FAILS to fix the error. Looks
like earlier firmware got disabled back in 2013 with commit 0e284c074ef9
("wl12xx: increase minimum singlerole firmware version required").
If it turns out we still need WLCORE_QUIRK_START_STA_FAILS with any
firmware that the driver works with, we can simply revert this patch and
add extra checks for firmware version used.
With this fix wlcore reconnects properly after a beacon loss.
The constant 20 makes the font sum computation signed which can lead to
sign extensions and signed wraps. It's not much of a problem as we build
with -fno-strict-overflow. But if we ever decide not to, be ready, so
switch the constant to unsigned.
On this system the M.2 PCIe WiFi card isn't detected after reboot, only
after cold boot. reboot=pci fixes this behavior. In [0] the same issue
is described, although on another system and with another Intel WiFi
card. In case it's relevant, both systems have Celeron CPUs.
Add a PCI reboot quirk on affected systems until a more generic fix is
available.
When fw_core_add_address_handler() fails, we need to destroy
the port by tty_port_destroy(). Also we need to unregister
the address handler by fw_core_remove_address_handler() on
failure.
Looking through patchwork I don't see that there was any consensus to
use switchdev notifiers only in case of netlink provided port flags but
not sysfs (as a sort of deprecation, punishment or anything like that),
so we should probably keep the user interface consistent in terms of
functionality.
Fixes: 3922285d96e7 ("net: bridge: Add support for offloading port attributes") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The current code would unnecessarily expand the address range. Consider
one example, (start, end) = (1G-2M, 3G+2M), and (vm_start, vm_end) =
(1G-4M, 3G+4M), the expected adjustment should be keep (1G-2M, 3G+2M)
without expand. But the current result will be (1G-4M, 3G+4M). Actually,
the range (1G-4M, 1G) and (3G, 3G+4M) would never been involved in pmd
sharing.
After this patch, we will check that the vma span at least one PUD aligned
size and the start,end range overlap the aligned range of vma.
With above example, the aligned vma range is (1G, 3G), so if (start, end)
range is within (1G-4M, 1G), or within (3G, 3G+4M), then no adjustment to
both start and end. Otherwise, we will have chance to adjust start
downwards or end upwards without exceeding (vm_start, vm_end).
Mike:
: The 'adjusted range' is used for calls to mmu notifiers and cache(tlb)
: flushing. Since the current code unnecessarily expands the range in some
: cases, more entries than necessary would be flushed. This would/could
: result in performance degradation. However, this is highly dependent on
: the user runtime. Is there a combination of vma layout and calls to
: actually hit this issue? If the issue is hit, will those entries
: unnecessarily flushed be used again and need to be unnecessarily reloaded?
Link: https://lkml.kernel.org/r/20210104081631.2921415-1-lixinhai.lxh@gmail.com Fixes: 75802ca66354 ("mm/hugetlb: fix calculation of adjust_range_if_pmd_sharing_possible") Signed-off-by: Li Xinhai <lixinhai.lxh@gmail.com> Suggested-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Avoid the assumption that ksize(kmalloc(S)) == ksize(kmalloc(S)): when
cloning an skb, save and restore truesize after pskb_expand_head(). This
can occur if the allocator decides to service an allocation of the same
size differently (e.g. use a different size class, or pass the
allocation on to KFENCE).
Because truesize is used for bookkeeping (such as sk_wmem_queued), a
modified truesize of a cloned skb may result in corrupt bookkeeping and
relevant warnings (such as in sk_stream_kill_queues()).
syzbot found WARNINGs in several smackfs write operations where
bytes count is passed to memdup_user_nul which exceeds
GFP MAX_ORDER. Check count size if bigger than PAGE_SIZE.
Per smackfs doc, smk_write_net4addr accepts any label or -CIPSO,
smk_write_net6addr accepts any label or -DELETE. I couldn't find
any general rule for other label lengths except SMK_LABELLEN,
SMK_LONGLABEL, SMK_CIPSOMAX which are documented.
Let's constrain, in general, smackfs label lengths for PAGE_SIZE.
Although fuzzer crashes write to smackfs/netlabel on 0x400000 length.
Here is a quick way to reproduce the WARNING:
python -c "print('A' * 0x400000)" > /sys/fs/smackfs/netlabel
An assert failure is triggered by syzkaller test due to
ATTR_KILL_PRIV is not cleared before xfs_setattr_size.
As ATTR_KILL_PRIV is not checked/used by xfs_setattr_size,
just remove it from the assert.
Signed-off-by: Yumei Huang <yuhuang@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
syzbot is feeding invalid superblock data to JFS for mount testing.
JFS does not check several of the fields -- just assumes that they
are good since the JFS_MAGIC and version fields are good.
In this case (syzbot reproducer), we have s_l2bsize == 0xda0c,
pad == 0xf045, and s_state == 0x50, all of which are invalid IMO.
Having s_l2bsize == 0xda0c causes this UBSAN warning:
UBSAN: shift-out-of-bounds in fs/jfs/jfs_mount.c:373:25
shift exponent -9716 is negative
s_l2bsize can be tested for correctness. pad can be tested for non-0
and punted. s_state can be tested for its valid values and punted.
Do those 3 tests and if any of them fails, report the superblock as
invalid/corrupt and let fsck handle it.
With this patch, chkSuper() says this when JFS_DEBUG is enabled:
jfs_mount: Mount Failure: superblock is corrupt!
Mount JFS Failure: -22
jfs_mount failed w/return code = -22
The obvious problem with this method is that next week there could
be another syzbot test that uses different fields for invalid values,
this making this like a game of whack-a-mole.
link: https://syzkaller.appspot.com/bug?extid=36315852ece4132ec193 Reported-by: syzbot+36315852ece4132ec193@syzkaller.appspotmail.com Reported-by: kernel test robot <lkp@intel.com> # v2 Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com> Cc: jfs-discussion@lists.sourceforge.net Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
... but we currently use the 'I' constraint for many atomic operations
using sub or logical instructions, which is not always valid.
When CONFIG_ARM64_LSE_ATOMICS is not set, this allows invalid immediates
to be passed to instructions, potentially resulting in a build failure.
When CONFIG_ARM64_LSE_ATOMICS is selected the out-of-line ll/sc atomics
always use a register as they have no visibility of the value passed by
the caller.
This patch adds a constraint parameter to the ATOMIC_xx and
__CMPXCHG_CASE macros so that we can pass appropriate constraints for
each case, with uses updated accordingly.
Unfortunately prior to GCC 8.1.0 the 'K' constraint erroneously accepted
'4294967295', so we must instead force the use of a register.
Signed-off-by: Andrew Murray <andrew.murray@arm.com> Signed-off-by: Will Deacon <will@kernel.org>
[bwh: Backported to 4.14: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The "L" AArch64 machine constraint, which we use for the "old" value in
an LL/SC cmpxchg(), generates an immediate that is suitable for a 64-bit
logical instruction. However, for cmpxchg() operations on types smaller
than 64 bits, this constraint can result in an invalid instruction which
is correctly rejected by GAS, such as EOR W1, W1, #0xffffffff.
Whilst we could special-case the constraint based on the cmpxchg size,
it's far easier to change the constraint to "K" and put up with using
a register for large 64-bit immediates. For out-of-line LL/SC atomics,
this is all moot anyway.
Reported-by: Robin Murphy <robin.murphy@arm.com> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Our atomic instructions (either LSE atomics of LDXR/STXR sequences)
natively support byte, half-word, word and double-word memory accesses
so there is no need to mask the data register prior to being stored.
Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
These plt* and .text.ftrace_trampoline sections specified for arm64 have
non-zero addressses. Non-zero section addresses in a relocatable ELF would
confuse GDB when it tries to compute the section offsets and it ends up
printing wrong symbol addresses. Therefore, set them to zero, which mirrors
the change in commit 5d8591bc0fba ("module: set ksymtab/kcrctab* section
addresses to 0x0").
Reported-by: Frank van der Linden <fllinden@amazon.com> Signed-off-by: Shaoying Xu <shaoyi@amazon.com> Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/r/20210216183234.GA23876@amazon.com Signed-off-by: Will Deacon <will@kernel.org>
[shaoyi@amazon.com: made same changes in arch/arm64/kernel/module.lds for 5.4] Signed-off-by: Shaoying Xu <shaoyi@amazon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
CCW_CMD_READ_STATUS was introduced with revision 2 of virtio-ccw,
and drivers should only rely on it being implemented when they
negotiated at least that revision with the device.
However, virtio_ccw_get_status() issued READ_STATUS for any
device operating at least at revision 1. If the device accepts
READ_STATUS regardless of the negotiated revision (which some
implementations like QEMU do, even though the spec currently does
not allow it), everything works as intended. While a device
rejecting the command should also be handled gracefully, we will
not be able to see any changes the device makes to the status,
such as setting NEEDS_RESET or setting the status to zero after
a completed reset.
We negotiated the revision to at most 1, as we never bumped the
maximum revision; let's do that now and properly send READ_STATUS
only if we are operating at least at revision 2.
We observed that some of virtio_gpu_object_shmem_init() allocations
can be rather costly - order 6 - which can be difficult to fulfill
under memory pressure conditions. Switch to kvmalloc_array() in
virtio_gpu_object_shmem_init() and let the kernel vmalloc the entries
array.
page structs are not guaranteed to be contiguous for gigantic pages. The
routine update_and_free_page can encounter a gigantic page, yet it assumes
page structs are contiguous when setting page flags in subpages.
If update_and_free_page encounters non-contiguous page structs, we can see
“BUG: Bad page state in process …” errors.
Non-contiguous page structs are generally not an issue. However, they can
exist with a specific kernel configuration and hotplug operations. For
example: Configure the kernel with CONFIG_SPARSEMEM and
!CONFIG_SPARSEMEM_VMEMMAP. Then, hotplug add memory for the area where
the gigantic page will be allocated. Zi Yan outlined steps to reproduce
here [1].
Link: https://lkml.kernel.org/r/20210217184926.33567-1-mike.kravetz@oracle.com Fixes: 944d9fec8d7a ("hugetlb: add support for gigantic page allocation at runtime") Signed-off-by: Zi Yan <ziy@nvidia.com> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Otherwise build fails if the headers are not in the default location. While at
it also ask pkg-config for the libs, with fallback to the existing value.
Now that interface 3 in "option" driver is no longer mapped, add device
ID matching it to qmi_wwan.
The modem is used inside ZTE MF283+ router and carriers identify it as
such.
Interface mapping is:
0: QCDM, 1: AT (PCUI), 2: AT (Modem), 3: QMI, 4: ADB
In case of a system crash, dm-era might fail to mark blocks as written
in its metadata, although the corresponding writes to these blocks were
passed down to the origin device and completed successfully.
Consider the following sequence of events:
1. We write to a block that has not been yet written in the current era
2. era_map() checks the in-core bitmap for the current era and sees
that the block is not marked as written.
3. The write is deferred for submission after the metadata have been
updated and committed.
4. The worker thread processes the deferred write
(process_deferred_bios()) and marks the block as written in the
in-core bitmap, **before** committing the metadata.
5. The worker thread starts committing the metadata.
6. We do more writes that map to the same block as the write of step (1)
7. era_map() checks the in-core bitmap and sees that the block is marked
as written, **although the metadata have not been committed yet**.
8. These writes are passed down to the origin device immediately and the
device reports them as completed.
9. The system crashes, e.g., power failure, before the commit from step
(5) finishes.
When the system recovers and we query the dm-era target for the list of
written blocks it doesn't report the aforementioned block as written,
although the writes of step (6) completed successfully.
The issue is that era_map() decides whether to defer or not a write
based on non committed information. The root cause of the bug is that we
update the in-core bitmap, **before** committing the metadata.
Fix this by updating the in-core bitmap **after** successfully
committing the metadata.
The icmp{,v6}_send functions make all sorts of use of skb->cb, casting
it with IPCB or IP6CB, assuming the skb to have come directly from the
inet layer. But when the packet comes from the ndo layer, especially
when forwarded, there's no telling what might be in skb->cb at that
point. As a result, the icmp sending code risks reading bogus memory
contents, which can result in nasty stack overflows such as this one
reported by a user:
In icmp_send, skb->cb is cast with IPCB and an ip_options struct is read
from it. The optlen parameter there is of particular note, as it can
induce writes beyond bounds. There are quite a few ways that can happen
in __ip_options_echo. For example:
// sptr/skb are attacker-controlled skb bytes
sptr = skb_network_header(skb);
// dptr/dopt points to stack memory allocated by __icmp_send
dptr = dopt->__data;
// sopt is the corrupt skb->cb in question
if (sopt->rr) {
optlen = sptr[sopt->rr+1]; // corrupt skb->cb + skb->data
soffset = sptr[sopt->rr+2]; // corrupt skb->cb + skb->data
// this now writes potentially attacker-controlled data, over
// flowing the stack:
memcpy(dptr, sptr+sopt->rr, optlen);
}
In the icmpv6_send case, the story is similar, but not as dire, as only
IP6CB(skb)->iif and IP6CB(skb)->dsthao are used. The dsthao case is
worse than the iif case, but it is passed to ipv6_find_tlv, which does
a bit of bounds checking on the value.
This is easy to simulate by doing a `memset(skb->cb, 0x41,
sizeof(skb->cb));` before calling icmp{,v6}_ndo_send, and it's only by
good fortune and the rarity of icmp sending from that context that we've
avoided reports like this until now. For example, in KASAN:
BUG: KASAN: stack-out-of-bounds in __ip_options_echo+0xa0e/0x12b0
Write of size 38 at addr ffff888006f1f80e by task ping/89
CPU: 2 PID: 89 Comm: ping Not tainted 5.10.0-rc7-debug+ #5
Call Trace:
dump_stack+0x9a/0xcc
print_address_description.constprop.0+0x1a/0x160
__kasan_report.cold+0x20/0x38
kasan_report+0x32/0x40
check_memory_region+0x145/0x1a0
memcpy+0x39/0x60
__ip_options_echo+0xa0e/0x12b0
__icmp_send+0x744/0x1700
Actually, out of the 4 drivers that do this, only gtp zeroed the cb for
the v4 case, while the rest did not. So this commit actually removes the
gtp-specific zeroing, while putting the code where it belongs in the
shared infrastructure of icmp{,v6}_ndo_send.
This commit fixes the issue by passing an empty IPCB or IP6CB along to
the functions that actually do the work. For the icmp_send, this was
already trivial, thanks to __icmp_send providing the plumbing function.
For icmpv6_send, this required a tiny bit of refactoring to make it
behave like the v4 case, after which it was straight forward.
If IPv6 is builtin, we do not need an expensive indirect call
to reach icmp6_send().
v2: put inline keyword before the type to avoid sparse warnings.
Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Because sunvnet is calling icmp from network device context, it should use
the ndo helper so that the rate limiting applies correctly. While we're
at it, doing the additional route lookup before calling icmp_ndo_send is
superfluous, since this is the job of the icmp code in the first place.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Cc: Shannon Nelson <shannon.nelson@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Because gtp is calling icmp from network device context, it should use
the ndo helper so that the rate limiting applies correctly.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Cc: Harald Welte <laforge@gnumonks.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The icmpv6_send function has long had a static inline implementation
with an empty body for CONFIG_IPV6=n, so that code calling it doesn't
need to be ifdef'd. The new icmpv6_ndo_send function, which is intended
for drivers as a drop-in replacement with an identical function
signature, should follow the same pattern. Without this patch, drivers
that used to work with CONFIG_IPV6=n now result in a linker error.
Cc: Chen Zhou <chenzhou10@huawei.com> Reported-by: Hulk Robot <hulkci@huawei.com> Fixes: 0b41713b6066 ("icmp: introduce helper for nat'd source address in network device context") Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This introduces a helper function to be called only by network drivers
that wraps calls to icmp[v6]_send in a conntrack transformation, in case
NAT has been used. We don't want to pollute the non-driver path, though,
so we introduce this as a helper to be called by places that actually
make use of this, as suggested by Florian.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Cc: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Metadata resize shouldn't happen in the ctr. The ctr loads a temporary
(inactive) table that will only become active upon resume. That is why
resize should always be done in terms of resume. Otherwise a load (ctr)
whose inactive table never becomes active will incorrectly resize the
metadata.
Also, perform the resize directly in preresume, instead of using the
worker to do it.
The worker might run other metadata operations, e.g., it could start
digestion, before resizing the metadata. These operations will end up
using the old size.
In case of devices with at most 64 blocks, the digestion of consecutive
eras uses the writeset of the first era as the writeset of all eras to
digest, leading to lost writes. That is, we lose the information about
what blocks were written during the affected eras.
The digestion code uses a dm_disk_bitset object to access the archived
writesets. This structure includes a one word (64-bit) cache to reduce
the number of array lookups.
This structure is initialized only once, in metadata_digest_start(),
when we kick off digestion.
But, when we insert a new writeset into the writeset tree, before the
digestion of the previous writeset is done, or equivalently when there
are multiple writesets in the writeset tree to digest, then all these
writesets are digested using the same cache and the cache is not
re-initialized when moving from one writeset to the next.
For devices with more than 64 blocks, i.e., the size of the cache, the
cache is indirectly invalidated when we move to a next set of blocks, so
we avoid the bug.
But for devices with at most 64 blocks we end up using the same cached
data for digesting all archived writesets, i.e., the cache is loaded
when digesting the first writeset and it never gets reloaded, until the
digestion is done.
As a result, the writeset of the first era to digest is used as the
writeset of all the following archived eras, leading to lost writes.
Fix this by reinitializing the dm_disk_bitset structure, and thus
invalidating the cache, every time the digestion code starts digesting a
new writeset.
dm-era doesn't support changing the data block size of existing devices,
so check explicitly that the requested block size for a new target
matches the one stored in the metadata.
Following a system crash, dm-era fails to recover the committed writeset
for the current era, leading to lost writes. That is, we lose the
information about what blocks were written during the affected era.
dm-era assumes that the writeset of the current era is archived when the
device is suspended. So, when resuming the device, it just moves on to
the next era, ignoring the committed writeset.
This assumption holds when the device is properly shut down. But, when
the system crashes, the code that suspends the target never runs, so the
writeset for the current era is not archived.
There are three issues that cause the committed writeset to get lost:
1. dm-era doesn't load the committed writeset when opening the metadata
2. The code that resizes the metadata wipes the information about the
committed writeset (assuming it was loaded at step 1)
3. era_preresume() starts a new era, without taking into account that
the current era might not have been archived, due to a system crash.
To fix this:
1. Load the committed writeset when opening the metadata
2. Fix the code that resizes the metadata to make sure it doesn't wipe
the loaded writeset
3. Fix era_preresume() to check for a loaded writeset and archive it,
before starting a new era.
Patch fb6791d100d1 was designed to allow gfs2 to unmount quicker by
skipping the step where it tells dlm to unlock glocks in EX with lvbs.
This was done because when gfs2 unmounts a file system, it destroys the
dlm lockspace shortly after it destroys the glocks so it doesn't need to
unlock them all: the unlock is implied when the lockspace is destroyed
by dlm.
However, that patch introduced a use-after-free in dlm: as part of its
normal dlm_recoverd process, it can call ls_recovery to recover dead
locks. In so doing, it can call recover_rsbs which calls recover_lvb for
any mastered rsbs. Func recover_lvb runs through the list of lkbs queued
to the given rsb (if the glock is cached but unlocked, it will still be
queued to the lkb, but in NL--Unlocked--mode) and if it has an lvb,
copies it to the rsb, thus trying to preserve the lkb. However, when
gfs2 skips the dlm unlock step, it frees the glock and its lvb, which
means dlm's function recover_lvb references the now freed lvb pointer,
copying the freed lvb memory to the rsb.
This patch changes the check in gdlm_put_lock so that it calls
dlm_unlock for all glocks that contain an lvb pointer.
Fixes: fb6791d100d1 ("GFS2: skip dlm_unlock calls in unmount") Cc: stable@vger.kernel.org # v3.8+ Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Back in 2.1.29 the clear_user() guts (__bzero()) had been merged
with memset(). Unfortunately, while all exception handlers had been
copied, one of the exception table entries got lost. As the result,
clear_user() starting at 128*n bytes before the end of page and
spanning between 8 and 127 bytes into the next page would oops when
the second page is unmapped. It's trivial to reproduce - all
it takes is
which had been oopsing since March 1997. Says something about
the quality of test coverage... ;-/ And while today sparc32 port
is nearly dead, back in '97 it had been very much alive; in fact,
sparc64 had only been in mainline for 3 months by that point...
Cc: stable@kernel.org Fixes: v2.1.29 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>