read_poll_timeout inside phy_read_poll_timeout can set val negative
in some cases (for example, __mdiobus_read inside phy_read can return
-EOPNOTSUPP).
Supposedly, commit 4ec732951702 ("net: phylib: fix phy_read*_poll_timeout()")
should fix problems with wrong-signed vals, but I do not see how
as val is sent to phy_read as is and __val = phy_read (not val)
is checked for sign.
Change val type for signed to allow better error handling as done in other
phy_read_poll_timeout callers. This will not fix any error handling
by itself, but allows, for example, to modify cond with appropriate
sign check or check resulting val separately.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Fixes: 014068dcb5b1 ("net: phy: genphy_loopback: add link speed configuration") Signed-off-by: Nikita Kiryushin <kiryushin@ancud.ru> Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://lore.kernel.org/r/20240315175052.8049-1-kiryushin@ancud.ru Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
If EOF is encountered, ceph_sync_read() return value is adjusted down
according to i_size, but the "to" iter is advanced by the actual number
of bytes read. Then, when retrying, the remainder of the range may be
skipped incorrectly.
Ensure that the "to" iter is advanced only until EOF.
[ idryomov: changelog ]
Fixes: c3d8e0b5de48 ("ceph: return the real size read when it hits EOF") Reported-by: Frank Hsiao <frankhsiao@qnap.com> Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Tested-by: Frank Hsiao <frankhsiao@qnap.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
Since the referenced commit, the xfrm_inner_extract_output() function
uses the protocol field to determine the address family. So not setting
it for IPv4 raw sockets meant that such packets couldn't be tunneled via
IPsec anymore.
IPv6 raw sockets are not affected as they already set the protocol since 9c9c9ad5fae7 ("ipv6: set skb->protocol on tcp, raw and ip6_append_data
genereated skbs").
Fixes: f4796398f21b ("xfrm: Remove inner/outer modes from output path") Signed-off-by: Tobias Brunner <tobias@strongswan.org> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Link: https://lore.kernel.org/r/c5d9a947-eb19-4164-ac99-468ea814ce20@strongswan.org Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
A failure during registration of the netdev notifier was not handled at
all. A failure during netlink initialization did not unregister the netdev
notifier.
Handle failures of netdev notifier registration and netlink initialization.
Both functions should only return negative values on failure and thereby
lead to the hsr module not being loaded.
Fixes: f421436a591d ("net/hsr: Add support for the High-availability Seamless Redundancy protocol (HSRv0)") Signed-off-by: Felix Maurer <fmaurer@redhat.com> Reviewed-by: Shigeru Yoshida <syoshida@redhat.com> Reviewed-by: Breno Leitao <leitao@debian.org> Link: https://lore.kernel.org/r/3ce097c15e3f7ace98fc7fd9bcbf299f092e63d1.1710504184.git.fmaurer@redhat.com Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
acquire/release_in_xmit() work as bit lock in rds_send_xmit(), so they
are expected to ensure acquire/release memory ordering semantics.
However, test_and_set_bit/clear_bit() don't imply such semantics, on
top of this, following smp_mb__after_atomic() does not guarantee release
ordering (memory barrier actually should be placed before clear_bit()).
Instead, we use clear_bit_unlock/test_and_set_bit_lock() here.
Syzkaller with KCSAN identified a data-race issue when accessing
keypair->receiving_counter.counter. Use READ_ONCE() and WRITE_ONCE()
annotations to mark the data race as intentional.
BUG: KCSAN: data-race in wg_packet_decrypt_worker / wg_packet_rx_poll
write to 0xffff888107765888 of 8 bytes by interrupt on cpu 0:
counter_validate drivers/net/wireguard/receive.c:321 [inline]
wg_packet_rx_poll+0x3ac/0xf00 drivers/net/wireguard/receive.c:461
__napi_poll+0x60/0x3b0 net/core/dev.c:6536
napi_poll net/core/dev.c:6605 [inline]
net_rx_action+0x32b/0x750 net/core/dev.c:6738
__do_softirq+0xc4/0x279 kernel/softirq.c:553
do_softirq+0x5e/0x90 kernel/softirq.c:454
__local_bh_enable_ip+0x64/0x70 kernel/softirq.c:381
__raw_spin_unlock_bh include/linux/spinlock_api_smp.h:167 [inline]
_raw_spin_unlock_bh+0x36/0x40 kernel/locking/spinlock.c:210
spin_unlock_bh include/linux/spinlock.h:396 [inline]
ptr_ring_consume_bh include/linux/ptr_ring.h:367 [inline]
wg_packet_decrypt_worker+0x6c5/0x700 drivers/net/wireguard/receive.c:499
process_one_work kernel/workqueue.c:2633 [inline]
...
read to 0xffff888107765888 of 8 bytes by task 3196 on cpu 1:
decrypt_packet drivers/net/wireguard/receive.c:252 [inline]
wg_packet_decrypt_worker+0x220/0x700 drivers/net/wireguard/receive.c:501
process_one_work kernel/workqueue.c:2633 [inline]
process_scheduled_works+0x5b8/0xa30 kernel/workqueue.c:2706
worker_thread+0x525/0x730 kernel/workqueue.c:2787
...
Fixes: a9e90d9931f3 ("wireguard: noise: separate receive counter from send counter") Reported-by: syzbot+d1de830e4ecdaac83d89@syzkaller.appspotmail.com Signed-off-by: Nikita Zhandarovich <n.zhandarovich@fintech.ru> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
The MLX driver was not updating its control virtqueue size at set_vq_num
and instead always initialized to MLX5_CVQ_MAX_ENT (16) at
setup_cvq_vring.
Qemu would try to set the size to 64 by default, however, because the
CVQ size always was initialized to 16, an error would be thrown when
sending >16 control messages (as used-ring entry 17 is initialized to 0).
For example, starting a guest with x-svq=on and then executing the
following command would produce the error below:
# for i in {1..20}; do ifconfig eth0 hw ether XX:xx:XX:xx:XX:XX; done
qemu-system-x86_64: Insufficient written data (0)
[ 435.331223] virtio_net virtio0: Failed to set mac address by vq command.
SIOCSIFHWADDR: Invalid argument
Acked-by: Dragos Tatulea <dtatulea@nvidia.com> Acked-by: Eugenio Pérez <eperezma@redhat.com> Signed-off-by: Jonah Palmer <jonah.palmer@oracle.com>
Message-Id: <20240216142502.78095-1-jonah.palmer@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Tested-by: Lei Yang <leiyang@redhat.com> Fixes: 5262912ef3cf ("vdpa/mlx5: Add support for control VQ and MAC setting") Signed-off-by: Sasha Levin <sashal@kernel.org>
vdpasim_do_reset sets running to true, which is wrong, as it allows
vdpasim_kick_vq to post work requests before the device has been
configured. To fix, do not set running until VIRTIO_CONFIG_S_DRIVER_OK
is set.
Fixes: 0c89e2a3a9d0 ("vdpa_sim: Implement suspend vdpa op") Signed-off-by: Steve Sistare <steven.sistare@oracle.com> Reviewed-by: Eugenio Pérez <eperezma@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com>
Message-Id: <1707517807-137331-1-git-send-email-steven.sistare@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
As well noted by Pekka[1], the rounding of drm_fixp2int_round is wrong.
To round a number, you need to add 0.5 to the number and floor that,
drm_fixp2int_round() is adding 0.0000076. Make it add 0.5.
c712c05e46c8 ("spi: imx: fix the burst length at DMA mode and CPU mode")
corrects three cases of setting the ECSPI burst length but erroneously
leaves the in-range CPU case one bit to big (in that field a value of
0 means 1 bit). The effect was that transmissions that should have been
8-bit bytes appeared as 9-bit causing failed communication with SPI
devices.
On MT7530, the HT_XTAL_FSEL field of the HWTRAP register stores a 2-bit
value that represents the frequency of the crystal oscillator connected to
the switch IC. The field is populated by the state of the ESW_P4_LED_0 and
ESW_P4_LED_0 pins, which is done right after reset is deasserted.
On MT7531, the XTAL25 bit of the STRAP register stores this. The LAN0LED0
pin is used to populate the bit. 25MHz when the pin is high, 40MHz when
it's low.
These pins are also used with LEDs, therefore, their state can be set to
something other than the bootstrapping configuration. For example, a link
may be established on port 3 before the DSA subdriver takes control of the
switch which would set ESW_P3_LED_0 to high.
Currently on mt7530_setup() and mt7531_setup(), 1000 - 1100 usec delay is
described between reset assertion and deassertion. Some switch ICs in real
life conditions cannot always have these pins set back to the bootstrapping
configuration before reset deassertion in this amount of delay. This causes
wrong crystal frequency to be selected which puts the switch in a
nonfunctional state after reset deassertion.
The tests below are conducted on an MT7530 with a 40MHz crystal oscillator
by Justin Swartz.
With a cable from an active peer connected to port 3 before reset, an
incorrect crystal frequency (0b11 = 25MHz) is selected:
[1] Reset is asserted.
[2] Period of 1000 - 1100 usec.
[3] Reset is deasserted.
[4] Period of 315 usec. HWTRAP register is populated with incorrect
XTAL frequency.
[5] Signals reflect the bootstrapped configuration.
Increase the delay between reset_control_assert() and
reset_control_deassert(), and gpiod_set_value_cansleep(priv->reset, 0) and
gpiod_set_value_cansleep(priv->reset, 1) to 5000 - 5100 usec. This amount
ensures a higher possibility that the switch IC will have these pins back
to the bootstrapping configuration before reset deassertion.
With a cable from an active peer connected to port 3 before reset, the
correct crystal frequency (0b10 = 40MHz) is selected:
[1] Reset is asserted.
[2] Period of 5000 - 5100 usec.
[2-1] ESW_P3_LED_0 goes low.
[2-2] Remaining period of 5000 - 5100 usec.
[3] Reset is deasserted.
[4] Period of 310 usec. HWTRAP register is populated with bootstrapped
XTAL frequency.
[5] Signals reflect the bootstrapped configuration.
Revert commit 2920dd92b980 ("net: dsa: mt7530: disable LEDs before reset").
Changing the state of pins via reset assertion is simpler and more
efficient than doing so by setting the LED controller off.
Fixes: b8f126a8d543 ("net-next: dsa: add dsa support for Mediatek MT7530 switch") Fixes: c288575f7810 ("net: dsa: mt7530: Add the support of MT7531 switch") Co-developed-by: Justin Swartz <justin.swartz@risingedge.co.za> Signed-off-by: Justin Swartz <justin.swartz@risingedge.co.za> Signed-off-by: Arınç ÜNAL <arinc.unal@arinc9.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
Commit d3256efd8e8b ("veth: allow enabling NAPI even without XDP") tried to fix
the fact that GRO was not possible without XDP, because veth did not use NAPI
without XDP. However, it also introduced the behaviour that GRO is always
enabled, when XDP is enabled.
While it might be desired for most cases, it is confusing for the user at best
as the GRO flag suddenly changes, when an XDP program is attached. It also
introduces some complexities in state management as was partially addressed in
commit fe9f801355f0 ("net: veth: clear GRO when clearing XDP even when down").
But the biggest problem is that it is not possible to disable GRO at all, when
an XDP program is attached, which might be needed for some use cases.
Fix this by not touching the GRO flag on XDP enable/disable as the code already
supports switching to NAPI if either GRO or XDP is requested.
Link: https://lore.kernel.org/lkml/20240311124015.38106-1-ignat@cloudflare.com/ Fixes: d3256efd8e8b ("veth: allow enabling NAPI even without XDP") Fixes: fe9f801355f0 ("net: veth: clear GRO when clearing XDP even when down") Signed-off-by: Ignat Korchagin <ignat@cloudflare.com> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
BUG: KCSAN: data-race in dev_queue_xmit_nit / packet_setsockopt
write to 0xffff888107804542 of 1 bytes by task 22618 on cpu 0:
packet_setsockopt+0xd83/0xfd0 net/packet/af_packet.c:4003
do_sock_setsockopt net/socket.c:2311 [inline]
__sys_setsockopt+0x1d8/0x250 net/socket.c:2334
__do_sys_setsockopt net/socket.c:2343 [inline]
__se_sys_setsockopt net/socket.c:2340 [inline]
__x64_sys_setsockopt+0x66/0x80 net/socket.c:2340
do_syscall_64+0xd3/0x1d0
entry_SYSCALL_64_after_hwframe+0x6d/0x75
read to 0xffff888107804542 of 1 bytes by task 27 on cpu 1:
dev_queue_xmit_nit+0x82/0x620 net/core/dev.c:2248
xmit_one net/core/dev.c:3527 [inline]
dev_hard_start_xmit+0xcc/0x3f0 net/core/dev.c:3547
__dev_queue_xmit+0xf24/0x1dd0 net/core/dev.c:4335
dev_queue_xmit include/linux/netdevice.h:3091 [inline]
batadv_send_skb_packet+0x264/0x300 net/batman-adv/send.c:108
batadv_send_broadcast_skb+0x24/0x30 net/batman-adv/send.c:127
batadv_iv_ogm_send_to_if net/batman-adv/bat_iv_ogm.c:392 [inline]
batadv_iv_ogm_emit net/batman-adv/bat_iv_ogm.c:420 [inline]
batadv_iv_send_outstanding_bat_ogm_packet+0x3f0/0x4b0 net/batman-adv/bat_iv_ogm.c:1700
process_one_work kernel/workqueue.c:3254 [inline]
process_scheduled_works+0x465/0x990 kernel/workqueue.c:3335
worker_thread+0x526/0x730 kernel/workqueue.c:3416
kthread+0x1d1/0x210 kernel/kthread.c:388
ret_from_fork+0x4b/0x60 arch/x86/kernel/process.c:147
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:243
value changed: 0x00 -> 0x01
Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 27 Comm: kworker/u8:1 Tainted: G W 6.8.0-syzkaller-08073-g480e035fc4c7 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/29/2024
Workqueue: bat_events batadv_iv_send_outstanding_bat_ogm_packet
Fixes: fa788d986a3a ("packet: add sockopt to ignore outgoing packets") Reported-by: syzbot+c669c1136495a2e7c31f@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/CANn89i+Z7MfbkBLOv=p7KZ7=K1rKHO4P1OL5LYDCtBiyqsa9oQ@mail.gmail.com/T/#t Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
In bind_evtchn_to_irq_chip() don't increment the refcnt of the event
channel blindly. In case the event channel is NOT refcounted, issue a
warning instead.
Add an additional safety net by doing the refcnt increment only if the
caller has specified IRQF_SHARED in the irqflags parameter.
When unbinding a user event channel, the related handler might be
called a last time in case the kernel was built with
CONFIG_DEBUG_SHIRQ. This might cause a WARN() in the handler.
Avoid that by adding an "unbinding" flag to struct user_event which
will short circuit the handler.
Taking the ctx lock is not enough to use the deferred request completion
infrastructure, it'll get queued into the list but no one would expect
it there, so it will sit there until next io_submit_flush_completions().
It's hard to care about the cancellation path, so complete it via tw.
A patch to resolve an issue was found in MediaTek's GPL-licensed SDK:
In the mtk_ppe_stop() function, the PPE scan mode is not disabled before
disabling the PPE. This can potentially lead to a hang during the process
of disabling the PPE.
Without this patch, the PPE may experience a hang during the reboot test.
We found a issue on production environment while using NVMe over RDMA,
admin_q reconnect failed forever while remote target and network is ok.
After dig into it, we found it may caused by a ABBA deadlock due to tag
allocation. In my case, the tag was hold by a keep alive request
waiting inside admin_q, as we quiesced admin_q while reset ctrl, so the
request maked as idle and will not process before reset success. As
fabric_q shares tagset with admin_q, while reconnect remote target, we
need a tag for connect command, but the only one reserved tag was held
by keep alive command which waiting inside admin_q. As a result, we
failed to reconnect admin_q forever. In order to fix this issue, I
think we should keep two reserved tags for admin queue.
Fixes: ed01fee283a0 ("nvme-fabrics: only reserve a single tag") Signed-off-by: Chunguang Xu <chunguang.xu@shopee.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
txgbe register clk which name is i2c_designware.pci_dev_id(),
clk_name will be stored in clk_lookup_alloc. If PCIe bus number
is larger than 0x39, clk_name size will be larger than 20 bytes.
It exceeds clk_lookup_alloc MAX_DEV_ID limits. So the driver
shortened clk_name.
CPU: 1 PID: 5033 Comm: syz-executor334 Not tainted 6.7.0-syzkaller-00562-g9f8413c4a66f #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 11/17/2023
=====================================================
If the packet type ID field in the Ethernet header is either ETH_P_PRP or
ETH_P_HSR, but it is not followed by an HSR tag, hsr_get_skb_sequence_nr()
reads an invalid value as a sequence number. This causes the above issue.
This patch fixes the issue by returning NULL if the Ethernet header is not
followed by an HSR tag.
syzbot reported a warning in sk_nulls_del_node_init_rcu().
The commit 66b60b0c8c4a ("dccp/tcp: Unhash sk from ehash for tb2 alloc
failure after check_estalblished().") tried to fix an issue that an
unconnected socket occupies an ehash entry when bhash2 allocation fails.
In such a case, we need to revert changes done by check_established(),
which does not hold refcnt when inserting socket into ehash.
So, to revert the change, we need to __sk_nulls_add_node_rcu() instead
of sk_nulls_add_node_rcu().
Otherwise, sock_put() will cause refcnt underflow and leak the socket.
Looking at the error path of __io_uaddr_map, if we fail after pinning
the pages for any reasons, ret will be set to -EINVAL and the error
handler won't properly release the pinned pages.
I didn't manage to trigger it without forcing a failure, but it can
happen in real life when memory is heavily fragmented.
In the existing code, per-policy flags don't have any impact i.e.
if cpufreq_driver boost is enabled and boost is disabled for one or
more of the policies, the cpufreq driver will behave as if boost is
enabled.
Fix this by incorporating per-policy boost flag in the policy->max
computation used in cpufreq_frequency_table_cpuinfo and setting the
default per-policy boost to mirror the cpufreq_driver boost flag.
Fixes: 218a06a79d9a ("cpufreq: Support per-policy performance boost") Reported-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Reviewed-by: Viresh Kumar <viresh.kumar@linaro.org> Reviewed-by: Dhruva Gole <d-gole@ti.com> Signed-off-by: Sibi Sankar <quic_sibis@quicinc.com>
Tested-by:Yipeng Zou <zouyipeng@huawei.com> <mailto:zouyipeng@huawei.com> Reviewed-by: Yipeng Zou <zouyipeng@huawei.com> <mailto:zouyipeng@huawei.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
A previous bugfix added a call to kcalloc(), which starting in gcc-14
causes a harmless warning about the argument order:
drivers/soc/fsl/dpio/dpio-service.c: In function 'dpaa2_io_service_enqueue_multiple_desc_fq':
drivers/soc/fsl/dpio/dpio-service.c:526:29: error: 'kcalloc' sizes specified with 'sizeof' in the earlier argument and not in the later argument [-Werror=calloc-transposed-args]
526 | ed = kcalloc(sizeof(struct qbman_eq_desc), 32, GFP_KERNEL);
| ^~~~~~
drivers/soc/fsl/dpio/dpio-service.c:526:29: note: earlier argument should specify number of elements, later size of each element
Since the two are only multiplied, the order does not change the
behavior, so just fix it now to shut up the compiler warning.
Commit 870c7ad4a52b ("devlink: protect devlink->dev by the instance
lock") added devlink instance locking inside a loop that iterates over
all the registered devlink instances on the machine in the pre-doit
phase. This can lead to serialization of devlink commands over
different devlink instances.
For example: While the first devlink instance is executing firmware
flash, all commands to other devlink instances on the machine are
forced to wait until the first devlink finishes.
Therefore, in the pre-doit phase, take the devlink instance lock only
for the devlink instance the command is targeting. Devlink layer is
taking a reference on the devlink instance, ensuring the devlink->dev
pointer is valid. This reference taking was introduced by commit a380687200e0 ("devlink: take device reference for devlink object").
Without this commit, it would not be safe to access devlink->dev
lockless.
Fixes: 870c7ad4a52b ("devlink: protect devlink->dev by the instance lock") Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
Introduce a new private flag ('DEVLINK_NL_FLAG_NEED_DEV_LOCK') to allow
netlink commands to specify that they need to acquire the device lock in
their pre_doit operation and release it in their post_doit operation.
The reload command will use this flag in the subsequent patch.
No functional changes intended.
Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Stable-dep-of: d7d75124965a ("devlink: Fix devlink parallel commands processing") Signed-off-by: Sasha Levin <sashal@kernel.org>
Currently, private flags (e.g., 'DEVLINK_NL_FLAG_NEED_PORT') are only
used in pre_doit operations, but a subsequent patch will need to
conditionally lock and unlock the device lock in pre and post doit
operations, respectively.
As a preparation, enable the use of private flags in post_doit
operations in a similar fashion to how it is done for pre_doit
operations.
No functional changes intended.
Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Stable-dep-of: d7d75124965a ("devlink: Fix devlink parallel commands processing") Signed-off-by: Sasha Levin <sashal@kernel.org>
Device drivers register with devlink from their probe routines (under
the device lock) by acquiring the devlink instance lock and calling
devl_register().
Drivers that support a devlink reload usually implement the
reload_{down, up}() operations in a similar fashion to their remove and
probe routines, respectively.
However, while the remove and probe routines are invoked with the device
lock held, the reload operations are only invoked with the devlink
instance lock held. It is therefore impossible for drivers to acquire
the device lock from their reload operations, as this would result in
lock inversion.
The motivating use case for invoking the reload operations with the
device lock held is in mlxsw which needs to trigger a PCI reset as part
of the reload. The driver cannot call pci_reset_function() as this
function acquires the device lock. Instead, it needs to call
__pci_reset_function_locked which expects the device lock to be held.
To that end, adjust devlink to always acquire the device lock before the
devlink instance lock when performing a reload.
For now, only do that when reload is triggered as part of netns
dismantle. Subsequent patches will handle the case where reload is
explicitly triggered by user space.
Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Stable-dep-of: d7d75124965a ("devlink: Fix devlink parallel commands processing") Signed-off-by: Sasha Levin <sashal@kernel.org>
taprio_parse_tc_entry() is not correctly checking
TCA_TAPRIO_TC_ENTRY_INDEX attribute:
int tc; // Signed value
tc = nla_get_u32(tb[TCA_TAPRIO_TC_ENTRY_INDEX]);
if (tc >= TC_QOPT_MAX_QUEUE) {
NL_SET_ERR_MSG_MOD(extack, "TC entry index out of range");
return -ERANGE;
}
syzbot reported that it could fed arbitary negative values:
Fixes: a54fc09e4cba ("net/sched: taprio: allow user input of per-tc max SDU") Reported-and-tested-by: syzbot+a340daa06412d6028918@syzkaller.appspotmail.com Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Michal Kubiak <michal.kubiak@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
Current average steal timer calculation produces volatile and inflated
values. The only user of this value is KVM so far and it uses that to
decide whether or not to yield the vCPU which is seeing steal time.
KVM compares average steal timer to a threshold and if the threshold
is past then it does not allow CPU polling and yields it to host, else
it keeps the CPU by polling.
Since KVM's steal time threshold is very low by default (%10) it most
likely is not effected much by the bloated average steal timer values
because the operating region is pretty small. However there might be
new users in the future who might rely on this number. Fix average
steal timer calculation by changing the formula from:
This ensures that avg_steal_timer is actually a naive average of steal
timer values. It now closely follows steal timer values but of course
in a smoother manner.
Fixes: 152e9b8676c6 ("s390/vtime: steal time exponential moving average") Signed-off-by: Mete Durlu <meted@linux.ibm.com> Acked-by: Heiko Carstens <hca@linux.ibm.com> Acked-by: Christian Borntraeger <borntraeger@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
Use wake_up API instead of wake_up_interruptible, since
wait_event_timeout API is used for waiting on command completion.
Fixes: 1463f382f58d ("octeontx2-af: Add support for CGX link management") Signed-off-by: Linu Cherian <lcherian@marvell.com> Signed-off-by: Sunil Goutham <sgoutham@marvell.com> Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
The notable thing here is 0x4001 in connect(), which is RDS_TCP_PORT.
So, the scenario would be:
1. unshare(CLONE_NEWNET) creates a per netns tcp listener in
rds_tcp_listen_init().
2. syz-executor connect()s to it and creates a reqsk.
3. syz-executor exit()s immediately.
4. netns is dismantled. [0]
5. reqsk timer is fired, and UAF happens while freeing reqsk. [1]
6. listener is freed after RCU grace period. [2]
Basically, reqsk assumes that the listener guarantees netns safety
until all reqsk timers are expired by holding the listener's refcount.
However, this was not the case for kernel sockets.
Commit 740ea3c4a0b2 ("tcp: Clean up kernel listener's reqsk in
inet_twsk_purge()") fixed this issue only for per-netns ehash.
Allocated by task 258 on cpu 0 at 83.612050s:
kasan_save_stack (mm/kasan/common.c:48)
kasan_save_track (mm/kasan/common.c:68)
__kasan_slab_alloc (mm/kasan/common.c:343)
kmem_cache_alloc (mm/slub.c:3813 mm/slub.c:3860 mm/slub.c:3867)
copy_net_ns (./include/linux/slab.h:701 net/core/net_namespace.c:421 net/core/net_namespace.c:480)
create_new_namespaces (kernel/nsproxy.c:110)
unshare_nsproxy_namespaces (kernel/nsproxy.c:228 (discriminator 4))
ksys_unshare (kernel/fork.c:3429)
__x64_sys_unshare (kernel/fork.c:3496)
do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83)
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:129)
Freed by task 27 on cpu 0 at 329.158864s:
kasan_save_stack (mm/kasan/common.c:48)
kasan_save_track (mm/kasan/common.c:68)
kasan_save_free_info (mm/kasan/generic.c:643)
__kasan_slab_free (mm/kasan/common.c:265)
kmem_cache_free (mm/slub.c:4299 mm/slub.c:4363)
cleanup_net (net/core/net_namespace.c:456 net/core/net_namespace.c:446 net/core/net_namespace.c:639)
process_one_work (kernel/workqueue.c:2638)
worker_thread (kernel/workqueue.c:2700 kernel/workqueue.c:2787)
kthread (kernel/kthread.c:388)
ret_from_fork (arch/x86/kernel/process.c:153)
ret_from_fork_asm (arch/x86/entry/entry_64.S:250)
The buggy address belongs to the object at ffff88801b370000
which belongs to the cache net_namespace of size 4352
The buggy address is located 1024 bytes inside of
freed 4352-byte region [ffff88801b370000, ffff88801b371100)
Later attempts to refault the bo won't happen and the whole
GPU does to lunch. I think Christian's refactoring of this
code out to the driver broke this not very well tested path.
Fixes: 141b15e59175 ("drm/nouveau: move io_reserve_lru handling into the driver v5") Cc: Christian König <christian.koenig@amd.com> Signed-off-by: Dave Airlie <airlied@redhat.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Danilo Krummrich <dakr@redhat.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240311072037.287905-1-airlied@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>
The sample rates set by the rockchip_i2s_tdm driver in master mode are
inaccurate up to 5% in several cases, due to the driver logic to configure
clocks and a nasty interaction with the Common Clock Framework.
To understand what happens, here is the relevant section of the clock tree
(slightly simplified), along with the names used in the driver:
This is what happens when playing back e.g. at 192 kHz using
audio-graph-card (when recording the same applies, only s/tx/rx/):
0. at probe, rockchip_i2s_tdm_set_sysclk() stores the passed frequency in
i2s_tdm->mclk_tx_freq (*) which is 50176000, and that is never modified
afterwards
1. when playback is started, rockchip_i2s_tdm_hw_params() is called and
does the following two calls
2. rockchip_i2s_tdm_calibrate_mclk():
2a. selects mclk_root0 (vpll0) as a parent for mclk_parent
(mclk_tx_src), which is OK because the vpll0 rate is a good for
192000 (and sumbultiple) rates
2b. sets the mclk_root frequency based on ppm calibration computations
2c. sets mclk_tx_src to 49152000 (= 256 * 192000), which is also OK as
it is a multiple of the required bit clock
3. rockchip_i2s_tdm_set_mclk()
3a. calls clk_set_rate() to set the rate of mclk_tx (clk_i2s2_8ch_tx)
to the value of i2s_tdm->mclk_tx_freq (*), i.e. 50176000 which is
not a multiple of the sampling frequency -- this is not OK
3a1. clk_set_rate() reacts by reparenting clk_i2s2_8ch_tx_src to
vpll1 -- this is not OK because the default vpll1 rate can be
divided to get 44.1 kHz and related rates, not 192 kHz
The result is that the driver does a lot of ad-hoc decisions about clocks
and ends up in using the wrong parent at an unoptimal rate.
Step 0 is one part of the problem: unless the card driver calls set_sysclk
at each stream start, whatever rate is set in mclk_tx_freq during boot will
be taken and used until reboot. Moreover the driver does not care if its
value is not a multiple of any audio frequency.
Another part of the problem is that the whole reparenting and clock rate
setting logic is conflicting with the CCF algorithms to achieve largely the
same goal: selecting the best parent and setting the closest clock
rate. And it turns out that only calling once clk_set_rate() on
clk_i2s2_8ch_tx picks the correct vpll and sets the correct rate.
The fix is based on removing the custom logic in the driver to select the
parent and set the various clocks, and just let the Clock Framework do it
all. As a side effect, the set_sysclk() op becomes useless because we now
let the CCF compute the appropriate value for the sampling rate. It also
implies that the whole calibration logic is now dead code and so it is
removed along with the "PCM Clock Compensation in PPM" kcontrol, which has
always been broken anyway. The handling of the 4 optional clocks also
becomes dead code and is removed.
The actual rates have been tested playing 30 seconds of audio at various
sampling rates before and after this change using sox:
time play -r <sample_rate> -n synth 30 sine 950 gain -3
The time reported in the table below is the 'real' value reported by the
'time' command in the above command line.
While the tests are running the clock tree confirms that:
* without the patch, vpll1 is always used and clk_i2s2_8ch_tx always
produces 50176000 Hz, which cannot be divided for most audio rates
except the slowest ones, generating inaccurate rates
* with the patch:
- for 192000 Hz vpll0 is used
- for 176400 Hz vpll1 is used
- clk_i2s2_8ch_tx always produces (256 * <rate>) Hz
Tested on the RK3308 using the internal audio codec.
fsl_lpspi_probe() is allocating/disposing memory manually with
spi_alloc_host()/spi_alloc_target(), but uses
devm_spi_register_controller(). In case of error after the latter call the
memory will be explicitly freed in the probe function by
spi_controller_put() call, but used afterwards by "devm" management outside
probe() (spi_unregister_controller() <- devm_spi_unregister() below).
This kind of state is per-syscall, and since we're doing the waiting off
entering the io_uring_enter(2) syscall, there's no way that iowait can
already be set for this case. Simplify it by setting it if we need to,
and always clearing it to 0 when done.
Fixes: 7b72d661f1f2 ("io_uring: gate iowait schedule on having pending requests") Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
Using __exit for the remove function results in the remove callback
being discarded with SND_SOC_TLV320ADC3XXX=y. When such a device gets
unbound (e.g. using sysfs or hotplug), the driver is just removed
without the cleanup being performed. This results in resource leaks. Fix
it by compiling in the remove callback unconditionally.
Currently, the secondary CPUs in Hyper-V VTL context lack support for
parallel startup. Therefore, relying on the single initial_stack fetched
from the current task structure suffices for all vCPUs.
However, common initial_stack risks stack corruption when parallel startup
is enabled. In order to facilitate parallel startup, use the initial_stack
from the per CPU idle thread instead of the current task.
Fixes: 3be1bc2fe9d2 ("x86/hyperv: VTL support for Hyper-V") Signed-off-by: Saurabh Sengar <ssengar@linux.microsoft.com> Reviewed-by: Michael Kelley <mhklinux@outlook.com> Link: https://lore.kernel.org/r/1709452896-13342-1-git-send-email-ssengar@linux.microsoft.com Signed-off-by: Wei Liu <wei.liu@kernel.org>
Message-ID: <1709452896-13342-1-git-send-email-ssengar@linux.microsoft.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
Currently the variable irqflags is being set but is not being used,
it appears it should be used in the call to net2272_probe_fin
rather than IRQF_TRIGGER_LOW being used. Kudos to Uwe Kleine-König
for suggesting the fix.
Cleans up clang scan build warning:
drivers/usb/gadget/udc/net2272.c:2610:15: warning: variable 'irqflags'
set but not used [-Wunused-but-set-variable]
Fixes: ceb80363b2ec ("USB: net2272: driver for PLX NET2272 USB device controller") Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Acked-by: Alan Stern <stern@rowland.harvard.edu> Link: https://lore.kernel.org/r/20240307181734.2034407-1-colin.i.king@gmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
The get_channel_from_mode() function is supposed to return the channel
which matches the mode. But it has a bug where if it doesn't find a
matching channel then it returns the last channel. It should return
NULL instead.
Also remove an unnecessary NULL check on "channel".
Fixes: 2870b52bae4c ("greybus: lights: add lights implementation") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: Rui Miguel Silva <rmfrfs@gmail.com> Reviewed-by: Alex Elder <elder@linaro.org> Link: https://lore.kernel.org/r/379c0cb4-39e0-4293-8a18-c7b1298e5420@moroto.mountain Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
It seems a copy&paste mistake that suspend callback removes the GPIO
device. There is no counterpart of this action, means once suspended
there is no more GPIO device available untile full unbind-bind cycle
is performed. Remove suspicious GPIO device removal in suspend.
Fixes: d0aeaa83f0b0 ("serial: exar: split out the exar code from 8250_pci") Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Link: https://lore.kernel.org/r/20240219150627.2101198-2-andriy.shevchenko@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
IRQ_DOMAIN is a hidden (not user visible) symbol. Users cannot set
it directly thru "make *config", so drivers should select it instead
of depending on it if they need it.
Relying on it being set for a dependency is risky.
Consistently using "select" or "depends on" can also help reduce
Kconfig circular dependency issues.
Therefore, change the use of "depends on" for IRQ_DOMAIN to
"select" for RTC_DRV_MT6397.
Fixes: 04d3ba70a3c9 ("rtc: mt6397: add IRQ domain dependency") Cc: Arnd Bergmann <arnd@arndb.de> Cc: Eddie Huang <eddie.huang@mediatek.com> Cc: Sean Wang <sean.wang@mediatek.com> Cc: Matthias Brugger <matthias.bgg@gmail.com> Cc: linux-arm-kernel@lists.infradead.org Cc: linux-mediatek@lists.infradead.org Cc: Alessandro Zummo <a.zummo@towertech.it> Cc: Alexandre Belloni <alexandre.belloni@bootlin.com> Cc: linux-rtc@vger.kernel.org Cc: Marc Zyngier <maz@kernel.org> Cc: Philipp Zabel <p.zabel@pengutronix.de> Cc: Peter Rosin <peda@axentia.se> Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Link: https://lore.kernel.org/r/20240213050258.6167-1-rdunlap@infradead.org Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
When perf_init_event() calls perf_try_init_event() to init pmu driver,
searches for the next pmu driver only when the return value is -ENOENT.
Therefore, hisi_ptt_pmu_event_init() needs to check the type at the
beginning of the function.
Otherwise, in the case of perf-task mode, perf_try_init_event() returns
-EOPNOTSUPP and skips subsequent pmu drivers, causes perf_init_event() to
fail.
Fixes: ff0de066b463 ("hwtracing: hisi_ptt: Add trace function support for HiSilicon PCIe Tune and Trace device") Signed-off-by: Yang Jihong <yangjihong1@huawei.com> Reviewed-by: Yicong Yang <yangyicong@hisilicon.com> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Link: https://lore.kernel.org/r/20240108121906.3514820-1-yangjihong1@huawei.com Signed-off-by: Sasha Levin <sashal@kernel.org>
The loop based 64bit division may run for a long time when dividend is a
lot bigger than the divider. Replace the division loop by the
div64_u64() which implementation may be significantly faster.
skip_power_up is used in etm4_init_arch_data when set lpoverride. So
need to set the value of it before calling using it.
Fixes: 5214b563588e ("coresight: etm4x: Add support for sysreg only devices") Signed-off-by: Mao Jinlong <quic_jinlmao@quicinc.com> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Link: https://lore.kernel.org/r/20240131105423.9519-1-quic_jinlmao@quicinc.com Signed-off-by: Sasha Levin <sashal@kernel.org>
The linked commit reverts the change that accidentally used some sysfs
enable/disable functions from Perf which broke the refcounting, but it
also removes the fact that the sysfs disable function disabled the
helpers.
Add a new wrapper function that does both which is used by both Perf and
sysfs, and label the sysfs disable function appropriately. The naming of
all of the functions will be tidied up later to avoid this happening
again.
Fixes: 287e82cf69aa ("coresight: Fix crash when Perf and sysfs modes are used concurrently") Signed-off-by: James Clark <james.clark@arm.com> Link: https://lore.kernel.org/r/20240129154050.569566-2-james.clark@arm.com Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
Ethernet switch does not have addressable subnodes.
This fixes:
arch/arm64/boot/dts/broadcom/bcmbca/bcm4908-asus-gt-ac5300.dtb: ethernet-switch@0: '#address-cells', '#size-cells' do not match any of the regexes: 'pinctrl-[0-9]+'
from schema $id: http://devicetree.org/schemas/net/dsa/brcm,sf2.yaml#
The core expects for tx_empty() either TIOCSER_TEMT when the tx is
empty or 0 otherwise. s3c24xx_serial_txempty_nofifo() might return
0x4, and at least uart_get_lsr_info() tries to clear exactly
TIOCSER_TEMT (BIT(1)). Fix tx_empty() to return TIOCSER_TEMT.
The if (c >= 20 && c <= 0x3f) test added in commit 7a99565f8732 is
wrong. 20 is DC4 in ascii and it makes no sense to consider that as the
bottom limit. Instead, it should be 0x20 as in the other test in
the commit above. This is supposed to NOT change anything as we handle
interesting 20-0x20 asciis far before this if.
While support for working with a vbus was added, the regulator was never
actually gotten (despite what was documented). Fix this by actually
getting the supply from the device tree.
Fix off-by-one error in transfer-function property.
The honeywell,transfer-function property takes values between 1-3 so
make sure the proper enum gets used.
Fixes: 713337d9143ed ("iio: pressure: Honeywell mprls0025pa pressure sensor") Co-developed-by: Andreas Klinger <ak@it-klinger.de> Signed-off-by: Andreas Klinger <ak@it-klinger.de> Signed-off-by: Petre Rodan <petre.rodan@subdimension.ro> Link: https://lore.kernel.org/r/20231229092445.30180-5-petre.rodan@subdimension.ro Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
The sparse tool complains about the remove of the _iomem attribute.
stm32_rproc.c:660:17: warning: cast removes address space '__iomem' of expression
Add '__force' to explicitly specify that the cast is intentional.
This conversion is necessary to cast to addresses pointer,
which are then managed by the remoteproc core as a pointer to a
resource_table structure.
Add '__force' to explicitly specify that the cast is intentional.
This conversion is necessary to cast to virtual addresses pointer,used,
by the remoteproc core.
Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202312150052.HCiNKlqB-lkp@intel.com/ Fixes: 13140de09cc2 ("remoteproc: stm32: add an ST stm32_rproc driver") Signed-off-by: Arnaud Pouliquen <arnaud.pouliquen@foss.st.com> Link: https://lore.kernel.org/r/20240117135312.3381936-2-arnaud.pouliquen@foss.st.com Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
On Arrow Lake S systems, MEI is no longer strictly connected to bus 0,
while graphics remain exclusively on bus 0. Adapt the component
matching logic to accommodate this change:
Original behavior: Required both MEI and graphics to be on the same
bus 0.
New behavior: Only enforces graphics to be on bus 0 (integrated),
allowing MEI to reside on any bus.
This ensures compatibility with Arrow Lake S and maintains functionality
for the legacy systems.
The comedi_test devices have a couple of timers (ai_timer and ao_timer)
that can be started to simulate hardware interrupts. Their expiry
functions normally reschedule the timer. The driver code calls either
del_timer_sync() or del_timer() to delete the timers from the queue, but
does not currently prevent the timers from rescheduling themselves so
synchronized deletion may be ineffective.
Add a couple of boolean members (one for each timer: ai_timer_enable and
ao_timer_enable) to the device private data structure to indicate
whether the timers are allowed to reschedule themselves. Set the member
to true when adding the timer to the queue, and to false when deleting
the timer from the queue in the waveform_ai_cancel() and
waveform_ao_cancel() functions.
The del_timer_sync() function is also called from the waveform_detach()
function, but the timer enable members will already be set to false when
that function is called, so no change is needed there.
The refactoring done in commit 5c57b1ccecc7 ("comedi: comedi_8255: Rework
subdevice initialization functions") to the initialization of the io
field of struct subdev_8255_private broke all cards using the
drivers/comedi/drivers/comedi_8255.c module.
Prior to 5c57b1ccecc7, __subdev_8255_init() initialized the io field
in the newly allocated struct subdev_8255_private to the non-NULL
callback given to the function, otherwise it used a flag parameter to
select between subdev_8255_mmio and subdev_8255_io. The refactoring
removed that logic and the flag, as subdev_8255_mm_init() and
subdev_8255_io_init() now explicitly pass subdev_8255_mmio and
subdev_8255_io respectively to __subdev_8255_init(), only
__subdev_8255_init() never sets spriv->io to the supplied
callback. That spriv->io is NULL leads to a later BUG:
If we're in FILTER_SNAPSHOTS mode and we start scanning a range of the
keyspace where no keys are visible in the current snapshot, we have a
problem - we'll scan for a very long time before scanning terminates.
Awhile back, this was fixed for most cases with peek_upto() (and
assertions that enforce that it's being used).
But the fix missed the fact that the inodes btree is different - every
key offset is in a different snapshot tree, not just the inode field.
Fixes: Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
With the upcoming member seq patch, it's now critical that we don't ever
write to a superblock that hasn't been version downgraded - failure to
update member seq fields will cause split brain detection to fire
erroniously.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Calling fd_install() makes a file reachable for userland, including the
possibility to close the file descriptor, which leads to calling its
'release' hook. If that happens before the code had a chance to bump the
reference of the newly created task struct, the release callback will
call put_task_struct() too early, leading to the premature destruction
of the kernel thread.
Avoid that race by calling fd_install() later, after all the setup is
done.
The gcc compiler on paric does support the __int128 type, although the
architecture does not have native 128-bit support.
The effect is, that the bcachefs u128_square() function will pull in the
libgcc __multi3() helper, which breaks the kernel build when bcachefs is
built as module since this function isn't currently exported in
arch/parisc/kernel/parisc_ksyms.c.
The build failure can be seen in the latest debian kernel build at:
https://buildd.debian.org/status/fetch.php?pkg=linux&arch=hppa&ver=6.7.1-1%7Eexp1&stamp=1706132569&raw=0
We prefer to not export that symbol, so fall back to the optional 64-bit
implementation provided by bcachefs and thus avoid usage of __multi3().
Signed-off-by: Helge Deller <deller@gmx.de> Cc: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Inspecting the core with drgn I was able to pull this
>>> prog.crashed_thread().stack_trace()[0]
#0 at 0xffffffffa079657a (ff_layout_cancel_io+0x3a/0x84) in ff_layout_cancel_io at fs/nfs/flexfilelayout/flexfilelayout.c:2021:27
>>> prog.crashed_thread().stack_trace()[0]['idx']
(u32)1
>>> prog.crashed_thread().stack_trace()[0]['flseg'].mirror_array[1].mirror_ds
(struct nfs4_ff_layout_ds *)0xffffffffffffffed
This is clear from the stack trace, we call nfs4_ff_layout_prepare_ds()
which could error out initializing the mirror_ds, and then we go to
clean it all up and our check is only for if (!mirror->mirror_ds). This
is inconsistent with the rest of the users of mirror_ds, which have
if (IS_ERR_OR_NULL(mirror_ds))
to keep from tripping over this exact scenario. Fix this up in
ff_layout_cancel_io() to make sure we don't panic when we get an error.
I also spot checked all the other instances of checking mirror_ds and we
appear to be doing the correct checks everywhere, only unconditionally
dereferencing mirror_ds when we know it would be valid.
Signed-off-by: Josef Bacik <josef@toxicpanda.com> Fixes: b739a5bd9d9f ("NFSv4/flexfiles: Cancel I/O if the layout is recalled or revoked") Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
When bringing a CPU online, some of the PMC and LBR related registers
are reset. The same is done when a CPU is taken offline although that
is unnecessary. This currently happens in the "cpu_dead" callback which
is also incorrect as the callback runs on a control CPU instead of the
one that is being taken offline. This also affects hibernation and
suspend to RAM on some platforms as reported in the link below.
Fixes: 21d59e3e2c40 ("perf/x86/amd/core: Detect PerfMonV2 support") Reported-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Sandipan Das <sandipan.das@amd.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/550a026764342cf7e5812680e3e2b91fe662b5ac.1706526029.git.sandipan.das@amd.com Signed-off-by: Sasha Levin <sashal@kernel.org>
Thread A GC thread
- gc_data_segment
- ra_data_block
- locked meta_inode page
- f2fs_inplace_write_data
- invalidate_mapping_pages
: fail to invalidate meta_inode page
due to lock failure or dirty|writeback
status
- f2fs_submit_page_bio
: write last dirty data to old blkaddr
- move_data_block
- load old data from meta_inode page
- f2fs_submit_page_write
: write old data to new blkaddr
Because invalidate_mapping_pages() will skip invalidating page which
has unclear status including locked, dirty, writeback and so on, so
we need to use truncate_inode_pages_range() instead of
invalidate_mapping_pages() to make sure meta_inode page will be dropped.
Fixes: 6aa58d8ad20a ("f2fs: readahead encrypted block during GC") Fixes: e3b49ea36802 ("f2fs: invalidate META_MAPPING before IPU/DIO write") Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
This is because the file has only one direct_node. After returning
to -ENOSPC, reserved_blocks += ret will not be executed. As a result,
the reserved_blocks at this time is still 0, which is not the real
number of reserved blocks. Therefore, fsck cannot be set to repair
the file.
After this patch, the fsck flag will be set to fix this problem.
unisoc # df -h | grep dm-48
/dev/block/dm-48 112G 112G 1.8M 100% /data
unisoc # ./f2fs_io reserve_cblocks test.apk
F2FS_IOC_RESERVE_COMPRESS_BLOCKS failed: No space left on device
adb reboot then fsck will be executed
unisoc # df -h | grep dm-48
/dev/block/dm-48 112G 112G 11M 100% /data
unisoc # ./f2fs_io reserve_cblocks test.apk
924
The following f2fs_io test will get a "0" result instead of -EINVAL,
unisoc # ./f2fs_io compress file
unisoc # ./f2fs_io reserve_cblocks file
0
it's not reasonable, so the judgement of
atomic_read(&F2FS_I(inode)->i_compr_blocks) should be placed after
the judgement of is_inode_flag_set(inode, FI_COMPRESS_RELEASED).
Currently, even though xprtsec=tls is specified and used for operations
to MDS, any operations that go to DS travel over unencrypted connection.
Or additionally, if more than 1 DS can serve the data, then trunked
connections are also done unencrypted.
IN GETDEVINCEINFO, we get an entry for the DS which carries a protocol
type (which is TCP), then nfs4_set_ds_client() gets called with TCP
instead of TCP with TLS.
Currently, each trunked connection is created and uses clp->cl_hostname
value which if TLS is used would get passed up in the handshake upcall,
but instead we need to pass in the appropriate trunked address value.
Fixes: c8407f2e560c ("NFS: Add an "xprtsec=" NFS mount option") Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
The intent is to check if 'dest' is truncated or not. So, >= should be
used instead of >, because strlcat() returns the length of 'dest' and 'src'
excluding the trailing NULL.
Fixes: 56463e50d1fc ("NFS: Use super.c for NFSROOT mount option parsing") Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
The loop inside nfs_netfs_issue_read() currently does not disable
interrupts while iterating through pages in the xarray to submit
for NFS read. This is not safe though since after taking xa_lock,
another page in the mapping could be processed for writeback inside
an interrupt, and deadlock can occur. The fix is simple and clean
if we use xa_for_each_range(), which handles the iteration with RCU
while reducing code complexity.
The problem is easily reproduced with the following test:
mount -o vers=3,fsc 127.0.0.1:/export /mnt/nfs
dd if=/dev/zero of=/mnt/nfs/file1.bin bs=4096 count=1
echo 3 > /proc/sys/vm/drop_caches
dd if=/mnt/nfs/file1.bin of=/dev/null
umount /mnt/nfs
On the console with a lockdep-enabled kernel a message similar to
the following will be seen:
================================
WARNING: inconsistent lock state
6.7.0-lockdbg+ #10 Not tainted
--------------------------------
inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
test5/1708 [HC0[0]:SC0[0]:HE1:SE1] takes: ffff888127baa598 (&xa->xa_lock#4){+.?.}-{3:3}, at:
nfs_netfs_issue_read+0x1b2/0x4b0 [nfs]
{IN-SOFTIRQ-W} state was registered at:
lock_acquire+0x144/0x380
_raw_spin_lock_irqsave+0x4e/0xa0
__folio_end_writeback+0x17e/0x5c0
folio_end_writeback+0x93/0x1b0
iomap_finish_ioend+0xeb/0x6a0
blk_update_request+0x204/0x7f0
blk_mq_end_request+0x30/0x1c0
blk_complete_reqs+0x7e/0xa0
__do_softirq+0x113/0x544
__irq_exit_rcu+0xfe/0x120
irq_exit_rcu+0xe/0x20
sysvec_call_function_single+0x6f/0x90
asm_sysvec_call_function_single+0x1a/0x20
pv_native_safe_halt+0xf/0x20
default_idle+0x9/0x20
default_idle_call+0x67/0xa0
do_idle+0x2b5/0x300
cpu_startup_entry+0x34/0x40
start_secondary+0x19d/0x1c0
secondary_startup_64_no_verify+0x18f/0x19b
irq event stamp: 176891
hardirqs last enabled at (176891): [<ffffffffa67a0be4>]
_raw_spin_unlock_irqrestore+0x44/0x60
hardirqs last disabled at (176890): [<ffffffffa67a0899>]
_raw_spin_lock_irqsave+0x79/0xa0
softirqs last enabled at (176646): [<ffffffffa515d91e>]
__irq_exit_rcu+0xfe/0x120
softirqs last disabled at (176633): [<ffffffffa515d91e>]
__irq_exit_rcu+0xfe/0x120
other info that might help us debug this:
Possible unsafe locking scenario:
Fixes: 000dbe0bec05 ("NFS: Convert buffered read paths to use netfs when fscache is enabled") Suggested-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Dave Wysochanski <dwysocha@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: David Howells <dhowells@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
@type in f2fs_allocate_data_block() indicates log header's type, it
can be CURSEG_COLD_DATA_PINNED or CURSEG_ALL_DATA_ATGC, rather than
type of data/node, however IS_DATASEG()/IS_NODESEG() only accept later
one, fix it.
Fixes: 093749e296e2 ("f2fs: support age threshold based garbage collection") Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
The driver never sets a default timeout value, therefore it is
initialized to zero. When CONFIG_WATCHDOG_HANDLE_BOOT_ENABLED is
enabled, the watchdog is started during probe. The kernel is supposed to
automatically ping the watchdog from this point until userspace takes
over, but this does not happen if the configured timeout is zero. A zero
timeout causes watchdog_need_worker() to return false, so the heartbeat
worker does not run and the system therefore resets soon after the
driver is probed.
This patch fixes this by setting an arbitrary non-zero default timeout.
The default could be read from the hardware instead, but I didn't see
any reason to add this complexity.
This has been tested on an STM32F746.
Fixes: 85fdc63fe256 ("drivers: watchdog: stm32_iwdg: set WDOG_HW_RUNNING at probe") Signed-off-by: Ben Wolsieffer <ben.wolsieffer@hefring.com> Reviewed-by: Guenter Roeck <linux@roeck-us.net> Link: https://lore.kernel.org/r/20240228182723.12855-1-ben.wolsieffer@hefring.com Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Wim Van Sebroeck <wim@linux-watchdog.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
In the probe function, pm_runtime_put_sync() will fail on platform with
runtime PM disabled.
Check if runtime PM is enabled before calling pm_runtime_put_sync() to
fix it.
It needs to check compress flag w/ .i_sem lock, otherwise, compressed
inode may be disabled after the check condition, it's not needed to
set compress option on non-compress inode.
Switch order of operations to avoid creating a short XDR buffer:
e.g., buflen = 12, old xdrlen = 12, new xdrlen = 20.
Having a short XDR buffer leads to lxa_maxcount be a few bytes
less than what is needed to retrieve the whole list when using
a buflen as returned by a call with size = 0:
buflen = listxattr(path, NULL, 0);
buf = malloc(buflen);
buflen = listxattr(path, buf, buflen);
For a file with one attribute (name = '123456'), the first call
with size = 0 will return buflen = 12 ('user.123456\x00').
The second call with size = 12, sends LISTXATTRS with
lxa_maxcount = 12 + 8 (cookie) + 4 (array count) = 24. The
XDR buffer needs 8 (cookie) + 4 (array count) + 4 (name count)
+ 6 (name len) + 2 (padding) + 4 (eof) = 28 which is 4 bytes
shorter than the lxa_maxcount provided in the call.
Fixes: 04a5da690e8f ("NFSv4.2: define limits and sizes for user xattr handling") Signed-off-by: Jorge Mora <mora@netapp.com> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
A call to listxattr() with a buffer size = 0 returns the actual
size of the buffer needed for a subsequent call. When size > 0,
nfs4_listxattr() does not return an error because either
generic_listxattr() or nfs4_listxattr_nfs4_label() consumes
exactly all the bytes then size is 0 when calling
nfs4_listxattr_nfs4_user() which then triggers the following
kernel BUG:
The intent is to check if the strings' are truncated or not. So, >= should
be used instead of >, because strlcat() and snprintf() return the length of
the output, excluding the trailing NULL.
Fixes: a02d69261134 ("SUNRPC: Provide functions for managing universal addresses") Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
f2fs only support to config zstd compress level w/ a positive number due
to layout design, but since commit e0c1b49f5b67 ("lib: zstd: Upgrade to
latest upstream zstd version 1.4.10"), zstd supports negative compress
level, so that zstd_min_clevel() may return a negative number, then w/
below mount option, .compress_level can be configed w/ a negative number,
which is not allowed to f2fs, let's add check condition to avoid it.
mount -o compress_algorithm=zstd:4294967295 /dev/sdx /mnt/f2fs