commit 2f4a4d63a193 ("ACPI: CPPC: Use access_width over bit_width for system
memory accesses") modified cpc_read()/cpc_write() to use access_width to
read CPC registers.
However, for PCC registers the access width field in the ACPI register
macro specifies the PCC subspace ID. For non-zero PCC subspace ID it is
incorrectly treated as access width. This causes errors when reading
from PCC registers in the CPPC driver.
For PCC registers, base the size of read/write on the bit width field.
The debug message in cpc_read()/cpc_write() is updated to print relevant
information for the address space type used to read the register.
Fixes: 2f4a4d63a193 ("ACPI: CPPC: Use access_width over bit_width for system memory accesses") Signed-off-by: Vanshidhar Konda <vanshikonda@os.amperecomputing.com> Tested-by: Jarred White <jarredwhite@linux.microsoft.com> Reviewed-by: Jarred White <jarredwhite@linux.microsoft.com> Reviewed-by: Easwar Hariharan <eahariha@linux.microsoft.com> Cc: 5.15+ <stable@vger.kernel.org> # 5.15+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
[ eahariha: Backport to v5.15 by dropping SystemIO bits as commit a2c8f92bea5f is not present ] Signed-off-by: Easwar Hariharan <eahariha@linux.microsoft.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
Commit 2f4a4d63a193 ("ACPI: CPPC: Use access_width over bit_width for
system memory accesses") neglected to properly wrap the bit_offset shift
when it comes to applying the mask. This may cause incorrect values to be
read and may cause the cpufreq module not be loaded.
[ 11.059751] cpu_capacity: CPU0 missing/invalid highest performance.
[ 11.066005] cpu_capacity: partial information: fallback to 1024 for all CPUs
Also, corrected the bitmask generation in GENMASK (extra bit being added).
Fixes: 2f4a4d63a193 ("ACPI: CPPC: Use access_width over bit_width for system memory accesses") Signed-off-by: Jarred White <jarredwhite@linux.microsoft.com> Cc: 5.15+ <stable@vger.kernel.org> # 5.15+ Reviewed-by: Vanshidhar Konda <vanshikonda@os.amperecomputing.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Easwar Hariharan <eahariha@linux.microsoft.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
New request from KMD/VBIOS in order to support new UMA carveout
model. This fixes a null dereference from accessing
Ctx->dc_bios->integrated_info while it was NULL.
DAL parses through the BIOS and extracts the necessary
integrated_info but was missing a case for the new BIOS
version 2.3.
Reviewed-by: Nicholas Kazlauskas <nicholas.kazlauskas@amd.com> Acked-by: Aurabindo Pillai <aurabindo.pillai@amd.com> Signed-off-by: Gabe Teeger <gabe.teeger@amd.com> Tested-by: Daniel Wheeler <daniel.wheeler@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
When no mode is set, the utility pin appears to be grounded. No signal
is getting through.
This is problematic because ARC and eARC use this line and may do so even
if no display mode is set.
This change enable the bandgap setting on g12 chip, which fix the problem
with the utility pin. This is done by restoring init values on PHY init and
disable.
The phy is not in a useful state right after init. It will become useful,
including for auxiliary function such as CEC or ARC, after the first mode
is set. This is a problem on systems where the display is using another
interface like DSI or CVBS.
This change refactor the init and mode change callback to power up the PHY
on init and leave only what is necessary for mode changes in the related
function. This is enough to fix CEC operation when HDMI display is not
enabled.
According to hardware limitation, for device support modify
VLAN filter state but not support bypass port VLAN filter,
it should always disable the port VLAN filter. but the driver
enables port VLAN filter when initializing, if there is no
VLAN(except VLAN 0) id added, the driver will disable it
in service task. In most time, it works fine. But there is
a time window before the service task shceduled and net device
being registered. So if user adds VLAN at this time, the driver
will not update the VLAN filter state, and the port VLAN filter
remains enabled.
To fix the problem, if support modify VLAN filter state but not
support bypass port VLAN filter, set the port vlan filter to "off".
Fixes: 184cd221a863 ("net: hns3: disable port VLAN filter when support function level VLAN filter control") Fixes: 2ba306627f59 ("net: hns3: add support for modify VLAN filter state") Signed-off-by: Yonglong Liu <liuyonglong@huawei.com> Signed-off-by: Jijie Shao <shaojijie@huawei.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
There is a memory barrier in followed case. When set the port down,
hclgevf_set_timmer will set DOWN in state. Meanwhile, the service task has
different behaviour based on whether the state is DOWN. Thus, to make sure
service task see DOWN, use smp_mb__after_atomic after calling set_bit().
Fixes: ff200099d271 ("net: hns3: remove unnecessary work in hclgevf_main") Fixes: 1c6dfe6fc6f7 ("net: hns3: remove mailbox and reset work in hclge_main") Signed-off-by: Peiyang Wang <wangpeiyang1@huawei.com> Signed-off-by: Jijie Shao <shaojijie@huawei.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
It provides nodemask_t to describe the numa node mask in kernel. To
improve transportability, change the type of numa_node_mask as nodemask_t.
Fixes: 38caee9d3ee8 ("net: hns3: Add support of the HNAE3 framework") Signed-off-by: Peiyang Wang <wangpeiyang1@huawei.com> Signed-off-by: Jijie Shao <shaojijie@huawei.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
This patch firstly uses new hardware description struct hclge_comm_hw as
child member of hclge_hw and deletes the original child memebers of
hclge_hw. All the hclge_hw variables used in PF module is modified
according to the new hclge_hw.
Secondly hclge_cmd_send is refactored to use hclge_comm_cmd_send APIs. The
old functions called by hclge_cmd_send are deleted and hclge_cmd_send is
kept to avoid too many meaningless modifications.
Signed-off-by: Jie Wang <wangjie125@huawei.com> Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Stable-dep-of: 6639a7b95321 ("net: hns3: change type of numa_node_mask as nodemask_t") Signed-off-by: Sasha Levin <sashal@kernel.org>
This patch create new set of unified hclge_comm_cmd_send APIs for PF and VF
cmdq module. Subfunctions called by hclge_comm_cmd_send are also created
include cmdq result check, cmdq return code conversion and ring space
opertaion APIs.
These new common cmdq APIs will be used to replace the old PF and VF cmdq
APIs in next patches.
Signed-off-by: Jie Wang <wangjie125@huawei.com> Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Stable-dep-of: 6639a7b95321 ("net: hns3: change type of numa_node_mask as nodemask_t") Signed-off-by: Sasha Levin <sashal@kernel.org>
Currently PF and VF cmdq APIs use struct hclge(vf)_hw to describe cmdq
hardware information needed by hclge(vf)_cmd_send. There are a little
differences between its child struct hclge_cmq_ring and hclgevf_cmq_ring.
It is redundent to use two sets of structures to support same functions.
So this patch creates new set of common cmdq hardware description
structures(hclge_comm_hw) to unify PF and VF cmdq functions. The struct
hclge_desc is still kept to avoid too many meaningless replacement.
These new structures will be used to unify hclge(vf)_hw structures in PF
and VF cmdq APIs in next patches.
Signed-off-by: Jie Wang <wangjie125@huawei.com> Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Stable-dep-of: 6639a7b95321 ("net: hns3: change type of numa_node_mask as nodemask_t") Signed-off-by: Sasha Levin <sashal@kernel.org>
Currently we plan to refactor PF and VF cmdq module. A new file folder
hns3_common will be created to store new common APIs used by PF and VF
cmdq module. Thus the PF and VF compilation process will both depends on
the hns3_common. This may cause parallel building problems if we add a new
makefile building unit.
So this patch combined the PF and VF makefile scripts to the top level
makefile to support the new hns3_common which will be created in the next
patch.
Signed-off-by: Jie Wang <wangjie125@huawei.com> Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Stable-dep-of: 6639a7b95321 ("net: hns3: change type of numa_node_mask as nodemask_t") Signed-off-by: Sasha Levin <sashal@kernel.org>
Currently, the driver didn't return when receive a unknown
mailbox message, and continue checking whether need to
generate a response. It's unnecessary and may be incorrect.
Fixes: bb5790b71bad ("net: hns3: refactor mailbox response scheme between PF and VF") Signed-off-by: Jian Shen <shenjian15@huawei.com> Signed-off-by: Jijie Shao <shaojijie@huawei.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
Currently, the function hclge_mbx_handler() has too many switch-case
statements, it makes this function too long. To improve code readability,
refactor this function and use lookup table instead.
Signed-off-by: Hao Lan <lanhao@huawei.com> Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Stable-dep-of: 669554c512d2 ("net: hns3: direct return when receive a unknown mailbox message") Signed-off-by: Sasha Levin <sashal@kernel.org>
This patch adds a new mailbox opcode to query map relation between
vf ring and vector.
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Stable-dep-of: 669554c512d2 ("net: hns3: direct return when receive a unknown mailbox message") Signed-off-by: Sasha Levin <sashal@kernel.org>
When the mbx or reset message arrives, the driver is informed
through an interrupt. This task can be processed only after
the workqueue is scheduled. In some cases, this workqueue
scheduling takes a long time. As a result, the mbx or reset
service task cannot be processed in time. So add some warning
message to improve debugging efficiency for this case.
Signed-off-by: Yufeng Mo <moyufeng@huawei.com> Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Stable-dep-of: 669554c512d2 ("net: hns3: direct return when receive a unknown mailbox message") Signed-off-by: Sasha Levin <sashal@kernel.org>
When a reset occurring, it's supposed to recover user's configuration.
Currently, the port info(speed, duplex and autoneg) is stored in hclge_mac
and will be scheduled updated. Consider the case that reset was happened
consecutively. During the first reset, the port info is configured with
a temporary value cause the PHY is reset and looking for best link config.
Second reset start and use pervious configuration which is not the user's.
The specific process is as follows:
To avoid aboved situation, this patch introduced req_speed, req_duplex,
req_autoneg to store user's configuration and it only be used after
hardware reset and to recover user's configuration
Fixes: f5f2b3e4dcc0 ("net: hns3: add support for imp-controlled PHYs") Signed-off-by: Peiyang Wang <wangpeiyang1@huawei.com> Signed-off-by: Jijie Shao <shaojijie@huawei.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
Currently, there are two ways for PF to set the unicast MAC address space
size: specified by config parameters in firmware or set to default value.
That's mean if the config parameters in firmware is zero, driver will
divide the whole unicast MAC address space equally to 8 PFs. However, in
this case, the unicast MAC address space will be wasted a lot when the
hardware actually has less then 8 PFs. And in the other hand, if one PF has
much more VFs than other PFs, then each function of this PF will has much
less address space than other PFs.
In order to ameliorate the above two situations, introduce the third way
of unicast MAC address space assignment: firmware divides the whole unicast
MAC address space equally to functions of all PFs, and calculates the space
size of each PF according to its function number. PF queries the space size
by the querying device specification command when in initialization
process.
The third way assignment is lower priority than specified by config
parameters, only if the config parameters is zero can be used, and if
firmware does not support the third way assignment, then driver still
divides the whole unicast MAC address space equally to 8 PFs.
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Stable-dep-of: 05eb60e9648c ("net: hns3: using user configure after hardware reset") Signed-off-by: Sasha Levin <sashal@kernel.org>
The change from skb_copy to pskb_copy unfortunately changed the data
copying to omit the ethernet header, since it was pulled before reaching
this point. Fix this by calling __skb_push/pull around pskb_copy.
Fixes: 59c878cbcdd8 ("net: bridge: fix multicast-to-unicast with fraglist GSO") Signed-off-by: Felix Fietkau <nbd@nbd.name> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
Through hidraw, userspace can cause a status report to be sent
from the device. The parsing in ccp_raw_event() may happen in
parallel to a send_usb_cmd() call (which resets the completion
for tracking the report) if it's running on a different CPU where
bottom half interrupts are not disabled.
Add a spinlock around the complete_all() in ccp_raw_event() and
reinit_completion() in send_usb_cmd() to prevent race issues.
In ccp_raw_event(), the ccp->wait_input_report completion is
completed once. Since we're waiting for exactly one report in
send_usb_cmd(), use complete_all() instead of complete()
to mark the completion as spent.
Introduce cmd_buffer, a separate buffer for storing only
the command that is sent to the device. Before this separation,
the existing buffer was shared for both the command and the
report received in ccp_raw_event(), which was copied into it.
However, because of hidraw, the raw event parsing may be triggered
in the middle of sending a command, resulting in outputting gibberish
to the device. Using a separate buffer resolves this.
Each attribute inside a nested IFLA_VF_VLAN_LIST is assumed to be a
struct ifla_vf_vlan_info so the size of such attribute needs to be at least
of sizeof(struct ifla_vf_vlan_info) which is 14 bytes.
The current size validation in do_setvfinfo is against NLA_HDRLEN (4 bytes)
which is less than sizeof(struct ifla_vf_vlan_info) so this validation
is not enough and a too small attribute might be cast to a
struct ifla_vf_vlan_info, this might result in an out of bands
read access when accessing the saved (casted) entry in ivvl.
Fixes: 79aab093a0b5 ("net: Update API for VF vlan protocol 802.1ad support") Signed-off-by: Roded Zats <rzats@paloaltonetworks.com> Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Link: https://lore.kernel.org/r/20240502155751.75705-1-rzats@paloaltonetworks.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
There is a race condition between l2cap_chan_timeout() and
l2cap_chan_del(). When we use l2cap_chan_del() to delete the
channel, the chan->conn will be set to null. But the conn could
be dereferenced again in the mutex_lock() of l2cap_chan_timeout().
As a result the null pointer dereference bug will happen. The
KASAN report triggered by POC is shown below:
When the sco connection is established and then, the sco socket
is releasing, timeout_work will be scheduled to judge whether
the sco disconnection is timeout. The sock will be deallocated
later, but it is dereferenced again in sco_sock_timeout. As a
result, the use-after-free bugs will happen. The root cause is
shown below:
[ 95.890016] ==================================================================
[ 95.890496] BUG: KASAN: slab-use-after-free in sco_sock_timeout+0x5e/0x1c0
[ 95.890755] Write of size 4 at addr ffff88800c388080 by task kworker/0:0/7
...
[ 95.890755] Workqueue: events sco_sock_timeout
[ 95.890755] Call Trace:
[ 95.890755] <TASK>
[ 95.890755] dump_stack_lvl+0x45/0x110
[ 95.890755] print_address_description+0x78/0x390
[ 95.890755] print_report+0x11b/0x250
[ 95.890755] ? __virt_addr_valid+0xbe/0xf0
[ 95.890755] ? sco_sock_timeout+0x5e/0x1c0
[ 95.890755] kasan_report+0x139/0x170
[ 95.890755] ? update_load_avg+0xe5/0x9f0
[ 95.890755] ? sco_sock_timeout+0x5e/0x1c0
[ 95.890755] kasan_check_range+0x2c3/0x2e0
[ 95.890755] sco_sock_timeout+0x5e/0x1c0
[ 95.890755] process_one_work+0x561/0xc50
[ 95.890755] worker_thread+0xab2/0x13c0
[ 95.890755] ? pr_cont_work+0x490/0x490
[ 95.890755] kthread+0x279/0x300
[ 95.890755] ? pr_cont_work+0x490/0x490
[ 95.890755] ? kthread_blkcg+0xa0/0xa0
[ 95.890755] ret_from_fork+0x34/0x60
[ 95.890755] ? kthread_blkcg+0xa0/0xa0
[ 95.890755] ret_from_fork_asm+0x11/0x20
[ 95.890755] </TASK>
[ 95.890755]
[ 95.890755] Allocated by task 506:
[ 95.890755] kasan_save_track+0x3f/0x70
[ 95.890755] __kasan_kmalloc+0x86/0x90
[ 95.890755] __kmalloc+0x17f/0x360
[ 95.890755] sk_prot_alloc+0xe1/0x1a0
[ 95.890755] sk_alloc+0x31/0x4e0
[ 95.890755] bt_sock_alloc+0x2b/0x2a0
[ 95.890755] sco_sock_create+0xad/0x320
[ 95.890755] bt_sock_create+0x145/0x320
[ 95.890755] __sock_create+0x2e1/0x650
[ 95.890755] __sys_socket+0xd0/0x280
[ 95.890755] __x64_sys_socket+0x75/0x80
[ 95.890755] do_syscall_64+0xc4/0x1b0
[ 95.890755] entry_SYSCALL_64_after_hwframe+0x67/0x6f
[ 95.890755]
[ 95.890755] Freed by task 506:
[ 95.890755] kasan_save_track+0x3f/0x70
[ 95.890755] kasan_save_free_info+0x40/0x50
[ 95.890755] poison_slab_object+0x118/0x180
[ 95.890755] __kasan_slab_free+0x12/0x30
[ 95.890755] kfree+0xb2/0x240
[ 95.890755] __sk_destruct+0x317/0x410
[ 95.890755] sco_sock_release+0x232/0x280
[ 95.890755] sock_close+0xb2/0x210
[ 95.890755] __fput+0x37f/0x770
[ 95.890755] task_work_run+0x1ae/0x210
[ 95.890755] get_signal+0xe17/0xf70
[ 95.890755] arch_do_signal_or_restart+0x3f/0x520
[ 95.890755] syscall_exit_to_user_mode+0x55/0x120
[ 95.890755] do_syscall_64+0xd1/0x1b0
[ 95.890755] entry_SYSCALL_64_after_hwframe+0x67/0x6f
[ 95.890755]
[ 95.890755] The buggy address belongs to the object at ffff88800c388000
[ 95.890755] which belongs to the cache kmalloc-1k of size 1024
[ 95.890755] The buggy address is located 128 bytes inside of
[ 95.890755] freed 1024-byte region [ffff88800c388000, ffff88800c388400)
[ 95.890755]
[ 95.890755] The buggy address belongs to the physical page:
[ 95.890755] page: refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff88800c38a800 pfn:0xc388
[ 95.890755] head: order:3 entire_mapcount:0 nr_pages_mapped:0 pincount:0
[ 95.890755] anon flags: 0x100000000000840(slab|head|node=0|zone=1)
[ 95.890755] page_type: 0xffffffff()
[ 95.890755] raw: 0100000000000840ffff888006842dc000000000000000000000000000000001
[ 95.890755] raw: ffff88800c38a800000000000010000a00000001ffffffff0000000000000000
[ 95.890755] head: 0100000000000840ffff888006842dc000000000000000000000000000000001
[ 95.890755] head: ffff88800c38a800000000000010000a00000001ffffffff0000000000000000
[ 95.890755] head: 0100000000000003ffffea000030e201ffffea000030e24800000000ffffffff
[ 95.890755] head: 0000000800000000000000000000000000000000ffffffff0000000000000000
[ 95.890755] page dumped because: kasan: bad access detected
[ 95.890755]
[ 95.890755] Memory state around the buggy address:
[ 95.890755] ffff88800c387f80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 95.890755] ffff88800c388000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 95.890755] >ffff88800c388080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 95.890755] ^
[ 95.890755] ffff88800c388100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 95.890755] ffff88800c388180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 95.890755] ==================================================================
Fix this problem by adding a check protected by sco_conn_lock to judget
whether the conn->hcon is null. Because the conn->hcon will be set to null,
when the sock is releasing.
Fixes: ba316be1b6a0 ("Bluetooth: schedule SCO timeouts with delayed_work") Signed-off-by: Duoming Zhou <duoming@zju.edu.cn> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
Anderson Nascimento reported a use-after-free splat in tcp_twsk_unique()
with nice analysis.
Since commit ec94c2696f0b ("tcp/dccp: avoid one atomic operation for
timewait hashdance"), inet_twsk_hashdance() sets TIME-WAIT socket's
sk_refcnt after putting it into ehash and releasing the bucket lock.
Thus, there is a small race window where other threads could try to
reuse the port during connect() and call sock_hold() in tcp_twsk_unique()
for the TIME-WAIT socket with zero refcnt.
If that happens, the refcnt taken by tcp_twsk_unique() is overwritten
and sock_put() will cause underflow, triggering a real use-after-free
somewhere else.
To avoid the use-after-free, we need to use refcount_inc_not_zero() in
tcp_twsk_unique() and give up on reusing the port if it returns false.
To fix this issue, change tcp_shutdown() to not
perform a TCP_SYN_RECV -> TCP_FIN_WAIT1 transition,
which makes no sense anyway.
When tcp_rcv_state_process() later changes socket state
from TCP_SYN_RECV to TCP_ESTABLISH, then look at
sk->sk_shutdown to finally enter TCP_FIN_WAIT1 state,
and send a FIN packet from a sane socket state.
This means tcp_send_fin() can now be called from BH
context, and must use GFP_ATOMIC allocations.
[ 33.452494] ==================================================================
[ 33.453513] BUG: KASAN: stack-out-of-bounds in refresh_cpu_vm_stats.constprop.0+0xcc/0x2ec
[ 33.454660] Write of size 164 at addr c1d03d30 by task swapper/0/0
[ 33.455515]
[ 33.455767] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G O 6.1.25-mainline #1
[ 33.456880] Hardware name: Generic DT based system
[ 33.457555] unwind_backtrace from show_stack+0x18/0x1c
[ 33.458326] show_stack from dump_stack_lvl+0x40/0x4c
[ 33.459072] dump_stack_lvl from print_report+0x158/0x4a4
[ 33.459863] print_report from kasan_report+0x9c/0x148
[ 33.460616] kasan_report from kasan_check_range+0x94/0x1a0
[ 33.461424] kasan_check_range from memset+0x20/0x3c
[ 33.462157] memset from refresh_cpu_vm_stats.constprop.0+0xcc/0x2ec
[ 33.463064] refresh_cpu_vm_stats.constprop.0 from tick_nohz_idle_stop_tick+0x180/0x53c
[ 33.464181] tick_nohz_idle_stop_tick from do_idle+0x264/0x354
[ 33.465029] do_idle from cpu_startup_entry+0x20/0x24
[ 33.465769] cpu_startup_entry from rest_init+0xf0/0xf4
[ 33.466528] rest_init from arch_post_acpi_subsys_init+0x0/0x18
[ 33.467397]
[ 33.467644] The buggy address belongs to stack of task swapper/0/0
[ 33.468493] and is located at offset 112 in frame:
[ 33.469172] refresh_cpu_vm_stats.constprop.0+0x0/0x2ec
[ 33.469917]
[ 33.470165] This frame has 2 objects:
[ 33.470696] [32, 76) 'global_zone_diff'
[ 33.470729] [112, 276) 'global_node_diff'
[ 33.471294]
[ 33.472095] The buggy address belongs to the physical page:
[ 33.472862] page:3cd72da8 refcount:1 mapcount:0 mapping:00000000 index:0x0 pfn:0x41d03
[ 33.473944] flags: 0x1000(reserved|zone=0)
[ 33.474565] raw: 00001000ed741470ed741470000000000000000000000000ffffffff00000001
[ 33.475656] raw: 00000000
[ 33.476050] page dumped because: kasan: bad access detected
[ 33.476816]
[ 33.477061] Memory state around the buggy address:
[ 33.477732] c1d03c00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 33.478630] c1d03c80: 00 00 00 00 00 00 00 00 f1 f1 f1 f1 00 00 00 00
[ 33.479526] >c1d03d00: 00 04 f2 f2 f2 f2 00 00 00 00 00 00 f1 f1 f1 f1
[ 33.480415] ^
[ 33.481195] c1d03d80: 00 00 00 00 00 00 00 00 00 00 04 f3 f3 f3 f3 f3
[ 33.482088] c1d03e00: f3 f3 f3 f3 00 00 00 00 00 00 00 00 00 00 00 00
[ 33.482978] ==================================================================
We find the root cause of this OOB is that arm does not clear stale stack
poison in the case of cpuidle.
This patch refer to arch/arm64/kernel/sleep.S to resolve this issue.
From cited commit [1] that explain the problem
Functions which the compiler has instrumented for KASAN place poison on
the stack shadow upon entry and remove this poison prior to returning.
In the case of cpuidle, CPUs exit the kernel a number of levels deep in
C code. Any instrumented functions on this critical path will leave
portions of the stack shadow poisoned.
If CPUs lose context and return to the kernel via a cold path, we
restore a prior context saved in __cpu_suspend_enter are forgotten, and
we never remove the poison they placed in the stack shadow area by
functions calls between this and the actual exit of the kernel.
Thus, (depending on stackframe layout) subsequent calls to instrumented
functions may hit this stale poison, resulting in (spurious) KASAN
splats to the console.
To avoid this, clear any stale poison from the idle thread for a CPU
prior to bringing a CPU online.
From cited commit [2]
Extend to check for CONFIG_KASAN_STACK
[1] commit 0d97e6d8024c ("arm64: kasan: clear stale stack poison")
[2] commit d56a9ef84bd0 ("kasan, arm64: unpoison stack only with CONFIG_KASAN_STACK")
Signed-off-by: Boy Wu <boy.wu@mediatek.com> Reviewed-by: Mark Rutland <mark.rutland@arm.com> Acked-by: Andrey Ryabinin <ryabinin.a.a@gmail.com> Reviewed-by: Linus Walleij <linus.walleij@linaro.org> Fixes: 5615f69bc209 ("ARM: 9016/2: Initialize the mapping of KASan shadow memory") Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: Sasha Levin <sashal@kernel.org>
The software GRO path for esp transport mode uses skb_mac_header_rebuild
prior to re-injecting the packet via the xfrm_napi_dev. This only
copies skb->mac_len bytes of header which may not be sufficient if the
packet contains 802.1Q tags or other VLAN tags. Worse copying only the
initial header will leave a packet marked as being VLAN tagged but
without the corresponding tag leading to mangling when it is later
untagged.
The VLAN tags are important when receiving the decrypted esp transport
mode packet after GRO processing to ensure it is received on the correct
interface.
Therefore record the full mac header length in xfrm*_transport_input for
later use in corresponding xfrm*_transport_finish to copy the entire mac
header when rebuilding the mac header for GRO. The skb->data pointer is
left pointing skb->mac_header bytes after the start of the mac header as
is expected by the network stack and network and transport header
offsets reset to this location.
Fixes: 7785bba299a8 ("esp: Add a software GRO codepath") Signed-off-by: Paul Davey <paul.davey@alliedtelesis.co.nz> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
simple_recursive_removal() drops the pinning references to all positives
in subtree. For the cases when its argument has been kept alive by
the pinning alone that's exactly the right thing to do, but here
the argument comes from dcache lookup, that needs to be balanced by
explicit dput().
Fixes: e41d237818598 "qib_fs: switch to simple_recursive_removal()" Fucked-up-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Sasha Levin <sashal@kernel.org>
We noticed some rare sk_buffs were stepping past the queue when system was
under memory pressure. The general theory is to skip enqueueing
sk_buffs when its not necessary which is the normal case with a system
that is properly provisioned for the task, no memory pressure and enough
cpu assigned.
But, if we can't allocate memory due to an ENOMEM error when enqueueing
the sk_buff into the sockmap receive queue we push it onto a delayed
workqueue to retry later. When a new sk_buff is received we then check
if that queue is empty. However, there is a problem with simply checking
the queue length. When a sk_buff is being processed from the ingress queue
but not yet on the sockmap msg receive queue its possible to also recv
a sk_buff through normal path. It will check the ingress queue which is
zero and then skip ahead of the pkt being processed.
Previously we used sock lock from both contexts which made the problem
harder to hit, but not impossible.
To fix instead of popping the skb from the queue entirely we peek the
skb from the queue and do the copy there. This ensures checks to the
queue length are non-zero while skb is being processed. Then finally
when the entire skb has been copied to user space queue or another
socket we pop it off the queue. This way the queue length check allows
bypassing the queue only after the list has been completely processed.
To reproduce issue we run NGINX compliance test with sockmap running and
observe some flakes in our testing that we attributed to this issue.
Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()") Suggested-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: William Findlay <will@isovalent.com> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/20230523025618.113937-5-john.fastabend@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>
Now that the backlog manages the reschedule() logic correctly we can drop
the partial fix to reschedule from recvmsg hook.
Rescheduling on recvmsg hook was added to address a corner case where we
still had data in the backlog state but had nothing to kick it and
reschedule the backlog worker to run and finish copying data out of the
state. This had a couple limitations, first it required user space to
kick it introducing an unnecessary EBUSY and retry. Second it only
handled the ingress case and egress redirects would still be hung.
With the correct fix, pushing the reschedule logic down to where the
enomem error occurs we can drop this fix.
Fixes: bec217197b412 ("skmsg: Schedule psock work if the cached skb exists on the psock") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/20230523025618.113937-4-john.fastabend@gmail.com
Stable-dep-of: 405df89dd52c ("bpf, sockmap: Improved check for empty queue") Signed-off-by: Sasha Levin <sashal@kernel.org>
Sk_buffs are fed into sockmap verdict programs either from a strparser
(when the user might want to decide how framing of skb is done by attaching
another parser program) or directly through tcp_read_sock. The
tcp_read_sock is the preferred method for performance when the BPF logic is
a stream parser.
The flow for Cilium's common use case with a stream parser is,
tcp_read_sock()
sk_psock_verdict_recv
ret = bpf_prog_run_pin_on_cpu()
sk_psock_verdict_apply(sock, skb, ret)
// if system is under memory pressure or app is slow we may
// need to queue skb. Do this queuing through ingress_skb and
// then kick timer to wake up handler
skb_queue_tail(ingress_skb, skb)
schedule_work(work);
The work queue is wired up to sk_psock_backlog(). This will then walk the
ingress_skb skb list that holds our sk_buffs that could not be handled,
but should be OK to run at some later point. However, its possible that
the workqueue doing this work still hits an error when sending the skb.
When this happens the skbuff is requeued on a temporary 'state' struct
kept with the workqueue. This is necessary because its possible to
partially send an skbuff before hitting an error and we need to know how
and where to restart when the workqueue runs next.
Now for the trouble, we don't rekick the workqueue. This can cause a
stall where the skbuff we just cached on the state variable might never
be sent. This happens when its the last packet in a flow and no further
packets come along that would cause the system to kick the workqueue from
that side.
To fix we could do simple schedule_work(), but while under memory pressure
it makes sense to back off some instead of continue to retry repeatedly. So
instead to fix convert schedule_work to schedule_delayed_work and add
backoff logic to reschedule from backlog queue on errors. Its not obvious
though what a good backoff is so use '1'.
To test we observed some flakes whil running NGINX compliance test with
sockmap we attributed these failed test to this bug and subsequent issue.
>From on list discussion. This commit
bec217197b41("skmsg: Schedule psock work if the cached skb exists on the psock")
was intended to address similar race, but had a couple cases it missed.
Most obvious it only accounted for receiving traffic on the local socket
so if redirecting into another socket we could still get an sk_buff stuck
here. Next it missed the case where copied=0 in the recv() handler and
then we wouldn't kick the scheduler. Also its sub-optimal to require
userspace to kick the internal mechanisms of sockmap to wake it up and
copy data to user. It results in an extra syscall and requires the app
to actual handle the EAGAIN correctly.
Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: William Findlay <will@isovalent.com> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/20230523025618.113937-3-john.fastabend@gmail.com
Stable-dep-of: 405df89dd52c ("bpf, sockmap: Improved check for empty queue") Signed-off-by: Sasha Levin <sashal@kernel.org>
The sockmap code is returning EAGAIN after a FIN packet is received and no
more data is on the receive queue. Correct behavior is to return 0 to the
user and the user can then close the socket. The EAGAIN causes many apps
to retry which masks the problem. Eventually the socket is evicted from
the sockmap because its released from sockmap sock free handling. The
issue creates a delay and can cause some errors on application side.
To fix this check on sk_msg_recvmsg side if length is zero and FIN flag
is set then set return to zero. A selftest will be added to check this
condition.
Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: William Findlay <will@isovalent.com> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/20230523025618.113937-6-john.fastabend@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>
A common mechanism to put a TCP socket into the sockmap is to hook the
BPF_SOCK_OPS_{ACTIVE_PASSIVE}_ESTABLISHED_CB event with a BPF program
that can map the socket info to the correct BPF verdict parser. When
the user adds the socket to the map the psock is created and the new
ops are assigned to ensure the verdict program will 'see' the sk_buffs
as they arrive.
Part of this process hooks the sk_data_ready op with a BPF specific
handler to wake up the BPF verdict program when data is ready to read.
The logic is simple enough (posted here for easy reading)
if (unlikely(!sock || !sock->ops || !sock->ops->read_skb))
return;
sock->ops->read_skb(sk, sk_psock_verdict_recv);
}
The oversight here is sk->sk_socket is not assigned until the application
accepts() the new socket. However, its entirely ok for the peer application
to do a connect() followed immediately by sends. The socket on the receiver
is sitting on the backlog queue of the listening socket until its accepted
and the data is queued up. If the peer never accepts the socket or is slow
it will eventually hit data limits and rate limit the session. But,
important for BPF sockmap hooks when this data is received TCP stack does
the sk_data_ready() call but the read_skb() for this data is never called
because sk_socket is missing. The data sits on the sk_receive_queue.
Then once the socket is accepted if we never receive more data from the
peer there will be no further sk_data_ready calls and all the data
is still on the sk_receive_queue(). Then user calls recvmsg after accept()
and for TCP sockets in sockmap we use the tcp_bpf_recvmsg_parser() handler.
The handler checks for data in the sk_msg ingress queue expecting that
the BPF program has already run from the sk_data_ready hook and enqueued
the data as needed. So we are stuck.
To fix do an unlikely check in recvmsg handler for data on the
sk_receive_queue and if it exists wake up data_ready. We have the sock
locked in both read_skb and recvmsg so should avoid having multiple
runners.
Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/20230523025618.113937-7-john.fastabend@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>
I didn't pay close enough attention the last time I tried to fix this
problem - while we currently do correctly take care to make sure we don't
probe a connected eDP port more then once, we don't do the same thing for
eDP ports we found to be disconnected.
So, fix this and make sure we only ever probe eDP ports once and then leave
them at that connector state forever (since without HPD, it's not going to
change on its own anyway). This should get rid of the last few GSP errors
getting spit out during runtime suspend and resume on some machines, as we
tried to reprobe eDP ports in response to ACPI hotplug probe events.
We don't need to hold the prepare_lock when dropping a ref on a struct
clk_core. The release function is only freeing memory and any code with
a pointer reference has already unlinked anything pointing to the
clk_core. This reduces the holding area of the prepare_lock a bit.
Note that we also don't call free_clk() with the prepare_lock held.
There isn't any reason to do that.
9p is a remote network protocol, and it doesn't support asynchronous
notifications from the server. Ensure that we don't hand out any leases
since we can't guarantee they'll be broken when a file's contents
change.
Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Eric Van Hensbergen <ericvh@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
Garbage in plain 9P2000's perm bits is allowed through, which causes it
to be able to set (among others) the suid bit. This was presumably not
the intent since the unix extended bits are handled explicitly and
conditionally on .u.
Signed-off-by: Joakim Sindholt <opensource@zhasha.com> Signed-off-by: Eric Van Hensbergen <ericvh@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
valid-adjtimex.c:66:6: warning: shifting a negative signed value is undefined [-Wshift-negative-value]
-499<<16,
~~~~^
valid-adjtimex.c:67:6: warning: shifting a negative signed value is undefined [-Wshift-negative-value]
-450<<16,
~~~~^
..
Fix it by using a multiply by (1 << 16) instead of shifting negative values
in the valid-adjtimex test case. Align the values for better readability.
thread_info.syscall is used by syscall_get_nr to supply syscall nr
over a thread stack frame.
Previously, thread_info.syscall is only saved at syscall_trace_enter
when syscall tracing is enabled. However rest of the kernel code do
expect syscall_get_nr to be available without syscall tracing. The
previous design breaks collect_syscall.
Move saving process to syscall entry to fix it.
Reported-by: Xi Ruoyao <xry111@xry111.site> Link: https://github.com/util-linux/util-linux/issues/2867 Signed-off-by: Jiaxun Yang <jiaxun.yang@flygoat.com> Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
UBSAN catches undefined behavior in blk-iocost, where sometimes
iocg->delay is shifted right by a number that is too large,
resulting in undefined behavior on some architectures.
[ 186.556576] ------------[ cut here ]------------
UBSAN: shift-out-of-bounds in block/blk-iocost.c:1366:23
shift exponent 64 is too large for 64-bit type 'u64' (aka 'unsigned long long')
CPU: 16 PID: 0 Comm: swapper/16 Tainted: G S E N 6.9.0-0_fbk700_debug_rc2_kbuilder_0_gc85af715cac0 #1
Hardware name: Quanta Twin Lakes MP/Twin Lakes Passive MP, BIOS F09_3A23 12/08/2020
Call Trace:
<IRQ>
dump_stack_lvl+0x8f/0xe0
__ubsan_handle_shift_out_of_bounds+0x22c/0x280
iocg_kick_delay+0x30b/0x310
ioc_timer_fn+0x2fb/0x1f80
__run_timer_base+0x1b6/0x250
...
Avoid that undefined behavior by simply taking the
"delay = 0" branch if the shift is too large.
I am not sure what the symptoms of an undefined value
delay will be, but I suspect it could be more than a
little annoying to debug.
If the systemd-modules service loads the target module, the credentials of
that userspace process will be used to validate the access to the target db
directory. SELinux will prevent it, reporting an error like the following:
It is possible to clear a root's IN_TRANS tag from the radix tree, but
not clear its PERTRANS, if there is some error in between. Eliminate
that possibility by moving the free up to where we clear the tag.
Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
Currently, this call site in btrfs_clear_delalloc_extent() only converts
the reservation. We are marking it not delalloc, so I don't think it
makes sense to keep the rsv around. This is a path where we are not
sure to join a transaction, so it leads to incorrect free-ing during
umount.
Helps with the pass rate of generic/269 and generic/475.
Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
In the FireWire OHCI interrupt handler, if a bus reset interrupt has
occurred, mask bus reset interrupts until bus_reset_work has serviced and
cleared the interrupt.
Normally, we always leave bus reset interrupts masked. We infer the bus
reset from the self-ID interrupt that happens shortly thereafter. A
scenario where we unmask bus reset interrupts was introduced in 2008 in a007bb857e0b26f5d8b73c2ff90782d9c0972620: If
OHCI_PARAM_DEBUG_BUSRESETS (8) is set in the debug parameter bitmask, we
will unmask bus reset interrupts so we can log them.
irq_handler logs the bus reset interrupt. However, we can't clear the bus
reset event flag in irq_handler, because we won't service the event until
later. irq_handler exits with the event flag still set. If the
corresponding interrupt is still unmasked, the first bus reset will
usually freeze the system due to irq_handler being called again each
time it exits. This freeze can be reproduced by loading firewire_ohci
with "modprobe firewire_ohci debug=-1" (to enable all debugging output).
Apparently there are also some cases where bus_reset_work will get called
soon enough to clear the event, and operation will continue normally.
This freeze was first reported a few months after a007bb85 was committed,
but until now it was never fixed. The debug level could safely be set
to -1 through sysfs after the module was loaded, but this would be
ineffectual in logging bus reset interrupts since they were only
unmasked during initialization.
irq_handler will now leave the event flag set but mask bus reset
interrupts, so irq_handler won't be called again and there will be no
freeze. If OHCI_PARAM_DEBUG_BUSRESETS is enabled, bus_reset_work will
unmask the interrupt after servicing the event, so future interrupts
will be caught as desired.
As a side effect to this change, OHCI_PARAM_DEBUG_BUSRESETS can now be
enabled through sysfs in addition to during initial module loading.
However, when enabled through sysfs, logging of bus reset interrupts will
be effective only starting with the second bus reset, after
bus_reset_work has executed.
If the RBUF logic is not reset when the kernel starts then there
may be some data left over from any network boot loader. If the
64-byte packet headers are enabled then this can be fatal.
Extend bcmgenet_dma_disable to do perform the reset, but not when
called from bcmgenet_resume in order to preserve a wake packet.
N.B. This different handling of resume is just based on a hunch -
why else wouldn't one reset the RBUF as well as the TBUF? If this
isn't the case then it's easy to change the patch to make the RBUF
reset unconditional.
Signed-off-by: Phil Elwell <phil@raspberrypi.com> Signed-off-by: Maarten Vanraes <maarten@rmail.be> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
For shutting up spurious KMSAN uninit-value warnings, just replace
kmalloc() calls with kzalloc() for the buffers used for
communications. There should be no real issue with the original code,
but it's still better to cover.
When KCSAN and CONSTRUCTORS are enabled, one can trigger the
"Unpatched return thunk in use. This should not happen!"
catch-all warning.
Usually, when objtool runs on the .o objects, it does generate a section
.return_sites which contains all offsets in the objects to the return
thunks of the functions present there. Those return thunks then get
patched at runtime by the alternatives.
KCSAN and CONSTRUCTORS add this to the object file's .text.startup
section:
-------------------
Disassembly of section .text.startup:
which, if it is built as a module goes through the intermediary stage of
creating a <module>.mod.c file which, when translated, receives a second
constructor:
-------------------
Disassembly of section .text.startup:
Objtool has run already so that second constructor's return thunk cannot
be added to the .return_sites section and thus the return thunk remains
unpatched and the warning rightfully fires.
Drop KCSAN flags from the mod.c generation stage as those constructors
do not contain data races one would be interested about.
Debugged together with David Kaplan <David.Kaplan@amd.com> and Nikolay
Borisov <nik.borisov@suse.com>.
Reported-by: Paul Menzel <pmenzel@molgen.mpg.de> Closes: https://lore.kernel.org/r/0851a207-7143-417e-be31-8bf2b3afb57d@molgen.mpg.de Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Paul Menzel <pmenzel@molgen.mpg.de> # Dell XPS 13 Reviewed-by: Nikolay Borisov <nik.borisov@suse.com> Reviewed-by: Marco Elver <elver@google.com> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
When attempting to exclusive open a device which has no exclusive open
permission, such as a physical device associated with the flakey dm
device, the open operation will fail, resulting in a mount failure.
In this particular scenario, we erroneously return -EINVAL instead of the
correct error code provided by the bdev_open_by_path() function, which is
-EBUSY.
Fix this, by returning error code from the bdev_open_by_path() function.
With this correction, the mount error message will align with that of
ext4 and xfs.
Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
The session resources are used by FW and driver when session is offloaded,
once session is uploaded these resources are not used. The lock is not
required as these fields won't be used any longer. The offload and upload
calls are sequential, hence lock is not required.
sk->sk_rcvbuf in __sock_queue_rcv_skb() and __sk_receive_skb() can be
changed by other threads. Mark this as benign using READ_ONCE().
This patch is aimed at reducing the number of benign races reported by
KCSAN in order to focus future debugging effort on harmful races.
Signed-off-by: linke li <lilinke99@qq.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
Running kernel-doc on ieee80211_i.h flagged the following:
net/mac80211/ieee80211_i.h:145: warning: expecting prototype for enum ieee80211_corrupt_data_flags. Prototype was for enum ieee80211_bss_corrupt_data_flags instead
net/mac80211/ieee80211_i.h:162: warning: expecting prototype for enum ieee80211_valid_data_flags. Prototype was for enum ieee80211_bss_valid_data_flags instead
In punch_hole(), when the offset lies in the final block for a given
height, there is no hole to punch, but the maximum size check fails to
detect that. Consequently, punch_hole() will try to punch a hole beyond
the end of the metadata and fail. Fix the maximum size check.
Signed-off-by: Andrew Price <anprice@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
Typically when an out of resource CQE status is detected, the
lpfc_ramp_down_queue_handler() logic is called to help reduce I/O load by
reducing an sdev's queue_depth.
However, the current lpfc_rampdown_queue_depth() logic does not help reduce
queue_depth. num_cmd_success is never updated and is always zero, which
means new_queue_depth will always be set to sdev->queue_depth. So,
new_queue_depth = sdev->queue_depth - new_queue_depth always sets
new_queue_depth to zero. And, scsi_change_queue_depth(sdev, 0) is
essentially a no-op.
Change the lpfc_ramp_down_queue_handler() logic to set new_queue_depth
equal to sdev->queue_depth subtracted from number of times num_rsrc_err was
incremented. If num_rsrc_err is >= sdev->queue_depth, then set
new_queue_depth equal to 1. Eventually, the frequency of Good_Status
frames will signal SCSI upper layer to auto increase the queue_depth back
to the driver default of 64 via scsi_handle_queue_ramp_up().
There are cases after NPIV deletion where the fabric switch still believes
the NPIV is logged into the fabric. This occurs when a vport is
unregistered before the Remove All DA_ID CT and LOGO ELS are sent to the
fabric.
Currently fc_remove_host(), which calls dev_loss_tmo for all D_IDs including
the fabric D_ID, removes the last ndlp reference and frees the ndlp rport
object. This sometimes causes the race condition where the final DA_ID and
LOGO are skipped from being sent to the fabric switch.
Fix by moving the fc_remove_host() and scsi_remove_host() calls after DA_ID
and LOGO are sent.
vgic_v2_parse_attr() is responsible for finding the vCPU that matches
the user-provided CPUID, which (of course) may not be valid. If the ID
is invalid, kvm_get_vcpu_by_id() returns NULL, which isn't handled
gracefully.
Similar to the GICv3 uaccess flow, check that kvm_get_vcpu_by_id()
actually returns something and fail the ioctl if not.
Cc: stable@vger.kernel.org Fixes: 7d450e282171 ("KVM: arm/arm64: vgic-new: Add userland access to VGIC dist registers") Reported-by: Alexander Potapenko <glider@google.com> Tested-by: Alexander Potapenko <glider@google.com> Reviewed-by: Alexander Potapenko <glider@google.com> Reviewed-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20240424173959.3776798-2-oliver.upton@linux.dev Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Sasha Levin <sashal@kernel.org>
When parsing a GICv2 attribute that contains a cpuid, handle this
as the vcpu_id, not a vcpu_idx, as userspace cannot really know
the mapping between the two. For this, use kvm_get_vcpu_by_id()
instead of kvm_get_vcpu().
Take this opportunity to get rid of the pointless check against
online_vcpus, which doesn't make much sense either, and switch
to FIELD_GET as a way to extract the vcpu_id.
Reviewed-by: Zenghui Yu <yuzenghui@huawei.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20230927090911.3355209-5-maz@kernel.org Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Stable-dep-of: 6ddb4f372fc6 ("KVM: arm64: vgic-v2: Check for non-NULL vCPU in vgic_v2_parse_attr()") Signed-off-by: Sasha Levin <sashal@kernel.org>
While PLL CPUX clock rate change when CPU is running from it works in
vast majority of cases, now and then it causes instability. This leads
to system crashes and other undefined behaviour. After a lot of testing
(30+ hours) while also doing a lot of frequency switches, we can't
observe any instability issues anymore when doing reparenting to stable
clock like 24 MHz oscillator.
GRO-GSO path is supposed to be transparent and as such L3 flush checks are
relevant to all UDP flows merging in GRO. This patch uses the same logic
and code from tcp_gro_receive, terminating merge if flush is non zero.
Fixes: e20cf8d3f1f7 ("udp: implement GRO for plain UDP sockets.") Signed-off-by: Richard Gobert <richardbgobert@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
In one case the -1 is returned which is quite confusing code for
the wrong device ID, in another the ret is returning instead of
plain 0 that also confusing as readed may ask the possible meaning
of positive codes, which are never the case there. Convert both
to use explicit predefined error codes to make it clear what's going
on there.
GPIO controller might not be available when driver is being probed.
There are plenty of reasons why, one of which is deferred probe.
Since GPIOs are optional, return any error code we got to the upper
layer, including deferred probe. With that in mind, use dev_err_probe()
in order to avoid spamming the logs.
Symptom:
When the hsuid attribute is set for the first time on an IQD Layer3
device while the corresponding network interface is already UP,
the kernel will try to execute a napi function pointer that is NULL.
Analysis:
There is one napi structure per out_q: card->qdio.out_qs[i].napi
The napi.poll functions are set during qeth_open().
Since
commit 1cfef80d4c2b ("s390/qeth: Don't call dev_close/dev_open (DOWN/UP)")
qeth_set_offline()/qeth_set_online() no longer call dev_close()/
dev_open(). So if qeth_free_qdio_queues() cleared
card->qdio.out_qs[i].napi.poll while the network interface was UP and the
card was offline, they are not set again.
Reproduction:
chzdev -e $devno layer2=0
ip link set dev $network_interface up
echo 0 > /sys/bus/ccwgroup/devices/0.0.$devno/online
echo foo > /sys/bus/ccwgroup/devices/0.0.$devno/hsuid
echo 1 > /sys/bus/ccwgroup/devices/0.0.$devno/online
-> Crash (can be enforced e.g. by af_iucv connect(), ip link down/up, ...)
Note that a Completion Queue (CQ) is only enabled or disabled, when hsuid
is set for the first time or when it is removed.
Workarounds:
- Set hsuid before setting the device online for the first time
or
- Use chzdev -d $devno; chzdev $devno hsuid=xxx; chzdev -e $devno;
to set hsuid on an existing device. (this will remove and recreate the
network interface)
Fix:
There is no need to free the output queues when a completion queue is
added or removed.
card->qdio.state now indicates whether the inbound buffer pool and the
outbound queues are allocated.
card->qdio.c_q indicates whether a CQ is allocated.
Fixes: 1cfef80d4c2b ("s390/qeth: Don't call dev_close/dev_open (DOWN/UP)") Signed-off-by: Alexandra Winter <wintera@linux.ibm.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240430091004.2265683-1-wintera@linux.ibm.com Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
The only actual user of qdio.no_input_queues is qeth_qdio_establish(),
and there we already have full awareness of the current Input Queue
configuration (1 RX queue, plus potentially 1 TX Completion queue).
So avoid this state tracking, and the ambiguity it brings with it.
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Stable-dep-of: 8a2e4d37afb8 ("s390/qeth: Fix kernel panic after setting hsuid") Signed-off-by: Sasha Levin <sashal@kernel.org>
SKB_GSO_FRAGLIST skbs must not be linearized, otherwise they become
invalid. Return NULL if such an skb is passed to skb_copy or
skb_copy_expand, in order to prevent a crash on a potential later
call to skb_gso_segment.
Fixes: 3a1296a38d0c ("net: Support GRO/GSO fraglist chaining.") Signed-off-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
Calling skb_copy on a SKB_GSO_FRAGLIST skb is not valid, since it returns
an invalid linearized skb. This code only needs to change the ethernet
header, so pskb_copy is the right function to call here.
Fixes: 6db6f0eae605 ("bridge: multicast to unicast") Signed-off-by: Felix Fietkau <nbd@nbd.name> Acked-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
The selftest for the driver sends a dummy packet and checks if the
packet will be received properly as it should be. The regular TX path
and the selftest can use the same network queue so locking is required
and was missing in the selftest path. This was addressed in the commit
cited below.
Unfortunately locking the TX queue requires BH to be disabled which is
not the case in selftest path which is invoked in process context.
Lockdep should be complaining about this.
Use __netif_tx_lock_bh() for TX queue locking.
Fixes: c650e04898072 ("cxgb4: Fix race between loopback and normal Tx path") Reported-by: "John B. Wyatt IV" <jwyatt@redhat.com> Closes: https://lore.kernel.org/all/Zic0ot5aGgR-V4Ks@thinkpad2021/ Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lore.kernel.org/r/20240429091147.YWAaal4v@linutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
Currently, we allocate a lbuf-sized kernel buffer and copy lbuf from
userspace to that buffer. Later, we use scanf on this buffer but we don't
ensure that the string is terminated inside the buffer, this can lead to
OOB read when using scanf. Fix this issue by using memdup_user_nul instead.
Amlogic sound cards do create a lot of pcm interfaces, possibly more than
8. Some pcm interfaces are internal (like DPCM backends and c2c) and not
exposed to userspace.
Those interfaces still increase the number passed to snd_find_free_minor(),
which eventually exceeds 8 causing -EBUSY error on card registration if
CONFIG_SND_DYNAMIC_MINORS=n and the interface is exposed to userspace.
select CONFIG_SND_DYNAMIC_MINORS for Amlogic cards to avoid the problem.
So far, the formatters have been reset/enabled using the .prepare()
callback. This was done in this callback because walking the formatters use
a mutex so it could not be done in .trigger(), which is atomic by default.
It turns out there is a problem on capture path of the AXG series.
The FIFO may get out of sync with the TDM decoder if the IP are not enabled
in a specific order. The FIFO must be enabled before the formatter starts
producing data. IOW, we must deal with FE before the BE. The .prepare()
callback is called on the BEs before the FE so it is not OK for the AXG.
The .trigger() callback order can be configured, and it deals with the FE
before the BEs by default. To solve our problem, we just need to start and
stop the formatters from the .trigger() callback. It is OK do so now that
the links have been made 'nonatomic' in the card driver.
Non atomic operations need to be performed in the trigger callback
of the TDM interfaces. Those are BEs but what matters is the nonatomic
flag of the FE in the DPCM context. Just set nonatomic for everything so,
at least, it is clear.
With the AXG audio subsystem, there is a possible random channel shift on
TDM capture, when the slot number per lane is more than 2, and there is
more than one lane used.
The problem has been there since the introduction of the axg audio support
but such scenario is pretty uncommon. This is why there is no loud
complains about the problem.
Solving the problem require to make the links non-atomic and use the
trigger() callback to start FEs and BEs in the appropriate order.
This was tried in the past and reverted because it caused the block irq to
sleep while atomic. However, instead of reverting, the solution is to call
snd_pcm_period_elapsed() in a non atomic context.
Use the bottom half of a threaded IRQ to do so.
Fixes: 6dc4fa179fb8 ("ASoC: meson: add axg fifo base driver") Signed-off-by: Jerome Brunet <jbrunet@baylibre.com> Link: https://lore.kernel.org/r/20240426152946.3078805-2-jbrunet@baylibre.com Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
In qede_flow_spec_to_rule(), when calling
qede_parse_flow_attr() then the return code
was only used for a non-zero check, and then
-EINVAL was returned.
qede_parse_flow_attr() can currently fail with:
* -EINVAL
* -EOPNOTSUPP
* -EPROTONOSUPPORT
This patch changes the code to use the actual
return code, not just return -EINVAL.
The blaimed commit introduced qede_flow_spec_to_rule(),
and this call to qede_parse_flow_attr(), it looks
like it just duplicated how it was already used.
Only compile tested.
Fixes: 37c5d3efd7f8 ("qede: use ethtool_rx_flow_rule() to remove duplicated parser code") Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
In qede_add_tc_flower_fltr(), when calling
qede_parse_flow_attr() then the return code
was only used for a non-zero check, and then
-EINVAL was returned.
qede_parse_flow_attr() can currently fail with:
* -EINVAL
* -EOPNOTSUPP
* -EPROTONOSUPPORT
This patch changes the code to use the actual
return code, not just return -EINVAL.
The blaimed commit introduced these functions.
Only compile tested.
Fixes: 2ce9c93eaca6 ("qede: Ingress tc flower offload (drop action) support.") Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
Explicitly set 'rc' (return code), before jumping to the
unlock and return path.
By not having any code depend on that 'rc' remains at
it's initial value of -EINVAL, then we can re-use 'rc' for
the return code of function calls in subsequent patches.
Only compile tested.
Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Stable-dep-of: fcee2065a178 ("net: qede: use return from qede_parse_flow_attr() for flower") Signed-off-by: Sasha Levin <sashal@kernel.org>
The return-address (RA) register r14 is specified as volatile in the
s390x ELF ABI [1]. Nevertheless proper CFI directives must be provided
for an unwinder to restore the return address, if the RA register
value is changed from its value at function entry, as it is the case.
NSH can encapsulate IPv4, IPv6, Ethernet, NSH, and MPLS. As the inner
protocol can be Ethernet, NSH GSO handler, nsh_gso_segment(), calls
skb_mac_gso_segment() to invoke inner protocol GSO handlers.
nsh_gso_segment() does the following for the original skb before
calling skb_mac_gso_segment()
1. reset skb->network_header
2. save the original skb->{mac_heaeder,mac_len} in a local variable
3. pull the NSH header
4. resets skb->mac_header
5. set up skb->mac_len and skb->protocol for the inner protocol.
and does the following for the segmented skb
6. set ntohs(ETH_P_NSH) to skb->protocol
7. push the NSH header
8. restore skb->mac_header
9. set skb->mac_header + mac_len to skb->network_header
10. restore skb->mac_len
There are two problems in 6-7 and 8-9.
(a)
After 6 & 7, skb->data points to the NSH header, so the outer header
(ETH_P_8021AD in this case) is stripped when skb is sent out of netdev.
Also, if NSH is encapsulated by NSH + Ethernet (so NSH-Ethernet-NSH),
skb_pull() in the first nsh_gso_segment() will make skb->data point
to the middle of the outer NSH or Ethernet header because the Ethernet
header is not pulled by the second nsh_gso_segment().
(b)
While restoring skb->{mac_header,network_header} in 8 & 9,
nsh_gso_segment() does not assume that the data in the linear
buffer is shifted.
However, udp6_ufo_fragment() could shift the data and change
skb->mac_header accordingly as demonstrated by syzbot.
If this happens, even the restored skb->mac_header points to
the middle of the outer header.
It seems nsh_gso_segment() has never worked with outer headers so far.
At the end of nsh_gso_segment(), the outer header must be restored for
the segmented skb, instead of the NSH header.
To do that, let's calculate the outer header position relatively from
the inner header and set skb->{data,mac_header,protocol} properly.