The truesize for a UDP GRO packet is added by main skb and skbs in main
skb's frag_list:
skb_gro_receive_list
p->truesize += skb->truesize;
The commit 53475c5dd856 ("net: fix use-after-free when UDP GRO with
shared fraglist") introduced a truesize increase for frag_list skbs.
When uncloning skb, it will call pskb_expand_head and trusesize for
frag_list skbs may increase. This can occur when allocators uses
__netdev_alloc_skb and not jump into __alloc_skb. This flow does not
use ksize(len) to calculate truesize while pskb_expand_head uses.
skb_segment_list
err = skb_unclone(nskb, GFP_ATOMIC);
pskb_expand_head
if (!skb->sk || skb->destructor == sock_edemux)
skb->truesize += size - osize;
If we uses increased truesize adding as delta_truesize, it will be
larger than before and even larger than previous total truesize value
if skbs in frag_list are abundant. The main skb truesize will become
smaller and even a minus value or a huge value for an unsigned int
parameter. Then the following memory check will drop this abnormal skb.
To avoid this error we should use the original truesize to segment the
main skb.
Fixes: 53475c5dd856 ("net: fix use-after-free when UDP GRO with shared fraglist") Signed-off-by: lena wang <lena.wang@mediatek.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/1646133431-8948-1-git-send-email-lena.wang@mediatek.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Update MAC type check e1000_pch_tgp because for e1000_pch_cnp,
NVM checksum update is still possible.
Emit a more detailed warning message.
Bugzilla: https://bugzilla.opensuse.org/show_bug.cgi?id=1191663 Fixes: 4051f68318ca ("e1000e: Do not take care about recovery NVM checksum") Reported-by: Thomas Bogendoerfer <tbogendoerfer@suse.de> Signed-off-by: Sasha Neftin <sasha.neftin@intel.com> Tested-by: Naama Meir <naamax.meir@linux.intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
struct xfrm_user_offload has flags variable that received user input,
but kernel didn't check if valid bits were provided. It caused a situation
where not sanitized input was forwarded directly to the drivers.
For example, XFRM_OFFLOAD_IPV6 define that was exposed, was used by
strongswan, but not implemented in the kernel at all.
As a solution, check and sanitize input flags to forward
XFRM_OFFLOAD_INBOUND to the drivers.
We must not dereference @new_hooks after nf_hook_mutex has been released,
because other threads might have freed our allocated hooks already.
BUG: KASAN: use-after-free in nf_hook_entries_get_hook_ops include/linux/netfilter.h:130 [inline]
BUG: KASAN: use-after-free in hooks_validate net/netfilter/core.c:171 [inline]
BUG: KASAN: use-after-free in __nf_register_net_hook+0x77a/0x820 net/netfilter/core.c:438
Read of size 2 at addr ffff88801c1a8000 by task syz-executor237/4430
A Packet Too Big ICMPv6 message received in response to an ESP
packet will prevent all further communication through the tunnel
if the reported MTU minus the ESP overhead is smaller than 1280.
E.g. in a case of a tunnel-mode ESP with sha256/aes the overhead
is 92 bytes. Receiving a PTB with MTU of 1371 or less will result
in all further packets in the tunnel dropped. A ping through the
tunnel fails with "ping: sendmsg: Invalid argument".
Apparently the MTU on the xfrm route is smaller than 1280 and
fails the check inside ip6_setup_cork() added by 749439bf.
We found this by debugging USGv6/ipv6ready failures. Failing
tests are: "Phase-2 Interoperability Test Scenario IPsec" /
5.3.11 and 5.4.11 (Tunnel Mode: Fragmentation).
Commit b515d2637276a3810d6595e10ab02c13bfd0b63a ("xfrm:
xfrm_state_mtu should return at least 1280 for ipv6") attempted
to fix this but caused another regression in TCP MSS calculations
and had to be reverted.
The patch below fixes the situation by dropping the MTU
check and instead checking for the underflows described in the 749439bf commit message.
Signed-off-by: Jiri Bohac <jbohac@suse.cz> Fixes: 749439bfac6e ("ipv6: fix udpv6 sendmsg crash caused by too small MTU") Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
syzkaller was recently triggering an oversized kvmalloc() warning via
xdp_umem_create().
The triggered warning was added back in 7661809d493b ("mm: don't allow
oversized kvmalloc() calls"). The rationale for the warning for huge
kvmalloc sizes was as a reaction to a security bug where the size was
more than UINT_MAX but not everything was prepared to handle unsigned
long sizes.
Anyway, the AF_XDP related call trace from this syzkaller report was:
Björn mentioned that requests for >2GB allocation can still be valid:
The structure that is being allocated is the page-pinning accounting.
AF_XDP has an internal limit of U32_MAX pages, which is *a lot*, but
still fewer than what memcg allows (PAGE_COUNTER_MAX is a LONG_MAX/
PAGE_SIZE on 64 bit systems). [...]
I could just change from U32_MAX to INT_MAX, but as I stated earlier
that has a hacky feeling to it. [...] From my perspective, the code
isn't broken, with the memcg limits in consideration. [...]
Linus says:
[...] Pretty much every time this has come up, the kernel warning has
shown that yes, the code was broken and there really wasn't a reason
for doing allocations that big.
Of course, some people would be perfectly fine with the allocation
failing, they just don't want the warning. I didn't want __GFP_NOWARN
to shut it up originally because I wanted people to see all those
cases, but these days I think we can just say "yeah, people can shut
it up explicitly by saying 'go ahead and fail this allocation, don't
warn about it'".
So enough time has passed that by now I'd certainly be ok with [it].
Thus allow call-sites to silence such userspace triggered splats if the
allocation requests have __GFP_NOWARN. For xdp_umem_pin_pages()'s call
to kvcalloc() this is already the case, so nothing else needed there.
Fixes: 7661809d493b ("mm: don't allow oversized kvmalloc() calls") Reported-by: syzbot+11421fbbff99b989670e@syzkaller.appspotmail.com Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: syzbot+11421fbbff99b989670e@syzkaller.appspotmail.com Cc: Björn Töpel <bjorn@kernel.org> Cc: Magnus Karlsson <magnus.karlsson@intel.com> Cc: Willy Tarreau <w@1wt.eu> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Jakub Kicinski <kuba@kernel.org> Cc: David S. Miller <davem@davemloft.net> Link: https://lore.kernel.org/bpf/CAJ+HfNhyfsT5cS_U9EC213ducHs9k9zNxX9+abqC0kTrPbQ0gg@mail.gmail.com Link: https://lore.kernel.org/bpf/20211201202905.b9892171e3f5b9a60f9da251@linux-foundation.org Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Ackd-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The field offset for port configuration status on SPR has been changed to
bit 14 from ICX where it resides at bit 12. By chance link status detection
continued to work on SPR. This is due to bit 12 being a configuration bit
which is in sync with the status bit. Fix this by checking for a SPR device
and checking correct status bit.
Fixes: 26bfe3d0b227 ("ntb: intel: Add Icelake (gen4) support for Intel NTB") Tested-by: Jerry Dai <jerry.dai@intel.com> Signed-off-by: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Jon Mason <jdmason@kudzu.us> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When tracing the whole disk, 'dropped' and 'msg' will be created
under 'q->debugfs_dir' and 'bt->dir' is NULL, thus blk_trace_free()
won't remove those files. What's worse, the following UAF can be
triggered because of accessing stale 'dropped' and 'msg':
==================================================================
BUG: KASAN: use-after-free in blk_dropped_read+0x89/0x100
Read of size 4 at addr ffff88816912f3d8 by task blktrace/1188
Memory state around the buggy address: ffff88816912f280: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc ffff88816912f300: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
>ffff88816912f380: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
^ ffff88816912f400: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc ffff88816912f480: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
==================================================================
mac80211 set capability NL80211_EXT_FEATURE_CONTROL_PORT_OVER_NL80211
to upper layer by default. That means we should pass EAPoL packets through
nl80211 path only, and should not send the EAPoL skb to netdevice diretly.
At the meanwhile, wpa_supplicant would not register sock to listen EAPoL
skb on the netdevice.
However, there is no control_port_protocol handler in mac80211 for 802.3 RX
packets, mac80211 driver would pass up the EAPoL rekey frame to netdevice
and wpa_supplicant would be never interactive with this kind of packets,
if SUPPORTS_RX_DECAP_OFFLOAD is enabled. This causes STA always rekey fail
if EAPoL frame go through 802.3 path.
To avoid this problem, align the same process as 802.11 type to handle
this frame before put it into network stack.
This also addresses a potential security issue in 802.3 RX mode that was
previously fixed in commit a8c4d76a8dd4 ("mac80211: do not accept/forward
invalid EAPOL frames").
Cc: stable@vger.kernel.org # 5.12+ Fixes: 80a915ec4427 ("mac80211: add rx decapsulation offload support") Signed-off-by: Deren Wu <deren.wu@mediatek.com> Link: https://lore.kernel.org/r/6889c9fced5859ebb088564035f84fd0fa792a49.1644680751.git.deren.wu@mediatek.com
[fix typos, update comment and add note about security issue] Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
xennet_destroy_queues() relies on info->netdev->real_num_tx_queues to
delete queues. Since d7dac083414eb5bb99a6d2ed53dc2c1b405224e5
("net-sysfs: update the queue counts in the unregistration path"),
unregister_netdev() indirectly sets real_num_tx_queues to 0. Those two
facts together means, that xennet_destroy_queues() called from
xennet_remove() cannot do its job, because it's called after
unregister_netdev(). This results in kfree-ing queues that are still
linked in napi, which ultimately crashes:
Fix this by calling xennet_destroy_queues() from xennet_uninit(),
when real_num_tx_queues is still available. This ensures that queues are
destroyed when real_num_tx_queues is set to 0, regardless of how
unregister_netdev() was called.
Originally reported at
https://github.com/QubesOS/qubes-issues/issues/7257
Fixes: d7dac083414eb5bb9 ("net-sysfs: update the queue counts in the unregistration path") Cc: stable@vger.kernel.org Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[Why & How]
Dmesg errors are found on dcn3.1 during reset test, but it's not
a really failure. So reduce it to a debug print.
Signed-off-by: Leo (Hanghong) Ma <hanghong.ma@amd.com> Reviewed-by: Nicholas Kazlauskas <nicholas.kazlauskas@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This JSP2 PCH actually seems to be some special Apple
specific ICP variant rather than a JSP. Make it so. Or at
least all the references to it seem to be some Apple ICL
machines. Didn't manage to find these PCI IDs in any
public chipset docs unfortunately.
The only thing we're losing here with this JSP->ICP change
is Wa_14011294188, but based on the HSD that isn't actually
needed on any ICP based design (including JSP), only TGP
based stuff (including MCC) really need it. The documented
w/a just never made that distinction because Windows didn't
want to differentiate between JSP and MCC (not sure how
they handle hpd/ddc/etc. then though...).
Cc: stable@vger.kernel.org Cc: Matt Roper <matthew.d.roper@intel.com> Cc: Vivek Kasireddy <vivek.kasireddy@intel.com> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/4226 Fixes: 943682e3bd19 ("drm/i915: Introduce Jasper Lake PCH") Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20220224132142.12927-1-ville.syrjala@linux.intel.com Acked-by: Vivek Kasireddy <vivek.kasireddy@intel.com> Tested-by: Tomas Bzatek <bugs@bzatek.net>
(cherry picked from commit 53581504a8e216d435f114a4f2596ad0dfd902fc) Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The AMD IOMMU logs I/O page faults and such to a ring buffer in
system memory, and this ring buffer can overflow. The AMD IOMMU
spec has the following to say about the interrupt status bit that
signals this overflow condition:
EventOverflow: Event log overflow. RW1C. Reset 0b. 1 = IOMMU
event log overflow has occurred. This bit is set when a new
event is to be written to the event log and there is no usable
entry in the event log, causing the new event information to
be discarded. An interrupt is generated when EventOverflow = 1b
and MMIO Offset 0018h[EventIntEn] = 1b. No new event log
entries are written while this bit is set. Software Note: To
resume logging, clear EventOverflow (W1C), and write a 1 to
MMIO Offset 0018h[EventLogEn].
The AMD IOMMU driver doesn't currently implement this recovery
sequence, meaning that if a ring buffer overflow occurs, logging
of EVT/PPR/GA events will cease entirely.
This patch implements the spec-mandated reset sequence, with the
minor tweak that the hardware seems to want to have a 0 written to
MMIO Offset 0018h[EventLogEn] first, before writing an 1 into this
field, or the IOMMU won't actually resume logging events.
When enabling VMD and IOMMU scalable mode, the following kernel panic
call trace/kernel log is shown in Eagle Stream platform (Sapphire Rapids
CPU) during booting:
The symptom 'list_add double add' is caused by the following failure
message:
pci 10000:80:01.0: DMAR: Setup RID2PASID failed
pci 10000:80:01.0: Failed to add to iommu group 42: -16
pci 10000:80:03.0: [8086:352b] type 01 class 0x060400
Device 10000:80:01.0 is the subdevice of the VMD device 0000:59:00.5,
so invoking intel_pasid_alloc_table() gets the pasid_table of the VMD
device 0000:59:00.5. Here is call path:
pci_real_dma_dev() in pci_for_each_dma_alias() gets the real dma device
which is the VMD device 0000:59:00.5. However, pte of the VMD device
0000:59:00.5 has been configured during this message "pci 0000:59:00.5:
Adding to iommu group 42". So, the status -EBUSY is returned when
configuring pasid entry for device 10000:80:01.0.
It then invokes dmar_remove_one_dev_info() to release
'struct device_domain_info *' from iommu_devinfo_cache. But, the pasid
table is not released because of the following statement in
__dmar_remove_one_dev_info():
if (info->dev && !dev_is_real_dma_subdevice(info->dev)) {
...
intel_pasid_free_table(info->dev);
}
The subsequent dmar_insert_one_dev_info() operation of device
10000:80:03.0 allocates 'struct device_domain_info *' from
iommu_devinfo_cache. The allocated address is the same address that
is released previously for device 10000:80:01.0. Finally, invoking
device_attach_pasid_table() causes the issue.
`git bisect` points to the offending commit 474dd1c65064 ("iommu/vt-d:
Fix clearing real DMA device's scalable-mode context entries"), which
releases the pasid table if the device is not the subdevice by
checking the returned status of dev_is_real_dma_subdevice().
Reverting the offending commit can work around the issue.
The solution is to prevent from allocating pasid table if those
devices are subdevices of the VMD device.
While the $val/$val2 values passed in from userspace are always >= 0
integers, the limits of the control can be signed integers and the $min
can be non-zero and less than zero. To correctly validate $val/$val2
against platform_max, add the $min offset to val first.
Fixes: 817f7c9335ec0 ("ASoC: ops: Reject out of bounds values in snd_soc_put_volsw()") Signed-off-by: Marek Vasut <marex@denx.de> Cc: Mark Brown <broonie@kernel.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20220215130645.164025-1-marex@denx.de Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In order to get the pfn of a struct page* when sparsemem is enabled
without vmemmap, the mem_section structures need to be initialized which
happens in sparse_init.
But kasan_early_init calls pfn_to_page way before sparse_init is called,
which then tries to dereference a null mem_section pointer.
Fix this by removing the usage of this function in kasan_early_init.
The get_boot_hartid_from_fdt() function currently returns U32_MAX
for failure case which is not correct because U32_MAX is a valid
hartid value. This patch fixes the issue by returning error code.
Cc: <stable@vger.kernel.org> Fixes: d7071743db31 ("RISC-V: Add EFI stub support.") Signed-off-by: Sunil V L <sunilvl@ventanamicro.com> Reviewed-by: Heinrich Schuchardt <heinrich.schuchardt@canonical.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Partially revert commit 5f501d555653 ("binfmt_elf: reintroduce using
MAP_FIXED_NOREPLACE"), which applied the ET_DYN "total_mapping_size"
logic also to ET_EXEC.
At least ia64 has ET_EXEC PT_LOAD segments that are not virtual-address
contiguous (but _are_ file-offset contiguous). This would result in a
giant mapping attempting to cover the entire span, including the virtual
address range hole, and well beyond the size of the ELF file itself,
causing the kernel to refuse to load it. For example:
File offset range : 0x000000-0x00bb4c
0x00bb4c bytes
Virtual address range : 0x4000000000000000-0x600000000000bcb0
0x200000000000bcb0 bytes
Remove the total_mapping_size logic for ET_EXEC, which reduces the
ET_EXEC MAP_FIXED_NOREPLACE coverage to only the first PT_LOAD (better
than nothing), and retains it for ET_DYN.
Ironically, this is the reverse of the problem that originally caused
problems with MAP_FIXED_NOREPLACE: overlapping PT_LOAD segments. Future
work could restore full coverage if load_elf_binary() were to perform
mappings in a separate phase from the loading (where it could resolve
both overlaps and holes).
Long story short recursively enforcing RLIMIT_NPROC when it is not
enforced on the process that creates a new user namespace, causes
currently working code to fail. There is no reason to enforce
RLIMIT_NPROC recursively when we don't enforce it normally so update
the code to detect this case.
I would like to simply use capable(CAP_SYS_RESOURCE) to detect when
RLIMIT_NPROC is not enforced upon the caller. Unfortunately because
RLIMIT_NPROC is charged and checked for enforcement based upon the
real uid, using capable() which is euid based is inconsistent with reality.
Come as close as possible to testing for capable(CAP_SYS_RESOURCE) by
testing for when the real uid would match the conditions when
CAP_SYS_RESOURCE would be present if the real uid was the effective
uid.
PCM buffers might be allocated dynamically when the buffer
preallocation failed or a larger buffer is requested, and it's not
guaranteed that substream->dma_buffer points to the actually used
buffer. The driver needs to refer to substream->runtime->dma_addr
instead for the buffer address.
For extent maps, if they are not compressed extents and are adjacent by
logical addresses and file offsets, they can be merged into one larger
extent map.
Such merged extent map will have the higher generation of all the
original ones.
But this brings a problem for autodefrag, as it relies on accurate
extent_map::generation to determine if one extent should be defragged.
For merged extent maps, their higher generation can mark some older
extents to be defragged while the original extent map doesn't meet the
minimal generation threshold.
Thus this will cause extra IO.
So solve the problem, here we introduce a new flag, EXTENT_FLAG_MERGED,
to indicate if the extent map is merged from one or more ems.
And for autodefrag, if we find a merged extent map, and its generation
meets the generation requirement, we just don't use this one, and go
back to defrag_get_extent() to read extent maps from subvolume trees.
This could cause more read IO, but should result less defrag data write,
so in the long run it should be a win for autodefrag.
Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
For defrag, we don't really want to use btrfs_get_extent() to iterate
all extent maps of an inode.
The reasons are:
- btrfs_get_extent() can merge extent maps
And the result em has the higher generation of the two, causing defrag
to mark unnecessary part of such merged large extent map.
This in fact can result extra IO for autodefrag in v5.16+ kernels.
However this patch is not going to completely solve the problem, as
one can still using read() to trigger extent map reading, and got
them merged.
The completely solution for the extent map merging generation problem
will come as an standalone fix.
- btrfs_get_extent() caches the extent map result
Normally it's fine, but for defrag the target range may not get
another read/write for a long long time.
Such cache would only increase the memory usage.
- btrfs_get_extent() doesn't skip older extent map
Unlike the old find_new_extent() which uses btrfs_search_forward() to
skip the older subtree, thus it will pick up unnecessary extent maps.
This patch will fix the regression by introducing defrag_get_extent() to
replace the btrfs_get_extent() call.
This helper will:
- Not cache the file extent we found
It will search the file extent and manually convert it to em.
- Use btrfs_search_forward() to skip entire ranges which is modified in
the past
This should reduce the IO for autodefrag.
Reported-by: Filipe Manana <fdmanana@suse.com> Fixes: 7b508037d4ca ("btrfs: defrag: use defrag_one_cluster() to implement btrfs_defrag_file()") Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
While investigating on why a synchronize_net() has been added recently
in ipv6_mc_down(), I found that igmp6_event_query() and igmp6_event_report()
might drop skbs in some cases.
Discussion about removing synchronize_net() from ipv6_mc_down()
will happen in a different thread.
Fixes: f185de28d9ae ("mld: add new workqueues for process mld events") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Taehee Yoo <ap420073@gmail.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20220303173728.937869-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
Since referencing user space pointers is special, if the user wants to
filter on a field that is a pointer to user space, then they need to
specify it.
Add a ".ustring" attribute to the field name for filters to state that the
field is pointing to user space such that the kernel can take the
appropriate action to read that pointer.
Link: https://lore.kernel.org/all/yt9d8rvmt2jq.fsf@linux.ibm.com/ Fixes: 77360f9bbc7e ("tracing: Add test for user space strings when filtering on string pointers") Tested-by: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
Workstation application ANSA/META v21.1.4 get this error dmesg when
running CI test suite provided by ANSA/META:
[drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't update BO_VA (-16)
This is caused by:
1. create a 256MB buffer in invisible VRAM
2. CPU map the buffer and access it causes vm_fault and try to move
it to visible VRAM
3. force visible VRAM space and traverse all VRAM bos to check if
evicting this bo is valuable
4. when checking a VM bo (in invisible VRAM), amdgpu_vm_evictable()
will set amdgpu_vm->evicting, but latter due to not in visible
VRAM, won't really evict it so not add it to amdgpu_vm->evicted
5. before next CS to clear the amdgpu_vm->evicting, user VM ops
ioctl will pass amdgpu_vm_ready() (check amdgpu_vm->evicted)
but fail in amdgpu_vm_bo_update_mapping() (check
amdgpu_vm->evicting) and get this error log
This error won't affect functionality as next CS will finish the
waiting VM ops. But we'd better clear the error log by checking
the amdgpu_vm->evicting flag in amdgpu_vm_ready() to stop calling
amdgpu_vm_bo_update_mapping() later.
Another reason is amdgpu_vm->evicted list holds all BOs (both
user buffer and page table), but only page table BOs' eviction
prevent VM ops. amdgpu_vm->evicting flag is set only for page
table BOs, so we should use evicting flag instead of evicted list
in amdgpu_vm_ready().
The side effect of this change is: previously blocked VM op (user
buffer in "evicted" list but no page table in it) gets done
immediately.
v2: update commit comments.
Acked-by: Paul Menzel <pmenzel@molgen.mpg.de> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Qiang Yu <qiang.yu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org Signed-off-by: Sasha Levin <sashal@kernel.org>
Follow the precedent set by other architectures that support the VCPU
ioctl, KVM_ENABLE_CAP, and advertise the VM extension, KVM_CAP_ENABLE_CAP.
This way, userspace can ensure that KVM_ENABLE_CAP is available on a
vcpu before using it.
Syzbot found a GPF in reweight_entity. This has been bisected to
commit 4ef0c5c6b5ba ("kernel/sched: Fix sched_fork() access an invalid
sched_task_group")
There is a race between sched_post_fork() and setpriority(PRIO_PGRP)
within a thread group that causes a null-ptr-deref in
reweight_entity() in CFS. The scenario is that the main process spawns
number of new threads, which then call setpriority(PRIO_PGRP, 0, -20),
wait, and exit. For each of the new threads the copy_process() gets
invoked, which adds the new task_struct and calls sched_post_fork()
for it.
In the above scenario there is a possibility that
setpriority(PRIO_PGRP) and set_one_prio() will be called for a thread
in the group that is just being created by copy_process(), and for
which the sched_post_fork() has not been executed yet. This will
trigger a null pointer dereference in reweight_entity(), as it will
try to access the run queue pointer, which hasn't been set.
Before the mentioned change the cfs_rq pointer for the task has been
set in sched_fork(), which is called much earlier in copy_process(),
before the new task is added to the thread_group. Now it is done in
the sched_post_fork(), which is called after that. To fix the issue
the remove the update_load param from the update_load param() function
and call reweight_task() only if the task flag doesn't have the
TASK_NEW flag set.
in the follow scenario:
1. jbd start transaction n
2. task A get new handle for transaction n+1
3. task A do some actions and add inode to FC_Q_MAIN fc_q
4. jbd complete transaction n and clear FC_Q_MAIN fc_q
5. task A call fsync
Fast commit will lost the file actions during a full commit.
we should also add updates to staging queue during a full commit.
and in ext4_fc_cleanup(), when reset a inode's fc track range, check
it's i_sync_tid, if it bigger than current transaction tid, do not
rest it, or we will lost the track range.
And EXT4_MF_FC_COMMITTING is not needed anymore, so drop it.
For the follow scenario:
1. jbd start commit transaction n
2. task A get new handle for transaction n+1
3. task A do some ineligible actions and mark FC_INELIGIBLE
4. jbd complete transaction n and clean FC_INELIGIBLE
5. task A call fsync
In this case fast commit will not fallback to full commit and
transaction n+1 also not handled by jbd.
Make ext4_fc_mark_ineligible() also record transaction tid for
latest ineligible case, when call ext4_fc_cleanup() check
current transaction tid, if small than latest ineligible tid
do not clear the EXT4_MF_FC_INELIGIBLE.
This patch drops ext4_fc_start_ineligible() and
ext4_fc_stop_ineligible() APIs. Fast commit ineligible transactions
should simply call ext4_fc_mark_ineligible() after starting the
trasaction.
Cortex-A510's erratum #2077057 causes SPSR_EL2 to be corrupted when
single-stepping authenticated ERET instructions. A single step is
expected, but a pointer authentication trap is taken instead. The
erratum causes SPSR_EL1 to be copied to SPSR_EL2, which could allow
EL1 to cause a return to EL2 with a guest controlled ELR_EL2.
Because the conditions require an ERET into active-not-pending state,
this is only a problem for the EL2 when EL2 is stepping EL1. In this case
the previous SPSR_EL2 value is preserved in struct kvm_vcpu, and can be
restored.
Cc: stable@vger.kernel.org # 53960faf2b73: arm64: Add Cortex-A510 CPU part definition Cc: stable@vger.kernel.org Signed-off-by: James Morse <james.morse@arm.com>
[maz: fixup cpucaps ordering] Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20220127122052.1584324-5-james.morse@arm.com Signed-off-by: Sasha Levin <sashal@kernel.org>
When sending x_char in stm32_usart_transmit_chars(), driver can overwrite
the value of TDR register by the value of x_char. If this happens, the
previous value that was present in TDR register will not be sent through
uart.
This code checks if the previous value in TDR register is sent before
writing the x_char value into register.
Mark the start_backtrace() as notrace and NOKPROBE_SYMBOL
because this function is called from ftrace and lockdep to
get the caller address via return_address(). The lockdep
is used in kprobes, it should also be NOKPROBE_SYMBOL.
Fixes: b07f3499661c ("arm64: stacktrace: Move start_backtrace() out of the header") Cc: <stable@vger.kernel.org> # 5.13.x Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Reviewed-by: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/r/164301227374.1433152.12808232644267107415.stgit@devnote2 Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
The reason is that trace event filter treats the user space pointer
defined by "filename" as a normal pointer to compare against the "cpu"
string. The following bug happened:
The above happened because the kernel tried to access user space directly
and triggered a "supervisor read access in kernel mode" fault. Worse yet,
the memory could not even be loaded yet, and a SEGFAULT could happen as
well. This could be true for kernel space accessing as well.
To be even more robust, test both kernel and user space strings. If the
string fails to read, then simply have the filter fail.
Note, TASK_SIZE is used to determine if the pointer is user or kernel space
and the appropriate strncpy_from_kernel/user_nofault() function is used to
copy the memory. For some architectures, the compare to TASK_SIZE may always
pick user space or kernel space. If it gets it wrong, the only thing is that
the filter will fail to match. In the future, this needs to be fixed to have
the event denote which should be used. But failing a filter is much better
than panicing the machine, and that can be solved later.
Link: https://lore.kernel.org/all/20220107044951.22080-1-kernelfans@gmail.com/ Link: https://lkml.kernel.org/r/20220110115532.536088fd@gandalf.local.home Cc: stable@vger.kernel.org Cc: Ingo Molnar <mingo@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Tom Zanussi <zanussi@kernel.org> Reported-by: Pingfan Liu <kernelfans@gmail.com> Tested-by: Pingfan Liu <kernelfans@gmail.com> Fixes: 87a342f5db69d ("tracing/filters: Support filtering for char * strings") Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
In exfat_truncate(), the computation of inode->i_blocks is wrong if
the file is larger than 4 GiB because a 32-bit variable is used as a
mask. This is fixed and simplified by using round_up().
Also fix the same buggy computation in exfat_read_root() and another
(correct) one in exfat_fill_inode(). The latter was fixed another way
last month but can be simplified by using round_up() as well. See:
commit 0c336d6e33f4 ("exfat: fix incorrect loading of i_blocks for
large files")
pm_runtime_get_() increments the runtime PM usage counter even
when it returns an error code, thus a matching decrement is needed on
the error handling path to keep the counter balanced.
Choose some other visible function for the test. The function does not
have to be actually executed during the test, because it is only testing
filter API interface.
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com> Reviewed-by: Shuah Khan <skhan@linuxfoundation.org> Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
seccomp_bpf failed on tests 47 global.user_notification_filter_empty
and 48 global.user_notification_filter_empty_threaded when it's
tested on updated kernel but with old kernel headers. Because old
kernel headers don't have definition of macro __NR_clone3 which is
required for these two tests. Since under selftests/, we can install
headers once for all tests (the default INSTALL_HDR_PATH is
usr/include), fix it by adding usr/include to the list of directories
to be searched. Use "-isystem" to indicate it's a system directory as
the real kernel headers directories are.
Signed-off-by: Sherry Yang <sherry.yang@oracle.com> Tested-by: Sherry Yang <sherry.yang@oracle.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
When we create a file with modefromsids we set an ACL that
has one ACE for the magic modefromsid as well as a second ACE that
grants full access to all authenticated users.
When later we chante the mode on the file we strip away this, and other,
ACE for authenticated users in set_chmod_dacl() and then just add back/update
the modefromsid ACE.
Thus leaving the file with a single ACE that is for the mode and no ACE
to grant any user any rights to access the file.
Fix this by always adding back also the modefromsid ACE so that we do not
drop the rights to access the file.
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Reviewed-by: Shyam Prasad N <sprasad@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
On newer AMD platforms with SFH, it is observed that random interrupts
get generated on the SFH hardware and until this is cleared the firmware
sensor processing is stalled, resulting in no data been received to
driver side.
Add routines to handle these interrupts, so that firmware operations are
not stalled.
Newer AMD platforms with SFH may generate interrupts on some events
which are unwarranted. Until this is cleared the actual MP2 data
processing maybe stalled in some cases.
Add a mechanism to clear the pending interrupts (if any) during the
driver initialization and sensor command operations.
Since in the current amd_sfh design the sensor data is periodically
obtained in the form of poll data, during the suspend/resume cycle,
scheduling a delayed work adds no value.
So, cancel the work and restart back during the suspend/resume cycle
respectively.
When cifs_get_root() fails during cifs_smb3_do_mount() we call
deactivate_locked_super() which eventually will call delayed_free() which
will free the context.
In this situation we should not proceed to enter the out: section in
cifs_smb3_do_mount() and free the same resources a second time.
[Thu Feb 10 12:59:06 2022] BUG: KASAN: use-after-free in rcu_cblist_dequeue+0x32/0x60
[Thu Feb 10 12:59:06 2022] Read of size 8 at addr ffff888364f4d110 by task swapper/1/0
[Thu Feb 10 12:59:06 2022] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G OE 5.17.0-rc3+ #4
[Thu Feb 10 12:59:06 2022] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.0 12/17/2019
[Thu Feb 10 12:59:06 2022] Call Trace:
[Thu Feb 10 12:59:06 2022] <IRQ>
[Thu Feb 10 12:59:06 2022] dump_stack_lvl+0x5d/0x78
[Thu Feb 10 12:59:06 2022] print_address_description.constprop.0+0x24/0x150
[Thu Feb 10 12:59:06 2022] ? rcu_cblist_dequeue+0x32/0x60
[Thu Feb 10 12:59:06 2022] kasan_report.cold+0x7d/0x117
[Thu Feb 10 12:59:06 2022] ? rcu_cblist_dequeue+0x32/0x60
[Thu Feb 10 12:59:06 2022] __asan_load8+0x86/0xa0
[Thu Feb 10 12:59:06 2022] rcu_cblist_dequeue+0x32/0x60
[Thu Feb 10 12:59:06 2022] rcu_core+0x547/0xca0
[Thu Feb 10 12:59:06 2022] ? call_rcu+0x3c0/0x3c0
[Thu Feb 10 12:59:06 2022] ? __this_cpu_preempt_check+0x13/0x20
[Thu Feb 10 12:59:06 2022] ? lock_is_held_type+0xea/0x140
[Thu Feb 10 12:59:06 2022] rcu_core_si+0xe/0x10
[Thu Feb 10 12:59:06 2022] __do_softirq+0x1d4/0x67b
[Thu Feb 10 12:59:06 2022] __irq_exit_rcu+0x100/0x150
[Thu Feb 10 12:59:06 2022] irq_exit_rcu+0xe/0x30
[Thu Feb 10 12:59:06 2022] sysvec_hyperv_stimer0+0x9d/0xc0
...
[Thu Feb 10 12:59:07 2022] Freed by task 58179:
[Thu Feb 10 12:59:07 2022] kasan_save_stack+0x26/0x50
[Thu Feb 10 12:59:07 2022] kasan_set_track+0x25/0x30
[Thu Feb 10 12:59:07 2022] kasan_set_free_info+0x24/0x40
[Thu Feb 10 12:59:07 2022] ____kasan_slab_free+0x137/0x170
[Thu Feb 10 12:59:07 2022] __kasan_slab_free+0x12/0x20
[Thu Feb 10 12:59:07 2022] slab_free_freelist_hook+0xb3/0x1d0
[Thu Feb 10 12:59:07 2022] kfree+0xcd/0x520
[Thu Feb 10 12:59:07 2022] cifs_smb3_do_mount+0x149/0xbe0 [cifs]
[Thu Feb 10 12:59:07 2022] smb3_get_tree+0x1a0/0x2e0 [cifs]
[Thu Feb 10 12:59:07 2022] vfs_get_tree+0x52/0x140
[Thu Feb 10 12:59:07 2022] path_mount+0x635/0x10c0
[Thu Feb 10 12:59:07 2022] __x64_sys_mount+0x1bf/0x210
[Thu Feb 10 12:59:07 2022] do_syscall_64+0x5c/0xc0
[Thu Feb 10 12:59:07 2022] entry_SYSCALL_64_after_hwframe+0x44/0xae
[Thu Feb 10 12:59:07 2022] Last potentially related work creation:
[Thu Feb 10 12:59:07 2022] kasan_save_stack+0x26/0x50
[Thu Feb 10 12:59:07 2022] __kasan_record_aux_stack+0xb6/0xc0
[Thu Feb 10 12:59:07 2022] kasan_record_aux_stack_noalloc+0xb/0x10
[Thu Feb 10 12:59:07 2022] call_rcu+0x76/0x3c0
[Thu Feb 10 12:59:07 2022] cifs_umount+0xce/0xe0 [cifs]
[Thu Feb 10 12:59:07 2022] cifs_kill_sb+0xc8/0xe0 [cifs]
[Thu Feb 10 12:59:07 2022] deactivate_locked_super+0x5d/0xd0
[Thu Feb 10 12:59:07 2022] cifs_smb3_do_mount+0xab9/0xbe0 [cifs]
[Thu Feb 10 12:59:07 2022] smb3_get_tree+0x1a0/0x2e0 [cifs]
[Thu Feb 10 12:59:07 2022] vfs_get_tree+0x52/0x140
[Thu Feb 10 12:59:07 2022] path_mount+0x635/0x10c0
[Thu Feb 10 12:59:07 2022] __x64_sys_mount+0x1bf/0x210
[Thu Feb 10 12:59:07 2022] do_syscall_64+0x5c/0xc0
[Thu Feb 10 12:59:07 2022] entry_SYSCALL_64_after_hwframe+0x44/0xae
Reported-by: Shyam Prasad N <sprasad@microsoft.com> Reviewed-by: Shyam Prasad N <sprasad@microsoft.com> Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
When idsfromsid is used we create a special SID for owner/group.
This structure must be initialized or else the first 5 bytes
of the Authority field of the SID will contain uninitialized data
and thus not be a valid SID.
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
If backing file's filesystem has implemented ->fallocate(), we think the
loop device can support discard, then pass sb->s_blocksize as
discard_granularity. However, some underlying FS, such as overlayfs,
doesn't set sb->s_blocksize, and causes discard_granularity to be set as
zero, then the warning in __blkdev_issue_discard() is triggered.
Christoph suggested to pass kstatfs.f_bsize as discard granularity, and
this way is fine because kstatfs.f_bsize means 'Optimal transfer block
size', which still matches with definition of discard granularity.
So fix the issue by setting discard_granularity as kstatfs.f_bsize if it
is available, otherwise claims discard isn't supported.
Cc: Christoph Hellwig <hch@lst.de> Cc: Vivek Goyal <vgoyal@redhat.com> Reported-by: Pei Zhang <pezhang@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220126035830.296465-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
It appears that a read access to GIC[DR]_I[CS]PENDRn doesn't always
result in the pending interrupts being accurately reported if they are
mapped to a HW interrupt. This is particularily visible when acking
the timer interrupt and reading the GICR_ISPENDR1 register immediately
after, for example (the interrupt appears as not-pending while it really
is...).
This is because a HW interrupt has its 'active and pending state' kept
in the *physical* distributor, and not in the virtual one, as mandated
by the spec (this is what allows the direct deactivation). The virtual
distributor only caries the pending and active *states* (note the
plural, as these are two independent and non-overlapping states).
Fix it by reading the HW state back, either from the timer itself or
from the distributor if necessary.
This is because we started using writeback_inodes_sb() to flush delalloc
when committing a transaction (when using -o flushoncommit), in order to
avoid deadlocks with filesystem freeze operations. This change was made
by commit ce8ea7cc6eb313 ("btrfs: don't call btrfs_start_delalloc_roots
in flushoncommit"). After that change we started producing that warning,
and every now and then a user reports this since the warning happens too
often, it spams dmesg/syslog, and a user is unsure if this reflects any
problem that might compromise the filesystem's reliability.
We can not just lock the sb->s_umount semaphore before calling
writeback_inodes_sb(), because that would at least deadlock with
filesystem freezing, since at fs/super.c:freeze_super() sync_filesystem()
is called while we are holding that semaphore in write mode, and that can
trigger a transaction commit, resulting in a deadlock. It would also
trigger the same type of deadlock in the unmount path. Possibly, it could
also introduce some other locking dependencies that lockdep would report.
To fix this call try_to_writeback_inodes_sb() instead of
writeback_inodes_sb(), because that will try to read lock sb->s_umount
and then will only call writeback_inodes_sb() if it was able to lock it.
This is fine because the cases where it can't read lock sb->s_umount
are during a filesystem unmount or during a filesystem freeze - in those
cases sb->s_umount is write locked and sync_filesystem() is called, which
calls writeback_inodes_sb(). In other words, in all cases where we can't
take a read lock on sb->s_umount, writeback is already being triggered
elsewhere.
An alternative would be to call btrfs_start_delalloc_roots() with a
number of pages different from LONG_MAX, for example matching the number
of delalloc bytes we currently have, in which case we would end up
starting all delalloc with filemap_fdatawrite_wbc() and not with an
async flush via filemap_flush() - that is only possible after the rather
recent commit e076ab2a2ca70a ("btrfs: shrink delalloc pages instead of
full inodes"). However that creates a whole new can of worms due to new
lock dependencies, which lockdep complains, like for example:
[ 8948.247280] ======================================================
[ 8948.247823] WARNING: possible circular locking dependency detected
[ 8948.248353] 5.17.0-rc1-btrfs-next-111 #1 Not tainted
[ 8948.248786] ------------------------------------------------------
[ 8948.249320] kworker/u16:18/933570 is trying to acquire lock:
[ 8948.249812] ffff9b3de1591690 (sb_internal#2){.+.+}-{0:0}, at: find_free_extent+0x141e/0x1590 [btrfs]
[ 8948.250638]
but task is already holding lock:
[ 8948.251140] ffff9b3e09c717d8 (&root->delalloc_mutex){+.+.}-{3:3}, at: start_delalloc_inodes+0x78/0x400 [btrfs]
[ 8948.252018]
which lock already depends on the new lock.
It all comes from the fact that btrfs_start_delalloc_roots() takes the
delalloc_root_mutex, in the transaction commit path we are holding a
read lock on one of the superblock's freeze semaphores (via
sb_start_intwrite()), the async reclaim task can also do a call to
btrfs_start_delalloc_roots(), which ends up triggering writeback with
calls to filemap_fdatawrite_wbc(), resulting in extent allocation which
in turn can call btrfs_start_transaction(), which will result in taking
the freeze semaphore via sb_start_intwrite(), forming a nasty dependency
on all those locks which can be taken in different orders by different
code paths.
So just adopt the simple approach of calling try_to_writeback_inodes_sb()
at btrfs_start_delalloc_flush().
Buttonpads are expected to map the INPUT_PROP_BUTTONPAD property bit
and the BTN_LEFT key bit.
As explained in the specification, where a device has a button type
value of 0 (click-pad) or 1 (pressure-pad) there should not be
discrete buttons:
https://docs.microsoft.com/en-us/windows-hardware/design/component-guidelines/touchpad-windows-precision-touchpad-collection#device-capabilities-feature-report
However, some drivers map the BTN_RIGHT and/or BTN_MIDDLE key bits even
though the device is a buttonpad and therefore does not have those
buttons.
This behavior has forced userspace applications like libinput to
implement different workarounds and quirks to detect buttonpads and
offer to the user the right set of features and configuration options.
For more information:
https://gitlab.freedesktop.org/libinput/libinput/-/merge_requests/726
In order to avoid this issue clear the BTN_RIGHT and BTN_MIDDLE key
bits when the input device is register if the INPUT_PROP_BUTTONPAD
property bit is set.
Notice that this change will not affect udev because it does not check
for buttons. See systemd/src/udev/udev-builtin-input_id.c.
List of known affected hardware:
- Chuwi AeroBook Plus
- Chuwi Gemibook
- Framework Laptop
- GPD Win Max
- Huawei MateBook 2020
- Prestigio Smartbook 141 C2
- Purism Librem 14v1
- StarLite Mk II - AMI firmware
- StarLite Mk II - Coreboot firmware
- StarLite Mk III - AMI firmware
- StarLite Mk III - Coreboot firmware
- StarLabTop Mk IV - AMI firmware
- StarLabTop Mk IV - Coreboot firmware
- StarBook Mk V
Acked-by: Peter Hutterer <peter.hutterer@who-t.net> Acked-by: Benjamin Tissoires <benjamin.tissoires@redhat.com> Acked-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: José Expósito <jose.exposito89@gmail.com> Link: https://lore.kernel.org/r/20220208174806.17183-1-jose.exposito89@gmail.com Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
The check done by regulator_late_cleanup() to detect whether a regulator
is on was inconsistent with the check done by _regulator_is_enabled().
While _regulator_is_enabled() takes the enable GPIO into account,
regulator_late_cleanup() was not doing that.
This resulted in a false positive, e.g. when a GPIO-controlled fixed
regulator was used, which was not enabled at boot time, e.g.
Such regulator doesn't have an is_enabled() operation. Nevertheless
it's state can be determined based on the enable GPIO. The check in
regulator_late_cleanup() wrongly assumed that the regulator is on and
tried to disable it.
The current rt5682_jack_detect_handler() assumes the component
and card will always show up and implements an infinite usleep
loop waiting for them to show up.
This does not hold true if a codec interrupt (or other
event) occurs when the card is unbound. The codec driver's
remove or shutdown functions cannot cancel the workqueue due
to the wait loop. As a result, code can either end up blocking
the workqueue, or hit a kernel oops when the card is freed.
Fix the issue by rescheduling the jack detect handler in
case the card is not ready. In case card never shows up,
the shutdown/remove/suspend calls can now cancel the detect
task.
Signed-off-by: Kai Vehmanen <kai.vehmanen@linux.intel.com> Reviewed-by: Bard Liao <yung-chuan.liao@linux.intel.com> Reviewed-by: Ranjani Sridharan <ranjani.sridharan@linux.intel.com> Reviewed-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com> Reviewed-by: Péter Ujfalusi <peter.ujfalusi@linux.intel.com> Reviewed-by: Shuming Fan <shumingf@realtek.com> Link: https://lore.kernel.org/r/20220207153000.3452802-3-kai.vehmanen@linux.intel.com Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
The current rt5668_jack_detect_handler() assumes the component
and card will always show up and implements an infinite usleep
loop waiting for them to show up.
This does not hold true if a codec interrupt (or other
event) occurs when the card is unbound. The codec driver's
remove or shutdown functions cannot cancel the workqueue due
to the wait loop. As a result, code can either end up blocking
the workqueue, or hit a kernel oops when the card is freed.
Fix the issue by rescheduling the jack detect handler in
case the card is not ready. In case card never shows up,
the shutdown/remove/suspend calls can now cancel the detect
task.
Signed-off-by: Kai Vehmanen <kai.vehmanen@linux.intel.com> Reviewed-by: Bard Liao <yung-chuan.liao@linux.intel.com> Reviewed-by: Ranjani Sridharan <ranjani.sridharan@linux.intel.com> Reviewed-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com> Reviewed-by: Péter Ujfalusi <peter.ujfalusi@linux.intel.com> Reviewed-by: Shuming Fan <shumingf@realtek.com> Link: https://lore.kernel.org/r/20220207153000.3452802-2-kai.vehmanen@linux.intel.com Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
The current rt5682s_jack_detect_handler() assumes the component
and card will always show up and implements an infinite usleep
loop waiting for them to show up.
This does not hold true if a codec interrupt (or other
event) occurs when the card is unbound. The codec driver's
remove or shutdown functions cannot cancel the workqueue due
to the wait loop. As a result, code can either end up blocking
the workqueue, or hit a kernel oops when the card is freed.
Fix the issue by rescheduling the jack detect handler in
case the card is not ready. In case card never shows up,
the shutdown/remove/suspend calls can now cancel the detect
task.
Signed-off-by: Kai Vehmanen <kai.vehmanen@linux.intel.com> Reviewed-by: Bard Liao <yung-chuan.liao@linux.intel.com> Reviewed-by: Ranjani Sridharan <ranjani.sridharan@linux.intel.com> Reviewed-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com> Reviewed-by: Péter Ujfalusi <peter.ujfalusi@linux.intel.com> Reviewed-by: Shuming Fan <shumingf@realtek.com> Link: https://lore.kernel.org/r/20220207153000.3452802-1-kai.vehmanen@linux.intel.com Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
The CLKT register contains at poweron 0x40, which at our typical 100kHz
bus rate means .64ms. But there is no specified limit to how long devices
should be able to stretch the clocks, so just disable the timeout. We
still have a timeout wrapping the entire transfer.
Signed-off-by: Eric Anholt <eric@anholt.net> Signed-off-by: Stefan Wahren <stefan.wahren@i2se.com> BugLink: https://github.com/raspberrypi/linux/issues/3064 Signed-off-by: Wolfram Sang <wsa@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
In mac80211_hwsim, the probe_req frame is created and sent while
scanning. It is sent with ieee80211_tx_info which is not initialized.
Uninitialized ieee80211_tx_info can cause problems when using
mac80211_hwsim with wmediumd. wmediumd checks the tx_rates field of
ieee80211_tx_info and doesn't relay probe_req frame to other clients
even if it is a broadcasting message.
Call ieee80211_tx_prepare_skb() to initialize ieee80211_tx_info for
the probe_req that is created by hw_scan_work in mac80211_hwsim.
memblock.{reserved,memory}.regions may be allocated using kmalloc() in
memblock_double_array(). Use kfree() to release these kmalloced regions
indicated by memblock_{reserved,memory}_in_slab.
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Fixes: 3010f876500f ("mm: discard memblock data later") Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The tegra186 GPIO driver makes the assumption that the pointer
returned by irq_data_get_irq_chip_data() is a pointer to a
tegra_gpio structure. Unfortunately, it is actually a pointer
to the inner gpio_chip structure, as mandated by the gpiolib
infrastructure. Nice try.
The saving grace is that the gpio_chip is the first member of
tegra_gpio, so the bug has gone undetected since... forever.
Fix it by performing a container_of() on the pointer. This results
in no additional code, and makes it possible to understand how
the whole thing works.
In the current implementation the user may open a virtual tty which then
could fail to establish the underlying DLCI. The function gsmtty_open()
gets stuck in tty_port_block_til_ready() while waiting for a carrier rise.
This happens if the remote side fails to acknowledge the link establishment
request in time or completely. At some point gsm_dlci_close() is called
to abort the link establishment attempt. The function tries to inform the
associated virtual tty by performing a hangup. But the blocking loop within
tty_port_block_til_ready() is not informed about this event.
The patch proposed here fixes this by resetting the initialization state of
the virtual tty to ensure the loop exits and triggering it to make
tty_port_block_til_ready() return.
The function gsm_process_modem() exists to handle modem status bits of
incoming frames. This includes incoming MSC (modem status command) frames
and convergence layer type 2 data frames. The function, however, was only
designed to handle MSC frames as it expects the command length. Within
gsm_dlci_data() it is wrongly assumed that this is the same as the data
frame length. This is only true if the data frame contains only 1 byte of
payload.
This patch names the length parameter of gsm_process_modem() in a generic
manner to reflect its association. It also corrects all calls to the
function to handle the variable number of modem status octets correctly in
both cases.
Fixes: 7263287af93d ("tty: n_gsm: Fixed logic to decode break signal from modem status") Cc: stable@vger.kernel.org Signed-off-by: Daniel Starke <daniel.starke@siemens.com> Link: https://lore.kernel.org/r/20220218073123.2121-6-daniel.starke@siemens.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
tty flow control is handled via gsmtty_throttle() and gsmtty_unthrottle().
Both functions propagate the outgoing hardware flow control state to the
remote side via MSC (modem status command) frames. The local state is taken
from the RTS (ready to send) flag of the tty. However, RTS gets mapped to
DTR (data terminal ready), which is wrong.
This patch corrects this by mapping RTS to RTS.
The here fixed commit made the tty hangup asynchronous to avoid a circular
locking warning. I could not reproduce this warning. Furthermore, due to
the asynchronous hangup the function call now gets queued up while the
underlying tty is being freed. Depending on the timing this results in a
NULL pointer access in the global work queue scheduler. To be precise in
process_one_work(). Therefore, the previous commit made the issue worse
which it tried to fix.
This patch fixes this by falling back to the old behavior which uses a
blocking tty hangup call before freeing up the associated tty.
Fixes: 7030082a7415 ("tty: n_gsm: avoid recursive locking with async port hangup") Cc: stable@vger.kernel.org Signed-off-by: Daniel Starke <daniel.starke@siemens.com> Link: https://lore.kernel.org/r/20220218073123.2121-4-daniel.starke@siemens.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Trying to open a DLCI by sending a SABM frame may fail with a timeout.
The link is closed on the initiator side without informing the responder
about this event. The responder assumes the link is open after sending a
UA frame to answer the SABM frame. The link gets stuck in a half open
state.
This patch fixes this by initiating the proper link termination procedure
after link setup timeout instead of silently closing it down.
n_gsm is based on the 3GPP 07.010 and its newer version is the 3GPP 27.010.
See https://portal.3gpp.org/desktopmodules/Specifications/SpecificationDetails.aspx?specificationId=1516
The changes from 07.010 to 27.010 are non-functional. Therefore, I refer to
the newer 27.010 here. Chapter 5.2.1.2 describes the encoding of the
C/R (command/response) bit. Table 1 shows that the actual encoding of the
C/R bit is inverted if the associated frame is sent by the responder.
The referenced commit fixed here further broke the internal meaning of this
bit in the outgoing path by always setting the C/R bit regardless of the
frame type.
This patch fixes both by setting the C/R bit always consistently for
command (1) and response (0) frames and inverting it later for the
responder where necessary. The meaning of this bit in the debug output
is being preserved and shows the bit as if it was encoded by the initiator.
This reflects only the frame type rather than the encoded combination of
communication side and frame type.
Fixes: cc0f42122a7e ("tty: n_gsm: Modify CR,PF bit when config requester") Cc: stable@vger.kernel.org Signed-off-by: Daniel Starke <daniel.starke@siemens.com> Link: https://lore.kernel.org/r/20220218073123.2121-2-daniel.starke@siemens.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
n_gsm is based on the 3GPP 07.010 and its newer version is the 3GPP 27.010.
See https://portal.3gpp.org/desktopmodules/Specifications/SpecificationDetails.aspx?specificationId=1516
The changes from 07.010 to 27.010 are non-functional. Therefore, I refer to
the newer 27.010 here. Chapter 5.4.6.3.7 describes the encoding of the
control signal octet used by the MSC (modem status command). The same
encoding is also used in convergence layer type 2 as described in chapter
5.5.2. Table 7 and 24 both require the DV (data valid) bit to be set 1 for
outgoing control signal octets sent by the DTE (data terminal equipment),
i.e. for the initiator side.
Currently, the DV bit is only set if CD (carrier detect) is on, regardless
of the side.
This patch fixes this behavior by setting the DV bit on the initiator side
unconditionally.
kernel BUG at include/linux/mm.h:2373!
cpu 0x5d: Vector: 700 (Program Check) at [c00000003c6e76e0]
pc: c000000000581a54: pmd_to_page+0x54/0x80
lr: c00000000058d184: move_hugetlb_page_tables+0x4e4/0x5b0
sp: c00000003c6e7980
msr: 9000000000029033
current = 0xc00000003bd8d980
paca = 0xc000200fff610100 irqmask: 0x03 irq_happened: 0x01
pid = 9349, comm = hugepage-mremap
kernel BUG at include/linux/mm.h:2373!
move_hugetlb_page_tables+0x4e4/0x5b0 (link register)
move_hugetlb_page_tables+0x22c/0x5b0 (unreliable)
move_page_tables+0xdbc/0x1010
move_vma+0x254/0x5f0
sys_mremap+0x7c0/0x900
system_call_exception+0x160/0x2c0
the kernel can't use huge_pte_offset before it set the pte entry because
a page table lookup check for huge PTE bit in the page table to
differentiate between a huge pte entry and a pointer to pte page. A
huge_pte_alloc won't mark the page table entry huge and hence kernel
should not use huge_pte_offset after a huge_pte_alloc.
Link: https://lkml.kernel.org/r/20220211063221.99293-1-aneesh.kumar@linux.ibm.com Fixes: 550a7d60bd5e ("mm, hugepages: add mremap() support for hugepage backed vma") Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Mina Almasry <almasrymina@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The trace_hardirqs_{on,off}() require the caller to setup frame pointer
properly. This because these two functions use macro 'CALLER_ADDR1' (aka.
__builtin_return_address(1)) to acquire caller info. If the $fp is used
for other purpose, the code generated this macro (as below) could trigger
memory access fault.
The patch below has two "linkcontrol" names causing the duplication.
Fix by using the correct "diag_counters" name on the second instance.
Fixes: 4a7aaf88c89f ("RDMA/qib: Use attributes for the port sysfs") Link: https://lore.kernel.org/r/1645106372-23004-1-git-send-email-mike.marciniszyn@cornelisnetworks.com Cc: <stable@vger.kernel.org> Reviewed-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com> Signed-off-by: Mike Marciniszyn <mike.marciniszyn@cornelisnetworks.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The interrupt mask is enabled before any potential failure points in
the driver, which can leave a failure path where we exit with
interrupts enabled but the device not live. This causes an infinite
stream of interrupts on an Apple M1 Pro laptop on USB-C.
Add a failure label that's used post enabling interrupts, where we
mask them again before returning an error.
Suggested-by: Sven Peter <sven@svenpeter.dev> Cc: stable <stable@vger.kernel.org> Reviewed-by: Heikki Krogerus <heikki.krogerus@linux.intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/e6b80669-20f3-06e7-9ed5-8951a9c6db6f@kernel.dk Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In rare cases the display is flipped or mirrored. This was observed more
often in a low temperature environment. A clean reset on init_display()
should help to get registers in a sane state.
Fixes: ef8f317795da (staging: fbtft: use init function instead of init sequence) Cc: stable@vger.kernel.org Signed-off-by: Oliver Graute <oliver.graute@kococonnector.com> Link: https://lore.kernel.org/r/20220210085322.15676-1-oliver.graute@kococonnector.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
If the state is not idle then resolve_prepare_src() should immediately
fail and no change to global state should happen. However, it
unconditionally overwrites the src_addr trying to build a temporary any
address.
For instance if the state is already RDMA_CM_LISTEN then this will corrupt
the src_addr and would cause the test in cma_cancel_operation():
if (cma_any_addr(cma_src_addr(id_priv)) && !id_priv->cma_dev)
Which would manifest as this trace from syzkaller:
BUG: KASAN: use-after-free in __list_add_valid+0x93/0xa0 lib/list_debug.c:26
Read of size 8 at addr ffff8881546491e0 by task syz-executor.1/32204
This is indicating that an rdma_id_private was destroyed without doing
cma_cancel_listens().
Instead of trying to re-use the src_addr memory to indirectly create an
any address derived from the dst build one explicitly on the stack and
bind to that as any other normal flow would do. rdma_bind_addr() will copy
it over the src_addr once it knows the state is valid.
This is similar to commit bc0bdc5afaa7 ("RDMA/cma: Do not change
route.addr.src_addr.ss_family")
Link: https://lore.kernel.org/r/0-v2-e975c8fd9ef2+11e-syz_cma_srcaddr_jgg@nvidia.com Cc: stable@vger.kernel.org Fixes: 732d41c545bb ("RDMA/cma: Make the locking for automatic state transition more clear") Reported-by: syzbot+c94a3675a626f6333d74@syzkaller.appspotmail.com Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
There is a big gap between inode_should_defrag() and autodefrag extent
size threshold. For inode_should_defrag() it has a flexible
@small_write value. For compressed extent is 16K, and for non-compressed
extent it's 64K.
However for autodefrag extent size threshold, it's always fixed to the
default value (256K).
This means, the following write sequence will trigger autodefrag to
defrag ranges which didn't trigger autodefrag:
pwrite 0 8k
sync
pwrite 8k 128K
sync
The latter 128K write will also be considered as a defrag target (if
other conditions are met). While only that 8K write is really
triggering autodefrag.
Such behavior can cause extra IO for autodefrag.
Close the gap, by copying the @small_write value into inode_defrag, so
that later autodefrag can use the same @small_write value which
triggered autodefrag.
With the existing transid value, this allows autodefrag really to scan
the ranges which triggered autodefrag.
Although this behavior change is mostly reducing the extent_thresh value
for autodefrag, I believe in the future we should allow users to specify
the autodefrag extent threshold through mount options, but that's an
other problem to consider in the future.
In the rework of btrfs_defrag_file(), we always call
defrag_one_cluster() and increase the offset by cluster size, which is
only 256K.
But there are cases where we have a large extent (e.g. 128M) which
doesn't need to be defragged at all.
Before the refactor, we can directly skip the range, but now we have to
scan that extent map again and again until the cluster moves after the
non-target extent.
Fix the problem by allow defrag_one_cluster() to increase
btrfs_defrag_ctrl::last_scanned to the end of an extent, if and only if
the last extent of the cluster is not a target.
The test script looks like this:
mkfs.btrfs -f $dev > /dev/null
mount $dev $mnt
# As btrfs ioctl uses 32M as extent_threshold
xfs_io -f -c "pwrite 0 64M" $mnt/file1
sync
# Some fragemented range to defrag
xfs_io -s -c "pwrite 65548k 4k" \
-c "pwrite 65544k 4k" \
-c "pwrite 65540k 4k" \
-c "pwrite 65536k 4k" \
$mnt/file1
sync
echo "=== before ==="
xfs_io -c "fiemap -v" $mnt/file1
echo "=== after ==="
btrfs fi defrag $mnt/file1
sync
xfs_io -c "fiemap -v" $mnt/file1
umount $mnt
With extra ftrace put into defrag_one_cluster(), before the patch it
would result tons of loops:
(As defrag_one_cluster() is inlined, the function name is its caller)
Compressed length can be corrupted to be a lot larger than memory
we have allocated for buffer.
This will cause memcpy in copy_compressed_segment to write outside
of allocated memory.
This mostly results in stuck read syscall but sometimes when using
btrfs send can get #GP
From the very beginning of btrfs defrag, there is a check to reject
extents which meet both conditions:
- Physically adjacent
We may want to defrag physically adjacent extents to reduce the number
of extents or the size of subvolume tree.
- Larger than 128K
This may be there for compressed extents, but unfortunately 128K is
exactly the max capacity for compressed extents.
And the check is > 128K, thus it never rejects compressed extents.
Furthermore, the compressed extent capacity bug is fixed by previous
patch, there is no reason for that check anymore.
The original check has a very small ranges to reject (the target extent
size is > 128K, and default extent threshold is 256K), and for
compressed extent it doesn't work at all.
So it's better just to remove the rejection, and allow us to defrag
physically adjacent extents.
CC: stable@vger.kernel.org # 5.16 Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[BUG]
For compressed extents, defrag ioctl will always try to defrag any
compressed extents, wasting not only IO but also CPU time to
compress/decompress:
Then it shows the 2 128K extents just get COW for no extra benefit, with
extra IO/CPU spent:
=== before ===
/mnt/btrfs/file1:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..255]: 26624..26879 256 0x8
1: [256..511]: 26632..26887 256 0x9
=== after ===
/mnt/btrfs/file1:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..255]: 26640..26895 256 0x8
1: [256..511]: 26648..26903 256 0x9
This affects not only v5.16 (after the defrag rework), but also v5.15
(before the defrag rework).
[CAUSE]
From the very beginning, btrfs defrag never checks if one extent is
already at its max capacity (128K for compressed extents, 128M
otherwise).
And the default extent size threshold is 256K, which is already beyond
the compressed extent max size.
This means, by default btrfs defrag ioctl will mark all compressed
extent which is not adjacent to a hole/preallocated range for defrag.
[FIX]
Introduce a helper to grab the maximum extent size, and then in
defrag_collect_targets() and defrag_check_next_extent(), reject extents
which are already at their max capacity.
Reported-by: Filipe Manana <fdmanana@suse.com> CC: stable@vger.kernel.org # 5.16 Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[BUG]
With older kernels (before v5.16), btrfs will defrag preallocated extents.
While with newer kernels (v5.16 and newer) btrfs will not defrag
preallocated extents, but it will defrag the extent just before the
preallocated extent, even it's just a single sector.
This can be exposed by the following small script:
Which defrags the single sector along with the preallocated extent, and
replace them with an regular extent into a new location (caused by data
COW).
This wastes most of the data IO just for the preallocated range.
The preallocated range is not defragged, but the sector before it still
gets defragged, which has no need for it.
[CAUSE]
One of the function reused by the old and new behavior is
defrag_check_next_extent(), it will determine if we should defrag
current extent by checking the next one.
It only checks if the next extent is a hole or inlined, but it doesn't
check if it's preallocated.
On the other hand, out of the function, both old and new kernel will
reject preallocated extents.
Such inconsistent behavior causes above behavior.
[FIX]
- Also check if next extent is preallocated
If so, don't defrag current extent.
- Add comments for each branch why we reject the extent
This will reduce the IO caused by defrag ioctl and autodefrag.
CC: stable@vger.kernel.org # 5.16 Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>