Adjacent changes: e3f02f32a050 ("ionic: fix kernel panic due to multi-buffer handling") d9c04209990b ("ionic: Mark error paths in the data path as unlikely")
Jakub Kicinski [Thu, 27 Jun 2024 18:57:40 +0000 (11:57 -0700)]
Merge tag 'wireless-2024-06-27' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless
Johannes Berg says:
====================
Just a few changes:
- maintainers: Larry Finger sadly passed away
- maintainers: ath trees are in their group now
- TXQ FQ quantum configuration fix
- TI wl driver: work around stuck FW in AP mode
- mac80211: disable softirqs in some new code
needing that
* tag 'wireless-2024-06-27' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless:
MAINTAINERS: wifi: update ath.git location
MAINTAINERS: Remembering Larry Finger
wifi: mac80211: disable softirqs for queued frame handling
wifi: cfg80211: restrict NL80211_ATTR_TXQ_QUANTUM values
wifi: wlcore: fix wlcore AP mode
====================
Linus Torvalds [Thu, 27 Jun 2024 17:05:35 +0000 (10:05 -0700)]
Merge tag 'net-6.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Including fixes from can, bpf and netfilter.
There are a bunch of regressions addressed here, but hopefully nothing
spectacular. We are still waiting the driver fix from Intel, mentioned
by Jakub in the previous networking pull.
Current release - regressions:
- core: add softirq safety to netdev_rename_lock
- tcp: fix tcp_rcv_fastopen_synack() to enter TCP_CA_Loss for failed
TFO
- batman-adv: fix RCU race at module unload time
Previous releases - regressions:
- openvswitch: get related ct labels from its master if it is not
confirmed
- eth: mlxsw: fix memory corruptions on spectrum-4 systems
- eth: ionic: use dev_consume_skb_any outside of napi
Previous releases - always broken:
- netfilter: fully validate NFT_DATA_VALUE on store to data registers
- unix: several fixes for OoB data
- tcp: fix race for duplicate reqsk on identical SYN
- bpf:
- fix may_goto with negative offset
- fix the corner case with may_goto and jump to the 1st insn
- fix overrunning reservations in ringbuf
- can:
- j1939: recover socket queue on CAN bus error during BAM
transmission
- mcp251xfd: fix infinite loop when xmit fails
- dsa: microchip: monitor potential faults in half-duplex mode
- eth: vxlan: pull inner IP header in vxlan_xmit_one()
- eth: ionic: fix kernel panic due to multi-buffer handling
Misc:
- selftest: unix tests refactor and a lot of new cases added"
* tag 'net-6.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (61 commits)
net: mana: Fix possible double free in error handling path
selftest: af_unix: Check SIOCATMARK after every send()/recv() in msg_oob.c.
af_unix: Fix wrong ioctl(SIOCATMARK) when consumed OOB skb is at the head.
selftest: af_unix: Check EPOLLPRI after every send()/recv() in msg_oob.c
selftest: af_unix: Check SIGURG after every send() in msg_oob.c
selftest: af_unix: Add SO_OOBINLINE test cases in msg_oob.c
af_unix: Don't stop recv() at consumed ex-OOB skb.
selftest: af_unix: Add non-TCP-compliant test cases in msg_oob.c.
af_unix: Don't stop recv(MSG_DONTWAIT) if consumed OOB skb is at the head.
af_unix: Stop recv(MSG_PEEK) at consumed OOB skb.
selftest: af_unix: Add msg_oob.c.
selftest: af_unix: Remove test_unix_oob.c.
tracing/net_sched: NULL pointer dereference in perf_trace_qdisc_reset()
netfilter: nf_tables: fully validate NFT_DATA_VALUE on store to data registers
net: usb: qmi_wwan: add Telit FN912 compositions
tcp: fix tcp_rcv_fastopen_synack() to enter TCP_CA_Loss for failed TFO
ionic: use dev_consume_skb_any outside of napi
net: dsa: microchip: fix wrong register write when masking interrupt
Fix race for duplicate reqsk on identical SYN
ibmvnic: Add tx check to prevent skb leak
...
Linus Torvalds [Thu, 27 Jun 2024 16:34:09 +0000 (09:34 -0700)]
Merge tag 'sound-6.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"This became bigger than usual, as it receives a pile of pending ASoC
fixes. Most of changes are for device-specific issues while there are
a few core fixes that are all rather trivial:
- DMA-engine sync fixes
- Continued MIDI2 conversion fixes
- Various ASoC Intel SOF fixes
- A series of ASoC topology fixes for memory handling
- AMD ACP fix, curing a recent regression, too
- Platform / codec-specific fixes for mediatek, atmel, realtek, etc"
* tag 'sound-6.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (40 commits)
ASoC: rt5645: fix issue of random interrupt from push-button
ALSA: seq: Fix missing MSB in MIDI2 SPP conversion
ASoC: amd: yc: Fix non-functional mic on ASUS M5602RA
ALSA: hda/realtek: fix mute/micmute LEDs don't work for EliteBook 645/665 G11.
ALSA: hda/realtek: Fix conflicting quirk for PCI SSID 17aa:3820
ALSA: dmaengine_pcm: terminate dmaengine before synchronize
ALSA: hda/relatek: Enable Mute LED on HP Laptop 15-gw0xxx
ALSA: PCM: Allow resume only for suspended streams
ALSA: seq: Fix missing channel at encoding RPN/NRPN MIDI2 messages
ASoC: mediatek: mt8195: Add platform entry for ETDM1_OUT_BE dai link
ASoC: fsl-asoc-card: set priv->pdev before using it
ASoC: amd: acp: move chip->flag variable assignment
ASoC: amd: acp: remove i2s configuration check in acp_i2s_probe()
ASoC: amd: acp: add a null check for chip_pdev structure
ASoC: Intel: soc-acpi: mtl: fix speaker no sound on Dell SKU 0C64
ASoC: q6apm-lpass-dai: close graph on prepare errors
ASoC: cs35l56: Disconnect ASP1 TX sources when ASP1 DAI is hooked up
ASoC: topology: Fix route memory corruption
ASoC: rt722-sdca-sdw: add debounce time for type detection
ASoC: SOF: sof-audio: Skip unprepare for in-use widgets on error rollback
...
Paolo Abeni [Thu, 27 Jun 2024 11:00:50 +0000 (13:00 +0200)]
Merge tag 'nf-24-06-27' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains two Netfilter fixes for net:
Patch #1 fixes CONFIG_SYSCTL=n for a patch coming in the previous PR
to move the sysctl toggle to enable SRv6 netfilter hooks from
nf_conntrack to the core, from Jianguo Wu.
Patch #2 fixes a possible pointer leak to userspace due to insufficient
validation of NFT_DATA_VALUE.
Linus found this pointer leak to userspace via zdi-disclosures@ and
forwarded the notice to Netfilter maintainers, he appears as reporter
because whoever found this issue never approached Netfilter
maintainers neither via security@ nor in private.
netfilter pull request 24-06-27
* tag 'nf-24-06-27' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: nf_tables: fully validate NFT_DATA_VALUE on store to data registers
netfilter: fix undefined reference to 'netfilter_lwtunnel_*' when CONFIG_SYSCTL=n
====================
Ma Ke [Tue, 25 Jun 2024 13:03:14 +0000 (21:03 +0800)]
net: mana: Fix possible double free in error handling path
When auxiliary_device_add() returns error and then calls
auxiliary_device_uninit(), callback function adev_release
calls kfree(madev). We shouldn't call kfree(madev) again
in the error handling path. Set 'madev' to NULL.
af_unix: Fix wrong ioctl(SIOCATMARK) when consumed OOB skb is at the head.
Even if OOB data is recv()ed, ioctl(SIOCATMARK) must return 1 when the
OOB skb is at the head of the receive queue and no new OOB data is queued.
Without fix:
# RUN msg_oob.no_peek.oob ...
# msg_oob.c:305:oob:Expected answ[0] (0) == oob_head (1)
# oob: Test terminated by assertion
# FAIL msg_oob.no_peek.oob
not ok 2 msg_oob.no_peek.oob
With fix:
# RUN msg_oob.no_peek.oob ...
# OK msg_oob.no_peek.oob
ok 2 msg_oob.no_peek.oob
selftest: af_unix: Check SIGURG after every send() in msg_oob.c
When data is sent with MSG_OOB, SIGURG is sent to a process if the
receiver socket has set its owner to the process by ioctl(FIOSETOWN)
or fcntl(F_SETOWN).
This patch adds SIGURG check after every send(MSG_OOB) call.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Now the consumed OOB skb stays at the head of recvq to return a correct
value for ioctl(SIOCATMARK), which is broken now and fixed by a later
patch.
Then, if peer issues recv() with MSG_DONTWAIT, manage_oob() returns NULL,
so recv() ends up with -EAGAIN.
>>> c2.setblocking(False) # This causes -EAGAIN even with available data
>>> c2.recv(5)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
BlockingIOError: [Errno 11] Resource temporarily unavailable
However, next recv() will return the following available data, "world".
>>> c2.recv(5)
b'world'
When the consumed OOB skb is at the head of the queue, we need to fetch
the next skb to fix the weird behaviour.
Note that the issue does not happen without MSG_DONTWAIT because we can
retry after manage_oob().
This patch also adds a test case that covers the issue.
Without fix:
# RUN msg_oob.no_peek.ex_oob_break ...
# msg_oob.c:134:ex_oob_break:AF_UNIX :Resource temporarily unavailable
# msg_oob.c:135:ex_oob_break:Expected:ld
# msg_oob.c:137:ex_oob_break:Expected ret[0] (-1) == expected_len (2)
# ex_oob_break: Test terminated by assertion
# FAIL msg_oob.no_peek.ex_oob_break
not ok 8 msg_oob.no_peek.ex_oob_break
With fix:
# RUN msg_oob.no_peek.ex_oob_break ...
# OK msg_oob.no_peek.ex_oob_break
ok 8 msg_oob.no_peek.ex_oob_break
Yunseong Kim [Mon, 24 Jun 2024 17:33:23 +0000 (02:33 +0900)]
tracing/net_sched: NULL pointer dereference in perf_trace_qdisc_reset()
In the TRACE_EVENT(qdisc_reset) NULL dereference occurred from
qdisc->dev_queue->dev <NULL> ->name
This situation simulated from bunch of veths and Bluetooth disconnection
and reconnection.
During qdisc initialization, qdisc was being set to noop_queue.
In veth_init_queue, the initial tx_num was reduced back to one,
causing the qdisc reset to be called with noop, which led to the kernel
panic.
I've attached the GitHub gist link that C converted syz-execprogram
source code and 3 log of reproduced vmcore-dmesg.
However, without our patch applied, I tested upstream 6.10.0-rc3 kernel
using the qdisc_reset event and the ip command on my qemu virtual machine.
This 2 commands makes always kernel panic.
Linux version: 6.10.0-rc3
[ 0.000000] Linux version 6.10.0-rc3-00164-g44ef20baed8e-dirty
(paran@fedora) (gcc (GCC) 14.1.1 20240522 (Red Hat 14.1.1-4), GNU ld
version 2.41-34.fc40) #20 SMP PREEMPT Sat Jun 15 16:51:25 KST 2024
Yeoreum and I don't know if the patch we wrote will fix the underlying
cause, but we think that priority is to prevent kernel panic happening.
So, we're sending this patch.
====================
Series to deliver Ethernet for STM32MP25
STM32MP25 is STM32 SOC with 2 GMACs instances.
GMAC IP version is SNPS 5.3x.
GMAC IP configure with 2 RX and 4 TX queue.
DMA HW capability register supported
RX Checksum Offload Engine supported
TX Checksum insertion supported
Wake-Up On Lan supported
TSO supported
====================
net: stmmac: dwmac-stm32: stm32: add management of stm32mp25 for stm32
Add Ethernet support for STM32MP25.
STM32MP25 is STM32 SOC with 2 GMACs instances.
GMAC IP version is SNPS 5.3x.
GMAC IP configure with 2 RX and 4 TX queue.
DMA HW capability register supported
RX Checksum Offload Engine supported
TX Checksum insertion supported
Wake-Up On Lan supported
TSO supported
Signed-off-by: Christophe Roullier <christophe.roullier@foss.st.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Marek Vasut <marex@denx.de> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
====================
selftests: drv-net: rss_ctx: add tests for RSS contexts
Add a few tests exercising RSS context API.
In addition to basic sanity checks, tests add RSS contexts,
n-tuple rule to direct traffic to them (based on dst port),
and qstats to make sure traffic landed where we expected.
v2 adds a test for removing contexts out of order. When testing
bnxt - either the new test or running more tests after the overlap
test makes the device act strangely. To the point where it may start
giving out ntuple IDs of 0 for all rules..
$ export NETIF=eth0 REMOTE_...
$ ./drivers/net/hw/rss_ctx.py
KTAP version 1
1..8
ok 1 rss_ctx.test_rss_key_indir
ok 2 rss_ctx.test_rss_context
ok 3 rss_ctx.test_rss_context4
# Increasing queue count 44 -> 66
# Failed to create context 32, trying to test what we got
ok 4 rss_ctx.test_rss_context32 # SKIP Tested only 31 contexts, wanted 32
ok 5 rss_ctx.test_rss_context_overlap
ok 6 rss_ctx.test_rss_context_overlap2
# .. sprays traffic like a headless chicken ..
not ok 7 rss_ctx.test_rss_context_out_of_order
ok 8 rss_ctx.test_rss_context4_create_with_cfg
# Totals: pass:6 fail:1 xfail:0 xpass:0 skip:1 error:0
Jakub Kicinski [Wed, 26 Jun 2024 01:24:56 +0000 (18:24 -0700)]
selftests: drv-net: rss_ctx: add tests for RSS configuration and contexts
Add tests focusing on indirection table configuration and
creating extra RSS contexts in drivers which support it.
$ export NETIF=eth0 REMOTE_...
$ ./drivers/net/hw/rss_ctx.py
KTAP version 1
1..8
ok 1 rss_ctx.test_rss_key_indir
ok 2 rss_ctx.test_rss_context
ok 3 rss_ctx.test_rss_context4
# Increasing queue count 44 -> 66
# Failed to create context 32, trying to test what we got
ok 4 rss_ctx.test_rss_context32 # SKIP Tested only 31 contexts, wanted 32
ok 5 rss_ctx.test_rss_context_overlap
ok 6 rss_ctx.test_rss_context_overlap2
# .. sprays traffic like a headless chicken ..
not ok 7 rss_ctx.test_rss_context_out_of_order
ok 8 rss_ctx.test_rss_context4_create_with_cfg
# Totals: pass:6 fail:1 xfail:0 xpass:0 skip:1 error:0
Note that rss_ctx.test_rss_context_out_of_order fails with the device
I tested with, but it seems to be a device / driver bug.
Jakub Kicinski [Wed, 26 Jun 2024 01:24:55 +0000 (18:24 -0700)]
selftests: drv-net: add ability to wait for at least N packets to load gen
Teach the load generator how to wait for at least given number
of packets to be received. This will be useful for filtering
where we'll want to send a non-trivial number of packets and
make sure they landed in right queues.
Reviewed-by: Breno Leitao <leitao@debian.org> Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20240626012456.2326192-4-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Wed, 26 Jun 2024 01:24:53 +0000 (18:24 -0700)]
selftests: drv-net: try to check if port is in use
We use random ports for communication. As Willem predicted
this leads to occasional failures. Try to check if port is
already in use by opening a socket and binding to that port.
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20240626012456.2326192-2-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Linus Torvalds [Thu, 27 Jun 2024 00:51:39 +0000 (17:51 -0700)]
Merge tag 'mm-hotfixes-stable-2024-06-26-17-28' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"13 hotfixes, 7 are cc:stable.
All are MM related apart from a MAINTAINERS update. There is no
identifiable theme here - just singleton patches in various places"
* tag 'mm-hotfixes-stable-2024-06-26-17-28' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mm/memory: don't require head page for do_set_pmd()
mm/page_alloc: Separate THP PCP into movable and non-movable categories
nfs: drop the incorrect assertion in nfs_swap_rw()
mm/migrate: make migrate_pages_batch() stats consistent
MAINTAINERS: TPM DEVICE DRIVER: update the W-tag
selftests/mm:fix test_prctl_fork_exec return failure
mm: convert page type macros to enum
ocfs2: fix DIO failure due to insufficient transaction credits
kasan: fix bad call to unpoison_slab_object
mm: handle profiling for fake memory allocations during compaction
mm/slab: fix 'variable obj_exts set but not used' warning
/proc/pid/smaps: add mseal info for vma
mm: fix incorrect vbq reference in purge_fragmented_block
netfilter: nf_tables: fully validate NFT_DATA_VALUE on store to data registers
register store validation for NFT_DATA_VALUE is conditional, however,
the datatype is always either NFT_DATA_VALUE or NFT_DATA_VERDICT. This
only requires a new helper function to infer the register type from the
set datatype so this conditional check can be removed. Otherwise,
pointer to chain object can be leaked through the registers.
Linus Torvalds [Wed, 26 Jun 2024 22:01:33 +0000 (15:01 -0700)]
Merge tag 'wq-for-6.10-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue fixes from Tejun Heo:
"Two patches to fix kworker name formatting"
* tag 'wq-for-6.10-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: Increase worker desc's length to 32
workqueue: Refactor worker ID formatting and make wq_worker_comm() use full ID string
Takashi Iwai [Wed, 26 Jun 2024 20:02:55 +0000 (22:02 +0200)]
Merge tag 'asoc-fix-v6.10-rc5' of https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound into for-linus
ASoC: Fixes for v6.10
A relatively large batch of updates, largely due to the long interval
since I last sent fixes due to various travel and holidays. There's a
lot of driver specific fixes and quirks in here, none of them too major,
and also some fixes for recently introduced memory safety issues in the
topology code.
Kalle Valo [Wed, 26 Jun 2024 10:26:32 +0000 (13:26 +0300)]
MAINTAINERS: wifi: update ath.git location
ath.git tree has moved to a new location. The old location will be an alias to
the new location and will work at least until end of 2024, but best to update
git trees already now.
Kalle Valo [Tue, 25 Jun 2024 10:39:29 +0000 (13:39 +0300)]
MAINTAINERS: Remembering Larry Finger
We got sad news that Larry is not with us anymore. He was a long time
Linux developer, his first commit was back in 2005 and he has
maintained several wireless drivers over the years. He was known for
patiently supporting Linux users with all sorts of problems they had.
Larry's work helped so many people around the world and I always
enjoyed working with him, even though I sadly never met him.
Takashi Iwai [Wed, 26 Jun 2024 14:51:13 +0000 (16:51 +0200)]
ALSA: seq: Fix missing MSB in MIDI2 SPP conversion
The conversion of SPP to MIDI2 UMP called a wrong function, and the
secondary argument wasn't taken. As a result, MSB of SPP was always
zero. Fix to call the right function.
====================
mlxsw: Reduce memory footprint of mlxsw driver
Amit Cohen writes:
A previous patch-set used page pool to allocate buffers, to simplify the
change, we first used one continuous buffer, which was allocated with
order > 0. This set improves page pool usage to allocate the exact number
of pages which are required for packet.
This change requires using fragmented SKB, till now all the buffer was in
the linear part. Note that 'skb->truesize' is decreased for small packets.
This set significantly reduces memory consumption of mlxsw driver. The
footprint is reduced by 26%.
Patch set overview:
Patch #1 calculates number of scatter/gather entries and stores the value
Patch #2 converts the driver to use fragmented buffers
====================
Amit Cohen [Tue, 25 Jun 2024 13:47:35 +0000 (15:47 +0200)]
mlxsw: pci: Use fragmented buffers
WQE (Work Queue Element) includes 3 scatter/gather entries for buffers.
The buffer can be split into 3 parts, software should set address and byte
count of each part.
A previous patch-set used page pool to allocate buffers, to simplify the
change, we first used one continuous buffer, which was allocated with
order > 0. This patch improves page pool usage to allocate the exact
number of pages which are required for packet.
As part of init, fill WQE.address[x] and WQE.byte_count* with pages which
are allocated from the pool. Fill x entries according to number of
scatter/gather entries which are required for maximum packet size. When a
packet is received, check the actual size and replace only the used pages.
Save bytes for software overhead only as part of the first entry.
This change also requires using fragmented SKB, till now all the buffer
was in the linear part. Note that 'skb->truesize' is decreased for small
packets.
For now the maximum buffer size is 3 * PAGE_SIZE which is enough, in
case that the driver will support larger MTU, we can use 'order' to
allocate more than one page per scatter/gather entry.
This change significantly reduces memory consumption of mlxsw driver. The
footprint is reduced by 26%.
Amit Cohen [Tue, 25 Jun 2024 13:47:34 +0000 (15:47 +0200)]
mlxsw: pci: Store number of scatter/gather entries for maximum packet size
A previous patch-set used page pool for Rx buffers allocations. To
simplify the change, we first used page pool for one allocation per
packet - one continuous buffer is allocated for each packet. This can be
improved by using fragmented buffers, then memory consumption will be
significantly reduced.
WQE (Work Queue Element) includes up to 3 scatter/gather entries for
data. As preparation for fragmented buffer usage, calculate number of
scatter/gather entries which are required for packet according to
maximum MTU and store it for future use. For now use PAGE_SIZE for each
entry, which means that maximum buffer size is 3 * PAGE_SIZE. This is
enough for the maximum MTU which is supported in the driver now (10K).
Warn in an unlikely case of maximum MTU which requires more than 3 pages,
for now this warn should not happen with standard page size (>=4K) and
maximum MTU (10K).
net: Drop explicit initialization of struct i2c_device_id::driver_data to 0
These drivers don't use the driver_data member of struct i2c_device_id,
so don't explicitly initialize this member.
This prepares putting driver_data in an anonymous union which requires
either no initialization or named designators. But it's also a nice
cleanup on its own.
While add it, also remove commas after the sentinel entries.
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@baylibre.com> Reviewed-by: Petr Machata <petrm@nvidia.com> # For mlxsw Reviewed-by: Kory Maincent <Kory.maincent@bootlin.com> Reviewed-by: Jeremy Kerr <jk@codeconstruct.com.au> # for mctp-i2c Reviewed-by: Oleksij Rempel <o.rempel@pengutronix.de> Acked-by: Oleksij Rempel <o.rempel@pengutronix.de> Link: https://patch.msgid.link/20240625083853.2205977-2-u.kleine-koenig@baylibre.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Johannes Berg [Wed, 26 Jun 2024 07:15:59 +0000 (09:15 +0200)]
wifi: mac80211: disable softirqs for queued frame handling
As noticed by syzbot, calling ieee80211_handle_queued_frames()
(and actually handling frames there) requires softirqs to be
disabled, since we call into the RX code. Fix that in the case
of cleaning up frames left over during shutdown.
Takashi Iwai [Tue, 25 Jun 2024 15:52:12 +0000 (17:52 +0200)]
ALSA: hda/realtek: Fix conflicting quirk for PCI SSID 17aa:3820
The recent fix for Lenovo IdeaPad 330-17IKB replaced the quirk entry,
and this eventually breaks the existing quirk for Lenovo Yoga Duet 7
13ITL6 equipped with the same PCI SSID 17aa:3820.
For applying a proper quirk for each model, check the codec SSID
additionally. Fortunately Yoga Duet has a different codec SSID,
0x17aa3802.
(Interestingly, 17aa:3802 has another conflict of SSID between another
Yoga model vs 14IRP8 which we had to work around similarly.)
====================
add ethernet driver for Tehuti Networks TN40xx chips
This patchset adds a new 10G ethernet driver for Tehuti Networks
TN40xx chips. Note in mainline, there is a driver for Tehuti Networks
(drivers/net/ethernet/tehuti/tehuti.[hc]), which supports TN30xx
chips.
Multiple vendors (DLink, Asus, Edimax, QNAP, etc) developed adapters
based on TN40xx chips. Tehuti Networks went out of business but the
drivers are still distributed under GPL2 with some of the hardware
(and also available on some sites). With some changes, I try to
upstream this driver with a new PHY driver in Rust.
The major change is replacing the PHY abstraction layer in the original
driver with phylink. TN40xx chips are used with various PHY hardware
(AMCC QT2025, TI TLK10232, Aqrate AQR105, and Marvell MV88X3120,
MV88X3310, and MV88E2010).
I've also been working on a new PHY driver for QT2025 in Rust [1]. For
now, I enable only adapters using QT2025 PHY in the PCI ID table of
this driver. I've tested this driver and the QT2025 PHY driver with
Edimax EN-9320 10G adapter and 10G-SR SFP+. In mainline, there are PHY
drivers for AQR105 and Marvell PHYs, which could work for some TN40xx
adapters with this driver.
To make reviewing easier, this patchset has only basic functions. Once
merged, I'll submit features like ethtool support.
FUJITA Tomonori [Sun, 23 Jun 2024 23:55:07 +0000 (08:55 +0900)]
net: tn40xx: add phylink support
This patch adds supports for multiple PHY hardware with phylink. The
adapters with TN40xx chips use multiple PHY hardware; AMCC QT2025, TI
TLK10232, Aqrate AQR105, and Marvell 88X3120, 88X3310, and MV88E2010.
For now, the PCI ID table of this driver enables adapters using only
QT2025 PHY. I've tested this driver and the QT2025 PHY driver (SFP+
10G SR) with Edimax EN-9320 10G adapter.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@gmail.com> Reviewed-by: Hans-Frieder Vogt <hfdevel@gmx.net> Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://patch.msgid.link/20240623235507.108147-8-fujita.tomonori@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
FUJITA Tomonori [Sun, 23 Jun 2024 23:55:05 +0000 (08:55 +0900)]
net: tn40xx: add basic Rx handling
This patch adds basic Rx handling. The Rx logic uses three major data
structures; two ring buffers with NIC and one database. One ring
buffer is used to send information to NIC about memory to be stored
packets to be received. The other is used to get information from NIC
about received packets. The database is used to keep the information
about DMA mapping. After a packet arrived, the db is used to pass the
packet to the network stack.
FUJITA Tomonori [Sun, 23 Jun 2024 23:55:04 +0000 (08:55 +0900)]
net: tn40xx: add basic Tx handling
This patch adds device specific structures to initialize the hardware
with basic Tx handling. The original driver loads the embedded
firmware in the header file. This driver is implemented to use the
firmware APIs.
The Tx logic uses three major data structures; two ring buffers with
NIC and one database. One ring buffer is used to send information
about packets to be sent for NIC. The other is used to get information
from NIC about packet that are sent. The database is used to keep the
information about DMA mapping. After a packet is sent, the db is used
to free the resource used for the packet.
FUJITA Tomonori [Sun, 23 Jun 2024 23:55:01 +0000 (08:55 +0900)]
PCI: Add Edimax Vendor ID to pci_ids.h
Add the Edimax Vendor ID (0x1432) for an ethernet driver for Tehuti
Networks TN40xx chips. This ID can be used for Realtek 8180 and Ralink
rt28xx wireless drivers.
Jakub Kicinski [Wed, 26 Jun 2024 00:48:35 +0000 (17:48 -0700)]
Merge branch 'gve-add-flow-steering-support'
Ziwei Xiao says:
====================
gve: Add flow steering support
To support flow steering in GVE driver, there are two adminq changes
need to be made in advance.
The first one is adding adminq mutex lock, which is to allow the
incoming flow steering operations to be able to temporarily drop the
rtnl_lock to reduce the latency for registering flow rules among
several NICs at the same time. This could be achieved by the future
changes to reduce the drivers' dependencies on the rtnl lock for
particular ethtool ops.
The second one is to add the extended adminq command so that we can
support larger adminq command such as configure_flow_rule command. In
that patch, there is a new added function called
gve_adminq_execute_extended_cmd with the attribute of __maybe_unused.
That attribute will be removed in the third patch of this series where
it will use the previously unused function.
And the other three patches are needed for the actual flow steering
feature support in driver.
====================
Jeroen de Borst [Tue, 25 Jun 2024 00:12:31 +0000 (00:12 +0000)]
gve: Add flow steering ethtool support
Implement the ethtool commands that can be used to configure and query
flow-steering rules.
A large part of this change consists of translating the ethtool
representation of 'ntuples' to our internal gve_flow_rule and vice-versa
in the new created gve_flow_rule.c
Considering the possible large amount of flow rules, the driver doesn't
store all the rules locally. When the user runs 'ethtool -n <nic>' to
check the registered rules, the driver will send adminq command to
query a limited amount of rules/rule ids(that filled in a 4096 bytes dma
memory) at a time as a cache for the ethtool queries. The adminq query
commands will be repeated for several times until the ethtool has
queried all the needed rules.
Signed-off-by: Jeroen de Borst <jeroendb@google.com> Co-developed-by: Ziwei Xiao <ziweixiao@google.com> Signed-off-by: Ziwei Xiao <ziweixiao@google.com> Reviewed-by: Praveen Kaligineedi <pkaligineedi@google.com> Reviewed-by: Harshitha Ramamurthy <hramamurthy@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20240625001232.1476315-6-ziweixiao@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jeroen de Borst [Tue, 25 Jun 2024 00:12:30 +0000 (00:12 +0000)]
gve: Add flow steering adminq commands
Add new adminq commands for the driver to configure and query flow rules
that are stored in the device. Flow steering rules are assigned with a
location that determines the relative order of the rules.
Flow rules can run up to an order of millions. In such cases, storing
a full copy of the rules in the driver to prepare for the ethtool query
is infeasible while querying them from the device is better. That needs
to be optimized too so that we don't send a lot of adminq commands. The
solution here is to store a limited number of rules/rule ids in the
driver in a cache. Use dma_pool to allocate 4k bytes which lets device
write at most 46 flow rules(4096/88) or 1024 rule ids(4096/4) at a time.
For configuring flow rules, there are 3 sub-commands:
- ADD which adds a rule at the location supplied
- DEL which deletes the rule at the location supplied
- RESET which clears all currently active rules in the device
For querying flow rules, there are also 3 sub-commands:
- QUERY_RULES corresponds to ETHTOOL_GRXCLSRULE. It fills the rules in
the allocated cache after querying the device
- QUERY_RULES_IDS corresponds to ETHTOOL_GRXCLSRLALL. It fills the
rule_ids in the allocated cache after querying the device
- QUERY_RULES_STATS corresponds to ETHTOOL_GRXCLSRLCNT. It queries the
device's current flow rule number and the supported max flow rule
limit
Signed-off-by: Jeroen de Borst <jeroendb@google.com> Co-developed-by: Ziwei Xiao <ziweixiao@google.com> Signed-off-by: Ziwei Xiao <ziweixiao@google.com> Reviewed-by: Praveen Kaligineedi <pkaligineedi@google.com> Reviewed-by: Harshitha Ramamurthy <hramamurthy@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20240625001232.1476315-5-ziweixiao@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jeroen de Borst [Tue, 25 Jun 2024 00:12:29 +0000 (00:12 +0000)]
gve: Add flow steering device option
Add a new device option to signal to the driver that the device supports
flow steering. This device option also carries the maximum number of
flow steering rules that the device can store.
Signed-off-by: Jeroen de Borst <jeroendb@google.com> Co-developed-by: Ziwei Xiao <ziweixiao@google.com> Signed-off-by: Ziwei Xiao <ziweixiao@google.com> Reviewed-by: Praveen Kaligineedi <pkaligineedi@google.com> Reviewed-by: Harshitha Ramamurthy <hramamurthy@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20240625001232.1476315-4-ziweixiao@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jeroen de Borst [Tue, 25 Jun 2024 00:12:28 +0000 (00:12 +0000)]
gve: Add adminq extended command
The adminq command is limited to 64 bytes per entry and it's 56 bytes
for the command itself at maximum. To support larger commands, we need
to dma_alloc a separate memory to put the command in that memory and
send the dma memory address instead of the actual command.
Introduce an extended adminq command to wrap the real command with the
inner opcode and the allocated dma memory address specified. Once the
device receives it, it can get the real command from the given dma
memory address. As designed with the device, all the extended commands
will use inner opcode larger than 0xFF.
Signed-off-by: Jeroen de Borst <jeroendb@google.com> Co-developed-by: Ziwei Xiao <ziweixiao@google.com> Signed-off-by: Ziwei Xiao <ziweixiao@google.com> Reviewed-by: Praveen Kaligineedi <pkaligineedi@google.com> Reviewed-by: Harshitha Ramamurthy <hramamurthy@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20240625001232.1476315-3-ziweixiao@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Ziwei Xiao [Tue, 25 Jun 2024 00:12:27 +0000 (00:12 +0000)]
gve: Add adminq mutex lock
We were depending on the rtnl_lock to make sure there is only one adminq
command running at a time. But some commands may take too long to hold
the rtnl_lock, such as the upcoming flow steering operations. For such
situations, it can temporarily drop the rtnl_lock, and replace it for
these operations with a new adminq lock, which can ensure the adminq
command execution to be thread-safe.
Signed-off-by: Ziwei Xiao <ziweixiao@google.com> Reviewed-by: Praveen Kaligineedi <pkaligineedi@google.com> Reviewed-by: Harshitha Ramamurthy <hramamurthy@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20240625001232.1476315-2-ziweixiao@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Neal Cardwell [Mon, 24 Jun 2024 14:43:23 +0000 (14:43 +0000)]
tcp: fix tcp_rcv_fastopen_synack() to enter TCP_CA_Loss for failed TFO
Testing determined that the recent commit 9e046bb111f1 ("tcp: clear
tp->retrans_stamp in tcp_rcv_fastopen_synack()") has a race, and does
not always ensure retrans_stamp is 0 after a TFO payload retransmit.
If transmit completion for the SYN+data skb happens after the client
TCP stack receives the SYNACK (which sometimes happens), then
retrans_stamp can erroneously remain non-zero for the lifetime of the
connection, causing a premature ETIMEDOUT later.
Testing and tracing showed that the buggy scenario is the following
somewhat tricky sequence:
+ Client attempts a TFO handshake. tcp_send_syn_data() sends SYN + TFO
cookie + data in a single packet in the syn_data skb. It hands the
syn_data skb to tcp_transmit_skb(), which makes a clone. Crucially,
it then reuses the same original (non-clone) syn_data skb,
transforming it by advancing the seq by one byte and removing the
FIN bit, and enques the resulting payload-only skb in the
sk->tcp_rtx_queue.
+ Client sets retrans_stamp to the start time of the three-way
handshake.
+ Cookie mismatches or server has TFO disabled, and server only ACKs
SYN.
+ tcp_ack() sees SYN is acked, tcp_clean_rtx_queue() clears
retrans_stamp.
+ Since the client SYN was acked but not the payload, the TFO failure
code path in tcp_rcv_fastopen_synack() tries to retransmit the
payload skb. However, in some cases the transmit completion for the
clone of the syn_data (which had SYN + TFO cookie + data) hasn't
happened. In those cases, skb_still_in_host_queue() returns true
for the retransmitted TFO payload, because the clone of the syn_data
skb has not had its tx completetion.
+ Because skb_still_in_host_queue() finds skb_fclone_busy() is true,
it sets the TSQ_THROTTLED bit and the retransmit does not happen in
the tcp_rcv_fastopen_synack() call chain.
+ The tcp_rcv_fastopen_synack() code next implicitly assumes the
retransmit process is finished, and sets retrans_stamp to 0 to clear
it, but this is later overwritten (see below).
+ Later, upon tx completion, tcp_tsq_write() calls
tcp_xmit_retransmit_queue(), which puts the retransmit in flight and
sets retrans_stamp to a non-zero value.
+ The client receives an ACK for the retransmitted TFO payload data.
+ Since we're in CA_Open and there are no dupacks/SACKs/DSACKs/ECN to
make tcp_ack_is_dubious() true and make us call
tcp_fastretrans_alert() and reach a code path that clears
retrans_stamp, retrans_stamp stays nonzero.
+ Later, if there is a TLP, RTO, RTO sequence, then the connection
will suffer an early ETIMEDOUT due to the erroneously ancient
retrans_stamp.
The fix: this commit refactors the code to have
tcp_rcv_fastopen_synack() retransmit by reusing the relevant parts of
tcp_simple_retransmit() that enter CA_Loss (without changing cwnd) and
call tcp_xmit_retransmit_queue(). We have tcp_simple_retransmit() and
tcp_rcv_fastopen_synack() share code in this way because in both cases
we get a packet indicating non-congestion loss (MTU reduction or TFO
failure) and thus in both cases we want to retransmit as many packets
as cwnd allows, without reducing cwnd. And given that retransmits will
set retrans_stamp to a non-zero value (and may do so in a later
calling context due to TSQ), we also want to enter CA_Loss so that we
track when all retransmitted packets are ACked and clear retrans_stamp
when that happens (to ensure later recurring RTOs are using the
correct retrans_stamp and don't declare ETIMEDOUT prematurely).
Fixes: 9e046bb111f1 ("tcp: clear tp->retrans_stamp in tcp_rcv_fastopen_synack()") Fixes: a7abf3cd76e1 ("tcp: consider using standard rtx logic in tcp_rcv_fastopen_synack()") Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Link: https://patch.msgid.link/20240624144323.2371403-1-ncardwell.sw@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
ethtool: provide the dim profile fine-tuning channel
The NetDIM library provides excellent acceleration for many modern
network cards. However, the default profiles of DIM limits its maximum
capabilities for different NICs, so providing a way which the NIC can
be custom configured is necessary.
Currently, the way is based on the commonly used "ethtool -C".
For example,
on the server side, the virtio-net NIC with rx dim enabled has 8
queues and runs nginx.
The client uses the following command to send traffic to the server:
./wrk http://server_ip:80 -c 64 -t 5 -d 30
Then adjust the default rx-profile for server dim to
Heng Qi [Fri, 21 Jun 2024 10:13:53 +0000 (18:13 +0800)]
virtio-net: support dim profile fine-tuning
Virtio-net has different types of back-end device implementations.
In order to effectively optimize the dim library's gains for different
device implementations, let's use the new interface params to
initialize and query dim results from a customized profile list.
Heng Qi [Fri, 21 Jun 2024 10:13:51 +0000 (18:13 +0800)]
ethtool: provide customized dim profile management
The NetDIM library, currently leveraged by an array of NICs, delivers
excellent acceleration benefits. Nevertheless, NICs vary significantly
in their dim profile list prerequisites.
Specifically, virtio-net backends may present diverse sw or hw device
implementation, making a one-size-fits-all parameter list impractical.
On Alibaba Cloud, the virtio DPU's performance under the default DIM
profile falls short of expectations, partly due to a mismatch in
parameter configuration.
I also noticed that ice/idpf/ena and other NICs have customized
profilelist or placed some restrictions on dim capabilities.
Motivated by this, I tried adding new params for "ethtool -C" that provides
a per-device control to modify and access a device's interrupt parameters.
Usage
========
The target NIC is named ethx.
Assume that ethx only declares support for rx profile setting
(with DIM_PROFILE_RX flag set in profile_flags) and supports modification
of usec and pkt fields.
1. Query the currently customized list of the device
Heng Qi [Fri, 21 Jun 2024 10:13:50 +0000 (18:13 +0800)]
dim: make DIMLIB dependent on NET
DIMLIB's capabilities are supplied by the dim, net_dim, and
rdma_dim objects, and dim's interfaces solely act as a base for
net_dim and rdma_dim and are not explicitly used anywhere else.
rdma_dim is utilized by the infiniband driver, while net_dim
is for network devices, excluding the soc/fsl driver.
In this patch, net_dim relies on some NET's interfaces, thus
DIMLIB needs to explicitly depend on the NET Kconfig.
The soc/fsl driver uses the functions provided by net_dim, so
it also needs to depend on NET.
Jakub Kicinski [Wed, 26 Jun 2024 00:07:06 +0000 (17:07 -0700)]
Merge branch 'ravb-add-mii-support-for-r-car-v4m'
Geert Uytterhoeven says:
====================
ravb: Add MII support for R-Car V4M
All EtherAVB instances on R-Car Gen3/Gen4 SoCs support the RGMII
interface. In addition, the first two EtherAVB instances on R-Car V4M
also support the MII interface, but this is not yet supported by the
driver. This patch series adds support for MII on R-Car Gen4, after the
customary cleanup.
The corresponding pin control support is available in [1].
Compile-tested only, as all AVB interfaces on the Gray Hawk Single
development board are connected to RGMII PHYs.
No regressions on R-Car V4H.
All EtherAVB instances on R-Car Gen3/Gen4 SoCs support the RGMII
interface. In addition, the first two EtherAVB instances on R-Car V4M
also support the MII interface, but this is not yet supported by the
driver.
Add support for MII on R-Car Gen4 by adding an R-Car Gen4-specific EMAC
initialization function that selects the MII clock instead of the RGMII
clock when the PHY interface is MII. Note that all implementations of
EtherAVB on R-Car Gen4 SoCs have the APSR register, but only MII-capable
instances are documented to have the MIISELECT bit, which has a
documented value of zero when reserved.
Shannon Nelson [Mon, 24 Jun 2024 17:50:15 +0000 (10:50 -0700)]
ionic: use dev_consume_skb_any outside of napi
If we're not in a NAPI softirq context, we need to be careful
about how we call napi_consume_skb(), specifically we need to
call it with budget==0 to signal to it that we're not in a
safe context.
This was found while running some configuration stress testing
of traffic and a change queue config loop running, and this
curious note popped out:
I found that ionic_tx_clean() calls napi_consume_skb() which calls
napi_skb_cache_put(), but before that last call is the note
/* Zero budget indicate non-NAPI context called us, like netpoll */
and
DEBUG_NET_WARN_ON_ONCE(!in_softirq());
Those are pretty big hints that we're doing it wrong. We can pass a
context hint down through the calls to let ionic_tx_clean() know what
we're doing so it can call napi_consume_skb() correctly.
Li RongQing [Fri, 21 Jun 2024 09:45:52 +0000 (17:45 +0800)]
virtio_net: Remove u64_stats_update_begin()/end() for stats fetch
This place is fetching the stats, u64_stats_update_begin()/end()
should not be used, and the fetcher of stats is in the same context
as the updater of the stats, so don't need any protection
James Chapman [Mon, 24 Jun 2024 08:29:45 +0000 (09:29 +0100)]
l2tp: remove incorrect __rcu attribute
This fixes a sparse warning.
Fixes: d18d3f0a24fc ("l2tp: replace hlist with simple list for per-tunnel session list") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202406220754.evK8Hrjw-lkp@intel.com/ Signed-off-by: James Chapman <jchapman@katalix.com> Link: https://patch.msgid.link/20240624082945.1925009-1-jchapman@katalix.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Elad Yifee [Sun, 23 Jun 2024 17:51:09 +0000 (20:51 +0300)]
net: ethernet: mtk_eth_soc: ppe: prevent ppe update for non-mtk devices
Introduce an additional validation to ensure that the PPE index
is modified exclusively for mtk_eth ingress devices.
This primarily addresses the issue related
to WED operation with multiple PPEs.
Tristram Ha [Fri, 21 Jun 2024 22:34:22 +0000 (15:34 -0700)]
net: dsa: microchip: fix wrong register write when masking interrupt
The switch global port interrupt mask, REG_SW_PORT_INT_MASK__4, is
defined as 0x001C in ksz9477_reg.h. The designers used 32-bit value in
anticipation for increase of port count in future product but currently
the maximum port count is 7 and the effective value is 0x7F in register
0x001F. Each port has its own interrupt mask and is defined as 0x#01F.
It uses only 4 bits for different interrupts.
The developer who implemented the current interrupt mechanism in the
switch driver noticed there are similarities between the mechanism to
mask port interrupts in global interrupt and individual interrupts in
each port and so used the same code to handle these interrupts. He
updated the code to use the new macro REG_SW_PORT_INT_MASK__1 which is
defined as 0x1F in ksz_common.h but he forgot to update the 32-bit write
to 8-bit as now the mask registers are 0x1F and 0x#01F.
In addition all KSZ switches other than the KSZ9897/KSZ9893 and LAN937X
families use only 8-bit access and so this common code will eventually
be changed to accommodate them.
Shengjiu Wang [Thu, 20 Jun 2024 02:40:18 +0000 (10:40 +0800)]
ALSA: dmaengine_pcm: terminate dmaengine before synchronize
When dmaengine supports pause function, in suspend state,
dmaengine_pause() is called instead of dmaengine_terminate_async(),
In end of playback stream, the runtime->state will go to
SNDRV_PCM_STATE_DRAINING, if system suspend & resume happen
at this time, application will not resume playback stream, the
stream will be closed directly, the dmaengine_terminate_async()
will not be called before the dmaengine_synchronize(), which
violates the call sequence for dmaengine_synchronize().
This behavior also happens for capture streams, but there is no
SNDRV_PCM_STATE_DRAINING state for capture. So use
dmaengine_tx_status() to check the DMA status if the status is
DMA_PAUSED, then call dmaengine_terminate_async() to terminate
dmaengine before dmaengine_synchronize().
Paolo Abeni [Tue, 25 Jun 2024 09:53:09 +0000 (11:53 +0200)]
Merge branch 'net-macb-wol-enhancements'
Vineeth Karumanchi says:
====================
net: macb: WOL enhancements
- Add provisioning for queue tie-off and queue disable during suspend.
- Add support for ARP packet types to WoL.
- Advertise WoL attributes by default.
- Extend MACB supported WoL modes to the PHY supported WoL modes.
- Deprecate magic-packet property.
WOL modes such as magic-packet should be an OS policy.
By default, advertise supported modes and use ethtool to activate
the required mode.
Suggested-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Vineeth Karumanchi <vineeth.karumanchi@amd.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Acked-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Currently, if PHY supports WOL, ethtool ignores the modes supported
by MACB. This change extends the WOL modes with MACB supported modes.
Advertise wake-on LAN supported modes by default without relying on
dt node. By default, wake-on LAN will be in disabled state.
Using ethtool, users can enable/disable or choose packet types.
For wake-on LAN via ARP, ensure the IP address is assigned and
report an error otherwise.
net: macb: queue tie-off or disable during WOL suspend
When GEM is used as a wake device, it is not mandatory for the RX DMA
to be active. The RX engine in IP only needs to receive and identify
a wake packet through an interrupt. The wake packet is of no further
significance; hence, it is not required to be copied into memory.
By disabling RX DMA during suspend, we can avoid unnecessary DMA
processing of any incoming traffic.
During suspend, perform either of the below operations:
- tie-off/dummy descriptor: Disable unused queues by connecting
them to a looped descriptor chain without free slots.
- queue disable: The newer IP version allows disabling individual queues.
luoxuanqiang [Fri, 21 Jun 2024 01:39:29 +0000 (09:39 +0800)]
Fix race for duplicate reqsk on identical SYN
When bonding is configured in BOND_MODE_BROADCAST mode, if two identical
SYN packets are received at the same time and processed on different CPUs,
it can potentially create the same sk (sock) but two different reqsk
(request_sock) in tcp_conn_request().
These two different reqsk will respond with two SYNACK packets, and since
the generation of the seq (ISN) incorporates a timestamp, the final two
SYNACK packets will have different seq values.
The consequence is that when the Client receives and replies with an ACK
to the earlier SYNACK packet, we will reset(RST) it.
This behavior is consistently reproducible in my local setup,
which comprises:
| NETA1 ------ NETB1 |
PC_A --- bond --- | | --- bond --- PC_B
| NETA2 ------ NETB2 |
- PC_A is the Server and has two network cards, NETA1 and NETA2. I have
bonded these two cards using BOND_MODE_BROADCAST mode and configured
them to be handled by different CPU.
- PC_B is the Client, also equipped with two network cards, NETB1 and
NETB2, which are also bonded and configured in BOND_MODE_BROADCAST mode.
If the client attempts a TCP connection to the server, it might encounter
a failure. Capturing packets from the server side reveals:
The attempted solution is as follows:
Add a return value to inet_csk_reqsk_queue_hash_add() to confirm if the
ehash insertion is successful (Up to now, the reason for unsuccessful
insertion is that a reqsk for the same connection has already been
inserted). If the insertion fails, release the reqsk.
Due to the refcnt, Kuniyuki suggests also adding a return value check
for the DCCP module; if ehash insertion fails, indicating a successful
insertion of the same connection, simply release the reqsk as well.
Simultaneously, In the reqsk_queue_hash_req(), the start of the
req->rsk_timer is adjusted to be after successful insertion.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: luoxuanqiang <luoxuanqiang@kylinos.cn> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20240621013929.1386815-1-luoxuanqiang@kylinos.cn Signed-off-by: Paolo Abeni <pabeni@redhat.com>
====================
af_unix: Remove spin_lock_nested() and convert to lock_cmp_fn.
This series removes spin_lock_nested() in AF_UNIX and instead
defines the locking orders as functions tied to each lock by
lockdep_set_lock_cmp_fn().
When the defined function returns a negative value, lockdep
considers it will not cause deadlock. (See ->cmp_fn() in
check_deadlock() and check_prev_add().)
When we cannot define the total ordering, we return -1 for
the allowed ordering and otherwise 0 as undefined. [0]
af_unix: Don't use spin_lock_nested() in copy_peercred().
When (AF_UNIX, SOCK_STREAM) socket connect()s to a listening socket,
the listener's sk_peer_pid/sk_peer_cred are copied to the client in
copy_peercred().
Then, two sk_peer_locks are held there; one is client's and another
is listener's.
However, the latter is not needed because we hold the listner's
unix_state_lock() there and unix_listen() cannot update the cred
concurrently.
Let's drop the unnecessary spin_lock() and use the bare spin_lock()
for the client to protect concurrent read by getsockopt(SO_PEERCRED).
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
af_unix: Remove put_pid()/put_cred() in copy_peercred().
When (AF_UNIX, SOCK_STREAM) socket connect()s to a listening socket,
the listener's sk_peer_pid/sk_peer_cred are copied to the client in
copy_peercred().
Then, the client's sk_peer_pid and sk_peer_cred are always NULL, so
we need not call put_pid() and put_cred() there.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Commit 1971d13ffa84 ("af_unix: Suppress false-positive lockdep splat for
spin_lock() in __unix_gc().") added U_LOCK_GC_LISTENER for the old GC,
but it's no longer needed for the new GC.
Let's remove U_LOCK_GC_LISTENER and unix_state_lock_nested() as there's
no user.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
sk_diag_dump_icons() acquires embryo's lock by unix_state_lock_nested()
to fetch its peer.
The embryo's ->peer is set to NULL only when its parent listener is
close()d. Then, unix_release_sock() is called for each embryo after
unlinking skb by skb_dequeue().
In sk_diag_dump_icons(), we hold the parent's recvq lock, so we need
not acquire unix_state_lock_nested(), and peer is always non-NULL.
Let's remove unnecessary unix_state_lock_nested() and non-NULL test
for peer.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
af_unix: Don't acquire unix_state_lock() for sock_i_ino().
sk_diag_dump_peer() and sk_diag_dump() call unix_state_lock() for
sock_i_ino() which reads SOCK_INODE(sk->sk_socket)->i_ino, but it's
protected by sk->sk_callback_lock.
Let's remove unnecessary unix_state_lock().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
af_unix: Define locking order for U_LOCK_SECOND in unix_stream_connect().
While a SOCK_(STREAM|SEQPACKET) socket connect()s to another, we hold
two locks of them by unix_state_lock() and unix_state_lock_nested() in
unix_stream_connect().
Before unix_state_lock_nested(), the following is guaranteed by checking
sk->sk_state:
1. The first socket is TCP_LISTEN
2. The second socket is not the first one
3. Simultaneous connect() must fail
So, the client state can be TCP_CLOSE or TCP_LISTEN or TCP_ESTABLISHED.
Let's define the expected states as unix_state_lock_cmp_fn() instead of
using unix_state_lock_nested().
Note that 2. is detected by debug_spin_lock_before() and 3. cannot be
expressed as lock_cmp_fn.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
af_unix: Don't retry after unix_state_lock_nested() in unix_stream_connect().
When a SOCK_(STREAM|SEQPACKET) socket connect()s to another one, we need
to lock the two sockets to check their states in unix_stream_connect().
We use unix_state_lock() for the server and unix_state_lock_nested() for
client with tricky sk->sk_state check to avoid deadlock.
The possible deadlock scenario are the following:
1) Self connect()
2) Simultaneous connect()
The former is simple, attempt to grab the same lock, and the latter is
AB-BA deadlock.
After the server's unix_state_lock(), we check the server socket's state,
and if it's not TCP_LISTEN, connect() fails with -EINVAL.
Then, we avoid the former deadlock by checking the client's state before
unix_state_lock_nested(). If its state is not TCP_LISTEN, we can make
sure that the client and the server are not identical based on the state.
Also, the latter deadlock can be avoided in the same way. Due to the
server sk->sk_state requirement, AB-BA deadlock could happen only with
TCP_LISTEN sockets. So, if the client's state is TCP_LISTEN, we can
give up the second lock to avoid the deadlock.
CPU 1 CPU 2 CPU 3
connect(A -> B) connect(B -> A) listen(A)
--- --- ---
unix_state_lock(B)
B->sk_state == TCP_LISTEN
READ_ONCE(A->sk_state) == TCP_CLOSE
^^^^^^^^^
ok, will lock A unix_state_lock(A)
.--------------' WRITE_ONCE(A->sk_state, TCP_LISTEN)
| unix_state_unlock(A)
|
| unix_state_lock(A)
| A->sk_sk_state == TCP_LISTEN
| READ_ONCE(B->sk_state) == TCP_LISTEN
v ^^^^^^^^^^
unix_state_lock_nested(A) Don't lock B !!
Currently, while checking the client's state, we also check if it's
TCP_ESTABLISHED, but this is unlikely and can be checked after we know
the state is not TCP_CLOSE.
Moreover, if it happens after the second lock, we now jump to the restart
label, but it's unlikely that the server is not found during the retry,
so the jump is mostly to revist the client state check.
Let's remove the retry logic and check the state against TCP_CLOSE first.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Nick Child [Thu, 20 Jun 2024 15:23:11 +0000 (10:23 -0500)]
ibmvnic: Add tx check to prevent skb leak
Below is a summary of how the driver stores a reference to an skb during
transmit:
tx_buff[free_map[consumer_index]]->skb = new_skb;
free_map[consumer_index] = IBMVNIC_INVALID_MAP;
consumer_index ++;
Where variable data looks like this:
free_map == [4, IBMVNIC_INVALID_MAP, IBMVNIC_INVALID_MAP, 0, 3]
consumer_index^
tx_buff == [skb=null, skb=<ptr>, skb=<ptr>, skb=null, skb=null]
The driver has checks to ensure that free_map[consumer_index] pointed to
a valid index but there was no check to ensure that this index pointed
to an unused/null skb address. So, if, by some chance, our free_map and
tx_buff lists become out of sync then we were previously risking an
skb memory leak. This could then cause tcp congestion control to stop
sending packets, eventually leading to ETIMEDOUT.
Therefore, add a conditional to ensure that the skb address is null. If
not then warn the user (because this is still a bug that should be
patched) and free the old pointer to prevent memleak/tcp problems.
Signed-off-by: Nick Child <nnac123@linux.ibm.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
mm/memory: don't require head page for do_set_pmd()
The requirement that the head page be passed to do_set_pmd() was added in
commit ef37b2ea08ac ("mm/memory: page_add_file_rmap() ->
folio_add_file_rmap_[pte|pmd]()") and prevents pmd-mapping in the
finish_fault() and filemap_map_pages() paths if the page to be inserted is
anything but the head page for an otherwise suitable vma and pmd-sized
page.
Matthew said:
: We're going to stop using PMDs to map large folios unless the fault is
: within the first 4KiB of the PMD. No idea how many workloads that
: affects, but it only needs to be backported as far as v6.8, so we may
: as well backport it.
Link: https://lkml.kernel.org/r/20240611153216.2794513-1-abrestic@rivosinc.com Fixes: ef37b2ea08ac ("mm/memory: page_add_file_rmap() -> folio_add_file_rmap_[pte|pmd]()") Signed-off-by: Andrew Bresticker <abrestic@rivosinc.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>