Jason Xing [Thu, 20 Feb 2025 07:29:34 +0000 (15:29 +0800)]
bpf: Add BPF_SOCK_OPS_TSTAMP_SCHED_CB callback
Support SCM_TSTAMP_SCHED case for bpf timestamping.
Add a new sock_ops callback, BPF_SOCK_OPS_TSTAMP_SCHED_CB. This
callback will occur at the same timestamping point as the user
space's SCM_TSTAMP_SCHED. The BPF program can use it to get the
same SCM_TSTAMP_SCHED timestamp without modifying the user-space
application.
A new SKBTX_BPF flag is added to mark skb_shinfo(skb)->tx_flags,
ensuring that the new BPF timestamping and the current user
space's SO_TIMESTAMPING do not interfere with each other.
Jason Xing [Thu, 20 Feb 2025 07:29:33 +0000 (15:29 +0800)]
net-timestamp: Prepare for isolating two modes of SO_TIMESTAMPING
No functional changes here. Only add test to see if the orig_skb
matches the usage of application SO_TIMESTAMPING.
In this series, bpf timestamping and previous socket timestamping
are implemented in the same function __skb_tstamp_tx(). To test
the socket enables socket timestamping feature, this function
skb_tstamp_tx_report_so_timestamping() is added.
In the next patch, another check for bpf timestamping feature
will be introduced just like the above report function, namely,
skb_tstamp_tx_report_bpf_timestamping(). Then users will be able
to know the socket enables either or both of features.
Jason Xing [Thu, 20 Feb 2025 07:29:32 +0000 (15:29 +0800)]
bpf: Disable unsafe helpers in TX timestamping callbacks
New TX timestamping sock_ops callbacks will be added in the
subsequent patch. Some of the existing BPF helpers will not
be safe to be used in the TX timestamping callbacks.
The bpf_sock_ops_setsockopt, bpf_sock_ops_getsockopt, and
bpf_sock_ops_cb_flags_set require owning the sock lock. TX
timestamping callbacks will not own the lock.
The bpf_sock_ops_load_hdr_opt needs the skb->data pointing
to the TCP header. This will not be true in the TX timestamping
callbacks.
At the beginning of these helpers, this patch checks the
bpf_sock->op to ensure these helpers are used by the existing
sock_ops callbacks only.
Jason Xing [Thu, 20 Feb 2025 07:29:31 +0000 (15:29 +0800)]
bpf: Prevent unsafe access to the sock fields in the BPF timestamping callback
The subsequent patch will implement BPF TX timestamping. It will
call the sockops BPF program without holding the sock lock.
This breaks the current assumption that all sock ops programs will
hold the sock lock. The sock's fields of the uapi's bpf_sock_ops
requires this assumption.
To address this, a new "u8 is_locked_tcp_sock;" field is added. This
patch sets it in the current sock_ops callbacks. The "is_fullsock"
test is then replaced by the "is_locked_tcp_sock" test during
sock_ops_convert_ctx_access().
The new TX timestamping callbacks added in the subsequent patch will
not have this set. This will prevent unsafe access from the new
timestamping callbacks.
Potentially, we could allow read-only access. However, this would
require identifying which callback is read-safe-only and also requires
additional BPF instruction rewrites in the covert_ctx. Since the BPF
program can always read everything from a socket (e.g., by using
bpf_core_cast), this patch keeps it simple and disables all read
and write access to any socket fields through the bpf_sock_ops
UAPI from the new TX timestamping callback.
Moreover, note that some of the fields in bpf_sock_ops are specific
to tcp_sock, and sock_ops currently only supports tcp_sock. In
the future, UDP timestamping will be added, which will also break
this assumption. The same idea used in this patch will be reused.
Considering that the current sock_ops only supports tcp_sock, the
variable is named is_locked_"tcp"_sock.
Jason Xing [Thu, 20 Feb 2025 07:29:30 +0000 (15:29 +0800)]
bpf: Prepare the sock_ops ctx and call bpf prog for TX timestamping
This patch introduces a new bpf_skops_tx_timestamping() function
that prepares the "struct bpf_sock_ops" ctx and then executes the
sockops BPF program.
The subsequent patch will utilize bpf_skops_tx_timestamping() at
the existing TX timestamping kernel callbacks (__sk_tstamp_tx
specifically) to call the sockops BPF program. Later, four callback
points to report information to user space based on this patch will
be introduced.
Jason Xing [Thu, 20 Feb 2025 07:29:29 +0000 (15:29 +0800)]
bpf: Add networking timestamping support to bpf_get/setsockopt()
The new SK_BPF_CB_FLAGS and new SK_BPF_CB_TX_TIMESTAMPING are
added to bpf_get/setsockopt. The later patches will implement the
BPF networking timestamping. The BPF program will use
bpf_setsockopt(SK_BPF_CB_FLAGS, SK_BPF_CB_TX_TIMESTAMPING) to
enable the BPF networking timestamping on a socket.
Jason Xing [Wed, 19 Feb 2025 08:13:31 +0000 (16:13 +0800)]
bpf: Support TCP_RTO_MAX_MS for bpf_setsockopt
Some applications don't want to wait for too long because the
time of retransmission increases exponentially and can reach more
than 10 seconds, for example. Eric implements the core logic
on supporting rto max feature in the stack previously. Based on that,
we can support it for BPF use.
This patch reuses the same logic of TCP_RTO_MAX_MS in do_tcp_setsockopt()
and do_tcp_getsockopt(). BPF program can call bpf_{set/get}sockopt()
to set/get the maximum value of RTO.
Linus Torvalds [Thu, 13 Feb 2025 20:17:04 +0000 (12:17 -0800)]
Merge tag 'net-6.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from netfilter, wireless and bluetooth.
Kalle Valo steps down after serving as the WiFi driver maintainer for
over a decade.
Current release - fix to a fix:
- vsock: orphan socket after transport release, avoid null-deref
- Bluetooth: L2CAP: fix corrupted list in hci_chan_del
Current release - regressions:
- eth:
- stmmac: correct Rx buffer layout when SPH is enabled
- iavf: fix a locking bug in an error path
- rxrpc: fix alteration of headers whilst zerocopy pending
- s390/qeth: move netif_napi_add_tx() and napi_enable() from under BH
- Revert "netfilter: flowtable: teardown flow if cached mtu is stale"
Current release - new code bugs:
- rxrpc: fix ipv6 path MTU discovery, only ipv4 worked
- pse-pd: fix deadlock in current limit functions
Previous releases - regressions:
- rtnetlink: fix netns refleak with rtnl_setlink()
- wifi: brcmfmac: use random seed flag for BCM4355 and BCM4364
firmware
Previous releases - always broken:
- add missing RCU protection of struct net throughout the stack
- can: rockchip: bail out if skb cannot be allocated
- eth: ti: am65-cpsw: base XDP support fixes
Misc:
- ethtool: tsconfig: update the format of hwtstamp flags, changes the
uAPI but this uAPI was not in any release yet"
* tag 'net-6.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (72 commits)
net: pse-pd: Fix deadlock in current limit functions
rxrpc: Fix ipv6 path MTU discovery
Reapply "net: skb: introduce and use a single page frag cache"
s390/qeth: move netif_napi_add_tx() and napi_enable() from under BH
mlxsw: Add return value check for mlxsw_sp_port_get_stats_raw()
ipv6: mcast: add RCU protection to mld_newpack()
team: better TEAM_OPTION_TYPE_STRING validation
Bluetooth: L2CAP: Fix corrupted list in hci_chan_del
Bluetooth: btintel_pcie: Fix a potential race condition
Bluetooth: L2CAP: Fix slab-use-after-free Read in l2cap_send_cmd
net: ethernet: ti: am65_cpsw: fix tx_cleanup for XDP case
net: ethernet: ti: am65-cpsw: fix RX & TX statistics for XDP_TX case
net: ethernet: ti: am65-cpsw: fix memleak in certain XDP cases
vsock/test: Add test for SO_LINGER null ptr deref
vsock: Orphan socket after transport release
MAINTAINERS: Add sctp headers to the general netdev entry
Revert "netfilter: flowtable: teardown flow if cached mtu is stale"
iavf: Fix a locking bug in an error path
rxrpc: Fix alteration of headers whilst zerocopy pending
net: phylink: make configuring clock-stop dependent on MAC support
...
Linus Torvalds [Thu, 13 Feb 2025 20:06:29 +0000 (12:06 -0800)]
Merge tag 'for-6.14-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fix stale page cache after race between readahead and direct IO write
- fix hole expansion when writing at an offset beyond EOF, the range
will not be zeroed
- use proper way to calculate offsets in folio ranges
* tag 'for-6.14-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix hole expansion when writing at an offset beyond EOF
btrfs: fix stale page cache after race between readahead and direct IO write
btrfs: fix two misuses of folio_shift()
Linus Torvalds [Thu, 13 Feb 2025 19:58:11 +0000 (11:58 -0800)]
Merge tag 'bcachefs-2025-02-12' of git://evilpiepirate.org/bcachefs
Pull bcachefs fixes from Kent Overstreet:
"Just small stuff.
As a general announcement, on disk format is now frozen in my master
branch - future on disk format changes will be optional, not required.
- More fixes for going read-only: the previous fix was insufficient,
but with more work on ordering journal reclaim flushing (and a
btree node accounting fix so we don't split until we have to) the
tiering_replication test now consistently goes read-only in less
than a second.
- fix for fsck when we have reflink pointers to missing indirect
extents
- some transaction restart handling fixes from Alan; the "Pass
_orig_restart_count to trans_was_restarted" likely fixes some rare
undefined behaviour heisenbugs"
* tag 'bcachefs-2025-02-12' of git://evilpiepirate.org/bcachefs:
bcachefs: Reuse transaction
bcachefs: Pass _orig_restart_count to trans_was_restarted
bcachefs: CONFIG_BCACHEFS_INJECT_TRANSACTION_RESTARTS
bcachefs: Fix want_new_bset() so we write until the end of the btree node
bcachefs: Split out journal pins by btree level
bcachefs: Fix use after free
bcachefs: Fix marking reflink pointers to missing indirect extents
Kory Maincent [Wed, 12 Feb 2025 15:17:51 +0000 (16:17 +0100)]
net: pse-pd: Fix deadlock in current limit functions
Fix a deadlock in pse_pi_get_current_limit and pse_pi_set_current_limit
caused by consecutive mutex_lock calls. One in the function itself and
another in pse_pi_get_voltage.
Resolve the issue by using the unlocked version of pse_pi_get_voltage
instead.
David Howells [Wed, 12 Feb 2025 11:21:24 +0000 (11:21 +0000)]
rxrpc: Fix ipv6 path MTU discovery
rxrpc path MTU discovery currently only makes use of ICMPv4, but not
ICMPv6, which means that pmtud for IPv6 doesn't work correctly. Fix it to
check for ICMPv6 messages also.
Fixes: eeaedc5449d9 ("rxrpc: Implement path-MTU probing using padded PING ACKs (RFC8899)") Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org Link: https://patch.msgid.link/3517283.1739359284@warthog.procyon.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Thu, 13 Feb 2025 17:41:33 +0000 (09:41 -0800)]
Merge tag 'for-net-2025-02-13' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth
Luiz Augusto von Dentz says:
====================
bluetooth pull request for net:
- btintel_pcie: Fix a potential race condition
- L2CAP: Fix slab-use-after-free Read in l2cap_send_cmd
- L2CAP: Fix corrupted list in hci_chan_del
* tag 'for-net-2025-02-13' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
Bluetooth: L2CAP: Fix corrupted list in hci_chan_del
Bluetooth: btintel_pcie: Fix a potential race condition
Bluetooth: L2CAP: Fix slab-use-after-free Read in l2cap_send_cmd
====================
Jakub Kicinski [Thu, 13 Feb 2025 17:38:50 +0000 (09:38 -0800)]
Merge tag 'nf-25-02-13' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following batch contains one revert for:
1) Revert flowtable entry teardown cycle when skbuff exceeds mtu to
deal with DF flag unset scenarios. This is reverts a patch coming
in the previous merge window (available in 6.14-rc releases).
* tag 'nf-25-02-13' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
Revert "netfilter: flowtable: teardown flow if cached mtu is stale"
====================
Sabrina reports that the revert may trigger warnings due to intervening
changes, especially the ability to rise MAX_SKB_FRAGS. Let's drop it
and revisit once that part is also ironed out.
Linus Torvalds [Thu, 13 Feb 2025 16:43:46 +0000 (08:43 -0800)]
Merge tag 'loongarch-fixes-6.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson
Pull LoongArch fixes from Huacai Chen:
"Fix bugs about idle, kernel_page_present(), IP checksum and KVM, plus
some trival cleanups"
* tag 'loongarch-fixes-6.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson:
LoongArch: KVM: Set host with kernel mode when switch to VM mode
LoongArch: KVM: Remove duplicated cache attribute setting
LoongArch: KVM: Fix typo issue about GCFG feature detection
LoongArch: csum: Fix OoB access in IP checksum code for negative lengths
LoongArch: Remove the deprecated notifier hook mechanism
LoongArch: Use str_yes_no() helper function for /proc/cpuinfo
LoongArch: Fix kernel_page_present() for KPRANGE/XKPRANGE
LoongArch: Fix idle VS timer enqueue
Linus Torvalds [Thu, 13 Feb 2025 16:41:48 +0000 (08:41 -0800)]
Merge tag 'platform-drivers-x86-v6.14-3' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86
Pull x86 platform driver fixes from Ilpo Järvinen:
- thinkpad_acpi:
- Fix registration of tpacpi platform driver
- Support fan speed in ticks per revolution (Thinkpad X120e)
- Support V9 DYTC profiles (new Thinkpad AMD platforms)
- int3472: Handle GPIO "enable" vs "reset" variation (ov7251)
* tag 'platform-drivers-x86-v6.14-3' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86:
platform/x86: thinkpad_acpi: Fix registration of tpacpi platform driver
platform/x86: int3472: Call "reset" GPIO "enable" for INT347E
platform/x86: int3472: Use correct type for "polarity", call it gpio_flags
platform/x86: thinkpad_acpi: Support for V9 DYTC platform profiles
platform/x86: thinkpad_acpi: Fix invalid fan speed on ThinkPad X120e
Alexandra Winter [Wed, 12 Feb 2025 16:36:59 +0000 (17:36 +0100)]
s390/qeth: move netif_napi_add_tx() and napi_enable() from under BH
Like other drivers qeth is calling local_bh_enable() after napi_schedule()
to kick-start softirqs [0].
Since netif_napi_add_tx() and napi_enable() now take the netdev_lock()
mutex [1], move them out from under the BH protection. Same solution as in
commit a60558644e20 ("wifi: mt76: move napi_enable() from under BH")
Max Schulze [Wed, 12 Feb 2025 15:09:51 +0000 (16:09 +0100)]
net: usb: asix_devices: add FiberGecko DeviceID
The FiberGecko is a small USB module that connects a 100 Mbit/s SFP
Signed-off-by: Max Schulze <max.schulze@online.de> Tested-by: Max Schulze <max.schulze@online.de> Suggested-by: David Hollis <dhollis@davehollis.com> Reported-by: Sven Kreiensen <s.kreiensen@lyconsys.com> Link: https://patch.msgid.link/20250212150957.43900-2-max.schulze@online.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Wentao Liang [Wed, 12 Feb 2025 15:23:11 +0000 (23:23 +0800)]
mlxsw: Add return value check for mlxsw_sp_port_get_stats_raw()
Add a check for the return value of mlxsw_sp_port_get_stats_raw()
in __mlxsw_sp_port_get_stats(). If mlxsw_sp_port_get_stats_raw()
returns an error, exit the function to prevent further processing
with potentially invalid data.
Fixes: 614d509aa1e7 ("mlxsw: Move ethtool_ops to spectrum_ethtool.c") Cc: stable@vger.kernel.org # 5.9+ Signed-off-by: Wentao Liang <vulab@iscas.ac.cn> Reviewed-by: Petr Machata <petrm@nvidia.com> Link: https://patch.msgid.link/20250212152311.1332-1-vulab@iscas.ac.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Heiner Kallweit [Wed, 12 Feb 2025 06:32:52 +0000 (07:32 +0100)]
ixgene-v2: prepare for phylib stop exporting phy_10_100_features_array
As part of phylib cleanup we plan to stop exporting the feature arrays.
So explicitly remove the modes not supported by the MAC. The media type
bits don't have any impact on kernel behavior, so don't touch them.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com> Link: https://patch.msgid.link/be356a21-5a1a-45b3-9407-3a97f3af4600@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Bluetooth: L2CAP: Fix corrupted list in hci_chan_del
This fixes the following trace by reworking the locking of l2cap_conn
so instead of only locking when changing the chan_l list this promotes
chan_lock to a general lock of l2cap_conn so whenever it is being held
it would prevents the likes of l2cap_conn_del to run:
Kiran K [Fri, 31 Jan 2025 13:00:19 +0000 (18:30 +0530)]
Bluetooth: btintel_pcie: Fix a potential race condition
On HCI_OP_RESET command, firmware raises alive interrupt. Driver needs
to wait for this before sending other command. This patch fixes the potential
miss of alive interrupt due to which HCI_OP_RESET can timeout.
Expected flow:
If tx command is HCI_OP_RESET,
1. set data->gp0_received = false
2. send HCI_OP_RESET
3. wait for alive interrupt
Actual flow having potential race:
If tx command is HCI_OP_RESET,
1. send HCI_OP_RESET
1a. Firmware raises alive interrupt here and in ISR
data->gp0_received is set to true
2. set data->gp0_received = false
3. wait for alive interrupt
Signed-off-by: Kiran K <kiran.k@intel.com> Fixes: 05c200c8f029 ("Bluetooth: btintel_pcie: Add handshake between driver and firmware") Reported-by: Bjorn Helgaas <helgaas@kernel.org> Closes: https://patchwork.kernel.org/project/bluetooth/patch/20241001104451.626964-1-kiran.k@intel.com/ Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Bluetooth: L2CAP: Fix slab-use-after-free Read in l2cap_send_cmd
After the hci sync command releases l2cap_conn, the hci receive data work
queue references the released l2cap_conn when sending to the upper layer.
Add hci dev lock to the hci receive data work queue to synchronize the two.
[1]
BUG: KASAN: slab-use-after-free in l2cap_send_cmd+0x187/0x8d0 net/bluetooth/l2cap_core.c:954
Read of size 8 at addr ffff8880271a4000 by task kworker/u9:2/5837
Reported-by: syzbot+31c2f641b850a348a734@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=31c2f641b850a348a734 Tested-by: syzbot+31c2f641b850a348a734@syzkaller.appspotmail.com Signed-off-by: Edward Adam Davis <eadavis@qq.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Dimitri Fedrau [Mon, 10 Feb 2025 14:53:40 +0000 (15:53 +0100)]
net: phy: marvell-88q2xxx: Add support for PHY LEDs on 88q2xxx
Marvell 88Q2XXX devices support up to two configurable Light Emitting
Diode (LED). Add minimal LED controller driver supporting the most common
uses with the 'netdev' trigger.
Roger Quadros [Mon, 10 Feb 2025 14:52:17 +0000 (16:52 +0200)]
net: ethernet: ti: am65_cpsw: fix tx_cleanup for XDP case
For XDP transmit case, swdata doesn't contain SKB but the
XDP Frame. Infer the correct swdata based on buffer type
and return the XDP Frame for XDP transmit case.
Roger Quadros [Mon, 10 Feb 2025 14:52:16 +0000 (16:52 +0200)]
net: ethernet: ti: am65-cpsw: fix RX & TX statistics for XDP_TX case
For successful XDP_TX and XDP_REDIRECT cases, the packet was received
successfully so update RX statistics. Use original received
packet length for that.
TX packets statistics are incremented on TX completion so don't
update it while TX queueing.
If xdp_convert_buff_to_frame() fails, increment tx_dropped.
Roger Quadros [Mon, 10 Feb 2025 14:52:15 +0000 (16:52 +0200)]
net: ethernet: ti: am65-cpsw: fix memleak in certain XDP cases
If the XDP program doesn't result in XDP_PASS then we leak the
memory allocated by am65_cpsw_build_skb().
It is pointless to allocate SKB memory before running the XDP
program as we would be wasting CPU cycles for cases other than XDP_PASS.
Move the SKB allocation after evaluating the XDP program result.
This fixes the memleak. A performance boost is seen for XDP_DROP test.
Huacai Chen [Mon, 10 Feb 2025 13:43:28 +0000 (21:43 +0800)]
net: stmmac: dwmac-loongson: Set correct {tx,rx}_fifo_size
Now for dwmac-loongson {tx,rx}_fifo_size are uninitialised, which means
zero. This means dwmac-loongson doesn't support changing MTU because in
stmmac_change_mtu() it requires the fifo size be no less than MTU. Thus,
set the correct tx_fifo_size and rx_fifo_size for it (16KB multiplied by
queue counts).
Here {tx,rx}_fifo_size is initialised with the initial value (also the
maximum value) of {tx,rx}_queues_to_use. So it will keep as 16KB if we
don't change the queue count, and will be larger than 16KB if we change
(decrease) the queue count. However stmmac_change_mtu() still work well
with current logic (MTU cannot be larger than 16KB for stmmac).
Note: the Fixes tag picked here is the oldest commit and key commit of
the dwmac-loongson series "stmmac: Add Loongson platform support".
Acked-by: Yanteng Si <si.yanteng@linux.dev> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Chong Qiao <qiaochong@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn> Link: https://patch.msgid.link/20250210134328.2755328-1-chenhuacai@loongson.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Bibo Mao [Thu, 13 Feb 2025 04:02:56 +0000 (12:02 +0800)]
LoongArch: KVM: Set host with kernel mode when switch to VM mode
PRMD register is only meaningful on the beginning stage of exception
entry, and it is overwritten with nested irq or exception.
When CPU runs in VM mode, interrupt need be enabled on host. And the
mode for host had better be kernel mode rather than random or user mode.
When VM is running, the running mode with top command comes from CRMD
register, and running mode should be kernel mode since kernel function
is executing with perf command. It needs be consistent with both top and
perf command.
Signed-off-by: Bibo Mao <maobibo@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Cache attribute comes from GPA->HPA secondary mmu page table and is
configured when kvm is enabled. It is the same for all VMs, so remove
duplicated cache attribute setting on vCPU context switch.
Signed-off-by: Bibo Mao <maobibo@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Bibo Mao [Thu, 13 Feb 2025 04:02:56 +0000 (12:02 +0800)]
LoongArch: KVM: Fix typo issue about GCFG feature detection
This is typo issue and misusage about GCFG feature macro. The code
is wrong, only that it does not cause obvious problem since GCFG is
set again on vCPU context switch.
Yuli Wang [Thu, 13 Feb 2025 04:02:40 +0000 (12:02 +0800)]
LoongArch: Remove the deprecated notifier hook mechanism
The notifier hook mechanism in proc and cpuinfo is actually unnecessary
for LoongArch because it's not used anywhere.
It was originally added to the MIPS code in commit d6d3c9afaab4 ("MIPS:
MT: proc: Add support for printing VPE and TC ids"), and LoongArch then
inherited it.
But as the kernel code stands now, this notifier hook mechanism doesn't
really make sense for either LoongArch or MIPS.
In addition, the seq_file forward declaration needs to be moved to its
proper place, as only the show_ipi_list() function in smp.c requires it.
Huacai Chen [Thu, 13 Feb 2025 04:02:35 +0000 (12:02 +0800)]
LoongArch: Fix kernel_page_present() for KPRANGE/XKPRANGE
Now kernel_page_present() always return true for KPRANGE/XKPRANGE
addresses, this isn't correct because hibernation (ACPI S4) use it
to distinguish whether a page is saveable. If all KPRANGE/XKPRANGE
addresses are considered as saveable, then reserved memory such as
EFI_RUNTIME_SERVICES_CODE / EFI_RUNTIME_SERVICES_DATA will also be
saved and restored.
Fix this by returning true only if the KPRANGE/XKPRANGE address is in
memblock.memory.
Marco Crivellari [Thu, 13 Feb 2025 04:02:35 +0000 (12:02 +0800)]
LoongArch: Fix idle VS timer enqueue
LoongArch re-enables interrupts on its idle routine and performs a
TIF_NEED_RESCHED check afterwards before putting the CPU to sleep.
The IRQs firing between the check and the idle instruction may set the
TIF_NEED_RESCHED flag. In order to deal with such a race, IRQs
interrupting __arch_cpu_idle() rollback their return address to the
beginning of __arch_cpu_idle() so that TIF_NEED_RESCHED is checked
again before going back to sleep.
However idle IRQs can also queue timers that may require a tick
reprogramming through a new generic idle loop iteration but those timers
would go unnoticed here because __arch_cpu_idle() only checks
TIF_NEED_RESCHED. It doesn't check for pending timers.
Fix this with fast-forwarding idle IRQs return address to the end of the
idle routine instead of the beginning, so that the generic idle loop can
handle both TIF_NEED_RESCHED and pending timers.
Michal Luczaj [Mon, 10 Feb 2025 12:15:00 +0000 (13:15 +0100)]
vsock: Orphan socket after transport release
During socket release, sock_orphan() is called without considering that it
sets sk->sk_wq to NULL. Later, if SO_LINGER is enabled, this leads to a
null pointer dereferenced in virtio_transport_wait_close().
Orphan the socket only after transport release.
Partially reverts the 'Fixes:' commit.
KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f]
lock_acquire+0x19e/0x500
_raw_spin_lock_irqsave+0x47/0x70
add_wait_queue+0x46/0x230
virtio_transport_release+0x4e7/0x7f0
__vsock_release+0xfd/0x490
vsock_release+0x90/0x120
__sock_release+0xa3/0x250
sock_close+0x14/0x20
__fput+0x35e/0xa90
__x64_sys_close+0x78/0xd0
do_syscall_64+0x93/0x1b0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Reported-by: syzbot+9d55b199192a4be7d02c@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=9d55b199192a4be7d02c Fixes: fcdd2242c023 ("vsock: Keep the binding until socket destruction") Tested-by: Luigi Leonardi <leonardi@redhat.com> Reviewed-by: Luigi Leonardi <leonardi@redhat.com> Signed-off-by: Michal Luczaj <mhal@rbox.co> Link: https://patch.msgid.link/20250210-vsock-linger-nullderef-v3-1-ef6244d02b54@rbox.co Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Thu, 13 Feb 2025 03:53:03 +0000 (19:53 -0800)]
Merge branch '200GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue
Tony Nguyen says:
====================
Intel Wired LAN Driver Updates 2025-02-11 (idpf, ixgbe, igc)
For idpf:
Sridhar fixes a couple issues in handling of RSC packets.
Josh adds a call to set_real_num_queues() to keep queue count in sync.
For ixgbe:
Piotr removes missed IS_ERR() removal when ERR_PTR usage was removed.
For igc:
Zdenek Bouska fixes reporting of Rx timestamp with AF_XDP.
Siang sets buffer type on empty frame to ensure proper handling.
* '200GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue:
igc: Set buffer type for empty frames in igc_init_empty_frame
igc: Fix HW RX timestamp when passed by ZC XDP
ixgbe: Fix possible skb NULL pointer dereference
idpf: call set_real_num_queues in idpf_open
idpf: record rx queue in skb for RSC packets
idpf: fix handling rsc packet with a single segment
====================
Changes since last iteration:
- Address comments from Donald Hunter, in particular, fixup "get" and
"get-stats" descriptions, the former operation supports both dump
and normal request (returns a single entry, if found), the latter
only supports dumps.
Paolo Abeni [Tue, 11 Feb 2025 17:17:31 +0000 (18:17 +0100)]
net: avoid unconditionally touching sk_tsflags on RX
After commit 5d4cc87414c5 ("net: reorganize "struct sock" fields"),
the sk_tsflags field shares the same cacheline with sk_forward_alloc.
The UDP protocol does not acquire the sock lock in the RX path;
forward allocations are protected via the receive queue spinlock;
additionally udp_recvmsg() calls sock_recv_cmsgs() unconditionally
touching sk_tsflags on each packet reception.
Due to the above, under high packet rate traffic, when the BH and the
user-space process run on different CPUs, UDP packet reception
experiences a cache miss while accessing sk_tsflags.
The receive path doesn't strictly need to access the problematic field;
change sock_set_timestamping() to maintain the relevant information
in a newly allocated sk_flags bit, so that sock_recv_cmsgs() can
take decisions accessing the latter field only.
With this patch applied, on an AMD epic server with i40e NICs, I
measured a 10% performance improvement for small packets UDP flood
performance tests - possibly a larger delta could be observed with more
recent H/W.
====================
netlink: specs: add a spec for nl80211 wiphy
Add a rudimentary YNL spec for nl80211 that includes get-wiphy and
get-interface, along with some required enhancements to YNL and the
netlink schemas.
Patch 1 is a minor cleanup to prepare for patch 2
Patches 2-4 are new features for YNL
Patches 5-7 are updates to ynl_gen_c
Patches 8-9 are schema updates for feature parity
Patch 10 is the new nl80211 spec
====================
Donald Hunter [Tue, 11 Feb 2025 12:01:21 +0000 (12:01 +0000)]
tools/net/ynl: accept IP string inputs
The ynl tool uses display-hint to know when to format IP addresses in
printed output, but not to parse IP addresses from --json input. Add
support for parsing ipv4 and ipv6 strings.
Donald Hunter [Tue, 11 Feb 2025 12:01:20 +0000 (12:01 +0000)]
tools/net/ynl: support rendering C array members to strings
The nl80211 family encodes the list of supported ciphers as a C array of
u32 values. Add support for translating arrays of scalars into strings
for enum names and display hints.
Donald Hunter [Tue, 11 Feb 2025 12:01:19 +0000 (12:01 +0000)]
tools/net/ynl: support decoding indexed arrays as enums
When decoding an indexed-array with a scalar subtype, it is currently
only possible to add a display-hint. Add support for decoding each value
as an enum.
Jakub Kicinski [Thu, 13 Feb 2025 02:20:06 +0000 (18:20 -0800)]
Merge branch 'net: dsa: add support for phylink managed EEE'
Russell King says:
====================
net: dsa: add support for phylink managed EEE
This series adds support for phylink managed EEE to DSA, and converts
mt753x to make use of this feature.
Patch 1 implements a helper to indicate whether the MAC LPI operations
are populated (suggested by Vladimir)
Patch 2 makes the necessary changes to the core code - we retain calling
set_mac_eee(), but this method now becomes a way to merely validate the
arguments when using phylink managed EEE rather than performing any
configuration.
Patch 3 converts the mt7530 driver to use phylink managed EEE.
====================
Convert mt7530 to use phylink managed EEE. When enabling EEE, we set
both PMCR_FORCE_EEE1G and PMCR_FORCE_EEE100 irrespective of the speed,
and clear them both when disabling.
net: dsa: allow use of phylink managed EEE support
In order to allow DSA drivers to use phylink managed EEE, we need to
change the behaviour of the DSA's .set_eee() ethtool method.
Implementation of the DSA .set_mac_eee() method becomes optional with
phylink managed EEE as it is only used to validate the EEE parameters
supplied from userspace. The rest of the EEE state management should
be left to phylink.
Note that we don't collect csum-unnecessary, just the uncommon
cases (and unnecessary is all the rest of the packets). There
is no programatic use for these stats AFAIK, just manual debug.
====================
Jakub Kicinski [Tue, 11 Feb 2025 18:13:53 +0000 (10:13 -0800)]
eth: fbnic: wrap tx queue stats in a struct
The queue stats struct is used for Rx and Tx queues. Wrap
the Tx stats in a struct and a union, so that we can reuse
the same space for Rx stats on Rx queues.
This also makes it easy to add an assert to the stat handling
code to catch new stats not being aggregated on shutdown.
Jakub Kicinski [Tue, 11 Feb 2025 18:13:52 +0000 (10:13 -0800)]
net: report csum_complete via qstats
Commit 13c7c941e729 ("netdev: add qstat for csum complete") reserved
the entry for csum complete in the qstats uAPI. Start reporting this
value now that we have a driver which needs it.
====================
Use PHYlib for reset randomization and adjustable polling
This patch set tackles a DP83TG720 reset lock issue and improves PHY
polling. Rather than adding a separate polling worker to randomize PHY
resets, I chose to extend the PHYlib framework - which already handles
most of the needed functionality - with adjustable polling. This
approach not only addresses the DP83TG720-specific problem (where
synchronized resets can lock the link) but also lays the groundwork for
optimizing PHY stats polling across all PHY drivers. With generic PHY
stats coming in, we can adjust the polling interval based on hardware
characteristics, such as using longer intervals for PHYs with stable HW
counters or shorter ones for high-speed links prone to counter
overflows.
Patch version changes are tracked in separate patches.
====================
Oleksij Rempel [Mon, 10 Feb 2025 08:23:58 +0000 (09:23 +0100)]
net: phy: dp83tg720: Add randomized polling intervals for link detection
Address the limitations of the DP83TG720 PHY, which cannot reliably
detect or report a stable link state. To handle this, the PHY must be
periodically reset when the link is down. However, synchronized reset
intervals between the PHY and its link partner can result in a deadlock,
preventing the link from re-establishing.
This change introduces a randomized polling interval when the link is
down to desynchronize resets between link partners.
Oleksij Rempel [Mon, 10 Feb 2025 08:23:57 +0000 (09:23 +0100)]
net: phy: Add support for driver-specific next update time
Introduce the `phy_get_next_update_time` function to allow PHY drivers
to dynamically determine the time (in jiffies) until the next state
update event. This enables more flexible and adaptive polling intervals
based on the link state or other conditions.
Alexei Lazar [Sun, 9 Feb 2025 10:17:16 +0000 (12:17 +0200)]
net/mlx5: XDP, Enable TX side XDP multi-buffer support
In XDP scenarios, fragmented packets can occur if the MTU is larger
than the page size, even when the packet size fits within the linear
part.
If XDP multi-buffer support is disabled, the fragmented part won't be
handled in the TX flow, leading to packet drops.
Since XDP multi-buffer support is always available, this commit removes
the conditional check for enabling it.
This ensures that XDP multi-buffer support is always enabled,
regardless of the `is_xdp_mb` parameter, and guarantees the handling of
fragmented packets in such scenarios.
Alexei Lazar [Sun, 9 Feb 2025 10:17:15 +0000 (12:17 +0200)]
net/mlx5: Extend Ethtool loopback selftest to support non-linear SKB
Current loopback test validation ignores non-linear SKB case in
the SKB access, which can lead to failures in scenarios such as
when HW GRO is enabled.
Linearize the SKB so both cases will be handled.
Amir Tzin [Sun, 9 Feb 2025 10:17:13 +0000 (12:17 +0200)]
net/mlx5e: Add direct TIRs to devlink rx reporter diagnose
Add "RX resources" tag to the output of rx reporter diagnose callback.
Underneath add tag for direct TIRs, for each TIR expose its tirn and
the corresponding rqtn.
Amir Tzin [Sun, 9 Feb 2025 10:17:12 +0000 (12:17 +0200)]
net/mlx5e: Move RQs diagnose to a dedicated function
Move rx reporter RQs diagnose from mlx5e_rx_reporter_diagnose() to a
dedicated function. This change is a preparation for the following
series which extends diagnose output for the rx reporter. While at it,
also pass a mlx5e_priv pointer to
mlx5e_rx_reporter_diagnose_common_config() as this is the argument the
latter actually needs.
net/mlx5: Rename and move mlx5_esw_query_vport_vhca_id
Rename mlx5_esw_query_vport_vhca_id to mlx5_vport_get_vhca_id and move
it to vport file. Also, add function declaration to mlx5_core header
file. This better represents the function's usage and allows for it to
be called from other parts of the mlx5_core driver.
William Tu [Sun, 9 Feb 2025 10:17:09 +0000 (12:17 +0200)]
net/mlx5e: set the tx_queue_len for pfifo_fast
By default, the mq netdev creates a pfifo_fast qdisc. On a
system with 16 core, the pfifo_fast with 3 bands consumes
16 * 3 * 8 (size of pointer) * 1024 (default tx queue len)
= 393KB. The patch sets the tx qlen to representor default
value, 128 (1<<MLX5E_REP_PARAMS_DEF_LOG_SQ_SIZE), which
consumes 16 * 3 * 8 * 128 = 49KB, saving 344KB for each
representor at ECPF.
Signed-off-by: William Tu <witu@nvidia.com> Reviewed-by: Daniel Jurgens <danielj@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Link: https://patch.msgid.link/20250209101716.112774-9-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
William Tu [Sun, 9 Feb 2025 10:17:08 +0000 (12:17 +0200)]
net/mlx5e: reduce rep rxq depth to 256 for ECPF
By experiments, a single queue representor netdev consumes kernel
memory around 2.8MB, and 1.8MB out of the 2.8MB is due to page
pool for the RXQ. Scaling to a thousand representors consumes 2.8GB,
which becomes a memory pressure issue for embedded devices such as
BlueField-2 16GB / BlueField-3 32GB memory.
Since representor netdevs mostly handles miss traffic, and ideally,
most of the traffic will be offloaded, reduce the default non-uplink
rep netdev's RXQ default depth from 1024 to 256 if mdev is ecpf eswitch
manager. This saves around 1MB of memory per regular RQ,
(1024 - 256) * 2KB, allocated from page pool.
With rxq depth of 256, the netlink page pool tool reports
$./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
--dump page-pool-get
{'id': 277,
'ifindex': 9,
'inflight': 128,
'inflight-mem': 786432,
'napi-id': 775}]
This is due to mtu 1500 + headroom consumes half pages, so 256 rxq
entries consumes around 128 pages (thus create a page pool with
size 128), shown above at inflight.
Note that each netdev has multiple types of RQs, including
Regular RQ, XSK, PTP, Drop, Trap RQ. Since non-uplink representor
only supports regular rq, this patch only changes the regular RQ's
default depth.
Signed-off-by: William Tu <witu@nvidia.com> Reviewed-by: Bodong Wang <bodong@nvidia.com> Reviewed-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Link: https://patch.msgid.link/20250209101716.112774-8-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
William Tu [Sun, 9 Feb 2025 10:17:07 +0000 (12:17 +0200)]
net/mlx5e: reduce the max log mpwrq sz for ECPF and reps
For the ECPF and representors, reduce the max MPWRQ size from 256KB (18)
to 128KB (17). This prepares the later patch for saving representor
memory.
With Striding RQ, there is a minimum of 4 MPWQEs. So with 128KB of max
MPWRQ size, the minimal memory is 4 * 128KB = 512KB. When creating page
pool, consider 1500 mtu, the minimal page pool size will be 512KB/4KB =
128 pages = 256 rx ring entries (2 entries per page).
Before this patch, setting RX ringsize (ethtool -G rx) to 256 causes
driver to allocate page pool size more than it needs due to max MPWRQ
is 256KB (18). Ex: 4 * 256KB = 1MB, 1MB/4KB = 256 pages, but actually
128 pages is good enough. Reducing the max MPWRQ to 128KB fixes the
limitation.
Signed-off-by: William Tu <witu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Link: https://patch.msgid.link/20250209101716.112774-7-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Mark Pearson [Tue, 11 Feb 2025 17:36:11 +0000 (12:36 -0500)]
platform/x86: thinkpad_acpi: Fix registration of tpacpi platform driver
The recent platform profile changes prevent the tpacpi platform driver
from registering. This error is seen in the kernel logs, and the
various tpacpi entries are not created:
[ 7550.642171] platform thinkpad_acpi: Resources present before probing
This happens because devm_platform_profile_register() is called before
tpacpi_pdev probes (thanks to Kurt Borja for identifying the root
cause).
For now revert back to the old platform_profile_register to fix the
issue. This is quick fix and will be re-implemented later as more
testing is needed for full solution.
Tested on X1 Carbon G12.
Fixes: 31658c916fa6 ("platform/x86: thinkpad_acpi: Use devm_platform_profile_register()") Signed-off-by: Mark Pearson <mpearson-lenovo@squebb.ca> Reviewed-by: Kurt Borja <kuurtb@gmail.com> Link: https://lore.kernel.org/r/20250211173620.16522-1-mpearson-lenovo@squebb.ca Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
IPv4 packets with no DF flag set on result in frequent flow entry
teardown cycles, this is visible in the network topology that is used in
the nft_flowtable.sh test.
nft_flowtable.sh test ocassionally fails reporting that the dscp_fwd
test sees no packets going through the flowtable path.
Fixes: b8baac3b9c5c ("netfilter: flowtable: teardown flow if cached mtu is stale") Reported-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Jakub Kicinski [Wed, 12 Feb 2025 03:51:16 +0000 (19:51 -0800)]
Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue
Tony Nguyen says:
====================
Intel Wired LAN Driver Updates 2025-02-10 (ice, igc, e1000e)
For ice:
Karol, Jake, and Michal add PTP support for E830 devices. Karol
refactors and cleans up PTP code. Jake allows for a common
cross-timestamp implementation to be shared for all devices and
Michal adds E830 support.
Mateusz cleans up initial Flow Director rule creation to loop rather
than duplicate repeated similar calls.
For igc:
Siang adjust calls to remove need for close and open calls on loading
XDP program.
For e1000e:
Gerhard Engleder batches register writes for writing multicast table
on real-time kernels.
* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
e1000e: Fix real-time violations on link up
igc: Avoid unnecessary link down event in XDP_SETUP_PROG process
ice: refactor ice_fdir_create_dflt_rules() function
ice: Implement PTP support for E830 devices
ice: Refactor ice_ptp_init_tx_*
ice: Add unified ice_capture_crosststamp
ice: Process TSYN IRQ in a separate function
ice: Use FIELD_PREP for timestamp values
ice: Remove unnecessary ice_is_e8xx() functions
ice: Don't check device type when checking GNSS presence
====================
Edward Cree [Mon, 10 Feb 2025 11:25:45 +0000 (11:25 +0000)]
sfc: document devlink flash support
Update the information in sfc's devlink documentation including
support for firmware update with devlink flash.
Also update the help text for CONFIG_SFC_MTD, as it is no longer
strictly required for firmware updates.