From: Sasha Levin Date: Mon, 29 May 2023 02:43:51 +0000 (-0400) Subject: Fixes for 6.1 X-Git-Tag: v4.14.316~9 X-Git-Url: http://git.ipfire.org/?a=commitdiff_plain;h=a8d6d7975c3d8f0277d73026bf099b26bd0ed4ea;p=thirdparty%2Fkernel%2Fstable-queue.git Fixes for 6.1 Signed-off-by: Sasha Levin --- diff --git a/queue-6.1/blk-mq-fix-race-condition-in-active-queue-accounting.patch b/queue-6.1/blk-mq-fix-race-condition-in-active-queue-accounting.patch new file mode 100644 index 00000000000..fda7a4d8c21 --- /dev/null +++ b/queue-6.1/blk-mq-fix-race-condition-in-active-queue-accounting.patch @@ -0,0 +1,57 @@ +From 7d3619700ce03dbf88f176046df552aec7a6646a Mon Sep 17 00:00:00 2001 +From: Sasha Levin +Date: Mon, 22 May 2023 17:05:55 -0400 +Subject: blk-mq: fix race condition in active queue accounting + +From: Tian Lan + +[ Upstream commit 3e94d54e83cafd2b562bb6d15bb2f72d76200fb5 ] + +If multiple CPUs are sharing the same hardware queue, it can +cause leak in the active queue counter tracking when __blk_mq_tag_busy() +is executed simultaneously. + +Fixes: ee78ec1077d3 ("blk-mq: blk_mq_tag_busy is no need to return a value") +Signed-off-by: Tian Lan +Reviewed-by: Ming Lei +Reviewed-by: Damien Le Moal +Reviewed-by: John Garry +Link: https://lore.kernel.org/r/20230522210555.794134-1-tilan7663@gmail.com +Signed-off-by: Jens Axboe +Signed-off-by: Sasha Levin +--- + block/blk-mq-tag.c | 12 ++++++++---- + 1 file changed, 8 insertions(+), 4 deletions(-) + +diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c +index 9eb968e14d31f..a80d7c62bdfe6 100644 +--- a/block/blk-mq-tag.c ++++ b/block/blk-mq-tag.c +@@ -41,16 +41,20 @@ void __blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx) + { + unsigned int users; + ++ /* ++ * calling test_bit() prior to test_and_set_bit() is intentional, ++ * it avoids dirtying the cacheline if the queue is already active. ++ */ + if (blk_mq_is_shared_tags(hctx->flags)) { + struct request_queue *q = hctx->queue; + +- if (test_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags)) ++ if (test_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags) || ++ test_and_set_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags)) + return; +- set_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags); + } else { +- if (test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state)) ++ if (test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state) || ++ test_and_set_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state)) + return; +- set_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state); + } + + users = atomic_inc_return(&hctx->tags->active_queues); +-- +2.39.2 + diff --git a/queue-6.1/bpf-sockmap-convert-schedule_work-into-delayed_work.patch b/queue-6.1/bpf-sockmap-convert-schedule_work-into-delayed_work.patch new file mode 100644 index 00000000000..c51c5d7ec14 --- /dev/null +++ b/queue-6.1/bpf-sockmap-convert-schedule_work-into-delayed_work.patch @@ -0,0 +1,190 @@ +From 907ed25a5d69d2ddd7edd2d08a54a1079ece1211 Mon Sep 17 00:00:00 2001 +From: Sasha Levin +Date: Mon, 22 May 2023 19:56:06 -0700 +Subject: bpf, sockmap: Convert schedule_work into delayed_work + +From: John Fastabend + +[ Upstream commit 29173d07f79883ac94f5570294f98af3d4287382 ] + +Sk_buffs are fed into sockmap verdict programs either from a strparser +(when the user might want to decide how framing of skb is done by attaching +another parser program) or directly through tcp_read_sock. The +tcp_read_sock is the preferred method for performance when the BPF logic is +a stream parser. + +The flow for Cilium's common use case with a stream parser is, + + tcp_read_sock() + sk_psock_verdict_recv + ret = bpf_prog_run_pin_on_cpu() + sk_psock_verdict_apply(sock, skb, ret) + // if system is under memory pressure or app is slow we may + // need to queue skb. Do this queuing through ingress_skb and + // then kick timer to wake up handler + skb_queue_tail(ingress_skb, skb) + schedule_work(work); + +The work queue is wired up to sk_psock_backlog(). This will then walk the +ingress_skb skb list that holds our sk_buffs that could not be handled, +but should be OK to run at some later point. However, its possible that +the workqueue doing this work still hits an error when sending the skb. +When this happens the skbuff is requeued on a temporary 'state' struct +kept with the workqueue. This is necessary because its possible to +partially send an skbuff before hitting an error and we need to know how +and where to restart when the workqueue runs next. + +Now for the trouble, we don't rekick the workqueue. This can cause a +stall where the skbuff we just cached on the state variable might never +be sent. This happens when its the last packet in a flow and no further +packets come along that would cause the system to kick the workqueue from +that side. + +To fix we could do simple schedule_work(), but while under memory pressure +it makes sense to back off some instead of continue to retry repeatedly. So +instead to fix convert schedule_work to schedule_delayed_work and add +backoff logic to reschedule from backlog queue on errors. Its not obvious +though what a good backoff is so use '1'. + +To test we observed some flakes whil running NGINX compliance test with +sockmap we attributed these failed test to this bug and subsequent issue. + +>From on list discussion. This commit + + bec217197b41("skmsg: Schedule psock work if the cached skb exists on the psock") + +was intended to address similar race, but had a couple cases it missed. +Most obvious it only accounted for receiving traffic on the local socket +so if redirecting into another socket we could still get an sk_buff stuck +here. Next it missed the case where copied=0 in the recv() handler and +then we wouldn't kick the scheduler. Also its sub-optimal to require +userspace to kick the internal mechanisms of sockmap to wake it up and +copy data to user. It results in an extra syscall and requires the app +to actual handle the EAGAIN correctly. + +Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()") +Signed-off-by: John Fastabend +Signed-off-by: Daniel Borkmann +Tested-by: William Findlay +Reviewed-by: Jakub Sitnicki +Link: https://lore.kernel.org/bpf/20230523025618.113937-3-john.fastabend@gmail.com +Signed-off-by: Sasha Levin +--- + include/linux/skmsg.h | 2 +- + net/core/skmsg.c | 21 ++++++++++++++------- + net/core/sock_map.c | 3 ++- + 3 files changed, 17 insertions(+), 9 deletions(-) + +diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h +index 84f787416a54d..904ff9a32ad61 100644 +--- a/include/linux/skmsg.h ++++ b/include/linux/skmsg.h +@@ -105,7 +105,7 @@ struct sk_psock { + struct proto *sk_proto; + struct mutex work_mutex; + struct sk_psock_work_state work_state; +- struct work_struct work; ++ struct delayed_work work; + struct rcu_work rwork; + }; + +diff --git a/net/core/skmsg.c b/net/core/skmsg.c +index 2b6d9519ff29c..6a9b794861f3f 100644 +--- a/net/core/skmsg.c ++++ b/net/core/skmsg.c +@@ -481,7 +481,7 @@ int sk_msg_recvmsg(struct sock *sk, struct sk_psock *psock, struct msghdr *msg, + } + out: + if (psock->work_state.skb && copied > 0) +- schedule_work(&psock->work); ++ schedule_delayed_work(&psock->work, 0); + return copied; + } + EXPORT_SYMBOL_GPL(sk_msg_recvmsg); +@@ -639,7 +639,8 @@ static void sk_psock_skb_state(struct sk_psock *psock, + + static void sk_psock_backlog(struct work_struct *work) + { +- struct sk_psock *psock = container_of(work, struct sk_psock, work); ++ struct delayed_work *dwork = to_delayed_work(work); ++ struct sk_psock *psock = container_of(dwork, struct sk_psock, work); + struct sk_psock_work_state *state = &psock->work_state; + struct sk_buff *skb = NULL; + bool ingress; +@@ -679,6 +680,12 @@ static void sk_psock_backlog(struct work_struct *work) + if (ret == -EAGAIN) { + sk_psock_skb_state(psock, state, skb, + len, off); ++ ++ /* Delay slightly to prioritize any ++ * other work that might be here. ++ */ ++ if (sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED)) ++ schedule_delayed_work(&psock->work, 1); + goto end; + } + /* Hard errors break pipe and stop xmit. */ +@@ -733,7 +740,7 @@ struct sk_psock *sk_psock_init(struct sock *sk, int node) + INIT_LIST_HEAD(&psock->link); + spin_lock_init(&psock->link_lock); + +- INIT_WORK(&psock->work, sk_psock_backlog); ++ INIT_DELAYED_WORK(&psock->work, sk_psock_backlog); + mutex_init(&psock->work_mutex); + INIT_LIST_HEAD(&psock->ingress_msg); + spin_lock_init(&psock->ingress_lock); +@@ -822,7 +829,7 @@ static void sk_psock_destroy(struct work_struct *work) + + sk_psock_done_strp(psock); + +- cancel_work_sync(&psock->work); ++ cancel_delayed_work_sync(&psock->work); + mutex_destroy(&psock->work_mutex); + + psock_progs_drop(&psock->progs); +@@ -937,7 +944,7 @@ static int sk_psock_skb_redirect(struct sk_psock *from, struct sk_buff *skb) + } + + skb_queue_tail(&psock_other->ingress_skb, skb); +- schedule_work(&psock_other->work); ++ schedule_delayed_work(&psock_other->work, 0); + spin_unlock_bh(&psock_other->ingress_lock); + return 0; + } +@@ -1017,7 +1024,7 @@ static int sk_psock_verdict_apply(struct sk_psock *psock, struct sk_buff *skb, + spin_lock_bh(&psock->ingress_lock); + if (sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED)) { + skb_queue_tail(&psock->ingress_skb, skb); +- schedule_work(&psock->work); ++ schedule_delayed_work(&psock->work, 0); + err = 0; + } + spin_unlock_bh(&psock->ingress_lock); +@@ -1048,7 +1055,7 @@ static void sk_psock_write_space(struct sock *sk) + psock = sk_psock(sk); + if (likely(psock)) { + if (sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED)) +- schedule_work(&psock->work); ++ schedule_delayed_work(&psock->work, 0); + write_space = psock->saved_write_space; + } + rcu_read_unlock(); +diff --git a/net/core/sock_map.c b/net/core/sock_map.c +index a68a7290a3b2b..d382672018928 100644 +--- a/net/core/sock_map.c ++++ b/net/core/sock_map.c +@@ -1624,9 +1624,10 @@ void sock_map_close(struct sock *sk, long timeout) + rcu_read_unlock(); + sk_psock_stop(psock); + release_sock(sk); +- cancel_work_sync(&psock->work); ++ cancel_delayed_work_sync(&psock->work); + sk_psock_put(sk, psock); + } ++ + /* Make sure we do not recurse. This is a bug. + * Leak the socket instead of crashing on a stack overflow. + */ +-- +2.39.2 + diff --git a/queue-6.1/bpf-sockmap-handle-fin-correctly.patch b/queue-6.1/bpf-sockmap-handle-fin-correctly.patch new file mode 100644 index 00000000000..9d6bb0e88d5 --- /dev/null +++ b/queue-6.1/bpf-sockmap-handle-fin-correctly.patch @@ -0,0 +1,83 @@ +From a0481dab8243bb79c9e5d41af86f6d19f7298477 Mon Sep 17 00:00:00 2001 +From: Sasha Levin +Date: Mon, 22 May 2023 19:56:09 -0700 +Subject: bpf, sockmap: Handle fin correctly + +From: John Fastabend + +[ Upstream commit 901546fd8f9ca4b5c481ce00928ab425ce9aacc0 ] + +The sockmap code is returning EAGAIN after a FIN packet is received and no +more data is on the receive queue. Correct behavior is to return 0 to the +user and the user can then close the socket. The EAGAIN causes many apps +to retry which masks the problem. Eventually the socket is evicted from +the sockmap because its released from sockmap sock free handling. The +issue creates a delay and can cause some errors on application side. + +To fix this check on sk_msg_recvmsg side if length is zero and FIN flag +is set then set return to zero. A selftest will be added to check this +condition. + +Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()") +Signed-off-by: John Fastabend +Signed-off-by: Daniel Borkmann +Tested-by: William Findlay +Reviewed-by: Jakub Sitnicki +Link: https://lore.kernel.org/bpf/20230523025618.113937-6-john.fastabend@gmail.com +Signed-off-by: Sasha Levin +--- + net/ipv4/tcp_bpf.c | 31 +++++++++++++++++++++++++++++++ + 1 file changed, 31 insertions(+) + +diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c +index 2e9547467edbe..73c13642d47f6 100644 +--- a/net/ipv4/tcp_bpf.c ++++ b/net/ipv4/tcp_bpf.c +@@ -174,6 +174,24 @@ static int tcp_msg_wait_data(struct sock *sk, struct sk_psock *psock, + return ret; + } + ++static bool is_next_msg_fin(struct sk_psock *psock) ++{ ++ struct scatterlist *sge; ++ struct sk_msg *msg_rx; ++ int i; ++ ++ msg_rx = sk_psock_peek_msg(psock); ++ i = msg_rx->sg.start; ++ sge = sk_msg_elem(msg_rx, i); ++ if (!sge->length) { ++ struct sk_buff *skb = msg_rx->skb; ++ ++ if (skb && TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN) ++ return true; ++ } ++ return false; ++} ++ + static int tcp_bpf_recvmsg_parser(struct sock *sk, + struct msghdr *msg, + size_t len, +@@ -196,6 +214,19 @@ static int tcp_bpf_recvmsg_parser(struct sock *sk, + lock_sock(sk); + msg_bytes_ready: + copied = sk_msg_recvmsg(sk, psock, msg, len, flags); ++ /* The typical case for EFAULT is the socket was gracefully ++ * shutdown with a FIN pkt. So check here the other case is ++ * some error on copy_page_to_iter which would be unexpected. ++ * On fin return correct return code to zero. ++ */ ++ if (copied == -EFAULT) { ++ bool is_fin = is_next_msg_fin(psock); ++ ++ if (is_fin) { ++ copied = 0; ++ goto out; ++ } ++ } + if (!copied) { + long timeo; + int data; +-- +2.39.2 + diff --git a/queue-6.1/bpf-sockmap-improved-check-for-empty-queue.patch b/queue-6.1/bpf-sockmap-improved-check-for-empty-queue.patch new file mode 100644 index 00000000000..eea008995a4 --- /dev/null +++ b/queue-6.1/bpf-sockmap-improved-check-for-empty-queue.patch @@ -0,0 +1,178 @@ +From b30cce8f5c1fd0d245cc4075f7cfe47d1ea91c85 Mon Sep 17 00:00:00 2001 +From: Sasha Levin +Date: Mon, 22 May 2023 19:56:08 -0700 +Subject: bpf, sockmap: Improved check for empty queue + +From: John Fastabend + +[ Upstream commit 405df89dd52cbcd69a3cd7d9a10d64de38f854b2 ] + +We noticed some rare sk_buffs were stepping past the queue when system was +under memory pressure. The general theory is to skip enqueueing +sk_buffs when its not necessary which is the normal case with a system +that is properly provisioned for the task, no memory pressure and enough +cpu assigned. + +But, if we can't allocate memory due to an ENOMEM error when enqueueing +the sk_buff into the sockmap receive queue we push it onto a delayed +workqueue to retry later. When a new sk_buff is received we then check +if that queue is empty. However, there is a problem with simply checking +the queue length. When a sk_buff is being processed from the ingress queue +but not yet on the sockmap msg receive queue its possible to also recv +a sk_buff through normal path. It will check the ingress queue which is +zero and then skip ahead of the pkt being processed. + +Previously we used sock lock from both contexts which made the problem +harder to hit, but not impossible. + +To fix instead of popping the skb from the queue entirely we peek the +skb from the queue and do the copy there. This ensures checks to the +queue length are non-zero while skb is being processed. Then finally +when the entire skb has been copied to user space queue or another +socket we pop it off the queue. This way the queue length check allows +bypassing the queue only after the list has been completely processed. + +To reproduce issue we run NGINX compliance test with sockmap running and +observe some flakes in our testing that we attributed to this issue. + +Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()") +Suggested-by: Jakub Sitnicki +Signed-off-by: John Fastabend +Signed-off-by: Daniel Borkmann +Tested-by: William Findlay +Reviewed-by: Jakub Sitnicki +Link: https://lore.kernel.org/bpf/20230523025618.113937-5-john.fastabend@gmail.com +Signed-off-by: Sasha Levin +--- + include/linux/skmsg.h | 1 - + net/core/skmsg.c | 32 ++++++++------------------------ + 2 files changed, 8 insertions(+), 25 deletions(-) + +diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h +index 904ff9a32ad61..054d7911bfc9f 100644 +--- a/include/linux/skmsg.h ++++ b/include/linux/skmsg.h +@@ -71,7 +71,6 @@ struct sk_psock_link { + }; + + struct sk_psock_work_state { +- struct sk_buff *skb; + u32 len; + u32 off; + }; +diff --git a/net/core/skmsg.c b/net/core/skmsg.c +index 2dfb6e31e8d04..d3ffca1b96462 100644 +--- a/net/core/skmsg.c ++++ b/net/core/skmsg.c +@@ -621,16 +621,12 @@ static int sk_psock_handle_skb(struct sk_psock *psock, struct sk_buff *skb, + + static void sk_psock_skb_state(struct sk_psock *psock, + struct sk_psock_work_state *state, +- struct sk_buff *skb, + int len, int off) + { + spin_lock_bh(&psock->ingress_lock); + if (sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED)) { +- state->skb = skb; + state->len = len; + state->off = off; +- } else { +- sock_drop(psock->sk, skb); + } + spin_unlock_bh(&psock->ingress_lock); + } +@@ -641,23 +637,17 @@ static void sk_psock_backlog(struct work_struct *work) + struct sk_psock *psock = container_of(dwork, struct sk_psock, work); + struct sk_psock_work_state *state = &psock->work_state; + struct sk_buff *skb = NULL; ++ u32 len = 0, off = 0; + bool ingress; +- u32 len, off; + int ret; + + mutex_lock(&psock->work_mutex); +- if (unlikely(state->skb)) { +- spin_lock_bh(&psock->ingress_lock); +- skb = state->skb; ++ if (unlikely(state->len)) { + len = state->len; + off = state->off; +- state->skb = NULL; +- spin_unlock_bh(&psock->ingress_lock); + } +- if (skb) +- goto start; + +- while ((skb = skb_dequeue(&psock->ingress_skb))) { ++ while ((skb = skb_peek(&psock->ingress_skb))) { + len = skb->len; + off = 0; + if (skb_bpf_strparser(skb)) { +@@ -666,7 +656,6 @@ static void sk_psock_backlog(struct work_struct *work) + off = stm->offset; + len = stm->full_len; + } +-start: + ingress = skb_bpf_ingress(skb); + skb_bpf_redirect_clear(skb); + do { +@@ -676,8 +665,7 @@ static void sk_psock_backlog(struct work_struct *work) + len, ingress); + if (ret <= 0) { + if (ret == -EAGAIN) { +- sk_psock_skb_state(psock, state, skb, +- len, off); ++ sk_psock_skb_state(psock, state, len, off); + + /* Delay slightly to prioritize any + * other work that might be here. +@@ -689,15 +677,16 @@ static void sk_psock_backlog(struct work_struct *work) + /* Hard errors break pipe and stop xmit. */ + sk_psock_report_error(psock, ret ? -ret : EPIPE); + sk_psock_clear_state(psock, SK_PSOCK_TX_ENABLED); +- sock_drop(psock->sk, skb); + goto end; + } + off += ret; + len -= ret; + } while (len); + +- if (!ingress) ++ skb = skb_dequeue(&psock->ingress_skb); ++ if (!ingress) { + kfree_skb(skb); ++ } + } + end: + mutex_unlock(&psock->work_mutex); +@@ -790,11 +779,6 @@ static void __sk_psock_zap_ingress(struct sk_psock *psock) + skb_bpf_redirect_clear(skb); + sock_drop(psock->sk, skb); + } +- kfree_skb(psock->work_state.skb); +- /* We null the skb here to ensure that calls to sk_psock_backlog +- * do not pick up the free'd skb. +- */ +- psock->work_state.skb = NULL; + __sk_psock_purge_ingress_msg(psock); + } + +@@ -813,7 +797,6 @@ void sk_psock_stop(struct sk_psock *psock) + spin_lock_bh(&psock->ingress_lock); + sk_psock_clear_state(psock, SK_PSOCK_TX_ENABLED); + sk_psock_cork_free(psock); +- __sk_psock_zap_ingress(psock); + spin_unlock_bh(&psock->ingress_lock); + } + +@@ -828,6 +811,7 @@ static void sk_psock_destroy(struct work_struct *work) + sk_psock_done_strp(psock); + + cancel_delayed_work_sync(&psock->work); ++ __sk_psock_zap_ingress(psock); + mutex_destroy(&psock->work_mutex); + + psock_progs_drop(&psock->progs); +-- +2.39.2 + diff --git a/queue-6.1/bpf-sockmap-incorrectly-handling-copied_seq.patch b/queue-6.1/bpf-sockmap-incorrectly-handling-copied_seq.patch new file mode 100644 index 00000000000..f645c8d25a0 --- /dev/null +++ b/queue-6.1/bpf-sockmap-incorrectly-handling-copied_seq.patch @@ -0,0 +1,235 @@ +From 6fe373073f0aed92f9c8bc739f3553e1f5eb2ece Mon Sep 17 00:00:00 2001 +From: Sasha Levin +Date: Mon, 22 May 2023 19:56:12 -0700 +Subject: bpf, sockmap: Incorrectly handling copied_seq + +From: John Fastabend + +[ Upstream commit e5c6de5fa025882babf89cecbed80acf49b987fa ] + +The read_skb() logic is incrementing the tcp->copied_seq which is used for +among other things calculating how many outstanding bytes can be read by +the application. This results in application errors, if the application +does an ioctl(FIONREAD) we return zero because this is calculated from +the copied_seq value. + +To fix this we move tcp->copied_seq accounting into the recv handler so +that we update these when the recvmsg() hook is called and data is in +fact copied into user buffers. This gives an accurate FIONREAD value +as expected and improves ACK handling. Before we were calling the +tcp_rcv_space_adjust() which would update 'number of bytes copied to +user in last RTT' which is wrong for programs returning SK_PASS. The +bytes are only copied to the user when recvmsg is handled. + +Doing the fix for recvmsg is straightforward, but fixing redirect and +SK_DROP pkts is a bit tricker. Build a tcp_psock_eat() helper and then +call this from skmsg handlers. This fixes another issue where a broken +socket with a BPF program doing a resubmit could hang the receiver. This +happened because although read_skb() consumed the skb through sock_drop() +it did not update the copied_seq. Now if a single reccv socket is +redirecting to many sockets (for example for lb) the receiver sk will be +hung even though we might expect it to continue. The hang comes from +not updating the copied_seq numbers and memory pressure resulting from +that. + +We have a slight layer problem of calling tcp_eat_skb even if its not +a TCP socket. To fix we could refactor and create per type receiver +handlers. I decided this is more work than we want in the fix and we +already have some small tweaks depending on caller that use the +helper skb_bpf_strparser(). So we extend that a bit and always set +the strparser bit when it is in use and then we can gate the +seq_copied updates on this. + +Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()") +Signed-off-by: John Fastabend +Signed-off-by: Daniel Borkmann +Reviewed-by: Jakub Sitnicki +Link: https://lore.kernel.org/bpf/20230523025618.113937-9-john.fastabend@gmail.com +Signed-off-by: Sasha Levin +--- + include/net/tcp.h | 10 ++++++++++ + net/core/skmsg.c | 15 +++++++-------- + net/ipv4/tcp.c | 10 +--------- + net/ipv4/tcp_bpf.c | 28 +++++++++++++++++++++++++++- + 4 files changed, 45 insertions(+), 18 deletions(-) + +diff --git a/include/net/tcp.h b/include/net/tcp.h +index 5b70b241ce71b..0744717f5caa7 100644 +--- a/include/net/tcp.h ++++ b/include/net/tcp.h +@@ -1467,6 +1467,8 @@ static inline void tcp_adjust_rcv_ssthresh(struct sock *sk) + } + + void tcp_cleanup_rbuf(struct sock *sk, int copied); ++void __tcp_cleanup_rbuf(struct sock *sk, int copied); ++ + + /* We provision sk_rcvbuf around 200% of sk_rcvlowat. + * If 87.5 % (7/8) of the space has been consumed, we want to override +@@ -2291,6 +2293,14 @@ int tcp_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool restore); + void tcp_bpf_clone(const struct sock *sk, struct sock *newsk); + #endif /* CONFIG_BPF_SYSCALL */ + ++#ifdef CONFIG_INET ++void tcp_eat_skb(struct sock *sk, struct sk_buff *skb); ++#else ++static inline void tcp_eat_skb(struct sock *sk, struct sk_buff *skb) ++{ ++} ++#endif ++ + int tcp_bpf_sendmsg_redir(struct sock *sk, bool ingress, + struct sk_msg *msg, u32 bytes, int flags); + #endif /* CONFIG_NET_SOCK_MSG */ +diff --git a/net/core/skmsg.c b/net/core/skmsg.c +index 062612ee508c0..9e0f694515636 100644 +--- a/net/core/skmsg.c ++++ b/net/core/skmsg.c +@@ -978,10 +978,8 @@ static int sk_psock_verdict_apply(struct sk_psock *psock, struct sk_buff *skb, + err = -EIO; + sk_other = psock->sk; + if (sock_flag(sk_other, SOCK_DEAD) || +- !sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED)) { +- skb_bpf_redirect_clear(skb); ++ !sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED)) + goto out_free; +- } + + skb_bpf_set_ingress(skb); + +@@ -1010,18 +1008,19 @@ static int sk_psock_verdict_apply(struct sk_psock *psock, struct sk_buff *skb, + err = 0; + } + spin_unlock_bh(&psock->ingress_lock); +- if (err < 0) { +- skb_bpf_redirect_clear(skb); ++ if (err < 0) + goto out_free; +- } + } + break; + case __SK_REDIRECT: ++ tcp_eat_skb(psock->sk, skb); + err = sk_psock_skb_redirect(psock, skb); + break; + case __SK_DROP: + default: + out_free: ++ skb_bpf_redirect_clear(skb); ++ tcp_eat_skb(psock->sk, skb); + sock_drop(psock->sk, skb); + } + +@@ -1066,8 +1065,7 @@ static void sk_psock_strp_read(struct strparser *strp, struct sk_buff *skb) + skb_dst_drop(skb); + skb_bpf_redirect_clear(skb); + ret = bpf_prog_run_pin_on_cpu(prog, skb); +- if (ret == SK_PASS) +- skb_bpf_set_strparser(skb); ++ skb_bpf_set_strparser(skb); + ret = sk_psock_map_verd(ret, skb_bpf_redirect_fetch(skb)); + skb->sk = NULL; + } +@@ -1173,6 +1171,7 @@ static int sk_psock_verdict_recv(struct sock *sk, struct sk_buff *skb) + psock = sk_psock(sk); + if (unlikely(!psock)) { + len = 0; ++ tcp_eat_skb(sk, skb); + sock_drop(sk, skb); + goto out; + } +diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c +index 31156ebb759c0..021a8bf6a1898 100644 +--- a/net/ipv4/tcp.c ++++ b/net/ipv4/tcp.c +@@ -1570,7 +1570,7 @@ static int tcp_peek_sndq(struct sock *sk, struct msghdr *msg, int len) + * calculation of whether or not we must ACK for the sake of + * a window update. + */ +-static void __tcp_cleanup_rbuf(struct sock *sk, int copied) ++void __tcp_cleanup_rbuf(struct sock *sk, int copied) + { + struct tcp_sock *tp = tcp_sk(sk); + bool time_to_ack = false; +@@ -1785,14 +1785,6 @@ int tcp_read_skb(struct sock *sk, skb_read_actor_t recv_actor) + break; + } + } +- WRITE_ONCE(tp->copied_seq, seq); +- +- tcp_rcv_space_adjust(sk); +- +- /* Clean up data we have read: This will do ACK frames. */ +- if (copied > 0) +- __tcp_cleanup_rbuf(sk, copied); +- + return copied; + } + EXPORT_SYMBOL(tcp_read_skb); +diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c +index 01dd76be1a584..5f93918c063c7 100644 +--- a/net/ipv4/tcp_bpf.c ++++ b/net/ipv4/tcp_bpf.c +@@ -11,6 +11,24 @@ + #include + #include + ++void tcp_eat_skb(struct sock *sk, struct sk_buff *skb) ++{ ++ struct tcp_sock *tcp; ++ int copied; ++ ++ if (!skb || !skb->len || !sk_is_tcp(sk)) ++ return; ++ ++ if (skb_bpf_strparser(skb)) ++ return; ++ ++ tcp = tcp_sk(sk); ++ copied = tcp->copied_seq + skb->len; ++ WRITE_ONCE(tcp->copied_seq, copied); ++ tcp_rcv_space_adjust(sk); ++ __tcp_cleanup_rbuf(sk, skb->len); ++} ++ + static int bpf_tcp_ingress(struct sock *sk, struct sk_psock *psock, + struct sk_msg *msg, u32 apply_bytes, int flags) + { +@@ -198,8 +216,10 @@ static int tcp_bpf_recvmsg_parser(struct sock *sk, + int flags, + int *addr_len) + { ++ struct tcp_sock *tcp = tcp_sk(sk); ++ u32 seq = tcp->copied_seq; + struct sk_psock *psock; +- int copied; ++ int copied = 0; + + if (unlikely(flags & MSG_ERRQUEUE)) + return inet_recv_error(sk, msg, len, addr_len); +@@ -244,9 +264,11 @@ static int tcp_bpf_recvmsg_parser(struct sock *sk, + + if (is_fin) { + copied = 0; ++ seq++; + goto out; + } + } ++ seq += copied; + if (!copied) { + long timeo; + int data; +@@ -284,6 +306,10 @@ static int tcp_bpf_recvmsg_parser(struct sock *sk, + copied = -EAGAIN; + } + out: ++ WRITE_ONCE(tcp->copied_seq, seq); ++ tcp_rcv_space_adjust(sk); ++ if (copied > 0) ++ __tcp_cleanup_rbuf(sk, copied); + release_sock(sk); + sk_psock_put(sk, psock); + return copied; +-- +2.39.2 + diff --git a/queue-6.1/bpf-sockmap-pass-skb-ownership-through-read_skb.patch b/queue-6.1/bpf-sockmap-pass-skb-ownership-through-read_skb.patch new file mode 100644 index 00000000000..ffe6b2c14e4 --- /dev/null +++ b/queue-6.1/bpf-sockmap-pass-skb-ownership-through-read_skb.patch @@ -0,0 +1,159 @@ +From 4df24f4760fc039a3b25c34569712d7615d3b5a4 Mon Sep 17 00:00:00 2001 +From: Sasha Levin +Date: Mon, 22 May 2023 19:56:05 -0700 +Subject: bpf, sockmap: Pass skb ownership through read_skb + +From: John Fastabend + +[ Upstream commit 78fa0d61d97a728d306b0c23d353c0e340756437 ] + +The read_skb hook calls consume_skb() now, but this means that if the +recv_actor program wants to use the skb it needs to inc the ref cnt +so that the consume_skb() doesn't kfree the sk_buff. + +This is problematic because in some error cases under memory pressure +we may need to linearize the sk_buff from sk_psock_skb_ingress_enqueue(). +Then we get this, + + skb_linearize() + __pskb_pull_tail() + pskb_expand_head() + BUG_ON(skb_shared(skb)) + +Because we incremented users refcnt from sk_psock_verdict_recv() we +hit the bug on with refcnt > 1 and trip it. + +To fix lets simply pass ownership of the sk_buff through the skb_read +call. Then we can drop the consume from read_skb handlers and assume +the verdict recv does any required kfree. + +Bug found while testing in our CI which runs in VMs that hit memory +constraints rather regularly. William tested TCP read_skb handlers. + +[ 106.536188] ------------[ cut here ]------------ +[ 106.536197] kernel BUG at net/core/skbuff.c:1693! +[ 106.536479] invalid opcode: 0000 [#1] PREEMPT SMP PTI +[ 106.536726] CPU: 3 PID: 1495 Comm: curl Not tainted 5.19.0-rc5 #1 +[ 106.537023] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.16.0-1 04/01/2014 +[ 106.537467] RIP: 0010:pskb_expand_head+0x269/0x330 +[ 106.538585] RSP: 0018:ffffc90000138b68 EFLAGS: 00010202 +[ 106.538839] RAX: 000000000000003f RBX: ffff8881048940e8 RCX: 0000000000000a20 +[ 106.539186] RDX: 0000000000000002 RSI: 0000000000000000 RDI: ffff8881048940e8 +[ 106.539529] RBP: ffffc90000138be8 R08: 00000000e161fd1a R09: 0000000000000000 +[ 106.539877] R10: 0000000000000018 R11: 0000000000000000 R12: ffff8881048940e8 +[ 106.540222] R13: 0000000000000003 R14: 0000000000000000 R15: ffff8881048940e8 +[ 106.540568] FS: 00007f277dde9f00(0000) GS:ffff88813bd80000(0000) knlGS:0000000000000000 +[ 106.540954] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 +[ 106.541227] CR2: 00007f277eeede64 CR3: 000000000ad3e000 CR4: 00000000000006e0 +[ 106.541569] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 +[ 106.541915] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 +[ 106.542255] Call Trace: +[ 106.542383] +[ 106.542487] __pskb_pull_tail+0x4b/0x3e0 +[ 106.542681] skb_ensure_writable+0x85/0xa0 +[ 106.542882] sk_skb_pull_data+0x18/0x20 +[ 106.543084] bpf_prog_b517a65a242018b0_bpf_skskb_http_verdict+0x3a9/0x4aa9 +[ 106.543536] ? migrate_disable+0x66/0x80 +[ 106.543871] sk_psock_verdict_recv+0xe2/0x310 +[ 106.544258] ? sk_psock_write_space+0x1f0/0x1f0 +[ 106.544561] tcp_read_skb+0x7b/0x120 +[ 106.544740] tcp_data_queue+0x904/0xee0 +[ 106.544931] tcp_rcv_established+0x212/0x7c0 +[ 106.545142] tcp_v4_do_rcv+0x174/0x2a0 +[ 106.545326] tcp_v4_rcv+0xe70/0xf60 +[ 106.545500] ip_protocol_deliver_rcu+0x48/0x290 +[ 106.545744] ip_local_deliver_finish+0xa7/0x150 + +Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()") +Reported-by: William Findlay +Signed-off-by: John Fastabend +Signed-off-by: Daniel Borkmann +Tested-by: William Findlay +Reviewed-by: Jakub Sitnicki +Link: https://lore.kernel.org/bpf/20230523025618.113937-2-john.fastabend@gmail.com +Signed-off-by: Sasha Levin +--- + net/core/skmsg.c | 2 -- + net/ipv4/tcp.c | 1 - + net/ipv4/udp.c | 7 ++----- + net/unix/af_unix.c | 7 ++----- + 4 files changed, 4 insertions(+), 13 deletions(-) + +diff --git a/net/core/skmsg.c b/net/core/skmsg.c +index 53d0251788aa2..2b6d9519ff29c 100644 +--- a/net/core/skmsg.c ++++ b/net/core/skmsg.c +@@ -1180,8 +1180,6 @@ static int sk_psock_verdict_recv(struct sock *sk, struct sk_buff *skb) + int ret = __SK_DROP; + int len = skb->len; + +- skb_get(skb); +- + rcu_read_lock(); + psock = sk_psock(sk); + if (unlikely(!psock)) { +diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c +index 1fb67f819de49..31156ebb759c0 100644 +--- a/net/ipv4/tcp.c ++++ b/net/ipv4/tcp.c +@@ -1772,7 +1772,6 @@ int tcp_read_skb(struct sock *sk, skb_read_actor_t recv_actor) + WARN_ON_ONCE(!skb_set_owner_sk_safe(skb, sk)); + tcp_flags = TCP_SKB_CB(skb)->tcp_flags; + used = recv_actor(sk, skb); +- consume_skb(skb); + if (used < 0) { + if (!copied) + copied = used; +diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c +index 3ffa30c37293e..956d6797c76f3 100644 +--- a/net/ipv4/udp.c ++++ b/net/ipv4/udp.c +@@ -1806,7 +1806,7 @@ EXPORT_SYMBOL(__skb_recv_udp); + int udp_read_skb(struct sock *sk, skb_read_actor_t recv_actor) + { + struct sk_buff *skb; +- int err, copied; ++ int err; + + try_again: + skb = skb_recv_udp(sk, MSG_DONTWAIT, &err); +@@ -1825,10 +1825,7 @@ int udp_read_skb(struct sock *sk, skb_read_actor_t recv_actor) + } + + WARN_ON_ONCE(!skb_set_owner_sk_safe(skb, sk)); +- copied = recv_actor(sk, skb); +- kfree_skb(skb); +- +- return copied; ++ return recv_actor(sk, skb); + } + EXPORT_SYMBOL(udp_read_skb); + +diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c +index 70eb3bc67126d..5b19b6c53a2cb 100644 +--- a/net/unix/af_unix.c ++++ b/net/unix/af_unix.c +@@ -2552,7 +2552,7 @@ static int unix_read_skb(struct sock *sk, skb_read_actor_t recv_actor) + { + struct unix_sock *u = unix_sk(sk); + struct sk_buff *skb; +- int err, copied; ++ int err; + + mutex_lock(&u->iolock); + skb = skb_recv_datagram(sk, MSG_DONTWAIT, &err); +@@ -2560,10 +2560,7 @@ static int unix_read_skb(struct sock *sk, skb_read_actor_t recv_actor) + if (!skb) + return err; + +- copied = recv_actor(sk, skb); +- kfree_skb(skb); +- +- return copied; ++ return recv_actor(sk, skb); + } + + /* +-- +2.39.2 + diff --git a/queue-6.1/bpf-sockmap-reschedule-is-now-done-through-backlog.patch b/queue-6.1/bpf-sockmap-reschedule-is-now-done-through-backlog.patch new file mode 100644 index 00000000000..cb76d0f3c81 --- /dev/null +++ b/queue-6.1/bpf-sockmap-reschedule-is-now-done-through-backlog.patch @@ -0,0 +1,48 @@ +From ef1bb490282f412394b809afb82fd36538826a3f Mon Sep 17 00:00:00 2001 +From: Sasha Levin +Date: Mon, 22 May 2023 19:56:07 -0700 +Subject: bpf, sockmap: Reschedule is now done through backlog + +From: John Fastabend + +[ Upstream commit bce22552f92ea7c577f49839b8e8f7d29afaf880 ] + +Now that the backlog manages the reschedule() logic correctly we can drop +the partial fix to reschedule from recvmsg hook. + +Rescheduling on recvmsg hook was added to address a corner case where we +still had data in the backlog state but had nothing to kick it and +reschedule the backlog worker to run and finish copying data out of the +state. This had a couple limitations, first it required user space to +kick it introducing an unnecessary EBUSY and retry. Second it only +handled the ingress case and egress redirects would still be hung. + +With the correct fix, pushing the reschedule logic down to where the +enomem error occurs we can drop this fix. + +Fixes: bec217197b412 ("skmsg: Schedule psock work if the cached skb exists on the psock") +Signed-off-by: John Fastabend +Signed-off-by: Daniel Borkmann +Reviewed-by: Jakub Sitnicki +Link: https://lore.kernel.org/bpf/20230523025618.113937-4-john.fastabend@gmail.com +Signed-off-by: Sasha Levin +--- + net/core/skmsg.c | 2 -- + 1 file changed, 2 deletions(-) + +diff --git a/net/core/skmsg.c b/net/core/skmsg.c +index 6a9b794861f3f..2dfb6e31e8d04 100644 +--- a/net/core/skmsg.c ++++ b/net/core/skmsg.c +@@ -480,8 +480,6 @@ int sk_msg_recvmsg(struct sock *sk, struct sk_psock *psock, struct msghdr *msg, + msg_rx = sk_psock_peek_msg(psock); + } + out: +- if (psock->work_state.skb && copied > 0) +- schedule_delayed_work(&psock->work, 0); + return copied; + } + EXPORT_SYMBOL_GPL(sk_msg_recvmsg); +-- +2.39.2 + diff --git a/queue-6.1/bpf-sockmap-tcp-data-stall-on-recv-before-accept.patch b/queue-6.1/bpf-sockmap-tcp-data-stall-on-recv-before-accept.patch new file mode 100644 index 00000000000..3e64f68bdc7 --- /dev/null +++ b/queue-6.1/bpf-sockmap-tcp-data-stall-on-recv-before-accept.patch @@ -0,0 +1,96 @@ +From 2df778add75145bf4cdf666c03e6d3c3732444e5 Mon Sep 17 00:00:00 2001 +From: Sasha Levin +Date: Mon, 22 May 2023 19:56:10 -0700 +Subject: bpf, sockmap: TCP data stall on recv before accept + +From: John Fastabend + +[ Upstream commit ea444185a6bf7da4dd0df1598ee953e4f7174858 ] + +A common mechanism to put a TCP socket into the sockmap is to hook the +BPF_SOCK_OPS_{ACTIVE_PASSIVE}_ESTABLISHED_CB event with a BPF program +that can map the socket info to the correct BPF verdict parser. When +the user adds the socket to the map the psock is created and the new +ops are assigned to ensure the verdict program will 'see' the sk_buffs +as they arrive. + +Part of this process hooks the sk_data_ready op with a BPF specific +handler to wake up the BPF verdict program when data is ready to read. +The logic is simple enough (posted here for easy reading) + + static void sk_psock_verdict_data_ready(struct sock *sk) + { + struct socket *sock = sk->sk_socket; + + if (unlikely(!sock || !sock->ops || !sock->ops->read_skb)) + return; + sock->ops->read_skb(sk, sk_psock_verdict_recv); + } + +The oversight here is sk->sk_socket is not assigned until the application +accepts() the new socket. However, its entirely ok for the peer application +to do a connect() followed immediately by sends. The socket on the receiver +is sitting on the backlog queue of the listening socket until its accepted +and the data is queued up. If the peer never accepts the socket or is slow +it will eventually hit data limits and rate limit the session. But, +important for BPF sockmap hooks when this data is received TCP stack does +the sk_data_ready() call but the read_skb() for this data is never called +because sk_socket is missing. The data sits on the sk_receive_queue. + +Then once the socket is accepted if we never receive more data from the +peer there will be no further sk_data_ready calls and all the data +is still on the sk_receive_queue(). Then user calls recvmsg after accept() +and for TCP sockets in sockmap we use the tcp_bpf_recvmsg_parser() handler. +The handler checks for data in the sk_msg ingress queue expecting that +the BPF program has already run from the sk_data_ready hook and enqueued +the data as needed. So we are stuck. + +To fix do an unlikely check in recvmsg handler for data on the +sk_receive_queue and if it exists wake up data_ready. We have the sock +locked in both read_skb and recvmsg so should avoid having multiple +runners. + +Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()") +Signed-off-by: John Fastabend +Signed-off-by: Daniel Borkmann +Reviewed-by: Jakub Sitnicki +Link: https://lore.kernel.org/bpf/20230523025618.113937-7-john.fastabend@gmail.com +Signed-off-by: Sasha Levin +--- + net/ipv4/tcp_bpf.c | 20 ++++++++++++++++++++ + 1 file changed, 20 insertions(+) + +diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c +index 73c13642d47f6..01dd76be1a584 100644 +--- a/net/ipv4/tcp_bpf.c ++++ b/net/ipv4/tcp_bpf.c +@@ -212,6 +212,26 @@ static int tcp_bpf_recvmsg_parser(struct sock *sk, + return tcp_recvmsg(sk, msg, len, flags, addr_len); + + lock_sock(sk); ++ ++ /* We may have received data on the sk_receive_queue pre-accept and ++ * then we can not use read_skb in this context because we haven't ++ * assigned a sk_socket yet so have no link to the ops. The work-around ++ * is to check the sk_receive_queue and in these cases read skbs off ++ * queue again. The read_skb hook is not running at this point because ++ * of lock_sock so we avoid having multiple runners in read_skb. ++ */ ++ if (unlikely(!skb_queue_empty(&sk->sk_receive_queue))) { ++ tcp_data_ready(sk); ++ /* This handles the ENOMEM errors if we both receive data ++ * pre accept and are already under memory pressure. At least ++ * let user know to retry. ++ */ ++ if (unlikely(!skb_queue_empty(&sk->sk_receive_queue))) { ++ copied = -EAGAIN; ++ goto out; ++ } ++ } ++ + msg_bytes_ready: + copied = sk_msg_recvmsg(sk, psock, msg, len, flags); + /* The typical case for EFAULT is the socket was gracefully +-- +2.39.2 + diff --git a/queue-6.1/bpf-sockmap-wake-up-polling-after-data-copy.patch b/queue-6.1/bpf-sockmap-wake-up-polling-after-data-copy.patch new file mode 100644 index 00000000000..78bb3ef5778 --- /dev/null +++ b/queue-6.1/bpf-sockmap-wake-up-polling-after-data-copy.patch @@ -0,0 +1,60 @@ +From 410c637b47383afd271b811512d7430abca6ea54 Mon Sep 17 00:00:00 2001 +From: Sasha Levin +Date: Mon, 22 May 2023 19:56:11 -0700 +Subject: bpf, sockmap: Wake up polling after data copy + +From: John Fastabend + +[ Upstream commit 6df7f764cd3cf5a03a4a47b23be47e57e41fcd85 ] + +When TCP stack has data ready to read sk_data_ready() is called. Sockmap +overwrites this with its own handler to call into BPF verdict program. +But, the original TCP socket had sock_def_readable that would additionally +wake up any user space waiters with sk_wake_async(). + +Sockmap saved the callback when the socket was created so call the saved +data ready callback and then we can wake up any epoll() logic waiting +on the read. + +Note we call on 'copied >= 0' to account for returning 0 when a FIN is +received because we need to wake up user for this as well so they +can do the recvmsg() -> 0 and detect the shutdown. + +Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()") +Signed-off-by: John Fastabend +Signed-off-by: Daniel Borkmann +Reviewed-by: Jakub Sitnicki +Link: https://lore.kernel.org/bpf/20230523025618.113937-8-john.fastabend@gmail.com +Signed-off-by: Sasha Levin +--- + net/core/skmsg.c | 11 ++++++++++- + 1 file changed, 10 insertions(+), 1 deletion(-) + +diff --git a/net/core/skmsg.c b/net/core/skmsg.c +index d3ffca1b96462..062612ee508c0 100644 +--- a/net/core/skmsg.c ++++ b/net/core/skmsg.c +@@ -1196,10 +1196,19 @@ static int sk_psock_verdict_recv(struct sock *sk, struct sk_buff *skb) + static void sk_psock_verdict_data_ready(struct sock *sk) + { + struct socket *sock = sk->sk_socket; ++ int copied; + + if (unlikely(!sock || !sock->ops || !sock->ops->read_skb)) + return; +- sock->ops->read_skb(sk, sk_psock_verdict_recv); ++ copied = sock->ops->read_skb(sk, sk_psock_verdict_recv); ++ if (copied >= 0) { ++ struct sk_psock *psock; ++ ++ rcu_read_lock(); ++ psock = sk_psock(sk); ++ psock->saved_data_ready(sk); ++ rcu_read_unlock(); ++ } + } + + void sk_psock_start_verdict(struct sock *sk, struct sk_psock *psock) +-- +2.39.2 + diff --git a/queue-6.1/firmware-arm_ffa-fix-usage-of-partition-info-get-cou.patch b/queue-6.1/firmware-arm_ffa-fix-usage-of-partition-info-get-cou.patch new file mode 100644 index 00000000000..516ad8191b8 --- /dev/null +++ b/queue-6.1/firmware-arm_ffa-fix-usage-of-partition-info-get-cou.patch @@ -0,0 +1,50 @@ +From 2fbb7f647a422c87e7c155ac8cb640c970027f51 Mon Sep 17 00:00:00 2001 +From: Sasha Levin +Date: Thu, 20 Apr 2023 16:06:02 +0100 +Subject: firmware: arm_ffa: Fix usage of partition info get count flag + +From: Sudeep Holla + +[ Upstream commit c6e045361a27ecd4fac6413164e0d091d80eee99 ] + +Commit bb1be7498500 ("firmware: arm_ffa: Add v1.1 get_partition_info support") +adds support to discovery the UUIDs of the partitions or just fetch the +partition count using the PARTITION_INFO_GET_RETURN_COUNT_ONLY flag. + +However the commit doesn't handle the fact that the older version doesn't +understand the flag and must be MBZ which results in firmware returning +invalid parameter error. That results in the failure of the driver probe +which is in correct. + +Limit the usage of the PARTITION_INFO_GET_RETURN_COUNT_ONLY flag for the +versions above v1.0(i.e v1.1 and onwards) which fixes the issue. + +Fixes: bb1be7498500 ("firmware: arm_ffa: Add v1.1 get_partition_info support") +Reported-by: Jens Wiklander +Reported-by: Marc Bonnici +Tested-by: Jens Wiklander +Reviewed-by: Jens Wiklander +Link: https://lore.kernel.org/r/20230419-ffa_fixes_6-4-v2-2-d9108e43a176@arm.com +Signed-off-by: Sudeep Holla +Signed-off-by: Sasha Levin +--- + drivers/firmware/arm_ffa/driver.c | 3 ++- + 1 file changed, 2 insertions(+), 1 deletion(-) + +diff --git a/drivers/firmware/arm_ffa/driver.c b/drivers/firmware/arm_ffa/driver.c +index 737f36e7a9035..5904a679d3512 100644 +--- a/drivers/firmware/arm_ffa/driver.c ++++ b/drivers/firmware/arm_ffa/driver.c +@@ -274,7 +274,8 @@ __ffa_partition_info_get(u32 uuid0, u32 uuid1, u32 uuid2, u32 uuid3, + int idx, count, flags = 0, sz, buf_sz; + ffa_value_t partition_info; + +- if (!buffer || !num_partitions) /* Just get the count for now */ ++ if (drv_info->version > FFA_VERSION_1_0 && ++ (!buffer || !num_partitions)) /* Just get the count for now */ + flags = PARTITION_INFO_GET_RETURN_COUNT_ONLY; + + mutex_lock(&drv_info->rx_lock); +-- +2.39.2 + diff --git a/queue-6.1/gpio-f7188x-fix-chip-name-and-pin-count-on-nuvoton-c.patch b/queue-6.1/gpio-f7188x-fix-chip-name-and-pin-count-on-nuvoton-c.patch new file mode 100644 index 00000000000..c4032476bed --- /dev/null +++ b/queue-6.1/gpio-f7188x-fix-chip-name-and-pin-count-on-nuvoton-c.patch @@ -0,0 +1,147 @@ +From 9a125be940383e0c3453bff0d1ded868ddb199b2 Mon Sep 17 00:00:00 2001 +From: Sasha Levin +Date: Thu, 27 Apr 2023 17:20:55 +0200 +Subject: gpio-f7188x: fix chip name and pin count on Nuvoton chip + +From: Henning Schild + +[ Upstream commit 3002b8642f016d7fe3ff56240dacea1075f6b877 ] + +In fact the device with chip id 0xD283 is called NCT6126D, and that is +the chip id the Nuvoton code was written for. Correct that name to avoid +confusion, because a NCT6116D in fact exists as well but has another +chip id, and is currently not supported. + +The look at the spec also revealed that GPIO group7 in fact has 8 pins, +so correct the pin count in that group as well. + +Fixes: d0918a84aff0 ("gpio-f7188x: Add GPIO support for Nuvoton NCT6116") +Reported-by: Xing Tong Wu +Signed-off-by: Henning Schild +Acked-by: Simon Guinot +Signed-off-by: Bartosz Golaszewski +Signed-off-by: Sasha Levin +--- + drivers/gpio/Kconfig | 2 +- + drivers/gpio/gpio-f7188x.c | 28 ++++++++++++++-------------- + 2 files changed, 15 insertions(+), 15 deletions(-) + +diff --git a/drivers/gpio/Kconfig b/drivers/gpio/Kconfig +index e3af86f06c630..3e8e5f4ffa59f 100644 +--- a/drivers/gpio/Kconfig ++++ b/drivers/gpio/Kconfig +@@ -882,7 +882,7 @@ config GPIO_F7188X + help + This option enables support for GPIOs found on Fintek Super-I/O + chips F71869, F71869A, F71882FG, F71889F and F81866. +- As well as Nuvoton Super-I/O chip NCT6116D. ++ As well as Nuvoton Super-I/O chip NCT6126D. + + To compile this driver as a module, choose M here: the module will + be called f7188x-gpio. +diff --git a/drivers/gpio/gpio-f7188x.c b/drivers/gpio/gpio-f7188x.c +index 9effa7769bef5..f54ca5a1775ea 100644 +--- a/drivers/gpio/gpio-f7188x.c ++++ b/drivers/gpio/gpio-f7188x.c +@@ -48,7 +48,7 @@ + /* + * Nuvoton devices. + */ +-#define SIO_NCT6116D_ID 0xD283 /* NCT6116D chipset ID */ ++#define SIO_NCT6126D_ID 0xD283 /* NCT6126D chipset ID */ + + #define SIO_LD_GPIO_NUVOTON 0x07 /* GPIO logical device */ + +@@ -62,7 +62,7 @@ enum chips { + f81866, + f81804, + f81865, +- nct6116d, ++ nct6126d, + }; + + static const char * const f7188x_names[] = { +@@ -74,7 +74,7 @@ static const char * const f7188x_names[] = { + "f81866", + "f81804", + "f81865", +- "nct6116d", ++ "nct6126d", + }; + + struct f7188x_sio { +@@ -187,8 +187,8 @@ static int f7188x_gpio_set_config(struct gpio_chip *chip, unsigned offset, + /* Output mode register (0:open drain 1:push-pull). */ + #define f7188x_gpio_out_mode(base) ((base) + 3) + +-#define f7188x_gpio_dir_invert(type) ((type) == nct6116d) +-#define f7188x_gpio_data_single(type) ((type) == nct6116d) ++#define f7188x_gpio_dir_invert(type) ((type) == nct6126d) ++#define f7188x_gpio_data_single(type) ((type) == nct6126d) + + static struct f7188x_gpio_bank f71869_gpio_bank[] = { + F7188X_GPIO_BANK(0, 6, 0xF0, DRVNAME "-0"), +@@ -274,7 +274,7 @@ static struct f7188x_gpio_bank f81865_gpio_bank[] = { + F7188X_GPIO_BANK(60, 5, 0x90, DRVNAME "-6"), + }; + +-static struct f7188x_gpio_bank nct6116d_gpio_bank[] = { ++static struct f7188x_gpio_bank nct6126d_gpio_bank[] = { + F7188X_GPIO_BANK(0, 8, 0xE0, DRVNAME "-0"), + F7188X_GPIO_BANK(10, 8, 0xE4, DRVNAME "-1"), + F7188X_GPIO_BANK(20, 8, 0xE8, DRVNAME "-2"), +@@ -282,7 +282,7 @@ static struct f7188x_gpio_bank nct6116d_gpio_bank[] = { + F7188X_GPIO_BANK(40, 8, 0xF0, DRVNAME "-4"), + F7188X_GPIO_BANK(50, 8, 0xF4, DRVNAME "-5"), + F7188X_GPIO_BANK(60, 8, 0xF8, DRVNAME "-6"), +- F7188X_GPIO_BANK(70, 1, 0xFC, DRVNAME "-7"), ++ F7188X_GPIO_BANK(70, 8, 0xFC, DRVNAME "-7"), + }; + + static int f7188x_gpio_get_direction(struct gpio_chip *chip, unsigned offset) +@@ -490,9 +490,9 @@ static int f7188x_gpio_probe(struct platform_device *pdev) + data->nr_bank = ARRAY_SIZE(f81865_gpio_bank); + data->bank = f81865_gpio_bank; + break; +- case nct6116d: +- data->nr_bank = ARRAY_SIZE(nct6116d_gpio_bank); +- data->bank = nct6116d_gpio_bank; ++ case nct6126d: ++ data->nr_bank = ARRAY_SIZE(nct6126d_gpio_bank); ++ data->bank = nct6126d_gpio_bank; + break; + default: + return -ENODEV; +@@ -559,9 +559,9 @@ static int __init f7188x_find(int addr, struct f7188x_sio *sio) + case SIO_F81865_ID: + sio->type = f81865; + break; +- case SIO_NCT6116D_ID: ++ case SIO_NCT6126D_ID: + sio->device = SIO_LD_GPIO_NUVOTON; +- sio->type = nct6116d; ++ sio->type = nct6126d; + break; + default: + pr_info("Unsupported Fintek device 0x%04x\n", devid); +@@ -569,7 +569,7 @@ static int __init f7188x_find(int addr, struct f7188x_sio *sio) + } + + /* double check manufacturer where possible */ +- if (sio->type != nct6116d) { ++ if (sio->type != nct6126d) { + manid = superio_inw(addr, SIO_FINTEK_MANID); + if (manid != SIO_FINTEK_ID) { + pr_debug("Not a Fintek device at 0x%08x\n", addr); +@@ -581,7 +581,7 @@ static int __init f7188x_find(int addr, struct f7188x_sio *sio) + err = 0; + + pr_info("Found %s at %#x\n", f7188x_names[sio->type], (unsigned int)addr); +- if (sio->type != nct6116d) ++ if (sio->type != nct6126d) + pr_info(" revision %d\n", superio_inb(addr, SIO_FINTEK_DEVREV)); + + err: +-- +2.39.2 + diff --git a/queue-6.1/inet-add-ip_local_port_range-socket-option.patch b/queue-6.1/inet-add-ip_local_port_range-socket-option.patch new file mode 100644 index 00000000000..d63c499da26 --- /dev/null +++ b/queue-6.1/inet-add-ip_local_port_range-socket-option.patch @@ -0,0 +1,311 @@ +From 584b6e562ba94b1fbed370fb6cc60bb54e7b519b Mon Sep 17 00:00:00 2001 +From: Sasha Levin +Date: Tue, 24 Jan 2023 14:36:43 +0100 +Subject: inet: Add IP_LOCAL_PORT_RANGE socket option + +From: Jakub Sitnicki + +[ Upstream commit 91d0b78c5177f3e42a4d8738af8ac19c3a90d002 ] + +Users who want to share a single public IP address for outgoing connections +between several hosts traditionally reach for SNAT. However, SNAT requires +state keeping on the node(s) performing the NAT. + +A stateless alternative exists, where a single IP address used for egress +can be shared between several hosts by partitioning the available ephemeral +port range. In such a setup: + +1. Each host gets assigned a disjoint range of ephemeral ports. +2. Applications open connections from the host-assigned port range. +3. Return traffic gets routed to the host based on both, the destination IP + and the destination port. + +An application which wants to open an outgoing connection (connect) from a +given port range today can choose between two solutions: + +1. Manually pick the source port by bind()'ing to it before connect()'ing + the socket. + + This approach has a couple of downsides: + + a) Search for a free port has to be implemented in the user-space. If + the chosen 4-tuple happens to be busy, the application needs to retry + from a different local port number. + + Detecting if 4-tuple is busy can be either easy (TCP) or hard + (UDP). In TCP case, the application simply has to check if connect() + returned an error (EADDRNOTAVAIL). That is assuming that the local + port sharing was enabled (REUSEADDR) by all the sockets. + + # Assume desired local port range is 60_000-60_511 + s = socket(AF_INET, SOCK_STREAM) + s.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1) + s.bind(("192.0.2.1", 60_000)) + s.connect(("1.1.1.1", 53)) + # Fails only if 192.0.2.1:60000 -> 1.1.1.1:53 is busy + # Application must retry with another local port + + In case of UDP, the network stack allows binding more than one socket + to the same 4-tuple, when local port sharing is enabled + (REUSEADDR). Hence detecting the conflict is much harder and involves + querying sock_diag and toggling the REUSEADDR flag [1]. + + b) For TCP, bind()-ing to a port within the ephemeral port range means + that no connecting sockets, that is those which leave it to the + network stack to find a free local port at connect() time, can use + the this port. + + IOW, the bind hash bucket tb->fastreuse will be 0 or 1, and the port + will be skipped during the free port search at connect() time. + +2. Isolate the app in a dedicated netns and use the use the per-netns + ip_local_port_range sysctl to adjust the ephemeral port range bounds. + + The per-netns setting affects all sockets, so this approach can be used + only if: + + - there is just one egress IP address, or + - the desired egress port range is the same for all egress IP addresses + used by the application. + + For TCP, this approach avoids the downsides of (1). Free port search and + 4-tuple conflict detection is done by the network stack: + + system("sysctl -w net.ipv4.ip_local_port_range='60000 60511'") + + s = socket(AF_INET, SOCK_STREAM) + s.setsockopt(SOL_IP, IP_BIND_ADDRESS_NO_PORT, 1) + s.bind(("192.0.2.1", 0)) + s.connect(("1.1.1.1", 53)) + # Fails if all 4-tuples 192.0.2.1:60000-60511 -> 1.1.1.1:53 are busy + + For UDP this approach has limited applicability. Setting the + IP_BIND_ADDRESS_NO_PORT socket option does not result in local source + port being shared with other connected UDP sockets. + + Hence relying on the network stack to find a free source port, limits the + number of outgoing UDP flows from a single IP address down to the number + of available ephemeral ports. + +To put it another way, partitioning the ephemeral port range between hosts +using the existing Linux networking API is cumbersome. + +To address this use case, add a new socket option at the SOL_IP level, +named IP_LOCAL_PORT_RANGE. The new option can be used to clamp down the +ephemeral port range for each socket individually. + +The option can be used only to narrow down the per-netns local port +range. If the per-socket range lies outside of the per-netns range, the +latter takes precedence. + +UAPI-wise, the low and high range bounds are passed to the kernel as a pair +of u16 values in host byte order packed into a u32. This avoids pointer +passing. + + PORT_LO = 40_000 + PORT_HI = 40_511 + + s = socket(AF_INET, SOCK_STREAM) + v = struct.pack("I", PORT_HI << 16 | PORT_LO) + s.setsockopt(SOL_IP, IP_LOCAL_PORT_RANGE, v) + s.bind(("127.0.0.1", 0)) + s.getsockname() + # Local address between ("127.0.0.1", 40_000) and ("127.0.0.1", 40_511), + # if there is a free port. EADDRINUSE otherwise. + +[1] https://github.com/cloudflare/cloudflare-blog/blob/232b432c1d57/2022-02-connectx/connectx.py#L116 + +Reviewed-by: Marek Majkowski +Reviewed-by: Kuniyuki Iwashima +Signed-off-by: Jakub Sitnicki +Reviewed-by: Eric Dumazet +Signed-off-by: Jakub Kicinski +Stable-dep-of: 3632679d9e4f ("ipv{4,6}/raw: fix output xfrm lookup wrt protocol") +Signed-off-by: Sasha Levin +--- + include/net/inet_sock.h | 4 ++++ + include/net/ip.h | 3 ++- + include/uapi/linux/in.h | 1 + + net/ipv4/inet_connection_sock.c | 25 +++++++++++++++++++++++-- + net/ipv4/inet_hashtables.c | 2 +- + net/ipv4/ip_sockglue.c | 18 ++++++++++++++++++ + net/ipv4/udp.c | 2 +- + net/sctp/socket.c | 2 +- + 8 files changed, 51 insertions(+), 6 deletions(-) + +diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h +index bf5654ce711ef..51857117ac099 100644 +--- a/include/net/inet_sock.h ++++ b/include/net/inet_sock.h +@@ -249,6 +249,10 @@ struct inet_sock { + __be32 mc_addr; + struct ip_mc_socklist __rcu *mc_list; + struct inet_cork_full cork; ++ struct { ++ __u16 lo; ++ __u16 hi; ++ } local_port_range; + }; + + #define IPCORK_OPT 1 /* ip-options has been held in ipcork.opt */ +diff --git a/include/net/ip.h b/include/net/ip.h +index 144bdfbb25afe..c3fffaa92d6e0 100644 +--- a/include/net/ip.h ++++ b/include/net/ip.h +@@ -340,7 +340,8 @@ static inline u64 snmp_fold_field64(void __percpu *mib, int offt, size_t syncp_o + } \ + } + +-void inet_get_local_port_range(struct net *net, int *low, int *high); ++void inet_get_local_port_range(const struct net *net, int *low, int *high); ++void inet_sk_get_local_port_range(const struct sock *sk, int *low, int *high); + + #ifdef CONFIG_SYSCTL + static inline bool inet_is_local_reserved_port(struct net *net, unsigned short port) +diff --git a/include/uapi/linux/in.h b/include/uapi/linux/in.h +index 07a4cb149305b..4b7f2df66b995 100644 +--- a/include/uapi/linux/in.h ++++ b/include/uapi/linux/in.h +@@ -162,6 +162,7 @@ struct in_addr { + #define MCAST_MSFILTER 48 + #define IP_MULTICAST_ALL 49 + #define IP_UNICAST_IF 50 ++#define IP_LOCAL_PORT_RANGE 51 + + #define MCAST_EXCLUDE 0 + #define MCAST_INCLUDE 1 +diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c +index 7152ede18f115..916075e00d066 100644 +--- a/net/ipv4/inet_connection_sock.c ++++ b/net/ipv4/inet_connection_sock.c +@@ -117,7 +117,7 @@ bool inet_rcv_saddr_any(const struct sock *sk) + return !sk->sk_rcv_saddr; + } + +-void inet_get_local_port_range(struct net *net, int *low, int *high) ++void inet_get_local_port_range(const struct net *net, int *low, int *high) + { + unsigned int seq; + +@@ -130,6 +130,27 @@ void inet_get_local_port_range(struct net *net, int *low, int *high) + } + EXPORT_SYMBOL(inet_get_local_port_range); + ++void inet_sk_get_local_port_range(const struct sock *sk, int *low, int *high) ++{ ++ const struct inet_sock *inet = inet_sk(sk); ++ const struct net *net = sock_net(sk); ++ int lo, hi, sk_lo, sk_hi; ++ ++ inet_get_local_port_range(net, &lo, &hi); ++ ++ sk_lo = inet->local_port_range.lo; ++ sk_hi = inet->local_port_range.hi; ++ ++ if (unlikely(lo <= sk_lo && sk_lo <= hi)) ++ lo = sk_lo; ++ if (unlikely(lo <= sk_hi && sk_hi <= hi)) ++ hi = sk_hi; ++ ++ *low = lo; ++ *high = hi; ++} ++EXPORT_SYMBOL(inet_sk_get_local_port_range); ++ + static bool inet_use_bhash2_on_bind(const struct sock *sk) + { + #if IS_ENABLED(CONFIG_IPV6) +@@ -316,7 +337,7 @@ inet_csk_find_open_port(const struct sock *sk, struct inet_bind_bucket **tb_ret, + ports_exhausted: + attempt_half = (sk->sk_reuse == SK_CAN_REUSE) ? 1 : 0; + other_half_scan: +- inet_get_local_port_range(net, &low, &high); ++ inet_sk_get_local_port_range(sk, &low, &high); + high++; /* [32768, 60999] -> [32768, 61000[ */ + if (high - low < 4) + attempt_half = 0; +diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c +index f0750c06d5ffc..e8734ffca85a8 100644 +--- a/net/ipv4/inet_hashtables.c ++++ b/net/ipv4/inet_hashtables.c +@@ -1022,7 +1022,7 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row, + + l3mdev = inet_sk_bound_l3mdev(sk); + +- inet_get_local_port_range(net, &low, &high); ++ inet_sk_get_local_port_range(sk, &low, &high); + high++; /* [32768, 60999] -> [32768, 61000[ */ + remaining = high - low; + if (likely(remaining > 1)) +diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c +index 6e19cad154f5c..d05f631ea6401 100644 +--- a/net/ipv4/ip_sockglue.c ++++ b/net/ipv4/ip_sockglue.c +@@ -922,6 +922,7 @@ int do_ip_setsockopt(struct sock *sk, int level, int optname, + case IP_CHECKSUM: + case IP_RECVFRAGSIZE: + case IP_RECVERR_RFC4884: ++ case IP_LOCAL_PORT_RANGE: + if (optlen >= sizeof(int)) { + if (copy_from_sockptr(&val, optval, sizeof(val))) + return -EFAULT; +@@ -1364,6 +1365,20 @@ int do_ip_setsockopt(struct sock *sk, int level, int optname, + WRITE_ONCE(inet->min_ttl, val); + break; + ++ case IP_LOCAL_PORT_RANGE: ++ { ++ const __u16 lo = val; ++ const __u16 hi = val >> 16; ++ ++ if (optlen != sizeof(__u32)) ++ goto e_inval; ++ if (lo != 0 && hi != 0 && lo > hi) ++ goto e_inval; ++ ++ inet->local_port_range.lo = lo; ++ inet->local_port_range.hi = hi; ++ break; ++ } + default: + err = -ENOPROTOOPT; + break; +@@ -1742,6 +1757,9 @@ int do_ip_getsockopt(struct sock *sk, int level, int optname, + case IP_MINTTL: + val = inet->min_ttl; + break; ++ case IP_LOCAL_PORT_RANGE: ++ val = inet->local_port_range.hi << 16 | inet->local_port_range.lo; ++ break; + default: + sockopt_release_sock(sk); + return -ENOPROTOOPT; +diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c +index 2eaf47e23b221..3ffa30c37293e 100644 +--- a/net/ipv4/udp.c ++++ b/net/ipv4/udp.c +@@ -243,7 +243,7 @@ int udp_lib_get_port(struct sock *sk, unsigned short snum, + int low, high, remaining; + unsigned int rand; + +- inet_get_local_port_range(net, &low, &high); ++ inet_sk_get_local_port_range(sk, &low, &high); + remaining = (high - low) + 1; + + rand = get_random_u32(); +diff --git a/net/sctp/socket.c b/net/sctp/socket.c +index 17185200079d5..bc3d08bd7cef3 100644 +--- a/net/sctp/socket.c ++++ b/net/sctp/socket.c +@@ -8325,7 +8325,7 @@ static int sctp_get_port_local(struct sock *sk, union sctp_addr *addr) + int low, high, remaining, index; + unsigned int rover; + +- inet_get_local_port_range(net, &low, &high); ++ inet_sk_get_local_port_range(sk, &low, &high); + remaining = (high - low) + 1; + rover = prandom_u32_max(remaining) + low; + +-- +2.39.2 + diff --git a/queue-6.1/ipv-4-6-raw-fix-output-xfrm-lookup-wrt-protocol.patch b/queue-6.1/ipv-4-6-raw-fix-output-xfrm-lookup-wrt-protocol.patch new file mode 100644 index 00000000000..633d6c50013 --- /dev/null +++ b/queue-6.1/ipv-4-6-raw-fix-output-xfrm-lookup-wrt-protocol.patch @@ -0,0 +1,138 @@ +From 35f9d8bdac8a43aa43859b47745171d684a418a5 Mon Sep 17 00:00:00 2001 +From: Sasha Levin +Date: Mon, 22 May 2023 14:08:20 +0200 +Subject: ipv{4,6}/raw: fix output xfrm lookup wrt protocol + +From: Nicolas Dichtel + +[ Upstream commit 3632679d9e4f879f49949bb5b050e0de553e4739 ] + +With a raw socket bound to IPPROTO_RAW (ie with hdrincl enabled), the +protocol field of the flow structure, build by raw_sendmsg() / +rawv6_sendmsg()), is set to IPPROTO_RAW. This breaks the ipsec policy +lookup when some policies are defined with a protocol in the selector. + +For ipv6, the sin6_port field from 'struct sockaddr_in6' could be used to +specify the protocol. Just accept all values for IPPROTO_RAW socket. + +For ipv4, the sin_port field of 'struct sockaddr_in' could not be used +without breaking backward compatibility (the value of this field was never +checked). Let's add a new kind of control message, so that the userland +could specify which protocol is used. + +Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") +CC: stable@vger.kernel.org +Signed-off-by: Nicolas Dichtel +Link: https://lore.kernel.org/r/20230522120820.1319391-1-nicolas.dichtel@6wind.com +Signed-off-by: Paolo Abeni +Signed-off-by: Sasha Levin +--- + include/net/ip.h | 2 ++ + include/uapi/linux/in.h | 1 + + net/ipv4/ip_sockglue.c | 12 +++++++++++- + net/ipv4/raw.c | 5 ++++- + net/ipv6/raw.c | 3 ++- + 5 files changed, 20 insertions(+), 3 deletions(-) + +diff --git a/include/net/ip.h b/include/net/ip.h +index c3fffaa92d6e0..acec504c469a0 100644 +--- a/include/net/ip.h ++++ b/include/net/ip.h +@@ -76,6 +76,7 @@ struct ipcm_cookie { + __be32 addr; + int oif; + struct ip_options_rcu *opt; ++ __u8 protocol; + __u8 ttl; + __s16 tos; + char priority; +@@ -96,6 +97,7 @@ static inline void ipcm_init_sk(struct ipcm_cookie *ipcm, + ipcm->sockc.tsflags = inet->sk.sk_tsflags; + ipcm->oif = READ_ONCE(inet->sk.sk_bound_dev_if); + ipcm->addr = inet->inet_saddr; ++ ipcm->protocol = inet->inet_num; + } + + #define IPCB(skb) ((struct inet_skb_parm*)((skb)->cb)) +diff --git a/include/uapi/linux/in.h b/include/uapi/linux/in.h +index 4b7f2df66b995..e682ab628dfa6 100644 +--- a/include/uapi/linux/in.h ++++ b/include/uapi/linux/in.h +@@ -163,6 +163,7 @@ struct in_addr { + #define IP_MULTICAST_ALL 49 + #define IP_UNICAST_IF 50 + #define IP_LOCAL_PORT_RANGE 51 ++#define IP_PROTOCOL 52 + + #define MCAST_EXCLUDE 0 + #define MCAST_INCLUDE 1 +diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c +index d05f631ea6401..a7fd035b5b4f9 100644 +--- a/net/ipv4/ip_sockglue.c ++++ b/net/ipv4/ip_sockglue.c +@@ -317,7 +317,14 @@ int ip_cmsg_send(struct sock *sk, struct msghdr *msg, struct ipcm_cookie *ipc, + ipc->tos = val; + ipc->priority = rt_tos2priority(ipc->tos); + break; +- ++ case IP_PROTOCOL: ++ if (cmsg->cmsg_len != CMSG_LEN(sizeof(int))) ++ return -EINVAL; ++ val = *(int *)CMSG_DATA(cmsg); ++ if (val < 1 || val > 255) ++ return -EINVAL; ++ ipc->protocol = val; ++ break; + default: + return -EINVAL; + } +@@ -1760,6 +1767,9 @@ int do_ip_getsockopt(struct sock *sk, int level, int optname, + case IP_LOCAL_PORT_RANGE: + val = inet->local_port_range.hi << 16 | inet->local_port_range.lo; + break; ++ case IP_PROTOCOL: ++ val = inet_sk(sk)->inet_num; ++ break; + default: + sockopt_release_sock(sk); + return -ENOPROTOOPT; +diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c +index af03aa8a8e513..86197634dcf5d 100644 +--- a/net/ipv4/raw.c ++++ b/net/ipv4/raw.c +@@ -530,6 +530,9 @@ static int raw_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) + } + + ipcm_init_sk(&ipc, inet); ++ /* Keep backward compat */ ++ if (hdrincl) ++ ipc.protocol = IPPROTO_RAW; + + if (msg->msg_controllen) { + err = ip_cmsg_send(sk, msg, &ipc, false); +@@ -597,7 +600,7 @@ static int raw_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) + + flowi4_init_output(&fl4, ipc.oif, ipc.sockc.mark, tos, + RT_SCOPE_UNIVERSE, +- hdrincl ? IPPROTO_RAW : sk->sk_protocol, ++ hdrincl ? ipc.protocol : sk->sk_protocol, + inet_sk_flowi_flags(sk) | + (hdrincl ? FLOWI_FLAG_KNOWN_NH : 0), + daddr, saddr, 0, 0, sk->sk_uid); +diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c +index f44b99f7ecdcc..33852fc38ad91 100644 +--- a/net/ipv6/raw.c ++++ b/net/ipv6/raw.c +@@ -791,7 +791,8 @@ static int rawv6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) + + if (!proto) + proto = inet->inet_num; +- else if (proto != inet->inet_num) ++ else if (proto != inet->inet_num && ++ inet->inet_num != IPPROTO_RAW) + return -EINVAL; + + if (proto > 255) +-- +2.39.2 + diff --git a/queue-6.1/net-mlx5-e-switch-devcom-sync-devcom-events-and-devc.patch b/queue-6.1/net-mlx5-e-switch-devcom-sync-devcom-events-and-devc.patch new file mode 100644 index 00000000000..441d4aadd59 --- /dev/null +++ b/queue-6.1/net-mlx5-e-switch-devcom-sync-devcom-events-and-devc.patch @@ -0,0 +1,91 @@ +From a2faf150514ee4a652fafe17c2dcb05efa261f8d Mon Sep 17 00:00:00 2001 +From: Sasha Levin +Date: Mon, 6 Feb 2023 11:52:02 +0200 +Subject: net/mlx5: E-switch, Devcom, sync devcom events and devcom comp + register + +From: Shay Drory + +[ Upstream commit 8c253dfc89efde6b5faddf9e7400e5d17884e042 ] + +devcom events are sent to all registered component. Following the +cited patch, it is possible for two components, e.g.: two eswitches, +to send devcom events, while both components are registered. This +means eswitch layer will do double un/pairing, which is double +allocation and free of resources, even though only one un/pairing is +needed. flow example: + + cpu0 cpu1 + ---- ---- + + mlx5_devlink_eswitch_mode_set(dev0) + esw_offloads_devcom_init() + mlx5_devcom_register_component(esw0) + mlx5_devlink_eswitch_mode_set(dev1) + esw_offloads_devcom_init() + mlx5_devcom_register_component(esw1) + mlx5_devcom_send_event() + mlx5_devcom_send_event() + +Hence, check whether the eswitches are already un/paired before +free/allocation of resources. + +Fixes: 09b278462f16 ("net: devlink: enable parallel ops on netlink interface") +Signed-off-by: Shay Drory +Reviewed-by: Mark Bloch +Signed-off-by: Saeed Mahameed +Signed-off-by: Sasha Levin +--- + drivers/net/ethernet/mellanox/mlx5/core/eswitch.h | 1 + + .../net/ethernet/mellanox/mlx5/core/eswitch_offloads.c | 9 ++++++++- + 2 files changed, 9 insertions(+), 1 deletion(-) + +diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h +index 821c78bab3732..a3daca44f74b1 100644 +--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h ++++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h +@@ -340,6 +340,7 @@ struct mlx5_eswitch { + } params; + struct blocking_notifier_head n_head; + struct dentry *dbgfs; ++ bool paired[MLX5_MAX_PORTS]; + }; + + void esw_offloads_disable(struct mlx5_eswitch *esw); +diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c +index 5235b5a7b9637..433cdd0a2cf34 100644 +--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c ++++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c +@@ -2827,6 +2827,9 @@ static int mlx5_esw_offloads_devcom_event(int event, + mlx5_eswitch_vport_match_metadata_enabled(peer_esw)) + break; + ++ if (esw->paired[mlx5_get_dev_index(peer_esw->dev)]) ++ break; ++ + err = mlx5_esw_offloads_set_ns_peer(esw, peer_esw, true); + if (err) + goto err_out; +@@ -2838,14 +2841,18 @@ static int mlx5_esw_offloads_devcom_event(int event, + if (err) + goto err_pair; + ++ esw->paired[mlx5_get_dev_index(peer_esw->dev)] = true; ++ peer_esw->paired[mlx5_get_dev_index(esw->dev)] = true; + mlx5_devcom_set_paired(devcom, MLX5_DEVCOM_ESW_OFFLOADS, true); + break; + + case ESW_OFFLOADS_DEVCOM_UNPAIR: +- if (!mlx5_devcom_is_paired(devcom, MLX5_DEVCOM_ESW_OFFLOADS)) ++ if (!esw->paired[mlx5_get_dev_index(peer_esw->dev)]) + break; + + mlx5_devcom_set_paired(devcom, MLX5_DEVCOM_ESW_OFFLOADS, false); ++ esw->paired[mlx5_get_dev_index(peer_esw->dev)] = false; ++ peer_esw->paired[mlx5_get_dev_index(esw->dev)] = false; + mlx5_esw_offloads_unpair(peer_esw); + mlx5_esw_offloads_unpair(esw); + mlx5_esw_offloads_set_ns_peer(esw, peer_esw, false); +-- +2.39.2 + diff --git a/queue-6.1/net-page_pool-use-in_softirq-instead.patch b/queue-6.1/net-page_pool-use-in_softirq-instead.patch new file mode 100644 index 00000000000..a1e5b2a1a2c --- /dev/null +++ b/queue-6.1/net-page_pool-use-in_softirq-instead.patch @@ -0,0 +1,75 @@ +From fd295408c719ed8b6b4595392c12e6575fedbfe0 Mon Sep 17 00:00:00 2001 +From: Sasha Levin +Date: Fri, 3 Feb 2023 09:16:11 +0800 +Subject: net: page_pool: use in_softirq() instead + +From: Qingfang DENG + +[ Upstream commit 542bcea4be866b14b3a5c8e90773329066656c43 ] + +We use BH context only for synchronization, so we don't care if it's +actually serving softirq or not. + +As a side node, in case of threaded NAPI, in_serving_softirq() will +return false because it's in process context with BH off, making +page_pool_recycle_in_cache() unreachable. + +Signed-off-by: Qingfang DENG +Tested-by: Felix Fietkau +Signed-off-by: David S. Miller +Stable-dep-of: 368d3cb406cd ("page_pool: fix inconsistency for page_pool_ring_[un]lock()") +Signed-off-by: Sasha Levin +--- + include/net/page_pool.h | 4 ++-- + net/core/page_pool.c | 6 +++--- + 2 files changed, 5 insertions(+), 5 deletions(-) + +diff --git a/include/net/page_pool.h b/include/net/page_pool.h +index 813c93499f201..34bf531ffc8d6 100644 +--- a/include/net/page_pool.h ++++ b/include/net/page_pool.h +@@ -386,7 +386,7 @@ static inline void page_pool_nid_changed(struct page_pool *pool, int new_nid) + static inline void page_pool_ring_lock(struct page_pool *pool) + __acquires(&pool->ring.producer_lock) + { +- if (in_serving_softirq()) ++ if (in_softirq()) + spin_lock(&pool->ring.producer_lock); + else + spin_lock_bh(&pool->ring.producer_lock); +@@ -395,7 +395,7 @@ static inline void page_pool_ring_lock(struct page_pool *pool) + static inline void page_pool_ring_unlock(struct page_pool *pool) + __releases(&pool->ring.producer_lock) + { +- if (in_serving_softirq()) ++ if (in_softirq()) + spin_unlock(&pool->ring.producer_lock); + else + spin_unlock_bh(&pool->ring.producer_lock); +diff --git a/net/core/page_pool.c b/net/core/page_pool.c +index 9b203d8660e47..193c187998650 100644 +--- a/net/core/page_pool.c ++++ b/net/core/page_pool.c +@@ -511,8 +511,8 @@ static void page_pool_return_page(struct page_pool *pool, struct page *page) + static bool page_pool_recycle_in_ring(struct page_pool *pool, struct page *page) + { + int ret; +- /* BH protection not needed if current is serving softirq */ +- if (in_serving_softirq()) ++ /* BH protection not needed if current is softirq */ ++ if (in_softirq()) + ret = ptr_ring_produce(&pool->ring, page); + else + ret = ptr_ring_produce_bh(&pool->ring, page); +@@ -570,7 +570,7 @@ __page_pool_put_page(struct page_pool *pool, struct page *page, + page_pool_dma_sync_for_device(pool, page, + dma_sync_size); + +- if (allow_direct && in_serving_softirq() && ++ if (allow_direct && in_softirq() && + page_pool_recycle_in_cache(page, pool)) + return NULL; + +-- +2.39.2 + diff --git a/queue-6.1/net-phy-mscc-enable-vsc8501-2-rgmii-rx-clock.patch b/queue-6.1/net-phy-mscc-enable-vsc8501-2-rgmii-rx-clock.patch new file mode 100644 index 00000000000..2a4ea5181f5 --- /dev/null +++ b/queue-6.1/net-phy-mscc-enable-vsc8501-2-rgmii-rx-clock.patch @@ -0,0 +1,134 @@ +From dc0b2c73e5cd6a9e840a2796e921994c31096ba6 Mon Sep 17 00:00:00 2001 +From: Sasha Levin +Date: Tue, 23 May 2023 17:31:08 +0200 +Subject: net: phy: mscc: enable VSC8501/2 RGMII RX clock + +From: David Epping + +[ Upstream commit 71460c9ec5c743e9ffffca3c874d66267c36345e ] + +By default the VSC8501 and VSC8502 RGMII/GMII/MII RX_CLK output is +disabled. To allow packet forwarding towards the MAC it needs to be +enabled. + +For other PHYs supported by this driver the clock output is enabled +by default. + +Fixes: d3169863310d ("net: phy: mscc: add support for VSC8502") +Signed-off-by: David Epping +Reviewed-by: Russell King (Oracle) +Reviewed-by: Vladimir Oltean +Signed-off-by: Jakub Kicinski +Signed-off-by: Sasha Levin +--- + drivers/net/phy/mscc/mscc.h | 1 + + drivers/net/phy/mscc/mscc_main.c | 54 +++++++++++++++++--------------- + 2 files changed, 29 insertions(+), 26 deletions(-) + +diff --git a/drivers/net/phy/mscc/mscc.h b/drivers/net/phy/mscc/mscc.h +index a50235fdf7d99..055e4ca5b3b5c 100644 +--- a/drivers/net/phy/mscc/mscc.h ++++ b/drivers/net/phy/mscc/mscc.h +@@ -179,6 +179,7 @@ enum rgmii_clock_delay { + #define VSC8502_RGMII_CNTL 20 + #define VSC8502_RGMII_RX_DELAY_MASK 0x0070 + #define VSC8502_RGMII_TX_DELAY_MASK 0x0007 ++#define VSC8502_RGMII_RX_CLK_DISABLE 0x0800 + + #define MSCC_PHY_WOL_LOWER_MAC_ADDR 21 + #define MSCC_PHY_WOL_MID_MAC_ADDR 22 +diff --git a/drivers/net/phy/mscc/mscc_main.c b/drivers/net/phy/mscc/mscc_main.c +index f778e4f8b5080..7bd940baec595 100644 +--- a/drivers/net/phy/mscc/mscc_main.c ++++ b/drivers/net/phy/mscc/mscc_main.c +@@ -527,14 +527,27 @@ static int vsc85xx_mac_if_set(struct phy_device *phydev, + * * 2.0 ns (which causes the data to be sampled at exactly half way between + * clock transitions at 1000 Mbps) if delays should be enabled + */ +-static int vsc85xx_rgmii_set_skews(struct phy_device *phydev, u32 rgmii_cntl, +- u16 rgmii_rx_delay_mask, +- u16 rgmii_tx_delay_mask) ++static int vsc85xx_update_rgmii_cntl(struct phy_device *phydev, u32 rgmii_cntl, ++ u16 rgmii_rx_delay_mask, ++ u16 rgmii_tx_delay_mask) + { + u16 rgmii_rx_delay_pos = ffs(rgmii_rx_delay_mask) - 1; + u16 rgmii_tx_delay_pos = ffs(rgmii_tx_delay_mask) - 1; + u16 reg_val = 0; +- int rc; ++ u16 mask = 0; ++ int rc = 0; ++ ++ /* For traffic to pass, the VSC8502 family needs the RX_CLK disable bit ++ * to be unset for all PHY modes, so do that as part of the paged ++ * register modification. ++ * For some family members (like VSC8530/31/40/41) this bit is reserved ++ * and read-only, and the RX clock is enabled by default. ++ */ ++ if (rgmii_cntl == VSC8502_RGMII_CNTL) ++ mask |= VSC8502_RGMII_RX_CLK_DISABLE; ++ ++ if (phy_interface_is_rgmii(phydev)) ++ mask |= rgmii_rx_delay_mask | rgmii_tx_delay_mask; + + mutex_lock(&phydev->lock); + +@@ -545,10 +558,9 @@ static int vsc85xx_rgmii_set_skews(struct phy_device *phydev, u32 rgmii_cntl, + phydev->interface == PHY_INTERFACE_MODE_RGMII_ID) + reg_val |= RGMII_CLK_DELAY_2_0_NS << rgmii_tx_delay_pos; + +- rc = phy_modify_paged(phydev, MSCC_PHY_PAGE_EXTENDED_2, +- rgmii_cntl, +- rgmii_rx_delay_mask | rgmii_tx_delay_mask, +- reg_val); ++ if (mask) ++ rc = phy_modify_paged(phydev, MSCC_PHY_PAGE_EXTENDED_2, ++ rgmii_cntl, mask, reg_val); + + mutex_unlock(&phydev->lock); + +@@ -557,19 +569,11 @@ static int vsc85xx_rgmii_set_skews(struct phy_device *phydev, u32 rgmii_cntl, + + static int vsc85xx_default_config(struct phy_device *phydev) + { +- int rc; +- + phydev->mdix_ctrl = ETH_TP_MDI_AUTO; + +- if (phy_interface_mode_is_rgmii(phydev->interface)) { +- rc = vsc85xx_rgmii_set_skews(phydev, VSC8502_RGMII_CNTL, +- VSC8502_RGMII_RX_DELAY_MASK, +- VSC8502_RGMII_TX_DELAY_MASK); +- if (rc) +- return rc; +- } +- +- return 0; ++ return vsc85xx_update_rgmii_cntl(phydev, VSC8502_RGMII_CNTL, ++ VSC8502_RGMII_RX_DELAY_MASK, ++ VSC8502_RGMII_TX_DELAY_MASK); + } + + static int vsc85xx_get_tunable(struct phy_device *phydev, +@@ -1766,13 +1770,11 @@ static int vsc8584_config_init(struct phy_device *phydev) + if (ret) + return ret; + +- if (phy_interface_is_rgmii(phydev)) { +- ret = vsc85xx_rgmii_set_skews(phydev, VSC8572_RGMII_CNTL, +- VSC8572_RGMII_RX_DELAY_MASK, +- VSC8572_RGMII_TX_DELAY_MASK); +- if (ret) +- return ret; +- } ++ ret = vsc85xx_update_rgmii_cntl(phydev, VSC8572_RGMII_CNTL, ++ VSC8572_RGMII_RX_DELAY_MASK, ++ VSC8572_RGMII_TX_DELAY_MASK); ++ if (ret) ++ return ret; + + ret = genphy_soft_reset(phydev); + if (ret) +-- +2.39.2 + diff --git a/queue-6.1/page_pool-fix-inconsistency-for-page_pool_ring_-un-l.patch b/queue-6.1/page_pool-fix-inconsistency-for-page_pool_ring_-un-l.patch new file mode 100644 index 00000000000..b6f463397f0 --- /dev/null +++ b/queue-6.1/page_pool-fix-inconsistency-for-page_pool_ring_-un-l.patch @@ -0,0 +1,129 @@ +From 0f7c4261319e04c926d12a31b9ed6abd87280272 Mon Sep 17 00:00:00 2001 +From: Sasha Levin +Date: Mon, 22 May 2023 11:17:14 +0800 +Subject: page_pool: fix inconsistency for page_pool_ring_[un]lock() + +From: Yunsheng Lin + +[ Upstream commit 368d3cb406cdd074d1df2ad9ec06d1bfcb664882 ] + +page_pool_ring_[un]lock() use in_softirq() to decide which +spin lock variant to use, and when they are called in the +context with in_softirq() being false, spin_lock_bh() is +called in page_pool_ring_lock() while spin_unlock() is +called in page_pool_ring_unlock(), because spin_lock_bh() +has disabled the softirq in page_pool_ring_lock(), which +causes inconsistency for spin lock pair calling. + +This patch fixes it by returning in_softirq state from +page_pool_producer_lock(), and use it to decide which +spin lock variant to use in page_pool_producer_unlock(). + +As pool->ring has both producer and consumer lock, so +rename it to page_pool_producer_[un]lock() to reflect +the actual usage. Also move them to page_pool.c as they +are only used there, and remove the 'inline' as the +compiler may have better idea to do inlining or not. + +Fixes: 7886244736a4 ("net: page_pool: Add bulk support for ptr_ring") +Signed-off-by: Yunsheng Lin +Acked-by: Jesper Dangaard Brouer +Acked-by: Ilias Apalodimas +Link: https://lore.kernel.org/r/20230522031714.5089-1-linyunsheng@huawei.com +Signed-off-by: Jakub Kicinski +Signed-off-by: Sasha Levin +--- + include/net/page_pool.h | 18 ------------------ + net/core/page_pool.c | 28 ++++++++++++++++++++++++++-- + 2 files changed, 26 insertions(+), 20 deletions(-) + +diff --git a/include/net/page_pool.h b/include/net/page_pool.h +index 34bf531ffc8d6..ad0bafc877d48 100644 +--- a/include/net/page_pool.h ++++ b/include/net/page_pool.h +@@ -383,22 +383,4 @@ static inline void page_pool_nid_changed(struct page_pool *pool, int new_nid) + page_pool_update_nid(pool, new_nid); + } + +-static inline void page_pool_ring_lock(struct page_pool *pool) +- __acquires(&pool->ring.producer_lock) +-{ +- if (in_softirq()) +- spin_lock(&pool->ring.producer_lock); +- else +- spin_lock_bh(&pool->ring.producer_lock); +-} +- +-static inline void page_pool_ring_unlock(struct page_pool *pool) +- __releases(&pool->ring.producer_lock) +-{ +- if (in_softirq()) +- spin_unlock(&pool->ring.producer_lock); +- else +- spin_unlock_bh(&pool->ring.producer_lock); +-} +- + #endif /* _NET_PAGE_POOL_H */ +diff --git a/net/core/page_pool.c b/net/core/page_pool.c +index 193c187998650..2396c99bedeaa 100644 +--- a/net/core/page_pool.c ++++ b/net/core/page_pool.c +@@ -133,6 +133,29 @@ EXPORT_SYMBOL(page_pool_ethtool_stats_get); + #define recycle_stat_add(pool, __stat, val) + #endif + ++static bool page_pool_producer_lock(struct page_pool *pool) ++ __acquires(&pool->ring.producer_lock) ++{ ++ bool in_softirq = in_softirq(); ++ ++ if (in_softirq) ++ spin_lock(&pool->ring.producer_lock); ++ else ++ spin_lock_bh(&pool->ring.producer_lock); ++ ++ return in_softirq; ++} ++ ++static void page_pool_producer_unlock(struct page_pool *pool, ++ bool in_softirq) ++ __releases(&pool->ring.producer_lock) ++{ ++ if (in_softirq) ++ spin_unlock(&pool->ring.producer_lock); ++ else ++ spin_unlock_bh(&pool->ring.producer_lock); ++} ++ + static int page_pool_init(struct page_pool *pool, + const struct page_pool_params *params) + { +@@ -615,6 +638,7 @@ void page_pool_put_page_bulk(struct page_pool *pool, void **data, + int count) + { + int i, bulk_len = 0; ++ bool in_softirq; + + for (i = 0; i < count; i++) { + struct page *page = virt_to_head_page(data[i]); +@@ -633,7 +657,7 @@ void page_pool_put_page_bulk(struct page_pool *pool, void **data, + return; + + /* Bulk producer into ptr_ring page_pool cache */ +- page_pool_ring_lock(pool); ++ in_softirq = page_pool_producer_lock(pool); + for (i = 0; i < bulk_len; i++) { + if (__ptr_ring_produce(&pool->ring, data[i])) { + /* ring full */ +@@ -642,7 +666,7 @@ void page_pool_put_page_bulk(struct page_pool *pool, void **data, + } + } + recycle_stat_add(pool, ring, i); +- page_pool_ring_unlock(pool); ++ page_pool_producer_unlock(pool, in_softirq); + + /* Hopefully all pages was return into ptr_ring */ + if (likely(i == bulk_len)) +-- +2.39.2 + diff --git a/queue-6.1/platform-x86-amd-pmf-fix-cnqf-and-auto-mode-after-re.patch b/queue-6.1/platform-x86-amd-pmf-fix-cnqf-and-auto-mode-after-re.patch new file mode 100644 index 00000000000..e609fbc2ed4 --- /dev/null +++ b/queue-6.1/platform-x86-amd-pmf-fix-cnqf-and-auto-mode-after-re.patch @@ -0,0 +1,103 @@ +From 5197daa25cf6bd27df0920debbf93d53067824e2 Mon Sep 17 00:00:00 2001 +From: Sasha Levin +Date: Fri, 12 May 2023 20:14:08 -0500 +Subject: platform/x86/amd/pmf: Fix CnQF and auto-mode after resume + +From: Mario Limonciello + +[ Upstream commit b54147fa374dbeadcb01b1762db1a793e06e37de ] + +After suspend/resume cycle there is an error message and auto-mode +or CnQF stops working. + +[ 5741.447511] amd-pmf AMDI0100:00: SMU cmd failed. err: 0xff +[ 5741.447523] amd-pmf AMDI0100:00: AMD_PMF_REGISTER_RESPONSE:ff +[ 5741.447527] amd-pmf AMDI0100:00: AMD_PMF_REGISTER_ARGUMENT:7 +[ 5741.447531] amd-pmf AMDI0100:00: AMD_PMF_REGISTER_MESSAGE:16 +[ 5741.447540] amd-pmf AMDI0100:00: [AUTO_MODE] avg power: 0 mW mode: QUIET + +This is because the DRAM address used for accessing metrics table +needs to be refreshed after a suspend resume cycle. Add a resume +callback to reset this again. + +Fixes: 1a409b35c995 ("platform/x86/amd/pmf: Get performance metrics from PMFW") +Signed-off-by: Mario Limonciello +Link: https://lore.kernel.org/r/20230513011408.958-1-mario.limonciello@amd.com +Reviewed-by: Hans de Goede +Signed-off-by: Hans de Goede +Signed-off-by: Sasha Levin +--- + drivers/platform/x86/amd/pmf/core.c | 32 ++++++++++++++++++++++------- + 1 file changed, 25 insertions(+), 7 deletions(-) + +diff --git a/drivers/platform/x86/amd/pmf/core.c b/drivers/platform/x86/amd/pmf/core.c +index 0acc0b6221290..dc9803e1a4b9b 100644 +--- a/drivers/platform/x86/amd/pmf/core.c ++++ b/drivers/platform/x86/amd/pmf/core.c +@@ -245,24 +245,29 @@ static const struct pci_device_id pmf_pci_ids[] = { + { } + }; + +-int amd_pmf_init_metrics_table(struct amd_pmf_dev *dev) ++static void amd_pmf_set_dram_addr(struct amd_pmf_dev *dev) + { + u64 phys_addr; + u32 hi, low; + +- INIT_DELAYED_WORK(&dev->work_buffer, amd_pmf_get_metrics); ++ phys_addr = virt_to_phys(dev->buf); ++ hi = phys_addr >> 32; ++ low = phys_addr & GENMASK(31, 0); ++ ++ amd_pmf_send_cmd(dev, SET_DRAM_ADDR_HIGH, 0, hi, NULL); ++ amd_pmf_send_cmd(dev, SET_DRAM_ADDR_LOW, 0, low, NULL); ++} + ++int amd_pmf_init_metrics_table(struct amd_pmf_dev *dev) ++{ + /* Get Metrics Table Address */ + dev->buf = kzalloc(sizeof(dev->m_table), GFP_KERNEL); + if (!dev->buf) + return -ENOMEM; + +- phys_addr = virt_to_phys(dev->buf); +- hi = phys_addr >> 32; +- low = phys_addr & GENMASK(31, 0); ++ INIT_DELAYED_WORK(&dev->work_buffer, amd_pmf_get_metrics); + +- amd_pmf_send_cmd(dev, SET_DRAM_ADDR_HIGH, 0, hi, NULL); +- amd_pmf_send_cmd(dev, SET_DRAM_ADDR_LOW, 0, low, NULL); ++ amd_pmf_set_dram_addr(dev); + + /* + * Start collecting the metrics data after a small delay +@@ -273,6 +278,18 @@ int amd_pmf_init_metrics_table(struct amd_pmf_dev *dev) + return 0; + } + ++static int amd_pmf_resume_handler(struct device *dev) ++{ ++ struct amd_pmf_dev *pdev = dev_get_drvdata(dev); ++ ++ if (pdev->buf) ++ amd_pmf_set_dram_addr(pdev); ++ ++ return 0; ++} ++ ++static DEFINE_SIMPLE_DEV_PM_OPS(amd_pmf_pm, NULL, amd_pmf_resume_handler); ++ + static void amd_pmf_init_features(struct amd_pmf_dev *dev) + { + int ret; +@@ -414,6 +431,7 @@ static struct platform_driver amd_pmf_driver = { + .name = "amd-pmf", + .acpi_match_table = amd_pmf_acpi_ids, + .dev_groups = amd_pmf_driver_groups, ++ .pm = pm_sleep_ptr(&amd_pmf_pm), + }, + .probe = amd_pmf_probe, + .remove = amd_pmf_remove, +-- +2.39.2 + diff --git a/queue-6.1/selftests-bpf-fix-pkg-config-call-building-sign-file.patch b/queue-6.1/selftests-bpf-fix-pkg-config-call-building-sign-file.patch new file mode 100644 index 00000000000..3c2206bd512 --- /dev/null +++ b/queue-6.1/selftests-bpf-fix-pkg-config-call-building-sign-file.patch @@ -0,0 +1,51 @@ +From a8033055415b14a46897376561eaf631b38910df Mon Sep 17 00:00:00 2001 +From: Sasha Levin +Date: Wed, 26 Apr 2023 22:50:32 +0100 +Subject: selftests/bpf: Fix pkg-config call building sign-file + +From: Jeremy Sowden + +[ Upstream commit 5f5486b620cd43b16a1787ef92b9bc21bd72ef2e ] + +When building sign-file, the call to get the CFLAGS for libcrypto is +missing white-space between `pkg-config` and `--cflags`: + + $(shell $(HOSTPKG_CONFIG)--cflags libcrypto 2> /dev/null) + +Removing the redirection of stderr, we see: + + $ make -C tools/testing/selftests/bpf sign-file + make: Entering directory '[...]/tools/testing/selftests/bpf' + make: pkg-config--cflags: No such file or directory + SIGN-FILE sign-file + make: Leaving directory '[...]/tools/testing/selftests/bpf' + +Add the missing space. + +Fixes: fc97590668ae ("selftests/bpf: Add test for bpf_verify_pkcs7_signature() kfunc") +Signed-off-by: Jeremy Sowden +Signed-off-by: Daniel Borkmann +Reviewed-by: Roberto Sassu +Link: https://lore.kernel.org/bpf/20230426215032.415792-1-jeremy@azazel.net +Signed-off-by: Alexei Starovoitov +Signed-off-by: Sasha Levin +--- + tools/testing/selftests/bpf/Makefile | 2 +- + 1 file changed, 1 insertion(+), 1 deletion(-) + +diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile +index 687249d99b5f1..0465ddc81f352 100644 +--- a/tools/testing/selftests/bpf/Makefile ++++ b/tools/testing/selftests/bpf/Makefile +@@ -193,7 +193,7 @@ $(OUTPUT)/urandom_read: urandom_read.c urandom_read_aux.c $(OUTPUT)/liburandom_r + + $(OUTPUT)/sign-file: ../../../../scripts/sign-file.c + $(call msg,SIGN-FILE,,$@) +- $(Q)$(CC) $(shell $(HOSTPKG_CONFIG)--cflags libcrypto 2> /dev/null) \ ++ $(Q)$(CC) $(shell $(HOSTPKG_CONFIG) --cflags libcrypto 2> /dev/null) \ + $< -o $@ \ + $(shell $(HOSTPKG_CONFIG) --libs libcrypto 2> /dev/null || echo -lcrypto) + +-- +2.39.2 + diff --git a/queue-6.1/series b/queue-6.1/series index 6bb6a794e67..b87ac97b77c 100644 --- a/queue-6.1/series +++ b/queue-6.1/series @@ -117,3 +117,29 @@ regulator-mt6359-add-read-check-for-pmic-mt6359.patch net-smc-reset-connection-when-trying-to-use-smcrv2-fails.patch 3c589_cs-fix-an-error-handling-path-in-tc589_probe.patch net-phy-mscc-add-vsc8502-to-module_device_table.patch +inet-add-ip_local_port_range-socket-option.patch +ipv-4-6-raw-fix-output-xfrm-lookup-wrt-protocol.patch +firmware-arm_ffa-fix-usage-of-partition-info-get-cou.patch +selftests-bpf-fix-pkg-config-call-building-sign-file.patch +platform-x86-amd-pmf-fix-cnqf-and-auto-mode-after-re.patch +tls-rx-device-fix-checking-decryption-status.patch +tls-rx-strp-set-the-skb-len-of-detached-cow-ed-skbs.patch +tls-rx-strp-fix-determining-record-length-in-copy-mo.patch +tls-rx-strp-force-mixed-decrypted-records-into-copy-.patch +tls-rx-strp-factor-out-copying-skb-data.patch +tls-rx-strp-preserve-decryption-status-of-skbs-when-.patch +net-mlx5-e-switch-devcom-sync-devcom-events-and-devc.patch +gpio-f7188x-fix-chip-name-and-pin-count-on-nuvoton-c.patch +bpf-sockmap-pass-skb-ownership-through-read_skb.patch +bpf-sockmap-convert-schedule_work-into-delayed_work.patch +bpf-sockmap-reschedule-is-now-done-through-backlog.patch +bpf-sockmap-improved-check-for-empty-queue.patch +bpf-sockmap-handle-fin-correctly.patch +bpf-sockmap-tcp-data-stall-on-recv-before-accept.patch +bpf-sockmap-wake-up-polling-after-data-copy.patch +bpf-sockmap-incorrectly-handling-copied_seq.patch +blk-mq-fix-race-condition-in-active-queue-accounting.patch +vfio-type1-check-pfn-valid-before-converting-to-stru.patch +net-page_pool-use-in_softirq-instead.patch +page_pool-fix-inconsistency-for-page_pool_ring_-un-l.patch +net-phy-mscc-enable-vsc8501-2-rgmii-rx-clock.patch diff --git a/queue-6.1/tls-rx-device-fix-checking-decryption-status.patch b/queue-6.1/tls-rx-device-fix-checking-decryption-status.patch new file mode 100644 index 00000000000..2d6b0c24870 --- /dev/null +++ b/queue-6.1/tls-rx-device-fix-checking-decryption-status.patch @@ -0,0 +1,44 @@ +From d0b063d0c88043a9198c69a24b09490b1459ac5e Mon Sep 17 00:00:00 2001 +From: Sasha Levin +Date: Tue, 16 May 2023 18:50:36 -0700 +Subject: tls: rx: device: fix checking decryption status + +From: Jakub Kicinski + +[ Upstream commit b3a03b540e3cf62a255213d084d76d71c02793d5 ] + +skb->len covers the entire skb, including the frag_list. +In fact we're guaranteed that rxm->full_len <= skb->len, +so since the change under Fixes we were not checking decrypt +status of any skb but the first. + +Note that the skb_pagelen() added here may feel a bit costly, +but it's removed by subsequent fixes, anyway. + +Reported-by: Tariq Toukan +Fixes: 86b259f6f888 ("tls: rx: device: bound the frag walk") +Tested-by: Shai Amiram +Signed-off-by: Jakub Kicinski +Reviewed-by: Simon Horman +Signed-off-by: David S. Miller +Signed-off-by: Sasha Levin +--- + net/tls/tls_device.c | 2 +- + 1 file changed, 1 insertion(+), 1 deletion(-) + +diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c +index a7cc4f9faac28..3b87c7b04ac87 100644 +--- a/net/tls/tls_device.c ++++ b/net/tls/tls_device.c +@@ -1012,7 +1012,7 @@ int tls_device_decrypted(struct sock *sk, struct tls_context *tls_ctx) + struct sk_buff *skb_iter; + int left; + +- left = rxm->full_len - skb->len; ++ left = rxm->full_len + rxm->offset - skb_pagelen(skb); + /* Check if all the data is decrypted already */ + skb_iter = skb_shinfo(skb)->frag_list; + while (skb_iter && left > 0) { +-- +2.39.2 + diff --git a/queue-6.1/tls-rx-strp-factor-out-copying-skb-data.patch b/queue-6.1/tls-rx-strp-factor-out-copying-skb-data.patch new file mode 100644 index 00000000000..99c400009b9 --- /dev/null +++ b/queue-6.1/tls-rx-strp-factor-out-copying-skb-data.patch @@ -0,0 +1,84 @@ +From ac5b79057662d03c2434227903c80e19c171b90f Mon Sep 17 00:00:00 2001 +From: Sasha Levin +Date: Tue, 16 May 2023 18:50:40 -0700 +Subject: tls: rx: strp: factor out copying skb data + +From: Jakub Kicinski + +[ Upstream commit c1c607b1e5d5477d82ca6a86a05a4f10907b33ee ] + +We'll need to copy input skbs individually in the next patch. +Factor that code out (without assuming we're copying a full record). + +Tested-by: Shai Amiram +Signed-off-by: Jakub Kicinski +Reviewed-by: Simon Horman +Signed-off-by: David S. Miller +Stable-dep-of: eca9bfafee3a ("tls: rx: strp: preserve decryption status of skbs when needed") +Signed-off-by: Sasha Levin +--- + net/tls/tls_strp.c | 33 +++++++++++++++++++++++---------- + 1 file changed, 23 insertions(+), 10 deletions(-) + +diff --git a/net/tls/tls_strp.c b/net/tls/tls_strp.c +index e2e48217e7ac9..61fbf84baf9e0 100644 +--- a/net/tls/tls_strp.c ++++ b/net/tls/tls_strp.c +@@ -34,31 +34,44 @@ static void tls_strp_anchor_free(struct tls_strparser *strp) + strp->anchor = NULL; + } + +-/* Create a new skb with the contents of input copied to its page frags */ +-static struct sk_buff *tls_strp_msg_make_copy(struct tls_strparser *strp) ++static struct sk_buff * ++tls_strp_skb_copy(struct tls_strparser *strp, struct sk_buff *in_skb, ++ int offset, int len) + { +- struct strp_msg *rxm; + struct sk_buff *skb; +- int i, err, offset; ++ int i, err; + +- skb = alloc_skb_with_frags(0, strp->stm.full_len, TLS_PAGE_ORDER, ++ skb = alloc_skb_with_frags(0, len, TLS_PAGE_ORDER, + &err, strp->sk->sk_allocation); + if (!skb) + return NULL; + +- offset = strp->stm.offset; + for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { + skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; + +- WARN_ON_ONCE(skb_copy_bits(strp->anchor, offset, ++ WARN_ON_ONCE(skb_copy_bits(in_skb, offset, + skb_frag_address(frag), + skb_frag_size(frag))); + offset += skb_frag_size(frag); + } + +- skb->len = strp->stm.full_len; +- skb->data_len = strp->stm.full_len; +- skb_copy_header(skb, strp->anchor); ++ skb->len = len; ++ skb->data_len = len; ++ skb_copy_header(skb, in_skb); ++ return skb; ++} ++ ++/* Create a new skb with the contents of input copied to its page frags */ ++static struct sk_buff *tls_strp_msg_make_copy(struct tls_strparser *strp) ++{ ++ struct strp_msg *rxm; ++ struct sk_buff *skb; ++ ++ skb = tls_strp_skb_copy(strp, strp->anchor, strp->stm.offset, ++ strp->stm.full_len); ++ if (!skb) ++ return NULL; ++ + rxm = strp_msg(skb); + rxm->offset = 0; + return skb; +-- +2.39.2 + diff --git a/queue-6.1/tls-rx-strp-fix-determining-record-length-in-copy-mo.patch b/queue-6.1/tls-rx-strp-fix-determining-record-length-in-copy-mo.patch new file mode 100644 index 00000000000..f85cd408e7f --- /dev/null +++ b/queue-6.1/tls-rx-strp-fix-determining-record-length-in-copy-mo.patch @@ -0,0 +1,71 @@ +From 44208b50d590560b00e95c93b4078c892f242444 Mon Sep 17 00:00:00 2001 +From: Sasha Levin +Date: Tue, 16 May 2023 18:50:39 -0700 +Subject: tls: rx: strp: fix determining record length in copy mode + +From: Jakub Kicinski + +[ Upstream commit 8b0c0dc9fbbd01e58a573a41c38885f9e4c17696 ] + +We call tls_rx_msg_size(skb) before doing skb->len += chunk. +So the tls_rx_msg_size() code will see old skb->len, most +likely leading to an over-read. + +Worst case we will over read an entire record, next iteration +will try to trim the skb but may end up turning frag len negative +or discarding the subsequent record (since we already told TCP +we've read it during previous read but now we'll trim it out of +the skb). + +Fixes: 84c61fe1a75b ("tls: rx: do not use the standard strparser") +Tested-by: Shai Amiram +Signed-off-by: Jakub Kicinski +Reviewed-by: Simon Horman +Signed-off-by: David S. Miller +Signed-off-by: Sasha Levin +--- + net/tls/tls_strp.c | 21 +++++++++++++++------ + 1 file changed, 15 insertions(+), 6 deletions(-) + +diff --git a/net/tls/tls_strp.c b/net/tls/tls_strp.c +index 24016c865e004..9889df5ce0660 100644 +--- a/net/tls/tls_strp.c ++++ b/net/tls/tls_strp.c +@@ -210,19 +210,28 @@ static int tls_strp_copyin(read_descriptor_t *desc, struct sk_buff *in_skb, + skb_frag_size(frag), + chunk)); + +- sz = tls_rx_msg_size(strp, strp->anchor); ++ skb->len += chunk; ++ skb->data_len += chunk; ++ skb_frag_size_add(frag, chunk); ++ ++ sz = tls_rx_msg_size(strp, skb); + if (sz < 0) { + desc->error = sz; + return 0; + } + + /* We may have over-read, sz == 0 is guaranteed under-read */ +- if (sz > 0) +- chunk = min_t(size_t, chunk, sz - skb->len); ++ if (unlikely(sz && sz < skb->len)) { ++ int over = skb->len - sz; ++ ++ WARN_ON_ONCE(over > chunk); ++ skb->len -= over; ++ skb->data_len -= over; ++ skb_frag_size_add(frag, -over); ++ ++ chunk -= over; ++ } + +- skb->len += chunk; +- skb->data_len += chunk; +- skb_frag_size_add(frag, chunk); + frag++; + len -= chunk; + offset += chunk; +-- +2.39.2 + diff --git a/queue-6.1/tls-rx-strp-force-mixed-decrypted-records-into-copy-.patch b/queue-6.1/tls-rx-strp-force-mixed-decrypted-records-into-copy-.patch new file mode 100644 index 00000000000..1464e7e8c58 --- /dev/null +++ b/queue-6.1/tls-rx-strp-force-mixed-decrypted-records-into-copy-.patch @@ -0,0 +1,97 @@ +From 200b14455e14910ea81e13833331928828c347fc Mon Sep 17 00:00:00 2001 +From: Sasha Levin +Date: Tue, 16 May 2023 18:50:38 -0700 +Subject: tls: rx: strp: force mixed decrypted records into copy mode + +From: Jakub Kicinski + +[ Upstream commit 14c4be92ebb3e36e392aa9dd8f314038a9f96f3c ] + +If a record is partially decrypted we'll have to CoW it, anyway, +so go into copy mode and allocate a writable skb right away. + +This will make subsequent fix simpler because we won't have to +teach tls_strp_msg_make_copy() how to copy skbs while preserving +decrypt status. + +Tested-by: Shai Amiram +Signed-off-by: Jakub Kicinski +Reviewed-by: Simon Horman +Signed-off-by: David S. Miller +Stable-dep-of: eca9bfafee3a ("tls: rx: strp: preserve decryption status of skbs when needed") +Signed-off-by: Sasha Levin +--- + include/linux/skbuff.h | 10 ++++++++++ + net/tls/tls_strp.c | 16 +++++++++++----- + 2 files changed, 21 insertions(+), 5 deletions(-) + +diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h +index 20ca1613f2e3e..cc5ed2cf25f65 100644 +--- a/include/linux/skbuff.h ++++ b/include/linux/skbuff.h +@@ -1567,6 +1567,16 @@ static inline void skb_copy_hash(struct sk_buff *to, const struct sk_buff *from) + to->l4_hash = from->l4_hash; + }; + ++static inline int skb_cmp_decrypted(const struct sk_buff *skb1, ++ const struct sk_buff *skb2) ++{ ++#ifdef CONFIG_TLS_DEVICE ++ return skb2->decrypted - skb1->decrypted; ++#else ++ return 0; ++#endif ++} ++ + static inline void skb_copy_decrypted(struct sk_buff *to, + const struct sk_buff *from) + { +diff --git a/net/tls/tls_strp.c b/net/tls/tls_strp.c +index 9889df5ce0660..e2e48217e7ac9 100644 +--- a/net/tls/tls_strp.c ++++ b/net/tls/tls_strp.c +@@ -326,15 +326,19 @@ static int tls_strp_read_copy(struct tls_strparser *strp, bool qshort) + return 0; + } + +-static bool tls_strp_check_no_dup(struct tls_strparser *strp) ++static bool tls_strp_check_queue_ok(struct tls_strparser *strp) + { + unsigned int len = strp->stm.offset + strp->stm.full_len; +- struct sk_buff *skb; ++ struct sk_buff *first, *skb; + u32 seq; + +- skb = skb_shinfo(strp->anchor)->frag_list; +- seq = TCP_SKB_CB(skb)->seq; ++ first = skb_shinfo(strp->anchor)->frag_list; ++ skb = first; ++ seq = TCP_SKB_CB(first)->seq; + ++ /* Make sure there's no duplicate data in the queue, ++ * and the decrypted status matches. ++ */ + while (skb->len < len) { + seq += skb->len; + len -= skb->len; +@@ -342,6 +346,8 @@ static bool tls_strp_check_no_dup(struct tls_strparser *strp) + + if (TCP_SKB_CB(skb)->seq != seq) + return false; ++ if (skb_cmp_decrypted(first, skb)) ++ return false; + } + + return true; +@@ -422,7 +428,7 @@ static int tls_strp_read_sock(struct tls_strparser *strp) + return tls_strp_read_copy(strp, true); + } + +- if (!tls_strp_check_no_dup(strp)) ++ if (!tls_strp_check_queue_ok(strp)) + return tls_strp_read_copy(strp, false); + + strp->msg_ready = 1; +-- +2.39.2 + diff --git a/queue-6.1/tls-rx-strp-preserve-decryption-status-of-skbs-when-.patch b/queue-6.1/tls-rx-strp-preserve-decryption-status-of-skbs-when-.patch new file mode 100644 index 00000000000..b597c3cb50c --- /dev/null +++ b/queue-6.1/tls-rx-strp-preserve-decryption-status-of-skbs-when-.patch @@ -0,0 +1,262 @@ +From 6fdc036e01a2673b96c29fc5efa8805244fb2f66 Mon Sep 17 00:00:00 2001 +From: Sasha Levin +Date: Tue, 16 May 2023 18:50:41 -0700 +Subject: tls: rx: strp: preserve decryption status of skbs when needed + +From: Jakub Kicinski + +[ Upstream commit eca9bfafee3a0487e59c59201ae14c7594ba940a ] + +When receive buffer is small we try to copy out the data from +TCP into a skb maintained by TLS to prevent connection from +stalling. Unfortunately if a single record is made up of a mix +of decrypted and non-decrypted skbs combining them into a single +skb leads to loss of decryption status, resulting in decryption +errors or data corruption. + +Similarly when trying to use TCP receive queue directly we need +to make sure that all the skbs within the record have the same +status. If we don't the mixed status will be detected correctly +but we'll CoW the anchor, again collapsing it into a single paged +skb without decrypted status preserved. So the "fixup" code will +not know which parts of skb to re-encrypt. + +Fixes: 84c61fe1a75b ("tls: rx: do not use the standard strparser") +Tested-by: Shai Amiram +Signed-off-by: Jakub Kicinski +Reviewed-by: Simon Horman +Signed-off-by: David S. Miller +Signed-off-by: Sasha Levin +--- + include/net/tls.h | 1 + + net/tls/tls.h | 5 ++ + net/tls/tls_device.c | 22 +++----- + net/tls/tls_strp.c | 117 ++++++++++++++++++++++++++++++++++++------- + 4 files changed, 114 insertions(+), 31 deletions(-) + +diff --git a/include/net/tls.h b/include/net/tls.h +index 154949c7b0c88..c36bf4c50027e 100644 +--- a/include/net/tls.h ++++ b/include/net/tls.h +@@ -124,6 +124,7 @@ struct tls_strparser { + u32 mark : 8; + u32 stopped : 1; + u32 copy_mode : 1; ++ u32 mixed_decrypted : 1; + u32 msg_ready : 1; + + struct strp_msg stm; +diff --git a/net/tls/tls.h b/net/tls/tls.h +index 0e840a0c3437b..17737a65c643a 100644 +--- a/net/tls/tls.h ++++ b/net/tls/tls.h +@@ -165,6 +165,11 @@ static inline bool tls_strp_msg_ready(struct tls_sw_context_rx *ctx) + return ctx->strp.msg_ready; + } + ++static inline bool tls_strp_msg_mixed_decrypted(struct tls_sw_context_rx *ctx) ++{ ++ return ctx->strp.mixed_decrypted; ++} ++ + #ifdef CONFIG_TLS_DEVICE + int tls_device_init(void); + void tls_device_cleanup(void); +diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c +index 3b87c7b04ac87..bf69c9d6d06c0 100644 +--- a/net/tls/tls_device.c ++++ b/net/tls/tls_device.c +@@ -1007,20 +1007,14 @@ int tls_device_decrypted(struct sock *sk, struct tls_context *tls_ctx) + struct tls_sw_context_rx *sw_ctx = tls_sw_ctx_rx(tls_ctx); + struct sk_buff *skb = tls_strp_msg(sw_ctx); + struct strp_msg *rxm = strp_msg(skb); +- int is_decrypted = skb->decrypted; +- int is_encrypted = !is_decrypted; +- struct sk_buff *skb_iter; +- int left; +- +- left = rxm->full_len + rxm->offset - skb_pagelen(skb); +- /* Check if all the data is decrypted already */ +- skb_iter = skb_shinfo(skb)->frag_list; +- while (skb_iter && left > 0) { +- is_decrypted &= skb_iter->decrypted; +- is_encrypted &= !skb_iter->decrypted; +- +- left -= skb_iter->len; +- skb_iter = skb_iter->next; ++ int is_decrypted, is_encrypted; ++ ++ if (!tls_strp_msg_mixed_decrypted(sw_ctx)) { ++ is_decrypted = skb->decrypted; ++ is_encrypted = !is_decrypted; ++ } else { ++ is_decrypted = 0; ++ is_encrypted = 0; + } + + trace_tls_device_decrypted(sk, tcp_sk(sk)->copied_seq - rxm->full_len, +diff --git a/net/tls/tls_strp.c b/net/tls/tls_strp.c +index 61fbf84baf9e0..da95abbb7ea32 100644 +--- a/net/tls/tls_strp.c ++++ b/net/tls/tls_strp.c +@@ -29,7 +29,8 @@ static void tls_strp_anchor_free(struct tls_strparser *strp) + struct skb_shared_info *shinfo = skb_shinfo(strp->anchor); + + DEBUG_NET_WARN_ON_ONCE(atomic_read(&shinfo->dataref) != 1); +- shinfo->frag_list = NULL; ++ if (!strp->copy_mode) ++ shinfo->frag_list = NULL; + consume_skb(strp->anchor); + strp->anchor = NULL; + } +@@ -195,22 +196,22 @@ static void tls_strp_flush_anchor_copy(struct tls_strparser *strp) + for (i = 0; i < shinfo->nr_frags; i++) + __skb_frag_unref(&shinfo->frags[i], false); + shinfo->nr_frags = 0; ++ if (strp->copy_mode) { ++ kfree_skb_list(shinfo->frag_list); ++ shinfo->frag_list = NULL; ++ } + strp->copy_mode = 0; ++ strp->mixed_decrypted = 0; + } + +-static int tls_strp_copyin(read_descriptor_t *desc, struct sk_buff *in_skb, +- unsigned int offset, size_t in_len) ++static int tls_strp_copyin_frag(struct tls_strparser *strp, struct sk_buff *skb, ++ struct sk_buff *in_skb, unsigned int offset, ++ size_t in_len) + { +- struct tls_strparser *strp = (struct tls_strparser *)desc->arg.data; +- struct sk_buff *skb; +- skb_frag_t *frag; + size_t len, chunk; ++ skb_frag_t *frag; + int sz; + +- if (strp->msg_ready) +- return 0; +- +- skb = strp->anchor; + frag = &skb_shinfo(skb)->frags[skb->len / PAGE_SIZE]; + + len = in_len; +@@ -228,10 +229,8 @@ static int tls_strp_copyin(read_descriptor_t *desc, struct sk_buff *in_skb, + skb_frag_size_add(frag, chunk); + + sz = tls_rx_msg_size(strp, skb); +- if (sz < 0) { +- desc->error = sz; +- return 0; +- } ++ if (sz < 0) ++ return sz; + + /* We may have over-read, sz == 0 is guaranteed under-read */ + if (unlikely(sz && sz < skb->len)) { +@@ -271,15 +270,99 @@ static int tls_strp_copyin(read_descriptor_t *desc, struct sk_buff *in_skb, + offset += chunk; + } + +- if (strp->stm.full_len == skb->len) { ++read_done: ++ return in_len - len; ++} ++ ++static int tls_strp_copyin_skb(struct tls_strparser *strp, struct sk_buff *skb, ++ struct sk_buff *in_skb, unsigned int offset, ++ size_t in_len) ++{ ++ struct sk_buff *nskb, *first, *last; ++ struct skb_shared_info *shinfo; ++ size_t chunk; ++ int sz; ++ ++ if (strp->stm.full_len) ++ chunk = strp->stm.full_len - skb->len; ++ else ++ chunk = TLS_MAX_PAYLOAD_SIZE + PAGE_SIZE; ++ chunk = min(chunk, in_len); ++ ++ nskb = tls_strp_skb_copy(strp, in_skb, offset, chunk); ++ if (!nskb) ++ return -ENOMEM; ++ ++ shinfo = skb_shinfo(skb); ++ if (!shinfo->frag_list) { ++ shinfo->frag_list = nskb; ++ nskb->prev = nskb; ++ } else { ++ first = shinfo->frag_list; ++ last = first->prev; ++ last->next = nskb; ++ first->prev = nskb; ++ } ++ ++ skb->len += chunk; ++ skb->data_len += chunk; ++ ++ if (!strp->stm.full_len) { ++ sz = tls_rx_msg_size(strp, skb); ++ if (sz < 0) ++ return sz; ++ ++ /* We may have over-read, sz == 0 is guaranteed under-read */ ++ if (unlikely(sz && sz < skb->len)) { ++ int over = skb->len - sz; ++ ++ WARN_ON_ONCE(over > chunk); ++ skb->len -= over; ++ skb->data_len -= over; ++ __pskb_trim(nskb, nskb->len - over); ++ ++ chunk -= over; ++ } ++ ++ strp->stm.full_len = sz; ++ } ++ ++ return chunk; ++} ++ ++static int tls_strp_copyin(read_descriptor_t *desc, struct sk_buff *in_skb, ++ unsigned int offset, size_t in_len) ++{ ++ struct tls_strparser *strp = (struct tls_strparser *)desc->arg.data; ++ struct sk_buff *skb; ++ int ret; ++ ++ if (strp->msg_ready) ++ return 0; ++ ++ skb = strp->anchor; ++ if (!skb->len) ++ skb_copy_decrypted(skb, in_skb); ++ else ++ strp->mixed_decrypted |= !!skb_cmp_decrypted(skb, in_skb); ++ ++ if (IS_ENABLED(CONFIG_TLS_DEVICE) && strp->mixed_decrypted) ++ ret = tls_strp_copyin_skb(strp, skb, in_skb, offset, in_len); ++ else ++ ret = tls_strp_copyin_frag(strp, skb, in_skb, offset, in_len); ++ if (ret < 0) { ++ desc->error = ret; ++ ret = 0; ++ } ++ ++ if (strp->stm.full_len && strp->stm.full_len == skb->len) { + desc->count = 0; + + strp->msg_ready = 1; + tls_rx_msg_ready(strp); + } + +-read_done: +- return in_len - len; ++ return ret; + } + + static int tls_strp_read_copyin(struct tls_strparser *strp) +-- +2.39.2 + diff --git a/queue-6.1/tls-rx-strp-set-the-skb-len-of-detached-cow-ed-skbs.patch b/queue-6.1/tls-rx-strp-set-the-skb-len-of-detached-cow-ed-skbs.patch new file mode 100644 index 00000000000..3148cfc42db --- /dev/null +++ b/queue-6.1/tls-rx-strp-set-the-skb-len-of-detached-cow-ed-skbs.patch @@ -0,0 +1,40 @@ +From fad0e22d1b05be4fa3448678d8728fbfe3939065 Mon Sep 17 00:00:00 2001 +From: Sasha Levin +Date: Tue, 16 May 2023 18:50:37 -0700 +Subject: tls: rx: strp: set the skb->len of detached / CoW'ed skbs + +From: Jakub Kicinski + +[ Upstream commit 210620ae44a83f25220450bbfcc22e6fe986b25f ] + +alloc_skb_with_frags() fills in page frag sizes but does not +set skb->len and skb->data_len. Set those correctly otherwise +device offload will most likely generate an empty skb and +hit the BUG() at the end of __skb_nsg(). + +Fixes: 84c61fe1a75b ("tls: rx: do not use the standard strparser") +Tested-by: Shai Amiram +Signed-off-by: Jakub Kicinski +Reviewed-by: Simon Horman +Signed-off-by: David S. Miller +Signed-off-by: Sasha Levin +--- + net/tls/tls_strp.c | 2 ++ + 1 file changed, 2 insertions(+) + +diff --git a/net/tls/tls_strp.c b/net/tls/tls_strp.c +index 955ac3e0bf4d3..24016c865e004 100644 +--- a/net/tls/tls_strp.c ++++ b/net/tls/tls_strp.c +@@ -56,6 +56,8 @@ static struct sk_buff *tls_strp_msg_make_copy(struct tls_strparser *strp) + offset += skb_frag_size(frag); + } + ++ skb->len = strp->stm.full_len; ++ skb->data_len = strp->stm.full_len; + skb_copy_header(skb, strp->anchor); + rxm = strp_msg(skb); + rxm->offset = 0; +-- +2.39.2 + diff --git a/queue-6.1/vfio-type1-check-pfn-valid-before-converting-to-stru.patch b/queue-6.1/vfio-type1-check-pfn-valid-before-converting-to-stru.patch new file mode 100644 index 00000000000..7f1d923409c --- /dev/null +++ b/queue-6.1/vfio-type1-check-pfn-valid-before-converting-to-stru.patch @@ -0,0 +1,64 @@ +From ed771c5a237db5efc48a1c50f3f430d419080509 Mon Sep 17 00:00:00 2001 +From: Sasha Levin +Date: Fri, 19 May 2023 14:58:43 +0800 +Subject: vfio/type1: check pfn valid before converting to struct page + +From: Yan Zhao + +[ Upstream commit 4752354af71043e6fd72ef5490ed6da39e6cab4a ] + +Check physical PFN is valid before converting the PFN to a struct page +pointer to be returned to caller of vfio_pin_pages(). + +vfio_pin_pages() pins user pages with contiguous IOVA. +If the IOVA of a user page to be pinned belongs to vma of vm_flags +VM_PFNMAP, pin_user_pages_remote() will return -EFAULT without returning +struct page address for this PFN. This is because usually this kind of PFN +(e.g. MMIO PFN) has no valid struct page address associated. +Upon this error, vaddr_get_pfns() will obtain the physical PFN directly. + +While previously vfio_pin_pages() returns to caller PFN arrays directly, +after commit +34a255e67615 ("vfio: Replace phys_pfn with pages for vfio_pin_pages()"), +PFNs will be converted to "struct page *" unconditionally and therefore +the returned "struct page *" array may contain invalid struct page +addresses. + +Given current in-tree users of vfio_pin_pages() only expect "struct page * +returned, check PFN validity and return -EINVAL to let the caller be +aware of IOVAs to be pinned containing PFN not able to be returned in +"struct page *" array. So that, the caller will not consume the returned +pointer (e.g. test PageReserved()) and avoid error like "supervisor read +access in kernel mode". + +Fixes: 34a255e67615 ("vfio: Replace phys_pfn with pages for vfio_pin_pages()") +Cc: Sean Christopherson +Reviewed-by: Jason Gunthorpe +Signed-off-by: Yan Zhao +Reviewed-by: Sean Christopherson +Link: https://lore.kernel.org/r/20230519065843.10653-1-yan.y.zhao@intel.com +Signed-off-by: Alex Williamson +Signed-off-by: Sasha Levin +--- + drivers/vfio/vfio_iommu_type1.c | 5 +++++ + 1 file changed, 5 insertions(+) + +diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c +index 7fa68dc4e938a..009ba186652ac 100644 +--- a/drivers/vfio/vfio_iommu_type1.c ++++ b/drivers/vfio/vfio_iommu_type1.c +@@ -936,6 +936,11 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data, + if (ret) + goto pin_unwind; + ++ if (!pfn_valid(phys_pfn)) { ++ ret = -EINVAL; ++ goto pin_unwind; ++ } ++ + ret = vfio_add_to_pfn_list(dma, iova, phys_pfn); + if (ret) { + if (put_pfn(phys_pfn, dma->prot) && do_accounting) +-- +2.39.2 +