--- /dev/null
+From 7d3619700ce03dbf88f176046df552aec7a6646a Mon Sep 17 00:00:00 2001
+From: Sasha Levin <sashal@kernel.org>
+Date: Mon, 22 May 2023 17:05:55 -0400
+Subject: blk-mq: fix race condition in active queue accounting
+
+From: Tian Lan <tian.lan@twosigma.com>
+
+[ Upstream commit 3e94d54e83cafd2b562bb6d15bb2f72d76200fb5 ]
+
+If multiple CPUs are sharing the same hardware queue, it can
+cause leak in the active queue counter tracking when __blk_mq_tag_busy()
+is executed simultaneously.
+
+Fixes: ee78ec1077d3 ("blk-mq: blk_mq_tag_busy is no need to return a value")
+Signed-off-by: Tian Lan <tian.lan@twosigma.com>
+Reviewed-by: Ming Lei <ming.lei@redhat.com>
+Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
+Reviewed-by: John Garry <john.g.garry@oracle.com>
+Link: https://lore.kernel.org/r/20230522210555.794134-1-tilan7663@gmail.com
+Signed-off-by: Jens Axboe <axboe@kernel.dk>
+Signed-off-by: Sasha Levin <sashal@kernel.org>
+---
+ block/blk-mq-tag.c | 12 ++++++++----
+ 1 file changed, 8 insertions(+), 4 deletions(-)
+
+diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
+index 9eb968e14d31f..a80d7c62bdfe6 100644
+--- a/block/blk-mq-tag.c
++++ b/block/blk-mq-tag.c
+@@ -41,16 +41,20 @@ void __blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx)
+ {
+ unsigned int users;
+
++ /*
++ * calling test_bit() prior to test_and_set_bit() is intentional,
++ * it avoids dirtying the cacheline if the queue is already active.
++ */
+ if (blk_mq_is_shared_tags(hctx->flags)) {
+ struct request_queue *q = hctx->queue;
+
+- if (test_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags))
++ if (test_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags) ||
++ test_and_set_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags))
+ return;
+- set_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags);
+ } else {
+- if (test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
++ if (test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state) ||
++ test_and_set_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
+ return;
+- set_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state);
+ }
+
+ users = atomic_inc_return(&hctx->tags->active_queues);
+--
+2.39.2
+
--- /dev/null
+From 907ed25a5d69d2ddd7edd2d08a54a1079ece1211 Mon Sep 17 00:00:00 2001
+From: Sasha Levin <sashal@kernel.org>
+Date: Mon, 22 May 2023 19:56:06 -0700
+Subject: bpf, sockmap: Convert schedule_work into delayed_work
+
+From: John Fastabend <john.fastabend@gmail.com>
+
+[ Upstream commit 29173d07f79883ac94f5570294f98af3d4287382 ]
+
+Sk_buffs are fed into sockmap verdict programs either from a strparser
+(when the user might want to decide how framing of skb is done by attaching
+another parser program) or directly through tcp_read_sock. The
+tcp_read_sock is the preferred method for performance when the BPF logic is
+a stream parser.
+
+The flow for Cilium's common use case with a stream parser is,
+
+ tcp_read_sock()
+ sk_psock_verdict_recv
+ ret = bpf_prog_run_pin_on_cpu()
+ sk_psock_verdict_apply(sock, skb, ret)
+ // if system is under memory pressure or app is slow we may
+ // need to queue skb. Do this queuing through ingress_skb and
+ // then kick timer to wake up handler
+ skb_queue_tail(ingress_skb, skb)
+ schedule_work(work);
+
+The work queue is wired up to sk_psock_backlog(). This will then walk the
+ingress_skb skb list that holds our sk_buffs that could not be handled,
+but should be OK to run at some later point. However, its possible that
+the workqueue doing this work still hits an error when sending the skb.
+When this happens the skbuff is requeued on a temporary 'state' struct
+kept with the workqueue. This is necessary because its possible to
+partially send an skbuff before hitting an error and we need to know how
+and where to restart when the workqueue runs next.
+
+Now for the trouble, we don't rekick the workqueue. This can cause a
+stall where the skbuff we just cached on the state variable might never
+be sent. This happens when its the last packet in a flow and no further
+packets come along that would cause the system to kick the workqueue from
+that side.
+
+To fix we could do simple schedule_work(), but while under memory pressure
+it makes sense to back off some instead of continue to retry repeatedly. So
+instead to fix convert schedule_work to schedule_delayed_work and add
+backoff logic to reschedule from backlog queue on errors. Its not obvious
+though what a good backoff is so use '1'.
+
+To test we observed some flakes whil running NGINX compliance test with
+sockmap we attributed these failed test to this bug and subsequent issue.
+
+>From on list discussion. This commit
+
+ bec217197b41("skmsg: Schedule psock work if the cached skb exists on the psock")
+
+was intended to address similar race, but had a couple cases it missed.
+Most obvious it only accounted for receiving traffic on the local socket
+so if redirecting into another socket we could still get an sk_buff stuck
+here. Next it missed the case where copied=0 in the recv() handler and
+then we wouldn't kick the scheduler. Also its sub-optimal to require
+userspace to kick the internal mechanisms of sockmap to wake it up and
+copy data to user. It results in an extra syscall and requires the app
+to actual handle the EAGAIN correctly.
+
+Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()")
+Signed-off-by: John Fastabend <john.fastabend@gmail.com>
+Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
+Tested-by: William Findlay <will@isovalent.com>
+Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
+Link: https://lore.kernel.org/bpf/20230523025618.113937-3-john.fastabend@gmail.com
+Signed-off-by: Sasha Levin <sashal@kernel.org>
+---
+ include/linux/skmsg.h | 2 +-
+ net/core/skmsg.c | 21 ++++++++++++++-------
+ net/core/sock_map.c | 3 ++-
+ 3 files changed, 17 insertions(+), 9 deletions(-)
+
+diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
+index 84f787416a54d..904ff9a32ad61 100644
+--- a/include/linux/skmsg.h
++++ b/include/linux/skmsg.h
+@@ -105,7 +105,7 @@ struct sk_psock {
+ struct proto *sk_proto;
+ struct mutex work_mutex;
+ struct sk_psock_work_state work_state;
+- struct work_struct work;
++ struct delayed_work work;
+ struct rcu_work rwork;
+ };
+
+diff --git a/net/core/skmsg.c b/net/core/skmsg.c
+index 2b6d9519ff29c..6a9b794861f3f 100644
+--- a/net/core/skmsg.c
++++ b/net/core/skmsg.c
+@@ -481,7 +481,7 @@ int sk_msg_recvmsg(struct sock *sk, struct sk_psock *psock, struct msghdr *msg,
+ }
+ out:
+ if (psock->work_state.skb && copied > 0)
+- schedule_work(&psock->work);
++ schedule_delayed_work(&psock->work, 0);
+ return copied;
+ }
+ EXPORT_SYMBOL_GPL(sk_msg_recvmsg);
+@@ -639,7 +639,8 @@ static void sk_psock_skb_state(struct sk_psock *psock,
+
+ static void sk_psock_backlog(struct work_struct *work)
+ {
+- struct sk_psock *psock = container_of(work, struct sk_psock, work);
++ struct delayed_work *dwork = to_delayed_work(work);
++ struct sk_psock *psock = container_of(dwork, struct sk_psock, work);
+ struct sk_psock_work_state *state = &psock->work_state;
+ struct sk_buff *skb = NULL;
+ bool ingress;
+@@ -679,6 +680,12 @@ static void sk_psock_backlog(struct work_struct *work)
+ if (ret == -EAGAIN) {
+ sk_psock_skb_state(psock, state, skb,
+ len, off);
++
++ /* Delay slightly to prioritize any
++ * other work that might be here.
++ */
++ if (sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED))
++ schedule_delayed_work(&psock->work, 1);
+ goto end;
+ }
+ /* Hard errors break pipe and stop xmit. */
+@@ -733,7 +740,7 @@ struct sk_psock *sk_psock_init(struct sock *sk, int node)
+ INIT_LIST_HEAD(&psock->link);
+ spin_lock_init(&psock->link_lock);
+
+- INIT_WORK(&psock->work, sk_psock_backlog);
++ INIT_DELAYED_WORK(&psock->work, sk_psock_backlog);
+ mutex_init(&psock->work_mutex);
+ INIT_LIST_HEAD(&psock->ingress_msg);
+ spin_lock_init(&psock->ingress_lock);
+@@ -822,7 +829,7 @@ static void sk_psock_destroy(struct work_struct *work)
+
+ sk_psock_done_strp(psock);
+
+- cancel_work_sync(&psock->work);
++ cancel_delayed_work_sync(&psock->work);
+ mutex_destroy(&psock->work_mutex);
+
+ psock_progs_drop(&psock->progs);
+@@ -937,7 +944,7 @@ static int sk_psock_skb_redirect(struct sk_psock *from, struct sk_buff *skb)
+ }
+
+ skb_queue_tail(&psock_other->ingress_skb, skb);
+- schedule_work(&psock_other->work);
++ schedule_delayed_work(&psock_other->work, 0);
+ spin_unlock_bh(&psock_other->ingress_lock);
+ return 0;
+ }
+@@ -1017,7 +1024,7 @@ static int sk_psock_verdict_apply(struct sk_psock *psock, struct sk_buff *skb,
+ spin_lock_bh(&psock->ingress_lock);
+ if (sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED)) {
+ skb_queue_tail(&psock->ingress_skb, skb);
+- schedule_work(&psock->work);
++ schedule_delayed_work(&psock->work, 0);
+ err = 0;
+ }
+ spin_unlock_bh(&psock->ingress_lock);
+@@ -1048,7 +1055,7 @@ static void sk_psock_write_space(struct sock *sk)
+ psock = sk_psock(sk);
+ if (likely(psock)) {
+ if (sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED))
+- schedule_work(&psock->work);
++ schedule_delayed_work(&psock->work, 0);
+ write_space = psock->saved_write_space;
+ }
+ rcu_read_unlock();
+diff --git a/net/core/sock_map.c b/net/core/sock_map.c
+index a68a7290a3b2b..d382672018928 100644
+--- a/net/core/sock_map.c
++++ b/net/core/sock_map.c
+@@ -1624,9 +1624,10 @@ void sock_map_close(struct sock *sk, long timeout)
+ rcu_read_unlock();
+ sk_psock_stop(psock);
+ release_sock(sk);
+- cancel_work_sync(&psock->work);
++ cancel_delayed_work_sync(&psock->work);
+ sk_psock_put(sk, psock);
+ }
++
+ /* Make sure we do not recurse. This is a bug.
+ * Leak the socket instead of crashing on a stack overflow.
+ */
+--
+2.39.2
+
--- /dev/null
+From a0481dab8243bb79c9e5d41af86f6d19f7298477 Mon Sep 17 00:00:00 2001
+From: Sasha Levin <sashal@kernel.org>
+Date: Mon, 22 May 2023 19:56:09 -0700
+Subject: bpf, sockmap: Handle fin correctly
+
+From: John Fastabend <john.fastabend@gmail.com>
+
+[ Upstream commit 901546fd8f9ca4b5c481ce00928ab425ce9aacc0 ]
+
+The sockmap code is returning EAGAIN after a FIN packet is received and no
+more data is on the receive queue. Correct behavior is to return 0 to the
+user and the user can then close the socket. The EAGAIN causes many apps
+to retry which masks the problem. Eventually the socket is evicted from
+the sockmap because its released from sockmap sock free handling. The
+issue creates a delay and can cause some errors on application side.
+
+To fix this check on sk_msg_recvmsg side if length is zero and FIN flag
+is set then set return to zero. A selftest will be added to check this
+condition.
+
+Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()")
+Signed-off-by: John Fastabend <john.fastabend@gmail.com>
+Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
+Tested-by: William Findlay <will@isovalent.com>
+Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
+Link: https://lore.kernel.org/bpf/20230523025618.113937-6-john.fastabend@gmail.com
+Signed-off-by: Sasha Levin <sashal@kernel.org>
+---
+ net/ipv4/tcp_bpf.c | 31 +++++++++++++++++++++++++++++++
+ 1 file changed, 31 insertions(+)
+
+diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
+index 2e9547467edbe..73c13642d47f6 100644
+--- a/net/ipv4/tcp_bpf.c
++++ b/net/ipv4/tcp_bpf.c
+@@ -174,6 +174,24 @@ static int tcp_msg_wait_data(struct sock *sk, struct sk_psock *psock,
+ return ret;
+ }
+
++static bool is_next_msg_fin(struct sk_psock *psock)
++{
++ struct scatterlist *sge;
++ struct sk_msg *msg_rx;
++ int i;
++
++ msg_rx = sk_psock_peek_msg(psock);
++ i = msg_rx->sg.start;
++ sge = sk_msg_elem(msg_rx, i);
++ if (!sge->length) {
++ struct sk_buff *skb = msg_rx->skb;
++
++ if (skb && TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN)
++ return true;
++ }
++ return false;
++}
++
+ static int tcp_bpf_recvmsg_parser(struct sock *sk,
+ struct msghdr *msg,
+ size_t len,
+@@ -196,6 +214,19 @@ static int tcp_bpf_recvmsg_parser(struct sock *sk,
+ lock_sock(sk);
+ msg_bytes_ready:
+ copied = sk_msg_recvmsg(sk, psock, msg, len, flags);
++ /* The typical case for EFAULT is the socket was gracefully
++ * shutdown with a FIN pkt. So check here the other case is
++ * some error on copy_page_to_iter which would be unexpected.
++ * On fin return correct return code to zero.
++ */
++ if (copied == -EFAULT) {
++ bool is_fin = is_next_msg_fin(psock);
++
++ if (is_fin) {
++ copied = 0;
++ goto out;
++ }
++ }
+ if (!copied) {
+ long timeo;
+ int data;
+--
+2.39.2
+
--- /dev/null
+From b30cce8f5c1fd0d245cc4075f7cfe47d1ea91c85 Mon Sep 17 00:00:00 2001
+From: Sasha Levin <sashal@kernel.org>
+Date: Mon, 22 May 2023 19:56:08 -0700
+Subject: bpf, sockmap: Improved check for empty queue
+
+From: John Fastabend <john.fastabend@gmail.com>
+
+[ Upstream commit 405df89dd52cbcd69a3cd7d9a10d64de38f854b2 ]
+
+We noticed some rare sk_buffs were stepping past the queue when system was
+under memory pressure. The general theory is to skip enqueueing
+sk_buffs when its not necessary which is the normal case with a system
+that is properly provisioned for the task, no memory pressure and enough
+cpu assigned.
+
+But, if we can't allocate memory due to an ENOMEM error when enqueueing
+the sk_buff into the sockmap receive queue we push it onto a delayed
+workqueue to retry later. When a new sk_buff is received we then check
+if that queue is empty. However, there is a problem with simply checking
+the queue length. When a sk_buff is being processed from the ingress queue
+but not yet on the sockmap msg receive queue its possible to also recv
+a sk_buff through normal path. It will check the ingress queue which is
+zero and then skip ahead of the pkt being processed.
+
+Previously we used sock lock from both contexts which made the problem
+harder to hit, but not impossible.
+
+To fix instead of popping the skb from the queue entirely we peek the
+skb from the queue and do the copy there. This ensures checks to the
+queue length are non-zero while skb is being processed. Then finally
+when the entire skb has been copied to user space queue or another
+socket we pop it off the queue. This way the queue length check allows
+bypassing the queue only after the list has been completely processed.
+
+To reproduce issue we run NGINX compliance test with sockmap running and
+observe some flakes in our testing that we attributed to this issue.
+
+Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()")
+Suggested-by: Jakub Sitnicki <jakub@cloudflare.com>
+Signed-off-by: John Fastabend <john.fastabend@gmail.com>
+Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
+Tested-by: William Findlay <will@isovalent.com>
+Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
+Link: https://lore.kernel.org/bpf/20230523025618.113937-5-john.fastabend@gmail.com
+Signed-off-by: Sasha Levin <sashal@kernel.org>
+---
+ include/linux/skmsg.h | 1 -
+ net/core/skmsg.c | 32 ++++++++------------------------
+ 2 files changed, 8 insertions(+), 25 deletions(-)
+
+diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
+index 904ff9a32ad61..054d7911bfc9f 100644
+--- a/include/linux/skmsg.h
++++ b/include/linux/skmsg.h
+@@ -71,7 +71,6 @@ struct sk_psock_link {
+ };
+
+ struct sk_psock_work_state {
+- struct sk_buff *skb;
+ u32 len;
+ u32 off;
+ };
+diff --git a/net/core/skmsg.c b/net/core/skmsg.c
+index 2dfb6e31e8d04..d3ffca1b96462 100644
+--- a/net/core/skmsg.c
++++ b/net/core/skmsg.c
+@@ -621,16 +621,12 @@ static int sk_psock_handle_skb(struct sk_psock *psock, struct sk_buff *skb,
+
+ static void sk_psock_skb_state(struct sk_psock *psock,
+ struct sk_psock_work_state *state,
+- struct sk_buff *skb,
+ int len, int off)
+ {
+ spin_lock_bh(&psock->ingress_lock);
+ if (sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED)) {
+- state->skb = skb;
+ state->len = len;
+ state->off = off;
+- } else {
+- sock_drop(psock->sk, skb);
+ }
+ spin_unlock_bh(&psock->ingress_lock);
+ }
+@@ -641,23 +637,17 @@ static void sk_psock_backlog(struct work_struct *work)
+ struct sk_psock *psock = container_of(dwork, struct sk_psock, work);
+ struct sk_psock_work_state *state = &psock->work_state;
+ struct sk_buff *skb = NULL;
++ u32 len = 0, off = 0;
+ bool ingress;
+- u32 len, off;
+ int ret;
+
+ mutex_lock(&psock->work_mutex);
+- if (unlikely(state->skb)) {
+- spin_lock_bh(&psock->ingress_lock);
+- skb = state->skb;
++ if (unlikely(state->len)) {
+ len = state->len;
+ off = state->off;
+- state->skb = NULL;
+- spin_unlock_bh(&psock->ingress_lock);
+ }
+- if (skb)
+- goto start;
+
+- while ((skb = skb_dequeue(&psock->ingress_skb))) {
++ while ((skb = skb_peek(&psock->ingress_skb))) {
+ len = skb->len;
+ off = 0;
+ if (skb_bpf_strparser(skb)) {
+@@ -666,7 +656,6 @@ static void sk_psock_backlog(struct work_struct *work)
+ off = stm->offset;
+ len = stm->full_len;
+ }
+-start:
+ ingress = skb_bpf_ingress(skb);
+ skb_bpf_redirect_clear(skb);
+ do {
+@@ -676,8 +665,7 @@ static void sk_psock_backlog(struct work_struct *work)
+ len, ingress);
+ if (ret <= 0) {
+ if (ret == -EAGAIN) {
+- sk_psock_skb_state(psock, state, skb,
+- len, off);
++ sk_psock_skb_state(psock, state, len, off);
+
+ /* Delay slightly to prioritize any
+ * other work that might be here.
+@@ -689,15 +677,16 @@ static void sk_psock_backlog(struct work_struct *work)
+ /* Hard errors break pipe and stop xmit. */
+ sk_psock_report_error(psock, ret ? -ret : EPIPE);
+ sk_psock_clear_state(psock, SK_PSOCK_TX_ENABLED);
+- sock_drop(psock->sk, skb);
+ goto end;
+ }
+ off += ret;
+ len -= ret;
+ } while (len);
+
+- if (!ingress)
++ skb = skb_dequeue(&psock->ingress_skb);
++ if (!ingress) {
+ kfree_skb(skb);
++ }
+ }
+ end:
+ mutex_unlock(&psock->work_mutex);
+@@ -790,11 +779,6 @@ static void __sk_psock_zap_ingress(struct sk_psock *psock)
+ skb_bpf_redirect_clear(skb);
+ sock_drop(psock->sk, skb);
+ }
+- kfree_skb(psock->work_state.skb);
+- /* We null the skb here to ensure that calls to sk_psock_backlog
+- * do not pick up the free'd skb.
+- */
+- psock->work_state.skb = NULL;
+ __sk_psock_purge_ingress_msg(psock);
+ }
+
+@@ -813,7 +797,6 @@ void sk_psock_stop(struct sk_psock *psock)
+ spin_lock_bh(&psock->ingress_lock);
+ sk_psock_clear_state(psock, SK_PSOCK_TX_ENABLED);
+ sk_psock_cork_free(psock);
+- __sk_psock_zap_ingress(psock);
+ spin_unlock_bh(&psock->ingress_lock);
+ }
+
+@@ -828,6 +811,7 @@ static void sk_psock_destroy(struct work_struct *work)
+ sk_psock_done_strp(psock);
+
+ cancel_delayed_work_sync(&psock->work);
++ __sk_psock_zap_ingress(psock);
+ mutex_destroy(&psock->work_mutex);
+
+ psock_progs_drop(&psock->progs);
+--
+2.39.2
+
--- /dev/null
+From 6fe373073f0aed92f9c8bc739f3553e1f5eb2ece Mon Sep 17 00:00:00 2001
+From: Sasha Levin <sashal@kernel.org>
+Date: Mon, 22 May 2023 19:56:12 -0700
+Subject: bpf, sockmap: Incorrectly handling copied_seq
+
+From: John Fastabend <john.fastabend@gmail.com>
+
+[ Upstream commit e5c6de5fa025882babf89cecbed80acf49b987fa ]
+
+The read_skb() logic is incrementing the tcp->copied_seq which is used for
+among other things calculating how many outstanding bytes can be read by
+the application. This results in application errors, if the application
+does an ioctl(FIONREAD) we return zero because this is calculated from
+the copied_seq value.
+
+To fix this we move tcp->copied_seq accounting into the recv handler so
+that we update these when the recvmsg() hook is called and data is in
+fact copied into user buffers. This gives an accurate FIONREAD value
+as expected and improves ACK handling. Before we were calling the
+tcp_rcv_space_adjust() which would update 'number of bytes copied to
+user in last RTT' which is wrong for programs returning SK_PASS. The
+bytes are only copied to the user when recvmsg is handled.
+
+Doing the fix for recvmsg is straightforward, but fixing redirect and
+SK_DROP pkts is a bit tricker. Build a tcp_psock_eat() helper and then
+call this from skmsg handlers. This fixes another issue where a broken
+socket with a BPF program doing a resubmit could hang the receiver. This
+happened because although read_skb() consumed the skb through sock_drop()
+it did not update the copied_seq. Now if a single reccv socket is
+redirecting to many sockets (for example for lb) the receiver sk will be
+hung even though we might expect it to continue. The hang comes from
+not updating the copied_seq numbers and memory pressure resulting from
+that.
+
+We have a slight layer problem of calling tcp_eat_skb even if its not
+a TCP socket. To fix we could refactor and create per type receiver
+handlers. I decided this is more work than we want in the fix and we
+already have some small tweaks depending on caller that use the
+helper skb_bpf_strparser(). So we extend that a bit and always set
+the strparser bit when it is in use and then we can gate the
+seq_copied updates on this.
+
+Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()")
+Signed-off-by: John Fastabend <john.fastabend@gmail.com>
+Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
+Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
+Link: https://lore.kernel.org/bpf/20230523025618.113937-9-john.fastabend@gmail.com
+Signed-off-by: Sasha Levin <sashal@kernel.org>
+---
+ include/net/tcp.h | 10 ++++++++++
+ net/core/skmsg.c | 15 +++++++--------
+ net/ipv4/tcp.c | 10 +---------
+ net/ipv4/tcp_bpf.c | 28 +++++++++++++++++++++++++++-
+ 4 files changed, 45 insertions(+), 18 deletions(-)
+
+diff --git a/include/net/tcp.h b/include/net/tcp.h
+index 5b70b241ce71b..0744717f5caa7 100644
+--- a/include/net/tcp.h
++++ b/include/net/tcp.h
+@@ -1467,6 +1467,8 @@ static inline void tcp_adjust_rcv_ssthresh(struct sock *sk)
+ }
+
+ void tcp_cleanup_rbuf(struct sock *sk, int copied);
++void __tcp_cleanup_rbuf(struct sock *sk, int copied);
++
+
+ /* We provision sk_rcvbuf around 200% of sk_rcvlowat.
+ * If 87.5 % (7/8) of the space has been consumed, we want to override
+@@ -2291,6 +2293,14 @@ int tcp_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool restore);
+ void tcp_bpf_clone(const struct sock *sk, struct sock *newsk);
+ #endif /* CONFIG_BPF_SYSCALL */
+
++#ifdef CONFIG_INET
++void tcp_eat_skb(struct sock *sk, struct sk_buff *skb);
++#else
++static inline void tcp_eat_skb(struct sock *sk, struct sk_buff *skb)
++{
++}
++#endif
++
+ int tcp_bpf_sendmsg_redir(struct sock *sk, bool ingress,
+ struct sk_msg *msg, u32 bytes, int flags);
+ #endif /* CONFIG_NET_SOCK_MSG */
+diff --git a/net/core/skmsg.c b/net/core/skmsg.c
+index 062612ee508c0..9e0f694515636 100644
+--- a/net/core/skmsg.c
++++ b/net/core/skmsg.c
+@@ -978,10 +978,8 @@ static int sk_psock_verdict_apply(struct sk_psock *psock, struct sk_buff *skb,
+ err = -EIO;
+ sk_other = psock->sk;
+ if (sock_flag(sk_other, SOCK_DEAD) ||
+- !sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED)) {
+- skb_bpf_redirect_clear(skb);
++ !sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED))
+ goto out_free;
+- }
+
+ skb_bpf_set_ingress(skb);
+
+@@ -1010,18 +1008,19 @@ static int sk_psock_verdict_apply(struct sk_psock *psock, struct sk_buff *skb,
+ err = 0;
+ }
+ spin_unlock_bh(&psock->ingress_lock);
+- if (err < 0) {
+- skb_bpf_redirect_clear(skb);
++ if (err < 0)
+ goto out_free;
+- }
+ }
+ break;
+ case __SK_REDIRECT:
++ tcp_eat_skb(psock->sk, skb);
+ err = sk_psock_skb_redirect(psock, skb);
+ break;
+ case __SK_DROP:
+ default:
+ out_free:
++ skb_bpf_redirect_clear(skb);
++ tcp_eat_skb(psock->sk, skb);
+ sock_drop(psock->sk, skb);
+ }
+
+@@ -1066,8 +1065,7 @@ static void sk_psock_strp_read(struct strparser *strp, struct sk_buff *skb)
+ skb_dst_drop(skb);
+ skb_bpf_redirect_clear(skb);
+ ret = bpf_prog_run_pin_on_cpu(prog, skb);
+- if (ret == SK_PASS)
+- skb_bpf_set_strparser(skb);
++ skb_bpf_set_strparser(skb);
+ ret = sk_psock_map_verd(ret, skb_bpf_redirect_fetch(skb));
+ skb->sk = NULL;
+ }
+@@ -1173,6 +1171,7 @@ static int sk_psock_verdict_recv(struct sock *sk, struct sk_buff *skb)
+ psock = sk_psock(sk);
+ if (unlikely(!psock)) {
+ len = 0;
++ tcp_eat_skb(sk, skb);
+ sock_drop(sk, skb);
+ goto out;
+ }
+diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
+index 31156ebb759c0..021a8bf6a1898 100644
+--- a/net/ipv4/tcp.c
++++ b/net/ipv4/tcp.c
+@@ -1570,7 +1570,7 @@ static int tcp_peek_sndq(struct sock *sk, struct msghdr *msg, int len)
+ * calculation of whether or not we must ACK for the sake of
+ * a window update.
+ */
+-static void __tcp_cleanup_rbuf(struct sock *sk, int copied)
++void __tcp_cleanup_rbuf(struct sock *sk, int copied)
+ {
+ struct tcp_sock *tp = tcp_sk(sk);
+ bool time_to_ack = false;
+@@ -1785,14 +1785,6 @@ int tcp_read_skb(struct sock *sk, skb_read_actor_t recv_actor)
+ break;
+ }
+ }
+- WRITE_ONCE(tp->copied_seq, seq);
+-
+- tcp_rcv_space_adjust(sk);
+-
+- /* Clean up data we have read: This will do ACK frames. */
+- if (copied > 0)
+- __tcp_cleanup_rbuf(sk, copied);
+-
+ return copied;
+ }
+ EXPORT_SYMBOL(tcp_read_skb);
+diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
+index 01dd76be1a584..5f93918c063c7 100644
+--- a/net/ipv4/tcp_bpf.c
++++ b/net/ipv4/tcp_bpf.c
+@@ -11,6 +11,24 @@
+ #include <net/inet_common.h>
+ #include <net/tls.h>
+
++void tcp_eat_skb(struct sock *sk, struct sk_buff *skb)
++{
++ struct tcp_sock *tcp;
++ int copied;
++
++ if (!skb || !skb->len || !sk_is_tcp(sk))
++ return;
++
++ if (skb_bpf_strparser(skb))
++ return;
++
++ tcp = tcp_sk(sk);
++ copied = tcp->copied_seq + skb->len;
++ WRITE_ONCE(tcp->copied_seq, copied);
++ tcp_rcv_space_adjust(sk);
++ __tcp_cleanup_rbuf(sk, skb->len);
++}
++
+ static int bpf_tcp_ingress(struct sock *sk, struct sk_psock *psock,
+ struct sk_msg *msg, u32 apply_bytes, int flags)
+ {
+@@ -198,8 +216,10 @@ static int tcp_bpf_recvmsg_parser(struct sock *sk,
+ int flags,
+ int *addr_len)
+ {
++ struct tcp_sock *tcp = tcp_sk(sk);
++ u32 seq = tcp->copied_seq;
+ struct sk_psock *psock;
+- int copied;
++ int copied = 0;
+
+ if (unlikely(flags & MSG_ERRQUEUE))
+ return inet_recv_error(sk, msg, len, addr_len);
+@@ -244,9 +264,11 @@ static int tcp_bpf_recvmsg_parser(struct sock *sk,
+
+ if (is_fin) {
+ copied = 0;
++ seq++;
+ goto out;
+ }
+ }
++ seq += copied;
+ if (!copied) {
+ long timeo;
+ int data;
+@@ -284,6 +306,10 @@ static int tcp_bpf_recvmsg_parser(struct sock *sk,
+ copied = -EAGAIN;
+ }
+ out:
++ WRITE_ONCE(tcp->copied_seq, seq);
++ tcp_rcv_space_adjust(sk);
++ if (copied > 0)
++ __tcp_cleanup_rbuf(sk, copied);
+ release_sock(sk);
+ sk_psock_put(sk, psock);
+ return copied;
+--
+2.39.2
+
--- /dev/null
+From 4df24f4760fc039a3b25c34569712d7615d3b5a4 Mon Sep 17 00:00:00 2001
+From: Sasha Levin <sashal@kernel.org>
+Date: Mon, 22 May 2023 19:56:05 -0700
+Subject: bpf, sockmap: Pass skb ownership through read_skb
+
+From: John Fastabend <john.fastabend@gmail.com>
+
+[ Upstream commit 78fa0d61d97a728d306b0c23d353c0e340756437 ]
+
+The read_skb hook calls consume_skb() now, but this means that if the
+recv_actor program wants to use the skb it needs to inc the ref cnt
+so that the consume_skb() doesn't kfree the sk_buff.
+
+This is problematic because in some error cases under memory pressure
+we may need to linearize the sk_buff from sk_psock_skb_ingress_enqueue().
+Then we get this,
+
+ skb_linearize()
+ __pskb_pull_tail()
+ pskb_expand_head()
+ BUG_ON(skb_shared(skb))
+
+Because we incremented users refcnt from sk_psock_verdict_recv() we
+hit the bug on with refcnt > 1 and trip it.
+
+To fix lets simply pass ownership of the sk_buff through the skb_read
+call. Then we can drop the consume from read_skb handlers and assume
+the verdict recv does any required kfree.
+
+Bug found while testing in our CI which runs in VMs that hit memory
+constraints rather regularly. William tested TCP read_skb handlers.
+
+[ 106.536188] ------------[ cut here ]------------
+[ 106.536197] kernel BUG at net/core/skbuff.c:1693!
+[ 106.536479] invalid opcode: 0000 [#1] PREEMPT SMP PTI
+[ 106.536726] CPU: 3 PID: 1495 Comm: curl Not tainted 5.19.0-rc5 #1
+[ 106.537023] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.16.0-1 04/01/2014
+[ 106.537467] RIP: 0010:pskb_expand_head+0x269/0x330
+[ 106.538585] RSP: 0018:ffffc90000138b68 EFLAGS: 00010202
+[ 106.538839] RAX: 000000000000003f RBX: ffff8881048940e8 RCX: 0000000000000a20
+[ 106.539186] RDX: 0000000000000002 RSI: 0000000000000000 RDI: ffff8881048940e8
+[ 106.539529] RBP: ffffc90000138be8 R08: 00000000e161fd1a R09: 0000000000000000
+[ 106.539877] R10: 0000000000000018 R11: 0000000000000000 R12: ffff8881048940e8
+[ 106.540222] R13: 0000000000000003 R14: 0000000000000000 R15: ffff8881048940e8
+[ 106.540568] FS: 00007f277dde9f00(0000) GS:ffff88813bd80000(0000) knlGS:0000000000000000
+[ 106.540954] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
+[ 106.541227] CR2: 00007f277eeede64 CR3: 000000000ad3e000 CR4: 00000000000006e0
+[ 106.541569] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
+[ 106.541915] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
+[ 106.542255] Call Trace:
+[ 106.542383] <IRQ>
+[ 106.542487] __pskb_pull_tail+0x4b/0x3e0
+[ 106.542681] skb_ensure_writable+0x85/0xa0
+[ 106.542882] sk_skb_pull_data+0x18/0x20
+[ 106.543084] bpf_prog_b517a65a242018b0_bpf_skskb_http_verdict+0x3a9/0x4aa9
+[ 106.543536] ? migrate_disable+0x66/0x80
+[ 106.543871] sk_psock_verdict_recv+0xe2/0x310
+[ 106.544258] ? sk_psock_write_space+0x1f0/0x1f0
+[ 106.544561] tcp_read_skb+0x7b/0x120
+[ 106.544740] tcp_data_queue+0x904/0xee0
+[ 106.544931] tcp_rcv_established+0x212/0x7c0
+[ 106.545142] tcp_v4_do_rcv+0x174/0x2a0
+[ 106.545326] tcp_v4_rcv+0xe70/0xf60
+[ 106.545500] ip_protocol_deliver_rcu+0x48/0x290
+[ 106.545744] ip_local_deliver_finish+0xa7/0x150
+
+Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()")
+Reported-by: William Findlay <will@isovalent.com>
+Signed-off-by: John Fastabend <john.fastabend@gmail.com>
+Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
+Tested-by: William Findlay <will@isovalent.com>
+Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
+Link: https://lore.kernel.org/bpf/20230523025618.113937-2-john.fastabend@gmail.com
+Signed-off-by: Sasha Levin <sashal@kernel.org>
+---
+ net/core/skmsg.c | 2 --
+ net/ipv4/tcp.c | 1 -
+ net/ipv4/udp.c | 7 ++-----
+ net/unix/af_unix.c | 7 ++-----
+ 4 files changed, 4 insertions(+), 13 deletions(-)
+
+diff --git a/net/core/skmsg.c b/net/core/skmsg.c
+index 53d0251788aa2..2b6d9519ff29c 100644
+--- a/net/core/skmsg.c
++++ b/net/core/skmsg.c
+@@ -1180,8 +1180,6 @@ static int sk_psock_verdict_recv(struct sock *sk, struct sk_buff *skb)
+ int ret = __SK_DROP;
+ int len = skb->len;
+
+- skb_get(skb);
+-
+ rcu_read_lock();
+ psock = sk_psock(sk);
+ if (unlikely(!psock)) {
+diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
+index 1fb67f819de49..31156ebb759c0 100644
+--- a/net/ipv4/tcp.c
++++ b/net/ipv4/tcp.c
+@@ -1772,7 +1772,6 @@ int tcp_read_skb(struct sock *sk, skb_read_actor_t recv_actor)
+ WARN_ON_ONCE(!skb_set_owner_sk_safe(skb, sk));
+ tcp_flags = TCP_SKB_CB(skb)->tcp_flags;
+ used = recv_actor(sk, skb);
+- consume_skb(skb);
+ if (used < 0) {
+ if (!copied)
+ copied = used;
+diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
+index 3ffa30c37293e..956d6797c76f3 100644
+--- a/net/ipv4/udp.c
++++ b/net/ipv4/udp.c
+@@ -1806,7 +1806,7 @@ EXPORT_SYMBOL(__skb_recv_udp);
+ int udp_read_skb(struct sock *sk, skb_read_actor_t recv_actor)
+ {
+ struct sk_buff *skb;
+- int err, copied;
++ int err;
+
+ try_again:
+ skb = skb_recv_udp(sk, MSG_DONTWAIT, &err);
+@@ -1825,10 +1825,7 @@ int udp_read_skb(struct sock *sk, skb_read_actor_t recv_actor)
+ }
+
+ WARN_ON_ONCE(!skb_set_owner_sk_safe(skb, sk));
+- copied = recv_actor(sk, skb);
+- kfree_skb(skb);
+-
+- return copied;
++ return recv_actor(sk, skb);
+ }
+ EXPORT_SYMBOL(udp_read_skb);
+
+diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
+index 70eb3bc67126d..5b19b6c53a2cb 100644
+--- a/net/unix/af_unix.c
++++ b/net/unix/af_unix.c
+@@ -2552,7 +2552,7 @@ static int unix_read_skb(struct sock *sk, skb_read_actor_t recv_actor)
+ {
+ struct unix_sock *u = unix_sk(sk);
+ struct sk_buff *skb;
+- int err, copied;
++ int err;
+
+ mutex_lock(&u->iolock);
+ skb = skb_recv_datagram(sk, MSG_DONTWAIT, &err);
+@@ -2560,10 +2560,7 @@ static int unix_read_skb(struct sock *sk, skb_read_actor_t recv_actor)
+ if (!skb)
+ return err;
+
+- copied = recv_actor(sk, skb);
+- kfree_skb(skb);
+-
+- return copied;
++ return recv_actor(sk, skb);
+ }
+
+ /*
+--
+2.39.2
+
--- /dev/null
+From ef1bb490282f412394b809afb82fd36538826a3f Mon Sep 17 00:00:00 2001
+From: Sasha Levin <sashal@kernel.org>
+Date: Mon, 22 May 2023 19:56:07 -0700
+Subject: bpf, sockmap: Reschedule is now done through backlog
+
+From: John Fastabend <john.fastabend@gmail.com>
+
+[ Upstream commit bce22552f92ea7c577f49839b8e8f7d29afaf880 ]
+
+Now that the backlog manages the reschedule() logic correctly we can drop
+the partial fix to reschedule from recvmsg hook.
+
+Rescheduling on recvmsg hook was added to address a corner case where we
+still had data in the backlog state but had nothing to kick it and
+reschedule the backlog worker to run and finish copying data out of the
+state. This had a couple limitations, first it required user space to
+kick it introducing an unnecessary EBUSY and retry. Second it only
+handled the ingress case and egress redirects would still be hung.
+
+With the correct fix, pushing the reschedule logic down to where the
+enomem error occurs we can drop this fix.
+
+Fixes: bec217197b412 ("skmsg: Schedule psock work if the cached skb exists on the psock")
+Signed-off-by: John Fastabend <john.fastabend@gmail.com>
+Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
+Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
+Link: https://lore.kernel.org/bpf/20230523025618.113937-4-john.fastabend@gmail.com
+Signed-off-by: Sasha Levin <sashal@kernel.org>
+---
+ net/core/skmsg.c | 2 --
+ 1 file changed, 2 deletions(-)
+
+diff --git a/net/core/skmsg.c b/net/core/skmsg.c
+index 6a9b794861f3f..2dfb6e31e8d04 100644
+--- a/net/core/skmsg.c
++++ b/net/core/skmsg.c
+@@ -480,8 +480,6 @@ int sk_msg_recvmsg(struct sock *sk, struct sk_psock *psock, struct msghdr *msg,
+ msg_rx = sk_psock_peek_msg(psock);
+ }
+ out:
+- if (psock->work_state.skb && copied > 0)
+- schedule_delayed_work(&psock->work, 0);
+ return copied;
+ }
+ EXPORT_SYMBOL_GPL(sk_msg_recvmsg);
+--
+2.39.2
+
--- /dev/null
+From 2df778add75145bf4cdf666c03e6d3c3732444e5 Mon Sep 17 00:00:00 2001
+From: Sasha Levin <sashal@kernel.org>
+Date: Mon, 22 May 2023 19:56:10 -0700
+Subject: bpf, sockmap: TCP data stall on recv before accept
+
+From: John Fastabend <john.fastabend@gmail.com>
+
+[ Upstream commit ea444185a6bf7da4dd0df1598ee953e4f7174858 ]
+
+A common mechanism to put a TCP socket into the sockmap is to hook the
+BPF_SOCK_OPS_{ACTIVE_PASSIVE}_ESTABLISHED_CB event with a BPF program
+that can map the socket info to the correct BPF verdict parser. When
+the user adds the socket to the map the psock is created and the new
+ops are assigned to ensure the verdict program will 'see' the sk_buffs
+as they arrive.
+
+Part of this process hooks the sk_data_ready op with a BPF specific
+handler to wake up the BPF verdict program when data is ready to read.
+The logic is simple enough (posted here for easy reading)
+
+ static void sk_psock_verdict_data_ready(struct sock *sk)
+ {
+ struct socket *sock = sk->sk_socket;
+
+ if (unlikely(!sock || !sock->ops || !sock->ops->read_skb))
+ return;
+ sock->ops->read_skb(sk, sk_psock_verdict_recv);
+ }
+
+The oversight here is sk->sk_socket is not assigned until the application
+accepts() the new socket. However, its entirely ok for the peer application
+to do a connect() followed immediately by sends. The socket on the receiver
+is sitting on the backlog queue of the listening socket until its accepted
+and the data is queued up. If the peer never accepts the socket or is slow
+it will eventually hit data limits and rate limit the session. But,
+important for BPF sockmap hooks when this data is received TCP stack does
+the sk_data_ready() call but the read_skb() for this data is never called
+because sk_socket is missing. The data sits on the sk_receive_queue.
+
+Then once the socket is accepted if we never receive more data from the
+peer there will be no further sk_data_ready calls and all the data
+is still on the sk_receive_queue(). Then user calls recvmsg after accept()
+and for TCP sockets in sockmap we use the tcp_bpf_recvmsg_parser() handler.
+The handler checks for data in the sk_msg ingress queue expecting that
+the BPF program has already run from the sk_data_ready hook and enqueued
+the data as needed. So we are stuck.
+
+To fix do an unlikely check in recvmsg handler for data on the
+sk_receive_queue and if it exists wake up data_ready. We have the sock
+locked in both read_skb and recvmsg so should avoid having multiple
+runners.
+
+Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()")
+Signed-off-by: John Fastabend <john.fastabend@gmail.com>
+Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
+Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
+Link: https://lore.kernel.org/bpf/20230523025618.113937-7-john.fastabend@gmail.com
+Signed-off-by: Sasha Levin <sashal@kernel.org>
+---
+ net/ipv4/tcp_bpf.c | 20 ++++++++++++++++++++
+ 1 file changed, 20 insertions(+)
+
+diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
+index 73c13642d47f6..01dd76be1a584 100644
+--- a/net/ipv4/tcp_bpf.c
++++ b/net/ipv4/tcp_bpf.c
+@@ -212,6 +212,26 @@ static int tcp_bpf_recvmsg_parser(struct sock *sk,
+ return tcp_recvmsg(sk, msg, len, flags, addr_len);
+
+ lock_sock(sk);
++
++ /* We may have received data on the sk_receive_queue pre-accept and
++ * then we can not use read_skb in this context because we haven't
++ * assigned a sk_socket yet so have no link to the ops. The work-around
++ * is to check the sk_receive_queue and in these cases read skbs off
++ * queue again. The read_skb hook is not running at this point because
++ * of lock_sock so we avoid having multiple runners in read_skb.
++ */
++ if (unlikely(!skb_queue_empty(&sk->sk_receive_queue))) {
++ tcp_data_ready(sk);
++ /* This handles the ENOMEM errors if we both receive data
++ * pre accept and are already under memory pressure. At least
++ * let user know to retry.
++ */
++ if (unlikely(!skb_queue_empty(&sk->sk_receive_queue))) {
++ copied = -EAGAIN;
++ goto out;
++ }
++ }
++
+ msg_bytes_ready:
+ copied = sk_msg_recvmsg(sk, psock, msg, len, flags);
+ /* The typical case for EFAULT is the socket was gracefully
+--
+2.39.2
+
--- /dev/null
+From 410c637b47383afd271b811512d7430abca6ea54 Mon Sep 17 00:00:00 2001
+From: Sasha Levin <sashal@kernel.org>
+Date: Mon, 22 May 2023 19:56:11 -0700
+Subject: bpf, sockmap: Wake up polling after data copy
+
+From: John Fastabend <john.fastabend@gmail.com>
+
+[ Upstream commit 6df7f764cd3cf5a03a4a47b23be47e57e41fcd85 ]
+
+When TCP stack has data ready to read sk_data_ready() is called. Sockmap
+overwrites this with its own handler to call into BPF verdict program.
+But, the original TCP socket had sock_def_readable that would additionally
+wake up any user space waiters with sk_wake_async().
+
+Sockmap saved the callback when the socket was created so call the saved
+data ready callback and then we can wake up any epoll() logic waiting
+on the read.
+
+Note we call on 'copied >= 0' to account for returning 0 when a FIN is
+received because we need to wake up user for this as well so they
+can do the recvmsg() -> 0 and detect the shutdown.
+
+Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()")
+Signed-off-by: John Fastabend <john.fastabend@gmail.com>
+Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
+Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
+Link: https://lore.kernel.org/bpf/20230523025618.113937-8-john.fastabend@gmail.com
+Signed-off-by: Sasha Levin <sashal@kernel.org>
+---
+ net/core/skmsg.c | 11 ++++++++++-
+ 1 file changed, 10 insertions(+), 1 deletion(-)
+
+diff --git a/net/core/skmsg.c b/net/core/skmsg.c
+index d3ffca1b96462..062612ee508c0 100644
+--- a/net/core/skmsg.c
++++ b/net/core/skmsg.c
+@@ -1196,10 +1196,19 @@ static int sk_psock_verdict_recv(struct sock *sk, struct sk_buff *skb)
+ static void sk_psock_verdict_data_ready(struct sock *sk)
+ {
+ struct socket *sock = sk->sk_socket;
++ int copied;
+
+ if (unlikely(!sock || !sock->ops || !sock->ops->read_skb))
+ return;
+- sock->ops->read_skb(sk, sk_psock_verdict_recv);
++ copied = sock->ops->read_skb(sk, sk_psock_verdict_recv);
++ if (copied >= 0) {
++ struct sk_psock *psock;
++
++ rcu_read_lock();
++ psock = sk_psock(sk);
++ psock->saved_data_ready(sk);
++ rcu_read_unlock();
++ }
+ }
+
+ void sk_psock_start_verdict(struct sock *sk, struct sk_psock *psock)
+--
+2.39.2
+
--- /dev/null
+From 2fbb7f647a422c87e7c155ac8cb640c970027f51 Mon Sep 17 00:00:00 2001
+From: Sasha Levin <sashal@kernel.org>
+Date: Thu, 20 Apr 2023 16:06:02 +0100
+Subject: firmware: arm_ffa: Fix usage of partition info get count flag
+
+From: Sudeep Holla <sudeep.holla@arm.com>
+
+[ Upstream commit c6e045361a27ecd4fac6413164e0d091d80eee99 ]
+
+Commit bb1be7498500 ("firmware: arm_ffa: Add v1.1 get_partition_info support")
+adds support to discovery the UUIDs of the partitions or just fetch the
+partition count using the PARTITION_INFO_GET_RETURN_COUNT_ONLY flag.
+
+However the commit doesn't handle the fact that the older version doesn't
+understand the flag and must be MBZ which results in firmware returning
+invalid parameter error. That results in the failure of the driver probe
+which is in correct.
+
+Limit the usage of the PARTITION_INFO_GET_RETURN_COUNT_ONLY flag for the
+versions above v1.0(i.e v1.1 and onwards) which fixes the issue.
+
+Fixes: bb1be7498500 ("firmware: arm_ffa: Add v1.1 get_partition_info support")
+Reported-by: Jens Wiklander <jens.wiklander@linaro.org>
+Reported-by: Marc Bonnici <marc.bonnici@arm.com>
+Tested-by: Jens Wiklander <jens.wiklander@linaro.org>
+Reviewed-by: Jens Wiklander <jens.wiklander@linaro.org>
+Link: https://lore.kernel.org/r/20230419-ffa_fixes_6-4-v2-2-d9108e43a176@arm.com
+Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
+Signed-off-by: Sasha Levin <sashal@kernel.org>
+---
+ drivers/firmware/arm_ffa/driver.c | 3 ++-
+ 1 file changed, 2 insertions(+), 1 deletion(-)
+
+diff --git a/drivers/firmware/arm_ffa/driver.c b/drivers/firmware/arm_ffa/driver.c
+index 737f36e7a9035..5904a679d3512 100644
+--- a/drivers/firmware/arm_ffa/driver.c
++++ b/drivers/firmware/arm_ffa/driver.c
+@@ -274,7 +274,8 @@ __ffa_partition_info_get(u32 uuid0, u32 uuid1, u32 uuid2, u32 uuid3,
+ int idx, count, flags = 0, sz, buf_sz;
+ ffa_value_t partition_info;
+
+- if (!buffer || !num_partitions) /* Just get the count for now */
++ if (drv_info->version > FFA_VERSION_1_0 &&
++ (!buffer || !num_partitions)) /* Just get the count for now */
+ flags = PARTITION_INFO_GET_RETURN_COUNT_ONLY;
+
+ mutex_lock(&drv_info->rx_lock);
+--
+2.39.2
+
--- /dev/null
+From 9a125be940383e0c3453bff0d1ded868ddb199b2 Mon Sep 17 00:00:00 2001
+From: Sasha Levin <sashal@kernel.org>
+Date: Thu, 27 Apr 2023 17:20:55 +0200
+Subject: gpio-f7188x: fix chip name and pin count on Nuvoton chip
+
+From: Henning Schild <henning.schild@siemens.com>
+
+[ Upstream commit 3002b8642f016d7fe3ff56240dacea1075f6b877 ]
+
+In fact the device with chip id 0xD283 is called NCT6126D, and that is
+the chip id the Nuvoton code was written for. Correct that name to avoid
+confusion, because a NCT6116D in fact exists as well but has another
+chip id, and is currently not supported.
+
+The look at the spec also revealed that GPIO group7 in fact has 8 pins,
+so correct the pin count in that group as well.
+
+Fixes: d0918a84aff0 ("gpio-f7188x: Add GPIO support for Nuvoton NCT6116")
+Reported-by: Xing Tong Wu <xingtong.wu@siemens.com>
+Signed-off-by: Henning Schild <henning.schild@siemens.com>
+Acked-by: Simon Guinot <simon.guinot@sequanux.org>
+Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
+Signed-off-by: Sasha Levin <sashal@kernel.org>
+---
+ drivers/gpio/Kconfig | 2 +-
+ drivers/gpio/gpio-f7188x.c | 28 ++++++++++++++--------------
+ 2 files changed, 15 insertions(+), 15 deletions(-)
+
+diff --git a/drivers/gpio/Kconfig b/drivers/gpio/Kconfig
+index e3af86f06c630..3e8e5f4ffa59f 100644
+--- a/drivers/gpio/Kconfig
++++ b/drivers/gpio/Kconfig
+@@ -882,7 +882,7 @@ config GPIO_F7188X
+ help
+ This option enables support for GPIOs found on Fintek Super-I/O
+ chips F71869, F71869A, F71882FG, F71889F and F81866.
+- As well as Nuvoton Super-I/O chip NCT6116D.
++ As well as Nuvoton Super-I/O chip NCT6126D.
+
+ To compile this driver as a module, choose M here: the module will
+ be called f7188x-gpio.
+diff --git a/drivers/gpio/gpio-f7188x.c b/drivers/gpio/gpio-f7188x.c
+index 9effa7769bef5..f54ca5a1775ea 100644
+--- a/drivers/gpio/gpio-f7188x.c
++++ b/drivers/gpio/gpio-f7188x.c
+@@ -48,7 +48,7 @@
+ /*
+ * Nuvoton devices.
+ */
+-#define SIO_NCT6116D_ID 0xD283 /* NCT6116D chipset ID */
++#define SIO_NCT6126D_ID 0xD283 /* NCT6126D chipset ID */
+
+ #define SIO_LD_GPIO_NUVOTON 0x07 /* GPIO logical device */
+
+@@ -62,7 +62,7 @@ enum chips {
+ f81866,
+ f81804,
+ f81865,
+- nct6116d,
++ nct6126d,
+ };
+
+ static const char * const f7188x_names[] = {
+@@ -74,7 +74,7 @@ static const char * const f7188x_names[] = {
+ "f81866",
+ "f81804",
+ "f81865",
+- "nct6116d",
++ "nct6126d",
+ };
+
+ struct f7188x_sio {
+@@ -187,8 +187,8 @@ static int f7188x_gpio_set_config(struct gpio_chip *chip, unsigned offset,
+ /* Output mode register (0:open drain 1:push-pull). */
+ #define f7188x_gpio_out_mode(base) ((base) + 3)
+
+-#define f7188x_gpio_dir_invert(type) ((type) == nct6116d)
+-#define f7188x_gpio_data_single(type) ((type) == nct6116d)
++#define f7188x_gpio_dir_invert(type) ((type) == nct6126d)
++#define f7188x_gpio_data_single(type) ((type) == nct6126d)
+
+ static struct f7188x_gpio_bank f71869_gpio_bank[] = {
+ F7188X_GPIO_BANK(0, 6, 0xF0, DRVNAME "-0"),
+@@ -274,7 +274,7 @@ static struct f7188x_gpio_bank f81865_gpio_bank[] = {
+ F7188X_GPIO_BANK(60, 5, 0x90, DRVNAME "-6"),
+ };
+
+-static struct f7188x_gpio_bank nct6116d_gpio_bank[] = {
++static struct f7188x_gpio_bank nct6126d_gpio_bank[] = {
+ F7188X_GPIO_BANK(0, 8, 0xE0, DRVNAME "-0"),
+ F7188X_GPIO_BANK(10, 8, 0xE4, DRVNAME "-1"),
+ F7188X_GPIO_BANK(20, 8, 0xE8, DRVNAME "-2"),
+@@ -282,7 +282,7 @@ static struct f7188x_gpio_bank nct6116d_gpio_bank[] = {
+ F7188X_GPIO_BANK(40, 8, 0xF0, DRVNAME "-4"),
+ F7188X_GPIO_BANK(50, 8, 0xF4, DRVNAME "-5"),
+ F7188X_GPIO_BANK(60, 8, 0xF8, DRVNAME "-6"),
+- F7188X_GPIO_BANK(70, 1, 0xFC, DRVNAME "-7"),
++ F7188X_GPIO_BANK(70, 8, 0xFC, DRVNAME "-7"),
+ };
+
+ static int f7188x_gpio_get_direction(struct gpio_chip *chip, unsigned offset)
+@@ -490,9 +490,9 @@ static int f7188x_gpio_probe(struct platform_device *pdev)
+ data->nr_bank = ARRAY_SIZE(f81865_gpio_bank);
+ data->bank = f81865_gpio_bank;
+ break;
+- case nct6116d:
+- data->nr_bank = ARRAY_SIZE(nct6116d_gpio_bank);
+- data->bank = nct6116d_gpio_bank;
++ case nct6126d:
++ data->nr_bank = ARRAY_SIZE(nct6126d_gpio_bank);
++ data->bank = nct6126d_gpio_bank;
+ break;
+ default:
+ return -ENODEV;
+@@ -559,9 +559,9 @@ static int __init f7188x_find(int addr, struct f7188x_sio *sio)
+ case SIO_F81865_ID:
+ sio->type = f81865;
+ break;
+- case SIO_NCT6116D_ID:
++ case SIO_NCT6126D_ID:
+ sio->device = SIO_LD_GPIO_NUVOTON;
+- sio->type = nct6116d;
++ sio->type = nct6126d;
+ break;
+ default:
+ pr_info("Unsupported Fintek device 0x%04x\n", devid);
+@@ -569,7 +569,7 @@ static int __init f7188x_find(int addr, struct f7188x_sio *sio)
+ }
+
+ /* double check manufacturer where possible */
+- if (sio->type != nct6116d) {
++ if (sio->type != nct6126d) {
+ manid = superio_inw(addr, SIO_FINTEK_MANID);
+ if (manid != SIO_FINTEK_ID) {
+ pr_debug("Not a Fintek device at 0x%08x\n", addr);
+@@ -581,7 +581,7 @@ static int __init f7188x_find(int addr, struct f7188x_sio *sio)
+ err = 0;
+
+ pr_info("Found %s at %#x\n", f7188x_names[sio->type], (unsigned int)addr);
+- if (sio->type != nct6116d)
++ if (sio->type != nct6126d)
+ pr_info(" revision %d\n", superio_inb(addr, SIO_FINTEK_DEVREV));
+
+ err:
+--
+2.39.2
+
--- /dev/null
+From 584b6e562ba94b1fbed370fb6cc60bb54e7b519b Mon Sep 17 00:00:00 2001
+From: Sasha Levin <sashal@kernel.org>
+Date: Tue, 24 Jan 2023 14:36:43 +0100
+Subject: inet: Add IP_LOCAL_PORT_RANGE socket option
+
+From: Jakub Sitnicki <jakub@cloudflare.com>
+
+[ Upstream commit 91d0b78c5177f3e42a4d8738af8ac19c3a90d002 ]
+
+Users who want to share a single public IP address for outgoing connections
+between several hosts traditionally reach for SNAT. However, SNAT requires
+state keeping on the node(s) performing the NAT.
+
+A stateless alternative exists, where a single IP address used for egress
+can be shared between several hosts by partitioning the available ephemeral
+port range. In such a setup:
+
+1. Each host gets assigned a disjoint range of ephemeral ports.
+2. Applications open connections from the host-assigned port range.
+3. Return traffic gets routed to the host based on both, the destination IP
+ and the destination port.
+
+An application which wants to open an outgoing connection (connect) from a
+given port range today can choose between two solutions:
+
+1. Manually pick the source port by bind()'ing to it before connect()'ing
+ the socket.
+
+ This approach has a couple of downsides:
+
+ a) Search for a free port has to be implemented in the user-space. If
+ the chosen 4-tuple happens to be busy, the application needs to retry
+ from a different local port number.
+
+ Detecting if 4-tuple is busy can be either easy (TCP) or hard
+ (UDP). In TCP case, the application simply has to check if connect()
+ returned an error (EADDRNOTAVAIL). That is assuming that the local
+ port sharing was enabled (REUSEADDR) by all the sockets.
+
+ # Assume desired local port range is 60_000-60_511
+ s = socket(AF_INET, SOCK_STREAM)
+ s.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
+ s.bind(("192.0.2.1", 60_000))
+ s.connect(("1.1.1.1", 53))
+ # Fails only if 192.0.2.1:60000 -> 1.1.1.1:53 is busy
+ # Application must retry with another local port
+
+ In case of UDP, the network stack allows binding more than one socket
+ to the same 4-tuple, when local port sharing is enabled
+ (REUSEADDR). Hence detecting the conflict is much harder and involves
+ querying sock_diag and toggling the REUSEADDR flag [1].
+
+ b) For TCP, bind()-ing to a port within the ephemeral port range means
+ that no connecting sockets, that is those which leave it to the
+ network stack to find a free local port at connect() time, can use
+ the this port.
+
+ IOW, the bind hash bucket tb->fastreuse will be 0 or 1, and the port
+ will be skipped during the free port search at connect() time.
+
+2. Isolate the app in a dedicated netns and use the use the per-netns
+ ip_local_port_range sysctl to adjust the ephemeral port range bounds.
+
+ The per-netns setting affects all sockets, so this approach can be used
+ only if:
+
+ - there is just one egress IP address, or
+ - the desired egress port range is the same for all egress IP addresses
+ used by the application.
+
+ For TCP, this approach avoids the downsides of (1). Free port search and
+ 4-tuple conflict detection is done by the network stack:
+
+ system("sysctl -w net.ipv4.ip_local_port_range='60000 60511'")
+
+ s = socket(AF_INET, SOCK_STREAM)
+ s.setsockopt(SOL_IP, IP_BIND_ADDRESS_NO_PORT, 1)
+ s.bind(("192.0.2.1", 0))
+ s.connect(("1.1.1.1", 53))
+ # Fails if all 4-tuples 192.0.2.1:60000-60511 -> 1.1.1.1:53 are busy
+
+ For UDP this approach has limited applicability. Setting the
+ IP_BIND_ADDRESS_NO_PORT socket option does not result in local source
+ port being shared with other connected UDP sockets.
+
+ Hence relying on the network stack to find a free source port, limits the
+ number of outgoing UDP flows from a single IP address down to the number
+ of available ephemeral ports.
+
+To put it another way, partitioning the ephemeral port range between hosts
+using the existing Linux networking API is cumbersome.
+
+To address this use case, add a new socket option at the SOL_IP level,
+named IP_LOCAL_PORT_RANGE. The new option can be used to clamp down the
+ephemeral port range for each socket individually.
+
+The option can be used only to narrow down the per-netns local port
+range. If the per-socket range lies outside of the per-netns range, the
+latter takes precedence.
+
+UAPI-wise, the low and high range bounds are passed to the kernel as a pair
+of u16 values in host byte order packed into a u32. This avoids pointer
+passing.
+
+ PORT_LO = 40_000
+ PORT_HI = 40_511
+
+ s = socket(AF_INET, SOCK_STREAM)
+ v = struct.pack("I", PORT_HI << 16 | PORT_LO)
+ s.setsockopt(SOL_IP, IP_LOCAL_PORT_RANGE, v)
+ s.bind(("127.0.0.1", 0))
+ s.getsockname()
+ # Local address between ("127.0.0.1", 40_000) and ("127.0.0.1", 40_511),
+ # if there is a free port. EADDRINUSE otherwise.
+
+[1] https://github.com/cloudflare/cloudflare-blog/blob/232b432c1d57/2022-02-connectx/connectx.py#L116
+
+Reviewed-by: Marek Majkowski <marek@cloudflare.com>
+Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
+Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
+Reviewed-by: Eric Dumazet <edumazet@google.com>
+Signed-off-by: Jakub Kicinski <kuba@kernel.org>
+Stable-dep-of: 3632679d9e4f ("ipv{4,6}/raw: fix output xfrm lookup wrt protocol")
+Signed-off-by: Sasha Levin <sashal@kernel.org>
+---
+ include/net/inet_sock.h | 4 ++++
+ include/net/ip.h | 3 ++-
+ include/uapi/linux/in.h | 1 +
+ net/ipv4/inet_connection_sock.c | 25 +++++++++++++++++++++++--
+ net/ipv4/inet_hashtables.c | 2 +-
+ net/ipv4/ip_sockglue.c | 18 ++++++++++++++++++
+ net/ipv4/udp.c | 2 +-
+ net/sctp/socket.c | 2 +-
+ 8 files changed, 51 insertions(+), 6 deletions(-)
+
+diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h
+index bf5654ce711ef..51857117ac099 100644
+--- a/include/net/inet_sock.h
++++ b/include/net/inet_sock.h
+@@ -249,6 +249,10 @@ struct inet_sock {
+ __be32 mc_addr;
+ struct ip_mc_socklist __rcu *mc_list;
+ struct inet_cork_full cork;
++ struct {
++ __u16 lo;
++ __u16 hi;
++ } local_port_range;
+ };
+
+ #define IPCORK_OPT 1 /* ip-options has been held in ipcork.opt */
+diff --git a/include/net/ip.h b/include/net/ip.h
+index 144bdfbb25afe..c3fffaa92d6e0 100644
+--- a/include/net/ip.h
++++ b/include/net/ip.h
+@@ -340,7 +340,8 @@ static inline u64 snmp_fold_field64(void __percpu *mib, int offt, size_t syncp_o
+ } \
+ }
+
+-void inet_get_local_port_range(struct net *net, int *low, int *high);
++void inet_get_local_port_range(const struct net *net, int *low, int *high);
++void inet_sk_get_local_port_range(const struct sock *sk, int *low, int *high);
+
+ #ifdef CONFIG_SYSCTL
+ static inline bool inet_is_local_reserved_port(struct net *net, unsigned short port)
+diff --git a/include/uapi/linux/in.h b/include/uapi/linux/in.h
+index 07a4cb149305b..4b7f2df66b995 100644
+--- a/include/uapi/linux/in.h
++++ b/include/uapi/linux/in.h
+@@ -162,6 +162,7 @@ struct in_addr {
+ #define MCAST_MSFILTER 48
+ #define IP_MULTICAST_ALL 49
+ #define IP_UNICAST_IF 50
++#define IP_LOCAL_PORT_RANGE 51
+
+ #define MCAST_EXCLUDE 0
+ #define MCAST_INCLUDE 1
+diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
+index 7152ede18f115..916075e00d066 100644
+--- a/net/ipv4/inet_connection_sock.c
++++ b/net/ipv4/inet_connection_sock.c
+@@ -117,7 +117,7 @@ bool inet_rcv_saddr_any(const struct sock *sk)
+ return !sk->sk_rcv_saddr;
+ }
+
+-void inet_get_local_port_range(struct net *net, int *low, int *high)
++void inet_get_local_port_range(const struct net *net, int *low, int *high)
+ {
+ unsigned int seq;
+
+@@ -130,6 +130,27 @@ void inet_get_local_port_range(struct net *net, int *low, int *high)
+ }
+ EXPORT_SYMBOL(inet_get_local_port_range);
+
++void inet_sk_get_local_port_range(const struct sock *sk, int *low, int *high)
++{
++ const struct inet_sock *inet = inet_sk(sk);
++ const struct net *net = sock_net(sk);
++ int lo, hi, sk_lo, sk_hi;
++
++ inet_get_local_port_range(net, &lo, &hi);
++
++ sk_lo = inet->local_port_range.lo;
++ sk_hi = inet->local_port_range.hi;
++
++ if (unlikely(lo <= sk_lo && sk_lo <= hi))
++ lo = sk_lo;
++ if (unlikely(lo <= sk_hi && sk_hi <= hi))
++ hi = sk_hi;
++
++ *low = lo;
++ *high = hi;
++}
++EXPORT_SYMBOL(inet_sk_get_local_port_range);
++
+ static bool inet_use_bhash2_on_bind(const struct sock *sk)
+ {
+ #if IS_ENABLED(CONFIG_IPV6)
+@@ -316,7 +337,7 @@ inet_csk_find_open_port(const struct sock *sk, struct inet_bind_bucket **tb_ret,
+ ports_exhausted:
+ attempt_half = (sk->sk_reuse == SK_CAN_REUSE) ? 1 : 0;
+ other_half_scan:
+- inet_get_local_port_range(net, &low, &high);
++ inet_sk_get_local_port_range(sk, &low, &high);
+ high++; /* [32768, 60999] -> [32768, 61000[ */
+ if (high - low < 4)
+ attempt_half = 0;
+diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
+index f0750c06d5ffc..e8734ffca85a8 100644
+--- a/net/ipv4/inet_hashtables.c
++++ b/net/ipv4/inet_hashtables.c
+@@ -1022,7 +1022,7 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
+
+ l3mdev = inet_sk_bound_l3mdev(sk);
+
+- inet_get_local_port_range(net, &low, &high);
++ inet_sk_get_local_port_range(sk, &low, &high);
+ high++; /* [32768, 60999] -> [32768, 61000[ */
+ remaining = high - low;
+ if (likely(remaining > 1))
+diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
+index 6e19cad154f5c..d05f631ea6401 100644
+--- a/net/ipv4/ip_sockglue.c
++++ b/net/ipv4/ip_sockglue.c
+@@ -922,6 +922,7 @@ int do_ip_setsockopt(struct sock *sk, int level, int optname,
+ case IP_CHECKSUM:
+ case IP_RECVFRAGSIZE:
+ case IP_RECVERR_RFC4884:
++ case IP_LOCAL_PORT_RANGE:
+ if (optlen >= sizeof(int)) {
+ if (copy_from_sockptr(&val, optval, sizeof(val)))
+ return -EFAULT;
+@@ -1364,6 +1365,20 @@ int do_ip_setsockopt(struct sock *sk, int level, int optname,
+ WRITE_ONCE(inet->min_ttl, val);
+ break;
+
++ case IP_LOCAL_PORT_RANGE:
++ {
++ const __u16 lo = val;
++ const __u16 hi = val >> 16;
++
++ if (optlen != sizeof(__u32))
++ goto e_inval;
++ if (lo != 0 && hi != 0 && lo > hi)
++ goto e_inval;
++
++ inet->local_port_range.lo = lo;
++ inet->local_port_range.hi = hi;
++ break;
++ }
+ default:
+ err = -ENOPROTOOPT;
+ break;
+@@ -1742,6 +1757,9 @@ int do_ip_getsockopt(struct sock *sk, int level, int optname,
+ case IP_MINTTL:
+ val = inet->min_ttl;
+ break;
++ case IP_LOCAL_PORT_RANGE:
++ val = inet->local_port_range.hi << 16 | inet->local_port_range.lo;
++ break;
+ default:
+ sockopt_release_sock(sk);
+ return -ENOPROTOOPT;
+diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
+index 2eaf47e23b221..3ffa30c37293e 100644
+--- a/net/ipv4/udp.c
++++ b/net/ipv4/udp.c
+@@ -243,7 +243,7 @@ int udp_lib_get_port(struct sock *sk, unsigned short snum,
+ int low, high, remaining;
+ unsigned int rand;
+
+- inet_get_local_port_range(net, &low, &high);
++ inet_sk_get_local_port_range(sk, &low, &high);
+ remaining = (high - low) + 1;
+
+ rand = get_random_u32();
+diff --git a/net/sctp/socket.c b/net/sctp/socket.c
+index 17185200079d5..bc3d08bd7cef3 100644
+--- a/net/sctp/socket.c
++++ b/net/sctp/socket.c
+@@ -8325,7 +8325,7 @@ static int sctp_get_port_local(struct sock *sk, union sctp_addr *addr)
+ int low, high, remaining, index;
+ unsigned int rover;
+
+- inet_get_local_port_range(net, &low, &high);
++ inet_sk_get_local_port_range(sk, &low, &high);
+ remaining = (high - low) + 1;
+ rover = prandom_u32_max(remaining) + low;
+
+--
+2.39.2
+
--- /dev/null
+From 35f9d8bdac8a43aa43859b47745171d684a418a5 Mon Sep 17 00:00:00 2001
+From: Sasha Levin <sashal@kernel.org>
+Date: Mon, 22 May 2023 14:08:20 +0200
+Subject: ipv{4,6}/raw: fix output xfrm lookup wrt protocol
+
+From: Nicolas Dichtel <nicolas.dichtel@6wind.com>
+
+[ Upstream commit 3632679d9e4f879f49949bb5b050e0de553e4739 ]
+
+With a raw socket bound to IPPROTO_RAW (ie with hdrincl enabled), the
+protocol field of the flow structure, build by raw_sendmsg() /
+rawv6_sendmsg()), is set to IPPROTO_RAW. This breaks the ipsec policy
+lookup when some policies are defined with a protocol in the selector.
+
+For ipv6, the sin6_port field from 'struct sockaddr_in6' could be used to
+specify the protocol. Just accept all values for IPPROTO_RAW socket.
+
+For ipv4, the sin_port field of 'struct sockaddr_in' could not be used
+without breaking backward compatibility (the value of this field was never
+checked). Let's add a new kind of control message, so that the userland
+could specify which protocol is used.
+
+Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
+CC: stable@vger.kernel.org
+Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
+Link: https://lore.kernel.org/r/20230522120820.1319391-1-nicolas.dichtel@6wind.com
+Signed-off-by: Paolo Abeni <pabeni@redhat.com>
+Signed-off-by: Sasha Levin <sashal@kernel.org>
+---
+ include/net/ip.h | 2 ++
+ include/uapi/linux/in.h | 1 +
+ net/ipv4/ip_sockglue.c | 12 +++++++++++-
+ net/ipv4/raw.c | 5 ++++-
+ net/ipv6/raw.c | 3 ++-
+ 5 files changed, 20 insertions(+), 3 deletions(-)
+
+diff --git a/include/net/ip.h b/include/net/ip.h
+index c3fffaa92d6e0..acec504c469a0 100644
+--- a/include/net/ip.h
++++ b/include/net/ip.h
+@@ -76,6 +76,7 @@ struct ipcm_cookie {
+ __be32 addr;
+ int oif;
+ struct ip_options_rcu *opt;
++ __u8 protocol;
+ __u8 ttl;
+ __s16 tos;
+ char priority;
+@@ -96,6 +97,7 @@ static inline void ipcm_init_sk(struct ipcm_cookie *ipcm,
+ ipcm->sockc.tsflags = inet->sk.sk_tsflags;
+ ipcm->oif = READ_ONCE(inet->sk.sk_bound_dev_if);
+ ipcm->addr = inet->inet_saddr;
++ ipcm->protocol = inet->inet_num;
+ }
+
+ #define IPCB(skb) ((struct inet_skb_parm*)((skb)->cb))
+diff --git a/include/uapi/linux/in.h b/include/uapi/linux/in.h
+index 4b7f2df66b995..e682ab628dfa6 100644
+--- a/include/uapi/linux/in.h
++++ b/include/uapi/linux/in.h
+@@ -163,6 +163,7 @@ struct in_addr {
+ #define IP_MULTICAST_ALL 49
+ #define IP_UNICAST_IF 50
+ #define IP_LOCAL_PORT_RANGE 51
++#define IP_PROTOCOL 52
+
+ #define MCAST_EXCLUDE 0
+ #define MCAST_INCLUDE 1
+diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
+index d05f631ea6401..a7fd035b5b4f9 100644
+--- a/net/ipv4/ip_sockglue.c
++++ b/net/ipv4/ip_sockglue.c
+@@ -317,7 +317,14 @@ int ip_cmsg_send(struct sock *sk, struct msghdr *msg, struct ipcm_cookie *ipc,
+ ipc->tos = val;
+ ipc->priority = rt_tos2priority(ipc->tos);
+ break;
+-
++ case IP_PROTOCOL:
++ if (cmsg->cmsg_len != CMSG_LEN(sizeof(int)))
++ return -EINVAL;
++ val = *(int *)CMSG_DATA(cmsg);
++ if (val < 1 || val > 255)
++ return -EINVAL;
++ ipc->protocol = val;
++ break;
+ default:
+ return -EINVAL;
+ }
+@@ -1760,6 +1767,9 @@ int do_ip_getsockopt(struct sock *sk, int level, int optname,
+ case IP_LOCAL_PORT_RANGE:
+ val = inet->local_port_range.hi << 16 | inet->local_port_range.lo;
+ break;
++ case IP_PROTOCOL:
++ val = inet_sk(sk)->inet_num;
++ break;
+ default:
+ sockopt_release_sock(sk);
+ return -ENOPROTOOPT;
+diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
+index af03aa8a8e513..86197634dcf5d 100644
+--- a/net/ipv4/raw.c
++++ b/net/ipv4/raw.c
+@@ -530,6 +530,9 @@ static int raw_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
+ }
+
+ ipcm_init_sk(&ipc, inet);
++ /* Keep backward compat */
++ if (hdrincl)
++ ipc.protocol = IPPROTO_RAW;
+
+ if (msg->msg_controllen) {
+ err = ip_cmsg_send(sk, msg, &ipc, false);
+@@ -597,7 +600,7 @@ static int raw_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
+
+ flowi4_init_output(&fl4, ipc.oif, ipc.sockc.mark, tos,
+ RT_SCOPE_UNIVERSE,
+- hdrincl ? IPPROTO_RAW : sk->sk_protocol,
++ hdrincl ? ipc.protocol : sk->sk_protocol,
+ inet_sk_flowi_flags(sk) |
+ (hdrincl ? FLOWI_FLAG_KNOWN_NH : 0),
+ daddr, saddr, 0, 0, sk->sk_uid);
+diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
+index f44b99f7ecdcc..33852fc38ad91 100644
+--- a/net/ipv6/raw.c
++++ b/net/ipv6/raw.c
+@@ -791,7 +791,8 @@ static int rawv6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
+
+ if (!proto)
+ proto = inet->inet_num;
+- else if (proto != inet->inet_num)
++ else if (proto != inet->inet_num &&
++ inet->inet_num != IPPROTO_RAW)
+ return -EINVAL;
+
+ if (proto > 255)
+--
+2.39.2
+
--- /dev/null
+From a2faf150514ee4a652fafe17c2dcb05efa261f8d Mon Sep 17 00:00:00 2001
+From: Sasha Levin <sashal@kernel.org>
+Date: Mon, 6 Feb 2023 11:52:02 +0200
+Subject: net/mlx5: E-switch, Devcom, sync devcom events and devcom comp
+ register
+
+From: Shay Drory <shayd@nvidia.com>
+
+[ Upstream commit 8c253dfc89efde6b5faddf9e7400e5d17884e042 ]
+
+devcom events are sent to all registered component. Following the
+cited patch, it is possible for two components, e.g.: two eswitches,
+to send devcom events, while both components are registered. This
+means eswitch layer will do double un/pairing, which is double
+allocation and free of resources, even though only one un/pairing is
+needed. flow example:
+
+ cpu0 cpu1
+ ---- ----
+
+ mlx5_devlink_eswitch_mode_set(dev0)
+ esw_offloads_devcom_init()
+ mlx5_devcom_register_component(esw0)
+ mlx5_devlink_eswitch_mode_set(dev1)
+ esw_offloads_devcom_init()
+ mlx5_devcom_register_component(esw1)
+ mlx5_devcom_send_event()
+ mlx5_devcom_send_event()
+
+Hence, check whether the eswitches are already un/paired before
+free/allocation of resources.
+
+Fixes: 09b278462f16 ("net: devlink: enable parallel ops on netlink interface")
+Signed-off-by: Shay Drory <shayd@nvidia.com>
+Reviewed-by: Mark Bloch <mbloch@nvidia.com>
+Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
+Signed-off-by: Sasha Levin <sashal@kernel.org>
+---
+ drivers/net/ethernet/mellanox/mlx5/core/eswitch.h | 1 +
+ .../net/ethernet/mellanox/mlx5/core/eswitch_offloads.c | 9 ++++++++-
+ 2 files changed, 9 insertions(+), 1 deletion(-)
+
+diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
+index 821c78bab3732..a3daca44f74b1 100644
+--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
++++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
+@@ -340,6 +340,7 @@ struct mlx5_eswitch {
+ } params;
+ struct blocking_notifier_head n_head;
+ struct dentry *dbgfs;
++ bool paired[MLX5_MAX_PORTS];
+ };
+
+ void esw_offloads_disable(struct mlx5_eswitch *esw);
+diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
+index 5235b5a7b9637..433cdd0a2cf34 100644
+--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
++++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
+@@ -2827,6 +2827,9 @@ static int mlx5_esw_offloads_devcom_event(int event,
+ mlx5_eswitch_vport_match_metadata_enabled(peer_esw))
+ break;
+
++ if (esw->paired[mlx5_get_dev_index(peer_esw->dev)])
++ break;
++
+ err = mlx5_esw_offloads_set_ns_peer(esw, peer_esw, true);
+ if (err)
+ goto err_out;
+@@ -2838,14 +2841,18 @@ static int mlx5_esw_offloads_devcom_event(int event,
+ if (err)
+ goto err_pair;
+
++ esw->paired[mlx5_get_dev_index(peer_esw->dev)] = true;
++ peer_esw->paired[mlx5_get_dev_index(esw->dev)] = true;
+ mlx5_devcom_set_paired(devcom, MLX5_DEVCOM_ESW_OFFLOADS, true);
+ break;
+
+ case ESW_OFFLOADS_DEVCOM_UNPAIR:
+- if (!mlx5_devcom_is_paired(devcom, MLX5_DEVCOM_ESW_OFFLOADS))
++ if (!esw->paired[mlx5_get_dev_index(peer_esw->dev)])
+ break;
+
+ mlx5_devcom_set_paired(devcom, MLX5_DEVCOM_ESW_OFFLOADS, false);
++ esw->paired[mlx5_get_dev_index(peer_esw->dev)] = false;
++ peer_esw->paired[mlx5_get_dev_index(esw->dev)] = false;
+ mlx5_esw_offloads_unpair(peer_esw);
+ mlx5_esw_offloads_unpair(esw);
+ mlx5_esw_offloads_set_ns_peer(esw, peer_esw, false);
+--
+2.39.2
+
--- /dev/null
+From fd295408c719ed8b6b4595392c12e6575fedbfe0 Mon Sep 17 00:00:00 2001
+From: Sasha Levin <sashal@kernel.org>
+Date: Fri, 3 Feb 2023 09:16:11 +0800
+Subject: net: page_pool: use in_softirq() instead
+
+From: Qingfang DENG <qingfang.deng@siflower.com.cn>
+
+[ Upstream commit 542bcea4be866b14b3a5c8e90773329066656c43 ]
+
+We use BH context only for synchronization, so we don't care if it's
+actually serving softirq or not.
+
+As a side node, in case of threaded NAPI, in_serving_softirq() will
+return false because it's in process context with BH off, making
+page_pool_recycle_in_cache() unreachable.
+
+Signed-off-by: Qingfang DENG <qingfang.deng@siflower.com.cn>
+Tested-by: Felix Fietkau <nbd@nbd.name>
+Signed-off-by: David S. Miller <davem@davemloft.net>
+Stable-dep-of: 368d3cb406cd ("page_pool: fix inconsistency for page_pool_ring_[un]lock()")
+Signed-off-by: Sasha Levin <sashal@kernel.org>
+---
+ include/net/page_pool.h | 4 ++--
+ net/core/page_pool.c | 6 +++---
+ 2 files changed, 5 insertions(+), 5 deletions(-)
+
+diff --git a/include/net/page_pool.h b/include/net/page_pool.h
+index 813c93499f201..34bf531ffc8d6 100644
+--- a/include/net/page_pool.h
++++ b/include/net/page_pool.h
+@@ -386,7 +386,7 @@ static inline void page_pool_nid_changed(struct page_pool *pool, int new_nid)
+ static inline void page_pool_ring_lock(struct page_pool *pool)
+ __acquires(&pool->ring.producer_lock)
+ {
+- if (in_serving_softirq())
++ if (in_softirq())
+ spin_lock(&pool->ring.producer_lock);
+ else
+ spin_lock_bh(&pool->ring.producer_lock);
+@@ -395,7 +395,7 @@ static inline void page_pool_ring_lock(struct page_pool *pool)
+ static inline void page_pool_ring_unlock(struct page_pool *pool)
+ __releases(&pool->ring.producer_lock)
+ {
+- if (in_serving_softirq())
++ if (in_softirq())
+ spin_unlock(&pool->ring.producer_lock);
+ else
+ spin_unlock_bh(&pool->ring.producer_lock);
+diff --git a/net/core/page_pool.c b/net/core/page_pool.c
+index 9b203d8660e47..193c187998650 100644
+--- a/net/core/page_pool.c
++++ b/net/core/page_pool.c
+@@ -511,8 +511,8 @@ static void page_pool_return_page(struct page_pool *pool, struct page *page)
+ static bool page_pool_recycle_in_ring(struct page_pool *pool, struct page *page)
+ {
+ int ret;
+- /* BH protection not needed if current is serving softirq */
+- if (in_serving_softirq())
++ /* BH protection not needed if current is softirq */
++ if (in_softirq())
+ ret = ptr_ring_produce(&pool->ring, page);
+ else
+ ret = ptr_ring_produce_bh(&pool->ring, page);
+@@ -570,7 +570,7 @@ __page_pool_put_page(struct page_pool *pool, struct page *page,
+ page_pool_dma_sync_for_device(pool, page,
+ dma_sync_size);
+
+- if (allow_direct && in_serving_softirq() &&
++ if (allow_direct && in_softirq() &&
+ page_pool_recycle_in_cache(page, pool))
+ return NULL;
+
+--
+2.39.2
+
--- /dev/null
+From dc0b2c73e5cd6a9e840a2796e921994c31096ba6 Mon Sep 17 00:00:00 2001
+From: Sasha Levin <sashal@kernel.org>
+Date: Tue, 23 May 2023 17:31:08 +0200
+Subject: net: phy: mscc: enable VSC8501/2 RGMII RX clock
+
+From: David Epping <david.epping@missinglinkelectronics.com>
+
+[ Upstream commit 71460c9ec5c743e9ffffca3c874d66267c36345e ]
+
+By default the VSC8501 and VSC8502 RGMII/GMII/MII RX_CLK output is
+disabled. To allow packet forwarding towards the MAC it needs to be
+enabled.
+
+For other PHYs supported by this driver the clock output is enabled
+by default.
+
+Fixes: d3169863310d ("net: phy: mscc: add support for VSC8502")
+Signed-off-by: David Epping <david.epping@missinglinkelectronics.com>
+Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
+Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
+Signed-off-by: Jakub Kicinski <kuba@kernel.org>
+Signed-off-by: Sasha Levin <sashal@kernel.org>
+---
+ drivers/net/phy/mscc/mscc.h | 1 +
+ drivers/net/phy/mscc/mscc_main.c | 54 +++++++++++++++++---------------
+ 2 files changed, 29 insertions(+), 26 deletions(-)
+
+diff --git a/drivers/net/phy/mscc/mscc.h b/drivers/net/phy/mscc/mscc.h
+index a50235fdf7d99..055e4ca5b3b5c 100644
+--- a/drivers/net/phy/mscc/mscc.h
++++ b/drivers/net/phy/mscc/mscc.h
+@@ -179,6 +179,7 @@ enum rgmii_clock_delay {
+ #define VSC8502_RGMII_CNTL 20
+ #define VSC8502_RGMII_RX_DELAY_MASK 0x0070
+ #define VSC8502_RGMII_TX_DELAY_MASK 0x0007
++#define VSC8502_RGMII_RX_CLK_DISABLE 0x0800
+
+ #define MSCC_PHY_WOL_LOWER_MAC_ADDR 21
+ #define MSCC_PHY_WOL_MID_MAC_ADDR 22
+diff --git a/drivers/net/phy/mscc/mscc_main.c b/drivers/net/phy/mscc/mscc_main.c
+index f778e4f8b5080..7bd940baec595 100644
+--- a/drivers/net/phy/mscc/mscc_main.c
++++ b/drivers/net/phy/mscc/mscc_main.c
+@@ -527,14 +527,27 @@ static int vsc85xx_mac_if_set(struct phy_device *phydev,
+ * * 2.0 ns (which causes the data to be sampled at exactly half way between
+ * clock transitions at 1000 Mbps) if delays should be enabled
+ */
+-static int vsc85xx_rgmii_set_skews(struct phy_device *phydev, u32 rgmii_cntl,
+- u16 rgmii_rx_delay_mask,
+- u16 rgmii_tx_delay_mask)
++static int vsc85xx_update_rgmii_cntl(struct phy_device *phydev, u32 rgmii_cntl,
++ u16 rgmii_rx_delay_mask,
++ u16 rgmii_tx_delay_mask)
+ {
+ u16 rgmii_rx_delay_pos = ffs(rgmii_rx_delay_mask) - 1;
+ u16 rgmii_tx_delay_pos = ffs(rgmii_tx_delay_mask) - 1;
+ u16 reg_val = 0;
+- int rc;
++ u16 mask = 0;
++ int rc = 0;
++
++ /* For traffic to pass, the VSC8502 family needs the RX_CLK disable bit
++ * to be unset for all PHY modes, so do that as part of the paged
++ * register modification.
++ * For some family members (like VSC8530/31/40/41) this bit is reserved
++ * and read-only, and the RX clock is enabled by default.
++ */
++ if (rgmii_cntl == VSC8502_RGMII_CNTL)
++ mask |= VSC8502_RGMII_RX_CLK_DISABLE;
++
++ if (phy_interface_is_rgmii(phydev))
++ mask |= rgmii_rx_delay_mask | rgmii_tx_delay_mask;
+
+ mutex_lock(&phydev->lock);
+
+@@ -545,10 +558,9 @@ static int vsc85xx_rgmii_set_skews(struct phy_device *phydev, u32 rgmii_cntl,
+ phydev->interface == PHY_INTERFACE_MODE_RGMII_ID)
+ reg_val |= RGMII_CLK_DELAY_2_0_NS << rgmii_tx_delay_pos;
+
+- rc = phy_modify_paged(phydev, MSCC_PHY_PAGE_EXTENDED_2,
+- rgmii_cntl,
+- rgmii_rx_delay_mask | rgmii_tx_delay_mask,
+- reg_val);
++ if (mask)
++ rc = phy_modify_paged(phydev, MSCC_PHY_PAGE_EXTENDED_2,
++ rgmii_cntl, mask, reg_val);
+
+ mutex_unlock(&phydev->lock);
+
+@@ -557,19 +569,11 @@ static int vsc85xx_rgmii_set_skews(struct phy_device *phydev, u32 rgmii_cntl,
+
+ static int vsc85xx_default_config(struct phy_device *phydev)
+ {
+- int rc;
+-
+ phydev->mdix_ctrl = ETH_TP_MDI_AUTO;
+
+- if (phy_interface_mode_is_rgmii(phydev->interface)) {
+- rc = vsc85xx_rgmii_set_skews(phydev, VSC8502_RGMII_CNTL,
+- VSC8502_RGMII_RX_DELAY_MASK,
+- VSC8502_RGMII_TX_DELAY_MASK);
+- if (rc)
+- return rc;
+- }
+-
+- return 0;
++ return vsc85xx_update_rgmii_cntl(phydev, VSC8502_RGMII_CNTL,
++ VSC8502_RGMII_RX_DELAY_MASK,
++ VSC8502_RGMII_TX_DELAY_MASK);
+ }
+
+ static int vsc85xx_get_tunable(struct phy_device *phydev,
+@@ -1766,13 +1770,11 @@ static int vsc8584_config_init(struct phy_device *phydev)
+ if (ret)
+ return ret;
+
+- if (phy_interface_is_rgmii(phydev)) {
+- ret = vsc85xx_rgmii_set_skews(phydev, VSC8572_RGMII_CNTL,
+- VSC8572_RGMII_RX_DELAY_MASK,
+- VSC8572_RGMII_TX_DELAY_MASK);
+- if (ret)
+- return ret;
+- }
++ ret = vsc85xx_update_rgmii_cntl(phydev, VSC8572_RGMII_CNTL,
++ VSC8572_RGMII_RX_DELAY_MASK,
++ VSC8572_RGMII_TX_DELAY_MASK);
++ if (ret)
++ return ret;
+
+ ret = genphy_soft_reset(phydev);
+ if (ret)
+--
+2.39.2
+
--- /dev/null
+From 0f7c4261319e04c926d12a31b9ed6abd87280272 Mon Sep 17 00:00:00 2001
+From: Sasha Levin <sashal@kernel.org>
+Date: Mon, 22 May 2023 11:17:14 +0800
+Subject: page_pool: fix inconsistency for page_pool_ring_[un]lock()
+
+From: Yunsheng Lin <linyunsheng@huawei.com>
+
+[ Upstream commit 368d3cb406cdd074d1df2ad9ec06d1bfcb664882 ]
+
+page_pool_ring_[un]lock() use in_softirq() to decide which
+spin lock variant to use, and when they are called in the
+context with in_softirq() being false, spin_lock_bh() is
+called in page_pool_ring_lock() while spin_unlock() is
+called in page_pool_ring_unlock(), because spin_lock_bh()
+has disabled the softirq in page_pool_ring_lock(), which
+causes inconsistency for spin lock pair calling.
+
+This patch fixes it by returning in_softirq state from
+page_pool_producer_lock(), and use it to decide which
+spin lock variant to use in page_pool_producer_unlock().
+
+As pool->ring has both producer and consumer lock, so
+rename it to page_pool_producer_[un]lock() to reflect
+the actual usage. Also move them to page_pool.c as they
+are only used there, and remove the 'inline' as the
+compiler may have better idea to do inlining or not.
+
+Fixes: 7886244736a4 ("net: page_pool: Add bulk support for ptr_ring")
+Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
+Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
+Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
+Link: https://lore.kernel.org/r/20230522031714.5089-1-linyunsheng@huawei.com
+Signed-off-by: Jakub Kicinski <kuba@kernel.org>
+Signed-off-by: Sasha Levin <sashal@kernel.org>
+---
+ include/net/page_pool.h | 18 ------------------
+ net/core/page_pool.c | 28 ++++++++++++++++++++++++++--
+ 2 files changed, 26 insertions(+), 20 deletions(-)
+
+diff --git a/include/net/page_pool.h b/include/net/page_pool.h
+index 34bf531ffc8d6..ad0bafc877d48 100644
+--- a/include/net/page_pool.h
++++ b/include/net/page_pool.h
+@@ -383,22 +383,4 @@ static inline void page_pool_nid_changed(struct page_pool *pool, int new_nid)
+ page_pool_update_nid(pool, new_nid);
+ }
+
+-static inline void page_pool_ring_lock(struct page_pool *pool)
+- __acquires(&pool->ring.producer_lock)
+-{
+- if (in_softirq())
+- spin_lock(&pool->ring.producer_lock);
+- else
+- spin_lock_bh(&pool->ring.producer_lock);
+-}
+-
+-static inline void page_pool_ring_unlock(struct page_pool *pool)
+- __releases(&pool->ring.producer_lock)
+-{
+- if (in_softirq())
+- spin_unlock(&pool->ring.producer_lock);
+- else
+- spin_unlock_bh(&pool->ring.producer_lock);
+-}
+-
+ #endif /* _NET_PAGE_POOL_H */
+diff --git a/net/core/page_pool.c b/net/core/page_pool.c
+index 193c187998650..2396c99bedeaa 100644
+--- a/net/core/page_pool.c
++++ b/net/core/page_pool.c
+@@ -133,6 +133,29 @@ EXPORT_SYMBOL(page_pool_ethtool_stats_get);
+ #define recycle_stat_add(pool, __stat, val)
+ #endif
+
++static bool page_pool_producer_lock(struct page_pool *pool)
++ __acquires(&pool->ring.producer_lock)
++{
++ bool in_softirq = in_softirq();
++
++ if (in_softirq)
++ spin_lock(&pool->ring.producer_lock);
++ else
++ spin_lock_bh(&pool->ring.producer_lock);
++
++ return in_softirq;
++}
++
++static void page_pool_producer_unlock(struct page_pool *pool,
++ bool in_softirq)
++ __releases(&pool->ring.producer_lock)
++{
++ if (in_softirq)
++ spin_unlock(&pool->ring.producer_lock);
++ else
++ spin_unlock_bh(&pool->ring.producer_lock);
++}
++
+ static int page_pool_init(struct page_pool *pool,
+ const struct page_pool_params *params)
+ {
+@@ -615,6 +638,7 @@ void page_pool_put_page_bulk(struct page_pool *pool, void **data,
+ int count)
+ {
+ int i, bulk_len = 0;
++ bool in_softirq;
+
+ for (i = 0; i < count; i++) {
+ struct page *page = virt_to_head_page(data[i]);
+@@ -633,7 +657,7 @@ void page_pool_put_page_bulk(struct page_pool *pool, void **data,
+ return;
+
+ /* Bulk producer into ptr_ring page_pool cache */
+- page_pool_ring_lock(pool);
++ in_softirq = page_pool_producer_lock(pool);
+ for (i = 0; i < bulk_len; i++) {
+ if (__ptr_ring_produce(&pool->ring, data[i])) {
+ /* ring full */
+@@ -642,7 +666,7 @@ void page_pool_put_page_bulk(struct page_pool *pool, void **data,
+ }
+ }
+ recycle_stat_add(pool, ring, i);
+- page_pool_ring_unlock(pool);
++ page_pool_producer_unlock(pool, in_softirq);
+
+ /* Hopefully all pages was return into ptr_ring */
+ if (likely(i == bulk_len))
+--
+2.39.2
+
--- /dev/null
+From 5197daa25cf6bd27df0920debbf93d53067824e2 Mon Sep 17 00:00:00 2001
+From: Sasha Levin <sashal@kernel.org>
+Date: Fri, 12 May 2023 20:14:08 -0500
+Subject: platform/x86/amd/pmf: Fix CnQF and auto-mode after resume
+
+From: Mario Limonciello <mario.limonciello@amd.com>
+
+[ Upstream commit b54147fa374dbeadcb01b1762db1a793e06e37de ]
+
+After suspend/resume cycle there is an error message and auto-mode
+or CnQF stops working.
+
+[ 5741.447511] amd-pmf AMDI0100:00: SMU cmd failed. err: 0xff
+[ 5741.447523] amd-pmf AMDI0100:00: AMD_PMF_REGISTER_RESPONSE:ff
+[ 5741.447527] amd-pmf AMDI0100:00: AMD_PMF_REGISTER_ARGUMENT:7
+[ 5741.447531] amd-pmf AMDI0100:00: AMD_PMF_REGISTER_MESSAGE:16
+[ 5741.447540] amd-pmf AMDI0100:00: [AUTO_MODE] avg power: 0 mW mode: QUIET
+
+This is because the DRAM address used for accessing metrics table
+needs to be refreshed after a suspend resume cycle. Add a resume
+callback to reset this again.
+
+Fixes: 1a409b35c995 ("platform/x86/amd/pmf: Get performance metrics from PMFW")
+Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
+Link: https://lore.kernel.org/r/20230513011408.958-1-mario.limonciello@amd.com
+Reviewed-by: Hans de Goede <hdegoede@redhat.com>
+Signed-off-by: Hans de Goede <hdegoede@redhat.com>
+Signed-off-by: Sasha Levin <sashal@kernel.org>
+---
+ drivers/platform/x86/amd/pmf/core.c | 32 ++++++++++++++++++++++-------
+ 1 file changed, 25 insertions(+), 7 deletions(-)
+
+diff --git a/drivers/platform/x86/amd/pmf/core.c b/drivers/platform/x86/amd/pmf/core.c
+index 0acc0b6221290..dc9803e1a4b9b 100644
+--- a/drivers/platform/x86/amd/pmf/core.c
++++ b/drivers/platform/x86/amd/pmf/core.c
+@@ -245,24 +245,29 @@ static const struct pci_device_id pmf_pci_ids[] = {
+ { }
+ };
+
+-int amd_pmf_init_metrics_table(struct amd_pmf_dev *dev)
++static void amd_pmf_set_dram_addr(struct amd_pmf_dev *dev)
+ {
+ u64 phys_addr;
+ u32 hi, low;
+
+- INIT_DELAYED_WORK(&dev->work_buffer, amd_pmf_get_metrics);
++ phys_addr = virt_to_phys(dev->buf);
++ hi = phys_addr >> 32;
++ low = phys_addr & GENMASK(31, 0);
++
++ amd_pmf_send_cmd(dev, SET_DRAM_ADDR_HIGH, 0, hi, NULL);
++ amd_pmf_send_cmd(dev, SET_DRAM_ADDR_LOW, 0, low, NULL);
++}
+
++int amd_pmf_init_metrics_table(struct amd_pmf_dev *dev)
++{
+ /* Get Metrics Table Address */
+ dev->buf = kzalloc(sizeof(dev->m_table), GFP_KERNEL);
+ if (!dev->buf)
+ return -ENOMEM;
+
+- phys_addr = virt_to_phys(dev->buf);
+- hi = phys_addr >> 32;
+- low = phys_addr & GENMASK(31, 0);
++ INIT_DELAYED_WORK(&dev->work_buffer, amd_pmf_get_metrics);
+
+- amd_pmf_send_cmd(dev, SET_DRAM_ADDR_HIGH, 0, hi, NULL);
+- amd_pmf_send_cmd(dev, SET_DRAM_ADDR_LOW, 0, low, NULL);
++ amd_pmf_set_dram_addr(dev);
+
+ /*
+ * Start collecting the metrics data after a small delay
+@@ -273,6 +278,18 @@ int amd_pmf_init_metrics_table(struct amd_pmf_dev *dev)
+ return 0;
+ }
+
++static int amd_pmf_resume_handler(struct device *dev)
++{
++ struct amd_pmf_dev *pdev = dev_get_drvdata(dev);
++
++ if (pdev->buf)
++ amd_pmf_set_dram_addr(pdev);
++
++ return 0;
++}
++
++static DEFINE_SIMPLE_DEV_PM_OPS(amd_pmf_pm, NULL, amd_pmf_resume_handler);
++
+ static void amd_pmf_init_features(struct amd_pmf_dev *dev)
+ {
+ int ret;
+@@ -414,6 +431,7 @@ static struct platform_driver amd_pmf_driver = {
+ .name = "amd-pmf",
+ .acpi_match_table = amd_pmf_acpi_ids,
+ .dev_groups = amd_pmf_driver_groups,
++ .pm = pm_sleep_ptr(&amd_pmf_pm),
+ },
+ .probe = amd_pmf_probe,
+ .remove = amd_pmf_remove,
+--
+2.39.2
+
--- /dev/null
+From a8033055415b14a46897376561eaf631b38910df Mon Sep 17 00:00:00 2001
+From: Sasha Levin <sashal@kernel.org>
+Date: Wed, 26 Apr 2023 22:50:32 +0100
+Subject: selftests/bpf: Fix pkg-config call building sign-file
+
+From: Jeremy Sowden <jeremy@azazel.net>
+
+[ Upstream commit 5f5486b620cd43b16a1787ef92b9bc21bd72ef2e ]
+
+When building sign-file, the call to get the CFLAGS for libcrypto is
+missing white-space between `pkg-config` and `--cflags`:
+
+ $(shell $(HOSTPKG_CONFIG)--cflags libcrypto 2> /dev/null)
+
+Removing the redirection of stderr, we see:
+
+ $ make -C tools/testing/selftests/bpf sign-file
+ make: Entering directory '[...]/tools/testing/selftests/bpf'
+ make: pkg-config--cflags: No such file or directory
+ SIGN-FILE sign-file
+ make: Leaving directory '[...]/tools/testing/selftests/bpf'
+
+Add the missing space.
+
+Fixes: fc97590668ae ("selftests/bpf: Add test for bpf_verify_pkcs7_signature() kfunc")
+Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
+Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
+Reviewed-by: Roberto Sassu <roberto.sassu@huawei.com>
+Link: https://lore.kernel.org/bpf/20230426215032.415792-1-jeremy@azazel.net
+Signed-off-by: Alexei Starovoitov <ast@kernel.org>
+Signed-off-by: Sasha Levin <sashal@kernel.org>
+---
+ tools/testing/selftests/bpf/Makefile | 2 +-
+ 1 file changed, 1 insertion(+), 1 deletion(-)
+
+diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
+index 687249d99b5f1..0465ddc81f352 100644
+--- a/tools/testing/selftests/bpf/Makefile
++++ b/tools/testing/selftests/bpf/Makefile
+@@ -193,7 +193,7 @@ $(OUTPUT)/urandom_read: urandom_read.c urandom_read_aux.c $(OUTPUT)/liburandom_r
+
+ $(OUTPUT)/sign-file: ../../../../scripts/sign-file.c
+ $(call msg,SIGN-FILE,,$@)
+- $(Q)$(CC) $(shell $(HOSTPKG_CONFIG)--cflags libcrypto 2> /dev/null) \
++ $(Q)$(CC) $(shell $(HOSTPKG_CONFIG) --cflags libcrypto 2> /dev/null) \
+ $< -o $@ \
+ $(shell $(HOSTPKG_CONFIG) --libs libcrypto 2> /dev/null || echo -lcrypto)
+
+--
+2.39.2
+
net-smc-reset-connection-when-trying-to-use-smcrv2-fails.patch
3c589_cs-fix-an-error-handling-path-in-tc589_probe.patch
net-phy-mscc-add-vsc8502-to-module_device_table.patch
+inet-add-ip_local_port_range-socket-option.patch
+ipv-4-6-raw-fix-output-xfrm-lookup-wrt-protocol.patch
+firmware-arm_ffa-fix-usage-of-partition-info-get-cou.patch
+selftests-bpf-fix-pkg-config-call-building-sign-file.patch
+platform-x86-amd-pmf-fix-cnqf-and-auto-mode-after-re.patch
+tls-rx-device-fix-checking-decryption-status.patch
+tls-rx-strp-set-the-skb-len-of-detached-cow-ed-skbs.patch
+tls-rx-strp-fix-determining-record-length-in-copy-mo.patch
+tls-rx-strp-force-mixed-decrypted-records-into-copy-.patch
+tls-rx-strp-factor-out-copying-skb-data.patch
+tls-rx-strp-preserve-decryption-status-of-skbs-when-.patch
+net-mlx5-e-switch-devcom-sync-devcom-events-and-devc.patch
+gpio-f7188x-fix-chip-name-and-pin-count-on-nuvoton-c.patch
+bpf-sockmap-pass-skb-ownership-through-read_skb.patch
+bpf-sockmap-convert-schedule_work-into-delayed_work.patch
+bpf-sockmap-reschedule-is-now-done-through-backlog.patch
+bpf-sockmap-improved-check-for-empty-queue.patch
+bpf-sockmap-handle-fin-correctly.patch
+bpf-sockmap-tcp-data-stall-on-recv-before-accept.patch
+bpf-sockmap-wake-up-polling-after-data-copy.patch
+bpf-sockmap-incorrectly-handling-copied_seq.patch
+blk-mq-fix-race-condition-in-active-queue-accounting.patch
+vfio-type1-check-pfn-valid-before-converting-to-stru.patch
+net-page_pool-use-in_softirq-instead.patch
+page_pool-fix-inconsistency-for-page_pool_ring_-un-l.patch
+net-phy-mscc-enable-vsc8501-2-rgmii-rx-clock.patch
--- /dev/null
+From d0b063d0c88043a9198c69a24b09490b1459ac5e Mon Sep 17 00:00:00 2001
+From: Sasha Levin <sashal@kernel.org>
+Date: Tue, 16 May 2023 18:50:36 -0700
+Subject: tls: rx: device: fix checking decryption status
+
+From: Jakub Kicinski <kuba@kernel.org>
+
+[ Upstream commit b3a03b540e3cf62a255213d084d76d71c02793d5 ]
+
+skb->len covers the entire skb, including the frag_list.
+In fact we're guaranteed that rxm->full_len <= skb->len,
+so since the change under Fixes we were not checking decrypt
+status of any skb but the first.
+
+Note that the skb_pagelen() added here may feel a bit costly,
+but it's removed by subsequent fixes, anyway.
+
+Reported-by: Tariq Toukan <tariqt@nvidia.com>
+Fixes: 86b259f6f888 ("tls: rx: device: bound the frag walk")
+Tested-by: Shai Amiram <samiram@nvidia.com>
+Signed-off-by: Jakub Kicinski <kuba@kernel.org>
+Reviewed-by: Simon Horman <simon.horman@corigine.com>
+Signed-off-by: David S. Miller <davem@davemloft.net>
+Signed-off-by: Sasha Levin <sashal@kernel.org>
+---
+ net/tls/tls_device.c | 2 +-
+ 1 file changed, 1 insertion(+), 1 deletion(-)
+
+diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c
+index a7cc4f9faac28..3b87c7b04ac87 100644
+--- a/net/tls/tls_device.c
++++ b/net/tls/tls_device.c
+@@ -1012,7 +1012,7 @@ int tls_device_decrypted(struct sock *sk, struct tls_context *tls_ctx)
+ struct sk_buff *skb_iter;
+ int left;
+
+- left = rxm->full_len - skb->len;
++ left = rxm->full_len + rxm->offset - skb_pagelen(skb);
+ /* Check if all the data is decrypted already */
+ skb_iter = skb_shinfo(skb)->frag_list;
+ while (skb_iter && left > 0) {
+--
+2.39.2
+
--- /dev/null
+From ac5b79057662d03c2434227903c80e19c171b90f Mon Sep 17 00:00:00 2001
+From: Sasha Levin <sashal@kernel.org>
+Date: Tue, 16 May 2023 18:50:40 -0700
+Subject: tls: rx: strp: factor out copying skb data
+
+From: Jakub Kicinski <kuba@kernel.org>
+
+[ Upstream commit c1c607b1e5d5477d82ca6a86a05a4f10907b33ee ]
+
+We'll need to copy input skbs individually in the next patch.
+Factor that code out (without assuming we're copying a full record).
+
+Tested-by: Shai Amiram <samiram@nvidia.com>
+Signed-off-by: Jakub Kicinski <kuba@kernel.org>
+Reviewed-by: Simon Horman <simon.horman@corigine.com>
+Signed-off-by: David S. Miller <davem@davemloft.net>
+Stable-dep-of: eca9bfafee3a ("tls: rx: strp: preserve decryption status of skbs when needed")
+Signed-off-by: Sasha Levin <sashal@kernel.org>
+---
+ net/tls/tls_strp.c | 33 +++++++++++++++++++++++----------
+ 1 file changed, 23 insertions(+), 10 deletions(-)
+
+diff --git a/net/tls/tls_strp.c b/net/tls/tls_strp.c
+index e2e48217e7ac9..61fbf84baf9e0 100644
+--- a/net/tls/tls_strp.c
++++ b/net/tls/tls_strp.c
+@@ -34,31 +34,44 @@ static void tls_strp_anchor_free(struct tls_strparser *strp)
+ strp->anchor = NULL;
+ }
+
+-/* Create a new skb with the contents of input copied to its page frags */
+-static struct sk_buff *tls_strp_msg_make_copy(struct tls_strparser *strp)
++static struct sk_buff *
++tls_strp_skb_copy(struct tls_strparser *strp, struct sk_buff *in_skb,
++ int offset, int len)
+ {
+- struct strp_msg *rxm;
+ struct sk_buff *skb;
+- int i, err, offset;
++ int i, err;
+
+- skb = alloc_skb_with_frags(0, strp->stm.full_len, TLS_PAGE_ORDER,
++ skb = alloc_skb_with_frags(0, len, TLS_PAGE_ORDER,
+ &err, strp->sk->sk_allocation);
+ if (!skb)
+ return NULL;
+
+- offset = strp->stm.offset;
+ for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
+ skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
+
+- WARN_ON_ONCE(skb_copy_bits(strp->anchor, offset,
++ WARN_ON_ONCE(skb_copy_bits(in_skb, offset,
+ skb_frag_address(frag),
+ skb_frag_size(frag)));
+ offset += skb_frag_size(frag);
+ }
+
+- skb->len = strp->stm.full_len;
+- skb->data_len = strp->stm.full_len;
+- skb_copy_header(skb, strp->anchor);
++ skb->len = len;
++ skb->data_len = len;
++ skb_copy_header(skb, in_skb);
++ return skb;
++}
++
++/* Create a new skb with the contents of input copied to its page frags */
++static struct sk_buff *tls_strp_msg_make_copy(struct tls_strparser *strp)
++{
++ struct strp_msg *rxm;
++ struct sk_buff *skb;
++
++ skb = tls_strp_skb_copy(strp, strp->anchor, strp->stm.offset,
++ strp->stm.full_len);
++ if (!skb)
++ return NULL;
++
+ rxm = strp_msg(skb);
+ rxm->offset = 0;
+ return skb;
+--
+2.39.2
+
--- /dev/null
+From 44208b50d590560b00e95c93b4078c892f242444 Mon Sep 17 00:00:00 2001
+From: Sasha Levin <sashal@kernel.org>
+Date: Tue, 16 May 2023 18:50:39 -0700
+Subject: tls: rx: strp: fix determining record length in copy mode
+
+From: Jakub Kicinski <kuba@kernel.org>
+
+[ Upstream commit 8b0c0dc9fbbd01e58a573a41c38885f9e4c17696 ]
+
+We call tls_rx_msg_size(skb) before doing skb->len += chunk.
+So the tls_rx_msg_size() code will see old skb->len, most
+likely leading to an over-read.
+
+Worst case we will over read an entire record, next iteration
+will try to trim the skb but may end up turning frag len negative
+or discarding the subsequent record (since we already told TCP
+we've read it during previous read but now we'll trim it out of
+the skb).
+
+Fixes: 84c61fe1a75b ("tls: rx: do not use the standard strparser")
+Tested-by: Shai Amiram <samiram@nvidia.com>
+Signed-off-by: Jakub Kicinski <kuba@kernel.org>
+Reviewed-by: Simon Horman <simon.horman@corigine.com>
+Signed-off-by: David S. Miller <davem@davemloft.net>
+Signed-off-by: Sasha Levin <sashal@kernel.org>
+---
+ net/tls/tls_strp.c | 21 +++++++++++++++------
+ 1 file changed, 15 insertions(+), 6 deletions(-)
+
+diff --git a/net/tls/tls_strp.c b/net/tls/tls_strp.c
+index 24016c865e004..9889df5ce0660 100644
+--- a/net/tls/tls_strp.c
++++ b/net/tls/tls_strp.c
+@@ -210,19 +210,28 @@ static int tls_strp_copyin(read_descriptor_t *desc, struct sk_buff *in_skb,
+ skb_frag_size(frag),
+ chunk));
+
+- sz = tls_rx_msg_size(strp, strp->anchor);
++ skb->len += chunk;
++ skb->data_len += chunk;
++ skb_frag_size_add(frag, chunk);
++
++ sz = tls_rx_msg_size(strp, skb);
+ if (sz < 0) {
+ desc->error = sz;
+ return 0;
+ }
+
+ /* We may have over-read, sz == 0 is guaranteed under-read */
+- if (sz > 0)
+- chunk = min_t(size_t, chunk, sz - skb->len);
++ if (unlikely(sz && sz < skb->len)) {
++ int over = skb->len - sz;
++
++ WARN_ON_ONCE(over > chunk);
++ skb->len -= over;
++ skb->data_len -= over;
++ skb_frag_size_add(frag, -over);
++
++ chunk -= over;
++ }
+
+- skb->len += chunk;
+- skb->data_len += chunk;
+- skb_frag_size_add(frag, chunk);
+ frag++;
+ len -= chunk;
+ offset += chunk;
+--
+2.39.2
+
--- /dev/null
+From 200b14455e14910ea81e13833331928828c347fc Mon Sep 17 00:00:00 2001
+From: Sasha Levin <sashal@kernel.org>
+Date: Tue, 16 May 2023 18:50:38 -0700
+Subject: tls: rx: strp: force mixed decrypted records into copy mode
+
+From: Jakub Kicinski <kuba@kernel.org>
+
+[ Upstream commit 14c4be92ebb3e36e392aa9dd8f314038a9f96f3c ]
+
+If a record is partially decrypted we'll have to CoW it, anyway,
+so go into copy mode and allocate a writable skb right away.
+
+This will make subsequent fix simpler because we won't have to
+teach tls_strp_msg_make_copy() how to copy skbs while preserving
+decrypt status.
+
+Tested-by: Shai Amiram <samiram@nvidia.com>
+Signed-off-by: Jakub Kicinski <kuba@kernel.org>
+Reviewed-by: Simon Horman <simon.horman@corigine.com>
+Signed-off-by: David S. Miller <davem@davemloft.net>
+Stable-dep-of: eca9bfafee3a ("tls: rx: strp: preserve decryption status of skbs when needed")
+Signed-off-by: Sasha Levin <sashal@kernel.org>
+---
+ include/linux/skbuff.h | 10 ++++++++++
+ net/tls/tls_strp.c | 16 +++++++++++-----
+ 2 files changed, 21 insertions(+), 5 deletions(-)
+
+diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
+index 20ca1613f2e3e..cc5ed2cf25f65 100644
+--- a/include/linux/skbuff.h
++++ b/include/linux/skbuff.h
+@@ -1567,6 +1567,16 @@ static inline void skb_copy_hash(struct sk_buff *to, const struct sk_buff *from)
+ to->l4_hash = from->l4_hash;
+ };
+
++static inline int skb_cmp_decrypted(const struct sk_buff *skb1,
++ const struct sk_buff *skb2)
++{
++#ifdef CONFIG_TLS_DEVICE
++ return skb2->decrypted - skb1->decrypted;
++#else
++ return 0;
++#endif
++}
++
+ static inline void skb_copy_decrypted(struct sk_buff *to,
+ const struct sk_buff *from)
+ {
+diff --git a/net/tls/tls_strp.c b/net/tls/tls_strp.c
+index 9889df5ce0660..e2e48217e7ac9 100644
+--- a/net/tls/tls_strp.c
++++ b/net/tls/tls_strp.c
+@@ -326,15 +326,19 @@ static int tls_strp_read_copy(struct tls_strparser *strp, bool qshort)
+ return 0;
+ }
+
+-static bool tls_strp_check_no_dup(struct tls_strparser *strp)
++static bool tls_strp_check_queue_ok(struct tls_strparser *strp)
+ {
+ unsigned int len = strp->stm.offset + strp->stm.full_len;
+- struct sk_buff *skb;
++ struct sk_buff *first, *skb;
+ u32 seq;
+
+- skb = skb_shinfo(strp->anchor)->frag_list;
+- seq = TCP_SKB_CB(skb)->seq;
++ first = skb_shinfo(strp->anchor)->frag_list;
++ skb = first;
++ seq = TCP_SKB_CB(first)->seq;
+
++ /* Make sure there's no duplicate data in the queue,
++ * and the decrypted status matches.
++ */
+ while (skb->len < len) {
+ seq += skb->len;
+ len -= skb->len;
+@@ -342,6 +346,8 @@ static bool tls_strp_check_no_dup(struct tls_strparser *strp)
+
+ if (TCP_SKB_CB(skb)->seq != seq)
+ return false;
++ if (skb_cmp_decrypted(first, skb))
++ return false;
+ }
+
+ return true;
+@@ -422,7 +428,7 @@ static int tls_strp_read_sock(struct tls_strparser *strp)
+ return tls_strp_read_copy(strp, true);
+ }
+
+- if (!tls_strp_check_no_dup(strp))
++ if (!tls_strp_check_queue_ok(strp))
+ return tls_strp_read_copy(strp, false);
+
+ strp->msg_ready = 1;
+--
+2.39.2
+
--- /dev/null
+From 6fdc036e01a2673b96c29fc5efa8805244fb2f66 Mon Sep 17 00:00:00 2001
+From: Sasha Levin <sashal@kernel.org>
+Date: Tue, 16 May 2023 18:50:41 -0700
+Subject: tls: rx: strp: preserve decryption status of skbs when needed
+
+From: Jakub Kicinski <kuba@kernel.org>
+
+[ Upstream commit eca9bfafee3a0487e59c59201ae14c7594ba940a ]
+
+When receive buffer is small we try to copy out the data from
+TCP into a skb maintained by TLS to prevent connection from
+stalling. Unfortunately if a single record is made up of a mix
+of decrypted and non-decrypted skbs combining them into a single
+skb leads to loss of decryption status, resulting in decryption
+errors or data corruption.
+
+Similarly when trying to use TCP receive queue directly we need
+to make sure that all the skbs within the record have the same
+status. If we don't the mixed status will be detected correctly
+but we'll CoW the anchor, again collapsing it into a single paged
+skb without decrypted status preserved. So the "fixup" code will
+not know which parts of skb to re-encrypt.
+
+Fixes: 84c61fe1a75b ("tls: rx: do not use the standard strparser")
+Tested-by: Shai Amiram <samiram@nvidia.com>
+Signed-off-by: Jakub Kicinski <kuba@kernel.org>
+Reviewed-by: Simon Horman <simon.horman@corigine.com>
+Signed-off-by: David S. Miller <davem@davemloft.net>
+Signed-off-by: Sasha Levin <sashal@kernel.org>
+---
+ include/net/tls.h | 1 +
+ net/tls/tls.h | 5 ++
+ net/tls/tls_device.c | 22 +++-----
+ net/tls/tls_strp.c | 117 ++++++++++++++++++++++++++++++++++++-------
+ 4 files changed, 114 insertions(+), 31 deletions(-)
+
+diff --git a/include/net/tls.h b/include/net/tls.h
+index 154949c7b0c88..c36bf4c50027e 100644
+--- a/include/net/tls.h
++++ b/include/net/tls.h
+@@ -124,6 +124,7 @@ struct tls_strparser {
+ u32 mark : 8;
+ u32 stopped : 1;
+ u32 copy_mode : 1;
++ u32 mixed_decrypted : 1;
+ u32 msg_ready : 1;
+
+ struct strp_msg stm;
+diff --git a/net/tls/tls.h b/net/tls/tls.h
+index 0e840a0c3437b..17737a65c643a 100644
+--- a/net/tls/tls.h
++++ b/net/tls/tls.h
+@@ -165,6 +165,11 @@ static inline bool tls_strp_msg_ready(struct tls_sw_context_rx *ctx)
+ return ctx->strp.msg_ready;
+ }
+
++static inline bool tls_strp_msg_mixed_decrypted(struct tls_sw_context_rx *ctx)
++{
++ return ctx->strp.mixed_decrypted;
++}
++
+ #ifdef CONFIG_TLS_DEVICE
+ int tls_device_init(void);
+ void tls_device_cleanup(void);
+diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c
+index 3b87c7b04ac87..bf69c9d6d06c0 100644
+--- a/net/tls/tls_device.c
++++ b/net/tls/tls_device.c
+@@ -1007,20 +1007,14 @@ int tls_device_decrypted(struct sock *sk, struct tls_context *tls_ctx)
+ struct tls_sw_context_rx *sw_ctx = tls_sw_ctx_rx(tls_ctx);
+ struct sk_buff *skb = tls_strp_msg(sw_ctx);
+ struct strp_msg *rxm = strp_msg(skb);
+- int is_decrypted = skb->decrypted;
+- int is_encrypted = !is_decrypted;
+- struct sk_buff *skb_iter;
+- int left;
+-
+- left = rxm->full_len + rxm->offset - skb_pagelen(skb);
+- /* Check if all the data is decrypted already */
+- skb_iter = skb_shinfo(skb)->frag_list;
+- while (skb_iter && left > 0) {
+- is_decrypted &= skb_iter->decrypted;
+- is_encrypted &= !skb_iter->decrypted;
+-
+- left -= skb_iter->len;
+- skb_iter = skb_iter->next;
++ int is_decrypted, is_encrypted;
++
++ if (!tls_strp_msg_mixed_decrypted(sw_ctx)) {
++ is_decrypted = skb->decrypted;
++ is_encrypted = !is_decrypted;
++ } else {
++ is_decrypted = 0;
++ is_encrypted = 0;
+ }
+
+ trace_tls_device_decrypted(sk, tcp_sk(sk)->copied_seq - rxm->full_len,
+diff --git a/net/tls/tls_strp.c b/net/tls/tls_strp.c
+index 61fbf84baf9e0..da95abbb7ea32 100644
+--- a/net/tls/tls_strp.c
++++ b/net/tls/tls_strp.c
+@@ -29,7 +29,8 @@ static void tls_strp_anchor_free(struct tls_strparser *strp)
+ struct skb_shared_info *shinfo = skb_shinfo(strp->anchor);
+
+ DEBUG_NET_WARN_ON_ONCE(atomic_read(&shinfo->dataref) != 1);
+- shinfo->frag_list = NULL;
++ if (!strp->copy_mode)
++ shinfo->frag_list = NULL;
+ consume_skb(strp->anchor);
+ strp->anchor = NULL;
+ }
+@@ -195,22 +196,22 @@ static void tls_strp_flush_anchor_copy(struct tls_strparser *strp)
+ for (i = 0; i < shinfo->nr_frags; i++)
+ __skb_frag_unref(&shinfo->frags[i], false);
+ shinfo->nr_frags = 0;
++ if (strp->copy_mode) {
++ kfree_skb_list(shinfo->frag_list);
++ shinfo->frag_list = NULL;
++ }
+ strp->copy_mode = 0;
++ strp->mixed_decrypted = 0;
+ }
+
+-static int tls_strp_copyin(read_descriptor_t *desc, struct sk_buff *in_skb,
+- unsigned int offset, size_t in_len)
++static int tls_strp_copyin_frag(struct tls_strparser *strp, struct sk_buff *skb,
++ struct sk_buff *in_skb, unsigned int offset,
++ size_t in_len)
+ {
+- struct tls_strparser *strp = (struct tls_strparser *)desc->arg.data;
+- struct sk_buff *skb;
+- skb_frag_t *frag;
+ size_t len, chunk;
++ skb_frag_t *frag;
+ int sz;
+
+- if (strp->msg_ready)
+- return 0;
+-
+- skb = strp->anchor;
+ frag = &skb_shinfo(skb)->frags[skb->len / PAGE_SIZE];
+
+ len = in_len;
+@@ -228,10 +229,8 @@ static int tls_strp_copyin(read_descriptor_t *desc, struct sk_buff *in_skb,
+ skb_frag_size_add(frag, chunk);
+
+ sz = tls_rx_msg_size(strp, skb);
+- if (sz < 0) {
+- desc->error = sz;
+- return 0;
+- }
++ if (sz < 0)
++ return sz;
+
+ /* We may have over-read, sz == 0 is guaranteed under-read */
+ if (unlikely(sz && sz < skb->len)) {
+@@ -271,15 +270,99 @@ static int tls_strp_copyin(read_descriptor_t *desc, struct sk_buff *in_skb,
+ offset += chunk;
+ }
+
+- if (strp->stm.full_len == skb->len) {
++read_done:
++ return in_len - len;
++}
++
++static int tls_strp_copyin_skb(struct tls_strparser *strp, struct sk_buff *skb,
++ struct sk_buff *in_skb, unsigned int offset,
++ size_t in_len)
++{
++ struct sk_buff *nskb, *first, *last;
++ struct skb_shared_info *shinfo;
++ size_t chunk;
++ int sz;
++
++ if (strp->stm.full_len)
++ chunk = strp->stm.full_len - skb->len;
++ else
++ chunk = TLS_MAX_PAYLOAD_SIZE + PAGE_SIZE;
++ chunk = min(chunk, in_len);
++
++ nskb = tls_strp_skb_copy(strp, in_skb, offset, chunk);
++ if (!nskb)
++ return -ENOMEM;
++
++ shinfo = skb_shinfo(skb);
++ if (!shinfo->frag_list) {
++ shinfo->frag_list = nskb;
++ nskb->prev = nskb;
++ } else {
++ first = shinfo->frag_list;
++ last = first->prev;
++ last->next = nskb;
++ first->prev = nskb;
++ }
++
++ skb->len += chunk;
++ skb->data_len += chunk;
++
++ if (!strp->stm.full_len) {
++ sz = tls_rx_msg_size(strp, skb);
++ if (sz < 0)
++ return sz;
++
++ /* We may have over-read, sz == 0 is guaranteed under-read */
++ if (unlikely(sz && sz < skb->len)) {
++ int over = skb->len - sz;
++
++ WARN_ON_ONCE(over > chunk);
++ skb->len -= over;
++ skb->data_len -= over;
++ __pskb_trim(nskb, nskb->len - over);
++
++ chunk -= over;
++ }
++
++ strp->stm.full_len = sz;
++ }
++
++ return chunk;
++}
++
++static int tls_strp_copyin(read_descriptor_t *desc, struct sk_buff *in_skb,
++ unsigned int offset, size_t in_len)
++{
++ struct tls_strparser *strp = (struct tls_strparser *)desc->arg.data;
++ struct sk_buff *skb;
++ int ret;
++
++ if (strp->msg_ready)
++ return 0;
++
++ skb = strp->anchor;
++ if (!skb->len)
++ skb_copy_decrypted(skb, in_skb);
++ else
++ strp->mixed_decrypted |= !!skb_cmp_decrypted(skb, in_skb);
++
++ if (IS_ENABLED(CONFIG_TLS_DEVICE) && strp->mixed_decrypted)
++ ret = tls_strp_copyin_skb(strp, skb, in_skb, offset, in_len);
++ else
++ ret = tls_strp_copyin_frag(strp, skb, in_skb, offset, in_len);
++ if (ret < 0) {
++ desc->error = ret;
++ ret = 0;
++ }
++
++ if (strp->stm.full_len && strp->stm.full_len == skb->len) {
+ desc->count = 0;
+
+ strp->msg_ready = 1;
+ tls_rx_msg_ready(strp);
+ }
+
+-read_done:
+- return in_len - len;
++ return ret;
+ }
+
+ static int tls_strp_read_copyin(struct tls_strparser *strp)
+--
+2.39.2
+
--- /dev/null
+From fad0e22d1b05be4fa3448678d8728fbfe3939065 Mon Sep 17 00:00:00 2001
+From: Sasha Levin <sashal@kernel.org>
+Date: Tue, 16 May 2023 18:50:37 -0700
+Subject: tls: rx: strp: set the skb->len of detached / CoW'ed skbs
+
+From: Jakub Kicinski <kuba@kernel.org>
+
+[ Upstream commit 210620ae44a83f25220450bbfcc22e6fe986b25f ]
+
+alloc_skb_with_frags() fills in page frag sizes but does not
+set skb->len and skb->data_len. Set those correctly otherwise
+device offload will most likely generate an empty skb and
+hit the BUG() at the end of __skb_nsg().
+
+Fixes: 84c61fe1a75b ("tls: rx: do not use the standard strparser")
+Tested-by: Shai Amiram <samiram@nvidia.com>
+Signed-off-by: Jakub Kicinski <kuba@kernel.org>
+Reviewed-by: Simon Horman <simon.horman@corigine.com>
+Signed-off-by: David S. Miller <davem@davemloft.net>
+Signed-off-by: Sasha Levin <sashal@kernel.org>
+---
+ net/tls/tls_strp.c | 2 ++
+ 1 file changed, 2 insertions(+)
+
+diff --git a/net/tls/tls_strp.c b/net/tls/tls_strp.c
+index 955ac3e0bf4d3..24016c865e004 100644
+--- a/net/tls/tls_strp.c
++++ b/net/tls/tls_strp.c
+@@ -56,6 +56,8 @@ static struct sk_buff *tls_strp_msg_make_copy(struct tls_strparser *strp)
+ offset += skb_frag_size(frag);
+ }
+
++ skb->len = strp->stm.full_len;
++ skb->data_len = strp->stm.full_len;
+ skb_copy_header(skb, strp->anchor);
+ rxm = strp_msg(skb);
+ rxm->offset = 0;
+--
+2.39.2
+
--- /dev/null
+From ed771c5a237db5efc48a1c50f3f430d419080509 Mon Sep 17 00:00:00 2001
+From: Sasha Levin <sashal@kernel.org>
+Date: Fri, 19 May 2023 14:58:43 +0800
+Subject: vfio/type1: check pfn valid before converting to struct page
+
+From: Yan Zhao <yan.y.zhao@intel.com>
+
+[ Upstream commit 4752354af71043e6fd72ef5490ed6da39e6cab4a ]
+
+Check physical PFN is valid before converting the PFN to a struct page
+pointer to be returned to caller of vfio_pin_pages().
+
+vfio_pin_pages() pins user pages with contiguous IOVA.
+If the IOVA of a user page to be pinned belongs to vma of vm_flags
+VM_PFNMAP, pin_user_pages_remote() will return -EFAULT without returning
+struct page address for this PFN. This is because usually this kind of PFN
+(e.g. MMIO PFN) has no valid struct page address associated.
+Upon this error, vaddr_get_pfns() will obtain the physical PFN directly.
+
+While previously vfio_pin_pages() returns to caller PFN arrays directly,
+after commit
+34a255e67615 ("vfio: Replace phys_pfn with pages for vfio_pin_pages()"),
+PFNs will be converted to "struct page *" unconditionally and therefore
+the returned "struct page *" array may contain invalid struct page
+addresses.
+
+Given current in-tree users of vfio_pin_pages() only expect "struct page *
+returned, check PFN validity and return -EINVAL to let the caller be
+aware of IOVAs to be pinned containing PFN not able to be returned in
+"struct page *" array. So that, the caller will not consume the returned
+pointer (e.g. test PageReserved()) and avoid error like "supervisor read
+access in kernel mode".
+
+Fixes: 34a255e67615 ("vfio: Replace phys_pfn with pages for vfio_pin_pages()")
+Cc: Sean Christopherson <seanjc@google.com>
+Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
+Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
+Reviewed-by: Sean Christopherson <seanjc@google.com>
+Link: https://lore.kernel.org/r/20230519065843.10653-1-yan.y.zhao@intel.com
+Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
+Signed-off-by: Sasha Levin <sashal@kernel.org>
+---
+ drivers/vfio/vfio_iommu_type1.c | 5 +++++
+ 1 file changed, 5 insertions(+)
+
+diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
+index 7fa68dc4e938a..009ba186652ac 100644
+--- a/drivers/vfio/vfio_iommu_type1.c
++++ b/drivers/vfio/vfio_iommu_type1.c
+@@ -936,6 +936,11 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
+ if (ret)
+ goto pin_unwind;
+
++ if (!pfn_valid(phys_pfn)) {
++ ret = -EINVAL;
++ goto pin_unwind;
++ }
++
+ ret = vfio_add_to_pfn_list(dma, iova, phys_pfn);
+ if (ret) {
+ if (put_pfn(phys_pfn, dma->prot) && do_accounting)
+--
+2.39.2
+