net/mlx5e: SHAMPO, Switch to header memcpy
Previously the HW-GRO code was using a separate page_pool for the header
buffer. The pages of the header buffer were replenished via UMR. This
mechanism has some drawbacks:
- Reference counting on the page_pool page frags is not cheap.
- UMRs have HW overhead for updating and also for access. Especially for
the KLM type which was previously used.
- UMR code for headers is complex.
This patch switches to using a static memory area (static MTT MKEY) for
the header buffer and does a header memcpy. This happens only once per
GRO session. The SKB is allocated from the per-cpu NAPI SKB cache.
Performance numbers for x86:
+---------------------------------------------------------+
| Test | Baseline | Header Copy | Change |
|---------------------+------------+-------------+--------|
| iperf3 oncpu | 59.5 Gbps | 64.00 Gbps | 7 % |
| iperf3 offcpu | 102.5 Gbps | 104.20 Gbps | 2 % |
| kperf oncpu | 115.0 Gbps | 130.00 Gbps | 12 % |
| XDP_DROP (skb mode) | 3.9 Mpps | 3.9 Mpps | 0 % |
+---------------------------------------------------------+
Notes on test:
- System: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz
- oncpu: NAPI and application running on same CPU
- offcpu: NAPI and application running on different CPUs
- MTU: 1500
- iperf3 tests are single stream, 60s with IPv6 (for slightly larger
headers)
- kperf version [1]
[1] git://git.kernel.dk/kperf.git
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260204200345.1724098-1-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>