High Performance Configuration
==============================
-If you have enough RAM, consider the following options in suricata.yaml to off-load as much work from the CPU's as possible:
+NIC
+---
+
+One of the major dependencies for Suricata's performance is the Network
+Interface Card. There are many vendors and possibilities. Some NICs have and
+require their own specific instructions and tools of how to set up the NIC.
+This ensures the greatest benefit when running Suricata. Vendors like
+Napatech, Netronome, Accolade, Myricom include those tools and documentation
+as part of their sources.
+
+For Intel, Mellanox and commodity NICs the following suggestions below could
+be utilized.
+
+It is recommended that the latest available stable NIC drivers are used. In
+general when changing the NIC settings it is advisable to use the latest
+``ethtool`` version. Some NICs ship with their own ``ethtool`` that is
+recommended to be used. Here is an example of how to set up the ethtool
+if needed:
+
+::
+
+ wget https://mirrors.edge.kernel.org/pub/software/network/ethtool/ethtool-5.2.tar.xz
+ tar -xf ethtool-5.2.tar.xz
+ cd ethtool-5.2
+ ./configure && make clean && make && make install
+ /usr/local/sbin/ethtool --version
+
+When doing high performance optimisation make sure ``irqbalance`` is off and
+not running:
+
+::
+
+ service irqbalance stop
+
+Depending on the NIC's available queues (for example Intel's x710/i40 has 64
+available per port/interface) the worker threads can be set up accordingly.
+Usually the available queues can be seen by running:
+
+::
+
+ /usr/local/sbin/ethtool -l eth1
+
+Some NICs - generally lower end 1Gbps - do not support symmetric hashing see
+:doc:`packet-capture`. On those systems due to considerations for out of order
+packets the following setup with af-packet is suggested (the example below
+uses ``eth1``):
+
+::
+
+ /usr/local/sbin/ethtool -L eth1 combined 1
+
+then set up af-packet with number of desired workers threads ``threads: auto``
+(auto by default will use number of CPUs available) and
+``cluster-type: cluster_flow`` (also the default setting)
+
+For higher end systems/NICs a better and more performant solution could be
+utilizing the NIC itself a bit more. x710/i40 and similar Intel NICs or
+Mellanox MT27800 Family [ConnectX-5] for example can easily be set up to do
+a bigger chunk of the work using more RSS queues and symmetric hashing in order
+to allow for increased performance on the Suricata side by using af-packet
+with ``cluster-type: cluster_qm`` mode. In that mode with af-packet all packets
+linked by network card to a RSS queue are sent to the same socket. Below is
+an example of a suggested config set up based on a 16 core one CPU/NUMA node
+socket system using x710:
+
+::
+
+ rmmod i40e && modprobe i40e
+ ifconfig eth1 down
+ /usr/local/sbin/ethtool -L eth1 combined 16
+ /usr/local/sbin/ethtool -K eth1 rxhash on
+ /usr/local/sbin/ethtool -K eth1 ntuple on
+ ifconfig eth1 up
+ /usr/local/sbin/ethtool -X eth1 hkey 6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A equal 16
+ /usr/local/sbin/ethtool -A eth1 rx off
+ /usr/local/sbin/ethtool -C eth1 adaptive-rx off adaptive-tx off rx-usecs 125
+ /usr/local/sbin/ethtool -G eth1 rx 1024
+
+The commands above can be reviewed in detail in the help or manpages of the
+``ethtool``. In brief the sequence makes sure the NIC is reset, the number of
+RSS queues is set to 16, load balancing is enabled for the NIC, a low entropy
+toepiltz key is inserted to allow for symmetric hashing, receive offloading is
+disabled, the adaptive control is disabled for lowest possible latency and
+last but not least, the ring rx descriptor size is set to 1024.
+Make sure the RSS hash function is Toeplitz:
+
+::
+
+ /usr/local/sbin/ethtool -X eth1 hfunc toeplitz
+
+Let the NIC balance as much as possible:
+
+::
+
+ for proto in tcp4 udp4 tcp6 udp6; do
+ /usr/local/sbin/ethtool -N eth1 rx-flow-hash $proto sdfn
+ done
+
+In some cases:
+
+::
+
+ /usr/local/sbin/ethtool -N eth1 rx-flow-hash $proto sd
+
+might be enough or even better depending on the type of traffic. However not
+all NICs allow it. The ``sd`` specifies the multi queue hashing algorithm of
+the NIC (for the particular proto) to use src IP, dst IP only. The ``sdfn``
+allows for the tuple src IP, dst IP, src port, dst port to be used for the
+hashing algorithm.
+In the af-packet section of suricata.yaml:
+
+::
+
+ af-packet:
+ - interface: eth1
+ threads: 16
+ cluster-id: 99
+ cluster-type: cluster_qm
+ ...
+ ...
+
+CPU affinity and NUMA
+---------------------
+
+Intel based systems
+~~~~~~~~~~~~~~~~~~~
+
+If the system has more then one NUMA node there are some more possibilities.
+In those cases it is generally recommended to use as many worker threads as
+cpu cores available/possible - from the same NUMA node. The example below uses
+a 72 core machine and the sniffing NIC that Suricata uses located on NUMA node 1.
+In such 2 socket configurations it is recommended to have Suricata and the
+sniffing NIC to be running and residing on the second NUMA node as by default
+CPU 0 is widely used by many services in Linux. In a case where this is not
+possible it is recommended that (via the cpu affinity config section in
+suricata.yaml and the irq affinity script for the NIC) CPU 0 is never used.
+
+In the case below 36 worker threads are used out of NUMA node 1's CPU,
+af-packet runmode with ``cluster-type: cluster_qm``.
+
+If the CPU's NUMA set up is as follows:
+
+::
+
+ lscpu
+ Architecture: x86_64
+ CPU op-mode(s): 32-bit, 64-bit
+ Byte Order: Little Endian
+ CPU(s): 72
+ On-line CPU(s) list: 0-71
+ Thread(s) per core: 2
+ Core(s) per socket: 18
+ Socket(s): 2
+ NUMA node(s): 2
+ Vendor ID: GenuineIntel
+ CPU family: 6
+ Model: 79
+ Model name: Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz
+ Stepping: 1
+ CPU MHz: 1199.724
+ CPU max MHz: 3600.0000
+ CPU min MHz: 1200.0000
+ BogoMIPS: 4589.92
+ Virtualization: VT-x
+ L1d cache: 32K
+ L1i cache: 32K
+ L2 cache: 256K
+ L3 cache: 46080K
+ NUMA node0 CPU(s): 0-17,36-53
+ NUMA node1 CPU(s): 18-35,54-71
+
+It is recommended that 36 worker threads are used and the NIC set up could be
+as follows:
+
+::
+
+ rmmod i40e && modprobe i40e
+ ifconfig eth1 down
+ /usr/local/sbin/ethtool -L eth1 combined 36
+ /usr/local/sbin/ethtool -K eth1 rxhash on
+ /usr/local/sbin/ethtool -K eth1 ntuple on
+ ifconfig eth1 up
+ ./set_irq_affinity local eth1
+ /usr/local/sbin/ethtool -X eth1 hkey 6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A equal 36
+ /usr/local/sbin/ethtool -A eth1 rx off tx off
+ /usr/local/sbin/ethtool -C eth1 adaptive-rx off adaptive-tx off rx-usecs 125
+ /usr/local/sbin/ethtool -G eth1 rx 1024
+ for proto in tcp4 udp4 tcp6 udp6; do
+ echo "/usr/local/sbin/ethtool -N eth1 rx-flow-hash $proto sdfn"
+ /usr/local/sbin/ethtool -N eth1 rx-flow-hash $proto sdfn
+ done
+
+In the example above the ``set_irq_affinity`` script is used from the NIC
+driver's sources.
+In the cpu affinity section of suricata.yaml config:
+
+::
+
+ # Suricata is multi-threaded. Here the threading can be influenced.
+ threading:
+ cpu-affinity:
+ - management-cpu-set:
+ cpu: [ "1-10" ] # include only these CPUs in affinity settings
+ - receive-cpu-set:
+ cpu: [ "0-10" ] # include only these CPUs in affinity settings
+ - worker-cpu-set:
+ cpu: [ "18-35", "54-71" ]
+ mode: "exclusive"
+ prio:
+ low: [ 0 ]
+ medium: [ "1" ]
+ high: [ "18-35","54-71" ]
+ default: "high"
+
+In the af-packet section of suricata.yaml config :
+
+::
+
+ - interface: eth1
+ # Number of receive threads. "auto" uses the number of cores
+ threads: 18
+ cluster-id: 99
+ cluster-type: cluster_qm
+ defrag: no
+ use-mmap: yes
+ mmap-locked: yes
+ tpacket-v3: yes
+ ring-size: 100000
+ block-size: 1048576
+ - interface: eth1
+ # Number of receive threads. "auto" uses the number of cores
+ threads: 18
+ cluster-id: 99
+ cluster-type: cluster_qm
+ defrag: no
+ use-mmap: yes
+ mmap-locked: yes
+ tpacket-v3: yes
+ ring-size: 100000
+ block-size: 1048576
+
+That way 36 worker threads can be mapped (18 per each af-packet interface slot)
+in total per CPUs NUMA 1 range - 18-35,54-71. That part is done via the
+``worker-cpu-set`` affinity settings. ``ring-size`` and ``block-size`` in the
+config section above are decent default values to start with. Those can be
+better adjusted if needed as explained in :doc:`tuning-considerations`.
+
+AMD based systems
+~~~~~~~~~~~~~~~~~
+
+Another example can be using an AMD based system where the architecture and
+design of the system itself plus the NUMA node's interaction is different as
+it is based on the HyperTransport (HT) technology. In that case per NUMA
+thread/lock would not be needed. The example below shows a suggestion for such
+a configuration utilising af-packet, ``cluster-type: cluster_flow``. The
+Mellanox NIC is located on NUMA 0.
+
+The CPU set up is as follows:
+
+::
+
+ Architecture: x86_64
+ CPU op-mode(s): 32-bit, 64-bit
+ Byte Order: Little Endian
+ CPU(s): 128
+ On-line CPU(s) list: 0-127
+ Thread(s) per core: 2
+ Core(s) per socket: 32
+ Socket(s): 2
+ NUMA node(s): 8
+ Vendor ID: AuthenticAMD
+ CPU family: 23
+ Model: 1
+ Model name: AMD EPYC 7601 32-Core Processor
+ Stepping: 2
+ CPU MHz: 1200.000
+ CPU max MHz: 2200.0000
+ CPU min MHz: 1200.0000
+ BogoMIPS: 4391.55
+ Virtualization: AMD-V
+ L1d cache: 32K
+ L1i cache: 64K
+ L2 cache: 512K
+ L3 cache: 8192K
+ NUMA node0 CPU(s): 0-7,64-71
+ NUMA node1 CPU(s): 8-15,72-79
+ NUMA node2 CPU(s): 16-23,80-87
+ NUMA node3 CPU(s): 24-31,88-95
+ NUMA node4 CPU(s): 32-39,96-103
+ NUMA node5 CPU(s): 40-47,104-111
+ NUMA node6 CPU(s): 48-55,112-119
+ NUMA node7 CPU(s): 56-63,120-127
+
+The ``ethtool``, ``show_irq_affinity.sh`` and ``set_irq_affinity_cpulist.sh``
+tools are provided from the official driver sources.
+Set up the NIC, including offloading and load balancing:
+
+::
+
+ ifconfig eno6 down
+ /opt/mellanox/ethtool/sbin/ethtool -L eno6 combined 15
+ /opt/mellanox/ethtool/sbin/ethtool -K eno6 rxhash on
+ /opt/mellanox/ethtool/sbin/ethtool -K eno6 ntuple on
+ ifconfig eno6 up
+ /sbin/set_irq_affinity_cpulist.sh 1-7,64-71 eno6
+ /opt/mellanox/ethtool/sbin/ethtool -X eno6 hfunc toeplitz
+ /opt/mellanox/ethtool/sbin/ethtool -X eno6 hkey 6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A
+
+In the example above (1-7,64-71 for the irq affinity) CPU 0 is skipped as it is usually used by default on Linux systems by many applications/tools.
+Let the NIC balance as much as possible:
+
+::
+
+ for proto in tcp4 udp4 tcp6 udp6; do
+ /usr/local/sbin/ethtool -N eth1 rx-flow-hash $proto sdfn
+ done
+
+In the cpu affinity section of suricata.yaml config :
+
+::
+
+ # Suricata is multi-threaded. Here the threading can be influenced.
+ threading:
+ set-cpu-affinity: yes
+ cpu-affinity:
+ - management-cpu-set:
+ cpu: [ "120-127" ] # include only these cpus in affinity settings
+ - receive-cpu-set:
+ cpu: [ 0 ] # include only these cpus in affinity settings
+ - worker-cpu-set:
+ cpu: [ "8-55" ]
+ mode: "exclusive"
+ prio:
+ high: [ "8-55" ]
+ default: "high"
+
+In the af-packet section of suricata.yaml config:
::
- detect:
- profile: custom
- custom-values:
- toclient-groups: 200
- toserver-groups: 200
- sgh-mpm-context: auto
- inspection-recursion-limit: 3000
+ - interface: eth1
+ # Number of receive threads. "auto" uses the number of cores
+ threads: 48 # 48 worker threads on cpus "8-55" above
+ cluster-id: 99
+ cluster-type: cluster_flow
+ defrag: no
+ use-mmap: yes
+ mmap-locked: yes
+ tpacket-v3: yes
+ ring-size: 100000
+ block-size: 1048576
+
+
+In the example above there are 15 RSS queues pinned to cores 1-7,64-71 on NUMA
+node 0 and 40 worker threads using other CPUs on different NUMA nodes. The
+reason why CPU 0 is skipped in this set up is as in Linux systems it is very
+common for CPU 0 to be used by default by many tools/services. The NIC itself in
+this config is positioned on NUMA 0 so starting with 15 RSS queues on that
+NUMA node and keeping those off for other tools in the system could offer the
+best advantage.
+
+.. note:: Performance and optimization of the whole system can be affected upon regular NIC driver and pkg/kernel upgrades so it should be monitored regularly and tested out in QA/test environments first. As a general suggestion it is always recommended to run the latest stable firmware and drivers as instructed and provided by the particular NIC vendor.
-Be advised, however, that this may require lots of RAM for even modestly sized rule sets. Also be aware that having additional CPU's available provides a greater performance boost than having more RAM available. That is, it would be better to spend money on CPU's instead of RAM when configuring a system.
+Other considerations
+~~~~~~~~~~~~~~~~~~~~
-It may also lead to significantly longer rule loading times.
+Another advanced option to consider is the ``isolcpus`` kernel boot parameter
+is a way of allowing CPU cores to be isolated for use of general system
+processes. That way ensures total dedication of those CPUs/ranges for the
+Suricata process only.
+``stream.wrong_thread`` / ``tcp.pkt_on_wrong_thread`` are counters available
+in ``stats.log`` or ``eve.json`` as ``event_type: stats`` that indicate issues with
+the load balancing. There could be traffic/NICs settings related as well. In
+very high/heavily increasing counter values it is recommended to experiment
+with a different load balancing method either via the NIC or for example using
+XDP/eBPF. There is an issue open
+https://redmine.openinfosecfoundation.org/issues/2725 that is a placeholder
+for feedback and findings.