From: Peter Manev Date: Thu, 3 Oct 2019 09:14:58 +0000 (+0200) Subject: doc: Update high performance config doc X-Git-Tag: suricata-5.0.0~122 X-Git-Url: http://git.ipfire.org/cgi-bin/gitweb.cgi?a=commitdiff_plain;h=6df10019575f5c62adac3799593e28b49d718e4b;p=thirdparty%2Fsuricata.git doc: Update high performance config doc --- diff --git a/doc/userguide/performance/high-performance-config.rst b/doc/userguide/performance/high-performance-config.rst index 19c5e8f621..8a51eee3e8 100644 --- a/doc/userguide/performance/high-performance-config.rst +++ b/doc/userguide/performance/high-performance-config.rst @@ -1,19 +1,382 @@ High Performance Configuration ============================== -If you have enough RAM, consider the following options in suricata.yaml to off-load as much work from the CPU's as possible: +NIC +--- + +One of the major dependencies for Suricata's performance is the Network +Interface Card. There are many vendors and possibilities. Some NICs have and +require their own specific instructions and tools of how to set up the NIC. +This ensures the greatest benefit when running Suricata. Vendors like +Napatech, Netronome, Accolade, Myricom include those tools and documentation +as part of their sources. + +For Intel, Mellanox and commodity NICs the following suggestions below could +be utilized. + +It is recommended that the latest available stable NIC drivers are used. In +general when changing the NIC settings it is advisable to use the latest +``ethtool`` version. Some NICs ship with their own ``ethtool`` that is +recommended to be used. Here is an example of how to set up the ethtool +if needed: + +:: + + wget https://mirrors.edge.kernel.org/pub/software/network/ethtool/ethtool-5.2.tar.xz + tar -xf ethtool-5.2.tar.xz + cd ethtool-5.2 + ./configure && make clean && make && make install + /usr/local/sbin/ethtool --version + +When doing high performance optimisation make sure ``irqbalance`` is off and +not running: + +:: + + service irqbalance stop + +Depending on the NIC's available queues (for example Intel's x710/i40 has 64 +available per port/interface) the worker threads can be set up accordingly. +Usually the available queues can be seen by running: + +:: + + /usr/local/sbin/ethtool -l eth1 + +Some NICs - generally lower end 1Gbps - do not support symmetric hashing see +:doc:`packet-capture`. On those systems due to considerations for out of order +packets the following setup with af-packet is suggested (the example below +uses ``eth1``): + +:: + + /usr/local/sbin/ethtool -L eth1 combined 1 + +then set up af-packet with number of desired workers threads ``threads: auto`` +(auto by default will use number of CPUs available) and +``cluster-type: cluster_flow`` (also the default setting) + +For higher end systems/NICs a better and more performant solution could be +utilizing the NIC itself a bit more. x710/i40 and similar Intel NICs or +Mellanox MT27800 Family [ConnectX-5] for example can easily be set up to do +a bigger chunk of the work using more RSS queues and symmetric hashing in order +to allow for increased performance on the Suricata side by using af-packet +with ``cluster-type: cluster_qm`` mode. In that mode with af-packet all packets +linked by network card to a RSS queue are sent to the same socket. Below is +an example of a suggested config set up based on a 16 core one CPU/NUMA node +socket system using x710: + +:: + + rmmod i40e && modprobe i40e + ifconfig eth1 down + /usr/local/sbin/ethtool -L eth1 combined 16 + /usr/local/sbin/ethtool -K eth1 rxhash on + /usr/local/sbin/ethtool -K eth1 ntuple on + ifconfig eth1 up + /usr/local/sbin/ethtool -X eth1 hkey 6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A equal 16 + /usr/local/sbin/ethtool -A eth1 rx off + /usr/local/sbin/ethtool -C eth1 adaptive-rx off adaptive-tx off rx-usecs 125 + /usr/local/sbin/ethtool -G eth1 rx 1024 + +The commands above can be reviewed in detail in the help or manpages of the +``ethtool``. In brief the sequence makes sure the NIC is reset, the number of +RSS queues is set to 16, load balancing is enabled for the NIC, a low entropy +toepiltz key is inserted to allow for symmetric hashing, receive offloading is +disabled, the adaptive control is disabled for lowest possible latency and +last but not least, the ring rx descriptor size is set to 1024. +Make sure the RSS hash function is Toeplitz: + +:: + + /usr/local/sbin/ethtool -X eth1 hfunc toeplitz + +Let the NIC balance as much as possible: + +:: + + for proto in tcp4 udp4 tcp6 udp6; do + /usr/local/sbin/ethtool -N eth1 rx-flow-hash $proto sdfn + done + +In some cases: + +:: + + /usr/local/sbin/ethtool -N eth1 rx-flow-hash $proto sd + +might be enough or even better depending on the type of traffic. However not +all NICs allow it. The ``sd`` specifies the multi queue hashing algorithm of +the NIC (for the particular proto) to use src IP, dst IP only. The ``sdfn`` +allows for the tuple src IP, dst IP, src port, dst port to be used for the +hashing algorithm. +In the af-packet section of suricata.yaml: + +:: + + af-packet: + - interface: eth1 + threads: 16 + cluster-id: 99 + cluster-type: cluster_qm + ... + ... + +CPU affinity and NUMA +--------------------- + +Intel based systems +~~~~~~~~~~~~~~~~~~~ + +If the system has more then one NUMA node there are some more possibilities. +In those cases it is generally recommended to use as many worker threads as +cpu cores available/possible - from the same NUMA node. The example below uses +a 72 core machine and the sniffing NIC that Suricata uses located on NUMA node 1. +In such 2 socket configurations it is recommended to have Suricata and the +sniffing NIC to be running and residing on the second NUMA node as by default +CPU 0 is widely used by many services in Linux. In a case where this is not +possible it is recommended that (via the cpu affinity config section in +suricata.yaml and the irq affinity script for the NIC) CPU 0 is never used. + +In the case below 36 worker threads are used out of NUMA node 1's CPU, +af-packet runmode with ``cluster-type: cluster_qm``. + +If the CPU's NUMA set up is as follows: + +:: + + lscpu + Architecture: x86_64 + CPU op-mode(s): 32-bit, 64-bit + Byte Order: Little Endian + CPU(s): 72 + On-line CPU(s) list: 0-71 + Thread(s) per core: 2 + Core(s) per socket: 18 + Socket(s): 2 + NUMA node(s): 2 + Vendor ID: GenuineIntel + CPU family: 6 + Model: 79 + Model name: Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz + Stepping: 1 + CPU MHz: 1199.724 + CPU max MHz: 3600.0000 + CPU min MHz: 1200.0000 + BogoMIPS: 4589.92 + Virtualization: VT-x + L1d cache: 32K + L1i cache: 32K + L2 cache: 256K + L3 cache: 46080K + NUMA node0 CPU(s): 0-17,36-53 + NUMA node1 CPU(s): 18-35,54-71 + +It is recommended that 36 worker threads are used and the NIC set up could be +as follows: + +:: + + rmmod i40e && modprobe i40e + ifconfig eth1 down + /usr/local/sbin/ethtool -L eth1 combined 36 + /usr/local/sbin/ethtool -K eth1 rxhash on + /usr/local/sbin/ethtool -K eth1 ntuple on + ifconfig eth1 up + ./set_irq_affinity local eth1 + /usr/local/sbin/ethtool -X eth1 hkey 6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A equal 36 + /usr/local/sbin/ethtool -A eth1 rx off tx off + /usr/local/sbin/ethtool -C eth1 adaptive-rx off adaptive-tx off rx-usecs 125 + /usr/local/sbin/ethtool -G eth1 rx 1024 + for proto in tcp4 udp4 tcp6 udp6; do + echo "/usr/local/sbin/ethtool -N eth1 rx-flow-hash $proto sdfn" + /usr/local/sbin/ethtool -N eth1 rx-flow-hash $proto sdfn + done + +In the example above the ``set_irq_affinity`` script is used from the NIC +driver's sources. +In the cpu affinity section of suricata.yaml config: + +:: + + # Suricata is multi-threaded. Here the threading can be influenced. + threading: + cpu-affinity: + - management-cpu-set: + cpu: [ "1-10" ] # include only these CPUs in affinity settings + - receive-cpu-set: + cpu: [ "0-10" ] # include only these CPUs in affinity settings + - worker-cpu-set: + cpu: [ "18-35", "54-71" ] + mode: "exclusive" + prio: + low: [ 0 ] + medium: [ "1" ] + high: [ "18-35","54-71" ] + default: "high" + +In the af-packet section of suricata.yaml config : + +:: + + - interface: eth1 + # Number of receive threads. "auto" uses the number of cores + threads: 18 + cluster-id: 99 + cluster-type: cluster_qm + defrag: no + use-mmap: yes + mmap-locked: yes + tpacket-v3: yes + ring-size: 100000 + block-size: 1048576 + - interface: eth1 + # Number of receive threads. "auto" uses the number of cores + threads: 18 + cluster-id: 99 + cluster-type: cluster_qm + defrag: no + use-mmap: yes + mmap-locked: yes + tpacket-v3: yes + ring-size: 100000 + block-size: 1048576 + +That way 36 worker threads can be mapped (18 per each af-packet interface slot) +in total per CPUs NUMA 1 range - 18-35,54-71. That part is done via the +``worker-cpu-set`` affinity settings. ``ring-size`` and ``block-size`` in the +config section above are decent default values to start with. Those can be +better adjusted if needed as explained in :doc:`tuning-considerations`. + +AMD based systems +~~~~~~~~~~~~~~~~~ + +Another example can be using an AMD based system where the architecture and +design of the system itself plus the NUMA node's interaction is different as +it is based on the HyperTransport (HT) technology. In that case per NUMA +thread/lock would not be needed. The example below shows a suggestion for such +a configuration utilising af-packet, ``cluster-type: cluster_flow``. The +Mellanox NIC is located on NUMA 0. + +The CPU set up is as follows: + +:: + + Architecture: x86_64 + CPU op-mode(s): 32-bit, 64-bit + Byte Order: Little Endian + CPU(s): 128 + On-line CPU(s) list: 0-127 + Thread(s) per core: 2 + Core(s) per socket: 32 + Socket(s): 2 + NUMA node(s): 8 + Vendor ID: AuthenticAMD + CPU family: 23 + Model: 1 + Model name: AMD EPYC 7601 32-Core Processor + Stepping: 2 + CPU MHz: 1200.000 + CPU max MHz: 2200.0000 + CPU min MHz: 1200.0000 + BogoMIPS: 4391.55 + Virtualization: AMD-V + L1d cache: 32K + L1i cache: 64K + L2 cache: 512K + L3 cache: 8192K + NUMA node0 CPU(s): 0-7,64-71 + NUMA node1 CPU(s): 8-15,72-79 + NUMA node2 CPU(s): 16-23,80-87 + NUMA node3 CPU(s): 24-31,88-95 + NUMA node4 CPU(s): 32-39,96-103 + NUMA node5 CPU(s): 40-47,104-111 + NUMA node6 CPU(s): 48-55,112-119 + NUMA node7 CPU(s): 56-63,120-127 + +The ``ethtool``, ``show_irq_affinity.sh`` and ``set_irq_affinity_cpulist.sh`` +tools are provided from the official driver sources. +Set up the NIC, including offloading and load balancing: + +:: + + ifconfig eno6 down + /opt/mellanox/ethtool/sbin/ethtool -L eno6 combined 15 + /opt/mellanox/ethtool/sbin/ethtool -K eno6 rxhash on + /opt/mellanox/ethtool/sbin/ethtool -K eno6 ntuple on + ifconfig eno6 up + /sbin/set_irq_affinity_cpulist.sh 1-7,64-71 eno6 + /opt/mellanox/ethtool/sbin/ethtool -X eno6 hfunc toeplitz + /opt/mellanox/ethtool/sbin/ethtool -X eno6 hkey 6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A + +In the example above (1-7,64-71 for the irq affinity) CPU 0 is skipped as it is usually used by default on Linux systems by many applications/tools. +Let the NIC balance as much as possible: + +:: + + for proto in tcp4 udp4 tcp6 udp6; do + /usr/local/sbin/ethtool -N eth1 rx-flow-hash $proto sdfn + done + +In the cpu affinity section of suricata.yaml config : + +:: + + # Suricata is multi-threaded. Here the threading can be influenced. + threading: + set-cpu-affinity: yes + cpu-affinity: + - management-cpu-set: + cpu: [ "120-127" ] # include only these cpus in affinity settings + - receive-cpu-set: + cpu: [ 0 ] # include only these cpus in affinity settings + - worker-cpu-set: + cpu: [ "8-55" ] + mode: "exclusive" + prio: + high: [ "8-55" ] + default: "high" + +In the af-packet section of suricata.yaml config: :: - detect: - profile: custom - custom-values: - toclient-groups: 200 - toserver-groups: 200 - sgh-mpm-context: auto - inspection-recursion-limit: 3000 + - interface: eth1 + # Number of receive threads. "auto" uses the number of cores + threads: 48 # 48 worker threads on cpus "8-55" above + cluster-id: 99 + cluster-type: cluster_flow + defrag: no + use-mmap: yes + mmap-locked: yes + tpacket-v3: yes + ring-size: 100000 + block-size: 1048576 + + +In the example above there are 15 RSS queues pinned to cores 1-7,64-71 on NUMA +node 0 and 40 worker threads using other CPUs on different NUMA nodes. The +reason why CPU 0 is skipped in this set up is as in Linux systems it is very +common for CPU 0 to be used by default by many tools/services. The NIC itself in +this config is positioned on NUMA 0 so starting with 15 RSS queues on that +NUMA node and keeping those off for other tools in the system could offer the +best advantage. + +.. note:: Performance and optimization of the whole system can be affected upon regular NIC driver and pkg/kernel upgrades so it should be monitored regularly and tested out in QA/test environments first. As a general suggestion it is always recommended to run the latest stable firmware and drivers as instructed and provided by the particular NIC vendor. -Be advised, however, that this may require lots of RAM for even modestly sized rule sets. Also be aware that having additional CPU's available provides a greater performance boost than having more RAM available. That is, it would be better to spend money on CPU's instead of RAM when configuring a system. +Other considerations +~~~~~~~~~~~~~~~~~~~~ -It may also lead to significantly longer rule loading times. +Another advanced option to consider is the ``isolcpus`` kernel boot parameter +is a way of allowing CPU cores to be isolated for use of general system +processes. That way ensures total dedication of those CPUs/ranges for the +Suricata process only. +``stream.wrong_thread`` / ``tcp.pkt_on_wrong_thread`` are counters available +in ``stats.log`` or ``eve.json`` as ``event_type: stats`` that indicate issues with +the load balancing. There could be traffic/NICs settings related as well. In +very high/heavily increasing counter values it is recommended to experiment +with a different load balancing method either via the NIC or for example using +XDP/eBPF. There is an issue open +https://redmine.openinfosecfoundation.org/issues/2725 that is a placeholder +for feedback and findings.