doc: add performance analysis section

author Andreas Herz <andreas.herz@dcso.de>

Thu, 9 Apr 2020 13:23:40 +0000 (15:23 +0200)

committer Victor Julien <victor@inliniac.net>

Mon, 20 Apr 2020 13:29:25 +0000 (15:29 +0200)
author Andreas Herz <andreas.herz@dcso.de>
Thu, 9 Apr 2020 13:23:40 +0000 (15:23 +0200)
committer Victor Julien <victor@inliniac.net>
Mon, 20 Apr 2020 13:29:25 +0000 (15:29 +0200)
diff --git a/doc/userguide/performance/analysis.rst b/doc/userguide/performance/analysis.rst

new file mode 100644 (file)

index 0000000..529f1cb
--- /dev/null
+++ b/doc/userguide/performance/analysis.rst
@@ -0,0 +1,186 @@
+Performance Analysis
+=====================
+
+There are many possibilities that could be the reason for performance issues.
+In this section we will guide you through some options. The first part will
+cover basic steps and introduce some helpful tools. The second part will cover
+more in-depth explanations and corner cases.
+
+System Load
+-----------
+
+The first step should be to check the system load. Run a top tool like **htop**
+to get an overview of the system load and if there is a bottleneck with the
+traffic distribution. For example if you can see that only a small number of
+cpu cores hit 100% all the time and others don't, it could be related to a bad
+traffic distribution or elephant flows like in the screenshot where one core
+peaks due to one big elephant flow.
+
+.. image:: analysis/htopelephantflow.png
+
+If all cores are at peak load the system might be too slow for the traffic load
+or misconfigured. Also keep an eye on memory usage, if the actual memory usage
+is too high and the system needs to swap it will also result in low
+performance.
+
+The load will give you a first indication where to start with the debugging at
+specific parts we describe in more detail in the second part.
+
+Logfiles
+--------
+
+The next step would be to check all the log files with a focus on **stats.log**
+and **suricata.log** if any obvious issues are seen. The most obvious indicator
+is the **capture.kernel_drops** value that ideally would not even show up but
+should be below 1% of the **capture.kernel_packets** value as high drop rates
+could lead to a reduced amount of events and alerts.
+
+If **memcap** is seen in the stats the memcap values in the configuration could
+be increased. This can result to higher memory usage and should be taken into
+account when the settings are changed.
+
+Don't forget to check any system logs as well, even a **dmesg** run can show
+potential issues.
+
+Suricata Load
+-------------
+
+Besides the system load, another indicator for potential performance issues is
+the load of Suricata itself.  A helpful tool for that is **perf** which helps
+to spot performance issues. Make sure you have it installed and also the debug
+symbols installed for Suricata or the output won't be very helpful. This output
+is also helpful when you report performance issues as the Suricata Development
+team can narrow down possible issues with that.
+
+::
+
+    sudo perf top -p $(pidof suricata)
+
+If you see specific function calls at the top in red it's a hint that those are
+the bottlenecks. For example if you see **IPOnlyMatchPacket** it can be either
+a result of high drop rates or incomplete flows which result in decreased
+performance. To look into the performance issues on a specific thread you can
+pass **-t TID** to perf top. In other cases you can see functions that give you
+a hint that a specific protocol parser is used a lot and can either try to
+debug a performance bug or try to filter related traffic.
+
+.. image:: analysis/perftop.png
+
+In general try to play around with the different configuration options that
+Suricata does provide with a focus on the options described in
+:doc:`high-performance-config`.
+
+Traffic
+-------
+
+In most cases where the hardware is fast enough to handle the traffic but the
+drop rate is still high it's related to specific traffic issues.
+
+Basics
+^^^^^^
+
+Some of the basic checks are:
+
+- Check if the traffic is bidirectional, if it's mostly unidirectional you're
+  missing relevant parts of the flow (see **tshark** example at the bottom).
+  Another indicator could be a big discrepancy between SYN and SYN-ACK as well
+  as RST counter in the Suricata stats.
+
+- Check for encapsulated traffic, while GRE, MPLS etc. are supported they could
+  also lead to performance issues. Especially if there are several layers of
+  encapsulation.
+
+- Use tools like **iftop** to spot elephant flows. Flows that have a rate of
+  over 1Gbit/s for a long time can result in one cpu core peak at 100% all the
+  time and increasing the droprate while it doesn't make sense to dig deep into
+  this traffic.
+
+- Another approach to narrow down issues is the usage of **bpf filter**. For
+  example filter all HTTPS traffic with **not port 443** to exclude traffic
+  that might be problematic or just look into one specific port **port 25** if
+  you expect some issues with a specific protocol. See :doc:`ignoring-traffic`
+  for more details.
+
+- If VLAN is used it might help to disable **vlan.use-for-tracking** in
+  scenarios where only one direction of the flow has the VLAN tag.
+
+Advanced
+^^^^^^^^
+
+There are several advanced steps and corner cases when it comes to a deep dive
+into the traffic.
+
+If VLAN QinQ (IEEE 802.1ad) is used be very cautious if you use **cluster_qm**
+in combinatin with Intel drivers and AF_PACKET runmode. While the RFC expects
+ethertype 0x8100 and 0x88A8 in this case (see
+https://en.wikipedia.org/wiki/IEEE_802.1ad) most implementations only add
+0x8100 on each layer. If the first seen layer has the same VLAN tag but the
+inner one has different VLAN tags it will still end up in the same queue in
+**cluster_qm** mode. This was observed with the i40e driver up to 2.8.20 and
+the firmare version up to 7.00, feel free to report if newer versions have
+fixed this (see https://suricata-ids.org/support/).
+
+
+If you want to use **tshark** to get an overview of the traffic direction use
+this command:
+
+::
+
+    sudo tshark -i $INTERFACE -q -z conv,ip -a duration:10
+
+The output will show you all flows within 10s and if you see 0 for one
+direction you have unidirectional traffic, thus you don't see the ACK packets
+for example. Since Suricata is trying to work on flows this will have a rather
+big impact on the visibility. Focus on fixing the unidirectional traffic. If
+it's not possible at all you can enable **async-oneside** in the **stream**
+configuration setting.
+
+Check for other unusual or complex protocols that aren't supported very well.
+You can try to filter those to see if it has any impact on the performance.  In
+this example we filter Cisco Fabric Path (ethertype 0x8903) with the bpf filter
+**not ether proto 0x8903** as it's assumed to be a performance issue (see
+https://redmine.openinfosecfoundation.org/issues/3637)
+
+Elephant Flows
+^^^^^^^^^^^^^^
+
+The so called Elephant Flows or traffic spikes are quite difficult to deal
+with. In most cases those are big file transfers or backup traffic and it's not
+feasible to decode the whole traffic. From a network security monitoring
+perspective it's enough to log the metadata of that flow and do a packet
+inspection at the beginning but not the whole flow.
+
+If you can spot specific flows as described above then try to filter those. The
+easiest solution would be a bpf filter but that would still result in a
+performance impact. Ideally you can filter such traffic even sooner on driver
+or NIC level (see eBPF/XDP) or even before it reaches the system where Suricata
+is running. Some commercial packet broker support such filtering where it's
+called **Flow Shunting** or **Flow Slicing**.
+
+Rules
+-----
+
+The Ruleset plays an important role in the detection but also in the
+performance capability of Suricata. Thus it's recommended to look into the
+impact of enabled rules as well.
+
+If you run into performance issues and struggle to narrow it down start with
+running Suricata without any rules enabled and use the tools again that have
+been explained at the first part. Keep in mind that even without signatures
+enabled Suricata still does all the decoding and traffic analysis, so a fair
+amount of load should still be seen. If the load is still very high and drops
+are seen and the hardware should be capable to deal with such traffic loads you
+should deep dive if there is any specific traffic issue (see above) or report
+the performance issue so it can be investigated (see
+https://suricata-ids.org/support/).
+
+Suricata also provides several specific traffic related signatures in the rules
+folder that could be enabled for testing to spot specific traffic issues. Those
+are found the **rules** and you should start with **decoder-events.rules**,
+**stream-events.rules** and **app-layer-events.rules**.
+
+It can also be helpful to use :doc:`rule-profiling` and/or
+:doc:`packet-profiling` to find problematic rules or traffic pattern. This is
+achieved by compiling Suricata with **--enable-profiling** but keep in mind
+that this has an impact on performance and should only be used for
+troubleshooting.
diff --git a/doc/userguide/performance/analysis/htopelephantflow.png b/doc/userguide/performance/analysis/htopelephantflow.png

new file mode 100644 (file)

index 0000000..0976644

Binary files /dev/null and b/doc/userguide/performance/analysis/htopelephantflow.png differ
diff --git a/doc/userguide/performance/analysis/perftop.png b/doc/userguide/performance/analysis/perftop.png

new file mode 100644 (file)

index 0000000..303fc7a

Binary files /dev/null and b/doc/userguide/performance/analysis/perftop.png differ
diff --git a/doc/userguide/performance/index.rst b/doc/userguide/performance/index.rst

index d43fa658963e54a04ed41397176349db0804d5af..369fd749b4d80f26dc99420061c0f77c4645cea0 100644 (file)
--- a/doc/userguide/performance/index.rst
+++ b/doc/userguide/performance/index.rst
@@ -13,3 +13,4 @@ Performance
     packet-profiling
     rule-profiling
     tcmalloc
+   analysis
author	Andreas Herz <andreas.herz@dcso.de>
	Thu, 9 Apr 2020 13:23:40 +0000 (15:23 +0200)
committer	Victor Julien <victor@inliniac.net>
	Mon, 20 Apr 2020 13:29:25 +0000 (15:29 +0200)
doc/userguide/performance/analysis.rst	[new file with mode: 0644]	patch \| blob
doc/userguide/performance/analysis/htopelephantflow.png	[new file with mode: 0644]	patch \| blob
doc/userguide/performance/analysis/perftop.png	[new file with mode: 0644]	patch \| blob
doc/userguide/performance/index.rst		patch \| blob \| blame \| history