Datasets
========
-Using the ``dataset`` and ``datarep`` keyword it is possible to match on
-large amounts of data against any sticky buffer.
+Using the ``dataset`` and ``datarep`` keyword it is possible
+to match on large amounts of data against any sticky buffer.
For example, to match against a DNS black list called ``dns-bl``::
dataset:<cmd>,<name>,<options>;
dataset:<set|unset|isset|isnotset>,<name> \
- [, type <string|md5|sha256|ipv4|ip>, save <file name>, load <file name>, state <file name>, memcap <size>, hashsize <size>];
+ [, type <string|md5|sha256|ipv4|ip>, save <file name>, load <file name>, state <file name>, memcap <size>, hashsize <size>
+ , format <csv|json>, enrichment_key <output_key>, value_key <json_key>, array_key <json_path>];
type <type>
the data type: string, md5, sha256, ipv4, ip
maximum memory limit for the respective dataset
hashsize <size>
allowed size of the hash for the respective dataset
+format <type>
+ the format of the file: csv, json. Defaut to csv. See
+ :ref:`dataset with json format <datasets_json>` for json
+ option
+enrichment_key <key>
+ the key to use for the enrichment of the alert event
+ for json format
+value_key <key>
+ the key to use for the value of the alert
+ for json format
+array_key <key>
+ the key to use for the array of the alert
+ for json format
+
.. note:: 'type' is mandatory and needs to be set.
value is higher than 200.
+.. _datasets_datajson:
+
+dataset with json
+~~~~~~~~~~~~~~~~~
+
+DataJSON allows matching data against a set and output data attached to the matching
+value in the event.
+
+Syntax::
+
+ dataset:<cmd>,<name>,<options>;
+
+ dataset:<isset|isnotset>,<name> \
+ [, type <string|md5|sha256|ipv4|ip>, load <file name>, format json, memcap <size>, hashsize <size>, enrichment_key <json_key> \
+ , value_key <json_key>, array_key <json_path>];
+
+Example rules could look like::
+
+ alert http any any -> any any (msg:"IP match"; ip.dst; dataset:isset,bad_ips, type ip, load bad_ips.json, format json, enrichment_key bad_ones, value_key ip; sid:8000001;)
+
+In this example, the match will occur if the destination IP is in the set and the
+alert will have an ``alert.extra.bad_ones`` subobject that will contain the JSON
+data associated to the value.
+
+If ``json_key`` is present then the data file has to contains a valid JSON object containing an array
+where every elemeents have to contain a key equal to ``json_key``.
+If ``array_key`` is present, Suricata will extract the corresponding subobject that has to be
+a JSON array.
+
+See :ref:`Datajson format <datajson_data>` for more information.
+
Rule Reloads
------------
dataset-dump
+dataset-add-json
+~~~~~~~~~~~~~~~~
+
+Unix Socket command to add data to a set. On success, the addition becomes
+active instantly.
+
+Syntax::
+
+ dataset-add-json <set name> <set type> <data> <json_info>
+
+set name
+ Name of an already defined dataset
+type
+ Data type: string, md5, sha256, ipv4, ip
+data
+ Data to add in serialized form (base64 for string, hex notation for md5/sha256, string representation for ipv4/ip)
+
+Example adding 'google.com' to set 'myset'::
+
+ dataset-add-json myset string Z29vZ2xlLmNvbQ== {"city":"Mountain View"}
+
+
File formats
------------
datarep
~~~~~~~
-The datarep format follows the dataset, expect that there are 1 more CSV
+The datarep format follows the dataset, except that there are 1 more CSV
field:
Syntax::
<data>,<value>
+.. _datajson_data:
+
+dataset with JSON enrichment
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If ``format json`` is used in the parameters of a dataset keyword, then the loaded
+file has to contain a valid JSON object.
+
+If ``value_key``` option is present then the file has to contain a valid JSON
+object containing an array where the key equal to ``value_key`` value is present.
+
+For example, if the file ``file.json`` is like the following example (typical of return of REST API call) ::
+
+ {
+ "time": "2024-12-21",
+ "response": {
+ "threats":
+ [
+ {"host": "toto.com", "origin": "japan"},
+ {"host": "grenouille.com", "origin": "french"}
+ ]
+ }
+ }
+
+then the match to check the list of threats using datajson can be defined as ::
+
+ http.host; dataset:isset,threats,load file.json, enrichment_key threat, value_key host, array_key response.threats;
+
.. _datasets_file_locations:
File Locations
.. note:: The following characters must be escaped inside the content:
``;`` ``\`` ``"``
+PCRE extraction
+~~~~~~~~~~~~~~~
+
+It is possible to capture groups from the regular expression and log them into the
+alert events.
+
+There is 3 capabilities:
+
+* pkt: the extracted group is logged as pkt variable in ``metadata.pktvars``
+* alert: the extracted group is logged to the ``alert.extra`` subobject
+* flow: the extracted group is stored in a flow variable and end up in the ``metadata.flowvars``
+
+To use the feature, parameters of pcre keyword need to be updated.
+After the regular pcre regex and options, a comma separated lists of variable names.
+The prefix here is ``flow:``, ``pkt:`` or ``alert:`` and the names can contain special
+characters now. The names map to the capturing substring expressions in order ::
+
+ pcre:"/([a-z]+)\/[a-z]+\/(.+)\/(.+)\/changelog$/GUR, \
+ flow:ua/ubuntu/repo,flow:ua/ubuntu/pkg/base, \
+ flow:ua/ubuntu/pkg/version";
+
+This would result in the alert event has something like ::
+
+ "metadata": {
+ "flowvars": [
+ {"ua/ubuntu/repo": "fr"},
+ {"ua/ubuntu/pkg/base": "curl"},
+ {"ua/ubuntu/pkg/version": "2.2.1"}
+ ]
+ }
+
+The other events on the same flow such as the ``flow`` one will
+also have the flow vars.
+
+If this is not wanted, you can use the ``alert:`` construct to only
+get the event in the alert ::
+
+ pcre:"/([a-z]+)\/[a-z]+\/(.+)\/(.+)\/changelog$/GUR, \
+ alert:ua/ubuntu/repo,alert:ua/ubuntu/pkg/base, \
+ alert:ua/ubuntu/pkg/version";
+
+With that syntax, the result of the extraction will appear like ::
+
+ "alert": {
+ "extra": {
+ "ua/ubuntu/repo": "fr",
+ "ua/ubuntu/pkg/base": "curl",
+ "ua/ubuntu/pkg/version": "2.2.1"
+ ]
+ }
+
+A combination of the extraction scopes can be combined.
+
+It is also possible to extract key/value pair in the ``pkt`` scope.
+One capture would be the key, the second the value. The notation is similar to the last ::
+
+ pcre:"^/([A-Z]+) (.*)\r\n/, pkt:key,pkt:value";
+
+``key`` and ``value`` are simply hardcoded names to trigger the key/value extraction.
+As a consequence, they can't be used as name for the variables.
+
Suricata's modifiers
~~~~~~~~~~~~~~~~~~~~