To prevent dataset files from difference sources from overwriting each
other, give each file downloaded and extracted a prefix based on the
URL (a hash). This ensures unique filenames across all rulesets.
This mostly matters for datasets, as when datasets are processed we
are working with a merged set of filenames, unlike rules which are
parsed much earlier when we still have a list of files.
Not the most elegant solution, but saves a rather large refactor.
Bug: #6833
- Don't base dataset filenames on the contents of the file, but
instead the filename path:
https://redmine.openinfosecfoundation.org/issues/6763
+- Give each file in a source a unique filename by prefixing the files
+ with a hash of the URL to prevent duplicate filenames from
+ cloberring each other, in particular dataset files:
+ https://redmine.openinfosecfoundation.org/issues/6833
## 1.3.0 - 2023-07-07
# Now download each URL.
files = []
for url in urls:
+
+ # To de-duplicate filenames, add a prefix that is a hash of the URL.
+ prefix = hashlib.md5(url[0].encode()).hexdigest()
source_files = Fetch().run(url)
for key in source_files:
- files.append(SourceFile(key, source_files[key]))
+ content = source_files[key]
+ key = format("{}/{}".format(prefix, key))
+ files.append(SourceFile(key, content))
# Now load local rules.
if config.get("local") is not None: