
-The main disadvantage is the amount of tokens which is multiplied by size of window. In rspamd, we use a window of 5 tokens that means that
+The main disadvantage is the amount of tokens which is multiplied by size of window. In rspamd, we use a window of 5 tokens that means that
the number of tokens is about 5 times larger than the amount of words.
Statistical tokens are stored in statfiles which, in turn, are mapped to specific backends. This architecture is displayed in the following image:
metainformation in statistics. The following configuration demonstrates the recommended statistics configuration:
~~~ucl
-classifier {
- type = "bayes";
+# Classifier's algorith is BAYES
+classifier "bayes" {
tokenizer {
name = "osb";
}
+
+ # Unique name used to learn the specific classifier
+ name = "common_bayes";
+
cache {
path = "${DBDIR}/learn_cache.sqlite";
}
+
+ # Minimum number of words required for statistics processing
min_tokens = 11;
+ # Minimum learn count for both spam and ham classes to perform classification
+ min_learns = 200;
+
backend = "sqlite3";
languages_enabled = true;
statfile {
of such a script for extracting domain names from recipients organizing thus per-domain statistics:
~~~ucl
- classifier {
- tokenizer {
- name = "osb";
- }
- name = "bayes2";
- min_tokens = 11;
- backend = "sqlite3";
- per_language = true;
- per_user = <<EOD
+classifier "bayes" {
+ tokenizer {
+ name = "osb";
+ }
+
+ name = "bayes2";
+
+ min_tokens = 11;
+ min_learns = 200;
+
+ backend = "sqlite3";
+ per_language = true;
+ per_user = <<EOD
return function(task)
local rcpt = task:get_recipients(1)
return nil
end
EOD
- statfile {
- path = "/tmp/bayes2.spam.sqlite";
- symbol = "BAYES_SPAM2";
- }
- statfile {
- path = "/tmp/bayes2.ham.sqlite";
- symbol = "BAYES_HAM2";
- }
+ statfile {
+ path = "/tmp/bayes2.spam.sqlite";
+ symbol = "BAYES_SPAM2";
+ }
+ statfile {
+ path = "/tmp/bayes2.ham.sqlite";
+ symbol = "BAYES_HAM2";
}
+}
~~~
## Applying per-user and per-language statistics
Rspamd allows to learn and to check multiple classifiers for a single messages. This might be useful, for example, if you have common and per user statistics. It is even possible to use the same statfiles for these purposes. Classifiers **might** have the same symbols (thought it is not recommended) and they should have a **unique** `name` attribute that is used for learning. Here is an example of such a configuration:
~~~ucl
- classifier {
- tokenizer {
- name = "osb";
- }
- name = "bayes_user";
- min_tokens = 11;
- backend = "sqlite3";
- per_language = true;
- per_user = true;
- statfile {
- path = "/tmp/bayes.spam.sqlite";
- symbol = "BAYES_SPAM_USER";
- }
- statfile {
- path = "/tmp/bayes.ham.sqlite";
- symbol = "BAYES_HAM_USER";
- }
+classifier "bayes" {
+ tokenizer {
+ name = "osb";
}
- classifier {
- tokenizer {
- name = "osb";
- }
- name = "bayes";
- min_tokens = 11;
- backend = "sqlite3";
- per_language = true;
- statfile {
- path = "/tmp/bayes.spam.sqlite";
- symbol = "BAYES_SPAM";
- }
- statfile {
- path = "/tmp/bayes.ham.sqlite";
- symbol = "BAYES_HAM";
- }
+ name = "users";
+ min_tokens = 11;
+ min_learns = 200;
+ backend = "sqlite3";
+ per_language = true;
+ per_user = true;
+
+ statfile {
+ path = "/tmp/bayes.spam.sqlite";
+ symbol = "BAYES_SPAM_USER";
+ }
+ statfile {
+ path = "/tmp/bayes.ham.sqlite";
+ symbol = "BAYES_HAM_USER";
}
+}
+
+classifier "bayes" {
+ tokenizer {
+ name = "osb";
+ }
+
+ name = "common";
+ min_tokens = 11;
+ min_learns = 200;
+ backend = "sqlite3";
+ per_language = true;
+
+ statfile {
+ path = "/tmp/bayes.spam.sqlite";
+ symbol = "BAYES_SPAM";
+ }
+ statfile {
+ path = "/tmp/bayes.ham.sqlite";
+ symbol = "BAYES_HAM";
+ }
+}
~~~
To learn specific classifier, you can use `-c` option for `rspamc` (or `Classifier` HTTP header):
From version 1.1, it is also possible to specify redis as a backend for statistics and cache of learned messages. Redis is recommended for clustered configurations as it allows simultaneous learn and checks and, besides, is very fast. To setup redis, you could use `redis` backend for a classifier (cache is set to the same servers accordingly).
~~~ucl
- classifier {
+ classifier "bayes" {
tokenizer {
name = "osb";
}
+
name = "bayes";
min_tokens = 11;
+ min_learns = 200;
backend = "redis";
servers = "localhost:6379";
#write_servers = "localhost:6379"; # If needed another servers for learning