Training the Bayesian databases
Bayesian scans analyze the words (or “tokens”) in an message header and message body of an email to determine the probability that it is spam. For every token, the FortiMail unit calculates the probability that the email is spam based on the percentage of times that the word has previously been associated with spam or non-spam email. If a Bayesian database has not yet been trained, the Bayesian scan does not yet know the spam or non-spam association of many tokens, and does not have enough information to determine the statistical likelihood of an email being spam. By training a Bayesian database to recognize words that are and are not likely to be associated with spam, Bayesian scans become increasingly accurate.
However, spammers are constantly trying to invent new ways to defeat antispam filters. In one technique commonly used in attempt to avoid antispam filters, spammers alter words commonly identified as characteristic of spam, inserting symbols such as periods ( . ), or using nonstandard but human-readable spellings, such as substituting Â, Ç, Ë, or Í for A, C, E or I. These altered words are technically different tokens to a Bayesian database, so mature Bayesian databases may require some ongoing training to recognize new spam tokens.
You generally will not want to enable Bayesian scans until you have performed initial training of your Bayesian databases, as using untrained Bayesian databases can increase your rate of spam false positives and false negatives. Unlike global and per-domain Bayesian databases, however, per-user databases cannot be used until they are mature — that is, until they have been sufficiently trained. A user Bayesian database is considered to be mature when it has been trained with a minimum of 100 spam messages and 200 non-spam messages. If you have enabled use of per-user Bayesian databases but a per-user database is not yet mature, the Bayesian scanner will instead use either the global or per-domain database, whichever is enabled for the protected domain. You can determine whether a per-user Bayesian database is mature by viewing the database training level summary. For more information, see “Backing up, batch training, and monitoring the Bayesian databases” on page 649.
To initially train the Bayesian databases
- Train the global database by uploading mailbox (.mbox) files. For details, see “Backing up, batch training, and monitoring the Bayesian databases” on page 649.
By uploading mailbox files, you can provide initial training more rapidly than through the Bayesian control email addresses. Training the global database ensures that outgoing antispam profiles in which you have enabled Bayesian scanning, and incoming antispam profiles for protected domains that you have configured to use the global database, can recognize spam.
You can leave the global database untrained if both these conditions are true:
- no outgoing antispam profile has Bayesian scanning enabled
- no protected domain is configured to use the global Bayesian database
- Train the per-domain databases by uploading mailbox (.mbox) files. For details, see “Backing up, batch training, and monitoring the Bayesian databases” on page 649.
By uploading mailbox files, you can provide initial training more rapidly than through the Bayesian control email addresses. Training per-domain databases ensures that incoming antispam profiles for protected domains that you have configured to use the per-domain database can recognize spam.
You can leave a per-domain database untrained if either of these conditions are true:
- the protected domain is configured to use the global Bayesian database
- no incoming antispam profiles exist for the protected domain
- If you have enabled incoming antispam profiles to train Bayesian databases when the FortiMail unit receives training messages, and have selected those antispam profiles in recipient-based policies that match training messages, instruct FortiMail administrators and email users to forward sample spam and non-spam email to the Bayesian control email addresses. For more information, see “Configuring the Bayesian training control accounts”
on page 654, “Accept training messages from users” on page 511, and “Training Bayesian databases” on page 719.
Before instructing email users to train the Bayesian databases, verify that you have enabled the FortiMail unit to accept training messages. If you have not enabled the “Accept training messages from users” option in the antispam profile for policies which match training messages, the training messages will be discarded without notification to the sender, and no training will occur.
FortiMail units apply training messages to either the global or per-domain Bayesian database, whichever is enabled for the sender’s protected domain. If per-user Bayesian databases are enabled, training messages are also applied to the sender’s per-user Bayesian database.
To more quickly train per-user databases to a mature state, you can also configure the FortiMail unit to use scan results from other antispam methods to train per-user Bayesian databases. For more information, see “Use other techniques for auto training” on page 511.
Emails from at least one customer are still going to quarantine after being added to personal AND system safe list. What am I missing?