Bayesian spam filtering has become a popular way to distinguish between legitimate emails and illegitimate spam emails, through a process that uses Bayesian statistical methods. It filters emails by classifying documents into categories. Based on the contents of the message in your email, the Bayesian spam filters calculate the probability of the message being a spam. They are much more robust than the normal content based filters, and their anti spam approach hardly has false positives.
Normally when you receive an email, one look tells you whether the email is a spam or not. To your eyes, there is 'zero' probability of a spam looking like a good email. How would it be if spam filters, too, worked in the same way!
Bayesian Spam Filters
Bayesian spam filters are what are known as scoring content-based spam filters. They try to work the way your eye does in identifying spam emails, by looking for words and other characteristics that typify spams. Every characteristic typical of spam is assigned a score, and the total spam score for the whole message is computed. Depending on the type of Bayesian spam filter you are using, it may also look for legitimate email characteristics, thereby lowering the total score.
The basic difference between the Bayesian spam filters and other simple scoring content based spam filters is that the Bayesian spam filters build the list themselves, as against other filters that depend on a manually built list of characteristics.
You start with a sizable bunch of emails you have identified as spam, and another bunch of good emails. The filters look at both, the legitimate and the spam emails and calculate in what probability various characters appear in them. Bayesian spam filters may look at:
&bullThe words in the message body
&bullThe headers (message paths and senders)
&bullThe word pairs and phrases
&bullHTML code, such as colors
&bullWhere a particular phrase appears (meta information)
The Problems With Scoring Content Based Filters
Though the scoring based spam filters work well, they also encounter certain problems; the normal ones more so than the Bayesian spam filters. These are some of the problems faced:
&bullThe scoring content based spam filters build a list of characteristics from the spam emails and the good emails they get. For building a good list of spam characteristics, mail needs to be collected from hundreds of sources (email addresses). This may weaken the efficiency of the spam filters, as the characteristics of the good email would be different for each person.
&bullIf the spammers make an effort to make their mails look like genuine mails, the filtering characteristics may have to be corrected manually - a very big effort.