The Advantages of Bayesian Spam Filters

by : Arvind



Bayesian spam filtering technique is a great way of filtering out the spam from reaching your inbox. M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz proposed this technique in "A Bayesian approach to filtering junk e-mail" in 1998, but until it was described in a paper by Paul Graham, in 2002, it gained no attention. Thereafter, it has become a great technique for distinguishing legitimate email from the illegitimate spam mail. Modern email programs use the Bayesian spam filtering techniques, and so do the server-side email filters, which at times, embed the function of the Bayesian spam filters within the mail server software itself.

The Bayesian spam filter works by analyzing and then calculating the probability of the contents in the email being spam. It self-builds a list of characteristics of spam as well as good elements in the message. Based on the analysis, the message is classified as spam or legitimate. After the message has been classified, the spam filter is further trained on a per-user basis. This is the advantage of Bayesian spam filters.

Most spam one receives is more often than not related to one's activities online. You may have subscribed to an online newsletter, which could be considered as spam. This newsletter, like other newsletters from the same source, is likely to contain common words, such as its name and its email address, from where it originated. Your Bayesian spam filter will analyze the contents, identify the characteristics, and assign a higher rate of probability to its being spam. All this is based on your specific user activity.

Legitimate emails you receive are different from the spam, and the Bayesian spam filter will assign a lower rate of probability of its being spam. In an environment where you receive corporate emails from the same source, the mails will have the same company name, and the names of the clients or customers. These will be analyzed as legitimate by your Bayesian spam filter.

The Bayesian spam filter's accuracy improves over time. It analyzes the characteristics that allow it to assess the probability, and whenever the filter incorrectly classifies a message, its corrective training takes over. The probability of each word is unique to each individual user.

The Bayesian spam filter is exceptional in avoiding false positives. If the email you receive contains the words 'Nigeria' or 'lottery', which have frequently been seen in spam messages, your Bayesian spam filter would probably put it down as a probable, and not reject it outright, as a normal spam filter might. It would look for other characteristics to classify the message. If the mail happens to be from your spouse, it would indicate its legitimacy, and your Bayesian spam filter would overcome the probable spam words.