The English noun "spam" in the spam filter can be translated as the German word Abfall. Originally, it refers to canned meat. In the IT sector, this refers to unsolicited electronic messages; i.e. they are delivered without the recipient's request. They usually contain advertising. According to research by the Hamburg statistics portal Statista, the number of spam mails worldwide in 2014 was 28 billion. This is a global problem that is solved with the help of a spam filter; specifically, unwanted messages are to be sorted out by a computer program. The originator of such unwanted mail is called a spammer, the process is called spamming or spamming.
Application areas of a spam filter
Classically, the use of a spam filter was limited to the sorting out of unwanted e-mails. For this purpose, modules for e-mail programs and mail servers were constructed with the help of algorithms. However, since the importance of advertising on the Internet has increased more and more in the past, newer programs also filter pages. More specifically, spam filters are also used for web browsers, wikis and blogs.
Working methods of a spam filter
Spam filters pick up information that is directly related to a mail. This can be the content of the mail itself, but also the originator of a message can be checked to a limited extent. Three methods have become established:
(a) The Blacklist method. A blacklist is a "black list" which is a synonym for an unwanted contact. In terms of content, such a list lists certain expressions and keywords. An algorithm searches a mail for these keywords; if it finds such keywords, it will cause a mail to be sorted out. The same procedure can be extended to the sender. Many spam filters that work according to the blacklist method already contain an extensive database. Users can extend this database according to their personal needs.
b) The Bayes filter method. The Bayesian filter method is based on probability theory and requires the user's cooperation, especially at the beginning of its use. If it is set up correctly, it is superior to the blacklist method. In this case, the user must identify received mails as Spam or classify non-spam. In the background, the Bayesian filter learns the rules without any intervention in the algorithms. After about 1,000 self-sorted mails, the filter works independently. The Bayesian filter also continues to learn in the course of subsequent re-sorting.
(c) The Database based solutions. Advertising emails in particular contain a range of data that is intended to lead to a specific contact. This includes, above all, the URL of a website and the phone number. Database-based solutions use algorithms to search for this information. If they are found, mails are sorted out. The success rate of such procedures can be described as very good. Although advertising mails can be redesigned again and again and thus in an unlimited number, certain data always remain the same.
Error rates of spam filters
Spam mails have become increasingly sophisticated in the past. As a result, the spam filter application has to evolve. This involves effort and costs, which is why some providers charge fees for any service. In addition, sorting by means of programs is associated with an error rate, but this can be reduced by training. False negative detection is when spam mails reach the regular inbox; false positive detection is when normal mails are mistaken for spam. While optimization measures reduce the error rate of false negative detection to ten to one percent, false positive detection tends towards zero.
A known spam filter is for example SpamAssassinwhich is used by most email providers.