I mentioned before how there are two basic forms for combating the email spam problem currently. I call these two methods “sender passive” and “sender active,” depending on whether the sender has to do something other what is specified in the basic SMTP protocol. Today I’ll talk about the common “sender passive” methods. Later I’ll mention a few of the “sender active” methods.
Content Filtering
This is by far the most common anti-spam technique in use today. The basic premise is you scan the content of the message and if it contain “spammy” keywords, you drop it. So if someone sends me a message with the word “viagra” in it, you can be pretty sure it’s spam.
The main benefit of this solution is its simplicity and its flexibility. There are lots of different kinds of filters, from simple keyword-based ones to more complex Bayesian filtering.
The problem is that spammers can tailor their messages to get past the filters, so it’s basically a cat-and-mouse game, it’s also always going to be a catch-up game. This is especially true for the “big four” providers (that is, Hotmail, Yahoo, AOL and Gmail) – spammers will specifically target addresses from them, but before sending out their batch, they actually sign up an account and use it for “testing” – once their message gets into their test account, bam! they hit everybody.
The other drawback I somewhat hinted at above. Say I’m a doctor, many of my message might then include the word “viagra” – I can’t just blindly filter that word out in that case.
Finally, there’s basically an infinite number of ways to write “viagra” – “v1agra,” “vi@gra”, etc. It’s just a matter of coming up with new ways to get past the filter.
IP Blacklisting
This method uses the one piece of “reliable” information we get about the source – their IP address. The basic idea is simple: if the sender’s IP address is on the blacklist, don’t accept mail from them.
There are a few ways to get an IP address on the blacklist. First of all, there are reliable blocks which can always stay on there. For example, if an ISP assigns the block 123.123.123.* to their dial-up lines, then you can be pretty sure there’s not mail server on there. Any mail originating from there is probably from a botnet.
Another way you might get on there is from user feedback. You can use a service like NJABL, which accepts feedback from users – if enough people report spam from an IP address, it gets added to the blacklist.
The good thing about blacklists is that they pretty much automatically kill botnets, since most botnets are in dial-up or residential IP blocks, the blacklists have those listed by default. Also, there is not very much extra processing on the client required – it’s usually just a matter of doing a specially-formed DNS query.
The problem is that, again, it’s a game of catch-up. You have to have a certain amount of spam from an IP before it gets blacklisted. There is also a problem of false-positives – your own IP might be blacklisted by mistake, which isn’t very fun when it happens.
Greylisting
This post is starting to get rather long, so I’ll finish up with a relative unknown technique, called “greylisting.” Basically the way this works is, the first time you see an email with a new “From” address, you add that address to the greylist, and send it back an SMTP “temporarily unavailable” error. A conformant SMTP server will then queue the message itself and send it again after a short while. On the second attempt, you simply let the message through.
Once a message has successfully got through the greylist, the server will usually add the sender to a temporary whitelist so that the next few messages will get through without going through the greylist.
The reason this works is because almost all legitimate SMTP servers will handle the “temporarily unavailable” message correctly, while almost all spammer’s servers do not (the main reason being that they’re being sent from zombie computers, and the queuing is going to be a little suspicious!)
Of the three techniques that I’ve mentioned today, this one has had the most success for me (that doesn’t mean it works well for everybody). However, it does have a couple of problems. First of all, it means legitimate email takes a little longer to reach me initially (depending on the sender’s server, it could be from 10 minutes to a couple of hours). The other probably is that the reason the technique is actually successful is it’s relative obscurity – if everybody did greylisting, the spammers would catch on pretty quick.
Another problem is that many large organisations have many SMTP servers on the same address (using MX record priorities) – each server must know what the others have seen, since the sender may send the first message to one server and the second to another. This needs to be taken care of.
By the way, I’ve listed this under “sender passive” even though, technically, the sender has to do something “extra” – but since that something “extra” is already part of the SMTP standard, it doesn’t really count.
Conclusion
In the end, all of the above techniques fail for the same reason: there is no cost to the sender for failed messages. Even it 90% of messages are blocked in this way, a batch of 1,000,000 messages will still have 100,000 people receive it. It’s also always a catch-up game: the spammers are necessarily one step ahead – after all, we can only block stuff once we know it’s being sent.
Next time, I’ll describe a couple of what I call “sender active” technique, which require extra effort (outside of the basic SMTP protocol) to get a message through.