Like anyone who uses computers, I have to deal with a ridiculous amount of spam. My email is double-filtered with Spam Assassin and Gmail, so very little email spam actually makes it to my inbox anymore. Unfortunately, I still have to deal with comment spam on my blog and websites. I decided to give an overview of some of the ways that you can analyze spam (particularly blog comment spam) in order to eliminate it.

Points
A good system of checking for spam is one that uses points. Your scripts should check several factors and assign points based on severity (e.g., including 100 links is more severe than saying “Viagra” one time). You can then determine points thresholds such as the level at which the comment is automatically marked as spam, the level at which a comment is “iffy” so it needs to be moderated by a human, and the level at which a comment is most likely “real” and can be displayed instantly. Below are several things to analyze that could be weighted with points.

Time spent viewing the page
The kind of user whose comments you want will read your post and then type a reply. This often takes a few minutes (which is why most people don’t invest the time to comment), so if you create a hidden field in the comment form with a timestamp (preferably encrypted so it isn’t entirely obvious) and then check when the comment is submitted, you can see how long the entire read-and-respond process took. Often, spammers will take ~4 seconds, because it’s an automated script. This also ensures that the page was actually viewed (rather than having a comment directly submitted).

Since the amount of time a user takes responding can be from zero (or negative if they’re trying to spoof your timestamp) to infinity, the points you assign should be inversely proportional to the time spent on a page—with a cutoff. For example: zero seconds might give ten “spam points,” whereas five minutes might give zero. At a certain point (preferably a couple of hours), you should have a cutoff where the points max out again. This prevents someone from easily spoofing the timestamp (because the number of acceptable timestamps will be more finite).

Count links
The vast majority of spam is designed to get traffic to a website where the user is expected to spend money. Because of this, most spam includes at least one link. Unfortunately, real users might also leave comments with links, so you have to weight links carefully. You can choose to do it based purely on the number of links (e.g., no links = no spam points, 2 = 4 spam points, 3 = 8 spam points, etc.) or you can do it based on a ratio of links to non-linked text. Usually spammers will just include link after link, but real users will post paragraphs with a few links scattered throughout.

JavaScript
Some people think it is unfair to discriminate against users who don’t have JavaScript, but the fact is that the vast majority of spam scripts do not process JavaScript. Some web developers are willing to say “too bad” to legitimate users without JavaScript (or with it disabled), because they represent a very small portion of all Internet users. I recommend weighting this very lightly. Use JavaScript to write to a hidden form field. If the field is not written to, then you could give a point or two, but not enough to actually mark the comment as spam (though, if you’re picky, it could be enough to put it into the “to be moderated” pile).

Blacklists
Blacklists are another point of contention among web developers. Spammers often use the same IP address many times, but we have no way of knowing if Spammer Bob was using the IP address to spam all night before disconnecting and then Joe Schmoe connects and is assigned that IP address. It’s unfair to block Joe Schmoe based on Spammer Bob’s actions, but we can’t ignore that blacklists can be extremely effective.

With external blacklists (i.e., when using a third-party’s blacklist), you run a much greater risk of blocking a legitimate user. You could choose to weight any blacklisted IP heavily enough to automatically make the comment moderated. Another option is to assign points based on how recent the IP address was blacklisted (one that was blacklisted 30 minutes ago should be assigned far more points than one that was blacklisted a week or a month ago.).

You can also use internal blacklisting. Create your own list of banned IP’s based on your comment spam. You might do it based on number of comments marked as spam from a given IP address for a specific amount of time (e.g., 3 days), or you can make it based on the average number of points for a given IP address over that particular amount of time (e.g., five comments in the past 24 hours for a total of 75 spam points would be 15 points per comment; if that is above your threshold, that IP address could be blocked from commenting temporarily).

Keywords
“Viagra! Cialis! BUY NOW!!” Nearly everyone has seen comments or emails with messages like this. You can create a function that runs through a comment and counts the number of flagged words and assigns points based on the total number of those words in the message. One instance of the word “Viagra” might be legitimate (particularly if the comment is on a post about spamming), but the more these words are used, the less likely that the comment is legitimate. Keeping a word list can be excessive, because there is a new drug or other spam item every day, but it’s one of the easiest ways to identify spam.

Age of post
It’s a fact: older posts are spammed more. The older a post is, the more time it has been floating around at places like Technorati, being scooped up by spam bots as a potential target. Some people completely block comments on old posts, but that’s a bit extreme to me. I’ve quit visiting blogs, because I had a legitimate comment that I couldn’t post. I felt like my opinion wasn’t important to the author, so I decided his/her opinion was no longer important to me. A better option is to either assign points or simply require moderation on all posts older than X days.

User base
If your site allows creating accounts and posting while logged in, users who are logged in should be given bonus points (you could simply give them a negative number of spam points to offset some of the potential triggers and make it more likely that their messages go through right away). You should also try to track previous posters. If John (john@gmail.com) from http://johnsawesomeblog.com is commenting on your posts (legitimately) on a semi-regular basis, he should be recognized as a regular user and his posts should be less likely to be flagged as spam. Usually blogs collect email addresses from users but don’t share that address with other users, so that’s a good way of identifying a frequent commenter.

Trackback the trackbacks
If someone posts a trackback, your blog should automatically access that page (such as with cURL) to verify it has a link to your site. This will stop 95% of trackback spam. The only problem is that the trackback could contain a bogus URL (e.g., it could be http://blog.spammer.com/?sucker=10500dcd4bca8cb8f46279d8e61e4cd8), that has a get query containing your site’s address. The spammer’s site could easily create a link to your blog based on the query, so your blog would see this link and consider the site legitimate. Currently, this isn’t a significant problem in the blogging world, but it’s something to be aware of as it may become an issue later.

Monitor your spam
Keep any spam you receive (at least, for some time), because you can analyze it to improve your catching methods. If you see something that was just barely flagged as spam, see how you can guarantee it will be flagged next time. If you see a pattern, develop a way of catching that pattern. If a spam comment makes it through your system, find out why. Sometimes you can flag one comment as spam based on its similarity to a comment that was definitely spam (e.g., a “V1AGRA” spam is caught and then the same username and email tries to submit another message, the second message is likely to be spam too).

Other thoughts
Obviously, you can use none, all, or any mixture of these methods and modify them to meet your needs, but you may also consider a few alternatives. For instance, you can have a flag that sets a message as needing moderation despite the number of spam points it has (e.g., you might flag all trackbacks and comments with links as needing moderation if you are very specific about what links you allow on your site).

It’s typical to have a message appear explaining to the user what happened. If the message was considered spam, the user should know about it (on the off chance that it is a human after all). If the comment is being held for moderation, the user should know so s/he does not try to submit it again or think your blog is “broken.” Some developers choose to have messages that need moderation show up, but they add nofollow (e.g., <a href=”http://potentialspammer.com” rel=”nofollow”>Cool site!</a>) to all links, preventing the user from improving his/her site’s search engine ranking (most search engines use the number of links to a site as part of the measure of the site’s importance).

As a last resort, you can use a CAPTCHA. It should always be optional, because users with vision impairments often cannot complete a CAPTCHA. By optional, I mean that it should give you “bonus points” for completing it rather than punishing you for not completing it. It can be used effectively when a comment would go into moderation by presenting the end user with the CAPTCHA as a means of getting their comment instantly approved.

Another way you could use the point system is on a blog in which all comments must be manually approved in order to determine which comments are most likely to be legitimate and view those first.

Summary
I can’t emphasize enough the need for a point system. The more pieces of the puzzle you have to look at, the more accurate your guess will be. In my opinion, one spam message getting through is better than one legitimate comment being blocked. If you’re adamant that no spam is ever shown on your site, consider full moderation. If your site receives few comments, you can have it email you every time you receive a new comment or a comment in moderation. That allows you to quickly spot bogus comments that managed (or almost managed) to make it through your filters.

Remember that spam is used because it works. If you send out a message to a million people, some are likely to be suckered by it. That means that every spam comment you block makes spamming less cost efficient and just might dissuade future spammers. We can only hope…


8 Responses to “How To Analyze And Stop Comment Spam”

  1. 1 Bernie Zimmermann

    This is an awesome list, Ian. I’m not quite to the point where I’m ready for a points system at my blog, but I can certainly see the merit in such an approach. Thanks for posting such a thorough list of spam countermeasures. I’m sure I’ll be revisiting this post in the future (unfortunately).

  2. 2 Gordaen

    A custom CMS like yours usually lasts a bit longer against the spammers, but eventually someone will have too much time on their hands and create a script for it as well. At first, it’s easy to stop these spams by blocking keywords, excessive links, or other obvious patterns, but someone out there will be just as interested in circumventing your countermeasures as you are in stopping their spam.

    Hopefully this excessively long post will help you (and others) when the spammers become a bit more sophisticated. In some ways, it’s good that there are so many easy targets (keeps their eyes off the more challenging sites to spam), but I wonder how much those sites encourage these people…

    One thing I didn’t mention is that you can also shut down a lot of spammer’s sites. Oftentimes, they use sites like JubiiBlog where they can quickly set up a page with spam advertisements or JavaScript forwarding, and you can simply contact the support team with the offending sites. I’ve done that quite a bit in the past few weeks while analyzing spam I receive through my comment form, and I can only hope it irritates the spammer. It takes a lot less time to find a customer support email address and forward on the spam sites than it does for them to find new sites that are vulnerable to their techniques.

  3. 3 Brian Buffington

    This article is fantastic! Before this I was strongly leaning towards integrating a CAPTCHA into my custom CMS, but your points approach brings up some new ideas. I’m definitely going to take a look at my pages and see how I can integrate some of these great suggestions. Thanks again!

  4. 4 Gordaen

    Thanks for the kind comment, Brian. I’m glad you found the article helpful.

    CAPTCHA’s are definitely one of the most commonly used solutions because they’re so easy. Unfortunately, there are some significant problems with them. Vision-impaired users can’t use them, people using command line browsers (e.g., Lynx) can’t see them, and small, and portable devices often cannot use them (e.g., phones and PDA’s which might have too small of screens, too low of resolution, or may even have images disabled to save bandwidth/memory). I also see people who implement CAPTCHA’s incorrectly, using an obvious hash that can be cracked by a computer faster than a real user can type in the image’s characters.

  5. 5 mya

    Hi folks,

    I use a spam blocker, the approach is different, it doesn’t intend to analyse an email. This spam blocker uses a new pidkey technology. It’s too long to explain but you can see an explanation http://www.pidware.com/pidkey.php. The point is… the machine (computer) competes with human brain, computer no match with that. So maybe it’s time to look elsewhere.

    Mya

  6. 6 mya

    Hi folks,

    I saw, I gave you the wrong link, is http://www.pidware.com/pidkey.php (without . at the end) If you are like me, I hate spam and now… I relax :-)

    Mya

  7. 7 Gordaen

    I updated the link in your first post, so now they both work. Pidkey sounds like a simple solution to email spam, but it seems like it has a few problems. The major one that I see is email from non-humans that isn’t spam. For instance, I have scripts on my site that email me upon various events. I would have to specifically create a contact in my address book for those scripts. I also don’t know that it could be used effectively for blogs. It could be adapted to send out an email confirmation to each poster and store who is accepted, but that could generate a lot of emails on a popular blog (mine receives an average of about 150 comment spam attempts a day, sometimes triple that, and my blog is certainly not a major one).

    Another interesting solution that would be nice if more people used it is hashing. Programs like hashcash basically make the computer do some calculations in order to send an email. Since humans don’t type and send emails extremely fast, it’s not a problem for us, but spammers wouldn’t be able to do the calculations for the millions (or billions?) of messages sent. Of course, then you run into problems with companies sending out legitimate newsletters.

  8. 8 paul

    Hi Gordaen,
    I checked Pidkey and what I saw is a little bit more than just a simple solution anti-spam. And … my humble advice, the problem “The major one that I see is email from non-humans that isn’t spam” is not a problem… because with that… spammers are out of service :-) . I start to use it and I have no more spams :-) see you:-)