Like anyone who uses computers, I have to deal with a ridiculous amount of spam. My email is double-filtered with Spam Assassin and Gmail, so very little email spam actually makes it to my inbox anymore. Unfortunately, I still have to deal with comment spam on my blog and websites. I decided to give an overview of some of the ways that you can analyze spam (particularly blog comment spam) in order to eliminate it.
A good system of checking for spam is one that uses points. Your scripts should check several factors and assign points based on severity (e.g., including 100 links is more severe than saying “Viagra” one time). You can then determine points thresholds such as the level at which the comment is automatically marked as spam, the level at which a comment is “iffy” so it needs to be moderated by a human, and the level at which a comment is most likely “real” and can be displayed instantly. Below are several things to analyze that could be weighted with points.
Time spent viewing the page
The kind of user whose comments you want will read your post and then type a reply. This often takes a few minutes (which is why most people don’t invest the time to comment), so if you create a hidden field in the comment form with a timestamp (preferably encrypted so it isn’t entirely obvious) and then check when the comment is submitted, you can see how long the entire read-and-respond process took. Often, spammers will take ~4 seconds, because it’s an automated script. This also ensures that the page was actually viewed (rather than having a comment directly submitted).
Since the amount of time a user takes responding can be from zero (or negative if they’re trying to spoof your timestamp) to infinity, the points you assign should be inversely proportional to the time spent on a page—with a cutoff. For example: zero seconds might give ten “spam points,” whereas five minutes might give zero. At a certain point (preferably a couple of hours), you should have a cutoff where the points max out again. This prevents someone from easily spoofing the timestamp (because the number of acceptable timestamps will be more finite).
The vast majority of spam is designed to get traffic to a website where the user is expected to spend money. Because of this, most spam includes at least one link. Unfortunately, real users might also leave comments with links, so you have to weight links carefully. You can choose to do it based purely on the number of links (e.g., no links = no spam points, 2 = 4 spam points, 3 = 8 spam points, etc.) or you can do it based on a ratio of links to non-linked text. Usually spammers will just include link after link, but real users will post paragraphs with a few links scattered throughout.
Blacklists are another point of contention among web developers. Spammers often use the same IP address many times, but we have no way of knowing if Spammer Bob was using the IP address to spam all night before disconnecting and then Joe Schmoe connects and is assigned that IP address. It’s unfair to block Joe Schmoe based on Spammer Bob’s actions, but we can’t ignore that blacklists can be extremely effective.
With external blacklists (i.e., when using a third-party’s blacklist), you run a much greater risk of blocking a legitimate user. You could choose to weight any blacklisted IP heavily enough to automatically make the comment moderated. Another option is to assign points based on how recent the IP address was blacklisted (one that was blacklisted 30 minutes ago should be assigned far more points than one that was blacklisted a week or a month ago.).
You can also use internal blacklisting. Create your own list of banned IP’s based on your comment spam. You might do it based on number of comments marked as spam from a given IP address for a specific amount of time (e.g., 3 days), or you can make it based on the average number of points for a given IP address over that particular amount of time (e.g., five comments in the past 24 hours for a total of 75 spam points would be 15 points per comment; if that is above your threshold, that IP address could be blocked from commenting temporarily).
“Viagra! Cialis! BUY NOW!!” Nearly everyone has seen comments or emails with messages like this. You can create a function that runs through a comment and counts the number of flagged words and assigns points based on the total number of those words in the message. One instance of the word “Viagra” might be legitimate (particularly if the comment is on a post about spamming), but the more these words are used, the less likely that the comment is legitimate. Keeping a word list can be excessive, because there is a new drug or other spam item every day, but it’s one of the easiest ways to identify spam.
Age of post
It’s a fact: older posts are spammed more. The older a post is, the more time it has been floating around at places like Technorati, being scooped up by spam bots as a potential target. Some people completely block comments on old posts, but that’s a bit extreme to me. I’ve quit visiting blogs, because I had a legitimate comment that I couldn’t post. I felt like my opinion wasn’t important to the author, so I decided his/her opinion was no longer important to me. A better option is to either assign points or simply require moderation on all posts older than X days.
If your site allows creating accounts and posting while logged in, users who are logged in should be given bonus points (you could simply give them a negative number of spam points to offset some of the potential triggers and make it more likely that their messages go through right away). You should also try to track previous posters. If John (email@example.com) from http://johnsawesomeblog.com is commenting on your posts (legitimately) on a semi-regular basis, he should be recognized as a regular user and his posts should be less likely to be flagged as spam. Usually blogs collect email addresses from users but don’t share that address with other users, so that’s a good way of identifying a frequent commenter.
Trackback the trackbacks
If someone posts a trackback, your blog should automatically access that page (such as with cURL) to verify it has a link to your site. This will stop 95% of trackback spam. The only problem is that the trackback could contain a bogus URL (e.g., it could be http://blog.spammer.com/?sucker=10500dcd4bca8cb8f46279d8e61e4cd8), that has a get query containing your site’s address. The spammer’s site could easily create a link to your blog based on the query, so your blog would see this link and consider the site legitimate. Currently, this isn’t a significant problem in the blogging world, but it’s something to be aware of as it may become an issue later.
Monitor your spam
Keep any spam you receive (at least, for some time), because you can analyze it to improve your catching methods. If you see something that was just barely flagged as spam, see how you can guarantee it will be flagged next time. If you see a pattern, develop a way of catching that pattern. If a spam comment makes it through your system, find out why. Sometimes you can flag one comment as spam based on its similarity to a comment that was definitely spam (e.g., a “V1AGRA” spam is caught and then the same username and email tries to submit another message, the second message is likely to be spam too).
Obviously, you can use none, all, or any mixture of these methods and modify them to meet your needs, but you may also consider a few alternatives. For instance, you can have a flag that sets a message as needing moderation despite the number of spam points it has (e.g., you might flag all trackbacks and comments with links as needing moderation if you are very specific about what links you allow on your site).
It’s typical to have a message appear explaining to the user what happened. If the message was considered spam, the user should know about it (on the off chance that it is a human after all). If the comment is being held for moderation, the user should know so s/he does not try to submit it again or think your blog is “broken.” Some developers choose to have messages that need moderation show up, but they add nofollow (e.g., <a href=”http://potentialspammer.com” rel=”nofollow”>Cool site!</a>) to all links, preventing the user from improving his/her site’s search engine ranking (most search engines use the number of links to a site as part of the measure of the site’s importance).
As a last resort, you can use a CAPTCHA. It should always be optional, because users with vision impairments often cannot complete a CAPTCHA. By optional, I mean that it should give you “bonus points” for completing it rather than punishing you for not completing it. It can be used effectively when a comment would go into moderation by presenting the end user with the CAPTCHA as a means of getting their comment instantly approved.
Another way you could use the point system is on a blog in which all comments must be manually approved in order to determine which comments are most likely to be legitimate and view those first.
I can’t emphasize enough the need for a point system. The more pieces of the puzzle you have to look at, the more accurate your guess will be. In my opinion, one spam message getting through is better than one legitimate comment being blocked. If you’re adamant that no spam is ever shown on your site, consider full moderation. If your site receives few comments, you can have it email you every time you receive a new comment or a comment in moderation. That allows you to quickly spot bogus comments that managed (or almost managed) to make it through your filters.
Remember that spam is used because it works. If you send out a message to a million people, some are likely to be suckered by it. That means that every spam comment you block makes spamming less cost efficient and just might dissuade future spammers. We can only hope…