On Comment Spam

30 December 2008

I noticed an article on Slashdot today called Smart Spam Filtering For Forums and Blogs? where someone asks "Does anyone know of a good system for filtering spam".

In this post I reflect on comment spam and my approach to dealing with it.

Background

On this site I have previously talked about my past approaches to this topic:

  • My original post on the subject was called 'only the penitent man will pass - on captchas and cotton wool', where I explained why I do not go in for 'captchas' or other party games that make it harder for you the reader to leave a comment.
  • My next post was Akismet Blues where I discovered that a third-party comment judging service called Akismet was suddenly blocking everyone, due to the fact that the upstream changes in the service had broken the blog plugin I was using at the time.
  • My third post on the subject was Django FreeComments cleanup script where I shared a command line cleanup script for Django FreeComments, a comments system that appeared in the previous version of Django.

Now I recoded the site to use Django 1.0 (read about that here), I had to make these decisions again.

Principles

The aim of this site is to have a discussion about the topics I raise. The comments are not some optional add on, they are core to what the site is about. Furthermore, my readers are busy people, who take the time out of their busy lives to share their wisdom, thoughts and views by leaving a comment here.

Therefore I respect that very highly. I have 1431 comments, I may not have responded individually to each of them, but I have read them all many times, and I have written SQL to move them each time I have changed blog software, backed them up, reformatted them, and so on.

This is a website about fairly advanced computing, which is aimed at adults. If someone can use a bash recipe or write a Python script, then if the comment spam fighting techniques fail, they have the intellect to handle a Viagra advert in the comments for a few minutes or hours.

Therefore my first principle is that the cost of a false positive is far greater than any number of false negatives. If there are false positives then they must not be permanent, and they must be easy to spot and reverse.

Secondly, as mentioned above, spam fighting should be server-side, it is my problem not yours - I am not into making the reader guess the word, jump through hoops or pin the tail on the donkey while blindfolded and spun around.

What is a spam comment?

The commenting system here is for discussing the topic and communicating with other readers. Linking to a resource that is relevant to the discussion is clearly not spam. Many people, while leaving a relevant comment, will link to their own website as a kind of signature. This is clearly not spam.

A person or script using the comment system to leave links to irrelevant websites for financial gain, with no comment or something extremely generic, is comment spam. So obviously context is important.

Catching by URLs

Most genuine commentators paste the URL in the post directly. Those that want to markup their text use the markup that I have specified.

Meanwhile spam comments have HTML or BBCode marked up links in them, from a few to dozens or hundreds of them. So comments with HTML or BBCode marked up links can be held for moderation. Comments with a large number of URLs can also be held for moderation. These little measures stop the most of the indiscriminate spam scripts sent from zombie machines.

Catching by IP

This leaves the more discriminate spammers, those who probe and test to try to see what goes or up not. These appear to send multiple spams from the same IP Address.

/images/posts/django/sameip.png

In the Image above, you can see the admin panel showing 18 comments held for moderation. Highlighted in red are 16 comments from the same IP Address. I moderate all the comments caught using a new version of my command-line tool. When a comment is confirmed as spam, it is deleted and the originating IP address is blocked, thus stopping repeat offenders.

Extra herbs and spices

I have some other ingredients to my spam fighting recipe, but at the moment they are not being brought into play because the above ingredients seem to do the trick. One advantage of using Django and Soturi is that I can quickly reprogram the site if my spam fighting recipe stops being effective.

Since this approach seems to work, I personally I have no desire to use some third-party service to filter my comments, because I have no easy way of knowing if or how it is likely to make false positives. Also being a technology site, people often write comments involving command line commands or source code which gets flagged easily in these filtering services.

Anyhow, that is my setup, I would be interested to hear what other people find useful.

1 Eion says...

Other than server-side processing of comments, I like to add additional <input>'s and hide them in external css. Most of the time the fields are populated by spam-bots, and if so, they get blocked.

No, not a perfect solution on it's own but just another feather in the cap against comment spam.

Posted at 2:20 a.m. on December 30, 2008


2 Bug says...

Well... Sadly, and I guess you hate me for it, I use captcha. But at least it's not an image, so even if you visit using w3m [yey!] you can comment. I know it's not perfect and it could get annoying... I thought about making a cookie of passing the captcha once grants you captcha-less life for as long as the cookie is alive, but I never got around to it [Along with the setting rewrite mod].

I guess creating your site from scratch has disadvantages [spam], but I still love the advantages, it lets me play around with it as much as I want.

Posted at 6:33 a.m. on December 30, 2008


3 Zeth says...

Hi Eion, Yes that is an interesting approach also. It is the only approach given by default in the stock Django comments module, though it does not stop all comment spam.

Bug, well the simple easy text-based question is not too annoying, and if the question rotates occasionally then it is quite effective.

Posted at 2:30 a.m. on December 31, 2008


4 bug says...

@Zeth: The hidden field does block some. Not perfect, but it does release some weight from the filtering system, as those are 100% false comments.

Acctually, if you would have tried to post comments or my webbie, you'd see that it doesn't change. I didn't see a reason to implant one, because as long as my webbie isn't boing boing or anything, it won't have hand made spam. This magical captcha of mine blocks the automated kind of spam. But if someone does manually code it, yeah. It'd be pain and I'd have to make the question change.

Posted at 7:46 a.m. on December 31, 2008


5 Ryan v. says...

Clever solution Eion. Sort of stumbled across a solution similar to that on my own (accidentally). Seems to be undefeated for the time being...

Posted at 12:03 a.m. on July 24, 2009


6 dahuletam says...

Having good content can only get you so far unless you also provide a good atmosphere to comment in

Posted at 7:43 p.m. on August 29, 2009


7 Sim says...

In relation to spam I think in most cases you need human judgment to decide whether it is a spam or not. I have used Akismet on my wordpress blog after hearing a lot of good things about it but I also think it is too too secure as their spam detecting system seems to be very strict. I don't mind linking to commenter as long as their comment is relevant to the post and is engaging with other commenter but some of these comments seem to be blocked off.

I do believe that many website owners comment for the purpose of building links towards their own sites however it does not necessarily mean they are spamming. As long as they are actually engaging with the blog post itself and with other commenter then it is not a spam.

Posted at 9:51 p.m. on September 15, 2009


8 expifesluse says...

Hi People How are you doing?

Posted at 6:29 p.m. on October 6, 2009


9 leifbk says...

Hi Zeth, from the most recent comments on this post it may appear like your comment spam blocking is a little bit broken :)

Posted at 11:15 a.m. on June 5, 2010


10 Zeth says...

Well as I talk about in the post, my approach is not really so much blocking spam. Rather I delete comments quickly and then block IP addresses. Also, as I said in the post, I don't really care too much if I have a certain amount of spam at any moment, as long as I am not preventing a real commentator getting through.

Posted at 12:04 a.m. on June 13, 2010


11 tahitian noni says...

Thank You For This Blog, was added to my bookmarks.

Posted at 11:38 p.m. on July 28, 2010


12 noni juice says...

I just sent this post to a bunch of my friends as I agree with most of what you’re saying here and the way you’ve presented it is awesome.

Posted at 12:03 a.m. on July 30, 2010


13 scuba says...

I’ve been visiting your blog for a while now and I always find a gem in your new posts. Thanks for sharing.

Posted at 7:09 p.m. on July 30, 2010


14 vemma2018 says...

I find myself coming to your blog more and more often to the point where my visits are almost daily now!

Posted at 8:24 p.m. on July 30, 2010


What do you have to say?

Show Editing Help

About

Hello, my name is Zeth, I'll be your host here.

Command Line Warriors is about taking control of your own technology, it looks at our experiences of computing; especially using GNU/Linux, the Python programming language, the command-line and issues such as techno-ethics, best practices and whatever is cool now. If you take control of your technology then you are a Warrior too!

This site is your site too which means that you can contribute and get involved. You can leave comments using the facility provided. For me, the comments and discussions are by far the best part of the site. So please do have your say!

Latest Discussions

Cupcake

July 31, 2010
Good post! You helped me a lot with my school project! CountryField(blank = True) < (K)
Countries in Django

LeshaShampoo

July 30, 2010
it was very interesting to read commandline.org.uk I want to quote your post in my blog. It can? And you et an account on Twitter?
Email Syntax Check in Python

vemma2018

July 30, 2010
I find myself coming to your blog more and more often to the point where my visits are almost daily now!
On Comment Spam

layecenda

July 30, 2010
Hello. And Bye.test :) http://idfjhvihdfiphvlajbvhalibv.com
PuTTY Series: Adding PuTTY to your system path

scuba

July 30, 2010
I’ve been visiting your blog for a while now and I always find a gem in your new posts. Thanks for sharing.
On Comment Spam

Businesking

July 30, 2010
Great site and articles for hack for win, I said Amazing post
How not to program WSGI

Tehnoking

July 30, 2010
This is Great post to learn about the hack Thumbs-up for you :D
How not to program WSGI

Syabiltech

July 30, 2010
I think this articles for master...because very hard to learning, As blogger beginners like me.
How not to program WSGI

coffeeatea

July 30, 2010
Are you looking for coffee gifts? We can tell you more about the coffee gifts including coffee machines and coffee pods.
Introducing Soturi - yet another Django blog application

noni juice

July 30, 2010
I just sent this post to a bunch of my friends as I agree with most of what you’re saying here and the way you’ve presented it is awesome.
On Comment Spam

Dion Moult

July 29, 2010
What I do know is that ever since I tried out Opera and put their tab bar on the left as a column, I've loved that layout. Back on Firefox ...
We need a thoughout integration of the desktop and the web - not Tab Candy superfast jellyfish

ZonaEntertainment

July 29, 2010
Wow useful articles, I'm read to learn about this and now I bookmark this to my Facebook, thanks for share!
How not to program WSGI

Giacomo

July 29, 2010
Honestly, I think both Mozilla and you are wrong :) This sort of concept adds overhead. A user would have to manage all this crap, constantly dragging and dropping, creating ...
We need a thoughout integration of the desktop and the web - not Tab Candy superfast jellyfish

Matija "hook" Šuklje

July 29, 2010
As a minimalist, you'll probybly moan if I mention KDE, but I'll do so anyway ;) The future I want (and actually see slowly fold out before me) is to ...
We need a thoughout integration of the desktop and the web - not Tab Candy superfast jellyfish

tahitian noni

July 28, 2010
Thank You For This Blog, was added to my bookmarks.
On Comment Spam

Rick

July 28, 2010
I already have piles. It's called A New Window.
We need a thoughout integration of the desktop and the web - not Tab Candy superfast jellyfish

Tech News

July 25, 2010
Thanks for this short tutorial...was auto-FTPing my files from my appserver to webserver for my tech news website. Everything was OK until someone hacked it. Hosting provider is now recommending ...
SFTP in Python: Really Simple SSH

naypalm

July 24, 2010
During the past 3-4 years, I and many others have enjoyed unlimited 2G/3G internet. But ever since the massive cult-like following of i Phone users in the US, most cellular ...
Calling time on mobile internet nonsense?

Steve

July 15, 2010
Very occasionally, you will run into a Java program that uses a lot of memory just to hold all the classes used. It turns out that the JVM uses a ...
Three classic command line tips

no

July 14, 2010
1. number one 2. number two 4. number four 3. number three 6. number six # first # second ## second-ay ## second-bee ### second-bee-one ### second-bee-two
An Introduction to ReStructuredText