wingologA mostly dorky weblog by Andy Wingo2021-03-08T14:36:22Ztekutihttps://wingolog.org/feed/atomAndy Wingohttps://wingolog.org/99% spamhttps://wingolog.org/2021/03/08/99-spam2021-03-08T14:36:22Z2021-03-08T14:36:22Z

Hey all, happy new year apparently! A quick service update on the old wingolog. For some time the site has been drowning in spam comments, despite my best efforts to point a bayesian classifier at the problem.

I don't keep logs of the number of attempts at posting comments that don't pass the classifier. But what I can say is that since I put in the classifier around 4 years ago, about 2500 comments a year made it through -- enough to turn the comment section into a bit of a dump. Icky, right??

At the same time of course, that's too many comments to triage manually, so I never got around to fixing the problem. So in fact I had two problems: lots 'o spam, and lots 'o incoming spam.

With regards to the existing spam, I took a heavyhanded approach. I took a look at all good emails and URLs that people had submitted for comments prior to 2017, assuming they were triaged. Then I made a goodlist of comments since 2017 that had those comments or emails. There were very few of those -- maybe 50 or 70 or so.

Then I took a look at when comments were made relative to the posts. Turns out, 98.3% of comments were made more than 21 days after their parent post was published -- and sometimes years afterwards. I used that as a first filter, that if a post wasn't from a known poster, and was made 3 weeks or more after the post, I just classified it as spam.

The list of comments made within 3 weeks of the parent post was small enough for me to triage manually, and there I was able to save a bit of wheat from the chaff. In the end, though, the result of winnowing was less than 1% of what went in.

As I don't really want to babysit this wobsite, I'll keep this policy in place in the future -- comments will be open for a while after articles are posted. Hopefully that should keep the ol' wingolog in tidy shape going forward, while still permitting people to comment -- something I have really appreciated in the past.

So happy 2021 to everybody, may the vaccine gods shine upon your shoulders, and happy hacking :)

Andy Wingohttps://wingolog.org/it's probably spamhttps://wingolog.org/2017/03/06/its-probably-spam2017-03-06T14:16:10Z2017-03-06T14:16:10Z

Greetings, peoples. As you probably know, these words are served to you by Tekuti, a blog engine written in Scheme that uses Git as its database.

Part of the reason I wrote this blog software was that from the time when I was using Wordpress, I actually appreciated the comments that I would get. Sometimes nice folks visit this blog and comment with information that I find really interesting, and I thought it would be a shame if I had to disable those entirely.

But allowing users to add things to your site is tricky. There are all kinds of potential security vulnerabilities. I thought about the ones that were important to me, back in 2008 when I wrote Tekuti, and I thought I did a pretty OK job on preventing XSS and designing-out code execution possibilities. When it came to bogus comments though, things worked well enough for the time. Tekuti uses Git as a log-structured database, and so to delete a comment, you just revert the change that added the comment. I added a little security question ("what's your favorite number?"; any number worked) to prevent wordpress spammers from hitting me, and I was good to go.

Sadly, what was good enough in 2008 isn't good enough in 2017. In 2017 alone, some 2000 bogus comments made it through. So I took comments offline and painstakingly went through and separated the wheat from the chaff while pondering what to do next.

an aside

I really wondered why spammers bothered though. I mean, I added the rel="external nofollow" attribute on links, which should prevent search engines from granting relevancy to the spammer's links, so what gives? Could be that all the advice from the mid-2000s regarding nofollow is bogus. But it was definitely the case that while I was adding the attribute to commenter's home page links, I wasn't adding it to links in the comment. Doh! With this fixed, perhaps I will just have to deal with the spammers I have and not even more spammers in the future.

i digress

I started by simply changing my security question to require a number in a certain range. No dice; bogus comments still got through. I changed the range; could it be the numbers they were using were already in range? Again the bogosity continued undaunted.

So I decided to break down and write a bogus comment filter. Luckily, Git gives me a handy corpus of legit and bogus comments: all the comments that remain live are legit, and all that were ever added but are no longer live are bogus. I wrote a simple tokenizer across the comments, extracted feature counts, and fed that into a naive Bayesian classifier. I finally turned it on this morning; fingers crossed!

My trials at home show that if you train the classifier on half the data set (around 5300 bogus comments and 1900 legit comments) and then run it against the other half, I get about 6% false negatives and 1% false positives. The feature extractor interns sequences of 1, 2, and 3 tokens, and doesn't have a lower limit for number of features extracted -- a feature seen only once in bogus comments and never in legit comments is a fairly strong bogosity signal; as you have to make up the denominator in that case, I set it to indicate that such a feature is 99.9% bogus. A corresponding single feature in the legit set without appearance in the bogus set is 99% legit.

Of course with this strong of a bias towards precise features of the training set, if you run the classifier against its own training set, it produces no false positives and only 0.3% false negatives, some of which were simply reverted duplicate comments.

It wasn't straightforward to get these results out of a Bayesian classifier. The "smoothing" factor that you add to both numerator and denominator was tricky, as I mentioned above. Getting a useful tokenization was tricky. And the final trick was even trickier: limiting the significant-feature count when determining bogosity. I hate to cite Paul Graham but I have to do so here -- choosing the N most significant features in the document made the classification much less sensitive to the varying lengths of legit and bogus comments, and less sensitive to inclusions of verbatim texts from other comments.

We'll see I guess. If your comment gets caught by my filters, let me know -- over email or Twitter I guess, since you might not be able to comment! I hope to be able to keep comments open; I've learned a lot from yall over the years.