|
|
Subscribe / Log in / New account

Fighting image spam

Benefits for LWN subscribers

The primary benefit from subscribing to LWN is helping to keep us publishing, but, beyond that, subscribers get immediate access to all site content and access to a number of extra site features. Please sign up today!

August 23, 2006

This article was contributed by Jake Edge.

A number of spammers have been evading filters like SpamAssassin (SA) recently by encoding their messages as images. SA already has a set of rules that are meant to combat image spam, but the more recent messages (typically for stock scams or pharmacy products) have been crafted to avoid them. This would indicate, once again, that spammers are using SA to pre-test their messages and are modifying them to get through. SA developers, however, are up to the challenge and two specific countermeasures have been released.

The first technique uses Optical Character Recognition (OCR) software to pull words out of the images and then uses a blacklist of words to increase the SA score. It was quickly realized that spammers are using similar obfuscation techniques in the images that they have long used in text emails (misspelling words, using characters that look like others, etc.) so a fuzzy matching was added to the plugin.

Unsurprisingly, there are already reports of images that put a light background of random 'snow' behind the text (example). This practice does not affect the readability for humans, but does affect the quality of the OCR output. The FuzzyOCR developers have quickly adapted by using a feature that removes smaller particles before doing the OCR scan. The question remains, of course, whether the OCR software will be able to keep up with obfuscations that will still be readable to humans. Human pattern matching may be too good for the state of the art in OCR.

The plugin uses several external programs from the netpbm tools, the gocr open source OCR program and several other libraries and perl modules. This is a fairly heavy handed approach, requiring a good bit of installation and configuration of the various pieces.

Another approach is the ImageInfo plugin, which does not require any external tools. It looks at the GIF and PNG headers of images in the email and calculates the area, in pixels, that they cover. Those values can be used in SA rules to increase the score of those having the characteristics of the latest image spam. The current ruleset penalizes single images that are larger than 180K pixels as well as a combinations of four or more images that total to more than 180K. It seems very likely that the spammers will be using the plugin and testing their images so this ruleset will likely have to evolve rather quickly.

It is interesting to watch the battle over our email inboxes as the level of cleverness of the spammers seems to be increasing over time. This is clearly an arms race and one that spam filtering developers will have to stay on top of for the foreseeable future. Long term solutions to the problem do not seem to exist and this incremental measure-countermeasure war is here to stay.


Index entries for this article
GuestArticlesEdge, Jake


(Log in to post comments)

Fighting image spam

Posted Aug 24, 2006 1:17 UTC (Thu) by dskoll (subscriber, #1630) [Link]

gocr is the right way to go, but "fuzzy" matching rules are the wrong way to go. It's better just to feed the output of gocr into your Bayes database and let the statistical algorithm lock on to the words.

You'll quickly find that "INVESToR", "Indudry", "accomplishmems" and
so on become incredibly strong indicators of spaminess. gocr's common
mistakes actually make it *easier* for Bayes to lock on.

Fighting image spam

Posted Aug 24, 2006 1:25 UTC (Thu) by flewellyn (subscriber, #5047) [Link]

I agree. Especially since there are legitimate uses for image attachments in email.

Fighting image spam

Posted Aug 24, 2006 2:18 UTC (Thu) by Ross (guest, #4065) [Link]

Do spammers use the same misspellings over and over? If not (they aren't that stupid), there are an incredibly large number of combinations for each word. Especially when you throw in "looks like" numbers, symbols, and alternate letters. Spammy words would never already be in the database.

Fighting image spam

Posted Aug 24, 2006 9:28 UTC (Thu) by nix (subscriber, #2304) [Link]

The point is that *gocr* makes the same mistakes over and over, acting as, in effect, a 'this word came from an image' tag. :)

Fighting bloated email messages

Posted Aug 24, 2006 2:26 UTC (Thu) by bignose (subscriber, #40) [Link]

The solution to this is the same as it's always been. Send and accept only email messages bodies that are plain text. Putting images, HTML pages, word documents, or any other junk in the message body is an invitation to abuse.

Note that this doesn't mean *attachments* are junk, if all parties involved want them. And things like OpenPGP signatures, that are small and aren't required for the message body to be read, don't detract from this either.

If we use MUAs that display only static, local, textual information in the message header and body, and *don't* treat the message as an executable program or network-retrieval script or image-display request, this sort of spam wouldn't exist: There'd be no point sending an attached image that purported to be the message body, because MUAs wouldn't interpret it that way.

If, on the other hand, we invite everyone to send us any old crap as a message body and attachments, and expect our MUA to unquestioningly display all the attachments it can get its hands on, this sort of spam can only flourish.

Fighting bloated email messages

Posted Aug 24, 2006 4:20 UTC (Thu) by sjj (subscriber, #2020) [Link]

Absolutely. Even better, only read email on green-on-black serial terminals. That'll teach 'em!

Fighting bloated email messages

Posted Sep 2, 2006 18:53 UTC (Sat) by Baylink (guest, #755) [Link]

I regularly use Mutt 1.2.5 in a PuTTY window; the annoyances it occasionally causes are *vastly* outweighed by the trouble it keeps me out of.

Fighting bloated email messages

Posted Aug 24, 2006 9:29 UTC (Thu) by nix (subscriber, #2304) [Link]

We probably already do this: but to make it not worth it to the spammers to send this sort of stuff, we'd need to adjust the behaviour of their OE-using normal targets.

And *that* is hard.

Fighting bloated email messages

Posted Aug 25, 2006 5:12 UTC (Fri) by bignose (subscriber, #40) [Link]

> We probably already do this

Anyone who doesn't want this spam is included in that "we", so no, "we" are not already doing this. Some of us are, some of us are not. If we stop thinking of "them" and "us" and see that all users of email who hate spam are "us", perhaps we can improve the situation.

Fighting bloated email messages

Posted Aug 25, 2006 16:00 UTC (Fri) by nix (subscriber, #2304) [Link]

I was actually considering 'them' to be 'email newbies who use Outlook Express or similar program and are barely aware that alternatives exist'.

I don't think many such people are going to be LWN subscribers :)

Fighting bloated email messages

Posted Aug 25, 2006 18:08 UTC (Fri) by tzafrir (subscriber, #11501) [Link]

The approach of pine is different:

Allow HTML, but only a very limited subset of it. No images. No implicit
following of external links. It is possible to follow links in a message
to a browser (a highly useful operation) and to see basic HTML formatting.

Devious

Posted Aug 24, 2006 3:30 UTC (Thu) by gjmarter (guest, #5777) [Link]

The spammers probably do not care if SpamAssassin can figure out how to OCR their images. If SA is successful enough, then they have a new tool to help them solve CAPTCHAs.

Unexpected benefit?

Posted Aug 24, 2006 6:41 UTC (Thu) by eru (subscriber, #2753) [Link]

On the other hand, this arms race might help advance the state-of-the-art in free OCR software, which would be a good outcome...

Unexpected benefit?

Posted Aug 24, 2006 7:17 UTC (Thu) by drag (guest, #31333) [Link]

Hell ya.

High quality OCR software for open source software would rock. It would be usefull for a whole bunch of different things.

For example.. improved handwriting recognition, maybe? Increase accessability for disabled people would be another purpose. Make it easy and cheaper for people to scan in massive amounts of written and printed documents for archival purposes.

If they can get that improved then it would make the fight against spam much more worthwhile then just fighting spam. :)

Plus it would be great if SA is able to outwit the spammers with the aid super cow powers of open source. Because this would be simply something that is not possible with closed source software. The anti-virus vs the pro-virus people have proven time and time again that crafty folks on irc channels can defeat big corprate money.

Unexpected benefit?

Posted Aug 24, 2006 9:45 UTC (Thu) by NAR (subscriber, #1313) [Link]

but might help defeating some of the current defense strategies used against web forum spamming (i.e. put an image to the registrating page which is hard to detect by a computer but easier to detect by a user, and type the string on the image into a textfield)...

Bye,NAR

Unexpected benefit?

Posted Aug 24, 2006 12:50 UTC (Thu) by eru (subscriber, #2753) [Link]

You mean "captchas". One way to make OCR ineffective against them is to use pictures of something, instead of strings. For example, show photos of well-known people, buildings, or animal species and ask the person to type the name. For a bot to beat that, it would have to do sophisticated pattern-matching against a large library of pictures, making it infeasible for spammers.

Unexpected benefit?

Posted Aug 24, 2006 16:13 UTC (Thu) by NAR (subscriber, #1313) [Link]

I'm afraid it would be really hard to create a set of images that every user would recognize - after all, "well known" is a relative concept.

Bye,NAR

Unexpected benefit?

Posted Aug 24, 2006 18:36 UTC (Thu) by mattdm (subscriber, #18) [Link]

The other benefit would hopefully be that sites would stop using CAPTCHAs.

Devious

Posted Aug 24, 2006 10:52 UTC (Thu) by robdinn (guest, #30753) [Link]

I didn't know what a CAPTCHA was, but wikipedia came to the rescue:

http://en.wikipedia.org/wiki/Captchas

CAPTCHAs are easy to beat

Posted Aug 26, 2006 4:58 UTC (Sat) by dlang (guest, #313) [Link]

all it takes are pictures of naked females.

just setup a porn site and you can get a LOT of people willing to prove that they are human (by solving the CAPTCHAs that you harvest from the site you want to penetrate (and you probably have enough traffic that you can do this in real-time, you fetch the image from the target, present it to the next person to access the site, and use the result against your target)

Fighting image spam

Posted Aug 24, 2006 4:02 UTC (Thu) by jwb (guest, #15467) [Link]

A third way is to calculate the frequency of the pixel-to-pixel transitions in the image. Images which contain mostly text will be high frequency, photgraphs and other real images will have normal frequency distribution, and diagrams and such will have mostly the same color. Some code for this recently crossed the SpamAssassin developer mailing list. I tried it out, and with a small modification to work with animated GIFs, it successfully classified 20 instance of spam and legitimate images.

Fighting image spam

Posted Aug 24, 2006 4:34 UTC (Thu) by warmcat1 (guest, #31975) [Link]

These CPU-intensive solutions won't scale.

Greylisting remains the most powerful tool until the spamming software gets smarter.

Fighting image spam

Posted Aug 24, 2006 8:39 UTC (Thu) by dion (guest, #2764) [Link]

I agree, greylisting is a very useful technique, combined with a realtime database of currently spamming machines it's almost undefeatable.

The tricky bit is getting the realtime database fast and accurate enough to classify a spamming server before the first greylisting wears out.

Fighting image spam

Posted Aug 24, 2006 7:40 UTC (Thu) by pointwood (guest, #2814) [Link]

The ultimate solution would be to hire Chuck Norris (http://www.chucknorrisfacts.com/) and let him hunt them down :p

There is no silver bullet solution that will kill spam. If there were, spam wouldn't be a problem by now.

I don't get much spam though, thanks to various spam filters and greylisting.

Fighting image spam

Posted Aug 25, 2006 16:02 UTC (Fri) by nix (subscriber, #2304) [Link]

No, silver bullets won't work. Garlic will.

(sorry)

Fighting image spam

Posted Aug 24, 2006 8:16 UTC (Thu) by ekj (guest, #1524) [Link]

Filtering is an endless arms-race. It don't think it can be won.

I personally think the solution will be some sort of reputation-system, preferably web-of-trust like.

All you'd need would be people signing their email-messages, and a repository storing signed statements to the effect that certain public-keys belong to non-spammers.

Fighting image spam

Posted Aug 24, 2006 8:50 UTC (Thu) by NAR (subscriber, #1313) [Link]

All you'd need would be people signing their email-messages, and a repository storing signed statements to the effect that certain public-keys belong to non-spammers.

I'm afraid this wouldn't work for addresses like lwn@lwn.net, which has to be able to receive e-mails from any e-mail addresses - and addresses like these get the most spam.

Bye,NAR

Fighting image spam

Posted Aug 24, 2006 9:33 UTC (Thu) by evgeny (subscriber, #774) [Link]

Such addresses, actually, shouldn't exist. What's wrong with a web-form for sending comments? The latter is much easier to protect from spam.

Fighting image spam

Posted Aug 24, 2006 9:52 UTC (Thu) by NAR (subscriber, #1313) [Link]

Such addresses, actually, shouldn't exist. What's wrong with a web-form for sending comments?

Imagine that you're responsible for the PR of a (linux-specific) project. There is a new release, so you type up an announcement. Today you'll probably have a list in your addressbook containing addresses like pr@lwn.net so you simply send your announcement to these addresses. Imagine if you wouldn't have this list, instead you'd have a folder in your browser's bookmarks containing the "Press Release Forms" of all relevant publications and you'd have to copy&paste the announcement into these forms one by one. It wouldn't be nice.

Bye,NAR

Fighting image spam

Posted Aug 24, 2006 10:02 UTC (Thu) by evgeny (subscriber, #774) [Link]

1. Some hard work (like copy&paste and then clicking on a button) is a nice addition to the air-bubbling activities PR agents are doing most of the time.
2. In most of the cases, either or both of the two parties participating in PR announces (the sender & the receipinet) get some cash as a result. A tiny part of it probably justifies fighting the spam if email submission is a must (which I'm still not sure about).
3. PR is the last thing I'd worry about. The lwn's email you mentioned initially as an example is actually used for any comment submission, which in 99.99% are not PR, I believe.

Fighting image spam

Posted Aug 24, 2006 17:21 UTC (Thu) by bronson (subscriber, #4806) [Link]

It was an *example*. There are thousands of others just like it. And, you appear to hold PR people in very low esteem.

Fighting image spam

Posted Aug 24, 2006 19:35 UTC (Thu) by evgeny (subscriber, #774) [Link]

> It was an *example*.

Give a better one then.

> And, you appear to hold PR people in very low esteem.

Yes, I do. At least those who tend to post their releases to a huge amount of emails (otherwise doing it from web forms wouldn't take too much time anyway).

Fighting image spam

Posted Aug 24, 2006 10:04 UTC (Thu) by rwmj (subscriber, #5474) [Link]

You seem to speak as someone who doesn't have an open contact form on their site ...

Rich.

Fighting image spam

Posted Aug 24, 2006 10:13 UTC (Thu) by evgeny (subscriber, #774) [Link]

Believe me, I do. Yes, web forms get spammed, too. But fighting this spam is much easier. In the worst possible case, just pipe the form output through SA.

Fighting image spam

Posted Aug 25, 2006 13:26 UTC (Fri) by kpower (guest, #37136) [Link]

Removing one easily accessible open method of communication just moves the spam target elsewhere. If it's not email, then its a web form, an IM address or other method. The digital methods are all nearly alike in their ease to spam, and will likely remain that way into the near future.

Business requires open, easy to use communication to survive. Without a means for the outside world to communicat to and with, a business will whither and die. This is different than a single private individual outside of a business context, who doesn't need the same type of openness in communication.

A variety of open communication channels are available, all are subjected to sapm (at least in the USA). Email, web forms, web forums, news groups, IM, telephone, fax, postal, all receive spam in greater or less quantities.

It's not just PR, but Sales, prospective clients, prospective employees and others that rely upon this open channel. Without the open channel, how will the communication be initiated?

Unprofitable?

Posted Aug 25, 2006 5:18 UTC (Fri) by GreyWizard (guest, #1026) [Link]

I'm skeptical of reputation systems, but if one could be effectively deployed and only addresses like lwn@lwn.net could receive spam wouldn't spam become unprofitable?

Fighting image spam

Posted Aug 25, 2006 21:21 UTC (Fri) by pimlott (guest, #1535) [Link]

This is why, ideally, trust-based controls would work in tandem with economics-based controls. Senders you don't trust as much have to pay more (possibly with a refund if they prove trustworthy). Unfortunately, we're a long way from having either one on a wide scale.

The ultimate spam solution

Posted Aug 26, 2006 21:16 UTC (Sat) by giraffedata (guest, #1954) [Link]

I guess I don't know the technique you're alluding to. It's not clear to me how I would get someone to vouch for me as a non-spammer.

I don't think it's really possible to determine adequately whether a person is a spammer, and more to the point, whether a piece of email came from a spammer, so I think the ultimate solution will be financial accounting. If we can require payment to send an email into the Internet, the spammers and hammers will sort themselves out. As will the people with worm-infected computers.

Remember that what makes a spam annoying to you is the same thing that makes it worthless to the sender: you're not interested. If the spam cost the sender as much to send as it costs you to read/delete it, it wouldn't get sent.

But both plans assume it's possible to install a whole new system of email on the Internet. Looking at the failures so far, I'm starting to doubt we'll ever see a systemic fix to spam. Something like filtering that can be done one mailbox at a time may be our only hope.

OT: Does usable free OCR software exist?

Posted Aug 24, 2006 13:32 UTC (Thu) by debacle (subscriber, #7114) [Link]

I tried different OCR programs (not sure, but possibly everything available in Debian: clara, gocr, ocrad), but all of them produced at least one "typo" per word, often much more. Typing is faster and less error-prone.

OT: Does usable free OCR software exist?

Posted Aug 24, 2006 17:27 UTC (Thu) by bronson (subscriber, #4806) [Link]

It totally depends on your source images. High resolution scans of clean, printed text tend to do pretty well. Blurry or misregistered scans (i.e. from a book without first tearing out the page) are trouble. If any letters bleed together, you're sunk.

I agree that Linux OCR is nascent but, even so, one typo per word is way too high. I would guess that your source images are flawed somehow?

OT: Does usable free OCR software exist?

Posted Aug 24, 2006 18:08 UTC (Thu) by debacle (subscriber, #7114) [Link]

You are right, the input was not perfect: Google scans of a mid 18th centery book. Still, I had hoped, that the OCR software would do better.

OT: Does usable free OCR software exist?

Posted Aug 25, 2006 19:25 UTC (Fri) by leoc (guest, #39773) [Link]

One thing you might want to try is to "posterize" the image you are scanning. I wrote a perl script to use gocr to scan my satellite television source for channels that I do not receive (they are the ones that come up with a screen of text that says something to that effect), and I found that I had to posterize the output to 2 colours (black text on white background) before gocr could read any text off them.

There is no need to actually OCR the image

Posted Aug 26, 2006 4:43 UTC (Sat) by spitzak (guest, #4593) [Link]

There is no need to actually read the text in the image. All that is needed is a way to detect that the image *is* text. Since ther is no legitimate reason to send text as an image this will indicate span. I'm pretty sure it is far easier to indicate that an image contains a lot of text with 99% certanty than it is to determine the image's text contains a certain word.

There is no need to actually OCR the image

Posted Aug 28, 2006 17:43 UTC (Mon) by tack (guest, #12542) [Link]

I occasionally receive scanned newspaper articles that may be of interest to me.

I prefer the approach of OCRing the image and filtering that through a Bayesian classifier. One might be able to use some of the techniques described here to first optionally determine if the image likely contains a lot of text, and only then OCR it, which would help out with the CPU overhead.

Fighting image spam

Posted Sep 1, 2006 23:14 UTC (Fri) by decoder (guest, #40285) [Link]

Hello all, I am the developer that invented the FuzzyOcr plugin and I just wanted to pick up some things I've read here:

- Using the text recognized by OCR for bayes/other SA rules/etc:

Most gocr output can be really bad, completely unsuitable for the standard SA rules. Also you'd feed more junk to bayes (the output often contains a lot of junk too) than useful words, and no, gocr does not always do the same mistakes on words. It depends on colors, fonts, size, etc. and that will cause bayes to be ineffective.

- Can we keep up with spammers obfuscating their images?

Yes and no. On one side, spammers will always find a way to avoid our detection methods, hence we try to keep the internals secret. But that won't work forever, so spammers will evolve. But on the other side, we have to keep one thing in mind: These pictures are still ads. Ads are supposed to have an appeal, and more obfuscation leads to less appeal. Comparing this to captchas is not suitable, as a captcha does not have this intention. And a captcha can hardly be so appealing as a normal ad can be.

- It takes up many resources:

Yes, it does. OCR will always take up many resources. But FuzzyOcr has already many improvements that save actual OCR passes. For example, the new image hash system filters out up to 40% of the image spam (depending on what spam you get) after they were recognized once, hence saving OCR.runtime.

- Legitimate uses of image attachments:

Yes, they are there. And FuzzyOcr is actually the most friendly plugin concerning people that send images in their mails. Most images that get sent in ham are not even near in getting detected by FuzzyOcr. The only class of images that can cause false positives are screenshots. But with the new word threshold tweaking, you can eleminate almost all "true" false positives (Caused by too fuzzy matching or too common words). However, if the screenshot really contains spam words, then this will most likely produce a false positive.

- Greylisting > *.

I agree that greylisting stops a lot of spam. But in what time do you actually live? At our university, there is greylisting in place and even though we have greylisting, we still get TONS of spam. Greylisting worked nice and fine for quite some time, but the latest stats show that the effectiveness decreased heavily.

- Does usable free OCR software exist?

Yes it does... For those that aren't satisfied with gocr, you should try tesseract. This OCR was once a commercial engine and was released just today. It will be included in FuzzyOcr in approximately 8-9 weeks as an experimental feature.

- No need to actually read the text, text means spam:

See the screenshot false positive situation. For most situations though, you are right.

- Spammers don't care about OCR:

Wrong... see http://users.own-hero.net/~decoder/forgiving26.gif just as an example. They actually do care and they are not dumb at all.

So, what do I recommend as a general anti spam receipt for the future you ask?

In my opinion, it would already help A LOT, if stuff like DKIM was used by everyone. We are far away from that, but if everyone was using it, then this would be the first step to a more controlled spam flow. This would surely not stop the spam, but it would change how spam is sent, it would have to be sent from valid domains, forcing the spammers to get easier trackable/blockable, costing them money and everything that is connected to that.

My 2 cents :)

Chris

Tarproxy

Posted Sep 2, 2006 18:51 UTC (Sat) by Baylink (guest, #755) [Link]

tarproxy, TarProxy.

Sheesh.

If the top 20 ISP's paid Marty $10K a piece (or a buck a sub :-), he'd have a version of this that would handle ISP wire-speed email done within a year, and spam would *stop*.

There would be no practical way to make it cost-effective to send anymore.

FuzzyOcr hit Debian yersterday !

Posted Dec 19, 2006 14:08 UTC (Tue) by liotier (guest, #42321) [Link]

FuzzyOcr hit Debian unstable yesterday ! Mail server administrators rejoice ! Somebody must have been even more pissed off than me about image spam and decided to make the Debian packaging work… Installing FuzzyOcr on a Debian server is now trivially easy. I have installed FuzzyOcr with great success and reported about my experience.

Captcha detection

Posted Dec 25, 2006 11:52 UTC (Mon) by aprilmay (guest, #42405) [Link]

Months after this thread, has someone found a good solution ?

Maybe the best way would be to detect that an image is indeed a captcha, no need to do ocr. I don't see any reason for a real mail to send a captcha.

I guess this process would be far less CPU expensive that the full OCR (which could be always defeated by new captcha builder imho).


Copyright © 2006, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds