chris-allen-lane.com Technology | Security | Making

Processing MD5 Suppression Lists - a Tool for Affiliate Marketers

tags: +foss +wordpress +ruby

Affiliate marketers will from time to time have to process what’s called an “MD5 suppression list". In brief, an MD5 suppression list is a list of email addresses which a marketer must remove from her mailing lists, in order to comply with the CAN SPAM Act of 2003, and respect the rights of individuals to opt-out of email marketing campaigns.

An MD5 suppression list is simply a file containing a long list of MD5 hashes of unsubscribers’ email addresses, the hashing being a security measure designed to prevent unscrupulous marketers from using suppression lists themselves as sources for obtaining more email addresses to use in email marketing campaigns.

To use a suppression list, an email marketer must compare each hash in the suppression list against an MD5 hash of each contact in her mailing lists. A matched pair of MD5 hashes indicates that an email address has been found in the suppression list, and thus must be removed from the marketer’s email lists.¬†(The mechanic here, obviously, is similar to how user passwords are hashed before being stored in a database.)

Recently, at work, I had to process a 2 gigabyte suppression list (of about 62 million rows) from Groupon. To my surprise, I didn’t find any readily available tools to do this, and thus, rolled my own.

The tool is (unimaginatively) called md5-suppression-list-match, and is available on Github. It’s a relatively small (<300 lines) Ruby script that is designed to run on Unix/Linux systems, and can process lists of any size (tested up to 2G) using a SQLite database. For faster performance (and smaller lists), it can also be configured to run entirely in RAM as an in-memory hash.

More detailed instructions are available in the README.

If you find the tool useful, have a feature request, or otherwise have feedback, feel free to let me know in the comments.

Update: 3 Dec 2011

After using this tool for a few days, I was able to improve its performance dramatically by solving the problem via a different algorithm. As such, I’ve pushed major revisions to the script, which are currently available at the same URL on github.

The new tool is dramatically¬†faster - lists that used to take me 10+ hours to process now get processed in around 20 minutes. It’s also significantly simpler to configure, and will now automatically produce both a whitelist and a blacklist with each run.