Archive for August, 2008

Spam Filtering techniques

Sunday, August 31st, 2008

Spam is a problem of enormous proportions. Current estimates figure that over 80% of all email is spam.

Some time ago I wrote a post about some changes to the configuration of my mail server that cut down the spam drastically. I thought I might take a moment to talk about the various techniques that are used to combat spam.

Some terminology I’m going to use:

  • spam - unwanted email
  • ham - wanted email
  • false positive - ham that is marked as spam
  • client - mail client, eg Thunderbird, Outlook
  • server - mail server, eg Exchange, postfix
  • host - someone who hosts servers
  • Joe Job - when spam is sent using the email address of someone else

Bayesian Filtering

This originated from Paul Graham. The idea was that you break a message up into tokens and then examine the tokens against a database of tokens. Each of the tokens in your database has a score as to how spammy the token is. The individual scores are combined to provide a score for an email. Emails are then rejected or allowed based on that score. This requires that you train your filter on collections of spam and ham.

Spammer responses

  • replacing letters with numbers (v1arga) or adding in spaces. This is generally pretty ineffective.
  • Attempt to poison the filters with random text
  • Delivering their payload as an image

Advantages

  • Generally cuts spam significantly (>75%)
  • Can be configured and trained to specific needs
  • Can be run on the client (eg Thunderbird) or the server

Disadvantages

  • CPU intensive, a burden borne by the receiver of the email.
  • Doesn’t tend to scale well, over an organisation. One person’s spam is another person’s ham.

Realtime Black List (RBL)

A RBL works by storing a known list of IP addresses or IP address blocks that send spam. When a server receives a HELO request, it checks the IP address of the sender against the RBL. If the IP address matches a known spammer IP address, it refuses the email. One issue with RBLs is that they are often easy to get on to and hard to get off. In addition some RBLs take the view that if even if just a single IP address is being used to send spam, they should ban the whole block to encourage the host not to allow spammers on their network. This tends to punish the innocent along with the guilty.

Spammer responses

  • Find a host who will allow them to hop between IP addresses
  • DDOS against the RBL
  • Relay spam through zombies (generally home computers) on dynamic IP addresses

Advantages

  • Can have a significant impact on the amount of spam received
  • Runs at very little cost to the receiver of the email (no bandwidth spent receiving the email)

Disadvantages

  • It can be hard to get off an RBL if you get on one
  • The false positive rate can be quite high, depending on which RBL you choose
  • If you have a false positive, you never know about it

Whitelisting

This works by storing a list of valid email addresses or IP addresses (generally just email addresses) that your server will receive emails from. In general this is not a terribly effective solution as it severly limits the list of people you can receive email from. This is typically to eliminate email from other testing criteria (eg to avoid running bayesian filters over it).

Spammer responses

  • Joe job

Advantages

  • Can have a significant impact on the amount of spam received
  • Low requirements (badwidth, computation)

Disadvantages

  • You can only receive email from email addresses/IP addresses on that list

Challenge - Response

This is really a variation on whitelisting for email addresses, with a dynamic white list. When someone who is not in your white list sends an email, an automatic email with a list goes back to them. Clicking on that link adds them to your whitelist.

Spammer responses

  • Joe job

Advantages

  • Can have a significant impact on the amount of spam received

Disadvantages

  • Places a burden of work on the people sending ham emails
  • Tends to work only if you have a small, known list of people who send you email

Greylisting

Greylisting is one of the more interesting ideas out there. Greylisting checks against an internal database to see if the combination of sender, recipient and sender IP address matches an IP address for an email that has been delivered. If there is a match, the email is received. If not, the receiving server sends a response to the sender to say that the server is unable to receive the email at the moment and to retry after a delay. This eliminates a proportion of spam by delivering mail only from MTAs that comply with the standards for email. The real power of greylisting comes when coupled with RBLs. If the email is part of a spam run, by the time the sending MTA resends the email, the IP address is likely to be in an RBL.

Spammer responses

  • Run a complying MTA helps

Advantages

  • Low bandwidth/CPU cost

Disadvantages

  • Delays some emails from arriving immediately

SenderID and SPF

SenderID and SPF are two approaches to deal with one aspect of spam: Joe jobs. Both add records to the DNS records for the domain to list the IP addresses that can send emails for that domain. Of the two SenderID is technically a better tool, however Microsoft (the creator of SenderID) has patented parts of this. This makes it impossible for it to be implemented on most Open Source mail servers (postfix, qmail, sendmail, exim, etc), which make up a significant proportion of all mail servers. As a result we are unlikely to see SenderID implemented.

Spammer responses

  • Run an MTA that supports this

Advantages

  • Low bandwidth
  • goes some way to deal with the Joe Job issue

Disadvantages

  • Not supported by all MTAs, likely to drop some ham

Blue Frog

As far as I am aware there was only one implementation of this. The basic idea was to make a single http request to all links in all incoming emails. This would bring the sites hosting the products sold by the spam to their knees by the sheer volume of requests. Even if the servers could handle the load, the increased cost of bandwidth would make the spamming uneconomic. Please note that this is not a DDOS, as it is making just one request for each incoming email.

Spammer responses

  • multiple DDOS

Advantages

  • Hurts the spammers, adds costs to them in proportion to the emails they send

Disadvantages

  • Not around any more :(. Unfortunately the DDOSes brought the service to an end.

The dropping cost of hardware

Saturday, August 23rd, 2008

One thing that never ceases to amaze me is the way that hardware continues to drop in cost. This really came home to me when I specced and built a couple of machines for my parents. My parents have the misfortune to have a son who knows his way around a computer and as a result has been able to keep their computers running far longer than they really should have. My mother’s computer was just over 11 years old this year when I replaced it, and had (from memory) 3 replacement power supplies, more RAM, 2 replacement HDD, 3 replacement DVD/CDRom drives, 1 replacement sound card, 2 replacement NICs.

My parents use their computers largely for email, surfing the web and editing the odd word and excel documents. In this part of the market the AMD chips win hands down in bang for your buck. In the end I go something like (monitors were not needed):

  1. AM2 4000
  2. nVidia chipset AT motherboard with integrated gfx & dual channel RAM
  3. 2xaGb DDR2 800 RAM
  4. DVD burner
  5. 160Gb 7200rpm seagate HDD
  6. antec case
  7. XP home

For a total of $485 (AUD) per machine.

All name brand parts, none really bottom of the market parts. To keep this in perspective, under 10 years ago I paid ~$800 (AUD) for a 700Mhz slot A Athlon for first computer I ever built, the total cost of the computer was.

The crazy thing about this is that these computers are quite frankly overpowered for their needs. There are people who need more: gaming, video editing, graphical work, programmers, however these computers are overpowered for most people’s needs. Even then, moving to a Core2 Duo and an ATX motherboard, adding a larger HDD and adding a gfx card would likely still keep the price under $1000 (AUD), you could probably get it below the price of my prized slot A Athlon processor.

Interestingly that processor is still running … it is in the machine that currently hosts this website.

The other interesting part of this purchase is that the OS makes up $109 of that $485, or 22% of that is the OS. For comparison the OS was under 10% of the cost for the machine this replaced. This should be warning to Microsoft, particularly when there are other credible alternatives.

The greatest engine of the air in WWII

Tuesday, August 19th, 2008

I love history, particularly the first 50 years of the 20th century. While reading about aircraft in WWII, I noticed something interesting, a good proportion of the best aircraft on the allied side were powered by the same engine: the Rolls Royce Merlin.

Why was the engine so important? It controls the speed that an aircraft could fly at, the range of the aircraft and to a lesser extent the ceiling. A faster aircraft can attack and quickly reposition for another attack, a faster aircraft can escape attacks.

Among the fighters: Spitfires, Hurricanes and probably the greatest fighter of the war, the P51 Mustang (powered by a Packard-built Merlin). Among the bombers: Lancaster, probably the greatest heavy bomber of the war (with the possible exception of the B29).

Most of the aircraft listed here were pivotal in their own way.

The Battle of Britain was won by Hurricanes and Spitfires (and to a certain extent the small fuel tanks of the German fighters crossing the channel). The Spitfire went through 34 revisions and was still in service by the end of the war, an icon of the battle of Britain.

The P51’s performance made it one of the best fighters of the war, but more importantly, with drop tanks, had the range to go all the way to Berlin from England. This enabled the allies to fly escort on the day bombing missions, drastically reducing losses.

The Mosquito was one of the most versatile aircraft of the war, remaining the fastest aircraft in Bomber command until the end of the war. The Mosquito was used extensively in reconnaissance, as a medium bomber, for marking targets and even as a nightfighter. The Mosquito could fly at close to the performance of most of the axis fighters and still deliver 4000lbs bomb of bombs.

The Lancaster was the backbone of the British night bombing offensive. The Lancaster was famous for the dam buster raid and the sinking of the Tirpitz.

The Rolls Royce Merlin was certainly the best engine from among the allied forces. It isn’t inconceivable that course of the war might have been different had the engine not been built. Of course there were other interesting aircraft to come out of the war.

Exception handling

Saturday, August 9th, 2008

I read a recent post that complained about the lack of error handling in twitter.

My problem with this is while the author is unhappy with the error handling in twitter, no reasonable solution is provided.

In my (admittedly limited) experience of web applications, exceptions fall into three basic categories.

  1. The code is broken somewhere
  2. Platform instability: this might an issue in the hardware/software platform stack that application runs on. For example your server might have a bad stick of RAM or there might be a bug in php/.net/tomcat etc.
  3. Load issues: the app is overloaded, resulting in inability to connect to the database, file locking etc

All of these three items (although to a lesser extent 3) are not issues you can plan for. If you know where the bugs in your code are, you would fix them (duh). If there is an issue in the platform you would either code around it or replace the defective parts of the platform. As for the last, load does interesting things, and it is hard to predict exactly what will break under the load, in the end you can spend a lot of time writing code to handle expected load situations that do not occur.

My question is, what is the programmer are supposed to do with these exceptions? At the least the error should be logged (with enough data to replicate) for the development team so that they might be able to fix it.

You can take the tried and true option of throwing the whole thing into the users lap with a detailed error message. What is the user going to do with this? For a general user this is goobledegook, even for a user who is a developer this only makes sense if they understand the application itself.

Or you can do what twitter does, recognise that the information is essentially useless and simply apologise for the problem.

Jeff Atwood does get something right in this though: Twitter should try to let you know how long the site is going to be down for. However this is only really possible when the developers have assessed the situation resulting from the errors that have been logged and worked out how long the site/feature will be unavailable for.

Coding on whiteboards - interview procedure

Saturday, August 2nd, 2008

Edited to improve the code samples slightly. Also still tweaking the CSS to get the code to display better.

Update 2: just found the preserve code formatting plugin. Fighting wordpress (which was completely screwing up the code tag) was no fun.

A lot of people recommend include a practical test as part of an interview for a programming position. Quite a few people, including some notable people, recommend doing this on a whiteboard.

I think that this stinks: somebody trying to write code on a whiteboard is no reflection on their abilities as a programmer. It isn’t just that it is so different to the way people normally write code: it penalises people who write code well. It is good programming practice to design the skeleton and then to put some flesh on those bones. For example, I have got into the habit of writing closing braces for blocks as soon as I write the opening brace. In my there is no question that this is a good idea, but this is based on the assumption that the space between the braces is effectively infinitely expandable, which is the case when writing normal code but not when writing code on paper or on a whiteboard.

Let’s take a simple function, that retrieves some data from the database (C#, illustrative purposes only, not tested), writes it to the screen. I write code in multiple passes. The first pass through might look something like this:


// TODO: retrieve data

// TOD: loop through data

// TODO: write totals

The next pass would fill some of that in:


// retrieve data
DataTable data = this.GetData();

// loop through data
foreach (DataRow row in data)
{
  TableRow row = new TableRow();

  TableCell cell1 = new TableCell();
  cell1.innerText = row["label"].ToString();

  TableCell cell2 = new TableCell();
  cell2.innerText = row["amount"].ToString();

  this.Results.Rows.Add(row);
}

// TODO: write totals

And some more in the next pass:


// retrieve data
DataTable data = this.GetData();

int total = 0;
// loop through data
foreach (DataRow row in data)
{
  TableRow row = new TableRow();

  this.AddCell(row, row["label"].ToString());

  this.AddCell(row, row["amount"].ToString());

  total += Convert.ToInt32(row["amount"].Value);

  this.Results.Rows.Add(row);
}

// write totals
TableRow total = new TableRow();
this. AddCell(total, “Total”);
this. AddCell(total, total.ToString());
this.Results.Rows.Add(total);

private void AddCell(TableRow row, string value)
{
  TableCell cell = new TableCell();
  cell.innerText = value;
   row.Cells.Add(cell2);
}

And probably a final pass, to alternate colours on the rows and set some styling on the total:


// retrieve data
DataTable data = this.GetData();

int total = 0;
// loop through array
for (DataRow row in data)
{
  DataRow row = data.Rows[i];
  string style = “background-color:” + (i % 2 == 0 ? “#FFFFFF” : “#CCCCCC”) + “;”;

  TableRow row = new TableRow();
  this.AddCell(total, row["label"].ToString(), style);
  this.AddCell(total, row["amount"].ToString(), style);
  total += Convert.ToInt32(row["amount"].Value);

  this.Results.Rows.Add(row);
}

// write totals
TableRow total = new TableRow();
this.AddCell(total, “Total”, “font-weight:bold”);
this.AddCell(total, total.ToString(), “font-weight:bold”);
this.Results.Rows.Add(total);

private void AddCell(TableRow row, string value, string style)
{
  TableCell cell = new TableCell();
  cell.innerText = value;
  if (style.Length != 0) cell.Attributes["style"] = style;
   row.Cells.Add(cell2);
}

And normally this would have been broken out into a number of functions, but I think the point is clear. One of the most frustrating experiences of my life, technology-wise, was hand-writing code as part of an exam.

In more complex code this is even worse: when you are writing the code it is not clear how long a block is.