New Project

September 12th, 2008

I’ve decided to start a new project, and like all good home projects, at scratches an itch. I’ve also wanted to teach myself a new language or two and the best way to do that is to write some code.

I read a lot of books, I’m pretty omnivorous so I’ll happily read history, science fiction, fantasy, fiction and older literature. I also like to own books so that I can re-read them or refer to them later. So I have a lot of books.

I’m starting to run into problems organising all my books, I forget which books I have and where they are.

Obviously this calls for a database, but I have no inclination to type in the author, title, publication date, ISBN etc into a database for the (est) thousands of books I have.

So here is the project. I want to build a tool that can use a barcode scanner to retrieve the ISBN from books, query an online database and do some operations on the database. Operations would include, add book, move book, get rid of book. In addition the list of books will be published in a website, possibly with some filtering tools

I was chatting to a friend about this and he suggested that this could be great for managing CDs and DVDs. That might make an interesting extension of the project.

Now I’m pretty sure that I could find something that does most of this, but I have been looking for a project to work on for a while and this looks like fun. I’ll release the whole thing under GPL3 and possibly host it on source force if it gets to a polished enough state.

The current plan is:

  1. Learn python, my new language of choice (in progress)
  2. Work out how to get barcode reader to work and retrieve ISBN from barcode. This is pretty pivotal to the project so I should work that out first
  3. Find an appropriate online store to query for the details on the book. Amazon is one option but I need to check out their ToS.
  4. Design the database, I’ll aim to make it as cross platform as possible, but probably using mysql
  5. Build the class structure
  6. Build the GUI interface for adding books. Would also need a manual ISBN entry option, not all books have barcodes. Current thinking favors using wxPython for this
  7. Build the web interface
  8. Install/build scripts.

I’m really looking forward to getting my teeth into this. I’ll add updates as this progresses.

Spam Filtering techniques

August 31st, 2008

Spam is a problem of enormous proportions. Current estimates figure that over 80% of all email is spam.

Some time ago I wrote a post about some changes to the configuration of my mail server that cut down the spam drastically. I thought I might take a moment to talk about the various techniques that are used to combat spam.

Some terminology I’m going to use:

  • spam - unwanted email
  • ham - wanted email
  • false positive - ham that is marked as spam
  • client - mail client, eg Thunderbird, Outlook
  • server - mail server, eg Exchange, postfix
  • host - someone who hosts servers
  • Joe Job - when spam is sent using the email address of someone else

Bayesian Filtering

This originated from Paul Graham. The idea was that you break a message up into tokens and then examine the tokens against a database of tokens. Each of the tokens in your database has a score as to how spammy the token is. The individual scores are combined to provide a score for an email. Emails are then rejected or allowed based on that score. This requires that you train your filter on collections of spam and ham.

Spammer responses

  • replacing letters with numbers (v1arga) or adding in spaces. This is generally pretty ineffective.
  • Attempt to poison the filters with random text
  • Delivering their payload as an image

Advantages

  • Generally cuts spam significantly (>75%)
  • Can be configured and trained to specific needs
  • Can be run on the client (eg Thunderbird) or the server

Disadvantages

  • CPU intensive, a burden borne by the receiver of the email.
  • Doesn’t tend to scale well, over an organisation. One person’s spam is another person’s ham.

Realtime Black List (RBL)

A RBL works by storing a known list of IP addresses or IP address blocks that send spam. When a server receives a HELO request, it checks the IP address of the sender against the RBL. If the IP address matches a known spammer IP address, it refuses the email. One issue with RBLs is that they are often easy to get on to and hard to get off. In addition some RBLs take the view that if even if just a single IP address is being used to send spam, they should ban the whole block to encourage the host not to allow spammers on their network. This tends to punish the innocent along with the guilty.

Spammer responses

  • Find a host who will allow them to hop between IP addresses
  • DDOS against the RBL
  • Relay spam through zombies (generally home computers) on dynamic IP addresses

Advantages

  • Can have a significant impact on the amount of spam received
  • Runs at very little cost to the receiver of the email (no bandwidth spent receiving the email)

Disadvantages

  • It can be hard to get off an RBL if you get on one
  • The false positive rate can be quite high, depending on which RBL you choose
  • If you have a false positive, you never know about it

Whitelisting

This works by storing a list of valid email addresses or IP addresses (generally just email addresses) that your server will receive emails from. In general this is not a terribly effective solution as it severly limits the list of people you can receive email from. This is typically to eliminate email from other testing criteria (eg to avoid running bayesian filters over it).

Spammer responses

  • Joe job

Advantages

  • Can have a significant impact on the amount of spam received
  • Low requirements (badwidth, computation)

Disadvantages

  • You can only receive email from email addresses/IP addresses on that list

Challenge - Response

This is really a variation on whitelisting for email addresses, with a dynamic white list. When someone who is not in your white list sends an email, an automatic email with a list goes back to them. Clicking on that link adds them to your whitelist.

Spammer responses

  • Joe job

Advantages

  • Can have a significant impact on the amount of spam received

Disadvantages

  • Places a burden of work on the people sending ham emails
  • Tends to work only if you have a small, known list of people who send you email

Greylisting

Greylisting is one of the more interesting ideas out there. Greylisting checks against an internal database to see if the combination of sender, recipient and sender IP address matches an IP address for an email that has been delivered. If there is a match, the email is received. If not, the receiving server sends a response to the sender to say that the server is unable to receive the email at the moment and to retry after a delay. This eliminates a proportion of spam by delivering mail only from MTAs that comply with the standards for email. The real power of greylisting comes when coupled with RBLs. If the email is part of a spam run, by the time the sending MTA resends the email, the IP address is likely to be in an RBL.

Spammer responses

  • Run a complying MTA helps

Advantages

  • Low bandwidth/CPU cost

Disadvantages

  • Delays some emails from arriving immediately

SenderID and SPF

SenderID and SPF are two approaches to deal with one aspect of spam: Joe jobs. Both add records to the DNS records for the domain to list the IP addresses that can send emails for that domain. Of the two SenderID is technically a better tool, however Microsoft (the creator of SenderID) has patented parts of this. This makes it impossible for it to be implemented on most Open Source mail servers (postfix, qmail, sendmail, exim, etc), which make up a significant proportion of all mail servers. As a result we are unlikely to see SenderID implemented.

Spammer responses

  • Run an MTA that supports this

Advantages

  • Low bandwidth
  • goes some way to deal with the Joe Job issue

Disadvantages

  • Not supported by all MTAs, likely to drop some ham

Blue Frog

As far as I am aware there was only one implementation of this. The basic idea was to make a single http request to all links in all incoming emails. This would bring the sites hosting the products sold by the spam to their knees by the sheer volume of requests. Even if the servers could handle the load, the increased cost of bandwidth would make the spamming uneconomic. Please note that this is not a DDOS, as it is making just one request for each incoming email.

Spammer responses

  • multiple DDOS

Advantages

  • Hurts the spammers, adds costs to them in proportion to the emails they send

Disadvantages

  • Not around any more :( . Unfortunately the DDOSes brought the service to an end.

The dropping cost of hardware

August 23rd, 2008

One thing that never ceases to amaze me is the way that hardware continues to drop in cost. This really came home to me when I specced and built a couple of machines for my parents. My parents have the misfortune to have a son who knows his way around a computer and as a result has been able to keep their computers running far longer than they really should have. My mother’s computer was just over 11 years old this year when I replaced it, and had (from memory) 3 replacement power supplies, more RAM, 2 replacement HDD, 3 replacement DVD/CDRom drives, 1 replacement sound card, 2 replacement NICs.

My parents use their computers largely for email, surfing the web and editing the odd word and excel documents. In this part of the market the AMD chips win hands down in bang for your buck. In the end I go something like (monitors were not needed):

  1. AM2 4000
  2. nVidia chipset AT motherboard with integrated gfx & dual channel RAM
  3. 2xaGb DDR2 800 RAM
  4. DVD burner
  5. 160Gb 7200rpm seagate HDD
  6. antec case
  7. XP home

For a total of $485 (AUD) per machine.

All name brand parts, none really bottom of the market parts. To keep this in perspective, under 10 years ago I paid ~$800 (AUD) for a 700Mhz slot A Athlon for first computer I ever built, the total cost of the computer was.

The crazy thing about this is that these computers are quite frankly overpowered for their needs. There are people who need more: gaming, video editing, graphical work, programmers, however these computers are overpowered for most people’s needs. Even then, moving to a Core2 Duo and an ATX motherboard, adding a larger HDD and adding a gfx card would likely still keep the price under $1000 (AUD), you could probably get it below the price of my prized slot A Athlon processor.

Interestingly that processor is still running … it is in the machine that currently hosts this website.

The other interesting part of this purchase is that the OS makes up $109 of that $485, or 22% of that is the OS. For comparison the OS was under 10% of the cost for the machine this replaced. This should be warning to Microsoft, particularly when there are other credible alternatives.

The greatest engine of the air in WWII

August 19th, 2008

I love history, particularly the first 50 years of the 20th century. While reading about aircraft in WWII, I noticed something interesting, a good proportion of the best aircraft on the allied side were powered by the same engine: the Rolls Royce Merlin.

Why was the engine so important? It controls the speed that an aircraft could fly at, the range of the aircraft and to a lesser extent the ceiling. A faster aircraft can attack and quickly reposition for another attack, a faster aircraft can escape attacks.

Among the fighters: Spitfires, Hurricanes and probably the greatest fighter of the war, the P51 Mustang (powered by a Packard-built Merlin). Among the bombers: Lancaster, probably the greatest heavy bomber of the war (with the possible exception of the B29).

Most of the aircraft listed here were pivotal in their own way.

The Battle of Britain was won by Hurricanes and Spitfires (and to a certain extent the small fuel tanks of the German fighters crossing the channel). The Spitfire went through 34 revisions and was still in service by the end of the war, an icon of the battle of Britain.

The P51’s performance made it one of the best fighters of the war, but more importantly, with drop tanks, had the range to go all the way to Berlin from England. This enabled the allies to fly escort on the day bombing missions, drastically reducing losses.

The Mosquito was one of the most versatile aircraft of the war, remaining the fastest aircraft in Bomber command until the end of the war. The Mosquito was used extensively in reconnaissance, as a medium bomber, for marking targets and even as a nightfighter. The Mosquito could fly at close to the performance of most of the axis fighters and still deliver 4000lbs bomb of bombs.

The Lancaster was the backbone of the British night bombing offensive. The Lancaster was famous for the dam buster raid and the sinking of the Tirpitz.

The Rolls Royce Merlin was certainly the best engine from among the allied forces. It isn’t inconceivable that course of the war might have been different had the engine not been built. Of course there were other interesting aircraft to come out of the war.

Exception handling

August 9th, 2008

I read a recent post that complained about the lack of error handling in twitter.

My problem with this is while the author is unhappy with the error handling in twitter, no reasonable solution is provided.

In my (admittedly limited) experience of web applications, exceptions fall into three basic categories.

  1. The code is broken somewhere
  2. Platform instability: this might an issue in the hardware/software platform stack that application runs on. For example your server might have a bad stick of RAM or there might be a bug in php/.net/tomcat etc.
  3. Load issues: the app is overloaded, resulting in inability to connect to the database, file locking etc

All of these three items (although to a lesser extent 3) are not issues you can plan for. If you know where the bugs in your code are, you would fix them (duh). If there is an issue in the platform you would either code around it or replace the defective parts of the platform. As for the last, load does interesting things, and it is hard to predict exactly what will break under the load, in the end you can spend a lot of time writing code to handle expected load situations that do not occur.

My question is, what is the programmer are supposed to do with these exceptions? At the least the error should be logged (with enough data to replicate) for the development team so that they might be able to fix it.

You can take the tried and true option of throwing the whole thing into the users lap with a detailed error message. What is the user going to do with this? For a general user this is goobledegook, even for a user who is a developer this only makes sense if they understand the application itself.

Or you can do what twitter does, recognise that the information is essentially useless and simply apologise for the problem.

Jeff Atwood does get something right in this though: Twitter should try to let you know how long the site is going to be down for. However this is only really possible when the developers have assessed the situation resulting from the errors that have been logged and worked out how long the site/feature will be unavailable for.

Coding on whiteboards - interview procedure

August 2nd, 2008

Edited to improve the code samples slightly. Also still tweaking the CSS to get the code to display better.

Update 2: just found the preserve code formatting plugin. Fighting wordpress (which was completely screwing up the code tag) was no fun.

A lot of people recommend include a practical test as part of an interview for a programming position. Quite a few people, including some notable people, recommend doing this on a whiteboard.

I think that this stinks: somebody trying to write code on a whiteboard is no reflection on their abilities as a programmer. It isn’t just that it is so different to the way people normally write code: it penalises people who write code well. It is good programming practice to design the skeleton and then to put some flesh on those bones. For example, I have got into the habit of writing closing braces for blocks as soon as I write the opening brace. In my there is no question that this is a good idea, but this is based on the assumption that the space between the braces is effectively infinitely expandable, which is the case when writing normal code but not when writing code on paper or on a whiteboard.

Let’s take a simple function, that retrieves some data from the database (C#, illustrative purposes only, not tested), writes it to the screen. I write code in multiple passes. The first pass through might look something like this:


// TODO: retrieve data

// TOD: loop through data

// TODO: write totals

The next pass would fill some of that in:


// retrieve data
DataTable data = this.GetData();

// loop through data
foreach (DataRow row in data)
{
  TableRow row = new TableRow();

  TableCell cell1 = new TableCell();
  cell1.innerText = row["label"].ToString();

  TableCell cell2 = new TableCell();
  cell2.innerText = row["amount"].ToString();

  this.Results.Rows.Add(row);
}

// TODO: write totals

And some more in the next pass:


// retrieve data
DataTable data = this.GetData();

int total = 0;
// loop through data
foreach (DataRow row in data)
{
  TableRow row = new TableRow();

  this.AddCell(row, row["label"].ToString());

  this.AddCell(row, row["amount"].ToString());

  total += Convert.ToInt32(row["amount"].Value);

  this.Results.Rows.Add(row);
}

// write totals
TableRow total = new TableRow();
this. AddCell(total, "Total");
this. AddCell(total, total.ToString());
this.Results.Rows.Add(total);

private void AddCell(TableRow row, string value)
{
  TableCell cell = new TableCell();
  cell.innerText = value;
   row.Cells.Add(cell2);
}

And probably a final pass, to alternate colours on the rows and set some styling on the total:


// retrieve data
DataTable data = this.GetData();

int total = 0;
// loop through array
for (DataRow row in data)
{
  DataRow row = data.Rows[i];
  string style = "background-color:" + (i % 2 == 0 ? "#FFFFFF" : "#CCCCCC") + ";";

  TableRow row = new TableRow();
  this.AddCell(total, row["label"].ToString(), style);
  this.AddCell(total, row["amount"].ToString(), style);
  total += Convert.ToInt32(row["amount"].Value);

  this.Results.Rows.Add(row);
}

// write totals
TableRow total = new TableRow();
this.AddCell(total, "Total", "font-weight:bold");
this.AddCell(total, total.ToString(), "font-weight:bold");
this.Results.Rows.Add(total);

private void AddCell(TableRow row, string value, string style)
{
  TableCell cell = new TableCell();
  cell.innerText = value;
  if (style.Length != 0) cell.Attributes["style"] = style;
   row.Cells.Add(cell2);
}

And normally this would have been broken out into a number of functions, but I think the point is clear. One of the most frustrating experiences of my life, technology-wise, was hand-writing code as part of an exam.

In more complex code this is even worse: when you are writing the code it is not clear how long a block is.

Multiple failures suck

May 17th, 2008

So, someone might have noticed that my site has been down for a little while.

First my mail/web server died hard, complete hard drive failure. It was about this point that I discovered that my backup scripts were somewhat lacking. It was about this time that my file server died.

Everything backs up to the fileserver, which then backs up to a machine offsite and occasionally to a local desktop. However this doesn’t occur as regularly as it should. Piecing together the files remaining on my fileserver, the files from the offsite backup and the onsite backup got me most of my data back.

I almost lost 6 months of email, but managed to restore this from local stores of email.

All this has taken some time to get things back together, but I’m back online now.

Unfortunately I lost all the photos hosted here. I still have the actual photos, but the resized images and html pages are gone. And I’m somewhat disinclined to bring most of them back, given the work involved. So I’m going to have to delete most of the posts related to that.

MSDN Pricing

October 13th, 2007

I was looking to purchase a couple of new MSDN subscriptions for work recently and discivered something interesting. We were looking to purchase Visual Studio Professional MSDN Professional subscriptions.

The first interesting thing is that the pricing is the same for DVD and online distribution. This doesn’t make a whole lot of sense, the DVDs cost something to produce. The more interesting thing is the pricing around the world.

Go here and select Australia in the top right corner. Click on Buy direct from Microsoft and select the appropriate subscription. Price: $2084.

Now, go back to the first page and change to United States. Select “How to Buy”, select “Buy or renew MSDN Subscriptions directly from Microsoft” under “Buy from Microsoft” (you might need to clear cookies for this). Select the appropriate subscription level. Price $1199 US.

Converting that to AUD comes to $1,326.33. Even adding GST on only brings it up to $1458.96. Given that this is distributed online, there is no difference between distributing this product in Australia or the US. This is flat out ripping people off by over $600. Now it is possible that Microsoft hasn’t caught up with the freefall of the US dollar, however in a global market this seems pretty silly.

I rang Microsoft (I do this occasionally, it doesn’t make any difference but I think it should be done) and a rather helpful guy couldn’t come up with a explanation. He did say I could buy a US subscription, but I would need to use support from the US.

Site re-arrangement

September 7th, 2007

I’ve rearranged the site a bit so it is a little more logical, putting the blog in a directory if its own rather than the root directory for the site.

Long time…

August 20th, 2007

It has been some time since I’ve written a post, largely due to being very busy with work. One thing of note has happened.

New laptop

I got a new work laptop, retiring the much loved Thinkpad T41.

The new laptop is a macbook pro, my first mac. There were a number of really good reasons to get a mac, but the major one was we needed a mac to test websites we build with safari. Also with Parallels (and bootcamp), I can work on windows just fine. That is a good thing as I primarily write code that runs on windows.

Some initial impressions after working with it for a little while:

  • The hardware is very pretty
  • OS X is really, really slick. Very user friendly. Just picking one example, when you’ve used the wireless confguration tools in OS X, anything else seems clumsy.
  • Parallels is really good at virtualising windows. I write code for windows, running IIS, SQL Server, Visual Studio and a host of other basic applications. The single issue I’ve run across to date is that SQL Server Profiler doesn’t seem to resume when you pause it.
  • The keyboard on the laptop is really, really annoying if you write code. No single page up/down keys. Function keys don’t work as single keys (issue for windows). The killer is that the home/end keys require two keys. Any and every programmer would hit these keys all the time..
  • Wide screen sucks for code. I want longer screens, not wider screens.

The keyboard and screen were almost enough for me to wish for the thinkpad back.