Archive for the ‘Tech’ Category

Update

Saturday, March 6th, 2010

It has been a while between posts, apologies I’ve been busy, especially with work. I also ran into a bug in wordpress as I tried to log in to create a post. Essentially if you logged in with admin rights but were not an admin, and you needed to upgrade the install, it endlessly redirected.

So a quick tech summary of the last year, which I will expand on in further posts:

  • iPhone - I got one, it is great, highly recommended
  • Snow Leopard - I finally moved my laptop to snow leopard, seems a bit better
  • Windows 7 - I ran the RC for 8 months or so, looks good, just installed the RTM version on my desktop
  • Virtualisation - I’ve got very excited about virtualisation in the last 6 months, particularly vmware
  • Code - I’ve found a language worse than perl: Powerbuilder
  • Library project in Python - I found Library thing and decided I didn’t have enough time to write my own

Occasional insomnia has benefits (part 2)

Saturday, January 24th, 2009

While looking at a new flash drive, I spent some time looking again at PortableApps. In case you’ve never seen this, portable apps are basically apps optimised to run from your flash drive. So you can walk up to another computer, plug in your flash drive and feel right at home. I’d always assumed that PortableApps would be stateless and that settings would not be remembered between uses. I was wrong.

The first discovery was just how well thunderbird works from PortableApps. I run my own mail server, with remote access via IMAP over SSL. I have installed squirrelmail, but I’ve found it to be generally pretty slow on my low end hardware. So being able to plug in a flash drive and to be able to just run thunderbird for IMAP over SSL solves an immediate problem.

I was happy to discover that portable apps also quite happily allows you to install plugins for thunderbird and firefox. This is a great, I generally have about 10 plugins installed for firefox. For example, this means I can have bookmarks synchronised with foxmarks.

Ideally for something like this, you want to have all the mail cached locally so there isn’t too much time spent loading up messages. However if I do this and I lose the drive, someone would have all my emails. Even if I did not cache all my emails, they would still have cached the headers, which is far too much information.

Enter TrueCrypt, true friend of the appropriately paranoid. TrueCrypt is an encryption program that can encrypt whole drives. It includes a traveller mode for flash drives, just what I was looking for. Using TrueCrypt, you can encrypt the entire PortableApps folder, ensuring that if you lose your drive your data is still safe. Unfortunately it seems to require admin privileges, but in most cases that should be possible to arrange.

I haven’t bought the new drive yet, I’ll report any issues when I get it.

Occasional insomnia has benefits

Saturday, January 17th, 2009

When I wake up I normally can’t get to sleep, regardless of what time it is.

This happened recently when I was helping an organisation test a web application. I woke up at about 1am and couldn’t get back to sleep, by about 5am I’d tested everything and written out a full report for them.

This morning I couldn’t sleep again, so I worked on solving what has been a long running problem for me.

I was looking for a flash drive to replace my current rather battered 1GB drive. Once my flash drive sat happily on my keyring. First the keyring connector broke. Then the plastic casing came apart and I was left with this:
flash drive

It works just fine, sits in a jeans pocket well and has been through the wash safely at least half a dozen times. However I do leave it behind at times so I really want something that I can put on a keyring. The other problem is that I keep finding myself running short of space.

Unfortunately most flash drives have pretty fragile keyring links. There are some that don’t, but they seem to by ultra-rugged style drives that would be rather heavy on a keyring. Additionally, I don’t see the point of waterproofing flash drives. If mine has been through the laundry half a dozen times quite safely, what is the point? So long as it dry inside and out when I plug it in it should work just fine.

Fortunately I found the SanDisk Cruzer Titanium. Not too large, tough exterior and a good strong keyring link.

Book Project Update

Saturday, January 17th, 2009

So unfortunately I haven’t had as much time as I would have liked to work on my book library project. What with buying a unit, Christmas, going to NZ and helping test a software app on the side, I just haven’t had the time.

However I had made some progress.

I’ve done some basic work familiarising myself with python. I’ve come to the conclusion that it would have to be one of the best languages I’ve ever worked with. The only way I can describe it is as if C++ and perl reproduced and had a child that took the best features from both languages and none of the worst features. Glee! It is very easy to write nice clean code that makes sense.

I also bought a CueCat for scanning the barcodes. My reasoning was that it was cheap and was likely to have good support on many platforms since there are so many of them around. I’ve since found that it can be little slow to scan barcodes, but is certainly good enough for the moment. I’ve ported some code to decode the CueCat output from javascript to python.

My original plan was to build a GUI for this using wxpython. Since then I’ve discovered that the CueCat doesn’t need complicated drivers, it just dumps encoded output similar to the way a keyboard does. So there is no real need for a heavy client on a desktop, I can just skip to building the whole thing as a web application.

I’ve currently looking at different frameworks, but at the moment django looks pretty sweet.

New Project

Friday, September 12th, 2008

I’ve decided to start a new project, and like all good home projects, at scratches an itch. I’ve also wanted to teach myself a new language or two and the best way to do that is to write some code.

I read a lot of books, I’m pretty omnivorous so I’ll happily read history, science fiction, fantasy, fiction and older literature. I also like to own books so that I can re-read them or refer to them later. So I have a lot of books.

I’m starting to run into problems organising all my books, I forget which books I have and where they are.

Obviously this calls for a database, but I have no inclination to type in the author, title, publication date, ISBN etc into a database for the (est) thousands of books I have.

So here is the project. I want to build a tool that can use a barcode scanner to retrieve the ISBN from books, query an online database and do some operations on the database. Operations would include, add book, move book, get rid of book. In addition the list of books will be published in a website, possibly with some filtering tools

I was chatting to a friend about this and he suggested that this could be great for managing CDs and DVDs. That might make an interesting extension of the project.

Now I’m pretty sure that I could find something that does most of this, but I have been looking for a project to work on for a while and this looks like fun. I’ll release the whole thing under GPL3 and possibly host it on source force if it gets to a polished enough state.

The current plan is:

  1. Learn python, my new language of choice (in progress)
  2. Work out how to get barcode reader to work and retrieve ISBN from barcode. This is pretty pivotal to the project so I should work that out first
  3. Find an appropriate online store to query for the details on the book. Amazon is one option but I need to check out their ToS.
  4. Design the database, I’ll aim to make it as cross platform as possible, but probably using mysql
  5. Build the class structure
  6. Build the GUI interface for adding books. Would also need a manual ISBN entry option, not all books have barcodes. Current thinking favors using wxPython for this
  7. Build the web interface
  8. Install/build scripts.

I’m really looking forward to getting my teeth into this. I’ll add updates as this progresses.

Spam Filtering techniques

Sunday, August 31st, 2008

Spam is a problem of enormous proportions. Current estimates figure that over 80% of all email is spam.

Some time ago I wrote a post about some changes to the configuration of my mail server that cut down the spam drastically. I thought I might take a moment to talk about the various techniques that are used to combat spam.

Some terminology I’m going to use:

  • spam - unwanted email
  • ham - wanted email
  • false positive - ham that is marked as spam
  • client - mail client, eg Thunderbird, Outlook
  • server - mail server, eg Exchange, postfix
  • host - someone who hosts servers
  • Joe Job - when spam is sent using the email address of someone else

Bayesian Filtering

This originated from Paul Graham. The idea was that you break a message up into tokens and then examine the tokens against a database of tokens. Each of the tokens in your database has a score as to how spammy the token is. The individual scores are combined to provide a score for an email. Emails are then rejected or allowed based on that score. This requires that you train your filter on collections of spam and ham.

Spammer responses

  • replacing letters with numbers (v1arga) or adding in spaces. This is generally pretty ineffective.
  • Attempt to poison the filters with random text
  • Delivering their payload as an image

Advantages

  • Generally cuts spam significantly (>75%)
  • Can be configured and trained to specific needs
  • Can be run on the client (eg Thunderbird) or the server

Disadvantages

  • CPU intensive, a burden borne by the receiver of the email.
  • Doesn’t tend to scale well, over an organisation. One person’s spam is another person’s ham.

Realtime Black List (RBL)

A RBL works by storing a known list of IP addresses or IP address blocks that send spam. When a server receives a HELO request, it checks the IP address of the sender against the RBL. If the IP address matches a known spammer IP address, it refuses the email. One issue with RBLs is that they are often easy to get on to and hard to get off. In addition some RBLs take the view that if even if just a single IP address is being used to send spam, they should ban the whole block to encourage the host not to allow spammers on their network. This tends to punish the innocent along with the guilty.

Spammer responses

  • Find a host who will allow them to hop between IP addresses
  • DDOS against the RBL
  • Relay spam through zombies (generally home computers) on dynamic IP addresses

Advantages

  • Can have a significant impact on the amount of spam received
  • Runs at very little cost to the receiver of the email (no bandwidth spent receiving the email)

Disadvantages

  • It can be hard to get off an RBL if you get on one
  • The false positive rate can be quite high, depending on which RBL you choose
  • If you have a false positive, you never know about it

Whitelisting

This works by storing a list of valid email addresses or IP addresses (generally just email addresses) that your server will receive emails from. In general this is not a terribly effective solution as it severly limits the list of people you can receive email from. This is typically to eliminate email from other testing criteria (eg to avoid running bayesian filters over it).

Spammer responses

  • Joe job

Advantages

  • Can have a significant impact on the amount of spam received
  • Low requirements (badwidth, computation)

Disadvantages

  • You can only receive email from email addresses/IP addresses on that list

Challenge - Response

This is really a variation on whitelisting for email addresses, with a dynamic white list. When someone who is not in your white list sends an email, an automatic email with a list goes back to them. Clicking on that link adds them to your whitelist.

Spammer responses

  • Joe job

Advantages

  • Can have a significant impact on the amount of spam received

Disadvantages

  • Places a burden of work on the people sending ham emails
  • Tends to work only if you have a small, known list of people who send you email

Greylisting

Greylisting is one of the more interesting ideas out there. Greylisting checks against an internal database to see if the combination of sender, recipient and sender IP address matches an IP address for an email that has been delivered. If there is a match, the email is received. If not, the receiving server sends a response to the sender to say that the server is unable to receive the email at the moment and to retry after a delay. This eliminates a proportion of spam by delivering mail only from MTAs that comply with the standards for email. The real power of greylisting comes when coupled with RBLs. If the email is part of a spam run, by the time the sending MTA resends the email, the IP address is likely to be in an RBL.

Spammer responses

  • Run a complying MTA helps

Advantages

  • Low bandwidth/CPU cost

Disadvantages

  • Delays some emails from arriving immediately

SenderID and SPF

SenderID and SPF are two approaches to deal with one aspect of spam: Joe jobs. Both add records to the DNS records for the domain to list the IP addresses that can send emails for that domain. Of the two SenderID is technically a better tool, however Microsoft (the creator of SenderID) has patented parts of this. This makes it impossible for it to be implemented on most Open Source mail servers (postfix, qmail, sendmail, exim, etc), which make up a significant proportion of all mail servers. As a result we are unlikely to see SenderID implemented.

Spammer responses

  • Run an MTA that supports this

Advantages

  • Low bandwidth
  • goes some way to deal with the Joe Job issue

Disadvantages

  • Not supported by all MTAs, likely to drop some ham

Blue Frog

As far as I am aware there was only one implementation of this. The basic idea was to make a single http request to all links in all incoming emails. This would bring the sites hosting the products sold by the spam to their knees by the sheer volume of requests. Even if the servers could handle the load, the increased cost of bandwidth would make the spamming uneconomic. Please note that this is not a DDOS, as it is making just one request for each incoming email.

Spammer responses

  • multiple DDOS

Advantages

  • Hurts the spammers, adds costs to them in proportion to the emails they send

Disadvantages

  • Not around any more :(. Unfortunately the DDOSes brought the service to an end.

The dropping cost of hardware

Saturday, August 23rd, 2008

One thing that never ceases to amaze me is the way that hardware continues to drop in cost. This really came home to me when I specced and built a couple of machines for my parents. My parents have the misfortune to have a son who knows his way around a computer and as a result has been able to keep their computers running far longer than they really should have. My mother’s computer was just over 11 years old this year when I replaced it, and had (from memory) 3 replacement power supplies, more RAM, 2 replacement HDD, 3 replacement DVD/CDRom drives, 1 replacement sound card, 2 replacement NICs.

My parents use their computers largely for email, surfing the web and editing the odd word and excel documents. In this part of the market the AMD chips win hands down in bang for your buck. In the end I go something like (monitors were not needed):

  1. AM2 4000
  2. nVidia chipset AT motherboard with integrated gfx & dual channel RAM
  3. 2xaGb DDR2 800 RAM
  4. DVD burner
  5. 160Gb 7200rpm seagate HDD
  6. antec case
  7. XP home

For a total of $485 (AUD) per machine.

All name brand parts, none really bottom of the market parts. To keep this in perspective, under 10 years ago I paid ~$800 (AUD) for a 700Mhz slot A Athlon for first computer I ever built, the total cost of the computer was.

The crazy thing about this is that these computers are quite frankly overpowered for their needs. There are people who need more: gaming, video editing, graphical work, programmers, however these computers are overpowered for most people’s needs. Even then, moving to a Core2 Duo and an ATX motherboard, adding a larger HDD and adding a gfx card would likely still keep the price under $1000 (AUD), you could probably get it below the price of my prized slot A Athlon processor.

Interestingly that processor is still running … it is in the machine that currently hosts this website.

The other interesting part of this purchase is that the OS makes up $109 of that $485, or 22% of that is the OS. For comparison the OS was under 10% of the cost for the machine this replaced. This should be warning to Microsoft, particularly when there are other credible alternatives.

Exception handling

Saturday, August 9th, 2008

I read a recent post that complained about the lack of error handling in twitter.

My problem with this is while the author is unhappy with the error handling in twitter, no reasonable solution is provided.

In my (admittedly limited) experience of web applications, exceptions fall into three basic categories.

  1. The code is broken somewhere
  2. Platform instability: this might an issue in the hardware/software platform stack that application runs on. For example your server might have a bad stick of RAM or there might be a bug in php/.net/tomcat etc.
  3. Load issues: the app is overloaded, resulting in inability to connect to the database, file locking etc

All of these three items (although to a lesser extent 3) are not issues you can plan for. If you know where the bugs in your code are, you would fix them (duh). If there is an issue in the platform you would either code around it or replace the defective parts of the platform. As for the last, load does interesting things, and it is hard to predict exactly what will break under the load, in the end you can spend a lot of time writing code to handle expected load situations that do not occur.

My question is, what is the programmer are supposed to do with these exceptions? At the least the error should be logged (with enough data to replicate) for the development team so that they might be able to fix it.

You can take the tried and true option of throwing the whole thing into the users lap with a detailed error message. What is the user going to do with this? For a general user this is goobledegook, even for a user who is a developer this only makes sense if they understand the application itself.

Or you can do what twitter does, recognise that the information is essentially useless and simply apologise for the problem.

Jeff Atwood does get something right in this though: Twitter should try to let you know how long the site is going to be down for. However this is only really possible when the developers have assessed the situation resulting from the errors that have been logged and worked out how long the site/feature will be unavailable for.

Coding on whiteboards - interview procedure

Saturday, August 2nd, 2008

Edited to improve the code samples slightly. Also still tweaking the CSS to get the code to display better.

Update 2: just found the preserve code formatting plugin. Fighting wordpress (which was completely screwing up the code tag) was no fun.

A lot of people recommend include a practical test as part of an interview for a programming position. Quite a few people, including some notable people, recommend doing this on a whiteboard.

I think that this stinks: somebody trying to write code on a whiteboard is no reflection on their abilities as a programmer. It isn’t just that it is so different to the way people normally write code: it penalises people who write code well. It is good programming practice to design the skeleton and then to put some flesh on those bones. For example, I have got into the habit of writing closing braces for blocks as soon as I write the opening brace. In my there is no question that this is a good idea, but this is based on the assumption that the space between the braces is effectively infinitely expandable, which is the case when writing normal code but not when writing code on paper or on a whiteboard.

Let’s take a simple function, that retrieves some data from the database (C#, illustrative purposes only, not tested), writes it to the screen. I write code in multiple passes. The first pass through might look something like this:


// TODO: retrieve data

// TOD: loop through data

// TODO: write totals

The next pass would fill some of that in:


// retrieve data
DataTable data = this.GetData();

// loop through data
foreach (DataRow row in data)
{
  TableRow row = new TableRow();

  TableCell cell1 = new TableCell();
  cell1.innerText = row["label"].ToString();

  TableCell cell2 = new TableCell();
  cell2.innerText = row["amount"].ToString();

  this.Results.Rows.Add(row);
}

// TODO: write totals

And some more in the next pass:


// retrieve data
DataTable data = this.GetData();

int total = 0;
// loop through data
foreach (DataRow row in data)
{
  TableRow row = new TableRow();

  this.AddCell(row, row["label"].ToString());

  this.AddCell(row, row["amount"].ToString());

  total += Convert.ToInt32(row["amount"].Value);

  this.Results.Rows.Add(row);
}

// write totals
TableRow total = new TableRow();
this. AddCell(total, “Total”);
this. AddCell(total, total.ToString());
this.Results.Rows.Add(total);

private void AddCell(TableRow row, string value)
{
  TableCell cell = new TableCell();
  cell.innerText = value;
   row.Cells.Add(cell2);
}

And probably a final pass, to alternate colours on the rows and set some styling on the total:


// retrieve data
DataTable data = this.GetData();

int total = 0;
// loop through array
for (DataRow row in data)
{
  DataRow row = data.Rows[i];
  string style = “background-color:” + (i % 2 == 0 ? “#FFFFFF” : “#CCCCCC”) + “;”;

  TableRow row = new TableRow();
  this.AddCell(total, row["label"].ToString(), style);
  this.AddCell(total, row["amount"].ToString(), style);
  total += Convert.ToInt32(row["amount"].Value);

  this.Results.Rows.Add(row);
}

// write totals
TableRow total = new TableRow();
this.AddCell(total, “Total”, “font-weight:bold”);
this.AddCell(total, total.ToString(), “font-weight:bold”);
this.Results.Rows.Add(total);

private void AddCell(TableRow row, string value, string style)
{
  TableCell cell = new TableCell();
  cell.innerText = value;
  if (style.Length != 0) cell.Attributes["style"] = style;
   row.Cells.Add(cell2);
}

And normally this would have been broken out into a number of functions, but I think the point is clear. One of the most frustrating experiences of my life, technology-wise, was hand-writing code as part of an exam.

In more complex code this is even worse: when you are writing the code it is not clear how long a block is.