Saturday, 18 June 2011

Compressing Postcodes..

The UK has postcodes, which are of a similar purpose to zip codes in the US. You put a postcode, and a house number on an envelope, and in theory that's pretty much all that's needed to get the letter to the destination.

I've never really put much thought into how they were constructed, until I met a stack of data that was indexed using postcodes as the lookup, very handy, but it was causing the data size to exceed a limit which meant I couldn't embed the data into a blog post ;-)

I went hunting, and found a document explaining the structure here. I ran a quick scan over my data, discovering I had 121 postcode areas present in my list, some 129 sub bits of them, and some 4000 or so last bits =) Eg, PO1 3AX, PO being what I'd call an area, 1 being the sub bit, and 3AX being the last bit.

I started playing around a bit and here's what I ended up with...

.. the area & sub bit being ~128ish, I figured I can deal with easy enough.. so I started looking at the usage of the last bits.. how much are they reused, how they vary, etc.

First up, they get reused a LOT =) a quick hack with the data showed that one was reused 465 times, and the average reuse was around 200. Breaking that down I found that the total repeated usage (not counting initial usage) of all last bits was a rather large 799275 times.

Splitting that out into brackets, I can see that the average wasn't being skewed by a few high reuses..
 <50:700  <100:561 <150:238 <200:278 <250:392 <300:721 <350:841 <400:277 <450:20 >449:1
(thats 0 to 49 reuses, 50 to 99 reuses, 100 to 149 reuses etc..)

Nice, so if I can find a way to shrink that 3 character extension, I could save a fair number of bytes..

Easy enough, I just put all the known 3 character endings into a table, then use an index to refer to which one I want when I need it.. given there are ~4000 or so, I can get away with a 2byte index, which should mean I can knock a 3rd off the current storage.

Of course, I need to get the entire table (and its expansion code) included, if the index is to be usable within a blog post. The simple approach to sending the entire table as a proper table takes a rather large amount of space.
{"0AA","0AB","0AD","0AE","0AF","0AG","0AH","0AJ","0AL" ....(etc)
So, given the length of each ending is fixed at 3chars, I could just concatenate them all into a single string, and send that.. except that's pretty huge =)
"0AA0AB0AD0AE0AF0AG0AH0AJ0AL0AN0AP0AQ0AR0AS"... (etc)
I tried it anyway, and noticed a side effect of sorting the list, was that there were very very strong patterns within the long list of extensions, it was pretty much an incrementing counter, using custom units per column.

With that in mind, I thought maybe instead of sending every postcode, I could send a postcode, and follow it with instructions on how to increment to the next one. I decided to try with just the last digit, and to send the amount to increment by encoded as letters..
"0AAbcbbbbccccbbbbbcbbb*0BAbcbbbbcccbbbbbbbcbbb*0DAbcbbbbccccbbbbbcbbb" ... (etc)
This approach would take 0AA, then use the 'b' to know to increment the last A by 1 to get 0AB, then use the 'c' to know to increment what is now a B by 2 to get 0AD, and so on and so on..

But then I noticed the "bcbbbbccbbcbbbbbcbbb" data seemed to repeat a bit, although I was processing 121 postcode areas, I only had 8 distinct patterns for the incrementer strings. Well.. maybe I should send those in a table, then just put an index to that table after each base pattern.
p="bcbbbbccbbcbbbbbcbbb%bcbbbbcbbccbbbbbcbbb%bbbbbbbccccbbbbbcbbb%bcbbbbccccbbdcd%bcbbbbcccbbbbbbbcbbb%bcbbbbccccbbbbbcbbb%bcbbbbccccbcbbcd%bcbbbbccccbbbbbbbbbb*0AA50BA40DA50EA50FA50GA50HA50JA50LA50MR#0NA20OO#0PA50QA50RA50SA50TA50UA50VV#0WA50XA50YA50ZA31AA51BA51DA51EA51FA51GA51HA51JA51LA51ME#1NA51PA51QA51RA51SA51TA51UA51VV#1WA51XA51YA51ZA52AA52BA22DA52EA52FA52GA52HA52JA52LA52NA52PA52QA52RA52SA52TA22UA52WA52XA52YA72ZA53AA53BA53DA53EA53FA53GA53HA53JA53LA53NA53PA53QA53RA53SA53TA53UA53VV#3WA53XA53YA53ZA54AA54BA54DA54EA54FA54GA54HA54JA54LA54MH#4NA54OR#4PA54QA54RA54SA54TA04UA54WA54XA54YA54ZA55AA55BA55CH#5DA55EA55FA55GA55HA75JA55LA55NA55PA05QA55RA55SA55TA55UA55VV#5WA55XA55YA55ZA56AA56BA56DA56EA56FA56GA56HA56JA56LA56NA56PA56QA56RA56SA56TA56UA56VV#6WA56XA56YA56ZA57AA57BA57DA57EA57FA07GA57HA57JA57LA47NA57PA57QA57RA17SA57TA57UA57VV#7WA57XA57YA57ZA58AA58BA58DA58EA48FA58GA58HA58JA58LA58NA58PA58QA58RA58SA58TA58UA58WA58XA58YA58ZA59AA59BA59DA59EA59FA59GA59HA59JA59LA59NA59PA59QA59RA59SA59TA59UA59VV#9WA59XA59YA59ZA6";
Well.. thats the entire data table there =) 1010 bytes, instead of 24k worth for the proper table (or 12k for the 'just a string' approach).

Inflating the data from that String to an Array of Strings in Javascript takes another 500 or so bytes, which overall gives me about 90% compression for the table =) Yes it can probably improve a little from there, but there's a certain effort / reward ratio where I stop at =) Feel free to show me how it can improve further.

Next up, I get to look at the first parts (The PO1 part of the example) .. I think I can shrink that a little by taking advantage of the fact that the data I'm processing has them in clusters, so I may be able to generate localized lookup tables within the tree, enough to potentially let me encode each postcode as just 3 characters.

1 comment:

  1. Very interesting. What is the use case for this (ie why do you need to compress the postcode data so?)

    Alex

    ReplyDelete