I've never really put much thought into how they were constructed, until I met a stack of data that was indexed using postcodes as the lookup, very handy, but it was causing the data size to exceed a limit which meant I couldn't embed the data into a blog post ;-)
I went hunting, and found a document explaining the structure here. I ran a quick scan over my data, discovering I had 121 postcode areas present in my list, some 129 sub bits of them, and some 4000 or so last bits =) Eg, PO1 3AX, PO being what I'd call an area, 1 being the sub bit, and 3AX being the last bit.
I started playing around a bit and here's what I ended up with...
.. the area & sub bit being ~128ish, I figured I can deal with easy enough.. so I started looking at the usage of the last bits.. how much are they reused, how they vary, etc.
First up, they get reused a LOT =) a quick hack with the data showed that one was reused 465 times, and the average reuse was around 200. Breaking that down I found that the total repeated usage (not counting initial usage) of all last bits was a rather large 799275 times.
Splitting that out into brackets, I can see that the average wasn't being skewed by a few high reuses..
<50:700 <100:561 <150:238 <200:278 <250:392 <300:721 <350:841 <400:277 <450:20 >449:1(thats 0 to 49 reuses, 50 to 99 reuses, 100 to 149 reuses etc..)
Nice, so if I can find a way to shrink that 3 character extension, I could save a fair number of bytes..
Easy enough, I just put all the known 3 character endings into a table, then use an index to refer to which one I want when I need it.. given there are ~4000 or so, I can get away with a 2byte index, which should mean I can knock a 3rd off the current storage.
Of course, I need to get the entire table (and its expansion code) included, if the index is to be usable within a blog post. The simple approach to sending the entire table as a proper table takes a rather large amount of space.
{"0AA","0AB","0AD","0AE","0AF","0AG","0AH","0AJ","0AL" ....(etc)So, given the length of each ending is fixed at 3chars, I could just concatenate them all into a single string, and send that.. except that's pretty huge =)
"0AA0AB0AD0AE0AF0AG0AH0AJ0AL0AN0AP0AQ0AR0AS"... (etc)
With that in mind, I thought maybe instead of sending every postcode, I could send a postcode, and follow it with instructions on how to increment to the next one. I decided to try with just the last digit, and to send the amount to increment by encoded as letters..
"0AAbcbbbbccccbbbbbcbbb*0BAbcbbbbcccbbbbbbbcbbb*0DAbcbbbbccccbbbbbcbbb" ... (etc)This approach would take 0AA, then use the 'b' to know to increment the last A by 1 to get 0AB, then use the 'c' to know to increment what is now a B by 2 to get 0AD, and so on and so on..
But then I noticed the "bcbbbbccbbcbbbbbcbbb" data seemed to repeat a bit, although I was processing 121 postcode areas, I only had 8 distinct patterns for the incrementer strings. Well.. maybe I should send those in a table, then just put an index to that table after each base pattern.
p="bcbbbbccbbcbbbbbcbbb%bcbbbbcbbccbbbbbcbbb%bbbbbbbccccbbbbbcbbb%bcbbbbccccbbdcd%bcbbbbcccbbbbbbbcbbb%bcbbbbccccbbbbbcbbb%bcbbbbccccbcbbcd%bcbbbbccccbbbbbbbbbb*0AA50BA40DA50EA50FA50GA50HA50JA50LA50MR#0NA20OO#0PA50QA50RA50SA50TA50UA50VV#0WA50XA50YA50ZA31AA51BA51DA51EA51FA51GA51HA51JA51LA51ME#1NA51PA51QA51RA51SA51TA51UA51VV#1WA51XA51YA51ZA52AA52BA22DA52EA52FA52GA52HA52JA52LA52NA52PA52QA52RA52SA52TA22UA52WA52XA52YA72ZA53AA53BA53DA53EA53FA53GA53HA53JA53LA53NA53PA53QA53RA53SA53TA53UA53VV#3WA53XA53YA53ZA54AA54BA54DA54EA54FA54GA54HA54JA54LA54MH#4NA54OR#4PA54QA54RA54SA54TA04UA54WA54XA54YA54ZA55AA55BA55CH#5DA55EA55FA55GA55HA75JA55LA55NA55PA05QA55RA55SA55TA55UA55VV#5WA55XA55YA55ZA56AA56BA56DA56EA56FA56GA56HA56JA56LA56NA56PA56QA56RA56SA56TA56UA56VV#6WA56XA56YA56ZA57AA57BA57DA57EA57FA07GA57HA57JA57LA47NA57PA57QA57RA17SA57TA57UA57VV#7WA57XA57YA57ZA58AA58BA58DA58EA48FA58GA58HA58JA58LA58NA58PA58QA58RA58SA58TA58UA58WA58XA58YA58ZA59AA59BA59DA59EA59FA59GA59HA59JA59LA59NA59PA59QA59RA59SA59TA59UA59VV#9WA59XA59YA59ZA6";Well.. thats the entire data table there =) 1010 bytes, instead of 24k worth for the proper table (or 12k for the 'just a string' approach).
Inflating the data from that String to an Array of Strings in Javascript takes another 500 or so bytes, which overall gives me about 90% compression for the table =) Yes it can probably improve a little from there, but there's a certain effort / reward ratio where I stop at =) Feel free to show me how it can improve further.
Next up, I get to look at the first parts (The PO1 part of the example) .. I think I can shrink that a little by taking advantage of the fact that the data I'm processing has them in clusters, so I may be able to generate localized lookup tables within the tree, enough to potentially let me encode each postcode as just 3 characters.
Very interesting. What is the use case for this (ie why do you need to compress the postcode data so?)
ReplyDeleteAlex