So there I was trying to work out why data is so huge, trying to work out a way of handling it better, when I had a thought. What actually is data?
Data, in simplest terms, is a string of bytes. In order for that data to be useful we put those bytes into a logical order to describe things such as text, images and the like inside a file. Normally these files have headers and other such human recognisable features to describe how the computer readable data is laid out inside the file.
I digress. Data is (currently) arranged in files for us humans to look at, they have structure, all future files will have structure, so I want to predict what those structures will be. To that end I started playing.
Plain Text file |
I wrote a script in Python that can count the occurences of byte values. Its output is a png where the colours tendancy towards red shows its higher frequency compared to the total number of bytes in the file.
As you can see byte values that correspond with language based ascii characters feature heavily. Now what happens if we zip it?
Zipped Plain Text |
As you can see the image is a lot flatter. This is because of the way zipping a file works. Essentially it records 1 or a number of bytes and then back references on the next occurance. I can now prove that the maths for different types of compression is different.
7z file |
See? Proof positive something else is going on. Ok, ok, but it is interesting isn't it? How it's flattened out? It's something I've noticed with other file types too. Here are some below:
JPG file. |
PDF file |
MSI file |
The smaller the file, the higher the compression, the more lossy the data, the more flat it is.
Nobbing around with data is something I've been doing for years. I am a great believer that we have already reached "L-Space". Everything ever written, ever been written, ever will be written is already out there. With the birth of huge datastores and the internet it should have transpired that every combination of bytes will have been written somewhere.
Thanks for reading,
No comments:
Post a Comment