Personal View site logo
A new lossless, infinite compression algorithm
  • A Thought Experiment on the Relationship between Information Entropy, Randomness, and Effort

    Can truly random data be compressed? The conventional answer is “no”. Like many conventional notions, that is incorrect. Not only can random data be compressed, it is trivially easy to do so.

    Most would agree that if one were to attempt to compress numerous files containing random data, out of pure luck some of the processed files would end up somewhat smaller than the originals. Conversely, some of the processed files would end up somewhat larger than the originals. In fact, nearly all processed files would differ slightly in size from the originals, even if all the original files were the same size. However, taken in the aggregate, the random random data files would require the same amount of storage as the processed files – right? Well… not necessarily. By adding a simple rule to the compression algorithm one can have a collection of files containing truly random data that always requires less space to store than the original collection.

    The rule that accomplishes this only requires that a simple decision be included in the processing sequence: if the processed file is smaller than the original, use it; if it is not, use the original. This will result in a statistically small, but significant and repeatable compression of truly random (i.e. entropic) information.

    Can the resulting collection of files be further compressed? Absolutely! If the data in the files is somehow transformed (using a lossless transform) – the entire compression scheme can applied again resulting in approximately the same degree of compression achieved the first time. This is true because random data is already “uncompressible” so the entire process started out as a worst case scenario. Alternatively, one could leave the collection of files alone and use a different compression algorithm instead.

    Can this process be executed ad infinitum? If so that would imply that given enough iterations, the minute amount of compression realized at each iteration could add up to – well… an arbitrary amount of compression.

    In fairness one would have to include the information about this algorithm into the calculation of the total data volume. But that won’t matter because there are self-modifying transforms that can achieve an infinite number of different results, so only a very small amount of storage would be necessary. Besides, the algorithm data can also be compressed.

    This implies that information and effort are related and interchangeable – much the same way matter and energy are. In fact, it wouldn’t surprise me if that if the amount of processing required to achieve significant amounts of compression wouldn’t end up being calculated as the energy expended for the processing being equal to the storage savings multiplied by …wait for it… the speed of light squared.

    I’m a little busy these days. Do any of you have time to put this lossless, infinite compression algorithm, together? If you get it to work, I’ll split the royalties with you.

    My apologies to Claude Shannon… I wish he were around to reply.

    Chris

  • 16 Replies sorted by
  • Infinite compression does not exist, sorry. What you've described in terms of marking blocks as non-compressible is already done by some compression techniques (including ZFS filesystem compression). At some point, the compression gains in aggregate are overtaken by the compressor's metadata (including info on non-compressible blocks).

    Actually, I should correct that a little bit. Infinite compression IS possible, but only as long as you don't ever care to decompress your file. In fact, by doing so, you could just compress any arbitrary data stream down to 1 bit in length. Assume, for example, that you have a file 3 bits long and your algorithm reduces it to 2 bits:

    8 input possibilities:

    000 001 010 011 100 101 110 111

    4 output possibilities:

    00 01 10 11

    How does the decompressor know which set of inputs it should use when looking at the outputs?

    The same applies as you keep adding more length. Infinite lossless compression is as impossible today as it always was.

  • you CAN compress something infinitely, the problem is not making it small, but making it big again. you said that if one algorithm doesn't work, than you use another, how would you know which algorithm to use when making it big again? unless you tuck in a small bit of data in the already existing data telling it which algorithm to use next. but i have a feeling that will lead to guess work within the algorithm every once in a while, and wouldn't work 100% of the time.

  • I think the trick to infinitely compressing data lies within an iterative self-similarity algorithm using the imaginary number i. If in each iteration data can be made into a somewhat smaller size and the decompression method is infused with that output as a mesh between what is compressed and the key to decompression, the problem would be resolved. Some might ask, what comes first, the chicken or the egg. After all, how can you have the data output which also contains the decompression method if you haven't even compressed the data yet to determine the method used to compress it. That is where the wonderful world of self-similar fractal geometry comes into the fold. Methods for creating infinite data sets with self-similar fractional dimensions from a very small iterative equation are already being utilized and have been for decades now. The smaller features are an exact copy, and many times a slightly skewed version, of the larger features which contain those smaller features. There should be a way to generate a mathematical code that would do the same thing with 1's and 0's where the data itself would provide the encoding method and the total output data, smaller than the original, would contain both the original data without loss as well as the unique decompression method needed for that particular compressed data set. Each set of data would be uniquely compressed in a way that is different from every other data set and based on the data itself providing the unique encoding for itself. It may require assistance of quantum computing though so it can analyze every possible encryption combination simultaneously with real-time feedback to the original data set as well as the output data. This process would then be iterated as many times as is needed to reduce the data to the size desired balanced with the computing time required to do so. The number of times iterated would then become the only meta-data that is required to be attached to the final compressed data/decompression method. It would undoubtedly be several decades before we would be able to perform this kind of thing. It would also require another Benoit Mandelbrot type genius of compression, or quantum compression, algorithms as well. Anything can be done with data since it is infinitely malleable. The problem with traditional compression techniques is just that, they are traditional.

  • While it may seem absurd, I have already found 2 rules to use during decompression using truth tables to regenerate temporary data used during compression for non-quantum computing. I need at least 1 more, possibly 2 to complete the process used for encoding. Looking at larger strings now but running through and testing every possible combination to develop rules is becoming exponentially time consuming. I have 2 even rules and now am looking for an odd one to fill in the gaps for the data retrieval. If an odd numbered rule exists, the it will be possible to convert the excel sheet I have into coding to use for iterative compression and decompression.

  • Even if your file is already compressed, we can make it smaller.

    Now you see it, now…you have a hard time seeing it, because it’s so much smaller! Smaller files take up less space on your computer. Less space = more stuff. More stuff = better life!

    (c) Silicon Valley

  • Here's a sample of what the self-similarity pattern I'm using is:

    101100011001101010

    11010010101011111

    0111011111110000

    100110000001000

    10101000001100

    1111100001010

    000010001111

    00011001000

    etc...

    The absolute value difference of parallel bits is used to generate strings of decreasing sizes of temporary data. A fixed number of levels down (needs to remain static for reference on the way back up) is chosen as the output of the first iteration along with known bits of temporary data at fixed locations. Both the output plus the temporary data must stay at least 1 bit smaller than the original data for which I'm only up to 12 bits at the moment but just doubled that size to look for odd-numbered truth table rules to fill in the gaps. An example of 4 digits apart would be:

    01011

    1110

    001

    01

    1

    Given any 2 of the 3 points on the triangle, the 3rd is ALWAYS the absolute value difference of the other 2 points for any 4-bit input combination, no matter what values of zeroes and ones are in between. The output compressed value is used to generate further levels down as reference points for retrieving the temporary data used in compression during the decompression cycle to fill in the gaps on the way back up the hierarchy. The original coding rule is a 1-digit rule. There is a 2-digit rule and a 4-digit rule and the above is an example of the 4-digit one. Now I'm looking into odd-numbered rules

  • Those values are supposed to be stacked but the formatting on this forum isn't working.

    edit: nevermind, just had to give them 2 returns instead of 1 to to get them to stack properly.

  • Those values are supposed to be stacked but the formatting on this forum isn't working.

    All is working :-) Just need to learn things. Same as with compression.

  • Thanks for the insightful feedback Vitaliy, but it isn't. When you hit return it only tabs unless you hit return twice.

  • Well, where is one good button above post area that is called "Format Help"

  • Again, thanks for the insightful feedback. You really bring a lot to the table here in this discussion.

  • If you click the link that he suggested, it really is very helpful.

  • Not sure I quite understand some of previous, but it brought to mind a simple idea that has been used for a long time: using "libraries" and transmitting only the data that points to parts of the libraries.

    In other words, instead of compressing the book, let the codec on both sides include the book and transmit only "page X, word Y" information. Or, instead of saving a wave file of orchestral music, transmit only the sheet music.

    For example MIDI files with GM standards use this idea directly (and are lossless if sound library is exactly the same on both ends), while mp3 and h264 have aspects that can be thought of as "instructions for resynthesizing" something that resembles original material.

    AFAIK ideas like this are utilized in all current codecs already, to varying degrees. In theory, if you have big enough library of long random and non-random data strings, I suppose you would always use less data by transmitting only indexes.

  • Neokoo that does work but it is more like a rainbow table. That would be good to transfer large amounts of data to and from locations but only if those locations have local access to the libraries which means you have to have a lot of storage locally to store the libraries to reassemble what is being transmitted. I'm trying to get around the local storage problem while still being able to transmit a large amount of data in a small package. With that in mind though, data storage and transmit speeds seem to be accelerating faster than processing speeds currently and the method I stated above requires large amounts of processing power so the library method you described might even be better than infinite lossless compression algorithms at some point in the near future. Infinite lossless compression would be best for very long distance data packets, like satellite telemetry. You gather your data, compress it down, then send the very small amount you now have multiple times to ensure the receiving end, light minutes to light days away, has multiple chances of receiving it all correctly.

  • I wish this forum is still alive :-) The limit of compression is the mass, that is a law of physics. It is said that information has mass, bits, bytes....indeed represent mass, then there is not such a thing like infinite compression. But, having said that the error is to consider information like having a mass. What we are used to compress is not information, but a mass for whcih we assign meaning/information. I believe and I can demonstrate that no matter how "big" is the supposed data volume, that is in the end a number: 100011100001111111100001111...... That number is not he information but a pointer, an address. As far as you are able to represent that specific address, then you are able to recover the complete information. Is there any one there