I've been playing around a bit with file compression again, as we generate some very large backup files daily on Stack Overflow.
This is a companion discussion topic for the original blog entry at: http://www.codinghorror.com/blog/2009/02/file-compression-in-the-multi-core-era.html
The bzip2 compressor splits data into blocks of 900kB and compresses each block separately; compressing your 4.7GB file could make use of over 5000 CPUs efficiently. The 7zip compressor isn’t easily parallelized for exactly the same reason as it produces better compression: It doesn’t work operate on small blocks. Compressors operate by forming a model of the data which is being compressed; the more data they’re compressing at once the better the model.
Colin – so it is effectively impossible to paralellize 7zip? I sort of suspected this, for the reasons you stated, but I was hesitant to come out and say that in the post.
Alternately, what’s the largest block size that could be used without defeating parallelization? The same size as the L2 cache per CPU core?
Break your backups into various parts, and run parallel copies of your favorite tool to compress the individual parts. One way to deal with the problem if your tool-of-choice doesn’t scale.
As for multi-core CPUs – developers are the likely consumers to use multi-core CPUs on their desktop. My grandparents and parents aren’t going to take advantage of multi-core when sending e-mail. But as a serious career developer, when I have multiple VMs running, I can see the load spread across the cores. I typically have several VMs running with different versions of my C++ environment/project, and another 1 or 2 VMs with python projects. I can have huge re-compile events going on in several VMs, while continuing to work in the foreground of yet another VM. And I do this regularly. I’ve never understood your stance on the topic.
(Otherwise, keep up the great work. Love the blog SO.)
I’m not a big expert on compression algorithms but I’m still going to comment because this is the first time I’ve been on your site recently and there haven’t been 500 comments so maybe this gets read
I started using 7zip after my last desktop upgrade because it was the first thing I grabbed that was free. I don’t do any major compression/decompression more than a dozen MB or so. Speed isn’t my major concern. It’s fast enough. The only big files I sometimes have to decompress are gzipped log and pgsql backups and it does fine with those.
The benchmarks are interesting. 10x more wait for only a 10% decrease in files size seems like such a waste. I’d rather use 7zip than bzip if file size was an issue.
By the way, I’m confused. You say you don’t like more than 2 cores on a desktop but your showing windows task manager.
Oh. Coding HORROR. Now I get it
7zip doesn’t operate on blocks at all, it’s entirely streaming. And since it uses an arithmetic coder, it probably only outputs one bit at a time.
You could parallelize it per file, though.
The 7zip preferences allows a selection of 1 or 2 in the Number of CPU Threads option. Looks like it does something with the 2nd core, but obviously it doesn’t extend to more than 2.
A streaming compressor counts as one with an infinite block size.
I don’t think it is possible to parallelize 7z compression given the current file format; but that doesn’t mean that the file format couldn’t be changed. The problem with stream compressors is that each time a byte is processed the model is updated; you can’t have additional processors jump ahead and process data which the 1st processor hasn’t reached, because you don’t know what the compression model is going to look like until you get there. To solve this, you would need to have the compression state periodically revert to an earlier state; one way of doing this would be to zero the compression state (i.e., to operate on blocks completely independently of each other), but somewhat better results could be obtained by branching and merging models. Either way, however, you’d need to change the compression format, since the decompressor needs to update its model in sync with the compressor.
BTW, the 7z compressor operates in several stages (optional preprocessing, deflation, arithmetic compression), so I’m guessing that the option to use two threads just pushes one of these (probably arithmetic compression) off to a second thread. This is a cheap way of parallelizing, but it doesn’t scale.
If you broke up your backups into multiple files and did multiple invocations of 7zip as Stephane suggested, that would nullify the benefits of the 7zip format.
It may very well be possible to parallelize the format without blocking up the data, but as to how trivial it would be, I don’t know.
If nothing else, reimplementing the algorithm in a functional language designed for parallelism might be worth the overhead of a ‘slower-than-C’ language with that many cores available.
I’m actually a tinkerer with asm and C, so if you (Jeff Atwood) expressed sufficient interest in the subject I could look at 7zip’s code and see what the deal is.
So, 7zip with default settings takes 34 meters to compress the database, while bzip2 takes only three and a half meters. Cool. But I was confused with the MB/sec units you used in the forth column, This resembles me the unit for data throughput, but that is MB/s, not MB/sec.
As suggested, you should break the big database in separate tables, and compress tables in parallel. Since the LZMA algorithm is streaming, it makes more sense to compress tables individually, since data may look more alike inside each table (better opportunity for compression) than taking the whole database at once.
What is bzip doing with all that cpu time? It looks like 7zip running fastest mode (with less parallelization) beats bzip ultra mode (in file size). Based upong the final file size, it doesn’t look like bzip runs faster. It just generates more heat.
This was my thought too. It’s a good example of why I still don’t think anything beyond 2 cores on the desktop is really necessary yet. Get a super fast dual core (mine runs at 3.8 GHz now), and you will win every time except for edge conditions.
(to be fair to @Stephane3, C++ compilation is one of those edge conditions; it’s very slow and highly, highly parallelizable.)
The only reason to go bzip2 is if you need the file compressed as rapidly as possible to a reasonable size.
On pure efficiency / energy, 7zip wins, despite bzip2 using 4x the CPU time.
Never mind, I get it. The bzip throughput is higher (bytes processed faster0. However the result size is worse (larger). It would be interesting to come up with some ratio that would represent compressed bytes per minute.
Metro, maximum compression developed an efficiency measure:
That combines compression time, decompression time, and size, which sounds like what you’re looking for. It used to be only size and compression time, which for some scenarios is more useful (like this one).
Well, actually, it seems to me to be feasible to have both the parallel operations and the better compression size.
Compression works on the model of the data. So effectively, you’d need some sort of modeller algorithm that can analyze blocks of the dataset, then merge the results together. The question becomes then, is that algorithm better than one that does not parallelize?
An easy way to prove something does not work is by trying to find a situation where it fails. Of course, logically, not finding a situation does not mean it always works either.
Let us assume that in each block there is a unique byte, b1. Because it is unique within only that block, but not in the context of the program, most naive modeling algorithms would miss the correlation. But it is possible, with an accompanying increase in memory consumption, to build a modeling algorithm that catches that pattern, while still parallelizing the work.
Now, for a real proof, we’d start at a base case, where each block is one byte. Since there is no repetition in the block, we cannot compress it, not without increasing the size of the file. Then we merge a two of the checked bytes together. And one of two situations happen: either there is repetition, or there is not. At each stage, the blocks are merged together with more blocks, and again, the same two situations happen. Each merged (n+1) block is checked. Then merged. If the modeler can find a pattern of data at an n-size block, it follows it will find more at each n+i-size block. Therefore, we achieve the parallelization and the proper data modeling we need to do this.
Not quite a full proper proof, but it should be sufficient. If not, I apologize. I’d suggest taking a look at the literature on how DBMS’ deal with efficient indexing while taking advantage of multiple cores. Most of the problems we face now were solved in that area 10-15-20 years ago.
Sorry, but your example is rather meaningless.
bzip2’s ultra is worst in both time and size than 7zip’s fastest.
So its rather pointless to say bzip2 is more parallel and scales better, when 7zip out right beats it.
However this is only a sample size of 1, so its hard to say.
Any reason you didn’t use gzip? If possible, can you please post results for that, too?
My personal observation is gzip is extremely fast compared to bzip2 (and 7 zip) and the penalty of extra storage is not that high.
Er… look at the numbers…
We can just conclude that 7zip fastest does a great job in 5 minutes on 1 core, while bzip2 ultra isn’t as good in 20 minutes on 8 cores.
Philippe: Or we can conclude that bzip is better if you need speed and 7zip is better if you need size. If you look at both of them they both shine in their specialty but suck when not.