File Compression in the Multi-Core Era

I tried unraring a huge 7-Zip archive using 7-Zip and WinRAR. 7 Zip takes 50-60% of CPU while rar is utilizing dual core close to 100%.

7-Zip reports 50 minutes to complete the operation, even with high CPU utilization rar is showing above 60 minutes to complete.

Basic queuing theory might indicate that 3 cores/CPUs may offer a practical advantage over just 2 even in light-load settings (under ~ 33% utilization) so a quad scenario might be overkill on the desktop.

However my quad-core desktop is vasty more responsive than my dual-core laptop. Similar clock speeds and same RAM configuration, very similar usage profiles. I doubt it’s all about disk I/O but I can’t discount it knowing the limitations of typical laptop drives.

John Goerzen recently blogged about comparing various different compression schemas, and the time/space tradeoff you get from each.

It’s amazing how good gzip still is in terms the compression you get for a given amount of CPU time.

http://changelog.complete.org/archives/910-how-to-think-about-compression

So, to test parallelization’s effects, do a simple ‘split’ of your multi-GB file into 4 or 8 chunks and have 4 (2-CPU setting) or 8 (1-CPU setting) 7Zip processes running at them. The timings which would be relevant are the total process time for the longest-running of the 4 or 8 processes, and the relevant file size is the sum of all 4 (or 8) chunks post-compression.

Seems like the obvious next step, given the conclusion that bzip’s performance gains are due to it using multiple cores. Also seems logical that you’d see a significant performance boost without a significant effectiveness drop, given the huge source file size (I have a hard time believing that any compression algorithm is going to be significantly more efficient on 4GB of data then on 2GB of data! Ten years ago I found the efficiency differences in gzip between 50MB and 100MB file sizes weren’t significant, on the order of 0.1%, and I don’t think the algorithms have improved enough to take advantage of the larger pattern space in the intervening years!) Who knows: you might find that a split/7zip process fits your needs the best.

On the other hand, it also seems like a straight block split to multiple cores would have occurred to the folks who make 7-zip too, so maybe they’ve already tried this and found no advantage. Still, their test cases and your specific case are likely different enough that what doesn’t make sense for them as a general-use option does make enough sense in your scenario.

Does this stuff really matters? Even for your backup solution Jeff, it is a little overkill to research this much if not just by curiosity.

For almost all users, it really doesn’t matter which compression you pick, as long as you pick one (just like copyright licenses).

I heard you describe in your podcast how you’re using local backup because, even compressed, your db is too big to download daily.

I download my 5G database every day in about 5 minutes using rsync, once I’ve got an initial reference copy.

I dump the Postgresql db on the server using simple uncompressed format, and then rsync with compression to copy it locally - if your db is like mine and many others, where a small amount of existing data changes daily, and most of the changes are new records, then rsync is VERY efficient at pulling out just the new records, to bring your local copy of the file up to date with the server copy.

I don’t think rsync is available on Windows natively, but you can get it as part of cygwin.

This article tells a bit of a different story: http://changelog.complete.org/archives/910-how-to-think-about-compression: 7za -2 (the LZMA algorithm used in 7-Zip and p7zip) is both faster and smaller than any possible bzip2 combination.

wow… This is impressive. You should try messing with dictionary size to see if there are any improvements

iirc 7zip can’t hanle unix permissions so I’ve always used tar then compressed with the lzma util and uncompressed with lzma -d before untarring. more recently it is also possible to call lzma directly from tar.

Has anyone benchmarked the lzma util?

Any reason you didn’t use gzip? If possible, can you please post results for that, too?

same file, same conditions, Gzip.

fastest:

2.5 min, 29 MB/sec, 1169 MB, one CPU core at 100%

normal:

17 min, 5 MB/sec, 1098 MB, one CPU core at 100%

I second Nat’s suggestion: keep an eye on 7-Zip development as LZMA2 is going to become a very interesting solution in near/mid term future if current multicore trend goes on.

Another interesting project to observe, in a longer time frame, is PAQ.
Based on neural networks, it is currently the top performing compressor (it means it, or derived works like WinRK and KGBArchiver, is in the top position of all comparative benchmarks); ZPAQ is an ongoing effort to standardize a format:
http://www.cs.fit.edu/~mmahoney/compression/#zpaq
It is too heavy in terms of memory usage and computation complexity to be a feasible choice for current generation’s machine, but proven to be the best currently known approach in reducing file sizes to the theoretical amount of enthropy they contains.

Another way to reduce the size of your backups is to not take full backups every day. Instead, you can take weekly full backups and daily differential backups.

In fact, you can use a combination of full, differential and transaction log backups to increase the amount of data covered by your backups while still reducing the size of your backup files.

Have you compared your home built solution against RedGate SQL Backup?

http://www.red-gate.com/products/SQL_Backup/index.htm

And remember, recovery time is just as important as backup time. I will concur with BradC, native SQL 2008 compression does enough compression at speed, and is an easy option to implement to make it almost the default setting on new 2008 servers.

Jeff, which server is doing the compression? Are you maxing out your DB server for 3 or so minutes? What happens to site response times during this activity?

The point is that it doesn’t much to load up current CPUs - even a top of the line machine like mine with an efficient OS can easily make efficient use of more than 2 cores with typical consumer use.

No actual published benchmarks of typical computer use support your statement. I can point to dozens of articles backed by data on AnandTech, TechReport, Tom’s Hardware, etc, that all show the same thing – there is a massive point of diminishing return beyond 2 cores.

If you aren’t in one of the narrow niches that can exploit parallelism, an ultra-fast dual core is all you need.

On the other hand, it is fun to believe that you need one CPU core for your email, one for your iTunes playback, etc. It’s like people have forgotten how multi-tasking operating systems work… :slight_smile:

One thing that hasn’t been noted, is that bzip2 takes all the cores, resulting in the sql service being put on ‘pause’ while it’s working (it will be fighting with the sql service for cpu space)

While if you use 7zip, you will only hold 2 cores, while the other 6 are free for the server to do it’s business :slight_smile:

How about lrzip? http://ck.kolivas.org/apps/lrzip/README

Slightly different implementation, uses your choice of bzip2, lzma, lzo.

Has anyone tried compiling either bzip2 and/or 7zip with intel’s vector instructions i.e using XMM or MMX or any of the newer instructions ?

Wall Clock vs. CPU time

It’s a shame that 7zip, as a streaming algorithm, can’t really be used in a massively parallel manner. If you’re wondering just how parallel bzip2 can get (it is a block algorithm, with -1 through -9 correlating to the uncompressed data size), though, Wikipedia implemented a distributed network bzip2 compressor.

I’ve done a lot of thinking trying to address the Wikipedia backup problem (the English wiki is so huge that the backup takes months and regularly fails without completion), and I have favored 7zip entirely because of the performance of 7zip (or lzma) using faster compression. Tricks need to be used with 7zip for large amounts of data, and namely that trick is multiple input to multiple output streams. A la db-backup-1, db-backup-2, etc. Kind of like what we did in the ‘old days’ when you’d use 100 floppy disks to back up your desktop. That sucked, by the way.

Anyway, the answer at the end of it is that I pipe my database backup streams into ‘lzma -2’ because I find it compresses faster than bzip2 and compresses to a smaller size as well (assuming you’re not running bzip2 with a 4 fold cpu advantage, at least).

Of course it’s all moot in the sense of Wikipedia since I guess something on the order of 100 of us are willing to tackle the problem, and nobody seems to want to let us try to get our hands dirty. It was fun reading on the mailing list last week.

It’s years old, but Linux Journal made one of the niftiest looking compression comparison graphs I’ve seen in their comparison of 18 compressors: http://www.linuxjournal.com/article/8051

-Jeff

It never occurred to me before but since bzip uses the BWT (Burrows-Wheeler Transform), this does allow parallelism since the compression has two independent steps:

(1) cyclic suffix sort
(2) compress the last column

and it operates on fixed size chunks making it a perfect candidate for multi-core CPUs. Sounds like another blog posting, BWT is a beautiful and elegant idea. It is one of those ideas that seems mysterious and you wonder how it was ever invented.

Not to be a link-whore, but I tracked 7-zip’s compression efficiency just recently: http://heliologue.com/2009/02/09/tracking-lzma-efficiency/

There was a precipitous drop in encoding time between the early 4.x series and about v4.57, at which time it’s been idling. Looks like LZMA2 (4.66 in my data) promises to increase the efficiency even further.

Final compressed sizes are similar (too similar to bother reporting, I found, at least on Ultra).