File Compression in the Multi-Core Era

Dobbs: I stand corrected - bzip2 is really fast for fairly good results.

Hi Jeff: could you change the font; color for this blog please? otherwise an excellent place

Jeff, you may want to try lzop too; it won’t give as good compression, but it should be the fastest by CPU seconds, and the decompressor is very fast too.

7zip fastest 5 min 14 MB/sec 973 MB
bzip2 ultra 21 min 4 MB/sec 986 MB

Why is bzip2 able to work so much faster than 7zip?


To compare like for like, we need similar final file sizes, and bzip is not faster than 7zip.

to hmm:

Does your quad core have more L2 or L3 cache than your dual core laptop?

It might not take much more cache to improve the hit rate an extra 1%. At high hit rates,(e.g. 90%), a 1% improvement hit rate improvement is a major (e.g. 10% ) improvement in the miss rate.

At a 100 clock tick penalty for a miss, that’s the difference between 100 instructions in 1090 cycles vs 100 instructions in 991 cycles.

Depending on the working set of all the processes (including clock-tick, keystroke, and mickey triggered background processes) an additional 1 MB of cache can make a BIG difference.

David Kra

Another relevant measure with multicore is MiB/J - megabytes per joule.

Yes everyone saw that:
7zip fastest 5 min 14 MB/sec 973 MB
bzip2 ultra 21 min 4 MB/sec 986 MB

But that’s missing the point, which was you can get a very fast compression of 5GB of data using bzip2: 2min vs 5min with 7zip, with a loss of ‘only’ 100MB, which when compared to 5GB is not that much (2%).

And at normal bzip2 gets a much smaller size diff for still 1.5 min less.

Of course one wouldn’t want to max out ALL the server’s cores, you may need to actually ‘serve’ things during that time. So IMO the 3 minutes saving isn’t really worth that much in that use case.

But i guess if you needed to compress your skynet DB on with a gazillion cores then bzip2 would be the way to go.

@Pyrolistical:

We can get a sample size greater than 1 if you compare something other than the one configuration that makes bzip2 look worst :).

bzip2 ultra is clearly the wrong choice on this hardware and dataset. But there’s no configuration of 7zip that can get the original file under a quarter of its original size in under 5 minutes, which bzip2 normal (and faster) achieve. So bzip2 is clearly faster than 7zip in those scenarios.

Of course, as stressed above, the scenario must be considered to decide the proper space/time tradeoff.

Interesting, but WHY ARE YOU USING UNCOMPRESSED NATIVE SQL BACKUPS, and not compressed SQL backups (let the SQL engine compress the file on the fly as it is performing the backup)?

Yes, SQL 2008 supports it: http://sqlblog.com/blogs/tibor_karaszi/archive/2007/12/12/backup-compression-in-sql-server-2008.aspx

(ok, so only Enterprise version supports it, so if you aren’t using that version, then use LightSpeed backup instead to compress backups on the fly. Also allows other cool stuff like restores of individual tables instead of forcing you to restore an entire copy of a huge database.

BradC: In Unix I take that for granted… I pipe the output of mysqldump to bzip2 and there’s your on the fly as it’s performing the backup.

Did you try LZMA2 in 7-zip 4.66 Alpha?
http://sourceforge.net/forum/forum.php?thread_id=2965956forum_id=45797

Not all the programs/algorithms can be parallelized infinitely. Some algorithms are inherently non-parallelizable beyond a limit(for ex.:Calculation of the Fibonacci series by use of the formula: F(k + 2) = F(k + 1) + F(k) is strictly sequential). It can be possible that the algorithm used by 7zip does not scale well beyond 2 threads.

 Why bzip2 works so faster than 7zip? Everything is simple: 

I do not see that it works faster, have a look for yourself:

7zip for 7 minutes to compress the archive 926 MB,
and bzip2 compresses over seven minutes to 987 MB

Ie for one time 7zip squeezed better means to the same size 7zip sozhmet faster.

In doing so, bzip2 for the use of all eight cores, while 7zip barely 2. If he uses all 8 cores, he would be back in 4 Araz faster. This 7zip does not semmitrichny algorithm, unpacks it 10 times faster.

Conclusion: 7zip much-a lot faster than bzip2.

You blog in bad character.

I first came here to say what your first commenter said.

So instead I’ll say that people using the original bzip2 program (rather than its algorithm implemented by 7zip) should install pbzip2 to get the parallelization advantages.
http://compression.ca/pbzip2/
(It seems to be available already in Ubuntu Linux.)

One other thought… have you considered how caching of some of the data between runs may affect your speed results? You didn’t mention how much memory you have, but I assume it’s at least the size of your data file.

Nat – sounds very interesting!

copied from that thread, from Igor Pavlov (the author of 7zip):

WHY ARE YOU USING UNCOMPRESSED NATIVE SQL BACKUPS, and not compressed SQL backups (let the SQL engine compress the file on the fly as it is performing the backup)?

Because the default compression offered is very poor – and once compressed, it cannot be recompressed with any kind of savings.

We can literally get 1/2 the size of those compressed backups using 7zip.

@Jeff
But even though it isn’t compressed as much, the backups are many times faster when done to a compressed file. That’s the biggest advantage - try benchmarking that! The difference is startling (like 90% speed improvement).

Cheers

There are versions of gzip and bzip2 specifically written in a parallel fashion (at least for compression). A post with some benchmarks:

http://blogs.sun.com/timc/entry/tamp_a_lightweight_multi_threaded

You may want to check out the pigz utility (Parallel Implementation of GZip):

http://zlib.net/pigz/

It leverages the standard zlib found on most Unix-y systems, and by default launches up to eight threads to go through the file / stream (configurable).

(Sun has some systems that can run in up-to 256 threads in parallel, so the people there are kind of interested in this.)

On the file format debate (which of 7zip, gzip, bzip2 is better?), wouldn’t it depend on what you’re doing with the compressed file?

If you’re sending it out once or twice to a few people, then the overhead of bzip2 compression may not be worth it.

However, if you’re posting it on a web or FTP site, to be downloaded by thousands of people, then the bandwidth savings may be worth waiting an extra five minutes for the file to be compressed (something you only have to do once). So if you’re saving 1 MB by going from gzip to bzip2, and the file will be download 10,000 times in a month, that’s a saving of 10 GB.

Depending on the the file savings, and your distribution audience, a few minutes’ wait can save you in other places down the road.