Basic queuing theory might indicate that 3 cores/CPUs may offer a practical advantage over just 2 even in light-load settings (under ~ 33% utilization) so a quad scenario might be overkill on the desktop.
However my quad-core desktop is vasty more responsive than my dual-core laptop. Similar clock speeds and same RAM configuration, very similar usage profiles. I doubt itâs all about disk I/O but I canât discount it knowing the limitations of typical laptop drives.
So, to test parallelizationâs effects, do a simple âsplitâ of your multi-GB file into 4 or 8 chunks and have 4 (2-CPU setting) or 8 (1-CPU setting) 7Zip processes running at them. The timings which would be relevant are the total process time for the longest-running of the 4 or 8 processes, and the relevant file size is the sum of all 4 (or 8) chunks post-compression.
Seems like the obvious next step, given the conclusion that bzipâs performance gains are due to it using multiple cores. Also seems logical that youâd see a significant performance boost without a significant effectiveness drop, given the huge source file size (I have a hard time believing that any compression algorithm is going to be significantly more efficient on 4GB of data then on 2GB of data! Ten years ago I found the efficiency differences in gzip between 50MB and 100MB file sizes werenât significant, on the order of 0.1%, and I donât think the algorithms have improved enough to take advantage of the larger pattern space in the intervening years!) Who knows: you might find that a split/7zip process fits your needs the best.
On the other hand, it also seems like a straight block split to multiple cores would have occurred to the folks who make 7-zip too, so maybe theyâve already tried this and found no advantage. Still, their test cases and your specific case are likely different enough that what doesnât make sense for them as a general-use option does make enough sense in your scenario.
I heard you describe in your podcast how youâre using local backup because, even compressed, your db is too big to download daily.
I download my 5G database every day in about 5 minutes using rsync, once Iâve got an initial reference copy.
I dump the Postgresql db on the server using simple uncompressed format, and then rsync with compression to copy it locally - if your db is like mine and many others, where a small amount of existing data changes daily, and most of the changes are new records, then rsync is VERY efficient at pulling out just the new records, to bring your local copy of the file up to date with the server copy.
I donât think rsync is available on Windows natively, but you can get it as part of cygwin.
iirc 7zip canât hanle unix permissions so Iâve always used tar then compressed with the lzma util and uncompressed with lzma -d before untarring. more recently it is also possible to call lzma directly from tar.
I second Natâs suggestion: keep an eye on 7-Zip development as LZMA2 is going to become a very interesting solution in near/mid term future if current multicore trend goes on.
Another interesting project to observe, in a longer time frame, is PAQ.
Based on neural networks, it is currently the top performing compressor (it means it, or derived works like WinRK and KGBArchiver, is in the top position of all comparative benchmarks); ZPAQ is an ongoing effort to standardize a format: http://www.cs.fit.edu/~mmahoney/compression/#zpaq
It is too heavy in terms of memory usage and computation complexity to be a feasible choice for current generationâs machine, but proven to be the best currently known approach in reducing file sizes to the theoretical amount of enthropy they contains.
Another way to reduce the size of your backups is to not take full backups every day. Instead, you can take weekly full backups and daily differential backups.
In fact, you can use a combination of full, differential and transaction log backups to increase the amount of data covered by your backups while still reducing the size of your backup files.
And remember, recovery time is just as important as backup time. I will concur with BradC, native SQL 2008 compression does enough compression at speed, and is an easy option to implement to make it almost the default setting on new 2008 servers.
Jeff, which server is doing the compression? Are you maxing out your DB server for 3 or so minutes? What happens to site response times during this activity?
The point is that it doesnât much to load up current CPUs - even a top of the line machine like mine with an efficient OS can easily make efficient use of more than 2 cores with typical consumer use.
No actual published benchmarks of typical computer use support your statement. I can point to dozens of articles backed by data on AnandTech, TechReport, Tomâs Hardware, etc, that all show the same thing â there is a massive point of diminishing return beyond 2 cores.
If you arenât in one of the narrow niches that can exploit parallelism, an ultra-fast dual core is all you need.
On the other hand, it is fun to believe that you need one CPU core for your email, one for your iTunes playback, etc. Itâs like people have forgotten how multi-tasking operating systems workâŚ
One thing that hasnât been noted, is that bzip2 takes all the cores, resulting in the sql service being put on âpauseâ while itâs working (it will be fighting with the sql service for cpu space)
While if you use 7zip, you will only hold 2 cores, while the other 6 are free for the server to do itâs business
Itâs a shame that 7zip, as a streaming algorithm, canât really be used in a massively parallel manner. If youâre wondering just how parallel bzip2 can get (it is a block algorithm, with -1 through -9 correlating to the uncompressed data size), though, Wikipedia implemented a distributed network bzip2 compressor.
Iâve done a lot of thinking trying to address the Wikipedia backup problem (the English wiki is so huge that the backup takes months and regularly fails without completion), and I have favored 7zip entirely because of the performance of 7zip (or lzma) using faster compression. Tricks need to be used with 7zip for large amounts of data, and namely that trick is multiple input to multiple output streams. A la db-backup-1, db-backup-2, etc. Kind of like what we did in the âold daysâ when youâd use 100 floppy disks to back up your desktop. That sucked, by the way.
Anyway, the answer at the end of it is that I pipe my database backup streams into âlzma -2â because I find it compresses faster than bzip2 and compresses to a smaller size as well (assuming youâre not running bzip2 with a 4 fold cpu advantage, at least).
Of course itâs all moot in the sense of Wikipedia since I guess something on the order of 100 of us are willing to tackle the problem, and nobody seems to want to let us try to get our hands dirty. It was fun reading on the mailing list last week.
Itâs years old, but Linux Journal made one of the niftiest looking compression comparison graphs Iâve seen in their comparison of 18 compressors: http://www.linuxjournal.com/article/8051
It never occurred to me before but since bzip uses the BWT (Burrows-Wheeler Transform), this does allow parallelism since the compression has two independent steps:
(1) cyclic suffix sort
(2) compress the last column
and it operates on fixed size chunks making it a perfect candidate for multi-core CPUs. Sounds like another blog posting, BWT is a beautiful and elegant idea. It is one of those ideas that seems mysterious and you wonder how it was ever invented.
There was a precipitous drop in encoding time between the early 4.x series and about v4.57, at which time itâs been idling. Looks like LZMA2 (4.66 in my data) promises to increase the efficiency even further.
Final compressed sizes are similar (too similar to bother reporting, I found, at least on Ultra).