Chris Guest
|
Posted: Sat Jul 26, 2008 11:15 pm Post subject: Advice on selecting text compression library |
|
|
My app needs to compress and decompress large amounts of plain text very
quickly. The text is usually English, sometimes non-English in UTF8.
Can anyone point me to the best text compression library to use?
Here are the requirements:
1. Must be accessible from Java.
2. Text *decompression* speed is the most important thing to optimize
for. We>ll compress only once, but decompress many times. The amount of
compression we get is important, but we>re willing to trade size for speed.
3. The blocks of text to compress will range from 100 bytes to 100K,
with an average ~10K.
I>m currently using the java.util.zip library built in to Java, and it
works ok, but a profiler shows that a large percentage of processor time
is spent in decompression. Is there a library that can do significantly
better? |
|
Mark Nelson Guest
|
Posted: Sun Jul 27, 2008 3:53 pm Post subject: Re: Advice on selecting text compression library |
|
|
On Jul 26, 1:15 pm, Chris <spam_me_...@goaway.com> wrote:
[quote]My app needs to compress and decompress large amounts of plain text very
quickly. The text is usually English, sometimes non-English in UTF8.
[...]
I>m currently using the java.util.zip library built in to Java, and it
works ok, but a profiler shows that a large percentage of processor time
is spent in decompression. Is there a library that can do significantly
better?
[/quote]
I suppose it>s always possible that your native methods are
inefficient on your platform, but in general java.util.zip should
perform very well in terms of speed. I think that the overall
acceptance of this format shows that it provides a balance between
speed and compression that is pretty comfortable for most people. Last
time I studied it the zip classes were implemented on top of zlib, so
they are not using Java but rather the target system>s C compiler.
Note that Zip is somewhat asymmetrical - the compressor has to perform
a search for matching strings at the current compression point,
whereas the decompressor simply has to follow an index back to the
matching string.
Likewise, the compressor has to build a Huffman table and encode using
it, the decompressor doesn>t have to go through the build process
(although I suspect this is not a big deal.)
Anyway, the point is that decompression should be fairly fast, and you
will be hard pressed to find something a lot faster. You can do some
parameter tuning in an attempt to speed things up, but I think that in
many cases it will affect the compressor more than the decompressor.
Markus Oberhumer makes the LZO library which is reputed to be very
fast, although people often have trouble with licensing. I don>t know
if it has Java bindings, you may have to do that yourself:
http://en.wikipedia.org/wiki/Lempel-Ziv-Oberhumer
If you look at some compression benchmarks you can look at the faster
compressors and see if any of them are available in lib format:
http://cs.fit.edu/~mmahoney/compression/text.html
It>s worth noting that the zip compressors show up as having very fast
decompression times in Matt>s benchmark, but LZO does manage to beat
them. (At some cost to compression ratios.)
|
| Mark Nelson - http://marknelson.us
| |
|