| View previous topic :: View next topic |
| Author |
Message |
SuperFly Guest
|
Posted: Sun Sep 07, 2003 10:08 pm Post subject: sorted lists |
|
|
Hi,
I have to compress sorted lists of 32 bit integers. Are there any
algorithms that were specifically designed to do this. Or is delta
coding close to the best modelling scheme for this type of data.
Thanks for any input. |
|
| |
|
Back to top |
Colin Andrew Percival Guest
|
Posted: Sun Sep 07, 2003 10:08 pm Post subject: Re: sorted lists |
|
|
SuperFly <fake@email.com> wrote:
[quote]I have to compress sorted lists of 32 bit integers. Are there any
algorithms that were specifically designed to do this. Or is delta
coding close to the best modelling scheme for this type of data.
[/quote]
How many integers do you have? Is repetition allowed? With what
distribution are they distributed?
Colin Percival |
|
| |
|
Back to top |
Colin Andrew Percival Guest
|
Posted: Sun Sep 07, 2003 10:08 pm Post subject: Re: sorted lists |
|
|
SuperFly <fake@email.com> wrote:
[quote]The size of each list is different and the only known fact is that +/-
75% of the values are unique. Nothing more is known. Repetition is
allowed and there is no fixed distribution or pattern.
Delta coding does a decent job. Just wondered if there are specific
modelling schemes for sorted data. If you look at the amount of
research that has been put in sorting algorithms, I kind of expected
there to be better ones than delta coding.
[/quote]
If you>ve got lots of values, you could try storing them as a bitmap and
using arithmetic compression; I don>t know if that would help or not.
Colin Percival |
|
| |
|
Back to top |
Willem Guest
|
Posted: Sun Sep 07, 2003 10:08 pm Post subject: Re: sorted lists |
|
|
SuperFly wrote:
) I have to compress sorted lists of 32 bit integers. Are there any
) algorithms that were specifically designed to do this. Or is delta
) coding close to the best modelling scheme for this type of data.
I don>t know of any algorithms that are specifically designed to compress
sorted lists of integers, but I>m guessing there are enough people who cvan
come up with easy ideas how to compress such lists.
For example, you could encode the numbers in 'binary search' order, i.e.
first the center number, then the number in the middle of the left half,
the number in the middle of the right half, etc. This can easily be done
in a recursion, or some other way.
This means that the further you get, the smaller the range will be that the
number you>re encodiong can be in. Which means they will use less bits.
(You can use arith/rangecoding, or some simpler way of encoding for speed.)
I don>t know for sure how well this works, but I think it should do pretty
good.
(I can probably post some pseudocode, or C code if you want.)
SaSW, Willem (at stack dot nl)
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I>m not paranoid. You all think I>m paranoid, don>t you !
#EOT |
|
| |
|
Back to top |
Willem Guest
|
Posted: Sun Sep 07, 2003 10:08 pm Post subject: Re: sorted lists |
|
|
Raymond wrote:
) Yes, this works fine and is called interpolative coding:
I could have guessed...
) ... If I remember correctly, the paper
) did not just look at binary encoding the range of numbers, but how the
) coding is performed (i.e., golomb coding, etc. -- which is better).
Naah, arith coding with the proper probability distribution is unbeatable.
(Of course, you need to find the 'proper' distribution first...)
) But, the paper is based on index compression for an IR system, and
) the sorted list of numbers does not have any duplicates. So for these two
) reasons, it may not be relevant.
Doesn>t matter, you just have to make the ranges inclusive, that neatly
covers the possibility of having duplicates.
In other words, instead of recursing on (l, m-1) and (m+1, r) you recurse
on (l, m) and (m, r). Easy as pie.
SaSW, Willem (at stack dot nl)
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I>m not paranoid. You all think I>m paranoid, don>t you !
#EOT |
|
| |
|
Back to top |
SuperFly Guest
|
Posted: Sun Sep 07, 2003 11:25 pm Post subject: Re: sorted lists |
|
|
On Sun, 7 Sep 2003 17:24:57 +0000 (UTC), Colin Andrew Percival
<cperciva@sfu.ca> wrote:
[quote]SuperFly <fake@email.com> wrote:
I have to compress sorted lists of 32 bit integers. Are there any
algorithms that were specifically designed to do this. Or is delta
coding close to the best modelling scheme for this type of data.
How many integers do you have? Is repetition allowed? With what
distribution are they distributed?
[/quote]
The size of each list is different and the only known fact is that +/-
75% of the values are unique. Nothing more is known. Repetition is
allowed and there is no fixed distribution or pattern.
Delta coding does a decent job. Just wondered if there are specific
modelling schemes for sorted data. If you look at the amount of
research that has been put in sorting algorithms, I kind of expected
there to be better ones than delta coding.
But i couldn>t find any using google. |
|
| |
|
Back to top |
Steven Pigeon Guest
|
Posted: Mon Sep 08, 2003 1:21 am Post subject: Re: sorted lists |
|
|
"SuperFly" <fake@email.com> wrote in message
news:45pmlv4380fbrgvoqdhdslej1pjmoc4pd5@4ax.com...
[quote]Hi,
I have to compress sorted lists of 32 bit integers. Are there any
algorithms that were specifically designed to do this. Or is delta
coding close to the best modelling scheme for this type of data.
[/quote]
If the list is very sparse, there are sparse bitmap coding algorithms
that may be useful. If you have numerous continous runs, for example,
a list of the form of {1000-1250,2000-2040,... } a list of (start,length)
encodings would do the job. Is it very random and dense? Then good
luck.
Steven Pigeon, Ph. D.
pigeon@iro.umontreal.ca
[quote]
Thanks for any input.[/quote] |
|
| |
|
Back to top |
Colin Andrew Percival Guest
|
Posted: Mon Sep 08, 2003 1:59 am Post subject: Re: sorted lists |
|
|
Kevin Easton <kevin@-nospam-pcug.org.au> wrote:
[quote]This thread has set me thinking... how would you go about determining
the probability distribution of X, where X is defined as the median
value of N values evenly distributed over the range R1 to R2?
[/quote]
Binomial distribution. For N=2k+1 (such that we have an easily defined
median), P(median=X) = ((X-R1)^k (R2-X)^k k! k!) / ((2k+1)! (R2-R1)^(2k+1)) dX.
Colin Percival |
|
| |
|
Back to top |
Raymond Wan Guest
|
Posted: Mon Sep 08, 2003 2:34 am Post subject: Re: sorted lists |
|
|
On Sun, 7 Sep 2003, Willem wrote:
[quote]SuperFly wrote:
) I have to compress sorted lists of 32 bit integers. Are there any
) algorithms that were specifically designed to do this. Or is delta
) coding close to the best modelling scheme for this type of data.
I don>t know of any algorithms that are specifically designed to compress
sorted lists of integers, but I>m guessing there are enough people who cvan
come up with easy ideas how to compress such lists.
For example, you could encode the numbers in 'binary search' order, i.e.
first the center number, then the number in the middle of the left half,
the number in the middle of the right half, etc. This can easily be done
in a recursion, or some other way.
[/quote]
Yes, this works fine and is called interpolative coding:
author = "A. Moffat and L. Stuiver",
title = "Binary Interpolative Coding for Effective Index Compression",
journal = "Information Retrieval",
year = 2000,
volume = 3,
number = 1,
pages = "25--47",
month = jul,
It also appeared in DCC 1996. I don>t know whether or not it>s
been called something else by others. If I remember correctly, the paper
did not just look at binary encoding the range of numbers, but how the
coding is performed (i.e., golomb coding, etc. -- which is better).
But, the paper is based on index compression for an IR system, and
the sorted list of numbers does not have any duplicates. So for these two
reasons, it may not be relevant. Try transforming the sorted list of
duplicate numbers to cummulative sums -- that would get around the second
problem, but gaps between consecutive values would get larger; not sure if
this helps.
Ray |
|
| |
|
Back to top |
Kevin Easton Guest
|
Posted: Mon Sep 08, 2003 6:15 am Post subject: Re: sorted lists |
|
|
Willem <no@sp.am.invalid> wrote:
[quote]Raymond wrote:
) Yes, this works fine and is called interpolative coding:
I could have guessed...
) ... If I remember correctly, the paper
) did not just look at binary encoding the range of numbers, but how the
) coding is performed (i.e., golomb coding, etc. -- which is better).
Naah, arith coding with the proper probability distribution is unbeatable.
(Of course, you need to find the 'proper' distribution first...)
[/quote]
This thread has set me thinking... how would you go about determining
the probability distribution of X, where X is defined as the median
value of N values evenly distributed over the range R1 to R2? It>s
probably a normal distribution which is sharper for larger N (decaying
to flat for N = 1, of course), but I>m interested in how you>d work it
out exactly.
- Kevin. |
|
| |
|
Back to top |
Willem Guest
|
Posted: Mon Sep 08, 2003 8:12 am Post subject: Re: sorted lists |
|
|
Raymond wrote:
)> Doesn>t matter, you just have to make the ranges inclusive, that neatly
)> covers the possibility of having duplicates.
)> In other words, instead of recursing on (l, m-1) and (m+1, r) you recurse
)> on (l, m) and (m, r). Easy as pie.
)
) But, by assuming uniqueness in the list of numbers, some shortcuts
) can be made.
The shortcut would simply be to exclude the midpoint from the range when
you recurse down.
) For example, if your interval is now (0, 2), then only one
) number can exist in this range, so you can actually code "1" in "0 bits".
) But, suppose you have five 1>s, then how do you code them without
) explicitly coding a number that says how many there are?
Like I said, you simply include the midpoint in the interval when you
recurse. Example:
If I have [1, 1, 1, 2, 2, 4, 7], and I want to encode this, the
encoding steps will be (assuming the max. range is 0-7 inclusive):
2 in range (0..7)
1 in range (0..2) 4 in range (2..7)
1 in range (0..1) 1 in range (1..2) 2 in range (2..4) 7 in range (4..7)
When you know each number is unique, you get the following:
[0, 2, 3, 5, 6, 8, 9]
5 in range (0..9)
2 in range (0..4) 8 in range (6..9)
0 in range (0..1) 3 in range (3..4) 6 in range (6..7) 9 in range (9..9)
And another comment: My guess is that it would be easiest to 'train' a
probability model by pushing lots of data through an adaptive model and
then smoothing it out.
SaSW, Willem (at stack dot nl)
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I>m not paranoid. You all think I>m paranoid, don>t you !
#EOT |
|
| |
|
Back to top |
Raymond Wan Guest
|
Posted: Mon Sep 08, 2003 10:39 am Post subject: Re: sorted lists |
|
|
On Sun, 7 Sep 2003, Willem wrote:
[quote]) ... If I remember correctly, the paper
) did not just look at binary encoding the range of numbers, but how the
) coding is performed (i.e., golomb coding, etc. -- which is better).
Naah, arith coding with the proper probability distribution is unbeatable.
(Of course, you need to find the 'proper' distribution first...)
[/quote]
Sorry, I didn>t mean golomb coding was better, but that they might
have looked at several coding options (A, B, C, ...) and came up with the
ideal combination. Reading over what I wrote, I can see the mistake I
made.
And contrary to the papers, I actually don>t consider it a coding
method...feels more like a transformation, because it just says how the
list of numbers are treated, and not what codewords are made. *shrug*
[quote]) But, the paper is based on index compression for an IR system, and
) the sorted list of numbers does not have any duplicates. So for these two
) reasons, it may not be relevant.
Doesn>t matter, you just have to make the ranges inclusive, that neatly
covers the possibility of having duplicates.
In other words, instead of recursing on (l, m-1) and (m+1, r) you recurse
on (l, m) and (m, r). Easy as pie.
[/quote]
But, by assuming uniqueness in the list of numbers, some shortcuts
can be made. For example, if your interval is now (0, 2), then only one
number can exist in this range, so you can actually code "1" in "0 bits".
But, suppose you have five 1>s, then how do you code them without
explicitly coding a number that says how many there are?
One way to get around that is to transform: [0, 1, 1, 1, 1, 1, 2]
to [0, 1, 2, 3, 4, 5, 7], and code this list instead. Ummm, not sure if
you>re right as well, but it>s quite possible there is another way around
it.
Ray |
|
| |
|
Back to top |
SuperFly Guest
|
Posted: Mon Sep 08, 2003 1:53 pm Post subject: Re: sorted lists |
|
|
On Sun, 07 Sep 2003 21:34:04 GMT, Raymond Wan
<rwan@student.unimelb.edu.au> wrote:
[snip]
[quote]For example, you could encode the numbers in 'binary search' order, i.e.
first the center number, then the number in the middle of the left half,
the number in the middle of the right half, etc. This can easily be done
in a recursion, or some other way.
Yes, this works fine and is called interpolative coding:
author = "A. Moffat and L. Stuiver",
title = "Binary Interpolative Coding for Effective Index Compression",
journal = "Information Retrieval",
year = 2000,
volume = 3,
number = 1,
pages = "25--47",
month = jul,
[/quote]
[snip]
I will try this out. Thanks at everyone for your ideas.
I wonder if anything can be said about the maximum ratio given the
fact that the data is 'random' and a certain percentage of the data is
unique. I>m now down to +/-25%. and it would be great to determine the
mathematical limit. So I have something to aim for. |
|
| |
|
Back to top |
Mikael Lundqvist Guest
|
Posted: Wed Sep 10, 2003 1:22 am Post subject: Re: sorted lists |
|
|
SuperFly wrote:
[quote]Hi,
I have to compress sorted lists of 32 bit integers. Are there any
algorithms that were specifically designed to do this. Or is delta
coding close to the best modelling scheme for this type of data.
Thanks for any input.
[/quote]
This has been dealt with before.
http://groups.google.com/groups?hl=sv&lr=&ie=UTF-8&threadm=3D295255.7111A71C%40telia.com&rnum=1
Binary arithmetic encoding of the delta-codes seems like a good idea.
*Maybe* using some list update algorithm to minimize the delta-codes
could be a good idea. (Unless you>d like to create a simple solution)
/Mikael |
|
| |
|
Back to top |
YG Guest
|
Posted: Sun Sep 14, 2003 2:36 am Post subject: Re: sorted lists |
|
|
hi,
Raymond Wan wrote:
[quote]On Sun, 7 Sep 2003, Willem wrote:
) ... If I remember correctly, the paper
) did not just look at binary encoding the range of numbers, but how the
) coding is performed (i.e., golomb coding, etc. -- which is better).
Naah, arith coding with the proper probability distribution is unbeatable.
(Of course, you need to find the 'proper' distribution first...)
Sorry, I didn>t mean golomb coding was better, but that they might
have looked at several coding options (A, B, C, ...) and came up with the
ideal combination. Reading over what I wrote, I can see the mistake I
made.
And contrary to the papers, I actually don>t consider it a coding
method...feels more like a transformation, because it just says how the
list of numbers are treated, and not what codewords are made. *shrug*
) But, the paper is based on index compression for an IR system, and
) the sorted list of numbers does not have any duplicates. So for these two
) reasons, it may not be relevant.
Doesn>t matter, you just have to make the ranges inclusive, that neatly
covers the possibility of having duplicates.
In other words, instead of recursing on (l, m-1) and (m+1, r) you recurse
on (l, m) and (m, r). Easy as pie.
But, by assuming uniqueness in the list of numbers, some shortcuts
can be made. For example, if your interval is now (0, 2), then only one
number can exist in this range, so you can actually code "1" in "0 bits".
But, suppose you have five 1>s, then how do you code them without
explicitly coding a number that says how many there are?
One way to get around that is to transform: [0, 1, 1, 1, 1, 1, 2]
to [0, 1, 2, 3, 4, 5, 7], and code this list instead. Ummm, not sure if
you>re right as well, but it>s quite possible there is another way around
it.
[/quote]
i>m preparing a (french) article on this subject,
i>ll need french-speaking reviewers when i>m done.
i>ve "found" a little algo for "encoding" sorted lists
and it can also be applied to unsorted lists with a recursive
trick. however it is not fully tested yet (though i have working
code) so i prefer to stay cool about it.
<completely random thought>
maybe a "modified" Haar algo could fit the original poster>s needs ?
</completely random thought>
[quote]Ray
YG[/quote] |
|
| |
|
Back to top |
|