Ken Willets Guest
|
Posted: Sun Sep 14, 2003 12:13 am Post subject: Re: HOW TO SUFFIX ARRAY CONSTRUCTION IS USED IN BWT EFFICIEN |
|
|
In article <57dda274.0309120551.3fcd5f1f@posting.google.com>, "BHARATH"
<bharathkj@hotmail.com> wrote:
If you list out the block-sorted matrix M, eg
$abracadabra
a$abracadabr
abra$abracad
abracadabra$
acadabra$abr
adabra$abrac
bra$abracada
bracadabra$a
cadabra@abra
dabra$abraca
ra$abracadab
racadabra@ab
^ ^
F L
(Here, $ < a, and using $ ensures that M is sorted by suffixes only )
You can draw arrows from each row to its immediate suffix, eg "abracadbra$
--> "bracadabra$a", giving each row a link in the form of the row number
of its suffix. This creates an array Psi which is the permutation that
would result if M were shifted one column to the left and re-sorted.
Psi is typically used in reversing the BWT to get the original text (it>s
easily calculated from L). If you just store the first character of each
row i (column F, or just character counts), and Psi[i], you can find
successive characters of a row by reading the first character of i, going
to Psi[i] to find the second character, so on, iterating i = Psi[i] to
cycle through all n suffixes.
Psi can be converted to a suffix array by starting at the row
corresponding to the original string (the primary index), and iterating
the same way, replacing Psi[i] with 0,1,2... as you move through the text.
Psi has a useful property in that it is increasing on each range of rows
in M which have the same first character, that is if rows i and j start
with "a", say, and i < j, then Psi[i] < Psi[j]. The reason is simple: if
the row at i is lexicographically less than the one at j, and they have
the same first character, then string comparison depends on their 2nd or
further characters, meaning that their suffixes must be in the same order,
etc.
Psi is compressible because consecutive values often differ by one; this
distribution is used in Distance Coding and Inversion Frequencies BWT
compression methods.
It>s also fairly easy to build Psi character-by-character, creating a
block-sorted array in n log n time and very small space. I thought
I was the only one who had thought of this until I read this paper:
* T. W. Lam, K. Sadakane, W. K. Sung and S. M. Yiu: A Space and Time
Efficient Algorithm for Constructing Compressed Suffix Arrays, Proc. of
COCOON, LNCS 2387, pp. 401--410, 2002. ps.gz file (66875 bytes).
This is a good description of Psi and the insertion method. |
|