In 1994, Burrows and Wheeler presented a block-sorting based transform in their technical report [Burrows and Wheeler 1994], known as the BWT, which permutes a text into a new sequence that usually is more “compressible”. Briefly, given a text S of N characters, the BWT can be performed in three steps: (1) to derive a matrix consisting of N rotations (cyclic shifts) of S; (2) to sort the rows of the matrix lexicographically; and (3) to extract the last column of the sorted matrix to produce the transformed text. The BWT is lossless and reversible. There exists an efficient algorithm which can correctly restore the original text S from the transformed text with a time/space complexity of O(N). Therefore, when N is large, any naive sorting scheme will still impose an evident bottleneck on the BWT.
One drawback of the BWT in its original form is its time and space complexities. Any naive implementation of the second step that utilizes general purpose comparison-based sorting algorithms could result in a O(N^2 log N) worst case running time complexity. If a traditional radix sort algorithm is used, it requires O(N^2). One solution proposed by Schindler in [Schindler 1997; 2001], known as ST, to speed up the block sorting is to sort only the first k columns of the matrix in order to reduce the complexity to O(kN). A major tradeoff for the ST to achieve the speedup gain over the BWT is that the inverse ST is far more complicated than the inverse BWT. In Schindler's patent disclosure [Schindler 2001], the hash table based inverse ST algorithms were given only for the orders of 1 and 2. In [Chan and Nong 2006] and [Nong, Zhang and Chan 2011], we show that the time complexity of the inverse ST is independent of k and gave a linear time algorithm to the inverse ST.
Our study was then extended to the construction of suffix array SA(S) of the text S. SA(S) is an array of pointers for all the suffixes in the S sorted in the lexicographically ascending order. It is clear that BWT and the suffix array construction on S is equivalent. In the talk, I will also introduce the development of linear suffix array construction algorithms in the last decade.
简历