讲座 | From Burrows-Wheeler Transform to Suffix Array Construction


主讲人: Daricks W.H. Chan博士
时间: 2013年5月8日(星期三),下午2:00-3:00
地点: E202
摘要:

In 1994, Burrows and Wheeler presented a block-sorting based transform in their technical report [Burrows and Wheeler 1994], known as the BWT, which permutes a text into a new sequence that usually is more “compressible”. Briefly, given a text S of N characters, the BWT can be performed in three steps: (1) to derive a matrix consisting of N rotations (cyclic shifts) of S; (2) to sort the rows of the matrix lexicographically; and (3) to extract the last column of the sorted matrix to produce the transformed text. The BWT is lossless and reversible. There exists an efficient algorithm which can correctly restore the original text S from the transformed text with a time/space complexity of O(N). Therefore, when N is large, any naive sorting scheme will still impose an evident bottleneck on the BWT.

One drawback of the BWT in its original form is its time and space complexities. Any naive implementation of the second step that utilizes general purpose comparison-based sorting algorithms could result in a O(N^2 log N) worst case running time complexity. If a traditional radix sort algorithm is used, it requires O(N^2). One solution proposed by Schindler in [Schindler 1997; 2001], known as ST, to speed up the block sorting is to sort only the first k columns of the matrix in order to reduce the complexity to O(kN). A major tradeoff for the ST to achieve the speedup gain over the BWT is that the inverse ST is far more complicated than the inverse BWT. In Schindler's patent disclosure [Schindler 2001], the hash table based inverse ST algorithms were given only for the orders of 1 and 2. In [Chan and Nong 2006] and [Nong, Zhang and Chan 2011], we show that the time complexity of the inverse ST is independent of k and gave a linear time algorithm to the inverse ST.

Our study was then extended to the construction of suffix array SA(S) of the text S. SA(S) is an array of pointers for all the suffixes in the S sorted in the lexicographically ascending order. It is clear that BWT and the suffix array construction on S is equivalent. In the talk, I will also introduce the development of linear suffix array construction algorithms in the last decade.

简历

Dr. Daricks W.H. Chan is now serving the Department of Mathematics and Information Technology of the Hong Kong Institute of Education as an assistant professor. Before joining HKIEd, Dr. Chan taught in the Department of Mathematics of Hong Kong Baptist University, and he had been teaching mathematics subjects in various levels for over ten years. He obtained his doctorate from the HKBU in 2003, and the postgraduate diploma in education from the CUHK in 1999. His areas of expertise are algorithm design, quantum information, graph theory and combinatorics. Dr. Chan has published over fourty papers in international journals including SIAM, IEEE, Physical Review and Discrete Applied Mathematics. He was awarded research and teaching grants of total amount over two million Hong Kong dollars.