[1]胡春玲,赵俊杰,姚梦媛,等.基于优化BWT索引技术的序列比对算法研究[J].南京师范大学学报(工程技术版),2024,24(04):037-45.[doi:10.3969/j.issn.1672-1292.2024.04.004]
 Hu Chunling,Zhao Junjie,Yao Mengyuan,et al.Research on Sequence Alignment Algorithm Based on Optimized BWT Indexing Technique[J].Journal of Nanjing Normal University(Engineering and Technology),2024,24(04):037-45.[doi:10.3969/j.issn.1672-1292.2024.04.004]
点击复制

基于优化BWT索引技术的序列比对算法研究
分享到:

南京师范大学学报(工程技术版)[ISSN:1006-6977/CN:61-1281/TN]

卷:
24卷
期数:
2024年04期
页码:
037-45
栏目:
计算机科学与技术
出版日期:
2024-12-15

文章信息/Info

Title:
Research on Sequence Alignment Algorithm Based on Optimized BWT Indexing Technique
文章编号:
1672-1292(2024)04-0037-09
作者:
胡春玲1赵俊杰2姚梦媛1高欢欢1朱艺杭1汪少鸿1
(1.合肥大学人工智能与大数据学院,安徽 合肥 230031)
(2.郑州大学计算机与人工智能学院 河南 郑州 450001)
Author(s):
Hu Chunling1Zhao Junjie2Yao Mengyuan1Gao Huanhuan1Zhu Yihang1Wang Shaohong1
(1.School of Artificial Intelligence and Big Data,Hefei University,Hefei 230031,China)
(2.School of Computer Science and Artificial Intelligence,Zhengzhou University,Zhengzhou 450001,China)
关键词:
长序列比对BWT索引DC3后缀数组
Keywords:
sequence alignmentBWT indexdifference cover mod 3suffix array
分类号:
TP181
DOI:
10.3969/j.issn.1672-1292.2024.04.004
文献标志码:
A
摘要:
生物信息学中,大规模的生物基因序列比对是最重要的基础问题. 针对主流的BWT(burrows-wheeler transform)索引技术的研究,提出一种新的多阶混合BWT索引方法MD-BWT(multi difference cover mod3 burrows-wheeler transform),根据待比对序列的长度,动态选取适合的多位索引查找. 实验结果表明,改进后的方法可以有效减少序列比对算法中的比对次数和计算次数,降低序列比对算法中索引算法的时间复杂度,明显提高序列比对的效率. 在构造BWT(S)字符串过程中,通过DC3(difference cover mod 3)算法来构造后缀数组,实验表明DC3算法构造后缀数组比倍增算法的时间复杂度更低,时间性能更优.
Abstract:
Large-scale gene sequence alignment is the most important fundamental problem in bioinformatics. Based on the mainstream research of BWT index technology,the paper proposes a new multi-order mixed BWT index method,which dynamically selects the appropiate multi-bit indexing according to the length of the sequence to be compared. The experimental results show that the improved method can effectively reduce the number of comparison and calculation times,reduce the time complexity of the index algorithm,and significantly improve the efficiency of sequence comparison. In the process of constructing BWT(S)string,this paper uses DC3(difference cover mod 3)algorithm to construct the suffix array. Experiments show that DC3 algorithm has better time performance compared to Binary Lifting.

参考文献/References:

[1]SARKAR A,GHOSH S,RAY S S. A hardware-based memory-efficient solution for pair-wise compact sequence alignment[J]. IETE Journal of Research,2023,69(6):3638-3649.
[2]刘济晗. 三代RNA测序序列的比对和分析工具[D]. 哈尔滨:哈尔滨工业大学,2018.
[3]LI R Q,LI Y R,KRISTIANSEN K,et al. SOAP:short oligonucleotide alignment program[J]. Bioinformatics,2008,24(5):713-714.
[4]CAMPAGNA D,ALBIERO A,BILARDI A,et al. PASS:a program to align short sequences[J]. Bioinformatics,2009,25(7):967-968.
[5]SMITH A D,XUAN Z Y,ZHANG M Q. Using quality scores and longer reads improves accuracy of Solexa read mapping[J]. BMC Bioinformatics,2008,9(1):128.
[6]SSERWADDA I,MBOOWA G. rMAP:The rapid microbial analysis pipeline for ESKAPE bacterial group whole-genome sequence data[J]. Microbial Genomics,2021,7(6):000583.
[7]LI H,RUAN J,DURBIN R. Mapping short DNA sequencing reads and calling variants using mapping quality scores[J]. Genome Research,2008,18(11):1851-1858.
[8]RUMBLE S M,LACROUTE P,DALCA A V,et al. SHRiMP:accurate mapping of short color-space reads[J]. PLoS Computational Biology,2009,5(5):e1000386.
[9]BURROWS M J,WHEELER D. A block-sorting lossless data compression algorithm:SRC Research Report 124[R]. Palo Alto,CA,USA:Digital Systems Research Center,1994.
[10]FUENTES-SEPAM'U1LVEDA J,NAVARRO G,NEKRICH Y. Parallel computation of the Burrows Wheeler Transform in compact space[J]. Theoretical Computer Science,2020,812:123-136.
[11]GAGIE T,MANZINI G,SIRÉN J. Wheeler graphs:A framework for BWT-based data structures[J]. Theoretical Computer Science,2017,698:67-78.
[12]LANGMEAD B,TRAPNELL C,POP M,et al. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome[J]. Genome Biology,2009,10(3):R25.
[13]LANGMEAD B,SALZBERG S L. Fast gapped-read alignment with Bowtie 2[J]. Nature Methods,2012,9(4):357-359.
[14]LI H,DURBIN R. Fast and accurate long-read alignment with Burrows-Wheeler transform[J]. Bioinformatics,2010,26(5):589-595.
[15]KEEL B N,SNELLING W M. Comparison of burrows-wheeler transform-based mapping algorithms used in high-throughput whole-genome sequencing:application to illumina data for livestock genomes[J]. Frontiers in Genetics,2018(9):35.
[16]赵雅男,徐云,程昊宇. 序列比对算法中的 BW 变换索引技术研究及其改进[J]. 计算机工程,2016,42(1):282-286.
[17]CHANG C H,CHOU M T,WU Y C,et al. sBWT:Memory efficient implementation of the hardware-acceleration-friendly Schindler transform for the fast biological sequence mapping[J]. Bioinformatics,2016,32(22):3498-3500.
[18]EGIDI L,LOUZA F A,MANZINI G,et al. External memory BWT and LCP computation for sequence collections with applications[J]. Algorithms for Molecular Biology,2019,14:6.
[19]LIU Y C,SCHMIDT B,MASKELL D. CUSHAW:a CUDA compatible short read aligner to large genomes based on the Burrows-Wheeler transform[J]. Bioinformatics,2012,28(14):1830-1837.
[20]KUMAR S,AGARWAL S,RANVIJAY. Burrows wheeler transform and wavelet tree based retrieval of genome sequence in an indexed genome database[J]. Recent Advances in Computer Science and Communications,2020,13(6):1213-1220.
[21]SILVA M,PRATAS D,PINHO A J. AC2:an efficient protein sequence compression tool using artificial neural networks and cache-hash models[J]. Entropy(Basel),2021,23(5):530.
[22]CHEN Y,YOU D L,ZHANG T J,et al. SLDMS:a tool for calculating the overlapping regions of sequences[J]. Frontiers in Plant Science,2022,12:813036.
[23]BINGMANN T. Scalable string and suffix sorting:algorithms,techniques,and tools[EB/OL].(2018-08-02)[2024-05-12]. https://doi.org/10.48550/arXiv.1808.00963.
[24]ZHENG W B,CHEN J,DOAK T G,et al. ADFinder:accurate detection of programmed DNA elimination using NGS high-throughput sequencing data[J]. Bioinformatics,2020,36(12):3632-3636.
[25]KIM Y,CHOI S,JEONG J,et al. Data dependency reduction for high-performance FPGA implementation of DEFLATE compression algorithm[J]. Journal of Systems Architecture,2019,98:41-52.
[26]GRIEBLER D,HOFFMANN R B,DANELUTTO M,et al. High-level and productive stream parallelism for Dedup,Ferret,and Bzip2[J]. International Journal of Parallel Programming,2019,47(2):253-271.

备注/Memo

备注/Memo:
收稿日期:2024-05-12.
基金项目:国家自然科学基金青年项目(62306100)、安徽省教学研究重大项目(2023jyxm0558).
通讯作者:胡春玲,博士,教授,研究方向:机器学习、生物信息学. E-mail:huchunling@hfuu.edu.cn
更新日期/Last Update: 2024-12-15