2.6 BLAST
2.6 BLAST算法
-
Intro
-
Application
-
思路 —— BLAST Ideas: Seeding‐and‐extending:种子-扩展
- Find matches (seed) between the query and subject;寻找查询序列和目标序列之间的匹配(种子即高度相似的序列片段)
- Extend seed into High Scoring Segment Pairs (HSPs);将种子扩展成高分段对(HSPs)
– Run Smith‐Waterman algorithm on the specified region only.特定区域 - Assess the reliability of the alignment.计算统计显著性,评估校准的可靠性
-
Seeding
For a given word length w (usually 3 for proteins and 11 for nucleotides), slicing the query sequence into multiple
对于给定的单词长度w(通常3是蛋白质和11是核苷酸用),将查询序列切成多个,生成小片段seed
continuous “seed words” 种子单字 -
Speedup: Index database 加速:索引数据库
The database was pre‐indexed to quickly locate all positions in the database for a given seed.
数据库被预先索引,以快速定位数据库中某个特定种子的所有位置——对每个seed提前做索引
-
Speedup: mask low-complexity 加速:屏蔽低复杂度
为了加速屏蔽了低复杂度区域,牺牲了灵敏度
-
质量评估——计算统计显著性,为了确保这个比对不是由随机因素引起的
-
E-Value:How a match is likely to arise by chance:匹配是如何偶然产生
E-value用来代表随机出现也能匹配的个数,所以是越小越好,一般取0.05做cutoff。数据库(n)越大,序列(m)越长(blast是局部比对),E更有可能更大
(参考:https://zhuanlan.zhihu.com/p/62342599)
-
The expected number of alignments with a given score that would be expected to occur at random in the database that has been searched
在已经搜索过的数据库中,具有给定分数的对齐的预期数量,预计会随机出现在数据库中
– e.g. if E=10, 10 matches with scores this high are expected to be found by chance
例:如果E=10,期望随机找到与该分数相匹配的10个匹配项(参考:https://blog.csdn.net/GUET_DM_LQ/article/details/106185880)
-
-
summary
-
WHY
-
BLAST is the tool most frequently used for calculating sequence similarity, by searching the database.
BLAST是一个最常用的工具,通过搜索数据库来计算序列相似性。
-
If you work with one or a few proteins or genes, it can tell you about their conservation, active sites, structure and regulation in other organisms, etc.
如果你研究一个或几个蛋白质或基因,它可以告诉你它们的保存情况、活性位点、结构和在其他生物体中的调节等
-
-
What BLAST does?
-
ldentity: the occurrence of exactly the same nucleotide or amino
acid in the same position in aligned sequences.一致性:在对齐的序列中的相同位置上出现完全相同的核苷酸或氨基酸
-
Similarity: measure the sameness or difference of the sequences
相似性:衡量序列的同一性或差异性
-
Homology: is defined in terms of shared ancestors. Homologous
sequences are often similar. Sequence regions that are homologous
are also called conserved regions.同源性:是以共同的祖先来定义的。同源的序列往往是相似的。具有同源性的序列区域也被称为保守区
-
-
不同的比对算法之间的区别
动态规划在搜索数据库的时候速度和占用资源会有很大的瓶颈,这就引申出了BLAST、FASTA,准确度不降低很多的情况下提升了速度
-
How BLAST works
-
Step 0 —— Filtering
-
Step 1 —— Seeding
-
Step 2 —— Search word hits
-
Step 3 —— Scanning
-
Step 4 —— Extending ➡️ HSP
-
Step 5 —— Significance evaluation
-
-
BLAST programs
-
Gapped BLAST
-
PSI-BLAST
-
Caveat emptor