2.6 BLAST算法

Intro
Application
思路 —— BLAST Ideas: Seeding‐and‐extending：种子-扩展
1. Find matches (seed) between the query and subject；寻找查询序列和目标序列之间的匹配（种子即高度相似的序列片段）
2. Extend seed into High Scoring Segment Pairs (HSPs)；将种子扩展成高分段对（HSPs）
  – Run Smith‐Waterman algorithm on the specified region only.特定区域
3. Assess the reliability of the alignment.计算统计显著性，评估校准的可靠性
- Seeding
  
  For a given word length w (usually 3 for proteins and 11 for nucleotides), slicing the query sequence into multiple
  对于给定的单词长度w（通常3是蛋白质和11是核苷酸用），将查询序列切成多个，生成小片段seed
  continuous “seed words” 种子单字
- Speedup: Index database 加速：索引数据库
  
  The database was pre‐indexed to quickly locate all positions in the database for a given seed.
  
  数据库被预先索引，以快速定位数据库中某个特定种子的所有位置——对每个seed提前做索引
- Speedup: mask low-complexity 加速：屏蔽低复杂度
  
  为了加速屏蔽了低复杂度区域，牺牲了灵敏度
- 质量评估——计算统计显著性，为了确保这个比对不是由随机因素引起的
  - E-Value：How a match is likely to arise by chance：匹配是如何偶然产生
    
    E-value用来代表随机出现也能匹配的个数，所以是越小越好，一般取0.05做cutoff。数据库(n)越大，序列(m)越长（blast是局部比对），E更有可能更大
    
    （参考：https://zhuanlan.zhihu.com/p/62342599）
    - The expected number of alignments with a given score that would be expected to occur at random in the database that has been searched
      在已经搜索过的数据库中，具有给定分数的对齐的预期数量，预计会随机出现在数据库中
      – e.g. if E=10, 10 matches with scores this high are expected to be found by chance
      例：如果E=10，期望随机找到与该分数相匹配的10个匹配项
      
      （参考：https://blog.csdn.net/GUET_DM_LQ/article/details/106185880）

summary

WHY
- BLAST is the tool most frequently used for calculating sequence similarity, by searching the database.
  
  BLAST是一个最常用的工具，通过搜索数据库来计算序列相似性。
- If you work with one or a few proteins or genes, it can tell you about their conservation, active sites, structure and regulation in other organisms, etc.
  
  如果你研究一个或几个蛋白质或基因，它可以告诉你它们的保存情况、活性位点、结构和在其他生物体中的调节等
What BLAST does?
- ldentity: the occurrence of exactly the same nucleotide or amino
  acid in the same position in aligned sequences.
  
  一致性：在对齐的序列中的相同位置上出现完全相同的核苷酸或氨基酸
- Similarity: measure the sameness or difference of the sequences
  
  相似性：衡量序列的同一性或差异性
- Homology: is defined in terms of shared ancestors. Homologous
  sequences are often similar. Sequence regions that are homologous
  are also called conserved regions.
  
  同源性：是以共同的祖先来定义的。同源的序列往往是相似的。具有同源性的序列区域也被称为保守区
不同的比对算法之间的区别

动态规划在搜索数据库的时候速度和占用资源会有很大的瓶颈，这就引申出了BLAST、FASTA，准确度不降低很多的情况下提升了速度
How BLAST works
- Step 0 —— Filtering
- Step 1 —— Seeding
- Step 2 —— Search word hits
- Step 3 —— Scanning
- Step 4 —— Extending ➡️ HSP
- Step 5 —— Significance evaluation
BLAST programs
Gapped BLAST
PSI-BLAST
Caveat emptor