2.6 BLAST算法

  • Intro

    image-20221019132436668
  • Application

    Untitled
  • 思路 —— BLAST Ideas: Seeding‐and‐extending:种子-扩展

    1. Find matches (seed) between the query and subject;寻找查询序列和目标序列之间的匹配(种子即高度相似的序列片段)
    2. Extend seed into High Scoring Segment Pairs (HSPs);将种子扩展成高分段对(HSPs)
      – Run Smith‐Waterman algorithm on the specified region only.特定区域
    3. Assess the reliability of the alignment.计算统计显著性,评估校准的可靠性
    Untitled
    • Seeding

      For a given word length w (usually 3 for proteins and 11 for nucleotides), slicing the query sequence into multiple
      对于给定的单词长度w(通常3是蛋白质和11是核苷酸用),将查询序列切成多个,生成小片段seed
      continuous “seed words” 种子单字

      Untitled
    • Speedup: Index database 加速:索引数据库

      The database was pre‐indexed to quickly locate all positions in the database for a given seed.

      数据库被预先索引,以快速定位数据库中某个特定种子的所有位置——对每个seed提前做索引

      Untitled
    • Speedup: mask low-complexity 加速:屏蔽低复杂度

      为了加速屏蔽了低复杂度区域,牺牲了灵敏度

      Untitled
    • 质量评估——计算统计显著性,为了确保这个比对不是由随机因素引起的

      Untitled
      • E-Value:How a match is likely to arise by chance:匹配是如何偶然产生

        E-value用来代表随机出现也能匹配的个数,所以是越小越好,一般取0.05做cutoff。数据库(n)越大,序列(m)越长(blast是局部比对),E更有可能更大

        (参考:https://zhuanlan.zhihu.com/p/62342599

        • The expected number of alignments with a given score that would be expected to occur at random in the database that has been searched
          在已经搜索过的数据库中,具有给定分数的对齐的预期数量,预计会随机出现在数据库中
          – e.g. if E=10, 10 matches with scores this high are expected to be found by chance
          例:如果E=10,期望随机找到与该分数相匹配的10个匹配项

          (参考:https://blog.csdn.net/GUET_DM_LQ/article/details/106185880

          Untitled

summary

  • WHY

    Untitled
    • BLAST is the tool most frequently used for calculating sequence similarity, by searching the database.

      BLAST是一个最常用的工具,通过搜索数据库来计算序列相似性。

    • If you work with one or a few proteins or genes, it can tell you about their conservation, active sites, structure and regulation in other organisms, etc.

      如果你研究一个或几个蛋白质或基因,它可以告诉你它们的保存情况、活性位点、结构和在其他生物体中的调节等

  • What BLAST does?

    • ldentity: the occurrence of exactly the same nucleotide or amino
      acid in the same position in aligned sequences.

      一致性:在对齐的序列中的相同位置上出现完全相同的核苷酸或氨基酸

    • Similarity: measure the sameness or difference of the sequences

      相似性:衡量序列的同一性或差异性

    • Homology: is defined in terms of shared ancestors. Homologous
      sequences are often similar. Sequence regions that are homologous
      are also called conserved regions.

      同源性:是以共同的祖先来定义的。同源的序列往往是相似的。具有同源性的序列区域也被称为保守区

  • 不同的比对算法之间的区别

    Untitled

    动态规划在搜索数据库的时候速度和占用资源会有很大的瓶颈,这就引申出了BLAST、FASTA,准确度不降低很多的情况下提升了速度

  • How BLAST works

    Untitled
    • Step 0 —— Filtering

      Untitled
    • Step 1 —— Seeding

      Untitled
    • Step 2 —— Search word hits

      Untitled Untitled
    • Step 3 —— Scanning

      Untitled
    • Step 4 —— Extending ➡️ HSP

      Untitled
    • Step 5 —— Significance evaluation

      Untitled Untitled
  • BLAST programs

    Untitled Untitled
  • Gapped BLAST

    Untitled
  • PSI-BLAST

    Untitled
  • Caveat emptor

    Untitled

讲人话