暗能星系

    • 登录
    • 搜索

    在病原分析中去掉宿主信息

    生物信息分析
    1
    2
    19
    正在加载更多帖子
    • 从旧到新
    • 从新到旧
    • 最多赞同
    回复
    • 在新帖中回复
    登录后回复
    此主题已被删除。只有拥有主题管理权限的用户可以查看。
    • A
      anneng 最后由 anneng 编辑

      1。是否有必要去掉宿主?
      假设样本的reads数是n 参考数据库的条数是x  那么在去宿主的情况下 可以让人提前指定宿主是什么物种 比较常见的是医院为人,动物防疫为某种动物 植物防疫为某种植物 这样实际上x=1 可以很快的把n中的宿主序列找出来 从性能方面考虑是很有必要。
      考虑到比对的精度,会不会在这个步骤,把一些病原物种序列比对到了宿主上?理论上看不能排除 但是宿主和微生物从进化树上看 差别还是比较大的 这种概率应该很低(需要查阅一下文献)

      2。方案一 bowtie2/samtools
      https://sites.google.com/site/wiki4metagenomics/tools/short-read/remove-host-sequences
      Remove host sequences
      In silico removal of host sequences
      removing host (contamination) sequences in order to analyze remaining (bacterial) sequences

      1. Using bowtie2 option: --un-conc
        quick solution to get the paired reads that do not map to the host reference genome (both reads unmapped).

      download ready to use bowtie2 database of human host genome GRCh38 (hg38)

      wget https://genome-idx.s3.amazonaws.com/bt/GRCh38_noalt_as.zip
      unzip GRCh38_noalt_as.zip

      run bowtie2 mapping (using --un-conc-gz to get gzip compressed output files; 8 processors)

      bowtie2 -p 8 -x GRCh38_noalt_as -1 SAMPLE_R1.fastq.gz -2 SAMPLE_R2.fastq.gz --un-conc-gz SAMPLE_host_removed > SAMPLE_mapped_and_unmapped.sam
      http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml#output-options

      bowtie2 results (gz files without gz ending)

      ls
      SAMPLE_host_removed.1
      SAMPLE_host_removed.2
      °°°

      rename host-sequence free samples

      mv SAMPLE_host_removed.1 SAMPLE_host_removed_R1.fastq.gz
      mv SAMPLE_host_removed.2 SAMPLE_host_removed_R2.fastq.gz

      Option --un-conc shows results like samtools options -F 2 (excluding reads "mapped in proper pair").
      Paired reads that do not map both to the host sequence might still be included in the "host removed" output.
      For better control about read filtering options, see workflow below.

      If multi-processor option -p is used, output reads might have a different order compared to input files.
      Use option --reorder to keep the original read order.
      (read order refers to .sam output but might effect also host-removed read output files .1 .2)
      http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml#performance-options

      1. Using bowtie2 together with samtools
        complex solution that gives better control over the rejected reads by using SAM-flags

      How to filter out host reads from paired-end fastq files?

      a) bowtie2 mapping against host genome: write all (mapped and unmapped) reads to a single .bam file
      b) samtools view: use filter-flags to extract unmapped reads
      c) samtools fastq: split paired-end reads into separated R1 and R2 fastq files

      a) bowtie2 mapping against host sequence

      1. create or download ready to use bowtie2 index databases (host_DB)

      create bowtie2 database using a reference genome (fasta file)

      bowtie2-build host_genome_sequence.fasta host_DB
      → Download reference genomes of common hosts (human, mouse, ..)
      → How to create a bowtie2 index database

      download ready to use bowtie2 database of human genome GRCh38 (hg38)

      wget https://genome-idx.s3.amazonaws.com/bt/GRCh38_noalt_as.zip
      unzip GRCh38_noalt_as.zip
      # move all files into your working directory (or into your predefined $BOWTIE2_INDEXES location)
      # use "GRCh38_noalt_as" for your bowtie2 database name (instead of "host_DB")
      → see all available bowtie2 databases (host species list is shown on the right)

      1. bowtie2 mapping against host sequence database, keep both aligned and unaligned reads (paired-end reads)
        bowtie2 -p 8 -x host_DB -1 SAMPLE_R1.fastq.gz -2 SAMPLE_R2.fastq.gz -S SAMPLE_mapped_and_unmapped.sam

      2. convert file .sam to .bam
        samtools view -bS SAMPLE_mapped_and_unmapped.sam > SAMPLE_mapped_and_unmapped.bam

      b) filter required unmapped reads

      SAMtools SAM-flag filter: get unmapped pairs (both reads R1 and R2 unmapped)

      samtools view -b -f 12 -F 256 SAMPLE_mapped_and_unmapped.bam > SAMPLE_bothReadsUnmapped.bam
      -f 12 Extract only (-f) alignments with both reads unmapped: <read unmapped><mate unmapped>
      -F 256 Do not(-F) extract alignments which are: <not primary alignment>
      see meaning of SAM-flags
      c) split paired-end reads into separated fastq files .._R1 .._R2

      sort bam file by read name (-n) to have paired reads next to each other (2 parallel threads, each using up to 5G memory)

      samtools sort -n -m 5G -@ 2 SAMPLE_bothReadsUnmapped.bam -o SAMPLE_bothReadsUnmapped_sorted.bam

      samtools fastq -@ 8 SAMPLE_bothReadsUnmapped_sorted.bam
      -1 SAMPLE_host_removed_R1.fastq.gz
      -2 SAMPLE_host_removed_R2.fastq.gz
      -0 /dev/null -s /dev/null -n
      http://www.htslib.org/doc/samtools-fasta.html

      see other options for → converting .bam to .fastq
      Result
      Two files of paired-end reads, containing non-host sequences
      SAMPLE_host_removed_R1.fastq.gz
      SAMPLE_host_removed_R2.fastq.gz

      3。方案二
      https://github.com/UEA-Cancer-Genetics-Lab/sepath_tool_UEA
      https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1819-8
      SEPATH: benchmarking the search for pathogens in human tissue whole genome sequence data leads to template pipelines
      从方案二文章中的测试情况看 bbduk 和 Kontaminant 的效果比较好
      https://yiweiniu.github.io/blog/2018/07/Remove-Contamination-of-Pokaryotic-Organisms-in-Reads/
      这个文章建议用bbduk处理二代 用minimap2处理三代污染

      4.方案四 
      还是用blast 把人的序列去掉 这种方式 我们引入的软件比较少 但是kmer的速度肯定比较快一点

      1 条回复 最后回复 回复 引用 0
      • A
        anneng 最后由 编辑

        http://www.metagenomics.wiki/tools/short-read/remove-host-sequences
        https://www.microbiologyresearch.org/content/journal/mgen/10.1099/mgen.0.000393?crawler=true

        1 条回复 最后回复 回复 引用 0
        • First post
          Last post
        Powered by 暗能星系