在病原分析中去掉宿主信息

anneng

１。是否有必要去掉宿主？
假设样本的reads数是n 参考数据库的条数是x 　那么在去宿主的情况下　可以让人提前指定宿主是什么物种　比较常见的是医院为人，动物防疫为某种动物　植物防疫为某种植物　这样实际上x=1 可以很快的把n中的宿主序列找出来　从性能方面考虑是很有必要。
考虑到比对的精度，会不会在这个步骤，把一些病原物种序列比对到了宿主上？理论上看不能排除　但是宿主和微生物从进化树上看　差别还是比较大的　这种概率应该很低（需要查阅一下文献）

２。方案一　bowtie2/samtools
https://sites.google.com/site/wiki4metagenomics/tools/short-read/remove-host-sequences
Remove host sequences
In silico removal of host sequences
removing host (contamination) sequences in order to analyze remaining (bacterial) sequences

Using bowtie2 option: --un-conc
quick solution to get the paired reads that do not map to the host reference genome (both reads unmapped).

download ready to use bowtie2 database of human host genome GRCh38 (hg38)

wget https://genome-idx.s3.amazonaws.com/bt/GRCh38_noalt_as.zip
unzip GRCh38_noalt_as.zip

run bowtie2 mapping (using --un-conc-gz to get gzip compressed output files; 8 processors)

bowtie2 -p 8 -x GRCh38_noalt_as -1 SAMPLE_R1.fastq.gz -2 SAMPLE_R2.fastq.gz --un-conc-gz SAMPLE_host_removed > SAMPLE_mapped_and_unmapped.sam
http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml#output-options

bowtie2 results (gz files without gz ending)

ls
SAMPLE_host_removed.1
SAMPLE_host_removed.2
°°°

rename host-sequence free samples

mv SAMPLE_host_removed.1 SAMPLE_host_removed_R1.fastq.gz
mv SAMPLE_host_removed.2 SAMPLE_host_removed_R2.fastq.gz

Option --un-conc shows results like samtools options -F 2 (excluding reads "mapped in proper pair").
Paired reads that do not map both to the host sequence might still be included in the "host removed" output.
For better control about read filtering options, see workflow below.

If multi-processor option -p is used, output reads might have a different order compared to input files.
Use option --reorder to keep the original read order.
(read order refers to .sam output but might effect also host-removed read output files .1 .2)
http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml#performance-options

Using bowtie2 together with samtools
complex solution that gives better control over the rejected reads by using SAM-flags

How to filter out host reads from paired-end fastq files?

a) bowtie2 mapping against host genome: write all (mapped and unmapped) reads to a single .bam file
b) samtools view: use filter-flags to extract unmapped reads
c) samtools fastq: split paired-end reads into separated R1 and R2 fastq files

a) bowtie2 mapping against host sequence

create or download ready to use bowtie2 index databases (host_DB)

create bowtie2 database using a reference genome (fasta file)

bowtie2-build host_genome_sequence.fasta host_DB
→ Download reference genomes of common hosts (human, mouse, ..)
→ How to create a bowtie2 index database

download ready to use bowtie2 database of human genome GRCh38 (hg38)

wget https://genome-idx.s3.amazonaws.com/bt/GRCh38_noalt_as.zip
unzip GRCh38_noalt_as.zip
# move all files into your working directory (or into your predefined $BOWTIE2_INDEXES location)
# use "GRCh38_noalt_as" for your bowtie2 database name (instead of "host_DB")
→ see all available bowtie2 databases (host species list is shown on the right)

bowtie2 mapping against host sequence database, keep both aligned and unaligned reads (paired-end reads)
bowtie2 -p 8 -x host_DB -1 SAMPLE_R1.fastq.gz -2 SAMPLE_R2.fastq.gz -S SAMPLE_mapped_and_unmapped.sam
convert file .sam to .bam
samtools view -bS SAMPLE_mapped_and_unmapped.sam > SAMPLE_mapped_and_unmapped.bam

b) filter required unmapped reads

SAMtools SAM-flag filter: get unmapped pairs (both reads R1 and R2 unmapped)

samtools view -b -f 12 -F 256 SAMPLE_mapped_and_unmapped.bam > SAMPLE_bothReadsUnmapped.bam
-f 12 Extract only (-f) alignments with both reads unmapped: <read unmapped><mate unmapped>
-F 256 Do not(-F) extract alignments which are: <not primary alignment>
see meaning of SAM-flags
c) split paired-end reads into separated fastq files .._R1 .._R2

sort bam file by read name (-n) to have paired reads next to each other (2 parallel threads, each using up to 5G memory)

samtools sort -n -m 5G -@ 2 SAMPLE_bothReadsUnmapped.bam -o SAMPLE_bothReadsUnmapped_sorted.bam

samtools fastq -@ 8 SAMPLE_bothReadsUnmapped_sorted.bam
-1 SAMPLE_host_removed_R1.fastq.gz
-2 SAMPLE_host_removed_R2.fastq.gz
-0 /dev/null -s /dev/null -n
http://www.htslib.org/doc/samtools-fasta.html

see other options for → converting .bam to .fastq
Result
Two files of paired-end reads, containing non-host sequences
SAMPLE_host_removed_R1.fastq.gz
SAMPLE_host_removed_R2.fastq.gz

３。方案二
https://github.com/UEA-Cancer-Genetics-Lab/sepath_tool_UEA
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1819-8
SEPATH: benchmarking the search for pathogens in human tissue whole genome sequence data leads to template pipelines
从方案二文章中的测试情况看　bbduk 和　Kontaminant 的效果比较好
https://yiweiniu.github.io/blog/2018/07/Remove-Contamination-of-Pokaryotic-Organisms-in-Reads/
这个文章建议用bbduk处理二代　用minimap2处理三代污染

4.方案四　
还是用blast　把人的序列去掉　这种方式　我们引入的软件比较少　但是kmer的速度肯定比较快一点

anneng

http://www.metagenomics.wiki/tools/short-read/remove-host-sequences
https://www.microbiologyresearch.org/content/journal/mgen/10.1099/mgen.0.000393?crawler=true