<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[在病原分析中去掉宿主信息]]></title><description><![CDATA[<p dir="auto">１。是否有必要去掉宿主？<br />
假设样本的reads数是n  参考数据库的条数是x 　那么在去宿主的情况下　可以让人提前指定宿主是什么物种　比较常见的是医院为人，动物防疫为某种动物　植物防疫为某种植物　这样实际上x=1 可以很快的把n中的宿主序列找出来　从性能方面考虑是很有必要。<br />
考虑到比对的精度，会不会在这个步骤，把一些病原物种序列比对到了宿主上？理论上看不能排除　但是宿主和微生物从进化树上看　差别还是比较大的　这种概率应该很低（需要查阅一下文献）</p>
<p dir="auto">２。方案一　bowtie2/samtools<br />
<a href="https://sites.google.com/site/wiki4metagenomics/tools/short-read/remove-host-sequences" rel="nofollow ugc">https://sites.google.com/site/wiki4metagenomics/tools/short-read/remove-host-sequences</a><br />
Remove host sequences<br />
In silico removal of host sequences<br />
removing host (contamination) sequences in order to analyze remaining (bacterial) sequences</p>
<ol>
<li>Using bowtie2 option:  --un-conc<br />
quick solution to get the paired reads that do not map to the host reference genome (both reads unmapped).</li>
</ol>
<h1>download ready to use bowtie2 database of human host genome GRCh38 (hg38)</h1>
<p dir="auto">wget <a href="https://genome-idx.s3.amazonaws.com/bt/GRCh38_noalt_as.zip" rel="nofollow ugc">https://genome-idx.s3.amazonaws.com/bt/GRCh38_noalt_as.zip</a><br />
unzip GRCh38_noalt_as.zip</p>
<h1>run bowtie2 mapping (using --un-conc-gz to get gzip compressed output files;  8 processors)</h1>
<p dir="auto">bowtie2 -p 8 -x GRCh38_noalt_as -1 SAMPLE_R1.fastq.gz -2 SAMPLE_R2.fastq.gz --un-conc-gz SAMPLE_host_removed &gt; SAMPLE_mapped_and_unmapped.sam<br />
<a href="http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml#output-options" rel="nofollow ugc">http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml#output-options</a></p>
<h1>bowtie2 results (gz files without gz ending)</h1>
<p dir="auto">ls<br />
SAMPLE_host_removed.1<br />
SAMPLE_host_removed.2<br />
°°°</p>
<h1>rename host-sequence free samples</h1>
<p dir="auto">mv SAMPLE_host_removed.1 SAMPLE_host_removed_R1.fastq.gz<br />
mv SAMPLE_host_removed.2 SAMPLE_host_removed_R2.fastq.gz</p>
<p dir="auto">Option --un-conc shows results like samtools options -F 2 (excluding reads "mapped in proper pair").<br />
Paired reads that do not map both to the host sequence might still be included in the "host removed" output.<br />
For better control about read filtering options, see workflow below.</p>
<p dir="auto">If multi-processor option -p is used, output reads might have a different order compared to input files.<br />
Use option --reorder to keep the original read order.<br />
(read order refers to .sam output but might effect also host-removed read output files .1 .2)<br />
<a href="http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml#performance-options" rel="nofollow ugc">http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml#performance-options</a></p>
<ol start="2">
<li>Using bowtie2 together with samtools<br />
complex solution that gives better control over the rejected reads by using SAM-flags</li>
</ol>
<p dir="auto">How to filter out host reads from paired-end fastq files?</p>
<p dir="auto">a) bowtie2 mapping against host genome: write all (mapped and unmapped) reads to a single .bam file<br />
b) samtools view: use filter-flags to extract unmapped reads<br />
c) samtools fastq: split paired-end reads into separated R1 and R2 fastq files</p>
<p dir="auto">a) bowtie2 mapping against host sequence</p>
<ol>
<li>create or download ready to use bowtie2 index databases (host_DB)</li>
</ol>
<h1>create bowtie2 database using a reference genome (fasta file)</h1>
<p dir="auto">bowtie2-build host_genome_sequence.fasta host_DB<br />
→ Download reference genomes of common hosts (human, mouse, ..)<br />
→ How to create a bowtie2 index database</p>
<h1>download ready to use bowtie2 database of human genome GRCh38 (hg38)</h1>
<p dir="auto">wget <a href="https://genome-idx.s3.amazonaws.com/bt/GRCh38_noalt_as.zip" rel="nofollow ugc">https://genome-idx.s3.amazonaws.com/bt/GRCh38_noalt_as.zip</a><br />
unzip GRCh38_noalt_as.zip<br />
# move all files into your working directory (or into your predefined $BOWTIE2_INDEXES location)<br />
# use  "GRCh38_noalt_as" for your bowtie2 database name (instead of "host_DB")<br />
→ see all available bowtie2 databases  (host species list is shown on the right)</p>
<ol start="2">
<li>
<p dir="auto">bowtie2 mapping against host sequence database, keep both aligned and unaligned reads (paired-end reads)<br />
bowtie2 -p 8 -x host_DB -1 SAMPLE_R1.fastq.gz -2 SAMPLE_R2.fastq.gz -S SAMPLE_mapped_and_unmapped.sam</p>
</li>
<li>
<p dir="auto">convert file .sam to .bam<br />
samtools view -bS SAMPLE_mapped_and_unmapped.sam &gt; SAMPLE_mapped_and_unmapped.bam</p>
</li>
</ol>
<p dir="auto">b) filter required unmapped reads</p>
<h1>SAMtools SAM-flag filter: get unmapped pairs (both reads R1 and R2 unmapped)</h1>
<p dir="auto">samtools view -b -f 12 -F 256 SAMPLE_mapped_and_unmapped.bam &gt; SAMPLE_bothReadsUnmapped.bam<br />
-f 12     Extract only (-f) alignments with both reads unmapped: &lt;read unmapped&gt;&lt;mate unmapped&gt;<br />
-F 256   Do not(-F) extract alignments which are: &lt;not primary alignment&gt;<br />
see meaning of SAM-flags<br />
c)  split paired-end reads into separated fastq files .._R1 .._R2</p>
<h1>sort bam file by read name (-n) to have paired reads next to each other (2 parallel threads, each using up to 5G memory)</h1>
<p dir="auto">samtools sort -n -m 5G -@ 2 SAMPLE_bothReadsUnmapped.bam -o SAMPLE_bothReadsUnmapped_sorted.bam</p>
<p dir="auto">samtools fastq -@ 8 SAMPLE_bothReadsUnmapped_sorted.bam <br />
-1 SAMPLE_host_removed_R1.fastq.gz <br />
-2 SAMPLE_host_removed_R2.fastq.gz <br />
-0 /dev/null -s /dev/null -n<br />
<a href="http://www.htslib.org/doc/samtools-fasta.html" rel="nofollow ugc">http://www.htslib.org/doc/samtools-fasta.html</a></p>
<p dir="auto">see other options for  → converting .bam to .fastq<br />
Result<br />
Two files of paired-end reads, containing non-host sequences<br />
SAMPLE_host_removed_R1.fastq.gz<br />
SAMPLE_host_removed_R2.fastq.gz</p>
<p dir="auto">３。方案二<br />
<a href="https://github.com/UEA-Cancer-Genetics-Lab/sepath_tool_UEA" rel="nofollow ugc">https://github.com/UEA-Cancer-Genetics-Lab/sepath_tool_UEA</a><br />
<a href="https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1819-8" rel="nofollow ugc">https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1819-8</a><br />
SEPATH: benchmarking the search for pathogens in human tissue whole genome sequence data leads to template pipelines<br />
从方案二文章中的测试情况看　bbduk 和　Kontaminant 的效果比较好<br />
<a href="https://yiweiniu.github.io/blog/2018/07/Remove-Contamination-of-Pokaryotic-Organisms-in-Reads/" rel="nofollow ugc">https://yiweiniu.github.io/blog/2018/07/Remove-Contamination-of-Pokaryotic-Organisms-in-Reads/</a><br />
这个文章建议用bbduk处理二代　用minimap2处理三代污染</p>
<p dir="auto">4.方案四　<br />
还是用blast　把人的序列去掉　这种方式　我们引入的软件比较少　但是kmer的速度肯定比较快一点</p>
]]></description><link>http://an.forum.genostack.com/topic/250/在病原分析中去掉宿主信息</link><generator>RSS for Node</generator><lastBuildDate>Sat, 13 Jun 2026 12:37:06 GMT</lastBuildDate><atom:link href="http://an.forum.genostack.com/topic/250.rss" rel="self" type="application/rss+xml"/><pubDate>Thu, 18 Mar 2021 03:43:17 GMT</pubDate><ttl>60</ttl><item><title><![CDATA[Reply to 在病原分析中去掉宿主信息 on Mon, 12 Apr 2021 01:22:09 GMT]]></title><description><![CDATA[<p dir="auto"><a href="http://www.metagenomics.wiki/tools/short-read/remove-host-sequences" rel="nofollow ugc">http://www.metagenomics.wiki/tools/short-read/remove-host-sequences</a><br />
<a href="https://www.microbiologyresearch.org/content/journal/mgen/10.1099/mgen.0.000393?crawler=true" rel="nofollow ugc">https://www.microbiologyresearch.org/content/journal/mgen/10.1099/mgen.0.000393?crawler=true</a></p>
]]></description><link>http://an.forum.genostack.com/post/579</link><guid isPermaLink="true">http://an.forum.genostack.com/post/579</guid><dc:creator><![CDATA[anneng]]></dc:creator><pubDate>Mon, 12 Apr 2021 01:22:09 GMT</pubDate></item></channel></rss>