HBV分析

anneng

https://sci-hub.st/10.1053/j.gastro.2009.08.063

anneng

https://jbiomedsci.biomedcentral.com/articles/10.1186/s12929-018-0442-4
Applications of next-generation sequencing analysis for the detection of hepatocellular carcinoma-associated hepatitis B virus mutations
这篇文章分析了HBV突变和肝癌的相关性
1.对于组装文章提到有参组装更好相对de novo 组装而言毕竟illumina的reads比较短

2.文章里面提到了一个sample-specific 参考序列、同基因型的参考序列、其他不兼容参考序列对假阳性的影响。

这个文章很水没有提到生信是怎么做的

anneng

https://hivdb.stanford.edu/HBV/releaseNotes/

anneng

https://www.ncbi.nlm.nih.gov/labs/pmc/articles/PMC4382110/

直接用reads 在进化树上进行分型并且能进行混合样本的分型

anneng

四医大HBV分析记录
1.使用bbmerge合并R1 R2 下面脚本的意思是使用find找到样本名称然后使用这个样本名称传递给parallel并发处理

find ../all_data/*_L001_R1_001.fastq.gz | sed 's/_L001_R1_001.fastq.gz$//' | parallel 'bbmerge.sh in1=../all_data/{}_L001_R1_001.fastq.gz in2=../all_data/{}_L001_R2_001.fastq.gz out={}.fastq  outu1={}.R1.umerged outu2={}.R2.unmerged'

发现327个样本中有几个样本 R1 和 R2 的数量不一致针对这些样本使用spades进行组装取最长的序列进行第二步
因为涉及到组装无法进行混合样品的分析把这些样本当作单样本处理
将所有的fastq转成fasta(blast只识别fasta)

parallel 'seqtk seq -a {}> {.}.fasta' ::: *.fastq

2.使用blast 对样本中的序列进行分型得到每个样本中各种分型的序列数量
构建blast数据库
从hbvdb下载的参考序列有一个类别是RF 例如 https://www.ncbi.nlm.nih.gov/nucleotide/EU871985.1?report=genbank&log$=nuclalign&blast_rank=1&RID=Z8DW1MY8016 这个序列 NCBI没有标识类型 hbvdb将其注释为了BC重组型我们当前先把这种RF的去掉

makeblastdb -in all_hbvdb_Genomes.fas -dbtype nucl

blastn -task blastn -max_target_seqs 1 -query ../0-merging-pe/100_S42.fasta -db ../hbvdb/all_hbvdb_Genomes.fas -num_threads 10 -out 100_S42.m8 -outfmt 6

nohup bash -c "find ../0-merging-pe/*.fasta | sed 's/.fasta$//' |  parallel --joblog ./logs -j40 blastn -task blastn -max_target_seqs 1 -query ../0-merging-pe/{}.fasta -db ../hbvdb/A-H/HBV_A_H.fas -out {/}.m8 -outfmt 6 " &

3.比对

nohup bash -c "find ../all_data/*_L001_R1_001.fastq.gz | sed 's/_L001_R1_001.fastq.gz$//' | parallel 'bwa mem -M AB033556_hbc_type_C.fasta {}_L001_R1_001.fastq.gz {}_L001_R2_001.fastq.gz > {/}.sam' " &

nohup parallel "samtools view -bF 4 {} > {/.}.bam" ::: ./sam/*.sam &
parallel samtools sort {} -o {.}.sorted.bam ::: *.bam

4.call

nohup parallel "lofreq indelqual {} --dindel -f ../3-mapping/AB033556_hbc_type_C.fasta -o {/.}.sorted.dindel.bam " ::: ../3-mapping/bam/*.sorted.bam &

nohup parallel "lofreq call {} --call-indels -f ../3-mapping/AB033556_hbc_type_C.fasta -o {/.}.vcf " ::: *.bam &

5.分析单倍型

find /ceph_disk2/siyida_327_sample/3-mapping/sam/ -name "*.sam" -exec basename \{} .sam \; | sed 's/.sam$//' |parallel 'java -jar clique-snv.jar -m snv-illumina -in /ceph_disk2/siyida_327_sample/3-mapping/sam/{}.sam'

anneng

spades的组装

/home/bioinfo/miniconda2/envs/assembly/bin/spades.py      -1      /ceph_disk3/hbv/HBV_illumina/106/106_S46_L001_R1_001.fastq      -2      /ceph_disk3/hbv/HBV_illumina/106/106_S46_L001_R2_001.fastq      -o      /ceph_disk3/hbv/HBV_illumina/106/spades

anneng

https://www.sciencedirect.com/science/article/pii/S1386653218300970
Frequency of hepatitis B surface antigen variants (HBsAg) in hepatitis B virus genotype B and C infected East- and Southeast Asian patients: Detection by the Elecsys HBsAg II assay

anneng

https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0172101
Ultra-deep sequencing reveals high prevalence and broad structural diversity of hepatitis B surface antigen mutations in a global population
https://github.com/spabinger/HBV_data_publication_2016_07
an MHR variant was defined as a nucleotide sequence change in the S gene region (encoding amino acids 99 to 170) with an allele frequency >5% (in both sequencing directions) and at least 3 variant reads present on the forward as well as on the reverse strand.

anneng

https://sci-hub.st/10.1159/000361076
Hepatitis B Virus Drug Resistance Tools:
One Sequence, Two Predictions
www.genafor.org/services.php

HIV-GRADE HBV

文章提到了一些工具用于分型、耐药、免疫逃逸的分析

anneng

Genetic Diversity of Hepatitis B Virus
Strains Derived Worldwide: Genotypes,
Subgenotypes, and HBsAg Subtypes

https://sci-hub.st/10.1159/000080872
对HBV进行进化树分析里面也提到血清型和基因型之间的复杂的对应关系。
涉及的软件：
DNADIST and NEIGHBOR from the Phylip program package version 3.53

PUZZLE

Bootstrap on 1,000 replicas was performed with SEQBOOT, DNADIST, NEIGHBOR, and CONSENSE from the Phylip package.

anneng

Global Occurrence of Clinically Relevant Hepatitis B Virus
viruses-12-01344-v3.pdf

从蛋白序列预测血清型

anneng

https://www.aimspress.com/article/doi/10.3934/microbiol.2020024?viewType=HTML
突变可能造成的影响这个论文做了一个总结

anneng

https://www.nature.com/articles/s41598-019-43524-9
Illumina and Nanopore methods for whole genome sequencing of hepatitis B virus (HBV)

anneng

https://www.frontiersin.org/articles/10.3389/fmicb.2020.616023/full
Comprehensive Analysis of Clinically Significant Hepatitis B Virus Mutations in Relation to Genotype, Subgenotype and Geographic Region
使用公开数据分析HBV的突变
Table_1_Comprehensive Analysis of Clinically Significant Hepatitis B Virus Mutations in Relation to Genotype, Subgenotype and Geographic Region.XLSX

这个表格的格式可以作为分析的模板
行是样本列是突变的位置或者重要图标的代号

anneng

https://www.hiv.lanl.gov/content/sequence/ENTROPY/entropy.html

香农熵计算器

anneng

https://zhanglab.ccmb.med.umich.edu/I-TASSER/.
结构预测

anneng

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7229894/
四医大肖老师提供的一个文章这个使用clone测序方法对HBV的全长进行了测序

组装：Contig-Express 和Codon Code Aligner
序列对齐：MEGAX Clustal X

anneng

Inference with viral quasispecies diversity indices: Clonal and
NGS approaches

对突变频率香农熵做了详细分析

anneng

https://www.yacinemahdid.com/shannon-entropy-from-theory-to-python/
香农熵的python实现

anneng

https://elifesciences.org/articles/61803
The haplotypes for each sample were reconstructed for each gene segment using a previously published pipeline (Cacciabue et al., 2020). In brief, FastQC (Andrews, 2010) was used for quality assurance of the NGS paired-end raw reads followed by BBtools (Bushnell, 2014), for removing and filtering adapters and low-quality reads. Bowtie2 (Langmead and Salzberg, 2012), an aligner tool to align the trimmed reads to the selected reference of the influenza strain (i.e. the inoculum), was then used. Samtools suite (Li et al., 2009) was used to sort, index, and generate depth and coverage statistics for read alignment files. Next, CliqueSNV (Knyazev, 2020) was used to infer the haplotypes and frequencies for all eight gene segments for each sample.