暗能星系

    • 登录
    • 搜索

    简化基因组分析

    生物信息分析
    1
    13
    78
    正在加载更多帖子
    • 从旧到新
    • 从新到旧
    • 最多赞同
    回复
    • 在新帖中回复
    登录后回复
    此主题已被删除。只有拥有主题管理权限的用户可以查看。
    • A
      anneng 最后由 anneng 编辑

      一,生物学预备知识
      160ccc2c-d5ab-46b3-94c5-b0b083f97f78-image.png
      Restriction enzymes are DNA-cutting enzymes. Each enzyme recognizes one or a few target sequences and cuts DNA at or near those sequences.
      DNA ligase is a DNA-joining enzyme. If two pieces of DNA have matching ends, ligase can link them to form a single, unbroken molecule of DNA.
      e5a0ef80-94fc-400b-b9a8-e2f74f540dd9-image.png
      64a727dc-5c4e-43ce-b3b7-a4267b5b3997-image.png
      enzymes that leave single-stranded overhangs are said to produce sticky ends. Sticky ends are helpful in cloning because they hold two pieces of DNA together so they can be linked by DNA ligase.
      Some are “blunt cutters,” which cut straight down the middle of a target sequence and leave no overhang. The restriction enzyme SmaI is an example of a blunt cutter:
      9b706ca4-000b-4cab-a73c-f22172af4b32-image.png
      adffef80-198b-4eff-97b0-67b0594befe4-image.png
      一个例子:
      5e9fbbda-8e8c-477e-9304-1f76faafa453-image.png
      7320c572-4466-4497-acd3-4a1ad2c62168-image.png
      78f5acd4-506a-4d13-96b9-315a9bcc1b72-image.png
      二,实验方法
      CRoPS
      RAD-Seq
      GBS
      double-digest RAD-Seq
      2bRAD
      restriction-enzyme anchored positions
      https://www.researchgate.net/figure/An-overview-of-the-RAD-seq-library-creation-protocol-and-initial-analysis-steps_fig1_235794002
      d5d8f02e-6106-42ed-8b34-e850111dab4d-image.png
      三,分析方法
      Stack2流程
      官网:https://catchenlab.life.illinois.edu/stacks/
      a44c215f-c432-4aa5-8037-9304f58bb975-image.png
      原理:maximum likelihood statistical model
      实现语言:C++ with wrapper programs written in Perl,基于OpenMP的多线程技术
      无参考序列分析(de novo):
      process_radtags:demultiplexed and cleaned
      ustacks:building loci
      cstacks:creating the catalog of loci
      sstacks:matching against the catalog
      tsv2bam: transpose data from being store per-sample to be stored per-locus
      gstacks:assemble and merge paired-end contigs, call variant sites in the population and genotypes in each sample
      populations:filter data, calculate population genetics statistics, and export a variety of data formats.
      有参考序列分析:
      process_radtags:demultiplexed and cleaned
      gstacks:assemble and merge paired-end contigs, call variant sites in the population and genotypes in each sample
      populations:filter data, calculate population genetics statistics, and export a variety of data formats.
      四,安装

      tar xfvz stacks-2.xx.tar.gz
      cd stacks-2.xx
      ./configure
      make
      (become root)
      make install
      (or, use sudo)
      sudo make install
      

      五,详细过程
      5.1.Clean the data
      process_radtags 样本拆分和过滤低质量序列
      样本是否需要拆分,以及barcode是什么,需要和实验人员确定。
      参考手册4.1.1 指定 --inline_index参数
      https://catchenlab.life.illinois.edu/stacks/manual/#install
      指定barcode列表

      cat barcodes_lane3
      CGATA<tab>sample_01
      CGGCG     sample_02
      GAAGC     sample_03
      GAGAT     sample_04
      TAATG     sample_05
      TAGCA     sample_06
      AAGGG     sample_07
      ACACG     sample_08
      ACGTA     sample_09
      
      process_radtags -p ./raw/ -o ./samples/ -b ./barcodes/barcodes_lane3 \
                        -e sbfI -r -c -q
      process_radtags -P -p ./raw  -b ./barcodes/barcodes -o ./samples/ \
                        -c -q -r --inline_index --renz_1 nlaIII --renz_2 mluCI
      

      -p 待处理fataq文件所在的目录
      -o 输出结果目录
      -b barcode列表文件
      -e restrction enzyme类型
      5.2 和参考序列进行比对
      可以使用常用的比对软件 GSnap, BWA, or Bowtie2.
      可以用samtools或者picard查看比对的质量。
      建议使用bwa mem进行比对。
      烟草有参考基因组吗?
      5.3
      无参流程denovo_map.pl 或者单独执行命令
      有参流程ref_map.pl或者单独执行命令
      大部分stack2的组件需要一个population map文件,这个文件其实是根据样本特征进行分组,打标签
      population map

      cat popmap_both
      sample_01<tab>red<tab>high
      sample_02     red      high
      sample_03     red      high
      sample_04     red      high
      sample_05     yellow   high
      sample_06     yellow   high
      sample_07     yellow   high
      sample_08     yellow   high
      sample_09     blue     low
      sample_10     blue     low
      sample_11     blue     low
      sample_12     blue     low
      sample_13     orange   low
      sample_14     orange   low
      sample_15     orange   low
      sample_16     orange   low
      
      denovo_map.pl -T 8 -M 6 -o ./stacks/ --samples ./samples --popmap ./popmaps/popmap --paired
      

      或者单独执行命令:

      #!/bin/bash
      
      src=$HOME/research/project
      
      files=”sample_01
      sample_02
      sample_03”
      
      #
      # Build loci de novo in each sample for the single-end reads only. If paired-end reads are available, 
      # they will be integrated in a later stage (tsv2bam stage).
      # This loop will run ustacks on each sample, e.g.
      #   ustacks -f ./samples/sample_01.1.fq.gz -o ./stacks -i 1 --name sample_01 -M 4 -p 8
      #
      id=1
      for sample in $files
      do
          ustacks -f $src/samples/${sample}.1.fq.gz -o $src/stacks -i $id --name $sample -M 4 -p 8
          let "id+=1"
      done
      
      # 
      # Build the catalog of loci available in the metapopulation from the samples contained
      # in the population map. To build the catalog from a subset of individuals, supply
      # a separate population map only containing those samples.
      #
      cstacks -n 6 -P $src/stacks/ -M $src/popmaps/popmap -p 8
      
      #
      # Run sstacks. Match all samples supplied in the population map against the catalog.
      #
      sstacks -P $src/stacks/ -M $src/popmaps/popmap -p 8
      
      #
      # Run tsv2bam to transpose the data so it is stored by locus, instead of by sample. We will include
      # paired-end reads using tsv2bam. tsv2bam expects the paired read files to be in the samples
      # directory and they should be named consistently with the single-end reads,
      # e.g. sample_01.1.fq.gz and sample_01.2.fq.gz, which is how process_radtags will output them.
      #
      tsv2bam -P $src/stacks/ -M $src/popmaps/popmap --pe-reads-dir $src/samples -t 8
      
      #
      # Run gstacks: build a paired-end contig from the metapopulation data (if paired-reads provided),
      # align reads per sample, call variant sites in the population, genotypes in each individual.
      #
      gstacks -P $src/stacks/ -M $src/popmaps/popmap -t 8
      
      #
      # Run populations. Calculate Hardy-Weinberg deviation, population statistics, f-statistics
      # export several output files.
      #
      populations -P $src/stacks/ -M $src/popmaps/popmap -r 0.65 --vcf --genepop --structure --fstats --hwe -t 8
      
      ref_map.pl -T 8 --popmap ./popmaps/popmap -o ./stacks/ --samples ./aligned
      

      或者单独执行命令:

      #!/bin/bash
      
      src=$HOME/research/project
      bwa_db=$src/bwa_db/my_bwa_db_prefix
          
      files=”sample_01
      sample_02
      sample_03”
      
      #
      # Align paired-end data with BWA, convert to BAM and SORT.
      #
      for sample in $files
      do 
          bwa mem -t 8 $bwa_db $src/samples/${sample}.1.fq.gz $src/samples/${sample}.2.fq.gz |
            samtools view -b |
            samtools sort --threads 4 > $src/aligned/${sample}.bam
      done
      
      #
      # Run gstacks to build loci from the aligned paired-end data. We have instructed
      # gstacks to remove any PCR duplicates that it finds.
      #
      gstacks -I $src/aligned/ -M $src/popmaps/popmap --rm-pcr-duplicates -O $src/stacks/ -t 8
      
      #
      # Run populations. Calculate Hardy-Weinberg deviation, population statistics, f-statistics and 
      # smooth the statistics across the genome. Export several output files.
      #
      populations -P $src/stacks/ -M $src/popmaps/popmap -r 0.65 --vcf --genepop --fstats --smooth --hwe -t 8
      

      5.4 结果数据分析
      genetic map
      population analysis
      STRUCTURE
      Adegenet
      coverage

      1 条回复 最后回复 回复 引用 0
      • A
        anneng 最后由 anneng 编辑

        https://www.yn-tobacco.com/gzfw/rqfw/xfz/jyzwjb/201801/t20180112_215188.html
        烟草有参考基因组吗?
        普通烟草(Nicotiana tabacum)为异源四倍体植物,基因组大小约 4.5G,基因组结构高度复杂。美国烟草基因组计划对四倍体烤烟品种采用甲基过滤法测序只获得了部分基因序列信息。我国烟草基因组计划完成了对普通烟草的两个祖先种绒毛状烟草(N. tomentosiformis)和林烟草(N. sylvestris)全基因组序列图谱的绘制,更充分地验证了烟草基因组结构高度复杂这一事实。在林烟草和绒毛状烟草7万多个基因中,有1万多个属于高度同源基因,就像孪生兄弟一样十分相像,常规的表达谱芯片探针设计难以有效区分鉴别。
        N.tomentosiformis 绒毛状烟草
        N.sylvestris    林烟草
        N.tabacum    普通烟草
        用tobacco搜索NCBI 有参考基因组 到时侯也可以和客户沟通下用哪个

        1 条回复 最后回复 回复 引用 0
        • A
          anneng 最后由 anneng 编辑

          实际测试:
          鸿元服务器 /home/bioinfo/radseq
          process_radtags -p . -o 1-cleandata/ -e sbfI -c -q -r
          zgrep TGCAGG --color=always SRR828261.1.fq.gz |head -n 20
          //比对
          bwa mem ../Gasterosteus_aculeatus.BROADS1.dna.toplevel.fa ../SRR828261.1.fastq |gzip -3 > SRR828261.1.sam.gz
          //查看比对率
          samtools flagstat SRR828303.1.sam.gz
          //构建loci
          gstacks -I ../2-align/ -M popmap -O output -t 8
          //populations分析
          populations -P ../3-build-loci/output/ -M ./popmap -r 0.65 --vcf --genepop --fstats --smooth --hwe -t 8

          1 条回复 最后回复 回复 引用 0
          • A
            anneng 最后由 编辑

            https://pearg.github.io/pearg_documentation/tutorials/intro-rad-seq/

            1 条回复 最后回复 回复 引用 0
            • A
              anneng 最后由 编辑

              https://simison.com/brian/Structure_notes.html

              1 条回复 最后回复 回复 引用 0
              • A
                anneng 最后由 编辑

                https://evolution.genetics.washington.edu/phylip/

                1 条回复 最后回复 回复 引用 0
                • A
                  anneng 最后由 编辑

                  https://github.com/evansbenj/Reduced-Representation-Workshop

                  1 条回复 最后回复 回复 引用 0
                  • A
                    anneng 最后由 anneng 编辑

                    OneMap应用指南
                    https://mran.microsoft.com/snapshot/2017-02-04/web/packages/onemap/vignettes/Tutorial_Onemap_reduced_version.pdf

                    1 条回复 最后回复 回复 引用 0
                    • A
                      anneng 最后由 编辑

                      https://evolution.genetics.washington.edu/phylip/tuimala3.pdf

                      1 条回复 最后回复 回复 引用 0
                      • A
                        anneng 最后由 编辑

                        https://www.researchgate.net/post/Which-program-is-best-to-use-for-phylogeny-analysis

                        1 条回复 最后回复 回复 引用 0
                        • A
                          anneng 最后由 编辑

                          python tagdigger_script.py -w . -b samples.csv -o mycounts.csv -k markerstokeep.txt --StacksTags ../ANCHA180152/ustack/catalog.tags.tsv.gz --StacksSNPs ../ANCHA180152/ustack/catalog.snps.tsv.gz --StacksAlleles ../ANCHA180152/ustack/catalog.alleles.tsv.gz

                          1 条回复 最后回复 回复 引用 0
                          • A
                            anneng 最后由 编辑

                            37dc496d-5922-433f-90bf-7d68d03368f7-image.png
                            https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0201254
                            A RAD-sequencing approach to genome-wide marker discovery, genotyping, and phylogenetic inference in a diverse radiation of primates

                            1 条回复 最后回复 回复 引用 0
                            • A
                              anneng 最后由 编辑

                              https://groups.google.com/g/stacks-users/c/1n6O98oWO2o/m/oRm3FYZTs1IJ?pli=1

                              Stacks不太适合多倍体(大于2)

                              1 条回复 最后回复 回复 引用 0
                              • First post
                                Last post
                              Powered by 暗能星系