暗能星系

    • 登录
    • 搜索

    简化基因组流程开发笔记

    刘茜
    1
    5
    24
    正在加载更多帖子
    • 从旧到新
    • 从新到旧
    • 最多赞同
    回复
    • 在新帖中回复
    登录后回复
    此主题已被删除。只有拥有主题管理权限的用户可以查看。
    • I
      ice-melt 最后由 ice-melt 编辑

      process_radtags

      process_radtags -p in_dir [--paired [--interleaved]] [-b barcode_file] -o out_dir -e enz [-c] [-q] [-r] [-t len]
      process_radtags -f in_file [-b barcode_file] -o out_dir -e enz [-c] [-q] [-r] [-t len]
      process_radtags -1 pair_1 -2 pair_2 [-b barcode_file] -o out_dir -e enz [-c] [-q] [-r] [-t len]
      
        p: path to a directory of files.
        P,--paired: files contained within the directory are paired.
        I,--interleaved: specify that the paired-end reads are interleaved in single files.
        i: input file type, either 'fastq', 'gzfastq' (gzipped fastq), 'bam', or 'bustard' (default: guess, or gzfastq if unable to).
        b: path to a file containing barcodes for this run, omit to ignore any barcoding.
        o: path to output the processed files.
        f: path to the input file if processing single-end sequences.
        1: first input file in a set of paired-end sequences.
        2: second input file in a set of paired-end sequences.
        c,--clean: clean data, remove any read with an uncalled base.
        q,--quality: discard reads with low quality scores.
        r,--rescue: rescue barcodes and RAD-Tags.
        t: truncate final read length to this value.
        D: capture discarded reads to a file.
        E: specify how quality scores are encoded, 'phred33' (Illumina 1.8+/Sanger, default) or 'phred64' (Illumina 1.3-1.5).
        w: set the size of the sliding window as a fraction of the read length, between 0 and 1 (default 0.15).
        s: set the score limit. If the average score within the sliding window drops below this value, the read is discarded (default 10).
        y: output type, either 'fastq', 'gzfastq', 'fasta', or 'gzfasta' (default: match input type).
      
        Barcode options:
          --inline-null:   barcode is inline with sequence, occurs only on single-end read (default).
          --index-null:    barcode is provded in FASTQ header (Illumina i5 or i7 read).
          --null-index:    barcode is provded in FASTQ header (Illumina i7 read if both i5 and i7 read are provided).
          --inline-inline: barcode is inline with sequence, occurs on single and paired-end read.
          --index-index:   barcode is provded in FASTQ header (Illumina i5 and i7 reads).
          --inline-index:  barcode is inline with sequence on single-end read and occurs in FASTQ header (from either i5 or i7 read).
          --index-inline:  barcode occurs in FASTQ header (Illumina i5 or i7 read) and is inline with single-end sequence (for single-end data) on paired-end read (for paired-end data).
      
        Restriction enzyme options:
          -e <enz>, --renz-1 <enz>: provide the restriction enzyme used (cut site occurs on single-end read)
          --renz-2 <enz>: if a double digest was used, provide the second restriction enzyme used (cut site occurs on the paired-end read).
          Currently supported enzymes include:
            'aciI', 'ageI', 'aluI', 'apaLI', 'apeKI', 'apoI', 'aseI', 'bamHI', 
            'bbvCI', 'bfaI', 'bfuCI', 'bgIII', 'bsaHI', 'bspDI', 'bstYI', 'btgI', 
            'cac8I', 'claI', 'csp6I', 'ddeI', 'dpnII', 'eaeI', 'ecoRI', 'ecoRV', 
            'ecoT22I', 'haeIII', 'hinP1I', 'hindIII', 'hpaII', 'hpyCH4IV', 'kpnI', 'mluCI', 
            'mseI', 'mslI', 'mspI', 'ncoI', 'ndeI', 'nheI', 'nlaIII', 'notI', 
            'nsiI', 'nspI', 'pacI', 'pspXI', 'pstI', 'rsaI', 'sacI', 'sau3AI', 
            'sbfI', 'sexAI', 'sgrAI', 'speI', 'sphI', 'taqI', 'xbaI', or 'xhoI'
            
      
        Protocol-specific options:
          --bestrad: library was generated using BestRAD, check for restriction enzyme on either read and potentially tranpose reads.
      
        Adapter options:
          --adapter-1 <sequence>: provide adaptor sequence that may occur on the single-end read for filtering.
          --adapter-2 <sequence>: provide adaptor sequence that may occur on the paired-read for filtering.
            --adapter-mm <mismatches>: number of mismatches allowed in the adapter sequence.
      
        Output options:
          --retain-header: retain unmodified FASTQ headers in the output.
          --merge: if no barcodes are specified, merge all input files into a single output file.
      
        Advanced options:
          --filter-illumina: discard reads that have been marked by Illumina's chastity/purity filter as failing.
          --disable-rad-check: disable checking if the RAD site is intact.
          --len-limit <limit>: specify a minimum sequence length (useful if your data has already been trimmed).
          --barcode-dist-1: the number of allowed mismatches when rescuing single-end barcodes (default 1).
          --barcode-dist-2: the number of allowed mismatches when rescuing paired-end barcodes (defaults to --barcode-dist-1).
      
      

      分析:

      • 输入支持目录和文件两种方式
        • 目录:通过-P/--paired 指定是否双端数据
        • 文件:单端-f指定文件;双端-1\-2指定R1和R2文件
      • 可选参数 [-c] [-q] [-r] [-t len]
        • -c 表示数据清洗时去除表示为N的碱基,
        • -q 表示数据清理时要去除低质量碱基 ,
        • -r 表示要抢救下barcode和RAD-tag。
      • 单选参数:
        • barcode 类型 -b
        • 建库所用的限制性内切酶 -e, --renz-1\--renz-2
        • 质量分数编码 -E,有默认值 phred33
      I 2 条回复 最后回复 回复 引用 0
      • I
        ice-melt 最后由 编辑

        参考基因组创建索引

        烟草的参考基因组下载链接:
        https://www.ncbi.nlm.nih.gov/assembly/GCA_000715135.1
        三刺鱼参考基因组:
        http://asia.ensembl.org/Gasterosteus_aculeatus/Info/Index

        wget -q ftp://ftp.ensembl.org/pub/release-91/fasta/gasterosteus_aculeatus/dna/Gasterosteus_aculeatus.BROADS1.dna.toplevel.fa.gz
        
        gzip -d Gasterosteus_aculeatus.BROADS1.dna.toplevel.fa.gz
        

        创建索引

        stacks不直接参与比对,而是处理不同比对软件得到的BAM文件
        这里建立bwa的索引

        genome_fa=genome/Gasterosteus_aculeatus.BROADS1.dna.toplevel.fa
        
        bwa index -p index/bwa/gac $genome_fa &> index/bwa/bwa_index.oe
        
        # 结果如下
        |-- genome
        |   |-- Gasterosteus_aculeatus.BROADS1.dna.toplevel.fa
        |-- index
            |-- bwa
                |-- bwa_index.oe
                |-- gac.amb
                |-- gac.ann
                |-- gac.bwt
                |-- gac.pac
                |-- gac.sa
        
        I 1 条回复 最后回复 回复 引用 0
        • I
          ice-melt @ice-melt 最后由 编辑

          @ice-melt

          这一步需要留意自己的单端测序,还是双端测序,barcode是在read中,还是在FASTQ的header中,是否还需要去接头序列,是否是双酶切建库等。

          另外这一步比较耗时,尽量脱机运行或者提交到计算节点中,不然突然断网导致运行终止就更浪费时间了。
          将运行结果记录到日志文件中,方便后期检查报错。

          process_radtags -p $raw_dir -b $barcode_file \
              -o $processed_data/ -e sbfI --inline_null \
              -c -q -r &> $processed_data/process_radtags.lane1.oe&
          
          1 条回复 最后回复 回复 引用 0
          • I
            ice-melt @ice-melt 最后由 编辑

            @ice-melt 在 简化基因组流程开发笔记 中说:

            参考基因组创建索引

            烟草的参考基因组下载链接:
            https://www.ncbi.nlm.nih.gov/assembly/GCA_000715135.1
            三刺鱼参考基因组:
            http://asia.ensembl.org/Gasterosteus_aculeatus/Info/Index

            wget -q ftp://ftp.ensembl.org/pub/release-91/fasta/gasterosteus_aculeatus/dna/Gasterosteus_aculeatus.BROADS1.dna.toplevel.fa.gz
            
            gzip -d Gasterosteus_aculeatus.BROADS1.dna.toplevel.fa.gz
            

            创建索引

            stacks不直接参与比对,而是处理不同比对软件得到的BAM文件
            这里建立bwa的索引

            genome_fa=genome/Gasterosteus_aculeatus.BROADS1.dna.toplevel.fa
            
            bwa index -p index/bwa/gac $genome_fa &> index/bwa/bwa_index.oe
            
            # 结果如下
            |-- genome
            |   |-- Gasterosteus_aculeatus.BROADS1.dna.toplevel.fa
            |-- index
                |-- bwa
                    |-- bwa_index.oe
                    |-- gac.amb
                    |-- gac.ann
                    |-- gac.bwt
                    |-- gac.pac
                    |-- gac.sa
            

            生成 bam/sam 文件

            # sam 文件
            bwa mem -M ../genome/index/bwa/gcf T5_R1.fq.gz T5_R2.fq.gz > T5_R2.sam &
            
            # bam 文件
            bwa mem -M ../genome/index/bwa/gcf T5_R1.fq.gz T5_R2.fq.gz |samtools view -b > T5_R2.bam &
            
            • 如果创建参考基因组索引没有指定索引文件夹, bwa mem 不需要指定-M参数
            • samtools 可以把bam转成sam格式
            • 可以指定 -t 增加线程数
            • 比对的结果可以查看一下质量
              samtools flagstat T5_R2.bam 
              
            1 条回复 最后回复 回复 引用 0
            • I
              ice-melt @ice-melt 最后由 编辑

              @ice-melt

              process_radtags 双端处理格式

              
                  // Parse a file name that looks like: lane6_NoIndex_L006_R1_003.fastq
                  // but exclude the paired-end files:  lane6_NoIndex_L006_R2_003.fastq
                  //
                  // Another example could be:          GfddRAD1_001_ATCACG_L008_R1_001.fastq.gz
                  // and excluding the paired-end file: GfddRAD1_001_ATCACG_L008_R2_001.fastq.gz
              
              1 条回复 最后回复 回复 引用 0
              • First post
                Last post
              Powered by 暗能星系