公共数据集
-
-
International HapMap Project
https://www.genome.gov/10001688/international-hapmap-project
The DNA sequence of any two people is 99.5 percent identical. The variations, however, may greatly affect an individual's disease risk. Sites in the DNA sequence where individuals differ at a single DNA base are called single nucleotide polymorphisms (SNPs). Sets of nearby SNPs on the same chromosome are inherited in blocks. This pattern of SNPs on a block is a haplotype. Blocks may contain a large number of SNPs, but a few SNPs are enough to uniquely identify the haplotypes in a block. The HapMap is a map of these haplotype blocks and the specific SNPs that identify the haplotypes are called tag SNPs. -
https://www.personalgenomes.org/us
The Personal Genome Project, initiated in 2005, is a vision and coalition of projects across the world dedicated to creating public genome, health, and trait data. Sharing data is critical to scientific progress, but has been hampered by traditional research practices. The PGP approach is to invite willing participants to publicly share their personal data for the greater good. -
The International Genome Sample Resource
The 1000 Genomes Project created a catalogue of common human genetic variation, using openly consented samples from people who declared themselves to be healthy. The reference data resources generated by the project remain heavily used by the biomedical science community.The International Genome Sample Resource (IGSR) maintains and shares the human genetic variation resources built by the 1000 Genomes Project. We also update the resources to the current reference assembly, add new data sets generated from the 1000 Genomes Project samples and add data from projects working with other openly consented samples.
https://www.internationalgenome.org/human-genome-structural-variation-consortium/
The Human Genome Structural Variation Consortium (HGSV) creates a high-quality maps of human structural variation and develops new methods, taking advantage of the burgeoning array of genomics assays now available to define genomic structure. -
https://ddbj.nig.ac.jp/resource/bioproject/PRJEB31736
We sequenced all 2,504 samples from the 1000 Genomes (1KG) Project to a minimum of 30x mean genome coverage. Though a small number of 1KG samples had been sequenced to high coverage previously, we sequenced all samples to depth on the latest technology, providing a unified dataset for the next phase of analyses. We processed these samples using the laboratory processes we have previously used for the CCDG project (with minor modifications). Specifically, we generated PCR-free sequencing libraries using unique dual indices to avoid the index switching phenomenon that occurs and causes low level sequencing data contamination on the Illumina patterned flow cells. We sequenced these samples on the Illumina NovaSeq 6000 sequencing instrument, with 2x150bp reads. We believe this instrument represents the future for WGS with short-read technology, and it was important to sequence the 1KG samples in a format that is consistent with future large scale sequencing projects. Our automated analysis pipeline for whole genome sequencing matches the CCDG and TOPMed recommended best practices. Sequencing reads were aligned to the human reference, hs38DH, using BWA-MEM v0.7.15. Data are further processed using the GATK best-practices (v3.5), which generates VCF files in the 4.2 format. Single nucleotide variants and Indels are called using GATK HaplotypeCaller (v3.5), which generates a single-sample GVCF.