GWAS

anneng

Clumping: This is a procedure in which only the most significant SNP (i.e., lowest p value) in each LD block is identified and selected for further analyses. This reduces the correlation between the remaining SNPs, while retaining SNPs with the strongest statistical evidence.

Co‐heritability: This is a measure of the genetic relationship between disorders. The SNP‐based co‐heritability is the proportion of covariance between disorder pairs (e.g., schizophrenia and bipolar disorder) that is explained by SNPs.

Gene: This is a sequence of nucleotides in the DNA that codes for a molecule (e.g., a protein)

Heterozygosity: This is the carrying of two different alleles of a specific SNP. The heterozygosity rate of an individual is the proportion of heterozygous genotypes. High levels of heterozygosity within an individual might be an indication of low sample quality whereas low levels of heterozygosity may be due to inbreeding.

Individual‐level missingness: This is the number of SNPs that is missing for a specific individual. High levels of missingness can be an indication of poor DNA quality or technical problems.

Linkage disequilibrium (LD): This is a measure of non‐random association between alleles at different loci at the same chromosome in a given population. SNPs are in LD when the frequency of association of their alleles is higher than expected under random assortment. LD concerns patterns of correlations between SNPs.

Minor allele frequency (MAF): This is the frequency of the least often occurring allele at a specific location. Most studies are underpowered to detect associations with SNPs with a low MAF and therefore exclude these SNPs.

Population stratification: This is the presence of multiple subpopulations (e.g., individuals with different ethnic background) in a study. Because allele frequencies can differ between subpopulations, population stratification can lead to false positive associations and/or mask true associations. An excellent example of this is the chopstick gene, where a SNP, due to population stratification, accounted for nearly half of the variance in the capacity to eat with chopsticks (Hamer & Sirota, 2000).

Pruning: This is a method to select a subset of markers that are in approximate linkage equilibrium. In PLINK, this method uses the strength of LD between SNPs within a specific window (region) of the chromosome and selects only SNPs that are approximately uncorrelated, based on a user‐specified threshold of LD. In contrast to clumping, pruning does not take the p value of a SNP into account.

Relatedness: This indicates how strongly a pair of individuals is genetically related. A conventional GWAS assumes that all subjects are unrelated (i.e., no pair of individuals is more closely related than second‐degree relatives). Without appropriate correction, the inclusion of relatives could lead to biased estimations of standard errors of SNP effect sizes. Note that specific tools for analysing family data have been developed.

Sex discrepancy: This is the difference between the assigned sex and the sex determined based on the genotype. A discrepancy likely points to sample mix‐ups in the lab. Note, this test can only be conducted when SNPs on the sex chromosomes (X and Y) have been assessed.

Single nucleotide polymorphism (SNP): This is a variation in a single nucleotide (i.e., A, C, G, or T) that occurs at a specific position in the genome. A SNP usually exists as two different forms (e.g., A vs. T). These different forms are called alleles. A SNP with two alleles has three different genotypes (e.g., AA, AT, and TT).

SNP‐heritability: This is the fraction of phenotypic variance of a trait explained by all SNPs in the analysis.

SNP‐level missingness: This is the number of individuals in the sample for whom information on a specific SNP is missing. SNPs with a high level of missingness can potentially lead to bias.

Summary statistics: These are the results obtained after conducting a GWAS, including information on chromosome number, position of the SNP, SNP(rs)‐identifier, MAF, effect size (odds ratio/beta), standard error, and p value. Summary statistics of GWAS are often freely accessible or shared between researchers.

The Hardy–Weinberg (dis)equilibrium (HWE) law: This concerns the relation between the allele and genotype frequencies. It assumes an indefinitely large population, with no selection, mutation, or migration. The law states that the genotype and the allele frequencies are constant over generations. Violation of the HWE law indicates that genotype frequencies are significantly different from expectations (e.g., if the frequency of allele A = 0.20 and the frequency of allele T = 0.80; the expected frequency of genotype AT is 20.20.8 = 0.32) and the observed frequency should not be significantly different. In GWAS, it is generally assumed that deviations from HWE are the result of genotyping errors. The HWE thresholds in cases are often less stringent than those in controls, as the violation of the HWE law in cases can be indicative of true genetic association with disease risk.

anneng

参考文献：
https://sci-hub.st/10.1146/annurev-biodatasci-030320-041026

anneng

https://somepomed.org/articulos/contents/mobipreview.htm?14/2/14377

(1, 2) Well-characterized cases and controls of similar ancestry are genotyped using oligonucleotide microarrays, typically consisting of assays for 500,000 to 1,000,000 SNPs. A representative cartesian plot of results for one such assay is presented. Individuals homozygous for the major allele are plotted in blue, homozygotes for the minor allele in red, and heterozygotes in green.
(3) Next, extensive quality control analysis is conducted, including assessments at the level of the individual samples (microarray quality, subject genotype completion rates, gender inconsistencies) and individual assays (assay quality scores, genotype completion rates, Hardy Weinberg equilibrium, parent-child genotype inconstancies). Screens for cryptic sample relatedness are performed using Identity By State statistics. Data are then filtered, with removal of results for problematic subjects or assays.
(4) Principal components analysis (shown) enables screening for evidence of population stratification. If detected, outlier subjects can be removed from consideration, and the eigenvectors generated can be used as covariates in the subsequent association testing.
(5) Using reference genotype panels, additional SNP genotypes for untyped variants (grey SNPs) can be imputed from the typed variants (in black) and considered for association testing.
(6) Statistical inference is performed by testing individual variants for association by logistic regression for case-control studies or linear regression for quantitative traits, with adjustment for relevant covariates, including eigenvectors. To accommodate the large number of hypotheses tested, p-values must be adjusted, with more stringent cutoffs p-values designated in advance.
(7) Results can be visualized using so-called "Manhattan Plots," where the individual -log10 p-values are plotted as a function of physical position.
(8) Significant associations can be passed for additional validation, including replication studies in independent cohorts, sequencing and fine-mapping studies to identify causal variants, and experimental validation to confirm biological relevance and function. Additional modeling, including gene-by-gene interaction and gene-by-environment modeling, can also be conducted.

anneng

linkage analyses 和GWAS的不同：
https://www.futurelearn.com/info/courses/translational-research/0/steps/14199
Genetic linkage and association analyses are the major tools to identify the genetic basis of diseases or traits. The primary difference between these two approaches is that linkage analysis looks at the relation between the transmission of a locus and the disease/trait within families, whereas association analysis focuses on the relation between a specific allele and the disease/trait within population.

anneng

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6099125/ 位点的具体位置

anneng

https://www.nature.com/articles/s43586-021-00056-9
Genome-wide association studies
Nature Reviews Methods Primer的一篇综述讲了整个GWAS的现状