10X 单细胞的一些概念

anneng

https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/glossary#gem_well
Single Cell Gene Expression
GEM well (formerly GEM group): A set of partitioned cells (Gelbeads-in-Emulsion) from a single 10x Chromium Chip channel. One or more sequencing libraries can be derived from a GEM well.

Library (or Sequencing Library): A 10x-barcoded sequencing library prepared from a single GEM well. With Feature Barcode or V(D)J assays, it is possible to create multiple libraries from the same GEM well. The library types may include Gene Expression, Antibody Capture, CRISPR Guide Capture, TCR-enrichment, etc.

LT (or Low Throughput): The Chromium Next GEM Single Cell 3’ LT v3.1 kit is a low throughput, cost-effective solution for smaller-scale and pilot studies for profiling whole transcriptome at the single cell level for 100 - 1,000 cells per sample. In combination with Feature Barcode technology (Antibody Capture), the assay also enables simultaneous cell surface protein detection in single cells.

Sample: A cell suspension extracted from a single biological source (blood, tissue, etc).

Sequencing Run (or Flowcell): A flowcell containing data from one sequencing instrument run. The sequencing data can be further demultiplexed by lane or by sample indices.

Feature Barcode Technology
Feature Barcode Antibody (or Antibody): Refers to a Feature Barcode reagent consisting of an antibody with high affinity to a known Cell Surface Protein coupled to a Feature Barcode oligonucleotide that identifies the antibody. These reagents are used to quantify the expression of cell surface proteins. For example, the TotalSeq-B product line is a family of Feature Barcode antibodies that are compatible with the Single Cell 3' v3 solution.

Cell Surface Protein: a protein that is localized to the cell membrane, typically containing extracellular domains. These proteins can be quantified with Feature Barcodes such as TotalSeq antibody-oligonucleotide conjugates.

CRISPRa (or CRISPR activation): Similar to CRISPRi, but uses a Cas9 fused to an activating domain to promote expression of target gene instead of repressing it.

CRISPR Guide RNA: See sgRNA

CRISPRi (or CRISPR Interference): A method for measuring the impact of perturbations to gene expression levels. sgRNAs with protospacers targeting a gene of interest are used with a non-cutting Cas9 that is fused to a repressive domain. This represses the expression of the selected gene.

Count Matrix (or Feature-Barcode Matrix): Formerly known as the Gene-Barcode Matrix. A matrix of counts representing the number of unique observations of each feature within each cell barcode. Genes defined by the transcriptome reference and Feature Barcodes defined in the Feature Reference appear as rows in the matrix. Each barcode is a column of the matrix.

CROP-Seq: An assay scheme for pooled CRISPRi and CRISPRa experiments with single-cell Gene Expression readout. See Datlinger et al., Nature Methods 2017

Dextramer: Refers to a Feature Barcode reagent consisting of multiple copies of a peptide-MHC (p-MHC) complex conjugated to a Dextran backbone, coupled to a DNA oligonucleotide carrying a Feature Barcode that identifies the peptide-MHC complex. The p-MHC complex is the antigen of a T-Cell Receptor. Dextramers compatible with 10x Genomics Feature Barcode Technology are supplied by Immudex.

Feature: A unique type of countable molecule. Can refer to a gene, a barcoded antibody, a CRISPR Guide RNA or another barcoded reagent. Each feature is either a gene declared in the transcriptome reference or a feature barcode declared in the feature reference file. Corresponds to a row in the Count Matrix.

Feature Barcode: The subsequence of a Feature Barcode read that uniquely identifies the identity of the Feature Barcode reagent.

Feature Reference: A CSV file declaring the name, read layout, and barcode sequence of the all the Feature Barcode reagents in use in an experiment. A Feature Reference CSV must be provided to cellranger count when using Feature Barcode Technology. See the Feature Reference Documentation for details.

Guide RNA (or sgRNA, or Single Guide RNA): The Guide RNA, along with a Cas9 enzyme form the CRISPR system. The protospacer region of the Guide RNA recognizes a particular sequence in the genome.

p-MHC (or Peptide-MHC): An antigen-presenting MHC gene, bound to a displayed peptide. These complexes are recognized by T-Cell receptors in the adaptive immune system. Dextramers are Feature Barcode capable p-MHC reagent technology.

Perturb-Seq: The original demonstration of a pooled CRISPRi assay with a single-cell Gene Expression readout, using barcodes to identify which CRISPRi perturbations were present in each cell. See Dixit et al., Cell 2016.

Targeted Gene Expression
Target Panel CSV file: A CSV file declaring the target gene panel used for a Targeted Gene Expression experiment, which specifies detailed information about the targeted genes and baits included in the panel. This file must be provided to cellranger count via the --target-panel option when analyzing Targeted Gene Expression data. Details and specifications of the Target Panel CSV file format are documented here.

Baits: A set of oligonucleotide sequences designed to specifically hybridize to and recover molecules from targeted genes during the hybridization-capture enrichment step of the Targeted Gene Expression assay.

WTA (or Whole Transcriptome Analysis): Whole Transcriptome Analysis, describes a (non-targeted) Chromium single-cell RNA-seq library or dataset which may be used as input for a Targeted Gene Expression experiment.

Parent: A WTA single-cell gene expression library or dataset that was used as the input material for a Targeted Gene Expression experiment. Because the Parent and its corresponding Targeted dataset are both derived from the same Library and/or GEM well, it is possible to directly compare results from these matched datasets in a pairwise manner, as enabled by cellranger targeted-compare.

Cell Multiplexing
Cell Multiplexing: the labeling of a given cell (or nuclei) sample with a molecular tag and subsequently mixing this sample with other labeled samples. Introduced in Cell Ranger 6.0.

Cell Multiplexing Oligo (CMO): a specific type of feature barcode used to tag cells prior to pooling in a single GEM well.

Multiplet: A cell-associated barcode containing more than one cell. Multiplets that are assigned more than 1 CMO are detected and filtered out.

Physical Library: A sequencing library produced from a single GEM well.

Singlet: a cell-associated barcode assigned exactly one CMO. Only these are assigned to samples.

anneng

软件的布局包括云、二级分析、三级分析

anneng

10X也支持在一个孔中同时处理多个样本　样本中的细胞是用Cell Multiplexing Oligos (CMOs) 提前区分的. CMO有12种，因此在一个空中可以同时放12个样本

CMO由三个部分组成：
CMOs contain three distinct molecular regions:
• The reverse complement of Capture Sequence 2 enabling
direct capture and priming within GEMs.
• A 15 nt Feature Barcode that provides a unique identifier
for a given sample.
• Illumina Nextera Read 2 (Read 2N; read 2 sequencing
primer) sequence that provides a PCR-handle that enables
specific amplification of the CMOs.

CG000383_TechNote_ChromiumNextGEMSingle_Cell_3___v3.1_CellMultiplexing_Rev_A.pdf
凝胶微珠上除了后续捕获rna的序列外　还有两个捕获序列　这些序列就可以捕获细胞上CMO

anneng

https://www.jieandze1314.com/post/cnposts/scrna-6/
工作原理第一步：微珠上DNA引物设计

先预制凝胶微珠，也就是所说的gel beads，然后每个凝胶微珠"种上"特定的DNA片段，每个DNA序列分成几段：

第一段是barcode，16bp碱基，大约350万种barcodes，一个微珠对应一个barcode，利用这么多barcode可以区分各个凝胶微珠。=》每个凝胶微珠的ID号

其中任意两个barcode之间至少差两个或者两个以上的碱基，因为测序存在对碱基的误读，这样设计可以避免将两个barcode搞混(可以试想，如果两个barcode之差一个碱基，那么就有16分之一的概率将两个判断成一个)

第二段是UMI序列，即unique molecular index，它是一段随机序列，也就是说每个DNA分子都有自己的UMI序列，UMI长为10bp，那么就有4^10=1,048,576也就是100多万种变化。它的作用就是经过了PCR+深度测序后，找到reads与原始cDNA的对应关系 =》每个DNA标签分子的ID号

它考虑到了这样一种情况：一个基因片段经过PCR扩增产生多个reads，但是不加标记我们是不知道的，并且不同基因的PCR扩增效率可能不同，因此一个基因最后得到的reads数就可能由于PCR扩增效率高而超过了另一个基因(而这两个基因的真实表达量可能差不多)。也就是排除"PCR bias”

第三段是Poly(dT)序列，它起到的作用是与mRNA的poly(A)尾巴结合，作为逆转录的引物，将cDNA逆转录出来

工作原理第二步：芯片的液流管路

细胞混悬液在第一个十字交叉口，与凝胶微珠混合；接着进入第二个交叉口，这时加上油滴，把凝胶微珠+细胞混悬液包裹起来=》油包水的小液滴=》这些油包水的小微滴就组成了乳浊液

乳浊液中，有的是包含一个细胞的(红圈部分)，也有的不包含细胞，还有的有两个以上细胞(这个叫"Doublets”) ，一个小液滴中包含几个细胞是符合"泊松分布"的。

大部分细胞会匹配到一个小液滴中(细胞混悬液中大约有65%的细胞可以被成功包到有微珠的小液滴中=》也叫做细胞的捕获效率~65%)，后续分析的reads就是从它们这里来的

工作原理第三步：测序文库构建
得到乳浊液后，就要脱掉细胞膜，让其中的mRNA游离出来=》

游离出来的mRNA与小液滴中的水相混合，水相中包括凝胶微珠上连着的核酸引物、逆转录酶、dNTP底物，发生逆转录反应=》

通过mRNA的polyA与凝胶微珠上的polyT互补，mRNA与凝胶微珠上带有标签的DNA分子结合起来，然后在逆转录酶作用下，逆转录出cDNA=》

这样得到的cDNA分子是带有微珠特定的barcode标签的，并且每个cDNA分子带有特定的UMI标签，有了这两个标签，就可以区分这个特定的cDNA与其他的cDNA=》

然后将乳浊液中所有的水相抽出来，也就是把带有标签的cDNA分子抽出来=》

cDNA分子加接头，PCR扩增，得到illumina文库

数据构成
一个样本一般就测几百或几千细胞，barcode种类却有3百多万，所以很少出现一个barcode对应两个细胞的情况。因此得到的数据可以通过barcode拆分，将测序reads回溯到每个细胞

当然，是有可能出现一个barcode对应两个甚至多个细胞的情况，这时如果按照barcode去拆分，就会将这两个或者多个细胞的reads组合成一个**“pool**"。因此，为了减少pool的出现，就要在细胞混悬液制备阶段，控制原始的细胞数量

所以这里看到，并不是制备的细胞数越多越好。原始细胞数越少，最后的混合pool就越少，这也是符合泊松分布的。一般来说一个样本混悬液的细胞数在1万以下比较好

利用UMI对reads进行简并，就可以看到细胞reads与基因数量之间的关系，比如这样：横坐标是细胞reads数，纵坐标是基因数，reads数越多能得到的基因也就越多。一般来说一个细胞读到30万条reads后，基因数量随reads数增加的速度会变慢=》基因数量”平台期”

一般一个细胞可以得到4万-8万个有效的UMI，平均一个细胞的一个基因有10个UMI；

一个细胞的一个基因的表达量是衡量这个细胞的一个维度，于是几千个被测基因的表达量形成了几千个维度。如果将成千上万个细胞放在一起分析，经过降维、聚类，放到一个三维空间并加上颜色，就形成了这样的分布形式