CyTOF Data Preprocessing

For CyTOF datasets, we oftentimes need to preprocess the expression. We currently support the follow preprocssing steps:

  • Arcsinh transformation

  • Gating:
    • Debris Removal

    • Find intact cells

    • Find live cells

    • Center, offset, and residual outlier removal

  • Bead normalization

Note

Currently, we do not support any builtin cross batch normalization.

We break this tutorial down to two sections. The first focuses on our builtin benchmark datasets, which requires little normalization. The second part details the full API and the details of each step.


Benchmark Datasets

The only preprocessing step for all three benchmark datasets, levine13, levine32, and samusik is arcinh transformation. Typically, we use a cofactor of 5 for such a transformation. You can let the DataLoader.load_dataset method perform the transformation automatically upon loading the datasets:

>>> from PyCytoData import DataLoader
>>> exprs = DataLoader.load_dataset(dataset="levine13", preprocess=True)

This will automatically apply the arcsinh transformation to the expression matrix. You can use the data for downstream analyses if needed. The transformation is applied to the lineage channels only.

If you wish to preprocess later, you can call the preprocess method separately as well:

>>> from PyCytoData import DataLoader
>>> exprs = DataLoader.load_dataset(dataset="levine13", preprocess=False)
>>> exprs.preprocess(arcsinh=True, cofactor=5)

This will have the same effect as the previous snippet, but you have the choice of applying the transformation whenever you wish or with different cofactors.


Full Preprocessing API

To access the API, you can use the preprocess method on your PyCytoData method:

>>> exprs.preprocess(arcsinh=True,
...                  gate_debris_removal=True,
...                  gate_intact_cells=True,
...                  gate_live_cells=True,
...                  gate_center_offset_residual=True,
...                  bead_normalization=True)
Running Arcsinh transformation...
Running debris remvoal...
Running gating intact cells...
Running gating live cells...
Running gating Center, Offset, and Residual...
Running bead normalization...

Each step can be optionally included. By default, the method automatically attempts to resolve the necessary instrument channels. However, in case this step fails, some manual intervention is needed.

Channel Names

The preprocessing pipeline some specific channels to work properly. If none are specified, the preprocess method tries to guess how such channels are named. If auto_channels is set to False, then specifying instrument channles is required for gating and bead normaliztion; otherwise, any specified channels will override the auto_channels function.

Below, we have a list of channels that users may wish to specify for preprocessing:

  • lineage_channels: These are protein marker channels used for analyses and arcsinh transformation (e.g. CD4).

  • bead_channels: These are the added beads to track time decay. Typically there are multiple bead channels (e.g. Bead1)

  • time_channel: The time of each event, typically named “Time”.

  • cor_channel: This consists of “Center”, “Offset”, and “Residual”.

  • dead_channel: Typically named “Dead” or “Live” to indicate cell status.

  • DNA_channels: Typically consisting of “DNA1” and “DNA2”.

We assume that these channels are present and conventionally named. If they’re not present, it can be an indication that some preprocessing has been done already. As for arcsinh transformation, we suggest users plot the distributions of each channel and verify whether transformation is needed.

Arcsinh Transformation

Arcsinh transformation is often used to transform the channels. The transformation has the following formula:

\[arcinh(\frac{\cdot}{cofactor})\]

Typically, people set \(cofactor=5\) as per convention. If you prefer, you can change it for your own data analyses.

As an example:

>>> exprs.preprocess(arcsinh=True,
...                  cofactor=2)
Running Arcsinh transformation...

Gating: Debris Removal

This is the first step in gating in which we use the bead channels and remove any cells that are three standard deviations above the mean for each channel.

The bead_channels parameter must be given without auto_channels. Otherwise, the bead channels must start with “bead” (case insensitvie) for the method to automatically detect channels.

>>> exprs.preprocess(gate_debris_removal=True,
...                  bead_channels = ["Bead1", "Bead2", "Bead3", "Bead4"])
Running debris remvoal...

Gating: Finding Intact Cells

This step uses the DNA channels to gate for intact cells. Specifically, it trims cells with DNA greater than or smaller than cutoff_DNA_sd times the standard deviation of the channels. By defaults, it preserves cells within two standard deviations of the mean. The DNA channel names must contain “DNA” or be provided specifically.

>>> exprs.preprocess(gate_intact_cells=True,
...                  DNA_channels = ["DNA1", "DNA2"],
...                  cutoff_DNA_sd = 2)
Running gating intact cells...

Gating: Finding Live Cells

This step uses the Dead channel to gate for live cells. Specifically, it trims cells from the top percentile of the dead channel as specified by cutoff_quantile. By default, the top 3rd percentile will be trimmed. The DNA channel name must contain “dead” or be provided specifically.

>>> exprs.preprocess(gate_live_cells=True,
...                  dead_channel=["Dead"],
...                  dead_cutoff_quantile=0.03)
Running gating live cells...

Gating: Center, Offset, and Residuals

This step gates cells using the Center, Offset, and Residual channels. Specifically, it trims cells from the top and bottom percentile of the three channels as specified by cutoff_quantile. By default, the top and bottom 3rd percentile will be trimmed. The channels must be named as such given here or be provided.

>>> exprs.preprocess(gate_center_offset_residual=True,
...                  cor_channels = ["Center", "Offset", "Residual"],
...                  cor_cutoff_quantile = 0.03)
Running gating Center, Offset, and Residual...

Bead Normalization

Our bead normalization algorithm uses the bead channels and the time channel to correct signal decay. We uses a two-step process:

  1. We remove cells whose bead sigals are in the bottom 5th quantile.

  2. We perform the transformation using the most correlated bead channels.

This algorithm is developed and implemented in house. To perform bead normalization:

>>> exprs.preprocess(bead_normalization=True,
...                  bead_channels = ["Bead1", "Bead2", "Bead3", "Bead4"],
...                  time_channel = ["Time"])
Running bead normalization...