Module: PyCytoData.data

class PyCytoData.data.PyCytoData(expression_matrix: ArrayLike, channels: ArrayLike | None = None, cell_types: ArrayLike | None = None, sample_index: ArrayLike | None = None, lineage_channels: ArrayLike | None = None)[source]

Bases: object

The CytoData Class for handling CyTOF data.

This is an all-purpose data class for handling CyTOF data. It is compatible with benchmark datasets downloaded from the DataLoader class as well as users’ own CyTOF datasets. It has wideranging functionalities, include preprecessing, DR, and much more.

Parameters:

expression_matrix (ArrayLike) – The expression matrix for the CyTOF sample. Rows are cells and columns are channels.
channels (ArrayLike) – The name of the channels, defaults to None
cell_types (ArrayLike) – The cell types of the cells, defaults to None
sample_index (ArrayLike) – The indicies or names to indicate samples of each cell. This allows the combination of multiple samples into one class, defaults to None
lineage_channels (ArrayLike) – The names of lineage channels, defaults to None

Raises:

exceptions.ExpressionMatrixDimensionError – The expression matrix is not or cannot be cast into a two dimensional array.
exceptions.DimensionMismatchError – The number of channel names does not agree with the number of columns of the expression matrix.
exceptions.DimensionMismatchError – The number of cell types for all cells does not agree with the number of rows of the expression matrix.
exceptions.DimensionMismatchError – The number of sample indices does not agree with the number of rows of the expression matrix.

Additional Attributes:

reductions: A reductions object for dimension reduction using CytofDR.

__add__(new_object: PyCytoData) → PyCytoData[source]

Concatenate two PyCytoData objects with the + operator.

This method concatenates two PyCytOData objects together by using the add_sample method internally. A new PyCytoData object is returned.

Parameters:: new_object (PyCytoData) – The second PyCytoData object.
Raises:: TypeError – The provided object is not a PyCytoData object.
Returns:: A new PyCytoData object after concatenation.
Return type:: PyCytoData

The method to index elements of the PyCytoData object.

This method implements the bracket notation to index part of the class. The notation is mostly consistent with the numpy indexing notation with a few excetions, which is listed below. When indexing specific cells, the metadata are appropriately indexed as well.

A few deviations from the numpy notations:

Integer indices are currently not supported. This is because indexing by integer returns a 1-d array instead of a 2-d array, which can possibly cause confusion.
Indexing by two lists or arrays with different lengths are supported. They are treated to index rows and columns, such as exprs[[0,1,2], [3,4]] is perfectly valid to index the first 3 cells with the fourth and fifth channel.

Tip

To index columns/channels by name, use the subset method instead.

Parameters:

items (Union[int, slice, List[int], Tuple[Any, Any]]) – The indices for items.

Raises:

IndexError – Two or more indices present.
TypeError – Indexing by integer in either or both axes.
IndexError – An higher dimensional array is used.
TypeError – Invalid indices type used.

Returns:

An appropriately indexed PyCytoData object.

Return type:

PyCytoData

__iadd__(new_object: PyCytoData) → PyCytoData[source]

Concatenate a new PyCytoData object with the += operator.

This essentially works the same way the add_sample method. However, instead of the necessity of providing the expression matrices, sample indices, and the cell types manually, the concatenation is automatically performed from a new PyCytoData object.

Parameters:: new_object (PyCytoData) – The second PyCytoData object.
Raises:: TypeError – The provided object is not a PyCytoData object.
Returns:: A new PyCytoData object after concatenation.
Return type:: PyCytoData

__len__() → int[source]

The length of the PyCytoData Class.

This method implements the len of the builtin python method. It returns the number of total cells in the expression matrix.

Returns:: The length of the object.
Return type:: int

__str__() → str[source]

String representation of the PyCytoData class.

This method returns a string containing the most basic metadata of the class along with the memory address.

Returns:: The string representation of the class.
Return type:: str

add_sample(expression_matrix: ArrayLike, sample_index: ArrayLike, cell_types: ArrayLike | None = None)[source]

Add another CyTOF sample from the same experiment.

This method allows users to combine samples into existing samples. The data must be in the same shape. Sample indices must be provided so that the class can properly index these samples using names.

Parameters:

expression_matrix (ArrayLike) – The expression matrix of the new sample.
sample_index (ArrayLike) – The sample indicies to name the sample.
cell_types (Optional[ArrayLike], optional) – The cell types of each cell, defaults to None

Raises:

exceptions.ExpressionMatrixDimensionError – The expression matrix cannot be cast
exceptions.DimensionMismatchError – The number of sample indices
exceptions.DimensionMismatchError – _description_

property cell_types: ndarray

Getter for sample_index.

Returns:: The cell types.
Return type:: np.ndarray

property channels: ndarray

Getter for sample_index.

Returns:: The sample index.
Return type:: np.ndarray

property expression_matrix: ndarray

Getter for the expression matrix.

Returns:: The expression matrix.
Return type:: np.ndarray

get_channel_expressions(channels: ArrayLike) → Tuple[ndarray, ndarray][source]

Get the expressions of specific channels.

This method subsets the expression matrix with the specific channels specified and returns the expression matrix along with the channel names. As opposed to subset, this method is more useful for investigating the expressions themselves rather than subsetting the object as a whole.

Parameters:

channels (Union[str, List[str]]) – The channel names to subset the data.

Raises:

TypeError – The channels n
ValueError – The channels specified are not listed in the channel names.

Returns:

A tuple of the expressions and the corresponding channel names.

Return type:

Tuple[np.ndarray, np.ndarray]

property lineage_channels: ndarray | None

Getter for lineage_channels.

Returns:: An array of lineage channels or None.
Return type:: np.ndarray, optional

property n_cell_types: int

“Getter for n_cell_types.

Returns:: The number of cell types.
Return type:: int

property n_cells: int

Getter for n_cells.

Returns:: The number of cells.
Return type:: int

property n_channels: int

Getter for n_channels.

Returns:: The number of channels.
Return type:: int

property n_samples: int

Getter for n_samples.

Returns:: The number of samples.
Return type:: int

preprocess(arcsinh: bool = False, gate_debris_removal: bool = False, gate_intact_cells: bool = False, gate_live_cells: bool = False, gate_center_offset_residual: bool = False, bead_normalization: bool = False, auto_channels: bool = True, bead_channels: ArrayLike | None = None, time_channel: ArrayLike | None = None, cor_channels: ArrayLike | None = None, dead_channel: ArrayLike | None = None, DNA_channels: ArrayLike | None = None, cofactor: int = 5, cutoff_DNA_sd: float = 2, dead_cutoff_quantile: float = 0.03, cor_cutoff_quantile: float = 0.03, verbose: bool = True)[source]

Preprocess the expression matrix.

This is a one-size-fits-all method to preprocess the CyTOF sample using the preprocess module. The preprocessing consists of the following steps:

Arcsinh transformation.
Gate to remove debris.
Gate for intact cells.
Gate for live cells.
Gate for anomalies using center, offset, and residual channels.
Bead normalization.

Parameters:

gate_debris_removal (bool) – Whether to gate to remove debris, defaults to True.
gate_intact_cells (bool) – Whether to gate for intact cells, defaults to True.
gate_live_cells (bool) – Whether to gate for live cells, defaults to True.
gate_center_offset_residual (bool) – Whether to gate using center, offset, and residual channels, defaults to True.
bead_normalizations (bool) – Whether to perform bead normalization, defaults to True.
auto_channels (bool) – Allow the method to recognize instrument and other non-lineage channels automatically. This can be overwritten by specifying channels in bead_channels, time_channel, cor_channels, dead_channel, and DNA_channels, defaults to True.
bead_channels (ArrayLike, optional) – The bead channels as specify by name, defaults to None
time_channel (ArrayLike, optional) – The time channel as specify by name, defaults to None
cor_channels (ArrayLike, optional) – The Center, Offset, and Residual channels as specify by name, defaults to None
dead_channel (ArrayLike, optional) – The dead channels as specify by name, defaults to None
DNA_channels (ArrayLike, optional) – The DNA channels as specify by name, defaults to None
cofactor (int, optional) – The cofactor for arcsinh transforatrion, default to 5.
cutoff_DNA_sd (float) – The standard deviation cutoff for DNA channels. Here, we specifically measure how many standard deviations away from the mean, defaults to 2
dead_cutoff_quantile (float) – The cutoff quantiles for dead channels. The top specified quantile will be excluded, defaults to 0.03
cor_cutoff_quantile (float) – The cutoff quantiles for Center, Offset, and Residual channels. Both the top and bottom specified quantiles will be excluded, defaults to 0.03
verbose (bool) – Whether to print out progress.

Returns:

The gated expression matrix.

Return type:

np.ndarray

property reductions: dr.Reductions | None

Getter for reductions.

Returns:: A Reductions object or None.
Return type:: CytofDR.dr.Reductions, optional

run_dr_methods(methods: str | List[str] = 'all', out_dims: int = 2, n_jobs: int = -1, verbose: bool = True, suppress_error_msg: bool = False)[source]

Run dimension reduction methods.

This is a one-size-fits-all dispatcher that runs all supported methods in the module. It supports running multiple methods at the same time at the sacrifice of some more granular control of parameters. If you would like more customization, please use the CytofDR package directly.

Parameters:

methods (Union[str, List[str]]) – DR methods to run (not case sensitive).
out_dims (int) – Output dimension of DR.
n_jobs (int) – The number of jobs to run when applicable, defaults to -1.
verbose (bool) – Whether to print out progress, defaults to True.
suppress_error_msg – Whether to suppress error messages print outs, defaults to False.

Raises:

ImoportError – CytofDR is not installed.

property sample_index: ndarray

Getter for sample_index.

Returns:: The sample index.
Return type:: np.ndarray

subset(channels: ArrayLike | None = None, sample: ArrayLike | None = None, cell_types: ArrayLike | None = None, not_in: bool = False, in_place: bool = True) → PyCytoData | None[source]

Subset the dataset with specific cell types or samples.

This method allows you to subset the data using channels, samples, or cell types. In terms of the expression matrix, channels subsets are operations on columns, whereas sample or cell type subsets are operations on rows.

Tip

To index specific channels and get the expression matrix instead of a PyCtyoData object, use the get_channel_expressions method.

Tip

To subset by indices, use the [] syntax, which supports indexing similar to that of numpy.

Parameters:

channels (Optional[ArrayLike], optional) – The names of the channels to perform subset, defaults to None.
sample (Optional[ArrayLike], optional) – The names of the samples to perform subset, defaults to None
cell_types (Optional[ArrayLike], optional) – The name of the cell types to perform subset, defaults to None
not_in (bool, optional) – Whether to filter out the provided cell types or samples, defaults to False
in_place (bool, optional) – Whether to perform the subset in place. If not, a new object will be created and returned. defaults to True.

Returns:

A new PyCytoData after subsetting

Return type:

PyCytoData, optional

Raises:

ValueError – Filtering out all cells with nothing in the expression matrix, which is unsupported.

class PyCytoData.data.DataLoader[source]

Bases: object

The class with utility functions to load datasets.

This class offers one public utility function to load datasets, load_dataset, which loads and preprocesses existing benchmark datasets. All other methods are private methods. Instantiation is not necessary.

classmethod load_dataset(dataset: str, sample: ArrayLike | None = None, force_download: bool = False, preprocess: bool = False) → PyCytoData[source]

Load benchmark datasets.

This methods downloads and load benchmark datasets. The dataset is downloaded only once, which is then cached for future use. Currently, we support three datasets:

levine13
levine32
samusik

This method also supports specifying a specific sample instead of loading the entire dataset. Below is a list of samples available:

levine13: 0 (There is only one sample in this case)
levine32: AML08 and AML09.
samusik: 01, 02, …, 09, 10

Parameters:

dataset (str) – The name of the dataset.
sample (ArrayLike, optional) – The specific sample to load from the dataset, defaults to None.
force_download (bool) – Whether to download dataset regardless of previous cache, defaults to False
preprocess (bool, optional) – Whether to automatically perform all the necessary preocessing, defaults to false. In the case of the existing three datasets, preprocessing includes just arcsinh transformation with cofactor of 5.

Returns:

The loaded dataset.

Return type:

PyCytoData

classmethod purge_dataset_cache(dataset: str) → None[source]

Delete the cached benchmark datasets.

This method permanently deletes the datasets downloaded and cached with the load_dataset method. Currently, it supports only the three benchmark datasets: levine13, levine32, and samusik. Once deleted, the dataset has to be downloaded again in the future if needed. However, performing this operation will free up storage space for one-time usages.

Currently, we do not support deleting only specific samples.

Parameters:: dataset (str) – The benchmark dataset name: levine13, levine32, samusik.

classmethod purge_dataset_cache_all() → None[source]

Delete the cached benchmark datasets.

This method permanently deletes all the datasets downloaded and cached with the load_dataset method. Once deleted, the dataset has to be downloaded again in the future if needed. However, performing this operation will free up storage space for one-time usages.

class PyCytoData.data.FileIO[source]

Bases: object

A utility class to handle common IO workflows for CyTOF data.

This class includes a few utility static methods to load and save CyTOF data. Currently, it includes the following methods:

load_delim
load_expression
save_2d_list_to_csv
save_np_array

Most of the methods are wrappers, but we offer a few advantages, such as returning PyCytoData data and saving numpy array along with channel names. For detailed documentations, read the docstring for each method.

static load_delim(files: ~typing.List[str] | str, skiprows: int = 0, drop_columns: int | ~typing.List[int] | None = None, delim: str = '\t', dtype: type = <class 'float'>, return_sample_indices: bool = False) → ndarray | Tuple[ndarray, ndarray][source]

Load deliminated file(s) as a numpy array.

This method loads a deliminited file and returns a numpy array. The file has to be a standard text file. It is essentially a wrapper for the np.loadtxt function, but we offer the functionality of loading a list of files all at once, which are automatically concatenated.

Parameters:

files (Union[List[str], str]) – The path (or a list of paths) to the files to be loaded.
skiprows (int, optional) – The number of rows to skip, default to 0.
drop_colums – The columns indices for those that need to be dropped, defaults to None.
delim (str, optional.) – The delimiter to use, defaults to \t
dtype (type, optional) – The data type for the arrays, defaults to float.

Raises:

TypeError – The files is neither a string nor a list of strings.

Returns:

An array or an array along with the sample indices.

Return type:

Union[np.ndarray, Tuple[np.ndarray, np.ndarray]]

static load_expression(files: ~typing.List[str] | str, col_names: bool = True, drop_columns: int | ~typing.List[int] | None = None, delim: str = '\t', dtype=<class 'float'>) → PyCytoData[source]

Load a deliminited text file as a PyCytoData object.

This method loads deliminited file(s) and returns a PyCytoData object. The file has to be a standard text file containing the expression matrix. Rows are cells and columns are channels. If col_names is True, the first row of the file will be treated as channel names. If multiple file paths are present, they will be automatically concatenated into one object, but the sample indices will be recorded.

Parameters:

files (Union[List[str], str]) – The path (or a list of paths) to the files to be loaded.
col_names (bool, optional) – Whether the first row is channel names, default to False.
drop_columns (Union[int, List[int]], optional.) – The columns indices for those that need to be dropped, defaults to None.
delim (str, optional.) – The delimiter to use, defaults to \t
dtype (type, optional) – The data type for the arrays, defaults to float.

Raises:

TypeError – The files is neither a string nor a list of strings.
ValueError – The expression matrices’ channels are mismatched or misaligned.

Returns:

A PyCytoData object.

Return type:

PyCytoData

static save_2d_list_to_csv(data: List[List[Any]], path: str, overwrite: bool = False)[source]

Save a nested list to a CSV file.

Parameters:

data (List[List[Any]]) – The nested list to be written to disk
path (str) – Path to save the CSV file

Note

By default, this method does not overwrite existing files. In case a file exists, a FileExistsError is thrown.

static save_np_array(array: ndarray, path: str, col_names: ndarray | None = None, dtype: str = '%.18e', overwrite: bool = False) → None[source]

Save a NumPy array to a plain text file

Parameters:

array (np.ndarray) – The NumPy array to be saved
file (str) – Path to save the plain text file
col_names (np.ndarray, optional) – Column names to be save as the first row, defaults to None
dtype (str, optional) – NumPy data type, defaults to “%.18e”

Note

By default, this method does not overwrite existing files. In case a file exists, a FileExistsError is thrown.