Module: PyCytoData.data

class PyCytoData.data.PyCytoData(expression_matrix: ArrayLike, channels: Optional[ArrayLike] = None, cell_types: Optional[ArrayLike] = None, sample_index: Optional[ArrayLike] = None, lineage_channels: Optional[ArrayLike] = None)[source]

Bases: object

The CytoData Class for handling CyTOF data.

This is an all-purpose data class for handling CyTOF data. It is compatible with benchmark datasets downloaded from the DataLoader class as well as users’ own CyTOF datasets. It has wideranging functionalities, include preprecessing, DR, and much more.

Parameters

expression_matrix (ArrayLike) – The expression matrix for the CyTOF sample. Rows are cells and columns are channels.
channels (ArrayLike) – The name of the channels, defaults to None
cell_types (ArrayLike) – The cell types of the cells, defaults to None
sample_index (ArrayLike) – The indicies or names to indicate samples of each cell. This allows the combination of multiple samples into one class, defaults to None
lineage_channels (ArrayLike) – The names of lineage channels, defaults to None

Raises

exceptions.ExpressionMatrixDimensionError – The expression matrix is not or cannot be cast into a two dimensional array.
exceptions.DimensionMismatchError – The number of channel names does not agree with the number of columns of the expression matrix.
exceptions.DimensionMismatchError – The number of cell types for all cells does not agree with the number of rows of the expression matrix.
exceptions.DimensionMismatchError – The number of sample indices does not agree with the number of rows of the expression matrix.

Additional Attributes

reductions: A reductions object for dimension reduction using CytofDR.

__add__(new_object: PyCytoData) → PyCytoData[source]

__dict__ = mappingproxy({'__module__': 'PyCytoData.data', '__doc__': "The CytoData Class for handling CyTOF data.\n\n This is an all-purpose data class for handling CyTOF data. It is compatible with\n benchmark datasets downloaded from the ``DataLoader`` class as well as users' own\n CyTOF datasets. It has wideranging functionalities, include preprecessing, DR,\n and much more.\n\n :param expression_matrix: The expression matrix for the CyTOF sample. Rows are cells\n and columns are channels.\n :type expression_matrix: ArrayLike\n :param channels: The name of the channels, defaults to None\n :type channels: ArrayLike\n :param cell_types: The cell types of the cells, defaults to None\n :type cell_types: ArrayLike\n :param sample_index: The indicies or names to indicate samples of each cell.\n This allows the combination of multiple samples into one class, defaults to None\n :type sample_index: ArrayLike\n :param lineage_channels: The names of lineage channels, defaults to None\n :type lineage_channels: ArrayLike\n \n :raises exceptions.ExpressionMatrixDimensionError: The expression matrix is not\n or cannot be cast into a two dimensional array.\n :raises exceptions.DimensionMismatchError: The number of channel names does not agree\n with the number of columns of the expression matrix.\n :raises exceptions.DimensionMismatchError: The number of cell types for all cells does not agree\n with the number of rows of the expression matrix.\n :raises exceptions.DimensionMismatchError: The number of sample indices does not agree\n with the number of rows of the expression matrix.\n \n :Additional Attributes:\n \n - **reductions**: A ``reductions`` object for dimension reduction using ``CytofDR``.\n ", '__init__': <function PyCytoData.__init__>, 'add_sample': <function PyCytoData.add_sample>, 'preprocess': <function PyCytoData.preprocess>, 'run_dr_methods': <function PyCytoData.run_dr_methods>, 'subset': <function PyCytoData.subset>, 'get_channel_expressions': <function PyCytoData.get_channel_expressions>, '__len__': <function PyCytoData.__len__>, '__iadd__': <function PyCytoData.__iadd__>, '__add__': <function PyCytoData.__add__>, '__str__': <function PyCytoData.__str__>, '__getitem__': <function PyCytoData.__getitem__>, 'expression_matrix': <property object>, 'sample_index': <property object>, 'cell_types': <property object>, 'channels': <property object>, 'n_cells': <property object>, 'n_channels': <property object>, 'n_samples': <property object>, 'n_cell_types': <property object>, 'lineage_channels': <property object>, 'reductions': <property object>, '__dict__': <attribute '__dict__' of 'PyCytoData' objects>, '__weakref__': <attribute '__weakref__' of 'PyCytoData' objects>, '__annotations__': {'_expression_matrix': 'np.ndarray', '_n_samples': 'int', '_n_cell_types': 'int', '_lineage_channels': 'Optional[np.ndarray]', '_lineage_channels_indices': 'np.ndarray', '_reductions': 'Optional[dr.Reductions]'}})

__getitem__(items: Union[slice, List[int], ndarray, Tuple[Union[slice, List[int], ndarray], Union[slice, List[int], ndarray]]]) → PyCytoData[source]

The method to index elements of the PyCytoData object.

This method implements the bracket notation to index part of the class. The notation is mostly consistent with the numpy indexing notation with a few excetions, which is listed below. When indexing specific cells, the metadata are appropriately indexed as well.

A few deviations from the numpy notations:

Integer indices are currently not supported. This is because indexing by integer
returns a 1-d array instead of a 2-d array, which can possibly cause confusion.
Indexing by two lists or arrays with different lengths are supported. They are
treated to index rows and columns, such as exprs[[0,1,2], [3,4]] is perfectly valid to index the first 3 cells with the fourth and fifth channel.

Tip

To index columns/channels by name, use the subset method instead.

Parameters

items (Union[int, slice, List[int], Tuple[Any, Any]]) – The indices for items.

Raises

IndexError – Two or more indices present.
TypeError – Indexing by integer in either or both axes.
IndexError – An higher dimensional array is used.
TypeError – Invalid indices type used.

Returns

An appropriately indexed PyCytoData object.

Return type

PyCytoData

__iadd__(new_object: PyCytoData) → PyCytoData[source]

__init__(expression_matrix: ArrayLike, channels: Optional[ArrayLike] = None, cell_types: Optional[ArrayLike] = None, sample_index: Optional[ArrayLike] = None, lineage_channels: Optional[ArrayLike] = None)[source]

__len__() → int[source]

The length of the PyCytoData Class.

This method implements the len of the builtin python method. It returns the number of total cells in the expression matrix.

Returns: The length of the object.
Return type: int

__module__ = 'PyCytoData.data'

__str__() → str[source]

String representation of the PyCytoData class.

This method returns a string containing the most basic metadata of the class along with the memory address.

Returns: The string representation of the class.
Return type: str

__weakref__: list of weak references to the object (if defined)

add_sample(expression_matrix: ArrayLike, sample_index: ArrayLike, cell_types: Optional[ArrayLike] = None)[source]

Add another CyTOF sample from the same experiment.

This method allows users to combine samples into existing samples. The data must be in the same shape. Sample indices must be provided so that the class can properly index these samples using names.

Parameters

expression_matrix (ArrayLike) – The expression matrix of the new sample.
sample_index (ArrayLike) – The sample indicies to name the sample.
cell_types (Optional[ArrayLike], optional) – The cell types of each cell, defaults to None

Raises

exceptions.ExpressionMatrixDimensionError – The expression matrix cannot be cast
exceptions.DimensionMismatchError – The number of sample indices
exceptions.DimensionMismatchError – _description_

property cell_types: ndarray

Getter for sample_index.

Returns: The cell types.
Return type: np.ndarray

property channels: ndarray

Getter for sample_index.

Returns: The sample index.
Return type: np.ndarray

property expression_matrix: ndarray

Getter for the expression matrix.

Returns: The expression matrix.
Return type: np.ndarray

get_channel_expressions(channels: ArrayLike) → Tuple[ndarray, ndarray][source]

Get the expressions of specific channels.

This method subsets the expression matrix with the specific channels specified and returns the expression matrix along with the channel names. As opposed to subset, this method is more useful for investigating the expressions themselves rather than subsetting the object as a whole.

Parameters

channels (Union[str, List[str]]) – The channel names to subset the data.

Raises

TypeError – The channels n
ValueError – The channels specified are not listed in the channel names.

Returns

A tuple of the expressions and the corresponding channel names.

Return type

Tuple[np.ndarray, np.ndarray]

property lineage_channels: Optional[ndarray]

Getter for lineage_channels.

Returns: An array of lineage channels or None.
Return type: np.ndarray, optional

property n_cell_types: int

“Getter for n_cell_types.

Returns: The number of cell types.
Return type: int

property n_cells: int

Getter for n_cells.

Returns: The number of cells.
Return type: int

property n_channels: int

Getter for n_channels.

Returns: The number of channels.
Return type: int

property n_samples: int

Getter for n_samples.

Returns: The number of samples.
Return type: int

preprocess(arcsinh: bool = False, gate_debris_removal: bool = False, gate_intact_cells: bool = False, gate_live_cells: bool = False, gate_center_offset_residual: bool = False, bead_normalization: bool = False, auto_channels: bool = True, bead_channels: Optional[ArrayLike] = None, time_channel: Optional[ArrayLike] = None, cor_channels: Optional[ArrayLike] = None, dead_channel: Optional[ArrayLike] = None, DNA_channels: Optional[ArrayLike] = None, cofactor: int = 5, cutoff_DNA_sd: float = 2, dead_cutoff_quantile: float = 0.03, cor_cutoff_quantile: float = 0.03, verbose: bool = True)[source]

Preprocess the expression matrix.

This is a one-size-fits-all method to preprocess the CyTOF sample using the preprocess module. The preprocessing consists of the following steps:

Arcsinh transformation.
Gate to remove debris.
Gate for intact cells.
Gate for live cells.
Gate for anomalies using center, offset, and residual channels.

Parameters

gate_debris_removal (bool) – Whether to gate to remove debris, defaults to True.
gate_intact_cells (bool) – Whether to gate for intact cells, defaults to True.
gate_live_cells (bool) – Whether to gate for live cells, defaults to True.
gate_center_offset_residual (bool) – Whether to gate using center, offset, and residual channels, defaults to True.
bead_normalizations (bool) – Whether to perform bead normalization, defaults to True.
auto_channels (bool) – Allow the method to recognize instrument and other non-lineage channels automatically. This can be overwritten by specifying channels in bead_channels, time_channel, cor_channels, dead_channel, and DNA_channels, defaults to True.
bead_channels (ArrayLike, optional) – The bead channels as specify by name, defaults to None
time_channel (ArrayLike, optional) – The time channel as specify by name, defaults to None
cor_channels (ArrayLike, optional) – The Center, Offset, and Residual channels as specify by name, defaults to None
dead_channel (ArrayLike, optional) – The dead channels as specify by name, defaults to None
DNA_channels (ArrayLike, optional) – The DNA channels as specify by name, defaults to None
cofactor (int, optional) – The cofactor for arcsinh transforatrion, default to 5.
cutoff_DNA_sd (float) – The standard deviation cutoff for DNA channels. Here, we specifically measure how many standard deviations away from the mean, defaults to 2
dead_cutoff_quantile (float) – The cutoff quantiles for dead channels. The top specified quantile will be excluded, defaults to 0.03
cor_cutoff_quantile (float) – The cutoff quantiles for Center, Offset, and Residual channels. Both the top and bottom specified quantiles will be excluded, defaults to 0.03
verbose (bool) – Whether to print out progress.

Returns

The gated expression matrix.

Return type

np.ndarray

property reductions: Optional[dr.Reductions]

Getter for reductions.

Returns: A Reductions object or None.
Return type: CytofDR.dr.Reductions, optional

run_dr_methods(methods: Union[str, List[str]] = 'all', out_dims: int = 2, n_jobs: int = - 1, verbose: bool = True, suppress_error_msg: bool = False)[source]

Run dimension reduction methods.

This is a one-size-fits-all dispatcher that runs all supported methods in the module. It supports running multiple methods at the same time at the sacrifice of some more granular control of parameters. If you would like more customization, please use the CytofDR package directly.

Parameters

methods (Union[str, List[str]]) – DR methods to run (not case sensitive).
out_dims (int) – Output dimension of DR.
n_jobs (int) – The number of jobs to run when applicable, defaults to -1.
verbose (bool) – Whether to print out progress, defaults to True.
suppress_error_msg – Whether to suppress error messages print outs, defaults to False.

Raises

ImoportError – CytofDR is not installed.

property sample_index: ndarray

Getter for sample_index.

Returns: The sample index.
Return type: np.ndarray

subset(channels: Optional[ArrayLike] = None, sample: Optional[ArrayLike] = None, cell_types: Optional[ArrayLike] = None, not_in: bool = False, in_place: bool = True) → Optional[PyCytoData][source]

Subset the dataset with specific cell types or samples.

This method allows you to subset the data using channels, samples, or cell types. In terms of the expression matrix, channels subsets are operations on columns, whereas sample or cell type subsets are operations on rows.

Tip

To index specific channels and get the expression matrix instead of a PyCtyoData object, use the get_channel_expressions method.

Tip

To subset by indices, use the [] syntax, which supports indexing similar to that of numpy.

Parameters

channels (Optional[ArrayLike], optional) – The names of the channels to perform subset, defaults to None.
sample (Optional[ArrayLike], optional) – The names of the samples to perform subset, defaults to None
cell_types (Optional[ArrayLike], optional) – The name of the cell types to perform subset, defaults to None
not_in (bool, optional) – Whether to filter out the provided cell types or samples, defaults to False
in_place (bool, optional) – Whether to perform the subset in place. If not, a new object will be created and returned. defaults to True.

Returns

A new PyCytoData after subsetting

Return type

PyCytoData, optional

Raises

ValueError – Filtering out all cells with nothing in the expression matrix, which is unsupported.

class PyCytoData.data.DataLoader[source]

Bases: object

The class with utility functions to load datasets.

This class offers one public utility function to load datasets, load_dataset, which loads and preprocesses existing benchmark datasets. All other methods are private methods. Instantiation is not necessary.

classmethod load_dataset(dataset: str, sample: Optional[ArrayLike] = None, force_download: bool = False, preprocess: bool = False) → PyCytoData[source]

Load benchmark datasets.

This methods downloads and load benchmark datasets. The dataset is downloaded only once, which is then cached for future use. Currently, we support three datasets:

levine13
levine32
samusik

This method also supports specifying a specific sample instead of loading the entire dataset. Below is a list of samples available:

levine13: 0 (There is only one sample in this case)
levine32: AML08 and AML09.
samusik: 01, 02, …, 09, 10

Parameters

dataset (str) – The name of the dataset.
sample (ArrayLike, optional) – The specific sample to load from the dataset, defaults to None.
force_download (bool) – Whether to download dataset regardless of previous cache, defaults to False
preprocess (bool, optional) – Whether to automatically perform all the necessary preocessing, defaults to false. In the case of the existing three datasets, preprocessing includes just arcsinh transformation with cofactor of 5.

Returns

The loaded dataset.

Return type

PyCytoData

class PyCytoData.data.FileIO[source]

Bases: object

A utility class to handle common IO workflows for CyTOF data.

This class includes a few utility static methods to load and save CyTOF data. Currently, it includes the following methods:

load_delim
load_expression
save_2d_list_to_csv
save_np_array

Most of the methods are wrappers, but we offer a few advantages, such as returning PyCytoData data and saving numpy array along with channel names. For detailed documentations, read the docstring for each method.

static load_delim(files: ~typing.Union[~typing.List[str], str], skiprows: int = 0, drop_columns: ~typing.Optional[~typing.Union[int, ~typing.List[int]]] = None, delim: str = '\t', dtype: type = <class 'float'>, return_sample_indices: bool = False) → Union[ndarray, Tuple[ndarray, ndarray]][source]

Load deliminated file(s) as a numpy array.

This method loads a deliminited file and returns a numpy array. The file has to be a standard text file. It is essentially a wrapper for the np.loadtxt function, but we offer the functionality of loading a list of files all at once, which are automatically concatenated.

Parameters

files (Union[List[str], str]) – The path (or a list of paths) to the files to be loaded.
skiprows (int, optional) – The number of rows to skip, default to 0.
drop_colums – The columns indices for those that need to be dropped, defaults to None.
delim (str, optional.) – The delimiter to use, defaults to \t
dtype (type, optional) – The data type for the arrays, defaults to float.

Raises

TypeError – The files is neither a string nor a list of strings.

Returns

An array or an array along with the sample indices.

Return type

Union[np.ndarray, Tuple[np.ndarray, np.ndarray]]

static load_expression(files: ~typing.Union[~typing.List[str], str], col_names: bool = True, drop_columns: ~typing.Optional[~typing.Union[int, ~typing.List[int]]] = None, delim: str = '\t', dtype=<class 'float'>) → PyCytoData[source]

Load a deliminited text file as a PyCytoData object.

This method loads deliminited file(s) and returns a PyCytoData object. The file has to be a standard text file containing the expression matrix. Rows are cells and columns are channels. If col_names is True, the first row of the file will be treated as channel names. If multiple file paths are present, they will be automatically concatenated into one object, but the sample indices will be recorded.

Parameters

files (Union[List[str], str]) – The path (or a list of paths) to the files to be loaded.
col_names (bool, optional) – Whether the first row is channel names, default to False.
drop_columns (Union[int, List[int]], optional.) – The columns indices for those that need to be dropped, defaults to None.
delim (str, optional.) – The delimiter to use, defaults to \t
dtype (type, optional) – The data type for the arrays, defaults to float.

Raises

TypeError – The files is neither a string nor a list of strings.

Returns

A PyCytoData object.

Return type

PyCytoData

static save_2d_list_to_csv(data: List[List[Any]], path: str, overwrite: bool = False)[source]

Save a nested list to a CSV file.

Parameters

data (List[List[Any]]) – The nested list to be written to disk
path (str) – Path to save the CSV file

Note

By default, this method does not overwrite existing files. In case a file exists, a FileExistsError is thrown.

static save_np_array(array: ndarray, path: str, col_names: Optional[ndarray] = None, dtype: str = '%.18e', overwrite: bool = False) → None[source]

Save a NumPy array to a plain text file

Parameters

array (np.ndarray) – The NumPy array to be saved
file (str) – Path to save the plain text file
col_names (np.ndarray, optional) – Column names to be save as the first row, defaults to None
dtype (str, optional) – NumPy data type, defaults to “%.18e”

Note

By default, this method does not overwrite existing files. In case a file exists, a FileExistsError is thrown.