Module: PyCytoData.data

class PyCytoData.data.PyCytoData(expression_matrix: ArrayLike, channels: Optional[ArrayLike] = None, cell_types: Optional[ArrayLike] = None, sample_index: Optional[ArrayLike] = None, lineage_channels: Optional[ArrayLike] = None)[source]

Bases: object

The CytoData Class for handling CyTOF data.

This is an all-purpose data class for handling CyTOF data. It is compatible with benchmark datasets downloaded from the DataLoader class as well as users’ own CyTOF datasets. It has wideranging functionalities, include preprecessing, DR, and much more.

Parameters
  • expression_matrix (ArrayLike) – The expression matrix for the CyTOF sample. Rows are cells and columns are channels.

  • channels (ArrayLike) – The name of the channels, defaults to None

  • cell_types (ArrayLike) – The cell types of the cells, defaults to None

  • sample_index (ArrayLike) – The indicies or names to indicate samples of each cell. This allows the combination of multiple samples into one class, defaults to None

  • lineage_channels (ArrayLike) – The names of lineage channels, defaults to None

Raises
Additional Attributes

  • reductions: A reductions object for dimension reduction using CytofDR.

__add__(new_object: PyCytoData) PyCytoData[source]
__dict__ = mappingproxy({'__module__': 'PyCytoData.data', '__doc__': "The CytoData Class for handling CyTOF data.\n\n    This is an all-purpose data class for handling CyTOF data. It is compatible with\n    benchmark datasets downloaded from the ``DataLoader`` class as well as users' own\n    CyTOF datasets. It has wideranging functionalities, include preprecessing, DR,\n    and much more.\n\n    :param expression_matrix: The expression matrix for the CyTOF sample. Rows are cells\n        and columns are channels.\n    :type expression_matrix: ArrayLike\n    :param channels: The name of the channels, defaults to None\n    :type channels: ArrayLike\n    :param cell_types: The cell types of the cells, defaults to None\n    :type cell_types: ArrayLike\n    :param sample_index: The indicies or names to indicate samples of each cell.\n        This allows the combination of multiple samples into one class, defaults to None\n    :type sample_index: ArrayLike\n    :param lineage_channels: The names of lineage channels, defaults to None\n    :type lineage_channels: ArrayLike\n    \n    :raises exceptions.ExpressionMatrixDimensionError: The expression matrix is not\n        or cannot be cast into a two dimensional array.\n    :raises exceptions.DimensionMismatchError: The number of channel names does not agree\n        with the number of columns of the expression matrix.\n    :raises exceptions.DimensionMismatchError: The number of cell types for all cells does not agree\n        with the number of rows of the expression matrix.\n    :raises exceptions.DimensionMismatchError: The number of sample indices does not agree\n        with the number of rows of the expression matrix.\n        \n    :Additional Attributes:\n    \n    - **reductions**: A ``reductions`` object for dimension reduction using ``CytofDR``.\n    ", '__init__': <function PyCytoData.__init__>, 'add_sample': <function PyCytoData.add_sample>, 'preprocess': <function PyCytoData.preprocess>, 'run_dr_methods': <function PyCytoData.run_dr_methods>, 'subset': <function PyCytoData.subset>, 'get_channel_expressions': <function PyCytoData.get_channel_expressions>, '__len__': <function PyCytoData.__len__>, '__iadd__': <function PyCytoData.__iadd__>, '__add__': <function PyCytoData.__add__>, '__str__': <function PyCytoData.__str__>, '__getitem__': <function PyCytoData.__getitem__>, 'expression_matrix': <property object>, 'sample_index': <property object>, 'cell_types': <property object>, 'channels': <property object>, 'n_cells': <property object>, 'n_channels': <property object>, 'n_samples': <property object>, 'n_cell_types': <property object>, 'lineage_channels': <property object>, 'reductions': <property object>, '__dict__': <attribute '__dict__' of 'PyCytoData' objects>, '__weakref__': <attribute '__weakref__' of 'PyCytoData' objects>, '__annotations__': {'_expression_matrix': 'np.ndarray', '_n_samples': 'int', '_n_cell_types': 'int', '_lineage_channels': 'Optional[np.ndarray]', '_lineage_channels_indices': 'np.ndarray', '_reductions': 'Optional[dr.Reductions]'}})
__getitem__(items: Union[slice, List[int], ndarray, Tuple[Union[slice, List[int], ndarray], Union[slice, List[int], ndarray]]]) PyCytoData[source]

The method to index elements of the PyCytoData object.

This method implements the bracket notation to index part of the class. The notation is mostly consistent with the numpy indexing notation with a few excetions, which is listed below. When indexing specific cells, the metadata are appropriately indexed as well.

A few deviations from the numpy notations:

  1. Integer indices are currently not supported. This is because indexing by integer

    returns a 1-d array instead of a 2-d array, which can possibly cause confusion.

  2. Indexing by two lists or arrays with different lengths are supported. They are

    treated to index rows and columns, such as exprs[[0,1,2], [3,4]] is perfectly valid to index the first 3 cells with the fourth and fifth channel.

Tip

To index columns/channels by name, use the subset method instead.

Parameters

items (Union[int, slice, List[int], Tuple[Any, Any]]) – The indices for items.

Raises
  • IndexError – Two or more indices present.

  • TypeError – Indexing by integer in either or both axes.

  • IndexError – An higher dimensional array is used.

  • TypeError – Invalid indices type used.

Returns

An appropriately indexed PyCytoData object.

Return type

PyCytoData

__iadd__(new_object: PyCytoData) PyCytoData[source]
__init__(expression_matrix: ArrayLike, channels: Optional[ArrayLike] = None, cell_types: Optional[ArrayLike] = None, sample_index: Optional[ArrayLike] = None, lineage_channels: Optional[ArrayLike] = None)[source]
__len__() int[source]

The length of the PyCytoData Class.

This method implements the len of the builtin python method. It returns the number of total cells in the expression matrix.

Returns

The length of the object.

Return type

int

__module__ = 'PyCytoData.data'
__str__() str[source]

String representation of the PyCytoData class.

This method returns a string containing the most basic metadata of the class along with the memory address.

Returns

The string representation of the class.

Return type

str

__weakref__

list of weak references to the object (if defined)

add_sample(expression_matrix: ArrayLike, sample_index: ArrayLike, cell_types: Optional[ArrayLike] = None)[source]

Add another CyTOF sample from the same experiment.

This method allows users to combine samples into existing samples. The data must be in the same shape. Sample indices must be provided so that the class can properly index these samples using names.

Parameters
  • expression_matrix (ArrayLike) – The expression matrix of the new sample.

  • sample_index (ArrayLike) – The sample indicies to name the sample.

  • cell_types (Optional[ArrayLike], optional) – The cell types of each cell, defaults to None

Raises
property cell_types: ndarray

Getter for sample_index.

Returns

The cell types.

Return type

np.ndarray

property channels: ndarray

Getter for sample_index.

Returns

The sample index.

Return type

np.ndarray

property expression_matrix: ndarray

Getter for the expression matrix.

Returns

The expression matrix.

Return type

np.ndarray

get_channel_expressions(channels: ArrayLike) Tuple[ndarray, ndarray][source]

Get the expressions of specific channels.

This method subsets the expression matrix with the specific channels specified and returns the expression matrix along with the channel names. As opposed to subset, this method is more useful for investigating the expressions themselves rather than subsetting the object as a whole.

Parameters

channels (Union[str, List[str]]) – The channel names to subset the data.

Raises
  • TypeError – The channels n

  • ValueError – The channels specified are not listed in the channel names.

Returns

A tuple of the expressions and the corresponding channel names.

Return type

Tuple[np.ndarray, np.ndarray]

property lineage_channels: Optional[ndarray]

Getter for lineage_channels.

Returns

An array of lineage channels or None.

Return type

np.ndarray, optional

property n_cell_types: int

“Getter for n_cell_types.

Returns

The number of cell types.

Return type

int

property n_cells: int

Getter for n_cells.

Returns

The number of cells.

Return type

int

property n_channels: int

Getter for n_channels.

Returns

The number of channels.

Return type

int

property n_samples: int

Getter for n_samples.

Returns

The number of samples.

Return type

int

preprocess(arcsinh: bool = False, gate_debris_removal: bool = False, gate_intact_cells: bool = False, gate_live_cells: bool = False, gate_center_offset_residual: bool = False, bead_normalization: bool = False, auto_channels: bool = True, bead_channels: Optional[ArrayLike] = None, time_channel: Optional[ArrayLike] = None, cor_channels: Optional[ArrayLike] = None, dead_channel: Optional[ArrayLike] = None, DNA_channels: Optional[ArrayLike] = None, cofactor: int = 5, cutoff_DNA_sd: float = 2, dead_cutoff_quantile: float = 0.03, cor_cutoff_quantile: float = 0.03, verbose: bool = True)[source]

Preprocess the expression matrix.

This is a one-size-fits-all method to preprocess the CyTOF sample using the preprocess module. The preprocessing consists of the following steps:

  1. Arcsinh transformation.

  2. Gate to remove debris.

  3. Gate for intact cells.

  4. Gate for live cells.

  5. Gate for anomalies using center, offset, and residual channels.

Parameters
  • gate_debris_removal (bool) – Whether to gate to remove debris, defaults to True.

  • gate_intact_cells (bool) – Whether to gate for intact cells, defaults to True.

  • gate_live_cells (bool) – Whether to gate for live cells, defaults to True.

  • gate_center_offset_residual (bool) – Whether to gate using center, offset, and residual channels, defaults to True.

  • bead_normalizations (bool) – Whether to perform bead normalization, defaults to True.

  • auto_channels (bool) – Allow the method to recognize instrument and other non-lineage channels automatically. This can be overwritten by specifying channels in bead_channels, time_channel, cor_channels, dead_channel, and DNA_channels, defaults to True.

  • bead_channels (ArrayLike, optional) – The bead channels as specify by name, defaults to None

  • time_channel (ArrayLike, optional) – The time channel as specify by name, defaults to None

  • cor_channels (ArrayLike, optional) – The Center, Offset, and Residual channels as specify by name, defaults to None

  • dead_channel (ArrayLike, optional) – The dead channels as specify by name, defaults to None

  • DNA_channels (ArrayLike, optional) – The DNA channels as specify by name, defaults to None

  • cofactor (int, optional) – The cofactor for arcsinh transforatrion, default to 5.

  • cutoff_DNA_sd (float) – The standard deviation cutoff for DNA channels. Here, we specifically measure how many standard deviations away from the mean, defaults to 2

  • dead_cutoff_quantile (float) – The cutoff quantiles for dead channels. The top specified quantile will be excluded, defaults to 0.03

  • cor_cutoff_quantile (float) – The cutoff quantiles for Center, Offset, and Residual channels. Both the top and bottom specified quantiles will be excluded, defaults to 0.03

  • verbose (bool) – Whether to print out progress.

Returns

The gated expression matrix.

Return type

np.ndarray

property reductions: Optional[dr.Reductions]

Getter for reductions.

Returns

A Reductions object or None.

Return type

CytofDR.dr.Reductions, optional

run_dr_methods(methods: Union[str, List[str]] = 'all', out_dims: int = 2, n_jobs: int = - 1, verbose: bool = True, suppress_error_msg: bool = False)[source]

Run dimension reduction methods.

This is a one-size-fits-all dispatcher that runs all supported methods in the module. It supports running multiple methods at the same time at the sacrifice of some more granular control of parameters. If you would like more customization, please use the CytofDR package directly.

Parameters
  • methods (Union[str, List[str]]) – DR methods to run (not case sensitive).

  • out_dims (int) – Output dimension of DR.

  • n_jobs (int) – The number of jobs to run when applicable, defaults to -1.

  • verbose (bool) – Whether to print out progress, defaults to True.

  • suppress_error_msg – Whether to suppress error messages print outs, defaults to False.

Raises

ImoportErrorCytofDR is not installed.

property sample_index: ndarray

Getter for sample_index.

Returns

The sample index.

Return type

np.ndarray

subset(channels: Optional[ArrayLike] = None, sample: Optional[ArrayLike] = None, cell_types: Optional[ArrayLike] = None, not_in: bool = False, in_place: bool = True) Optional[PyCytoData][source]

Subset the dataset with specific cell types or samples.

This method allows you to subset the data using channels, samples, or cell types. In terms of the expression matrix, channels subsets are operations on columns, whereas sample or cell type subsets are operations on rows.

Tip

To index specific channels and get the expression matrix instead of a PyCtyoData object, use the get_channel_expressions method.

Tip

To subset by indices, use the [] syntax, which supports indexing similar to that of numpy.

Parameters
  • channels (Optional[ArrayLike], optional) – The names of the channels to perform subset, defaults to None.

  • sample (Optional[ArrayLike], optional) – The names of the samples to perform subset, defaults to None

  • cell_types (Optional[ArrayLike], optional) – The name of the cell types to perform subset, defaults to None

  • not_in (bool, optional) – Whether to filter out the provided cell types or samples, defaults to False

  • in_place (bool, optional) – Whether to perform the subset in place. If not, a new object will be created and returned. defaults to True.

Returns

A new PyCytoData after subsetting

Return type

PyCytoData, optional

Raises

ValueError – Filtering out all cells with nothing in the expression matrix, which is unsupported.

class PyCytoData.data.DataLoader[source]

Bases: object

The class with utility functions to load datasets.

This class offers one public utility function to load datasets, load_dataset, which loads and preprocesses existing benchmark datasets. All other methods are private methods. Instantiation is not necessary.

classmethod load_dataset(dataset: str, sample: Optional[ArrayLike] = None, force_download: bool = False, preprocess: bool = False) PyCytoData[source]

Load benchmark datasets.

This methods downloads and load benchmark datasets. The dataset is downloaded only once, which is then cached for future use. Currently, we support three datasets:

  • levine13

  • levine32

  • samusik

This method also supports specifying a specific sample instead of loading the entire dataset. Below is a list of samples available:

  • levine13: 0 (There is only one sample in this case)

  • levine32: AML08 and AML09.

  • samusik: 01, 02, …, 09, 10

Parameters
  • dataset (str) – The name of the dataset.

  • sample (ArrayLike, optional) – The specific sample to load from the dataset, defaults to None.

  • force_download (bool) – Whether to download dataset regardless of previous cache, defaults to False

  • preprocess (bool, optional) – Whether to automatically perform all the necessary preocessing, defaults to false. In the case of the existing three datasets, preprocessing includes just arcsinh transformation with cofactor of 5.

Returns

The loaded dataset.

Return type

PyCytoData

class PyCytoData.data.FileIO[source]

Bases: object

A utility class to handle common IO workflows for CyTOF data.

This class includes a few utility static methods to load and save CyTOF data. Currently, it includes the following methods:

  • load_delim

  • load_expression

  • save_2d_list_to_csv

  • save_np_array

Most of the methods are wrappers, but we offer a few advantages, such as returning PyCytoData data and saving numpy array along with channel names. For detailed documentations, read the docstring for each method.

static load_delim(files: ~typing.Union[~typing.List[str], str], skiprows: int = 0, drop_columns: ~typing.Optional[~typing.Union[int, ~typing.List[int]]] = None, delim: str = '\t', dtype: type = <class 'float'>, return_sample_indices: bool = False) Union[ndarray, Tuple[ndarray, ndarray]][source]

Load deliminated file(s) as a numpy array.

This method loads a deliminited file and returns a numpy array. The file has to be a standard text file. It is essentially a wrapper for the np.loadtxt function, but we offer the functionality of loading a list of files all at once, which are automatically concatenated.

Parameters
  • files (Union[List[str], str]) – The path (or a list of paths) to the files to be loaded.

  • skiprows (int, optional) – The number of rows to skip, default to 0.

  • drop_colums – The columns indices for those that need to be dropped, defaults to None.

  • delim (str, optional.) – The delimiter to use, defaults to \t

  • dtype (type, optional) – The data type for the arrays, defaults to float.

Raises

TypeError – The files is neither a string nor a list of strings.

Returns

An array or an array along with the sample indices.

Return type

Union[np.ndarray, Tuple[np.ndarray, np.ndarray]]

static load_expression(files: ~typing.Union[~typing.List[str], str], col_names: bool = True, drop_columns: ~typing.Optional[~typing.Union[int, ~typing.List[int]]] = None, delim: str = '\t', dtype=<class 'float'>) PyCytoData[source]

Load a deliminited text file as a PyCytoData object.

This method loads deliminited file(s) and returns a PyCytoData object. The file has to be a standard text file containing the expression matrix. Rows are cells and columns are channels. If col_names is True, the first row of the file will be treated as channel names. If multiple file paths are present, they will be automatically concatenated into one object, but the sample indices will be recorded.

Parameters
  • files (Union[List[str], str]) – The path (or a list of paths) to the files to be loaded.

  • col_names (bool, optional) – Whether the first row is channel names, default to False.

  • drop_columns (Union[int, List[int]], optional.) – The columns indices for those that need to be dropped, defaults to None.

  • delim (str, optional.) – The delimiter to use, defaults to \t

  • dtype (type, optional) – The data type for the arrays, defaults to float.

Raises

TypeError – The files is neither a string nor a list of strings.

Returns

A PyCytoData object.

Return type

PyCytoData

static save_2d_list_to_csv(data: List[List[Any]], path: str, overwrite: bool = False)[source]

Save a nested list to a CSV file.

Parameters
  • data (List[List[Any]]) – The nested list to be written to disk

  • path (str) – Path to save the CSV file

Note

By default, this method does not overwrite existing files. In case a file exists, a FileExistsError is thrown.

static save_np_array(array: ndarray, path: str, col_names: Optional[ndarray] = None, dtype: str = '%.18e', overwrite: bool = False) None[source]

Save a NumPy array to a plain text file

Parameters
  • array (np.ndarray) – The NumPy array to be saved

  • file (str) – Path to save the plain text file

  • col_names (np.ndarray, optional) – Column names to be save as the first row, defaults to None

  • dtype (str, optional) – NumPy data type, defaults to “%.18e”

Note

By default, this method does not overwrite existing files. In case a file exists, a FileExistsError is thrown.