Module: PyCytoData.data
- class PyCytoData.data.PyCytoData(expression_matrix: ArrayLike, channels: Optional[ArrayLike] = None, cell_types: Optional[ArrayLike] = None, sample_index: Optional[ArrayLike] = None, lineage_channels: Optional[ArrayLike] = None)[source]
Bases:
object
The CytoData Class for handling CyTOF data.
This is an all-purpose data class for handling CyTOF data. It is compatible with benchmark datasets downloaded from the
DataLoader
class as well as users’ own CyTOF datasets. It has wideranging functionalities, include preprecessing, DR, and much more.- Parameters
expression_matrix (ArrayLike) – The expression matrix for the CyTOF sample. Rows are cells and columns are channels.
channels (ArrayLike) – The name of the channels, defaults to None
cell_types (ArrayLike) – The cell types of the cells, defaults to None
sample_index (ArrayLike) – The indicies or names to indicate samples of each cell. This allows the combination of multiple samples into one class, defaults to None
lineage_channels (ArrayLike) – The names of lineage channels, defaults to None
- Raises
exceptions.ExpressionMatrixDimensionError – The expression matrix is not or cannot be cast into a two dimensional array.
exceptions.DimensionMismatchError – The number of channel names does not agree with the number of columns of the expression matrix.
exceptions.DimensionMismatchError – The number of cell types for all cells does not agree with the number of rows of the expression matrix.
exceptions.DimensionMismatchError – The number of sample indices does not agree with the number of rows of the expression matrix.
- Additional Attributes
reductions: A
reductions
object for dimension reduction usingCytofDR
.
- __add__(new_object: PyCytoData) PyCytoData [source]
- __dict__ = mappingproxy({'__module__': 'PyCytoData.data', '__doc__': "The CytoData Class for handling CyTOF data.\n\n This is an all-purpose data class for handling CyTOF data. It is compatible with\n benchmark datasets downloaded from the ``DataLoader`` class as well as users' own\n CyTOF datasets. It has wideranging functionalities, include preprecessing, DR,\n and much more.\n\n :param expression_matrix: The expression matrix for the CyTOF sample. Rows are cells\n and columns are channels.\n :type expression_matrix: ArrayLike\n :param channels: The name of the channels, defaults to None\n :type channels: ArrayLike\n :param cell_types: The cell types of the cells, defaults to None\n :type cell_types: ArrayLike\n :param sample_index: The indicies or names to indicate samples of each cell.\n This allows the combination of multiple samples into one class, defaults to None\n :type sample_index: ArrayLike\n :param lineage_channels: The names of lineage channels, defaults to None\n :type lineage_channels: ArrayLike\n \n :raises exceptions.ExpressionMatrixDimensionError: The expression matrix is not\n or cannot be cast into a two dimensional array.\n :raises exceptions.DimensionMismatchError: The number of channel names does not agree\n with the number of columns of the expression matrix.\n :raises exceptions.DimensionMismatchError: The number of cell types for all cells does not agree\n with the number of rows of the expression matrix.\n :raises exceptions.DimensionMismatchError: The number of sample indices does not agree\n with the number of rows of the expression matrix.\n \n :Additional Attributes:\n \n - **reductions**: A ``reductions`` object for dimension reduction using ``CytofDR``.\n ", '__init__': <function PyCytoData.__init__>, 'add_sample': <function PyCytoData.add_sample>, 'preprocess': <function PyCytoData.preprocess>, 'run_dr_methods': <function PyCytoData.run_dr_methods>, 'subset': <function PyCytoData.subset>, 'get_channel_expressions': <function PyCytoData.get_channel_expressions>, '__len__': <function PyCytoData.__len__>, '__iadd__': <function PyCytoData.__iadd__>, '__add__': <function PyCytoData.__add__>, '__str__': <function PyCytoData.__str__>, '__getitem__': <function PyCytoData.__getitem__>, 'expression_matrix': <property object>, 'sample_index': <property object>, 'cell_types': <property object>, 'channels': <property object>, 'n_cells': <property object>, 'n_channels': <property object>, 'n_samples': <property object>, 'n_cell_types': <property object>, 'lineage_channels': <property object>, 'reductions': <property object>, '__dict__': <attribute '__dict__' of 'PyCytoData' objects>, '__weakref__': <attribute '__weakref__' of 'PyCytoData' objects>, '__annotations__': {'_expression_matrix': 'np.ndarray', '_n_samples': 'int', '_n_cell_types': 'int', '_lineage_channels': 'Optional[np.ndarray]', '_lineage_channels_indices': 'np.ndarray', '_reductions': 'Optional[dr.Reductions]'}})
- __getitem__(items: Union[slice, List[int], ndarray, Tuple[Union[slice, List[int], ndarray], Union[slice, List[int], ndarray]]]) PyCytoData [source]
The method to index elements of the PyCytoData object.
This method implements the bracket notation to index part of the class. The notation is mostly consistent with the numpy indexing notation with a few excetions, which is listed below. When indexing specific cells, the metadata are appropriately indexed as well.
A few deviations from the numpy notations:
- Integer indices are currently not supported. This is because indexing by integer
returns a 1-d array instead of a 2-d array, which can possibly cause confusion.
- Indexing by two lists or arrays with different lengths are supported. They are
treated to index rows and columns, such as
exprs[[0,1,2], [3,4]]
is perfectly valid to index the first 3 cells with the fourth and fifth channel.
Tip
To index columns/channels by name, use the
subset
method instead.- Parameters
items (Union[int, slice, List[int], Tuple[Any, Any]]) – The indices for items.
- Raises
IndexError – Two or more indices present.
TypeError – Indexing by integer in either or both axes.
IndexError – An higher dimensional array is used.
TypeError – Invalid indices type used.
- Returns
An appropriately indexed
PyCytoData
object.- Return type
- __iadd__(new_object: PyCytoData) PyCytoData [source]
- __init__(expression_matrix: ArrayLike, channels: Optional[ArrayLike] = None, cell_types: Optional[ArrayLike] = None, sample_index: Optional[ArrayLike] = None, lineage_channels: Optional[ArrayLike] = None)[source]
- __len__() int [source]
The length of the PyCytoData Class.
This method implements the
len
of the builtin python method. It returns the number of total cells in the expression matrix.- Returns
The length of the object.
- Return type
int
- __module__ = 'PyCytoData.data'
- __str__() str [source]
String representation of the PyCytoData class.
This method returns a string containing the most basic metadata of the class along with the memory address.
- Returns
The string representation of the class.
- Return type
str
- __weakref__
list of weak references to the object (if defined)
- add_sample(expression_matrix: ArrayLike, sample_index: ArrayLike, cell_types: Optional[ArrayLike] = None)[source]
Add another CyTOF sample from the same experiment.
This method allows users to combine samples into existing samples. The data must be in the same shape. Sample indices must be provided so that the class can properly index these samples using names.
- Parameters
expression_matrix (ArrayLike) – The expression matrix of the new sample.
sample_index (ArrayLike) – The sample indicies to name the sample.
cell_types (Optional[ArrayLike], optional) – The cell types of each cell, defaults to None
- Raises
exceptions.ExpressionMatrixDimensionError – The expression matrix cannot be cast
exceptions.DimensionMismatchError – The number of sample indices
exceptions.DimensionMismatchError – _description_
- property cell_types: ndarray
Getter for sample_index.
- Returns
The cell types.
- Return type
np.ndarray
- property channels: ndarray
Getter for sample_index.
- Returns
The sample index.
- Return type
np.ndarray
- property expression_matrix: ndarray
Getter for the expression matrix.
- Returns
The expression matrix.
- Return type
np.ndarray
- get_channel_expressions(channels: ArrayLike) Tuple[ndarray, ndarray] [source]
Get the expressions of specific channels.
This method subsets the expression matrix with the specific channels specified and returns the expression matrix along with the channel names. As opposed to
subset
, this method is more useful for investigating the expressions themselves rather than subsetting the object as a whole.- Parameters
channels (Union[str, List[str]]) – The channel names to subset the data.
- Raises
TypeError – The channels n
ValueError – The channels specified are not listed in the channel names.
- Returns
A tuple of the expressions and the corresponding channel names.
- Return type
Tuple[np.ndarray, np.ndarray]
- property lineage_channels: Optional[ndarray]
Getter for lineage_channels.
- Returns
An array of lineage channels or
None
.- Return type
np.ndarray, optional
- property n_cell_types: int
“Getter for n_cell_types.
- Returns
The number of cell types.
- Return type
int
- property n_cells: int
Getter for n_cells.
- Returns
The number of cells.
- Return type
int
- property n_channels: int
Getter for n_channels.
- Returns
The number of channels.
- Return type
int
- property n_samples: int
Getter for n_samples.
- Returns
The number of samples.
- Return type
int
- preprocess(arcsinh: bool = False, gate_debris_removal: bool = False, gate_intact_cells: bool = False, gate_live_cells: bool = False, gate_center_offset_residual: bool = False, bead_normalization: bool = False, auto_channels: bool = True, bead_channels: Optional[ArrayLike] = None, time_channel: Optional[ArrayLike] = None, cor_channels: Optional[ArrayLike] = None, dead_channel: Optional[ArrayLike] = None, DNA_channels: Optional[ArrayLike] = None, cofactor: int = 5, cutoff_DNA_sd: float = 2, dead_cutoff_quantile: float = 0.03, cor_cutoff_quantile: float = 0.03, verbose: bool = True)[source]
Preprocess the expression matrix.
This is a one-size-fits-all method to preprocess the CyTOF sample using the
preprocess
module. The preprocessing consists of the following steps:Arcsinh transformation.
Gate to remove debris.
Gate for intact cells.
Gate for live cells.
Gate for anomalies using center, offset, and residual channels.
- Parameters
gate_debris_removal (bool) – Whether to gate to remove debris, defaults to True.
gate_intact_cells (bool) – Whether to gate for intact cells, defaults to True.
gate_live_cells (bool) – Whether to gate for live cells, defaults to True.
gate_center_offset_residual (bool) – Whether to gate using center, offset, and residual channels, defaults to True.
bead_normalizations (bool) – Whether to perform bead normalization, defaults to True.
auto_channels (bool) – Allow the method to recognize instrument and other non-lineage channels automatically. This can be overwritten by specifying channels in
bead_channels
,time_channel
,cor_channels
,dead_channel
, andDNA_channels
, defaults to True.bead_channels (ArrayLike, optional) – The bead channels as specify by name, defaults to None
time_channel (ArrayLike, optional) – The time channel as specify by name, defaults to None
cor_channels (ArrayLike, optional) – The Center, Offset, and Residual channels as specify by name, defaults to None
dead_channel (ArrayLike, optional) – The dead channels as specify by name, defaults to None
DNA_channels (ArrayLike, optional) – The DNA channels as specify by name, defaults to None
cofactor (int, optional) – The cofactor for arcsinh transforatrion, default to 5.
cutoff_DNA_sd (float) – The standard deviation cutoff for DNA channels. Here, we specifically measure how many standard deviations away from the mean, defaults to 2
dead_cutoff_quantile (float) – The cutoff quantiles for dead channels. The top specified quantile will be excluded, defaults to 0.03
cor_cutoff_quantile (float) – The cutoff quantiles for Center, Offset, and Residual channels. Both the top and bottom specified quantiles will be excluded, defaults to 0.03
verbose (bool) – Whether to print out progress.
- Returns
The gated expression matrix.
- Return type
np.ndarray
- property reductions: Optional[dr.Reductions]
Getter for reductions.
- Returns
A
Reductions
object orNone
.- Return type
CytofDR.dr.Reductions, optional
- run_dr_methods(methods: Union[str, List[str]] = 'all', out_dims: int = 2, n_jobs: int = - 1, verbose: bool = True, suppress_error_msg: bool = False)[source]
Run dimension reduction methods.
This is a one-size-fits-all dispatcher that runs all supported methods in the module. It supports running multiple methods at the same time at the sacrifice of some more granular control of parameters. If you would like more customization, please use the
CytofDR
package directly.- Parameters
methods (Union[str, List[str]]) – DR methods to run (not case sensitive).
out_dims (int) – Output dimension of DR.
n_jobs (int) – The number of jobs to run when applicable, defaults to -1.
verbose (bool) – Whether to print out progress, defaults to
True
.suppress_error_msg – Whether to suppress error messages print outs, defaults to
False
.
- Raises
ImoportError –
CytofDR
is not installed.
- property sample_index: ndarray
Getter for sample_index.
- Returns
The sample index.
- Return type
np.ndarray
- subset(channels: Optional[ArrayLike] = None, sample: Optional[ArrayLike] = None, cell_types: Optional[ArrayLike] = None, not_in: bool = False, in_place: bool = True) Optional[PyCytoData] [source]
Subset the dataset with specific cell types or samples.
This method allows you to subset the data using channels, samples, or cell types. In terms of the expression matrix, channels subsets are operations on columns, whereas sample or cell type subsets are operations on rows.
Tip
To index specific channels and get the expression matrix instead of a
PyCtyoData
object, use theget_channel_expressions
method.Tip
To subset by indices, use the
[]
syntax, which supports indexing similar to that ofnumpy
.- Parameters
channels (Optional[ArrayLike], optional) – The names of the channels to perform subset, defaults to None.
sample (Optional[ArrayLike], optional) – The names of the samples to perform subset, defaults to None
cell_types (Optional[ArrayLike], optional) – The name of the cell types to perform subset, defaults to None
not_in (bool, optional) – Whether to filter out the provided cell types or samples, defaults to False
in_place (bool, optional) – Whether to perform the subset in place. If not, a new object will be created and returned. defaults to True.
- Returns
A new PyCytoData after subsetting
- Return type
PyCytoData, optional
- Raises
ValueError – Filtering out all cells with nothing in the expression matrix, which is unsupported.
- class PyCytoData.data.DataLoader[source]
Bases:
object
The class with utility functions to load datasets.
This class offers one public utility function to load datasets,
load_dataset
, which loads and preprocesses existing benchmark datasets. All other methods are private methods. Instantiation is not necessary.- classmethod load_dataset(dataset: str, sample: Optional[ArrayLike] = None, force_download: bool = False, preprocess: bool = False) PyCytoData [source]
Load benchmark datasets.
This methods downloads and load benchmark datasets. The dataset is downloaded only once, which is then cached for future use. Currently, we support three datasets:
levine13
levine32
samusik
This method also supports specifying a specific sample instead of loading the entire dataset. Below is a list of samples available:
levine13
:0
(There is only one sample in this case)levine32
:AML08
andAML09
.samusik
:01
,02
, …,09
,10
- Parameters
dataset (str) – The name of the dataset.
sample (ArrayLike, optional) – The specific sample to load from the dataset, defaults to None.
force_download (bool) – Whether to download dataset regardless of previous cache, defaults to False
preprocess (bool, optional) – Whether to automatically perform all the necessary preocessing, defaults to false. In the case of the existing three datasets, preprocessing includes just arcsinh transformation with cofactor of 5.
- Returns
The loaded dataset.
- Return type
- class PyCytoData.data.FileIO[source]
Bases:
object
A utility class to handle common IO workflows for CyTOF data.
This class includes a few utility static methods to load and save CyTOF data. Currently, it includes the following methods:
load_delim
load_expression
save_2d_list_to_csv
save_np_array
Most of the methods are wrappers, but we offer a few advantages, such as returning
PyCytoData
data and savingnumpy
array along with channel names. For detailed documentations, read the docstring for each method.- static load_delim(files: ~typing.Union[~typing.List[str], str], skiprows: int = 0, drop_columns: ~typing.Optional[~typing.Union[int, ~typing.List[int]]] = None, delim: str = '\t', dtype: type = <class 'float'>, return_sample_indices: bool = False) Union[ndarray, Tuple[ndarray, ndarray]] [source]
Load deliminated file(s) as a numpy array.
This method loads a deliminited file and returns a numpy array. The file has to be a standard text file. It is essentially a wrapper for the
np.loadtxt
function, but we offer the functionality of loading a list of files all at once, which are automatically concatenated.- Parameters
files (Union[List[str], str]) – The path (or a list of paths) to the files to be loaded.
skiprows (int, optional) – The number of rows to skip, default to 0.
drop_colums – The columns indices for those that need to be dropped, defaults to None.
delim (str, optional.) – The delimiter to use, defaults to
\t
dtype (type, optional) – The data type for the arrays, defaults to
float
.
- Raises
TypeError – The
files
is neither a string nor a list of strings.- Returns
An array or an array along with the sample indices.
- Return type
Union[np.ndarray, Tuple[np.ndarray, np.ndarray]]
- static load_expression(files: ~typing.Union[~typing.List[str], str], col_names: bool = True, drop_columns: ~typing.Optional[~typing.Union[int, ~typing.List[int]]] = None, delim: str = '\t', dtype=<class 'float'>) PyCytoData [source]
Load a deliminited text file as a PyCytoData object.
This method loads deliminited file(s) and returns a PyCytoData object. The file has to be a standard text file containing the expression matrix. Rows are cells and columns are channels. If
col_names
isTrue
, the first row of the file will be treated as channel names. If multiple file paths are present, they will be automatically concatenated into one object, but the sample indices will be recorded.- Parameters
files (Union[List[str], str]) – The path (or a list of paths) to the files to be loaded.
col_names (bool, optional) – Whether the first row is channel names, default to False.
drop_columns (Union[int, List[int]], optional.) – The columns indices for those that need to be dropped, defaults to None.
delim (str, optional.) – The delimiter to use, defaults to
\t
dtype (type, optional) – The data type for the arrays, defaults to
float
.
- Raises
TypeError – The
files
is neither a string nor a list of strings.- Returns
A PyCytoData object.
- Return type
- static save_2d_list_to_csv(data: List[List[Any]], path: str, overwrite: bool = False)[source]
Save a nested list to a CSV file.
- Parameters
data (List[List[Any]]) – The nested list to be written to disk
path (str) – Path to save the CSV file
Note
By default, this method does not overwrite existing files. In case a file exists, a
FileExistsError
is thrown.
- static save_np_array(array: ndarray, path: str, col_names: Optional[ndarray] = None, dtype: str = '%.18e', overwrite: bool = False) None [source]
Save a NumPy array to a plain text file
- Parameters
array (np.ndarray) – The NumPy array to be saved
file (str) – Path to save the plain text file
col_names (np.ndarray, optional) – Column names to be save as the first row, defaults to None
dtype (str, optional) – NumPy data type, defaults to “%.18e”
Note
By default, this method does not overwrite existing files. In case a file exists, a
FileExistsError
is thrown.