The PyCytoData Object
The PyCytoData
object is the workhorse that carries you through every step
of the way! If you’re familiar with seurat
and its signature workflow, you
should feel more or less at ease here: we aim to use one single object for
all your needs! All the actions will be as verbs whereas the metadata can be
accessed via attributes.
This tutorial walks you through the details of the class and how you can best utilize all its neat features beyond the most common usage as shown in the quickstart guide.
Creating PyCytoData
Object
In general, there are three ways that you will have a PyCytoData
object:
You load a benchmark dataset using the
DataLoader.load_dataset
method.You load an expression matrix with the
FileIO.load_expression
method.You create your own object using the constructor.
While the first two applications are much more common, we will walk you through each of the three methods.
Loading Benchmark Dataset
This is by far the easiest and the most common method:
>>> from PyCytoData import DataLoader
>>> exprs = DataLoader.load_dataset(dataset="levine13")
The exprs
object here is a PyCytoData
object. The details of this method works
with benchmark datasets are included in the
FileIO and Datasets section.
Loading Expression Matrix
If you have an existing expression matrix stored in plain text (not fcs
, which
is currently unsupported), you can easily load it into a PyCytoData
object as
well:
>>> from PyCytoData import FileIO
>>> exprs = FileIO.load_expression(files="<path>", col_names=True, delim="\t")
Again, exprs
is a PyCytoData
object, and the load_expression
method has
more features, which are documented thoroughly in the
FileIO and Datasets
section as well.
Using PyCytoData
Constructor
If you already have your data as arrays, you can of course construct an object:
>>> from PyCytoData import PyCytoData
>>> exprs = PyCytoData(expression_matrix=expression_matrix,
... channels=channels,
... cell_types=cell_types,
... sample_index=sample_index,
... lineage_channels=lineage_channels)
Here, you can supply all your information on your CyTOF data. The only mandatory
component is the expression_matrix
. All parameters should be ArrayLike
,
but they don’t have to be a numpy
array. For the expression matrix, it naturally
should consist of numeric values, whereas others are expected to be strings (or maybe
integers).
In some cases, your data have some channels, such as beads, Time, or other instrument
channels. In this case, you can specify your lineage_channels
to indicate which
are actually protein channels instead of other supplementary so that PyCytoData
can process downstream analyses using appropriate channels accordingly.
Object Attributes and Metadata
One of the beauties of working with a PyCytoData
object is that it offers a whole
set of builtin metadata and attributes you can work with. Some of them are automatically
added to the object whereas others you will need to supply. In any case, they will allow
you to get any information you need from your dataset.
Metadata
The metadata are basically statistics that are calculated from your dataset. You can easily get the number of cells, channels, cell types, etc. Here, we will use a builtin benchmark dataset as an example:
>>> from PyCytoData import DataLoader
>>> exprs = DataLoader.load_dataset(dataset="levine32")
After loading levine32
, you can look at some of its statistics:
>>> exprs.n_cells
265627
>>> exprs.n_cell_types
15
>>> exprs.n_samples
2
>>> exprs.n_channels
39
Note
There are 39 channels for levine32
because there are instrument channels.
The metadata are automatically updated with changes and operations to the expression matrix, such as subsetting, concatenating, etc. You can always be sure that the metadata are up to date.
Attribute: channels
This attribute stores all the channel names. If no channel names are given during
instantiation, names are automatically generated using the "Channel" + int
convention.
Namely, the first channel will be “Channel0” and all the rest are automatically numbers.
If channels are given or built into benchmark datasets, they’re stored as is in this
channel.
Attribute: lineage_channels
This attribute denotes all the lineage channels, which are a subset of the channels
.
Lineage channels are protein channels typically used for analyses. They are stored to
differentiate from instrument channels, which are not used for analyses and transformations
Attribute: cell_types
This attribute stores the cell types for each cell. If no cell types are given or available,
an array of None
is automatically created and stored.
Attribute: sample_index
This attribute stores the sample indices for each cell. They are stroed as strings or integers within an array. If no sample information is available, all cells are assumed to be from the same sample and indices of 0 are given.
Attribute: reductions
This stores a Reductions
object for dimension reduction from CytofDR
package. By default,
this is None
. If the run_dr_methods
method is called, the results will be automatically
stored. Users can supply and set their own Reductions
object as well.
Subsetting and Indexing
The PyCytoData
object supports both the standard bracket notation, which is mostly consitent
with numpy
indexing, and a custom subset
method to subset by metadata, such as channels,
cell types, and samples. Here is a tutorial on both.
Indexing Using Brackets
We have implemented many of the features from numpy
’s behavior. Here are a list of allowable
types in the brackets (which are passed into the magic __getitem__
method):
slice
List[int]
np.ndarray
Indexing is performed on the expression matrix. A new object will be created and returned along with the appropriate metadata. The indices can be one-dimensional or two-dimensional, allowing you to index by the zeroth axis without having to specify the other axis. For example:
>>> new_exprs = exprs[:10]
>>> new_exprs.n_cells
10
>>> new_exprs.n_channels
39
As expected, this indexes the first 10 cells from the original object. This is equivalent to the following:
>>> new_exprs = exprs[:10, :]
And of course, you can index both rows and columns:
>>> new_exprs = exprs[5:100, [2,3,4]]
which will index the 5th to the 99th cells along with the 3 given channels. You can also use
numpy
arrays to index, which will enable you do something such as:
>>> new_exprs = exprs[5:100, np.isin(exprs.channels, ["CD3", "CD4"])]
As shown, you can mix and match indices types.
We do not support higher dimensional indexing because we assume that expression matrices are two dimensional. Higher dimensional arrays will cause confusions.
Note
We do not support indexing with integers such as exprs[5]
. This is to avoid the complexities
introduced. If you wish to index a single obervation, use exprs[[5]]
or exprs[5:6]
.
Subsetting
Instead of using indices and arrays, you can specifically subset based on metadata using the
subset
method provided. You can subset based on the following:
channels
sample
cell_types
This usage is quite common and will save you a few seconds from using np.where
or np.isin
.
To get started, you can simply do:
>>> new_exprs = exprs.subset(channels=['CD13(Er168)Di', 'CD3(Er170)Di'], sample=[0], cell_types=["Pro B Cells"], in_place=False)
>>> print(new_exprs)
A 'PyCytoData' object with 10 cells, 39 channels, 1 cell types, and 1 samples at 0x7fbf5eb0db80.
This will subset the dataset with:
CD13(Er168)Di
,CD3(Er170)Di
channelsThe 0th sample
Pro B cells
You can pick and choose which of the metadata to subset. By default, the operation is done in
place. However, if you wish a new object to be returned, you can set in_place=False
.
Optionally, you can also negate the selection by setting not_in=True
, which will exclude the
given channels, sample, or cell types.
Subset Channels by Name
The standard subset
method returns a PyCytoData
object. However, if you wish to get an
array with the specified channels, you can subset using the get_channel_expression
method:
>>> expressions, channels = exprs.get_channel_expressions(['CD13(Er168)Di', 'CD3(Er170)Di'])
>>> expressions
array([[ 1.92805946e-01, -1.63007990e-01],
... [ 1.01540089e+01, -2.17397958e-01],
... [ 1.07605422e+00, 1.63160920e+00],
... ...,
... [-8.74943915e-04, -1.50887862e-01],
... [ 7.03829479e+00, -7.69027993e-02],
... [ 1.62252128e+00, -2.47358829e-01]])
>>> channels
array(['CD13(Er168)Di', 'CD3(Er170)Di'], dtype='<U18')
This will save a step of getting the expression from an object.
Adding New Samples
If you have an existing PyCytoData
object and you have a new sample, you can add your new
sample to your existing object. This can be easily achieved by the following:
>>> exprs = exprs.add_sample(expression_matrix=expression_matrix, sample_index = sample_index)
where sample_index
is an array-like object with sample names, and expression_matrix
is
just a matrix.
Magic Methods
PyCytoData
implements a number of magic methods for convenience. Here, we detail the usage
of such methods.
Method: len
This returns the number of cells of the object, which is equivalent to using n_cells
:
>>> len(exprs)
265627
Method: print
This method prints out the general information of the object with information on the number of cells, channels, cell types, and samples along with its memory address.
>>> print(exprs)
A 'PyCytoData' object with 265627 cells, 39 channels, 15 cell types, and 2 samples at 0x7fbf5ea09f10.
Method: +
and +=
These two methods are used to concatenate two PyCytoData
objects. The +
operator returns a new
object:
>>> exprs1 = DataLoader.load_dataset(dataset="levine32", sample="AML08")
>>> exprs2 = DataLoader.load_dataset(dataset="levine32", sample="AML09")
>>> exprs = exprs1 + exprs2
>>> print(exprs)
A 'PyCytoData' object with 265627 cells, 39 channels, 15 cell types, and 2 samples at 0x7fbf313f7190.
Or if you prefer to concatenate a new object to an existing one, you can use the +=
operator:
>>> exprs1 += exprs2
>>> print(exprs1)
A 'PyCytoData' object with 265627 cells, 39 channels, 15 cell types, and 2 samples at 0x7fbf5ea6f100.