########################### The PyCytoData Object ########################### The ``PyCytoData`` object is the workhorse that carries you through every step of the way! If you're familiar with ``seurat`` and its signature workflow, you should feel more or less at ease here: we aim to use one single object for all your needs! All the actions will be as verbs whereas the metadata can be accessed via attributes. This tutorial walks you through the details of the class and how you can best utilize all its neat features beyond the most common usage as shown in the quickstart guide. --------------------------------- ******************************* Creating ``PyCytoData`` Object ******************************* In general, there are three ways that you will have a ``PyCytoData`` object: 1. You load a benchmark dataset using the ``DataLoader.load_dataset`` method. 2. You load an expression matrix with the ``FileIO.load_expression`` method. 3. You create your own object using the constructor. While the first two applications are much more common, we will walk you through each of the three methods. Loading Benchmark Dataset ---------------------------- This is by far the easiest and the most common method: .. code-block:: python >>> from PyCytoData import DataLoader >>> exprs = DataLoader.load_dataset(dataset="levine13") The ``exprs`` object here is a ``PyCytoData`` object. The details of this method works with benchmark datasets are included in the `FileIO and Datasets `_ section. Loading Expression Matrix --------------------------- If you have an existing expression matrix stored in plain text (not ``fcs``, which is currently unsupported), you can easily load it into a ``PyCytoData`` object as well: .. code-block:: python >>> from PyCytoData import FileIO >>> exprs = FileIO.load_expression(files="", col_names=True, delim="\t") Again, ``exprs`` is a ``PyCytoData`` object, and the ``load_expression`` method has more features, which are documented thoroughly in the `FileIO and Datasets `_ section as well. Using ``PyCytoData`` Constructor ------------------------------------ If you already have your data as arrays, you can of course construct an object: .. code-block:: python >>> from PyCytoData import PyCytoData >>> exprs = PyCytoData(expression_matrix=expression_matrix, ... channels=channels, ... cell_types=cell_types, ... sample_index=sample_index, ... lineage_channels=lineage_channels) Here, you can supply all your information on your CyTOF data. The only mandatory component is the ``expression_matrix``. All parameters should be ``ArrayLike``, but they don't have to be a ``numpy`` array. For the expression matrix, it naturally should consist of numeric values, whereas others are expected to be strings (or maybe integers). In some cases, your data have some channels, such as beads, Time, or other instrument channels. In this case, you can specify your ``lineage_channels`` to indicate which are actually protein channels instead of other supplementary so that ``PyCytoData`` can process downstream analyses using appropriate channels accordingly. ------------------------- ******************************** Object Attributes and Metadata ******************************** One of the beauties of working with a ``PyCytoData`` object is that it offers a whole set of builtin metadata and attributes you can work with. Some of them are automatically added to the object whereas others you will need to supply. In any case, they will allow you to get any information you need from your dataset. Metadata ----------- The metadata are basically statistics that are calculated from your dataset. You can easily get the number of cells, channels, cell types, etc. Here, we will use a builtin benchmark dataset as an example: .. code-block:: python >>> from PyCytoData import DataLoader >>> exprs = DataLoader.load_dataset(dataset="levine32") After loading ``levine32``, you can look at some of its statistics: .. code-block:: python >>> exprs.n_cells 265627 >>> exprs.n_cell_types 15 >>> exprs.n_samples 2 >>> exprs.n_channels 39 .. note:: There are 39 channels for ``levine32`` because there are instrument channels. The metadata are automatically updated with changes and operations to the expression matrix, such as subsetting, concatenating, etc. You can always be sure that the metadata are up to date. Attribute: ``channels`` ------------------------- This attribute stores all the channel names. If no channel names are given during instantiation, names are automatically generated using the ``"Channel" + int`` convention. Namely, the first channel will be "Channel0" and all the rest are automatically numbers. If channels are given or built into benchmark datasets, they're stored as is in this channel. Attribute: ``lineage_channels`` --------------------------------- This attribute denotes all the lineage channels, which are a subset of the ``channels``. Lineage channels are protein channels typically used for analyses. They are stored to differentiate from instrument channels, which are not used for analyses and transformations Attribute: ``cell_types`` -------------------------- This attribute stores the cell types for each cell. If no cell types are given or available, an array of ``None`` is automatically created and stored. Attribute: ``sample_index`` --------------------------- This attribute stores the sample indices for each cell. They are stroed as strings or integers within an array. If no sample information is available, all cells are assumed to be from the same sample and indices of `0` are given. Attribute: ``reductions`` -------------------------- This stores a ``Reductions`` object for dimension reduction from ``CytofDR`` package. By default, this is ``None``. If the ``run_dr_methods`` method is called, the results will be automatically stored. Users can supply and set their own ``Reductions`` object as well. ---------------------------------- *************************** Subsetting and Indexing *************************** The ``PyCytoData`` object supports both the standard bracket notation, which is mostly consitent with ``numpy`` indexing, and a custom ``subset`` method to subset by metadata, such as channels, cell types, and samples. Here is a tutorial on both. Indexing Using Brackets ------------------------- We have implemented many of the features from ``numpy``'s behavior. Here are a list of allowable types in the brackets (which are passed into the magic ``__getitem__`` method): - slice - List[int] - np.ndarray Indexing is performed on the expression matrix. A new object will be created and returned along with the appropriate metadata. The indices can be one-dimensional or two-dimensional, allowing you to index by the zeroth axis without having to specify the other axis. For example: .. code-block:: python >>> new_exprs = exprs[:10] >>> new_exprs.n_cells 10 >>> new_exprs.n_channels 39 As expected, this indexes the first 10 cells from the original object. This is equivalent to the following: .. code-block:: python >>> new_exprs = exprs[:10, :] And of course, you can index both rows and columns: .. code-block:: python >>> new_exprs = exprs[5:100, [2,3,4]] which will index the 5th to the 99th cells along with the 3 given channels. You can also use ``numpy`` arrays to index, which will enable you do something such as: .. code-block:: python >>> new_exprs = exprs[5:100, np.isin(exprs.channels, ["CD3", "CD4"])] As shown, you can mix and match indices types. We do not support higher dimensional indexing because we assume that expression matrices are two dimensional. Higher dimensional arrays will cause confusions. .. note:: We do not support indexing with integers such as ``exprs[5]``. This is to avoid the complexities introduced. If you wish to index a single obervation, use ``exprs[[5]]`` or ``exprs[5:6]``. Subsetting -------------- Instead of using indices and arrays, you can specifically subset based on metadata using the ``subset`` method provided. You can subset based on the following: - channels - sample - cell_types This usage is quite common and will save you a few seconds from using ``np.where`` or ``np.isin``. To get started, you can simply do: .. code-block:: python >>> new_exprs = exprs.subset(channels=['CD13(Er168)Di', 'CD3(Er170)Di'], sample=[0], cell_types=["Pro B Cells"], in_place=False) >>> print(new_exprs) A 'PyCytoData' object with 10 cells, 39 channels, 1 cell types, and 1 samples at 0x7fbf5eb0db80. This will subset the dataset with: - ``CD13(Er168)Di``, ``CD3(Er170)Di`` channels - The 0th sample - Pro B cells You can pick and choose which of the metadata to subset. By default, the operation is done in place. However, if you wish a new object to be returned, you can set ``in_place=False``. Optionally, you can also negate the selection by setting ``not_in=True``, which will exclude the given channels, sample, or cell types. Subset Channels by Name ------------------------- The standard ``subset`` method returns a ``PyCytoData`` object. However, if you wish to get an array with the specified channels, you can subset using the ``get_channel_expression`` method: .. code-block:: python >>> expressions, channels = exprs.get_channel_expressions(['CD13(Er168)Di', 'CD3(Er170)Di']) >>> expressions array([[ 1.92805946e-01, -1.63007990e-01], ... [ 1.01540089e+01, -2.17397958e-01], ... [ 1.07605422e+00, 1.63160920e+00], ... ..., ... [-8.74943915e-04, -1.50887862e-01], ... [ 7.03829479e+00, -7.69027993e-02], ... [ 1.62252128e+00, -2.47358829e-01]]) >>> channels array(['CD13(Er168)Di', 'CD3(Er170)Di'], dtype='>> exprs = exprs.add_sample(expression_matrix=expression_matrix, sample_index = sample_index) where ``sample_index`` is an array-like object with sample names, and ``expression_matrix`` is just a matrix. ----------------------------- ******************** Magic Methods ******************** ``PyCytoData`` implements a number of magic methods for convenience. Here, we detail the usage of such methods. Method: ``len`` ------------------- This returns the number of cells of the object, which is equivalent to using ``n_cells``: .. code-block:: python >>> len(exprs) 265627 Method: ``print`` -------------------- This method prints out the general information of the object with information on the number of cells, channels, cell types, and samples along with its memory address. .. code-block:: python >>> print(exprs) A 'PyCytoData' object with 265627 cells, 39 channels, 15 cell types, and 2 samples at 0x7fbf5ea09f10. Method: ``+`` and ``+=`` ----------------------------- These two methods are used to concatenate two ``PyCytoData`` objects. The ``+`` operator returns a new object: .. code-block:: python >>> exprs1 = DataLoader.load_dataset(dataset="levine32", sample="AML08") >>> exprs2 = DataLoader.load_dataset(dataset="levine32", sample="AML09") >>> exprs = exprs1 + exprs2 >>> print(exprs) A 'PyCytoData' object with 265627 cells, 39 channels, 15 cell types, and 2 samples at 0x7fbf313f7190. Or if you prefer to concatenate a new object to an existing one, you can use the ``+=`` operator: .. code-block:: python >>> exprs1 += exprs2 >>> print(exprs1) A 'PyCytoData' object with 265627 cells, 39 channels, 15 cell types, and 2 samples at 0x7fbf5ea6f100.