SSAM (Spot-based Spatial cell-type Analysis by Multidimensional mRNA density estimation)

Author: Jeongbin Park (jeongbin.park@charite.de)1,2 and Wonyl Choi (wonyl@bu.edu)3

1Digital Health Center, Berlin Institute of Health (BIH) and Charité – Universitätsmedizin, Berlin, Germany; 2Faculty of Biosciences, Heidelberg University, Heidelberg, Germany; 3Department of Computer Science, Boston University, Boston, the United States of America

(Not referring this :laughing:: https://en.wikipedia.org/wiki/Ssam)

This project was done under supervision of Dr. Naveed Ishaque (naveed.ishaque@charite.de) and Prof. Roland Eils (roland.eils@charite.de), and in collaboration with the SpaceTx consortium and the Human Cell Atlas project.

Please also check our example Jupyter notebooks here: https://github.com/eilslabs/ssam_example

Prerequisites

Currently SSAM was only tested with Python 3 in Linux environment. In addition to this package, SSAM requires a local R installation with pre-installed packages feather and sctransform. For details, please follow the instructions here: https://ssam.readthedocs.io/en/release/userguide/01-tldr.html#installation

Citations

Jeongbin Park, Wonyl Choi, Sebastian Tiesmeyer, Brian Long, Lars E. Borm, Emma Garren, Thuc Nghi Nguyen, Bosiljka Tasic, Simone Codeluppi, Tobias Graf, Matthias Schlesner, Oliver Stegle, Roland Eils & Naveed Ishaque. “Cell segmentation-free inference of cell types from in situ transcriptomics data.Nature Communications 12, 3545 (2021).

License

Copyright (C) 2018 Jeongbin Park and Wonyl Choi

This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.

You should have received a copy of the GNU Affero General Public License along with this program. If not, see https://www.gnu.org/licenses/.

Spatial gene expression analysis with SSAM

quick start / tldr page

This tl;dr guide is for you if you already know what happens in a SSAM analysis or if you don’t care.

For everyone else we recommend using the full userguide.

Installation

Setup a conda environment:

conda create -n ssam python=3.6
conda activate ssam
conda install gxx_linux-64 numpy pip R=3.6 pyarrow=0.15.1

Do this in R:

install.packages("sctransform")
install.packages("feather")

Install SSAM via pip:

pip install ssam

Data download

curl "https://zenodo.org/record/3478502/files/supplemental_data_ssam_2019.zip?download=1" -o zenodo.zip
unzip zenodo.zip

Data preparation

All following steps in python:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import ssam

df = pd.read_csv(
    "zenodo/multiplexed_smFISH/raw_data/smFISH_MCT_CZI_Panel_0_spot_table.csv",
    usecols=['x', 'y', 'z', 'target'])

um_per_pixel = 0.1

df.x = (df.x - df.x.min()) * um_per_pixel + 10
df.y = (df.y - df.y.min()) * um_per_pixel + 10
df.z = (df.z - df.z.min()) * um_per_pixel + 10
width = df.x.max() - df.x.min() + 10
height = df.y.max() - df.y.min() + 10

grouped = df.groupby('target').agg(list)
genes = list(grouped.index)
coord_list = []
for target, coords in grouped.iterrows():
    coord_list.append(np.array(list(zip(*coords))))

Create SSAM dataset and vector field

ds = ssam.SSAMDataset(genes, coord_list, width, height)
analysis = ssam.SSAMAnalysis(
  ds,
  ncores=10, # used for kde step
  save_dir="kde/",
  verbose=True)

analysis.run_kde(bandwidth=2.5, use_mmap=False)

analysis.find_localmax(
    search_size=3,
    min_norm=0.2,
    min_expression=0.027
    )

analysis.normalize_vectors_sctransform()

Creating the de novo cell map

analysis.cluster_vectors(
    min_cluster_size=0,
    pca_dims=22,
    resolution=0.15,
    metric='correlation')

# post-filtering parameter for cell-type map
filter_method = "local"
filter_params = {
    "block_size": 151,
    "method": "mean",
    "mode": "constant",
    "offset": 0.2
}

analysis.map_celltypes()
analysis.filter_celltypemaps(min_norm=filter_method, filter_params=filter_params, min_r=0.6, fill_blobs=True, min_blob_area=50, output_mask=output_mask)
Visualisation of cell type map.

Visualisation of cell type map.

Creating the tissue domain map

analysis.bin_celltypemaps(step=10, radius=100)
analysis.find_domains(n_clusters=20, merge_remote=True, merge_thres=0.7, norm_thres=1500)

plt.figure(figsize=[5, 5])
ds.plot_domains(rotate=1, cmap=cmap)

Installation

A step-by-step guide

The easiest way to prepare a python environment for SSAM is using conda. Keeping python projects in isolated environments prevents dependency version conflicts or conflicts with your OS installation of python which usually depends on older versions incompatible with current scientific packages.

Create your environment:

conda create -n ssam python=3.6

Remember to activate before using it:

conda activate ssam

Now we use conda to install some dependencies into our ssam environment:

conda install gxx_linux-64=7.3.0 numpy=1.19.2 pip R=3.6 pyarrow=0.15.1

Now we can install the R packages sctransform and feather. Open R and type:

install.packages("sctransform")
install.packages("feather")

Finally we switch to pip:

pip install git+https://github.com/HiDiHlabs/ssam.git

Next we can download and prepare our data.

SSAM’s source code

In case you want to work with SSAM’s source code, it is also hosted on github.

Data Preparation

Download VISp data

In this tutorial we work with data of the murine primary visual cortex (VISp) profiled using multiplexed smFISH. Further details are available in the SSAM publication (Park, et. al. 2019).

First, download the data and unpack it:

curl "https://zenodo.org/record/3478502/files/supplemental_data_ssam_2019.zip?download=1" -o zenodo.zip
unzip zenodo.zip

Load data into python

Let’s start with loading our python packages:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import ssam

Now we can load the mRNA spot table. Each row describes one mRNA spot and the columns contain its coordinates and target gene. We load the required columns into a dataframe:

df = pd.read_csv(
    "zenodo/multiplexed_smFISH/raw_data/smFISH_MCT_CZI_Panel_0_spot_table.csv",
    usecols=['x', 'y', 'z', 'target'])

If your dataset is organized differently, you will have to reshape it before continuing with the next steps. ## Transform Data

Because SSAM analysis is rooted in a cellular scale we transform the coordinates from a laboratory system into micrometers. Also we make them a bit tidier:

um_per_pixel = 0.1

df.x = (df.x - df.x.min()) * um_per_pixel + 10
df.y = (df.y - df.y.min()) * um_per_pixel + 10
df.z = (df.z - df.z.min()) * um_per_pixel + 10

Prepare data for SSAM

To create a SSAMDataset object we need to provide four arguments: - a list of gene names profiled in the experiment: genes - a list of lists that contains the coordinates of each gene: coord_list - the width of the image - the height of the image

The width and height are straightforward to infer from the dimensions of the image:

width = df.x.max() - df.x.min() + 10
height = df.y.max() - df.y.min() + 10

We group the dataframe by gene and create the list of gene names:

grouped = df.groupby('target').agg(list)
genes = list(grouped.index)

And finally the coordinate list:

coord_list = []
for target, coords in grouped.iterrows():
    coord_list.append(np.array(list(zip(*coords))))

Create the SSAMDataset object

With everything in place we can now instantiate the SSAMDataset object:

ds = ssam.SSAMDataset(genes, coord_list, width, height)

Now we can start the analysis with the kernel density estimation step.

Creating the vector field

After the data has been loaded, SSAM converts the discrete mRNA locations into mRNA desntiy (that can be thought of as continuous “gene expression clouds” over the tissue) through application of Kernel Density Estimation.

KDE

With our SSAMDataset object ds we can now initialize a SSAMAnalysis object analysis.

analysis = ssam.SSAMAnalysis(
  ds,
  ncores=10, # used for kde step
  save_dir="kde/",
  verbose=True)

And calculate a mRNA density estimate with the run_kde method. Important considerations here are the kernel function and the kernel bandwidth. As default, we recommend using a Gaussian kernel with a bandwidth of 2.5:

analysis.run_kde(bandwidth=2.5, use_mmap=False)

Masking

If you want to perform the analysis on only a part of your sample you can use a mask. This can restrict what parts of the image are used for local maxima sampling (the input_mask), or restrict the cell-type map generation of SSAM to certain regions (the output_mask). While this is not required for analysis (infact the SSAM paper did not apply masks to the osmFISH or MERFISH dataset), here we define a simply polygon as both the input_mask and output_mask for the VISp region.

from matplotlib.path import Path
# manual area annotation
xy = np.array([[1535,  90],
               [ 795,  335],
               [ 135,  940],
               [ 835, 1995],
               [1465, 1695],
               [2010, 1215]])

# Extract coordinates from SSAMDataset
x, y = np.meshgrid(np.arange(ds.vf.shape[0]), np.arange(ds.vf.shape[1]))
x, y = x.flatten(), y.flatten()
points = np.vstack((x,y)).T

path = Path(xy)
input_mask = path.contains_points(points)
input_mask = input_mask.reshape((ds.vf.shape[1], ds.vf.shape[0], 1)).swapaxes(0, 1)
output_mask = input_mask

We recommend a visual inspection of the mask to make sure it alignes with the data as you expect it to:

from matplotlib.patches import Polygon
from matplotlib.collections import PatchCollection

patch = Polygon(xy, True)
p = PatchCollection([patch], alpha=0.4)

plt.figure(figsize=[5, 5])
ds.plot_l1norm(rotate=1, cmap="Greys")
plt.gca().add_collection(p)
plt.axis('off')
plt.savefig('images/mask.png')
plot of the mRNA density superimposed with the mask

plot of the mRNA density superimposed with the mask

Local maxima search and normalization

In order to reduce the computational burden, we recommend downsampling the image. While random sampling can be performe, we strongly encourage downsampling via local maxima selection, followed by filtering based of individual and total gene expression.

The local maxima are used to (i) determine the variance stabilisation parameters for the image, and (ii) be used to determine clusters in de novo analysis. In this section, we will use the local maxima for variance stabilisation.

Here we apply the find_localmax function to find the local maxima of the mRNA density, using a per gene expression threshold of 0.027 and a total gene expression threshold of 0.2:

analysis.find_localmax(
    search_size=3,
    min_norm=0.2, # the total gene expression threshold
    min_expression=0.027, # the per gene expression threshold
    mask=input_mask
    )

Visualization

After the local maxima have been identified, they can be visualised. In cases when many local maxima orginate from outside the tissue area a k-NN density threshold can be used to filter “stray” local maxima, however in this example we use an input mask so it is not a problem.

plt.figure(figsize=[5, 5])
ds.plot_l1norm(cmap="Greys", rotate=1)
ds.plot_localmax(c="Blue", rotate=1, s=0.1)

patch = Polygon(xy, facecolor="black", edgecolor="red", linewidth=10, ls="-")
p = PatchCollection([patch], alpha=0.4)
plt.gca().add_collection(p)

scalebar = ScaleBar(1, 'um') # 1 pixel = 1um
plt.gca().add_artist(scalebar)
plt.tight_layout()
plt.axis('off')
plt.show()
plot found maxima superimposed with the mask

plot found maxima superimposed with the mask

Normalization

Once the local maxima have been identified, we can use them for calculating the variance stabilisation parameters using sctransform. If you receive an error here, make sure that you have installed the R packages in the installation step

This part of the analysis ends with the normalization of the mRNA density and the local-maximum vectors.

analysis.normalize_vectors_sctransform()

Now we are rady to continue with mapping the cell types in guided or de novo mode.

The shape of the kernel

The shape of the kernel is defined by the kernel function. The shape of the kernel determines how the mRNA signal is smoothed.

We adopt the use of the Gaussian kernel due to it’s popular use in signal processing, however other kernel functions can be used: - we have had success in using semi-circle kernels when applied to ISS data of the human pancreas - the Epanechnikov kernel minimizes AMISE and has therefore been described as optimal

The following exmaples shows how you can apply a semicircular kernel instead of a Gaussian.

# code to change the shape of the kernel (@sebastiantiesmeyer)

Kernel bandwidth

The bandwidth of the kernel controls the amount of smoothing applied. With a low bandwidth, the smooth is spread less. With a high badnwidth, the smoothing is spread more.

The bandwidth should be set according to 2 factors: - the maximum size of the bandwidth should not smooth the signals outside of cells. by default we choose a bandwidth of 2.5 um, as this has a FWTM or ~10um, which is the average size of cells in the mouse SSp. This worked well for all examples in the SSAM paper. - the minimum size of the bandwidth should at least smooth signal to adjacent mRNA. From experience, this is not an issue for most ISH based techniques, but sequencing based techniques such as ISS can produce very sparse data and may require higher bandwidths to smooth signal sufficiently.

Here is a close-up of the osmFISH mouse SSp dataset which investigates the effect of adjusting the kernel bandwidth. You can see that with a bandwidth of 1um the smoothing is sufficient, and with a bandwidth of 5um it is a little too much. The bandwidth of 2.5um appears to be a good balance of smoothing adjacent signal, while not smooting into the adjacent area or loosing sparse cell types.

image0

Input masks

For some tissue images you may want to restrict analysis to certain parts of the image. For example, the image may have degradation towards the edges, you may wish to exclude non tissue areas, or even perhaps restricting SSAM analysis to previously segmented areas.

SSAM accepts input masks that are defined as polygons.

Example for the VISp smFISH dataset:

from matplotlib.patches import Polygon
from matplotlib.collections import PatchCollection

plt.figure(figsize=[5, 5])
ds.plot_l1norm(cmap="Greys", rotate=1)
ds.plot_localmax(c="Blue", rotate=1, s=0.1)

patch = Polygon(xy, facecolor="black", edgecolor="red", linewidth=10, ls="-")
p = PatchCollection([patch], alpha=0.4)
plt.gca().add_collection(p~)
plt.show()

image0

After the desired region selected, a mask can be created. In this case we define an input_mask and output_mask which restricts all data process anf reported output to the selected region.

from matplotlib.path import Path

x, y = np.meshgrid(np.arange(ds.vf.shape[0]), np.arange(ds.vf.shape[1]))
x, y = x.flatten(), y.flatten()
points = np.vstack((x,y)).T

path = Path(xy)
input_mask = path.contains_points(points)
output_mask = input_mask = input_mask.reshape((ds.vf.shape[1], ds.vf.shape[0], 1)).swapaxes(0, 1)

SSAM guided analysis

The main visual output of SSAM is the creation of the cell-type map, which is created by classifying pixels in the tissue image based of either predefined or calculated genes expression signatures. When the gene expression signatures are already known, one can use SSAM in guided mode. When previously known cell type signatures are known, we highly recommend running guided mode analysis as a quality check.

Single cell RNA sequencing data

We will use scRNA-seq data from Tasic et al. 2018 for the guided analysis. In the paper they identified “shared and distinct transcriptomic cell types across neocortical areas” in the mouse brain, also including the mouse VISp (which is our exmaple).

First we need to load the data:

scrna_cl = pd.read_feather("zenodo/multiplexed_smFISH/raw_data/scrna_data_tasic_2018/cl.feather")
scrna_cl_df = pd.read_feather("zenodo/multiplexed_smFISH/raw_data/scrna_data_tasic_2018/cl_df.feather")
scrna_genes = pd.read_feather("zenodo/multiplexed_smFISH/raw_data/scrna_data_tasic_2018/genes.feather")
scrna_counts = pd.read_feather("zenodo/multiplexed_smFISH/raw_data/scrna_data_tasic_2018/counts.feather")

scrna_clusters = scrna_cl['cluster_id']

scrna_cl_dic = dict(zip(scrna_cl['cell_id'], scrna_cl['cluster_id']))
scrna_cl_metadata_dic = dict(zip(
    scrna_cl_df['cluster_id'],
    zip(scrna_cl_df['cluster_label'],
        scrna_cl_df['cluster_color'], )
))

qc_gene_indices = np.sum(scrna_counts > 0, axis=1) > 5
scrna_genes_qc = np.array(scrna_genes)[qc_gene_indices]

scrna_counts_qc = np.array(scrna_counts).T[:, qc_gene_indices]

Normalisation

Once the data is loaded, we will normalise it using run_sctransform:

scrna_data_normalized = np.array(ssam.run_sctransform(scrna_counts_qc)[0])

Cell-type gene expression signatures

Once the data is normalised, we can calculate the average gene expression per cell type (the centroids), which can then be used for classifying pixels in the image

selected_genes_idx = [list(scrna_genes_qc).index(g) for g in ds.genes]
scrna_uniq_clusters = np.unique(scrna_clusters)
scrna_centroids = []
for cl in scrna_uniq_clusters:
    scrna_centroids.append(np.mean(scrna_data_normalized[:, selected_genes_idx][scrna_clusters == cl], axis=0))

Generate a guided cell-type map

We can now continue to classify pixels in the tissue image using the cell-type gene expression signatures from the sc-RNAseq data.

We map the local maxima vectors to the most similar clusters in the scRNA-seq data using, using a correlation threshold of classifying pixels of ``0.6` <celltype_map_thresh_g.md>`__

analysis.map_celltypes(scrna_centroids) # map the scRNAseq cell type signatures to the tissue image
analysis.filter_celltypemaps(min_norm=filter_method, filter_params=filter_params, min_r=0.3, output_mask=output_mask) # post-filter cell-type map to remove spurious pixels

plt.figure(figsize=[5, 5]) # initiate the plotting area
ds.plot_celltypes_map(rotate=1, colors=scrna_colors, set_alpha=False) # SSAM plotting function

image0

Despite the guided mode producing passable results, we highly recommend using the de novo mode for more accurate analysis.

Thresholding the guided cell-type map

After cell-type signatures are provided, the tissue image can be classified. The classification of each pixel is based on the Pearson correlation metric (although an experimental adversarial autoencoder based classification method can be applied).

We found that a minimum correlation threshold (min_r) of 0.3 worked well for guided mode based on single cell RNAseq cell-type signatures, and 0.6 worked well for de novo mode.

Below we show how the cell-type map changes using correlation thresholds of 0.15,0.3,0.45 using the scRNAseq signatures

scrna_uniq_labels = [scrna_cl_metadata_dic[i][0] for i in scrna_uniq_clusters]
scrna_colors = [scrna_cl_metadata_dic[i][1] for i in scrna_uniq_clusters]

analysis.map_celltypes(scrna_centroids)

analysis.filter_celltypemaps(min_norm=filter_method, filter_params=filter_params, min_r=0.15, output_mask=output_mask) # post-filter cell-
plt.figure(figsize=[5, 5])
ds.plot_celltypes_map(rotate=1, colors=scrna_colors, set_alpha=False)

analysis.filter_celltypemaps(min_norm=filter_method, filter_params=filter_params, min_r=0.3, output_mask=output_mask) # post-filter cell-
plt.figure(figsize=[5, 5])
ds.plot_celltypes_map(rotate=1, colors=scrna_colors, set_alpha=False)

analysis.filter_celltypemaps(min_norm=filter_method, filter_params=filter_params, min_r=0.45, output_mask=output_mask) # post-filter cell-
plt.figure(figsize=[5, 5])
ds.plot_celltypes_map(rotate=1, colors=scrna_colors, set_alpha=False)

SSAM de novo analysis

While we believe that the guided mode of SSAM to be able to generate good cell-type maps rapidly, the de novo mode provide much more accurate results.

The steps of the de novo analysis are briefly discussed below, with links to more detailed discussion:

Clustering of expression vectors

Once the local maxima have been selected and filtered, we can perform clustering analysis. SSAM supports a number of clustering methods. Here we use the Louvain algorithm using 22 principle components, a resolution of 0.15.

analysis.cluster_vectors(
    min_cluster_size=0,
    pca_dims=22,
    resolution=0.15,
    metric='correlation')

Cluster annotation and diagnostics

SSAM provides diagnostic plots which can be used to evaluate the quality of clusters, and facilitates the annotation of clusters.

Visualisng the clusters

SSAM supports cluster visualisation via heatmaps, and 2D embedding (t-SNE and UMAP). Here we give an example of the t-SNE plot:

plt.figure(figsize=[5, 5])
ds.plot_tsne(pca_dims=22, metric="correlation", s=5, run_tsne=True)
plt.savefig('images/tsne.png')
plot of t-SNE embedding of cell types

plot of t-SNE embedding of cell types

Cell type map

Once the clusters have been evaluated for quality, we can generate the de novo cell-type map. This involves classifying all the pixels in the tissue image based on a correlation threshold. For the de novo application 0.6 was found to perform well:

analysis.map_celltypes()

filter_params = {
    "block_size": 151,
    "method": "mean",
    "mode": "constant",
    "offset": 0.2
    }

analysis.filter_celltypemaps(min_norm="local", filter_params=filter_params, min_r=0.6, fill_blobs=True, min_blob_area=50, output_mask=output_mask)
plt.figure(figsize=[5, 5])
ds.plot_celltypes_map(rotate=1, set_alpha=False)
plt.axis('off')
plt.savefig('images/de_novo.png')
plot of the de novo generated celltype map

plot of the de novo generated celltype map

We can now use our celltype map to infer a map of tissue domains.

Filtering local maxima

As demonstrated in the SSAM paper, local L1 maxima selection is an effective way of downsampling the entire vector field for faster computation, and they better represent known gene expression profiles compared to random downsampling.

However, local maxima in the vector field can arrise from undesirable locations, e.g. singleton mRNAs. In order to filter less informative local maxima.

We recommend applying threshold for individual genes, and for the total gene expression.

Per gene expression threshold

The per gene threshold should be at least the height of a single Gaussian curve over an mRNA. This can easily be empirically determined by visual analysis. In this multiplexed smFISH exmaple, the per gene expression threshold, exp_thres is set to 0.027

exp_thres = 0.027
viewport = 0.1
gindices = np.arange(len(ds.genes))
np.random.shuffle(gindices)
plt.figure(figsize=[5, 7])
for i, gidx in enumerate(gindices[:6], start=1):
    ax = plt.subplot(5, 2, i)
    n, bins, patches = ax.hist(ds.vf[..., gidx][np.logical_and(ds.vf[..., gidx] > 0, ds.vf[..., gidx] < viewport)], bins=100, log=True, histtype=u'step')
    ax.set_xlim([0, viewport])
    ax.set_ylim([n[0], n[-1]])
    ax.axvline(exp_thres, c='red', ls='--')
    ax.set_title(ds.genes[gidx])
    ax.set_xlabel("Expression")
    ax.set_ylabel("Count")
plt.tight_layout()
pass

image0

Total gene expression threshold

The total gene threshold should be empirically determined by examing the curve of total gene expression of local maxima. This isn’t always easy, and we highly encourage investigating this thoroughly.

norm_thres = 0.2
gidx = 0
plt.figure(figsize=[5, 2])
#plt.hist(ds.vf[..., gidx][ds.vf[..., gidx] > 0], bins=100, log=True)
n, _, _ = plt.hist(ds.vf_norm[np.logical_and(ds.vf_norm > 0, ds.vf_norm < 0.3)], bins=100, log=True, histtype='step')
ax = plt.gca()
ax.axvline(norm_thres, c='red', ls='--')
ax.set_xlabel("L1-norm")
ax.set_ylabel("Count")

plt.xlim([0, 0.3])
plt.ylim([np.min(n), np.max(n) + 100000])
pass

image1

Filtering “stray” local maxima using k-nearest neighbour density

If there is mRNA signal originating from outside the tissue area (due to background noise), it would improve downstream analysis to remove such vectors. We observed this in the osMFISH data. These “stray” local maxima tend to be less dense than local maxima from the tissue area:

image2

Because of this, they can be effectively filtered using their k-neearest neighbor density, in this example settting the threshold to 0.002.

from sklearn.neighbors import KDTree
X = np.array([ds.local_maxs[0], ds.local_maxs[1]]).T
kdt = KDTree(X, leaf_size=30, metric='euclidean')
rho = 100 / (np.pi * kdt.query(X, k=100)[0][:, 99] ** 2)

threshold = 0.002

plt.figure(figsize=[5, 2.5])
plt.hist(rho, bins=100, histtype='step')
plt.axvline(x=threshold, color='r', linestyle='--')

ax = plt.gca()
ax.set_xlabel("Local KNN density")
ax.set_ylabel("Count")
pass

image3

…. and a quick look at the before and after in the osmFISH dataset

image4

Clustering Local L-1 Maxima

In the de novo mode analysis, after the local maxima have been identified from the tissue image, they are clustered.

The default clustering algorithm is based on Louvain community detection. SSAM also supports clustering using hdbscan and optics.

It can be initiated by:

analysis.cluster_vectors(method="louvain",
                         pca_dims=-1,
                         min_cluster_size=2,
                         max_correlation=1.0,
                         metric="correlation",
                         outlier_detection_method='medoid-correlation',
                         outlier_detection_kwargs={},
                         random_state=0,
                         **kwargs)

… where - method can be louvain, hdbscan, optics. - pca_dims are the number of principal componants used for clustering. - min_cluster_size is the minimum cluster size. - resolution is the resolution for Louvain community detection. - prune is the threshold for Jaccard index (weight of SNN network). If it is smaller than prune, it is set to zero. - snn_neighbors is the number of neighbors for SNN network. - max_correlation is the threshold for which clusters with higher correlation to this value will be merged. - metric is the metric for calculation of distance between vectors in gene expression space. - subclustering if set to True, each cluster will be clustered once again with DBSCAN algorithm to find more subclusters. - dbscan_eps is the eps value for DBSCAN subclustering. Not used when ‘subclustering’ is set False. - centroid_correction_threshold is the threshold for which centroid will be recalculated with the vectors which have the correlation to the cluster medoid equal or higher than this value. - random_state is the random seed or scikit-learn’s random state object to replicate the same result

Removing outliers

The cell type signature is determined as the centroid of the cluster. This can be affected by outliers, so SSAM supports a number of outlier removal methods:

analysis.remove_outliers(outlier_detection_method='medoid-correlation', outlier_detection_kwargs={}, normalize=True)

robust-covariance, one-class-svm, isolation-forest, local-outlier-factor - outlier_detection_kwargs are arguments passed to the outlier detection method

Diagnostic plots

After unsupervised clustering of gene expression vectors, some clusters may need to be manually merged or discarded. SSAM supports merging of clusters based on correlation of gene expression profile, however in many cases manual inspection is needed to rule out any non-trivial issues.

To guide this process, SSAM generates a cluster-wise ‘diagnostic plot’, which consists of four panels: 1) location of the clustered vectors on the tissue image, 2) the pixels classified to belong the cluster signature (the cluster centroid), 3) the mean expression profile of the clustered vectors, and 4) the t-SNE or UMAP embedding.

In the three datasets analyzed the clusters to be merged or removed often showed a discordance between the location of sampled vectors used to determine the cluster (panel 1) and the pixels classified to belong to that cluster (panel 2). In case of overclustering, i.e. when a cell-type signature is split over 2 clusters, the map typically does not classify the full shape of the cells but instead only fragments (panel 2), and having almost the same marker gene expression of another cluster (panel 3). Such clusters can be merged.

For dubious clusters that should be removed, we observed that vectors usually originate from outside the tissue region or from image artifacts (panel 1), or that the gene expression does not show any clear expression of marker genes or similarity to expected gene expression profiles (panel 3).

The remaining clusters are then annotated by comparing cluster marker genes to known cell-type markers. Note that in many cases, the identity of clusters can be easily assigned by comparing the centroids of the clusters to the known cell-type signatures, e.g., from single cell RNA sequencing.

To support rapid annotation of cell types to clusters, SSAM additionally shows the highest correlating known cell-type signature should this data be available in panel 3.

Example 1: a large cluster that can be easily annotated

Local maxima (panel 1), correspond to the same area (panel 2), and matches known gene expression patterns of Vip Arhgap36 Hmcn1 cell types from scRNAseq experiments with high correlation (panel 3)

image0

Example 2: a large cluster that cannot be easily annotated

Local maxima (panel 1), correspond to the same area (panel 2). The gene expression profile has a good correlation to L2/3 IT VISp Adamts2 cell types, but are lacking the very high expression of Pde1a. In this particular case, one would need to check other clusters matching this cell type and perhaps merge them, or perhaps this indicates low efficiency of the Pde1a probe in the experiment.

image1

Example 3: a small cluster that is good

Despite only 2 local maxima (panel 1), the classified pixels correspond to the same area (panel 2), and matches known gene expression patterns (panel 3). This presents a very rare, SSt Chodl cell type.

image2

Example 4: a small cluster that is questionable

Sampled local maxima (panel 1) to no correspond to the classified pixels (panel 2), and doesnt clearly match known gene expression patterns (panel 3)

image3

Cluster annotation

In a typical single cell RNAseq experiment, the process of annotating cell types manually can be laborious and as such, a number of automated methods have emerged.

In a typical in situ transcriptomics experiment, the annotation of cell types is usually much easier as these assays are usually profile established cell type markers. Cluster can be annotated easily based on marker gene expression.

The diagnostic plots can be used to compare existing signatures against those identified de novo

from scipy.stats import pearsonr, spearmanr

for idx in range(len(ds.centroids)):
    plt.figure(figsize=[50, 15])
    ds.plot_diagnostic_plot(idx, known_signatures=[
        ("scRNA-seq", scrna_uniq_labels, scrna_centroids, scrna_colors),
    ], correlation_methods=[
        ("r", pearsonr),
        ("rho", spearmanr)
    ])
    plt.tight_layout()
    plt.savefig('diagplots_multiplexed_smFISH/diagplot_centroid_%d.png'%idx)
    plt.close()

This will generate a diagnostic plot for each cluster, which can be used to assign cluster labels. E.g. the following cluster matches known gene expression patterns of Vip Arhgap36 Hmcn1 cell types from scRNAseq experiments with high correlation (panel 3):

image0

While this is a good example of cluster that can be easily annotated, some clusters may prepresent noise and would need to be removed, and when over clustering occurs then clusters may have to be merged. The diagnostic plots documentation assist the decision making process.

Once each cluster is reviewed, a cell-type be assigned, or removed, or merged. In the following code snippet, we show an elegent way to annotate, remove, and merge clusters.

  1. Determine that (i) clusters with a name will be annotated, (ii) clusters with a “N/A” will be removed, (iii) clusters with the same name will be merged
denovo_labels = [
    "N/A",
    "VLMC",
    "Vip Arhgap36 Hmcn1 / Vip Igfbp4 Map21l1",
    "L2/3 IT Rrad",
    "N/A",
    "L2/3 IT Adamts2",
    "Sst Nts / Sst Rxfp1 Eya1",
    "Lamp5 Lsp1",
    "N/A",
    "Sst Crhr2 Efemp1 / Sst Esm1",

    "Pvalb Calb1 Sst / Pvalb Reln Tac1",
    "Astro Aqp4",
    "L6 IT Penk Fst",
    "L4 IT Superficial",
    "L5 IT Col27a1",
    "L2/3 IT Adamts2",
    "OPC",
    "Oligo",
    "L4 IT Rspo1",
    "L5 NP Trhr Met",

    "L5 IT Hsd11b1 Endou",
    "Pvalb Th Sst / Pvalb Reln Tac1",
    "L6 CT Ctxn3 Brinp3 / L6 CT Gpr139",
    "L5 PT Chrna6",
    "L5 IT Batf3",
    "L5 PT C1ql2 Cdh13",
    "L5 PT Krt80",
    "L6 IT Penk Col27a1",
    "L6 IT Penk Col27a1",
    "L6b Crh",

    "Sst Chodl",
]
  1. make objects for storing the index of clusters to be annotated, removed and merged
denovo_labels_final = []
exclude_indices = []
merge_indices = []
  1. iterate over the denovo_labels object and populate the denovo_labels_final, exclude_indices, merge_indices objects
for idx, cl in enumerate(denovo_labels):
    if cl == 'N/A':
        exclude_indices.append(idx)
        continue
    if cl in denovo_labels_final:
        continue
    denovo_labels_final.append(cl)

for cl in np.unique(denovo_labels):
    if cl == 'N/A':
        continue
    mask = [cl == e for e in denovo_labels]
    if np.sum(mask) > 1:
        merge_indices.append(np.where(mask)[0])
  1. plot the removed clusters in t-SNE embedding
cmap = plt.get_cmap('jet')
jet_colors = cmap(np.array(list(range(len(ds.centroids)))) / (len(ds.centroids) - 1))
tsne_colors = np.zeros_like(jet_colors)
tsne_colors[..., :] = [0.8, 0.8, 0.8, 1]
tsne_colors[exclude_indices] = [0, 0, 0, 1] #jet_colors[exclude_indices]
import matplotlib.patheffects as PathEffects
plt.figure(figsize=[5, 5])
ds.plot_tsne(pca_dims=33, metric="correlation", s=5, run_tsne=False, colors=tsne_colors)
plt.axis('off')

image1

  1. plot the merged clusters in t-SNE embedding
cmap = plt.get_cmap('rainbow')
jet_colors = cmap(np.array(list(range(len(merge_indices)))) / (len(merge_indices) - 1))
plt.figure(figsize=[5, 5])
tsne_colors = np.zeros([len(ds.centroids), 4])
tsne_colors[..., :] = [0.8, 0.8, 0.8, 1]
for idx, mi in enumerate(merge_indices):
    tsne_colors[mi] = jet_colors[idx]
    ds.plot_tsne(pca_dims=33, metric="correlation", s=5, run_tsne=False, colors=tsne_colors)
plt.axis('off')

image2

  1. update the analysis object with the clusters to remove and merge
analysis.exclude_and_merge_clusters(exclude_indices, merge_indices, centroid_correction_threshold=0.6)

Thresholding the de-novo cell-type map

After cell-type signatures are calculated, the tissue image can be classified. The classification of each pixel is based on the Pearson correlation metric (although an experimental adversarial autoencoder based classification method can be applied).

We found that a minimum correlation threshold (min_r) of 0.3 worked well for guided mode based on single cell RNAseq cell-type signatures, and 0.6 worked well for de novo mode.

Below we show how the cell-type map changes using correlation thresholds of 0.4,0.6,0.8 for the guided cell-type map.

analysis.map_celltypes()

analysis.filter_celltypemaps(min_norm=filter_method, filter_params=filter_params, min_r=0.4, fill_blobs=True, min_blob_area=50, output_mask=output_mask)
plt.figure(figsize=[5, 5])
ds.plot_celltypes_map(colors=denovo_celltype_colors, rotate=1, set_alpha=False)

analysis.filter_celltypemaps(min_norm=filter_method, filter_params=filter_params, min_r=0.6, fill_blobs=True, min_blob_area=50, output_mask=output_mask)
plt.figure(figsize=[5, 5])
ds.plot_celltypes_map(colors=denovo_celltype_colors, rotate=1, set_alpha=False)

analysis.filter_celltypemaps(min_norm=filter_method, filter_params=filter_params, min_r=0.8, fill_blobs=True, min_blob_area=50, output_mask=output_mask)
plt.figure(figsize=[5, 5])
ds.plot_celltypes_map(colors=denovo_celltype_colors, rotate=1, set_alpha=False)

Visualisation of 2D gene expression embeddings (t-SNE and UMAP)

An important part of presenting the summary of the clustering analysis is 2D visualisation via embedding.

UMAP and t-SNE, are 2 common dimensionality reduction methods that can be useful for displaying clustering results.

Running t-SNE

To run the t-SNE on the ds object: ds.run_tsne(pca_dims=-1,n_iter=5000, perplexity=70, early_exaggeration=10, metric="correlation", exclude_bad_clusters=True, random_state=0, tsne_kwargs={})

  • pca_dims: Number of PCA dimensions used for the tSNE embedding.
  • n_iter: Maximum number of iterations for the tSNE.
  • perplexity: The perplexity value of the tSNE (please refer to the section How should I set the perplexity in t-SNE? ).
  • early_exaggeration: Early exaggeration parameter for tSNE. Controls the tightness of the resulting tSNE plot.
  • metric: Metric for calculation of distance between vectors in gene expression space.
  • exclude_bad_clusters: If true, the vectors that are excluded by the clustering algorithm will not be considered for tSNE computation.
  • random_state: Random seed or scikit-learn’s random state object to replicate the same result
  • tsne_kwargs: Other keyward parameters for tSNE.

Running UMAP

To run the t-SNE on the ds object: ds.run_umap(self, pca_dims=-1, metric="correlation", min_dist=0.8, exclude_bad_clusters=True, random_state=0, umap_kwargs={})

  • pca_dims: Number of PCA dimensions used for the UMAP embedding.
  • metric: Metric for calculation of distance between vectors in gene expression space.
  • min_dist: ‘min_dist’ parameter for UMAP.
  • exclude_bad_clusters: If true, the vectors that are excluded by the clustering algorithm will not be considered for UMAP computation.
  • random_state: Random seed or scikit-learn’s random state object to replicate the same result
  • umap_kwargs: Other keyward parameters for UMAP.

Plotting embeddings

Plotting of the t-SNE and UMAP beddings can be performed by:

ds.plot_embedding(method='umap')
ds.plot_embedding(method='tSNE')

image0

Identifying tissue domains

Cells are organised into tissues and organs. Spatial gene expression not only allows the identification of cell types in situ, but also allows investigation of how these cells are organised.

SSAM facilitates the identification of “tissue domains”, which are regions in the tissue exhibiting similar local cell type composition. This is based on circular window sampling with a defined radius and step, which is then followed by agglomerative clustering.

Perform circular window sampling

The first step is to sample cell-type composition in circular sweeping windows. For this, the size of circular window (radius) and the step between each sampling (step) has to be defined. The units here are in um, which is also equivalent to pixels in this example. The following performs this sampling using a circular window of 100um, with 10um steps:

analysis.bin_celltypemaps(step=10, radius=100)

Clustering domain signatures

After performing the sampling, we continue with identifying domain signatures through clustering. This is based on agglomerative clustering to identify the initial clusters (n_clusters) of windows which include a minimum number of classified pixels (norm_thres), followed cluster merging when the correlation between clusters exceeds a threshold (merge_thres). The merging of clusters can be restricted to adjacent clusters (merge_remote=FALSE), or not restricted to spatial proximity (merge_remote=True)

analysis.find_domains(n_clusters=20, merge_remote=True, merge_thres=0.7, norm_thres=1500)

Visualizing identified domains

Once the domains have been indentified, they have to be visualised for evaluation.

from matplotlib.colors import ListedColormap
cmap_jet = plt.get_cmap('jet')
num_domains = np.max(ds.inferred_domains_cells) + 1

fig, axs = plt.subplots(1, num_domains, figsize=(4*num_domains, 4))
for domain_idx in range(num_domains):
    ax = axs[domain_idx]
    plt.sca(ax)
    plt.axis('off')
    cmap = ListedColormap([cmap_jet(lbl_idx / num_domains) if domain_idx == lbl_idx else "#cccccc" for lbl_idx in range(num_domains)])
    ds.plot_domains(rotate=1, cmap=cmap)
plt.tight_layout()
plt.savefig(f'plots/domains_individual')
side by side plot of all tissue domains

side by side plot of all tissue domains

Post-processing the identified domains

In certain cases, one may wish to exclude certain domains (excluded_domain_indices) as they may originate from tissue artifacts or contain no information. In our case the third domain (0 based index 2) seems to be an artifact and the fourth one contains no useful information. The First two domains are obviously part of the same layer and can therefore be merged.

Due to possible imaging artifacts such as tiling, some domains might be split. While it is still possible to tune the merge_thres in the clustering step, one can simply perform this as manual post processing. In the case above, there do not appear to be any domains that require merging.

Once the domains to be excluded or merged have been determined, they can be excluded and removed(!):

excluded_domain_indices = [2,3,7,10]
merged_domain_indices = [[0,1],[9,11]]
analysis.exclude_and_merge_domains(excluded_domain_indices, merged_domain_indices)

The final plot

The individual domains represent the established neocortex layering patterns found in the mouse brain. We can continue with assigning domain colours, names, and plotting all of the domains together.

plt.figure(figsize=[5, 5])
ds.plot_domains(rotate=1)

image0

Cell-type composition analysis in tissue domains

After identifying tissue domains that exhibit specific cell-type composition properties, it may be desirable to report the cell-type composition properties of the identified domains.

In the SSAM manuscript we used this functionality to identify that astrocytes cell type representation of neocortex layer were previously under-reported, and identified the cell-type composition of novel layering patterns in the primary visual cortex (VISp).

Performing the cell-type composition analysis

The analysis is initiated on the analysis object:

analysis.calc_cell_type_compositions()

Plotting the composition of each domain

Once this has completed, you can plot the cell-type composition of the different layers using the plot function. In the following exmaple, we plot the 7 identified layers (domain_index = 0-6) in the order that they would appear in the neocortex:

# note - this could be wrapped up into a function
for domain_idx in [1, 0, 2, 3, 4, 5, 6]:
    plt.figure(figsize=[5, 5])
    ds.plot_celltype_composition(domain_idx,
                                 cell_type_colors=denovo_celltype_colors,
                                 cell_type_orders=heatmap_clusters_index[::-1],
                                 label_cutoff=0.03)
    plt.title(domain_labels[domain_idx])

image0

Plotting the composition of the entire tissue

It would be worthwhile to compare the cell-type composition within each domain, and compare this to what is observed over the entire tissue. The cell-type compostion over the entire tissue is stored as the last domain, in this case the 8th element (domain_index = 7):

# note - this can be wrapped up into a function
plt.figure(figsize=[5, 5])
ds.plot_celltype_composition(domain_index=7,
                             cell_type_colors=denovo_celltype_colors,
                             cell_type_orders=heatmap_clusters_index[::-1],
                             label_cutoff=0.03)
plt.title('All')

image1

Experimental features

We will endevour to improve the functionality of SSAM by implementing novel features. So far, these experimental features only works with the develop branch of SSAM.

The current novel features supported by SSAM include:

Cell-type classification using Adversarial Autoencoders

The default classification algorithm is based on Pearson correlation as this has been shown to be effective for automatic classification of cell types for single cell RNAseq experiments. This proved to be both highly performant and accurate also for spatial gene expression data. However, it may be desirable to explore other classification methods.

One recent and exciting Deep Learning framework that achieve competitive results in generative modeling and semi-supervised classification tasks are adversarial autoencoders.

SSAM implements a modified version of adversarial autoencoder classifier based on the original implementation by Shahar Azulay.

Mapping cell types using an adversarial autoencoder

In order to use the AAEC classification of pixels instead of the Pearson correlation based method, simply replace analysis.map_celltypes() with :

analysis.map_celltypes_aaec(epochs=1000, seed=0, batch_size=1000, chunk_size=100000, z_dim=10, noise=0)

Segmenting the SSAM cell type map

While we demonstrate the accuracy of SSAM in reconstructing celltype maps, we understand that many applications in biology require cell segmentation. As such, the development branch of SSAM supports segmentation of the celltype map using the watershed algorithm.

This is an experimental feature!

The segmentation of the cell type map can be performed by:

# Load DAPI image
with open('zenodo/osmFISH/raw_data/im_nuc_small.pickle', 'rb') as f:
    dapi = pickle.load(f)
dapi_small = np.hstack([dapi.T[:1640], np.zeros([1640, 12])]).reshape(ds.vf_norm.shape)

# Threshold DAPI image to create markers
dapi_threshold = filters.threshold_local(dapi_small[..., 0], 35, offset=-0.0002)
dapi_thresh_im = dapi_small[..., 0] > dapi_threshold
dapi_thresh_im = dapi_thresh_im.reshape(ds.vf_norm.shape).astype(np.uint8) * 255

# Run watershed segmentation of cell-type maps with DAPI as markers
# After running below, the segmentation data will be available as:
#  - Segmentations: ds.watershed_segmentations
#  - Cell-type map: ds.watershed_celltype_map
analysis.run_watershed(dapi_thresh_im)

Below we demonstrate the application of the segmentation on the de novo celltype map generated for the mouse SSp osmFISH data.

image0

Module contents

class ssam.SSAMAnalysis(dataset, ncores=-1, save_dir='', verbose=False)[source]

Bases: object

A class to run SSAM analysis.

Parameters:
  • dataset (SSAMDataset) – A SSAMDataset object.
  • ncores (int) – Number of cores for parallel computation. If a negative value is given, ((# of all available cores on system) - abs(ncores)) cores will be used.
  • save_dir (str) – Directory to store intermediate data (e.g. density / vector field). Any data which already exists will be loaded and reused.
  • verbose (bool) – If True, then it prints out messages during the analysis.
bin_celltypemaps(step=10, radius=100)[source]

Sweep a sphere window along a lattice on the image, and count the number of cell types in each window.

Parameters:
  • step (int) – The lattice spacing.
  • radius (int) – The radius of the sphere window.
calc_cell_type_compositions()[source]

Calculate cell type compositions in each domain.

calc_correlation_map(corr_size=3)[source]

Calculate local correlation map of the vector field.

Parameters:corr_size (int) – Size of square (or cube) that is used to compute the local correlation values. This value should be an odd number.
calc_spatial_relationship()[source]

Calculate spatial relationship between the domains using the result of bin_celltypemap.

cluster_vectors(pca_dims=10, min_cluster_size=0, resolution=0.6, prune=0.06666666666666667, snn_neighbors=30, max_correlation=1.0, metric='correlation', subclustering=False, dbscan_eps=0.4, centroid_correction_threshold=0.8, random_state=0)[source]

Cluster the given vectors using the specified clustering method.

Parameters:
  • pca_dims (int) – Number of principal componants used for clustering.
  • min_cluster_size (int) – Set minimum cluster size.
  • resolution (float) – Resolution for Louvain community detection.
  • prune (float) – Threshold for Jaccard index (weight of SNN network). If it is smaller than prune, it is set to zero.
  • snn_neighbors (int) – Number of neighbors for SNN network.
  • max_correlation (bool) – Clusters with higher correlation to this value will be merged.
  • metric (str) – Metric for calculation of distance between vectors in gene expression space.
  • subclustering (bool) – If True, each cluster will be clustered once again with DBSCAN algorithm to find more subclusters.
  • centroid_correction_threshold (float) – Centroid will be recalculated with the vectors which have the correlation to the cluster medoid equal or higher than this value.
  • random_state (int or random state object) – Random seed or scikit-learn’s random state object to replicate the same result
exclude_and_merge_clusters(exclude=[], merge=[], centroid_correction_threshold=0.8)[source]

Exclude bad clusters (including the vectors in the clusters), and merge similar clusters for the downstream analysis.

Parameters:
  • exclude (list(int)) – List of cluster indices to be excluded.
  • merge (list(list(int))) – List of list of cluster indices to be merged.
  • centroid_correction_threshold (float) – Centroid will be recalculated with the vectors which have the correlation to the cluster medoid equal or higher than this value.
exclude_and_merge_domains(exclude=[], merge=[])[source]

Manually exclude or merge domains.

Parameters:
  • exclude (list(int)) – Indices of the domains which will be excluded.
  • merge (list(list(int))) – List of indices of the domains which will be merged.
expand_localmax(r=0.99, min_pixels=7, max_pixels=1000)[source]

Merge the vectors nearby the local max vectors. Only the vectors with the large Pearson correlation values are merged.

Parameters:
  • r (float) – Minimum Pearson’s correlation coefficient to look for the nearby vectors.
  • min_pixels (float) – Minimum number of pixels to merge.
  • max_pixels (float) – Maximum number of pixels to merge.
filter_celltypemaps(min_r=0.6, min_norm=0.1, fill_blobs=True, min_blob_area=0, filter_params={}, output_mask=None)[source]

Post-filter cell type maps created by map_celltypes.

Parameters:
  • min_r (float) – minimum threshold of the correlation.
  • min_norm (str or float) – minimum threshold of the vector norm. If a string is given instead, then the threshold is automatically determined using sklearn’s threshold filter functions (The functions start with threshold_).
  • fill_blobs (bool) – If true, then the algorithm automatically fill holes in each blob.
  • min_blob_area (int) – The blobs with its area less than this value will be removed.
  • filter_params (dict) – Filter parameters used for the sklearn’s threshold filter functions. Not used when min_norm is float.
  • output_mask (np.ndarray(bool)) – If given, the cell type maps will be filtered using the output mask.
find_domains(centroid_indices=[], n_clusters=10, norm_thres=0, merge_thres=0.6, merge_remote=True)[source]

Find domains in the image, using the result of bin_celltypemaps.

Parameters:
  • centroid_indices (list(int)) – The indices of centroids which will be used for determine tissue domains.
  • n_clusters (int) – Initial number of clusters (domains) of agglomerative clustering.
  • norm_thres (int) – Threshold for the total number of cell types in each window. The window which contains the number of cell-type pixels less than this value will be ignored.
  • merge_thres (float) – Threshold for merging domains. The centroids of the domains which have higher correlation to this value will be merged.
  • merge_remote (bool) – If true, allow merging clusters that are not adjacent to each other.
find_localmax(search_size=3, min_norm=0, min_expression=0, mask=None)[source]

Find local maxima vectors in the norm of the vector field.

Parameters:
  • search_size (int) – Size of square (or cube in 3D) that is used to search for the local maxima. This value should be an odd number.
  • min_norm (float) – Minimum value of norm at the local maxima.
  • min_expression (float) – Minimum value of gene expression in a unit pixel at the local maxima. mask: numpy.ndarray, optional If given, find vectors in the masked region, instead of the whole image.
map_celltypes(centroids=None)[source]

Create correlation maps between the centroids and the vector field. Each correlation map corresponds each cell type map.

Parameters:centroids (list(np.array(int))) – If given, map celltypes with the given cluster centroids.
normalize_vectors(use_expanded_vectors=False, normalize_gene=False, normalize_vector=False, normalize_median=False, size_after_normalization=10000.0, log_transform=False, scale=False)[source]

Normalize and regularize vectors

Parameters:
  • use_expanded_vectors (bool) – If True, use averaged vectors nearby local maxima of the vector field.
  • normalize_gene (bool) – If True, normalize vectors by sum of each gene expression across all vectors.
  • normalize_vector (bool) – If True, normalize vectors by sum of all gene expression of each vector.
  • log_transform (bool) – If True, vectors are log transformed.
  • scale (bool) – If True, vectors are z-scaled (mean centered and scaled by stdev).
normalize_vectors_sctransform(use_expanded_vectors=False, normalize_vf=True, vst_kwargs={})[source]

Normalize and regularize vectors using SCtransform

Parameters:
  • use_expanded_vectors (bool) – If True, use averaged vectors nearby local maxima of the vector field.
  • normalize_vf (bool) – If True, the vector field is also normalized using the same parameters used to normalize the local maxima.
  • vst_kwargs (dict) – Optional keywords arguments for sctransform’s vst function.
rescue_cluster(gene_names, expression_thresholds=[])[source]
run_fast_kde(kernel='gaussian', bandwidth=2.5, sampling_distance=1.0, re_run=False, use_mmap=False)[source]

Run KDE faster than run_kde method. This method uses precomputed kernels to estimate density of mRNA.

Parameters:
  • kernel (str) – Kernel for density estimation. Currently only Gaussian kernel is supported.
  • bandwidth (float) – Parameter to adjust width of kernel. Set it 2.5 to make FWTM of Gaussian kernel to be ~10um (assume that avg. cell diameter is ~10um).
  • sampling_distance (float) – Grid spacing in um. Currently only 1 um is supported.
  • re_run (bool) – Recomputes KDE, ignoring all existing precomputed densities in the data directory.
  • use_mmap (bool) – Use MMAP to reduce memory usage during analysis. Currently not implemented, this option should be always disabled.
run_kde(kernel='gaussian', bandwidth=2.5, sampling_distance=1.0, use_mmap=False)[source]

Run KDE to estimate density of mRNA.

Parameters:
  • kernel (str) – Kernel for density estimation.
  • bandwidth (float) – Parameter to adjust width of kernel. Set it 2.5 to make FWTM of Gaussian kernel to be ~10um (assume that avg. cell diameter is ~10um).
  • sampling_distance (float) – Grid spacing in um.
  • use_mmap (bool) – Use MMAP to reduce memory usage during analysis. Turning on this option can reduce the amount of memory used by SSAM analysis, but also lower the analysis speed.
class ssam.SSAMDataset(genes, locations, width, height, depth=1)[source]

Bases: object

A class to store intial values and results of SSAM analysis.

Parameters:
  • genes (list(str)) – The genes that will be used for the analysis.
  • locations (list(numpy.ndarray)) – Location of the mRNAs in um, given as a list of N x D ndarrays (N is number of mRNAs, D is number of dimensions).
  • width (float) – Width of the image in um.
  • height (float) – Height of the image in um.
  • depth (float) – Depth of the image in um. Depth == 1 means 2D image.
get_celltype_correlation(idx)[source]

Get correlation values of a cell type map between the given cluster’s centroid to the vector field.

Parameters:idx (int) – Index of a cluster
Returns:Correlation values of a cell type map of the specified cluster’s centroid
Return type:numpy.ndarray
plot_celltype_composition(domain_index, cell_type_colors=None, cell_type_cmap='jet', cell_type_orders=None, label_cutoff=0.03, pctdistance=1.15, **kwargs)[source]

Plot composition of cell types in each domain.

Parameters:
  • domain_index (int) – Index of the domain.
  • cell_type_colors (str or list(float)) – The colors of the cell types. Overrides cell_type_cmap parameter.
  • cell_type_cmap (str or matplotlib.colors.Colormap) – The colormap for the cell types.
  • label_cutoff (float) – The minimum cutoff of the labeling of the percentage. From 0 to 1.
  • pctdistance (float) – The distance from center of the pie to the labels.
  • kwargs – More kewward arguments for the matplotlib.pyplot.pie.
plot_celltypes_map(background='black', centroid_indices=[], colors=None, cmap='jet', rotate=0, min_r=0.6, set_alpha=False, z=None)[source]

Plot the merged cell-type map.

Parameters:
  • background (str or list(float)) – Set background color of the cell-type map.
  • centroid_indices (list(int)) – The centroids which will be in the cell type map. If not given, the cell-type map is drawn with all centroids.
  • colors (list(str), list(list(float))) – Color of the clusters. Overrides cmap parameter.
  • cmap (str or matplotlib.colors.Colormap) – Colormap for the clusters.
  • rotate (int) – Rotate the plot. Possible values are 0, 1, 2, and 3.
  • min_r (float) – Minimum correlation threshold for the cell-type map. This value is only for the plotting, does not affect to the cell-type maps generated by filter_celltypemaps.
  • set_alpha (bool) – Set alpha of each pixel based on the correlation. Not properly implemented yet, doesn’t work properly with the background other than black.
  • z (int) – Z index to slice 3D cell-type map. If not given, the slice at the middle will be used.
plot_correlation_map(cmap='hot')[source]

Plot the correlations near the vectors in the vector field (Not fully implemented yet).

Parameters:cmap – Colormap for the image.
plot_diagnostic_plot(centroid_index, cluster_name=None, cluster_color=None, cmap=None, rotate=0, z=None, use_embedding='tsne', known_signatures=[], correlation_methods=[])[source]

Plot the diagnostic plot. This method requires plot_tsne or plot_umap was run at least once before.

Parameters:
  • centroid_index (int) – Index of the centroid for the diagnostic plot.
  • cluster_name (str) – The name of the cluster.
  • cluster_color (str or list(float)) – The color of the cluster. Overrides cmap parameter.
  • cmap (str or matplotlib.colors.Colormap) – The colormap for the clusters. The cluster color is determined using the centroid_index th color of the given colormap.
  • rotate (int) – Rotate the plot. Possible values are 0, 1, 2, and 3.
  • z (int) – Z index to slice 3D vector norm and cell-type map plots. If not given, the slice at the middle will be used.
  • use_embedding (str) – The type of the embedding for the last panel. Possible values are “tsne” or “umap”.
  • known_signatures (list(tuple)) – The list of known signatures, which will be displayed in the 3rd panel. Each signature can be 3-tuple or 4-tuple, containing 1) the name of signature, 2) gene labels of the signature, 3) gene expression values of the signature, 4) optionally the color of the signature.
  • correlation_methods (list(tuple)) – The correlation method used to determine max correlation of the centroid to the known_signatures. Each method should be 2-tuple, containing 1) the name of the correaltion, 2) the correaltion function (compatiable with the correlation methods available in scipy.stats)
plot_domains(background='white', colors=None, cmap='jet', rotate=0, domain_background=False, background_alpha=0.3, z=None)[source]

Plot tissue domains.

Parameters:
  • background (str or list(float)) – Background color of the plot.
  • colors (list(str), list(list(float))) – Color of the domains. Overrides cmap parameter.
  • cmap (str or matplotlib.colors.Colormap) – Colormap for the domains.
  • rotate (int) – Rotate the plot. Possible values are 0, 1, 2, and 3.
  • domain_background (bool) – Show the area of the inferred domains behind the domain map.
  • background_alpha (float) – The alpha value of the area of the inferred domains.
  • z (int) – Z index to slice 3D domain map. If not given, the slice at the middle will be used.
plot_expanded_mask(cmap='Greys')[source]

Plot the expanded area of the vectors (Not fully implemented yet).

Parameters:cmap – Colormap for the mask.
plot_l1norm(cmap='viridis', rotate=0, z=None)[source]

Plot the L1-norm of the vector field.

Parameters:
  • cmap (str or matplotlib.colors.Colormap) – Colormap used for the plot.
  • rotate (int) – Rotate the plot. Possible values are 0, 1, 2, and 3.
  • z (int) – Z index to slice 3D vector field. If not given, the slice at the middle will be plotted.
plot_localmax(c=None, cmap=None, s=1, rotate=0)[source]

Scatter plot the local maxima.

Parameters:
  • c (str or list(str), or list(float) or list(list(float))) – Color of the scatter dots. Overrides cmap parameter.
  • cmap (str or matplotlib.colors.Colormap) – Colormap of the scatter dots.
  • s – Size of the scatter dots.
  • rotate (int) – Rotate the plot. Possible values are 0, 1, 2, and 3.
plot_spatial_relationships(cluster_labels, *args, **kwargs)[source]

Plot spatial relationship between cell types, presented as a heatmap.

Parameters:
  • cluster_labels (list(str)) – x- and y-axis label of the heatmap.
  • args – More arguments for the seaborn.heatmap.
  • kwargs – More keyword arguments for the seaborn.heatmap.
plot_tsne(run_tsne=False, pca_dims=10, n_iter=5000, perplexity=70, early_exaggeration=10, metric='correlation', exclude_bad_clusters=True, s=None, random_state=0, colors=[], excluded_color='#00000033', cmap='jet', tsne_kwargs={})[source]

Scatter plot the tSNE embedding.

Parameters:
  • run_tsne (bool) – If false, this method tries to load precomputed tSNE result before running tSNE.
  • pca_dims (int) – Number of PCA dimensions used for the tSNE embedding.
  • n_iter (int) – Maximum number of iterations for the tSNE.
  • perplexity (float) – The perplexity value of the tSNE (please refer to the section How should I set the perplexity in t-SNE? in this link).
  • early_exaggeration (float) – Early exaggeration parameter for tSNE. Controls the tightness of the resulting tSNE plot.
  • metric (str) – Metric for calculation of distance between vectors in gene expression space.
  • exclude_bad_clusters (bool) – If true, the vectors that are excluded by the clustering algorithm will not be considered for tSNE computation.
  • s (float) – Size of the scatter dots.
  • random_state (int or random state object) – Random seed or scikit-learn’s random state object to replicate the same result
  • colors (list(str), list(list(float))) – Color of each clusters.
  • excluded_color (str of list(float)) – Color of the vectors excluded by the clustering algorithm.
  • cmap (str or matplotlib.colors.Colormap) – Colormap for the clusters.
  • tsne_kwargs (dict) – Other keyward parameters for tSNE.
plot_umap(run_umap=False, pca_dims=10, metric='correlation', exclude_bad_clusters=True, s=None, random_state=0, colors=[], excluded_color='#00000033', cmap='jet', umap_kwargs={})[source]

Scatter plot the UMAP embedding.

Parameters:
  • run_umap – If false, this method tries to load precomputed UMAP result before running UMAP.
  • pca_dims (int) – Number of PCA dimensions used for the UMAP embedding.
  • metric (str) – Metric for calculation of distance between vectors in gene expression space.
  • exclude_bad_clusters (bool) – If true, the vectors that are excluded by the clustering algorithm will not be considered for tSNE computation.
  • s (float) – Size of the scatter dots.
  • random_state (int or random state object) – Random seed or scikit-learn’s random state object to replicate the same result
  • colors (list(str), list(list(float))) – Color of each clusters.
  • excluded_color (str of list(float)) – Color of the vectors excluded by the clustering algorithm.
  • cmap (str or matplotlib.colors.Colormap) – Colormap for the clusters.
  • umap_kwargs (dict) – Other keyward parameters for UMAP.
vf

Vector field as a numpy.ndarray.

vf_norm

L1-norm of the vector field as a numpy.ndarray.

ssam.run_sctransform(data, clip_range=None, verbose=True, debug_path=None, plot_model_pars=False, **kwargs)[source]

Run ‘sctransform’ R package and returns the normalized matrix and the model parameters. Package ‘feather’ is used for the data exchange between R and Python. :param data: N x D ndarray to normlize (N is number of samples, D is number of dimensions). :type data: numpy.ndarray :param kwargs: Any keyword arguments passed to R function vst. :returns: A 2-tuple, which contains two pandas.dataframe:

  1. normalized N x D matrix.
  2. determined model parameters.

Indices and tables