Filtering local maxima¶
As demonstrated in the SSAM paper, local L1 maxima selection is an effective way of downsampling the entire vector field for faster computation, and they better represent known gene expression profiles compared to random downsampling.
However, local maxima in the vector field can arrise from undesirable locations, e.g. singleton mRNAs. In order to filter less informative local maxima.
We recommend applying threshold for individual genes, and for the total gene expression.
Per gene expression threshold¶
The per gene threshold should be at least the height of a single
Gaussian curve over an mRNA. This can easily be empirically determined
by visual analysis. In this multiplexed smFISH exmaple, the per gene
expression threshold, exp_thres
is set to 0.027
exp_thres = 0.027
viewport = 0.1
gindices = np.arange(len(ds.genes))
np.random.shuffle(gindices)
plt.figure(figsize=[5, 7])
for i, gidx in enumerate(gindices[:6], start=1):
ax = plt.subplot(5, 2, i)
n, bins, patches = ax.hist(ds.vf[..., gidx][np.logical_and(ds.vf[..., gidx] > 0, ds.vf[..., gidx] < viewport)], bins=100, log=True, histtype=u'step')
ax.set_xlim([0, viewport])
ax.set_ylim([n[0], n[-1]])
ax.axvline(exp_thres, c='red', ls='--')
ax.set_title(ds.genes[gidx])
ax.set_xlabel("Expression")
ax.set_ylabel("Count")
plt.tight_layout()
pass
Total gene expression threshold¶
The total gene threshold should be empirically determined by examing the curve of total gene expression of local maxima. This isn’t always easy, and we highly encourage investigating this thoroughly.
norm_thres = 0.2
gidx = 0
plt.figure(figsize=[5, 2])
#plt.hist(ds.vf[..., gidx][ds.vf[..., gidx] > 0], bins=100, log=True)
n, _, _ = plt.hist(ds.vf_norm[np.logical_and(ds.vf_norm > 0, ds.vf_norm < 0.3)], bins=100, log=True, histtype='step')
ax = plt.gca()
ax.axvline(norm_thres, c='red', ls='--')
ax.set_xlabel("L1-norm")
ax.set_ylabel("Count")
plt.xlim([0, 0.3])
plt.ylim([np.min(n), np.max(n) + 100000])
pass
Filtering “stray” local maxima using k-nearest neighbour density¶
If there is mRNA signal originating from outside the tissue area (due to background noise), it would improve downstream analysis to remove such vectors. We observed this in the osMFISH data. These “stray” local maxima tend to be less dense than local maxima from the tissue area:
Because of this, they can be effectively filtered using their k-neearest
neighbor density, in this example settting the threshold
to 0.002.
from sklearn.neighbors import KDTree
X = np.array([ds.local_maxs[0], ds.local_maxs[1]]).T
kdt = KDTree(X, leaf_size=30, metric='euclidean')
rho = 100 / (np.pi * kdt.query(X, k=100)[0][:, 99] ** 2)
threshold = 0.002
plt.figure(figsize=[5, 2.5])
plt.hist(rho, bins=100, histtype='step')
plt.axvline(x=threshold, color='r', linestyle='--')
ax = plt.gca()
ax.set_xlabel("Local KNN density")
ax.set_ylabel("Count")
pass
…. and a quick look at the before and after in the osmFISH dataset