t-SNE is a non-linear method for visualizing structure in high-dimensional datasets that has been popularized by single-cell sequencing studies. In contrast to PCA, which is a linear method that seeks to preserve global structure, t-SNE aims to preserve local structure. t-SNE is often used in combination with PCA, and can be used to enhance the visualization of structure present in PC-PC plots. Here is an example from one of the papers we will study this semester (Figure 1).
The mathematics behind t-SNE are a bit complicated and are beyond the scope of this review. Here we seek to get a general feel for this method. Its primary goal is to map nearby points in the original high-dimensional data to nearby points in a two- or three-dimensional representation.
The structure of many complex topological spaces, such as the famous “Swiss roll”, can be reasonably represented using non-linear methods but not linear methods such as PCA. If the original features are related by complex polynomials, PCA will not work well.
The O’Reilly tutorial listed below encourages the reader to think about the motivation for the t-distribution by visualizing a sphere in multiple dimensions. If random points are picked within the sphere, then most of them will be near the surface, as illustrated in Figure 2. The proportion of points near the surface increases with dimensionality, compressing most of the data toward it. The t-distribution compensates for this by stretching the distribution out more, which alleviates crowding.
The mapped space goes through several iterations as the optimization proceeds. Mapped points start out really close together, and then the graph relaxes as the distributions are recalculated. Figure 3 shows the evolution of a set of data containing 10 natural clusters.
The original algorithm is very costly in computational time that is proportional to the square of the number of data points. We say that it runs on an “order” of \(n\) squared, which we write using “Big O” notation as: \(\mathcal{O}(n^2)\). An implementation using the Barnes-Hut approximation greatly speeds up the processing time, allowing it to be run on very large real-world datasets.
Barnes-Hut speeds up the calculations by dividing the space into small cells for nearby points (giving high resolution for short distances) and placing far away points by very large cells, allowing them to be treated as a single point centered on the cell’s center of mass. Barnes-Hut runs in \(\mathcal{O}(nlog(n))\).
You can think of the map as a dynamic graph composed of nodes connected by springs. If the mapped points get too close together, the springs push them apart. Similarly, points can’t get too far from each other either, since the springs will pull them back. The length of the spring is propotional to the distance between points in the low-dimensional space. The stiffness of the spring is proportional to the mismatch in the conditional probabilities between the points. This image is from an animation you can find in the O’Reilly tutorial listed at the end of these notes.
The perplexity should always be smaller than the total number of points.
Running enough iterations to reach a stable configuration is important.
Figures 5-9 show theoretical examples from one of the tutorials referenced below (see “A cautionary tale”), chosen to illustrate how results may vary depending on different properties of the original dataset and different settings used when running the algorithm.
Dimensional reduction methods are really useful for visualizing complex data in just two or three dimensions. Different methods have different strengths and drawbacks, and it will take you some time to get an intuitive understanding of their utility and their behavior.
PCA is a well-established linear method that is great for finding significant variation within your data.
tSNE is a relatively new non-linear method that has become very popular because it is very flexible and can often find structure in data when other methods cannot.
It’s always a good idea to play around with different methods and parameters before settling on a final visualization. You will get a better feel for them the more you experiment. You should also get in the habit of cross-checking your results with other techniques as a “sanity check”.
A pretty good overview: https://www.oreilly.com/learning/an-illustrated-introduction-to-the-t-sne-algorithm
A cautionary tale: https://distill.pub/2016/misread-tsne/
The conditional probability \(p_{j|i}\) is given by a Gaussian, where the variance for each point, \(\sigma_i^2\), is different for each and depends on its local density (these variances are called Gaussian kernels). The conditional probability is defined as:
\[ p_{j|i} = \frac{\exp(-|\mathbf{x_i} - \mathbf{x_j}|^2/2\sigma_i^2)}{\sum_{k\ne{i}}\exp(-\mathbf{|x_i} - \mathbf{x_k}|^2/2\sigma_i^2)} \]
The joint probability \(p_{ij}\) is the average of \(p_{i|j}\) and \(p_{j|i}\) divided by the number of points, N:
\[ p_{ij} = \frac{p_{i|j} + p_{j|i}}{2N} \]
The distance \(q_{ij}\) between two points in the mapped space is given by a Student’s t-distribution and is defined as:
\[ q_{ij} = \frac{(1 + |\mathbf{x_i} - \mathbf{x_j}|^2)^{-1}}{\sum_{k\ne{i}}(1 + |\mathbf{x_i} - \mathbf{x_k}|^2)^{-1}} \]
The cost function \(C\) is a function of the Kullback-Leibler divergence, or relative entropy between all data points in \(P\) and \(Q\). For symmetric SNE, it is defined on the joint probabilities \(p_{ij}\) and \(q_{ij}\) as:
\[ C = \sum_i{KL(P_i||Q_i)} = \sum_i{\sum_j{p_{ij}log{\frac{p_{ij}}{q_{ij}}}}} \]
The perplexity is defined as \(Perp(P_i) = 2^{H(P_i)}\), where \(H(P_i)\) is the Shannon entropy measured in bits:
\[ H(P_i) = -\sum_j{p_{j|i} log_2({p_{j|i})}} \]