To perform our analysis this time, we will use the MNIST handwritten digit dataset.
% matplotlib notebook
from sklearn.datasets import fetch_openml
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import rcParams
#rcParams['font.size'] = 30
# Load MNIST data from https://www.openml.org/d/554
X, y = fetch_openml('mnist_784', version=1, return_X_y=True)
% matplotlib notebook
np.random.seed(1)
num_samples = 2000
# Select num_samples random MNIST digits to analyze.
data = X[np.random.choice(int(X.shape[0]), num_samples), :]/255.
num_clusters = 10
# Inspect a single digit from the dataset.
digit = data[0,:]
digit = np.reshape(digit, (28,28))
plt.imshow(digit, cmap='Greys')
plt.show()
Spectral embedding and other "kernel trick" methods use a kernel $K(s_i, s_j)$ that measures similarity between data points. The choice of kernel is part of the procedure and requires some intuition/trial-and-error.
A popular choice of kernel is the Gaussian kernel $$ K(s_i, s_j) = \exp\left(- \frac{||s_i - s_j||^2}{2\sigma^2}\right) $$
% matplotlib notebook
import numpy.linalg as nla
# Gaussian kernel with variance sigma^2.
sigma = 100.0
def kernel(s_i, s_j):
return np.exp(-nla.norm(s_i - s_j)**2.0 / (2.0*sigma**2.0))
# The kernel matrix of the data.
kernel_matrix = np.zeros((num_samples, num_samples))
for i in range(num_samples):
for j in range(i, num_samples):
kernel_matrix[i,j] = kernel(data[i,:], data[j,:])
kernel_matrix[j,i] = kernel_matrix[i,j]
plt.matshow(kernel_matrix)
A key step of spectral embedding is the construction of the graph Laplacian, $L = D - K$, where $D_{ij} = \sum_{l=1}^n K(s_i, s_l) \delta_{ij}$ is the degree matrix and $K_{ij} = K(s_i, s_j)$ is the edge weight matrix (the kernel matrix). The graph Laplacian can be interpreted as an effective quadratic Hamiltonian for springs coupled Harmonically.
% matplotlib notebook
degrees = np.sum(kernel_matrix, axis=0)
D = np.diag(degrees)
K = kernel_matrix
L = D - K
plt.matshow(L)
Compute the spectrum of $L$. Ignore the exactly zero eigenvector as it has to always be $(1,\ldots,1)$, which does not give us useful information.
% matplotlib notebook
import numpy.linalg as nla
(eigvals, eigvecs) = nla.eigh(L)
print('Smallest eigenvalues = {}'.format(eigvals[0:2*num_clusters]))
plt.plot(eigvals[1:], 'ro')
plt.xlabel('Eigenvalue index')
plt.ylabel('Eigenvalue')
plt.show()
Here we use spectral clustering to find which cluster the digits fit into:
from sklearn.cluster import KMeans, SpectralClustering
sigma = 500.0
# spectral clustering
sc = SpectralClustering(n_clusters=num_clusters, gamma=1.0/sigma**2.0, affinity='rbf', n_init=100, random_state=0, assign_labels='kmeans').fit(data)
skl_sc_clusters_info = []
for ind_cluster in range(num_clusters):
skl_sc_clusters_info.append([])
for ind_point in range(num_samples):
ind_cluster = sc.labels_[ind_point]
skl_sc_clusters_info[ind_cluster].append(ind_point)
Here we show which digits it put into which cluster.
% matplotlib notebook
num_cols = 10
plt.figure()
ind_subplot = 1
for ind_cluster in range(num_clusters):
for ind_col in range(num_cols):
plt.subplot(num_clusters, num_cols, ind_subplot)
ind_point = skl_sc_clusters_info[ind_cluster][ind_col]
digit = data[ind_point, :]
digit = np.reshape(digit, (28,28))
plt.imshow(digit, cmap='Greys')
frame1 = plt.gca()
frame1.axes.xaxis.set_ticklabels([])
frame1.axes.yaxis.set_ticklabels([])
ind_subplot += 1
plt.show()