To perform our analysis this time, we will use the MNIST handwritten digit dataset.
% matplotlib notebook
from sklearn.datasets import fetch_openml
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import rcParams
#rcParams['font.size'] = 30
# Load MNIST data from https://www.openml.org/d/554
X, y = fetch_openml('mnist_784', version=1, return_X_y=True)
% matplotlib notebook
np.random.seed(1)
num_samples = 2000
# Select num_samples random MNIST digits to analyze.
data = X[np.random.choice(int(X.shape[0]), num_samples), :]/255.
num_clusters = 10
# Inspect a single digit from the dataset.
digit = data[0,:]
digit = np.reshape(digit, (28,28))
plt.imshow(digit, cmap='Greys')
plt.show()
Spectral embedding and other "kernel trick" methods use a kernel $K(s_i, s_j)$ that measures similarity between data points. The choice of kernel is part of the procedure and requires some intuition/trial-and-error.
A popular choice of kernel is the Gaussian kernel $$ K(s_i, s_j) = \exp\left(- \frac{||s_i - s_j||^2}{2\sigma^2}\right) $$
% matplotlib notebook
import numpy.linalg as nla
# Gaussian kernel with variance sigma^2.
sigma = 100.0
def kernel(s_i, s_j):
return np.exp(-nla.norm(s_i - s_j)**2.0 / (2.0*sigma**2.0))
# The kernel matrix of the data.
kernel_matrix = np.zeros((num_samples, num_samples))
for i in range(num_samples):
for j in range(i, num_samples):
kernel_matrix[i,j] = kernel(data[i,:], data[j,:])
kernel_matrix[j,i] = kernel_matrix[i,j]
plt.matshow(kernel_matrix)