Intro to Machine Learning 6 | Unsupervised Learning and Clustering

Series: Intro to Machine Learning

Intro to Machine Learning 6 | Unsupervised Learning and Clustering

Unsupervised Learning Techniques

Principle Components Analysis
Page Rank
Word Embedding like glove
Anomaly detection

2. Common Unsupervised Learning Models

Clustering: 𝑘-means / 𝑘-medians / 𝑘-medoid / mean-shift
Hierarchical Clustering
Spectral Clustering
Collaborative Filtering

3. Clustering Problems

Useful in specific circumstances like compression
Generally doesn’t work well with imbalanced data
No clear measure of success

4. Common Distances for Clustering

We need to normalize x values so distance means the same thing in all directions
Taxicab Distance: L1 distance for Euclidean space
Euclidean Distance: L2 distance for Euclidean space
Chebyshev Distance: L∞ distance can also be useful
Cosine Similarity: angle θ between 2 vectors by cosine theorem, and the similarity should be 1-cos(θ)

5. k-Means, k-Medians, and k-Medoids

𝑘-means uses mean for centroids and centroids don’t have to be points in X
k-medians uses median not mean for centroids and it minimizes with respect to L1, not L2 distance
k-medoids requires medoids to be points in X and it will pick the centrally located point for each cluster

6. Troubles with 𝑘-means

Requires to specify k, which is usually difficult
Not stable: starting centroids can lead to different results
Hard predictions: would prefer to have soft predictions like probabilities