Intro to Machine Learning 6 | Unsupervised Learning and Clustering

Series: Intro to Machine Learning

Intro to Machine Learning 6 | Unsupervised Learning and Clustering

  1. Unsupervised Learning Techniques
  • Principle Components Analysis
  • Page Rank
  • Word Embedding like glove
  • Anomaly detection

2. Common Unsupervised Learning Models

  • Clustering: ๐‘˜-means / ๐‘˜-medians / ๐‘˜-medoid / mean-shift
  • Hierarchical Clustering
  • Spectral Clustering
  • Collaborative Filtering

3. Clustering Problems

  • Useful in specific circumstances like compression
  • Generally doesnโ€™t work well with imbalanced data
  • No clear measure of success

4. Common Distances for Clustering

  • We need to normalize x values so distance means the same thing in all directions
  • Taxicab Distance: L1 distance for Euclidean space
  • Euclidean Distance: L2 distance for Euclidean space
  • Chebyshev Distance: Lโˆž distance can also be useful
  • Cosine Similarity: angle ฮธ between 2 vectors by cosine theorem, and the similarity should be 1-cos(ฮธ)

5. k-Means, k-Medians, and k-Medoids

  • ๐‘˜-means uses mean for centroids and centroids donโ€™t have to be points in X
  • k-medians uses median not mean for centroids and it minimizes with respect to L1, not L2 distance
  • k-medoids requires medoids to be points in X and it will pick the centrally located point for each cluster

6. Troubles with ๐‘˜-means

  • Requires to specify k, which is usually difficult
  • Not stable: starting centroids can lead to different results
  • Hard predictions: would prefer to have soft predictions like probabilities