What is: collaborative filtering, n-grams, cosine distance?
What is: collaborative filtering, n-grams, cosine distance?
Collaborative filtering:
- Technique used by some recommender systems
- Filtering for information or patterns using techniques involving collaboration of multiple agents: viewpoints, data sources.
1. A user expresses his/her preferences by rating items (movies, CDs.)
2. The system matches this user's ratings against other users' and finds people with most similar tastes
3. With similar users, the system recommends items that the similar users have rated highly but not yet being rated by this user
n-grams:
- Contiguous sequence of n items from a given sequence of text or speech
- "Andrew is a talented data scientist"
- Bi-gram: "Andrew is", "is a", "a talented".
- Tri-grams: "Andrew is a", "is a talented", "a talented data".
- An n-gram model models sequences using statistical properties of n-grams; see: Shannon Game
- More concisely, n-gram model: P(Xi|Xi−(n−1)...Xi−1)P(Xi|Xi−(n−1)...Xi−1): Markov model
- N-gram model: each word depends only on the n−1n−1 last words
Issues:
- when facing infrequent n-grams
- solution: smooth the probability distributions by assigning non-zero probabilities to unseen words or n-grams
- Methods: Good-Turing, Backoff, Kneser-Kney smoothing
Cosine distance:
- How similar are two documents?
- Perfect similarity/agreement: 1
- No agreement : 0 (orthogonality)
- Measures the orientation, not magnitude
Given two vectors A and B representing word frequencies:
cosine-similarity(A,B)=⟨A,B⟩||A||⋅||B||