What is: collaborative filtering, n-grams, cosine distance?

What is: collaborative filtering, n-grams, cosine distance?



Collaborative filtering:
- Technique used by some recommender systems
- Filtering for information or patterns using techniques involving collaboration of multiple agents: viewpoints, data sources.
1. A user expresses his/her preferences by rating items (movies, CDs.)
2. The system matches this user's ratings against other users' and finds people with most similar tastes
3. With similar users, the system recommends items that the similar users have rated highly but not yet being rated by this user

n-grams:
- Contiguous sequence of n items from a given sequence of text or speech
- "Andrew is a talented data scientist"
- Bi-gram: "Andrew is", "is a", "a talented".
- Tri-grams: "Andrew is a", "is a talented", "a talented data".
- An n-gram model models sequences using statistical properties of n-grams; see: Shannon Game
- More concisely, n-gram model: P(Xi|Xi−(n−1)...Xi−1)P(Xi|Xi−(n−1)...Xi−1): Markov model
- N-gram model: each word depends only on the n−1n−1 last words

Issues:
- when facing infrequent n-grams
- solution: smooth the probability distributions by assigning non-zero probabilities to unseen words or n-grams
- Methods: Good-Turing, Backoff, Kneser-Kney smoothing

Cosine distance:
- How similar are two documents?
- Perfect similarity/agreement: 1
- No agreement : 0 (orthogonality)
- Measures the orientation, not magnitude

Given two vectors A and B representing word frequencies:
cosine-similarity(A,B)=⟨A,B⟩||A||⋅||B||

Popular posts from this blog

After analyzing the model, your manager has informed that your regression model is suffering from multicollinearity. How would you check if he's true? Without losing any information, can you still build a better model?

Is rotation necessary in PCA? If yes, Why? What will happen if you don't rotate the components?

What does Latency mean?