Points to cover
-
Why words are need to be represented as vectors or as intergers?
- Because machine learning algorithms can't work with strings or text directly.
- Because Computers are designed to compute numbers faster than strings.
-
What was the state of the art before word embeddings?
- To easily search for a sentence in a document or corpus.
- Count Vectorizer
- Bag of words -> Just count the words
- N-grams -> Same as bag of words but with n words
- TF-IDF Vectorizer -> Term Frequency - Inverse Document Frequency to find the importance of a word in a document.
- Pros
- Easy to implement
- Easy to understand
- Cons
- No relation between words
- Unintentional similarity between words
- Sparse matrix
- Unable to handle Unknown words
CountVectorizer Example:
source: https://learning.oreilly.com/scenarios/understanding-word-embeddings/9781492095309/
# Importing CountVectorizer from sklearn.feature_extraction.text import CountVectorizer # Using two documents text = [ "Two roads diverged in a wood and I took the one less traveled by, and that has made all the difference.", "The best way out is always through." ] # A CountVectorizer object vectorizer = CountVectorizer() # This method tokenizes text and generates vocabulary vectorizer.fit(text) print("Generated Vocabulary:") print(vectorizer.vocabulary_) print("\nNumber of words in the document:") print(len(text[0].split()) + len(text[1].split())) print("\nNumber of words in the vocabulary:") print(len(vectorizer.vocabulary_)) # Transforming document into a vector based on vocabulary vector = vectorizer.transform(text) print("\nShape of the transformed vector:") print(vector.shape) print("\nVector representation of the document:") print(vector.toarray())
- TfIdf Vectorizer Example:
source: https://learning.oreilly.com/scenarios/understanding-word-embeddings/9781492095309/# Importing TfidfVectorizer from sklearn.feature_extraction.text import TfidfVectorizer # Using two documents text = [ "Two roads diverged in a wood and I took the one less traveled by, and that has made all the difference.", "The best way out is always through." ] # A TfidfVectorizer object vectorizer = TfidfVectorizer() # This method tokenizes text and generates vocabulary vectorizer.fit(text) print("Generated Vocabulary:") print(vectorizer.vocabulary_) print("\nNumber of words in the document:") print(len(text[0].split()) + len(text[1].split())) print("\nNumber of words in the vocabulary:") print(len(vectorizer.vocabulary_)) print("\nInverse document frequency:") print(vectorizer.idf_) # Transforming document into a vector based on vocabulary vector = vectorizer.transform(text) print("\nShape of the transformed vectors:") print(vector.shape) print("\nVector representation of the documents:") print(vector.toarray())
- One hot encoding
- Pros
- No unintentional similarity
- Easy to understand
- Easy to implement
- Cons
- More disk space
- More zero values
- Pros
-
Why word embeddings?
- It tries to solve the problem that how do you represent two similar vectors which are two similar words.
-
What is similarity between two words?
- How often they appear together.
- How often they appear near each other.
- How often they appear in the same context.
- Unsupervised learning
-
Embedding space
- It is a vector space where each word is represented as a vector.
- The distance between two words in the embedding space is the similarity between them.
- Dimensions of the embedding space is the number of features.
- Representation in physical space.
-
Relationship between words in 2D
- Show in Odia demo
- Describe addition and subtraction of vectors with King - Man + Woman = Queen example.
- Show https://projector.tensorflow.org/ for 3D representation.
-
How do we make word embeddings?
- Embedding layer in neural networks
- Word2Vec
- CBOW (Continuous Bag of Words)
- Predict the word given the neighbouring words.
- Skip Gram
- Predict the neighbouring words given the word.
- Cons:
- It represents the same (gender) bias as training data.
- We can not rely on it 100%.
- We can not debug through the dimensions on the output. -> No interpretability.
- CBOW (Continuous Bag of Words)
- GloVe (Global Vectors)
- Uses co-occurence matrix of overall corpus.
- Co-occurence matrix is a matrix where each row and column represents a word and the value at the intersection of the row and column is the number of times the words appear together. It is a probabilistic method.
- FastText
- It is an extension of Word2Vec.
- It uses subword information.
- It uses n-grams.
- Word2Vec and GloVe are not able to handle unknown words but FastText can handle unknown words.
- Good for morphologically rich languages like German.
- ELMo (Embeddings from Language Models)
- Bidirectional LSTM
- It is a deep contextualized word representation.
-
References: