51.20 Word Embedding Tutorial

Points to cover

Why words are need to be represented as vectors or as intergers?
- Because machine learning algorithms can't work with strings or text directly.
- Because Computers are designed to compute numbers faster than strings.

What was the state of the art before word embeddings?

To easily search for a sentence in a document or corpus.

Count Vectorizer

Bag of words -> Just count the words
N-grams -> Same as bag of words but with n words
TF-IDF Vectorizer -> Term Frequency - Inverse Document Frequency to find the importance of a word in a document.
Pros
- Easy to implement
- Easy to understand
Cons
- No relation between words
- Unintentional similarity between words
- Sparse matrix
- Unable to handle Unknown words
  CountVectorizer Example:
  source: https://learning.oreilly.com/scenarios/understanding-word-embeddings/9781492095309/

    # Importing CountVectorizer
    from sklearn.feature_extraction.text import CountVectorizer

    # Using two documents
    text = [
        "Two roads diverged in a wood and I took the one less traveled by, and that has made all the difference.",
        "The best way out is always through."
    ]

    # A CountVectorizer object
    vectorizer = CountVectorizer()

    # This method tokenizes text and generates vocabulary
    vectorizer.fit(text)

    print("Generated Vocabulary:")
    print(vectorizer.vocabulary_)

    print("\nNumber of words in the document:")
    print(len(text[0].split()) + len(text[1].split()))

    print("\nNumber of words in the vocabulary:")
    print(len(vectorizer.vocabulary_))

    # Transforming document into a vector based on vocabulary
    vector = vectorizer.transform(text)

    print("\nShape of the transformed vector:")
    print(vector.shape)

    print("\nVector representation of the document:")
    print(vector.toarray())

TfIdf Vectorizer Example:
source: https://learning.oreilly.com/scenarios/understanding-word-embeddings/9781492095309/

    # Importing TfidfVectorizer
    from sklearn.feature_extraction.text import TfidfVectorizer

    # Using two documents
    text = [
        "Two roads diverged in a wood and I took the one less traveled by, and that has made all the difference.",
        "The best way out is always through."
    ]

    # A TfidfVectorizer object
    vectorizer = TfidfVectorizer()

    # This method tokenizes text and generates vocabulary
    vectorizer.fit(text)

    print("Generated Vocabulary:")
    print(vectorizer.vocabulary_)

    print("\nNumber of words in the document:")
    print(len(text[0].split()) + len(text[1].split()))

    print("\nNumber of words in the vocabulary:")
    print(len(vectorizer.vocabulary_))

    print("\nInverse document frequency:")
    print(vectorizer.idf_)

    # Transforming document into a vector based on vocabulary
    vector = vectorizer.transform(text)

    print("\nShape of the transformed vectors:")
    print(vector.shape)

    print("\nVector representation of the documents:")
    print(vector.toarray())

One hot encoding
- Pros
  - No unintentional similarity
  - Easy to understand
  - Easy to implement
- Cons
  - More disk space
  - More zero values

Why word embeddings?
- It tries to solve the problem that how do you represent two similar vectors which are two similar words.
What is similarity between two words?
- How often they appear together.
- How often they appear near each other.
- How often they appear in the same context.
- Unsupervised learning
Embedding space
- It is a vector space where each word is represented as a vector.
- The distance between two words in the embedding space is the similarity between them.
- Dimensions of the embedding space is the number of features.
- Representation in physical space.
Relationship between words in 2D
- Show in Odia demo
- Describe addition and subtraction of vectors with King - Man + Woman = Queen example.
- Show https://projector.tensorflow.org/ for 3D representation.
How do we make word embeddings?
- Embedding layer in neural networks
- Word2Vec
  - CBOW (Continuous Bag of Words)
    - Predict the word given the neighbouring words.
  - Skip Gram
    - Predict the neighbouring words given the word.
  - Cons:
    - It represents the same (gender) bias as training data.
    - We can not rely on it 100%.
    - We can not debug through the dimensions on the output. -> No interpretability.
- GloVe (Global Vectors)
  - Uses co-occurence matrix of overall corpus.
  - Co-occurence matrix is a matrix where each row and column represents a word and the value at the intersection of the row and column is the number of times the words appear together. It is a probabilistic method.
- FastText
  - It is an extension of Word2Vec.
  - It uses subword information.
  - It uses n-grams.
  - Word2Vec and GloVe are not able to handle unknown words but FastText can handle unknown words.
  - Good for morphologically rich languages like German.
- ELMo (Embeddings from Language Models)
  - Bidirectional LSTM
  - It is a deep contextualized word representation.
References:
- Word embedding by Assembly AI
- Word Embedding Workshop by Rachel Thomas