Requirements (python3.x
):
torch
;numpy
;
Word2Vec
is a popular word embedding model proposed by Google researchers in 2013 (Tomas Mikolov). It transforms words from a text corpus into vectors of numbers in such a way that words with similar semantic meanings have close vector representations in multidimensional space. This makes Word2Vec
a powerful tool for natural language processing (NLP
) tasks such as sentiment analysis, machine translation, automatic summarization, and many others.
- Distributed representation: Each word is represented as a vector in a multidimensional space, where relationships between words are reflected through the cosine similarity between their vectors.
- Unsupervised learning: Word2Vec is trained on large unlabeled text corpora without the need for external annotations or labeling.
- Contextual learning: Word vectors are obtained based on the context in which these words appear, capturing their semantic and syntactic relations.
CBOW
(Continuous Bag of Words): This approach predicts the current word based on the context around it. For example, for the phrase "blue sky above the head", the CBOW
model would try to predict the word "sky" based on the context words "blue", "above", "head". CBOW
processes large volumes of data quickly but is less effective for rare words.
Skip-Gram
: In this approach, the current word is used to predict the words in its context. For the same example, the Skip-Gram
model would try to predict the words "blue", "above", "head" based on the word "sky". Skip-Gram
processes data more slowly but works better with rare words and less common contexts.
The goal of CBOW
is to predict the target word based on the context around this word. The context is defined as a set of words around the target word within a given window. The model architecture is simplified as a three-layer neural network: an input layer, a hidden layer, and an output layer.
Input layer: The model receives context words. These words are represented as vectors using "one-hot encoding", where each vector has a dimension equal to the size of the vocabulary and contains 1 at the position corresponding to the word's index in the vocabulary, and 0 in all other positions.
Hidden layer: The input word vectors are multiplied by a weight matrix between the input and hidden layer, resulting in a hidden layer vector. For CBOW, the context word vectors are usually averaged before being passed to the next layer.
Output layer: The hidden layer vector is multiplied by a weight matrix between the hidden and output layer, and the result passes through a softmax
function to obtain probabilities of each word in the vocabulary being the target word. The goal of training is to maximize the probability of the correct target word.
Unlike CBOW, the goal of Skip-Gram is to predict the context words for a given target word. This word at the model's input is used to predict words in its context within a given range of words (called the window).
Input layer: The input is the target word, represented as a one-hot
vector.
Hidden layer: The same as in CBOW, where the target word vector is multiplied by a weight matrix leading to the hidden layer.
Output layer: Unlike CBOW
, where the output layer is a single softmax
, in Skip-Gram
each word in the context uses a separate softmax
, meaning the model tries to predict each context word separately. The goal of training is to maximize the probability of real context words appearing for a given target word.