Word embeddings are a type of word representation that allows words with similar meanings to have similar representations. These embeddings are dense vector representations of words that capture their meanings based on their context in a text.
Word2Vec is a popular method for learning word embeddings using a two-layer neural network. It can be trained using two approaches: Continuous Bag of Words (CBOW) and Skip-Gram. CBOW predicts a target word from its context words, while Skip-Gram predicts context words from a target word.
To train a Word2Vec model, you need a corpus of text. Here's how to train a Word2Vec model using the Gensim library and the Brown corpus from NLTK:
from gensim.models import Word2Vec
from nltk.corpus import brown
# Prepare the corpus
sentences = brown.sents()
# Create and train the Word2Vec model
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)
model.save("word2vec.model")
Once the model is trained, you can use it to find similar words, get word vectors, and perform other operations:
from gensim.models import Word2Vec
# Load the trained model
model = Word2Vec.load("word2vec.model")
# Get word vector
vector = model.wv['example']
# Find similar words
similar_words = model.wv.most_similar('example')
print("Vector for 'example':", vector)
print("Similar words to 'example':", similar_words)