Natural Language Processing (NLP) is a field at the intersection of computer science, artificial intelligence, and linguistics. It focuses on the interaction between computers and human language. NLTK (Natural Language Toolkit) is a leading platform for building Python programs to work with human language data.
To get started with NLTK, follow these steps:
pip install nltk
import nltk
nltk.download('popular')
Tokenization is the process of breaking down text into individual words or sentences.
from nltk.tokenize import word_tokenize, sent_tokenize
text = "NLTK is a powerful library for NLP. It's widely used in Python."
words = word_tokenize(text)
sentences = sent_tokenize(text)
print("Words:", words)
print("Sentences:", sentences)
POS tagging assigns grammatical categories to words in a text.
from nltk import pos_tag
tagged = pos_tag(words)
print("POS Tagged:", tagged)
Stemming reduces words to their root form.
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
words = ["running", "runs", "ran", "runner"]
stemmed_words = [stemmer.stem(word) for word in words]
print("Stemmed words:", stemmed_words)
Lemmatization is similar to stemming but considers the context to convert words to their base form.
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
words = ["better", "worse", "running", "am", "is"]
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
print("Lemmatized words:", lemmatized_words)
NER identifies and classifies named entities in text.
from nltk import ne_chunk
from nltk.chunk import conlltags2tree, tree2conlltags
tagged = pos_tag(word_tokenize("John works at Google in New York"))
named_entities = ne_chunk(tagged)
print(named_entities)
Text classification categorizes text into predefined classes.
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews
# (Code for preparing data and training classifier)
classifier = NaiveBayesClassifier.train(train_data)
print("Accuracy:", nltk.classify.accuracy(classifier, test_data))
Word embeddings are dense vector representations of words.
from gensim.models import Word2Vec
# (Code for training or loading a Word2Vec model)
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)
print(model.wv.most_similar("python"))
This tutorial has introduced you to the fundamentals of NLP using NLTK. As you progress, you'll be well-prepared to explore more advanced topics and integrate these concepts with deep learning frameworks like PyTorch or TensorFlow.