Natural Language Processing with NLTK Tutorial

Introduction to NLP and NLTK

Natural Language Processing (NLP) is a field at the intersection of computer science, artificial intelligence, and linguistics. It focuses on the interaction between computers and human language. NLTK (Natural Language Toolkit) is a leading platform for building Python programs to work with human language data.

Installation and Setup

To get started with NLTK, follow these steps:

  1. Install NLTK using pip:
pip install nltk
  1. Download necessary NLTK data:
import nltk nltk.download('popular')

Basic NLP Tasks with NLTK

Tokenization

Tokenization is the process of breaking down text into individual words or sentences.

from nltk.tokenize import word_tokenize, sent_tokenize text = "NLTK is a powerful library for NLP. It's widely used in Python." words = word_tokenize(text) sentences = sent_tokenize(text) print("Words:", words) print("Sentences:", sentences)

Part-of-Speech (POS) Tagging

POS tagging assigns grammatical categories to words in a text.

from nltk import pos_tag tagged = pos_tag(words) print("POS Tagged:", tagged)

Stemming

Stemming reduces words to their root form.

from nltk.stem import PorterStemmer stemmer = PorterStemmer() words = ["running", "runs", "ran", "runner"] stemmed_words = [stemmer.stem(word) for word in words] print("Stemmed words:", stemmed_words)

Lemmatization

Lemmatization is similar to stemming but considers the context to convert words to their base form.

from nltk.stem import WordNetLemmatizer lemmatizer = WordNetLemmatizer() words = ["better", "worse", "running", "am", "is"] lemmatized_words = [lemmatizer.lemmatize(word) for word in words] print("Lemmatized words:", lemmatized_words)

Advanced NLP Concepts

Named Entity Recognition (NER)

NER identifies and classifies named entities in text.

from nltk import ne_chunk from nltk.chunk import conlltags2tree, tree2conlltags tagged = pos_tag(word_tokenize("John works at Google in New York")) named_entities = ne_chunk(tagged) print(named_entities)

Text Classification

Text classification categorizes text into predefined classes.

from nltk.classify import NaiveBayesClassifier from nltk.corpus import movie_reviews # (Code for preparing data and training classifier) classifier = NaiveBayesClassifier.train(train_data) print("Accuracy:", nltk.classify.accuracy(classifier, test_data))

Word Embeddings

Word embeddings are dense vector representations of words.

from gensim.models import Word2Vec # (Code for training or loading a Word2Vec model) model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4) print(model.wv.most_similar("python"))

Practical Exercises

  1. Create a function that takes a sentence as input and returns the most common part of speech.
  2. Implement a simple text summarizer using NLTK's frequency distribution functionality.
  3. Build a named entity recognizer for a specific domain (e.g., medical texts).
  4. Create a function that removes stop words from a given text.
  5. Implement a basic sentiment analysis model using NLTK and scikit-learn.

Conclusion

This tutorial has introduced you to the fundamentals of NLP using NLTK. As you progress, you'll be well-prepared to explore more advanced topics and integrate these concepts with deep learning frameworks like PyTorch or TensorFlow.

Back to Top