Natural Language Processing

This is a set of code used to demonstrate basics of natural language processing with some basic of machine learning.

Notebooks

The notebooks contain code to demonstrate the concepts and how they are applied in code and data.

Regular Expressions

This shows how to use the re module in python as well as some examples of the different Regular Expression (Regex) syntax.

Main Concepts:

Using Python's re module
Matching multiple cases with Regex's disjunction
Reusing the same pattern with Regex's quantifiers
Positioning the matches with Regex's anchors
Retrieving matches with Regex's capture groups
Looking before and after a match with Regex's lookahead and lookbehind

Text Normalization

This notebook demonstrate the process of converting text into a standard form. This is one of the first step in the pipelines for a natural language processing project.

Main Concepts:

Exploring datasets using Regex
Splitting sentences and paragraphs to words
Transforming words into roots and base forms
Separating different sentences from a paragraph

Part of Speech Tagging

This notebook contains code to show how words in the sentences are given their sentence function, or their part of speech (POS).

Main Concepts:

Utilizing word order using N-Gram taggers
Training different machine learning algorithms to develop a POS tagger

Word Classification

This notebook implements some basic pipelines for processing different words for classification. Specifically, this contains a basic sentiment analysis pipeline.

Main Concepts:

Balancing imbalance datasets
Creating word features for training models
Using a Naive Bayes algorithm for sentiment analysis
Interpreting results of a Naive Bayes algorithm

Document Classification

This notebook extends the concepts in Word Classification. This creates a document classification pipeline by determining what topic a paragraph is discussing.

Main Concepts:

Creating paragraph features for training models such as counts and tf-idf
Using a Random Forest algorithm for document classification
Interpreting results of a Random Forest algorithm

Information Extraction

This notebook contains topics related to Chunking and Named Entity Recognition (NER). These are two concepts are often used for information extraction within a text.

Main Concepts:

Working with sentence syntax tree formats
Creating syntax trees using rule based approaches
Training syntax tree models using different machine learning algorithms
Developing a NER model using Conditional Random Fields (CRF)

Models

The models folder is used as the output of the trained models when running the notebook. Specifically, the CRF models created by NLTK are stored here.