A Simple Bigram Language Model with Jupyter Notebook

Chethani Dilhari
5 min readSep 8, 2021
dribble.com

Bigram language models are used is sentiment analyzing. The basic task of this model is to take a corpus and train it to classify a given sentence is positive, negative or neutral. It is done by analyzing the positive and negative corpus using their bigram probabilities. This process has several steps as follow.

Step 1 — Data gathering and annotating them as positive and negative

First of all, we need to find a corpus of data of a particular domain and annotate them to positive and negative polarities manually. Here I used Jupyter Notebook and some libraries like nltk for develop the model.

Here the corpus is a large set of inputs which we used to train our model. We can collect a list of inputs and classify them manually to positive and negative to be trained in the model

Step 2 — preprocessing

After that we need to preprocess those 2 sets of data to get only the necessary words separately as the output. We can do preprocessing in several techniques as per the dataset.

  • Remove punctuations

We can choose the punctuations we need to remove using string library and get the output data set without unnecessary punctuation marks.

  • Remove special characters

Special characters are removed using regular expressions and take only the data with characters we need.

  • Remove numbers

If we need we can remove numbers and get only the data set with letters. If we need we can turn the numbers into words.

Eg: 3 → three

  • Remove whitespaces

Some sentences and words have whitespaces unnecessarily. Then we can get a clean data set without extra spaces or new line characters.

  • Turn to lowercase

After that we can turn the whole text into lowercase using “lower()” function. Otherwise it will identify “hello” and “Hello” as 2 words.

  • Remove stop words

In each language there are stop words which have no meaning as a single word. It is good to remove them as they get higher counts but without a unique meaning. There are functions in some libraries to remove them but not for all languages.

  • Stemming

Stemming is used to remove the stem of words. It can be done using PorterStemmer() function in nltk which uses porters stemming algorithm.

Eg: words → word

  • Lemmatization

Lemmatization is done to remove the different forms of a same lemma of a word. It can be done using WordNetLemmatizer() function in nltk.

Eg: cookery → cook

Step 3 — Tokenize

Now we have a preprocessed data set of words. We need to separate words of the text correctly. There are several functions to do this. Here I use word_tokenize() function. After this step we get unigrams list of the text. Do this to all 2 data sets.

Eg: word_tokenize(“he eat rice”) à [‘he’, eat, rice]

Step 4 — Get unigram frequencies and store in a list

We can use a function to get unigram frequencies of separated words as tokens and store in a list. In this model I used FreqDist() function in nltk which gives output as a list of tokens with the count.

Step 5 — Get bigram frequencies and store in a list

Then we need to create a list to store bigram frequencies of the tokens. Bigrams are extracted using bigrams() function in nltk and it gives a list of bigrams as follows:

Eg: [(‘<s>’, ‘he’), (‘he’, ‘eat’), (‘eat’, ‘rice’), (‘rice’,’</s>’)]

Step 6 — Calculate bigram probabilities

Then we calculate bigram probabilities using above created bigram frequency list and unigram frequency list.

towardsdatascience.com

Step 7 — Testing and evaluate output

In the testing part we are going to test the module using sample inputs. There we use inputs one by one to test whether the model is working correctly and get the expected classification with perplexity evaluation.

There we preprocess the input sentence as Step 2 and then do Step 3, 4, 5, and 6. There we separate the words in the preprocessed sentence and get their unigram and bigram lists. There for each bigram get the probability through previously created 2 bigram probability lists.

After that get the total probability of the sentence according to positive probability list and to negative probability list.

towardsdatascience.com

Then compare the total positive and negative probabilities and classify the input.

Then we can classify our input is positive, negative or neutral by comparing the positive and negative probability values..

And also perplexity is calculated correspond to above calculated positive and negative probabilities as follows.

towardsdatascience.com

when the perplexity value is low, the model is considered as comparatively good due to better performance.

This article simply describe how to develop a bigram language model step by step with basic details. I hope you this will be helpful for the people who are interested in natural language processing.

You can find the bigram language model I developed for analyze social media comments in my github repository using the link below.

Thank you for reading…

--

--