Generate Text with N-grams Language Models: Working, Implementation & Limitations

As we dive deeper into text generation, we encounter the N-gram model, which improves upon simpler models like Markov Chains by considering more than just the previous word. It helps capture more context in text generation and makes the generated text more coherent. In this article, we’ll explore how N-gram models work, how to build one, their limitations, and why we eventually move to more advanced techniques.


What is an N-gram?

An N-gram is a sequence of N words from a given text. The idea behind N-grams is that the probability of a word depends not just on the immediate previous word (like in a Markov Chain), but on a sequence of preceding words.

  • Unigram: A model that considers only the current word.
  • Bigram: A model that considers two consecutive words.
  • Trigram: A model that considers three consecutive words.
  • N-gram: A general term for models that consider N words in a sequence.

Example:

Consider the sentence: "The cat sits on the mat"

  • Unigrams: "The", "cat", "sits", "on", "the", "mat".
  • Bigrams: "The cat", "cat sits", "sits on", "on the", "the mat".
  • Trigrams: "The cat sits", "cat sits on", "sits on the", "on the mat".

By using N-grams, we are able to model the relationship between not just individual words but sequences of words, which helps improve the quality of generated text compare to earlier markov chain.


Trigram Model Example

A trigram is a sequence of three consecutive words. In language modeling, we use these sequences to predict the likelihood of a word following a pair of preceding words.

Unlike unigram (1 word) or bigram (2 words), trigrams offer more context, as they consider the previous two words when predicting the next one. This gives more meaningful predictions than bigrams, especially for complex sentences.

Example:

If you have a sentence like:

"The cat sits on the mat. The dog barks on the mat."

The trigrams in this sentence are:

  • (“The”, “cat”, “sits”)
  • (“cat”, “sits”, “on”)
  • (“sits”, “on”, “the”)
  • (“on”, “the”, “mat.”)
  • (“the”, “mat.”, “The”)
  • (“The”, “dog”, “barks”)
  • (“dog”, “barks”, “on”)
  • (“barks”, “on”, “the”)
  • (“on”, “the”, “mat.”)

In a trigram model, we use the first two words to predict the third word. For example, if we see “on the”, we predict that the next word will likely be “mat.”.


How Trigram Models works

Probability Calculation in Trigram Models

In trigram model, we can calculate the probability of a word given the two preceding words.


Implementation

Step 1: Build a Trigram Model

To build a trigram model, we first need to tokenize the sentence into trigrams and store them in a way that we can use them for prediction.

Python Code for Tokenizing into Trigrams

import random
from pprint import pprint
from collections import defaultdict
# Sample sentence
text = "The cat sits on the mat. The dog barks on the mat."

# Split the text into words (tokens)
words = text.split()
words
# Create a dictionary to store trigrams
trigram_model = defaultdict(list)

# Create trigrams and store them in the model
for i in range(len(words) - 2):
    key = (words[i], words[i + 1])
    trigram_model[key].append(words[i + 2])

# Print the trigram model
pprint(dict(trigram_model))

Explanation:

  1. Tokenization: We split the sentence into individual words (tokens).
  2. Trigrams: For each word in the sentence, we create trigrams. We look at two consecutive words as the key and store the third word as the value in the dictionary.
  3. Output: The dictionary will store the trigram pairs and the corresponding next word

This is your trigram model, which can predict the next word based on the previous two words.


Step 2: Generate Text from the Trigram Model

Now that we have the trigram model, we can use it to generate new sentences by selecting the next word based on the previous two words.

Python Code for Text Generation

# Function to generate text using the trigram model
def generate_text(trigram_model, start_words, num_words=10):
    current_words = start_words
    generated_text = list(current_words)
    
    for _ in range(num_words):
        next_word = random.choice(trigram_model.get(tuple(current_words), ['']))
        if not next_word:  # If no word is found, stop
            break
        generated_text.append(next_word)
        current_words = (current_words[1], next_word)  # Shift the current words
    
    return ' '.join(generated_text)

# Starting words to begin the generation
start_words = ("The", "cat")
generated_sentence = generate_text(trigram_model, start_words)
print(generated_sentence)

Explanation:

  • generate_text: This function takes the trigram model, a pair of starting words, and generates text by predicting the next word from the model.
  • random.choice: Randomly selects the next word from the list of possible words that follow the given pair of words.
  • Word Shift: After predicting the next word, we shift the window by one word and continue predicting.

This output is generated based on the trigrams learned from the input text.


Limitations of Trigram Models

While trigram models provide more context than bigrams, they still have limitations:

  1. Data Sparsity: Trigrams require more data to be accurate. Since they rely on three-word sequences, we may encounter many word combinations that we haven’t seen before, making it harder to predict.
  2. Lack of Long-term Context: Even though trigrams consider two previous words, they still cannot capture long-term dependencies in a sentence. For instance, if a word earlier in the sentence affects the current word, a trigram model won’t notice that.
  3. More Complex Models Needed: To handle complex text generation tasks, more sophisticated models like neural networks (e.g., RNNs, LSTMs) or transformers are needed, which can learn from much longer contexts and handle large datasets better.

Summary

Trigram models are a step up from bigrams because they use two words of context to predict the next word. They are simple, easy to understand, and work well for small datasets. However, they are limited by the amount of context they can handle and often require large amounts of data to perform well.


Projects to Try

Here are a few simple projects you can build to practice using n-grams:

  1. N-gram Poem Generator: Use a 4-gram / 5-gram model to generate random poems by training your model on a dataset of famous poems.
  2. Next-word Prediction App: Build an app that suggests the next word based on bigrams. This could be similar to predictive text on smartphones.
  3. Text Completion Tool: Create a tool that completes a sentence using n-grams. You could use bigrams or trigrams to suggest how a sentence might end.

Leave a Reply