Introduction to tf-idf

11 minute read

Although I’ve been able to automate some portion of the blog workflow, there’s always been a challenging part that I wanted to further automate myself using deep learning: automatic tagging and categorization. Every post requires some form of YAML front matter, containing information such as the title, tag, and category of the blog post to be uploaded. Although I sometimes create new tags or categories if existing ones seem unfit, I only deal with a limited number of topics on this blog, which is why I’ve always thought that some form of supervised learning be able to automate the process by at least generating some possible tags for me.

I’m currently in the process of preparing the data (my previous blog posts) and building a simple NLP document classification model for the job. However, NLP is a field that I’m admittedly not well-acquainted with, not to mention the fact that I’ve not bee posting a lot about deep learning implementations for a while now. So in today’s short post, I decided to write about tf-idf vectorization, which is a very simple yet powerful technique that is often used in routine tasks like document classification, where SOTA models aren’t really required. As always, this post is going to take a hands-on approach by demonstrating a simple way of implementing tf-idf vectorization from scratch. Let’s get started.

tf-idf Formula

tf-idf stands for term frequency-inverse document frequency. This is all there is to it—in fact, the formula for tf-idf can simply be expressed as

\[\text{tfidf}(t, d, D) = \text{tf}(t, d) \cdot \text{idf}(t, D) \tag{1}\]

where $t$ denotes a single term; $d$, a singe document, and $D$, a collection of documents.

So simply put, tf-idf is simply a product of the term frequency, denoted above as $\text{tf}$, and inverse document frequency, $\text{idf}$. All there is left, then, is to figure out what term frequency and inverse document frequency are.

Term Frequency

Without much explanation, you can probably guess what term frequency is: it simply indicates how frequently a word appeared in a given document. For example, if there were a total of 3 distinct words in a document (very short, I know), then each of the three words would have a tf score of $1/3$. Put differently, the sum of the tf vector for each document should sum to one. The definition of a tf score might be thus expressed as

\[\text{tf}(t, d) = \frac{\sum_{t \in d} t}{\lvert d \rvert} \tag{2}\]

where the denominator denotes the count of all occurrences of the term $t$ in document $d$, and the numerator represents the total number of terms in the document.

Inverse Document Frequency

Roughly speaking, inverse document frequency is simply the reciprocal of document frequency. Therefore, it suffices to show what document frequency is, since idf would immediately follow from df.

Before getting into the formula, I think it’s instructive to consider the motivation behind tf-idf, and in particular what role idf plays in the final score. The motivation behind tf-idf commences from a simple question: how do we determine the semantic importance of a word in a set of documents? On one hand, words the appear a lot are probably worth paying attention to. For example, in one of my posts on Gaussian distributions, the word “Gaussian” probably appears many times throughout the post. A keyword probably appears frequently in the document; hence the need to calculate tf.

On the other hand, there might be words that appear a lot, but aren’t really that important at all. For example, consider the word “denote.” I know that I use this word a lot before writing down equations or formulas, just for the sake of notational clarity. However, the word itself carries little information on what the post is about. The same goes for other words, such as “example,” “however,” and so on. So term frequency only doesn’t really tell us much; instead, we want to pay attention to words that occur frequently in a given document, but doesn’t appear a lot in others—such words are most likely to be unique keywords that potentially capture the gist of that document.

Given this analysis, it isn’t difficult to see why tf-idf is designed the way it is. Although we give priority weight to words with high term frequency, we discount words that appear frequently across all documents by dividing tf by idf, or inverse document frequency. In short, document frequency tells us how frequently a given word appears throughout all documents; the inverse is the reciprocal of that quantity.

\[\text{idf}(t, D) = \frac{\lvert D \rvert}{\sum_{t \in D} t} \tag{3}\]

In practice, we often apply a logarithm to prevent the idf score from exploding. Also, we add some smoothing to prevent division by zero. There seems to be many variations of how smoothing is implemented in practice, but here I present one way that seems to be adopted by scikit-learn. For other schemes, refer to this table on Wikipedia.

\[\text{idf}(t, D) = 1 + \log \left( \frac{1 + \lvert D \rvert}{1 + \sum_{t \in D} t} \right) \tag{4}\]

However, this is a mere technically; the intuition we motivated earlier still applies regardless.

With these ideas in mind, let’s go implement tf-idf vectorization in Python!

Python Implementation

In this section, we will develop a simple set of methods to convert a set of raw documents to tf-idf vectors, using a dummy dataset. Below are four documents (again, I know they’re short) that we will be using throughout this tutorial.

docs = [
    "Tom plays soccer!",
    "Tom loves basketball.",
    "Basketball is his hobby?",
    "Sarah loves basketball;"
]

Preprocessing

The first step is to preprocess and tokenize the data. Although the specifics of preprocessing would probably differ from task to task, in this simple example, we simply remove all punctuations, change documents to lower case letters, and tokenize them by breaking down documents into a bag of words. Other possible techniques not discussed here include stemming and lemmatization.

The remove_punctuations() function accepts as input a set of documents and removes all the punctuation in each document.

punctuations = ".,;:!?"

def remove_punctuations(docs):
    for i in range(len(docs)):
        for punctuation in punctuations:
            docs[i] = docs[i].replace(punctuation, "")
    return docs

Here is the result of applying our function to the dummy data.

docs = remove_punctuations(docs)
docs
['Tom plays soccer',
 'Tom loves basketball',
 'Basketball is his hobby',
 'Sarah loves basketball']

Next, we need to tokenize the strings by splitting them into words. In this process, we will also convert all documents to lower case as well.

def tokenize(doc):
    return doc.lower().split(" ")

Note that tokenize() works on each documents, not the entire collection. Let’s try calling the function with the first document in our dummy example.

tokenize(docs[0])
['tom', 'plays', 'soccer']

Finally, as part of the preprocessing step, let’s build the corpus. The corpus simply refers to the entire set of words in the dataset. Specifically for our purposes, the corpus will be a dictionary whose keys are the words and values are an ordinal index. Another way to think about the corpus in this context is to consider it as a word-to-index mapping. We will be using the indices to represent each word in the tf, idf, and tf-idf vectors later on in the tutorial.

def build_corpus(docs):
    words_set = set()
    for doc in docs:
        for word in tokenize(doc):
            words_set.add(word)
    res = {}
    for i, word in enumerate(list(words_set)):
        res[word] = i
    return res    

Because we have a very simple example, our corpus only contains 9 words. This also means that our tf-idf vectors for each document will also be a list of length 9. Thus it isn’t difficult to see how tf-idf vectorization can result in extremely high-dimensional matrices, which is why we often apply techniques such as lemmatization or PCA on the final result. Also note that published modules use sparse representations to minimize computational load, as we will later see with scikit-learn.

corpus = build_corpus(docs)
corpus
{'his': 0,
 'soccer': 1,
 'sarah': 2,
 'basketball': 3,
 'plays': 4,
 'hobby': 5,
 'loves': 6,
 'tom': 7,
 'is': 8}

Term Frequency

Now it’s time to implement the first step: calculating term frequency. In Python, this simply amounts to looping through each document, creating a tf vector per iteration, and making sure that they are normalized as frequencies at the very end. In creating tf vectors for each document, we will be referencing the word-to-index mapping in our corpus.

def get_tf(docs, corpus):
    res = []
    for doc in docs:
        freq = [0] * len(corpus)
        for word in tokenize(doc):
            freq[corpus[word]] += 1
        total_count = sum(freq)
        for i in range(len(freq)):
            freq[i] /= total_count
        res.append(freq)
    return res

Let’s see what we get for the four documents in our dummy example.

for doc in get_tf(docs, corpus):
    print(doc)
[0.0, 0.3333333333333333, 0.0, 0.0, 0.3333333333333333, 0.0, 0.0, 0.3333333333333333, 0.0]
[0.0, 0.0, 0.0, 0.3333333333333333, 0.0, 0.0, 0.3333333333333333, 0.3333333333333333, 0.0]
[0.25, 0.0, 0.0, 0.25, 0.0, 0.25, 0.0, 0.0, 0.25]
[0.0, 0.0, 0.3333333333333333, 0.3333333333333333, 0.0, 0.0, 0.3333333333333333, 0.0, 0.0]

Due to floating point arithmetic, the decimals don’t look the most pleasing to the eye, but it’s clear that normalization has been performed as expected. Also note that we get 4 vectors of length 9 each, as expected.

Inverse Document Frequency

Next, it’s time to implement the idf portion of the vectorization process. In order to calculate idf, we first need a total count of each number in the entire document collection. A module that is perfect for this job is collections.Counter, which accepts as input an iterable and outputs a dictionary-like object whose values represent the count of each key.

from collections import Counter

def count(docs):
    res = []
    for doc in docs:
        res += tokenize(doc)
    return Counter(res)

Let’s test in on our dummy dataset to see if we get the count of each tokenized word.

count(docs)
Counter({'tom': 2,
         'plays': 1,
         'soccer': 1,
         'loves': 2,
         'basketball': 3,
         'is': 1,
         'his': 1,
         'hobby': 1,
         'sarah': 1})

This is precisely what we need to calculate idf. Recall the formula for calculating idf

\[\text{idf}(t, D) = 1 + \log \left( \frac{1 + \lvert D \rvert}{1 + \sum_{t \in D} t} \right) \tag{4}\]

As noted earlier, the intuition behind idf was that important keywords probably appear only in specific relevant documents, whereas generic words of comparatively lesser importance appear throughout all documents. We transcribe (4) into code as follows:

import numpy as np

def get_idf(docs, corpus):
    n_docs = len(docs)
    counter = count(docs)
    res = [0] * len(corpus)
    for word, i in corpus.items():
        res[i] = 1 + np.log((n_docs + 1) / (1 + counter[word]))
    return res

Now, we have the idf vectors for the nine terms in the dummy dataset.

get_idf(docs, corpus)
[1.916290731874155,
 1.916290731874155,
 1.916290731874155,
 1.2231435513142097,
 1.916290731874155,
 1.916290731874155,
 1.5108256237659907,
 1.5108256237659907,
 1.916290731874155]

tf-idf

At this point, all there is left to do is to multiply the term frequencies with their corresponding idf scores. This is extremely easy, since we are essentially performing a dot product of the tf and idf vectors for each document. As a final step, we normalize the result to ensure that longer documents do not overshadow shorter ones. Normalizing is pretty simple, so we’ll assume that we have a function normalize() that does the job for now.

def get_tfidf(docs, corpus):
    tf = get_tf(docs, corpus)
    idf = get_idf(docs, corpus)
    for i, doc in enumerate(tf):
        for j in range(len(doc)):
            doc[j] *= idf[j]
        tf[i] = normalize(doc)
    return tf

Before we test the code, we obviously need to implement normalize(). This can simply done by obtaining the sum of the L2 norm of each vector, then dividing each element by that constant.

def normalize(vector):
    norm_constant = np.sqrt(sum(num ** 2 for num in vector))
    for i in range(len(vector)):
        vector[i] /= norm_constant
    return vector

Here is an easy contrived example we can do in our heads:

normalize([3, 4])
[0.6, 0.8]

And now we’re done! If let’s print the tf-idf vectors for each of the four documents in the dummy example.

tf_idf = get_tfidf(docs, corpus)
for doc in tf_idf:
    print(doc)
[0.0, 0.617614370975602, 0.0, 0.0, 0.617614370975602, 0.0, 0.0, 0.48693426407352264, 0.0]
[0.0, 0.0, 0.0, 0.496816117482646, 0.0, 0.0, 0.6136667440107332, 0.6136667440107332, 0.0]
[0.5417361046803605, 0.0, 0.0, 0.3457831381910465, 0.0, 0.5417361046803605, 0.0, 0.0, 0.5417361046803605]
[0.0, 0.0, 0.7020348194149619, 0.4480997313625987, 0.0, 0.0, 0.5534923152870045, 0.0, 0.0]

It seems about right, as all the vectors appear normalized and are of the desired dimensions. However, to really verify the result, it’s probably a good idea to pit our algorithm against scikit-learn’s implementation. In scikit-learn, the TfidfVectorizer does all the job.

from sklearn.feature_extraction.text import TfidfVectorizer

To transform the data to tf-idf vectors, we need to create an instance of the TfidfVectorizer() and call its method, fit_transform().

vectorizer = TfidfVectorizer()
sklearn_tfidf = vectorizer.fit_transform(docs).toarray()

And here are the results:

for doc in sklearn_tfidf:
    print(doc)
[0.         0.         0.         0.         0.         0.61761437
 0.         0.61761437 0.48693426]
[0.49681612 0.         0.         0.         0.61366674 0.
 0.         0.         0.61366674]
[0.34578314 0.5417361  0.5417361  0.5417361  0.         0.
 0.         0.         0.        ]
[0.44809973 0.         0.         0.         0.55349232 0.
 0.70203482 0.         0.        ]

There are several observations to be made about this result. First, note that the default return type of TfidfVectorizer().fit_transform() is a sparse matrix. Sparse matrices are a great choice since many of the entries of the matrix will be zero—there is probably no document that contains every word in the corpus. Therefore, sparse representations can save a lot of space and compute time. This is why we had to call toarray() on the result. Second, you might be wondering why the order of elements are different. This is because the way we built ordinal indexing in corpus is probably different from how scikit-learn implements it internally. This point notwithstanding, it’s clear that the values of each vectors are identical, disregarding the fact that the result produced by our algorithm has more decimal points due to floating point arithmetic.

Conclusion

This was a short introductory post on tf-idf vectors. When I first heard about tf-idf vectors from a friend studying computational linguistics, I was intimidated. However, now that I have a project I want to complete, namely an auto-tagging and classification NLP model, I’ve mustered more courage and motivation to continue my study the basics of NLP.

I hope you’ve enjoyed reading this post. Catch you up in the next one! (Yes, this is a trite ending comment I use in almost every post, so the idf scores for the words in these two sentences are going to be very low.)