TF-IDF

#!/usr/bin/env python
#-------------------------------------------------------------------------------------#
# TF-IDF (Term Frequency-Inverse Document Frequency)
#-------------------------------------------------------------------------------------#
from sklearn.feature_extraction.text import TfidfVectorizer
from prettytable import PrettyTable

# Example documents
docs = [  "This is a sample document.",
          "This document is another sample document.",
          "Machine Learning Document"
        ]

# Initialize the vectorizer
vectorizer = TfidfVectorizer()

# Fit the model and transform the text data into a TF-IDF matrix
tfidf_matrix = vectorizer.fit_transform(docs)

# Get feature names (terms)
feature_names = vectorizer.get_feature_names_out()

# Convert to array to see the result
tfidf_array = tfidf_matrix.toarray()

# Initialize PrettyTable
table = PrettyTable()

# Add column names (terms)
table.field_names = ["Document"] + list(feature_names)

# Add rows (documents and their TF-IDF values rounded to 4 decimals)
for i, doc in enumerate(docs):
    row = [f"Document {i+1}"] + [round(value, 4) for value in tfidf_array[i]]
    table.add_row(row)

print(table)


+-----------+---------+----------+---------+---------+---------+--------+--------+
| Document  | another | document |   is    | learning| machine | sample |  this  |
+-----------+---------+----------+---------+---------+---------+--------+--------+
| Document 1|   0.0   |  0.4091  |  0.5268 |   0.0   |   0.0   | 0.5268 | 0.5268 |
| Document 2|  0.492  |  0.5812  |  0.3742 |   0.0   |   0.0   | 0.3742 | 0.3742 |
| Document 3|   0.0   |  0.3854  |   0.0   | 0.6525  | 0.6525  |  0.0   |  0.0   |
+-----------+---------+----------+---------+---------+---------+--------+--------+

TF-IDF Value Breakdown

Higher TF-IDF values: The higher the TF-IDF score for a term in a document, the more relevant or important that term is to that specific document.

Lower TF-IDF values: Terms with lower TF-IDF scores are either less frequent or appear in many documents, which reduces their importance in distinguishing between documents.

Document 1: "This is a sample document."


|  Document  | another | document |  is    | learning | machine | sample |  this  |
|------------|---------|----------|--------|----------|---------|--------|--------|
| Document 1 |   0.0   |  0.4091  | 0.5268 |   0.0    |   0.0   | 0.5268 | 0.5268 |

'this', 'is', 'sample': These terms have a TF-IDF score of 0.5268, which indicates that they are equally important in this document. They appear only once in the document and are relevant for its content.
'document': This word appears in both Document 1 and Document 2, making its score (0.4091) slightly lower because it’s shared across documents, reducing its uniqueness.
'another', 'learning', 'machine': These words are absent in this document, so their TF-IDF score is 0.0.

Document 2: "This document is another sample document."


|  Document  | another | document |  is    | learning | machine |  sample |  this  |
|------------|---------|----------|--------|----------|---------|---------|--------|
| Document 2 |  0.492  |  0.5812  | 0.3742 |   0.0    |   0.0   |  0.3742 | 0.3742 |

'another': This word has a high TF-IDF score of 0.492 because it only appears in Document 2, making it more unique and relevant to this document.
'document': This word has the highest score of 0.5812 because it appears twice in Document 2, making it more important than other terms.
'is', 'sample', 'this': These terms have lower scores (0.3742) because they also appear in Document 1, so their importance in Document 2 is reduced.

Document 3: "Machine Learning Document"


|  Document  | another | document |  is    | learning | machine | sample | this  |
|------------|---------|----------|--------|----------|---------|--------|-------|
| Document 3 |   0.0   |  0.3854  |  0.0   |  0.6525  | 0.6525  |  0.0   |  0.0  |

'learning', 'machine': These words have high TF-IDF scores (0.6525) because they are unique to Document 3. Since they don’t appear in the other documents, they are very important for distinguishing this document.
'document': This word has a lower TF-IDF score of 0.3854 because it also appears in Documents 1 and 2. Even though it appears in all documents, it’s still somewhat important in Document 3.

TF-IDF Ranking Summary

Terms with high TF-IDF scores: are unique or more frequent in a particular document, making them key terms for that document.
Example: 'document' in Document 2, 'machine' and 'learning' in Document 3.
Terms with low TF-IDF scores: are either shared across multiple documents or appear less frequently in the specific document, making them less important.
Example: 'this', 'is', 'sample' in Document 2 and Document 1.

Range of TF-IDF

The TF-IDF score typically ranges between 0.0 and 1.0, though in theory, the upper limit can go beyond 1 depending on the data. In practice, however, the values are normalized to stay in this range:

0.0: Indicates that the term is either not present in the document or is too common across the entire corpus to be meaningful.
1.0: Indicates that the term is highly relevant to the specific document and does not appear frequently in other documents in the corpus.