Document-Term Matrix
Explore the creation and interpretation of document-term matrices, a key structured data format for text analysis. Understand how rows represent documents and columns represent terms with frequencies. Discover how to use R's tm package to find frequent terms and gain insights into text corpora.
We'll cover the following...
A document-term matrix is fairly simple to understand. It is a matrix with rows and columns.
Each row represents a document. In our case, there will be one row for Frankenstein and a second row for The Last Man.
Each column represents a term. In this case, terms are words, although they can be sentences, lines, paragraphs, or n-grams (more on these in a later lesson).
Each cell in the matrix contains the frequency of the term in the document.
On the other hand, a term-document matrix (TDM) is a data structure that is essentially the transpose of a document-term matrix. It also consists of rows and columns but with a different arrangement.
Each row in a TDM represents a term. For instance, if we have a set of terms like love, hate, joy, and so forth, each term will have its own row in the matrix.
Each column in the TDM represents a document. For example, if we have documents named Document 1, Document 2, and Document 3, these will be represented as separate columns in the matrix.
Each cell in the term-document matrix contains the frequency of the term in the corresponding document.
This matrix is often used to examine the occurrence of specific terms across various documents, enabling researchers to gain insights into how certain terms are distributed throughout the collection of documents.
Understanding the DTM
The easiest way to understand a document-term matrix is to look at one. Here’s the code to create a DTM from the corpus we’ve been working with.
Line 9: This produces an abbreviated view of the DTM. For our immediate purposes, we can ignore the first six lines and jump to the table at the bottom of the output.
First, notice that each row is labeled with the title of a document. The first row is mws_frankenstein.txt, and the second row is mws_lastman.txt (this column is labeled as Docs).
Then, notice each column is labeled with a term. The inspect( ) function only shows 10 of the 27,389 total terms. In this case, terms are words and include “and,” “for,” “had,” “her,” and so on.
The cells in the matrix list the number of times each term appears in each document. For example, the word “and” appears in mws_frankenstein.txt 2,833 times. The word “and” appears in mws_lastman.txt 6,107 times.
That is the basic idea behind a document-term matrix.
Uses of a DTM
Once we’ve created a DTM, it’s trivial to find the most frequent terms. The tm package provides two commands for this purpose: findFreqTerms( ) and findMostFreqTerms( ). Here’s how they work:
Line 11:
findFreqTerms(DTmatrix, lowfreq = 400)creates a list of terms that appear a minimum of 400 times. Note this is for the entire DTM and not specific to each document.Line 13:
findMostFreqTerms(DTmatrix)shows the most frequent terms in each document. There is an overlap between the documents, but they aren’t identical.