Basics of Text Mining in R

Thinking of Text as List of Words

By Harshvardhan in R ML

January 27, 2022

Text in R can be represented in several ways but generally it is a character vector (strings). Reading a text file would mean most of the content would either be in a single long character file, or broken into several variables and observations as a data frame like comma separated files (CSV). In this blog tutorial, I will download a Jane Austen’s book and perform some basic analysis to understand how these text functions work.

Packages

The common packages for text mining in R are stringr, tidytext, tidyverse and quanteda. I will also use gutenbergr to download the book for analysis.

library(stringr)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5          ✓ purrr   0.3.4     
## ✓ tibble  3.1.6          ✓ dplyr   1.0.8.9000
## ✓ tidyr   1.2.0          ✓ forcats 0.5.1     
## ✓ readr   2.1.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(tidytext)
library(quanteda)
## Package version: 3.2.0
## Unicode version: 13.0
## ICU version: 67.1
## Parallel computing: 8 of 8 threads used.
## See https://quanteda.io for tutorials and examples.
library(gutenbergr)

# changing default ggplot theme to minimal
theme_set(theme_minimal())

Downloading the Book

Once I have the required functions in my namespace, I can download the book using gutenberg_download(). gutenberg_works() gives a list of works that can be downloaded. (gutenberg_metadata will give a list of all books in Project Gutenberg, but we only need the ones that can be downloaded.)

gutenberg_works(title == "Persuasion")
## # A tibble: 1 × 8
##   gutenberg_id title    author gutenberg_autho… language gutenberg_books… rights
##          <int> <chr>    <chr>             <int> <chr>    <chr>            <chr> 
## 1          105 Persuas… Auste…               68 en       <NA>             Publi…
## # … with 1 more variable: has_text <lgl>

I am looking for Persuasion, Jane Austen’s last book. R tells me the rights to the book are public and it has text, so works for my purpose. Downloading the book requires its gutenberg_id, which is 105 for Persuasion, as seen in previous output.

book = gutenberg_download(105)
## Determining mirror for Project Gutenberg from http://www.gutenberg.org/robot/harvest
## Using mirror http://aleph.gutenberg.org

I can download more than one books at a time and many other fancy things. Check gutenbergr’s vignette for more information.

Exploring the Book

Let’s see what we have in book.

book
## # A tibble: 8,328 × 2
##    gutenberg_id text         
##           <int> <chr>        
##  1          105 "Persuasion" 
##  2          105 ""           
##  3          105 ""           
##  4          105 "by"         
##  5          105 ""           
##  6          105 "Jane Austen"
##  7          105 ""           
##  8          105 "(1818)"     
##  9          105 ""           
## 10          105 ""           
## # … with 8,318 more rows

The book object has two variables: gutenberg_id and text. Unless you are downloading multiple books, text is the only useful variable.

Also note that there are 8,328 rows in the dataset. However, this text is not in tidytext format, where each row identifies a token and each column is a variable. (An easy way to remember the format is to repeat out loud “One Token Per Document Per Row” as often as you can.)

To convert it into tidytext format, I will use unnest_tokens() function from tidytext package.

book %>% 
  unnest_tokens(word, text)
## # A tibble: 83,658 × 2
##    gutenberg_id word      
##           <int> <chr>     
##  1          105 persuasion
##  2          105 by        
##  3          105 jane      
##  4          105 austen    
##  5          105 1818      
##  6          105 chapter   
##  7          105 1         
##  8          105 sir       
##  9          105 walter    
## 10          105 elliot    
## # … with 83,648 more rows

unnest_tokens used here has two parameters: what you want to convert into and what you want to convert. First we have the output column name that will be created as the text is unnested into it (word, in this case), and then the input column that the text comes from (text, in this case).

The function also did some other operations in the background. It removed all the punctuation marks from the text. It also converted everything to lower case (which can be toggled OFF by using to_lower = FALSE in unnest_tokens. The function also has an argument token to specify what kind of text is it. words is the default option that worked for our case. Other options are characters, character_shingles, ngrams, skip_ngrams, sentences, lines, paragraphs, regex, tweets and ptb.

Exploring Words

We can look for several manipulations for insights about the words. Such as, how many four letter words did she use? Less than four letter words? Longer than ten letter words?

book = book %>% 
  unnest_tokens(word, text)

# Four Letter Words
book %>% 
  filter(str_length(word) == 4)
## # A tibble: 15,505 × 2
##    gutenberg_id word 
##           <int> <chr>
##  1          105 jane 
##  2          105 1818 
##  3          105 hall 
##  4          105 took 
##  5          105 book 
##  6          105 idle 
##  7          105 hour 
##  8          105 were 
##  9          105 into 
## 10          105 from 
## # … with 15,495 more rows
# Less than four letter words
book %>% 
  filter(str_length(word) < 4)
## # A tibble: 37,908 × 2
##    gutenberg_id word 
##           <int> <chr>
##  1          105 by   
##  2          105 1    
##  3          105 sir  
##  4          105 of   
##  5          105 in   
##  6          105 was  
##  7          105 a    
##  8          105 man  
##  9          105 who  
## 10          105 for  
## # … with 37,898 more rows
# More than ten letters
book %>% 
  filter(str_length(word) > 10)
## # A tibble: 1,636 × 2
##    gutenberg_id word         
##           <int> <chr>        
##  1          105 somersetshire
##  2          105 consolation  
##  3          105 contemplating
##  4          105 information  
##  5          105 respectable  
##  6          105 representing 
##  7          105 parliaments  
##  8          105 handwriting  
##  9          105 presumptive  
## 10          105 infatuation  
## # … with 1,626 more rows

We see that there are 15,505 words that have exactly four letters. 37,908 have less than four letters (that includes numbers such as 1). There are 1,636 words that have more than ten letters in them.

Words that Start or End with …

We can also find words that start or end with a particular string. For example, I wonder how often does Jane Austen use V4 form of the verb — ending in “ing”? We can use str_ends() from stringr package.

book %>% 
  filter(str_ends(word, "ing"))
## # A tibble: 2,638 × 2
##    gutenberg_id word         
##           <int> <chr>        
##  1          105 contemplating
##  2          105 arising      
##  3          105 adding       
##  4          105 inserting    
##  5          105 serving      
##  6          105 representing 
##  7          105 forming      
##  8          105 concluding   
##  9          105 handwriting  
## 10          105 beginning    
## # … with 2,628 more rows

She uses 2,638 words that end with “ing”. I’m curious, what are their frequencies? I only need to add the count() at the end.

book %>% 
  filter(str_ends(word, "ing")) %>% 
  count(word, sort = T)
## # A tibble: 549 × 2
##    word           n
##    <chr>      <int>
##  1 being        220
##  2 nothing      139
##  3 having        92
##  4 going         65
##  5 something     64
##  6 morning       59
##  7 evening       54
##  8 anything      49
##  9 looking       45
## 10 everything    43
## # … with 539 more rows

“Being” and “nothing” are the most often used (no pun intended). What about words that start with “h”? I can use str_starts() from stringr package for this.

book %>% 
  filter(str_starts(word, "h")) %>% 
  count(word, sort = T)
## # A tibble: 216 × 2
##    word        n
##    <chr>   <int>
##  1 her      1203
##  2 had      1187
##  3 he        961
##  4 his       659
##  5 have      589
##  6 him       467
##  7 herself   159
##  8 how       125
##  9 has        99
## 10 himself    95
## # … with 206 more rows

They’re mostly pronouns. How many times does “gh” appear in her texts and in which words? (If I recall correctly, “gh” is probably one of the most common letter-pair in English.)

book %>% 
  filter(str_detect(word, fixed("gh"))) %>% 
  count(word, sort = T)
## # A tibble: 97 × 2
##    word        n
##    <chr>   <int>
##  1 might     166
##  2 though    117
##  3 thought    90
##  4 enough     71
##  5 ought      52
##  6 right      43
##  7 through    34
##  8 brought    33
##  9 high       27
## 10 night      24
## # … with 87 more rows

I did this using str_detect() function from stringr. This function usually looks for regular expressions. Since there was a fix string that I was looking for (gh), I used fixed() to tell R exactly what I was looking for. It will not make pattern matches but only exact fixed matches. I’m very naive in handling regular expressions but the starting guide could be Hadley Wickham’s R for Data Science chapter on Strings.

I can also look for words that start with a certain letter(s) and end with certain letter(s). How? Just add another condition in the filter() statement. Let’s look for words Jane used that start and end with “t”.

book %>% 
  filter(str_starts(word, "t") & str_ends(word, "t")) %>% 
  count(word, sort = T)
## # A tibble: 18 × 2
##    word            n
##    <chr>       <int>
##  1 that          876
##  2 thought        90
##  3 tenant         13
##  4 trust           4
##  5 throat          2
##  6 torment         2
##  7 transient       2
##  8 treat           2
##  9 taught          1
## 10 temperament     1
## 11 tempt           1
## 12 tenderest       1
## 13 test            1
## 14 thickest        1
## 15 throughout      1
## 16 tight           1
## 17 trent           1
## 18 trustiest       1

The most common such word is “that”, followed by “thought”.

Frequency Distribution Plots

We saw how adding count(word, sort = T) created the frequency distribution. We can also visualise the counts.

Frequency Table

book %>% 
  count(word, sort = T) %>%
  head(20) %>% 
  mutate(word = reorder(word, n))
## # A tibble: 20 × 2
##    word      n
##    <fct> <int>
##  1 the    3329
##  2 to     2808
##  3 and    2800
##  4 of     2570
##  5 a      1594
##  6 in     1389
##  7 was    1337
##  8 her    1203
##  9 had    1187
## 10 she    1146
## 11 i      1123
## 12 it     1038
## 13 he      961
## 14 be      950
## 15 not     934
## 16 that    876
## 17 as      810
## 18 for     707
## 19 but     664
## 20 his     659

Frequency Plot

I will have to reorder the counts for creating the plot as count() only counts and doesn’t change the order of the tibble.

book %>% 
  count(word, sort = T) %>%
  head(20) %>% 
  mutate(word = reorder(word, n)) %>% 
  ggplot(aes(x = n, y = word)) +
  geom_col() +
  xlab("Count") +
  ylab("Word")

Finding Hapaxes

Hapaxes are words that occur only once in the text. Nothing complicated; I will first count the occurrences and then filter when the count is 1.

book %>% 
  count(word, sort = T) %>% 
  filter(n == 1)
## # A tibble: 2,566 × 2
##    word      n
##    <chr> <int>
##  1 10        1
##  2 11        1
##  3 12        1
##  4 13        1
##  5 14        1
##  6 17        1
##  7 1760      1
##  8 1784      1
##  9 1785      1
## 10 1787      1
## # … with 2,556 more rows

These are all numbers. What about words?

book %>% 
  count(word, sort = T) %>% 
  filter(n == 1) %>% 
  filter(!str_detect(word, "[0-9]"))
## # A tibble: 2,534 × 2
##    word             n
##    <chr>        <int>
##  1 a'n't            1
##  2 abbreviation     1
##  3 abdication       1
##  4 abide            1
##  5 abode            1
##  6 abominable       1
##  7 abominate        1
##  8 abominates       1
##  9 absenting        1
## 10 abstraction      1
## # … with 2,524 more rows

I have used regular expression here to identify all the words that didn’t have any numerals.

Distribution of Word Lengths

Some writers have a habit of writing long words. What were the longest words used by Jane and how often did she use them?

book %>% 
  mutate(length = str_length(word)) %>% 
  count(length, sort = T)
## # A tibble: 16 × 2
##    length     n
##     <int> <int>
##  1      3 19955
##  2      4 15505
##  3      2 15212
##  4      5  8419
##  5      6  6489
##  6      7  5682
##  7      8  3450
##  8      9  2907
##  9      1  2741
## 10     10  1662
## 11     11   820
## 12     12   486
## 13     13   231
## 14     14    70
## 15     15    25
## 16     16     4

Three letter words are most commonly used, followed by four letter and two letter ones. I have first calculated the length of words using mutate() and str_length().

I can also plot them.

book %>% 
  mutate(length = str_length(word)) %>% 
  count(length, sort = T) %>% 
  mutate(length = reorder(length, n)) %>% 
  ggplot(aes(x = length, y = n)) +
  geom_col() +
  xlab("Length of Word") +
  ylab("Count")

That was all! See you in next week when I try some harder text analysis tools.


P.S. I have used the words “word(s)” and “token(s)” quite liberally. They are not always the same. As token argument in unnest_token informs, there are many options besides words that can be tokens.

Posted on:
January 27, 2022
Length:
11 minute read, 2220 words
Categories:
R ML
Tags:
R ML
See Also:
Notes on Github
How fast is M1?
Next — Today I Learnt About R