Creating a Table of Word Frequencies in R

Function word_freq_table() takes a text and returns a table with the words within the text and the number of occurrences of that word. A list of optional attributes are also available to cleanse the resulting list of words:

  • to_upper: a logical parameter to inform if the text should be converted to upper case.
  • remove_punct: informs the punctuation/special characters to be removed.
  • remove_numbers: a logical parameter to inform if the numbers/letters 0 through 9 must be removed.
  • replace_CR: informs if a “carriage return” character must be replaced with something else.
  • replace_LF: informs if a “line feed” characters must be replaced with something else.
  • remove_repetitive_space: a logical parameter to inform if repetitive blank spaces must be replaced with a single blank space.
  • wordlength_atleast: informs the minimum length of words to be listed in the resulting table.

Here is the code block to call the function.

> word_freq_table(txt, replace_LF = " ", wordlength_atleast = 5)   %>%
+   head()
# Words Total: 228, Unique: 202
# String Length Minimum: 5, Maximum: 14
          Word Freq
1 ACCELERATION    1
2    ACCORDING    1
3        AFTER    2
4      AGAINST    1
5    AGREEMENT    1
6    APPOINTED    2

The function is listed below.

word_freq_table <-
  function(txt,
           to_upper = TRUE,
           remove_punct = "[[:punct:]]",
           remove_numbers = TRUE,
           replace_CR = "\r",
           replace_LF = "\n",
           remove_repetitive_space = TRUE,
           wordlength_atleast = 1) {
    if (to_upper)
      txt <- stringr::str_to_upper(txt)
    if (remove_punct != "")
      txt <- stringr::str_replace_all(txt, remove_punct, "")
    if (remove_numbers)
      txt <- stringr::str_replace_all(txt, "[0-9]", "")
    txt <- stringr::str_replace_all(txt, "[\r]", replace_CR)
    txt <- stringr::str_replace_all(txt, "[\n]", replace_LF)
    if (remove_repetitive_space)
      txt <- stringr::str_replace_all(txt, "[ ]{2,}", " ")
    
    Word <- stringr::str_split(txt, " ")
    freq <- table(Word)
    freq <- as.data.frame(freq)
    
    freq <- dplyr::filter(freq, stringr::str_length(Word) >= wordlength_atleast)
    cat(sprintf("# Words Total: %s, Unique: %s\r\n", sum(freq$Freq), nrow(freq)))
    cat(sprintf(
      "# String Length Minimum: %s, Maximum: %s\r\n",
      min(stringr::str_length(freq$Word)),
      max(stringr::str_length(freq$Word))
    ))
    freq
  }