Skip to Content

Scrape the kindle notes export file in R

The objective

When I read books on the Kindle application for Ipad I underline the sentences that I think are worth to remember and then I like to save them in a file together with the notes from other books.

The Kindle application has a nice feature to export all those underlined notes and to send them via e-mail as an html file.

However, I want to save the notes as a plain text without any formatting and with the reference to the page and to the chapter title.

An example of the html file to scrape

This is how the html exported looks from my Kindle application on Ipad:

The code

library(readr)
library(tidyverse)

# This takes any html file on my desktop, so I assume that on my desktop I only have the file that I want to scrape

file_html <- Sys.glob("~/Desktop/*.html") 

# you can replace ~/Desktop/ with the path to your own desktop

file.copy(file_html, "~/Desktop/original_copy.html")
file.rename(file_html, "~/Desktop/text.txt")

text <- read_delim("~/Desktop/text.txt", ";", escape_double = FALSE, trim_ws = TRUE)

names(text) <- "column1"
to_delete <- which(text$column1 == "<div class=\"sectionHeading\">")
text <- text[to_delete:nrow(text),]

library(stringr)
text$page <- ""
text <- dplyr::filter(text, !grepl("</div>",text$column1))

The default text generated by my Kindle is in French, so you should decline the French words (Surlignement, Page and Emplacement) in the code below according to the language of your Kindle.

i <- grepl("Surlignement", text$column1, row.names(text)) 
rows_with_Surlignement <- which(i)

for (i in rows_with_Surlignement){
page_num <- strsplit(text$column1[i], "- Page")
page_num <- page_num[[1]][2]
page_num <- strsplit(page_num, "· Emplacement")
page_num <- page_num[[1]][1]
page_num <- trimws(page_num)
text$page[i] <- page_num
}

text <- dplyr::filter(text, !grepl("<div class=\"noteText\">",text$column1))
text <- text %>% mutate(page = lag(page))
text <- dplyr::filter(text, !grepl("<",text$column1))

# optionally I could remove chapter titles, for the moment I like to have them

for (i in 1:nrow(text)){
text$column1[i] <- ifelse(nchar(text$page[i])!=0,paste0(text$column1[i]," (page ",text$page[i],")"), paste0(text$column1[i]," (chapter title)")  )
}

text <- text %>% select(column1)

names(text) <- ""
write.table(text,"~/Desktop/text.txt",sep="\t",row.names=FALSE, quote=FALSE)

The final output

Disclaimer

I used as an example the book “The Power of Now: A Guide to Spiritual Enlightenment” by Eckhart Tolle. It is a book that I enjoyed and I recommend to read it. However, I do not have any connections with the author nor I get any benefits for recommending it.