September 24, 2016
I love reading and I’ve often thought of working my way through reading every book on a “top books” list. But which list to choose? What books show up consistently? What authors are always represented but with different works? How representative are book lists with regards to historical works?
I was thinking about these questions at the same time as I was interested in learning more about bash programming and AWK. This project is the result.
I decided to stick to top 100 lists and to avoid specific lists like “Top 100 scifi novels.” Some of the following are chosen by voting readers and some by literary experts.
I downloaded the raw HTML for each page and wrote AWK scripts to pull out the title and author.
An example of the HTML, in this case from GoodReads.
<tr itemscope itemtype="http://schema.org/Book">
<td valign="top" class="number">1</td>
<td width="5%" valign="top">
<a name="1885"></a>
<a href="/book/show/1885.Pride_and_Prejudice" title="Pride and Prejudice">
<img alt="Pride and Prejudice" class="bookSmallImg" src="http://d202m5krfqbpi5.cloudfront.net/books/1320399351s/1885.jpg" />
</a> </td>
<td width="100%" valign="top">
<a href="/book/show/1885.Pride_and_Prejudice" class="bookTitle" itemprop="url">
<span itemprop='name'>Pride and Prejudice</span>
</a> <br/>
<span class='by smallText'>by</span>
<span itemprop='author' itemscope='' itemtype='http://schema.org/Person'>
<a href="http://www.goodreads.com/author/show/1265.Jane_Austen" class="authorName" itemprop="url"><span itemprop="name">Jane Austen</span></a>
</span>
The following AWK script was used to pull out the title and author and write that information to a tab delimited file.
#!/usr/bin/awk
#awk -f extract<name>.awk -f functionLibrary.awk <name>.html
#Source: http://www.goodreads.com/list/show/13086.Goodreads_Top_100_Literary_Novels
BEGIN{
FS="<[^>]+>";
OFS="\t";
lineCount=100;
rankIterator=0;
}
FNR==1{
split(FILENAME,fileNameArray,".");
outputFile = fileNameArray[1]".table";
}
/itemprop='name'/{
title=$2
#print FNR,title
}
/class="authorName"/{
author=$3
#print FNR,author
currentLine=FNR
}
FNR==currentLine{
numNames=split(author,authorNameArray," ");
firstNames=join(authorNameArray,1,numNames-1);
#rank=weightBook(lineCount,rankIterator++);
print title,authorNameArray[numNames],firstNames > outputFile;
}
The result was a file for each list with the book title and the authors name on each line (family name followed by given name).
Pride and Prejudice Austen Jane
1984 Orwell George
The Great Gatsby Fitzgerald F. Scott
Jane Eyre Brontë Charlotte
Crime and Punishment Dostoyevsky Fyodor
Lolita Nabokov Vladimir
The Adventures of Huckleberry Finn Twain Mark
Of Mice and Men Steinbeck John
Wuthering Heights Brontë Emily
Brave New World Huxley Aldous
To merge the lists I used another AWK script, this time called from a bash script. This script also counted up the number of times the book shows up in a list and which lists it showed up in.
#!/bin/bash
#usage: ./tabulateBooks.sh
awk -f tabulateBooks.awk \
../rawData/bookman_librarians.table \
../rawData/bbc_the_big_read.table \
../rawData/modern_library_readers.table \
../rawData/modern_library_board.table \
../rawData/npr_beach_books.table \
../rawData/good_reads.table \
../rawData/harvard_bookstore.table \
| sort -t"," -k1.2,1gr -k2.2,2gr > tabulatedBooks.csv
BEGIN{
FS="\t"; #Input delimiter
OFS=","; #Output delimiter
lineCount=100 #Number of items in the longest list
}
{ #for every line in each file
title=toupper($1)
authorLast=$2
authorFirst=$3
rank=lineCount-(FNR-1)
#bookArray[title]+=rank
NumOfListsArray[title]+=1
FirstNameArray[title]=authorFirst
LastNameArray[title]=authorLast
#Pull out filename sans extension
len = split(FILENAME,N1,"/")
split(N1[len],N2,".")
listName = N2[1]
#Append which lists the book appears in and the rank it has in that list
FileNameArray[title]=(FileNameArray[title] "\",\"" listName "\",\"" rank)
totalRankArray[title]=(totalRankArray[title] + rank)
}
END{
#print "-------------------"
for(title in FirstNameArray) {
gsub(/^\",\"/,"",FileNameArray[title]) #Remove extra delimter
# Number of lists book is in,total rank, book title, author first name, last name, lists in which book appears and rank in that list
print "\""NumOfListsArray[title]"\"","\""totalRankArray[title]"\"","\""title"\"","\""FirstNameArray[title]"\"","\""LastNameArray[title]"\"","\""FileNameArray[title]"\""
}
}
Of course some duplicates slipped through since some lists had typos, title variations, etc… Here I gave in and used python to search through the merged list for likely duplicates by calculating the Jaccard Index for every combination of books in the list. Once I found the duplicates I corrected the errant names in the relevant source files and recreated the merged list.
#!/usr/bin/python
import sys
#####################################
def jaccard_index(set_1,set_2):
n = len(set_1.intersection(set_2))
return n / float(len(set_1) + len(set_2) - n)
#####################################
fileName=str(sys.argv[1])
shingleLen=4
with open(fileName,'r') as f:
data = f.readlines()
lineNumA=0
for line in data:
line=line.rstrip('\n')
primaryShingle=[line[i:i + shingleLen] for i in range(len(line) - shingleLen + 1)]
primarySet=set(primaryShingle)
lineNumB=0
for ll in data:
if lineNumA < lineNumB:
ll=ll.rstrip('\n')
secondaryShingle=[ll[k:k + shingleLen] for k in range(len(ll) - shingleLen + 1)]
secondarySet=set(secondaryShingle)
jac = jaccard_index(primarySet,secondarySet)
if jac > .35:
print "%.3f\n%s\n%s"%(jac,line,ll)
print "-----------"
#print lineNumB
lineNumB += 1
#print lineNumA
lineNumA += 1
This left me with a list of 443 unique books. The next step was to determine the year each book was published. Using curl I ran the title and author as search terms through Google and WolframAlpha and was able to pull out the publishing date for nearly every book. The last few I filled in by hand.
This script looks to see if the date is already known (since I ran this quite a few times) and if it isn’t, tries to determine the date using Google and then WolframAlpha.
#!/bin/bash
#Usage: ./tabulateBooks.sh tabulatedBooks.csv
: > tempdatesfile #Create empty file
cat $1 |
while read line; do #for each line in the file do the following
#The FPAT variable keeps commas inside quotes from being field seperators
TA=$(echo "$line" | awk -vFPAT='([^,]*)|("[^"]+")' -vOFS=+ \
'{print $3,$5}') #Title and author's last name
Stitle=$(echo "$TA" | cut -d"+" -f1) #cut out just the title
#Now remake $title and $TA with the '+' delimiter replacing spaces
TA=$(echo $TA | awk -F'[ ]' 'BEGIN{OFS="+";} {$1=$1;print;}')
title=$(echo $Stitle | awk -F'[ ]' 'BEGIN{OFS="+";} {$1=$1;print;}')
sanitizedTitle=$( echo ${Stitle//\(/\\\(} ) #Replaces ( with \(
sanitizedTitle=$( echo ${sanitizedTitle//\)/\\)} ) #Replaces ) with \)
#Look for known dates in file (use perl regex because it works)
year=$(cat tabulatedBooks_withDates.csv | grep -iP \
"$sanitizedTitle\,\"([0-9]|\-[0-9])" \
| awk -vFPAT='([^,]*)|("[^"]+")' '{gsub("\"","",$4); print $4}' )
#Check if year is in the range 1000 to 2999 CE
if (echo $year | grep -Eq '^[1-2][0-9][0-9][0-9]$')
then
printf 'Known year:%s\n' "$year"
printf '\"%s\",%s\n' "$year" "$line" >> tempdatesfile
continue
fi
#Check if year is in the range 0 to 999 CE
if (echo $year | grep -Eq '^[0-9][0-9][0-9]$')
then
printf 'Known year:%s\n' "$year"
printf '\"%s\",%s\n' "$year" "$line" >> tempdatesfile
continue
fi
#Check if year is between 0 and 999 BCE
if (echo $year | grep -Eq '^\-[0-9][0-9][0-9]$')
then
printf 'Known year:%s\n' "$year"
printf '\"%s\",%s\n' "$year" "$line" >> tempdatesfile
continue
fi
if !(echo $year | grep -Eq '^[0-9][0-9][0-9][0-9]$')
then
echo "Searching for year"
searchString=$(printf '%s%s' "$TA" "+published")
year=$(curl --silent --user-agent "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.93 Safari/537.36" \
https://www.google.com/search?q=$searchString | \
awk -f scrapeGoogle.awk)
fi
if !(echo $year | grep -Eq '^[0-9][0-9][0-9][0-9]$')
then
echo "Scraping google failed, trying wolframalpha..."
searchString=$(printf '%s%s' "\"$title\"" "+first+publication+date")
year=$(curl --silent --user-agent "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.93 Safari/537.36" \
http://www.wolframalpha.com/input/?i=$searchString | \
awk -f scrapeWolfram.awk)
fi
if !(echo $year | grep -Eq '^[0-9][0-9][0-9][0-9]$')
then
#make sure $year is empty if it isn't a 4 digit number
year=""
echo "FAILED: $Stitle"
fi
printf '\"%s\",%s\n' "$year" "$line" >> tempdatesfile
done
: > tempdatesfile2
awk -vOFS="," -vFPAT='([^,]*)|("[^"]+")' \
'{t=$1; $1=$2; $2=$3; $3=$4; $4=t; gsub(",,",","); \
print >> "tempdatesfile2"}' tempdatesfile
rm tempdatesfile
mv tabulatedBooks_withDates.csv tabulatedBooks_withDates.csv.backup
mv tempdatesfile2 tabulatedBooks_withDates.csv
These are the helper AWK scripts to interpret the downloaded HTML.
#For google results
#!/usr/bin/awk
BEGIN{
RS="<[^>]+>"; #Make each HTML tag a row divider
OFS="+";
ORS="";
}
/Originally published/{
if ($0=="Originally published") #Did it work?
pubNR=NR #Remember row number
else
next #No luck, proceed to next book
}
NR==pubNR+3{ #Date will be 3 rows after the 'originally published' text
dateArrayLength=split($0,dateArray," ")
year=dateArray[dateArrayLength] #Pull out just the year
print year
}
#For wolframalpha results
#!/usr/bin/awk
BEGIN{
FS="\"";
OFS="";
}
/stringified/{
c++;
if(c==2){
dateArrayLength=split($4,date," ")
year=date[dateArrayLength]
print year
}
}
At this point I have a delimited file containing the book title, author name, publishing date, number of lists the book appears in, and which lists those are and the rank in each list..