-
Notifications
You must be signed in to change notification settings - Fork 79
Calculate word frequency
Philipp Zumstein edited this page Sep 23, 2016
·
2 revisions
It is possible to calculate the word frequencies of a hocr-file with just some standard command line programs.
To extract simply the text of a hocr file one can use a sed
command to delete all tags which are around the actual text:
sed 's/<[^>]*>/ /g' sample.hocr
(Please see also https://github.com/UB-Mannheim/ocr-fileformat for alternatives of this step.)
Calculate the word frequencies with an awk
program as described in the GNU awk's User Guide, section 14.3.5
# wordfreq.awk --- print list of word frequencies
{
$0 = tolower($0) # remove case distinctions
# remove punctuation
gsub(/[^[:alnum:]_[:blank:]]/, "", $0)
for (i = 1; i <= NF; i++)
freq[$i]++
}
END {
for (word in freq)
printf "%s\t%d\n", word, freq[word]
}
Sort with sort -k 2nr
and output the top 10 words with head
.
After saving the awk
-program to a file wordfreq.awk
one can call this altogether with
sed 's/<[^>]*>/ /g' sample.hocr | awk -f wordfreq.awk | sort -k 2nr | head
The output will then look like this example
the 24
to 21
she 20
it 18
and 15
of 15
a 13
was 12
her 10
down 9