Skip to content

Calculate word frequency

Philipp Zumstein edited this page Sep 23, 2016 · 2 revisions

It is possible to calculate the word frequencies of a hocr-file with just some standard command line programs.

1. Text extraction

To extract simply the text of a hocr file one can use a sed command to delete all tags which are around the actual text:

sed 's/<[^>]*>/ /g' sample.hocr

(Please see also https://github.com/UB-Mannheim/ocr-fileformat for alternatives of this step.)

2. Calculate word frequencies

Calculate the word frequencies with an awk program as described in the GNU awk's User Guide, section 14.3.5

# wordfreq.awk --- print list of word frequencies

{
    $0 = tolower($0)    # remove case distinctions
    # remove punctuation
    gsub(/[^[:alnum:]_[:blank:]]/, "", $0)
    for (i = 1; i <= NF; i++)
        freq[$i]++
}

END {
    for (word in freq)
        printf "%s\t%d\n", word, freq[word]
}

3. Sort and output

Sort with sort -k 2nr and output the top 10 words with head.

Putting it together

After saving the awk-program to a file wordfreq.awk one can call this altogether with

 sed 's/<[^>]*>/ /g' sample.hocr | awk -f wordfreq.awk | sort -k 2nr | head

The output will then look like this example

the     24
to      21
she     20
it      18
and     15
of      15
a       13
was     12
her     10
down    9
Clone this wiki locally