Skip to content

Latest commit

 

History

History
27 lines (24 loc) · 1.79 KB

README.md

File metadata and controls

27 lines (24 loc) · 1.79 KB
BUILDING VOCABULARY
Processed 1754541204 tokens.
Counted 5329509 unique words.
Truncating vocabulary at min count 5.
Using vocabulary of size 1539115.

Build the Arabic Corpus

Dowload Resources

The arabic corpus {1.9B word} consists of the following resources:

  • ShamelaLibrary348.7z link {1.15B}
  • UN arabic corpus mirror1 mirror2 {0.37B}
  • AraCorpus.tar.gz link {0.14B}
  • Arabic Wikipedia Latest Articles Dump link {0.11B}
  • Tashkeela-arabic-diacritized-text-utf8-0.3.zip link {0.07B}
  • Arabic Tweets link {0.03B}
  • watan-2004.7z link {0.01B}

More resources are listed by Ayman Eddakrouri

Parse and Process

After downloading the resources from the above links, run the make_corpus.sh to automate the extraction, preprocessing, formatting and finally generating a single-line file will the full arabic corpus. Some the the used commands are discussed in commands.

Due to file sizes limits in github, no files are added due to huge file sizes.

Download Pre-built Arabic Corpus

A zipped tar may be downloaded from archive.org. I welcome mirroring this file.