BUILDING VOCABULARY
Processed 1754541204 tokens.
Counted 5329509 unique words.
Truncating vocabulary at min count 5.
Using vocabulary of size 1539115.

Build the Arabic Corpus

Dowload Resources

The arabic corpus {1.9B word} consists of the following resources:

ShamelaLibrary348.7z link {1.15B}
UN arabic corpus mirror1 mirror2 {0.37B}
AraCorpus.tar.gz link {0.14B}
Arabic Wikipedia Latest Articles Dump link {0.11B}
Tashkeela-arabic-diacritized-text-utf8-0.3.zip link {0.07B}
Arabic Tweets link {0.03B}
watan-2004.7z link {0.01B}

More resources are listed by Ayman Eddakrouri

Parse and Process

After downloading the resources from the above links, run the make_corpus.sh to automate the extraction, preprocessing, formatting and finally generating a single-line file will the full arabic corpus. Some the the used commands are discussed in commands.

Due to file sizes limits in github, no files are added due to huge file sizes.

Download Pre-built Arabic Corpus

A zipped tar may be downloaded from archive.org. I welcome mirroring this file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Build the Arabic Corpus

Dowload Resources

Parse and Process

Download Pre-built Arabic Corpus

Files

README.md

Latest commit

History

README.md

File metadata and controls

Build the Arabic Corpus

Dowload Resources

Parse and Process

Download Pre-built Arabic Corpus