The Vietnamese Corpus Project aims to provide a well-organized collection of Vietnamese text resources covering multiple subject areas. The corpus can be used for natural language processing (NLP), machine translation, text analysis, and other research and applications involving Vietnamese. The documents in the corpus are categorized by subject so that users can easily access and utilize these resources.
This project also integrates the Vietnamese Wikipedia dictionary resource, allowing users to easily find and use the definitions and background information of Vietnamese vocabulary.
The text documents in the corpus are categorized according to the content theme, and the details of each category are as follows:
-
Chính trị Xã hội (Politics and Society) - Contains 6567 documents covering Vietnamese politics, social phenomena and related issues.
-
Đời sống (Life) - Contains 4195 documents covering content related to daily life, such as family, education, culture, etc.
-
Kinh doanh (Business) - Contains 4276 files, focusing on topics such as business, economy, and finance.
-
Pháp luật (Law) - Contains 6656 files, covering laws, regulations, judicial cases, etc.
-
Sức khỏe (Health) - Contains 4417 files, covering topics such as medical health and public health.
-
Thế giới (World) - Contains 5716 files, discussing international news, global issues, diplomatic affairs, etc.
-
Thể thao (Sports) - Contains 5667 files, covering sports news, event reports, athlete information, etc.
-
Văn hóa (Culture) - Contains 5250 files, covering art, literature, traditional culture, etc.
This project integrates the Vietnamese dictionary from Wikipedia.