DEXTER is a research project designed to discover and extraction product specifications on the Web.
This repository provides information to access the DEXTER Dataset described in VLDB2015 Research paper:
DEXTER: Large-Scale Discovery and Extraction of Product Specifications on the Web link
In this repository you can find the output dataset generated by DEXTER, the dataset if organized as follows:
- Output XML: dump with all the attribute/value pairs collected
- URLs of product pages: list of product URLs build by DEXTER
- Pages dump: a dump of the processed pages
We provide under output-xml a dump of the specifications of the discovered products. Each file is a compressed (.7z) file that contains an XML dump with all the discovered products for a specific category.
The XML dump follows this structure:
<products>
<product>
<site>www.amazon.com</site>
<category>camera</category>
<url>http://www.amazon.com/...</url>
<attribute_1>value_attribute_1</attribute_1>
...
</product>
<product>
...
</product>
...
</products>
To each product we have added three additional attributes: URL from which the specification has been extracted, the category associated to the page and the website.
The dataset presents HTML pages collected by our focused crawler. The dataset is organised under the bucket dexter-pages in the following folders:
- data
- dexter_sources
- dataset_local_categories.json
Under /data/*
The folder is organised in subfolder, a subfolder for each crawled website. Pages of a given website are stored as .gz files. Pages are stored with an incremental file number <i>.txt.gz and the mapping between dumped file and original url is under an index.txt file.
The index.txt file stores in each line a tab separated pair. Pairs are organised in <i>.txt and <file_url>.
An example is:
1.txt http://www.sample_website.com/productAAAA 2.txt http://www.sample_website.com/productBBBB 3.txt http://www.sample_website.com/productCCCC
Example of index file Link
Under /dexter_sources/*
We provide also the output of the Dexter classification. Page urls are grouped in sources (pair <category,website>), the folder contains a single json file for each DEXTER classified source.
Files are named with the following pattern: <category>_<site>.json
File contains for each website a map with the following information:
- "<website_name>": list of pages urls
- "entry_page": list of category entry page
- "pages_number": number of pages
Example of Dexter category file Link
Dataset Local Categories Link
In dataset_local_categories.json
We present the locale categories crawled directly from the discovered websites. The file is a nested json that is organised as follows:
{ "site1": { "category_1": [ url1, url2, ... ] ... }, "site2": { ... }