Skip to content
This repository has been archived by the owner on Mar 3, 2020. It is now read-only.

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
ficolo authored Aug 23, 2017
1 parent 8e7494b commit 58c52aa
Showing 1 changed file with 2 additions and 0 deletions.
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,3 +68,5 @@ Then go to http://localhost:9200/nutch/_search?pretty=true&q=*:*&size=100

The documents body have several fields coming from the data crawled by Nutch, such as the plain text content (content), the crawling time stamp (tstamp), the source url (id) and the page title (title) among others. Inside the 'bioschemas' field you will find a JSON String containing the JSON representation of the microdata extraction result. This result is a JSON object, each field have the name of one item type coming from the extracted microdata, in this example we have "BreadCrumbList" and "Event". In those fields you will find JSON arrays with the JSON Object representation of the collected items.

## Apache Nutch documentation
Please find more information about how to use Apache Nutch in order to crawl websites [here](https://wiki.apache.org/nutch/NutchTutorial)

0 comments on commit 58c52aa

Please sign in to comment.