From 58c52aa6bb91e41fd6da3e37ec11d91740f834e7 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Federico=20L=C3=B3pez=20G=C3=B3mez?= Date: Wed, 23 Aug 2017 02:16:54 -0500 Subject: [PATCH] Update README.md --- README.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/README.md b/README.md index 94623f8..906f55a 100644 --- a/README.md +++ b/README.md @@ -68,3 +68,5 @@ Then go to http://localhost:9200/nutch/_search?pretty=true&q=*:*&size=100 The documents body have several fields coming from the data crawled by Nutch, such as the plain text content (content), the crawling time stamp (tstamp), the source url (id) and the page title (title) among others. Inside the 'bioschemas' field you will find a JSON String containing the JSON representation of the microdata extraction result. This result is a JSON object, each field have the name of one item type coming from the extracted microdata, in this example we have "BreadCrumbList" and "Event". In those fields you will find JSON arrays with the JSON Object representation of the collected items. +## Apache Nutch documentation +Please find more information about how to use Apache Nutch in order to crawl websites [here](https://wiki.apache.org/nutch/NutchTutorial)