Note: This project is an open source / externalization of iberobyte's past crawling strategy.
Web spider that crawls its way through a list of website sources, extracts relevant articles, and indexes the entries by execution date for long term storage and web consumption.
- Scala 2.12.10
- Mill
mill compile
mill assembly
java -cp out/spider/assembly/dest/out.jar spider.Main > out.log
Alternatively, simply execute ./run.sh
in your terminal. The output will be stored in a article_${unixtime}.json
file.
XSpider uses code generation tecniques to acelerate development. Specifically,
the script command, ./sc
, can run a delegate control to other script commands.
For example, ./sc scraper ScraperElNuevoDia
will invoke the ./scraper
command
with ScraperElNuevoDia
as argument. The output is a file generated based on your input.
In this case, for example, news files:
spider/scraper/ScraperElNuevoDia.scala
,spider/scraper/ScraperElNuevoDiaTests.scala
spider/test/resources/ElNuevoDia.html
will be created. The contents of the file will contain comments indicating where to modify the file. For example:
/**
* This file is partiallty generated. Only make modifications between
* BEGIN MANUAL SECTION and END MANUAL SECTION designators.
*
* This file is was generated by ./sc scraper script command.
*/
package scraper
import collection.JavaConverters._
import org.jsoup.nodes.Document
import org.jsoup.nodes.Element
import spider.CountryType._
import crawler.CrawlerTypes.URL
import crawler.Article
object ScraperElNuevoDia extends TScraper {
def apply(
siteURL: URL,
country: CountryType,
html: String
): Seq[Article] = {
val doc = parseHTMLDocument(html)
/** BEGIN MANUAL SECTION */
/** END MANUAL SECTION */
}
}
Developer can quickly build new scrapers by employing this framework.
mill spider.test
MIT