XSpider

Note: This project is an open source / externalization of iberobyte's past crawling strategy.

Web spider that crawls its way through a list of website sources, extracts relevant articles, and indexes the entries by execution date for long term storage and web consumption.

Requirements

Scala 2.12.10
Mill

Usage

mill compile
mill assembly
java -cp out/spider/assembly/dest/out.jar spider.Main > out.log

Alternatively, simply execute ./run.sh in your terminal. The output will be stored in a article_${unixtime}.json file.

Code Generation

XSpider uses code generation tecniques to acelerate development. Specifically, the script command, ./sc, can run a delegate control to other script commands.
For example, ./sc scraper ScraperElNuevoDia will invoke the ./scraper command with ScraperElNuevoDia as argument. The output is a file generated based on your input. In this case, for example, news files:

spider/scraper/ScraperElNuevoDia.scala,
spider/scraper/ScraperElNuevoDiaTests.scala
spider/test/resources/ElNuevoDia.html will be created. The contents of the file will contain comments indicating where to modify the file. For example:

/**
 * This file is partiallty generated. Only make modifications between
 * BEGIN MANUAL SECTION and END MANUAL SECTION designators.
 * 
 * This file is was generated by ./sc scraper script command.
 */
package scraper

import collection.JavaConverters._
import org.jsoup.nodes.Document
import org.jsoup.nodes.Element

import spider.CountryType._
import crawler.CrawlerTypes.URL
import crawler.Article

object ScraperElNuevoDia extends TScraper {

  def apply(
    siteURL: URL,
    country: CountryType,
    html: String
  ): Seq[Article] = {
    val doc = parseHTMLDocument(html)
    /** BEGIN MANUAL SECTION */
    /** END MANUAL SECTION */
  }
}

Developer can quickly build new scrapers by employing this framework.

Testing

mill spider.test

LICENSE

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
sample-output		sample-output
scripts		scripts
spider		spider
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build.sc		build.sc
run.sh		run.sh
sc		sc
web_scraper		web_scraper

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

XSpider

Requirements

Usage

Code Generation

Testing

LICENSE

About

Releases

Packages

Languages

License

jdiaz/xspider

Folders and files

Latest commit

History

Repository files navigation

XSpider

Requirements

Usage

Code Generation

Testing

LICENSE

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages