Skip to content

Parsing gmane with factor part 2

Björn Lindqvist edited this page Aug 19, 2013 · 81 revisions

Part 2 of the Parsing Gmane with Factor tutorial.

The previous part left us with a mail database model and some simple functionality to generate data. That was enough to demonstrate how the database should work, but to make it more interesting we need to get some real data.

The Real Data

The real data is archived mails like this one. The task is to put it into the database, a task that is commonly known as scraping. Note how different parts of that page corresponds to different columns in our mail table:

3120                      -> mid
comp.lang.factor.general  -> group
2009-04-28                -> date
Caesar Hu <hupeishun@...> -> sender
XIM patch                 -> subject
Hi, all I can't input...  -> body

The left side is the values to extract and the right the names of the columns to put them in. Scraping is all about taking the source page and emitting these neat little chunks of data for further processing.

Extracting the mid and group is trivial. Just look at the url of the page. date, sender and subject can be extracted using regular expressions, but getting the body of the message from this page is more troublesome because it is html formatted. It doesn't look good when read, unless it is rendered:

<div><p>Hi, all<br><br>I can't input Chinese char in factor
listener,&nbsp; so I do some patch in xim.factor to allow me
input chinese,<br>the patc....

Usually, when talking about rendering the output is a 2d image. But it is equally valid to talk about rendering to a character array. Like lynx and other text browsers do. Then the above html could be rendered as:

Hi, all

I can't input Chinese char in factor listener, ; so I do
some patch in xim.factor to allow me input chinese,
the patch include :
1. Input char from xim server corrently....

So what we will implement is a renderer/converter that will take html mail as input and output a string which is a text version of that mail.

Since that is a pretty difficult problem we will divide it into smaller subtasks and create one vocabulary for each one. See the wiki pages below for how those vocabularies should be created.