Parsing Gmane with Factor Part 5

Part 5 of the Parsing Gmane with Factor tutorial.

When coding the other vocabularies, the goal has mostly been to write robust, succinct and easily reusable code. But now it is time to write the words that will become the user interface, so the goal changes and we will prioritize user friendliness instead. The focus will be on providing feedback in the form of useful error messages when stuff goes wrong and intuitive to use words that will act as commands.

This part of the application will likely not be very reusable in other projects due to these different constraints. That is ok, because an application that is usable is much better than just having a bunch of reusable code.

Begin with creating a skeleton vocabulary gmane.console on the path gmane/console/console.factor:

USING:
    fill in here ;
IN: gmane.console

: init ( -- )
    [ mail recreate-table ] with-mydb
    "Database created." print ;

The init word just sets up the SQLite database. This vocabulary will tie together all other vocabularies we've created into a neat interface.

User Friendly Scraping Interface

Let's start on the listener and import some mails from the Factor mailing list into our database:

IN: scratchpad 10 [1,b] [ "comp.lang.factor.general" parse-mail insert-mail ] map

--- Data stack:
{ 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 }

This approach works, but is flawed in a few ways. First, it is too much to type to be user friendly. Second, it does not provide any feedback -- the cursor just sits there for a few seconds until the code completes. Third, it does not handle duplicates so you might get several copies of the same email inserted in the database. And fourth, there is no error handling. We will fix each of these problems one at a time.

Importing Without Duplicates

Let's see what we can do about the duplicates. Note that the example code above always generates the same sequence of mids from 1 to 10 which leads to the same mails being scraped. Instead we want to express "import the next N mails not yet imported in this newsgroup" using the mail-exists? predicate introduced in part 2.

:: next-mids ( n group -- seq )
    n 1 lfrom [ group mail-exists? not ] lfilter ltake list>array ;

The expression 1 lfrom creates an infinite list of natural numbers. The list is then filtered so that only those numbers representing mids not found in the specified group are kept. Finally ltake takes the n first numbers from the list, passes it to list>array that materializes the list and returns it.

Using next-mids we can rewrite our initial mail import expression as:

IN: scratchpad 10 "comp.lang.factor.general" [ next-mids ] keep '[ _ parse-mail insert-mail ] map

--- Data stack:
{ 2072 2073 2074 2075 2076 2077 2078 2079 2080 2081 }

Repeated invocations will import the next ten mails in comp.lang.factor.general avoiding already imported ones.

Lazy Lists

There are a few peculiarities in this code worthy of mention. First is the infinite list of natural numbers. Obviously an infinitely large list is to large to be kept in computer memory. So how does Factor do it? It fakes it using the lazy.lists vocabulary which provides lists that aren't lists at all but tuples that describe how successive elements are constructed. These lazy lists can often be used when, in an imperative language, a while-loop would have been used. Here is how you could have written next-mids in Python:

def next_mids(n, group):
    r = []
    cnt = 1
    while len(r) < n:
        if not mail_exists(cnt, group):
            r.append(cnt)
        cnt += 1
    return r

I know you can solve the problem in a functional approach in Python too, the above example is just for illustration. :)

Local Variables

The other oddness is the double-colon declaration ::. It is actually a parsing word in the locals vocabulary that modifies how word definitions are compiled enabling you to use local variables. What's up with that? Shouldn't Factor's concatenative nature obviate the need for named variables?

Well, yes, and no. You can write Factor code without ever having to write local variables, but the problem is the closures.

As shown earlier, binding values "from the outside" into quotations is tricky. Fry expressions help you "put holes" in the quotations using _ and then filling those holes with values from the stack. But sometimes that leads to too much stack shuffling work so local variables becomes a better fit. next-mids implemented using a fry would be:

: next-mids ( n group -- seq )
    '[ _ mail-exists? not ] 1 lfrom swap lfilter ltake list>array ;

Calling swap is needed to put the arguments to lfilter in the right order on the stack. Admittedly, the difference between the two approaches are small and it's (to me) a tossup which version is the most readable. Generally the more closures you have, and the more values you are closing on, the more sense it makes to use local variables.

Alternative Implementations

next-mids is not efficient when the number of mails in a newsgroup is large. Say we already have imported 100k mails (mid 1 - 100000) in the newsgroup "foo.bar.boo," then next-mids needs to perform 100k + n expensive calls to mail-exists? to get the next n mids to import. The algorithm performs according to O(n) on the number of existing mails.

There are many possible ways to solve this problem. One is to check the largest mid in a newsgroup and then take the n following to import. But that approach will fail if there is any gaps in range like if all mails in the range 1-1000 was imported but mails 20-40 was missing for some reason. A second idea is to write an advanced SQL expression to retrieve the next mids. That approach's problem is that SQLite doesn't support it. A third one is to use one select query to fetch all mids in a newsgroup in memory and figure out the next n ones from that. That approach would still be O(n), but checking if something is present in a memory structure is orders of magnitude than checking it in a database.

I'll leave implementing a more efficient version of next-mids up as an exercise for the reader. :) But keep in mind that, even if the huge number of database queries is a drag on performance, most of the wall-clock execution time will be wasted sending and waiting on http requests sent to Gmane due to network lag.

Progress Updates and Error Reporting

With the mail import working satisfactory our next task is to polish it for the user. Here is how:

: print-row ( seq -- )
    1array simple-table. flush ;

: import-one ( mid group -- )
    2dup parse-mail
    [ dup insert-mail >>id mail-format swap table-row print-row 2drop ]
    [ "Mail %d:%s does not exist!\n" printf flush ]
    if* ;

: import-many ( n group -- )
    mail-format table-header print-row
    [ next-mids reverse ] keep '[ _ import-one ] each ;

Here we are making good use of the table printing functions that was introduced in part 3. Running it produces output like this:

IN: scratchpad 4 "comp.lang.factor.general" import-many 
id     mid   group           date       sender          subject                                           
  2100  1426 comp.lang.facto 2008-05-06 Phil Dawes <phi Re: Bug in CSV parser                             
  2101  1425 comp.lang.facto 2008-05-06 Stefan Scholl < Re: Bug in CSV parser                             
  2102  1424 comp.lang.facto 2008-05-06 Phil Dawes <phi Re: Bug in CSV parser                             
  2103  1423 comp.lang.facto 2008-05-05 Slava Pestov <s Bug in CSV parser

Each time a mail is imported a new table row is written so that the user can see that "something happens" and not get the impression that the program has hung. You may also want to test using a non-existent newsgroup name as argument. The word should display errors to let the user know the newsgroup can't be parsed.

Improving User Friendliness

The last thing we will do before considering the import machinery done is to add two state variables to our program. One which represents the newsgroup the user has selected and the other the number of mails to import per batch. That way the user does not have to specify the group name and message count each time the import word is called.

Factor has a vocabulary called namespaces which is exactly the right tool for the job. It allows you to maintain a hash of mutable variables, one per vocabulary, and can be manipulated using the get and set words.

Below is the import word that wraps import-many and uses the variables pagesize and group:

SYMBOL: pagesize
SYMBOL: group

: get-pagesize ( -- n )
    pagesize get [ 25 pagesize set get-pagesize ] unless* ;

: set-pagesize ( n -- )
    dup 1 100 between?
    [ dup pagesize set "Page size set to %d." sprintf ]
    [ "%d not in range 1-100." sprintf ]
    if print ;

: get-group ( -- group )
    group get [ "comp.lang.factor.general" group set get-group ] unless* ;

: set-group ( group -- )
    dup group set "Group set to %s.\n" printf ;

: import ( -- )
    get-pagesize get-group import-many ;

The accessor functions get-pagsize and set-pagesize wraps the pagesize variable so that if it has not been set by the user, the default 25 will be returned. get-group uses the default "comp.lang.factor.general" in exactly the same way. set-pagesize also limits the allowed range of values to 1-100. It's an entirely arbitrary upper bound and its only purpose is to limit how fast a user can scrape gmane. Always have a way to limit yourself when scraping -- you don't want to get ip-banned by sending millions of requests to a single site.

Now you can write:

IN: scratchpad 23 set-pagesize
Page size set to 23.
IN: scratchpad "comp.lang.factor.general" set-group
Group set to comp.lang.factor.general.
IN: scratchpad import
...

And it will be equivalent to:

IN: scratchpad 23 "comp.lang.factor.general" import-many 
...

Words For Reading The Mails

Here is the definition for recent:

: recent ( -- )
    [
        <query> T{ mail } get-group >>group >>tuple
        "date desc" >>order get-pagesize >>limit select-tuples
    ] with-mydb mail-format print-table ;

It is pretty straightforward I think. The quotation roughly maps to the sql query

select * from mail where _group = $0 order by odate desc limit $1

Where $0 is the name of the group and $1 the number of mails to show.

print-table takes care of pretty printing the tuple sequence.

: read-mail ( id -- )
    '[ T{ mail } _ >>id select-tuple ] with-mydb
     [
        dup
        { { "ID" [ id>> ] }
          { "Group" [ group>> ] }
          { "Mid" [ mid>> ] }
          { "Date" [ date>> timestamp>ymdhms ] }
          { "Sender" [ sender>> ] }
          { "Subject" [ subject>> ] }
        } object-table.
        body>>
    ]
    [ "No such mail :-(" ]
    if* print ;

object-table. is the interesting word here. It formats a table of key-values where the value part is aligned to the same column.

Here is an example how a user could use these "interface" words:

IN: scratchpad 5 set-pagesize
Page size set to 5.
IN: "comp.lang.factor.general" set-group
id     mid   group           date       sender          subject                                           
IN: scratchpad import
... output omitted for brevity ...
IN: scratchpad recent
id     mid   group           date       sender          subject                                           
  3154  2506 comp.lang.facto 2009-01-12 Jose A. Ortega  Re: Pull request: fuel vocabulary refactoring     
  3155  2505 comp.lang.facto 2009-01-12 Slava Pestov <s Bug with fuel insert use feature                  
  3156  2504 comp.lang.facto 2009-01-12 Slava Pestov <s Re: Pull request: fuel vocabulary refactoring     
  3157  2503 comp.lang.facto 2009-01-12 Jose A. Ortega  Pull request: fuel vocabulary refactoring         
  3158  2502 comp.lang.facto 2009-01-11 Jose A. Ortega  Re: FUEL listener issue                           
IN: scratchpad 3155 read-mail
ID      3155
Group   comp.lang.factor.general
Mid     2505
Date    2009-01-12 00:39:19
Sender  Slava Pestov <slava@...>
Subject Bug with fuel insert use feature
From: Slava Pestov <slava@...>
Jose,

I noticed today that sometimes the last character in a buffer is not
...

Final Words

Now the program is done. Hope you had fun! The code is only about 250 lines in total but should be enough to teach you how to the basics of Factor. This tutorial has just scratched the surface -- there is many more combinator vocabularies, syntax extensions, macros! and other incredibly powerful features that makes coding in Factor so fun.

To learn more about Factor, you could try and extend this demo program yourself. There is about a million interesting features one could add. Here are some ideas:

Full text searching of newsgroups. Either using a word index or (slower but easier to implement) like queries.
Newsgroup statistics, such as top posters, average posts per day, longest threads and so on.
Scheduling the scraper with the systems cron so that new mails in configured newsgroups are scraped daily automatically.
Color coding. It would pose an interesting challenge making it portable because different systems use different methods for specifying colors.
Showing mails in a threaded view instead of just a flat list.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly