A ruby-marc reader for MARC files in the Aleph sequential format
- Homepage
- Issues
- Documentation
- [Email](mailto:bill at dueber.com)
require 'marc'
require 'marc_alephsequential'
log = GetALogFromSomewhere.new
# reader = MARC::AlephSequential::Reader.new('myfile.seq')
reader = MARC::AlephSequential::Reader.new('myfile.seq.gz') # automatically notice the .gz and behave!
reader.log = log # optional. Set up a logger; otherwise, a default logger will be used
reader.each do |r|
if MARC::AlephSequential::ErrorRecord === r
e = r.error
log.error "Error while parsing record #{e.record_id} at/near #{e.line_number}: #{e.message}"
next
end
doStuffWithTheRecord(r)
end
Aleph sequential is a MARC serialization format that is easily output by Ex Libris' Aleph software. Each MARC record is presented as a series of unicode text lines, one field per line.
000000228 LDR L ^^^^^nam^a22002891^^4500
000000228 001 L 000000228
000000228 006 L m^^^^^^^^d^^^^^^^^
000000228 007 L cr^bn^---auaua
000000228 008 L 880715r19691828nyuab^^^^^^^^|00000^eng^^
000000228 010 L $$a68055188
000000228 020 L $$a083711750X
000000228 035 L $$a(RLIN)MIUG0021856-B
000000794 24514 L $$aThe descent of manuscripts.
000000794 60010 L $$aCicero, Marcus Tullius$$xManuscripts.
000000794 60000 L $$aPlato.$$tCritias$$xManuscripts.
Each line has the following format (note: All must be in utf-8)
- 9 characters (all digits) for the aleph record ID
- [space]
- 3 character tag (left-justified / space padded if need be)
- 1 character indicator 1
- 1 character indicator 2
- [space L space], for some historic reasons I don't know
- The tag's value, perhaps with internal subfields
A record is defined as a set of continuous lines with the same record ID (i.e., the way you know you've finished with a record is because the record ID changes or you hit EOF).
The leader and control fields have no internal structure, but spaces in the values are stored as '^' for some reason. (The reader, obviously, changes them back into spaces)
For data fields, the subfields are indicated as follows:
- A subfield start marker (let's just say "SSM") matches /$$[a-z0-9]/ (e.g., $$a)
- The value string for a data field must start with an SSM
- An SSM marks the start of a subfield (and the end of the previous subfield, if any)
Actually, it's not all bad; I like it in a lot of ways. A little verbose at times, but easy to read for a human, and easy to write one-off scripts to run through a file and get statistics about use of tags, find a specific record (just match the bib ID at the beginning of the line), etc.
The easy-to-see problems are:
- fixed field size. Aleph has a lot of Cobol underneath. So if your bib ids don't happen to be nine characters, well, too bad.
- You can't have an embedded '$$' in a data field's value, because it will be interpreted as the start of a new subfield. '$$' isn't super common as a typo, but I've seen it.
- Lines that don't start with a nine-digit id will be assumed to be a part of the previous line that has an illegal spurious newline. The newline will be removed and all put back together again. If there is no "previous line" because it's the first line of the file, throw an error.
- Any completed record that doesn't include a leader (LDR) will throw an error
- Datafield values that don't start with '$$' will be logged as an error and assumed that the first set of data should be in subfield $$a
$ gem install marc_alephsequential
Copyright (c) 2013 Bill Dueber
See [LICENSE.txt] for details.