Skip to content

Data Source Configuration

mshroom edited this page Dec 13, 2024 · 64 revisions

Data Source Settings

Data sources are defined in conf/datasources.ini. All data source settings always belong to a section that identifies the data source. The section name is is used as the "source" parameter in the command line programs as well as the prefix for record IDs unless a different idPrefix is defined. The settings for a data source include format, harvesting source, import settings, transformations etc. Typically a source corresponds to a library catalog or an institutional repository, but it could just be almost anything as long as the following rules apply:

  • All records are in a single format
  • All records come from a single harvesting source
  • All the other settings also apply to the whole set of records
  • The records don't need to be deduplicated among themselves

Data Source settings are divided into two categories. The first category of settings is used for all data sources, and the second one is specific to OAI-PMH harvesting.

Common Settings

Setting Description
idPrefix By default the section name in datasources.ini is used as an identifier prefix for the institution. idPrefix can be used to override this e.g. in case multiple OAI-PMH sets need to be harvested from the same data source (which requires multiple uniquely named sections in datasources.ini). It is not recommended map multiple sources to the same idPrefix, and generally the solrIdPrefix is preferred since it does not have side-effect in RecordManager.
institution The institution code mapped to the data source. Used e.g. to fill an organization field in the Solr index.
recordXPath An xpath expression used when loading records from a file to identify a single record (default is "//record")
oaiIDXPath An xpath expression used when loading records from a file to find record's OAI ID, if it's present in the file (typically when importing a file containing an OAI-PMH listRecords response). Relative to recordXPath (e.g. "../../header/identifier").
format Record format in RecordManager (e.g. dc, ead, lido or marc)
preTransformation[] Optional transformations to be applied to files to be imported (just the names of the xsl files, one per one, in transformations directory, e.g. to strip namespaces)
reParseTransformed Whether to re-parse transformation results (true/false, defaults to false). This may be required in a transformation chain in case a transformation adds new elements by outputting un-encoded text.
recordSplitter Optional XSL transformation used to split records in import or OAI-PMH harvest (just the name of the xsl file in transformations directory). See transformations/EadSplit.xsl for an example of an XSL transformation. Specify only the .xsl file name without path.
recordSplitterClass Optional PHP class used to split records in import or OAI-PMH harvest (just the name of the class including the namespace). See "\RecordManager\Base\Splitter\Ead" for an example implementation of a splitter.
recordSplitterParams[] Optional splitter parameters to customize its behavior. Available parameters depend on the splitter used. See the table below for description of the parameters.
normalization Optional XSL Transformation to be applied to each record. Points to a properties file in transformations directory (enter only the file name, no path). The properties file further defines the actual XSL transformation and any PHP-based helper functions or classes used in the transformation.
solrTransformation XSL Transformation to be used when converting a record for import to Solr. Must be specified if the record driver does not provide a usable toSolrArray method. Points to a properties file in transformations directory.
dedup Whether this data source needs deduplication (true/false, defaults to false). See important notes if you toggle this for an existing data source.
keepMissingHierarchyMembers Whether members of a hierarchical record not present in an imported or harvested records are kept and not deleted (true/false, defaults to false). Normally it is assumed that an imported hierarchical record contains all the child records, and those not present anymore need to be deleted, but if a record hierarchy is imported in multiple parts, this setting can be enabled to keep the previously imported parts intact. The downside is that another way to handle any deletions (e.g. OAI-PMH harvest with the reharvest parameter) is needed.
componentParts How component parts, if any, are handled in the data source during load to Solr. See the table below for possible values.
indexMergedParts Whether to index merged component parts also separately with hidden_component_boolean field set to true. Defaults to true.
indexUnprefixedIds By default all records are indexed with their IDs prefixed with the source or idPrefix (e.g. record '123' from source 'test' becomes 'test.123'). Setting this to true will index record IDs without the source prefix.
solrIdPrefix By default all records are indexed with their IDs prefixed with the source or idPrefix. This setting can be used to modify the prefix used when indexing records in Solr (but has no effect if indexUnprefixedIds above is specified). This allows two separate sources to be combined into one logical set in Solr, but care must be taken to avoid the identifiers from overlapping.
{field}_mapping[,regexp|,regexp-multi] A mapping file in mappings directory to be used to map values of {field} when updating Solr index. Useful for e.g. mapping multiple location codes to one. See below for an explanation of mapping files.
institutionInBuilding How institution is converted to building field. See below for possible values. See also addInstitutionToBuildingBeforeMapping.
addInstitutionToBuildingBeforeMapping Whether the institution conversion above is done before the mapping files are processed. Defaults to false, which means that building field is prefixed with the institution after any mapping files are processed.
extraFields[] An array of static fields to add to each record when sending them to solr. Format is fieldname:value, e.g. extraFields[] = "building:mainLibrary" or extraFields[] = "sector_str_mv:library"
fieldRules[] An array of rules for changing fields before they're sent to Solr. Allows deleting, copying and moving of fields. See below for more information.
driverParams[] An array of driver-specific parameters that control driver behavior. Format is fieldname:value, e.g. driverParams[] = "holdingsInBuilding=true". See below for available driver parameters.
enrichments[] An array of enrichment classes to use for the records, e.g. enrichments[] = "MarcOnkiLightEnrichment"
suppressOnField[field] Filters that allow records to be suppressed based on the Solr field contents (before applying any mappings). The setting may contain multiple values separated with a pipe character, or a regular expression surrounded by slashes. Examples: suppressOnField[building] = "Closed|Process" or suppressOnField[institution] = "/[Ii]naccessible/". Note that the data the data source needs to be reharvested or renormalized (php manage.php --func=renormalize --source=xyz) for changes to take effect.

Possible Settings for componentParts

Setting Description
as_is No special handling (default)
merge_all Merge all component parts to their host records
merge_non_articles Merge to host record unless article (including e-journal articles)
merge_non_earticles Merge to host record unless e-journal article

Possible Settings for institutionInBuilding

Setting Description
default Use institution setting from datasources.ini
"none" No mapping. Note that due to PHP ini file handling, the quotes are required.
driver Use whatever the record driver provided in institution field
source Use source id
institution/source Use institution and source id separated with a slash

Field processing rules in fieldRules[]

Field processing rules allow deleting, copying and moving of fields before they are sent to Solr. Rules are processed in order before processing mappings and converting hierarchical facets.

The following rules are available:

Rule Description
delete <field> [params] Removes a field if it exists.
copy <from> <to> [params] Copies field <from> to <to>. A default value can be specified in case the <from> field is empty or doesn't exist.
move <from> <to> [params]  Moves field <from> to <to>. A default value can be specified in case the <from> field is empty or doesn't exist.

If the target field exists, a single value is converted to an array as necessary, and new values are appended to it.

Params may be a single word for a default value or a set of further expressions:

  • match="value" Only do the operation if field contents matches the given value.
  • match="/regexp/" Only do the operation if field contents matches the given regular expression.
  • match="/regexp/i" Only do the operation if field contents matches the given regular expression (case insensitive).
  • default="value" Use default value if a value does not exist or doesn't match any regular expression.

Examples:

fieldRules[] = "delete collection"
fieldRules[] = "copy building building2_str_mv MAIN"
fieldRules[] = 'copy building building2_str_mv match="/^(MAIN|SUB)$/" default="MAIN"'
fieldRules[] = "move author author2"

Possible Settings for driverParams

Setting Description
accessRestrictions Overrides the access restrictions for records. Default is ''.
mergeTitleValues Lido: Whether repeated appellationValue elements are merged into a single title. Default is true.
mergeTitleSets Lido: Whether titleSet elements are merged into a single title. Default is true.
allowTitleToMatchFormat Lido: Do not replace title with description when title equals format (objectWorkType). Default is false.
holdingsInBuilding Marc: Whether to include holdings locations (852b) in building field. Default is true.
idIn999 Marc: Whether the real record ID is in 999c field (Koha style). Default is false.
003InLinkingID Marc: Whether links from component parts to the host records include 003 field. Default is false.
kohaNormalization Marc: Koha-specific normalization for converting item fields (952) to holdings. Default is false. Note that if enabled, the "building" field in Solr is populated from both 852 and 952 fields unless overridden with the buildingFields param.
almaNormalization Marc: Alma-specific normalization for converting item fields (952) to holdings. Default is false. Note that if enabled, the "building" field in Solr is populated from both 852 and 952 fields unless overridden with the buildingFields param. See https://www.kiwi.fi/display/Finna/Alman+metadatan+julkaisu+OAI-PMH%3Alla+Finnaa+varten for the enrichment settings used in Alma.
subLocationInBuilding Marc: Which subfield from 852, if any, is used as a sub-location code. Since subfield b is normally used for building, this could be set to c to add it too. Default is empty, so no other subfield than b is used. DEPRECATED, use buildingFields instead.
buildingFields Marc: Colon-separated list of fields that are stored in the building field with first subfield code indicating the main location and second the sub-location (e.g. 852bc:952bc). If specified, overrides holdingsInBuilding and subLocationInBuilding. Default is to use field 852b for main location and subfield specified in subLocationInBuilding, if any, for the sublocation.
fullTextXPaths Dc, Qdc: XPath or array of XPaths where to find full text. Example: driverParams[] = "fullTextXPaths = \"//description\""
fullTextUrlXPaths Dc, Qdc: XPath or array of XPaths where to find full text URLs. Example: driverParams[] = "fullTextUrlXPaths = \"//file[@bundle='TEXT']/@href\""
preferredFormatTypes Qdc: Preferred dc.type fields to be used when indexing format. Default is the first field. Example: "preferredFormatTypes=okm,other" will first check dc.type.okm and then dc.type.other, and lastly the first dc.type.
defaultDisplayLanguage Qdc, Lido: Default display language to prefer if there are multiple entries with different languages to choose from.
prependTitleWithSubtitle Ead/Ead3: Whether to add subtitle in front of title. Valid values are false (never), true (always) and children (only for records that have a parent record)
useHILCC Marc: MAP Library of Congress call numbers (LCCN) to hierarchical categories as defined in HILCC. The categories are indexed in Solr field category_str_mv for faceting. Note that you will need to copy other-license/mappings/LcCallNumberCategories.php to the mappings directory and make sure you use it in compliance to its license terms.
addIdToHierarchyTitle Ead, Ead3, Marc: Whether to prefix a hierarchy title with record's identifier such as unit id.

Possible Settings for recordSplitterParams

Setting Description
recordSplitterParams[prependParentTitleWithUnitId] = true Ead, Ead3: Prepend the unit id to the parent title (default is false)
recordSplitterParams[nonInheritedFields] = "tag,tag2,tag3" Ead, Ead3: List of elements that should not be inherited by child records

Mapping Files

Normal mapping files are simple .ini-style files where on the left side of an equals sign is the original value and on the right side the resulting value. Mappings are case-sensitive, and if multiple values in a multivalued field map to same result, only one is kept. There is a simple example mapping file in the mappings directory.

There are a couple of special mapping strings that can be used to provide default values:

; A default value of xyz is used if none of the other strings match
##default = xyz
; A default for a singlevalued field where no original value exists
##empty = xyz
; A default for a multivalued field where no original value exists
##emptyarray = xyz
; A default for singlevalued field when the field is empty after mapping
##mappedempty = xyz
; A default for multivalued field when the field is empty after mapping
##mappedemptyarray = xyz

The ##mappedempty and ##mappedemptyarray settings can be combined with an empty ##default to make it so that the value is applied only when the field contains a value but does not match any of the mapped values:

main = Main Library
branch = Branch Library
##default =
##mappedemptyarray = Location Unknown

A single value can be mapped to multiple values by appending [] to the key:

single[] = first
single[] = second

It is also possible to use mapping files with regular expressions by adding ,regexp or ,regexp-multi after the mapping file name. With regexp files, the left-hand side is used as a regexp pattern and the right hand side as the replacement for strings that match the pattern. The expressions are tested one by one. For ,regexp the process ends when a match is found. For ,regexp-multi all matches are returned. Slashes must not be escaped in the pattern. In replacement $1 .. $9 can be used to denote a match in the pattern. An example:

; Remove a number from the beginning
\d+(.*) = $1

; Convert a string to hierarchical using the first character as the hierarchy separator (e.g. h12 becomes h/h12)
(.)(.*) = $1/$1$2

It is possible to define an array of files for handling different levels for hierarchical fields by making the setting an array:

building_mapping[] = building.map
building_mapping[] = building_sub.map

This means that for a hierarchical field the first mapping is used for the first level, second for the second level and so on. This requires that the record driver provides the value of a single hierarchical facet as an array. See src/RecordManager/Base/Record/Marc.php::getBuilding() for an example. See also the hierarchical_facets[] setting in general configuration for information on how to specify hierarchical facets.

Note that if you want to replace "any value" with a string as a catch-all using .* (you could use ##default as well), you'll need to anchor it to the beginning to avoid matching multiple times:

^.* = foo

This is the same as:

##default = foo

OAI-PMH Harvesting Specific Settings

Setting Description
url OAI-PMH provider base URL. Note that by default RecordManager does not follow HTTP 302 redirects, so make sure to use https to access a secure server. See also HTTP Settings for more settings to control the HTTP client.
headers Array of HTTP headers to send with each request.
username Optional username for HTTP authentication.
password Optional password for HTTP authentication.
authType Optional authentication type for HTTP authentication. Possible values are basic (default), digest and ntlm.
set Identifier of a set to harvest (normally found in the setSpec tag of an OAI-PMH ListSets response). Omit this setting to harvest all records.
metadataPrefix Format to harvest. The default is oai_dc.
idSearch[]
idReplace[] Can be used to manipulate record ID's with regular expression.
dateGranularity dateGranularity is the granularity used by the server for representing dates. This may be "YYYY-MM-DDThh:mm:ssZ," "YYYY-MM-DD" or "auto" (to query the server for details). The default is "auto."
verbose Can be set to true in order to log more detailed output while harvesting; this may be useful for troubleshooting purposes, but it defaults to false.
debuglog Can be set to a file where all the OAI-PMH requests and responses are written. There is also a splitlog.php utility that can be used to split the responses from the debug log so that they can be reloaded with the import program. This is especially useful when testing record splitters.
oaipmhTransformation[] XSL transformations that are applied to OAI-PMH responses before they are processed (just the names of the xsl files, one per line, in the transformations directory, e.g. to strip namespaces).

SFX KB Harvest Specific Settings

SFX KB harvest is actually "fetch export files and import them". SFX export files are fetched according to their time stamps and processed in RecordManager.

Setting Description
type Only valid value is sfx. This tells RecordManager to harvest SFX exports via HTTP.
url HTTP address of the export directory on the SFX server
username Optional username for HTTP authentication.
password Optional password for HTTP authentication.
authType Optional authentication type for HTTP authentication. Possible values are basic (default), digest and ntlm.
filePrefix File name prefix used to distinquish the files to be processed from any other export files

The SFX harvest requires that an SFX export be scheduled to run on the SFX server and the results exposed via the proxy Apache on the SFX server. See [Harvesting SFX Objects](Harvesting SFX Objects) for information on how to set up the SFX side.

III Sierra REST API Specific Settings

Setting Description
type Only valid value is sierra. This tells RecordManager to harvest Sierra records using the Sierra REST API.
url Base address of the REST API without the version. E.g. url = https://sandbox.iii.com/iii/sierra-api
sierraApiKey API key for the REST API
sierraApiSecret API secret for the REST API
sierraApiVersion API version to use. Default is 6.
sierraApiEndpoint API endpoint to use. Possible values are bibs for bibliographic records (default) and authorities for authority records.
batchSize Number of records to request in a single response. A higher number means less requests, but the Sierra API may choke with too high numbers and never return the results. Default is 100.
suppressedBibCode3 BIB codes to suppress (RecordManager will process these like they were deleted records)
suppressedRecords Whether to request only suppressed or non-suppressed records. Normally not needed as it's important to fetch both and let RecordManager filter the results.

OAI-PMH Provider Specific Settings

Setting Description
transformation_to_{format} XSL Transformation used to convert records from the original format to the requested format. E.g. if records are stored in MARC format, transformation_to_ese = marc2ese.properties could be used to transform the MARC records to ESE format.
ignoreOaiIdInProvider Instructs the OAI-PMH provider to ignore any existing OAI identifier received from an upstream repository. Useful when there's need to provide new identifiers e.g. when the source repository provided bad identifiers or a record splitter was used to split a single source record to multiple records.