-
Notifications
You must be signed in to change notification settings - Fork 32
Data Source Configuration
Data sources are defined in conf/datasources.ini. All data source settings always belong to a section that identifies the data source. The section name is is used as the "source" parameter in the command line programs as well as the prefix for record IDs unless a different idPrefix is defined. The settings for a data source include format, harvesting source, import settings, transformations etc. Typically a source corresponds to a library catalog or an institutional repository, but it could just be almost anything as long as the following rules apply:
- All records are in a single format
- All records come from a single harvesting source
- All the other settings also apply to the whole set of records
- The records don't need to be deduplicated among themselves
Data Source settings are divided into two categories. The first category of settings is used for all data sources, and the second one is specific to OAI-PMH harvesting.
Setting | Description |
---|---|
idPrefix | By default the section name in datasources.ini is used as an identifier prefix for the institution. idPrefix can be used to override this e.g. in case multiple OAI-PMH sets need to be harvested from the same data source (which requires multiple uniquely named sections in datasources.ini). It is not recommended map multiple sources to the same idPrefix, and generally the solrIdPrefix is preferred since it does not have side-effect in RecordManager. |
institution | The institution code mapped to the data source. Used e.g. to fill an organization field in the Solr index. |
recordXPath | An xpath expression used when loading records from a file to identify a single record (default is "//record") |
oaiIDXPath | An xpath expression used when loading records from a file to find record's OAI ID, if it's present in the file (typically when importing a file containing an OAI-PMH listRecords response). Relative to recordXPath (e.g. "../../header/identifier"). |
format | Record format in RecordManager (e.g. dc, ead, lido or marc) |
preTransformation[] | Optional transformations to be applied to files to be imported (just the names of the xsl files, one per one, in transformations directory, e.g. to strip namespaces) |
reParseTransformed | Whether to re-parse transformation results (true/false, defaults to false). This may be required in a transformation chain in case a transformation adds new elements by outputting un-encoded text. |
recordSplitter | Optional XSL transformation used to split records in import or OAI-PMH harvest (just the name of the xsl file in transformations directory). See transformations/EadSplit.xsl for an example of an XSL transformation. Specify only the .xsl file name without path. |
recordSplitterClass | Optional PHP class used to split records in import or OAI-PMH harvest (just the name of the class including the namespace). See "\RecordManager\Base\Splitter\Ead" for an example implementation of a splitter. |
recordSplitterParams[] | Optional splitter parameters to customize its behavior. Available parameters depend on the splitter used. See the table below for description of the parameters. |
normalization | Optional XSL Transformation to be applied to each record. Points to a properties file in transformations directory (enter only the file name, no path). The properties file further defines the actual XSL transformation and any PHP-based helper functions or classes used in the transformation. |
solrTransformation | XSL Transformation to be used when converting a record for import to Solr. Must be specified if the record driver does not provide a usable toSolrArray method. Points to a properties file in transformations directory. |
dedup | Whether this data source needs deduplication (true/false, defaults to false). See important notes if you toggle this for an existing data source. |
keepMissingHierarchyMembers | Whether members of a hierarchical record not present in an imported or harvested records are kept and not deleted (true/false, defaults to false). Normally it is assumed that an imported hierarchical record contains all the child records, and those not present anymore need to be deleted, but if a record hierarchy is imported in multiple parts, this setting can be enabled to keep the previously imported parts intact. The downside is that another way to handle any deletions (e.g. OAI-PMH harvest with the reharvest parameter) is needed. |
componentParts | How component parts, if any, are handled in the data source during load to Solr. See the table below for possible values. |
indexMergedParts | Whether to index merged component parts also separately with hidden_component_boolean field set to true. Defaults to true. |
indexUnprefixedIds | By default all records are indexed with their IDs prefixed with the source or idPrefix (e.g. record '123' from source 'test' becomes 'test.123'). Setting this to true will index record IDs without the source prefix. |
solrIdPrefix | By default all records are indexed with their IDs prefixed with the source or idPrefix. This setting can be used to modify the prefix used when indexing records in Solr (but has no effect if indexUnprefixedIds above is specified). This allows two separate sources to be combined into one logical set in Solr, but care must be taken to avoid the identifiers from overlapping. |
{field}_mapping[,regexp|,regexp-multi] | A mapping file in mappings directory to be used to map values of {field} when updating Solr index. Useful for e.g. mapping multiple location codes to one. See below for an explanation of mapping files. |
institutionInBuilding | How institution is converted to building field. See below for possible values. See also addInstitutionToBuildingBeforeMapping. |
addInstitutionToBuildingBeforeMapping | Whether the institution conversion above is done before the mapping files are processed. Defaults to false, which means that building field is prefixed with the institution after any mapping files are processed. |
extraFields[] | An array of static fields to add to each record when sending them to solr. Format is fieldname:value, e.g. extraFields[] = "building:mainLibrary" or extraFields[] = "sector_str_mv:library"
|
fieldRules[] | An array of rules for changing fields before they're sent to Solr. Allows deleting, copying and moving of fields. See below for more information. |
driverParams[] | An array of driver-specific parameters that control driver behavior. Format is fieldname:value, e.g. driverParams[] = "holdingsInBuilding=true" . See below for available driver parameters. |
enrichments[] | An array of enrichment classes to use for the records, e.g. enrichments[] = "MarcOnkiLightEnrichment"
|
suppressOnField[field] | Filters that allow records to be suppressed based on the Solr field contents (before applying any mappings). The setting may contain multiple values separated with a pipe character, or a regular expression surrounded by slashes. Examples: suppressOnField[building] = "Closed|Process" or suppressOnField[institution] = "/[Ii]naccessible/" . Note that the data the data source needs to be reharvested or renormalized (php manage.php --func=renormalize --source=xyz ) for changes to take effect. |
Setting | Description |
---|---|
as_is | No special handling (default) |
merge_all | Merge all component parts to their host records |
merge_non_articles | Merge to host record unless article (including e-journal articles) |
merge_non_earticles | Merge to host record unless e-journal article |
Setting | Description |
---|---|
default | Use institution setting from datasources.ini |
"none" | No mapping. Note that due to PHP ini file handling, the quotes are required. |
driver | Use whatever the record driver provided in institution field |
source | Use source id |
institution/source | Use institution and source id separated with a slash |
Field processing rules allow deleting, copying and moving of fields before they are sent to Solr. Rules are processed in order before processing mappings and converting hierarchical facets.
The following rules are available:
Rule | Description |
---|---|
delete <field> [params] |
Removes a field if it exists. |
copy <from> <to> [params] |
Copies field <from> to <to>. A default value can be specified in case the <from> field is empty or doesn't exist. |
move <from> <to> [params] |
Moves field <from> to <to>. A default value can be specified in case the <from> field is empty or doesn't exist. |
If the target field exists, a single value is converted to an array as necessary, and new values are appended to it.
Params may be a single word for a default value or a set of further expressions:
-
match="value"
Only do the operation if field contents matches the given value. -
match="/regexp/"
Only do the operation if field contents matches the given regular expression. -
match="/regexp/i"
Only do the operation if field contents matches the given regular expression (case insensitive). -
default="value"
Use default value if a value does not exist or doesn't match any regular expression.
Examples:
fieldRules[] = "delete collection"
fieldRules[] = "copy building building2_str_mv MAIN"
fieldRules[] = 'copy building building2_str_mv match="/^(MAIN|SUB)$/" default="MAIN"'
fieldRules[] = "move author author2"
Setting | Description |
---|---|
accessRestrictions | Overrides the access restrictions for records. Default is ''. |
mergeTitleValues | Lido: Whether repeated appellationValue elements are merged into a single title. Default is true. |
mergeTitleSets | Lido: Whether titleSet elements are merged into a single title. Default is true. |
allowTitleToMatchFormat | Lido: Do not replace title with description when title equals format (objectWorkType). Default is false. |
holdingsInBuilding | Marc: Whether to include holdings locations (852b) in building field. Default is true. |
idIn999 | Marc: Whether the real record ID is in 999c field (Koha style). Default is false. |
003InLinkingID | Marc: Whether links from component parts to the host records include 003 field. Default is false. |
kohaNormalization | Marc: Koha-specific normalization for converting item fields (952) to holdings. Default is false. Note that if enabled, the "building" field in Solr is populated from both 852 and 952 fields unless overridden with the buildingFields param. |
almaNormalization | Marc: Alma-specific normalization for converting item fields (952) to holdings. Default is false. Note that if enabled, the "building" field in Solr is populated from both 852 and 952 fields unless overridden with the buildingFields param. See https://www.kiwi.fi/display/Finna/Alman+metadatan+julkaisu+OAI-PMH%3Alla+Finnaa+varten for the enrichment settings used in Alma. |
subLocationInBuilding | Marc: Which subfield from 852, if any, is used as a sub-location code. Since subfield b is normally used for building, this could be set to c to add it too. Default is empty, so no other subfield than b is used. DEPRECATED, use buildingFields instead. |
buildingFields | Marc: Colon-separated list of fields that are stored in the building field with first subfield code indicating the main location and second the sub-location (e.g. 852bc:952bc). If specified, overrides holdingsInBuilding and subLocationInBuilding. Default is to use field 852b for main location and subfield specified in subLocationInBuilding, if any, for the sublocation. |
fullTextXPaths | Dc, Qdc: XPath or array of XPaths where to find full text. Example: driverParams[] = "fullTextXPaths = \"//description\""
|
fullTextUrlXPaths | Dc, Qdc: XPath or array of XPaths where to find full text URLs. Example: driverParams[] = "fullTextUrlXPaths = \"//file[@bundle='TEXT']/@href\""
|
preferredFormatTypes | Qdc: Preferred dc.type fields to be used when indexing format. Default is the first field. Example: "preferredFormatTypes=okm,other" will first check dc.type.okm and then dc.type.other, and lastly the first dc.type. |
defaultDisplayLanguage | Qdc, Lido: Default display language to prefer if there are multiple entries with different languages to choose from. |
prependTitleWithSubtitle | Ead/Ead3: Whether to add subtitle in front of title. Valid values are false (never), true (always) and children (only for records that have a parent record) |
useHILCC | Marc: MAP Library of Congress call numbers (LCCN) to hierarchical categories as defined in HILCC. The categories are indexed in Solr field category_str_mv for faceting. Note that you will need to copy other-license/mappings/LcCallNumberCategories.php to the mappings directory and make sure you use it in compliance to its license terms. |
addIdToHierarchyTitle | Ead, Ead3, Marc: Whether to prefix a hierarchy title with record's identifier such as unit id. |
Setting | Description |
---|---|
recordSplitterParams[prependParentTitleWithUnitId] = true | Ead, Ead3: Prepend the unit id to the parent title (default is false) |
recordSplitterParams[nonInheritedFields] = "tag,tag2,tag3" | Ead, Ead3: List of elements that should not be inherited by child records |
Normal mapping files are simple .ini-style files where on the left side of an equals sign is the original value and on the right side the resulting value. Mappings are case-sensitive, and if multiple values in a multivalued field map to same result, only one is kept. There is a simple example mapping file in the mappings directory.
There are a couple of special mapping strings that can be used to provide default values:
; A default value of xyz is used if none of the other strings match
##default = xyz
; A default for a singlevalued field where no original value exists
##empty = xyz
; A default for a multivalued field where no original value exists
##emptyarray = xyz
; A default for singlevalued field when the field is empty after mapping
##mappedempty = xyz
; A default for multivalued field when the field is empty after mapping
##mappedemptyarray = xyz
The ##mappedempty and ##mappedemptyarray settings can be combined with an empty ##default to make it so that the value is applied only when the field contains a value but does not match any of the mapped values:
main = Main Library
branch = Branch Library
##default =
##mappedemptyarray = Location Unknown
A single value can be mapped to multiple values by appending [] to the key:
single[] = first
single[] = second
It is also possible to use mapping files with regular expressions by adding ,regexp
or ,regexp-multi
after the mapping file name. With regexp files, the left-hand side is used as a regexp pattern and the right hand side as the replacement for strings that match the pattern. The expressions are tested one by one. For ,regexp
the process ends when a match is found. For ,regexp-multi
all matches are returned. Slashes must not be escaped in the pattern. In replacement $1 .. $9 can be used to denote a match in the pattern. An example:
; Remove a number from the beginning
\d+(.*) = $1
; Convert a string to hierarchical using the first character as the hierarchy separator (e.g. h12 becomes h/h12)
(.)(.*) = $1/$1$2
It is possible to define an array of files for handling different levels for hierarchical fields by making the setting an array:
building_mapping[] = building.map
building_mapping[] = building_sub.map
This means that for a hierarchical field the first mapping is used for the first level, second for the second level and so on. This requires that the record driver provides the value of a single hierarchical facet as an array. See src/RecordManager/Base/Record/Marc.php::getBuilding() for an example. See also the hierarchical_facets[] setting in general configuration for information on how to specify hierarchical facets.
Note that if you want to replace "any value" with a string as a catch-all using .*
(you could use ##default
as well), you'll need to anchor it to the beginning to avoid matching multiple times:
^.* = foo
This is the same as:
##default = foo
Setting | Description |
---|---|
url | OAI-PMH provider base URL. Note that by default RecordManager does not follow HTTP 302 redirects, so make sure to use https to access a secure server. See also HTTP Settings for more settings to control the HTTP client. |
headers | Array of HTTP headers to send with each request. |
username | Optional username for HTTP authentication. |
password | Optional password for HTTP authentication. |
authType | Optional authentication type for HTTP authentication. Possible values are basic (default), digest and ntlm. |
set | Identifier of a set to harvest (normally found in the setSpec tag of an OAI-PMH ListSets response). Omit this setting to harvest all records. |
metadataPrefix | Format to harvest. The default is oai_dc. |
idSearch[] | |
idReplace[] | Can be used to manipulate record ID's with regular expression. |
dateGranularity | dateGranularity is the granularity used by the server for representing dates. This may be "YYYY-MM-DDThh:mm:ssZ," "YYYY-MM-DD" or "auto" (to query the server for details). The default is "auto." |
verbose | Can be set to true in order to log more detailed output while harvesting; this may be useful for troubleshooting purposes, but it defaults to false. |
debuglog | Can be set to a file where all the OAI-PMH requests and responses are written. There is also a splitlog.php utility that can be used to split the responses from the debug log so that they can be reloaded with the import program. This is especially useful when testing record splitters. |
oaipmhTransformation[] | XSL transformations that are applied to OAI-PMH responses before they are processed (just the names of the xsl files, one per line, in the transformations directory, e.g. to strip namespaces). |
SFX KB harvest is actually "fetch export files and import them". SFX export files are fetched according to their time stamps and processed in RecordManager.
Setting | Description |
---|---|
type | Only valid value is sfx. This tells RecordManager to harvest SFX exports via HTTP. |
url | HTTP address of the export directory on the SFX server |
username | Optional username for HTTP authentication. |
password | Optional password for HTTP authentication. |
authType | Optional authentication type for HTTP authentication. Possible values are basic (default), digest and ntlm. |
filePrefix | File name prefix used to distinquish the files to be processed from any other export files |
The SFX harvest requires that an SFX export be scheduled to run on the SFX server and the results exposed via the proxy Apache on the SFX server. See [Harvesting SFX Objects](Harvesting SFX Objects) for information on how to set up the SFX side.
Setting | Description |
---|---|
type | Only valid value is sierra. This tells RecordManager to harvest Sierra records using the Sierra REST API. |
url | Base address of the REST API without the version. E.g. url = https://sandbox.iii.com/iii/sierra-api |
sierraApiKey | API key for the REST API |
sierraApiSecret | API secret for the REST API |
sierraApiVersion | API version to use. Default is 6. |
sierraApiEndpoint | API endpoint to use. Possible values are bibs for bibliographic records (default) and authorities for authority records. |
batchSize | Number of records to request in a single response. A higher number means less requests, but the Sierra API may choke with too high numbers and never return the results. Default is 100. |
suppressedBibCode3 | BIB codes to suppress (RecordManager will process these like they were deleted records) |
suppressedRecords | Whether to request only suppressed or non-suppressed records. Normally not needed as it's important to fetch both and let RecordManager filter the results. |
Setting | Description |
---|---|
transformation_to_{format} | XSL Transformation used to convert records from the original format to the requested format. E.g. if records are stored in MARC format, transformation_to_ese = marc2ese.properties could be used to transform the MARC records to ESE format. |
ignoreOaiIdInProvider | Instructs the OAI-PMH provider to ignore any existing OAI identifier received from an upstream repository. Useful when there's need to provide new identifiers e.g. when the source repository provided bad identifiers or a record splitter was used to split a single source record to multiple records. |