Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRAFT Setup documentation #215

Draft
wants to merge 128 commits into
base: dev
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
128 commits
Select commit Hold shift + click to select a range
4e34171
moves code out of bin elasticsearch files and into module
jduss4 Jan 9, 2020
172c4aa
removes unnecessary dtd for french 17
jduss4 Jan 9, 2020
4483c44
combines some parameter gathering files
jduss4 Jan 9, 2020
c70169b
updates gems and fixes test suite
jduss4 Jan 9, 2020
2aa5f38
in progress working on validator for es fields)
jduss4 Jan 10, 2020
a5d0047
creates validator for elasticsearch postings
jduss4 Jan 15, 2020
8615211
whoops missed one
jduss4 Jan 17, 2020
9459f9d
refactored validator to handle nested field specific mapping
jduss4 Jan 17, 2020
4235116
get rid of unnecessary variable definitions
jduss4 Jan 17, 2020
be55c60
put data methods in index class
wkdewey May 20, 2022
6bda669
require_relative so that tests can be run from base directory
wkdewey May 20, 2022
9535604
move puts from bin methods into es classes
wkdewey May 20, 2022
5cd5b4d
change get_schema to return rather than puts
wkdewey May 20, 2022
b3da61e
simplify regex
wkdewey May 20, 2022
da6721f
drop unnecessary conditional
wkdewey May 20, 2022
d7ec579
return early if invalid nested field found
wkdewey May 20, 2022
66396ab
change coverage-spatial to spatial
wkdewey May 23, 2022
fefe254
Update CHANGELOG.md
wkdewey May 23, 2022
621193d
moves code out of bin elasticsearch files and into module
jduss4 Jan 9, 2020
59c9dd5
in progress working on validator for es fields)
jduss4 Jan 10, 2020
162f980
creates validator for elasticsearch postings
jduss4 Jan 15, 2020
ef20fe8
add byebug and update gems
wkdewey Nov 3, 2021
1470e12
specify proper api_version, add xsl file for ead
wkdewey Nov 3, 2021
17f3ac6
add ead to format_to_class
wkdewey Nov 3, 2021
e39f423
add date helpers from newer Datura version
wkdewey Nov 3, 2021
95ea454
add byebug to gemspec
wkdewey Nov 3, 2021
ae49d18
add gem
wkdewey Nov 3, 2021
5d56df7
add file_ead class
wkdewey Nov 3, 2021
5acd4f1
add EadToES class
wkdewey Nov 3, 2021
5b67707
add files for EadToEs
wkdewey Nov 3, 2021
023fd82
add EadToEsItems class and associated files
wkdewey Nov 3, 2021
fc6e4f0
add xsl file for ead (not functional yet)
wkdewey Nov 3, 2021
ba17a76
remove gem doc that is messing things up
wkdewey Nov 4, 2021
3523b83
print full error message, not just something went wrong
wkdewey Nov 4, 2021
a6e2a51
fix xpath
wkdewey Nov 4, 2021
0839528
add all require fields, including unfilled ones
wkdewey Nov 4, 2021
3b10607
fix xpaths hash
wkdewey Nov 4, 2021
90739d5
make EadToEsItems a separate class
wkdewey Nov 4, 2021
10543b7
add abstract field and fix bad xpaths
wkdewey Nov 4, 2021
0d3c07a
add a backtrace to error handling
wkdewey Nov 8, 2021
49ba94b
grab 'items' at any nesting of the EAD
wkdewey Nov 8, 2021
8604663
add xpaths and fields, and make sure eadtoesitems inherits from eadtoes
wkdewey Nov 8, 2021
dbb5659
change order of get id to fix bug
wkdewey Nov 8, 2021
f3628b6
add documentation for adding new format
wkdewey Nov 8, 2021
6cbe577
adjust and add fields for items
wkdewey Nov 9, 2021
352efd1
add items to repository xpaths
wkdewey Nov 9, 2021
ef6024e
fix image_url xpath
wkdewey Nov 9, 2021
a7c0c9e
add puts statements for debugging
wkdewey Nov 9, 2021
9bd0e38
try another way to debug
wkdewey Nov 9, 2021
7f1fa42
test for nil specifically
wkdewey Nov 9, 2021
30294ff
add debugging statements to get_schema
wkdewey Nov 9, 2021
f13dcae
try debugging with byebug
wkdewey Nov 9, 2021
a61363a
remove debugging info
wkdewey Nov 9, 2021
5b87459
add alternative field
wkdewey Nov 10, 2021
fcf01b3
add relation field
wkdewey Nov 10, 2021
d2d5939
add spatial field
wkdewey Nov 10, 2021
102a5a8
fix a get_text method
wkdewey Nov 10, 2021
cf85953
change post_es to match jessica's changes
wkdewey Nov 10, 2021
b35aaeb
change CommonXML to Datura helpers
wkdewey Nov 11, 2021
436f85d
change xpaths to be less specific to Walt Whitman
wkdewey Nov 11, 2021
c85b31f
refactor title fields and xpaths
wkdewey Nov 11, 2021
349ff66
add creator override for items so it is an array
wkdewey Dec 10, 2021
e640971
change creators to creator
wkdewey Dec 21, 2021
7286114
add rdf schema
wkdewey Jun 16, 2022
d4f45f0
update schemas to include rdf fields
wkdewey Jun 20, 2022
f78ea73
add rdf to default fields
wkdewey Jun 20, 2022
f5c9d62
add spatial.title field
wkdewey Jun 21, 2022
a2b78f6
require byebug so it is in scope for posting etc.
wkdewey Jul 18, 2022
330fcd6
remove inserted byebug
wkdewey Aug 24, 2022
f7ebd87
require byebug so it is in scope for posting etc.
wkdewey Jul 18, 2022
73e7b19
include full error message with backtrace
wkdewey Aug 8, 2022
4328acc
updates gems and fixes test suite
jduss4 Jan 9, 2020
1660fb1
update gems in preparation for release
wkdewey May 25, 2022
ee596cc
start adding new api fields
wkdewey Aug 9, 2022
2fe4238
update schema to match spreadsheet with new field names
wkdewey Aug 11, 2022
e40a225
assemble json based on api version
wkdewey Aug 15, 2022
bfe1ca6
add overrides for 2.0 fields
wkdewey Aug 15, 2022
9c8bc72
change next and previous fields
wkdewey Aug 15, 2022
9577bf3
add fig_location
wkdewey Aug 15, 2022
3bd2eeb
add abstract
wkdewey Aug 15, 2022
d30f76b
remove split-out assemble_text methods
wkdewey Aug 15, 2022
48e316e
update gems and get rid of merge conflicts
wkdewey Aug 24, 2022
c9d8730
add new fields
wkdewey Aug 24, 2022
96d4199
correct field name
wkdewey Aug 24, 2022
a3bbc14
add fields to ead overrides
wkdewey Aug 24, 2022
75651f7
populate new fields in json
wkdewey Aug 24, 2022
f49f776
resolve merge conflict
wkdewey Aug 24, 2022
91c79d6
add new fields
wkdewey Aug 25, 2022
95c6a29
update fields for related items, dates, order integers
wkdewey Aug 25, 2022
3c1ac53
correct syntax errors
wkdewey Aug 25, 2022
29f9bf2
correct another syntax error
wkdewey Aug 25, 2022
a70d464
change keywords1 to plain keywords
wkdewey Aug 25, 2022
f7aa3db
add more specific message to es validation
wkdewey Aug 30, 2022
ddb6960
remove extra byebug require
wkdewey Sep 7, 2022
b2ffb0c
remove byebug, change error message
wkdewey Sep 7, 2022
8008c9b
update schema under citations
wkdewey Sep 7, 2022
7a7a0b1
require fileutils to avoid errors in setup
wkdewey Sep 16, 2022
e983197
skip title_sort if title is nil
wkdewey Sep 20, 2022
f12291b
return nil instead of empty string, addresses https://github.com/whit…
wkdewey Sep 23, 2022
974ab39
add more nil checks for results of xpath methods
wkdewey Sep 26, 2022
84d1a9e
check the correct xpath fields
wkdewey Sep 26, 2022
ec81ddb
make sure input is in UTF-8
wkdewey Sep 26, 2022
a3d2fbf
make changes for new api schema and revised xpath methods
wkdewey Sep 26, 2022
9a18c1c
add a nil check for creators
wkdewey Sep 26, 2022
31d0a36
Revert "make sure input is in UTF-8"
wkdewey Oct 3, 2022
74cd809
change error handling to avoid method that isn't present
wkdewey Oct 17, 2022
47533f9
make sure person is an array
wkdewey Oct 20, 2022
79d1455
make sure settings hash is what elasticsearch expects
wkdewey Oct 21, 2022
aab8145
change where mappings are posted for es upgrade
wkdewey Oct 21, 2022
daeb437
add headers to ES requests for authorization
wkdewey Oct 25, 2022
04de44b
add method to construct basic auth header from options
wkdewey Oct 25, 2022
c713334
remove debugging code
wkdewey Oct 25, 2022
4da4062
update conditional logic for status code, dynamic_templates key
wkdewey Oct 25, 2022
5bf8ca7
change endpoint for delete_by_query for ES8 compatibility
wkdewey Jan 26, 2023
a9d9fc0
upgrade to Ruby 3.0.4
wkdewey Oct 27, 2022
9c013b8
make keyword arguments compatible with Ruby 3
wkdewey Oct 27, 2022
c70d01a
go up to ruby 3.1.2
wkdewey Oct 27, 2022
a784c1d
add output if nested field is invalid
wkdewey Nov 8, 2022
2ec32e4
don't use array method on person to avoid errors
wkdewey Nov 8, 2022
1ff5b88
update changelog for new version
wkdewey Nov 9, 2022
5bcfeb3
update reference to ruby version
wkdewey Nov 9, 2022
51ff02f
make changes related to ES and API upgrade
wkdewey Nov 10, 2022
bf82b87
add links to more detailed documentation
wkdewey Nov 10, 2022
e2d9d9f
add link to elasticsearch documentation
wkdewey Nov 10, 2022
bda6d7a
add conditional to creator for nil checking
wkdewey Nov 18, 2022
214c777
Create schema_v2.md
karindalziel Nov 10, 2022
5d6d205
make sure webs_to_es fields can handle nil values
wkdewey Jan 25, 2023
4c65fb0
add documentation steps for starting a new data collection
wkdewey Mar 1, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .ruby-version
Original file line number Diff line number Diff line change
@@ -1 +1 @@
2.7.1
3.1.2
32 changes: 31 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,21 +25,47 @@ Versioning](https://semver.org/spec/v2.0.0.html).
### Security
-->

## [Unreleased](https://github.com/CDRH/datura/compare/v0.2.0-beta...dev)
## [1.0.0](https://github.com/CDRH/datura/compare/v0.2.0-beta...dev)

### Added
- minor test for Datura::Helpers.date_standardize
- documentation for web scraping
- documentation for CsvToEs (transforming CSV files and posting to elasticsearch)
- documentation for adding new ingest formats to Datura
- byebug gem for debugging
- instructions for installing Javascript Runtime files for Saxon
- API schema can either be 1.0 or 2.0 (which includes nested fields); 1.0 will be run by default unless 2.0 is specified. Add the following to `public.yml` or `private.yml` in the data repo:
```
api_version: '2.0'
```
See new schema (2.0) documentation [here](https://github.com/CDRH/datura/docs/schema_v2.md)
- schema validation with API version 2.0, invalidly constructed documents will not post
- authentication with Elasticesarch 8.5; add the following to `public.yml` or `private.yml` in the data repo:
```
es_user: username
es_password: ********
```
- field overrides for new fields in the new API schema
- functionality to transform EAD files and post them to elasticsearch

### Changed
- update ruby to 3.1.2
- date_standardize now relies on strftime instead of manual zero padding for month, day
- minor corrections to documentation
- XPath: "text" is now ingested as an array and will be displayed delimitted by spaces
- refactored command line methods into elasticsearch library
- refactored and moved date_standardize and date_display helper methods
- Nokogiri methods `get_text` and `get_list` on TEI now return nil rather than empty strings or arrays if there are no matches

### Migration
- check to make sure "text" xpath is doing desired behavior
- use Elasticsearch 8.5 or higher and add authentication as described above if security is enabled. See [dev docs instructions](https://github.com/CDRH/cdrh_dev_docs/blob/update_elasticsearch_documentation/publishing/2_basic_requirements.md#downloading-elasticsearch).
- upgrade data repos to Ruby 3.1.2
- add api version to config as described above
- make sure fields are consistent with the api schema, many have been renamed or changed in format
- add nil checks with get_text and get_list methods
- add EadToES overrides if ingesting EAD files
- if overriding the `read_csv` method in `lib/datura/file_type.rb`, the hash must be prefixed with ** (`**{}`).

## [v0.2.0-beta](https://github.com/CDRH/datura/compare/v0.1.6...v0.2.0-beta) - 2020-08-17 - Altering field and xpath behavior, adds get_elements

Expand All @@ -50,6 +76,8 @@ Versioning](https://semver.org/spec/v2.0.0.html).
- Tests and fixtures for all supported formats except CustomToEs
- `get_elements` returns nodeset given xpath arguments
- `spatial` nested fields `spatial.type` and `spatial.title`
- Versioning system to support multiple elasticsearch schemas
- Validator to check against the elasticsearch copy

### Changed
- Arguments for `get_text`, `get_list`, and `get_xpaths`
Expand All @@ -58,12 +86,14 @@ Versioning](https://semver.org/spec/v2.0.0.html).
- Documentation updated
- Changed Install instructions to include RVM and gemset naming conventions
- API field `coverage_spatial` is now just `spatial`
- refactored executables into modules and classes

### Migration
- Change `coverage_spatial` nested field to `spatial`
- `get_text`, `get_list`, and `get_xpaths` require changing arguments to keyword (like `xml` and `keep_tags`)
- Recommend checking xpaths and behavior of fields after updating to this version, as some defaults have changed
- Possible to refactor previous FileCsv overrides to use new CsvToEs abilities, but not necessary
- Config files should specify `api_version` as 1.0 or 2.0

## [v0.1.6](https://github.com/CDRH/datura/compare/v0.1.5...v0.1.6) - 2020-04-24 - Improvements to CSV, WEBS transformers and adds Custom transformer

Expand Down
1 change: 1 addition & 0 deletions Gemfile
Original file line number Diff line number Diff line change
Expand Up @@ -2,5 +2,6 @@ source "https://rubygems.org"

git_source(:github) {|repo_name| "https://github.com/#{repo_name}" }

gem "byebug"
# Specify your gem's dependencies in datura.gemspec
gemspec
21 changes: 14 additions & 7 deletions Gemfile.lock
Original file line number Diff line number Diff line change
Expand Up @@ -3,44 +3,51 @@ PATH
specs:
datura (0.2.0.pre.beta)
colorize (~> 0.8.1)
nokogiri (~> 1.8)
rest-client (~> 2.0.2)
nokogiri (~> 1.10)
rest-client (~> 2.1)

GEM
remote: https://rubygems.org/
specs:
byebug (11.1.3)
colorize (0.8.1)
domain_name (0.5.20190701)
unf (>= 0.0.5, < 1.0.0)
http-accept (1.7.0)
http-cookie (1.0.5)
domain_name (~> 0.5)
mime-types (3.4.1)
mime-types-data (~> 3.2015)
mime-types-data (3.2022.0105)
mini_portile2 (2.8.0)
minitest (5.15.0)
minitest (5.16.3)
netrc (0.11.0)
nokogiri (1.13.6)
nokogiri (1.13.9)
mini_portile2 (~> 2.8.0)
racc (~> 1.4)
nokogiri (1.13.9-x86_64-darwin)
racc (~> 1.4)
racc (1.6.0)
rake (13.0.6)
rest-client (2.0.2)
rest-client (2.1.0)
http-accept (>= 1.7.0, < 2.0)
http-cookie (>= 1.0.2, < 2.0)
mime-types (>= 1.16, < 4.0)
netrc (~> 0.8)
unf (0.1.4)
unf_ext
unf_ext (0.0.8.1)
unf_ext (0.0.8.2)

PLATFORMS
ruby
x86_64-darwin-20

DEPENDENCIES
bundler (>= 1.16.0, < 3.0)
byebug
datura!
minitest (~> 5.0)
rake (~> 13.0)

BUNDLED WITH
2.1.4
2.2.33
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ Looking for information about how to post documents? Check out the

## Install / Set Up Data Repo

Check that Ruby is installed, preferably 2.7.x or up. If you are using RVM, see the RVM section below.
Check that Ruby is installed, preferably 3.1.2 or up. If you are using RVM, see the RVM section below.

If your project already has a Gemfile, add the `gem "datura"` line. If not, create a new directory and add a file named `Gemfile` (no extension).

Expand Down
16 changes: 4 additions & 12 deletions bin/admin_es_create_index
Original file line number Diff line number Diff line change
Expand Up @@ -2,18 +2,10 @@

require "datura"

params = Datura::Parser.es_create_delete_index
options = Datura::Options.new(params).all

put_url = File.join(options["es_path"], "#{options["es_index"]}?pretty=true")
get_url = File.join(options["es_path"], "_cat", "indices?v&pretty=true")

begin
# TODO if we want to add any default settings to the new index,
# we can do that with the payload and then use rest-client again instead of exec
# however, rest-client appears to require a payload and won't allow simple "PUT" with none
puts "Creating new ES index: #{put_url}"
exec("curl -XPUT #{put_url}")
es = Datura::Elasticsearch::Index.new
es.create
es.set_schema
rescue => e
puts "Error: #{e.inspect}"
puts e
end
11 changes: 3 additions & 8 deletions bin/admin_es_delete_index
Original file line number Diff line number Diff line change
@@ -1,15 +1,10 @@
#!/usr/bin/env ruby

require "datura"
require "rest-client"

params = Datura::Parser.es_create_delete_index
options = Datura::Options.new(params).all

url = File.join(options["es_path"], "#{options["es_index"]}?pretty=true")

begin
puts JSON.parse(RestClient.delete(url))
es = Datura::Elasticsearch::Index.new
es.delete
rescue => e
puts "Error with request, check that index exists before deleting: #{e}"
puts e
end
25 changes: 2 additions & 23 deletions bin/es_alias_add
Original file line number Diff line number Diff line change
Expand Up @@ -2,29 +2,8 @@

require "datura"

require "json"
require "rest-client"

params = Datura::Parser.es_alias_add
options = Datura::Options.new(params).all

ali = options["alias"]
idx = options["index"]
url = File.join(options["es_path"], "_aliases")

data = {
actions: [
{ remove: { alias: ali, index: "_all" } },
{ add: { alias: ali, index: idx } }
]
}

begin
res = RestClient.post(url, data.to_json, { content_type: :json })
puts "Results of setting alias #{ali} to index #{idx}"
puts res
list = JSON.parse(RestClient.get(url))
puts "\nAll aliases: #{JSON.pretty_generate(list)}"
Datura::Elasticsearch::Alias.add
rescue => e
puts "Error: #{e.response}"
puts e
end
14 changes: 5 additions & 9 deletions bin/es_alias_delete
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,8 @@

require "datura"

require "json"
require "rest-client"

params = Datura::Parser.es_alias_delete
options = Datura::Options.new(params).all
url = File.join(options["es_path"], options["index"], "_alias", options["alias"])

res = JSON.parse(RestClient.delete(url))
puts JSON.pretty_generate(res)
begin
Datura::Elasticsearch::Alias.delete
rescue => e
puts e
end
9 changes: 1 addition & 8 deletions bin/es_alias_list
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,4 @@

require "datura"

require "json"
require "rest-client"

options = Datura::Options.new({}).all
url = File.join(options["es_path"], "_aliases")

res = JSON.parse(RestClient.get(url))
puts JSON.pretty_generate(res)
Datura::Elasticsearch::Alias.list
89 changes: 4 additions & 85 deletions bin/es_clear_index
Original file line number Diff line number Diff line change
Expand Up @@ -2,89 +2,8 @@

require "datura"

require "json"
require "rest-client"

def confirm_basic(options, url)
# verify that the user is really sure about the index they're about to wipe
puts "Are you sure that you want to remove entries from"
puts " #{options["collection"]}'s #{options['environment']} environment?"
puts "url: #{url}"
puts "y/N"
answer = STDIN.gets.chomp
# boolean
return !!(answer =~ /[yY]/)
end

def main

# run the parameters through the option parser
params = Datura::Parser.clear_index_params
options = Datura::Options.new(params).all
if options["collection"] == "all"
clear_all(options)
else
clear_index(options)
end
end

def build_data(options)
if options["regex"]
field = options["field"] || "identifier"
return {
"query" => {
"bool" => {
"must" => [
{ "regexp" => { field => options["regex"] } },
{ "term" => { "collection" => options["collection"] } }
]
}
}
}
else
return {
"query" => { "term" => { "collection" => options["collection"] } }
}
end
end

def clear_all(options)
puts "Please verify that you want to clear EVERY ENTRY from the ENTIRE INDEX\n\n"
puts "== FIELD / REGEX FILTERS NOT AVAILABLE FOR THIS OPTION, YOU'LL WIPE EVERYTHING ==\n\n"
puts "Seriously, you probably do not want to do this"
puts "Are you running this on something other than your local machine? RETHINK IT."
puts "Type: 'Yes I'm sure'"
confirm = STDIN.gets.chomp
if confirm == "Yes I'm sure"
url = "#{options["es_path"]}/#{options["es_index"]}/_doc/_delete_by_query?pretty=true"
post url, { "query" => { "match_all" => {} } }
else
puts "You typed '#{confirm}'. This is incorrect, exiting program"
exit
end
end

def clear_index(options)
url = "#{options["es_path"]}/#{options["es_index"]}/_doc/_delete_by_query?pretty=true"
confirmation = confirm_basic(options, url)

if confirmation
data = build_data(options)
post(url, data)
else
puts "come back anytime!"
exit
end
begin
Datura::Elasticsearch::Index.clear
rescue => e
puts e
end

def post(url, data={})
begin
puts "clearing from #{url}: #{data.to_json}"
res = RestClient.post(url, data.to_json, {:content_type => :json})
puts res.body
rescue => e
puts "error posting to ES: #{e.response}"
end
end

main
16 changes: 3 additions & 13 deletions bin/es_get_schema
Original file line number Diff line number Diff line change
Expand Up @@ -2,19 +2,9 @@

require "datura"

require "json"
require "rest-client"
require "yaml"

params = Datura::Parser.es_set_schema_params
options = Datura::Options.new(params).all

begin
url = File.join(options["es_path"], options["es_index"], "_mapping", "_doc?pretty=true")
res = RestClient.get(url)
puts res.body
puts "environment: #{options["environment"]}"
puts "url: #{url}"
es = Datura::Elasticsearch::Index.new
puts es.get_schema
rescue => e
puts "Error: #{e.response}"
puts e
end
19 changes: 3 additions & 16 deletions bin/es_set_schema
Original file line number Diff line number Diff line change
Expand Up @@ -2,22 +2,9 @@

require "datura"

require "json"
require "rest-client"
require "yaml"

params = Datura::Parser.es_set_schema_params
options = Datura::Options.new(params).all
path = File.join(options["datura_dir"], options["es_schema_path"])
schema = YAML.load_file(path)

begin
idx = options["es_index"]

url = File.join(options["es_path"], options["es_index"], "_mapping", "_doc?pretty=true")
puts "environment: #{options["environment"]}"
puts "Setting schema: #{url}"
RestClient.put(url, schema.to_json, { :content_type => :json })
es = Datura::Elasticsearch::Index.new
es.set_schema
rescue => e
puts "Error: #{e.response}"
puts e
end
Loading