Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Forces UTF-8 encoding on ALTO XML. #2298

Merged
merged 17 commits into from
Jan 2, 2025
Merged
1 change: 1 addition & 0 deletions app/controllers/catalog_controller.rb
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ def self.modified_field

# CatalogController-scope behavior and configuration for BlacklightIiifSearch
include BlacklightIiifSearch::Controller
skip_before_action :authenticate_user!, only: :iiif_search

configure_blacklight do |config|
# configuration for Blacklight IIIF Content Search
Expand Down
5 changes: 3 additions & 2 deletions app/models/file_set.rb
Original file line number Diff line number Diff line change
Expand Up @@ -87,13 +87,14 @@ def preferred_file
end
end

# The two methods below err when storing text in Solr, so forcing UTF-8 encoding removes errant text (most likely ASCII).
def alto_xml
return extracted&.content if extracted&.file_name&.first&.include?('.xml')
return extracted&.content&.force_encoding('UTF-8')&.encode("UTF-8", invalid: :replace, replace: "") if extracted&.file_name&.first&.include?('.xml')
nil
end

def transcript_text
transcript_file&.content&.force_encoding('UTF-8') if transcript_file&.file_name&.first&.include?('.txt')
transcript_file&.content&.force_encoding('UTF-8')&.encode("UTF-8", invalid: :replace, replace: "") if transcript_file&.file_name&.first&.include?('.txt')
end

private
Expand Down
2 changes: 1 addition & 1 deletion app/views/manifest/manifest.json.jbuilder
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ end
# within the Work to activate, but each text-optimized FileSet's alto_xml_tesi,
# transcript_text_tesi, and is_page_of_ssi fields must also be indexed for normal
# searching functions.
if @solr_doc['all_text_tsimv'].present?
if @image_concerns.any? { |id| SolrDocument&.find(id)&.[]('alto_xml_tesi')&.present? }
json.service do
json.child! do
json.set! :@context, 'http://iiif.io/api/search/0/context.json'
Expand Down
5 changes: 2 additions & 3 deletions spec/views/manifest/manifest.json.jbuilder_spec.rb
Original file line number Diff line number Diff line change
Expand Up @@ -96,10 +96,9 @@
expect(work.file_sets.count).to eq 5
end

context 'when all_text_tsimv is present' do
let(:solr_document) { SolrDocument.new(attributes.merge('all_text_tsimv' => 'So much text!')) }

context 'when @image_concerns contains values in alto_xml_tesi' do
it 'renders a IIIF Search service' do
allow(image_concerns).to receive(:any?).and_return(true)
render
parsed_rendered_manifest = JSON.parse(rendered)

Expand Down