speed up PRD indexing? #148

holfordm · 2018-05-02T10:04:23Z

Detailed Description

Building the SOLR indexes and HTML is time consuming, at least on slower machines; it typically takes over an hour for me. If they have been built and tested for QA, is it necessary to rebuild them for PRD? Or could index-all-prd look for existing HTML / SOLR indexes (and/or logfiles, if there is a logfile confirming that everything has indexed OK), and ask in the terminal if these existing files can be sent to the PRD server?

The text was updated successfully, but these errors were encountered:

andrew-morrison · 2018-05-02T13:35:50Z

I'm already working on this, because Fihrist takes even longer. It was taking two hours on a fast machine, but optimizing the XQuery scripts has reduced that to about an hour.

A quick win would be to parallel-process where as possible. The HTML needs to be generated before the manuscripts indexing can start, and (for Medieval only) places must be done before organizations. But people and works indexing could run in a second thread.

A mode for index-all-prd.sh to re-use existing files is a good idea. I'll think about the safest way to implement it.

andrew-morrison · 2018-05-02T14:59:05Z

FYI, I've just done an index to Medieval QA with a few optimizations to the XQuery and it took just under 9 minutes. But that's on my fast PC.

holfordm · 2018-05-08T10:04:50Z

45 minutes for me, but that is still a distinct improvement

andrew-morrison · 2018-05-08T10:08:52Z

Sorry, I should've made it clear, I haven't committed my changes to the scripts yet. I'm still working on that, and need to test that the output isn't affected. So any speed improvement is probably due to available memory on your machine.

andrew-morrison · 2018-05-08T10:28:44Z

BTW, we have made a change to how the Solr server does fulltext indexing on the QA site. This has nothing to do with speed, but you might see differences in the results you get back.

Basically, now that we've moved on to some of the much, much smaller TEI catalogues, it has become noticeable that the way Solr was configured to identify individual words wasn't optimal. In Medieval, the large number of words, and variant forms of each, means you're unlikely to find matches for things that didn't match already, but you might notice more hits, even for words you haven't changed the number of instances of in the source TEI.

This change hasn't been pushed to the production system yet.

andrew-morrison · 2018-06-04T08:57:23Z

I've managed to get the processing time down from 7:20 to 1:40 on my fast PC. Approximately 1:30 of that is the time it takes to build the works index, with the rest running simultaneously. It needs a bit more testing, but I'll push the updates sometime this week.

holfordm · 2018-06-04T15:09:43Z

Congratulations; this will make a real difference. Get Outlook for Android<https://aka.ms/ghei36>

…

________________________________ From: Andrew Morrison <[email protected]> Sent: Monday, June 4, 2018 9:57:24 AM To: bodleian/medieval-mss Cc: holfordm; Author Subject: Re: [bodleian/medieval-mss] speed up PRD indexing? (#148) I've managed to get the processing time down from 7:20 to 1:40 on my fast PC. Approximately 1:30 of that is the time it takes to build the works index, with the rest running simultaneously. It needs a bit more testing, but I'll push the updates sometime this week. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub<#148 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ATThVKCxMXTMCUa3ASbXJWjrfI4kw3hgks5t5PZ0gaJpZM4TvLz6>.

andrew-morrison · 2018-06-04T15:58:04Z

One questions about works: should the works index include entries with IDs which are only found in the key attributes of titles inside bibliographic references (which, are specifically blocked from being displayed, e.g. bibl with a type of "commentary")?

holfordm · 2018-06-06T16:02:15Z

For the moment, yes. Really most of these cross-references belong in the works index, but until they have been put there they should be indexed as they stand. Get Outlook for Android<https://aka.ms/ghei36>

…

________________________________ From: Andrew Morrison <[email protected]> Sent: Monday, June 4, 2018 4:58:04 PM To: bodleian/medieval-mss Cc: holfordm; Author Subject: Re: [bodleian/medieval-mss] speed up PRD indexing? (#148) One questions about works: should the works index include entries with IDs which are only found in the key attributes of titles inside bibliographic references (which, are specifically blocked from being displayed, e.g. bibl with a type of "commentary")? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub<#148 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ATThVN5aCKOE_uMpKmk5lJvLc1fRsKGJks5t5VkMgaJpZM4TvLz6>.

andrew-morrison · 2018-06-11T14:21:24Z

Sorry about the delay. I'll return to this on Wednesday.

Both places and organizations are in the same authority file, so it is more efficient to have one script to process that, which will be places.xquery. See #148.

Includes changes for #148 to speed up indexing and #34 to index extra info in the authority files. Renaming people.xquery to persons.xquery (naming scripts consistently with the indexes they generate so it is easier to run them concurrently.)

andrew-morrison · 2018-07-10T16:45:10Z

I've been able to optimize the indexing scripts to run, on my machine, in roughly a third of the time.

The only way I have been able to replicate an indexing time over an hour, even on a slow machine with an old-fashioned spinning hard disc drive, is on an unreliable wifi connection. The Java networking library which the Saxon processor uses seem to hang for a very long time if it cannot retrieve the XSLT and XQuery files in the consolidated-tei-schema repository. I cannot find any way to change that. So, I'm modifying the scripts to download what it needs first, then use local copies. Therefore if it fails it should tell you straightaway.

See #148

pull to TA 23-1-23

holfordm assigned andrew-morrison May 2, 2018

andrew-morrison mentioned this issue Jun 22, 2018

person, work and place records should display links and additional info. where available #34

Closed

andrew-morrison added a commit that referenced this issue Jul 11, 2018

New: Download local copy of consolidated-tei-schema before indexing

645ccd9

See #148

andrew-morrison added a commit that referenced this issue Jul 11, 2018

WIP: Speed up indexing

79bd9e3

See #148

andrew-morrison added a commit that referenced this issue Jul 11, 2018

WIP: Speed up indexing

b3477c2

See #148

andrew-morrison added the bug label Jul 18, 2018

andrew-morrison mentioned this issue Jul 19, 2018

author pages should link to works? #79

Closed

andrew-morrison mentioned this issue Aug 13, 2018

Processing overhaul #173

Merged

holfordm closed this as completed in #173 Aug 13, 2018

holfordm pushed a commit that referenced this issue Jan 26, 2023

Merge pull request #148 from bodleian/master

15d19a1

pull to TA 23-1-23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

speed up PRD indexing? #148

speed up PRD indexing? #148

holfordm commented May 2, 2018

andrew-morrison commented May 2, 2018 •

edited

Loading

andrew-morrison commented May 2, 2018

holfordm commented May 8, 2018

andrew-morrison commented May 8, 2018

andrew-morrison commented May 8, 2018

andrew-morrison commented Jun 4, 2018

holfordm commented Jun 4, 2018 via email

andrew-morrison commented Jun 4, 2018

holfordm commented Jun 6, 2018 via email

andrew-morrison commented Jun 11, 2018

andrew-morrison commented Jul 10, 2018

speed up PRD indexing? #148

speed up PRD indexing? #148

Comments

holfordm commented May 2, 2018

Detailed Description

andrew-morrison commented May 2, 2018 • edited Loading

andrew-morrison commented May 2, 2018

holfordm commented May 8, 2018

andrew-morrison commented May 8, 2018

andrew-morrison commented May 8, 2018

andrew-morrison commented Jun 4, 2018

holfordm commented Jun 4, 2018 via email

andrew-morrison commented Jun 4, 2018

holfordm commented Jun 6, 2018 via email

andrew-morrison commented Jun 11, 2018

andrew-morrison commented Jul 10, 2018

andrew-morrison commented May 2, 2018 •

edited

Loading