Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

speed up PRD indexing? #148

Closed
holfordm opened this issue May 2, 2018 · 11 comments · Fixed by #173
Closed

speed up PRD indexing? #148

holfordm opened this issue May 2, 2018 · 11 comments · Fixed by #173
Assignees
Labels

Comments

@holfordm
Copy link
Collaborator

holfordm commented May 2, 2018

Detailed Description

Building the SOLR indexes and HTML is time consuming, at least on slower machines; it typically takes over an hour for me. If they have been built and tested for QA, is it necessary to rebuild them for PRD? Or could index-all-prd look for existing HTML / SOLR indexes (and/or logfiles, if there is a logfile confirming that everything has indexed OK), and ask in the terminal if these existing files can be sent to the PRD server?

@andrew-morrison
Copy link
Contributor

andrew-morrison commented May 2, 2018

I'm already working on this, because Fihrist takes even longer. It was taking two hours on a fast machine, but optimizing the XQuery scripts has reduced that to about an hour.

A quick win would be to parallel-process where as possible. The HTML needs to be generated before the manuscripts indexing can start, and (for Medieval only) places must be done before organizations. But people and works indexing could run in a second thread.

A mode for index-all-prd.sh to re-use existing files is a good idea. I'll think about the safest way to implement it.

@andrew-morrison
Copy link
Contributor

FYI, I've just done an index to Medieval QA with a few optimizations to the XQuery and it took just under 9 minutes. But that's on my fast PC.

@holfordm
Copy link
Collaborator Author

holfordm commented May 8, 2018

45 minutes for me, but that is still a distinct improvement

@andrew-morrison
Copy link
Contributor

Sorry, I should've made it clear, I haven't committed my changes to the scripts yet. I'm still working on that, and need to test that the output isn't affected. So any speed improvement is probably due to available memory on your machine.

@andrew-morrison
Copy link
Contributor

BTW, we have made a change to how the Solr server does fulltext indexing on the QA site. This has nothing to do with speed, but you might see differences in the results you get back.

Basically, now that we've moved on to some of the much, much smaller TEI catalogues, it has become noticeable that the way Solr was configured to identify individual words wasn't optimal. In Medieval, the large number of words, and variant forms of each, means you're unlikely to find matches for things that didn't match already, but you might notice more hits, even for words you haven't changed the number of instances of in the source TEI.

This change hasn't been pushed to the production system yet.

@andrew-morrison
Copy link
Contributor

I've managed to get the processing time down from 7:20 to 1:40 on my fast PC. Approximately 1:30 of that is the time it takes to build the works index, with the rest running simultaneously. It needs a bit more testing, but I'll push the updates sometime this week.

@holfordm
Copy link
Collaborator Author

holfordm commented Jun 4, 2018 via email

@andrew-morrison
Copy link
Contributor

One questions about works: should the works index include entries with IDs which are only found in the key attributes of titles inside bibliographic references (which, are specifically blocked from being displayed, e.g. bibl with a type of "commentary")?

@holfordm
Copy link
Collaborator Author

holfordm commented Jun 6, 2018 via email

@andrew-morrison
Copy link
Contributor

Sorry about the delay. I'll return to this on Wednesday.

andrew-morrison added a commit that referenced this issue Jun 18, 2018
Both places and organizations are in the same authority file, so it is more efficient to have one script to process that, which will be places.xquery. See #148.
andrew-morrison added a commit that referenced this issue Jun 18, 2018
Includes changes for #148 to speed up indexing and #34 to index extra info in the authority files. Renaming people.xquery to persons.xquery (naming scripts consistently with the indexes they generate so it is easier to run them concurrently.)
@andrew-morrison
Copy link
Contributor

I've been able to optimize the indexing scripts to run, on my machine, in roughly a third of the time.

The only way I have been able to replicate an indexing time over an hour, even on a slow machine with an old-fashioned spinning hard disc drive, is on an unreliable wifi connection. The Java networking library which the Saxon processor uses seem to hang for a very long time if it cannot retrieve the XSLT and XQuery files in the consolidated-tei-schema repository. I cannot find any way to change that. So, I'm modifying the scripts to download what it needs first, then use local copies. Therefore if it fails it should tell you straightaway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants