-
-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
speed up PRD indexing? #148
Comments
I'm already working on this, because Fihrist takes even longer. It was taking two hours on a fast machine, but optimizing the XQuery scripts has reduced that to about an hour. A quick win would be to parallel-process where as possible. The HTML needs to be generated before the manuscripts indexing can start, and (for Medieval only) places must be done before organizations. But people and works indexing could run in a second thread. A mode for index-all-prd.sh to re-use existing files is a good idea. I'll think about the safest way to implement it. |
FYI, I've just done an index to Medieval QA with a few optimizations to the XQuery and it took just under 9 minutes. But that's on my fast PC. |
45 minutes for me, but that is still a distinct improvement |
Sorry, I should've made it clear, I haven't committed my changes to the scripts yet. I'm still working on that, and need to test that the output isn't affected. So any speed improvement is probably due to available memory on your machine. |
BTW, we have made a change to how the Solr server does fulltext indexing on the QA site. This has nothing to do with speed, but you might see differences in the results you get back. Basically, now that we've moved on to some of the much, much smaller TEI catalogues, it has become noticeable that the way Solr was configured to identify individual words wasn't optimal. In Medieval, the large number of words, and variant forms of each, means you're unlikely to find matches for things that didn't match already, but you might notice more hits, even for words you haven't changed the number of instances of in the source TEI. This change hasn't been pushed to the production system yet. |
I've managed to get the processing time down from 7:20 to 1:40 on my fast PC. Approximately 1:30 of that is the time it takes to build the works index, with the rest running simultaneously. It needs a bit more testing, but I'll push the updates sometime this week. |
Congratulations; this will make a real difference.
Get Outlook for Android<https://aka.ms/ghei36>
…________________________________
From: Andrew Morrison <[email protected]>
Sent: Monday, June 4, 2018 9:57:24 AM
To: bodleian/medieval-mss
Cc: holfordm; Author
Subject: Re: [bodleian/medieval-mss] speed up PRD indexing? (#148)
I've managed to get the processing time down from 7:20 to 1:40 on my fast PC. Approximately 1:30 of that is the time it takes to build the works index, with the rest running simultaneously. It needs a bit more testing, but I'll push the updates sometime this week.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub<#148 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ATThVKCxMXTMCUa3ASbXJWjrfI4kw3hgks5t5PZ0gaJpZM4TvLz6>.
|
One questions about works: should the works index include entries with IDs which are only found in the key attributes of titles inside bibliographic references (which, are specifically blocked from being displayed, e.g. |
For the moment, yes. Really most of these cross-references belong in the works index, but until they have been put there they should be indexed as they stand.
Get Outlook for Android<https://aka.ms/ghei36>
…________________________________
From: Andrew Morrison <[email protected]>
Sent: Monday, June 4, 2018 4:58:04 PM
To: bodleian/medieval-mss
Cc: holfordm; Author
Subject: Re: [bodleian/medieval-mss] speed up PRD indexing? (#148)
One questions about works: should the works index include entries with IDs which are only found in the key attributes of titles inside bibliographic references (which, are specifically blocked from being displayed, e.g. bibl with a type of "commentary")?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub<#148 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ATThVN5aCKOE_uMpKmk5lJvLc1fRsKGJks5t5VkMgaJpZM4TvLz6>.
|
Sorry about the delay. I'll return to this on Wednesday. |
Both places and organizations are in the same authority file, so it is more efficient to have one script to process that, which will be places.xquery. See #148.
I've been able to optimize the indexing scripts to run, on my machine, in roughly a third of the time. The only way I have been able to replicate an indexing time over an hour, even on a slow machine with an old-fashioned spinning hard disc drive, is on an unreliable wifi connection. The Java networking library which the Saxon processor uses seem to hang for a very long time if it cannot retrieve the XSLT and XQuery files in the consolidated-tei-schema repository. I cannot find any way to change that. So, I'm modifying the scripts to download what it needs first, then use local copies. Therefore if it fails it should tell you straightaway. |
Detailed Description
Building the SOLR indexes and HTML is time consuming, at least on slower machines; it typically takes over an hour for me. If they have been built and tested for QA, is it necessary to rebuild them for PRD? Or could
index-all-prd
look for existing HTML / SOLR indexes (and/or logfiles, if there is a logfile confirming that everything has indexed OK), and ask in the terminal if these existing files can be sent to the PRD server?The text was updated successfully, but these errors were encountered: