-
Notifications
You must be signed in to change notification settings - Fork 314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallelize on corpora basis in bulk task clients #1412
Conversation
# pre-initialize in order to assign parallel tasks for different corpora through assignment | ||
readers = [None for corpus in corpora for _ in corpus.documents] * num_clients | ||
# stagger which corpus each client starts with for better parallelism | ||
for group, corpus in enumerate(corpora[(start_client_index + mod) % len(corpora)] for mod in range(len(corpora))): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
primarily looking for feedback on the round-robin approach implemented between this line and line 961 below. I feel like it's unreadable but was stumped with regards making it better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this approach makes sense. As you correctly identified, there are tracks like http_logs
that currently have the assumption that specific corpus files contain a certain timerange of events and to run them in a realistic way we'd need the existing sequential strategy or change the corpora files in the track.
I'd be interested to see the impact of this approach also in the solutions/logs tracks.
Tried this with a track I'm developing with 1000 clients, 5 corpora, 3 elasticsearch nodes, and each datastream configured as 1 primary and 1 replica. The following are the results in that case (
Note that this is because if corpora are indexed (mostly) sequentially, for the majority of the workload only two of three nodes will be busy at a time. Still TODO are to see the impact on |
HTTP_Logs Indexing Results3 nodes, 1 corpus (7 files), 8 indexing clients append-no-conflicts-index-onlyno effect (expected) Detailed ResultsBaseline is master, Contender is this change
append-sorted-no-conflictsno effect (unexpected!) Detailed ResultsBaseline is master, Contender is this change
NextI'm thinking the asc/desc sorted queries need to be checked |
HTTP_Logs Querying Resultsappend-no-conflictsThere's some indication that:
Many of these are known to be volatile. Re-running for validation Detailed Results
|
Solutions/Logs Indexing Performancelogging-indexingNo effect. I believe this change is not in the code path for solutions/logs due to custom code. Detailed Results
|
HTTP_Logs Querying Results Attempt 2append-no-conflictsRe-ran the previous result. Confirmed it seems like Detailed Results
|
@dliappis My proposal for next steps given the above results, at a high level:
Can I have your thoughts on that? |
Talked to @dliappis OOB. Above mentioned approach is acceptable. Once we change the http_logs corpora source to a single unified file, we'll backport the change in rally-tracks and put a signpost in our changes for this enhancement so if performance changes for users after upgrading Rally we will be able to easily trace back to this point. |
I have re-visited the above results after getting enormous deviations trying to consolidate all the corpora to a single file, which in its simplest form also collapses all the data to a single index, which affects time-based results. After reviewing results I initially wrote off as being too deviant by their relative performance, I noticed, for instance that For what it's worth, revisiting these numbers is the result of trying to write new corpora files that include the target index as the action and metadata line with the following script: #!/bin/bash
source_file=$1
tmp="${source_file##*-}"
index_name="logs-${tmp%%.json}"
#echo $source_file $index_name
while read source_line;
do
echo "{\"index\":{\"_index\":\"${index_name}\"}}" >> ${index_name}.json
echo $source_line >> ${index_name}.json
done < $source_file But for larger files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
In multi-corpora tracks/challenges, indexing would occur in order from file to file.
This change causes better parallelism between corpora in indexing tasks in two ways:
Example intended consequences are listed in the following table:
I intend to keep this PR in draft mode until it is tested to ensure it doesn't synthetically reduce throughput in any of our multiple-corpora benchmarks (such as the Logging Solutions or http_logs nightly benchmarks).