Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index-append operation only indexing bulk-size * clients documents #377

Closed
dakrone opened this issue Dec 5, 2017 · 5 comments
Closed

Index-append operation only indexing bulk-size * clients documents #377

dakrone opened this issue Dec 5, 2017 · 5 comments
Labels
enhancement Improves the status quo :Track Management New operations, changes in the track format, track download changes and the like :Usability Makes Rally easier to use
Milestone

Comments

@dakrone
Copy link
Member

dakrone commented Dec 5, 2017

Rally version (get with esrally --version):
Latest from master, 425d8f6

Invoked command:

./rally --track-path=/home/hinmanm/es/mytrack --target-hosts=127.0.0.1:9200 --pipeline=benchmark-only

Configuration file (located in ~/.rally/rally.ini)):

[meta]
config.version = 12

[system]
env.name = local

[node]
root.dir = /home/hinmanm/.rally/benchmarks
src.root.dir = /home/hinmanm/es

[source]
remote.repo.url = https://github.com/elastic/elasticsearch.git
elasticsearch.src.subdir = elasticsearch

[build]
gradle.bin = /home/hinmanm/.sdkman/candidates/gradle/current/bin/gradle

[runtime]
java.home = /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.151-1.b12.fc26.x86_64

[benchmarks]
local.dataset.cache = ${node:root.dir}/data

[reporting]
datastore.type = elasticsearch
datastore.host = localhost
datastore.port = 9900
datastore.secure = False
datastore.user = 
datastore.password = 

[tracks]
default.url = https://github.com/elastic/rally-tracks

[teams]
default.url = https://github.com/elastic/rally-teams

[defaults]
preserve_benchmark_candidate = False

[distributions]
release.1.url = https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-{{VERSION}}.tar.gz
release.2.url = https://download.elasticsearch.org/elasticsearch/release/org/elasticsearch/distribution/tar/elasticsearch/{{VERSION}}/elasticsearch-{{VERSION}}.tar.gz
release.url = https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-{{VERSION}}.tar.gz
release.cache = true

JVM version:
JDK 8

OS version:
Fedora 26

Description of the problem including expected versus actual behavior:

I have a track with an index-append operation defined inline in the challenge like so:

      "schedule": [
        {
          "operation": {
            "name": "index-append",
            "operation-type": "bulk",
            "bulk-size": {{bulk_size | default(100)}}
          },
          "clients": 4
        },

The documents.json contains 1967 documents, however, only 400 are actually indexed.

Steps to reproduce:

  1. Using a track with many documents, add a challenge schedule with a low bulk-size and multiple clients
  2. Run the track
  3. Observe that only bulk-size x clients documents are indexed, in my case, 100 x 4 = 400 documents actually indexed.

I've noticed that this didn't affect me when the indexing was defined in a separate operation, it only started affecting me when I defined it inline in the challenge.

Provide logs (if relevant):
The data is from a private repo, so I cannot provide it here.

@danielmitterdorfer
Copy link
Member

danielmitterdorfer commented Dec 6, 2017

I could reproduce the behavior that you are seeing. It is caused by the fact that you did not specify any iterations or time-periods on the task. If you add "warmup-time-period": 0 to the task definition, then it will index all documents, i.e. this will do what you want:

      "schedule": [
        {
          "clients": 4,
          "warmup-time-period": 0,
          "operation": {
            "name": "index-append",
            "operation-type": "bulk",
            "bulk-size": {{bulk_size | default(100)}}
          }
        }

The reason for this - admittedly - strange behavior is that you can either have a time-period-based or an iteration-based task. If you do not specify anything, Rally will run the provided operation once without warmup by default and that's what you see here.

While we could argue that it makes no sense to execute a bulk operation only once, Rally does not impose any semantics on the operation on that level. It simply executes what you give it.

@dakrone
Copy link
Member Author

dakrone commented Dec 6, 2017

Very odd, okay, I wonder if maybe it'd be nice to have a different operation type that will always consume all of the documents from the file? That's the only thing I could think that would help alleviate the weirdness

@danielmitterdorfer
Copy link
Member

Yes. I let this ticket open as a reminder for now but I need to think how to make this less trappy in the future.

@danielmitterdorfer danielmitterdorfer added :Track Management New operations, changes in the track format, track download changes and the like :Usability Makes Rally easier to use enhancement Improves the status quo labels Dec 6, 2017
@danielmitterdorfer danielmitterdorfer added this to the 0.9.x milestone Feb 19, 2018
@danielmitterdorfer
Copy link
Member

Another user just hit this in https://discuss.elastic.co/t/bulk-index-operation-for-multiple-indices/120373. Hence, I have changed the milestone now so we do something about this earlier.

@danielmitterdorfer danielmitterdorfer modified the milestones: 0.9.x, 0.9.4 Mar 9, 2018
@danielmitterdorfer
Copy link
Member

Rally 0.9.4 will implement the following behavior in case the user did not specify warmup-time-period, time-period, warmup-iterations or iterations: It will still default to an iteration-based approach (as opposed to a time-based approach). However, instead of defaulting to no warmup iterations and one measurement iteration, it will first check the corresponding parameter source. For bulk operations, the parameter source is able to determine the necessary number of bulks upfront (for all other operations this has never been a problem). Consequently, we will now ingest all data by default.

danielmitterdorfer added a commit to danielmitterdorfer/rally that referenced this issue Mar 9, 2018
With this commit we also query the parameter source when determining the
default number of iterations. Previously, when the user did not specify
any time-period nor any number of iterations we always defaulted to zero
warmup iterations and one measurement iteration. This lead to surprising
behavior for bulk-indexing when the user forgot to add a warmup time
period because we only issued one bulk request.

Closes elastic#377
danielmitterdorfer added a commit that referenced this issue Mar 9, 2018
With this commit we also query the parameter source when determining the
default number of iterations. Previously, when the user did not specify
any time-period nor any number of iterations we always defaulted to zero
warmup iterations and one measurement iteration. This lead to surprising
behavior for bulk-indexing when the user forgot to add a warmup time
period because we only issued one bulk request.

Closes #377
Relates #436
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improves the status quo :Track Management New operations, changes in the track format, track download changes and the like :Usability Makes Rally easier to use
Projects
None yet
Development

No branches or pull requests

2 participants