Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelization of bulk publisher for high loads #927

Merged
merged 10 commits into from
Jul 19, 2019

Conversation

lancewf
Copy link
Contributor

@lancewf lancewf commented Jul 17, 2019

image

📋 TODO

  • Add context check to fail the message out of the pipeline - this will be adding in future PRs
  • Add numberOfParallelBulkPublishers to the Automate config

🔩 Description

Created a message distributor to send messages to multiple bulk publishers. The distributor will create a set number of bulk publishers that the messages will be sent to. The messages are sent to the first publisher until it is filled. Then the overflow messages are sent to the second publisher until it full and so on with the other publisher until they are all full. When they are all full the distributor will wait and try to send the messages again. This algorithm can be easily changed by updating the distributeMessage function.

The reason behind filling one of the publishers before sending messages to the other ones is to increase the size of the bulk messages send. With a round-robin or equal distribution approach, all the bulk publishers will be sending small messages with the overhead of more connections to elasticsearch. For example, we would rather send 100 messages with one elasticsearch connection than 4 connections with 25 messages apiece. Once the load on the system is high enough that the bottleneck is publishing the bulk messages. The second bulk publisher's inbox will fill up and the second bulk publisher will start sending bulk messages in parallel.

Testing is needed The distributor may only be useful with multiple elasticsearch nodes.

Update the numberOfParallelBulkPublishers to add more publishers sending messages to Elasticsearch.

⛓️ Related Resources

#924

✅ Checklist

  • Necessary tests added/updated?
  • Necessary docs added/updated?
  • Code actually executed?
  • Vetting performed (unit tests, lint, etc.)?

messageProcessed := false
for !messageProcessed {
for _, pipe := range pipes {
if !pipe.isFull() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have you considered using something like this so you don't need to check for space:

select {
case pipe.in <- msg:
   messageProcessed = true
default:
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. I will give that a try.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The select worked great! Thanks Jay!

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need the default: line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@afiune You need the default: line to allow this line pipe.in <- msg to not wait for room in the channel.

The below code is doing. If the channel is not full send the message and set messageProcessed to true. If the channel is full exit the select.

case pipe.in <- msg:
   messageProcessed = true
default:

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clever!!! ⭐️

Signed-off-by: Lance Finfrock <[email protected]>
@lancewf lancewf force-pushed the lancewf/parallelization_of_bulk_publish branch from f1ec0c5 to a0be4a5 Compare July 17, 2019 17:52
Lance Finfrock added 3 commits July 17, 2019 13:13
Signed-off-by: Lance Finfrock <[email protected]>
Signed-off-by: Lance Finfrock <[email protected]>
Signed-off-by: Lance Finfrock <[email protected]>
Copy link
Contributor

@stevendanna stevendanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neato

}

func sendMessage(pipeInChannels []chan message.ChefRun, msg message.ChefRun) {
for true {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit]

Suggested change
for true {
for {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how we feel about this style, but I think if you wanted you could do:

for !distributeMessage(pipeInChannels, msg) {
	// All pipes are full. Wait and try again
	time.Sleep(time.Millisecond * 10)
}

Copy link
Contributor Author

@lancewf lancewf Jul 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I thought about doing that change, but it does not seem as readable. The "distributeMessage" function does not say when to stop the for-loop.
What about?

for messageProcessed := distributeMessage(pipeInChannels, msg); !messageProcessed; messageProcessed = distributeMessage(pipeInChannels, msg) {
   // All pipes are full. Wait and try again
   time.Sleep(time.Millisecond * 10)
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eh, I'd just leave it, it was pretty clear as is, I was just musing aloud :D

}

func mergeOutChannels(pipeOutChannels []<-chan message.ChefRun) chan message.ChefRun {
mergedOut := make(chan message.ChefRun, 100)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 As a follow up to this PR, I think it might make sense to do some arithmetic on the maximum number of message we might end up with sitting in channel buffers if we get maximally backed up and then tune or remove some of these buffers accordingly.

default:
}
}
return false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to log a warning here? If we hit here, it means we potentially are no longer keeping up with requests right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is right. I will add a warning log.

Lance Finfrock added 2 commits July 18, 2019 09:24
Signed-off-by: Lance Finfrock <[email protected]>
Signed-off-by: Lance Finfrock <[email protected]>
@thomascate
Copy link

I did some load testing with this PR and a large Elasticsearch cluster and was able to achieve much higher throughput than before through the frontend. At peak I saw 60k 3MB nodes checking on a c5.4xlarge, with CPU being the ultimate bottleneck. It looks like this moves the bottleneck from waiting on publishing to the Automate servers hardware. I would recommend a bit higher than 6 though, from the testing we did 20 looks to be a good number.

@@ -35,7 +36,8 @@ func NewChefRunPipeline(client backend.Client, authzClient iam_v2.ProjectsClient
processor.BuildRunProjectTagger(authzClient),
publisher.BuildNodeManagerPublisher(nodeMgrClient),
processor.BuildRunMsgToBulkRequestTransformer(client),
publisher.BuildBulkRunPublisher(client, maxNumberOfBundledRunMsgs),
publisher.BuildMsgDistributor(publisher.BuildBulkRunPublisher(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super nit but, I can read this better if you ident it this way:

publisher.BuildMsgDistributor(
  publisher.BuildBulkRunPublisher(client, maxNumberOfBundledRunMsgs),
  numberOfParallelBulkPublishers,
  maxNumberOfBundledRunMsgs,
),

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That does look better. Thanks.

for msg := range in {
sendMessage(pipeInChannels, msg)
}
close(out)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Im not so sure about it but, should this close func call belongs in a defer statement? 🤔 (I'm just starting to think about what happens when one of these goroutines fail for any reason.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you might be correct. I don't want to make this change right now though, because this has already been thoroughly tested. This is a problem with all the processors in the pipeline and should be tested separately.

Good catch!

inChannels := make([]chan message.ChefRun, numProcessors)
outChannels := make([]<-chan message.ChefRun, numProcessors)
for index := range inChannels {
in := make(chan message.ChefRun, childPipeInboxSize)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this inbox size is coming all the way from the config file in habitat, should we do some parameter check for any -1 or 0 (negative int or zero)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good idea. I will want to do that in another PR. We have not fully published these parameters yet (meaning no it the docs). So, only the CS teams should be playing around with them for now.

@@ -0,0 +1,122 @@
package publisher
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: By convention, this file should be called msg_distributor_internal_test.go since it is testing the internals of the package.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thouuuuugh, thanks for adding tests!!!!! 💯

Copy link

@afiune afiune left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking sharp! Loving the improvements you have made Lance. 💟

Lance Finfrock added 4 commits July 18, 2019 15:48
Signed-off-by: Lance Finfrock <[email protected]>
Signed-off-by: Lance Finfrock <[email protected]>
Signed-off-by: Lance Finfrock <[email protected]>
@@ -29,6 +29,7 @@ func DefaultConfigRequest() *ConfigRequest {
c.V1.Sys.Service.MaxNumberOfBundledRunMsgs = w.Int32(2500)
c.V1.Sys.Service.MaxNumberOfBundledActionMsgs = w.Int32(10000)
c.V1.Sys.Service.NumberOfRunMsgsTransformers = w.Int32(9)
c.V1.Sys.Service.NumberOfRunMsgPublishers = w.Int32(2)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are currently going to default to two publishers. This should be good for a local Elasticsearch.

return bulkRunPublisherBundler(in, client, maxNumberOfBundledRunMsgs)
name := fmt.Sprintf("pub-%d", count)
count++
return bulkRunPublisherBundler(in, client, maxNumberOfBundledRunMsgs, name)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a different name for each publisher to be able to see in the logs how many publishers are being used.

@@ -0,0 +1,94 @@
package publisher
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the main addition.

@lancewf lancewf changed the title WIP: Parallelization of bulk publisher for high loads Parallelization of bulk publisher for high loads Jul 18, 2019
@lancewf lancewf removed the WIP label Jul 18, 2019
Copy link

@afiune afiune left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tenor-263265994

@jeremymv2
Copy link
Contributor

picasso

@lancewf lancewf merged commit 73f1053 into master Jul 19, 2019
@chef-ci chef-ci deleted the lancewf/parallelization_of_bulk_publish branch July 19, 2019 15:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants