Parallelization of bulk publisher for high loads #927

lancewf · 2019-07-17T01:11:56Z

📋 TODO

Add context check to fail the message out of the pipeline - this will be adding in future PRs
Add numberOfParallelBulkPublishers to the Automate config

🔩 Description

Created a message distributor to send messages to multiple bulk publishers. The distributor will create a set number of bulk publishers that the messages will be sent to. The messages are sent to the first publisher until it is filled. Then the overflow messages are sent to the second publisher until it full and so on with the other publisher until they are all full. When they are all full the distributor will wait and try to send the messages again. This algorithm can be easily changed by updating the distributeMessage function.

The reason behind filling one of the publishers before sending messages to the other ones is to increase the size of the bulk messages send. With a round-robin or equal distribution approach, all the bulk publishers will be sending small messages with the overhead of more connections to elasticsearch. For example, we would rather send 100 messages with one elasticsearch connection than 4 connections with 25 messages apiece. Once the load on the system is high enough that the bottleneck is publishing the bulk messages. The second bulk publisher's inbox will fill up and the second bulk publisher will start sending bulk messages in parallel.

Testing is needed The distributor may only be useful with multiple elasticsearch nodes.

Update the numberOfParallelBulkPublishers to add more publishers sending messages to Elasticsearch.

⛓️ Related Resources

#924

✅ Checklist

Necessary tests added/updated?
Necessary docs added/updated?
Code actually executed?
Vetting performed (unit tests, lint, etc.)?

jaym · 2019-07-17T15:26:48Z

components/ingest-service/pipeline/publisher/msg_distributor.go

+			messageProcessed := false
+			for !messageProcessed {
+				for _, pipe := range pipes {
+					if !pipe.isFull() {


have you considered using something like this so you don't need to check for space:

select { case pipe.in <- msg: messageProcessed = true default: }

this is fun: https://stackoverflow.com/questions/19992334/how-to-listen-to-n-channels-dynamic-select-statement

Thank you. I will give that a try.

The select worked great! Thanks Jay!

I don't think you need the default: line

@afiune You need the default: line to allow this line pipe.in <- msg to not wait for room in the channel.

The below code is doing. If the channel is not full send the message and set messageProcessed to true. If the channel is full exit the select.

case pipe.in <- msg: messageProcessed = true default:

Clever!!! ⭐️

Signed-off-by: Lance Finfrock <[email protected]>

stevendanna

Neato

stevendanna · 2019-07-18T07:16:10Z

components/ingest-service/pipeline/publisher/msg_distributor.go

+}
+
+func sendMessage(pipeInChannels []chan message.ChefRun, msg message.ChefRun) {
+	for true {


[nit]

Suggested change

for true {

for {

Not sure how we feel about this style, but I think if you wanted you could do:

for !distributeMessage(pipeInChannels, msg) { // All pipes are full. Wait and try again time.Sleep(time.Millisecond * 10) }

Yeah, I thought about doing that change, but it does not seem as readable. The "distributeMessage" function does not say when to stop the for-loop.
What about?

for messageProcessed := distributeMessage(pipeInChannels, msg); !messageProcessed; messageProcessed = distributeMessage(pipeInChannels, msg) { // All pipes are full. Wait and try again time.Sleep(time.Millisecond * 10) }

Eh, I'd just leave it, it was pretty clear as is, I was just musing aloud :D

stevendanna · 2019-07-18T07:30:56Z

components/ingest-service/pipeline/publisher/msg_distributor.go

+}
+
+func mergeOutChannels(pipeOutChannels []<-chan message.ChefRun) chan message.ChefRun {
+	mergedOut := make(chan message.ChefRun, 100)


💡 As a follow up to this PR, I think it might make sense to do some arithmetic on the maximum number of message we might end up with sitting in channel buffers if we get maximally backed up and then tune or remove some of these buffers accordingly.

stevendanna · 2019-07-18T07:31:57Z

components/ingest-service/pipeline/publisher/msg_distributor.go

+		default:
+		}
+	}
+	return false


Would it make sense to log a warning here? If we hit here, it means we potentially are no longer keeping up with requests right?

That is right. I will add a warning log.

Signed-off-by: Lance Finfrock <[email protected]>

thomascate · 2019-07-18T20:35:19Z

I did some load testing with this PR and a large Elasticsearch cluster and was able to achieve much higher throughput than before through the frontend. At peak I saw 60k 3MB nodes checking on a c5.4xlarge, with CPU being the ultimate bottleneck. It looks like this moves the bottleneck from waiting on publishing to the Automate servers hardware. I would recommend a bit higher than 6 though, from the testing we did 20 looks to be a good number.

afiune · 2019-07-18T20:50:00Z

components/ingest-service/pipeline/chef_run.go

@@ -35,7 +36,8 @@ func NewChefRunPipeline(client backend.Client, authzClient iam_v2.ProjectsClient
 		processor.BuildRunProjectTagger(authzClient),
 		publisher.BuildNodeManagerPublisher(nodeMgrClient),
 		processor.BuildRunMsgToBulkRequestTransformer(client),
-		publisher.BuildBulkRunPublisher(client, maxNumberOfBundledRunMsgs),
+		publisher.BuildMsgDistributor(publisher.BuildBulkRunPublisher(


Super nit but, I can read this better if you ident it this way:

publisher.BuildMsgDistributor( publisher.BuildBulkRunPublisher(client, maxNumberOfBundledRunMsgs), numberOfParallelBulkPublishers, maxNumberOfBundledRunMsgs, ),

That does look better. Thanks.

afiune · 2019-07-18T20:54:53Z

components/ingest-service/pipeline/publisher/msg_distributor.go

+		for msg := range in {
+			sendMessage(pipeInChannels, msg)
+		}
+		close(out)


Im not so sure about it but, should this close func call belongs in a defer statement? 🤔 (I'm just starting to think about what happens when one of these goroutines fail for any reason.)

I think you might be correct. I don't want to make this change right now though, because this has already been thoroughly tested. This is a problem with all the processors in the pipeline and should be tested separately.

Good catch!

afiune · 2019-07-18T20:57:56Z

components/ingest-service/pipeline/publisher/msg_distributor.go

+	inChannels := make([]chan message.ChefRun, numProcessors)
+	outChannels := make([]<-chan message.ChefRun, numProcessors)
+	for index := range inChannels {
+		in := make(chan message.ChefRun, childPipeInboxSize)


Since this inbox size is coming all the way from the config file in habitat, should we do some parameter check for any -1 or 0 (negative int or zero)?

That is a good idea. I will want to do that in another PR. We have not fully published these parameters yet (meaning no it the docs). So, only the CS teams should be playing around with them for now.

afiune · 2019-07-18T20:59:22Z

components/ingest-service/pipeline/publisher/msg_distributor_test.go

@@ -0,0 +1,122 @@
+package publisher


nit: By convention, this file should be called msg_distributor_internal_test.go since it is testing the internals of the package.

Thouuuuugh, thanks for adding tests!!!!! 💯

afiune

Looking sharp! Loving the improvements you have made Lance. 💟

Signed-off-by: Lance Finfrock <[email protected]>

lancewf · 2019-07-18T23:30:29Z

api/config/ingest/config_request.go

@@ -29,6 +29,7 @@ func DefaultConfigRequest() *ConfigRequest {
 	c.V1.Sys.Service.MaxNumberOfBundledRunMsgs = w.Int32(2500)
 	c.V1.Sys.Service.MaxNumberOfBundledActionMsgs = w.Int32(10000)
 	c.V1.Sys.Service.NumberOfRunMsgsTransformers = w.Int32(9)
+	c.V1.Sys.Service.NumberOfRunMsgPublishers = w.Int32(2)


We are currently going to default to two publishers. This should be good for a local Elasticsearch.

lancewf · 2019-07-18T23:32:14Z

components/ingest-service/pipeline/publisher/bulk_publisher.go

-		return bulkRunPublisherBundler(in, client, maxNumberOfBundledRunMsgs)
+		name := fmt.Sprintf("pub-%d", count)
+		count++
+		return bulkRunPublisherBundler(in, client, maxNumberOfBundledRunMsgs, name)


Adding a different name for each publisher to be able to see in the logs how many publishers are being used.

lancewf · 2019-07-18T23:36:08Z

components/ingest-service/pipeline/publisher/msg_distributor.go

@@ -0,0 +1,94 @@
+package publisher


This is the main addition.

afiune

jeremymv2 · 2019-07-19T13:02:57Z

lancewf added WIP ingest-service labels Jul 17, 2019

jaym reviewed Jul 17, 2019

View reviewed changes

Creating the Message Distributor

a0be4a5

Signed-off-by: Lance Finfrock <[email protected]>

lancewf force-pushed the lancewf/parallelization_of_bulk_publish branch from f1ec0c5 to a0be4a5 Compare July 17, 2019 17:52

Lance Finfrock added 3 commits July 17, 2019 13:13

Clean up

fca7296

Signed-off-by: Lance Finfrock <[email protected]>

Adding tests

2683e4d

Signed-off-by: Lance Finfrock <[email protected]>

add numberOfParallelBulkPublishers

36700bb

Signed-off-by: Lance Finfrock <[email protected]>

stevendanna reviewed Jul 18, 2019

View reviewed changes

Lance Finfrock added 2 commits July 18, 2019 09:24

clean up with logs

44a4e95

Signed-off-by: Lance Finfrock <[email protected]>

Updating publishers

0246989

Signed-off-by: Lance Finfrock <[email protected]>

afiune reviewed Jul 18, 2019

View reviewed changes

Lance Finfrock added 4 commits July 18, 2019 15:48

Review fixes

e8c15f1

Signed-off-by: Lance Finfrock <[email protected]>

Adding NumberOfRunMsgPublishers to Automate Config

77249b6

Signed-off-by: Lance Finfrock <[email protected]>

Adding a publisher name for logging

a0466b3

Signed-off-by: Lance Finfrock <[email protected]>

Fixing the cmd parameter name

a69f3d4

Signed-off-by: Lance Finfrock <[email protected]>

lancewf commented Jul 18, 2019

View reviewed changes

lancewf changed the title ~~WIP: Parallelization of bulk publisher for high loads~~ Parallelization of bulk publisher for high loads Jul 18, 2019

lancewf removed the WIP label Jul 18, 2019

afiune approved these changes Jul 19, 2019

View reviewed changes

lancewf merged commit 73f1053 into master Jul 19, 2019

chef-ci deleted the lancewf/parallelization_of_bulk_publish branch July 19, 2019 15:53

thomascate mentioned this pull request Jul 19, 2019

Large node objects can cause ingest-service to become overloaded permaturely #924

Closed

ekmixon mentioned this pull request Nov 29, 2023

[Snyk] Fix for 12 vulnerabilities ekmixon/automate#140

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelization of bulk publisher for high loads #927

Parallelization of bulk publisher for high loads #927

lancewf commented Jul 17, 2019 •

edited

Loading

jaym Jul 17, 2019

jaym Jul 17, 2019

lancewf Jul 17, 2019

lancewf Jul 17, 2019

afiune Jul 18, 2019

lancewf Jul 18, 2019

afiune Jul 19, 2019

stevendanna left a comment

stevendanna Jul 18, 2019

stevendanna Jul 18, 2019

lancewf Jul 18, 2019 •

edited

Loading

stevendanna Jul 18, 2019

stevendanna Jul 18, 2019

stevendanna Jul 18, 2019

lancewf Jul 18, 2019

thomascate commented Jul 18, 2019

afiune Jul 18, 2019

lancewf Jul 18, 2019

afiune Jul 18, 2019

lancewf Jul 18, 2019

afiune Jul 18, 2019

lancewf Jul 18, 2019

afiune Jul 18, 2019

afiune Jul 18, 2019

afiune left a comment

lancewf Jul 18, 2019

lancewf Jul 18, 2019

lancewf Jul 18, 2019

afiune left a comment

jeremymv2 commented Jul 19, 2019

Parallelization of bulk publisher for high loads #927

Parallelization of bulk publisher for high loads #927

Conversation

lancewf commented Jul 17, 2019 • edited Loading

📋 TODO

🔩 Description

⛓️ Related Resources

✅ Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stevendanna left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lancewf Jul 18, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thomascate commented Jul 18, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

afiune left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

afiune left a comment

Choose a reason for hiding this comment

jeremymv2 commented Jul 19, 2019

lancewf commented Jul 17, 2019 •

edited

Loading

lancewf Jul 18, 2019 •

edited

Loading