Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Elastic-agent] Sometimes backend processes are not started #21120

Closed
mdelapenya opened this issue Sep 14, 2020 · 12 comments
Closed

[Elastic-agent] Sometimes backend processes are not started #21120

mdelapenya opened this issue Sep 14, 2020 · 12 comments
Assignees
Labels
bug failed-test indicates a failed automation test relates v7.10.0

Comments

@mdelapenya
Copy link
Contributor

I.e. https://beats-ci.elastic.co/blue/organizations/jenkins/e2e-tests%2Fe2e-testing-mbp/detail/PR-289/1/pipeline/350

It causes some flakiness in the test suite

Log error

[2020-09-14T18:59:45.170Z] time="2020-09-14T18:59:44Z" level=warning msg="filebeat process is not running in the container yet" container=ingest-manager_elastic-agent_1 desiredState=started elapsedTime=126.374302ms error="filebeat process is not running in the container yet" process=filebeat retry=1

Agent log

[2020-09-14T18:59:44.908Z] elastic-agent_1 | 2020-09-14T18:59:44.632Z ERROR application/emitter.go:88 Failed to render configuration with latest context from composable controller: operator: failed to execute step sc-run, error: failed to start '/usr/share/elastic-agent/data/elastic-agent-50ea11/install/metricbeat-8.0.0-SNAPSHOT-linux-x86_64/metricbeat': fork/exec /usr/share/elastic-agent/data/elastic-agent-50ea11/install/metricbeat-8.0.0-SNAPSHOT-linux-x86_64/metricbeat: text file busy: failed to start '/usr/share/elastic-agent/data/elastic-agent-50ea11/install/metricbeat-8.0.0-SNAPSHOT-linux-x86_64/metricbeat': fork/exec /usr/share/elastic-agent/data/elastic-agent-50ea11/install/metricbeat-8.0.0-SNAPSHOT-linux-x86_64/metricbeat: text file busy

cc/ @ph @EricDavisX

@mdelapenya
Copy link
Contributor Author

This flakiness affects both stand-alone and fleet mode tests

@ph
Copy link
Contributor

ph commented Sep 15, 2020

@blakerouse could you take a look, this looks like a concurrency issue. I am not sure it due because of composable input. Maybe a concurrency issue in the structure?

@mdelapenya
Copy link
Contributor Author

I can confirm that since past week, this is happening much more frequently (I'd say almost always), when in the past it happened from time to time, possibly being resolved restarting the CI job.

@ph ph transferred this issue from elastic/e2e-testing Sep 16, 2020
@ph
Copy link
Contributor

ph commented Sep 16, 2020

@mdelapenya / @EricDavisX I've moved the issue into the beats repository.

@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Sep 16, 2020
@ph ph added bug Team:Ingest Management and removed needs_team Indicates that the issue/PR needs a Team:* label labels Sep 16, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ingest-management (Team:Ingest Management)

@jsoriano
Copy link
Member

I have also found this issue while testing this change #21213

It looks like the operation-install happens twice, but second one is skipped and then agent tries to start the service, while the first installation is happening:

2020-09-22T11:05:31.672Z	DEBUG	operation/operation_retryable.go:83	running operation 'operation-verify' of the block 'retryable block: operation-fetch operation-verify'
2020-09-22T11:05:31.683Z	DEBUG	operation/operator.go:230	running operation 'operation-install' for metricbeat.8.0.0
2020-09-22T11:05:31.814Z	INFO	operation/operator.go:226	operation 'operation-install' skipped for metricbeat.8.0.0
2020-09-22T11:05:31.814Z	DEBUG	operation/operator.go:230	running operation 'operation-start' for metricbeat.8.0.0
2020-09-22T11:05:31.973Z	INFO	log/reporter.go:40	2020-09-22T11:05:31Z: type: 'STATE': sub_type: 'STARTING' message: Application: metricbeat--8.0.0[c5c35fa6-046d-429b-9f3a-ca680467ba80]: State changed to STARTING: Starting
2020-09-22T11:05:31.973Z	ERROR	application/emitter.go:88	Failed to render configuration with latest context from composable controller: operator: failed to execute step sc-run, error: failed to start '/usr/share/elastic-agent/data/elastic-agent-b1b66c/install/metricbeat-8.0.0-linux-x86_64/metricbeat': fork/exec /usr/share/elastic-agent/data/elastic-agent-b1b66c/install/metricbeat-8.0.0-linux-x86_64/metricbeat: text file busy: failed to start '/usr/share/elastic-agent/data/elastic-agent-b1b66c/install/metricbeat-8.0.0-linux-x86_64/metricbeat': fork/exec /usr/share/elastic-agent/data/elastic-agent-b1b66c/install/metricbeat-8.0.0-linux-x86_64/metricbeat: text file busy

The check for the install operation seems to check only if the directory exists:

func (*Checker) Check(_ context.Context, _, _, installDir string) error {
_, err := os.Stat(installDir)
return err
}

So it seems that second installation request thinks that the service is correctly installed because the first one has already created the directory.

If we need concurrent installations of the same service maybe an option is to have an exclusive lock file in the installation directory.

@mdelapenya
Copy link
Contributor Author

Update: with latest updates we see this error:

[2020-09-24T07:06:01.287Z] elastic-agent_1 | 2020-09-24T07:03:38.989Z ERROR operation/operation_retryable.go:85 operation operation-verify failed
[2020-09-24T07:06:01.287Z] elastic-agent_1 | 2020-09-24T07:03:38.989Z DEBUG application/periodic.go:40 Failed to read configuration, error: could not emit configuration: operator: failed to execute step sc-run, error: operation 'operation-verify' failed to verify metricbeat.8.0.0-SNAPSHOT: 3 errors occurred:
[2020-09-24T07:06:01.287Z] elastic-agent_1 | * fetching asc file from '/usr/share/elastic-agent/data/elastic-agent-d816b6/downloads/metricbeat-8.0.0-SNAPSHOT-linux-x86_64.tar.gz.asc': open /usr/share/elastic-agent/data/elastic-agent-d816b6/downloads/metricbeat-8.0.0-SNAPSHOT-linux-x86_64.tar.gz.asc: no such file or directory
[2020-09-24T07:06:01.287Z] elastic-agent_1 | * check detached signature: openpgp: invalid signature: hash tag doesn't match
[2020-09-24T07:06:01.287Z] elastic-agent_1 | * fetching asc file from https://artifacts.elastic.co/downloads/beats/metricbeat/metricbeat-8.0.0-SNAPSHOT-linux-x86_64.tar.gz.asc: call to 'https://artifacts.elastic.co/downloads/beats/metricbeat/metricbeat-8.0.0-SNAPSHOT-linux-x86_64.tar.gz.asc' returned unsuccessful status code: 404: /go/src/github.com/elastic/beats/x-pack/elastic-agent/pkg/artifact/download/http/verifier.go[205]: unknown error
[2020-09-24T07:06:01.287Z] elastic-agent_1 |
[2020-09-24T07:06:01.287Z] elastic-agent_1 | : operation 'operation-verify' failed to verify metricbeat.8.0.0-SNAPSHOT: 3 errors occurred:
[2020-09-24T07:06:01.287Z] elastic-agent_1 | * fetching asc file from '/usr/share/elastic-agent/data/elastic-agent-d816b6/downloads/metricbeat-8.0.0-SNAPSHOT-linux-x86_64.tar.gz.asc': open /usr/share/elastic-agent/data/elastic-agent-d816b6/downloads/metricbeat-8.0.0-SNAPSHOT-linux-x86_64.tar.gz.asc: no such file or directory
[2020-09-24T07:06:01.287Z] elastic-agent_1 | * check detached signature: openpgp: invalid signature: hash tag doesn't match
[2020-09-24T07:06:01.287Z] elastic-agent_1 | * fetching asc file from https://artifacts.elastic.co/downloads/beats/metricbeat/metricbeat-8.0.0-SNAPSHOT-linux-x86_64.tar.gz.asc: call to 'https://artifacts.elastic.co/downloads/beats/metricbeat/metricbeat-8.0.0-SNAPSHOT-linux-x86_64.tar.gz.asc' returned unsuccessful status code: 404: /go/src/github.com/elastic/beats/x-pack/elastic-agent/pkg/artifact/download/http/verifier.go[205]: unknown error

It seems that the .ASC is not present in none of the expected locations: filesystem or artifactory. So no process is able to start

@blakerouse
Copy link
Contributor

@mdelapenya Can you file your last comment in a seperate issue, that is unrelated to the initial issue. Just want to be able to track it correctly and not get them confused.

@mdelapenya
Copy link
Contributor Author

Sure, I can do it. I thought it was related because the observed results are the same: backend proceses are not started.

I'll do as you suggest and convert that comment in a new issue. Thanks!

@EricDavisX EricDavisX added impacts_automation used by teams to indicate an automated test relates to the issue failed-test indicates a failed automation test relates and removed impacts_automation used by teams to indicate an automated test relates to the issue labels Oct 5, 2020
@ph
Copy link
Contributor

ph commented Oct 14, 2020

@michalpristas @blakerouse can this issue be closed?

@ph ph added the v7.10.0 label Oct 14, 2020
@blakerouse
Copy link
Contributor

Yes this can be closed.

@EricDavisX
Copy link
Contributor

@mdelapenya quick check to confirm that this is not being trapped by the e2e-testing framework at this point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug failed-test indicates a failed automation test relates v7.10.0
Projects
None yet
Development

No branches or pull requests

6 participants