Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Promtail: Add new target for receiving GCP PubSub Push messages #6777

Merged
merged 29 commits into from
Aug 2, 2022

Conversation

thepalbi
Copy link
Contributor

@thepalbi thepalbi commented Jul 26, 2022

What this PR does / why we need it:
Promtail currently supports scraping GCP logs using a pull model, in which promtail subscribes to a GCP PubSub topic, to where logs are routed. There's another mechanism possible for receiving messages from PubSub topics (for example logs, but could potentially be other kind of messages) which is by using a push subscription.

This PR adds a new Promtail target which is able to receive messages from a GCP PubSub push subscription. It has been mainly thought for logs, but since it's generic it's able to support whatever message the GCP topic sends, since they all use a similar envelope. The target exposes and HTTP server (with just one POST /gcp/api/v1/push endpoint), which is then configured as target of this Push subscriptions.

Just to give an example, the envelope that GCP uses for this messages has the following shape:

{
	"message": {
		"attributes": {
			"logging.googleapis.com/timestamp": "2022-07-25T22:19:09.903683708Z"
		},
		"data": "ewogICJzZXZlcml0eSI6ICJERUJVRyIsCiAgInRleHRQYXlsb2FkIjogIkZ1bmN0aW9uIGV4ZWN1dGlvbiB0b29rIDExOTggbXMuIEZpbmlzaGVkIHdpdGggc3RhdHVzIGNvZGU6IDIwMCIsCiAgInRyYWNlIjogInByb2plY3RzL3dpcmVkLWhlaWdodC90cmFjZXMvNTRhYThhMjA4NzUyNTNjNmI0N2NhYThiZDI4YWQ2NTIiCn0=",
		"messageId": "5187581549398349",
		"message_id": "5187581549398349",
		"publishTime": "2022-07-25T22:19:15.56Z",
		"publish_time": "2022-07-25T22:19:15.56Z"
	},
	"subscription": "projects/wired-height-350515/subscriptions/test"
}

This message has been extracted from a push subscription, listening for logs from a cloud-function.

The target can be configured as follows:

  gcp_push:
    server:
      http_listen_address: 0.0.0.0
      http_listen_port: 8080
      log_level: debug
      register_instrumentation: true
    labels: 
      job: gcp_push_logs_test
    use_incoming_timestamp: true

This target exposes some of the received "label" as "internal labels" to be re-labeled by the user using the scrape config (much like the syslog target does), and dropped if not used. These are:

  • __gcp_message_id: message id
  • __gcp_attributes_*: All attributes read from .message.attributes are added as a temporal label, conveniently adapting the label name since it might contain non-loki-supported characters. For example for logging.googleapis.com/timestamp it's converted to __gcp_attributes_logging_googleapis_com_timestamp.

Some things to keep in mind:

  • GCP offers some nice delivery garantuees for not overloading the receiving sender like a sliding window algo with slow start.
  • It also offers acks and nacks, and retries, but because of promtail async processing, this are lost when the entry enters the underlying channel towards loki's client. The idea is once WAL-support is added, this will improve vastly.

Which issue(s) this PR fixes:
n/a

Special notes for your reviewer:
Documentation to be added.

Checklist

  • Documentation added
  • Tests updated
  • Is this an important fix or new feature? Add an entry in the CHANGELOG.md.
  • Changes that require user attention or interaction to upgrade are documented in docs/sources/upgrading/_index.md

@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

@thepalbi thepalbi force-pushed the pablo/gcp-push-target branch from 8c65211 to 95b9b60 Compare July 26, 2022 21:03
@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

@thepalbi thepalbi marked this pull request as ready for review July 26, 2022 21:19
@thepalbi thepalbi requested a review from a team as a code owner July 26, 2022 21:19
@kavirajk
Copy link
Contributor

@thepalbi thanks a ton for adding support for GCP pubsub push subscription type. Skimmed through the PR and overall looks great 👍

Just wanted to clarify one thing. Can we still keep single target type for GCP(gcplog) and just add extra config subscription_type to differentiate push and pull? (default being pull)? Rather than adding completely new target type.
I see we use more or less similar config for both except the server block for push subscription type.

We can keep everything more or less same, except if subscription_type: push we create newGCPpushTarget.

wdyt?

@@ -37,7 +37,7 @@ type Config struct {
PipelineStages stages.PipelineStages `yaml:"pipeline_stages,omitempty"`
JournalConfig *JournalTargetConfig `yaml:"journal,omitempty"`
SyslogConfig *SyslogTargetConfig `yaml:"syslog,omitempty"`
GcplogConfig *GcplogTargetConfig `yaml:"gcplog,omitempty"`
GcplogConfig *GCPLogTargetConfig `yaml:"gcplog,omitempty"`
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drive by: renamed the main target to use a better case

}

// nolint:revive
func newGcplogTarget(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe this one should also be renamed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is just returning the struct. why don't we make it as part of newPullTarget itself?. Any reason to have it as separate function?

Comment on lines +95 to +99
// If the incoming request carries the tenant id, inject it as the reserved label, so it's used by the
// remote write client.
if xScopeOrgID != "" {
labels[lokiClient.ReservedLabelTenantID] = model.LabelValue(xScopeOrgID)
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This functionality mimics what was done in d4cdc37. The idea is to make this target multitenant-supporting, so one can inject the tenantID from upstream using this reserved header (which is well-kown in loki world) and let the loki-client forward it when doing the remote write.

@thepalbi
Copy link
Contributor Author

@thepalbi thanks a ton for adding support for GCP pubsub push subscription type. Skimmed through the PR and overall looks great 👍

Just wanted to clarify one thing. Can we still keep single target type for GCP(gcplog) and just add extra config subscription_type to differentiate push and pull? (default being pull)? Rather than adding completely new target type. I see we use more or less similar config for both except the server block for push subscription type.

We can keep everything more or less same, except if subscription_type: push we create newGCPpushTarget.

wdyt?

@kavirajk completely agree. Just merged both targets under the same package. Refactored (mainly renames) a little bit the Pull one so that both of them fit better. Let me know what you think 😀

@thepalbi thepalbi force-pushed the pablo/gcp-push-target branch from ff3faf7 to 9f481cd Compare July 27, 2022 19:03
@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

Copy link
Contributor

@kavirajk kavirajk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🎉 thanks @thepalbi for addressing the comments.

Left few minor nits and suggestion. We are good to merge after that.

}

// nolint:revive
func newGcplogTarget(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is just returning the struct. why don't we make it as part of newPullTarget itself?. Any reason to have it as separate function?

@kavirajk
Copy link
Contributor

@thepalbi also can you also please update the docs accordingly?
https://grafana.com/docs/loki/latest/clients/promtail/gcplog-cloud/

@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

@thepalbi
Copy link
Contributor Author

@thepalbi also can you also please update the docs accordingly? https://grafana.com/docs/loki/latest/clients/promtail/gcplog-cloud/

Yup, wanted to have a final 👍 on the implementation before doing this. Will add the docs so the PR is sounds!

@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

@thepalbi
Copy link
Contributor Author

@kavirajk added the docs on configuration and scraping sections as discussed. Also resolved all open comments. Can you take another pass + tag the correct docs person?

@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

@thepalbi thepalbi force-pushed the pablo/gcp-push-target branch 2 times, most recently from 35446db to d461e11 Compare July 28, 2022 19:47
@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

@cstyan
Copy link
Contributor

cstyan commented Jul 29, 2022

Do we have any concerns about this target potentially allowing users to easily hit per stream rate limit issues? It looks to me like it would be very easy to configure many things to all write to one pubsub and have it all pushed to promtail as a single stream.

Should we consider enforcing some amount of the attributes being used as labels?

@kavirajk
Copy link
Contributor

@cstyan that's the perfectly valid concern.

Even now user can take advantage of relabeling those gcp atrributes.
https://github.com/grafana/loki/blob/pablo%2Fgcp-push-target/clients/pkg/promtail/targets/gcplog/formatter.go#L60-L72

And you are right, user may not be aware about the impact before sending too much messages into single stream.

IMO, good place to add this note is in gcp-cloud setup doc https://grafana.com/docs/loki/latest/clients/promtail/gcplog-cloud/

I think @thepalbi decided to update the cloud setup doc in the follow up PR. Can you take care of that when creating follow up PR @thepalbi ?

@thepalbi thepalbi force-pushed the pablo/gcp-push-target branch from bf843ef to 1dfa0dd Compare August 2, 2022 13:41
@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

Copy link
Contributor

@KMiller-Grafana KMiller-Grafana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for revising the docs. The docs in this PR look good to me. I put in a suggestion for a typo fix, but I'm approving now so you won't need a re-review from me.

@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

@kavirajk kavirajk merged commit c9ef062 into main Aug 2, 2022
@kavirajk kavirajk deleted the pablo/gcp-push-target branch August 2, 2022 16:35
Copy link
Contributor

@jeschkies jeschkies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

}

func (h *pushTarget) push(w http.ResponseWriter, r *http.Request) {
entries := h.handler.Chan()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure the channel would change often. Why not h.entires?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean saving h.handler.Chan() as a struct field, and directly accessing it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry. I was out. Yes.

Comment on lines +131 to +138
serverConfig, port, err := getServerConfigWithAvailablePort()
require.NoError(t, err, "error generating server config or finding open port")
config := &scrapeconfig.GcplogTargetConfig{
Server: serverConfig,
Labels: tc.args.Labels,
UseIncomingTimestamp: false,
SubscriptionType: "push",
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running the server is quite a handful. What do you think about testing the handler function push with a ResponseRecorder instead?

@thepalbi
Copy link
Contributor Author

thepalbi commented Aug 2, 2022

Thanks for the look @jeschkies , I'll address the comments in a follow up 😀

kavirajk added a commit that referenced this pull request Sep 8, 2022
<!--  Thanks for sending a pull request!  Before submitting:

1. Read our CONTRIBUTING.md guide
2. Name your PR as `<Feature Area>: Describe your change`.
a. Do not end the title with punctuation. It will be added in the
changelog.
b. Start with an imperative verb. Example: Fix the latency between
System A and System B.
  c. Use sentence case, not title case.
d. Use a complete phrase or sentence. The PR title will appear in a
changelog, so help other people understand what your change will be.
3. Rebase your PR if it gets out of sync with main
-->
**What this PR does / why we need it**:
This PR includes some pending improvements to newish promtail GCP push
target:
- Pending low hanging optimization for replacing a regex that gets
evaluated on every incoming request from GCP, for a simpler solution.
[Related
comments](#6777 (comment))
- Some small refactors. See:
[this](#6777 (review))
- Add some validations over the payload the target receives. Payload
details are [here](https://cloud.google.com/pubsub/docs/push). This PR
adds just validation over the log line, the incoming GCP message ID to
add some traceability if needed, and the subscription name because it
can be used as label.
- Add `subscription` GCP payload field as label, which includes the
PubSub subscription identifier.
- Add `cause` label to error metrics. Since there's multiple failure
point this might help debugging / notice strange behaviours.

Also, by running the push target in some dev workloads, we noticed it
has some labels different to the ones added in the pull target. This PR
also refactors the logic to re-use the same GCP log entry parsing logic
in both targets. That way the labels exposed to the user in the push
target are $Labels_{pull} \cup Labels_{push}$.

**Which issue(s) this PR fixes**:
Related to: grafana/cloud-onboarding#2067

**Special notes for your reviewer**:
This PR is a follow up [this other
one](#6777). It works on some
comments made after merge, and adds some changes we saw necessary upon
running this target in real environments.
<!--
Note about CHANGELOG entries, if a change adds:
* an important feature
* fixes an issue present in a previous release, 
* causes a change in operation that would be useful for an operator of
Loki to know
then please add a CHANGELOG entry.

For documentation changes, build changes, simple fixes etc please skip
this step. We are attempting to curate a changelog of the most relevant
and important changes to be easier to ingest by end users of Loki.

Note about the upgrade guide, if this changes:
* default configuration values
* metric names or label names
* changes existing log lines such as the metrics.go query output line
* configuration parameters 
* anything to do with any API
* any other change that would require special attention or extra steps
to upgrade
Please document clearly what changed AND what needs to be done in the
upgrade guide.
-->
**Checklist**
- [x] Documentation added
- [x] Tests updated
- [ ] Is this an important fix or new feature? Add an entry in the
`CHANGELOG.md`.
- [ ] Changes that require user attention or interaction to upgrade are
documented in `docs/sources/upgrading/_index.md`

Co-authored-by: Kaviraj Kanagaraj <[email protected]>
lxwzy pushed a commit to lxwzy/loki that referenced this pull request Nov 7, 2022
<!--  Thanks for sending a pull request!  Before submitting:

1. Read our CONTRIBUTING.md guide
2. Name your PR as `<Feature Area>: Describe your change`.
a. Do not end the title with punctuation. It will be added in the
changelog.
b. Start with an imperative verb. Example: Fix the latency between
System A and System B.
  c. Use sentence case, not title case.
d. Use a complete phrase or sentence. The PR title will appear in a
changelog, so help other people understand what your change will be.
3. Rebase your PR if it gets out of sync with main
-->
**What this PR does / why we need it**:
This PR includes some pending improvements to newish promtail GCP push
target:
- Pending low hanging optimization for replacing a regex that gets
evaluated on every incoming request from GCP, for a simpler solution.
[Related
comments](grafana#6777 (comment))
- Some small refactors. See:
[this](grafana#6777 (review))
- Add some validations over the payload the target receives. Payload
details are [here](https://cloud.google.com/pubsub/docs/push). This PR
adds just validation over the log line, the incoming GCP message ID to
add some traceability if needed, and the subscription name because it
can be used as label.
- Add `subscription` GCP payload field as label, which includes the
PubSub subscription identifier.
- Add `cause` label to error metrics. Since there's multiple failure
point this might help debugging / notice strange behaviours.

Also, by running the push target in some dev workloads, we noticed it
has some labels different to the ones added in the pull target. This PR
also refactors the logic to re-use the same GCP log entry parsing logic
in both targets. That way the labels exposed to the user in the push
target are $Labels_{pull} \cup Labels_{push}$.

**Which issue(s) this PR fixes**:
Related to: https://github.com/grafana/cloud-onboarding/issues/2067

**Special notes for your reviewer**:
This PR is a follow up [this other
one](grafana#6777). It works on some
comments made after merge, and adds some changes we saw necessary upon
running this target in real environments.
<!--
Note about CHANGELOG entries, if a change adds:
* an important feature
* fixes an issue present in a previous release, 
* causes a change in operation that would be useful for an operator of
Loki to know
then please add a CHANGELOG entry.

For documentation changes, build changes, simple fixes etc please skip
this step. We are attempting to curate a changelog of the most relevant
and important changes to be easier to ingest by end users of Loki.

Note about the upgrade guide, if this changes:
* default configuration values
* metric names or label names
* changes existing log lines such as the metrics.go query output line
* configuration parameters 
* anything to do with any API
* any other change that would require special attention or extra steps
to upgrade
Please document clearly what changed AND what needs to be done in the
upgrade guide.
-->
**Checklist**
- [x] Documentation added
- [x] Tests updated
- [ ] Is this an important fix or new feature? Add an entry in the
`CHANGELOG.md`.
- [ ] Changes that require user attention or interaction to upgrade are
documented in `docs/sources/upgrading/_index.md`

Co-authored-by: Kaviraj Kanagaraj <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size/XXL type/docs Issues related to technical documentation; the Docs Squad uses this label across many repositories
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants