[Proposal] Stop shipping log.offset by default #3934

ruflin · 2023-12-20T07:23:39Z

Currently Elastic Agent (filebeat under the hood) ships log.offset by default for each event when the logfile or filestream input is used (are there other inputs that do this too?). Some users use log.offset for sorting of log entries with an identical timestamp. With the switch to nano seconds as default, this should become less of a problem: elastic/elasticsearch#102548 As log.offset also takes a relevant amount of storage for each event, the proposal is to remove shipping it by default and make it opt in. This can be considered a breaking change and needs further discussion.

For users to remove log.offset already now, it can be either done on the edge with a processor or centrally through an ingest pipeline. The processor on the edge looks as following:

processors:
- drop_fields:
    fields: ["log.offset"]
    ignore_missing: true

Even though likely code changes are in filebeat, I filed this issue in the Elastic Agent repo to make it clear, changes would also apply to data shipped by Elastic Agent.

Related issues

Discuss around log.offset mapping in ECS: log.offset not ECS compliant? ecs#2234
Change in Elasticsearch to move to nano seconds by default: Change @timestamp field type to date_nanos in Logs data streams elasticsearch#102548

The text was updated successfully, but these errors were encountered:

elasticmachine · 2023-12-20T07:39:40Z

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

cmacknz · 2023-12-20T19:01:46Z

No opposition to doing this, besides figuring out how to handle the breaking change aspect of it.

We could introduce this as a feature flag directly in the agent policy, and then make a decision on when to flip it from true to false. This would give a relatively easy way to turn it off globally.

elastic-agent/elastic-agent.reference.yml

Lines 194 to 197 in a76f4ba

    
           # This section enables or disables feature flags supported by Agent and its components. 
        
           #agent.features: 
        
           #  fqdn: 
        
           #    enabled: false

We could also argue that this is something that should be in each individual integration that collects logs, but there are a lot of those.

amitkanfer · 2023-12-21T08:50:01Z

We could introduce this as a feature flag directly in the agent policy,

@nimarezainia another one to the list ^^

ruflin · 2023-12-21T09:20:20Z

We could also argue that this is something that should be in each individual integration that collects logs, but there are a lot of those.

I would argue log.offset is something related to the input not integrations.

andrewkroh · 2024-01-04T02:22:00Z

Recently there were a several other new non-ECS fields added to filestream input which have an impact on users. I think their only purpose is troubleshooting, so having them present all the time does not make sense. IMO is would be nice to make this opt-in in some way too like log.offset.

// windows
  "log": {
    "offset": 0,
    "file": {
      "path": "C:\\Users\\Admin\\Desktop\\test\\fs_metadata\\logs\\some.log",
      "idxhi": 327680,
      "idxlo": 99308,
      "vol": 2860917223,
      "fingerprint": "0e9d2f73e97662c9a156111ea556fc536782043f2a1cd69fecbe3b69cab64c9d"
    }
  }

// posix
  "log": {
    "offset": 0,
    "file": {
      "inode": 119400347,
      "fingerprint": "2edc986847e209b4016e141a6dc8716d3207350f416969382d431539bf292e4a",
      "path": "/path/to/your/logs/some.log",
      "device_id": 16777234
    }
  }

ruflin · 2024-01-10T09:03:37Z

@cmacknz @rdner Have we tested what is the overhead of the addition of these field? As @andrewkroh described, we should make these ideally opt in for debugging. @jpountz for awareness.

@cachedout I would like us to have some simple benchmarks in place for storage usage for collecting a simple file where we find out quickly in case some changes increase storage usage. @amitkanfer Does your team already have something link this already in place?

cmacknz · 2024-01-10T16:45:16Z

Have we tested what is the overhead of the addition of these field?

There is no regular testing for the storage size of individual documents or data streams, only adhoc testing. I believe this was added as a direct result of being unable to diagnose issues in support cases.

For events read from the same file these fields are exactly the same the majority of the time. Having some way to opt-in to those fields would make sense.

I don't think we have a concept of an option that augments the actual event data with troubleshooting information. I think that is probably useful in general. The easiest alternative would be to convert this information into debug logs but they would be extremely verbose at high event rates (essentially making these the only visible logs).

cachedout · 2024-01-11T09:33:06Z

@cachedout I would like us to have some simple benchmarks in place for storage usage for collecting a simple file where we find out quickly in case some changes increase storage usage.

We could certainly consider this but I was under the impression that the Ingest folks have their own folks dedicated to benchmarking and performance. Is this the case @andresrc ?

rdner · 2024-01-12T13:01:48Z

@ruflin yes, we ran benchmarks and tested the overhead after adding these new fields elastic/beats#36065 (comment)

However, these benchmarks were aimed at the Filebeat throughput and not the Elasticsearch storage layout.

These new values mentioned in #3934 (comment) and log.offset help us to identify issues when investigating SDHs. We could make them opt-in, sure, but that would be an extra step in the investigation.

Also, adding these particular values was requested by a customer.

ruflin · 2024-01-15T08:00:18Z

I understand the value of these add fields but we need to understand the storage impact of these. If these increase the storage usage only in a single digit percentage it can have a large $ effect for all users.

@amitkanfer Can you help me find who should take on the work to have some storage benchmarks in place? It is important that we know when changes / new features in inputs increase / reduce the storage usage in Elasticsearch, especially for logs.

amitkanfer · 2024-01-16T17:58:02Z

@pierrehilbert - this align perfectly with one of the KRs you're sponsoring around lowering storage volume. We also mentioned the need to first evaluate the current state. What do you think to use this proposal as a test case?

andresrc · 2024-01-17T08:43:42Z

We could certainly consider this but I was under the impression that the Ingest folks have their own folks dedicated to benchmarking and performance. Is this the case @andresrc ?

I think so, but not sure to what extend.

ruflin mentioned this issue Dec 20, 2023

log.offset not ECS compliant? elastic/ecs#2234

Open

pierrehilbert added the Team:Elastic-Agent Label for the Agent team label Dec 20, 2023

cmacknz mentioned this issue Dec 21, 2023

Validate and sanitization of dataset and namespace field #3946

Open

andrewkroh mentioned this issue Apr 18, 2024

[juniper_junos.log] field "log.file.device_id" is undefined elastic/integrations#9582

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Proposal] Stop shipping log.offset by default #3934

[Proposal] Stop shipping log.offset by default #3934

ruflin commented Dec 20, 2023

elasticmachine commented Dec 20, 2023

cmacknz commented Dec 20, 2023 •

edited

Loading

amitkanfer commented Dec 21, 2023

ruflin commented Dec 21, 2023

andrewkroh commented Jan 4, 2024

ruflin commented Jan 10, 2024

cmacknz commented Jan 10, 2024

cachedout commented Jan 11, 2024

rdner commented Jan 12, 2024

ruflin commented Jan 15, 2024

amitkanfer commented Jan 16, 2024

andresrc commented Jan 17, 2024

[Proposal] Stop shipping log.offset by default #3934

[Proposal] Stop shipping log.offset by default #3934

Comments

ruflin commented Dec 20, 2023

Related issues

elasticmachine commented Dec 20, 2023

cmacknz commented Dec 20, 2023 • edited Loading

amitkanfer commented Dec 21, 2023

ruflin commented Dec 21, 2023

andrewkroh commented Jan 4, 2024

ruflin commented Jan 10, 2024

cmacknz commented Jan 10, 2024

cachedout commented Jan 11, 2024

rdner commented Jan 12, 2024

ruflin commented Jan 15, 2024

amitkanfer commented Jan 16, 2024

andresrc commented Jan 17, 2024

cmacknz commented Dec 20, 2023 •

edited

Loading