S3 input and large buckets #14

talevy · 2015-02-07T00:14:49Z

migrated from: https://logstash.jira.com/browse/LOGSTASH-2125

S3 input is takeing a long time until the first logfile is processed:

input {
    s3 {
        credentials => ["XXXX","XXXX"]
        bucket => "my-production-bucket"
        interval => 300
   }
}
output {
   stdout {}
}

Running it with

sudo ./logstash agent -f /etc/logstash/conf.d/central.conf  --debug

shows me that the bucket is used. As soon as I start logstash I see via tcpdump that there is a lot of traffic between the host and s3 going on.
Now that bucket has right now 4451 .gz files just in the root folder. Subfolders have even more files.
If I create now another bucket and put only one of the log files in it I can see that this logfile is more or less immediately downloaded and processed.

The text was updated successfully, but these errors were encountered:

ururk · 2015-02-26T15:50:21Z

It looks like the plugin loops through every object in the bucket before processing them. As you add objects, this list grows and it takes longer to loop through. I need to do a bit more testing on this theory - but in processing a ton of files (> 80K) that's what I seemed to see. I can't quite tell if it queues them all up, or if it is processing them while looping through them. Amazon's API calls limit you to 1K objects at a time, but some of the libraries abstract this and add paging, such that a loop will go through everything.

It would be nice to have the option of limiting how many objects it processes at a time.

DanielRedOak · 2015-03-05T20:46:32Z

https://github.com/logstash-plugins/logstash-input-s3/blob/master/lib/logstash/inputs/s3.rb#L104

list_new_files runs through and looks for keys that match the prefix and dont match the excludes, storing things in the sincedb. If you move objects to another bucket or prefix after they have been processed, this should speed up run times as the list to run through and check would be much smaller. Not a solution, but a workaround at least

lexelby · 2015-03-06T22:44:10Z

The root problem is that in ruby-aws-sdk, if you iterate through a bucket, checking thing.last_modified does a round-trip to the aws api. The ListObjects aws api call does return the last modified date for every object, but apparently the aws sdk forgets this information and rerequests it every time. This means the s3 input is making an api call for every object in the bucket (matching the prefix), which is insanely slow.

ph · 2015-03-09T15:11:18Z

@lexelby I didn't know aws-sdk was doing a round trip when requesting the last_modified information. I'll check how I can improve that part and also boost the performance of this method.

Concerning adding the proxy support, this is an easy fix to add the option on our base aws mixin https://github.com/logstash-plugins/logstash-mixin-aws .

I've looked rapidly at the aws-sdk and how they use the net/http class, if we don't specify the proxy as an option they will create an net/http object with http_proxy set to nil, I believe this will make the library skip the http_proxy environment variable.

lexelby · 2015-03-10T15:39:32Z

Oh, that mixin looks perfect. I see that the SQS input uses it, for example.

Here's the upstream bug, which they claim is fixed in a more recent version than logstash ships with: aws/aws-sdk-ruby#734

DanielRedOak · 2015-03-10T17:36:41Z

This is related I believe: aws/aws-sdk-ruby#588

So since this uses aws-sdk < 2 I think we're SOL until its upgraded. I'll see about it if I have some time here, but there is a ticket open to get the mixin updated too

lexelby · 2015-03-10T17:45:39Z

I found a workaround, for my use-case at least. A couple, really. First, there's a pull request around somewhere for a fog-based s3 input, called s3fog. Like its author, I wanted to use the s3 input to pull cloudtrail logs into my ELK stack. I ended up using this: https://bitbucket.org/atlassianlabs/cloudtrailimporter. It's designed to skip logstash, which I think is kind of limiting, so I hacked on it: http://github.com/lexelby/cloudtrail-logstash/. Works quite nicely. Set up the SNS/SQS stuff as per this: https://github.com/AppliedTrust/traildash. I dumped traildash because I couldn't figure out how to build the darned thing.

DanielRedOak · 2015-03-10T21:42:06Z

Well the switchover to v2 of the sdk was quick but I can't seem to install the updated plugin locally for testing. :( This didnt help either: elastic/logstash#2779

DanielRedOak · 2015-03-11T15:21:34Z

If anyone else wants to give testing a shot, checkout my fork over here: https://github.com/DanielRedOak/logstash-input-s3 spec tests pass but I haven't gotten to updating the integration tests.

DanielRedOak · 2015-03-11T17:50:22Z

PR submitted so this can be closed if/when merged: #25

nowshad-amin · 2016-02-25T10:12:20Z

I want to send log of ELB to s3 bucket. ELB logs of different service will be in different directory of my main log bucket. When I tried to put that in my input s3 conf, I am not getting any log

Here is my s3 input conf file:
input {
s3 {
bucket => "production-logs"
region => "us-east-1"
prefix => "elb/"
type => "elb"
sincedb_path => "log_sincedb"
}
}

But If I set a name of filepath as prefix then I can view the log in Kibana. (example: elb/production-XXXX/AWSLogs/XXXXXX/elasticloadbalancing/us-east-1/2016/02/24/). But I want to send log from all subdirectory of my bucket

neoecos · 2016-05-30T17:49:40Z

@nowshad-amin Did you find a workarround for this issue?

Using the version with patch from @DanielRedOak worked like a charm.

bgerstle · 2016-08-04T16:47:41Z

The corresponding PR for this issue was merged, and I've updated to logstash 5.0 and s3-input 3.1.1, but I'm still seeing slower than expected processing times for S3 access logs. This could perhaps be due to the fact that fully utilizing available CPU (hovering around 10-20%). Take this w/ a pinch salt, as I'm running everything on localhost as an ELK stack orchestrated with docker-compose, but I can see S3 documents coming into elasticsearch slowly but surely (by looking at a stdout output as well as refreshing a catch-all query in kibana and observing hits). In one example, docker stat shows:

lookbackelk_logstash_1        21.63%              508 MiB / 3.856 GiB     12.86%              36.14 MB / 54.56 MB   81.92 kB / 47.33 MB   65
lookbackelk_elasticsearch_1   3.91%               630.1 MiB / 3.856 GiB   15.96%              82.77 MB / 61.09 MB   954.4 kB / 230.1 MB   138
lookbackelk_kibana_1          0.60%               255.3 MiB / 3.856 GiB   6.47%               49.33 MB / 11.35 MB   1.044 MB / 0 B        10

and my Mac's CPU & network utilization are both pretty low. Any ideas?

bgerstle · 2016-08-04T19:06:43Z

I tried upping the pipeline workers & batch size, but didn't notice a huge increase in utilization. Probably just rookie mistakes combined with input size and runtime environment.

Chadwiki · 2017-06-16T18:53:16Z

+1

cdenneen · 2017-07-19T21:58:32Z

any updates to speeding this up?
I regularly have to do temporary log analysis by ingesting logs from s3 and the longer it takes to ingest the more money it costs and the angrier people get waiting for all their data to be ingested to analyze.

This was referenced Mar 9, 2015

Add FOG-based S3 input plugin elastic/logstash#786

Closed

last_modified while iterating through a bucket requires unnecessary round-trip aws/aws-sdk-ruby#734

Closed

torrancew mentioned this issue Aug 27, 2015

Unable to specify Proxy for s3 input #30

Closed

ph mentioned this issue Sep 6, 2016

using lambda events logstash-plugins/logstash-output-s3#25

Closed

roaksoax added the Team:Logstash label Feb 21, 2023

eherot mentioned this issue Nov 2, 2023

Restructure for better large bucket support (Fixes #138, #128. #100, #80, #54, and #14) #249

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

S3 input and large buckets #14

S3 input and large buckets #14

talevy commented Feb 7, 2015

ururk commented Feb 26, 2015

DanielRedOak commented Mar 5, 2015

lexelby commented Mar 6, 2015

ph commented Mar 9, 2015

lexelby commented Mar 10, 2015

DanielRedOak commented Mar 10, 2015

lexelby commented Mar 10, 2015

DanielRedOak commented Mar 10, 2015

DanielRedOak commented Mar 11, 2015

DanielRedOak commented Mar 11, 2015

nowshad-amin commented Feb 25, 2016

neoecos commented May 30, 2016 •

edited

Loading

bgerstle commented Aug 4, 2016

bgerstle commented Aug 4, 2016

Chadwiki commented Jun 16, 2017

cdenneen commented Jul 19, 2017

S3 input and large buckets #14

S3 input and large buckets #14

Comments

talevy commented Feb 7, 2015

ururk commented Feb 26, 2015

DanielRedOak commented Mar 5, 2015

lexelby commented Mar 6, 2015

ph commented Mar 9, 2015

lexelby commented Mar 10, 2015

DanielRedOak commented Mar 10, 2015

lexelby commented Mar 10, 2015

DanielRedOak commented Mar 10, 2015

DanielRedOak commented Mar 11, 2015

DanielRedOak commented Mar 11, 2015

nowshad-amin commented Feb 25, 2016

neoecos commented May 30, 2016 • edited Loading

bgerstle commented Aug 4, 2016

bgerstle commented Aug 4, 2016

Chadwiki commented Jun 16, 2017

cdenneen commented Jul 19, 2017

neoecos commented May 30, 2016 •

edited

Loading