-
Notifications
You must be signed in to change notification settings - Fork 14
Home
Joe Landers edited this page Feb 1, 2016
·
6 revisions
Here's a quick outline (by joe) of the ooni-pipeline process at the moment.
- ooni-backend machines upload "raw reports" (yaml files) to the ooni-incoming S3 bucket. The following is run daily by cron:
find /data/bouncer/archive -type f -print0 \
| xargs -0 -I FILE \
aws s3 mv FILE s3://ooni-incoming/yaml/
- those reports get copied to the ooni-private S3 bucket (this 2nd step is for permissions trickery). Another daily cron job:
date_bin=$(date -I)
if [ -z "$date_bin" ]; then exit -1; fi
# should be able to do this filtering with the aws --exclude and --include,
# but I can't get that to work.
aws s3 ls s3://ooni-incoming/yaml/ \
| awk '{print $4}' \
| grep '.yamloo$' \
| xargs -I FILE \
aws s3 mv s3://ooni-incoming/yaml/FILE \
s3://ooni-private/reports-raw/yaml/$date_bin/
- the
invoke bins_to_sanitised_streams --unsanitised-dir "s3n://ooni-private/reports-raw" --sanitised-dir "s3n://ooni-public" --date-interval 2012-12-01-2016-01-01 --workers 32
command from this repo does some sanitisation and aggregates the reports by date (the folders ("bins") correspond to a pipeline date, not the report measurement date) into json streams in the ooni-public bucket. - the
invoke streams_to_db --streams-dir "s3n://ooni-public/json"
command reads the json streams and puts each report entry (there are many entries in a report) as a row into the postgres db.
We currently run the bins->streams on a c3.8xlarge (32 core) with 32 processes and 80GB EBS. (the S3 files get cached on-disk on their way in and out, so this can eat a lot of space). It takes about a day to run through the whole dataset.
The streams->db step, we run on a m4.xlarge with 1 process, and it also takes about a day to run. I haven't looked into what the speed bottleneck here is.