beam-processing

Description

Implement a pipeline to fetch top 1k Javascript repositories ranked by stars from github APIv3. Then calculate the health score of each repository with the pre-defined formula:

    health_score = (num_stars/max_num_stars) * (num_folks/max_num_folks) *
                   (commits_per_day/max_commits_per_day) * (num_opened_issues/max_num_opened_issues)

Then save the repositories ranked by healthiest to lease health in SCV format as bellow:

repo_id,repo_name,health_score,num_stars,num_folks,created_at,avg_commits_per_day,avg_time_first_response_to_issues,avg_time_opened_issues,num_maintainers,avg_time_merged_pull_request,ratio_closed_open_issues,num_people_open_issues,ratio_commit_per_devs
2126244,bootstrap,0.7,134041,65691,2011-07-29T21:19:00Z,37.84251,72.8793,93.42848,89,73.0398,0.9817419,66,53
10270250,react,0.3,131107,24150,2013-05-24T16:15:54Z,56.24272,24.065191,13.832375,28,16.782827,0.9284099,49,70
...

Structure

    beam-processing
                |__ logs
                |__ output
                |     |__ data-*.csv   ---> Default output data dir
                |     |__ ...
                |__ src
                |     |__ ...
                |__ target
                |     |__ ...
                |     |__ beam-processing-<version>.jar  ---> Executable jar file
                |__ pom.xml
                |__ README.md

Dependencies

Build

$ mvn clean package

Configuration

Due to the rate-limit of Github API v3, we need Personal access tokens from Github to bypass this:

Generate a new token here.
Set the newly generated token in config.properties file under section github.token.

Run

1. Run unit test

$ mvn test

2. Run locally

For testing and development purposes, It can run on local direct runner with the following command:

$ mvn clean package exec:java -Dexec.mainClass=com.nvbac.beam.Application -Pdirect-runner -Dexec.args="--runner=DirectRunner" -DskipTests

3. Run on other runners

Apache Spark:

$ mvn clean package -Pspark-runner
$ spark-submit --class com.nvbac.beam.Application --master spark://HOST:PORT target/beam-processing-0.0.1-shaded.jar --runner=SparkRunner

Apache Flink, Samza ... Not tested yet.

Technical decisions

Apache Beam:
- Provides an advanced unified programming model.
- The pipelines can execute on multiple execution engines such as: Apex, Flink, Spark, Samza...
- Support both batch and streaming processing.
- Provides a rich APIs and Interfaces for pipeline implementations.
Google HTTP Client Library For Java:
- Flexible, efficient and powerful.
- Pluggable HTTP transport abstraction that allows you to use any low-level library such as java.net.HttpURLConnection, Apache HTTP Client, or URL Fetch on Google App Engine.
- Efficient JSON and XML data models for parsing and serialization of HTTP response and request content. The JSON and XML libraries are also fully pluggable, and they include support for Jackson and Android's GSON libraries for JSON.
Gson:
- Provide simple toJson() and fromJson() methods to convert Java objects to JSON and vice-versa.
- Allow custom representations for objects.
Log4j:
- High performance especially in multi-threaded application.
- Easy to configure.

Future Improvements

Should implement as a general library by exposing interfaces: ISink, ITransform, ISource where devs only need to implement their own sinks, sources, transforms. And when starting a pipeline the Java Runtime will pick the sources, sinks, transforms available in the classpath.
Handle exceptions more gracefully.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
src		src
.gitignore		.gitignore
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

beam-processing

Description

Structure

Dependencies

Build

Configuration

Run

1. Run unit test

2. Run locally

3. Run on other runners

Technical decisions

Future Improvements

About

Releases

Packages

Languages

nhoclove/beam-processing

Folders and files

Latest commit

History

Repository files navigation

beam-processing

Description

Structure

Dependencies

Build

Configuration

Run

1. Run unit test

2. Run locally

3. Run on other runners

Technical decisions

Future Improvements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages