Skip to content

nhoclove/beam-processing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

beam-processing

Description

Implement a pipeline to fetch top 1k Javascript repositories ranked by stars from github APIv3. Then calculate the health score of each repository with the pre-defined formula:

    health_score = (num_stars/max_num_stars) * (num_folks/max_num_folks) *
                   (commits_per_day/max_commits_per_day) * (num_opened_issues/max_num_opened_issues)

Then save the repositories ranked by healthiest to lease health in SCV format as bellow:

repo_id,repo_name,health_score,num_stars,num_folks,created_at,avg_commits_per_day,avg_time_first_response_to_issues,avg_time_opened_issues,num_maintainers,avg_time_merged_pull_request,ratio_closed_open_issues,num_people_open_issues,ratio_commit_per_devs
2126244,bootstrap,0.7,134041,65691,2011-07-29T21:19:00Z,37.84251,72.8793,93.42848,89,73.0398,0.9817419,66,53
10270250,react,0.3,131107,24150,2013-05-24T16:15:54Z,56.24272,24.065191,13.832375,28,16.782827,0.9284099,49,70
...

Structure

    beam-processing
                |__ logs
                |__ output
                |     |__ data-*.csv   ---> Default output data dir
                |     |__ ...
                |__ src
                |     |__ ...
                |__ target
                |     |__ ...
                |     |__ beam-processing-<version>.jar  ---> Executable jar file
                |__ pom.xml
                |__ README.md

Dependencies

  1. beam-sdks-java-core 2.13.0
  2. beam-runners-direct-java 2.13.0
  3. google-http-client 1.30.1
  4. gson 2.8.4
  5. log4j 1.7.25

Build

$ mvn clean package

Configuration

Due to the rate-limit of Github API v3, we need Personal access tokens from Github to bypass this:

  1. Generate a new token here.
  2. Set the newly generated token in config.properties file under section github.token.

Run

1. Run unit test

$ mvn test

2. Run locally

For testing and development purposes, It can run on local direct runner with the following command:

$ mvn clean package exec:java -Dexec.mainClass=com.nvbac.beam.Application -Pdirect-runner -Dexec.args="--runner=DirectRunner" -DskipTests

3. Run on other runners

  • Apache Spark:
$ mvn clean package -Pspark-runner
$ spark-submit --class com.nvbac.beam.Application --master spark://HOST:PORT target/beam-processing-0.0.1-shaded.jar --runner=SparkRunner
  • Apache Flink, Samza ... Not tested yet.

Technical decisions

  1. Apache Beam:
    • Provides an advanced unified programming model.
    • The pipelines can execute on multiple execution engines such as: Apex, Flink, Spark, Samza...
    • Support both batch and streaming processing.
    • Provides a rich APIs and Interfaces for pipeline implementations.
  2. Google HTTP Client Library For Java:
    • Flexible, efficient and powerful.
    • Pluggable HTTP transport abstraction that allows you to use any low-level library such as java.net.HttpURLConnection, Apache HTTP Client, or URL Fetch on Google App Engine.
    • Efficient JSON and XML data models for parsing and serialization of HTTP response and request content. The JSON and XML libraries are also fully pluggable, and they include support for Jackson and Android's GSON libraries for JSON.
  3. Gson:
    • Provide simple toJson() and fromJson() methods to convert Java objects to JSON and vice-versa.
    • Allow custom representations for objects.
  4. Log4j:
    • High performance especially in multi-threaded application.
    • Easy to configure.

Future Improvements

  1. Should implement as a general library by exposing interfaces: ISink, ITransform, ISource where devs only need to implement their own sinks, sources, transforms. And when starting a pipeline the Java Runtime will pick the sources, sinks, transforms available in the classpath.
  2. Handle exceptions more gracefully.

About

Construct a pipeline using Apache Beam SDK

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages