Implement a pipeline to fetch top 1k Javascript repositories ranked by stars from github APIv3. Then calculate the health score of each repository with the pre-defined formula:
health_score = (num_stars/max_num_stars) * (num_folks/max_num_folks) *
(commits_per_day/max_commits_per_day) * (num_opened_issues/max_num_opened_issues)
Then save the repositories ranked by healthiest to lease health in SCV format as bellow:
repo_id,repo_name,health_score,num_stars,num_folks,created_at,avg_commits_per_day,avg_time_first_response_to_issues,avg_time_opened_issues,num_maintainers,avg_time_merged_pull_request,ratio_closed_open_issues,num_people_open_issues,ratio_commit_per_devs
2126244,bootstrap,0.7,134041,65691,2011-07-29T21:19:00Z,37.84251,72.8793,93.42848,89,73.0398,0.9817419,66,53
10270250,react,0.3,131107,24150,2013-05-24T16:15:54Z,56.24272,24.065191,13.832375,28,16.782827,0.9284099,49,70
...
beam-processing
|__ logs
|__ output
| |__ data-*.csv ---> Default output data dir
| |__ ...
|__ src
| |__ ...
|__ target
| |__ ...
| |__ beam-processing-<version>.jar ---> Executable jar file
|__ pom.xml
|__ README.md
- beam-sdks-java-core 2.13.0
- beam-runners-direct-java 2.13.0
- google-http-client 1.30.1
- gson 2.8.4
- log4j 1.7.25
$ mvn clean package
Due to the rate-limit of Github API v3, we need Personal access tokens
from Github to bypass this:
- Generate a new token here.
- Set the newly generated token in
config.properties
file under sectiongithub.token
.
$ mvn test
For testing and development purposes, It can run on local direct runner with the following command:
$ mvn clean package exec:java -Dexec.mainClass=com.nvbac.beam.Application -Pdirect-runner -Dexec.args="--runner=DirectRunner" -DskipTests
- Apache Spark:
$ mvn clean package -Pspark-runner
$ spark-submit --class com.nvbac.beam.Application --master spark://HOST:PORT target/beam-processing-0.0.1-shaded.jar --runner=SparkRunner
- Apache Flink, Samza ...
Not tested yet
.
- Apache Beam:
- Provides an advanced unified programming model.
- The pipelines can execute on multiple execution engines such as: Apex, Flink, Spark, Samza...
- Support both batch and streaming processing.
- Provides a rich APIs and Interfaces for pipeline implementations.
- Google HTTP Client Library For Java:
- Flexible, efficient and powerful.
- Pluggable HTTP transport abstraction that allows you to use any low-level library such as java.net.HttpURLConnection, Apache HTTP Client, or URL Fetch on Google App Engine.
- Efficient JSON and XML data models for parsing and serialization of HTTP response and request content. The JSON and XML libraries are also fully pluggable, and they include support for Jackson and Android's GSON libraries for JSON.
- Gson:
- Provide simple
toJson()
andfromJson()
methods to convert Java objects to JSON and vice-versa. - Allow custom representations for objects.
- Provide simple
- Log4j:
- High performance especially in multi-threaded application.
- Easy to configure.
- Should implement as a general library by exposing interfaces:
ISink
,ITransform
,ISource
where devs only need to implement their own sinks, sources, transforms. And when starting a pipeline the Java Runtime will pick the sources, sinks, transforms available in the classpath. - Handle exceptions more gracefully.