Skip to content

Using JRecord to build a mapred and mapreduce inputformat for HDFS, MAPREDUCE, PIG, HIVE, Spark, ...

License

Notifications You must be signed in to change notification settings

tmalaska/CopybookInputFormat

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

#CopybookInputFormat

##Overview This project has a collections of tools to allow you to read directly from copybook data files in HDFS, using Map/Reduce, Hive, or Spark

Here is what is in this project:

  • BasicCopybookConvert: Example of how to read a copybook data file with the copybook schema with JRecord. This is single threaded.
  • PrepCopybook: This tool with clean up a copybook file so it will work with Hive and JRecord.
  • GenTestData: This will take a given cpl file and create sample rows for testing
  • GenHiveCreateTable: This will read the copybook schema and generate a Hive table definition.
  • mapred.InputFormat & RecordReader: This is an mapped implementation of FileInputFormat and RecordReader. It also works with Hive.
  • mapreduce.InputFormat & RecordReader: This is an mapped implementation of FileInputFormat and RecordReader.
  • Spark Exampl: An example of how to read a cpl data from with Spark.

##Build JRecord is not on a public maven repo so I have included the JRecord jars. To build you have to put these jars in your local repo under the following folders

~/.m2/repository/net/sf/JRecord/JRecord/0.80/JRecord-0.80.jar

~/.m2/repository/net/sf/cb2xml/cb2xml/1.0/cb2xml-1.0.jar

After you do that just do maven package and use target/copybookInputFormat.jar

##Credits Sekou Mckissick, Susan Greslik, Gwen Shapira, Jeremy Beard, and Ted Malaska

##Internal Notes java -jar copybookInputFormat.jar GenHiveCreateTable example.cbl createTable.hql exampleTable /user/root/exampleTable /tmp/example.cbl

hive -f createTable.hql

java -jar copybookInputFormat.jar GenTestData example.cbl copyGen/example.dat 100 10

hadoop fs -put copyGen/example.dat /user/root/exampleTable/example.dat

hadoop fs -put example.cbl /tmp/example.cbl

hive

add jar copybookInputFormat.jar;

set copybook.inputformat.cbl.hdfs.path=/tmp/example.cbl;

desc exampleTable;

select * from exampleTable;

select * from exampleTable where user_id > '570'

hadoop jar SparkCopybookExample.jar com.cloudera.sa.copybook.spark.CopybookSparkExample spark://{host}:7077 hdfs://{host}:8020/tmp/example.cbl hdfs://{host}:8020/user/root/exampleTable hdfs://{host}:8020/user/root/op2

or

java -cp SparkCopybookExample.jar com.cloudera.sa.copybook.spark.CopybookSparkExample spark://{host}:7077 hdfs://{host}:8020/tmp/example.cbl hdfs://{host}:8020/user/root/exampleTable hdfs://{host}:8020/user/root/op3

##Extra Notes hive.aux.jars.path hdfs:///user/root/copybook-0.0.1-SNAPSHOT.jar

About

Using JRecord to build a mapred and mapreduce inputformat for HDFS, MAPREDUCE, PIG, HIVE, Spark, ...

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published