-
Notifications
You must be signed in to change notification settings - Fork 290
Home
Spiro Michaylov edited this page Aug 7, 2016
·
10 revisions
Whenever these examples are updated for a new Spark version, changes tend to be needed, and some are interesting and important. Starting with Spark 1.5.0, here are the details.
- dataframe/UDT.scala
- graphx/SecondDegreeNeighbors.scala
- hiveql/LateralViewExplode.scala
- hiveql/SimpleUDAF.scala
- hiveql/SimpleUDF.scala
- sql/CustomRelationProvider.scala
- sql/ExternalNonRectangular.scala
- sql/JSON.scala
- sql/JSONScehmaInference.scala
- sql/JSONTypes.scala
- sql/MixedJSONQuery.scala
- sql/OutputJSON.scala
- sql/RelationProviderFilterPushdown.scala
- sql/SchemaConversion.scala
- sql/UDF.scala
- sql/UDT.scala
- streaming/CustomReceiver.scala
- Dataframe.foreach needs more careful passing of println because of overload
- User defined types (UDTs) have been removed
- DataSet.partitionBy not supported -- seems to prefer Repartition
- graphx.mapReduceTriplets was deprecated in 1.2 and now gone -- replaced by aggregateMessages
- Missing org.apache.spark.Logging but wasn't using it anyway
- A dataset is not an RDD
- dataframe.registerTempTable
- SQLContext
- If you are running the examples through
sbt
, you will now get a non-default JVMMaxPermSize
setting so that the hiveql examples get enough memory to run. - The two versions of
UDT.scala
(one indataframe
and one insql
) used to depend on the fact thatArrayData
andGenericArrayData
were public inorg.apache.spark.sql.types
. This is no longer true in Spark 1.6.0, due to SPARK-11273. More recently, as a result of SPARK-11780, deprecated type aliases have been added back, and this change is slated for Spark 1.6.1. Frankly, I find this change a bit disturbing, since every attept at defining a user defined type anywhere in the Spark source tree requires these two types for serialization and deserialization -- see, for example, ExamplePointUDT.scala and UserDefinedTypeSuite.scala. I can understand that perhaps functionality needed for UDTs shouldn't pollute the general purpose public APIs, but then I would argue that support for them needs a package of its own.
- The Hive examples (
hive.*
) were failing with memory problems. To execute them, one has to supply the flag -XX:MaxPermSize=128M to the JVM somehow. THis settign works for these examples, but whether it is the "right" setting in practice depends on ytour application. -
sql.OutputJSON
needed to be extended because it seems that JSON integers that were being interpreted asint
s are now interpeted aslong
s. -
sql.Types
had to change quite a lot because the type conversions seem to have become a lot more stringent. - In dealing with a deprecation in
sql.JSONTypes
, I can't find a supported way to provide a schema when reading a JSON file. This is not strictly speaking a 1.5.0 problem. I'll keep looking for a solution. - The last example (passing an array to a UDF) in
dataframe.UDF
needed to be changed because passing an array now results in the UDF receiving aWrappedArray
rather than an ArrayBuffer. - With the introduction of more systematic reading and writing for dataframes, I took this opportunity to replace all uses of the older, deprecated techniques.
- Again not really a 1.5.0 problem, but there are two more deprecations I couldn't find a good way to deal with:
- In
hiveql.UDAF
the deprecated approach is the only one I've been able to figure out so far. - In queue based streaming,
scala.collection.mutable.SynchronizedQueue
has been deprecated for some time, but there doesn't seem to be a non-deprecated replacement thatStreamignContext,queueStream()
will accept as an input.
- In