-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spark DataFrame insert into Exasol Table #37
Spark DataFrame insert into Exasol Table #37
Conversation
Add more comments, refactor tests.
Moreover, bump some dependency and plugin versions.
They will be useful when doing Exasol insertion, so that we can check if table already available or truncate it if needed.
This commit adds save modes: overwrite and append.
The problem: If we have more Exasol data nodes than the number of Dataframe partitions, we will open more sub-connections and some of them will not be used at all. This currently is a problem, since all opened sub-connections should be connected and closed. The solution: It is easy to perform repartition on dataframe and increase the number of partitions. This is what current commit does. However, `repartition` is expensive operation that might involve shuffle. Therefore, we need to check if we can use fewer (as number of partitions) when initiating sub-connections.
With some of save modes (for instance, Overwrite or Append), if table does not exist in Exasol, then we first create it using dataframe schema as table schema. This will allow saving dataframe even though the table was not available in Exasol. However, maybe creating table action should be permitted via user provided parameter.
`create_table`: This is used to create a table, when saving a dataframe, if it does not exist already in Exasol. If it is not set to "true" and table does not exists the connector throws an exception. By default it is "false". `batch_size`: This parameter is used to configure batch size when writing rows into Exasol jdbc statement. The default value is `1000`.
@morazow: sure, I will look into it tonight. |
@morazow Yes , saving a dataframe into Exasol seems like a great idea. |
Hey @3cham, Thanks a lot for looking so far! Please let me know if you have any feedback. |
Hi @morazow, so far the integration tests worked, also with our added test. However, when we use our cluster for testing it, we have the following exception. We couldn't understand why it occurs
|
Hey @3cham, Seems like that error is thrown when the datasource does not provide providingClass.newInstance() match {
case dataSource: CreatableRelationProvider =>
dataSource.createRelation(
sparkSession.sqlContext, mode, caseInsensitiveOptions, Dataset.ofRows(sparkSession, data))
case format: FileFormat =>
// ...
case _ =>
sys.error(s"${providingClass.getCanonicalName} does not allow create table as select.")
} Could you please also provide the Spark version and the query / insert syntax? Maybe I can try to reproduce the error here from my side also. |
Hi @morazow: yeah, the jar is built from your branch. I used Spark 2.2.0 for testing it.
Could it be Spark Version causing the problem? I will test with newer Spark Version. |
Hi @morazow, there was an older jar that built into my spark distribution. After I remove the old jar another Exception occurs:
|
Hey @3cham, No I do not think Spark version should be a problem. It should be compatible with all version above Spark 2.1+. However, the second error is happens to be on Could you please check if the 'TYPE' column is available? Similarly, look for another log line starting with |
Hi @morazow, I should elaborate more in the last comment. However, good news is the data could be written back into Exasol. Great job! The cause for the first problem is that there was an older artifact of spark-exasol-connector in the distribution that I used. Spark somehow ignored the new one :) For the second problem. The table that I used has a column defined in lowercase. This is possible in Exasol if you wrap the column name inside "". However in the enrichQuery method, we don't have the quotes: https://github.com/exasol/spark-exasol-connector/blob/master/src/main/scala/com/exasol/spark/ExasolRelation.scala#L105. I think
That why the column |
Hey @3cham, Ahh okay. But it is good news that you could save the dataframe! Then if there are no more reviews, I am going to merge this and do a release beginning of the next week. However, lets address the column quoted issue in separate ticket. I think it is good next task to consider. Similarly, we have now for a while related reserved keywords (#14) issue open. Thanks a lot for doing the review! |
@morazow: yes, LGTM. thank you for this PR. |
What changes are proposed in this PR?
Add a new feature that enable connector users to save the Spark DataFrame into Exasol table.
In order to save a dataframe, the option
table
should be provided that identify Exasol table. Please note that it should also include the Exasol schema in the table name, e.g.my_schema.my_table
.For example:
It is important to know what each Spark SaveMode means here. The Spark write comes with several
SaveMode
-s, please check their descriptions:So if the mode is
overwrite
then the table will be truncated and then dataframe will be saved into that table.Additionally, I introduced a new user provided parameter called
create_table
that is by default set to false. This enables connector to create a table if it does not exists.For instance:
Addresses #32.