Skip to content

Getting Started

shiyuhang0 edited this page May 11, 2022 · 16 revisions

What is TiSpark

TiSpark is a third-party jar package for Spark that provides the ability to read/write TiKV

Getting TiSpark

First of all, choose the TiSpark version:

TiSpark version TiDB、TiKV、PD version Spark version Scala version
2.4.x-scala_2.11 5.x,4.x 2.3.x,2.4.x 2.11
2.4.x-scala_2.12 5.x,4.x 2.4.x 2.12
2.5.x 5.x,4.x 3.0.x,3.1.x 2.12
  • TiSpark 2.4.3, 2.5.0 is the latest stable version, which is highly recommended.
  • Spark < 2.3 is not supported anymore. if your spark version is below 2.3, follow this document to use another version of TiSpark

Then get the TiSpark jar from

Currently, java8 is the only choice to build TiSpark, run mvn -version to check.

git clone https://github.com/pingcap/tispark.git

Run the following command under the TiSpark root directory:

// add -Dmaven.test.skip=true to skip the tests
mvn clean install -Dmaven.test.skip=true

Use TiSpark >= 2.5

Take the use of spark-shell for example, make sure you have deployed Spark.

Start spark-shell

TO use Tispark in Spark shell

  1. Add the following configuration in spark-defaults.conf
spark.sql.extensions  org.apache.spark.sql.TiExtensions
spark.tispark.pd.addresses  ${your_pd_adress}
spark.sql.catalog.tidb_catalog  org.apache.spark.sql.catalyst.catalog.TiCatalog
spark.sql.catalog.tidb_catalog.pd.addresses  ${your_pd_adress}
  1. Start spark-shell with the --jars option
spark-shell --jars tispark-assembly-{version}.jar

Read with TiSpark

You can use Spark SQL to read from TiKV

spark.sql("use tidb_catalog")
spark.sql("select count(*) from ${database}.${table}").show

Write with TiSpark

You can use Spark DataSource API to write to TiKV and guarantees ACID(INSERT statement is not supported yet)

val tidbOptions: Map[String, String] = Map(
  "tidb.addr" -> "127.0.0.1",
  "tidb.password" -> "",
  "tidb.port" -> "4000",
  "tidb.user" -> "root",
  "spark.tispark.pd.addresses" -> "127.0.0.1:2379"
)

val customerDF = spark.sql("select * from customer limit 100000")

customerDF.write
.format("tidb")
.option("database", "tpch_test")
.option("table", "cust_test_select")
.options(tidbOptions)
.mode("append")
.save()

See here for more details.

Delete with TiSpark

You can use Spark SQL to delete from TiKV (Tispark master support)

spark.sql("use tidb_catalog")
spark.sql("delete from ${database}.${table} where xxx").show

See here for more details.

Use TiSpark 2.4.x

Take the use of spark-shell for example

Start spark-shell

TO use Tispark in Spark shell

  1. Add the following configuration in spark-defaults.conf
spark.sql.extensions  org.apache.spark.sql.TiExtensions
spark.tispark.pd.addresses  ${your_pd_adress}
  1. Start spark-shell with the --jars option
spark-shell --jars tispark-assembly-{version}.jar

Read with TiSpark

You can use Spark SQL to read from TiKV

spark.sql("select count(*) from ${database}.${table}").show

Write with TiSpark

You can use Spark DataSource API to write to TiKV and guarantees ACID(INSERT statement is not supported yet)

val tidbOptions: Map[String, String] = Map(
  "tidb.addr" -> "127.0.0.1",
  "tidb.password" -> "",
  "tidb.port" -> "4000",
  "tidb.user" -> "root",
  "spark.tispark.pd.addresses" -> "127.0.0.1:2379"
)

val customerDF = spark.sql("select * from customer limit 100000")

customerDF.write
.format("tidb")
.option("database", "tpch_test")
.option("table", "cust_test_select")
.options(tidbOptions)
.mode("append")
.save()

See here for more details.

Clone this wiki locally