Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide **IMPORT DATA** statement #476

Closed
sherman-the-tank opened this issue May 30, 2019 · 17 comments
Closed

Provide **IMPORT DATA** statement #476

sherman-the-tank opened this issue May 30, 2019 · 17 comments
Assignees
Milestone

Comments

@sherman-the-tank
Copy link
Member

sherman-the-tank commented May 30, 2019

We want to support bulk load from the console. The statement could look like this

IMPORT DATA FROM // Will be executed in console

IMPORT DATA FROM SERVER // Will be executed on the query engine

@sherman-the-tank sherman-the-tank added this to the v1_beta_release milestone May 30, 2019
@spacewalkman
Copy link
Contributor

spacewalkman commented Jun 21, 2019

import concept
If this one is not taken, I want to give it a shot, The above pic is what's in my mind.

@spacewalkman
Copy link
Contributor

IMPORT DATA FROM // Will be executed in console
IMPORT DATA FROM SERVER // Will be executed on the query engine

Some clarification needed here @sherman-the-tank :

  1. We are supposed to import file in CSV format only when running in console, right?
  2. When running in SERVER mode, Are we going to support BOTH csv & sst files format?

@dangleptr
Copy link
Contributor

dangleptr commented Jun 21, 2019

@darionyaphet has give a patch about download procedure.
There are some different points with your thoughts.

  1. Meta server will control the whole process instead of GraphServer
  2. All communication is through http.

@dangleptr
Copy link
Contributor

You could go on the ingest procedure @spacewalkman

@dangleptr
Copy link
Contributor

Because the patch has not been merged in, we could discuss which communication way is better, rpc or http?

@spacewalkman
Copy link
Contributor

spacewalkman commented Jun 21, 2019

  1. Meta server will control the whole process instead of GraphServer

Why MetaServer in charge? IMO, MetaServer is just a meta provider, should not involve in something like data manipulation procedure. It's QueryServer who receive the IMPORT request in the first place. Let GraphServer do it will keep MetaServer simple and tidy.

  1. All communication is through http.

If it's all about communication, Why not let Thrift do it? Introducing an extra HTTP layer would suffer from security vulnerability and communication inefficiency.

WDYT? @dangleptr @darionyaphet

@dangleptr
Copy link
Contributor

dangleptr commented Jun 21, 2019

Why MetaServer in charge?

That's a good question. Not only download/ingest, some features in coming we could take into account too. For example, compaction, balance, snapshot etc.
IMO, all admin operations about storage servers should be in charged by Meta server. Because Meta server knows which storage server is still alive, and it has all information about each storage server.

Why not let Thrift do it?

For http, the only advantage is it could be accessed by different terminals, for example, web console.

@darionyaphet
Copy link
Contributor

MetaServer hold the whole cluster view.

We will make sure the file number is same with nebula‘s partition and how to assign the SST Files for ingest.

@spacewalkman
Copy link
Contributor

spacewalkman commented Jun 21, 2019

@darionyaphet @dangleptr
You both point out that MetaServer hold the information that need to do IMPORT, but that's doesn't prohibit it to tell someone else (like QueryEngine )that information to let them do the actual IMPORTing job.

@dangleptr
Copy link
Contributor

dangleptr commented Jun 24, 2019

@darionyaphet @dangleptr
You both point out that MetaServer hold the information that need to do IMPORT, but that's doesn't prohibit it to tell someone else (like QueryEngine )that information to let them do the actual IMPORTing job.

Not only the information, think about that some admin operations need a long procedure, maybe we want to record the state step by step, and do failover. For queryEngine, it has no states, no leader, if it crashed, we can not do failover with it.

@dangleptr dangleptr reopened this Jun 24, 2019
@spacewalkman
Copy link
Contributor

@darionyaphet @dangleptr
You both point out that MetaServer hold the information that need to do IMPORT, but that's doesn't prohibit it to tell someone else (like QueryEngine )that information to let them do the actual IMPORTing job.

Not only the information, think about that some admin operations need a long procedure, maybe we want to record the state step by step, and do failover. For queryEngine, it has no states, no leader, if it crashed, we can not do failover with it.

Fair enough

@spacewalkman
Copy link
Contributor

image

Update concept pic to reflect the idea that MetaServer is in charge.

@dangleptr
Copy link
Contributor

dangleptr commented Jun 24, 2019

I have two question:

  1. After typing the command "Import DATA xxx " in console, doest the console blocked?
  2. Download And Ingest are two commands or one command for users?

@spacewalkman
Copy link
Contributor

spacewalkman commented Jun 24, 2019

1.After typing the command "Import DATA xxx " in console, doest the console blocked?

IMO, It's a long running task, we should not block the console, but return a handle to periodically polling the task status. BUT it has following cons:

  1. Normally, we like to notify user the progress(percentage). But that progress may be interleave with user's other conversation. Such as: use may issue use anothergrapsace and do some other query.
  2. We need a way to abort the whole procedure, even if after use close the original console which has issue the IMPORT command(In blocking mode, we just ctrl-c to abort)

2.Download And Ingest are two commands or one command for users?

Download & Ingest are just 2 conceptual PHASE of the single IMPORT command. But make it two does no harm? WDYT?

@dangleptr
Copy link
Contributor

Download & Ingest are just 2 conceptual PHASE of the single IMPORT command. But make it two does no harm? WDYT?

Currently, we'd better use two command to control the whole procedure.

We need a way to abort the whole procedure, even if after use close the original console which has issue the IMPORT command(In blocking mode, we just ctrl-c to abort)

Yes, we need this feature.

@sherman-the-tank
Copy link
Member Author

sherman-the-tank commented Jun 30, 2019

Awesome discussion thread 👍 Way to go, guys!!

Here are some of my thoughts

  1. IMPORT DATA ... statement has two modes:
    • In LOCAL mode, the statement specifies a local CSV file path, the console will read the file and general bulk INSERT statements which will be sent to the Graph Engine to execute
    • In SERVER mode, the execution process is very like @spacewalkman 's picture above. The IMPORT DATA statement will be sent to the Graph Engine, and the Graph Engine will contact Meta Service to kick off an asynchronous task (the task will orchestra the SST file download process. Other possible tasks include index repair and so on). The task ID will be returned to the console. The statement is non-blocking (We should never block in the distributed environment)
  2. Users should be able to query the task list from the Meta Service using statement SHOW TASKS...
  3. Users should be able to check the status for a specific task using statement SHOW TASK STATUS <taskid>

The last two points also apply to the index repair and other tasks

@sherman-the-tank
Copy link
Member Author

Regard tasks, as soon as a task is created on the Meta Service, it is global, not associated with any space

yixinglu pushed a commit to yixinglu/nebula that referenced this issue Mar 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants