Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a python script as datasource #2790

Closed
thoth291 opened this issue May 21, 2017 · 7 comments
Closed

Add a python script as datasource #2790

thoth291 opened this issue May 21, 2017 · 7 comments

Comments

@thoth291
Copy link

Is there any way to write and publish custom python scripts so that they would behave as any other (real time) data source.
For example assume that you have some trading data:

time value1 value2 value3
12:05 25 May 2001 13.5 14.5 15.5
... ... ... ...
18:13 12 Jan 2017 15.5 13.5 10.5

And my python script contains a class which reads this data from the input (given) table and outputs another table like this:

time inc_probability1 inc_probability2 inc_probability3
12:05 25 May 2017 0.9 0.8 0.7
... ... ... ...
18:13 12 Dec 2017 0.4 0.7 0.95

Now assume that my custom script runs on table T1 and produces table P1.
T1 by itself may be also an output of another python script or real(-time) database data.
P1 should be a normal citizen in the world of tables and being able to reply on SQL queries and etc.

This functionality will allow to build a robust R&D framework to not only explore the data but also build a pipeline of workflows in more visual way and present and share them with your team.

For example User1 would create a custom script CS1 which takes as input T1 and producing output P1.
User2 don't know if P1 is real table or produced (on the fly) by the CS1. User2 just puts it into it's dashboard and after playing with it enough realizes that in order to make some use of P1 he/she need to create a custom script CS2 which takes P1 and T2 as input and outputs P2 (which can be immediately visualized in the dashboard).

Here is a diagram:
screen shot 2017-05-21 at 1 25 01 am

Solution like this might also solve the issues with unsupported datasources for good as it will let users to write their own Table Generators and let them use those tables as inputs.

I can put more concrete examples if these ones were not clear enough.

Thank you very much for such a great software - but this missing feature is real showstopper for me and my team for now.

All the Best Wishes!

@mistercrunch
Copy link
Member

Well so the "connectors" interface is extendable, but it assumes that the source can filter, aggregate and can expose the tabular structures it makes available (tables/views). Superset asks the database through the query interface, and it's assumed that the backend aggregates and filter the specific "cut" of data.

Would your use-cases be geared towards altering atomic data in Python, or preaggregated data? I'd assume the former, which means you'd need to perform aggregations yourself and taking over a lot of the database's functions, resulting in much more workload on the web server that would have see all atomic data of the table on each query.

If it's the second use case, you could hook it in easily on top of existing datasource as some sort of hook for a dataframe mutator function that would receive the dataframe and return another. Though I'm unclear on what this would be used for.

@thoth291
Copy link
Author

Thank you for your answer, @mistercrunch .
As far as I understand the only features which I should cover in my custom python data source is:

  • Input data Indices (time in this case)
  • Output data Indices (time in this example - but can be anything else)
  • Return list of output columns
  • Generation of one particular column for a given indices
  • Aggregate can be implemented outside of this script easily but it might be useful to have it as an optional feature

This python script should serve as a low level implementation of your dataframe handler which will provide transparent access to the features such as tables/views/aggregate.

These features of the script should be sufficient to provide all necessary features of normal table queries - as far as I understand... But I might be wrong as I haven't looked at the code at all.

@QiXuanWang
Copy link

I had the similar question earlier but not well said as you. But as I understand, your request would skip need to skip the SQLAlchemy interface and feed data to superset frontend directly?
mistercrunch indicated that "views" of SQL is the best solution. But that requires all atomic data stored into DB while not on the fly. I'm not a big fan of SQL though it might be faster and some aggregated data are not always useful.

@mistercrunch
Copy link
Member

Notice: this issue has been closed because it has been inactive for 334 days. Feel free to comment and request for this issue to be reopened.

@agershon0
Copy link

Hi, any updates on this issue? it is very useful data science tool to have.
raw data -> python(raw data) = feature engineering -> superset dash presentation.

@kdcyberdude
Copy link

Any update on this?

1 similar comment
@ly0
Copy link

ly0 commented Mar 7, 2023

Any update on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants