Add a python script as datasource #2790

thoth291 · 2017-05-21T06:29:14Z

Is there any way to write and publish custom python scripts so that they would behave as any other (real time) data source.
For example assume that you have some trading data:

time	value1	value2	value3
12:05 25 May 2001	13.5	14.5	15.5
...	...	...	...
18:13 12 Jan 2017	15.5	13.5	10.5

And my python script contains a class which reads this data from the input (given) table and outputs another table like this:

time	inc_probability1	inc_probability2	inc_probability3
12:05 25 May 2017	0.9	0.8	0.7
...	...	...	...
18:13 12 Dec 2017	0.4	0.7	0.95

Now assume that my custom script runs on table T1 and produces table P1.
T1 by itself may be also an output of another python script or real(-time) database data.
P1 should be a normal citizen in the world of tables and being able to reply on SQL queries and etc.

This functionality will allow to build a robust R&D framework to not only explore the data but also build a pipeline of workflows in more visual way and present and share them with your team.

For example User1 would create a custom script CS1 which takes as input T1 and producing output P1.
User2 don't know if P1 is real table or produced (on the fly) by the CS1. User2 just puts it into it's dashboard and after playing with it enough realizes that in order to make some use of P1 he/she need to create a custom script CS2 which takes P1 and T2 as input and outputs P2 (which can be immediately visualized in the dashboard).

Here is a diagram:

Solution like this might also solve the issues with unsupported datasources for good as it will let users to write their own Table Generators and let them use those tables as inputs.

I can put more concrete examples if these ones were not clear enough.

Thank you very much for such a great software - but this missing feature is real showstopper for me and my team for now.

All the Best Wishes!

mistercrunch · 2017-05-21T16:24:25Z

Well so the "connectors" interface is extendable, but it assumes that the source can filter, aggregate and can expose the tabular structures it makes available (tables/views). Superset asks the database through the query interface, and it's assumed that the backend aggregates and filter the specific "cut" of data.

Would your use-cases be geared towards altering atomic data in Python, or preaggregated data? I'd assume the former, which means you'd need to perform aggregations yourself and taking over a lot of the database's functions, resulting in much more workload on the web server that would have see all atomic data of the table on each query.

If it's the second use case, you could hook it in easily on top of existing datasource as some sort of hook for a dataframe mutator function that would receive the dataframe and return another. Though I'm unclear on what this would be used for.

thoth291 · 2017-05-21T21:53:06Z

Thank you for your answer, @mistercrunch .
As far as I understand the only features which I should cover in my custom python data source is:

Input data Indices (time in this case)
Output data Indices (time in this example - but can be anything else)
Return list of output columns
Generation of one particular column for a given indices
Aggregate can be implemented outside of this script easily but it might be useful to have it as an optional feature

This python script should serve as a low level implementation of your dataframe handler which will provide transparent access to the features such as tables/views/aggregate.

These features of the script should be sufficient to provide all necessary features of normal table queries - as far as I understand... But I might be wrong as I haven't looked at the code at all.

QiXuanWang · 2017-05-24T01:51:51Z

I had the similar question earlier but not well said as you. But as I understand, your request would skip need to skip the SQLAlchemy interface and feed data to superset frontend directly?
mistercrunch indicated that "views" of SQL is the best solution. But that requires all atomic data stored into DB while not on the fly. I'm not a big fan of SQL though it might be faster and some aggregated data are not always useful.

mistercrunch · 2018-04-23T05:40:20Z

Notice: this issue has been closed because it has been inactive for 334 days. Feel free to comment and request for this issue to be reopened.

agershon0 · 2021-12-28T12:53:34Z

Hi, any updates on this issue? it is very useful data science tool to have.
raw data -> python(raw data) = feature engineering -> superset dash presentation.

kdcyberdude · 2023-03-02T09:47:49Z

Any update on this?

ly0 · 2023-03-07T14:33:27Z

Any update on this?

rhunwicks mentioned this issue Aug 16, 2017

Create a PandasDatasource #3302

Closed

1 task

mistercrunch closed this as completed Apr 23, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a python script as datasource #2790

Add a python script as datasource #2790

thoth291 commented May 21, 2017

mistercrunch commented May 21, 2017

thoth291 commented May 21, 2017

QiXuanWang commented May 24, 2017

mistercrunch commented Apr 23, 2018

agershon0 commented Dec 28, 2021

kdcyberdude commented Mar 2, 2023

ly0 commented Mar 7, 2023

Add a python script as datasource #2790

Add a python script as datasource #2790

Comments

thoth291 commented May 21, 2017

mistercrunch commented May 21, 2017

thoth291 commented May 21, 2017

QiXuanWang commented May 24, 2017

mistercrunch commented Apr 23, 2018

agershon0 commented Dec 28, 2021

kdcyberdude commented Mar 2, 2023

ly0 commented Mar 7, 2023