Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PandasConnector #3492

Closed
wants to merge 39 commits into from
Closed

Conversation

rhunwicks
Copy link
Contributor

Here is an initial draft of PandasConnector for consideration, as discussed in #3302.

It is not complete, but it shows the general approach.

I have created contrib.tests.connectors which implements a BaseConnectorTestCase which I am using to make sure that the results returned by PandasConnector are consistent with those from SqlaTable.

If we get to the point where this PR can be merged, then we will probably want to refactor to put BaseConnectorTestCase and SqlaConnectorTestCase into the main tests directory and just leave the PandasConnectorTestCase in the contrib directory.

@rhunwicks rhunwicks changed the title Initial commit of PandasConnector - see #3302 PandasConnector Sep 19, 2017
Copy link
Contributor

@xrmx xrmx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small nitpicks

d['granularity_sqla'] = utils.choicify(self.dttm_cols)
d['time_grain_sqla'] = [(g, g) for g in self.GRAINS.keys()]
logging.info(d)
print(d)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

debug leftover


class PandasDatabase(object):
"""Non-ORM object for a Pandas Source"""
database_name = ''
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think you need to set defaults if they are not optional in init


Each Pandas Datasource can have multiple columns"""

__tablename__ = 'pandascolumns'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to be consistent with sqla backend and call it 'pandas_columns'

# Build a dict of the metrics to include, including those that
# are required for post-aggregation filtering
filtered_metrics = [flt['col']
for flt in extras.get('having_druid', [])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

having_druid? If we are reusing the field a comment would be nice

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm adding a comment for the use of having_druid, and similar ones for granularity_sqla and time_grain_sqla. However, if there is a long term plan to encourage the development of more connectors it would be better to rename these parameters in the main code. For example to granularity_col, granularity_freq and post_aggregation_filter.


This is used to be displayed to the user so that she/he can
understand what is taking place behind the scene"""
import json
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can make these imports global

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take them out, they are part of the debugging code.

return val.isoformat() + "Z"

logging.info(json.dumps(query_obj, indent=4, default=to_serializable))
print(json.dumps(query_obj, indent=4, default=to_serializable))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

debug

'format': _(
"The format of the raw data, e.g. csv"),
'additional_parameters': _(
"A JSON-formatted dictionary of additional parameters "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we have add something like: "These are the actual parameters for the read_* functions, see https://pandas.pydata.org/pandas-docs/stable/api.html"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@xrmx
Copy link
Contributor

xrmx commented Sep 20, 2017

Thanks a lot for your great work, i'd really like this to be a first citizen in superset

# for post-aggregation filters, and we are reusing that
# interface component.
filtered_metrics = [flt['col']
for flt in extras.get('having_druid', [])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm have added a comment for the use of having_druid, and similar ones for granularity_sqla and time_grain_sqla. However, if there is a long term plan to encourage the development of more connectors it would be better to rename these parameters in the main code. For example to granularity_col, granularity_freq and post_aggregation_filter.

@xrmx xrmx mentioned this pull request Sep 27, 2017
@coveralls
Copy link

Coverage Status

Coverage remained the same at 70.161% when pulling 9df1786 on kimetrica:3302-pandas-connctor into ef59b6b on apache:master.

3 similar comments
@coveralls
Copy link

Coverage Status

Coverage remained the same at 70.161% when pulling 9df1786 on kimetrica:3302-pandas-connctor into ef59b6b on apache:master.

@coveralls
Copy link

Coverage Status

Coverage remained the same at 70.161% when pulling 9df1786 on kimetrica:3302-pandas-connctor into ef59b6b on apache:master.

@coveralls
Copy link

Coverage Status

Coverage remained the same at 70.161% when pulling 9df1786 on kimetrica:3302-pandas-connctor into ef59b6b on apache:master.

@coveralls
Copy link

coveralls commented Sep 28, 2017

Coverage Status

Coverage increased (+0.5%) to 70.612% when pulling 4b47b75 on kimetrica:3302-pandas-connctor into ef59b6b on apache:master.

@coveralls
Copy link

coveralls commented Sep 28, 2017

Coverage Status

Coverage increased (+0.5%) to 70.612% when pulling d156430 on kimetrica:3302-pandas-connctor into ef59b6b on apache:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.7%) to 70.83% when pulling fa38256 on kimetrica:3302-pandas-connctor into ef59b6b on apache:master.

1 similar comment
@coveralls
Copy link

coveralls commented Sep 28, 2017

Coverage Status

Coverage increased (+0.7%) to 70.83% when pulling fa38256 on kimetrica:3302-pandas-connctor into ef59b6b on apache:master.

@rhunwicks
Copy link
Contributor Author

@mistercrunch I have moved the migration into superset/migrations/versions in order to get Travis to run. I can move it back out if you decide you want to keep it in a contrib directory.

I have also added lxml and beautifulsoup4 to dev-reqs.txt because contrib.tests.connector_tests uses them to prepare test data. connector_tests is probably useful for other people writing connectors as well as providing increased test coverage for superset/connectors/sqla/models.py so you might want to adopt some of it into core.

Please let me know what you would like me to do next?

@coveralls
Copy link

coveralls commented Oct 3, 2017

Coverage Status

Coverage increased (+0.7%) to 70.814% when pulling 1df6237 on kimetrica:3302-pandas-connctor into ef59b6b on apache:master.

@coveralls
Copy link

coveralls commented Oct 5, 2017

Coverage Status

Coverage increased (+0.4%) to 70.535% when pulling 878c7c4 on kimetrica:3302-pandas-connctor into ef59b6b on apache:master.

@coveralls
Copy link

coveralls commented Oct 5, 2017

Coverage Status

Coverage increased (+0.4%) to 70.535% when pulling 2826876 on kimetrica:3302-pandas-connctor into ef59b6b on apache:master.

@coveralls
Copy link

coveralls commented Oct 5, 2017

Coverage Status

Coverage increased (+0.4%) to 70.535% when pulling b24a700 on kimetrica:3302-pandas-connctor into ef59b6b on apache:master.

@coveralls
Copy link

coveralls commented Oct 5, 2017

Coverage Status

Coverage increased (+0.6%) to 70.753% when pulling 3610ac0 on kimetrica:3302-pandas-connctor into ef59b6b on apache:master.

@coveralls
Copy link

coveralls commented Oct 17, 2017

Coverage Status

Coverage increased (+1.3%) to 71.412% when pulling bb78be8 on kimetrica:3302-pandas-connctor into ef59b6b on apache:master.

@rhunwicks
Copy link
Contributor Author

Having the migrations and requirements mixed in with the main Superset ones without ever merging this MR causes frequent merge conflicts. Therefore, I am going to close this MR. We will continue to maintain the connector in order to allow us to use Superset with APIs and remote files, but we will do so in our fork rather than a MR.

@rhunwicks rhunwicks closed this Apr 16, 2018
@rhunwicks rhunwicks mentioned this pull request Apr 16, 2018
1 task
@stu-co
Copy link

stu-co commented Jul 11, 2018

@rhunwicks thank you for creating this code. It seems pretty perfect for us to be honest - I wish it had made it into superset!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants