Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more integrated support for Re:Dash users against the presto query_runner #3723

Closed
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 9 additions & 7 deletions redash/query_runner/presto.py
Original file line number Diff line number Diff line change
Expand Up @@ -98,14 +98,16 @@ def get_schema(self, get_stats=False):
return schema.values()

def run_query(self, query, user):
user_name = self.configuration.get('username', user.email.split('@')[0] if user is not None else 'redash')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There were previous attempts as similar functionality, but it conflicts with how Redash caches data. Redash caches the last query execution result. This means it will cache the last execution of whatever user who executed it last, and hence might leak data to other users while giving the admin a false sense of protection.

I feel that we can't introduce this change as is without addressing the caching issue.

But I think you already use this in your environment, so maybe you can share how it works for you?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are definitely leveraging this implementation in our environment.

Authentication behavior

The behavior of this code in practice allows for the username configuration for the Presto data source to override the individual user logins. (In the case of a situation where Presto is configured with single user access.)

Aside from that, it will attempt to leverage the email address as the basis for the username to pass to Presto. I would much rather prefer to update the SAML code so that we can pass the username as a unique attribute which updates the username attribute of the user object which is passed to the Query runner. That would allow for a consistent method of finding the username which it supposed to be leveraged against Presto. However that was a much more involved change to get in place. This allows a consistent behavior (In most environments, email matches username.) and gets this initial process in place.

I do acknowledge that using a SAML configured Re:Dash cluster against a authentication enforced Presto endpoints does not work right now. There is no way to obtain the specific user's password. However this implementation does help when auditing and investigating the Presto cluster itself, as you know the specific user running the queries in question.

Cached Results Scenario

The cached results is a known scenario and your concerns are understandable. However they are concerns with a slightly inaccurate assumption.

Presto, in general, does not offer a robust authorization solution internally. There is the ability to do static controls through files on disk, and there is one vendor who has added a much more robust support for Apache Ranger (Starburst).

However it is understood and accepted that until Re:Dash supports the ability to control access to queries based on group membership, that any data queried through Re:Dash is visible to all users within Re:Dash. If there is a need to segment what previously queried data people have access to, for example secure financial data, we would stand up a separate Re:Dash cluster environment which only grants access through SAML for the Finance users who need access to it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you use this mostly for auditing/monitoring purposes?

However it is understood and accepted that until Re:Dash supports the ability to control access to queries based on group membership

You can control access to data sources (and the queries resulting from them) using groups. It doesn't cover all the use cases groups to queries model would cover, but it's still easier than setting up a separate Redash cluster :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely! Control to the sources itself is supported internally, but not the cached results of the Query. It is simply recognized that the only way to prevent anyone with access to Re:Dash from viewing the cached results is by that larger scale separation.

The primary purpose of having these users getting passed through to Presto is for the purpose of controlling the user's ability to leverage the server resources within Presto through Presto Resource Groups. Previously any Presto Resource Group tunings apply to anyone on Re:Dash which made it difficult to tune resource utilizations within Presto for different users leveraging Re:Dash.

For users leveraging Starburst Presto, by having usernames getting passed through, we gain the ability to supporting Ranger for authorization.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, this sort of makes sense. I'm still worried that for people without the full context this will cause more harm than good. How about we put this behind an environment variable?

Similar to how we do with Athena:

ANNOTATE_QUERY = parse_boolean(os.environ.get('ATHENA_ANNOTATE_QUERY', 'true'))
SHOW_EXTRA_SETTINGS = parse_boolean(os.environ.get('ATHENA_SHOW_EXTRA_SETTINGS', 'true'))
OPTIONAL_CREDENTIALS = parse_boolean(os.environ.get('ATHENA_OPTIONAL_CREDENTIALS', 'true'))

This way you can easily use this without a fork, while it's only enabled for people who fully understand the implication.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is totally reasonable. Let me do a refactor on the code to implement this.

connection = presto.connect(
host=self.configuration.get('host', ''),
port=self.configuration.get('port', 8080),
protocol=self.configuration.get('protocol', 'http'),
username=self.configuration.get('username', 'redash'),
password=self.configuration.get('password'),
catalog=self.configuration.get('catalog', 'hive'),
schema=self.configuration.get('schema', 'default'))
host=self.configuration.get('host', ''),
port=self.configuration.get('port', 8080),
protocol=self.configuration.get('protocol', 'http'),
username=user_name,
password=self.configuration.get('password', None),
source='redash',
catalog=self.configuration.get('catalog', 'hive'),
schema=self.configuration.get('schema', 'default'))

cursor = connection.cursor()

Expand Down