Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Support Cross Cluster Search in SQL/PPL #789

Closed
penghuo opened this issue Aug 25, 2022 · 6 comments
Closed

[FEATURE] Support Cross Cluster Search in SQL/PPL #789

penghuo opened this issue Aug 25, 2022 · 6 comments
Assignees
Labels

Comments

@penghuo
Copy link
Collaborator

penghuo commented Aug 25, 2022

Is your feature request related to a problem?
Currently, SQL/PPL does not support cross cluster search (CCS) because CCS does not support fetch index mapping from remote cluster

What solution would you like?
Support CCS in SQL/PPL

@penghuo penghuo added enhancement New feature or request untriaged and removed untriaged labels Aug 25, 2022
@dai-chen dai-chen added feature and removed enhancement New feature or request labels Nov 7, 2022
@padmashreemutha
Copy link

Yes, I would like to have this feature to get the CCS capability with my old SQL queries.

@seankao-az
Copy link
Collaborator

seankao-az commented Feb 25, 2023

1 Introduction

We will allow any node in an opensearch cluster to execute search requests against other opensearch clusters using piped processing language (PPL) and SQL. The remote cluster should respond with the documents matching the search query.

This is a feature already supported by the OpenSearch service. Users can set up a connection between clusters, and later use that connection for cross cluster search using DSL. The cluster accepting the user request will be the coordinating cluster, and it will forward the request to remote clusters to fetch data results.

The goal of this project is to allow users to make such search using PPL and SQL.

2 Background

2.1 Motivation

Usage of piped processing language (PPL) and SQL to query data stored in OpenSearch is now restricted to local cluster search only. Customers are not able to invoke cross cluster search with both PPL and SQL, and have to make individual connections to each of the remote clusters to fetch data. This is a pain point for them.

As we are integrating more datasources into our query engine, such as prometheus for event monitoring, to allow customers to use PPL to easily explore and discover data of various sources, we also want to better support customers who set up multiple smaller OpenSearch clusters instead of a single large cluster.

2.2 Cross Cluster Search (CCS) in OpenSearch

Users could add the remote cluster to the coordinating clusters’s setting:

PUT _cluster/settings
{
  "persistent": {
    "cluster.remote": {
      "<remote cluster name>": {
        "seeds": ["<IP address>:<port>"]
      }
    }
  }
}

Syntax for cross cluster search is as follows:

# Search index_1 on remote cluster cluster_1
GET cluster_1:index_1/_search

# Search index_1 on both the local cluster and remote cluster cluster_1
GET index_1,cluster_1:index_1/_search

# Search index_1 on both the remote clusters cluster_1 and cluster_2
GET cluster_1:index_1,cluster_2:index_1/_search

# Search index that matches the name pattern on remote cluster cluster_1
GET cluster_1:index_*/_search

# Search all indices on all remote clusters; local cluster NOT included
GET *:*/_search

2.3 Challenge for CCS in PPL/SQL

Our execution flow for each query involves two major steps:

  • (1) and (2): A system index query to get the field mapping and cluster settings. See Appendix B.
  • (3) and (4): An index search query to get the documents.

In order to support CCS in PPL/SQL, we need to make CCS possible for both steps.

  • (3.a) and (3.b): CCS for index data search is already supported. When the name contains a colon in it, for example my-cluster:my-index, a CCS is invoked to “my-cluster”. It can also resolve expressions such as my-cluster-*:my-index, and make CCS to all clusters with the prefix my-cluster-
  • (1.a) and (1.b): However, CCS for field mapping and cluster settings query is not supported.

Screenshot 2023-02-28 at 10 56 19 AM

The main design decision to make is: How to get index mapping and cluster settings from remote clusters?

3 User Interface

3.1 Syntax Options

3.1.1 Option 1 (preferred): Colon syntax

This option uses the same identifier syntax as that of CCS in OpenSearch.

Identifiers

Identifiers should match one of the following syntax:

  • <index-identifier> for local cluster query
  • <cluster-identifier>:<index-identifier> for remote cluster query

Both <index-identifier> and <cluster-identifier> can contain asterisks (*) for pattern matching.

Examples

PPL
search source=cluster_1:index_1
search source=index_1, cluster_1:index_1, cluster_2:index_1
SQL
SELECT * FROM cluster_1:index_1
SELECT * FROM index_1 UNION SELECT * FROM cluster_1:index_1

Pros and Cons

Pros
  • Match the user experience with OpenSearch query, which also uses the : syntax
  • Make cross-cluster search transparent to the sql plugin. A cross-cluster search query is forwarded to the OS service as is.
Cons
  • Does not follow the standard catalog.schema.table format

3.1.2 Option 2: Dot syntax

This option provides the same user experience with standard catalog.schema.table syntax.

Identifiers

Identifiers should match one of the following syntax:

  • <index-identifier> for local cluster query
  • <cluster-identifier>.<index-identifier> for remote cluster query

Both <index-identifier> and <cluster-identifier> can contain asterisks (*) for pattern matching.

Examples

PPL
search source=cluster_1.index_1
search source=index_1, cluster_1.index_1, cluster_2.index_1
SQL
SELECT * FROM cluster_1.index_1
SELECT * FROM index_1 UNION SELECT * FROM cluster_1.index_1

Pros and Cons

Pros
  • Provide a unified query user experience with SQL standard and datasources
Cons
  • Ambiguity in cluster name / index name resolving
    • It's possible to use backticks (`) to eliminate ambiguity in cluster name / index name resolving. However:
      • Might confuse user if they’re not familiar with the syntax and use search source=cluster.index instead of search source=cluster.`index` . The former will result in a local cluster searching for the exact name "cluster.index"
      • Using backticks for index name prohibits use of * for pattern matching, e.g. search source=cluster.`foo*bar*`

3.2 Response formats

Response formats are exactly the same as a regular local cluster search. They are described here: https://opensearch.org/docs/latest/search-plugins/sql/response-formats/

The default response format is jdbc for both PPL and SQL. An example response is as follows:

{
    "schema": [{
        "name": "firstname",
        "type": "text"
    },
    {
        "name": "lastname",
        "type": "text"
    },
    {
        "name": "age",
        "type": "long"
    }
    ],
    "datarows": [
        [
            "Nanette",
            "Bates",
            28
        ],
        [
            "Amber",
            "Duke",
            32
        ]
    ],
    "total": 4,
    "size": 2,
    "status": 200
}

4 Design

We would like to fill the gap between the sql plugin and the OpenSearch CCS service.

4.1 Option 1

Support “get mappings” query for CCS in OpenSearch, filling the gap shown in the diagram in section 2.3
As far as the sql plugin is concerned, there will be no difference in executing local cluster search and remote cluster search.

Pros

  • Result is accurate (compared to option 3)
  • Less requests made (compared to option 2)

Cons

  • RFC. Some inputs from ccs team?

4.2 Option 2

Find a way to connect to each remote cluster individually, and route the cross cluster search requests to the remote clusters directly. For example, registering each remote cluster as a datasource.

Evaluation

Here are some reasons we believe this is not the right call:

  • Cannot leverage the existing logic and optimization in the OpenSearch service for cluster name resolving, request grouping and routing
  • Need to re-invent the wheel for cross-cluster search in sql plugin
  • More resources needed for multiple connections and request round trips.

4.3 Option 3 (preferred)

Workaround by querying the local cluster index for field mapping and cluster settings, assuming that they’re the same as the remote clusters’.
Since cross cluster index field mapping search isn’t supported in OpenSearch, we impose the following limitation to CCS in PPL/SQL:

  • The index mappings (schemas) should be present on the local cluster

This limitation can be removed once we could search the cross cluster index mapping using OpenSearch.

Following the assumption that remote index mappings should also be present at local cluster, we get rid of the cluster name, and use only the index name to query the local cluster about its mapping.
For example, use books instead of ccs-cluster2:books to find index mapping locally. This satisfies the use case where an index is sharded across multiple clusters, so that multiple clusters have the same indices (name and mappings) containing different data.

A breakdown of how we gather each info for a cross cluster search query:

  • Get field mappings from local index stored at local cluster
  • Get cluster settings from local cluster
  • Get documents from remote index stored at remote cluster

For multi-indices query:
ccs-cluster2:books1, books2

  1. option 1 (preferred): use books1, books2, which is the same as current behavior for local cluster search
  2. option 2: try books1, then books2
  3. option 3: use books2, prioritizing the local cluster index

One thing to note is that if the user wants to search only a single index on a single remote cluster, and that the same index name doesn’t already exist on the local cluster, we do not allow such search. A workaround for this is that the user create a mapping table for that index on the local cluster. It does not need to be an index that stores actual data. It contains only the schema. This way, we could search the local cluster for field mapping, and search the data on the remote cluster.

Evaluation

Overall, this option might lead to the most confusing experience for users. However, it naturally solves the use case where an index is sharded across multiple clusters. With clear documentation and response warning / error handling, this option could work for some use cases.

Appendix

A. PPL/SQL Search Syntax

PPL

Both asterisk (*) in index names and comma-separated multi-indices search are supported. However, we do not support searching only using an asterisk.

# These are supported
search source=index_*
search source=index_1,index_2

# This is NOT supported
search source=*

SQL

SQL syntax for querying multiple indices isn’t using the comma syntax. We can use the UNION keyword.

-- This is NOT supported
SELECT * FROM index_1,index_2

-- This is supported
SELECT * FROM index_1
UNION
SELECT * FROM index_2

Using the asterisk character (*), we can query multiple indices that match the expression. However, we do not support searching only using an asterisk.

-- This is supported
SELECT * FROM index_*

-- This is NOT supported
SELECT * FROM *

B. Index Mapping and Cluster Settings in Query Processing

Currently, cluster settings is used to determine the maximum size of response.

Index mapping is used for

@anirudha
Copy link
Collaborator

Remote cluster shouldn't be data sources. Today's concepts of cross cluster exists independent of data sources

@anirudha
Copy link
Collaborator

I would prioritize PPL work over sql. It's would be always good to have both/ but we have many users asking for ppl soon.

@seankao-az
Copy link
Collaborator

seankao-az commented Apr 6, 2023

Tasks

  • Development
    • modify grammar for ppl
    • get field mapping at local cluster
    • forward ccs search request to opensearch
  • Testing
    • unit test (full coverage)
    • integration test (need cross cluster setup)
    • piped commands should work with ccs
    • cross cluster credentials with security plugin
  • Communication to customer about feature, and more importantly, the restriction
    • build a dev version to share with customer
    • documentation in project
    • documentation on OpenSearch website

@anirudha
Copy link
Collaborator

PPL CCS is released

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants