Selectivity Estimation for Queries Containing Predicates over Set-valued Attributes

Introduction

This repo contains the source code of our SIGMOD 2024 paper.

In this work, we presents novel techniques for selectivity estimation on queries involving predicates over set-valued attributes. We first propose the set-valued column factorization problem, whereby each each set-valued column is converted to multiple numeric subcolumns, and set containment predicates are converted to numeric comparison predicates. This enables us to leverage any existing estimator to perform selectivity estimation. We then develop two methods for column factorization and query conversion, namely ST and ST-hist.

Quick Start

We use a sample dataset from Twitter which contains 10000 tuples. The running codes mainly contains 4 steps. Preprocessing. Preprocess partitioning with graph coloring based method. ST. Convert the data and query with ST method. STH. Convert the data and query with ST-hist method. Estimation. Estimate the converted query on Postgres, Neurocard or DeepDB.

Environmental Setup

Tested with GCC 7.5 and Python 3.7. It is suggested to use conda to set the Python environment

sudo apt install build-essential
sudo apt-get install libboost-all-dev
conda create -n "myenv" python=3.7.0
conda activate myenv
pip3 install psycopg2
pip3 install numpy==1.21.6
pip3 install pandas==1.0.5
pip3 install networkx==2.4

This environment is for running our algorithm and testing on Postgres. If you want to test on Neurocard and DeepDB, you can refer to their environmental requirments.

Preprocessing

To partition the elements with graph coloring based method, you could run:

bash preprocessing.sh

This command will generate indices for elements with high-frequency-first order first (generate_idx.cc). Then generate graph based on the set-valued column (generate_graph.cc), and partition the graph with graph coloring method (graph_color_partition.py). The partition result is stored in the folder ./graph_color.

ST

After preprocessing, you could run the following command to convert the set-valued data into 5 numeric subcolumns and convert the queries with ST method:

bash run_settrie.sh 5

It first greedy partition the elements based on graph coloring partition result (color_based_dis_part.cc). Then construct settrie based on the partitioning (set_map.cc). Finally convert the queries based on the settrie (settrie_trans_query.cc). The partition result, settrie and converted query are stored in the folder ./color_partition_5. You can change the parameter of this command to change the number of subcolumns.

STH

Alternatively, you could run the following command to convert the set-valued data into 5 numeric subcolumns and convert the queries with STH method by keeping 500 nodes on each settrie:

bash run_settrie_hist.sh 5 500

The partition is same as the partition in ST, so in the script we directly copy the partition. If you don't partition first, please use color_based_dis_part.cc to partition the elements first. Then construct histogram based settrie based on the partitioning (freq_set_map.cc). Finally convert the queries based on the settrie (settrie_trans_query_hist.cc). The partition result, settrie and converted query are stored in the folder ./partition_5_500. You can change the parameter of this command to change the number of subcolumns and the number of kept nodes on each settrie.

Estimation

If you want to do estimation with ST on any estimator, you should run the following command first to convert the dataset:

python3 clique_trans_table.py --partnum 5

Alternatively, if you want to do estimation with STH, please modify the line 20 in clique_trans_table.py first to change the folder, then run the following command:

python3 clique_trans_table.py --partnum 5 --keepnum 500

To estimate with Postgres, you can modify generate_sql.py and run it to populate the converted dataset into Postgres. You need the package psycopg2 to run it. Then you can modify the folder path and parameters of the database connection in postgres_est_new.py and run it to estimate. Specifically, you can modify the query path at line 111-116 or you can use the parameters from line 12-19 to accept the parameters from command line. You also need to modify line 23 to connet your own database.

If you want to estimate with Neurocard, we have modified the source code of Neurocard to support our method. You can refer to Neurocard to insert new table and run queries. We also provide the running scripts, but you need to modify some folder path and parameters in datasets.py (Line 317 for geotweet) and run.py (line 515-523) to the correct dataset folder path and query folder path.

If you want to estimate with DeepDB, we have modified the source code of DeepDB to support our method. You can refer to DeepDB to insert new table and run queries. We also provide the running scripts, but you need to speficify args.query_file_location for query location in maqp.py and csvfile in schemas/geotweet/schema.py for dataset file location.

Incremental ST and STH

The incremental ST and STH are in the folder incremental. The running procedure are similar with the base methods.

Citation

If you find this work helpful, please cite our paper:

@article{DBLP:journals/pacmmod/MengCC23,
  author       = {Zizhong Meng,
                  Xin Cao and
                  Gao Cong},
  title        = {Selectivity Estimation for Queries Containing Predicates over Set-Valued
                  Attributes},
  journal      = {Proceedings of the ACM on Management of Data},
  volume       = {1},
  number       = {4},
  pages        = {261:1--261:26},
  year         = {2023}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Selectivity Estimation for Queries Containing Predicates over Set-valued Attributes

Introduction

Quick Start

Environmental Setup

Preprocessing

ST

STH

Estimation

Incremental ST and STH

Citation

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
dataset/geotweet		dataset/geotweet
deepdb-public		deepdb-public
incremental		incremental
neurocard		neurocard
query/geotweet		query/geotweet
README.md		README.md
clique_trans_table.py		clique_trans_table.py
color_based_dis_part.cc		color_based_dis_part.cc
freq_set_map.cc		freq_set_map.cc
generate_graph.cc		generate_graph.cc
generate_idx.cc		generate_idx.cc
generate_sql.py		generate_sql.py
generate_table_array.py		generate_table_array.py
graph_color_partition.py		graph_color_partition.py
postgres_est_new.py		postgres_est_new.py
preprocessing.sh		preprocessing.sh
run_settrie.sh		run_settrie.sh
run_settrie_hist.sh		run_settrie_hist.sh
set_map.cc		set_map.cc
settrie_trans_query.cc		settrie_trans_query.cc
settrie_trans_query_hist.cc		settrie_trans_query_hist.cc
utils.h		utils.h

mzz235711/Selectivity-Estimation-for-Queries-Containing-Predicates-over-Set-Valued-Attributes

Folders and files

Latest commit

History

Repository files navigation

Selectivity Estimation for Queries Containing Predicates over Set-valued Attributes

Introduction

Quick Start

Environmental Setup

Preprocessing

ST

STH

Estimation

Incremental ST and STH

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages