Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add EP-PQM-related code #1

Merged
merged 1 commit into from
Jan 20, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 32 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,10 @@
# String Comparison on a Quantum Computer Using Hamming Distance
# String Comparison on a Quantum Computer

This repository contains the files needed to compare and classify the strings using Hamming distance between a string and a set of strings using a quantum computer.
For computing the distance between a target string and the closest string in the group of strings, see [preprint](https://arxiv.org/abs/2106.16173). The code for this preprint is packaged in [v0.1.0](https://github.com/miranska/qc-str/releases/tag/v0.1.0). The core code resides in `string_comparison.py`.

Furthermore, this repository extends the above codebase by creating an efficient version of the Parametric Probabilistic Quantum Memory ([P-PQM](https://doi.org/10.1016/j.neucom.2020.01.116)) approach for computing the probability of a string belonging to a particular group of strings (i.e., a machine learning classification problem). We call our algorithm EP-PQM, see [preprint](https://arxiv.org/abs/2201.07265) for details. The code for this preprint is packaged in [v0.2.0](https://github.com/miranska/qc-str/releases/tag/v0.2.0).

This repository contains the files needed to compute the Hamming distance between a string and a set of strings
using a quantum computer; see [preprint](https://arxiv.org/abs/2106.16173) for details.
The code resides in `string_comparison.py`.

## Setup
To set up the environment, run
Expand All @@ -11,6 +13,7 @@ pip install -r requirements.txt
```

## Usage examples
### Computing the distance between a target string and the closest string in a group
`test_string_comparison.py` contains unit test cases for `string_comparison.py`. This file can also be interpreted as a
set of examples of how `StringComparator` in `string_comparison.py` should be invoked. To run, do
```bash
Expand All @@ -25,8 +28,18 @@ To execute code listings shown in the [preprint](https://arxiv.org/abs/2106.1617
python hd_paper_listings.py
```

### EP-PQM

The file `compute_empirical_complexity.py` simulates the generation of quantum circuits for string classification as described in the [preprint](https://arxiv.org/abs/2201.07265). Datasets found in `./datasets` (namely, Balance Scale, Breast Cancer, SPECT Heart, Tic-Tac-Toe Endgame, and Zoo) are taken from the UCI Machine Learning [Repository](https://archive.ics.uci.edu/ml/index.php).
To execute, run
```bash
python compute_empirical_complexity.py
```
The output is saved in `stats.csv` and `stats.json`.


## Citation
If you use the algorithm or code, please cite them as follows:
If you use the algorithm or code, please cite them as follows. For computing Hamming distance:
```bibtex
@article{khan2021string,
author = {Mushahid Khan and Andriy Miranskyy},
Expand All @@ -40,6 +53,20 @@ If you use the algorithm or code, please cite them as follows:
}
```

For EP-PQM:
```bibtex
@article{khan2022string,
author = {Mushahid Khan and Jean Paul Latyr Faye and Udson C. Mendes and Andriy Miranskyy},
title = {{EP-PQM: Efficient Parametric Probabilistic Quantum Memory with Fewer Qubits and Gates}},
journal = {CoRR},
volume = {abs/2201.07265},
year = {2022},
archivePrefix = {arXiv},
url = {https://arxiv.org/abs/2201.07265},
eprint = {2201.07265}
}
```

## Contact us
If you found a bug or came up with a new feature --
please open an [issue](https://github.com/miranska/qc-str/issues)
Expand Down
224 changes: 224 additions & 0 deletions compute_empirical_complexity.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,224 @@
import json

import numpy as np
import pandas as pd
from qiskit.test.mock import FakeQasmSimulator
from qiskit.transpiler.exceptions import TranspilerError

from string_comparison import StringComparator


class NumpyEnc(json.JSONEncoder):
"""
Convert numpy int64 to the format comprehensible by the JSON encoder
"""

def default(self, obj):
if isinstance(obj, np.int64):
return int(obj)
return json.JSONEncoder.default(self, obj)


def standardize_column_elements(column):
"""
Update column values to make sure that column element values are consistent
Note that the changes will happen in place
:param column: pandas series
:return: updated column, number of unique attributes
"""
dic = {}
numeric_id = 0
for ind in range(0, len(column)):
element_value = column.iloc[ind]
if element_value not in dic:
dic[element_value] = numeric_id
numeric_id += 1
column.iloc[ind] = dic[element_value]
return column, len(dic)


def get_data(file_name, label_location, encoding, columns_to_remove=None,
fraction_of_rows=0.9, random_seed=42):
"""
Take the dataset, reshuffle, retain 90% of it and return a list of datasets (one per label/class)

:param file_name: name of the file to read the data from
:param label_location: location of the label column (first or last), applied after undesired columns are removed
:param encoding: Type of encoding: either one-hot or label
:param columns_to_remove: List of unwanted columns to remove
:param fraction_of_rows: Fraction of rows to retain for analysis
:param random_seed: The value of random seed needed for reproducibility
:return: a list of datasets, max number of attributes, features count
"""

df = pd.read_csv(file_name, header=None)

# remove unwanted columns
if columns_to_remove is not None:
df = df.drop(df.columns[columns_to_remove], axis=1)
# update column names
df.columns = list(range(0, len(df.columns)))

# get indexes of data columns
col_count = len(df.columns)
if label_location == "first":
data_columns = range(1, col_count)
label_column = 0
elif label_location == "last":
data_columns = range(0, col_count - 1)
label_column = col_count - 1
else:
raise Exception(f"Unknown label_location {label_location}")

features_cnt = len(data_columns)
# standardize column elements and get max number of attributes in a column/feature
max_attr_cnt = -1
for data_column in data_columns:
updated_column, attr_cnt = standardize_column_elements(df[data_column].copy())
df[data_column] = updated_column
if attr_cnt > max_attr_cnt:
max_attr_cnt = attr_cnt

# get 90% of strings (drawn at random)
df = df.sample(n=round(len(df.index) * fraction_of_rows), random_state=random_seed)

# get labels
labels = df[label_column].unique()

# generate strings
strings = {}
for label in labels:
single_class = df[df[label_column] == label]
class_strings = []
for ind in range(0, len(single_class.index)):
observation = single_class.iloc[ind]
if encoding == "label":
my_string = []
for feature_ind in data_columns:
my_string.append(str(observation.iloc[feature_ind]))
class_strings.append(my_string)
elif encoding == "one-hot":
my_string = ""
if max_attr_cnt > 2:
for feature_ind in data_columns:
value = observation.iloc[feature_ind]
one_hot = [0] * max_attr_cnt
one_hot[value] = 1
my_string += ''.join(map(str, one_hot))
else: # use binary string for the 2-attribute case
for feature_ind in data_columns:
value = observation.iloc[feature_ind]
one_hot = [value]
my_string += ''.join(map(str, one_hot))

class_strings.append(my_string)
else:
raise Exception(f"Unknown encoding {encoding}")
strings[label] = class_strings

return strings, max_attr_cnt, features_cnt


if __name__ == "__main__":
stats = []

files = [
{"file_name": "./datasets/balance_scale.csv", "label_location": "first", 'labels': ['R'],
'is_laborious': False},
{"file_name": "./datasets/tictactoe.csv", "label_location": "last", 'labels': ['positive'],
'is_laborious': True},
{"file_name": "./datasets/breast_cancer.csv", "label_location": "last", "remove_columns": [0], 'labels': [2],
'is_laborious': True},
{"file_name": "./datasets/zoo.csv", "label_location": "last", "remove_columns": [0], 'labels': [1],
'is_laborious': True},
{"file_name": "./datasets/SPECTrain.csv", "label_location": "first", 'labels': [1], 'is_laborious': False}
]
encodings = ["one-hot", "label"]

is_fake_circuit_off = input("Creation of the circuit for fake simulator is laborious. "
"Do you want to skip it? (Y/n): ") or "Y"
if is_fake_circuit_off.upper() == 'Y':
print("Skip fake simulator")
backend_types = ["abstract"]
elif is_fake_circuit_off.upper() == 'N':
print("Keep fake simulator")
backend_types = ["abstract", "fake_simulator"]
else:
raise ValueError("Please enter y or n.")

for file in files:
if "remove_columns" in file:
remove_columns = file["remove_columns"]
else:
remove_columns = None
for encoding in encodings:

classes, max_attr_count, features_count = get_data(file["file_name"],
label_location=file["label_location"],
encoding=encoding,
columns_to_remove=remove_columns)
# parameters for String Comparisons
if encoding == "one-hot":
is_binary = True
symbol_length = max_attr_count
p_pqm = True
symbol_count = None
elif encoding == "label":
is_binary = False
symbol_length = None
p_pqm = False
symbol_count = max_attr_count
else:
raise Exception(f"Unknown encoding {encoding}")

for label in classes:
if 'labels' in file: # process only a subset of labels present in file['labels']
if label not in file['labels']:
continue
database = classes[label]
target = database[0] # dummy target string
for backend_type in backend_types:
print(f"Analyzing {file['file_name']} for label {label} on {backend_type}")
if backend_type == "abstract":
x = StringComparator(target, database, symbol_length=symbol_length, is_binary=is_binary,
symbol_count=symbol_count,
p_pqm=p_pqm)
elif backend_type == "fake_simulator":
if file['is_laborious']:
print(f" Skipping {file['file_name']} as it requires too much computing power")
continue
print(" Keep only two rows to speed up processing")
database = database[1:3] # keep only two rows to make it simpler to compute the circuit
try:
x = StringComparator(target, database, symbol_length=symbol_length, is_binary=is_binary,
symbol_count=symbol_count,
p_pqm=p_pqm, optimize_for=FakeQasmSimulator(),
optimization_levels=[0], attempts_per_optimization_level=1)
except TranspilerError as e:
print(print(f"Unexpected {e=}, {type(e)=}"))
break
else:
raise Exception(f"Unknown backend type {backend_type}")

circ_decomposed = x.circuit.decompose().decompose(['c3sx', 'rcccx', 'rcccx_dg']).decompose('cu1')
run_stats = {'file_name': file['file_name'], 'encoding': encoding, 'label': label,
'observations_count': len(database), 'features_count': features_count,
'max_attr_count': max_attr_count, 'backend_type': backend_type,
'qubits_count': x.circuit.num_qubits, 'circuit_depth': x.circuit.depth(),
'circuit_count_ops': x.circuit.count_ops(),
'qubits_count_decomposed': circ_decomposed.num_qubits,
'circuit_depth_decomposed': circ_decomposed.depth(),
'circuit_count_ops_decomposed': circ_decomposed.count_ops()
}
stats.append(run_stats)

print(f"Final stats in basic dictionary")
print(stats)

# save stats in JSON format
out_json = json.dumps(stats, cls=NumpyEnc)
with open('stats.json', 'w') as f:
json.dump(out_json, f)

# let's also save it in a table
pd.json_normalize(stats).to_csv('stats.csv')
80 changes: 80 additions & 0 deletions datasets/SPECTrain.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
1,0,0,0,1,0,0,0,1,1,0,0,0,1,1,0,0,0,0,0,0,0,0
1,0,0,1,1,0,0,0,1,1,0,0,0,1,1,0,0,0,0,0,0,0,1
1,1,0,1,0,1,0,0,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,1,1
1,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1,1,0,0,0,0,0,0
1,0,0,0,1,0,0,0,0,1,0,0,0,1,1,0,1,0,0,0,1,0,1
1,1,0,1,1,0,0,0,1,0,1,0,1,1,0,0,0,0,0,0,0,1,1
1,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1
1,0,0,1,0,0,0,1,1,0,0,0,0,1,0,1,0,0,0,0,0,1,1
1,0,1,0,0,0,0,1,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0
1,1,1,0,0,1,0,1,0,0,1,1,1,1,0,0,1,1,1,1,1,0,1
1,1,1,0,0,1,1,1,0,1,1,1,1,0,1,0,0,1,0,1,1,0,0
1,1,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,1
1,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,1,0,1,0,0,1,1
1,1,0,1,1,0,0,1,1,1,0,1,1,1,1,1,1,0,1,1,0,1,1
1,0,1,1,0,0,1,1,1,0,0,0,1,1,0,0,1,1,1,0,1,1,1
1,0,0,1,1,0,0,0,1,1,0,0,0,1,1,0,1,0,0,0,0,1,0
1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
1,1,0,1,0,1,0,1,1,0,1,0,1,1,0,0,0,1,0,0,1,1,0
1,1,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0
1,0,0,0,0,0,0,1,0,0,0,1,1,1,0,0,0,0,0,0,0,1,1
1,1,0,0,0,1,1,0,1,0,0,1,0,0,0,0,0,0,0,1,1,0,0
1,1,1,0,0,1,1,1,0,0,1,1,1,0,0,0,0,0,0,1,0,0,0
1,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,1,0,0,0
1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0
1,0,0,0,1,0,0,1,0,1,0,0,1,0,1,0,0,0,0,0,0,1,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
1,1,0,1,1,1,0,0,0,0,1,0,0,1,1,0,1,0,0,0,1,1,1
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1,1,0,0,1,1,1,0,0,1,1,1,0,0,0,0,1,0,1,0,0,1
1,0,1,1,1,0,0,1,1,1,0,1,1,1,0,0,1,1,1,0,0,1,1
1,1,0,1,1,1,0,0,1,1,1,0,0,1,1,1,0,0,0,0,0,1,0
1,1,1,1,1,1,1,0,0,0,0,1,1,1,1,0,1,1,1,1,1,0,1
1,1,1,1,0,1,0,1,1,1,1,0,1,1,1,0,1,0,0,0,1,1,1
1,1,0,0,1,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0
1,1,0,1,1,0,0,0,1,1,1,0,0,1,1,1,1,0,0,1,1,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0
0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0
0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1
0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0
0,0,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,1
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
0,1,0,0,0,1,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,1
0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0
0,1,1,1,0,1,0,1,1,1,1,1,0,0,1,0,1,0,0,1,0,1,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
0,1,0,0,0,1,0,1,0,0,1,0,1,0,0,0,0,0,0,1,0,0,0
0,1,0,1,0,0,0,0,1,0,1,0,0,1,0,0,0,0,0,0,0,1,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
0,1,0,0,0,1,1,0,0,1,1,0,0,0,1,0,0,0,0,1,1,0,0
0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0
0,0,0,1,1,0,0,1,0,0,0,0,1,1,1,0,0,0,0,0,0,1,1
0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Loading