IndexFlatL2 performance degradation #1762

Alexsandruss · 2021-03-15T12:33:54Z

Summary

IndexFlatL2 has performance degradation on large number of threads since e1adde0 commit.
This degradation can be observed with benchs/bench_index_flat.py.

bench_index_flat benchmark aggregated results:

q=100

`56 threads`	`100 query size`	before degradation (`698a459`)	after degradation (`e1adde0`)	master (`6977d72`)
dimension	k	time, ms	time, ms	time, ms
16	1	0.767	2.631	2.894
16	10	0.834	3.004	3.194
16	100	0.949	7.237	7.108
32	1	0.297	2.642	2.744
32	10	0.369	3.009	3.05
32	100	0.491	7.218	7.002
64	1	0.317	2.668	2.772
64	10	0.372	3.044	3.069
64	100	0.5	7.341	7.089
128	1	0.466	2.734	2.837
128	10	0.529	3.104	3.14
128	100	0.649	7.431	7.178

q=10000

`56 threads`	`10000 query size`	before degradation (`698a459`)	after degradation (`e1adde0`)	master (`6977d72`)
dimension	k	time, s	time, s	time, s
16	1	0.008	0.181	0.19
16	10	0.009	0.212	0.214
16	100	0.019	0.529	0.52
32	1	0.009	0.167	0.135
32	10	0.01	0.189	0.157
32	100	0.02	0.508	0.513
64	1	0.01	0.174	0.146
64	10	0.011	0.197	0.166
64	100	0.02	0.508	0.503
128	1	0.014	0.173	0.145
128	10	0.015	0.193	0.163
128	100	0.025	0.496	0.488

bench_index_flat benchmark full logs:

before degradation (698a459):

model name      : Intel(R) Xeon(R) Platinum 8280L CPU @ 2.70GHz
************ nthread= 56
*********** nq= 100
========== d= 16
dataset in dimension 16, with metric L2, size: Q 100 B 10000 T 0
search k=  1 t=0.767 ms (± 0.0058)
search k= 10 t=0.834 ms (± 0.0056)
search k=100 t=0.949 ms (± 0.0274)
========== d= 32
dataset in dimension 32, with metric L2, size: Q 100 B 10000 T 0
search k=  1 t=0.297 ms (± 0.0028)
search k= 10 t=0.369 ms (± 0.0044)
search k=100 t=0.491 ms (± 0.0098)
========== d= 64
dataset in dimension 64, with metric L2, size: Q 100 B 10000 T 0
search k=  1 t=0.317 ms (± 0.0059)
search k= 10 t=0.372 ms (± 0.0034)
search k=100 t=0.500 ms (± 0.0124)
========== d= 128
dataset in dimension 128, with metric L2, size: Q 100 B 10000 T 0
search k=  1 t=0.466 ms (± 0.0029)
search k= 10 t=0.529 ms (± 0.0033)
search k=100 t=0.649 ms (± 0.0103)
restab=
 0.767142       0.834286        0.949353
0.297099        0.369161        0.491083
0.316858        0.372231        0.499964
0.465631        0.529319        0.649452
*********** nq= 10000
========== d= 16
dataset in dimension 16, with metric L2, size: Q 10000 B 10000 T 0
search k=  1 t=0.008 s (± 0.0000)
search k= 10 t=0.009 s (± 0.0000)
search k=100 t=0.019 s (± 0.0003)
========== d= 32
dataset in dimension 32, with metric L2, size: Q 10000 B 10000 T 0
search k=  1 t=0.009 s (± 0.0000)
search k= 10 t=0.010 s (± 0.0000)
search k=100 t=0.020 s (± 0.0025)
========== d= 64
dataset in dimension 64, with metric L2, size: Q 10000 B 10000 T 0
search k=  1 t=0.010 s (± 0.0001)
search k= 10 t=0.011 s (± 0.0000)
search k=100 t=0.020 s (± 0.0000)
========== d= 128
dataset in dimension 128, with metric L2, size: Q 10000 B 10000 T 0
search k=  1 t=0.014 s (± 0.0000)
search k= 10 t=0.015 s (± 0.0000)
search k=100 t=0.025 s (± 0.0016)
restab=
 0.00769553     0.00912893      0.0186442
0.00908014      0.00991631      0.0202678
0.0103275       0.0113654       0.02024
0.0141219       0.0154734       0.0250527
************ nthread= 1
*********** nq= 100
========== d= 16
dataset in dimension 16, with metric L2, size: Q 100 B 10000 T 0
search k=  1 t=3.227 ms (± 0.0031)
search k= 10 t=3.660 ms (± 0.0047)
search k=100 t=7.973 ms (± 0.0070)
========== d= 32
dataset in dimension 32, with metric L2, size: Q 100 B 10000 T 0
search k=  1 t=1.633 ms (± 0.0040)
search k= 10 t=1.881 ms (± 0.0249)
search k=100 t=4.617 ms (± 0.0072)
========== d= 64
dataset in dimension 64, with metric L2, size: Q 100 B 10000 T 0
search k=  1 t=2.292 ms (± 0.0024)
search k= 10 t=2.563 ms (± 0.0217)
search k=100 t=5.368 ms (± 0.1682)
========== d= 128
dataset in dimension 128, with metric L2, size: Q 100 B 10000 T 0
search k=  1 t=3.707 ms (± 0.0010)
search k= 10 t=3.948 ms (± 0.0048)
search k=100 t=6.628 ms (± 0.0059)
restab=
 3.22726        3.66044 7.97331
1.6332  1.88079 4.61653
2.29204 2.56261 5.36752
3.70705 3.948   6.62845
*********** nq= 10000
========== d= 16
dataset in dimension 16, with metric L2, size: Q 10000 B 10000 T 0
search k=  1 t=0.121 s (± 0.0001)
search k= 10 t=0.143 s (± 0.0015)
search k=100 t=0.394 s (± 0.0020)
========== d= 32
dataset in dimension 32, with metric L2, size: Q 10000 B 10000 T 0
search k=  1 t=0.127 s (± 0.0014)
search k= 10 t=0.148 s (± 0.0002)
search k=100 t=0.406 s (± 0.0020)
========== d= 64
dataset in dimension 64, with metric L2, size: Q 10000 B 10000 T 0
search k=  1 t=0.160 s (± 0.0003)
search k= 10 t=0.183 s (± 0.0018)
search k=100 t=0.435 s (± 0.0027)
========== d= 128
dataset in dimension 128, with metric L2, size: Q 10000 B 10000 T 0
search k=  1 t=0.213 s (± 0.0004)
search k= 10 t=0.236 s (± 0.0024)
search k=100 t=0.486 s (± 0.0020)
restab=
 0.121088       0.143475        0.393567
0.126678        0.147563        0.405952
0.160052        0.182952        0.435488
0.213282        0.236116        0.485715

after degradation (e1adde0):

model name      : Intel(R) Xeon(R) Platinum 8280L CPU @ 2.70GHz
************ nthread= 56
*********** nq= 100
========== d= 16
dataset in dimension 16, with metric L2, size: Q 100 B 10000 T 0
search k=  1 t=2.631 ms (± 0.0041)
search k= 10 t=3.004 ms (± 0.0050)
search k=100 t=7.237 ms (± 0.0118)
========== d= 32
dataset in dimension 32, with metric L2, size: Q 100 B 10000 T 0
search k=  1 t=2.642 ms (± 0.0040)
search k= 10 t=3.009 ms (± 0.0022)
search k=100 t=7.218 ms (± 0.0110)
========== d= 64
dataset in dimension 64, with metric L2, size: Q 100 B 10000 T 0
search k=  1 t=2.668 ms (± 0.0035)
search k= 10 t=3.044 ms (± 0.0040)
search k=100 t=7.341 ms (± 0.0259)
========== d= 128
dataset in dimension 128, with metric L2, size: Q 100 B 10000 T 0
search k=  1 t=2.734 ms (± 0.0059)
search k= 10 t=3.104 ms (± 0.0026)
search k=100 t=7.431 ms (± 0.0070)
restab=
 2.63062        3.00425 7.23729
2.64192 3.00926 7.21794
2.66764 3.04425 7.34085
2.73407 3.10364 7.4307
*********** nq= 10000
========== d= 16
dataset in dimension 16, with metric L2, size: Q 10000 B 10000 T 0
search k=  1 t=0.181 s (± 0.0024)
search k= 10 t=0.212 s (± 0.0013)
search k=100 t=0.529 s (± 0.0599)
========== d= 32
dataset in dimension 32, with metric L2, size: Q 10000 B 10000 T 0
search k=  1 t=0.167 s (± 0.0019)
search k= 10 t=0.189 s (± 0.0021)
search k=100 t=0.508 s (± 0.0009)
========== d= 64
dataset in dimension 64, with metric L2, size: Q 10000 B 10000 T 0
search k=  1 t=0.174 s (± 0.0002)
search k= 10 t=0.197 s (± 0.0001)
search k=100 t=0.508 s (± 0.0006)
========== d= 128
dataset in dimension 128, with metric L2, size: Q 10000 B 10000 T 0
search k=  1 t=0.173 s (± 0.0001)
search k= 10 t=0.193 s (± 0.0003)
search k=100 t=0.496 s (± 0.0008)
restab=
 0.180814       0.212223        0.529138
0.166661        0.188869        0.508382
0.174292        0.196803        0.508177
0.17272 0.193195        0.495662
************ nthread= 1
*********** nq= 100
========== d= 16
dataset in dimension 16, with metric L2, size: Q 100 B 10000 T 0
search k=  1 t=2.217 ms (± 0.0041)
search k= 10 t=2.530 ms (± 0.0022)
search k=100 t=6.499 ms (± 0.0067)
========== d= 32
dataset in dimension 32, with metric L2, size: Q 100 B 10000 T 0
search k=  1 t=2.697 ms (± 0.0024)
search k= 10 t=2.375 ms (± 0.4417)
search k=100 t=4.781 ms (± 0.0036)
========== d= 64
dataset in dimension 64, with metric L2, size: Q 100 B 10000 T 0
search k=  1 t=2.528 ms (± 0.0052)
search k= 10 t=2.727 ms (± 0.0058)
search k=100 t=5.531 ms (± 0.0050)
========== d= 128
dataset in dimension 128, with metric L2, size: Q 100 B 10000 T 0
search k=  1 t=3.850 ms (± 0.0031)
search k= 10 t=4.058 ms (± 0.0294)
search k=100 t=6.834 ms (± 0.0151)
restab=
 2.21735        2.52995 6.49855
2.69735 2.37504 4.78083
2.52756 2.72736 5.53104
3.85019 4.058   6.83403
*********** nq= 10000
========== d= 16
dataset in dimension 16, with metric L2, size: Q 10000 B 10000 T 0
search k=  1 t=0.129 s (± 0.0011)
search k= 10 t=0.148 s (± 0.0014)
search k=100 t=0.411 s (± 0.0021)
========== d= 32
dataset in dimension 32, with metric L2, size: Q 10000 B 10000 T 0
search k=  1 t=0.140 s (± 0.0001)
search k= 10 t=0.159 s (± 0.0004)
search k=100 t=0.416 s (± 0.0017)
========== d= 64
dataset in dimension 64, with metric L2, size: Q 10000 B 10000 T 0
search k=  1 t=0.177 s (± 0.0016)
search k= 10 t=0.196 s (± 0.0017)
search k=100 t=0.458 s (± 0.0017)
========== d= 128
dataset in dimension 128, with metric L2, size: Q 10000 B 10000 T 0
search k=  1 t=0.234 s (± 0.0015)
search k= 10 t=0.252 s (± 0.0018)
search k=100 t=0.513 s (± 0.0023)
restab=
 0.129152       0.147665        0.410948
0.140337        0.15915 0.415998
0.177245        0.196366        0.458388
0.23385 0.252216        0.512823

master (6977d72):

model name      : Intel(R) Xeon(R) Platinum 8280L CPU @ 2.70GHz
************ nthread= 56
*********** nq= 100
========== d= 16
dataset in dimension 16, with metric L2, size: Q 100 B 10000 T 0
search k=  1 t=2.894 ms (± 0.0209)
search k= 10 t=3.194 ms (± 0.0389)
search k=100 t=7.108 ms (± 0.0705)
========== d= 32
dataset in dimension 32, with metric L2, size: Q 100 B 10000 T 0
search k=  1 t=2.744 ms (± 0.0019)
search k= 10 t=3.050 ms (± 0.0044)
search k=100 t=7.002 ms (± 0.0070)
========== d= 64
dataset in dimension 64, with metric L2, size: Q 100 B 10000 T 0
search k=  1 t=2.772 ms (± 0.0035)
search k= 10 t=3.069 ms (± 0.0024)
search k=100 t=7.089 ms (± 0.0079)
========== d= 128
dataset in dimension 128, with metric L2, size: Q 100 B 10000 T 0
search k=  1 t=2.837 ms (± 0.0039)
search k= 10 t=3.140 ms (± 0.0036)
search k=100 t=7.178 ms (± 0.0100)
restab=
 2.89437        3.19391 7.10759
2.7439  3.05027 7.00188
2.77224 3.06919 7.08941
2.83748 3.13994 7.17816
*********** nq= 10000
========== d= 16
dataset in dimension 16, with metric L2, size: Q 10000 B 10000 T 0
search k=  1 t=0.190 s (± 0.0012)
search k= 10 t=0.214 s (± 0.0016)
search k=100 t=0.520 s (± 0.0601)
========== d= 32
dataset in dimension 32, with metric L2, size: Q 10000 B 10000 T 0
search k=  1 t=0.135 s (± 0.0002)
search k= 10 t=0.157 s (± 0.0032)
search k=100 t=0.513 s (± 0.0250)
========== d= 64
dataset in dimension 64, with metric L2, size: Q 10000 B 10000 T 0
search k=  1 t=0.146 s (± 0.0002)
search k= 10 t=0.166 s (± 0.0001)
search k=100 t=0.503 s (± 0.0047)
========== d= 128
dataset in dimension 128, with metric L2, size: Q 10000 B 10000 T 0
search k=  1 t=0.145 s (± 0.0007)
search k= 10 t=0.163 s (± 0.0001)
search k=100 t=0.488 s (± 0.0043)
restab=
 0.190026       0.213606        0.520014
0.134764        0.15713 0.513136
0.145534        0.165525        0.503114
0.145355        0.16309 0.488284
************ nthread= 1
*********** nq= 100
========== d= 16
dataset in dimension 16, with metric L2, size: Q 100 B 10000 T 0
search k=  1 t=1.890 ms (± 0.0025)
search k= 10 t=2.175 ms (± 0.0230)
search k=100 t=6.323 ms (± 0.0038)
========== d= 32
dataset in dimension 32, with metric L2, size: Q 100 B 10000 T 0
search k=  1 t=2.349 ms (± 0.0031)
search k= 10 t=2.622 ms (± 0.0025)
search k=100 t=4.895 ms (± 0.1379)
========== d= 64
dataset in dimension 64, with metric L2, size: Q 100 B 10000 T 0
search k=  1 t=2.224 ms (± 0.0522)
search k= 10 t=2.383 ms (± 0.0021)
search k=100 t=5.382 ms (± 0.1252)
========== d= 128
dataset in dimension 128, with metric L2, size: Q 100 B 10000 T 0
search k=  1 t=3.647 ms (± 0.0651)
search k= 10 t=3.756 ms (± 0.0119)
search k=100 t=6.645 ms (± 0.0076)
restab=
 1.88985        2.17527 6.32253
2.34929 2.62222 4.89473
2.22358 2.38296 5.38185
3.64709 3.75602 6.64547
*********** nq= 10000
========== d= 16
dataset in dimension 16, with metric L2, size: Q 10000 B 10000 T 0
search k=  1 t=0.112 s (± 0.0058)
search k= 10 t=0.126 s (± 0.0001)
search k=100 t=0.400 s (± 0.0046)
========== d= 32
dataset in dimension 32, with metric L2, size: Q 10000 B 10000 T 0
search k=  1 t=0.122 s (± 0.0062)
search k= 10 t=0.138 s (± 0.0066)
search k=100 t=0.406 s (± 0.0060)
========== d= 64
dataset in dimension 64, with metric L2, size: Q 10000 B 10000 T 0
search k=  1 t=0.156 s (± 0.0002)
search k= 10 t=0.173 s (± 0.0007)
search k=100 t=0.447 s (± 0.0045)
========== d= 128
dataset in dimension 128, with metric L2, size: Q 10000 B 10000 T 0
search k=  1 t=0.213 s (± 0.0069)
search k= 10 t=0.230 s (± 0.0067)
search k=100 t=0.500 s (± 0.0056)
restab=
 0.111619       0.125901        0.39991
0.122357        0.138325        0.405528
0.156162        0.173034        0.44707
0.213437        0.230029        0.499801

Platform

CPU: Intel(R) Xeon(R) Platinum 8280L CPU @ 2.70GHz, 2 sockets, 28 cores per socket

OS: Linux, CentOS 7.6.1810, 5.4.69 kernel version

Faiss version before perf. degradation: 698a459
Faiss version after perf. degradation: e1adde0

Installed from: compiled with GCC 10.2.0

Faiss compilation options: default from conda build scripts (see reproducer)

SW: Python 3.8.8, MKL 2020.4, llvm-openmp 11.0.1

Running on:

CPU
GPU

Interface:

C++
Python

Reproduction instructions

Clone Faiss repository, modify benchs/bench_index_flat.py to use needed number of threads and run following bash script:

#!/bin/bash

export PYTHON=python
export PY_VER=38
export CPU_COUNT=56
export PREFIX=$CONDA_PREFIX
cp -r benchs benchs_copy # first commit hasn't needed benchmark
rm -rf logs
mkdir -p logs

run_bench () {
    mkdir -p logs/$1
    rm -rf _*
    pip uninstall -y faiss faiss-cpu
    git checkout $1

    bash conda/faiss/build-lib.sh
    bash conda/faiss/build-pkg.sh

    python benchs_copy/bench_index_flat.py > logs/$1/bench_index_flat.log
}

run_bench 698a4592e87920f036aa7a2d8a3a56e12387a8f0
run_bench e1adde0d84ece584f3b4d86db0b1532329f8cdb8
run_bench master

The text was updated successfully, but these errors were encountered:

mdouze · 2021-03-18T08:57:54Z

Thanks for the extensive analysis.
The faiss.cvar.distance_compute_min_k_reservoir is set to 5, which is lower than the default 100, so I am not sure this benchmark is valid.

Alexsandruss · 2021-03-23T11:25:29Z

Forgot to mention that I commented line with faiss.cvar.distance_compute_min_k_reservoir = 5 since it didn't work with oldest commit. So, this parameter was set to default value.

mdouze · 2021-03-28T20:55:33Z

I have to figure out why the results of your benchmark differ from mine.

Alexsandruss · 2021-04-12T11:30:44Z

I made additional measurements on bigger range of data shapes and found out some interesting things:

Faster search with IndexIVFFlat was actually slower than raw IndexFlatL2 before degradation:
Faster search with IndexIVFFlat are also affected by changes in e1adde0 but weaker than raw IndexFlatL2:

Used SW and HW are same as above.

Full benchmark results (csv format):
bench_results.txt

Code of benchmarks:

import faiss
import numpy as np
import pandas as pd
from scipy import stats
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score
from timeit import default_timer as timer


pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)


NLIST = 100
NPROBE = 10
N_CLASSES = 5
N_CLUSTERS_PER_CLASS = 4

N_RUNS = 10


def box_filter(timing, left=0.25, right=0.75):
    timing.sort()
    size = len(timing)
    if size == 1:
        return timing[0]
    q1, q2 = timing[int(size * left)], timing[int(size * right)]
    iq = q2 - q1
    lower, upper = q1 - 1.5 * iq, q2 + 1.5 * iq
    result = np.array([item for item in timing if lower < item < upper])
    return np.mean(result)


def index_flatl2_search(train_data, test_data, k):
    d = train_data.shape[1]
    index = faiss.IndexFlatL2(d)
    index.add(train_data)
    return index.search(test_data, k)


def index_ivfflat_search(train_data, test_data, k):
    d = train_data.shape[1]
    quantizer = faiss.IndexFlatL2(d)
    index = faiss.IndexIVFFlat(quantizer, d, NLIST)
    index.nprobe = NPROBE
    index.train(train_data)
    index.add(train_data)
    return index.search(test_data, k)


def compute_accuracy_from_indices(indices, train_y, test_y):
    pred_y, _ = stats.mode(train_y[indices], axis=1)
    pred_y = pred_y.ravel()
    return accuracy_score(test_y, pred_y)


rows_arr = [5000, 10000, 20000, 50000, 100000]
dims_arr = [8, 32, 128, 512]
k_arr = [8, 32, 128]
n_cases = len(rows_arr) ** 2 * len(dims_arr) * len(k_arr)

bench_data = pd.DataFrame({
    "n_train": np.zeros(shape=(n_cases,), dtype=np.int32),
    "n_test": np.zeros(shape=(n_cases,), dtype=np.int32),
    "dims": np.zeros(shape=(n_cases,), dtype=np.int32),
    "k": np.zeros(shape=(n_cases,), dtype=np.int32),
    "IndexIVFFlat_time[s]": np.zeros(shape=(n_cases,), dtype=np.float32),
    "IndexFlatL2_time[s]": np.zeros(shape=(n_cases,), dtype=np.float32),
    "IndexIVFFlat_accuracy": np.zeros(shape=(n_cases,), dtype=np.float32),
    "IndexFlatL2_accuracy": np.zeros(shape=(n_cases,), dtype=np.float32),
    "IndexIVFFlat_speedup": np.zeros(shape=(n_cases,), dtype=np.float32),
    "IndexIVFFlat_rel_accuracy": np.zeros(shape=(n_cases,), dtype=np.float32)
})

j = 0
for n_train_rows in rows_arr:
    for n_test_rows in rows_arr:
        for n_dims in dims_arr:
            for k in k_arr:
                x, y = make_classification(n_samples=n_train_rows + n_test_rows, n_features=n_dims,
                    n_informative=n_dims, n_repeated=0, n_redundant=0,
                    n_classes=N_CLASSES, n_clusters_per_class=N_CLUSTERS_PER_CLASS, random_state=42)
                x = x.astype(np.float32)
                train_data, test_data = x[:n_train_rows], x[n_train_rows:]
                train_y, test_y = y[:n_train_rows], y[n_train_rows:]
                ivff_times = []
                f_times = []
                for i in range(N_RUNS):
                    t0 = timer()
                    _, ivff_idx = index_ivfflat_search(train_data, test_data, k)
                    t1 = timer()
                    _, f_idx = index_flatl2_search(train_data, test_data, k)
                    t2 = timer()
                    ivff_times.append(t1 - t0)
                    f_times.append(t2 - t1)
                ivff_acc = compute_accuracy_from_indices(ivff_idx, train_y, test_y)
                f_acc = compute_accuracy_from_indices(f_idx, train_y, test_y)
                ivff_time = box_filter(ivff_times)
                f_time = box_filter(f_times)
                speedup = f_time / ivff_time
                rel_acc = ivff_acc / f_acc
                bench_data.at[j, 'n_train'], bench_data.at[j, 'n_test'] = n_train_rows, n_test_rows
                bench_data.at[j, 'dims'], bench_data.at[j, 'k'] = n_dims, k
                bench_data.at[j, 'IndexIVFFlat_time[s]'], bench_data.at[j, 'IndexFlatL2_time[s]'] = ivff_time, f_time
                bench_data.at[j, 'IndexIVFFlat_accuracy'], bench_data.at[j, 'IndexFlatL2_accuracy'] = ivff_acc, f_acc
                bench_data.at[j, 'IndexIVFFlat_speedup'], bench_data.at[j, 'IndexIVFFlat_rel_accuracy'] = speedup, rel_acc
                j += 1

print(bench_data)

ava57r · 2021-04-20T11:43:21Z

I have performance degradation too after update from v1.6.3 to v1.7.0
I use IndexFlatL2 and vectors 512D.

ava57r · 2021-04-20T15:29:28Z

I found this function knn_L2sqr_blas

it was parallel before

faiss/faiss/utils/distances.cpp

Lines 310 to 320 in ef28350

    
                                   x + i0 * d, &di, &zero, 
        
                                   ip_block, &nyi); 
        
                       } 
        
                       /* collect minima */ 
        
           #pragma omp parallel for 
        
                       for (int64_t i = i0; i < i1; i++) { 
        
                           float * __restrict simi = res->get_val(i); 
        
                           int64_t * __restrict idxi = res->get_ids (i); 
        
                           const float *ip_line = ip_block + (i - i0) * (j1 - j0);

it is now

faiss/faiss/utils/distances.cpp

Lines 280 to 295 in e1adde0

    
           /* compute the actual dot products */ 
        
           { 
        
               float one = 1, zero = 0; 
        
               FINTEGER nyi = j1 - j0, nxi = i1 - i0, di = d; 
        
               sgemm_ ("Transpose", "Not transpose", &nyi, &nxi, &di, &one, 
        
                       y + j0 * d, &di, 
        
                       x + i0 * d, &di, &zero, 
        
                       ip_block.get(), &nyi); 
        
           } 
        
           for (int64_t i = i0; i < i1; i++) { 
        
               float *ip_line = ip_block.get() + (i - i0) * (j1 - j0); 
        
               for (size_t j = j0; j < j1; j++) { 
        
                   float ip = *ip_line; 
        
                   float dis = x_norms[i] + y_norms[j] - 2 * ip;

mdouze · 2021-04-22T12:47:11Z

Right, thanks for narrowing down the regression.

Summary: This diff is related to facebookresearch#1762 The ResultHandler introduced for FlatL2 and FlatIP was not multithreaded. This diff attempts to fix that. To be verified if it is indeed faster. Differential Revision: D27939173 fbshipit-source-id: 61e09355df65c899a89f70d2bd5ab47076871289

mdouze · 2021-04-22T13:05:02Z

OK the indexing is a bit tricky so I prepared a PR here: #1840

Would you mind running the benchmark to check if we recover the previous performance?

ava57r · 2021-04-23T09:11:51Z

If query n < distance_compute_blas_threshold
It is slower than v1.6.3

I found *sse function
before

faiss/utils/distances.cpp

Line 186 in a17a631

static void knn_L2sqr_sse (

compile flags in v1.6.3

faiss/configure

Line 6818 in a17a631

ARCH_CPUFLAGS="-mpopcnt -msse4"

mdouze · 2021-04-26T07:09:53Z

@ava57r sorry, I don't understand. Do you suggest that -msse4 is faster than -mavx2 ?

ava57r · 2021-04-26T07:28:48Z

@ava57r sorry, I don't understand. Do you suggest that -msse4 is faster than -mavx2 ?

No.
I think what v1.6.3 compiled with -mss4 by default.
The current Master branch is no.
I compiled lib from source without avx2.

Summary: Pull Request resolved: #1840 This diff is related to #1762 The ResultHandler introduced for FlatL2 and FlatIP was not multithreaded. This diff attempts to fix that. To be verified if it is indeed faster. Reviewed By: wickedfoo Differential Revision: D27939173 fbshipit-source-id: c85f01a97d4249fe0c6bfb04396b68a7a9fe643d

ava57r · 2021-05-11T14:44:01Z

OK the indexing is a bit tricky so I prepared a PR here: #1840

Would you mind running the benchmark to check if we recover the previous performance?

I tried to run this bench

q=100

32 threads	100 query size	before degradation (`698a459`)	after degradation (`e1adde0`)	master (`b4c320a`)
dimension	k	time, ms	time, ms	time, ms
16	1	2.421	2.238	1.377
16	10	1.773	1.103	1.413
16	100	2.900	2.520	2.904
32	1	1.356	1.296	1.336
32	10	1.419	1.470	1.371
32	100	2.711	3.153	2.871
64	1	1.221	1.438	1.249
64	10	1.253	1.503	1.320
64	100	2.547	3.091	2.897
128	1	1.414	1.679	1.384
128	10	1.526	1.460	1.518
128	100	2.836	3.100	3.157

q=10000

32 threads	100 query size	before degradation (`698a459`)	after degradation (`e1adde0`)	master (`b4c320a`)
dimension	k	time, s	time, s	time, s
16	1	0.035	0.036	0.034
16	10	0.037	0.035	0.035
16	100	0.151	0.148	0.140
32	1	0.033	0.031	0.029
32	10	0.037	0.033	0.032
32	100	0.166	0.183	0.157
64	1	0.038	0.040	0.038
64	10	0.042	0.044	0.042
64	100	0.156	0.159	0.145
128	1	0.055	0.056	0.053
128	10	0.059	0.061	0.057
128	100	0.178	0.179	0.177

commit (698a459)

model name      : Intel(R) Xeon(R) Silver 4214R CPU @ 2.40GHz
************ nthread= 32
*********** nq= 100
========== d= 16
dataset in dimension 16, with metric L2, size: Q 100 B 10000 T 0
search k=  1 t=2.421 ms (± 0.1440)
search k= 10 t=1.773 ms (± 0.4305)
search k=100 t=2.900 ms (± 0.1537)
========== d= 32
dataset in dimension 32, with metric L2, size: Q 100 B 10000 T 0
search k=  1 t=1.356 ms (± 0.0085)
search k= 10 t=1.419 ms (± 0.0138)
search k=100 t=2.711 ms (± 0.0145)
========== d= 64
dataset in dimension 64, with metric L2, size: Q 100 B 10000 T 0
search k=  1 t=1.221 ms (± 0.0297)
search k= 10 t=1.253 ms (± 0.0110)
search k=100 t=2.547 ms (± 0.0262)
========== d= 128
dataset in dimension 128, with metric L2, size: Q 100 B 10000 T 0
search k=  1 t=1.414 ms (± 0.0122)
search k= 10 t=1.526 ms (± 0.0410)
search k=100 t=2.836 ms (± 0.0191)
restab=
 2.42072        1.77324 2.90045
1.35612 1.41916 2.71073
1.22073 1.25286 2.54732
1.41433 1.5263  2.83617
*********** nq= 10000
========== d= 16
dataset in dimension 16, with metric L2, size: Q 10000 B 10000 T 0
search k=  1 t=0.035 s (± 0.0004)
search k= 10 t=0.037 s (± 0.0061)
search k=100 t=0.151 s (± 0.0066)
========== d= 32
dataset in dimension 32, with metric L2, size: Q 10000 B 10000 T 0
search k=  1 t=0.033 s (± 0.0003)
search k= 10 t=0.037 s (± 0.0003)
search k=100 t=0.166 s (± 0.0101)
========== d= 64
dataset in dimension 64, with metric L2, size: Q 10000 B 10000 T 0
search k=  1 t=0.038 s (± 0.0007)
search k= 10 t=0.042 s (± 0.0007)
search k=100 t=0.156 s (± 0.0087)
========== d= 128
dataset in dimension 128, with metric L2, size: Q 10000 B 10000 T 0
search k=  1 t=0.055 s (± 0.0005)
search k= 10 t=0.059 s (± 0.0004)
search k=100 t=0.178 s (± 0.0090)
restab=
 0.0350017      0.0371504       0.150556
0.0330732       0.0368384       0.165539
0.038266        0.0417653       0.156422
0.0551096       0.0589395       0.177976

commit (e1adde0)

model name      : Intel(R) Xeon(R) Silver 4214R CPU @ 2.40GHz
************ nthread= 32
*********** nq= 100
========== d= 16
dataset in dimension 16, with metric L2, size: Q 100 B 10000 T 0
search k=  1 t=2.238 ms (± 0.0496)
search k= 10 t=1.103 ms (± 0.0274)
search k=100 t=2.520 ms (± 0.0180)
========== d= 32
dataset in dimension 32, with metric L2, size: Q 100 B 10000 T 0
search k=  1 t=1.296 ms (± 0.0480)
search k= 10 t=1.470 ms (± 0.0168)
search k=100 t=3.153 ms (± 0.0943)
========== d= 64
dataset in dimension 64, with metric L2, size: Q 100 B 10000 T 0
search k=  1 t=1.438 ms (± 0.0105)
search k= 10 t=1.503 ms (± 0.0295)
search k=100 t=3.091 ms (± 0.1679)
========== d= 128
dataset in dimension 128, with metric L2, size: Q 100 B 10000 T 0
search k=  1 t=1.679 ms (± 0.0176)
search k= 10 t=1.460 ms (± 0.0093)
search k=100 t=3.100 ms (± 0.1812)
restab=
 2.23753        1.10283 2.51985
1.29637 1.47039 3.15303
1.43808 1.50332 3.09131
1.67882 1.46049 3.10025
*********** nq= 10000
========== d= 16
dataset in dimension 16, with metric L2, size: Q 10000 B 10000 T 0
search k=  1 t=0.036 s (± 0.0003)
search k= 10 t=0.035 s (± 0.0053)
search k=100 t=0.148 s (± 0.0062)
========== d= 32
dataset in dimension 32, with metric L2, size: Q 10000 B 10000 T 0
search k=  1 t=0.031 s (± 0.0017)
search k= 10 t=0.033 s (± 0.0003)
search k=100 t=0.183 s (± 0.0043)
========== d= 64
dataset in dimension 64, with metric L2, size: Q 10000 B 10000 T 0
search k=  1 t=0.040 s (± 0.0004)
search k= 10 t=0.044 s (± 0.0003)
search k=100 t=0.159 s (± 0.0079)
========== d= 128
dataset in dimension 128, with metric L2, size: Q 10000 B 10000 T 0
search k=  1 t=0.056 s (± 0.0011)
search k= 10 t=0.061 s (± 0.0067)
search k=100 t=0.179 s (± 0.0120)
restab=
 0.0362413      0.034522        0.148349
0.0310689       0.0330717       0.182657
0.0400654       0.043545        0.158652
0.0564006       0.0606787       0.178946

master(b4c320a)

model name      : Intel(R) Xeon(R) Silver 4214R CPU @ 2.40GHz
************ nthread= 32
*********** nq= 100
========== d= 16
dataset in dimension 16, with metric L2, size: Q 100 B 10000 T 0
search k=  1 t=1.377 ms (± 0.0176)
search k= 10 t=1.413 ms (± 0.0192)
search k=100 t=2.904 ms (± 0.0757)
========== d= 32
dataset in dimension 32, with metric L2, size: Q 100 B 10000 T 0
search k=  1 t=1.336 ms (± 0.0141)
search k= 10 t=1.371 ms (± 0.0099)
search k=100 t=2.871 ms (± 0.1652)
========== d= 64
dataset in dimension 64, with metric L2, size: Q 100 B 10000 T 0
search k=  1 t=1.249 ms (± 0.0192)
search k= 10 t=1.320 ms (± 0.0190)
search k=100 t=2.897 ms (± 0.0420)
========== d= 128
dataset in dimension 128, with metric L2, size: Q 100 B 10000 T 0
search k=  1 t=1.384 ms (± 0.0066)
search k= 10 t=1.518 ms (± 0.0412)
search k=100 t=3.157 ms (± 0.1046)
restab=
 1.37651        1.4134  2.90379
1.33565 1.37076 2.87127
1.24866 1.3198  2.89655
1.38402 1.51813 3.15681
*********** nq= 10000
========== d= 16
dataset in dimension 16, with metric L2, size: Q 10000 B 10000 T 0
search k=  1 t=0.034 s (± 0.0005)
search k= 10 t=0.035 s (± 0.0047)
search k=100 t=0.140 s (± 0.0065)
========== d= 32
dataset in dimension 32, with metric L2, size: Q 10000 B 10000 T 0
search k=  1 t=0.029 s (± 0.0003)
search k= 10 t=0.032 s (± 0.0006)
search k=100 t=0.157 s (± 0.0115)
========== d= 64
dataset in dimension 64, with metric L2, size: Q 10000 B 10000 T 0
search k=  1 t=0.038 s (± 0.0008)
search k= 10 t=0.042 s (± 0.0004)
search k=100 t=0.145 s (± 0.0068)
========== d= 128
dataset in dimension 128, with metric L2, size: Q 10000 B 10000 T 0
search k=  1 t=0.053 s (± 0.0003)
search k= 10 t=0.057 s (± 0.0005)
search k=100 t=0.177 s (± 0.0085)
restab=
 0.0341082      0.0346912       0.139799
0.0287708       0.0319045       0.157442
0.0382387       0.0418262       0.145482
0.0525824       0.0565464       0.177184

/bench_index_flat.py

-faiss.cvar.distance_compute_min_k_reservoir = 5
+#faiss.cvar.distance_compute_min_k_reservoir = 5

#!/bin/bash

export PYTHON=python
export PY_VER=38
export CPU_COUNT=32
export PREFIX=$CONDA_PREFIX
export LD_LIBRARY_PATH=$CONDA_PREFIX/lib
cp -r benchs benchs_copy # first commit hasn't needed benchmark
rm -rf logs
mkdir -p logs

run_bench () {
    mkdir -p logs/$1
    rm -rf _*
    pip uninstall -y faiss faiss-cpu
    git checkout $1

    bash conda/faiss/build-lib.sh
    bash conda/faiss/build-pkg.sh

    python benchs_copy/bench_index_flat.py > logs/$1/bench_index_flat.log
}

run_bench 698a4592e87920f036aa7a2d8a3a56e12387a8f0
run_bench e1adde0d84ece584f3b4d86db0b1532329f8cdb8
run_bench master

mdouze added the question label Mar 18, 2021

mdouze added enhancement and removed question labels Mar 28, 2021

mdouze mentioned this issue Apr 22, 2021

Fix performance regression in ResultHandler #1840

Closed

mdouze closed this as completed Sep 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IndexFlatL2 performance degradation #1762

IndexFlatL2 performance degradation #1762

Alexsandruss commented Mar 15, 2021

mdouze commented Mar 18, 2021

Alexsandruss commented Mar 23, 2021

mdouze commented Mar 28, 2021

Alexsandruss commented Apr 12, 2021

ava57r commented Apr 20, 2021 •

edited

Loading

ava57r commented Apr 20, 2021

mdouze commented Apr 22, 2021 •

edited

Loading

mdouze commented Apr 22, 2021

ava57r commented Apr 23, 2021

mdouze commented Apr 26, 2021

ava57r commented Apr 26, 2021

ava57r commented May 11, 2021

IndexFlatL2 performance degradation #1762

IndexFlatL2 performance degradation #1762

Comments

Alexsandruss commented Mar 15, 2021

Summary

bench_index_flat benchmark aggregated results:

bench_index_flat benchmark full logs:

Platform

Reproduction instructions

mdouze commented Mar 18, 2021

Alexsandruss commented Mar 23, 2021

mdouze commented Mar 28, 2021

Alexsandruss commented Apr 12, 2021

ava57r commented Apr 20, 2021 • edited Loading

ava57r commented Apr 20, 2021

mdouze commented Apr 22, 2021 • edited Loading

mdouze commented Apr 22, 2021

ava57r commented Apr 23, 2021

mdouze commented Apr 26, 2021

ava57r commented Apr 26, 2021

ava57r commented May 11, 2021

ava57r commented Apr 20, 2021 •

edited

Loading

mdouze commented Apr 22, 2021 •

edited

Loading