Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IndexFlatL2 performance degradation #1762

Closed
2 of 4 tasks
Alexsandruss opened this issue Mar 15, 2021 · 12 comments
Closed
2 of 4 tasks

IndexFlatL2 performance degradation #1762

Alexsandruss opened this issue Mar 15, 2021 · 12 comments

Comments

@Alexsandruss
Copy link

Summary

IndexFlatL2 has performance degradation on large number of threads since e1adde0 commit.
This degradation can be observed with benchs/bench_index_flat.py.

bench_index_flat benchmark aggregated results:

q=100

56 threads 100 query size before degradation (698a459) after degradation (e1adde0) master (6977d72)
dimension k time, ms time, ms time, ms
16 1 0.767 2.631 2.894
16 10 0.834 3.004 3.194
16 100 0.949 7.237 7.108
32 1 0.297 2.642 2.744
32 10 0.369 3.009 3.05
32 100 0.491 7.218 7.002
64 1 0.317 2.668 2.772
64 10 0.372 3.044 3.069
64 100 0.5 7.341 7.089
128 1 0.466 2.734 2.837
128 10 0.529 3.104 3.14
128 100 0.649 7.431 7.178

q=10000

56 threads 10000 query size before degradation (698a459) after degradation (e1adde0) master (6977d72)
dimension k time, s time, s time, s
16 1 0.008 0.181 0.19
16 10 0.009 0.212 0.214
16 100 0.019 0.529 0.52
32 1 0.009 0.167 0.135
32 10 0.01 0.189 0.157
32 100 0.02 0.508 0.513
64 1 0.01 0.174 0.146
64 10 0.011 0.197 0.166
64 100 0.02 0.508 0.503
128 1 0.014 0.173 0.145
128 10 0.015 0.193 0.163
128 100 0.025 0.496 0.488

bench_index_flat benchmark full logs:

before degradation (698a459):

model name      : Intel(R) Xeon(R) Platinum 8280L CPU @ 2.70GHz
************ nthread= 56
*********** nq= 100
========== d= 16
dataset in dimension 16, with metric L2, size: Q 100 B 10000 T 0
search k=  1 t=0.767 ms (± 0.0058)
search k= 10 t=0.834 ms (± 0.0056)
search k=100 t=0.949 ms (± 0.0274)
========== d= 32
dataset in dimension 32, with metric L2, size: Q 100 B 10000 T 0
search k=  1 t=0.297 ms (± 0.0028)
search k= 10 t=0.369 ms (± 0.0044)
search k=100 t=0.491 ms (± 0.0098)
========== d= 64
dataset in dimension 64, with metric L2, size: Q 100 B 10000 T 0
search k=  1 t=0.317 ms (± 0.0059)
search k= 10 t=0.372 ms (± 0.0034)
search k=100 t=0.500 ms (± 0.0124)
========== d= 128
dataset in dimension 128, with metric L2, size: Q 100 B 10000 T 0
search k=  1 t=0.466 ms (± 0.0029)
search k= 10 t=0.529 ms (± 0.0033)
search k=100 t=0.649 ms (± 0.0103)
restab=
 0.767142       0.834286        0.949353
0.297099        0.369161        0.491083
0.316858        0.372231        0.499964
0.465631        0.529319        0.649452
*********** nq= 10000
========== d= 16
dataset in dimension 16, with metric L2, size: Q 10000 B 10000 T 0
search k=  1 t=0.008 s (± 0.0000)
search k= 10 t=0.009 s (± 0.0000)
search k=100 t=0.019 s (± 0.0003)
========== d= 32
dataset in dimension 32, with metric L2, size: Q 10000 B 10000 T 0
search k=  1 t=0.009 s (± 0.0000)
search k= 10 t=0.010 s (± 0.0000)
search k=100 t=0.020 s (± 0.0025)
========== d= 64
dataset in dimension 64, with metric L2, size: Q 10000 B 10000 T 0
search k=  1 t=0.010 s (± 0.0001)
search k= 10 t=0.011 s (± 0.0000)
search k=100 t=0.020 s (± 0.0000)
========== d= 128
dataset in dimension 128, with metric L2, size: Q 10000 B 10000 T 0
search k=  1 t=0.014 s (± 0.0000)
search k= 10 t=0.015 s (± 0.0000)
search k=100 t=0.025 s (± 0.0016)
restab=
 0.00769553     0.00912893      0.0186442
0.00908014      0.00991631      0.0202678
0.0103275       0.0113654       0.02024
0.0141219       0.0154734       0.0250527
************ nthread= 1
*********** nq= 100
========== d= 16
dataset in dimension 16, with metric L2, size: Q 100 B 10000 T 0
search k=  1 t=3.227 ms (± 0.0031)
search k= 10 t=3.660 ms (± 0.0047)
search k=100 t=7.973 ms (± 0.0070)
========== d= 32
dataset in dimension 32, with metric L2, size: Q 100 B 10000 T 0
search k=  1 t=1.633 ms (± 0.0040)
search k= 10 t=1.881 ms (± 0.0249)
search k=100 t=4.617 ms (± 0.0072)
========== d= 64
dataset in dimension 64, with metric L2, size: Q 100 B 10000 T 0
search k=  1 t=2.292 ms (± 0.0024)
search k= 10 t=2.563 ms (± 0.0217)
search k=100 t=5.368 ms (± 0.1682)
========== d= 128
dataset in dimension 128, with metric L2, size: Q 100 B 10000 T 0
search k=  1 t=3.707 ms (± 0.0010)
search k= 10 t=3.948 ms (± 0.0048)
search k=100 t=6.628 ms (± 0.0059)
restab=
 3.22726        3.66044 7.97331
1.6332  1.88079 4.61653
2.29204 2.56261 5.36752
3.70705 3.948   6.62845
*********** nq= 10000
========== d= 16
dataset in dimension 16, with metric L2, size: Q 10000 B 10000 T 0
search k=  1 t=0.121 s (± 0.0001)
search k= 10 t=0.143 s (± 0.0015)
search k=100 t=0.394 s (± 0.0020)
========== d= 32
dataset in dimension 32, with metric L2, size: Q 10000 B 10000 T 0
search k=  1 t=0.127 s (± 0.0014)
search k= 10 t=0.148 s (± 0.0002)
search k=100 t=0.406 s (± 0.0020)
========== d= 64
dataset in dimension 64, with metric L2, size: Q 10000 B 10000 T 0
search k=  1 t=0.160 s (± 0.0003)
search k= 10 t=0.183 s (± 0.0018)
search k=100 t=0.435 s (± 0.0027)
========== d= 128
dataset in dimension 128, with metric L2, size: Q 10000 B 10000 T 0
search k=  1 t=0.213 s (± 0.0004)
search k= 10 t=0.236 s (± 0.0024)
search k=100 t=0.486 s (± 0.0020)
restab=
 0.121088       0.143475        0.393567
0.126678        0.147563        0.405952
0.160052        0.182952        0.435488
0.213282        0.236116        0.485715

after degradation (e1adde0):

model name      : Intel(R) Xeon(R) Platinum 8280L CPU @ 2.70GHz
************ nthread= 56
*********** nq= 100
========== d= 16
dataset in dimension 16, with metric L2, size: Q 100 B 10000 T 0
search k=  1 t=2.631 ms (± 0.0041)
search k= 10 t=3.004 ms (± 0.0050)
search k=100 t=7.237 ms (± 0.0118)
========== d= 32
dataset in dimension 32, with metric L2, size: Q 100 B 10000 T 0
search k=  1 t=2.642 ms (± 0.0040)
search k= 10 t=3.009 ms (± 0.0022)
search k=100 t=7.218 ms (± 0.0110)
========== d= 64
dataset in dimension 64, with metric L2, size: Q 100 B 10000 T 0
search k=  1 t=2.668 ms (± 0.0035)
search k= 10 t=3.044 ms (± 0.0040)
search k=100 t=7.341 ms (± 0.0259)
========== d= 128
dataset in dimension 128, with metric L2, size: Q 100 B 10000 T 0
search k=  1 t=2.734 ms (± 0.0059)
search k= 10 t=3.104 ms (± 0.0026)
search k=100 t=7.431 ms (± 0.0070)
restab=
 2.63062        3.00425 7.23729
2.64192 3.00926 7.21794
2.66764 3.04425 7.34085
2.73407 3.10364 7.4307
*********** nq= 10000
========== d= 16
dataset in dimension 16, with metric L2, size: Q 10000 B 10000 T 0
search k=  1 t=0.181 s (± 0.0024)
search k= 10 t=0.212 s (± 0.0013)
search k=100 t=0.529 s (± 0.0599)
========== d= 32
dataset in dimension 32, with metric L2, size: Q 10000 B 10000 T 0
search k=  1 t=0.167 s (± 0.0019)
search k= 10 t=0.189 s (± 0.0021)
search k=100 t=0.508 s (± 0.0009)
========== d= 64
dataset in dimension 64, with metric L2, size: Q 10000 B 10000 T 0
search k=  1 t=0.174 s (± 0.0002)
search k= 10 t=0.197 s (± 0.0001)
search k=100 t=0.508 s (± 0.0006)
========== d= 128
dataset in dimension 128, with metric L2, size: Q 10000 B 10000 T 0
search k=  1 t=0.173 s (± 0.0001)
search k= 10 t=0.193 s (± 0.0003)
search k=100 t=0.496 s (± 0.0008)
restab=
 0.180814       0.212223        0.529138
0.166661        0.188869        0.508382
0.174292        0.196803        0.508177
0.17272 0.193195        0.495662
************ nthread= 1
*********** nq= 100
========== d= 16
dataset in dimension 16, with metric L2, size: Q 100 B 10000 T 0
search k=  1 t=2.217 ms (± 0.0041)
search k= 10 t=2.530 ms (± 0.0022)
search k=100 t=6.499 ms (± 0.0067)
========== d= 32
dataset in dimension 32, with metric L2, size: Q 100 B 10000 T 0
search k=  1 t=2.697 ms (± 0.0024)
search k= 10 t=2.375 ms (± 0.4417)
search k=100 t=4.781 ms (± 0.0036)
========== d= 64
dataset in dimension 64, with metric L2, size: Q 100 B 10000 T 0
search k=  1 t=2.528 ms (± 0.0052)
search k= 10 t=2.727 ms (± 0.0058)
search k=100 t=5.531 ms (± 0.0050)
========== d= 128
dataset in dimension 128, with metric L2, size: Q 100 B 10000 T 0
search k=  1 t=3.850 ms (± 0.0031)
search k= 10 t=4.058 ms (± 0.0294)
search k=100 t=6.834 ms (± 0.0151)
restab=
 2.21735        2.52995 6.49855
2.69735 2.37504 4.78083
2.52756 2.72736 5.53104
3.85019 4.058   6.83403
*********** nq= 10000
========== d= 16
dataset in dimension 16, with metric L2, size: Q 10000 B 10000 T 0
search k=  1 t=0.129 s (± 0.0011)
search k= 10 t=0.148 s (± 0.0014)
search k=100 t=0.411 s (± 0.0021)
========== d= 32
dataset in dimension 32, with metric L2, size: Q 10000 B 10000 T 0
search k=  1 t=0.140 s (± 0.0001)
search k= 10 t=0.159 s (± 0.0004)
search k=100 t=0.416 s (± 0.0017)
========== d= 64
dataset in dimension 64, with metric L2, size: Q 10000 B 10000 T 0
search k=  1 t=0.177 s (± 0.0016)
search k= 10 t=0.196 s (± 0.0017)
search k=100 t=0.458 s (± 0.0017)
========== d= 128
dataset in dimension 128, with metric L2, size: Q 10000 B 10000 T 0
search k=  1 t=0.234 s (± 0.0015)
search k= 10 t=0.252 s (± 0.0018)
search k=100 t=0.513 s (± 0.0023)
restab=
 0.129152       0.147665        0.410948
0.140337        0.15915 0.415998
0.177245        0.196366        0.458388
0.23385 0.252216        0.512823

master (6977d72):

model name      : Intel(R) Xeon(R) Platinum 8280L CPU @ 2.70GHz
************ nthread= 56
*********** nq= 100
========== d= 16
dataset in dimension 16, with metric L2, size: Q 100 B 10000 T 0
search k=  1 t=2.894 ms (± 0.0209)
search k= 10 t=3.194 ms (± 0.0389)
search k=100 t=7.108 ms (± 0.0705)
========== d= 32
dataset in dimension 32, with metric L2, size: Q 100 B 10000 T 0
search k=  1 t=2.744 ms (± 0.0019)
search k= 10 t=3.050 ms (± 0.0044)
search k=100 t=7.002 ms (± 0.0070)
========== d= 64
dataset in dimension 64, with metric L2, size: Q 100 B 10000 T 0
search k=  1 t=2.772 ms (± 0.0035)
search k= 10 t=3.069 ms (± 0.0024)
search k=100 t=7.089 ms (± 0.0079)
========== d= 128
dataset in dimension 128, with metric L2, size: Q 100 B 10000 T 0
search k=  1 t=2.837 ms (± 0.0039)
search k= 10 t=3.140 ms (± 0.0036)
search k=100 t=7.178 ms (± 0.0100)
restab=
 2.89437        3.19391 7.10759
2.7439  3.05027 7.00188
2.77224 3.06919 7.08941
2.83748 3.13994 7.17816
*********** nq= 10000
========== d= 16
dataset in dimension 16, with metric L2, size: Q 10000 B 10000 T 0
search k=  1 t=0.190 s (± 0.0012)
search k= 10 t=0.214 s (± 0.0016)
search k=100 t=0.520 s (± 0.0601)
========== d= 32
dataset in dimension 32, with metric L2, size: Q 10000 B 10000 T 0
search k=  1 t=0.135 s (± 0.0002)
search k= 10 t=0.157 s (± 0.0032)
search k=100 t=0.513 s (± 0.0250)
========== d= 64
dataset in dimension 64, with metric L2, size: Q 10000 B 10000 T 0
search k=  1 t=0.146 s (± 0.0002)
search k= 10 t=0.166 s (± 0.0001)
search k=100 t=0.503 s (± 0.0047)
========== d= 128
dataset in dimension 128, with metric L2, size: Q 10000 B 10000 T 0
search k=  1 t=0.145 s (± 0.0007)
search k= 10 t=0.163 s (± 0.0001)
search k=100 t=0.488 s (± 0.0043)
restab=
 0.190026       0.213606        0.520014
0.134764        0.15713 0.513136
0.145534        0.165525        0.503114
0.145355        0.16309 0.488284
************ nthread= 1
*********** nq= 100
========== d= 16
dataset in dimension 16, with metric L2, size: Q 100 B 10000 T 0
search k=  1 t=1.890 ms (± 0.0025)
search k= 10 t=2.175 ms (± 0.0230)
search k=100 t=6.323 ms (± 0.0038)
========== d= 32
dataset in dimension 32, with metric L2, size: Q 100 B 10000 T 0
search k=  1 t=2.349 ms (± 0.0031)
search k= 10 t=2.622 ms (± 0.0025)
search k=100 t=4.895 ms (± 0.1379)
========== d= 64
dataset in dimension 64, with metric L2, size: Q 100 B 10000 T 0
search k=  1 t=2.224 ms (± 0.0522)
search k= 10 t=2.383 ms (± 0.0021)
search k=100 t=5.382 ms (± 0.1252)
========== d= 128
dataset in dimension 128, with metric L2, size: Q 100 B 10000 T 0
search k=  1 t=3.647 ms (± 0.0651)
search k= 10 t=3.756 ms (± 0.0119)
search k=100 t=6.645 ms (± 0.0076)
restab=
 1.88985        2.17527 6.32253
2.34929 2.62222 4.89473
2.22358 2.38296 5.38185
3.64709 3.75602 6.64547
*********** nq= 10000
========== d= 16
dataset in dimension 16, with metric L2, size: Q 10000 B 10000 T 0
search k=  1 t=0.112 s (± 0.0058)
search k= 10 t=0.126 s (± 0.0001)
search k=100 t=0.400 s (± 0.0046)
========== d= 32
dataset in dimension 32, with metric L2, size: Q 10000 B 10000 T 0
search k=  1 t=0.122 s (± 0.0062)
search k= 10 t=0.138 s (± 0.0066)
search k=100 t=0.406 s (± 0.0060)
========== d= 64
dataset in dimension 64, with metric L2, size: Q 10000 B 10000 T 0
search k=  1 t=0.156 s (± 0.0002)
search k= 10 t=0.173 s (± 0.0007)
search k=100 t=0.447 s (± 0.0045)
========== d= 128
dataset in dimension 128, with metric L2, size: Q 10000 B 10000 T 0
search k=  1 t=0.213 s (± 0.0069)
search k= 10 t=0.230 s (± 0.0067)
search k=100 t=0.500 s (± 0.0056)
restab=
 0.111619       0.125901        0.39991
0.122357        0.138325        0.405528
0.156162        0.173034        0.44707
0.213437        0.230029        0.499801

Platform

CPU: Intel(R) Xeon(R) Platinum 8280L CPU @ 2.70GHz, 2 sockets, 28 cores per socket

OS: Linux, CentOS 7.6.1810, 5.4.69 kernel version

Faiss version before perf. degradation: 698a459
Faiss version after perf. degradation: e1adde0

Installed from: compiled with GCC 10.2.0

Faiss compilation options: default from conda build scripts (see reproducer)

SW: Python 3.8.8, MKL 2020.4, llvm-openmp 11.0.1

Running on:

  • CPU
  • GPU

Interface:

  • C++
  • Python

Reproduction instructions

Clone Faiss repository, modify benchs/bench_index_flat.py to use needed number of threads and run following bash script:

#!/bin/bash

export PYTHON=python
export PY_VER=38
export CPU_COUNT=56
export PREFIX=$CONDA_PREFIX
cp -r benchs benchs_copy # first commit hasn't needed benchmark
rm -rf logs
mkdir -p logs

run_bench () {
    mkdir -p logs/$1
    rm -rf _*
    pip uninstall -y faiss faiss-cpu
    git checkout $1

    bash conda/faiss/build-lib.sh
    bash conda/faiss/build-pkg.sh

    python benchs_copy/bench_index_flat.py > logs/$1/bench_index_flat.log
}

run_bench 698a4592e87920f036aa7a2d8a3a56e12387a8f0
run_bench e1adde0d84ece584f3b4d86db0b1532329f8cdb8
run_bench master
@mdouze
Copy link
Contributor

mdouze commented Mar 18, 2021

Thanks for the extensive analysis.
The faiss.cvar.distance_compute_min_k_reservoir is set to 5, which is lower than the default 100, so I am not sure this benchmark is valid.

@Alexsandruss
Copy link
Author

Forgot to mention that I commented line with faiss.cvar.distance_compute_min_k_reservoir = 5 since it didn't work with oldest commit. So, this parameter was set to default value.

@mdouze
Copy link
Contributor

mdouze commented Mar 28, 2021

I have to figure out why the results of your benchmark differ from mine.

@Alexsandruss
Copy link
Author

I made additional measurements on bigger range of data shapes and found out some interesting things:

  • Faster search with IndexIVFFlat was actually slower than raw IndexFlatL2 before degradation:
    image

  • Faster search with IndexIVFFlat are also affected by changes in e1adde0 but weaker than raw IndexFlatL2:
    image

Used SW and HW are same as above.

Full benchmark results (csv format):
bench_results.txt

Code of benchmarks:

import faiss
import numpy as np
import pandas as pd
from scipy import stats
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score
from timeit import default_timer as timer


pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)


NLIST = 100
NPROBE = 10
N_CLASSES = 5
N_CLUSTERS_PER_CLASS = 4

N_RUNS = 10


def box_filter(timing, left=0.25, right=0.75):
    timing.sort()
    size = len(timing)
    if size == 1:
        return timing[0]
    q1, q2 = timing[int(size * left)], timing[int(size * right)]
    iq = q2 - q1
    lower, upper = q1 - 1.5 * iq, q2 + 1.5 * iq
    result = np.array([item for item in timing if lower < item < upper])
    return np.mean(result)


def index_flatl2_search(train_data, test_data, k):
    d = train_data.shape[1]
    index = faiss.IndexFlatL2(d)
    index.add(train_data)
    return index.search(test_data, k)


def index_ivfflat_search(train_data, test_data, k):
    d = train_data.shape[1]
    quantizer = faiss.IndexFlatL2(d)
    index = faiss.IndexIVFFlat(quantizer, d, NLIST)
    index.nprobe = NPROBE
    index.train(train_data)
    index.add(train_data)
    return index.search(test_data, k)


def compute_accuracy_from_indices(indices, train_y, test_y):
    pred_y, _ = stats.mode(train_y[indices], axis=1)
    pred_y = pred_y.ravel()
    return accuracy_score(test_y, pred_y)


rows_arr = [5000, 10000, 20000, 50000, 100000]
dims_arr = [8, 32, 128, 512]
k_arr = [8, 32, 128]
n_cases = len(rows_arr) ** 2 * len(dims_arr) * len(k_arr)

bench_data = pd.DataFrame({
    "n_train": np.zeros(shape=(n_cases,), dtype=np.int32),
    "n_test": np.zeros(shape=(n_cases,), dtype=np.int32),
    "dims": np.zeros(shape=(n_cases,), dtype=np.int32),
    "k": np.zeros(shape=(n_cases,), dtype=np.int32),
    "IndexIVFFlat_time[s]": np.zeros(shape=(n_cases,), dtype=np.float32),
    "IndexFlatL2_time[s]": np.zeros(shape=(n_cases,), dtype=np.float32),
    "IndexIVFFlat_accuracy": np.zeros(shape=(n_cases,), dtype=np.float32),
    "IndexFlatL2_accuracy": np.zeros(shape=(n_cases,), dtype=np.float32),
    "IndexIVFFlat_speedup": np.zeros(shape=(n_cases,), dtype=np.float32),
    "IndexIVFFlat_rel_accuracy": np.zeros(shape=(n_cases,), dtype=np.float32)
})

j = 0
for n_train_rows in rows_arr:
    for n_test_rows in rows_arr:
        for n_dims in dims_arr:
            for k in k_arr:
                x, y = make_classification(n_samples=n_train_rows + n_test_rows, n_features=n_dims,
                    n_informative=n_dims, n_repeated=0, n_redundant=0,
                    n_classes=N_CLASSES, n_clusters_per_class=N_CLUSTERS_PER_CLASS, random_state=42)
                x = x.astype(np.float32)
                train_data, test_data = x[:n_train_rows], x[n_train_rows:]
                train_y, test_y = y[:n_train_rows], y[n_train_rows:]
                ivff_times = []
                f_times = []
                for i in range(N_RUNS):
                    t0 = timer()
                    _, ivff_idx = index_ivfflat_search(train_data, test_data, k)
                    t1 = timer()
                    _, f_idx = index_flatl2_search(train_data, test_data, k)
                    t2 = timer()
                    ivff_times.append(t1 - t0)
                    f_times.append(t2 - t1)
                ivff_acc = compute_accuracy_from_indices(ivff_idx, train_y, test_y)
                f_acc = compute_accuracy_from_indices(f_idx, train_y, test_y)
                ivff_time = box_filter(ivff_times)
                f_time = box_filter(f_times)
                speedup = f_time / ivff_time
                rel_acc = ivff_acc / f_acc
                bench_data.at[j, 'n_train'], bench_data.at[j, 'n_test'] = n_train_rows, n_test_rows
                bench_data.at[j, 'dims'], bench_data.at[j, 'k'] = n_dims, k
                bench_data.at[j, 'IndexIVFFlat_time[s]'], bench_data.at[j, 'IndexFlatL2_time[s]'] = ivff_time, f_time
                bench_data.at[j, 'IndexIVFFlat_accuracy'], bench_data.at[j, 'IndexFlatL2_accuracy'] = ivff_acc, f_acc
                bench_data.at[j, 'IndexIVFFlat_speedup'], bench_data.at[j, 'IndexIVFFlat_rel_accuracy'] = speedup, rel_acc
                j += 1

print(bench_data)

@ava57r
Copy link
Contributor

ava57r commented Apr 20, 2021

I have performance degradation too after update from v1.6.3 to v1.7.0
I use IndexFlatL2 and vectors 512D.

@ava57r
Copy link
Contributor

ava57r commented Apr 20, 2021

I found this function knn_L2sqr_blas

it was parallel before

x + i0 * d, &di, &zero,
ip_block, &nyi);
}
/* collect minima */
#pragma omp parallel for
for (int64_t i = i0; i < i1; i++) {
float * __restrict simi = res->get_val(i);
int64_t * __restrict idxi = res->get_ids (i);
const float *ip_line = ip_block + (i - i0) * (j1 - j0);

it is now

/* compute the actual dot products */
{
float one = 1, zero = 0;
FINTEGER nyi = j1 - j0, nxi = i1 - i0, di = d;
sgemm_ ("Transpose", "Not transpose", &nyi, &nxi, &di, &one,
y + j0 * d, &di,
x + i0 * d, &di, &zero,
ip_block.get(), &nyi);
}
for (int64_t i = i0; i < i1; i++) {
float *ip_line = ip_block.get() + (i - i0) * (j1 - j0);
for (size_t j = j0; j < j1; j++) {
float ip = *ip_line;
float dis = x_norms[i] + y_norms[j] - 2 * ip;

@mdouze
Copy link
Contributor

mdouze commented Apr 22, 2021

Right, thanks for narrowing down the regression.

mdouze added a commit to mdouze/faiss that referenced this issue Apr 22, 2021
Summary:
This diff is related to

facebookresearch#1762

The ResultHandler introduced for FlatL2 and FlatIP was not multithreaded. This diff attempts to fix that. To be verified if it is indeed faster.

Differential Revision: D27939173

fbshipit-source-id: 61e09355df65c899a89f70d2bd5ab47076871289
@mdouze
Copy link
Contributor

mdouze commented Apr 22, 2021

OK the indexing is a bit tricky so I prepared a PR here: #1840

Would you mind running the benchmark to check if we recover the previous performance?

@ava57r
Copy link
Contributor

ava57r commented Apr 23, 2021

If query n < distance_compute_blas_threshold
It is slower than v1.6.3

I found *sse function
before

static void knn_L2sqr_sse (

compile flags in v1.6.3

faiss/configure

Line 6818 in a17a631

ARCH_CPUFLAGS="-mpopcnt -msse4"

@mdouze
Copy link
Contributor

mdouze commented Apr 26, 2021

@ava57r sorry, I don't understand. Do you suggest that -msse4 is faster than -mavx2 ?

@ava57r
Copy link
Contributor

ava57r commented Apr 26, 2021

@ava57r sorry, I don't understand. Do you suggest that -msse4 is faster than -mavx2 ?

No.
I think what v1.6.3 compiled with -mss4 by default.
The current Master branch is no.
I compiled lib from source without avx2.

facebook-github-bot pushed a commit that referenced this issue Apr 30, 2021
Summary:
Pull Request resolved: #1840

This diff is related to

#1762

The ResultHandler introduced for FlatL2 and FlatIP was not multithreaded. This diff attempts to fix that. To be verified if it is indeed faster.

Reviewed By: wickedfoo

Differential Revision: D27939173

fbshipit-source-id: c85f01a97d4249fe0c6bfb04396b68a7a9fe643d
@ava57r
Copy link
Contributor

ava57r commented May 11, 2021

OK the indexing is a bit tricky so I prepared a PR here: #1840

Would you mind running the benchmark to check if we recover the previous performance?

I tried to run this bench

q=100

32 threads 100 query size before degradation (698a459) after degradation (e1adde0) master (b4c320a)
dimension k time, ms time, ms time, ms
16 1 2.421 2.238 1.377
16 10 1.773 1.103 1.413
16 100 2.900 2.520 2.904
32 1 1.356 1.296 1.336
32 10 1.419 1.470 1.371
32 100 2.711 3.153 2.871
64 1 1.221 1.438 1.249
64 10 1.253 1.503 1.320
64 100 2.547 3.091 2.897
128 1 1.414 1.679 1.384
128 10 1.526 1.460 1.518
128 100 2.836 3.100 3.157

q=10000

32 threads 100 query size before degradation (698a459) after degradation (e1adde0) master (b4c320a)
dimension k time, s time, s time, s
16 1 0.035 0.036 0.034
16 10 0.037 0.035 0.035
16 100 0.151 0.148 0.140
32 1 0.033 0.031 0.029
32 10 0.037 0.033 0.032
32 100 0.166 0.183 0.157
64 1 0.038 0.040 0.038
64 10 0.042 0.044 0.042
64 100 0.156 0.159 0.145
128 1 0.055 0.056 0.053
128 10 0.059 0.061 0.057
128 100 0.178 0.179 0.177

commit (698a459)

model name      : Intel(R) Xeon(R) Silver 4214R CPU @ 2.40GHz
************ nthread= 32
*********** nq= 100
========== d= 16
dataset in dimension 16, with metric L2, size: Q 100 B 10000 T 0
search k=  1 t=2.421 ms (± 0.1440)
search k= 10 t=1.773 ms (± 0.4305)
search k=100 t=2.900 ms (± 0.1537)
========== d= 32
dataset in dimension 32, with metric L2, size: Q 100 B 10000 T 0
search k=  1 t=1.356 ms (± 0.0085)
search k= 10 t=1.419 ms (± 0.0138)
search k=100 t=2.711 ms (± 0.0145)
========== d= 64
dataset in dimension 64, with metric L2, size: Q 100 B 10000 T 0
search k=  1 t=1.221 ms (± 0.0297)
search k= 10 t=1.253 ms (± 0.0110)
search k=100 t=2.547 ms (± 0.0262)
========== d= 128
dataset in dimension 128, with metric L2, size: Q 100 B 10000 T 0
search k=  1 t=1.414 ms (± 0.0122)
search k= 10 t=1.526 ms (± 0.0410)
search k=100 t=2.836 ms (± 0.0191)
restab=
 2.42072        1.77324 2.90045
1.35612 1.41916 2.71073
1.22073 1.25286 2.54732
1.41433 1.5263  2.83617
*********** nq= 10000
========== d= 16
dataset in dimension 16, with metric L2, size: Q 10000 B 10000 T 0
search k=  1 t=0.035 s (± 0.0004)
search k= 10 t=0.037 s (± 0.0061)
search k=100 t=0.151 s (± 0.0066)
========== d= 32
dataset in dimension 32, with metric L2, size: Q 10000 B 10000 T 0
search k=  1 t=0.033 s (± 0.0003)
search k= 10 t=0.037 s (± 0.0003)
search k=100 t=0.166 s (± 0.0101)
========== d= 64
dataset in dimension 64, with metric L2, size: Q 10000 B 10000 T 0
search k=  1 t=0.038 s (± 0.0007)
search k= 10 t=0.042 s (± 0.0007)
search k=100 t=0.156 s (± 0.0087)
========== d= 128
dataset in dimension 128, with metric L2, size: Q 10000 B 10000 T 0
search k=  1 t=0.055 s (± 0.0005)
search k= 10 t=0.059 s (± 0.0004)
search k=100 t=0.178 s (± 0.0090)
restab=
 0.0350017      0.0371504       0.150556
0.0330732       0.0368384       0.165539
0.038266        0.0417653       0.156422
0.0551096       0.0589395       0.177976

commit (e1adde0)

model name      : Intel(R) Xeon(R) Silver 4214R CPU @ 2.40GHz
************ nthread= 32
*********** nq= 100
========== d= 16
dataset in dimension 16, with metric L2, size: Q 100 B 10000 T 0
search k=  1 t=2.238 ms (± 0.0496)
search k= 10 t=1.103 ms (± 0.0274)
search k=100 t=2.520 ms (± 0.0180)
========== d= 32
dataset in dimension 32, with metric L2, size: Q 100 B 10000 T 0
search k=  1 t=1.296 ms (± 0.0480)
search k= 10 t=1.470 ms (± 0.0168)
search k=100 t=3.153 ms (± 0.0943)
========== d= 64
dataset in dimension 64, with metric L2, size: Q 100 B 10000 T 0
search k=  1 t=1.438 ms (± 0.0105)
search k= 10 t=1.503 ms (± 0.0295)
search k=100 t=3.091 ms (± 0.1679)
========== d= 128
dataset in dimension 128, with metric L2, size: Q 100 B 10000 T 0
search k=  1 t=1.679 ms (± 0.0176)
search k= 10 t=1.460 ms (± 0.0093)
search k=100 t=3.100 ms (± 0.1812)
restab=
 2.23753        1.10283 2.51985
1.29637 1.47039 3.15303
1.43808 1.50332 3.09131
1.67882 1.46049 3.10025
*********** nq= 10000
========== d= 16
dataset in dimension 16, with metric L2, size: Q 10000 B 10000 T 0
search k=  1 t=0.036 s (± 0.0003)
search k= 10 t=0.035 s (± 0.0053)
search k=100 t=0.148 s (± 0.0062)
========== d= 32
dataset in dimension 32, with metric L2, size: Q 10000 B 10000 T 0
search k=  1 t=0.031 s (± 0.0017)
search k= 10 t=0.033 s (± 0.0003)
search k=100 t=0.183 s (± 0.0043)
========== d= 64
dataset in dimension 64, with metric L2, size: Q 10000 B 10000 T 0
search k=  1 t=0.040 s (± 0.0004)
search k= 10 t=0.044 s (± 0.0003)
search k=100 t=0.159 s (± 0.0079)
========== d= 128
dataset in dimension 128, with metric L2, size: Q 10000 B 10000 T 0
search k=  1 t=0.056 s (± 0.0011)
search k= 10 t=0.061 s (± 0.0067)
search k=100 t=0.179 s (± 0.0120)
restab=
 0.0362413      0.034522        0.148349
0.0310689       0.0330717       0.182657
0.0400654       0.043545        0.158652
0.0564006       0.0606787       0.178946

master(b4c320a)

model name      : Intel(R) Xeon(R) Silver 4214R CPU @ 2.40GHz
************ nthread= 32
*********** nq= 100
========== d= 16
dataset in dimension 16, with metric L2, size: Q 100 B 10000 T 0
search k=  1 t=1.377 ms (± 0.0176)
search k= 10 t=1.413 ms (± 0.0192)
search k=100 t=2.904 ms (± 0.0757)
========== d= 32
dataset in dimension 32, with metric L2, size: Q 100 B 10000 T 0
search k=  1 t=1.336 ms (± 0.0141)
search k= 10 t=1.371 ms (± 0.0099)
search k=100 t=2.871 ms (± 0.1652)
========== d= 64
dataset in dimension 64, with metric L2, size: Q 100 B 10000 T 0
search k=  1 t=1.249 ms (± 0.0192)
search k= 10 t=1.320 ms (± 0.0190)
search k=100 t=2.897 ms (± 0.0420)
========== d= 128
dataset in dimension 128, with metric L2, size: Q 100 B 10000 T 0
search k=  1 t=1.384 ms (± 0.0066)
search k= 10 t=1.518 ms (± 0.0412)
search k=100 t=3.157 ms (± 0.1046)
restab=
 1.37651        1.4134  2.90379
1.33565 1.37076 2.87127
1.24866 1.3198  2.89655
1.38402 1.51813 3.15681
*********** nq= 10000
========== d= 16
dataset in dimension 16, with metric L2, size: Q 10000 B 10000 T 0
search k=  1 t=0.034 s (± 0.0005)
search k= 10 t=0.035 s (± 0.0047)
search k=100 t=0.140 s (± 0.0065)
========== d= 32
dataset in dimension 32, with metric L2, size: Q 10000 B 10000 T 0
search k=  1 t=0.029 s (± 0.0003)
search k= 10 t=0.032 s (± 0.0006)
search k=100 t=0.157 s (± 0.0115)
========== d= 64
dataset in dimension 64, with metric L2, size: Q 10000 B 10000 T 0
search k=  1 t=0.038 s (± 0.0008)
search k= 10 t=0.042 s (± 0.0004)
search k=100 t=0.145 s (± 0.0068)
========== d= 128
dataset in dimension 128, with metric L2, size: Q 10000 B 10000 T 0
search k=  1 t=0.053 s (± 0.0003)
search k= 10 t=0.057 s (± 0.0005)
search k=100 t=0.177 s (± 0.0085)
restab=
 0.0341082      0.0346912       0.139799
0.0287708       0.0319045       0.157442
0.0382387       0.0418262       0.145482
0.0525824       0.0565464       0.177184

/bench_index_flat.py

-faiss.cvar.distance_compute_min_k_reservoir = 5
+#faiss.cvar.distance_compute_min_k_reservoir = 5
#!/bin/bash

export PYTHON=python
export PY_VER=38
export CPU_COUNT=32
export PREFIX=$CONDA_PREFIX
export LD_LIBRARY_PATH=$CONDA_PREFIX/lib
cp -r benchs benchs_copy # first commit hasn't needed benchmark
rm -rf logs
mkdir -p logs

run_bench () {
    mkdir -p logs/$1
    rm -rf _*
    pip uninstall -y faiss faiss-cpu
    git checkout $1

    bash conda/faiss/build-lib.sh
    bash conda/faiss/build-pkg.sh

    python benchs_copy/bench_index_flat.py > logs/$1/bench_index_flat.log
}

run_bench 698a4592e87920f036aa7a2d8a3a56e12387a8f0
run_bench e1adde0d84ece584f3b4d86db0b1532329f8cdb8
run_bench master

@mdouze mdouze closed this as completed Sep 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants