Skip to content

Commit

Permalink
Fixes #3971: Check how to integrate vector databases via rest APIs (#…
Browse files Browse the repository at this point in the history
…4059)

* Fixes #3971: Check how to integrate vector databases via rest APIs

* fixed CI errors and removed unused imports

* Changes review: added weaviate db, removed vector idx autocreation and vector as a default result

* code clean

* Changes review: added systemdb store, removed constraint creation

* code clean

* 2nd changes review

* fixed qdrant filename typo and removed info procs from docs
  • Loading branch information
vga91 authored May 27, 2024
1 parent 05bb7ff commit 89d167b
Show file tree
Hide file tree
Showing 31 changed files with 4,081 additions and 2 deletions.
1 change: 1 addition & 0 deletions docs/asciidoc/modules/ROOT/nav.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ include::partial$generated-documentation/nav.adoc[]
** xref::database-integration/bolt-neo4j.adoc[]
** xref::database-integration/load-ldap.adoc[]
** xref::database-integration/redis.adoc[]
** xref:database-integration/vectordb/index.adoc[]
* xref:graph-updates/index.adoc[]
** xref::graph-updates/uuid.adoc[]
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,4 +17,5 @@ For more information on how to use these procedures, see:
* xref::database-integration/bolt-neo4j.adoc[]
* xref::database-integration/load-ldap.adoc[]
* xref::database-integration/redis.adoc[]
* xref:database-integration/vectordb/index.adoc[]

Original file line number Diff line number Diff line change
@@ -0,0 +1,157 @@

== ChromaDB

Here is a list of all available ChromaDB procedures,
note that the list and the signature procedures are consistent with the others, like the Qdrant ones:

[opts=header, cols="1, 3"]
|===
| name | description
| apoc.vectordb.chroma.createCollection(hostOrKey, collection, similarity, size, $config) |
Creates a collection, with the name specified in the 2nd parameter, and with the specified `similarity` and `size`.
The default endpoint is `<hostOrKey param>/api/v1/collections`.
| apoc.vectordb.chroma.deleteCollection(hostOrKey, collection, $config) |
Deletes a collection with the name specified in the 2nd parameter.
The default endpoint is `<hostOrKey param>/api/v1/collections/<collection param>`.
| apoc.vectordb.chroma.upsert(hostOrKey, collection, vectors, $config) |
Upserts, in the collection with the name specified in the 2nd parameter, the vectors [{id: 'id', vector: '<vectorDb>', medatada: '<metadata>'}].
The default endpoint is `<hostOrKey param>/api/v1/collections/<collection param>/upsert`.
| apoc.vectordb.chroma.delete(hostOrKey, collection, ids, $config) |
Deletes the vectors with the specified `ids`.
The default endpoint is `<hostOrKey param>/api/v1/collections/<collection param>/delete`.
| apoc.vectordb.chroma.get(hostOrKey, collection, ids, $config) |
Gets the vectors with the specified `ids`.
The default endpoint is `<hostOrKey param>/api/v1/collections/<collection param>/get`.
| apoc.vectordb.chroma.query(hostOrKey, collection, vector, filter, limit, $config) |
Retrieve closest vectors from the defined `vector`, `limit` of results, in the collection with the name specified in the 2nd parameter.
The default endpoint is `<hostOrKey param>/api/v1/collections/<collection param>/query`.
| apoc.vectordb.chroma.getAndUpdate(hostOrKey, collection, ids, $config) |
Gets the vectors with the specified `ids`, and optionally creates/updates neo4j entities.
The default endpoint is `<hostOrKey param>/api/v1/collections/<collection param>/get`.
| apoc.vectordb.chroma.queryAndUpdate(hostOrKey, collection, vector, filter, limit, $config) |
Retrieve closest vectors from the defined `vector`, `limit` of results, in the collection with the name specified in the 2nd parameter, and optionally creates/updates neo4j entities.
The default endpoint is `<hostOrKey param>/api/v1/collections/<collection param>/query`.
|===

where the 1st parameter can be a key defined by the apoc config `apoc.chroma.<key>.host=myHost`.
With hostOrKey=null, the default is 'http://localhost:8000'.

=== Examples

.Create a collection (it leverages https://docs.trychroma.com/usage-guide#creating-inspecting-and-deleting-collections[this API])
[source,cypher]
----
CALL apoc.vectordb.chroma.createCollection($host, 'test_collection', 'Cosine', 4, {<optional config>})
----


.Delete a collection (it leverages https://docs.trychroma.com/usage-guide#creating-inspecting-and-deleting-collections[this API])
[source,cypher]
----
CALL apoc.vectordb.chroma.deleteCollection($host, '<collection_id>', {<optional config>})
----


.Upsert vectors (it leverages https://docs.trychroma.com/usage-guide#adding-data-to-a-collection[this API])
[source,cypher]
----
CALL apoc.vectordb.qdrant.upsert($host, '<collection_id>',
[
{id: 1, vector: [0.05, 0.61, 0.76, 0.74], metadata: {city: "Berlin", foo: "one"}, text: 'ajeje'},
{id: 2, vector: [0.19, 0.81, 0.75, 0.11], metadata: {city: "London", foo: "two"}, text: 'brazorf'}
],
{<optional config>})
----


.Get vectors (it leverages https://docs.trychroma.com/usage-guide#querying-a-collection[this API])
[source,cypher]
----
CALL apoc.vectordb.chroma.get($host, '<collection_id>', ['1','2'], {<optional config>}), text
----


.Example results
[opts="header"]
|===
| score | metadata | id | vector | text | entity
| null | {city: "Berlin", foo: "one"} | null | null | null | null
| null | {city: "Berlin", foo: "two"} | null | null | null | null
| ...
|===


.Get vectors with `{allResults: true}`
[source,cypher]
----
CALL apoc.vectordb.chroma.get($host, '<collection_id>', ['1','2'], {<optional config>}), text
----


.Example results
[opts="header"]
|===
| score | metadata | id | vector | text | entity
| null | {city: "Berlin", foo: "one"} | 1 | [...] | ajeje | null
| null | {city: "Berlin", foo: "two"} | 2 | [...] | brazorf | null
| ...
|===


.Query vectors (it leverages https://docs.trychroma.com/usage-guide#querying-a-collection[this API])
[source,cypher]
----
CALL apoc.vectordb.chroma.query($host,
'<collection_id>',
[0.2, 0.1, 0.9, 0.7],
{city: 'London'},
5,
{allResults: true, <optional config>}), text
----


.Example results
[opts="header"]
|===
| score | metadata | id | vector | text
| 1, | {city: "Berlin", foo: "one"} | 1 | [...] | ajeje
| 0.1 | {city: "Berlin", foo: "two"} | 2 | [...] | brazorf
| ...
|===


[NOTE]
====
To optimize performances, we can choose what to `YIELD` with the apoc.vectordb.chroma.query and the `apoc.vectordb.chroma.get` procedures.
For example, by executing a `CALL apoc.vectordb.chroma.query(...) YIELD metadata, score, id`, the RestAPI request will have an {"include": ["metadatas", "documents", "distances"]},
so that we do not return the other values that we do not need.
====


In the same way as other procedures, we can define a mapping, to fetch the associated nodes and relationships and optionally create them,
by leveraging the vector metadata. For example:

.Query vectors
[source,cypher]
----
CALL apoc.vectordb.chroma.query($host, '<collection_id>',
[0.2, 0.1, 0.9, 0.7],
{},
5,
{ mapping: {
embeddingKey: "vect",
nodeLabel: "Test",
entityKey: "myId",
metadataKey: "foo"
}
})
----



.Delete vectors (it leverages https://docs.trychroma.com/usage-guide#deleting-data-from-a-collection[this API])
[source,cypher]
----
CALL apoc.vectordb.chroma.delete($host, '<collection_id>', [1,2], {<optional config>})
----

Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@

== Custom (i.e. other vector databases)

We can also interface with other db vectors that do not (yet) have dedicated procedures.
For example, with https://docs.pinecone.io/guides/getting-started/overview[Pinecone], as we will see later.

Here is a list of all available custom procedures:

[opts=header, cols="1, 3"]
|===
| name | description
| apoc.vectordb.custom.get(host, $embeddingConfig) | Customizable get / query procedure,
returning a result like the others `apoc.vectordb.*.get` ones
| apoc.vectordb.custom(host, $config) | Fully customizable procedure, returns generic object results.
|===


=== Examples


The `apoc.vectordb.custom.get` can be used with every API that return something like this
(note that the call does not need to return all keys):

```
[
"<idKey>": "value",
"<scoreKey>": scoreValue,
"<vectorKey>": [ ... ]
"<metadataKey>": { .. },
"<textKey>": "..."
],
[
...
]
```

where we can customize idKey, scoreKey, vectorKey, metadataKey and textKey via the homonyms config parameters.


Let's look at some examples using https://docs.pinecone.io/guides/getting-started/overview[Pinecone].


.apoc.vectordb.custom.get example
[source,cypher]
----
CALL apoc.vectordb.custom.get('https://<INDEX-ID>.svc.gcp-starter.pinecone.io/query', {
body: {
"namespace", namespace,
"vector", vector,
"topK", 3,
"includeValues", true,
"includeMetadata", true
},
headers: {"Api-Key", apiKey},
method: null,
jsonPath: "matches",
// the RestAPI return values as the key with values the vectors
vectorKey: 'values'
}), text
----


.Example results
[opts="header"]
|===
| score | metadata | id | vector | text
| 1, | {a: 1} | 1 | [1,2,3,4]
| 0.1 | {a: 2} | 2 | [1,2,3,4]
| ...
|===



.apoc.vectordb.custom example
[source,cypher]
----
CALL apoc.vectordb.custom('https://<INDEX-ID>.svc.gcp-starter.pinecone.io/query', {
body: {
"namespace", namespace,
"vector", vector,
"topK", 3,
"includeValues", true,
"includeMetadata", true
},
headers: {"Api-Key", apiKey},
method: null,
jsonPath: "matches"
})
----


.Example esults
[opts="header"]
|===
| value
| {score: <score>, metadata: <metadata>, id: <id>, vector: <vector>}
| {score: <score>, metadata: <metadata>, id: <id>, vector: <vector>}
| ...
|===
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
[[vectordb]]
= Vector Databases
:description: This section describes procedures that can be used to interact with Vector Databases.

APOC provides these set of procedures, which leverages the Rest APIs, to interact with Vector Databases:

- `apoc.vectordb.qdrant.*` (to interact with https://qdrant.tech/documentation/overview/[Qdrant])
- `apoc.vectordb.chroma.*` (to interact with https://docs.trychroma.com/getting-started[Chroma])
- `apoc.vectordb.weaviate.*` (to interact with https://weaviate.io/developers/weaviate[Weaviate])
- `apoc.vectordb.custom.*` (to interact with other vector databases).
- `apoc.vectordb.configure` (to store host, credentials and mapping into the system database)

All the procedures, except the `apoc.vectordb.configure` one, can have, as a final parameter,
a configuration map with these optional parameters:

.config parameters

|===
| key | description
| headers | additional HTTP headers
| method | HTTP method
| endpoint | endpoint key,
can be used to override the default endpoint created via the 1st parameter of the procedures,
to handle potential endpoint changes.
| body | body HTTP request
| jsonPath | To customize https://github.com/json-path/JsonPath[JSONPath] parsing of the response. The default is `null`.
|===


Besides the above config, the `apoc.vectordb.<type>.get` and the `apoc.vectordb.<type>.query` procedures can have these additional parameters:

.embeddingConfig parameters

|===
| key | description
| mapping | to fetch the associated entities and optionally create them. See examples below.
| allResults | if true, returns the vector, metadata and text (if present), otherwise returns null values for those columns.
| vectorKey, metadataKey, scoreKey, textKey | used with the `apoc.vectordb.custom.get` procedure.
To let the procedure know which key in the restAPI (if present) corresponds to the one that should be populated as respectively the vector/metadata/score/text result.
Defaults are "vector", "metadata", "score", "text".
See examples below.
|===


== Ad-hoc procedures

See the following pages for more details on specific vector db procedures

- xref:./qdrant.adoc[Qdrant]
- xref:./chroma.adoc[ChromaDB]
- xref:./weaviate.adoc[Weaviate]


== Store Vector db info (i.e. `apoc.vectordb.configure`)

We can save some info in the System Database to be reused later, that is the host, login credentials, and mapping,
to be used in `*.get` and `.*query` procedures, except for the `apoc.vectordb.custom.get` one.

Therefore, to store the vector info, we can execute the `CALL apoc.vectordb.configure(vectorName, keyConfig, databaseName, $configMap)`,
where `vectorName` can be "QDRANT", "CHROMA" or "WEAVIATE",
that indicates info to be reused respectively by `apoc.vectordb.qdrant.*`, `apoc.vectordb.chroma.*` and `apoc.vectordb.weaviate.*`.

Then `keyConfig` is the configuration name, `databaseName` is the database where the config will be set,

and finally the `configMap`, that can have:

- `host` is the host base name
- `credentialsValue` is the API key
- `mapping` is a map that can be used by the `apoc.vectordb.\*.getAndUpdate` and `apoc.vectordb.*.queryAndUpdate` procedures

NOTE:: this procedure is only executable by a user with admin permissions and against the system database

For example:
[source,cypher]
----
// -- within the system database or using the Cypher clause `USE SYSTEM ..` as a prefix
CALL apoc.vectordb.configure('QDRANT', 'qdrant-config-test', 'neo4j',
{
mapping: { embeddingKey: "vect", nodeLabel: "Test", entityKey: "myId", metadataKey: "foo" },
host: 'custom-host-name',
credentials: '<apiKey>'
}
)
----

and then we can execute e.g. the following procedure (within the `neo4j` database):

[source,cypher]
----
CALL apoc.vectordb.qdrant.query('qdrant-config-test', 'test_collection', [0.2, 0.1, 0.9, 0.7], {}, 5)
----

instead of:

[source,cypher]
----
CALL apoc.vectordb.qdrant.query($host, 'test_collection', [0.2, 0.1, 0.9, 0.7], {}, 5,
{ mapping: {
embeddingKey: "vect",
nodeLabel: "Test",
entityKey: "myId",
metadataKey: "foo"
},
headers: {Authorization: 'Bearer <apiKey>'},
endpoint: 'custom-host-name'
})
----

Loading

0 comments on commit 89d167b

Please sign in to comment.