diff --git a/docs/chapters/jep/jep-3/main.rst b/docs/chapters/jep/jep-3/main.rst index 6f3f6879c784c..1c8f3a2b47856 100644 --- a/docs/chapters/jep/jep-3/main.rst +++ b/docs/chapters/jep/jep-3/main.rst @@ -24,8 +24,11 @@ Abstract Motivation --------- -Multi-field search is commonly used in practice. -Concretely, as a user, I want to limit the query within some selected fields. +The Multi-field search is commonly used in practice. Concretely, + +.. highlights:: + as a user, I want to limit the query within some selected fields. + In the following use case, there are two documents and three two fields in each of them, i.e. ``title`` and ``summary``. The user wants to query ``painter`` but **only** from the ``title`` field. The expected result will be ``{'doc_id': 11, 'title': 'hackers and painters'}``. @@ -53,7 +56,7 @@ Modify ``jina.proto`` ^^^^^^^^^^^^^^^^^^^^^ Let's take the following ``Flow`` as an example. -The `FieldsMapper` is a ``Crafter`` that split each ``Document`` into fields and add the ``field_name`` information for ``Chunks``. +The ``FieldsMapper`` is a ``Crafter`` that split each ``Document`` into fields and add the ``field_name`` information for ``Chunks``. Afterwards, the ``Chunks`` containing the ``title`` and the ``summary`` information are processed differently in two pathways and stored seperately. .. image:: JEP3-index-design.png @@ -66,7 +69,7 @@ Each ``Document`` have one or more fields, and each field can be further splitte In other words, each ``Chunk`` can **only** be assigned to one field, but each field contains one or more ``Chunks``. .. highlights:: - The concept of ``field`` can be considered as a group of ``Chunks``. + The concept of `field` can be considered as a group of ``Chunks``. Secondly, at the ``Request`` level, we will add another new field, namely ``filter_by``, for the ``SearchRequest``. This is used to store the information of on which fields the user wants to query. @@ -75,8 +78,9 @@ By adding this information, the users can specify different fields to query in e Adapt Index-Flow Pods ^^^^^^^^^^^^^^^^^^^^^ During index time, most parts of the ``Flow`` stay the same as before. -In order to make the ``Indexer`` only index the ``Chunks`` whose ``field_name`` meet the selected fields, we need to adapt the ``VectorIndexDriver`` and the ``extract_chunks`` function. -A new argument, ``filter_by``, is introduced to specify which fields will be indexed. + +To make the ``Encoder`` only encode the ``Chunks`` whose ``field_name`` meet the selected fields, a new argument, ``filter_by``, is introduced to specify which fields will be encoded. +To do so, we need adapt ``EncodeDriver`` and the ``extract_chunks()``. .. highlight:: python .. code-block:: python @@ -89,6 +93,24 @@ A new argument, ``filter_by``, is introduced to specify which fields will be ind :param filter_by: a list of service names to wait """ +.. highlight:: python +.. code-block:: python + + class EncodeDriver(BaseEncodeDriver): + def __init__(self, filter_by: Union[str, List[str], Tuple[str]] = None, *args, **kwargs) + super().__init__(*args, **kwargs) + self.filter_by = filter_by + + def __call__(self, *args, **kwargs): + filter_by = self.filter_by + if self._request.__class__.__name__ == 'SearchRequest': + filter_by = self.req.filter_by + contents, chunk_pts, no_chunk_docs, bad_chunk_ids = \ + extract_chunks(self.req.docs, self.filter_by, embedding=False) + + +In order to make the ``Indexer`` only index the ``Chunks`` whose ``field_name`` meet the selected fields, we need to adapt the ``VectorIndexDriver`` as well. + .. highlight:: python .. code-block:: python @@ -126,22 +148,8 @@ The same change goes for the ``ChunkKVIndexDriver``. Adapt Query-Flow Pods ^^^^^^^^^^^^^^^^^^^^^ -During the query time, the ``EncodeDriver`` will extract only the ``Chunks`` that meet the ``filter_by`` in the ``SearchRequest``. -To do so, we need adapt ``EncodeDriver`` to the ``extract_chunks()`` which has been refactored in the previous section. - -.. highlight:: python -.. code-block:: python - - class EncodeDriver(BaseEncodeDriver): - """Extract the chunk-level content from documents and call executor and do encoding - """ - - def __call__(self, *args, **kwargs): - contents, chunk_pts, no_chunk_docs, bad_chunk_ids = \ - extract_chunks(self.req.docs, self.req.filter_by, embedding=False) - - -Plus, we need to refactor the ``BasePea`` so that the ``Pea`` gets the information of how many incoming messages are expected. +During the query time, +Moreover, we need to refactor the ``BasePea`` so that the ``Pea`` gets the information of how many incoming messages are expected. The expected number of incoming messages will change from query to query because the user will select different fields with the ``filter_by`` argument. In the current version (v.0.1.15), this information is fixed and stored in ``self.args.num_parts`` when the graph is built. And the ``Pea`` will **NOT** start processing the data until the expected number of incoming messages arrive. @@ -157,7 +165,7 @@ Thereafter, it is reasonable to set the expected number of incoming messages as # modify the num_part on the fly for SearchRequest num_part = min(self.args.num_part, max(len(self.request.filtered_by), 1)) -In the query time, the ``VectorSearchDriver`` and the ``KVSearchDriver`` also need to be adapted accordingly in order to **only** process the ``Chunks`` meet the ``filter_by`` requirement. +Furthermore, the ``VectorSearchDriver`` and the ``KVSearchDriver`` also need to be adapted accordingly in order to **only** process the ``Chunks`` meet the ``filter_by`` requirement. .. highlight:: python .. code-block:: python @@ -194,17 +202,106 @@ In the query time, the ``VectorSearchDriver`` and the ``KVSearchDriver`` also ne Specification ------------- -The selected multiple fields will be given in the query by ``field_name`` as following. +For the use case above, the `index.yml` will be defined as following, + +.. highlight:: yaml +.. code-block:: yaml + + !Flow + pods: + fields_mapper: + yaml_path: mapper.yml + title_encoder: + yaml_path: title_encoder.yml + needs: fields_mapper + sum_encoder: + yaml_path: sum_encoder.yml + needs: fields_mapper + title_indexer: + yaml_path: title_indexer.yml + needs: title_encoder + sum_indexer: + yaml_path: sum_indexer.yml + needs: sum_encoder + join: + needs: + - title_indexer + - sum_indexer + +And the `mapper.yml` will be defined as below, + +.. highlight:: yaml +.. code-block:: yaml + + !FilterMapper + requests: + on: + [SearchRequest, IndexRequest]: + - !MapperDriver + with: + method: craft + mapping: {'title': 'title', 'summary': 'summ'} + +The `sum_encoder.yml` is as below, + +.. highlight:: yaml +.. code-block:: yaml + + !AnotherTextEncoder + requests: + on: + [SearchRequest, IndexRequest]: + - !EncodeDriver + with: + method: encode + filter_by: summ + +The `sum_indexer.yml` is as below, + +.. highlight:: yaml +.. code-block:: yaml + + !ChunkIndexer + components: + - !NumpyIndexer + with: + index_filename: vec.gz + - !BasePbIndexer + with: + index_filename: chunk.gz + requests: + on: + IndexRequest: + - !VectorIndexDriver + with: + executor: NumpyIndexer + filter_by: summ + - !PruneDriver {} + - !KVIndexDriver + with: + level: chunk + executor: BasePbIndexer + filter_by: summ + SearchRequest: + - !VectorSearchDriver + with: + executor: NumpyIndexer + filter_by: summ + - !PruneDriver {} + - !KVSearchDriver + with: + level: chunk + executor: BasePbIndexer + filter_by: summ + + +To send the request, one can specify the `filter_by` argument as below, -.. highlight:: json -.. code-block:: json +.. highlight:: python +.. code-block:: python - { - "data": "painter", - "top_k": 10, - "mime_type": "application/text" - "fields_name": ["title"], - } + with flow.build() as fl: + fl.search(read_data_fn, callback=call_back_fn, filter_by=['title',]) Open Issues