Skip to content

Commit

Permalink
feat: add an example for the Specification
Browse files Browse the repository at this point in the history
  • Loading branch information
nan-wang committed May 30, 2020
1 parent ece5e21 commit c48cd97
Showing 1 changed file with 129 additions and 32 deletions.
161 changes: 129 additions & 32 deletions docs/chapters/jep/jep-3/main.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,8 +24,11 @@ Abstract

Motivation
---------
Multi-field search is commonly used in practice.
Concretely, as a user, I want to limit the query within some selected fields.
The Multi-field search is commonly used in practice. Concretely,

.. highlights::
as a user, I want to limit the query within some selected fields.

In the following use case, there are two documents and three two fields in each of them, i.e. ``title`` and ``summary``.
The user wants to query ``painter`` but **only** from the ``title`` field. The expected result will be ``{'doc_id': 11, 'title': 'hackers and painters'}``.

Expand Down Expand Up @@ -53,7 +56,7 @@ Modify ``jina.proto``
^^^^^^^^^^^^^^^^^^^^^

Let's take the following ``Flow`` as an example.
The `FieldsMapper` is a ``Crafter`` that split each ``Document`` into fields and add the ``field_name`` information for ``Chunks``.
The ``FieldsMapper`` is a ``Crafter`` that split each ``Document`` into fields and add the ``field_name`` information for ``Chunks``.
Afterwards, the ``Chunks`` containing the ``title`` and the ``summary`` information are processed differently in two pathways and stored seperately.

.. image:: JEP3-index-design.png
Expand All @@ -66,7 +69,7 @@ Each ``Document`` have one or more fields, and each field can be further splitte
In other words, each ``Chunk`` can **only** be assigned to one field, but each field contains one or more ``Chunks``.

.. highlights::
The concept of ``field`` can be considered as a group of ``Chunks``.
The concept of `field` can be considered as a group of ``Chunks``.

Secondly, at the ``Request`` level, we will add another new field, namely ``filter_by``, for the ``SearchRequest``.
This is used to store the information of on which fields the user wants to query.
Expand All @@ -75,8 +78,9 @@ By adding this information, the users can specify different fields to query in e
Adapt Index-Flow Pods
^^^^^^^^^^^^^^^^^^^^^
During index time, most parts of the ``Flow`` stay the same as before.
In order to make the ``Indexer`` only index the ``Chunks`` whose ``field_name`` meet the selected fields, we need to adapt the ``VectorIndexDriver`` and the ``extract_chunks`` function.
A new argument, ``filter_by``, is introduced to specify which fields will be indexed.

To make the ``Encoder`` only encode the ``Chunks`` whose ``field_name`` meet the selected fields, a new argument, ``filter_by``, is introduced to specify which fields will be encoded.
To do so, we need adapt ``EncodeDriver`` and the ``extract_chunks()``.

.. highlight:: python
.. code-block:: python
Expand All @@ -89,6 +93,24 @@ A new argument, ``filter_by``, is introduced to specify which fields will be ind
:param filter_by: a list of service names to wait
"""
.. highlight:: python
.. code-block:: python
class EncodeDriver(BaseEncodeDriver):
def __init__(self, filter_by: Union[str, List[str], Tuple[str]] = None, *args, **kwargs)
super().__init__(*args, **kwargs)
self.filter_by = filter_by
def __call__(self, *args, **kwargs):
filter_by = self.filter_by
if self._request.__class__.__name__ == 'SearchRequest':
filter_by = self.req.filter_by
contents, chunk_pts, no_chunk_docs, bad_chunk_ids = \
extract_chunks(self.req.docs, self.filter_by, embedding=False)
In order to make the ``Indexer`` only index the ``Chunks`` whose ``field_name`` meet the selected fields, we need to adapt the ``VectorIndexDriver`` as well.

.. highlight:: python
.. code-block:: python
Expand Down Expand Up @@ -126,22 +148,8 @@ The same change goes for the ``ChunkKVIndexDriver``.
Adapt Query-Flow Pods
^^^^^^^^^^^^^^^^^^^^^

During the query time, the ``EncodeDriver`` will extract only the ``Chunks`` that meet the ``filter_by`` in the ``SearchRequest``.
To do so, we need adapt ``EncodeDriver`` to the ``extract_chunks()`` which has been refactored in the previous section.

.. highlight:: python
.. code-block:: python
class EncodeDriver(BaseEncodeDriver):
"""Extract the chunk-level content from documents and call executor and do encoding
"""
def __call__(self, *args, **kwargs):
contents, chunk_pts, no_chunk_docs, bad_chunk_ids = \
extract_chunks(self.req.docs, self.req.filter_by, embedding=False)
Plus, we need to refactor the ``BasePea`` so that the ``Pea`` gets the information of how many incoming messages are expected.
During the query time,
Moreover, we need to refactor the ``BasePea`` so that the ``Pea`` gets the information of how many incoming messages are expected.
The expected number of incoming messages will change from query to query because the user will select different fields with the ``filter_by`` argument.
In the current version (v.0.1.15), this information is fixed and stored in ``self.args.num_parts`` when the graph is built.
And the ``Pea`` will **NOT** start processing the data until the expected number of incoming messages arrive.
Expand All @@ -157,7 +165,7 @@ Thereafter, it is reasonable to set the expected number of incoming messages as
# modify the num_part on the fly for SearchRequest
num_part = min(self.args.num_part, max(len(self.request.filtered_by), 1))
In the query time, the ``VectorSearchDriver`` and the ``KVSearchDriver`` also need to be adapted accordingly in order to **only** process the ``Chunks`` meet the ``filter_by`` requirement.
Furthermore, the ``VectorSearchDriver`` and the ``KVSearchDriver`` also need to be adapted accordingly in order to **only** process the ``Chunks`` meet the ``filter_by`` requirement.

.. highlight:: python
.. code-block:: python
Expand Down Expand Up @@ -194,17 +202,106 @@ In the query time, the ``VectorSearchDriver`` and the ``KVSearchDriver`` also ne
Specification
-------------

The selected multiple fields will be given in the query by ``field_name`` as following.
For the use case above, the `index.yml` will be defined as following,

.. highlight:: yaml
.. code-block:: yaml
!Flow
pods:
fields_mapper:
yaml_path: mapper.yml
title_encoder:
yaml_path: title_encoder.yml
needs: fields_mapper
sum_encoder:
yaml_path: sum_encoder.yml
needs: fields_mapper
title_indexer:
yaml_path: title_indexer.yml
needs: title_encoder
sum_indexer:
yaml_path: sum_indexer.yml
needs: sum_encoder
join:
needs:
- title_indexer
- sum_indexer
And the `mapper.yml` will be defined as below,

.. highlight:: yaml
.. code-block:: yaml
!FilterMapper
requests:
on:
[SearchRequest, IndexRequest]:
- !MapperDriver
with:
method: craft
mapping: {'title': 'title', 'summary': 'summ'}
The `sum_encoder.yml` is as below,

.. highlight:: yaml
.. code-block:: yaml
!AnotherTextEncoder
requests:
on:
[SearchRequest, IndexRequest]:
- !EncodeDriver
with:
method: encode
filter_by: summ
The `sum_indexer.yml` is as below,

.. highlight:: yaml
.. code-block:: yaml
!ChunkIndexer
components:
- !NumpyIndexer
with:
index_filename: vec.gz
- !BasePbIndexer
with:
index_filename: chunk.gz
requests:
on:
IndexRequest:
- !VectorIndexDriver
with:
executor: NumpyIndexer
filter_by: summ
- !PruneDriver {}
- !KVIndexDriver
with:
level: chunk
executor: BasePbIndexer
filter_by: summ
SearchRequest:
- !VectorSearchDriver
with:
executor: NumpyIndexer
filter_by: summ
- !PruneDriver {}
- !KVSearchDriver
with:
level: chunk
executor: BasePbIndexer
filter_by: summ
To send the request, one can specify the `filter_by` argument as below,

.. highlight:: json
.. code-block:: json
.. highlight:: python
.. code-block:: python
{
"data": "painter",
"top_k": 10,
"mime_type": "application/text"
"fields_name": ["title"],
}
with flow.build() as fl:
fl.search(read_data_fn, callback=call_back_fn, filter_by=['title',])
Open Issues
Expand Down

0 comments on commit c48cd97

Please sign in to comment.