Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve documentation for settings #1345

Merged
merged 3 commits into from
Jan 10, 2022
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
87 changes: 67 additions & 20 deletions docs/source/admin/fs/index.rst
Original file line number Diff line number Diff line change
@@ -1,59 +1,106 @@
Job file specification
======================
Example job file specification
==============================

The job file must comply to the following ``yaml`` specifications:
The job file (``~/.fscrawler/test/_settings.yaml``) for the job name ``test`` must comply to the following ``yaml`` specifications:

.. code:: yaml

name: "job_name"
# required
name: "test"

# required
fs:

# define a "local" file path crawler, if running inside a docker container this must be the path INSIDE the container
url: "/path/to/docs"
follow_symlink: false
remove_deleted: true
continue_on_error: false

# scan every 5 minutes for changes in url defined above
update_rate: "5m"

# opional: define includes and excludes, "~" files are excluded by default if not defined below
includes:
- "*.doc"
- "*.xls"
excludes:
- "resume.doc"

# optional: do not send big files to TIKA
ignore_above: "512mb"

# special handling of JSON files, should only be used if ALL files are JSON
json_support: false
add_as_inner_object: false

# special handling of XML files, should only be used if ALL files are XML
xml_support: false

# use MD5 from filename (instead of filename) if set to false
filename_as_id: true

# include size ot file in index
add_filesize: true
remove_deleted: true
add_as_inner_object: false
store_source: true

# inlcude user/group of file only if needed
attributes_support: false

# do you REALLY want to store every file as a copy in the index ? Then set this to true
store_source: false

# you may want to store (partial) content of the file (see indexed_chars)
index_content: true

# how much data from the content of the file should be indexed (and stored inside the index), set to 0 if you need checksum, but no content at all to be indexed
#indexed_chars: "0"
indexed_chars: "10000.0"
attributes_support: false
raw_metadata: true
xml_support: false

# usually file metadata will be stored in separate fields, if you want to keep the original set, set this to true
raw_metadata: false

# optional: add checksum meta (requires index_content to be set to true)
checksum: "MD5"

# recommmended, but will create another index
index_folders: true

lang_detect: false
continue_on_error: false
pdf_ocr: true
ocr:
language: "eng"
path: "/path/to/tesseract/if/not/available/in/PATH"
data_path: "/path/to/tesseract/tessdata/if/needed"

ocr.pdf_strategy: noocr
#ocr:
# language: "eng"
# path: "/path/to/tesseract/if/not/available/in/PATH"
# data_path: "/path/to/tesseract/tessdata/if/needed"

# optional: only required if you want to SSH to another server to index documents from there
server:
hostname: "localhost"
port: 22
username: "dadoonet"
password: "password"
protocol: "SSH"
pem_path: "/path/to/pemfile"

# required
elasticsearch:
nodes:
# With Cloud ID
- cloud_id: "CLOUD_ID"
# With URL
- url: "http://127.0.0.1:9200"
index: "docs"
bulk_size: 1000
flush_interval: "5s"
byte_size: "10mb"
username: "elastic"
password: "password"
# optional, defaults to "docs"
index: "test_docs"
# optional, defaults to "test_folders", used when es.index_folders is set to true
index_folder: "test_fold"
rest:
url: "https://127.0.0.1:8080/fscrawler"
# only is started with --rest option
url: "http://127.0.0.1:8080/fscrawler"

Here is a list of existing top level settings:

Expand All @@ -73,5 +120,5 @@ Here is a list of existing top level settings:

.. versionadded:: 2.7

You can define your job settings either in ``yaml`` (using ``.yaml`` extension) or
in ``json`` (using ``.json`` extension).
You can define your job settings either in ``_settings.yaml`` (using ``.yaml`` extension) or
in ``_settings.json`` (using ``.json`` extension).
23 changes: 17 additions & 6 deletions docs/source/admin/fs/local-fs.rst
Original file line number Diff line number Diff line change
Expand Up @@ -46,13 +46,13 @@ Here is a list of Local FS settings (under ``fs.`` prefix)`:
+----------------------------+-----------------------+---------------------------------+
| ``fs.continue_on_error`` | ``false`` | :ref:`continue_on_error` |
+----------------------------+-----------------------+---------------------------------+
| ``fs.pdf_ocr`` | ``true`` | :ref:`ocr_integration` |
| ``fs.ocr.pdf_strategy`` | ``ocr_and_text`` | :ref:`ocr_integration` |
+----------------------------+-----------------------+---------------------------------+
| ``fs.indexed_chars`` | ``100000.0`` | `Extracted characters`_ |
+----------------------------+-----------------------+---------------------------------+
| ``fs.ignore_above`` | ``null`` | `Ignore above`_ |
+----------------------------+-----------------------+---------------------------------+
| ``fs.checksum`` | ``null`` | `File Checksum`_ |
| ``fs.checksum`` | ``false`` | `File Checksum`_ |
+----------------------------+-----------------------+---------------------------------+
| ``fs.follow_symlinks`` | ``false`` | `Follow Symlinks`_ |
+----------------------------+-----------------------+---------------------------------+
Expand Down Expand Up @@ -275,6 +275,8 @@ Note that in that case, FSCrawler won’t be able to detect removed
folders so any document has been indexed in elasticsearch, it won’t be
removed when you remove or move the folder away.

See ``elasticsearch.index_folder`` below for the name of the index to be used to store the folder data (if ``es.index_folders`` is set to ``true``).

.. code:: yaml

name: "test"
Expand All @@ -284,7 +286,7 @@ removed when you remove or move the folder away.
Dealing with multiple types and multiple dirs
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

If you have more than one type, create as many crawlers as types:
If you have more than one type, create as many crawlers as types and/or folders:

``~/.fscrawler/test_type1/_settings.yaml``:

Expand Down Expand Up @@ -376,7 +378,7 @@ scanning the same dir and by setting ``includes`` parameter:
Using filename as elasticsearch ``_id``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Please note that the document ``_id`` is always generated (hash value)
Please note that the document ``_id`` is generated as a hash value
from the filename to avoid issues with special characters in filename.
You can force to use the ``_id`` to be the filename using
``filename_as_id`` attribute:
Expand Down Expand Up @@ -714,7 +716,7 @@ Ignore Above

.. versionadded:: 2.5

By default FSCrawler will send to Tika every single file, whatever its size.
By default (if ``index_content`` set to ``true``) FSCrawler will send every single file to Tika, whatever its size.
But some files on your file system might be a way too big to be parsed.

Set ``ignore_above`` to the desired value of the limit.
Expand All @@ -723,7 +725,7 @@ Set ``ignore_above`` to the desired value of the limit.

name: "test"
fs:
ignore_above: "5mb"
ignore_above: "512mb"

File checksum
^^^^^^^^^^^^^
Expand All @@ -732,10 +734,19 @@ If you want FSCrawler to generate a checksum for each file, set
``checksum`` to the algorithm you wish to use to compute the checksum,
such as ``MD5`` or ``SHA-1``.

.. note::

You MUST set ``index_content`` to true to allow this feature to work. Nevertheless you MAY set ``indexed_chars`` to 0 if you do not need any content in the index.

You MUST NOT set ``json_support`` or ``xml_support`` to allow this feature to work also.

.. code:: yaml

name: "test"
fs:
# required
index_content: true
#indexed_chars: 0
checksum: "MD5"

Follow Symlinks
Expand Down