dadoonet · dadoonet · Jan 10, 2022 · Jan 5, 2022 · Jan 10, 2022 · Jan 10, 2022
diff --git a/docs/source/admin/fs/index.rst b/docs/source/admin/fs/index.rst
@@ -1,59 +1,106 @@
-Job file specification
-======================
+Example job file specification
+==============================
 
-The job file must comply to the following ``yaml`` specifications:
+The job file (``~/.fscrawler/test/_settings.yaml``) for the job name ``test`` must comply to the following ``yaml`` specifications:
 
 .. code:: yaml
 
-   name: "job_name"
+   # required
+   name: "test"
+
+   # required
    fs:
+
+     # define a "local" file path crawler, if running inside a docker container this must be the path INSIDE the container
      url: "/path/to/docs"
+     follow_symlink: false
+     remove_deleted: true
+     continue_on_error: false
+
+     # scan every 5 minutes for changes in url defined above
      update_rate: "5m"
+
+     # opional: define includes and excludes, "~" files are excluded by default if not defined below
      includes:
      - "*.doc"
      - "*.xls"
      excludes:
      - "resume.doc"
+
+     # optional: do not send big files to TIKA
+     ignore_above: "512mb"
+
+     # special handling of JSON files, should only be used if ALL files are JSON
      json_support: false
+     add_as_inner_object: false
+
+     # special handling of XML files, should only be used if ALL files are XML
+     xml_support: false
+
+     # use MD5 from filename (instead of filename) if set to false
      filename_as_id: true
+
+     # include size ot file in index
      add_filesize: true
-     remove_deleted: true
-     add_as_inner_object: false
-     store_source: true
+
+	 # inlcude user/group of file only if needed
+     attributes_support: false
+
+     # do you REALLY want to store every file as a copy in the index ? Then set this to true
+     store_source: false
+
+     # you may want to store (partial) content of the file (see indexed_chars)	 
      index_content: true
+
+     # how much data from the content of the file should be indexed (and stored inside the index), set to 0 if you need checksum, but no content at all to be indexed
+     #indexed_chars: "0"
      indexed_chars: "10000.0"
-     attributes_support: false
-     raw_metadata: true
-     xml_support: false
+
+     # usually file metadata will be stored in separate fields, if you want to keep the original set, set this to true
+     raw_metadata: false
+
+     # optional: add checksum meta (requires index_content to be set to true)
+     checksum: "MD5"
+
+     # recommmended, but will create another index
      index_folders: true
+
      lang_detect: false
-     continue_on_error: false
-     pdf_ocr: true
-     ocr:
-       language: "eng"
-       path: "/path/to/tesseract/if/not/available/in/PATH"
-       data_path: "/path/to/tesseract/tessdata/if/needed"
+
+     ocr.pdf_strategy: noocr
+     #ocr:
+     #  language: "eng"
+     #  path: "/path/to/tesseract/if/not/available/in/PATH"
+     #  data_path: "/path/to/tesseract/tessdata/if/needed"
+
+   # optional: only required if you want to SSH to another server to index documents from there
    server:
      hostname: "localhost"
      port: 22
      username: "dadoonet"
      password: "password"
      protocol: "SSH"
      pem_path: "/path/to/pemfile"
+
+   # required
    elasticsearch:
      nodes:
      # With Cloud ID
      - cloud_id: "CLOUD_ID"
      # With URL
      - url: "http://127.0.0.1:9200"
-     index: "docs"
      bulk_size: 1000
      flush_interval: "5s"
      byte_size: "10mb"
      username: "elastic"
      password: "password"
+     # optional, defaults to "docs"
+     index: "test_docs"
+     # optional, defaults to "test_folders", used when es.index_folders is set to true
+     index_folder: "test_fold"
    rest:
-     url: "https://127.0.0.1:8080/fscrawler"
+     # only is started with --rest option
+     url: "http://127.0.0.1:8080/fscrawler"
 
 Here is a list of existing top level settings:
 
@@ -73,5 +120,5 @@ Here is a list of existing top level settings:
 
 .. versionadded:: 2.7
 
-You can define your job settings either in ``yaml`` (using ``.yaml`` extension) or
-in ``json`` (using ``.json`` extension).
+You can define your job settings either in ``_settings.yaml`` (using ``.yaml`` extension) or
+in ``_settings.json`` (using ``.json`` extension).
diff --git a/docs/source/admin/fs/local-fs.rst b/docs/source/admin/fs/local-fs.rst
@@ -46,13 +46,13 @@ Here is a list of Local FS settings (under ``fs.`` prefix)`:
 +----------------------------+-----------------------+---------------------------------+
 | ``fs.continue_on_error``   | ``false``             | :ref:`continue_on_error`        |
 +----------------------------+-----------------------+---------------------------------+
-| ``fs.pdf_ocr``             | ``true``              | :ref:`ocr_integration`          |
+| ``fs.ocr.pdf_strategy``    | ``ocr_and_text``      | :ref:`ocr_integration`          |
 +----------------------------+-----------------------+---------------------------------+
 | ``fs.indexed_chars``       | ``100000.0``          | `Extracted characters`_         |
 +----------------------------+-----------------------+---------------------------------+
 | ``fs.ignore_above``        | ``null``              | `Ignore above`_                 |
 +----------------------------+-----------------------+---------------------------------+
-| ``fs.checksum``            | ``null``              | `File Checksum`_                |
+| ``fs.checksum``            | ``false``             | `File Checksum`_                |
 +----------------------------+-----------------------+---------------------------------+
 | ``fs.follow_symlinks``     | ``false``             | `Follow Symlinks`_              |
 +----------------------------+-----------------------+---------------------------------+
@@ -275,6 +275,8 @@ Note that in that case, FSCrawler won’t be able to detect removed
 folders so any document has been indexed in elasticsearch, it won’t be
 removed when you remove or move the folder away.
 
+See ``elasticsearch.index_folder`` below for the name of the index to be used to store the folder data (if ``es.index_folders`` is set to ``true``).
+
 .. code:: yaml
 
    name: "test"
@@ -284,7 +286,7 @@ removed when you remove or move the folder away.
 Dealing with multiple types and multiple dirs
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-If you have more than one type, create as many crawlers as types:
+If you have more than one type, create as many crawlers as types and/or folders:
 
 ``~/.fscrawler/test_type1/_settings.yaml``:
 
@@ -376,7 +378,7 @@ scanning the same dir and by setting ``includes`` parameter:
 Using filename as elasticsearch ``_id``
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-Please note that the document ``_id`` is always generated (hash value)
+Please note that the document ``_id`` is generated as a hash value
 from the filename to avoid issues with special characters in filename.
 You can force to use the ``_id`` to be the filename using
 ``filename_as_id`` attribute:
@@ -714,7 +716,7 @@ Ignore Above
 
 .. versionadded:: 2.5
 
-By default FSCrawler will send to Tika every single file, whatever its size.
+By default (if ``index_content`` set to ``true``) FSCrawler will send every single file to Tika, whatever its size.
 But some files on your file system might be a way too big to be parsed.
 
 Set ``ignore_above`` to the desired value of the limit.
@@ -723,7 +725,7 @@ Set ``ignore_above`` to the desired value of the limit.
 
    name: "test"
    fs:
-     ignore_above: "5mb"
+     ignore_above: "512mb"
 
 File checksum
 ^^^^^^^^^^^^^
@@ -732,10 +734,19 @@ If you want FSCrawler to generate a checksum for each file, set
 ``checksum`` to the algorithm you wish to use to compute the checksum,
 such as ``MD5`` or ``SHA-1``.
 
+.. note::
+
+    You MUST set ``index_content`` to true to allow this feature to work. Nevertheless you MAY set ``indexed_chars`` to 0 if you do not need any content in the index.
+
+    You MUST NOT set ``json_support`` or ``xml_support`` to allow this feature to work also.
+
 .. code:: yaml
 
    name: "test"
    fs:
+      # required
+     index_content: true
+     #indexed_chars: 0
      checksum: "MD5"
 
 Follow Symlinks