diff --git a/docs/source/admin/fs/index.rst b/docs/source/admin/fs/index.rst index 65fd8ad18..0d2223093 100644 --- a/docs/source/admin/fs/index.rst +++ b/docs/source/admin/fs/index.rst @@ -1,38 +1,79 @@ -Job file specification -====================== +Example job file specification +============================== -The job file must comply to the following ``yaml`` specifications: +The job file (``~/.fscrawler/test/_settings.yaml``) for the job name ``test`` must comply to the following ``yaml`` specifications: .. code:: yaml - name: "job_name" + # required + name: "test" + + # required fs: + + # define a "local" file path crawler, if running inside a docker container this must be the path INSIDE the container url: "/path/to/docs" + follow_symlink: false + remove_deleted: true + continue_on_error: false + + # scan every 5 minutes for changes in url defined above update_rate: "5m" + + # opional: define includes and excludes, "~" files are excluded by default if not defined below includes: - "*.doc" - "*.xls" excludes: - "resume.doc" + + # optional: do not send big files to TIKA + ignore_above: "512mb" + + # special handling of JSON files, should only be used if ALL files are JSON json_support: false + add_as_inner_object: false + + # special handling of XML files, should only be used if ALL files are XML + xml_support: false + + # use MD5 from filename (instead of filename) if set to false filename_as_id: true + + # include size ot file in index add_filesize: true - remove_deleted: true - add_as_inner_object: false - store_source: true + + # inlcude user/group of file only if needed + attributes_support: false + + # do you REALLY want to store every file as a copy in the index ? Then set this to true + store_source: false + + # you may want to store (partial) content of the file (see indexed_chars) index_content: true + + # how much data from the content of the file should be indexed (and stored inside the index), set to 0 if you need checksum, but no content at all to be indexed + #indexed_chars: "0" indexed_chars: "10000.0" - attributes_support: false - raw_metadata: true - xml_support: false + + # usually file metadata will be stored in separate fields, if you want to keep the original set, set this to true + raw_metadata: false + + # optional: add checksum meta (requires index_content to be set to true) + checksum: "MD5" + + # recommmended, but will create another index index_folders: true + lang_detect: false - continue_on_error: false - pdf_ocr: true - ocr: - language: "eng" - path: "/path/to/tesseract/if/not/available/in/PATH" - data_path: "/path/to/tesseract/tessdata/if/needed" + + ocr.pdf_strategy: noocr + #ocr: + # language: "eng" + # path: "/path/to/tesseract/if/not/available/in/PATH" + # data_path: "/path/to/tesseract/tessdata/if/needed" + + # optional: only required if you want to SSH to another server to index documents from there server: hostname: "localhost" port: 22 @@ -40,20 +81,26 @@ The job file must comply to the following ``yaml`` specifications: password: "password" protocol: "SSH" pem_path: "/path/to/pemfile" + + # required elasticsearch: nodes: # With Cloud ID - cloud_id: "CLOUD_ID" # With URL - url: "http://127.0.0.1:9200" - index: "docs" bulk_size: 1000 flush_interval: "5s" byte_size: "10mb" username: "elastic" password: "password" + # optional, defaults to "docs" + index: "test_docs" + # optional, defaults to "test_folders", used when es.index_folders is set to true + index_folder: "test_fold" rest: - url: "https://127.0.0.1:8080/fscrawler" + # only is started with --rest option + url: "http://127.0.0.1:8080/fscrawler" Here is a list of existing top level settings: @@ -73,5 +120,5 @@ Here is a list of existing top level settings: .. versionadded:: 2.7 -You can define your job settings either in ``yaml`` (using ``.yaml`` extension) or -in ``json`` (using ``.json`` extension). +You can define your job settings either in ``_settings.yaml`` (using ``.yaml`` extension) or +in ``_settings.json`` (using ``.json`` extension). diff --git a/docs/source/admin/fs/local-fs.rst b/docs/source/admin/fs/local-fs.rst index 787c827fa..49b0701bc 100644 --- a/docs/source/admin/fs/local-fs.rst +++ b/docs/source/admin/fs/local-fs.rst @@ -46,13 +46,13 @@ Here is a list of Local FS settings (under ``fs.`` prefix)`: +----------------------------+-----------------------+---------------------------------+ | ``fs.continue_on_error`` | ``false`` | :ref:`continue_on_error` | +----------------------------+-----------------------+---------------------------------+ -| ``fs.pdf_ocr`` | ``true`` | :ref:`ocr_integration` | +| ``fs.ocr.pdf_strategy`` | ``ocr_and_text`` | :ref:`ocr_integration` | +----------------------------+-----------------------+---------------------------------+ | ``fs.indexed_chars`` | ``100000.0`` | `Extracted characters`_ | +----------------------------+-----------------------+---------------------------------+ | ``fs.ignore_above`` | ``null`` | `Ignore above`_ | +----------------------------+-----------------------+---------------------------------+ -| ``fs.checksum`` | ``null`` | `File Checksum`_ | +| ``fs.checksum`` | ``false`` | `File Checksum`_ | +----------------------------+-----------------------+---------------------------------+ | ``fs.follow_symlinks`` | ``false`` | `Follow Symlinks`_ | +----------------------------+-----------------------+---------------------------------+ @@ -275,6 +275,8 @@ Note that in that case, FSCrawler won’t be able to detect removed folders so any document has been indexed in elasticsearch, it won’t be removed when you remove or move the folder away. +See ``elasticsearch.index_folder`` below for the name of the index to be used to store the folder data (if ``es.index_folders`` is set to ``true``). + .. code:: yaml name: "test" @@ -284,7 +286,7 @@ removed when you remove or move the folder away. Dealing with multiple types and multiple dirs ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -If you have more than one type, create as many crawlers as types: +If you have more than one type, create as many crawlers as types and/or folders: ``~/.fscrawler/test_type1/_settings.yaml``: @@ -376,7 +378,7 @@ scanning the same dir and by setting ``includes`` parameter: Using filename as elasticsearch ``_id`` ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -Please note that the document ``_id`` is always generated (hash value) +Please note that the document ``_id`` is generated as a hash value from the filename to avoid issues with special characters in filename. You can force to use the ``_id`` to be the filename using ``filename_as_id`` attribute: @@ -714,7 +716,7 @@ Ignore Above .. versionadded:: 2.5 -By default FSCrawler will send to Tika every single file, whatever its size. +By default (if ``index_content`` set to ``true``) FSCrawler will send every single file to Tika, whatever its size. But some files on your file system might be a way too big to be parsed. Set ``ignore_above`` to the desired value of the limit. @@ -723,7 +725,7 @@ Set ``ignore_above`` to the desired value of the limit. name: "test" fs: - ignore_above: "5mb" + ignore_above: "512mb" File checksum ^^^^^^^^^^^^^ @@ -732,10 +734,19 @@ If you want FSCrawler to generate a checksum for each file, set ``checksum`` to the algorithm you wish to use to compute the checksum, such as ``MD5`` or ``SHA-1``. +.. note:: + + You MUST set ``index_content`` to true to allow this feature to work. Nevertheless you MAY set ``indexed_chars`` to 0 if you do not need any content in the index. + + You MUST NOT set ``json_support`` or ``xml_support`` to allow this feature to work also. + .. code:: yaml name: "test" fs: + # required + index_content: true + #indexed_chars: 0 checksum: "MD5" Follow Symlinks