Merge remote-tracking branch 'upstream/dev' into feature-consume-eml

author Trenton Holmes <797416+stumpylog@users.noreply.github.com>

Sun, 4 Dec 2022 21:55:46 +0000 (13:55 -0800)

committer Trenton Holmes <797416+stumpylog@users.noreply.github.com>

Sun, 4 Dec 2022 21:55:46 +0000 (13:55 -0800)
author Trenton Holmes <797416+stumpylog@users.noreply.github.com>
Sun, 4 Dec 2022 21:55:46 +0000 (13:55 -0800)
committer Trenton Holmes <797416+stumpylog@users.noreply.github.com>
Sun, 4 Dec 2022 21:55:46 +0000 (13:55 -0800)
diff --cc .github/workflows/ci.yml
Simple merge
diff --cc Pipfile

index 4b32ad01e92c969d9bd8f8e5e8bc411c6ec163cf,dad9a47603130cef1264ec7205f887887c387b36..e7702898a1b8cb64376f089abeb33d722d9579d8
--- 1/Pipfile
--- 2/Pipfile
+++ b/Pipfile
@@@ -60,7 -60,6 +60,9 @@@ setproctitle = "*
   nltk = "*"
   pdf2image = "*"
   flower = "*"
+ +bleach = "*"
++# https://www.piwheels.org/project/cryptography/ last built version
++cryptography = "==38.0.1"
   
   [dev-packages]
   coveralls = "*"
@@@ -79,4 -76,4 +79,5 @@@ black = "*
   pre-commit = "*"
   sphinx-autobuild = "*"
   myst-parser = "*"
+ +imagehash = "*"
+ mkdocs-material = "*"
diff --cc Pipfile.lock

index 74f3a6a8603378485e0f37b820b27bb06cd46668,d00e7029f04b3128caede8837e6b5d08a5cd34b9..8446689132ab3ac8f7acd5b95214e4420d7ebfa9
--- 1/Pipfile.lock
--- 2/Pipfile.lock
+++ b/Pipfile.lock
@@@ -1,7 -1,7 +1,7 @@@
   {
       "_meta": {
           "hash": {
-             "sha256": "548803b8c176073960d6fb5858949d1bb263b36f8811b2963d03a1a29ad65dd0"
- -            "sha256": "0242e3e296e09b30fb69e0d7a2f2e8feb4c6a23d3c7ec99500f2883a032a8c84"
++            "sha256": "cbfe9920231de6e7f993962efb3cc371abdb6b08975232d4cf64d1bad1b53d7a"
           },
           "pipfile-spec": 6,
           "requires": {},
@@@ -2089,6 -2075,9 +2090,7 @@@
               "version": "==0.4.6"
           },
           "coverage": {
- -            "extras": [
- -                "toml"
- -            ],
++            "extras": [],
               "hashes": [
                   "sha256:027018943386e7b942fa832372ebc120155fd970837489896099f5cfa2890f79",
                   "sha256:11b990d520ea75e7ee8dcab5bc908072aaada194a794db9f6d7d5cfd19661e5a",
diff --cc docs/configuration.md

index 0000000000000000000000000000000000000000,ec4cf7765db4e8317fadda7c2faf74fd12b490ba..93eaead36f044f962a75901da8a8917a36352fb8

mode 000000,100644..100644
--- /dev/null
--- 2/docs/configuration.md
+++ b/docs/configuration.md
@@@ -1,0 -1,1031 +1,1036 @@@
- -"Office" documents (such as ".doc", ".xlsx" and ".odt"). If you
- -wish to use this, you must provide a Tika server and a Gotenberg server,
+ # Configuration
+ 
+ Paperless provides a wide range of customizations. Depending on how you
+ run paperless, these settings have to be defined in different places.
+ 
+ - If you run paperless on docker, `paperless.conf` is not used.
+   Rather, configure paperless by copying necessary options to
+   `docker-compose.env`.
+ 
+ - If you are running paperless on anything else, paperless will search
+   for the configuration file in these locations and use the first one
+   it finds:
+ 
+   ```
+   /path/to/paperless/paperless.conf
+   /etc/paperless.conf
+   /usr/local/etc/paperless.conf
+   ```
+ 
+ ## Required services
+ 
+ `PAPERLESS_REDIS=<url>`
+ 
+ : This is required for processing scheduled tasks such as email
+ fetching, index optimization and for training the automatic document
+ matcher.
+ 
+     -   If your Redis server needs login credentials PAPERLESS_REDIS =
+         `redis://<username>:<password>@<host>:<port>`
+     -   With the requirepass option PAPERLESS_REDIS =
+         `redis://:<password>@<host>:<port>`
+ 
+     [More information on securing your Redis
+     Instance](https://redis.io/docs/getting-started/#securing-redis).
+ 
+     Defaults to <redis://localhost:6379>.
+ 
+ `PAPERLESS_DBENGINE=<engine_name>`
+ 
+ : Optional, gives the ability to choose Postgres or MariaDB for
+ database engine. Available options are [postgresql]{.title-ref} and
+ [mariadb]{.title-ref}.
+ 
+     Default is [postgresql]{.title-ref}.
+ 
+     !!! warning
+ 
+         Using MariaDB comes with some caveats. See [MySQL Caveats](advanced_usage#mysql-caveats).
+ 
+ `PAPERLESS_DBHOST=<hostname>`
+ 
+ : By default, sqlite is used as the database backend. This can be
+ changed here.
+ 
+     Set PAPERLESS_DBHOST and another database will be used instead of
+     sqlite.
+ 
+ `PAPERLESS_DBPORT=<port>`
+ 
+ : Adjust port if necessary.
+ 
+     Default is 5432.
+ 
+ `PAPERLESS_DBNAME=<name>`
+ 
+ : Database name in PostgreSQL or MariaDB.
+ 
+     Defaults to "paperless".
+ 
+ `PAPERLESS_DBUSER=<name>`
+ 
+ : Database user in PostgreSQL or MariaDB.
+ 
+     Defaults to "paperless".
+ 
+ `PAPERLESS_DBPASS=<password>`
+ 
+ : Database password for PostgreSQL or MariaDB.
+ 
+     Defaults to "paperless".
+ 
+ `PAPERLESS_DBSSLMODE=<mode>`
+ 
+ : SSL mode to use when connecting to PostgreSQL.
+ 
+     See [the official documentation about
+     sslmode](https://www.postgresql.org/docs/current/libpq-ssl.html).
+ 
+     Default is `prefer`.
+ 
+ `PAPERLESS_DB_TIMEOUT=<float>`
+ 
+ : Amount of time for a database connection to wait for the database to
+ unlock. Mostly applicable for an sqlite based installation, consider
+ changing to postgresql if you need to increase this.
+ 
+     Defaults to unset, keeping the Django defaults.
+ 
+ ## Paths and folders
+ 
+ `PAPERLESS_CONSUMPTION_DIR=<path>`
+ 
+ : This where your documents should go to be consumed. Make sure that
+ it exists and that the user running the paperless service can
+ read/write its contents before you start Paperless.
+ 
+     Don't change this when using docker, as it only changes the path
+     within the container. Change the local consumption directory in the
+     docker-compose.yml file instead.
+ 
+     Defaults to "../consume/", relative to the "src" directory.
+ 
+ `PAPERLESS_DATA_DIR=<path>`
+ 
+ : This is where paperless stores all its data (search index, SQLite
+ database, classification model, etc).
+ 
+     Defaults to "../data/", relative to the "src" directory.
+ 
+ `PAPERLESS_TRASH_DIR=<path>`
+ 
+ : Instead of removing deleted documents, they are moved to this
+ directory.
+ 
+     This must be writeable by the user running paperless. When running
+     inside docker, ensure that this path is within a permanent volume
+     (such as "../media/trash") so it won't get lost on upgrades.
+ 
+     Defaults to empty (i.e. really delete documents).
+ 
+ `PAPERLESS_MEDIA_ROOT=<path>`
+ 
+ : This is where your documents and thumbnails are stored.
+ 
+     You can set this and PAPERLESS_DATA_DIR to the same folder to have
+     paperless store all its data within the same volume.
+ 
+     Defaults to "../media/", relative to the "src" directory.
+ 
+ `PAPERLESS_STATICDIR=<path>`
+ 
+ : Override the default STATIC_ROOT here. This is where all static
+ files created using "collectstatic" manager command are stored.
+ 
+     Unless you're doing something fancy, there is no need to override
+     this.
+ 
+     Defaults to "../static/", relative to the "src" directory.
+ 
+ `PAPERLESS_FILENAME_FORMAT=<format>`
+ 
+ : Changes the filenames paperless uses to store documents in the media
+ directory. See [File name handling](advanced_usage#file_name_handling) for details.
+ 
+     Default is none, which disables this feature.
+ 
+ `PAPERLESS_FILENAME_FORMAT_REMOVE_NONE=<bool>`
+ 
+ : Tells paperless to replace placeholders in
+ [PAPERLESS_FILENAME_FORMAT]{.title-ref} that would resolve to
+ 'none' to be omitted from the resulting filename. This also holds
+ true for directory names. See [File name handling](advanced_usage#file_name_handling) for
+ details.
+ 
+     Defaults to [false]{.title-ref} which disables this feature.
+ 
+ `PAPERLESS_LOGGING_DIR=<path>`
+ 
+ : This is where paperless will store log files.
+ 
+     Defaults to "`PAPERLESS_DATA_DIR`/log/".
+ 
+ ## Logging
+ 
+ `PAPERLESS_LOGROTATE_MAX_SIZE=<num>`
+ 
+ : Maximum file size for log files before they are rotated, in bytes.
+ 
+     Defaults to 1 MiB.
+ 
+ `PAPERLESS_LOGROTATE_MAX_BACKUPS=<num>`
+ 
+ : Number of rotated log files to keep.
+ 
+     Defaults to 20.
+ 
+ ## Hosting & Security {#hosting-and-security}
+ 
+ `PAPERLESS_SECRET_KEY=<key>`
+ 
+ : Paperless uses this to make session tokens. If you expose paperless
+ on the internet, you need to change this, since the default secret
+ is well known.
+ 
+     Use any sequence of characters. The more, the better. You don't
+     need to remember this. Just face-roll your keyboard.
+ 
+     Default is listed in the file `src/paperless/settings.py`.
+ 
+ `PAPERLESS_URL=<url>`
+ 
+ : This setting can be used to set the three options below
+ (ALLOWED_HOSTS, CORS_ALLOWED_HOSTS and CSRF_TRUSTED_ORIGINS). If the
+ other options are set the values will be combined with this one. Do
+ not include a trailing slash. E.g. <https://paperless.domain.com>
+ 
+     Defaults to empty string, leaving the other settings unaffected.
+ 
+ `PAPERLESS_CSRF_TRUSTED_ORIGINS=<comma-separated-list>`
+ 
+ : A list of trusted origins for unsafe requests (e.g. POST). As of
+ Django 4.0 this is required to access the Django admin via the web.
+ See
+ <https://docs.djangoproject.com/en/4.0/ref/settings/#csrf-trusted-origins>
+ 
+     Can also be set using PAPERLESS_URL (see above).
+ 
+     Defaults to empty string, which does not add any origins to the
+     trusted list.
+ 
+ `PAPERLESS_ALLOWED_HOSTS=<comma-separated-list>`
+ 
+ : If you're planning on putting Paperless on the open internet, then
+ you really should set this value to the domain name you're using.
+ Failing to do so leaves you open to HTTP host header attacks:
+ <https://docs.djangoproject.com/en/3.1/topics/security/#host-header-validation>
+ 
+     Just remember that this is a comma-separated list, so
+     "example.com" is fine, as is "example.com,www.example.com", but
+     NOT " example.com" or "example.com,"
+ 
+     Can also be set using PAPERLESS_URL (see above).
+ 
+     If manually set, please remember to include "localhost". Otherwise
+     docker healthcheck will fail.
+ 
+     Defaults to "\*", which is all hosts.
+ 
+ `PAPERLESS_CORS_ALLOWED_HOSTS=<comma-separated-list>`
+ 
+ : You need to add your servers to the list of allowed hosts that can
+ do CORS calls. Set this to your public domain name.
+ 
+     Can also be set using PAPERLESS_URL (see above).
+ 
+     Defaults to "<http://localhost:8000>".
+ 
+ `PAPERLESS_FORCE_SCRIPT_NAME=<path>`
+ 
+ : To host paperless under a subpath url like example.com/paperless you
+ set this value to /paperless. No trailing slash!
+ 
+     Defaults to none, which hosts paperless at "/".
+ 
+ `PAPERLESS_STATIC_URL=<path>`
+ 
+ : Override the STATIC_URL here. Unless you're hosting Paperless off a
+ subdomain like /paperless/, you probably don't need to change this.
+ If you do change it, be sure to include the trailing slash.
+ 
+     Defaults to "/static/".
+ 
+     !!! note
+ 
+         When hosting paperless behind a reverse proxy like Traefik or Nginx
+         at a subpath e.g. example.com/paperlessngx you will also need to set
+         `PAPERLESS_FORCE_SCRIPT_NAME` (see above).
+ 
+ `PAPERLESS_AUTO_LOGIN_USERNAME=<username>`
+ 
+ : Specify a username here so that paperless will automatically perform
+ login with the selected user.
+ 
+     !!! danger
+ 
+         Do not use this when exposing paperless on the internet. There are
+         no checks in place that would prevent you from doing this.
+ 
+     Defaults to none, which disables this feature.
+ 
+ `PAPERLESS_ADMIN_USER=<username>`
+ 
+ : If this environment variable is specified, Paperless automatically
+ creates a superuser with the provided username at start. This is
+ useful in cases where you can not run the
+ [createsuperuser]{.title-ref} command separately, such as Kubernetes
+ or AWS ECS.
+ 
+     Requires [PAPERLESS_ADMIN_PASSWORD]{.title-ref} to be set.
+ 
+     !!! note
+ 
+         This will not change an existing \[super\]user's password, nor will
+         it recreate a user that already exists. You can leave this
+         throughout the lifecycle of the containers.
+ 
+ `PAPERLESS_ADMIN_MAIL=<email>`
+ 
+ : (Optional) Specify superuser email address. Only used when
+ [PAPERLESS_ADMIN_USER]{.title-ref} is set.
+ 
+     Defaults to `root@localhost`.
+ 
+ `PAPERLESS_ADMIN_PASSWORD=<password>`
+ 
+ : Only used when [PAPERLESS_ADMIN_USER]{.title-ref} is set. This will
+ be the password of the automatically created superuser.
+ 
+ `PAPERLESS_COOKIE_PREFIX=<str>`
+ 
+ : Specify a prefix that is added to the cookies used by paperless to
+ identify the currently logged in user. This is useful for when
+ you're running two instances of paperless on the same host.
+ 
+     After changing this, you will have to login again.
+ 
+     Defaults to `""`, which does not alter the cookie names.
+ 
+ `PAPERLESS_ENABLE_HTTP_REMOTE_USER=<bool>`
+ 
+ : Allows authentication via HTTP_REMOTE_USER which is used by some SSO
+ applications.
+ 
+     !!! warning
+ 
+         This will allow authentication by simply adding a
+         `Remote-User: <username>` header to a request. Use with care! You
+         especially *must:   ensure that any such header is not passed from
+         your proxy server to paperless.
+ 
+         If you're exposing paperless to the internet directly, do not use
+         this.
+ 
+         Also see the warning [in the official documentation
+         <https://docs.djangoproject.com/en/3.1/howto/auth-remote-user/#configuration>]{.title-ref}.
+ 
+     Defaults to [false]{.title-ref} which disables this feature.
+ 
+ `PAPERLESS_HTTP_REMOTE_USER_HEADER_NAME=<str>`
+ 
+ : If [PAPERLESS_ENABLE_HTTP_REMOTE_USER]{.title-ref} is enabled, this
+ property allows to customize the name of the HTTP header from which
+ the authenticated username is extracted. Values are in terms of
+ \[HttpRequest.META\](<https://docs.djangoproject.com/en/3.1/ref/request-response/#django.http.HttpRequest.META>).
+ Thus, the configured value must start with [HTTP\_]{.title-ref}
+ followed by the normalized actual header name.
+ 
+     Defaults to [HTTP_REMOTE_USER]{.title-ref}.
+ 
+ `PAPERLESS_LOGOUT_REDIRECT_URL=<str>`
+ 
+ : URL to redirect the user to after a logout. This can be used
+ together with [PAPERLESS_ENABLE_HTTP_REMOTE_USER]{.title-ref} to
+ redirect the user back to the SSO application's logout page.
+ 
+     Defaults to None, which disables this feature.
+ 
+ ## OCR settings {#ocr}
+ 
+ Paperless uses [OCRmyPDF](https://ocrmypdf.readthedocs.io/en/latest/)
+ for performing OCR on documents and images. Paperless uses sensible
+ defaults for most settings, but all of them can be configured to your
+ needs.
+ 
+ `PAPERLESS_OCR_LANGUAGE=<lang>`
+ 
+ : Customize the language that paperless will attempt to use when
+ parsing documents.
+ 
+     It should be a 3-letter language code consistent with ISO 639:
+     <https://www.loc.gov/standards/iso639-2/php/code_list.php>
+ 
+     Set this to the language most of your documents are written in.
+ 
+     This can be a combination of multiple languages such as `deu+eng`,
+     in which case tesseract will use whatever language matches best.
+     Keep in mind that tesseract uses much more cpu time with multiple
+     languages enabled.
+ 
+     Defaults to "eng".
+ 
+     !!! note
+ 
+         If your language contains a '-' such as chi-sim, you must use chi_sim
+ 
+ `PAPERLESS_OCR_MODE=<mode>`
+ 
+ : Tell paperless when and how to perform ocr on your documents. Four
+ modes are available:
+ 
+     -   `skip`: Paperless skips all pages and will perform ocr only on
+         pages where no text is present. This is the safest option.
+ 
+     -   `skip_noarchive`: In addition to skip, paperless won't create
+         an archived version of your documents when it finds any text in
+         them. This is useful if you don't want to have two
+         almost-identical versions of your digital documents in the media
+         folder. This is the fastest option.
+ 
+     -   `redo`: Paperless will OCR all pages of your documents and
+         attempt to replace any existing text layers with new text. This
+         will be useful for documents from scanners that already
+         performed OCR with insufficient results. It will also perform
+         OCR on purely digital documents.
+ 
+         This option may fail on some documents that have features that
+         cannot be removed, such as forms. In this case, the text from
+         the document is used instead.
+ 
+     -   `force`: Paperless rasterizes your documents, converting any
+         text into images and puts the OCRed text on top. This works for
+         all documents, however, the resulting document may be
+         significantly larger and text won't appear as sharp when zoomed
+         in.
+ 
+     The default is `skip`, which only performs OCR when necessary and
+     always creates archived documents.
+ 
+     Read more about this in the [OCRmyPDF
+     documentation](https://ocrmypdf.readthedocs.io/en/latest/advanced.html#when-ocr-is-skipped).
+ 
+ `PAPERLESS_OCR_CLEAN=<mode>`
+ 
+ : Tells paperless to use `unpaper` to clean any input document before
+ sending it to tesseract. This uses more resources, but generally
+ results in better OCR results. The following modes are available:
+ 
+     -   `clean`: Apply unpaper.
+     -   `clean-final`: Apply unpaper, and use the cleaned images to
+         build the output file instead of the original images.
+     -   `none`: Do not apply unpaper.
+ 
+     Defaults to `clean`.
+ 
+     !!! note
+ 
+         `clean-final` is incompatible with ocr mode `redo`. When both
+         `clean-final` and the ocr mode `redo` is configured, `clean` is used
+         instead.
+ 
+ `PAPERLESS_OCR_DESKEW=<bool>`
+ 
+ : Tells paperless to correct skewing (slight rotation of input images
+ mainly due to improper scanning)
+ 
+     Defaults to `true`, which enables this feature.
+ 
+     !!! note
+ 
+         Deskewing is incompatible with ocr mode `redo`. Deskewing will get
+         disabled automatically if `redo` is used as the ocr mode.
+ 
+ `PAPERLESS_OCR_ROTATE_PAGES=<bool>`
+ 
+ : Tells paperless to correct page rotation (90°, 180° and 270°
+ rotation).
+ 
+     If you notice that paperless is not rotating incorrectly rotated
+     pages (or vice versa), try adjusting the threshold up or down (see
+     below).
+ 
+     Defaults to `true`, which enables this feature.
+ 
+ `PAPERLESS_OCR_ROTATE_PAGES_THRESHOLD=<num>`
+ 
+ : Adjust the threshold for automatic page rotation by
+ `PAPERLESS_OCR_ROTATE_PAGES`. This is an arbitrary value reported by
+ tesseract. "15" is a very conservative value, whereas "2" is a
+ very aggressive option and will often result in correctly rotated
+ pages being rotated as well.
+ 
+     Defaults to "12".
+ 
+ `PAPERLESS_OCR_OUTPUT_TYPE=<type>`
+ 
+ : Specify the the type of PDF documents that paperless should produce.
+ 
+     -   `pdf`: Modify the PDF document as little as possible.
+     -   `pdfa`: Convert PDF documents into PDF/A-2b documents, which is
+         a subset of the entire PDF specification and meant for storing
+         documents long term.
+     -   `pdfa-1`, `pdfa-2`, `pdfa-3` to specify the exact version of
+         PDF/A you wish to use.
+ 
+     If not specified, `pdfa` is used. Remember that paperless also keeps
+     the original input file as well as the archived version.
+ 
+ `PAPERLESS_OCR_PAGES=<num>`
+ 
+ : Tells paperless to use only the specified amount of pages for OCR.
+ Documents with less than the specified amount of pages get OCR'ed
+ completely.
+ 
+     Specifying 1 here will only use the first page.
+ 
+     When combined with `PAPERLESS_OCR_MODE=redo` or
+     `PAPERLESS_OCR_MODE=force`, paperless will not modify any text it
+     finds on excluded pages and copy it verbatim.
+ 
+     Defaults to 0, which disables this feature and always uses all
+     pages.
+ 
+ `PAPERLESS_OCR_IMAGE_DPI=<num>`
+ 
+ : Paperless will OCR any images you put into the system and convert
+ them into PDF documents. This is useful if your scanner produces
+ images. In order to do so, paperless needs to know the DPI of the
+ image. Most images from scanners will have this information embedded
+ and paperless will detect and use that information. In case this
+ fails, it uses this value as a fallback.
+ 
+     Set this to the DPI your scanner produces images at.
+ 
+     Default is none, which will automatically calculate image DPI so
+     that the produced PDF documents are A4 sized.
+ 
+ `PAPERLESS_OCR_MAX_IMAGE_PIXELS=<num>`
+ 
+ : Paperless will raise a warning when OCRing images which are over
+ this limit and will not OCR images which are more than twice this
+ limit. Note this does not prevent the document from being consumed,
+ but could result in missing text content.
+ 
+     If unset, will default to the value determined by
+     [Pillow](https://pillow.readthedocs.io/en/stable/reference/Image.html#PIL.Image.MAX_IMAGE_PIXELS).
+ 
+     !!! note
+ 
+         Increasing this limit could cause Paperless to consume additional
+         resources when consuming a file. Be sure you have sufficient system
+         resources.
+ 
+     !!! warning
+ 
+         The limit is intended to prevent malicious files from consuming
+         system resources and causing crashes and other errors. Only increase
+         this value if you are certain your documents are not malicious and
+         you need the text which was not OCRed
+ 
+ `PAPERLESS_OCR_USER_ARGS=<json>`
+ 
+ : OCRmyPDF offers many more options. Use this parameter to specify any
+ additional arguments you wish to pass to OCRmyPDF. Since Paperless
+ uses the API of OCRmyPDF, you have to specify these in a format that
+ can be passed to the API. See [the API reference of
+ OCRmyPDF](https://ocrmypdf.readthedocs.io/en/latest/api.html#reference)
+ for valid parameters. All command line options are supported, but
+ they use underscores instead of dashes.
+ 
+     !!! warning
+ 
+         Paperless has been tested to work with the OCR options provided
+         above. There are many options that are incompatible with each other,
+         so specifying invalid options may prevent paperless from consuming
+         any documents.
+ 
+     Specify arguments as a JSON dictionary. Keep note of lower case
+     booleans and double quoted parameter names and strings. Examples:
+ 
+     ``` json
+     {"deskew": true, "optimize": 3, "unpaper_args": "--pre-rotate 90"}
+     ```
+ 
+ ## Tika settings {#tika}
+ 
+ Paperless can make use of [Tika](https://tika.apache.org/) and
+ [Gotenberg](https://gotenberg.dev/) for parsing and converting
- -  gotenberg:
- -    image: gotenberg/gotenberg:7.6
- -    restart: unless-stopped
- -    command:
- -      - 'gotenberg'
- -      - '--chromium-disable-routes=true'
++"Office" documents (such as ".doc", ".xlsx" and ".odt").
++Tika and Gotenberg are also needed to allow parsing of E-Mails (.eml).
++
++If you wish to use this, you must provide a Tika server and a Gotenberg server,
+ configure their endpoints, and enable the feature.
+ 
+ `PAPERLESS_TIKA_ENABLED=<bool>`
+ 
+ : Enable (or disable) the Tika parser.
+ 
+     Defaults to false.
+ 
+ `PAPERLESS_TIKA_ENDPOINT=<url>`
+ 
+ : Set the endpoint URL were Paperless can reach your Tika server.
+ 
+     Defaults to "<http://localhost:9998>".
+ 
+ `PAPERLESS_TIKA_GOTENBERG_ENDPOINT=<url>`
+ 
+ : Set the endpoint URL were Paperless can reach your Gotenberg server.
+ 
+     Defaults to "<http://localhost:3000>".
+ 
+ If you run paperless on docker, you can add those services to the
+ docker-compose file (see the provided `docker-compose.sqlite-tika.yml`
+ file for reference). The changes requires are as follows:
+ 
+ ```yaml
+ services:
+   # ...
+ 
+   webserver:
+     # ...
+ 
+     environment:
+       # ...
+ 
+       PAPERLESS_TIKA_ENABLED: 1
+       PAPERLESS_TIKA_GOTENBERG_ENDPOINT: http://gotenberg:3000
+       PAPERLESS_TIKA_ENDPOINT: http://tika:9998
+ 
+   # ...
+ 
++    gotenberg:
++        image: gotenberg/gotenberg:7.6
++        restart: unless-stopped
++        # The gotenberg chromium route is used to convert .eml files. We do not
++        # want to allow external content like tracking pixels or even javascript.
++        command:
++          - "gotenberg"
++          - "--chromium-disable-javascript=true"
++          - "--chromium-allow-list=file:///tmp/.*"
+ 
+   tika:
+     image: ghcr.io/paperless-ngx/tika:latest
+     restart: unless-stopped
+ ```
+ 
+ Add the configuration variables to the environment of the webserver
+ (alternatively put the configuration in the `docker-compose.env` file)
+ and add the additional services below the webserver service. Watch out
+ for indentation.
+ 
+ Make sure to use the correct format [PAPERLESS_TIKA_ENABLED =
+ 1]{.title-ref} so python_dotenv can parse the statement correctly.
+ 
+ ## Software tweaks {#software_tweaks}
+ 
+ `PAPERLESS_TASK_WORKERS=<num>`
+ 
+ : Paperless does multiple things in the background: Maintain the
+ search index, maintain the automatic matching algorithm, check
+ emails, consume documents, etc. This variable specifies how many
+ things it will do in parallel.
+ 
+     Defaults to 1
+ 
+ `PAPERLESS_THREADS_PER_WORKER=<num>`
+ 
+ : Furthermore, paperless uses multiple threads when consuming
+ documents to speed up OCR. This variable specifies how many pages
+ paperless will process in parallel on a single document.
+ 
+     !!! warning
+ 
+         Ensure that the product
+ 
+         `PAPERLESS_TASK_WORKERS \:   PAPERLESS_THREADS_PER_WORKER`
+ 
+         does not exceed your CPU core count or else paperless will be
+         extremely slow. If you want paperless to process many documents in
+         parallel, choose a high worker count. If you want paperless to
+         process very large documents faster, use a higher thread per worker
+         count.
+ 
+     The default is a balance between the two, according to your CPU core
+     count, with a slight favor towards threads per worker:
+ 
+     | CPU core count | Workers | Threads |
+     |----------------|---------|---------|
+     | > 1            | > 1     | > 1     |
+     | > 2            | > 2     | > 1     |
+     | > 4            | > 2     | > 2     |
+     | > 6            | > 2     | > 3     |
+     | > 8            | > 2     | > 4     |
+     | > 12           | > 3     | > 4     |
+     | > 16           | > 4     | > 4     |
+ 
+     If you only specify PAPERLESS_TASK_WORKERS, paperless will adjust
+     PAPERLESS_THREADS_PER_WORKER automatically.
+ 
+ `PAPERLESS_WORKER_TIMEOUT=<num>`
+ 
+ : Machines with few cores or weak ones might not be able to finish OCR
+ on large documents within the default 1800 seconds. So extending
+ this timeout may prove to be useful on weak hardware setups.
+ 
+ `PAPERLESS_WORKER_RETRY=<num>`
+ 
+ : If PAPERLESS_WORKER_TIMEOUT has been configured, the retry time for
+ a task can also be configured. By default, this value will be set to
+ 10s more than the worker timeout. This value should never be set
+ less than the worker timeout.
+ 
+ `PAPERLESS_TIME_ZONE=<timezone>`
+ 
+ : Set the time zone here. See
+ <https://docs.djangoproject.com/en/3.1/ref/settings/#std:setting-TIME_ZONE>
+ for details on how to set it.
+ 
+     Defaults to UTC.
+ 
+ ## Polling {#polling}
+ 
+ `PAPERLESS_CONSUMER_POLLING=<num>`
+ 
+ : If paperless won't find documents added to your consume folder, it
+ might not be able to automatically detect filesystem changes. In
+ that case, specify a polling interval in seconds here, which will
+ then cause paperless to periodically check your consumption
+ directory for changes. This will also disable listening for file
+ system changes with `inotify`.
+ 
+     Defaults to 0, which disables polling and uses filesystem
+     notifications.
+ 
+ `PAPERLESS_CONSUMER_POLLING_RETRY_COUNT=<num>`
+ 
+ : If consumer polling is enabled, sets the number of times paperless
+ will check for a file to remain unmodified.
+ 
+     Defaults to 5.
+ 
+ `PAPERLESS_CONSUMER_POLLING_DELAY=<num>`
+ 
+ : If consumer polling is enabled, sets the delay in seconds between
+ each check (above) paperless will do while waiting for a file to
+ remain unmodified.
+ 
+     Defaults to 5.
+ 
+ ## iNotify {#inotify}
+ 
+ `PAPERLESS_CONSUMER_INOTIFY_DELAY=<num>`
+ 
+ : Sets the time in seconds the consumer will wait for additional
+ events from inotify before the consumer will consider a file ready
+ and begin consumption. Certain scanners or network setups may
+ generate multiple events for a single file, leading to multiple
+ consumers working on the same file. Configure this to prevent that.
+ 
+     Defaults to 0.5 seconds.
+ 
+ `PAPERLESS_CONSUMER_DELETE_DUPLICATES=<bool>`
+ 
+ : When the consumer detects a duplicate document, it will not touch
+ the original document. This default behavior can be changed here.
+ 
+     Defaults to false.
+ 
+ `PAPERLESS_CONSUMER_RECURSIVE=<bool>`
+ 
+ : Enable recursive watching of the consumption directory. Paperless
+ will then pickup files from files in subdirectories within your
+ consumption directory as well.
+ 
+     Defaults to false.
+ 
+ `PAPERLESS_CONSUMER_SUBDIRS_AS_TAGS=<bool>`
+ 
+ : Set the names of subdirectories as tags for consumed files. E.g.
+ <CONSUMPTION_DIR>/foo/bar/file.pdf will add the tags "foo" and
+ "bar" to the consumed file. Paperless will create any tags that
+ don't exist yet.
+ 
+     This is useful for sorting documents with certain tags such as `car`
+     or `todo` prior to consumption. These folders won't be deleted.
+ 
+     PAPERLESS_CONSUMER_RECURSIVE must be enabled for this to work.
+ 
+     Defaults to false.
+ 
+ `PAPERLESS_CONSUMER_ENABLE_BARCODES=<bool>`
+ 
+ : Enables the scanning and page separation based on detected barcodes.
+ This allows for scanning and adding multiple documents per uploaded
+ file, which are separated by one or multiple barcode pages.
+ 
+     For ease of use, it is suggested to use a standardized separation
+     page, e.g. [here](https://www.alliancegroup.co.uk/patch-codes.htm).
+ 
+     If no barcodes are detected in the uploaded file, no page separation
+     will happen.
+ 
+     The original document will be removed and the separated pages will
+     be saved as pdf.
+ 
+     Defaults to false.
+ 
+ `PAPERLESS_CONSUMER_BARCODE_TIFF_SUPPORT=<bool>`
+ 
+ : Whether TIFF image files should be scanned for barcodes. This will
+ automatically convert any TIFF image(s) to pdfs for later
+ processing. This only has an effect, if
+ PAPERLESS_CONSUMER_ENABLE_BARCODES has been enabled.
+ 
+     Defaults to false.
+ 
+ PAPERLESS_CONSUMER_BARCODE_STRING=PATCHT
+ 
+ : Defines the string to be detected as a separator barcode. If
+ paperless is used with the PATCH-T separator pages, users shouldn't
+ change this.
+ 
+     Defaults to "PATCHT"
+ 
+ `PAPERLESS_CONVERT_MEMORY_LIMIT=<num>`
+ 
+ : On smaller systems, or even in the case of Very Large Documents, the
+ consumer may explode, complaining about how it's "unable to extend
+ pixel cache". In such cases, try setting this to a reasonably low
+ value, like 32. The default is to use whatever is necessary to do
+ everything without writing to disk, and units are in megabytes.
+ 
+     For more information on how to use this value, you should search the
+     web for "MAGICK_MEMORY_LIMIT".
+ 
+     Defaults to 0, which disables the limit.
+ 
+ `PAPERLESS_CONVERT_TMPDIR=<path>`
+ 
+ : Similar to the memory limit, if you've got a small system and your
+ OS mounts /tmp as tmpfs, you should set this to a path that's on a
+ physical disk, like /home/your_user/tmp or something. ImageMagick
+ will use this as scratch space when crunching through very large
+ documents.
+ 
+     For more information on how to use this value, you should search the
+     web for "MAGICK_TMPDIR".
+ 
+     Default is none, which disables the temporary directory.
+ 
+ `PAPERLESS_POST_CONSUME_SCRIPT=<filename>`
+ 
+ : After a document is consumed, Paperless can trigger an arbitrary
+ script if you like. This script will be passed a number of arguments
+ for you to work with. For more information, take a look at [Post-consumption script](advanced_usage#post_consume_script).
+ 
+     The default is blank, which means nothing will be executed.
+ 
+ `PAPERLESS_FILENAME_DATE_ORDER=<format>`
+ 
+ : Paperless will check the document text for document date
+ information. Use this setting to enable checking the document
+ filename for date information. The date order can be set to any
+ option as specified in
+ <https://dateparser.readthedocs.io/en/latest/settings.html#date-order>.
+ The filename will be checked first, and if nothing is found, the
+ document text will be checked as normal.
+ 
+     A date in a filename must have some separators ([.]{.title-ref},
+     [-]{.title-ref}, [/]{.title-ref}, etc) for it to be parsed.
+ 
+     Defaults to none, which disables this feature.
+ 
+ `PAPERLESS_NUMBER_OF_SUGGESTED_DATES=<num>`
+ 
+ : Paperless searches an entire document for dates. The first date
+ found will be used as the initial value for the created date. When
+ this variable is greater than 0 (or left to it's default value),
+ paperless will also suggest other dates found in the document, up to
+ a maximum of this setting. Note that duplicates will be removed,
+ which can result in fewer dates displayed in the frontend than this
+ setting value.
+ 
+     The task to find all dates can be time-consuming and increases with
+     a higher (maximum) number of suggested dates and slower hardware.
+ 
+     Defaults to 3. Set to 0 to disable this feature.
+ 
+ `PAPERLESS_THUMBNAIL_FONT_NAME=<filename>`
+ 
+ : Paperless creates thumbnails for plain text files by rendering the
+ content of the file on an image and uses a predefined font for that.
+ This font can be changed here.
+ 
+     Note that this won't have any effect on already generated
+     thumbnails.
+ 
+     Defaults to
+     `/usr/share/fonts/liberation/LiberationSerif-Regular.ttf`.
+ 
+ `PAPERLESS_IGNORE_DATES=<string>`
+ 
+ : Paperless parses a documents creation date from filename and file
+ content. You may specify a comma separated list of dates that should
+ be ignored during this process. This is useful for special dates
+ (like date of birth) that appear in documents regularly but are very
+ unlikely to be the documents creation date.
+ 
+     The date is parsed using the order specified in PAPERLESS_DATE_ORDER
+ 
+     Defaults to an empty string to not ignore any dates.
+ 
+ `PAPERLESS_DATE_ORDER=<format>`
+ 
+ : Paperless will try to determine the document creation date from its
+ contents. Specify the date format Paperless should expect to see
+ within your documents.
+ 
+     This option defaults to DMY which translates to day first, month
+     second, and year last order. Characters D, M, or Y can be shuffled
+     to meet the required order.
+ 
+ `PAPERLESS_CONSUMER_IGNORE_PATTERNS=<json>`
+ 
+ : By default, paperless ignores certain files and folders in the
+ consumption directory, such as system files created by the Mac OS.
+ 
+     This can be adjusted by configuring a custom json array with
+     patterns to exclude.
+ 
+     Defaults to
+     `[".DS_STORE/*", "._*", ".stfolder/*", ".stversions/*", ".localized/*", "desktop.ini"]`.
+ 
+ ## Binaries
+ 
+ There are a few external software packages that Paperless expects to
+ find on your system when it starts up. Unless you've done something
+ creative with their installation, you probably won't need to edit any
+ of these. However, if you've installed these programs somewhere where
+ simply typing the name of the program doesn't automatically execute it
+ (ie. the program isn't in your \$PATH), then you'll need to specify
+ the literal path for that program.
+ 
+ `PAPERLESS_CONVERT_BINARY=<path>`
+ 
+ : Defaults to "convert".
+ 
+ `PAPERLESS_GS_BINARY=<path>`
+ 
+ : Defaults to "gs".
+ 
+ ## Docker-specific options {#docker}
+ 
+ These options don't have any effect in `paperless.conf`. These options
+ adjust the behavior of the docker container. Configure these in
+ [docker-compose.env]{.title-ref}.
+ 
+ `PAPERLESS_WEBSERVER_WORKERS=<num>`
+ 
+ : The number of worker processes the webserver should spawn. More
+ worker processes usually result in the front end to load data much
+ quicker. However, each worker process also loads the entire
+ application into memory separately, so increasing this value will
+ increase RAM usage.
+ 
+     Defaults to 1.
+ 
+ `PAPERLESS_BIND_ADDR=<ip address>`
+ 
+ : The IP address the webserver will listen on inside the container.
+ There are special setups where you may need to configure this value
+ to restrict the Ip address or interface the webserver listens on.
+ 
+     Defaults to \[::\], meaning all interfaces, including IPv6.
+ 
+ `PAPERLESS_PORT=<port>`
+ 
+ : The port number the webserver will listen on inside the container.
+ There are special setups where you may need this to avoid collisions
+ with other services (like using podman with multiple containers in
+ one pod).
+ 
+     Don't change this when using Docker. To change the port the
+     webserver is reachable outside of the container, instead refer to
+     the "ports" key in `docker-compose.yml`.
+ 
+     Defaults to 8000.
+ 
+ `USERMAP_UID=<uid>`
+ 
+ : The ID of the paperless user in the container. Set this to your
+ actual user ID on the host system, which you can get by executing
+ 
+     ``` shell-session
+     $ id -u
+     ```
+ 
+     Paperless will change ownership on its folders to this user, so you
+     need to get this right in order to be able to write to the
+     consumption directory.
+ 
+     Defaults to 1000.
+ 
+ `USERMAP_GID=<gid>`
+ 
+ : The ID of the paperless Group in the container. Set this to your
+ actual group ID on the host system, which you can get by executing
+ 
+     ``` shell-session
+     $ id -g
+     ```
+ 
+     Paperless will change ownership on its folders to this group, so you
+     need to get this right in order to be able to write to the
+     consumption directory.
+ 
+     Defaults to 1000.
+ 
+ `PAPERLESS_OCR_LANGUAGES=<list>`
+ 
+ : Additional OCR languages to install. By default, paperless comes
+ with English, German, Italian, Spanish and French. If your language
+ is not in this list, install additional languages with this
+ configuration option:
+ 
+     ``` bash
+     PAPERLESS_OCR_LANGUAGES=tur ces
+     ```
+ 
+     To actually use these languages, also set the default OCR language
+     of paperless:
+ 
+     ``` bash
+     PAPERLESS_OCR_LANGUAGE=tur
+     ```
+ 
+     Defaults to none, which does not install any additional languages.
+ 
+ `PAPERLESS_ENABLE_FLOWER=<defined>`
+ 
+ : If this environment variable is defined, the Celery monitoring tool
+ [Flower](https://flower.readthedocs.io/en/latest/index.html) will be
+ started by the container.
+ 
+     You can read more about this in the [advanced documentation](advanced#celery-monitoring).
+ 
+ ## Update Checking {#update-checking}
+ 
+ `PAPERLESS_ENABLE_UPDATE_CHECK=<bool>`
+ 
+ !!! note
+ 
+     This setting was deprecated in favor of a frontend setting after
+     v1.9.2. A one-time migration is performed for users who have this
+     setting set. This setting is always ignored if the corresponding
+     frontend setting has been set.
diff --cc docs/troubleshooting.md

index 0000000000000000000000000000000000000000,53d0e1de3383c2356c79e363f706a329fb2be26a..329de94db729a41fb3879aaf927807a8bfc52e88

mode 000000,100644..100644
--- /dev/null
--- 2/docs/troubleshooting.md
+++ b/docs/troubleshooting.md
@@@ -1,0 -1,334 +1,334 @@@
- -gotenberg:
- -  image: gotenberg/gotenberg:7.6
- -  restart: unless-stopped
+ # Troubleshooting
+ 
+ ## No files are added by the consumer
+ 
+ Check for the following issues:
+ 
+ - Ensure that the directory you're putting your documents in is the
+   folder paperless is watching. With docker, this setting is performed
+   in the `docker-compose.yml` file. Without docker, look at the
+   `CONSUMPTION_DIR` setting. Don't adjust this setting if you're
+   using docker.
+ 
+ - Ensure that redis is up and running. Paperless does its task
+   processing asynchronously, and for documents to arrive at the task
+   processor, it needs redis to run.
+ 
+ - Ensure that the task processor is running. Docker does this
+   automatically. Manually invoke the task processor by executing
+ 
+   ```shell-session
+   $ celery --app paperless worker
+   ```
+ 
+ - Look at the output of paperless and inspect it for any errors.
+ 
+ - Go to the admin interface, and check if there are failed tasks. If
+   so, the tasks will contain an error message.
+ 
+ ## Consumer warns `OCR for XX failed`
+ 
+ If you find the OCR accuracy to be too low, and/or the document consumer
+ warns that
+ `OCR for XX failed, but we're going to stick with what we've got since FORGIVING_OCR is enabled`,
+ then you might need to install the [Tesseract language
+ files](http://packages.ubuntu.com/search?keywords=tesseract-ocr)
+ marching your document's languages.
+ 
+ As an example, if you are running Paperless-ngx from any Ubuntu or
+ Debian box, and your documents are written in Spanish you may need to
+ run:
+ 
+     apt-get install -y tesseract-ocr-spa
+ 
+ ## Consumer fails to pickup any new files
+ 
+ If you notice that the consumer will only pickup files in the
+ consumption directory at startup, but won't find any other files added
+ later, you will need to enable filesystem polling with the configuration
+ option `PAPERLESS_CONSUMER_POLLING`, see
+ `[here](/configuration#polling).
+ 
+ This will disable listening to filesystem changes with inotify and
+ paperless will manually check the consumption directory for changes
+ instead.
+ 
+ ## Paperless always redirects to /admin
+ 
+ You probably had the old paperless installed at some point. Paperless
+ installed a permanent redirect to /admin in your browser, and you need
+ to clear your browsing data / cache to fix that.
+ 
+ ## Operation not permitted
+ 
+ You might see errors such as:
+ 
+ ```shell-session
+ chown: changing ownership of '../export': Operation not permitted
+ ```
+ 
+ The container tries to set file ownership on the listed directories.
+ This is required so that the user running paperless inside docker has
+ write permissions to these folders. This happens when pointing these
+ directories to NFS shares, for example.
+ 
+ Ensure that `chown` is possible on these directories.
+ 
+ ## Classifier error: No training data available
+ 
+ This indicates that the Auto matching algorithm found no documents to
+ learn from. This may have two reasons:
+ 
+ - You don't use the Auto matching algorithm: The error can be safely
+   ignored in this case.
+ - You are using the Auto matching algorithm: The classifier explicitly
+   excludes documents with Inbox tags. Verify that there are documents
+   in your archive without inbox tags. The algorithm will only learn
+   from documents not in your inbox.
+ 
+ ## UserWarning in sklearn on every single document
+ 
+ You may encounter warnings like this:
+ 
+ ```
+ /usr/local/lib/python3.7/site-packages/sklearn/base.py:315:
+ UserWarning: Trying to unpickle estimator CountVectorizer from version 0.23.2 when using version 0.24.0.
+ This might lead to breaking code or invalid results. Use at your own risk.
+ ```
+ 
+ This happens when certain dependencies of paperless that are responsible
+ for the auto matching algorithm are updated. After updating these, your
+ current training data _might_ not be compatible anymore. This can be
+ ignored in most cases. This warning will disappear automatically when
+ paperless updates the training data.
+ 
+ If you want to get rid of the warning or actually experience issues with
+ automatic matching, delete the file `classification_model.pickle` in the
+ data directory and let paperless recreate it.
+ 
+ ## 504 Server Error: Gateway Timeout when adding Office documents
+ 
+ You may experience these errors when using the optional TIKA
+ integration:
+ 
+ ```
+ requests.exceptions.HTTPError: 504 Server Error: Gateway Timeout for url: http://gotenberg:3000/forms/libreoffice/convert
+ ```
+ 
+ Gotenberg is a server that converts Office documents into PDF documents
+ and has a default timeout of 30 seconds. When conversion takes longer,
+ Gotenberg raises this error.
+ 
+ You can increase the timeout by configuring a command flag for Gotenberg
+ (see also [here](https://gotenberg.dev/docs/modules/api#properties)). If
+ using docker-compose, this is achieved by the following configuration
+ change in the `docker-compose.yml` file:
+ 
+ ```yaml
- -    - 'gotenberg'
- -    - '--chromium-disable-routes=true'
- -    - '--api-timeout=60'
++  # The gotenberg chromium route is used to convert .eml files. We do not
++  # want to allow external content like tracking pixels or even javascript.
+   command:
++    - "gotenberg"
++    - "--chromium-disable-javascript=true"
++    - "--chromium-allow-list=file:///tmp/.*"
++    - "--api-timeout=60"
+ ```
+ 
+ ## Permission denied errors in the consumption directory
+ 
+ You might encounter errors such as:
+ 
+ ```shell-session
+ The following error occured while consuming document.pdf: [Errno 13] Permission denied: '/usr/src/paperless/src/../consume/document.pdf'
+ ```
+ 
+ This happens when paperless does not have permission to delete files
+ inside the consumption directory. Ensure that `USERMAP_UID` and
+ `USERMAP_GID` are set to the user id and group id you use on the host
+ operating system, if these are different from `1000`. See [Docker setup](setup#docker_hub).
+ 
+ Also ensure that you are able to read and write to the consumption
+ directory on the host.
+ 
+ ## OSError: \[Errno 19\] No such device when consuming files
+ 
+ If you experience errors such as:
+ 
+ ```shell-session
+ File "/usr/local/lib/python3.7/site-packages/whoosh/codec/base.py", line 570, in open_compound_file
+ return CompoundStorage(dbfile, use_mmap=storage.supports_mmap)
+ File "/usr/local/lib/python3.7/site-packages/whoosh/filedb/compound.py", line 75, in __init__
+ self._source = mmap.mmap(fileno, 0, access=mmap.ACCESS_READ)
+ OSError: [Errno 19] No such device
+ 
+ During handling of the above exception, another exception occurred:
+ 
+ Traceback (most recent call last):
+ File "/usr/local/lib/python3.7/site-packages/django_q/cluster.py", line 436, in worker
+ res = f(*task["args"], **task["kwargs"])
+ File "/usr/src/paperless/src/documents/tasks.py", line 73, in consume_file
+ override_tag_ids=override_tag_ids)
+ File "/usr/src/paperless/src/documents/consumer.py", line 271, in try_consume_file
+ raise ConsumerError(e)
+ ```
+ 
+ Paperless uses a search index to provide better and faster full text
+ searching. This search index is stored inside the `data` folder. The
+ search index uses memory-mapped files (mmap). The above error indicates
+ that paperless was unable to create and open these files.
+ 
+ This happens when you're trying to store the data directory on certain
+ file systems (mostly network shares) that don't support memory-mapped
+ files.
+ 
+ ## Web-UI stuck at "Loading\..."
+ 
+ This might have multiple reasons.
+ 
+ 1.  If you built the docker image yourself or deployed using the bare
+     metal route, make sure that there are files in
+     `<paperless-root>/static/frontend/<lang-code>/`. If there are no
+     files, make sure that you executed `collectstatic` successfully,
+     either manually or as part of the docker image build.
+ 
+     If the front end is still missing, make sure that the front end is
+     compiled (files present in `src/documents/static/frontend`). If it
+     is not, you need to compile the front end yourself or download the
+     release archive instead of cloning the repository.
+ 
+ 2.  Check the output of the web server. You might see errors like this:
+ 
+     ```
+     [2021-01-25 10:08:04 +0000] [40] [ERROR] Socket error processing request.
+     Traceback (most recent call last):
+     File "/usr/local/lib/python3.7/site-packages/gunicorn/workers/sync.py", line 134, in handle
+         self.handle_request(listener, req, client, addr)
+     File "/usr/local/lib/python3.7/site-packages/gunicorn/workers/sync.py", line 190, in handle_request
+         util.reraise(*sys.exc_info())
+     File "/usr/local/lib/python3.7/site-packages/gunicorn/util.py", line 625, in reraise
+         raise value
+     File "/usr/local/lib/python3.7/site-packages/gunicorn/workers/sync.py", line 178, in handle_request
+         resp.write_file(respiter)
+     File "/usr/local/lib/python3.7/site-packages/gunicorn/http/wsgi.py", line 396, in write_file
+         if not self.sendfile(respiter):
+     File "/usr/local/lib/python3.7/site-packages/gunicorn/http/wsgi.py", line 386, in sendfile
+         sent += os.sendfile(sockno, fileno, offset + sent, count)
+     OSError: [Errno 22] Invalid argument
+     ```
+ 
+     To fix this issue, add
+ 
+     ```
+     SENDFILE=0
+     ```
+ 
+     to your [docker-compose.env]{.title-ref} file.
+ 
+ ## Error while reading metadata
+ 
+ You might find messages like these in your log files:
+ 
+ ```
+ [WARNING] [paperless.parsing.tesseract] Error while reading metadata
+ ```
+ 
+ This indicates that paperless failed to read PDF metadata from one of
+ your documents. This happens when you open the affected documents in
+ paperless for editing. Paperless will continue to work, and will simply
+ not show the invalid metadata.
+ 
+ ## Consumer fails with a FileNotFoundError
+ 
+ You might find messages like these in your log files:
+ 
+ ```
+ [ERROR] [paperless.consumer] Error while consuming document SCN_0001.pdf: FileNotFoundError: [Errno 2] No such file or directory: '/tmp/ocrmypdf.io.yhk3zbv0/origin.pdf'
+ Traceback (most recent call last):
+   File "/app/paperless/src/paperless_tesseract/parsers.py", line 261, in parse
+     ocrmypdf.ocr(**args)
+   File "/usr/local/lib/python3.8/dist-packages/ocrmypdf/api.py", line 337, in ocr
+     return run_pipeline(options=options, plugin_manager=plugin_manager, api=True)
+   File "/usr/local/lib/python3.8/dist-packages/ocrmypdf/_sync.py", line 385, in run_pipeline
+     exec_concurrent(context, executor)
+   File "/usr/local/lib/python3.8/dist-packages/ocrmypdf/_sync.py", line 302, in exec_concurrent
+     pdf = post_process(pdf, context, executor)
+   File "/usr/local/lib/python3.8/dist-packages/ocrmypdf/_sync.py", line 235, in post_process
+     pdf_out = metadata_fixup(pdf_out, context)
+   File "/usr/local/lib/python3.8/dist-packages/ocrmypdf/_pipeline.py", line 798, in metadata_fixup
+     with pikepdf.open(context.origin) as original, pikepdf.open(working_file) as pdf:
+   File "/usr/local/lib/python3.8/dist-packages/pikepdf/_methods.py", line 923, in open
+     pdf = Pdf._open(
+ FileNotFoundError: [Errno 2] No such file or directory: '/tmp/ocrmypdf.io.yhk3zbv0/origin.pdf'
+ ```
+ 
+ This probably indicates paperless tried to consume the same file twice.
+ This can happen for a number of reasons, depending on how documents are
+ placed into the consume folder. If paperless is using inotify (the
+ default) to check for documents, try adjusting the
+ [inotify configuration](/configuration#inotify). If polling is enabled, try adjusting the
+ [polling configuration](/configuration#polling).
+ 
+ ## Consumer fails waiting for file to remain unmodified.
+ 
+ You might find messages like these in your log files:
+ 
+ ```
+ [ERROR] [paperless.management.consumer] Timeout while waiting on file /usr/src/paperless/src/../consume/SCN_0001.pdf to remain unmodified.
+ ```
+ 
+ This indicates paperless timed out while waiting for the file to be
+ completely written to the consume folder. Adjusting
+ [polling configuration](/configuration#polling) values should resolve the issue.
+ 
+ !!! note
+ 
+     The user will need to manually move the file out of the consume folder
+     and back in, for the initial failing file to be consumed.
+ 
+ ## Consumer fails reporting "OS reports file as busy still".
+ 
+ You might find messages like these in your log files:
+ 
+ ```
+ [WARNING] [paperless.management.consumer] Not consuming file /usr/src/paperless/src/../consume/SCN_0001.pdf: OS reports file as busy still
+ ```
+ 
+ This indicates paperless was unable to open the file, as the OS reported
+ the file as still being in use. To prevent a crash, paperless did not
+ try to consume the file. If paperless is using inotify (the default) to
+ check for documents, try adjusting the
+ [inotify configuration](/configuration#inotify). If polling is enabled, try adjusting the
+ [polling configuration](/configuration#polling).
+ 
+ !!! note
+ 
+     The user will need to manually move the file out of the consume folder
+     and back in, for the initial failing file to be consumed.
+ 
+ ## Log reports "Creating PaperlessTask failed".
+ 
+ You might find messages like these in your log files:
+ 
+ ```
+ [ERROR] [paperless.management.consumer] Creating PaperlessTask failed: db locked
+ ```
+ 
+ You are likely using an sqlite based installation, with an increased
+ number of workers and are running into sqlite's concurrency
+ limitations. Uploading or consuming multiple files at once results in
+ many workers attempting to access the database simultaneously.
+ 
+ Consider changing to the PostgreSQL database if you will be processing
+ many documents at once often. Otherwise, try tweaking the
+ `PAPERLESS_DB_TIMEOUT` setting to allow more time for the database to
+ unlock. This may have minor performance implications.
+ 
+ ## gunicorn fails to start with "is not a valid port number"
+ 
+ You are likely running using Kubernetes, which automatically creates an
+ environment variable named [\${serviceName}\_PORT]{.title-ref}. This is
+ the same environment variable which is used by Paperless to optionally
+ change the port gunicorn listens on.
+ 
+ To fix this, set [PAPERLESS_PORT]{.title-ref} again to your desired
+ port, or the default of 8000.
author	Trenton Holmes <797416+stumpylog@users.noreply.github.com>
	Sun, 4 Dec 2022 21:55:46 +0000 (13:55 -0800)
committer	Trenton Holmes <797416+stumpylog@users.noreply.github.com>
	Sun, 4 Dec 2022 21:55:46 +0000 (13:55 -0800)
		1	2
.github/workflows/ci.yml	patch \|	diff1 \|	diff2 \|	blob \| history
Pipfile	patch \|	diff1 \|	diff2 \|	blob \| history
Pipfile.lock	patch \|	diff1 \|	diff2 \|	blob \| history
docs/configuration.md	patch \|	\|	diff2 \|	blob \| history
docs/troubleshooting.md	patch \|	\|	diff2 \|	blob \| history