From: Trenton Holmes <797416+stumpylog@users.noreply.github.com> Date: Sun, 4 Dec 2022 21:55:46 +0000 (-0800) Subject: Merge remote-tracking branch 'upstream/dev' into feature-consume-eml X-Git-Tag: v1.11.0~1^2~35^2~1 X-Git-Url: http://git.ipfire.org/cgi-bin/gitweb.cgi?a=commitdiff_plain;h=8076ebd78ca61b8b4369ed03d462962e03b76708;p=thirdparty%2Fpaperless-ngx.git Merge remote-tracking branch 'upstream/dev' into feature-consume-eml --- 8076ebd78ca61b8b4369ed03d462962e03b76708 diff --cc Pipfile index 4b32ad01e9,dad9a47603..e7702898a1 --- a/Pipfile +++ b/Pipfile @@@ -60,7 -60,6 +60,9 @@@ setproctitle = "* nltk = "*" pdf2image = "*" flower = "*" +bleach = "*" ++# https://www.piwheels.org/project/cryptography/ last built version ++cryptography = "==38.0.1" [dev-packages] coveralls = "*" @@@ -79,4 -76,4 +79,5 @@@ black = "* pre-commit = "*" sphinx-autobuild = "*" myst-parser = "*" +imagehash = "*" + mkdocs-material = "*" diff --cc Pipfile.lock index 74f3a6a860,d00e7029f0..8446689132 --- a/Pipfile.lock +++ b/Pipfile.lock @@@ -1,7 -1,7 +1,7 @@@ { "_meta": { "hash": { - "sha256": "548803b8c176073960d6fb5858949d1bb263b36f8811b2963d03a1a29ad65dd0" - "sha256": "0242e3e296e09b30fb69e0d7a2f2e8feb4c6a23d3c7ec99500f2883a032a8c84" ++ "sha256": "cbfe9920231de6e7f993962efb3cc371abdb6b08975232d4cf64d1bad1b53d7a" }, "pipfile-spec": 6, "requires": {}, @@@ -2089,6 -2075,9 +2090,7 @@@ "version": "==0.4.6" }, "coverage": { - "extras": [ - "toml" - ], ++ "extras": [], "hashes": [ "sha256:027018943386e7b942fa832372ebc120155fd970837489896099f5cfa2890f79", "sha256:11b990d520ea75e7ee8dcab5bc908072aaada194a794db9f6d7d5cfd19661e5a", diff --cc docs/configuration.md index 0000000000,ec4cf7765d..93eaead36f mode 000000,100644..100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@@ -1,0 -1,1031 +1,1036 @@@ + # Configuration + + Paperless provides a wide range of customizations. Depending on how you + run paperless, these settings have to be defined in different places. + + - If you run paperless on docker, `paperless.conf` is not used. + Rather, configure paperless by copying necessary options to + `docker-compose.env`. + + - If you are running paperless on anything else, paperless will search + for the configuration file in these locations and use the first one + it finds: + + ``` + /path/to/paperless/paperless.conf + /etc/paperless.conf + /usr/local/etc/paperless.conf + ``` + + ## Required services + + `PAPERLESS_REDIS=` + + : This is required for processing scheduled tasks such as email + fetching, index optimization and for training the automatic document + matcher. + + - If your Redis server needs login credentials PAPERLESS_REDIS = + `redis://:@:` + - With the requirepass option PAPERLESS_REDIS = + `redis://:@:` + + [More information on securing your Redis + Instance](https://redis.io/docs/getting-started/#securing-redis). + + Defaults to . + + `PAPERLESS_DBENGINE=` + + : Optional, gives the ability to choose Postgres or MariaDB for + database engine. Available options are [postgresql]{.title-ref} and + [mariadb]{.title-ref}. + + Default is [postgresql]{.title-ref}. + + !!! warning + + Using MariaDB comes with some caveats. See [MySQL Caveats](advanced_usage#mysql-caveats). + + `PAPERLESS_DBHOST=` + + : By default, sqlite is used as the database backend. This can be + changed here. + + Set PAPERLESS_DBHOST and another database will be used instead of + sqlite. + + `PAPERLESS_DBPORT=` + + : Adjust port if necessary. + + Default is 5432. + + `PAPERLESS_DBNAME=` + + : Database name in PostgreSQL or MariaDB. + + Defaults to "paperless". + + `PAPERLESS_DBUSER=` + + : Database user in PostgreSQL or MariaDB. + + Defaults to "paperless". + + `PAPERLESS_DBPASS=` + + : Database password for PostgreSQL or MariaDB. + + Defaults to "paperless". + + `PAPERLESS_DBSSLMODE=` + + : SSL mode to use when connecting to PostgreSQL. + + See [the official documentation about + sslmode](https://www.postgresql.org/docs/current/libpq-ssl.html). + + Default is `prefer`. + + `PAPERLESS_DB_TIMEOUT=` + + : Amount of time for a database connection to wait for the database to + unlock. Mostly applicable for an sqlite based installation, consider + changing to postgresql if you need to increase this. + + Defaults to unset, keeping the Django defaults. + + ## Paths and folders + + `PAPERLESS_CONSUMPTION_DIR=` + + : This where your documents should go to be consumed. Make sure that + it exists and that the user running the paperless service can + read/write its contents before you start Paperless. + + Don't change this when using docker, as it only changes the path + within the container. Change the local consumption directory in the + docker-compose.yml file instead. + + Defaults to "../consume/", relative to the "src" directory. + + `PAPERLESS_DATA_DIR=` + + : This is where paperless stores all its data (search index, SQLite + database, classification model, etc). + + Defaults to "../data/", relative to the "src" directory. + + `PAPERLESS_TRASH_DIR=` + + : Instead of removing deleted documents, they are moved to this + directory. + + This must be writeable by the user running paperless. When running + inside docker, ensure that this path is within a permanent volume + (such as "../media/trash") so it won't get lost on upgrades. + + Defaults to empty (i.e. really delete documents). + + `PAPERLESS_MEDIA_ROOT=` + + : This is where your documents and thumbnails are stored. + + You can set this and PAPERLESS_DATA_DIR to the same folder to have + paperless store all its data within the same volume. + + Defaults to "../media/", relative to the "src" directory. + + `PAPERLESS_STATICDIR=` + + : Override the default STATIC_ROOT here. This is where all static + files created using "collectstatic" manager command are stored. + + Unless you're doing something fancy, there is no need to override + this. + + Defaults to "../static/", relative to the "src" directory. + + `PAPERLESS_FILENAME_FORMAT=` + + : Changes the filenames paperless uses to store documents in the media + directory. See [File name handling](advanced_usage#file_name_handling) for details. + + Default is none, which disables this feature. + + `PAPERLESS_FILENAME_FORMAT_REMOVE_NONE=` + + : Tells paperless to replace placeholders in + [PAPERLESS_FILENAME_FORMAT]{.title-ref} that would resolve to + 'none' to be omitted from the resulting filename. This also holds + true for directory names. See [File name handling](advanced_usage#file_name_handling) for + details. + + Defaults to [false]{.title-ref} which disables this feature. + + `PAPERLESS_LOGGING_DIR=` + + : This is where paperless will store log files. + + Defaults to "`PAPERLESS_DATA_DIR`/log/". + + ## Logging + + `PAPERLESS_LOGROTATE_MAX_SIZE=` + + : Maximum file size for log files before they are rotated, in bytes. + + Defaults to 1 MiB. + + `PAPERLESS_LOGROTATE_MAX_BACKUPS=` + + : Number of rotated log files to keep. + + Defaults to 20. + + ## Hosting & Security {#hosting-and-security} + + `PAPERLESS_SECRET_KEY=` + + : Paperless uses this to make session tokens. If you expose paperless + on the internet, you need to change this, since the default secret + is well known. + + Use any sequence of characters. The more, the better. You don't + need to remember this. Just face-roll your keyboard. + + Default is listed in the file `src/paperless/settings.py`. + + `PAPERLESS_URL=` + + : This setting can be used to set the three options below + (ALLOWED_HOSTS, CORS_ALLOWED_HOSTS and CSRF_TRUSTED_ORIGINS). If the + other options are set the values will be combined with this one. Do + not include a trailing slash. E.g. + + Defaults to empty string, leaving the other settings unaffected. + + `PAPERLESS_CSRF_TRUSTED_ORIGINS=` + + : A list of trusted origins for unsafe requests (e.g. POST). As of + Django 4.0 this is required to access the Django admin via the web. + See + + + Can also be set using PAPERLESS_URL (see above). + + Defaults to empty string, which does not add any origins to the + trusted list. + + `PAPERLESS_ALLOWED_HOSTS=` + + : If you're planning on putting Paperless on the open internet, then + you really should set this value to the domain name you're using. + Failing to do so leaves you open to HTTP host header attacks: + + + Just remember that this is a comma-separated list, so + "example.com" is fine, as is "example.com,www.example.com", but + NOT " example.com" or "example.com," + + Can also be set using PAPERLESS_URL (see above). + + If manually set, please remember to include "localhost". Otherwise + docker healthcheck will fail. + + Defaults to "\*", which is all hosts. + + `PAPERLESS_CORS_ALLOWED_HOSTS=` + + : You need to add your servers to the list of allowed hosts that can + do CORS calls. Set this to your public domain name. + + Can also be set using PAPERLESS_URL (see above). + + Defaults to "". + + `PAPERLESS_FORCE_SCRIPT_NAME=` + + : To host paperless under a subpath url like example.com/paperless you + set this value to /paperless. No trailing slash! + + Defaults to none, which hosts paperless at "/". + + `PAPERLESS_STATIC_URL=` + + : Override the STATIC_URL here. Unless you're hosting Paperless off a + subdomain like /paperless/, you probably don't need to change this. + If you do change it, be sure to include the trailing slash. + + Defaults to "/static/". + + !!! note + + When hosting paperless behind a reverse proxy like Traefik or Nginx + at a subpath e.g. example.com/paperlessngx you will also need to set + `PAPERLESS_FORCE_SCRIPT_NAME` (see above). + + `PAPERLESS_AUTO_LOGIN_USERNAME=` + + : Specify a username here so that paperless will automatically perform + login with the selected user. + + !!! danger + + Do not use this when exposing paperless on the internet. There are + no checks in place that would prevent you from doing this. + + Defaults to none, which disables this feature. + + `PAPERLESS_ADMIN_USER=` + + : If this environment variable is specified, Paperless automatically + creates a superuser with the provided username at start. This is + useful in cases where you can not run the + [createsuperuser]{.title-ref} command separately, such as Kubernetes + or AWS ECS. + + Requires [PAPERLESS_ADMIN_PASSWORD]{.title-ref} to be set. + + !!! note + + This will not change an existing \[super\]user's password, nor will + it recreate a user that already exists. You can leave this + throughout the lifecycle of the containers. + + `PAPERLESS_ADMIN_MAIL=` + + : (Optional) Specify superuser email address. Only used when + [PAPERLESS_ADMIN_USER]{.title-ref} is set. + + Defaults to `root@localhost`. + + `PAPERLESS_ADMIN_PASSWORD=` + + : Only used when [PAPERLESS_ADMIN_USER]{.title-ref} is set. This will + be the password of the automatically created superuser. + + `PAPERLESS_COOKIE_PREFIX=` + + : Specify a prefix that is added to the cookies used by paperless to + identify the currently logged in user. This is useful for when + you're running two instances of paperless on the same host. + + After changing this, you will have to login again. + + Defaults to `""`, which does not alter the cookie names. + + `PAPERLESS_ENABLE_HTTP_REMOTE_USER=` + + : Allows authentication via HTTP_REMOTE_USER which is used by some SSO + applications. + + !!! warning + + This will allow authentication by simply adding a + `Remote-User: ` header to a request. Use with care! You + especially *must: ensure that any such header is not passed from + your proxy server to paperless. + + If you're exposing paperless to the internet directly, do not use + this. + + Also see the warning [in the official documentation + ]{.title-ref}. + + Defaults to [false]{.title-ref} which disables this feature. + + `PAPERLESS_HTTP_REMOTE_USER_HEADER_NAME=` + + : If [PAPERLESS_ENABLE_HTTP_REMOTE_USER]{.title-ref} is enabled, this + property allows to customize the name of the HTTP header from which + the authenticated username is extracted. Values are in terms of + \[HttpRequest.META\](). + Thus, the configured value must start with [HTTP\_]{.title-ref} + followed by the normalized actual header name. + + Defaults to [HTTP_REMOTE_USER]{.title-ref}. + + `PAPERLESS_LOGOUT_REDIRECT_URL=` + + : URL to redirect the user to after a logout. This can be used + together with [PAPERLESS_ENABLE_HTTP_REMOTE_USER]{.title-ref} to + redirect the user back to the SSO application's logout page. + + Defaults to None, which disables this feature. + + ## OCR settings {#ocr} + + Paperless uses [OCRmyPDF](https://ocrmypdf.readthedocs.io/en/latest/) + for performing OCR on documents and images. Paperless uses sensible + defaults for most settings, but all of them can be configured to your + needs. + + `PAPERLESS_OCR_LANGUAGE=` + + : Customize the language that paperless will attempt to use when + parsing documents. + + It should be a 3-letter language code consistent with ISO 639: + + + Set this to the language most of your documents are written in. + + This can be a combination of multiple languages such as `deu+eng`, + in which case tesseract will use whatever language matches best. + Keep in mind that tesseract uses much more cpu time with multiple + languages enabled. + + Defaults to "eng". + + !!! note + + If your language contains a '-' such as chi-sim, you must use chi_sim + + `PAPERLESS_OCR_MODE=` + + : Tell paperless when and how to perform ocr on your documents. Four + modes are available: + + - `skip`: Paperless skips all pages and will perform ocr only on + pages where no text is present. This is the safest option. + + - `skip_noarchive`: In addition to skip, paperless won't create + an archived version of your documents when it finds any text in + them. This is useful if you don't want to have two + almost-identical versions of your digital documents in the media + folder. This is the fastest option. + + - `redo`: Paperless will OCR all pages of your documents and + attempt to replace any existing text layers with new text. This + will be useful for documents from scanners that already + performed OCR with insufficient results. It will also perform + OCR on purely digital documents. + + This option may fail on some documents that have features that + cannot be removed, such as forms. In this case, the text from + the document is used instead. + + - `force`: Paperless rasterizes your documents, converting any + text into images and puts the OCRed text on top. This works for + all documents, however, the resulting document may be + significantly larger and text won't appear as sharp when zoomed + in. + + The default is `skip`, which only performs OCR when necessary and + always creates archived documents. + + Read more about this in the [OCRmyPDF + documentation](https://ocrmypdf.readthedocs.io/en/latest/advanced.html#when-ocr-is-skipped). + + `PAPERLESS_OCR_CLEAN=` + + : Tells paperless to use `unpaper` to clean any input document before + sending it to tesseract. This uses more resources, but generally + results in better OCR results. The following modes are available: + + - `clean`: Apply unpaper. + - `clean-final`: Apply unpaper, and use the cleaned images to + build the output file instead of the original images. + - `none`: Do not apply unpaper. + + Defaults to `clean`. + + !!! note + + `clean-final` is incompatible with ocr mode `redo`. When both + `clean-final` and the ocr mode `redo` is configured, `clean` is used + instead. + + `PAPERLESS_OCR_DESKEW=` + + : Tells paperless to correct skewing (slight rotation of input images + mainly due to improper scanning) + + Defaults to `true`, which enables this feature. + + !!! note + + Deskewing is incompatible with ocr mode `redo`. Deskewing will get + disabled automatically if `redo` is used as the ocr mode. + + `PAPERLESS_OCR_ROTATE_PAGES=` + + : Tells paperless to correct page rotation (90°, 180° and 270° + rotation). + + If you notice that paperless is not rotating incorrectly rotated + pages (or vice versa), try adjusting the threshold up or down (see + below). + + Defaults to `true`, which enables this feature. + + `PAPERLESS_OCR_ROTATE_PAGES_THRESHOLD=` + + : Adjust the threshold for automatic page rotation by + `PAPERLESS_OCR_ROTATE_PAGES`. This is an arbitrary value reported by + tesseract. "15" is a very conservative value, whereas "2" is a + very aggressive option and will often result in correctly rotated + pages being rotated as well. + + Defaults to "12". + + `PAPERLESS_OCR_OUTPUT_TYPE=` + + : Specify the the type of PDF documents that paperless should produce. + + - `pdf`: Modify the PDF document as little as possible. + - `pdfa`: Convert PDF documents into PDF/A-2b documents, which is + a subset of the entire PDF specification and meant for storing + documents long term. + - `pdfa-1`, `pdfa-2`, `pdfa-3` to specify the exact version of + PDF/A you wish to use. + + If not specified, `pdfa` is used. Remember that paperless also keeps + the original input file as well as the archived version. + + `PAPERLESS_OCR_PAGES=` + + : Tells paperless to use only the specified amount of pages for OCR. + Documents with less than the specified amount of pages get OCR'ed + completely. + + Specifying 1 here will only use the first page. + + When combined with `PAPERLESS_OCR_MODE=redo` or + `PAPERLESS_OCR_MODE=force`, paperless will not modify any text it + finds on excluded pages and copy it verbatim. + + Defaults to 0, which disables this feature and always uses all + pages. + + `PAPERLESS_OCR_IMAGE_DPI=` + + : Paperless will OCR any images you put into the system and convert + them into PDF documents. This is useful if your scanner produces + images. In order to do so, paperless needs to know the DPI of the + image. Most images from scanners will have this information embedded + and paperless will detect and use that information. In case this + fails, it uses this value as a fallback. + + Set this to the DPI your scanner produces images at. + + Default is none, which will automatically calculate image DPI so + that the produced PDF documents are A4 sized. + + `PAPERLESS_OCR_MAX_IMAGE_PIXELS=` + + : Paperless will raise a warning when OCRing images which are over + this limit and will not OCR images which are more than twice this + limit. Note this does not prevent the document from being consumed, + but could result in missing text content. + + If unset, will default to the value determined by + [Pillow](https://pillow.readthedocs.io/en/stable/reference/Image.html#PIL.Image.MAX_IMAGE_PIXELS). + + !!! note + + Increasing this limit could cause Paperless to consume additional + resources when consuming a file. Be sure you have sufficient system + resources. + + !!! warning + + The limit is intended to prevent malicious files from consuming + system resources and causing crashes and other errors. Only increase + this value if you are certain your documents are not malicious and + you need the text which was not OCRed + + `PAPERLESS_OCR_USER_ARGS=` + + : OCRmyPDF offers many more options. Use this parameter to specify any + additional arguments you wish to pass to OCRmyPDF. Since Paperless + uses the API of OCRmyPDF, you have to specify these in a format that + can be passed to the API. See [the API reference of + OCRmyPDF](https://ocrmypdf.readthedocs.io/en/latest/api.html#reference) + for valid parameters. All command line options are supported, but + they use underscores instead of dashes. + + !!! warning + + Paperless has been tested to work with the OCR options provided + above. There are many options that are incompatible with each other, + so specifying invalid options may prevent paperless from consuming + any documents. + + Specify arguments as a JSON dictionary. Keep note of lower case + booleans and double quoted parameter names and strings. Examples: + + ``` json + {"deskew": true, "optimize": 3, "unpaper_args": "--pre-rotate 90"} + ``` + + ## Tika settings {#tika} + + Paperless can make use of [Tika](https://tika.apache.org/) and + [Gotenberg](https://gotenberg.dev/) for parsing and converting -"Office" documents (such as ".doc", ".xlsx" and ".odt"). If you -wish to use this, you must provide a Tika server and a Gotenberg server, ++"Office" documents (such as ".doc", ".xlsx" and ".odt"). ++Tika and Gotenberg are also needed to allow parsing of E-Mails (.eml). ++ ++If you wish to use this, you must provide a Tika server and a Gotenberg server, + configure their endpoints, and enable the feature. + + `PAPERLESS_TIKA_ENABLED=` + + : Enable (or disable) the Tika parser. + + Defaults to false. + + `PAPERLESS_TIKA_ENDPOINT=` + + : Set the endpoint URL were Paperless can reach your Tika server. + + Defaults to "". + + `PAPERLESS_TIKA_GOTENBERG_ENDPOINT=` + + : Set the endpoint URL were Paperless can reach your Gotenberg server. + + Defaults to "". + + If you run paperless on docker, you can add those services to the + docker-compose file (see the provided `docker-compose.sqlite-tika.yml` + file for reference). The changes requires are as follows: + + ```yaml + services: + # ... + + webserver: + # ... + + environment: + # ... + + PAPERLESS_TIKA_ENABLED: 1 + PAPERLESS_TIKA_GOTENBERG_ENDPOINT: http://gotenberg:3000 + PAPERLESS_TIKA_ENDPOINT: http://tika:9998 + + # ... + - gotenberg: - image: gotenberg/gotenberg:7.6 - restart: unless-stopped - command: - - 'gotenberg' - - '--chromium-disable-routes=true' ++ gotenberg: ++ image: gotenberg/gotenberg:7.6 ++ restart: unless-stopped ++ # The gotenberg chromium route is used to convert .eml files. We do not ++ # want to allow external content like tracking pixels or even javascript. ++ command: ++ - "gotenberg" ++ - "--chromium-disable-javascript=true" ++ - "--chromium-allow-list=file:///tmp/.*" + + tika: + image: ghcr.io/paperless-ngx/tika:latest + restart: unless-stopped + ``` + + Add the configuration variables to the environment of the webserver + (alternatively put the configuration in the `docker-compose.env` file) + and add the additional services below the webserver service. Watch out + for indentation. + + Make sure to use the correct format [PAPERLESS_TIKA_ENABLED = + 1]{.title-ref} so python_dotenv can parse the statement correctly. + + ## Software tweaks {#software_tweaks} + + `PAPERLESS_TASK_WORKERS=` + + : Paperless does multiple things in the background: Maintain the + search index, maintain the automatic matching algorithm, check + emails, consume documents, etc. This variable specifies how many + things it will do in parallel. + + Defaults to 1 + + `PAPERLESS_THREADS_PER_WORKER=` + + : Furthermore, paperless uses multiple threads when consuming + documents to speed up OCR. This variable specifies how many pages + paperless will process in parallel on a single document. + + !!! warning + + Ensure that the product + + `PAPERLESS_TASK_WORKERS \: PAPERLESS_THREADS_PER_WORKER` + + does not exceed your CPU core count or else paperless will be + extremely slow. If you want paperless to process many documents in + parallel, choose a high worker count. If you want paperless to + process very large documents faster, use a higher thread per worker + count. + + The default is a balance between the two, according to your CPU core + count, with a slight favor towards threads per worker: + + | CPU core count | Workers | Threads | + |----------------|---------|---------| + | > 1 | > 1 | > 1 | + | > 2 | > 2 | > 1 | + | > 4 | > 2 | > 2 | + | > 6 | > 2 | > 3 | + | > 8 | > 2 | > 4 | + | > 12 | > 3 | > 4 | + | > 16 | > 4 | > 4 | + + If you only specify PAPERLESS_TASK_WORKERS, paperless will adjust + PAPERLESS_THREADS_PER_WORKER automatically. + + `PAPERLESS_WORKER_TIMEOUT=` + + : Machines with few cores or weak ones might not be able to finish OCR + on large documents within the default 1800 seconds. So extending + this timeout may prove to be useful on weak hardware setups. + + `PAPERLESS_WORKER_RETRY=` + + : If PAPERLESS_WORKER_TIMEOUT has been configured, the retry time for + a task can also be configured. By default, this value will be set to + 10s more than the worker timeout. This value should never be set + less than the worker timeout. + + `PAPERLESS_TIME_ZONE=` + + : Set the time zone here. See + + for details on how to set it. + + Defaults to UTC. + + ## Polling {#polling} + + `PAPERLESS_CONSUMER_POLLING=` + + : If paperless won't find documents added to your consume folder, it + might not be able to automatically detect filesystem changes. In + that case, specify a polling interval in seconds here, which will + then cause paperless to periodically check your consumption + directory for changes. This will also disable listening for file + system changes with `inotify`. + + Defaults to 0, which disables polling and uses filesystem + notifications. + + `PAPERLESS_CONSUMER_POLLING_RETRY_COUNT=` + + : If consumer polling is enabled, sets the number of times paperless + will check for a file to remain unmodified. + + Defaults to 5. + + `PAPERLESS_CONSUMER_POLLING_DELAY=` + + : If consumer polling is enabled, sets the delay in seconds between + each check (above) paperless will do while waiting for a file to + remain unmodified. + + Defaults to 5. + + ## iNotify {#inotify} + + `PAPERLESS_CONSUMER_INOTIFY_DELAY=` + + : Sets the time in seconds the consumer will wait for additional + events from inotify before the consumer will consider a file ready + and begin consumption. Certain scanners or network setups may + generate multiple events for a single file, leading to multiple + consumers working on the same file. Configure this to prevent that. + + Defaults to 0.5 seconds. + + `PAPERLESS_CONSUMER_DELETE_DUPLICATES=` + + : When the consumer detects a duplicate document, it will not touch + the original document. This default behavior can be changed here. + + Defaults to false. + + `PAPERLESS_CONSUMER_RECURSIVE=` + + : Enable recursive watching of the consumption directory. Paperless + will then pickup files from files in subdirectories within your + consumption directory as well. + + Defaults to false. + + `PAPERLESS_CONSUMER_SUBDIRS_AS_TAGS=` + + : Set the names of subdirectories as tags for consumed files. E.g. + /foo/bar/file.pdf will add the tags "foo" and + "bar" to the consumed file. Paperless will create any tags that + don't exist yet. + + This is useful for sorting documents with certain tags such as `car` + or `todo` prior to consumption. These folders won't be deleted. + + PAPERLESS_CONSUMER_RECURSIVE must be enabled for this to work. + + Defaults to false. + + `PAPERLESS_CONSUMER_ENABLE_BARCODES=` + + : Enables the scanning and page separation based on detected barcodes. + This allows for scanning and adding multiple documents per uploaded + file, which are separated by one or multiple barcode pages. + + For ease of use, it is suggested to use a standardized separation + page, e.g. [here](https://www.alliancegroup.co.uk/patch-codes.htm). + + If no barcodes are detected in the uploaded file, no page separation + will happen. + + The original document will be removed and the separated pages will + be saved as pdf. + + Defaults to false. + + `PAPERLESS_CONSUMER_BARCODE_TIFF_SUPPORT=` + + : Whether TIFF image files should be scanned for barcodes. This will + automatically convert any TIFF image(s) to pdfs for later + processing. This only has an effect, if + PAPERLESS_CONSUMER_ENABLE_BARCODES has been enabled. + + Defaults to false. + + PAPERLESS_CONSUMER_BARCODE_STRING=PATCHT + + : Defines the string to be detected as a separator barcode. If + paperless is used with the PATCH-T separator pages, users shouldn't + change this. + + Defaults to "PATCHT" + + `PAPERLESS_CONVERT_MEMORY_LIMIT=` + + : On smaller systems, or even in the case of Very Large Documents, the + consumer may explode, complaining about how it's "unable to extend + pixel cache". In such cases, try setting this to a reasonably low + value, like 32. The default is to use whatever is necessary to do + everything without writing to disk, and units are in megabytes. + + For more information on how to use this value, you should search the + web for "MAGICK_MEMORY_LIMIT". + + Defaults to 0, which disables the limit. + + `PAPERLESS_CONVERT_TMPDIR=` + + : Similar to the memory limit, if you've got a small system and your + OS mounts /tmp as tmpfs, you should set this to a path that's on a + physical disk, like /home/your_user/tmp or something. ImageMagick + will use this as scratch space when crunching through very large + documents. + + For more information on how to use this value, you should search the + web for "MAGICK_TMPDIR". + + Default is none, which disables the temporary directory. + + `PAPERLESS_POST_CONSUME_SCRIPT=` + + : After a document is consumed, Paperless can trigger an arbitrary + script if you like. This script will be passed a number of arguments + for you to work with. For more information, take a look at [Post-consumption script](advanced_usage#post_consume_script). + + The default is blank, which means nothing will be executed. + + `PAPERLESS_FILENAME_DATE_ORDER=` + + : Paperless will check the document text for document date + information. Use this setting to enable checking the document + filename for date information. The date order can be set to any + option as specified in + . + The filename will be checked first, and if nothing is found, the + document text will be checked as normal. + + A date in a filename must have some separators ([.]{.title-ref}, + [-]{.title-ref}, [/]{.title-ref}, etc) for it to be parsed. + + Defaults to none, which disables this feature. + + `PAPERLESS_NUMBER_OF_SUGGESTED_DATES=` + + : Paperless searches an entire document for dates. The first date + found will be used as the initial value for the created date. When + this variable is greater than 0 (or left to it's default value), + paperless will also suggest other dates found in the document, up to + a maximum of this setting. Note that duplicates will be removed, + which can result in fewer dates displayed in the frontend than this + setting value. + + The task to find all dates can be time-consuming and increases with + a higher (maximum) number of suggested dates and slower hardware. + + Defaults to 3. Set to 0 to disable this feature. + + `PAPERLESS_THUMBNAIL_FONT_NAME=` + + : Paperless creates thumbnails for plain text files by rendering the + content of the file on an image and uses a predefined font for that. + This font can be changed here. + + Note that this won't have any effect on already generated + thumbnails. + + Defaults to + `/usr/share/fonts/liberation/LiberationSerif-Regular.ttf`. + + `PAPERLESS_IGNORE_DATES=` + + : Paperless parses a documents creation date from filename and file + content. You may specify a comma separated list of dates that should + be ignored during this process. This is useful for special dates + (like date of birth) that appear in documents regularly but are very + unlikely to be the documents creation date. + + The date is parsed using the order specified in PAPERLESS_DATE_ORDER + + Defaults to an empty string to not ignore any dates. + + `PAPERLESS_DATE_ORDER=` + + : Paperless will try to determine the document creation date from its + contents. Specify the date format Paperless should expect to see + within your documents. + + This option defaults to DMY which translates to day first, month + second, and year last order. Characters D, M, or Y can be shuffled + to meet the required order. + + `PAPERLESS_CONSUMER_IGNORE_PATTERNS=` + + : By default, paperless ignores certain files and folders in the + consumption directory, such as system files created by the Mac OS. + + This can be adjusted by configuring a custom json array with + patterns to exclude. + + Defaults to + `[".DS_STORE/*", "._*", ".stfolder/*", ".stversions/*", ".localized/*", "desktop.ini"]`. + + ## Binaries + + There are a few external software packages that Paperless expects to + find on your system when it starts up. Unless you've done something + creative with their installation, you probably won't need to edit any + of these. However, if you've installed these programs somewhere where + simply typing the name of the program doesn't automatically execute it + (ie. the program isn't in your \$PATH), then you'll need to specify + the literal path for that program. + + `PAPERLESS_CONVERT_BINARY=` + + : Defaults to "convert". + + `PAPERLESS_GS_BINARY=` + + : Defaults to "gs". + + ## Docker-specific options {#docker} + + These options don't have any effect in `paperless.conf`. These options + adjust the behavior of the docker container. Configure these in + [docker-compose.env]{.title-ref}. + + `PAPERLESS_WEBSERVER_WORKERS=` + + : The number of worker processes the webserver should spawn. More + worker processes usually result in the front end to load data much + quicker. However, each worker process also loads the entire + application into memory separately, so increasing this value will + increase RAM usage. + + Defaults to 1. + + `PAPERLESS_BIND_ADDR=` + + : The IP address the webserver will listen on inside the container. + There are special setups where you may need to configure this value + to restrict the Ip address or interface the webserver listens on. + + Defaults to \[::\], meaning all interfaces, including IPv6. + + `PAPERLESS_PORT=` + + : The port number the webserver will listen on inside the container. + There are special setups where you may need this to avoid collisions + with other services (like using podman with multiple containers in + one pod). + + Don't change this when using Docker. To change the port the + webserver is reachable outside of the container, instead refer to + the "ports" key in `docker-compose.yml`. + + Defaults to 8000. + + `USERMAP_UID=` + + : The ID of the paperless user in the container. Set this to your + actual user ID on the host system, which you can get by executing + + ``` shell-session + $ id -u + ``` + + Paperless will change ownership on its folders to this user, so you + need to get this right in order to be able to write to the + consumption directory. + + Defaults to 1000. + + `USERMAP_GID=` + + : The ID of the paperless Group in the container. Set this to your + actual group ID on the host system, which you can get by executing + + ``` shell-session + $ id -g + ``` + + Paperless will change ownership on its folders to this group, so you + need to get this right in order to be able to write to the + consumption directory. + + Defaults to 1000. + + `PAPERLESS_OCR_LANGUAGES=` + + : Additional OCR languages to install. By default, paperless comes + with English, German, Italian, Spanish and French. If your language + is not in this list, install additional languages with this + configuration option: + + ``` bash + PAPERLESS_OCR_LANGUAGES=tur ces + ``` + + To actually use these languages, also set the default OCR language + of paperless: + + ``` bash + PAPERLESS_OCR_LANGUAGE=tur + ``` + + Defaults to none, which does not install any additional languages. + + `PAPERLESS_ENABLE_FLOWER=` + + : If this environment variable is defined, the Celery monitoring tool + [Flower](https://flower.readthedocs.io/en/latest/index.html) will be + started by the container. + + You can read more about this in the [advanced documentation](advanced#celery-monitoring). + + ## Update Checking {#update-checking} + + `PAPERLESS_ENABLE_UPDATE_CHECK=` + + !!! note + + This setting was deprecated in favor of a frontend setting after + v1.9.2. A one-time migration is performed for users who have this + setting set. This setting is always ignored if the corresponding + frontend setting has been set. diff --cc docs/troubleshooting.md index 0000000000,53d0e1de33..329de94db7 mode 000000,100644..100644 --- a/docs/troubleshooting.md +++ b/docs/troubleshooting.md @@@ -1,0 -1,334 +1,334 @@@ + # Troubleshooting + + ## No files are added by the consumer + + Check for the following issues: + + - Ensure that the directory you're putting your documents in is the + folder paperless is watching. With docker, this setting is performed + in the `docker-compose.yml` file. Without docker, look at the + `CONSUMPTION_DIR` setting. Don't adjust this setting if you're + using docker. + + - Ensure that redis is up and running. Paperless does its task + processing asynchronously, and for documents to arrive at the task + processor, it needs redis to run. + + - Ensure that the task processor is running. Docker does this + automatically. Manually invoke the task processor by executing + + ```shell-session + $ celery --app paperless worker + ``` + + - Look at the output of paperless and inspect it for any errors. + + - Go to the admin interface, and check if there are failed tasks. If + so, the tasks will contain an error message. + + ## Consumer warns `OCR for XX failed` + + If you find the OCR accuracy to be too low, and/or the document consumer + warns that + `OCR for XX failed, but we're going to stick with what we've got since FORGIVING_OCR is enabled`, + then you might need to install the [Tesseract language + files](http://packages.ubuntu.com/search?keywords=tesseract-ocr) + marching your document's languages. + + As an example, if you are running Paperless-ngx from any Ubuntu or + Debian box, and your documents are written in Spanish you may need to + run: + + apt-get install -y tesseract-ocr-spa + + ## Consumer fails to pickup any new files + + If you notice that the consumer will only pickup files in the + consumption directory at startup, but won't find any other files added + later, you will need to enable filesystem polling with the configuration + option `PAPERLESS_CONSUMER_POLLING`, see + `[here](/configuration#polling). + + This will disable listening to filesystem changes with inotify and + paperless will manually check the consumption directory for changes + instead. + + ## Paperless always redirects to /admin + + You probably had the old paperless installed at some point. Paperless + installed a permanent redirect to /admin in your browser, and you need + to clear your browsing data / cache to fix that. + + ## Operation not permitted + + You might see errors such as: + + ```shell-session + chown: changing ownership of '../export': Operation not permitted + ``` + + The container tries to set file ownership on the listed directories. + This is required so that the user running paperless inside docker has + write permissions to these folders. This happens when pointing these + directories to NFS shares, for example. + + Ensure that `chown` is possible on these directories. + + ## Classifier error: No training data available + + This indicates that the Auto matching algorithm found no documents to + learn from. This may have two reasons: + + - You don't use the Auto matching algorithm: The error can be safely + ignored in this case. + - You are using the Auto matching algorithm: The classifier explicitly + excludes documents with Inbox tags. Verify that there are documents + in your archive without inbox tags. The algorithm will only learn + from documents not in your inbox. + + ## UserWarning in sklearn on every single document + + You may encounter warnings like this: + + ``` + /usr/local/lib/python3.7/site-packages/sklearn/base.py:315: + UserWarning: Trying to unpickle estimator CountVectorizer from version 0.23.2 when using version 0.24.0. + This might lead to breaking code or invalid results. Use at your own risk. + ``` + + This happens when certain dependencies of paperless that are responsible + for the auto matching algorithm are updated. After updating these, your + current training data _might_ not be compatible anymore. This can be + ignored in most cases. This warning will disappear automatically when + paperless updates the training data. + + If you want to get rid of the warning or actually experience issues with + automatic matching, delete the file `classification_model.pickle` in the + data directory and let paperless recreate it. + + ## 504 Server Error: Gateway Timeout when adding Office documents + + You may experience these errors when using the optional TIKA + integration: + + ``` + requests.exceptions.HTTPError: 504 Server Error: Gateway Timeout for url: http://gotenberg:3000/forms/libreoffice/convert + ``` + + Gotenberg is a server that converts Office documents into PDF documents + and has a default timeout of 30 seconds. When conversion takes longer, + Gotenberg raises this error. + + You can increase the timeout by configuring a command flag for Gotenberg + (see also [here](https://gotenberg.dev/docs/modules/api#properties)). If + using docker-compose, this is achieved by the following configuration + change in the `docker-compose.yml` file: + + ```yaml -gotenberg: - image: gotenberg/gotenberg:7.6 - restart: unless-stopped ++ # The gotenberg chromium route is used to convert .eml files. We do not ++ # want to allow external content like tracking pixels or even javascript. + command: - - 'gotenberg' - - '--chromium-disable-routes=true' - - '--api-timeout=60' ++ - "gotenberg" ++ - "--chromium-disable-javascript=true" ++ - "--chromium-allow-list=file:///tmp/.*" ++ - "--api-timeout=60" + ``` + + ## Permission denied errors in the consumption directory + + You might encounter errors such as: + + ```shell-session + The following error occured while consuming document.pdf: [Errno 13] Permission denied: '/usr/src/paperless/src/../consume/document.pdf' + ``` + + This happens when paperless does not have permission to delete files + inside the consumption directory. Ensure that `USERMAP_UID` and + `USERMAP_GID` are set to the user id and group id you use on the host + operating system, if these are different from `1000`. See [Docker setup](setup#docker_hub). + + Also ensure that you are able to read and write to the consumption + directory on the host. + + ## OSError: \[Errno 19\] No such device when consuming files + + If you experience errors such as: + + ```shell-session + File "/usr/local/lib/python3.7/site-packages/whoosh/codec/base.py", line 570, in open_compound_file + return CompoundStorage(dbfile, use_mmap=storage.supports_mmap) + File "/usr/local/lib/python3.7/site-packages/whoosh/filedb/compound.py", line 75, in __init__ + self._source = mmap.mmap(fileno, 0, access=mmap.ACCESS_READ) + OSError: [Errno 19] No such device + + During handling of the above exception, another exception occurred: + + Traceback (most recent call last): + File "/usr/local/lib/python3.7/site-packages/django_q/cluster.py", line 436, in worker + res = f(*task["args"], **task["kwargs"]) + File "/usr/src/paperless/src/documents/tasks.py", line 73, in consume_file + override_tag_ids=override_tag_ids) + File "/usr/src/paperless/src/documents/consumer.py", line 271, in try_consume_file + raise ConsumerError(e) + ``` + + Paperless uses a search index to provide better and faster full text + searching. This search index is stored inside the `data` folder. The + search index uses memory-mapped files (mmap). The above error indicates + that paperless was unable to create and open these files. + + This happens when you're trying to store the data directory on certain + file systems (mostly network shares) that don't support memory-mapped + files. + + ## Web-UI stuck at "Loading\..." + + This might have multiple reasons. + + 1. If you built the docker image yourself or deployed using the bare + metal route, make sure that there are files in + `/static/frontend//`. If there are no + files, make sure that you executed `collectstatic` successfully, + either manually or as part of the docker image build. + + If the front end is still missing, make sure that the front end is + compiled (files present in `src/documents/static/frontend`). If it + is not, you need to compile the front end yourself or download the + release archive instead of cloning the repository. + + 2. Check the output of the web server. You might see errors like this: + + ``` + [2021-01-25 10:08:04 +0000] [40] [ERROR] Socket error processing request. + Traceback (most recent call last): + File "/usr/local/lib/python3.7/site-packages/gunicorn/workers/sync.py", line 134, in handle + self.handle_request(listener, req, client, addr) + File "/usr/local/lib/python3.7/site-packages/gunicorn/workers/sync.py", line 190, in handle_request + util.reraise(*sys.exc_info()) + File "/usr/local/lib/python3.7/site-packages/gunicorn/util.py", line 625, in reraise + raise value + File "/usr/local/lib/python3.7/site-packages/gunicorn/workers/sync.py", line 178, in handle_request + resp.write_file(respiter) + File "/usr/local/lib/python3.7/site-packages/gunicorn/http/wsgi.py", line 396, in write_file + if not self.sendfile(respiter): + File "/usr/local/lib/python3.7/site-packages/gunicorn/http/wsgi.py", line 386, in sendfile + sent += os.sendfile(sockno, fileno, offset + sent, count) + OSError: [Errno 22] Invalid argument + ``` + + To fix this issue, add + + ``` + SENDFILE=0 + ``` + + to your [docker-compose.env]{.title-ref} file. + + ## Error while reading metadata + + You might find messages like these in your log files: + + ``` + [WARNING] [paperless.parsing.tesseract] Error while reading metadata + ``` + + This indicates that paperless failed to read PDF metadata from one of + your documents. This happens when you open the affected documents in + paperless for editing. Paperless will continue to work, and will simply + not show the invalid metadata. + + ## Consumer fails with a FileNotFoundError + + You might find messages like these in your log files: + + ``` + [ERROR] [paperless.consumer] Error while consuming document SCN_0001.pdf: FileNotFoundError: [Errno 2] No such file or directory: '/tmp/ocrmypdf.io.yhk3zbv0/origin.pdf' + Traceback (most recent call last): + File "/app/paperless/src/paperless_tesseract/parsers.py", line 261, in parse + ocrmypdf.ocr(**args) + File "/usr/local/lib/python3.8/dist-packages/ocrmypdf/api.py", line 337, in ocr + return run_pipeline(options=options, plugin_manager=plugin_manager, api=True) + File "/usr/local/lib/python3.8/dist-packages/ocrmypdf/_sync.py", line 385, in run_pipeline + exec_concurrent(context, executor) + File "/usr/local/lib/python3.8/dist-packages/ocrmypdf/_sync.py", line 302, in exec_concurrent + pdf = post_process(pdf, context, executor) + File "/usr/local/lib/python3.8/dist-packages/ocrmypdf/_sync.py", line 235, in post_process + pdf_out = metadata_fixup(pdf_out, context) + File "/usr/local/lib/python3.8/dist-packages/ocrmypdf/_pipeline.py", line 798, in metadata_fixup + with pikepdf.open(context.origin) as original, pikepdf.open(working_file) as pdf: + File "/usr/local/lib/python3.8/dist-packages/pikepdf/_methods.py", line 923, in open + pdf = Pdf._open( + FileNotFoundError: [Errno 2] No such file or directory: '/tmp/ocrmypdf.io.yhk3zbv0/origin.pdf' + ``` + + This probably indicates paperless tried to consume the same file twice. + This can happen for a number of reasons, depending on how documents are + placed into the consume folder. If paperless is using inotify (the + default) to check for documents, try adjusting the + [inotify configuration](/configuration#inotify). If polling is enabled, try adjusting the + [polling configuration](/configuration#polling). + + ## Consumer fails waiting for file to remain unmodified. + + You might find messages like these in your log files: + + ``` + [ERROR] [paperless.management.consumer] Timeout while waiting on file /usr/src/paperless/src/../consume/SCN_0001.pdf to remain unmodified. + ``` + + This indicates paperless timed out while waiting for the file to be + completely written to the consume folder. Adjusting + [polling configuration](/configuration#polling) values should resolve the issue. + + !!! note + + The user will need to manually move the file out of the consume folder + and back in, for the initial failing file to be consumed. + + ## Consumer fails reporting "OS reports file as busy still". + + You might find messages like these in your log files: + + ``` + [WARNING] [paperless.management.consumer] Not consuming file /usr/src/paperless/src/../consume/SCN_0001.pdf: OS reports file as busy still + ``` + + This indicates paperless was unable to open the file, as the OS reported + the file as still being in use. To prevent a crash, paperless did not + try to consume the file. If paperless is using inotify (the default) to + check for documents, try adjusting the + [inotify configuration](/configuration#inotify). If polling is enabled, try adjusting the + [polling configuration](/configuration#polling). + + !!! note + + The user will need to manually move the file out of the consume folder + and back in, for the initial failing file to be consumed. + + ## Log reports "Creating PaperlessTask failed". + + You might find messages like these in your log files: + + ``` + [ERROR] [paperless.management.consumer] Creating PaperlessTask failed: db locked + ``` + + You are likely using an sqlite based installation, with an increased + number of workers and are running into sqlite's concurrency + limitations. Uploading or consuming multiple files at once results in + many workers attempting to access the database simultaneously. + + Consider changing to the PostgreSQL database if you will be processing + many documents at once often. Otherwise, try tweaking the + `PAPERLESS_DB_TIMEOUT` setting to allow more time for the database to + unlock. This may have minor performance implications. + + ## gunicorn fails to start with "is not a valid port number" + + You are likely running using Kubernetes, which automatically creates an + environment variable named [\${serviceName}\_PORT]{.title-ref}. This is + the same environment variable which is used by Paperless to optionally + change the port gunicorn listens on. + + To fix this, set [PAPERLESS_PORT]{.title-ref} again to your desired + port, or the default of 8000.