Feature: collate two single-sided multipage scans (#3784)

author Dennis Brakhane <brakhane@gmail.com>

Mon, 24 Jul 2023 07:29:04 +0000 (09:29 +0200)

committer GitHub <noreply@github.com>

Mon, 24 Jul 2023 07:29:04 +0000 (00:29 -0700)
author Dennis Brakhane <brakhane@gmail.com>
Mon, 24 Jul 2023 07:29:04 +0000 (09:29 +0200)
committer GitHub <noreply@github.com>
Mon, 24 Jul 2023 07:29:04 +0000 (00:29 -0700)
diff --git a/docs/advanced_usage.md b/docs/advanced_usage.md

index 094943f3eb6442aa7d412ecb1151b650695c347d..199931bb5924548b0478a44aaf5280c862d0ad76 100644 (file)
--- a/docs/advanced_usage.md
+++ b/docs/advanced_usage.md
@@ -528,7 +528,7 @@ For how to enable barcode usage, see [the configuration](/configuration#barcodes
  The two settings may be enabled independently, but do have interactions as explained
  below.
  
-### Document Splitting
+### Document Splitting {#document-splitting}
  
  When enabled, Paperless will look for a barcode with the configured value and create a new document
  starting from the next page. The page with the barcode on it will _not_ be retained. It
@@ -543,3 +543,69 @@ If document splitting via barcode is also enabled, documents will be split when
  barcode is located. However, differing from the splitting, the page with the
  barcode _will_ be retained. This allows application of a barcode to any page, including
  one which holds data to keep in the document.
+
+## Automatic collation of double-sided documents {#collate}
+
+!!! note
+
+    If your scanner supports double-sided scanning natively, you do not need this feature.
+
+This feature is turned off by default, see [configuration](/configuration#collate) on how to turn it on.
+
+### Summary
+
+If you have a scanner with an automatic document feeder (ADF) that only scans a single side,
+this feature makes scanning double-sided documents much more convenient by automatically
+collating two separate scans into one document, reordering the pages as necessary.
+
+### Usage example
+
+Suppose you have a double-sided document with 6 pages (3 sheets of paper). First,
+put the stack into your ADF as normal, ensuring that page 1 is scanned first. Your ADF
+will now scan pages 1, 3, and 5. Then you (or your the scanner, if it supports it) upload
+the scan into the correct sub-directory of the consume folder (`double-sided` by default;
+keep in mind that Paperless will _not_ automatically create the directory for you.)
+Paperless will then process the scan and move it into an internal staging area.
+
+The next step is to turn your stack upside down (without reordering the sheets of paper),
+and scan it once again, your ADF will now scan pages 6, 4, and 2, in that order. Once this
+scan is copied into the sub-directory, Paperless will collate the previous scan with the
+new one, reversing the order of the pages on the second, "even numbered" scan. The
+resulting document will have the pages 1-6 in the correct order, and this new file will
+then be processed as normal.
+
+!!! tip
+
+    When scanning the even numbered pages, you can omit the last empty pages, if there are
+    any. For example, if page 6 is empty, you only need to scan pages 2 and 4. _Do not_ omit
+    empty pages in the middle of the document.
+
+### Things that could go wrong
+
+Paperless will notice when the first, "odd numbered" scan has less pages than the second
+scan (this can happen when e.g. the ADF skipped a few pages in the first pass). In that
+case, Paperless will remove the staging copy as well as the scan, and give you an error
+message asking you to restart the process from scratch, by scanning the odd pages again,
+followed by the even pages.
+
+Another thing that might happen is that you start a double sided scan, but then forget
+to upload the second file. To avoid collating the wrong documents if you then come back
+a day later to scan a new double-sided document, Paperless will only keep an "odd numbered
+pages" file for up to 30 minutes. If more time passes, it will consider the next incoming
+scan a completely new "odd numbered pages" one. The old staging file will get discarded.
+
+### Interaction with "subdirs as tags"
+
+The collation feature can be used together with the "subdirs as tags" feature (but this is not
+a requirement). Just create a correctly named double-sided subdir in the hierachy and upload
+your scans there. For example, both `double-sided/foo/bar` as well as `foo/bar/double-sided` will
+cause the collated document to be treated as if it were uploaded into `foo/bar` and receive both
+`foo` and `bar` tags, but not `double-sided`.
+
+### Interaction with document splitting
+
+You can use the [document splitting](#document-splitting) feature, but if you use a normal
+single-sided split marker page, the split document(s) will have an empty page at the front (or
+whatever else was on the backside of the split marker page.) You can work around that by having
+a split marker page that has the split barcode on _both_ sides. This way, the extra page will
+get automatically removed.
diff --git a/docs/configuration.md b/docs/configuration.md

index 8f587d8acbafb11f08e1d3e305c8dc8d0daf847a..0ed2218a69b80aa5f5e3fecc213122f9bab91280 100644 (file)
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -1116,6 +1116,43 @@ combination with PAPERLESS_CONSUMER_BARCODE_UPSCALE bigger than 1.0.
  
      Defaults to "300"
  
+## Collate Double-Sided Documents {#collate}
+
+`PAPERLESS_CONSUMER_ENABLE_COLLATE_DOUBLE_SIDED=<bool>`
+
+: Enables automatic collation of two single-sided scans into a double-sided
+document.
+
+    This is useful if you have an automatic document feeder that only supports
+    single-sided scans, but you need to scan a double-sided document. If your
+    ADF supports double-sided scans natively, you do not need this feature.
+
+    `PAPERLESS_CONSUMER_RECURSIVE` must be enabled for this to work.
+
+    For more information, read the [corresponding section in the advanced
+    documentation](/advanced_usage#collate).
+
+    Defaults to false.
+
+`PAPERLESS_CONSUMER_COLLATE_DOUBLE_SIDED_SUBDIR_NAME=<str>`
+
+: The name of the subdirectory that the collate feature expects documents to
+arrive.
+
+    This only has an effect if `PAPERLESS_CONSUMER_ENABLE_COLLATE_DOUBLE_SIDED`
+    has been enabled. Note that Paperless will not automatically create the
+    directory.
+
+    Defaults to "double-sided".
+
+`PAPERLESS_CONSUMER_COLLATE_DOUBLE_SIDED_TIFF_SUPPORT=<bool>`
+: Whether TIFF image files should be supported when collating documents.
+This will automatically convert any TIFF image(s) to pdfs for later
+processing. This only has an effect if
+`PAPERLESS_CONSUMER_ENABLE_COLLATE_DOUBLE_SIDED` has been enabled.
+
+    Defaults to false.
+
  ## Binaries
  
  There are a few external software packages that Paperless expects to
diff --git a/paperless.conf.example b/paperless.conf.example

index 9b168db0cc21fff62f5aa4fd60f0a29e8037b226..1610dcda9552c5b3333387d94192e2b6d6b08396 100644 (file)
--- a/paperless.conf.example
+++ b/paperless.conf.example
@@ -68,6 +68,9 @@
  #PAPERLESS_CONSUMER_BARCODE_STRING=PATCHT
  #PAPERLESS_CONSUMER_BARCODE_UPSCALE=0.0
  #PAPERLESS_CONSUMER_BARCODE_DPI=300
+#PAPERLESS_CONSUMER_ENABLE_COLLATE_DOUBLE_SIDED=false
+#PAPERLESS_CONSUMER_COLLATE_DOUBLE_SIDED_SUBDIR_NAME=double-sided
+#PAPERLESS_CONSUMER_COLLATE_DOUBLE_SIDED_TIFF_SUPPORT=false
  #PAPERLESS_PRE_CONSUME_SCRIPT=/path/to/an/arbitrary/script.sh
  #PAPERLESS_POST_CONSUME_SCRIPT=/path/to/an/arbitrary/script.sh
  #PAPERLESS_FILENAME_DATE_ORDER=YMD
diff --git a/src/documents/barcodes.py b/src/documents/barcodes.py

index cabc195b3956bcb862e9aa90dd7b022b9902618c..b64f531d8f0859c8cb435e7177a21443d5715f20 100644 (file)
--- a/src/documents/barcodes.py
+++ b/src/documents/barcodes.py
@@ -2,13 +2,11 @@ import logging
  import tempfile
  from dataclasses import dataclass
  from pathlib import Path
-from subprocess import run
  from typing import Dict
  from typing import Final
  from typing import List
  from typing import Optional
  
-import img2pdf
  from django.conf import settings
  from pdf2image import convert_from_path
  from pdf2image.exceptions import PDFPageCountError
@@ -16,6 +14,7 @@ from pikepdf import Page
  from pikepdf import Pdf
  from PIL import Image
  
+from documents.converters import convert_from_tiff_to_pdf
  from documents.data_models import DocumentSource
  from documents.utils import copy_basic_file_stats
  from documents.utils import copy_file_with_basic_stats
@@ -55,7 +54,7 @@ class BarcodeReader:
          self.mime: Final[str] = mime_type
          self.pdf_file: Path = self.file
          self.barcodes: List[Barcode] = []
-        self.temp_dir: Optional[Path] = None
+        self.temp_dir: Optional[tempfile.TemporaryDirectory] = None
  
          if settings.CONSUMER_BARCODE_TIFF_SUPPORT:
              self.SUPPORTED_FILE_MIMES = {"application/pdf", "image/tiff"}
@@ -155,34 +154,7 @@ class BarcodeReader:
          if self.mime != "image/tiff":
              return
  
-        with Image.open(self.file) as im:
-            has_alpha_layer = im.mode in ("RGBA", "LA")
-        if has_alpha_layer:
-            # Note the save into the temp folder, so as not to trigger a new
-            # consume
-            scratch_image = Path(self.temp_dir.name) / Path(self.file.name)
-            run(
-                [
-                    settings.CONVERT_BINARY,
-                    "-alpha",
-                    "off",
-                    self.file,
-                    scratch_image,
-                ],
-            )
-        else:
-            # Not modifying the original, safe to use in place
-            scratch_image = self.file
-
-        self.pdf_file = Path(self.temp_dir.name) / Path(self.file.name).with_suffix(
-            ".pdf",
-        )
-
-        with scratch_image.open("rb") as img_file, self.pdf_file.open("wb") as pdf_file:
-            pdf_file.write(img2pdf.convert(img_file))
-
-        # Copy what file stat is possible
-        copy_basic_file_stats(self.file, self.pdf_file)
+        self.pdf_file = convert_from_tiff_to_pdf(self.file, Path(self.temp_dir.name))
  
      def detect(self) -> None:
          """
diff --git a/src/documents/converters.py b/src/documents/converters.py

new file mode 100644 (file)

index 0000000..e3a7cb7
--- /dev/null
+++ b/src/documents/converters.py
@@ -0,0 +1,46 @@
+from pathlib import Path
+from subprocess import run
+
+import img2pdf
+from django.conf import settings
+from PIL import Image
+
+from documents.utils import copy_basic_file_stats
+
+
+def convert_from_tiff_to_pdf(tiff_path: Path, target_directory: Path) -> Path:
+    """
+    Converts a TIFF file into a PDF file.
+
+    The PDF will be created in the given target_directory and share the name of
+    the original TIFF file, as well as its stats (mtime etc.).
+
+    Returns the path of the PDF created.
+    """
+    with Image.open(tiff_path) as im:
+        has_alpha_layer = im.mode in ("RGBA", "LA")
+    if has_alpha_layer:
+        # Note the save into the temp folder, so as not to trigger a new
+        # consume
+        scratch_image = target_directory / tiff_path.name
+        run(
+            [
+                settings.CONVERT_BINARY,
+                "-alpha",
+                "off",
+                tiff_path,
+                scratch_image,
+            ],
+        )
+    else:
+        # Not modifying the original, safe to use in place
+        scratch_image = tiff_path
+
+    pdf_path = (target_directory / tiff_path.name).with_suffix(".pdf")
+
+    with scratch_image.open("rb") as img_file, pdf_path.open("wb") as pdf_file:
+        pdf_file.write(img2pdf.convert(img_file))
+
+    # Copy what file stat is possible
+    copy_basic_file_stats(tiff_path, pdf_path)
+    return pdf_path
diff --git a/src/documents/double_sided.py b/src/documents/double_sided.py

new file mode 100644 (file)

index 0000000..4e6b8b7
--- /dev/null
+++ b/src/documents/double_sided.py
@@ -0,0 +1,131 @@
+import datetime as dt
+import logging
+import os
+import shutil
+from pathlib import Path
+
+from django.conf import settings
+from pikepdf import Pdf
+
+from documents.consumer import ConsumerError
+from documents.converters import convert_from_tiff_to_pdf
+from documents.data_models import ConsumableDocument
+
+logger = logging.getLogger("paperless.double_sided")
+
+# Hardcoded for now, could be made a configurable setting if needed
+TIMEOUT_MINUTES = 30
+
+# Used by test cases
+STAGING_FILE_NAME = "double-sided-staging.pdf"
+
+
+def collate(input_doc: ConsumableDocument) -> str:
+    """
+    Tries to collate pages from 2 single sided scans of a double sided
+    document.
+
+    When called with a file, it checks whether or not a staging file
+    exists, if not, the current file is turned into that staging file
+    containing the odd numbered pages.
+
+    If a staging file exists, and it is not too old, the current file is
+    considered to be the second part (the even numbered pages) and it will
+    collate the pages of both, the pages of the second file will be added
+    in reverse order, since the ADF will have scanned the pages from bottom
+    to top.
+
+    Returns a status message on succcess, or raises a ConsumerError
+    in case of failure.
+    """
+
+    # Make sure scratch dir exists, Consumer might not have run yet
+    settings.SCRATCH_DIR.mkdir(exist_ok=True)
+
+    if input_doc.mime_type == "application/pdf":
+        pdf_file = input_doc.original_file
+    elif (
+        input_doc.mime_type == "image/tiff"
+        and settings.CONSUMER_COLLATE_DOUBLE_SIDED_TIFF_SUPPORT
+    ):
+        pdf_file = convert_from_tiff_to_pdf(
+            input_doc.original_file,
+            settings.SCRATCH_DIR,
+        )
+        input_doc.original_file.unlink()
+    else:
+        raise ConsumerError("Unsupported file type for collation of double-sided scans")
+
+    staging = settings.SCRATCH_DIR / STAGING_FILE_NAME
+
+    valid_staging_exists = False
+    if staging.exists():
+        stats = os.stat(str(staging))
+        # if the file is older than the timeout, we don't consider
+        # it valid
+        if dt.datetime.now().timestamp() - stats.st_mtime > TIMEOUT_MINUTES * 60:
+            logger.warning("Outdated double sided staging file exists, deleting it")
+            os.unlink(str(staging))
+        else:
+            valid_staging_exists = True
+
+    if valid_staging_exists:
+        try:
+            # Collate pages from second PDF in reverse order
+            with Pdf.open(staging) as pdf1, Pdf.open(pdf_file) as pdf2:
+                pdf2.pages.reverse()
+                try:
+                    for i, page in enumerate(pdf2.pages):
+                        pdf1.pages.insert(2 * i + 1, page)
+                except IndexError:
+                    raise ConsumerError(
+                        "This second file (even numbered pages) contains more "
+                        "pages than the first/odd numbered one. This means the "
+                        "two uploaded files don't belong to the same double-"
+                        "sided scan. Please retry, starting with the odd "
+                        "numbered pages again.",
+                    )
+                # Merged file has the same path, but without the
+                # double-sided subdir. Therefore, it is also in the
+                # consumption dir and will be picked up for processing
+                old_file = input_doc.original_file
+                new_file = Path(
+                    *(
+                        part
+                        for part in old_file.with_name(
+                            f"{old_file.stem}-collated.pdf",
+                        ).parts
+                        if part != settings.CONSUMER_COLLATE_DOUBLE_SIDED_SUBDIR_NAME
+                    ),
+                )
+                # If the user didn't create the subdirs yet, do it for them
+                new_file.parent.mkdir(parents=True, exist_ok=True)
+                pdf1.save(new_file)
+            logger.info("Collated documents into new file %s", new_file)
+            return (
+                "Success. Even numbered pages of double sided scan collated "
+                "with odd pages"
+            )
+        finally:
+            # Delete staging and recently uploaded file no matter what.
+            # If any error occurs, the user needs to be able to restart
+            # the process from scratch; after all, the staging file
+            # with the odd numbered pages might be the culprit
+            pdf_file.unlink()
+            staging.unlink()
+
+    else:
+        # In Python 3.9 move supports Path objects directly,
+        # but for now we have to be compatible with 3.8
+        shutil.move(str(pdf_file), str(staging))
+        # update access to modification time so we know if the file
+        # is outdated when another file gets uploaded
+        os.utime(str(staging), (dt.datetime.now().timestamp(),) * 2)
+        logger.info(
+            "Got scan with odd numbered pages of double-sided scan, moved it to %s",
+            staging,
+        )
+        return (
+            "Received odd numbered pages of double sided scan, waiting up to "
+            f"{TIMEOUT_MINUTES} minutes for even numbered pages"
+        )
diff --git a/src/documents/tasks.py b/src/documents/tasks.py

index 97a7791f3206205d4acd548aec35429575dd1078..f1b65c45f67e6cc3e0038857a95250c2708614e6 100644 (file)
--- a/src/documents/tasks.py
+++ b/src/documents/tasks.py
@@ -25,6 +25,7 @@ from documents.consumer import Consumer
  from documents.consumer import ConsumerError
  from documents.data_models import ConsumableDocument
  from documents.data_models import DocumentMetadataOverrides
+from documents.double_sided import collate
  from documents.file_handling import create_source_path_directory
  from documents.file_handling import generate_unique_filename
  from documents.models import Correspondent
@@ -89,10 +90,40 @@ def consume_file(
      input_doc: ConsumableDocument,
      overrides: Optional[DocumentMetadataOverrides] = None,
  ):
+    def send_progress(status="SUCCESS", message="finished"):
+        payload = {
+            "filename": overrides.filename or input_doc.original_file.name,
+            "task_id": None,
+            "current_progress": 100,
+            "max_progress": 100,
+            "status": status,
+            "message": message,
+        }
+        try:
+            async_to_sync(get_channel_layer().group_send)(
+                "status_updates",
+                {"type": "status_update", "data": payload},
+            )
+        except ConnectionError as e:
+            logger.warning(f"ConnectionError on status send: {e!s}")
+
      # Default no overrides
      if overrides is None:
          overrides = DocumentMetadataOverrides()
  
+    # Handle collation of double-sided documents scanned in two parts
+    if settings.CONSUMER_ENABLE_COLLATE_DOUBLE_SIDED and (
+        settings.CONSUMER_COLLATE_DOUBLE_SIDED_SUBDIR_NAME
+        in input_doc.original_file.parts
+    ):
+        try:
+            msg = collate(input_doc)
+            send_progress(message=msg)
+            return msg
+        except ConsumerError as e:
+            send_progress(status="FAILURE", message=e.args[0])
+            raise e
+
      # read all barcodes in the current document
      if settings.CONSUMER_ENABLE_BARCODES or settings.CONSUMER_ENABLE_ASN_BARCODE:
          with BarcodeReader(input_doc.original_file, input_doc.mime_type) as reader:
@@ -102,24 +133,9 @@ def consume_file(
              ):
                  # notify the sender, otherwise the progress bar
                  # in the UI stays stuck
-                payload = {
-                    "filename": overrides.filename or input_doc.original_file.name,
-                    "task_id": None,
-                    "current_progress": 100,
-                    "max_progress": 100,
-                    "status": "SUCCESS",
-                    "message": "finished",
-                }
-                try:
-                    async_to_sync(get_channel_layer().group_send)(
-                        "status_updates",
-                        {"type": "status_update", "data": payload},
-                    )
-                except ConnectionError as e:
-                    logger.warning(f"ConnectionError on status send: {e!s}")
+                send_progress()
                  # consuming stops here, since the original document with
                  # the barcodes has been split and will be consumed separately
-
                  input_doc.original_file.unlink()
                  return "File successfully split"
  
diff --git a/src/documents/tests/samples/double-sided-even.pdf b/src/documents/tests/samples/double-sided-even.pdf

new file mode 100644 (file)

index 0000000..7caa48a

Binary files /dev/null and b/src/documents/tests/samples/double-sided-even.pdf differ
diff --git a/src/documents/tests/samples/double-sided-odd.pdf b/src/documents/tests/samples/double-sided-odd.pdf

new file mode 100644 (file)

index 0000000..7d29320

Binary files /dev/null and b/src/documents/tests/samples/double-sided-odd.pdf differ
diff --git a/src/documents/tests/test_double_sided.py b/src/documents/tests/test_double_sided.py

new file mode 100644 (file)

index 0000000..88cbe7d
--- /dev/null
+++ b/src/documents/tests/test_double_sided.py
@@ -0,0 +1,253 @@
+import datetime as dt
+import os
+import shutil
+from pathlib import Path
+from typing import Union
+from unittest import mock
+
+from django.test import TestCase
+from django.test import override_settings
+from pdfminer.high_level import extract_text
+from pikepdf import Pdf
+
+from documents import tasks
+from documents.consumer import ConsumerError
+from documents.data_models import ConsumableDocument
+from documents.data_models import DocumentSource
+from documents.double_sided import STAGING_FILE_NAME
+from documents.double_sided import TIMEOUT_MINUTES
+from documents.tests.utils import DirectoriesMixin
+from documents.tests.utils import FileSystemAssertsMixin
+
+
+@override_settings(
+    CONSUMER_RECURSIVE=True,
+    CONSUMER_ENABLE_COLLATE_DOUBLE_SIDED=True,
+)
+class TestDoubleSided(DirectoriesMixin, FileSystemAssertsMixin, TestCase):
+    SAMPLE_DIR = Path(__file__).parent / "samples"
+
+    def setUp(self):
+        super().setUp()
+        self.dirs.double_sided_dir = self.dirs.consumption_dir / "double-sided"
+        self.dirs.double_sided_dir.mkdir()
+        self.staging_file = self.dirs.scratch_dir / STAGING_FILE_NAME
+
+    def consume_file(self, srcname, dstname: Union[str, Path] = "foo.pdf"):
+        """
+        Starts the consume process and also ensures the
+        destination file does not exist afterwards
+        """
+        src = self.SAMPLE_DIR / srcname
+        dst = self.dirs.double_sided_dir / dstname
+        dst.parent.mkdir(parents=True, exist_ok=True)
+        shutil.copy(src, dst)
+        with mock.patch("documents.tasks.async_to_sync"), mock.patch(
+            "documents.consumer.async_to_sync",
+        ):
+            msg = tasks.consume_file(
+                ConsumableDocument(
+                    source=DocumentSource.ConsumeFolder,
+                    original_file=dst,
+                ),
+                None,
+            )
+        self.assertIsNotFile(dst)
+        return msg
+
+    def create_staging_file(self, src="double-sided-odd.pdf", datetime=None):
+        shutil.copy(self.SAMPLE_DIR / src, self.staging_file)
+        if datetime is None:
+            datetime = dt.datetime.now()
+        os.utime(str(self.staging_file), (datetime.timestamp(),) * 2)
+
+    def test_odd_numbered_moved_to_staging(self):
+        """
+        GIVEN:
+            - No staging file exists
+        WHEN:
+            - A file is copied into the double-sided consume directory
+        THEN:
+            - The file becomes the new staging file
+            - The file in the consume directory gets removed
+            - The staging file has the st_mtime set to now
+            - The user gets informed
+        """
+
+        msg = self.consume_file("double-sided-odd.pdf")
+
+        self.assertIsFile(self.staging_file)
+        self.assertAlmostEqual(
+            dt.datetime.fromtimestamp(self.staging_file.stat().st_mtime),
+            dt.datetime.now(),
+            delta=dt.timedelta(seconds=5),
+        )
+        self.assertIn("Received odd numbered pages", msg)
+
+    def test_collation(self):
+        """
+        GIVEN:
+            - A staging file not older than TIMEOUT_MINUTES with odd pages exists
+        WHEN:
+            - A file is copied into the double-sided consume directory
+        THEN:
+            - A new file containing the collated staging and uploaded file is
+              created and put into the consume directory
+            - The new file is named "foo-collated.pdf", where foo is the name of
+              the second file
+            - Both staging and uploaded file get deleted
+            - The new file contains the pages in the correct order
+        """
+
+        self.create_staging_file()
+        self.consume_file("double-sided-even.pdf", "some-random-name.pdf")
+
+        target = self.dirs.consumption_dir / "some-random-name-collated.pdf"
+        self.assertIsFile(target)
+        self.assertIsNotFile(self.staging_file)
+        self.assertRegex(
+            extract_text(str(target)),
+            r"(?s)"
+            r"This is page 1.*This is page 2.*This is page 3.*"
+            r"This is page 4.*This is page 5",
+        )
+
+    def test_staging_file_expiration(self):
+        """
+        GIVEN:
+            - A staging file older than TIMEOUT_MINUTES exists
+        WHEN:
+            - A file is copied into the double-sided consume directory
+        THEN:
+            - It becomes the new staging file
+        """
+
+        self.create_staging_file(
+            datetime=dt.datetime.now()
+            - dt.timedelta(minutes=TIMEOUT_MINUTES, seconds=1),
+        )
+        msg = self.consume_file("double-sided-odd.pdf")
+        self.assertIsFile(self.staging_file)
+        self.assertIn("Received odd numbered pages", msg)
+
+    def test_less_odd_pages_then_even_fails(self):
+        """
+        GIVEN:
+            - A valid staging file
+        WHEN:
+            - A file is copied into the double-sided consume directory
+              that has more pages than the staging file
+        THEN:
+            - Both files get removed
+            - A ConsumerError exception is thrown
+        """
+        self.create_staging_file("simple.pdf")
+        self.assertRaises(
+            ConsumerError,
+            self.consume_file,
+            "double-sided-even.pdf",
+        )
+        self.assertIsNotFile(self.staging_file)
+
+    @override_settings(CONSUMER_COLLATE_DOUBLE_SIDED_TIFF_SUPPORT=True)
+    def test_tiff_upload_enabled(self):
+        """
+        GIVEN:
+            - CONSUMER_COLLATE_DOUBLE_SIDED_TIFF_SUPPORT is true
+            - No staging file exists
+        WHEN:
+            - A TIFF file gets uploaded into the double-sided
+              consume dir
+        THEN:
+            - The file is converted into a PDF and moved to
+              the staging file
+        """
+        self.consume_file("simple.tiff", "simple.tiff")
+        self.assertIsFile(self.staging_file)
+        # Ensure the file is a valid PDF by trying to read it
+        Pdf.open(self.staging_file)
+
+    @override_settings(CONSUMER_COLLATE_DOUBLE_SIDED_TIFF_SUPPORT=False)
+    def test_tiff_upload_disabled(self):
+        """
+        GIVEN:
+            - CONSUMER_COLLATE_DOUBLE_SIDED_TIFF_SUPPORT is false
+            - No staging file exists
+        WHEN:
+            - A TIFF file gets uploaded into the double-sided
+              consume dir
+        THEN:
+            - A ConsumerError is raised
+        """
+        self.assertRaises(
+            ConsumerError,
+            self.consume_file,
+            "simple.tiff",
+            "simple.tiff",
+        )
+
+    @override_settings(CONSUMER_COLLATE_DOUBLE_SIDED_SUBDIR_NAME="quux")
+    def test_different_upload_dir_name(self):
+        """
+        GIVEN:
+            - No staging file exists
+            - CONSUMER_COLLATE_DOUBLE_SIDED_SUBDIR_NAME is set to quux
+        WHEN:
+            - A file is uploaded into the quux dir
+        THEN:
+            - A staging file is created
+        """
+        self.consume_file("double-sided-odd.pdf", Path("..") / "quux" / "foo.pdf")
+        self.assertIsFile(self.staging_file)
+
+    def test_only_double_sided_dir_is_handled(self):
+        """
+        GIVEN:
+            - No staging file exists
+        WHEN:
+            - A file is uploaded into the normal consumption dir
+        THEN:
+            - The file is processed as normal
+        """
+        msg = self.consume_file("simple.pdf", Path("..") / "simple.pdf")
+        self.assertIsNotFile(self.staging_file)
+        self.assertRegex(msg, "Success. New document .* created")
+
+    def test_subdirectory_upload(self):
+        """
+        GIVEN:
+            - A staging file exists
+        WHEN:
+            - A file gets uploaded into foo/bar/double-sided
+              or double-sided/foo/bar
+        THEN:
+            - The collated file gets put into foo/bar
+        """
+        for path in [
+            Path("foo") / "bar" / "double-sided",
+            Path("double-sided") / "foo" / "bar",
+        ]:
+            with self.subTest(path=path):
+                # Ensure we get fresh directories for each run
+                self.tearDown()
+                self.setUp()
+
+                self.create_staging_file()
+                self.consume_file("double-sided-odd.pdf", path / "foo.pdf")
+                self.assertIsFile(
+                    self.dirs.consumption_dir / "foo" / "bar" / "foo-collated.pdf",
+                )
+
+    @override_settings(CONSUMER_ENABLE_COLLATE_DOUBLE_SIDED=False)
+    def test_disabled_double_sided_dir_upload(self):
+        """
+        GIVEN:
+            - CONSUMER_ENABLE_COLLATE_DOUBLE_SIDED is false
+        WHEN:
+            - A file is uploaded into the double-sided directory
+        THEN:
+            - The file is processed like a normal upload
+        """
+        msg = self.consume_file("simple.pdf")
+        self.assertIsNotFile(self.staging_file)
+        self.assertRegex(msg, "Success. New document .* created")
diff --git a/src/paperless/settings.py b/src/paperless/settings.py

index 763cf96fc2fa79f58e7d0d82f9a517c8f9be7168..39460066e694ab1cc5d46d8de1344de852b8b382 100644 (file)
--- a/src/paperless/settings.py
+++ b/src/paperless/settings.py
@@ -791,6 +791,18 @@ CONSUMER_BARCODE_DPI: Final[str] = int(
      os.getenv("PAPERLESS_CONSUMER_BARCODE_DPI", 300),
  )
  
+CONSUMER_ENABLE_COLLATE_DOUBLE_SIDED: Final[bool] = __get_boolean(
+    "PAPERLESS_CONSUMER_ENABLE_COLLATE_DOUBLE_SIDED",
+)
+
+CONSUMER_COLLATE_DOUBLE_SIDED_SUBDIR_NAME: Final[str] = os.getenv(
+    "PAPERLESS_CONSUMER_COLLATE_DOUBLE_SIDED_SUBDIR_NAME",
+    "double-sided",
+)
+
+CONSUMER_COLLATE_DOUBLE_SIDED_TIFF_SUPPORT: Final[bool] = __get_boolean(
+    "PAPERLESS_CONSUMER_COLLATE_DOUBLE_SIDED_TIFF_SUPPORT",
+)
  
  OCR_PAGES = int(os.getenv("PAPERLESS_OCR_PAGES", 0))
author	Dennis Brakhane <brakhane@gmail.com>
	Mon, 24 Jul 2023 07:29:04 +0000 (09:29 +0200)
committer	GitHub <noreply@github.com>
	Mon, 24 Jul 2023 07:29:04 +0000 (00:29 -0700)
docs/advanced_usage.md		patch \| blob \| blame \| history
docs/configuration.md		patch \| blob \| blame \| history
paperless.conf.example		patch \| blob \| blame \| history
src/documents/barcodes.py		patch \| blob \| blame \| history
src/documents/converters.py	[new file with mode: 0644]	patch \| blob
src/documents/double_sided.py	[new file with mode: 0644]	patch \| blob
src/documents/tasks.py		patch \| blob \| blame \| history
src/documents/tests/samples/double-sided-even.pdf	[new file with mode: 0644]	patch \| blob
src/documents/tests/samples/double-sided-odd.pdf	[new file with mode: 0644]	patch \| blob
src/documents/tests/test_double_sided.py	[new file with mode: 0644]	patch \| blob
src/paperless/settings.py		patch \| blob \| blame \| history