From 91a0a073a01bcb7d28cb9da21c44929c8c8863e4 Mon Sep 17 00:00:00 2001
From: Mike Bayer <mike_mp@zzzcomputing.com>
Date: Thu, 5 Sep 2019 11:46:44 -0400
Subject: [PATCH] Document Query deduplication

Users are frequently confused why count() does not return the same
number as the number of objects returned by all().  While we continue
to want to find a better solution to this problem, in the meantime
this has never been clearly documented.  Add an FAQ section with
links from .count() , .all(), ORM tutorial.

Change-Id: I6eff36b686ff6cdd55489036fc48a981bc47d5ee
(cherry picked from commit 75c3950732a66ff7f4aca152653d6cb14eb9783c)
---
 doc/build/faq/sessions.rst  | 72 +++++++++++++++++++++++++++++++++++++
 doc/build/orm/tutorial.rst  | 17 +++++++++
 lib/sqlalchemy/orm/query.py | 47 ++++++++++++++++++------
 3 files changed, 126 insertions(+), 10 deletions(-)

diff --git a/doc/build/faq/sessions.rst b/doc/build/faq/sessions.rst
index 1abdc69417..6fdaa2ceb9 100644
--- a/doc/build/faq/sessions.rst
+++ b/doc/build/faq/sessions.rst
@@ -225,6 +225,78 @@ How do I make a Query that always adds a certain filter to every query?
 
 See the recipe at `FilteredQuery <http://www.sqlalchemy.org/trac/wiki/UsageRecipes/FilteredQuery>`_.
 
+.. _faq_query_deduplicating:
+
+My Query does not return the same number of objects as query.count() tells me - why?
+-------------------------------------------------------------------------------------
+
+The :class:`.Query` object, when asked to return a list of ORM-mapped objects,
+will **deduplicate the objects based on primary key**.   That is, if we
+for example use the ``User`` mapping described at :ref:`ormtutorial_toplevel`,
+and we had a SQL query like the following::
+
+    q = session.query(User).outerjoin(User.addresses).filter(User.name == 'jack')
+
+Above, the sample data used in the tutorial has two rows in the ``addresses``
+table for the ``users`` row with the name ``'jack'``, primary key value 5.
+If we ask the above query for a :meth:`.Query.count`, we will get the answer
+**2**::
+
+    >>> q.count()
+    2
+
+However, if we run :meth:`.Query.all` or iterate over the query, we get back
+**one element**::
+
+  >>> q.all()
+  [User(id=5, name='jack', ...)]
+
+This is because when the :class:`.Query` object returns full entities, they
+are **deduplicated**.    This does not occur if we instead request individual
+columns back::
+
+  >>> session.query(User.id, User.name).outerjoin(User.addresses).filter(User.name == 'jack').all()
+  [(5, 'jack'), (5, 'jack')]
+
+There are two main reasons the :class:`.Query` will deduplicate:
+
+* **To allow joined eager loading to work correctly** - :ref:`joined_eager_loading`
+  works by querying rows using joins against related tables, where it then routes
+  rows from those joins into collections upon the lead objects.   In order to do this,
+  it has to fetch rows where the lead object primary key is repeated for each
+  sub-entry.   This pattern can then continue into further sub-collections such
+  that a multiple of rows may be processed for a single lead object, such as
+  ``User(id=5)``.   The dedpulication allows us to receive objects in the way they
+  were queried, e.g. all the ``User()`` objects whose name is ``'jack'`` which
+  for us is one object, with
+  the ``User.addresses`` collection eagerly loaded as was indicated either
+  by ``lazy='joined'`` on the :func:`.relationship` or via the :func:`.joinedload`
+  option.    For consistency, the deduplication is still applied whether or not
+  the joinedload is established, as the key philosophy behind eager loading
+  is that these options never affect the result.
+
+* **To eliminate confusion regarding the identity map** - this is admittedly
+  the less critical reason.  As the :class:`.Session`
+  makes use of an :term:`identity map`, even though our SQL result set has two
+  rows with primary key 5, there is only one ``User(id=5)`` object inside the :class:`.Session`
+  which must be maintained uniquely on its identity, that is, its primary key /
+  class combination.   It doesn't actually make much sense, if one is querying for
+  ``User()`` objects, to get the same object multiple times in the list.   An
+  ordered set would potentially be a better representation of what :class:`.Query`
+  seeks to return when it returns full objects.
+
+The issue of :class:`.Query` deduplication remains problematic, mostly for the
+single reason that the :meth:`.Query.count` method is inconsistent, and the
+current status is that joined eager loading has in recent releases been
+superseded first by the "subquery eager loading" strategy and more recently the
+"select IN eager loading" strategy, both of which are generally more
+appropriate for collection eager loading. As this evolution continues,
+SQLAlchemy may alter this behavior on :class:`.Query`, which may also involve
+new APIs in order to more directly control this behavior, and may also alter
+the behavior of joined eager loading in order to create a more consistent usage
+pattern.
+
+
 I've created a mapping against an Outer Join, and while the query returns rows, no objects are returned.  Why not?
 ------------------------------------------------------------------------------------------------------------------
 
diff --git a/doc/build/orm/tutorial.rst b/doc/build/orm/tutorial.rst
index 686b774b97..1b92ddced9 100644
--- a/doc/build/orm/tutorial.rst
+++ b/doc/build/orm/tutorial.rst
@@ -834,6 +834,8 @@ Here's a rundown of some of the most common operators used in
     or ``CONTAINS`` function; its behavior will vary by backend and is not
     available on some backends such as SQLite.
 
+.. _orm_tutorial_query_returning:
+
 Returning Lists and Scalars
 ---------------------------
 
@@ -857,6 +859,21 @@ database results.  Here's a brief tour:
       {stop}[<User(name='ed', fullname='Ed Jones', nickname='eddie')>,
             <User(name='fred', fullname='Fred Flintstone', nickname='freddy')>]
 
+  .. warning::
+
+        When the :class:`.Query` object returns lists of ORM-mapped objects
+        such as the ``User`` object above, the entries are **deduplicated**
+        based on primary key, as the results are interpreted from the SQL
+        result set.  That is, if SQL query returns a row with ``id=7`` twice,
+        you would only get a single ``User(id=7)`` object back in the result
+        list.  This does not apply to the case when individual columns are
+        queried.
+
+        .. seealso::
+
+            :ref:`faq_query_deduplicating`
+
+
 * :meth:`~.Query.first()` applies a limit of one and returns
   the first result as a scalar:
 
diff --git a/lib/sqlalchemy/orm/query.py b/lib/sqlalchemy/orm/query.py
index 530b8fee88..08a085f86e 100644
--- a/lib/sqlalchemy/orm/query.py
+++ b/lib/sqlalchemy/orm/query.py
@@ -3170,10 +3170,18 @@ class Query(object):
             self._suffixes = suffixes
 
     def all(self):
-        """Return the results represented by this ``Query`` as a list.
+        """Return the results represented by this :class:`.Query` as a list.
 
-        This results in an execution of the underlying query.
+        This results in an execution of the underlying SQL statement.
+
+        .. warning::  The :class:`.Query` object, when asked to return either
+           a sequence or iterator that consists of full ORM-mapped entities,
+           will **deduplicate entries based on primary key**.  See the FAQ for
+           more details.
+
+            .. seealso::
 
+                :ref:`faq_query_deduplicating`
         """
         return list(self)
 
@@ -3541,7 +3549,8 @@ class Query(object):
         )
 
     def count(self):
-        r"""Return a count of rows this Query would return.
+        r"""Return a count of rows this the SQL formed by this :class:`Query`
+        would return.
 
         This generates the SQL for this Query as follows::
 
@@ -3549,13 +3558,31 @@ class Query(object):
                 SELECT <rest of query follows...>
             ) AS anon_1
 
-        For fine grained control over specific columns
-        to count, to skip the usage of a subquery or
-        otherwise control of the FROM clause,
-        or to use other aggregate functions,
-        use :attr:`~sqlalchemy.sql.expression.func`
-        expressions in conjunction
-        with :meth:`~.Session.query`, i.e.::
+        The above SQL returns a single row, which is the aggregate value
+        of the count function; the :meth:`.Query.count` method then returns
+        that single integer value.
+
+        .. warning::
+
+            It is important to note that the value returned by
+            count() is **not the same as the number of ORM objects that this
+            Query would return from a method such as the .all() method**.
+            The :class:`.Query` object, when asked to return full entities,
+            will **deduplicate entries based on primary key**, meaning if the
+            same primary key value would appear in the results more than once,
+            only one object of that primary key would be present.  This does
+            not apply to a query that is against individual columns.
+
+            .. seealso::
+
+                :ref:`faq_query_deduplicating`
+
+                :ref:`orm_tutorial_query_returning`
+
+        For fine grained control over specific columns to count, to skip the
+        usage of a subquery or otherwise control of the FROM clause, or to use
+        other aggregate functions, use :attr:`~sqlalchemy.sql.expression.func`
+        expressions in conjunction with :meth:`~.Session.query`, i.e.::
 
             from sqlalchemy import func
 
-- 
2.47.2