From: Mike Bayer <mike_mp@zzzcomputing.com>
Date: Thu, 5 Sep 2019 15:46:44 +0000 (-0400)
Subject: Document Query deduplication
X-Git-Tag: rel_1_4_0b1~735
X-Git-Url: http://git.ipfire.org/cgi-bin/gitweb.cgi?a=commitdiff_plain;h=75c3950732a66ff7f4aca152653d6cb14eb9783c;p=thirdparty%2Fsqlalchemy%2Fsqlalchemy.git

Document Query deduplication

Users are frequently confused why count() does not return the same
number as the number of objects returned by all().  While we continue
to want to find a better solution to this problem, in the meantime
this has never been clearly documented.  Add an FAQ section with
links from .count() , .all(), ORM tutorial.

Change-Id: I6eff36b686ff6cdd55489036fc48a981bc47d5ee
---

diff --git a/doc/build/faq/sessions.rst b/doc/build/faq/sessions.rst
index 1abdc69417..6fdaa2ceb9 100644
--- a/doc/build/faq/sessions.rst
+++ b/doc/build/faq/sessions.rst
@@ -225,6 +225,78 @@ How do I make a Query that always adds a certain filter to every query?
 
 See the recipe at `FilteredQuery <http://www.sqlalchemy.org/trac/wiki/UsageRecipes/FilteredQuery>`_.
 
+.. _faq_query_deduplicating:
+
+My Query does not return the same number of objects as query.count() tells me - why?
+-------------------------------------------------------------------------------------
+
+The :class:`.Query` object, when asked to return a list of ORM-mapped objects,
+will **deduplicate the objects based on primary key**.   That is, if we
+for example use the ``User`` mapping described at :ref:`ormtutorial_toplevel`,
+and we had a SQL query like the following::
+
+    q = session.query(User).outerjoin(User.addresses).filter(User.name == 'jack')
+
+Above, the sample data used in the tutorial has two rows in the ``addresses``
+table for the ``users`` row with the name ``'jack'``, primary key value 5.
+If we ask the above query for a :meth:`.Query.count`, we will get the answer
+**2**::
+
+    >>> q.count()
+    2
+
+However, if we run :meth:`.Query.all` or iterate over the query, we get back
+**one element**::
+
+  >>> q.all()
+  [User(id=5, name='jack', ...)]
+
+This is because when the :class:`.Query` object returns full entities, they
+are **deduplicated**.    This does not occur if we instead request individual
+columns back::
+
+  >>> session.query(User.id, User.name).outerjoin(User.addresses).filter(User.name == 'jack').all()
+  [(5, 'jack'), (5, 'jack')]
+
+There are two main reasons the :class:`.Query` will deduplicate:
+
+* **To allow joined eager loading to work correctly** - :ref:`joined_eager_loading`
+  works by querying rows using joins against related tables, where it then routes
+  rows from those joins into collections upon the lead objects.   In order to do this,
+  it has to fetch rows where the lead object primary key is repeated for each
+  sub-entry.   This pattern can then continue into further sub-collections such
+  that a multiple of rows may be processed for a single lead object, such as
+  ``User(id=5)``.   The dedpulication allows us to receive objects in the way they
+  were queried, e.g. all the ``User()`` objects whose name is ``'jack'`` which
+  for us is one object, with
+  the ``User.addresses`` collection eagerly loaded as was indicated either
+  by ``lazy='joined'`` on the :func:`.relationship` or via the :func:`.joinedload`
+  option.    For consistency, the deduplication is still applied whether or not
+  the joinedload is established, as the key philosophy behind eager loading
+  is that these options never affect the result.
+
+* **To eliminate confusion regarding the identity map** - this is admittedly
+  the less critical reason.  As the :class:`.Session`
+  makes use of an :term:`identity map`, even though our SQL result set has two
+  rows with primary key 5, there is only one ``User(id=5)`` object inside the :class:`.Session`
+  which must be maintained uniquely on its identity, that is, its primary key /
+  class combination.   It doesn't actually make much sense, if one is querying for
+  ``User()`` objects, to get the same object multiple times in the list.   An
+  ordered set would potentially be a better representation of what :class:`.Query`
+  seeks to return when it returns full objects.
+
+The issue of :class:`.Query` deduplication remains problematic, mostly for the
+single reason that the :meth:`.Query.count` method is inconsistent, and the
+current status is that joined eager loading has in recent releases been
+superseded first by the "subquery eager loading" strategy and more recently the
+"select IN eager loading" strategy, both of which are generally more
+appropriate for collection eager loading. As this evolution continues,
+SQLAlchemy may alter this behavior on :class:`.Query`, which may also involve
+new APIs in order to more directly control this behavior, and may also alter
+the behavior of joined eager loading in order to create a more consistent usage
+pattern.
+
+
 I've created a mapping against an Outer Join, and while the query returns rows, no objects are returned.  Why not?
 ------------------------------------------------------------------------------------------------------------------
 
diff --git a/doc/build/orm/tutorial.rst b/doc/build/orm/tutorial.rst
index 1132c96fc8..14ea82a0dd 100644
--- a/doc/build/orm/tutorial.rst
+++ b/doc/build/orm/tutorial.rst
@@ -834,6 +834,8 @@ Here's a rundown of some of the most common operators used in
     or ``CONTAINS`` function; its behavior will vary by backend and is not
     available on some backends such as SQLite.
 
+.. _orm_tutorial_query_returning:
+
 Returning Lists and Scalars
 ---------------------------
 
@@ -857,6 +859,21 @@ database results.  Here's a brief tour:
       {stop}[<User(name='ed', fullname='Ed Jones', nickname='eddie')>,
             <User(name='fred', fullname='Fred Flintstone', nickname='freddy')>]
 
+  .. warning::
+
+        When the :class:`.Query` object returns lists of ORM-mapped objects
+        such as the ``User`` object above, the entries are **deduplicated**
+        based on primary key, as the results are interpreted from the SQL
+        result set.  That is, if SQL query returns a row with ``id=7`` twice,
+        you would only get a single ``User(id=7)`` object back in the result
+        list.  This does not apply to the case when individual columns are
+        queried.
+
+        .. seealso::
+
+            :ref:`faq_query_deduplicating`
+
+
 * :meth:`~.Query.first()` applies a limit of one and returns
   the first result as a scalar:
 
diff --git a/lib/sqlalchemy/orm/query.py b/lib/sqlalchemy/orm/query.py
index d4ff35d2e5..6d8037bae5 100644
--- a/lib/sqlalchemy/orm/query.py
+++ b/lib/sqlalchemy/orm/query.py
@@ -3200,10 +3200,18 @@ class Query(object):
             self._suffixes = suffixes
 
     def all(self):
-        """Return the results represented by this ``Query`` as a list.
+        """Return the results represented by this :class:`.Query` as a list.
 
-        This results in an execution of the underlying query.
+        This results in an execution of the underlying SQL statement.
+
+        .. warning::  The :class:`.Query` object, when asked to return either
+           a sequence or iterator that consists of full ORM-mapped entities,
+           will **deduplicate entries based on primary key**.  See the FAQ for
+           more details.
+
+            .. seealso::
 
+                :ref:`faq_query_deduplicating`
         """
         return list(self)
 
@@ -3568,7 +3576,8 @@ class Query(object):
         )
 
     def count(self):
-        r"""Return a count of rows this Query would return.
+        r"""Return a count of rows this the SQL formed by this :class:`Query`
+        would return.
 
         This generates the SQL for this Query as follows::
 
@@ -3576,13 +3585,31 @@ class Query(object):
                 SELECT <rest of query follows...>
             ) AS anon_1
 
-        For fine grained control over specific columns
-        to count, to skip the usage of a subquery or
-        otherwise control of the FROM clause,
-        or to use other aggregate functions,
-        use :attr:`~sqlalchemy.sql.expression.func`
-        expressions in conjunction
-        with :meth:`~.Session.query`, i.e.::
+        The above SQL returns a single row, which is the aggregate value
+        of the count function; the :meth:`.Query.count` method then returns
+        that single integer value.
+
+        .. warning::
+
+            It is important to note that the value returned by
+            count() is **not the same as the number of ORM objects that this
+            Query would return from a method such as the .all() method**.
+            The :class:`.Query` object, when asked to return full entities,
+            will **deduplicate entries based on primary key**, meaning if the
+            same primary key value would appear in the results more than once,
+            only one object of that primary key would be present.  This does
+            not apply to a query that is against individual columns.
+
+            .. seealso::
+
+                :ref:`faq_query_deduplicating`
+
+                :ref:`orm_tutorial_query_returning`
+
+        For fine grained control over specific columns to count, to skip the
+        usage of a subquery or otherwise control of the FROM clause, or to use
+        other aggregate functions, use :attr:`~sqlalchemy.sql.expression.func`
+        expressions in conjunction with :meth:`~.Session.query`, i.e.::
 
             from sqlalchemy import func