From 91a0a073a01bcb7d28cb9da21c44929c8c8863e4 Mon Sep 17 00:00:00 2001 From: Mike Bayer Date: Thu, 5 Sep 2019 11:46:44 -0400 Subject: [PATCH] Document Query deduplication Users are frequently confused why count() does not return the same number as the number of objects returned by all(). While we continue to want to find a better solution to this problem, in the meantime this has never been clearly documented. Add an FAQ section with links from .count() , .all(), ORM tutorial. Change-Id: I6eff36b686ff6cdd55489036fc48a981bc47d5ee (cherry picked from commit 75c3950732a66ff7f4aca152653d6cb14eb9783c) --- doc/build/faq/sessions.rst | 72 +++++++++++++++++++++++++++++++++++++ doc/build/orm/tutorial.rst | 17 +++++++++ lib/sqlalchemy/orm/query.py | 47 ++++++++++++++++++------ 3 files changed, 126 insertions(+), 10 deletions(-) diff --git a/doc/build/faq/sessions.rst b/doc/build/faq/sessions.rst index 1abdc69417..6fdaa2ceb9 100644 --- a/doc/build/faq/sessions.rst +++ b/doc/build/faq/sessions.rst @@ -225,6 +225,78 @@ How do I make a Query that always adds a certain filter to every query? See the recipe at `FilteredQuery `_. +.. _faq_query_deduplicating: + +My Query does not return the same number of objects as query.count() tells me - why? +------------------------------------------------------------------------------------- + +The :class:`.Query` object, when asked to return a list of ORM-mapped objects, +will **deduplicate the objects based on primary key**. That is, if we +for example use the ``User`` mapping described at :ref:`ormtutorial_toplevel`, +and we had a SQL query like the following:: + + q = session.query(User).outerjoin(User.addresses).filter(User.name == 'jack') + +Above, the sample data used in the tutorial has two rows in the ``addresses`` +table for the ``users`` row with the name ``'jack'``, primary key value 5. +If we ask the above query for a :meth:`.Query.count`, we will get the answer +**2**:: + + >>> q.count() + 2 + +However, if we run :meth:`.Query.all` or iterate over the query, we get back +**one element**:: + + >>> q.all() + [User(id=5, name='jack', ...)] + +This is because when the :class:`.Query` object returns full entities, they +are **deduplicated**. This does not occur if we instead request individual +columns back:: + + >>> session.query(User.id, User.name).outerjoin(User.addresses).filter(User.name == 'jack').all() + [(5, 'jack'), (5, 'jack')] + +There are two main reasons the :class:`.Query` will deduplicate: + +* **To allow joined eager loading to work correctly** - :ref:`joined_eager_loading` + works by querying rows using joins against related tables, where it then routes + rows from those joins into collections upon the lead objects. In order to do this, + it has to fetch rows where the lead object primary key is repeated for each + sub-entry. This pattern can then continue into further sub-collections such + that a multiple of rows may be processed for a single lead object, such as + ``User(id=5)``. The dedpulication allows us to receive objects in the way they + were queried, e.g. all the ``User()`` objects whose name is ``'jack'`` which + for us is one object, with + the ``User.addresses`` collection eagerly loaded as was indicated either + by ``lazy='joined'`` on the :func:`.relationship` or via the :func:`.joinedload` + option. For consistency, the deduplication is still applied whether or not + the joinedload is established, as the key philosophy behind eager loading + is that these options never affect the result. + +* **To eliminate confusion regarding the identity map** - this is admittedly + the less critical reason. As the :class:`.Session` + makes use of an :term:`identity map`, even though our SQL result set has two + rows with primary key 5, there is only one ``User(id=5)`` object inside the :class:`.Session` + which must be maintained uniquely on its identity, that is, its primary key / + class combination. It doesn't actually make much sense, if one is querying for + ``User()`` objects, to get the same object multiple times in the list. An + ordered set would potentially be a better representation of what :class:`.Query` + seeks to return when it returns full objects. + +The issue of :class:`.Query` deduplication remains problematic, mostly for the +single reason that the :meth:`.Query.count` method is inconsistent, and the +current status is that joined eager loading has in recent releases been +superseded first by the "subquery eager loading" strategy and more recently the +"select IN eager loading" strategy, both of which are generally more +appropriate for collection eager loading. As this evolution continues, +SQLAlchemy may alter this behavior on :class:`.Query`, which may also involve +new APIs in order to more directly control this behavior, and may also alter +the behavior of joined eager loading in order to create a more consistent usage +pattern. + + I've created a mapping against an Outer Join, and while the query returns rows, no objects are returned. Why not? ------------------------------------------------------------------------------------------------------------------ diff --git a/doc/build/orm/tutorial.rst b/doc/build/orm/tutorial.rst index 686b774b97..1b92ddced9 100644 --- a/doc/build/orm/tutorial.rst +++ b/doc/build/orm/tutorial.rst @@ -834,6 +834,8 @@ Here's a rundown of some of the most common operators used in or ``CONTAINS`` function; its behavior will vary by backend and is not available on some backends such as SQLite. +.. _orm_tutorial_query_returning: + Returning Lists and Scalars --------------------------- @@ -857,6 +859,21 @@ database results. Here's a brief tour: {stop}[, ] + .. warning:: + + When the :class:`.Query` object returns lists of ORM-mapped objects + such as the ``User`` object above, the entries are **deduplicated** + based on primary key, as the results are interpreted from the SQL + result set. That is, if SQL query returns a row with ``id=7`` twice, + you would only get a single ``User(id=7)`` object back in the result + list. This does not apply to the case when individual columns are + queried. + + .. seealso:: + + :ref:`faq_query_deduplicating` + + * :meth:`~.Query.first()` applies a limit of one and returns the first result as a scalar: diff --git a/lib/sqlalchemy/orm/query.py b/lib/sqlalchemy/orm/query.py index 530b8fee88..08a085f86e 100644 --- a/lib/sqlalchemy/orm/query.py +++ b/lib/sqlalchemy/orm/query.py @@ -3170,10 +3170,18 @@ class Query(object): self._suffixes = suffixes def all(self): - """Return the results represented by this ``Query`` as a list. + """Return the results represented by this :class:`.Query` as a list. - This results in an execution of the underlying query. + This results in an execution of the underlying SQL statement. + + .. warning:: The :class:`.Query` object, when asked to return either + a sequence or iterator that consists of full ORM-mapped entities, + will **deduplicate entries based on primary key**. See the FAQ for + more details. + + .. seealso:: + :ref:`faq_query_deduplicating` """ return list(self) @@ -3541,7 +3549,8 @@ class Query(object): ) def count(self): - r"""Return a count of rows this Query would return. + r"""Return a count of rows this the SQL formed by this :class:`Query` + would return. This generates the SQL for this Query as follows:: @@ -3549,13 +3558,31 @@ class Query(object): SELECT ) AS anon_1 - For fine grained control over specific columns - to count, to skip the usage of a subquery or - otherwise control of the FROM clause, - or to use other aggregate functions, - use :attr:`~sqlalchemy.sql.expression.func` - expressions in conjunction - with :meth:`~.Session.query`, i.e.:: + The above SQL returns a single row, which is the aggregate value + of the count function; the :meth:`.Query.count` method then returns + that single integer value. + + .. warning:: + + It is important to note that the value returned by + count() is **not the same as the number of ORM objects that this + Query would return from a method such as the .all() method**. + The :class:`.Query` object, when asked to return full entities, + will **deduplicate entries based on primary key**, meaning if the + same primary key value would appear in the results more than once, + only one object of that primary key would be present. This does + not apply to a query that is against individual columns. + + .. seealso:: + + :ref:`faq_query_deduplicating` + + :ref:`orm_tutorial_query_returning` + + For fine grained control over specific columns to count, to skip the + usage of a subquery or otherwise control of the FROM clause, or to use + other aggregate functions, use :attr:`~sqlalchemy.sql.expression.func` + expressions in conjunction with :meth:`~.Session.query`, i.e.:: from sqlalchemy import func -- 2.47.2