From 8200c2cd35b3e85a636baabe8324b9ecbbd8fedf Mon Sep 17 00:00:00 2001 From: Mike Bayer Date: Wed, 5 Nov 2014 15:11:13 -0500 Subject: [PATCH] - edits to the subqueryload ordering merge --- doc/build/faq.rst | 108 +++++++++++++++++++++---------------- doc/build/orm/loading.rst | 25 ++++++--- doc/build/orm/tutorial.rst | 9 ++-- 3 files changed, 86 insertions(+), 56 deletions(-) diff --git a/doc/build/faq.rst b/doc/build/faq.rst index fa10ba44b1..12d8e0acc6 100644 --- a/doc/build/faq.rst +++ b/doc/build/faq.rst @@ -603,62 +603,78 @@ The same idea applies to all the other arguments, such as ``foreign_keys``:: foo = relationship(Dest, foreign_keys=[foo_id, bar_id]) -.. _faq_subqueryload_sort: - -Why must I always ``ORDER BY`` a unique column when using ``subqueryload``? ----------------------------------------------------------------------------- - -The SQL standard prescribes that RDBMSs are free to return rows in any order it -deems appropriate, if no ``ORDER BY`` clause is specified. This even extends to -the case where the ``ORDER BY`` clause is not unique across all rows, i.e. rows -with the same value in the ``ORDER BY`` column(s) will not necessarily be -returned in a deterministic order. - -SQLAlchemy implements :func:`.orm.subqueryload` by issuing a separate query -(where the table specified in the relationship is joined to the original query) -and then attempting to match up the results in Python. This works fine -normally: +.. _faq_subqueryload_limit_sort: + +Why is ``ORDER BY`` required with ``LIMIT`` (especially with ``subqueryload()``)? +--------------------------------------------------------------------------------- + +A relational database can return rows in any +arbitrary order, when an explicit ordering is not set. +While this ordering very often corresponds to the natural +order of rows within a table, this is not the case for all databases and +all queries. The consequence of this is that any query that limits rows +using ``LIMIT`` or ``OFFSET`` should **always** specify an ``ORDER BY``. +Otherwise, it is not deterministic which rows will actually be returned. + +When we use a SQLAlchemy method like :meth:`.Query.first`, we are in fact +applying a ``LIMIT`` of one to the query, so without an explicit ordering +it is not deterministic what row we actually get back. +While we may not notice this for simple queries on databases that usually +returns rows in their natural +order, it becomes much more of an issue if we also use :func:`.orm.subqueryload` +to load related collections, and we may not be loading the collections +as intended. + +SQLAlchemy implements :func:`.orm.subqueryload` by issuing a separate query, +the results of which are matched up to the results from the first query. +We see two queries emitted like this: .. sourcecode:: python+sql >>> session.query(User).options(subqueryload(User.addresses)).all() - {opensql}# the "main" query + {opensql}-- the "main" query SELECT users.id AS users_id FROM users {stop} - {opensql}# the "load" query issued by subqueryload - SELECT addresses.id AS addresses_id, addresses.user_id AS addresses_user_id, anon_1.users_id AS anon_1_users_id - FROM (SELECT users.id AS users_id - FROM users) AS anon_1 JOIN addresses ON anon_1.users_id = addresses.user_id ORDER BY anon_1.users_id - -Notice how the main query is a subquery in the load query. When an -``OFFSET``/``LIMIT`` is involved, however, things get a bit tricky: + {opensql}-- the "load" query issued by subqueryload + SELECT addresses.id AS addresses_id, + addresses.user_id AS addresses_user_id, + anon_1.users_id AS anon_1_users_id + FROM (SELECT users.id AS users_id FROM users) AS anon_1 + JOIN addresses ON anon_1.users_id = addresses.user_id + ORDER BY anon_1.users_id + +The second query embeds the first query as a source of rows. +When the inner query uses ``OFFSET`` and/or ``LIMIT`` without ordering, +the two queries may not see the same results: .. sourcecode:: python+sql >>> user = session.query(User).options(subqueryload(User.addresses)).first() - {opensql}# the "main" query + {opensql}-- the "main" query SELECT users.id AS users_id FROM users LIMIT 1 {stop} - {opensql}# the "load" query issued by subqueryload - SELECT addresses.id AS addresses_id, addresses.user_id AS addresses_user_id, anon_1.users_id AS anon_1_users_id - FROM (SELECT users.id AS users_id - FROM users - LIMIT 1) AS anon_1 JOIN addresses ON anon_1.users_id = addresses.user_id ORDER BY anon_1.users_id - -The main query is still a subquery in the load query, but *it may return a -different set of results in the second query from the first* because it does -not have a deterministic sort order! Depending on database internals, there is -a chance we may get the following resultset for the two queries:: - + {opensql}-- the "load" query issued by subqueryload + SELECT addresses.id AS addresses_id, + addresses.user_id AS addresses_user_id, + anon_1.users_id AS anon_1_users_id + FROM (SELECT users.id AS users_id FROM users LIMIT 1) AS anon_1 + JOIN addresses ON anon_1.users_id = addresses.user_id + ORDER BY anon_1.users_id + +Depending on database specifics, there is +a chance we may get the a result like the following for the two queries:: + + -- query #1 +--------+ |users_id| +--------+ | 1| +--------+ + -- query #2 +------------+-----------------+---------------+ |addresses_id|addresses_user_id|anon_1_users_id| +------------+-----------------+---------------+ @@ -667,26 +683,28 @@ a chance we may get the following resultset for the two queries:: | 4| 2| 2| +------------+-----------------+---------------+ -From SQLAlchemy's point of view, it didn't get any addresses back for user 1, -so ``user.addresses`` is empty. Oops. +Above, we receive two ``addresses`` rows for ``user.id`` of 2, and none for +1. We've wasted two rows and failed to actually load the collection. This +is an insidious error because without looking at the SQL and the results, the +ORM will not show that there's any issue; if we access the ``addresses`` +for the ``User`` we have, it will emit a lazy load for the collection and we +won't see that anything actually went wrong. The solution to this problem is to always specify a deterministic sort order, so that the main query always returns the same set of rows. This generally -means that you should :meth:`.Query.order_by` on a unique column on the table, -usually the primary key:: +means that you should :meth:`.Query.order_by` on a unique column on the table. +The primary key is a good choice for this:: session.query(User).options(subqueryload(User.addresses)).order_by(User.id).first() -You can get away with not doing a sort if the ``OFFSET``/``LIMIT`` does not -throw away any rows at all, but it's much simpler to remember to always ``ORDER -BY`` the primary key:: - - session.query(User).options(subqueryload(User.addresses)).filter(User.id == 1).first() - Note that :func:`.joinedload` does not suffer from the same problem because only one query is ever issued, so the load query cannot be different from the main query. +.. seealso:: + + :ref:`subqueryload_ordering` + Performance =========== diff --git a/doc/build/orm/loading.rst b/doc/build/orm/loading.rst index 27846b9b28..b2d8124e2c 100644 --- a/doc/build/orm/loading.rst +++ b/doc/build/orm/loading.rst @@ -120,21 +120,32 @@ query options: # set children to load eagerly with a second statement session.query(Parent).options(subqueryload('children')).all() -.. _subquery_loading_tips: +.. _subqueryload_ordering: -Subquery Loading Tips -^^^^^^^^^^^^^^^^^^^^^ +The Importance of Ordering +-------------------------- + +A query which makes use of :func:`.subqueryload` in conjunction with a +limiting modifier such as :meth:`.Query.first`, :meth:`.Query.limit`, +or :meth:`.Query.offset` should **always** include :meth:`.Query.order_by` +against unique column(s) such as the primary key, so that the additional queries +emitted by :func:`.subqueryload` include +the same ordering as used by the parent query. Without it, there is a chance +that the inner query could return the wrong rows:: -If you have ``LIMIT`` or ``OFFSET`` in your query, you **must** ``ORDER BY`` a -unique column, generally the primary key of your table, in order to ensure -correct results (see :ref:`faq_subqueryload_sort`):: + # incorrect, no ORDER BY + session.query(User).options(subqueryload(User.addresses)).first() - # incorrect + # incorrect if User.name is not unique session.query(User).options(subqueryload(User.addresses)).order_by(User.name).first() # correct session.query(User).options(subqueryload(User.addresses)).order_by(User.name, User.id).first() +.. seealso:: + + :ref:`faq_subqueryload_limit_sort` - detailed example + Loading Along Paths ------------------- diff --git a/doc/build/orm/tutorial.rst b/doc/build/orm/tutorial.rst index 19f3f6fead..8871ce765e 100644 --- a/doc/build/orm/tutorial.rst +++ b/doc/build/orm/tutorial.rst @@ -1631,11 +1631,12 @@ very easy to use: >>> jack.addresses [, ] -.. warning:: +.. note:: - If you use :func:`.subqueryload`, you should generally - :meth:`.Query.order_by` on a unique column in order to ensure correct - results. See :ref:`subquery_loading_tips`. + :func:`.subqueryload` when used in conjunction with limiting such as + :meth:`.Query.first`, :meth:`.Query.limit` or :meth:`.Query.offset` + should also include :meth:`.Query.order_by` on a unique column in order to + ensure correct results. See :ref:`subqueryload_ordering`. Joined Load ------------- -- 2.47.3