From: Mike Bayer Date: Wed, 21 Aug 2013 23:09:01 +0000 (-0400) Subject: add FAQ entry on large inserts X-Git-Tag: rel_0_8_3~47 X-Git-Url: http://git.ipfire.org/cgi-bin/gitweb.cgi?a=commitdiff_plain;h=a71e5c68ca9d20eed868ee49e89b4aeb38e4f5f9;p=thirdparty%2Fsqlalchemy%2Fsqlalchemy.git add FAQ entry on large inserts --- diff --git a/doc/build/faq.rst b/doc/build/faq.rst index c1f47864ec..8528428aa3 100644 --- a/doc/build/faq.rst +++ b/doc/build/faq.rst @@ -588,7 +588,133 @@ with any combination of subtransactions and real SAVEPOINTs. The job of starting/ending the "frame" is kept consistently with the code external to the ``flush()``, and we made a decision that this was the most consistent approach. +I'm inserting 400,000 rows with the ORM and it's really slow! +-------------------------------------------------------------- + +The SQLAlchemy ORM uses the :term:`unit of work` pattern when synchronizing +changes to the database. This pattern goes far beyond simple "inserts" +of data. It includes that attributes which are assigned on objects are +received using an attribute instrumentation system which tracks +changes on objects as they are made, includes that all rows inserted +are tracked in an identity map which has the effect that for each row +SQLAlchemy must retrieve its "last inserted id" if not already given, +and also involves that rows to be inserted are scanned and sorted for +dependencies as needed. Objects are also subject to a fair degree of +bookkeeping in order to keep all of this running, which for a very +large number of rows at once can create an inordinate amount of time +spent with large data structures, hence it's best to chunk these. + +Basically, unit of work is a large degree of automation in order to +automate the task of persisting a complex object graph into a +relational database with no explicit persistence code, and this +automation has a price. + +ORMs are basically not intended for high-performance bulk inserts - +this is the whole reason SQLAlchemy offers the Core in addition to the +ORM as a first-class component. + +For the use case of fast bulk inserts, the +SQL generation and execution system that the ORM builds on top of +is part of the Core. Using this system directly, we can produce an INSERT that +is competitive with using the raw database API directly. + +The example below illustrates time-based tests for four different +methods of inserting rows, going from the most automated to the least. +Runtimes observed here are: + +* SQLAlchemy ORM: Total time for 100000 records 16.4133379459 secs +* SQLAlchemy ORM pk given: Total time for 100000 records 9.77570986748 secs +* SQLAlchemy Core: Total time for 100000 records 0.568737983704 secs +* sqlite3: Total time for 100000 records 0.595796823502 sec + +Script:: + + import time + import sqlite3 + + from sqlalchemy.ext.declarative import declarative_base + from sqlalchemy import Column, Integer, String, create_engine + from sqlalchemy.orm import scoped_session, sessionmaker + + Base = declarative_base() + DBSession = scoped_session(sessionmaker()) + class Customer(Base): + __tablename__ = "customer" + id = Column(Integer, primary_key=True) + name = Column(String(255)) + + def init_sqlalchemy(dbname = 'sqlite:///sqlalchemy.db'): + global engine + engine = create_engine(dbname, echo=False) + DBSession.remove() + DBSession.configure(bind=engine, autoflush=False, expire_on_commit=False) + Base.metadata.drop_all(engine) + Base.metadata.create_all(engine) + + def test_sqlalchemy_orm(n=100000): + init_sqlalchemy() + t0 = time.time() + for i in range(n): + customer = Customer() + customer.name = 'NAME ' + str(i) + DBSession.add(customer) + if i % 1000 == 0: + DBSession.flush() + DBSession.commit() + print("SQLAlchemy ORM: Total time for " + str(n) + + " records " + str(time.time() - t0) + " secs") + + def test_sqlalchemy_orm_pk_given(n=100000): + init_sqlalchemy() + t0 = time.time() + for i in range(n): + customer = Customer(id=i+1, name="NAME " + str(i)) + DBSession.add(customer) + if i % 1000 == 0: + DBSession.flush() + DBSession.commit() + print( + "SQLAlchemy ORM pk given: Total time for " + str(n) + + " records " + str(time.time() - t0) + " secs") + + def test_sqlalchemy_core(n=100000): + init_sqlalchemy() + t0 = time.time() + engine.execute( + Customer.__table__.insert(), + [{"name":'NAME ' + str(i)} for i in range(n)] + ) + print( + "SQLAlchemy Core: Total time for " + str(n) + + " records " + str(time.time() - t0) + " secs") + + def init_sqlite3(dbname): + conn = sqlite3.connect(dbname) + c = conn.cursor() + c.execute("DROP TABLE IF EXISTS customer") + c.execute("CREATE TABLE customer (id INTEGER NOT NULL, " + "name VARCHAR(255), PRIMARY KEY(id))") + conn.commit() + return conn + + def test_sqlite3(n=100000, dbname = 'sqlite3.db'): + conn = init_sqlite3(dbname) + c = conn.cursor() + t0 = time.time() + for i in range(n): + row = ('NAME ' + str(i),) + c.execute("INSERT INTO customer (name) VALUES (?)", row) + conn.commit() + print( + "sqlite3: Total time for " + str(n) + + " records " + str(time.time() - t0) + " sec") + + if __name__ == '__main__': + test_sqlalchemy_orm(100000) + test_sqlalchemy_orm_pk_given(100000) + test_sqlalchemy_core(100000) + test_sqlite3(100000) How do I make a Query that always adds a certain filter to every query? ------------------------------------------------------------------------------------------------ @@ -606,7 +732,7 @@ at :func:`.mapper`. I'm using ``joinedload()`` or ``lazy=False`` to create a JOIN/OUTER JOIN and SQLAlchemy is not constructing the correct query when I try to add a WHERE, ORDER BY, LIMIT, etc. (which relies upon the (OUTER) JOIN) ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- The joins generated by joined eager loading are only used to fully load related collections, and are designed to have no impact on the primary results of the query.