Thursday, July 29, 2010

Polymorphic Associations with SQLAlchemy

http://techspot.zzzeek.org/?p=13

The other day Ben Bangert pointed me over to Understanding Polymorphic Associations, a method used by ActiveRecord to allow a certain type of object to be associated with multiple kinds of parent object, along with the question "Can SQLAlchemy do this?".

My initial mental reaction when I hear "can SQLAlchemy do XYZ" is "probably", with the hopes that it isn't going to require a total refactoring of internals to make it happen. In this case, it took me awhile to figure out exactly what AR is getting at in this one, since my own cognitive dissonance was being heavily influenced by the meaning of "polymorphic" in SA terminology. In SA, the word "polymorphic" is usually meant to mean "a mapper that loads instances across a class hierarchy in one query". This means if you have a mapper for class Vehicle, and inheriting mappers for Car, Train and Airplane, you can load a list of Car, Train, and Airplane objects from just one Query call (typically via a UNION ALL statement...but some new twists to that idea are coming up in SA 0.4).

The "polymorphic association", on the other hand, while it bears some resemblance to the regular polymorphic union of a class hierarchy, its not really the same since you're only dealing with a particular association to a single target class from any number of source classes, source classes which don't have anything else to do with each other; i.e. they aren't in any particular inheritance relationship and probably are all persisted in completely different tables. In this way, the polymorphic association has a lot less to do with object inheritance and a lot more to do with aspect oriented programming (AOP); a particular concept needs to be applied to a divergent set of entities which otherwise are not directly related. Such a concept is referred to as a cross cutting concern, such as, all the entities in your domain need to support a history log of all changes to a common logging table. In the AR example, an Order and a User object are illustrated to both require links to an Address object.

While studying AR's example, the primary thing I was looking for was not the various keywords and directives used to express the object relationship in Ruby, instead I was looking for how the database tables are laid out. Once I found that, the exact nature of the relationship would immediately become apparent. This is one thing I like about SQL; it cuts through all the abstracted mumbo jumbo and just shows you the raw relationships in the most plain spoken way possible. Like virtually every ORM I've ever looked at, I could see AR's docs were no exception in that they were long on configurational/usage patterns and very brief on illustrating actual SQL. SQLAlchemy is the only ORM I know of, for any platform, which does not take the position of SQL as an "implementation detail". It takes the position that the developer wants to deal with domain objects as well as SQL/relational concepts at equal levels. Does this violate abstraction and/or encapsulation ? Not at all, SQLAlchemy isn't in the business of building an opaque abstraction layer for you. Its just providing the richest zone of overlap between the two idioms as it possibly can so that you can build the best abstraction for your needs (or use something off the shelf like Elixir).

So I found AR's primary table for its users/orders/addresses example to be the addresses table. This is mostly a regular looking table, with some extra columns added to create the association. Lets illustrate the AR example piece by piece by building a copy of it in SA. We'll start with just the concrete tables and relationships and worry about generalizing a feature out of it later.

We'll start with the standard SQLAlchemy boilerplate for SA examples...import the whole namespace in just to save typing, set up an anonymous connection to an in-memory sqlite database:

from sqlalchemy import *
metadata = BoundMetaData('sqlite://', echo=False)

Next lets look at their addresses table (I added a surrogate key because I think AR is actually going to put one there):

addresses = Table("addresses", metadata,
Column('id', Integer, primary_key=True),
Column('addressable_id', Integer),
Column('addressable_type', String(50)),
Column('street', String(100)),
Column('city', String(50)),
Column('country', String(50))
)

A totally normal address table, except for the combination of the addressable_id and addressable_type columns. Normally, you'd see a column such as user_id or customer_id with an explicit foreign key constraint to a table like users or customers, so that any number of address rows can be associated to that particular referenced table. But here, addressable_id has no foreign key whatsoever. In this example, that's intentional. It's because many different "source" tables can conceivably be referenced by addressable_id against their surrogate primary key column. From this, it follows that if more than one source table has the same primary key value, its ambiguous which of those tables is referenced by a particular addressable_id value. Therefore, the addressable_type column qualifies which of those tables we should be looking at. Such as, if two tables users and orders both wanted to reference rows in addresses, you'd get this:

users:      user_id   name
-----------------
1 bob
2 ed

orders: order_id description
-------------------------
1 order #1
5 order #2

addresses: id addressable_id addressable_type street
----------------------------------------------------------------------
1 1 'user' 123 anywhere street
2 1 'user' 345 orchard ave
3 2 'user' 27 fulton st.
4 1 'order' 444 park ave.

The first three rows of addresses have an addressable_id which references rows in users, the last row has an addressable_id which references a row in orders. So to get all the address rows that are mapped to user id 1, you'd say:

select * from addresses where addressable_id=1 and addressable_type='user'

If you didn't have the extra criterion for addressable_type, you'd also get address id 4, which is related to an order, not a user.

As a side note, I've been noticing that AR doesn't even support foreign keys out of the box, instead requiring a plugin in order to deal with them. People say this is due to AR's history as a MySQL-oriented ORM, and as we all know MySQL for many years tried to convince us all that foreign keys were silly. Well they lost that argument and its too bad AR is still trying to make it, but whatever. Even though this schema requires a relationship with an explicit "no FK constraint" in order to function, I'm going to make a guess that AR's history as a non-FK oriented tool is what led to the adoption of patterns like this in the first place.

Well, your DBA will never take you seriously again, but lets go with it (for now). The "join" condition implied by the "user" query above is:

addressable_id=; AND addressable_type=

So that we may start constructing SQL expressions the SQLAlchemy way, lets write out the users and orders tables:

users = Table("users", metadata,
Column('id', Integer, primary_key=True),
Column('name', String(50), nullable=False)
)

orders = Table("orders", metadata,
Column('id', Integer, primary_key=True),
Column('description', String(50), nullable=False)
)

Each table has a surrogate primary key and just one column of actual data for example's sake. Placeholder classes for both entities:

class User(object):
pass

class Order(object):
pass

From what we've seen so far, it wont be possible for an Address object to exist without having both an addressable_type and an addressable_id attribute. SQLAlchemy takes care of "foreign key" relationships (even though we don't yet have a real "foreign key" set up), but lets add the "type" attribute to the constructor since we know we'll have to populate that ourselves:

class Address(object):
def __init__(self, type):
self.addressable_type = type

Next, we know we're going to have a little bit of code-intensive stuff to set up the relationship to both User and Order (and whatever else), including making it look as much like the AR example as possible, so lets set up a function addressable which will do the work of the "addressable interface". In this function, we want to have the ability to mark a class as "addressable" using either a one-to-one or one-to-many (many-to-many is of course totally doable, but that would require a slightly different table setup, so we aren't addressing that here). We also want to define the create_address() method on the source class which will both create a new Address object as well as properly associate it with the parent instance, and we want to be able to fetch the parent instance given an Address object, so we will need some backref (bi-directional relationship) action in there too. Additionally, our table has no foreign key constraints on it yet we need to set up a relationship. This will have to be expressed manually. We can generically define the join condition between any table and the addresses table via:

primaryjoin = and_(
list(table.primary_key)[0] == addresses.c.addressable_id,
addresses.c.addressable_type == table.name
)

Above, we are using the primary_key attribute of the source table, converted from an unordered set to a list and pulling the sole column from that list, to get at the single primary key column of the source table (we are assuming a single-column primary key for source tables, since thats all that addressable_id can handle referencing to). The second part of the criterion relates the addressable_type column to the string name of the table. We're using that only because its a string name that happens to be conveniently available; substitute other naming schemes according to taste.

The full function then looks like:

def addressable(cls, name, uselist=True):
"""addressable 'interface'."""

def create_address(self):
a = Address(table.name)
if uselist:
getattr(self, name).append(a)
else:
setattr(self, name, a)
return a

cls.create_address = create_address

mapper = class_mapper(cls)
table = mapper.local_table

# no constraints. therefore define constraints in an ad-hoc fashion.
primaryjoin = and_(
list(table.primary_key)[0] == addresses.c.addressable_id,
addresses.c.addressable_type == table.name
)

foreign_keys = [addresses.c.addressable_id]

mapper.add_property(name, relation(
Address,
primaryjoin=primaryjoin, uselist=uselist, foreign_keys=foreign_keys,
backref=backref(
'_backref_%s' % table.name,
primaryjoin=list(table.primary_key)[0] == addresses.c.addressable_id,
foreign_keys=foreign_keys
)
)
)

The addressable function is given the source class, the name under which the Address-referencing scalar or collection attribute should be placed, and the uselist flag indicating if we are doing one-to-one or one-to-many. The first thing it does is create a create_address() function which is attached to the source class as an instance method; when invoked, this method will instantiate an Address with its proper type string and associate it with the parent using the scalar or list semantics specified. We then retrieve the Mapper and Table associated with the source class and set up a new relation() relating the source class directly to the Address class, and add it to the source mapper using the desired attribute name. We also set up a backref which has a more unusual name given to it.

Both the forward and backwards refs use a constructed primary join condition and an explicit "foreign key" column. How can we say theres a foreign key involved ? The tables themselves have no FKs whatsoever. Well, turns out that mappers and relations don't really care if there are any actual FKs defined for the tables (or if its even impossible to define any, as is the case here). They only care what columns you'd like to pretend have a foreign key constraint for the purposes of this relationship, if there aren't any for real (or even if there are). So in all cases, each relation added to a source class' mapper should consider the addressable_id column to be the "foreign key" column, at least as far as the relationship between a particular source row and address row is concerned. The same goes for the backwards reference which is essentially identical except its looked at from the reverse direction.

Let's look at the backref. First of all, its join condition is lacking the comparison to the addressable_type column. While it still works if the comparison is present, its not actually needed in the case of the backreference since it's invoked from the perspective of a row in the addresses table, and it actually gets in the way of the backref's lazy loader optimizing its reverse-load for a parent User or Order already present in the session. I certainly didn't guess this ahead of time, I only know it because I tried with the full condition to start with, and I watched the SQL generated as I ran my test program (this notion of experimenting and observing to see what works was the inspiration behind the word "alchemy", BTW, just don't blow up your kitchen).

The backref itself has a name which is derived from the source table name. This implies that after several source classes have had addressable applied to them, Address will have several mapped properties of the form _backref_, such as:

# get an address
someaddress = session.query(Address).get(5)

# see if it has an associated Order object
order = someaddress._backref_orders

# see if it has an associated User object
user = someaddress._backref_users

Thats a pretty unpleasant interface above. We actually don't have to guess what kind of object an Address is related to, because it already has an addressable_type attribute. Lets redefine the Address class with a handy property that will give us the correct associated source instance:

class Address(object):
def __init__(self, type):
self.addressable_type = type
member = property(lambda self: getattr(self, '_backref_%s' % self.addressable_type))

mapper(Address, addresses)

And that, adding in a simple mapper def for Address is it...as far as our addressable interface. One hardwired version of a polymorphic association creator. Lets create mappers for User and Order and use our new toy:

mapper(User, users)
addressable(User, 'addresses', uselist=True)

mapper(Order, orders)
addressable(Order, 'address', uselist=False)

Notice that we didn't do the usual properties={} thing when setting up the Address relation, as the addressable function calls add_property() on the source mapper for us. This is because we needed the attribute name to set up our create_address() method as well, and we only want to have to type the concrete attribute name once. Its not quite as nice as AR's minimalist syntax, but its a custom job at this point; in a couple of weeks someone will probably stick this feature into Elixir and it'll be just as pleasing all over again. Plus we're not done with it yet anyway.

So everything is set up; lets fire it up:

metadata.create_all()

and create a session and manipulate some objects (note to any AR users reading this: before you cringe at the verbosity of using Session and Query objects [not to mention the whole Table and mapper() dichotomy], it should be noted that both SA as well as Elixir provide myriad ways of creating more succinct code than this...we are showing "raw SA" here, which while it's the most verbose is also the most explicit and flexible, and is also the style everyone knows):

u1 = User()
u1.name = 'bob'

o1 = Order()
o1.description = 'order 1'

a1 = u1.create_address()
a1.street = '123 anywhere street'
a2 = u1.create_address()
a2.street = '345 orchard ave'

a3 = o1.create_address()
a3.street = '444 park ave.'

sess = create_session()
sess.save(u1)
sess.save(o1)
sess.flush()

sess.clear()

# query objects, get their addresses

bob = sess.query(User).get_by(name='bob')
assert [s.street for s in bob.addresses] == ['123 anywhere street', '345 orchard ave']

order = sess.query(Order).get_by(description='order 1')
assert order.address.street == '444 park ave.'

# query from Address to members

for address in sess.query(Address).list():
print "Street", address.street, "Member", address.member

Running the application without SQL logging gives us the output:

Street 123 anywhere street Member <__main__.user>
Street 345 orchard ave Member <__main__.user>
Street 444 park ave. Member <__main__.order>

And so ends Part I, how to mimic AR's exact method of "polymorphic association". It still feels a little empty knowing that it violates normalization. Or if not, take my word for it...you won't impress Yahoo's development team with this schema. Which brings us to the lightning round:

Part II. How to Really Do This.

It shouldn't be too hard to generalize out the addressable function we created above to provide the polymorphic-association-feature-ala-ActiveRecord as a pluggable module. Before we do that, lets fix the foreign key constraint issue. The AR article has a comment suggesting that, "Well, if you really need a constraint, put a trigger". Which again, is a little naive as to the full functionality of a foreign key (CASCADE, anyone?). In our case, we are going to "flip around" some aspects of the relationship such that the addresses table will be a totally normal table, and each source table will instead have a foreign key column referencing the association.

How will we do that, and still allow one-to-many relationships ? By doing what the name "association" implies - adding an association table. This new table will express part of what we're currently trying to express between just two tables.

For those following at home, the script re-begins right underneath the BoundMetaData declaration, with all new tables. First, the association table:

address_associations = Table("address_associations", metadata,
Column('assoc_id', Integer, primary_key=True),
Column('type', String(50), nullable=False)
)

In this table, the association itself becomes an entity of its own with its own primary key, as well as a type column which will serve a similar, but less intrusive, purpose as the previous addressable_type column. For the one-to-one and one-to-many use cases we are currently addressing, the addresses table can be expressed as o2m from this association table (as before, for a many-to-many association, yet another table is needed, which we won't go into in this article):

addresses = Table("addresses", metadata,
Column('id', Integer, primary_key=True),
Column('assoc_id', None, ForeignKey('address_associations.assoc_id')),
Column('street', String(100)),
Column('city', String(50)),
Column('country', String(50))
)

Back into familiar territory, where a table relation can have a foreign key. Note the lesser known trick we are using of defining the type as None for a column which has a foreign key; the resulting type will be taken from the referenced column. Now, when defining orders and users tables, they will take on the burden of expressing relationships to this association, by having the foreign key on their side:

users = Table("users", metadata,
Column('id', Integer, primary_key=True),
Column('name', String(50), nullable=False),
Column('assoc_id', None, ForeignKey('address_associations.assoc_id'))
)

orders = Table("orders", metadata,
Column('id', Integer, primary_key=True),
Column('description', String(50), nullable=False),
Column('assoc_id', None, ForeignKey('address_associations.assoc_id'))
)

This puts the "polymorphism" where it should be; on the part of the schema that is actually polymorphic. Lets see what our sample data looks like with this scheme:

users:                  user_id   name   assoc_id
----------------------------
1 bob 1
2 ed 2

orders: order_id description assoc_id
------------------------------------
1 order #1 3
5 order #2 NULL

address_associations: assoc_id type
------------------
1 'user'
2 'user'
3 'order'

addresses: id assoc_id street
-----------------------------------------
1 1 123 anywhere street
2 1 345 orchard ave
3 2 27 fulton st.
4 3 444 park ave.

So the above data shows that the addresses table loses its previously complex role. As an additional advantage, the limitation that all source tables (i.e. users and orders tables) need have a single-column surrogate primary key goes away as well (see how it all falls together when you follow the rules?)

Keen users might notice another interesting facet of this schema: which is that the relation from address_associations to the users and orders table is identical to a joined table inheritance pattern, where users and orders both inherit from address_associations (save for the NULL value for order #2, for which we could have a placeholder row in address_associations). Suddenly, the original premise that AR's "polymorphic associations" are nothing like SA's "polymorphic inheritance" starts to melt away. It's not that they're so different, its just how AR was doing it made it look less like the traditional inheritance pattern.

In fact, while this table scheme is exactly joined table inheritance, and we could certainly go the "straight" route of creating an Addressable base class and mapper from which the User and Order classes/mappers derive, then creating the traditional SA "polymorphic mapping" using UNION ALL (or whatever surprise 0.4 has in store) with a relationship to Address, here we're going to do it differently. Namely, because we are still going to look at this association as a "cross-cutting concern" rather than an "is-a" relationship, and also because SA's explicit inheritance features only support single inheritance, and we'd rather not occupy the "inherits" slot with a relationship that is at best a mixin, not an "is-a".

So lets define a new class AddressAssoc which we can use to map to the address_associations table, and a new addressable function to replace the old one. This function will apply a little bit more Python property magic in order to create collection- or scalar-based property on the source class which completely conceals the usage of AddressAssoc:

class AddressAssoc(object):
def __init__(self, name):
self.type = name

def addressable(cls, name, uselist=True):
"""addressable 'interface'."""

mapper = class_mapper(cls)
table = mapper.local_table
mapper.add_property('address_rel', relation(AddressAssoc, backref=backref('_backref_%s' % table.name, uselist=False)))

if uselist:
# list based property decorator
def get(self):
if self.address_rel is None:
self.address_rel = AddressAssoc(table.name)
return self.address_rel.addresses
setattr(cls, name, property(get))
else:
# scalar based property decorator
def get(self):
return self.address_rel.addresses[0]
def set(self, value):
if self.address_rel is None:
self.address_rel = AddressAssoc(table.name)
self.address_rel.addresses = [value]
setattr(cls, name, property(get, set))

The basic idea of the relationship is exactly the same as before; we will be adding distinct backrefs to the AddressAssoc object for each kind of relation configured. But observe that now that we're back in normalized constraint-land, the relationship is totally simple and requires no explicit joins or ad-hoc foreign key arguments. The "database" part of this operation is now just one line of code, and the rest is just some property-based "make-it-pretty" code to make usage more convenient. While the property code is a little verbose, overall the whole thing is conceptually simpler.

The member property on Address looks just about the same as before, save for that its pulling the backref off of its association object rather than itself, as defined below along with the mappers for Address and AddressAssoc:

class Address(object):
member = property(lambda self: getattr(self.association, '_backref_%s' % self.association.type))

mapper(Address, addresses)

mapper(AddressAssoc, address_associations, properties={
'addresses':relation(Address, backref='association'),
})

The Address object is now freely creatable with no weird create_address() function needed (although we probably could have found a way to do away with it in the first example too).

Lets do the same test code, minus the usage of create_address(), which of course we could add back in if we wanted it that way. Otherwise, the test is exactly the same, and the results are the same:

class User(object):
pass

mapper(User, users)
addressable(User, 'addresses', uselist=True)

class Order(object):
pass

mapper(Order, orders)
addressable(Order, 'address', uselist=False)

metadata.create_all()

u1 = User()
u1.name = 'bob'

o1 = Order()
o1.description = 'order 1'

a1 = Address()
u1.addresses.append(a1)
a1.street = '123 anywhere street'

a2 = Address()
u1.addresses.append(a2)
a2.street = '345 orchard ave'

o1.address = Address()
o1.address.street = '444 park ave.'

sess = create_session()
sess.save(u1)
sess.save(o1)
sess.flush()

sess.clear()

# query objects, get their addresses

bob = sess.query(User).get_by(name='bob')
assert [s.street for s in bob.addresses] == ['123 anywhere street', '345 orchard ave']

order = sess.query(Order).get_by(description='order 1')
assert order.address.street == '444 park ave.'

# query from Address to members

for address in sess.query(Address).list():
print "Street", address.street, "Member", address.member

Output is again:

Street 123 anywhere street Member <__main__.user>
Street 345 orchard ave Member <__main__.user>
Street 444 park ave. Member <__main__.order>

And there you have it, polymorphic associations, SQLAlchemy style.

Both of these examples, as well as a third version that "genericizes" the address function, are checked into the /examples directory of SQLAlchemy.

polyassoc1.py - AR version.

polyassoc2.py - Normalized version.

polyassoc3.py - Generic version.

No comments:

Post a Comment