"I have a mind like a steel... uh... thingy." Patrick Logan's weblog.

Search This Blog

Friday, December 23, 2005

Representing RDF

Bill de hÓra explores the upside and down of using an RDBMS for a persistent representation of RDF. This reminds me of several times I've encountered a generic attribute/value pair relational data model. As Bill points out regarding the "disenabling" this does to Rails and Django (as they are currently designed), the generic model moves the domain vocabulary out of the relational schema and into the data itself.

I understand one of the keys to RDF is that this is a conceptual model for the semantic web and not necessarily a recommended persistent physical model. That doesn't mean it couldn't be, but I wonder what other models might make sense.

An advantage Bill points out to this approach is the RDBMS schema itself would change little if any over time. Everything becomes data stored using one simple, generic model. This tells me we need a database, independent of RDF or any other conceptual model, that can recognize more about the domain-specific data and relationships while supporting "agile data refactoring". Some databases are pretty good about this for star schema models. Star schemas kind of aggregate related RDF triples into another conceptual model that is only somewhat more complex than the RDF concept itself.

A star schema relates MxM things to each other through a common "fact". The simplest fact is a statement that such a relationship exists. So when some relationship is established at some point in time, say an employee is assigned a manager, the fact that this exists is established by a fact relating the employee (e.g. a row in the "employee" table) to the manager (e.g. another row in the "employee" table) via a fact (e.g. a row in the "management" table) along with other related "triples" vectoring through the same table (e.g. the "calendar" table for the start date and the "department" table for the organization being managed, perhaps the "role" table for the row of the role the employee is playing).

Then other triples even more tightly associated with each other are those "dimensions" themselves. So the relationship of the employee to their name, address, pay scale, and other vital statistics may simply be represented as columns in the same employee row. The result is a normalized fact table bringing together an array of denormalized dimension tables.

The kinds of changes to this model are simpler than those made to a "fully" normalized model. Columns are typically added to a dimension but are rarely removed. Columns infrequently change data type, but could. Dimensions added to the model, new facts are added, and new relationships between facts and dimensions.

A database not based on 25 year old approaches in the typical RDBMS are pretty efficient about representing this model and refactoring. Sybase IQ is one, e.g. in the Sybase Dynamic Operational Data Store. Efficient in this case means the space used in the database is typically *less* than the space used to represent the same data in a flat file. Very few databases have this capability and at the same time *reduce* the maintenance effort.

This model would be useful for RDF data sets that need to do a good bit of computation. For example when the "facts" of the relationships include numerical measurements of some kind (payments, temperatures, velocities, occurrences (e.g. attendance, page hit, etc.), etc.)

When the model is not so computation intensive I would suspect the queries would be more like searches and path navigation. The structure would not benefit from being in a more general attribute/value exploded RDF-like schema. Some kind of graph model would be better to support path navigation around large networks.

The three activities I think would come up frequently are more or less free-form search (text and other indexing independent of conceptual model), path navigation (across a large network of related RDF triples), and numeric computation (on measurements of some kind related to the things of the triple).

No comments:

Blog Archive

About Me

Portland, Oregon, United States
I'm usually writing from my favorite location on the planet, the pacific northwest of the u.s. I write for myself only and unless otherwise specified my posts here should not be taken as representing an official position of my employer. Contact me at my gee mail account, username patrickdlogan.