"I have a mind like a steel... uh... thingy." Patrick Logan's weblog.

Search This Blog

Wednesday, December 20, 2006

Update In Place vs. Coordination Space

Data caches are good. Even distributed, shared, clustered data caches are good. There are many applications where this is the right thing to use.

Comparing them to Javaspaces is kind of nagging at me. In some cases one product can serve both purposes. E.g. in Gigaspaces, the developer can choose the Javaspace API or the Map/Cache API. They can also do funky update-in-place things within a space kind of too, but that is explicitly not in the Javaspaces API per se.

In any case I think it is imperative that a developer chooses what kind of mechanism they need for specific situations. And I think there are critical differences between the Javaspace mechanism and a data cache mechanism.

In particular, certain data caches run *in* the address space of the client JVMs, while a Javaspace is *never* in the address space of the client JVMs. (There are caches that run outside the address space, e.g. memcache, and that's another angle on the topic.) When these caches use the java.util.Map API there are potentially funky goblins at play. A regular java Map has certain expectations.

Consider an object V1 with a direct object reference to object O1. Now consider another object K1. Put V1 in a shared, clustered cache using K1 as the key. Update object O1 in that JVM. In another JVM in the cluster (JVM'), do cache.get(K1') to get a copy of V1' referencing a copy of O1'.

In this second JVM', update that O1'. Then from there do cache.put(K1', V1'). Note: a cache is *not* a transparent transactional memory OODB like Gemstone/S or Gemstone/J. Not that you'd want one of those anymore, but they do go to pains to maintain referential integrity, which is the cliff this example is heading off of. So the put in JVM' will soon update the first JVM so that the key K1' leads to the value V1' with a reference to the object O1'.

Meanwhile in the first JVM there is still, outside the cache, the objects K1, V1, and O1, with V1 referencing O1. When this JVM does cache.get(K1) it gets back V1' with a reference to O1'.

Question: In the first JVM what are the identities of the objects K1, K1', V1, V1', O1, and O1'?

Answer: I believe the answer is dependent on the cache implementation. I am not sure what the JCache spec says. Either there is leeway or not all caches do the same thing, and none of these things may be what the developer expects based on experience using the out of the box java.util.Map classes in single or concurrent threads.

In JBoss Cache the developer has the choice of a "tree cache" or a "pojo cache" (maybe others). The behavior will be different based on this choice. A pojo map will patch up object references *within* the stream used to cluster the distributed caches. The tree cache does not. And I think this behavior can vary based on the use of their AOP mechanism.

Even the pojo cache though does not patch up references as far as I can tell, ever, among cached objects and their former references in a JVM outside the cache per se. I.e. the deserializer does not do a sweep of the entire JVM address space to fix identity problems.

This is not necessarily a bad thing when the cache is used as a read-mostly, shared data cache backing some external data with a well-controlled update convention.

A Javaspace does not have this problem because of a simplifying specification -- Javaspaces do not deal with object identity in JVMs. You always get a new object. If you want to deal with identity then code it yourself in the Entry objects. But really don't do that very much!

That is different, and maybe not what you'd want at first. But that may indicate your not using the best mechanism for your problem or you're not thinking about the mechanism the best way yet. This is one of those "architectural constraints" that seem to get in your way, but actually can simplify the solution to a problem for certain classes of problems. E.g. when the problem is "coordination" of processing rather than "clustering" of read-mostly data.

Yes?

1 comment:

PetrolHead said...

Couple of tidbits:

What would be the value of forcing everything to have some form of unique key as say the Map interface does when the data I'm playing with doesn't naturally have such a key?

In the remote case, maybe the identifier field of my object makes sense to me (perhaps because the identifier represents me in some way) but has no meaning for the recipient who's simply going to tweak it and give it me back? For example, if I receive a multicast, I'm not always going to care who it came from.

Not everything I might pass through a JavaSpace (a task, a message, a state change, a command) has a need for an id and thus forcing a developer to invent one is bad. Of course, when the developer needs a unique id and I don't provide it, that's bad ;)

Object identity across multiple address spaces is tricky. Things like memory addresses and even handles as used by most JVM's makes no sense. You can try and make this work under the covers but it's sometimes easier to just put it up there in the developer's face and say, "you need to handle this explicitly so it's predictable in that it does what you expect" i.e. equals works the "right way".

Dan.

Blog Archive

About Me

Portland, Oregon, United States
I'm usually writing from my favorite location on the planet, the pacific northwest of the u.s. I write for myself only and unless otherwise specified my posts here should not be taken as representing an official position of my employer. Contact me at my gee mail account, username patrickdlogan.