21 Mar 2004

Different ideas on implementation of GD's distribution set

The “Distribution Set” is an important idea on GridDistribute’s vendor distribution. Here comes different ideas on implementing it, which will influence the search approach we will choose. Personally, I think we must have a final decision before next Wednesday.

My approach:

We will have two different types of distribution sets, one called “atomic set”, which will have empty intersection set between each other; another is called “meta set”, which is vendor specified sets of unions of “atomic set"s.

Take an instance, we have a collection of Beethoven’s Symphonies No. 5, 9 as “Collection of Beethoven 5/9” and another collection of all Beethoven’s Symphonies, then we will have two atomic sets: One containing No. 5, and No. 9 and the other has all the rest symphonies.

All access control policies are applied to “atomic set"s, and the system processes “atomic set"s internally. Therefore, when you request “All Beethoven’s Symphonies”, the system will interept it into “the two atomic sets”.

This approach gives us better search performance while retaining the flexibility of changing the sets, even inclusion of other files. The drawback is when we add or remove some files, new atomic sets must be generated and hence the old ones must be destroyed to keep the access control policy across the new and old atomic sets.

Therefore, it will cost more when the brain-drained changes their “releases” frequently by adding and removing files from their distribution set. However, I don’t see the update cost as a big problem while we must maintain mandatory access control over the network so it will be the responsibility of every nodes to update their local resource directories when there is a change, because access might be granted or revoked from one file and hence this must be applied soon to keep the system itself efficient and secure (No unwanted information to be disclosed because of antique access grant list).

hengdm’s approach:

The above idea is essentially based on hendm’s original proposal on resource directory. And here is his (different amend) of the original one.

The resource directory contains files and several meta directories. In the above scenario, we will have Beethoven’s 5, 9 and other Symphonies individually. This will cause us to maintain “What file do you interest” information, rather than “What atomic set do you interest” information. The benefits are, from my view:

No (unnecessary) search attempts on unrelated nodes. That’s say, if I download Symphony #5 and not #9, when you look for #9, you won’t bother to bug me.
There’s no need to maintain “atomic sets” across “meta set” changes. This will make vendor releases of several distribution sets easier.

The drawbacks are apparant, too:

There might be much more information to maintain. My approach maintains three different granularity levels of resource “interest"s: by vendor, by atomic sets, and by files and the first two will be distributed across the network. Hengdm’s approach will maintain two levels: by vendor, and by files. This, on the other hand, will generate much more information for nodes and areas to maintain.

With this in concern, will it be useful to have that more information to benefit search?

While there’s no need to maintain so many atomic sets, we will have to maintain access control on individual files. For example, if we have several files in a new distribution set, it will have to look up (and update when necessary) that many files’ ACL’s, even these files are contained in another distribution set having same ACL.

I think the two approach’s major difference is flexiblity, and we should consider performance impacts more carefully. Hopefully we can mix the benefits and avoid the drawbacks of the two approaches and work out a new, better approach next week.

delphij's Chaos

Different ideas on implementation of GD's distribution set