28 Mar 2004

Hint guided scheduling in GridDistribute

An interesting conversation in FreeBSDChina’s core:

delphij -> yarshure: I need some help on qmail
yarshure -> delphij: alu is qmail guru
delphij (continue the conversation with alu and finally solved the problem)

This is also often the case we met in a distributed system. Traditionally we obtain the 「who has a certain resource」 information from a static list, typically, a mirror list, or a DNS record, for example, we have two A records for www.frontfree.net, and hence, once beastie.frontfree.net is down, you can still access through the other mirror.

Unfortunately, while this approach is easy and highly efficient (You know what you will deal with before you deal with them!), it lacks the ability to maintain the distributed resource discovery, because it is not so flexible when the resource provider is going online and offline frequently – this is the common case in a typical P2P system.

The approach Grid Distribute will adopt is a different one. It balances efficiency and flexibility through a 「hint」 approach. This is similar with what in humanity social.

We do not explicitly store information whether a remote node has certain resource, because it might be changed from time to time, and maintaining that huge dataset is not a wise idea on a large network. Instead, we maintain the information whether a remote domain (namely, a colored zone), is 「interested in」 some resources in two levels, one is what vendor they interest, and the other is the atomic distribution sets (or files, this is TBD at present, I prefer my idea, atomic distribution set at present). A request will only go to zones which are 「interested in」 a certain vendor, and if we have more precious information, we will send less requests.

You see, we will definitely send information to those nodes that has not the data – they either 「interested in」 the named resource, but don’t have them yet, or 「not interested in」 the resource and fortunately they have a friend node in their own zone having the information, so they give us a hint.

The hint will be: 「Ok, ok, I’m not interested in your request, but I knew that Bob have the same habit you have, go ahead to ask him」.

We then store this information. In my example, I presented only one node, and of course, the hint can contain arbitrary nodes when necessary.

It then turns out that how to make the protocol more efficient. In order to implement this, we should provide more fine grained information – the atomic distribution set, as well as the vendor they are being produced. A 「liaison」, we call it 「Alice」 here, will know what others are interested (by vendor), but not necessarily to know the atomic distribution sets because the node Alice does not interest the vendor, and so it will not interest the atomic distribution set. With the vendor information, it will be able to find 「Vendor A club」, and then, with the atomic distribution sets, it will be able to find 「Who exactly in vendor A club likes the things」, and hence, provide a subset of Vendor A club which can provide useful information.

Still working on this idea, and I need feedback if you have some thoughts.

delphij's Chaos

Hint guided scheduling in GridDistribute