One of the difficulties in running a large scale proxy infrastructure is how to choose which proxy to use. This is not as straight forward as it sounds and there are various methods commonly used in selecting the best proxy to be used.
In hash-function-based proxy selection, a hash value is calculated from some information in the URL, and the resulting hash value is used to pick the proxy that is used. One approach could be to use the entire URL as data for the hash Function. However, as we’ve seen before, it is harmful to make the proxy selection completely random: some applications expect a given client to contact a given origin server using the same proxy chain.
For this reason, it makes more sense to use the DNS host or domain name in the URL as the basis for the hash function. This way, every URL from a certain origin server host, or domain, will always go through the same proxy server (chain). In practice, it is even safer to use the domain name instead of the full host name (that is, drop the ﬁrst part of the host- name)—this avoids any cookie problems where a cookie is shared across several servers in the same domain.
It’s also useful when large amounts of data are involved and can indeed be used to switch proxies even during the same connection. For example if someone is using a proxy to stream video – such as in this article – BBC iPlayer France, then the connection will be live for a considerable time with a significant amount of data. In these situations, there is also limited requirement for any caching facilities particularly with live video streams.
This approach may be subject to “hot spots”—that is, sites that are very well known and have a tremendous number of requests. However, while the high load may indeed be tremendous at those sites’ servers, the hot spots are considerably scaled down in each proxy server. There are several smaller hot spots from the proxy’s point of view, and they start to balance each other out. I-lash—function-based load balancing in the client can be accomplished by using the client proxy auto-conﬁguration feature (page 322). In proxy servers, this is done through the proxy server’s conﬁguration ﬁle, or its API.
Cache Array Routing Protocol [CARP], is an advanced hash function based proxy selection mechanism. It allows proxies to be added and removed from the proxy array Without relocating more than a single proxy’s share of documents. More simplistic hash functions use the module of the URL hash to determine which proxy the URL belongs to. If a proxy gets added or deleted, most of the documents get relocated—that is, their storage place assigned by the hash function changes.
Where the allocations are shown for three and four proxies. Note how most of the documents in the three-proxy scenario are on a different numbered proxy in the four-proxy scenario. Simplistic hash-function-based proxy allocation using modulo of the hash function to determine which proxy to use. When adding a fourth proxy server, many of the proxy assignments change, these changed locations are marked with a diamond. Note that we have numbered the proxies starting from zero in order to be able to use the hash module directly.