Proxy Selection Using Hash Based Function

One of the difficulties in running a large scale proxy infrastructure is how to choose which proxy to use. This is not as straight forward as it sounds and there are various methods commonly used in selecting the best proxy to be used.

In hash-function-based proxy selection, a hash value is calculated from some information in the URL, and the resulting hash value is used to pick the proxy that is used. One approach could be to use the entire URL as data for the hash Function. However, as we’ve seen before, it is harmful to make the proxy selection completely random: some applications expect a given client to contact a given origin server using the same proxy chain.

For this reason, it makes more sense to use the DNS host or domain name in the URL as the basis for the hash function. This way, every URL from a certain origin server host, or domain, will always go through the same proxy server (chain). In practice, it is even safer to use the domain name instead of the full host name (that is, drop the first part of the host- name)—this avoids any cookie problems where a cookie is shared across several servers in the same domain.

It’s also useful when large amounts of data are involved and can indeed be used to switch proxies even during the same connection.  For example if someone is using a proxy to stream video – such as in this article – BBC iPlayer France, then the connection will be live for a considerable time with a significant amount of data.  In these situations, there is also limited requirement for any caching facilities particularly with live video streams.

This approach may be subject to “hot spots”—that is, sites that are very well known and have a tremendous number of requests. However, while the high load may indeed be tremendous at those sites’ servers, the hot spots are considerably scaled down in each proxy server. There are several smaller hot spots from the proxy’s point of view, and they start to balance each other out. I-lash—function-based load balancing in the client can be accomplished by using the client proxy auto-configuration feature (page 322). In proxy servers, this is done through the proxy server’s configuration file, or its API.

Cache Array Routing Protocol [CARP], is an advanced hash function based proxy selection mechanism. It allows proxies to be added and removed from the proxy array Without relocating more than a single proxy’s share of documents. More simplistic hash functions use the module of the URL hash to determine which proxy the URL belongs to. If a proxy gets added or deleted, most of the documents get relocated—that is, their storage place assigned by the hash function changes.

Where the allocations are shown for three and four proxies. Note how most of the documents in the three-proxy scenario are on a different numbered proxy in the four-proxy scenario. Simplistic hash-function-based proxy allocation using modulo of the hash function to determine which proxy to use. When adding a fourth proxy server, many of the proxy assignments change, these changed locations are marked with a diamond. Note that we have numbered the proxies starting from zero in order to be able to use the hash module directly.

John Ferris:

Further Reading Link

Tips on Debugging with telnet

It’s rather old school and can seem very time consuming in a world of automated and visual debugging tools, but sometimes the older tools can be extremely effective. It’s been a long time since telnet was used as a proper terminal emulator simply because it is so insecure, yet it’s still extremely useful as troubleshooting tool as it engages on a very simple level. Although it should be noted that it can be used securely using a VPN connection which will at least encrypt the connection.

One of the biggest benefits of the fact that HTTP is an ASCII protocol is that it is possible to debug it using the telnet program. A binary protocol Would be much harder to debug, as the binary data would have to be translated into a human-readable format. Debugging with telnet is done by establishing a telnet connection to the port that the proxy server is running on.

On UNIX, the port number can be specified as a second parameter to the telnet program:


For example, let’s say the proxy server’s hostname is step, and it is listening to port 8080. To establish a telnet session, type this at the UNIX shell prompt:

telnet step 8080

The telnet program will attempt to connect to the proxy server; you
will see the line


If the server is up and running without problems, you will immediately get the connection, and telnet will display
Connected to
Escape character is ’“]’.

(Above, the “_” sign signifies the cursor.) After that, any characters you
type will be forwarded to the server, and the server’s response will be dis-
played on your terminal. You Will need to type in a legitimate HTTP

In short, the request consists of the actual request line containing the method, URL, and the protocol version; the header section; and a single empty line terminating the header section.
With POST and PUT requests, the empty line is Followed by the request body. This section contains the HTML form field values, the file that is being uploaded, or other data that is being posted to the server.

The simplest HTTP request is one that has just the request line and no header section. Remember the empty line at the end! That is, press RETURN twice after typing in the request line.

(remember to hit RETURN twice)

The response will come back, such as,
HTTP/1.1 200 OK
Server: Google—Enterprise/3.0
Date: Mon, 30 Jun 1997 22:37:25 GMT
Content—type: text/html
Connection: close

This can then be used to perform further troubleshooting steps, simply type individual commands into the terminal and you can see the direct response. You should have permission to perform these functions on the server you are using. Typically these will be troubleshooting connections, however it can be a remote attack. Many attacks using this method will use something like a proxy or online IP changer in order to hide the true location.

Components of a Web Proxy Cache

There are several important components to the standard cache architecture of your typical web proxy server. In order to implement a fully functional Web proxy cache, a cache architecture requires several components:

  • A storage mechanism for storing the cache data.
  • A mapping mechanism to the establish relationship between the URLs to their respective cached copies.
  • Format of the cached object content and its metadata.

These components may vary from implementation to implementation, and certain architectures can do away with some components. Storage The main Web cache storage type is persistent disk storage. However, it is common to have a combination of disk and in-memory caches, so that frequently accessed documents remain in the main memory of the proxy server and don’t have to be constantly reread from the disk.

The disk storage may be deployed in different ways:

  • The disk maybe used as a raw partition and the proxy performs all space management, data addressing, and lookup-related tasks.
  • The cache may be in a single or a few large files which contain an internal structure capable of storing any number of cached documents.

The proxy deals with the issues of space management and addressing. ‘ The filesystem provided by the operating system may be used to create a hierarchical structure (a directory tree); data is then stored in filesystem files and addressed by filesystem paths. The operating system will do the work of locating the file(s). ° An object database may be used.

Again, the database may internally use the disk as a raw partition and perform all space manage- ment tasks, or it may create a single file, or a set of files, and create its own “filesystem” within those files. Mapping In order to cache the document, a mapping has to be established such that, given the URL, the cached document can be looked up Fast. The mapping may be a straight-forward mapping to a file system path, although this can be stored internally as a static route.

Typically a proxy would store any resource that is accessed frequently. For example in many UK proxies, the BBC website is extremely popular and it’s essential that this is cached. even for satellite offices it allows people to access BBC VPN through the companies internal network. This is because the page is requested and cached by the proxy which is based in the UK, so instead of the BBC being blocked outside the UK it is still accessible.

Indeed many large multinational corporations sometimes inadvertently offer these facilities. Employees who have the technical know how can connect their remote access clients to specific servers in order to obtain access to normally blocked resources. So they would connect through the British proxy to access the BBC and then switch to a French proxy in order to access a media site like M6 Replay which only allows French IP addresses.

It is also important to remember that direct mappings are normally reversible, that is if you have the correct cache file name then you can use it to produce the unique URL for that document. There are lots of applications which can make use of this function and include some sort of mapping function based on hashes.

Programming Terms: Garbage Collection

There are lots of IT terms thrown about which can be quite confusing for even the experienced IT worker. Particularly in the world of network programming and proxies, sometimes similar words have completely different meanings depending on where you are in the world.

Let’s step back for a minute, and look at what garbage collection means in the programming language world. Though not strictly relevant to the subject of this blog, it is a good way to illustrate the benefits and draw- backs of garbage collection type memory management, whether on disk or in memory. Compiled programming languages, such as C or Pascal, typically do not have run-time garbage collection type memory management.  Which is why they are often not suitable for heavy duty network resources such as the BBC servers which supply millions  streaming live TV such as Match of the Day from VPNs and home connections anywhere.

Instead, those languages require the program authors to explicitly manage the dynamic memory: memory is allocated by a call to malloc ( ) , and the allocated memory must be freed by a call to free ( ) once the memory is no longer needed. Otherwise, the memory space will get cluttered and may run out. Other programming languages, such as Lisp, use an easier [1] memory management style: dynamic memory that gets allocated does not have to be explicitly freed. Instead, the run-time system will periodically inspect its dynamic memory pool and figure out which chunks of memory are still used, and which are no longer needed and can be marked free.

Usually programming languages that are interpreted or object oriented (Lisp, ]ava, Smalltalk) use garbage collection techniques for their dynamic memory management. The determination of what is still used is done by determining whether the memory area is still referenced somewhere-—that is, if there is still a pointer pointing to that area. If all references are lost for example if it has been thrown away by the program—-the memory could no longer be accessed and therefore could be freed.

There are several different approaches to doing this reference detection. One approach is to make each memory block contain an explicit reference counter which gets incremented when a new reference is created and decremented when the reference is deleted or changed to point somewhere else. This requires more work from the run-time system when managing memory references. Another approach is simply to use brute force periodically and traverse the entire memory arena of the program looking for memory references and determine which chunks still get referenced.

This makes it easier and faster to manage memory references as reference counters don’t have to be updated constantly. However, at the same time it introduces a rather heavyweight operation of having to traverse the entire memory scanning for references.

John Williams

From – Web Site

No Comments Networks, Protocols

Subroutine – Passing Parameters

Passing parameters into Subroutines, following examples are from Perl scripts.

Parameters are passed into subroutines in a list with a special name — it’s called @_ and it doesn’t conform to the usual rules of variable naming. This name isn’t descriptive, so it’s usual to copy the incoming variables into other variables within the subroutine.

Here’s what we did at the start of the getplayer subroutine: $angle = $_[O]; If multiple parameters are going to be passed, you’ll write something like: ($angle,$units) = @_; Or if a list is passed to a subroutine: @pqr = @_; In each of these examples, you’ve taken a copy of each of the incoming parameters; this means that if you alter the value held in the variable, that will not alter the value of any variable in the calling code.

This copying is a wise thing to do; later on, when other people use your subroutines, they may get a little annoyed if you change the value of an incoming variable!   Although this method can also be used to hack into websites or divert video streams to bypass geo-blocking for example to watch BBC News outside the UK  like this.

Returning values Our first example concludes the subroutine with a return statement: return ($response); which very clearly states that the value of $response is to returned as the result of running the subroutine. Note that if you execute a return statement earlier in your subroutine, the rest of the code in the subroutine will be skipped over.

For example: sub flines { $fnrd = $_[0]; open (FH,$fnrd) or return (—1); @tda = ; close PH; return (scalar (@tda)); l will return a -1 value if the file requested couldn’t be opened.

Writing subroutines in a separate file
Subroutines are often reused between programs. You really won’t want to rewrite the same code many times, and you’ll
certainly not want to have to maintain the same thing many times over Here’s a simple technique and checklist that you can use in your own programs. This is from a Perl coding lesson, but can be used in any high level programming language
which supports subroutines.

Plan of action:
a) Place the subroutines in a separate file, using a file extension .pm
b) Add a use statement at the top of your main program, calling
in that tile of subroutines
c) Add a 1; at the end oi the file of subroutines. This is necessary since use executes any code that’s not included in subroutine blocks as the tile is loaded, and that code must return a true value — a safety feature to prevent people using TV channels and files that weren’t designed to be used.

No Comments News, Protocols, VPN

Network Programming : What are Subroutines?

What are subroutines and why would you use them?The limitations of “single block code” You won’t be the first person in the world to want to :

  • be able to read options from the command line
  • interpret form input in a CGI script –
  • pluralize words in English

But it doesn’t stop there, lets choose a few other seemingly simple but useful tasks that your code may need to accomplish.  You won’t be the first person in your organisation to want to

  • output your organisation’s copyright statement
  • validate an employee code
  • automatically contact a resource on your web site

These are the sort of tasks that may need to happen again and again, both in the same piece of codes or perhaps across different programs. You may need to handle the same data in several programs, or to handle in your programs the same data that your colleagues handle in theirs. And you may want to perform the same series of instructions at several places within the same program. Almost all programming languages, at least the high level ones can handle these operation including things like Perl. Even the beginners who start off with all your code has been in a single file and indeed has “flowed” from top to bottom.

You can use these subroutines to perform tasks that need to be repeated over and over again. In the context of network programming you could use a specific subroutine to assign a British IP address to a client or hardware device,
You have not been able to call the same code in two different places ‘ You have not been able to share code between programs — copying is not normally an option as it gives maintenance problems ~ You have not used your colleague’s code, nor code that’s available for everyone on the CPAN, nor additional code that’s so often needed that it’s shipped with the Perl distribution. First use of subroutines The first computer programs were written rather like the ones that we’ve written so far.

Each one for its own specific task. In time, programmers (said to be naturally lazy people) noticed that they could save effort by placing commonly used sections of code into separate blocks which could be called whenever and wherever they were needed. Such separate blocks were variously known as functions, procedures or subroutines.

We’ll use the word “subroutine” because Perl does! Structured programming The subroutine approach was then taken to extreme so that all the code was put into separate blocks, each of which could be described as performing a single task. For example, the program I run might be described as performing the task of “reporting on all towns with names matching a pattern”.  You could then split that task into multiple tasks for example creating multiple network connections to different servers.  On a multimedia server you could call the relevant subroutines depending on which channel was to be displayed e’g one for English channel, one for commercial ITV channel abroad  and another for a French variant.  All of these could be separate subroutines called from within the main code when the user presses a button.

No Comments News, Protocols

Intrusion Detection – Post Attack Phase

If you’re protecting any network then understanding the options and various phases of an attack can be crucial.  When you detect an intrusion, it’s important to quickly assess what stage the attack is at and what possible developments are likely.  Whether it’s a skilled attacker of some opportunist kid with some technical skill makes a huge difference in the possible outcomes.

Even regular, normal traffic in suspicious or unusual situations can indicate a possible intrusion. If you suddenly notice TCP three-way handshakes completing on TCP ports 20 and 21 on a home Web server, but you know that you do not run an FTP server at home, it is safe to assume that something suspicious is going on. Post—Attack Phase After an attacker has successfully penetrated a host on your network, the further actions he will take for the most part follow no predictable pattern.   Obviously the danger is much greater if the attacker is both skilled and has plans to further exploit your network while many will simply deface a few pages or use it as  a VPN to watch US or UK TV channels abroad.

This phase is where the attacker carries out his plan and makes use of any information resources as he sees fit. Some of the different options available to the attacker at this point include the following:

  • Covering tracks
  • Penetrating deeper into network infrastructure
  • Using the host to attack other networks
  • Gathering, manipulating, or destroying data
  • Handing over the host to a friend or hacker group
  • Walking or running away

If the attacker is even somewhat skilled, he is likely to attempt to cover his tracks. There are several methods; most involve the removal of evidence and the replacement of system files with modified versions.The replaced versions of system files are designed to hide the presence of the intruder. On a Linux box, netstat would be modified to hide a Trojan listening on a particular port. Hackers can also cover their tracks by destroying system or security log files that would alert an administrator to their presence. Removing logs can also disable an HIDS that relies on them to detect malicious activity. There are automated scripts available that can perform all these actions with a single command. These scripts are commonly referred to as root/ens.

Externally facing servers in large network topologies usually contain very little in terms of useful data for the attacker. Application logic and data is usually stored in subsequent tiers separated by firewalls.The attacker may use the compromised host to cycle through the first three attack phases to penetrate deeper into the system infrastructure. Another possibility for the black hat is to make use of the host as an attack or scanning box.When skilled hackers want to penetrate a high—profile network, they often compromise a chain of hosts to hide their tracks.   It’s not unusual for the attackers to relay their connections through multiple servers, bouncing from remote sites such as Russian, Czech and a German proxy for example before attacking the network.

The most obvious possibilities for the attacker are to gather, manipulate, or destroy data. The attacker may steal credit card numbers and then format the server. The cracker could subtract monies from a transactional database.The possibilities are endless. Sometimes the attackers motivation is solely to intrude into vulnerable hosts to see whether he can. Skilled hackers take pride in pulling off complicated hacks and do not desire to cause damage. He may turn the compromised system over to a friend to play with or to a hacker group he belongs to. The cracker may realize that he has gotten in over his head and attacked a highly visible host, such as the military’s or major financial institutions host, and want to walk away from it praying he isn’t later discovered.

Cryptographic Methods and Authentication

It used to be the domain of mathematicians and spies but know cryptography plays an important part in all our lives. It is important if we want to continue to use the internet for commerce and any sort of financial transactions. All our basic web traffic exists in the clear and is transported via a myriad of shared network equipment. Which means basically anything can be intercepted and read unless we protect it in some way – the most accessible option is to use encryption.

Cryptographic methods are utilized by software to maintain computing and data resources safe-,effectively shielding them with secret code or their,’key.   It’s not always necessary of course, the requirements are heavily dependent on what the connection is being used for.  For example there’s little point encrypting compressed streams like audio and video in normal circumstance, no-one is at risk from intercepting you streaming UK TV abroad from your computer.The key holder is the only individual who has access to the secure information. That individual might share the key with others, permitting them to also get into the information. In a digital world, and especially from the envisaged world of electronic commerce, the demand for safety which is backed by cryptographic systems is paramount. At the future, a person’s initial approach to most electronic devices, and especially to networked electronic devices, will demand cryptography working from the background. Whenever security is necessary, the first point from the human-to machine interface is that of authentication.

The electronic system should know with whom it’s dealing. But just how is this done?  Strong authentication is based on three characteristics which a user needs to have:

  • What the user knows.
  • What the user has.
  • Who the user is.

Today, a typical authentication routine will be to present what you’ve, a token like an identification card, then to uncover what you know, a pin number or password. In a very brief time in the future, the ,who you are kind of identification would be common, first on computers, and after that on an entire selection of merchandise, progressively phasing out the need for us to memorize contact numbers and passwords.  Indeed many entertainment websites are looking at developments in this field with a view to incorporating identity checks in a seamless way.  For example to allow access to UK TV license fee payers who want to watch the BBC from Ireland for example.

But where does the cryptography come to the equation? . In the easiest level, you might offer a system. Like a pc terminal, a password. The system checks your password. You can be logged on to the system. In this example of quite weak authentication, cryptographic methods are utilized to encrypt your password stored inside the system. If your password was held in clear text, rather than cipher text, then a person with an aptitude for programming could soon find the password inside the system and start to usurp and obtaining access to all of the information and system resources you’re permitted to use.

Cryptography does its best to defend the secret, which is your password. Now consider a system that requires stronger authentication. The automatic teller machine is a good example. To perform transactions in an Automated teller machine terminal, you want an ATM Card and a pin number. Inside the terminal, information is encrypted. The information the terminal transmits to the bank is also encrypted. Security is better, but not perfect, since the system will authenticate an individual who isn’t the owner of the card / pin number. The person might be a relative utilizing your card by permission, or he can be a burglar who has just relieved you of your pocket and is about to save you of your life savings. Time, you could think, for stronger authentication. Systems currently in field tests require an additional attribute based on your identity to strengthen the authentication procedure.

TCP/UDP Port Numbers

Both TCP and UDP require port numbers in order to communicate with the upper layers.  These port numbers are used to keep track of varying conversations which criss-cross the network simultaneously. The origin port numbers are dynamically assigned by the source host, most of them will be  at some number above 1024.   All the numbers below 1024 are reserved for specific services as defined in RFC 1700 – they are known as well known port numbers.

Any virtual circuit which is not assigned with a specified service will always be assigned a random port number from this range above 1024.    The port numbers will identify the source and destination in the TCP segment.    Here’s some common port numbers that are associated with well known services:

  • FTP – 21
  • Telnet -23
  • DNS – 53
  • TFTP – 69
  • POP3 – 110
  • News – 144

As you can see all the port numbers assigned are under 1023, whereas above 1024 and above are assigned by the upper layers to set up connections with other hosts.

The internet layer exists for two main reasons, routing and providing a specific network interface to the upper layers. As regards to routing none of the upper or lower layer protocols have any specific functions. Al the routing functionality is primarily the job of the internet layer. As well as routing the internet layer has a second function – to provide a single network interface and gateway to the upper layer protocols.
Application programmers, use this layer to to access the functionality into their application for network access. It is important as it ensures that there is a standardization to access the network layer. Therefore the same functions apply whether you’re on a ethernet or Token ring network.

IP provides a single network interface to access all of these upper layer protocols. The following protocols specifically work at the internet layer:

  • Internet Protocol (IP)
  • Internet Protocol (ICMP)
  • Address Resolution Protocol (ARP)
  • Reverse Address Resolution Protocol (RARP)

The internet protocol is essentially the Internet layer, all the other protocols merely support this functionality. So if for instance you buy UK proxy connections then IP would look at each packet’s address. Then using a routing table, the protocol would decide where the packet should be routed next. The other protocols, the network access layer ones at the bottom of the OSI model are not able to see the entire network topology as they only have connections to the physical addresses.

In order to decide on the specific route, the IP layer needs to answer two specific questions,. The first is which network is the destination host on and the second is what is the ID on that network.   these can be determined and allocated as the logical and hardware address.  The logical address is better known as the IP address and is a unique identifier on any network of the location of a specific host.  These are allocated by specific location and are used by websites to determine resources, so for example to watch BBC iPlayer in Ireland you’d need to route through a UK IP address and not your assigned Irish address.


Data Encapsulation and the OSI Model

When a client needs to transmit data across the network to another device an important process happens.  This process is called encapsulation and involved adding protocol information from each layer of the OSI model.   Every layer in the model only communicates with it’s peer layer on the receiving device.

In order to communicate and exchange information, each layer uses something called PDU which are Protocol Data Units.   These are extremely important and contain the control information attached to the data at each layer of the model.  It’s normally attached to the header of the data field however it can also be attached to the trailer at the end of the data.

The encapsulation process is how the PDU is attached to the data at each layer of the OSI model.  Every PDU has a specific name which is dependent on the information contained in each header.   The PDU is only read by the peer layer on the receiving device at which point it is stripped off and the data handed to the next layer.

Upper layer information only is passed onto the next level and then transmitted onto the network.    After this process the data is converted and handed down to the Transport layer this is done by setting up a virtual circuit to the receiving device by sending a synch packet.   In most cases the data needs to be broken up into smaller segments then a Transport layer PDU attached to the header of the field.

Network addressing and routing through the internetwork happens at the network layer and each data segment.    Logical addressing for example IP is used to transport every data segment to it’s destination network.  When the Network layer protocol adds the control header from the data received from the transport layer it is then described as packet or datagram.  This addressing information is essential to ensure the data reaches it’s destination.  It can allow data to traverse all sorts of networks and devices with the right delivery information added to subsequent PDUs on it’s journey.

One aspect that often causes confusion is the later where packets are taken from the network layer and placed in the actual delivery medium (e.g. cable or wireless for example). This can be even more confusing when other complications such as VPNs are included which involve routing the data through a specified path.   For example people route through a VPN server in order to access BBC iPlayer abroad like this post which will add additional PDUs to the data.   This stage is covered by the Data Link layer which encapsulates all the data into a frame and adds to the header the hardware address of both the source and the destination.

Remember for this data to be transmitted over a physical network it must be converted into a digital signal.  A frame is therefore simply a logical group of binary digits – 1 and 0s which is read only by devices on the local networks.   Receiving devices will synchronize the digital signal and extract all the 1s and 0s.  Here the devices build the frames and run a CRC (Cyclic Redundancy Check) in order to ensure it matches with the transmitted frame.

Additional Information 

No Comments Networks, Protocols, VPN