Sound advice - blog

Tales from the homeworld

My current feeds

Sat, 2005-Jun-04

Naming Services for Ports

I've been going on ad nauseum about the handling of ad hoc user daemons for a few days now. I've been somewhat under the weather, even to the point of missing some of both work and HUMBUG, so some of it may just be the dreaded lurgie talking. On the premise that the line between genius and madness is thin and the line between madness and fever even thinner, I'll pen a few more words.

So I'm looking for a stable URI scheme for getting to servers on the local machine that are started by users. I thought a little about the use of UNIX sockets, but they're not widely enough implemented to gain traction and don't scale to thin-client architectures. It seems that IP is still the way forward, and because we're talking REST, that's HTTP/TCP/IP.

Now the question turns to the more practical "how?". We have a good abstraction for IP addresses that allow us to use a static URI even though we're talking to multiple machines. We call it DNS, or perhaps NIS or LDAP. Maybe we call it /etc/hosts, and perhaps we even call it gethostbyname(3) or some more recent incarnation of that API. These models allow us to keep the https://example.com/ URI, even if example.com decides to move from a 127.0.0.1 IP address to 127.0.0.1. It even works if multiple machines are sharing the load of handling this URI through various clever hacks. It scales to handling multiple machines per service nicely, but we're still left with this problem of handling multiple services per machine. You see, when we talk about https://example.com/ we're really referring to https://example.com:80/.

There's another way to refer to that URI, which may or may not be available on your particular platform setup. It's https://example.com:http/. It's obviously not something you want to type all of the time and is a little on the weird side with its multiple invocation of http. On the other hand, it might allow you to vary the port number associated with http without changing our URI. Because we don't change the URI we can gain benfits for long-term hyperlinking as well as short-term caching mechanisms. We just edit /etc/services on every client machine, and set the port number to 81 instead.

Hrrm... slight problem there, obviously. Although DNS allows the meaning of URIs that contain domain names to be defined by the URI owner, port naming is handled in a less dynamic manner. With DNS, the owner can trust clients to discover the new meaning of the URI for when it comes to actually retrieve data. They bear extra expenses for doing so, but it is worth the benefit.

Let's assume that we'll be able to overcome this client discovery problem for the moment, and move over to the other side of the bridge. You have a process that gets started as part of your login operation to serve your own data to you via a more convenient interface than the underlying flat files would provide. Maybe you have a service that answers SQL queries for your sqlite files. You want to be able to enter https://localhost:myname.sqlite/home/myname/mydatabase?SELECT%20*%20FROM%20Foo into your web browser and get your results back, perhaps in an XML format. Maybe you yourself don't want to, but a program you use does. It doesn't want to link against the sqlite libraries itself, so it takes the distributed application aproach and replaces an API with a protocol. Now it doesn't need to be upgraded when you change from v2.x to v3.x of sqlite. Put support in for https://localhost:postgres/*, and https://localhost:mysql/* and you're half-way towards never having to link to a database library again. So we have to start this application (or a stand-in) for it at start-up. What happens next?

This is the divide I want to cross. Your application opens ports on the physical interfaces you ask it to, and implicitly names them. The trick is to associate these names with something a client can look up. On the local machine you can edit /etc/services directly, so producing an API a program can use to announce its existence in /etc/services might be a good way to start the ball rolling. Huhh. I just noticed something. When I went to check that full stop (.) characters were permitted in URI port names I found they weren't. In fact, I found that rfc3986 and the older rfc2396 omit the possibility of not only the full stop, but any non-digit character. Oh, bugger. I thought I might have actually been onto something... and if I type a port name into Firefox 1.0.4 it happily chopps that part of the URL for me.

Well, what I would have said if I hadn't come across that wonderful piece of news is that once you provide that API you can access your varied HTTP-protocol services on their dynamically allocated port numbers with a static URI because the URI only refers to names, not actual numbers. That would have been a leading off point into how you might share this port ownership information in the small-(virtual)-network and thin-client cases where it would matter.

That's a blow :-/

So it seems we can't virtualise port identification in a compliant URI scheme. Now that rfc3986 has come out you can't even come up with alternative authority components of the URI. DNS and explicit ports are all that are permitted, although rfc2396 allows for alternate naming authorities to that of DNS. The only way to virtualise would be to use some form of port forwarding, in which case we're back to the case of pre-allocating the little buggers for their purposes and making sure they're available for use by the chosen user's specific invocation.

Well, root owns ports less than 1024 under unix. Maybe it's time to allocate other port spaces to specific users. The user could keep a registry of the ports themselves, and would just have to live with the ugliness of seeing magic numbers in the URIs all of the time. It's that, or become resigned to all services that accept connections defaulting to a root-owned daemon mode that perform a setuid(2) after forking to handle the requests of a specific user. Modelling after ssh shouldn't be all that detrimental, although while ssh sessions are long-lived http are typically short. The best performance will always be gained by avoiding the bottleneck and allowing a long-lived server process handle the entire conversion with its client processes. Starting the process when a connection is recieved is asking for a hit, just has using an intermediate process to pass data from one process to another will hurt things.

Another alternative would be to try and make these "splicing" processes more efficient. Perhaps a system call could save the day. Consider the case of processes A, B, and C. A connects to B, and B determines that A wants to talk to C. It could push bytes between the two ad nauseum, or it could tell the kernel that bytes from the file descriptor associated with A should be sent directly to the file descriptor associated with C. No extra context switches would be required and theoretically the interference of B could end up with no further performance hit.

Maybe a simple way of passing file descriptors between processes would be an easy solution to this kind of problem. I seem to recall a mechanism to do this in UNIX Network Programming, but that is currently at work and I am currently at home. Passing the file descriptor associated with A between B and C as required could reduce the bottleneck effect of B.

Oh well, I'm fairly disillusioned as I tend to be at the end of my more ponderous blog entries.

Benjamin