Sound advice - blog

Tales from the homeworld

My current feeds

Sat, 2006-Sep-30

Publish/Subscribe and XMPP

I have a long-standing interest in protocols and technologies. In the proprietary system I work with professionally, publish/subscribe is the cornerstone of realtime data collection. Client machines are capable of displaying updates from monitored field equipment in latencies measured according the speed of light, plus a processing delays.

My implementation is proprietary, so I have long been keeping an eye out for promising standards and research that may emerge into something positive. The solution must be architecturally sound. In particular, it should be scalable to the size of the Internet. I have some thoughts about this which mainly stem back to the protocol, Rohit Khare's dissertation Extending the REpresentational State Transfer Architectural Style for Decentralized Systems and my responses to it: Consensus on the Internet Scale, The Estimated Web, Routed REST, REST Trust Relationships, Infinite Buffering, and Use of HTTP verbs in ARREST architectural style.

I like the direct client to server nature of HTTP. You figure out who to connect to using DNS, then make a direct TCP/IP connection. Or indirect. For scalability purposes you can introduce intermediataries. These intermediataries are not confused about their role. It is to direct traffic on to the origin server. Sometimes this involves additional intermediataries, however these proxies are not expected to explicitly route data. That is a job for the network.

takes an instant-messenger approach to communications. JEP-0060 specifies a publish/subscribe mechanism for the XMPP protocol that apparently is seeing use as a transport for atom to notify interested parties when news feeds are updated. I don't mind saying that the fundamental architecture irks me. Instead of talking directly to an end server or being transparently pushed through layers that improve network performance, we start out with the assumption that we are talking to a XMPP server. This server could be anywhere. Chances are that unlike your web proxy, it is not being hosted by your ISP. Instead of measuring the request in terms of the speed of light between source and destination plus processing delays, we need to consider the speed of light and processing delays across a disorganised mishmash of servers from here to Antarctica. XMPP itself also appears to be a poor match to the REST architectural style. On the face of it, XMPP appears to have confusing identifier schemes, nouns, content types, and mish-mash of associated standards and extensions that remind me more of the WS-* stack than specifications or software stacks that are still used by the generation that follows their specifiers.

Nevertheless, GENA is dead outside of UPnP. The internet drafts submitted by Microsoft to the IETF don't match up with the specification that forms part of UPnP. Neither specification matches up to GENA implementations I have seen in wild. I think that the fundamental reason for this is not that HTTP forms a poor transport for subscription at a base technological level, but that firewalls are generally set up to make requests back from HTTP servers impossible as part of a subscription mechanism. As such, a protocol that already supports bidirectional communication and is acceptable to firewalls yields a better chance of ongoing success. For the moment, it is a technology that works on the small scale and in the wild Intenet today. Perhaps from that seed the organisational issue between servers will simply work itself out as the technology and associated traffic volume becomes more substantial and more important. After all, the web itself did not start out as the well-oiled reliable and high-performance machine it is today.

So, it seems reasonable that when it comes to rolling out a standards-based subscription mechanism today that JEP-0060 should be the preferred option ahead of trying to define and promote a HTTP-based specification. That said, there are a number of principles that must be transferrable to this XMPP-based solution:

In good RESTful style, subscriptions transfer a summarised sequence of the states of a resource. The first such state is the resource's state at the time the subscription request was recieved. This allows the state of the resource to be mirrored within a client and for the client to respond to changes in the resource's state. However it is reasonable to also consider subscription to transient data that is never retained as application state in any resource. This data has a null initial state, no matter when it is subscribed to.

Working through the XMPP protocol adds a great deal of complexity to the subscription relationship. Intermediataries handle the subscription, so they must also handle authorisation and other issues normally left out of the protocol to be handled within the origin server. In XMPP, the subscription effectively becomes a channel that certain users have a voice in and that other users can recieve messages from. My expertise is very thin about XMPP, but on the face of things it appears that subscription data is routed through a server that manages the particular channel, the pubsub service. Perhaps this service could be repaced with an origin server if that was desired.

In terms of matching up with my expectations of a subscription service, well... localised resynchronisation and patch updates can both be supported, but not at the same time. The pubsub service can forward the last message to a new subscriber. If that message contains the entire state of the resource, the client is synchronised. If it is a patch update, the client cannot synchronise. There does not appear to be a way to negotiate or inform the client of the nature of the update. "Message" appears to be the only recognised semantic. This is understandable, I suppose, and fits at least a niche of what a pubsub system can be expected to do.

Summarisation seems to be on the cards only at the edge of the network (i.e. the origin server). This is probably the best place for summarisation, however the lack of differential flow control is a concern. The server appears to simply send messages to the pubsub service at the rate that service can accept them. What happens from there is not clearly cemented in my mind. Either the rate is slowest to meet the slowed client, messages are buffered infintely (until the pubsub service crashes), or messages are buffered to a set limit and messages or clients are dropped past that point. There doesn't seem to be any way of reporting flow control back to the origin server in order to shape the summarisation activity at that point. If message dropping is occuring in the pubsub service then this should be more explicit. Other forms of summarisation may be preferrable to the wholesale discard of arbitrary messages.

JEP-0060 is long (really long) and full of inane examples. It is difficult to get a feel for what problems it does and does not solve. I doesn't contain text like "flow control", "loss", "missed", "sequence", "drop"... anything recongnisable as how the subscription model relates to the underlying transport's guarantees. Every time I look through it I feel like crying. Perhaps I am just missing the point, but when it comes to internet-scale subscription I don't think this document puts a standards-based solution in play.

I need to be able synchronise the state of a resource. I need the subscription mechanism to handle exceptional load or high latency situations effectively. I need it to be able to deal with thousands of changes per second across a dispirate client base even in my small example. On the Internet I expect it to deal with millions or billions of changes per second. Will a jabber-style network handle that kind of load without breaking client service guarantees? How are overflow conditions handled? Can messages be lost, reordered, or summarised? Are messages self-descriptive enough to allow summarisation by the pubsub server?

Perhaps I should go and pen an internet draft after all. GENA isn't that far off the mark, and really does work effectively when no firewalls are in the way. Perhaps it would be a useful mechanism to reliably and safely transfer data between jabber pubsub islands.


Sat, 2006-Sep-30

Using TCP Keepalive for Client Failover

I covered recently my foray into using mechanisms that are as standard as possible between client and server to facilitate a fixed-period failover time. A client may have a request outstanding and may be waiting for a response. A client may have subscriptions outstanding to the server. Even a server that transfers its IP or MAC address to its backup during failover does not completely isolate its clients from the failover process. Failover and server restart both cause a loss of the state of the server's TCP/IP stack. When that happens, clients must detect it in order to successfully move their processing to the new server instance.

I had originally pooh poohed TCP/IP keepalive as a limited option. Most (all?) operating systems that support keepalive use system-wide timeout settings, so values can't be tuned based on who you are talking to. I think this might be able to be overcome by solaris zones, however. Also, the failover characteristics of a particular host with respect to the services it talks to are often similar enough that this is not a problem.

I want to keep end-to-end pinging to a minimum, so I only want keepalive to be turned on while a client has requests outstanding. An idle connection should not generate traffic. Interestingly, this seems to be possible by using the socket option. It should be possible to turn the keepalive on when a request is sent, and turn it back off again when the last outstanding response is recieved. In the mean-time the active TCP/IP connection will often be sending data, so keepalives will most often be sent during network lull times while the server is taking time processing.

If I want my four second failover, it should just be a matter of setting the appropriate kernel variables to send requests every second or so and give up after a corresponding number of failures. Combined with IP-level server failover, and subscriptions that are persistent across the failover, this provides a consistent failover experience with a minimum of network load.


Sat, 2006-Sep-30

Common REST Questions

I just came across a blog entry that includes a number of common misconceptions and questions about about REST, here

I posted a response in comments, but I thought I might repeat it here also:

RESTwiki contains a some useful information on how REST models things differently to Object-Orientation. See:

and others. Also, see the rest wikipedia article which sums some aspects of REST up nicely:

The core of prevailing REST philosophy is the rest triangle, where naming of resource is separated from the set of operations that can be performed on resources, and again from the kinds of information representations at those resources. Verbs and content types must be standard if messages are to be self-descripitve, and the requirements of the REST style met. Also, there should be no crossover between the corners of the REST triangle. names should not be found in verbs or content types, except as hyperlinks. Content should not be found in names or verbs. Verbs should not be found in names or content.

REST can be seen a documented-oriented subset of Object-Orientation. It deliberately reduces the expressiveness of Objects down to the capabilities of resources to ensure compatability and interoperability between components of the architecture. Object-Orientation allows too great a scope of variation for internet-scale software systems such as the world-wide-web to develop, and doesn't evolve well as demands on the feature set change. REST is Object-Orientation that works between agencies, between opposing interests. For that you need to make compromises rather than doing things your own way.

Now, to address your example:
Verbs should not be part of the noun-space, so your urls

should not be things you POST to. They should demarcate the "void" state and the "reverse" state of your journal entry. When you GET the void URL it should return the text/plain "true" if the transaction is void and "false" if the transaction is not void. A put of the text/plain "true" will void the transaction, possibling impacting the state demarcated by other resources. Reverse is similar. The URL should be "reversal" rather than "reverse". It should return the url of the reversing transaction, or indicate 404 Not Found to show no reversal. A PUT to the reverse would return 201 Created and further GETs would show the reversal transaction.

Creation in REST is simple. Either the client knows the URL of the resource they want to create and PUT the resource's state to that URL, or the client requests a factory resource add the state it provides to itself. This is designed to either append the state provided or create a new resource to demarcate the new state. POST is more common. The PUT approach requires clients to know something about the namespace that they often shouldn't know outside of some kind of test environment.

On swapping: This is something of an edge case, and this sort of thing comes up less often than you think when you are designing RESTfully from the start. The canonical approach would be to include the position of the resource as part of its content. PUTting over the top of that position would move it. This is messy because it crosses between noun and content spaces. Introducing a SWAP operation is also a problem. HTTP operates on a single resource, so there is no unmunged way to issue a SWAP request. Any such SWAP request would have to assume both of the resources of the unordered list are held by the same server, or that the server of one of these resources was able to operate on the ordered list.

On transactions: The CRUD verb analogy is something of a bane for REST. I prefer cut-and-paste. Interestingly, cut-and-paste on the desktop is quite RESTful. A small number of verbs are able to transfer information in a small number of widely-understood formats from one application to another. The cursor identifies and demarcates the information that will be COPIED (GET) or CUT (GET + DELETE) and the position where the information or state will be PASTED to (PUT to paste over, POST to paste after). The CRUD analogy leaves us wondering how to do transactions, but with the cut-and-paste analogy the answer is obvious: Don't.

In REST, updates are almost universally atomic. You do everything you need to do atomically in a single request, rather than trying to spread it out over several requests and having to add transaction semantics. If you can't see how to do without transactions you are probably applying REST at a lower-level than it is typically applied. In this example, whenever you post a new journal entry you do so as a single operation. POST to a complete representation of the journal entry to a factory resource.

That is not to say that REST can't do transactions. Just POST to a transaction factory resource, perform several POSTS to the transaction that was created, then DELETE (roll-back) or POST a commit marker to the transaction.

How REST maps to objects is up to the implementation. You can evolve your objects independently of the namespace, which is expected to remain stable forever once clients start to use it. The URI space is not a map of your objects, it is a virtual view of the state of your application. Resources are not required or even expected to map directly onto objects. One method of a resource may operate on one object but another may operate on a different object. This is especially the case when state is being created or destroyed.

REST is about modelling the state of your application as resources, then operating on that virtualised state using state transfer methods rather than arbitrary methods with arbitrary parameter lists. REST advocates such as myself will claim this has significant benefits, but I'll refer you to the literature (especially the wikipedia page) rather than list them here.


Mon, 2006-Sep-18

High Availability at the Network Level

I have been reading up over the last week on an area of my knowledge that is sorely lacking. Despite being deeply involved in the architecture of a high availability distributed software architecture, I don't have a good understanding of how high availabilty can be provided at the network level. Given the cost of these solutions in the past and the relatively small scale of the systems I have worked with, we have generally considered the network to be a dumb transport. Deciding which network interface of which server to talk to has been a problem for the application layer. A typical allowable failover time in this architecture would be four seconds (4s). This is achieved with end to end pinging between the servers to ensure they are all up, and from clients to their selected server to ensure it is still handling their requests and subscriptions.

The book I have been reading is a Cisco title, Building Resilient IP Networks It has improved my understanding of layer 2 switching, an area which has changed significantly since I last really looked at networking. Back then, the hub was still king. It also delved into features for supporting host redundancy including NIC teaming, clustering and the combination of Server Load Balancing and a Global Site Selector.

It may be just my lack of imagination, but the books seems to get tantalisingly close to a complete solution for the kinds of systems I build. Just not quite there. It talks about clustering and NIC teaming within a single access module, and that offers at least a half-way solution. It seems you could add a VLAN out to another site (thus another access module) for disaster recovery, but without offering a clear alternative the book repeatedly warns against such an architecture.

So, I have three servers. Two are at my main site. One is at my disaster recovery site. I can issue pings using layer three protocols, so I don't strictly need my servers to be on the same subnet. However, I need my clients to fail over from one to the other within a fixed period after any single point failure. It looks like I need IP address takeover between the sites to solve my failover problem at the network level.

The DNS-based Global Site Selector option discussed in the book is fine if we want the failover to affect only new clients. Old clients will retain cached DNS records, and may not issue another DNS query for requests that are still pending. Issuing a mass DNS cache expiry multicast or using very short DNS cache periods both seem like poor options. Ideally we would contain the failover event within the cluster and its immediate network somehow.

A routing-based failover solution might allow a floating IP address to be taken over a different node within the cluster as failover occurs. For this to occur we would need a fast-converging OSPF network that allowed a single IP to be served from multiple sites. Failover of connections would be handled at the OSPF level. This solution (if implementable) would have similar charactersitics to any multiple-site VLAN solution based on RSTP. The problem remains in either case of clients that are already in particular communication states with the failed server.

A current client may be either part-way through issuing a request, or may be holding a subscription to resources at a server. If the client is to reissue its request to the new server after a failover, the client can wait only as long as the failover time before declaring the original request failed and in an unknown state of completion. The maximum request processing time on the server is therefore bounded by the failover time, less the network latency to the client with the particular failover time.

An alternative to timing out when the failover time is reached would be to sample the state of the connection or request at a rate faster than that of the failover time. If your failover time is four seconds (4s), you could sample the state every three seconds, or two, or one. A timeout would not be necessary if the sampling indicated that the request was still being processed.

The sampling itself could come in the form of a pipelined "PING" request queued up behind the main request. Whenever the transmit queue is nonempty on the client side, TCP will transmit packets on an expontential back-off strategy. So long as routes to the new server of the cluster IP address are established before too many packets are lost, the new server should respond indicating that it doesn't know about the connection. Another option would be to employ a dedicated request-state sampling protocol or to craft specially-designed TCP packets for transmission to sample the state.

Subscription is a problem in it's own right. The server has agreed to report changes to a particular input as they come in, however the server may fail. The client must therefore sample the state of its subscriptions at a rate faster than the failover time if it is to detect failure, and issue requests to the new server to reestablish the subscription. This again is an intensive process that we would rather do without. One solution is to persist subscription state across failover. Clients should not recieve an acknowledgement to their subscription requests until news of the subscription request has spread sufficiently far and wide.

Both the outstanding request and outstanding subscription client states can be resolved through these mechanisms when the server is behaving itself. However, there is the possibility that an outstanding request will never return. Likewise there is the possibility that a client's subscription state will be lost. For this reason, outstanding requests do demand an eventual timeout. Accordingly, outstanding subscriptions do need to be periodically sampled and renewed. These periods can be much longer than the failover period.

Clustering and network-based solutions can be expensive, but they can also provide scalable failover solutions for the service of new clients. Existing clients still need some belts and straps to ensure they fail over safely to the new server.

My high-availability operating-system support wishlist:


Sat, 2006-Sep-02

REST Triangle, URLConstruction, RESTfulDesign

I have put a few more draft documents up on restwiki:

The article names perhaps don't quite do them justice. The REST Triangle article is about how REST decouples various problem domains from each other to be solved separately. It lays out what those problem domains are, what the purpose of each problem domain is, and why you should be avoid crossover between the problem domains. URL Construction is something of a splinter discussion about why URLs shouldn't be used to convey information to clients, and why clients shouldn't construct URLs.

RESTful design is my summary of how to design a RESTful interface. It is based on Object-Oriented design, at least until you get past the resource definition into the definition of hyperlinks and content schemas. I think the information there is good, although I haven't included a specific case study as yet. Also, restwiki doesn't look like it supports images. It is a bit hard to convey some of the diagramming that should go on in this kind of design without image support.

All documents are subject to future change, and your mileage may vary. Good luck, and I hope they mean something to you.