Sound advice

Tales from the homeworld

My current feeds

Fri, 2006-Jan-27

Internet-scale Client Failover

Failover is the process of electing a new piece of hardware to take over the role of a failed piece of hardware (or sometimes software), and the process of bringing everyone on board with the new management structure. Detecting failure and electing a new master are not hard problems. Telling everyone about it is hard. You can attack the problem at various levels. You can have the new master take over the IP of the old and broadcast a new arp reply to take over from the MAC address. You can even have the new master take over the IP and MAC addres of the old. If new and old are not on the same subnet, you can try to solve the problem through DNS. The trouble with all of these approaches is that while they solve the problem for new clients that may come along, they don't solve the problem for clients with existing cached DNS entries or existing TCP/IP connections.

Imagine you are a client app, and you have sent a HTTP request to the server. The server fails over, and a new piece of hardware is now serving that IP address. You can still ping it. You can still see it in DNS. The problem is, it doesn't know about your TCP/IP connection to it, or the connection's "waiting for HTTP response" state. Until a new TCP/IP packet associated with the connection hits the new server it won't know you are there. Only when that happens and it returns a packet to that effect will the client learn its connection state is not reflected by the server side. Such a packet won't usually be generated until new request data is sent by the client, and often that just won't ever happen.

Under high load conditions clients should wait patiently to avoid putting extra strain on the server. If a client knows that a response will eventually be forthcoming it should be willing to wait for as long as it takes to generate the response. With the possibility of failover, the problem is that a client cannot know whether the server state reflects its own and cannot know whether a response really will be forthcoming or not. How often it must sample the remote state is determined by the desired failover time. In industrial applications the time may be as low as four or two seconds, and sampling must take place at a rate several times as quickly to allow for lost packets. If sampling is not possible the desired failover time represents the maximum time a server has to respond to its clients, plus network latency. Another means must be used to return the results of processing if any single request takes longer. Clients must use the desired failover time as their request timeout.

If you take the short request route, HTTP permits you to return 202 Accepted to indicate a request has been accepted for processing but without indicating success or failure of the request. If this were used as a matter of course, conventions could be set up to return the HTTP response via a request back to a call-back url. Alternatively, the response could be modelled as a resource on the server which is periodically polled by the client until it exhibits a success or failure status. Neither of these approaches is directly supported by today's browser software, however the latter could be performed using a little meta-refresh magic.

You may not have sufficient information at the application level to support sampling at the TCP/IP level. You would need to know the current sequence numbers of the stack in order to generate a packet that would be rejected by the server in an appropriate way. In practice what you need is a closer vantage point. Someone who is close in terms of network topology to both the old and the new master can easily tell when a failover occurs and publish that information for clients to monitor. On the face of it this just moving the problem around, however a specialised service can more easily ensure that it doesn't ever spend a long time responding to requests. This allows us to employ the techniques which rely on quick responses.

Like the state of http subscriptions, the state of http requests must be sampled if a client is to wait indefinately for a response. How long it should wait depends on the client's service guarantees, and has little to do with what the server considers an appropriate timeframe. Nevertheless, the client's demands put hard limits on the profile of behaviour acceptable on the server side. In subscription the server can simply renew whenever a renew is requested of it, and time a subscription out after a long period. It seems that the handling of a simple request/response couples clients and servers together more closely than even a subscription does, because of the hard limits client timeout puts onto the server side.


Sat, 2005-Nov-19

HTTP in Control Systems

HTTP may not be the first protocol that comes to mind when you think SCADA, or when you think of other kinds of control systems. Even the Internet Protocol is not a traditional SCADA component. SCADA traditionally works of good old serial or radio communications with field devices, and uses specialised protocols that keep bandwidth usage to an absolute minimum. SCADA has two sides, though, and I don't just mean the "Supervisory Control" and the "Data Acquisition" sides. A SCADA system is an information concentration system for operational control of your plant. Having already gotten your information into a concentrated form and place, it makes sense to feed summaries of that data into other systems. In the old parlence of the corporation I happen to work for this was called "Sensor to Boardroom".

One of my drivers in trying to understand some of the characteristics of the web as a distributed architecture has been in trying to expose the data of a SCADA system to other ad hoc systems that may need to utilise SCADA data. SCADA has also come a long way over the years, and now stands more for integration of operational data from various sources than simple plant control. It makes sense to me to think about whether the ways SCADA might expose its data to other systems may also work within a SCADA system composed of different parts. We're in the land of ethernet here, and fast processors. Using a more heavy-weight protocol such as HTTP shouldn't be a concern from the performance perspective, but what else might we have to consider?

Let's draw out a very simple model of a SCADA system. In it we have two server machines running redundantly, plus one client machine seeking information from the servers. This model is effectively replicated over and over for different services and extra clients. I'll quickly summarise some possible issues and work through them one by one:

  1. Timely arrival of data
  2. Deciding who to ask
  3. Quick failover between server machines
  4. Dealing with redundant networks

Timely Data

When I use the word timely, I mean that our client would not get data that is any fresher by polling rapidly. The simplest implementation of this requirement would be... well... to poll rapidly. However, this loads the network and all CPUs unnecessarily and should be avoided in order to maintain adequate system performance. Timely arrival of data in the SCADA world is all about subscription, either ad hoc or preconfigured. I have worked fairly extensively on the appropriate models for this. A client requests subscription of a server. The subscription is periodically renewed and may eventually be deleted. While the subscription is active it delivers state updates to a client URL over some appropriate protocol. Easy. The complications start to appear in the next few points.

Who is the Master?

Deciding who to ask for subscriptions and other services is not as simple as you might think. You could use DNS (or a DNS-like service) in one of two ways. You could use static records, or your could change your records as the availability of servers changes. Dynamic updates would work through some DNS updater application running on one or more machines. It would detect the failure of one host, and nominate the other as the IP address to connect to for your service. Doing it dynamically has a problem that you're working from pretty much a single point of view. What you as the dynamic DNS modifier sees may not be the same as what all clients see. In addition you have the basic problem of the static DNS: Where do you host it? In SCADA everything has to be redundant and robust against failure. No downtime is acceptable. The static approach also pushes the failure detection problem to clients, which may be a problem they aren't capable of solving due to their inherent "dumb" generic functionality.

Rather than solving the problem at the application level you could rely on IP-level failover, however this works best when machines are situated on the same subnet. It becomes more complex to design when main and backup servers are situated in separate control centres for disaster recovery.

Whichever way you turn there are issues. My current direction is to use static DNS (or eqivalent) that specifies all IP addresses that are or may be relevant for the name. Each server should forward requests onto the main if it is not currently master, meaning that it doesn't matter which one is chosen when both servers are up (apart from a slight additonal lag should the wrong server be chosen). Clients should connect to all IP addresses simultaneously if they want to get their request through quickly when one or more servers are down. They should submit their request to the first connected IP, and be prepared to retry on failure to get their message through. TCP/IP has timeouts tuned for operating over the Internet, but these kinds of interactions between clients and servers in the same network are typically much faster. It may be important to ping hosts you have connections to in order to ensure they are still responsive.

It would be nice if TCP/IP timeouts could be tuned more finely. Most operating systems allow tuning of the entire system's connections. Few support tuning on a per-connection basis. If I know the connection I'm making is going to a host that is very close to me in terms of network topology it may be better to declare failures earlier using the standard TCP/IP mechanisms rather than supplimenting with ICMP. Also, the ICMP method for supplimenting TCP/IP in this way relies on not using an IP-level failover techniques between servers.

Client Failover

Quick failover follows on from discovering who to talk to. The same kinds of failture detection mechanisms are required. Fundamentally clients must be able to quickly detect any premature loss of their subscription resource and recreate it. This is made more complicated by the different server side implementations that may make subscription loss more or less likely, and thus the necessary corrective actions that clients may need to take. If a subscription is lost when a single server host fails, it is important that clients check their subscriptions often and also monitor the state of the host that is maintaining their subscription resource. If the host goes down then the subscription must be reestablished as soon as this is discovered. As such the subscription must be periodically tested for existence, preferrably through a RENEW request. Regular RENEW requests over an ICMP-supported TCP/IP connection as described above should be sufficent for even a slowly-responding server application to adequately inform clients that their subscriptions remain active and they should not reattempt creation.

Redundant Networks

SCADA systems typically utilise redundant networks as well as redundant servers. Not only can clients access the servers on two different physical media, the servers can to the same to clients. Like server failover, this could be dealt with at the IP level... however your IP stack would need to work in a very well-defined way with respect to packets you send. I would suggest that each packet be sent to both networks with duplicates discarded on the recieving end. This would very neatly deal with temporary outages in either network without any delays or network hiccups. Ultimately the whole system must be able to run over the single network, so trying to load balance while both are up may be hiding inherent problems in the network topology. Using them both should provide the best network architecture overall.

Unfortunately, I'm not aware of any network stacks that do what I would like. Hey, if you happen to know how to set it up feel free to drop me a line. In the mean-time this is usually dealt with at the application level with two IP addresses per machine. I tell you what: This complicates matters more than you'd think. You end up needing a DNS name for the whole server pair with four IP addresses. You then need an additional DNS name for each of the servers, each with two IP addresses. When you subscribe to a resource you specify the whole server pair DNS name on connection, but the subscrpition resource may only exist on one service. It would be returned with only that sevice's DNS name, but that's still two IP addresses to deal with and ping. All the way through your code you have to deal with this multiple address problem. In the end it doesn't cause a huge theoretical problem to deal with this at the application level, but it does make development and testing a pain in the arse all around.


Because this is all SIL2 software you end up having to write most of it yourself. I've been developing HTTP client and sever software is spurts over the last six months or so, but concertedly over the last few weeks. The beauty is that once you have the bits that need to be SIL2 in place you can access them with off the shelf implementation of both interfaces. Mozilla and curl both get a big workout on my desktop. I expect Apache, maybe Tomcat or Websphere will start getting a workout soon. By rearchitecting around existing web standards it should make it easier for me to produce non-SIL2 implementations of the same basic principles. Parts of the SCADA system that are not safety-related could be built out of commodity components while the ones that are can still work through carefully-crafted proprietary implementations. It's also possible that off the shelf implementations will eventually become so accepted in the industry that they can be used where safety is an issue. We may one day think of apache like we do the operating systems we use. They provide a commodity service that we undertand and have validated very well in our own industry and environment to help us to only have to write software that really adds value to our customers.

On that note, we do have a few jobs going at Westinghouse Rail Systems Australia's Brisbane office to support a few projects that are coming up. Hmm... I don't seem to be able to find them on seek. Email me if you're intersted and I'll pass them on to my manager. You'd be best to use my ben.carlyle at address for this purpose.


Sat, 2005-Oct-15

Internet-scale Subscription Lease Durations

Depending on your standpoint you may have different ideas about how long a subscription lease should be. From an independent standpoint we may say that infinite subscription leases are the best way forward. That produces the lowest overall network and processing overhead and thus the best result overall. There are, however, competing interests that influence this number downwards. It is likely the lease should be of finite duration and that duration is likely to count more on the reliability of the server and the demands of the client than on anything else.

As a server I want to free up resources as soon as I can after clients that are uncontactable go away. This is especially the case when the same clients may have reregistered and are effectively consuming my resources twice. The new live registration takes up legitimate resources, but the stale ghost registration takes additional illegitimate resources. I want to balance the cost of holding onto resources against the cost of subscription renewals to decide my desired lease period. I'll probably choose something in the order of the time it takes for a tcp/ip connection to expire, but may choose a smaller number if I expect this to be happening regularly. I don't have an imperative to clean up except for resource consumption. In fact, whenever I'm delivering messages to clients that are up but have forgotten about their subscriptions I should get feedback from them indicating they think I'm sending them spam. It's only subscriptions that are both stale and inactive that chew my resources unnecessarily, and it doesn't cost a lot to manage a subscription in that mode.

As a client, if I lease a subscription I expect the subscription to be honoured. That is to say that I expect to be given timely updates of the information I requested. By timely I mean that I couldn't get the information any sooner by polling. Waiting for the notification should get me the data first. The risk to a client is that the subscription will not be honoured. I may get notifications too late. More importantly my subscription might be lost entirely. REST says that the state of any client and server interaction should be held within the last message that passed between them. Subscription puts a spanner in these works and places an expectation of synchronised interaction state between a falliable client and server.

Depending on the server implementation it may be possible to see a server fail and come back up without any saved subscriptions. It might fail over to a backup instance that isn't aware of some or all of the subscriptions. This would introduce a risk to the client that its data isn't timely. The client might get its data more quickly by polling, or by checking or renewing the subscription at the rate it would otherwise poll. This period for sending renewal messages is defined by need rather than simple resource utilisation. The client must have the data in a timely manner or it may fail to meet its own service obligations. Seconds may count. It must check the subscription over a shorter duration than the limit it itself can have on how out of date its data may be under these circumstances. If it is responsible for getting data to an operator console from the field within five (5) seconds it must check its subscription more frequently than at that rate, or someone must do it on their behalf.

Non-failure subscription loss conditions may exist. It may be more convenient for a server to drop subscriptions and allow clients to resubscribe than to maintain them over certain internal reconfiguration activities. These cases are potentially easier to resolve than server death. They don't result in system failure so the owner of subscriptions can notifiy clients as appropriate. It must in fact do so, and once clients have recieved timely update of the end of their subscriptions they should be free to attempt resubscription. It is server death which is tricky. Whichever way you paint things there is always the chance your server and its cluster will terminate in such a way that your subscriptions are lost. Clients must be able to measure this risk and poll at a rate that provides adequate certainty that timely updates are still being sent.


Fri, 2005-Oct-07

Application Consolidation

Sean McGrath talks about consolidating the data from different applications for the purpose of business reporting. He says that the wrong way to do it is usually to redevelop the functions of both applications into one. He says the right way to do it is usually to grab reports from both system and combine them. There are two issues that must be solved during this process. The first is one of simultaneous agreement. The second is a common language of discourse. I'll address the second point first.

Without a common terminology the information of the two systems can't be combined. If one system thinks of your business as the monitoring of network equipment by asset id while another thinks of your business as the monitoring of network equipment by IP address the results aren't going to mesh together well. You need a reasonably sophisticated tool to combine the data, especially when there isn't a 1:1 mapping between asset id and IP address.

Without simultaneous agreement you may be combining reports that are about different things. If a new customer signs on and has been entered only into one system the combined report may be a nonsense. Even if they are enetered at the same time there is necessarily some difference beteween the time queries are issued to each system. The greater the latency between the report consolidation application and the original systems the less likely it is that the data you get back will still be correct when it is recieved. The greater the difference in latencies between the systems you are consolidating the greater the likelyhood that those reports will be referring to different data. This problem is discussed in some detail in Rohit Khare's 2003 dissertation from the point of view of a single datum. I suspect the results regarding simultaneous agreement for different data will be even more complicated.

If the report is historical in nature or for some other reason isn't sensitive to instantaneous change and if the systems you are consolidating do speak a common language, I suggest that Sean is right. Writing a report consolidator application is probably going to be easier than redeveloping the applications themselves. If you lie in the middle somewhere you'll have some thinking to do.


Sat, 2005-Oct-01

Use of HTTP verbs in ARREST architectural style

This should be my last post on Rohit Khare's Decentralizing REST thesis. I apologise to my readers for somewhat of a blogging glut around this paper but there have been a number of topics I wanted to touch apon. This post concerns the use of HTTP verbs for subscription and notification activities.

In section 5.1 Rohit describes the A+REST architectural style. It uses a WATCH request and and NOTIFY response. By the time he reaches the R+REST and ARREST styles of sections 5.2 and 5.3 he is using SUBSCRIBE requests and POST responses. I feel that the jump to use POST (a standard HTTP request) is unfortunate.

I think Rohit sees POST as the obvious choice here. The server wants to return something to the client, therefore mutating the state of the client, therefore POST is appropriate. rfc2616 has this to say about POST:

The POST method is used to request that the origin server accept the entity enclosed in the request as a new subordinate of the resource identified by the Request-URI in the Request-Line. POST is designed to allow a uniform method to cover the following functions:

  • Annotation of existing resources;
  • Posting a message to a bulletin board, newsgroup, mailing list, or similar group of articles;
  • Providing a block of data, such as the result of submitting a form, to a data-handling process;
  • Extending a database through an append operation.

POST is often used outside of these kinds of context, especially as means of tunnelling alternate protocols or architectural styles over HTTP. In this case though, I think that its use is particularly aggregious. Consider this text from section 9.1.1 of rfc2616:

the convention has been established that the GET and HEAD methods SHOULD NOT have the significance of taking an action other than retrieval. These methods ought to be considered "safe". This allows user agents to represent other methods, such as POST, PUT and DELETE, in a special way, so that the user is made aware of the fact that a possibly unsafe action is being requested.

Naturally, it is not possible to ensure that the server does not generate side-effects as a result of performing a GET request; in fact, some dynamic resources consider that a feature. The important distinction here is that the user did not request the side-effects, so therefore cannot be held accountable for them.

When the server issues a POST it should be acting on behalf of its user. Who its user is is a little unclear at this point. There is the client owned by some agency, the server owned by another, and finally the POST destination possibly owned by an additional agency. If the server is acting on behalf of its owner it should do so with extreme care and be absolutely sure it is willing to take responsibilty for the actions taken in response to the POST. It should provide its own credentials and act in a responsible manner, operating in accordance with the policy its owner sets forward for it.

I see the use of POST in this way as a great security risk. If the server generating POST requests is trusted by anybody then by using POST as a notification it is transferring that trust to its client. Unless client and server are owned by the same organisation or individual an interagency conflict exists and an unjustified trust relationship is created. Instead, the server must provide the client's credentials only or notify the destination server in a non-accountable way. It is important that the server not be seen to be requesting any of the sideeffects the client may generate in response to the notification but instead that those sideeffects are part of the intrinsic behaviour of the destination when provided with trustworthy updates.

Ultimately I don't think it is possible or reasonable for a server to present its client's credentials as its own. There is too much risk that domain name or IP address records will be taken into account when processing trust relationships. I think therefore that a new method is required, just as is provided in the GENA protocol which introduces NOTIFY for the purpose.

It's not a dumb idea to use POST as a notification mechanism. I certainly thought that way when I first came to this area. Other examples also exist. Rohit himself talks about the difficulty of introducing new methods to the web and having to work around this problem in mod_pubsub using HTTP headers. In the end though, I think that the introduction of subscription to the web is something worthy of at least one new verb.

I'm still not sure whether an explicit SUBSCRIBE verb is required. Something tells me that a subscribable resource will always be subscribed to anyway and that GET is ultimately enough to setup a subscription if appropriate headers are supplied. I'm still hoping I'll be able to do something in this area to reach a standard approach.

The established use of GENA in UPnP may tip the cards. The fact that it exists and is widely deployed may outweigh its lack of formal specification and standardisation. Its defects mostly appear asthetic rather than fundamental, and it still may be possible to extend it in a backwards-compatible way.


Sat, 2005-Oct-01

Infinite Buffering

Rohit Khare's 2003 paper on Decentralizing REST makes an important point in section 7.2.4 during his discussion of REST+E (REST with Estimates):

It is impossible to sustain an information transfer rate in excess of a channel's capacity. Unchecked, guaranteed message delivery can inexorably increase latency. To ensure this does not occur - ensuring that buffers, even if necessary, remain finite - information buffered for transmission may need to be summarized, updated, or even dropped while still queued.

It is a classic mistake in message-oriented middleware to combine guaranteed delivery of messages with guaranteed acceptance of messages. Just when the system is overloaded or stressed, everything starts to unravel. Latency increases, and can even reach a point where feedback loops are created in the middleware: Messages designed to keep the middleware operating normally are delayed so much that they cause more messages to be generated that can eventually consume all available bandwidth and resources.

Rohit cites summarisation as a solution, and it is. On the other hand it is important to look at where this summarisation should take place. Generic middleware can't summarise data effectively. It has few choices but to drop the data. I favour an end-to-end solution for summarisation: The messaging middleware must not accept a message unless it is ready to deliver it into finite buffers. The source of the message must wait to send and when it becomes possible to send should be able to alter its choice of message. It should be able to summarise in its own way for its own purposes. Summarisation should only take place at this one location. From this location it is possible to summarise, not based on a set of messages that have not been sent, but on any other currently- accessible data. For instance, a summariser can be written that always sends the current state of a resource rather than the state it had at the first change after the last successful delivery. This doesn't require any extra state information (except for changed/not-changed) if the summariser has access to a common representation of the origin resource. The summariser's program can be as simple as:

  1. Wait for change to occur
  2. Wait until a message can be sent, ignoring further changes
  3. Send a copy of the current resource representation
  4. Repeat

More complicated summarisation is likely to be useful on complex resources such as lists. Avoiding sending the whole list over and over again can make significant inroads to reducing bandwidth and processing costs. This kind of summariser requires more sophisticated state, including the difference between the current and previosly-transmitted list contents.

Relying on end-to-end summarisers on finite buffers allows systems to operate efficiently under extreme load conditions.


Sat, 2005-Oct-01

REST Trust Relationships

Rohit Khare's 2003 paper Decentralizing REST introduces the ARREST architectural style for routed event notifications when agency conflicts exist. The theory is that it can be used according R+REST principles to establish communication channels between network nodes that don't trust each other but which are both trusted by a common client.

Rohit has this to say about that three-way trust relationship in chapter 5.3.2:

Note that while a subscription must be owned by the same agency that owns S, the event source, it can be created by anyone that S's owner trusts. Formally, creating a subscription does not even require the consent of D's owner [the owner of the resource being notified by S], because any resource must be prepared for the possibilty of unwanted notifications ("spam").

If Rohit was only talking about R+REST's single routed notifications I would agree. One notification of the result of some calculation should be dropped by D as spam. Certainly no unauthorised alterations to D should be permitted by D, and this is the basis of Rohit's claim that D need not authorise notifications. Importantly, however, this section is not referring to a one-off notification but to a notification stream. It is essential in this case to have the authority of D before sending more than a finite message sequence to D. Selecting a number of high-volume event sources and directing them to send notifications to an unsuspecting victim is a classic denial of service attack technique. It is therefore imperative to establish D's complicity in the subscription before pummeling the resource with an arbitrary number of notifications.

The classic email technique associated with mailing lists is to send a single email first, requesting authorisation to send further messages. If a positive confirmation is recieved to the email (either as a return email, or a web site submission) then further data can flow. Yahoo has the concept of a set of email addresses which a user has demonstrated are their own, and any new mailing list subscriptions can be requested by the authorised user to be sent to any of those addresses. New addresses require individual confirmation.

I believe that a similar technique is required for HTTP or XMPP notifications before a flood arrives. The receiving resource must acknowledge successful receipt of a single notification message before the subsequent flood is permitted. This avoids the notifying server becoming complicit in the nefarious activities of its authorised users. In the end it may come down to what those users are authorised to do and who they are authorised to do it with. Since many sites on the internet are effectively open to any user, authorised or not, the importance of handling how much trust your site has in its users may be important in the extreme.


Sat, 2005-Oct-01

Routed REST

I think that Rohit Khare's contribution with his 2003 paper on Decentralizing REST holds many valuable insights. In some areas, though, I think he has it wrong. One such area is his chapter on Routed REST (R+REST), chapter 5.2.

The purported reason for deriving this architectural style is to allow agencies that don't trust each other to collaborate in implementing useful functions. Trust is not uniform across the Internet. I may trust my bank and my Internet provider, but they may not trust each other. Proxies on the web may or may not be trusted, so authentication must be done in an end-to-end way between network nodes. Rohit wants to build a collaboration mechanism between servers that don't trust each other implicitly, but are instructed to trust each ohter in specific ways by their client which trusts each service along the chain.

Rohit gives the example of a client, a printer, a notary watermark, and an accounting system. The client trusts its notary watermark to sign and authenticate the printed document, and trusts the printer. The printer trusts the accounting system and uses it to bill the client. Rohit tries to include the notary watermark and the accounting system not as communication endpoints in their own right, but as proxies that transform data that passes through them. To this end he places the notary watermark between the client and printer to transform the request, but places the accounting system betwen printer and client on an alternate return route. He seems to get very excited about this kind of composition and starts talking about MustUnderstand headers and about SOAP- and WS- routing as existing implementations. The summary of communications is as follows:

  1. Send print job from client to notary
  2. Forward notarised job from notary to printer
  3. Send response not back to the notary, but to the printer's accounting system
  4. The accounting system passes the response back to the client

I think that in this chapter he's off the rails. His example can be cleanly implemented in a REST style by maintaining state at relevant points in the pipeline. Instead of transforming the client's request in transit, the client should ask the notary to sign the document and have it returned to the client. The client should only submit a notarised document to the printer. The printer in turn should negotiate with the accounting system to ensure proper billing for the client before returning any kind of success code to the client. The summary of communications is as follows:

  1. Send print job from client to notary
  2. Return notarised job from notary to client
  3. Send notarised job from client to printer
  4. Send request from printer to accounting system
  5. Receive response from accounting system back to printer
  6. Send response from printer back to client

Each interaction is independent of the others and only moves across justified trust relationships.

Rohit cites improved performance out of a routing based system. He says that only four message transmissions need to occur, instead of six using alternate approaches. His alternate approach is different to mine, but both his alternate and my alternate have the same number of messages needing transmission: six. But let's consider that TCP/IP startup requires at least three traversals of the network and also begins transmission slowly. Establishing a TCP/IP connection is quite expensive. If we consider the number of connections involved in his solution (four) compared to the number in my solution and his non-ideal alternative (three) it starts to look more like the routing approach is the less efficient one. This effect should be multiplied over high latency links. When responses are able to make use of the same TCP/IP connection as the request was made on, the system as a whole should actually be more efficient and responsive. Even when it is not more efficient, I would argue that it is signficantly simpler.

Rohit uses this style to build the ARREST style, however using this style as a basis weakens ARREST. He uses routing as a basis for subscription, however in practice whether subscription results come back over the same TCP/IP connection or are routed to a web server using a different TCP/IP connection is a matter of tradeoff of server resources and load.


Fri, 2005-Sep-30

The Estimated Web

Rohit Khare's dissertation describes the ARREST+E architectural style for building estimated systems on a web-like foundation. He derives this architecture based on styles developed to reduce the time it takes for a client and server to reach consensus on a value (they have the same value). The styles he derives from are based on leases of values which mirror the web's cache expiry model. An assumption of this work is that servers must not change their value until all caches have expired, which is to say that the most recent expiry date provided to any client has passed. He doesn't explicitly cover the way the web already works as an estimated system: by lying about the expiry field.

Most applications on the web do not treat the expirty date as a lease that must be honoured. Instead, they change whenever they want to change and provide an expiry date to clients that represents a value much less than they expect the real page to change at. Most DNS server records remain static for at least months at a time, so an expiry model that permits caching for 24 hours saves bandwith while still ensuring that clients have a high probability of having records that are correct. Simply speaking, if a DNS record changes once every thirty days then a one-day cache expiry gives clients something like a 29 in 30 chance of having data that is up to date. In the mean time caching saves each DNS server from having to answer the same queries every time someone wants to look up a web page or send some email.

The question on the web of today becomes "What will it cost me to have clients with the wrong data?". If you're managing a careful transition then both old and new DNS records should be valid for the duration of the cache expiry. In this case it costs nothing for clients to have the old data. Because it costs them nothing it also costs you nothing, excepting the cost of such a careful transition. When we're talking about web resources the problem becomes more complicated because of the wide variety of things that a web resource could represent. If it is time-critical data of commercial importance to your clients then even short delays in getting the freshest data can be a problem for clients. Clients wishing to have a commercial advantage over each other are likely to poll the resource rapidly to ensure they get the new result first. Whether cache expiry dates are used or not clients will consume your bandwidth and processing power in proportion to their interest.

The ARREST+E style provides an alternative model. Clients no longer need to poll because you're giving them the freshest data possible as it arrives. ARREST+E invovles subscription by clients and notification back to the clients as data changes. It allows for summarisation of data to avoid irrelevant information from being transmitted (such as stale updates) and also for prediction on the client side to try and reduce the estimated error in their copy of the data. If your clients simply must know the result and are willing to compete with each other by effectively launching denial of service attacks on your server then the extra costs of ARREST+E style may be worth it. Storing state for each subscription (including summariser state) may be cheaper than handling the excess load.

On the other hand, most of the Internet doesn't work this way. Clients are polite because they don't have a strong commercial interest in changes to your data. Clients that aren't polite can be denied access for a while until their manners improve. A server-side choice of how good an estimate your copy of the resource representation will be is enough to satisfy everyone.

Whether or not you use ARREST+E with its subscription model some aspects of the architecture may still be of some use. I periodically see complaints about how inefficient HTTP is at accessing RSS feeds. A summariser could reduce the bandwith associated with transferring large feeds. It would be nice to be able to detect feed reader clients to a resource specifically tailored for them each time they arrive. Consider the possibility of returning today's feed data to a client, but directing them next time to a resource that represents only feed items after today. Each time they get a new set of items they would be directed onto a resource that represents items since that set.


Fri, 2005-Sep-30

Consensus on the Internet Scale

I've had a little downtime this week and have used some of it to read a 2003 paper by Rohit Khare, Extending the REpresentational State Transfer (REST) Architectural Style for Decentralized Systems. As of 2002, Rohit was a founder and Chief Technology Officer of KnowNow, itself a commercialisation of technology originally developed for mod_pubsub. I noticed this paper through a Danny Ayers blog entry. I have an interest in publish subscribe systems and this paper lays the theoretical groundwork for the behaviour of such systems on high-latency networks such as the internet. Rohit's thesis weighs in at almost three hundred pages, including associated bibliography and supporting material. As such, I don't intend to cover all of my observations in this blog entry. I'll try to confine myself to the question of consensus on high-latency networks and on the "now" horizon.

One of the first things Rohit does in his thesis is establish a theoretical limit to the latency between two nodes and the rate at which data can change at a source node while being agreed at the sink node. He bases his model around the HTTP concept of expiry and calculates if it takes at most d seconds for the data to get from point A to point B, then A must hold its value constant until at least d seconds and note this period as an expiry in the message it sends. Clearly if A wants to change its value more frequently the data will be expired before it reaches B. This result guides much of the subsequent work as he tries to develop additional REST-like styles to achieve as close as possible to the theoretical result in practice. He ends up with A being able to change its value about every 2d if appropriate architectural components are included.

Two times maximum latency may not be terribly significant on simple short distance networks, however even on ethernet d is theoretically unbounded. It follows then that over a network it isn't possible to always have a fresh copy of an origin server's value. This is where things get interesting. Rohit defines the concept of an "estimated" system. He contrasts this from a centralised system by saying that an estimated system does not guarantee consensus between network nodes. Instead of saying "I know my copy of the variable is the same as the server holds, and thus my decisions will be equivalent to everyone else's", we say "I have a P% chance that my copy is the same as the server holds, and thus my decisions may differ from those of other network nodes". A "now" horizon exists between those components that have a low enough latency to maintain consensus with the origin server's data and those that are too far from the source of data to track changes at the rate they occur.

In SCADA systems we measure physical phenomenon. Even if we only sample that phenomenon at a certain rate, the actual source change can occur at any frequency. If we consider the device that monitors an electric current (our RTU) as the origin server we could attempt to synchronise with the RTU's data before it resamples. If latencies are low enough we could possibly indicate to the user of our system the actual value as we saw it within the "now" horizon and thus in agreement with the RTU's values. However, another way to see the RTU is as a proxy. Because we can't control the frequency of change of the current, the "now" horizon is arbitrarily small. All SCADA systems are estimated rather than consensus-based.

Rohit doesn't directly cover the possibility of an infinitely-small "now" horizon, however his ultimate architecture for estimated systems generally is ARREST+E (Asynchronous Routed REST+Estimation). It contains a client that retries requests to the field device when they fail in order to traverse unreliable networks. It includes credentials for the client to access the field device. It includes a trust manager that rejects bad credentials at the field device, and includes the capability to drop duplicate messages from clients that have retried when a retry wasn't required. The field device would send updates back to the client as often as the network permitted. To keep latency low it would avoid passing on stale data and may perform other summarisation activities to maximise the relevance of data returned to the client. On the other side of this returned data connection the client would include a predictor to provide the best possible estimate of actual state based on the stale data it has at its disposal. Finally, this return path also includes credential transmission and trust management.

This grand unifying model is practical when the number of clients for our field device is small. Unlike the base REST and HTTP models this architecture requires state to be maintained in our field device to keep track of subscriptions and to provide appropriate summaries. The predictor on the client side can be as simple as the null predictor (we assume that the latest value at our disposal is still current) or could include domain-specific knowledge (we know the maximum rate at which this piece of equipment can change state and we know the direction it is moving so we perform a linear extrapolation based on past data input). As I mentioned in my previous entry state may be too expensive to maintain on behalf of the arbitrarily large client base that exists on the Internet at large. There must be some means of aquiring the money to scale servers up appropraitely. If that end of things is secure, however, I think it is feasible.

Importantly, I think it is necessary. In practice I think that most mutable resources on the web are represnting physical phenomenon of one kind or another. Whether it is "the state the user left this document in last time they wrote it" or "the current trading price of MSFT on wall street", people are physical phenomenon who can act on software at any time and in any way. The existing system on the web of using expiry dates is actually a form of estimation that is driven by the server deciding how often a client should poll or recheck their data. This server-driven approach may not suit all clients, which is why web browsers need a "reload" button!

I'll talk some specific issues I had with the thesis in a subsequent post.