Sound advice - blog

Tales from the homeworld

My current feeds

Tue, 2005-Sep-20

State Tradeoffs

HTTP is stateless between requests, each of which involves stateful TCP/IP connections. Any subscription protocol requires state to be retained in the network for the lifetime of the subscription to keep track of who the subscribers are and what they want. Whenever I POST to a web server I can be pretty confident that I'm creating some kind of state, although how long-lived the state is or whether the state has associated costs in terms of memory utilisation or other resources is unkown to me. REST sides with HTTP and says we should be stateless beween requests, but is that always possible and what are the tradeoffs?

Mark Barker points to Fielding's disseration. Under the Related Work heading alongside the discussion of preexisting architectural styles he said this:

The interaction method of sending representations of resources to consuming components has some parallels with event-based integration (EBI) styles. The key difference is that EBI styles are push-based. The component containing the state (equivalent to an origin server in REST) issues an event whenever the state changes, whether or not any component is actually interested in or listening for such an event. In the REST style, consuming components usually pull representations. Although this is less efficient when viewed as a single client wishing to monitor a single resource, the scale of the Web makes an unregulated push model infeasible.

He discusses Chiron-2 as an example EBI style.

This opinion is interesting to me because I can perhaps see for the first time the difference between "corporate" and "internet" scales of software. If Fielding is to be believed, that difference is captured in exactly how much state a server can afford to maintain on behalf of its clients. In computer science courses students are taught that polling on the network is and obvious evil. It increases response latency, increases network bandwidth, or both. When thinking of the cost of network in terms of only the bandwidth and latency characteristics of the network the urge to use a publish subscribe scheme rather than polling is strong. Fielding gives us a different way to think about the problem.

I come from the SCADA world. Typically, a stable set of client and server applications are running on a secure or relatively-isolated network. Extreme load conditions usually come about when equipment being monitored changes state frequently and servers must process and pass on data under what is termed an avalanche condition. Varying quality constraints apply to different kinds of client in this scenareo. Historical storage of change records seeks to store as many interim states as possible without lagging too far behind the change records of right now. User-oriented services typically seek to have the very freshest data available at all times. Low latency and effective use of bandwidth are critical to making this work, as is the processing performance of actual clients.

Extreme load conditions on the web usually come about from client activity, and is termed in today's internet, the slashdot effect. Instead of thousands of changes being processed and passed on to clients, thousands of clients are either "just having a look" or actively participating in your web site mechanics. It's not just Fielding who things that state is a barrier to high performance under these condtions. Paul Vixie wrote the following earlier this year on namedroppers@ops.ietf.org:

tcp requires state. state quotas are a vulnerability in the face of a DoS attack. tcp must remain a fallback for queries, and be reserved in the common case for updates and zone transfers. doesn't matter who runs the servers, if they're supposed to stay "up" then you can't depend on them to have room for another tcp protocol control block in the common case.

The sentiment is periodically echoed through the mailing list's lifetime. Paul is also the author of this earlier quote:

speaking for f-root, if we had to build a protocol control block for each of the 5000 to 10000 queries that came in every second, and exchange packets between our state machine and remote state machine, before finally sending the response and reclaiming the state-memory, we'd need a lot more hardware than we have today. it's not the bytes, or the packets-- those we can provision easily. it's the state-memory occupancy time that would dominate.

So in the extreme world of DNS, even the state associated with a TCP/IP connection is considered too much. In Fielding's view of the web anything more than TCP is too much. Scaled back down to a corporate setting and state is no longer the problem. Bandwidth and latency dominate. Clearly there is a sliding scale and there are tradeoffs to be made. Where do the tradeoffs lie?

When we're talking about real-time scada-like data clients have a cost if they aren't kept up to date. This applies to any time-sensitive data ranging from stock market quotes to eBay auction prices. To reduce the cost of getting information late clients must poll the service, incurring bandwidth and node procesing costs. Memory resources aren't used past those required to carry on TCP/IP comms between relevant network nodes, so a large quantity of requests can be handled with the main costs being processing and bandwidth. Those can be limited explicitly by the server, so it should be possible for the server to ride out the storm while clients see longer and longer response times during peak demand. The cost of bandwidth and processing can be traded off for extra memory resources by offering a publish/subscribe state synchronisation model to clients. This provides low-latency return of data to clients without polling and may reduce both processing and bandwidth costs. It still allows those costs to be managed by the server infrastructure, which may choose to drop intermim state updates or perform other load shedding activities. It does cost extra memory, though. Unlike bandwidth and processing, memory isn't a renewable resource. If you expend it all your server can't recovery by slowing the rate of requests or responses. Your server is likely to simply fall over.

State is sometimes necessary. If I book a flight online the flight booking state must be retained somewhere, or my booking won't be honoured. In cases like these I pay money to store the state so the accounting infrastructure exists to fund memory expansion. Even if the cost is amortized against actual successful bookings, the fact that the site is doing business mitigates the raw "interesting site" effect where thousands or users may decide to store small amounts of state on your server if they are allowed to. Importantly, this example also has an approximate limit to the amount of state stored. The number of planes in the sky at any one time controls the number of seats available. Even if a proportion of the bookings will have to be cancelled and remade and the state of those changes recorded that proportion should be roughly predictable on the large scale as a percentage of actual seats booked or available.

Other examples include WikipediA and del.icio.us. Both are open to the slashdot effect of having a large number of clients. As well as doing stateless queries, these services allow users to store encylopedia articles or bookmarks on the sites. They must have storage capacity in proportion to the size of their user base, based on the average number of bytes a user contributes. Wikipedia has a small advantage over other sites in that once an article is written and authorative it seems likely that a proportion of the user base is cut out of contributing to that topic because everything they know about it has already been written down. I suspect delicious doesn't get this effect. In fact, my experience is that bookmarks found via delicious are often inserted back into delicious by the discoverer. This suggests to me that the storage requirements for delicious are in rough proportion to the user base but with a risk that it will eventually grow in rough proportion with the size of the web itself. Is such a service sustainable? Will its funding model allow for this possibly infinite growth in resource consumption requirements?

To answer that question, let's take a look at a bigger example. Google certainly seems to have been able to make a go of it so far. Their storage requirements follow the same curve but with a much larger constant. Instead of storing a bunch of URLs and the people who like them, google is effectively a full-text index of the internet. Other projects of that scale exist also, within and without the search engine manifold. The key seems to be ensuring your funding model grows along the same curve (or a steeper one if you can manage it). If you have storage needs in proportion to the size of an internet user base then you probably need to get funding proportional to that size. This is important in paying for bandwidth also. You need to get paid in proportion to how popular you are or you aren't going anywhere on the Internet. It seems that for the moment delicious is working from a venture capital pool, but at some point it will have to start returning money to its investors at a greater rate than the cost of scaling up the servers to handle load.

I don't know whether subscription is a viable concept on the web when the number of users is effectively unbounded. Servers can still reject subscriptions or refuse to renew them so a bounded number can be maintained... but you'll be giving a proportion of your user base error messages and that is never a good look. In the end it probably comes down to a matter of economics. Are you being paid in proportion to the number of subscriptions you provide? Is there an advertising stream alongside the subscription? Are users paying for the subscription itself, perhaps as a premium service? If you can get that side of things off the ground and are ready to scale your servers as your userbase grows and spikes (or in anticipation of the spike) then you can probably make a go of subscription. Otherwise you might be better off letting your server be polled and see your web site respond more slowly to users in moderate overload conditions rather than just cutting them out. In the end the big overload conditions will still cause errors to be returned, so it is really a matter of how much you can afford to spend and what it will cost you if clients don't see your offering as reliable.

Benjamin