Sound advice

HTTP may not be the first protocol that comes to mind when you think SCADA, or when you think of other kinds of control systems. Even the Internet Protocol is not a traditional SCADA component. SCADA traditionally works of good old serial or radio communications with field devices, and uses specialised protocols that keep bandwidth usage to an absolute minimum. SCADA has two sides, though, and I don't just mean the "Supervisory Control" and the "Data Acquisition" sides. A SCADA system is an information concentration system for operational control of your plant. Having already gotten your information into a concentrated form and place, it makes sense to feed summaries of that data into other systems. In the old parlence of the corporation I happen to work for this was called "Sensor to Boardroom".

One of my drivers in trying to understand some of the characteristics of the web as a distributed architecture has been in trying to expose the data of a SCADA system to other ad hoc systems that may need to utilise SCADA data. SCADA has also come a long way over the years, and now stands more for integration of operational data from various sources than simple plant control. It makes sense to me to think about whether the ways SCADA might expose its data to other systems may also work within a SCADA system composed of different parts. We're in the land of ethernet here, and fast processors. Using a more heavy-weight protocol such as HTTP shouldn't be a concern from the performance perspective, but what else might we have to consider?

Let's draw out a very simple model of a SCADA system. In it we have two server machines running redundantly, plus one client machine seeking information from the servers. This model is effectively replicated over and over for different services and extra clients. I'll quickly summarise some possible issues and work through them one by one:

Timely arrival of data
Deciding who to ask
Quick failover between server machines
Dealing with redundant networks

Timely Data

When I use the word timely, I mean that our client would not get data that is any fresher by polling rapidly. The simplest implementation of this requirement would be... well... to poll rapidly. However, this loads the network and all CPUs unnecessarily and should be avoided in order to maintain adequate system performance. Timely arrival of data in the SCADA world is all about subscription, either ad hoc or preconfigured. I have worked fairly extensively on the appropriate models for this. A client requests subscription of a server. The subscription is periodically renewed and may eventually be deleted. While the subscription is active it delivers state updates to a client URL over some appropriate protocol. Easy. The complications start to appear in the next few points.

Who is the Master?

Deciding who to ask for subscriptions and other services is not as simple as you might think. You could use DNS (or a DNS-like service) in one of two ways. You could use static records, or your could change your records as the availability of servers changes. Dynamic updates would work through some DNS updater application running on one or more machines. It would detect the failure of one host, and nominate the other as the IP address to connect to for your service. Doing it dynamically has a problem that you're working from pretty much a single point of view. What you as the dynamic DNS modifier sees may not be the same as what all clients see. In addition you have the basic problem of the static DNS: Where do you host it? In SCADA everything has to be redundant and robust against failure. No downtime is acceptable. The static approach also pushes the failure detection problem to clients, which may be a problem they aren't capable of solving due to their inherent "dumb" generic functionality.

Rather than solving the problem at the application level you could rely on IP-level failover, however this works best when machines are situated on the same subnet. It becomes more complex to design when main and backup servers are situated in separate control centres for disaster recovery.

Whichever way you turn there are issues. My current direction is to use static DNS (or eqivalent) that specifies all IP addresses that are or may be relevant for the name. Each server should forward requests onto the main if it is not currently master, meaning that it doesn't matter which one is chosen when both servers are up (apart from a slight additonal lag should the wrong server be chosen). Clients should connect to all IP addresses simultaneously if they want to get their request through quickly when one or more servers are down. They should submit their request to the first connected IP, and be prepared to retry on failure to get their message through. TCP/IP has timeouts tuned for operating over the Internet, but these kinds of interactions between clients and servers in the same network are typically much faster. It may be important to ping hosts you have connections to in order to ensure they are still responsive.

It would be nice if TCP/IP timeouts could be tuned more finely. Most operating systems allow tuning of the entire system's connections. Few support tuning on a per-connection basis. If I know the connection I'm making is going to a host that is very close to me in terms of network topology it may be better to declare failures earlier using the standard TCP/IP mechanisms rather than supplimenting with ICMP. Also, the ICMP method for supplimenting TCP/IP in this way relies on not using an IP-level failover techniques between servers.

Client Failover

Quick failover follows on from discovering who to talk to. The same kinds of failture detection mechanisms are required. Fundamentally clients must be able to quickly detect any premature loss of their subscription resource and recreate it. This is made more complicated by the different server side implementations that may make subscription loss more or less likely, and thus the necessary corrective actions that clients may need to take. If a subscription is lost when a single server host fails, it is important that clients check their subscriptions often and also monitor the state of the host that is maintaining their subscription resource. If the host goes down then the subscription must be reestablished as soon as this is discovered. As such the subscription must be periodically tested for existence, preferrably through a RENEW request. Regular RENEW requests over an ICMP-supported TCP/IP connection as described above should be sufficent for even a slowly-responding server application to adequately inform clients that their subscriptions remain active and they should not reattempt creation.

Redundant Networks

SCADA systems typically utilise redundant networks as well as redundant servers. Not only can clients access the servers on two different physical media, the servers can to the same to clients. Like server failover, this could be dealt with at the IP level... however your IP stack would need to work in a very well-defined way with respect to packets you send. I would suggest that each packet be sent to both networks with duplicates discarded on the recieving end. This would very neatly deal with temporary outages in either network without any delays or network hiccups. Ultimately the whole system must be able to run over the single network, so trying to load balance while both are up may be hiding inherent problems in the network topology. Using them both should provide the best network architecture overall.

Unfortunately, I'm not aware of any network stacks that do what I would like. Hey, if you happen to know how to set it up feel free to drop me a line. In the mean-time this is usually dealt with at the application level with two IP addresses per machine. I tell you what: This complicates matters more than you'd think. You end up needing a DNS name for the whole server pair with four IP addresses. You then need an additional DNS name for each of the servers, each with two IP addresses. When you subscribe to a resource you specify the whole server pair DNS name on connection, but the subscrpition resource may only exist on one service. It would be returned with only that sevice's DNS name, but that's still two IP addresses to deal with and ping. All the way through your code you have to deal with this multiple address problem. In the end it doesn't cause a huge theoretical problem to deal with this at the application level, but it does make development and testing a pain in the arse all around.

Conclusion

Because this is all SIL2 software you end up having to write most of it yourself. I've been developing HTTP client and sever software is spurts over the last six months or so, but concertedly over the last few weeks. The beauty is that once you have the bits that need to be SIL2 in place you can access them with off the shelf implementation of both interfaces. Mozilla and curl both get a big workout on my desktop. I expect Apache, maybe Tomcat or Websphere will start getting a workout soon. By rearchitecting around existing web standards it should make it easier for me to produce non-SIL2 implementations of the same basic principles. Parts of the SCADA system that are not safety-related could be built out of commodity components while the ones that are can still work through carefully-crafted proprietary implementations. It's also possible that off the shelf implementations will eventually become so accepted in the industry that they can be used where safety is an issue. We may one day think of apache like we do the operating systems we use. They provide a commodity service that we undertand and have validated very well in our own industry and environment to help us to only have to write software that really adds value to our customers.

On that note, we do have a few jobs going at Westinghouse Rail Systems Australia's Brisbane office to support a few projects that are coming up. Hmm... I don't seem to be able to find them on seek. Email me if you're intersted and I'll pass them on to my manager. You'd be best to use my ben.carlyle at invensys.com address for this purpose.

Benjamin

Sound advice - blog

Lifesigns

Subscribe

RDF

Feedback and Social Software

Support Software

Site Statistics

License

My recent bookmarks

HTTP in Control Systems

Timely Data

Who is the Master?

Client Failover

Redundant Networks

Conclusion

Sound advice - blog

Lifesigns

Subscribe

RDF

Feedback and Social Software

Support Software

Site Statistics

License

My current feeds

My recent bookmarks

HTTP in Control Systems

Timely Data

Who is the Master?

Client Failover

Redundant Networks

Conclusion