HTTP may not be the first protocol that comes to mind when you think
SCADA,
or when you think of other kinds of control systems. Even the Internet Protocol is
not a traditional SCADA component. SCADA traditionally works of good old serial or radio
communications with field devices, and uses specialised protocols that keep bandwidth
usage to an absolute minimum. SCADA has two sides, though, and I don't just mean the
"Supervisory Control" and the "Data Acquisition" sides. A SCADA system is an information
concentration system for operational control of your plant. Having already gotten your
information into a concentrated form and place, it makes sense to feed summaries of that
data into other systems. In the old parlence of the corporation I happen to work for this
was called "Sensor to Boardroom".
One of my drivers in trying to understand some of the characteristics of the web as
a distributed architecture has been in trying to expose the data of a SCADA system to
other ad hoc systems that may need to utilise SCADA data. SCADA has also come a long way
over the years, and now stands more for integration of operational data from various
sources than simple plant control. It makes sense to me to think about whether the ways
SCADA might expose its data to other systems may also work within a SCADA system composed
of different parts. We're in the land of ethernet here, and fast processors. Using a more
heavy-weight protocol such as HTTP shouldn't be a concern from the performance
perspective, but what else might we have to consider?
Let's draw out a very simple model of a SCADA system. In it we have two server
machines running redundantly, plus one client machine seeking information from the
servers. This model is effectively replicated over and over for different services and
extra clients. I'll quickly summarise some possible issues and work through them one by
one:
- Timely arrival of data
- Deciding who to ask
- Quick failover between server machines
- Dealing with redundant networks
Timely Data
When I use the word timely, I mean that our client would not get data that is any
fresher by polling rapidly. The simplest implementation of this requirement would be...
well... to poll rapidly. However, this loads the network and all CPUs unnecessarily and
should be avoided in order to maintain adequate system performance. Timely arrival of
data in the SCADA world is all about subscription, either ad hoc or preconfigured.
I have worked fairly extensively on the appropriate models for this. A client requests
subscription of a server. The subscription is periodically renewed and may eventually
be deleted. While the subscription is active it delivers state updates to a client URL
over some appropriate protocol. Easy. The complications start to appear in the next few
points.
Who is the Master?
Deciding who to ask for subscriptions and other services is not as simple as you might
think. You could use DNS (or a DNS-like service) in one of two ways. You could use static
records, or your could change your records as the availability of servers changes. Dynamic
updates would work through some DNS updater application running on one or more machines.
It would detect the failure of one host, and nominate the other as the IP address to
connect to for your service. Doing it dynamically has a problem that you're working from
pretty much a single point of view. What you as the dynamic DNS modifier sees may not be
the same as what all clients see. In addition you have the basic problem of the static
DNS: Where do you host it? In SCADA everything has to be redundant and robust against
failure. No downtime is acceptable. The static approach also pushes the failure detection
problem to clients, which may be a problem they aren't capable of solving due to their
inherent "dumb" generic functionality.
Rather than solving the problem at the application level you could rely on
IP-level failover,
however this works best when machines are situated on the same subnet. It becomes more
complex to design when main and backup servers are situated in separate control centres
for disaster recovery.
Whichever way you turn there are issues. My current direction is to use static DNS
(or eqivalent) that specifies all IP addresses that are or may be relevant for the name.
Each server should forward requests onto the main if it is not currently master, meaning
that it doesn't matter which one is chosen when both servers are up (apart from a slight
additonal lag should the wrong server be chosen). Clients should connect to all IP
addresses simultaneously if they want to get their request through quickly when one or
more servers are down. They should submit their request to the first connected IP, and
be prepared to retry on failure to get their message through. TCP/IP has timeouts tuned
for operating over the Internet, but these kinds of interactions between clients and
servers in the same network are typically much faster. It may be important to ping hosts
you have connections to in order to ensure they are still responsive.
It would be nice if TCP/IP timeouts could be tuned more finely. Most operating systems
allow tuning of the entire system's connections. Few support tuning on a per-connection
basis. If I know the connection I'm making is going to a host that is very close to me
in terms of network topology it may be better to declare failures earlier using the
standard TCP/IP mechanisms rather than supplimenting with ICMP. Also, the ICMP method
for supplimenting TCP/IP in this way relies on not using an IP-level failover techniques
between servers.
Client Failover
Quick failover follows on from discovering who to talk to. The same kinds of failture
detection mechanisms are required. Fundamentally clients must be able to quickly detect
any premature loss of their subscription resource and recreate it. This is made more
complicated by the different server side implementations that may make subscription loss
more or less likely, and thus the necessary corrective actions that clients may need to
take. If a subscription is lost when a single server host fails, it is important that
clients check their subscriptions often and also monitor the state of the host that
is maintaining their subscription resource. If the host goes down then the subscription
must be reestablished as soon as this is discovered. As such the subscription must be
periodically tested for existence, preferrably through a RENEW request. Regular RENEW
requests over an ICMP-supported TCP/IP connection as described above should be sufficent
for even a slowly-responding server application to adequately inform clients that their
subscriptions remain active and they should not reattempt creation.
Redundant Networks
SCADA systems typically utilise redundant networks as well as redundant servers. Not
only can clients access the servers on two different physical media, the servers can to
the same to clients. Like server failover, this could be dealt with at the IP level...
however your IP stack would need to work in a very well-defined way with respect to
packets you send. I would suggest that each packet be sent to both networks with
duplicates discarded on the recieving end. This would very neatly deal with temporary
outages in either network without any delays or network hiccups. Ultimately the whole
system must be able to run over the single network, so trying to load balance while both
are up may be hiding inherent problems in the network topology. Using them both should
provide the best network architecture overall.
Unfortunately, I'm not aware of any network stacks that do what I would like. Hey,
if you happen to know how to set it up feel free to drop me a line. In the mean-time this
is usually dealt with at the application level with two IP addresses per machine. I tell
you what: This complicates matters more than you'd think. You end up needing a DNS name
for the whole server pair with four IP addresses. You then need an additional DNS name
for each of the servers, each with two IP addresses. When you subscribe to a resource you
specify the whole server pair DNS name on connection, but the subscrpition resource may
only exist on one service. It would be returned with only that sevice's DNS name, but
that's still two IP addresses to deal with and ping. All the way through your code you
have to deal with this multiple address problem. In the end it doesn't cause a huge
theoretical problem to deal with this at the application level, but it does make
development and testing a pain in the arse all around.
Conclusion
Because this is all
SIL2
software you end up having to write most of it yourself. I've been developing HTTP client
and sever software is spurts over the last six months or so, but concertedly over the
last few weeks. The beauty is that once you have the bits that need to be SIL2 in place
you can access them with off the shelf implementation of both interfaces. Mozilla and
curl both get a big workout on my desktop. I expect Apache, maybe Tomcat or Websphere
will start getting a workout soon. By rearchitecting around existing web standards it
should make it easier for me to produce non-SIL2 implementations of the same basic
principles. Parts of the SCADA system that are not safety-related could be built out
of commodity components while the ones that are can still work through carefully-crafted
proprietary implementations. It's also possible that off the shelf implementations will
eventually become so accepted in the industry that they can be used where safety is an
issue. We may one day think of apache like we do the operating systems we use. They
provide a commodity service that we undertand and have validated very well in our own
industry and environment to help us to only have to write software that really adds
value to our customers.
On that note, we do have a few jobs going at
Westinghouse Rail Systems Australia's
Brisbane office to support a few projects that are coming up. Hmm... I don't seem to be
able to find them on seek. Email me if you're intersted and I'll pass them on to my
manager. You'd be best to use my ben.carlyle at invensys.com address for this purpose.
Benjamin