The Hacker Crackdown: part 1, section 7

From The Hacker Crackdown, by Bruce Sterling

See: The Hacker Crackdown: Preface to the electronic release for copying info

To the average citizen, the idea of the telephone is represented by, well, a *telephone:* a device that you talk into. To a telco professional, however, the telephone itself is known, in lordly fashion, as a "subset." The "subset" in your house is a mere adjunct, a distant nerve ending, of the central switching stations, which are ranked in levels of heirarchy, up to the long-distance electronic switching stations, which are some of the largest computers on earth.

Let us imagine that it is, say, 1925, before the introduction of computers, when the phone system was simpler and somewhat easier to grasp. Let's further imagine that you are Miss Leticia Luthor, a fictional operator for Ma Bell in New York City of the 20s. Basically, you, Miss Luthor, *are* the "switching system." You are sitting in front of a large vertical switchboard, known as a "cordboard," made of shiny wooden panels, with ten thousand metal-rimmed holes punched in them, known as jacks. The engineers would have put more holes into your switchboard, but ten thousand is as many as you can reach without actually having to get up out of your chair.

Each of these ten thousand holes has its own little electric lightbulb, known as a "lamp," and its own neatly printed number code.

With the ease of long habit, you are scanning your board for lit-up bulbs. This is what you do most of the time, so you are used to it.

A lamp lights up. This means that the phone at the end of that line has been taken off the hook. Whenever a handset is taken off the hook, that closes a circuit inside the phone which then signals the local office, i.e. you, automatically. There might be somebody calling, or then again the phone might be simply off the hook, but this does not matter to you yet. The first thing you do, is record that number in your logbook, in your fine American public-school handwriting. This comes first, naturally, since it is done for billing purposes.

You now take the plug of your answering cord, which goes directly to your headset, and plug it into the lit-up hole. "Operator," you announce.

In operator's classes, before taking this job, you have been issued a large pamphlet full of canned operator's responses for all kinds of contingencies, which you had to memorize. You have also been trained in a proper non-regional, non-ethnic pronunciation and tone of voice. You rarely have the occasion to make any spontaneous remark to a customer, and in fact this is frowned upon (except out on the rural lines where people have time on their hands and get up to all kinds of mischief). A tough-sounding user's voice at the end of the line gives you a number. Immediately, you write that number down in your logbook, next to the caller's number, which you just wrote earlier. You then look and see if the number this guy wants is in fact on your switchboard, which it generally is, since it's generally a local call. Long distance costs so much that people use it sparingly.

Only then do you pick up a calling-cord from a shelf at the base of the switchboard. This is a long elastic cord mounted on a kind of reel so that it will zip back in when you unplug it. There are a lot of cords down there, and when a bunch of them are out at once they look like a nest of snakes. Some of the girls think there are bugs living in those cable-holes. They're called "cable mites" and are supposed to bite your hands and give you rashes. You don't believe this, yourself.

Gripping the head of your calling-cord, you slip the tip of it deftly into the sleeve of the jack for the called person. Not all the way in, though. You just touch it. If you hear a clicking sound, that means the line is busy and you can't put the call through. If the line is busy, you have to stick the calling-cord into a "busy-tone jack," which will give the guy a busy-tone. This way you don't have to talk to him yourself and absorb his natural human frustration.

But the line isn't busy. So you pop the cord all the way in. Relay circuits in your board make the distant phone ring, and if somebody picks it up off the hook, then a phone conversation starts. You can hear this conversation on your answering cord, until you unplug it. In fact you could listen to the whole conversation if you wanted, but this is sternly frowned upon by management, and frankly, when you've overheard one, you've pretty much heard 'em all.

You can tell how long the conversation lasts by the glow of the calling-cord's lamp, down on the calling-cord's shelf. When it's over, you unplug and the calling-cord zips back into place.

Having done this stuff a few hundred thousand times, you become quite good at it. In fact you're plugging, and connecting, and disconnecting, ten, twenty, forty cords at a time. It's a manual handicraft, really, quite satisfying in a way, rather like weaving on an upright loom.

Should a long-distance call come up, it would be different, but not all that different. Instead of connecting the call through your own local switchboard, you have to go up the hierarchy, onto the long-distance lines, known as "trunklines." Depending on how far the call goes, it may have to work its way through a whole series of operators, which can take quite a while. The caller doesn't wait on the line while this complex process is negotiated across the country by the gaggle of operators. Instead, the caller hangs up, and you call him back yourself when the call has finally worked its way through.

After four or five years of this work, you get married, and you have to quit your job, this being the natural order of womanhood in the American 1920s. The phone company has to train somebody else -- maybe two people, since the phone system has grown somewhat in the meantime. And this costs money.

In fact, to use any kind of human being as a switching system is a very expensive proposition. Eight thousand Leticia Luthors would be bad enough, but a quarter of a million of them is a military-scale proposition and makes drastic measures in automation financially worthwhile.

Although the phone system continues to grow today, the number of human beings employed by telcos has been dropping steadily for years. Phone "operators" now deal with nothing but unusual contingencies, all routine operations having been shrugged off onto machines. Consequently, telephone operators are considerably less machine-like nowadays, and have been known to have accents and actual character in their voices. When you reach a human operator today, the operators are rather more "human" than they were in Leticia's day -- but on the other hand, human beings in the phone system are much harder to reach in the first place.

Over the first half of the twentieth century, "electromechanical" switching systems of growing complexity were cautiously introduced into the phone system. In certain backwaters, some of these hybrid systems are still in use. But after 1965, the phone system began to go completely electronic, and this is by far the dominant mode today. Electromechanical systems have "crossbars," and "brushes," and other large moving mechanical parts, which, while faster and cheaper than Leticia, are still slow, and tend to wear out fairly quickly.

But fully electronic systems are inscribed on silicon chips, and are lightning-fast, very cheap, and quite durable. They are much cheaper to maintain than even the best electromechanical systems, and they fit into half the space. And with every year, the silicon chip grows smaller, faster, and cheaper yet. Best of all, automated electronics work around the clock and don't have salaries or health insurance.

There are, however, quite serious drawbacks to the use of computer-chips. When they do break down, it is a daunting challenge to figure out what the heck has gone wrong with them. A broken cordboard generally had a problem in it big enough to see. A broken chip has invisible, microscopic faults. And the faults in bad software can be so subtle as to be practically theological.

If you want a mechanical system to do something new, then you must travel to where it is, and pull pieces out of it, and wire in new pieces. This costs money. However, if you want a chip to do something new, all you have to do is change its software, which is easy, fast and dirt-cheap. You don't even have to see the chip to change its program. Even if you did see the chip, it wouldn't look like much. A chip with program X doesn't look one whit different from a chip with program Y.

With the proper codes and sequences, and access to specialized phone-lines, you can change electronic switching systems all over America from anywhere you please.

And so can other people. If they know how, and if they want to, they can sneak into a microchip via the special phonelines and diddle with it, leaving no physical trace at all. If they broke into the operator's station and held Leticia at gunpoint, that would be very obvious. If they broke into a telco building and went after an electromechanical switch with a toolbelt, that would at least leave many traces. But people can do all manner of amazing things to computer switches just by typing on a keyboard, and keyboards are everywhere today. The extent of this vulnerability is deep, dark, broad, almost mind-boggling, and yet this is a basic, primal fact of life about any computer on a network. Security experts over the past twenty years have insisted, with growing urgency, that this basic vulnerability of computers represents an entirely new level of risk, of unknown but obviously dire potential to society. And they are right. An electronic switching station does pretty much everything Letitia did, except in nanoseconds and on a much larger scale. Compared to Miss Luthor's ten thousand jacks, even a primitive 1ESS switching computer, 60s vintage, has a 128,000 lines. And the current AT&T system of choice is the monstrous fifth-generation 5ESS.

An Electronic Switching Station can scan every line on its "board" in a tenth of a second, and it does this over and over, tirelessly, around the clock. Instead of eyes, it uses "ferrod scanners" to check the condition of local lines and trunks. Instead of hands, it has "signal distributors," "central pulse distributors," "magnetic latching relays," and "reed switches," which complete and break the calls. Instead of a brain, it has a "central processor." Instead of an instruction manual, it has a program. Instead of a handwritten logbook for recording and billing calls, it has magnetic tapes. And it never has to talk to anybody. Everything a customer might say to it is done by punching the direct-dial tone buttons on your subset.

Although an Electronic Switching Station can't talk, it does need an interface, some way to relate to its, er, employers. This interface is known as the "master control center." (This interface might be better known simply as "the interface," since it doesn't actually "control" phone calls directly. However, a term like "Master Control Center" is just the kind of rhetoric that telco maintenance engineers -- and hackers -- find particularly satisfying.)

Using the master control center, a phone engineer can test local and trunk lines for malfunctions. He (rarely she) can check various alarm displays, measure traffic on the lines, examine the records of telephone usage and the charges for those calls, and change the programming.

And, of course, anybody else who gets into the master control center by remote control can also do these things, if he (rarely she) has managed to figure them out, or, more likely, has somehow swiped the knowledge from people who already know.

In 1989 and 1990, one particular RBOC, BellSouth, which felt particularly troubled, spent a purported $1.2 million on computer security. Some think it spent as much as two million, if you count all the associated costs. Two million dollars is still very little compared to the great cost-saving utility of telephonic computer systems. Unfortunately, computers are also stupid. Unlike human beings, computers possess the truly profound stupidity of the inanimate.

In the 1960s, in the first shocks of spreading computerization, there was much easy talk about the stupidity of computers -- how they could "only follow the program" and were rigidly required to do "only what they were told." There has been rather less talk about the stupidity of computers since they began to achieve grandmaster status in chess tournaments, and to manifest many other impressive forms of apparent cleverness.

Nevertheless, computers *still* are profoundly brittle and stupid; they are simply vastly more subtle in their stupidity and brittleness. The computers of the 1990s are much more reliable in their components than earlier computer systems, but they are also called upon to do far more complex things, under far more challenging conditions.

On a basic mathematical level, every single line of a software program offers a chance for some possible screwup. Software does not sit still when it works; it "runs," it interacts with itself and with its own inputs and outputs. By analogy, it stretches like putty into millions of possible shapes and conditions, so many shapes that they can never all be successfully tested, not even in the lifespan of the universe. Sometimes the putty snaps.

The stuff we call "software" is not like anything that human society is used to thinking about. Software is something like a machine, and something like mathematics, and something like language, and something like thought, and art, and information.... but software is not in fact any of those other things. The protean quality of software is one of the great sources of its fascination. It also makes software very powerful, very subtle, very unpredictable, and very risky.

Some software is bad and buggy. Some is "robust," even "bulletproof." The best software is that which has been tested by thousands of users under thousands of different conditions, over years. It is then known as "stable." This does *not* mean that the software is now flawless, free of bugs. It generally means that there are plenty of bugs in it, but the bugs are well-identified and fairly well understood.

There is simply no way to assure that software is free of flaws. Though software is mathematical in nature, it cannot by "proven" like a mathematical theorem; software is more like language, with inherent ambiguities, with different definitions, different assumptions, different levels of meaning that can conflict.

Human beings can manage, more or less, with human language because we can catch the gist of it. Computers, despite years of effort in "artificial intelligence," have proven spectacularly bad in "catching the gist" of anything at all. The tiniest bit of semantic grit may still bring the mightiest computer tumbling down. One of the most hazardous things you can do to a computer program is try to improve it -- to try to make it safer. Software "patches" represent new, untried un-"stable" software, which is by definition riskier.

The modern telephone system has come to depend, utterly and irretrievably, upon software. And the System Crash of January 15, 1990, was caused by an *improvement* in software. Or rather, an *attempted* improvement. As it happened, the problem itself -- the problem per se -- took this form. A piece of telco software had been written in C language, a standard language of the telco field. Within the C software was a long "do... while" construct. The "do... while" construct contained a "switch" statement. The "switch" statement contained an "if" clause. The "if" clause contained a "break." The "break" was *supposed* to "break" the "if clause." Instead, the "break" broke the "switch" statement.

That was the problem, the actual reason why people picking up phones on January 15, 1990, could not talk to one another.

Or at least, that was the subtle, abstract, cyberspatial seed of the problem. This is how the problem manifested itself from the realm of programming into the realm of real life. The System 7 software for AT&T's 4ESS switching station, the "Generic 44E14 Central Office Switch Software," had been extensively tested, and was considered very stable. By the end of 1989, eighty of AT&T's switching systems nationwide had been programmed with the new software. Cautiously, thirty-four stations were left to run the slower, less-capable System 6, because AT&T suspected there might be shakedown problems with the new and unprecedently sophisticated System 7 network.

The stations with System 7 were programmed to switch over to a backup net in case of any problems. In mid-December 1989, however, a new high-velocity, high-security software patch was distributed to each of the 4ESS switches that would enable them to switch over even more quickly, making the System 7 network that much more secure. Unfortunately, every one of these 4ESS switches was now in possession of a small but deadly flaw.

In order to maintain the network, switches must monitor the condition of other switches -- whether they are up and running, whether they have temporarily shut down, whether they are overloaded and in need of assistance, and so forth. The new software helped control this bookkeeping function by monitoring the status calls from other switches. It only takes four to six seconds for a troubled 4ESS switch to rid itself of all its calls, drop everything temporarily, and re-boot its software from scratch. Starting over from scratch will generally rid the switch of any software problems that may have developed in the course of running the system. Bugs that arise will be simply wiped out by this process. It is a clever idea. This process of automatically re-booting from scratch is known as the "normal fault recovery routine." Since AT&T's software is in fact exceptionally stable, systems rarely have to go into "fault recovery" in the first place; but AT&T has always boasted of its "real world" reliability, and this tactic is a belt-and-suspenders routine.

The 4ESS switch used its new software to monitor its fellow switches as they recovered from faults. As other switches came back on line after recovery, they would send their "OK" signals to the switch. The switch would make a little note to that effect in its "status map," recognizing that the fellow switch was back and ready to go, and should be sent some calls and put back to regular work. Unfortunately, while it was busy bookkeeping with the status map, the tiny flaw in the brand-new software came into play. The flaw caused the 4ESS switch to interacted, subtly but drastically, with incoming telephone calls from human users. If -- and only if -- two incoming phone-calls happened to hit the switch within a hundredth of a second, then a small patch of data would be garbled by the flaw. But the switch had been programmed to monitor itself constantly for any possible damage to its data. When the switch perceived that its data had been somehow garbled, then it too would go down, for swift repairs to its software. It would signal its fellow switches not to send any more work. It would go into the fault-recovery mode for four to six seconds. And then the switch would be fine again, and would send out its "OK, ready for work" signal.

However, the "OK, ready for work" signal was the *very thing that had caused the switch to go down in the first place.* And *all* the System 7 switches had the same flaw in their status-map software. As soon as they stopped to make the bookkeeping note that their fellow switch was "OK," then they too would become vulnerable to the slight chance that two phone-calls would hit them within a hundredth of a second. At approximately 2:25 p.m. EST on Monday, January 15, one of AT&T's 4ESS toll switching systems in New York City had an actual, legitimate, minor problem. It went into fault recovery routines, announced "I'm going down," then announced, "I'm back, I'm OK." And this cheery message then blasted throughout the network to many of its fellow 4ESS switches.

Many of the switches, at first, completely escaped trouble. These lucky switches were not hit by the coincidence of two phone calls within a hundredth of a second. Their software did not fail -- at first. But three switches -- in Atlanta, St. Louis, and Detroit -- were unlucky, and were caught with their hands full. And they went down. And they came back up, almost immediately. And they too began to broadcast the lethal message that they, too, were "OK" again, activating the lurking software bug in yet other switches. As more and more switches did have that bit of bad luck and collapsed, the call-traffic became more and more densely packed in the remaining switches, which were groaning to keep up with the load. And of course, as the calls became more densely packed, the switches were *much more likely* to be hit twice within a hundredth of a second.

It only took four seconds for a switch to get well. There was no *physical* damage of any kind to the switches, after all. Physically, they were working perfectly. This situation was "only" a software problem. But the 4ESS switches were leaping up and down every four to six seconds, in a virulent spreading wave all over America, in utter, manic, mechanical stupidity. They kept *knocking* one another down with their contagious "OK" messages. It took about ten minutes for the chain reaction to cripple the network. Even then, switches would periodically luck-out and manage to resume their normal work. Many calls -- millions of them -- were managing to get through. But millions weren't. The switching stations that used System 6 were not directly affected. Thanks to these old-fashioned switches, AT&T's national system avoided complete collapse. This fact also made it clear to engineers that System 7 was at fault.

Bell Labs engineers, working feverishly in New Jersey, Illinois, and Ohio, first tried their entire repertoire of standard network remedies on the malfunctioning System 7. None of the remedies worked, of course, because nothing like this had ever happened to any phone system before.

By cutting out the backup safety network entirely, they were able to reduce the frenzy of "OK" messages by about half. The system then began to recover, as the chain reaction slowed. By 11:30 pm on Monday January 15, sweating engineers on the midnight shift breathed a sigh of relief as the last switch cleared-up.

By Tuesday they were pulling all the brand-new 4ESS software and replacing it with an earlier version of System 7. If these had been human operators, rather than computers at work, someone would simply have eventually stopped screaming. It would have been *obvious* that the situation was not "OK," and common sense would have kicked in. Humans possess common sense -- at least to some extent. Computers simply don't.

On the other hand, computers can handle hundreds of calls per second. Humans simply can't. If every single human being in America worked for the phone company, we couldn't match the performance of digital switches: direct-dialling, three-way calling, speed-calling, call-waiting, Caller ID, all the rest of the cornucopia of digital bounty. Replacing computers with operators is simply not an option any more.

And yet we still, anachronistically, expect humans to be running our phone system. It is hard for us to understand that we have sacrificed huge amounts of initiative and control to senseless yet powerful machines. When the phones fail, we want somebody to be responsible. We want somebody to blame. When the Crash of January 15 happened, the American populace was simply not prepared to understand that enormous landslides in cyberspace, like the Crash itself, can happen, and can be nobody's fault in particular. It was easier to believe, maybe even in some odd way more reassuring to believe, that some evil person, or evil group, had done this to us. "Hackers" had done it. With a virus. A trojan horse. A software bomb. A dirty plot of some kind. People believed this, responsible people. In 1990, they were looking hard for evidence to confirm their heartfelt suspicions.

And they would look in a lot of places.

Come 1991, however, the outlines of an apparent new reality would begin to emerge from the fog. On July 1 and 2, 1991, computer-software collapses in telephone switching stations disrupted service in Washington DC, Pittsburgh, Los Angeles and San Francisco. Once again, seemingly minor maintenance problems had crippled the digital System 7. About twelve million people were affected in the Crash of July 1, 1991. Said the New York Times Service: "Telephone company executives and federal regulators said they were not ruling out the possibility of sabotage by computer hackers, but most seemed to think the problems stemmed from some unknown defect in the software running the networks."

And sure enough, within the week, a red-faced software company, DSC Communications Corporation of Plano, Texas, owned up to "glitches" in the "signal transfer point" software that DSC had designed for Bell Atlantic and Pacific Bell. The immediate cause of the July 1 Crash was a single mistyped character: one tiny typographical flaw in one single line of the software. One mistyped letter, in one single line, had deprived the nation's capital of phone service. It was not particularly surprising that this tiny flaw had escaped attention: a typical System 7 station requires *ten million* lines of code. On Tuesday, September 17, 1991, came the most spectacular outage yet. This case had nothing to do with software failures -- at least, not directly. Instead, a group of AT&T's switching stations in New York City had simply run out of electrical power and shut down cold. Their back-up batteries had failed. Automatic warning systems were supposed to warn of the loss of battery power, but those automatic systems had failed as well.

This time, Kennedy, La Guardia, and Newark airports all had their voice and data communications cut. This horrifying event was particularly ironic, as attacks on airport computers by hackers had long been a standard nightmare scenario, much trumpeted by computer-security experts who feared the computer underground. There had even been a Hollywood thriller about sinister hackers ruining airport computers -- *Die Hard II.* Now AT&T itself had crippled airports with computer malfunctions -- not just one airport, but three at once, some of the busiest in the world. Air traffic came to a standstill throughout the Greater New York area, causing more than 500 flights to be cancelled, in a spreading wave all over America and even into Europe. Another 500 or so flights were delayed, affecting, all in all, about 85,000 passengers. (One of these passengers was the chairman of the Federal Communications Commission.)

Stranded passengers in New York and New Jersey were further infuriated to discover that they could not even manage to make a long distance phone call, to explain their delay to loved ones or business associates. Thanks to the crash, about four and a half million domestic calls, and half a million international calls, failed to get through.

The September 17 NYC Crash, unlike the previous ones, involved not a whisper of "hacker" misdeeds. On the contrary, by 1991, AT&T itself was suffering much of the vilification that had formerly been directed at hackers. Congressmen were grumbling. So were state and federal regulators. And so was the press. For their part, ancient rival MCI took out snide full-page newspaper ads in New York, offering their own long-distance services for the "next time that AT&T goes down."

"You wouldn't find a classy company like AT&T using such advertising," protested AT&T Chairman Robert Allen, unconvincingly. Once again, out came the full-page AT&T apologies in newspapers, apologies for "an inexcusable culmination of both human and mechanical failure." (This time, however, AT&T offered no discount on later calls. Unkind critics suggested that AT&T were worried about setting any precedent for refunding the financial losses caused by telephone crashes.)

Industry journals asked publicly if AT&T was "asleep at the switch." The telephone network, America's purported marvel of high-tech reliability, had gone down three times in 18 months. *Fortune* magazine listed the Crash of September 17 among the "Biggest Business Goofs of 1991," cruelly parodying AT&T's ad campaign in an article entitled "AT&T Wants You Back (Safely On the Ground, God Willing)."

Why had those New York switching systems simply run out of power? Because no human being had attended to the alarm system. Why did the alarm systems blare automatically, without any human being noticing? Because the three telco technicians who *should* have been listening were absent from their stations in the power-room, on another floor of the building -- attending a training class. A training class about the alarm systems for the power room!

"Crashing the System" was no longer "unprecedented" by late 1991. On the contrary, it no longer even seemed an oddity. By 1991, it was clear that all the policemen in the world could no longer "protect" the phone system from crashes. By far the worst crashes the system had ever had, had been inflicted, by the system, upon *itself.* And this time nobody was making cocksure statements that this was an anomaly, something that would never happen again. By 1991 the System's defenders had met their nebulous Enemy, and the Enemy was -- the System.

Bruce Sterling
bruces@well.sf.ca.us

How to request that a writeup be deleted	Sorcerer's apprentice mode	Writeup Deletion Request	network meltdown
Europa	Blue Box

The Hacker Crackdown: part 1, section 7

Literary Freeware: Not for Commercial Use

THE HACKER CRACKDOWN: Law and Disorder on the Electronic Frontier