Engineer’s Codex

Share this post

How one line of code caused a $60 million loss

engineercodex.substack.com

Discover more from Engineer’s Codex

Case studies and practical lessons from real-world software engineering. Become a smarter software engineer in just 7 minutes a week.
Over 10,000 subscribers
Continue reading
Sign in

How one line of code caused a $60 million loss

60,000 people lost full phone service, half of AT&T's network was down, and 500 airline flights were delayed

Nov 13, 2023
53
Share this post

How one line of code caused a $60 million loss

engineercodex.substack.com
9
Share

Engineer’s Codex is about practical lessons and stories of real-world software engineering.


On January 15th, 1990, AT&T's New Jersey operations center detected a widespread system malfunction, shown by a plethora of red warnings on their network display.

Despite attempts to rectify the situation, the network remained compromised for 9 hours, leading to a 50% failure rate in call connections.

AT&T lost over $60 million as a result with over 60,000 of Americans left with fully disconnected phones.

Furthermore, 500 airline flights were delayed, affecting 85,000 people.

AT&T's long-distance network was supposedly a paragon of efficiency, handling a substantial portion of the nation's calls with its advanced electronic switches and signaling system. This system usually completed call routing within seconds.

However, on this day, a fault originating in a New York switch cascaded through the network. This was due to a software bug in a recent update that contained a critical bug affecting the network's 114 switches. When the New York switch reset itself and sent out signals, this bug caused a domino effect, leading to widespread network disruption.

Interestingly, this small software patch was not tested. Testing was actually bypassed as per management’s request because the code change was small.

However, most of AT&T’s code was rigorously tested.

The Problem

The root cause was traced back to a coding error in a software update implemented across the network's switches.

The error, within a C program, involved a misplaced break statement within nested conditional statements, leading to data overwrites and system resets.

The pseudocode:

1  while (ring receive buffer not empty 
          and side buffer not empty):

2    Initialize pointer to first message in side buffer
     or ring receive buffer

3    get copy of buffer

4    switch (message):

5       case (incoming_message):

6             if (sending switch is out of service):

7                 if (ring write buffer is empty):

8                     send "in service" to status map

9                 else:

10                    break // The error was here!

                  END IF

11           process incoming message, set up pointers to
             optional parameters

12           break
       END SWITCH


13   do optional parameter work

The problem:

  • If the ring write buffer is NOT empty, then the `if` statement on line 7 is skipped and the break statement on line 10 is hit instead.

  • However, for the program to function properly, line 11 should have been hit instead.

  • When the break statement is hit instead of the incoming message being processed and pointers being set up to optional parameters, then data (the pointers that should’ve been held) is overwritten

  • The error correction software identified the data overwrite and initiated a shutdown of the switch for a reset. This issue was compounded because this flawed software was present in all switches across the network, leading to a chain reaction of resets that ultimately crippled the entire network system.

Despite having a network designed for resilience, one line of code was able to bring down half the country’s main line of communication.

The Fix

It took engineers 9 hours to get AT&T’s system fully back online. They did so mostly by rolling back the switches to a previous, working version of code.

It actually took software engineers two weeks of rigorous code reading, testing, and replication to actually understand where the bug was.

Conclusion

For AT&T, unfortunately, this wasn’t even their biggest system crash of the 90s. They encountered many more issues later in the decade.

In reality, it wasn’t one line of code that brought down a system. It was a failure in processes.

Today’s companies have even better processes in place, and even then, bugs slip through. Google wrote a great retrospective on 20 years of Site Reliability Engineering, where they reflect on YouTube’s first global outage in 2016.

The scale of an outage for companies is huge and there are lessons to be learned from each outage. For most, however, outages come down to human error and gaps in processes.

My favorite resources for reliability engineering

Google SRE

GitLab’s SRE Role

Site Reliability Engineering Resources


Sources:

https://users.csc.calpoly.edu/~jdalbey/SWE/Papers/att_collapse

https://telephoneworld.org/landline-telephone-history/the-crash-of-the-att-network-in-1990/

https://www.mit.edu/hacker/part1.html

https://prezi.com/qxeu8iayvwpu/1990-att-long-distance-network-crash/

53
Share this post

How one line of code caused a $60 million loss

engineercodex.substack.com
9
Share
Previous
Next
9 Comments
Share this discussion

How one line of code caused a $60 million loss

engineercodex.substack.com
Abhinav Upadhyay
Writes Confessions of a Code Addict
Nov 13Liked by Leonardo Creed

Fascinating story. Early days of "testing in production".

Expand full comment
Reply
Share
Rich
Nov 20Liked by Leonardo Creed

I remember this event well. I was working for a large competing telecom equipment manufacturer at the time and we were all stunned at the extent of the outage. A lot of new customers were gained as a result of that day and there was a renewed emphasis on testing everything before deploying to production equipment.

Expand full comment
Reply
Share
7 more comments...
Top
New
Community

No posts

Ready for more?

© 2023 Engineer’s Codex
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing