Michael Mainelli, The Z/Yen Group
[An edited version of this article originally appeared as “Y2K in 1999: From Risk Reduction to Contingency Planning”, Kluwer Handbook of Risk Management, Issue 25, Kluwer Publishing (12 February 1999) pages 6-8.]
The "60's" and "70's" are to today in computing history as the construction of the pyramids is to today in recorded history. Yet some programming tricks used by the early pharaohs of computing are threatening to bring the lights down on the biggest New Year's Eve party in 1,000 years. It is almost as if faulty pyramid construction techniques were threatening to destroy public buildings in major conurbations worldwide over the next year. Surely this is too alarmist - we're being over wound-up about the scale of the problem. With a bit of luck and a fair wind, we can muddle through. Right?
Prehistory of a Panic Attack
The Y2K problem (or millennium bug) began in the "60's", "70's" and early "80's" (sic - two digits) when computer programmers were chronically short of memory, disk space and processor speed. The differences between that period and today were large. This author began programming in the mid-70's with a luxurious 4 kilobytes of memory on a laboratory mini-computer and is writing this article with 64 megabytes on a PC at home, approximately a 16,000-fold increase. Programmers were told that systems were being built for a finite period of time and therefore used a common trick of recording only two digits for an annual date which saved significant space on large files. So how do some programming tricks from long ago affect us today?
Computations on those files depended on two digits being interpreted as "1900+ two digits" and often resorted to further efficiency tricks such as using 98 or 99 as special triggers or adding extra months and days that don't exist but start specific computer routines. For instance, 98 might mean end of record and 99 end of file. Clearly, problems arise when the real 1998 or 1999 comes along but the program jumps off to do what it normally did. The Y2K problem has an extra zing when you realise that 2000 is a leap year and that many programmers mistakenly thought it wasn't (leap year in every year divisible by four, except when divisible by 100, UNLESS divisible by 400). Explaining some even trickier tricks would take too much time to explain, but they will certainly blow up on an appointed date.
Most systems built wholly after the mid-1980's appear to be largely free of such efficiency tricks, although many ‘new’ systems rely on older ones. Digging deeper, computing systems and much non-computing equipment rely on microprocessors embedded in cooling systems, power supplies or security systems which are, in turn, programmed devices which use date functions - the "embedded chip" problem.
Programmers and electricians worldwide are working hard at solving the Y2K problem. Organisations in certain industries - air, rail, road and sea transportation; utilities; emergency services; many heavy industries; and most of the high-volume financial sector - almost certainly have well-developed Y2K programmes underway. Estimates are that possibly half of the world's information systems people will be employed on Y2K during 1999 (last minute preparations) and 2000 (cleaning up after the party). Global estimates of the remedial costs till the end of 2000 range from a fairly well-researched minimum of US$860 billion up to several trillion - excluding the costs of any disasters. The bulk of the work is hard going, revisiting, re-analysing, re-writing or patching, re-testing and re-implementing millions of lines of code and integrated applications. There are no ‘silver bullet’ solutions. The work gets even harder when you realise the need to get out and visit installed equipment on oil rigs, radar stations, submarines or mine shafts in order to test embedded devices.
Facing the Music
Y2K falls into an interesting class of risk. Severity and likelihood typically measure risks. Y2K severity is difficult to ascertain. There would be catastrophic failure if everyone had done nothing at all - power stations would fail, emergency services would not arrive, somewhere people would die. As these potential catastrophes recede in likelihood, largely due to significant analytical effort and early responses, most organisations now probably have an ability to muddle through. Estimating likelihood is equally difficult. If all available resource was spent trying to eliminate Y2K risk, there would still be some residual risk. Y2K risk cannot be eliminated totally because it is a singularity in two ways, event singularity and digital singularity. As an event singularity, previous millennium or century changes shed little light on this one - computers and microprocessors didn't exist. Other, better known risk comparisons do not shed much light, e.g. summer/winter time change problems. Digital singularity arises from the distinction between digital and analogue. Clearing 90% of a digital system from Y2K risk does not reduce the risk in the same way as strengthening a bridge to 90% of its desired strength. A single line of poor code can render the system inoperable, e.g. calculating pension monies due based on an extra 100 years, or cause major clean-up problems, e.g. unwinding the faulty pension calculation a month after discovery, recovering some monies and re-calculating all of the transactions. Digital systems tend to be more ‘brittle’ than analogue systems.
What Should We Do?
Let's assume you are a small or medium-sized business without any life threatening machinery or vulnerable customers. There are traditionally four responses to risk - accept, avoid, transfer or mitigate. One or more responses can be appropriate:
Accept: For small and medium-sized businesses, some risk must be accepted. Although, we should ask businesses which have done nothing to date - why do they need to accept all the risk?
Avoid: Y2K risk is hard and expensive to avoid. Recent supposition, based on the ease of the conversion to the Euro and the paucity of severe 1998 problems, is that the danger posed by Y2K is receding as the most vulnerable organisations continue to reduce it. Because of the event singularity mentioned above, all suppositions remain suppositions. Because of digital singularity, it remains impossible to reduce the likelihood to nil. At the level of the small and medium-sized business, the time for direct reduction action has probably passed. With less than a year to go, possibly as little as three or four months, computer systems of any size are unlikely to be replaceable.
Transfer: insurers, one traditional transfer approach, are unable to assess the scale of risk they might be accepting, so virtually no sensible insurance is available. Other transference, e.g. holding a supplier responsible, is fraught with difficulty, even where possible. Some suppliers, e.g. popular software companies, face too many potential claims to act as reliable risk counter-parties; lawyers are unsure of the contractual ground; and some of the risks are so severe that they would make the plaintiff bankrupt well before any court settlement. Government action in many countries is helping to keep people out of the courts and at their workstations re-programming. Pledge programmes are campaigns where organisations pledge to share information and governments pledge to help them avoid liability if their information turns out to be unreliable, e.g. X Corp thinks a specific escalator chip is OK and tells everyone, but acting on this advice Y Corp has severe troubles and normally would have sued. Government could help to unleash the power of insurance risk management if it were to act as a re-insurer of last resort. Even at this late stage Government re-insurance would bring insurers into the market, permit premium assessments, promote information sharing (through the insurers) and alert some organisations to the scale of financial risk they are ignoring.
Mitigate: mitigation largely takes the form of contingency planning - how will we cope with resultant problems. Mitigation varies widely from sector to sector, from stocking food and fuel to arranging for temporary bookkeepers to having alternative premises fitted out. Risk mitigation is probably best started by examining the asset base for vulnerability and then assessing the impact on sites, projects, people and legal entities. Risk analysis classifies areas into those suitable for risk reduction (time permitting), ignoring or mitigation.
Proper Planning Prevents Poor Parties
We are not being totally wound up by pundits on Y2K - there is real risk. At the same time, getting wound up doesn't help. As the clock ticks down, it has become clear in the last few months that there has been a big shift from emphasis on risk avoidance to emphasis on risk mitigation. Fortunately, people are sharing their experience and knowledge of certain systems and equipment, mostly over the Internet - Taskforce 2000 (www.taskforce2000.co.uk) and Action 2000 (www.bug2000.co.uk) are good starting points. For risk managers, the new emphasis on risk mitigation clearly shows the importance of starting promptly with proper risk analysis.
Who's to Blame
A natural human response in such situations is to ask how this could possibly come about and who's at fault before getting on to what can be done about it. A first port of call is the programmers, clearly they built the systems using shortcuts which would not stand the test of time and now they have the audacity to charge for fixing it. However, these systems were almost always built for a finite period of time. In the 70's this time period could be as short as two or three years or possibly as long as five or seven before "we buy a software package", "we move to a fully-relational database", or "we upgrade all our systems". A next port of call is the accountants who left these systems off the books when they were key business assets or failed to fund the asset maintenance costs which should have existed. However, accountants had, and have, great trouble getting sensible lifetimes and valuations for computer-based systems. Information systems managers should have been warning everyone earlier, but actually many were, often from the late 1980's. The last port is probably management, who don't seem to have paid enough attention to how their businesses were increasing in risk with each passing year. In fact, in most organisations, no one is to blame.