Risk Management & The Millennium Bug

By Professor Michael Mainelli
Published by Croner Business Networks Briefing, Issue 21, pages 5-8.

You've already read the book: Y2K gurus slugging it out in the press, weighty reports, quick market surveys, government warnings and announcements. You are an actor in the film: mailshots from recently arrived software suppliers and consultants, enormous sums spent by large organisations, endless conferences. If you had the T-shirt, space would force you to summarise all the hype, perhaps to the expert's refrain: "who is to say? - remember I told you so!" Yet, in the end, Y2K is inescapable - are you missing something?

Origins of a Panic Attack

The Y2K problem (or millennium bug) began in the "60's", "70's" and early "80's" (sic - two digits) when computer programmers were chronically short of memory, disk space and processor speed. The differences between that period and today were large. This author began programming in the mid-70's with a luxurious 4 kilobytes of memory on a laboratory mini-computer and is writing this article with 64 megabytes on a PC at home, approximately a 16,000-fold increase. Programmers were told that systems were being built for a finite period of time and therefore used a common trick of only recording two digits for an annual date which saved significant space on large files. Computations on those files depended on two digits being interpreted as "1900+ two digits" and often resorted to further efficiency tricks such as using 98 or 99 as special triggers or adding extra months and days that don't exist. For instance, 98 might mean end of record and 99 end of file. Clearly, problems arise when the real 1998 or 1999 comes along. The Y2K problem has an extra zing when you realise that it is a leap year and that many programmers mistakenly thought it wasn't (leap year in every year divisible by four, except when divisible by 100 UNLESS divisible by 400).

Most systems built wholly after the mid-1980's appear to be free of such efficiency tricks, although many `new' systems rely on older ones. Digging even deeper, many systems and much installed non-computing equipment rely on microprocessors embedded in cooling systems, power supplies, security systems, fail-safe devices, which are in turn programmed devices which can use date functions - the "embedded chip" problem. Programmers world-wide are working hard at solving the Y2K problem. Estimates are that possibly half of the world's information systems people will be employed on Y2K during 1999 (last minute preparations) and 2000 (cleaning up after the party). The bulk of the work is hard going, revisiting, re-analysing, re-writing or patching, re-testing and re-implementing millions of lines of code and integrated applications. The work gets harder when you need to get out and visit installed equipment on oil rigs, radar stations, submarines or mine shafts in order to test embedded devices.

Who's to Blame

A natural human response in such situations is to ask how this could possibly come about and who's at fault before getting on to what can be done about it. A first port of call is the programmers, clearly they built the systems using shortcuts which would not stand the test of time and now they have the audacity to charge for fixing it. However, these systems were almost always built for a finite period of time. In the 70's this time period could be as short as two or three years or possibly as long as five or seven before "we buy a software package", "we move to a fully-relational database", or "we upgrade all our systems". A next port of call is the accountants who left these systems off the books when they were key business assets or failed to fund the asset maintenance costs which should have existed. However, accountants had, and have, great trouble getting sensible lifetimes and valuations for computer-based systems.

Information systems managers certainly should have been warning everyone earlier, but actually many were, often starting in the late 1980's. The last port is probably management, who don't seem to have paid enough attention to how their businesses were increasing in risk with each passing year. In fact, in most organisations, no-one is to blame. The problem has arisen gradually and, provided it is tackled in time, can still be reasonably solved in the time available by most organisations.

Facing the Music?

If you are in certain industries - air, rail, road and sea transportation; utilities; emergency services; many heavy industries; and most of the high-volume financial sector - you almost certainly have a well-developed Y2K programme and probably aren't wasting time with this article. If you are a small or medium-sized business in a non-life threatening area, you may have not quite got around to doing anything. What might your response be?

Panic: At one extreme, there is the possibility of a series of successive small failures leading to some forms of large-scale damage, e.g. failure of street lamps and traffic lights leads to accidents which crowd hospitals and impede ambulances, combined with some power losses and alarm failures which lead to looting. Disaster scenarios are a possibility, but a small one. Taking such scenarios seriously, you could install redundant systems, emergency power supplies, better security and totally renew all of your equipment, both computing and general machinery - oh, and find a secure place to spend the first week of the New Year.

Fatalism: Few of your competitors or you seem to be doing very much and the issues are certainly being blown out of proportion. You've got company. There has been a lot of hype and even scare-mongering. Only a few of the predicted difficulties in 1998 actually arose. Certain commentators note that no one can actually point to any microprocessor that is going to fail in home or office environments. Toasters, cars, home telephones, security systems, etc, all appear to function after changing the date. One analyst has even promised to eat the first device which someone shows will fail. In effect, the great Y2K problem is no worse than a summer-time/winter-time change. If it is much worse, there is probably little you can do.

Pragmatism: Somewhere, a lurking microchip might bring an aeroplane down, but society as we know it is not at risk. Assess the scale of the problem as best you can, take precautions on vital systems and keep a watch on the wider field. Although 1998 failed to provide many of the predicted problems, much of the work required for Y2K is actually upgrading, replacing or enhancing systems which was probably necessary in any event, e.g. replacing older operating systems with new ones that parts of the organisation have already standardised on.

Why Is It So Difficult to Get Appropriate Advice?

Y2K falls into an interesting class of risk. Risks are measured by severity and likelihood. Y2K severity is difficult to ascertain. There would be catastrophic failure if everyone had done nothing at all - power stations would fail, emergency services would not arrive, somewhere people would die. Yet most organisations probably have an ability to muddle through if everyone else has done a reasonable amount. Estimating likelihood is equally difficult. If all available resource was spent trying to eliminate Y2K risk, there would still be some residual risk - it cannot be eliminated totally. Y2K is a singularity at two levels. At the risk level, previous millennium or century changes shed little light on this one - computers and microprocessors didn't exist. Other, better known risks don't shed much light, e.g. summer/winter time change problems. At the computing level the singularity lies in the distinction between digital and analogue. Examining 90% of a system and clearing it from Y2K risk does not reduce the risk in the same way as strengthening a bridge to 90% of its desired strength. A single line of poor code can render the system inoperable, e.g. calculating pension monies due based on an extra 100 years, or cause major clean-up problems, e.g. unwinding the faulty pension calculation a month after discovery, recovering some monies and re-calculating all of the transactions. Because Y2K is an unusual risk, two traditional approaches, transfer (insurance) or share (mutualise) are not readily available. The most likely pragmatic responses are reduce the likelihood or minimise the impact. Fortunately, people are sharing their experience and knowledge of certain systems and equipment, mostly over the Internet.

What Should We Do?

Let's assume you are a small or medium-sized business without any life threatening machinery, e.g. turnover £1M to £100M, 10 to 1,000 computer users, a small data network, telephone switchboards, 50 sq m to 5,000 sq m of space. Sadly, you have done absolutely nothing except pass around a few articles, hold a few committee meetings and decide to appoint a Y2K project manager to begin tackling the problem in January. Looking at the year ahead for the Y2K project manager, this might be typical, if not ideal:

January 1999 - start the year right with a proper risk analysis based on the assets and operations you have. There a several checklists and methods available, or use an external consultant with risk assessment experience. A good place to start is your fixed asset register. Get a listing and fill it in - equipment, purchase date, value, model, etc. This work can be combined with an update of inventory or the asset register. Grade all assets and operations in terms of the likelihood of any problem and its severity to the business. Look in particular at:

Inputs
  • individual computers; local area networks; servers;
  • mini and mainframe computers;
  • switchboards, telecomms networks and telecomms devices;
  • plant and machinery;
  • building mechanical and electrical devices, lifts, escalators, air conditioning, heating, security systems;
  • suppliers, particularly utilities;
  • network feeds - information services, EDI networks, etc;
Processes
  • shopfloor plant and machinery;
  • distribution and logistics;
  • core database applications;
Outputs
  • embedded chips in your products;
  • links to client systems;
  • invoicing, payroll.

February 1999 - write to suppliers of the highest risk equipment for advice and their knowledge of any problems with their equipment. Surf the Internet for advice. Taskforce 2000 and Action 2000 are good starting points for networking around. Look in particular for information on any known problems with the assets you hold. For computer based assets see if automated testing tools are relevant.

March 1999 - based on the risk assessment, draw up a plan for senior management commitment to reduce the likelihood and minimise the impact. A key item is likely to be the development of a disaster recovery plan to ensure business continuity. Consider the use of bureaux for upgrading software if the task is very large.

April 1999 - Share the key points of the plan with your staff, perhaps a monthly newsletter and helpline. Ask for their assessments of hotspots and ideas for disaster recovery. Order the systems you have decided to upgrade in plenty of time for installation - services and computer people are in short supply.

May 1999 - you have probably by now developed a three-track plan. The first track is focused on the critical systems. The second is probably seeing what can get done in the time available. The third may be bringing forward a major project which would have occurred in any case, but also provides a Y2K risk reduction. As an example of the three tracks, you are installing a more modern GPS in your offshore vessel; you have two people going around upgrading older PC's who will certainly get to almost all of them; and your new financial system is going live in October rather than the second quarter of next year.

June 1999 - Ensure that the critical systems plan is due to complete with sufficient contingency. Run a simulated disaster workshop with a few staff over one morning. Feed the results into the disaster recovery plan.

July 1999 - look on the market for extra security protection, back-up services, temporary staff, disaster recovery centres. See what the market has made available.

August 1999 - the last opportunity for a safe holiday until the next year.

September 1999 - be particularly on the alert for 09/99 problems. Use this as a last opportunity to determine the scale of the problem, if any, within suppliers or customers.

October 1999 - work out how to wind down any projects which are likely to cause more disruption than benefit, e.g. perhaps that new financial system will have to wait until the second quarter of 2000 after all.

November - surprise the organisation with an emergency test of the disaster recovery plan.

December 1999 - test marshalling arrangements for staff over the holiday period, ensure that disaster recovery plan is ready to go.

January 2000 - ensure that a hit team of some form goes out and looks for problems which may arise during the month. Scrutinise stock, payroll, invoicing, receipts, everything. Possibly bring forward some of the annual stocktaking or internal audit so that it occurs before these problems spread more widely.

February 2000 - if you've successfully weathered everything, have a celebration and then look forward to the year 3,000 when we may need to lose an entire day in order to get things lined again up with the sun.

[A version of this article originally appeared as “Risk Management and the Millennium Bug”, Croner Business Networks Briefing, Issues 21, (17 December 1998) pages 5-8.]