Millions of computers affected, with airports, supermarkets, and TV stations worldwide having their activities compromised. All due to a software failure. INESC-ID board member Miguel Pupo Correia, from the Distributed Parallel and Secure Systems research area and head of the Computer Science and Engineering Department of Técnico explains what we can do to prevent such a blackout and reveals what has been learned from this episode, which was not the first of its kind and certainly will not be the last.

 A failure at Microsoft resulted in a ‘blackout’ that is said to have affected 8.5 million computers. What kind of failure are we talking about?

In reality, it wasn’t one failure but two. The one with the most impact wasn’t at Microsoft, but at a cybersecurity company called CrowdStrike. As far as we know, someone at this company left a bug in a software product, which was propagated to millions of computers worldwide and caused them to stop working. So, there were several errors made by the company’s employees: a bug, lack of testing that would have detected the bug, and propagation of the buggy software version to computers worldwide. Worse, all these computers had and have to be fixed manually, one by one. The second failure was indeed at Microsoft, specifically in its cloud service, Windows Azure, which had a data center down for several hours in the early hours of July 18 to 19.

Was it of malicious or accidental origin? How can one distinguish between the two?

The causes appear to have been accidental, or rather, there is no reason to believe they were intentional or malicious. Distinguishing them is not easy. The distinction concerns the presence of intention on the part of those who caused them. What we know is that neither company presented the case as having an intentional cause. It also seems evident that if they had been intentional, the perpetrator or perpetrators would be easily identified and would suffer the associated consequences.

Despite the impact, only about one percent of companies using Windows were affected. What do these companies have in common, with the most notable examples being from the aviation sector?

According to Gartner, a market research firm, CrowdStrike’s cybersecurity software (“endpoint protection”) is currently the market leader in this type of product. Therefore, the companies that fell victim to the problem were those concerned with the cybersecurity of their systems to the point of investing in and using the most sophisticated product available. Apparently, the choice was not the best from a reliability standpoint, although it might have been from a cybersecurity perspective.

How can this type of problem be prevented?

The problem cannot be completely avoided. It must be managed, and the risk of it happening must be kept at an acceptable level. The scientific field that studies the problem of avoiding failures like this – Dependability – has existed for several decades and is a very active research area. In this field, we know well that there are four complementary categories of mechanisms to avoid system failures: 1) fault prevention, which tries to avoid the occurrence and introduction of faults in systems (the bug in CrowdStrike’s case); 2) fault tolerance, which aims to prevent faults from leading to failures (the stoppage of computers in this case); 3) fault removal, which attempts to reduce the number and severity of faults; 4) fault forecasting, which aims to estimate the number, future incidence, and consequences of faults.

We have witnessed the impact at the business level. But this type of problem can also affect citizens. What can each of us, individually, do to avoid suffering such a blackout?

Both companies and individuals are increasingly dependent on computers and, I would say, want to depend more and more on computers. In the case of companies, it is evident, but citizens also increasingly depend on personal computers: mobile phones, laptops, tablets, smartwatches, etc. What can be done is to avoid critical dependence. There are numerous examples. One I see as a professor: students who have their thesis presentation in software that is in the cloud (usually Google Slides). As it is in the cloud, the possibility of using this software depends on the availability of the Internet. It seems to me to be a bad idea to depend on the Internet at an important moment like a master’s or doctoral defence, not to mention that it is unnecessary since the presentation can simply be downloaded in advance. Identifying these dependencies is not trivial, but it is necessary to think if at an important moment, I will depend on IT and what I can do to avoid it. I once heard Admiral Gameiro Marques, who is the National Security Authority, say that companies should maintain the ability to perform much of their activity manually, without using computers. This may be possible in some cases and impossible in others, but it seems to me a good principle. He was thinking specifically about a company, the IMPRESA group, whose IT infrastructure suffered a devastating cyberattack and lost the ability to edit the Expresso newspaper using the IT systems they had been using for several years. They might have thought it was impossible to continue producing the newspaper manually, but they had no choice.

What do we learn – companies and citizens – from this incident?

A few decades ago, public and business services worked quite poorly. Today we are used to them generally working well, efficiently, and without major delays. What we need to learn is that reality is not perfect and that at certain times something that seemed as obvious as catching a plane can be delayed by hours or even days or even impossible. We need to learn to manage our expectations.

There has been talk of an increase in the occurrence of such problems – whether accidental or malicious in origin. Is this your opinion? If so, is it an inevitable fact, or can precautions prevent its occurrence?

I agree that there has been talk and that there is a perception that the occurrence of such problems, both for accidental and intentional reasons, has increased. However, I have no certainty that this is true. This type of problem has always happened. Our dependence on such IT systems is growing, and therefore, the problems affect us more, seem to happen more, and have more visibility in the media.