Something no one tells you about "bugs" in software



Software industry has its own quirks. What the rest of the world calls "defects", we refer to them as a "bugs".  The word has become so common that it is being used in other contexts as well. For example,  does a car company say that a car model has a "bug" in its transmission? No, but  same car company will use the word "bug" for  any defect caused by a flaw in the software running on the onboard computer of one of its models.  I learnt this recently, as the service agent of my car told me that he had to upgrade the software in the ECU due to a "bug". When I asked him what he meant by "bug"... well,  he couldn't define it.

Computing folklore tells us that the word was first used by engineers,  colleagues of the pioneer Grace Hopper , while working on the early electromechanical computers. It is said that  they had traced a particular  fault in the computer to an actual insect that had lodged itself into one of the relays of the massive Mark II computer.  The word is benign, and maybe even considered friendly, compared to defect" ( which  is harsh as a defect can almost always be traced to a human error).  Programmers have grown up to believe that there will always be bugs, as if they appear in some kind of random manner like Grace Hopper's original (and real) bug.  In general terms, programmers shy away from attributing "bugs" to their own mistakes.

A couple of years ago, I did an analysis of about one thousand "bugs" for one my clients. In many large software services organisations which are seemingly certified for various quality processes  (think ISO,SEI...), such analyses ( which go  by the name of RCA-root cause analysis) are fairly common. Except, in most cases, due to the ritualist nature of these exercises, these tend to be ineffective  and useless.  More often than not, such reviews hardly refer to the actual work product (i.e., the code)- instead relying in input from the developers themselves.

In any case, I went through these bugs, discussed with the developers who fixed these bugs and reviewed each and every piece of the code that was responsible for the malfunction. Nearly 850 bugs tcould be attributed to errors by programmers- logical errors, poor construction and composition of code, usage of shared mutable states and so on.   At little bit of care and most of these could have been avoided. So you see, these "bugs" were not in the same class found by Grace Hopper's team. These are due to defective workmanship, not due to some unknown reason.

The rest of the bugs could mainly be classified into two categories, configuration management- installing wrong versions of  own/third party artefacts and poor documentation- mostly of  third party libraries used in the code.  A small number could be attributed to poor communication of requirements.  To be fair, this system did not require too much of domain-specific knowledge on the part of the developers.

In the software engineering world, the popular view is that most bugs are due to poor definition and  communication of requirements. In this instance that assumption proved to be spectacularly wrong.

Is it really possible to write error free software? In theory, yes. But in the real world , we are writing software on layers upon layers of code ( OS, Browser, frameworks ...) that run on chips that are built using software. Now none of these "other" pieces of software are error free. Therefore, even if one does manage to write error free code, it is possible that the system as a whole might malfunction due to errors is any of the layers below.

Is it possible to prove that a  piece of software is error free?  No, here we reach inherent limits of computer science.  Edward Djikstra, one of the pioneers of computer science observed that  "Program testing can be used to show presence of bugs, but never to show their absence!"  Alas, one can never prove that a program has no errors.


Unfortunately, "bugs will always be there" has become an accepted mantra amongst developers and in many cases it leads to a lack of rigour in approach to writing code.  The engineering process led practices of most software development organisations has moved the focus firmly away from the primary work product, code, to different kinds stuff like methodologies and process improvements to address question of quality ( or lack of quality).

Process improvements, in some cases, have resulted in better practices and led to introduction of automation of error prone tasks such as building and deployment of systems. However, it might be too optimistic to expect that process improvements will have the outcome  of people automatically writing better quality of code.   Programming is a craft and craftsmen must craft their artefacts with pride!!

Eventually, the errors or bugs are traced to  faulty code written by developers. To tackle this one must  bring the focus of the organisations back to the primary work product- the code. People who write good code  must be valued and people who write poor code must be called out.  Developers must  follow the rigorous discipline that is required to master the art of writing and testing code.

Companies must build a culture where developers obsessively avoid writing erroneous code and visibly reward developers who do so. In order to do this, management ,at the highest levels, must bring the focus squarely on  the primary work product - code. For that is what runs your business.



Comments

Popular posts from this blog

Model correctly and write less code, using Akka Streams

Your Microservices are asynchronous. Your approach to testing has to change!

Who is afraid of immutable data?