One of the foundational rules that makes DevOps work well to deliver software is that of “test early, test often.” Indeed, part of the CI/CD (Continuous Integration/Continuous Delivery) domain is that of builds and automated testing of those builds. Not just of individual components, but of the whole application.
Test Driven Development (TDD) is one method, another is simply the integration of unit and system testing as part of the build and release process itself. Functional and regression testing is imperative, and in highly agile environments is often triggered by a simple “commit” to the repository.
Thus, one would hope that by the time the software is “delivered” into the hands of those responsible for deployment into production, there would be a great deal of confidence in its readiness.
Of course if there were, I wouldn’t be writing this post, would I?
The wall over which software is delivered into production (and make no mistake, that wall still exists) is where we often run afoul of what is known in philosophy as the “fallacy of composition.” This logical fallacy is generally applied to arguments and proofs, but interestingly it is software that forms the basis of a simple explanation of this “bad argument” by author Ali Almossawi in The Book of Bad Arguments (which I highly recommend for budding philosophers/debate champions of all ages):
Informal fallacy, unwarranted assumption, composition and division
Each module in this software system has been subjected to a set of unit tests and has passed them all. Therefore, when the modules are integrated, the software system will not violate any of the invariants verified by those unit tests. The reality is that the integration of individual parts introduces new complexities to a system due to dependencies that may in turn introduce additional avenues for potential failure. -- https://bookofbadarguments.com/ p.46
Now, at the final stage of the CI/CD process, the software is not necessarily prone to this fallacy. It has been subjected to not only unit tests, but tests on the integration of those units to form a complete “system” or application.
Once it lands in production, however, we’re back to square three. We can’t assume that the system will continue to operate as expected based on those tests. That’s because the definition of the application just changed to encompass not just its software and platforms, but the network and app service components required to make the app go, as it were, and deliver it to the screens of eager consumers and corporate users.
Network and app services impact the data path along which requests and responses must traverse. Many, but not all, of those services may in fact modify those requests and responses in ways developers did not anticipate. Thus, it is possible (and often likely) that once in the production environment – even if every individual service and app component have been rigorously tested – the application will experience a fault. A failure. Make a mistake.
This is because we have fallen prey to the fallacy of composition in the production environment. While app developers (and DevOps) understand this fallacy and address it in pre-production testing, we still often fail to recognize that integration at the network layer is still integration, and may in fact impact the operational integrity of the whole.
The answer seems obvious: well, we’ll just test in production then!
Except we won’t, and you and I know we won’t. Because production is a shared environment, and rigorous testing in a production environment increases the risk of collateral damage to shared resources and systems, which can cause outages. Outages mean lower profits and productivity, and no one wants to be responsible for that. It’s far easier to perform what individual tests are possible in production, and then point the finger at developers later when something breaks or fails to work as advertised.
This is ultimately one of the quiet drivers of the software revolution that is eating production networks. In ages past, it was too difficult and costly to replicate and maintain a test environment that matched the production environment. Some of that shared infrastructure is expensive, and takes up a lot of space. Duplicating that network is not financially nor operationally wise.
But the introduction and subsequent embrace of software thanks to cloud computing and virtualization has started to bring to the fore the notion that replication of software-based networks is not only affordable, but easier thanks to concepts like infrastructure as code. Not only that, but you can replicate it, test across it, and tear it down – meaning resources can be reused to do the same for other applications, too.
We’re not there yet, but we’re getting closer. The integration of virtual (software) based network and application services into the CI/CD pipeline is actually a lot more real than some might think. Traditionally infrastructure (network) based services – load balancing, app routing, web app security – are being highly integrated into the software build cycle thanks to their integration in environments like container-based architectures. As software-based solutions, they can be included in at least the testing phase of the build process, and thus raise confidence that the deployment in production will not introduce faults or errors.
Building on that software, the application of “infrastructure as code” takes it even further. When policies and configurations can be designed and fine-tuned during the build and release cycles and then deployed into production with little or no changes, we are definitely closer to eliminating the compositional fallacies that exist in production.
The more these services – the most app-centric of the app services – are integrated into the CI/CD testing phase, the more confidence everyone can have in a successful production deployment.
Because the other fallacy that exists today is that testing against “production” means “production systems.” It doesn’t often include the app services that form the data path in production. It needs to, so we can reduce the errors to the really esoteric ones and actually have time and resources to address them.