Use the Cloud as your “sandbox” to experiment and do R&D for on-prem systems.
This document discusses using a cloud model to architecturally validate the possibility of consolidating multiple application servers or instances into a smaller number of physical resources that will ultimately remain on-prem. For this document, the cloud offering from Skytap is used as the example cloud for the possible approach, although the same techniques can be leveraged in other cloud offerings.
It is important to note that this document is not advocating for reengineering applications from on-prem to the cloud, though that is a possibility. Instead, the focus of this document is to describe how to leverage the cloud to help validate the design of re-organizing a large number of physical on-prem servers down to a smaller number of resources also hosted on-prem. In this case, the cloud is used as the R&D “sandbox” for key design assumptions.
Undertaking a project of this type requires a foundational set of questions to be addressed:
- Is it even possible to consolidate multiple resources to a more efficient target number? How can this basic assumption be proven without investing significant additional costs in existing infrastructure if it turns out that the basic premise is false?
- How can the basic assumption be proven out in the fastest time possible? If it takes weeks or months to acquire and provision the resources required to perform experiments against the basic premise, how will that impact the project timeline?
- How do we quickly adapt the final target model as new information is learned? For example, if initial calculations on CPU, memory, storage are incorrect, how can adjustments be made quickly and easily without incurring massive unexpected costs?
- How can we provide R&D, test, and lab environments for all the constituents involved in the project? If the project is to consolidate servers, how can all the various application, database, networking, administrative, and QA testing groups gain access to a production-like representation of the final target architecture? Will all the groups have to schedule time on a limited set of testing resources? How will integration testing from all the teams be done?
Questions like these and others can be addressed by leveraging the cloud as a “placeholder” for the final target system that will eventually exist on-prem. The flexibility of the cloud offers solutions to many of the classical problems faced with working with traditional on-prem physical resources. Most notably, issues such as these can be addressed:
- How do we create “multiple” isolated test environments identical down to hostnames and IP addresses?
QA environments “Test1”, “Test2”, and “Test3” should all have a host called “Database” with an IP address of 192.168.0.1. How can identical R&D and test systems exist without colliding with each other? How can these identical systems also communicate with shared resources without colliding at the network layer? Traditional Enterprise thinking would require that hosts go through a “re-IP” process so that no IP addresses are duplicated on the same network address space. The downside of this approach is that the test systems are no longer exact duplicates. There is a greater possibility of having multiple configuration problems that need to be debugged by hand. The cloud model offers “environment cloning,” where multiple environments with the same network topology can exist in harmony.
- How do we reset a test or R&D environment that has become corrupt?
Once an application environment has been used extensively, the test data may become stale, or automated test cases have created data that must be reset or removed before the next test run can be executed. Another variation is that multiple, but slightly different test datasets are needed to validate all configurations of the target system. Traditional Enterprise thinking would use scripts and possibly other automation techniques to delete database data, reset configuration files, remove log files, etc. Each of these types of options can be time-consuming and error-prone. The cloud approach leverages “image templates” where complete, ready to use VMs along with their network topology and data are saved into templates. If a database becomes corrupt, or heavily modified from the previous testing, instead of resetting the data via scripts, the cloud model replaces the entire database VM along with all of its data in one-step. The reset process can often be done in seconds or minutes versus hours or days. Complete ready to use environments containing dozens of VMs can be saved as templates and reconstituted in very short amounts of time. For example, instead of taking weeks or months to rebuild a complex multi-VM/LPAR from scratch, what if it could only take a few hours?
- How do we provide “self-service” so that a project sprint requiring a new, clean environment is not delayed weeks or more waiting for an environment to become available?
Traditional Enterprise thinking would limit control to cloud resources so that only a select few have direct access to the cloud resources, and those few then create environments for others. The historical reasons given for this approach are steeped in antiquated cultural models, as documented here: “Quit Hiding the Cloud from your Developers.” The cloud approach delivers “self-service with IT control and oversight.” Users and groups are given direct access to the cloud but have restrictions that limit their consumption. The overall system is protected from a runaway script that incurs excessive charges or consumes all available resources. Users and groups have a “Quota” that limits the amount of consumption possible at any one time. Users are assigned “Projects” that define what environments they can see. A QA user sees environments “QA1” and “QA2” while a developer might see both of those as well as “R&D1” and “R&D2”. Finally, the cloud system provides universal “Auditing” so that user activities are tracked and available for reporting. The question of “Who deleted that AIX LPAR?” is no longer a mystery.
With the above points laid out, here is the recommended Cloud-based model for projects involving the consolidation of multiple on-prem servers/LPARs that will remain on-prem after consolidation:
- Classic “lift and shift” from on-prem of existing-working resources
The recommended approach is to “create or recreate” a representation of the final target system in-the-cloud, but not re-engineer any components into cloud-native equivalents. The same number of LPARs, same memory/disk/CPU allocations, same file system structures, same exact IP addresses, same exact hostnames, and network subnets are created in the cloud that represents as much as possible a “clone” of the eventual system of record that will exist on-prem. The benefit of this approach is that you can apply “cloud flexibility” to what was historically a “cloud stubborn” system. Fast cloning, ephemeral longevity, software-defined networking, API automation can all be applied to the temporary stand-in running in the cloud. As design principles are finalized based on research performed on the cloud version of the system, those findings can be applied to the on-prem final buildout.
To jump-start the cloud build-out process, it is possible to reuse existing on-prem assets as the foundation for components built in the cloud. LPARs in the cloud can be based on existing mksysb images already created on-prem. Other alternatives like ‘alt-disk-copy’ can be used to take snapshots of root and data volume groups and move them to LPARs running in the cloud.
- Save to Template
Once a collection of LPARs representing an “Environment” has been created in the cloud, the environment is saved as a single object called a Template. The Template is used to “clone” other working environments. Clones are exact duplicates of the Template, down to the hostname, IP address, subnet, disk allocations, everything. In Skytap, multiple environment clones can be running simultaneously without colliding. Creating ready to use environments from a template is the most powerful component of the cloud-based approach. It provides the ability for multiple exact copies of the reference system to be handed out to numerous ENG/DEV/TEST groups, all of which can be running in parallel. There is no need to change the IP address of individual servers or their hostnames. Each environment runs in a virtual data-center in harmony with the others. If environments have to communicate to other on-prem resources, they are differentiated via an isolated NAT mechanism, as described below. Many of the environments contain the same VM clone base image(s) with the same hostnames, IP addresses, etc.
On-Prem to Cloud Workflow
- Assign Templates to Skytap Projects
Once Templates are created, they are assigned in the Skytap portal to a Project. Projects are then assigned to groups of users. Users can only see or access environments that have been assigned to them via the Project mechanism. A QA user can not see an environment solely assigned to ENG, for example. Skytap provides a built-in access/security model so users only see components assigned to them via the project mechanism. Users also have role assignments that allow them to view/edit/admin VMs/LPARs defined in an environment assigned to a project. The Skytap portal provides a complete and audited access control mechanism.
How to create cloned environments with duplicate address spaces
As stated previously, there are many benefits associated with creating multiple working environments that replicate the same network topology as the final target system. “Replicate” meaning re-using the same host-names, IP addresses, and subnets within each environment. To achieve this, some form of isolation must be implemented to avoid collision across duplicate environments. Within Skytap, each environment exists within its own software-defined networking space not visible to other environments that are also running. Using this mechanism, it is possible to create exact clones of multi-VM architectures with multiple subnets containing replicated address spaces. Each environment becomes a virtual private data-center.
Cloned environments communicate back to upstream on-prem resources via a single focal point called an “environment virtual router” (VR). The VR hides the lower VMs containing duplicate host-names and IP addresses and exposes a unique IP address to the greater on-prem network. Using this mechanism creates a simplified and elegant way for multiple duplicate environments to exist in harmony without breaking basic network constructs. By allowing duplicated host-names and IP addresses to exist, individual hosts do not have to go through a “re-IP” process, which is error-prone and time-consuming. The VR becomes the “jump-host” that allows operations like SSH into each unique environment. From on-prem, users first SSH to the jump-host, which exposes a unique IP address to on-prem, and then relays down to the VM within an individual environment.
The following use cases apply to many server consolidation projects where different groups of participants need access to a representation of the final target system or some subset of the final design. The key is to deliver “quickly” the infrastructure needed to perform a specific need or task. Waiting weeks for infrastructure delivery should be considered an “anti-pattern” since the cumulative time of waiting over the course of the project would be considerable. Building internal resource delivery processes with slow delivery times goes against the concepts described in works such as “The Phoenix Project” and “The Goal.”
- R&D Sandbox
Developers need a way to create their own “environment” of components that represent the target system. See Gene Kim, “DevOps for High Performing Enterprises.” Providing “representative” environments instead of mock environments running on local workstations/laptops would be a key indicator of the ability to reliably prove out many architectural assumptions being made in the project.
- Classical QA Automation testing
Instead of QA having to share a limited number of environments that commonly develop “configuration drift” for the current versions of components. QA should be able to easily create ‘n’ number of QA environments “QA-1”, “QA-2”, “QA-3”, “QA-n.” Leveraging cloud representations of the target system will allow QA to completely destroy-and-rebuild the correct target environment from scratch within minutes or hours. No more scripting to back-out, reset test data. No more “reset” scripts to return configurations to a starting state. The QA environment is completely ephemeral (short-lived) and may only exist for the duration of the test run. If tests fail, the entire environment is “saved” as a complete snapshot, aka ‘Template,’ and attached to the defect report to be reconstituted by ENG when diagnosing the problem. For the next test run, a completely new environment is generated from the Template and is separate from the previous environment used in past test runs that may contain defects.
- Integration testing
Traditional Enterprise thinking historically creates a limited number of Integration test environments shared among many groups. Because of this, the Integration environment is often broken, misconfigured, out of date, stale, or unusable in some way. Environment “drift” becomes a barrier to doing regular testing.
Applying cloud thinking to the building of on-prem systems allows for different Integration testing approaches to be used in an economical and efficient manner. The cloud can create multiple Integration environments that are all “identical” based on the current target goal. R&D and ENG subgroups can each have their dedicated Integration environment that can combine work from multiple squads of the same discipline, without colliding with other system components. For example, all the teams working on database changes can first integrate their work into a localized Integration testing environment. Once successful, move the bulk modifications to the higher level where all system components are being combined.
- Chaos Engineering, “What if we take server X offline?”
“Chaos Engineering for Traditional Applications” documents the justification and use cases where cloud-native Chaos engineering theory can be applied to traditional applications running on-prem. Server consolidation projects have the manifest need to apply “Chaos Testing” to the intermediate release candidates being built. A new level of “What if XYZ happens?” will be achieved when combining multiple systems down into a smaller number. Some categories of problem areas that need “random failure testing” would include:
– Low memory
– Not enough CPU
– Full disk volumes
– Low network bandwidth, high latency
– Hardware failures like a failed disk drive, failed server, disconnected network
And not so obvious ones could be:
– Database/server process down
– Microservice down
– Application code failure
– Expired Certificate(s)
And even less obvious:
– Is there sufficient monitoring, and have alarms been validated?
Each category of items requires the execution of multiple “experiments” to understand how the overall system reacts when a chaotic event is introduced into the system. The cloud can then be used to recover the system back to a stable state. Quick environment recovery allows for the execution of multiple, potentially destructive experiments.
- Disaster Recovery
Creating a consolidated server solution creates a whole new application architecture that did not exist before. This brings up the question of how Disaster Recovery (DR) will be implemented for this new system. The DR requirements for a consolidated system are even more elevated than individual disparate systems that may have existed during pre-consolidation. The consolidated system creates an “all or nothing” approach for DR since the DR mechanism now holds all of the previous individual components as a single unit of failure. Before consolidation, one of the individual components may have failed without impacting the others. Now during post-consolidation, all components are in a single unit of failure. So a DR event may cause more “ripples in the pond” than during pre-consolidation.
But once again, a cloud-based thinking model allows for experimentation and trial-and-error during the design of the DR implementation. Viable approaches can be tested in production-like but mock environments running in the cloud, that will eventually be represented by traditional on-prem systems. The cloud becomes the DR “sandbox” so that the right approach can be validated in a way that does not require non-disposable fixed assets to be purchased.
This document outlines an approach that can be taken to provide a “safe path” for server consolidation projects. The temporary use of the cloud as a place holder for eventual on-prem resources allows for experimentation in the design, the ability to perform greater amounts of QA testing, and advanced testing concepts like those described as “Chaos Engineering.” The outlined approach “unblocks” traditional Enterprise construction models by giving “Production-Like” environments to all groups that need them (Gene Kim), and eliminates the constraint of environment resources as described in “The Goal.” The cloud-to-on-prem design model described here is the solution to the anti-patterns historically created on-prem where Agile sprint teams and squads wait “weeks or months” for needed environments.