The main criteria which determine whether an application or solution is suitable for IT Production use are as follows:
- Reliability and Stability
- Backup and Recovery
- Monitoring and Management
Each of these buzzwords have their own precise definition. When doing an assessment of a software application or system, make sure that you have these words properly defined.
The following notes give an outline of what questions to ask, in order to create an environment which meets stringent Production suitability criteria.
To create an application or system with this level of reliability, security etc. is a significant co-operative task, which must, by definition, involve all streams of IT - Development, IT Production, Security, Service Management, Project Control, and Processes. It cannot really be put in place without the input from all these teams. Particularly in the case of modern Service-Oriented Architectures, it is essential that software development becomes more collaborative
Production Suitability requires a change of approach - a change of culture - you need to "think Production"
Too often, the responsibility for a solution is left at the door of the development team. This is understandable - after all, the development team are responsible for the key Business Functionality which differentiates the project from all others which went before it. However, development teams, unless they are particularly enlightened, are often not the correct people to implement the production worthiness of the final solution.
In practice, companies can sometimes have a "chinese walls" distinction between the different departments responsible for each of these functions. The only way to deal with this sort of xenephobia is to ensure that a Project has adequate sponsorship from the top management, who can ensure that all the necessary tasks are addressed at the appropriate time in the Project life cycle.
Is your solution Scalable - having been tested with 1,000 page hits, what happens when a million pages have to be served up ?
Scalability is not, unfortunately, simply a question of "throw hardware at it". To do so, makes the implicit assumption that the performance capability of the application is proportional to the hardware, be it CPU, memory, I/O capacity etc. This is simply not true. Consider a simple example: supposing a database used a "process_id" counter, internally, to keep track of each use thread it was responsible for. This counter was, at one time, a 4-byte integer. Net result: it was physically impossible to have more than 32,768 users on the system at any one time - irrespective of the amount of hardware.
Such design/physical limitations are everywhere. If a software supplier tells you that their package has "unlimited" scalability - do not believe them !
You need to examine your hardware and software combinations, to look for such limitations. This should be as much part of the Quality Control of your software, as the review of the Business Logic. In addition, some companies take the trouble to bench-test systems with thousands of pseudo-users, using a Scalability Test Harness system, such as those available from Mercury Interactive.
Reliability and Stability
Reliability could be defined as "the ability to take knocks without falling over". As with children, there needs to be a level of maturity before this happens.
Reliability is not just a hardware, or even a software issue. It refers to the entire environment. Whilst we are used to the idea of testing a program, in order to see how it performs, we often forget to test it in context. For example, the software may be capable of grabbing additional memory "as required", but to what extent is that memory released when the demand is reduced? The classic "memory leak" is one reason why systems often have to be restarted periodically.
How Stable is it ? Or does it require on-line support at 2:00 am ?
Systems are often designed on the assumption that all components are available. For example, I once came across a message-based system for an international company. It required dozens of Servers, all in different parts of the world, to transfer messages. Not a problem, until you realised that all messages were delivered synchronously. If just one server failed, anywhere in the world, all messages would "hang". A reliable store-and-forward approach would have been far more stable.
Sit down with your operational teams, and dream "war stories" - what would happen if ...? Then go back and review your system, top-down, to resolve these issues.
Is it Resilient ? This system may need to be available 24 X 7. Easy to do with hardware, but what about the network, and other components ? Are the database and applications "cluster-aware" ?
It is sometimes (wrongly) assumed that introducing a clustering technology (such as Microsoft Cluster Server) will automatically solve any reliability problems. But Operating System failover, alone, will not guarantee reliability. The Application must be capable of detecting a cluster failover - and performing appropriate recovery operations. This is particularly true with .NET or J2EE environments, where "persistance" does not necessarily mean "recovery from known point". Remember also, the key importance of having ATOMIC transactions - i.e. they either succeed, or fail - there should be no "partial success" - even if hardware has crashed.
These days, resilience means Multi-Site presence. In the 24*7 world, it is no longer realistic to allow for a "site-outage". Tragically, the events of September 11, 2001, showed the world what could happen if a company centralises it's IT in one place. The internet itself has broken new ground in showing how infrastructure can fail, but the overall solution can still run, thanks to multiple POPs, wide-area clustering etc. The technology is there - use IT !
Resilence also comes into account when looking at hardware solutions - dual power supplies, multiple disk arrays (RAID) etc. all ensure that we have multiple ways of delivering the application, with no single points of failure.
Have the Failover Procedures been tested ?
If you are buying a packaged solution, make sure that the manufacturers are able to prove that it can handle clustering failover. Look for single points of Failure - and plan accordingly.
Backup and Recoverability
Have we considered Backup and Recovery in the design ?
One day, the unlikely will happen, and those backup tapes will be needed. It has often been said that there is no problem with doing system backups - the big issue is recovery. You may moan about the necessity of having a "backup window", but, like the life-jackets in the airline you fly in, one day it could save your life.
It is also important that you understand what you are backing up, as well as why. For example, backing up the internal files relating to an Exchange Server, or Oracle Database may give the impression that the data is secure. However, if the data were to be recovered, it could be useless, since these files are open, and being written to. What is required is that any operations on these files are Quieced. Then, and only then, can you be confident that the data is in an internally consistant state.
Backup can be a Cinderella operation - sometimes just done for the sake of conscience. Instead, IT professionals should assume that backups will eventually be needed (if only by Internal Auditors to check their figures). You must ensure that you can rebuild your systems from backups.
Put plans in place for periodic full Disaster Recovery testing. Crash the system, and try to rebuilt it from backups. You don't want to find that "missing file" syndrome when you do things for real.
Has Security been designed in from the start, or was it just a final thought ?
Too often, Security (access rights, passwords etc.) are not even included in the original design of an application. The emphaisis is on the business logic - "we'll leave security until later" (!). The results of this approach are well known - applications which have been hard-wired with the "Admin" username and password, databases which are open to anyone etc.
It's not just the databases which need to be secure. In these days of network-centric development, you need to be aware that anyone could be on your network - even the internal one. So if you really want to build an Application using JAVA remoting, be aware that there is no security written in. Anyone, including that disgruntled contract programmer, is capable of instantiating a remote object class. Given this, how many people use internal encryption for messages between Production Systems ? What sort of network authentication is needed to speak to a Production Application Server ?
Security involves a lot more than just denying access to the "root" password.
Any design intended for Production deployment should be approved by the Security team in your company, before the design is translated to code. This may add time, and effort, to your application. But we have all heard the war stories of companies who have lost their livelihood as a result of security breaches.
Monitoring and Maintenance
Monitoring needs to cover two main areas: (1) the ability to raise alerts if the system behaves outside normal expected window, and (2) the ability to extract trend data to find out what the application is actually doing (number of transactions etc.). These two purposes are often confused. Make sure that both are supported
Maintenance, on the other hand, involves the provision of suitable automated tools to configure and manage the system.
Ensure that your proposed solution supports standard monitoring interfaces to enable alerts to be created. Ensure that time-series trend data can be extracted, and fed into your capacity plan. Understand the manual cost of configuring and maintaining the system.
Will this solution really deliver the required Return on Investments when Production Costs are taken into account ?
The cost of an Application is hard to measure. This is one pragmatic reason why we rarely hear about "real" ROI figures from Applications. Partly, of course, the problem is that the world has moved on since the Application was designed. Instead of delivering improved performance to existing business practice, the application enables new busines practices to evolve.
But don't let the non-existance of a real ROI blind you to the real costs of an Application. It is sometimes assumed that once an App. goes into production, it falls into the "black hole" of support, where all applications are treated the same, and have the same costs associated with them. This is simply not the case. Having worked on IT for 20 years, I have often met "demon" applications. The ones which behave like a black hole to Production. Studies have shown that the cost of supporting systems can be multiple times the cost of the original development.
Don't just include hardware and development time in the ROI for your application. Ensure that you factor in the Production Support as well. That way, you may find yourself taking very different decisions.