Rules for Developing Safety-Critical Code (NASA) vs. Needs for Commercial or Government Business Systems

The rules for building safety-critical systems are very different from the rules for building enterprise business systems - even "high value" systems with a significant economic impact to their success or failure.  When a bank or fee collection agency builds software they commonly arrive at a quality in the product from an economic trade-off of the cost to build the system versus the cost of mistakes in their transaction processing.  When NASA builds a system, there is a risk to human life for the crew and anyone living under the flight path - hence "safety critical" or "cost of human life" guidelines come into play.
As a result the software guidelines for a business system and the guidelines for a flight system are very different.  A general difference is the level of rigor applied to the software development and, as a result, the cost impact to deliver a particular functionality.  The quality assurance on a display that can be verified by an ad hoc business report is different than the quality assurance required for a real-time display used during a launch transition where seconds separate success with fatal disaster. 
Gerard Holzmann at the NASA/JPL laboratory for Reliable Software wrote an in the June 2006 issue of the ACM's Computer publication that gave a very nice introduction to a few rules for developing safety-critical code.  Below is a summary of his points with my annotations:
Rule 1 - Restrict all code to the most simple control flow statements.  This rule impacts the use of goto, do-while v. while-do, if and if-then and if-then-else, case and other constructs. Recursion and similar constructs are forbidden.  Benefits include making the code easier understand, safer to modify, and easier to verify.  Another common benefit is to prevent run-time data structures which hostile code analyzers (viruses et. al.) can exploit. Makes the code more verbose which can be more costly to author (there are literally more lines to type) and can create run-time components that require more computing resources (memory, disk, etc.).
Rule 2 - Give all loops a fixed upper bound.  In the most restrictive form this means that loops cannot have a fixed upper bound defined at run-time - e.g. based upon user input, data from a database, information from the processing stream, etc.  This assures that the system has been designed, coded, tested and verified for specific conditions and that it will not operate in a state, upon data, for which it has been unproven.  A consequence of this rule is that  changes in scope relatively common in a business processing environment (e.g. "..we are now processing from 9 service centers instead of 5...") cannot be implemented by changing a run-time setting but intentional forces a re-design/review, build, and re-verify.  Even more common logic (e.g. "...for each transaction in the submission..." is not allowed - completely impractical for business applications.
Rule 3 - Do not use dynamic memory allocation after initialization.  The basis for this rule is that memory allocation (malloc, new, garbage collection, etc.) is unpredictable and allowing dynamic allocation during run-time permits the system to have unpredictable throughput and availability characteristics.  One impact of this rule is that it forces a design and implementation strategy that is completely unfamiliar to most business systems developers.  While a fixed memory model prevents the application from consuming memory unchecked, it also prevents the application from releasing unused memory to the run-time environment so that it can be used by other applications.
Rule 4 - No function shall be longer than what can be viewed in a single printed page or viewed completely on a computer display.  The rationale is to ensure that all of the logic in the system can be read and understood properly during reviews.  This is desirable in business systems as well but the difference comes in how strictly this rule is applied and enforced....with safety-critical systems enforcing a "no function shall..." version of this rule but business systems applying a "most functions should..." version of this rule.  
Rule 5 - The code's assertion density should average to minimally two assertions per function.  The use of true-false, no side-effect, assertions to ensure to check for anomalies in logic or data has a demonstrated impact on reducing the defect density of large bodies of code.  These assertions should purely defensive and make no other contribution to the program logic so that they don't require their own test/verify effort and can be turned-off in performance-critical components.  Note the use of the word "minimally" in this rule - some functions will have 5, 10, 15 or more assertions associated with them.  The impact to the development effort is extra cost and time to write these assertions - both of which may be inappropriate to applications with lower quality requirements adequately served by a much lower assertions rate. Business applications have a different need for assertion density because they have a different tolerance for defect density.
Rule 6 - Declare all data objects at the smallest possible scope.  This goes far beyond the "don't use global variables" rule.  The benefits are to prevent data from being in scope where it could be inadvertently modified.  The other benefit is to narrow the code for examination when test cases fail.  One impact is that it can require a design and coding style that is unfamiliar to most developers.  It will also typically increase the computing resources (memory, CPU, disk, etc.) required at run-time.  
Rule 7 - Each calling and called function must check the return values.  This rule imposes a double-check that ensures every function checks the values it is returning and that every calling function checks the return values of functions it calls.  This rule, for example, prohibits calling any function (even a printf or cout) without checking the return values. It requires even functions that return void (C++, Java, C#, etc.) to be double checked.  This often means that standard function libraries normally re-used as-is in business applications have to be put within a "wrapper" of safety-critical code to ensure these rules are enforced.  This practice is unfamiliar to most developers.  It increases the development and the O&M effort.  It also impacts the processing requirements for the system being built.
Rule 8 - The use of tools outside the compiler (e.g. a pre-processor) must be confined to aggregation of the code body.  This bars the use of the pre-processor (C++) for anything other than assembling the body of code to be compiled.  It also prohibits insertion into the code body any data or logic from configuration management systems or build scripts - commonly done to insert version information into software products.  The rationale for this rule is to ensure that all of the program logic and data is subject to the intended scrutiny of the compiler, profiler, unit tests, etc.  It also helps prevent creating code that is difficult to understand, review and test.  It also prevents an explosive (n!) number of variation in the application configuration being possible. An impact is that this imposes build and configuration management practices unfamiliar to many developers.  Often requires procedures that are at odds with commonly used features in CM/build tools.
Rule 9 - Restrict or forbid the use of  pointers.  Pointers are often misused even by experienced programmers.  They can make it difficult to follow the program logic and data flow. They may also limit the ability for automated tools (static or run-time) to analyze and verify the code. This has less impact in that increasingly languages (Java, C#, etc.) have removed support for pointers.  Impact on languages that do (C++) will be in changes to design and coding practices which may be difficult for long-time developers in those languages.
Rule 10 - All code must be compiled from the first day, and every day, with all automated reviews (compiler, unit tests, etc.) applied with the greatest severity, and all warnings/errors reviewed.  This rule often comes with an associated requirement that all code pass every static analysis rule without warning at least once per day.  The rationale is one of "test first" and "early discovery".  With static and run-time analyzers increasingly built into development tools, this practice is increasingly common even in applications that are not safety critical.  However, again the impact is one of degree - a typical business application development may appropriately be more lenient about this rule where a safety-critical development team would strictly enforce this rule.  The impact of enforcement demonstrates itself primarily in schedule and cost impacts - it just takes more time and effort to resolve every error or warning so frequently.
- Brian

No comments: