Archive
Project-Specific Coding Guidelines
I’m about to embark on the development of a distributed, scalable, data-centric, real-time, sensor system. Since some technical risks are at this point disturbingly high, especially meeting CPU loading and latency requirements, a team of three of us (two software engineers and one system engineer) are going to prototype several CPU-intensive system functions.
Assuming that our prototyping effort proves that our proposed application layer functional architecture is feasible, a much larger team will be applied to the effort in an attempt to meet schedule. In order to promote consistency across the code base, facilitate on-boarding of new team members, and lower long term maintenance costs, I’m proposing the following 15 design and C++ coding guidelines:
- Minimize macro usage.
- Use STL containers instead of homegrown ones.
- No unessential 3rd party libraries, with the exception of Boost.
- Strive for a clear and intuitive namespace-to-directory mapping.
- Use a consistent, uniform, communication scheme between application processes. Deviations must be justified.
- Use the same threading library (Boost) when multi-threading is needed within a process.
- Design a little, code a little, unit test a little, integrate a little, document a little. Repeat.
- Avoid casting. When casting is unavoidable, use the C++ cast operators so that the cast hacks stick out like a sore thumb.
- No naked pointers when heap memory is required. Use C++ auto and Boost smart pointers.
- Strive for pure layering. Document all non-adjacent layer interface breaches and wrap all forays into OS-specific functionality.
- Strive for < 100 lines of code per function member.
- Strive for < 4 nested if-then-else code sections and inheritance tree depths.
- Minimize “using” directives, liberally employ “using” declarations to keep verbosity low.
- Run the code through a static code analyzer frequently.
- Strive for zero compiler warnings.
Notice that the list is short, (maybe) memorize-able, and (maybe) enforceable. It’s an attempt to avoid imposing a 100 page tome of rules that nobody will read, let alone use or enforce. What do you think? What is YOUR list of 15?
Iteration Frequency
Obviously, even the most iterative agile development process marches forward in time. Thus, at least to a small extent, every process exhibits some properties of the classic and much maligned waterfall metaphor. On big projects, the schedule is necessarily partitioned into sequential time phases. The figure below (Reqs. = requirements, Freq. = frequency, HW = hardware, SW = software) attempts to model both forward progress and iteration as a function of time.
If the phases are designed to be internally cohesive, externally loosely coupled , and the project is managed correctly, the frequency of iteration triggered by the natural process of human learning and fixing mistakes is:
- high within a given project phase
- low between adjacent project phases
- very low between non-adjacent project phases
Of course, if managed stupidly by explicitly or implicitly “disallowing” any iterative learning loops in order to meet equally stupid and “aggressive” schedules handed down from the heavens, errors and mistakes will accumulate and weave themselves undetected into the product fabric – until the customer starts using the contraption. D’oh!
The Loop Of Woe
When a “side view” of a distributed software architecture is communicated, it’s sometimes presented with a specific instantiation of something like this four layer drawing; where COTS = Commercial Off The Shelf and FOSS=Free Open Source Software:
I think that neglecting the artifacts that capture the thinking and rationale in the more abstract higher layers of the stack is a recipe for high downstream maintenance costs, competitive disadvantage, and all around stakeholder damage. For “big” systems, trying to find/fix bugs, or determining where new feature source code must be inserted among 100s of thousands of lines of code, is a huge cost sink when a coherent full stack of artifacts is not available to steer the hunt. The artifacts don’t have to be high ceremony, heavyweight boat anchors, they just have to be useful. Simple, but not simplistic.
For safety-critical systems, besides being a boon to maintenance, another increasingly important reason for treating the upper layers with respect is certification. All certification agencies require an auditable and scrutably connected path from requirements down through the source code. The classic end run around the certification obstacle when the content of the upper layers is non-existent or resembles swiss cheese is to get the system classified as “advisory”.
Frenetic , clock-watching managers and illiterate software developers are the obvious culprits of upper layer neglect but, ironically, the biggest contributors to undependable and uncertifiable systems are customers themselves. By consistently selecting the lowest bidder during acquisition, customers unconsciously encourage corner-cutting and apathy towards safety.
Got any ideas for breaking the loop of woe? I wish I did, but I don’t.
Background Daemons
Assume that you have to build a distributed real-time system where your continuously communicating application components must run 24×7 on multiple processor nodes netted together over a local area network. In order to save development time and shield your programmers from the arcane coding details of setting up and monitoring many inter-component communication channels, you decide to investigate pre-written communication packages for inclusion into your product. After all, why would you want your programmers wasting company dollars developing non-application layer software that experts with decades of battle-hardened experience have already created?
Now, assume that the figure below represents a two node portion of your many-node product where a distributed architecture middleware package has been linked (statically or dynamically) into each of your application components. By distributed architecture, I mean that the middleware doesn’t require any single point-of-failure daemons running in the background on each node.
Next, assume that to increase reliability, your system design requires an application layer health monitor component running on each node as in the figure below. Since there are no daemons in the middleware architecture that can crash a node even when all the application components on that node are running flawlessly, the overall system architecture is more reliable than a daemon-based one; dontcha think? In both distributed and daemon-based architectures, a single application process crash may or may not bring down the system; the effect of failure is application-specific and not related to the middleware architecture.
The two figures below represent a daemon-based alternative to the truly distributed design previously discussed. Note the added latency in the communication path between nodes introduced by the required insertion of two daemons between the application layer communication endpoints. Also note in the second figure that each “Node Health Monitor” now has to include “daemon aware” functionality that monitors daemon state in addition to the co-resident application components. All other things being equal (which they rarely are), which architecture would you choose for a system with high availability and low latency requirements? Can you see any benefits of choosing the daemon-based middleware package over a truly distributed alternative?
The most reliable part in a system is the one that is not there – because it isn’t needed.
Dependable Mission Critical Software
In this post, Embedded.com – Software for dependable systems, Jack Gannsle introduced me to the book: Software for Dependable Systems–Sufficient Evidence?. It was written by the “Committee on Certifiably Dependable Software Systems” and it’s available for free pdf download.
Despite being written by a committee (blech!), and despite the bland title (yawn), I agree with Jack in that it’s a riveting geek read. It’s understandable to field-hardened practitioners and it’s filled with streetwise wisdom about building dependability into large, mission critical software systems that can kill people or cause massive financial loss if they collapse under stress. Essentially, it says that all the bloated, costly, high-falutin safety and security and certification processes in existence today don’t guarantee squat – except jobs for self-important bureaucrats and wanna-be-engineers. They don’t say it THAT way of course, but that’s my warped and unprofessional interpretation of their message.
Here are a few gems from the 149 page pdf:
As is well known to software engineers, by far the largest class of problems arises from errors made in eliciting, recording, and analysis of requirements.
Undependable software suffers from an absence of a coherent and well articulated conceptual model.
Today’s certification regimes and consensus standards have a mixed record. Some are largely ineffective, and some are counterproductive. (<- This one is mind blowing to me)
The goal of certifiably dependable software cannot be achieved by mandating particular processes and approaches regardless of their effectiveness in certain situations.
In addition to lampooning the “way things are currently done” for certifying software-centric dependability, the committee dudes actually make some recommendations for improving the so-called state of art. Stunningly, they don’t prescribe yet another costly, heavyweight process of dubious effectiveness. They recommend any process comprised of best practices; as long as there is scrutable connectivity from phase to phase and from start to end to “preserve the chain of evidence” for a claim of dependability that vendors of such software should be required to make. Where there is a gap between links in the chain of scrutability, they recommend rigorous analysis to fill it.
To make the transition to the new mindset of scrutable connectivity, they say that these skills, which are rare today and difficult to acquire, will be required in the future:
- True systems thinking (not just specialized, localized, algorithmic thinking that’s erroneously praised as systems thinking by corpocracies) of the properties of the system as a whole and the interactions among its components.
- The art of simplifying complex concepts, which is difficult to appreciate since the awareness of the need for simplification usually only comes (if it DOES come at all) with bitter experience and the humility gained from years of practice.
Drum roll please, because my absolute favorite entry in the book, which tugs at my heart, is as follows:
To achieve high levels of dependability in the foreseeable future, striving for simplicity is likely to be by far the most cost-effective of all interventions. Simplicity is not easy or cheap but its rewards far outweigh its costs.
That passage resonates deeply with me because, even though I’m not good at it, that’s what my primary professional goal has been for 20+ years. Clueless companies that put complexifying and obfuscating experts that nobody can understand up on a pedestal, deserve what they get:
- incomprehensible, unmaintainable, and undependable products
- a disconnected and apathetic workforce
- low (if any) profit margins.
As my Irish friend would say, they are all fecked up. They’re innocent and ignorant, but still fecked up.
My Velocity
The figure below shows some source code level metrics that I collected on my last C++ programming project. I only collected them because the process was low ceremony, simple, and unobtrusive. I ran the source code tree through an easy to use metrics tool on a daily basis. The plots in the figure show the sequential growth in:
- The number of Source Lines Of Code (SLOC)
- The number of classes
- The number of class methods (functions)
- The number of source code files
So Whoopee. I kept track of metrics during the 60 day construction phase of this project. The question is: “How can a graph like this help me improve my personal software development process?”.
The slope of the SLOC curve, which measured my velocity throughout the duration, doesn’t tell me anything my intution can’t deduce. For the first 30 days, my velocity was relatively constant as I coded, unit tested, and integrated my way toward the finished program. Whoopee. During the last 30 days, my velocity essentially went to zero as I ran end-to-end system tests (which were designed and documented before the construction phase, BTW) and refactored my way to the end game. Whoopee. Did I need a plot to tell me this?
I’ll assert that the pattern in the plot will be unspectacularly similar for each project I undertake in the future. Depending on the nature/complexity/size of the application functionality that will need to be implemented, only the “tilt” and the time length will be different. Nevertheless, I can foresee a historical collection of these graphs being used to predict better future cost estimates, but not being used much to help me improve my personal “process”.
What’s not represented in the graph is a metric that captures the first 60 days of problem analysis and high level design effort that I did during the front end. OMG! Did I use the dreaded waterfall methodology? Shame on me.
iSpeed OPOS
A couple of years ago, I designed a “big” system development process blandly called MPDP2 = Modified Product Development Process version 2. It’s version 2 because I screwed up version 1 badly. Privately, I named it iSpeed to signify both quality (the Apple-esque “i”) and speed but didn’t promote it as such because it didn’t sound nerdy enough. Plus, I was too chicken to introduce the moniker into a conservative engineering culture that innocently but surely suppresses individuality.
One of the MPDP2 activties, which stretches across and runs in parallel to the time sequenced development phases, is called OPOS = Ongoing Planning, Ongoing Steering. The figure below shows the OPOS activity gaz-intaz and gaz-outaz.
In the iSpeed process, the top priority of the project leader (no self-serving BMs allowed) is to buffer and shield the engineering team from external demands and distractions. Other lower priority OPOS tasks are to periodically “sample the value stream”, assess the project state, steer progress, and provide helpful actions to the multi-disciplined product development team. What do you think? Good, bad, fugly? Missing something?
The Rise Of The “ilities”
The title of this post should have been “The Rise Of Non-Functional Requirements“, but that sounds so much more gauche than the chosen title.
As software-centric systems get larger and necessarily more complex, they take commensurately more time to develop and build. Making poor up front architectural decisions on how to satisfy the cross-cutting non-functional requirements (scalability, distribute-ability, response-ability (latency), availability, usability, maintainability, evolvability, portability, secure-ability, etc.) imposed on the system is way more costly downstream than making bad up front decisions regarding localized, domain-specific functionality. To exacerbate the problem, the unglamorous “ilities” have been traditionally neglected and they’re typically hard to quantify and measure until the system is almost completely built. Adding fuel to the fire, many of the “ilities” conflict with each other (e.g. latency vs maintainability, usability vs. security). Optimizing one often marginalizes one or more others.
When a failure to meet one or more non-functional requirements is discovered, correcting the mistake(s) can, at best, consume a lot of time and money, and at worst, cause the project to crash and burn (the money’s gone, the time’s gone, and the damn thang don’t work). That’s because the mechanisms and structures used to meet the “ilities” requirements cut globally across the entire system and they’re pervasively weaved into the fabric of the product.
If you’re a software engineer trying to grow past the coding and design patterns phases of your profession, self-educating yourself on the techniques, methods, and COTS technologies (stay away from homegrown crap – including your own) that effectively tackle the highest priority “ilities” in your product domain and industry should be high on your list of priorities.
Because of the ubiquitous propensity of managers to obsess on short term results and avoid changing their mindsets while simultaneously calling for everyone else to change theirs, it’s highly likely that your employer doesn’t understand and appreciate the far reaching effects of hosing up the “ilities” during the front end design effort (the new age agile crowd doesn’t help very much here either). It’s equally likely that your employer ain’t gonna train you to learn how to confront the growing “ilities” menace.
Incremental Chunked Construction
Assume that the green monster at the top of the figure below represents a stratospheric vision of a pipelined, data-centric, software-intensive system that needs to be developed and maintained over a long lifecycle. By data-centric, I mean that all the connectors, both internal and external, represent 24 X 7 real-time flows of streaming data – not “client requests” for data or transactional “services”. If the Herculean development is successful, the product will both solve a customer’s problem and make money for the developer org. Solving a problem and making money at the same time – what a concept, eh?
One disciplined way to build the system is what can be called “incremental chunked construction”. The system entities are called “chunks” to reinforce the thought that their granularity is much larger than a fine grained “unit” – which everybody in the agile, enterprise IT, transaction-centric, software systems world seems to be fixated on these days.
Follow the progression in the non-standard, ad-hoc diagram downward to better understand the process of incremental chunked development. It’s not much different than the classic “unit testing and continuous integration” concept. The real difference is in the size, granularity, complexity and automation-ability of the individual chunk and multi-chunk integration test harnesses that need to be co-developed. Often, these harnesses are as large and complex as the product’s chunks and subsystems themselves. Sadly, mostly due to pressure from STSJ management (most of whom have no software background, mysteriously forget repeated past schedule/cost performance shortfalls, and don’t have to get their hands dirty spending months building the contraption themselves), the effort to develop these test support entities is often underestimated as much as, if not more than, the product code. Bummer.
Data-Centric, Transaction-Centric
The market (ka-ching!, ka-ching!) for transaction-centric enterprise IT (Information Technology) systems dwarfs that for real-time, data-centric sensor control systems. Because of this market disparity, the lion’s share of investment dollars is naturally and rightfully allocated to creating new software technologies that facilitate the efficient development of behemoth enterprise IT systems.
I work in an industry that develops and sells distributed, real-time, data-centric sensor systems and it frustrates me to no end when people don’t “see” (or are ignorant of) the difference between the domains. With innocence, and unfortunately, ignorance embedded in their psyche, these people try to jam-fit risky, transaction-centric technologies into data-centric sensor system designs. By risky, I mean an elevated chance of failing to meet scalability, real-time throughput and latency requirements that are much more stringent for data-centric systems than they are for most transaction-centric systems. Attributes like scalability, latency, capacity, and throughput are usually only measurable after a large investment has been made in the system development. To add salt to the wound, re-architecting a system after the mistake is discovered and ( more importantly) acknowledged, delays release and consumes resources by amounts that can seriously damage a company’s long term viability.
As an example, consider the CORBA and DDS OMG standard middleware technologies. CORBA was initially designed (by committee) from scratch to accommodate the development of big, distributed, client-server, transaction-centric systems. Thus, minimizing latency and maximizing throughout were not the major design drivers in its definition and development. DDS was designed to accommodate the development of big, distributed, publisher-subscriber, data-centric systems. It was not designed by a committee of competing vendors each eager to throw their pet features into a fragmented, overly complex quagmire. DDS was derived from the merger of two fielded and proven implementations, one by Thales (distributed naval shipboard sensor control systems) and the other by RTI (distributed robotic sensor and control systems). In contrast, as a well meaning attempt to be all things to all people, publish-subscribe capability was tacked on to the CORBA beast after the fact. Meanwhile, DDS has remained lean and mean. Because of the architecture busting risk for the types of applications DDS targets, no client-server capability has been back-fitted into its design.
Consider the example distributed, data-centric, sensor system application layer design below. If this application sits on top of a DDS middleware layer, there is no “intrusion” of a (single point of failure) CORBA broker into the application layer. Each application layer system component simply boots up, subscribes to the topic (message) streams it needs, starts crunching them with it’s app-specific algorithms, and publishes its own topic instances (messages) to the other system components that have subscribed to the topic.
Now consider a CORBA broker-based instantiation of this application example (refer to the figure below). Because of the CORBA requirement for each system component to register with an all knowing centralized ORB (Object Request Broker) authority, CORBA “leaks” into the application layer design. After registration, and after each service has found (via post-registration ORB lookup) the other services it needs to subscribe to, the ORB can disappear from the application layer – until a crash occurs and the system gets hosed. DDS avoids leakage into the value-added application layer by avoiding the centralized broker concept and providing for fully distributed, under-the-covers, “auto-discovery” of publishers and subscribers. No DDS application process has to interact with a registrar at startup to find other components – it only has to tell DDS which topics it will publish and which topics it needs to subscribe to. Each CORBA component has to know about the topics it needs and the services it needs to subscribe to.
The more a system is required to accommodate future growth, the more inapplicable a centralized ORB-based architecture is. Relative to a fully distributed coordination and auto-discovery mechanism that’s transparent to each application component, attempting to jam fit a single, centralized coordinator into a large scale distributed system so that the components can find and interact with each other reduces robustness, fault tolerance, and increases long term maintenance costs.
















