Feedback: Software Development Paradigm TrapDr. Carl Dreher - 5/18/2006 Dear Mr. Bereit, I found your article very interesting and thought-provoking. I've been thinking a great deal about the one-process-one-processor architecture ever since that article last year in Embedded System, which described the development of a custom, one-off solution based on this idea. However, after thinking about it a great deal, I've come to the conclusion it won't help (much). In my experience (29 years), the bugs that cause the major headaches are very rarely in the individual processes. Good coders write lots of test code that encapsulates processes, subroutines, functions, etc. These can be tested and perfected relatively easily, and if a bug slips through, quickly found and fixed. That is one of the main advantages of OO-languages. In that sense, we're already at the one-process-one-processor ideal. Where the debugging headaches start are in how the processes interact, the architecture and the inter-process communications. Those type of bugs will not be eliminated by adding processors. In all the projects I've worked on in the last few years, there were lots of separate, asynchronous processes running, albeit in the same processor. With a good OS and a good OO-language, keeping them separate was no problem. Each was tested very thoroughly. As far as each coder knew, his code was running on a separate processor with some magic communication channel between him and the other processes. Making them all work together invariably was the problem, especially with asynchronous inputs. I'm still thinking about the one-process-one-processor architecture, and believe it has merits where you can afford a huge, power-hungry $500 FPGAs with dozens of embedded processors and where developer costs (and time to market) outweigh final-product costs. But I don't think it is going to solve more than a small percentage of the real problems that show up in today's market, where dimes and uWatts count.
Sincerely, Mark Bereit - 5/18/2006 Carl, Thanks for your comments! The approach I suggested, processor per routine, is certainly not universally applicable or practical. My goal in writing that article (more of an extended rant, really) was not to say "here's the best way to do things"--if I had everything worked out, it would have been an article on "here's what I did and why you should too." Rather, my goal was to say, please stop and think about things differently, to see if there are some approaches we are missing, because the normal software/system design approach (in my experience) has a number of cumulative stupidities from the course of its evolution from processing simple algorithms. We're seeing multi-processors, and for economy, multi-cores, moving into the mainstream to address the insatiable Need For Speed in the desktop world, but they are all built to execute the existing single-core multi-thread approach: they are doing their best to share memory bus and external resources. Is this because it's the best way to do it? No, I think it's because it's the easiest sell from where we've come from. What is the biggest thing getting in the way of the performance of these added cores? Fighting for the bus! What is the biggest thing allowing processes to derail each other? Sharing the memory space. What is the trickiest part of multi-threading? Protecting shared resources. So the hardware designers and the OS designers and the software designers are all fighting the issues that come from putting everything in the same basket, which forces me to ask if this is actually a good idea. For argument's sake let's leave hardware alone. Suppose that I wanted to build a complex system using a single-core Pentium desktop motherboard, but built an OS that used the virtual memory mapping heavily to put hundreds of different "objects" in their own completely isolated memory space, each its own task, with gates into ring 0 only to pass inter-module messages (carefully marshalled and bounds-checked by the OS as it dispatches them). Can I do most of what I want? Sure. This approach has more overhead than if I used, say, COM, with objects having pointers directly into each other's function tables, but it keeps a strong isolation. The trick would be that, coming from my existing worldview, I would keep wanting to use some global variables to speed things along... and I would want to debug setting breakpoints and stepping into other functions... Should I compromise the OS or stick to principles and compromise speed? Because as complexity grows (and it does!) my only option is to crank up that motherboard speed: I can't just add compute capacity the way I can add RAM. But if I followed such a deliberately constrained model, object then at some point I would have to ask why, for example, the process that needs disk access and the process that needs graphics memory access have to share bandwidth. Why does a disk fetch cost me video refresh time? Why does an incoming network packet for still another process hurt the first two? And so I would start to reason that the processes doing disk crunching should be on their own bus... just like, in accelerated graphics cards, the tricky video work is being done by an outside processor. And the modules working with the network? Give them a processor, too. If we went to the trouble of making our OS keep everything isolated and talk through asynchronous messaging anyway, then performance gains through more processors becomes easy and natural. In such a world, the same economics that today puts multiple cores on the same chip would then put multiple processors with isolated buses on the same chip. And why not? A system with PCI and AGP is already doing this; what is the addition of a few more buses when we keep adding pins to BGA chips anyway? So whether you start by ripping things down to lots of hardware and then start consolidating, or start by creating a strict operating and coding environment and then accelerating, you end up in the same place... a place which I strongly suspect is better than where more-threads-and-cores-for-ever-more-shaky-shortcut-ridden-code is taking us. You're right: the hard work is in making the pieces work properly together, and that's what I don't see much work on when the normal approach is to just build bigger algorithms. That's my thoughts; thank you for sharing yours, and if you have further comments please throw them my way. Or tell others about them, because I think we could all use some fresh ideas now and again! Mark Bereit Dr. Carl Dreher - 5/24/2006 Mark- Thanks for the long email detailing your thoughts. It seems you've headed off in a different direction than I thought, namely performance rather than reliability. Yes, it would be great to have lots of processors, each with their own internal set of buses. That would definitely speed up the whole system. And, yes, it would increase reliability by forcing the system programmers to use messaging instead of shared memory space. Of course, it would slow down the system since messaging is slower than shared memory, although I think it is safe to assume that the local processing speed gains would outweigh the losses from relatively infrequent messaging. The big problems I see (software reliability and bugs) are still in how the pieces fit together. Now matter how good the message transport is, or how isolated the processes are, if the system design has bugs, it will still break. In the last system I designed (running on WCe) we had a custom high-priority process acting as a message handler between the more mundane processes. Any hardware (machine interface, relays, etc.) had to be isolated by a driver in its own process space, and each process had a specific message interface. If we'd had more time and coders, I would have had each process inspect each message it received for correctness (beyond the normal packet checking.) Despite all that, we still ran into problems with one processing telling a driver to Do-This and another saying Do-That. Each process was working perfectly, but the interactions caused headaches. I suppose I work in a different sphere than you do. In my work, we have all the processing power and speed we need, and I can trade speed optimization tricks like shared memory for cleaner messaging designs. I'm already at the one-process-one-processor level (if only virtually.) The tough bugs exist in how the processes relate. I must say, it would be VERY cool to be able to configure X-number of processors on a chip and change X as needed while developing the system. FPGAs can do that, but not at a cost for most products. Maybe someone can develop a line of simple 16 or 32-bit CPUs with 4, 6, 8....50? processors per device. That would require a serial DRAM interface, since there simply aren't enough I/O pins for all the needed independent memory buses. In fact, the same company could offer an I/O chip in various sizes to bring out I/O from the processors that need I/O, again using a serial interface. Sounds interesting, no? May MicroChip would be interested? Well, we can dream. - Carl Mark Bereit - 5/31/2006 Carl, Thanks for further food for thought! I wasn't trying to optimize for performance, actually: reliability is my number one goal. The idea of using a horde of low-spec processors is certainly not a universal solution, and I agree that the interconnect becomes the complication. I guess my main argument is that we have to focus on making the interconnects work, because the huge-algorithm approach has already hit its wall. In my article I described the construction of a building involving a lot of different people working, and there are two kinds of complexity in such a task: the specific sub-tasks themselves (cut this, nail that, etc.), and the communications that let it happen as a team. I agree that you can cut out all communication by having one person do everything, one step at a time, according to a huge algorithm. I also maintain that we don't do large tasks that way. Performance is a reason but not the only one. Another advantage is having sub-tasks as trusted commodities rather than part of a highly specialized knowledge worker. Another aspect is the inherent concurrency of some real-time tasks: sometimes you need to lift, hold and fasten (for example) all at the same time in order to get a thing done; doesn't it make sense for there to be parallel workers? These are all fringe benefits to the main idea that you can get a task done "right" and then trust it to be a reusable component... a goal that continues to elude us. So, no, I don't want to minimize the complexity of "managing" the "workers." Certainly in the human world that can be complicated too! I simply think that if there is going to be progress in the development of reliable, maintainable systems, it is going to require that people focus their creative energies on that task, rather than band-aids to the prevailing approach. I was just reading this week about the architecture of the Cell processor, used inside the upcoming Sony PlayStation 3. It is designed with some elements in common to what I was describing. A total of twelve nodes work cooperatively (eight symmetrical vector processors, a single supervisor processor, and three I/O subsystems) to divide up tasks and message among themselves as needed. I've also read several game developers complaining about the painfully complicated architecture. Once people have climbed the development learning curve, will the Cell turn out to have been a step in the evolution to large numbers of semi-autonomous processors? Or will this architecture be added to the "too clever" design graveyard? It's certainly too early to tell. If something like the Cell, or like we've been talking about, is in the future, then I am sure that the chips will come. But none of it will come without some pain and some focus on the really hard stuff: making a complex set of workers help each other more than they get in the way. Mark Bereit |