To get the answer, I called Grant Martin, Chief Scientist at Tensilica. Grant had no problem providing a clear answer to my question, while also confusing me simultaneously. Oh well. The topic wouldn't qualify as a Mocha Mystery if it was easy to understand.
Before you read on, please keep in mind that prior to Tensilica, Grant worked for Burroughs in Scotland, for Nortel/BNR in Canada, and for Cadence in San Jose, where he was named a Cadence Fellow. After our phone call, I decided Grant deserves a new title ...
Heterogeneous Asymmetric Multiprocessing Fellow
Q – Is multicore a reality today?
Grant Martin – It all depends on the definition. Some take a narrow definition, but I take a wider one. Some say, for instance, that multicore just refers to symmetric multiprocessing – 2, 4, or 8 cores – which is what companies like Sun, IBM, and Intel mean when they talk about their SPARC, x86, or PowerPC chips.
Here at Tensilica, however, we think of multicore as being blended with multiprocessing – multiprocessing being the design style that many consumer, handset, and large equipment manufacturers have been using for a long, long time – including if you look back at the original cell phones of the mid-1990’s. They used RISC cores along with DSPs to do voice encoding and decoding.
When you have heterogeneous multiprocessors, some may be multicore and some may be domain-specific compute engines for particular end applications. You might have an audio processor, for instance, to do encoding/decoding, or you might have a special broadband processor on a single core that might involve multiprocessing.
In theory, you could have a homogeneous set of processor cores. But more likely, a homogeneous multicore device is an N-way symmetric device, such as you might get from IBM. Other devices are definitely more heterogeneous. If you split something into multiprocessing for different applications, then you’re more likely to want to take advantage of those features.
Q – This is really confusing!
Grant Martin – Yes, it is. But again, it’s typically because companies like IBM, AMD, and Intel say multicore, and then only talk about homogeneous multiprocessing. But, there's a lot of heterogeneous asymmetric multiprocessing in consumer devices.
In many ways, the conversation is driven by lots of people still remembering the old mainframe CPU, where everybody had to line up to run their jobs. Today, however, the modern distributed smart phone can have 4, 8, or more processors – they have so much more processing power than your old mainframe!
And, now it’s battery life that’s driving innovation. Widespread use of heterogeneous asymmetric multiprocessing does not need to wait for the development of better batteries, because it is more power efficient than many homogeneous approaches.
Q – So, the term you like is: heterogeneous asymmetric multiprocessing?
Grant Martin – Yes, because it reflects the wider scheme. But remember, inside there could still be a multicore device, which would give you the capacity to load more applications in the future – not in real time, but more in the user interface or controller domain.
Q – Moving past the confusing hardware terminology, can today’s software be parsed effectively to take advantage of what multicore has to offer?
Gran Martin – No, because you would need a tool to be able to convert standard applications into a number of threads that would cooperate – multi-threading the application into symmetric, cache-coherent multicore. But, such a tool remains a profound research project.
That doesn’t mean all is lost, however. There continue to be multicore support libraries to help programmers handle multi-threading to run on multicore devices, and there are different companies who provide those libraries today.
Last year, Intel bought several small, independent companies offering those sorts of libraries – Cilk Arts and RapidMind. Intel bought them specifically to add to their already well-developed programming environment and toolsets. Presumably, Intel is now providing an even richer set of offerings.
Q – Have you vetted the Intel multi-threading programming tools or environment?
Grant Martin – I haven’t looked at Intel specifically, but I had heard about both RapidMind and Cilk Arts long before they were acquired.
These kinds of libraries have been around for a long time, but they are basically manual assists for creating multi-threaded applications. There’s nothing, as yet, that's an automatic tool for parsing an application into multiple threads.
I am aware, however, of a tool from one company – Critical Blue – which does what-if analysis to help you analyze whether your application can be parsed into multiple threads. If you want to port the application to a symmetrical multicore device, you need to know up front if you can take advantage of the multi-threading capacity you'll find there. There may be other tools out there like the one from Critical Blue, but I’m not aware of them.
Q – Is all of this an issue for Tensilica – tools for automatic parsing?
Grant Martin – No, it’s not, because we focus on heterogeneous multiprocessors. Of course, there are issues about splitting software and porting it, but it’s more ad hoc. People here focus on a different task – determining the kind of data they need to send across different kinds of control paths.
However, we definitely do work on methods to help our customers in terms of our own multiprocessing systems, but it’s a far less regular process than what you’re talking about – taking existing software and splitting it to take advantage of multicore hardware.
Q – In a perfect world, there would be a tool that could take any software and split it, to optimize any kind of homogeneous or heterogeneous multicore platform. Right?
Grant Martin – Yes, that would be very nice! But for now, it still only falls into the category of a research topic.
Think about it – to automatically split or suggest cut points in a 2D application is not realistic. In reality, you have the datawise dimension of the application for mapping onto symmetric homogeneous multicore hardware, and you also have the time-based dimension of the application, such as you find in signal processing, where you are using asymmetric multiprocessing. In fact, to split most applications you need a little bit of both.
You need to explore the datawise multi-threading and timewise dataflow. Today, you have cores stacked beside each other, and cores stacked on top of each other. It’s all very complicated, and nothing that I am aware of in the research community comes close to solving all of this.
Q – If you could fund a university to work on these things, which university would you choose?
Grant Martin – There are people at U.C. Berkeley who have done some work in this domain – but in truth, most other centers of this type of research work are in Europe. I was at a Summer School last year in Barcelona called ACASES [Advanced Computer Architecture and Compilation for Embedded Systems], which was driven by large applications and some for embedded systems.
Q – Regarding embedded systems: isn’t splitting an application across multiple threads an ideal way to minimize the footprint?
Grant Martin - Yes, but we shouldn’t overstate the advantages. Often people understand an application domain – especially if they’re mapping it onto multicore – but they may not understand the application at each stage. Things are sequentialized into different types of processing at different point in time. It’s not like we’re starting with a totally blank slate, but there are interesting, unanswered questions in understanding how to shunt data back and forth.
Interestingly enough, we could have talked about these same issues 2 years ago. There really has been no dramatic progress in recent years. There is still today this overwhelming need to have an automatic solution, which hasn’t arrived as yet, because people are finding adequate ways to deal with the problem today.
We currently have hardware that allows us to do things on a symmetric multicore, because there are limits on inherent datawise concurrency in many applications. Today, you might be able to use 4 cores on a device, but not 8.
Q – Is there an argument for abandoning the effort, for just resolving to stick with sequential code?
Grant Martin – It depends on what part of the application you’re in. If it’s not dominated by computation time, but only by user interface time, then usually a sequential application is good enough, and no one would worry about parallelizing it up.
But, the minute you want to process large amounts of data, then you want to explore the concurrency – then you want to worry about the concurrency in time and space. You want to ask: Can I pipeline a bunch of tasks, optimizing through different cores, each optimized for different tasks?
Q – It’s really an N-dimensional problem, right?
Grant Martin – Yes, it’s N-dimensional. But, a lot of good thinking comes from some of those dataflow-oriented tools like Simulink from MathWorks and the old SPW tool from CoWare, now part of Synopsys,which also has System Studio in this area. Those kinds of tools encourage people to think of defining computing into data flow, and then deciding what might be parallel at all stages in the applications.
Q – How does this whole discussion apply to Tensilica?
Grant Martin – Many of our customers use multicore in their products. Some use a control processor, plus a number of different accelerator cores. For them, partitioning is easy.
Also, along with the new advanced standards in baseband processing for wireless devices, comes the use of heterogeneous data style multiprocessing. At Tensilica, we’re developing capability along those lines, and seeing a lot of interest from our customers. But this is not optimizing, and is by no means an automated process. But it’s also not impossible – many applications are being ported onto heterogeneous multiprocessors today.
Using heterogeneous multiprocessing means having to deal with the reality of what can be done today, and finding more tractable ways of doing programming than just a simple emphasis on homogeneous multicore machines. We need to learn how to explore the whole problem if we want to find the whole solution.
Looking at multicore machines, and thinking about multiprocessing and multi-threading, are orthogonal dimensions of a complex problem.
Q – If this is what you get to think about all day, is it possible you’ve got the best job in the world?
Grant Martin – I definitely enjoy what I do, and have every day since I joined Tensilica 6 years ago. The work is challenging, but we do really good stuff here!