“Can I have that in English, please? could well have been the reaction of those pondering a dual-core processor purchase on reading my previous article. In this follow-up we’ll explore a bit more closely the situation confronting programmers when dual or multi-core processors are introduced into desktop PCs.
My previous article, Multicore processors: Tomorrow or Today? attracted a number of bemused comments. I’ll address some of those first, before continuing on with discussion of programming for multi-core processors.
Clarifying the Queries
Thanks to forum members jwenting and benna for their interesting feedback to the earlier article. Their comments raise the following considerations:
“ It may be that GPU's are being developed faster than single core CPUs, but Moore's law continues to apply, and CPU speed continues to increase. The chip makers have no hit any brick wall, and I'm not sure where you got the idea that they did.
No, I’m sorry to say that it doesn’t. Intel ceased further development of its Pentium 4 core short of the 4GHz level for good reason.
It wasn’t achievable. Current and future developments are focused on adding ‘features’ and adding extra cores to the processor, in order that the unit can perform more work. AMD have pulled up with the AthlonFX 57 processor, for the same reason. Advances slowed to a crawl when the 3Ghz or equivalent performance level was reached, and have now stopped altogether. We may see, in the future, different processor architectures which are more powerful what we currently have but they most certainly won’t be faster versions of the systems we use today. They’ll be a whole new playing field.
The pedal is fully to the metal, and we now have to make vehicle bigger to get more work done. That’s why we’re at the onset of the multi-core era!
“The real drive behind multicore processors is NOT the games industry….The real power will come first from large scientific and financial applications, maybe CAD applications, in general things that are often run on multi-CPU machines today.
The real benefit of multi-core processors will really only arrive when ALL computing applications make use of the available hardware resources in a fundamental load sharing environment. Some more specialized applications, such as the ones mentioned there, have been developed with that environment in mind from the outset. They transfer to a multicore desktop environment seamlessly, giving even better performance than they enjoyed on the ‘hyperthreaded’ desktop processors we’ve seen previously. But an everyday PC application isn’t built on a framework which can seamlessly make use of multiple processing cores.
They'll just program OS’s to put that game in one core while the OS merrily steams along in another, thus giving both more room than they have now and increasing performance. “
Such an response to the dilemma night seem to be an answer but in reality it’s only the most basic of ways in which more than one core can be utilized. Sending one application to one core and another to an alternative core might share the load out, but it does nothing to ensure that the sharing occurs in a balanced way whatsoever. Half of the available processor resource might well be handling only a small percentage of the overall workload with such an approach.
What was David Kirk on about?
When commentors such as Nvidia’s David Kirk claim that we are facing a ‘crisis’ they refer primarily to the way in which load sharing is currently confronted. As good as current programming technology is, some of the fundamental work has only been confronted in the past several years. Trainees are taught to confront problems in a sinle threaded, one dimensional manner, and the apportioning of workload becomes a process of queuing work to be performed in a sequential manner. Interaction of code with a processor which can act on parallel threads is a rather large shift in paradigm, and an exponential increase in complexity. Currently, trainees are only being introduced to such considerations as post-graduate training. Currently, the fundamental frameworks for code to interact with CPUs in this manner don’t really exist. There are only some existing frameworks for interaction of this nature, and they are rather limited in applicability.
Consider that you have a roomful of people to be put to work at a particular task. Your job in overseeing the task needs to accommodate:
• Splitting the job into manageable small, non-interacting parts.
• Designating some workers to be ‘job creators’ who work on the individual parts
• Designating some workers to be ‘job finishers’ who work on the completed individual parts.
Ensuring that this is performed in a way which maximizes the effectiveness of the available workers is ‘load balancing’ and this is the key component of programming for a multi-core environment which currently does not exist in any practically employable form. Parallelism in programming for GPUs (graphics processors) is widespread. Graphics processors are built with parallelism in mind. CPUs (central processors) are not. To make the most effective use of dual or multi core processors, programming needs to occur with parallelism in mind from the outset. Break the BIG problem down into SMALL units. Have the SMALL units handled concurrently. Have a load balancer managing the overall workflow and keeping the timing right!
To do this, the programmer needs that framework which provides job queues, job creators, job finishers, and load balancer. At present, programmers are some way off having that framework available to them. What is needed is, as a programmer friend explained to me recently:
Someone smart build a framework based on dynamic get/put queues for smaller sub-units of work.
If you get the framework right for task communication around job/task - create/solve, then life is a lot simpler. It’s a bit like middleware, once you have said we have jobs broken into smaller tasks and each task rather than being the sole thing current, it simply gets placed on a FIFO queue until something is ready to do it. You don't need to know if you have 1 or 1 million processors – that’s what a load balancer does. All other parts of the problem either generate work or do the work whilst the load balancer changes the work creators re how 'big' a job they create before they place it on a FIFO queue,
* * *Bottom line,
1. Build task handling FIFO queues message handlers - couple hundred lines of C
2. Show programmers how to transform existing algorithms for a parallel machine - say a 30 pages PowerPoint presentation
3. Write a clever load balancer that takes inputs on hw fast is your CPU / GPU and how many pipelines or co-processors it has and set the teaks that create work to do it in chunks that are optimised towards these parameters - between 300 - 2,000 lines of C depending on how sophisticated you want to be.
What's the hassle?
* * *
So for instant a parallel quicksort becomes
qsort (array[low..high]) { /* split the array into two parts, all parts below midpoint are lower than it (but likely unsequenced) and converse is true for numbers bigger than the midpoint */ split(array[low..high],midpoint); if (midpoint - low > MIN_PARALLEL_TASK) /* if the task is too big, split it again and again! */ { qsort(array(low..midpoint); /* sort the lower half */ qsort(array(midpoint..high); /* sort the higher half */ } else if (midpoint > low) /* a sort interval of more than 1 number? */ { PUT(array[low..midpoint]); /* add to task queue */ PUT(array[midpoint..high]); /* add to task queue */ } /* else job is finished */ }
Then if you have say 8 pipelines you create 8 tasks that read the queue the function PUT writes to (with say a GET function), then simply implement the code in the else statement above).
The load balancer has to monitor put and get queue sizes to work out how big MIN_PARALLEL_TASK should be.
So if you had a million numbers to sort and ten processors, MIN_PARALLEL_TASK should be problem size divided by the number of co-processors i.e. one million numbers / 10 co-processors = 100,000 size job chunks. Split the array into at least segments 0..100K, 100K..200K, ... 800K..900K, 900K..999K and let each co-processor do an equal amount of work!
GET just adds jobs to a queue, PUT takes them off. MASTERS split the job until its small enough to be PUT onto the work queue, SLAVES GET the sub jobs, solve them and write the solution back to a shared memory segment. Each SLAVE process works on a separate memory block with no shared variables.
That's the basics!
The framework currently doesn’t exist. Whilst it is relatively easily achievable, there aren’t currently enough people with the skills to implement it. If code has been poorly written initially (and a lot of code has) then there is little incentive to port it across. The concepts need to be introduced at an undergraduate level for us to get very far down the road in the short term.
What does all that waffle mean to me?
The situation is this:
• We are currently being confronted with dual-core desktop processors, and informed that they are the next ‘big thing’ in desktop processing power.
• We currently have graphics processors in many machines which are even more powerful in some respects that the CPUs we have installed.
• Almost all of the software we use is designed with single threaded capability in mind, and cannot make meaningful use of either the extra CPU core or the computing power of the graphics processor.
• It will be quite some way down the track before these hurdles will be overcome.
It places a dilemma on those of us making the decision to purchase or upgrade our PCs, and that dilemma effects mot one but two of the major expense components in the PC. Is it worth the expense to go dual-core at this point in time for applications performance? Well, yes, but only if the major use of your PC will be in running those tasks for which we’d previously have considered a dual processor system anyway. Is it worth the expense of going for the latest and greatest in 3D display cards? Well who really knows. We’ve already heard the stories of how the previous generation of display cards were bottlenecked by the CPU, and CPUs really haven’t become any more capable at running what we run.
Not yet, anyway!
There is no doubt that a dual core processor is potentially a much more capable unit than a single core processor, even if the clock speed of each core is lower than that of the single core unit. But unless that clockspeed matches or betters that of the single core processor, effectively we will be spending extra to get a component which will underperform in most tasks, because we simply don’t have suitable software to run on it!
Tomorrow is most certainly going to remain tomorrow, I’m afraid, for most of us anyway!