Parallelizing applications for scalable performance can be a daunting task, not least because multi- and many-core processors make you think in two separate directions. One skill set is required to address threading – analyzing the workload, spawning threads to do the work while preventing races and other forms of contention, retrieving and ordering results, and so on. Another skill set is required to address vectorization: how to exploit powerful built-in instructions that process arrays of data-elements stored in vector registers.
The latter can be done with inline assembler, of course. But that’s difficult, and won’t scale forward easily as vector registers become wider and wider over successive chip generations (in today’s multicore CPUs, 256-bit-wide vector registers are the norm; whereas in Xeon Phi’s exponentially greater number of cores, vectors are now 512 bits wide). Auto-vectorization – code-analysis and vectorization by the compiler – offers a potential stopgap solution, for some parts of some codebases. But native serial code semantics are often too ambiguous – containing too many implicit dependencies – for today’s compilers to vectorize directly.
A better answer, as more and more C/C++ developers are discovering, is Intel Cilk Plus: an extended syntax with compiler enhancements and a runtime engine (plus associated tools) for building parallel applications that exploit vectorization and multithreading in limited, but extremely powerful ways.
via Parallel Programming Simplified – Intel Cilk Plus Webinar Brings Clarity, Offers Performance Insight.