Webinar: Get Ready for Intel® Math Kernel Library on Intel® Xeon Phi™ Coprocessors

November 28, 2013, 5:00 am

Latest and popular articles on Intel Technologies

≫ Next: VecAnalysis Python* Script for Annotating Intel C++ & Fortran Compilers Vectorization Reports

Intel recently unveiled the new Intel® Xeon Phi™ product – a coprocessor based on the Intel® Many Integrated Core architecture. Intel® Math Kernel Library (Intel® MKL) 11.0 introduces high-performance and comprehensive math functionality support for the Intel® Xeon Phi™ coprocessor. You can download the audio recording of the webinar and the presentation slides from the links below.

Webinar video recording (Link)
Webinar presentation slides (Link)

More information can be found from our "Intel® Math Kernel Library on Intel® Xeon Phi™ Coprocessors" central page. If you have questions, please ask them either on the public Intel MKL forum or in a priviate secure Intel® Premier Support.

Also, please visit this page for a replay of highly popular webinar series that introduces you to other Intel software tools for the Intel® Xeon® Phi™ coprocessors.

Questions and Answers from the webinar

Is there anyone using Intel Xeon Phi product? What Kind of applications they run on it?
Many users have successfully benefited from it. For example, seven supercomputers on the most recent Top 500 list already use Intel Xeon Phi coprocessors in combination with Intel Xeon processors. A lot of HPC applications, for example, those in the areas of new drug discovery, weather prediction, global financial analysis, oil exploration, Hollywood movie special effects, can make good use of all the power provided by Intel Xeon Phi.
Is Intel® Cluster Studio XE 2013 or Intel® Parallel Studio XE 2013 required in order to use Intel Xeon Phi coprocessors?
Intel Cluster Studio XE 2013 and Intel Parallel Studio XE 2013 are bundle products that contain necessary tools for programming the coprocessor. For example, Intel compilers (FORTRAN or C/C++) are required to build code for native execution on the coprocessor. The pragmas and directives used to offload computations to the coprocessor are only supported by Intel compilers. Intel MKL provides highly optimized math functions for the coprocessor. Intel MPI (a component of Intel Cluster Studio XE) enables building code scalable to multiple coprocessors and hosts. These bundle products also provide tools for threading assistant, performance and thread profiling, memory and threading error detection, etc.
What if a system has multiple coprocessors? Does Intel MKL try to balance the tasks across them?
In the case of automatic offload, MKL will try to make use of multiple coprocessors for a computation. Users can also pick which coprocessors to use. In the case of compiler-assisted offload, it is up to the user to specify which coprocessors to use and to orchestrate the work division among them.
Do the performance charts published online include cost of data transfer between host and coprocessors?
The performance charts compare native execution performance on the coprocessor with host execution performance on the host processor. Hence, data transfer cost is not reflected.
Do the performance charts published online compare the dual-socket E5-2680 CPU performance against single coprocessor performance?
Yes. The host CPU used to obtain the performance charts is an Intel Xeon E5-2680 CPU with 2 sockets and 8 cores per socket. The coprocessor is an Intel Xeon Phi SE10, with 61 cores. Each of the online performance charts has detailed configuration listed at the bottom.
What happens if multiple user processes or threads call Intel MKL functions with automatic offload?
Currently, a process/thread doing automatic offload is not aware of other processes/threads that may also be offloading at the same time. In this scenario, all processes/threads will offload to a coprocessor. This leads to the risks of thread oversubscription and running out of memory on the coprocessor. It is possible, however, with careful memory management and thread affinity settings, to have multiple offloading processes/threads use different group of cores on the coprocessor at the same time.
Will more routines get automatic offload support in future?
Automatic offload works well when there is enough computation in a function to offset the data transfer overhead. Currently, only GEMM, TRSM, TRMM and LU, QR, Cholesky are supported with this model. There might be other functions in Intel MKL that can be good candidates for automatic offload. We are investigating all opportunities. Please contact us via our support channels if you see more needs for automatic offload.
Can you show us in detail the configurations of running the LINPACK benchmark natively on the coprocessor?
Intel optimized SMP LINPACK benchmark is included in Intel MKL 11.0 installation packages. Please find it in $MKLROOT/benchmarks/linpack. See the execution scripts in this location for the default configuration.
Is the memory allocated for arguments of an Intel MKL routine resides on the coprocessor or on the host?
Unless input data already exists on the coprocessor or output data is not needed on the host, MKL routine input arguments are allocated on the host and then copied to the coprocessor. Enough space needs to be allocated on the coprocessor to receive the data. Output arguments are copied back to the host. The offload pragmas offers a rich set of controls for data transfer and memory management on the coprocessor. In the case of MKL automatic offload, however, the MKL runtime system handles all these transparently.
If memory population between host and coprocessor is transparent, now you have two copies of data. What about data synching?
In the case of Intel MKL automatic offload, data synching is taken care of transparently by the Intel MKL runtime. If a function call is offloaded using pragmas, then the user needs to rely on the facilities provided by the pragmas to synch/merge data. Intel Xeon Phi coprocessor also supports a shared memory programming model called MYO (“mine”, “yours”, “ours”). Data synching between host processors and coprocessors is taken care of implicitly in this model.
Refer to this article for more information.
If I have two automatic offload function calls, and a non-automatic offload function call in between them, suppose these functions reuse data, will the data persist on the coprocessor to be reused?
Data persistence on coprocessor is currently not supported for function calls using Intel MKL automatic offload. The input data is always copied from host to coprocessor in the beginning of an automatic offload execution and output data is always copied back at the end.
Can PARDISO and other sparse solvers make use of the coprocessor? How does the performance compare with, say, running on an 8-core Xeon processor?
Yes. Intel MKL sparse solvers, including PARDISO, can make use of the coprocessor. However, our optimization effort has so far been focused on dense matrices (BLAS and LAPACK). Sparse solvers at present are not optimized to the same extent. Performance of sparse solvers, on processor or on the coprocessor, largely depends on the properties of sparse systems. It’s hard to have a performance comparison without putting a particular sparse system in the context.
Is Intel® Integrated Performance Primitives (Intel® IPP) supported on Intel Xeon Phi product?
Support for Intel IPP is still to be determined. If you have a request for supporting Intel IPP on the Intel Xeon Phi coprocessor, please follow the regular Intel IPP support channel to submit a formal request.
There are a lot of pragmas to set. Are there any preprocessors to scan one's FORTRAN code for LAPACK calls and automatically insert all the appropriate pragmas?
There is no such a tool to automatically scan your code and insert pragmas. But if you use MKL automatic offload (when applicable), then you can take the advantage of computation offloading without using pragmas.
The offload pragmas from Intel compilers are very different than OpenACC. Can users do either one for the Intel Xeon Phi coprocessor?
Intel compilers do not have plans to support OpenACC.
What is the difference, if any, between using the Intel specific compiler directives to offload to the coprocessor and using the newly proposed OpenMP coprocessor/accelerator directives? Am I correct that these new OpenMP directives will be added to the Intel compilers next year?
Intel compiler offload directives offer a much richer set of features than OpenMP offload directives. Intel Compiler 13.0 update 2 (both FORTRAN and C/C++) will add support for OpenMP offload directives.
Does GCC support Intel Xeon Phi?
Please see this page for information on third-party tools available with support for Intel Xeon Phi coprocessor.
Our changes to the GCC tool chain, available as of June 2012, allow it to build the coprocessor’s Linux environment, including our drivers, for the Intel Xeon Phi Coprocessor. The changes do not include support for vector instructions and related optimization improvements. GCC for Intel Xeon Phi is really only for building the kernel and related tools; it is not for building applications. Using GCC to build an application for Intel Xeon Phi Coprocessor will most often result in low performance code due its current inability to vectorize for the new Knights Corner vector instructions. Future changes to give full usage of Knights Corner vector instructions would require work on the GCC vectorizer to utilize those instructions’ masking capabilities
Is debugging supported on the coprocessor?
Yes. Debugging is supported. At this time, Intel debugger, GDB, TotalView, and Allinea DDT are debuggers available with support Intel Xeon Phi coprocessor. See this page for more information.
Is the 8GB memory shared by all cores on the coprocessor? Are there memory hierarchies on the Intel Xeon Phi coprocessor?
Yes. All cores on a coprocessor share 8GB memory. The memory hierarchy includes the shared 8GB memory, and for each core, a 32KB L1 instruction cache and a 32 KB L1 data cache and a 512KB unified L2 cache. The caches are fully coherent and implement the x86 memory order model. See here for a description of the Intel Many Integrated Core architecture.
How lightweight are threads on the coprocessor? Is context switching expensive?
Context switch is more expensive on Intel Xeon Phi coprocessors than on Intel Xeon processors. This is because the coprocessor has more vector registers, and a coprocessor core is typically slower than a processor core.
What MPI implementations are supported?
At present, Intel MPI and MPICH2 are the two implementations that support Intel Xeon Phi coprocessors.
Can I put an MPI rank on the host processor and another MPI rank on the coprocessor to have a 2-node MPI environment?
Yes. This usage model has been supported since Intel MPI 4.1. Please refer to the Intel MPI product page for more information on Intel MPI support for Intel Xeon Phi coprocessors.
Can you explain the motherboard requirements for Intel Xeon Phi coprocessors, e.g. power, BIOS, PCI bandwidth?
Please contact your OEMs for information on system configurations for Intel Xeon Phi coprocessors. Find a list of OEMs that support the coprocessor on this page.

What is the estimated price of Intel Xeon Phi coprocessor?
Please contact your OEMs or your Intel field representatives to get estimated pricing of Intel Xeon Phi coprocessor.
Where to buy Intel software tools that will support Intel Xeon Phi coprocessor?
Please contact your local Intel® Software Development Products Resellers for more details.

↧

VecAnalysis Python* Script for Annotating Intel C++ & Fortran Compilers Vectorization Reports

November 28, 2013, 5:00 am

Latest and popular articles on Intel Technologies

≫ Next: Samples for Intel® C++ Composer XE

≪ Previous: Webinar: Get Ready for Intel® Math Kernel Library on Intel® Xeon Phi™ Coprocessors

This is the Python* script used to annotate Intel® C++ and Fortran compiler 13.1 (Intel® C++/Fortran/Visual Fortran Composer XE 2013 Update 2 and later) vectorization reports produced at -vec-report7. The attached zip file contains:

vecanalysis.py
vecmessages.py
README-vecanalysis.txt

NOTE: You will need Python* version 2.6.5 or higher. For more information, and download instructions please click here.

The new -vec-report7 (for Linux*) (/Qvec-report7 for Windows*) compiler option available in Intel® C++ and Fortran compilers version 13.1 allows the compiler to emit vector code quality messages and the corresponding message ID, and data values for vectorized loops. The messages provide information such as the expected speedup, memory access patterns, and the number of vector idioms for vectorized loops. Below is a sample of the type of messges the compiler will emit at -vec-report7:

loop was vectorized (with peel / with remainder)
unmasked aligned unit stride loads: 4
unmasked aligned unit stride stores: 2
saturating add/subtract: 3
estimated potential speedup: 6.270000

The attached Python* script takes the message IDs produced by the compiler as input and produces a .txt file that includes the original source code annotated with -vec-report7 messages. The information gives more insight into the generated vector code quality without the need to analyze the assembly code. The naming convention for the output file is (filename_extension_vr.txt). For example the output file corresponding to satSub.c would be satSub_c_vr.txt. The compiler does not invoke the Python script automatically. The user needs to apply the Python script manually to the output file produced by the compiler as shown below. The below command assumes the vecanalysis Python script files are located in the "vecanalysis" directory:

Example: icc -c -vec-report7 satSub.c 2>&1 | ./vecanalysis/vecanalysis.py --list

For more information please see the README.vecanalysis.txt file provided.

$ python
Python 2.6.5 (r265:79063, Jul 5 2010, 11:46:13)
[GCC 4.5.0 20100604 [gcc-4_5-branch revision 160292]] on linux2
Type "help", "copyright", "credits" or "license" for more information.

$ icc -c -vec-report7 satSub.c 2>&1 | ./vecanalysis/vecanalysis.py --list
satSub.c(9): (col. 3) remark: SIMD LOOP WAS VECTORIZED.
satSub.c(9): (col. 3) remark: VEC#00001WPWR 1.
satSub.c(9): (col. 3) remark: VEC#00052 1.
satSub.c(9): (col. 3) remark: VEC#00101UASL 4.
satSub.c(9): (col. 3) remark: VEC#00101UASS 2.
satSub.c(9): (col. 3) remark: VEC#00101UUSL 2.
satSub.c(9): (col. 3) remark: VEC#00101UUSS 1.
satSub.c(9): (col. 3) remark: VEC#00201 5.
satSub.c(9): (col. 3) remark: VEC#00202 0.310000.
satSub.c(9): (col. 3) remark: VEC#00203 6.270000.
satSub.c(9): (col. 3) remark: VEC#00204 15.
satSub.c(9): (col. 3) remark: VEC#00405 3.
Writing satSub_c_vr.txt ... done
Statistics for all files

// Below is the vectorization summary for satSub.c
Source Locations
Message Count %

// This line says there were 3 saturating add/subtract operations.
// 100% means the message refers to a single location/loop in the program.
// (Count = 1) means there is one instance of this message for the loops in the program.
saturating add/subtract: 3.                                            1 100.0%
unmasked unaligned unit stride loads: 2.                     1 100.0%
loop was vectorized (with peel/with remainder)            1 100.0%
unmasked aligned unit stride stores: 2.                        1 100.0%

// 100% of all loops (in this case a single loop) in the program were vectorized
// If there were 10 loops out of which 6 got vectorized, the % would be 60%
SIMD LOOP WAS VECTORIZED.                               1 100.0%
unmasked aligned unit stride loads: 4.                         1 100.0%
scalar loop cost: 5.                                                       1 100.0%
lightweight vector operations: 15.                                 1 100.0%
vector loop cost: 0.310000.                                           1 100.0%
loop inside vectorized loop at nesting level: 1.             1 100.0%
unmasked unaligned unit stride stores: 1.                     1 100.0%
estimated potential speedup: 6.270000.                        1 100.0%
Total Source Locations:                                                 1

$ more satSub_c_vr.txt
VECRPT satSub.c
VECRPT                                                                    Source Locations
VECRPT Message                                                                 Count     %
VECRPT saturating add/subtract: 3.                                             1 100.0%
VECRPT unmasked unaligned unit stride loads: 2.                      1 100.0%
VECRPT loop was vectorized (with peel/with remainder)             1 100.0%
VECRPT unmasked aligned unit stride stores: 2.                         1 100.0%
VECRPT scalar loop cost: 5.                                                         1 100.0%
VECRPT unmasked aligned unit stride loads: 4.                           1 100.0%
VECRPT SIMD LOOP WAS VECTORIZED.                                 1 100.0%
VECRPT lightweight vector operations: 15.                                   1 100.0%
VECRPT vector loop cost: 0.310000.                                            1 100.0%
VECRPT loop inside vectorized loop at nesting level: 1.               1 100.0%
VECRPT unmasked unaligned unit stride stores: 1.                      1 100.0%
VECRPT estimated potential speedup: 6.270000.                         1 100.0%
VECRPT Total Source Locations:                                                               1

   1: #define SAT_U8(x) ((x) < 0 ? 0 : (x))
   2: void satsub(
   3:   unsigned char *a,
   4:   unsigned char *b,
   5:   int n
   6: ){
   7:   int i;
   8: #pragma simd
VECRPT (col. 3) SIMD LOOP WAS VECTORIZED.
VECRPT (col. 3) estimated potential speedup: 6.270000.
VECRPT (col. 3) lightweight vector operations: 15.
VECRPT (col. 3) loop inside vectorized loop at nesting level: 1.
VECRPT (col. 3) loop was vectorized (with peel/with remainder)
VECRPT (col. 3) saturating add/subtract: 3.
VECRPT (col. 3) scalar loop cost: 5.
VECRPT (col. 3) unmasked aligned unit stride loads: 4.
VECRPT (col. 3) unmasked aligned unit stride stores: 2.
VECRPT (col. 3) unmasked unaligned unit stride loads: 2.
VECRPT (col. 3) unmasked unaligned unit stride stores: 1.
VECRPT (col. 3) vector loop cost: 0.310000.
   9:   for (i=0; i<n; i++){
10:     a[i] = SAT_U8(a[i] - b[i]);
11:   }
12: }
$

↧

Samples for Intel® C++ Composer XE

November 28, 2013, 5:00 am

Latest and popular articles on Intel Technologies

≫ Next: Linux* ABI

≪ Previous: VecAnalysis Python* Script for Annotating Intel C++ & Fortran Compilers Vectorization Reports

Intel® C++ compiler is an industry-leading C/C++ compiler, including optimization features like auto-vectorization and auto-parallelization, OpenMP*, and Intel® Cilk™ Plus multithreading capabilities; plus the highly optimized performance libraries.

We have created a list of articles with samples explaining the features in detail and how or when to use in the source code. There are:

By installing or copying all or any part of the sample source code, you agree to the terms of the Intel(R) Sample Source Code License Agreement.

Auto-vectorization articles and samples

Article Name	Description	Download
A Guide to Auto-vectorization with Intel® C++ Compilers	This article provides guidelines for enabling Intel C++ compiler auto-vectorization using the sample source code; it targets the Intel® processors or compatible non-Intel processors that support SIMD instructions such as Intel® Streaming SIMD Extensions (Intel® SSE).	Source Code in C/C++

Intel® Cilk™ Plus

Please visit Intel® C++ Compiler Code Samples page.

Building Open Source Applications using Intel C++ Compiler

Article Name	Description	Platforms
How to Building POV-Ray* with Intel C++ Compiler on Windows	The article provided detail instructions on building Povray* using the Intel® C++ Compiler for Windows. Version information Povray* beta version 3.7 Intel(R) C++ for Windows: 11.0	Windows
Building Boost C++ Libraries with Intel® C++ Compiler on Windows XP	Boost is a set of libraries for the C++ language, visit www.boost.org for more information. The article provides detail instructions on how to build Boost* library with Intel C++ Compiler on Windows. Version information Boost: v1.39.0 Intel C++ Compiler for Windows: 11.1	Windows
Building Open MPI* with the Intel compilers	The article is to help Intel® compiler customers build and use the Open MPI* library with Intel C++ and Fortran Compilers for Linux and OS X. Version information Open MPI: 1.2 Intel(R) C++ and Fortran Compilers for Linux* or Mac OS* X: 10.0	Linux, OS X
Building UPC* to utilize the Intel C++ Compiler	The Berkeley* Unified Parallel C* (UPC) is a programming language that adds parallelization extensions to the C language. The article explains how to build UPC* compiler with Intel C++ Compiler and configure it for use with symmetric multiprocessing (SMP) machines. Version information UPC: version 2.4.0 Intel(R) C++ Compiler for Linux*: 10.0	Linux
Building Quantlib with Intel C++ Compiler	Quantlib is a free/open-source library for modeling, trading, and risk management in real-life writting in C++. The article explains how to configure and build the Quantlib* library (http://quantlib.org/) and an example provided with Quantlib. Version information Quantlib: Quantlib-0.3.13.tar.gz Boost: boost_1_33_1 Intel(R) C++ Compiler for Linux: 10.0	Linux
Building Xerces with Intel C++ Compiler	The article describes how to build the Xerces-C++ with the Intel® C++ Compiler for Linux* Version information Xerces: 2.7.0 Intel(R) C++ Compiler for Linux*: 10.0	Linux
Building FFTW* With the Intel Compilers	The FFTW library is used for high performance computation of the Discrete Fourier Transform (DFT). The article describles how to build the FFTW* library on Linux* using Intel C++ Compiler for Linux. Version information FFTW* library v3.1.2 Intel(R) C++ Compiler for Linux*: 10.0	Linux
Building PGPLOT* with the Intel compilers	PGPLOT is a library for creating two-dimensional plots and graphs. The article provides instructions on how to build the PGPLOT* graphics library using Intel C++ and Fortran Compilers for Linux. Version information PGPLOT* graphics library v5.2.2 Intel(R) C++ and Fortran Compilers for Linux*: 10.x	Linux
Building WRF v2.x with the Intel compilers	The Weather Research and Forecasting (WRF) Model (http://wrf-model.org/index.php) is a next-generation mesoscale numerical weather prediction system designed to serve both operational forecasting and atmospheric research needs. The article is created to help users of WRF make use of the Intel C++ and Fortran compiler. Version information WRF: version 2.2 and 2.2.1 Intel(R) C++ and Fortran Compilers for Linux* or Mac OS* X: 10.x, 11.x	Linux
Building WRF v3.1.1 with the Intel compilers	The article is created to help users of WRF v3.1.1 make use of the Intel C++ and Fortran compiler. Version information WRF: version 3.1.1 Intel(R) C++ and Fortran Compilers for Linux* or Mac OS* X: 11.1	Linux
Building the HPCC* benchmark with Intel C++ and Fortran Compilers	The HPC Challenge (HPCC) benchmark is used to evaluate and test a wide variety of performance parameters for high-performance computing system. The article provides instructions on how to build the HPCC* benchmark. Version information HPCC: 1.0.0 Intel(R) C++ and Fortran Compilers for Linux* or Mac OS* X: 10.0 Intel(R) Math Kernel Library: 9.1	Linux, OS X
Building HDF5* with Intel® compilers	The article provides instructions on how to build and use the HDF5 library with Intel C++ and Fortrna Compilers on Linux* or OS X. HDF5 (http://www.hdfgroup.org/HDF5/) is the latest generation of the HDF libraries, a general purpose library and associated file formats for storing and sharing scientific data. Version information HDF5 1.8.9 Intel C++ and Fortran Compiler for Linux or Mac OS* X: 13.0 Update 1	Linux, OS X

↧

Linux* ABI

November 28, 2013, 5:00 am

Latest and popular articles on Intel Technologies

≫ Next: How to detect New Instruction support in the 4th generation Intel® Core™ processor family

≪ Previous: Samples for Intel® C++ Composer XE

by Milind Girkar, Hongjiu Lu, David Kreitzer, and Vyacheslav Zakharin (Intel)

Description of the Intel® AVX, Intel® AVX2, Intel® AVX-512 and Intel® MPX extensions required for the Intel® 64 architecture application binary interface.

↧

How to detect New Instruction support in the 4th generation Intel® Core™ processor family

November 28, 2013, 5:00 am

Latest and popular articles on Intel Technologies

≫ Next: Webinar -"Intel® System Studio: Embedded application development and debugging tools"

≪ Previous: Linux* ABI

Downloads

How to detect New Instruction support in the 4th generation Intel® Core™ processor family [PDF 342.3KB]

The 4th generation Intel® Core™ processor family (codenamed Haswell) introduces support for many new instructions that are specifically designed to provide better performance to a broad range of applications such as: media, gaming, data processing, hashing, cryptography, etc. The new instructions can be divided into the following categories:

Intel® Advanced Vector Extensions 2 (Intel® AVX2)
Fused Multiply Add (FMA)
Bit Manipulation New Instructions (BMI)
MOVBE instruction (previously supported by the Intel® Atom™ processor)
Intel® Transactional Synchronization Extensions (Intel® TSX) (available in some models)

The details of these instructions can be found in Intel® 64 and IA-32 Architectures Software Developer Manuals and Intel® Advanced Vector Extensions Programming Reference manual.

In order to correctly use the new instructions and avoid runtime crashes, applications must properly detect hardware support for the new instructions using CPUID checks. It is important to understand that a new instruction is supported on a particular processor only if the corresponding CPUID feature flag is set. Applications must not assume support of any instruction set extension simply based on, for example, checking a CPU model or family and must instead always check for _all_ the feature CPUID bits of the instructions being used.

Software developers can take advantage of the new instructions via writing assembly code, using intrinsic functions, or relying on compiler automatic code generation. In the latter case, it is crucial to understand what instructions the compiler(s) can generate with given switches and implement proper CPUID feature checks accordingly.

Generally, compilers and libraries generating code for 4th generation Intel Core processors are expected and allowed to use all the instructions listed above, with the exception of Intel TSX. Below is the complete list of CPUID flags that generally must be checked:

CPUID.(EAX=01H, ECX=0H):ECX.FMA[bit 12]==1 &&
CPUID.(EAX=07H, ECX=0H):EBX.AVX2[bit 5]==1 &&
CPUID.(EAX=07H, ECX=0H):EBX.BMI1[bit 3]==1 &&
CPUID.(EAX=07H, ECX=0H):EBX.BMI2[bit 8]==1 &&
CPUID.(EAX=80000001H):ECX.LZCNT[bit 5]==1 &&
CPUID.(EAX=01H, ECX=0H):ECX.MOVBE[bit 22]==1

Note: Applications using instructions from the RTM subset of Intel TSX extension need to guard the code by checking the CPUID.(EAX=07H, ECX=0H).EBX.RTM[bit 11]==1. Applications can also, but are not required to, check CPUID.(EAX=07H, ECX=0H).EBX.HLE[bit 4]==1 for HLE, because legacy processors ignore HLE hints.

For example Intel® Composer XE 2013 can automatically generate all the new instructions guarded by the CPUID features in the above list, using -QaxCORE-AVX2 and -QxCORE-AVX2 switches on Microsoft Windows* (on Linux*: -axCORE-AVX2 and -xCORE-AVX2). The compiler switch -[Q]axCORE-AVX2 generates automatic CPUID check and dispatch to the code using new instructions, while the -[Q]xCORE-AVX2 switch assumes the new instructions are supported and thus requires a manual implementation of the CPUID check for all the features in the list above. Microsoft Visual C++* 2012 compiler supports these new instructions via intrinsics as well as 32-bit inline assembler, while the GCC compiler supports both auto-generation and intrinsics with -march=core-avx2 switch starting with version 4.7, thus requiring a check of the complete list of CPUID features above, whenever such code is called.

Additionally, libraries such as Intel® Integrated Performance Primitives (Intel® IPP) beginning with version 7.1 may also use these new Instructions. In the case of Intel IPP, two types of interfaces are available: an automatically dispatched interface is the default, and a CPU-specific interface available via prefixes like ‘h9_’ (32-bit) or ‘l9_’ (64-bit). In the case of functions optimized for the 4th generation Intel Core processor family, applications must check for the support of all the features in the list above before calling these functions.

And finally, new instructions using VEX prefixes and operating on vector YMM/XMM registers continue to require checking for OS support of YMM state before using, the same check as for Intel AVX instructions.

Below is a code example you can use to detect the support of new instructions:

#if defined(__INTEL_COMPILER) && (__INTEL_COMPILER >= 1300)

#include <immintrin.h>

int check_4th_gen_intel_core_features()
{
    const int the_4th_gen_features = 
        (_FEATURE_AVX2 | _FEATURE_FMA | _FEATURE_BMI | _FEATURE_LZCNT | _FEATURE_MOVBE);
    return _may_i_use_cpu_feature( the_4th_gen_features );
}

#else /* non-Intel compiler */

#include <stdint.h>
#if defined(_MSC_VER)
# include <intrin.h>
#endif

void run_cpuid(uint32_t eax, uint32_t ecx, uint32_t* abcd)
{
#if defined(_MSC_VER)
    __cpuidex(abcd, eax, ecx);
#else
    uint32_t ebx, edx;
# if defined( __i386__ ) && defined ( __PIC__ )
     /* in case of PIC under 32-bit EBX cannot be clobbered */
    __asm__ ( "movl %%ebx, %%edi \n\t cpuid \n\t xchgl %%ebx, %%edi" : "=D" (ebx),
# else
    __asm__ ( "cpuid" : "+b" (ebx),
# endif
              "+a" (eax), "+c" (ecx), "=d" (edx) );
    abcd[0] = eax; abcd[1] = ebx; abcd[2] = ecx; abcd[3] = edx;
#endif
}     

int check_xcr0_ymm() 
{
    uint32_t xcr0;
#if defined(_MSC_VER)
    xcr0 = (uint32_t)_xgetbv(0);  /* min VS2010 SP1 compiler is required */
#else
    __asm__ ("xgetbv" : "=a" (xcr0) : "c" (0) : "%edx" );
#endif
    return ((xcr0 & 6) == 6); /* checking if xmm and ymm state are enabled in XCR0 */
}


int check_4th_gen_intel_core_features()
{
    uint32_t abcd[4];
    uint32_t fma_movbe_osxsave_mask = ((1 << 12) | (1 << 22) | (1 << 27));
    uint32_t avx2_bmi12_mask = (1 << 5) | (1 << 3) | (1 << 8);

    /* CPUID.(EAX=01H, ECX=0H):ECX.FMA[bit 12]==1   && 
       CPUID.(EAX=01H, ECX=0H):ECX.MOVBE[bit 22]==1 && 
       CPUID.(EAX=01H, ECX=0H):ECX.OSXSAVE[bit 27]==1 */
    run_cpuid( 1, 0, abcd );
    if ( (abcd[2] & fma_movbe_osxsave_mask) != fma_movbe_osxsave_mask ) 
        return 0;

    if ( ! check_xcr0_ymm() )
        return 0;

    /*  CPUID.(EAX=07H, ECX=0H):EBX.AVX2[bit 5]==1  &&
        CPUID.(EAX=07H, ECX=0H):EBX.BMI1[bit 3]==1  &&
        CPUID.(EAX=07H, ECX=0H):EBX.BMI2[bit 8]==1  */
    run_cpuid( 7, 0, abcd );
    if ( (abcd[1] & avx2_bmi12_mask) != avx2_bmi12_mask ) 
        return 0;

    /* CPUID.(EAX=80000001H):ECX.LZCNT[bit 5]==1 */
    run_cpuid( 0x80000001, 0, abcd );
    if ( (abcd[2] & (1 << 5)) == 0)
        return 0;

    return 1;
}

#endif /* non-Intel compiler */


static int can_use_intel_core_4th_gen_features()
{
    static int the_4th_gen_features_available = -1;
    /* test is performed once */
    if (the_4th_gen_features_available < 0 )
        the_4th_gen_features_available = check_4th_gen_intel_core_features();

    return the_4th_gen_features_available;
}

#include <stdio.h>

int main(int argc, char** argv)
{
    if ( can_use_intel_core_4th_gen_features() )
        printf("This CPU supports ISA extensions introduced in Haswell\n");
    else
        printf("This CPU does not support all ISA extensions introduced in Haswell\n");

    return 1;
}

Intel, the Intel logo, Atom, and Core are trademarks of Intel Corporation in the U.S. and/or other countries.
Copyright © 2013 Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.

↧

Webinar -"Intel® System Studio: Embedded application development and debugging tools"

November 28, 2013, 5:00 am

Latest and popular articles on Intel Technologies

≫ Next: Using Intel® SDE's chip-check feature

≪ Previous: How to detect New Instruction support in the 4th generation Intel® Core™ processor family

Abstract

Presenter

Information

The Intel® System Studio is a flexible complete software development studio which allows you to optimize Intel® Architecture based intelligent embedded systems and devices. It combines Eclipse* CDT integrated optimizing compiler solutions and signal and media processing libraries, whole platform power and performance tuning capabilities, in-depth memory and thread checking, instruction trace and data race detection enabled application debug, and the deep insight of a JTAG based system software debug solution.

This session will give an overview for the Intel® System Studio and introduce the key features which cover the following topics:

Debugging with GDB* debugger with enhanced instruction trace and data race detection support, and Intel® JTAG Debugger with deep insight into processor architecture, flashing, and source level debug from EFI* to OS Kernel and to driver development
Building with Intel® C++ Compiler in corss-build environment to extract the best performance for your embedded target.
Tuning with Intel® Vtune Amplifier to find the performance hotspots, identify the architecture performance bottlenecks, analyze system power and frequency.
Verifying with Intel® inspector to find the memory and threading issues for your embedded applications

Naveen Gv & Sukruth H V

PLAYBACK

The following is a selected list of questions and answers from the webinar "Intel® System Studio: Embedded application development and debugging tools", we thought these may be useful to other developers as reference.

Q: We have XDP Debug interface on our board. Is Intel® JTAG Debugger interface different?

A: The Intel® JTAG Debugger 3.0 provides Linux* hosted cross-debug solutions for software developers to debug the Linux* kernel sources and dynamically loaded drivers and kernel modules on Intel® Atom™ Processor based devices. It does so using the In-Target Probe eXtended Debug Port (ITP-XDP) on Intel® Atom™ Processor (N2xxx, D2xxx, E6xx, CE42xx, and CE 53xx) based platforms.

For more information, refer to Intel® System Studio Installation guide and Release Notes Intel® JTAG Debugger 3.0 : http://software.intel.com/sites/default/files/article/365160/jtag-release-install.pdf

Q: Does Intel System Studio support Menlow and Baytrail platforms?

A: Intel® System Studio 2013 supports Intel® Atom™ Z5xx series Processors(Menlow). The upcoming release of Intel® System Studio 2014 beta will support The Intel® Atom™ Processor E3xxx and Z3xxx code-named “Baytrail” from Windows* host with the Intel® ITP-XDP3 device.

Q: Do you have auto configuration option for connecting the target device? We are using American Arium which do not support auto configuration.

A: We do have script files for specific Intel® Atom™ based target boards which will set the environment variables and bring up the Intel® JTAG debugger GUI, provided you have connected either an Intel® ITP-XDP3 or Macraigor* usb2demon onto the target board.

Q: Our focus is primarily for board bringup and BIOS/ UEFI porting effort.

A: The Intel® JTAG Debugger included with Intel® System Studio supports BIOS/UEFI debugging. As UEFI code is usually compiler in Microsoft* COFF PDB format, our debugger relies on Microsoft Visual Studio* redistributables for symbol info resolution and thus the Windows* hosted version of the Intel® JTAG Debugger should be used.

Q: Is the mac probe available with the purchase of the System Studio?

A: Intel System Studio is just a software tools suite and NO hardware is shipped along with this product.To order the Intel® ITP-XDP3 device, please contact the Hibbert Group* at Intelvtg@hibbertgroup.com and request the VTG order form.

To order Macraigor* usb2Demon*, Go to http://www.macraigor.com/usbDemon.htm and select the Intel® Atom™ Processor target with the appropriate 24, 31 or 60 pin connector for your target device.

Q: What does SoC mean?

A: SoC - System on Chip, Please refer to http://www.intel.com/pressroom/kits/soc/ for more information.

Q: How to collect data from target device to host

A: Target board analysis can be done using Intel® VTune Amplifier, you have to setup SSH(Secured Shell) connection and the analysis data is automatically copied back onto the host machine. Please refer to the remote collection article for more details :

http://software.intel.com/en-us/articles/how-to-use-remote-collection-on-intel-vtune-amplifier-2013-for-systems

Q: Intel JTAG debugger can support for all the controller?

A: Intel® System Studio JTAG debugger supports only Intel® Atom x86 architecture as of today. With the 2014 release it will also support Intel® Core processors code-named “Haswell” and newer.

Q: Does Intel System studio supports GCC environment/GCC commands

A: Yes, Intel® System Studio tools are compatible with GCC compiled binaries and the Intel® C++ Compiler accepts GCC command-line options.

Q: Can we develop bare metal application on this studio. i.e target is not having any operating system

A: The build tools included with Intel® System Studio are targeted towards a variety of Embedded Linux* flavors like Yocto Project*, Wind River Linux*. The Intel® C++ Compiler relies on the presence of the GNU binutils. It is not intended for bare metal applications. You can however use the Intel® JTAG debugger to debug and analyze your bare metal code.

Q: What about Ivy bridge and Haswell support?

A: The build tools including the Intel® C++ Compiler already support the latest generation of Intel® Core processors. Optimiztaions will be further improved in future versions of Intel® System Studio. With Intel® System Studio 2014 we add support for the processor code-named “Haswell” in the analysis and system debug tools as well.

Q: Is The Studio Helpful In Developing Linux Drivers?

A: Yes, it can be used to build, optimize and debug Linux drivers and kernel moduels.

Q: What Do You Mean By SSH Connection?

A: http://en.wikipedia.org/wiki/Secure_Shell

Q: What Is Yocto Project?

A: Yocto Project* is a Linux* foundation open source framework for embedded Linux* development. It is Open Embedded compatible and provides reference OS builds as well as the setup and build environment to build your own compatible custom embedded Linux* - https://www.yoctoproject.org/

Q: Is Windriver ice JTAG supported?

A: No, as of now we are not supporting The Windriver ICE JTAG probe.

Q: If we want to implement our own load balancing mechanism, how we can bypass the Cilk auto load balancing?

A: Load balancing is done by default by Cilk Plus [Which is part of Intel compiler]. Since this is done by CILK runtime automatically, it is not advisable to change the behavior of this. However you can get the source of this runtime under GPL V3 license https://www.cilkplus.org/which-license.

Q: is this support all version of Linux? Like Fedora and Ubuntu.

A: We have validated and listed some of the supported Host and Target OS here http://software.intel.com/en-us/articles/intel-system-studio-system-requirements, You may want to ensure that the Linux kernel version is 2.6.32 or above.

Q: is trial version have limited code compilation?

A: The feature set of the evaluation version is identical to the commercial version, with the only limitation being the license expiration after 30days.

Q: is this studio support for C language programming or having other C Complier?

A: Intel C/C++ Compiler which is part of Intel® System Studio supports C/C++ language.

Q: How can you analyze multi core processing ?

A: Intel® System Studio’s VTune Amplifier for systems and Inspector for systems can be helpful to find the CPU usage on a Multi-core processor using concurrency analysis, threading analysis respectively.

Q: OpenMP will be supported for this compiler ?

A: No, OpenMP is not supported as a language extention by the Intel® C++ Compiler for Embedded Linux* OS included with Intel® System Studio. Pre-existing OpenMP based binaries and shared objects that rely on OpenMP runtimes will however be executed correctly. We areopen to considering the requests, If more developers are asking for OpenMP support in future.

Q: We observed a situation of 100% CPU usage, wanted to analyze the root cause

A: You can use Intel® VTune Amplifier for Systems.

Q: Does the studio software take care of cross compilation?

A: Yes, the Intel® C++ Compiler support sysroot and chroot based cross-build setups. For cross compilation using sysroot we offer a "-platform" compiler option on the host that already takes care of the cross build integration for multiple target Linux* OSs. The provided cross-build integration can also be used as a template for other cross-build environment. Please refer to the detailed procedure on cross compilation here:- http://software.intel.com/en-us/articles/using-intel-c-compiler-for-embedded-system

↧

Using Intel® SDE's chip-check feature

November 28, 2013, 5:00 am

Latest and popular articles on Intel Technologies

≫ Next: Intel® Xeon® Processor E5-2600 V2 Product Family Technical Overview

≪ Previous: Webinar -"Intel® System Studio: Embedded application development and debugging tools"

Intel® SDE includes a software validation mechanism to restrict executed instructions to a particular microprocessor. This is intended to be a helpful diagnostic tool for use when deploying new software. Use chip check when you want to make sure that your program is not using instruction features that are not present on a specific microarchitecture implementation.

In the output of "sde -long-help" there is a section describing the controls for this feature:

-chip_check  [default ]
        Restrict to a specific XED chip.
-chip_check_die  [default 1]
        Die on errors. 0=warn, 1=die
-chip_check_disable  [default 0]
        Disable the chip checking mechanism.
-chip_check_emit_file  [default 0]
        Emit messages to a file. 0=no file, 1=file
-chip_check_file  [default sde-chip-check.txt]
        Output file chip-check errors.
-chip_check_jit  [default 0]
        Check during JIT'ing only. Checked code might not be executed due to
        speculative JIT'ing, but this mode is a little faster.
-chip_check_list  [default 0]
        List valid chip names and exit.
-chip_check_stderr  [default 1]
        Try to emit messages to stderr. 0=no stderr, 1=stderr
-chip_check_vsyscall  [default 0]
        Enable the chip checking checking in the vsyscall area.

To list all the chips that Intel SDE knows about, you can use "sde -chip-check-list". The output will vary depending on the version of Intel SDE you use. For the current version, you will see this output:

% kits/current/sde -chip-check-list -- /bin/ls
        INVALID             I86           I86FP            I186 
         I186FP        I286REAL            I286         I2186FP 
       I386REAL            I386          I386FP        I486REAL 
           I486     PENTIUMREAL         PENTIUM  PENTIUMMMXREAL 
     PENTIUMMMX         ALLREAL      PENTIUMPRO        PENTIUM2 
       PENTIUM3        PENTIUM4      P4PRESCOTT   P4PRESCOTT642 
   P4PRESCOTT2M           CORE2          PENRYN        PENRYN_E 
        NEHALEM        WESTMERE         BONNELL        SALTWELL 
     SILVERMONT             AMD             KNL       IVYBRIDGE 
    SANDYBRIDGE         SKYLAKE       BROADWELL         HASWELL 
       GOLDMONT             ALL

To limit instructions to the Intel Westmere microarchitecture, use "sde -chip-check WESTMERE -- yourapp". If you do not want to limit instructions to a particular chip, "-chip-check ALL". To limit the allowed instructions to just those implemented on on the current Intel(R) Quark processors, use "-chip-check PENTIUM".

By default, Intel SDE emits warnings to a file called sde-chip-check.txt and also to stderr (if the application has not closed stderr). This behavior can be customized using the above knobs.

On linux, there are instructions in the virtual system call area that are not under direct user control. To avoid flagging those instructions, the chip check mechanism defaults to ignoring instructions that region. If you wanted to check the instructions in the vsyscall area for some reason, use "-chip-check-vsyscall".

There is a performance cost for using the chip-check feature. At instrumentation (JIT) time, we must do an extra check on each instruction. And at run time every instruction that is not valid for particular chip gets code inserted before it to trigger the error (or warning). There is also a JIT time code to finding the function symbols (if any) associated with the unwanted instructions.

Using the "-chip-check-jit" option, the JIT instrumentor can report disallowed instructions at JIT instrumentation time. This may be too aggressive as the JIT speculates and the JITted code may never execute due to the dynamic control flow in the program. It is more conservative though if you want to be sure there are no unwanted instructions.

Example

Here is a little example of the error message you get when your program does not have symbols:

% kits/current/sde -chip-check PENTIUM -- /bin/ls
TID 0 SDE-ERROR: Executed instruction not valid for specified chip (PENTIUM): 0x2b3db3fdc447: cmovnbe rdx, rax
Instruction bytes are: 48 0f 47 d0

If your program was compiled in debug mode or has function symbols, Intel SDE tries to provide additional information when it reports problems. This can be very useful for figuring out where the unwanted instructions are coming from.

kits/current/sde -chip-check IVYBRIDGE -- tests/a.out
TID 0 SDE-ERROR: Executed instruction not valid for specified chip (IVYBRIDGE): 0x400623: vfmadd231sd xmm1, xmm2, xmm3

Function: main
File Name: /tmp/fma1.c:36
Instruction bytes are: c4 e2 e9 b9 cb

In binaries without debug symbols, sde will still show the function name when it can be located.

Finding more errors

By default, Intel SDE issues the above error message and terminates when it encounters an unwanted instruction. Sometimes there are more than one unwanted instruction in large program. By using the "-chip-check-die 0" option, Intel SDE will continue to execute after reporting the error.

If an unwanted instruction is not executed, it will not be flagged by Intel SDE. Intel SDE is built upon the Pin dynamic binary instrumentation system. Pin is JIT for the application being run. As with all path-based checking mechanisms, you must exercise any code paths if you want them to be checked.

↧

Intel® Xeon® Processor E5-2600 V2 Product Family Technical Overview

November 28, 2013, 5:00 am

Latest and popular articles on Intel Technologies

≫ Next: How to install Intel® System Studio on Windows* OS

≪ Previous: Using Intel® SDE's chip-check feature

Download Article

Intel® Xeon® Processor E5-2600 V2 Product Family Technical Overview [PDF 780KB]

1. Executive Summary

The Intel® Xeon® processor E5-2600 V2 product family, codenamed “Ivy Bridge EP”, is a 2-socket platform based on Intel’s most recent microarchitecture. Ivy Bridge is the 22-nanometer shrink of the Intel® Xeon® processor E5-2600 (codenamed “Sandy Bridge EP”) microarchitecture. This product brings additional capabilities for data centers: more cores and more memory bandwidth. As a result, platforms based on the Intel Xeon processor E5-2600 V2 product family will yield up to 50% improvement in performance¹ compared to the previous generation “Sandy Bridge EP”.

2. Introduction

The Intel Xeon processor E5-2600 V2 product family is based on Ivy Bridge EP microarchitecture, an enhanced version of the Sandy Bridge EP microarchitecture (http://software.intel.com/en-us/articles/intel-xeon-processor-e5-26004600-product-family-technical-overview). The platform supporting the Intel Xeon processor E5-2600 V2 product family is named “Romley.” This paper discusses the new features available in the Intel Xeon processor E5-2600 V2 product family compared to the Intel Xeon processor E5-2600 product family. Each section includes information about what developers need to do to take advantage of new features for improving application performance and security.

3. Intel Xeon processor E5-2600 V2 product family enhancements

Some of the new features that come with the Intel Xeon processor E5-2600 V2 product family include:

22-nm process technology
Security: Intel® Secure Key (DRNG)
Security: Intel® OS Guard (SMEP)
Intel® Advanced Vector Extensions (Intel® AVX): Float 16 Format Conversion
Virtualization: APIC Virtualization (APICv)
PCI Express* (PCIe): Support for atomic operation, x16 Non Transparent Bridge

Figure 1. The Intel® Xeon® processor E5-2600 V2 product family Microarchitecture

Figure 1 shows a block diagram of the Intel Xeon processor E5-2600 V2 product family microarchitecture. All processors in the family have up to 12 cores (compared to 8 cores in its predecessor), which bring additional computing power to the table. They also have 50% additional cache (30 MB) and more memory bandwidth. With the 22-nm process technology, the Intel Xeon processor E5-2600 V2 product family has less idle power and is capable of delivering 25% more performance² while consuming less power compared to the earlier version.

Table 1 shows a comparison of the Intel Xeon processor E5-2600 V2 product family features compared to its predecessor, the Intel Xeon processor E5-2600.

Table 1. Comparison of the Intel® Xeon® processor E5–2600 product family to the Intel® Xeon® processor E5–2600 V2 product family

¹ Requires a system with Intel® Turbo Boost Technology. Intel Turbo Boost Technology and Intel Turbo Boost Technology 2.0 are only available on select Intel® processors. Consult your PC manufacturer. Performance varies depending on hardware, software, and system configuration. For more information, visit http://www.intel.com/go/turbo

The rest of this paper discusses some of the main enhancements in this product family.

a. Intel® Secure Key (DRNG)

Intel Secure Key (Digital Random Number Generator: DRNG) is a hardware approach to high-quality and high-performance entropy and random number generation. The entropy source is thermal noise within the silicon.

Figure 2. Digital Random Number Generator using RDRAND instruction

Figure 2 shows a block diagram of the Digital Random Number Generator. The entropy source outputs a random stream of bits at the rate of 3 GHz that is sent to the conditioner for further processing. The conditioner takes pairs of 256-bit raw entropy samples generated by the entropy source and reduces them to a single 256-bit conditioned entropy sample. This is passed to a deterministic random bit generator (DRBG) that spreads the sample into a large set of random values, thus increasing the amount of random numbers available by the module. DRNG is compliant with ANSI X9.82, NIST, and SP800-90 and certifiable to FIPS-140-2.

Since DRNG is implemented in hardware as a part of the processor chip, both the entropy source and DRBG execute at processor clock speeds. There is no system I/O required to obtain entropy samples and no off-chip bus latencies to slow entropy transfer. DRNG is scalable enough to support heavy server application workloads and multiple VMs.

DRNG can be accessed through a new instruction named RDRAND. RDRAND takes the random value generated by DRNG and stores it in a 16-bit or 32-bit destination register (size of the destination register determines size of the random value). RDRAND can be emulated via CPUID.1.ECX[30] and is available at all privilege levels and operating modes. Performance of RDRAND instruction is dependent on the bus infrastructure; it varies between processor generations and families.

Software developers can use the RDRAND instruction either through cryptographic libraries (OpenSSL* 1.0.1) or through direct application use (assembly functions). Intel® Compiler (starting with version 12.1), Microsoft Visual Studio* 2012, and GCC* 4.6 support the RDRAND instruction.

Microsoft Windows* 8 uses the DRNG as an entropy source to improve the quality of output from its cryptographically secure random number generator. Linux* distributions based on the 3.2 kernel use DRNG inside the kernel for random timings. Linux distributions based on the 3.3 kernel use it to improve the quality of random numbers coming from /dev/random and /dev/urandom, but not the quantity. That being said, Red Hat Fedora* Core 18 ships with the rngd daemon enabled by default, which will use DRNG to increase both the quality and quantity of random numbers in /dev/random and /dev/urandom.

For more details on DRNG and RDRAND instruction, refer to the Intel DRNG Software Implementation Guide.

b. Intel® OS Guard (SMEP)

Intel OS Guard (Supervisor Mode Execution Protection: SMEP) prevents execution out of untrusted application memory while operating at a more privileged level. By doing this, Intel OS Guard helps prevent Escalation of Privilege (EoP) security attacks. Intel OS Guard is available in both 32-bit and 64-bit operating modes and can be enumerated via CPUID.7.0.EBX[7].

Figure 3. Pictorial description of Intel® OS Guard operation

Support for Intel OS Guard needs to be in the operating system (OS) or Virtual Machine Monitor (VMM) you are using. Please contact your OS or VMM providers to determine which versions include this support. No changes are required in the BIOS or application level to use this feature.

c. Intel® Advanced Vector Extensions (Intel® AVX): Float 16 Format Conversion

The “Sandy Bridge” microarchitecture introduced Intel AVX, a new-256 bit instruction set extension to Intel® SSE designed for applications that are floating-point (FP) intensive. The “Ivy Bridge” microarchitecture enhances this with the addition of float 16 format conversion instructions.

Figure 4. Intel® Advanced Vector Extensions Instruction Format

Intel Xeon processor E5-2600 V2 product family supports half-precision (16-bit) floating- point data types. Half-precision floating-point data types provide 2x more compact data representation than single-precision (32-bit) floating-point data format, but sacrifice data range and accuracy. In particular, half-floats may provide better performance than 32-bit floats when the 32-bit float data does not fit into the L1 cache. This format is widely used in graphics and imaging applications to reduce dataset size and memory bandwidth consumption.

Because the half-precision floating-point format is a storage format, the only operation performed on half-floats is conversion to and from 32-bit floats. The Intel Xeon processor E5-2600 V2 product family introduces two half-float conversion instructions: vcvtps2ph for converting from 32-bit float to half-float (4x speedup compared to alternative Intel AVX code implementation), and vcvtph2ps for converting from half-float to 32-bit float (2.5x speedup compared to alternative Intel AVX implementation). A developer can utilize these instructions without writing assembly by using the corresponding intrinsics instructions: _mm256_cvtps_ph for converting from 32-bit float to half-float, and _mm256_cvtph_ps for converting from half-float to 32-bit float (_mm_cvtps_ph and _mm_cvtph_ps for 128-bit vectors).

The compilers that support these instructions include Intel Compiler (starting with version 12.1), Visual Studio 2012, and GCC 4.6. To direct the Intel Compiler to produce the conversion instructions for execution on Intel Xeon processor E5-2600 V2 product family (or later), a developer can either compile the entire application with the –xCORE-AVX-I flag (/QxCORE-AVX-I on Windows), or use the Intel®-specific optimization pragma with target_arch=CORE-AVX-I for the individual function(s).

For more details on half precision floating point instructions, refer to: http://software.intel.com/en-us/articles/performance-benefits-of-half-precision-floats

d. Advanced Programmable Interrupt Controller (APIC) Virtualization (APICv)

A significant amount of performance overhead in machine virtualization is due to Virtual Machine (VM) exits. Every VM exit can cause a penalty of approximately 2,000 – 7,000 CPU cycles (see Figure 5), and a significant portion of these exits are for APIC and interrupt virtualization. Whenever a guest operating system tries to read an APIC register, the VM has to exit and the Virtual Machine Monitor (VMM) has to fetch and decode the instruction.

The Intel Xeon processor E5-2600 V2 product family introduces support for APIC virtualization (APICv); in this context, the guest OS can read most APIC registers without requiring VM exits. Hardware and microcode emulate (virtualize) the APIC controller, thus saving thousands of CPU cycles and improving VM performance.

Figure 5. APIC Virtualization

This feature must be enabled at the VMM layer: please contact your VMM supplier for their roadmap on APICv support. No application-level changes are required to take advantage of this feature.

e. PCI Express Enhancements

The Intel Xeon processor E5-2600 V2 product family supports PCIe atomic operations (as a completer). Today, message-based transactions are used for PCIe devices, and these use interrupts that can experience long latency, unlike CPU updates to main memory that use atomic transactions. An Atomic Operation (AtomicOp) is a single PCIe transaction that targets a location in memory space, reads the location’s value, potentially writes a new value back to the location, and returns the original value. This “read-modify-write” sequence to the location is performed atomically. This is a new operation added per PCIe Specification 3.0. FetchAdd, Swap, and CAS (Compare and Swap) are the new atomic transactions.

The benefits of atomic operations include:

Lower overhead for synchronization
Lock-free statistics (e.g. counter updates)
Performance enhancement for device drivers

The Intel Xeon processor E5-2600 V2 product family also supports X16 non transparent bridge. All these contribute to better I/O performance.

These PCIe features are inherently transparent and require no application changes.

For more details on these PCIe features, refer to:

5. Conclusion

In summary, the Intel Xeon processor E5-2600 V2 product family combined with the Romley platform provides many new and improved features that could significantly change your performance and power experience on enterprise platforms. Developers can make use of most of these new features without making any changes to their applications.

6. About the Author

Sree Syamalakumari is a software engineer in the Software & Service Group at Intel Corporation. Sree holds a Master's degree in Computer Engineering from Wright State University, Dayton, Ohio.

Intel, the Intel logo, and Xeon are trademarks of Intel Corporation in the U.S. and/or other countries.
Copyright © 2013 Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.

¹ Baseline Configuration and Score on SPECVirt_sc2013* benchmark: Platform with two Intel® Xeon® Processor E5-2690, 256GB memory, RHEL 6.4(KVM). Baseline source as of July 2013. Score: 624.9 @ 37 VMs. New Configuration: IBM System x3650 M4* platform with two Intel® Xeon® Processor E5-2697 v2, 512GB memory, RHEL 6.4(KVM). Source as of Sept. 2013. Score: 947.9 @ 57 VMs. For more information go to http://www.intel.com/performance.

² Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance.

↧

How to install Intel® System Studio on Windows* OS

November 28, 2013, 5:00 am

Latest and popular articles on Intel Technologies

≫ Next: Video : Installation of Intel® System Studio on Windows* Host

≪ Previous: Intel® Xeon® Processor E5-2600 V2 Product Family Technical Overview

Topic: - How to install Intel® System Studio 2014 Beta on Windows* OS

Objective: - This article is focused on explaining step wise as “How to install Intel® System Studio 2014 Beta on Windows* OS”.

Installation: - After downloading Intel® System Studio 2014 Beta – Windows* Host from Intel® registration center (https://registrationcenter.intel.com), please go ahead and follow below mentioned steps:-

Step 1:- Double click on the executable file that you have downloaded. Now you should see a “User Account Control” window asking for your permission, Press “yes”.

Step 2:- Now you should see an installer been launched as shown below:-

Step 3:- Now you would see a “Welcome” page describing about the products that would be installed, then click “Next”

Step 4:- Now you should see a Licensing page with 2 options

“I have a serial number and want to activate and install my product”:- You can use this option provided you have a SN with you and have an internet connection on the system on which you are installing.
“Choose alternate activation” :- This option can be used if you want to activate the product using either of these 3 options :-
1. Remote activation
2. Using a license file
3. Using a license manager

Step 5:- Now you would see an “Options” windows which allows you to customize and select the components that you want to install, If you want all the components listed to be installed, then click “Install” and jump to step 6, else click “Customize” and jump to step 5.

Step 6:- Customizing your installation, in this step you can remove the components that you don’t want to be installed and in fact you can later add the removed components by again “Modifying” the installation.

As shown above you can “right-click” on the components and select “Do not install”, if you do not want to install that component. After customizing the components, you can click “Next”.

Step 7:- Now you can click “Install” and it might take several minutes to install the whole product.

Step 8:- Once the installation completes, You can click on the “Finish” button.

↧

Video : Installation of Intel® System Studio on Windows* Host

November 28, 2013, 5:00 am

Latest and popular articles on Intel Technologies

≫ Next: Programming for Multicore and Many-core Products including Intel® Xeon® processors and Intel® Xeon Phi™ coprocessors

≪ Previous: How to install Intel® System Studio on Windows* OS

This video is focused on “How to install Intel® System Studio 2014 Beta on Windows* Host”.

Intel System Studio Installation - Windows Host.wmv

Size : 7.34MB

How to get Intel System Studio 2014 beta - Windows* Host package?

Upon registering for the program you will receive a serial number and email with a license file. You will need either of these two to complete the installation process. If you want to use the license file you can point to it during install, but you can also copy it to C:\Program Files (x86)\Common Files\Intel\Licenses\for automatic pickup by the installer.

Execute one of the installer executable.

w_cembd_2014.0.xxx.exe or w_cembd_2014.0.xxx_online.exe

The later one is an online installer reducing the initial package download size.

Prerequisites for Eclipse* IDE Integration

The Intel® C++ Compiler and SVEN SDK can be automatically integrated into a preexisting Eclipse* CDT installation. The Eclipse* CDK, Eclipse* JRE and the Eclipse* CDT integrated development environment are not shipped with this package of the Intel® System Studio. The Eclipse* integration is automatically offered as one of the last steps of the installation process. If you decide against integration during an earlier install, simply rerun the Intel® System Studio installer.

When asked point the installer to the installation directory of your Eclipse* install. Usually this would be C:\Program Files (x86)\eclipse\.

The prerequisites for successful Eclipse integration are:

1. Eclipse* 3.7 (Indigo) – Eclipse* 4.3 (Kepler)

2. Eclipse* CDT 8.0 – 8.1

3. Java Runtime Environment (JRE) version 6.0 (also called 1.6) update 11 or later.

↧

Programming for Multicore and Many-core Products including Intel® Xeon® processors and Intel® Xeon Phi™ coprocessors

November 12, 2012, 1:00 pm

Latest and popular articles on Intel Technologies

≫ Next: Intel® System Studio - Solutions, Tips and Tricks

≪ Previous: Video : Installation of Intel® System Studio on Windows* Host

Programming for Multicore and Many-core Products including Intel® Xeon® processors and Intel® Xeon Phi™ coprocessors (including language extensions for offloading to Intel® Xeon Phi™ coprocessors)

Abstract

The programming models in use today, used for multicore processors every day, are available for many-core coprocessors as well. Therefore, explaining how to program both Intel Xeon processors and Intel Xeon Phi coprocessor is best done by explaining the options for parallel programming. This paper provides the foundation for understanding how multicore processors and many-core coprocessors are best programmed using a unified programming approach that is abstracted, intuitive and effective. This approach is simple and natural because it fits applications of today easily and yields strong results. When combined with the common base of Intel® architecture instructions utilized by Intel® many-core processors and Intel® multi-core coprocessors, the result is performance for highly parallel computing with substantially less difficulty than with other less intuitive approaches.

Programs that utilize multicore processors and many-core coprocessors have a wide variety of options to meet varying needs. These options fully utilize existing widely adopted solutions, such as C, C++, Fortran, OpenMP*, MPI and Intel® Threading Building Blocks (Intel® TBB), and are rapidly driving the development of additional emerging standards such as OpenCL* as well as new open entrants such as Intel® Cilk™ Plus.

Introduction

Single core processors are a shrinking minority of all the processors in the world. Multicore processors, offering parallel computing, have displaced single core processors permanently. The future of computing is parallel computing, and the future of programming is parallel programming.

The methods to utilize multicore processors have evolved in recent years, offering more and better choices for programmers than ever. Nothing exemplifies this more than the rapid rise in popularity of Intel TBB or the industry interest and support behind OpenCL.

At the same time that multicore processors and programming methods are becoming common, Intel is introducing many-core processors that will participate in this evolution without sacrificing the benefits of Intel architecture. Additional capabilities that are new with many-core processors are addressed in a natural and intuitive manner Intel® many-core processors allow use of the same tools, programming languages, programming models, the same execution models, memory models and behaviors as in Intel’s multicore processors.

This paper explains the programming methods available for multicore processors and many-core processors with a focus on widely adopted solutions and emerging standards.

Parallel Programming Today

Since the goal of using Intel architecture in both multicore processors and many-core coprocessors is intuitive and common programming methods, it is important to first review where parallel programming for multicore stands today and understand where it is headed. Because of their common Intel architecture foundations, this will also precisely define the basis for parallel programming for many-core processors.

Libraries

Libraries provide an important abstract parallel programming method that needs to be considered before jumping into programming. Library implementations for algorithms including BLAS, video or audio encoders and decoders, Fast Fourier Transforms (FFT), solvers and sorters, are important to consider. Libraries such as the Intel® Math Kernel Library (Intel® MKL) already offer advanced implementations of many algorithms that are highly tuned to utilize Intel® Streaming SIMD Extensions (Intel® SSE), Intel® Advanced Vector Extensions (Intel® AVX), multicore processors and many-core coprocessors. A program can start to get these benefits by adding a single call to a routine in Intel MKL that includes support for industry standard interfaces in both Fortran and C to the Linear Algebra PACKage (LAPACK). Standards combined with Intel’s pursuit of high performance, make libraries an easy choice to utilize as the first preference in parallel programming.

When libraries do not solve specific programming needs, developers turn to programming languages that have been in use for many years.

None of the most popular programming languages were designed for parallel programming. This has brought about many proposals for new programming languages as well as extensions for the pre-existing languages. In the end, these experiences have led to the emergence of a number of widely deployed solutions for parallel programming using C, C++ and Fortran.

The most widely used abstractions for parallel programming are OpenMP (primarily C and Fortran), Intel Threading Building Blocks (primarily C++) and MPI (C, C++ and Fortran). These support a diverse range of processors and operating systems, making them truly versatile and reliable choices for programming.

Additionally, the native threading methods of the operating system are directly available for programmers. These interfaces, including POSIX threads (pthreads) and Windows* threads, offer a low level interface for full control but without the benefits of high level programming abstractions. These interfaces are essentially assembly language programming for parallel computing. Programmers have all but completely moved to higher levels of abstraction and abandoned assembly language programming. Similarly, avoiding direct use of threading models has been a strong trend that has accelerated with the introduction of multicore processors. This shift to program in “tasks” and not “threads” is a fundamental and critical change in programming habits that is well supported by the abstract programming models.

Most deployed parallel programming today is either done with one of the three most popular abstractions for parallelism (OpenMP, MPI or Intel TBB), or done using the raw threading interfaces of the operating system.

These standards continue to evolve and new methods are proposed. Today, the principle technical drivers of these evolutions are highly data parallel hardware and advancing compiler technology. Both of these driving forces are motivated by a strong desire to program at higher levels of abstraction so as to increase programmer productivity leading to faster time-to-money and reduced development and maintenance costs.

Most Composable Parallel Programming Models

Learn more at http://Intel.com/go/parallel

For reasons explained in this paper, the most composable parallel programming methods are (Intel TBB and Intel® Cilk™ Plus. They consistent advantages for effective abstract programming that yield performance and preserve programming investments. They provide recombinant components that can be selected and assembled for effective parallel programs. Even though they can be described and studied individually, they are best thought of as a collection of capabilities that are easily utilized both individually and together. This is incredibly important since it offers composability for mixing modules including libraries. Both have self-composability, which is not the case for threading and for OpenMP. Uses of OpenMP and OpenCL have limitations on composability, but are still preferable to the use of raw threads. One of the benefits of Intel architecture multicore processors and many-core coprocessors is the strong support for all these methods that is available, offering the solution that best fits your current and future programming needs.

OpenMP

Learn more at http://openmp.org/wp/

In 1996, the OpenMP standard was proposed as a way for compilers to assist in the utilization of parallel hardware. Now, after more than a decade, every major compiler for C, C++ and Fortran supports OpenMP. OpenMP is especially well suited for the needs of Fortran programs as well as scientific programs written in C. Intel is a member of the OpenMP work group and a leading vendor of implementations of OpenMP and supporting tools. OpenMP is applicable to both multicore and many-core programming.

OpenMP dot product example, in Fortran:

!$omp do
   do j = 1, n
      adotb = adotb + a(j) * b(j)
    end do
!$omp end do

OpenMP summation (reduction) example, in C:

#pragma omp parallel for reduction(+: s)
 
for (int i = 0; i < n; i++)
   s += x[i];

In the future, the OpenMP specification will expand to standardize the emerging controls for attached computing often called “offloading” or “accelerating.” Today, Intel offers non-standard extensions to OpenMP called “Language Extensions for Offload” (LEO). The OpenMP committee is reviewing LEO as well as a set of non-OpenMP offload directives for GPUs known as OpenACC, with an eye towards convergence to serve both Intel Xeon Phi coprocessors and GPUs.

Intel TBB

Learn more at http://threadingbuildingblocks.org/

Intel introduced Intel® TBB in 2006 and the open source project for Intel TBB was started in 2007. By 2009, it had grown in popularity to exceed that of OpenMP in terms of number of developers using it (per research from Evans Data Corp: http://www.evansdata.com/research/market_alerts.php, and support in subsequent research reports as well). Intel TBB is especially well suited for the needs of C++ programmers, and since OpenMP is designed to address the needs of C and Fortran developers there is virtually no competition between Intel TBB and OpenMP. It is worth noting that for C++ programmers, using OpenMP and Intel TBB in the same program is possible as well.

Parallel function invocation example, in C++, using Intel TBB:

parallel_for (0, n,
 
   [=](int i) {
   Foo(a[i]);
});

The emergence of Intel TBB, which does not directly require nor leverage compiler technology, emphasized the value of programming to tasks and led the way for wide acceptance of using task-stealing systems. Compiler technology continues to evolve to help address parallel programming and led to the creation of the Intel Cilk™ Plus project. Increased use of compiler technology is better able to unlock the full potential of parallelism. Intel remains a leading participant and contributor in the Intel TBB open source project as well as a leading supplier of Intel TBB support and supporting tools. Intel TBB is applicable to multicore and many-core programming.

MPI

Learn more at http://Intel.com/go/mpi

For programmers utilizing a cluster, in which processors are connected by the ability to pass messages but not always the ability to share memory, the Message Passing Interface (MPI) is the most common programming method. In a cluster, communication continues to use MPI, as they do today, regardless of whether a node has many-core processors or not.

Today’s MPI based programs move easily to Intel Xeon Phi coprocessor based systems because the Intel coprocessors support ranks that can talk to other coprocessor ranks and multicore (e.g., Intel Xeon® processors) ranks. An Intel Xeon Phi coprocessor, like a multicore processor, may create as many ranks as the programmer desires. Such ranks communicate with other ranks regardless of whether they are on multicore or many-core processors.

Because Intel Xeon Phi coprocessors are general-purpose, MPI jobs run on the coprocessors. This is very powerful because no algorithmic recoding or refactoring is required to get working results from an existing MPI program. The general capabilities of the coprocessors combined with the power of MPI support on the Intel Xeon Phi coprocessors produce immediate results in a manner that is intuitive for MPI programmers.

The widely used Intel® MPI library offers both high performance and support for virtually all interconnects. The Intel MPI library supports both multicore and many-core in systems creating ranks on multicore processors and many-core coprocessors in a fashion that is familiar and consistent with MPI programming today.

MPI, on Intel Xeon Phil coprocessors, composes with other thread models (e.g., OpenMP, Intel TBB, Intel® Cilk™ Plus) as has become common on multicore processors based systems.

Intel is a leading vendor of MPI implementations and tools. MPI is applicable to multicore and many-core programming.

Parallel Programming Emerging Standards

For data parallel hardware, the emergence of support for certain extensions to C, C++ offers important options for developers and address programmer productivity.

Intel® Cilk™ Plus

Learn more at http://cilk.com

Intel introduced Intel Cilk Plus in late 2010. Built on research from M.I.T. and product experiences by industry leader Cilk Arts, Intel implemented support for task stealing in compilers for Linux* and Windows. Intel has published full specifications for Intel Cilk Plus to help enable other implementations as well as optional usage of Intel runtime or construction of interchangeable runtimes via API compliance. Intel is actively working with other compilers to offer support in the future for more compilers. Intel is proud to be the leading supporter in industry of Intel Cilk Plus with products and tools.

Intel Cilk Plus provides three new keywords, special support for reduction operations, and data parallel extensions. The keyword cilk_spawn can be applied to a function call, as in x = cilk_spawn fib(n-1), to indicate that the function fib can execute concurrently with the subsequent code. The keyword cilk_sync indicates that execution has to wait until all spawns from the function have returned. The use of the function as a unit of spawn makes the code readable, relies on the baseline C/C++ language to define scoping rules of variables, and allows Intel Cilk Plus programs to be composable.

Parallel spawn in a recursive fibonacci computation, in C, using Intel Cilk Plus:

int fib (int n) {
 
   if (n < 2) return 1;
   else {
      int x, y;
      x = cilk_spawn fib(n-1);
      y = fib(n-2);
      cilk_sync;
      return x + y;
   }
}

Cilk offers exceptionally intuitive and effective compiler support for C and C++ programmers. Cilk is very easy to learn and poised to be widely adopted. A regular “for” loop, without inter-loop dependencies, can be transformed into a parallel loop by simply changing the keyword “for” into “cilk_for.” This indicates to the compiler that there is no ordering among the iterations of the loop.

Parallel function invocation, in C, using Intel Cilk Plus:

cilk_for (int i=0; i<n; ++i){
 
      Foo(a[i]);
}

Cilk programmers still utilize Intel TBB for certain algorithms or features where new compiler keywords or optimizations are not needed, such as the thread aware memory allocator or a sort routine. Intel Cilk Plus is applicable to multicore and many-core programming.

C/C++ data parallel extensions

Learn more at http://cilk.com

Debate about how to extend C (and C++) to directly offer data parallel extensions is on-going. Implementations, experiences and adoption are important steps toward standardization. Intel has implemented extensions for fundamental data parallelism as part of Intel Cilk Plus for Linux, Windows and Mac* OS X systems. Intel is actively working with other compilers to offer support in the future. An intuitive syntactic extension, similar to the array operations of Fortran 90, is provided as a key element of Intel Cilk Plus and allows simple operations on arrays. The C/C++ languages do not provide a way to express operations on arrays. A programmer has to write a loop and express the operation in terms of elements of the arrays, creating unnecessary explicit serial ordering. A better opportunity exists to write a[:] = b[:] + c[:]; to indicate the per element additions but without specifying unnecessary serial ordering. These simplified semantics free up a compiler to always generate vector code instead of generating non-optimal scalar code.

An additional method to avoid unintended serialization, allows a programmer to write a scalar function in standard C/C++ and declare it as a "SIMD enabled function" (occassionally this has previously been called by the less descriptive name “elemental function.”) This will trigger the compiler to generate a short vector version of that function, which instead of operating on a single set of arguments to the function, will operate on short vectors of arguments by utilizing the vector registers and vector instructions. In common cases, where the control flow within the function does not depend on data values, the execution of the short vector function can yield a vector of results in roughly the same time it would take the regular function to produce a single result.

SIMD enabled function, in C, using Intel Cilk Plus:

__declspec (vector) void saxpy(float a, float x, float &y)
{
   y += a * x;
}

Intel is supporting these syntactic extensions for C and C++ with products and tools as well as discussions with other compiler vendors for wider support. C, C++ and data parallel extensions are applicable to multicore and many-core programming.

OpenCL*

Learn more at http://Intel.com/go/opencl

OpenCL was first proposed by Apple* and then moved to an industry standards body of which Intel is a participant and supporter. OpenCL offers a “close to the hardware” interface, offering some important abstraction and substantial control coupled with wide industry interest and commitment. OpenCL may require the most refactoring of any of the solutions covered in this whitepaper. Specifically, refactoring based on advanced knowledge of the underlying hardware. Results from refactoring work may be significant for multicore and many-core performance, and the resulting performance may or may not be possible without such refactoring. A goal of OpenCL is to make an investment in refactoring productive when it is undertaken. Solutions other than OpenCL may offer alternatives to avoid the need for refactoring (which is best done when based on advanced knowledge of the underlying hardware).

Simple per element multiplication using OpenCL:

kernel void
   dotprod(	global const float *a,
             global const float *b,
             global float *c)
   {
      int myid = get_global_id(0);
      c[myid] = a[myid] * b[myid];
   }

Intel is a leading participant in the OpenCL standard efforts, and a vendor of solutions and related tools with early implementations available today. OpenCL is applicable to multicore, many-core and GPU programming although the code within an OpenCL program is usually separate or duplicated for each target. Intel currently ships OpenCL support for both Intel multi-core processors (using Intel SSE and Intel AVX instructions) and Intel® HD Graphics (integrated graphics available as part of many Third Generation Intel® Core™ processors).

Composability Using Multiple Models

Composability is an important concept. With multiple programming options to fit differing needs, it is essential that these methods not be mutually exclusive. The abstract programming methods discussed above can be mixed in a single application. By offering newer programming models that support composable programming, programmers are freed from subtle and unproductive limitations on the mixing and matching of programming methods in a single application.

The most composable methods are Intel TBB and Intel Cilk Plus (including the C/C++ data parallel extensions). Use of OpenMP and OpenCL have limitations on composability, but are still preferable to the use of raw threads.

Intel TBB and Intel Cilk Plus provide recombinant components that can be selected and assembled for effective parallel programs. This is incredibly important since it offers composability for mixing modules including libraries. Both have self-composability, which is not the case for threading and for OpenMP or OpenCL.

Harnessing Many-core

Combining the power of both multicore and many-core, and utilizing them together, offers enormous possibilities.

Intel Xeon Phi coprocessors are designed to offer power efficient processing for highly parallel work while remaining highly programmable. Platforms containing both multicore processors and Intel Xeon Phi coprocessors can be referred to as heterogeneous platforms. Such a heterogeneous platform offers the best of both worlds, multicore processors that are very flexible to handle general-purpose serial and parallel workloads as well as more specialized many-core processing capabilities for highly parallel workloads. A heterogeneous platform can be programmed as such and utilize a programming model to manage copying of data and transfer of control.

Applications are still built with a single source base. The versatility of Intel architecture multicore processors and many-core coprocessors allows for programming that is both intuitive and effective.

Explicit vs. Implicit use of Many-core

Many-core processors may be used implicitly through the use of libraries, like Intel MKL, by provisioning code to detect and utilize many-core processors when present. Explicit controls for Intel libraries are available to the developer, but the simple approach of relying on a library to decide if and when to use the attached multicore processors can be quite effective.

Additional programming opportunities are possible by explicit directions from the programmer in the source code. Writing an application to explicitly utilize many-core is done by writing a heterogeneous program. This program would consist of writing a parallel application and splitting the work between the multicore processors and many-core coprocessors.

Even with explicit control, Intel has designed the extensions to be flexible enough to work if no many-core processors are present and to also be ready for a converged future. These two benefits are incredibly important. First, a single source program can provide direction to offload to an Intel Xeon Phi coprocessor. However at runtime, if the coprocessor is not present on the system being utilized, the use of Intel architecture on both the multicore processors and many-core coprocessors means that the code available for offloading to a coprocessor can be executed seamlessly on either type of processor.

Offloading

The reality of today’s hardware is that a heterogeneous platform contains multiple distinct memory spaces, one (or more in a cluster) for the multicore processors and one for each many-core processor. The connection between multicore processors and many-core coprocessors can be a bottleneck that needs some consideration.

There are two approaches to utilizing such a heterogeneous platform. One approach treats the memory spaces as wholly distinct, and uses offload directives to move control and data to and from the multicore processors. Another approach simplifies data concerns by utilizing a software illusion of shared memory called MYO to allow sharing between multicore processors and many-core coprocessors that reside in a single system or a single node on a cluster. MYO is an acronym for “Mine Yours Ours” and refers to the software abstraction that shares memory within a system for determining current access and ownership privileges.

The first approach exposes completely that the multicore processors and many-core processors do not share memory. Compiler support of directives for this execution model are able to free the programmer from specifying the low level details of the system, while exposing the fundamental property that the target is heterogeneous and leaving the programmer to devote their time to solving harder problems.

Simple offload, in Fortran:

!dir$ offload target(MIC1)
!$omp parallel do
      do i=1,10
         A(i) = B(i) * C(i)
      enddo
!$omp end parallel

The compiler provides a pragma for offload (#pragma offload) that a programmer can use to indicate that the subsequent language construct may execute on the Intel Xeon Phil Coprocessor. The pragma also offers clauses that allow the programmer to specify data items that would need to be copied between processor and coprocessor memories before the offloaded code executes. The clauses also allow the developer to specify data that should be copied back to multicore processor memory afterwards. The offload pragma is available for C, C++ and Fortran.

Simple offload, in C, with data transfer:

float *a, *b; float *c;
 
#pragma offload target(MIC1) 
   in(a, b : length(s)) 
   out(c : length(s) alloc_if(0))
for (i=0; i<s; i++) {
   c[i] = a[i] + b[i];
}

An alternate approach is a run time user mode library called MYO. MYO allows synchronization of data between the multicore processors and an Intel Xeon Phi coprocessor, and with compiler support enabling allocation of data at the same virtual addresses. The implication is that data pointers can be shared between the multicore and many-core memory spaces. Copying of pointer based data structures such as trees, linked lists, etc. is supported fully without the need for data marshaling. To use the MYO capability, the programmer will mark data items that are meant to be visible from both sides with the _Cilk_shared keyword, and use offloading to invoke work on the Intel Xeon Phi coprocessor. The statement x = _Offload func(y); means that the function func() is executed on the Intel Xeon Phi coprocessor, and the return value is assigned to the variable x. The function may read and modify shared data as part of its execution, and the modified values will be visible for code executing on all processors.

The offload approach is very explicit and fits some programs quite well. The MYO approach is more implicit in nature and has several advantages. MYO allows copying of classes without marshaling and copying of C++ classes, which is not supported using offload pragmas. Importantly, MYO does not copy all shared variables upon synchronization. Instead, it only copies the values that have changed between two synchronization points.

These offload programming methods, while designed to allow control to direct work to many-core processors, are applicable to multicore and many-core programming so as to allow code to be highly portable and long lasting even as systems evolve. Source code will not need to differ for systems with and without many-core processors.

Additional Offload Capabilities

Both the keyword and pragma mechanisms perform the copying of the data triggered by the invocation of work on the Intel Xeon Phi coprocessor. Future directive options allow initiation of data copying ahead of invoking computation in order to be able to schedule other work while data is being copied.

Since systems may be configured with multicore processors, and more than one Intel Xeon Phi coprocessor per node, additional language support will allow the programmer to choose between forcing offloading or allowing a run time determination to be made. This option offers the potential for more dynamic and optimal decisions depending on the environment.

Standards

By utilizing Intel architecture instructions on multicore and many-core, programming tools and models are best able to serve both. With insights and use of the right models, a single source base can be constructed that is well equipped to utilize multicore processor systems, heterogeneous systems and future converged systems in an intuitive and effective manner. This can be accomplished with a single source base in familiar and current programming languages.

Tried and true solutions, including C, C++, Fortran, OpenMP, MPI and Intel TBB apply to these Intel architecture multicore and many-core systems.

Emerging efforts including Intel Cilk Plus, offload extensions and OpenCL are strongly supported by Intel, and are poised for broader adoption and support in the future.

The path to standardization starts with strong products and published specifications, progresses to users (customers) and support by additional vendors. Viable standards will follow. With OpenCL, Intel Cilk Plus, and offload extensions, the product support and specifications exist and customer usage is well under way. It is reasonable to expect that wider support and standards refined based on user experiences will follow.

Summary

By utilizing Intel architecture and industry standard programming tools, offer parallel programming methods that can be applied across both. These methods can employ a single source code base using familiar tools, programming languages and previous source code investments. Current and emerging solutions allow applications to grow into a single code section that best utilizes multicore processors and many-core coprocessors together. The methods available to utilize multicore and many-core parallelism offer performance while preserving investments and offering intuitive programming methods.

Standards play an important role in programming methods. Intel has invested heavily to support and implement standard programming models and methods. In addition, Intel has been a leader in the evolution of standards to solve new challenges.

When programming for Intel Xeon Phi coprocessors, applications can get the power of the Intel Xeon Phi coprocessors in a maintainable and performant application that is highly portable and scales to future architectures while fully supporting multicore systems with the same code.

Clusters of multicore processors and many-core coprocessors, organized in nodes, will be able to take advantage of this very rich set of tools and programming models available for Intel architecture in an intuitive, maintainable and effective manner.

The ability to utilize existing developer tools, standards, performance and offer flexibility puts Intel multi-core and Intel many-core solutions in a class of their own.

About the Author

James Reinders, Director, Software Evangelist, Intel Corporation

James Reinders is a senior engineer who joined Intel Corporation in 1989 and has contributed to projects including systolic arrays systems WARP and iWarp, the world's first TeraFLOP/sec supercomputer (ASCI Red), and the world’s first TeraFLOP/sec single-chip computing device known at the time as Knights Corner and now as the first Intel® Xeon Phi™ Coprocessor, as well as compilers and architecture work for multiple Intel® processors and parallel systems. James has been a leader in the emergence of Intel as a major provider of software development products, and serves as their chief software evangelist. James is the author of “Intel Threading Building Blocks” from O'Reilly Media. It has been translated to Japanese, Chinese and Korean. James is coauthor of “Structured Parallel Programming,” ©2012, from Morgan Kaufmann Publishing and "Intel® Xeon Phi™ Coprocessor High Performance Programming,"©2013, from Morgan Kaufmann Publishing. James has published numerous articles, contributed to several books. James received his B.S.E. in Electrical and Computing Engineering and M.S.E. in Computer Engineering from the University of Michigan.

Notices

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FORANYAPPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR.

Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.

The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.

Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.Intel.com/design/literature.htm

Intel, the Intel logo, Cilk, VTune, Xeon and Xeon Phi are trademarks of Intel Corporation in the U.S. and other countries.

*Other names and brands may be claimed as the property of others.

Optimization Notice

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

↧

Intel® System Studio - Solutions, Tips and Tricks

December 3, 2013, 8:59 am

Latest and popular articles on Intel Technologies

≫ Next: Using Intel® Math Kernel Library with MathWorks* MATLAB* on Intel® Xeon Phi™ Coprocessor System

≪ Previous: Programming for Multicore and Many-core Products including Intel® Xeon® processors and Intel® Xeon Phi™ coprocessors

Training Slides
- Link to Training Slides

Overview
- New Features and Components of Intel System Studio 2014
- Detailed Overview of all Intel System Studio 2014 Components
- System Requirements
- Support Matrix - Full list of component support based on Host/Target OS and Embedded Platform
- Intel Processor Family Support with Intel JTAG Debugger

Linux* and WindRiver Linux* Target Support - Intel System Studio Component Guides
- Getting Started with Intel System Studio and WindRiver Linux - Build and Run Guide
- Using Intel^® Compiler in Eclipse* for Linux Development
- Building Yocto Project* Applications using Intel Compiler
- Improved Sysroot Support with Intel Compiler for Cross Compilation
- Programming for Linux-based Intelligent Systems and Embedded Devices - Intel System Studio 2014 developer workflow
- Developing Secure Embedded Applications - Intel System Studio 2014 developer workflow
- Building OpenCV based Embedded Applications - Intel System Studio 2014 developer workflow

Android* Target Support - Intel System Studio Component by Component Guides

Intel^® Atom™ Processor E3xxx, Z3xxx (code-named Baytrail) - Overview of Intel System Studio support
- Developing Intel Atom Processor E3xxx, Z3xxx applications - overview of Intel System Studio components
- Intel System Studio Intel Atom Processor Support - general overview of Intel System Studio support for all Intel Atom processors

Intel^® Quark SoC - Overview of Intel System Studio Support

Windows* Host

Using Intel System Studio for Tizen* In-Vehicle Infotainment

Case Study / Success Stories / How-to Guides

Build and Design for Performance

Power and Performance Tuning

Debug, Verification, and Reliability

Additional Resources

↧

Using Intel® Math Kernel Library with MathWorks* MATLAB* on Intel® Xeon Phi™ Coprocessor System

May 18, 2014, 11:18 pm

Latest and popular articles on Intel Technologies

≫ Next: Applying Vectorization Techniques for B-Spline Surface Evaluation

≪ Previous: Intel® System Studio - Solutions, Tips and Tricks

Overview

This guide is intended to help developers use the latest version of Intel® Math Kernel Library (Intel® MKL) with MathWorks* MATLAB* on Intel® Xeon Phi™ Coprocessor System.

Intel MKL is a computational math library designed to accelerate application performance and reduce development time. It includes highly optimized and threaded dense and sparse Linear Algebra routines, Fast Fourier transforms (FFT) routines, Vector Math routines, and Statistical functions for Intel processors and coprocessors.

MATLABis an interactive software program that performs mathematical computations and visualization. Internally MATLAB uses Intel MKL Basic Linear Algebra Subroutines (BLAS) and Linear Algebra package (LAPACK) routines to perform the underlying computations when running on Intel processors.

Intel MKL now includes a new Automatic Offload (AO) feature that enables computationally intensive Intel MKL functions to offload partial workload to attached Intel Xeon Phi coprocessors automatically and transparently.

As a result, MATLAB performance can benefit from Intel Xeon Phi coprocessors via the Intel MKL AO feature when problem sizes are large enough to amortize the cost of transferring data to the coprocessors. The article describes how to enable Intel MKL AO when Intel Xeon Phi coprocessors are present within a MATLAB computing environment.

Prerequisite

Prior to getting started, obtain access to the following software and hardware:

The Latest Version of Intel MKL or Intel^® Composer XE, which includes the Intel^® C/C++ Compiler and Intel MKL available from https://registrationcenter.intel.com/regcenter/register.aspx, or register at https://software.intel.com/en-us/ to get a free 30-day evaluation copy
The Latest Version of MATLAB available from http://www.mathworks.com/products/matlab/
An Intel Xeon Phi Coprocessor Development System as described at https://software.intel.com/en-us/mic-developer

The 64bit version of Intel MKL and MATLAB should be installed at least on the development system.This article was created based on MATLAB R2014a and Intel MKL for Windows* 11.1 update 1 and update 2 on the system⁺ :

Host machine: Intel^® Xeon^® CPU E5-2697 v2, 2 Twelve-Core CPUs (30MB LLC, 2.7GHz), 128GB of RAM; OS: Windows Server 2008 R2 Enterprise

Coprocessors: 2 Intel® Xeon Phi™ Coprocessors 7120A, each with 61 cores (30.5MB total cache, 1.2GHz), 16GB GDDR5 Memory

Software: Intel® Math Kernel Library (Intel® MKL) 11.1 update 1 and update 2, Intel® Manycore Platform Software Stack (MPSS) 3.2.27270.1

⁺ 11.1 update 1 was upgraded to update 2 when the article was drafted, so two versions were tested in the article.

The below is the outline of the steps performed. Here is the link to the whole article.

Steps

Step 1: Determine which version of Intel MKL is used within MATLAB via the MATLAB command “version -blas”

Intel MKL version 11.0.5 is used within MATLAB R2014a

Step 2: Check if the Intel MKL version inside of MATLAB supports Intel Xeon Phi coprocessors

Intel MKL has supported for Intel Xeon Phi coprocessor since release 11.0 for Linux OS, and since release 11.1 for Windows OS.

Step 3: Upgrade Intel MKL version in MATLAB

Use mkl_rt.dll
Creation of custom dynamic library (Optional, click Download Button)

Step 4: Enable Intel MKL Automatic Offload (AO) in MATLAB via MKL_MIC_ENABLE

Set MKL_MIC_MAX_MEMORY=16G; set MKL_MIC_ENABLE=1

Step 5: Verify the Intel MKL version and ensure that AO is enabled on the Intel Xeon Phi coprocessors

Run version -blas and version –lapack, getenv(‘MKL_MIC_ENABLE’) commands and check the output list

Step 6: Compare performance

Accelerate the common used matrix multiply A*B in MATLAB
Accelerate the BLAS function dgemm() in MATLAB (Optional, click Download Button to get matrixMultiplyM.c file)

Summary

Intel MKL provides automatic offload (AO) feature for the Intel Xeon Phi coprocessor. With AO feature, certain MKL function can transfer part of the computation to Intel Xeon Phi coprocessor automatically. When problem sizes are large enough to amortize the cost of data transferring, the MKL functions performance can benefit from using both the host CPU and the Intel Xeon Phi coprocessor for computation. Because offloading happens transparently in AO, third-party software that uses Intel MKL functions can automatically benefit from this feature, easily making them to run faster on systems with Intel Xeon Phi coprocessor.

The article describes how to enable Intel MKL AO for MathWorks MATLAB on Intel Xeon Phi coprocessors system. The general steps are as below

Source environment using compilervars.sh or mklvars.sh intel64
Upgrade the intel MKL version in MATLAB to the latest version supporting Intel Xeon Phi coprocessors
Set MKL_MIC_MAX_MEMORY=16G; set MKL_MIC_ENABLE=1
Run MATLAB

A simple test shows that on one system with two Intel Xeon phi coprocessors, the common used matrix multiplication within MATLAB(C=A*B) achieves a 2.6 times speedup when Intel MKL AO is enabled, comparing to doing the same computation on the cpu only.

↧

Applying Vectorization Techniques for B-Spline Surface Evaluation

May 29, 2014, 11:48 am

Latest and popular articles on Intel Technologies

≫ Next: Samples for Intel® C++ Composer XE

≪ Previous: Using Intel® Math Kernel Library with MathWorks* MATLAB* on Intel® Xeon Phi™ Coprocessor System

Abstract

In this paper we analyze relevance of vectorization for evaluation of Non-Uniform Rational B-Spline (NURBS) surfaces broadly used in Computer Aided Design (CAD) industry to describe free-form surfaces. NURBS evaluation (i.e. computation of surface 3D points and derivatives for the given u, v parameters) is a core component of numerous CAD algorithms and can have a significant performance impact. We achieved up to 5.8x speedup using Intel® Advanced Vector Extensions (Intel® AVX) instructions generated by Intel® C/C++ compiler, and up to 16x speedup including minor algorithmic refactoring, which demonstrates high potential offered by the vectorization technique to NURBS evaluation.

Introduction

Vectorization, or Single Instruction Multiple Data (SIMD), is a parallelization technique available on modern computer processors, which allows to apply the same computational operation (e.g. addition or multiplication) to several data elements at once. For example, on a processor with a 128 bit register a single addition operation can add 4 pairs of integers (each takes 32 bits) or 2 pairs of doubles (64 bits each). With the help of vectorization one can speed up computations due to reduced time required to process the same data sets. SIMD was introduced with Intel® Architecture Processors way back in 1990es, with MMX™ technology as its first generation.

In this paper we analyze relevance of vectorization for evaluation of NURBS surfaces [1]. NURBS is a standard method used in CAD industry to describe free-form surfaces, e.g. car bodies, ship hulls, aircraft wings, consumer products and so on. Examples of 3D models (from [3]) containing NURBS surfaces are given on Fig.1:

NURBS evaluation (i.e. a computation of surface 3D points and derivatives) is a core component of numerous CAD algorithms. For instance...

(For further reading please refer to the attached pdf document)

↧

Samples for Intel® C++ Composer XE

February 25, 2013, 2:12 pm

Latest and popular articles on Intel Technologies

≫ Next: Intel® System Studio - Solutions, Tips and Tricks

≪ Previous: Applying Vectorization Techniques for B-Spline Surface Evaluation

We have created a list of articles with samples explaining the features in detail and how or when to use in the source code. There are:

By installing or copying all or any part of the sample source code, you agree to the terms of the Intel(R) Sample Source Code License Agreement.

**Auto-vectorization articles and samples**
Article Name	Description	Download
A Guide to Auto-vectorization with Intel® C++ Compilers	This article provides guidelines for enabling Intel C++ compiler auto-vectorization using the sample source code; it targets the Intel® processors or compatible non-Intel processors that support SIMD instructions such as Intel® Streaming SIMD Extensions (Intel® SSE).	Source Code in C/C++

Intel® Cilk™ Plus

Please visit Intel® C++ Compiler Code Samples page.

**Building Open Source Applications using Intel C++ Compiler**
Article Name	Description	Platforms
How to Building POV-Ray* with Intel C++ Compiler on Windows	The article provided detail instructions on building Povray* using the Intel® C++ Compiler for Windows. Version information Povray* beta version 3.7 Intel(R) C++ for Windows: 11.0	Windows
Building Boost C++ Libraries with Intel® C++ Compiler on Windows XP	Boost is a set of libraries for the C++ language, visit www.boost.org for more information. The article provides detail instructions on how to build Boost* library with Intel C++ Compiler on Windows. Version information Boost: v1.39.0 Intel C++ Compiler for Windows: 11.1	Windows
Building Open MPI* with the Intel compilers	The article is to help Intel® compiler customers build and use the Open MPI* library with Intel C++ and Fortran Compilers for Linux and OS X. Version information Open MPI: 1.2 Intel(R) C++ and Fortran Compilers for Linux* or Mac OS* X: 10.0	Linux, OS X
Building UPC* to utilize the Intel C++ Compiler	The Berkeley* Unified Parallel C* (UPC) is a programming language that adds parallelization extensions to the C language. The article explains how to build UPC* compiler with Intel C++ Compiler and configure it for use with symmetric multiprocessing (SMP) machines. Version information UPC: version 2.4.0 Intel(R) C++ Compiler for Linux*: 10.0	Linux
Building Quantlib with Intel C++ Compiler	Quantlib is a free/open-source library for modeling, trading, and risk management in real-life writting in C++. The article explains how to configure and build the Quantlib* library (http://quantlib.org/) and an example provided with Quantlib. Version information Quantlib: Quantlib-0.3.13.tar.gz Boost: boost_1_33_1 Intel(R) C++ Compiler for Linux: 10.0	Linux
Building Xerces with Intel C++ Compiler	The article describes how to build the Xerces-C++ with the Intel® C++ Compiler for Linux* Version information Xerces: 2.7.0 Intel(R) C++ Compiler for Linux*: 10.0	Linux
Building FFTW* With the Intel Compilers	The FFTW library is used for high performance computation of the Discrete Fourier Transform (DFT). The article describles how to build the FFTW* library on Linux* using Intel C++ Compiler for Linux. Version information FFTW* library v3.1.2 Intel(R) C++ Compiler for Linux*: 10.0	Linux
Building PGPLOT* with the Intel compilers	PGPLOT is a library for creating two-dimensional plots and graphs. The article provides instructions on how to build the PGPLOT* graphics library using Intel C++ and Fortran Compilers for Linux. Version information PGPLOT* graphics library v5.2.2 Intel(R) C++ and Fortran Compilers for Linux*: 10.x	Linux
Building WRF v2.x with the Intel compilers	The Weather Research and Forecasting (WRF) Model (http://wrf-model.org/index.php) is a next-generation mesoscale numerical weather prediction system designed to serve both operational forecasting and atmospheric research needs. The article is created to help users of WRF make use of the Intel C++ and Fortran compiler. Version information WRF: version 2.2 and 2.2.1 Intel(R) C++ and Fortran Compilers for Linux* or Mac OS* X: 10.x, 11.x	Linux
Building WRF v3.1.1 with the Intel compilers	The article is created to help users of WRF v3.1.1 make use of the Intel C++ and Fortran compiler. Version information WRF: version 3.1.1 Intel(R) C++ and Fortran Compilers for Linux* or Mac OS* X: 11.1	Linux
Building the HPCC* benchmark with Intel C++ and Fortran Compilers	The HPC Challenge (HPCC) benchmark is used to evaluate and test a wide variety of performance parameters for high-performance computing system. The article provides instructions on how to build the HPCC* benchmark. Version information HPCC: 1.0.0 Intel(R) C++ and Fortran Compilers for Linux* or Mac OS* X: 10.0 Intel(R) Math Kernel Library: 9.1	Linux, OS X
Building HDF5* with Intel® compilers	The article provides instructions on how to build and use the HDF5 library with Intel C++ and Fortrna Compilers on Linux* or OS X. HDF5 (http://www.hdfgroup.org/HDF5/) is the latest generation of the HDF libraries, a general purpose library and associated file formats for storing and sharing scientific data. Version information HDF5 1.8.9 Intel C++ and Fortran Compiler for Linux or Mac OS* X: 13.0 Update 1	Linux, OS X

↧

Intel® System Studio - Solutions, Tips and Tricks

December 3, 2013, 8:59 am

Latest and popular articles on Intel Technologies

≫ Next: Migrating from SSE2 Vector Operations to AVX2 Vector Operations

≪ Previous: Samples for Intel® C++ Composer XE

Intel^® System Studio 2015 Beta
- Download Here - Beta Program Information
- Release Notes
- What's New?
- Support Matrix
- Intel^® System Debugger
- Analysis
- New Platform Support and Intel^® C++ Compiler Optimizations
- Intel^® Energy Profiler
- Intel^® Math Kernel Library (Intel^® MKL)
- GNU GDB* 7.7 Update 1
  - Debugging Linux* Systems with GNU* GDB
- Android* Development

Intel System Studio 2014 Training Slides
- Link to Training Slides

Overview
- New Features and Components of Intel System Studio 2014
- Detailed Overview of all Intel System Studio 2014 Components
- System Requirements
- Support Matrix - Full list of component support based on Host/Target OS and Embedded Platform
- Intel Processor Family Support with Intel JTAG Debugger

Linux* and WindRiver Linux* Target Support - Intel System Studio Component Guides
- Getting Started with Intel System Studio and WindRiver Linux - Build and Run Guide
- Using Intel^® Compiler in Eclipse* for Linux Development
- Building Yocto Project* Applications using Intel Compiler
- Improved Sysroot Support with Intel Compiler for Cross Compilation
- Programming for Linux-based Intelligent Systems and Embedded Devices - Intel System Studio 2014 developer workflow
- Developing Secure Embedded Applications - Intel System Studio 2014 developer workflow
- Building OpenCV based Embedded Applications - Intel System Studio 2014 developer workflow

Android* Target Support - Intel System Studio Component by Component Guides

Intel^® Atom™ Processor E3xxx, Z3xxx (code-named Baytrail) - Overview of Intel System Studio support
- Developing Intel Atom Processor E3xxx, Z3xxx applications - overview of Intel System Studio components
- Intel System Studio Intel Atom Processor Support - general overview of Intel System Studio support for all Intel Atom processors

Intel^® Quark SoC - Overview of Intel System Studio Support

Windows* Host

Using Intel System Studio for Tizen* In-Vehicle Infotainment

Case Study / Success Stories / How-to Guides

Build and Design for Performance

Power and Performance Tuning

Debug, Verification, and Reliability

Additional Resources

↧

Migrating from SSE2 Vector Operations to AVX2 Vector Operations

September 6, 2014, 1:44 pm

Latest and popular articles on Intel Technologies

≫ Next: Full Pipeline Optimization for Immersive Video

≪ Previous: Intel® System Studio - Solutions, Tips and Tricks

Abstract

Intel® Architecture CPUs continue to evolve and offer improved performance and power efficiency. However, software engineers often fail to explore new hardware capabilities and enable their products to take advantage of the leading edge platforms. For example, Intel recently produced new SIMD instructions format, called AVX2, for the Haswell machine that has 256 bit registers and three operand instructions. This is compared to the older SSE instruction which used 128 bit register and two operand instructions. In this paper, I will demonstrate one technique for migrating SSE2 code to AVX2, using the loop filter.

Introduction

Legacy SSE2 code as well as MMX code, can be optimized by shifting to AVX2. MMX and SSE2 are examples of Single Instruction Multiple Data (SIMD) that operate at various data widths. Vectorization is the aggregate of multiple data into a single, wider register (See the Wikipedia article on "Vectorization parallel computing", in the reference section for more details). Here, we characterize a vector by the number of elements and the number of bits per element.

SSE2 uses a 128-bit register composed of 2, 4, 8, or 16 elements of 64, 32, 16, or 8 bits; all elements are required to be of the same size. AVX2 doubles the register size to 256 bits, and uses the same size elements as in the 128-bit register.

By vectorizing a C code using SSE2 instructions, the elements in the 128-bit register can be processed in parallel. The register can process twice as many elements in AVX2 compared to SSE2. Sometimes this approach requires substantial changes to the kernel code structure that affects parts of the code. For example, if the inner loop in a kernel is changed to process 32 elements at a time, we have to create new code to handle conditions that cause us to stride across array boundaries. So, we have potentially introduced substantial changes to the code that will require careful and thorough validation.

If we step back from the obvious ways to use the wider AVX2 register, we see the concept of the "vector" can be generalized: we can use a vector-of-vectors as the way to look the data. For example, the AVX register can hold two SSE2 vectors, so if we find cases without data dependencies between two SSE2 sub vectors, we can interleave to SSE2-sized vector operations into a single AVX2 operation. We will demonstrate this approach with the video decoder loop filter.

Example of Mapping SSE2 Vectors Within an AVX2 Vector

We have a loop filter for a video decoder.

Before calling that function the elements need to be transposed in order to be processed in a 128 bit register in parallel. In the SSE2 version of the loop filter, there are 8 elements, each one is byte size.

The SSE2 loop filter vector is composed of variables q3,q2,q1,q0,p0,p1,p2,p3, and the vector elements for processing are laid out consecutively in the memory. The size for the vector being used is 128 bits.

However, the variables themselves are laid out as an array of variables; i.e.
q3[16],q2[16],q1[16],q0[16],p0[16],p1[16],p2[16],p3[16]

We must read data from memory in such a way as to create an interleave of the variables.

The values exist in a constant length of X from each other.
Whoever wrote the SSE2 version of loop filter decided to transpose the data because:

The amount of operations in the functions is too expensive.
Or the function is too big, has many instructions, and requires a lot of cycles in order to finish.

Sometimes it is worth paying the penalty of transposing to create a more optimized code.

The SSE2 code loads 8 elements (q3,q2,q1,q0,p0,p1,p2,p3) in one chunk to a 128 bit register:

Each line in the table of image 1.0 is a load of 8 bytes into a register of 128 bit register:


movq  xmm4, [rsi]
movq  xmm1, [rdi]
movq  xmm0, [rsi+2*rax]
movq  xmm7, [rdi+2*rax]
movq  xmm5, [rsi+4*rax]
movq  xmm2, [rdi+4*rax]

Every register loads 8 elements in the code above.

When transposing, we load 8 elements 16 times to fill a 128 bit register (every element qi/pi where i=0..3 should fill a 128 bit register).
The transpose begins by merging every two registers together in order to have each of qi/pi in a 128 bit register.

Figure 1.0 shows how the merge works between register A to register B:

Figure 1.1 illustrates what happens to xmm4 and xmm1 after the merge:

The first step is to merge the bytes of every two registers until we have 16 values for every qi, pi where i=0..3
In our case the first merge (punpcklbw) already covers a 128 bit register.
The next stage is to merge words of every two registers. However, there is not enough room in the 128 bit register to save the result of this merge. We need two 128 bit registers to save this merge.
One word merge is done on the low bit of a 128 bit register (the low half of a xmm register) and one word merge to the high bit of the 128 bit register (the high half of a xmm register).


movdqa      xmm3,  xmm4         //save the data in a different register

The next stage is to merge double word, which also requires 2 operations: double word merge on low bits and double word merge on the high bits:

The next and last stage is to merge quad word which also requires two operations:

In sum, this transpose requires:

16 loads of 8 elements to 16 128 bit registers.
Byte merge for every 2 registers - we have 16 loads so 8 byte merge operations.
Word merge for every 2 registers - because the 128 bite register is filled from a previous byte merge, the word merge should be executed on a low half register and a high half register on a 128 bit register. and because its SSE2 code (only 2 argument in the instruction) that data should be saved in a different register in order to do those 2 operations - 8 merge operations for both high and low operations + 8 operations for saving the register.
Double word merge for every 2 registers - double word also requires separately merging the low and high half register of a 128 bit register - 8 merge operation for both high and low + 8 operations for saving the register.
A quad word merge also requires separately merging the low and high half register and therefore take 8 operations merging + 8 operations saving to a register.

In summary there are 32 merging operations/ + 24 saving to register operations.
This is the input of the SSE2 Transpose:

And this is the output:

By migrating to AVX2 and using 256 bit register, the transpose can be performed on every independent 8x8 matrix on a 256 bit register in parallel:

The first eight loads will be placed in the first half of a 256 bit register and the second eight loads will be placed in the second half.
In this case, the number of loads stays the same 16 loads.
Eight of them is to the same register to the higher half bits of the 256 bit register.

In order to put two loads on a 256 bit register:

1. We need to load the eight elements to a register:

Broadcast from the memory the independent data to a 256 bit register:

2. Blend the two loads together in the 256 bit register:

From now on the number of merge operations of each size byte/word/double word will be half of the operations in a SSE2 version because the two merges are on the same 256 bit register.

You do not need to save the register in order to merge because AVX2 also support 3 arguments in every instruction:

One for saving the output - destination register.
Two input arguments.

In summary:

There are still 16 loads (we can't change that) - eight load operations + eight broadcast operations + eight blend operations.
There are four byte merges.
There are four word merges.
There are four double word merges.

In this case, a quad word merge was not necessary because that data was already rearranged in the 256 bit register in a way that all the values for each of the elements q3,q2,q1,q0,p0,p1,p2,p3 were in the same register. See below.

When running the transpose function 100,000,000 times for both SSE2 version and AVX2 version, Intel® Vtune™ Amplifier shows that the amount of CPU time spent on the Transpose in the SSE2 version is 1.521 seconds. Meanwhile in AVX2, it took 0.900 seconds. There was a 41% improvement when using AVX2 version.

Next, rearrange the data on the registers so that one register will accommodate two elements qi/qj or pi/pj so that the two elements will be independent with each other. And the code does not include any computation with the two elements. For example:


signed char mask = 0;
mask |= (abs(p3 - p2) > limit);
mask |= (abs(p2 - p1) > limit);
mask |= (abs(p1 - p0) > limit);
mask |= (abs(q1 - q0) > limit);
mask |= (abs(q2 - q1) > limit);
mask |= (abs(q3 - q2) > limit);
mask |= (abs(p0 - q0) * 2 + abs(p1 - q1) / 2  > blimit);
return mask - 1;

This code shows that p0,q0 and q1, p1 have to be in a separate registers on the same lane:


mask |= (abs(p0 - q0) * 2 + abs(p1 - q1) / 2  > blimit);

The registers above already satisfy this requirement (being in a separate registers). The only problem is that they do not exist in the same lane. A simple shuffle operation can fix the problem and allows us to have p0,q0 and p1,q1 in the same lane.

For this section of code:


mask |= (abs(p3 - p2) > limit);
mask |= (abs(p2 - p1) > limit);
mask |= (abs(p1 - p0) > limit);
mask |= (abs(q1 - q0) > limit);
mask |= (abs(q2 - q1) > limit);
mask |= (abs(q3 - q2) > limit);

Rearranging every register with qi,pi where i=0..3 will cut the number of operations by half:

By using operations as shuffle and blend. For example putting p2 and q2 in the same register:
This is the starting point:

Will change the second register to be:

Will change the first register to be:

Not all elements require three operations (shuffle, shuffle, blend) some of them require only blend for example q3,p3:

The same for q1,p1:

For having p0 and q0 in the same register 2 shifts and on blend should be done:

So in the end this code that has six operations:


mask |= (abs(p3 - p2) > limit);
mask |= (abs(p2 - p1) > limit);
mask |= (abs(p1 - p0) > limit);
mask |= (abs(q1 - q0) > limit);
mask |= (abs(q2 - q1) > limit);
mask |= (abs(q3 - q2) > limit);

And the code is executed by three operations.

This is the original code:

In the original SSE2 code, every instruction can handle only two operands, so some of the registers need to be saved in a temporary register in order to use later.

The assembly instruction for that operation:


abs(p3 - p2);

Needs to be done by three operations (excluding the move):

In SSE2, to avoid overwriting q2, q2 has to be saved in another register. There is no need to save registers in AVX2 because it handles 3 operands. You only need three operations in AVX2 to calculate this code versus the six necessary in the SSE2 code.

All the assembly instructions that are needed to calculate the code above and the next code is done by half of the instructions from the original assembly code.
After calculating the mask, filter all the elements from q3-p3.


signed char s, u;
signed char filter_value, Filter1, Filter2;
signed char ps2 = (signed char) * op2 ^ 0x80;
signed char ps1 = (signed char) * op1 ^ 0x80;
signed char ps0 = (signed char) * op0 ^ 0x80;
signed char qs0 = (signed char) * oq0 ^ 0x80;
signed char qs1 = (signed char) * oq1 ^ 0x80;
signed char qs2 = (signed char) * oq2 ^ 0x80;
/* add outer taps if we have high edge variance */

filter_value = vp8_signed_char_clamp(ps1 - qs1);
filter_value = vp8_signed_char_clamp(filter_value + 3 * (qs0 - ps0));
filter_value &= mask;
Filter2 = filter_value;
Filter2 &= hev;

/* save bottom 3 bits so that we round one side +4 and the other +3 */
Filter1 = vp8_signed_char_clamp(Filter2 + 4);
Filter2 = vp8_signed_char_clamp(Filter2 + 3);
Filter1 >>= 3;
Filter2 >>= 3;
qs0 = vp8_signed_char_clamp(qs0 - Filter1);
ps0 = vp8_signed_char_clamp(ps0 + Filter2);

/* only apply wider filter if not high edge variance */
filter_value &= ~hev;
Filter2 = filter_value;

/* roughly 3/7th difference across boundary */
u = vp8_signed_char_clamp((63 + Filter2 * 27) >> 7);
s = vp8_signed_char_clamp(qs0 - u);
*oq0 = s ^ 0x80;
s = vp8_signed_char_clamp(ps0 + u);
*op0 = s ^ 0x80;

/* roughly 2/7th difference across boundary */
u = vp8_signed_char_clamp((63 + Filter2 * 18) >> 7);
s = vp8_signed_char_clamp(qs1 - u);
*oq1 = s ^ 0x80;
s = vp8_signed_char_clamp(ps1 + u);
*op1 = s ^ 0x80;

/* roughly 1/7th difference across boundary */
u = vp8_signed_char_clamp((63 + Filter2 * 9) >> 7);
s = vp8_signed_char_clamp(qs2 - u);
*oq2 = s ^ 0x80;
s = vp8_signed_char_clamp(ps2 + u);
*op2 = s ^ 0x80;

Except for the filter calculations, all the other operations can fully utilize the 256 bit register.
This part:


signed char ps2 = (signed char) * op2 ^ 0x80;
signed char ps1 = (signed char) * op1 ^ 0x80;
signed char ps0 = (signed char) * op0 ^ 0x80;
signed char qs0 = (signed char) * oq0 ^ 0x80;
signed char qs1 = (signed char) * oq1 ^ 0x80;
signed char qs2 = (signed char) * oq2 ^ 0x80;

Can be done by two assembly instructions:


vxor ymm0, ymm0, [GLOBAL(t80)]
vpxor ymm4, ymm4, [GLOBAL(t80)]

Instead of four assembly instructions.

And in order to calculate "op2", "op1","op0","oq0","oq1","oq2":


s = vp8_signed_char_clamp(qs0 - u);
*oq0 = s ^ 0x80;
s = vp8_signed_char_clamp(ps0 + u);
*op0 = s ^ 0x80;

In parallel (-u) and (u) need to be in the same register this could be done by using the following instruction:

Now every op0/1/2 and oq0/1/2 can be processed in parallel.
When the filtering is complete, the data needs to be transposed back to where it was.
The data still remains as it were before:

Every [q3,p3,q3,p3] needs to be byte merged with [q2,p2,q2,p2] the low and the high part of the register of each lane: [q2q3,q2q3,q2q3,q2q3] , [p3p2,p3p2,p3p2,p3p2]
And every [q1,p1,q1,p1] needs to be byte merged with [q0,p0,q0,p0] the low and high part of the register of each lane: [q0q1,q0q1,q0q1,q0q1] , [p1p0,p1p0,p1p0,p1p0]

The next merge should be with every word:

And the following merge with every double word:

The number of operations for transposing the data back is 12 operations: 4 byte merge, 4 word merge and 4 double word merge.
In the original code there are 24 operation because every qi i=0...3 occupy a full 128 bit register.

When running the whole loop filter 100000000 times there is 21% gain in function level.
The following Intel VTune Amplifier results illustrate this:

Image 1.1 shows that the SSE2 loop filter generates 12 trillion CPU cycles over 100000000 loop strides.

Image 1.2 shows that the AVX2 loop filter generating 9 trillion CPU cycles over 100000000 loop strides.

Conclusion

Vectorizing the SSE2 code by using AVX2 could optimize the SSE2 function by almost 50% depending on the type of work that needs to be done. It's really important to understand that vectorizing SSE2 code can be done only if there is no dependency between some of the variables in the function and there is a heavy computation in the function so that the amount of operation to rearrange the data in the 256 bit register will be eventually paid off.

In our case, vectorizing the SSE2 Transpose optimized it by 41% and optimizing the overall function was 21%. If the function was more heavily computed, the gain could reach up to 50%.

Other References

Vectorization definition: http://en.wikipedia.org/wiki/Vectorization_(parallel_computing)
AVX2 (advanced vector extension): http://en.wikipedia.org/wiki/Advanced_Vector_Extensions
Vtune for download and documentation: http://software.intel.com/en-us/intel-vtune-amplifier-xe

↧

Full Pipeline Optimization for Immersive Video

September 22, 2014, 2:54 pm

Latest and popular articles on Intel Technologies

≫ Next: Digital Security and Surveillance on 4th generation Intel® Core™ processors Using Intel® System Studio 2015

≪ Previous: Migrating from SSE2 Vector Operations to AVX2 Vector Operations

Download PDF

Abstract

Tencent (Tencent Technology Company Ltd) integrated the Intel® Media SDK to optimize performance and reduce power consumption of its video conferencing app, QQ*. The app went from a max resolution of 480p with low frames per second (fps) to 720p resolution at 15-30 fps while consuming only 35% of the original amount of power. And it now supports 4-way conferencing while lowering CPU utilization from 80% to <20%, reducing power consumption from 14w to 6w and cutting RAM usage in half.^Z

These techniques to optimize the entire pipeline using the hardware acceleration of Intel® graphics from camera capture through decoding, encoding, and final display can also be used by other media applications.

Introduction

Tencent QQ is a popular instant messaging service for mobile devices and computers. QQ boasts a worldwide base of more than one billion registered users and is particularly popular in China. QQ has more than 100 million people logged in at any time and offers not only video calls, voice chats, rich texting, and built-in translation (text) but also file and photo sharing.

Like all video on the Internet, QQ performs best when there’s plenty of data bandwidth available, but video conferencing is bi-directional so both uplink and download speeds are important. Unfortunately in many countries, including China, uplink speed may only be 512kbps. So to please customers, Tencent needed good compression and low latency while still leaving CPU and RAM bandwidth available for multitasking. Plus the devices need to remain cool and power efficient while balancing high quality with available bandwidth.

So Tencent engineers worked with Intel engineer, Youwei Wang, to first diagnose the bottlenecks and power consumption of their app and then improve performance of the data flow pipeline. The main changes involved using the CPU and GPU in parallel to increase performance while making major memory handling changes to decrease memory usage, both of which provided a significant decrease in power consumption.

This article details how the improvements were accomplished using the special features of Intel® processors by integrating the Intel Media SDK and using Intel® Streaming SIMD Extensions (Intel® SSE4) instructions.

Performance and Power Analysis Tools

Significant data capture and analysis can be done using tools currently available free on the Internet. From Microsoft, the team used the Windows* Assessment and Deployment Kit (Windows ADK) (available at http://go.microsoft.com/fwlink/p/?LinkID=293840), which includes:

Windows Performance Analyzer (WPA)
Windows Performance Toolkit (WPT)
GPUView
Windows Performance Recorder (WPR)

The Intel® tools used were:

Intel® Performance Bottleneck Analyzer https://software.intel.com/en-us/articles/intel-performance-bottleneck-analyzer
Graphics Performance Analyzers https://software.intel.com/en-us/vcsource/tools/intel-gpa
Intel® Power Gadget https://software.intel.com/en-us/articles/intel-power-gadget-20
Battery Life Analyzer https://downloadcenter.intel.com/Detail_Desc.aspx?agr=Y&DwnldID=19351

The Intel® Media Software Development Kit (Intel® Media SDK)

The Intel® Media SDK is a cross-platform API that includes features for video editing and processing, media conversion, streaming and playback, and video conferencing. The SDK makes it easy for developers to optimize applications for Intel® HD Graphics hardware acceleration, which is available starting with the 2nd generation Intel® Core™ processors as well as the latest Intel® Celeron® and Intel® Atom™ processors.

Features of the Intel® Media SDK include:

Low Latency Encode and Decode
Allows dynamic control of bit rate via filter settings shown in the UI including
mfxVideoParam::AsyncDepth (limits internal frame buffering and forces per frame sync)
mfxInfoMFX::GopRefDist (stops use of B frames)
mfxInfoMFX::NumRefFrame (can set to only use previous P-frame)
mfxExtCodingOption::MaxDecFrameBuffering
(extends buffer, can set to show frame immediately)
Dynamic Bit Rate and Resolution Control
Adapts target and max Kbps to actual bandwidth at any time, OR customizes bit rate encoding per frame with the Constant Quantization Parameter (CQP) DataFlag.
Reference List Selection
Uses client side frame reception feedback to adjust reference frames, can improve robustness and error resilience.
Provides 3 types of Lists: Preferred, Rejected, and Long Term.
Reference Picture Marking Repetition SEI Message
Repeats the decoded reference picture marking syntax structures of earlier decoded pictures to maintain status of the reference picture buffer and reference picture lists - even if frames were lost.
Long Term Reference
Allows temporal scalability through use of layers providing different frame rates.
MJPEG decoder
Accelerates H.264 encode/decode and video processing filters. Allows delivery of NV12 and RGB4 color format decoded video frames.
Blit Process
Option to combine multiple input video samples into a single output frame. Then post-processing can apply filters to the image buffer (before display) and use de-interlacing, color-space conversion, and sub-stream mixing.
Hardware-accelerated and software-optimized media libraries built on top of Microsoft DirectX*, DirectX Video Acceleration (DVXA) APIs, and platform graphics drivers.

Understanding the Video Pipeline

Sending video data between devices is more complex than most people imagine. Figure 3 shows the key steps that the QQ app takes to send video data from a camera (device A) to the user’s screen (device B).

Figure 3: Serial processing

As you can see, many steps that require data format conversion or ‘data swizzling’. When these are handled serially in the CPU, significant latency occurs. The pre-optimized solution of QQ had limited pre and post processing. But since each packet of data is independent of the next, the Intel Media SDK can parallelize the tasks, split them between CPU and GPU, and optimize the flow.

Figure 4: Optimized multi-thread flow

Changing SIMD instructions

Another major improvement came from replacing the older Intel SIMD instruction set (MMX) with the Intel® Streaming SIMD Extensions (Intel® SSE4) instructions. This provided double throughput capabilities by moving from 64-bit standard floating point registers (where 2 32-bit integers can be swizzled simultaneously) to 128-bit registers (using the _mm_stream_load_si128 and _mm_store_si128 functions). Besides the larger registers, Intel SSE also separates the floating point registers from the data point registers. This means the processor can work on multi set data within one single CPU cycle, which greatly improves the data throughput and execution efficiency. Just the change from MMX to SSE4 calls increased QQ performance 10x. (See Additional References at the end of this article for more information on how to rewrite copy functions using Intel SSE4 and conversion of SIMD instructions.)

Additionally, Tencent was using C libraries to do the many large memory copies for each frame, which was too slow for HD video. The code was changed to use system memory only for the software pipeline and the hardware pipeline was changed so that D3D surfaces handle all the sessions/threads. For copies between system memory and the D3D surface, the engineers used the Intel SSE and Intel® Advanced Vector Extensions 2 (Intel® AVX2) instructions to decrease any unnecessary memory copies in the pipeline.

Using Dynamic Features of the Intel® Media SDK

Another improvement was to use the proper codec level when encoding (doing MJPEG decoding in the GPU). The team used the Intel Media SDK dynamic buffer and dynamic bit and frame rates, which decreased latency and reduced buffer use. Adding the pre and post processing in hardware improved compression helping the performance on low bandwidth networks.

For the user experience, the teams added de-noise in the preprocessing and used post processing to adjust colors (hue/saturation/contrast). By also using the integrated skin tone detection and face color adjustment, user experience was greatly improved.

Figure 5: Optimized Skin Tones

Changing Reference Frames

Regardless of the efficiency of the encode and decode processing, the user’s experience in a video conference will suffer if the network connection can’t consistently deliver the data. Without data, the decoder will skip ahead to a new reference frame (since the incremental frames come in late or was missing). Both frame type selection and accurate bit-rate control is necessary for a stable bit-stream transfer. Tencent found that setting I-frames to 30% of bandwidth gave the best balance. Plus the Intel Media SDK allowed to the elimination of B frames and allows changes to the max frame size and the buffer size.

Moving away from only using I-Intra frames and P-Inter frames, the new SP frames in H.264 allow switching between different bit rate streams without requiring an intra-frame. Tencent moved to using SP frames between P frames (reducing the importance of the P frame) and allowed dynamic adjustment to get the best balance between network conditions and video quality.

Reducing Power Consumption

In addition to improving the performance of QQ, the changes to memory copies, reference frames, and post processing also reduced the power consumption of the app. This is a natural consequence of doing the same amount of work in less time. But Tencent further reduced the amount of power required by throttling down the power states of the processor cores when they weren’t actually processing. Using the findings from the power tests, the engineers reworked areas that were keeping the processor unnecessarily active. Video conferencing apps don’t need to run the CPU continuously since data supplied by the network is never continuous and because there is no value in drawing new frames faster than the screen refresh rate. Tencent added short, timed lower power states, using the Windows API Sleep and WaitforSingleObject functions. The latter is triggered by events such as data arriving on the network. The resulting improvements can be seen in Figure 8:

Figure 8: Power savings per release

Summary of QQ Improvements

Using the Intel Media SDK and changing to the Intel SSE4 instruction set, Tencent made the following improvements to the QQ app:

Offloaded H.264 and MJPEG encode and decode tasks to GPU
Moved pre and post process tasks (when possible) to hardware
Used both CPU and GPU simultaneously
Reduced memory copies
Reduced processor high power states (sleep calls, WaitForSingleObject, and timers)
Changed MMX to Intel SSE4 instructions
Optimized reference frame flow

Figure 9: Pre optimized 640x480 versus optimized 1280x720

Conclusion

The performance of Tencent QQ was dramatically increased by using key features of the Intel Media SDK. QQ was transformed from an app that could deliver 480p resolution images at low frame rate over a DSL connection into an app that could deliver 720p resolution images at 30 fps over that same DSL connection and support 4-way conferencing.

After moving key functions into hardware using the Intel Media SDK, the power consumption of QQ was reduced to almost 50% of its initial value. Then, by optimizing processor power states, power usage was further reduced to about 35% of its initial value. This is a remarkable power savings that permits QQ users to run the optimized app for more than twice as long as the preoptimized app. While improving customer satisfaction, QQ became a more capable (and greener) app by integrating the Intel Media SDK.

If you are a media software developer, be sure to evaluate how the Intel Media SDK can help increase performance and decrease memory usage and power consumption of your app by providing an efficient data flow pipeline with improved video quality and user experience even with limited bandwidth. And don’t forget the Intel tools available to help find problem spots and bottlenecks.

Additional Resources:

Intel® Media SDK: https://software.intel.com/en-us/vcsource/tools/media-sdk-clients
Intel® Media SDK sample code: https://software.intel.com/en-us/articles/media-sdk-tutorial-tutorial-samples-index
Intel® Media SDK features: https://software.intel.com/en-us/articles/video-conferencing-features-of-intel-media-software-development-kit
Intel® Media SDK video conferencing sample: https://software.intel.com/en-us/vcsource/samples/video-conferencing-using-media-sdk
Copying Accelerated Video Decode Frame Buffers: https://software.intel.com/en-us/articles/copying-accelerated-video-decode-frame-buffers
Intel® SSE4 instructions guide: https://software.intel.com/sites/landingpage/IntrinsicsGuide
ooVoo* Video Conferencing case study: https://software.intel.com/en-us/articles/oovoo-intel-enabling-hd-video-conferencing
Windows Performance Analyzer: http://go.microsoft.com/fwlink/?LinkID=293840.
QQ Official Site: http://www.imqq.com

About the Authors

Colleen Culbertson is an Application Engineer in Intel’s Developer Relation Division Scale Enabling in Oregon. She has worked for Intel for more than 15 years. She works with various teams and customers to enable developers to optimize their code.

Youwei Wang is an Application Engineer in Intel’s Developer Relation Division Client Enabling in Shanghai. Youwei has worked at Intel for more than 10 years. He works with ISVs on performance and power optimization of applications.

Testing Configuration

Some performance results were provided by Tencent. Intel performance results were obtained on a Lenovo* Yoga 2 Pro 2-in-1 platform with a 4^th generation Intel® Core^TMmobile processor and Intel® HD Graphics 4400.

Notices

UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR.

Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.

The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.

Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations, and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Any software source code reprinted in this document is furnished under a software license and may only be used or copied in accordance with the terms of that license.

Intel, the Intel logo, Intel Atom, Intel Celeron, and Intel Core are trademarks of Intel Corporation in the U.S. and/or other countries.

*Other names and brands may be claimed as the property of others.

↧

Digital Security and Surveillance on 4th generation Intel® Core™ processors Using Intel® System Studio 2015

October 8, 2014, 8:59 am

Latest and popular articles on Intel Technologies

≫ Next: Memory profiling techniques using Intel System Studio

≪ Previous: Full Pipeline Optimization for Immersive Video

This article presents the advantages of developing embedded digital video surveillance systems to run on 4^th generation Intel® Core™ processor with Intel® HD Graphics, in combination with the Intel® System Studio 2015 software development suite. While Intel® HD Graphics is useful for developing many types of computer vision functions in video management software; Intel® System Studio 2015 is an embedded application development suite that is useful in developing robust digital video surveillance applications.

↧

Memory profiling techniques using Intel System Studio

October 8, 2014, 8:23 am

Latest and popular articles on Intel Technologies

≫ Next: Intel® System Studio Case Studies

≪ Previous: Digital Security and Surveillance on 4th generation Intel® Core™ processors Using Intel® System Studio 2015

Introduction

One of the problems with developing embedded systems is the detection of memory errors; like

Memory leaks
Memory corruption
Allocation / de-allocation API mismatches
Inconsistent memory API usage etc.

These memory errors degrade performance of any embedded systems. Designing and programming an embedded application requires great care. The application must be robust enough to handle every possible error that can occur; care should be taken to anticipate these errors and handle them accordingly—especially in the area of memory.

In this article we have described how to use Intel® System Studio to find dynamic memory issues in any embedded application.

Intel® System Studio 2015

Intel® System Studio a new comprehensive integrated tool suite provides developers with advanced system tools and technologies that help accelerate the delivery of the next generation power efficient, high performance, and reliable embedded and mobile devices.

To get more information about Intel® System Studio – http://software.intel.com/en-us/intel-system-studio

Dynamic Memory Analysis

Dynamic memory analysis is the testing and evaluation of an embedded application for any memory errors during runtime.

Advantage of dynamic memory analysis: Dynamic memory analysis is the analysis of an application that is performed by executing application. For dynamic memory analysis to be effective, the target program must be executed with sufficient test inputs to analyze entire program.

Intel® Inspector for Systems

Intel® Inspector for Systems helps developers identify and resolve memory and threading correctness issues in their unmanaged C, C++ and Fortran programs as well as in the unmanaged portion of mixed managed and unmanaged programs. Additionally the tool identifies threading correctness issues in managed .NET C# programs.

Intel® Inspector for Systems will currently identifies following type of dynamic memory problems.

Problem Type	Description
Incorrect memcpy call	When an application calls the memcpy function with two pointers that overlap within the range to be copied.
Invalid deallocation	When an application calls a deallocation function with an address that does not correspond to dynamically allocated memory.
Invalid memory access	When a read or write instruction references memory that is logically or physically invalid.
Invalid partial memory access	When a read or write instruction references a block (2-bytes or more) of memory where part of the block is logically invalid.
Memory growth	When a block of memory is allocated but not deallocated within a specific time segment during application execution.
Memory leak	When a block of memory is allocated, never deallocated, and not reachable at application exit (there is no pointer available to deallocate the block).
Memory not deallocated	When a block of memory is allocated, never deallocated, but still reachable at application exit (there is a pointer available to deallocate the block).
Mismatched allocation/deallocation	When a deallocation is attempted with a function that is not the logical reflection of the allocator used.
Missing allocation	When an invalid pointer is passed to a deallocation function. The invalid address may point to a previously released heap block.
Uninitialized memory access	When a read of an uninitialized memory location is reported.
Uninitialized partial memory access	When a read instruction references a block (2-bytes or more) of memory where part of the block is uninitialized.
Cross-thread stack access	When a thread accesses a different thread's stack

Conclusion: Intel® System Studio provides you dynamic memory analysis feature to build robust embedded application.

↧

Questions and Answers from the webinar

Downloads

Example

Finding more errors

Download Article

Contents

1. Executive Summary

2. Introduction

3. Intel Xeon processor E5-2600 V2 product family enhancements

a. Intel® Secure Key (DRNG)

b. Intel® OS Guard (SMEP)

c. Intel® Advanced Vector Extensions (Intel® AVX): Float 16 Format Conversion

d. Advanced Programmable Interrupt Controller (APIC) Virtualization (APICv)

e. PCI Express Enhancements

5. Conclusion

6. About the Author

Programming for Multicore and Many-core Products including Intel® Xeon® processors and Intel® Xeon Phi™ coprocessors (including language extensions for offloading to Intel® Xeon Phi™ coprocessors)

Abstract

Introduction

Parallel Programming Today

Libraries

Most Composable Parallel Programming Models

OpenMP

Intel TBB

MPI

Parallel Programming Emerging Standards

Intel® Cilk™ Plus

C/C++ data parallel extensions

OpenCL*

Composability Using Multiple Models

Harnessing Many-core

Explicit vs. Implicit use of Many-core

Offloading

Additional Offload Capabilities

Standards

Summary

About the Author

Abstract

Introduction

Abstract

Introduction

Example of Mapping SSE2 Vectors Within an AVX2 Vector

Conclusion

Other References

Abstract

Introduction

Performance and Power Analysis Tools

The Intel® Media Software Development Kit (Intel® Media SDK)

Understanding the Video Pipeline

Changing SIMD instructions

Using Dynamic Features of the Intel® Media SDK

Changing Reference Frames

Reducing Power Consumption

Summary of QQ Improvements

Conclusion

Additional Resources:

About the Authors

Testing Configuration

Dynamic Memory Analysis