Quantcast
Channel: Intel Developer Zone Articles
Viewing all 113 articles
Browse latest View live

Improve Server Application Performance with Intel® Advanced Vector Extensions 2

$
0
0

The Intel® Xeon® processor E7 v3 family now includes an instruction set called Intel® Advanced Vector Extensions 2 (Intel® AVX2), which can potentially improve application performance related to high performance computing, databases, and video processing. To validate this statement, I performed a simple experiment using the Intel® Optimized LINPACK benchmark. The results, as shown in Table 1, show a greater than 2x performance increase using Intel AVX2 vs. using Intel® Streaming SIMD Extensions (Intel® SSE). It also shows an increase of 1.7x when comparing Intel AVX2 with Intel® Advanced Vector Extensions (Intel® AVX) instructions.

The results in Table 1 are from three different workloads running on Linux* (Intel AVX, Intel AVX2, and Intel SSE4). The last two columns show the performance gain from Intel AVX2 compared to Intel AVX or to Intel SSE4. Running with the combination of an Intel AVX2 optimized LINPACK and an Intel AVX2-capable processor, Intel AVX2 performed ~2.89x-3.49x better than Intel SSE while performing ~1.73x-2.12x better than Intel AVX. And these numbers were just an example of the potential performance boost for LINPACK. For other applications, the performance gain will vary depending on the optimized code and the hardware environment.

Table1 – Results and Performance Gain from Running the LINPACK Benchmark on Quad Intel® Xeon® Processor E7-8890 v3.

Linux* LINPACK v11.2.2Intel® AVX2 (Gflops)Intel® AVX (Gflops)Intel® SSE4 (Gflops)Performance Gain over Intel SSE4Performance Gain over Intel AVX
30K1835.83867.065525.383.492.12
75K2092.871211.89724.402.891.73
100K2130.311224.44731.422.911.74

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Configurations: Intel® Xeon® processor E7-8890 v3 @ 2.50GHz, 45MB L3 cache, 18 core pre-production system. 2x Intel® SSD DC P3700 Series @ 800GB, 2568GB memory (32x16GB DDR4 -2133Mhz), BIOS by Intel Corporation Version: BRHSXSD1.86B.0063.R00.1503261059 (63.R00) BMC 70.7.5334 ME 2.3.0 SDR Package D.00, Power supply: 2x1200W NON-REDUNDANT, running Microsoft Windows* Server 2012 R2 / SLES 11 SP3 Linux*

For more information go to http://www.intel.com/performance

How to take advantage of Intel® AVX2 in existing vectorized code

Vectorized code that uses floating point operations can get a potential performance boost when running on newer platforms such as the Intel Xeon processor E7 v3 family by doing the following:

  1. Recompile the code, using the Intel® compiler with the proper Intel AVX2 switch to convert existing Intel SSE code. See the Intel® Compiler Options for Intel® SSE and Intel® AVX generation white paper  for more details. 
  2. Modify the code's function calls to leverage the Intel® Math Kernel Library (Intel® MKL), which is already optimized to use Intel AVX2 where supported.
  3. Use the Intel AVX2 intrinsic instructions. High level language (such as C or C++) developers can use Intel® Intrinsic instructions to make the calls and recompile code. See the Intel® Intrinsic Guide and Intel® 64 and IA-32 Architectures Optimization Reference Manual for more details.
  4. Code in assembly instructions directly. Low level language (such as assembly) developers can use equivalent Intel AVX2 instructions from their existing Intel SSE code. See the Intel® 64 and IA-32 Architectures Optimization Reference Manual for more details.

Equivalent instructions for Intel® AVX2, Intel® AVX, and Intel® SSE used in the tests

Table 2 lists the equivalent instructions for Intel AVX2, Intel AVX, and Intel SSE (SSE/SSE2/SSE3/SSE4) that may be useful for migrating code. It contains three sets of the instructions: the first set are equivalent instructions across all three instruction sets (Intel AVX2, Intel AVX, and Intel SSE); the second set are equivalent instructions across two instruction sets (Intel AVX2 and Intel AVX), and the last set are Intel AVX2 instructions.

Table 2– Intel® AVX2, Intel® AVX, and Intel® SSE Equivalent Instructions

Intel® AVX and Intel® AVX2Equivalent Intel® SSEDefinitions
VADDPDADDPDAdd packed double-precision floating-point values
VDIVSDDIVSDDivide low double-precision floating point value in xmm2 by low double-precision floating-point value in xmm3/m64
VMOVSDMOVSDMove data from string to string
VMOVUPDMOVUPDMove unaligned packed double-precision floating-point values
VMULPDMULPDMultiply packed double-precision floating-point Values
VPXORPXORLogical exclusive OR
VUCOMISDUCOMISDUnordered compare scalar double-precision floating-point values and set EFLAGS
VUNPCKHPDUNPCKHPDUnpack and interleave high-packed double-precision floating-point values
VUNPCKLPDUNPCKLPDUnpack and interleave low-packed double-precision floating-point values
VXORPDXORPDBitwise logical XOR for double-precision floating-point values
Intel AVX and AVX2Definitions
VADDSDSet the upper half of all YMM registers to zero. Used when switching between 128-bit use and 256-bit use.
VBROADCASTSDCopy a 32-bit, 64-bit, or 128-bit memory operand to all elements of a XMM or YMM vector register.
VCMPPDCompare packed double-precision floating-point values
VCOMISDPerform ordered comparison of scalar double-precision floating-point values and set flags in EFLAGS register
VINSERTF128Replace only half of a 256-bit YMM register with the value of a 128-bit source operand. The other half is unchanged.
VMAXSDDetermine the maximum of single-precision float64 vectors. The corresponding Intel AVX instruction is VMAXSD.
VMOVQMove Quadword
VMOVUPSMove unaligned packed single-precision floating-point values
VMULSDMultiply packed single-precision floating-point values
VPERM2F128Permute 128-bit floating-point fields in ymm2 and ymm3/mem using controls fromimm8 and store result in ymm1.
VPSHUFDPermute 32-bit blocks of an int32 vector
VXORPSPerform bitwise logical XOR operation on float32 vectors
VZEROUPPERSet the upper half of all YMM registers to zero. Used when switching between 128-bit use and 256-bit use.
Intel AVX2Definitions
VEXTRACTF128Extract 128 bits of float data from ymm2 and store results in xmm1/mem.
VEXTRACTI128Extract 128 bits of integer data from ymm2 and store results in xmm1/mem.
VFMADD213PDMultiply packed double-precision floating-point values from xmm0 and xmm1, add to xmm2/mem, and put result in xmm0.
VFMADD213SDMultiply scalar double-precision floating-point value from xmm0 and xmm1, add to xmm2/mem, and put result in xmm0.
VFMADD231PDMultiply packed double-precision floating-point values from xmm1 and xmm2/mem, add to xmm0, and put result in xmm0.
VFMADD231SDMultiply scalar double-precision floating-point value in xmm1 and xmm2/mem, add to xmm0, and put result in xmm0.
VFNMADD213PDMultiply packed double-precision floating-point values from xmm1 and xmm2/mem. Negate the multiplication result, add to xmm0, and put the result in xmm0.
VFNMADD213SDMultiply the low-packed double-precision floating-point value from the second source operand to the low-packed double-precision floating-point value in the first source operand, add the negated infinite precision intermediate result to the low-packed double-precision floating-point value in the third source operand, perform rounding, and store the resulting packed double-precision floating-point value to the destination operand (first source operand).
VFNMADD231PDMultiply packed double-precision floating-point values from ymm1 and ymm2/mem, negate the multiplication result, and add to ymm0. Put the result in ymm0.
VMAXPDDetermine the maximum of float64 vectors. The corresponding Intel AVX instruction is VMAXPD.
VPADDQAdd packed quad-precision floating-point values
VPBLENDVBConditionally blend word elements of source vector depending on bits in a mask vector
VPBROADCASTQTake qwords from the source operand and broadcast to all elements of the result vector
VPCMPEQDCompare packed bytes/words/doublewords/quadwords of two source vectors
VPCMPGTQCompare packed bytes/words/doublewords/quadwords of two source vectors

Table 2 lists just the instructions used in these tests. You can obtain the full list from the Intel® 64 and IA-32 Architectures Optimization Reference Manual. By setting the compiler to Intel AVX2, it will use instructions from all 3 instruction sets as needed.

Procedure for running LINPACK

  1. Download and install the following:
    1. Intel MKL – LINPACK Download
      http://software.intel.com/en-us/articles/intel-math-kernel-library-linpack-download
    2. Intel MKL
      http://software.intel.com/en-us/intel-math-kernel-library-evaluation-options
  2. Create input files for 30K, 75K, 100K from the “...\linpack” directory
  3. For optimal performance, make the following operating system and BIOS setting changes before running LINPACK:
    1. Turn off Intel® Hyper-Threading Technology (Intel® HT Technology) in the BIOS.
    2. For Linux, export the “MKL_CBWR=AVX2” setting on the command line and update the runme_xeon64 shell script file to use the input files you created.
    3. The results will be in Glops similar to Table 2.
  4. For Intel AVX runs, set the “MKL_CBWR=AVX” and repeat the above steps.
  5. For Intel SSE runs, set the “MKL_CBWR=SSE4_2” and repeat the above steps.

Platform Configuration

CPU & ChipsetModel/Speed/Cache: Intel® Xeon® processor E7-8890 v3 (code named Haswell-EX) (2.5GHz, 45M) QGUA D0 Step
  • # of cores per chip: 18
  • # of sockets: 4
  • Chipset: (code named Patsburg) (J C1 step)
  • System bus: 9.6GT/s QPI
PlatformBrand/model:)(code named Brickland)
  • Chassis: Intel 4U Rackable
  • Baseboard: code named Brickland, 3 SPC DDR4
  • BIOS: BRHSXSD1.86B.0063.R00.1503261059 (63.R00)
  • Dimm slots: 96
  • Power supply: 2x1200W NON-REDUNDANT
  • CD ROM: TEAC Slim
  • Network (NIC): 1x Intel® Ethernet Converged Network Adapter x540-T2 (code named "Twin Pond") (OEM-GEN)
MemoryMemory Size: 256GB (32x8GB) DDR4 1.2V ECC 2133MHZ RDIMMs Brand/model: Micron MTA18ASF1G72PDZ-2G1A1HG DIMM info: 8GB 2Rx8 PC4-2133P
Mass storageBrand & model: Intel® S3700 Series SSD Number/size/RPM/Cache: 2/800GB/NA
Operating systemMicrosoft Windows* Server 2012 R2 / SLES 11 SP3 Linux*

Conclusion

From our LINPACK experiment, we see compelling performance benefits when going to an Intel AVX2-enabled Intel Xeon processor. In this specific case, we saw a performance increase of ~2.89x-3.49x for Intel AVX2 vs. Intel SSE and ~1.73x-2.12x for Intel AVX2 vs. Intel AVX in our test environment, which is a strong case for developers who have Intel SSE-enabled code and are weighing the benefit of moving to a newer Intel Xeon processor-based system with Intel AVX2. To learn how to migrate existing Intel SSE code to Intel AVX2 code, refer to the materials below.

References


Intel® Xeon® Processor D Product Family Technical Overview

$
0
0

Contents

1. Form Factor Overview
2. Intel® Xeon® Processor D Product Family Overview
3. Intel® Xeon® Processor D Product Family Feature Overview
4. Intel® Xeon® processor D Product Family introduces new instructions as well as enhancements of previous instructions4
5. Intel® Advanced Vector Extensions 2 (Intel® AVX2) Instructions
6. VT Cache QoS Monitoring/Enforcement and Memory Bandwidth Monitoring4
7. A/D Bits for EPT
8. Intel® Virtual Machine Control Structure Shadowing (Intel® VMCS Shadowing).
9. APICv
10. Supervisor Mode Access Protection (SMAP)
11. RDSEED4
12. Intel ® Trusted Execution Technology (Intel® TXT)
13. Intel® Node Manager
14. RAS – Reliability Availability Serviceability
15. Intel® Processor Trace4
16. Non-Transparent Bridge (NTB)
17. Asynchronous DRAM Refresh (ADR)
18. Intel® QuickData Technology
19. Resources

 

1. Form Factor Overview 

Microservers are an emerging form of servers designed to process lightweight, scale out workloads for hyper-scale data centers. They’re a good form factor example to use to describe the design considerations when implementing an Intel® SoC. Typical workloads suited for microservers include dynamic and static web page serving, entry dedicated hosting, cold and warm storage, and basic content delivery, among others. A microserver consists of a collection of nodes that share a common backplane.  Each node contains a system-on-chip (SoC), local memory for the SoC, and ideally all required IO components for the desired implementation. Because of the microserver’s high-density and energy-efficient design, its infrastructure (including the fan and power supply) can be shared by tens or even hundreds of SoCs, eliminating the space and power consumption demands of duplicate infrastructure components. Even within the microserver category, there is no one-size-fits-all answer to system design or processor choice. Some microservers may have high-performing single-socket processors with robust memory and storage, while others may have a far higher number of miniature dense configurations with lower power and relatively lower compute capacity per SoC.

Comparison of server form factors
Figure 1. Comparison of server form factors

To meet the full breadth of these requirements, Intel provides a range of processors that provide a spectrum of performance options so companies can select what’s appropriate for their lightweight scale out workloads. The Intel® Xeon® processor D product family offers new options for infrastructure optimization, by bringing the performance and advanced intelligence of Intel® Xeon® processors into dense, lower-power SoCs. The Intel® Xeon® processor E3 family offers a choice of integrated graphics, node performance, performance per watt, and flexibility. The Intel® Atom™ processor C2000 product family provides extreme low power and higher density.

The Intel® Xeon® processor D-1500 product family is Intel’s first generation SoC that is based on Intel Xeon processor line and is manufactured using Intel’s low-power 14nm process. This SoC adds additional performance capabilities to Intel’s SoC line up with such features as hyperthreading, improved cache sizes, DDR4 memory capability, Intel® 10GbE Network Adapter and more. Power enhancements are also a point of focus with a SoC thermal design power of 20-45 Watts and additional power capabilities such as Intel® Node Manager. Multiple redundancy features are also available that help mitigate failures with memory and storage.

The data center environment is diversifying both in terms of the infrastructure and the market segments including storage, network, and cloud. Each area has unique requirements, providing opportunities for targeted solutions to best cover these needs. The Intel Xeon processor D-1500 product family extends market segment coverage beyond Intel’s previous microserver product line based on the Intel Atom processor C2000 product family. Cloud service providers can benefit from the SoC with compute-focused workloads associated with hyper scale out such as distributed memcaching, web frontend, content delivery, and dedicated hosting. The Intel Xeon processor D-1500 product family is also beneficial for mid-range network-focused workloads such as those associated with compact PCI advanced mezzanine cards (AMC) found in router mid-range control. For storage-focused workloads it can also provide benefit with entry enterprise SAN/NAS, cloud storage nodes, or warm cloud storage.

These SoCs offer a significant step up from the Intel® Atom™ SoC C2750, delivering up to 3.4 times the performance per node1,3 and up to 1.7x estimated better performance per watt.2,3 With exceptional node performance, up to 12 MB of last level cache, and support for up to 128 GB of high-speed DDR4 memory, these SoCs are ideal for emerging lightweight hyper-scale workloads, including memory caching, dynamic web serving, and dedicated hosting.

 

2. Intel® Xeon® Processor D Product Family Overview 

Table 1 provides a high-level summary of the hardware differences between the Intel Xeon processor D-1500 product family and the Intel Atom SoC C2000 product family. Some of the more notable changes introduced with the Intel Xeon processor D-1500 product family include Intel® Hyper-Threading Technology (Intel® HT Technology), an L3 cache, greater memory capacity and speed, C-states, and more.

Table 1. Comparison of the Intel® Atom™ Processor C2000 Product Family to the Intel® Xeon® Processor D Product Family

 Intel® Atom™ Processor C2000 Product Family on the Edisonville  platform Intel® Xeon® Processor D-1500 Product Family  on the Grangeville platform  
Silicon Core Process technology 22nm14nm
Core / Thread CountUp to 8 cores / 8 threadsUp to 8 cores / 16 threads
Core FrequencyUp to 2.4GHz (2.6GHz with Turbo)Up to 2.0Ghz (2.6Ghz with Turbo)
L1 Cache32KB Data, 24KB Instruction per core32KB Data, 32KB Instruction per core
L2 Cache1MB shared per 2 cores256K per core
L3 CacheNone1.5Mb per core
SoC Thermal Design Power5W - 20W~20W - 45W
C-statesNoYes
Memory Addressing38 bits physical / 48 bits virtual48 bits physical / 48 bits virtual
Memory

2 Channels
2 DIMMs per ch

1600 DDR3/L

2 Channels

2 DIMMs per ch

1600 DDR3/L

2133 DDR4

64GB Max capacity128GB Max capacity
SODIMM, UDIMM, VLP UDIMM ECCRDIMM, UDIMM, SODIMM ECC
IO: PCI Express* (PCIe) lanes16x PCIe G224x Gen3, 8x Gen2
IO: GbE4x 1GbE/2.5GbE2x 1GbE / 2.5GbE / 10GbE
IO: SATA ports4x SATA2, 2x SATA36x SATA3
IO: USB ports4x USB 2.04x USB 2.0, 4x USB 3.0

A block diagram of the Intel® Xeon® processor D-1500 product
Figure 2. A block diagram of the Intel® Xeon® processor D-1500 product

 

3. Intel® Xeon® Processor D Product Family Feature Overview 

The rest of this paper discusses some of the new features in the Intel Xeon processor D-1500 product family. In Table 2 the items denoted with a4 have been newly introduced with this version of the silicon, while the other features are new to the entire Intel SoC product line, which previously contained only Intel Atom processors. Some of the features previously existed on other Intel Xeon processor product families, but are new to Intel’s SoC product line.

Table 2. Features and associated workload segments

Features/Technologies

COMPUTE:

Hyper Scale Out, Distributed Memcaching, Web Frontend, Content Delivery, Dedicated Hosting

NETWORK:

Router Mid Control such as with high density, compact PCI Advanced Mezzanine Cards (AMC)
New or Enhanced Instructions (ADC, SBB, ADCX, ADOX, PREFETCHW, MWAIT) 4
Intel® Advanced Vector Extensions 2 (Intel® AVX2)
VT Cache QoS Monitoring/Enforcement4
Memory Bandwidth Monitoring4
A/D Bits for EPT
Intel® Virtual Machine Control Structure Shadowing (Intel® VMCS Shadowing)
Posted Interruptsv
APICv
RDSEED4
Supervisor Mode Access Protection (SMAP) 4
Intel® Trusted Execution Technology
Intel® Node Manager
RAS
Intel® Processor Trace4
Intel® QuickAssist Technology v
Intel® Quick Data Technology  
Non-Transparent Bridge  
Asynchronous DRAM Refresh  

 

4. Intel® Xeon® processor D Product Family introduces new instructions as well as enhancements of previous instructions4 

ADCX (unsigned integer add with carry) and ADOX (unsigned integer add with overflow) have been introduced for Asymmetric Crypto Assist5 in addition to faster ADC/SSB instructions (no re-compilation required for ADC/SSB benefits). ADCX and ADOX are extensions of ADC (add with carry) and ADO (add with overflow) instructions for use in large integer arithmetic, greater than 64 bits. Performance improvements are due to two parallel carry chains being supported at the same time. ADOX/ADCX can be combined with MULX for additional performance improvements with public key encryption such as RSA. Large integer arithmetic is also used for Elliptic Curve Cryptography (ECC) and Diffie-Hellman (DH) Key Exchange. Beyond cryptography, there are many use cases in complex research and high performance computing (HPC). The demand for this functionality is high enough to warrant a number of commonly used optimized libraries, such as the GNU Multi-Precision (GMP) library (e.g., Mathematica), see New Instructions Supporting Large Integer Arithmetic on Intel® Architecture Processors. To take advantage of these new instructions you need to obtain a new software library and recompilation (Intel® Compiler 14.1+, GCC 4.7+, and Microsoft Visual Studio* 2013+).

MWAIT extensions for advanced power management can be used by the Operating System to implement power management policy.

PREFETCHW, which prefetches data into cache in anticipation of a write, now helps optimization with the network stack.

For more information about these instructions see the Intel® 64 and IA-32 Architectures Developer’s Manual. Currently, Intel® Compiler 14.1+, GCC 4.7+, and Microsoft Visual Studio* 2013+ support these instructions.

 

5. Intel® Advanced Vector Extensions 2 (Intel® AVX2) Instructions 

With Intel® AVX, all the floating point vector instructions were extended from 128 bit to 256 bits. The Intel Xeon processor D Family further improves performance by reducing floating point multiply (MULPS, PD) to 3 cycles vs 5 cycles on the previous generation of Intel Xeon processor. Intel® AVX2 also extends the integer vector instructions to 256 bits. Intel AVX2 uses the same 256 bit YMM registers as Intel AVX. Intel AVX2 instructions benefit high performance computing (HPC) applications, databases, and audio and video applications. Intel AVX2 instructions include fused multiply add (FMA), gather, shifts, and permute instructions.

The FMA instruction computes ±(a×b)±c with only one rounding. axb intermediate results are not rounded and therefore bring increased accuracy compared to MUL and ADD instructions. FMA increases performance and accuracy of many floating point computations such as matrix multiplication, dot product and polynomial evaluation. With 256 bits, we can have 8 single precision and 4 double precision FMA operations. Since FMA combines 2 operations into one, floating point operations per second (FLOPS) are increased. Additionally, because there are 2 FMA units, the peak FLOPS are doubled.

The gather instruction loads sparse elements to a single vector. It can gather 8 single precision (Dword) or 4 double precision (Qword) data elements into a vector register in a single operation. A base address points to the data structure in memory, and an Index (offset) gives the offset of each element from the base address. The mask register tracks which elements need to be gathered. Gather is complete when the mask register is all zeros. The gather instruction enables vectorization for workloads that could previously not be vectorized for various reasons.

Intel Xeon processor D product family adds additional hardware capability with a gather index table (GIT) to improve performance (Figure 3). No recompiling is required to take advantage of this new feature. The GIT provides storage for full width indices near the address generation unit. A special load grabs the correct index, simplifying the index handling. Loaded elements are merged directly into the destination.

Gather Index Table Conceptual Block Diagram
Figure 3. Gather Index Table Conceptual Block Diagram

Other new operations in Intel AVX2 include integer versions of permute instructions, new broadcast instructions, and blend instructions. A 1,024 radix divider for reduced latency, along with a "split" operation for scalar divides, where two scalar divides occur simultaneously, improve performance over previous generations of Intel Xeon processors.

Currently, the Intel Compiler 14.1+, GCC 4.7+, and Microsoft Visual Studio 2013+ support these instructions.

 

6. VT Cache QoS Monitoring/Enforcement and Memory Bandwidth Monitoring4 

The Intel Xeon processor D product family has the ability to monitor the last level of processor cache on a per-thread, application, or VM basis. This allows the VMM or OS scheduler to make changes based on policy enforcement. One scenario where this can be of benefit is if you have a multi-tenant environment and a VM is causing a lot of thrash with the cache. This feature allows the VMM or OS to migrate this “noisy neighbor” to a different location where it may have less of an impact on other VMs. This product family also introduces a new capability to manage the processor LLC based on pre-defined levels of service, independent of the OS or VMM. A QoS mask can be used to provide 16 different levels of enforcement to limit the amount of cache that a thread can consume.

Intel® 64 and IA-32 Architectures Software Developer’s Manual (SDM) volume-3 chapter-17.14 provides the CQM & MBM programming details. Chapter 17.15 provides the CQE programming details. To read the raw value from the IA32_QM_CTR register, multiply by a factor given in the CPUID field CPUID.0xF.1:EBX to convert to bytes.

For additional resources see: Benefits of Intel Cache Monitoring Technology in the Intel® Xeon™ Processor E5 v3 Family, IntelRLICache Monitoring Technology Software-Visible Interfaces, Intel's Cache Monitoring Technology: Use Models and Data, or Intel's Cache Monitoring Technology: Software Support and Tools

Cache and memory bandwidth monitoring and enforcement vectors
Figure 4. Cache and memory bandwidth monitoring and enforcement vectors.

Another new capability enables the OS or VMM to monitor memory bandwidth. This allows scheduling decisions to be made based on memory bandwidth usage on a per core or thread basis. An example of this situation is when one core is being heavily utilized by two applications, while another core is being underutilized by two other applications. With memory bandwidth monitoring the OS or VMM now has the ability to schedule a VM or an application to a different core to balance out memory bandwidth utilization. In Figure 5 two high memory bandwidth applications are competing for the same resource. The OS or VMM can move one of the high bandwidth memory applications to another resource to balance out the load on the cores.

Memory Bandwidth Monitoring use case
Figure 5. Memory Bandwidth Monitoring use case

 

7. A/D Bits for EPT 

In the previous generation, accessed and dirty bits (A/D bits) were emulated in VMM and accessing them caused VM exits. EPT A/D bits are implemented in hardware to reduce VM exits. This enables efficient live migration of VMs and fault tolerance.

VM exits with EPT A/D in hardware vs emulation
Figure 6. VM exits with EPT A/D in hardware vs emulation

This feature requires enabling VT-x at the BIOS level. Currently it is supported by KVM with 3.6+ kernel and Xen* 4.3+. For other VM providers please contact them to find out when this feature will be supported.

 

8. Intel® Virtual Machine Control Structure Shadowing (Intel® VMCS Shadowing) 

Nested virtualization allows a root Virtual Machine Monitor (VMM) to support guest VMMs. However, additional Virtual Machine (VM) exits can impact performance. As shown in Figure 7, Intel® VMCS Shadowing directs the guest VMM VMREAD/VMWRITE to a VMCS shadow structure. This reduces nesting induced VM exits. Intel VMCS Shadowing increases efficiency by reducing virtualization latency.

VM exits with Intel® VMCS Shadowing vs software-only
Figure 7. VM exits with Intel® VMCS Shadowing vs software-only

This feature requires enabling VT-x at the BIOS level. Currently it is supported by KVM with Linux Kernel 3.10+ and Xen 4.3+. For other VM providers please contact them to find out when this feature will be supported.

 

9. APICv 

The Virtual Machine Monitor emulates most guest accesses to interrupts and the Advanced Programmable Interrupt Controller (APIC) in a virtual environment. This causes VM exits, creating overhead on the system. APICv offloads this task to the hardware, eliminating VM exits and increasing I/O throughput.

VM exits with APICv vs without APICv
Figure 8. VM exits with APICv vs without APICv

This feature requires enabling VT-x at the BIOS level. Currently it is supported by KVM with Linux Kernel 3.10+, ESX(i)* 4.0+. For other VM providers please contact them to find out when this feature will be supported.

 

10. Supervisor Mode Access Protection (SMAP) 4 

Supervisor Mode Access Protection (SMAP) is a new CPU-based mechanism for user-mode address-space protection. It extends the protection that previously was provided by Supervisor Mode Execution Prevention (SMEP). SMEP prevents supervisor mode execution from user pages, while SMAP prevents unintended supervisor mode accesses to data on user pages. There are legitimate instances where the operating system needs to access user pages, and SMAP does provide support for those situations.

SMAP conceptual diagram
Figure 9. SMAP conceptual diagram

SMAP was developed with the Linux community and is supported on kernel 3.12+ and KVM version 3.15+. Support for this feature depends on which operating system or VMM you are using.

 

11. RDSEED4 

The RDSEED instruction is intended for seeding a Pseudorandom Number Generator (PRNG) of arbitrary width, which can be useful when you want to create stronger cryptography keys. If you do not need to seed another PRNG, then use the RDSEED instruction. For more information see Table 3, Figure 10, and The Difference Between RDRAND and RDSEED.

Table 3. RDSEED and RDRAND compliance and source information

InstructionSourceNIST Compliance
RDRANDCryptographically secure pseudorandom number generatorSP 800-90A
RDSEEDNon-deterministic random bit generatorSP 800-90B & C (drafts)

RDSEED and RDRAND conceptual block diagram
Figure 10. RDSEED and RDRAND conceptual block diagram

Currently the Intel® Compiler 15+, GCC 4.8+, and Microsoft Visual Studio* 2013+ support RDSEED.

RDSEED loads a hardware-generated random value and stores it in the destination register. The random value is generated from an Enhanced NRBG (Non Deterministic Random Bit Generator) that is compliant with NIST SP800-90B and INST SP800-90C in the XOR construction mode.

In order for the hardware design to meet its security goals, the random number generator continuously tests itself and the random data it is generating. The self-test hardware detects run-time failures in the random number generator circuitry or statistically anomalous data occurring by chance and flags the resulting data as bad. In such extremely rare cases, the RDSEED instruction will return no data instead of bad data.

Intel C/C++ Compiler Intrinsic Equivalent:

  • RDSEED int_rdseed16_step( unsigned short * );
  • RDSEED int_rdseed32_step( unsigned int * );
  • RDSEED int_rdseed64_step( unsigned __int64 *);

As with RDRAND, RDSEED will avoid any OS or library enabling dependencies and can be used directly by any software at any protection level or processor state.

For more information see section 7.3.17.2 of the Intel® 64 and IA-32 Architectures Software Developer’s Manual (SDM).

 

12. Intel ® Trusted Execution Technology (Intel® TXT) 

Intel® TXT is the hardware basis for mechanisms that validate platform trustworthiness during boot and launch, which enables reliable evaluation of the computing platform and its protection level. Intel TXT is compact and difficult to defeat or subvert, and it allows for flexibility and extensibility to verify the integrity during boot and launch of platform components, including BIOS, operating system loader, and hypervisor. Because of the escalating sophistication of malicious threats, mainstream organizations must employ ever-more stringent security requirements and scrutinize every aspect of the execution environment.

Intel TXT reduces the overall attack surface for both individual systems and compute pools. The technology provides a signature that represents the state of an intact system’s launch environment. The corresponding signature at the time of future launches can then be compared against that known-good state to verify a trusted software launch, to execute system software, and to ensure that cloud infrastructure as a service (IaaS) has not been tampered with. Security policies based on a trusted platform or pool status can then be set to restrict (or allow) the deployment or redeployment of virtual machines (VMs) and data to trusted platforms with known security profiles. Rather than relying on the detection of malware, Intel TXT builds trust into a known software environment and thus ensures that the software being executed hasn’t been compromised. This advances security to address key stealth attack mechanisms used to gain access to parts of the data center in order to access or compromise information. Intel TXT works with Intel® Virtualization Technology (Intel® VT) to create a trusted, isolated environment for VMs.

Simplified Intel® TXT Component diagram
Figure 11. Simplified Intel® TXT Component diagram

For more details on Intel TXT and its implementation see Intel® TXT Enabling Guide.

 

13. Intel® Node Manager 

Intel® Node Manager is a core set of power management features providing a smart way to optimize and manage power, cooling, and compute resources in the data center. This server management technology extends component instrumentation to the platform level and can be used to make the most of every watt consumed in the data center. First, Intel Node Manager reports vital platform information, such as power, temperature, and resource utilization using standards-based, out-of-band communications. Second, it provides fine-grained controls to limit platform power in compliance with IT policy. This feature can be found across Intel products segments providing consistency within the data center.

Table 4. Intel® Node Manager features

Intel® Node Manager features

To use this feature you must enable the BMC LAN and the associated BMC user configuration at the BIOS level, which should be available under the server management menu. The Programmer’s Reference Kit is very simple to use and requires no additional external libraries to compile or run. All that is needed is a C/C++ compiler and to then run the configuration and compilation scripts.

Intel® Node Manager website

Intel® Node Manager Programmer’s Reference Kit

Open Source Reference Kit

How to set up Intel® Node Manager

 

14. RAS – Reliability Availability Serviceability 

Server reliability, availability, and serviceability (RAS) are crucial issues for modern enterprise IT data centers that deliver mission-critical applications and services, as application delivery failures can be extremely costly per hour of system downtime. Furthermore, the likelihood of such failures increases statistically with the size of the servers, data, and memory required for these deployments. The Intel Xeon processor D product family offers a set of RAS features in silicon to provide error detection, correction, containment, and recovery. This feature set is a powerful foundation for hardware and software vendors to build higher-level RAS layers and provide overall server reliability across the entire hardware-software stack from silicon to application delivery and services. Table 5 shows a comparison of the RAS features available on the Intel Xeon processor D product family vs the Intel Atom processor C2000 series.

Table 5. Comparison of RAS features

CategoryFeatureIntel® Atom™ Processor C2000 Product Family on the Edisonville  platformIntel® Xeon® Processor D-1500 Product Family  on the Grangeville platform
MemoryECC
MemoryError detection and correction coverage
MemoryFailed DIMM Identification
MemoryMemory Address Parity Protection on Reads/WritesNo
MemoryMemory Demand and Patrol Scrubbing
MemoryMemory Thermal Throttling
MemoryMemory BIST including Error InjectionNo
MemoryData Scrambling with address
MemorySDDCNo
PlatformPCIe* Device Surprise RemovalNo
PlatformPCIe and GbE Advanced Error Reporting (AER)
PlatformPCIe Device Hot Add / Remove / SwapNo
PlatformECRC on PCIeNo
PlatformData Poisoning - ContainmentVia parity
PlatformCorrected Error Cloaking from OS
PlatformDisable CMCINo CMCI support
PlatformUncorrected error signaling to SMI (dual-signaling)
PlatformIntel® Silicon View TechnologyNo

 

15. Intel® Processor Trace4 

Intel® Processor Trace enables low-overhead instruction tracing of workloads to memory. This can be of value for low-level debugging, fine tuning performance, or post-mortem analysis (core dumps, save on crash, etc.). The output includes control flow details, enabling precise reconstruction of the path of software execution. It also provides timing information, software context details, processor frequency indication and more. Intel Processor Trace has a sampling mode to estimate the number of function calls and loop iterations in an application being profiled. It has a limited impact to system execution and does not require any enabling, you simply need Intel® VTune™ Amplifier 2015 update 1 (and newer).

For additional information see the Intel® Processor Trace lecture or pdf given at IDF14.

Overview of Intel® Processor Trace
Figure 12. Overview of Intel® Processor Trace

 

16. Non-Transparent Bridge (NTB) 

Non-Transparent Bridge (NTB) reduces loss of data, allowing a secondary system to take over the PCIe* storage devices in the event of a CPU failure providing high-availability for your storage devices.

Overview of Non-Transparent Bridge with a local and remote host on the Intel® Xeon® processor D product family
Figure 13. Overview of Non-Transparent Bridge with a local and remote host on the Intel® Xeon® processor D product family

 

17. Asynchronous DRAM Refresh (ADR) 

Asynchronous DRAM Refresh (ADR) preserves key data in the battery-backed DRAM in the event of AC power supply failure.

Figure 14. Overview of Asynchronous DRAM Refresh
Figure 14. Overview of Asynchronous DRAM Refresh

 

18. Intel® QuickData Technology 

Intel® QuickData Technology is a platform solution designed to maximize the throughput of server data traffic across a broader range of configurations and server environments to achieve faster, scalable, and more reliable I/O. It enables the chipset instead of the CPU to copy data, which allows data to move more efficiently through the server.  This technology is supported on Linux kernel 2.6.18+ and Windows* Server 2008 R2 and will require enabling within the BIOS.

For more information, see the Intel® QuickData Technology Software Guide for Linux.

Overview of Intel® QuickData Technology
Figure 15. Overview of Intel® QuickData Technology

 

19. Resources 

Intel® Xeon® processor D product family performance comparisons for general compute, cloud, storage and network.

Intel® 64 and IA-32 Architectures Software Developer’s Manual (SDM)

Intel® Processor Trace IDF 2014 Video Presentation

Intel® Processor Trace IDF 2014 PDF Presentation

Benefits of Intel Cache Monitoring Technology in the Intel® Xeon™ Processor E5 v3 Family

Intel’s Cache Monitoring Technology Software-Visible Interfaces

Intel's Cache Monitoring Technology: Use Models and Data

Intel's Cache Monitoring Technology: Software Support and Tools

The Difference Between RDRAND and RDSEED

Intel® Node Manager website

Intel® Node Manager Programmer’s Reference Kit

Open Source Reference Kit

How to set up Intel® Node Manager

Intel® QuickData Technology Software Guide for Linux

Haswell Cryptographic Performance

Intel® TXT Enabling Guide

Intel® Atom™ processor C2000 product family

Intel® Xeon® processor E3 family

  1. Up to 3.4x better performance on dynamic web serving Intel® Xeon® processor D-based reference platform with one Intel Xeon processor D (8C, 1.9GHz, 45W, ES2), Intel® Turbo Boost Technology enabled, Intel® Hyper-Threading Technology enabled, 64GB memory (4x16GB DDR4-2133 RDIMM ECC), 2x10GBase-T X552, 3x S3700 SATA SSD, Fedora* 20 (3.17.8-200.fc20.x86_64, Nginx* 1.4.4, Php-fpm* 15.4.14, Memcached* 1.4.14, Simultaneous users=43844 Supermicro SuperServer* 5018A-TN4 with one Intel® Atom™ processor C2750 (8C, 2.4GHz,20W), Intel Turbo Boost Technology enabled, 32GB memory (4x8GB DDR3-1600 SO-DIMM ECC), 1x10GBase-T X520, 2x S3700 SATA SSD, Ubuntu* 14.10 (3.16.0-23 generic), Nginx 1.4.4, Php-fpm 15.4.14, Memcached 1.4.14, Simultaneous users=12896.2
  2. Up to 1.7x (estimated) better performance per watt on dynamic web serving Intel® Xeon® processor D-based reference platform with one Intel Xeon processor D (8C, 1.9GHz, 45W, ES2), Intel® Turbo Boost Technology enabled, Intel® Hyper-Threading Technology enabled, 64GB memory (4x16GB DDR4-2133 RDIMM ECC), 2x10GBase-T X552, 3x S3700 SATA SSD, Fedora* 20 (3.17.8-200.fc20.x86_64, Nginx* 1.4.4, Php-fpm* 15.4.14, Memcached* 1.4.14, Simultaneous users=43844, Estimated wall power based on microserver chassis, power=90W, Perf/W=487.15 users/W Supermicro SuperServer* 5018A-TN4 with one Intel® Atom™ processor C2750 (8C, 2.4GHz,20W), Intel® Turbo Boost Technology enabled, 32GB memory (4x8GB DDR3-1600 SO-DIMM ECC), 1x10GBase-T X520, 2x S3700 SATA SSD, Ubuntu* 14.10 (3.16.0-23 generic), Nginx 1.4.4, Php-fpm 15.4.14, Memcached 1.4.14, Simultaneous users=12896. Maximum wall power =46W, Perf/W=280.3 users/W
  3. Software and workloads used in performance tests may have been optimized for performance only on Intel® microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance.
  4. New feature introduced with the Intel® Xeon® processor D product family. Intel technologies may require enabled hardware, specific software, or services activation. Performance varies depending on system configuration. Check with your system manufacturer or retailer.
  5. Intel® processors do not contain crypto algorithms, but support math functionality that accelerates the sub-operations.

Improving OpenSSL Performance

$
0
0

Contents

Abstract
Overview of OpenSSL
      What are SSL/TLS
      What is OpenSSL
      Goals of OpenSSL 1.0.2 Cryptographic Improvements
Key Components of OpenSSL 1.0.2
      Function Stitching
      Applying Multi-Buffer to OpenSSL
System Configuration and Experimental Setup
      Speed tests
Performance
      AES Results
      Public Key Cryptography Results
      Stitching Results
      Multi-Buffer Results
Authors
Contributors
Conclusion
References

 

Abstract 


The SSL/TLS protocols involve two compute-intensive cryptographic phases: session initiation and bulk data transfer. OpenSSL 1.0.2 introduces a comprehensive set of enhancements of cryptographic functions such as AES in different modes, SHA1, SHA256, SHA512 hash functions (for bulk data transfers), and Public Key cryptography such as RSA, DSA, and ECC (for session initiation). Optimizations target Intel® Core™ processors and Intel® Atom™ processors running in 32-bit and 64-bit modes.

OpenSSL [1] is one of the leading open source implementations of cryptographic functions and the go-to library for applications requiring the use of the SSL/TLS [2] protocols. The results for the cryptographic functions commonly used during the SSL/TLS session initiation/handshake and bulk data transfer phases are given.

 

Overview of OpenSSL 


What are SSL/TLS 

TLS (Transport Layer Security) [2] and its predecessor, SSL (Secure Sockets Layer), are cryptographic protocols that are used to provide secure communications over networks.

These protocols allow applications to communicate over the network while preventing eavesdropping and tampering. That is, third parties cannot read the content being transferred and cannot modify that content without the receiver detecting it.

These protocols operate in two phases. In the first phase, a session is initiated. The server and client negotiate to select a cipher-suite for encryption and authentication and a shared secret key. In the second phase, the bulk data is transferred. The protocols use encryption of the data packets to ensure that third parties cannot read the contents of the data packets. They use a message authentication code (MAC), based on a cryptographic hash of the data, to ensure that the data is not modified in transit.

During session initiation, before a shared secret key has been generated, the client must communicate private messages to the server using a public key encryption method. The most popular such method is RSA, which is based on modular exponentiation. Modular exponentiation is a compute-intensive operation, which accounts for the majority of the session initiation cycles. A faster modular exponentiation implementation directly translates to a lower session initiation cost.

Under SSL, the bulk data being transferred is broken into records with a maximum size of 16KBytes (for SSLv3 and TLSv1).

SSL Computations of Cipher and MAC

Figure 1: SSL Computations of Cipher and MAC

A header is added, and a message authentication code (MAC) is computed over the header and data using a cryptographic hash function. The MAC is appended to the end of the message, and the message is padded. Then everything other than the header is encrypted with the chosen cipher.

The key point here is that all of the bulk data buffers have two algorithms applied to them: encryption and authentication. In many cases, these two algorithms can be stitched [3] together to increase the overall performance. Some cipher suites such as GCM define combined encryption+authentication modes; in these cases, stitching the computations is easier.

What is OpenSSL 

OpenSSL [1] is an open-source implementation of the SSL and TLS protocols, used by many applications and large companies.

For these companies, the most interesting aspect of OpenSSL’s implementation is the number of connections that a server can handle (per second), as this translates directly to the number of servers needed to service their client base. The way to maximize the number of connections is to minimize the cost of each connection, which can be done by minimizing the cost of initiating a session and by minimizing the cost of transferring the data for that session.

Goals of OpenSSL 1.0.2 Cryptographic Improvements 

Some of the OpenSSL Project’s goals for the cryptographic optimizations were:

  1. Augment the OpenSSL software architecture to support multi-buffer processing techniques to extract maximum performance from the processor’s SIMD architecture.
  2. Deliver market-leading SSL/TLS performance using highly optimized stitched algorithms.
  3. Extend SIMD utilization in the crypto space (e.g., Intel® Streaming SIMD Extensions (Intel® SSE)-based SHA2 implementations).
  4. Utilize Intel® Advanced Vector Extensions 2 (Intel® AVX2) for a wide range of crypto algorithms like RSA and SHA.
  5. Wherever possible, extract maximum algorithmic performance using the new instructions MULX, ADCX, ADOX, RORX, and RDSEED.
  6. SSL/TLS payload processing performance tradeoffs should favor payloads that are less than ~1400 bytes.
  7. Integrate all the functionality into the OpenSSL 1.0.x and future 1.1.x codelines in a manner that allows their automatic use by applications using existing OpenSSL interfaces, without any additional required initializations.

 

Key Components of OpenSSL 1.0.2 


Some of the key cryptographic optimizations in OpenSSL 1.0.2 include:

  • Multi-buffer [4] support for AES [128|256] CBC encryption
  • Multi-buffer support for [SHA-1|SHA-256] utilizing architectural features [Intel SSE | Intel AVX | Intel AVX2-BMI2]
  • Single-buffer support for “Stitched” AES [128|256] CBC [SHA-1|SHA-256] utilizing architectural features [Intel SSE | Intel AVX | Intel AVX2]
    • AES-128-CBC-Encrypt-SHA-1-AVX2-BMI2
    • AES-256-CBC-Encrypt-SHA-1-AVX2-BMI2
    • AES-128-CBC-Encrypt-SHA-256-SSE
    • AES-256-CBC-Encrypt-SHA-256-SSE
    • AES-128-CBC-Encrypt-SHA-256-AVX
    • AES-256-CBC-Encrypt-SHA-256-AVX
    • AES-128-CBC-Encrypt-SHA-256-AVX2-BMI2
    • AES-256-CBC-Encrypt-SHA-256-AVX2-BMI2
  • Single-buffer support for “stitched” AES [128|256] GCM
  • Single-buffer SHA-1 performance enhancements utilizing Intel AVX2 and BMI2
  • Single-buffer SHA-2 suite SHA[224|256|384|512] performance enhancements utilizing [Intel SSE | Intel AVX | Intel AVX2-BMI2] [5]
  • RSA and DSA (Key size >= 1024) support using [legacy | MULX | ADCX – ADOX] instructions [6]
  • ECC – ECDH and ECDSA [MULX | ADCX – ADOX]
  • Intel® Secure Hash Algorithm Extensions (Intel® SHA Extensions) new instructions [7]

The RSA/DSA/ECC are targeted at the session initiation phase. The rest are for improved performance of the bulk data transfer phase. Multi-buffer implementations provide the largest speedup but are currently designed to work only for encryption flows.

Pairs of algorithms to implement via function stitching were chosen based on the most commonly used cipher-suites today and in the near future. For scenarios that cannot be covered with function stitching, the singular encryption or authentication functions were optimized.

Function Stitching 

Function stitching is a technique used to optimize two algorithms that typically run in combination yet sequentially, and finely stitch the operations together to maximize compute resources. This section presents just a brief overview of stitching. A more detailed description can be found in [3].

Function stitching is the fine-grained interleaving of the instructions from each algorithm so that both algorithms are executed simultaneously. The advantage of doing this is that execution units that would otherwise be idle when executing a single algorithm (due to either data dependencies or instruction latencies) can be used to execute instructions from the other algorithm, and vice versa [3].

Applying Multi-Buffer to OpenSSL 

Multi-buffer [4] is an efficient method to process multiple independent data buffers in parallel for cryptographic algorithms, such as hashing and encryption. Processing multiple buffers at the same time can result in significant performance improvements—both for the case where the code can take advantage of SIMD (Intel SSE/Intel AVX) instructions (e.g., Intel SHA Extensions), and even in some cases where it can’t (e.g., AES CBC encrypt using Intel® Advanced Encryption Standard New Instructions (Intel® AES-NI)).

Multi-Buffer Processing

Multi-Buffer Processing

Figure 2: Multi-Buffer Processing

Multi-buffer generally requires a scheduler that can process the multiple data buffers of different sizes with minimal performance overhead, which we have found good solutions for. Integrating multi-buffer into serially designed synchronous applications/frameworks, however, can be challenging and was one of the key problems when we were applying mutli-buffer to OpenSSL. We solved it by breaking up records during encryption into smaller, equal-sized sub-records. This solution, however, does not apply to decryption flows.

 

System Configuration and Test Setup 


The performance results provided in this section were measured on 3 Intel Core processors and 2 Intel Atom processors. The systems were:

  1. Intel® Core™ i7-3770 processor @ 3.4 GHz         (codenamed Ivy Bridge (IVB))
  2. Intel® Core™ i5-4250U processor @ 1.30 GHz      (codenamed Haswell (HSW))
  3. Intel® Core™ i5-5200U processor @ 2.20 GHz      (codenamed Broadwell (BDW))
  4. Intel® Atom™ processor N450 @ 1.66GHz           (codenamed Bonnell (BNL))
  5. Intel® Atom™ processor N2810 @ 2.00GHz         (codenamed Silvermont (SLM))

The tests were run on a single core with Intel® Turbo Boost Technology off, and with Intel® Hyper-Threading Technology (Intel® HT Technology) on for the three Intel Core processors. Note that the Intel Core i5-5200U processor was defaulting to "power saving" mode at boot and was running at 800 MHz for these tests. However, all test results are given in terms of cycles in order to provide an accurate representation of the microarchitecture’s capabilities and to eliminate any frequency discrepancies.

Speed tests 

OpenSSL ‘Speed’ Benchmark was run for the performance tests. Some example command lines are:

./bin/64/openssl speed -evp aes-128-gcm

./bin/64/openssl speed -decrypt -evp aes-128-gcm

./bin/64/openssl speed -evp aes-128-cbc-hmac-sha1

./bin/64/openssl speed -decrypt -evp aes-128-cbc-hmac-sha1

./bin/64/openssl speed -mb -evp aes-128-cbc-hmac-sha1

Note that the “-mb” switch is new and has been added to enable running multi-buffer performance tests.

 

Performance 


Results are normalized and in most cases converted to ‘Cycles per Byte’ (CPB) of processed data. CPB is the standard metric for Cryptographic algorithm efficiency.

The following graphs show the performance for 32 and 64-bit code.

Note: Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, Go to: http://www.intel.com/performance/resources/benchmark_limitations.htm

AES Results 

AES Encrypt (Intel® Core™ processors)

Figure 3: AES Encrypt (Intel® Core™ processors)

AES-CBC encryption gains on IBV to HSW are due to a 1 cycle latency reduction in the AESENC[LAST] and AESDEC[LAST] instructions.

AES-GCM performance gains on IBV to HSW are due to Intel AVX and PCLMULQDQ microarchitecture enhancements and from HSW to BDW due to further PCLMULQDQ microarchitecture enhancements.

AES Decrypt (Intel® Core™ processors)

Figure 4: AES Decrypt (Intel® Core™ processors)

Most of the popular AES decrypt modes are throughput limited, rather than latency limited. We implemented parallel AES-CBC decrypt processing 6 blocks at a time.

AES Encrypt (Intel® Atom™ processors)

Figure 5: AES Encrypt (Intel® Atom™ processors)

SLM introduces the AES and PCLMULQDQ instructions, resulting in a huge speedup for both CBC and GCM modes.

AES Decrypt (Intel® Atom™ processors)

Figure 6: AES Decrypt (Intel® Atom™ processors)

Public Key Cryptography Results 

Public Key Cryptography (Intel® Core™ processors)

Figure 7: Public Key Cryptography (Intel® Core™ processors)

IVB gains on RSA are due to algorithmic optimizations.

HSW RSA2048 is a special case where some of the gain is due to an Intel AVX2 implementation. All the rest of the gains are due to scalar code tuning/algorithmic improvements.

On BDW the addition of MULX/ADOX/ADCX (LIA instructions) results in large performance gains over HSW.

We added generic code in the Montgomery multiply function so it scales across all RSA sizes, DSA, DH, and ECDH.

Public Key Encryption 32-bit

Public Key Encryption 64-bit

Figure 8: Public Key Cryptography (Intel® Atom™ processors)

On SLM, architectural scalar improvements are due to out-of-order execution.

Stitching Results 

AES-CBC-HMAC-SHA (Encrypt) Cycles/Byte

Figure 9: AES-CBC-HMAC-SHA (Encrypt) Cycles/Byte

AES-CBC-HMAC-SHA (Decrypt) Cycles/Byte

Figure 10: AES-CBC-HMAC-SHA (Decrypt) Cycles/Byte

IVB to HSW performance gains are due to Intel AXV2 code.

AES instruction latency improvements do not yield much performance gains in the case of Decrypt, as the results become SHA bound.

The Stitched Ciphers are only available in 64b implementations due to the expanded register set. In v1.0.1 Stitched Ciphers only supported Encrypt.

Multi-Buffer Results 

Multi-Buffer AES-CBC-HMAC-SHA (Encrypt) Cycles/Byte

Figure 11: Multi-Buffer AES-CBC-HMAC-SHA (Encrypt) Cycles/Byte

Multi-Buffer Speedup over Stitched

Figure 12: Multi-Buffer Speedup over Stitched

 

Authors 


Vinodh Gopal, Sean Gulley, and Wajdi Feghali are architects in the Data Center Group, specializing in software and hardware features relating to cryptography and compression.

Ilya Albrekht and Dan Zimmerman are Application Engineers driving enabling and performance optimization efforts for cryptographic projects and libraries.

Contributors 


We thank Andy Polyakov and Steve Marquess of the OpenSSL Software Foundation and Max Locktyukhin, John Mechalas, and Shay Gueron from Intel for their contributions.

Conclusion 


This paper illustrates the goals and main features in OpenSSL 1.0.2 for improved cryptographic performance. By leveraging architectural features in the processors such as SIMD and new instructions, and combining innovative software techniques such as function stitching and Multi-Buffer, large performance gains are possible (e.g., ~3X for Multi-Buffer).

References 


[1] OpenSSL: http://www.openssl.org/

[2] The TLS Protocol http://www.ietf.org/rfc/rfc2246.txt

[3] Fast Cryptographic computation on Intel® Architecture processors via Function Stitching https://www-ssl.intel.com/content/www/us/en/intelligent-systems/wireless-infrastructure/cryptographic-computation-architecture-function-stitching-paper.html

[4] Processing Multiple Buffers in Parallel to Increase Performance on Intel® Architecture Processors - https://www-ssl.intel.com/content/www/us/en/communications/communications-ia-multi-buffer-paper.html

[5] Fast SHA-256 Implementations on Intel® Architecture Processors -https://www-ssl.intel.com/content/www/us/en/intelligent-systems/intel-technology/sha-256-implementations-paper.html

[6] New Instructions Supporting Large Integer Arithmetic on Intel® Architecture Processors

https://www-ssl.intel.com/content/www/us/en/intelligent-systems/intel-technology/ia-large-integer-arithmetic-paper.html

[7] Intel® SHA Extensions New Instructions Supporting the Secure Hash Algorithm on Intel® Architecture Processors

http://software.intel.com/en-us/articles/intel-sha-extensions 

Java* Application Performance Improvement with Intel® Xeon® Processor E7 v3

$
0
0

Background

Java1, 2 is a programming language used for developing applications that can run on any operating system (OS). To do that, Java applications need to be compiled to bytecode.3 This bytecode can then be run on any Java Virtual Machine (JVM)4 without recompiling. To run Java applications on OSs like Windows* and Linux*, a Java Runtime Environment (JRE)7 must be installed.

The advantage of Java-based applications is that their performance can be improved without having to change code or even recompile when the underlying Java Virtual Machine (JVM) is optimized for the platform the application is running on.

The next sections discuss improvements made in JVM for Intel® Xeon® processor E7 v3 and show one example Java-based application, TYDIC* Online Charging System (OCS), and how its performance improved on Intel® Xeon® processor E7-8800 v3.

JVM improvement for Intel® Xeon® processor E7 v3

Intel has been working closely with Oracle to optimize the JVM for Intel Xeon processor E7 v3 product family. This section lists some new features the Intel Xeon processor E7 v3 has that benefit Java performance compared to the previous Intel Xeon processors (E7 v2).

  • More cores and larger cache: Intel Xeon processors E7 v3 have more cores and larger cache than previous generations of Intel Xeon processors like E7 v2. For example, Intel Xeon processor E7-8890 v3 has 18 versus 15 cores and 45MB versus 37.5MB in cache compared to the Intel Xeon processor E7-4890 v2.
  • Better memory bandwidth: Intel® QuickPath Interconnect Technology (Intel® QPI)8 and integrated memory controllers deliver fast core-to-core and core-to-memory communications resulting in improved performance for Java applications that are memory-bound. Also, the garbage collection (GC) processes are improved as well. Intel Xeon processor E7 v3 has 102GB/s versus 85GB/s and 9.6GT/s versus 8GT/s compared to the Intel Xeon processor E7-4890 v2.
  • Speedup array and string operations: Intel® Advanced Vector Extensions 2 (Intel® AVX2) gives vectorization benefits that improve performance of array and string operations. Intel AVX2 is supported in all versions of JDK8.9
  • Hardware support for best-effort “transactional memory”:10 Multithreaded Java applications with lock contention will greatly benefit from Intel® Transactional Synchronization Extensions (Intel® TSX)11. When the -XX:+UseRTMLocking option is run at the command line, JVM will automatically use this feature, thus improving the performance of the Java applications. Intel TSX is supported in JDK8u20 onwards.

The next section shows how TYDIC OCS benefits when run on a Intel Xeon processor E7-8890 v3-based system.

TYDIC OCS

An OCS is software that provides a Communications service provider the ability to charge their customers, in real time, based on service usage.5

Figure 1 shows TYDIC’s approach of evolving from an application running on a large, monolithic RISC-based system to one that utilizes a distributed Intel Xeon processor-based architecture.


Figure 1. Comparison between RISC and Intel® Xeon® processor-based OCS architectures

The new architecture encompasses three trends:

  • Using 4-socket Intel Xeon processor E7 platforms as ‘node of choice’
  • Re-architecting the OCS application to a cloud architecture:
    • The application assumes shared responsibility for high availability/uptime with the Intel Xeon processor E7 platform (not just relying on the historic RISC platform).
    • The application provides the means for scaling to customer demands just by adding another node as demand grows.
  • Embracing “in-memory” by forming a memory data-grid across the Intel Xeon processor E7 platforms for very fast transactional processing.

Performance test procedure

To show how the OCS application benefits from being run on an Intel Xeon processor E7 v3 system, we performed tests on two platforms. One system was equipped with Intel® Xeon® processor E7-8890 v3 and the other with Intel® Xeon® processor E7-4890 v2.

This workload is based on data grid6 technology and purely in-memory computing (all business data are kept in memory!) and so will benefit from the large memory capacity and bandwidth on Intel Xeon processor E7 v3 platform. And in-memory computing could fully saturate CPU utilization by removing IO bottlenecks and take advantage of more cores in Intel Xeon processor E7 v3.

Note: This test is about comparing the performance of TYDIC OCS on two systems: one equipped with Intel Xeon processor E7-8890 v3 and the other with Intel Xeon processor E7-4890 v2. The test is not comparing performance of JVM on these systems.

Test configurations

System equipped with Intel Xeon processorE7-8890 v3

  • System: Pre-production
  • Processors: Intel Xeon processor E7-8890 v3 @2.5GHz
  • Last-level Cache: 45MB
  • Cores per socket: 18
  • Memory: 128GB DDR4-1600MHz

System equipped with Intel Xeon processorE7-4890 v2

  • System: Pre-production
  • Processors: Intel Xeon processor E7-4890 v2 @2.8GHz
  • Last-level Cache: 37.5MB
  • Cores per socket: 15
  • Memory: 128GB DDR3-1333MHz

Operating System: Red Hat* Enterprise Linux* 7

Application: OCS 2.1

Test results

Figure 2 shows the results on a system equipped with the two versions of Intel Xeon processors. Performance improved by 1.22X due to faster DDR4 memory, more cores, and the evolved micro-architecture on the Intel Xeon processor E7-8890 v3.


Figure 2. Comparison between Intel® Xeon® processor E7-8890 v3 and Intel® Xeon® processor E7-4890 v2

Note: Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance

Conclusion

The JVM included in the version of JDK starting with JDK8u20 onwards is optimized to take advantage of many new features of Intel Xeon processor E7 v3. More cores, faster memory, and features like Intel AVX2 and Intel TSX enable Java applications that are memory bound, or have issues like heavy array or string operations or lock contention, running on JVM to improve their performance without having to change code or even recompile.

References

[1] http://en.wikipedia.org/wiki/Java_%28programming_language%29

[2] https://java.com/en/download/faq/whatis_java.xml

[3] https://en.wikipedia.org/wiki/Java_bytecode

[4] http://en.wikipedia.org/wiki/Java_virtual_machine

[5] http://en.wikipedia.org/wiki/Online_charging_system

[6] http://en.wikipedia.org/wiki/Data_grid

[7] http://searchsoa.techtarget.com/definition/Java-Runtime-Environment

[8] http://www.intel.com/content/dam/doc/white-paper/quick-path-interconnect-introduction-paper.pdf

[9] https://en.wikipedia.org/wiki/Java_Development_Kit

[10] https://en.wikipedia.org/wiki/Transactional_Synchronization_Extensions

[11] https://software.intel.com/en-us/blogs/2012/02/07/transactional-synchronization-in-haswell

Notices

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.

This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.

The products and services described may contain defects or errors known as errata which may cause deviations from published specifications. Current characterized errata are available on request.

Copies of documents which have an order number and are referenced in this document may be obtained by calling 1-800-548-4725 or by visiting www.intel.com/design/literature.htm.

Intel, the Intel logo, and Intel Xeon are trademarks of Intel Corporation in the U.S. and/or other countries.

*Other names and brands may be claimed as the property of others

© 2015 Intel Corporation.

Diagnostic 15398: loop was not vectorized: loop was transformed to memset or memcpy

$
0
0

Product Version: Intel® Fortran Compiler 15.0 and a later version 

Cause:

When a code contains a loop or array syntax performing a simple initialization or a copy, the compiler may replace the loop with a function call to either set memory (memset) or copy memory (memcpy). The vectorization report generated using Intel® Fortran Compiler's optimization and vectorization report options includes non-vectorized loop instance:

Windows* OS:  /O2  /Qopt-report:2  /Qopt-report-phase:vec    

Linux OS or OS X:  -O2 -qopt-report2  -qopt-report-phase=vec 

Example:

An example below will generate the following  remark in optimization report:

program f15398
implicit none
integer, parameter :: N=32
integer :: i,a(N)
  !...initialize array using DO
  do i=1,N
    a(i) = 0
  end do
  a = 0   !...same, with array syntax
  print*, a(1)

end program f15398

ifort -c /O2 /Qopt-report:2 /Qopt-report-phase:vec /Qopt-report-file:stdout f15398.f90
 
Begin optimization report for: F15398

    Report from: Vector optimizations [vec]
    
LOOP BEGIN at f15398.f90(6,3)
   remark #15398: loop was not vectorized: loop was transformed to memset or memcpy
LOOP END

LOOP BEGIN at f15398.f90(9,3)
   remark #15398: loop was not vectorized: loop was transformed to memset or memcpy
LOOP END

See also:

Requirements for Vectorizable Loops

Vectorization Essentials

Vectorization and Optimization Reports

Back to the list of vectorization diagnostics for Intel® Fortran

Simple optimization methodology with Intel System Studio ( VTune, C++ Compiler, Cilk Plus )

$
0
0

Introduction:

 In this article, we introduce an easy optimization methodology that includes Intel® Cilk™ Plus and Intel® C++ Compiler based on the performance analysis using Intel® VTune amplifier. Intel® System Studio 2015 that containes the mentioned components was used for this article.

  • Intel® VTune amplifier, is an integrated performance analyzer that helps developers anayzes complex code and indentify bottlenecks quickly.  
  • Intel® C++ Compiler generates optimized code that runs on IA-32 and Intel 64 architectures. It also provides numbers of features to help developers easily improve performance.
  • Intel® Cilk Plus, a C/C++ language extension, included in the Intel® C++ Compiler, allows you to improve performance by adding parallelism to new or existing C or C++ programs. 

Strategy:

 We will use one of the code examples that used for tutorials of VTune,  tachyon_amp_xe, as our target code for performance optimization. This example draws a picture of a complicate objects.

 

                                                                                                               ↓

 

 

The performance optimization methodology that possibly applicable for the sample is described below.

  1. Running Basic Hotspots Analysis or General Exploration Analysis on the example project on the integrated IDE, for instance Visual Studio* 2013.
  2. Identifying hotspots and other potentials of optimization.
  3. Applying code modification on the detected hotspot.
  4. Examining optimization options of the compiler.
  5. Applying parallelism on parallelization candidates.

Optimization :

< Test Environment >

 OS : Windows 8.1

 Tool Suite : Intel® System Studio for Windows Update 3 

 IDE : Microsoft Visual Studio 2013

 

< Step 1 : Interpret & Analyze the result data >

  • Running General Exploration Analysis ( if not possible, go with Basic Hotspots Analysis ) and find hotspots. Since this example code is made for practice finding hotspots and improving the performance. It is helpful to follow and refer the tachyon_amp_ex example page for this particular hotspot finding. After running the example with VTune, we can see the result as the following
  • We can observe the elapsed time this application took was 44.834s andthis can be the performance baseline we will concentrate to reduce.
  • Also, for this sample application, the 'initialize_2D_buffer'function, which took 18.945s to execute, shows up at the top of the list as the hottest function. We will try to optimize this most time-consuming function.

 

  • The CPU Usage Histogram above shows this sample does not make use of parallelism. Therefore, there are possibilities that we may use of multi threads to handle heavy tasks more quickly.

 

< Step 2 : Algorithmic approach for 'initialize_2D_buffer' >

 

 

 

  • As we saw earlier, 'initialize_2D_buffer' function took the longest time to execute and the largest amount of instructions have been retired by the function, which means If we can optimize something and get performance improvement, this function is where we can get the largest benefit out possibly.  

  • By double-clicking the function name, VTune Amplifier opens the source file positioning at the most time-consuming code line of this function. For the 'initialize_2D_buffer' function, this is the line used to initialize a memory array using non-sequential memory locations. This sample code already has its alternative faster 'for loop'.

  • The code listed below is actual code of the function 'initialize_2D_buffer'. The first for loop is not consecutively filling in the target array, and the second for loop in designed to do the same task consecutively. By using the second for loop, we can get performance benefit.

  • After replacing the for loop with the second one, we can observe some performance improvement. Let's look at the new VTune profiling results.

  • Compare to the previous results, the total elapsed time has been reduced from 44.834s to 35.742s, which is about x1.25 faster than before,  and for only the target function, it is from 18.945s to 11.318s, which is about x1.67 faster.

 

< Step 3: Compiler Optimization Options >

  • We often overlook automatic optimization ability that compilers have. In this case, we simply enable Intel C++ Compiler's optimization option which is triggered by adding '/O3' while compiling. Also we can use GUI to enable this. First, seting Intel C++ Compiler as the project's compiler is required to use '/O3'.

 

  •  
  • Just changing the above option sometimes brings great performance benefits. For the detailed explanation of Optimization option '/O[n]', please click here . The new results below show 24.979s to finish the task. It was 35.742s, which gives us the result as x1.43 boosting.
  •  

< Step 4 : Adding Parallelism by Cilk Plus >

  • Parallel programing is a very broad area by itself and there are many ways to achieve and implement parallelism in your system trying to manipulate multi-core platforms. This time, we are introducing Intel Cilk Plus, which is a language expansion that is fairly easy to implement  and works smart.

  • By investigating and analyzing the code with VTune's results, we can find the point where it calls the heaviest routine repeatedly and it can be a successful parallelization candidate. Useually, it can be done by looking at the caller/callee tree and following back from the root hotspot until you find a parallelizable spot to test.

  • For this time, it was 'draw_trace' function in find_hotspots.cpp. Adding simple 'cilk_for' to parallelize the target task here can work dynamically address lines to draw to multi-threads instead of a single thread. Therefore, you can visually observe 4 threads ( tested machine is dual-core with Hyper Threading technology ) drawing different lines simultaneously.

  • If you see the time it took to finish the painting job, it is 11.656s which is a big improvement than how long it took at the beginning. Let's take a look at VTune results.

  • We can see 13.117s as the total elapsed time which is x1.9 faster than the previous result. Also we can observe that multi-core is being efficiently utilized.

Summary:

  • Total elapsed time has been decreased from 44.834s to 13.117s -> x3.41 boosting up.
  • This optimization case has been achieved by simple VTune analysis and adding an Intel C++ Compiler option and a Cilk Plus feature.
  • Intel System Studio's components are designed as a solution to help developers to easily make improvements for their products.

Fast Gathering-based SpMxV for Linear Feature Extraction

$
0
0

1. Background

Sparse Matrix-Vector Multiplication (SpMxV) is a common linear algebra function that often appears in real recognition-related problems, such as speech recognition. In standard framework of speech/facial recognition, input data directly extracted from outside are not suitable for pattern matching. It is a must step to transform input data to more compact and amenable feature data by multiplying with a huge-scale constant sparse parameter matrix.

Figure 1: Linear Feature Extraction Equation

A matrix is characterized as sparse if most of its elements are zero. Density of matrix is defined as percentage of non-zero elements among the matrix, which varies from 0% to 50%. The basic idea on optimizing a SpMxV is to concentrate non-zero elements to avoid unnecessary multiple-with-zero operations as many as possible. In general, concentration methods can be classified as two kinds.

The first is the widely used Compressed Row Storage (CRS), which only store non-zero elements and their position information for each row. But it is so unfriendly to modern SIMD architecture that it is hardly vectorized with SIMD, and only outperforms SIMD-accelerated ordinary matrix-vector multiplication when matrix is extreme sparse. A variation of this means, tailored for SIMD implementation, is Blocked Compressed Row Storage (BCRS) in which a fixed-size block instead of an element is handled in the same way. Because of involvement of indirect memory access, its performance may degrade severely when matrix density increases.

The second is to reorder matrix row/column via permutation. The key of these algorithms is to find the best matrix permutation scheme measured by certain criterion correlated with non-zero concentration degree, such as:

  • Group non-zero elements together to facilitate partitioning matrix to sub-matrices
  • Minimize total count of continuous non-zeros N x 1 blocks

Figure 2:  Permutation to minimize N x 1 blocks

However, in some applications, such as speech/facial recognition, there exist some permutation-insensitive sparse matrices. That is to say that any permutation operation does not bring about significant improvement for SpMxV. An extremely simplified example matrix is:

Figure 3: Simplest permutation-insensitive matrix

If non-zero elements are uniformly distributed inside a sparse matrix, it may happen that when exchanging any two columns, rows benefitted are nearly same as rows negatively. When this situation happens, the matrix is permutation-insensitive.

Additionally, for those sparse matrices of somewhat high density, if no help can be expected from two methods, we have to resort to ordinary matrix-vector multiplication merely accelerated by SIMD instructions, illustrated in Figure 4, which is totally sparseness-unaware. In hopes of alleviating this problem, we initiated and generalized a gathering-based SpMxV algorithm that is effective for not only evenly distributed but also irregular constant sparse matrix.

 

2. Terms and Task

Before detailing the algorithm, we introduce some terms/definitions/assumptions to ease description.

  • A SIMD Block is a memory block that is same-sized as SIMD register. A SIMD BlockSet consists of one or several SIMD Blocks. A SIMD value is either a SIMD Block or a SIMD register, which can be a SIMD instruction operand.

  • An element is underlying basic data unit of SIMD value. Type of element can be built-in integer or float. Type of whole SIMD value is called SIMD type, and is vector of element types. Element index is element LSB order in the SIMD value, equal to element-offset/element-byte-size.

  • Instructions of loading a SIMD Block into a SIMD register are symbolized as SIMD_LOAD. For most element types, there are corresponding SIMD multiplication or multiplication-accumulation instructions. On X86, examples are PMADDUBSW/PMADDWD for integer, MULPS/MULPD for float. These instructions are symbolized as SIMD_MUL.

  • Angular bracket “< >” is used to indicate parameterization similar to C++ template.

  • For a value X in memory or register, X<L>[i] is the ith L-bit slice of X, in LSB order.

On modern SIMD processors, an ordinary matrix-vector multiplication can be greatly accelerated with the help of SIMD instructions as the following pseudo-code:

Figure 4: Plain Matrix-Vector SIMD Multiplication

In the case of sparse matrix, we propose an innovative technique to compact non-zeros of the matrix, while sustaining SpMxV’s implementability via SIMD ISA as above pseudo-code, with a goal of reducing unnecessary SIMD_MUL instructions. Since a matrix is assumed to be constant, the operation of compacting non-zeros is considered as preprocessing on the matrix, which can be completed during program initialization or off-line matrix data preparation, so that no runtime cost is incurred for a matrix-vector multiplication.

 

3. Description

GATHER Operation

First of all, we should define a conceptual GATHER operation, which is the basis of this work. And its general description is:

GATHER<T, K>(destination = [D0, D1, …, DE–1],
                           source       = [S0, S1, …, SK*E–1],
                           hint            = [H0, H1, …, HE–1])

The parameters destination and source are SIMD values, whose SIMD type is specified by T. And destination is one SIMD value whose element count is denoted by E, while source consists K SIMD value(s) whose total element count is K*E. The parameter hint, called Relocation Hint, has E integer values, each of which is called Relocation Index. A Relocation Index is derived from a virtual index ranging between –1 and K*E–1, and can be described by a mathematical mapping as:    

 RELOCATION_INDEX<T>(index), abbreviated as RI<T>(index)

GATHER operation will move elements of source into destination based on Relocation Indices as:

  • If Hi is RI<T>(–1), GATHER will retain context of Di.
  • If Hi is RI<T>(j) (0 ≤ j < K*E), GATHER will move Sj to Di.

Implementation of GATHER operation is specific to processor’s ISA. Correspondingly, RI mapping depends on instruction selection for GATHER. Likewise, materialization of hint may be a SIMD value or an integer array, or even mixed with other Relocation Hints, which is totally instruction-specific.

According to ISA availability of certain SIMD processor, we only consider those, called fast or intrinsic GATHER operation, which can be translated to simple and efficient instruction sequence with low CPU cycles.

 

Fast GATHER on X86

On X86 processor, we propose a method to construct fast GATHER using BLEND and SHUFFLE instruction pair.

Given a SIMD type T, imagined BLEND and SHUFFLE instruction are defined as:

  • BLEND<T, L>(operand1, operand2,mask)  ->  result

    L is power of 2, not more than element bit length of T. And operand1, operand2 and result are values of T; mask is a SIMD value whose element is L-bit integer, and its element count is denoted by E. For the ith (0 ≤ i < E) element of mask, we have:

    • operand1<L>[i]  ->  result<L>[i]      (if the element’s MSB is 0)
    • operand2<L>[i]  ->  result<L>[i]      (if the element’s MSB is 1)
  • SHUFFLE<T, L>(operand1, mask)  ->  result

    Parameters description is same as BLEND. In element of mask, only low log2(E) bits, called SHUFFLE INDEX BITS, and MSB are significant. For the ith (0 ≤ i < E) element of mask, we have:

    • operand1<L>[mask<L>[i] & (E–1) ]  ->  result<L>[i]          (if the element’s MSB is 0)
    • instruction specific value  ->  result<L>[i]                          (if the element’s MSB is 1)

Then, we will construct fast GATHER<T, K> using SHUFFLE<T, LS> and BLEND<T, LB> instruction pair.  And element bit length of T is denoted by LT, SHUFFLE INDEX BITS is SIB. Relocation Hint is materialized as one SIMD value and each Relocation Index is LT-bit integer. The mathematical mapping RI<T>( ) is defined as:

  • RI<T>(­virtual index = –1) = –1

  • If virtual index ≥ 0, in other words, we can suppose the element indicated by this index is actually the pth element of the kth (0 ≤ k < K) source SIMD value. Final result, denoted by rid, is computed according to the formulations:

    • LS ≤ LB   (0 ≤ i < LT/LS)
      rid< LS>[i] = k * 2SIB + p * LT/LS + i                    ( i = integer * LB/LS – 1)
      rid< LS>[i] = ? * 2SIB + p * LT/LS + i                    ( i ≠ integer * LB/LS – 1)

       

    • LS > LB   (0 ≤ i < LT/LB)
      rid< LB>[i] = k * 2SIB + p * LT/LS + i * LB/LS       ( i = integer * LS/LB)
      rid< LB>[i] = k * 2SIB + ? & (2SIB– 1)                   ( i ≠ integer * LS/LB)

 Figure 5 is an example illustrating Relocation Hint for a GATHER<8*int16, 2> while LS = LB = 8.

Figure 5:  Relocation Hint For Gathering 2 SSE Blocks

The code sequence of fast GATHER<T, K> is depicted in Figure 6. Destination and Relocation Hint are symbolized as D and H. Source values are represented by B0, B1, …, BK–1. Besides, an essential SIMD constant I, of which element bit length is min(LS, LB) and each element is the integer 2SIB, will be used. Additionally, a condition should be satisfied that K is not more than 2min(LS, LB) – SIB – 1, which is K ≤ 8 for above case.

Figure 6:  Fast GATHER Code Sequence

Depending on SIMD type and processor SIMD ISA, SHUFFLE and BLEND should be mapped to specific instructions as optimal as possible. Some existing instruction selections are listed as examples.

SSE128 - Integer

PSHUFB + PBLENDV

LS=8,   LB=8

SSE128 - Float

VPERMILPS + BLENDPS

LS=32, LB=32

SSE128 - Double

VPERMILPD + BLENDPD

LS=64, LB=64

AVX256 - Int32/64

VPERMD + PBLENDV

LS=32, LB=8

AVX256 - Float

VPERMPS + BLENDPS

LS=32, LB=32

 

Sparse Matrix Re-organization

In a SpMxV, two operands, the matrix and the vector, are expressed by M and V respectively. Each row in M is partitioned into several pieces in unit of SIMD Block according to certain scheme. Non-zero elements in a piece are compacted into one SIMD Block as many as possible. If there are some remaining non-zero elements outside of compaction, the piece’s SIMD Blocks containing them should be as least as possible. Meanwhile, these leftover elements are moved to a left-over matrix ML. Obviously, M*V is theoretically broken up to (M–ML)*V and ML*V. When a proper partition scheme is adopted, especially possible for those nearly even distributed sparse matrices, ML is intended to be an ultra sparse matrix that is far sparser than M so that computation time of ML*V is non-significant in total time. We can apply standard compression-based algorithm or like, which will not be covered in this invention, to ML*V. And organization of ML is subject to its multiplication algorithm and its storage is separate from M’s compacted data, whose organization is detailed as the following.

Given a piece, suppose it contains N+1 SIMD Blocks of type T, expressed by MB0, MB1, …, MBN. We use MB0 as containing Block, select and gather non-zero elements of the other N Blocks into MB0. Without loss of generality, we assume that this gathering-N-Block operation is synthesized from one or several intrinsic GATHERs, whose ‘K’ parameters are K1, K2, …, KG that are subject to N = K1 + K2 + … + KG. That is to say, the N Blocks are divided into G groups sized by K1, K2, …, KG, and these groups are individually gathered into MB0 one by one. To archive best performance, we should find a decomposition that minimizes G. This is a classical knapsack type problem and can be solved in either dynamic programming or greedy method. As a special case, when intrinsic GATHER<T, N> exists, G=1.

Relocation Hints for those G intrinsic GATHERs are expressed by MH1, MH2, …, MHG. So, the piece will be replaced with its compacted form consisting of two parts: MB0 after compaction and (MH1, MH2, …, MHG). The former is called Data Block. The latter is called Relocation Block and means certain possible combination form of all Relocation Hints, which is specific to any implementation or optimization consideration that is out of discussion of this paper. The combination form may be affected by alignment enforcement, memory optimization, or other instruction-specific reasons. For example, if a Relocation Index occupies only half a byte, we can merge two Relocation Indices from two Relocation Hints into one byte so as to reduce memory usage. Ordinarily, a simple way is to layout Relocation Hints end to end. Figure 5 also shows how to create Data Block and Relocation Block for a 3-Block piece. A blank in SIMD Block means zero-valued element.

 

Sparse Matrix Partitioning Scheme

To guide decision on how to partition a row of matrix, we introduce a cost model. For a piece of N+1 SIMD Blocks, suppose that there will be R (R ≤ N) SIMD Blocks containing non-zero elements to be moved to ML. The cost of this piece is 1 + N*CostG + R*(1+CostL), in which:

In the following description, one or several adjacent pieces in a row will be referred as a whole, which is termed piece clique. All rows of the matrix have same partitioning scheme as:

  • 1 is cost of a SIMD multiplication in the piece.
  • CostG (CostG < 1) means cost of gathering one SIMD Block.
  • CostL means extra effort for a SIMD multiplication in ML, is always a very small value.

In the following description, one or several adjacent pieces in a row will be referred as a whole, which is termed piece clique. All rows of the matrix have same partitioning scheme as:

  • Row is cut into identical primary cliques except a possible leftover clique with fewer pieces than primary one.
  • Pieces in any clique should be not more than a pre-defined count limit C(1≤ C), which is statically deduced from characteristic of non-zero distribution of the sparse matrix and is also used to control code complexity in final implementation.
  • Total cost of all pieces in the matrix should be minimal for the given count limit C. As to how to find this most optimal scheme, we may rely on an exhaustive search or an improved beam algorithm. This beam algorithm will be covered in a new patent and ignored here.

An example of partitioning is [4, 5, 2], [4, 5, 2], [4, 5, 2], [2, 5] for a 40-Block row when C=3. ‘[ ]’ means a piece clique. For those even-distributed matrices, C=1 is always chosen.

 

Gather-Based Matrix-Vector Multiplication

Multiplication between vector V and a row of M is broken up into sub-multiplications on partitioned pieces. Given a piece in M, which we suppose its original form has N+1 SIMD Blocks, the corresponding SIMD Blocks in vector V are expressed by VB0, VB1, …, VBN. Previous symbol definitions for piece are extended to this section.

With new compacted form, a piece multiplication between [MB0, MB1, …, MBN] and [VB0, VB1, …, VBN] is transformed to operations of gathering effective vector elements into VB0 and only one SIMD multiplication on Data Block and VB0. Figure 7 depicts the pseudo-code of new multiplication, in which Data Block is MD, Relocation Block is MR and the vector is VB. And we will refer to a conceptual function EXTRACT_HINT(MR, i) (1 ≤ i ≤ G), which means extracting the ith Relocation Hint from MR and is the reverse operation to aforementioned ₡(MH1, MH2, …, MHG). To improve performance, there may be some internal temporaries inside this function. For example, register value of previous Relocation Hint was retained to avoid memory access. But detail of this function is not in scope of the article.

Figure 7:  Multiplication For Compacted Form of N+1 SIMD Blocks

In the code, original N SIMD multiplications are replaced by G gathering operations. Therefore, computation acceleration is possible and meaningful only if the former is much more time-consuming than the latter. We should compose efficient intrinsic GATHER to guarantee this assertion. This matter is easily done for some processors, such as ARM, on which intrinsic GATHER of SIMD integer type can be directly mapped to single low-cost hardware instruction. To be more specific, the fast GATHER elaborately constructed on X86 also satisfies the assertion. For the ith (1 ≤ i ≤ G) SIMD Block group in the piece, Ki SIMD_MULs are replaced by Ki rather faster BLEND and SHUFFLE pairs, and Ki1 SIMD_LOADs from the matrix are avoided and replaced by Ki1 much more CPU-cycle-saving SIMD_SUBs.

At last, new SpMxV algorithm can be described by the following flowchart:

Figure 8:  New Sparse Matrix-Vector Multiplication

 

4. Summary

The algorithm can be used to improve sparse matrix-vector and matrix-matrix multiplication in any numerical computation. As we know, there are lots of applications involving semi-sparse matrix computation in High Performance Computing. Additionally, in popular perceptual computing low-level engines, especially speech and facial recognition, semi-sparse matrices are found to be very common. Therefore, this invention can be applied to those mathematical libraries dedicated to these kinds of recognition engines.

Vectorization Advisor FAQ

$
0
0

General questions

What is Vectorization Advisor?

Vectorization Advisor is one of the two major features of the Intel® Advisor XE 2016 product. Intel® Advisor XE comprises Vectorization Advisor and Threading Advisor.

Vectorization Advisor is an analysis tool that lets you identify if loops utilize modern SIMD instructions or not, what prevents vectorization, what is performance efficiency and how to increase it. Vectorization Advisor shows compiler optimization reports in user-friendly way, and extends them with multiple other metrics, like loop trip counts, CPU time, memory access patterns and recommendations for optimization.

Where can I download Vectorization Advisor?
Intel® Advisor XE 2016 with Vectorization Advisor is available as part of Intel® Parallel XE 2016 suite. Visit the product web site for more information. See for information on evaluation copies or purchasing.
What is the difference between “Threading Advisor” and “Vectorization Advisor”?

Intel® Advisor XE version 2015 and earlier had only Threading Advisor workflow. Read more on the product website.

Starting from Intel® Advisor XE 2016, the product includes two major workflows or feature sets:

  • Vectorization Advisor is a vectorization analysis tool that lets you identify loops that will benefit most from vectorization, identify what is blocking effective vectorization, explore the benefit of alternative data reorganizations, and increase the confidence that vectorization is safe.
  • Threading Advisor is a threading design and prototyping tool that lets you analyze, design, tune, and check threading design options without disrupting your normal development.
What Compilers and programming languages are supported?

Vectorization Advisor supports C/C++ and Fortran programming languages.  

Vectorization Advisor requires Intel Compiler 15.0 or later to collect full set of analysis data. However, a subset of metrics is available for binaries built with GCC* or Microsoft* compiler.

How do I get support and provide feedback?

Visit our product support page.

Vectorization analysis workflow

Where do I start?
Check prerequisites and build settings in Getting Started with Intel® Advisor XE 2016. Create a project – just specify the executable to analyze and command line parameters.

 

Start from running Survey analysis – it will give you main statistics about vectorized and scalar loops:

First things to look at the Survey Report:

  1. Self and Total CPU time – focus on the most time consuming loops. Use Top Down tab (at the blue area in the middle of Advisor window) to explore call tree.
  2. Find hot scalar loops in Loop Type column. “Why No Vectorization” column and loop summary explain the reason that prevented compiler from generating SIMD code.
  3. For vectorized loops (marked by ), expand “Vectorized Loops” and other columns in the grid. Check efficiency metrics, instruction set, vector length and Traits.
  4. Click on a “Lamp” with a digit – it will bring you to Recommendations tab on the bottom, that might contain optimization hints.
Can I use Vectorization Advisor from a command line?

Yes. Use “advixe-cl --help” command to learn about syntax and see some examples. Please be aware that Intel Advisor XE 2016 documentation for command line syntax may not be up to date, and not all CLI options may be covered. We’re working on addressing this gap.

Hint: use “Command Line” link on workflow to generate command line for selected analysis type and project settings:

Does Vectorization Advisor help in improving already vectorized codes?

Yes, Vectorization Advisor has multiple features to detect inefficient usage of SIMD instructions. Some typical examples:

  • Efficiency metric is significantly lower than ideal value
  • Using instruction set lower than supported by hardware (e.g. SSE2 on a machine supporting AVX)
  • Vectorization traits detection, e.g. using gather and scatter instructions
  • Non-uniform and unaligned data accesses (use Memory Access Patterns analysis)
  • Partial loop vectorization, when scalar peel or remainder takes noticeable CPU time
  • Other bottlenecks described in Recommendations tab
Can I run Vectorization Advisor with an MPI application?

Yes. Use command line syntax for analyzing MPI applications, see details and examples. Below is an example with mpirun and “-gtool” option. This command launches “./your_app” application on 4 ranks, and only ranks 2 and 3 are analyzed by Intel Advisor:

mpirun -n 4 -gtool "advixe-cl -collect survey:2,3" ./your_app
How do I explore results on a cluster node without a GUI?

You can perform an MPI analysis only through the Intel Advisor command line interface; however, there are several ways to view an Intel Advisor result:

  • If you have an Intel Advisor GUI in your cluster environment, open a result in the GUI. E.g. a login node may have X server configured, and you can use a shared directory for storing Intel Advisor project.
  • If you do not have an Intel Advisor GUI on your cluster node, copy the result directory to another machine with the Intel Advisor GUI and open the result there. You can use a Windows machine to browse results collected on Linux. In this case, you might need to configure search directories in project properties to locate source files.
  • Use the Intel Advisor command line reports to browse results on a cluster node. E.g. default survey report:

    advixe-cl -report survey –project-dir ./my_proj

What data will I get with an application built with GCC* or Microsoft* compilers?

Vectorization Advisor requires Intel Compiler to collect full set of analysis data. However, a subset of metrics is available for binaries built with GCC or Microsoft compiler:

  • CPU time and call tree (Top Down tab)
  • Vector Instruction Set, Vector length, Data types
  • Loop trip counts
  • Dependencies analysis (loop dependencies)
  • Memory Access Patterns analysis
Do I need source code annotations?

No. Vectorization Advisor does not require source code modification. You can select loops for analysis using checkboxes on Survey tab:

Source code annotations are needed for Threading Advisor only.

How do I specify which loops to analyze by Memory Access Patterns or Dependencies features? How do I do it command line and in GUI?

In GUI, you can select loops for analysis using checkboxes on Survey tab:

In command line, print survey report and notice column “ID” before each loop:

advixe-cl -report survey –project-dir ./my_proj
ID Function Call Sites and Loops Self Time Total Time
 69 -[loop at test.cpp:190 ...] 1.06054 1.06054
 83 -[loop at test.cpp:89 ...] 0.841134 0.841134
 51 -[loop at test.cpp:113 ...] 0.799016 0.799016

Then use “-mark-up-list” option to specify loop IDs for Dependencies or Memory Access Patterns analysis:

advixe-cl -collect map -mark-up-list=83,51 -project-dir ./my_proj -– my_application

Tip: open the result in GUI, select loops using checkboxes and press “Get Command Line” button. It will generate command line for Dependencies or Memory Access Patterns analysis automatically.

How can I decrease analysis time?

Survey analysis in Vectorization Advisor is the least intrusive and should not slow down application significantly. However, analyses like “Dependencies” and “Memory Access Patterns” have significant overhead. You can mitigate application slowdown in several ways:

  1. Decrease a workload. It depends on your application how to do it: provide smaller data to process, decrease complexity of computations.
  2. Use separate settings for Survey and other analysis types. By default, it’s enough to configure Survey settings only, but if you can control workload via command line parameters, you can keep separate command line settings for different analysis types:
  3. Decrease number of selected loops for Dependencies or Memory Access Patterns analysis.
  4. Look at the Refinement report tab while the analysis runs. Data is shown once it appears, you don’t have to wait until application finishes. Press “Stop” button “in advance” when you see that analysis for all loops of interest is already finished (in Memory Access Patterns or Dependencies view).

Understanding Vectorization Advisor results

What kind of data does Vectorization Advisor provide? How does it collect information?

Key Vectorization Advisor features include a:

  • Correlation of CPU time and vectorization metrics with compiler optimization and vectorization reports
  • Ability to explore relevant loop data all in one place, including CPU time, if loop is vectorized, compiler diagnostics about vectorization constraints, instruction set, source code, and assembly code
  • Dependencies analysis that checks for loop-carried dependencies, so you can decide it if is safe to force vectorization with pragmas
  • Memory Access Patterns analysis that identifies non-unit stride array element accesses. Non-unit stride memory accesses can prevent automatic vectorization or hurt performance.
  • Loop Trip counts and call counts
  • Recommendations based on the static and dynamic analysis data.

Tip: Workflow panel helps to navigate through the steps and analysis types.

 
What data do I get from Survey analysis?

Most statistics is gathered by Survey analysis. It combines dynamic analysis (CPU sampling), static binary analysis (instruction set, data types, etc.) and compiler diagnostics. All analysis types include binary instrumentation and dynamic analysis; it means that Intel Advisor has to execute an application, even if collecting some data doesn’t require actual running.

The Survey Report provides a wealth of information, including the following:

  • Vectorized loop parts, such Body, Peeled, and Remainder, which are automatically grouped as a hierarchical row in top table.
  • Why No Vectorization?– Why a loop was not vectorized
  • Vectorized loops columns:
    • Vector Instruction Set–  For example, SSE, SSE2, and AVX
    • Efficiency – available with Intel Compiler 16.0 and later
    • Gain – Advisor estimate of relative loop performance speed-up achieved due to vectorization
    • Vector length– number of data elements of the given type fitting in a SIMD lane
  • Instruction set analysis (compiler-independent SIMD statistics):
    • Traits– Important loop characteristics, potentially hurting performance. For example, Divisions, Shuffles, and Masked Stores
    • Data Types
  • Optimization info columns:
    • Transformations– How a loop was modified if it was modified by the compiler (for example, loop unrolling)
    • Unroll factor
    • Estimated Achieved Gain - theoretical estimate of achievable or achieved vectorization gain, provided directly by compiler
    • Vector width, Vectorization Details and Optimization Details.
  • Tabs on the bottom of Survey report:
    • Top Down– call tree of loops and functions
    • Source and Assembly views with embedded optimization and vectorization info
    • Recommendations– description of typical problems and tips to optimize
    • Compiler Diagnostic Details – detailed description of compiler diagnostics with examples
What data do I get from Trip Counts analysis?

Trip Counts analysis counts minimum, maximum, median trip counts (i.e. number of times loop body was executed) and call counts (number of times loop is invoked) for all the loops in the application. Therefore, you should to run Survey first, then Trip Counts analysis. NOTE! Do not re-build your binary between running Survey and Trip counts, it can produce wrong results. Trip Counts results are added to existing Survey report in a new column group:

What data do I get from Dependencies analysis?

Dependencies analysis checks for cross-iteration (“loop carried”) dependencies. The most common case to use it is when you see “assumed dependence prevents vectorization” message in “Why No Vectorization” column. If Dependencies analysis reports no dependencies, you are safe to force vectorization. If dependencies are detected, you will get detailed information where they are:

NOTE! Dependencies analysis is only applicable to scalar (not vectorized) loops.

What data do I get from Memory Access Patterns analysis?

Memory Access Patterns (MAP) analysis traces memory access instructions and detects patterns: unit stride, non-unit “constant” stride (like on the picture below) and non-unit variable stride (gather-scatter patterns). Operand size and non-aligned data accesses are also reported.

Example results of MAP analysis in source view:

Run Memory Access Patterns (MAP) analysis in the following cases:

  • You addressed other vectorization problems, but the performance of the vectorized loop is still not satisfactory, while “Traits” indicate presence of Shuffles, Inserts, Gathers.
  • You want to eliminate non-unit stride memory accesses to refactor the code, either for optimizing vectorization or memory and cache usage.
How do I save results?

By default, Intel Advisor stores only the most recent result. That means if you run Survey (or any another analysis) two times, you will see only the last one without an option to get back to initial experiment.

You can manually save Intel Advisor experiments using “Snapshot” button  in Result window or on product toolbar:

This will save all analyses results (Survey, Trip Counts, Dependencies and MAP) in read-only experiment folder. You will be able to browse it any time, further experiments will not overwrite it. You can access the historical snapshots using Project Navigator.

How are Survey, Trip Counts and Dependencies results correlated?

Intel Advisor has complex structure of result versions. There are four analysis types: Survey, Trip Counts, Dependencies and Memory Access Patterns. All the results are comprised in “experiment” folder, usually called “e000”. The experiment contains the most recent versions of each result type. By default, only one (latest) experiment version is stored, however you can create “snapshots” – historical copies of the current experiment for future analysis and comparison purposes.

Basic analysis type is Survey. All other analysis types depend on Survey results, but don’t depend on each other:

e000:
Survey <- Trip Counts
Survey <- Dependencies
Survey <- MAP

Different analysis types are matched by an address in the target application binary. That means, when you select loops in Survey for further Dependencies analysis, they are identified by the address in binary. Changing the binary (re-building) between running Survey and Dependencies will break this connection and results will be wrong. Same applies to MAP and Trip Counts analyses. So if a binary is changed, run Survey again before running other analysis types.

You may run Survey 5 times, and only 1 time run Dependencies (say for Survey result #2). In this case, recent Survey will not match the Dependencies report, they can apply to different binary versions. If it is important to keep them matched, make a Snapshot before updating binary and running further analyses.


A Mission-Critical Big Data Platform for the Real-Time Enterprise

$
0
0

As the volume and velocity of enterprise data continue to grow, extracting high-value insight is becoming more challenging and more important. Businesses that can analyze fresh operational data instantly—without the delays of traditional data warehouses and data marts—can make the right decisions faster to deliver better outcomes.

Business Oriented Solution (BOS) software from the Nomura Research Institute (NRI) answers this challenge. Running on enterprise-class servers based on the Intel® Xeon® processor E7 v3 family, BOS turns operational data into a rich source of real-time business insight. There is no need to duplicate, pre-manipulate, or move data. A highly optimized in-memory database schema enables transactions and queries to be performed simultaneously and at high speed on the same data set.

Download complete article (PDF) DownloadDownload

 

 

Intel System Studio Matrix Multiplication Sample

$
0
0

This is a "matrix multiplication" example that illustrates different features of Intel® System Studio on Microsoft* Visual Studio* IDE, Eclipse* IDE and on Yacto* Target System

By Downloading or copying all or any part of the sample source code, you agree to the terms of the Intel® Sample Source Code License Agreement

Windows* System : system_studio_sample_matrix_multiply.zip(829 KB)

Linux* System : system_studio_sample_matrix_multiply.tar.gz (1380 KB)

These package has got four samples to demonstrate usage of Intel® C++ Compiler,  Intel® VTune™ Amplifier for Systems, Intel® Cilk™ Plus and Intel® MKL. 

  • Using Intel® C++ Compiler for Systems to get better performance
  • Using Intel® VTune™ Amplifier for Systems to identify performance bottleneck
  • Using Intel® Cilk™ Plus to parallelize the application
  • Using optimized functions from Intel® Math Kernel Library

Intel® System Studio Samples and Tutorials

$
0
0

Intel® System Studio is a comprehensive and integrated tool suite that provides developers with advanced system tools and technologies to help accelerate the delivery of the next-generation, energy-efficient, high-performance, and reliable embedded and mobile devices.

We have created a list of samples demonstrating different features of Intel System Studio, Also tutorials will show usage of features in your applications.

By Downloading or copying all or any part of the sample source code, you agree to the terms of the Intel® Sample Source Code License Agreement

Samples

Sample Code Name

Description

Hello World

This is a simple "Hello World" example that illustrates how to set up the environment to build embedded application with Intel Compiler (ICC) for Windows*, Linux* Host and Yocto* Linux* Target , in various usage models like command line, IDEs.

Matrix Multiplication

This is a "matrix multiplication" example that illustrates different features of Intel® System Studio like Intel® C/C++ Compiler, Intel® MKL, Intel® VTune Amplifier and Intel® Cilk Plus.

System Trace – a sample trace

 A sample trace file (sampleTrace.tracecpt) is included in this Intel System Debugger NDA package. This sample trace file is collected from the real Intel® Skylake machine and including multiple traces packets type like BIOS, CSME, TSCU and global error packets example. Before you get started to use System Trace Tool(eclipse plugin) to debug your system issues, this sample trace can help you to get more familiar with all UI operations, functionalities provided by System Trace Tool, like searching key words, opening a new field and exporting partial logs etc.

Processor Trace Sample 

Intel® Processor Trace is the hardware based low overhead code execution logging on instruction level and provides a powerful and deep insight into past instruction flow combined with interactive debug.

Image Blurring and Rotation

This tutorial demonstrates how to:

  •  Implement box blurring of an image with the Intel IPP filtering functions
  •  Rotate an image with the Intel IPP functions for affine warping
  •  Set up environment to build the Intel IPP application
  •  Compile and link your image processing application

 

Averaging Filter(Image Processing)

An Averaging filter is a commonly used filter in the field of image processing and is mainly used for removing any noise in a given image. This sample has demonstrates how to increase the performance of Averaging filter using Intel® Cilk™ Plus. Both threading and SIMD solutions are explored in the performance tuning and their corresponding contributions in the speedup are evaluated.

Discrete Cosine Transforms(DCT)

Discrete Cosine Transform (DCT) and Quantization are the first two steps in JPEG compression standard. This sample demonstrates how DCT and Quantizing stages can be implemented to run faster using Intel® Cilk™ Plus. 

Image Processing: Sepia Filter

A Sepia tone image is monochromatic image with a distinctive Brown Gray color that provides a distinctive tone to a photograph when Black & White film was available. The program works by converting each pixel in the bitmap file to a Sepia tone. This sample demonstrates how to improve the performance of Sepia filter using Intel® Cilk™ Plus. To demonstrate the performance increase, you will use a program that converts a bitmap file from color image to a Sepia tone image. 

Tutorials

Title/Link to Tutorial demo

Description 

Using Intel® C++ Compiler for Embedded Linux Systems

The Intel® C++ Compiler, also known as icc, is a high performance compiler which lets you build and optimize your C/C++ applications for the Linux* based operating system. Embedded system development is a cross-platform development in most cases. The applications development normally needs cross-compilation which requires a host compilation system and a target embedded system. The Intel® C++ compiler fully supports cross-platform compilation as well. 

Intel® VTune™ Amplifier for Systems Usage Models

Intel® VTune™ Amplifier for Systems is a software performance analysis tool for users developing serial and multithreaded applications on Embedded and Mobile system. VTune Amplifier supports multiple usage modes for various target systems depending on your development environment and target environment. In this article, we will describe the Vtune Amplifier usage modes and the recommended modes for different target systems.

Signal Processing Usage for Intel® System Studio – Intel® MKL vs. Intel® IPP

Employing performance libraries can be a great way to streamline and unify the computational execution flow for data intensive tasks, thus minimizing the risk of data stream timing issues and heisenbugs. Here we will describe the two libraries that can be used for signal processing within Intel® System Studio.

Debugging Intel® Quark SoC based target platform uinsg OpenPCD*

This tutorial will help you to understand setting up OpenOCD* based connector to Intel Quark based target systems and usage of Intel System Studio for debugging system software.

 

Performance Gains for SunGard’s Adaptiv Analytics* on the Intel® Xeon® Processor E7-8890 V3

$
0
0

Introduction

SunGard’s Adaptiv Analytics* allows traders to run pre-deal cost-of-credit calculations. Due to the volume and complexity of products, these calculations are often time consuming, causing delays that can lead to missed opportunities or taking action with incomplete information.

Since SunGard’s customer usage model is often running multiple instances simultaneously instead of running a single instance as fast as possible, running SunGard’s Adaptiv Analytics on systems with more cores can help improve the performance dramatically. SunGard’s adoption of Intel® Advanced Vector Extensions 2 (Intel® AVX2) and Intel’s investment in parallel computing through the use of vectorization lanes and registers has helped provide superior scalability and performance for SunGard’s industry-leading risk management solution. These improvements are helping to meet the growing computational requirements of the market and the regulatory environment.

This paper describes how Adaptiv Analytics running on systems equipped with Intel® Xeon® processor E7-8890 v3 gained a performance improvement over running on systems with the previous generation of Intel® Xeon® processor E7-4890 v2.

SunGard’s Adaptiv Analytic and Intel® Xeon® Processor E7-8890 V3

For hardware, Intel Xeon processor E7-8890 v3 has 18 cores over comparing to Intel Xeon E7-4890 v2 that has 15 cores resulting in increasing parallelism in E7-8890 v3. In additional to that, E7-8890 v3 has larger memory bandwidth comparing to E7-4890 v2 and uses DDR4 memory while E7-4890 v2 uses DDR3 memory, thus speeding up the executions.

For hardware, the Intel Xeon processor E7-8890 v3 has Intel AVX2 while the Intel Xeon processor E7-4890 v2 only supports Intel® Advanced Vector Extensions (Intel® AVX). Let see how Intel AVX2 improves the performance of this produce.

 

# Cores

# Threads

Memory

Vectorization

Intel® Xeon® E7-4890 v2

15

30

DDR3

AVX

Intel® Xeon® E7-8890 v3

18

36

DDR4

AVX2

Table 1. Processors Comparison

SunGard’s Adaptiv Analytics uses the Monte Carlo simulation to perform risk analysis. The Monte Carlo simulation is often used whenever there is a need to analyze the behavior of activities or processes that involve uncertainty, such as risk management. This simulation calculates the results multiple times using a random set of values, giving the decision maker a range of possible outcomes. The random set of values is generated from the probability functions.

To increase the accuracy of the possible outcome, Monte Carlo simulation needs to run for a long time period, possibly repeating up to 10,000 times. This is where Intel® AVX2 along with features of E7-8890 v3 mentioned above can provide advantages over those of E7-4890 v2.

The following paragraph talks about functions frequently used by Monte Carlo simulation for vector or matrix manipulations and are optimized by Intel AVX2.

daxpy

Function daxpy computes the following operation on double-precision values:

A× α + B

Where:

A and B: matrix or vector

α: Constant

 

dgemv

Function dgemv calculates the following operation on double-precision values:

α× A× x + β× y

Or

α× AT× x + β× y

Where:

α and β : Constants

x and y : Vectors

A: Matrix

Functions daxpy and dgemv are implemented in the Intel® Math Kernel Library (Intel® MKL) and Intel® Integrated Performance Primitives (Intel® IPP). Starting with version 11 of Intel MKL and version 8 of Intel IPP, the two functions are optimized using Intel AVX2. SunGard’s Adaptiv Analytics uses the daxpy and dgemv versions of Intel MKL and Intel IPP, thus taking advantage of Intel AVX2 performance improvements in the Intel Xeon processor E7-8890 v3. Using Intel’s libraries means that developers don’t have to modify their codes to take advantage of new enhancement features in future Intel® Xeon® processors.

Performance test procedure

To prove that Intel AVX2 along with the new microarchitecture in the Intel Xeon processor E7 v3 improve the performance of SunGard’s Adaptiv Analytics, we performed tests on two platforms. One system was equipped with the Intel Xeon processor E7-8890 v3 and the other with the Intel Xeon processor E7-4890 v2.

We created a launcher to execute x amount of instances of a command-line tool called RunCalcDef.exe that performs the calculations using the SunGard’s Adaptiv Analytics engine. On the system equipped with the Intel Xeon processor E7-8890 v3, we launched 18 instances with 8 nodes per instance while launching 30 instances with 4 nodes per instance on the system equipped with the Intel Xeon processor E7-4890 v2. Node is the term used in SunGard’s Adaptiv Analytics that specifies how many threads are operating on a subset of the 10k scenarios of the Monte Carlo simulation.

Why didn’t we use the same amount of instances and nodes on both systems? The reason: The system equipped with the Intel Xeon processor E7-8890 v3 has 4 sockets, each of which can handle 36 threads with hyper-threading on for a total of 144 threads for the whole system. On the other hand, the system equipped with the Intel Xeon processor E7-4890 v2 has 4 sockets, each of which can handle 30 threads with hyper-threading on for a total of 120 threads for the whole system. Using the same amount of instances and nodes on the system with Intel Xeon processor e7-8890 v3 as on the system with Intel Xeon processor E7-4890 v2 would result in over-subscribing the cores leading to a decrease in performance.

The tests computed the throughput, calculations per second, by dividing the total number of calculations executed (a pre-known value based on the number of instances) by the average execution time in seconds.

Test configurations

System equipped with Intel Xeon processor E7-8890 v3

  • System: Pre-production
  • Processors: Intel Xeon processor E7-8890 v3 @2.5 GHz
  • Cores: 18
  • Memory: 384 GB DDR4-2133 MHz

System equipped with Intel Xeon processor E7-4890 v2

  • System: Pre-production
  • Processors: Intel Xeon processor E5-4890 v2 @2.8 GHz
  • Cores: 15
  • Memory: 512 GB DDR3-1600 MHz

Operating System: Microsoft Windows Server* 2012 R2

Application: SunGard Adaptiv Benchmark v13.1

Test results


Figure 1. Performance comparison between processors.

Figure 1 shows a 1.47x performance gain of the system with the Intel Xeon processor E7-8890 v3 over that of the system with the Intel Xeon processor E7-4890 v2. The performance gain is due to the enhanced microarchitecture, increase in core count, better memory type (DDR4 over DDR3), and Intel AVX2.

Conclusion

More cores, enhanced microarchitecture, and the support of DDR4 memory contributed to the performance improvement of SunGard’s Adaptiv Analytics on systems equipped with the Intel Xeon processor E7-8890 v3 compared to those with Intel Xeon processor E7-4890 v2. With the introduction of Intel AVX2, matrix manipulations get a boost. In addition, applications that make use of Intel MKL and Intel IPP will receive a performance boost without having to change the source code, since their functions are optimized using Intel AVX2.

References

[1] Wikipedia. Basic Linear Algebra Subprograms. https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms

[2] Investopedia. Corporate Finance – Risk-Analysis Techniques. http://www.investopedia.com/exam-guide/cfa-level-1/corporate-finance/risk-analysis-techniques.asp

[3] Intel® Integrated Performance Primitives (Intel® IPP). https://software.intel.com/en-us/intel-ipp

[4] Intel® Math Kernel Library (Intel® MKL) https://software.intel.com/en-us/intel-mkl

[5] LAPACK: Linear Algebra PACKage – dgemv.f. http://www.netlib.org/lapack/explore-html/dc/da8/dgemv_8f_source.html

[6] LAPACK: Linear Algebra PACKage – daxpy. http://www.netlib.org/lapack/explore-html/d9/dcd/daxpy_8f.html

Storage: Accelerate Hash Function Performance Using the Intel® Intelligent Storage Acceleration Library

$
0
0

Abstract

With the growing number of devices connected to the cloud and the Internet, data is being generated from many different sources including smartphones, tablets, and Internet of Things devices. The demand for storage is growing every year. For cloud storage developers who are looking for ways to speed up their storage performance, the optimized hash functions in the Intel® Intelligent Storage Acceleration Library (Intel® ISA-L) accelerate the computation, providing up to a 8x performance gain over OpenSSL* algorithms. After a study of performance using version 2.14, the latest version of Intel ISA-L, the data shows a potential performance gain for developers to apply Intel ISA-L to their existing application.

This article captures the performance data and the system configuration for developers interested in reproducing this experiment in their own environment. Intel ISA-L can run on various Intel® server processors and provides operation acceleration through the following instruction sets:

  • Intel® Advanced Encryption Standard New Instructions (Intel® AES-NI)
  • Intel® Streaming SIMD Extensions (Intel® SSE)
  • Intel® Advanced Vector Extensions (Intel® AVX)
  • Intel® Advanced Vector Extensions 2 (Intel® AVX2)

Benefits

Intel ISA-L multibinary support functions allow an appropriate version to be selected at first run (based on the supported instruction set) and can be called instead of the architecture-specific versions. Developers can deploy a single binary with multiple function versions and then choose features at runtime. If code size is a concern, just call the architecture-specific version directly to reduce the code size. In default mode the base functions are written in C and the multibinary function will call those if none of the required instructions sets are enabled. 

For example, if the code is compiled on an Intel® Xeon® E5 v3 processor family and there are three versions of a particular functions (funct1_sse (), func1_avx(), func1_avx2 ()), the function (func1()) will determine that the appropriate function to call is func1_sse(). There is also a base function (func1_base()), which the multibinary function calls if none of the required instruction sets are enabled.

By including the Intel® instruction extensions listed above, Intel ISA-L reduces the number of instructions providing the ability to manipulate multiple datum in one instruction. See the reference section below to learn more about the extensions. The intelligence of selecting the right instruction extension for the processor allows the application to take full advantage of the system bandwidth. Figure 1 below shows a process where a developer can apply the Intel ISA-L functions in their deduplication application. In a quick study (see Figure  2), the performance run of the hash functions were able to achieve up to a 8x performance gain on the Intel® Xeon® processor E5-2650 v3.


Figure 1. One method of applying Intel® Intelligent Storage Acceleration Library into the data deduplication process.


Figure 2. Hash functions’ relative performance using OpenSSL* versus Intel® Intelligent Storage Acceleration Library.

Setting Up Intel® Intelligent Storage Acceleration Library On the System

  1. To access the full suite of Intel ISA-L functions, please fill out and submit this request form.
    You will receive an email that provides information on how to get the complete ISA-L zip file.
  2. Download and unzip the library source into the OS.
  3. Read the ISA-L_Getting_Started.pdf and Release_notes.txt supplied with the source. From the Guide, choose and follow the instructions to build the source depending on your needs.

Running the Provided Benchmarks

  1. Install “automake” to build the library and included unit tests.
  2. Run “make perfs”. This builds all unit function tests set for ‘cache cold – larger data set exceeds LLC size.’
  3. Run “make perf”. This runs each unit test supported by the platform architecture. Performance results are output to the console.

Optional: Run “make igzip/igzip_file_perf” and “make igzip/igzip_stateless_file_perf”. This builds additional compression functions and unit tests. Compression tests (igzip_file_perf and igzip_stateless_file_perf) are run using each file of a standard corpus—The Calgary Corpus— as an input. It is available here

Table 1 describes the platform configuration we used in our testing.

Table 1. Tested System Configuration

Related Links and Resources

Performance Gains for Ayasdi Analytics* on the Intel® Xeon® Processor E7-8890 V3

$
0
0

Introduction

Ayasdi deploys vertical applications that utilize Topological Data Analysis to extract value from large and complex data. The Ayasdi platform incorporates statistical, geometric, and machine-learning methods through a topological framework to more precisely segment populations, detect anomalies, and extract features.

This paper describes how Ayasdi’s Analytics* running on systems equipped with the Intel® Xeon® processor E7-8890 v3 gained a performance improvement over running on systems with the previous generation of Intel® Xeon® processor E7-4890 v2.

Ayasdi’s Analytic and Intel® Xeon® Processor E7-8890 V3

 

Number of Cores

Number of Threads

Memory

Vectorization

Intel® Xeon® processor E7-4890 v2

15

30

DDR3

Intel® Advanced Vector Extensions

Intel® Xeon® processor E7-8890 v3

18

36

DDR4

Intel® Advanced Vector Extensions 2

Table 1. Comparison between processors

Table 1 shows the comparison between the Intel Xeon processor E78890 v3 and Intel Xeon E7-4890 v2. The Intel Xeon processor E78890 v3 has 18 cores compared to the 15 cores of Intel Xeon processor E7-4890 v2, the latter allowing more parallelism resulting in performance improvement. Furthermore, the Intel Xeon processor E7-8890 v3 has a larger memory bandwidth comparing to that of the Intel Xeon processor E7-4890 v2 and uses DDR4 memory while the Intel Xeon processor E7-4890 v2 uses DDR3 memory, thus speeding up the executions.

In terms of software advantages, the Intel Xeon processor E7-8890 v3 supports Intel® Advanced Vector Extensions 2 (Intel® AVX2) while the Intel Xeon processor E7-4890 v2 supports only Intel® Advanced Vector Extensions (Intel AVX). In addition, the Intel Xeon processor E7-8890 v3 introduces Bit Manipulation Instruction sets, BMI1 and BMI2. These instruction sets speed up vector and matrix operations and the core computations of complex machine-learning algorithms.

Ayasdi Analytics was optimized for the Intel Xeon processor E7-8890 v3 by using the new Intel AVX2 intrinsic functions, especially Fused Multiply Add and BMIs. This optimization was accomplished by hand-coding in C++ and through the use of the Intel® Math Kernel Library (Intel® MKL)—Intel MKL version 11.2.

Performance Test Procedure

To show that Intel AVX2 along with the new microarchitecture in the Intel Xeon processor E7 v3 Family increase the throughput of Ayasdi Analytics, we performed tests on two platforms. One system was equipped with the Intel Xeon processor E7-8890 v3 and the other with the Intel Xeon processor E7-4890 v2.

Performance is measured in terms of the following:

  • The throughput of analyses (analyses per hour) that can be supported by the cluster, with acceptable latency.
  • The job latency of the analyses in minutes. The job latencies were measured with nine users concurrently accessing the systems.

Test Configurations

System equipped with the Intel Xeon processor E7-8890 v3

  • System: Pre-production
  • Processor: Intel Xeon processor E7-8890 v3 @2.5 GHz
  • Cores: 18
  • Memory: 1 TB DDR4-1600 MHz

System equipped with the Intel Xeon processor E7-4890 v2

  • System: Pre-production
  • Processor: Intel Xeon processor E5-4890 v2 @2.8 GHz
  • Cores: 15
  • Memory: 1 TB DDR3-1333 MHz

Operating system: Red Hat Enterprise Linux* 7.0

Application: Ayasdi Analytics Benchmark

Test Results


Figure 1. Performance comparison between processors.

Figure 1 shows a 1.85x performance gain of the system with the Intel Xeon processor E7-8890 v3 over that of the system with the Intel Xeon processor E7-4890 v2. The performance gain is due to the enhanced microarchitecture, increase in core count, better memory type (DDR4 over DDR3), and Intel AVX2.


Figure 2. Latency comparison between processors.

Figure 2 shows the reduction in latency on the system with the Intel Xeon processor E7-8890 v3 over that of the system with the Intel Xeon processor E7-4890 v2. The decrease in latency was credited by the enhanced microarchitecture, core count increase, better memory type (DDR4 over DDR3), and Intel AVX2.

Conclusion

More cores, enhanced microarchitecture, and the support of DDR4 memory contributed to the performance improvement of Ayasdi Analytics on systems equipped with the Intel Xeon processor E7-8890 v3 compared to those with the Intel Xeon processor E7-4890 v2. With the introduction of Intel AVX2, matrix manipulations get a performance boost. In addition, applications that make use of Intel MKL will receive a performance improvement without having to change the source code, since their functions are optimized using Intel AVX2.

For More Information

Ayasdi Official Website http://www.ayasdi.com

Topological Data Analysis http://www.ayasdi.com/blog/topology/topological-data-analysis-a-framework-for-machine-learning/
http://www.ayasdi.com/wp-content/uploads/2015/02/Topology_and_Data.pdf

Machine learning http://www.sas.com/en_us/insights/analytics/machine-learning.html

Fused Multiply-Add http://rd.springer.com/chapter/10.1007%2F978-0-8176-4705-6_5

Intel Math Kernel Library (Intel MKL) https://software.intel.com/en-us/intel-mkl

Intel Advanced Vector Extensions https://software.intel.com/en-us/articles/intel-mkl-support-for-intel-avx2

Intel Xeon E7 v3 processor product family https://software.intel.com/en-us/articles/intel-xeon-e7-4800-v3-family

Evaluating the Power Efficiency and Performance of Multi-core Platforms Using HEP Workloads

$
0
0

As Moore’s Law drives the silicon industry towards higher transistor counts, processor designs are becoming more and more complex. The area of development includes core count, execution ports, vector units, uncore architecture and finally instruction sets. This increasing complexity leads us to a place where access to the shared memory is the major limiting factor, resulting in feeding the cores with data a real challenge. On the other hand, the significant focus on power efficiency paves the way for power-aware computing and less complex architectures to data centers. In this paper we try to examine these trends and present results of our experiments with Intel® Xeon® E5 v3 (code named Haswell-EP) processor family and highly scalable High-Energy Physics (HEP) workloads.

 


Using Hardware Features in Intel® Architecture to Achieve High Performance in NFV

$
0
0

Introduction

Communications software requires extremely high performance, with data being exchanged in a huge number of small packets. One of the tenets of developing Network Functions Virtualization (NFV) applications is that you virtualize as far as possible, but still optimize for the underlying hardware where necessary.

In this paper, I will talk you through three features of Intel® processors that you can use to optimize the performance of your NFV applications: cache allocation technology (CAT), Intel® Advanced Vector Extensions 2 (Intel® AVX2) for processing vectors of data, and Intel® Transactional Synchronization Extensions (Intel® TSX).

Solving priority inversion with CAT

When a low priority function steals resource from a high priority function, we call it priority inversion.

Not all virtual functions are equally important. A routing function, for example, would be time and performance critical, but a media encoding function wouldn’t be. It could afford to sometimes drop a packet without affecting the user experience because nobody will notice if a video drops from 20 frames per second to 19 frames per second.

The cache is organized by default so that the heaviest user gets the biggest share of it. The heaviest user won’t necessarily be the most important application, though. In fact, the opposite is often true. High priority applications are optimized by reducing their data to the smallest set possible. Low priority applications aren’t worth optimizing in that way, and so tend to consume more memory. Some are inherently memory-hungry too: a packet inspection function for statistical analysis would be low priority, for example, but would require a lot of memory and cache use.

Developers often assume that if they put a single high priority application on a particular core, it’s safe and can’t be affected by low priority applications. That’s not true, unfortunately. Each core has its own level 1 cache (L1, the fastest but smallest cache) and level 2 cache (L2, which is slightly bigger, and somewhat slower). There are separate L1 caches for data (L1D) and program code (L1I, where I stands for instructions). The slowest cache, L3, is shared between the cores in a processor. In Intel® processor architectures up to and including Broadwell, the L3 cache is fully inclusive, which means it contains everything in the L1 and L2 caches. Because of the way the fully inclusive cache works, if something is evicted from L3, it also disappears from the associated L1 and L2 caches. This means that a low priority application that needs space in the L3 cache can evict data from the L1 and L2 caches of a high priority application, even if it’s on a different core.

In the past, there has been a workaround to resolve this, called ‘warming up’. When functions compete for L3 cache, the winner is the application that accesses the memory more often. One solution, then, is for the high priority function to keep accessing the cache when it is idle. It’s not an elegant solution, but it is often good enough, and until recently there wasn’t an alternative. Now there is: The Intel® Xeon® processor E5 v3 family introduced cache allocation technology (CAT), which enables you to allocate cache according to your applications and classes of service.

Understanding the impact of priority inversion

To demonstrate the impact of priority inversion, I wrote a simple microbenchmark that periodically runs a linked list traversal in a high priority thread, while a memory copy function is constantly running in a low priority thread. The threads are pinned to different cores on the same socket. This simulates the worst possible contention, with the copy operation being memory hungry and highly likely to disturb the more important list access thread.

Here’s the C code:

// Build a linked list of size N with pseudo-random pattern
void init_pool(list_item *head, int N, int A, int B)
{
    int C = B;
    list_item *current = head;
    for (int i = 0; i < N - 1; i++) {
        current->tick = 0;
        C = (A*C + B) % N;
        current->next = (list_item*)&(head[C]);
        current = current->next;
    }
}

// Touch first N elements in a linked list
void warmup_list(list_item* current, int N)
{
    bool write = (N > POOL_SIZE_L2_LINES) ? true : false;
    for(int i = 0; i < N - 1; i++) {
        current = current->next;
        if (write) current->tick++;
    }
}
void measure(list_item* head, int N)
{
    unsigned __long long  i1, i2, avg = 0;

    for (int j = 0; j < 50; j++) {
        list_item* current = head;
#if WARMUP_ON
        while(in_copy) warmup_list(head, N);
#else
        while(in_copy) spin_sleep(1);
#endif
        i1 = __rdtsc();
        for(int i = 0; i < N; i++) {
            current->tick++;
            current = current->next;
        }
        i2 = __rdtsc();
        avg += (i2-i1)/50;
        in_copy = true;
    }
   results[result++]=avg/N
}

It contains three functions:

  • The init_pool() function initializes a linked list located in a big and sparse memory area using a simple pseudorandom number generator. This avoids list elements being close together in memory, which would enable spatial locality, and disturb our measurements as some elements will be automatically prefetched. Each item in the list is exactly one cache line.
  • The warmup() function constantly traverses the linked list. We have to touch the specific data we want to keep in the cache, so this function stops the linked list from being evicted from the L3 cache by the other threads.
  • The measure() function measures a single list element traversal, then either sleeps for 1 millisecond or calls the warmup() function, depending on which benchmark we are running. The measure() function then averages the results.

The results of the microbenchmark running on a 5th generation Intel® Core™ i7 processor are shown on the graph below, where the X axis is the total number of cache lines in the linked list, and the Y axis shows the average number of CPU cycles per linked list access. As the size of the linked list increases, it spills over from the L1D cache into L2 and L3 cache, and finally into main memory.

The baseline is the red-brown line that shows the program running without the memory copy thread, and so without any contention. The blue line shows the effect of priority inversion: the memory copy function results in the list access taking significantly longer. The impact is particularly strong when the list fits into the high speed L1 cache or the L2 cache. The impact is insignificant when the list is larger than can fit into the L3 cache.

The green line shows the effect of warming up when the memory copy function is running: it dramatically cuts the access times, bringing them much closer to the baseline.

If we enable CAT and allocate parts of the L3 cache for the exclusive use of each core, the results are very close to the baseline (too close to plot here!), which is exactly our goal.

How to enable CAT

First you should make sure the platform supports CAT. You can use a CPUID instruction to check leaf 7, subleaf 0 which was added to indicate that CAT is available.

If CAT is enabled and supported, there are model specific registers (MSRs) that can be programmed to allocate different parts of L3 to different cores.

Each socket has the MSRs IA32_L3_MASKn (e.g. 0xc90, 0xc91, 0xc92, 0xc93). These registers store a bitmask that indicates how much of the L3 cache to allocate to each class of service (COS). 0xc90 stores the cache allocation for COS0, 0xc91 for COS1 and so on.

For example, this chart shows some possible bitmasks for the different classes of service, showing how the cache might be shared with COS0 getting half, COS1 getting a quarter, and COS2 and COS3 getting an eighth each. 0xc90 would contain 11110000, and 0xc93 would contain 00000001, for example.

Direct Data I/O (DDIO) has its own hidden bitmask that allows streaming data from high speed PCIe devices, such as network cards, to certain parts of the L3 cache. There is a possibility that this will conflict with the classes of service you define, so you have to take it into account when designing high throughput NFV applications. To test for conflict, use Intel® VTune™ Amplifier XE to test for cache misses. Some BIOSes have a setting to view and change the DDIO mask.

Each core has an MSR IA32_PQR_ASSOC (0xc8f), which is used to specify which class of service applies to that core. The default is 0, which means the bitmask in MSR 0xc90 is used. (By default, the bitmask of 0xc90 is all 1s, to provide maximum cache availability).

The most straightforward usage model for CAT in NFV is to allocate chunks of L3 using isolated bitmasks to different cores, and then pin your threads or VMs to the cores. If VMs have to share cores for execution, it is also possible to make a trivial patch to an OS scheduler to add a cache mask to threads running VMs, and dynamically enable it on each process schedule event.

There is also an unconventional way of using CAT for locking data in the cache. First, make an active cache mask and touch the data in memory so it is loaded to L3. Then disable the bits that represent this part of the L3 cache in any CAT bitmask that will be used in the future. The data will then be locked into L3 because there is no way to evict it (apart from DDIO). In an NFV application, this mechanism is useful to lock medium sized lookup tables for routing and packet inspection in the L3 cache, to enable constant access.

  

CAT configuration is described in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Chapter 17.15.

Using Intel AVX2 for processing vectors

Single instruction multiple data (SIMD) instructions enable you to carry out the same operation on different pieces of data at the same time. They’re often used to speed up floating point processing, but integer versions of arithmetic, logical and data manipulation instructions are also available.

Depending on which processor you are using, you will have a different family of SIMD instructions available to you, and a different size of vector that the commands can process:

  • SSE offers 128-bit vectors;
  • Intel AVX2 offers integer instructions for 256-bit vectors and also introduces instructions for gather operations;
  • AVX3, coming in future Intel® Architecture, will offer 512-bit vectors.

A 128-bit vector could be used for two 64-bit variables, four 32-bit variables, or eight 16-bit variables, depending on the SIMD instructions you use. Larger vectors can accommodate more data items. Given the need for high-performance throughput in NFV applications, you should use the most advanced SIMD instructions (and supporting hardware) available, which is currently Intel AVX2.

The most common use of SIMD instructions is to perform the same operation with a vector of values at the same time, as shown in the picture. Here, the operation for generating X1opY1 to X4opY4 is a single instruction that handles the data items X1 to X4 and Y1 to Y4 at the same time. In this example, the speed-up would be 4x compared to normal (scalar) execution, because four operations are processed at the same time. The speed-up can be as large as the SIMD vector size. NFV applications often involve processing multiple packet streams in the same way, so SIMD is a natural fit for optimizing performance.

For simple loops, the compiler is often able to automatically vectorize operations using the latest SIMD instructions available in the CPU (if you use the correct compiler flags). The code can be optimized to use the most advanced instructions available on the hardware at run-time, or can be compiled for a specific target architecture.

SIMD operations also enable memory loads, copying up to 32 bytes (or 256 bits) from memory to a register, streaming loads between memory and the register bypassing the cache, and gathering data from different memory locations. You can also perform vector permutations, which shuffle data in a single register, and vector stores, which write up to 32 bytes from a register to memory.

Memcpy and memmov are famous examples of essential routines that historically were implemented using SIMD instructions because the REP MOV instruction was too slow. The memcpy code has been regularly updated in the system libraries to take advantage of later SIMD instructions, and a CPUID dispatch table has been used to see which is the latest one that can be used. But the libraries tend to lag behind the SIMD generations in their implementation.

For example, the following memcpy routine using a trivial loop is based on an intrinsic (rather than using library code) so the compiler can optimize it for the latest SIMD instructions:

_mm256_store_si256((__m256i*) (dest++), (__m256i*) (src++))

It compiles to the following assembly code, to deliver twice the performance of recent libraries:

c5 fd 6f 04 04          vmovdqa (%rsp,%rax,1),%ymm0
c5 fd 7f 84 04 00 00    vmovdqa %ymm0,0x10000(%rsp,%rax,1)

The assembly code from the intrinsic will copy 32 bytes (256 bits) using the latest SIMD instructions available, while library code using SSE would only copy 16 bytes (128 bits).

NFV applications also often need to perform a gather operation, loading data from several non-consecutive memory locations. For example, the network card might place the incoming packets in the cache using DDIO. The NFV application might only need to access the part of the network header with the destination IP address. Using a gather operation, the application could collect the data on eight packets at the same time.

For a gather, there’s no need to use an intrinsic or inline assembly because the compiler can vectorize code similar to the program below, which is based on a benchmark that sums numbers from pseudorandom memory locations:

  int a[1024];
  int b[64];
  for (i = 0; i < 1024; i++) a[i] = i;
  for (i = 0; i < 64; i++) b[i] = (i*1051) % 1024;
  for (i = 0; i < 64; i++) sum += a[b[i]]; // This line is vectorized using gather.

The last line compiles to the following assembly:

c5 fe 6f 40 80      vmovdqu -0x80(%rax),%ymm0
c5 ed fe f3         vpaddd %ymm3,%ymm2,%ymm6
c5 e5 ef db         vpxor  %ymm3,%ymm3,%ymm3
c5 d5 76 ed         vpcmpeqd %ymm5,%ymm5,%ymm5
c4 e2 55 90 3c a0   vpgatherdd %ymm5,(%rax,%ymm4,4),%ymm7

While a single gather operation is significantly faster than a sequence of loads, it only matters if the data is already in the cache. If it is not, the data has to be fetched from memory, which can cost tens or hundreds of CPU cycles. With data in the cache, a speed-up of 10x (1000%) is possible. If not, the speed-up might only be 5%.

When you’re using techniques like this, it’s important to measure your application to identify where the bottlenecks are, and whether your application is spending time on copying or gathering data. You can measure your program performance using Intel VTune Amplifier.

Another feature useful for NFV workloads in Intel AVX2 and other SIMD operations are bitwise and logical operations. These are used to speed up the implementation of custom cryptography code, and bit checks are useful for ASN.1 coders, often used for data in telecommunications. Intel AVX2 can also be used for faster string matching using advanced algorithms such as Multiple Pattern Streaming SIMD Extensions Filter (MPSSEF).

Intel AVX2 works well in virtual machines. There is no difference in performance and it does not normally cause virtual machine exits.

Using Intel TSX for greater scalability

One of the challenges of parallel programs is to avoid data races, which can occur when several threads are trying to use the same data item and at least one of them is modifying it. To avoid unpredictable results, the concept of the lock has often been used, with the first thread to use a data item blocking others from using it until it’s finished. That can be inefficient, though, if you have frequently contested locks or if the locks control a larger area of memory than strictly necessary.

Intel Transactional Synchronization Extensions provide processor instructions to elide locks with hardware memory transactions. This helps to achieve better scalability. It works like this: when the program enters a section that uses Intel TSX to guard memory locations, all memory accesses are recorded, and at the end of the protected section they are either atomically committed or rolled back. The rollback happens if there was a conflicting memory access during the execution from another thread that would cause a race condition (such as writing to a location that another transaction has read). A rollback can also occur if the memory access record becomes too big for the Intel TSX implementation, if there is an I/O instruction or syscall, or if there are exceptions or virtual machine exits. IO calls rollback because it cannot be executed speculatively, as it interferes with outside world. Syscall is a very complex operation that changes ring and memory descriptors, so it is also very difficult to roll back.

A frequently seen usage example of Intel TSX is managing hash table accesses. Usually a hash table lock is implemented to guarantee consistent table accesses, but it comes at the expense of the waiting time for contending threads. The lock is often too coarse, locking the entire hash table, although it’s usually rare for threads to attempt to access the same elements of it at the same time. As the number of cores (and threads) goes up, the coarse lock prevents scaling.

As the diagram below shows, the coarse lock can result in one thread waiting for another thread to release the hash table, even though the threads are using different elements. Using Intel TSX enables both threads to execute straight away, with their results committed when they successfully reach the end of the transaction. The hardware detects conflicts on the fly, and aborts transactions which violate correctness. In the Intel TSX implementation, thread 2 should experience no waiting, and both threads complete much sooner. The per-hash-table lock is effectively converted into a fine-grained lock delivering improved performance. Intel TSX has tracking granularity for conflicts down to the level of a cache line (64 bytes).

There are two software interfaces used in Intel TSX to indicate code sections for transactional execution:

  • Hardware Lock Elision (HLE) is backwards compatible and could be used relatively easily to improve scalability without large modifications to the lock library. HLE introduces a prefix for locked instructions. The HLE instruction prefix provides hints to the hardware to track the status of the lock without acquiring it. In the example above, doing that would mean that unless there is a conflicting write access to a value stored in the hash table, accesses to other hash table elements will no longer lead to locking. As a result, they will not serialize access, so scalability will be greatly improved across four threads.
  • The Restricted Transactional Memory (RTM) interface introduces explicit instructions to start (XBEGIN), commit (XEND), abort (XABORT) a transaction, and test a transaction’s status (XTEST). The instructions give locking libraries a more flexible way to implement lock elision. RTM allows the library to implement flexible transaction abort handling algorithms. This capability can be utilized to improve Intel TSX performance by optimistic transaction retry, transaction back-off and other advanced techniques. Using the CPUID instruction the library can fall back to an older lock implementation without RTM, keeping backwards compatibility for the user-level code.
  • To learn more about HLE and RTM, I recommend the following articles on Intel Developer Zone:

    https://software.intel.com/en-us/blogs/2013/06/07/web-resources-about-intelr-transactional-synchronization-extensions
    https://software.intel.com/en-us/blogs/2012/02/07/transactional-synchronization-in-haswell

As well as improving your own synchronization primitives with HLE or RTM, data plane NFV functions can benefit from Intel TSX by using the Data Plane Development Kit (DPDK).

When using Intel TSX, the main challenge is not implementing it, but estimating and measuring the performance. There are Performance Monitoring Unit counters that can be used by Linux* perf, Intel® Performance Counter Monitor and Intel VTune Amplifier to see how often Intel TSX has been executed and how successful the execution was (committed vs. aborted cycles).

Intel TSX should be used cautiously in NFV applications and tested thoroughly because I/O operations in an Intel TSX-protected region always involve a rollback, and many NFV functions use a lot of I/O. NFV applications should avoid contended locks. If they have to have locks, then lock elision can help to improve scalability.

The full specification of Intel TSX can be found in Intel® 64 and IA-32 Architectures Software Developer’s Manual, Chapter 15

About the Author

Alexander Komarov is an Application Engineer in Intel Software and Services Group. For the last 10 years Alexander’s main job is optimizing customer’s code to achieve better performance on current and upcoming Intel server platforms. This involves using Intel software development tools: profiler, compiler, libraries, and always utilizing the latest instructions, u-architecture and architecture advancements of newest x86 CPUs and chipsets.

Further information

For more information on NFV, see the following videos:

  • Discover the top 10 differences between NFV and cloud environments
  • How to set up DPDK for your NFV applications
  • Using DPPD PROX to test NFV applications - Setting up your System
  • Using DPPD PROX to measure NFV applications - Running the Tests
  • Get a Helping Hand from the Vectorization Advisor

    $
    0
    0

    Get a Helping Hand from the Vectorization Advisor

    If you have not tried the new Vectorization Advisor yet, you might find this story of how it helped one customer sufficiently motivating to give it a look!  It is like having a trusted friend look over your code and give you advice based on what he sees.  As you’ll see in this article, user feedback on the tool has included, “there are significant speedups produced by following advisor output, I'm already sold on this tool!”

    What’s common among airplane surfaces, inkjet printing, and separation of oil/water mixtures or protein solutions used in brand new shampoos? Computational chemists will tell you that advancing in all these areas normally requires performing mesoscopicsimulations of condensed matter. “Mesoscopic” means that such simulations have to deal with quantities of matter slightly bigger than the size of atoms, while “condensed matter” implies that quite likely you will be modeling liquids or solid states.

    To address a wide range of mesoscale-related science and industry demands, research scientists at STFC Daresbury Laboratory in the United Kingdom have developed a mesoscopic simulation package called “DL_MESO.” DL_MESO has been adopted in European industry by companies such as Unilever, Syngenta and Infineum, who use mesoscale simulation when figuring out optimal formula [computer-aided formulation (CAF)] for shampoos, detergent powders, agrochemicals or petroleum additives.

    Figure 1. Visualization of the 3D_PhaseSeparation benchmark.

    Figure 1. Visualization of the 3D_PhaseSeparation benchmark.

    The computer-aided formulation simulation process is often time and resource consuming, so from the very beginning Daresbury experts were interested in performance-aware design and optimization of DL_MESO for modern platforms; that helps explain why DL_MESO was chosen as one of the Intel® Parallel Computing Center(s) (Intel® PCC) collaboration projects between Hartree and Intel Corporation1.  Intel PCC projects focus on using the latest techniques for code modernization on the latest systems.  As such, it is a natural fit for Intel PCC projects to look to adopt new technology that may assist in code modernization.

    The DL_MESO engineers made good use of early pre-release versions of the Vectorization Advisor analysis tool in early 2015 (the product version of Vectorization Advisor is now part of Intel® Advisor 2016 within Intel® Parallel Studio XE 2016.)

    The interest in new technology was driven by DL_MESO developers’ intention to fully exploit vector parallelism capabilities on modern Intel platforms. For multi-core Intel® Xeon® processors or many-core Intel® Xeon Phi™ platforms, code can only reach good performance if it exploits both levels of CPU parallelism: multi-core parallelism and vector data parallelism. With 512-bit-wide SIMD instructions, efficiently vectorized code becomes theoretically capable of delivering 8x more performance for double-precision (or 16x for single-precision) floating point computations over the performance of non-vectorized code. DL_MESO developers did not want to leave so much performance on the table.

    In this article, we show how the Vectorization Advisor was used by Michael Seaton and Luke Mason, computational scientists in the Daresbury lab, to analyze the DL_MESO Lattice Boltzmann Equation code2. One of the lead developers at Hartree was so impressed by the results of using the Vectorization Advisor that he enthusiastically wrote, “there are significant speedups produced by following advisor output, I'm already sold on this tool!”

    On new multi-core Intel Xeon processors and many-core Intel Xeon Phi coprocessors, you can achieve optimal performance by ensuring that your applications exploit two levels of CPU parallelism: multi-core parallelism and vector data parallelism.

    There is a wide range of techniques available to vectorize your application, including:

    Using libraries that are already vectorized, such as the Intel® Math Kernel Library (Intel® MKL). The advantage of this approach is you bypass much of the programming effort needed to vectorize code by using the optimized library functions. 

    Letting the compiler automatically vectorize your code. This has been the traditional route that many developers have relied on when using the Intel compiler – and the Intel compiler does an amazing job!  

    Explicitly adding pragmas or directives, such as the OpenMP* SIMD pragmas/directives. Increasingly, this is becoming the option of choice amongst developers, giving a greater level of vectorization control compared to just relying on auto-vectorization - without being locked into too low a programming level.

    nserting vector-aware code using vector intrinsic functions, C++ vector classes or assembler instructions.  This technique requires that you have a good working knowledge of the functions and instructions that support vectorization. Any code you write in this way will be much less portable than the other techniques mentioned above. 

    Whichever way you choose to produce vectorized code, it’s important that the resultant code efficiently exercises the vector units of the processor. In the DL_MESO library, the Daresbury programmers are using the OpenMP 4.x programming standard to improve the vectorized performance.

    Vectorization Advisor

    The Vectorization Advisor is one of the two major features of Intel® Advisor 2016. Intel Advisor includes the Vectorization Advisor and a Threading Advisor.

    The Vectorization Advisor is an analysis tool that lets you:

    • For unvectorized loops, discover what prevents code from being vectorized and get tips on how to vectorize it.
    • For vectorized loops that use modern SIMD instructions, measure their performance efficiency and get tips on how to increase it.
    • For both vectorized and unvectorized loops, explore how the memory layout and data structures can be made more vector friendly.

    You can use the Vectorization Advisor with any compiler, but the tool really excels when coupled with the Intel compilers. Not only does Intel Advisor give a more user-friendly view of various Intel compiler-generated reports, it elegantly brings together the results of the compile-time analysis, static analysis of the contributing binaries, and runtime workload metrics such as CPU hotspots and exact loop trip counts.

    Along with the merging of this static and dynamic analysis comes a set of recommendations that you can use in your optimization efforts.  With the Vectorization Advisor, the gap is filled between static compiler-time and dynamic runtime knowledge, giving the benefits of interactive feedback and a rich set of dynamic binary profiles3.

    Intel Advisor Survey: one-stop-shop DL_MESO performance overview

    The Intel Advisor user interface is cleverly designed to bring together all the salient vectorization features of your code into one place, almost like a one-stop shop – as can be seen in Figure 2, which shows the initial analysis of the Lattice Boltzman component using the vectorization survey analysis and trip counts features of Intel Advisor.

    Looking at the Intel Advisor Survey Report, you can see that about half of the total execution time is consumed by the top ten hotspots, with no outstanding leaders among them; all time-consuming loops take no more than 12% of cumulative program time. This type of profile could be characterized as relatively flat. A flat profile is usually bad news for a software developer, because in order to achieve noticeable cumulative workload speedup, there is a need to go through many hotspots and individually profile and optimize each of them, which is time consuming unless there is some software tool assistance provided.

    Figure 2. Survey Report with Trip Counts.

    Figure 2. Survey Report with Trip Counts.

    The Vectorization Advisor enabled the quick categorization of their hotspots as follows: 

    1. Vectorizable, but not vectorized loops that required some minimal program changes (mostly with the help of OpenMP 4.x) to enable compiler-driven SIMD parallelism.  The top four hotspots in the Survey Report belong to this category.
    2. Vectorized loops whose performance could be improved using low-hanging optimization techniques.
    3. Vectorized loops whose performance was limited by data layout (and thus requiring code refactoring to further speed up execution). As we will see later, after applying techniques corresponding to the two categories above, hotspots #1 and #2 transition to this category.
    4. Vectorized loops that performed well.
    5. All other cases (including non-vectorizable kernels).

    Not only does Vectorization Advisor give information about the loops, the Recommendations and Compiler Diagnostic Details tabs can be used to learn more about specific issues and to find out how to fix them.

    In our case, the third hotspot fGetSpeedSite could not be vectorized because the compiler could not work out how many loop iterations there would be. Figure 3 shows the Intel Advisor Compiler Diagnostic Details window for this problem, along with an example and suggestions how to fix the problem. By following the given suggestion, the given loop has been easily vectorized and transitioned from category #2 to category #4.

    Figure 3. Interactive Compiler Diagnostics Details window in the Intel Advisor Survey Report.

    Figure 3. Interactive Compiler Diagnostics Details window in the Intel Advisor Survey Report.

    Even when code can be vectorized, simply enabling vectorization doesn’t always lead to a performance improvement – that is, loops that fall into category #2 and #3. That’s why it is important to examine loops that have already been vectorized to confirm that they are performing well. In the next section we will briefly discuss optimization results achieved by the Daresbury lab when applying Intel Advisor to inefficiently vectorized loops.

    Low-hanging fruit optimization: loop padding

    The code for the hottest loop from the DL MESO profile is shown in Figure 4.

    The array lbv stores the velocities for the lattice in each dimension, the loop count variable lbsy.nq being the number of velocities. In our case, the model represents the three-dimensional 19-velocity lattice (D3Q19 schema), so the value of lbsy.nq is 19.  The resulting equilibrium is stored in the array feq[i].

    In the initial analysis, the loop was reported as being a scalar loop – that is, the code was not vectorized. By simply adding #pragma omp simd just in front of the for loop, the loop was vectorized, with its impact on the total run time dropping from 13% to 9%. Even with this addition, there is still more scope for optimization.

    int fGetEquilibriumF(double *feq, double *v, double rho)
    {
      double modv = v[0]*v[0] + v[1]*v[1] + v[2]*v[2];
      double uv;
    
      for(int i=0; i<lbsy.nq; i++)
      {
        uv = lbv[i*3] * v[0]
           + lbv[i*3+1] * v[1]
           + lbv[i*3+2] * v[2];
    
        feq[i] = rho * lbw[i]
               * (1 + 3.0 * uv + 4.5 * uv * uv - 1.5 * modv);
       }
      return 0;
    }
    

    Figure 4. Code listing – loop for calculating equilibrium distribution.

    The new results displayed by Intel Advisor showed that the compiler generated not one, but two loops:

    • A vectorized loop body with a vector length (VL) of 4 – that is, four doubles held in the 256-bit-wide AVX registers
    • A scalar remainder that consumes almost 30% of loop time

    Such a scalar remainder is an unnecessary overhead. The existence of this remainder loop has a detrimental effect on the parallel efficiency – that is, the maximum speedup that could be achieved.  Such a big remainder overhead is actually caused by the loop (trip) count not being a multiple of the vector length.  When the compiler vectorizes the loop, it generates the vectorized body, which in our case executes loop iterations number 0-15.  The remaining three iterations, 16-18, are executed by the scalar remainder code.  Since the total loop count is quite small, then the three iterations remaining become a significant part of elapsed loop time.  In an ideally optimized loop, and especially those with a low trip count, there should be no remainder code.

    One technique we can apply to this code is to increase the loop iterations count to become a multiple of VL, which is 20 in our case.  This technique is called “data padding” and this is exactly what Intel Advisor explicitly suggests in the Recommendations window for this loop (as seen in Figure 5). In order to pad the data, we need to increase the size of the arrays feq[], lbv[]and lbw[] so that accessing the (unused) 20th location will not cause a segmentation violation or similar problem.

    The second row of Figure 11 gives an example of the change that is needed.  The value lbsy.nqpad is a summation of the original loop trip count and the padding value (NQPAD_COUNT).

    You will also see that DL_MESO code developers have added the #pragma loop count directive. By telling the compiler what the loop count will be, the compiler sees that the count is a multiple of the vector length and optimizes code generation for the particular trip count value so that scalar remainder invocation code is omitted in runtime.

    Figure 5 The Vectorization Advisor recommendations for padding the data.

    Figure 5. The Vectorization Advisor recommendations for padding the data.

    In the DL_MESO code, there are a number of similar equilibrium distribution code constructs that can be modified in the same way.  In our example, we modified three other loops in the same source file and achieved a speedup of 15% on each loop.

    Balancing overheads and optimization trade-offs

    The padding technique that we used for the first two loops has both performance and code maintenance costs.

    • From a performance perspective, by padding we avoid overhead in the scalar part, but we introduce extra computations in the vector part.
    • From a code maintenance perspective, we have to rework data structure allocations and potentially introduce workload-dependent pragma definitions.

    Fortunately, in our case, the performance benefit outweighs any performance loss, and the code maintenance burden is light.

    Further progress with data layout Structure of Arrays transformation

    Vectorization, loop padding and data alignment techniques have boosted the number 1 hotspot performance by 25-30%, while parallel vectorization efficiency reported by Intel Advisor4 has grown up to 56%.

    Since 56% is pretty far from the ideal 100%, Daresbury developers wanted to further investigate performance blockers preventing loops from achieving higher efficiencies. One more time they looked at Vector Issues/Recommendations. This time, the Vector Issues column highlighted a new problem: Possible inefficient memory access patterns present. The associated recommendation was to run a Memory Access Patterns (MAP) analysis. A similar suggestion was also emitted in the Instruction Set Architecture/ Traits column (Figure 6).

     Vector Issue, Trait and associated Recommendations.

    Figure 6.“Inefficient Memory Access patterns present”: Vector Issue, Trait and associated Recommendations.

    MAP is a deeper-dive Intel Advisor analysis type, making it possible to identify and characterize inefficient memory access patterns in detail. In order to run the MAP tool, DL_MESO optimizers have used the following GUI–based scheme:

    • First, developers marked the loop of interest at line 730 by selecting the appropriate checkbox in the second column in the Survey Report (Figure 7).
    • After that, they ran Memory Access Patterns collection using the Workflow panel.

    Figure 7. Selecting loops for deeper MAP or Dependencies analysis.

    Figure 7. Selecting loops for deeper MAP or Dependencies analysis.

    The high-level Strides Distribution measured as a result of MAP analysis indicated that both unit-stride and non-unit constant stride accesses took place in the loop (see Figure 9). A further dive into the MAP Problems view and Source view helped to identify the presence of stride-3 (in the case of the original scalar version) or stride-12 (in the case of the padded, vectorized loop) accesses corresponding to manipulation with the lbv array.

    The presence of constant stride means that from iteration to iteration, access to some array elements shifts in a predictable but non-linear manner. In our case, stride-3 access to the lbv velocities array of integer elements meant that every next iteration, the access to the lbv array is shifted by 3 integer elements. The value of 3 was not surprising since the corresponding expression looks like lbv[i*3+X].

    Figure 8. Inefficient memory access… Vector Issue and corresponding Recommendations.

    Figure 8. Inefficient memory access… Vector Issue and corresponding Recommendations.

    Non-contiguous constant stride is not really good for vectorization, because it usually means that in the vectorized code version it will be impossible to load all array elements into the resulting vector register altogether using a single packed memory move instruction.5 On the other hand, constant stride access is often quite possible to transform to unit (contiguous) stride access by applying an Array of Structures (AoS) to Structure of Arrays (SoA) transformation technique.6 Noticeably, after running MAP analysis, the original recommendation for the loop in fGetEquilibirumF has automatically updated with a suggestion to apply the given AoS->SoA transformation (figure 8).

    Figure 9. Strides Distribution loop analysis and corresponding tooltip with stride taxonomy explanation.

    Figure 9. Strides Distribution loop analysis and corresponding tooltip with stride taxonomy explanation.

    Daresbury engineers decided to apply the given data layout optimization with regards to the lbv array. It was actually the latest optimization technique applied to the loop in fGetEquilibrium. In order to do this transformation, they needed to replace the single lbv array (incorporating velocities in X, Y and Z dimensions altogether) with three separate lbvx, lbvy and lbvz arrays.

    All associated DL_MESO loop and data structure transformations for padding and AoS->SoA are summarized in Figures 10 and 11 below, with accompanying Intel Advisor Vectorization Efficiency and Memory Access Patterns Report metrics.

    DL_MESO engineers told us that while refactoring was relatively time consuming and cannot be considered low hanging (as opposed to padding), the resulting speedup confirmed that it was definitely worth doing: the loop in fGetEquilibrium has gotten another 2x speedup on top of the already optimized version. Similar speedups were observed in several other loops manipulated with the lbv array.

    Figure 10. Padding and data layout (AoS -> SoA) transformations impact for loop in fGetEquilibriumF, accompanied by data from the Intel Advisor Survey analysis, Trip Counts analysis and MAP analysis.


    Figure 11. Data allocation, loop implementation and Intel Advisor MAP stride data for padding and data layout (AoS -> SoA) transformations for loop in fGetEquilibriumF.

    Summary

    By using the Vectorization Advisor to analyze DL_MESO and by adding some pragmas to the code, the Hartree Centre was able to shave off between 10%-19% of the time of the top three hotspots.  All of the optimizations were based on suggestions that the Vectorization Advisor gave. This work included enabling vectorization and improving loop performance by applying padding optimization techniques. By then applying similar techniques to several other less significant hotspots, the total speedup of the application was 18%.

    A further significant performance improvement was achieved by changing the data layout of some variables from Array of Structure to Structure of Arrays – again based on recommendations given by the Vectorization Advisor.

    Although at the time of this work the Vectorization Advisor was only available for regular Intel Xeon processors, when the same optimizations were applied to the code running on Intel Xeon Phi coprocessors, similar speedups were achieved – it’s clearly a win-win!

    Figure 12 shows the speedup obtained on one of the main hotspot functions on both a regular server (marked AVX) and on an Intel® Xeon Phi coprocessor (code named Knights Corner).  These optimizations resulted in a speedup of 2.5 on the Intel® Xeon® processor and a speedup of 4.1 on the coprocessor.

    Figure 12. The impact of various optimizations (bigger is better).

    Figure 12. The impact of various optimizations (bigger is better).

    In conclusion, the engineers were delighted with the way the Vectorization Advisor helped them get real speedups in their DL_MESO code – leading one of the key developers to say, “I'm already sold on this tool, going forward this will really help us on our Xeon Phi work!”


    1 https://software.intel.com/en-us/articles/intel-parallel-computing-center-at-hartree-centre-stfc
    2 DL_MESO consists of two simulation packages implementing the Lattice Boltzmann Equation (LBE) and Dissipative Particle Dynamics (DPD) methods. The LBE package supports simulation of lattice-gas systems with multiple fluid components, solutes and coupled heat transfers.
    3The Vectorization Advisor requires the Intel compiler to collect a full set of analysis data. However, a solid subset of metrics is available for binaries built with other compilers as well.
    4 The Intel Advisor Vectorization Efficiency metric is currently available only when profiling code compiled using the Intel Compiler 16.x (2016) release.
    5 For more details on strides, see educational videos: https://software.intel.com/en-us/videos/memory-access-101 and https://software.intel.com/en-us/videos/stride-and-memory-access-patterns
    6 For more details on Structure of Arrays, see https://software.intel.com/en-us/articles/a-case-study-comparing-aos-arrays-of-structures-and-soa-structures-of-arrays-data-layouts

    High-Performance, Modern Code Optimizations for Computational Fluid Dynamics

    $
    0
    0

    Modern server farms consist of a large number of heterogeneous, energy-efficient, and very high-performance computing nodes connected with each other through a high-bandwidth network interconnect.  Such systems pose one of the biggest challenges for engineers and scientists today:  how to solve complex, real-world problems by efficiently using the enormous computational horsepower available from the vast number of multi-core arrays comprising these systems. To accomplish this, we need to understand the hierarchical nature of the underlying hardware and use hierarchical parallelism in software to break down the computational problems into discrete parts that fully use the computing power available.

    A major trend in recent years is the growing gap between processor and memory speeds. Because of this, getting the data from memory is usually the most expensive operation, while the computation itself is cheap. Thus, to get optimal performance, software must also be tuned for efficient memory utilization, which requires careful use of various memory hierarchy layers available in the form of caches.

    The governing partial differential equations (PDEs) used for solving the computational fluid dynamics (CFD) challenges  in aerospace apply to various other fields of science as well. Moreover, the methodology for discretizing these PDEs on a finite mesh and solving them using explicit and implicit time-integration schemes can be adapted to applications from various industries and to scientific study. As such, the straightforward code-modernization framework presented here—with upper-bounds on performance set by the Roofline Model and Amdahl’s Law—can be extrapolated to optimize a variety of applications and workloads, yielding faster execution times, and allowing you to improve the performance of your software.

    About SU2

    SU2 is an open-source, CFD analysis and design software suite released by the Aerospace Design Laboratory at Stanford University in 2012. The suite enables high-performance, scalable Reynolds-Averaged Navier-Stokes (RANS) calculations using explicit and implicit time integration. A recent paper jointly authored by Intel and the Stanford University team focused on performance optimizations of SU2. The team investigated the opportunities for parallelism of the software components and for finding highly-scalable algorithms. This work is an outcome of the Intel® Parallel Computing Center (IPCC) established at Stanford University with Prof. Juan Alonso’s research group.

      The SU2 optimizations are classified into three categories:

    1. Fine grained parallelism using Open Multi-Processing (OpenMP or OMP)
    2. Single-Instruction, Multiple-Data (SIMD) Vectorization
    3. Memory optimizations

    Hierarchical Parallelization

    Current-day processors expose parallelism at multiple-levels even within a single node—with multiple processors (2-4 sockets), many cores/threads within each processor (up to 72 threads), and SIMD execution units within each core. The compiler is adept at taking care of these to a great extent. However, to effectively use all these levels of parallelism, consider using explicit hierarchical parallelism in your software.

    Exploiting the hierarchical nature of the hardware via hierarchical parallelism in software is important for optimal performance. An unstructured-grid flow solver comprises a diverse range of kernels with varying compute and memory requirements, irregular data accesses, as well as variable and limited amounts of instruction-, vector-, and thread-level parallelism. By breaking the problem down into pieces, however, solutions do emerge. For a high-level breakdown of the types of compute kernels associated with computational fluid dynamics, please see a recent paper by Mudigere, et al.

    For the work discussed here, the authors selected to optimize an inviscid, transonic ONERA M6 workload together with the Runge-Kutta (RK) explicit time-stepping scheme. This forms a building block for more involved turbulent and implicit time-stepping simulations, and at the same time retains the edge-loops, which form the top hotspots for any other workload. Moreover, because this forms the bare bones of the SU2 solver, it has one of the lowest overall compute intensities (that is, compute flops per byte of memory accessed) and, therefore, is most difficult to optimize.

    Fine-grained Parallelization with OpenMP* (OMP)

    OMP is an API that supports multi-platform, shared-memory, multiprocessing programming in C*, C++*, and Fortran* on most processor architectures and operating systems. Consisting of a set of compiler directives, library routines, and environment variables, OMP can be implemented in two ways:

    1. The loop-level, OMP parallel regions method is easy to implement by incrementally adding OMP parallelization. Less error prone due to implicit barriers at the end of the parallel regions, this method incurs a large fork-join overhead because the number of parallel regions is usually fairly high.
    2. The high-level, functional, OMP approach involves a single OMP parallel region at a very high level in the program. This approach looks similar to well-known Message Passing Interface (MPI) domain decomposition. Here the iteration space for edge loops is pre-divided by coloring the edges such that all edges of one color belong to a given OMP thread.

    Both OMP approaches were implemented in SU2, however the high-level approach was retained as it showed better performance. Figures 1(a) and (b) show the OMP strong-scaling for the Intel® Xeon® processor and Intel® Xeon Phi™ coprocessor, respectively. Note that Intel® Hyper-Threading Technology is enabled for Intel Xeon processors such that two OMP threads are affinitized (compactly) to a physical core for these processors, and four OMP threads are affinitized (again compactly) to a physical core for the Intel Xeon Phi coprocessors to take advantage of the four hardware-threads per physical core. This helps hide the latency associated with in-order execution on a core of the Intel Xeon Phi coprocessor. Figures 1(a) and (b) show the results for both small and large meshes for both the processor (in Figure 1(a)) and the coprocessor (in Figure 1(b)). For the Intel Xeon processor, the maximum scaling achieved is 11.06x for the small mesh and 12.28x for the large mesh. For the Intel Xeon Phi coprocessor, the corresponding scaling results are 31.72x and 44.28x. The large mesh shows better scaling compared to the small mesh because the effects of OMP load-imbalance reduce as the amount of computation increases. This is even more so for Intel Xeon Phi coprocessor as the number of OMP threads are higher.

    Intel® Xeon® results
    (a) Intel® Xeon® results.

    Intel® Xeon® Phi™ coprocessor results
    (b) Intel® Xeon® Phi™ coprocessor results.

    Figure 1. OMP strong scaling.  As indicated, two OMP threads are affinitized to a physical core for the Intel® Xeon® processor and four OMP threads are affinitized to a physical core for the Intel® Xeon Phi™ coprocessor. Compact affinity is used in both cases. 

    The speedup of a program using multiple processors in parallel computing is limited by the sequential fraction of the program. Amdahl’s Law, also known as Amdahl’s Argument, is used here to find the maximum expected improvement to an overall system when only part of the system is improved. This is often used in parallel computing to predict the theoretical maximum speedup that can be achieved using multiple processors.

    Balancing Work and Minimizing Dependencies

    For OMP thread scheduling, a combination of dynamic- and static-scheduling for different edge-loops is used to optimize and balance various costs. The atomic operations are required in the dynamic scheduling case because of write-contention.

    In the statically scheduled case, the edges of the mesh are “pre-colored” so each thread knows what part of the mesh it owns. Decomposing the edge graph balances work by evenly distributing edges while minimizing dependencies at shared nodes (the “edge cuts” of the edge graph). In order to eliminate contention at the shared nodes, all edges that touch a shared node are replicated on each thread that shares the node. The appropriate data structures for these repeated edges are then added to the code to eliminate contention, and the result is similar to a halo layer approach in a distributed-memory application. The subdomains can then be further reordered, vectorized, and so forth. In this methodology, no atomic operations are required.

    Vectorization

    Vectorization is extremely critical for achieving high performance on modern CPUs and co-processors. In addition to scalar units, the Intel Xeon processor used in this study has four-wide double-precision (DP) SIMD units that support a wide range of SIMD instructions through Intel® Advanced Vector Extensions (Intel® AVX).  In a single cycle, the processor can issue a four-wide DP floating-point multiply or add.

    The Intel Xeon Phi co-processor used in this study has eight-wide DP SIMD units for vector instructions. Thus, one can achieve a 4x and 8x speedup over scalar code for double-precision computes on an Intel Xeon processor and an Intel Xeon Phi co-processor, respectively. As such,  vectorization is even more important for the co-processor.

    Note that Amdahl’s Law also extends to vectorization. If the vector units are not used efficiently, that is, a large portion of the code is scalar, a calculable performance penalty is incurred.

    Strategies for High Compute Intensity

    Functions with a very high compute intensity are great candidates for vectorization. By implementing an outer-loop vectorization, savings are achieved by computing on multiple edges simultaneously in multiple SIMD-lanes.  This contrasts with vectorizing within edges, a method with lower SIMD efficiency. Oftentimes, outer-loop vectorization is better than the innermost loop vectorization, especially if innermost loops perform very little computes with low trip counts.

    Critically important is that the parameters passed into the vectorized (elemental) functions are accessed in a unit-stridden way for different values of the loop iteration index (elemental functions are a feature of the Intel compiler). When this is not possible for original variables of the application, which may be two-dimensional arrays or double pointers, packing a local copy of these variables into temporary arrays is useful. Typically these temporary arrays are small in size (a small multiple of the SIMD-length is sufficient). Sometimes a transpose is required while copying the original variables into temporary arrays such that the fastest changing dimension varies linearly with vector stride. Other clauses can be used in your elemental function definition to give hints to the compiler, such as the uniform and linear classes.

     As mentioned before, vectorization is done on the outer-loop, which is a loop over the edges of the mesh. This leads to four edges (on the Intel Xeon processor) being processed concurrently within a thread. To address the possible dependency across these edges, the write-out part is separated from the compute part, with the computed values stored in a temporary buffer for each SIMD width of edges. After the compute, scalar operations are used to write out results from the temporary buffer. The performance impact from the scalar write-out is minimal, because it is amortized by a large amount of compute in the vectorized kernels.

    Memory Optimizations

    Roofline is a performance model used to estimate an upper bound on the performance of various numerical methods and operations running on multi-core, many-core, or accelerator processor architectures. The most basic Roofline model can be used to bound floating-point performance as a function of machine peak performance, machine peak bandwidth, and arithmetic intensity. The model can be used to assess the quality of attained performance by combining locality, bandwidth, and different parallelization paradigms into a single performance figure. One can examine the resultant Roofline figure in order to determine both the implementation and inherent performance limitations. Most kernels in second-order accurate CFD codes are memory bandwidth bound; therefore, the Roofline model gives a good upper bound on performance for these.

    A number of approaches are available when attempting to improve memory performance, and specific techniques are described here as implemented in SU2. In general, the idea is to apply optimizations in order to improve the spatial and temporal locality of data.

    Three particular techniques used for improving data locality compared to the baseline SU2 version are as follows:

    1. Minimize cache misses with edge/vertex reordering
    2. Allocate class objects more intelligently
    3. Change the data structures from array-of-structures (AOS) to structures-of-arrays (SOA)

    Minimize Cache Misses

    The first approach for memory optimization is a reordering of the nodes (unknowns) to minimize cache misses. This is accomplished via a Reverse Cuthill-McKee (RCM) algorithm that minimizes the bandwidth of the adjacency matrix of the mesh. By using RCM, the overall bandwidth of the adjacency matrix of the unstructured mesh used in SU2 was significantly reduced. An adjacency matrix essentially shows the edge connections in an unstructured mesh. The rows and columns of this matrix are vertices; a non-zero entry in the matrix means that the vertices are connected by an edge. This can be seen in Figures 2(a) and (b), which show the adjacency matrix before and after the RCM transformation, respectively. The matrix bandwidth reduces from 170,691 to 15,515 by applying RCM re-numbering for the smaller tetrahedral ONERA M6 mesh. This reduction in matrix bandwidth directly translates into improved cache utilization.

    Smarter Memory Allocation

    The second approach for improving memory performance involves reworking some of the class structure in SU2 in order to support more parallelization- and cache-friendly initializations of class data. The key idea is to reduce the working set sizes and to reduce the number of indirect memory accesses as much as possible. Indirect memory accesses lead to gather-scatter instructions which incur a very large performance penalty. For this purpose, the CNumerics class (parent class) has been modified so that it is purely virtual with no class data, while the child classes allocate all of the data that is necessary for computing fluxes along the edges. This leads to a speed up of the code and also simplifies the parallelization of the flux loops using OMP. Another example is an improved (contiguous) memory allocation approach for the variables that are stored at each node (our unknowns) within the CVariable class. This guarantees that memory for the objects is allocated in a contiguous array (in C-style), rather than using typical C++ allocations.

    Reduce Memory Footprint

    Another major memory optimization performed is a change in class structure from AOS (array-of-structures) to SOA (structures-of-arrays) as explained in detail in a recent AOS-to-SOA case study. This is another strategy for reducing the indirect memory accesses.  The baseline code is written in AOS form for the key C++ CSolver class of the code. That is, the CSolver class contains a double pointer to an object of the CVariable class, which contains the solution variables (unknowns) at a given vertex of the mesh (such as fluid pressure p, or fluid velocity vector <u, v, w> for example). Thus, when accessing these quantities in an edge-loop, one requires an indirect access by de-referencing the CVariable object at the given vertex. These variables are stored in memory as [p1,u1,v1,w1], [p2,u2,v2,w2], …, where xi denotes the variable x at point i. The memory address for each set of vertex data can be spread out quite a bit, which results in expanded working sets. This structure has been modified to SOA format where the CVariable class has been removed entirely, and the required variables at all of the vertices are stored contiguously as members of the CSolver class itself. This avoids the need for indirect access. In SOA format, the variables are stored in memory as: [p1,u1,v1,w1,p2,u2,v2,w2,…], which compacts the working sets and results in cache-efficient traversal of the edge-loop.

    Before RCM
    (a) Before RCM

    After RCM
    (b) After RCM

    Figure 2. Effect of RCM re-numbering on the edge adjacency matrix of the ONERA M6 mesh.

    Because AOS provides a more modular software design, a tension exists between performance and programming flexibility that must be balanced as needed. In this example, the AOS-to-SOA transformations are coded using pre-processor directives. Be sure to compile with AOS-to-SOA enabled if you desire better performance.

    Performance Results Achieved

    Performance results given in this section were obtained using all of the optimizations described in this article The results were derived when run on the Intel Xeon processor and on the Intel Xeon Phi coprocessor with native execution. (Native execution on the Intel Xeon Phi coprocessor means that the code binaries are compiled for direct execution on the coprocessor, and the host is not involved at all in the computation.)

    Host with Intel® Xeon®  ProcessorIntel ® Xeon® E5,- 2.70 GHz, 2 x 12 cores (dual-socket workstation), 64GB DDR3 1600MHz RAM, Hyper-threading (HT) enabled
    Intel® Xeon Phi™ CoprocessorIntel® Xeon Phi™ C0-7120A, 1.238 GHz, 61 cores, 16 GB GDDR5 RAM, Turbo enabled
    ToolsIntel® Composer XE 2015 (beta)

    Table 1. Machine and tools configuration used to generate performance results.

    Figures 3(a) and (b) show the speedups obtained by adding various optimizations for both the Intel Xeon processor and the Intel Xeon Phi coprocessor, respectively. Results for both the small and large ONERA M6 meshes are shown. The simulation is run for 100 nonlinear iterations, and the time per iteration for the 100th iteration is taken as the performance metric. 

    Intel® Xeon® results
    (a) Intel® Xeon® results.

    Intel® Xeon® Phi™ coprocessor results
    (b) Intel® Xeon® Phi™ coprocessor results.

    Figure 3. Fine-grained single-node optimizations.

    Results on the Intel® Xeon® Processor

    The speedups shown in Figure 3(a) are relative to the Base (Message Passing Interface, or MPI only) code, which is run with 48 MPI ranks using all 24 physical cores (48 ranks because Intel® Hyper-Threading Technology is enabled). Note that hybridization to MPI+OMP improves the performance by 1.11x for the small mesh. However, it does not make a difference for the large mesh. This is because the small mesh is more sensitive to memory latencies. For the large mesh, sufficient compute occurs to hide some of these latencies and, hence, a significant speedup is not achieved from hybridization.

    By further adding AOS-to-SOA transformations, a very noticeable jump in speedup is obtained for both small and large meshes. This is the most productive optimization. By adding auto OMP scheduling (as shown in Roland W. Green’s paper on OMP open-loop scheduling) more speedup is derived (even more so for the small mesh again because it is more sensitive to load-imbalance among threads than is the large mesh). Finally, by adding vectorization, about 10 percent overall speedup is derived for both small and large meshes. Note that this is a significant gain from vectorization given the fact that only a single kernel (Centered Residual) was vectorized. The other kernels didn’t have enough compute intensity for the vectorization to be worthwhile.

    Results on the Intel® Xeon Phi™ Coprocessor

    For the Intel Xeon Phi coprocessor, the code was executed natively on the coprocessor without host involvement. Overall, the picture for the coprocessor looks similar to that for the Intel Xeon processor. This is a big advantage for the Intel Xeon Phi coprocessor because there is no need to write and maintain different codebases for host and coprocessor. The optimizations done to improve host performance help significantly in improving the coprocessor performance and vice-versa.

    Conclusion

    By placing a particular emphasis on parallelism (both fine- and coarse-grained), vectorization, efficient memory usage, and identification of the best-suited algorithms for modern hardware, the authors have discussed how to optimize SU2 for massively parallel simulations in several key areas:

    1. Code profiling and understanding current bottlenecks
    2. Implementation of coarse- and fine-grain parallelism approaches via a hierarchical framework in software
    3. Optimizing the code for vectorization and efficient memory usage

    Future work will build up on the current work and focus on the performance of the code in massively parallel settings for other workloads using the implicit turbulent solver as well. We are also investigating other novel scalable linear system solvers for implicit solution of governing equations of computational fluid dynamics.

    The Team

    The key members of Intel team include Gaurav Bansal from the Intel Software and Services Group and Dheevatsa Mudigere, Alexander Heinecke, and Mikhail Smelyanskiy from the Intel Labs; the key members of the Stanford University team are Thomas Economon and Prof. Juan J. Alonso.

    Resources

    Read the complete paper entitled Towards High-Performance Optimizations of the Unstructured Open-Source SU2 Suite at:  arc.aiaa.org/doi/abs/10.2514/6.2015-1949

    For details on the Intel® Parallel Studio see:  software.intel.com/en-us/intel-parallel-studio-xe

    Case Study: Optimized Code for Neural Cell Simulations

    $
    0
    0

    About

    Intel held the Intel® Modern Code Developer Challenge that had about 2,000 students from 130 universities in 19 countries registered to participate in the Challenge. They were provided access to Intel® Xeon® processors and Intel® Xeon Phi™ coprocessors to optimize code used in a CERN openlab brain simulation research project.  The goal of the research project is to find treatments and cures for neurological disorders, such as schizophrenia, epilepsy, and autism.  The contestants task was to look at the code for cell clustering and 3D movement and then modify the algorithms for parallel performance by optimizing the code to reduce the runtime, all while maintaining correctness.

    In this article Daniel Vea Falguera (one of the Challenge winners) shares the original code as well as the optimized code (Note: Changed code line(s)​:)and describes many of the optimizations he implemented. In some cases, the optimizations did not work, but he gives insight into how other changes to the code would work.

    INCLUDES

    Original Code

    #include <cstring>
    #include <cstdlib>
    #include <ctime>
    #include <cmath>
    #include <getopt.h>
    #include "util.hpp"

    Optimized Code

    #include <cstring>
    #include <cstdlib>
    #include <ctime>
    #include <cmath>
    #include <omp.h>
    #include <getopt.h>
    #include "util.hpp"
    #include <malloc.h>     //Useless
    #include <mkl.h>        //Useless
    #include <cilk/cilk.h> //Useless

    Changed code line​: 5

    Optimization Notes

    The addition of #include <omp.h> is required to run OpenMP*. The OpenMP functions and clauses are used in the optimized code.

    In the completed code, malloc.h, mkl.h, and cilk.h were included during the development process to optimize memory blocks (among other things), but they didn’t show any improvements. They are included to show that there was an attempt to use them. More information on the Intel® Math Kernel Library (Intel® MKL) and Intel® Cilk™ Plus is given in the Other Optimizations section.

    RandomFloatPos

    The function RandomFloatPos() is used to generate a random number three times to generate a  random 3D position inside the cellMovementAndDuplication function.

    NOTE: Only the computational related functions are referenced in this paper, so functions such as “menu printings” and others aren’t discussed.

    Original Code

    static float RandomFloatPos() {
        // returns a random number between a given minimum and maximum
        float random = ((float) rand()) / (float) RAND_MAX;
        float a = 0;
        float r = random;
        return a + r;
    }

    Changed code line(s)​:  3, 4, 5, 6

    Optimized Code

    static void RandomFloatPos(float input[3],unsigned sedin) {
        // returns a random number between a given minimum and maximum
        __assume_aligned((float*)input, 64);
        unsigned int seed=sedin,i;
        for(i=0;i<3;++i){
            input[i]=(((float)rand_r(&seed))/(float)(RAND_MAX))-0.5;
        }
    }

    Changed code line​: 1

    Optimization Notes

    The original function returns a random number doing some needless steps (for instance, adding zero to the random number). The original code can’t be parallelized directly due to the function rand(), which can only be executed one thread at a time. The parallelization problem was fixed using the rand_r() function. The rand_r() function allows each thread to execute at the same time.

    The function RandomFloatPos() is called to generate a random 3D position minus a constant offset of 0.5, so it can be simplified as a for loop with three iterations. The optimized code does exactly this. The offset is used to make the position values range from -0.5 to 0.5, placing the (0,0,0) point in the middle of the space of cell movement.

    The optimized function has two parameters: input[3] and sedin. Input[3] returns the values of the 3D random generated position and is memory aligned(64). The sedin is passed as a parameter and contains the value of the seed generated for each call to this function.

    The for loop generates the 3D coordinate and stores it into input.

    getNorm

    Original Code

    static float getNorm(float* currArray) {
        // computes L2 norm of input array
        int c;
        float arraySum=0;
        for (c=0; c<3; c++) {
            arraySum += currArray[c]*currArray[c];
        }
        float res = sqrt(arraySum);
        return res;
    }

    Changed code line(s)​: 3, 6, 8, 9

    Optimized Code

    static float getNorm(float* currArray) {
    	// computes L2 norm of input array
    	float arraySum=0;
    	for (int c=0; c<3; ++c) {
    		arraySum += pow(currArray[c],2);
    	}
    	return sqrt(arraySum);
    }

    Changed code line(s)​: 5, 7

    Optimization Notes

    This computes the norm from a given array of numbers by the input float currArray. There are two ways that the code was optimized for this function:

    • The res variable was removed.
    • The pow() function was added.
      Extensive research done afterwards has shown that the pow() is not a significant improvement. Further, given the Intel® Xeon Phi™ coprocessor’s ability to execute multiple floating point operations per cycle, the original currArray[c]*currArray[c] is provably faster in those cases. The Intel compiler vectorizes the pow() function, but adds more code to do so, resulting is slower runtime speed.

    getL2Distance

    This function is used to determine the linear distance between two points in 3D space.

    Original Code

    static float getL2Distance(float pos1x, float pos1y, float pos1z, float
    pos2x, float pos2y, float pos2z) {
    	// returns distance (L2 norm) between two positions in 3D
    	float distArray[3];
    	distArray[0] = pos2x-pos1x;
    	distArray[1] = pos2y-pos1y;
    	distArray[2] = pos2z-pos1z;
    	float l2Norm = getNorm(distArray);
    	return l2Norm;
    }

    Changed code line(s)​: 1, 2, 5, 6, 7, 8, 9

    Optimized Code

    static float getL2Distance(float* pos1, float* pos2) {
    	// returns distance (L2 norm) between two positions in 3D
    	float distArray[3] __attribute__((aligned(64)));
    	distArray[0] = pos2[0]-pos1[0];
    	distArray[1] = pos2[1]-pos1[1];
    	distArray[2] = pos2[2]-pos1[2];
    	return getNorm(distArray);
    }

    Changed code line(s)​: 1-8

    Optimization Notes

    The original function has six inputs that represent two 3D points to calculate the distance between them. The optimized function has only two inputs; each input is an array of three elements representing the 3D point. The variable used inside is aligned.

    The optimized function seems to be SIMD executable, but when using the vector notation (for example, P[0:2]=a[0:2]*b[0:2]) the execution time was slower. It may be that defining and using an elemental function will further optimize the function.

    A white paper on elemental functions is available at: http://software.intel.com/sites/default/files/article/181418/whitepaperonelementalfunctions.pdf

    produceSubstances

    This function increases the concentration of substances for each cell position to a maximum limit of 1 unit per cell position.

    Original Code

    static void produceSubstances(float**** Conc, float** posAll, int* typesAll, int L, int n){
    
    	produceSubstances_sw.reset();
    		// increases the concentration of substances at the location of the     cells
    		float sideLength = 1/(float)L; // length of a side of a diffusion voxel
    		int c, i1, i2, i3;
    		for (c=0; c< n; c++) {
    			i1 = std::min((int)floor(posAll[c][0]/sideLength),(L-1));
    			i2 = std::min((int)floor(posAll[c][1]/sideLength),(L-1));
    			i3 = std::min((int)floor(posAll[c][2]/sideLength),(L-1));
    		if (typesAll[c]==1) {
    			Conc[0][i1][i2][i3]+=0.1;
    			if (Conc[0][i1][i2][i3]>1) {
    				Conc[0][i1][i2][i3]=1;
    			}
    		} else {
    			Conc[1][i1][i2][i3]+=0.1;
    			if (Conc[1][i1][i2][i3]>1) {
    				Conc[1][i1][i2][i3]=1;
    			}
    		}
    	}
    	produceSubstances_sw.mark();
    }

    Changed code line(s)​: 1, 5, 6, 8, 9, 10, 11, 13, 14, 17, 18, 19

    Optimized Code

    static void produceSubstances(int L, float Conc[2][L][L][L], float posAll[][3], int* typesAll, int n) {
    
    	produceSubstances_sw.reset();
    	// increases the concentration of substances at the location of the  cells
    
    	const int auxL=L;
    	--L;
    	int c,i[3] __attribute__((aligned(32))); //i array aligned
    	omp_set_num_threads(240);
    
    	#pragma omp parallel for schedule(static) private(i,c)
    	for (c=0; c< n; ++c) {
    		__assume_aligned((int*)i, 32);
    		__assume_aligned((float*)posAll, 64);
    		i[0] = std::min((int)floor(posAll[c][0]*auxL),L);
    		i[1] = std::min((int)floor(posAll[c][1]*auxL),L);
    		i[2] = std::min((int)floor(posAll[c][2]*auxL),L);
    
    		if (typesAll[c]==1) {
    			(Conc[0][i[0]][i[1]][i[2]]>0.9)?
    			Conc[0][i[0]][i[1]][i[2]]=1 :
    			Conc[0][i[0]][i[1]][i[2]]+=0.1;
    		} else {
    			(Conc[1][i[0]][i[1]][i[2]]>0.9)?
    			Conc[1][i[0]][i[1]][i[2]]=1 :
    			Conc[1][i[0]][i[1]][i[2]]+=0.1;
    		}
    	}
    	produceSubstances_sw.mark();
    }

    Changed code line(s)​: 1,6-11, 13-17, 20-22, 24-26

    Optimization Notes

    The optimized function inputs order changed to define in the header of the function the size of the arrays. This way we can avoid the use of pointers and work directly with the arrays so the compiler knows beforehand the size of the elements we pass into the function.

    The original code used pointers to initialize the arrays, but since those arrays are static (the length is static and doesn’t change) it is easier and faster to declare it directly as arrays with a defined size without using pointers.

    The optimized code includes the use of one OpenMP parallel for function with static scheduling to distribute the load to the other cores equally. This is because this function can be executed in parallel without affecting the result, that is, every cycle of the for loop can be executed individually without affecting the next iterations.

    During the execution of the main code (described in a later section), this function is called multiple times, each time increasing the value of n. So while it is possible to make the function parallel, this would only be useful while the values of n are low (less than 10000). Otherwise the added overhead may slow the code.

    The operation posAll[c][0]/sideLength is the same as (posAll[c][0]/1/L) or (posAll[c][0]*L). Since posAll[c][0]/sideLength generates the same result with fewer calculations, it is a benefit.

    The use of the L variable was changed in optimization. The original code uses the operation L-1 three times, using three extra operations. In the optimized version we simply decrement previously the L variable. The auxL constant (used to optimize as a read-only variable is faster than a read/write variable) stores the original value of L and will not change during the execution of this function.

    Similarly the method for using the Conc variable was changed to optimize the code. In the original code the if clause will increment the Conc variable by 0.1, then check if the value is greater than 1 and if true, limit its value to 1, making the previous addition useless. The solution was to check if the value of Conc is greater than 0.9, and if true set the value of Conc to 1; if false then increment Conc by 0.1.

    runDiffusionStep

    This function has two parts. The first part of the function copies the Conc variable to tempConc. The second part iterates through the Conc array checking upper and 3D boundaries.

    Original Code

    static void runDiffusionStep(float**** Conc, int L, float D) {
    	runDiffusionStep_sw.reset();
    	// computes the changes in substance concentrations due to diffusion
    
    	int i1,i2,i3, subInd;
    	float tempConc[2][L][L][L];
    	for (i1 = 0; i1 < L; i1++) {
    		for (i2 = 0; i2 < L; i2++) {
    			for (i3 = 0; i3 < L; i3++) {
    				tempConc[0][i1][i2][i3] = Conc[0][i1][i2][i3];
    				tempConc[1][i1][i2][i3] = Conc[1][i1][i2][i3];
    			}
    		}
    	}
    
    	int xUp, xDown, yUp, yDown, zUp, zDown;
    
    	for (i1 = 0; i1 < L; i1++) {
    		for (i2 = 0; i2 < L; i2++) {
    			for (i3 = 0; i3 < L; i3++) {
    				xUp = (i1+1);
    				xDown = (i1-1);
    				yUp = (i2+1);
    				yDown = (i2-1);
    				zUp = (i3+1);
    				zDown = (i3-1);
    				for (subInd = 0; subInd < 2; subInd++) {
    					if (xUp<L) {
    						Conc[subInd][i1][i2][i3] += (tempConc[subInd][xUp][i2][i3]-tempConc[subInd][i1][i2][i3])*D/6;
    					}
    					if (xDown>=0) {
    						Conc[subInd][i1][i2][i3] += (tempConc[subInd][xDown][i2][i3]-tempConc[subInd][i1][i2][i3])*D/6;
    					}
    					if (yUp<L) {
    						Conc[subInd][i1][i2][i3] += (tempConc[subInd][i1][yUp][i3]-tempConc[subInd][i1][i2][i3])*D/6;
    					}
    					if (yDown>=0) {
    						Conc[subInd][i1][i2][i3] += (tempConc[subInd][i1][yDown][i3]-tempConc[subInd][i1][i2][i3])*D/6;
    					}
    					if (zUp<L) {
    						Conc[subInd][i1][i2][i3] += (tempConc[subInd][i1][i2][zUp]-tempConc[subInd][i1][i2][i3])*D/6;
    					}
    					if (zDown>=0) {
    						Conc[subInd][i1][i2][i3] += (tempConc[subInd][i1][i2][zDown]-tempConc[subInd][i1][i2][i3])*D/6;
    					}
    				}
    			}
    		}
    	}
    	runDiffusionStep_sw.mark();
    }

    Changed code line(s)​:  1, 5, 9-12, 16, 21-45

    Optimized Code

    static void runDiffusionStep(int L, float Conc[2][L][L][L], float D) {
    	runDiffusionStep_sw.reset();
    	// computes the changes in substance concentrations due to diffusion
    	int i1,i2,i3,auxx;
    	const float auxD=D/6;
    	const int auxL=L-1;
    	float tempConc[2][L][L][L] __attribute__((aligned(64)));
    	omp_set_num_threads(240);
    	#pragma omp parallel
    	{
    		#pragma omp for schedule(static) private(i1,i2) collapse(2)
    		for (i1 = 0; i1 < L; ++i1) {
    			for (i2 = 0; i2 < L; ++i2) {
    				memcpy(tempConc[0][i1][i2],
    				Conc[0][i1][i2],sizeof(float)*L);
    				memcpy(tempConc[1][i1][i2],
    				Conc[1][i1][i2],sizeof(float)*L);
    			}
    		}
    
    		#pragma omp for schedule(static) private(i1,i2,i3) collapse(2)
    		for (i1 = 0; i1 < L; ++i1) {
    			for (i2 = 0; i2 < L; ++i2) {
    				Conc[0][i1][i2][0] += (tempConc[0][i1][i2][1]-
    				tempConc[0][i1][i2][0])*auxD;
    				Conc[1][i1][i2][0] += (tempConc[1][i1][i2][1]-
    				tempConc[1][i1][i2][0])*auxD;
    				for (i3 = 1; i3 < auxL; ++i3) {
    					const float aux=tempConc[0][i1][i2][i3];
    					const float aux1=tempConc[1][i1][i2][i3];
    					__assume_aligned((float*)tempConc[0], 64);
    					__assume_aligned((float*)tempConc[1], 64);
    					__assume_aligned((float*)Conc[0], 64);
    					__assume_aligned((float*)Conc[1], 64);
    					if (i1<auxL) {
    						Conc[0][i1][i2][i3] +=
    						(tempConc[0][(i1+1)][i2][i3]-aux)*auxD;
    						Conc[1][i1][i2][i3] +=
    						(tempConc[1][(i1+1)][i2][i3]-aux1)*auxD;
    					}
    					if (i1>0) {
    						Conc[0][i1][i2][i3] +=
    						(tempConc[0][(i1-1)][i2][i3]-aux)*auxD;
    						Conc[1][i1][i2][i3] +=
    						(tempConc[1][(i1-1)][i2][i3]-aux1)*auxD;
    					}
    					if (i2<auxL) {
    						Conc[0][i1][i2][i3] +=
    						(tempConc[0][i1][(i2+1)][i3]-aux)*auxD;
    						Conc[1][i1][i2][i3] +=
    						(tempConc[1][i1][(i2+1)][i3]-aux1)*auxD;
    					}
    					if (i2>0) {
    						Conc[0][i1][i2][i3] +=
    						(tempConc[0][i1][(i2-1)][i3]-aux)*auxD;
    						Conc[1][i1][i2][i3] += (tempConc[1][i1][(i2-1)][i3]-aux1)*auxD;
    					}
    					Conc[0][i1][i2][i3] += (tempConc[0][i1][i2][(i3+1)]-aux)*auxD;
    					Conc[1][i1][i2][i3] += (tempConc[1][i1][i2][(i3+1)]-aux1)*auxD;
    					Conc[0][i1][i2][i3] +=  (tempConc[0][i1][i2][(i3-1)]-aux)*auxD;
    					Conc[1][i1][i2][i3] += (tempConc[1][i1][i2][(i3-1)]-aux1)*auxD;
    				}
    			Conc[0][i1][i2][auxL-1] += (tempConc[0][i1][i2][auxL]-tempConc[0][i1][i2][auxL-1])*auxD;
    			Conc[1][i1][i2][auxL-1] += (tempConc[1][i1][i2][auxL]-tempConc[0][i1][i2][auxL-1])*auxD;
    		}
    	}
    }

    Changed code line(s)​:  1, 4-11, 14-17, 21, 24-64

    Optimization Notes

    This function has two main parts: the first copies the Conc variable (using a for loop) to tempConc, and the second one iterates through the Conc array while constantly checking the upper and lower dimensional bounds. Those two parts can be parallelized without any difficulty: all the threads copy their divisions of Conc, and then all the threads execute divisions of the next big loop. The most expensive part of this function will be in the second big loop, where during each iteration it is necessary to increase and decrease the coordinates and check the bounds. If the calculated coordinate is inside the bounds, then execute the calculations.

    The optimized function creates all the threads. Within each thread their portion of the memcpy loop executes, and when all have finished then they start the execution of the dimensional loop. The use of memcpy increases the copies done by iteration.

    The variables are optimized in the same way previous functions were: using constants when possible and aligning the arrays.

    Both loops are ‘collapsed’ under the OpenMP pragma header and are considered in this way:

    • The first loop is under
      #pragma omp for schedule(static) private(i1,i2) collapse(2)
    • The second loop is under
      #pragma omp for schedule(static) private(i1,i2,i3) collapse(2)

    The first loop, the tempConc copy loop, is optimized with multithreading. The loop is multithreaded with a ‘parallel for’ and optimized by moving larger portions of data per iteration using memcpy (as commented previously). I don’t know if all the available memory bandwidth is occupied during the process of copying the Conc to tempConc, so one way to improve this is by using all the available memory bandwidth (it will be approx. L*float(64bit)*240).

    The second loop, handling the dimensions, has different optimizations.

    The initial optimization has to do with the i3 boundary check. This check takes place in the four lines of code that start with:

    Conc[0][i1][i2][0]
    Conc[1][i1][i2][0]
    
    Conc[0][i1][i2][auxL-1]
    Conc[1][i1][i2][auxL-1]

    The i3 bound check can be avoided if the loop is L-2 iterations (i3 goes from 1 to auxL), so we have to manually add the first and last operations. The i3 bounds are never surpassed, contained by these four lines:

    Conc[0][i1][i2][i3] += (tempConc[0][i1][i2][(i3+1)]-aux)*auxD;
    Conc[1][i1][i2][i3] += (tempConc[1][i1][i2][(i3+1)]-aux1)*auxD;
    Conc[0][i1][i2][i3] += (tempConc[0][i1][i2][(i3-1)]-aux)*auxD;
    Conc[1][i1][i2][i3] += (tempConc[1][i1][i2][(i3-1)]-aux1)*auxD;

    If the bounds are unreached, then the code inside the if code is executed.

    That said, this loop can be improved if all  the if code inside is deleted. One possible solution would be to manually unroll the loop and take care of the bounds,  then further distribute tasks among the threads with OpenMP task functions.

    runDecayStep

    This function iterates through all Conc variables to complete multiplication on each.

    Original Code

    static void runDecayStep(float**** Conc, int L, float mu) {
    	runDecayStep_sw.reset();
    	// computes the changes in substance concentrations due to decay
    	int i1,i2,i3;
    	for (i1 = 0; i1 < L; i1++) {
    		for (i2 = 0; i2 < L; i2++) {
    			for (i3 = 0; i3 < L; i3++) {
    				Conc[0][i1][i2][i3] = Conc[0][i1][i2][i3]*(1-mu);
    				Conc[1][i1][i2][i3] = Conc[1][i1][i2][i3]*(1-mu);
    			}
    		}
    	}
    	runDecayStep_sw.mark();
    }

    Changed code line(s)​:  1, 8, 9

    Optimized Code

    static void runDecayStep(int L, float Conc[2][L][L][L], float mu) {
    	runDecayStep_sw.reset();
    	// computes the changes in substance concentrations due to decay
    	const float muu=1-mu;
    	int i1,i2,i3;
    	omp_set_num_threads(240);
    	#pragma omp parallel for schedule(static) private(i1,i2,i3) collapse(3)
    	#pragma simd
    	for (i1 = 0; i1 < L; ++i1) {
    		for (i2 = 0; i2 < L; ++i2) {
    			for (i3 = 0; i3 < L; ++i3) {
    				__assume_aligned((float*)Conc[0][i1][i2], 64);
    				__assume_aligned((float*)Conc[1][i1][i2], 64);
    				Conc[0][i1][i2][i3] = Conc[0][i1][i2][i3]*muu;
    				Conc[1][i1][i2][i3] = Conc[1][i1][i2][i3]*muu;
    			}
    		}
    	}
    	runDecayStep_sw.mark();
    }

    Changed code line(s)​:  1, 4, 6-8, 12-15

    Optimization Notes

    There are a few simple changes to optimize this function.

    • The input arrays of the function are defined on the header.
    • The variables are optimized as in previous functions (using constants when possible and aligning the arrays).
    • Loops are ‘collapsed’ under the OpenMP pragma header.
    • The use of #pragma simd loop. This loop can be vectorized with SIMD instructions, which improve the execution time on the optimized code. The use of SIMD instructions gives the possibility of executing multiple operations at the same time.

    Additional optimizations that could be made are to use the Intel MKL functions to their fullest possibilities. The Intel MKL functions are optimized for large arrays of data, so if the Conc array is flattened to one dimension, the Intel MKL will perform faster.

    cellMovementAndDuplication

    This function is used to generate the random movement of each cell and each cell’s duplicate. However, not every cell duplicates.

    Original Code

    static int cellMovementAndDuplication(float** posAll, float* pathTraveled,
     int* typesAll, int* numberDivisions, float pathThreshold, int
     divThreshold, int n) {
    	cellMovementAndDuplication_sw.reset();
    	int c;
    	currentNumberCells = n;
    	float currentNorm;
    	float currentCellMovement[3];
    	float duplicatedCellOffset[3];
    	for (c=0; c<n; c++) {
    		// random cell movement
    		currentCellMovement[0]=RandomFloatPos()-0.5;
    		currentCellMovement[1]=RandomFloatPos()-0.5;
    		currentCellMovement[2]=RandomFloatPos()-0.5;
    		currentNorm = getNorm(currentCellMovement);
    		posAll[c][0]+=0.1*currentCellMovement[0]/currentNorm;
    		posAll[c][1]+=0.1*currentCellMovement[1]/currentNorm;
    		posAll[c][2]+=0.1*currentCellMovement[2]/currentNorm;
    		pathTraveled[c]+=0.1;
    		// cell duplication if conditions fulfilled
    		if (numberDivisions[c]<divThreshold) {
    			if (pathTraveled[c]>pathThreshold) {
    				pathTraveled[c]-=pathThreshold;
    				numberDivisions[c]+=1; // update number of divisions this cell has undergone
    				currentNumberCells++;  // update number of cells in
    				 the simulation
    				numberDivisions[currentNumberCells-1]=numberDivisions[c]; // update number of divisions the  duplicated cell has undergone
    				typesAll[currentNumberCells-1]=-typesAll[c];
    				// assign type of duplicated cell (opposite to current cell)
    
    				// assign location of duplicated cell
    				duplicatedCellOffset[0]=RandomFloatPos()-0.5;
    				duplicatedCellOffset[1]=RandomFloatPos()-0.5;
    				duplicatedCellOffset[2]=RandomFloatPos()-0.5;
    				currentNorm = getNorm(duplicatedCellOffset);
    				posAll[currentNumberCells-1][0]=posAll[c][0]+0.05*duplicatedCellOffset[0]/currentNorm;
    				posAll[currentNumberCells-1][1]=posAll[c][1]+0.05*duplicatedCellOffset[1]/currentNorm;
    				posAll[currentNumberCells-1][2]=posAll[c][2]+0.05*duplicatedCellOffset[2]/currentNorm;
    			}
    		}
    	}
    	cellMovementAndDuplication_sw.mark();
    	return currentNumberCells;
    }

    Changed code line(s)​:  1, 12-14, 16-18, 21, 22, 24-34, 36-38

    Optimized Code

    static int cellMovementAndDuplication(float posAll[][3], float* pathTraveled,
     int* typesAll, int* numberDivisions, float pathThreshold, int
     divThreshold, int n) {
    	cellMovementAndDuplication_sw.reset();
    	int c,currentNumberCells = n;
    	float currentNorm;
    	unsigned int seed=rand();
    	float currentCellMovement[3] __attribute__((aligned(64)));
    	float duplicatedCellOffset[3] __attribute__((aligned(64)));
    	omp_set_num_threads(240);
    
    	#pragma omp parallel for simd schedule (static) shared(posAll) private(c,currentNorm,currentCellMovement)
    
    	for (c=0; c<n; ++c) {
    		// random cell movement
    		RandomFloatPos(currentCellMovement,seed+c);
    		currentNorm = getNorm(currentCellMovement)*10;
    		__assume_aligned((float*)posAll, 64);
    		__assume_aligned((float*)currentCellMovement, 64);
    		posAll[c][0:3]+=currentCellMovement[0:3]/currentNorm;
    		__assume_aligned((float*)pathTraveled, 64);
    		pathTraveled[c]+=0.1;
    	}
    	seed=rand();
    	for (c=0; c<n; ++c) {
    		if ((numberDivisions[c]<divThreshold) && (pathTraveled[c]>pathThreshold)) {
    			pathTraveled[c]-=pathThreshold;
    			++numberDivisions[c]; // update number of divisions this cell has undergone
    			numberDivisions[currentNumberCells]=numberDivisions[c]; // update number of divisions the duplicated cell has undergone
    			typesAll[currentNumberCells]=-typesAll[c]; // assign type of
    			 duplicated cell (opposite to current cell)
    
    			// assign location of duplicated cell
    			RandomFloatPos(duplicatedCellOffset,seed+c); //The seed+c value will be different for each thread and iteration, this way the random value is random always.
    			currentNorm = getNorm(duplicatedCellOffset)*20;
    			__assume_aligned((float*)posAll, 64);
    			__assume_aligned((float*)duplicatedCellOffset, 64);
    			posAll[currentNumberCells][0:3]=posAll[c][0:3]+duplicatedCellOffset[0:3]/currentNorm;
    			++currentNumberCells; // update number of cells in the simulation
    		}
    	}
    	cellMovementAndDuplication_sw.mark();
    	return currentNumberCells;
    }

    Changed code line(s)​:  1, 8-12, 16-21, 26, 28, 29, 34, 36-39

    Optimization Notes

    The optimized code for cellMovementAndDuplication has two portions that were optimized in the same way as described for previous functions:

    • RandomFloatPos() functions. The optimized function can be multithreaded and returns the whole position array (the 3D position array).
    • The code lines incorporating currentNorm can be vectorized with SIMD and also mathematically simplified.

    Additionally, the optimized code updates the header and aligns the arrays.

    The new function RandomFloatPos() requires a random generated seed before calling seed=rand(), and each thread needs a different value of this seed. To achieve this, every thread increases the random generated seed value with their corresponding for loop iteration. This way the number is both random and different for each thread.

    To multithread this function I divided it into two big loops. The first (random cell movement) can be parallelized without problems. The second loop (starting with seed=rand();) must be executed in a single thread, because every iteration depends on the previous iteration’s values.

    runDiffusionClusterStep

    This function is used to determine cell movement based on the substance gradients.

    Original Code

    static void runDiffusionClusterStep(float**** Conc, float** movVec, float** posAll, int* typesAll, int n, int L, float speed) {
    	runDiffusionClusterStep_sw.reset();
    	// computes movements of all cells based on gradients of the two substances
    	float sideLength = 1/(float)L; // length of a side of a diffusion voxel
    
    	float gradSub1[3];
    	float gradSub2[3];
    	float normGrad1, normGrad2;
    	int c, i1, i2, i3, xUp, xDown, yUp, yDown, zUp, zDown;
    
    	for (c = 0; c < n; c++) {
    		i1 = std::min((int)floor(posAll[c][0]/sideLength),(L-1));
    		i2 = std::min((int)floor(posAll[c][1]/sideLength),(L-1));
    		i3 = std::min((int)floor(posAll[c][2]/sideLength),(L-1));
    
    		xUp = std::min((i1+1),L-1);
    		xDown = std::max((i1-1),0);
    		yUp = std::min((i2+1),L-1);
    		yDown = std::max((i2-1),0);
    		zUp = std::min((i3+1),L-1);
    		zDown = std::max((i3-1),0);
    
    		gradSub1[0] = (Conc[0][xUp][i2][i3]-Conc[0][xDown][i2][i3])/(sideLength*(xUp-xDown));
    		gradSub1[1] = (Conc[0][i1][yUp][i3]-Conc[0][i1][yDown][i3])/(sideLength*(yUp-yDown));
    		gradSub1[2] = (Conc[0][i1][i2][zUp]-Conc[0][i1][i2][zDown])/(sideLength*(zUp-zDown));
    		gradSub2[0] = (Conc[1][xUp][i2][i3]-Conc[1][xDown][i2][i3])/(sideLength*(xUp-xDown));
    		gradSub2[1] = (Conc[1][i1][yUp][i3]-Conc[1][i1][yDown][i3])/(sideLength*(yUp-yDown));
    		gradSub2[2] = (Conc[1][i1][i2][zUp]-Conc[1][i1][i2][zDown])/(sideLength*(zUp-zDown));
    		normGrad1 = getNorm(gradSub1);
    		normGrad2 = getNorm(gradSub2);
    		if ((normGrad1>0)&&(normGrad2>0)) {
    			movVec[c][0]=typesAll[c]*(gradSub1[0]/normGrad1-gradSub2[0]/normGrad2)*speed;
    			movVec[c][1]=typesAll[c]*(gradSub1[1]/normGrad1-gradSub2[1]/normGrad2)*speed;
    			movVec[c][2]=typesAll[c]*(gradSub1[2]/normGrad1-gradSub2[2]/normGrad2)*speed;
    		} else {
    			movVec[c][0]=0;
    			movVec[c][1]=0;
    			movVec[c][2]=0;
    		}
    	}
    	runDiffusionClusterStep_sw.mark();
    }

    Changed code line(s)​:  1-2, 5-10, 13-39 

    Optimized Code

    static void runDiffusionClusterStep(int L, float Conc[2][L][L][L], float movVec[][3], float posAll[][3], int* typesAll, int n, float speed) {
    	runDiffusionClusterStep_sw.reset();
    	// computes movements of all cells based on gradients of the two substances
    		const float auxL=L;
    			--L;
    	float gradSub[6] __attribute__((aligned(64)));
    	float aux[3] __attribute__((aligned(64)));
    		int i[3] __attribute__((aligned(32)));
    	float normGrad[2] __attribute__((aligned(64)));
    	int c, xUp, xDown, yUp, yDown, zUp, zDown;
    	   omp_set_num_threads(240);
    
    	#pragma omp parallel for schedule(static) private(i,xUp,xDown,yUp,yDown,zUp,zDown,gradSub,normGrad,aux) if (n>240)
    	for (c = 0; c < n; ++c) {
    		__assume_aligned((int*)i, 32);
    		__assume_aligned((float*)posAll, 64);
    		__assume_aligned((float*)gradSub, 64);
    		__assume_aligned((float*)normGrad, 64);
    		__assume_aligned((float*)movVec, 64);
    		__assume_aligned((float*)typesAll, 64);
    		i[0:3] = std::min((int)floor(posAll[c][0:3]*auxL),L)-1;
    		xDown = std::max(i[0],0);
    		yDown = std::max(i[1],0);
    		zDown = std::max(i[2],0);
    		xUp = std::min((i[0]+2),L);
    		yUp = std::min((i[1]+2),L);
    		zUp = std::min((i[2]+2),L);
    		aux[0]=auxL/((xUp-xDown));
    		aux[1]=auxL/((yUp-yDown));
    		aux[2]=auxL/((zUp-zDown));
    		gradSub[0] = (Conc[0][xUp][i[1]][i[2]]-Conc[0][xDown][i[1]][i[2]])*aux[0];
    		gradSub[1] = (Conc[0][i[0]][yUp][i[2]]-Conc[0][i[0]][yDown][i[2]])*aux[1];
    		gradSub[2] = (Conc[0][i[0]][i[1]][zUp]-Conc[0][i[0]][i[1]][zDown])*aux[2];
    		normGrad[0] = getNorm(gradSub);
    		if (normGrad[0]>0){
    			gradSub[3] = (Conc[1][i[0]][yUp][i[2]]-Conc[1][i[0]][yDown][i[2]])*aux[1];
    			gradSub[4] = (Conc[1][xUp][i[1]][i[2]]-Conc[1][xDown][i[1]][i[2]])*aux[0];
    			gradSub[5] = (Conc[1][i[0]][i[1]][zUp]-Conc[1][i[0]][i[1]][zDown])*aux[2];
    			normGrad[1] = getNorm(gradSub+3);
    			if ( normGrad[1]>0) {
    				movVec[c][0:3]=typesAll[c]*(gradSub[0:3]/normGrad[0]-gradSub[3:3]/normGrad[1])*speed;
    			} else movVec[c][0:3]=0;
    		}
    	}
    	runDiffusionClusterStep_sw.mark();
    }

    Changed code line(s)​:  1, 5-10, 12-14, 12-43

    Optimization Notes

    This function is optimized using some small simplifications, vectorization and multithreading, much in the same way other functions were optimized.

    • In the header the dimensions of the input arrays are specified.
    • It is array aligned.
    • Replacing variables with constants whenever possible.
    • OpenMP parallelization.
    • Simplification of operations.
    • Simplification of vectorization.

    The if condition can filter some of the operations out. An example is the calculation of normGrad[1] is unnecessary if normGrad[0] is not > 0. Using the if condition results in fewer comparisons and fewer useless operations are calculated.

    This function can be further improved using more SIMD optimizations. However, to use the SIMD operations, the max and min functions must be implemented differently.

    getEnergy

    This function computes the energy measure of a subvolume of cells by assuming uniform distribution within the entire volume and determining a volume of a target number of cells.

    Original Code

    Original Code
    static float getEnergy(float** posAll, int* typesAll, int n, float spatialRange, int targetN) {
    	getEnergy_sw.reset();
    	// Computes an energy measure of clusteredness within a subvolume. The
    	// size of the subvolume is computed by assuming roughly uniform
    	// distribution within the whole volume, and selecting a volume
    	// comprising approximately targetN cells.
    	int i1, i2;
    	float currDist;
    	float** posSubvol=0; // array of all 3 dimensional cell positions
    	posSubvol = new float*[n];
    	int typesSubvol[n];
    	float subVolMax = pow(float(targetN)/float(n),1.0/3.0)/2;
    	if(quiet < 1)
    		printf("subVolMax: %f\n", subVolMax);
    	int nrCellsSubVol = 0;
    	float intraClusterEnergy = 0.0;
    	float extraClusterEnergy = 0.0;
    	float nrSmallDist=0.0;
    
    	for (i1 = 0; i1 < n; i1++) {
    		posSubvol[i1] = new float[3];
    		if ((fabs(posAll[i1][0]-0.5)<subVolMax) && (fabs(posAll[i1][1]-0.5)<subVolMax) && (fabs(posAll[i1][2]-0.5)<subVolMax)) {
    			posSubvol[nrCellsSubVol][0] = posAll[i1][0];
    			posSubvol[nrCellsSubVol][1] = posAll[i1][1];
    			posSubvol[nrCellsSubVol][2] = posAll[i1][2];
    			typesSubvol[nrCellsSubVol] = typesAll[i1];
    			nrCellsSubVol++;
    		}
    	}
    
    	for (i1 = 0; i1 < nrCellsSubVol; i1++) {
    		for (i2 = i1+1; i2 < nrCellsSubVol; i2++) {
    			currDist = getL2Distance(posSubvol[i1][0],posSubvol[i1][1],posSubvol[i1][2],posSubvol[i2][0],
    			posSubvol[i2][1],posSubvol[i2][2]);
    			if (currDist<spatialRange) {
    				nrSmallDist = nrSmallDist+1;//currDist/spatialRange;
    				if (typesSubvol[i1]*typesSubvol[i2]>0) {
    					intraClusterEnergy = intraClusterEnergy+fmin(100.0,spatialRange/currDist);
    				} else {
    					extraClusterEnergy = extraClusterEnergy+fmin(100.0,spatialRange/currDist);
    				}
    			}
    		}
    	}
    	float totalEnergy = (extraClusterEnergy-intraClusterEnergy)/(1.0+100.0*nrSmallDist);
    	getEnergy_sw.mark();
    	return totalEnergy;
    }

    Changed code line(s)​:  1, 22, 34, 35, 37-41, 46, 48

    Optimized Code

    static float getEnergy(float posAll[][3], int* typesAll, int n, float spatialRange, int targetN) {
    	getEnergy_sw.reset();
    	// Computes an energy measure of clusteredness within a subvolume. The
    	// size of the subvolume is computed by assuming roughly uniform
    	// distribution within the whole volume, and selecting a volume
    	// comprising approximately targetN cells.
    	int i1, i2;
    	float currDist;
    	float posSubvol[n][3] __attribute__((aligned(64)));
    	int typesSubvol[n] __attribute__((aligned(64)));
    	const float subVolMax = pow(float(targetN)/float(n),1.0/3.0)/2;
    	if(quiet < 1)printf("subVolMax: %f\n", subVolMax);
    
    	int nrCellsSubVol = 0;
    	float intraClusterEnergy = 0.0;
    	float extraClusterEnergy = 0.0;
    	float nrSmallDist=0.0;
    	for (i1 = 0; i1 < n; ++i1) {
    		__assume_aligned((float*)posAll, 64);
    		if ((fabs(posAll[i1][0]-0.5)<subVolMax) && (fabs(posAll[i1][1]-0.5)<subVolMax) && (fabs(posAll[i1][2]-
    			0.5)<subVolMax)) {
    			__assume_aligned((float*)posSubvol[nrCellsSubVol], 64);
    			__assume_aligned((float*)typesAll, 64);
    			__assume_aligned((int*)typesSubvol, 64);
    			posSubvol[nrCellsSubVol][0] = posAll[i1][0];
    			posSubvol[nrCellsSubVol][1] = posAll[i1][1];
    			posSubvol[nrCellsSubVol][2] = posAll[i1][2];
    			typesSubvol[nrCellsSubVol] = typesAll[i1];
    			++nrCellsSubVol;
    		}
    	}
    	omp_set_num_threads(240);
    	#pragma omp parallel for schedule(static) reduction(+:nrSmallDist,intraClusterEnergy,extraClusterEnergy) private(i1,i2,currDist)
    	for (i1 = 0; i1 < nrCellsSubVol; ++i1) {
    		for (i2 = i1+1; i2 < nrCellsSubVol; ++i2) {
    			currDist = getL2Distance(posSubvol[i1],posSubvol[i2]);
    			if (currDist<spatialRange) {
    				++nrSmallDist; //currDist/spatialRange;
    				(typesSubvol[i1]*typesSubvol[i2]>0)? intraClusterEnergy += fmin(100.0,spatialRange/currDist) :
    				extraClusterEnergy += fmin(100.0,spatialRange/currDist);
    			}
    		}
    	}
    	getEnergy_sw.mark();
    	return (extraClusterEnergy-intraClusterEnergy)/(1.0+100.0*nrSmallDist);
    }

    Changed code line(s)​:  1, 9-11, 19, 22-24, 32, 33, 36-40

    Optimization Notes

    Some of the things in this code that were easy targets for optimization are the ‘new’ constructors, some variables that are constants, and the fact that the second loop depends on the first. The first loop isn’t parallelizable since it depends on the value of the nrCellsSubVol and the if with three conditions inside this loop makes it even more difficult to parallelize. The second loop, however, can be parallelized without problems.

    The specific optimizations done to this function are much the same as in the other functions:

    • The dimensions of the input arrays are specified in the header.
    • Use of constants when possible.
    • Array aligned.
    • OpenMP parallelization on the second ‘for loop’ using a reduction.

    The next step optimization would be to parallelize the first loop if condition. It is difficult but possible and would result in provable gains.

    getCriterion

    This function is used to determine if the cells in the subvolume are in clusters.

    NOTE: Both the original code and the optimized code are truncated versions, showing only the code portions relevant to this paper.

    Original Code

    static bool getCriterion(float** posAll, int* typesAll, int n, float spatialRange, int targetN) {
    	getCriterion_sw.reset();
    	// Returns 0 if the cell locations within a subvolume of the total
    	// system, comprising approximately targetN cells, are arranged as clusters, and 1 otherwise.
    	int i1, i2;
    	int nrClose=0; // number of cells that are close (i.e. within a distance of spatialRange)
    	float currDist;
    	int sameTypeClose=0; // number of cells of the same type, and that are close (i.e. within a distance of spatialRange)
    	int diffTypeClose=0; //number of cells of opposite types, and that are close (i.e. within a distance of spatialRange)
    	float** posSubvol=0; // array of all 3 dimensional cell positions in the subcube
    	posSubvol = new float*[n];
    	int typesSubvol[n];
    	float subVolMax = pow(float(targetN)/float(n),1.0/3.0)/2;
    	int nrCellsSubVol = 0;
    
    	// the locations of all cells within the subvolume are copied to array PosSubvol
    	for (i1 = 0; i1 < n; i1++) {
    		posSubvol[i1] = new float[3];
    		if ((fabs(posAll[i1][0]-0.5)<subVolMax) && (fabs(posAll[i1][1]-0.5)<subVolMax) && (fabs(posAll[i1][2]
    			-0.5)<subVolMax)) {
    			posSubvol[nrCellsSubVol][0] = posAll[i1][0];
    			posSubvol[nrCellsSubVol][1] = posAll[i1][1];
    			posSubvol[nrCellsSubVol][2] = posAll[i1][2];
    			typesSubvol[nrCellsSubVol] = typesAll[i1];
    			nrCellsSubVol++;
    		}
    	}
    
    [section of truncated code]
    
    	for (i1 = 0; i1 < nrCellsSubVol; i1++) {
    		for (i2 = i1+1; i2 < nrCellsSubVol; i2++) {
    			currDist = getL2Distance(posSubvol[i1][0],posSubvol[i1][1],posSubvol[i1][2],posSubvol[i2][0], posSubvol[i2][1],posSubvol[i2][2]);
    		if (currDist<spatialRange) {
    			nrClose++;
    			if (typesSubvol[i1]*typesSubvol[i2]<0) {
    				diffTypeClose++;
    			} else {
    				sameTypeClose++;
    			}
    		}
    	}
    
    [section of truncated code]
    
    }

    Changed code line(s)​:  9, 10, 32-39

    Optimized Code

    static bool getCriterion(float posAll[][3], int* typesAll, int n, float spatialRange, int targetN) {
    	getCriterion_sw.reset();
    	// Returns 0 if the cell locations within a subvolume of the total
    	// system, comprising approximately targetN cells, are arranged as clusters, and 1 otherwise.
    	int i1, i2;
    	int nrClose=0; // number of cells that are close (i.e. within a  distance of spatialRange)
    	int sameTypeClose=0; // number of cells of the same type, and that are  close (i.e. within a distance of spatialRange)
    	int diffTypeClose=0; // number of cells of opposite types, and that are  close (i.e. within a distance of spatialRange)
    	float posSubvol[n][3] __attribute__((aligned(64)));
    	int typesSubvol[n] __attribute__((aligned(64)));
    	const float subVolMax = pow(float(targetN)/float(n),1.0/3.0)/2;
    	int nrCellsSubVol = 0;
    
    	// the locations of all cells within the subvolume are copied to array  posSubvol
    
    	for (i1 = 0; i1 < n; ++i1) {
    		__assume_aligned((float*)posAll, 64);
    		if ((fabs(posAll[i1][0]-0.5)<subVolMax) && (fabs(posAll[i1][1]-0.5)<subVolMax) && (fabs(posAll[i1][2]
    			-0.5)<subVolMax)) {
    			__assume_aligned((float*)posSubvol[nrCellsSubVol], 64);
    			__assume_aligned((float*)typesAll, 64);
    			__assume_aligned((int*)typesSubvol, 64);
    			posSubvol[nrCellsSubVol][0] = posAll[i1][0];
    			posSubvol[nrCellsSubVol][1] = posAll[i1][1];
    			posSubvol[nrCellsSubVol][2] = posAll[i1][2];
    			typesSubvol[nrCellsSubVol] = typesAll[i1];
    			++nrCellsSubVol;
    		}
    	}
    
    [section of truncated code]
    
    	omp_set_num_threads(240);
    	#pragma omp parallel for schedule(static)
    	reduction(+:nrClose,diffTypeClose,sameTypeClose) private(i1,i2)
    	for (i1 = 0; i1 < nrCellsSubVol; ++i1) {
    		for (i2 = i1+1; i2 < nrCellsSubVol; ++i2) {
    				if (getL2Distance(posSubvol[i1],posSubvol[i2])<spatialRange) {
    					++nrClose;
    					(typesSubvol[i1]*typesSubvol[i2]<0) ? ++diffTypeClose: ++sameTypeClose;
    				}
    		}
    	}
    
    [section of truncated code]
    
    }

    Changed code line(s)​:  1, 9-11, 17, 20-22, 33-35, 38-40

    Optimization Notes

    The optimizable parts of this function include the variable declaration and initialization. The first loop is less optimizable, but optimizing the second loop can make up the difference.

    The specific optimizations done to this function are much the same as in the other functions:

    • The dimensions of the input arrays are specified in the header.
    • Use of constants when possible.
    • OpenMP parallelization on the second ‘for loop’ using a reduction.

    For further optimization, it may be possible to parallelize the first loop if condition. Additionally, flattening the posAll and posSubVol arrays can lead to a faster execution.

    Main – Variable Declaration

    The main code is large so for this paper it is divided into meaningful sections. This section deals with the code and optimized code for the variable declarations.

    Original Code

    int i,c,d;
    int i1, i2, i3, i4;
    float energy; // value that quantifies the quality of the cell clustering   output. The smaller this value, the better the clustering
    float** posAll=0; // array of all 3 dimensional cell positions
    posAll = new float*[finalNumberCells];
    float** currMov=0; // array of all 3 dimensional cell movements at the last  time point
    currMov = new float*[finalNumberCells]; // array of all cell movements in the  last time step
    float zeroFloat = 0.0;
    float pathTraveled[finalNumberCells]; // array keeping track of length of path traveled until cell divides
    int numberDivisions[finalNumberCells]; //array keeping track of number of  division a cell has undergone
    int typesAll[finalNumberCells]; // array specifying cell type (+1 or -1)

    Changed code line(s)​:  2, 4-7, 

    Optimized Code

    int i,c,d,i1;
    float energy; // value that quantifies the quality of the cell clustering output. The smaller this value, the better the clustering
    float posAll[finalNumberCells][3] __attribute__((aligned(64)));
    float currMov[finalNumberCells][3] __attribute__((aligned(64))); //array of
      // all cell movements in the last time step
    float pathTraveled[finalNumberCells] __attribute__((aligned(64))); // array
      // keeping track of length of path traveled until cell divides
    int numberDivisions[finalNumberCells] __attribute__((aligned(64))); //array
      // keeping track of number of division a cell has undergone
    int typesAll[finalNumberCells] __attribute__((aligned(64))); // array
      // specifying cell type (+1 or -1)
    float Conc[2][L][L][L] __attribute__((aligned(64)));

    Changed code line(s)​:  3-12

    Optimization Notes

    There were many ways to optimize the code in this small section.

    • Avoid the use of ‘new’ constructors.
    • Delete useless variables (for example, zeroFloat).
    • Avoid using pointers when we can declare an array of a known size and static.
    • Memory alignment; memory access is faster though it can use more memory.
    • Define the variable value at initialization time; this is faster than defining the value later.

    Main – Variable Initialization

    The main code is large so for this paper it is divided into meaningful sections. This section deals with the code and optimized code for the variable initialization. The size and values of the variables are declared in this code, including the largest arrays currMov, posAll, pathTraveled, and Conc.

    Original Code

    // Initialization of the various arrays
    for (i1 = 0; i1 < finalNumberCells; i1++) {
    	currMov[i1] = new float[3];
    	posAll[i1] = new float[3];
    	pathTraveled[i1] = zeroFloat;
    	pathTraveled[i1] = 0;
    	for (i2 = 0; i2 < 3; i2++) {
    		currMov[i1][i2] = zeroFloat;
    		posAll[i1][i2] = 0.5;
    	}
    }
    // create 3D concentration matrix
    float**** Conc;
    Conc = new float***[L];
    for (i1 = 0; i1 < 2; i1++) {
    	Conc[i1] = new float**[L];
    	for (i2 = 0; i2 < L; i2++) {
    		Conc[i1][i2] = new float*[L];
    		for (i3 = 0; i3 < L; i3++) {
    			Conc[i1][i2][i3] = new float[L];
    			for (i4 = 0; i4 < L; i4++) {
    				Conc[i1][i2][i3][i4] = zeroFloat;
    			}
    		}
    	}
    }

    Changed code line(s)​:  3-5, 7-10, 13, 14, 16, 18, 20, 22

    Optimized Code

    // Initialization of the various arrays
    omp_set_num_threads(240);
    #pragma omp parallel
    {
    	#pragma omp for simd schedule (static) private (i1) nowait
    	for (i1 = 0; i1 < finalNumberCells; ++i1) {
    		__assume_aligned((float*)posAll, 64);
    		__assume_aligned((float*)currMov[i1], 64);
    		__assume_aligned((float*)pathTraveled, 64);
    		currMov[i1][0:3]=0;
    		posAll[i1][0]=0.5;
    		posAll[i1][1]=0.5;
    		posAll[i1][2]=0.5;
    		pathTraveled[i1] = 0;
    	}
    	#pragma omp for schedule (static) private (i1,c,d) collapse(3)
    	for (i1 = 0; i1 < 2; ++i1) {
    		for (c = 0; c < L; ++c) {
    			for (d = 0; d < L; ++d) {
    				Conc[i1][c][d][0:L]=0;
    			}
    		}
    	}
    }

    Changed code line(s)​:  2-5, 7-13, 16, 20

    Optimization Notes

    There are three ways the code in this portion was optimized. The most basic is the reduction of the for loops from four to three. The other optimizations are memory alignment and the OpenMP thread parallelization with SIMD.

    Main – Phase 1

    The main code is large, so for this paper it is divided into meaningful sections. This section deals with the first steps of the simulation.

    Original Code

    // Phase 1: Cells move randomly and divide until final number of cells is reached
    while (n<finalNumberCells) {
    	produceSubstances(Conc, posAll, typesAll, L, n); // Cells produce
    	// substances. Depending on the cell type, one of the two substances is produced.
    	runDiffusionStep(Conc, L, D); // Simulation of substance diffusion
    	runDecayStep(Conc, L, mu);
    	n = cellMovementAndDuplication(posAll, pathTraveled, typesAll, numberDivisions, pathThreshold, divThreshold, n);
    
    	for (c=0; c<n; c++) {
    		// boundary conditions
    		for (d=0; d<3; d++) {
    			if (posAll[c][d]<0) {posAll[c][d]=0;}
    			if (posAll[c][d]>1) {posAll[c][d]=1;}
    		}
    	}
    }

    Changed code line(s)​:  4, 6, 7, 14

    Optimized Code

    // Phase 1: Cells move randomly and divide until final number of cells is reached
    while (n<finalNumberCells) {
    	produceSubstances(L,Conc, posAll, typesAll, n); // Cells produce substances. Depending on the cell type, one of the two substances is produced.
    	runDiffusionStep(L,Conc, D); // Simulation of substance diffusion
    	runDecayStep(L,Conc, mu);
    	n = cellMovementAndDuplication(posAll, pathTraveled, typesAll, numberDivisions, pathThreshold, divThreshold, n);
    	omp_set_num_threads(240);
    	#pragma omp parallel for simd schedule (static) private (c,d) if (n>500)
    	for (c=0; c<n; ++c) {
    		// boundary conditions
    		for (d=0; d<3; d++) {
    			if (posAll[c][d]<0) {
    				posAll[c][d]=0;
    			}else if (posAll[c][d]>1) posAll[c][d]=1;
    		}
    	}
    }

    Changed code line(s)​:  4-6, 8, 9, 15

    Optimization Notes

    There are two optimizations used in this portion of the code: OpenMP thread parallelization with SIMD and if clause reduction.

    The OpenMP thread parallelization with SIMD is a good change because the schedule is static.  Further, we know the size of the loop and the cost of each iteration is the same so the load cost will be equal for all threads. The if n>500 condition helps to reduce the added overhead for the first 500 n values.

    The original code always does two checks. The optimized code reduces these if checks by doing one, and when necessary doing the second if check. This reduces the total number of comparisons done.

    Main – Phase 2 Ending

    Original code

    for (c=0; c<n; c++) {
    	posAll[c][0] = posAll[c][0]+currMov[c][0];
    	posAll[c][1] = posAll[c][1]+currMov[c][1];
    	posAll[c][2] = posAll[c][2]+currMov[c][2];
    
    	// boundary conditions: cells can not move out of the cube [0,1]^3
    	for (d=0; d<3; d++) {
    		if (posAll[c][d]<0) {posAll[c][d]=0;}
    		if (posAll[c][d]>1) {posAll[c][d]=1;}
    	}
    }

    Changed code line(s)​:  2-4, 9

    Optimized code

    #pragma omp parallel for simd schedule (static) private (c,d) if (n>500)
    for (c=0; c<n; ++c) {
    	__assume_aligned((float*)posAll, 64);
    	__assume_aligned((float*)currMov[c], 64);
    	posAll[c][0:3] += currMov[c][0:3];
    	// boundary conditions: cells can not move out of the cube [0,1]^3
    	for (d=0; d<3; d++) {
    		if (posAll[c][d]<0) {
    			posAll[c][d]=0;
    		}else if (posAll[c][d]>1) posAll[c][d]=1;
    	}
    }

    Changed code line(s)​:  1, 3-5, 10

    Optimization Notes

    This final part of the code is optimized in the same way as the Phase 1 optimization: OpenMP thread parallelization with SIMD and if clause reduction.

    The OpenMP thread parallelization with SIMD is a good change because the schedule is static.  Further, we know the size of the loop and the cost of each iteration is the same so the load cost will be equal for all threads. The if n>500 condition helps to reduce the added overhead for the first 500 n values.

    The original code always does two checks. The optimized code reduces these if checks by doing one, and when necessary doing the second if check. This reduces the total number of comparisons done.

    Other Optimizations

    Besides the optimizations listed in each section of the code previously, there were other optimizations made that were not explicitly remarked on:

    • Defining all post-increments as pre-increments. The post-increment makes a copy of the actual variable and then increments it. A pre-increment does not makes this copy so it’s faster. This is only a gain when the point of incrementation does not affect the result.
    • Memory alignment on all arrays. When the arrays are aligned, the memory operations are faster and the CPU doesn’t need to mask or convert the data.

    Rejected Optimization Techniques

    During the development of this final code I experimented with other parallelization and vectorization techniques, such as Intel Cilk Plus and Intel® Threading Building Blocks (Intel® TBB), as well as some mathematic thread-optimized functions from the Intel MKL. Using Intel Cilk functions made the code slow, though this might be because I didn’t fully understand how to implement Intel Cilk functions.

    Since the OpenMP functions performed well without problems, I did not do additional testing with the Intel TBB functions, so it is possible that Intel TBB could improve performance as well.

    Another possibility was using Intel MKL functions, which perform well when they work with large data. While the code moves large parts of data, the instantaneous data volume is low. Because of the low instantaneous data volume (and possibly other reasons) the Intel MKL functions were not a good option for optimization.

    About the Author

     

    Daniel Vea Falguera is an Electronic Systems Engineering student and entrepreneur with interests and knowledge about electronics and computer programming.

    Links

    Most of the optimizations done in this code are based on this Intel Xeon Phi coprocessor article: https://software.intel.com/en-us/articles/optimization-and-performance-tuning-for-intel-xeon-phi-coprocessors-part-1-optimization

    Other similar commented options and solutions were provided at the Intel Modern Code for Parallel Architectures forums:

    https://software.intel.com/en-us/forums/intel-moderncode-for-parallel-architectures

    Intel® MKL: https://software.intel.com/en-us/mkl-reference-manual-for-c

    Intel® TBB: https://www.threadingbuildingblocks.org/

    Intel® Cilk Plus: https://www.cilkplus.org/

    OpenMP*: http://openmp.org/wp/

     

    Fast Computation of Fletcher Checksums

    $
    0
    0

    Abstract

    Checksums are widely used for checking the integrity of data in applications such as storage and networking. We present fast methods of computing checksums on Intel® processors. Instead of computing the checksum of the input with a traditional linear method, we describe a faster method to split the data into a number of interleaved parallel streams, compute the checksum on these segments in parallel, followed by a recombination step of computing the effective checksum using the partial checksums.

    Introduction

    Fletcher’s Checksum is a checksum that was designed to give an error detection capability close to that of a CRC, but with greatly improved performance (on general purpose processors).

    It computes a number of sums, where C(0) is the sum of the input words, C(1) is the sum of the C(0) sums, etc. In general the sums are computed modulo M. In some cases M=2K, so high-order bits are just dropped. In other variants, M=2K-1.

    ZFS is a popular file system originally designed by Sun Microsystems and open sourced. One of its features is protection against data corruption. For this purpose, it makes widespread use of either a Fletcher-based checksum, or a SHA-256 hash. The checksum has a much lower cost compared to a cryptographic hash function such as SHA-256, but is still significant enough to benefit from optimizations.

    The variant of Fletcher that ZFS uses is based on having four sums (i.e., C(0)…C(3)), processing 32-bit chunks of input data (DWORDS), and computing the sums modulo 264.

    While scalar implementations of Fletcher can achieve reasonable performance, this paper presents a way to further improve the performance by using the vector processing feature of Intel processors. The implementation was tailored to the particular Fletcher variant used in ZFS, but the same approach could be applied to other variants.

    Implementation

    If the input stream is considered to be an array of DWORDs (data) and the checksum consists of 4 QWORDS (A, B, C, D; initialized to 0), then the checksum can be defined as:

    for (i=0; i<end; i++) {
    	A += data[i];
    	B += A;
    	C += B;
    	D += C;
    }

    When trying to speed this up, an obvious approach is to use the SIMD instructions and registers. But there are multiple ways to do this.

    One approach is to divide the input buffer into 4 contiguous chunks (i.e., the first quarter, the second quarter, etc.), compute the checksum on each one separately, and then combine them. There are two problems with this approach. Unless the buffers are all of a fixed size, the way in which the partial results are combined will vary with each different buffer size. Furthermore, that first addition will require a gather operation to assemble the four input values, which adds extra overhead. This could be reduced by unrolling the loop, reading multiple DWORDS from each quarter, and then “transposing” the registers. Still, non-trivial overhead will be associated with reading the data.

    The most efficient approach is to take the simple scalar loop and implement it in SIMD instructions, e.g.:

    .sloop
    	vpmovzxdq  data, [buf]	; loads DWORD data into QWORD lanes
    	add	   buf, 16
    	vpaddq	   a, a, data
    	vpaddq	   b, b, a
    	vpaddq	   c, c, b
    	vpaddq	   d, d, c
    	cmp	   buf, end
    	jb	   .sloop

    This effectively stripes the buffer, so that lane j computes a checksum on DWORDS (4 i + j). It has no extraneous overhead (e.g., trying to marshal the input data); however, you need to compute the actual checksum from these four partial checksums. Fortunately in this approach the calculation does not depend on the size of the input buffer.

    If “a” is a DWORD pointer to the value of the “a” ymm register, etc., then the final checksum can be simply computed as:

        A =    a[0] +    a[1] +    a[2] +    a[3];
    
        B =         -    a[1] -  2*a[2] -  3*a[3]
          +  4*b[0] +  4*b[1] +  4*b[2] +  4*b[3];
    
        C =                        a[2] +  3*a[3]
          -  6*b[0] - 10*b[1] - 14*b[2] - 18*b[3]
          + 16*c[0] + 16*c[1] + 16*c[2] + 16*c[3];
    
        D =                             -    a[3]
          +  4*b[0] + 10*b[1] + 20*b[2] + 34*b[3]
          - 48*c[0] - 64*c[1] - 80*c[2] - 96*c[3]
          + 64*d[0] + 64*d[1] + 64*d[2] + 64*d[3];
    

    Justification

    Let Fi be the checksum after processing i DWORDS.

    Let the input stream be a series of DWORDS: x1, x2, x3, …

    Then by unrolling the loop, you can see that the elements of F are just weighted sums of the input DWORDS:

    Figure 1

    This is inconvenient, as the coefficients for X1 (for example) vary with the number of elements being considered. It is more convenient to renumber the input, so that y1 is the latest DWORD processed, y2 is the next most recent one, etc. That is, if we’ve processed n DWORDS, then yi = xn+1-i.

    Then we find that

    Figure 2

    Or, in general

    Figure 3

    This is what we are trying to compute. But when we use the SIMD instructions, we are computing the checksums on every 4th DWORD. If we consider lane 3, which corresponds to y1, y5, etc., then we are actually computing:

    Figure 4

    And what we really want for that partial sum is (replacing i with (4i-3)):

    Figure 5

    If we compute the following weighted sums of the partial checksums:

    Figure 6

    It can be shown that (A”3 B”3 C”3 D”3) is the same as (A3 B3 C3 D3) and corresponds to the bolded (lane 3) terms in the recombination calculation shown in the previous section.

    You can go through similar calculations for the other three lanes.

    Performance

    The performance results provided in this section were measured on an Intel® Core™ i7-4770 processor. The tests were run on a single core with Intel® Turbo Boost Technology off and with Intel® Hyper-Threading Technology (Intel® HT Technology) off and on1. The tests were run on a 16MB buffer that was warm in the cache, so that the cost of the recombination was insignificant.

    The SIMD version described in this paper was compared against the best-known scalar implementation, which is the basic scalar implementation with the loop unrolled by a factor of 4. The results (in cycles/DWORD) were2:

     HT offHT on
    Scalar1.621.56
    SIMD0.970.78

    The performance of the simple scalar version is good, due to the super scalar nature of the processor. Since the processing for each DWORD requires four additions, this is operating at a rate of 2.5 additions per cycle (not including address updates or compares).

    The Intel® Advanced Vector Extensions (Intel® AVX2) SIMD version runs at approximately 4 additions per cycle.

    The SIMD version shows better scaling under Intel HT Technology, gaining approximately 24% in performance, as opposed to the scalar one, which only improved by 4%

    The final merging operation does add some cycles, so for small buffers the scalar approach would be faster, but for large buffers, the SIMD approach can improve performance significantly. For processors that support Intel® Advanced Vector Extensions 512 (Intel® AVX-512), the vector width could be doubled to handle 8 parallel operations. This could approximately double the performance over the Intel AVX2 version for large data buffers.

    Conclusion

    This paper illustrates a method for improved checksum performance. By leveraging architectural features such as SIMD in the processors and combining innovative software techniques, large performance gains are possible.

    Authors

    Jim Guilford and Vinodh Gopal are architects in the Intel Data Center Group, specializing in software and hardware features relating to cryptography and compression.

    Notes

    1 Intel technologies may require enabled hardware, specific software, or services activation. Performance varies depending on system configuration. Check with your system manufacturer or retailer.

    2 Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

    Viewing all 113 articles
    Browse latest View live


    <script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>