Programming models and operating systems

1. Research challenge

Despite all the experience on parallel programming gathered in the last decades and the experience in implementing runtime systems to support parallel execution, the productive expression and efficient exploitation of parallelism for current multicore architectures are not obvious tasks.

Many-/multi-core architectures currently include tenths of cores and will include hundreds of them in the near future, with memory organizations that will probably scape from the easy-to-program shared-memory address space. In addition, resource specialization is considered as the path to built power efficient architectures, leading to heterogeneous architectures. New programming paradigms and OS/hardware support to enable effective programming of these architectures with 100+ cores, in terms of scalability and portability, are necessary.

2. Objectives

The main objective of this cluster was to create a powerful european research community for programming models and runtime environment for exascale architectures based on future many-/multi-core architectures. During HiPEAC-2, the cluster has been working with the following objectives in mind:

  • Propose novel programming models (or evolutions of the current ones) for future homogeneous/heterogeneous many-/multi-core chips. Expressiveness and productivity are equally important. Accelerators (SIMD units, GPUs, FPGAs, ...) should be considered. The introduction of dataflow concepts is necessary to exploit the distant parallelism that is hidden by the use of global synhronization mechanisms (i.e. barriers). Research on compiler and runtime support to parallel programming models, hiding as much as possible the particularities of the target architecture to the programmer (e.g. local memories and DMA transfers).
  • Propose novel programming models (or evolutions of the current ones) for future exascale clustered acrhitectures. PGAS (Partitioned Global Address Space) programming models are gaining much momentum as a novel approach to improve programmability and productivity of large-scale computing architectures. PGAS programming models offer programming flexibility of a globally shared-memory programming model whilst introducing adata locality at a much higher level than the distributed-memory MPI model.
  • Address the inter-operability of standards (e.g. MPI/OpenMP, OpenMP/OpenCL, ...) in order to provide scalability and portability.
  • Explore and propose new architectural features to support the programming model and its runtime implementation: thread creation, task off-loading, locality optmization, transactional memory, ...
  • Evaluate and propose OS support for architectures including heterogeneous cores (CPU, SIMD, GPU), reconfigurable cores (FPGA), ... that guarantees fast task creation and synchronization in the many-/multi-core environment, provides efficient task scheduling and allocation mechanism, thread management for power, temperature, and reliability, provides QoS in real time systems, ...
  • Promote the use and development of common tools to support our research activities as well as new benchmarks and methodologies that could serve as a basis for the evaluation of ideas proposed in our community.
  • Form interdisciplinary (intra- and inter-cluster) research groups that could make project proposals on programmability and parallelism of homogeneous or heterogeneous multi-core and/or reconfigurable architectures, with a holistic approach that addresses issues related to the underlying hardware, the operating system and the system software.
  • Contribute to a common HiPEAC 2012-2020 vision.

3. Research activities carried on during HiPEAC-2

3.1 Novel programming models and their runtime support
  • Exploring and proposing novel paradigms based on the exploitation of parallelism at runtime, supported with program annotations. In this direction, the activity of the groups at BSC-UPC, U. of Siena, U. Cyprus and U. of Manchester was the incubator of the TERAFLUX project (FP7 IP 2010-13, Grant Agreement 249013). These groups are addressing the programmability challenges of many-core architectures by combining an underlying dataflow-based thread execution with advanced programming models like transactional memory. The addition of simple extensions to the BSC-UPC's OmpSs programming model to introduce programmer-directed speculation mechanisms to increase performance has been explored during 2011, as a joint research activity between BSC-UPC and Technion supported with cluster funding (Integrating Speculation in StarSs).
  • Proposing extensions to OpenMP for accelerator-based architectures and implementation of a prototype for GPUs. Two groups were involved: U. Castellon and BSC-UPC. The collaboration resulted in a couple of two papers were the main proposal in described: The mobility associated to this collaboration was supported with cluster funding (Extension of the StarSs Programming Model for Platforms with Multiple GPUs). During 2011, as a result of a collaboration between BSC-UPC and UIUC supported with cluster funding (Fine-grained coherence mechanisms for accelerator/CPU architectures ), new functionalities were implemented in GMAC, the Asymmetric Distributed Shared Memory run-time developed by the group at BSC-UPC in conjunction with the IMPACT group at UIUC to make GPU programming more productive. The possibility of integrating GMAC and the OmpSs has been left for future research.
  • Exploring the use of FPGA-based accelerators to off-load tasks in OpenMP. This work was done in collaboration with the Reconfigurable Computing cluster and resulted in a paper presented at SAMOS 2009 (OpenMP extensions for FPGA Accelerators).
  • The integration of OmpSs and OpenCL has been explored thanks to the mobility Pipeline Scheduling of OpenCL Kernels on Multi-core Architectures supported with cluster funding. The mobility has been the seed for a strong collaboration between the programming models group at BSC-UPC and the group at Politecnico di Milano.
  • Exploring the applicability of the hybrid MPI/SMPSs in real applications, propagating the main characteristics of SMPSs to the global cluster level. This is also a way to leverage and provide a smooth migration path for the huge number of applications today written in MPI. In the EU TEXT project BSC-UPC, FORTH-ICS are U. of Castellon from HiPEAC are participating, in addition to other research institutions and supercomputing centers.
  • Exploring the use of MapReduce-like programming paradigms for clusters and distributed-memory multi-core architectures. Two groups were involved: FORTH-ICS and BSC-CNS. The work on Mapreduce continued during 2011 with a collaboration between BSC-CNS and IBM, creating formal models to describe the behaviour of MapReduce applications and to drive scheduling decisions, paying special attention to the energy consumption management in addition to performance management. Mobility associated to this collaboration was supported with cluster funding (MapReduce Modeling and Energy Management).
3.2 Archtectural support to programming models
  • Promoting common research activities around transactional memory, including language support, runtime and hardware support. Two groups were initially involved (U. Manchester and BSC-CNS) with mobility (Investigating Runtime-Adaptable Software/Hardware Transactional Systems) to support it. During 2011, in collaboration with Onur Mutlu at CMU, the use of Hardware Transactional Memory to provide support to runtime systems for fault tolerance, leveraging the abort mechanism of TM for error recovery. Mobility associated to this collaboration was supported with cluster funding (Cost-Effective Fault Tolerance).
  • Exploring hardware acceleration for managing task dependencies at runtime. The groups at U. Delft and BSC collaborated on. The mobility associated to this collaboration was supported with cluster funding (Collaboration funding Hardware Dependency Resolution). In addition, the activity in this cluster and in the multi-core architectures cluster was the incubator of the ENCORE project (FP7 STREP 2010-12). The project is investigating the appropriate hardware support for the parallel programming and the runtime environment that ensures scalability, performance, and cost-efficiency. It is also developing a runtime system to dynamically detect, manage, and exploit parallelism, data locality, and resources across several parallel architectures.
  • Implementing automatic data block prefetching techniques in dataflow programming models. ARM and BSC-UPC have collaborated in porting the OmpSs runtime for the ARM Cortex architectures and implementing data prefetching based on the data directionality clauses provided by the programmer and the information gathered at runtime. Mobility associated to this collaboration was supported with cluster funding (Implement automatic data block prefetch in OmpSs runtime for the ARMs Cortex-A9). This work was an initial step before the start of the Mont-Blanc EU project.
3.3 Tools to support research activities
  • Building a new thermal-, power- and performance- simulation and modeling environment. This new environment aims to cover both the microscopic (high-performance microarchitecture) and macroscopic (processing rack and complete network of processing nodes) levels during the execution of typical high-performance computing applications. Thus, this environment enables exploration of new programming models for supercomputers, low-power architectures, compiler tool chains, execution environments, operating systems, and applications. Three research groups were involved: Barcelona Supercomputing Center (Spain), EPFL (Switzerland) and Complutense University of Madrid (Spain). The mobility associated to this activity was supported with cluster fundind (Performance, Power and Thermal Simulation Frameworks for High-Performance Computing Systems).

4. Other activities

5. Information about cluster meetings

Coordinating partner: BSC