Mechanisms and techniques for scheduling in supercomputers
- José Antonio Lozano Alonso Director/a
- José Miguel Alonso Director/a
Universidad de defensa: Universidad del País Vasco - Euskal Herriko Unibertsitatea
Fecha de defensa: 28 de junio de 2013
- Clemente Rodríguez Lafuente Presidente/a
- Alexander Mendiburu Alberro Secretario/a
- Javier Navaridas Palma Vocal
- José Ángel Gregorio Monasterio Vocal
- Francisco Fernández de Vega Vocal
Tipo: Tesis
Resumen
This thesis analyzes the performance of the scheduling process in space-shared, large-scale supercomputers. These systems are specifically designed to run fine-grained parallel applications in which the communications/computation ratio is high. The way of using the interconnection network has a significant bearing on applications performance and, therefore, on the overall system performance. The scheduling process can be divided into three stages, driven by a set of policies or strategies. Assuming that users send parallel jobs to a single scheduling queue, (1) a job is selected to run, the (2) the resources (set of nodes) required by the job have to be located in the system and reserved for the job, and (3) job task have to be mapped onto the selected nodes. This dissertation studies ways of improving the performance of the scheduling process focusing on stages 2 (partitioning) and 3 (mapping). In particular we use contiguous partitioning as the strategy to assign partitions to jobs. Contiguous partitioning strategies have a well-known disadvantage: high fragmentation that results in low levels of system utilization. However they provide jobs with a running environment that, due to the locality of communications and the lack of interference with other running jobs, substantially reduce running times. In order to effectively exploit these advantages, an appropriate task-to-node mapping has to be implemented. Through extensive simulation-based experimentation, it is demonstrated that combinations of consecutive partitioning and application-aware mappings achieve excellent job throughput, compared to non-contiguous partitioning alternatives. Although the main topic of our work is contiguous partitioning, we have also explored some aspects of non-contiguous partitioning strategies, due to its common use in production environments. Regarding system topology, we mainly focus on cube-shaped topologies such as meshes and tori, but we also study alternative partitioning strategies for tree topologies.