Multi-core Scheduler


Experimental Setup:  We developed several JS-based real applications and benchmarks and experimented using two types multi-core parallel computers: shared memory machines (m01 − 02) and a heterogeneous cluster (HC, an aggregation of m01 − 02 and k01 − 03 machines):

Experimental Setup

The experiments conducted on the m01/02 machines contain up to three executions: an optimised (using the PF lists) execution by the scheduler, a default  Linux-scheduled execution (denoted as LSO), and a scheduler-based execution with optimisations applied using the shuffled factors (the co-scheduling factor is used before the latency) of the PF lists (PRM). The experiments conducted on the HC cluster contain up to four executions: a JS scheduler-based optimised execution (using PF lists), a Linux scheduled un-optimised execution (LSO) with the machine access order: m01-02 and k01-k03, the LSOR (Linux scheduled) with reverse machine access order: k01 − 03 and m01 − 02, and the JS scheduled execution with shuffled factors (for compute-intensive applications the latency factor is swapped with the co-scheduling and for communicationintensive applications the machine load factor is swapped with the processor
speed) in the PF list (PRM).


Selected Experimental Results


Variance Co-Variance Matrix computation (VCMatrix): The VCMatrix computes the co-variances and variances of a matrix. The diagonal  values of the resultant matrix represent variances and the off-diagonal represent co-variances. Figure 1 shows the experiment (matrix size: 2200×2200) results on the HC cluster. The JS scheduled execution outperformed the other three executions (PRM, LSO, and LSOR) for most of the machine sizes and achieves better speedup up-to 24.18% as compared to the LSO and LSOR, and up-to 39.96% as compared to the PRM. The JS scheduler optimises the performance of the application on the HC cluster by utilising the fastest available resources (e.g., cores, processors, and machines), reducing the data contention on m01 − 02 (by balancing the machine load), and reducing memory latencies (in the m01 − 02 NUMA machines).

VC Matrix - HC
Figure 1: VC Matrix - HC Cluster


Matrix Transposition with Floating Point Operations (MatrixFPO): The MatrixFPO transposes a matrix, and performs several floating point operations (e.g., addition, multiplication, and division) at each transpose step. Figure 1(a) shows the experiment (matrix size: 10000 × 10000) results on the m01. The JS scheduled execution of the application achieves up-to 31.09% better speedups as compared to the LSO. Figure 2(b) shows the experiment results on the HC cluster. The JS scheduler based execution achieves better speedups for most of the machine sizes and achieves up-to 48.75% (as compared to the LSO), 35.47% (as compared to the LSOR), and 44.05% more speedups (as compared to the PRM execution). The JS scheduled execution achieves better  performance results as compared to other executions because of the optimisations (using the PF list for the target application class and the
architecture) applied by the JS scheduler.

MatrixFPO

Figure 2: MatrixFPO - experimental results.



NAS benchmarks EP kernel: The NAS EP kernel is used to measure the computational performance of parallel computers. Figure 3 shows the experiment (data size: 16777216 × 100) results on the m01. We experimented and compared the JS scheduled, LSO, PRM, and the Proactive-based executions. The results show that, the JS scheduled execution achieves better speedups as compared to all the other executions and achieves up-to 16.82% (as compared to the LSO), up-to 12.13% (as compared to the Proactive), and up-to 30.32% (as compared to the PRM) more speedups. Proactive exhibits the low performance since it has no capability to map the active objects on specific cores. Although the EP kernel has small memory footprints still the remotely scheduled Proactive objects cause some performance degradations as compared to the JS scheduled execution.

NAS EP

Figure 3: NAS benchmarks EP kernel - experimental results.