The experiments conducted on the
m01/02 machines contain up to three executions: an optimised (using the
PF lists) execution by the scheduler, a default Linux-scheduled
execution (denoted as LSO),
and a scheduler-based execution with optimisations applied using the
shuffled factors (the co-scheduling factor is used before the latency)
of the PF lists (PRM). The
experiments conducted on the HC cluster contain up to four executions:
a JS scheduler-based optimised execution (using PF lists), a Linux
scheduled un-optimised execution (LSO) with the machine access order: m01-02 and k01-k03, the LSOR
(Linux scheduled) with reverse machine access order: k01 − 03 and m01 −
02, and the JS scheduled execution with shuffled factors (for
compute-intensive applications the latency factor is swapped with the
co-scheduling and for communicationintensive applications the machine
load factor is swapped with the processor
speed) in the PF list (PRM).
Variance Co-Variance Matrix computation (VCMatrix):
The VCMatrix computes the co-variances and variances of a matrix. The
diagonal values of the resultant matrix represent variances and
the off-diagonal represent co-variances. Figure 1 shows the experiment
(matrix size: 2200×2200) results on the HC cluster. The JS scheduled
execution outperformed the other three executions (
PRM, LSO, and
LSOR)
for most of the machine sizes and achieves better speedup up-to 24.18%
as compared to the LSO and LSOR, and up-to 39.96% as compared to the
PRM. The JS scheduler optimises the performance of the application on
the HC cluster by utilising the fastest available resources (e.g.,
cores, processors, and machines), reducing the data contention on m01 −
02 (by balancing the machine load), and reducing memory latencies (in the m01 − 02 NUMA machines).
Figure 1: VC Matrix - HC Cluster
Matrix Transposition with Floating Point Operations (MatrixFPO): The
MatrixFPO transposes a matrix, and performs several floating point
operations (e.g., addition, multiplication, and division) at each
transpose step. Figure 1(a) shows the experiment (matrix size: 10000 ×
10000) results on the m01. The JS scheduled execution of the
application achieves up-to 31.09% better speedups as compared to the
LSO. Figure 2(b) shows the experiment results on the HC cluster. The JS
scheduler based execution achieves better speedups for most of the
machine sizes and achieves up-to 48.75% (as compared to the LSO),
35.47% (as compared to the LSOR), and 44.05% more speedups (as compared
to the PRM execution). The JS scheduled execution achieves better
performance results as compared to other executions because of the
optimisations (using the PF list for the target application class and
the
architecture) applied by the JS scheduler.
Figure 2: MatrixFPO - experimental results.
NAS benchmarks EP kernel:
The NAS EP kernel is used to measure the computational performance of
parallel computers. Figure 3 shows the experiment (data size: 16777216
× 100) results on the m01. We experimented and compared the JS
scheduled, LSO, PRM, and the Proactive-based executions. The results
show that, the JS scheduled execution achieves better speedups as
compared to all the other executions and achieves up-to 16.82% (as
compared to the LSO), up-to 12.13% (as compared to the Proactive), and
up-to 30.32% (as compared to the PRM) more speedups. Proactive exhibits
the low performance since it has no capability to map the active
objects on specific cores. Although the EP kernel has small memory
footprints still the remotely scheduled Proactive objects cause some
performance degradations as compared to the JS scheduled execution.
Figure 3: NAS benchmarks EP kernel - experimental results.