Idbool
From Digitalis
Contents |
Overview
Idbool is a CC-NUMA system of 192 cores using the Numascale Numaconnect interconnect.
Technically, the machine is composed of 4 chassis/motherboards, each equipped with 3 AMD Opteron(tm) Processor 6376 (Abu Dhabi, 16 cores) and interconnected to the other nodes using the Numaconnect interconnect in a tore configuration with double links.
This Numaconnect interconnect provides a full hardware Single System Image (SSI) with a single memory space with cache coherency. As a result the system appears as a single Linux system.
Currently the systems is powered by a Ubuntu 14.04 LTS.
Technical documentations and other resources
- http://www.numascale.com/numa_support.html
- https://resources.numascale.com/
- https://wiki.numascale.com
For questions related to the performance achievable on this machine, please look at:
Installation notes
Instruction to install the system with Ubuntu 14.04 LTS
- Alter /etc/sysct.conf and kernel parameters (taken from the Numascale Wiki: https://wiki.numascale.com/tips/os-tips)
- Apply the patch from http://askubuntu.com/questions/468466/why-this-occurs-error-diskfilter-writes-are-not-supported due to software raid
- Remove irqbalance, suggested by Numascale
- Disable selinux and apparmor in /etc/default/grub, after that update-grub. Also disabled apparmor startup script
- Blacklist the edac drivers, because they caused and error during boot seen in dmesg ( /etc/modprobe.d/blacklist.conf )
- Not recommended by NumaScale, therefore reverted the steps above again. The traces can be considered as warning
- This is due to scalability in the kernel, which should be fixed with the NumScale provided kernel
- Install the linux-image-3.15.10-numascale17+_3.15.10-numascale17+-2_amd64.deb patch:
- Works perfectly, scales pretty good: but swap is not in the kernel, so no swap space is usable. But swapping on a Numasystem does not make sense at all, because this slows down even more than on a normal system
How to experiment
Reserving and accessing idbool
- By default OAR only gives access to 1 or the 4 hosts (motherboards) of the machine
[pneyron@digitalis ~]$ oarsub -I -p "machine='idbool'" Properties: machine='idbool' [ADMISSION RULE] Modify resource description with type constraints Import job key from file: /home/pneyron/.ssh/id_rsa OAR_JOB_ID=8348 Interactive mode : waiting... Starting... Connect to OAR job 8348 via the node idbool.grenoble.grid5000.fr [OAR] OAR_JOB_ID=8348 [OAR] Your nodes are: idbool-1.grenoble.grid5000.fr*48 [pneyron@idbool ~](8348-->60mn)$
Then see:
[pneyron@idbool ~](8348-->57mn)$ cat /dev/cpuset/$(grep -o "/oar/.*" /proc/self/cgroup)/cpus 0-47 [pneyron@idbool ~](8348-->57mn)$ cat /dev/cpuset/$(grep -o "/oar/.*" /proc/self/cgroup)/mems 0-5
This job only gives access to the resources of the first host (motherboard) of the machine: logical CPUS (core) 0 to 47 and Numa nodes 0 to 5. Other resources of the machine can be seen (e.g. in `top') but are not reachable because isolated by the linux container cpuset of your job.
- To reserve the complete machine, one must specify
-l machine=1
.
Furthermore, we request a 4 hours job in the example below:
[pneyron@digitalis ~]$ oarsub -I -p "machine='idbool'" -l machine=1,walltime=4 Properties: machine='idbool' [ADMISSION RULE] Modify resource description with type constraints Import job key from file: /home/pneyron/.ssh/id_rsa OAR_JOB_ID=8349 Interactive mode : waiting... Starting... Connect to OAR job 8349 via the node idbool.grenoble.grid5000.fr [OAR] OAR_JOB_ID=8349 [OAR] Your nodes are: idbool-1.grenoble.grid5000.fr*48 idbool-2.grenoble.grid5000.fr*48 idbool-3.grenoble.grid5000.fr*48 idbool-4.grenoble.grid5000.fr*48 [pneyron@idbool ~](8349-->59mn)$
Privileged commands
Currently, the following commands can be run via sudo in exclusive jobs:
- sudo /usr/bin/whoami (provided for testing the mechanism, should return "root")
- sudo /usr/bin/schedtool
- sudo /usr/bin/opcontrol
- sudo /usr/bin/perf
- sudo /usr/bin/lstopo
Performances
In order to get performance using the whole machine (see the case "machine=1" above), a special care must be taken with regard to data placement in memory vs. cpus. Indeed the numa factor between numa nodes from one motherboard to another motherboard is very high. A typical bandwidth might be as low as 90MB/s if accessing from one CPU, memory of a remote numa nodes. Numascale strongly advises to read https://resources.numascale.com/numascale-scaling-best-practice.pdf.