yitpinfo


 Japanese  :   English 


Japanese version
Yukawa-21: How to use the Super Computer Yukawa-21

大規模計算サーバ Yukawa-21 及び フロントエンドの利用方法

        1. 概要
        2. ログイン方法
        3. ディスク領域
        4. モジュールファイルによる環境設定
        5. ジョブ投入方法
           5.1 ジョブクラス
           5.2 ジョブスクリプト
        6. コンパイラ環境
        7. ライブラリ
        8. CUDAについて
        9. Pythonについて
       10. Julia
       11. 開発用ツール
       12. Intel OneAPI
       13. Yukawa-21 へのファイル転送方法
       14. その他



1. 概要

　大規模計算サーバは、1ノード112コア/1.5TiBの共有メモリーを有するノード135台、及びGPU2基搭載した
　ノード2台からなるDell Technologies社製クラスタ計算機です。
　1CPUあたり最大2.4TFLOPSの倍精度浮動小数点演算性能を有し、1ノード当たりの最大性能は9.6TFLOPSです。
　システム全体の浮動小数点演算性能は最大1.3PFLOPS, 総メモリ容量は205TiB です。
　Yukawa-21のbinary dataはlittle endianです。


2. ログイン方法

　基研ログインノード経由でフロントエンド(front)へログインして利用します。
　ログインノードへのログイン方法は、
　
　http://www.yukawa.kyoto-u.ac.jp/computer1/sshman/ssh-keylogin-man.html
　
　を参照してください。
　
　ログインノードから
　
	$ ssh -l (ユーザ名) front
　
　にてフロントエンドにログインできます。
　フロントエンドは2台あり、front1 or front2に自動的に分散します。


3. ディスク領域

　ログインノードとフロントエンド上のディスクは共有されています。
　フロントエンドにログインすると自動的に/sc/home/(ユーザ名)ディレクトリに移動します。
　
	/home/(所属グループ名)/(ユーザ名) ホームディレクトリ(NFS)
	/sc/home/(ユーザ名)     計算用ディレクトリ(Lustre)
　
　注: 計算ノードで利用するファイルは、アクセスが高速な/sc領域に置いてください。
　
　ファイル使用量の調査には、 lfs quota コマンド(単位: kB)を利用します。
　quotaはデフォルトで1TBです。　（不足の場合は基研計算室までご相談ください。）
　
　実行例:
　$ lfs quota /sc
　Disk quotas for user USERNAME (uid XXXX):
		Filesystem  kbytes    quota        limit    grace   files   quota   limit   grace
        	   /sc      24  1073741824  1073741824      -       7       0       0       -
　Disk quotas for group GROUP (gid YYY):
		Filesystem  kbytes    quota        limit    grace   files   quota   limit   grace
        	   /sc 255593212      0            0        -  460112       0       0       -
　


4. モジュールファイルによる環境設定

  Yukawa-21 には複数のコンパイラ、複数のライブラリが用意されており、また、各々のコンパイラ、
　ライブラリには、複数のバージョンが用意されています。
　
  コンパイラ、ライブラリを利用する際の環境設定には、Environment Modules ソフトウェアを利用します。
　Modules では、環境設定を定義したモジュールファイルと呼ばれるファイルが多数用意されており、
　必要なモジュールファイルをロードすることにより、柔軟に環境を設定することが可能です。

4-1 主なコマンド

　(1)利用するモジュールファイルのロード
	module load <モジュールファイル名>[/<バージョン>]
　
　(2)利用するモジュールファイルのアンロード
	module unload <モジュールファイル名>[/<バージョン>]
　
　(3)モジュールの切り替え
	module swap <ロード済みモジュールファイル> <ロードしたいモジュールファイル>
　
　(4)現在ロードされている module の確認方法
	module list
　
　(5)使用可能な module の確認方法
	module avail [モジュール名]
　
　注: モジュール名が省略された場合、すべての利用可能なモジュールファイルが表示されます。
　
　詳しくは、 man module を参照して下さい。

4-2 利用例

　(1)コンパイラ環境の切り替え
	module swap <ロード済み module> <切り替え先 module>
　
	例: Intel環境からNVIDIA HPC環境への切り替え
		module swap intel nvhpc
　
　(2)バージョン切り替え:
	module swap <モジュール名> <モジュール名>/<ロードしたいバージョン>
　
	例: intel デフォルトバージョン から intel 2019.5.281 への切り替え
		module swap intel intel/2019.5.281
　
	注: バージョン番号を指定しない場合は、デフォルトバージョンがロードされます。
　
　(3)アプリケーションの利用
	module load <利用したいアプリケーション>
　
	例: arm Forge の利用
		module load forge

4-3 初期状態

　Intel環境がロードされており、Intelコンパイラ, intel MPI, intel MKLなどの計算が可能です。


5. ジョブの投入方法

　Yukawa-21 では、ワークロードマネージャとして SLURM を利用しています。
　ログインノードからジョブ投入コマンド sbatch にてジョブを投入すると、計算ノード上でジョブが実行されます。
　プログラムの実行の際は、srun というコマンドを利用します。

5-1 ジョブクラス

　投入可能なジョブクラスは4種あり、実行ジョブの条件に合わせて選択してください。
　
　 Name    ノード数　  コア数上限   使用上限   経過時間  メモリ 　　　概要　
　-----------------------------------------------------------------------------------
　 L       32-32(排他)    ----     32node/user    1日     ----　　　　大規模ジョブ用
　 M       1 -24(排他)    ----     24node/user    1日     ----　　　　中規模ジョブ用
　 S     　    1(共有)　  8/job    10core/user    7日    13.7GB/core　小規模ジョブ用
　 DEBUG 　    1(排他)　  ----      1node/user   15分     ----      　デバッグ用
　 GPU   　    1(排他)　  ----      1node/user    1日     ----      　GPUジョブ用
　-----------------------------------------------------------------------------------
　注意）L/M/S/GPUについては同時利用可能なリソース量(ノード数/コア数)を制限しています
        同時実行可能ジョブ数の目安は
　　　　L: 合計32node (実質1ジョブ 同時に投入可能なジョブ数が10job/userまで ＊RUN+PENDの数が10までです。)
　　　　M: 合計24node (1nodeジョブであれば24個、24nodeジョブであれば1個)
　　　　S: 合計8core (1coreジョブであれば8個、8coreジョブであれば1個)
　　　　GPU: 合計1node (実質1ジョブ)

        DEBUGクラスについて
　　　　・優先度は高いですが、15分までです。
　　　　・同時実行ジョブは1job/userまで
　　　　　　＊1つのジョブが実行中の場合、2nodeに満たない場合でも実行できません
　　　　・同時に投入可能なジョブ数も1job/userまで
　　　　　　＊RUN+PENDの数が1までです。
　　　　　　　つまり、ジョブを事前に複数積んでおくことはできません。

5-2 ジョブスクリプト

　ジョブの投入は、ジョブスクリプトを作成し、sbatch コマンドを利用して行ないます。
　ジョブスクリプトは、 sbatch コマンドに渡すパラメータ設定部とシェルスクリプト本体部で記述されます。
　
　ジョブスクリプト例１  シングルジョブ(Partition S)
        #!/bin/bash
        #--- パラメータ設定部
        #SBATCH -J SINGLE_job
        #SBATCH -N 1
        #SBATCH -n 1
        #SBATCH -c 1
        #SBATCH -p S
		
        #--- ジョブスクリプト本体
        srun ./a.out
		
　ジョブスクリプト例２ OpenMPジョブ(class S, 4 SMP threads)
　(4SMP並列)実行の場合
        #!/bin/bash
        #--- パラメータ設定部
        #SBATCH -J SMP_job
        #SBATCH -N 1
        #SBATCH -n 1
        #SBATCH -c 4
        #SBATCH -p S
		
        #--- ジョブスクリプト本体
        export OMP_NUM_THREADS=4
        srun ./a.out
		
　ジョブスクリプトファイル例３ MPI並列ジョブ(class M, 2 nodes, 224 MPI processes)
　2ノードを使用して224MPI並列（1 ノードあたり 112 プロセス実行) の場合
        #!/bin/bash
        #--- パラメータ設定部
        #SBATCH -J MPI_job
        #SBATCH -N 2
        #SBATCH -n 224
        #SBATCH -c 1
        #SBATCH -p M
		
        #--- ジョブスクリプト本体
        srun ./a.out
		
　ジョブスクリプトファイル例４ MPI並列+OpenMPジョブ(class M, 4 nodes, 32 MPI processes, 14 SMP threads)
　4ノードを使用して32MPI並列, 1プロセスあたり14SMP並列実行の場合
        #!/bin/bash
        #--- パラメータ設定部
        #SBATCH -J HYBRID_job
        #SBATCH -N 4
        #SBATCH -n 32
        #SBATCH -c 14
        #SBATCH -p M
		
        #--- ジョブスクリプト本体
        export OMP_NUM_THREADS=14
        srun ./a.out
		
　スクリプトの中で"#SBATCH"で始まる行で、 sbatch コマンドにわたすパラメータを設定します。
　
　主なキーワード
        -p : 実行ジョブクラスを指定 (default: S)
        -o : スクリプトの標準出力ファイルを指定 (default: slurm-%j.out, %j はジョブ ID)
        -e : スクリプトの標準エラー出力ファイルを指定 (default: 標準出力と同じ)
        -J : ジョブ名を指定 (default: スクリプトファイル名)
        -n : プロセス数を指定 (default: 1)
        -c : プロセスあたりの利用コア数を指定 (default: 1)
        -N : 利用するノード数を指定 (default: -n, -c オプションより計算)
        -t : 経過時間制限 (default: キューで指定された経過時間制限)
        --mail-type= : メール送信するジョブ状態。詳細はman参照 (default: 送信しない)
        --mail-user= : メールアドレス。--mail-typeを指定した場合は必ず指定。
	-G 2: GPUのDEBUGを行いたいときに指定。（-p DEBUG or -p GPU の時に有効）
　注: 
	(1) 初期ディレクトリはジョブを投入したディレクトリです。
	(2) MPI を利用する、しないにかかわらず、 srun を利用して下さい。
	(3) (プロセス数)×(プロセスあたり利用コア数) ≦ (ノード数)×112 となるように指定してください。
	(4) -cオプションで指定する値は、使用する利用コア数と同じ値を指定してください。
		例: OMP_NUM_THREADS=8の時は"-c 8"を指定
　
　ジョブ投入方法（sbatch)
	$ sbatch go.sh
　
　ジョブの状態を参照する方法
  	$ squeue
　
　特定のアカウントのジョブの状態を参照する方法
        $ squeue -u userid
　
　ジョブクラスの状態を参照する方法
  	$ sinfo
　
　PENDジョブの優先順位を参照する方法
  	$ sprio
　
　ジョブをキャンセルする方法
  	$ scancel (JOB_ID)

  ジョブの開始予定時刻の表示方法
	$ scontrol show job JOBID

   該当ジョブの詳細情報が表示されますが、その中にStartTimeという項目があり、
    待ち状態のジョブについては、ジョブの開始予定時刻に相当します。

   ただし、下記の点にご注意ください。
　・あくまで開始の目安と考えてください。
　・他ユーザのジョブ投入や実行の影響で変化します。
　　　＊一度確認しても、その後に変わる可能性があります。
　・あまりに待ちジョブが多い場合や、依存性や実行制限の影響を受けた場合、
　　StartTimeが「Unknown」となることがあります。

6. コンパイラ環境

　(1) 利用可能なコンパイラ環境
　
　　　下記のコンパイラ環境が利用可能です。
　
		Intel コンパイラ環境       (intel): CPU用 (Parallel Studio XE と 後述のOneAPI)
		NVIDIA HPC コンパイラ環境  (nvhpc): GPU用
　
　　　＊NVIDIA HPC コンパイラは旧 PGI コンパイラに相当します。
　
　　　利用したいコンパイラのモジュールファイルへ swap してご利用下さい。
　
　　　例: NVIDIA HPCコンパイラ環境の利用
		module swap intel nvhpc
　
　　　注: Intel コンパイラ環境はログイン時にロードされていますので、この操作は必要ありません。
　
　(2) FORTRAN
　
  　　コンパイル時のコマンドは、 使用するコンパイラによって異なります。 
		|-----------------+-----------------|
		| コンパイラ      | コマンド        |
		|-----------------+-----------------|
		| Intel           | ifort           |
		| NVIDIA HPC      | nvfortran       |
		|-----------------+-----------------|
　　　必要に応じてコンパイラ固有のオプションを与えてください。
　
　(3) C/C++ コンパイラ
　
　　　コンパイル時のコマンドは、使用するコンパイラによって異なります。 
		|-----------------+-----------------|
		| コンパイラ      | コマンド        |
		|-----------------+-----------------|
		| Intel           | icc / icpc      |
		| NVIDIA HPC      | nvc / nvc++     |
		|-----------------+-----------------|
　　　必要に応じてコンパイラ固有のオプションを与えてください。
　
　(4) コンパイラ・オプション
　
　　　各コンパイラの主なオプションは下記の通りです。
　
　(4-1) Intel コンパイラ
		|-----------------------+----------------------------------------|
		| オプション            | 機能                                   |
		|-----------------------+----------------------------------------|
		| -qopenmp              | OpenMP 指示行による並列化を有効化      |
		| -parallel             | コンパイラによる自動並列化の有効化     |
		| -mkl                  | Intel MKL の利用                       |
		| -O [0|1|2|3|fast]     | 最適化レベルの指定                     |
		| -xCORE-AVX512         | AVX512までのSIMD拡張命令セットで最適化 |
		| -qopt-report          | コンパイルレポートの出力               |
		| -convert big_endian   | バイトスワップ入出力の指定             |
		|-----------------------+----------------------------------------|
		(詳細は man icc, man icpc, man ifort を参照して下さい)

　(4-2) NVIDIA HPC コンパイラ
		|-----------------------+----------------------------------------|
		| オプション            | 機能                                   |
		|-----------------------+----------------------------------------|
		| -mp                   | OpenMP 指示行による並列化を有効化      |
		| -Mconcur              | コンパイラによる自動並列化の有効化     |
		| -O [0|1|2|3|4]        | 最適化レベルの指定                     |
		| -Minfo                | コンパイルレポートの出力               |
		| -Mbyteswapio          | バイトスワップ入出力の指定             |
		|-----------------------+----------------------------------------|
		(詳細は man nvc, man nvc++, man nvfortran を参照して下さい)

　(5) binary dataのエンディアン

　Yukawa-21のbinary dataはlittle endianです。
　(前システムであるCray XC40のbinary dataはlittle endian、さらに以前のSR16000のbinary dataは
   big endianでした。)
　
　big endianのシステムで作成したbinary dataはlittle endianへ変換して読み込む必要があります。
　Fortranの場合は、コンパイルオプションまたは環境変数を使用することで、ソースやデータの修正なしに
　読み込むことが可能です。
　
　(5-1) Intelコンパイラ
　
　　・全てbig endianとして扱う場合
　　　　$ ifort -convert big_endian src.f90
　
　　・一部のファイルのみbig endianとして扱う場合
　　　　ジョブスクリプト中で環境変数を指定してからプログラムを実行
　　　　例(bashの場合):
　　　　　export F_UFMTENDIAN=10　#unit 10のみbig endianで扱う
　
　(5-2) NVIDIA HPCコンパイラ
　
　　・全てbig endianとして扱う場合
　　　　$ nvfortran -Mbyteswapio src.f90
　
　(6) スタックサイズ

　　　デフォルトのスタックサイズは8MBです。
　　　プログラムがSegmentation Faultで停止する場合、スタックサイズが不足している可能性がありますので、
　　　ジョブスクリプト中でスタックサイズをunlimitedにして
　　　実行してみてください。
　　　
　　　例:
		ulimit -s unlimited        #bashの場合
		limit stacksize unlimited  #cshの場合
		srun ./a.out


7. ライブラリ

7-1 MPI ライブラリ
　
　Intel MPIまたはOpenMPIが利用できます。
　適宜ラッパーコマンドを使用することで、自動的にインクルードファイルの指定、ライブラリのリンクが行なわれます。
　
　|-------------+------------------+-----------------------------|
　| コンパイラ  | MPIライブラリ    | ラッパーコマンド            |
　|-------------+------------------+-----------------------------|
　| Intel       | Intel MPI        | mpiicc / mpiicpc / mpiifort |
  |-------------+------------------+-----------------------------|
　| NVIDIA HPC  | OpenMPI　　　　  | mpicc  / mpicxx  / mpifort  |
  |-------------+------------------|-----------------------------|
　| CUDA        | OpenMPI　　　　  | 利用できません　　　　　　  |
  |-------------+------------------|-----------------------------|
　
　＊CUDA環境でOpenMPIを使用する場合は、コンパイル時インクルードパスを明示的に指定し、リンク時には
　　ライブラリパスの指定とリンクするライブラリの明示的な指定が必要です。
　
7-2 Intel Math Kernel Library (Intel MKL)
　
　MKLライブラリは、工学,科学,金融系ソフトウェアの開発者向けに、
　　線形代数ルーチン,高速フーリエ変換,ベクトル・マス・ライブラリー関数,乱数生成関数
　を提供します。これらのルーチンや関数は、Intelプロセッサ用に最適化されています。
　インテル・コンパイラで利用可能です。
　
　利用方法: Intel コンパイラを利用の上、ビルド時に -mkl オプションを付与して利用して下さい。
FFTWを使用する場合、-I ${MKLROOT}/include/fftwも追加してください。
  	$ ifort -mkl    # fortran 利用
	$ icc   -mkl    # C   利用
	$ icpc  -mkl    # C++ 利用


8. CUDA について

　NVIDIA CUDA ToolkitはNVIDIA社が提供する高性能のGPUアクセラレーションアプリケーションを作成するための開発環境です。(2024年8月の時点で default version は　cuda: 12.5-4.1.6,  nvhpc: 24.7-4.1.6　です。）
　
　利用方法：cuda モジュールファイルをロードしてご利用ください。
　＊NVIDIA HPCコンパイラとは異なります。"nvc"ではなく"nvcc"となる点にご注意ください。
	$ module load cuda
	$ nvcc [options] 
	$ nvcc -V
	nvcc: NVIDIA (R) Cuda compiler driver
	Copyright (c) 2005-2020 NVIDIA Corporation
	Built on Mon_Oct_12_20:09:46_PDT_2020
	Cuda compilation tools, release 11.1, V11.1.105
	Build cuda_11.1.TC455_06.29190527_0
　
　使用例：NVIDIA CUDA Toolkitに同梱されているsamplesのコンパイル方法と実行方法です。
　　　　　nvidia-smiコマンドの出力例も記載します。
	$ cuda-install-samples-11.1.sh ~
	$ cd ~/NVIDIA_CUDA-11.1_Samples/5_Simulations/nbody
	$ make
	
	#### in job script ####
	#SBATCH -p GPU
	#SBATCH -N 1
	
	nvidia-smi
	srun ./nbody -benchmark
	#######################
	+-----------------------------------------------------------------------------+
	| NVIDIA-SMI 455.32.00    Driver Version: 455.32.00    CUDA Version: 11.1     |
	|-------------------------------+----------------------+----------------------+
	| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
	| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
	|                               |                      |               MIG M. |
	|===============================+======================+======================|
	|   0  Tesla V100-PCIE...  Off  | 00000000:25:00.0 Off |                    0 |
	| N/A   39C    P0    37W / 250W |      0MiB / 16160MiB |      0%      Default |
	|                               |                      |                  N/A |
	+-------------------------------+----------------------+----------------------+
	|   1  Tesla V100-PCIE...  Off  | 00000000:C8:00.0 Off |                    0 |
	| N/A   38C    P0    36W / 250W |      0MiB / 16160MiB |      0%      Default |
	|                               |                      |                  N/A |
	+-------------------------------+----------------------+----------------------+
	                                                                               
	+-----------------------------------------------------------------------------+
	| Processes:                                                                  |
	|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
	|        ID   ID                                                   Usage      |
	|=============================================================================|
	|  No running processes found                                                 |
	+-----------------------------------------------------------------------------+
	
	> Windowed mode
	> Simulation data stored in video memory
	> Single precision floating point simulation
	> 1 Devices used for simulation
	GPU Device 0: "Volta" with compute capability 7.0
	
	> Compute 7.0 CUDA device: [Tesla V100-PCIE-16GB]
	81920 bodies, total time for 10 iterations: 145.415 ms
	= 461.498 billion interactions per second
	= 9229.970 single-precision GFLOP/s at 20 flops per interaction


9. Python について

　Intel Python をお使いください。標準でnumpy, scipy, mpi4py等が利用可能です。
　ただし、mpi4pyは計算ノード(バッチジョブ)で実行してください。front上では動作しません。
　＊Yukawa-21およびフロントエンドは外部ネットワークと繋がっていませんので、任意のパッケージを
　　インストールされたい場合は、オフライン環境におけるインストール方法をお試しください。
　
	$ module load intelpython3
	$ python -V
	Python 3.6.9 :: Intel Corporation
　
	$ module load intelpython2
	$ python -V
	Python 2.7.16 :: Intel Corporation
　
　基本的な使い方としては、anacondaによる仮想環境をご利用ください。
　
　使用例：
　デフォルトで用意されている"root"環境のクローンを作成します。
　"myenv"のところは任意の仮想環境名を指定してください。
　なお、"--clone root"オプションを付けないとconda管理下では空の仮想環境が作成されます。
　＊"conda list"は空ですが、"pip list"では複数のパッケージがインストールされています。
　
	$ conda create --offline --clone root -n myenv
　
　作成済みの仮想環境を確認します。
　
	$ conda info --envs
	# conda environments:
	#
	myenv                    (user_home)/.conda/envs/myenv
	root                  *  /sc/system/ap/intel/intelpython3
　
　作成した仮想環境をアクティブにします。成功すると、プロンプトの先頭に（仮想環境名）が付きます。
　必ずsourceコマンドから始めてください。sourceコマンドが無いとエラーになります。
　
	$ source activate myenv
	(myenv) $
	$ conda info --envs
	# conda environments:
	#
	myenv                 *  (user_home)/.conda/envs/myenv
	root                     /sc/system/ap/intel/intelpython3
	(myenv) $ conda list
	# packages in environment at (user_home)/.conda/envs/myenv:
	#
	asn1crypto                0.24.0                   py36_3    intel
	(省略)
	zlib                      1.2.11                        5    intel
　
　仮想環境上での作業が完了したら、アクティブな仮想環境から抜けます。
　正常に仮想環境から抜けられると、プロンプトから（仮想環境名）の表示が無くなります。
　
	(myenv) $ source deactivate myenv
	$ 
　
　バッチジョブで任意の仮想環境を使った実行をする場合は、ジョブスクリプト中で利用したい仮想環境を
　アクティブにするコマンドを挿入してください。
　＊moduleをloadしただけで仮想環境をアクティブにしないまま実行すると、デフォルトの"root"環境が
　　使われます。
　＊ジョブスクリプト中ではdeactivateはしなくても大丈夫です。
　
	### in job script ###
	source activate myenv
	srun python -c "import numpy; print(numpy.__version__)"
	################
	1.17.0

10. Julia 

    $ module load julia
    
  とすることで、version1.6.1が利用できるようになります。

    $ module load julia/1.9.2

  とすることで、version1.9.2が利用できるようになります。


11. 開発用ツール

　下記の開発用ツールが利用可能です。

　　arm Forge(arm DDT, arm MAP)
　　intel Vtune Profiler
    intel TraceAnalyzer and Collector

　(1) arm Forgeの利用方法
　　　
　　　arm Forge はソフトウェア開発のためのツール スイートです。以下が含まれます。
      　arm DDT : C/C++、Fortran/F90デバッガ
　　　　arm MAP : マルチスレッド/マルチプロセス向けの高速プロファイラ
　　　
　　　$ module load forge
　　　
　(2) intel Vtune Profilerの利用方法
　　　
　　　C/C++、Fortran、Python等に対応した高度なパフォーマンス/スレッド・プロファイラです。
　　　
　　　$ source /sc/system/ap/intel/vtune_profiler/vtune-vars.sh		(bashの場合)
　　　$ source /sc/system/ap/intel/vtune_profiler/vtune-vars.csh	(csh/tcshの場合)
　　　
　(3) intel TraceAnalyzer and Collectorの利用方法

　　　MPIアプリケーションのパフォーマンス分析、チューニング・ツールです。
　　　
　　　$ source /sc/system/ap/intel/itac_latest/bin/itacvars.sh	(bashの場合)
　　　$ source /sc/system/ap/intel/itac_latest/bin/itacvars.csh	(chs/tcshの場合)


12. Intel OneAPI の利用

      $ module sw intel/2020.4.304 intel/2023.2.0

   OneAPI の利用方法については　以下のURLを参照してください。
   
   https://jp.xlsoft.com/documents/intel/oneapi/download/programming-guide.pdf

   （注意）Yukawa-21のGPUは NVIDIA製なので nvhpc をお使いください。

13. Yukawa-21へのファイル転送について

　ログインサーバの/home, /scはYukawa-21と共有されています。
　mercury.yukawa.kyoto-u.ac.jpに対して外部からscp, sftpをかけてください。


14. その他

14-1. $SCHOME

　front にログインした時の カレントディレクトリ /sc/home/ユーザ名　ですが、
　環境変数 $SCHOME でスクリプトなどから参照することができます。


English version

---------------------------------------------------------------------------
Yukawa-21: How to use the Super Computer Yukawa-21
---------------------------------------------------------------------------
	1. System outline
	2. How to login
	3. Disk Storage
	4. Setting Enviroment by Modules
	5. How to Submit Batch Jobs
		5.1 Job Class
		5.2 Job Script
	6. Compiling Programs
	7. Numerical Analysis Library
    8. CUDA
    9. Python
   10. Julia 
   11. Development Tools
   12. Intel OneAPI
   13. How to Transfer Files to Yukawa-21
   14. Others

1. System outline

The large-scale computing server is a Dell Technologies cluster computer consisting of 135 nodes 
with 112 cores/1.5 TiB shared memory per node and 2 nodes equipped with 2 GPUs.
It has double precision floating point arithmetic performance of up to 2.4 TFLOPS per CPU, 
and the maximum performance per node is 9.6 TFLOPS.
The floating point arithmetic performance of the entire system is up to 1.3PFLOPS, 
and the total memory capacity is 205TiB.
Binray data type of Yukawa-21 is little endian.


2. How to login

You can log in to Yukawa-21 Frontend (front) by way of the login-node. 
Please see the following URL for how to log in to the login-node.
http://www.yukawa.kyoto-u.ac.jp/computer1/sshman/ssh-keylogin-man.html
If you type the following command on the login-node,

	$ ssh -l your-user-name front

you will log in to either front1 or front2. 
There are two frontends and automatically login to the one whose load is lower.


3. Disk Storage

The login-node and front share the same disk storage. 
When you log in to front, the working directory is automatically changed to "/sc/home/your-user-name".

	/home/gourp-name/your-user-name	: HOME directory (NFS)
	/sc/home/your-user-name			: Directory for computation (Lustre)

NOTE: Please put the files for computation on Yukawa-21 to /sc because its access speed is much faster.

The disk quota of /sc per user is 1TB. Please use "lfs quota" command to check how much you can use (in KB).  If you need more, please contact YITP computer room.

Example:
$ lfs quota /sc
Disk quotas for user USERNAME (uid XXXX):
	Filesystem   kbytes     quota      limit     grace  files  quota  limit  grace
	       /sc       24  1073741824  1073741824      -      7      0      0      -
Disk quotas for group GROUP (gid YYY):
	Filesystem   kbytes     quota      limit     grace  files  quota  limit  grace
	       /sc  255593212       0          0         -  460112     0      0      -


4. Setting Enviroment by Modules

You can use different kinds and versions of compilers and numerical analysis libraries on Yukawa-21.
In order to change the compilers and libraries, please use Enviroment Modules Software.
There are many module files that define the computational enviroment on Yukawa-21, 
and it can be changed by loading necessary module files.

4-1  Commands to load modules
(1) Loading modules which you want to use
	$ module load  (module filename) [/versions]
(2) Unloading modules
	$ module unload  (module filename) [/versions]
(3) Swapping modules
	$ module swap  (already loaded module filename)  (module filename to load)
(4) Check modules loaded right now
	$ module list
(5) Check available modules
	$ module avail (module name)

NOTE: if (module name) is not specified, all available modules will be listed. 
      Please see 'man module" for details.

4-2  Examples
(1) Changing compiler environment
	ex) changing from Intel compiler environment to NVIDIA HPC compiler environment
	$ module swap intel nvhpc
(2) Changing compiler environment versions
    ex) changing from intel default version to intel/2019.5.281
    $ module swap intel intel/2019.5.281
    NOTE: if you do not specify the version, the default version will be loaded.
(3) Using applications
    ex) arm Forge
    $ module load forge

4-3 Default environment
By default, intel module file is loaded. 
In this environment, you can use Intel Compiler, Intel MPI, Intel MKL.


5. Submitting Jobs

Yukawa-21 uses SLURM as the workload manager. 
First you make a bacth file to submit a job to Yukawa-21 (see the followings), 
and then submit it with "sbatch" command on front.
The program itself is executed by "srun" command.

5-1 Job Class

We have four job classes (L,M,S,DEBUG and GPU). You should choose one of them according to 
the computational resources that are required by your program.

 Name     Nodes        Core      Node      Elapsed
                       Limit     Limit     time Limit  Memory       Purpose
-----------------------------------------------------------------------------------
 L   32-32(exclusive)  ----    32node/user   1day       ----        Large job
 M   1 -24(exclusive)  ----    24node/user   1day       ----        Middle job
 S       1(shared)     8/job   10core/user   7days     13.7GB/core  Small job
 DEBUG   1(exclusive)  ----     1node/user   15min      ----        Debug job
 GPU     1(exclusive)  ----     1node/user   1day       ----        GPU job
-----------------------------------------------------------------------------------
{NOTE] For L/M/S/GPU class, there is a limit of resources one user may use simultaneously 
       instead of the limit of number of jobs one user may run simultaneously.
       L: total 32 nodes (actual 1job/user, Limit of job you can submit is 10 job/user.)
       M: total 24 nodes (24jobs if a job uses one node, 1 job if a job uses 24 nodes)
       S: totla  8 cores (8jobs if a job uses one core, 1 job if a job uses 8 cores)
       CPU: total 1 node (actual 1job/user) 

       DEBUG class
        * Although class priority is high, it is up to 15 minutes.
        * Limit of jobs you can run simultaneously is 1job/user  
          (If one job is running, one more job cannot run even if you are not using 2 nodes.)
        * Limit of job you can submit is 1job/user.
          (you cannot put multiple jobs in DEBUG queue a one time)

5-2  Job Script

Here we show some examples of the batch file. This can be submitted by

   $ sbatch go.sh

on front (the batch file name is arbitrary).

Example 1.  Single Job (Partition S)
   #!/bin/bash
   #--- parameter define
   #SBATCH -J SINGLE_job
   #SBATCH -N 1
   #SBATCH -n 1
   #SBATCH -c 1
   #SBATCH -p S

   #--- Job Script
   srun ./a.out

Example 2   OpenMP Job (class S, 4 SMP threads)
   #!/bin/bash
   #--- parameter define
   #SBATCH -J SMP_job
   #SBATCH -N 1
   #SBATCH -n 1
   #SBATCH -c 4
   #SBATCH -p S

   #--- Job Script
   export OMP_NUM_THREADS=4
   srun ./a.out

Example 3   MPI Parallel Job (class M, 2 nodes, 224 MPI processes)
   #!/bin/bash
   #--- parameter define
   #SBATCH -J MPI_job
   #SBATCH -N 2
   #SBATCH -n 224
   #SBATCH -c 1
   #SBATCH -p M

   #--- Job Script
   srun ./a.out

Example 4   MPI Parallel plus OpenMP Job (class M, 4 nodes, 32 MPI processes, 14 SMP threads)
   #!/bin/bash
   #--- parameter define
   #SBATCH -J HYBRID_job
   #SBATCH -N 4
   #SBATCH -n 32
   #SBATCH -c 14
   #SBATCH -p M

   #--- Job Script
   export OMP_NUM_THREADS=14
   srun ./a.out

The lines starting with #SBATCH are the options received by sbatch command.

Key words
            -p : Specify the Job class  (default: S)
            -o : Specify the Standard output file  of the job script. (default: slurm-%j.out, %j is JOB-ID)
            -e : Specify the Standard error output file  of the job script(default: same as standard output file)
            -J : Define the job name (default: script file name)
            -n : Specify the number of processes (default: 1)
            -c : Specify the number of cores per process (default: 1)
            -N : Specify the number of nodes to use  (default: calculated by -n, -c  option )
            -t : Specify the elapsed time limit (default: limit defined in the job class)
  --mail-type= : job status of sending email. See man for details.  (default: not send)
  --mail-user= : email address to be send. (cannot omit if you specify --mail-type)
          -G 2 : Add this option if you want to run DEBUG job on GPU node (work with -p DEBUG or GPU)

Note:
(1) Initial directory is the directory where you submit the job.
(2) Please use "srun" regardless of whether using MPI or not.
(3) It has to be (number of processes [-n]) x (number of cores per process [-c]) =< (nodes to use [-N]) x 64
(4) The value spefifyed by -c option must be same the number of cores to use.
    example: In case of OMP_NUM_THREADS=8, please specify "-c 8".

Submitting jobs  (sbatch)
        % sbatch go.sh

Checking job status
        % squeue

Checking job status of specified user
        % squeue -u userid

Checking job class
        % sinfo

Checking the priority of PEND jobs
        % sprio

Cancelling jobs
        % scancel (JOB_ID)

Checking the start time of submitted job
	% scontrol show job JOBID

 Please note that "StartTime" may be changed depending on other users jobs,
 congestion of the queue or dependency etc.

6. Compiling Enviroments

(1)  Available Compiler environemts

  Intel compiler environment      (intel): for CPU  (Parallel Studio XE and OneAPI )
  NVIDIA HPC compiler environment (nvhpc): for GPU

* The NVIDIA HPC compiler is the equivalent of the PGI compiler.

If you want to change the compiler environment, please change the environment module file.

example: In the case to switch to NVIDIA HPC compiler environment,
    % module swap intel nvhpc

(2) FORTRAN

The compile commands depend on the compiler you are using.
　|-----------------+-----------------|
　| Compiler        | Command         |
　|-----------------+-----------------|
　| Intel           | ifort           |
　| NVIDIA HPC      | nvfortran       |
　|-----------------+-----------------|
Give compiler-specific options as needed.

(3) C/C++

The compile commands depend on the compiler you are using.
　|-----------------+-----------------|
　| Compiler        | Command         |
　|-----------------+-----------------|
　| Intel           | icc / icpc      |
　| NVIDIA HPC      | nvc / nvc++     |
　|-----------------+-----------------|
Give compiler-specific options as needed.

(4) Compiler Options

(4-1) Intel Compiler
|-----------------------+---------------------------------------------------------------------|
|    Options            |                   Purpose                                           |
|-----------------------+---------------------------------------------------------------------|
| -qopenmp              | enable/disable parallelization by OpenMP directives                 |
| -parallel             | enable auto-parallelization by compiler                             |
| -mkl                  | use Intel MKL                                                       |
| -O [0|1|2|3|fast]     | specify optimization level                                          |
| -xCORE-AVX512         | Optimized with SIMD extended instruction set including up to AVX512 |
| -qopt-report          | output compile report                                               |
| -convert big_endian   | specify byteswap I/O                                                |
|-----------------------+---------------------------------------------------------------------|
 (see "man icc", "man icpc" and "man ifort" for details)

(4-2) NVIDIA HPC Compiler
|-------------------+-----------------------------------------------------|
|     Options       |                       Purpose                       |
|-------------------+-----------------------------------------------------|
| -mp               | enable parallelization by OpenMP directives         |
| -Mconcur          | enable auto-parallelization by compiler             |
| -O [0|1|2|3|4]    | specify optimization level                          |
| -Minfo            | output compile report                               |
| -Mbyteswapio      | specify byteswap I/O                                |
|-------------------+-----------------------------------------------------|
 (see "man nvc", "man nvc++" and "man nvfortran" for details)

(5) Binary data endian
Binary data of Yukawa-21 is little endian. (SR16000 binary data is big endian).
Please see the following URL for converting the data.

Binary data created by the big endian system needs to be converted to little endian and read.
For Fortran, you can use compile options or environment variables to read without modifying the source or data.

(5-1) Intel Compiler

　・This option specifies the format of unformatted files containing numeric data.
　　　$ ifort -convert big_endian src.f90

　・This variable specifies the numbers of the units to be used for little-endian-to-big-endian conversion purposes.
　　　Execute the program after specifying the environment variables in the job script.
　　　Example(bash):
　　　　export F_UFMTENDIAN=10　#Only unit 10 is handled as big endian

(5-2) NVIDIA HPC Compiler

　・swap byte-order from big-endian to little-endian or vice versa upon input/output of Fortran unformatted data files.
　　　$ nvfortran -Mbyteswapio src.f90

(6) Stacksize
The default stacksize is 8MB. In case the program aborted with Segmentation
Fault, stacksize may not be enough. You can try stacksize unlimited 
in the job script like this.
example:
ulimit -s unlimited          # in case of bash
limit stacksize unlimited    # in case of csh
srun ./a.out


7. Library

7-1  MPI Library

Intel MPI/OpenMPI is available.
Wrapper commands will automatically specify the include files and links to the library.

　|-------------+------------------+-----------------------------|
　| Compiler    | MPI Library      | wrapper command             |
　|-------------+------------------+-----------------------------|
　| Intel       | Intel MPI        | mpiicc / mpiicpc / mpiifort |
  |-------------+------------------+-----------------------------|
　| NVIDIA HPC  | OpenMPI          | mpicc  / mpicxx  / mpifort  |
  |-------------+------------------|-----------------------------|
　| CUDA        | OpenMPI　　　　  | no available                |
  |-------------+------------------|-----------------------------|

  * When you use OpenMPI in a CUDA environment, you need to explicitly specify the library path
    and library when linking, as well as specifying the include path when compiling.

7-2 Intel Math Kernel Library (Intel MKL)

MKL library is designed for engineering scientific and finace system software developers.
It includes linear algebra routines, FFTW, vector math library functions and random number generating functions. These routines and functions are optimized for Intel processors. You can use them with Intel compiler.

Usage: Use Intel compiler. Please add -mkl option when you build.
       "-I ${MKLROOT}/include/fftw" is also necessary for FFTW.

  $ ifort -mkl      #  fortran
  $ icc   -mkl      # C
  $ icpc  -mkl      # C++

8. CUDA

The NVIDIA CUDA Toolkit provides a development environment for creating high performance GPU-accelerated applications. (as of Aug 2024, default version is  cuda: 12.5-4.1.6, nvhpc: 24.7-4.1.6

Usage: please load module file "cuda".
＊Not the same as the NVIDIA HPC compiler. Please note that it will be "nvcc" instead of "nvc".
$ module load cuda
$ nvcc [options] 
$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Oct_12_20:09:46_PDT_2020
Cuda compilation tools, release 11.1, V11.1.105
Build cuda_11.1.TC455_06.29190527_0

Exmaple: How to compile and run the samples included with the NVIDIA CUDA Toolkit.
　　　　 An output of the "nvidia-smi" command is also presented.

$ cuda-install-samples-11.1.sh ~
$ cd ~/NVIDIA_CUDA-11.1_Samples/5_Simulations/nbody
$ make

#### in job script ####
#SBATCH -p GPU
#SBATCH -N 1

nvidia-smi
srun ./nbody -benchmark
#######################
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.32.00    Driver Version: 455.32.00    CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  Off  | 00000000:25:00.0 Off |                    0 |
| N/A   39C    P0    37W / 250W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE...  Off  | 00000000:C8:00.0 Off |                    0 |
| N/A   38C    P0    36W / 250W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

> Windowed mode
> Simulation data stored in video memory
> Single precision floating point simulation
> 1 Devices used for simulation
GPU Device 0: "Volta" with compute capability 7.0

> Compute 7.0 CUDA device: [Tesla V100-PCIE-16GB]
81920 bodies, total time for 10 iterations: 145.415 ms
= 461.498 billion interactions per second
= 9229.970 single-precision GFLOP/s at 20 flops per interaction


9. Python

Please use Intel Python. 
* numpy, scipy, mpi4py are available as default.
  Please note that mpi4py works only on Yukawa-21 nodes,
  does not works on front. Please use as a batch job.
* Yukawa-21 and the front end are not connected to the external network, 
  so if you want to install any package, please try the installation method in the offline environment.

$ module load intelpython3
$ python -V
Python 3.6.9 :: Intel Corporation

$ module load intelpython2
$ python -V
Python 2.7.16 :: Intel Corporation

The recommended usage is to use the virtual environment with anaconda.

Example:

Clone the "root" environment provided by default.
Specify any virtual environment name in place of "myenv".
If you do not add the "--clone root" option, an empty virtual environment will be created under conda management.
* "Conda list" is empty, but "pip list" has multiple packages installed.

$ conda create --offline --clone root -n myenv

Check the created virtual environment.

$ conda info --envs
# conda environments:
#
myenv                    (user_home)/.conda/envs/myenv
root                  *  /sc/system/ap/intel/intelpython3

Activate the virtual environment you created. If successful, the prompt will be prefixed with (virtual environment name).
Be sure to start with the "source" command. An error will occur if there is no "source" command.

$ source activate myenv
(myenv) $
$ conda info --envs
# conda environments:
#
myenv                 *  (user_home)/.conda/envs/myenv
root                     /sc/system/ap/intel/intelpython3
(myenv) $ conda list
# packages in environment at (user_home)/.conda/envs/myenv:
#
asn1crypto                0.24.0                   py36_3    intel
(snip)
zlib                      1.2.11                        5    intel

Deactivate when you are done working in the virtual environment.
If you can exit the virtual environment normally, the (virtual environment name) will disappear from the prompt.

(myenv) $ source deactivate myenv
$ 

If you want to execute a batch job using an any virtual environment, insert the command that activates the virtual environment you want to use in the job script.
* If you just load the intelpython module and run your job without activating any virtual environment, the default "root" environment will be used.
* There is no need to deactivate in the job script.

### in job script ###
source activate myenv
srun python -c "import numpy; print(numpy.__version__)"
################
1.17.0

10. Julia

    You may use Julia v1.9.2 by

    $ module load julia/1.9.2

    You may use julia v1.6.1 by 

    $ module load julia


11. Development Tools

The following development tools are available.
	arm Forge(arm DDT, arm MAP)
	intel Vtune Profiler
	intel TraceAnalyzer and Collector

(1) arm Forge
	
　　Arm Forge is the HPC development tool suite for C, C++, Fortran, and Python. Includes:
      arm DDT : C/C++、Fortran/F90 Debugger
　　  arm MAP : parallel profiler
　　
　　$ module load forge
　　
(2) intel Vtune Profiler
	
　　intel Vtune Profiler is an advanced performance/thread profiler for C/C++, Fortran, Python.
　　
　　$ source /sc/system/ap/intel/vtune_profiler/vtune-vars.sh	(bash)
　　$ source /sc/system/ap/intel/vtune_profiler/vtune-vars.csh	(csh/tcsh)
　　
(3) intel TraceAnalyzer and Collector
	
　　intel TraceAnalyzer and Collector is a performance analysis and tuning tool for MPI applications.
　　
　　$ source /sc/system/ap/intel/itac_latest/bin/itacvars.sh	(bash)
　　$ source /sc/system/ap/intel/itac_latest/bin/itacvars.csh	(chs/tcsh)


12. Intel OneAPI

    You may use Intel OneAPI by

    $ module sw intel/2020.4.304 intel/2023.2.0

 Please refer the following URL for the instruction of OneAPI

    https://www.intel.com/content/www/us/en/docs/oneapi/programming-guide/2023-2/overview.html
 
 (Note: Yukawa-21 GPU is NVIDIA product, so please use nvhpc)

13. File transfer to Yukawa-21

/home (home directory on the login server) can be seen also on Yukawa-21.
Please scp (or sftp) to mercury.yukawa.kyoto-u.ac.jp from outside.


14. Others

14-1. $SCHOME

  The CWD when you login to front (/sc/home/USERNAME) is defined as
  the enviromental variable $SCHOME.