Working on high-performance computing (HPC) systems requires a solid understanding of the tools and processes that enable efficient resource management. This guide covers essential steps to access the MBZUAI HPC environment using SSH, manage jobs with Slurm, and maintain persistent sessions with Tmux. Whether you’re a beginner or looking to refine your skills, this blog post will help you get the most out of your HPC experience.

Before diving into the technical details, make sure you have easy access to these essential resources:

2. HPC Access via SSH

To access the MBZUAI HPC, you’ll need to use SSH. Below is the configuration for accessing the lab workstation:

1
2
3
Host MBZUAI_lab_workstation
HostName login-student-lab.mbzu.ae
User haobo.yang

3. Managing Resources with Slurm

Slurm is a powerful workload manager that helps allocate and manage computing resources on HPC systems. Here are some key commands to get you started:

3.1. Viewing Partition and Node Information

To see the available partitions and nodes, use the following command:

1
sinfo

3.2. Viewing Job and Job Step Information

To view all jobs or your specific jobs, use:

1
squeue

For only your jobs, add the --me flag:

1
squeue --me

3.3. Allocating Resources

You can allocate resources by requesting a set of nodes and processors. For example:

1
salloc -N1 -n24

To allocate a specific workstation:

1
salloc -w ws-l6-007

4. Managing Persistent Sessions with Tmux

Tmux is an essential tool for anyone working on long-running tasks or multiple sessions in an HPC environment. With Tmux, your sessions remain active even if you disconnect, ensuring that your tasks continue uninterrupted.

4.1. Starting a Tmux Session

To start a basic Tmux session, simply type:

1
tmux

4.2. Creating Named Tmux Sessions

If you manage multiple Tmux sessions, it’s useful to name them. Start a new named session like this:

1
tmux new -s session_name

4.3. Detaching from a Tmux Session

You can temporarily leave a Tmux session by detaching from it with:

Ctrl + b then d

This allows your processes to continue running in the background.

4.4. Re-attaching to a Tmux Session

To reconnect to a Tmux session, first list all running sessions:

1
tmux ls

Identify the session name from the list and re-attach using:

1
tmux attach-session -t session_name

For example, to re-attach to a session named 0:

1
tmux attach-session -t 0

5. Conclusion

Leveraging SSH, Slurm, and Tmux effectively can significantly improve your productivity and resource management on the MBZUAI HPC. With these tools, you can confidently manage your jobs, maintain persistent sessions, and ensure that your computational tasks are running smoothly, even in your absence.

6. Additional Info

The terminal command activate code.

1
2
3
4
5
(base)  127  ws-l1-007:~ $
module load VSCODE
(base) 0 ws-l1-007:~ $
code
(base) 0 ws-l1-007:~ $

在高性能计算(HPC)系统上工作需要对工具和流程有一个清晰的理解,以便有效地管理资源。本指南涵盖了使用SSH访问MBZUAI HPC环境、使用Slurm管理作业以及使用Tmux保持持久会话的关键步骤。无论你是初学者还是希望精进技能,这篇博文都将帮助你充分利用HPC的体验。

1. 快速访问链接

在深入技术细节之前,请确保你可以轻松访问这些重要资源:

2. 通过SSH访问HPC

要访问MBZUAI HPC,你需要使用SSH。以下是访问实验室工作站的配置:

1
2
3
Host MBZUAI_lab_workstation
HostName login-student-lab.mbzu.ae
User haobo.yang

3. 使用Slurm管理资源

Slurm是一个强大的工作负载管理器,帮助分配和管理HPC系统上的计算资源。以下是一些关键命令,帮助你入门:

3.1. 查看分区和节点信息

要查看可用的分区和节点,请使用以下命令:

1
sinfo

3.2. 查看作业和作业步骤信息

要查看所有作业或特定作业,请使用:

1
squeue

如果只查看你自己的作业,添加 --me 参数:

1
squeue --me

3.3. 分配资源

你可以通过请求一组节点和处理器来分配资源。例如:

1
salloc -N1 -n24

要分配一个特定的工作站:

1
salloc -w ws-l6-007

4. 使用Tmux管理持久会话

Tmux是一个对任何在HPC环境中进行长时间任务或管理多个会话的人来说非常重要的工具。通过Tmux,即使你断开连接,你的会话也会继续保持活动状态,确保你的任务不间断地运行。

4.1. 启动Tmux会话

要启动一个基本的Tmux会话,只需键入:

1
tmux

4.2. 创建命名Tmux会话

如果你需要管理多个Tmux会话,命名它们是很有用的。以以下方式启动一个新的命名会话:

1
tmux new -s session_name

4.3. 从Tmux会话中分离

你可以通过以下方式暂时离开一个Tmux会话:

Ctrl + b 然后 d

这允许你的进程在后台继续运行。

4.4. 重新连接到Tmux会话

要重新连接到Tmux会话,首先列出所有正在运行的会话:

1
tmux ls

从列表中识别会话名称并使用以下命令重新连接:

1
tmux attach-session -t session_name

例如,要重新连接到名为 0 的会话:

1
tmux attach-session -t 0

5. 结论

有效利用SSH、Slurm和Tmux可以显著提高你在MBZUAI HPC上的生产力和资源管理能力。通过这些工具,你可以自信地管理作业,保持持久会话,并确保你的计算任务即使在你不在时也能顺利运行。

6. 补充说明

The terminal command activate code.

1
2
3
4
5
(base)  127  ws-l1-007:~ $
module load VSCODE
(base) 0 ws-l1-007:~ $
code
(base) 0 ws-l1-007:~ $

🍀后记🍀
博客的关键词集中在编程、算法、机器人、人工智能、数学等等,持续高质量输出中。
🌸唠嗑QQ群兔叽の魔术工房 (942848525)
⭐B站账号白拾Official(活跃于知识区和动画区)
✨GitHub主页YangSierCode000(工程文件)
⛳Discord社区AierLab(人工智能社区)