{"id":2338,"date":"2024-01-17T10:56:18","date_gmt":"2024-01-17T01:56:18","guid":{"rendered":"https:\/\/gotocloud.co.kr\/?p=2338"},"modified":"2024-01-17T10:57:45","modified_gmt":"2024-01-17T01:57:45","slug":"integration-of-mlops-hpc-and-cae-hpc-using-scheduler","status":"publish","type":"post","link":"https:\/\/gotocloud.co.kr\/?p=2338","title":{"rendered":"Integration of MLops HPC and CAE HPC using scheduler"},"content":{"rendered":"<p>Ask to ChatGPT about the integration of MLops HPC and CAE HPC based on the scheduler.<\/p>\n<h3>(Question) Integration of MLops HPC and CAE HPC using scheduler<\/h3>\n<h3>(Answer)<\/h3>\n<p>MLops HPC is a set of practices that aims to automate and streamline the end-to-end lifecycle of machine learning models, from development to deployment and monitoring, on high performance computing (HPC) systems. CAE HPC is the use of HPC systems for computer-aided engineering (CAE) applications, such as computational fluid dynamics (CFD), finite element analysis (FEA), and structural optimization.<\/p>\n<p>One of the challenges of integrating MLops HPC and CAE HPC is to manage the workload distribution and resource allocation on the shared HPC clusters. A scheduler is a tool that can help with this task by accepting, scheduling, and monitoring jobs on the HPC nodes. There are different types of schedulers that can be used for MLops HPC and CAE HPC, such as:<\/p>\n<ul>\n<li><strong>Slurm<\/strong>: an open-source, scalable, and fault-tolerant scheduler that is widely used for HPC clusters. Slurm can support various GPU libraries, such as PETSc and AmgX, for linear solvers and machine learning frameworks, such as Kubeflow, for MLOps.\n<ul>\n<li><a href=\"https:\/\/www.dkube.io\/post\/mlops-on-hpc-slurm-with-kubeflow\">MLOps on HPC\/Slurm with Kubeflow (dkube.io)<\/a><\/li>\n<li><a href=\"https:\/\/www.run.ai\/guides\/slurm\/slurm-vs-lsf-vs-kubernetes-scheduler-which-is-right-for-you\">Slurm vs LSF vs Kubernetes Scheduler: Which is Right for You? (run.ai)<\/a><\/li>\n<\/ul>\n<\/li>\n<li><strong>LSF<\/strong>: a commercial, enterprise-grade scheduler that can handle complex and heterogeneous HPC workloads. LSF can integrate with various machine learning platforms, such as IBM Watson Machine Learning Accelerator, and CAE software, such as Ansys Fluent and Siemens StarCCM+\n<ul>\n<li><a href=\"https:\/\/www.intel.com\/content\/dam\/www\/central-libraries\/us\/en\/documents\/2022-06\/select-solutions-for-hpc-manufacturing-google-cloud-paper.pdf\">Manufacturing and CAE with Intel Select Solutions for HPC on Google Cloud<\/a><\/li>\n<\/ul>\n<\/li>\n<li><strong>Kubernetes Scheduler<\/strong>: a native scheduler for Kubernetes, which is a popular platform for containerized applications. Kubernetes Scheduler can run AI\/ML workloads at scale using tools like Kubeflow and NVIDIA GPU Operator. Kubernetes Scheduler can also run CAE workloads using tools like OpenFOAM and Singularity\n<ul>\n<li><a href=\"https:\/\/link.springer.com\/chapter\/10.1007\/978-3-030-88224-2_7\">A HPC Co-scheduler with Reinforcement Learning | SpringerLink<\/a><\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<p>The choice of the scheduler depends on the specific requirements and preferences of the MLops HPC and CAE HPC users. Some factors to consider are:<\/p>\n<ul>\n<li><strong>Scalability<\/strong>: how well the scheduler can handle large and diverse workloads and resources<\/li>\n<li><strong>Flexibility<\/strong>: how easy the scheduler can be customized and configured to suit different needs and scenarios<\/li>\n<li><strong>Compatibility<\/strong>: how well the scheduler can integrate with the existing HPC infrastructure and software<\/li>\n<li><strong>Performance<\/strong>: how fast and efficient the scheduler can execute and monitor the jobs<\/li>\n<li><strong>Cost<\/strong>: how much the scheduler costs to acquire and maintain<\/li>\n<\/ul>\n<p>&nbsp;<\/p>\n<p>MLops HPC\ub294 HPC(\uace0\uc131\ub2a5 \ucef4\ud4e8\ud305) \uc2dc\uc2a4\ud15c\uc5d0\uc11c \uba38\uc2e0\ub7ec\ub2dd \ubaa8\ub378\uc758 \uac1c\ubc1c\ubd80\ud130 \ubc30\ud3ec \ubc0f \ubaa8\ub2c8\ud130\ub9c1\uc5d0 \uc774\ub974\ub294 \uc804\uccb4 \uc218\uba85\uc8fc\uae30\ub97c \uc790\ub3d9\ud654\ud558\uace0 \uac04\uc18c\ud654\ud558\ub294 \uc77c\ub828\uc758 \uad00\ud589\uc785\ub2c8\ub2e4. CAE HPC\ub294 CFD(\uacc4\uc0b0 \uc720\uccb4 \uc5ed\ud559), FEA(\uc720\ud55c \uc694\uc18c \ubd84\uc11d), \uad6c\uc870 \ucd5c\uc801\ud654 \ub4f1\uc758 \ucef4\ud4e8\ud130 \ubcf4\uc870 \uacf5\ud559(CAE) \uc751\uc6a9 \ud504\ub85c\uadf8\ub7a8\uc744 \uc704\ud574 HPC \uc2dc\uc2a4\ud15c\uc744 \uc0ac\uc6a9\ud558\ub294 \uac83\uc785\ub2c8\ub2e4.<\/p>\n<p>MLops HPC\uc640 CAE HPC\ub97c \ud1b5\ud569\ud558\ub294 \ub3c4\uc804 \uacfc\uc81c \uc911 \ud558\ub098\ub294 \uacf5\uc720 HPC \ud074\ub7ec\uc2a4\ud130\uc5d0\uc11c \uc791\uc5c5 \ubd84\ubc30\uc640 \uc790\uc6d0 \ud560\ub2f9\uc744 \uad00\ub9ac\ud558\ub294 \uac83\uc785\ub2c8\ub2e4. \uc2a4\ucf00\uc904\ub7ec\ub294 HPC \ub178\ub4dc\uc5d0\uc11c \uc791\uc5c5\uc744 \uc218\ub77d, \uc2a4\ucf00\uc904\ub9c1, \ubaa8\ub2c8\ud130\ub9c1\ud558\ub294 \ub3c4\uad6c\ub85c, \uc774 \uc791\uc5c5\uc744 \ub3d5\uc2b5\ub2c8\ub2e4. MLops HPC\uc640 CAE HPC\uc5d0 \uc0ac\uc6a9\ud560 \uc218 \uc788\ub294 \ub2e4\uc591\ud55c \uc720\ud615\uc758 \uc2a4\ucf00\uc904\ub7ec\uac00 \uc788\uc2b5\ub2c8\ub2e4. \uc608\ub97c \ub4e4\uba74:<\/p>\n<ul>\n<li><strong>Slurm<\/strong>: HPC \ud074\ub7ec\uc2a4\ud130\uc5d0 \ub110\ub9ac \uc0ac\uc6a9\ub418\ub294 \uc624\ud508\uc18c\uc2a4, \ud655\uc7a5\uc131, \ub0b4\uacb0\ud568\uc131 \uc2a4\ucf00\uc904\ub7ec\uc785\ub2c8\ub2e4. Slurm\uc740 \uc120\ud615 \uc194\ubc84\uc640 \uba38\uc2e0\ub7ec\ub2dd \ud504\ub808\uc784\uc6cc\ud06c\ub97c \uc704\ud55c \ub2e4\uc591\ud55c GPU \ub77c\uc774\ube0c\ub7ec\ub9ac, \uc608\ub97c \ub4e4\uc5b4 PETSc\uc640 AmgX, Kubeflow \ub4f1\uc744 \uc9c0\uc6d0\ud569\ub2c8\ub2e4\n<ul>\n<li><a href=\"https:\/\/www.dkube.io\/post\/mlops-on-hpc-slurm-with-kubeflow\">MLOps on HPC\/Slurm with Kubeflow (dkube.io)<\/a><\/li>\n<li><a href=\"https:\/\/www.run.ai\/guides\/slurm\/slurm-vs-lsf-vs-kubernetes-scheduler-which-is-right-for-you\">Slurm vs LSF vs Kubernetes Scheduler: Which is Right for You? (run.ai)<\/a><\/li>\n<\/ul>\n<\/li>\n<li><strong>LSF<\/strong>: \ubcf5\uc7a1\ud558\uace0 \uc774\uae30\uc885\uc758 HPC \uc791\uc5c5\uc744 \ucc98\ub9ac\ud560 \uc218 \uc788\ub294 \uc0c1\uc5c5\uc6a9, \uc5d4\ud130\ud504\ub77c\uc774\uc988\uae09 \uc2a4\ucf00\uc904\ub7ec\uc785\ub2c8\ub2e4. LSF\ub294 IBM Watson Machine Learning Accelerator\uc640 \uac19\uc740 \ub2e4\uc591\ud55c \uba38\uc2e0\ub7ec\ub2dd \ud50c\ub7ab\ud3fc\uacfc Ansys Fluent, Siemens StarCCM+\uc640 \uac19\uc740 CAE \uc18c\ud504\ud2b8\uc6e8\uc5b4\uc640 \ud1b5\ud569\ud560 \uc218 \uc788\uc2b5\ub2c8\ub2e4\n<ul>\n<li><a href=\"https:\/\/www.intel.com\/content\/dam\/www\/central-libraries\/us\/en\/documents\/2022-06\/select-solutions-for-hpc-manufacturing-google-cloud-paper.pdf\">Manufacturing and CAE with Intel Select Solutions for HPC on Google Cloud<\/a><\/li>\n<\/ul>\n<\/li>\n<li><strong>Kubernetes Scheduler<\/strong>: \ucee8\ud14c\uc774\ub108\ud654\ub41c \uc560\ud50c\ub9ac\ucf00\uc774\uc158\uc744 \uc704\ud55c \uc778\uae30\uc788\ub294 \ud50c\ub7ab\ud3fc\uc778 Kubernetes\uc758 \uae30\ubcf8 \uc2a4\ucf00\uc904\ub7ec\uc785\ub2c8\ub2e4. Kubernetes Scheduler\ub294 Kubeflow\uc640 NVIDIA GPU Operator\uc640 \uac19\uc740 \ub3c4\uad6c\ub97c \uc0ac\uc6a9\ud558\uc5ec \ub300\uaddc\ubaa8 AI\/ML \uc791\uc5c5\uc744 \uc2e4\ud589\ud560 \uc218 \uc788\uc2b5\ub2c8\ub2e4. \ub610\ud55c OpenFOAM\uacfc Singularity\uc640 \uac19\uc740 \ub3c4\uad6c\ub97c \uc0ac\uc6a9\ud558\uc5ec CAE \uc791\uc5c5\uc744 \uc2e4\ud589\ud560 \uc218 \uc788\uc2b5\ub2c8\ub2e4.\n<ul>\n<li><a href=\"https:\/\/link.springer.com\/chapter\/10.1007\/978-3-030-88224-2_7\">A HPC Co-scheduler with Reinforcement Learning | SpringerLink<\/a><\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<p>\uc2a4\ucf00\uc904\ub7ec\uc758 \uc120\ud0dd\uc740 MLops HPC\uc640 CAE HPC \uc0ac\uc6a9\uc790\uc758 \ud2b9\uc815 \uc694\uad6c\uc0ac\ud56d\uacfc \uc120\ud638\ub3c4\uc5d0 \ub530\ub77c \ub2ec\ub77c\uc9d1\ub2c8\ub2e4. \uace0\ub824\ud574\uc57c \ud560 \uba87 \uac00\uc9c0 \uc694\uc778\uc740 \ub2e4\uc74c\uacfc \uac19\uc2b5\ub2c8\ub2e4:<\/p>\n<ul>\n<li><strong>\ud655\uc7a5\uc131<\/strong>: \uc2a4\ucf00\uc904\ub7ec\uac00 \ud06c\uace0 \ub2e4\uc591\ud55c \uc791\uc5c5\uacfc \uc790\uc6d0\uc744 \uc5bc\ub9c8\ub098 \uc798 \ucc98\ub9ac\ud560 \uc218 \uc788\ub294\uc9c0<\/li>\n<li><strong>\uc720\uc5f0\uc131<\/strong>: \uc2a4\ucf00\uc904\ub7ec\ub97c \ub2e4\ub978 \uc694\uad6c\uc0ac\ud56d\uacfc \uc2dc\ub098\ub9ac\uc624\uc5d0 \ub9de\uac8c \uc5bc\ub9c8\ub098 \uc27d\uac8c \uc0ac\uc6a9\uc790 \uc815\uc758\ud558\uace0 \uc124\uc815\ud560 \uc218 \uc788\ub294\uc9c0<\/li>\n<li><strong>\ud638\ud658\uc131<\/strong>: \uc2a4\ucf00\uc904\ub7ec\uac00 \uae30\uc874\uc758 HPC \uc778\ud504\ub77c\uc640 \uc18c\ud504\ud2b8\uc6e8\uc5b4\uc640 \uc5bc\ub9c8\ub098 \uc798 \ud1b5\ud569\ud560 \uc218 \uc788\ub294\uc9c0<\/li>\n<li><strong>\uc131\ub2a5<\/strong>: \uc2a4\ucf00\uc904\ub7ec\uac00 \uc791\uc5c5\uc744 \uc5bc\ub9c8\ub098 \ube60\ub974\uace0 \ud6a8\uc728\uc801\uc73c\ub85c \uc2e4\ud589\ud558\uace0 \ubaa8\ub2c8\ud130\ub9c1\ud560 \uc218 \uc788\ub294\uc9c0<\/li>\n<li><strong>\ube44\uc6a9<\/strong>: \uc2a4\ucf00\uc904\ub7ec\ub97c \uad6c\uc785\ud558\uace0 \uc720\uc9c0\ud558\ub294 \ub370 \ub4dc\ub294 \ube44\uc6a9\uc740 \uc5bc\ub9c8\uc778\uc9c0<\/li>\n<\/ul>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Ask to ChatGPT about the integration of MLops HPC and CAE HPC based on the scheduler. (Question) Integration of MLops HPC and CAE HPC using [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":2341,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[409,420,405],"tags":[100,427,430],"_links":{"self":[{"href":"https:\/\/gotocloud.co.kr\/index.php?rest_route=\/wp\/v2\/posts\/2338"}],"collection":[{"href":"https:\/\/gotocloud.co.kr\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/gotocloud.co.kr\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/gotocloud.co.kr\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/gotocloud.co.kr\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=2338"}],"version-history":[{"count":1,"href":"https:\/\/gotocloud.co.kr\/index.php?rest_route=\/wp\/v2\/posts\/2338\/revisions"}],"predecessor-version":[{"id":2340,"href":"https:\/\/gotocloud.co.kr\/index.php?rest_route=\/wp\/v2\/posts\/2338\/revisions\/2340"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/gotocloud.co.kr\/index.php?rest_route=\/wp\/v2\/media\/2341"}],"wp:attachment":[{"href":"https:\/\/gotocloud.co.kr\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=2338"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/gotocloud.co.kr\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=2338"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/gotocloud.co.kr\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=2338"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}