Multi-account support for Amazon SageMaker HyperPod task governance | Amazon Web Services TechTricks365

GPUs are a precious resource; they are both short in supply and much more costly than traditional CPUs. They are also highly adaptable to many different use cases. Organizations building or adopting generative AI use GPUs to run simulations, run inference (both for internal or external usage), build agentic workloads, and run data scientists’ experiments. The workloads range from ephemeral single-GPU experiments run by scientists to long multi-node continuous pre-training runs. Many organizations need to share a centralized, high-performance GPU computing infrastructure across different teams, business units, or accounts within their organization. With this infrastructure, they can maximize the utilization of expensive accelerated computing resources like GPUs, rather than having siloed infrastructure that might be underutilized. Organizations also use multiple AWS accounts for their users. Larger enterprises might want to separate different business units, teams, or environments (production, staging, development) into different AWS accounts. This provides more granular control and isolation between these different parts of the organization. It also makes it straightforward to track and allocate cloud costs to the appropriate teams or business units for better financial oversight.

The specific reasons and setup can vary depending on the size, structure, and requirements of the enterprise. But in general, a multi-account strategy provides greater flexibility, security, and manageability for large-scale cloud deployments. In this post, we discuss how an enterprise with multiple accounts can access a shared Amazon SageMaker HyperPod cluster for running their heterogenous workloads. We use SageMaker HyperPod task governance to enable this feature.

Solution overview

SageMaker HyperPod task governance streamlines resource allocation and provides cluster administrators the capability to set up policies to maximize compute utilization in a cluster. Task governance can be used to create distinct teams with their own unique namespace, compute quotas, and borrowing limits. In a multi-account setting, you can restrict which accounts have access to which team’s compute quota using role-based access control.

In this post, we describe the settings required to set up multi-account access for SageMaker HyperPod clusters orchestrated by Amazon Elastic Kubernetes Service (Amazon EKS) and how to use SageMaker HyperPod task governance to allocate accelerated compute to multiple teams in different accounts.

The following diagram illustrates the solution architecture.

In this architecture, one organization is splitting resources across a few accounts. Account A hosts the SageMaker HyperPod cluster. Account B is where the data scientists reside. Account C is where the data is prepared and stored for training usage. In the following sections, we demonstrate how to set up multi-account access so that data scientists in Account B can train a model on Account A’s SageMaker HyperPod and EKS cluster, using the preprocessed data stored in Account C. We break down this setup in two sections: cross-account access for data scientists and cross-account access for prepared data.

Cross-account access for data scientists

When you create a compute allocation with SageMaker HyperPod task governance, your EKS cluster creates a unique Kubernetes namespace per team. For this walkthrough, we create an AWS Identity and Access Management (IAM) role per team, called cluster access roles, that are then scoped access only to the team’s task governance-generated namespace in the shared EKS cluster. Role-based access control is how we make sure the data science members of Team A will not be able to submit tasks on behalf of Team B.

To access Account A’s EKS cluster as a user in Account B, you will need to assume a cluster access role in Account A. The cluster access role will have only the needed permissions for data scientists to access the EKS cluster. For an example of IAM roles for data scientists using SageMaker HyperPod, see IAM users for scientists.

Next, you will need to assume the cluster access role from a role in Account B. The cluster access role in Account A will then need to have a trust policy for the data scientist role in Account B. The data scientist role is the role in account B that will be used to assume the cluster access role in Account A. The following code is an example of the policy statement for the data scientist role so that it can assume the cluster access role in Account A:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "sts:AssumeRole",
      "Resource": "arn:aws:iam::XXXXXXXXXXAAA:role/ClusterAccessRole"
    }
  ]
}

The following code is an example of the trust policy for the cluster access role so that it allows the data scientist role to assume it:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::XXXXXXXXXXBBB:role/DataScientistRole"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

The final step is to create an access entry for the team’s cluster access role in the EKS cluster. This access entry should also have an access policy, such as EKSEditPolicy, that is scoped to the namespace of the team. This makes sure that Team A users in Account B can’t launch tasks outside of their assigned namespace. You can also optionally set up custom role-based access control; see Setting up Kubernetes role-based access control for more information.

For users in Account B, you can repeat the same setup for each team. You must create a unique cluster access role for each team to align the access role for the team with their associated namespace. To summarize, we use two different IAM roles:

Data scientist role – The role in Account B used to assume the cluster access role in Account A. This role just needs to be able to assume the cluster access role.
Cluster access role – The role in Account A used to give access to the EKS cluster. For an example, see IAM role for SageMaker HyperPod.

Cross-account access to prepared data

In this section, we demonstrate how to set up EKS Pod Identity and S3 Access Points so that pods running training tasks in Account A’s EKS cluster have access to data stored in Account C. EKS Pod Identity allow you to map an IAM role to a service account in a namespace. If a pod uses the service account that has this association, then Amazon EKS will set the environment variables in the containers of the pod.

S3 Access Points are named network endpoints that simplify data access for shared datasets in S3 buckets. They act as a way to grant fine-grained access control to specific users or applications accessing a shared dataset within an S3 bucket, without requiring those users or applications to have full access to the entire bucket. Permissions to the access point is granted through S3 access point policies. Each S3 Access Point is configured with an access policy specific to a use case or application. Since the HyperPod cluster in this blog post can be used by multiple teams, each team could have its own S3 access point and access point policy.

Before following these steps, ensure you have the EKS Pod Identity Add-on installed on your EKS cluster.

In Account A, create an IAM Role that contains S3 permissions (such as s3:ListBucket and s3:GetObject to the access point resource) and has a trust relationship with Pod Identity; this will be your Data Access Role. Below is an example of a trust policy.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowEksAuthToAssumeRoleForPodIdentity",
      "Effect": "Allow",
      "Principal": {
        "Service": "pods.eks.amazonaws.com"
      },
      "Action": [
        "sts:AssumeRole",
        "sts:TagSession"
      ]
    }
  ]
}

In Account C, create an S3 access point by following the steps here.
Next, configure your S3 access point to allow access to the role created in step 1. This is an example access point policy that gives Account A permission to access points in account C.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam:::role/"
      },
      "Action": [
        "s3:ListBucket",
        "s3:GetObject"
      ],
      "Resource": [
        "arn:aws:s3:::accesspoint/",
        "arn:aws:s3:::accesspoint//object/*"
      ]
    }
  ]
}

Ensure your S3 bucket policy is updated to allow Account A access. This is an example S3 bucket policy:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": "*",
      "Action": [
        "s3:GetObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::",
        "arn:aws:s3:::/*"
      ],
      "Condition": {
        "StringEquals": {
          "s3:DataAccessPointAccount": ""
        }
      }
    }
  ]
}

In Account A, create a pod identity association for your EKS cluster using the AWS CLI.

aws eks create-pod-identity-association 
--cluster-name  
--role-arn arn:aws:iam:::role/ 
--namespace hyperpod-ns-eng 
--service-account my-service-account

Pods accessing cross-account S3 buckets will need the service account name referenced in their pod specification.

You can test cross-account data access by spinning up a test pod and the executing into the pod to run Amazon S3 commands:

kubectl exec -it aws-test -n hyperpod-ns-team-a -- aws s3 ls s3://

This example shows creating a single data access role for a single team. For multiple teams, use a namespace-specific ServiceAccount with its own data access role to help prevent overlapping resource access across teams. You can also configure cross-account Amazon S3 access for an Amazon FSx for Lustre file system in Account A, as described in Use Amazon FSx for Lustre to share Amazon S3 data across accounts. FSx for Lustre and Amazon S3 will need to be in the same AWS Region, and the FSx for Lustre file system will need to be in the same Availability Zone as your SageMaker HyperPod cluster.

Conclusion

In this post, we provided guidance on how to set up cross-account access to data scientists accessing a centralized SageMaker HyperPod cluster orchestrated by Amazon EKS. In addition, we covered how to provide Amazon S3 data access from one account to an EKS cluster in another account. With SageMaker HyperPod task governance, you can restrict access and compute allocation to specific teams. This architecture can be used at scale by organizations wanting to share a large compute cluster across accounts within their organization. To get started with SageMaker HyperPod task governance, refer to the Amazon EKS Support in Amazon SageMaker HyperPod workshop and SageMaker HyperPod task governance documentation.

About the Authors

Nisha Nadkarni is a Senior GenAI Specialist Solutions Architect at AWS, where she guides companies through best practices when deploying large scale distributed training and inference on AWS. Prior to her current role, she spent several years at AWS focused on helping emerging GenAI startups develop models from ideation to production.

Anoop Saha is a Sr GTM Specialist at Amazon Web Services (AWS) focusing on generative AI model training and inference. He partners with top frontier model builders, strategic customers, and AWS service teams to enable distributed training and inference at scale on AWS and lead joint GTM motions. Before AWS, Anoop held several leadership roles at startups and large corporations, primarily focusing on silicon and system architecture of AI infrastructure.

Kareem Syed-Mohammed is a Product Manager at AWS. He is focused on compute optimization and cost governance. Prior to this, at Amazon QuickSight, he led embedded analytics, and developer experience. In addition to QuickSight, he has been with AWS Marketplace and Amazon retail as a Product Manager. Kareem started his career as a developer for call center technologies, Local Expert and Ads for Expedia, and management consultant at McKinsey.

Rajesh Ramchander is a Principal ML Engineer in Professional Services at AWS. He helps customers at various stages in their AI/ML and GenAI journey, from those that are just getting started all the way to those that are leading their business with an AI-first strategy.

Multi-account support for Amazon SageMaker HyperPod task governance | Amazon Web Services TechTricks365

Solution overview

Cross-account access for data scientists

Cross-account access to prepared data

Conclusion

About the Authors

Build a serverless audio summarization solution with Amazon Bedrock and Whisper | Amazon Web Services TechTricks365

Building Confidence in AI: Training Programs Help Close Knowledge Gaps TechTricks365

When Your AI Invents Facts: The Enterprise Risk No Leader Can Ignore TechTricks365

LEAVE A REPLY Cancel reply

Most Popular

What are next big copper projects? UBS tells when equipment makers could benefit TechTricks365

Kotaku’s Weekend Guide: 5 Great Games We’re Kicking Off The Summer With TechTricks365

Siemens launches new training course and industry credential TechTricks365

Playdate Season 2 review: The Whiteout and Wheelsprung TechTricks365

Recent Comments

EDITOR PICKS

What are next big copper projects? UBS tells when equipment makers could benefit TechTricks365

Kotaku’s Weekend Guide: 5 Great Games We’re Kicking Off The Summer With TechTricks365

Siemens launches new training course and industry credential TechTricks365

POPULAR POSTS

What are next big copper projects? UBS tells when equipment makers could benefit TechTricks365

Kotaku’s Weekend Guide: 5 Great Games We’re Kicking Off The Summer With TechTricks365

Siemens launches new training course and industry credential TechTricks365

POPULAR CATEGORY

ABOUT US

FOLLOW US