Training large language models on Amazon SageMaker: Best practices
AWS Machine Learning Blog
MARCH 6, 2023
Automated checkpoint to Amazon S3 – This helps you checkpoint your progress and reload a past state on new jobs. Special thanks to Amr Ragab, Rashika Kheria, Zmnako Awrahman, Arun Nagarajan, Gal Oshri for their helpful reviews and teachings. Data parallelism degree is k, pipeline parallelism 6, and tensor parallelism 4.
Let's personalize your content