Remove load-balancing-handling-heterogeneous-hardware
article thumbnail

Host ML models on Amazon SageMaker using Triton: TensorRT models

AWS Machine Learning Blog

With kernel auto-tuning, the engine selects the best algorithm for the target GPU, maximizing hardware utilization. This engine is then loaded into Triton Inference Server and used to perform inference on incoming requests. Load the TensorRT engine in Triton Inference Server. Generating serialized engines from models.

ML 83
article thumbnail

Training large language models on Amazon SageMaker: Best practices

AWS Machine Learning Blog

Storage – We see data loading and checkpointing done in two ways, depending on skills and preferences: with an Amazon FSx Lustre file system, or Amazon Simple Storage Service (Amazon S3) only. Resiliency – At scale, hardware failures can happen. In the case of the SageMaker Training API, the computing fleet can be heterogeneous.