article thumbnail

Host ML models on Amazon SageMaker using Triton: TensorRT models

AWS Machine Learning Blog

With kernel auto-tuning, the engine selects the best algorithm for the target GPU, maximizing hardware utilization. Overall, TensorRT’s combination of techniques results in faster inference and lower latency compared to other inference engines. These functions are used during the inference step.

ML 88
article thumbnail

The NLP Cypher | 02.14.21

Towards AI

github.com Their core repos consist of SparseML: a toolkit that includes APIs, CLIs, scripts and libraries that apply optimization algorithms such as pruning and quantization to any neural network. DeepSparse: a CPU inference engine for sparse models. Follow their code on GitHub. SparseZoo: a model repo for sparse models.

NLP 96
article thumbnail

The NLP Cypher | 02.14.21

Towards AI

github.com Their core repos consist of SparseML: a toolkit that includes APIs, CLIs, scripts and libraries that apply optimization algorithms such as pruning and quantization to any neural network. DeepSparse: a CPU inference engine for sparse models. Follow their code on GitHub. SparseZoo: a model repo for sparse models.

NLP 52