Frontier trained a ChatGPT-sized large language model with only 3,000 of its 37,888 Radeon GPUs — the world's fastest supercomputer blasts through one trillion parameter model with only 8 percent of its MI250X GPUs

Frontier
(Image credit: ORNL)

Researchers at Oak Ridge National Laboratory trained a large language model (LLM) the size of ChatGPT on the Frontier supercomputer and only needed 3,072 of its 37,888 GPUs to do it. The team published a research paper that details how it pulled off the feat and the challenges it faced along the way.

The Frontier supercomputer is equipped with 9,472 Epyc 7A53 CPUs and 37,888 Radeon Instinct 37,888 GPUs. However, the team only used 3,072 GPUs to train an LLM with one trillion parameters and 1,024 to train another LLM with 175 billion parameters.

The paper notes that the key challenge in training such a large LLM is the amount of memory required, which was 14 terabytes at minimum. This meant that multiple MI250X GPUs with 64GB of VRAM each needed to be used, but this introduced a new problem: parallelism. Throwing more GPUs at an LLM requires increasingly better communication to actually use more resources effectively. Otherwise, most or all of that extra GPU horsepower would be wasted.

The research paper dives into the details of exactly how these computer engineers did it, but the short version is that they iterated on frameworks like Megatron-DeepSpeed and FSDP, changing things so that the training program would run more optimally on Frontier. In the end, the results were pretty impressive — weak scaling efficiency stood at 100%, which basically means more GPUs were used as efficiently as possible with an increasing workload size. 

Meanwhile, strong scaling efficiency was slightly lower at 89% for the 175 billion parameter LLM and 87% for the one trillion parameter LLM. Strong scaling refers to increasing processor count without changing the size of the workload, and this tends to be where higher core counts become less useful, according to Amdahl's law. Even 87% is a decent result, given how many GPUs they used.

However, the team noted some issues achieving this efficiency on Frontier, stating "there needs to be more work exploring efficient training performance on AMD GPUs, and the ROCm platform is sparse." As the paper says, most machine learning at this scale is done within Nvidia's CUDA hardware-software ecosystem, which leaves AMD's and Intel's solutions underdeveloped by comparison. Naturally, efforts like these will foster the development of these ecosystems. 

Nevertheless, the fastest supercomputer in the world continues to be Frontier, with its all-AMD hardware. In second place stands Aurora with its purely Intel hardware, including GPUs, though at the moment, only half of it has been used for benchmark submissions. Nvidia GPUs power the third fastest supercomputer, Eagle. If AMD and Intel want to keep the rankings this way, the two companies will need to catch up to Nvidia's software solutions.

Matthew Connatser

Matthew Connatser is a freelancing writer for Tom's Hardware US. He writes articles about CPUs, GPUs, SSDs, and computers in general.

  • peachpuff
    Soon the whole article will be in the headline...
    Reply
  • hotaru251
    any developing tech should be focused on opensource tech not proprietary as if one day nvidia just decided "we are no longer making it" your out to sea w/o a paddle.
    Reply
  • rluker5
    Frontier needs 3072 GPUs for a trillion parameter model, Aurora needs 384 GPUs (64 nodes*6GPUs apiece) for a trillion parameter model: https://www.tomshardware.com/news/intel-supercomputing-2023-aurora-xeon-max-gpu-gaudi-granite-rapids
    And Intel is in second place.

    More specifics are probably needed for an accurate comparison, but AMD looks to be trailing very badly. I wonder how much more power Frontier had to consume to perform the same task?
    Reply
  • H4UnT3R
    How long it took to train?
    Reply
  • jthill
    rluker5 said:
    Frontier needs 3072 GPUs for a trillion parameter model, Aurora needs 384 GPUs (64 nodes*6GPUs apiece) for a trillion parameter model: https://www.tomshardware.com/news/intel-supercomputing-2023-aurora-xeon-max-gpu-gaudi-granite-rapids
    And Intel is in second place.

    More specifics are probably needed for an accurate comparison, but AMD looks to be trailing very badly. I wonder how much more power Frontier had to consume to perform the same task?
    How many GPUs seems less relevant than how much time is needed.

    And according to wikipedia,

    has around 10 petabytes of memory and 230 petabytes of storage. The machine is estimated to consume around 60 MW of power. For comparison, the fastest computer in the world today, Frontier uses 21 MW while Summit uses 13 MW.
    Reply
  • hotaru251
    rluker5 said:
    Frontier needs 3072 GPUs for a trillion parameter model, Aurora needs 384 GPUs (64 nodes*6GPUs apiece) for a trillion parameter model: https://www.tomshardware.com/news/intel-supercomputing-2023-aurora-xeon-max-gpu-gaudi-granite-rapids
    And Intel is in second place.

    More specifics are probably needed for an accurate comparison, but AMD looks to be trailing very badly. I wonder how much more power Frontier had to consume to perform the same task?
    from what I took from the article they only used as many GPU's as the memory needed (as more gpu wasnt beneficial but they still needed the memory)
    Reply
  • bit_user
    H4UnT3R said:
    How long it took to train?
    Exactly. The most crucial piece of information was omitted. I glanced at the paper, but a quick search didn't find any unit of time. Maybe someone wants to have a closer read?
    https://arxiv.org/pdf/2312.12705.pdf
    rluker5 said:
    AMD looks to be trailing very badly. I wonder how much more power Frontier had to consume to perform the same task?
    How can you say they're trailing, when you don't know how much time (or power) either used? If Intel used 1/10th the GPUs, but took 20x as long and used more power in the process, can we really count that as a win?
    Reply
  • bit_user
    hotaru251 said:
    more gpu wasnt beneficial but they still needed the memory
    Given that GPT-3 was rumored to take 10k A100 GPUs an entire month to train. I don't see how you can claim more GPUs aren't beneficial.

    BTW, I'm sure GPT-3 was trained on a far larger dataset, which would explain the apparent contradiction in compute power used for that vs. these examples.
    Reply
  • rluker5
    bit_user said:
    Exactly. The most crucial piece of information was omitted. I glanced at the paper, but a quick search didn't find any unit of time. Maybe someone wants to have a closer read?
    https://arxiv.org/pdf/2312.12705.pdf

    How can you say they're trailing, when you don't know how much time (or power) either used? If Intel used 1/10th the GPUs, but took 20x as long and used more power in the process, can we really count that as a win?
    That is all true, which is why I said more specifics are probably needed for an accurate comparison.
    But this article, since it is lacking in any other information than just the number of GPUs used, makes it look like AMD needs 8x as many and makes AMD look like they are trailing badly.

    When we do not know that to be the case. We don't know how long it took, if the models were equally complex, or even if the "training" was done to the same standards.

    But still entertaining to put that in perspective with the headline with the loaded language: "Frontier trained a ChatGPT-sized large language model with only 3,000 of its 37,888 Radeon GPUs — the world's fastest supercomputer blasts through one trillion parameter model with only 8 percent of its MI250X GPUs"

    When Intel used 0.64% of Aurora's GPUs to do what sounds like the same thing a few months back.

    Since there is not enough information to even know which GPU is faster at this point I'll just gloat at throwing egg at the baiting headline.
    Reply
  • DavidC1
    rluker5 said:
    Frontier needs 3072 GPUs for a trillion parameter model, Aurora needs 384 GPUs (64 nodes*6GPUs apiece) for a trillion parameter model: https://www.tomshardware.com/news/intel-supercomputing-2023-aurora-xeon-max-gpu-gaudi-granite-rapids
    And Intel is in second place.

    More specifics are probably needed for an accurate comparison, but AMD looks to be trailing very badly. I wonder how much more power Frontier had to consume to perform the same task?
    Each MI250X has 383TOPs Int8 capability while each Data Center GPU 1550 series has 1678TOPS Int8, meaning more than 4x as much per GPU. It's 500W for MI250X and 600W for GPU 1550. Also has 408MB of "Rambo Cache" while MI250X is only 16MB L2, which will have real world benefits.

    Just by that metric it's like having 3072 Frontier GPUs versus 1536 Aurora GPUs.

    Intel's GPU also has twice the amount of transistors, and much harder to fabricate. One would think it would be faster in something.
    Reply