Thought Leaders

Facing Nvidia’s Dominance: Agile ML Development Strategies for Non-Big Tech Players (Amid Supply and Cost Challenges)

Published

2 months ago

March 15, 2024

Building a business in the game amongst the real big players has never been an easy task. In 2023, the competition in the AI sector reached unprecedented heights, fueled by real, mind-bending breakthroughs. The release of OpenAI’s GPT-4, Integration of ChatGPT with Bing, Google launching Bard, and Meta’s controversial “open-source” Llama 2 release. It sounds like a long list of big names, right? As exciting as it might sound, the majority of innovation lies where money flows, and the competition smaller tech players have to get through is getting more intense by the day.

In the ever-evolving landscape of the tech industry, Nvidia continues to solidify its position as the key player in AI infrastructure. During an August financial report teleconference, Jensen Huang, President of NVIDIA, highlighted the soaring demand for Nvidia processors. This claim is backed by confirmation from Nvidia's Q3 In r Presentation revenue data, which reveals an impressive year-on-year performance record, evident as early as November YTD. Meanwhile, Gartner's projections indicate a significant uptick in chip spending over the next four years. At present, Nvidia's software stack and processors stand unrivaled, leaving the industry uncertain about when a credible competitor might emerge.

Recent reports from Bloomberg and the Financial Times shed light on Sam Altman's, the CEO of OpenAI, negotiations with Middle-Eastern investors to initiate chip production, aiming to reduce the AI sector's reliance on Nvidia chips. Challenging Nvidia, with its nearly $1.5 trillion market capitalization, is likely to cost Altman between $5 trillion and $7 trillion and take several years.

Nevertheless, addressing the cost-effectiveness of ML models for business is something companies have to do now. For businesses beyond the realms of big tech, developing cost-efficient ML models is more than just a business process — it's a vital survival strategy. This article explores four pragmatic strategies that empower businesses of all sizes to develop their models without extensive R&D investments and remain flexible to avoid vendor lock-in.

Why Nvidia’s Dominates the AI Market

Long story short, Nvidia has created the ideal model training workflow by achieving synergy between high-performance GPUs and its proprietary model training software stack, the widely acclaimed CUDA toolkit.

CUDA (introduced in 2007) is a comprehensive parallel computing toolkit and API for optimal utilizing Nvidia GPU processors. The main reason it's so popular is its unmatched capability for accelerating complex mathematical computations, crucial for deep learning. Additionally, it offers a rich ecosystem like cuDNN for deep neural networks, enhancing performance and ease of use. It's essential for developers due to its seamless integration with major deep learning frameworks, enabling rapid model development and iteration.

The combination of such a robust software stack with highly efficient hardware has proven to be the key to capturing the market. While some argue that Nvidia's dominance may be a temporary phenomenon, it's hard to make such predictions in the current landscape.

The Heavy Toll of Nvidia's Dominance

Nvidia having the upper hand in the machine learning development field has raised numerous concerns, not only in the ethical realm but also in regards to the widening research and development budget disparities, which are one of the reasons why breaking into the market has become exponentially harder for smaller players, let alone startups. Add in the decline in investor interest due to higher risks, and the task of acquiring hefty R&D (like those of Nvidia) investments becomes outright impossible, creating a very, very uneven playing field.

Yet, this heavy reliance on Nvidia’s hardware puts even more pressure on supply chain consistency and opens up the risk for disruptions and vendor lock-in, reducing market flexibility and escalating market entry barriers.

“Some are pooling cash to ensure that they won’t be leaving users in the lurch. Everywhere, engineering terms like ‘optimization' and ‘smaller model size' are in vogue as companies try to cut their GPU needs, and investors this year have bet hundreds of millions of dollars on startups whose software helps companies make do with the GPUs they’ve got.”

Nvidia Chip Shortages Leave AI Startups Scrambling for Computing Power By Paresh Dave

Now is the time to adopt strategic approaches, since this may be the very thing that will give your enterprise the chance to thrive amidst Nvidia’s far-reaching influence in ML development.

Strategies Non-Big Tech Players Can Adapt to Nvidia's Dominance:

1. Start exploring AMD's RocM

AMD has been actively narrowing its AI development gap with NVIDIA, a feat accomplished through its consistent support for Rocm in PyTorch's main libraries over the past year. This ongoing effort has resulted in improved compatibility and performance, showcased prominently by the MI300 chipset, AMD's latest release. The MI300 has demonstrated robust performance in Large Language Model (LLM) inference tasks, particularly excelling with models like LLama-70b. This success underscores significant advancements in processing power and efficiency achieved by AMD.

2. Find other hardware alternatives

In addition to AMD's strides, Google has introduced Tensor Processing Units (TPUs), specialized hardware designed explicitly to accelerate machine learning workloads, offering a robust alternative for training large-scale AI models.

Beyond these industry giants, smaller yet impactful players like Graphcore and Cerebras are making notable contributions to the AI hardware space. Graphcore's Intelligence Processing Unit (IPU), tailored for efficiency in AI computations, has garnered attention for its potential in high-performance tasks, as demonstrated by Twitter's experimentation. Cerebras, on the other hand, is pushing boundaries with its advanced chips, emphasizing scalability and raw computational power for AI applications.

The collective efforts of these companies signify a shift towards a more diverse AI hardware ecosystem. This diversification presents viable strategies to reduce dependence on NVIDIA, providing developers and researchers with a broader range of platforms for AI development.

3. Start investing in performance optimisation

In addition to exploring hardware alternatives, optimizing software proves to be a crucial factor in lessening the impact of Nvidia's dominance. By utilizing efficient algorithms, reducing unnecessary computations, and implementing parallel processing techniques, non-big tech players can maximize the performance of their ML models on existing hardware, offering a pragmatic approach to bridging the gap without solely depending on expensive hardware upgrades.

An illustration of this approach is found in Deci Ai's AutoNAC technology. This innovation has demonstrated the ability to accelerate model inference by an impressive factor of 3-10 times, as substantiated by the widely recognized MLPerf Benchmark. By showcasing such advancements, it becomes evident that software optimization can significantly enhance the efficiency of ML development, presenting a viable alternative to mitigating the influence of Nvidia's dominance in the field.

4. Start collaborating with other organizations to create decentralized clusters

This collaborative approach can involve sharing research findings, jointly investing in alternative hardware options, and fostering the development of new ML technologies through open-source projects. By decentralizing inference and utilizing distributed computing resources, non-big tech players can level the playing field and create a more competitive landscape in the ML development industry.

Today, the strategy of sharing computing resources is gaining momentum across the tech industry. Google Kubernetes Engine (GKE) exemplifies this by supporting cluster multi-tenancy, enabling efficient resource utilization and integration with third-party services. This trend is further evidenced by community-led initiatives such as Petals, which offers a distributed network for running AI models, making high-powered computing accessible without significant investment. Additionally, platforms like Together.ai provide serverless access to a broad array of open-source models, streamlining development and fostering collaboration. Considering such platforms can allow you to access computational resources and collaborative development opportunities, helping to optimize your development process and reduce costs, regardless of an organization's size.

Conclusion

On a global scale, the necessity for the aforementioned strategies becomes apparent. When one entity dominates the market, it stifles development and hinders the establishment of reasonable pricing.

Non-big tech players can counter Nvidia's dominance by exploring alternatives like AMD's RocM, investing in performance optimization through efficient algorithms and parallel processing, and fostering collaboration with other organizations to create decentralized clusters. This promotes a more diverse and competitive landscape in the AI hardware and development industry, allowing smaller players to have a say in the future of AI development.

These strategies aim to diminish reliance on Nvidia's prices and supplies, thereby enhancing investment appeal, minimizing the risk of business development slowdown amid hardware competition, and fostering organic growth within the entire industry.