Meet MPT-7B: The Game-Changing Open-Source/Commercially Viable Foundation Model from Mosaic ML

7 min readMay 19, 2023

Start training your custom LLM model today on MosaicML’s platforma

Deploying applications with large language models (LLMs) poses two challenges. Firstly, while LLMs are revolutionizing numerous sectors, the complexity of training and deploying these models has limited accessibility for individuals outside of well-resourced industry labs. For example, a small startup aiming to develop a language-based virtual assistant may face difficulties in acquiring the necessary resources and expertise to effectively deploy a large language model. Secondly, certain open source models are restricted to research purposes and cannot be utilized for commercial applications. This restriction can pose challenges for businesses looking to leverage open source models to build customer-facing applications that require commercial usage rights. This has spurred a wave of activity around open source models, with notable examples including the LLaMA series from Meta, the Pythia series from EleutherAI, the StableLM series from StabilityAI, and the OpenLLaMA model from Berkeley AI Research.

MosaicML’s response to these challenges is the introduction of a new foundation model series, the MPT (MosaicML Pretrained Transformer). This series aims to circumvent the limitations of the existing models, providing a commercially viable, open-source model that equals and often surpasses the capabilities of LLaMA-7B.

Example Application built with open source foundation model

For example, a social media platform can utilize open-source foundation models to develop a domain-specific LLM that enhances content moderation, detects and mitigates harmful or misleading information, and fosters a safer online environment. By leveraging the accessibility and customization options provided by open-source Language Models (LLMs), the platform can incorporate tailored solutions to address specific content challenges, benefiting from the transparency and collaborative development facilitated by the open-source community. This not only ensures cost-efficiency but also promotes a continuous learning process and empowers the platform to proactively address emerging issues, ultimately improving the user experience and community well-being.

For example, a social media company can fine-tune a foundation model to accurately detect and remove hate speech, thereby fostering a safer and more inclusive online community.

What is Mosaic’s Open source model

MosaicML’s MPT-7B is a transformer-based language model, trained from scratch on a staggering 1 trillion tokens of text and code, is now open source and available for commercial use. Not only does MPT-7B match the quality of models like LLaMA-7B, but it also addresses the existing limitations of other open-source LLMs. The training process for MPT-7B was accomplished within an impressive timeframe of 9.5 days, without any human intervention, on the MosaicML platform. Leveraging their advanced training infrastructure, MPT-7B was trained at a cost of approximately $200,000, representing an extraordinary feat in terms of efficiency and affordability.

Here’s why MPT stands out from the competition:

Commercial Licensing: Unlike LLaMA, MPT is licensed for commercial use, making it a viable choice for individuals and businesses alike. For example, a retail company looking to enhance its customer support system can utilize MPT to build a domain-specific model specifically trained on their own customer service data. This custom model, built with MPT’s commercial license, enables the retail company to provide personalized and efficient customer support, tailored to their unique products and services.
Extensive Training Data: MPT models are trained on a vast amount of data, with 1 trillion tokens rivaling the scale of LLaMA. This far surpasses the training data size of other popular open-source models, such as Pythia, OpenLLaMA, and StableLM.
Handling Long Inputs: Thanks to ALiBi, MPT models are capable of processing extremely long inputs, accommodating up to 84,000 tokens. This far exceeds the input length capacities of other open-source models, which typically range from 2,000 to 4,000 tokens. In the legal space, this extended input capacity of MPT proves invaluable. For instance, a law firm can utilize MPT’s ability to handle long inputs to analyze lengthy legal contracts or complex legal documents without the need for tedious manual segmentation. This empowers legal professionals to gain comprehensive insights and extract critical information from extensive legal texts more efficiently and accurately.
Enhanced Efficiency: MPT models are optimized for both training and inference, leveraging FlashAttention and FasterTransformer to deliver lightning-fast performance. This efficiency enables users to unlock the full potential of MPT models.
Open-Source Training Code: MosaicML is committed to fostering a collaborative community. As part of this mission, we provide highly efficient and transparent open-source training code, facilitating advancements in the field of LLMs.

Alongside the base MPT model, Mosaic are also releasing three other finetuned variants to showcase the broad range of possibilities with this transformative model. We will descrive this in the next section.

The New Frontier of MPT-7B: Diverse Variants for Various Applications

The value of MPT-7B is not confined to its standalone capabilities. The true potential of this novel transformer emerges when its base is expanded and fine tuned to cater to different applications.

Along with the base model, MosaicML unveils three finetuned models — each manifesting a different approach to leveraging the MPT-7B foundation.

By fine-tuning the MosaicML MPT foundation model with instruction data, chat data, and novels, MosaicML have successfully derived three distinct models: MPT-7B-Instruct, MPT-7B-Chat, and MPT-7B-StoryWriter.

MPT-7B-Instruct: This model presents a promising way of utilizing the MPT-7B for short form instructions. This was trained on dataset derived from Databricks Dolly-15k and the Anthropic Helpful and Harmless (HH-RLHF) datasets. This model was trained with 9.6M tokens. Here is HuggingFace Link: https://huggingface.co/mosaicml/mpt-7b-instruct

*From the Mosaic ML paper. The model properly converts content formatted as YAML into the same content formatted as JSON.*

MPT-7B-Chat: The MPT-7B-Chat, as the name suggests, transforms the transformer into a chatbot, proving the model’s versatility in real-time interactions. This model was trained with 86M tokens. Here is the HuggingFace Link: https://huggingface.co/mosaicml/mpt-7b-chat

*From the Mosaic ML paper. Fine tuned Chat model based on MPT*

MPT-7B-StoryWriter-65k+: Perhaps the most ambitious of the three, MPT-7B-StoryWriter-65k+ uses a whopping context length of 65k tokens! This is particularly advantageous in applications requiring long-term context retention, such as storytelling, documentation, or large-scale data analysis. This model was trained with 5B tokens. Here is the HuggingFace Link: https://huggingface.co/mosaicml/mpt-7b-storywriter

There is a significant benefit for 65k token memory. You can read more about this here. In the example, provided by MosaicML, they gave the entire text of The Great Gatsby (about 68k tokens) as input to the model followed by the word “Epilogue” and allowing the model to continue generating from there.

Created using AI.. A language model that can remember 65k tokens could be used to assist in creative writing projects, such as writing novels or screenplays. By remembering previous dialogue and character interactions, the model could help writers maintain consistency in their storytelling and character development.

Streaming data for training : Automatic recovery during training

MosaicML team utilized StreamingDataset to efficiently host and stream their data from a standard cloud object store to their compute cluster during training. This approach offers several benefits, including the elimination of the need to download the entire dataset before commencing training. Additionally, it allows seamless resumption of training from any point in the dataset (before the crash for example) without the need to fast-forward the dataloader from the beginning

From the Mosaic ML paper. If hardware failures occur while a job is running, the MosaicML platform automatically detects the failure, pauses the job, cordons any broken nodes, and resumes the job. During the MPT-7B training run, they encountered 4 such failures, and each time the job was automatically resumed

Custom training of your data on MosaicML cloud

For instance, let’s say you want to create a specialized language model for manufacturing data. You can use MosaicML’s hero clustering offering today to train a new model from scratch. All you need to do is choose the model size and token budget you desire, upload your manufacturing data to a storage platform like S3, and start a job using the MosaicML Command Line Interface (MCLI). In just a few days, you will have your very own custom language model specifically tailored for biomedical text analysis.

Closing Remarks: Looking Ahead with MosaicML

MosaicML’s launch of the MPT series serves as a reminder that the boundaries of AI are ever-expanding. Open-source innovation and commercial usability continues to fuel the progression of AI accessibility.

MPT-7B represents a significant stride in AI’s journey, offering a robust, high-quality, commercially viable solution. Its successful benchmarking against LLaMA-7B, coupled with the ability to handle a vast range of inputs, sets the stage for a new era of AI exploration.

With a open source and commercially viable platform, I am anticipating the remarkable innovations that will arise from vibrant community of researchers, developers, and enthusiasts, all empowered to train, fine-tune, and deploy their unique MPT models.

BECOME a WRITER at MLearning.ai. From Dreams to Reality

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

medium.comv