AI

Hugging Face and ServiceNow release a free code-generating model

Comment

Blue binary code on black background interspersed with open and closed locks.
Image Credits: JuSun / Getty Images

AI startup Hugging Face and ServiceNow Research, ServiceNow’s R&D division, have released StarCoder, a free alternative to code-generating AI systems along the lines of GitHub’s Copilot.

Code-generating systems like DeepMind’s AlphaCode; Amazon’s CodeWhisperer; and OpenAI’s Codex, which powers Copilot, provide a tantalizing glimpse at what’s possible with AI within the realm of computer programming. Assuming the ethical, technical and legal issues are someday ironed out (and AI-powered coding tools don’t cause more bugs and security exploits than they solve), they could cut development costs substantially while allowing coders to focus on more creative tasks.

According to a study from the University of Cambridge, at least half of developers’ efforts are spent debugging and not actively programming, which costs the software industry an estimated $312 billion per year. But so far, only a handful of code-generating AI systems have been made freely available to the public — reflecting the commercial incentives of the organizations building them (see: Replit).

StarCoder, which by contrast is licensed to allow for royalty-free use by anyone, including corporations, was trained on over 80 programming languages as well as text from GitHub repositories, including documentation and programming notebooks. StarCoder integrates with Microsoft’s Visual Studio Code code editor and, like OpenAI’s ChatGPT, can follow basic instructions (e.g., “create an app UI”) and answer questions about code.

Leandro von Werra, a machine learning engineer at Hugging Face and a co-lead on StarCoder, claims that StarCoder matches or outperforms the AI model from OpenAI that was used to power initial versions of Copilot.

“One thing we learned from releases such as Stable Diffusion last year is the creativity and capability of the open-source community,” von Werra told TechCrunch in an email interview. “Within weeks of the release the community had built dozens of variants of the model as well as custom applications. Releasing a powerful code generation model allows anybody to fine-tune and adapt it to their own use-cases and will enable countless downstream applications.”

Building a model

StarCoder is a part of Hugging Face’s and ServiceNow’s over-600-person BigCode project, launched late last year, which aims to develop “state-of-the-art” AI systems for code in an “open and responsible” way. Hugging Face supplied an in-house compute cluster of 512 Nvidia V100 GPUs to train the StarCoder model.

Various BigCode working groups focus on subtopics like collecting datasets, implementing methods for training code models, developing an evaluation suite and discussing ethical best practices. For example, the Legal, Ethics and Governance working group explored questions on data licensing, attribution of generated code to original code, the redaction of personally identifiable information (PII), and the risks of outputting malicious code.

Inspired by Hugging Face’s previous efforts to open source sophisticated text-generating systems, BigCode seeks to address some of the controversies arising around the practice of AI-powered code generation. The nonprofit Software Freedom Conservancy among others has criticized GitHub and OpenAI for using public source code, not all of which is under a permissive license, to train and monetize Codex. Codex is available through OpenAI’s and Microsoft’s paid APIs, while GitHub recently began charging for access to Copilot.

For their parts, GitHub and OpenAI assert that Codex and Copilot — protected by the doctrine of fair use, at least in the U.S. — don’t run afoul of any licensing agreements.

“Releasing a capable code-generating system can serve as a research platform for institutions that are interested in the topic but don’t have the necessary resources or know-how to train such models,” von Werra said. “We believe that in the long run this leads to fruitful research on safety, capabilities and limits of code-generating systems.”

Unlike Copilot, the 15-billion-parameter StarCoder was trained over the course of several days on an open source dataset called The Stack, which has over 19 million curated, permissively licensed repositories and more than six terabytes of code in over 350 programming languages. In machine learning, parameters are the parts of an AI system learned from historical training data and essentially define the skill of the system on a problem, such as generating code.

The Stack
A graphic breaking down the contents of The Stack dataset. Image Credits: BigCode

Because it’s permissively licensed, code from The Stack can be copied, modified and redistributed. But the BigCode project also provides a way for developers to “opt out” of The Stack, similar to efforts elsewhere to let artists remove their work from text-to-image AI training datasets.

The BigCode team also worked to remove PII from The Stack, such as names, usernames, email and IP addresses, and keys and passwords. They created a separate dataset of 12,000 files containing PII, which they plan to release to researchers through “gated access.”

Beyond this, the BigCode team used Hugging Face’s malicious code detection tool to remove files from The Stack that might be considered “unsafe,” such as those with known exploits.

The privacy and security issues with generative AI systems, which for the most part are trained on relatively unfiltered data from the web, are well-established. ChatGPT once volunteered a journalist’s phone number. And GitHub has acknowledged that Copilot may generate keys, credentials and passwords seen in its training data on novel strings.

“Code poses some of the most sensitive intellectual property for most companies,” von Werra said. “In particular, sharing it outside their infrastructure poses immense challenges.”

To his point, some legal experts have argued that code-generating AI systems could put companies at risk if they were to unwittingly incorporate copyrighted or sensitive text from the tools into their production software. As Elaine Atwell notes in a piece on Kolide’s corporate blog, because systems like Copilot strip code of its licenses, it’s difficult to tell which code is permissible to deploy and which might have incompatible terms of use.

In response to the criticisms, GitHub added a toggle that lets customers prevent suggested code that matches public, potentially copyrighted content from GitHub from being shown. Amazon, following suit, has CodeWhisperer highlight and optionally filter the license associated with functions it suggests that bear a resemblance to snippets found in its training data.

Commercial drivers

So what does ServiceNow, a company that deals mostly in enterprise automation software, get out of this? A “strong-performing model and a responsible AI model license that permits commercial use,” said Harm de Vries, the lead of the Large Language Model Lab at ServiceNow Research and the co-lead of the BigCode project.

One imagines that ServiceNow will eventually build StarCoder into its commercial products. The company wouldn’t reveal how much, in dollars, it’s invested in the BigCode project, save that the amount of donated compute was “substantial.”

“The Large Language Models Lab at ServiceNow Research is building up expertise on the responsible development of generative AI models to ensure the safe and ethical deployment of these powerful models for our customers,” de Vries said. “The open-scientific research approach to BigCode provides ServiceNow developers and customers with full transparency into how everything was developed and demonstrates ServiceNow’s commitment to making socially responsible contributions to the community.”

StarCoder isn’t open source in the strictest sense. Rather, it’s being released under a licensing scheme, OpenRAIL-M, that includes “legally enforceable” use case restrictions that derivatives of the model — and apps using the model — are required to comply with.

For example, StarCoder users must agree not to leverage the model to generate or distribute malicious code. While real-world examples are few and far between (at least for now), researchers have demonstrated how AI like StarCoder could be used in malware to evade basic forms of detection.

Whether developers actually respect the terms of the license remains to be seen. Legal threats aside, there’s nothing at the base technical level to prevent them from disregarding the terms to their own ends.

That’s what happened with the aforementioned Stable Diffusion, whose similarly restrictive license was ignored by developers who used the generative AI model to create pictures of celebrity deepfakes.

But the possibility hasn’t discouraged von Werra, who feels the downsides of not releasing StarCoder aren’t outweighed by the upsides.

“At launch, StarCoder will not ship as many features as GitHub Copilot, but with its open-source nature, the community can help improve it along the way as well as integrate custom models,” he said.

The StarCoder code repositories, model training framework, dataset-filtering methods, code evaluation suite and research analysis notebooks are available on GitHub as of this week. The BigCode project will maintain them going forward as the groups look to develop more capable code-generating models, fueled by input from the community.

There’s certainly work to be done. In the technical paper accompanying StarCoder’s release, Hugging Face and ServiceNow say that the model may produce inaccurate, offensive, and misleading content as well as PII and malicious code that managed to make it past the dataset filtering stage.

More TechCrunch

The National Democratic Alliance (NDA) has emerged victorious in India’s 2024 general election, but with a smaller majority compared to 2019. According to post-election analysis by Goldman Sachs, JP Morgan,…

Modi-led coalition’s election win signals policy continuity in India – but also spending cuts

Featured Article

A comprehensive list of 2024 tech layoffs

The tech layoff wave is still going strong in 2024. Following significant workforce reductions in 2022 and 2023, this year has already seen 60,000 job cuts across 254 companies, according to independent layoffs tracker Layoffs.fyi. Companies like Tesla, Amazon, Google, TikTok, Snap and Microsoft have conducted sizable layoffs in the…

12 hours ago
A comprehensive list of 2024 tech layoffs

Featured Article

What to expect from WWDC 2024: iOS 18, macOS 15 and so much AI

Apple is hoping to make WWDC 2024 memorable as it finally spells out its generative AI plans.

13 hours ago
What to expect from WWDC 2024: iOS 18, macOS 15 and so much AI

We just announced the breakout session winners last week. Now meet the roundtable sessions that really “rounded” out the competition for this year’s Disrupt 2024 audience choice program. With five…

The votes are in: Meet the Disrupt 2024 audience choice roundtable winners

The malicious attack appears to have involved malware transmitted through TikTok’s DMs.

TikTok acknowledges exploit targeting high-profile accounts

It’s unusual for three major AI providers to all be down at the same time, which could signal a broader infrastructure issues or internet-scale problem.

AI apocalypse? ChatGPT, Claude and Perplexity all went down at the same time

Welcome to TechCrunch Fintech! This week, we’re looking at LoanSnap’s woes, Nubank’s and Monzo’s positive milestones, a plethora of fintech fundraises and more! To get a roundup of TechCrunch’s biggest…

A look at LoanSnap’s troubles and which neobanks are having a moment

Databricks, the analytics and AI giant, has acquired data management company Tabular for an undisclosed sum. (CNBC reports that Databricks paid over $1 billion.) According to Tabular co-founder Ryan Blue,…

Databricks acquires Tabular to build a common data lakehouse standard

ChatGPT, OpenAI’s text-generating AI chatbot, has taken the world by storm. What started as a tool to hyper-charge productivity through writing essays and code with short text prompts has evolved…

ChatGPT: Everything you need to know about the AI-powered chatbot

The next few weeks could be pivotal for Worldcoin, the controversial eyeball-scanning crypto venture co-founded by OpenAI’s Sam Altman, whose operations remain almost entirely shuttered in the European Union following…

Worldcoin faces pivotal EU privacy decision within weeks

OpenAI’s chatbot ChatGPT has been down for several users across the globe for the last few hours.

OpenAI fixes the issue that caused ChatGPT outage for several hours

True Fit, the AI-powered size-and-fit personalization tool, has offered its size recommendation solution to thousands of retailers for nearly 20 years. Now, the company is venturing into the generative AI…

True Fit leverages generative AI to help online shoppers find clothes that fit

Audio streaming service TuneIn is teaming up with Discord to bring free live radio to the platform. This is TuneIn’s first collaboration with a social platform and one that is…

Discord and TuneIn partner to bring live radio to the social platform

The early victors in the AI gold rush are selling the picks and shovels needed to develop and apply artificial intelligence. Just take a look at data-labeling startup Scale AI…

Scale AI founder Alexandr Wang is coming to Disrupt 2024

Try to imagine the number of parts that go into making a rocket engine. Now imagine requesting and comparing quotes for each of those parts, getting approvals to purchase the…

Engineer brothers found Forge to modernize hardware procurement

Raspberry Pi has released a $70 AI extension kit with a neural network inference accelerator that can be used for local inferencing, for the Raspberry Pi 5.

Raspberry Pi partners with Hailo for its AI extension kit

When Stacklet’s founders, Travis Stanfield and Kapil Thangavelu, came out of Capital One in 2020 to launch their startup, most companies weren’t all that concerned with constraining cloud costs. But…

Stacklet sees demand grow as companies take cloud cost control more seriously

Fivetran’s Managed Data Lake Service aims to remove the repetitive work of managing data lakes.

Fivetran launches a managed data lake service

Lance Riedel and Nigel Daley both spent decades in search discovery, but it was while working at Pinterest that they began trying to understand how to use search engines to…

How a couple of former Pinterest search experts caught Biz Stone’s attention

GetWhy helps businesses carry out market studies and extract insights from video-based interviews using AI.

GetWhy, a market research AI platform that extracts insights from video interviews, raises $34.5M

AI-powered virtual physical therapy platform Sword Health has seen its valuation soar 50% to $3 billion.

Sword Health raises $130M and its valuation soars to $3B

Jeffrey Katzenberg and Sujay Jaswa, along with three general partners, manage $1.5 billion in assets today through their Build, Venture and Seed strategies.

WndrCo officially gets into venture capital with fresh $460M across two funds

The startup targets the middle ground between platforms that offer rigid templates, and those that facilitate a full-control approach.

Storyblok raises $80M to add more AI to its ‘headless’ CMS aimed at non-technical people

The startup has been pursuing a ground-up redesign of a well-understood technology.

‘Star Wars’ lasers and waterfalls of molten salt: How Xcimer plans to make fusion power happen

Sēkr, a startup that offers a mobile app for outdoor enthusiasts and campers, is launching a new AI tool for planning road trips. The new tool, called Copilot, is available…

Travel app Sēkr can plan your next road trip with its new AI tool

Microsoft’s education-focused flavor of its cloud productivity suite, Microsoft 365 Education, is facing investigation in the European Union. Privacy rights nonprofit noyb has just lodged two complaints with Austria’s data…

Microsoft hit with EU privacy complaints over schools’ use of 365 Education suite

Since the shock of Russia’s 2022 invasion of Ukraine, solar energy has been having a moment in Europe. Electricity prices have been going up while the investment required to get…

Samara is accelerating the energy transition in Spain one solar panel at a time

Featured Article

DEI backlash: Stay up-to-date on the latest legal and corporate challenges

It’s clear that this year will be a turning point for DEI.

1 day ago
DEI backlash: Stay up-to-date on the latest legal and corporate challenges

The keynote will be focused on Apple’s software offerings and the developers that power them, including the latest versions of iOS, iPadOS, macOS, tvOS, visionOS and watchOS.

Watch Apple kick off WWDC 2024 right here

Hello and welcome back to TechCrunch Space. Unfortunately, Boeing’s Starliner launch was delayed yet again, this time due to issues with one of the three redundant computers used by United…

TechCrunch Space: China’s victory