Developers Put AI Bots to the Test of Writing Code

With so many AI-powered assistants waiting to be explored, we've now entered the phase where excited coders are trying their own homegrown experiments — and sharing the results online.

May 14th, 2023 6:00am by David Cassel

Featued image for: Developers Put AI Bots to the Test of Writing Code

One Bay Area technical support specialist told me he’d had a secret advantage when a potential employer assigned a take-home programming problem. He’d used ChatGPT to generate a solution, then turned it in as his own work.

OpenAI reminds users that its ChatGPT is currently in a free “research preview” to “learn about its strengths and weaknesses.” And there are plenty of other options to explore as well.

The last month has also seen the launch of Hugging Face’s open source alternative, “HuggingChat” — and a set of dedicated coding tools like StarCoder Playground.

With so many AI-powered assistants waiting to be explored, we’ve now entered the phase where excited users try their own homegrown experiments — and share the results online.

Engineering is going to change forever.

I just fed GPT-4-32K nearly all of Pinecone’s docs, and the results blew my mind!

It helped me make architecture decisions, and it then wrote my code for me.

The future of AI-assisted development is here, and it’s beyond impressive. pic.twitter.com/JJjF3nYiIF

— Matt Shumer (@mattshumer_) May 1, 2023

Can these new AI-powered tools really generate code? With a few qualifications and caveats, the answer appears to be yes.

Informal Tests

It’s always been the provocative question lurking behind the arrival of powerful AI systems. In early 2022 Alphabet reported its DeepMind lab for AI research had created a computer programming system called “AlphaCode” which was already ranking “within the top 54%” of the coders competing on the site Codeforces. By November GitHub was experimenting with adding a voice interface to its impressive AI-powered pair programmer, Copilot.

But now the systems are facing some more informal tests.

Last month a game developer on the “Candlesan” YouTube channel shared ChatGPT’s efforts to recreate the popular mobile game Flappy Bird. While it took several iterations, the code was fully completed in about 90 minutes. It was written in C# in the Unity game engine — and even used the AI-generated art that the developer created using Midjourney.

The video hints at a possible future where developers use AI to get their work done faster.

“What I really like about this process is that while ChatGPT is taking care of the code, I get to focus my attention on design work,” explains the video’s enthusiastic game developer. “I get to position text elements on the screen, I decide the distance between the pipes, or the exact tuning numbers for how hard the bird flaps its wings.”

And in a later video, the same developer uses ChatGPT to code bots to play the game ChatGPT just built.

Acing the Coding Test

Can AI pass a professional coding test? Other experiments suggest the answer there is also “yes” — but not every AI system. One such test appeared last month on the tech site HackerNoon, when Seattle-based full-stack developer Jorge Villegas tested GPT-4, Claude+, Bard, and GitHub Co-Pilot on a practice exercise from the coding site Leetcode.com. Villegas distilled the question down to an unambiguous five-word prompt: “Solve Leetcode 214. Shortest Palindrome.”

Leetcode’s practice puzzle #214 challenges coders to look at a string, and change it into a palindrome (the shortest possible one) by only adding letters to the front of the string. “While I could have asked follow-up questions, I chose to only consider the initial response,” Villegas added.

It’s a tricky puzzle — and the results were some hits and some misses…

GPT-4 wrote code that passed all of Leetcode’s tests — and even ran faster than 47% of submissions to the site by (presumably human) users. Villegas’ only caveat was that GPT-4 is slower to respond than the other sites — and that using its API “is also a lot more expensive and costs could ramp up quickly.”
Villegas also tested the Claude+ “AI assistant” from Anthropic, a company describing itself as “an AI safety and research company” that builds “reliable, interpretable, and steerable AI systems.” But unfortunately, the code it produced failed all but one of Leetcode’s 121 tests.
Google’s “experimental AI service” Bard failed all but two of Leetcode’s 121 tests. (Although Bard’s code also contained a bug so obvious that Villegas felt compelled to correct it himself: The function needed Python’s self keyword to specify a namespace for the function’s variables.)
Villegas tested GitHub Copilot (asking the question by typing it as a comment in Microsoft’s Copilot-powered VSCode). And it passed every one of Leetcode’s tests — scoring better than 30% of submissions (from presumably human coders).

Villegas’s essay closes with an important caveat. “It is unclear whether any of these models were pre-trained on Leetcode data.” So in early May Villegas tried another more specialized test, using a slightly longer prompt that requested four different CSS features written with a specific framework.

“Create a header component using Tailwind CSS that includes a logo on the left, navigation links in the center, and a search bar on the right. Make the header dark purple.”

The results from GPT-4 “overall looks very good” and Claude+ made “a pretty good attempt,” while for Bard’s response, “the nav links have no space between them, the search bar is illegible against the background… I guess it still got the main parts of the prompt correct, all the content is in the correct order.” And Bing’s version of GPT-4 was the only one that actually got the navigation links in the center.

Villegas’s ultimate verdict is that AI-generated code lacks context-awareness, and “often lacks attention to detail and can result in design flaws. Additionally, AI still struggles with context awareness, and it can be challenging to provide precise instructions that an AI can follow accurately.

“These difficulties demonstrate that AI cannot replace human designers entirely but can be a valuable tool to assist them in their work.”

I asked ChatGPT to write a python script that generates an image of a bird pic.twitter.com/mwd3FEHZkR

— Bruno Gavranović (@bgavran3) December 3, 2022

Plugins and PHP

ZDNet attempted some even more ambitious tests.

Senior contributing editor David Gewirtz had used ChatGPT back in February to generate a working WordPress plugin for his wife. It randomized items on a list — though a series of additional feature requests eventually tripped it up, with ChatGPT failing to sanitize the input when calling PHP within HTML.

While Gewirtz decided this was only coding at the “good enough” level, he also noted that what many clients actually want. This led Gewirtz to conclude that AI will “almost undoubtedly” reduce the number of human programming gigs, adding that even today AI is “definitely an option for quick and easy projects… this surge in high-quality generative AI has been startling to me.”

In April he’d tried the same test using Google’s Bard, but it generated a plugin that didn’t work. It just produced blank output rather than a list of names in random order. Bard also got tripped up when asked for a simple rewrite of an input checker so it would allow decimal values as well as integers (which would allow letters and symbols to be placed to the right of the decimal). And when testing both Bard and ChatGPT on some buggy PHP code, only ChatGPT correctly identified the flaw. “For the record, I looked at all three of Bard’s drafts for this answer, and they were all wrong.”

But then Gewirtz decided to push ChatGPT to write a “hello world” program in 12 different programming languages. Gewirtz used the top 12 most popular programming languages (as ranked by O’Reilly) — Java, Python, Rust, Go, C++, JavaScript, C#, C, TypeScript, R, Kotlin, and Scala — and ChatGPT dutifully complied (even providing the appropriate syntax coloring for them all).

David Gewirtz took ChatGPT through a history of programming languages dating as far back as the 1950s. And he described the results as “cool beyond belief.”

To make things more challenging, his prompt even requested different messages for the morning, evening, and afternoon. While Gewirtz didn’t run the code, “I did read through the generated code and — for most languages — the code looked good.” And a quick test of the JavaScript code shows it does indeed perform as expected.

I am using #ChatGPT to write computer code. Game changer on the order of electricity and personal computing.

Coding is a killer app. Can’t wait for it to be integrated FULLY into most IDEs. I tried current “chat” programming integrations but they are terrible compated to…

— JM Rothberg (@JMRothberg) February 19, 2023

Just for fun, Gewirtz also asked it to produce results using the legacy Forth programming language — and it did. So then in a later article, Gewirtz challenged ChatGPT to write code in 10 more “relatively obscure languages,” including Fortran, COBOL, Lisp, Algol, Simula, RPG (Report Program Generator), IBM’s BAL (Basic Assembly Language), and Xerox PARC’s Smalltalk.

In short, Gewirtz took ChatGPT through a history of programming languages dating as far back as the 1950s. And he described the results as “cool beyond belief.” Though he didn’t run the generated code, “most look right, and show the appropriate indicators telling us that the language presented is the language I asked for…”

ChatGPT even rose to Gewirtz’s challenge of writing code in another ancient language, APL, which sometimes uses a non-standard character set — though the font used to display its code transformed them into what Villegas calls “little glyphs.” As Google explains…

Google's explanation for APL character sets

But perhaps the most thought-provoking result of all came when ChatGPT generated code in equally-ancient Prolog. This is especially notable because ChatGPT is written in Prolog — at least partially. Gewirtz notes that ChatGPT uses a mode that translates Prolog logical forms into sentences in natural language.

With so many examples of AI assistants already generating code, maybe it’s time to move on to the question of how they’ll ultimately be used. That is a question we’ll watching out for in the months and years to come.

I have discovered I don't need to write full English to GPT-3.5; just a bunch of keywords is enough. E.g., "argparse fixed set of valid values for flag" or "python test equality between two tuples but report only components that differed"

— Edward Z. Yang (@ezyang) April 28, 2023

David Cassel is a proud resident of the San Francisco Bay Area, where he's been covering technology news for more than two decades. Over the years his articles have appeared everywhere from CNN, MSNBC, and the Wall Street Journal Interactive...