Can Open-Source LLMs Solve AI’s Democratization Problem?

//php echo do_shortcode(‘[responsivevoice_button voice=”US English Male” buttontext=”Listen to Post”]’) ?>

Raja Koduri, former chief architect and executive VP at Intel, can be forgiven for worrying aloud about the exorbitant cost of training the large language models (LLMs) that will give rise to more ChatGPTs.

“I am very concerned that generative AI control is in the hands of the few that can afford the dollars to train and deploy models at scale,” he told EE Times. “I have been a big proponent of compute democratization. Exascale compute needs to be accessible to every human and their devices.”

OpenAI CEO Sam Altman has said it cost more than $100 million to  train GPT-4, a next-generation version of the model ChatGPT is based on, according to a Wired report.

It follows, then, that at the cutting edge—models with hundreds of billions of parameters—only a select group of private companies have the resources required to train from scratch. One key risk is that powerful LLMs like GPT develop only in a direction that suits the commercial objectives of these companies.

Raja Koduri (Source: Tenstorrent)

Koduri said that with his new generative AI software startup, he intends to collaborate with Tenstorrent (where Koduri has joined the board), and the rest of the RISC-V ecosystem to address this issue.

Open-source models

As with other types of software, there are both proprietary and open-source AI algorithms. But recent industry trends have seen trained models being open sourced, in part to enable greater democratization of the technology. This includes work on GPT, Bloom and others.

Amongst the largest is HuggingFace’s HuggingChat, at 30 billion parameters: It is designed as a competitor to ChatGPT, though its license does not allow commercial use.

Stability AI released StableML, an open-source LLM that is free to use commercially. It currently comes in 3 billion and 7 billion parameter versions, with 15-65 billion parameter versions on the way, according to the company. The company open sources its models to promote transparency and foster trust, it said, noting that researchers can use the models to work on identifying potential risks and help develop safeguards.

Graphcore has Dolly 2.0, based on Eleuther’s Pythia, running on its IPU hardware. Dolly 2.0 is large at 12 billion parameters, and it is trained on data gathered specifically for the task by its creator, Databricks. It has been trained and fine-tuned for instructions; the model, weights and dataset have been open sourced with a license that permits commercial use.

StableLM and Dolly 2.0 are still one to two orders of magnitude smaller than ChatGPT, at 175 billion parameters.

Open sourcing trained models under a license that allows companies to fine-tune and use them can go a long way toward opening up access to this technology for many enterprises and researchers. Fine-tuning is a process where already trained models are trained further to help with specialization on particular tasks, which requires much less resources versus retraining from scratch.

“Open-source models will become the standard: All the best models in the world will come to open source,” Anton McGonnell, senior director of product at SambaNova, told EE Times. “Our thesis is that the winners are the platforms that will be able to host the complexity, to actually be able to run these models efficiently at scale and have velocity, because the state of the art [in models] is going to change so much.”

SambaNova has trained a collection of open-source LLMs including GPT and Bloom, with a variety of domain-specific datasets that are intended to be fine-tuned with customers’ proprietary task-specific data.

Open-source datasets

Cerebras has open sourced a series of GPT models ranging from 111 million to 3 billion parameters under the permissive Apache 2.0 license. Cerebras-GPT models have been downloaded more than 200,000 times.

Andrew Feldman (Source: Cerebras)

“I think if we’re not careful, we end up in this situation where a small handful of companies holds the keys to large language models,” Cerebras CEO Andrew Feldman told EE Times, noting that OpenAI’s GPT-4 is a black box, and that Llama, Meta’s open-source trained model, is not available for commercial use.

As well as the resources needed for training, access to the huge amounts of data required is also a barrier to entry.

ChatGPT was trained on about 300 billion words of text (570 GB of data). Cerebras’ open-source LLMs are trained on The Pile dataset from Eleuther, which is itself open source. (StableLM is trained on an experimental dataset “based on The Pile”, which the company says it plans to release details of later).

Open sourcing datasets not only helps remove the barrier to entry, but also allows datasets to be scrutinized for traits like bias.

“There is a sanitizing impact of being out in the open, for good or for bad,” Feldman said. “There’s plenty to criticize about The Pile, but it’s out in the open and we can make it better… It can be criticized and improved, and its biases can be challenged… When it’s in the open, it can get better, but when it’s closed, you have no idea where it’s from.”

The Pile includes data from books, scientific and medical papers, patents, Wikipedia, and even YouTube.

Going forward, not all data will be open sourced.

Feldman was careful to distinguish between open-source datasets used to train foundation models to understand language, and companies’ proprietary data, which is used for more specialized tasks.

His example—pharmaceutical companies that have spent billions of dollars generating data that captures scientific discovery—can now use open-source models, trained on open-source datasets, combined with their own data “to put forward things that are really valuable, and we’re going to see a lot of that,” he said.


Open sourcing trained foundation models for enterprises to fine-tune allows enterprises to keep their proprietary data secret. Additionally, it allows for greater differentiation than simply using the ChatGPT API.

“We are at the iPhone moment of AI,” Nvidia CEO Jensen Huang said during his keynote at GTC recently. “Startups are racing to build disruptive products and business models, while incumbents are looking to respond. Generative AI has triggered a sense of urgency in enterprises worldwide to develop AI strategies.”

This sense of urgency has resulted in many companies using APIs offered by OpenAI and others for automation and co-creation.

Building a business on top of an API is risky as it can often easily be replicated by competitors.

While Huang said during his speech that only some customers need to customize models, Nvidia has built powerful tools for enterprises to fine-tune models it has trained.

“The industry needs a foundry, a TSMC, for large language models,” he said. Nvidia’s AI Foundations allows customers to fine-tune text-generation LLMs, image generation and drug-discovery AIs. Customers include Shutterstock and Getty, which are using their large proprietary image and metadata databases to customize image and video generation AIs.


Yann LeCun (Source: Meta)

It isn’t just training that costs money; deployment at scale for large LLMs is also not cheap. Analyst estimates from SemiAnalysis suggest that ChatGPT inference costs 0.36 cents per query, or about $700,000 per day.

Meta’s head of AI, Yann LeCun, speaking in a livestreamed discussion on AI ethics, was not worried that LLMs would end up in the hands of the few.

“I think this is going to get democratized actually fairly quickly,” he said, adding that simpler LLMs will become more widely available—and may even come to edge hardware like mobile devices.

LeCun, too, had no doubt that the future of LLMs is open source.

“You’re going to have a lot of such LLMs, or systems of that type, available with different degrees of openness for either research groups or products, in short order,” he said. “It’s competitive, which means there is a lot of motivation for people to put things out there, [but] some of them are going to be more open than others.”

Scroll to Top