Editor’s Note: As we head into 2025, Corporate Board Member is kicking off a new “ask me anything” style column with Karen Silverman and JoAnn Stonier of The Cantellus Group, global experts in practical governance strategies for AI and other technology. To start, they’re delving into the vexing question of whether to invest in building your own AI models based on proprietary data or relying on public models developed by companies such as OpenAI, Meta and Google. Here’s what they had to say:
It’s a very important and tough question, and it very much depends on what you have at stake. Probably, you will deploy both solutions over time, but a cleareyed assessment of your business model, resources and capabilities will help you decide where to start.
Public LLMs are all trained on the same (mostly Western) world’s data. The leading AI labs’ LLMs are well-engineered and improving power, accuracy and capability all the time. They are a good starting place for many tasks, including summarizing large volumes of non-sensitive material, generating drafts and brainstorming different perspectives and quantities of generic options.
Considerations include the need for substantial oversight and user training, even for repetitive tasks and protections for proprietary data when interacting with APIs; the lack of competitive differentiation; reliance on training data that may be subject to litigation; a lack of control over the development of features (everyone gets the same features at the same time); lack of explainability, potential for bias, drift and performance loss; lack of precision and challenges with reliability, accuracy and security.
Proprietary LLMs, in contrast, are developed in-house, tailored to specific needs and trained on proprietary data. They are great for highly differentiated tasks, where results reference company-specific text, facts or practices, or where security, reliability and/or regulatory requirements are at a premium.
Considerations here include non-trivial costs (in time, talent and treasure) to develop, test and validate; access to a sufficient quantity and quality of training and production data; data with appropriate permissions and consents for use; internal standards and capabilities and ongoing maintenance and retraining requirements. Of course, this approach also puts you in competition for scarce AI engineering talent.
Hybrid models leverage existing public LLMs and adapt them to proprietary data through fine-tuning and/or strict data filtering tools and techniques, offering some level of customization. They can reduce up-front investment and allow gradual acquisition of internal capabilities and organizational skills.
Here, consider investments in preparing technology and proprietary data to protect personal, sensitive or trade information, create the right technical infrastructure with strict access controls and authentication, choose a base model with well-documented fine-tuning processes, and maintain detailed records for compliance and ongoing evaluation of model behavior, drift and usage.
Where To Begin?
To help think this through, here are some key areas to start board discussion:
Are you a regulated or highly sensitive business? These adopt proprietary solutions soonest, as custom models keep data within company boundaries, reducing the risk of leaking sensitive data and improving compliance controls.
Do you demand competitively differentiated capabilities? If so, training your models on unique proprietary data sets will capture domain knowledge and create solutions tailored to specific business and industry requirements.
Is a hybrid approach viable? Using carefully prepared proprietary data in a private cloud instance or air-gapped system to fine tune public models might be a good solution; eventually, most public LLM users will invest in RAG and other proprietary techniques to improve reliability, accuracy and context specificity.
What problems are you trying to solve? Which approach is best still turns on usecase context.
What can you afford? Where to put upfront investments (e.g., toward developing reliable datasets, model development or product development) is also context specific.
Finally, you’ll also want to realistically evaluate:
• Access to engineering talent, applicable and usable data sets and capital
• Need for competitive differentiation
• Timing requirements
• Availability of infrastructure
Generative AI raises build-or-buy decisions, just as your existing tech stack does. Where interoperability is critical, for instance, the utility of a universal solution may outweigh its associated risks. But where competitive differentiation is critical, universal LLMs generate content that others can replicate (it’s trained on all the same data), so a proprietary approach might be better—if your own data contains unique insights and is of sufficient volume. And, of course, if you can afford it.