Are Public LLMs Putting Your Company at Risk?

August 22, 2023
Dr. Vishal Misra

To avoid IP leaks and copyright risks, businesses need a smarter way to leverage GenAI tools

Apple isn’t known for being tech-averse — but the Cupertino company recently barred its employees from using third-party generative AI (GenAI) tools. The reason? Apple fears that using large language models (LLMs) such as ChatGPT or GenAI coding tool GitHub Copilot could leave the company vulnerable to copyright violations, IP leaks, and other serious business risks. 

And they’re not the only ones. Samsung banned ChatGPT and other GenAI tools after employees accidentally leaked sensitive company data through the open-source model, and J.P. Morgan, Verizon, and Amazon have all restricted ChatGPT use among employees.

As Samsung and many others have quickly — and, at times, painfully — learned, the open-source nature of public GenAI tools brings about a whole hosts of risks, beyond simply the risk of data leakage. But avoiding LLMs altogether would mean missing out on valuable technological tools. That’s why companies are starting to modify pre-existing LLMs with their own data. 

The trouble with Public LLMs

The biggest problem with utilizing public LLMs is that pre-trained LLMs necessarily contain large volumes of public data, over which organizations have little control. That training data can’t be unlearned, just as cake batter can’t be unmixed. Any inaccurate or unlawful public data incorporated during the initial training will inevitably taint the results subsequently generated by team members. That could leave companies legally liable for consumer harm resulting from inaccurate product information generated by their AI models.

Consider the case of Stability AI. They wound up embroiled in a lawsuit after their imaging tool began adding the Getty Images logo to the images it produced — a clear sign that Getty’s proprietary images had been scraped during training. Adobe’s PhotoShop AI system, by contrast, was designed more cautiously, using algorithms trained only on stock and public domain images. 

It’s also important to remember that the data used to create bespoke GenAI models isn’t forgotten once the LLM is fully trained and tuned. The data persists in, and can potentially be subsequently extracted from, the GenAI model — and that creates the risk of IP leakage if sensitive private data is used. 

The risk of leaks is exacerbated by the fact that GenAI tools typically live in the cloud. That means private data leveraged for public LLM use, plus prompts and responses generated by the model, can easily slip into the wrong hands. If you aren’t careful, sensitive data can accidentally be transmitted to outside organizations — and exploited by competitors — without your approval or awareness.

Weighing GenAI Options 

To mitigate such risks, Apple plans to build the LLMs and other GenAI tools it needs in-house, maintaining full control over the data they contain. That’s a solid strategy if you’re a tech giant with near-infinite resources — but given the time and money needed to create and train new LLMs, building bespoke AI solutions from scratch isn’t practical for most businesses. 

The alternative is to adapt existing LLMs to suit an organization's individual needs, while also minimizing their exposure to the kinds of risks Apple is trying to sidestep. 

There are several ways to modify pre-existing LLMs, but one of the most common is fine-tuning or prompt-tuning LLM or GenAI models using the organization's private data. This is the digital equivalent to swirling flavoring into premade cake batter: you can skip many of the foundational steps while still cooking up a final product that's customized to your business needs.

The benefits of this kind of customization are obvious: organizations get an AI model that “knows” their business and delivers tailored responses even to generic questions. Achieving this will require the use of multiple strategies to safeguard existing datasets, while also maximizing the available data and facilitating its use to effectively fine-tune or prompt-tune AI models. 

Finding a Solution

That’s why Conveyer is helping companies to organize and prepare their data for GenAI applications, maximizing the utility they can obtain from unstructured data sources including documents, notes, emails, images, and even videos. Instead of dumping unstructured and unmonitored data into algorithms, we make it possible to curate appropriately structured private datasets that can be used to tune and prompt large language models without exposing businesses to undue risk. 

Using contextual cues and natural language processing, Conveyer’s TopicLake™ repository automates the process of turning unstructured data into structured topics, enabling us to generate metadata, unlock new data sources, and create vector data and other custom data to power GenAI tools. Our engine can digest content far more accurately and efficiently than manual processes, enabling companies to rapidly and accurately sort, classify, structure, and generate data for AI initiatives. 

Using an expanded array of properly structured datasets, businesses can take the guesswork out of fine-tuning and prompt-tuning LLMs. By ensuring that only appropriate data is exposed to AI algorithms, organizations can develop models that deliver fully customized responses, without errors or copyrighted materials inherited from large public datasets.

Beat the backlash

Until now, the evolution of AI systems has been guided by a more-is-more mindset; vast amounts of public and semi-public data have been scraped from the web and poured into algorithms with relatively little oversight or guidance. 

As the sector matures, however, we’ll need to move toward more disciplined and sustainable data practices — namely, training open-source models on pristine, curated data. The ubiquity of GenAI technologies will increase awareness not just of their potential, but also of the risks they bring. Companies that fail to move with the times will find themselves targeted by regulators, but also punished by consumers, who are increasingly wary of the ways in which their data is being fed into AI algorithms. 

To stay ahead of that curve — leveraging the full potential of GenAI tools, but also cementing reputations as responsible stewards of consumer data — businesses will need to get serious about structuring and curating the datasets they use to train and tune their LLMs. 

This is the problem Conveyer’s TopicLake™ repository was built to solve. Whether you’re fine-tuning or prompt-tuning LLMs, or using other strategies to optimize AI models, we’re here to help you prepare your unstructured data and augment your usable datasets — helping you to effectively leverage GenAI tools and drive the results you need. Get in touch to find out more about how we can propel your business into the GenAI era.