Getting Healthcare Data To Train An AI Model - with Protege

And all the different ways you can “train” a model

Mar 6, 2025

How To Contract With Payors

This course teaches you the essentials of contracting with payors, including strategy, presentations, negotiations, and deal closure. By the end, you'll have a personalized playbook to navigate future contracting endeavors confidently.

Learn more

Next Cohort:

9/15 - 9/25

Featured Jobs

Finance Associate - Spark Advisors

Spark Advisors helps seniors enroll in Medicare and understand their benefits by monitoring coverage, figuring out the right benefits, and deal with insurance issues. They're hiring a finance associate.

Data Engineer - firsthand

firsthand is building technology and services to dramatically change the lives of those with serious mental illness who have fallen through the gaps in the safety net. They are hiring a data engineer to build first of its kind infrastructure to empower their peer-led care team.

Data Scientist - J2 Health

J2 Health brings together best in class data and purpose built software to enable healthcare organizations to optimize provider network performance. They're hiring a data scientist.

‍Looking for a job in health tech? Check out the other awesome healthcare jobs on the job board + give your preferences to get alerted to new postings.

Check Out The Job Board

TL;DR

Protege provides a platform that helps healthcare AI companies get data to train and fine-tune models. They also help companies with data monetize it.

We go through the different types of AI training techniques and talk about why it’s hard to get data for these models today.

We also talk about how the demand for unstructured data and new data types now exists thanks to AI, plus some of the challenges the company might face in the form of competition/platform dynamics.

This is a sponsored post. You can read more about my rules/thoughts on sponsored posts here. If you’re interested in having a sponsored post done, we do four per year. You can inquire here.

_______

How do you Train an AI Model?

It’s worth understanding how AI models get trained. For today’s post I’m mostly going to focus on Large Language Models (LLMs). I’m omitting a ton of details from this because both you and I, reader, are too dumb to go deeper than this.

First, you may have heard of something called foundation models. This is when you take an absolutely massive amount of unstructured and unlabeled data (e.g. the entire text of the internet) and a model learns patterns within this massive dataset. This process of “learning” is that the LLM can then “predict” the next thing to come out based on a given set of inputs. For example - given the context words in the prompt, the most common word after “Nikhil” would be “is hot”.

Additional decisions are made during training on value aligning model outputs - including data types, training techniques, labeled data, and model optimization. You can decide how off the rails you want the model to get. It goes from “Linkedin commenter” to “Kanye”.

Large language models do this specifically with text, but more models now use different types of modalities like images, audio, etc. Part of the training includes turning all of these different types of data into some set of common values so if you ask something via text, an image can pop up.

*Source: An image is converted into values based on the pixel types - do this across thousands of “8”s and a computer learns how to identify an “8”.*

The end result is a trained model with a certain set of weights that will influence its output for your given input. OpenAI, Anthropic, Facebook’s Llama, etc. all have different foundation models that optimize for different things (latency, ability to reason, types of outputs, etc.).

You can use these out-of-the-box models to do very powerful stuff. But sometimes being trained on all of the internet can cause very weird behaviors if it needs to do very specific tasks. It might hallucinate facts or talk about seed oils at inappropriate times.

So if you’re a business that wants to use these models to do a specific task really well, you might take one of these foundation models and train it on the steps of something specific. Depending on the model, some of them will let you take the existing weights/rules they learned through training and you can feed it your own data or processes to make it specific for you.

This is called “fine-tuning”. Fine-tuning will typically require data that’s cleaner and pre-processed. A bunch of other information about what’s happening might get added or labeled later by a human (because somehow the robots took the fun part of learning the trends and left us with that). Fine-tuning is very sensitive, so you need to make sure your data is clean and you have some ground-truth way to validate the right answers.

Here’s an example of the Virgo team fine-tuning an open-source model to recognize polyps in endoscopies. Below are some examples that AWS has around fine-tuning Anthropic’s models for specific use cases. It’s like getting an intern - they have a foundation from what they learned in college, but it’s actually not relevant at all to the job so you need to train them.

In fact, many products will use multiple fine-tuned foundation models for different tasks under the hood if they’re particularly good in some areas vs. others (e.g. image classification, text generation, etc.).

If you’re getting really sophisticated, you might want to have a foundation model that has a general understanding of your domain. This can make sense if you need to do a bunch of tasks within a domain/industry vs. trying to fine-tune to each specific task individually. For example, if you’re an AI scribing company that needs to know a lot about the nuances of the healthcare industry to write notes, highlight the important parts, learn what a “moat” is, etc., you’d start with a pre-trained model.

You could then customize a pre-trained model with clinical language or a company’s unique terminology and concepts. This requires dumping a bunch of unstructured data within a domain and letting the foundation model learn the new rules and syntaxes within the constrained space.

[Psst - We go through all of the specifics in our LLMs in Healthcare course, starting 4/7.]

The important things to take away from this:

Foundation models are now multimodal and training on text, images, video, and more.
Fine-tuning these models requires having access to very clean reference data that are meant for a specific use case or set of use cases.
Training your own model requires a lot of domain-specific data to learn the rules.

What’s the Pain Point Being Solved?

Healthcare AI companies increasingly need more data that’s difficult to obtain.

Part of this is for differentiation purposes. As soon as data becomes part of the public domain, every company has access to it, which means it no longer becomes an edge. This also skews benchmarking results – it’s like if you tested someone who was staring at an answer key. You wouldn’t know whether they’d be able to synthesize a similar response to a similar question asked differently.

Another part is that you need ALL of the data from a lot of different places and in different formats to get the correct answer. Answering a question about radiology not only requires the scan itself but all of the context about the patient (the record, history, etc.), which lives in different places. You also need all of that data over long periods of time, which means even more places.

And the fault tolerance for AI healthcare applications is low. So you really need to make sure your outputs are right. As more companies also differentiate based on how AI tools work in real-world settings, building confidence in more specific tasks becomes more important and requires niche datasets.

Clean, longitudinal, and multimodal data is just hard to get. If you wanted to train or fine-tune an AI model today, you would need to:

Figure out the different types of data needed and where to find it
Find the right person at the hospital/company with the data who can discuss this and actually take a meeting
Go through a bunch of security hoops and also get samples of data to make sure it’s the right kind you need
Negotiate data licensing agreements with each place where the data lives (in many cases, the data holder will want equity in your company)
Connect it to the rest of the patient’s record if you want more context
Make sure the data is de-identified properly each time before it’s sent to you
Get someone with expertise to label the data if needed

*Calling it gross feels unprofessional, cmon.*

This process would have to happen every time with each hospital, MRI imaging company, digital pathology company, etc. if you wanted to get a training data set. This can take a very long time (1+ year each), which comes directly from your life expectancy. It can also be very expensive both from a personnel hours standpoint and from an equity/cash basis.

For many companies, that’s not where they want to build their core competency, they just want to create a model that works for their business process.

What does Protege do?

Protege is a platform to make it easy for companies to get data they need to train AI models and for companies with that data to monetize it.

If you’re a company that needs healthcare data, you make a request on the kinds of data you need and which features of those datasets you want. Datasets they have include:

Structured and unstructured EMR data (doctors’ notes, etc.)
Pathology slides
Claims and remittance data
Mortality data
Demographic data
Images (X-rays, CT scans, etc.)
Lab reports and PDFs
Genomic data
Tens of thousands of medical research articles
De-identified transcripts of patient / doctor q&a
Wearables data
Social determinants of health

You might want different permutations across these various datasets over really long periods of time for specific use cases. For example:

You want to train a radiology AI to be able to assess images and create risk scores. You’ll need very well-labeled imaging data, the patient records to understand progression and medications they were on, death data to understand what their survival looked like, etc.

You want to be able to outline the best next step for a patient’s treatment if you’re a clinical decision support tool. You’ll want to understand as much of their medical history to date, across their structured and unstructured EHR data, any imaging, any pathology slides if they have cancer, etc.

You want curated and labeled examples of doctors’ notes tied to specific diagnosis codes to help train your AI scribe.

Once you choose the specifics of the dataset you want - Protege has a pipeline that tells you how many rows of data there are and makes it compliant then available for export to the customer to be used.

On the backend, the company has done a few things to streamline this process.

First, Protege goes to all different data vendors and creates revenue-share licensing agreements with them. This means they take on the contracting complexity. Have you ever slept and seen redlines enter your dreamscape?

Second, they work with different companies that can de-identify the data + connect it across datasets or provide data labeling and structuring services as needed. We talked about how this happens in our “how data gets sold” post.

Sometimes, the companies with rich data also don’t have it ready for commercial use like this, so Protege will find vendors that help them do that (including actually digitizing data that isn’t digital yet).

Third, they harmonize the data so it’s easy to query across datasets. You can search diagnoses, zip codes, and birth year and it’s referring to the same concepts across datasets.

What Is The Business Model And Who Is The End User?

Protege makes its money in three ways.

They charge platform fees to their customers to use the service.
They charge a revenue share to their data vendor partners.
If additional services are needed to make the data useful (data labeling, digitization, identity resolution, etc.), they work with third-party partners to do that and split the revenue.

There are a few different persona types that Protege is targeting:

Healthcare AI applications - Every healthcare AI company needs training data. One customer is building AI for medical coding, and another is building a multimodal AI model for medical imaging and radiology. Both will need lots of clean data across many sources.

Academic researchers - AI research is booming, and there’s a massive need in the academic community for easier access to data. Academics today donʼt have ready access to the most valuable data types in AI (including unstructured data). Some examples: a group at an academic hospital trying to predict when a pregnant woman might give birth, or government-led programs like NIH launching “Advancing Health Research through Ethical, Multimodal AIˮ initiative.

Foundation models - The models are regularly striking 7 and 8 figure deals with data owners with rich assets and are increasingly turning to healthcare for gaps in their training data (e.g. imaging data, genomic data, etc.). Healthcare use cases can be much larger contracts but way smaller tolerance for errors, so building for accuracy is key and requires a lot of specific data.

In-house AI teams - Every company is thinking about how to embed AI into their workflows. Life insurance companies are trying to generate better underwriting policies, medical device companies are looking to understand the likelihood of a procedure being reimbursed based on a scan, and EMRs are using AI to move the dropdown menus to new places as soon as you learned where they were.

Job Openings

Protege is hiring applied data scientists and data engineers.

They’re looking for folks who’ve worked with the personas above or any of their data partners. Or if you’re just a beast with data, shoot your shot.

Out-Of-Pocket Take

A few things I like about Protege:

Making healthcare commercialization easier - I get pinged all the time from companies saying “hey we have all this interesting data, but no idea if people would be willing to pay for it”. To which I respond, “how did you get this phone number?”. Especially now as companies are trying to beef up their revenue, many are looking for alternative areas to monetize like data commercialization.

However, going from “I have this data” to “what is this data worth” to getting paid has historically required a ton of steps for an unknown payoff. Protege is betting that by creating a marketplace for this data, they’ll also make it easier for companies who have it to know what it’s worth and simplify the process for getting paid for it.

One additional benefit might be newly digitized datasets thanks to demand for a new use case - training and evaluating AI models. For example, Protege has been getting more requests for pathology slides, images, etc. which requires manual digitization. A lot of current AI models have worked by training on all the same public datasets (e.g. the open internet, MIMIC, etc.) and so companies want to differentiate by training and benchmarking on harder-to-get data.

Healthcare + non-healthcare data mixing - While they’re starting in healthcare, Protege is trying to become a general data marketplace for AI. They even acquired a company that owns a bunch of global media, sports, and news footage. Maybe one day we can find out if cohorts that watch “Selling Sunset” score high on PHQ-9s.

Because healthcare data has traditionally been so siloed and requires so many hoops to jump through, it’s been difficult to do anything that requires mixing it. Only recently have large pools of unstructured data been considered valuable; historically they’ve been viewed mostly as noise. But this kind of data is particularly good for AI because it needs to train on unstructured data to learn the rules in order to be used on future unstructured data use cases.

By making it easier to get the data, we might see novel explorations about how non-healthcare factors impact our health, and vice versa. Plus, Protege can work with customers that require many different types of data (e.g. foundation models) who might prefer working with a single vendor for all of it.

Potential regulatory changes - Some budding regulations are focused on requiring AI companies to know more specifically where their training data is coming from and how it affects their outputs. HHS AI transparency rules are requiring transparency on data acquisition and sources, and Washington has new bills that require AI companies to disclose what data they’re trained on. In Europe, they’re going to require you to physically shake hands with every single user you have data on or you’ll get sterilized for GDPR non-compliance.

These requirements mean that AI companies will have to have answers if they get audited, and companies like Protege can make that process easier.

–

As with all companies, here are a few things Protege might struggle with as it grows:

Competition - Competition is probably the biggest challenge and it comes in a few flavors, like a rivalrous neapolitan ice cream.

The first is other data marketplaces. There are companies in the healthcare data broker space, and most of their work is with pharma. It’s providing data for everything from sales/marketing optimization, helping find clinical trial sites, building a case for why they should get reimbursed $X, and more. This is typically structured data in very high dollar areas like oncology.

Protege believes that the AI training use case has very different requirements from the pharma use cases. AI use cases need way more types of unstructured data, over long periods of time, and across many different modalities, independent of disease states. Plus, the scale for training is different - it’s rare a pharma company will need a million medical images. They’re not data freaks like that.

The second competition vector is data vendors going directly to companies that want to buy from them. Part of this depends on how much training data becomes necessary and how rich/self contained the data is within that one source. Training data requirements do seem to be getting smaller depending on the use case, so it’s possible that one source has most of what you need. You have coalitions of companies banding together like Blue Health Intelligence and Truveta to try and get data from multiple first party data repositories. Or, companies that offer one service have access to that data and can monetize it (e.g. EHR vendors offering de-identified data from partner hospitals).

However, even most of these datasets will probably have missing gaps that can be complemented with other data sources. Or the scale of their datasets will be naturally smaller, or have strings attached for using it.

Platform dynamics - Most successful platforms have network effects. Users go to a platform because there are lots of suppliers/applications built on top and the platform provides some benefit to the users (vets the apps, shows reviews, etc). The supply side goes to a platform because they have users and services that make it easy for them to build on top.

This can generally make a very strong business, but can be tricky if the supply side is using all the platforms for the best economics and/or if every user is on every platform. For example, all riders and all drivers are basically on both Lyft/Uber, which generally means either competing on price, a much wider product portfolio, or loyalty features. Protege might face similar dynamics, with data suppliers and customers working with whichever platform gives them the best economics.

Protege is betting that the hard part of this business, coordinating a lot of different stakeholders, is a moat. It’s just hard to find data buyers, contract with data sellers, and make data ready to be sold. Right now, they offer a lot of white glove services to make that happen (e.g. going out to find a data vendor if a customer asks for something they don’t have), but that also could limit their scalability.

Healthcare vs. non-healthcare focus - The beauty of creating a data marketplace for more than just healthcare is that you increase the TAM, you can do interesting data mixing with non-healthcare data, and you can execute larger contracts with customers that are looking for all different kinds of data.

The downside, however, is that increasing the surface area of industries means you could potentially lose to someone that just focuses on one industry at a time. Especially if a business requires a lot of white glove service and coordination to enable any transaction - a team that offers value-add services specific to an industry or has more relationships might edge them out.

Double-edged sword. Is there such a thing as Adderall, but for a business?

Conclusion And Parting Thoughts

In the new AI era, very different types of data are now precious. In healthcare in particular, the lack of access to data is kneecapping companies that want to create and evaluate their AI products.

Protege is trying to make it easier to access that data, which will hopefully improve healthcare AI products and make it easier to build on the foundation models. Somehow the machines are going to get access to my own record faster than I can.

Thinkboi out,

Nikhil aka “From RAG to Riches”

Twitter: @nikillinit

IG: @outofpockethealth

Search the Out-Of-Pocket Archives:

Getting Healthcare Data To Train An AI Model - with Protege

How To Contract With Payors

Featured Jobs

TL;DR

How do you Train an AI Model?

What’s the Pain Point Being Solved?

What does Protege do?

What Is The Business Model And Who Is The End User?

Job Openings

Out-Of-Pocket Take

Conclusion And Parting Thoughts

Interlude - Courses!!!

Healthcare 101 Crash Course

Claims Data 101 Course

Get Out-Of-Pocket in your email

Let's Keep In Touch