PALO ALTO — Hippocratic AI launched out of stealth this month to announce the industry’s first safety-focused Large Language Model (LLM) designed specifically for healthcare, as well as a $50M seed round co-led by General Catalyst and Andreessen Horowitz.
Large language models (LLMs) and Foundation Models (FMs) like ChatGPT and GPT-4 have surprised the world with their abilities. While researchers have shown that these AI models can pass the USMLE (US Medical Licensing Exam), no company has built a commercial model specifically tuned for healthcare applications. Hippocratic AI is building the first LLM for Healthcare with an initial focus on non-diagnostic, patient-facing applications. This will allow the company to ensure patient safety while improving healthcare access and outcomes.
“The healthcare industry needs its own AI platform, one that is focused on empowering the workforce, reducing burnout, and improving patient safety and experiences with the healthcare system. We joined forces with the Hippocratic AI team, our health assurance ecosystem, and the a16z team to build this platform. Our goal is to fundamentally increase the supply and scalability of healthcare professionals. This is the key to achieving the health assurance vision: a more proactive, more affordable, and equitable system of care for all,” said Hemant Taneja, CEO and Managing Director at General Catalyst.
Hippocratic AI was founded by a group of physicians, hospital administrators, Medicare professionals, and artificial intelligence researchers from El Camino Health, Johns Hopkins, Washington University in St. Louis, Stanford, UPenn, Google, and Nvidia.
“After working with Munjal and team for years in his prior company, we know that his lived experience as a healthcare and tech operator gives him an edge in understanding what it takes to bring high-ROI products to market – especially at a time when existing industry players are in such dire need of better operating leverage and financial sustainability. We believe Hippocratic AI’s cross-disciplinary, safety-first approach is what the healthcare industry needs to be able to maintain trust in the power of responsible deployment of generative AI solutions,” said Julie Yoo, General Partner at Andreessen Horowitz.
To build a safer large language model the company has focused on three main things: certification, RLHF via healthcare professionals, and bedside manner.
Certification
Passing the USMLE is not enough to ensure a model is ready for the wide variety of healthcare roles that exist in care and payor settings. Therefore, Hippocratic AI focused on testing its model on a wide variety of 114 healthcare certifications and roles. The company also strived to not just get a passing score but to outperform existing state-of-the-art language models such as GPT-4 and other commercially available models. The company was able to outperform GPT-4 on 105 of the 114 tests and certifications, outperform by 5% or more on 74 of the certifications, and outperform by 10% or more on 43 of their certifications. Below are some sample results. Full results here: (www.HippocraticAI.com/benchmarks)
Name | Commercial LLM #1 |
Commercial LLM #2 |
GPT-4 | Hippocratic | Δ Improvement vs Best Competitor |
|
NAPLEX | North American Pharmacist Licensure Examination |
51.0% | 0.0% | 70.9% | 91.1% | 20.2% |
NCLEX-RN | Registered Nurse | 58.8% | 25.8% | 76.2% | 88.6% | 12.4% |
CPNP-AC | Acute Care Certified Pediatric NP |
64.0% | 22.0% | 86.7% | 96.0% | 9.3% |
CPC | Certified Professional Coder |
54.7% | 50.0% | 65.3% | 71.0% | 5.7% |
ABOG | American Board of Obstetrics and Gynecology Licensing Exam |
44.00% | 24.00% | 80.30% | 92.33% | 12.03% |
ABU | American Board of Urology – Licensing Exam |
42.09% | 24.24% | 67.30% | 77.10% | 9.80% |
Hospital Safety Training |
Hospital Safety Training Compliance Quiz |
39.4% | 27.3% | 48.5% | 72.7% | 24.2% |
RD | Registered Dietician |
57.1% | 46.9% | 71.4% | 83.7% | 12.3% |
CLC | Certified Lactation Consultant |
60.9% | 51.7% | 79.3% | 98.9% | 19.6% |
CPCO | Certified Professional Compliance Officer |
60.7% | 54.0% | 67.3% | 86.0% | 18.7% |
RLHF with Healthcare professionals
Hippocratic AI has decided that the best people to determine LLM readiness for deployment in the healthcare system are the experts who serve in that role in today’s system. In large language models, there is a technique to mold the AI using human feedback: Reinforcement Learning with Human Feedback (RLHF). Many believe this technique is what led to the remarkable performance of ChatGPT compared to that of prior versions of OpenAI’s language models.
In building Hippocratic AI, the company has engaged healthcare professionals to help guide and train the LLM by rating its responses.
“RLHF with healthcare professionals isn’t just a feature but is really our commitment to partner deeply with the industry,” said Munjal Shah, Co-Founder and CEO of Hippocratic AI. “We aren’t just saying these professions will help us evaluate our system. We are saying we won’t launch each unique role for the LLM unless the professionals who do that exact task today agree the system is ready and safe.”
Some of the roles and tasks the company is exploring include patient navigator, dietician, genetic counselor, enrollment specialist, medication reminders, and more.
Bedside Manner
“In healthcare settings, it isn’t just important to answer the patient accurately. It is equally important that it is done with great bedside manner. Many studies have shown that bedside manner impacts emotional well-being and quality of outcomes. This isn’t just true for doctors but also true for everyone interacting with patients: billing agents, schedulers, and more,” said Meenesh Bhimani MD, Co-Founder and Chief Medical Officer of Hippocratic AI.
To date there are no benchmarks for evaluating the bedside manner of a language model when interacting with patients. Hippocratic AI will be releasing the first of many bedside manner benchmarks for the entire community to use. Below are the initial results the company has achieved against these benchmarks.
Name | Commercial LLM #1 | GPT-4 | Hippocratic | Δ Improvement vs Best Competitor |
Shows Empathy | 30.0% | 68.3% | 75.0% | 6.7% |
Shows care and compassion |
43.3% | 75.0% | 85.0% | 10.0% |
Making Patient feel at ease |
5.0% | 29.2% | 57.5% | 28.3% |
Taking a personal interest in patient’s life |
33.3% | 63.3% | 70.0% | 6.7% |
Helps patient take control |
35.0% | 61.7% | 65.0% | 3.3% |