Retnovi AI Blog.

Why Ophthalmology Needs a Foundation Model: Building the Future of Eye Care AI

Cover Image for Why Ophthalmology Needs a Foundation Model: Building the Future of Eye Care AI
Amir F Yazdanyar MD PhD
Amir F Yazdanyar MD PhD

The Fragmented Reality of Ophthalmology AI Today

Imagine an ophthalmologist's workflow: a single patient visit generates multiple imaging modalities, fundus photography, optical coherence tomography (OCT), fluorescein angiography, and potentially more. Each image requires analysis, comparison to previous visits, and integration with clinical notes. Now multiply this by 80-100 patients per day, and you begin to understand the scale of the challenge.

The current AI landscape in ophthalmology mirrors this fragmentation. We have specialized models for diabetic retinopathy detection, separate models for AMD classification, different systems for OCT analysis, and yet others for glaucoma screening. Each model is trained on its own dataset, optimized for its specific task, and requires separate integration into clinical workflows.

This approach has fundamental limitations:

  • No Longitudinal Understanding: A diabetic retinopathy model sees only one moment in time. It doesn't remember that this patient had minimal changes six months ago, or that their HbA1c improved after medication changes. Context is lost.

  • Fragmented Workflows: Clinicians must switch between multiple AI tools, each with different interfaces, confidence metrics, and output formats. This creates cognitive overhead and reduces adoption.

  • Data Inefficiency: Each new task requires collecting and annotating a new dataset from scratch. Training a model for a rare condition becomes prohibitively expensive when you can't leverage knowledge from related tasks.

  • Limited Generalization: A model trained only on fundus photos from one demographic struggles when applied to different populations or imaging devices. Generalization requires diverse pre-training.

  • Missing Multimodal Integration: Real clinical decision-making integrates visual findings with patient history, symptoms, and lab results. Single-modality models can't capture this holistic picture.

The Foundation Model Paradigm Shift

Foundation models represent a fundamental shift in how we approach medical AI. Instead of building separate models for each task, we train one large model on diverse, multimodal data that can be adapted to many downstream tasks with minimal fine-tuning.

In natural language processing, foundation models like GPT and BERT transformed the field by learning general language representations that could then be fine-tuned for specific tasks. Vision-language models like CLIP demonstrated that understanding images and text together enables capabilities like zero-shot classification and cross-modal retrieval.

In ophthalmology, foundation models offer transformative advantages:

1. Unified Representation Learning

A foundation model trained on millions of ophthalmic images across multiple modalities (fundus, OCT, angiography, autofluorescence) learns generalizable visual representations. These representations capture anatomical structures, pathological patterns, and disease signatures that transfer across tasks.

When a new task emerges, say, detecting a rare retinal dystrophy, the foundation model can leverage its learned representations. Fine-tuning requires far fewer labeled examples than training from scratch. This is especially critical for rare diseases where collecting large datasets is impractical.

2. Multimodal Understanding

Clinical practice is inherently multimodal. An ophthalmologist considers patient history, symptoms, imaging findings, and lab results together. A foundation model can integrate these modalities, learning the relationships between visual patterns and clinical context.

Recent research demonstrates this potential. EyeCLIP, a multimodal visual-language foundation model trained on 2.77 million ophthalmology images across 11 modalities, achieved state-of-the-art performance in disease classification, visual question answering, and cross-modal retrieval [1]. The model can answer questions like "What is the severity of diabetic retinopathy in this image?" and retrieve relevant images based on textual descriptions.

3. Longitudinal Patient Understanding

Perhaps the most significant advantage of foundation models is their ability to maintain context across time. A patient's retinal images from 2019, 2021, and 2024 aren't just independent snapshots, they represent a disease trajectory. A foundation model can learn temporal patterns, predict progression, and identify subtle changes that might be missed when viewing images in isolation.

This capability is particularly valuable for chronic conditions like diabetic retinopathy, AMD, and glaucoma, where monitoring disease progression over years is critical for treatment decisions.

4. Zero-Shot and Few-Shot Learning

Traditional models require extensive labeled data for each new task. Foundation models, through their broad pre-training, can perform zero-shot tasks (without any task-specific training) or few-shot tasks (with minimal examples).

For example, RETFound, one of the first ophthalmology foundation models, outperformed traditional deep learning models even when fine-tuned on small datasets [2]. This democratizes AI development, making it feasible for rare conditions and resource-limited settings.

The Current State: Pioneering Efforts

Several research groups have recognized the potential of foundation models in ophthalmology. Their work validates the approach and provides a foundation for broader implementation:

EyeFound: Multimodal Generalist Foundation Model

Developed by researchers from multiple institutions, EyeFound was trained on 2.78 million images from 227 hospitals across 11 ophthalmic modalities [3]. The model demonstrated superior performance in:

  • Diagnosing eye diseases across multiple conditions
  • Predicting systemic disease incidents (e.g., cardiovascular events from retinal images)
  • Zero-shot multimodal visual question answering

This work demonstrates that a single foundation model can handle diverse ophthalmic tasks, from disease classification to prognostic prediction.

VisionUnite: Clinical Knowledge Enhancement

VisionUnite extended the foundation model approach by explicitly incorporating clinical knowledge during pre-training [4]. Trained on 1.24 million image-text pairs, the model achieved diagnostic capabilities comparable to junior ophthalmologists in various clinical scenarios, including open-ended multi-disease diagnosis.

The model's ability to handle open-ended questions like "What do you see in this image?" rather than just binary classification represents a significant step toward more natural AI-assisted clinical workflows.

EyeCLIP: Visual-Language Integration

EyeCLIP demonstrated the power of visual-language foundation models in ophthalmology [1]. By training on multimodal data with partial text annotations, the model learned to connect visual patterns with clinical descriptions. This enables capabilities like:

  • Generating detailed reports from images
  • Answering clinical questions about images
  • Retrieving similar cases based on descriptions

The Clinical Imperative: Why This Matters Now

The need for foundation models in ophthalmology isn't just a technical improvement, it's a clinical necessity driven by several converging trends:

Exponential Growth in Imaging Volume

Ophthalmic imaging volume is growing exponentially. Advances in imaging technology have made OCT and fundus photography routine in clinical practice. A mid-size retina practice may generate 20,000-25,000 patient encounters annually, each producing multiple images.

The challenge: Human capacity to review, remember, and compare these images longitudinally is fundamentally limited. A clinician seeing 80-100 patients per day cannot maintain detailed memory of each patient's imaging history. AI systems that can maintain perfect recall and compare images across years become essential.

The Access Crisis

The global shortage of ophthalmologists is well-documented. While the United States has approximately 50 ophthalmologists per 1 million people, Sub-Saharan Africa has only 2 per million [5]. With an aging population and increasing prevalence of diabetes (projected to reach 1.3 billion people worldwide by 2050), the gap between need and capacity widens.

Foundation models can scale expert-level analysis to underserved populations through telemedicine and automated screening. But only if they work across diverse populations, imaging devices, and clinical contexts, exactly what foundation models are designed to achieve.

The Administrative Burden

Healthcare employment data reveals a troubling trend: administrative roles have grown about five times faster than physician roles since 1997 [6]. This represents time shifted away from patient care toward documentation, billing, and administrative tasks.

AI can help reverse this trend by automating routine analysis and documentation. But fragmented, single-task models require constant switching between tools and manual integration of results. A foundation model that provides comprehensive analysis in one step, from image interpretation to report generation can reduce this burden.

The Evidence Gap

Modern healthcare requires evidence-based justification for treatment decisions, especially for expensive interventions. Payers increasingly demand clear documentation of disease severity, progression, and treatment necessity.

Foundation models can automatically generate evidence packets that document disease findings, compare to previous visits, and provide confidence metrics. This capability becomes more powerful when the model understands the full patient context across time and modalities.

Why We're Building Retnovi's Foundation Model

At Retnovi AI, we're building a foundation model specifically for ophthalmology because we believe the current fragmented approach is fundamentally limiting the potential of AI in eye care.

Our Vision: "All Images. One Model."

Our foundation model will:

  • Remember across time: Maintain patient context across years of visits, comparing current images to historical ones
  • Understand all modalities: Process fundus photos, OCT, angiography, and other imaging together
  • Integrate text and images: Connect clinical notes, patient history, and imaging findings
  • Forecast with confidence: Predict disease progression and treatment response with uncertainty quantification
  • Explain its reasoning: Show clinicians why it made specific recommendations

Starting with Ophthalmology

We're starting with ophthalmology for several reasons:

  1. High imaging volume: Ophthalmology generates massive amounts of imaging data, providing the scale needed for foundation model training
  2. Clear use cases: The need for longitudinal tracking, multimodal analysis, and scalable screening is well-defined
  3. Diverse modalities: The field uses multiple imaging types, making it an ideal testbed for multimodal foundation models
  4. Clinical expertise: Our team includes practicing ophthalmologists who understand real-world clinical needs

The Path Forward

Our foundation model development follows a phased approach:

Phase 1: Core Foundation Model

  • Train on diverse ophthalmic imaging datasets across multiple modalities
  • Implement self-supervised learning to leverage unlabeled data
  • Develop multimodal understanding of images and clinical text

Phase 2: Temporal Understanding

  • Add longitudinal learning capabilities
  • Enable comparison across patient visits
  • Develop progression prediction and risk forecasting

Phase 3: Clinical Integration

  • Fine-tune for specific clinical workflows
  • Integrate with PACS and EMR systems
  • Enable automated report generation and evidence documentation

Phase 4: Generalization

  • Extend to additional ophthalmic subspecialties
  • Adapt to different imaging devices and populations
  • Enable few-shot learning for rare conditions

The Technical Foundation

Building a foundation model for ophthalmology requires addressing several technical challenges:

Data Curation and Diversity

Foundation models require diverse, large-scale datasets. We're aggregating data from multiple sources:

  • Public ophthalmic imaging datasets
  • Collaborations with academic medical centers
  • De-identified clinical data from partner practices

Ensuring diversity across demographics, imaging devices, disease presentations, and imaging modalities is critical for generalization.

Self-Supervised Learning

Labeling millions of medical images is prohibitively expensive. Self-supervised learning techniques allow models to learn useful representations from unlabeled data. Approaches like contrastive learning, masked image modeling, and temporal consistency learning can leverage the vast amount of unlabeled ophthalmic images available.

Multimodal Architecture

Our architecture must seamlessly integrate:

  • Visual encoders for different imaging modalities
  • Text encoders for clinical notes and reports
  • Cross-modal attention mechanisms to connect visual and textual information
  • Temporal modeling for longitudinal analysis

Uncertainty Quantification

Medical AI must know when it's uncertain. Our foundation model includes built-in uncertainty quantification, allowing it to abstain from predictions when confidence is low. This is critical for clinical trust and safety.

Explainability

Understanding why a model made a specific recommendation is essential for clinical adoption. Our foundation model provides:

  • Visual attention maps highlighting relevant image regions
  • Textual explanations connecting findings to recommendations
  • Confidence scores for different aspects of the analysis

Addressing the Challenges

Foundation models in medical AI face several challenges that we're actively addressing:

Data Privacy and Security

Medical data is highly sensitive. We implement:

  • Federated learning approaches where possible
  • Strict data de-identification protocols
  • Compliance with HIPAA and other regulations
  • Secure model training and deployment infrastructure

Bias and Generalization

Foundation models can perpetuate biases present in training data. We mitigate this through:

  • Diverse dataset curation across demographics and populations
  • Regular bias audits and model evaluation
  • Testing on held-out populations before deployment
  • Continuous monitoring after deployment

Clinical Validation

Foundation models must be validated rigorously before clinical use. We:

  • Collaborate with ophthalmologists to design validation studies
  • Test on diverse patient populations and clinical scenarios
  • Compare performance to expert clinicians
  • Conduct prospective studies to assess real-world impact

Regulatory Considerations

Medical AI requires regulatory approval. We:

  • Design our model architecture with regulatory requirements in mind
  • Implement robust quality control and monitoring
  • Prepare for FDA submission pathways
  • Maintain detailed documentation for regulatory review

The Future Landscape

Foundation models in ophthalmology represent more than an incremental improvement, they enable fundamentally new capabilities:

Predictive Medicine

By understanding disease trajectories across thousands of patients, foundation models can predict individual patient outcomes. This enables:

  • Early intervention for patients at high risk of progression
  • Personalized treatment selection based on predicted response
  • Optimized follow-up scheduling based on progression likelihood

Discovery and Research

Foundation models can identify patterns that might not be obvious to human observers:

  • Novel biomarkers for disease progression
  • Relationships between different conditions
  • Population-level insights from aggregated data

Democratized Expertise

Foundation models can make expert-level analysis available to:

  • Primary care providers in underserved areas
  • Telemedicine platforms serving remote populations
  • Screening programs in resource-limited settings

Continuous Learning

Unlike traditional models that are static after training, foundation models can continuously incorporate new data and knowledge:

  • Learning from new cases as they're encountered
  • Adapting to new imaging technologies
  • Incorporating latest research findings

Conclusion: A New Paradigm for Ophthalmic AI

The current fragmented landscape of ophthalmology AI, with separate models for each task, each disease, each modality is fundamentally limiting. Foundation models represent a paradigm shift toward unified, generalizable AI systems that can understand the full complexity of ophthalmic care.

The need is clear: growing imaging volumes, clinician shortages, and the complexity of longitudinal patient care require AI that can understand context, integrate modalities, and maintain continuity across time. Foundation models are uniquely positioned to address these needs.

At Retnovi, we're building this future. Our foundation model will transform ophthalmology AI from a collection of specialized tools into a unified system that understands all images, all modalities, and all time—enabling clinicians to provide better care to more patients.

The journey is just beginning, but the foundation is being laid. Within five years, we envision every medical image passing through a foundation model that provides comprehensive, contextual, and continuously improving analysis. This isn't just a technical achievement, it's a transformation of how eye care is delivered globally.


References

  1. EyeCLIP: A Multimodal Visual-Language Foundation Model for Computational Ophthalmology
    PubMed | arXiv
    A multimodal foundation model trained on 2.77 million ophthalmology images demonstrating state-of-the-art performance in disease classification, visual question answering, and cross-modal retrieval.

  2. RETFound: A Foundation Model for Retinal Imaging
    PubMed
    One of the first foundation models in ophthalmology, demonstrating superior performance even with limited fine-tuning data.

  3. EyeFound: A Multimodal Generalist Foundation Model for Ophthalmic Imaging
    arXiv
    A foundation model trained on 2.78 million images from 227 hospitals across 11 ophthalmic modalities, showing superior performance in disease diagnosis and systemic disease prediction.

  4. VisionUnite: A Vision-Language Foundation Model for Ophthalmology Enhanced with Clinical Knowledge
    arXiv
    A foundation model pretrained on 1.24 million image-text pairs, achieving diagnostic capabilities comparable to junior ophthalmologists.

  5. Global Ophthalmology Statistics
    American Academy of Ophthalmology
    Statistics on global distribution of ophthalmologists and eye care resources.

  6. U.S. Bureau of Labor Statistics, Occupational Employment and Wage Statistics
    BLS OEWS
    Data on healthcare employment trends, showing disproportionate growth in administrative roles compared to clinical roles.

  7. Foundation Models in Medical Imaging: Opportunities and Challenges
    Medical Image Analysis
    Comprehensive review of foundation models in medical imaging, including technical challenges and clinical applications.

  8. On the Challenges and Perspectives of Foundation Models for Medical Image Analysis
    Medical Image Analysis
    Analysis of the opportunities and challenges presented by foundation models in medical imaging.

  9. Recent Advances in Foundation Models for Ophthalmology
    ScienceDirect
    Latest research on foundation models in ophthalmology (2025).

Additional Resources

Contact Us

Interested in learning more about foundation models in ophthalmology or collaborating on this effort? Reach out to us at support@retnovi.ai.