Advanced search


Search results        Search results      Copy URL to E-Mail


Evaluating Artificial Intelligence’s Accuracy in Identifying Anatomical Structures on Cadavers: Implications for Osteopathic Medical Education

Journal: Journal of Osteopathic Medicine Date: 2025/12, 125(12):Pages: A652–653. doi: Subito , type of study: observational study

Full text    (https://www.degruyterbrill.com/document/doi/10.1515/jom-2025-2000/html)

Keywords:

anatomy [104]
AI [1922]
artifical intelligence [8]
cadaver [21]
ChatGPT [6]
observational study [228]
osteopathic medicine [2055]

Abstract:

Context: Artificial intelligence (AI) refers to computational systems that are capable of carrying out tasks that normally a human would need to perform, such as complex problem-solving or pattern recognition. Large language models (LLMs) are a subset of AI that are trained on vast datasets. AI and LLMs are becoming increasingly common in both osteopathic and allopathic medical education as tools for students to use in their studying. As their adoption increases within the world of osteopathic medical education, particularly in the anatomical sciences, evaluating the accuracy of these models is crucial for medical students, clinicians, and patients. OpenAI’s ChatGPT-4.0 is a LLM that is designed to generate human-like responses to prompts and questions, and it has been trained on extensive datasets. It is one of the most well-known LLMs and is becoming increasingly popular in undergraduate medical education on topics ranging from physiology to pharmacology and the anatomical sciences. It is also highly capable, having passed the United States Medical Licensing Exam. However, previous studies have focused on the text-based abilities of ChatGPT-4.0, and there is a gap in the literature regarding its abilities in the more difficult task of processing images. The University of New England College of Osteopathic Medicine (UNECOM) is a small but prestigious osteopathic medical school off the coast of Maine that offers a comprehensive course in the anatomical sciences for osteopathic medical students. As part of its educational resources, UNECOM maintains a robust archive of cadaver images from previous body donor lab practical exams. These practical exams involve placing “tags” using materials such as metal alligator clips, strings, pipe cleaners, or other objects on or near anatomical structures or spaces within cadavers. After each exam, images of the tagged structures are taken and saved along with the correct answers. These images are then made available to students in the following year’s class to help them with exam preparation. Access to this image archive is restricted exclusively to students within the College of Osteopathic Medicine. Thus, it is highly unlikely that AI or LLMs would have encountered these images within their datasets during training. Additionally, the anatomical tagging is performed by experts, including surgeons and doctoral-level anatomists with decades of experience, making these images a strong benchmark for evaluating LLM performance. Objective: This study aimed to evaluate the accuracy of ChatGPT-4.0 in identifying anatomical structures in images of cadavers. Methods: A total of 57 cadaver images were selected from an anatomy practical exam administered at UNECOM in 2022. These images featured structures from a variety of regions, including the extremities, back, and thoracic cavity. No images of the face were used during the study, and no personally identifying information about the cadavers was collected. The exam did not feature anatomy from the abdominal or pelvic cavity. Two images were of teaching skeletons and were included because they featured anatomical elements. Prior to the image analysis, GPT-4.0 was prompted with the following instruction: “You are a first-year osteopathic medical student taking an anatomy course. You are about to take an anatomy lab practical in which you will be asked to identify structures and possibly provide context regarding their function.” Each image was presented to the AI with a clearly tagged or indicated structure. The model was given only one attempt to identify the correct structure. In cases where the tagging was not self-evident, additional context—comparable to what would be provided in an actual exam—was supplied by the investigator. The anatomy practical exams also contained radiographic images and histological images, which were excluded because they would not serve our primary end point of investigating the AI’s ability to recognize anatomical structures on cadavers. To prevent contextual bias, screenshots were pasted directly into a new ChatGPT-4.0 conversation and remained unnamed to avoid any image metadata that could influence the response. Only cadaver images from the UNECOM anatomy department were used to ensure the AI had not, or was incredibly unlikely to have, previously encountered them during training. The model’s responses were evaluated in two ways: Exact identification – whether the model named the correct anatomical structure. Regional accuracy – whether the model identified a structure within the correct anatomical region (e.g., forearm vs. leg). Both overall accuracy and error rates were calculated for each category. Results: Out of 57 images, ChatGPT-4.0 correctly identified the general anatomical region 47 times, yielding a regional accuracy of 82% and an error rate of 18%. However, it correctly identified the specific tagged structure in only 5 of the 57 images, resulting in a specific structure identification accuracy of 8.8% and an error rate of 91.2%. This includes the two images with tags on teaching skeletons, which the AI was unable to correctly identify. Removing these images from our sample brings our regional accuracy and error rate to 81.8% and 18.2%, respectively, and our specific structure accuracy and error rate to 9.1% and 90.9%, respectively. Conclusion: While ChatGPT-4.0 demonstrates strong performance in recognizing general anatomical regions, its ability to correctly identify specific structures on cadaveric images remains limited. These findings suggest that although ChatGPT-4.0 may be a helpful tool for general anatomical orientation, it is not yet reliable for detailed structural identification in cadaver-based contexts. Students using ChatGPT-4.0 for anatomy practical preparation should be advised to verify its outputs with trusted sources or faculty guidance. Future improvements in image-based training or multimodal learning may enhance the accuracy of AI tools in anatomical sciences. Future research should be conducted to compare different large language models in similar anatomical recognition tasks (such as comparing Bard, another LLM) and to explore how changing prompts may or may not improve the accuracy of AI in this task. Future research should also include a more varied selection of anatomical images. Further research will be needed to evaluate the image processing capabilities of ChatGPT-4.0, as well as other LLMs.


Search results      Copy URL to E-Mail

 
 
 






  • ImpressumLegal noticeDatenschutz


ostlib.de/data_acxgwszyhfpmktbvunqd



Supported by

OSTLIB recommends