Discussion
In this study, we used GPTs to answer the Sample CFPC questions and responded satisfactorily to our complex sample questions. When the reviewers scored the questions using the fixed answer key provided by the CFPC website, the mean score for all five rounds was 76.0±27.7 for GPT-3.5 and 85.2±23.7 for GPT-4. Additionally, the authors found that most of the answers, although not explicitly stated in the answer key, were reasonable and acceptable, and only about 16% of the lines of answers provided by GPT-3.5 and 7% of the lines of answers provided by GPT-4 were deemed incorrect in the Reviewers’ scoring.
Although ChatGPT has been used to respond to medical examination questions,6 9–18 only one study has evaluated its efficacy in preparing for the Canadian family medicine exam.8 In this study, Huang and colleagues demonstrated that GPT-4 significantly outperformed the other test takers, achieving an impressive accuracy rate of 82.4%, whereas GPT-3.5 achieved 57.4% accuracy, and family medicine residents scored 56.9% correctly.8 In our study, the mean CFPC score across five rounds was 85.2 for GPT-4, which closely resembled their score, while GPT-3.5 scored lower at 76.0. However, it is important to note that Huang and his team’s questionnaire comprised multiple-choice questions, differing from the open-ended format of the questions in the SAMPs exam. Furthermore, their questionnaire was sourced from their university, specifically designed to prepare their family medicine residents for the exam and may lack standardisation. In contrast, our study employed a comprehensive and standardised set of questions sourced directly from the CFPC website. These questions were open-ended, mirroring the SAMPs structure, and included official answer keys approved by CFPC, providing a more accurate representation of the CFPC exam format.
Thirunavukarasu and coworkers used GPT-3.5 to answer the AKT exam designed for Membership of the Royal College of General Practitioners in the UK. They achieved a performance level of 60.17%, which was lower than our score, and it fell short of the 70.45% passing threshold in this primary care examination.12 Nevertheless, like the University of Toronto study,8 this study employed a multiple-choice questionnaire and was not specific to a Canadian family medicine exam. Other studies have reported similar scores for GPT-3.5 on various medical examinations at the undergraduate level. Kung and colleagues reported that ChatGPT achieved near-passing accuracy levels of around 60% for Step 1, Step 2 of CK and Step 3 of the USMLE.9 Similarly, Gilson and colleagues observed an accuracy range of 44% to 64.4% for sample USMLE Step one and Step two questions.10 ChatGPT’s performance on the Chinese NMLE stayed behind that of medical students and was below the passing threshold.17
Similar to our study, scores were higher when GPT-4 was used instead of GPT-3.5 in other studies. For instance, while GPT-3.5 fell short of the passing criteria for the Japanese medical licensing examination, GPT-4 met the threshold criteria.16 Nori et al used GPT-4 and observed a passing score on USMLE by over 20 points.11 Finally, GPT-4 accurately answered 81.3% of the questions on the Iranian Medical Residency Examination.18
The combined analysis of five rounds using the GEE model revealed that the CFPC Score Percentages were significantly higher for GPT-4 than GPT-3.5 (p<0.001). Likewise, on re-evaluating the responses using their medical expertise, the Reviewers’ Score percentages for GPT-4 over five rounds were significantly higher for GPT-4 compared with GPT-3.5 (p=0.009). This finding is probably because GPT-4 is able to perform more efficiently under challenging questions from complex situations.3 4 This trend has been previously shown through assessments of ChatGPT (GPT-3.5) and ChatGPT Plus (GPT-4) on various exams, including a sample of multiple choice progress tests from the University of Toronto,8 two sets of official practice materials for the USMLE exam from the National Board of Medical Examiners,11 the Japanese Medical Licensing Examination,16 the StatPearls ophthalmology Question Bank13 and the 2022 SCE neurology examination.14 However, other studies primarily involved multiple-choice questions,8 11 were related to the undergraduate level,11 were conducted in different languages16 or focused on other specialties.13 14 Our study focused on the complex task of open-ended Canadian family medicine questions and demonstrated that GPT-4 can provide more accurate answers to complex Canadian SAMPs exam questions than GPT-3.5 (the free version).
In the fifth round of our study, when AI was not specifically instructed to offer brief responses, it consistently provided informative justifications and reasoning. These responses were highly instructive and aligned well with our educational objectives (see online supplemental appendix box 1). Therefore, our study demonstrated that GPT-3.5 and GPT-4 can be used to guess the answers to complex tasks such as those outlined in the study, making it a potential help for CFPC exam preparation. However, using these technologies to learn family medicine and prepare for exams needs further study.
Despite several benefits and potential roles of LLMs in medical education and research, they have several pitfalls. These pitfalls include the absence of up-to-date sources of literature1 (the current versions of ChatGPT are trained in September 2021), inaccurate data,13 14 inability to distinguish between fake and reliable information,21 generating incorrect answers known as hallucinations,6 7 21–24 which is potentially misleading or dangerous in a healthcare context.7 24 ChatGPT is still in an experimental phase and is not intended for medical application.7 Therefore, using ChatGPT in preparation for exams should serve as a prompt to reinforce existing knowledge derived from reliable sources. Responses generated by ChatGPT should undergo rigorous fact-checking by human experts before being considered a primary knowledge resource.
Our testing comprised several rounds, including repeating identical prompts at intervals, modifying the prompts by eliminating the reference to ‘CFPC exam’ from the prompts, regenerating responses and removing prompts to evaluate outcomes. When comparing rounds 1 and 2 with a similar ‘Prompt 1’ but with an approximately 1 week interval, both GPT-3.5 and GPT-4 demonstrated high consistency and accuracy. This observation suggests that the passage of time does not significantly impact the chatbot’s performance. Instead, future improvements may arise through the AI’s learning curve and the introduction of newer versions of LLMs trained on updated material, warranting further investigation.
Removing the phrase ‘CFPC exam’ in round 4 led to an unexpected outcome. The accuracy, indicated by ‘CFPC Score Percentage’, noticeably increased for GPT-3.5 and showed an upward GPT-4 trend contrary to our initial hypothesis. We speculated that omitting the exam’s name might limit GPT’s access to the source questions, potentially reducing scores. However, the observed increase may be accidental or suggest other underlying factors, necessitating further investigation to understand these results.
The comparison between rounds 1 and 5 aimed to determine whether prompting influenced responses and resulted in consistently accurate outcomes. The absence of significant change for ‘CFPC Score Percentage’ for both GPT-3.5 and GPT-4 may suggest that prompting did not significantly alter the accuracy of the responses. Also, in most of the questions, the CFPC score remained unchanged (67.5% for GPT-3.5 and 83.1% for GPT-4). This result suggests that running ChatGPT without any prompt could lead to detailed responses with justifications with similar accuracy, which could be valuable for candidates preparing for the CFPC exam.
Finally, the regeneration of responses from round 2 in round 3 was conducted to assess whether response regeneration could enhance accuracy. We removed the output from each round except for the third run, a repetition of the second run, to minimise potential learning curve effects on the GPT’s performance. As a result of this approach, the ‘CFPC Score Percentage’ tended to increase for GPT-3.5, while remaining unchanged for GPT-4. This finding may further emphasise that regeneration of the responses may improve the results for GPT-3.5 but not GPT-4.
In summary, GPT-4 showed considerable consistency in our comparisons. This consistency was more impressive when the reviewers realised that changing the answer choices by GPT would not impact the scores (table 4). In most cases, GPT-4 repeated answers more frequently than GPT-3.5 or at least showed a trend of higher repetition. In a related study, Thirunavukarasu et al conducted two independent sessions of the AKT exam using ChatGPT for 10 days and observed consistent performance.12
Study limitation
It is important to acknowledge that there is no established cut-off score for passing the SAMPs part of the CFPC exam. Instead, the minimal passing score is set based on the performance of a reference group of first-time test-takers who graduate from Canadian family medicine residency programmes in each exam.19 Consequently, whether ChatGPT’s current performance would be sufficient to pass the exam remains inconclusive. Additionally, we lack access to the scores of candidates, making it impossible to compare ChatGPT’s performance with that of human candidates. Comparing ChatGPT’s performance in answering a sample question with that of candidates could potentially reveal whether ChatGPT outperforms or is not inferior to human candidates. It is necessary to emphasise that ChatGPT is not designed to practice family medicine or pass the related exam. Instead, we may propose that it could be used to assist candidates with exam preparation by helping them determine correct responses.
A significant component of learning in family medicine involves the interpretation of images, such as ECGs, X-rays and skin conditions—capabilities that text-based models like ChatGPT lack. In our study, we encountered this limitation when one question included an ECG image, which we had to exclude the image. Interestingly, our two reviewers found that the absence of this image did not impact the accuracy or relevance of ChatGPT’s answers to the associated clinical scenario question.
In this study, we used GPT-3.5 and GPT-4 from OpenAI, which were trained in September 2021 and were not specialised for medical purposes.1 It’s important to note that other LLMs may use more recent sources of information, potentially yielding different results and warranting further investigation. Furthermore, even within the same version of OpenAI, the GPT’s performance can be influenced by the repetition of questions and the feedback provided over time, meaning that the performance of ChatGPT may evolve over time. To avoid the possibility of learning curve effects and memory retention bias impacting the AI’s performance, we took the precaution of erasing the results of each round from the ChatGPT window before initiating a new session for the subsequent round.
In an actual exam setting, residents typically read the clinical scenario once and then respond to each two to seven related questions and the scenario is not reaped before each question. We adopted a similar approach and did not reiterate the clinical scenario before each related question. Nevertheless, ChatGPT’s responses might differ if the clinical scenario were repeated before each question. Confirming this hypothesis would necessitate further investigation.
In this study, we examined a sample of SAMP questions provided by CFPC, which is very similar to the actual exam. These question sets comprised only 19 clinical scenarios and 77 questions. Expanding the number of questions examined could enhance the study’s reliability. However, it’s important to note that many of the available sample questions from other sources on the market may not represent the actual examination, or their answer keys may be reliable.