What is the MCAT?

The Medical College Admission Test® (*MCAT*®) is a computer-based standardized examination for prospective medical students in the United States, Australia, Canada, and Caribbean Islands. The MCAT is a seven-hour exam that tests prospective medical students in the four following categories: Chemical and Physical Foundations of Biological Systems (CP), Critical Analysis and Reasoning Skills (CARS), Biological and Biochemical Foundations of Living Systems (BB), Psychological, Social, and Biological Foundations of Behavior (PS). The minimum and maximum scores for each category are 118 and 132 respectively [1].

The Association of American Medical Colleges (AAMC) is the organization that is in charge of creating the MCAT. To prepare students for the MCAT, the AAMC has created three practice exams that are of the same length and difficulty [2].

The Purpose of this Page

Have you ever wondered how your AAMC practice exam scores might compare to the real MCAT? Look no further MDbuddy has your back. By analyzing over 1000 students from Reddit's MCAT community (r/MCAT), we are able to predict where our scores might fall on exam day.

MDbuddy does its calculations in real-time, meaning that it updates its graphs and machine learning models whenever a new student is added to the database!

For best results, please use a computer, laptop, or iPad.

In this figure, we can see the distribution of r/Mcat's MCAT exam score and the AAMC practice exam scores.

Surprisingly, r/Mcat seems to score much higher than your average MCAT test taker. According to the AAMC, an individual category score of 125 equals the ~50th percentile. r/Mcat seems to have their ~50th percentile around 130 for CP, 128 for CP, 130 for BB, and 130 for PS, which correspond to the 97th, 90th, 96th, and 95th percentile respectively. That is very impressive.

As far as the score distribution of scores of the AAMC practice exam goes, the AAMC does not release those numbers publicly, however, we can expect them to be similar to those of the MCAT exam. Like r/Mcat's MCAT scores, their AAMC practice exam scores are just as impressive.

One thing that stands out from this figure is how all the distributions above are skewed to the left rather than having an un-skewed distribution like the ones shown in the official AAMC report linked above.

In this figure, we can see the distribution of differences between r/Mcat's MCAT exam score and the AAMC practice exam scores.

A positive score number means that you will get a better score on the MCAT exam than the AAMC practice exam.

A negative score number means that you will get a worse score on the MCAT exam than the AAMC practice exam.

For the most part, it looks like that the AAMC practice exams are representative of the real MCAT exam across each of the three practice exams and each of the categories (with the exception of PS). However, upon further inspection of the width of the distribution, it is apparent that that is not the case. The standard deviation of each category is around ~1.8. While 1.8 may not seem like a lot for an exam, it is important to remember that the scores of the exam only range from 118 to 132.**Therefore, on average the scores of the AAMC practice exams are representative, however, there is lots of room for variability.**

Predictive Modeling

The high variability found between the AAMC practice exams and the MCAT exams makes it unreliable to predict what your MCAT exam score would be using your AAMC practice exam scores. Instead of comparing the category scores between exams on a one to one basis, what if the predictions were done based on *all exams* through *all categories* in order to give a more precise and accurate estimate? This is what machine learning aims to do!

In the field below you can type your AAMC practice exam scores and a linear machine learning model will give you a prediction and a 95% prediction interval of what your score will be on the MCAT exam.

Category | Prediction | Prediction (Lower 95%) | Prediction (Upper 95%) |
---|---|---|---|

Chem/Phys | 125 | 125 | 125 |

CARS | 125 | 125 | 125 |

Bio/Biochem | 125 | 125 | 125 |

Psych/Socio | 125 | 125 | 125 |

- r/MCAT is not a true reflective population of MCAT test-takers.
- People with higher scores tend to participate in the community database, more so than people with lower scores (volunteer bias).
- The relationship between relative order of completion of each AAMC practice exams and change in score has not been explored, and could potentially bias the model towards AAMC practice exam #3.

Predictive Modeling Process

At this point, all the data analysis has been presented. However, if you'd like to know more about the process of creation of the machine learning model, please continue reading.**Objective**

To create an ML model to predict MCAT exam scores using AAMC practice exams.

**Requirements**

The ML model must be able to be deployed easily to the front-end server.

The ML model must be fast to train and not too CPU intensive.

**Methods - Data collection**

Google API was used to fetch, download, and store r/Mcat's 2018 and 2019 scores in the database.

An automatic web scraper was created to periodically update the database on a monthly basis.

User data that was incomplete or did not include the completion of all three AAMC practice exams were discarded.

Over 1000 points of user data were collected, which were then split into training, validation, and testing sets.

**Methods - Preprocessing**

Normalization (standard scaler) and feature expansion methods (polynomial expansion) were tested on a regular linear regression model. Minimal reduction in the loss was obtained, therefore these methods were not used.

**Methods - Modeling**

Regression-based methods like linear regression, ridge regression, LASSO, and random forests were used to assess the R2 score.

There wasn't much difference between the R2 values of ridge regression and random forests, therefore ridge regression was chosen as parametric models are easier to integrate and deploy at the front end of a website.

A function was created to generate the 95% prediction interval for the prediction specified by the user.

An automatic model trainer was implemented to train the model on new data collected from the web scraper on a monthly basis.

**Methods - Summary**

A ridge regression model can predict user MCAT exam scores with a 95% prediction interval using AAMC practice exam scores, that regularly updates its parameters on new submissions of r/Mcat user scores.

**Conclusion**

By generating a predictive linear model using all practice exams, across all categories, I was able to increase the precision of our predictions by more than a single standard deviation of the dummy model (analysis-2).