Guide to Cross-validation with Julius

Zach Fickenworth 10 May, 2024 • 7 min read

Introduction

Cross-validation is a machine learning technique that evaluates a model’s performance on a new dataset. It involves dividing a training dataset into multiple subsets and testing it on a new set. This prevents overfitting by encouraging the model to learn underlying trends associated with the data. The goal is to develop a model that accurately predicts outcomes on new datasets. Julius simplifies this process, making it easier for users to train and perform cross-validation.

Cross-validation is a powerful tool in fields like statistics, economics, bioinformatics, and finance. However, it’s crucial to understand which models to use due to potential bias or variance issues. This list demonstrates various models that can be used in Julius, highlighting their appropriate situations and potential biases.

Types of Cross-Validations

Let us explore types of cross-validations.

Hold-out Cross-Validation

Hold-out cross validation method is the easiest and quickest model. When bringing in your dataset, you can simply prompt Julius to perform this model. As you can see below, Julius has taken my dataset and split it into two different sets: the training and the testing set. As previously discussed, the model is trained on the training set (blue) and then it is evaluated on the testing set (red).

The split ratio for training and testing is typically 70% and 30%, depending on the dataset size. The model, like the hold-out model, learns trends and adjusts parameters based on the training set. After training, the model’s performance is evaluated using the test set, which serves as an unseen dataset to show its performance in real-world scenarios.

Example: you have a dataset with 10,000 emails, which were marked as spam or not spam. You can prompt Julius to run a hold-out cross-validation with a 70/30 split. This means that out of the 10,000 emails, 7,000 will be randomly selected and used in the training set and 3,000 in the testing set. You get the following:

We can prompt Julius on different ways to improve the model, which will give you a rundown list of model improvement strategies, trying different splits, k-fold, other metrics, etc. You can play around with these to see if the model performs better or not based on the output. Let’s see what happens when we change the split to 80/20.

We got a lower recall, which may happen when training these models. As such, it has suggested further tuning or a different model. Let’s take a look at some other model examples.

K-Fold Cross-Validation

This validation offers a more thorough, accurate, and stable performance as it tests the model repeatedly and does not have a fixed ratio. Unlike hold-out which uses fixed subsets for training and testing, k-fold uses all data for both training and testing in K equal-sized folds. For simplicity let’s use a 5-fold model. Julius will divide the data into 5-equal parts, and then train and evaluate the model each of those 5 times. Each time, it uses a different fold as the test set. It will then average the results from each of the folds to get an estimate of the model’s performance.

Let’s run the spam email test set and see how successful the model is at identifying spam versus non-spam emails:

As you can see, both models show an average accuracy of around 50%, with hold-out cross-validation having a slightly higher accuracy (52.2%) versus k-fold (50.45% across 5 folds). Let’s move away from this example and onto some other cross-validation techniques.

Special Case of K-Fold

We will now explore various special cases of K-Fold. Lets get started:

Leave-One-Out Cross-Validation (LOOCV)

Leave-one-out cross-validation falls under the K-fold cross-validation sector, where K is equal to the number of observations in the dataset. When you ask Julius to run this test, it will take one data point and use it as the test set. The remaining data points are used as the training set. It will repeat this process until all data points have been tested. It provides an unbiased estimate of the performance of the model. Since it is a very in-depth process, smaller datasets would be advisable for using this model. It can take a lot of computation power, especially if your dataset is relatively large in nature.

Example: you have a dataset on exam records of 100 students from a local high school. The record tells you if the student passed or failed an exam. You want to build a model that will predict the outcome of pass/fail. Julius will then evaluate the model 100 times, using each data point as the test set, with the remaining as the training set.

Leave-p-out Cross-Validation (LpOCV)

As you probably can tell, this is another special case that falls under the LOOCV. Here you leave out p-data points at a time. When you prompt Julius to run this cross-validation, it’ll go over all possible combinations of p-datasets, which will be used as the test set, while the remaining data points will be designated as the training sets. This is repeated until all combinations are used. Like LOOCV, LpOCV requires high computational power, so smaller datasets are easier to compute.

Example: taking that dataset with student records on exam performance, we can now tell Julius to run a LpOCV. We can instruct Julius to leave out 2 data points to be designated as the test model and the rest as the training (i.e., leave out points 1,2 then 1,3 then 1,4 etc). This is repeated until all points are used in the test set.

Repeated K-fold Cross-validation

Repeated K-fold Cross-validation is an extension of the K-fold set. This helps reduce variance in the model’s performance estimates. It does this by performing the repeated k-fold cross-validation process, partitioning the data differently each time into the k-folds.The results are then averaged to get a comprehensive understanding of the model’s performance.

Example: If you had a random dataset, with 1000 points, you can instruct Julius to use repeated 5-fold cross-validation with 3 repetitions, meaning that it will perform 5-fold cross-validation 3 times, each with a random partition of data. The performance of the model on each fold is evaluated and then all results are averaged for an overall estimation of the models performance.

Stratified K-Fold Cross-Validation

Oftentimes used with datasets that are considered imbalance or target variables offer a skewed distribution. When prompted to run in Julius, it will proceed to create folds that contain approximately the same proportion of samples across each class or target value. This allows for the model to maintain the original distribution of the target variable across each fold created.

Example: you have a dataset that contains 110 emails, with 5 of them being spam. You want to build a model that can detect these spam emails. You can instruct Julius to use the stratified 5-fold cross-validation that contains approximately 20 as non-spam emails and 2 as spam emails in each combination. This ensures that the model is trained on a subset that is representative of the dataset.

Time Series Cross-Validation

Temporal datasets are special cases as they have time dependencies between observations. When prompted, Julius will take this into consideration and deploy certain techniques to handle these observations. It will avoid disrupting the temporal structure of the dataset and prevent the use of future observations to predict past values; techniques such as rolling window or blocked cross-validation are used for this.

Rolling Window Cross-Validation

When prompted to run Rolling window cross-validation, Julius will take a portion of the past data, using that as the model, and then evaluate it on the following sets of observations. As the name implies, this window is rolled forward throughout the rest of the dataset and the process is repeated as new data is introduced.

Example: you have a dataset that contains daily stock prices from your company over a five-year period. Each row of data represents the stock prices of a singular day (date, opening price, highest price, lowest price, closing price, and trading volume). You instruct Julius to use 30 days as the window size, in which it will train the model on that specified window and then evaluate it on the next 7 days. Once finished, the process is repeated by shifting the original window an additional 7 days and then the model re-evaluates the dataset.

Check out the source content here.

Blocked Cross-Validation

For blocked cross-validation, Julius will take the dataset and divide it into individual, non-overlapping blocks. The model is trained on one of the divisions and then tested and evaluated on the other remaining sets of blocks. This allows for the time series structure to be maintained throughout the cross-validation process.

Example: you want to predict quarterly sales for a retail company based on their historical sales dataset. Your dataset displays quarterly sales over the last 5 years. Julius divides the dataset into 5 blocks, with each block containing 4 quarters (1 year) and trains the model on two of the five blocks. The model is then evaluated on the 3 remaining unseen blocks. Like rolling window cross-validation, this approach keeps the temporal structure of the dataset.

Checkout the source here.

Conclusion

Cross-validation is a powerful tool that can be used to predict future values in a dataset. With Julius, you can perform cross-validation with ease. By understanding the core attributes of your dataset and the different cross-validation techniques that can be employed by Julius, you can make informed decisions on which method to use. This is just another example of how Julius can aid in analyzing your dataset based on the characteristics and outcome you desire. With Julius, you can feel confident in your cross-validation process, as it walks you through the steps and helps you choose the correct model.