How to Calculate the Correlation Between Categorical and Continuous Values

Theoretical Explanations and Practical Examples of Correlation between Categorical and Continuous Values

4 min readFeb 28, 2023

Without any doubt, after obtaining the dataset, giving entire data to any ML model without any data analysis methods such as missing data analysis, outlier analysis, and correlation analysis. If excellent hardware exists and is accessible to everyone, we can perform our ML experiments utilizing all the possible data combinations. Extensive Data Analysis (EDA) might be ignored if we have sufficiently isolated validation dataset within seconds, (Sorry, but I still do not accept this suggestion, keep working on EDAs!!).

Image by author: Chestnut Seller in Istanbul (source)

Introduction

When I review varied posts regarding correlation analysis, the corr method is called from the Pandas package regardless of the data types. Unfortunately, Pearson analysis is the default technique in the corr method where the correlation analysis between categorical and continuous values does not perform well, and we need to involve other statistical approaches to extract better inferences.

Briefly, correlation analysis can help us to:

understand the data, then have more knowledge of the domain or the business
diminish data size by filtering highly correlated values
predict missing values by correlated column values if it’s required
provide better visual reports to deliver information to non-tech people

As I tried to make in the previous post, I would like to provide a reference point to apply correlation analysis between diverse data instances. In this following post, you can find the correlation analysis for the potential combinations of continuous and categorical data values.

Continuous - Continuous

We can find the correlation between 2 sets of continuous data using the Pearson technique. It calculates the linear correlation by the covariance of two variables and their standard deviations. In Pearson Correlation Analysis, 2 sets are interchangeable (symmetric). So. the correlation of x and y and the correlation of y and x will give identical results.

r represents the correlation between the variables and the value is between [-1, 1].

r = -1 : negative correlation
r = 1 : positive correlation
r = 0 : no correlation

In Pearson Correlation, it’s good to know that linear transformation does not change value. For example, using the inch instead of the meter or adding a constant value to each weight will provide the correlation value.

Categorical — Categorical

With the rise of intricate ML Algorithms such Tree Based models and Neural Networks, coping with categorical data became feasible addition to continuous data. Encoding categorical data and passing it to Linear Regression or KNN models is not a good practice since these models will perceive these data values as non-discrete.

As in the modeling, using the Pearson technique to calculate the correlation between Categorical values would not give significant results. Cramer’s V method can help us to extract the strength of association between 2 categorical sets of values having two or more levels with [0, 1] intervals.

The computation is based on Chi-Square Test and is symmetrical as Pearson Correlation. The chi-Square test is employed in the categorical dataset to evaluate how possibly these values are observed by chance. Last, Cramer’s V is computed by Chi-Square statistic, sample size, and the dimension contingency matrix.

As a final note, I want to note that it’s crucial to check the p-value of the Chi-Square Test to validate how the association between the values is statistically significant.

Categorical - Continuous

As a final destination, we arrived at the correlation between categorical and continuous features. Specifically, in classification problems, it can be practical to catch the correlation between binary and continuous features besides the group by aggregations. Point-biserial correlation can help us compute the correlation utilizing the standard deviation of the sample, the mean value of each binary group, and the probability of each binary category.

As the methods above, the Point-biserial correlation is symmetric, and the correlation between x and y gives the same value as the correlation between y and x.

As I cited in Cramer’s V, it’s critical to regard the p-value to see how statistically significant the correlation is. In the example above, even though the significance level of the correlation between gender and the monthly charge is not high, I want to demonstrate the dissimilarity between the correlation between gender and the monthly charges features and the churn and the monthly charge features. As you assume, a correlation between the gender and monthly charge features does not exist, although there is a correlation between the churn and monthly charge features.

Final Words

I wanted to share 3 methods with the theories and the implementations. I hope these scripts will help during the data analysis and reduce the dimension of the dataset. Also, using heatmaps with correlation values would improve the visual quality of your comprehensive analysis! :)

If you have any feedback or recommendations for further advancement, please leave your ideas in the comments! :)

To say me “hi” or ask me anything:

linkedin: https://www.linkedin.com/in/ktoprakucar/

github: https://github.com/ktoprakucar