Linear Regression for tech start-up company Cars4U in Python

11 min readFeb 28, 2023

Photo Credit: Unsplash.com — Photo Credit: Shamoil from Unsplash.com

Hello world!

Today’s post is about the customer profile of customers buying cars in India. Read more to learn more about this project!

Before we dive in, I want to share one piece of information about the usage of linear regression. Linear regression can be used when the dependent variable is continuous. Examples of continuous variables include height, weight, temperature, and currency.

Introduction

There is a huge demand for used cars in the Indian Market today. As sales of new cars have slowed down in the recent past, the pre-owned car market has continued to grow over the past years and is larger than the new car market now. For this reason, Cars4U was created as a budding tech start-up that aims to find footholds in this market.

In 2018–2019, while new car sales were recorded at 3.6 million units, around 4 million second-hand cars were bought and sold. There is a slowdown in new car sales and that could mean that the demand is shifting towards the pre-owned market. In fact, some car sellers replace their old cars with pre-owned cars instead of buying new ones. Unlike new cars, where price and supply are fairly deterministic and managed by OEMs (Original Equipment Manufacturer / except for dealership level discounts which come into play only in the last stage of the customer journey), used cars are very different beasts with huge uncertainty in both pricing and supply. Keeping this in mind, the pricing scheme of these used cars becomes important in order to grow in the market.

As a data scientist at Cars4U, I had to come up with a pricing model that can effectively predict the price of used cars and can help the business in devising profitable strategies using differential pricing. For example, if the business knows the market price, it will never sell anything below it.

In this analysis, I:

provided summary statistics and exploratory data analysis of the data.
investigated any correlation between variables involved in the using a pairplot and jointplot.
created linear regression model to predict the prices of used cars.

The Data

The dataset is in CSV file format, has 14 columns, and 7,253 rows. Each row represents data collected about each component in the columns. You can find the dataset here.

Here are the columns that I used for this analysis:

Year- Manufacturing year of the car
Name- Name of the car which includes brand name and model name
Location- Location in which the car is being sold or is available for purchase (cities)
Seats — The number of seats in the car
New Price — The price of a new car of the same model in INR Lakhs (1 Lakh = 100,000 INR)
Mileage— The standard mileage offered by the car company in kmpl or km/kg
Engine in CC — The displacement volume of the engine in CC
Power- The maximum power of the engine in bhp
Kilometers Driven- The total kilometers driven in the car by the previous owner(s) in km
Fuel type- The type of fuel used by the car (Petrol, Diesel, Electric, CNG, LPG)
Transmission- The type of transmission used by the car (Automatic/Manual)
Owner Type- Type of ownership

The Analysis

I decided to use Python for this project, as it was easier to obtain thorough statistical summaries and data visualizations by often writing just one line of code.

I began by importing pandas, matplotlib, and seaborn into my notebook. These are common Python libraries used for data analysis and visualization.

Data Preprocessing

I noticed that the Mileage, Engine, and Power are object type columns when they should be formatted as numerical. For the column Mileage, I was able to do this with the str.split() function, and then proceeded to creating 2 new columns: mileage_num and mileage_unit for mileage number and mileage unit, respectively. The same process was applied to Engine and Power.

df_mileage = df["Mileage"].str.split(" ", expand=True)
df_mileage.head()

# let's verify that there are two units
df_mileage[1].value_counts()

# create two new columns for mileage values and units
df["mileage_num"] = df_mileage[0].astype(float)
df["mileage_unit"] = df_mileage[1]

# Checking the new dataframe
df.head()

# Let's check if the units correspond to the fuel types
df.groupby(by=["Fuel_Type", "mileage_unit"]).size()

Feature Engineering

The Name column in the current format might not be very useful in our analysis. Since the name contains both the brand name and the model name of the vehicle, the column would have too many unique values to be useful in prediction.

# checking number of unique values
df["Name"].nunique()

# extracting brand names
df["Brand"] = df["Name"].apply(lambda x: x.split(" ")[0].lower())
df.head()

# checking the unique values and their number of occurences
df["Brand"].value_counts()

This is much better as now I can see that there are 32 brands of cars in the data. Now I will extract the car’s model name.

# extracting model names
df["Model"] = df["Name"].apply(lambda x: x.split(" ")[1].lower())
df.head()

# checking the unique values and their number of occurences
df["Model"].value_counts()

Here, there are 218 different car models in the data.

After examining missing values and dropping redundant columns, the statistical summary was examined.

Kilometers_Driven values have an incredibly high range, which this is expected. Few extreme values must be investigated to get a sense of the data. On average, a car seems to have 5 seats, which is plausible. There are used cars being sold at less than 1 lakh INR and as high as 160 lakh INR, as I saw for the Lamborghini earlier. This is why it is important to check for outliers in order to build a robust model. Variables, such as the minimum mileage being 0, is a concern and must be investigated. Additionally, Power and Engine mean and median values are not very different. A person with car engineer knowledge would be able to comment further on these attributes. Finally, the new price range seems right. Here, both budget-friendly Maruti cars and Lamborghinis are in our stock. Mean being almost twice that of the median suggests that there are only a few very high range brands, which again makes sense.

Exploratory Data Analysis (EDA)

Univariate EDA

Price: The price of a used car is the target variable and has a highly skewed distribution, with a median value of around 53.5 lakh INR. The log transformation was applied on this column to reduce skewness. The displacement volume of the engine, the maximum power of the engine and the price of a new car of the same model is highly correlated with the price of a used car.
Mileage: This attribute has a close to normally distribution. With increase in mileage, the engine displacement and power decrease.
Engine: There are a few upper outliers, indicating that there are a few car with a higher engine displacement volume. Higher priced cars have higher engine displacement. It is also highly correlated with the maximum engine power.
Power: There are a few upper outliers, indicating that there are a few car with a higher power. Higher priced cars have higher maximum power. It is also highly correlated with the engine displacement volume.
Kilometers_driven: The number of kilometers a used car is driven has a highly skewed distribution, with a median value of around 53.5 thousand. The log transformation was applied on this column to reduce skewness.
New_Price: The price of a used car is the target variable and has a highly skewed distribution, with a median value of around 11.3 lakh INR. The log transformation was applied on this column to reduce skewness.
Seats: 84% of the cars in the dataset are 5-seater cars.
Year: More than half the cars in the data were manufactured in or after 2014. The price of used cars has increased over the years.
Brand: Most of the cars in the data belong to Maruti or Hyundai. The price of used cars is lower for budget brands like Porsche, Bentley, Lamborghini, etc. The price of used cars is higher for premium brands like Maruti, Tata, Fiat, etc.
Model: Maruti Swift is the most common car up for resale. The dataset contains used cars from luxury as well as budget-friendly brands.
Location: Hyderabad and Mumbai have the most demand for used cars. The price of used cars has a large IQR in Coimbatore and Bangalore.
Fuel_Type: Around 1% of the cars in the dataset do not run on diesel or petrol. Electric cars have the highest median price, followed by diesel cars.
Transmission: More than 70% of the cars have manual transmission. The price is higher for used cars with automatic transmission.
Owner_Type: More than 80% of the used cars are being sold for the first time. The price of cars decreases as they keep getting resold.

Bivariate EDA

Contrary to intuition, Kilometers_Driven does not seem to have a relationship with the price. Price has a positive relationship with Year, i.e., the newer the car, the higher the price. The temporal element of variation is captured in the year column. 2 seater cars are all luxury variants. Cars with 8–10 seats are exclusively mid to high range. Mileage does not seem to show much relationship with the price of used cars. Engine displacement and power of the car have a positive relationship with the price. New_Price and used car price are also positively correlated, which is expected. Kilometers_Driven has a peculiar relationship with the Year variable. Generally, the newer the car lesser the distance it has traveled, but this is not always true. CNG cars are conspicuous outliers when it comes to Mileage. The mileage of these cars is very high. The mileage and power of newer cars are increasing owing to advancements in technology. Mileage has a negative correlation with engine displacement and power. The more powerful the engine is, the more fuel it consumes.

Multivariate EDA

Histogram showing Mileage vs. Name vs. price_log

The variables Mileage, Name, price_log were utilized as it is crucial to see the correlation between the three variables. Both histograms are slightly right skewed. Most buyers bought used cars in Mumbai with average prices between 13–14 lakhs and mileage between 15 and 20 kpml. The mean and median prices are similar for prices and mileage.

Linear Model Building

The variable used car price (Price) must be predicted. As Price is a skewed variable, a model was built using both the actual variable and its normalized version price_log. Before building a model, I have to encode categorical features. Splitting the data into train and test will help me to evaluate the model that I build on the train data. I will also build a Linear Regression model using the train data and then check its performance.

The metric functions defined in sklearn for MAPE, RMSE, MAE, and 𝑅² were also used. The mean absolute percentage error (MAPE) measures the accuracy of predictions as a percentage and can be calculated as the average absolute percent error for each predicted value minus actual values divided by actual values. It works best if there are no extreme values in the data and none of the actual values are 0.

Looking at Price variable, lets name this model as Model 1.

Both the R-squared and Adjusted R squared of our model are high. This is a clear indication that a good model is formed to explain variance in the price of used cars up to 87%. Here, the model is underfitting.

Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE) of train and test data are close, which indicates that our model is not overfitting the train data.

MAE indicated that the current model was able to predict used car prices within a mean error of 4.7 lakhs on test data. The units of both RMSE and MAE are the same, Lakhs in this case. But RMSE was greater than MAE because it penalized the outliers more. MAPE of 43.55 on the test data indicated that the model can predict within ~44% of the used car price.

Looking at price_log variable, lets name this model as Model 2.

Both the R-squared and Adjusted R squared of our model are higher than before and the model is able to explain up to 94% of the variance in the price of used cars. RMSE and MAE of train and test data are close, and are much lower than the previous models. MAE indicates that the current model is able to predict used car prices within a mean error of 1.3 lakhs on test data. MAPE of 12.96 on the test data indicates that the model can predict within ~13% of the used car price.

Finally, it is time to check the model performance on the actual prices and not the log values, create a function that will convert the log prices to actual prices and then check the performance, use metric functions defined in sklearn for RMSE, MAE, and 𝑅², and define a function to calculate MAPE and adjusted 𝑅².

# training performance comparison

models_train_comp_df = pd.concat(
    [lin_reg_model1_perf_train.T, lin_reg_model2_perf_train.T,], axis=1,
)

models_train_comp_df.columns = [
    "Linear Regression (Price)",
    "Linear Regression (price_log)",
]

print("Training performance comparison:")
models_train_comp_df

# test performance comparison

models_test_comp_df = pd.concat(
    [lin_reg_model1_perf_test.T, lin_reg_model2_perf_test.T,], axis=1,
)

models_test_comp_df.columns = [
    "Linear Regression (Price)",
    "Linear Regression (price_log)",
]

print("Test performance comparison:")
models_test_comp_df

Both the R-squared and Adjusted R squared of the model are higher than before and the model is able to explain up to 94% of the variance in the price of used cars. RMSE and MAE of train and test data are close and are much lower than the previous models. MAE indicates that the current model is able to predict used car prices within a mean error of 1.3 lakhs on test data. MAPE of 12.96 on the test data indicates that the model can predict within ~13% of the used car price. Overall, lin_reg_model2 (model using price_log as target) is considered as the final model.

Business Insights and Recommendations

Cars with a lesser number of kilometers driven would be preferred for customers.
Some markets tend to have higher prices. Cars4U should focus more on these markets’ strategies and set up offices in these areas if needed.
If profitability in the business is a concern, it is best to discuss about threshold cost of used cars versus new cars.
The next step post that would be to cluster different sets of data and see if multiple models should be created for different locations and car types.

I hope you enjoyed this post. If you would like to see the full notebook, you can find it here. Feel free to leave comments or any recommendations on the usage of linear regression or anything in general. If you would like to see more of this content, like this post, and please give me a follow! You can also find me here in LinkedIn. Thank you for reading!

BECOME a WRITER at MLearning.ai

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

medium.com