Explatory Data Analysis Project On Retails

17 min readApr 16, 2023

Istanbul is one of the most popular cities in the world, there are many reasons why it is such a popular city. Its historical monuments, cosmopolitan structure and lively nightlife are the factors that make Istanbul a center of attraction. At the same time, it is seen as a good spending center for millions of people for shopping. In this article, we will be diving into the shopping world of Istanbul. Without further ado, let’s dive in to our study…

Photograph Via : Steven Yu | Pexels, Pixabay

Hello, my previous work Analyzing and Visualizing Earthquake Data Received with USGS API in Python Environment I prepared a new work after 3 weeks. Previously, I did CRM work and forecasting work with a time series on the retail industry. Now, I will be conducting an exploratory data analysis study.

In this study, I will be conducting an exploratory data analysis study on the Customer Shopping Dataset — Retail Sales Data dataset published on Kaggle by my dear friend Mehmet Tahir Arslan, who is among my LinkedIn connections.

We will recognize the dataset and create new variables (See Feature Engineering) Afterwards, we will do data visualization studies using data visualization libraries (Seaborn, Matplotlib, Plotly). We will also be seeing the findings we have obtained from these graphs that we have created.

Shopping occupies an important place in people’s lives. Shopping has become a necessity for people for reasons such as meeting daily needs, satisfying personal pleasures, and discovering new products.

Although the time spent in shopping centers varies according to personal preferences and reasons, it can vary between 1–2 hours, which corresponds to an average day, in general. However, longer periods can be spent, especially on weekends and holidays.

Expenditures made in shopping centers also vary according to personal preferences and needs. On average, daily expenses can be between 50–100 dollars.But in our study, our samples made transactions with Turkish Lira.

People usually buy clothing, shoes, cosmetics, electronics, household goods, food and beverage items from shopping malls. However, this may vary according to personal preferences and needs.

In recent years, with the widespread use of online shopping, people are now shopping online. This has led to changes in shopping habits.

However, the colorful world that brings many people together, hosting different activities and being an entertainment center where people go to spend time away from the computer in their daily life still maintain the interest in shopping centers.

First of all, I would like to introduce the data set that we will use in our study.

Our dataset contains data on the shopping transactions that took place between the years 2021–2023 in 10 different shopping centers located in different parts of Istanbul. These data include information such as categories of shopping, gender and age of customers, shopping center, transaction date, unit price of the purchased product, and the number of products purchased. Detailed descriptions of the data set are given below.

Content

Attribute Information:

invoice_no: Invoice number. Nominal. A combination of the letter ‘I’ and a 6-digit integer uniquely assigned to each operation.
customer_id: Customer number. Nominal. A combination of the letter ‘C’ and a 6-digit integer uniquely assigned to each operation.
gender: String variable of the customer’s gender.
age: Positive Integer variable of the customers age.
category: String variable of the category of the purchased product.
quantity: The quantities of each product (item) per transaction. Numeric.
price: Unit price. Numeric. Product price per unit in Turkish Liras (TL).
payment_method: String variable of the payment method (cash, credit card or debit card) used for the transaction.
invoice_date: Invoice date. The day when a transaction was generated.
shopping_mall: String variable of the name of the shopping mall where the transaction was made.

In addition to these variables, the TotalSales variable, which we created by multiplying the Quantity and Price values to carry out a more detailed analysis, the NEW_AGE_CAT variable, where we divided the observations in the data set into certain categories using the age variable, the vehicle capacity of the relevant shopping center and the number of stores, and finally 2 new variables. The variables Year, Month and Days Of Week, which we aim to see the change in shopping transactions based on time, are included in the data set.

Let’s dive in!

Import To Necessary Libraries

Here, at first, we import some libraries for data manipulation, analysis of numerical variables, and visualization of data. However, you may witness that I also imported different libraries at some points within the study.


import numpy as np 
import pandas as pd 


import seaborn as sns 
from matplotlib import pyplot as plt
from matplotlib import figure

import seaborn as sns 
import plotly.express as px
import plotly.graph_objects as go

     
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.float_format', lambda x: '%.2f' % x)
pd.set_option('display.expand_frame_repr', False)
pd.set_option('display.width', 500)D

Dataset

dataset = pd.read_csv('/kaggle/input/customer-shopping-dataset/customer_shopping_data.csv')

df = dataset.copy()

df.head()

Descriptive Statistics Of Dataset

In my studies, instead of writing separate commands, I define a function to see the shape value of the data set, missing observations, variable types, descriptive statistics of numeric variables. This not only saves me time, but also helps me access descriptive information about the data set with a single function.

def check_df(dataframe, head = 5):
    print("#"*30 + " "+ "Dataset Shape"+ " " + "#"*30)
    print(dataframe.shape)
    print("#"*30 + " "+ "Dataset Dtypes"+ " " + "#"*30)
    print(dataframe.dtypes)
    print("#"*30 + " "+ "First 5 Rows Of Dataset"+ " " + "#"*30)
    print(dataframe.head(head))
    print("#"*30 + " "+ "Last 5 Rows Of Dataset"+ " " + "#"*30)
    print(dataframe.tail(head))
    print("#"*30 + " "+ "Is There Any NaN Observation In Dataset ?"+ " " + "#"*30)
    print(dataframe.isnull().sum())
    print("#"*30 + " "+ "Descriptive Statics About to (Default) Numerical Features"+ " " + "#"*30)
    print(dataframe.describe().T)
    print("#"*30 + " "+ "Quantile Values of (Default) Numerical Features"+ " " + "#"*30)
    print(dataframe.quantile([0.00, 0.05, 0.25, 0.50, 0.75, 0.95, 1.00]).T)

Now a separate fund is created to see the unique categories of the variables in the data set.

def unique_categories(dataframe):
    cat_cols = dataframe.select_dtypes(include=['object']).columns
    res = pd.DataFrame(columns=['Column', 'Unique Values', 'Value Counts'])
    for col in cat_cols:
        n_unique = dataframe[col].nunique()
        if n_unique < 15:
            value_counts = dataframe[col].value_counts().to_dict()
            res = res.append({'Column': col, 'Unique Values': n_unique, 'Value Counts': value_counts}, ignore_index=True)
        else:
            res = res.append({'Column': col, 'Unique Values': n_unique, 'Value Counts': {}}, ignore_index=True)
    return res

I am writing a visualization function to see the class distributions of the categorical variables in the dataset.

def plot_categorical_frequencies(df, n_unique=10, figsize=(8, 6), orient='h'):
    cat_cols = df.select_dtypes(include=['object']).columns
    for col in cat_cols:
        n = df[col].nunique()
        if n <= n_unique:
            plt.figure(figsize=figsize)
            ax = sns.countplot(x=col, data=df, orient=orient)
            ax.set_title(col)
            ax.set_xticklabels(ax.get_xticklabels(), rotation=60, ha='right')
            total = float(len(df))
            for p in ax.patches:
                height = p.get_height()
                if orient == 'h':
                    ax.text(p.get_x() + p.get_width() / 2., height + 3, '{:.1%}'.format(height/total), ha="center")
                else:
                    ax.text(height + 3, p.get_y() + p.get_height() / 2., '{:.1%}'.format(height/total), va="center")
            plt.show()

plot_categorical_frequencies(df, 10, (8,6), 'h')

When we look at the graphs, we can clearly see that the women in the observations made more shopping than the men.

Afterwards, when we examine the purchased products, we see that clothing shopping, cosmetic product shopping and food shopping are the most purchased product types.

As an interesting result, technology expenditures, jewelery expenditures and book expenditures appear at the same intensity. Here, technology expenditures seem to be at the bottom of the ranking, due to the high exchange rates (for technology expenditures) for Turkish people along with their reading habits.

When we look at the payment types, we see that it is more to pay with cash. While cash payment is followed by credit card payment, we see that the least used payment method is debit card. Here we can make a simple deduction:

As my personal interpretation, in countries with high inflation like Turkey, people generally tend to hold money for fear that the value of their money will decrease rapidly. This, in turn, can be a hindrance to economic growth as it can reduce consumer spending. However, some people may want to spend their money quickly and dispose of it before it loses value. This can vary depending on inflation rates, income levels, consumption habits and personal preferences.

When we look at the general lines of the chart, we clearly see that people tend to exchange the things they want to buy with the products in the market without losing their money.

Finally, when we examine the shopping centers in the data set, the shopping centers named Kanyon, Mall Of Istanbul and Metrocity are at the top of the ranking. Zorlu Center and Cevahir are in the lower ranks.

After this step, we create our first Additional Feature here. We create a new variable called TotalSales by multiplying Price and Quantity values.

df['TotalSales'] = df['quantity'] * df['price']

I have previously shown descriptive statistics for the numeric variable with the check_df function. Now, I’m writing a function called desc_stats to see this in an image.

def desc_stats(dataframe):
    desc = dataframe.describe().T
    desc_df = pd.DataFrame(index= dataframe.columns, 
                           columns= desc.columns,
                           data= desc)
    
    f,ax = plt.subplots(figsize=(10,
                                 desc_df.shape[0]*0.78))
    sns.heatmap(desc_df,
                annot=True,
                cmap = "Wistia",
                fmt= '.2f',
                ax=ax,
                linecolor='white',
                linewidths = 1.3,
                cbar = False,
                annot_kws={"size": 12})
    plt.xticks(size = 18)
    plt.yticks(size = 14,
               rotation = 0)
    plt.title("Descriptive Statistics", size = 14)
    plt.show()


desc_stats(df[[col for col in df.columns if df[col].dtype != 'O']])

When we look at the customers in the data set, we see that they are between the ages of 18–69. These genders may differ within themselves. We’ll be reviewing that later.

When we look at the quantity variable, the products purchased generally vary between 1 and 5 pieces. (I wrote the number here, but the unit may vary depending on the product purchased)

When we look at the price variable and the total sales variable, we see that the standard deviation is high. In both variables, the standard deviation appears to be greater than the mean. In a numerical data, if the standard deviation is greater than the mean, it indicates that the data distribution is wider than the mean. This also means that the data is more widespread or has a more heterogeneous distribution.
For example, if the standard deviation of students’ math grades in a class is greater than their average, it means that students’ grades are more distributed and more different from each other, meaning that there is a lot of variation. However, standard deviation alone may not adequately explain a whole data set, so it is recommended to be used in conjunction with other statistical measures. Let’s examine the mean and median values of these two numerical variables.

If a numeric data set has a high standard deviation and also a very high mean relative to its median, this indicates that the data has a right-skewed distribution. In a right-skewed distribution, most of the data is collected at low values, while a few high values drive the mean value up. In this case, the median will be lower than the mean value.

For example, if a company’s distribution of employee salaries is skewed to the right, a few employees’ higher salaries will push the average salary up. However, its median value will be lower and most workers’ salaries will be below this value. Therefore, standard deviation and mean alone are not sufficient to understand the distribution of a data set, other statistical measures must also be taken into account.

I want to visualize the age and Total sales variables in the dataset

import matplotlib.pyplot as plt
plt.figure(figsize=(12,6))
ax1 = plt.subplot(1,2,1)
ax2 = plt.subplot(1,2,2)

sns.boxplot(x="gender", y="age", data=df, ax=ax1)
sns.boxplot(x="gender", y="TotalSales", data=df, ax=ax2)

ax1.set_title("Age Distribution by Gender")
ax1.set_xlabel("Gender")
ax1.set_ylabel("Age")
ax1.grid(True)

ax2.set_title("Sales Distribution by Gender")
ax2.set_xlabel("Gender")
ax2.set_ylabel("Sales")
ax2.grid(True)

female_stats_age = df.groupby("gender")["age"].describe().loc["Female"]
male_stats_age = df.groupby("gender")["age"].describe().loc["Male"]

female_stats_sales = df.groupby("gender")["TotalSales"].describe().loc["Female"]
male_stats_sales = df.groupby("gender")["TotalSales"].describe().loc["Male"]


ax1.text(0, female_stats_age["25%"], f"Q1: {female_stats_age['25%']}", horizontalalignment="right")
ax1.text(0, female_stats_age["50%"], f"Median: {female_stats_age['50%']}", horizontalalignment="right")
ax1.text(0, female_stats_age["75%"], f"Q3: {female_stats_age['75%']}", horizontalalignment="right")
ax1.text(0, female_stats_age["mean"], f"Mean: {female_stats_age['mean']:.2f}", horizontalalignment="left")

ax1.text(1, male_stats_age["25%"], f"Q1: {male_stats_age['25%']}", horizontalalignment="right")
ax1.text(1, male_stats_age["50%"], f"Median: {male_stats_age['50%']}", horizontalalignment="right")
ax1.text(1, male_stats_age["75%"], f"Q3: {male_stats_age['75%']}", horizontalalignment="right")
ax1.text(1, male_stats_age["mean"], f"Mean: {male_stats_age['mean']:.2f}", horizontalalignment="left")

ax2.text(0, female_stats_sales["25%"], f"Q1: {female_stats_sales['25%']:.2f}", verticalalignment="top")
ax2.text(0, female_stats_sales["50%"], f"Median: {female_stats_sales['50%']:.2f}", verticalalignment="top")
ax2.text(0, female_stats_sales["75%"], f"Q3: {female_stats_sales['75%']:.2f}", verticalalignment="top")
ax2.text(0, female_stats_sales["mean"], f"Mean: {female_stats_sales['mean']:.2f}", verticalalignment="bottom")

ax2.text(1, male_stats_sales["25%"], f"Q1: {male_stats_sales['25%']:.2f}", verticalalignment="top")
ax2.text(1, male_stats_sales["50%"], f"Median: {male_stats_sales['50%']:.2f}", verticalalignment="top")
ax2.text(1, male_stats_sales["75%"], f"Q3: {male_stats_sales['75%']:.2f}", verticalalignment="top")
ax2.text(1, male_stats_sales["mean"], f"Mean: {male_stats_sales['mean']:.2f}", verticalalignment="bottom")

plt.show()

import matplotlib.pyplot as plt 
import seaborn as sns 

plt.figure(figsize=(20,6)) 
ax1 = plt.subplot(1,2,1) 
ax2 = plt.subplot(1,2,2)

sns.kdeplot(x="age", hue="gender", data=df, fill=True, ax=ax1) 
sns.kdeplot(x="TotalSales", hue="gender", data=df, ax=ax2)

ax1.set_title("Age Distribution by Gender")
ax1.set_xlabel("Ages")
ax1.set_ylabel("Density")
ax1.grid(True)

ax2.set_title("Sales Distribution by Gender")
ax2.set_xlabel("Sales")
ax2.set_ylabel("Density")
ax2.grid(True)

The data set is fairly balanced distributed. No bias appears about gender’s ages. When we go into the details, we will be able to see the separation.

When we look at the average age of the men and women in the data set, the average age of both groups seems to be very close to each other.

Likewise, the average purchase amount of men and women are quite close to each other.

Also, when we look at the gender breakdown of the distribution of our TotalSales variable, we see that the data is skewed to the right. In case of positive skewness, this shows us that the mass of the TotalSales variable is concentrated on the left side.

*As additional information, in the case of a distribution where the distribution is skewed in this way, the median value is used to evaluate the central tendency of the data. When we look at it, we see that the median variable of the TotalSales variable is 600.17

When we examine the density graph of the TotalSales variable, we can see that there is a multi-peaked structure. This shows that there are different breakdowns in the data and that there are different separations. These breaks can be sub-breakdowns such as Shopping Center categorical data, Category data.

Now, we create a new categorical variable using the age variable we have. This variable will help us to group according to age ranges and to see certain patterns more easily.

I divide the age variable into ‘0_18’, ‘19_25’, ‘26_35’, ‘36_45’, ‘46_55’, ‘56_65’, ‘66_70’ categories. Thus, we examine shopping habits according to age categories.

df.loc[:, 'NEW_AGE_CAT'] = pd.cut(df['age'], bins = [0,18,25,35,45,55,65,70], labels = ['0_18', '19_25', '26_35', '36_45', '46_55', '56_65', '66_70'])

import matplotlib.pyplot as plt 
import seaborn as sns 
import matplotlib.ticker as mtick

plt.figure(figsize=(20,8)) 
ax1 = plt.subplot(1,2,1) 
ax2 = plt.subplot(1,2,2)

sns.kdeplot(x="age", hue="NEW_AGE_CAT", data=df, fill=True, ax=ax1) 
sns.kdeplot(x="TotalSales", hue="NEW_AGE_CAT", data=df, ax=ax2)

ax1.set_title("Age Distribution by New Age Cat")
ax1.set_xlabel("Ages")
ax1.set_ylabel("Density")
ax1.grid(True)


ax2.set_title("Sales Distribution by New Age Cat")
ax2.set_xlabel('Sales')
ax2.set_ylabel("Density")
ax2.grid(True)

We have created the age categories, now, let’s combine the gender variable and the age category variable and see the shopping habits between the two genders.

import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.ticker as mtick

plt.figure(figsize=(20,8)) 
ax1 = plt.subplot(1,2,1) 
ax2 = plt.subplot(1,2,2)


grouped_ = df.groupby(['NEW_AGE_CAT', 'gender', 'category']).agg({'TotalSales':'sum'}).reset_index()


sns.barplot(x='NEW_AGE_CAT', y='TotalSales', hue='category', data=grouped_[grouped_['gender'] == 'Male'], ax=ax1, palette = 'Set2')
sns.barplot(x='NEW_AGE_CAT', y='TotalSales', hue='category', data=grouped_[grouped_['gender'] == 'Female'], ax=ax2,palette = 'Set2')


formatter = mtick.FuncFormatter(lambda x, p: format(int(x/1), ','))
ax1.set_xlabel('Age Group')
ax1.set_ylabel('Total Sales')
ax1.set_ylim(0,15000000)
ax1.set_title('Shopping Preferences by Age Group for Males')

handles, labels = ax1.get_legend_handles_labels()
ax1.legend(handles, labels, loc='best')
ax1.yaxis.set_major_formatter(formatter)

ax2.set_xlabel('Age Group')
ax2.set_ylabel('Total Sales')
ax2.set_ylim(0,15000000)
ax2.set_title('Shopping Preferences by Age Group for Females')

handles, labels = ax2.get_legend_handles_labels()
ax2.legend(handles, labels, loc='best')
ax2.yaxis.set_major_formatter(formatter)

plt.show()

The graph on the left shows the amount in the categories that male customers spend on their purchases, and the graph on the right shows the amount in the categories that female customers spend on their purchases.

When we divide this information into age categories, we clearly see that the total price paid in clothing, shoes and technology categories is at the highest level in all age categories.


grouped = df.groupby(['NEW_AGE_CAT', 'gender', 'category', 'payment_method']).agg({'TotalSales':'sum'}).reset_index()


g = sns.catplot(data=grouped, x='NEW_AGE_CAT', y='TotalSales', hue='category', col='gender', row='payment_method', kind='bar', height=5, aspect=2, palette='Set2')
g.set_axis_labels("Age Group", "Total Sales")
g.fig.suptitle("Shopping Preferences by Age Group and Gender, Payment Method")


formatter = mtick.FuncFormatter(lambda x, p: format(int(x/1), ','))

for ax in g.axes.flat:
    
    ax.yaxis.set_major_formatter(formatter)


handles, labels = g.axes.flat[-1].get_legend_handles_labels()
g.fig.legend(handles, labels, loc='upper right')

plt.show()

In the graphics, we see that the salary card is the least preferred payment method for both genders. Generally, people paid with cash or credit card.

The debit card is the least preferred payment method for both genders. Generally, people paid with cash or credit card.

When comparing graphs, I compare the genders among themselves, as we clearly see that women shop more. While you are interpreting the graphics, you can interpret their genders among themselves. The only comment we will make between the genders is, for example,

“ When men and women are compared according to their cash payments, it can be seen that men in the 35–45 age category prefer to use their credit cards more for their technology expenditures. But female customers in the same age category are more inclined to use credit cards for their shoe expenditures. “

Comment can be made.

In addition, female customers aged 35–45 are more likely to pay their technology expenses in cash. Because when comparing credit card and cash payment graphics for female customers, while the difference between technology and shoe categories is less in the cash payment graph (even their technology and shoe expenditures are almost equal), when looking at credit, they generally prefer to use credit cards for card payments. But this is of course a top-notch interpretation.

In clothing expenses, customers, regardless of men and women, preferred the cash payment method. However, when we look carefully, we can see that female customers in the 35_45 and 55_65 age categories prefer more cash payment for their clothing expenditures. Because the difference between the columns representing the clothing payments between the two graphs is seen as 35_45 and 55_65, which are the most striking age categories.

A different graph I want to see is the distribution of the sales values of the categories in the dataset. For this, I use plotly as a different visualization library.

px.box(df , y= 'TotalSales', color='category')

When we draw the boxplot of the categories, we see that the highest distribution is in the technology category. Surprisingly, the most sold product category is Clothing — Shoes and Technology, but when we look at the distributions, we see that the biggest change is in technology.

The other categorical variable in my data set, called Shopping mall, keeps the shopping mall information that customers shop. To see which age category did the most shopping at which shopping centers

import plotly.express as px

grouped = df.groupby(['NEW_AGE_CAT', 'shopping_mall']).agg({'customer_id': 'nunique'}).reset_index()

fig = px.bar(grouped, x='shopping_mall', y='customer_id', color='NEW_AGE_CAT', barmode='group')

fig.update_layout(title='Number of Invoices by Shopping Center and Age Group')
fig.show()

When we examine the visitors coming to the shopping centers, we see that Kanyon, Mall Of Istanbul and Metrocity are dominant. It seems that the highest number of customers in almost every age category has come to these 3 shopping malls. I would like to examine the same chart according to the shopping made and the amounts paid.

What I don’t want to see now is how many transactions the age categories make in which shopping mall and how much they pay for each category in these shopping.

import seaborn as sns
import matplotlib.pyplot as plt


df_count = df.groupby(['NEW_AGE_CAT', 'shopping_mall']).agg({'invoice_no': 'count'}).reset_index()


df_total = df.groupby(['NEW_AGE_CAT', 'shopping_mall', 'category']).agg({'TotalSales': 'sum'}).reset_index()


plt.figure(figsize=(20, 10))


ax1 = plt.subplot(1, 2, 1)
sns.barplot(x='shopping_mall', y='invoice_no', hue='NEW_AGE_CAT', data=df_count, ax=ax1, palette='Set2')

ax1.set_xlabel('Shopping Center')
ax1.set_xticklabels(ax1.get_xticklabels(), rotation=60, ha='right')
ax1.set_ylabel('Number of Invoices')
ax1.set_title('Number of Invoices by Shopping Center and Age Group')


ax2 = plt.subplot(1, 2, 2)
sns.barplot(x='shopping_mall', y='TotalSales', hue='category', data=df_total, ax=ax2, palette='Set2')

ax2.set_xlabel('Shopping Center')
ax2.set_xticklabels(ax2.get_xticklabels(), rotation=60, ha='right')
ax2.set_ylabel('Total Price')
ax2.set_title('Total Payments by Shopping Center, Age Group, and Category')


plt.show()

Here are the graphics. So could there be a reason for this? Is there really a valid reason why Kanyon Mall, Mall Of İstanbul is the most visited and shopping mall? In this case, thinking that these shopping malls may be larger than the others, I first added the number of stores and parking capacities in these shopping malls to the data set. Let us now examine this situation in detail.

In addition, although the number of shoes purchased from Zorlu Center and Viaport shopping malls is almost equal to each other, the price paid for the shoe category at Zorlu Center is slightly higher than at Viaport.

df.loc[(df["shopping_mall"] == "Mall of Istanbul") , "VEHICLE_CAP(~)-MALL"] = 6500
df.loc[(df["shopping_mall"] == "Mall of Istanbul") , "SHOP_COUNT(~)-MALL"] = 350

df.loc[(df["shopping_mall"] == "Kanyon"), "VEHICLE_CAP(~)-MALL"] = 2300
df.loc[(df["shopping_mall"] == "Kanyon"), "SHOP_COUNT(~)-MALL"] = 160

df.loc[(df["shopping_mall"] == "Metrocity"), "VEHICLE_CAP(~)-MALL"] = 1200
df.loc[(df["shopping_mall"] == "Metrocity"), "SHOP_COUNT(~)-MALL"] = 175

df.loc[(df["shopping_mall"] == "Metropol AVM"), "VEHICLE_CAP(~)-MALL"] = 4000
df.loc[(df["shopping_mall"] == "Metropol AVM"), "SHOP_COUNT(~)-MALL"] = 250

df.loc[(df["shopping_mall"] == "Istinye Park"), "VEHICLE_CAP(~)-MALL"] = 3600
df.loc[(df["shopping_mall"] == "Istinye Park"), "SHOP_COUNT(~)-MALL"] = 280

df.loc[(df["shopping_mall"] == "Zorlu Center"), "VEHICLE_CAP(~)-MALL"] = 3600
df.loc[(df["shopping_mall"] == "Zorlu Center"), "SHOP_COUNT(~)-MALL"] = 200

df.loc[(df["shopping_mall"] == "Cevahir AVM"), "VEHICLE_CAP(~)-MALL"] = 2500
df.loc[(df["shopping_mall"] == "Cevahir AVM"), "SHOP_COUNT(~)-MALL"] = 224

df.loc[(df["shopping_mall"] == "Forum Istanbul"), "VEHICLE_CAP(~)-MALL"] = 5000
df.loc[(df["shopping_mall"] == "Forum Istanbul"), "SHOP_COUNT(~)-MALL"] = 265

df.loc[(df["shopping_mall"] == "Viaport Outlet"), "VEHICLE_CAP(~)-MALL"] = 3000
df.loc[(df["shopping_mall"] == "Viaport Outlet"), "SHOP_COUNT(~)-MALL"] = 150

df.loc[(df["shopping_mall"] == "Emaar Square Mall"), "VEHICLE_CAP(~)-MALL"] = 3000
df.loc[(df["shopping_mall"] == "Emaar Square Mall"), "SHOP_COUNT(~)-MALL"] = 200

df.groupby("shopping_mall").agg({"VEHICLE_CAP(~)-MALL" : "mean",
                                "SHOP_COUNT(~)-MALL": "mean"}).sort_values(by =["VEHICLE_CAP(~)-MALL", "SHOP_COUNT(~)-MALL"], ascending = False)

When we examine the results, Forum Istanbul seems to be the only shopping center that fits our opinion. Although the shopping centers named Kanyon and Metrocity are in the lower rankings, I think the reason behind them being the center of attraction is the other opportunities these malls offer to their visitors.

I would also like to examine the total amounts and averages of the purchases made in these shopping centers. Likewise, I would like to examine the averages of the fees paid to the various shopping categories in these shopping malls. Perhaps due to the products, these shopping centers may have become centers of attraction.

Now we will be seeing the effect of time in our work. Here, I want to see how the time-dependent shopping amounts change. As an example, I would like to see how there is a change in the amount of shopping in religious holidays in Turkey, during periods such as somester’s holidays. So I’m splitting the Invoice Date column in the dataset

df['invoice_date'] = pd.to_datetime(df['invoice_date'])
df['Year'] = df['invoice_date'].dt.year
df['Month'] = df['invoice_date'].dt.month
df['DayOfWeek'] = df['invoice_date'].dt.day_name()

import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(20,10))
sns.set_style("darkgrid")
ax1 = plt.subplot(2,2,1) 
ax2 = plt.subplot(2,2,2)
ax3 = plt.subplot(2,2,3)
ax4 =  plt.subplot(2,2,4)



df_sales = df.groupby(['Year', 'Month', 'DayOfWeek', 'NEW_AGE_CAT','shopping_mall','category']).agg({'TotalSales':'sum'}).reset_index()

sns.barplot(data=df_sales, x="Year", y="TotalSales", ax= ax1)
ax1.set_xlabel("Year")
ax1.set_ylabel("Sales")
ax1.set_title("Sales by Years")

sns.lineplot(data=df_sales, x="Month", y="TotalSales", ax= ax2)
ax2.set_xlabel("Year")
ax2.set_ylabel("Sales")
ax2.set_title("Sales by Month")

sns.lineplot(data=df_sales, x="DayOfWeek", y="TotalSales", ax= ax3)
ax3.set_xlabel("Day Of Week")
ax3.set_ylabel("Sales")
ax3.set_title("Sales by Day Of Week")


sns.lineplot(data=df_sales, x="DayOfWeek", y="TotalSales", hue = 'NEW_AGE_CAT', errorbar = None, ax= ax4)
ax4.set_xlabel("Day Of Week")
ax4.set_ylabel("Sales")
ax4.set_title("Sales by Day Of Week")


plt.show()

When the graphs are examined, we can see that there are a lot of changes, especially when the graphs of the changes in the sales values according to the monthly and day of the week are examined.

We have come to the end of our exploratory data analysis studies on our data set. In this article, we created variables with feature engineering, we wrote some functions, we saw the descriptive statistics of the data set in a more practical way, we made a visualization on the data set. See you again in our next articles.

Contact :

Linkedn: Cem Ozcelik

Github : ccemozclk

Kaggle : Cem Ozcelik