Build a Stocks Price Prediction App powered by Snowflake, AWS, Python and Streamlit — Part 2 of 3

16 min readMar 14, 2023

A comprehensive guide to develop machine learning applications from start to finish.

Introduction

Welcome Back, Let's continue with our Data Science journey to create the Stock Price Prediction web application. This is part 2 of the three-series article. If you are here for the first time then please check out this article first. I have checked the AWS S3 bucket and Snowflake tables for a couple of days and the Data pipeline is working as expected. I see there is a daily file being created in the AWS S3 bucket and the same file is pushed in the SnowFlake table.

The scope of this article is quite big, we will exercise the core steps of data science, let's get started…

Project Layout

Here are the high-level steps for this project. In this article, we will cover the third & fourth sections i.e. Data Extraction, Preprocessing & EDA & Machine Learning Model development

Data collection: Automatically download the stock historical prices data in CSV format and save it to the AWS S3 bucket.
Data storage: Store the data in a Snowflake data warehouse by creating a data pipe between AWS and Snowflake.
Data Extraction, Preprocessing & EDA: Extract & Pre-process the data using Python and perform basic Exploratory Data Analysis.
Machine Learning Model development: Develop a machine learning model, Train the model on historical data, Evaluate the model and perform hyperparameter tuning
Machine Learning Model deployment: Deploy the final model on Snowflake.
Web App Development: Build a web app using Streamlit and Python to interact with the deployed model and display the predictions. And Deploy the final app on Streamlit Cloud.

3. Data Extraction, Preprocessing & EDA

In this phase we connect to Snowflake from python, Extract the data and then we will perform data preprocessing & Exploratory Data Analysis. We will perform the following steps.

Set up Python environment
Configurations on the Snowflake side
Connect Snowflake & Extract Data
Data Preprocessing
Exploratory Data Analysis (EDA)

Set up Python environment

First, we will set up the python environment

Prerequisites

Snowflake: We will use the same Snowflake account used in the first blog.
Anaconda or Miniconda: Use the following URL to download & install Anaconda & Miniconda (Anyone from below)

Anaconda: https://www.anaconda.com/products/distribution

Miniconda: https://docs.conda.io/en/latest/miniconda.html

Create Python environment

Let's create a Python environment for our project.

Open Anaconda / Miniconda terminal: This step can differ based on the operating system.

You can directly open it from the command prompt (as shown below)

Or you can Open Anaconda Navigator -> Environments -> Select Base Environment -> Open Terminal (as shown below)

Create Python env

Run the following commands one by one in the Terminal, Type Y if prompted

For Windows OS

conda deactivate
conda create --name stocksenv --override-channels -c https://repo.anaconda.com/pkgs/snowflake python=3.8 numpy pandas
conda activate stocksenv

For MAC OS

conda deactivate
CONDA_SUBDIR=osx-64 conda create -n stocksenv python=3.8 numpy pandas --override-channels -c https://repo.anaconda.com/pkgs/snowflake
conda activate stocksenv
conda config --env --set subdir osx-64

Install Required Python Packages

Run the following commands one by one in the Terminal, Type Y if prompted

python -m pip install pip==23.0.1
conda install snowflake-snowpark-python==0.10.0
pip install scikit-learn==1.1.1
pip install plotly
pip install matplotlib
pip install holidays==0.18
pip install prophet==1.0.1
pip install seaborn
pip install jupyter
python -m ipykernel install --user --name snowpack --display-name "stocksenv_3.8"
pip install streamlit
pip install pyarrow==1.0.1
pip install pmdarima

We are done with the python environment setup.

Configurations on the Snowflake side

Before we start we need to configure a few more things on the Snowflake side.

Acknowledgment to use of Third-Party Packages

Login to the snowflake web console using https://app.snowflake.com/
From the Left Panel, Click the dropdown menu next to your login name, then click Switch Role -> ORGADMIN.
After that on Left Panel Click Admin -> Billing & Terms
Scroll to the Anaconda section and click the Enable button. The Anaconda Packages (Preview Feature) dialog opens.
Click the link to review the Snowflake Third Party Terms.
If you agree to the terms, click the Acknowledge & Continue button.

Identify Snowflake Account Name

We need to find the Snowflake Account Name, this will be used to connect the snowflake from Python (Jupyter Notebook)

On the Left Panel bottom side, you will see a Snowflake icon, Some text, and a Down Arrow icon
Click the Down Arrow -> This will pop up one more window -> Then Hover over that window -> This will open a second window
On the top right side of the second window, there should be a copy account identifier icon -> Upon clicking that icon the account identifier will get copied into the clipboard.
Paste the account identifier in the notepad, Please note that the account identifier will be in the format of <OrganizationName>.<AccountName>.
Replace Dot (.) with Hyphen (-) so that the account identifier would be like <OrganizationName>-<AccountName>. Save this for future use.

you can refer to this link to read more about this

Connect Snowflake & Extract Data

In this section, we will use a Jupyter notebook to run the python code, and we will also connect to Snowflake and extract pricing data into Python.

Create a Project Folder in a suitable location, we will save all configuration files, and code in this project directory.

Create secrets.toml file

Generally in any project we do not hard-code the values in the code, instead we create a file (called conf, ini, toml, etc) and save parameters, credentials in that file, and then we access that file in the code. In this project we will use .toml file.

Create a “secrets.toml” file (.toml is an extension) paste below content in that file, and save the file in the project directory.

[SNOWFLAKE]
account="Type account identifier saved earlier<OrgaName>-<ActName>"
user="Type Snowflake User"
password="Type Snowflake Password"
role="ACCOUNTADMIN"
database="PRICE_PRED_DB"
schema="PRICE_PRED_SCHEMA"
warehouse="PRICE_PRED"

IMPORTANT NOTE: Do not share or upload this file or any user credentials

Jupyter Notebook

Open Anaconda / Miniconda Terminal (as shown earlier)
Type following commands

conda deactivate
conda activate stocksenv

Navigate to the Project directory using the cd command
Type the command jupyter notebook on the terminal, This will open Jupyter Notebook on a web browser.

You should always use the above steps to open a Jupyter notebook for this project.

Click the New button on the top right side, then Select stocksenv_3.8 this will create a new Jupyter Notebook using our Python environment.

You can set a suitable name for this Jupyter Notebook. Let's begin the Python coding

Snowpark Python

To perform an expressive, and extensible interface with Snowflake, we are using Snowpark Python. Snowpark provides a native Python experience with pandas for data extraction & manipulation

Import required packages

## Data Connect / process
from snowflake.snowpark.session import Session
import toml
import numpy as np
import pandas as pd
import sys

## Machine Learning
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, mean_absolute_percentage_error
from pmdarima.arima import ARIMA
from prophet import Prophet
from prophet.plot import plot_plotly
from prophet.diagnostics import cross_validation, performance_metrics
import itertools

##Data viz
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objs as go
import plotly.offline as py
py.init_notebook_mode()
%matplotlib inline

## Default settings
sns.set_style('darkgrid')
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (9, 5)
matplotlib.rcParams['figure.facecolor'] = '#00000000'

Load the credentials from the toml file & Connect to Snowflake

secrets = toml.load("secrets.toml")

accountname = secrets["SNOWFLAKE"]["account"]
user = secrets["SNOWFLAKE"]["user"]
password = secrets["SNOWFLAKE"]["password"]
role = secrets["SNOWFLAKE"]["role"]
database = secrets["SNOWFLAKE"]["database"]
schema = secrets["SNOWFLAKE"]["schema"]
warehouse = secrets["SNOWFLAKE"]["warehouse"]

connection_parameters = {
    "account": accountname,
    "user": user,
    "password": password,
    "role": role,
    "database": database,
    "schema": schema,
    "warehouse": warehouse,
    "ocsp_fail_open":"False"
}

session = Session.builder.configs(connection_parameters).create()

Test Connection

Using the session object you can run queries on a snowflake or retrieve table data, Let’s check if the connection is working properly

The Warehouse, database & schema name is displayed in the output, which means the connection is successful

Getting data from snowflake into a Python

Using session.table(TABLE_NAME) returns Snowpark DataFrame, which is similar to pandas DataFrame. it doesn't hold the data, just points to the table in snowflake. We can perform tons of operations like pandas DataFrame. Please refer to this documentation link.

Let's pull data from the table historical_prices

We can convert the Snowpark DataFrame to Pandas DataFrame

View Pricing data

Data Preprocessing

After data extraction, we will check some basic information & statistics of the dataset. and perform the data preprocessing which involves handling null & duplicate data, taking care of outliers, converting categorical data etc.

Basic information about the dataset

Here we will check some basic information like rows, columns, nulls etc.

View the Number of rows & Columns

Check for the Null Values

No Null values were found. One less thing to worry about, otherwise we would need to address the null. There are different techniques to handle nulls.

Check for duplicate data

There are Duplicate rows in the dataset, We should get rid of Duplicate data as it may impact the model prediction. We can use drop_duplicates to remove duplicates

Sort the DataFrame

Looks like the data is not sorted, Let’s sort the data based on DATE

price_df = price_df.sort_values(by='DATE',ignore_index=True)

Basic Stats of the dataset

Checking statistical information about the dataset

df.info

We will use the df.info method, which prints information about a DataFrame including the index dtype and columns, non-null values and memory usage.

By Looking at the data and info() output, we can see there is no Categorical data present in the dataset. In general we need to encode categorical data into columns using multiple techniques, but in this case we don’t have such data.

Format DATE Column

The data type for ‘Date’ is an object. So we will change the data type to datetime format, This will be useful in EDA & features Extraction

price_df['DATE'] = pd.to_datetime(price_df['DATE'])

Create New Features

Adding New Features (YEAR, MONTH, DAY & MONTHYEAR) derived from the Date, we will utilize this during Visualization

price_df['YEAR'] = price_df['DATE'].dt.year
price_df['MONTH'] = price_df['DATE'].dt.month
price_df['DAY'] = price_df['DATE'].dt.day
price_df['MONTHYEAR'] = price_df['DATE'].dt.strftime('%Y%m').astype(int)

Let’s check the revised data

df.describe

Let’s try to find out more stats using the df.describe method. This method is used for calculating some statistical data like percentile, mean and std of the numerical values of the Series or DataFrame.

By Observing statistical info we can notice that data is pretty much consistent, Also I do not see any Outliers.

The data is in good shape. Let’s move forward to Exploratory Data Analysis.

Exploratory Data Analysis (EDA)

In this step we will perform Exploratory Data Analysis by data visualization to get some insights from the data so we can use it in future model building phase.

Closing price Trend

I am using the Python library plotly to create Closing price trends, it is a very powerful tool to generate interactive visualizations.

Prices are climbing up until the Year 2021.

BAR Plots

Let’s plot the average Closing Price based on Day, Month & Year.

We cannot identify any pattern using DAY & MONTH, But there is a definitive trend on Average Closing Price based on Year. Prices have increased till 2021, and after 2021 prices are falling.

Correlation

Let’s try to find out if there is any correlation between any features (Columns)

Again I see some correlation between Year, Month Year & Closing Price. As it’s a Time Series dataset that justifies the correlation & trend.

We will use Date & Closing Price as features for the price prediction model.

I think it’s enough visualization for the project. I have also performed an in-depth Exploratory Data Analysis of market data, you can check out the following article.

Has Gold been a Good Investment Over Stock Market?

An Exploratory Data Analysis project to compare Gold vs S-P 500 index

blog.jovian.com

4. Machine Learning Model development

During this phase, we will utilize various machine learning models to forecast stock prices for future dates. For each model, we will carry out the following steps:

Determine the appropriate model
Split the data into a Training Set and Testing Set
Build and train the model
Predict outcomes & Visualize the data
Evaluate the model’s performance

Afterward, we will choose the most promising model and conduct hyperparameter tuning to attain optimal results.

Linear Regression Model

Let’s start with a Baseline model, a Baseline model is a simple model that serves as a point of reference for comparing the performance of more complex models. The performance of other models is measured against the baseline model to determine if the more complex models are actually providing any improvement over the simple model

We will use a simple Linear Regression model as Baseline (Note that Linear Regression is not commonly used in time series analysis).

What is Linear Regression

Linear regression is a statistical approach used to identify the linear relationship between a dependent variable and one or more independent variables. It is commonly used for prediction and forecasting purposes. Linear regression works by calculating the best-fit line through the data, which can be used to make predictions about the dependent variable based on the values of the independent variables.

Preparing Data for Linear Regression

Here I am using only one feature (column) Closing Price, Idea is to predict the prices against the number of the day. So I have created a new column index that represents the row number (or day number) of the dataset.

We will use the day number as input and try to predict the output as the Closing price.

In a typical machine learning model, X refers to the input features or independent variables, and y refers to the output or dependent variable that we are trying to predict. In summary, X represents the input data and y represents the output data we are trying to predict.

X = np.array(data[['index']])
y = np.array(data['CLOSE'])

Split the data into a Training Set and Testing Set

Generally we segment the data into Training Set & Testing Set. A Training set is a subset of a dataset that is used to train a machine learning model. It is the data on which the model learns patterns and relationships between the input variables and the output variable. A Testing set, on the other hand, is a subset of the data that is held back and used to evaluate the performance of the model on new, unseen data.

Train — Test split should be done randomly but since it’s a time series dataset, we will use the first 80 % of the data as a Training set and remain as a Test set.

train_len = int(len(X)* 0.8)
X_train, X_test = X[:train_len], X[train_len:]
y_train, y_test = y[:train_len], y[train_len:]

Build and train the model

Let’s create a linear regression model and train the model using the Training set

Predict outcomes & Visualize the data

Here we will predict the outcomes using the predict method, then Visualize the predictions against actual data.

y_pred = linear_model.predict(X_test)

Creating a function plot_prediction for plotting data, since we will plot this for multiple models

def plot_prediction(actual, prediction, title, y_label='Closing Price', x_label='Trading Days'):

    fig = plt.figure()
    ax = fig.add_subplot(111)

    # Add labels
    plt.ylabel(y_label)
    plt.xlabel(x_label)

    # Plot actual and predicted close values

    plt.plot(actual, '#00FF00', label='Actual Price')
    plt.plot(prediction, '#0000FF', label='Predicted Price')

    # Set title
    ax.set_title(title)
    ax.legend(loc='upper left')

    plt.show()

Evaluate Linear Regression model performance

Model evaluation is the process of assessing the performance of a trained machine learning model using various metrics to determine its accuracy, precision and other measures of prediction quality. The evaluation results help to determine if the model is suitable for its intended use or if it requires further improvement.

Following three metrics that are used widely in evaluating time series forecasting models, we will use the same.

RMSE (Root Mean Square Error)
MAE (Mean Absolute Error)
MAPE (Mean Absolute Percentage Error)

To study more on evaluation metrics you can refer to this article.

I am creating a function evaluate_model

def evaluate_model(y_true, y_pred):
    print(f"Root Mean squared error: {mean_squared_error(y_true , y_pred,  squared=False)}")
    print(f"Mean absolute error: {mean_absolute_error(y_true , y_pred)}")
    print(f"Mean absolute percentage error: {mean_absolute_percentage_error(y_true , y_pred)}")

Let’s try Different models

ARIMA (AutoRegressive Integrated Moving Average) model

ARIMA is a time series forecasting model used to make predictions based on past data. It models the time series as a combination of autoregressive (AR) and moving average (MA) components, and incorporates differencing to remove non-stationarity. ARIMA is commonly used for univariate time series analysis, where a single variable is observed over time.

The model is specified by three parameters:

p: order of autoregressive component
d: order of differencing
q: order of moving average component

Split the data into a Training Set and Testing Set

train_len = int(price_df.shape[0] * 0.8)
train_data, test_data = price_df[:train_len], price_df[train_len:]

y_train = train_data['CLOSE'].values
y_test = test_data['CLOSE'].values

Build and train the model

Predict outcomes & Visualize the data

y_parima_pred = parima_model.predict(len(y_test))

Evaluate ARIMA model performance

ARIMA model Evaluation figures look better than Linear Regression, Let’s move ahead with one more interesting model.

Prophet Model

Prophet is a time series forecasting model developed by Facebook’s Core Data Science team. It is a statistical model that uses an additive approach to capture various components of a time series, including trends, seasonality, holidays, and other effects. Prophet also employs Bayesian modeling techniques to estimate model parameters and uncertainty, which helps to provide more accurate predictions and forecast intervals.

Preparing Data for Prophet Model

The input to Prophet Model is always a DataFrame with two columns ds and y. The ds (datestamp) column and the y represents the measurement we wish to forecast, y column must be numeric.

Split the data into a Training Set and Testing Set

train_len = int(data.shape[0] * 0.8)
train_data, test_data = data[:train_len], data[train_len:]

Build and train the model

Predict outcomes & Visualize the data

The predict method will return a DataFrame representing future dates and predictions. The column yhat is the predicted value , also there other components like uncertainty intervals, trends etc.

The Prophet model also provides a couple of visualization methods .plot() & .plot_components(), which give better visualization & stats info.

Evaluate Prophet model performance

evaluate_model(test_data[['y']], prophed_pred_df[['yhat']])

Root Mean squared error: 32.31935000092778
Mean absolute error: 23.52914380257662
Mean absolute percentage error: 0.19923299525805302

Based on the evaluation metrics of the three models, it is evident that the Prophet model outperforms the others

Hyperparameter Tuning of Prophet model

Hyperparameter tuning is the process of selecting the optimal values for the parameters of a machine learning model that are not learned during training. The goal of hyperparameter tuning is to improve the performance of a model by finding the best combination of hyperparameters that minimize the error on the validation set or maximize some other performance metric. This is typically done by performing a grid search or a randomized search over a range of hyperparameter values.

I have used this documentation for hyperparameter tuning.

We will cross_validation and performance_metrics methods provided by Prophet for Hyperparameter Tuning.

cross_validation

Cross-validation is a resampling method that uses different portions of the data to test and train a model on different iterations. It is mainly used to estimate how accurately a predictive model will perform in practice. Prophet provides an inbuilt cross_validation function.

performance_metrics This utility takes the output of the cross_validation and provides multiple evaluation matrices.

Let’s see how it works.

Baseline Prophet model

Let’s apply these techniques to a plane vanilla baseline model

Getting evaluation matrices using performance_metrics, Prophet provides six commonly used performance metrics.

Mean Squared Error(MSE)
Root Mean Square Error(RMSE)
Mean Absolute Error(MAE)
Mean Absolute Percentage Error(MAPE)
Median Absolute Percentage Error(MDAPE)
Symmetric Mean Absolute Percentage Error(SMAPE)

Refer following links to understand more about evaluation matrices

We will use Mean Absolute Percentage Error (MAPE) to compare the model performance.

Manual Hyperparameter Tuning

According to the document changepoint_range hyperparameter we can tune it manually. changepoint_range is a value between 0 and 1 indicating the percentage of historical data that allows a trend change.

Automatic Hyperparameter Tuning

According to documentation, following hyperparameters that are suitable to be tuned automatically, we will use grid search method for the tuning

changepoint_prior_scale
seasonality_prior_scale
seasonality_mode.

Train the final model using the Best (Tuned) Hyperparameters

The MAPE value of the Tuned model is much better than the baseline Prophet model .

In this section we have learned how to perform hyperparameter tuning to get the best performance from the Model.

Prediction using finalized Tuned Model

The Prophet model provides a special method make_future_dataframe to create future dates in a suitable format. We can pass the number of days to forecast forward and the flag to include historical dates.

Output is a DataFrame containing dates, We can pass this DataFrame to .predict() function to forecast prices for those dates.

Let’s do predictions for the next 180 Days, Including Historical prices as well.

future = auto_model.make_future_dataframe(periods=180,
                                          include_history=True) 
prediction = auto_model.predict(future)

Visualize Actual Prices vs Predicted Prices

We will finalize this tuned Prophet model and use the same model in further steps.

References

References to some useful links.

What’s Next

We have made a good amount of progress, I will stop here. In Part 3 which is the final chapter of this series, we will see how to deploy ML models & Develop the web application.

Here are the links to the whole guide:

Part 1: Data Collection And Data Storage.

Part 2: Data Extraction, Preprocessing, EDA And Machine Learning Model development (This article)

Part 3: Machine Learning Model deployment And Web App development

If you have any questions or feedback, feel free to post a comment or contact me on LinkedIn. Thank you for reading and if you liked this post, please consider following me. Until next time… Happy coding !!

Don’t forget to give your 👏 !

BECOME a WRITER at MLearning.ai

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

medium.com

Build a Stocks Price Prediction App powered by Snowflake, AWS, Python and Streamlit — Part 2 of 3

Introduction

Project Layout

3. Data Extraction, Preprocessing & EDA

Set up Python environment

Prerequisites

Create Python environment

Create Python env

Install Required Python Packages

Configurations on the Snowflake side

Acknowledgment to use of Third-Party Packages

Identify Snowflake Account Name

Connect Snowflake & Extract Data

Create secrets.toml file

Jupyter Notebook

Snowpark Python

Data Preprocessing

Basic information about the dataset

Basic Stats of the dataset

Exploratory Data Analysis (EDA)

Closing price Trend

BAR Plots

Correlation

Has Gold been a Good Investment Over Stock Market?

An Exploratory Data Analysis project to compare Gold vs S-P 500 index

4. Machine Learning Model development

Linear Regression Model

What is Linear Regression

Preparing Data for Linear Regression

Split the data into a Training Set and Testing Set

Build and train the model

Predict outcomes & Visualize the data

Evaluate Linear Regression model performance

ARIMA (AutoRegressive Integrated Moving Average) model

Split the data into a Training Set and Testing Set

Build and train the model

Predict outcomes & Visualize the data

Evaluate ARIMA model performance

Prophet Model

Preparing Data for Prophet Model

Split the data into a Training Set and Testing Set

Build and train the model

Predict outcomes & Visualize the data

Evaluate Prophet model performance

Hyperparameter Tuning of Prophet model

Baseline Prophet model

Manual Hyperparameter Tuning

Automatic Hyperparameter Tuning

Train the final model using the Best (Tuned) Hyperparameters

Prediction using finalized Tuned Model

Visualize Actual Prices vs Predicted Prices

References

What’s Next

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

Written by Vinod Dhole