Building a Fraud Transaction Classifier Using Logistic Regression and Docker: A Comprehensive Guide

8 min readJun 10, 2023

In the digital age, the ability to detect fraudulent transactions is a crucial aspect of any financial system. Machine learning models, with their ability to learn patterns and make predictions, are a powerful tool in this fight against fraud. This article will walk you through the process of building a fraud transaction classifier using logistic regression and deploying it using Docker. We’ll be using a GitHub repository as our guide, which can be found here.

Overview

The repository we’re using as a guide contains a simple yet effective machine learning pipeline. It uses logistic regression to train a model that can classify transactions as either fraudulent or legitimate. The model is trained on a dataset from Kaggle, which can be found here. Once trained, the model is saved in a joblib file for future use. The repository also contains a Flask application that uses the trained model to make predictions based on POST requests. Finally, the entire application is containerized using Docker for easy deployment.

Understanding the Dataset

The dataset used for training the model is a collection of credit card transactions made by European cardholders in September 2013. It contains transactions that occurred over two days, where 492 transactions are fraudulent out of 284,807 transactions. The dataset is highly unbalanced, with the positive class (frauds) accounting for only 0.172% of all transactions.

The dataset contains only numerical input variables, which are the result of a PCA transformation. Due to confidentiality issues, the original features and more background information about the data cannot be provided. Features V1, V2, … V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are ‘Time’ and ‘Amount’. Feature ‘Time’ contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature ‘Amount’ is the transaction Amount. Feature ‘Class’ is the response variable and it takes value 1 in case of fraud and 0 otherwise.

Training the Model

The trained.py script is responsible for training the logistic regression model. Logistic regression is a statistical model that uses a logistic function to model a binary dependent variable. In this case, the dependent variable is the 'Class' feature, which indicates whether a transaction is fraudulent (1) or not (0).

Let’s break down the code in trained.py:

Import necessary libraries: The script begins by importing the necessary libraries. These include pandas for data manipulation, sklearn for machine learning, and joblib for saving the model.

import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.preprocessing import StandardScaler
from joblib import dump

2. Load the dataset: The script then loads the dataset using pandas’ read_csv function.

data = pd.read_csv('creditcard.csv')

3. Split the data into features and target: The features are the input variables (V1, V2, … V28, Time, Amount), and the target is the ‘Class’ feature. The data is then split into a training set and a test set.

df = df.drop("Time", axis=1)

# Separate the features from the target
X = df.iloc[:, :-1]  # all features
y = df['Class']  # target variable

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

4. Train the logistic regression model: The logistic regression model is then instantiated and fitted to the training data. The model’s parameters are learned from the data.

# Initialize a scaler, then apply it to the features
scaler = StandardScaler() # initialize
X_train = scaler.fit_transform(X_train) # Fit to the training data and transform it
X_test = scaler.transform(X_test) # transform the testing data

clf = LogisticRegression(random_state=42)

# Train the model
clf.fit(X_train, y_train)

5. Evaluate the model: The model’s performance is evaluated on the test set. The accuracy score and confusion matrix are printed to the console.

y_pred = model.predict(X_test)
print('Accuracy:', accuracy_score(y_test, y_pred))
print('Confusion Matrix:', confusion_matrix(y_test, y_pred))

6. Save the model: Finally, the trained model is saved in a joblib file for future use.

# Save the model
dump(clf, 'trained_model.joblib')

Making Predictions with `app.py` and `make_prediction.py`

The app.py script is a Flask application that uses the trained model to make predictions. The application defines a single route, /predict, which accepts POST requests. The POST request should contain a JSON object with a single key, 'data', whose value is an array of input variables. The application loads the trained model, makes a prediction for the input data, and returns the prediction as a JSON object.

Here’s a breakdown of the code in app.py:

Import necessary libraries: The script begins by importing the necessary libraries. These include Flask for the web application, request for handling requests, and joblib for loading the model.

from flask import Flask, request
from joblib import load

2. Load the trained model: The trained model is loaded from the joblib file.

# Load the trained model
model = load('trained_model.joblib')

3. Create the Flask application and define the /predict route: The Flask application is created, and the /predict route is defined. The route accepts POST requests and uses the trained model to make a prediction for the input data.

app = Flask(__name__)

# Prediction route
@app.route('/predict', methods=['POST'])
def predict():
    
    # Get the data from the request
    data = request.json
    # Convert the input data values to a list
    data = list(data.values())

    # Convert the data to a numpy array
    input_data = np.array(data).reshape(1, -1)

    # Load the scaler and transform the input data
    scaler = StandardScaler()
    input_data = scaler.fit_transform(input_data)

    # Make predictions using the loaded model
    predictions = model.predict(input_data)

    # Return the predictions as a JSON response
    return jsonify({'predictions': predictions.tolist()})

4. Run the Flask application: Finally, the Flask application is run if the script is executed directly.

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)

The make_prediction.py script is used to send a POST request to the Flask application. The script creates a JSON object with some sample input data, sends a POST request to the /predict route, and prints the response.

Here’s a breakdown of the code in make_prediction.py:

Import necessary libraries: The script begins by importing the necessary libraries. These include requests for sending the POST request and json for creating the JSON object.

import requests
import json

Create the JSON object and send the POST request: The script creates a JSON object with some sample input data and sends a POST request to the /predict route.

# Define the input data
input_data = {
    "V1": -1.3598071336738,
    "V2": -0.0727811733098497,
    "V3": 2.53634673796914,
    "V4": 1.37815522427443,
    "V5": -0.338320769942518,
    "V6": 0.462387777762292,
    "V7": 0.239598554061257,
    "V8": 0.0986979012610507,
    "V9": 0.363786969611213,
    "V10": 0.0907941719789316,
    "V11": -0.551599533260813,
    "V12": -0.617800855762348,
    "V13": -0.991389847235408,
    "V14": -0.311169353699879,
    "V15": 1.46817697209427,
    "V16": -0.470400525259478,
    "V17": 0.207971241929242,
    "V18": 0.0257905801985591,
    "V19": 0.403992960255733,
    "V20": 0.251412098239705,
    "V21": -0.018306777944153,
    "V22": 0.277837575558899,
    "V23": -0.110473910188767,
    "V24": 0.0669280749146731,
    "V25": 0.128539358273528,
    "V26": -0.189114843888824,
    "V27": 0.133558376740387,
    "V28": -0.0210530534538215,
    "Amount": 149.62
}

# Convert the input values to JSON
data = json.dumps(input_data)

# Set the request headers
headers = {'Content-Type': 'application/json'}

# Send the POST request to the prediction route
response = requests.post('http://127.0.0.1:5000/predict', data=data, headers=headers)

3. The script prints the response, which is the prediction made by the model.

try:
    # Get the predictions from the response
    predictions = response.json()['predictions']
    # Print the predictions
    print("Predictions:", predictions)
except json.decoder.JSONDecodeError:
    print("Failed to parse JSON response.")
    print("Response content:", response.content)

Dockerizing the Application

The application is containerized using Docker, which allows it to be easily deployed on any system that has Docker installed. Docker is a platform that allows developers to package applications into containers — standardized executable components that combine application source code with the operating system (OS) libraries and dependencies required to run that code in any environment.

The Dockerfile in the repository defines how to build a Docker image for the application. The Dockerfile specifies a base image, which is an image that the Docker image is built on top of. In this case, the base image is a Python image, which comes with Python and pip pre-installed.

Here’s a breakdown of the code in the Dockerfile:

Specify the base image: The base image is specified using the FROM instruction. In this case, the base image is a Python image.

FROM python:3.8-slim-buster

2. Set the working directory: The working directory in the Docker image is set using the WORKDIR instruction. This is the directory that any subsequent instructions will be run in.

WORKDIR /app

3. Copy the application code into the Docker image: The application code is copied into the Docker image using the COPY instruction.

COPY . /app

4. Install the necessary Python packages: The necessary Python packages are installed using pip. The packages are listed in the requirements.txt file.

RUN pip install --no-cache-dir -r requirements.txt

Specify the command to run when the Docker container starts: The command to run when the Docker container starts is specified using the CMD instruction. In this case, the command starts the Flask application.

CMD ["python", "app.py"]

Running the Docker Container

After building the Docker image, you can run the Docker container using the docker run command. This command creates a new container and runs it. The -p option is used to map a network port on the host to a port in the container. In this case, we're mapping port 5000 on the host to port 5000 in the container, which is where our Flask application is running.

Here’s the command to run the Docker container:

docker run -p 5000:5000 fraudtransaction

Replace fraudtransaction with the name of your Docker image. Once the Docker container is running, you can access the Flask application at http://localhost:5000.

Testing the Application

To test the application, you can use the make_prediction.py script to send a POST request to the /predict route. The script should print the prediction made by the model.

If everything is set up correctly, you should see a prediction printed on the console. This prediction is the result of the logistic regression model classifying the sample input data as either fraudulent or legitimate.

Conclusion

In this guide, we’ve walked you through the process of building a fraud transaction classifier using logistic regression and deploying it using Docker. We’ve covered everything from understanding the dataset and training the model, to making predictions, dockerizing the application, and testing it.

This process demonstrates the power of machine learning in detecting fraudulent transactions, and the convenience of Docker in deploying machine learning applications. With this knowledge, you should be well-equipped to build and deploy your own machine learning models for a variety of tasks.

Remember, while this guide uses logistic regression for the classification task, the same process can be applied to other machine learning algorithms. Feel free to experiment with different models and see how they perform on the task of fraud detection.