Reduce False Positives with Ensemble

12 min readApr 17, 2023

Research Paper Code Reconstruct: False Positives in Credit Card Fraud Detection: Measurement and Mitigation

In this article, I would like to reconstruct the algorithm used in the 2022 paper entitled “False Positives in Credit Card Fraud Detection: Measurement and Mitigation” done for the proceedings of the 55th Hawaii International Conference on System Sciences.

In this research paper, ensemble learning is presented as a solution to the challenge of false positives in fraud detection. Ensemble learning is the practice of combining several different models together, so that they can complement each other’s shortcomings. The following is an overview of the false positive count for each commonly used machine learning models when run separately, and that of the ensemble learning model.

For a brief overview of these models, please see below:

The above models can be further elevated into AI models, which the below article gives a high level overview of the potential risk implications as well as rewards:

Generative AI models - the risks and potential rewards in business

The report Generative AI models - the risks and potential rewards in business examines what the future holds for…

kpmg.com

Data

The researcher of the paper used the same credit card fraud transaction dataset from Kaggle, which I had used in the previous code reconstruction article. In this paper, the dataset is described as containing 590,540 labeled
credit card transactions and was extracted from a production fraud detection system.

2. Feature Engineering

The researcher had followed a 3-step process for getting the features in the dataset ready for model testing: a baseline feature set containing only original features, an augmented feature set containing engineered features and a reduced feature set that removes correlated and unimportant features.

Because the credit card numbers are removed from the dataset for privacy, we need to engineer a cardholder identifier by using adversarial validation.

Below is a detailed post on adversarial validation for feature engineering:

Enhancing Model Performance with Adversarial Validation

Feature engineering plays a critical role in machine learning model performance. It involves creating new features or…

open.substack.com

In my experiment, the logistic regression was used in the adversarial validation to derive a feature importance score for each feature, and by picking out the top ten features with the highest importance scores, a “card identifier” could be generated by adding up the top four most important features.

# Combine fields to create unique identifier for each cardholder
data['card_identifier'] = data['TransactionID'].astype(int) + \
                          data['card2'].astype(int) + \
                          data['addr1'].astype(int) + \
                          data['card5'].astype(int)
data

Once the card identifier was generated, the time-based features including average timespan between transactions in a given timeframe, number of transactions in a given timeframe, and average transaction amount in the past t hours for a cardholder identifier were added to the dataset.

# Create a new column in the data dataframe to represent the transaction timestamp in hours
data['TransactionHour'] = data['TransactionDT'] / 3600

# Compute the average timespan between transactions for each cardholder
data['TimeSinceLastTransaction'] = data.groupby('card_identifier')['TransactionHour'].diff().fillna(0)

# Compute the number of transactions in a given timeframe for each cardholder 
t = 24  # Replace with the desired timeframe in hours
data['TransactionCount'] = data.groupby('card_identifier')['TransactionHour'].rolling(window=t).count().reset_index(drop=True).fillna(0)

# Compute the average transaction amount in the past t hours for each cardholder
t = 24  # Replace with the desired timeframe in hours
data['TransactionAmtMean'] = data.groupby('card_identifier')['TransactionAmt'].rolling(window=t).mean().reset_index(drop=True).fillna(0)

# Convert the categorical column to a numeric column
data['card_identifier'] = data['card_identifier'].astype('category').cat.as_ordered()

Finally, the aggregation values above are calculated for each of 8 timeframes {1h, 3h, 6h, 12h, 18h, 24h, 72h, 168h}.

# Create a new column in the data dataframe to represent the transaction timestamp in hours
data['TransactionHour'] = data['TransactionDT'] / 3600

# Define a list of timeframes in hours
timeframes = [1, 3, 6, 12, 18, 24, 72, 168]

# Loop over the timeframes and compute the aggregation values
for t in timeframes:
    # Compute the average timespan between transactions for each cardholder
    data[f'TimeSinceLastTransaction_{t}'] = data.groupby('card_identifier')['TransactionHour'].diff().fillna(0)

    # Compute the number of transactions in the past t hours for each cardholder 
    data[f'TransactionCount_{t}'] = data.groupby('card_identifier')['TransactionHour'].rolling(window=t).count().reset_index(drop=True).fillna(0)

    # Compute the average transaction amount in the past t hours for each cardholder
    data[f'TransactionAmtMean_{t}'] = data.groupby('card_identifier')['TransactionAmt'].rolling(window=t).mean().reset_index(drop=True).fillna(0)

At this point, there are 118 features in the dataset and would already be a handful to process in the Google Colaboratory environment.

3. Feature Selection

The next step is to remove unimportant features. The paper had mentioned two approaches, correlation analysis and backward feature selection. We’re going to focus on correlation analysis in this article and will cover backward feature selection in a subsequent post.

Below is a detailed post on correlation analysis in Python.

Unravelling Data Relationships with Correlation

Correlation analysis is a fundamental statistical technique used in data analytics to measure the strength and…

open.substack.com

The paper had specified a p value of greater than 0.9 as being correlated. I had changed this to 0.65 to allow the code to run on Google Colab with more features to be dropped in the process.

correlation_threshold = 0.65

I was able to get the number of features down to 41. The code for this feature removal process is here.

4. Model Testing

Random Forest

The paper experimented up to 1000 trees, I had to limit to 50 in order for the code to run through. The paper performed a grid search over different subsampling ratios and found the best results with training each tree on 20% of the features and 80% of the rows of transaction. Below is a detailed post on how grid search works:

How to Use Grid Search to Find the Best Model Parameters

Grid search is a popular technique for finding the optimal hyperparameters for a machine learning model. In this blog…

open.substack.com

I had used the following code to find the best results to be when the all features in the dataset and 75% of the transactions are considered at each tree node split.

The code for this for random forest is here.

# Create a RandomizedSearchCV object
grid_search = RandomizedSearchCV(estimator=rf, param_distributions=param_distributions, n_iter=5, cv=3, n_jobs=4)

# Fit the RandomizedSearchCV object to the data
grid_search.fit(X_train, y_train)

# Get the best hyperparameters
best_params = grid_search.best_params_

# Extract the best values for max_features and max_samples
best_max_features = best_params['max_features']
best_max_samples = best_params['max_samples']

# Create a new random forest classifier with the best hyperparameters
best_rf = RandomForestClassifier(**best_params)

# Fit the best random forest classifier to the training data
best_rf.fit(X_train, y_train)

Below is an example tree used in the Random Forest model:

eXtreme Gradient Boosting

eXtreme Gradient Boosting is the second model the paper had used. It is similar to the random forest in that it combines multiple decision trees to create a strong learner. It iteratively builds a sequence of decision trees, where each tree is trained to correct the errors made by the previous trees in the sequence. This model is also better fitted for larger datasets than Random Forest. Below is a detailed post on XGBoost model:

Understanding eXtreme Gradient Boosting

eXtreme Gradient Boosting (XGBoost) is a gradient boosting algorithm that is designed to improve the accuracy of…

open.substack.com

The paper had identified a learning rate of p = 0.02, a row subsampling rate of 0.8, and an ensemble size of 2000 estimators. Each categorical feature is mapped to an integer value, as XGBoost cannot handle categorical features natively. The code for this looks like the below, with the full code here.

import xgboost as xgb

# Define hyperparameters
learning_rate = 0.02
subsample = 0.8
n_estimators = 2000

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Convert the data to DMatrix format
dtrain = xgb.DMatrix(X_train, label=y_train, enable_categorical=True)
dtest = xgb.DMatrix(X_test, label=y_test, enable_categorical=True)

# Set the parameters for the XGBoost model
params = {
    'objective': 'binary:logistic',
    'random_state': 42
}

# Train the XGBoost model
xgb_model = xgb.train(params, dtrain)

# Make predictions with the trained model
y_pred = xgb_model.predict(dtest)

# Convert predicted probabilities to binary predictions
y_pred_binary1 = np.round(y_pred)

# Evaluate the model
accuracy = np.sum(y_pred_binary1 == y_test) / len(y_test)
print("Accuracy: {:.2f}%".format(accuracy * 100))

Below is the visual of the first tree in the XGBoost model:

CatBoost

The CatBoost model is the next model used in the paper. The CatBoost model was developed from the challenge in traditional gradient boosting algorithms that typically require categorical features to be converted into numerical representations using techniques like one-hot encoding or label encoding. The label encoding used in the XGBoost model above is below:

# Convert categorical features to numerical using LabelEncoder
cat_cols = []
for f in merged_df.columns:
    if merged_df[f].dtype == 'object' or merged_df[f].dtype == 'bool':
        cat_cols.append(f)
        lbl = LabelEncoder()
        lbl.fit(list(merged_df[f].values))
        merged_df[f] = lbl.transform(list(merged_df[f].values))

To address this challenge, CatBoost introduced an innovative approach called “Ordered Boosting” that naturally handles categorical features without the need for explicit encoding. A detailed explanation on CatBoost can be found below:

Understanding CatBoost

CatBoost is a machine learning algorithm developed by Yandex that can be used for both classification and regression…

open.substack.com

In the paper, the ensemble size is limited to 1000 because of RAM constraints. The code for this in the model looks like the below, with the full code here.

# Create the CatBoost classifier with class weights and ensemble size of 1000
model = CatBoostClassifier(iterations=1000, random_seed=42, verbose=100)

# Specify the categorical columns in your DataFrame
cat_cols = ['ProductCD', 'card4', 'card6', 'P_emaildomain', 'M1', 'M2', 'M3', 'M4', 'M6', 'card_identifier']

# Fit the model with categorical columns specified
model.fit(X_train, y_train, cat_features=cat_cols, verbose=100)

# Make predictions with the trained model
y_pred = model.predict(X_test)

# Convert predicted probabilities to binary predictions
y_pred_binary2 = np.round(y_pred)

# Evaluate the model
accuracy = np.sum(y_pred_binary2 == y_test) / len(y_test)
print("Accuracy: {:.2f}%".format(accuracy * 100))

Neural Networks

Neural Networks is also used in the paper. A detailed dive into neural networks in general can be found in the post below and a code reconstruct of a neural network paper was also presented in my previous Medium article.

Neural Networks in Fraud Detection

The ranking of data importance discussed in the previous post is an important tool in machine learning models. In this…

open.substack.com

In this paper, a neural network with three hidden layers and 256 nodes each, a learning rate of p=0.001 and a weight coefficient of 0.0005 was used. I had to use a learning rate of 0.1 to have a reasonable runtime.

mlp = MLPClassifier(hidden_layer_sizes=(256, 256, 256), activation='relu', solver='adam', random_state=42, alpha=0.0005, learning_rate_init=0.1)

Before training, the paper had standardized all numerical features and log-transformed those with large value range. The code for this would look like the below:

# Custom transformer to perform robust scaling
class RobustScalerTransformer(FunctionTransformer):
    def __init__(self, colnames=None):
        super().__init__()
        self.colnames = colnames
        self.scaler = RobustScaler()

    def fit(self, X, y=None):
        self.scaler.fit(X[self.colnames])
        return self

    def transform(self, X, y=None, colnames=None):
        if colnames is None:
            colnames = self.colnames

        X_transformed = X.copy()

        # Replace infinite values with a large finite value
        X_transformed = X_transformed.replace([np.inf, -np.inf], np.nan)
        X_transformed.fillna(1e9, inplace=True)

        X_transformed[self.colnames] = np.log(X_transformed[self.colnames] + 1)  # Apply log transformation
        X_transformed.replace([np.inf, -np.inf], 1e9, inplace=True)

        return X_transformed

For categorical features, the paper used One-Hot-Encoding after reducing the cardinality for high-cardinality features by replacing infrequent categories with a uniform label. The code would look like the below:

# Define the threshold count for high cardinality features
threshold = 500

# Create an empty list to store high cardinality features
high_cardinality_features = []

# Loop through each column in the DataFrame
for column in merged_df.columns:
    # Check if the column is categorical
    if merged_df[column].dtype == 'object':
        # Count the frequency of each unique value in the column
        value_counts = merged_df[column].value_counts()
        # Check if any unique value count is greater than the threshold
        if any(value_counts > threshold):
            # If so, add the column name to the list of high cardinality features
            high_cardinality_features.append(column)

# Subset the DataFrame to only include non-high cardinality features
non_high_cardinality_features = list(set(X.columns) - set(high_cardinality_features))
X = X[non_high_cardinality_features]

# Select only the desired categorical columns
categorical_cols = ['ProductCD', 'card4', 'card6', 'P_emaildomain', 'M1', 'M2', 'M3', 'M4', 'M6']

# Create column transformer for numerical and categorical features
preprocessor = ColumnTransformer(
    transformers=[
        ('robust_scaler', RobustScalerTransformer(colnames=numerical_cols), numerical_cols),
        ('encoder', OneHotEncoder(), categorical_cols)
    ])

In this OneHotEncoder code chunk above, the uniform label is created, where ‘OneHotEncoder’ treats missing or unform values as a separate category and assigns a binary value of 0 for them. This acts as a uniform label for the ‘infrequent categories’ that were replaced during cardinality reduction in the code chunk further above.

One important point to note for neural networks is the issue of oversampling discussed in my previous Medium article, as the result of merely following the methods described in the paper without oversampling would result in no Fraud detected at all shown below.

The full code for Neural Networks is here.

5. Ensembling

The paper used the Jaccard Score to measure the above models’ prediction similarities. Given two models’ predictions, it is defined as:

where Ni,j is the number of transactions for which model 1 predicted class i and model 2 predicted class j. For i=j, we say there the models agree, and disagree otherwise. A Jaccard Score of 1 means that both models made identitical predictions on all transactions and a Jaccard Score of 0 means that the models did not agree on any transaction.

So, theoretically adding a model with a low prediction similarity score to the models already in the ensemble would improve performance better than adding a model with high similarity. Below is a summary of the Jaccard Score for all the above models as well as the ensemble model:

jaccard_ensemble = jaccard_score(y_test, ensemble_preds)

# Print the Jaccard Scores
print("Jaccard Score for Random Forest:", jaccard1)
print("Jaccard Score for eXtreme Gradient Boosting:", jaccard2)
print("Jaccard Score for CatBppst:", jaccard3)
print("Jaccard Score for Neural Network:", jaccard4)
print("Jaccard Score for ensemble_preds:", jaccard_ensemble)

The paper used majority voting for combining predictions, where for a given transaction, the ensemble’s prediction is the result of (0|No Fraud, 1|is Fraud) that the majority of the models have predicted.

ensemble_preds = np.where(np.sum([y_pred1, y_pred_binary1, y_pred_binary2, y_pred2a], axis=0) > 2, 1, 0)

Below is the result of the ensemble prediction:

Ensemble

This is based on a Receiver Operating Characteristic curve as shown below, where the 45 degree line shows the result of random guessing and the solid line shows the predictive performance of a binary classifier system compared to that of random guessing.

And below is an exhibit of confusion matrices of all models discussed above that were part of the ensemble:

Random Forest

eXtreme Gradient Boosting

Cat Boost

Neural Network

The full experiment and reconstructed research paper code can be found in the Google Colaboratory page below:

Google Colaboratory

Edit description

colab.research.google.com

There is still so much more to discover in this research paper, so stay tuned for upcoming posts on other topics raised that I haven’t covered yet.

CodeChat

I also hold code talks on Google Meet on the last Friday of every month at 5:00 p.m EST. The topic for the next chat is going to be on Digesting Decision Trees in Python as well as any questions you may have related to coding. You may sign up here for a meeting reminder or the meeting link is here if you would like to join directly.

Feedback

If you have thoughts to share with me, please let me know on the Data Analysts’ Message Board!

BECOME a WRITER at MLearning.ai

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

medium.com

Reduce False Positives with Ensemble

Generative AI models - the risks and potential rewards in business

The report Generative AI models - the risks and potential rewards in business examines what the future holds for…

Enhancing Model Performance with Adversarial Validation

Feature engineering plays a critical role in machine learning model performance. It involves creating new features or…

Unravelling Data Relationships with Correlation

Correlation analysis is a fundamental statistical technique used in data analytics to measure the strength and…

How to Use Grid Search to Find the Best Model Parameters

Grid search is a popular technique for finding the optimal hyperparameters for a machine learning model. In this blog…

Understanding eXtreme Gradient Boosting

eXtreme Gradient Boosting (XGBoost) is a gradient boosting algorithm that is designed to improve the accuracy of…

Understanding CatBoost

CatBoost is a machine learning algorithm developed by Yandex that can be used for both classification and regression…

Neural Networks in Fraud Detection

The ranking of data importance discussed in the previous post is an important tool in machine learning models. In this…

Google Colaboratory

Edit description

CodeChat

Feedback

BECOME a WRITER at MLearning.ai

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

Written by Penny Li