Reduce False Positives with Ensemble

Penny Li
12 min readApr 17, 2023

Research Paper Code Reconstruct: False Positives in Credit Card Fraud Detection: Measurement and Mitigation

In this article, I would like to reconstruct the algorithm used in the 2022 paper entitled “False Positives in Credit Card Fraud Detection: Measurement and Mitigation” done for the proceedings of the 55th Hawaii International Conference on System Sciences.

In this research paper, ensemble learning is presented as a solution to the challenge of false positives in fraud detection. Ensemble learning is the practice of combining several different models together, so that they can complement each other’s shortcomings. The following is an overview of the false positive count for each commonly used machine learning models when run separately, and that of the ensemble learning model.

For a brief overview of these models, please see below:

The above models can be further elevated into AI models, which the below article gives a high level overview of the potential risk implications as well as rewards:

  1. Data

The researcher of the paper used the same credit card fraud transaction dataset from Kaggle, which I had used in the previous code reconstruction article. In this paper, the dataset is described as containing 590,540 labeled
credit card transactions and was extracted from a production fraud detection system.

2. Feature Engineering

The researcher had followed a 3-step process for getting the features in the dataset ready for model testing: a baseline feature set containing only original features, an augmented feature set containing engineered features and a reduced feature set that removes correlated and unimportant features.

Because the credit card numbers are removed from the dataset for privacy, we need to engineer a cardholder identifier by using adversarial validation.

Below is a detailed post on adversarial validation for feature engineering:

In my experiment, the logistic regression was used in the adversarial validation to derive a feature importance score for each feature, and by picking out the top ten features with the highest importance scores, a “card identifier” could be generated by adding up the top four most important features.

# Combine fields to create unique identifier for each cardholder
data['card_identifier'] = data['TransactionID'].astype(int) + \
data['card2'].astype(int) + \
data['addr1'].astype(int) + \
data['card5'].astype(int)
data

Once the card identifier was generated, the time-based features including average timespan between transactions in a given timeframe, number of transactions in a given timeframe, and average transaction amount in the past t hours for a cardholder identifier were added to the dataset.

# Create a new column in the data dataframe to represent the transaction timestamp in hours
data['TransactionHour'] = data['TransactionDT'] / 3600

# Compute the average timespan between transactions for each cardholder
data['TimeSinceLastTransaction'] = data.groupby('card_identifier')['TransactionHour'].diff().fillna(0)

# Compute the number of transactions in a given timeframe for each cardholder
t = 24 # Replace with the desired timeframe in hours
data['TransactionCount'] = data.groupby('card_identifier')['TransactionHour'].rolling(window=t).count().reset_index(drop=True).fillna(0)

# Compute the average transaction amount in the past t hours for each cardholder
t = 24 # Replace with the desired timeframe in hours
data['TransactionAmtMean'] = data.groupby('card_identifier')['TransactionAmt'].rolling(window=t).mean().reset_index(drop=True).fillna(0)

# Convert the categorical column to a numeric column
data['card_identifier'] = data['card_identifier'].astype('category').cat.as_ordered()

Finally, the aggregation values above are calculated for each of 8 timeframes {1h, 3h, 6h, 12h, 18h, 24h, 72h, 168h}.

# Create a new column in the data dataframe to represent the transaction timestamp in hours
data['TransactionHour'] = data['TransactionDT'] / 3600

# Define a list of timeframes in hours
timeframes = [1, 3, 6, 12, 18, 24, 72, 168]

# Loop over the timeframes and compute the aggregation values
for t in timeframes:
# Compute the average timespan between transactions for each cardholder
data[f'TimeSinceLastTransaction_{t}'] = data.groupby('card_identifier')['TransactionHour'].diff().fillna(0)

# Compute the number of transactions in the past t hours for each cardholder
data[f'TransactionCount_{t}'] = data.groupby('card_identifier')['TransactionHour'].rolling(window=t).count().reset_index(drop=True).fillna(0)

# Compute the average transaction amount in the past t hours for each cardholder
data[f'TransactionAmtMean_{t}'] = data.groupby('card_identifier')['TransactionAmt'].rolling(window=t).mean().reset_index(drop=True).fillna(0)

At this point, there are 118 features in the dataset and would already be a handful to process in the Google Colaboratory environment.

3. Feature Selection

The next step is to remove unimportant features. The paper had mentioned two approaches, correlation analysis and backward feature selection. We’re going to focus on correlation analysis in this article and will cover backward feature selection in a subsequent post.

Below is a detailed post on correlation analysis in Python.

The paper had specified a p value of greater than 0.9 as being correlated. I had changed this to 0.65 to allow the code to run on Google Colab with more features to be dropped in the process.

correlation_threshold = 0.65

I was able to get the number of features down to 41. The code for this feature removal process is here.

4. Model Testing

Random Forest

The paper experimented up to 1000 trees, I had to limit to 50 in order for the code to run through. The paper performed a grid search over different subsampling ratios and found the best results with training each tree on 20% of the features and 80% of the rows of transaction. Below is a detailed post on how grid search works:

I had used the following code to find the best results to be when the all features in the dataset and 75% of the transactions are considered at each tree node split.

The code for this for random forest is here.

# Create a RandomizedSearchCV object
grid_search = RandomizedSearchCV(estimator=rf, param_distributions=param_distributions, n_iter=5, cv=3, n_jobs=4)

# Fit the RandomizedSearchCV object to the data
grid_search.fit(X_train, y_train)

# Get the best hyperparameters
best_params = grid_search.best_params_

# Extract the best values for max_features and max_samples
best_max_features = best_params['max_features']
best_max_samples = best_params['max_samples']

# Create a new random forest classifier with the best hyperparameters
best_rf = RandomForestClassifier(**best_params)

# Fit the best random forest classifier to the training data
best_rf.fit(X_train, y_train)

Below is an example tree used in the Random Forest model:

eXtreme Gradient Boosting

eXtreme Gradient Boosting is the second model the paper had used. It is similar to the random forest in that it combines multiple decision trees to create a strong learner. It iteratively builds a sequence of decision trees, where each tree is trained to correct the errors made by the previous trees in the sequence. This model is also better fitted for larger datasets than Random Forest. Below is a detailed post on XGBoost model:

The paper had identified a learning rate of p = 0.02, a row subsampling rate of 0.8, and an ensemble size of 2000 estimators. Each categorical feature is mapped to an integer value, as XGBoost cannot handle categorical features natively. The code for this looks like the below, with the full code here.

import xgboost as xgb

# Define hyperparameters
learning_rate = 0.02
subsample = 0.8
n_estimators = 2000

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Convert the data to DMatrix format
dtrain = xgb.DMatrix(X_train, label=y_train, enable_categorical=True)
dtest = xgb.DMatrix(X_test, label=y_test, enable_categorical=True)

# Set the parameters for the XGBoost model
params = {
'objective': 'binary:logistic',
'random_state': 42
}

# Train the XGBoost model
xgb_model = xgb.train(params, dtrain)

# Make predictions with the trained model
y_pred = xgb_model.predict(dtest)

# Convert predicted probabilities to binary predictions
y_pred_binary1 = np.round(y_pred)

# Evaluate the model
accuracy = np.sum(y_pred_binary1 == y_test) / len(y_test)
print("Accuracy: {:.2f}%".format(accuracy * 100))

Below is the visual of the first tree in the XGBoost model:

CatBoost

The CatBoost model is the next model used in the paper. The CatBoost model was developed from the challenge in traditional gradient boosting algorithms that typically require categorical features to be converted into numerical representations using techniques like one-hot encoding or label encoding. The label encoding used in the XGBoost model above is below:

# Convert categorical features to numerical using LabelEncoder
cat_cols = []
for f in merged_df.columns:
if merged_df[f].dtype == 'object' or merged_df[f].dtype == 'bool':
cat_cols.append(f)
lbl = LabelEncoder()
lbl.fit(list(merged_df[f].values))
merged_df[f] = lbl.transform(list(merged_df[f].values))

To address this challenge, CatBoost introduced an innovative approach called “Ordered Boosting” that naturally handles categorical features without the need for explicit encoding. A detailed explanation on CatBoost can be found below:

In the paper, the ensemble size is limited to 1000 because of RAM constraints. The code for this in the model looks like the below, with the full code here.

# Create the CatBoost classifier with class weights and ensemble size of 1000
model = CatBoostClassifier(iterations=1000, random_seed=42, verbose=100)

# Specify the categorical columns in your DataFrame
cat_cols = ['ProductCD', 'card4', 'card6', 'P_emaildomain', 'M1', 'M2', 'M3', 'M4', 'M6', 'card_identifier']

# Fit the model with categorical columns specified
model.fit(X_train, y_train, cat_features=cat_cols, verbose=100)

# Make predictions with the trained model
y_pred = model.predict(X_test)

# Convert predicted probabilities to binary predictions
y_pred_binary2 = np.round(y_pred)

# Evaluate the model
accuracy = np.sum(y_pred_binary2 == y_test) / len(y_test)
print("Accuracy: {:.2f}%".format(accuracy * 100))

Neural Networks

Neural Networks is also used in the paper. A detailed dive into neural networks in general can be found in the post below and a code reconstruct of a neural network paper was also presented in my previous Medium article.

In this paper, a neural network with three hidden layers and 256 nodes each, a learning rate of p=0.001 and a weight coefficient of 0.0005 was used. I had to use a learning rate of 0.1 to have a reasonable runtime.

mlp = MLPClassifier(hidden_layer_sizes=(256, 256, 256), activation='relu', solver='adam', random_state=42, alpha=0.0005, learning_rate_init=0.1)

Before training, the paper had standardized all numerical features and log-transformed those with large value range. The code for this would look like the below:

# Custom transformer to perform robust scaling
class RobustScalerTransformer(FunctionTransformer):
def __init__(self, colnames=None):
super().__init__()
self.colnames = colnames
self.scaler = RobustScaler()

def fit(self, X, y=None):
self.scaler.fit(X[self.colnames])
return self

def transform(self, X, y=None, colnames=None):
if colnames is None:
colnames = self.colnames

X_transformed = X.copy()

# Replace infinite values with a large finite value
X_transformed = X_transformed.replace([np.inf, -np.inf], np.nan)
X_transformed.fillna(1e9, inplace=True)

X_transformed[self.colnames] = np.log(X_transformed[self.colnames] + 1) # Apply log transformation
X_transformed.replace([np.inf, -np.inf], 1e9, inplace=True)

return X_transformed

For categorical features, the paper used One-Hot-Encoding after reducing the cardinality for high-cardinality features by replacing infrequent categories with a uniform label. The code would look like the below:

# Define the threshold count for high cardinality features
threshold = 500

# Create an empty list to store high cardinality features
high_cardinality_features = []

# Loop through each column in the DataFrame
for column in merged_df.columns:
# Check if the column is categorical
if merged_df[column].dtype == 'object':
# Count the frequency of each unique value in the column
value_counts = merged_df[column].value_counts()
# Check if any unique value count is greater than the threshold
if any(value_counts > threshold):
# If so, add the column name to the list of high cardinality features
high_cardinality_features.append(column)

# Subset the DataFrame to only include non-high cardinality features
non_high_cardinality_features = list(set(X.columns) - set(high_cardinality_features))
X = X[non_high_cardinality_features]
# Select only the desired categorical columns
categorical_cols = ['ProductCD', 'card4', 'card6', 'P_emaildomain', 'M1', 'M2', 'M3', 'M4', 'M6']

# Create column transformer for numerical and categorical features
preprocessor = ColumnTransformer(
transformers=[
('robust_scaler', RobustScalerTransformer(colnames=numerical_cols), numerical_cols),
('encoder', OneHotEncoder(), categorical_cols)
])

In this OneHotEncoder code chunk above, the uniform label is created, where ‘OneHotEncoder’ treats missing or unform values as a separate category and assigns a binary value of 0 for them. This acts as a uniform label for the ‘infrequent categories’ that were replaced during cardinality reduction in the code chunk further above.

One important point to note for neural networks is the issue of oversampling discussed in my previous Medium article, as the result of merely following the methods described in the paper without oversampling would result in no Fraud detected at all shown below.

The full code for Neural Networks is here.

5. Ensembling

The paper used the Jaccard Score to measure the above models’ prediction similarities. Given two models’ predictions, it is defined as:

where Ni,j is the number of transactions for which model 1 predicted class i and model 2 predicted class j. For i=j, we say there the models agree, and disagree otherwise. A Jaccard Score of 1 means that both models made identitical predictions on all transactions and a Jaccard Score of 0 means that the models did not agree on any transaction.

So, theoretically adding a model with a low prediction similarity score to the models already in the ensemble would improve performance better than adding a model with high similarity. Below is a summary of the Jaccard Score for all the above models as well as the ensemble model:

jaccard_ensemble = jaccard_score(y_test, ensemble_preds)

# Print the Jaccard Scores
print("Jaccard Score for Random Forest:", jaccard1)
print("Jaccard Score for eXtreme Gradient Boosting:", jaccard2)
print("Jaccard Score for CatBppst:", jaccard3)
print("Jaccard Score for Neural Network:", jaccard4)
print("Jaccard Score for ensemble_preds:", jaccard_ensemble)

The paper used majority voting for combining predictions, where for a given transaction, the ensemble’s prediction is the result of (0|No Fraud, 1|is Fraud) that the majority of the models have predicted.

ensemble_preds = np.where(np.sum([y_pred1, y_pred_binary1, y_pred_binary2, y_pred2a], axis=0) > 2, 1, 0)

Below is the result of the ensemble prediction:

Ensemble

This is based on a Receiver Operating Characteristic curve as shown below, where the 45 degree line shows the result of random guessing and the solid line shows the predictive performance of a binary classifier system compared to that of random guessing.

And below is an exhibit of confusion matrices of all models discussed above that were part of the ensemble:

Random Forest

eXtreme Gradient Boosting

Cat Boost

Neural Network

The full experiment and reconstructed research paper code can be found in the Google Colaboratory page below:

There is still so much more to discover in this research paper, so stay tuned for upcoming posts on other topics raised that I haven’t covered yet.

CodeChat

I also hold code talks on Google Meet on the last Friday of every month at 5:00 p.m EST. The topic for the next chat is going to be on Digesting Decision Trees in Python as well as any questions you may have related to coding. You may sign up here for a meeting reminder or the meeting link is here if you would like to join directly.

Feedback

If you have thoughts to share with me, please let me know on the Data Analysts’ Message Board!

BECOME a WRITER at MLearning.ai

--

--