Building an AI-Ethics Personalized Budget Predictor: A Comprehensive Guide Part 1
We should not just build AI systems that do well, but ones that also do good.
In this present world of daily emerging trends in the AI space, especially with the fast-growing trends in large language models (LLMs), it is of great importance to consider the ethical implications of their development.
I recently stumbled upon a tweet where a tweep expressed her displeasure about the result of the AI LinkedIn/curriculum picture generator.
In response to one of the comments, I made it clear that similar problems like this is a result of the development cycle not being ethical.
To be honest, a lot of work has to be done to curb issues like this. In building AI solutions, it is very important to take privileged and unprivileged groups into consideration and also analyze the fairness of the solutions.
Concerns like this have led me to create this comprehensive guide on building an AI Ethics Personalized Budget Predictor using Python. With this guide, machine learning engineers and data professionals will have an idea of how to develop ethical AI solutions.
In this guide, I will provide a step-by-step approach I adopted in developing a budget predictor model that does not just provide accurate predictions but also prioritizes personalization and, most importantly, ethics. This guide will span from data collection and preprocessing to model development and ethical considerations. I will give a breakdown of the effective and ethical tools I used in building the budget predictor.
Please note that this comprehensive guide will be in parts, and I will be focusing on AI ethics and the data cleaning processes in this guide.
If you are looking to start a career in data science and analytics, I am starting a newsletter where I will be sharing weekly articles, how-to guides, and case studies surrounding the data domain.
You can subscribe to my newsletter, All About Data & More, for early access to all my articles.
Before we delve into the step-by-step approach, let’s have a brief understanding of AI ethics and an overview of the project.
AI Ethics places more emphasis on the moral and ethical issues surrounding the design, development, and deployment of AI systems. With AI ethics, ethical issues like bias, discrimination, and privacy violations can be mitigated in AI solutions.
In a similar article, I discussed the ethical implications and debates surrounding AI-generated arts, which shed more light on the emerging trends in AI art and further sparked more conversations surrounding AI and art.
Overview of the AI System
This AI Ethics Personalized Budget Predictor project is divided into a 3-step category, which includes building a confusion matrix, measuring bias, and constructing a model card. In this solution, I will be using an already-provided dataset and also performing some operations like exploration, analysis, and evaluation of fairness and bias issues in the data. The dataset used in this project will not be shared because of the privacy guidelines surrounding it.
In the course of this article, I will implement a bias mitigation strategy and then evaluate the improvements in the fairness of the data using the algorithms supported by the IBM AIF360 toolkit.
Problem Statement
A virtual financial institution (IDOOU) will like to not just predict the budget of its customers but will also love to know if its customers with higher education credentials beyond high school will have a budget that is equal to or greater than 300,000 USD. This will be compared to customers who are high school graduates.
This project is focused on:
Building an AI model for predicting user budgets (USD) based on gender, age, and education level.
Assessing fairness and bias for analyzing the provided data and the AI model for potential biases based on education level.
This guide details the step-by-step approach taken, including:
Data exploration and fairness analysis
Building and evaluating machine learning models (Logistic Regression, Gaussian Naive Bayes)
Implementing and evaluating a bias mitigation strategy (Reweighing)
Utilizing Explainable AI techniques to understand model predictions
This project strives to contribute to building responsible and ethical AI models while providing valuable insights into the interplay between education and financial behaviors.
In the course of this guide, I will show you a step-by-step approach that you can utilize to build a similar system and also show you how to explore the provided data by evaluating the budget predictor’s fairness and bias issues.
Let’s get started!
Setting Up the Environment
This AI system was built using the Python programming language. Python is a popular high-level interpreted language with applications that cut across various fields, including data science, machine learning, and more.
The popularity of the Jupyter Notebook environment around data science and machine learning projects makes it the best fit for this project.
It is essential that, before starting any machine learning-related project, it is of great importance to set up the environment correctly. In this section, I will provide a step-by-step approach I utilized in setting up the environment for this project. Walk with me!
The first step is to install the tools and libraries that will be needed to run the models and algorithms. In this tutorial, I will be using the Anaconda distribution tool, which includes Python and Jupyter Notebook.
After installing Anaconda, you can then create a new Conda environment that will have all the packages for the project. The essence of creating a new Conda environment is to make it easier for you to manage and share the project by isolating the project’s dependencies.
You can use this line of code to activate the Conda environment by navigating to the project’s directory.
conda activate environment-name
Once this is done, you can then install the Jupyter notebook with this line of code.
conda install jupyter notebook
With this, you can then proceed to launch the notebook and install the necessary libraries and frameworks. Before I go further, let’s have a brief overview of the libraries that are applicable to this project.
There are quite a number of libraries in the data science and machine learning space, but for the sake of this tutorial, we will be considering a few of them: the IBM AIF360 toolkit, TensorFlow, ScikitLearn, Jinja2, and Fairlearn.
We will be using the IBM AIF360 toolkit to measure and mitigate bias in the machine learning models we will build later in this tutorial.
With Jinja2, we will be able to generate HTML files in a dynamic way.
The Fairlearn Python package will be very handy in assessing and improving the fairness of the machine learning models we will be working with.
TensorFlow is an open-source software library for dataflow and differentiable programming across a range of tasks.
ScikitLearn, also known as Sklearn, is an open-source machine learning library for the Python programming language. It provides a wide range of tools for machine learning tasks
To install the libraries for this project, we will be using the pip Python package installer. There is no need to install ScikitLearn since it comes preinstalled in the Anaconda distribution.
!pip install aif360
!pip install tensorflow
!pip install jinja2
!pip install fairlearn
After a successful installation, it will be necessary for you to restart the kernel before importing the packages. Now that you are done with that, you can import the following libraries:
import pandas as pd
import numpy as np
import seaborn as sns
import tempfile
from aif360.datasets import StandardDataset, BinaryLabelDataset
from aif360.metrics import ClassificationMetric, BinaryLabelDatasetMetric
from sklearn.tree import DecisionTreeClassifier
from aif360.algorithms.postprocessing import RejectOptionClassification
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
import joblib
import matplotlib.pyplot as plt
from collections import defaultdict
from sklearn.inspection import permutation_importance
from aif360.algorithms.preprocessing.reweighing import Reweighing
from aif360.explainers import MetricTextExplainer
from sklearn.preprocessing import StandardScaler
from tqdm import tqdm
Now that all the libraries are imported, you can proceed with data collection and preprocessing. The dataset used in this guide is not available publicly, so you can experiment with any dataset.
While this project utilizes a proprietary dataset (ai_ethics_project_data.csv
), the approach applies to any dataset you have access to. Feel free to experiment with publicly available datasets to follow along!
We begin by importing the data using Pandas’ read_csv()
function. This loads the data into a variable named act_rec_dataset
. To get a glimpse of the data structure, we'll explore the first few rows using the head()
method.
# Load the dataset for this project
act_rec_dataset = pd.read_csv('ai_ethics_project_data.csv')
act_rec_dataset.head()
Before performing any operation on the dataset, it is important to have a general idea/information about the dataset. With this information, we will be able to perform more accurate and specific operations on the data.
print(act_rec_dataset.info())
print(act_rec_dataset.describe())
The above lines of code will both print a concise summary of the dataset, including the column names, data types, and the number of non-null values in each column, as well as descriptive statistics for numerical columns in the dataset, such as count, mean, standard deviation, minimum, maximum, and quartile values (25th, 50th, and 75th percentiles).
To make the data clearer and easier to analyze, you can rename some columns using the rename()
method. We swapped "Budget(in dollars)" to "Budget_dollars" and "With children?" to "With_Children" for better readability. This helps ensure everyone's on the same page when we dig into the data later.
# To rename some columns in the dataset
act_rec_dataset.rename(columns = {'Budget (in dollars)':'Budget_dollars',
'With children?':'With_Children'}, inplace = True)
Let’s take a quick peek at the “Education Level” column. You might notice some inconsistencies in the way people write their degrees. For example, some might have written “Bachelor’s Degree” with an apostrophe, while others used “Bachelor Degree” without it.
To ensure consistency and make it easier for our model to understand, we’ll quickly replace the apostrophe in both “Bachelor’s Degree” and “Master’s Degree” with a space. Just grab your handy-dandy replace()
method and you're all set!
# To remove the apostrophe sign from all characters in the Education_Level column using the
# replace function
act_rec_dataset['Education_Level'] = act_rec_dataset['Education_Level'].replace(['Bachelor’s Degree', 'Master’s Degree'],
['Bachelor Degree', 'Master Degree'])
Now, everyone’s education levels are on the same page, making it easier for our model to analyze and understand the data.
Ever come across data that’s just, well, missing? In this dataset, we might have some empty values hiding out. Let’s use our detective skills to sniff them out!
We’ll use the isnull()
method, like a superpowered magnifying glass, to see if any values are missing (it returns "True" for missing values and "False" for filled ones). Then, the sum()
method acts as our counting tool, revealing the total number of missing values in each column.
# To determine the total number of empty values in the dataset for each column
act_rec_dataset.isnull().sum()
Now you know exactly where those sneaky missing values are lurking! Armed with this information, you can decide how to handle them (like filling them in, removing rows, or even dropping the entire column). The question now is: how can we decide the next line of action?
How about we build a function that can make this decision for us? In this function, we will be specifying instructions based on some conditions. In a column where the empty rows are more than 30% of the length of the entire dataset, we can decide to drop the column where other columns can be filled. Categorical columns with empty rows can be filled with mode, while numerical columns can be filled with median.
# To drop the null values in the "Gender", "Education_level", and "With_Children" columns.
def drop(df):
for i in df.columns:
if ((df[i].isnull().sum()/len(df))*100) > 30:
df.dropna([i], axis = 1, inplace = True)
elif df[i].dtype == 'O': # For object type columns
df[i].fillna(df[i].mode()[0], inplace=True) # Fill null values with the mode
elif df[i].dtype in ['int64', 'float64']: # For numeric type columns
df[i].fillna(df[i].median(), inplace=True) # Fill null values with the median
The above function, when called on the dataframe, can then be used to eliminate the null values.
Imagine organizing your budget by categories like “Frugal Friday” or “Splurge Saturday.” We can do the same with our data! Instead of just raw numbers, let’s turn them into meaningful categories.
Take the “Budget_dollars” column, for example. We can group them into buckets like “less than $300” or “big spender over $300,000.” To do this, we’ll use clever tools called “bins” and “labels.” Think of bins like sorting baskets and labels like the tags you put on them.
For “Budget_dollars,” we might create bins like [0, 300, 1000000] and labels like [“< 300,” “>=300”]. Then, we use the amazing pd.cut()
function to sort each budget amount into its rightful bin, transforming it into a neat and tidy category.
# For the Budget_dollars column
bins = [0,300,1000000]
labels = ['<300', '>=300']
# To convert the Budget_dollars column to a categorical column
act_rec_dataset['Budget_dollars'] = pd.cut(act_rec_dataset['Budget_dollars'], bins, labels = labels)
The ‘Budget_dollars’ column also needs to be checked to confirm that there are no null values. By ensuring that the transformation was successful and there are no missing values, we have prepared the dataset for further analysis and machine learning modeling.
# To confirm that there is no null value
act_rec_dataset.Budget_dollars.isnull().sum()
There is also a need to convert the “Age” column to a categorical type of column by using bins and labels. For example, we might create categories like “18–24” for young adults just starting out or “66–92” for experienced folks enjoying their golden years. To achieve this, we’ll use our trusty “bins” and “labels” again. Think of bins as age ranges, and labels as the titles for each range.
For “Age”, we might use bins like [17, 24, 44, 65, 92] and labels like [“18–24,” “25–44,” “45–65,” “66–92”]. Then, we’ll use the pd.cut()
function to sort each person's age into its appropriate category.
# For the Age column
bins = [17,24,44,65,92]
labels = ['18-24', '25-44', '45-65', '66-92']
To convert the ‘Age’ column to a categorical column based on the defined bins and labels, we can then use the pd.cut()
function from the Pandas library. Think of it as sorting people into different age groups, like sorting books by genre. The pd.cut()
function takes the 'Age' column as the first argument and the bins and labels as the second argument. It then categorizes the numerical values in the 'Age' column into the specified bins and assigns the corresponding labels to each category.
# To convert the Age column to a categorical column
act_rec_dataset['Age'] = pd.cut(act_rec_dataset['Age'], bins, labels = labels)
This marks the end of the data collection, cleaning, and transformation stage. In subsequent parts of this comprehensive guide, we will be exploring:
Data Exploration and Fairness Analysis: Here, we will scrutinize the data, thereby uncovering hidden patterns and potential biases.
Building and Evaluating Models: Get ready to meet Logistic Regression and Gaussian Naive Bayes, our AI model superstars. We’ll train them to predict budgets and then put their skills to the test!
Bias Mitigation Strategies: We’ll tackle fairness head-on, implementing the “Reweighing” technique to ensure our models treat everyone equally.
Explainable AI: Demystifying the magic behind the models, we’ll use Explainable AI tools to understand how they make their predictions.
By the end of this guide, you’ll be equipped with the skills to build fair, responsible, and insightful AI models, ready to unlock the secrets hidden within your own data. So, stay tuned for the next exciting parts!
This guide has provided a foundational understanding of the process involved in building an AI model for budget prediction while emphasizing the importance of data fairness and responsible development. In the following chapters, we’ll delve deeper into each step, exploring advanced data analysis techniques, model evaluation and selection, bias mitigation strategies, and explainable AI methods. By equipping yourself with these comprehensive skills and considerations, you’ll be empowered to build effective and responsible AI models for various applications, unlocking valuable insights and predictions from your data.
If you like this article, here’s how you can stay connected:
Give it a boost by liking it. This will help this article reach more readers
Stay in the loop:
Subscribe to my newsletter, All About Data & More, to get weekly insights and exclusive content directly in your inbox.
Follow me on Medium to discover more articles and updates on my profile.
Connect on LinkedIn to join the conversation and network with other data enthusiasts.