Sales Prophet: An AI-Driven Sales Forecasting Tool.
Boosting Sales by Unleashing The Power of Machine Learning for Sales Analytics and Forecasting
Ever Since COVID began, several businesses have experienced losses, which has led to job losses for their staff. As a result, there were a lot of individuals without jobs throughout this time. Companies must undergo a digital transformation in order to get out of the scenario.
This can assist them to continue to accelerate their sales and customer experience. Since COVID is now unwinding, work from office has started. The COVID epidemic aided in the identification of a further digital transformation opportunity. In this research, the main focus was to identify how sales can be increased, and what major factors are affecting the sales in the retail industry.
๐ค Problem Statement
To conduct an in-depth examination of sales performance across diverse retail outlets, with a focus on identifying and assessing the underlying factors influencing these sales figures. Additionally, the study aims to categorize these influencing factors into distinct groups and quantify their respective contributions to overall sales growth.
Examine and analyze sales trends in the retail industry both before and during the COVID-19 era to gain valuable insights. These insights aim to enable behavioral forecasting and drive profitable push sales strategies.
The dataset was sourced from data.world and is available for download from my GitHub repository. It spans a significant timeframe, encompassing data collected over several decades, from 1985 to 2021.
๐ Business Process
โ๏ธ Data Preparation
Missing value handling
The dataset has 5% null values. So, to ensure that our data doesnโt have bias, we took a closer look at the missing values per column.
Within the dataset, the columns "Item_Weight" and "Outlet_Size" contain missing values. Specifically, "Item_Weight," representing the weight of items, has 1463 instances of missing data. Given that "Item_Weight" consists of numerical values, these gaps were resolved through mean imputation, effectively replacing the missing values with the mean of the available data points.
df['Item_Weight'].fillna(df['Item_Weight'].mean(),inplace=True)
The "Outlet_Size" column, which contains categorical data, was identified to have 2410 missing values. To address these missing values, mode imputation was applied, effectively substituting the missing values with the most frequently occurring category within the column.
df['Outlet_Size'].mode()
df['Outlet_Size'].fillna(df['Outlet_Size'].mode()[0],inplace=True)
Additional details on our approach to cleaning the data for modelling are provided in this notebook.
Dimensionality reduction
To streamline the dataset and reduce dimensionality, irrelevant columns were eliminated. Notably, columns such as "Item_Identifier" and "Outlet_Identifier" were deemed unnecessary for achieving the project's objectives and were consequently removed from the dataset.
df.drop(['Item_Identifier','Outlet_Identifier'],axis=1,inplace=True)
๐งฎ Standardization
The dataset exhibits an extensive range of values, spanning from a minimum of 0 to potentially exceeding 5555. Consequently, the statistical characteristics, such as the mean and standard deviation, display a broad range of values, leading to potential challenges in data analysis and modeling. To mitigate this issue, a scaling technique known as standardization was applied. Standardization transforms the statistical distribution of the data to a standardized format where the mean becomes 0, and the standard deviation becomes 1.
In the process of standardization, two key operations are employed: "fit_transform" and "transform." The "fit_transform" operation is used during the initial step to calculate the mean and standard deviation of the dataset and subsequently transform it to meet the desired standard distribution. Once this transformation is determined, the "transform" operation is used to apply the same scaling parameters to new or additional data, ensuring consistent standardization across the dataset. This standardized format not only aids in data analysis but also enhances the compatibility of the data with various machine learning algorithms.
from sklearn.preprocessing import StandardScaler
sc= StandardScaler()
X_train_std= sc.fit_transform(X_train)
๐ Feature Selection and Engineering
In the process of feature selection, preference was given to columns devoid of any missing data. To assess the interrelationships between these features, a heatmap was illustrated to visualize their correlations.
If the correlation coefficient exceeded or equalled 0.7, it indicated a substantial correlation, prompting the removal of one of the variables. Conversely, if the correlation coefficient fell below 0.7, it signified a lack of significant correlation, allowing the retention of both variables.
Label Encoding
Given the considerable number of distinct categorical variables present across all categorical columns, a data transformation technique known as label encoding was systematically applied to these categorical features. This method effectively converts categorical data into a numerical format, facilitating their integration into various machine-learning algorithms and analyses.
df['item_fat_content']= le.fit_transform(df['item_fat_content'])
df['item_type']= le.fit_transform(df['item_type'])
df['outlet_size']= le.fit_transform(df['outlet_size'])
df['outlet_location_type']= le.fit_transform(df['outlet_location_type'])
df['outlet_type']= le.fit_transform(df['outlet_type'])
โฑ๏ธ Model Training and Evaluation
The Linear Regression
algorithm from the sklearn.linear_model
library was employed to train the machine learning model using the pre-processed dataset. Linear regression is a statistical method that aims to model the relationship between one or more independent features and a dependent response variable. It accomplishes this by fitting a linear equation to the observed data, effectively capturing the linear association between the features and the response.
In essence, linear regression seeks to establish a linear relationship, often represented as a straight line, that best explains how changes in the independent features influence variations in the dependent response. This method is widely used for tasks like predicting numeric values, making it a fundamental tool in regression analysis and predictive modeling.
from sklearn.linear_model import LinearRegression
lr= LinearRegression()
lr.fit(X_train_std,Y_train)
print(r2_score(Y_test,Y_pred_lr))
print(mean_absolute_error(Y_test,Y_pred_lr))
print(np.sqrt(mean_squared_error(Y_test,Y_pred_lr)))
Hyper Parameter Tuning
A machine learning model has parameters learned from data and hyperparameters set before training. Hyperparameters control aspects like model complexity and learning speed. One of the best strategies for Hyperparameter tuning is Grid Search CV.
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LinearRegression
# Define the Linear Regression model
model = LinearRegression()
# Define the hyperparameter grid
param_grid = {
'fit_intercept': [True, False], # Whether to calculate the intercept
'normalize': [True, False] # Whether to normalize the features
}
# Initialize the GridSearchCV object
grid_search_lr = GridSearchCV(estimator=model, param_grid=param_grid, scoring='r2', cv=5)
# Fit the GridSearchCV object to the training data
grid_search_lr.fit(X_train, y_train)
# Print the best hyperparameters and corresponding R-squared score
print("Best Hyperparameters: ", grid_search_lr.best_params_)
print("Best R-squared Score: {:.3f}".format(grid_search_lr.best_score_))
# Get the best model with tuned hyperparameters
best_lr_model = grid_search_lr.best_estimator_
To start hyperparameter tunning firstly import the necessary libraries, including GridSearchCV and LinearRegression. Then define the Linear Regression model (model) that we want to tune. Next specify the hyperparameter search space as param_grid, where we consider the options for fit_intercept (whether to calculate the intercept) and normalize (whether to normalize the features). We initialize the GridSearchCV object (grid_search_lr) with the model, parameter grid, scoring metric ('r2' for R-squared), and cross-validation folds (cv=5 for 5-fold cross-validation). The fit method is used to perform hyperparameter tuning on the training data. Finally, we print the best hyperparameters and their corresponding R-squared score, and we obtain the best model with the tuned hyperparameters as best_lr_model.
This code helps to optimize the hyperparameters of a Linear Regression model using GridSearchCV.
๐ Deployment of Solution
During the development of my retail store sales project, I came to understand that deploying a machine learning model is a complex process that goes beyond running tests within a notebook. This project required a multi-step approach, including extensive feature selection and engineering techniques.
To streamline the deployment process, I utilized the joblib
library to save both the trained StandardScaler
and Linear Regression
model. When making predictions, these models are loaded sequentially, creating an efficient pipeline for generating accurate sales forecasts. This approach ensures that the model can be readily used for future sales predictions in a retail store environment.
Furthermore, I leveraged Flask to develop a web application and seamlessly integrated the forecasting tool within it via a Flask API. This tool now offers the capability to predict sales based on specific Item and Outlet details. These details encompass various features such as Item Weight
, Item Fat Content
, Item Visibility
, Item Type
, Item MRP
, Outlet Establishment Year
, Outlet Size
, Outlet Location Type
, and Outlet Type
. Users can input these values into the web app to receive accurate sales predictions, enhancing decision-making within a retail context.
๐ Summary of Solution
The solution offers an in-depth analysis of sales patterns and performance in diverse retail stores, shedding light on the impact of various factors and factor groups. Additionally, the analysis uncovers trends in sales over the years and identifies the key drivers behind the sales variations among the top and bottom-performing stores.
These findings emphasize the significance of considering location, time, and influencing factors when analyzing retail sales data, thereby providing valuable insights for informed business decisions.
All the project files and code for this project can be found right here.
If you made it this far, here's a virtual cookie for you! ๐ช